Hello guys, this is the part2 of our memory hotplug work. This part
is based on the part1:
"x86, memblock: Allocate memory near kernel image before SRAT parsed"
which is base on 3.12-rc4.
You could refer part1 from: https://lkml.org/lkml/2013/10/10/644
Any comments are welcome! Thanks!
[Problem]
The current Linux cannot migrate pages used by the kerenl because
of the kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and
keep the va unmodified. So the kernel pages are not migratable.
There are also some other issues will cause the kernel pages not migratable.
For example, the physical address may be cached somewhere and will be used.
It is not to update all the caches.
When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages are
used by the kernel, they are not migratable. As a result, memory used by
the kernel cannot be hot-removed.
Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the following
way to do memory hotplug.
[What we are doing]
In Linux, memory in one numa node is divided into several zones. One of the
zones is ZONE_MOVABLE, which the kernel won't use.
In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these memory.
To do this, we need ACPI's help.
[How we do this]
In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The memory
affinities in SRAT record every memory range in the system, and also, flags
specifying if the memory range is hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)
With the help of SRAT, we have to do the following two things to achieve our
goal:
1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)
2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.
(This is what we are going to do. See below.)
[About this patch-set]
In previous part's patches, we have made the kernel allocate memory near
kernel image before SRAT parsed to avoid allocating hotpluggable memory
for kernel. So this patch-set does the following things:
1. Improve memblock to support flags, which are used to indicate different
memory type.
2. Mark all hotpluggable memory in memblock.memory[].
3. Make the default memblock allocator skip hotpluggable memory.
4. Improve "movable_node" boot option to have higher priority of movablecore
and kernelcore boot option.
Change log v1 -> v2:
1. Rebase this part on the v7 version of part1
2. Fix bug: If movable_node boot option not specified, memblock still
checks hotpluggable memory when allocating memory.
Tang Chen (7):
memblock, numa: Introduce flag into memblock
memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
hotpluggable regions
memblock: Make memblock_set_node() support different memblock_type
acpi, numa, mem_hotplug: Mark hotpluggable memory in memblock
acpi, numa, mem_hotplug: Mark all nodes the kernel resides
un-hotpluggable
memblock, mem_hotplug: Make memblock skip hotpluggable regions if
needed
x86, numa, acpi, memory-hotplug: Make movable_node have higher
priority
Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node
arch/metag/mm/init.c | 3 +-
arch/metag/mm/numa.c | 3 +-
arch/microblaze/mm/init.c | 3 +-
arch/powerpc/mm/mem.c | 2 +-
arch/powerpc/mm/numa.c | 8 ++-
arch/sh/kernel/setup.c | 4 +-
arch/sparc/mm/init_64.c | 5 +-
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/numa.c | 63 +++++++++++++++++++++--
arch/x86/mm/srat.c | 5 ++
include/linux/memblock.h | 39 ++++++++++++++-
mm/memblock.c | 123 ++++++++++++++++++++++++++++++++++++++-------
mm/memory_hotplug.c | 1 +
mm/page_alloc.c | 28 ++++++++++-
15 files changed, 252 insertions(+), 39 deletions(-)
From: Tang Chen <[email protected]>
There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.
In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.
In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags; /* This is new. */
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.
The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>.
v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.
Suggested-by: Wen Congyang <[email protected]>
Suggested-by: Liu Jiang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 53 +++++++++++++++++++++++++++++++++-------------
2 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 77c60e5..9a805ec 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+ unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index 53e477b..877973e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -255,6 +255,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+ type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
}
@@ -405,7 +406,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
- memblock_get_region_node(next)) {
+ memblock_get_region_node(next) ||
+ this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -425,13 +427,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
* @base: base address of the new region
* @size: size of the new region
* @nid: node id of the new region
+ * @flags: flags of the new region
*
* Insert new memblock region [@base,@base+@size) into @type at @idx.
* @type must already have extra room to accomodate the new region.
*/
static void __init_memblock memblock_insert_region(struct memblock_type *type,
int idx, phys_addr_t base,
- phys_addr_t size, int nid)
+ phys_addr_t size,
+ int nid, unsigned long flags)
{
struct memblock_region *rgn = &type->regions[idx];
@@ -439,6 +443,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+ rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -450,6 +455,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
@@ -460,7 +466,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
static int __init_memblock memblock_add_region(struct memblock_type *type,
- phys_addr_t base, phys_addr_t size, int nid)
+ phys_addr_t base, phys_addr_t size,
+ int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
@@ -475,6 +482,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
+ type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
@@ -505,7 +513,8 @@ repeat:
nr_new++;
if (insert)
memblock_insert_region(type, i++, base,
- rbase - base, nid);
+ rbase - base, nid,
+ flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
@@ -515,7 +524,8 @@ repeat:
if (base < end) {
nr_new++;
if (insert)
- memblock_insert_region(type, i, base, end - base, nid);
+ memblock_insert_region(type, i, base, end - base,
+ nid, flags);
}
/*
@@ -537,12 +547,13 @@ repeat:
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
int nid)
{
- return memblock_add_region(&memblock.memory, base, size, nid);
+ return memblock_add_region(&memblock.memory, base, size, nid, 0);
}
int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
- return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+ return memblock_add_region(&memblock.memory, base, size,
+ MAX_NUMNODES, 0);
}
/**
@@ -597,7 +608,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else if (rend > end) {
/*
* @rgn intersects from above. Split and redo the
@@ -607,7 +619,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else {
/* @rgn is fully contained, record it */
if (!*end_rgn)
@@ -649,16 +662,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+ phys_addr_t size,
+ int nid,
+ unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;
- memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size,
- (void *)_RET_IP_);
+ flags, (void *)_RET_IP_);
+
+ return memblock_add_region(_rgn, base, size, nid, flags);
+}
- return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
}
/**
@@ -1101,6 +1122,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
{
unsigned long long base, size;
+ unsigned long flags;
int i;
pr_info(" %s.cnt = 0x%lx\n", name, type->cnt);
@@ -1111,13 +1133,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
base = rgn->base;
size = rgn->size;
+ flags = rgn->flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
memblock_get_region_node(rgn));
#endif
- pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
- name, i, base, base + size - 1, size, nid_buf);
+ pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+ name, i, base, base + size - 1, size, nid_buf, flags);
}
}
--
1.7.1
From: Tang Chen <[email protected]>
In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.
To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 17 +++++++++++++++
mm/memblock.c | 52 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 69 insertions(+), 0 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9a805ec..b788faa 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
#define INIT_MEMBLOCK_REGIONS 128
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG 0x1 /* hotpluggable region */
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
@@ -122,6 +127,18 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
+static inline void memblock_set_region_flags(struct memblock_region *r,
+ unsigned long flags)
+{
+ r->flags |= flags;
+}
+
+static inline void memblock_clear_region_flags(struct memblock_region *r,
+ unsigned long flags)
+{
+ r->flags &= ~flags;
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 877973e..5bea331 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -683,6 +683,58 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
}
/**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+ struct memblock_type *type = &memblock.memory;
+ int i, ret, start_rgn, end_rgn;
+
+ ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+ if (ret)
+ return ret;
+
+ for (i = start_rgn; i < end_rgn; i++)
+ memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+ memblock_merge_regions(type);
+ return 0;
+}
+
+/**
+ * memblock_clear_hotplug - Clear flag MEMBLOCK_HOTPLUG for a specified region.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and clear flag
+ * MEMBLOCK_HOTPLUG for the isolated regions.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_clear_hotplug(phys_addr_t base, phys_addr_t size)
+{
+ struct memblock_type *type = &memblock.memory;
+ int i, ret, start_rgn, end_rgn;
+
+ ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+ if (ret)
+ return ret;
+
+ for (i = start_rgn; i < end_rgn; i++)
+ memblock_clear_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+ memblock_merge_regions(type);
+ return 0;
+}
+
+/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
* @nid: node selector, %MAX_NUMNODES for all nodes
--
1.7.1
From: Tang Chen <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/metag/mm/init.c | 3 ++-
arch/metag/mm/numa.c | 3 ++-
arch/microblaze/mm/init.c | 3 ++-
arch/powerpc/mm/mem.c | 2 +-
arch/powerpc/mm/numa.c | 8 +++++---
arch/sh/kernel/setup.c | 4 ++--
arch/sparc/mm/init_64.c | 5 +++--
arch/x86/mm/init_32.c | 2 +-
arch/x86/mm/init_64.c | 2 +-
arch/x86/mm/numa.c | 6 ++++--
include/linux/memblock.h | 3 ++-
mm/memblock.c | 6 +++---
12 files changed, 28 insertions(+), 19 deletions(-)
diff --git a/arch/metag/mm/init.c b/arch/metag/mm/init.c
index 1239195..d94a58f 100644
--- a/arch/metag/mm/init.c
+++ b/arch/metag/mm/init.c
@@ -205,7 +205,8 @@ static void __init do_init_bootmem(void)
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), 0);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, 0);
}
/* All of system RAM sits in node 0 for the non-NUMA case */
diff --git a/arch/metag/mm/numa.c b/arch/metag/mm/numa.c
index 9ae578c..229407f 100644
--- a/arch/metag/mm/numa.c
+++ b/arch/metag/mm/numa.c
@@ -42,7 +42,8 @@ void __init setup_bootmem_node(int nid, unsigned long start, unsigned long end)
memblock_add(start, end - start);
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, nid);
/* Node-local pgdat */
pgdat_paddr = memblock_alloc_base(sizeof(struct pglist_data),
diff --git a/arch/microblaze/mm/init.c b/arch/microblaze/mm/init.c
index 74c7bcc..89077d3 100644
--- a/arch/microblaze/mm/init.c
+++ b/arch/microblaze/mm/init.c
@@ -192,7 +192,8 @@ void __init setup_memory(void)
start_pfn = memblock_region_memory_base_pfn(reg);
end_pfn = memblock_region_memory_end_pfn(reg);
memblock_set_node(start_pfn << PAGE_SHIFT,
- (end_pfn - start_pfn) << PAGE_SHIFT, 0);
+ (end_pfn - start_pfn) << PAGE_SHIFT,
+ &memblock.memory, 0);
}
/* free bootmem is whole main memory */
diff --git a/arch/powerpc/mm/mem.c b/arch/powerpc/mm/mem.c
index 3fa93dc..231b785 100644
--- a/arch/powerpc/mm/mem.c
+++ b/arch/powerpc/mm/mem.c
@@ -209,7 +209,7 @@ void __init do_init_bootmem(void)
/* Place all memblock_regions in the same node and merge contiguous
* memblock_regions
*/
- memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+ memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock_memory, 0);
/* Add all physical memory to the bootmem map, mark each area
* present.
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index c916127..f82f2ea 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -670,7 +670,8 @@ static void __init parse_drconf_memory(struct device_node *memory)
node_set_online(nid);
sz = numa_enforce_memory_limit(base, size);
if (sz)
- memblock_set_node(base, sz, nid);
+ memblock_set_node(base, sz,
+ &memblock.memory, nid);
} while (--ranges);
}
}
@@ -760,7 +761,7 @@ new_range:
continue;
}
- memblock_set_node(start, size, nid);
+ memblock_set_node(start, size, &memblock.memory, nid);
if (--ranges)
goto new_range;
@@ -797,7 +798,8 @@ static void __init setup_nonnuma(void)
fake_numa_create_new_node(end_pfn, &nid);
memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+ PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, nid);
node_set_online(nid);
}
}
diff --git a/arch/sh/kernel/setup.c b/arch/sh/kernel/setup.c
index 1cf90e9..de19cfa 100644
--- a/arch/sh/kernel/setup.c
+++ b/arch/sh/kernel/setup.c
@@ -230,8 +230,8 @@ void __init __add_active_range(unsigned int nid, unsigned long start_pfn,
pmb_bolt_mapping((unsigned long)__va(start), start, end - start,
PAGE_KERNEL);
- memblock_set_node(PFN_PHYS(start_pfn),
- PFN_PHYS(end_pfn - start_pfn), nid);
+ memblock_set_node(PFN_PHYS(start_pfn), PFN_PHYS(end_pfn - start_pfn),
+ &memblock.memory, nid);
}
void __init __weak plat_early_device_setup(void)
diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index ed82eda..31beb53 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -1021,7 +1021,8 @@ static void __init add_node_ranges(void)
"start[%lx] end[%lx]\n",
nid, start, this_end);
- memblock_set_node(start, this_end - start, nid);
+ memblock_set_node(start, this_end - start,
+ &memblock.memory, nid);
start = this_end;
}
}
@@ -1325,7 +1326,7 @@ static void __init bootmem_init_nonnuma(void)
(top_of_ram - total_ram) >> 20);
init_node_masks_nonnuma();
- memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+ memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
allocate_node_data(0);
node_set_online(0);
}
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 4287f1f..d9685b6 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -665,7 +665,7 @@ void __init initmem_init(void)
high_memory = (void *) __va(max_low_pfn * PAGE_SIZE - 1) + 1;
#endif
- memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+ memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
sparse_memory_present_with_active_regions(0);
#ifdef CONFIG_FLATMEM
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index 104d56a..f35c66c 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -643,7 +643,7 @@ kernel_physical_mapping_init(unsigned long start,
#ifndef CONFIG_NUMA
void __init initmem_init(void)
{
- memblock_set_node(0, (phys_addr_t)ULLONG_MAX, 0);
+ memblock_set_node(0, (phys_addr_t)ULLONG_MAX, &memblock.memory, 0);
}
#endif
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e17db5d..ab69e1d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -492,7 +492,8 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ memblock_set_node(mb->start, mb->end - mb->start,
+ &memblock.memory, mb->nid);
}
/*
@@ -566,7 +567,8 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(node_possible_map);
nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
+ WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
+ MAX_NUMNODES));
numa_reset_distance();
ret = init_func();
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index b788faa..97480d3 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -140,7 +140,8 @@ static inline void memblock_clear_region_flags(struct memblock_region *r,
}
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
-int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_set_node(phys_addr_t base, phys_addr_t size,
+ struct memblock_type *type, int nid);
static inline void memblock_set_region_node(struct memblock_region *r, int nid)
{
diff --git a/mm/memblock.c b/mm/memblock.c
index 5bea331..7de9c76 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -910,18 +910,18 @@ void __init_memblock __next_mem_pfn_range(int *idx, int nid,
* memblock_set_node - set node ID on memblock regions
* @base: base of area to set node ID for
* @size: size of area to set node ID for
+ * @type: memblock type to set node ID for
* @nid: node ID to set
*
- * Set the nid of memblock memory regions in [@base,@base+@size) to @nid.
+ * Set the nid of memblock @type regions in [@base,@base+@size) to @nid.
* Regions which cross the area boundaries are split as necessary.
*
* RETURNS:
* 0 on success, -errno on failure.
*/
int __init_memblock memblock_set_node(phys_addr_t base, phys_addr_t size,
- int nid)
+ struct memblock_type *type, int nid)
{
- struct memblock_type *type = &memblock.memory;
int start_rgn, end_rgn;
int i, ret;
--
1.7.1
From: Tang Chen <[email protected]>
When parsing SRAT, we know that which memory area is hotpluggable.
So we invoke function memblock_mark_hotplug() introduced by previous
patch to mark hotpluggable memory in memblock.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/numa.c | 2 ++
arch/x86/mm/srat.c | 5 +++++
2 files changed, 7 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index ab69e1d..408c02d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -569,6 +569,8 @@ static int __init numa_init(int (*init_func)(void))
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
MAX_NUMNODES));
+ /* In case that parsing SRAT failed. */
+ WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
ret = init_func();
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index 266ca91..ca7c484 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -181,6 +181,11 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
(unsigned long long) start, (unsigned long long) end - 1,
hotpluggable ? " hotplug" : "");
+ /* Mark hotplug range in memblock. */
+ if (hotpluggable && memblock_mark_hotplug(start, ma->length))
+ pr_warn("SRAT: Failed to mark hotplug range [mem %#010Lx-%#010Lx] in memblock\n",
+ (unsigned long long) start, (unsigned long long) end - 1);
+
return 0;
out_err_bad_srat:
bad_srat();
--
1.7.1
From: Tang Chen <[email protected]>
At very early time, the kernel have to use some memory such as
loading the kernel image. We cannot prevent this anyway. So any
node the kernel resides in should be un-hotpluggable.
Signed-off-by: Zhang Yanfei <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/mm/numa.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 44 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 408c02d..f26b16f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -494,6 +494,14 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start,
&memblock.memory, mb->nid);
+
+ /*
+ * At this time, all memory regions reserved by memblock are
+ * used by the kernel. Set the nid in memblock.reserved will
+ * mark out all the nodes the kernel resides in.
+ */
+ memblock_set_node(mb->start, mb->end - mb->start,
+ &memblock.reserved, mb->nid);
}
/*
@@ -555,6 +563,30 @@ static void __init numa_init_array(void)
}
}
+static void __init numa_clear_kernel_node_hotplug(void)
+{
+ int i, nid;
+ nodemask_t numa_kernel_nodes;
+ unsigned long start, end;
+ struct memblock_type *type = &memblock.reserved;
+
+ /* Mark all kernel nodes. */
+ for (i = 0; i < type->cnt; i++)
+ node_set(type->regions[i].nid, numa_kernel_nodes);
+
+ /* Clear MEMBLOCK_HOTPLUG flag for memory in kernel nodes. */
+ for (i = 0; i < numa_meminfo.nr_blks; i++) {
+ nid = numa_meminfo.blk[i].nid;
+ if (!node_isset(nid, numa_kernel_nodes))
+ continue;
+
+ start = numa_meminfo.blk[i].start;
+ end = numa_meminfo.blk[i].end;
+
+ memblock_clear_hotplug(start, end - start);
+ }
+}
+
static int __init numa_init(int (*init_func)(void))
{
int i;
@@ -569,6 +601,8 @@ static int __init numa_init(int (*init_func)(void))
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.memory,
MAX_NUMNODES));
+ WARN_ON(memblock_set_node(0, ULLONG_MAX, &memblock.reserved,
+ MAX_NUMNODES));
/* In case that parsing SRAT failed. */
WARN_ON(memblock_clear_hotplug(0, ULLONG_MAX));
numa_reset_distance();
@@ -606,6 +640,16 @@ static int __init numa_init(int (*init_func)(void))
numa_clear_node(i);
}
numa_init_array();
+
+ /*
+ * At very early time, the kernel have to use some memory such as
+ * loading the kernel image. We cannot prevent this anyway. So any
+ * node the kernel resides in should be un-hotpluggable.
+ *
+ * And when we come here, numa_init() won't fail.
+ */
+ numa_clear_kernel_node_hotplug();
+
return 0;
}
--
1.7.1
From: Tang Chen <[email protected]>
Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.
In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.
In this patch, we make memblock skip these hotpluggable memory regions in
the default top-down allocation function if movable_node boot option is
specified.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 18 ++++++++++++++++++
mm/memblock.c | 12 ++++++++++++
mm/memory_hotplug.c | 1 +
3 files changed, 31 insertions(+), 0 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 97480d3..bfc1dba 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -47,6 +47,10 @@ struct memblock {
extern struct memblock memblock;
extern int memblock_debug;
+#ifdef CONFIG_MOVABLE_NODE
+/* If movable_node boot option specified */
+extern bool movable_node_enabled;
+#endif /* CONFIG_MOVABLE_NODE */
#define memblock_dbg(fmt, ...) \
if (memblock_debug) printk(KERN_INFO pr_fmt(fmt), ##__VA_ARGS__)
@@ -65,6 +69,20 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
int memblock_clear_hotplug(phys_addr_t base, phys_addr_t size);
+#ifdef CONFIG_MOVABLE_NODE
+static inline bool memblock_is_hotpluggable(struct memblock_region *m)
+{
+ return m->flags & MEMBLOCK_HOTPLUG;
+}
+
+static inline bool movable_node_is_enabled(void)
+{
+ return movable_node_enabled;
+}
+#else
+static inline bool memblock_is_hotpluggable(struct memblock_region *m){ return false; }
+static inline bool movable_node_is_enabled(void) { return false; }
+#endif
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_search_pfn_nid(unsigned long pfn, unsigned long *start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 7de9c76..7f69012 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -39,6 +39,9 @@ struct memblock memblock __initdata_memblock = {
};
int memblock_debug __initdata_memblock;
+#ifdef CONFIG_MOVABLE_NODE
+bool movable_node_enabled __initdata_memblock = false;
+#endif
static int memblock_can_resize __initdata_memblock;
static int memblock_memory_in_slab __initdata_memblock = 0;
static int memblock_reserved_in_slab __initdata_memblock = 0;
@@ -819,6 +822,11 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
* @out_nid: ptr to int for nid of the range, can be %NULL
*
* Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions if needed when allocating memory for the
+ * kernel.
*/
void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
phys_addr_t *out_start,
@@ -843,6 +851,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
continue;
+ /* skip hotpluggable memory regions if needed */
+ if (movable_node_is_enabled() && memblock_is_hotpluggable(m))
+ continue;
+
/* scan areas before each reservation for intersection */
for ( ; ri >= 0; ri--) {
struct memblock_region *r = &rsv->regions[ri];
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 8c91d0a..729a2d8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1436,6 +1436,7 @@ static int __init cmdline_parse_movable_node(char *p)
* the kernel away from hotpluggable memory.
*/
memblock_set_bottom_up(true);
+ movable_node_enabled = true;
#else
pr_warn("movable_node option not supported\n");
#endif
--
1.7.1
From: Tang Chen <[email protected]>
If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.
Now, if users specify "movable_node" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.
For those who don't want this, just specify nothing. The kernel will act as
before.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
mm/page_alloc.c | 28 ++++++++++++++++++++++++++--
1 files changed, 26 insertions(+), 2 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dd886fa..768ea0e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5021,9 +5021,33 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+ struct memblock_type *type = &memblock.memory;
+
+ /* Need to find movable_zone earlier when movable_node is specified. */
+ find_usable_zone_for_movable();
+
+ /*
+ * If movable_node is specified, ignore kernelcore and movablecore
+ * options.
+ */
+ if (movable_node_is_enabled()) {
+ for (i = 0; i < type->cnt; i++) {
+ if (!memblock_is_hotpluggable(&type->regions[i]))
+ continue;
+
+ nid = type->regions[i].nid;
+
+ usable_startpfn = PFN_DOWN(type->regions[i].base);
+ zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+ min(usable_startpfn, zone_movable_pfn[nid]) :
+ usable_startpfn;
+ }
+
+ goto out2;
+ }
/*
- * If movablecore was specified, calculate what size of
+ * If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
@@ -5049,7 +5073,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
restart:
@@ -5140,6 +5163,7 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
+out2:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
--
1.7.1
From: Yasuaki Ishimatsu <[email protected]>
If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, invoke memblock_alloc_nid(...MAX_NUMNODES) again to retry when
the first allocation fails. Otherwise, the system could failed to boot.
(We don't use memblock_alloc_try_nid() to retry because in this function,
if the allocation fails, it will panic the system.)
The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.
A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.
But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.
So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73
For now, we put node_data of movable node to another node, and then improve
it in the future.
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Zhang Yanfei <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Acked-by: Toshi Kani <[email protected]>
---
arch/x86/mm/numa.c | 11 ++++++++---
1 files changed, 8 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24aec58..e17db5d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -211,9 +211,14 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
*/
nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
- return;
+ pr_warn("Cannot find %zu bytes in node %d, so try other nodes",
+ nd_size, nid);
+ nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES,
+ MAX_NUMNODES);
+ if (!nd_pa) {
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
+ return;
+ }
}
nd = __va(nd_pa);
--
1.7.1
Hello tejun,
On 10/14/2013 11:19 PM, Tejun Heo wrote:
> Hey,
>
> On Mon, Oct 14, 2013 at 11:06:14PM +0800, Zhang Yanfei wrote:
>> a little difference here, consider a 16-GB node. If we parse SRAT earlier,
>> and still use the top-down allocation, and kernel image is loaded at 16MB,
>> we reserve other nodes but this 16GB node that kernel resides in is used
>> for boot-up allocation. So the page table is allocated from 16GB to 0.
>> The page table is allocated on top of the the memory as possible.
>>
>> But if we use this approach, no matter how large the page table is, we
>> allocate the page table in low memory which is the case that hpa concerns
>> about the DMA.
>
> Yeah, sure there will be cases where parsing SRAT would be better.
>
> 4k mapping is in use, which is mostly for debugging && memory map is
> composed such that the highest non-hotpluggable address is high
> enough.
>
> It's going in circles again but my point has always been that the
> above in itself don't seem to be substantial enough to justify
> putting, say, initrd loading before page table init.
>
> Later some argued that bringing SRAT parsing earlier could help
> implementing finer grained hotplug, which would be an acceptable path
> to follow; however, that doesn't turn out to be true either.
>
> * Again, it matter if and only if 4k mapping is in use. Do we even
> care?
>
> * SRAT isn't enough. The whole device tree needs to be parsed to put
> page tables into local device. It's a lot of churn, including major
> updates to page table allocation, just to support debug 4k mapping
> cases. Doesn't make much sense to me.
>
> So, SRAT's usefulness seems extremely limited - it helps if the user
> wants to use debug features along with memory hotplug on an extreme
> large machine with devices which have low DMA limit, and that's it.
> To me, it seems to be a poor argument. Just declaring memory hotplug
> works iff large kernel mapping is in use feels like a pretty good
> trade-off to me, and I have no idea why I have to repeat all this,
> which I've written multiple times already, in a private thread again.
>
> If the thread is to make progress, one has to provide counter
> arguments to the points raised. It feels like I'm going in circle
> again. The exact same content I wrote above has been repeated
> multiple times in the past discussions and I'm getting tired of doing
> it without getting any actual response.
>
> When replying, please restore cc's and keep the whole body.
>
Thanks for the whole explanation again. I was just raising some argument
that other guys raised before. I agree with what you said above and already
put some of them into the patch 4 description in v7 version.
Now could you please help reviewing the part2? As I said before, no matter
how we implement the part1, part2 is kind of independent.
--
Thanks.
Zhang Yanfei
On Mon, Oct 14, 2013 at 8:34 AM, Zhang Yanfei <[email protected]> wrote:
> Hello tejun,
>
> On 10/14/2013 11:19 PM, Tejun Heo wrote:
>> Hey,
>>
>> On Mon, Oct 14, 2013 at 11:06:14PM +0800, Zhang Yanfei wrote:
>>> a little difference here, consider a 16-GB node. If we parse SRAT earlier,
>>> and still use the top-down allocation, and kernel image is loaded at 16MB,
>>> we reserve other nodes but this 16GB node that kernel resides in is used
>>> for boot-up allocation. So the page table is allocated from 16GB to 0.
>>> The page table is allocated on top of the the memory as possible.
>>>
>>> But if we use this approach, no matter how large the page table is, we
>>> allocate the page table in low memory which is the case that hpa concerns
>>> about the DMA.
>>
>> Yeah, sure there will be cases where parsing SRAT would be better.
>>
>> 4k mapping is in use, which is mostly for debugging && memory map is
>> composed such that the highest non-hotpluggable address is high
>> enough.
>>
>> It's going in circles again but my point has always been that the
>> above in itself don't seem to be substantial enough to justify
>> putting, say, initrd loading before page table init.
>>
>> Later some argued that bringing SRAT parsing earlier could help
>> implementing finer grained hotplug, which would be an acceptable path
>> to follow; however, that doesn't turn out to be true either.
>>
>> * Again, it matter if and only if 4k mapping is in use. Do we even
>> care?
>>
>> * SRAT isn't enough. The whole device tree needs to be parsed to put
>> page tables into local device. It's a lot of churn, including major
>> updates to page table allocation, just to support debug 4k mapping
>> cases. Doesn't make much sense to me.
>>
>> So, SRAT's usefulness seems extremely limited - it helps if the user
>> wants to use debug features along with memory hotplug on an extreme
>> large machine with devices which have low DMA limit, and that's it.
>> To me, it seems to be a poor argument. Just declaring memory hotplug
>> works iff large kernel mapping is in use feels like a pretty good
>> trade-off to me, and I have no idea why I have to repeat all this,
>> which I've written multiple times already, in a private thread again.
>>
>> If the thread is to make progress, one has to provide counter
>> arguments to the points raised. It feels like I'm going in circle
>> again. The exact same content I wrote above has been repeated
>> multiple times in the past discussions and I'm getting tired of doing
>> it without getting any actual response.
The points for parsing SRAT early instead of Yanfei/Tang v7:
1. We just reached one unified path to setup page tables for 32bit,
64bit and xen or non xen after several years. We should not have add
another path for system
that support hotplug.
2. also we should avoid adding "movable_nodes" command line.
3. debug mapping 4k, and it is working all the way, why breaking it even for
memory hotplug path?
4. numa_meminfo now is static structure.
we have no reason that we can not parse SRAT etc to fill that struct.
5. for device tree, i assume that we could do same like srat parsing to find out
numa to fill the numa_meminfo early. or with help of BRK.
6. in the long run, We should rework our NUMA booting:
a. boot system with boot numa nodes early only.
b. in later init stage or user space, init other nodes
RAM/CPU/PCI...in parallel.
that will reduce boot time for 8 sockets/32 sockets dramatically.
We will need to parse srat table early so could avoid init memory for
non-boot nodes.
Yinghai
Hello, Yinghai.
On Mon, Oct 14, 2013 at 12:34:49PM -0700, Yinghai Lu wrote:
> The points for parsing SRAT early instead of Yanfei/Tang v7:
>
> 1. We just reached one unified path to setup page tables for 32bit,
> 64bit and xen or non xen after several years. We should not have add
> another path for system
> that support hotplug.
The separate code path we're talking about is tiny. It's just an
extra function for page table allocation and another for memblock
allocation which is symmetric to the existing one. Sure, there are
benefits to not diverging code paths but these are fairly trivial in
terms of maintenance overhead and test coverage.
> 2. also we should avoid adding "movable_nodes" command line.
Can we? What about the pgdat? We're allocating them off-node with
movable_nodes which can't be the default behavior.
> 3. debug mapping 4k, and it is working all the way, why breaking it even for
> memory hotplug path?
If it comes for free, sure, no reason to break it. On the other hand,
if maintaining it fully with a niche feature costs overhead, it's
somethinig to be traded-off. It's not like using 4k page mapping with
bottom-up allocation will be immediately broken either. It might
affect devices which can't DMA to higher addresses on gigantic
machines under debug configs. It's quite a corner case.
> 4. numa_meminfo now is static structure.
> we have no reason that we can not parse SRAT etc to fill that struct.
Sure, there's no reason we can't. The whole point is that the
benefits arent' strong enough. We don't do things just because we
can.
> 5. for device tree, i assume that we could do same like srat parsing to find out
> numa to fill the numa_meminfo early. or with help of BRK.
Digesting device tree involves a lot more complexity. The whole
reason why things like SRAT are broken into tables in the first place.
We'll be basically pulling in huge chunk of ACPICA into early boot.
Again, justfications. The *only* thing which may benefit from that
are debug setups. We'll have to pull in a lot of complexity before
page table setup and modify page table allocation to be
memory-device-specific just for debug configs, which is not a good
trade-off. Benefit / cost ratio doesn't make any sense.
> 6. in the long run, We should rework our NUMA booting:
> a. boot system with boot numa nodes early only.
> b. in later init stage or user space, init other nodes
> RAM/CPU/PCI...in parallel.
> that will reduce boot time for 8 sockets/32 sockets dramatically.
>
> We will need to parse srat table early so could avoid init memory for
> non-boot nodes.
Among the six you listed, this one sounds somewhat valid but still
assuming huge page, what difference does it make? We're just talking
about page table alloc / init and ACPI init. If you wanna speed up
huge NUMA machine booting and chop down memory init per-NUMA, sure,
move those pieces to later stages. You can init the amount necessary
during early boot and then bring up the rest later on. I don't see
why that'd require parsing SRAT. In fact, I think there'll be more
cases where you want to actively ignore NUMA mapping during early
boot. What if the system maps low memory to a non-boot numa node?
Optimizing NUMA boot just requires moving the heavy lifting to
appropriate NUMA nodes. It doesn't require that early boot phase
should strictly follow NUMA node boundaries.
Thanks.
--
tejun
On Mon, Oct 14, 2013 at 1:04 PM, Tejun Heo <[email protected]> wrote:
>> 6. in the long run, We should rework our NUMA booting:
>> a. boot system with boot numa nodes early only.
>> b. in later init stage or user space, init other nodes
>> RAM/CPU/PCI...in parallel.
>> that will reduce boot time for 8 sockets/32 sockets dramatically.
>>
>> We will need to parse srat table early so could avoid init memory for
>> non-boot nodes.
>
> Among the six you listed, this one sounds somewhat valid but still
> assuming huge page, what difference does it make? We're just talking
> about page table alloc / init and ACPI init. If you wanna speed up
> huge NUMA machine booting and chop down memory init per-NUMA, sure,
> move those pieces to later stages. You can init the amount necessary
> during early boot and then bring up the rest later on. I don't see
> why that'd require parsing SRAT.
The problem is how to define "amount necessary". If we can parse srat early,
then we could just map RAM for all boot nodes one time, instead of try some
small and then after SRAT table, expand it cover non-boot nodes.
To keep non-boot numa node hot-removable. we need to page table (and other
that we allocate during boot stage) on ram of non boot nodes, or their
local node ram. (share page table always should be on boot nodes).
> In fact, I think there'll be more
> cases where you want to actively ignore NUMA mapping during early
> boot. What if the system maps low memory to a non-boot numa node?
Then we treat that non-boot numa node as one of boot nodes, and it
could not be hot removed.
Actually that is BIOS or Firmware bug, they should set memory address
decoder correctly.
>
> Optimizing NUMA boot just requires moving the heavy lifting to
> appropriate NUMA nodes. It doesn't require that early boot phase
> should strictly follow NUMA node boundaries.
At end of day, I like to see all numa system (ram/cpu/pci) could have
non boot nodes to be hot-removed logically. with any boot command
line.
Thanks
Yinghai
On 10/14/2013 12:34 PM, Yinghai Lu wrote:
>
> The points for parsing SRAT early instead of Yanfei/Tang v7:
> 1. We just reached one unified path to setup page tables for 32bit,
> 64bit and xen or non xen after several years. We should not have add
> another path for system
> that support hotplug.
>
> 2. also we should avoid adding "movable_nodes" command line.
>
> 3. debug mapping 4k, and it is working all the way, why breaking it even for
> memory hotplug path?
>
> 4. numa_meminfo now is static structure.
> we have no reason that we can not parse SRAT etc to fill that struct.
>
> 5. for device tree, i assume that we could do same like srat parsing to find out
> numa to fill the numa_meminfo early. or with help of BRK.
>
> 6. in the long run, We should rework our NUMA booting:
> a. boot system with boot numa nodes early only.
> b. in later init stage or user space, init other nodes
> RAM/CPU/PCI...in parallel.
> that will reduce boot time for 8 sockets/32 sockets dramatically.
>
> We will need to parse srat table early so could avoid init memory for
> non-boot nodes.
>
I really like the long-term plan (and, I might want to add, the above
writeup.)
However, I don't understand how we can avoid #2, given that it is
fundamentally a sysadmin-driven tradeoff between performance and
reliability.
-hpa
On Mon, Oct 14, 2013 at 1:35 PM, H. Peter Anvin <[email protected]> wrote:
...
>> 2. also we should avoid adding "movable_nodes" command line.
...
>> 6. in the long run, We should rework our NUMA booting:
>> a. boot system with boot numa nodes early only.
>> b. in later init stage or user space, init other nodes
>> RAM/CPU/PCI...in parallel.
>> that will reduce boot time for 8 sockets/32 sockets dramatically.
>>
>> We will need to parse srat table early so could avoid init memory for
>> non-boot nodes.
>>
>
> I really like the long-term plan (and, I might want to add, the above
> writeup.)
>
> However, I don't understand how we can avoid #2, given that it is
> fundamentally a sysadmin-driven tradeoff between performance and
> reliability.
If we make all numa systems support nodes hot-remove logically.
like we boot system with node0, and hot add other nodes one by one,
we should hot remove them later.
Thanks
Yinghai
On 10/14/2013 01:37 PM, Yinghai Lu wrote:
>>
>> Optimizing NUMA boot just requires moving the heavy lifting to
>> appropriate NUMA nodes. It doesn't require that early boot phase
>> should strictly follow NUMA node boundaries.
>
> At end of day, I like to see all numa system (ram/cpu/pci) could have
> non boot nodes to be hot-removed logically. with any boot command
> line.
>
I don't think that is realistic without hardware support, simply because
all it takes is a single page of kernel locked memory to prevent a page
from being removed. The only realistic way around that, I believe, is
to remove the identity-mapping in the kernel, but it still has all kinds
of funnies involving devices and DMA.
-hpa
On 10/14/2013 01:42 PM, Yinghai Lu wrote:
>>
>> However, I don't understand how we can avoid #2, given that it is
>> fundamentally a sysadmin-driven tradeoff between performance and
>> reliability.
>
> If we make all numa systems support nodes hot-remove logically.
> like we boot system with node0, and hot add other nodes one by one,
> we should hot remove them later.
>
No, it doesn't work that way for memory. You can't do nonmovable
allocations from a node that you may need to yank, unless you can
migrate that memory node transparently (which hardware can do.)
-hpa
Hello,
On Mon, Oct 14, 2013 at 01:37:20PM -0700, Yinghai Lu wrote:
> The problem is how to define "amount necessary". If we can parse srat early,
> then we could just map RAM for all boot nodes one time, instead of try some
> small and then after SRAT table, expand it cover non-boot nodes.
Wouldn't that amount be fairly static and restricted? If you wanna
chunk memory init anyway, there's no reason to init more than
necessary until smp stage is reached. The more you do early, the more
serialized you're, so wouldn't the goal naturally be initing the
minimum possible?
> To keep non-boot numa node hot-removable. we need to page table (and other
> that we allocate during boot stage) on ram of non boot nodes, or their
> local node ram. (share page table always should be on boot nodes).
The above assumes the followings,
* 4k page mappings. It'd be nice to keep everything working for 4k
but just following SRAT isn't enough. What if the non-hotpluggable
boot node doesn't stretch high enough and page table reaches down
too far? This won't be an optional behavior, so it is actually
*likely* to happen on certain setups.
* Memory hotplug is at NUMA node granularity instead of device.
> > Optimizing NUMA boot just requires moving the heavy lifting to
> > appropriate NUMA nodes. It doesn't require that early boot phase
> > should strictly follow NUMA node boundaries.
>
> At end of day, I like to see all numa system (ram/cpu/pci) could have
> non boot nodes to be hot-removed logically. with any boot command
> line.
I suppose you mean "without any boot command line"? Sure, but, first
of all, there is a clear performance trade-off, and, secondly, don't
we want something finer grained? Why would we want to that per-NUMA
node, which is extremely coarse?
Thanks.
--
tejun
Hello tejun, peter and yinghai
On 10/15/2013 04:55 AM, Tejun Heo wrote:
> Hello,
>
> On Mon, Oct 14, 2013 at 01:37:20PM -0700, Yinghai Lu wrote:
>> The problem is how to define "amount necessary". If we can parse srat early,
>> then we could just map RAM for all boot nodes one time, instead of try some
>> small and then after SRAT table, expand it cover non-boot nodes.
>
> Wouldn't that amount be fairly static and restricted? If you wanna
> chunk memory init anyway, there's no reason to init more than
> necessary until smp stage is reached. The more you do early, the more
> serialized you're, so wouldn't the goal naturally be initing the
> minimum possible?
>
>> To keep non-boot numa node hot-removable. we need to page table (and other
>> that we allocate during boot stage) on ram of non boot nodes, or their
>> local node ram. (share page table always should be on boot nodes).
>
> The above assumes the followings,
>
> * 4k page mappings. It'd be nice to keep everything working for 4k
> but just following SRAT isn't enough. What if the non-hotpluggable
> boot node doesn't stretch high enough and page table reaches down
> too far? This won't be an optional behavior, so it is actually
> *likely* to happen on certain setups.
>
> * Memory hotplug is at NUMA node granularity instead of device.
>
>>> Optimizing NUMA boot just requires moving the heavy lifting to
>>> appropriate NUMA nodes. It doesn't require that early boot phase
>>> should strictly follow NUMA node boundaries.
>>
>> At end of day, I like to see all numa system (ram/cpu/pci) could have
>> non boot nodes to be hot-removed logically. with any boot command
>> line.
>
> I suppose you mean "without any boot command line"? Sure, but, first
> of all, there is a clear performance trade-off, and, secondly, don't
> we want something finer grained? Why would we want to that per-NUMA
> node, which is extremely coarse?
>
Both ways seem ok enough *currently*. But what tejun always emphasizes
is the trade-off, or benefit / cost ratio.
Yinghai and peter insist on the long-term plan. But it seems currently
no actual requirements and plans that *must* parse SRAT earlier comparing
to the current approach in this patchset, right?
Should we follow "Make it work first and optimize/beautify it later"?
I think if we have the scene that must parse SRAT earlier, I think tejun
will have no objection to it.
--
Thanks.
Zhang Yanfei
On Mon, Oct 14, 2013 at 1:55 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Mon, Oct 14, 2013 at 01:37:20PM -0700, Yinghai Lu wrote:
>> The problem is how to define "amount necessary". If we can parse srat early,
>> then we could just map RAM for all boot nodes one time, instead of try some
>> small and then after SRAT table, expand it cover non-boot nodes.
>
> Wouldn't that amount be fairly static and restricted? If you wanna
> chunk memory init anyway, there's no reason to init more than
> necessary until smp stage is reached. The more you do early, the more
> serialized you're, so wouldn't the goal naturally be initing the
> minimum possible?
Even we try to go minimum range instead of range that whole range on boot node,
without parsing srat at first, the minimum range could be crossed the boundary
of nodes.
>
>> To keep non-boot numa node hot-removable. we need to page table (and other
>> that we allocate during boot stage) on ram of non boot nodes, or their
>> local node ram. (share page table always should be on boot nodes).
>
> The above assumes the followings,
>
> * 4k page mappings. It'd be nice to keep everything working for 4k
> but just following SRAT isn't enough. What if the non-hotpluggable
> boot node doesn't stretch high enough and page table reaches down
> too far? This won't be an optional behavior, so it is actually
> *likely* to happen on certain setups.
no, do not assume 4k page. even we are using 1GB mapping, we will still have
chance to have one node to take 512G RAM, that means we can have one 4k page
on local node ram.
>
> * Memory hotplug is at NUMA node granularity instead of device.
Yes.
>
>> > Optimizing NUMA boot just requires moving the heavy lifting to
>> > appropriate NUMA nodes. It doesn't require that early boot phase
>> > should strictly follow NUMA node boundaries.
>>
>> At end of day, I like to see all numa system (ram/cpu/pci) could have
>> non boot nodes to be hot-removed logically. with any boot command
>> line.
>
> I suppose you mean "without any boot command line"? Sure, but, first
> of all, there is a clear performance trade-off, and, secondly, don't
> we want something finer grained? Why would we want to that per-NUMA
> node, which is extremely coarse?
On x86 system with intel new cpus there is memory controller built-in.,
could have hotplug modules (with socket and memory) and those hotplug modules
will be serviced as one single point. Just nowadays like we have pcie
card hotplugable.
I don't see where is the " a clear performance trade-off".
Yinghai
* H. Peter Anvin <[email protected]> wrote:
> On 10/14/2013 01:37 PM, Yinghai Lu wrote:
> >>
> >> Optimizing NUMA boot just requires moving the heavy lifting to
> >> appropriate NUMA nodes. It doesn't require that early boot phase
> >> should strictly follow NUMA node boundaries.
> >
> > At end of day, I like to see all numa system (ram/cpu/pci) could have
> > non boot nodes to be hot-removed logically. with any boot command
> > line.
> >
>
> I don't think that is realistic without hardware support, simply because
> all it takes is a single page of kernel locked memory to prevent a page
> from being removed. The only realistic way around that, I believe, is
> to remove the identity-mapping in the kernel, but it still has all kinds
> of funnies involving devices and DMA.
We played with virtual kernel memory a decade ago and it's doable. The
only complication was DMA from the kernel stack - that was done with some
really broken old ISA drivers IIRC. Those should be a distant memory, in
terms of practical impact.
So if anyone can implement it using huge pages, with a really fast __va()
and __pa() implementation, then it might be possible. But that's a pretty
major surgery on x86.
Thanks,
Ingo
Hello, Yinghai.
On Mon, Oct 14, 2013 at 07:25:55PM -0700, Yinghai Lu wrote:
> > Wouldn't that amount be fairly static and restricted? If you wanna
> > chunk memory init anyway, there's no reason to init more than
> > necessary until smp stage is reached. The more you do early, the more
> > serialized you're, so wouldn't the goal naturally be initing the
> > minimum possible?
>
> Even we try to go minimum range instead of range that whole range on boot node,
> without parsing srat at first, the minimum range could be crossed the boundary
> of nodes.
I guess it depends on how much is the minimum we're talking about, but
let's say it isn't multiple orders of magnitude larger than the kernel
image. That shouldn't be a problem then, no?
The thing is I don't really see how SRAT would help much. I don't
know how the existing systems are configured but it's natural to
assume that hardware-wise per-stick removal will be supported, right?
There's no reason for memory sticks of the first numa node can't be
hotunplugged. Likely we'll end up with SRAT map which splits the
first node into two pieces - the first smaller part which can't be
removed because firmwares and stuff depend on them and the larger
tailing chunk which can be removed. Allocating early non-migratable
stuff near the kernel image, which can't be moved without an
additional layer of indirection anyway would be fairly good choice
regardless, right?
Even if we parse SRAT early, we can't unconditionally make the kernel
allocate early stuff from node0. We do not know how SRAT will look
like in future configurations. If what the hotplug people are saying
is true, the first non-hotpluggable node being relatively small seems
actually quite likely. I don't think we want to factor all those
variables into very early bootstrap stages and it's not like we're
talking about gigabytes of memory. e.g. bring up the first half or
one gig and go from there. That part of memory is highly unlikely to
be unpluggable anyway.
> > * 4k page mappings. It'd be nice to keep everything working for 4k
> > but just following SRAT isn't enough. What if the non-hotpluggable
> > boot node doesn't stretch high enough and page table reaches down
> > too far? This won't be an optional behavior, so it is actually
> > *likely* to happen on certain setups.
>
> no, do not assume 4k page. even we are using 1GB mapping, we will still have
> chance to have one node to take 512G RAM, that means we can have one 4k page
> on local node ram.
Sure, the kernel image can also be located such that the last page
spills over to the next node too. No matter what we do, without an
extra layer of indirection, this can't be a complete solution. Think
about the usual node configuration and where kernel image is usually
loaded. As long as page table is relatively small, it is highly
unlikely to increase the chance of such issues.
Again, it's all about benefit and cost. Sure, parsing SRAT early will
definitely decrease the chance of such issues. However, as long as
the size of page table is small enough, just allocating those on top
of the kernel isn't significantly worse. Also, following SRAT earlier
not only increases complexity in vulnerable stages of boot but also
carries higher risk with the existing and future configurations
depending on how their SRAT looks like if the new behavior is applied
unconditionally. If we decide to make early SRAT usage conditional,
that a *LOT* more conditional code than what's added by bottom-up
allocation.
> On x86 system with intel new cpus there is memory controller built-in.,
> could have hotplug modules (with socket and memory) and those hotplug modules
> will be serviced as one single point. Just nowadays like we have pcie
> card hotplugable.
>
> I don't see where is the " a clear performance trade-off".
Because kernel data structures have to be allocated off-node.
Thanks.
--
tejun
On 10/14/2013 11:50 PM, Ingo Molnar wrote:
>
> So if anyone can implement it using huge pages, with a really fast __va()
> and __pa() implementation, then it might be possible. But that's a pretty
> major surgery on x86.
>
Well, we already *have* a way to deal with that for Xen (by inserting an
otherwise nonexistent logical level.) I'm wondering if those interfaces
could be (ab)used for this as well, or if that is functionally
equivalent to saying that this should be done in a hypervisor.
-hpa
* H. Peter Anvin <[email protected]> wrote:
> On 10/14/2013 11:50 PM, Ingo Molnar wrote:
> >
> > So if anyone can implement it using huge pages, with a really fast
> > __va() and __pa() implementation, then it might be possible. But
> > that's a pretty major surgery on x86.
>
> Well, we already *have* a way to deal with that for Xen (by inserting an
> otherwise nonexistent logical level.) I'm wondering if those interfaces
> could be (ab)used for this as well, or if that is functionally
> equivalent to saying that this should be done in a hypervisor.
It's not _that_ complex, and it does not need a separate security layer.
I have this distinct memory that I saw working patches that have paged all
of the kernel's data, more than a decade ago. It was all rather
disgusting, because those patches worked on the 4K level - but if a 2MB
granular solution can be found in an elegant fashion then I think we could
reconsider.
It definitely wasn't hypervisor thick. It probably needs a good hash for
virtual address transformations, and all DMA has to be managed [these days
we do that via the IOMMU anyway] but that's pretty much all - kernel
virtual memory is reconfigured extremely rarely, so all that could be sped
up for reads and mirrored per node and kept lockless, etc. etc. [Plus a
metric ton of details.]
Thanks,
Ingo