This patch-set aims to solve some problems at system boot time
to enhance memory hotplug functionality.
[Background]
The Linux kernel cannot migrate pages used by the kernel because
of the kernel direct mapping. Since va = pa + PAGE_OFFSET, if the
physical address is changed, we cannot simply update the kernel
pagetable. On the contrary, we have to update all the pointers
pointing to the virtual address, which is very difficult to do.
In order to do memory hotplug, we should prevent the kernel to use
hotpluggable memory.
In ACPI, there is a table named SRAT(System Resource Affinity Table).
It contains system NUMA info (CPUs, memory ranges, PXM), and also a
flag field indicating which memory ranges are hotpluggable.
[Problem to be solved]
At the very early time when the system is booting, we use a bootmem
allocator, named memblock, to allocate memory for the kernel.
memblock will start to work before the kernel parse SRAT, which
means memblock won't know which memory is hotpluggable before SRAT
is parsed.
So at this time, memblock could allocate hotpluggable memory for
the kernel to use permanently. For example, the kernel may allocate
pagetables in hotpluggable memory, which cannot be freed when the
system is up.
So we have to prevent memblock allocating hotpluggable memory for
the kernel at the early boot time.
[Earlier solutions]
We have tried to parse SRAT earlier, before memblock is ready. To
do this, we also have to do ACPI_INITRD_TABLE_OVERRIDE earlier.
Otherwise the override tables won't be able to effect.
This is not that easy to do because memblock is ready before direct
mapping is setup. So Yinghai split the ACPI_INITRD_TABLE_OVERRIDE
procedure into two steps: find and copy. Please refer to the
following patch-set:
https://lkml.org/lkml/2013/6/13/587
To this solution, tj gave a lot of comments and the following
suggestions.
[Suggestion from tj]
tj mainly gave the following suggestions:
1. Necessary reordering is OK, but we should not rely on
reordering to achieve the goal because it makes the kernel
too fragile.
2. Memory allocated to kernel for temporary usage is OK because
it will be freed when the system is up. Doing relocation
for permanent allocated hotpluggable memory will make the
the kernel more robust.
3. Need to enhance memblock to discover and complain if any
hotpluggable memory is allocated to kernel.
After a long thinking, we choose not to do the relocation for
the following reasons:
1. It's easy to find out the allocated hotpluggable memory. But
memblock will merge the adjoined ranges owned by different users
and used for different purposes. It's hard to find the owners.
2. Different memory has different way to be relocated. I think one
function for each kind of memory will make the code too messy.
3. Pagetable could be in hotpluggable memory. Relocating pagetable
is too difficult and risky. We have to update all PUD, PMD pages.
And also, ACPI_INITRD_TABLE_OVERRIDE and parsing SRAT procedures
are not long after pagetable is initialized. If we relocate the
pagetable not long after it was initialized, the code will be
very ugly.
[Solution in this patch-set]
In this patch-set, we still do the reordering, but in a new way.
1. Improve memblock with flags, so that it is able to differentiate
memory regions for different usage. And also a MEMBLOCK_HOTPLUGGABLE
flag to mark hotpluggable memory.
(patch 1 ~ 3)
2. When memblock is ready (memblock_x86_fill() is called), initialize
acpi_gbl_root_table_list, fulfill all the ACPI tables' phys addrs.
Now, we have all the ACPI tables' phys addrs provided by firmware.
(patch 4 ~ 8)
3. Check if there is a SRAT in initrd file used to override the one
provided by firmware. If so, get its phys addr.
(patch 12)
4. If no override SRAT in initrd, get the phys addr of the SRAT
provided by firmware.
(patch 13)
Now, we have the phys addr of the to be used SRAT, the one in
initrd or the one in firmware.
5. Parse only the memory affinities in SRAT, find out all the
hotpluggable memory regions and reserve them in memblock with
MEMBLK_HOTPLUGGABLE flag.
(patch 14 ~ 15)
6. The kernel goes through the current path. Any other related parts,
such as ACPI_INITRD_TABLE_OVERRIDE path, the current parsing ACPI
tables pathes, global variable numa_meminfo, and so on, are not
modified. They work as before.
7. Free memory with MEMBLK_HOTPLUGGABLE flag when memory initialization
is finished.
(patch 16)
8. Introduce movablecore=acpi boot option to allow users to enable
and disable this functionality.
(patch 17 ~ 21)
And patch 9 ~ 11 fix some small problems.
In summary, in order to get hotpluggable memory info as early as possible,
this patch-set only parse memory affinities in SRAT one more time right
after memblock is ready, and leave all the other pathes untouched. With
the hotpluggable memory info, we can arrange hotpluggable memory in
ZONE_MOVABLE to prevent the kernel to use it.
Thanks. :)
Tang Chen (20):
acpi: Print Hot-Pluggable Field in SRAT.
memblock, numa: Introduce flag into memblock.
x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to
reserve hotpluggable memory.
acpi: Remove "continue" in macro INVALID_TABLE().
acpi: Introduce acpi_invalid_table() to check if a table is invalid.
x86, acpi: Split acpi_boot_table_init() into two parts.
x86, acpi: Initialize ACPI root table list earlier.
x86, acpi: Also initialize signature and length when parsing root
table.
x86: Make get_ramdisk_{image|size}() global.
earlycpio.c: Fix the confusing comment of find_cpio_data().
x86, acpi: Try to find if SRAT is overrided earlier.
x86, acpi: Try to find SRAT in firmware earlier.
x86, acpi, numa: Reserve hotpluggable memory at early time.
x86, acpi, numa: Don't reserve memory on nodes the kernel resides in.
x86, memblock, mem-hotplug: Free hotpluggable memory reserved by
memblock.
page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using
SRAT.
x86, numa: Synchronize nid info in memblock.reserve with
numa_meminfo.
x86, numa: Save nid when reserve memory into memblock.reserved[].
x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher
priority.
doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot
option.
Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node
Documentation/kernel-parameters.txt | 10 ++
arch/x86/include/asm/setup.h | 3 +
arch/x86/kernel/acpi/boot.c | 38 +++--
arch/x86/kernel/setup.c | 22 +++-
arch/x86/mm/numa.c | 55 +++++++-
arch/x86/mm/srat.c | 11 +-
drivers/acpi/acpica/tbutils.c | 48 ++++++-
drivers/acpi/acpica/tbxface.c | 34 +++++
drivers/acpi/osl.c | 262 ++++++++++++++++++++++++++++++++---
drivers/acpi/tables.c | 7 +-
include/acpi/acpixf.h | 6 +
include/linux/acpi.h | 21 +++-
include/linux/memblock.h | 9 ++
include/linux/memory_hotplug.h | 5 +
include/linux/mm.h | 9 ++
lib/earlycpio.c | 7 +-
mm/memblock.c | 90 ++++++++++--
mm/memory_hotplug.c | 42 ++++++-
mm/nobootmem.c | 3 +
mm/page_alloc.c | 44 ++++++-
20 files changed, 656 insertions(+), 70 deletions(-)
The Hot-Pluggable field in SRAT suggests if the memory could be
hotplugged while the system is running. Print it as well when
parsing SRAT will help users to know which memory is hotpluggable.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
arch/x86/mm/srat.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..15ba3a9 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
+ u32 hotpluggable;
int node, pxm;
if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+ hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ if (hotpluggable && !save_add_info())
goto out_err;
start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
node_set(node, numa_nodes_parsed);
- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
- node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1);
+ pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
+ node, pxm,
+ (unsigned long long) start, (unsigned long long) end - 1,
+ hotpluggable ? "Hot Pluggable" : "");
return 0;
out_err_bad_srat:
--
1.7.1
The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we improve movablecore boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.
Users can specify "movablecore=acpi" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
arch/x86/kernel/setup.c | 8 +++++++-
include/linux/memory_hotplug.h | 3 +++
mm/page_alloc.c | 13 +++++++++++++
3 files changed, 23 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 9717760..9d08a03 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1083,8 +1083,14 @@ void __init setup_arch(char **cmdline_p)
* Linux kernel cannot migrate kernel pages, as a result, memory used
* by the kernel cannot be hot-removed. Reserve hotpluggable memory to
* prevent memblock from allocating hotpluggable memory for the kernel.
+ *
+ * If all the memory in a node is hotpluggable, then the kernel won't
+ * be able to use memory on that node. This will cause NUMA performance
+ * down. So by default, we don't reserve any hotpluggable memory. users
+ * may use "movablecore=acpi" boot option to enable this functionality.
*/
- reserve_hotpluggable_memory();
+ if (movablecore_enable_srat)
+ reserve_hotpluggable_memory();
#endif
/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 681b97f..9f26e29 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
ONLINE_MOVABLE,
};
+/* Enable/disable SRAT in movablecore boot option */
+extern bool movablecore_enable_srat;
+
/*
* pgdat resizing functions
*/
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..6271c36 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -209,6 +209,8 @@ static unsigned long __initdata required_kernelcore;
static unsigned long __initdata required_movablecore;
static unsigned long __meminitdata zone_movable_pfn[MAX_NUMNODES];
+bool __initdata movablecore_enable_srat;
+
/* movable_zone is the "real" zone pages in ZONE_MOVABLE are taken from */
int movable_zone;
EXPORT_SYMBOL(movable_zone);
@@ -5112,6 +5114,12 @@ void __init free_area_init_nodes(unsigned long *max_zone_pfn)
}
}
+static void __init cmdline_movablecore_srat(char *p)
+{
+ if (p && !strcmp(p, "acpi"))
+ movablecore_enable_srat = true;
+}
+
static int __init cmdline_parse_core(char *p, unsigned long *core)
{
unsigned long long coremem;
@@ -5142,6 +5150,11 @@ static int __init cmdline_parse_kernelcore(char *p)
*/
static int __init cmdline_parse_movablecore(char *p)
{
+ cmdline_movablecore_srat(p);
+
+ if (movablecore_enable_srat)
+ return 0;
+
return cmdline_parse_core(p, &required_movablecore);
}
--
1.7.1
In acpi_initrd_override(), it checks several things to ensure the
table it found is valid. In later patches, we need to do these check
somewhere else. So this patch introduces a common function
acpi_invalid_table() to do all these checks, and reuse it in different
places. The function will be used in the subsequent patches.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 82 ++++++++++++++++++++++++++++++++++++----------------
1 files changed, 57 insertions(+), 25 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 91d9f54..4531920 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,9 +572,64 @@ static const char * const table_sigs[] = {
/* Must not increase 10 or needs code modification below */
#define ACPI_OVERRIDE_TABLES 10
+/**
+ * acpi_invalid_table - Check if an acpi table found in initrd is invalid.
+ * @file: The cpio file returned by find_cpio_data().
+ * @path: The path storing acpi overriding tables in cpio file.
+ * @signature: The table signature to be checked.
+ *
+ * @signature can be NULL. If it is NULL, the function will check if the
+ * table signature matches any signature in table_sigs[].
+ *
+ * Return 0 if it passes all the checks, -EINVAL if any check fails.
+ */
+int __init acpi_invalid_table(struct cpio_data *file,
+ const char *path, const char *signature)
+{
+ int idx;
+ struct acpi_table_header *table = file->data;
+
+ if (file->size < sizeof(struct acpi_table_header)) {
+ INVALID_TABLE("Table smaller than ACPI header",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ if (signature) {
+ if (memcmp(table->signature, signature, 4)) {
+ INVALID_TABLE("Table signature does not match",
+ path, file->name);
+ return -EINVAL;
+ }
+ } else {
+ for (idx = 0; table_sigs[idx]; idx++)
+ if (!memcmp(table->signature, table_sigs[idx], 4))
+ break;
+
+ if (!table_sigs[idx]) {
+ INVALID_TABLE("Unknown signature", path, file->name);
+ return -EINVAL;
+ }
+ }
+
+ if (file->size != table->length) {
+ INVALID_TABLE("File length does not match table length",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ if (acpi_table_checksum(file->data, table->length)) {
+ INVALID_TABLE("Bad table checksum",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
void __init acpi_initrd_override(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int no, table_nr = 0, total_offset = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
@@ -593,33 +648,10 @@ void __init acpi_initrd_override(void *data, size_t size)
data += offset;
size -= offset;
- if (file.size < sizeof(struct acpi_table_header)) {
- INVALID_TABLE("Table smaller than ACPI header",
- cpio_path, file.name);
- continue;
- }
-
table = file.data;
- for (sig = 0; table_sigs[sig]; sig++)
- if (!memcmp(table->signature, table_sigs[sig], 4))
- break;
-
- if (!table_sigs[sig]) {
- INVALID_TABLE("Unknown signature",
- cpio_path, file.name);
- continue;
- }
- if (file.size != table->length) {
- INVALID_TABLE("File length does not match table length",
- cpio_path, file.name);
- continue;
- }
- if (acpi_table_checksum(file.data, table->length)) {
- INVALID_TABLE("Bad table checksum",
- cpio_path, file.name);
+ if (acpi_invalid_table(&file, cpio_path, NULL))
continue;
- }
pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);
--
1.7.1
The comments of find_cpio_data() says:
* @offset: When a matching file is found, this is the offset to the
* beginning of the cpio. ......
But according to the code,
dptr = PTR_ALIGN(p + ch[C_NAMESIZE], 4);
nptr = PTR_ALIGN(dptr + ch[C_FILESIZE], 4);
....
*offset = (long)nptr - (long)data; /* data is the cpio file */
@offset is the offset of the next file, not the matching file itself.
This is confused and may cause unnecessary waste of time to debug.
So fix it.
Signed-off-by: Tang Chen <[email protected]>
---
lib/earlycpio.c | 7 ++++---
1 files changed, 4 insertions(+), 3 deletions(-)
diff --git a/lib/earlycpio.c b/lib/earlycpio.c
index 8078ef4..53ccbf7 100644
--- a/lib/earlycpio.c
+++ b/lib/earlycpio.c
@@ -52,9 +52,10 @@ enum cpio_fields {
* @path: The directory to search for, including a slash at the end
* @data: Pointer to the the cpio archive or a header inside
* @len: Remaining length of the cpio based on data pointer
- * @offset: When a matching file is found, this is the offset to the
- * beginning of the cpio. It can be used to iterate through
- * the cpio to find all files inside of a directory path
+ * @offset: When a matching file is found, this is the offset from the
+ * beginning of the cpio to the beginning of the next file, not the
+ * matching file itself. It can be used to iterate through the cpio
+ * to find all files inside of a directory path
*
* @return: struct cpio_data containing the address, length and
* filename (with the directory path cut off) of the found file.
--
1.7.1
Besides the phys addr of the acpi tables, it will be very convenient if
we also have the signature of each table in acpi_gbl_root_table_list at
early time. We can find SRAT easily by comparing the signature.
This patch alse record signature and some other info in
acpi_gbl_root_table_list at early time.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/acpica/tbutils.c | 23 +++++++++++++++++++++++
1 files changed, 23 insertions(+), 0 deletions(-)
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index 9d68ffc..37cc5e4 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -514,6 +514,7 @@ acpi_tb_install_table(acpi_physical_address address,
* fully mapped later (in verify table). In any case, we must
* unmap the header that was mapped above.
*/
+ table_desc = &acpi_gbl_root_table_list.tables[table_index];
final_table = acpi_tb_table_override(table, table_desc);
if (!final_table) {
final_table = table; /* There was no override */
@@ -627,6 +628,7 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
u32 i;
u32 table_count;
struct acpi_table_header *table;
+ struct acpi_table_desc *table_desc;
acpi_physical_address address;
acpi_physical_address uninitialized_var(rsdt_address);
u32 length;
@@ -766,6 +768,27 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
*/
acpi_os_unmap_memory(table, length);
+ /*
+ * Also initialize the table entries here, so that later we can use them
+ * to find SRAT at very eraly time to reserve hotpluggable memory.
+ */
+ for (i = 2; i < table_count; i++) {
+ table = acpi_os_map_memory(
+ acpi_gbl_root_table_list.tables[i].address,
+ sizeof(struct acpi_table_header));
+ if (!table)
+ return_ACPI_STATUS(AE_NO_MEMORY);
+
+ table_desc = &acpi_gbl_root_table_list.tables[i];
+
+ table_desc->pointer = NULL;
+ table_desc->length = table->length;
+ table_desc->flags = ACPI_TABLE_ORIGIN_MAPPED;
+ ACPI_MOVE_32_TO_32(table_desc->signature.ascii, table->signature);
+
+ acpi_os_unmap_memory(table, sizeof(struct acpi_table_header));
+ }
+
return_ACPI_STATUS(AE_OK);
}
--
1.7.1
Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
because the kernel cannot use movable memory. For users who don't use memory
hotplug and who don't want to lose their NUMA performance, they need a way to
disable this functionality. So we improved movablecore boot option.
If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.
Now, if users specify "movablecore=acpi" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.
For those who don't want this, just specify nothing. The kernel will act as
before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 +++++
mm/page_alloc.c | 31 +++++++++++++++++++++++++++++--
3 files changed, 35 insertions(+), 2 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index d520015..28ba511 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -64,6 +64,7 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
+bool memblock_is_hotpluggable(struct memblock_region *region);
void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
diff --git a/mm/memblock.c b/mm/memblock.c
index 1f5dc12..fd3ded8 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -616,6 +616,11 @@ int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
return memblock_reserve_region(base, size, nid, MEMBLK_HOTPLUGGABLE);
}
+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+ return region->flags & MEMBLK_HOTPLUGGABLE;
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6271c36..cdb7919 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4880,9 +4880,37 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+ struct memblock_type *reserved = &memblock.reserved;
/*
- * If movablecore was specified, calculate what size of
+ * Need to find movable_zone earlier in case movablecore=acpi is
+ * specified.
+ */
+ find_usable_zone_for_movable();
+
+ /*
+ * If movablecore=acpi was specified, then zone_movable_pfn[] has been
+ * initialized, and no more work needs to do.
+ * NOTE: In this case, we ignore kernelcore option.
+ */
+ if (movablecore_enable_srat) {
+ for (i = 0; i < reserved->cnt; i++) {
+ if (!memblock_is_hotpluggable(&reserved->regions[i]))
+ continue;
+
+ nid = reserved->regions[i].nid;
+
+ usable_startpfn = PFN_DOWN(reserved->regions[i].base);
+ zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+ min(usable_startpfn, zone_movable_pfn[nid]) :
+ usable_startpfn;
+ }
+
+ goto out;
+ }
+
+ /*
+ * If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
@@ -4908,7 +4936,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
restart:
--
1.7.1
If all the memory ranges in SRAT are hotpluggable, we should not reserve all
of them in memblock. Otherwise the kernel won't have enough memory to boot.
And also, memblock will reserve some memory at early time, such initrd file,
kernel code and data segments, and so on. We cannot avoid these anyway.
So we make any node which the kernel resides in un-hotpluggable.
This patch introduces kernel_resides_in_range() to check if the kernel resides
in a memory range. And we can use this function to iterate memblock.reserve[],
and find out which node the kernel resides in. And then, even if the memory in
that node is hotpluggable, we don't reserve it.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 40 ++++++++++++++++++++++++++++++++++++++++
1 files changed, 40 insertions(+), 0 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 02a39e2..dfbe6ba 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -774,6 +774,43 @@ phys_addr_t __init early_acpi_firmware_srat()
}
/*
+ * kernel_resides_in_range - Check if kernel resides in a memory range.
+ * @base: The base address of the memory range.
+ * @length: The length of the memory range.
+ *
+ * memblock reserves some memory for the kernel at very early time, such
+ * as kernel code and data segments, initrd file, and so on. So this
+ * function iterates memblock.reserved[] and check if any memory range with
+ * flag MEMBLK_FLAGS_DEFAULT overlaps [@base, @length). If so, the kernel
+ * resides in this memory range.
+ *
+ * Return true if the kernel resides in the memory range, false otherwise.
+ */
+static bool __init kernel_resides_in_range(phys_addr_t base, u64 length)
+{
+ int i;
+ struct memblock_type *reserved = &memblock.reserved;
+ struct memblock_region *region;
+ phys_addr_t start, end;
+
+ for (i = 0; i < reserved->cnt; i++) {
+ region = &reserved->regions[i];
+
+ if (region->flags != MEMBLK_FLAGS_DEFAULT)
+ continue;
+
+ start = region->base;
+ end = region->base + region->size;
+ if (end <= base || start >= base + length)
+ continue;
+
+ return true;
+ }
+
+ return false;
+}
+
+/*
* acpi_reserve_hotpluggable_memory - Reserve hotpluggable memory in memblock.
* @srat_vaddr: The virtual address of SRAT.
*
@@ -817,6 +854,9 @@ void __init acpi_reserve_hotpluggable_memory(void *srat_vaddr)
length = ma->length;
pxm = ma->proximity_domain;
+ if (kernel_resides_in_range(base_address, length))
+ goto next;
+
/*
* In such an early time, we don't have nid. We specify pxm
* instead of MAX_NUMNODES to prevent memblock merging regions
--
1.7.1
From: Yasuaki Ishimatsu <[email protected]>
If system can create movable node which all memory of the
node is allocated as ZONE_MOVABLE, setup_node_data() cannot
allocate memory for the node's pg_data_t.
So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails. Otherwise, the system
could failed to boot.
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
arch/x86/mm/numa.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..5013583 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
- nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1
In the following patches, we need to call get_ramdisk_{image|size}()
to get initrd file's address and size. So make these two functions
global.
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 4 ++--
2 files changed, 5 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..69de7a1 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)
extern void probe_roms(void);
+u64 get_ramdisk_image(void);
+u64 get_ramdisk_size(void);
+
#ifdef __i386__
void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 38a5952..28d2e60 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,7 +297,7 @@ static void __init reserve_brk(void)
#ifdef CONFIG_BLK_DEV_INITRD
-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(void)
{
u64 ramdisk_image = boot_params.hdr.ramdisk_image;
@@ -305,7 +305,7 @@ static u64 __init get_ramdisk_image(void)
return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(void)
{
u64 ramdisk_size = boot_params.hdr.ramdisk_size;
--
1.7.1
As mentioned before, in order to prevent the kernel to use hotpluggable
memory, we want to reserve hotpluggable memory in memblock at early time.
As the previous two patches are able to find SRAT in initrd file or
fireware, this patch does the following:
1. Introduces acpi_reserve_hotpluggable_memory() to parse the memory
affinities in SRAT, find out which memory is hotpluggable.
2. Reserve it in memblock with MEMBLK_HOTPLUGGABLE flag.
3. Since at such an early time, nid has not been mapped, we also reserve
the PXM of the hotpluggable memory range in memblock. Later we will
modify it to nid.
In order to setup movable node (a node who has only ZONE_MOVABLE), it
will be very convenient if we can tell which memory range in memblock
is hotpluggable, and belongs to which node. We will see this convenience
in later patches. So we don't want memblock to merge memory ranges in
different nodes together.
PXM is the Proximity num provided by SRAT, used to map nid. At such an
early time, we don't have nid, so we reserve PXM in memblock to prevent
merging of different nodes' memory.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 65 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 1 +
mm/memory_hotplug.c | 14 ++++++++++-
3 files changed, 79 insertions(+), 1 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index a2e4596..02a39e2 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -772,6 +772,71 @@ phys_addr_t __init early_acpi_firmware_srat()
return table_desc.address;
}
+
+/*
+ * acpi_reserve_hotpluggable_memory - Reserve hotpluggable memory in memblock.
+ * @srat_vaddr: The virtual address of SRAT.
+ *
+ * This function parse memory affinities in SRAT, find out which memory is
+ * hotpluggable, and reserve it in memblock with MEMBLK_HOTPLUGGABLE flag.
+ *
+ * NOTE: At such an early time, we don't have nid yet. So use PXM instead of
+ * nid when reserving in memblock, and modify it when nids are mapped.
+ */
+void __init acpi_reserve_hotpluggable_memory(void *srat_vaddr)
+{
+ struct acpi_table_header *table_header;
+ struct acpi_subtable_header *entry;
+ struct acpi_srat_mem_affinity *ma;
+ unsigned long table_end;
+ unsigned int count = 0;
+ u32 pxm;
+ u64 base_address, length;
+
+ table_header = (struct acpi_table_header *)srat_vaddr;
+ table_end = (unsigned long)table_header + table_header->length;
+
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)table_header + sizeof(struct acpi_table_srat));
+
+ while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
+ table_end) {
+ if (entry->length == 0)
+ break;
+
+ if (entry->type != ACPI_SRAT_TYPE_MEMORY_AFFINITY ||
+ count++ >= NR_NODE_MEMBLKS)
+ goto next;
+
+ ma = (struct acpi_srat_mem_affinity *)entry;
+
+ if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE))
+ goto next;
+
+ base_address = ma->base_address;
+ length = ma->length;
+ pxm = ma->proximity_domain;
+
+ /*
+ * In such an early time, we don't have nid. We specify pxm
+ * instead of MAX_NUMNODES to prevent memblock merging regions
+ * on different nodes. And later modify pxm to nid when nid is
+ * mapped so that we can arrange ZONE_MOVABLE on different
+ * nodes.
+ */
+ memblock_reserve_hotpluggable(base_address, length, pxm);
+
+next:
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)entry + entry->length);
+ }
+
+ if (count > NR_NODE_MEMBLKS) {
+ pr_warning("[%4.4s:0x%02x] ignored %i entries of %i found\n",
+ ACPI_SIG_SRAT, ACPI_SRAT_TYPE_MEMORY_AFFINITY,
+ count - NR_NODE_MEMBLKS, count);
+ }
+}
#endif /* CONFIG_ACPI_NUMA */
static void acpi_table_taint(struct acpi_table_header *table)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 6fa7543..21d57a8 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -99,6 +99,7 @@ static inline phys_addr_t early_acpi_override_srat(void)
#ifdef CONFIG_ACPI_NUMA
phys_addr_t early_acpi_firmware_srat(void);
+void acpi_reserve_hotpluggable_memory(void *srat_vaddr);
#endif /* CONFIG_ACPI_NUMA */
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 15b11d3..ba3efe9 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -104,7 +104,10 @@ static void release_memory_resource(struct resource *res)
*/
void __init reserve_hotpluggable_memory(void)
{
+ void *srat_vaddr;
phys_addr_t srat_paddr;
+ u32 length;
+ struct acpi_table_header *srat_header;
/* Try to find out if SRAT is overrided */
srat_paddr = early_acpi_override_srat();
@@ -115,7 +118,16 @@ void __init reserve_hotpluggable_memory(void)
return;
}
- /* Will reserve hotpluggable memory here */
+ /* Map the whole SRAT */
+ srat_header = early_ioremap(srat_paddr,
+ sizeof(struct acpi_table_header));
+ length = srat_header->length;
+ early_iounmap(srat_header, sizeof(struct acpi_table_header));
+
+ /* Reserve hotpluggable memory */
+ srat_vaddr = early_ioremap(srat_paddr, length);
+ acpi_reserve_hotpluggable_memory(srat_vaddr);
+ early_iounmap(srat_vaddr, length);
}
#endif /* CONFIG_ACPI_NUMA */
--
1.7.1
This patch introduce early_acpi_firmware_srat() to find the
phys addr of SRAT provided by firmware. And call it in
reserve_hotpluggable_memory().
Since we have initialized acpi_gbl_root_table_list earlier,
and store all the tables' phys addrs and signatures in it,
it is easy to find the SRAT.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/acpica/tbxface.c | 34 ++++++++++++++++++++++++++++++++++
drivers/acpi/osl.c | 24 ++++++++++++++++++++++++
include/acpi/acpixf.h | 4 ++++
include/linux/acpi.h | 4 ++++
mm/memory_hotplug.c | 10 +++++++---
5 files changed, 73 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
index ad11162..95f8d1b 100644
--- a/drivers/acpi/acpica/tbxface.c
+++ b/drivers/acpi/acpica/tbxface.c
@@ -181,6 +181,40 @@ acpi_status acpi_reallocate_root_table(void)
return_ACPI_STATUS(status);
}
+/*
+ * acpi_get_table_desc - Get the acpi table descriptor of a specific table.
+ * @signature: The signature of the table to be found.
+ * @out_desc: The out returned descriptor.
+ *
+ * This function iterates acpi_gbl_root_table_list and find the specified
+ * table's descriptor.
+ *
+ * NOTE: The caller has the responsibility to allocate memory for @out_desc.
+ *
+ * Return AE_OK on success, AE_NOT_FOUND if the table is not found.
+ */
+acpi_status acpi_get_table_desc(char *signature,
+ struct acpi_table_desc *out_desc)
+{
+ int pos;
+
+ for (pos = 0;
+ pos < acpi_gbl_root_table_list.current_table_count;
+ pos++) {
+ if (!ACPI_COMPARE_NAME
+ (&(acpi_gbl_root_table_list.tables[pos].signature),
+ signature))
+ continue;
+
+ memcpy(out_desc, &acpi_gbl_root_table_list.tables[pos],
+ sizeof(struct acpi_table_desc));
+
+ return_ACPI_STATUS(AE_OK);
+ }
+
+ return_ACPI_STATUS(AE_NOT_FOUND);
+}
+
/*******************************************************************************
*
* FUNCTION: acpi_get_table_header
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index fa6b973..a2e4596 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -53,6 +53,7 @@
#include <acpi/acpi.h>
#include <acpi/acpi_bus.h>
#include <acpi/processor.h>
+#include <acpi/acpixf.h>
#define _COMPONENT ACPI_OS_SERVICES
ACPI_MODULE_NAME("osl");
@@ -750,6 +751,29 @@ void __init acpi_initrd_override(void *data, size_t size)
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+#ifdef CONFIG_ACPI_NUMA
+#include <asm/numa.h>
+#include <linux/memblock.h>
+
+/*
+ * early_acpi_firmware_srat - Get the phys addr of SRAT provide by firmware.
+ *
+ * This function iterate acpi_gbl_root_table_list, find SRAT and return the
+ * phys addr of SRAT.
+ *
+ * Return the phys addr of SRAT, or 0 on error.
+ */
+phys_addr_t __init early_acpi_firmware_srat()
+{
+ struct acpi_table_desc table_desc;
+
+ if (acpi_get_table_desc(ACPI_SIG_SRAT, &table_desc))
+ return 0;
+
+ return table_desc.address;
+}
+#endif /* CONFIG_ACPI_NUMA */
+
static void acpi_table_taint(struct acpi_table_header *table)
{
pr_warn(PREFIX
diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
index f5549b5..1d94f89 100644
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -184,6 +184,10 @@ acpi_status acpi_find_root_pointer(acpi_size *rsdp_address);
acpi_status acpi_unload_table_id(acpi_owner_id id);
acpi_status
+acpi_get_table_desc(char *signature,
+ struct acpi_table_desc *out_desc);
+
+acpi_status
acpi_get_table_header(acpi_string signature,
u32 instance, struct acpi_table_header *out_table_header);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17155bc..6fa7543 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -97,6 +97,10 @@ static inline phys_addr_t early_acpi_override_srat(void)
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+#ifdef CONFIG_ACPI_NUMA
+phys_addr_t early_acpi_firmware_srat(void);
+#endif /* CONFIG_ACPI_NUMA */
+
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 066873e..15b11d3 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -106,10 +106,14 @@ void __init reserve_hotpluggable_memory(void)
{
phys_addr_t srat_paddr;
- /* Try to find if SRAT is overrided */
+ /* Try to find out if SRAT is overrided */
srat_paddr = early_acpi_override_srat();
- if (!srat_paddr)
- return;
+ if (!srat_paddr) {
+ /* Try to find SRAT from firmware if it wasn't overrided */
+ srat_paddr = early_acpi_firmware_srat();
+ if (!srat_paddr)
+ return;
+ }
/* Will reserve hotpluggable memory here */
}
--
1.7.1
We reserved hotpluggable memory in memblock at early time. And when memory
initialization is done, we have to free it to buddy system.
This patch free memory reserved by memblock with flag MEMBLK_HOTPLUGGABLE.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 17 +++++++++++++++++
mm/nobootmem.c | 3 +++
3 files changed, 21 insertions(+), 0 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 90b49ee..8d71795 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -63,6 +63,7 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/memblock.c b/mm/memblock.c
index 73fe62d..631b727 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -565,6 +565,23 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}
+static void __init_memblock memblock_free_flags(unsigned long flags)
+{
+ int i;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++) {
+ if (reserved->regions[i].flags == flags)
+ memblock_remove_region(reserved, i);
+ }
+}
+
+void __init_memblock memblock_free_hotpluggable()
+{
+ memblock_dbg("memblock: free all hotpluggable memory");
+ memblock_free_flags(MEMBLK_HOTPLUGGABLE);
+}
+
static int __init_memblock memblock_reserve_region(phys_addr_t base,
phys_addr_t size,
int nid,
diff --git a/mm/nobootmem.c b/mm/nobootmem.c
index bdd3fa2..dbfbcb9 100644
--- a/mm/nobootmem.c
+++ b/mm/nobootmem.c
@@ -165,6 +165,9 @@ unsigned long __init free_all_bootmem(void)
for_each_online_pgdat(pgdat)
reset_node_lowmem_managed_pages(pgdat);
+ /* Hotpluggable memory reserved by memblock should also be freed. */
+ memblock_free_hotpluggable();
+
/*
* We need to use MAX_NUMNODES instead of NODE_DATA(0)->node_id
* because in some case like Node0 doesn't have RAM installed
--
1.7.1
As we mentioned in previous patches, to prevent the kernel
using hotpluggable memory at early time, we need to reserve
hotpluggable memory in memblock. So we need to parse SRAT
at early time.
This patch does the following two things:
1. Introduce reserve_hotpluggable_memory() to reserve
hotpluggable memory, and call it in setup_arch() right
after memblock is ready.
The main job of this function is not implemented in this
patch. In this patch, it only calls early_acpi_override_srat()
to get SRAT in initrd file if there is any.
2. Introduce early_acpi_override_srat() to check if there
is a SRAT in initrd file used to override SRAT in the
firmware. If so, the function will return the phys addr
of the override SRAT.
At early time in setup_arch(), pagetable has not been setup.
So we have to use early_ioremap() to map the initrd file
temporarily. But early_ioremap() can only map at most 256KB
at one time. So we use a loop to map the whole initrd file,
and map only 256KB one time.
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 9 ++++++
drivers/acpi/osl.c | 55 ++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 14 ++++++++-
include/linux/memory_hotplug.h | 2 +
mm/memory_hotplug.c | 26 ++++++++++++++++++-
5 files changed, 103 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 28d2e60..9717760 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1078,6 +1078,15 @@ void __init setup_arch(char **cmdline_p)
/* Initialize ACPI root table */
acpi_root_table_init();
+#ifdef CONFIG_ACPI_NUMA
+ /*
+ * Linux kernel cannot migrate kernel pages, as a result, memory used
+ * by the kernel cannot be hot-removed. Reserve hotpluggable memory to
+ * prevent memblock from allocating hotpluggable memory for the kernel.
+ */
+ reserve_hotpluggable_memory();
+#endif
+
/*
* The EFI specification says that boot service code won't be called
* after ExitBootServices(). This is, in fact, a lie.
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 4531920..fa6b973 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -48,6 +48,7 @@
#include <asm/io.h>
#include <asm/uaccess.h>
+#include <asm/setup.h>
#include <acpi/acpi.h>
#include <acpi/acpi_bus.h>
@@ -627,6 +628,60 @@ int __init acpi_invalid_table(struct cpio_data *file,
return 0;
}
+#ifdef CONFIG_ACPI_NUMA
+/*
+ * early_acpi_override_srat - Try to get the phys addr of SRAT in initrd.
+ *
+ * The ACPI_INITRD_TABLE_OVERRIDE procedure is able to use tables in initrd
+ * file to override the ones provided by firmware. This function checks if
+ * there is a SRAT in initrd at early time. If so, return the phys addr of
+ * the SRAT.
+ *
+ * Return the phys addr of SRAT in initrd, 0 if there is no SRAT.
+ */
+phys_addr_t __init early_acpi_override_srat(void)
+{
+ int i;
+ u32 length;
+ long offset;
+ void *ramdisk_vaddr;
+ struct acpi_table_header *table;
+ unsigned long map_step = NR_FIX_BTMAPS << PAGE_SHIFT;
+ phys_addr_t ramdisk_image = get_ramdisk_image();
+ char cpio_path[32] = "kernel/firmware/acpi/";
+ struct cpio_data file;
+
+ /* Try to find if SRAT is overrided */
+ for (i = 0; i < ACPI_OVERRIDE_TABLES; i++) {
+ ramdisk_vaddr = early_ioremap(ramdisk_image, map_step);
+
+ file = find_cpio_data(cpio_path, ramdisk_vaddr,
+ map_step, &offset);
+ if (!file.data) {
+ early_iounmap(ramdisk_vaddr, map_step);
+ return 0;
+ }
+
+ table = file.data;
+ length = table->length;
+
+ if (acpi_invalid_table(&file, cpio_path, ACPI_SIG_SRAT)) {
+ ramdisk_image += offset;
+ early_iounmap(ramdisk_vaddr, map_step);
+ continue;
+ }
+
+ /* Found SRAT */
+ early_iounmap(ramdisk_vaddr, map_step);
+ ramdisk_image = ramdisk_image + offset - length;
+
+ break;
+ }
+
+ return ramdisk_image;
+}
+#endif /* CONFIG_ACPI_NUMA */
+
void __init acpi_initrd_override(void *data, size_t size)
{
int no, table_nr = 0, total_offset = 0;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 95f600c..17155bc 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -81,11 +81,21 @@ typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
void acpi_initrd_override(void *data, size_t size);
-#else
+
+#ifdef CONFIG_ACPI_NUMA
+phys_addr_t early_acpi_override_srat(void);
+#endif /* CONFIG_ACPI_NUMA */
+
+#else /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
static inline void acpi_initrd_override(void *data, size_t size)
{
}
-#endif
+
+static inline phys_addr_t early_acpi_override_srat(void)
+{
+ return 0;
+}
+#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 3e622c6..681b97f 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -104,6 +104,7 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
+extern void reserve_hotpluggable_memory(void);
#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
@@ -181,6 +182,7 @@ static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
}
#endif
+
extern void put_page_bootmem(struct page *page);
extern void get_page_bootmem(unsigned long ingo, struct page *page,
unsigned long type);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 1ad92b4..066873e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -30,6 +30,7 @@
#include <linux/mm_inline.h>
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
+#include <linux/acpi.h>
#include <asm/tlbflush.h>
@@ -62,7 +63,6 @@ void unlock_memory_hotplug(void)
mutex_unlock(&mem_hotplug_mutex);
}
-
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
@@ -91,6 +91,30 @@ static void release_memory_resource(struct resource *res)
return;
}
+#ifdef CONFIG_ACPI_NUMA
+/*
+ * reserve_hotpluggable_memory - Reserve hotpluggable memory in memblock.
+ *
+ * This function did the following:
+ * 1. Try to find if there is a SRAT in initrd file used to override the one
+ * provided by firmware. If so, get its phys addr.
+ * 2. If there is no override SRAT, get the phys addr of the SRAT in firmware.
+ * 3. Parse SRAT, find out which memory is hotpluggable, and reserve it in
+ * memblock.
+ */
+void __init reserve_hotpluggable_memory(void)
+{
+ phys_addr_t srat_paddr;
+
+ /* Try to find if SRAT is overrided */
+ srat_paddr = early_acpi_override_srat();
+ if (!srat_paddr)
+ return;
+
+ /* Will reserve hotpluggable memory here */
+}
+#endif /* CONFIG_ACPI_NUMA */
+
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type)
--
1.7.1
Vasilis Liaskovitis found that before we parse SRAT and fulfill numa_meminfo,
the nids of all the regions in memblock.reserve[] are MAX_NUMNODES. That is
because nids have not been mapped at that time.
When we arrange ZONE_MOVABLE in each node later, we need nid in memblock. So
after we parse SRAT and fulfill nume_meminfo, synchronize the nid info to
memblock.reserve[] immediately.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Vasilis Liaskovitis <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 50 insertions(+), 0 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 5013583..f2a3984 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -548,6 +548,48 @@ static void __init numa_init_array(void)
}
}
+/*
+ * early_numa_find_range_nid - Find nid for a memory range at early time.
+ * @start: start address of the memory range (physaddr)
+ * @size: size of the memory range
+ *
+ * Return nid of the memory range, or MAX_NUMNODES if it failed to find the nid.
+ *
+ * NOTE: This function uses numa_meminfo to find the range's nid, so it should
+ * be called after numa_meminfo has been initialized.
+ */
+int __init early_numa_find_range_nid(u64 start, u64 size)
+{
+ int i;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ for (i = 0; i < mi->nr_blks; i++)
+ if (start >= mi->blk[i].start &&
+ (start + size - 1) <= mi->blk[i].end)
+ return mi->blk[i].nid;
+
+ return MAX_NUMNODES;
+}
+
+/*
+ * numa_sync_memblock_nid - Synchronize nid info in memblock.reserve[] to
+ * numa_meminfo.
+ *
+ * This function will synchronize the nid fields of regions in
+ * memblock.reserve[] to numa_meminfo.
+ */
+static void __init numa_sync_memblock_nid()
+{
+ int i, nid;
+ struct memblock_type *res = &memblock.reserved;
+
+ for (i = 0; i < res->cnt; i++) {
+ nid = early_numa_find_range_nid(res->regions[i].base,
+ res->regions[i].size);
+ memblock_set_region_node(&res->regions[i], nid);
+ }
+}
+
static int __init numa_init(int (*init_func)(void))
{
int i;
@@ -585,6 +627,14 @@ static int __init numa_init(int (*init_func)(void))
numa_clear_node(i);
}
numa_init_array();
+
+ /*
+ * Before fulfilling numa_meminfo, all regions allocated by memblock
+ * are reserved with nid MAX_NUMNODES because there is no numa node
+ * info at such an early time. Now, fill the correct nid into memblock.
+ */
+ numa_sync_memblock_nid();
+
return 0;
}
--
1.7.1
We have split acpi_table_init() into two steps:
1. Pares RSDT or XSDT, and initialize acpi_gbl_root_table_list.
This step will record all tables' physical address in memory.
2. Check acpi initrd table override and install all tables into
acpi_gbl_root_table_list.
This patch does step 1 earlier, right after memblock is ready.
When memblock_x86_fill() is called to fulfill memblock.memory[],
memblock is able to allocate memory.
This patch introduces a new function acpi_root_table_init() to
do step 1, and call this function right after memblock_x86_fill()
is called.
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 38 +++++++++++++++++++++++---------------
arch/x86/kernel/setup.c | 3 +++
drivers/acpi/tables.c | 7 +++++--
include/linux/acpi.h | 2 ++
4 files changed, 33 insertions(+), 17 deletions(-)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 230c8ea..3da5b3c 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1491,6 +1491,28 @@ static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
};
/*
+ * acpi_root_table_init - Initialize acpi_gbl_root_table_list.
+ *
+ * This function will parse RSDT or XSDT, find all tables' phys addr,
+ * initialize acpi_gbl_root_table_list, and record all tables' phys addr
+ * in acpi_gbl_root_table_list.
+ */
+void __init acpi_root_table_init(void)
+{
+ dmi_check_system(acpi_dmi_table);
+
+ /* If acpi_disabled, bail out */
+ if (acpi_disabled)
+ return;
+
+ /* Initialize the ACPI boot-time table parser */
+ if (acpi_table_init()) {
+ disable_acpi();
+ return;
+ }
+}
+
+/*
* acpi_boot_table_init() and acpi_boot_init()
* called from setup_arch(), always.
* 1. checksums all tables
@@ -1511,21 +1533,7 @@ static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
void __init acpi_boot_table_init(void)
{
- dmi_check_system(acpi_dmi_table);
-
- /*
- * If acpi_disabled, bail out
- */
- if (acpi_disabled)
- return;
-
- /*
- * Initialize the ACPI boot-time table parser.
- */
- if (acpi_table_init()) {
- disable_acpi();
- return;
- }
+ acpi_install_root_table();
acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 56f7fcf..38a5952 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1075,6 +1075,9 @@ void __init setup_arch(char **cmdline_p)
memblock.current_limit = ISA_END_ADDRESS;
memblock_x86_fill();
+ /* Initialize ACPI root table */
+ acpi_root_table_init();
+
/*
* The EFI specification says that boot service code won't be called
* after ExitBootServices(). This is, in fact, a lie.
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 8860e79..60ecbb8 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -353,10 +353,13 @@ int __init acpi_table_init(void)
if (ACPI_FAILURE(status))
return 1;
- acpi_tb_install_root_table();
+ return 0;
+}
+void __init acpi_install_root_table(void)
+{
+ acpi_tb_install_root_table();
check_multiple_madt();
- return 0;
}
static int __init acpi_parse_apic_instance(char *str)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17b5b59..95f600c 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -92,10 +92,12 @@ void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
+void acpi_root_table_init(void);
int acpi_mps_check (void);
int acpi_numa_init (void);
int acpi_table_init (void);
+void acpi_install_root_table(void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
int __init acpi_table_parse_entries(char *id, unsigned long table_size,
int entry_id,
--
1.7.1
Since we modify movablecore boot option to support
"movablecore=acpi", this patch adds doc for it.
Signed-off-by: Tang Chen <[email protected]>
---
Documentation/kernel-parameters.txt | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 2fe6e76..7aa1099 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1714,6 +1714,16 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.
+ movablecore=acpi [KNL,X86] The SRAT (System Resource Affinity
+ Table) in ACPI provides the NUMA info of the system.
+ And also, there is a Hotpluggable field in the memory
+ affinity of the table specifying which memory is
+ hotpluggable. This option enables the kernel to arrange
+ hotpluggable memory in SRAT as ZONE_MOVABLE.
+ NOTE: Any node which the kernel resides in will
+ always be un-hotpluggable so that the kernel
+ will always have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
--
1.7.1
Pages used by the kernel cannot be migrated. As a result, hotpluggable
memory used by the kernel cannot be hot-removed. So for memory
hotplug users, the kernel should not use hotpluggable memory.
Since now we have flags in memblock, we introduce a MEMBLK_HOTPLUGGABLE
flag to mark hotpluggable memory. At the early time, we use memblock to
reserve hotpluggable memory, and mark it with MEMBLK_HOTPLUGGABLE flag.
When the system is up, we free these memory with MEMBLK_HOTPLUGGABLE
flag to the buddy, and arrange them into ZONE_MOVABLE. In this way, the
kernel won't be able to use it.
This patch introduces MEMBLK_HOTPLUGGABLE flag, and an API to reserve
memory with MEMBLK_HOTPLUGGABLE flag. This is a preparation for the
coming patches.
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 2 ++
mm/memblock.c | 6 ++++++
2 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 93f3453..90b49ee 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -21,6 +21,7 @@
/* Definition of memblock flags. */
#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
+#define MEMBLK_HOTPLUGGABLE 0x1 /* hotpluggable region */
struct memblock_region {
phys_addr_t base;
@@ -61,6 +62,7 @@ int memblock_add(phys_addr_t base, phys_addr_t size);
int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
+int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
void memblock_trim_memory(phys_addr_t align);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
diff --git a/mm/memblock.c b/mm/memblock.c
index 9e871e9..73fe62d 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -586,6 +586,12 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
MEMBLK_FLAGS_DEFAULT);
}
+int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
+ phys_addr_t size, int nid)
+{
+ return memblock_reserve_region(base, size, nid, MEMBLK_HOTPLUGGABLE);
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
--
1.7.1
In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.
The memory affinities in SRAT record every memory range in the
system, and also, flags specifying if the memory range is
hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)
memblock starts to work at very early time, and SRAT has not been
parsed. So we don't know which memory is hotpluggable. In order
to use memblock to reserve hotpluggable memory, we need to obtain
SRAT memory affinity info earlier.
In the current acpi_boot_table_init(), it does the following:
1. Parse RSDT, so that we can find all the tables.
2. Initialize acpi_gbl_root_table_list, an array of acpi table
descriptorsused to store each table's address, length, signature,
and so on.
3. Check if there is any table in initrd intending to override
tables from firmware. If so, override the firmware tables.
4. Initialize all the data in acpi_gbl_root_table_list.
In order to parse SRAT at early time, we need to do similar job as
step 1 and 2 above earlier to obtain SRAT. It will be very convenient
if we have acpi_gbl_root_table_list initialized. We can use address
and signature to find SRAT.
Since step 1 and 2 allocates no memory, it is OK to do these two
steps earlier.
But step 3 will check acpi initrd table override, not just SRAT,
but also all the other tables. So it is better to keep it untouched.
This patch splits acpi_boot_table_init() into two steps:
1. Parse RSDT, which cannot be overrided, and initialize
acpi_gbl_root_table_list. (step 1 + 2 above)
2. Install all ACPI tables into acpi_gbl_root_table_list.
(step 3 + 4 above)
In later patches, we will do step 1 + 2 earlier.
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/acpica/tbutils.c | 25 ++++++++++++++++++++++---
drivers/acpi/tables.c | 2 ++
include/acpi/acpixf.h | 2 ++
3 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index ce3d5db..9d68ffc 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -766,9 +766,30 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
*/
acpi_os_unmap_memory(table, length);
+ return_ACPI_STATUS(AE_OK);
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_tb_install_root_table
+ *
+ * DESCRIPTION: This function installs all the ACPI tables in RSDT into
+ * acpi_gbl_root_table_list.
+ *
+ ******************************************************************************/
+
+void __init
+acpi_tb_install_root_table()
+{
+ int i;
+
/*
* Complete the initialization of the root table array by examining
- * the header of each table
+ * the header of each table.
+ *
+ * First two entries in the table array are reserved for the DSDT
+ * and FACS, which are not actually present in the RSDT/XSDT - they
+ * come from the FADT.
*/
for (i = 2; i < acpi_gbl_root_table_list.current_table_count; i++) {
acpi_tb_install_table(acpi_gbl_root_table_list.tables[i].
@@ -782,6 +803,4 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
acpi_tb_parse_fadt(i);
}
}
-
- return_ACPI_STATUS(AE_OK);
}
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index d67a1fe..8860e79 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -353,6 +353,8 @@ int __init acpi_table_init(void)
if (ACPI_FAILURE(status))
return 1;
+ acpi_tb_install_root_table();
+
check_multiple_madt();
return 0;
}
diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
index 454881e..f5549b5 100644
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -116,6 +116,8 @@ acpi_status
acpi_initialize_tables(struct acpi_table_desc *initial_storage,
u32 initial_table_count, u8 allow_resize);
+void acpi_tb_install_root_table(void);
+
acpi_status __init acpi_initialize_subsystem(void);
acpi_status acpi_enable_subsystem(u32 flags);
--
1.7.1
The macro INVALID_TABLE() is defined like this:
#define INVALID_TABLE(x, path, name) \
{ pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
And it is used like this:
for (...) {
...
if (...)
INVALID_TABLE()
...
}
The "continue" in the macro makes the code hard to understand.
Change it to the style like other macros:
#define INVALID_TABLE(x, path, name) \
do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
So after this patch, this macro should be used like this:
for (...) {
...
if (...) {
INVALID_TABLE()
continue;
}
...
}
Add the "continue" wherever the macro is called.
(For now, it is only called in acpi_initrd_override().)
The idea is from Yinghai Lu <[email protected]>.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
drivers/acpi/osl.c | 18 +++++++++++++-----
1 files changed, 13 insertions(+), 5 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e721863..91d9f54 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -565,7 +565,7 @@ static const char * const table_sigs[] = {
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
@@ -593,9 +593,11 @@ void __init acpi_initrd_override(void *data, size_t size)
data += offset;
size -= offset;
- if (file.size < sizeof(struct acpi_table_header))
+ if (file.size < sizeof(struct acpi_table_header)) {
INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }
table = file.data;
@@ -603,15 +605,21 @@ void __init acpi_initrd_override(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;
- if (!table_sigs[sig])
+ if (!table_sigs[sig]) {
INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
+ continue;
+ }
+ if (file.size != table->length) {
INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }
pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);
--
1.7.1
There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.
In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.
In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags; /* This is new. */
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
This patch does the following things:
1) Add "flags" member to memblock_region, and MEMBLK_DEFAULT flag for common usage.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.
The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>.
Suggested-by: Wen Congyang <[email protected]>
Suggested-by: Liu Jiang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 4 +++
mm/memblock.c | 55 +++++++++++++++++++++++++++++++++------------
2 files changed, 44 insertions(+), 15 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..93f3453 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,9 +19,13 @@
#define INIT_MEMBLOCK_REGIONS 128
+/* Definition of memblock flags. */
+#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+ unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index c5fad93..9e871e9 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+ type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
}
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
- memblock_get_region_node(next)) {
+ memblock_get_region_node(next) ||
+ this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
* @base: base address of the new region
* @size: size of the new region
* @nid: node id of the new region
+ * @flags: flags of the new region
*
* Insert new memblock region [@base,@base+@size) into @type at @idx.
* @type must already have extra room to accomodate the new region.
*/
static void __init_memblock memblock_insert_region(struct memblock_type *type,
int idx, phys_addr_t base,
- phys_addr_t size, int nid)
+ phys_addr_t size,
+ int nid, unsigned long flags)
{
struct memblock_region *rgn = &type->regions[idx];
@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+ rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
static int __init_memblock memblock_add_region(struct memblock_type *type,
- phys_addr_t base, phys_addr_t size, int nid)
+ phys_addr_t base, phys_addr_t size,
+ int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
+ type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
@@ -407,7 +415,8 @@ repeat:
nr_new++;
if (insert)
memblock_insert_region(type, i++, base,
- rbase - base, nid);
+ rbase - base, nid,
+ flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
if (base < end) {
nr_new++;
if (insert)
- memblock_insert_region(type, i, base, end - base, nid);
+ memblock_insert_region(type, i, base, end - base,
+ nid, flags);
}
/*
@@ -439,12 +449,14 @@ repeat:
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
int nid)
{
- return memblock_add_region(&memblock.memory, base, size, nid);
+ return memblock_add_region(&memblock.memory, base, size,
+ nid, MEMBLK_FLAGS_DEFAULT);
}
int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
- return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+ return memblock_add_region(&memblock.memory, base, size,
+ MAX_NUMNODES, MEMBLK_FLAGS_DEFAULT);
}
/**
@@ -499,7 +511,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else if (rend > end) {
/*
* @rgn intersects from above. Split and redo the
@@ -509,7 +522,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else {
/* @rgn is fully contained, record it */
if (!*end_rgn)
@@ -551,16 +565,25 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+ phys_addr_t size,
+ int nid,
+ unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;
- memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] with flags %#016lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size,
- (void *)_RET_IP_);
+ flags, (void *)_RET_IP_);
+
+ return memblock_add_region(_rgn, base, size, nid, flags);
+}
- return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_reserve_region(base, size, MAX_NUMNODES,
+ MEMBLK_FLAGS_DEFAULT);
}
/**
@@ -985,6 +1008,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
{
unsigned long long base, size;
+ unsigned long flags;
int i;
pr_info(" %s.cnt = 0x%lx\n", name, type->cnt);
@@ -995,13 +1019,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
base = rgn->base;
size = rgn->size;
+ flags = rgn->flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
memblock_get_region_node(rgn));
#endif
- pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
- name, i, base, base + size - 1, size, nid_buf);
+ pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+ name, i, base, base + size - 1, size, nid_buf, flags);
}
}
--
1.7.1
We have introduced numa_sync_memblock_nid to synchronize nid info in
memblock.reserved[] and numa_meminfo. But memblock_reserve() always
reserve memory with MAX_NUMNODES, even after numa_meminfo has been
initialized.
So this patch improves memblock_reserve() to reserve memory with
correct nid.
Reported-by: Vasilis Liaskovitis <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
include/linux/memblock.h | 1 +
include/linux/mm.h | 9 +++++++++
mm/memblock.c | 11 +++++++++--
3 files changed, 19 insertions(+), 2 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 8d71795..d520015 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -63,6 +63,7 @@ int memblock_remove(phys_addr_t base, phys_addr_t size);
int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
int memblock_reserve_hotpluggable(phys_addr_t base, phys_addr_t size, int nid);
+int memblock_reserve_node(phys_addr_t base, phys_addr_t size, int nid);
void memblock_free_hotpluggable(void);
void memblock_trim_memory(phys_addr_t align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..baa1aac 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1665,6 +1665,15 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
unsigned long start, unsigned long end);
#endif
+#ifdef CONFIG_NUMA
+int __init early_numa_find_range_nid(u64 start, u64 size);
+#else
+static inline int __init early_numa_find_range_nid(u64 start, u64 size)
+{
+ return 0;
+}
+#endif
+
struct vm_area_struct *find_extend_vma(struct mm_struct *, unsigned long addr);
int remap_pfn_range(struct vm_area_struct *, unsigned long addr,
unsigned long pfn, unsigned long size, pgprot_t);
diff --git a/mm/memblock.c b/mm/memblock.c
index 631b727..1f5dc12 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -597,12 +597,19 @@ static int __init_memblock memblock_reserve_region(phys_addr_t base,
return memblock_add_region(_rgn, base, size, nid, flags);
}
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+int __init_memblock memblock_reserve_node(phys_addr_t base,
+ phys_addr_t size, int nid)
{
- return memblock_reserve_region(base, size, MAX_NUMNODES,
+ return memblock_reserve_region(base, size, nid,
MEMBLK_FLAGS_DEFAULT);
}
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ int nid = early_numa_find_range_nid(base, size);
+ return memblock_reserve_node(base, size, nid);
+}
+
int __init_memblock memblock_reserve_hotpluggable(phys_addr_t base,
phys_addr_t size, int nid)
{
--
1.7.1
Hi,
Forgot to mention, this patch-set is based on linux-3.10 release.
Thanks. :)
On 07/19/2013 03:59 PM, Tang Chen wrote:
> This patch-set aims to solve some problems at system boot time
> to enhance memory hotplug functionality.
>
> [Background]
>
> The Linux kernel cannot migrate pages used by the kernel because
> of the kernel direct mapping. Since va = pa + PAGE_OFFSET, if the
> physical address is changed, we cannot simply update the kernel
> pagetable. On the contrary, we have to update all the pointers
> pointing to the virtual address, which is very difficult to do.
>
> In order to do memory hotplug, we should prevent the kernel to use
> hotpluggable memory.
>
> In ACPI, there is a table named SRAT(System Resource Affinity Table).
> It contains system NUMA info (CPUs, memory ranges, PXM), and also a
> flag field indicating which memory ranges are hotpluggable.
>
>
> [Problem to be solved]
>
> At the very early time when the system is booting, we use a bootmem
> allocator, named memblock, to allocate memory for the kernel.
> memblock will start to work before the kernel parse SRAT, which
> means memblock won't know which memory is hotpluggable before SRAT
> is parsed.
>
> So at this time, memblock could allocate hotpluggable memory for
> the kernel to use permanently. For example, the kernel may allocate
> pagetables in hotpluggable memory, which cannot be freed when the
> system is up.
>
> So we have to prevent memblock allocating hotpluggable memory for
> the kernel at the early boot time.
>
>
> [Earlier solutions]>
> We have tried to parse SRAT earlier, before memblock is ready. To
> do this, we also have to do ACPI_INITRD_TABLE_OVERRIDE earlier.
> Otherwise the override tables won't be able to effect.
>
> This is not that easy to do because memblock is ready before direct
> mapping is setup. So Yinghai split the ACPI_INITRD_TABLE_OVERRIDE
> procedure into two steps: find and copy. Please refer to the
> following patch-set:
> https://lkml.org/lkml/2013/6/13/587
>
> To this solution, tj gave a lot of comments and the following
> suggestions.
>
>
> [Suggestion from tj]
>
> tj mainly gave the following suggestions:
>
> 1. Necessary reordering is OK, but we should not rely on
> reordering to achieve the goal because it makes the kernel
> too fragile.
>
> 2. Memory allocated to kernel for temporary usage is OK because
> it will be freed when the system is up. Doing relocation
> for permanent allocated hotpluggable memory will make the
> the kernel more robust.
>
> 3. Need to enhance memblock to discover and complain if any
> hotpluggable memory is allocated to kernel.
>
> After a long thinking, we choose not to do the relocation for
> the following reasons:
>
> 1. It's easy to find out the allocated hotpluggable memory. But
> memblock will merge the adjoined ranges owned by different users
> and used for different purposes. It's hard to find the owners.
>
> 2. Different memory has different way to be relocated. I think one
> function for each kind of memory will make the code too messy.
>
> 3. Pagetable could be in hotpluggable memory. Relocating pagetable
> is too difficult and risky. We have to update all PUD, PMD pages.
> And also, ACPI_INITRD_TABLE_OVERRIDE and parsing SRAT procedures
> are not long after pagetable is initialized. If we relocate the
> pagetable not long after it was initialized, the code will be
> very ugly.
>
>
> [Solution in this patch-set]
>
> In this patch-set, we still do the reordering, but in a new way.
>
> 1. Improve memblock with flags, so that it is able to differentiate
> memory regions for different usage. And also a MEMBLOCK_HOTPLUGGABLE
> flag to mark hotpluggable memory.
> (patch 1 ~ 3)
>
> 2. When memblock is ready (memblock_x86_fill() is called), initialize
> acpi_gbl_root_table_list, fulfill all the ACPI tables' phys addrs.
> Now, we have all the ACPI tables' phys addrs provided by firmware.
> (patch 4 ~ 8)
>
> 3. Check if there is a SRAT in initrd file used to override the one
> provided by firmware. If so, get its phys addr.
> (patch 12)
>
> 4. If no override SRAT in initrd, get the phys addr of the SRAT
> provided by firmware.
> (patch 13)
>
> Now, we have the phys addr of the to be used SRAT, the one in
> initrd or the one in firmware.
>
> 5. Parse only the memory affinities in SRAT, find out all the
> hotpluggable memory regions and reserve them in memblock with
> MEMBLK_HOTPLUGGABLE flag.
> (patch 14 ~ 15)
>
> 6. The kernel goes through the current path. Any other related parts,
> such as ACPI_INITRD_TABLE_OVERRIDE path, the current parsing ACPI
> tables pathes, global variable numa_meminfo, and so on, are not
> modified. They work as before.
>
> 7. Free memory with MEMBLK_HOTPLUGGABLE flag when memory initialization
> is finished.
> (patch 16)
>
> 8. Introduce movablecore=acpi boot option to allow users to enable
> and disable this functionality.
> (patch 17 ~ 21)
>
> And patch 9 ~ 11 fix some small problems.
>
>
> In summary, in order to get hotpluggable memory info as early as possible,
> this patch-set only parse memory affinities in SRAT one more time right
> after memblock is ready, and leave all the other pathes untouched. With
> the hotpluggable memory info, we can arrange hotpluggable memory in
> ZONE_MOVABLE to prevent the kernel to use it.
>
> Thanks. :)
>
>
> Tang Chen (20):
> acpi: Print Hot-Pluggable Field in SRAT.
> memblock, numa: Introduce flag into memblock.
> x86, acpi, numa, mem-hotplug: Introduce MEMBLK_HOTPLUGGABLE to
> reserve hotpluggable memory.
> acpi: Remove "continue" in macro INVALID_TABLE().
> acpi: Introduce acpi_invalid_table() to check if a table is invalid.
> x86, acpi: Split acpi_boot_table_init() into two parts.
> x86, acpi: Initialize ACPI root table list earlier.
> x86, acpi: Also initialize signature and length when parsing root
> table.
> x86: Make get_ramdisk_{image|size}() global.
> earlycpio.c: Fix the confusing comment of find_cpio_data().
> x86, acpi: Try to find if SRAT is overrided earlier.
> x86, acpi: Try to find SRAT in firmware earlier.
> x86, acpi, numa: Reserve hotpluggable memory at early time.
> x86, acpi, numa: Don't reserve memory on nodes the kernel resides in.
> x86, memblock, mem-hotplug: Free hotpluggable memory reserved by
> memblock.
> page_alloc, mem-hotplug: Improve movablecore to {en|dis}able using
> SRAT.
> x86, numa: Synchronize nid info in memblock.reserve with
> numa_meminfo.
> x86, numa: Save nid when reserve memory into memblock.reserved[].
> x86, numa, acpi, memory-hotplug: Make movablecore=acpi have higher
> priority.
> doc, page_alloc, acpi, mem-hotplug: Add doc for movablecore=acpi boot
> option.
>
> Yasuaki Ishimatsu (1):
> x86: get pg_data_t's memory from other node
>
> Documentation/kernel-parameters.txt | 10 ++
> arch/x86/include/asm/setup.h | 3 +
> arch/x86/kernel/acpi/boot.c | 38 +++--
> arch/x86/kernel/setup.c | 22 +++-
> arch/x86/mm/numa.c | 55 +++++++-
> arch/x86/mm/srat.c | 11 +-
> drivers/acpi/acpica/tbutils.c | 48 ++++++-
> drivers/acpi/acpica/tbxface.c | 34 +++++
> drivers/acpi/osl.c | 262 ++++++++++++++++++++++++++++++++---
> drivers/acpi/tables.c | 7 +-
> include/acpi/acpixf.h | 6 +
> include/linux/acpi.h | 21 +++-
> include/linux/memblock.h | 9 ++
> include/linux/memory_hotplug.h | 5 +
> include/linux/mm.h | 9 ++
> lib/earlycpio.c | 7 +-
> mm/memblock.c | 90 ++++++++++--
> mm/memory_hotplug.c | 42 ++++++-
> mm/nobootmem.c | 3 +
> mm/page_alloc.c | 44 ++++++-
> 20 files changed, 656 insertions(+), 70 deletions(-)
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On Fri, Jul 19, 2013 at 03:59:14PM +0800, Tang Chen wrote:
> The Hot-Pluggable field in SRAT suggests if the memory could be
> hotplugged while the system is running. Print it as well when
> parsing SRAT will help users to know which memory is hotpluggable.
>
> Signed-off-by: Tang Chen <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
Acked-by: Tejun Heo <[email protected]>
But a nit below
> + pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
> + node, pxm,
> + (unsigned long long) start, (unsigned long long) end - 1,
> + hotpluggable ? "Hot Pluggable" : "");
The following would be more conventional.
"...10Lx]%s\n", ..., hotpluggable ? " Hot Pluggable" : ""
Also, isn't "Hot Pluggable" a bit too verbose? "hotplug" should be
fine, I think.
Thanks.
--
tejun
Hello,
On Fri, Jul 19, 2013 at 03:59:15PM +0800, Tang Chen wrote:
> +#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
Please don't do this. Just clearing the struct as zero is enough.
> @@ -439,12 +449,14 @@ repeat:
> int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
> int nid)
> {
> - return memblock_add_region(&memblock.memory, base, size, nid);
> + return memblock_add_region(&memblock.memory, base, size,
> + nid, MEMBLK_FLAGS_DEFAULT);
And just use zero for no flag. Doing something like the above gets
weird with actual flags. e.g. if you add a flag, say, MEMBLK_HOTPLUG,
should it be MEMBLK_FLAGS_DEFAULT | MEMBLK_HOTPLUG or just
MEMBLK_HOTPLUG? If latter, the knowledge that DEFAULT is zero is
implicit, and, if so, why do it at all?
> +static int __init_memblock memblock_reserve_region(phys_addr_t base,
> + phys_addr_t size,
> + int nid,
> + unsigned long flags)
> {
> struct memblock_type *_rgn = &memblock.reserved;
>
> - memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
> + memblock_dbg("memblock_reserve: [%#016llx-%#016llx] with flags %#016lx %pF\n",
Let's please drop "with" and do we really need to print full 16
digits?
Thanks.
--
tejun
On Tue, 2013-07-23 at 14:48 -0400, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:14PM +0800, Tang Chen wrote:
> > The Hot-Pluggable field in SRAT suggests if the memory could be
> > hotplugged while the system is running. Print it as well when
> > parsing SRAT will help users to know which memory is hotpluggable.
[]
> a nit below
>
> > + pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
> > + node, pxm,
> > + (unsigned long long) start, (unsigned long long) end - 1,
> > + hotpluggable ? "Hot Pluggable" : "");
>
> The following would be more conventional.
>
> "...10Lx]%s\n", ..., hotpluggable ? " Hot Pluggable" : ""
>
> Also, isn't "Hot Pluggable" a bit too verbose? "hotplug" should be
> fine, I think.
It's also a tiny nit better to use:
pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
node, pxm,
(unsigned long long) start, (unsigned long long) end - 1,
hotpluggable ? " Hot Pluggable" : "");
(or " hotplug")
so there's no space before newline.
On Fri, Jul 19, 2013 at 03:59:17PM +0800, Tang Chen wrote:
> The macro INVALID_TABLE() is defined like this:
>
> #define INVALID_TABLE(x, path, name) \
> { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
>
> And it is used like this:
>
> for (...) {
> ...
> if (...)
> INVALID_TABLE()
> ...
> }
>
> The "continue" in the macro makes the code hard to understand.
> Change it to the style like other macros:
>
> #define INVALID_TABLE(x, path, name) \
> do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
>
> So after this patch, this macro should be used like this:
>
> for (...) {
> ...
> if (...) {
> INVALID_TABLE()
> continue;
> }
> ...
> }
>
> Add the "continue" wherever the macro is called.
> (For now, it is only called in acpi_initrd_override().)
>
> The idea is from Yinghai Lu <[email protected]>.
>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:16PM +0800, Tang Chen wrote:
> /* Definition of memblock flags. */
> #define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
> +#define MEMBLK_HOTPLUGGABLE 0x1 /* hotpluggable region */
Given that all existing APIs are using "memblock", wouldn't it be
better to use "MEMBLOCK_" prefix? If it's too long, we can just do
MEMBLOCK_HOTPLUG.
Thanks.
--
tejun
On Tue, Jul 23, 2013 at 12:15:58PM -0700, Joe Perches wrote:
> > The following would be more conventional.
> >
> > "...10Lx]%s\n", ..., hotpluggable ? " Hot Pluggable" : ""
> >
> > Also, isn't "Hot Pluggable" a bit too verbose? "hotplug" should be
> > fine, I think.
>
> It's also a tiny nit better to use:
>
> pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
> node, pxm,
> (unsigned long long) start, (unsigned long long) end - 1,
> hotpluggable ? " Hot Pluggable" : "");
>
> (or " hotplug")
>
> so there's no space before newline.
Which was my first point which apparently wasn't clear enough. :)
Thanks.
--
tejun
On Tue, 2013-07-23 at 15:20 -0400, Tejun Heo wrote:
> On Tue, Jul 23, 2013 at 12:15:58PM -0700, Joe Perches wrote:
[]
> > so there's no space before newline.
> Which was my first point which apparently wasn't clear enough. :)
Nah, I just constantly need to relearn how to read...
On Fri, Jul 19, 2013 at 03:59:21PM +0800, Tang Chen wrote:
> @@ -514,6 +514,7 @@ acpi_tb_install_table(acpi_physical_address address,
> * fully mapped later (in verify table). In any case, we must
> * unmap the header that was mapped above.
> */
> + table_desc = &acpi_gbl_root_table_list.tables[table_index];
> final_table = acpi_tb_table_override(table, table_desc);
> if (!final_table) {
> final_table = table; /* There was no override */
Is this chunk correct? Why is @table_desc being assigned twice in
this function?
> + /*
> + * Also initialize the table entries here, so that later we can use them
> + * to find SRAT at very eraly time to reserve hotpluggable memory.
^ typo
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:22PM +0800, Tang Chen wrote:
> In the following patches, we need to call get_ramdisk_{image|size}()
> to get initrd file's address and size. So make these two functions
> global.
>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> arch/x86/include/asm/setup.h | 3 +++
> arch/x86/kernel/setup.c | 4 ++--
> 2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
> index b7bf350..69de7a1 100644
> --- a/arch/x86/include/asm/setup.h
> +++ b/arch/x86/include/asm/setup.h
> @@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
> RESERVE_BRK(name, sizeof(type) * entries)
>
> extern void probe_roms(void);
> +u64 get_ramdisk_image(void);
> +u64 get_ramdisk_size(void);
Might as well make these accessors inline functions.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:23PM +0800, Tang Chen wrote:
> - * @offset: When a matching file is found, this is the offset to the
> - * beginning of the cpio. It can be used to iterate through
> - * the cpio to find all files inside of a directory path
> + * @offset: When a matching file is found, this is the offset from the
> + * beginning of the cpio to the beginning of the next file, not the
> + * matching file itself. It can be used to iterate through the cpio
> + * to find all files inside of a directory path
Nicely spotted. I think we can go further and rename it to @nextoff.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:24PM +0800, Tang Chen wrote:
> From: Yasuaki Ishimatsu <[email protected]>
>
> If system can create movable node which all memory of the
> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
> allocate memory for the node's pg_data_t.
> So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
> to retry when the first allocation fails. Otherwise, the system
> could failed to boot.
>
> Signed-off-by: Yasuaki Ishimatsu <[email protected]>
> Signed-off-by: Lai Jiangshan <[email protected]>
> Signed-off-by: Tang Chen <[email protected]>
> Signed-off-by: Jiang Liu <[email protected]>
> Reviewed-by: Wanpeng Li <[email protected]>
> ---
> arch/x86/mm/numa.c | 5 ++---
> 1 files changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index a71c4e2..5013583 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
> * Allocate node data. Try node-local memory and then any node.
> * Never allocate in DMA zone.
> */
> - nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
> + nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
> if (!nd_pa) {
> - pr_err("Cannot find %zu bytes in node %d\n",
> - nd_size, nid);
> + pr_err("Cannot find %zu bytes in any node\n", nd_size);
Hmm... we want the node data to be colocated on the same node and I
don't think being hotpluggable necessarily requires the node data to
be allocated on a different node. Does node data of a hotpluggable
node need to stay around after hotunplug?
I don't think it's a huge issue but it'd be great if we can clarify
where the restriction is coming from.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:25PM +0800, Tang Chen wrote:
> As we mentioned in previous patches, to prevent the kernel
Prolly best to briefly describe what the overall goal is about.
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 28d2e60..9717760 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1078,6 +1078,15 @@ void __init setup_arch(char **cmdline_p)
> /* Initialize ACPI root table */
> acpi_root_table_init();
>
> +#ifdef CONFIG_ACPI_NUMA
> + /*
> + * Linux kernel cannot migrate kernel pages, as a result, memory used
> + * by the kernel cannot be hot-removed. Reserve hotpluggable memory to
> + * prevent memblock from allocating hotpluggable memory for the kernel.
> + */
> + reserve_hotpluggable_memory();
> +#endif
Hmmm, so you're gonna reserve all hotpluggable memory areas until
everything is up and running, which probably is why allocating
node_data on hotpluggable node doesn't work, right?
> +#ifdef CONFIG_ACPI_NUMA
> +/*
/**
> + * early_acpi_override_srat - Try to get the phys addr of SRAT in initrd.
> + *
> + * The ACPI_INITRD_TABLE_OVERRIDE procedure is able to use tables in initrd
> + * file to override the ones provided by firmware. This function checks if
> + * there is a SRAT in initrd at early time. If so, return the phys addr of
> + * the SRAT.
> + *
> + * Return the phys addr of SRAT in initrd, 0 if there is no SRAT.
> + */
> +phys_addr_t __init early_acpi_override_srat(void)
> +{
> + int i;
> + u32 length;
> + long offset;
> + void *ramdisk_vaddr;
> + struct acpi_table_header *table;
> + unsigned long map_step = NR_FIX_BTMAPS << PAGE_SHIFT;
> + phys_addr_t ramdisk_image = get_ramdisk_image();
> + char cpio_path[32] = "kernel/firmware/acpi/";
> + struct cpio_data file;
Don't we usually put variable declarations with initializers before
others? For some reason, the above block is painful to look at.
> + /* Try to find if SRAT is overrided */
^
overridden?
...
> +#ifdef CONFIG_ACPI_NUMA
> +/*
/**
> + * reserve_hotpluggable_memory - Reserve hotpluggable memory in memblock.
> + *
> + * This function did the following:
> + * 1. Try to find if there is a SRAT in initrd file used to override the one
> + * provided by firmware. If so, get its phys addr.
> + * 2. If there is no override SRAT, get the phys addr of the SRAT in firmware.
> + * 3. Parse SRAT, find out which memory is hotpluggable, and reserve it in
> + * memblock.
> + */
> +void __init reserve_hotpluggable_memory(void)
--
tejun
On Fri, Jul 19, 2013 at 03:59:26PM +0800, Tang Chen wrote:
> +/*
/**
> + * acpi_get_table_desc - Get the acpi table descriptor of a specific table.
> + * @signature: The signature of the table to be found.
> + * @out_desc: The out returned descriptor.
> + *
> + * This function iterates acpi_gbl_root_table_list and find the specified
> + * table's descriptor.
> + *
> + * NOTE: The caller has the responsibility to allocate memory for @out_desc.
> + *
> + * Return AE_OK on success, AE_NOT_FOUND if the table is not found.
> + */
> +acpi_status acpi_get_table_desc(char *signature,
> + struct acpi_table_desc *out_desc)
> +{
> + int pos;
> +
> + for (pos = 0;
> + pos < acpi_gbl_root_table_list.current_table_count;
> + pos++) {
> + if (!ACPI_COMPARE_NAME
> + (&(acpi_gbl_root_table_list.tables[pos].signature),
> + signature))
Hohumm... creative formatting. Can't you just cache the tables
pointer in a local variable with short name and avoid the creativity?
> + continue;
> +
> + memcpy(out_desc, &acpi_gbl_root_table_list.tables[pos],
> + sizeof(struct acpi_table_desc));
> +
> + return_ACPI_STATUS(AE_OK);
> + }
> +
> + return_ACPI_STATUS(AE_NOT_FOUND);
Also, if we already know that SRAT is what we want, I wonder whether
it'd be simpler to store the location of SRAT somewhere instead of
trying to be generic with the early processing.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:27PM +0800, Tang Chen wrote:
> + /*
> + * In such an early time, we don't have nid. We specify pxm
> + * instead of MAX_NUMNODES to prevent memblock merging regions
> + * on different nodes. And later modify pxm to nid when nid is
> + * mapped so that we can arrange ZONE_MOVABLE on different
> + * nodes.
> + */
> + memblock_reserve_hotpluggable(base_address, length, pxm);
This is rather hacky. Why not just introduce MEMBLOCK_NO_MERGE flag?
--
tejun
Hello,
On Fri, Jul 19, 2013 at 03:59:28PM +0800, Tang Chen wrote:
> /*
> + * kernel_resides_in_range - Check if kernel resides in a memory range.
> + * @base: The base address of the memory range.
> + * @length: The length of the memory range.
> + *
> + * memblock reserves some memory for the kernel at very early time, such
> + * as kernel code and data segments, initrd file, and so on. So this
> + * function iterates memblock.reserved[] and check if any memory range with
> + * flag MEMBLK_FLAGS_DEFAULT overlaps [@base, @length). If so, the kernel
> + * resides in this memory range.
> + *
> + * Return true if the kernel resides in the memory range, false otherwise.
> + */
> +static bool __init kernel_resides_in_range(phys_addr_t base, u64 length)
> +{
> + int i;
> + struct memblock_type *reserved = &memblock.reserved;
> + struct memblock_region *region;
> + phys_addr_t start, end;
> +
> + for (i = 0; i < reserved->cnt; i++) {
> + region = &reserved->regions[i];
> +
> + if (region->flags != MEMBLK_FLAGS_DEFAULT)
> + continue;
> +
> + start = region->base;
> + end = region->base + region->size;
> + if (end <= base || start >= base + length)
> + continue;
> +
> + return true;
> + }
> +
> + return false;
> +}
This being in acpi/osl.c is rather weird. Overall, the acpi and
memblock parts don't seem very well split. It'd best if acpi just
indicates which regions are hotpluggable and the rest is handled by
x86 boot or memblock code as appropriate.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:29PM +0800, Tang Chen wrote:
> We reserved hotpluggable memory in memblock at early time. And when memory
> initialization is done, we have to free it to buddy system.
>
> This patch free memory reserved by memblock with flag MEMBLK_HOTPLUGGABLE.
Sequencing patches this way means machines will run with hotpluggable
regions reserved inbetween. Please put the reserving and freeing into
the same patch.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:30PM +0800, Tang Chen wrote:
...
> Users can specify "movablecore=acpi" in kernel commandline to enable this
> functionality. For those who don't use memory hotplug or who don't want
> to lose their NUMA performance, just don't specify anything. The kernel
> will work as before.
The param name is pretty obscure and why would the user care where
that hotplug information came from? Shouldn't it be something handled
by memblock proper? ie. a knob which tells memblock to either honor
or ignore hotpluggable area reservations.
Thanks.
--
tejun
On Tue, Jul 23, 2013 at 05:04:35PM -0400, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:30PM +0800, Tang Chen wrote:
> ...
> > Users can specify "movablecore=acpi" in kernel commandline to enable this
> > functionality. For those who don't use memory hotplug or who don't want
> > to lose their NUMA performance, just don't specify anything. The kernel
> > will work as before.
>
> The param name is pretty obscure and why would the user care where
I mean, having movable zone is required for having any decent chance
of memory hotplug and movable zone implies worse affinity for kernel
data structures, so there's no point in distinguishing memory hotplug
enable/disable and this, right?
--
tejun
On Fri, Jul 19, 2013 at 03:59:33PM +0800, Tang Chen wrote:
> Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
> because the kernel cannot use movable memory. For users who don't use memory
> hotplug and who don't want to lose their NUMA performance, they need a way to
> disable this functionality. So we improved movablecore boot option.
>
> If users specify the original movablecore=nn@ss boot option, the kernel will
> arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
> except it specifies ZONE_NORMAL ranges.
>
> Now, if users specify "movablecore=acpi" in kernel commandline, the kernel will
> arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
> the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.
>
> For those who don't want this, just specify nothing. The kernel will act as
> before.
As I wrote before, I find movablecore=acpi rather weird. Shouldn't it
be memory_hotplug enable/disable switch instead and movablecore, if
specified, overriding information provided by firmware?
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:34PM +0800, Tang Chen wrote:
> Since we modify movablecore boot option to support
> "movablecore=acpi", this patch adds doc for it.
Please fold this into the patch which makes the code chnage.
Thanks.
--
tejun
On Fri, Jul 19, 2013 at 03:59:31PM +0800, Tang Chen wrote:
> Vasilis Liaskovitis found that before we parse SRAT and fulfill numa_meminfo,
> the nids of all the regions in memblock.reserve[] are MAX_NUMNODES. That is
> because nids have not been mapped at that time.
>
> When we arrange ZONE_MOVABLE in each node later, we need nid in memblock. So
> after we parse SRAT and fulfill nume_meminfo, synchronize the nid info to
> memblock.reserve[] immediately.
Having a separate sync is rather nasty. Why not let
memblock_set_node() and alloc functions set nid on the reserved
regions?
Thanks.
--
tejun
On Tue, Jul 23, 2013 at 04:55:57PM -0400, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:27PM +0800, Tang Chen wrote:
> > + /*
> > + * In such an early time, we don't have nid. We specify pxm
> > + * instead of MAX_NUMNODES to prevent memblock merging regions
> > + * on different nodes. And later modify pxm to nid when nid is
> > + * mapped so that we can arrange ZONE_MOVABLE on different
> > + * nodes.
> > + */
> > + memblock_reserve_hotpluggable(base_address, length, pxm);
>
> This is rather hacky. Why not just introduce MEMBLOCK_NO_MERGE flag?
Also, if memblock is gonna know about hotplug memory, why not just let
it control its allocation too instead of blocking it by reserving it
from outside? These are all pretty general memory hotplug logic which
doesn't have much to do with acpi and I think too much is implemented
on the acpi side.
Thanks.
--
tejun
On 07/19/2013 12:59 AM, Tang Chen wrote:
> This patch introduce early_acpi_firmware_srat() to find the
> phys addr of SRAT provided by firmware. And call it in
> reserve_hotpluggable_memory().
>
> Since we have initialized acpi_gbl_root_table_list earlier,
> and store all the tables' phys addrs and signatures in it,
> it is easy to find the SRAT.
>
> Signed-off-by: Tang Chen <[email protected]>
> ---
> drivers/acpi/acpica/tbxface.c | 34 ++++++++++++++++++++++++++++++++++
> drivers/acpi/osl.c | 24 ++++++++++++++++++++++++
> include/acpi/acpixf.h | 4 ++++
> include/linux/acpi.h | 4 ++++
> mm/memory_hotplug.c | 10 +++++++---
> 5 files changed, 73 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
> index ad11162..95f8d1b 100644
> --- a/drivers/acpi/acpica/tbxface.c
> +++ b/drivers/acpi/acpica/tbxface.c
> @@ -181,6 +181,40 @@ acpi_status acpi_reallocate_root_table(void)
> return_ACPI_STATUS(status);
> }
>
> +/*
> + * acpi_get_table_desc - Get the acpi table descriptor of a specific table.
> + * @signature: The signature of the table to be found.
> + * @out_desc: The out returned descriptor.
The "@out_desc:" line looks funky. Also, I believe changes to this file
need to go in via acpica & probably conform to their commenting standards?
> + *
> + * This function iterates acpi_gbl_root_table_list and find the specified
> + * table's descriptor.
> + *
> + * NOTE: The caller has the responsibility to allocate memory for @out_desc.
> + *
> + * Return AE_OK on success, AE_NOT_FOUND if the table is not found.
> + */
> +acpi_status acpi_get_table_desc(char *signature,
> + struct acpi_table_desc *out_desc)
> +{
> + int pos;
> +
> + for (pos = 0;
> + pos < acpi_gbl_root_table_list.current_table_count;
> + pos++) {
> + if (!ACPI_COMPARE_NAME
> + (&(acpi_gbl_root_table_list.tables[pos].signature),
> + signature))
> + continue;
> +
> + memcpy(out_desc, &acpi_gbl_root_table_list.tables[pos],
> + sizeof(struct acpi_table_desc));
> +
> + return_ACPI_STATUS(AE_OK);
> + }
> +
> + return_ACPI_STATUS(AE_NOT_FOUND);
> +}
> +
> /*******************************************************************************
> *
> * FUNCTION: acpi_get_table_header
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index fa6b973..a2e4596 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -53,6 +53,7 @@
> #include <acpi/acpi.h>
> #include <acpi/acpi_bus.h>
> #include <acpi/processor.h>
> +#include <acpi/acpixf.h>
>
> #define _COMPONENT ACPI_OS_SERVICES
> ACPI_MODULE_NAME("osl");
> @@ -750,6 +751,29 @@ void __init acpi_initrd_override(void *data, size_t size)
> }
> #endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
>
> +#ifdef CONFIG_ACPI_NUMA
> +#include <asm/numa.h>
> +#include <linux/memblock.h>
> +
> +/*
> + * early_acpi_firmware_srat - Get the phys addr of SRAT provide by firmware.
s/provide/provided/
> + *
> + * This function iterate acpi_gbl_root_table_list, find SRAT and return the
Perhaps: "Iterate over acpi_gbl_root_table_list to find SRAT then return
its phys addr"
Though I wonder if this comment is even needed, as the iteration is done
in acpi_get_table_desc() (added above).
> + * phys addr of SRAT.
> + *
> + * Return the phys addr of SRAT, or 0 on error.
> + */
> +phys_addr_t __init early_acpi_firmware_srat()
> +{
> + struct acpi_table_desc table_desc;
> +
> + if (acpi_get_table_desc(ACPI_SIG_SRAT, &table_desc))
> + return 0;
> +
> + return table_desc.address;
> +}
> +#endif /* CONFIG_ACPI_NUMA */
> +
> static void acpi_table_taint(struct acpi_table_header *table)
> {
> pr_warn(PREFIX
[...]
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 066873e..15b11d3 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -106,10 +106,14 @@ void __init reserve_hotpluggable_memory(void)
> {
> phys_addr_t srat_paddr;
>
> - /* Try to find if SRAT is overrided */
> + /* Try to find out if SRAT is overrided */
> srat_paddr = early_acpi_override_srat();
> - if (!srat_paddr)
> - return;
> + if (!srat_paddr) {
> + /* Try to find SRAT from firmware if it wasn't overrided */
s/overrided/overridden/
> + srat_paddr = early_acpi_firmware_srat();
> + if (!srat_paddr)
> + return;
> + }
>
> /* Will reserve hotpluggable memory here */
> }
>
On 07/24/2013 02:48 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:14PM +0800, Tang Chen wrote:
>> The Hot-Pluggable field in SRAT suggests if the memory could be
>> hotplugged while the system is running. Print it as well when
>> parsing SRAT will help users to know which memory is hotpluggable.
>>
>> Signed-off-by: Tang Chen<[email protected]>
>> Reviewed-by: Wanpeng Li<[email protected]>
>
> Acked-by: Tejun Heo<[email protected]>
>
> But a nit below
>
>> + pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx] %s\n",
>> + node, pxm,
>> + (unsigned long long) start, (unsigned long long) end - 1,
>> + hotpluggable ? "Hot Pluggable" : "");
>
> The following would be more conventional.
>
> "...10Lx]%s\n", ..., hotpluggable ? " Hot Pluggable" : ""
>
> Also, isn't "Hot Pluggable" a bit too verbose? "hotplug" should be
> fine, I think.
>
Hi tj, Joe,
OK,will change it as you guys said.
Thank you very much.
Thanks.
On 07/24/2013 03:09 AM, Tejun Heo wrote:
> Hello,
>
> On Fri, Jul 19, 2013 at 03:59:15PM +0800, Tang Chen wrote:
>> +#define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
>
> Please don't do this. Just clearing the struct as zero is enough.
>
>> @@ -439,12 +449,14 @@ repeat:
>> int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
>> int nid)
>> {
>> - return memblock_add_region(&memblock.memory, base, size, nid);
>> + return memblock_add_region(&memblock.memory, base, size,
>> + nid, MEMBLK_FLAGS_DEFAULT);
>
> And just use zero for no flag. Doing something like the above gets
> weird with actual flags. e.g. if you add a flag, say, MEMBLK_HOTPLUG,
> should it be MEMBLK_FLAGS_DEFAULT | MEMBLK_HOTPLUG or just
> MEMBLK_HOTPLUG? If latter, the knowledge that DEFAULT is zero is
> implicit, and, if so, why do it at all?
OK, will remove MEMBLK_FLAGS_DEFAULT, and use 0 by default.
>
>> +static int __init_memblock memblock_reserve_region(phys_addr_t base,
>> + phys_addr_t size,
>> + int nid,
>> + unsigned long flags)
>> {
>> struct memblock_type *_rgn =&memblock.reserved;
>>
>> - memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
>> + memblock_dbg("memblock_reserve: [%#016llx-%#016llx] with flags %#016lx %pF\n",
>
> Let's please drop "with" and do we really need to print full 16
> digits?
Sure, will remove "with". But I think printing out the full flags is batter.
The output seems more tidy.
Thanks.
On 07/24/2013 03:19 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:16PM +0800, Tang Chen wrote:
>> /* Definition of memblock flags. */
>> #define MEMBLK_FLAGS_DEFAULT 0x0 /* default flag */
>> +#define MEMBLK_HOTPLUGGABLE 0x1 /* hotpluggable region */
>
> Given that all existing APIs are using "memblock", wouldn't it be
> better to use "MEMBLOCK_" prefix? If it's too long, we can just do
> MEMBLOCK_HOTPLUG.
OK, followed.
Thanks.
On 07/24/2013 03:56 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:22PM +0800, Tang Chen wrote:
>> In the following patches, we need to call get_ramdisk_{image|size}()
>> to get initrd file's address and size. So make these two functions
>> global.
>>
>> Signed-off-by: Tang Chen<[email protected]>
>> ---
>> arch/x86/include/asm/setup.h | 3 +++
>> arch/x86/kernel/setup.c | 4 ++--
>> 2 files changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
>> index b7bf350..69de7a1 100644
>> --- a/arch/x86/include/asm/setup.h
>> +++ b/arch/x86/include/asm/setup.h
>> @@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
>> RESERVE_BRK(name, sizeof(type) * entries)
>>
>> extern void probe_roms(void);
>> +u64 get_ramdisk_image(void);
>> +u64 get_ramdisk_size(void);
>
> Might as well make these accessors inline functions.\
Sure, will make them as static inline functions in
arch/x86/include/asm/setup.h.
Thanks.
On 07/24/2013 04:02 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:23PM +0800, Tang Chen wrote:
>> - * @offset: When a matching file is found, this is the offset to the
>> - * beginning of the cpio. It can be used to iterate through
>> - * the cpio to find all files inside of a directory path
>> + * @offset: When a matching file is found, this is the offset from the
>> + * beginning of the cpio to the beginning of the next file, not the
>> + * matching file itself. It can be used to iterate through the cpio
>> + * to find all files inside of a directory path
>
> Nicely spotted. I think we can go further and rename it to @nextoff.
OK, followed.
Thanks.
On 07/24/2013 04:09 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:24PM +0800, Tang Chen wrote:
>> From: Yasuaki Ishimatsu<[email protected]>
>>
>> If system can create movable node which all memory of the
>> node is allocated as ZONE_MOVABLE, setup_node_data() cannot
>> allocate memory for the node's pg_data_t.
>> So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
>> to retry when the first allocation fails. Otherwise, the system
>> could failed to boot.
......
>> - nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
>> + nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
>> if (!nd_pa) {
>> - pr_err("Cannot find %zu bytes in node %d\n",
>> - nd_size, nid);
>> + pr_err("Cannot find %zu bytes in any node\n", nd_size);
>
> Hmm... we want the node data to be colocated on the same node and I
> don't think being hotpluggable necessarily requires the node data to
> be allocated on a different node. Does node data of a hotpluggable
> node need to stay around after hotunplug?
>
> I don't think it's a huge issue but it'd be great if we can clarify
> where the restriction is coming from.
>
You are right, the node data could be on hotpluggable node. And Yinghai
also said pagetable and vmemmap could be on hotpluggable node.
But for now, doing so will break memory hot-remove path. I should have
mentioned so in the log, which I didn't do.
A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUAM level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.
Also as Yinghai's previous patch-set did, he put pagetable on local node.
And we met the same problem. when hot-removing memory, we have to ensure
the memory device containing pagetable being hot-removed in the last place.
But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.
At last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73
The node data should be on local, I agree with that. I'm not saying I
won't do it. Just for now, it will be complicated to fix memory hot-remove
path. So I think pushing this patch for now, and do the local node things
in the next step.
Thanks.
On 07/24/2013 04:27 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:25PM +0800, Tang Chen wrote:
>> As we mentioned in previous patches, to prevent the kernel
>
> Prolly best to briefly describe what the overall goal is about.
Sure. Here is the overall picture, and will add it to log.
Linux cannot migrate pages used by the kernel due to the direct mapping
(va = pa + PAGE_OFFSET), any memory used by the kernel cannot be
hot-removed.
So in memory hotplug platform, we have to prevent the kernel from using
hotpluggable memory.
The ACPI table SRAT (System Resource Affinity Table) contains info to
specify
which memory is hotpluggble. After SRAT is parsed, we are aware of which
memory is hotpluggable.
At the early time when system is booting, SRAT has not been parsed. The boot
memory allocator memblock will allocate any memory to the kernel. So we need
SRAT parsed before memblock starts to work.
In this patch, we are going to parse SRAT earlier, right after memblock
is ready.
Generally speaking, SRAT is provided by firmware. But
ACPI_INITRD_TABLE_OVERRIDE
functionality allows users to customize their own SRAT in initrd, and
override
the one from firmware. So if we want to parse SRAT earlier, we also need
to do
SRAT override earlier.
First, we introduce early_acpi_override_srat() to check if SRAT will be
overridden
from initrd.
Second, we introduce reserve_hotpluggable_memory() to reserve
hotpluggable memory,
which will firstly call early_acpi_override_srat() to find out which
memory is
hotpluggable in the override SRAT.
>
>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>> index 28d2e60..9717760 100644
>> --- a/arch/x86/kernel/setup.c
>> +++ b/arch/x86/kernel/setup.c
>> @@ -1078,6 +1078,15 @@ void __init setup_arch(char **cmdline_p)
>> /* Initialize ACPI root table */
>> acpi_root_table_init();
>>
>> +#ifdef CONFIG_ACPI_NUMA
>> + /*
>> + * Linux kernel cannot migrate kernel pages, as a result, memory used
>> + * by the kernel cannot be hot-removed. Reserve hotpluggable memory to
>> + * prevent memblock from allocating hotpluggable memory for the kernel.
>> + */
>> + reserve_hotpluggable_memory();
>> +#endif
>
> Hmmm, so you're gonna reserve all hotpluggable memory areas until
> everything is up and running, which probably is why allocating
> node_data on hotpluggable node doesn't work, right?
Yes, that's right. The node_data of hotpluggable node is now put on another
unhotpluggable node.
>
......
>> +phys_addr_t __init early_acpi_override_srat(void)
>> +{
>> + int i;
>> + u32 length;
>> + long offset;
>> + void *ramdisk_vaddr;
>> + struct acpi_table_header *table;
>> + unsigned long map_step = NR_FIX_BTMAPS<< PAGE_SHIFT;
>> + phys_addr_t ramdisk_image = get_ramdisk_image();
>> + char cpio_path[32] = "kernel/firmware/acpi/";
>> + struct cpio_data file;
>
> Don't we usually put variable declarations with initializers before
> others? For some reason, the above block is painful to look at.
OK, followed.
Thanks.
On 07/24/2013 04:49 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:26PM +0800, Tang Chen wrote:
......
>> + for (pos = 0;
>> + pos< acpi_gbl_root_table_list.current_table_count;
>> + pos++) {
>> + if (!ACPI_COMPARE_NAME
>> + (&(acpi_gbl_root_table_list.tables[pos].signature),
>> + signature))
>
> Hohumm... creative formatting. Can't you just cache the tables
> pointer in a local variable with short name and avoid the creativity?
OK, followed.
>
>> + continue;
>> +
>> + memcpy(out_desc,&acpi_gbl_root_table_list.tables[pos],
>> + sizeof(struct acpi_table_desc));
>> +
>> + return_ACPI_STATUS(AE_OK);
>> + }
>> +
>> + return_ACPI_STATUS(AE_NOT_FOUND);
>
> Also, if we already know that SRAT is what we want, I wonder whether
> it'd be simpler to store the location of SRAT somewhere instead of
> trying to be generic with the early processing.
Do you mean get the SRAT's address without touching any ACPI global
variables, such as acpi_gbl_root_table_list ?
The physical addresses of all tables is stored in RSDT (Root System
Description Table), which is the root table. We need to parse RSDT
to get SRAT address.
Using acpi_gbl_root_table_list is very convenient. The initialization
of acpi_gbl_root_table_list is using acpi_os_map_memory(), so it can be
done before init_mem_mapping() and relocate_initrd().
With acpi_gbl_root_table_list initialized, we can iterate it and find
SRAT easily. Otherwise, we have to do the same procedure to parse RSDT,
and find SRAT, which I don't think could be any simpler. I think reuse
the existing acpi_gbl_root_table_list code is better.
Thanks.
On 07/24/2013 07:26 AM, Cody P Schafer wrote:
> On 07/19/2013 12:59 AM, Tang Chen wrote:
......
>> +/*
>> + * acpi_get_table_desc - Get the acpi table descriptor of a specific
>> table.
>> + * @signature: The signature of the table to be found.
>> + * @out_desc: The out returned descriptor.
>
> The "@out_desc:" line looks funky. Also, I believe changes to this file
> need to go in via acpica & probably conform to their commenting standards?
>
OK, followed.
......
>
> Perhaps: "Iterate over acpi_gbl_root_table_list to find SRAT then return
> its phys addr"
>
> Though I wonder if this comment is even needed, as the iteration is done
> in acpi_get_table_desc() (added above).
>
OK, will remove the comment.
Thanks.
On Wed, Jul 24, 2013 at 10:53:10AM +0800, Tang Chen wrote:
> >Let's please drop "with" and do we really need to print full 16
> >digits?
>
> Sure, will remove "with". But I think printing out the full flags is batter.
> The output seems more tidy.
I mean, padding is fine but you can just print out 4 or even 2 digits
and will be fine for the foreseeable future.
--
tejun
On Wed, Jul 24, 2013 at 06:12:03PM +0800, Tang Chen wrote:
> Do you mean get the SRAT's address without touching any ACPI global
> variables, such as acpi_gbl_root_table_list ?
>
> The physical addresses of all tables is stored in RSDT (Root System
> Description Table), which is the root table. We need to parse RSDT
> to get SRAT address.
>
> Using acpi_gbl_root_table_list is very convenient. The initialization
> of acpi_gbl_root_table_list is using acpi_os_map_memory(), so it can be
> done before init_mem_mapping() and relocate_initrd().
>
> With acpi_gbl_root_table_list initialized, we can iterate it and find
> SRAT easily. Otherwise, we have to do the same procedure to parse RSDT,
> and find SRAT, which I don't think could be any simpler. I think reuse
> the existing acpi_gbl_root_table_list code is better.
I see. As long as ACPI people are fine with the modifications, I
don't mind either way.
Thanks.
--
tejun
Hello, Tang.
On Wed, Jul 24, 2013 at 11:52:53AM +0800, Tang Chen wrote:
> The node data should be on local, I agree with that. I'm not saying I
> won't do it. Just for now, it will be complicated to fix memory hot-remove
> path. So I think pushing this patch for now, and do the local node things
> in the next step.
I see. As long as it's clearly noted in the patch description and as
comment && the behavior is off unless explicitly enabled, it should be
fine for now, I think. As currently implemented, the users of memory
hotplug would have to pay pretty heavy price in terms of memory
locality overhead in general and it could be that the ones missed here
might not make noticeable difference anyway.
Thanks.
--
tejun
On 07/24/2013 05:32 AM, Tejun Heo wrote:
> On Tue, Jul 23, 2013 at 04:55:57PM -0400, Tejun Heo wrote:
>> On Fri, Jul 19, 2013 at 03:59:27PM +0800, Tang Chen wrote:
>>> + /*
>>> + * In such an early time, we don't have nid. We specify pxm
>>> + * instead of MAX_NUMNODES to prevent memblock merging regions
>>> + * on different nodes. And later modify pxm to nid when nid is
>>> + * mapped so that we can arrange ZONE_MOVABLE on different
>>> + * nodes.
>>> + */
>>> + memblock_reserve_hotpluggable(base_address, length, pxm);
>>
>> This is rather hacky. Why not just introduce MEMBLOCK_NO_MERGE flag?
The original thinking is to merge regions with the same nid. So I used pxm.
And then refresh the nid field when nids are mapped.
I will try to introduce MEMBLOCK_NO_MERGE and make it less hacky.
>
> Also, if memblock is gonna know about hotplug memory, why not just let
> it control its allocation too instead of blocking it by reserving it
> from outside? These are all pretty general memory hotplug logic which
> doesn't have much to do with acpi and I think too much is implemented
> on the acpi side.
At the very beginning, a long time ago, we just did this.
Please refer to: https://lkml.org/lkml/2012/12/10/656
In order to let memblock control the allocation, we have to store the
hotpluggable ranges somewhere, and keep the allocated range out of the
hotpluggable regions. I just think reserving the hotpluggable regions
and then memblock won't allocate them. No need to do any other limitation.
And also, the acpi side modification in this patch-set is to get SRAT
and parse it. I think most of the logic in
acpi_reserve_hotpluggable_memory()
is necessary. I don't think letting memblock control the allocation will
make the acpi side easier.
Thanks.
On 07/24/2013 04:59 AM, Tejun Heo wrote:
......
>> +static bool __init kernel_resides_in_range(phys_addr_t base, u64 length)
>> +{
>> + int i;
>> + struct memblock_type *reserved =&memblock.reserved;
>> + struct memblock_region *region;
>> + phys_addr_t start, end;
>> +
>> + for (i = 0; i< reserved->cnt; i++) {
>> + region =&reserved->regions[i];
>> +
>> + if (region->flags != MEMBLK_FLAGS_DEFAULT)
>> + continue;
>> +
>> + start = region->base;
>> + end = region->base + region->size;
>> + if (end<= base || start>= base + length)
>> + continue;
>> +
>> + return true;
>> + }
>> +
>> + return false;
>> +}
>
> This being in acpi/osl.c is rather weird. Overall, the acpi and
> memblock parts don't seem very well split. It'd best if acpi just
> indicates which regions are hotpluggable and the rest is handled by
> x86 boot or memblock code as appropriate.
OK. Will move this function out from acpi side.
Thanks.
On 07/24/2013 05:00 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:29PM +0800, Tang Chen wrote:
>> We reserved hotpluggable memory in memblock at early time. And when memory
>> initialization is done, we have to free it to buddy system.
>>
>> This patch free memory reserved by memblock with flag MEMBLK_HOTPLUGGABLE.
>
> Sequencing patches this way means machines will run with hotpluggable
> regions reserved inbetween. Please put the reserving and freeing into
> the same patch.
Sure, followed.
Thanks.
On 07/24/2013 05:11 AM, Tejun Heo wrote:
> On Tue, Jul 23, 2013 at 05:04:35PM -0400, Tejun Heo wrote:
>> On Fri, Jul 19, 2013 at 03:59:30PM +0800, Tang Chen wrote:
>> ...
>>> Users can specify "movablecore=acpi" in kernel commandline to enable this
>>> functionality. For those who don't use memory hotplug or who don't want
>>> to lose their NUMA performance, just don't specify anything. The kernel
>>> will work as before.
>>
>> The param name is pretty obscure and why would the user care where
>
> I mean, having movable zone is required for having any decent chance
> of memory hotplug and movable zone implies worse affinity for kernel
> data structures, so there's no point in distinguishing memory hotplug
> enable/disable and this, right?
>
Sorry, I don't quite get this.
By movable zone, do you mean movable node (which has ZONE_MOVABLE only) ?
movablecore boot option was used to specify the size of ZONE_MOVABLE. And
this patch-set aims to arrange ZONE_MOVABLE with SRAT info. So my original
thinking is to reuse movablecore.
Since you said above, I think we have two problems here:
1. Should not let users care about where the hotplug info comes from.
2. Should not distinguish movable node and memory hotplug, since for now,
to use memory hotplug is to use movable node.
So how about something like "movablenode", just like "quiet" boot option.
If users specify "movablenode", then memblock will reserve hotpluggable
memory, and create movable nodes if any. If users specify nothing, then
the kernel acts as before.
Thanks.
On 07/24/2013 05:21 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:34PM +0800, Tang Chen wrote:
>> Since we modify movablecore boot option to support
>> "movablecore=acpi", this patch adds doc for it.
>
> Please fold this into the patch which makes the code chnage.
OK, followed.
Thanks.
On 07/24/2013 05:25 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:31PM +0800, Tang Chen wrote:
>> Vasilis Liaskovitis found that before we parse SRAT and fulfill numa_meminfo,
>> the nids of all the regions in memblock.reserve[] are MAX_NUMNODES. That is
>> because nids have not been mapped at that time.
>>
>> When we arrange ZONE_MOVABLE in each node later, we need nid in memblock. So
>> after we parse SRAT and fulfill nume_meminfo, synchronize the nid info to
>> memblock.reserve[] immediately.
>
> Having a separate sync is rather nasty. Why not let
> memblock_set_node() and alloc functions set nid on the reserved
> regions?
Node id and pxm are 1-1 mapped. For the current kernel, before SRAT is
parsed,
we don't know nid. So all allocated regions are reserved by memblock with
nid = MAX_NUMNODES. So for early allocated memory, we cannot use
memblock_set_node()
and alloc functions to set the nid.
In this patch-set, we parse SRAT twice, the first one is right after
memblock is ready.
But we didn't setup nid <-> pxm mapping. So we still have this problem.
And as in [patch 14/21], when reserving hotpluggable memory, we use pxm.
So my
idea was to do a nid sync in numa_init(). After this, memblock will set
nid when
it allocates memory.
If we want to let memblock_set_node() and alloc functions set nid on the
reserved
regions, we should setup nid <-> pxm mapping when we parst SRAT for the
first time.
If you think this is OK, I can try it.
Thanks.
On 07/24/2013 11:54 PM, Tejun Heo wrote:
> On Wed, Jul 24, 2013 at 10:53:10AM +0800, Tang Chen wrote:
>>> Let's please drop "with" and do we really need to print full 16
>>> digits?
>>
>> Sure, will remove "with". But I think printing out the full flags is batter.
>> The output seems more tidy.
>
> I mean, padding is fine but you can just print out 4 or even 2 digits
> and will be fine for the foreseeable future.
OK. In this patch-set, there won't more than two flags. I think one
hexadecimal
number is enough. I'll print two digits for the foreseeable future.
Thanks.
On 07/24/2013 03:45 AM, Tejun Heo wrote:
> On Fri, Jul 19, 2013 at 03:59:21PM +0800, Tang Chen wrote:
>> @@ -514,6 +514,7 @@ acpi_tb_install_table(acpi_physical_address address,
>> * fully mapped later (in verify table). In any case, we must
>> * unmap the header that was mapped above.
>> */
>> + table_desc =&acpi_gbl_root_table_list.tables[table_index];
>> final_table = acpi_tb_table_override(table, table_desc);
>> if (!final_table) {
>> final_table = table; /* There was no override */
>
> Is this chunk correct? Why is @table_desc being assigned twice in
> this function?
Oh, my mistake. The second assignment is redundant. Only the first one is
useful. Will remove the redundant assignment in this patch.
Thanks.
>
>> + /*
>> + * Also initialize the table entries here, so that later we can use them
>> + * to find SRAT at very eraly time to reserve hotpluggable memory.
> ^ typo
> Thanks.
>
Hello, Tang.
On Thu, Jul 25, 2013 at 12:09:29PM +0800, Tang Chen wrote:
> And as in [patch 14/21], when reserving hotpluggable memory, we use
> pxm. So my
Which is kinda nasty.
> idea was to do a nid sync in numa_init(). After this, memblock will
> set nid when
> it allocates memory.
Sure, that's the only place we can set the numa node IDs but my point
is that you don't need to add another interface. Jet let
memblock_set_node() handle both memblock.memory and .reserved ranges.
That way, you can make memblock simpler to use and less error-prone.
> If we want to let memblock_set_node() and alloc functions set nid on
> the reserved
> regions, we should setup nid <-> pxm mapping when we parst SRAT for
> the first time.
I don't follow why it has to be different. Why do you need to do
anything differently? What am I missing here?
Thanks.
--
tejun
Hello, Tang.
On Thu, Jul 25, 2013 at 11:50:12AM +0800, Tang Chen wrote:
> movablecore boot option was used to specify the size of ZONE_MOVABLE. And
> this patch-set aims to arrange ZONE_MOVABLE with SRAT info. So my original
> thinking is to reuse movablecore.
>
> Since you said above, I think we have two problems here:
> 1. Should not let users care about where the hotplug info comes from.
> 2. Should not distinguish movable node and memory hotplug, since for now,
> to use memory hotplug is to use movable node.
>
> So how about something like "movablenode", just like "quiet" boot option.
> If users specify "movablenode", then memblock will reserve hotpluggable
> memory, and create movable nodes if any. If users specify nothing, then
> the kernel acts as before.
Maybe I'm confused but memory hotplug isn't likely to work without
this, right? If so, wouldn't it make more sense to have
"memory_hotplug" option rather than "movablecore=acpi" which in no way
indicates that it has something to do with memory hotplug?
Thanks.
--
tejun
Hello,
On Thu, Jul 25, 2013 at 10:13:21AM +0800, Tang Chen wrote:
> >>This is rather hacky. Why not just introduce MEMBLOCK_NO_MERGE flag?
>
> The original thinking is to merge regions with the same nid. So I used pxm.
> And then refresh the nid field when nids are mapped.
>
> I will try to introduce MEMBLOCK_NO_MERGE and make it less hacky.
I kinda don't follow why it's necessary to disallow merging BTW. Can
you plesae elaborate? Shouldn't it be enough to mark the regions
hotpluggable? Why does it matter whether they get merged or not? If
they belong to different nodes, they'll be separated during the
isolation phase while setting nids, which is the modus operandi of
memblock anyway.
> In order to let memblock control the allocation, we have to store the
> hotpluggable ranges somewhere, and keep the allocated range out of the
> hotpluggable regions. I just think reserving the hotpluggable regions
> and then memblock won't allocate them. No need to do any other limitation.
It isn't different from what you're doing right now. Just tell
memblock that the areas are hotpluggable and the default memblock
allocation functions stay away from the areas. That way you can later
add functions which may allocate from hotpluggable areas for
node-local data without resorting to tricks like unreserving part of
it and trying allocation or what not. As it currently stands, you're
scattering hotpluggable memory handling across memblock and acpi which
is kinda nasty. Please make acpi feed information into memblock and
make memblock handle hotpluggable regions appropriately.
> And also, the acpi side modification in this patch-set is to get SRAT
> and parse it. I think most of the logic in
> acpi_reserve_hotpluggable_memory()
> is necessary. I don't think letting memblock control the allocation will
> make the acpi side easier.
It's about proper layering. The code change involved in either case
aren't big but splitting it right would give us less headache when we
later try to support a different firmware or add more features, and
more importantly, it makes things logical and lowers the all important
WTH factor and makes things easier to follow.
Thanks.
--
tejun
On 07/25/2013 11:17 PM, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 25, 2013 at 10:13:21AM +0800, Tang Chen wrote:
>>>> This is rather hacky. Why not just introduce MEMBLOCK_NO_MERGE flag?
>>
>> The original thinking is to merge regions with the same nid. So I used pxm.
>> And then refresh the nid field when nids are mapped.
>>
>> I will try to introduce MEMBLOCK_NO_MERGE and make it less hacky.
>
> I kinda don't follow why it's necessary to disallow merging BTW. Can
> you plesae elaborate? Shouldn't it be enough to mark the regions
> hotpluggable? Why does it matter whether they get merged or not? If
> they belong to different nodes, they'll be separated during the
> isolation phase while setting nids, which is the modus operandi of
> memblock anyway.
Sorry, I didn't make it clear enough.
The reason why disallowing merging is that in [Patch 20/21], I wanted to
use the nid in each reserved region to set the start address of ZONE_MOVABLE
in each node. And this is only my idea. It is OK without doing this.
But as you said, the isolation phase in memblock_set_node() will split the
specified region and set the nid, I think it is OK to merge the regions
here.
I'll just let it merged here, and not store the pxm.
>
>> In order to let memblock control the allocation, we have to store the
>> hotpluggable ranges somewhere, and keep the allocated range out of the
>> hotpluggable regions. I just think reserving the hotpluggable regions
>> and then memblock won't allocate them. No need to do any other limitation.
>
> It isn't different from what you're doing right now. Just tell
> memblock that the areas are hotpluggable and the default memblock
> allocation functions stay away from the areas. That way you can later
> add functions which may allocate from hotpluggable areas for
> node-local data without resorting to tricks like unreserving part of
> it and trying allocation or what not.
I just don't want to any new variables to store the hotpluggable regions.
But without a new shared variable, it seems difficult to achieve the goal
you said below.
>As it currently stands, you're
> scattering hotpluggable memory handling across memblock and acpi which
> is kinda nasty. Please make acpi feed information into memblock and
> make memblock handle hotpluggable regions appropriately.
Now, when SRAT is found in acpi side, I reserve the region directly in
memblock.
I think this is the one you don't like.
So how about this.
1. Introduce a new global list used to store hotpluggable regions.
2. On acpi side, find and fulfill the list.
3. On memblock side, make the default allocation function stay away from
these regions.
>
>> And also, the acpi side modification in this patch-set is to get SRAT
>> and parse it. I think most of the logic in
>> acpi_reserve_hotpluggable_memory()
>> is necessary. I don't think letting memblock control the allocation will
>> make the acpi side easier.
>
> It's about proper layering. The code change involved in either case
> aren't big but splitting it right would give us less headache when we
> later try to support a different firmware or add more features, and
> more importantly, it makes things logical and lowers the all important
> WTH factor and makes things easier to follow.
OK, followed.
Thanks.
On 07/25/2013 11:09 PM, Tejun Heo wrote:
> Hello, Tang.
>
> On Thu, Jul 25, 2013 at 11:50:12AM +0800, Tang Chen wrote:
>> movablecore boot option was used to specify the size of ZONE_MOVABLE. And
>> this patch-set aims to arrange ZONE_MOVABLE with SRAT info. So my original
>> thinking is to reuse movablecore.
>>
>> Since you said above, I think we have two problems here:
>> 1. Should not let users care about where the hotplug info comes from.
>> 2. Should not distinguish movable node and memory hotplug, since for now,
>> to use memory hotplug is to use movable node.
>>
>> So how about something like "movablenode", just like "quiet" boot option.
>> If users specify "movablenode", then memblock will reserve hotpluggable
>> memory, and create movable nodes if any. If users specify nothing, then
>> the kernel acts as before.
>
> Maybe I'm confused but memory hotplug isn't likely to work without
> this, right?
I don't think so. On x86, I think you are right because we cannot hotplug
a single memory_block (128MB on x86), which is only a small part of a modern
memory device. And now x86 kernel doesn't support a single memory device
hotplug, and what we are trying to do is node hotplug. So on x86, memory
hotplug won't work without movable node.
But on other platform, memory hotplug may work without this.
>If so, wouldn't it make more sense to have
> "memory_hotplug" option rather than "movablecore=acpi" which in no way
> indicates that it has something to do with memory hotplug?
I'm not working on ppcm, but I heard that memory hotplug was introduced
firstly
on ppc, and a memory_block on ppc is only 16MB, which can be hotplugged. It
doesn't need movable node support.
Here, 16MB memory_block hotplug is not the physical device hotplug, I think.
Just logically remove it from one OS, and add it to another OS running
on one
ppc server. This is done by the hardware.
But on x86, we don't have this kind of functionality. A single memory_block
hotplug means nothing. Actually I think struct memory_block is useless
on x86.
But for other platforms, we have to keep this structure.
So for the same reason, I think we cannot just introduce a boot option like
"memory_hotplug" to enable/disable what we are doing in this patch-set.
Sorry I didn't clarify this earlier.
Thanks.
On 07/25/2013 11:05 PM, Tejun Heo wrote:
> Hello, Tang.
>
> On Thu, Jul 25, 2013 at 12:09:29PM +0800, Tang Chen wrote:
>> And as in [patch 14/21], when reserving hotpluggable memory, we use
>> pxm. So my
>
> Which is kinda nasty.
Yes, will remove it.
>
>> idea was to do a nid sync in numa_init(). After this, memblock will
>> set nid when
>> it allocates memory.
>
> Sure, that's the only place we can set the numa node IDs but my point
> is that you don't need to add another interface. Jet let
> memblock_set_node() handle both memblock.memory and .reserved ranges.
> That way, you can make memblock simpler to use and less error-prone.
Yes, I missed the isolation phase in memblock_set_node().
So followed.
>
>> If we want to let memblock_set_node() and alloc functions set nid on
>> the reserved
>> regions, we should setup nid<-> pxm mapping when we parst SRAT for
>> the first time.
>
> I don't follow why it has to be different. Why do you need to do
> anything differently? What am I missing here?
No, it was me who missed the isolation phase in memblock_set_node().
Thanks.
On Fri, Jul 26, 2013 at 11:45:36AM +0800, Tang Chen wrote:
> I just don't want to any new variables to store the hotpluggable regions.
> But without a new shared variable, it seems difficult to achieve the goal
> you said below.
Why can't it be done with the .flags field that was added anyway?
> So how about this.
> 1. Introduce a new global list used to store hotpluggable regions.
> 2. On acpi side, find and fulfill the list.
> 3. On memblock side, make the default allocation function stay away from
> these regions.
I was thinking more along the line of
1. Mark hotpluggable regions with a flag in memblock.
2. On ACPI side, find and mark hotpluggable regions.
3. Make memblock avoid giving out hotpluggable regions for normal
allocations.
Thanks.
--
tejun
On Fri, Jul 26, 2013 at 06:26:09AM -0400, Tejun Heo wrote:
> > So how about this.
> > 1. Introduce a new global list used to store hotpluggable regions.
> > 2. On acpi side, find and fulfill the list.
> > 3. On memblock side, make the default allocation function stay away from
> > these regions.
>
> I was thinking more along the line of
>
> 1. Mark hotpluggable regions with a flag in memblock.
> 2. On ACPI side, find and mark hotpluggable regions.
> 3. Make memblock avoid giving out hotpluggable regions for normal
> allocations.
But adding new regions array is more convenient / cleaner, that's fine
too. Those arrays are dynamically sized anyway.
Thanks.
--
tejun
On 07/26/2013 06:26 PM, Tejun Heo wrote:
> On Fri, Jul 26, 2013 at 11:45:36AM +0800, Tang Chen wrote:
>> I just don't want to any new variables to store the hotpluggable regions.
>> But without a new shared variable, it seems difficult to achieve the goal
>> you said below.
>
> Why can't it be done with the .flags field that was added anyway?
I'm sorry but I'm a little misunderstanding here. There are some more
things I
want to confirm, thanks for your patient. :)
By "the goal" above, I mean making ACPI and memblock parts more
independent from
each other. I think in this patch-set, I called memblock_reserve() which
made
these two parts interactive.
So the point is, how to mark the hotpluggable regions and at the same
time, make
ACPI and memblock parts independent, right ?
But, please see below.
>
>> So how about this.
>> 1. Introduce a new global list used to store hotpluggable regions.
>> 2. On acpi side, find and fulfill the list.
>> 3. On memblock side, make the default allocation function stay away from
>> these regions.
>
> I was thinking more along the line of
>
> 1. Mark hotpluggable regions with a flag in memblock.
> 2. On ACPI side, find and mark hotpluggable regions.
But marking hotpluggable regions on ACPI side will also make ACPI and
memblock
parts more interactive. In this patch-set, I just called memblock_reserve()
directly on ACPI side.
I think marking hotpluggable regions on ACPI side is much the same as
reserving
the regions. I will just call something like memblock_mark_flags() to
mark the
regions. The only difference will be a different memblock_xxx() function
call,
right ?
In the last mail, I suggested a global array. So both sides will just
use the
array, and it seems to be independent. But I think the global array and
the flags
in memblock are redundant. They are for the same goal.
Actually I want to use flags. I think it is also useful when we try to
put thins
on local node, such as node_data.
So, is it OK to mark the hotpluggable regions on ACPI side ?
> 3. Make memblock avoid giving out hotpluggable regions for normal
> allocations.
This step3 is different from this patch-set. I reserved hotpluggable
regions in
memblock.reserved.
So are you saying mark the hotpluggable regions in memblock.memory, but not
reserve them in memblock.reserved, and make the default allocate
function avoid
the hotpluggable regions in memblock.memory ?
This way will be convenient when we put the node_data on local node
(don't need
to free regions from memblock.reserved, as you mentioned before), right?
Thanks.
Hello, Tang.
On Mon, Jul 29, 2013 at 10:12:40AM +0800, Tang Chen wrote:
> So the point is, how to mark the hotpluggable regions and at the
> same time, make
> ACPI and memblock parts independent, right ?
No, not at all. My point is that the roles need to be divided
clearly. The firmware (be that ACPI or whatever) knows memory areas
are hotpluggable but it shouldn't be making policy decisions like not
dispending hotpluggable memory through memblock allocator because that
part of logic has *nothing* to do with ACPI. That is the generic
kernel memory management policy which will apply regardless of what
type of firmware the machine happens to be running on top of.
So, please make ACPI inform memblock of the hotpluggable regions and
implement the allocation policies inside memblock proper.
> So are you saying mark the hotpluggable regions in memblock.memory, but not
> reserve them in memblock.reserved, and make the default allocate
> function avoid
> the hotpluggable regions in memblock.memory ?
>
> This way will be convenient when we put the node_data on local node
> (don't need
> to free regions from memblock.reserved, as you mentioned before), right?
I don't care too much about the specifics and it's likely that you'll
find out which way (flag in memblock.memory, separate region array or
whatever) is better as implementation progresses, but let's please put
things where they belong; otherwise, we end up with weird mess, and,
later on, have to do things like freeing part of reserved hotpluggable
memory for node data from firmware side as you said above, which
basically moves part of memory allocation logic into ACPI, which is
just horrible.
Thanks.
--
tejun