This patch-set aims to solve some problems at system boot time
to enhance memory hotplug functionality.
[Background]
The Linux kernel cannot migrate pages used by the kernel because
of the kernel direct mapping. Since va = pa + PAGE_OFFSET, if the
physical address is changed, we cannot simply update the kernel
pagetable. On the contrary, we have to update all the pointers
pointing to the virtual address, which is very difficult to do.
In order to do memory hotplug, we should prevent the kernel to use
hotpluggable memory.
In ACPI, there is a table named SRAT(System Resource Affinity Table).
It contains system NUMA info (CPUs, memory ranges, PXM), and also a
flag field indicating which memory ranges are hotpluggable.
[Problem to be solved]
At the very early time when the system is booting, we use a bootmem
allocator, named memblock, to allocate memory for the kernel.
memblock will start to work before the kernel parse SRAT, which
means memblock won't know which memory is hotpluggable before SRAT
is parsed.
So at this time, memblock could allocate hotpluggable memory for
the kernel to use permanently. For example, the kernel may allocate
pagetables in hotpluggable memory, which cannot be freed when the
system is up.
So we have to prevent memblock allocating hotpluggable memory for
the kernel at the early boot time.
[Earlier solutions]
We have tried to parse SRAT earlier, before memblock is ready. To
do this, we also have to do ACPI_INITRD_TABLE_OVERRIDE earlier.
Otherwise the override tables won't be able to effect.
This is not that easy to do because memblock is ready before direct
mapping is setup. So Yinghai split the ACPI_INITRD_TABLE_OVERRIDE
procedure into two steps: find and copy. Please refer to the
following patch-set:
https://lkml.org/lkml/2013/6/13/587
To this solution, tj gave a lot of comments and the following
suggestions.
[Suggestion from tj]
tj mainly gave the following suggestions:
1. Necessary reordering is OK, but we should not rely on
reordering to achieve the goal because it makes the kernel
too fragile.
2. Memory allocated to kernel for temporary usage is OK because
it will be freed when the system is up. Doing relocation
for permanent allocated hotpluggable memory will make the
the kernel more robust.
3. Need to enhance memblock to discover and complain if any
hotpluggable memory is allocated to kernel.
After a long thinking, we choose not to do the relocation for
the following reasons:
1. It's easy to find out the allocated hotpluggable memory. But
memblock will merge the adjoined ranges owned by different users
and used for different purposes. It's hard to find the owners.
2. Different memory has different way to be relocated. I think one
function for each kind of memory will make the code too messy.
3. Pagetable could be in hotpluggable memory. Relocating pagetable
is too difficult and risky. We have to update all PUD, PMD pages.
And also, ACPI_INITRD_TABLE_OVERRIDE and parsing SRAT procedures
are not long after pagetable is initialized. If we relocate the
pagetable not long after it was initialized, the code will be
very ugly.
[Solution in this patch-set]
In this patch-set, we still do the reordering, but in a new way.
1. Improve memblock with flags, so that it is able to differentiate
memory regions for different usage. And also a MEMBLOCK_HOTPLUG
flag to mark hotpluggable memory.
2. When memblock is ready (memblock_x86_fill() is called), initialize
acpi_gbl_root_table_list, fulfill all the ACPI tables' phys addrs.
Now, we have all the ACPI tables' phys addrs provided by firmware.
3. Check if there is a SRAT in initrd file used to override the one
provided by firmware. If so, get its phys addr.
4. If no override SRAT in initrd, get the phys addr of the SRAT
provided by firmware.
Now, we have the phys addr of the to be used SRAT, the one in
initrd or the one in firmware.
5. Parse only the memory affinities in SRAT, find out all the
hotpluggable memory regions and mark them in memblock.memory with
MEMBLOCK_HOTPLUG flag.
6. The kernel goes through the current path. Any other related parts,
such as ACPI_INITRD_TABLE_OVERRIDE path, the current parsing ACPI
tables pathes, global variable numa_meminfo, and so on, are not
modified. They work as before.
7. Make memblock default allocator skip hotpluggable memory.
8. Introduce movablenode boot option to allow users to enable
and disable this functionality.
In summary, in order to get hotpluggable memory info as early as possible,
this patch-set only parse memory affinities in SRAT one more time right
after memblock is ready, and leave all the other pathes untouched. With
the hotpluggable memory info, we can arrange hotpluggable memory in
ZONE_MOVABLE to prevent the kernel to use it.
change log v2 RESEND -> v3:
1. As Rafael and Lv Zheng suggested, split acpi global root table list
initialization procedure into two steps: install and override. And
do the "install" step earlier.
2. Fix some little problems found by Toshi.
change log v2 -> v2 RESEND:
According to Toshi's advice:
1. Rename acpi_invalid_table() to acpi_verify_table().
2. Rename acpi_root_table_init() to early_acpi_boot_table_init().
3. Rename INVALID_TABLE() to ACPI_INVALID_TABLE().
4. Check if ramdisk is present in early_acpi_override_srat().
5. Check if ACPI is disabled in acpi_boot_table_init().
6. Rebased to Linux 3.11-rc3.
change log v1 -> v2:
1. According to Tejun's advice, make ACPI side report which memory regions
are hotpluggable, and memblock side handle the memory allocation.
2. Change "movablecore=acpi" boot option to "movablenode" boot option.
Thanks.
Tang Chen (24):
acpi: Print Hot-Pluggable Field in SRAT.
earlycpio.c: Fix the confusing comment of find_cpio_data().
acpi: Remove "continue" in macro INVALID_TABLE().
acpi: Introduce acpi_verify_initrd() to check if a table is invalid.
acpi, acpica: Split acpi_tb_install_table() into two parts.
acpi, acpica: Call two new functions instead of
acpi_tb_install_table() in acpi_tb_parse_root_table().
acpi, acpica: Split acpi_tb_parse_root_table() into two parts.
acpi, acpica: Call two new functions instead of
acpi_tb_parse_root_table() in acpi_initialize_tables().
acpi, acpica: Split acpi_initialize_tables() into two parts.
x86, acpi: Call two new functions instead of acpi_initialize_tables()
in acpi_table_init().
x86, acpi: Split acpi_table_init() into two parts.
x86, acpi: Rename check_multiple_madt() and make it global.
x86, acpi: Split acpi_boot_table_init() into two parts.
x86, acpi: Initialize acpi golbal root table list earlier.
x86: Make get_ramdisk_{image|size}() global.
x86, acpica, acpi: Try to find if SRAT is overrided earlier.
x86, acpica, acpi: Try to find SRAT in firmware earlier.
x86, acpi, numa, mem_hotplug: Find hotpluggable memory in SRAT memory
affinities.
x86, numa, mem_hotplug: Skip all the regions the kernel resides in.
memblock, numa: Introduce flag into memblock.
memblock, mem_hotplug: Introduce MEMBLOCK_HOTPLUG flag to mark
hotpluggable regions.
memblock, mem_hotplug: Make memblock skip hotpluggable regions by
default.
mem-hotplug: Introduce movablenode boot option to {en|dis}able using
SRAT.
x86, numa, acpi, memory-hotplug: Make movablenode have higher
priority.
Yasuaki Ishimatsu (1):
x86: get pg_data_t's memory from other node
Documentation/kernel-parameters.txt | 15 ++
arch/x86/include/asm/setup.h | 21 +++
arch/x86/kernel/acpi/boot.c | 32 +++--
arch/x86/kernel/setup.c | 42 +++---
arch/x86/mm/numa.c | 5 +-
arch/x86/mm/srat.c | 11 +-
drivers/acpi/acpica/actables.h | 2 +
drivers/acpi/acpica/tbutils.c | 184 +++++++++++++++++++++++---
drivers/acpi/acpica/tbxface.c | 101 +++++++++++++-
drivers/acpi/osl.c | 252 ++++++++++++++++++++++++++++++++---
drivers/acpi/tables.c | 29 +++-
include/acpi/acpixf.h | 8 +
include/linux/acpi.h | 24 +++-
include/linux/memblock.h | 13 ++
include/linux/memory_hotplug.h | 5 +
lib/earlycpio.c | 27 ++--
mm/memblock.c | 92 +++++++++++--
mm/memory_hotplug.c | 104 ++++++++++++++-
mm/page_alloc.c | 31 ++++-
19 files changed, 873 insertions(+), 125 deletions(-)
The Hot-Pluggable field in SRAT suggests if the memory could be
hotplugged while the system is running. Print it as well when
parsing SRAT will help users to know which memory is hotpluggable.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
arch/x86/mm/srat.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..d44c8a4 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -146,6 +146,7 @@ int __init
acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
{
u64 start, end;
+ u32 hotpluggable;
int node, pxm;
if (srat_disabled())
@@ -154,7 +155,8 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
goto out_err_bad_srat;
if ((ma->flags & ACPI_SRAT_MEM_ENABLED) == 0)
goto out_err;
- if ((ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE) && !save_add_info())
+ hotpluggable = ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE;
+ if (hotpluggable && !save_add_info())
goto out_err;
start = ma->base_address;
@@ -174,9 +176,10 @@ acpi_numa_memory_affinity_init(struct acpi_srat_mem_affinity *ma)
node_set(node, numa_nodes_parsed);
- printk(KERN_INFO "SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]\n",
- node, pxm,
- (unsigned long long) start, (unsigned long long) end - 1);
+ pr_info("SRAT: Node %u PXM %u [mem %#010Lx-%#010Lx]%s\n",
+ node, pxm,
+ (unsigned long long) start, (unsigned long long) end - 1,
+ hotpluggable ? " Hot Pluggable" : "");
return 0;
out_err_bad_srat:
--
1.7.1
Arrange hotpluggable memory as ZONE_MOVABLE will cause NUMA performance down
because the kernel cannot use movable memory. For users who don't use memory
hotplug and who don't want to lose their NUMA performance, they need a way to
disable this functionality. So we improved movablecore boot option.
If users specify the original movablecore=nn@ss boot option, the kernel will
arrange [ss, ss+nn) as ZONE_MOVABLE. The kernelcore=nn@ss boot option is similar
except it specifies ZONE_NORMAL ranges.
Now, if users specify "movablenode" in kernel commandline, the kernel will
arrange hotpluggable memory in SRAT as ZONE_MOVABLE. And if users do this, all
the other movablecore=nn@ss and kernelcore=nn@ss options should be ignored.
For those who don't want this, just specify nothing. The kernel will act as
before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 5 +++++
mm/page_alloc.c | 31 ++++++++++++++++++++++++++++---
3 files changed, 34 insertions(+), 3 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index c0bd31c..e78e32f 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -64,6 +64,7 @@ int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+bool memblock_is_hotpluggable(struct memblock_region *region);
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
diff --git a/mm/memblock.c b/mm/memblock.c
index 3ea4301..c8eb5d2 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -610,6 +610,11 @@ int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
return 0;
}
+bool __init_memblock memblock_is_hotpluggable(struct memblock_region *region)
+{
+ return region->flags & MEMBLOCK_HOTPLUG;
+}
+
/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b100255..86d4381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4948,9 +4948,35 @@ static void __init find_zone_movable_pfns_for_nodes(void)
nodemask_t saved_node_state = node_states[N_MEMORY];
unsigned long totalpages = early_calculate_totalpages();
int usable_nodes = nodes_weight(node_states[N_MEMORY]);
+ struct memblock_type *type = &memblock.memory;
+ /* Need to find movable_zone earlier when movablenode is specified. */
+ find_usable_zone_for_movable();
+
+#ifdef CONFIG_MOVABLE_NODE
/*
- * If movablecore was specified, calculate what size of
+ * If movablenode is specified, ignore kernelcore and movablecore
+ * options.
+ */
+ if (movablenode_enable_srat) {
+ for (i = 0; i < type->cnt; i++) {
+ if (!memblock_is_hotpluggable(&type->regions[i]))
+ continue;
+
+ nid = type->regions[i].nid;
+
+ usable_startpfn = PFN_DOWN(type->regions[i].base);
+ zone_movable_pfn[nid] = zone_movable_pfn[nid] ?
+ min(usable_startpfn, zone_movable_pfn[nid]) :
+ usable_startpfn;
+ }
+
+ goto out;
+ }
+#endif
+
+ /*
+ * If movablecore=nn[KMG] was specified, calculate what size of
* kernelcore that corresponds so that memory usable for
* any allocation type is evenly spread. If both kernelcore
* and movablecore are specified, then the value of kernelcore
@@ -4976,7 +5002,6 @@ static void __init find_zone_movable_pfns_for_nodes(void)
goto out;
/* usable_startpfn is the lowest possible pfn ZONE_MOVABLE can be at */
- find_usable_zone_for_movable();
usable_startpfn = arch_zone_lowest_possible_pfn[movable_zone];
restart:
@@ -5067,12 +5092,12 @@ restart:
if (usable_nodes && required_kernelcore > usable_nodes)
goto restart;
+out:
/* Align start of ZONE_MOVABLE on all nids to MAX_ORDER_NR_PAGES */
for (nid = 0; nid < MAX_NUMNODES; nid++)
zone_movable_pfn[nid] =
roundup(zone_movable_pfn[nid], MAX_ORDER_NR_PAGES);
-out:
/* restore the node_state */
node_states[N_MEMORY] = saved_node_state;
}
--
1.7.1
The Hot-Pluggable fired in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel,
it cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.
Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.
But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.
So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movablenode boot option to allow users to
choose to reserve hotpluggable memory and set it as ZONE_MOVABLE or not.
Users can specify "movablenode" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.
Suggested-by: Kamezawa Hiroyuki <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
Documentation/kernel-parameters.txt | 15 +++++++++++++++
arch/x86/kernel/setup.c | 10 ++++++++--
include/linux/memory_hotplug.h | 3 +++
mm/memory_hotplug.c | 11 +++++++++++
4 files changed, 37 insertions(+), 2 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index 15356ac..7349d1f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -1718,6 +1718,21 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
that the amount of memory usable for all allocations
is not too small.
+ movablenode [KNL,X86] This parameter enables/disables the
+ kernel to arrange hotpluggable memory ranges recorded
+ in ACPI SRAT(System Resource Affinity Table) as
+ ZONE_MOVABLE. And these memory can be hot-removed when
+ the system is up.
+ By specifying this option, all the hotpluggable memory
+ will be in ZONE_MOVABLE, which the kernel cannot use.
+ This will cause NUMA performance down. For users who
+ care about NUMA performance, just don't use it.
+ If all the memory ranges in the system are hotpluggable,
+ then the ones used by the kernel at early time, such as
+ kernel code and data segments, initrd file and so on,
+ won't be set as ZONE_MOVABLE, and won't be hotpluggable.
+ Otherwise the kernel won't have enough memory to boot.
+
MTD_Partition= [MTD]
Format: <name>,<region-number>,<size>,<offset>
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 36d7fe8..abdfed7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,14 +1061,20 @@ void __init setup_arch(char **cmdline_p)
*/
early_acpi_boot_table_init();
-#ifdef CONFIG_ACPI_NUMA
+#if defined(CONFIG_ACPI_NUMA) && defined(CONFIG_MOVABLE_NODE)
/*
* Linux kernel cannot migrate kernel pages, as a result, memory used
* by the kernel cannot be hot-removed. Find and mark hotpluggable
* memory in memblock to prevent memblock from allocating hotpluggable
* memory for the kernel.
+ *
+ * If all the memory in a node is hotpluggable, then the kernel won't
+ * be able to use memory on that node. This will cause NUMA performance
+ * down. So by default, we don't reserve any hotpluggable memory. Users
+ * may use "movablenode" boot option to enable this functionality.
*/
- find_hotpluggable_memory();
+ if (movablenode_enable_srat)
+ find_hotpluggable_memory();
#endif
/*
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 463efa9..43eb373 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -33,6 +33,9 @@ enum {
ONLINE_MOVABLE,
};
+/* Enable/disable SRAT in movablenode boot option */
+extern bool movablenode_enable_srat;
+
/*
* pgdat resizing functions
*/
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 0a69ceb..64e9f7e 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -93,6 +93,17 @@ static void release_memory_resource(struct resource *res)
}
#ifdef CONFIG_ACPI_NUMA
+#ifdef CONFIG_MOVABLE_NODE
+bool __initdata movablenode_enable_srat;
+
+static int __init cmdline_parse_movablenode(char *p)
+{
+ movablenode_enable_srat = true;
+ return 0;
+}
+early_param("movablenode", cmdline_parse_movablenode);
+#endif /* CONFIG_MOVABLE_NODE */
+
/**
* kernel_resides_in_range - Check if kernel resides in a memory region.
* @base: The base address of the memory region.
--
1.7.1
In find_hotpluggable_memory, once we find out a memory region which is
hotpluggable, we want to mark them in memblock.memory. So that we could
control memblock allocator not to allocte hotpluggable memory for the kernel
later.
To achieve this goal, we introduce MEMBLOCK_HOTPLUG flag to indicate the
hotpluggable memory regions in memblock and a function memblock_mark_hotplug()
to mark hotpluggable memory if we find one.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 11 +++++++++++
mm/memblock.c | 26 ++++++++++++++++++++++++++
mm/memory_hotplug.c | 3 ++-
3 files changed, 39 insertions(+), 1 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index e89e0cd..c0bd31c 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -19,6 +19,9 @@
#define INIT_MEMBLOCK_REGIONS 128
+/* Definition of memblock flags. */
+#define MEMBLOCK_HOTPLUG 0x1 /* hotpluggable region */
+
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
@@ -60,6 +63,8 @@ int memblock_free(phys_addr_t base, phys_addr_t size);
int memblock_reserve(phys_addr_t base, phys_addr_t size);
void memblock_trim_memory(phys_addr_t align);
+int memblock_mark_hotplug(phys_addr_t base, phys_addr_t size);
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
void __next_mem_pfn_range(int *idx, int nid, unsigned long *out_start_pfn,
unsigned long *out_end_pfn, int *out_nid);
@@ -119,6 +124,12 @@ void __next_free_mem_range_rev(u64 *idx, int nid, phys_addr_t *out_start,
i != (u64)ULLONG_MAX; \
__next_free_mem_range_rev(&i, nid, p_start, p_end, p_nid))
+static inline void memblock_set_region_flags(struct memblock_region *r,
+ unsigned long flags)
+{
+ r->flags = flags;
+}
+
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int memblock_set_node(phys_addr_t base, phys_addr_t size, int nid);
diff --git a/mm/memblock.c b/mm/memblock.c
index 0841a6e..ecd8568 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -585,6 +585,32 @@ int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
}
/**
+ * memblock_mark_hotplug - Mark hotpluggable memory with flag MEMBLOCK_HOTPLUG.
+ * @base: the base phys addr of the region
+ * @size: the size of the region
+ *
+ * This function isolates region [@base, @base + @size), and mark it with flag
+ * MEMBLOCK_HOTPLUG.
+ *
+ * Return 0 on succees, -errno on failure.
+ */
+int __init_memblock memblock_mark_hotplug(phys_addr_t base, phys_addr_t size)
+{
+ struct memblock_type *type = &memblock.memory;
+ int i, ret, start_rgn, end_rgn;
+
+ ret = memblock_isolate_range(type, base, size, &start_rgn, &end_rgn);
+ if (ret)
+ return ret;
+
+ for (i = start_rgn; i < end_rgn; i++)
+ memblock_set_region_flags(&type->regions[i], MEMBLOCK_HOTPLUG);
+
+ memblock_merge_regions(type);
+ return 0;
+}
+
+/**
* __next_free_mem_range - next function for for_each_free_mem_range()
* @idx: pointer to u64 loop variable
* @nid: node selector, %MAX_NUMNODES for all nodes
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 3d760fc..0a69ceb 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -174,7 +174,8 @@ void __init find_hotpluggable_memory(void)
if (kernel_resides_in_region(base, size))
continue;
- /* Will mark hotpluggable memory regions here */
+ /* Mark hotpluggable memory regions in memblock.memory */
+ memblock_mark_hotplug(base, size);
}
early_iounmap(srat_vaddr, length);
--
1.7.1
From: Yasuaki Ishimatsu <[email protected]>
If system can create movable node which all memory of the node is allocated
as ZONE_MOVABLE, setup_node_data() cannot allocate memory for the node's
pg_data_t. So, use memblock_alloc_try_nid() instead of memblock_alloc_nid()
to retry when the first allocation fails. Otherwise, the system could failed
to boot.
The node_data could be on hotpluggable node. And so could pagetable and
vmemmap. But for now, doing so will break memory hot-remove path.
A node could have several memory devices. And the device who holds node
data should be hot-removed in the last place. But in NUMA level, we don't
know which memory_block (/sys/devices/system/node/nodeX/memoryXXX) belongs
to which memory device. We only have node. So we can only do node hotplug.
But in virtualization, developers are now developing memory hotplug in qemu,
which support a single memory device hotplug. So a whole node hotplug will
not satisfy virtualization users.
So at last, we concluded that we'd better do memory hotplug and local node
things (local node node data, pagetable, vmemmap, ...) in two steps.
Please refer to https://lkml.org/lkml/2013/6/19/73
For now, we put node_data of movable node to another node, and then improve
it in the future.
In the later patches, a boot option will be introduced to enable/disable this
functionality. If users disable it, the node_data will still be put on the
local node.
Signed-off-by: Yasuaki Ishimatsu <[email protected]>
Signed-off-by: Lai Jiangshan <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Jiang Liu <[email protected]>
Reviewed-by: Wanpeng Li <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Toshi Kani <[email protected]>
---
arch/x86/mm/numa.c | 5 ++---
1 files changed, 2 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 8bf93ba..d532b6d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -209,10 +209,9 @@ static void __init setup_node_data(int nid, u64 start, u64 end)
* Allocate node data. Try node-local memory and then any node.
* Never allocate in DMA zone.
*/
- nd_pa = memblock_alloc_nid(nd_size, SMP_CACHE_BYTES, nid);
+ nd_pa = memblock_alloc_try_nid(nd_size, SMP_CACHE_BYTES, nid);
if (!nd_pa) {
- pr_err("Cannot find %zu bytes in node %d\n",
- nd_size, nid);
+ pr_err("Cannot find %zu bytes in any node\n", nd_size);
return;
}
nd = __va(nd_pa);
--
1.7.1
At early time, memblock will reserve some memory for the kernel,
such as the kernel code and data segments, initrd file, and so on,
which means the kernel resides in these memory regions.
Even if these memory regions are hotpluggable, we should not
mark them as hotpluggable. Otherwise the kernel won't have enough
memory to boot.
This patch finds out which memory regions the kernel resides in,
and skip them when finding all hotpluggable memory regions.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
mm/memory_hotplug.c | 45 +++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 45 insertions(+), 0 deletions(-)
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ef9ccf8..3d760fc 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -31,6 +31,7 @@
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
#include <linux/acpi.h>
+#include <linux/memblock.h>
#include <asm/tlbflush.h>
@@ -93,6 +94,40 @@ static void release_memory_resource(struct resource *res)
#ifdef CONFIG_ACPI_NUMA
/**
+ * kernel_resides_in_range - Check if kernel resides in a memory region.
+ * @base: The base address of the memory region.
+ * @length: The length of the memory region.
+ *
+ * This function is used at early time. It iterates memblock.reserved and check
+ * if the kernel has used any memory in [@base, @base + @length).
+ *
+ * Return true if the kernel resides in the memory region, false otherwise.
+ */
+static bool __init kernel_resides_in_region(phys_addr_t base, u64 length)
+{
+ int i;
+ phys_addr_t start, end;
+ struct memblock_region *region;
+ struct memblock_type *reserved = &memblock.reserved;
+
+ for (i = 0; i < reserved->cnt; i++) {
+ region = &reserved->regions[i];
+
+ if (region->flags != MEMBLOCK_HOTPLUG)
+ continue;
+
+ start = region->base;
+ end = region->base + region->size;
+ if (end <= base || start >= base + length)
+ continue;
+
+ return true;
+ }
+
+ return false;
+}
+
+/**
* find_hotpluggable_memory - Find out hotpluggable memory from ACPI SRAT.
*
* This function did the following:
@@ -129,6 +164,16 @@ void __init find_hotpluggable_memory(void)
while (ACPI_SUCCESS(acpi_hotplug_mem_affinity(srat_vaddr, &base,
&size, &offset))) {
+ /*
+ * At early time, memblock will reserve some memory for the
+ * kernel, such as the kernel code and data segments, initrd
+ * file, and so on, which means the kernel resides in these
+ * memory regions. These regions should not be hotpluggable.
+ * So do not mark them as hotpluggable.
+ */
+ if (kernel_resides_in_region(base, size))
+ continue;
+
/* Will mark hotpluggable memory regions here */
}
--
1.7.1
In ACPI SRAT(System Resource Affinity Table), there is a memory affinity for each
memory range in the system. In each memory affinity, there is a field indicating
that if the memory range is hotpluggable.
This patch parses all the memory affinities in SRAT only, and find out all the
hotpluggable memory ranges in the system.
This patch doesn't mark hotpluggable memory in memblock. Memory marked as hotplug
won't be allocated to the kernel. If all the memory in the system is hotpluggable,
then the system won't have enough memory to boot. The basic idea to solve this
problem is making the nodes the kerenl resides in unhotpluggable. So, before we do
this, we don't mark any hotpluggable memory in memory so that to keep memblock
working as before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 2 +
mm/memory_hotplug.c | 22 ++++++++++++-
3 files changed, 107 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index ec490fe..d01202d 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -780,6 +780,91 @@ phys_addr_t __init early_acpi_firmware_srat(void)
return table_desc.address;
}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_hotplug_mem_affinity
+ *
+ * PARAMETERS: Srat_vaddr - Virt addr of SRAT
+ * Base - The base address of the found hotpluggable
+ * memory region
+ * Size - The size of the found hotpluggable memory
+ * region
+ * Offset - Offset of the found memory affinity
+ *
+ * RETURN: Status
+ *
+ * DESCRIPTION: This function iterates SRAT affinities list to find memory
+ * affinities with hotpluggable memory one by one. Return the
+ * offset of the found memory affinity through @offset. @offset
+ * can be used to iterate the SRAT affinities list to find all the
+ * hotpluggable memory affinities. If @offset is 0, it is the first
+ * time of the iteration.
+ *
+ ******************************************************************************/
+acpi_status __init
+acpi_hotplug_mem_affinity(void *srat_vaddr, u64 *base, u64 *size,
+ unsigned long *offset)
+{
+ struct acpi_table_header *table_header;
+ struct acpi_subtable_header *entry;
+ struct acpi_srat_mem_affinity *ma;
+ unsigned long table_end, curr;
+
+ if (!offset)
+ return_ACPI_STATUS(AE_BAD_PARAMETER);
+
+ table_header = (struct acpi_table_header *)srat_vaddr;
+ table_end = (unsigned long)table_header + table_header->length;
+
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)table_header + *offset);
+
+ if (*offset) {
+ /*
+ * @offset is the offset of the last affinity found in the
+ * last call. So need to move to the next affinity.
+ */
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)entry + entry->length);
+ } else {
+ /*
+ * Offset of the first affinity is the size of SRAT
+ * table header.
+ */
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)entry + sizeof(struct acpi_table_srat));
+ }
+
+ while (((unsigned long)entry) + sizeof(struct acpi_subtable_header) <
+ table_end) {
+ if (entry->length == 0)
+ break;
+
+ if (entry->type != ACPI_SRAT_TYPE_MEMORY_AFFINITY)
+ goto next;
+
+ ma = (struct acpi_srat_mem_affinity *)entry;
+
+ if (!(ma->flags & ACPI_SRAT_MEM_HOT_PLUGGABLE))
+ goto next;
+
+ if (base)
+ *base = ma->base_address;
+
+ if (size)
+ *size = ma->length;
+
+ *offset = (unsigned long)entry - (unsigned long)srat_vaddr;
+ return_ACPI_STATUS(AE_OK);
+
+next:
+ entry = (struct acpi_subtable_header *)
+ ((unsigned long)entry + entry->length);
+ }
+
+ return_ACPI_STATUS(AE_NOT_FOUND);
+}
#endif /* CONFIG_ACPI_NUMA */
static void acpi_table_taint(struct acpi_table_header *table)
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 280078c..f103e91 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -99,6 +99,8 @@ static inline phys_addr_t early_acpi_override_srat(void)
#ifdef CONFIG_ACPI_NUMA
phys_addr_t early_acpi_firmware_srat(void);
+acpi_status acpi_hotplug_mem_affinity(void *srat_vaddr, u64 *base,
+ u64 *size, unsigned long *offset);
#endif /* CONFIG_ACPI_NUMA */
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2dfb06f..ef9ccf8 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -103,7 +103,11 @@ static void release_memory_resource(struct resource *res)
*/
void __init find_hotpluggable_memory(void)
{
- phys_addr_t srat_paddr;
+ void *srat_vaddr;
+ phys_addr_t srat_paddr, base, size;
+ u32 length;
+ struct acpi_table_header *srat_header;
+ unsigned long offset = 0;
/* Try to find if SRAT is overridden */
srat_paddr = early_acpi_override_srat();
@@ -114,7 +118,21 @@ void __init find_hotpluggable_memory(void)
return;
}
- /* Will parse SRAT and find out hotpluggable memory here */
+ /* Get the length of SRAT */
+ srat_header = early_ioremap(srat_paddr,
+ sizeof(struct acpi_table_header));
+ length = srat_header->length;
+ early_iounmap(srat_header, sizeof(struct acpi_table_header));
+
+ /* Find all the hotpluggable memory regions */
+ srat_vaddr = early_ioremap(srat_paddr, length);
+
+ while (ACPI_SUCCESS(acpi_hotplug_mem_affinity(srat_vaddr, &base,
+ &size, &offset))) {
+ /* Will mark hotpluggable memory regions here */
+ }
+
+ early_iounmap(srat_vaddr, length);
}
#endif /* CONFIG_ACPI_NUMA */
--
1.7.1
This patch introduce early_acpi_firmware_srat() to find the
phys addr of SRAT provided by firmware. And call it in
find_hotpluggable_memory().
Since we have initialized acpi_gbl_root_table_list earlier,
and store all the tables' phys addrs and signatures in it,
it is easy to find the SRAT.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/tbxface.c | 32 ++++++++++++++++++++++++++++++++
drivers/acpi/osl.c | 22 ++++++++++++++++++++++
include/acpi/acpixf.h | 4 ++++
include/linux/acpi.h | 4 ++++
mm/memory_hotplug.c | 8 ++++++--
5 files changed, 68 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
index ecaa5e1..a025dcc 100644
--- a/drivers/acpi/acpica/tbxface.c
+++ b/drivers/acpi/acpica/tbxface.c
@@ -236,6 +236,38 @@ acpi_status acpi_reallocate_root_table(void)
return_ACPI_STATUS(status);
}
+/**
+ * acpi_get_table_desc - Get the acpi table descriptor of a specific table.
+ * @signature: The signature of the table to be found.
+ * @out_desc: The out returned descriptor.
+ *
+ * Iterate over acpi_gbl_root_table_list to find a specific table and then
+ * return its phys addr.
+ *
+ * NOTE: The caller has the responsibility to allocate memory for @out_desc.
+ *
+ * Return AE_OK on success, AE_NOT_FOUND if the table is not found.
+ */
+acpi_status acpi_get_table_desc(char *signature,
+ struct acpi_table_desc *out_desc)
+{
+ struct acpi_table_desc *desc;
+ int pos, count = acpi_gbl_root_table_list.current_table_count;
+
+ for (pos = 0; pos < count; pos++) {
+ desc = &acpi_gbl_root_table_list.tables[pos];
+
+ if (!ACPI_COMPARE_NAME(&desc->signature, signature))
+ continue;
+
+ memcpy(out_desc, desc, sizeof(struct acpi_table_desc));
+
+ return_ACPI_STATUS(AE_OK);
+ }
+
+ return_ACPI_STATUS(AE_NOT_FOUND);
+}
+
/*******************************************************************************
*
* FUNCTION: acpi_get_table_header
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index dcbca3e..ec490fe 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -53,6 +53,7 @@
#include <acpi/acpi.h>
#include <acpi/acpi_bus.h>
#include <acpi/processor.h>
+#include <acpi/acpixf.h>
#define _COMPONENT ACPI_OS_SERVICES
ACPI_MODULE_NAME("osl");
@@ -760,6 +761,27 @@ void __init acpi_initrd_override(void *data, size_t size)
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+#ifdef CONFIG_ACPI_NUMA
+/*******************************************************************************
+ *
+ * FUNCTION: early_acpi_firmware_srat
+ *
+ * RETURN: Phys addr of SRAT on success, 0 on error.
+ *
+ * DESCRIPTION: Get the phys addr of SRAT provided by firmware.
+ *
+ ******************************************************************************/
+phys_addr_t __init early_acpi_firmware_srat(void)
+{
+ struct acpi_table_desc table_desc;
+
+ if (acpi_get_table_desc(ACPI_SIG_SRAT, &table_desc))
+ return 0;
+
+ return table_desc.address;
+}
+#endif /* CONFIG_ACPI_NUMA */
+
static void acpi_table_taint(struct acpi_table_header *table)
{
pr_warn(PREFIX
diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
index 99c9d7b..daa7c10 100644
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -188,6 +188,10 @@ acpi_status acpi_find_root_pointer(acpi_size *rsdp_address);
acpi_status acpi_unload_table_id(acpi_owner_id id);
acpi_status
+acpi_get_table_desc(char *signature,
+ struct acpi_table_desc *out_desc);
+
+acpi_status
acpi_get_table_header(acpi_string signature,
u32 instance, struct acpi_table_header *out_table_header);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index bdcb9dd..280078c 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -97,6 +97,10 @@ static inline phys_addr_t early_acpi_override_srat(void)
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+#ifdef CONFIG_ACPI_NUMA
+phys_addr_t early_acpi_firmware_srat(void);
+#endif /* CONFIG_ACPI_NUMA */
+
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a57888..2dfb06f 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -107,8 +107,12 @@ void __init find_hotpluggable_memory(void)
/* Try to find if SRAT is overridden */
srat_paddr = early_acpi_override_srat();
- if (!srat_paddr)
- return;
+ if (!srat_paddr) {
+ /* Try to find SRAT from firmware if it wasn't overridden */
+ srat_paddr = early_acpi_firmware_srat();
+ if (!srat_paddr)
+ return;
+ }
/* Will parse SRAT and find out hotpluggable memory here */
}
--
1.7.1
There is no flag in memblock to describe what type the memory is.
Sometimes, we may use memblock to reserve some memory for special usage.
And we want to know what kind of memory it is. So we need a way to
differentiate memory for different usage.
In hotplug environment, we want to reserve hotpluggable memory so the
kernel won't be able to use it. And when the system is up, we have to
free these hotpluggable memory to buddy. So we need to mark these memory
first.
In order to do so, we need to mark out these special memory in memblock.
In this patch, we introduce a new "flags" member into memblock_region:
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
unsigned long flags; /* This is new. */
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
};
This patch does the following things:
1) Add "flags" member to memblock_region.
2) Modify the following APIs' prototype:
memblock_add_region()
memblock_insert_region()
3) Add memblock_reserve_region() to support reserve memory with flags, and keep
memblock_reserve()'s prototype unmodified.
4) Modify other APIs to support flags, but keep their prototype unmodified.
The idea is from Wen Congyang <[email protected]> and Liu Jiang <[email protected]>.
v1 -> v2:
As tj suggested, a zero flag MEMBLK_DEFAULT will make users confused. If
we want to specify any other flag, such MEMBLK_HOTPLUG, users don't know
to use MEMBLK_DEFAULT | MEMBLK_HOTPLUG or just MEMBLK_HOTPLUG. So remove
MEMBLK_DEFAULT (which is 0), and just use 0 by default to avoid confusions
to users.
Suggested-by: Wen Congyang <[email protected]>
Suggested-by: Liu Jiang <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
include/linux/memblock.h | 1 +
mm/memblock.c | 53 +++++++++++++++++++++++++++++++++-------------
2 files changed, 39 insertions(+), 15 deletions(-)
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index f388203..e89e0cd 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -22,6 +22,7 @@
struct memblock_region {
phys_addr_t base;
phys_addr_t size;
+ unsigned long flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
int nid;
#endif
diff --git a/mm/memblock.c b/mm/memblock.c
index a847bfe..0841a6e 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -157,6 +157,7 @@ static void __init_memblock memblock_remove_region(struct memblock_type *type, u
type->cnt = 1;
type->regions[0].base = 0;
type->regions[0].size = 0;
+ type->regions[0].flags = 0;
memblock_set_region_node(&type->regions[0], MAX_NUMNODES);
}
}
@@ -307,7 +308,8 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
if (this->base + this->size != next->base ||
memblock_get_region_node(this) !=
- memblock_get_region_node(next)) {
+ memblock_get_region_node(next) ||
+ this->flags != next->flags) {
BUG_ON(this->base + this->size > next->base);
i++;
continue;
@@ -327,13 +329,15 @@ static void __init_memblock memblock_merge_regions(struct memblock_type *type)
* @base: base address of the new region
* @size: size of the new region
* @nid: node id of the new region
+ * @flags: flags of the new region
*
* Insert new memblock region [@base,@base+@size) into @type at @idx.
* @type must already have extra room to accomodate the new region.
*/
static void __init_memblock memblock_insert_region(struct memblock_type *type,
int idx, phys_addr_t base,
- phys_addr_t size, int nid)
+ phys_addr_t size,
+ int nid, unsigned long flags)
{
struct memblock_region *rgn = &type->regions[idx];
@@ -341,6 +345,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
memmove(rgn + 1, rgn, (type->cnt - idx) * sizeof(*rgn));
rgn->base = base;
rgn->size = size;
+ rgn->flags = flags;
memblock_set_region_node(rgn, nid);
type->cnt++;
type->total_size += size;
@@ -352,6 +357,7 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* @base: base address of the new region
* @size: size of the new region
* @nid: nid of the new region
+ * @flags: flags of the new region
*
* Add new memblock region [@base,@base+@size) into @type. The new region
* is allowed to overlap with existing ones - overlaps don't affect already
@@ -362,7 +368,8 @@ static void __init_memblock memblock_insert_region(struct memblock_type *type,
* 0 on success, -errno on failure.
*/
static int __init_memblock memblock_add_region(struct memblock_type *type,
- phys_addr_t base, phys_addr_t size, int nid)
+ phys_addr_t base, phys_addr_t size,
+ int nid, unsigned long flags)
{
bool insert = false;
phys_addr_t obase = base;
@@ -377,6 +384,7 @@ static int __init_memblock memblock_add_region(struct memblock_type *type,
WARN_ON(type->cnt != 1 || type->total_size);
type->regions[0].base = base;
type->regions[0].size = size;
+ type->regions[0].flags = flags;
memblock_set_region_node(&type->regions[0], nid);
type->total_size = size;
return 0;
@@ -407,7 +415,8 @@ repeat:
nr_new++;
if (insert)
memblock_insert_region(type, i++, base,
- rbase - base, nid);
+ rbase - base, nid,
+ flags);
}
/* area below @rend is dealt with, forget about it */
base = min(rend, end);
@@ -417,7 +426,8 @@ repeat:
if (base < end) {
nr_new++;
if (insert)
- memblock_insert_region(type, i, base, end - base, nid);
+ memblock_insert_region(type, i, base, end - base,
+ nid, flags);
}
/*
@@ -439,12 +449,13 @@ repeat:
int __init_memblock memblock_add_node(phys_addr_t base, phys_addr_t size,
int nid)
{
- return memblock_add_region(&memblock.memory, base, size, nid);
+ return memblock_add_region(&memblock.memory, base, size, nid, 0);
}
int __init_memblock memblock_add(phys_addr_t base, phys_addr_t size)
{
- return memblock_add_region(&memblock.memory, base, size, MAX_NUMNODES);
+ return memblock_add_region(&memblock.memory, base, size,
+ MAX_NUMNODES, 0);
}
/**
@@ -499,7 +510,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= base - rbase;
type->total_size -= base - rbase;
memblock_insert_region(type, i, rbase, base - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else if (rend > end) {
/*
* @rgn intersects from above. Split and redo the
@@ -509,7 +521,8 @@ static int __init_memblock memblock_isolate_range(struct memblock_type *type,
rgn->size -= end - rbase;
type->total_size -= end - rbase;
memblock_insert_region(type, i--, rbase, end - rbase,
- memblock_get_region_node(rgn));
+ memblock_get_region_node(rgn),
+ rgn->flags);
} else {
/* @rgn is fully contained, record it */
if (!*end_rgn)
@@ -551,16 +564,24 @@ int __init_memblock memblock_free(phys_addr_t base, phys_addr_t size)
return __memblock_remove(&memblock.reserved, base, size);
}
-int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+static int __init_memblock memblock_reserve_region(phys_addr_t base,
+ phys_addr_t size,
+ int nid,
+ unsigned long flags)
{
struct memblock_type *_rgn = &memblock.reserved;
- memblock_dbg("memblock_reserve: [%#016llx-%#016llx] %pF\n",
+ memblock_dbg("memblock_reserve: [%#016llx-%#016llx] flags %#02lx %pF\n",
(unsigned long long)base,
(unsigned long long)base + size,
- (void *)_RET_IP_);
+ flags, (void *)_RET_IP_);
+
+ return memblock_add_region(_rgn, base, size, nid, flags);
+}
- return memblock_add_region(_rgn, base, size, MAX_NUMNODES);
+int __init_memblock memblock_reserve(phys_addr_t base, phys_addr_t size)
+{
+ return memblock_reserve_region(base, size, MAX_NUMNODES, 0);
}
/**
@@ -985,6 +1006,7 @@ void __init_memblock memblock_set_current_limit(phys_addr_t limit)
static void __init_memblock memblock_dump(struct memblock_type *type, char *name)
{
unsigned long long base, size;
+ unsigned long flags;
int i;
pr_info(" %s.cnt = 0x%lx\n", name, type->cnt);
@@ -995,13 +1017,14 @@ static void __init_memblock memblock_dump(struct memblock_type *type, char *name
base = rgn->base;
size = rgn->size;
+ flags = rgn->flags;
#ifdef CONFIG_HAVE_MEMBLOCK_NODE_MAP
if (memblock_get_region_node(rgn) != MAX_NUMNODES)
snprintf(nid_buf, sizeof(nid_buf), " on node %d",
memblock_get_region_node(rgn));
#endif
- pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s\n",
- name, i, base, base + size - 1, size, nid_buf);
+ pr_info(" %s[%#x]\t[%#016llx-%#016llx], %#llx bytes%s flags: %#lx\n",
+ name, i, base, base + size - 1, size, nid_buf, flags);
}
}
--
1.7.1
This patch splits acpi_initialize_tables() into two steps, and
introduces two new functions:
acpi_initialize_tables_firmware() and acpi_tb_root_table_override(),
which work just the same as acpi_initialize_tables() if they are called
in sequence.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/tbxface.c | 64 ++++++++++++++++++++++++++++++++++++----
1 files changed, 57 insertions(+), 7 deletions(-)
diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
index 98e4cad..ecaa5e1 100644
--- a/drivers/acpi/acpica/tbxface.c
+++ b/drivers/acpi/acpica/tbxface.c
@@ -72,8 +72,7 @@ acpi_status acpi_allocate_root_table(u32 initial_table_count)
}
/*******************************************************************************
- *
- * FUNCTION: acpi_initialize_tables
+ * FUNCTION: acpi_initialize_tables_firmware
*
* PARAMETERS: initial_table_array - Pointer to an array of pre-allocated
* struct acpi_table_desc structures. If NULL, the
@@ -86,8 +85,6 @@ acpi_status acpi_allocate_root_table(u32 initial_table_count)
*
* RETURN: Status
*
- * DESCRIPTION: Initialize the table manager, get the RSDP and RSDT/XSDT.
- *
* NOTE: Allows static allocation of the initial table array in order
* to avoid the use of dynamic memory in confined environments
* such as the kernel boot sequence where it may not be available.
@@ -98,8 +95,8 @@ acpi_status acpi_allocate_root_table(u32 initial_table_count)
******************************************************************************/
acpi_status __init
-acpi_initialize_tables(struct acpi_table_desc * initial_table_array,
- u32 initial_table_count, u8 allow_resize)
+acpi_initialize_tables_firmware(struct acpi_table_desc * initial_table_array,
+ u32 initial_table_count, u8 allow_resize)
{
acpi_physical_address rsdp_address;
acpi_status status;
@@ -144,10 +141,63 @@ acpi_initialize_tables(struct acpi_table_desc * initial_table_array,
* in a common, more useable format.
*/
status = acpi_tb_root_table_install(rsdp_address);
+
+ return_ACPI_STATUS(status);
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_initialize_tables
+ *
+ * PARAMETERS: None
+ *
+ * RETURN: None
+ *
+ * DESCRIPTION: Allow host OS to replace any table installed in global root
+ * table list.
+ *
+ ******************************************************************************/
+
+void acpi_initialize_tables_override(void)
+{
+ acpi_tb_root_table_override();
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_initialize_tables
+ *
+ * PARAMETERS: initial_table_array - Pointer to an array of pre-allocated
+ * struct acpi_table_desc structures. If NULL, the
+ * array is dynamically allocated.
+ * initial_table_count - Size of initial_table_array, in number of
+ * struct acpi_table_desc structures
+ * allow_resize - Flag to tell Table Manager if resize of
+ * pre-allocated array is allowed. Ignored
+ * if initial_table_array is NULL.
+ *
+ * RETURN: Status
+ *
+ * DESCRIPTION: Initialize the table manager, get the RSDP and RSDT/XSDT.
+ *
+ ******************************************************************************/
+
+acpi_status __init
+acpi_initialize_tables(struct acpi_table_desc * initial_table_array,
+ u32 initial_table_count, u8 allow_resize)
+{
+ acpi_status status;
+
+ status = acpi_initialize_tables_firmware(initial_table_array,
+ initial_table_count, allow_resize);
if (ACPI_FAILURE(status))
return_ACPI_STATUS(status);
- acpi_tb_root_table_override();
+ /*
+ * Allow host OS to replace any table installed in global root
+ * table list.
+ */
+ acpi_initialize_tables_override();
return_ACPI_STATUS(AE_OK);
}
--
1.7.1
This patch splits acpi_boot_table_init() into two steps,
so that we can do each step separately in later patches.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index 2627a81..ddb0bc1 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1529,11 +1529,15 @@ void __init acpi_boot_table_init(void)
/*
* Initialize the ACPI boot-time table parser.
*/
- if (acpi_table_init()) {
+ if (acpi_table_init_firmware()) {
disable_acpi();
return;
}
+ acpi_table_init_override();
+
+ acpi_check_multiple_madt();
+
acpi_table_parse(ACPI_SIG_BOOT, acpi_parse_sbf);
/*
--
1.7.1
This patch splits acpi_table_init() into two steps.
Since acpi_table_init() is used not just in x86, also used in ia64,
we introduce two new functions:
acpi_table_init_firmware() and acpi_table_init_override(),
which work just the same as acpi_table_init() if they are called
in sequence. This will keep acpi_table_init() works as before on
other platforms, and we only call these new functions in Linux.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/tables.c | 26 ++++++++++++++++++++------
include/linux/acpi.h | 2 ++
2 files changed, 22 insertions(+), 6 deletions(-)
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index c8f2d01..4913a85 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -336,6 +336,23 @@ static void __init check_multiple_madt(void)
return;
}
+int __init acpi_table_init_firmware(void)
+{
+ acpi_status status;
+
+ status = acpi_initialize_tables_firmware(initial_tables,
+ ACPI_MAX_TABLES, 0);
+ if (ACPI_FAILURE(status))
+ return 1;
+
+ return 0;
+}
+
+void __init acpi_table_init_override(void)
+{
+ acpi_initialize_tables_override();
+}
+
/*
* acpi_table_init()
*
@@ -347,16 +364,13 @@ static void __init check_multiple_madt(void)
int __init acpi_table_init(void)
{
- acpi_status status;
-
- status = acpi_initialize_tables_firmware(initial_tables,
- ACPI_MAX_TABLES, 0);
- if (ACPI_FAILURE(status))
+ if (acpi_table_init_firmware())
return 1;
- acpi_initialize_tables_override();
+ acpi_table_init_override();
check_multiple_madt();
+
return 0;
}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 353ba25..9704179 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -95,6 +95,8 @@ void acpi_boot_table_init (void);
int acpi_mps_check (void);
int acpi_numa_init (void);
+int acpi_table_init_firmware(void);
+void acpi_table_init_override(void);
int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
int __init acpi_table_parse_entries(char *id, unsigned long table_size,
--
1.7.1
This patch splits acpi_tb_parse_root_table() into two steps, and
introduces two new functions:
acpi_tb_root_table_install() and acpi_tb_root_table_override().
They are just the same as acpi_tb_parse_root_table() if they are
called in sequence.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/tbutils.c | 57 ++++++++++++++++++++++++++++++++++++++--
1 files changed, 54 insertions(+), 3 deletions(-)
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index 9bef44b..8ed9b9a 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -503,14 +503,16 @@ acpi_tb_get_root_table_entry(u8 *table_entry, u32 table_entry_size)
/*******************************************************************************
*
- * FUNCTION: acpi_tb_parse_root_table
+ * FUNCTION: acpi_tb_root_table_install
*
* PARAMETERS: rsdp - Pointer to the RSDP
*
* RETURN: Status
*
* DESCRIPTION: This function is called to parse the Root System Description
- * Table (RSDT or XSDT)
+ * Table (RSDT or XSDT), and install all the system description
+ * tables defined in the root table into the global root table
+ * list.
*
* NOTE: Tables are mapped (not copied) for efficiency. The FACS must
* be mapped and cannot be copied because it contains the actual
@@ -519,7 +521,7 @@ acpi_tb_get_root_table_entry(u8 *table_entry, u32 table_entry_size)
******************************************************************************/
acpi_status __init
-acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
+acpi_tb_root_table_install(acpi_physical_address rsdp_address)
{
struct acpi_table_rsdp *rsdp;
u32 table_entry_size;
@@ -673,7 +675,31 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
/* Install tables in firmware into acpi_gbl_root_table_list */
acpi_tb_install_table_firmware(acpi_gbl_root_table_list.
tables[i].address, NULL, i);
+ }
+
+ return_ACPI_STATUS(AE_OK);
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_tb_root_table_override
+ *
+ * PARAMETERS: None
+ *
+ * RETURN: None
+ *
+ * DESCRIPTION: This function is called to allow the host OS to replace any
+ * table that has been installed into the global root table
+ * list.
+ *
+ ******************************************************************************/
+void __init
+acpi_tb_root_table_override(void)
+{
+ int i;
+
+ for (i = 2; i < acpi_gbl_root_table_list.current_table_count; i++) {
/* Override the installed tables if any */
acpi_tb_install_table_override(acpi_gbl_root_table_list.
tables[i].address, NULL, i);
@@ -686,6 +712,31 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
acpi_tb_parse_fadt(i);
}
}
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_tb_parse_root_table
+ *
+ * PARAMETERS: rsdp - Pointer to the RSDP
+ *
+ * RETURN: Status
+ *
+ * DESCRIPTION: This function is called to parse the Root System Description
+ * Table (RSDT or XSDT)
+ *
+ ******************************************************************************/
+
+acpi_status __init
+acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
+{
+ acpi_status status;
+
+ status = acpi_tb_root_table_install(rsdp_address);
+ if (ACPI_FAILURE(status))
+ return_ACPI_STATUS(status);
+
+ acpi_tb_root_table_override();
return_ACPI_STATUS(AE_OK);
}
--
1.7.1
The previous patch introduces two new functions:
acpi_initialize_tables_firmware() and acpi_initialize_tables_override(),
which work just the same as acpi_initialize_tables() if they are called
in sequence.
In order to split acpi_table_init() on acpi side, call these two functions
in acpi_table_init().
Since acpi_table_init() is also used in ia64, we keep it works as before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/tables.c | 5 ++++-
include/acpi/acpixf.h | 4 ++++
2 files changed, 8 insertions(+), 1 deletions(-)
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index d67a1fe..c8f2d01 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -349,10 +349,13 @@ int __init acpi_table_init(void)
{
acpi_status status;
- status = acpi_initialize_tables(initial_tables, ACPI_MAX_TABLES, 0);
+ status = acpi_initialize_tables_firmware(initial_tables,
+ ACPI_MAX_TABLES, 0);
if (ACPI_FAILURE(status))
return 1;
+ acpi_initialize_tables_override();
+
check_multiple_madt();
return 0;
}
diff --git a/include/acpi/acpixf.h b/include/acpi/acpixf.h
index 22d497e..99c9d7b 100644
--- a/include/acpi/acpixf.h
+++ b/include/acpi/acpixf.h
@@ -115,6 +115,10 @@ extern u32 acpi_rsdt_forced;
* Initialization
*/
acpi_status
+acpi_initialize_tables_firmware(struct acpi_table_desc *initial_storage,
+ u32 initial_table_count, u8 allow_resize);
+void acpi_initialize_tables_override(void);
+acpi_status
acpi_initialize_tables(struct acpi_table_desc *initial_storage,
u32 initial_table_count, u8 allow_resize);
--
1.7.1
The previous patch introduced two new functions:
acpi_tb_install_table_firmware() and acpi_tb_install_table_override().
They are the same as acpi_tb_install_table() if they are called in sequence.
In order to split acpi_tb_parse_root_table(), we call these two functions
instead of acpi_tb_install_table() in acpi_tb_parse_root_table(). This will
keep acpi_tb_parse_root_table() works as before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/tbutils.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index 2db068c..9bef44b 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -670,8 +670,13 @@ acpi_tb_parse_root_table(acpi_physical_address rsdp_address)
* the header of each table
*/
for (i = 2; i < acpi_gbl_root_table_list.current_table_count; i++) {
- acpi_tb_install_table(acpi_gbl_root_table_list.tables[i].
- address, NULL, i);
+ /* Install tables in firmware into acpi_gbl_root_table_list */
+ acpi_tb_install_table_firmware(acpi_gbl_root_table_list.
+ tables[i].address, NULL, i);
+
+ /* Override the installed tables if any */
+ acpi_tb_install_table_override(acpi_gbl_root_table_list.
+ tables[i].address, NULL, i);
/* Special case for FADT - get the DSDT and FACS */
--
1.7.1
The comments of find_cpio_data() says:
* @offset: When a matching file is found, this is the offset to the
* beginning of the cpio. ......
But according to the code,
dptr = PTR_ALIGN(p + ch[C_NAMESIZE], 4);
nptr = PTR_ALIGN(dptr + ch[C_FILESIZE], 4);
....
*offset = (long)nptr - (long)data; /* data is the cpio file */
@offset is the offset of the next file, not the matching file itself.
This is confused and may cause unnecessary waste of time to debug.
So fix it.
v1 -> v2:
As tj suggested, rename @offset to @nextoff which is more clear to
users. And also adjust the new comments.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
lib/earlycpio.c | 27 ++++++++++++++-------------
1 files changed, 14 insertions(+), 13 deletions(-)
diff --git a/lib/earlycpio.c b/lib/earlycpio.c
index 7aa7ce2..c7ae5ed 100644
--- a/lib/earlycpio.c
+++ b/lib/earlycpio.c
@@ -49,22 +49,23 @@ enum cpio_fields {
/**
* cpio_data find_cpio_data - Search for files in an uncompressed cpio
- * @path: The directory to search for, including a slash at the end
- * @data: Pointer to the the cpio archive or a header inside
- * @len: Remaining length of the cpio based on data pointer
- * @offset: When a matching file is found, this is the offset to the
- * beginning of the cpio. It can be used to iterate through
- * the cpio to find all files inside of a directory path
+ * @path: The directory to search for, including a slash at the end
+ * @data: Pointer to the the cpio archive or a header inside
+ * @len: Remaining length of the cpio based on data pointer
+ * @nextoff: When a matching file is found, this is the offset from the
+ * beginning of the cpio to the beginning of the next file, not the
+ * matching file itself. It can be used to iterate through the cpio
+ * to find all files inside of a directory path
*
- * @return: struct cpio_data containing the address, length and
- * filename (with the directory path cut off) of the found file.
- * If you search for a filename and not for files in a directory,
- * pass the absolute path of the filename in the cpio and make sure
- * the match returned an empty filename string.
+ * @return: struct cpio_data containing the address, length and
+ * filename (with the directory path cut off) of the found file.
+ * If you search for a filename and not for files in a directory,
+ * pass the absolute path of the filename in the cpio and make sure
+ * the match returned an empty filename string.
*/
struct cpio_data find_cpio_data(const char *path, void *data,
- size_t len, long *offset)
+ size_t len, long *nextoff)
{
const size_t cpio_header_len = 8*C_NFIELDS - 2;
struct cpio_data cd = { NULL, 0, "" };
@@ -124,7 +125,7 @@ struct cpio_data find_cpio_data(const char *path, void *data,
if ((ch[C_MODE] & 0170000) == 0100000 &&
ch[C_NAMESIZE] >= mypathsize &&
!memcmp(p, path, mypathsize)) {
- *offset = (long)nptr - (long)data;
+ *nextoff = (long)nptr - (long)data;
if (ch[C_NAMESIZE] - mypathsize >= MAX_CPIO_FILE_NAME) {
pr_warn(
"File %s exceeding MAX_CPIO_FILE_NAME [%d]\n",
--
1.7.1
In ACPI, SRAT(System Resource Affinity Table) contains NUMA info.
The memory affinities in SRAT record every memory range in the
system, and also, flags specifying if the memory range is
hotpluggable.
(Please refer to ACPI spec 5.0 5.2.16)
memblock starts to work at very early time, and SRAT has not been
parsed. So we don't know which memory is hotpluggable. In order
to use memblock to reserve hotpluggable memory, we need to obtain
SRAT memory affinity info earlier.
In the current kernel, the acpica code iterates acpi_gbl_root_table_list,
and install all the acpi tables into it at boot time. First, it
tries to find if there is any override table in global list
acpi_tables_addr. If any, use the override table. Otherwise, it
will install the tables provided by firmware. Like the following:
setup_arch()
|->acpi_initrd_override() /* Initialize acpi_tables_addr with all override table. */
|...
|->acpi_boot_table_init()
|->acpi_table_init()
|->acpi_initialize_tables()
|->acpi_tb_parse_root_table() /* Parse RSDT or XSDT, find all tables in firmware */
|->for (each item in acpi_gbl_root_table_list)
|->acpi_tb_install_table()
|-> ...... /* Install one single table */
|->acpi_tb_table_override() /* Override one single table */
It does the table installation one by one.
In order to find SRAT at earlier time, we want to initialize
acpi_gbl_root_table_list earlier. But at the same time, keep
ACPI_INITRD_TABLE_OVERRIDE procedure works as well.
The basic idea is, split the acpi_gbl_root_table_list initialization
procedure into two steps:
1. Install all tables from firmware, not one by one.
2. Override any table if necessary, not one by one.
After this patch-set, it will work like this:
setup_arch()
|-> ...... /* Install all tables from firmware (Step 1) */
|-> ...... /* Try to find if any override SRAT in initrd file, if yes, use it */
|-> ...... /* Use the SRAT from firmware */
|-> ...... /* memblock starts to work */
|-> ......
|->acpi_initrd_override() /* Initialize acpi_tables_addr with all override table. */
|...
|-> ...... /* Do the table override work for all tables (Step 2) */
In order to achieve this goal, we have to split all the following functions:
ACPICA:
acpi_tb_install_table()
acpi_tb_parse_root_table()
acpi_initialize_tables()
acpi:
acpi_table_init()
acpi_boot_table_init()
Since ACPICA code is not just used by the Linux, so we should keep the ACPICA
side interfaces unmodified, and introduce new functions used in Linux.
This patch split acpi_tb_install_table() into two steps, and introduce two new
functions:
acpi_tb_install_table_firmware() and acpi_tb_install_table_override(),
which will be used later in Linux.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/tbutils.c | 118 +++++++++++++++++++++++++++++++++++-----
1 files changed, 103 insertions(+), 15 deletions(-)
diff --git a/drivers/acpi/acpica/tbutils.c b/drivers/acpi/acpica/tbutils.c
index bffdfc7..2db068c 100644
--- a/drivers/acpi/acpica/tbutils.c
+++ b/drivers/acpi/acpica/tbutils.c
@@ -249,28 +249,25 @@ struct acpi_table_header *acpi_tb_copy_dsdt(u32 table_index)
/*******************************************************************************
*
- * FUNCTION: acpi_tb_install_table
+ * FUNCTION: acpi_tb_install_table_firmware
*
- * PARAMETERS: address - Physical address of DSDT or FACS
+ * PARAMETERS: address - Physical address of the table to be
+ * installed
* signature - Table signature, NULL if no need to
* match
* table_index - Index into root table array
*
* RETURN: None
*
- * DESCRIPTION: Install an ACPI table into the global data structure. The
- * table override mechanism is called to allow the host
- * OS to replace any table before it is installed in the root
- * table array.
+ * DESCRIPTION: Install an ACPI table into the global data structure.
*
******************************************************************************/
void
-acpi_tb_install_table(acpi_physical_address address,
- char *signature, u32 table_index)
+acpi_tb_install_table_firmware(acpi_physical_address address,
+ char *signature, u32 table_index)
{
struct acpi_table_header *table;
- struct acpi_table_header *final_table;
struct acpi_table_desc *table_desc;
if (!address) {
@@ -312,6 +309,74 @@ acpi_tb_install_table(acpi_physical_address address,
table_desc->flags = ACPI_TABLE_ORIGIN_MAPPED;
ACPI_MOVE_32_TO_32(table_desc->signature.ascii, table->signature);
+ acpi_tb_print_table_header(table_desc->address, table);
+
+ /* Set the global integer width (based upon revision of the DSDT) */
+
+ if (table_index == ACPI_TABLE_INDEX_DSDT) {
+ acpi_ut_set_integer_width(table->revision);
+ }
+
+ unmap_and_exit:
+
+ /* Always unmap the table header that we mapped above */
+
+ acpi_os_unmap_memory(table, sizeof(struct acpi_table_header));
+}
+
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_tb_install_table_override
+ *
+ * PARAMETERS: address - Physical address of the table to be
+ * installed
+ * signature - Table signature, NULL if no need to
+ * match
+ * table_index - Index into root table array
+ *
+ * RETURN: None
+ *
+ * DESCRIPTION: Override an ACPI table in the global data structure.
+ *
+ ******************************************************************************/
+
+void
+acpi_tb_install_table_override(acpi_physical_address address,
+ char *signature, u32 table_index)
+{
+ struct acpi_table_header *table;
+ struct acpi_table_header *final_table;
+ struct acpi_table_desc *table_desc;
+
+ if (!address) {
+ ACPI_ERROR((AE_INFO,
+ "Null physical address for ACPI table [%s]",
+ signature));
+ return;
+ }
+
+ /* Map just the table header */
+
+ table = acpi_os_map_memory(address, sizeof(struct acpi_table_header));
+ if (!table) {
+ ACPI_ERROR((AE_INFO,
+ "Could not map memory for table [%s] at %p",
+ signature, ACPI_CAST_PTR(void, address)));
+ return;
+ }
+
+ /* If a particular signature is expected (DSDT/FACS), it must match */
+
+ if (signature && !ACPI_COMPARE_NAME(table->signature, signature)) {
+ ACPI_BIOS_ERROR((AE_INFO,
+ "Invalid signature 0x%X for ACPI table, expected [%s]",
+ *ACPI_CAST_PTR(u32, table->signature),
+ signature));
+ goto unmap_and_exit;
+ }
+
+ table_desc = &acpi_gbl_root_table_list.tables[table_index];
+
/*
* ACPI Table Override:
*
@@ -332,12 +397,6 @@ acpi_tb_install_table(acpi_physical_address address,
acpi_tb_print_table_header(table_desc->address, final_table);
- /* Set the global integer width (based upon revision of the DSDT) */
-
- if (table_index == ACPI_TABLE_INDEX_DSDT) {
- acpi_ut_set_integer_width(final_table->revision);
- }
-
/*
* If we have a physical override during this early loading of the ACPI
* tables, unmap the table for now. It will be mapped again later when
@@ -359,6 +418,35 @@ acpi_tb_install_table(acpi_physical_address address,
/*******************************************************************************
*
+ * FUNCTION: acpi_tb_install_table
+ *
+ * PARAMETERS: address - Physical address of DSDT or FACS
+ * signature - Table signature, NULL if no need to
+ * match
+ * table_index - Index into root table array
+ *
+ * RETURN: None
+ *
+ * DESCRIPTION: Install an ACPI table into the global data structure. The
+ * table override mechanism is called to allow the host
+ * OS to replace any table which has been installed in the root
+ * table array.
+ *
+ ******************************************************************************/
+
+void
+acpi_tb_install_table(acpi_physical_address address,
+ char *signature, u32 table_index)
+{
+ /* Install a table from firmware into acpi_gbl_root_table_list. */
+ acpi_tb_install_table_firmware(address, signature, table_index);
+
+ /* Override an installed table. */
+ acpi_tb_install_table_override(address, signature, table_index);
+}
+
+/*******************************************************************************
+ *
* FUNCTION: acpi_tb_get_root_table_entry
*
* PARAMETERS: table_entry - Pointer to the RSDT/XSDT table entry
--
1.7.1
In acpi_initrd_override(), it checks several things to ensure the
table it found is valid. In later patches, we need to do these check
somewhere else. So this patch introduces a common function
acpi_verify_initrd() to do all these checks, and reuse it in different
places. The function will be used in the subsequent patches.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
---
drivers/acpi/osl.c | 86 +++++++++++++++++++++++++++++++++++++---------------
1 files changed, 61 insertions(+), 25 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 3b8bab2..0043e9f 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,9 +572,68 @@ static const char * const table_sigs[] = {
/* Must not increase 10 or needs code modification below */
#define ACPI_OVERRIDE_TABLES 10
+/*******************************************************************************
+ *
+ * FUNCTION: acpi_verify_table
+ *
+ * PARAMETERS: File - The initrd file
+ * Path - Path to acpi overriding tables in cpio file
+ * Signature - Signature of the table
+ *
+ * RETURN: 0 if it passes all the checks, -EINVAL if any check fails.
+ *
+ * DESCRIPTION: Check if an acpi table found in initrd is invalid.
+ * @signature can be NULL. If it is NULL, the function will check
+ * if the table signature matches any signature in table_sigs[].
+ *
+ ******************************************************************************/
+int __init acpi_verify_table(struct cpio_data *file,
+ const char *path, const char *signature)
+{
+ int idx;
+ struct acpi_table_header *table = file->data;
+
+ if (file->size < sizeof(struct acpi_table_header)) {
+ ACPI_INVALID_TABLE("Table smaller than ACPI header",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ if (signature) {
+ if (memcmp(table->signature, signature, 4)) {
+ ACPI_INVALID_TABLE("Table signature does not match",
+ path, file->name);
+ return -EINVAL;
+ }
+ } else {
+ for (idx = 0; table_sigs[idx]; idx++)
+ if (!memcmp(table->signature, table_sigs[idx], 4))
+ break;
+
+ if (!table_sigs[idx]) {
+ ACPI_INVALID_TABLE("Unknown signature", path, file->name);
+ return -EINVAL;
+ }
+ }
+
+ if (file->size != table->length) {
+ ACPI_INVALID_TABLE("File length does not match table length",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ if (acpi_table_checksum(file->data, table->length)) {
+ ACPI_INVALID_TABLE("Bad table checksum",
+ path, file->name);
+ return -EINVAL;
+ }
+
+ return 0;
+}
+
void __init acpi_initrd_override(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int no, table_nr = 0, total_offset = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
@@ -593,33 +652,10 @@ void __init acpi_initrd_override(void *data, size_t size)
data += offset;
size -= offset;
- if (file.size < sizeof(struct acpi_table_header)) {
- ACPI_INVALID_TABLE("Table smaller than ACPI header",
- cpio_path, file.name);
- continue;
- }
-
table = file.data;
- for (sig = 0; table_sigs[sig]; sig++)
- if (!memcmp(table->signature, table_sigs[sig], 4))
- break;
-
- if (!table_sigs[sig]) {
- ACPI_INVALID_TABLE("Unknown signature",
- cpio_path, file.name);
+ if (acpi_verify_table(&file, cpio_path, NULL))
continue;
- }
- if (file.size != table->length) {
- ACPI_INVALID_TABLE("File length does not match table length",
- cpio_path, file.name);
- continue;
- }
- if (acpi_table_checksum(file.data, table->length)) {
- ACPI_INVALID_TABLE("Bad table checksum",
- cpio_path, file.name);
- continue;
- }
pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);
--
1.7.1
Since we split acpi_table_init() into two steps, and want to do
the two steps separately, we need to do check_multiple_madt() after
acpi_table_init_override().
But we also have to keep acpi_table_init() as before because it
is also used in ia64, we have to do check_multiple_madt() directly
in acpi_boot_table_init() in x86.
This patch make check_multiple_madt() global, and rename it to
acpi_check_multiple_madt() because all interfaces provided by
drivers/acpi/tables.c begins with "acpi_".
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/tables.c | 4 ++--
include/linux/acpi.h | 1 +
2 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/tables.c b/drivers/acpi/tables.c
index 4913a85..45727b2 100644
--- a/drivers/acpi/tables.c
+++ b/drivers/acpi/tables.c
@@ -314,7 +314,7 @@ int __init acpi_table_parse(char *id, acpi_tbl_table_handler handler)
* but some report two. Provide a knob to use either.
* (don't you wish instance 0 and 1 were not the same?)
*/
-static void __init check_multiple_madt(void)
+void __init acpi_check_multiple_madt(void)
{
struct acpi_table_header *table = NULL;
acpi_size tbl_size;
@@ -369,7 +369,7 @@ int __init acpi_table_init(void)
acpi_table_init_override();
- check_multiple_madt();
+ acpi_check_multiple_madt();
return 0;
}
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 9704179..44a3e5f 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -95,6 +95,7 @@ void acpi_boot_table_init (void);
int acpi_mps_check (void);
int acpi_numa_init (void);
+void acpi_check_multiple_madt(void);
int acpi_table_init_firmware(void);
void acpi_table_init_override(void);
int acpi_table_init (void);
--
1.7.1
The previous patch introduces two new functions:
acpi_tb_root_table_install() and acpi_tb_root_table_override(),
which work just the same as acpi_tb_parse_root_table() if they are
called in sequence.
In order to split acpi_initialize_tables(), call thes two functions
in acpi_initialize_tables(). This will keep acpi_initialize_tables()
works as before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/acpica/actables.h | 2 ++
drivers/acpi/acpica/tbxface.c | 9 +++++++--
2 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/acpica/actables.h b/drivers/acpi/acpica/actables.h
index 7755e91..641796e 100644
--- a/drivers/acpi/acpica/actables.h
+++ b/drivers/acpi/acpica/actables.h
@@ -120,6 +120,8 @@ void
acpi_tb_install_table(acpi_physical_address address,
char *signature, u32 table_index);
+acpi_status acpi_tb_root_table_install(acpi_physical_address rsdp_address);
+void acpi_tb_root_table_override(void);
acpi_status acpi_tb_parse_root_table(acpi_physical_address rsdp_address);
#endif /* __ACTABLES_H__ */
diff --git a/drivers/acpi/acpica/tbxface.c b/drivers/acpi/acpica/tbxface.c
index ad11162..98e4cad 100644
--- a/drivers/acpi/acpica/tbxface.c
+++ b/drivers/acpi/acpica/tbxface.c
@@ -143,8 +143,13 @@ acpi_initialize_tables(struct acpi_table_desc * initial_table_array,
* Root Table Array. This array contains the information of the RSDT/XSDT
* in a common, more useable format.
*/
- status = acpi_tb_parse_root_table(rsdp_address);
- return_ACPI_STATUS(status);
+ status = acpi_tb_root_table_install(rsdp_address);
+ if (ACPI_FAILURE(status))
+ return_ACPI_STATUS(status);
+
+ acpi_tb_root_table_override();
+
+ return_ACPI_STATUS(AE_OK);
}
/*******************************************************************************
--
1.7.1
The macro INVALID_TABLE() is defined like this:
#define INVALID_TABLE(x, path, name) \
{ pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
And it is used like this:
for (...) {
...
if (...)
INVALID_TABLE()
...
}
The "continue" in the macro makes the code hard to understand.
Change it to the style like other macros:
#define INVALID_TABLE(x, path, name) \
do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
And also, INVALID_TABLE() is used to checkout acpi tables, so rename it to
ACPI_INVALID_TABLE(). This is suggested by Toshi Kani <[email protected]>.
So after this patch, this macro should be used like this:
for (...) {
...
if (...) {
ACPI_INVALID_TABLE()
continue;
}
...
}
Add the "continue" wherever the macro is called.
(For now, it is only called in acpi_initrd_override().)
The idea is from Yinghai Lu <[email protected]>.
Signed-off-by: Tang Chen <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Acked-by: Rafael J. Wysocki <[email protected]>
Acked-by: Toshi Kani <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
drivers/acpi/osl.c | 28 ++++++++++++++++++----------
1 files changed, 18 insertions(+), 10 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 6ab2c35..3b8bab2 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -564,8 +564,8 @@ static const char * const table_sigs[] = {
ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
/* Non-fatal errors: Affected tables/files are ignored */
-#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+#define ACPI_INVALID_TABLE(x, path, name) \
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
@@ -593,9 +593,11 @@ void __init acpi_initrd_override(void *data, size_t size)
data += offset;
size -= offset;
- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ ACPI_INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }
table = file.data;
@@ -603,15 +605,21 @@ void __init acpi_initrd_override(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;
- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ ACPI_INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ ACPI_INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ ACPI_INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }
pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);
--
1.7.1
Linux cannot migrate pages used by the kernel due to the direct mapping
(va = pa + PAGE_OFFSET), any memory used by the kernel cannot be hot-removed.
So when using memory hotplug, we have to prevent the kernel from using
hotpluggable memory.
The ACPI table SRAT (System Resource Affinity Table) contains info to specify
which memory is hotpluggble. After SRAT is parsed, we are aware of which
memory is hotpluggable.
At the early time when system is booting, SRAT has not been parsed. The boot
memory allocator memblock will allocate any memory to the kernel. So we need
SRAT parsed before memblock starts to work.
In this patch, we are going to parse SRAT earlier, right after memblock is ready.
Generally speaking, tables such as SRAT are provided by firmware. But
ACPI_INITRD_TABLE_OVERRIDE functionality allows users to customize their own
tables in initrd, and override the ones from firmware. So if we want to parse
SRAT earlier, we also need to do SRAT override earlier.
First, we introduce early_acpi_override_srat() to check if SRAT will be overridden
from initrd.
Second, we introduce find_hotpluggable_memory() to reserve hotpluggable memory,
which will firstly call early_acpi_override_srat() to find out which memory is
hotpluggable in the override SRAT.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/setup.c | 10 ++++++
drivers/acpi/osl.c | 61 ++++++++++++++++++++++++++++++++++++++++
include/linux/acpi.h | 14 ++++++++-
include/linux/memory_hotplug.h | 2 +
mm/memory_hotplug.c | 25 +++++++++++++++-
5 files changed, 109 insertions(+), 3 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index da44353..36d7fe8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,6 +1061,16 @@ void __init setup_arch(char **cmdline_p)
*/
early_acpi_boot_table_init();
+#ifdef CONFIG_ACPI_NUMA
+ /*
+ * Linux kernel cannot migrate kernel pages, as a result, memory used
+ * by the kernel cannot be hot-removed. Find and mark hotpluggable
+ * memory in memblock to prevent memblock from allocating hotpluggable
+ * memory for the kernel.
+ */
+ find_hotpluggable_memory();
+#endif
+
/*
* The EFI specification says that boot service code won't be called
* after ExitBootServices(). This is, in fact, a lie.
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 0043e9f..dcbca3e 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -48,6 +48,7 @@
#include <asm/io.h>
#include <asm/uaccess.h>
+#include <asm/setup.h>
#include <acpi/acpi.h>
#include <acpi/acpi_bus.h>
@@ -631,6 +632,66 @@ int __init acpi_verify_table(struct cpio_data *file,
return 0;
}
+#ifdef CONFIG_ACPI_NUMA
+/*******************************************************************************
+ *
+ * FUNCTION: early_acpi_override_srat
+ *
+ * RETURN: Phys addr of SRAT on success, 0 on error.
+ *
+ * DESCRIPTION: Try to get the phys addr of SRAT in initrd.
+ * The ACPI_INITRD_TABLE_OVERRIDE procedure is able to use tables
+ * in initrd file to override the ones provided by firmware. This
+ * function checks if there is a SRAT in initrd at early time. If
+ * so, return the phys addr of the SRAT.
+ *
+ ******************************************************************************/
+phys_addr_t __init early_acpi_override_srat(void)
+{
+ int i;
+ u32 length;
+ long offset;
+ void *ramdisk_vaddr;
+ struct acpi_table_header *table;
+ struct cpio_data file;
+ unsigned long map_step = NR_FIX_BTMAPS << PAGE_SHIFT;
+ phys_addr_t ramdisk_image = get_ramdisk_image();
+ char cpio_path[32] = "kernel/firmware/acpi/";
+
+ if (!ramdisk_image || !get_ramdisk_size())
+ return 0;
+
+ /* Try to find if SRAT is overridden */
+ for (i = 0; i < ACPI_OVERRIDE_TABLES; i++) {
+ ramdisk_vaddr = early_ioremap(ramdisk_image, map_step);
+
+ file = find_cpio_data(cpio_path, ramdisk_vaddr,
+ map_step, &offset);
+ if (!file.data) {
+ early_iounmap(ramdisk_vaddr, map_step);
+ return 0;
+ }
+
+ table = file.data;
+ length = table->length;
+
+ if (acpi_verify_table(&file, cpio_path, ACPI_SIG_SRAT)) {
+ ramdisk_image += offset;
+ early_iounmap(ramdisk_vaddr, map_step);
+ continue;
+ }
+
+ /* Found SRAT */
+ early_iounmap(ramdisk_vaddr, map_step);
+ ramdisk_image = ramdisk_image + offset - length;
+
+ break;
+ }
+
+ return ramdisk_image;
+}
+#endif /* CONFIG_ACPI_NUMA */
+
void __init acpi_initrd_override(void *data, size_t size)
{
int no, table_nr = 0, total_offset = 0;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index c5e7b2a..bdcb9dd 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -81,11 +81,21 @@ typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
void acpi_initrd_override(void *data, size_t size);
-#else
+
+#ifdef CONFIG_ACPI_NUMA
+phys_addr_t early_acpi_override_srat(void);
+#endif /* CONFIG_ACPI_NUMA */
+
+#else /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
static inline void acpi_initrd_override(void *data, size_t size)
{
}
-#endif
+
+static inline phys_addr_t early_acpi_override_srat(void)
+{
+ return 0;
+}
+#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index dd38e62..463efa9 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -104,6 +104,7 @@ extern int __remove_pages(struct zone *zone, unsigned long start_pfn,
/* reasonably generic interface to expand the physical pages in a zone */
extern int __add_pages(int nid, struct zone *zone, unsigned long start_pfn,
unsigned long nr_pages);
+extern void find_hotpluggable_memory(void);
#ifdef CONFIG_NUMA
extern int memory_add_physaddr_to_nid(u64 start);
@@ -181,6 +182,7 @@ static inline void register_page_bootmem_info_node(struct pglist_data *pgdat)
{
}
#endif
+
extern void put_page_bootmem(struct page *page);
extern void get_page_bootmem(unsigned long ingo, struct page *page,
unsigned long type);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index ca1dd3a..2a57888 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -30,6 +30,7 @@
#include <linux/mm_inline.h>
#include <linux/firmware-map.h>
#include <linux/stop_machine.h>
+#include <linux/acpi.h>
#include <asm/tlbflush.h>
@@ -62,7 +63,6 @@ void unlock_memory_hotplug(void)
mutex_unlock(&mem_hotplug_mutex);
}
-
/* add this memory to iomem resource */
static struct resource *register_memory_resource(u64 start, u64 size)
{
@@ -91,6 +91,29 @@ static void release_memory_resource(struct resource *res)
return;
}
+#ifdef CONFIG_ACPI_NUMA
+/**
+ * find_hotpluggable_memory - Find out hotpluggable memory from ACPI SRAT.
+ *
+ * This function did the following:
+ * 1. Try to find if there is a SRAT in initrd file used to override the one
+ * provided by firmware. If so, get its phys addr.
+ * 2. If there is no override SRAT, get the phys addr of the SRAT in firmware.
+ * 3. Parse SRAT, find out which memory is hotpluggable.
+ */
+void __init find_hotpluggable_memory(void)
+{
+ phys_addr_t srat_paddr;
+
+ /* Try to find if SRAT is overridden */
+ srat_paddr = early_acpi_override_srat();
+ if (!srat_paddr)
+ return;
+
+ /* Will parse SRAT and find out hotpluggable memory here */
+}
+#endif /* CONFIG_ACPI_NUMA */
+
#ifdef CONFIG_MEMORY_HOTPLUG_SPARSE
void get_page_bootmem(unsigned long info, struct page *page,
unsigned long type)
--
1.7.1
As the previous patches split the acpi_gbl_root_table_list initialization
procedure into two steps: install and override, this patch does the "install"
steps earlier, right after memblock is ready.
In this way, we are able to find SRAT in firmware earlier. And then, we can
prevent memblock from allocating hotpluggable memory for the kernel.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/kernel/acpi/boot.c | 30 +++++++++++++++++-------------
arch/x86/kernel/setup.c | 8 +++++++-
include/linux/acpi.h | 1 +
3 files changed, 25 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/acpi/boot.c b/arch/x86/kernel/acpi/boot.c
index ddb0bc1..30daefd 100644
--- a/arch/x86/kernel/acpi/boot.c
+++ b/arch/x86/kernel/acpi/boot.c
@@ -1497,6 +1497,23 @@ static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
{}
};
+void __init early_acpi_boot_table_init(void)
+{
+ dmi_check_system(acpi_dmi_table);
+
+ /*
+ * If acpi_disabled, bail out
+ */
+ if (acpi_disabled)
+ return;
+
+ /*
+ * Initialize the ACPI boot-time table parser.
+ */
+ if (acpi_table_init_firmware())
+ disable_acpi();
+}
+
/*
* acpi_boot_table_init() and acpi_boot_init()
* called from setup_arch(), always.
@@ -1504,9 +1521,6 @@ static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
* 2. enumerates lapics
* 3. enumerates io-apics
*
- * acpi_table_init() is separate to allow reading SRAT without
- * other side effects.
- *
* side effects of acpi_boot_init:
* acpi_lapic = 1 if LAPIC found
* acpi_ioapic = 1 if IOAPIC found
@@ -1518,22 +1532,12 @@ static struct dmi_system_id __initdata acpi_dmi_table_late[] = {
void __init acpi_boot_table_init(void)
{
- dmi_check_system(acpi_dmi_table);
-
/*
* If acpi_disabled, bail out
*/
if (acpi_disabled)
return;
- /*
- * Initialize the ACPI boot-time table parser.
- */
- if (acpi_table_init_firmware()) {
- disable_acpi();
- return;
- }
-
acpi_table_init_override();
acpi_check_multiple_madt();
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f8ec578..fdb5a26 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1074,6 +1074,12 @@ void __init setup_arch(char **cmdline_p)
memblock_x86_fill();
/*
+ * Parse the ACPI tables from firmware for possible boot-time SMP
+ * configuration.
+ */
+ early_acpi_boot_table_init();
+
+ /*
* The EFI specification says that boot service code won't be called
* after ExitBootServices(). This is, in fact, a lie.
*/
@@ -1130,7 +1136,7 @@ void __init setup_arch(char **cmdline_p)
io_delay_init();
/*
- * Parse the ACPI tables for possible boot-time SMP configuration.
+ * Finish parsing the ACPI tables.
*/
acpi_boot_table_init();
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 44a3e5f..c5e7b2a 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -91,6 +91,7 @@ char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
int acpi_boot_init (void);
+void early_acpi_boot_table_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
int acpi_numa_init (void);
--
1.7.1
Linux kernel cannot migrate pages used by the kernel. As a result, hotpluggable
memory used by the kernel won't be able to be hot-removed. To solve this
problem, the basic idea is to prevent memblock from allocating hotpluggable
memory for the kernel at early time, and arrange all hotpluggable memory in
ACPI SRAT(System Resource Affinity Table) as ZONE_MOVABLE when initializing
zones.
In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.
In this patch, we make memblock skip these hotpluggable memory regions in
the default allocate function.
memblock_find_in_range_node()
|-->for_each_free_mem_range_reverse()
|-->__next_free_mem_range_rev()
The above is the only place where __next_free_mem_range_rev() is used. So
skip hotpluggable memory regions when iterating memblock.memory to find
free memory.
In the later patches, a boot option named "movablenode" will be introduced
to enable/disable using SRAT to arrange ZONE_MOVABLE.
NOTE: This check will always be done. It is OK because if users didn't specify
movablenode option, the hotpluggable memory won't be marked. So this
check won't skip any memory, which means the kernel will act as before.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
mm/memblock.c | 8 ++++++++
1 files changed, 8 insertions(+), 0 deletions(-)
diff --git a/mm/memblock.c b/mm/memblock.c
index ecd8568..3ea4301 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -695,6 +695,10 @@ void __init_memblock __next_free_mem_range(u64 *idx, int nid,
* @out_nid: ptr to int for nid of the range, can be %NULL
*
* Reverse of __next_free_mem_range().
+ *
+ * Linux kernel cannot migrate pages used by itself. Memory hotplug users won't
+ * be able to hot-remove hotpluggable memory used by the kernel. So this
+ * function skip hotpluggable regions when allocating memory for the kernel.
*/
void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
phys_addr_t *out_start,
@@ -719,6 +723,10 @@ void __init_memblock __next_free_mem_range_rev(u64 *idx, int nid,
if (nid != MAX_NUMNODES && nid != memblock_get_region_node(m))
continue;
+ /* skip hotpluggable memory regions */
+ if (m->flags & MEMBLOCK_HOTPLUG)
+ continue;
+
/* scan areas before each reservation for intersection */
for ( ; ri >= 0; ri--) {
struct memblock_region *r = &rsv->regions[ri];
--
1.7.1
In the following patches, we need to call get_ramdisk_{image|size}()
to get initrd file's address and size. So make these two functions
global.
v1 -> v2:
As tj suggested, make these two function static inline in
arch/x86/include/asm/setup.h.
Signed-off-by: Tang Chen <[email protected]>
Reviewed-by: Zhang Yanfei <[email protected]>
---
arch/x86/include/asm/setup.h | 21 +++++++++++++++++++++
arch/x86/kernel/setup.c | 18 ------------------
2 files changed, 21 insertions(+), 18 deletions(-)
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..cfdb55d 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,27 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)
extern void probe_roms(void);
+
+#ifdef CONFIG_BLK_DEV_INITRD
+static inline u64 __init get_ramdisk_image(void)
+{
+ u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+
+ ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+
+ return ramdisk_image;
+}
+
+static inline u64 __init get_ramdisk_size(void)
+{
+ u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+
+ ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+
+ return ramdisk_size;
+}
+#endif /* CONFIG_BLK_DEV_INITRD */
+
#ifdef __i386__
void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fdb5a26..da44353 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -296,24 +296,6 @@ static void __init reserve_brk(void)
}
#ifdef CONFIG_BLK_DEV_INITRD
-
-static u64 __init get_ramdisk_image(void)
-{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
-
- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
-
- return ramdisk_image;
-}
-static u64 __init get_ramdisk_size(void)
-{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
-
- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
-
- return ramdisk_size;
-}
-
#define MAX_MAP_CHUNK (NR_FIX_BTMAPS << PAGE_SHIFT)
static void __init relocate_initrd(void)
{
--
1.7.1
On Wednesday, August 07, 2013 06:51:51 PM Tang Chen wrote:
> This patch-set aims to solve some problems at system boot time
> to enhance memory hotplug functionality.
>
> [Background]
>
> The Linux kernel cannot migrate pages used by the kernel because
> of the kernel direct mapping. Since va = pa + PAGE_OFFSET, if the
> physical address is changed, we cannot simply update the kernel
> pagetable. On the contrary, we have to update all the pointers
> pointing to the virtual address, which is very difficult to do.
>
> In order to do memory hotplug, we should prevent the kernel to use
> hotpluggable memory.
>
> In ACPI, there is a table named SRAT(System Resource Affinity Table).
> It contains system NUMA info (CPUs, memory ranges, PXM), and also a
> flag field indicating which memory ranges are hotpluggable.
>
>
> [Problem to be solved]
>
> At the very early time when the system is booting, we use a bootmem
> allocator, named memblock, to allocate memory for the kernel.
> memblock will start to work before the kernel parse SRAT, which
> means memblock won't know which memory is hotpluggable before SRAT
> is parsed.
>
> So at this time, memblock could allocate hotpluggable memory for
> the kernel to use permanently. For example, the kernel may allocate
> pagetables in hotpluggable memory, which cannot be freed when the
> system is up.
>
> So we have to prevent memblock allocating hotpluggable memory for
> the kernel at the early boot time.
>
>
> [Earlier solutions]
>
> We have tried to parse SRAT earlier, before memblock is ready. To
> do this, we also have to do ACPI_INITRD_TABLE_OVERRIDE earlier.
> Otherwise the override tables won't be able to effect.
>
> This is not that easy to do because memblock is ready before direct
> mapping is setup. So Yinghai split the ACPI_INITRD_TABLE_OVERRIDE
> procedure into two steps: find and copy. Please refer to the
> following patch-set:
> https://lkml.org/lkml/2013/6/13/587
>
> To this solution, tj gave a lot of comments and the following
> suggestions.
>
>
> [Suggestion from tj]
>
> tj mainly gave the following suggestions:
>
> 1. Necessary reordering is OK, but we should not rely on
> reordering to achieve the goal because it makes the kernel
> too fragile.
>
> 2. Memory allocated to kernel for temporary usage is OK because
> it will be freed when the system is up. Doing relocation
> for permanent allocated hotpluggable memory will make the
> the kernel more robust.
>
> 3. Need to enhance memblock to discover and complain if any
> hotpluggable memory is allocated to kernel.
>
> After a long thinking, we choose not to do the relocation for
> the following reasons:
>
> 1. It's easy to find out the allocated hotpluggable memory. But
> memblock will merge the adjoined ranges owned by different users
> and used for different purposes. It's hard to find the owners.
>
> 2. Different memory has different way to be relocated. I think one
> function for each kind of memory will make the code too messy.
>
> 3. Pagetable could be in hotpluggable memory. Relocating pagetable
> is too difficult and risky. We have to update all PUD, PMD pages.
> And also, ACPI_INITRD_TABLE_OVERRIDE and parsing SRAT procedures
> are not long after pagetable is initialized. If we relocate the
> pagetable not long after it was initialized, the code will be
> very ugly.
>
>
> [Solution in this patch-set]
>
> In this patch-set, we still do the reordering, but in a new way.
>
> 1. Improve memblock with flags, so that it is able to differentiate
> memory regions for different usage. And also a MEMBLOCK_HOTPLUG
> flag to mark hotpluggable memory.
>
> 2. When memblock is ready (memblock_x86_fill() is called), initialize
> acpi_gbl_root_table_list, fulfill all the ACPI tables' phys addrs.
> Now, we have all the ACPI tables' phys addrs provided by firmware.
>
> 3. Check if there is a SRAT in initrd file used to override the one
> provided by firmware. If so, get its phys addr.
>
> 4. If no override SRAT in initrd, get the phys addr of the SRAT
> provided by firmware.
>
> Now, we have the phys addr of the to be used SRAT, the one in
> initrd or the one in firmware.
>
> 5. Parse only the memory affinities in SRAT, find out all the
> hotpluggable memory regions and mark them in memblock.memory with
> MEMBLOCK_HOTPLUG flag.
>
> 6. The kernel goes through the current path. Any other related parts,
> such as ACPI_INITRD_TABLE_OVERRIDE path, the current parsing ACPI
> tables pathes, global variable numa_meminfo, and so on, are not
> modified. They work as before.
>
> 7. Make memblock default allocator skip hotpluggable memory.
>
> 8. Introduce movablenode boot option to allow users to enable
> and disable this functionality.
>
>
> In summary, in order to get hotpluggable memory info as early as possible,
> this patch-set only parse memory affinities in SRAT one more time right
> after memblock is ready, and leave all the other pathes untouched. With
> the hotpluggable memory info, we can arrange hotpluggable memory in
> ZONE_MOVABLE to prevent the kernel to use it.
>
> change log v2 RESEND -> v3:
> 1. As Rafael and Lv Zheng suggested, split acpi global root table list
> initialization procedure into two steps: install and override. And
> do the "install" step earlier.
This looks a bit more manageable than before, but please do one more thing:
Please split all of the ACPICA changes out into separate patches and put those
patched in front of everything else.
The reason is we may need to merge them through upstream ACPICA as the first
step (if they are accepted by the ACPICA maintainers).
Thanks,
Rafael
--
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.