One commit that tried to parse SRAT early get reverted before v3.9-rc1.
| commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
| Author: Tang Chen <[email protected]>
| Date: Fri Feb 22 16:33:44 2013 -0800
|
| acpi, memory-hotplug: parse SRAT before memblock is ready
It broke several things, like acpi override and fall back path etc.
This patchset is clean implementation that will parse numa info early.
1. keep the acpi table initrd override working by split finding with copying.
finding is done at head_32.S and head64.c stage,
in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
in head64.c, initrd is accessed via kernel low mapping address
with help of #PF set page table.
copying is done with early_ioremap just after memblock is setup.
2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
seperate initmem_init to two stages.
early_initmem_init will only extract numa info early into numa_meminfo.
initmem_init will keep slit and emulation handling.
3. keep other old code flow untouched like relocate_initrd and initmem_init.
early_initmem_init will take old init_mem_mapping position.
it call early_x86_numa_init and init_mem_mapping for every nodes.
For 64bit, we avoid having size limit on initrd, as relocate_initrd
is still after init_mem_mapping for all memory.
4. last patch will try to put page table on local node, so that memory
hotplug will be happy.
In short, early_initmem_init will parse numa info early and call
init_mem_mapping to set page table for every nodes's mem.
could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
and it is based on today's Linus tree.
-v2: Address tj's review and split patches to small ones.
-v3: Add some Acked-by from tj, also stop abusing cpio_data for acpi_files info
-v4: fix one typo found by Tang Chen.
Also added tested-by from Thomas Renninger and Tony.
Thanks
Yinghai
Yinghai Lu (22):
x86: Change get_ramdisk_image() to global
x86, microcode: Use common get_ramdisk_image()
x86, ACPI, mm: Kill max_low_pfn_mapped
x86, ACPI: Search buffer above 4G in second try for acpi override
tables
x86, ACPI: Increase override tables number limit
x86, ACPI: Split acpi_initrd_override to find/copy two functions
x86, ACPI: Store override acpi tables phys addr in cpio files info
array
x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
x86, mm, numa: Move two functions calling on successful path later
x86, mm, numa: Call numa_meminfo_cover_memory() checking early
x86, mm, numa: Move node_map_pfn alignment() to x86
x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
x86, mm, numa: Set memblock nid later
x86, mm, numa: Move node_possible_map setting later
x86, mm, numa: Move emulation handling down.
x86, ACPI, numa, ia64: split SLIT handling out
x86, mm, numa: Add early_initmem_init() stub
x86, mm: Parse numa info early
x86, mm: Add comments for step_size shift
x86, mm: Make init_mem_mapping be able to be called several times
x86, mm, numa: Put pagetable on local node ram for 64bit
arch/ia64/kernel/setup.c | 4 +-
arch/x86/include/asm/acpi.h | 3 +-
arch/x86/include/asm/page_types.h | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/setup.h | 9 ++
arch/x86/kernel/head64.c | 2 +
arch/x86/kernel/head_32.S | 4 +
arch/x86/kernel/microcode_intel_early.c | 8 +-
arch/x86/kernel/setup.c | 86 +++++++-----
arch/x86/mm/init.c | 109 ++++++++++-----
arch/x86/mm/numa.c | 240 +++++++++++++++++++++++++-------
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 +
arch/x86/mm/srat.c | 11 +-
drivers/acpi/numa.c | 13 +-
drivers/acpi/osl.c | 138 ++++++++++++------
include/linux/acpi.h | 20 +--
include/linux/mm.h | 3 -
mm/page_alloc.c | 52 +------
19 files changed, 467 insertions(+), 243 deletions(-)
--
1.8.1.4
Now we only search buffer for override acpi table under 4G.
In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4G.
Do second try to search above 4G.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
---
drivers/acpi/osl.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 313d14d..c08cdb6 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
/* under 4G at first, then above 4G */
acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr)
+ acpi_tables_addr = memblock_find_in_range(0,
+ ~(phys_addr_t)0,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.8.1.4
Move node_possible_map handling out of numa_check_memblks to avoid side
changing in numa_check_memblks().
Only set once for successful path instead of resetting in numa_init()
every time.
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e2ddcbd..1d5fa08 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -539,12 +539,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
static int __init numa_check_memblks(struct numa_meminfo *mi)
{
+ nodemask_t nodes_parsed;
unsigned long pfn_align;
/* Account for nodes with cpus and no memory */
- node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
- if (WARN_ON(nodes_empty(node_possible_map)))
+ nodes_parsed = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&nodes_parsed, mi);
+ if (WARN_ON(nodes_empty(nodes_parsed)))
return -EINVAL;
if (!numa_meminfo_cover_memory(mi))
@@ -596,7 +597,6 @@ static int __init numa_init(int (*init_func)(void))
set_apicid_to_node(i, NUMA_NO_NODE);
nodes_clear(numa_nodes_parsed);
- nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
numa_reset_distance();
@@ -672,6 +672,9 @@ void __init x86_numa_init(void)
early_x86_numa_init();
+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
--
1.8.1.4
It needs to allocate buffer for new numa_meminfo and distance matrix,
so move it down.
Also we change the behavoir:
before this patch, if user input wrong data in command line, it
will fall back to next numa probing or disabling numa.
after this patch, if user input wrong data in command line, it will
stay with numa info from probing before, like acpi srat or amd_numa.
We need to call numa_check_memblks to reject wrong user inputs early,
so keep the original numa_meminfo not changed.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: David Rientjes <[email protected]>
---
arch/x86/mm/numa.c | 6 +++---
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 ++
3 files changed, 6 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1d5fa08..90fd123 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,7 +537,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif
-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
{
nodemask_t nodes_parsed;
unsigned long pfn_align;
@@ -607,8 +607,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;
- numa_emulation(&numa_meminfo, numa_distance_cnt);
-
ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -672,6 +670,8 @@ void __init x86_numa_init(void)
early_x86_numa_init();
+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
node_possible_map = numa_nodes_parsed;
numa_nodemask_from_meminfo(&node_possible_map, mi);
diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
if (ret < 0)
goto no_emu;
- if (numa_cleanup_meminfo(&ei) < 0) {
+ if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
goto no_emu;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);
void __init x86_numa_init(void);
+int __init numa_check_memblks(struct numa_meminfo *mi);
+
#ifdef CONFIG_NUMA_EMU
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
--
1.8.1.4
To parse srat early, we need to move acpi table probing early.
acpi_initrd_table_override is before acpi table probing. So we need to
move it early too.
Current code acpi_initrd_table_override is after init_mem_mapping and
relocate_initrd(), so it can scan initrd and copy acpi tables with kernel
virtual address of initrd.
Copying need to be after memblock is ready, because it need to allocate
buffer for new acpi tables.
So we have to split that function to find and copy two functions.
Find should be as early as possible. Copy should be after memblock is ready.
Finding could be done in head_32.S and head64.c, just like microcode
early scanning. In head_32.S, it is 32bit flat mode, we don't
need to set page table to access it. In head64.c, #PF set page table
could help us access initrd with kernel low mapping address.
Copying could be done just after memblock is ready and before probing
acpi tables, and we need to early_ioremap to access source and target
range, as init_mem_mapping is not called yet.
While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version
were conditionalized inside CONFIG_ACPI. This forced setup_arch() to
have its own #ifdefs around acpi_initrd_override() as otherwise build
would fail when !CONFIG_ACPI. Move the prototypes and dummy
implementations of the newly split functions below CONFIG_ACPI block
in acpi.h so that we can do away with #ifdefs in its user.
-v2: Split one patch out according to tj.
also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
avoid #idef in setup.c
Signed-off-by: Yinghai <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/kernel/setup.c | 6 +++---
drivers/acpi/osl.c | 18 +++++++++++++-----
include/linux/acpi.h | 16 ++++++++--------
3 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e75c6e6..d0cc176 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1092,9 +1092,9 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();
-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+ acpi_initrd_override_find((void *)initrd_start,
+ initrd_end - initrd_start);
+ acpi_initrd_override_copy();
reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index a5a9346..21714fb 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- char *p;
if (data == NULL || size == 0)
return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
- if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+ int no, total_offset = 0;
+ char *p;
+
+ if (!all_tables_size)
return;
/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
* tables one time, we will hit the limit. Need to map table
* one by one during copying.
*/
- for (no = 0; no < table_nr; no++) {
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
phys_addr_t size = acpi_initrd_files[no].size;
+ if (!size)
+ break;
p = early_ioremap(acpi_tables_addr + total_offset, size);
memcpy(p, acpi_initrd_files[no].data, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index bcbdd74..1654a241 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);
-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
@@ -485,6 +477,14 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));
--
1.8.1.4
In 32bit we will find table with phys address during 32bit flat mode
in head_32.S, because at that time we don't need set page table to
access initrd.
For copying we could use early_ioremap() with phys directly before mem mapping
is set.
To keep 32bit and 64bit consistent, use phys_addr for all.
-v2: introduce file_pos to save phys address instead of abusing cpio_data
that tj is not happy with.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
---
drivers/acpi/osl.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 21714fb..ee5c531 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
#define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+ phys_addr_t data;
+ phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
void __init acpi_initrd_override_find(void *data, size_t size)
{
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);
all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
void __init acpi_initrd_override_copy(void)
{
int no, total_offset = 0;
- char *p;
+ char *p, *q;
if (!all_tables_size)
return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
* one by one during copying.
*/
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ phys_addr_t addr = acpi_initrd_files[no].data;
phys_addr_t size = acpi_initrd_files[no].size;
if (!size)
break;
+ q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
- memcpy(p, acpi_initrd_files[no].data, size);
+ memcpy(p, q, size);
+ early_iounmap(q, size);
early_iounmap(p, size);
total_offset += size;
}
--
1.8.1.4
Current acpi tables in initrd is limited to 10, that is too small.
64 should be good enough as we have 35 sigs and could have several
SSDT.
Two problems in current code prevent us from increasing limit:
1. that cpio file info array is put in stack, as every element is 32
bytes, could run out of stack if we have that array size to 64.
We can move it out from stack, and make it as global and put it in
__initdata section.
2. early_ioremap only can remap 256k one time. Current code is mapping
10 tables one time. If we increase that limit, whole size could be
more than 256k, early_ioremap will fail with that.
We can map table one by one during copying, instead of mapping
all them one time.
-v2: According to tj, split it out to separated patch, also
rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
per tj.
Signed-off-by: Yinghai <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
---
drivers/acpi/osl.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index c08cdb6..a5a9346 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
void __init acpi_initrd_override(void *data, size_t size)
{
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;
if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);
all_tables_size += table->length;
- early_initrd_files[table_nr].data = file.data;
- early_initrd_files[table_nr].size = file.size;
+ acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
- p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+ /*
+ * early_ioremap only can remap 256k one time. If we map all
+ * tables one time, we will hit the limit. Need to map table
+ * one by one during copying.
+ */
for (no = 0; no < table_nr; no++) {
- memcpy(p + total_offset, early_initrd_files[no].data,
- early_initrd_files[no].size);
- total_offset += early_initrd_files[no].size;
+ phys_addr_t size = acpi_initrd_files[no].size;
+
+ p = early_ioremap(acpi_tables_addr + total_offset, size);
+ memcpy(p, acpi_initrd_files[no].data, size);
+ early_iounmap(p, size);
+ total_offset += size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
--
1.8.1.4
For finding with 32bit, it would be easy to access initrd in 32bit
flat mode, as we don't need to set page table.
That is from head_32.S, and microcode updating already use this trick.
Need to change acpi_initrd_override_find to use phys to access global
variables.
Pass is_phys in the function, as we can not use address to decide if it
is phys or virtual address on 32 bit. Boot loader could load initrd above
max_low_pfn.
Don't call printk as it uses global variables, so delay print later
during copying.
Change table_sigs to use stack instead, otherwise it is too messy to change
string array to phys and still keep offset calculating correct.
That size is about 36x4 bytes, and it is small to settle in stack.
Also remove "continue" in MARCO to make code more readable.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++++---------------
include/linux/acpi.h | 5 +--
3 files changed, 63 insertions(+), 29 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d0cc176..16a703f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1093,7 +1093,7 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();
acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start);
+ initrd_end - initrd_start, false);
acpi_initrd_override_copy();
reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index ee5c531..cce92a5 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}
-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)
@@ -576,17 +564,45 @@ struct file_pos {
};
static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
+ struct file_pos *files = acpi_initrd_files;
+ int *all_tables_size_p = &all_tables_size;
+
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ char *table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
if (data == NULL || size == 0)
return;
+ if (is_phys) {
+ files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+ all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
data += offset;
size -= offset;
- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ if (!is_phys)
+ INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }
table = file.data;
@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;
- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ if (!is_phys)
+ INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ if (!is_phys)
+ INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ if (!is_phys)
+ INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }
- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);
- all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
- acpi_initrd_files[table_nr].size = file.size;
+ (*all_tables_size_p) += table->length;
+ files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+ __pa_nodebug(file.data);
+ files[table_nr].size = file.size;
table_nr++;
}
}
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));
memcpy(p, q, size);
early_iounmap(q, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 1654a241..4b943e6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -478,10 +478,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */
#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
void acpi_initrd_override_copy(void);
#else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+ bool is_phys) { }
static inline void acpi_initrd_override_copy(void) { }
#endif
--
1.8.1.4
head64.c could use #PF handler set page table to access initrd before
init mem mapping and initrd relocating.
head_32.S could use 32bit flat mode to access initrd before init mem
mapping initrd relocating.
That make 32bit and 64 bit more consistent.
-v2: use inline function in header file instead according to tj.
also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
to make sure we are using right address.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/include/asm/setup.h | 6 ++++++
arch/x86/kernel/head64.c | 2 ++
arch/x86/kernel/head_32.S | 4 ++++
arch/x86/kernel/setup.c | 34 ++++++++++++++++++++++++++++++----
4 files changed, 42 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
extern unsigned long saved_video_mode;
extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c5e403f..a31bc63 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -174,6 +174,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");
+ x86_acpi_override_find();
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ call x86_acpi_override_find
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16a703f..2d29bc0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -424,6 +424,34 @@ static void __init reserve_initrd(void)
}
#endif /* CONFIG_BLK_DEV_INITRD */
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ /*
+ * 32bit is from head_32.S, and it is 32bit flat mode.
+ * So need to use phys address to access global variables.
+ */
+ boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (unsigned char *)ramdisk_image;
+ acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = __va(ramdisk_image);
+ acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -1090,12 +1118,10 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);
- reserve_initrd();
-
- acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start, false);
acpi_initrd_override_copy();
+ reserve_initrd();
+
reserve_crashkernel();
vsmp_init();
--
1.8.1.4
Parsing numa info has been separated to two functions now.
early_initmem_info() only parse info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.
SLIT and numa emulation handling are still left in initmem_init().
Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d40e16e..6ef3fa2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1098,13 +1098,21 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();
+ /*
+ * Parse the ACPI tables for possible boot-time SMP configuration.
+ */
+ acpi_initrd_override_copy();
+ acpi_boot_table_init();
+ early_acpi_boot_init();
+ early_initmem_init();
init_mem_mapping();
-
+ memblock.current_limit = get_max_mapped();
early_trap_pf_init();
+ reserve_initrd();
+
setup_real_mode();
- memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);
/*
@@ -1118,24 +1126,12 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);
- acpi_initrd_override_copy();
-
- reserve_initrd();
-
reserve_crashkernel();
vsmp_init();
io_delay_init();
- /*
- * Parse the ACPI tables for possible boot-time SMP configuration.
- */
- acpi_boot_table_init();
-
- early_acpi_boot_init();
-
- early_initmem_init();
initmem_init();
memblock_find_dma_reserve();
--
1.8.1.4
For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.
Only set memblock nid one time for successful path.
Also rename numa_register_memblks to numa_check_memblks()
after move out code for setting memblock nid.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index fcaeba9..e2ddcbd 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,10 +537,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif
-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
{
unsigned long pfn_align;
- int i;
/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -563,11 +562,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
return 0;
}
@@ -604,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
ret = init_func();
@@ -616,7 +609,7 @@ static int __init numa_init(int (*init_func)(void))
numa_emulation(&numa_meminfo, numa_distance_cnt);
- ret = numa_register_memblks(&numa_meminfo);
+ ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -679,6 +672,11 @@ void __init x86_numa_init(void)
early_x86_numa_init();
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
+
/* Finally register nodes. */
for_each_node_mask(nid, node_possible_map) {
u64 start = PFN_PHYS(max_pfn);
--
1.8.1.4
As request by hpa, add comments for why we choose 5 for
step size shift.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 28b294f..2754e45 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -385,8 +385,23 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
}
-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+ /*
+ * initial mapped size is PMD_SIZE, aka 2M.
+ * We can not set step_size to be PUD_SIZE aka 1G yet.
+ * In worse case, when 1G is cross the 1G boundary, and
+ * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+ * to map 1G range with PTE. Use 5 as shift for now.
+ */
+ unsigned long new_step_size = step_size << 5;
+
+ if (new_step_size > step_size)
+ step_size = new_step_size;
+
+ return step_size;
+}
+
void __init init_mem_mapping(void)
{
unsigned long end, real_end, start, last_start;
@@ -428,7 +443,7 @@ void __init init_mem_mapping(void)
min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
+ step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}
--
1.8.1.4
We could use numa_meminfo directly instead of memblock nid.
So we could move down set memblock nid and only do it one time
for successful path.
-v2: according to tj, separate moving to another patch.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24155b2..fcaeba9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -496,14 +496,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
* Returns the determined alignment in pfn's. 0 if there is no alignment
* requirement (single node).
*/
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
{
unsigned long accl_mask = 0, last_end = 0;
unsigned long start, end, mask;
int last_nid = -1;
int i, nid;
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ for (i = 0; i < mi->nr_blks; i++) {
+ start = mi->blk[i].start >> PAGE_SHIFT;
+ end = mi->blk[i].end >> PAGE_SHIFT;
+ nid = mi->blk[i].nid;
if (!start || last_nid < 0 || last_nid == nid) {
last_nid = nid;
last_end = end;
@@ -526,10 +530,16 @@ unsigned long __init node_map_pfn_alignment(void)
/* convert mask to number of pages */
return ~accl_mask + 1;
}
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+ return 0;
+}
+#endif
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
- unsigned long uninitialized_var(pfn_align);
+ unsigned long pfn_align;
int i;
/* Account for nodes with cpus and no memory */
@@ -541,24 +551,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;
- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
/*
* If sections array is gonna be used for pfn -> nid mapping, check
* whether its granularity is fine enough.
*/
-#ifdef NODE_NOT_IN_PAGE_FLAGS
- pfn_align = node_map_pfn_alignment();
+ pfn_align = node_map_pfn_alignment(mi);
if (pfn_align && pfn_align < PAGES_PER_SECTION) {
printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
PFN_PHYS(pfn_align) >> 20,
PFN_PHYS(PAGES_PER_SECTION) >> 20);
return -EINVAL;
}
-#endif
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
return 0;
}
--
1.8.1.4
Prepare to put page table on local nodes.
Move calling of init_mem_mapping to early_initmem_init.
Rework alloc_low_pages to alloc page table in following order:
BRK, local node, low range
Still only load_cr3 one time, otherwise we would break xen 64bit again.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 88 ++++++++++++++++++++++++++----------------
arch/x86/mm/numa.c | 24 ++++++++++++
4 files changed, 79 insertions(+), 36 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__
extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
void early_alloc_pgt_buf(void);
/* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ef3fa2..67ef4bc 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1105,7 +1105,6 @@ void __init setup_arch(char **cmdline_p)
acpi_boot_table_init();
early_acpi_boot_init();
early_initmem_init();
- init_mem_mapping();
memblock.current_limit = get_max_mapped();
early_trap_pf_init();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2754e45..8a03283 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
static unsigned long __initdata pgt_buf_end;
static unsigned long __initdata pgt_buf_top;
-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;
static bool __initdata can_use_brk_pgt = true;
@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)
if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
+ if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+ if (low_min_pfn_mapped >= low_max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(
+ low_min_pfn_mapped << PAGE_SHIFT,
+ low_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
@@ -402,60 +412,75 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
return step_size;
}
-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
+ bool is_low = false;
+
+ if (!begin) {
+ probe_page_size_mask();
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ begin = ISA_END_ADDRESS;
+ is_low = true;
+ }
- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ if (begin >= end)
+ return;
/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;
/* step_size need to be small so pgt_buf from BRK could cover it */
step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
+ local_max_pfn_mapped = begin >> PAGE_SHIFT;
+ local_min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > begin) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < begin)
+ start = begin;
} else
- start = ISA_END_ADDRESS;
+ start = begin;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
+ if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+ local_min_pfn_mapped = start >> PAGE_SHIFT;
last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}
- if (real_end < end)
+ if (real_end < end) {
init_range_memory_mapping(real_end, end);
+ if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = end >> PAGE_SHIFT;
+ }
+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
- }
#else
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
early_ioremap_page_table_range_init();
#endif
@@ -464,11 +489,6 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
#endif
/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c2d4653..d3eb0c9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
#include <asm/dma.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
+#include <asm/tlbflush.h>
#include "numa_internal.h"
+#include "mm_internal.h"
int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
@@ -668,9 +670,31 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}
+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
+ max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+ early_ioremap_page_table_range_init();
+}
+#endif
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
+
+ early_x86_numa_init_mapping();
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
+ early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}
void __init x86_numa_init(void)
--
1.8.1.4
We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping.
So move SLIT parsing later.
x86_acpi_numa_init become x86_acpi_numa_init_srat/x86_acpi_numa_init_slit.
It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.
-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
remove the reset_numa_distance() in numa_init(), as get we only set
distance in slit handling.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Tested-by: Tony Luck <[email protected]>
---
arch/ia64/kernel/setup.c | 4 +++-
arch/x86/include/asm/acpi.h | 3 ++-
arch/x86/mm/numa.c | 14 ++++++++++++--
arch/x86/mm/srat.c | 11 +++++++----
drivers/acpi/numa.c | 13 +++++++------
include/linux/acpi.h | 3 ++-
6 files changed, 33 insertions(+), 15 deletions(-)
diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 2029cc0..6a2efb5 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
acpi_table_init();
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
- acpi_numa_init();
+ acpi_numa_init_srat();
+ acpi_numa_init_slit();
+ acpi_numa_arch_fixup();
# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
# endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }
#ifdef CONFIG_ACPI_NUMA
extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
#endif /* CONFIG_ACPI_NUMA */
#define acpi_unlazy_tlb(x) leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 90fd123..182e085 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -598,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- numa_reset_distance();
ret = init_func();
if (ret < 0)
@@ -636,6 +635,10 @@ static int __init dummy_numa_init(void)
return 0;
}
+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
/**
* x86_numa_init - Initialize NUMA
*
@@ -651,8 +654,10 @@ static void __init early_x86_numa_init(void)
return;
#endif
#ifdef CONFIG_ACPI_NUMA
- if (!numa_init(x86_acpi_numa_init))
+ if (!numa_init(x86_acpi_numa_init_srat)) {
+ srat_used = true;
return;
+ }
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
@@ -670,6 +675,11 @@ void __init x86_numa_init(void)
early_x86_numa_init();
+#ifdef CONFIG_ACPI_NUMA
+ if (srat_used)
+ x86_acpi_numa_init_slit();
+#endif
+
numa_emulation(&numa_meminfo, numa_distance_cnt);
node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
return -1;
}
-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
{
int ret;
- ret = acpi_numa_init();
+ ret = acpi_numa_init_srat();
if (ret < 0)
return ret;
return srat_disabled() ? -EINVAL : 0;
}
+
+void __init x86_acpi_numa_init_slit(void)
+{
+ acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
}
-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
{
int cnt = 0;
@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
NR_NODE_MEMBLKS);
}
- /* SLIT: System Locality Information Table */
- acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
- acpi_numa_arch_fixup();
-
if (cnt < 0)
return cnt;
else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
return 0;
}
+void __init acpi_numa_init_slit(void)
+{
+ /* SLIT: System Locality Information Table */
+ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4b943e6..4a78235 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);
int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
--
1.8.1.4
early_initmem_init() call early_x86_numa_init() to parse numa info early.
Later will call init_mem_mapping for nodes in it.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/numa.c | 7 +++++--
4 files changed, 13 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);
+void early_initmem_init(void);
extern void initmem_init(void);
#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2d29bc0..d40e16e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1135,6 +1135,7 @@ void __init setup_arch(char **cmdline_p)
early_acpi_boot_init();
+ early_initmem_init();
initmem_init();
memblock_find_dma_reserve();
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index abcc241..28b294f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -450,6 +450,12 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 182e085..c2d4653 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -668,13 +668,16 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}
+void __init early_initmem_init(void)
+{
+ early_x86_numa_init();
+}
+
void __init x86_numa_init(void)
{
int i, nid;
struct numa_meminfo *mi = &numa_meminfo;
- early_x86_numa_init();
-
#ifdef CONFIG_ACPI_NUMA
if (srat_used)
x86_acpi_numa_init_slit();
--
1.8.1.4
Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
be used anymore.
User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.
Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
as later accessing is using early_ioremap(). We could change to use
1U<<(32_PAGE_SHIFT) with it, aka under 4G.
-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.
Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 4 +---
arch/x86/mm/init.c | 4 ----
drivers/acpi/osl.c | 6 +++---
4 files changed, 4 insertions(+), 11 deletions(-)
diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@
extern int devmem_is_allowed(unsigned long pagenr);
-extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;
static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1629577..e75c6e6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -113,13 +113,11 @@
#include <asm/prom.h>
/*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped: highest direct mapped pfn over 4GB
+ * max_pfn_mapped: highest direct mapped pfn
*
* The direct mapping only covers E820_RAM regions, so the ranges and gaps are
* represented by pfn_mapped
*/
-unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;
#ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 59b7fc4..abcc241 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
- if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
- max_low_pfn_mapped = max(max_low_pfn_mapped,
- min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
}
bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 586e7e9..313d14d 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;
- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
+ /* under 4G at first, then above 4G */
+ acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.8.1.4
If node with ram is hotplugable, local node mem for page table and vmemmap
should be on that node ram.
This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date: Mon Dec 27 16:48:17 2010 -0800
|
| x86-64, numa: Put pgtable to local node memory
That was reverted before.
We have reason to reintroduce it to make memory hotplug work.
Calling init_mem_mapping in early_initmem_init for every node.
alloc_low_pages will alloc page table in following order:
BRK, local node, low range
So page table will be on low range or local nodes.
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d3eb0c9..11acdf6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -673,7 +673,39 @@ static void __init early_x86_numa_init(void)
#ifdef CONFIG_X86_64
static void __init early_x86_numa_init_mapping(void)
{
- init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ unsigned long last_start = 0, last_end = 0;
+ struct numa_meminfo *mi = &numa_meminfo;
+ unsigned long start, end;
+ int last_nid = -1;
+ int i, nid;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ nid = mi->blk[i].nid;
+ start = mi->blk[i].start;
+ end = mi->blk[i].end;
+
+ if (last_nid == nid) {
+ last_end = end;
+ continue;
+ }
+
+ /* other nid now */
+ if (last_nid >= 0) {
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+ }
+
+ /* for next nid */
+ last_nid = nid;
+ last_start = start;
+ last_end = end;
+ }
+ /* last one */
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+
if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
--
1.8.1.4
We need to have numa info ready before init_mem_mapping, so we
can call init_mem_mapping per nodes also can trim node mem range to
big alignment.
Current numa parsing need to allocate some buffer and need to be
called after init_mem_mapping.
So try to split parsing numa info to two stages, and early one will be
before init_mem_mapping, and it should not need allocate buffers.
At last we will have early_initmem_init() and initmem_init().
This one is first one for separation.
setup_node_data() and numa_init_array() are only called for successful
path, so we can move calling to x86_numa_init(). That will also make
numa_init() small and readable.
-v2: remove online_node_map clear in numa_init(), as it is only
set in setup_node_data() at last in successful path.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 69 ++++++++++++++++++++++++++++++------------------------
1 file changed, 39 insertions(+), 30 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 72fe01e..d545638 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -480,7 +480,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
- int i, nid;
+ int i;
/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -509,24 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;
- /* Finally register nodes. */
- for_each_node_mask(nid, node_possible_map) {
- u64 start = PFN_PHYS(max_pfn);
- u64 end = 0;
-
- for (i = 0; i < mi->nr_blks; i++) {
- if (nid != mi->blk[i].nid)
- continue;
- start = min(mi->blk[i].start, start);
- end = max(mi->blk[i].end, end);
- }
-
- if (start < end)
- setup_node_data(nid, start, end);
- }
-
- /* Dump memblock with node info and return. */
- memblock_dump_all();
return 0;
}
@@ -562,7 +544,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
- nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
@@ -580,15 +561,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;
- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
- numa_init_array();
return 0;
}
@@ -621,7 +593,7 @@ static int __init dummy_numa_init(void)
* last fallback is dummy single node config encomapssing whole memory and
* never fails.
*/
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -641,6 +613,43 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}
+void __init x86_numa_init(void)
+{
+ int i, nid;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ early_x86_numa_init();
+
+ /* Finally register nodes. */
+ for_each_node_mask(nid, node_possible_map) {
+ u64 start = PFN_PHYS(max_pfn);
+ u64 end = 0;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (nid != mi->blk[i].nid)
+ continue;
+ start = min(mi->blk[i].start, start);
+ end = max(mi->blk[i].end, end);
+ }
+
+ if (start < end)
+ setup_node_data(nid, start, end); /* online is set */
+ }
+
+ /* Dump memblock with node info */
+ memblock_dump_all();
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ int nid = early_cpu_to_node(i);
+
+ if (nid == NUMA_NO_NODE)
+ continue;
+ if (!node_online(nid))
+ numa_clear_node(i);
+ }
+ numa_init_array();
+}
+
static __init int find_near_online_node(int node)
{
int n, val;
--
1.8.1.4
Use common get_ramdisk_image() to get ramdisk start phys address.
We need this to get correct ramdisk adress for 64bit bzImage that
initrd can be loaded above 4G by kexec-tools.
-v2: fix one typo that is found by Tang Chen
Signed-off-by: Yinghai Lu <[email protected]>
Cc: Fenghua Yu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index d893e8e..6a0054a 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -742,8 +742,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;
boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;
@@ -752,8 +752,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;
--
1.8.1.4
Need to use get_ramdisk_image() with early microcode_updating in other file.
Change it to global.
Also make it to take boot_params pointer, as head_32.S need to access it via
phys address during 32bit flat mode.
Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)
extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__
void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 90d8cc9..1629577 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -300,19 +300,19 @@ static void __init reserve_brk(void)
#ifdef CONFIG_BLK_DEV_INITRD
-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;
- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;
return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;
- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;
return ramdisk_size;
}
@@ -321,8 +321,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -361,8 +361,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}
- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -372,8 +372,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
if (!boot_params.hdr.type_of_loader ||
@@ -385,8 +385,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;
--
1.8.1.4
Move node_map_pfn_alignment() to arch/x86/mm as no other user for it.
Will update it to use numa_meminfo instead of memblock.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 -
mm/page_alloc.c | 50 --------------------------------------------------
3 files changed, 50 insertions(+), 51 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b7173f6..24155b2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,6 +477,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}
+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
+ * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's. 0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ unsigned long start, end, mask;
+ int last_nid = -1;
+ int i, nid;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ /*
+ * Start with a mask granular enough to pin-point to the
+ * start pfn and tick off bits one-by-one until it becomes
+ * too coarse to separate the current node from the last.
+ */
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ /* accumulate all internode masks */
+ accl_mask |= mask;
+ }
+
+ /* convert mask to number of pages */
+ return ~accl_mask + 1;
+}
+
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 192806c..77a71fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1322,7 +1322,6 @@ extern void free_initmem(void);
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580d919..f368db4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4725,56 +4725,6 @@ static inline void setup_nr_node_ids(void)
}
#endif
-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
- * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's. 0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
- unsigned long accl_mask = 0, last_end = 0;
- unsigned long start, end, mask;
- int last_nid = -1;
- int i, nid;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
- if (!start || last_nid < 0 || last_nid == nid) {
- last_nid = nid;
- last_end = end;
- continue;
- }
-
- /*
- * Start with a mask granular enough to pin-point to the
- * start pfn and tick off bits one-by-one until it becomes
- * too coarse to separate the current node from the last.
- */
- mask = ~((1 << __ffs(start)) - 1);
- while (mask && last_end <= (start & (mask << 1)))
- mask <<= 1;
-
- /* accumulate all internode masks */
- accl_mask |= mask;
- }
-
- /* convert mask to number of pages */
- return ~accl_mask + 1;
-}
-
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{
--
1.8.1.4
For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.
We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.
Also could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.
Later we can only set memblock nid one time on successful path.
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 7 ++++---
include/linux/mm.h | 2 --
mm/page_alloc.c | 2 +-
3 files changed, 5 insertions(+), 6 deletions(-)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d545638..b7173f6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -460,7 +460,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
- numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+ numaram -= absent_pages_in_range(s, e);
if ((s64)numaram < 0)
numaram = 0;
}
@@ -488,6 +488,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;
+ if (!numa_meminfo_cover_memory(mi))
+ return -EINVAL;
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -506,8 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
#endif
- if (!numa_meminfo_cover_memory(mi))
- return -EINVAL;
return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e19ff30..192806c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1323,8 +1323,6 @@ extern void free_initmem(void);
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..580d919 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4356,7 +4356,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
* Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
* then all holes in the requested range will be accounted for.
*/
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
unsigned long range_start_pfn,
unsigned long range_end_pfn)
{
--
1.8.1.4
----- [email protected] wrote:
> Prepare to put page table on local nodes.
>
> Move calling of init_mem_mapping to early_initmem_init.
>
> Rework alloc_low_pages to alloc page table in following order:
> BRK, local node, low range
>
> Still only load_cr3 one time, otherwise we would break xen 64bit
> again.
I have asked you in the previous iteration of the patch to fix that comment.
Please remove it - as it is misleading. The issue with load_cr3 more than
once has been fixed under the Xen platform.
On Thu, Apr 11, 2013 at 6:05 PM, Konrad Wilk <[email protected]> wrote:
>
> ----- [email protected] wrote:
>
>> Prepare to put page table on local nodes.
>>
>> Move calling of init_mem_mapping to early_initmem_init.
>>
>> Rework alloc_low_pages to alloc page table in following order:
>> BRK, local node, low range
>>
>> Still only load_cr3 one time, otherwise we would break xen 64bit
>> again.
>
> I have asked you in the previous iteration of the patch to fix that comment.
Maybe it is not clear enough.
>
> Please remove it - as it is misleading. The issue with load_cr3 more than
> once has been fixed under the Xen platform.
Peter, can you remove those two lines?
or need to resend -v5?
Thanks
Yinghai
Please send a replacement patch.
Yinghai Lu <[email protected]> wrote:
>On Thu, Apr 11, 2013 at 6:05 PM, Konrad Wilk <[email protected]>
>wrote:
>>
>> ----- [email protected] wrote:
>>
>>> Prepare to put page table on local nodes.
>>>
>>> Move calling of init_mem_mapping to early_initmem_init.
>>>
>>> Rework alloc_low_pages to alloc page table in following order:
>>> BRK, local node, low range
>>>
>>> Still only load_cr3 one time, otherwise we would break xen 64bit
>>> again.
>>
>> I have asked you in the previous iteration of the patch to fix that
>comment.
>
>Maybe it is not clear enough.
>
>>
>> Please remove it - as it is misleading. The issue with load_cr3 more
>than
>> once has been fixed under the Xen platform.
>
>Peter, can you remove those two lines?
>or need to resend -v5?
>
>Thanks
>
>Yinghai
--
Sent from my mobile phone. Please excuse brevity and lack of formatting.
Hi Yinghai,
It has been a long time since this patch-set was sent. I think we need to
do something to push it.
In my understanding, this patch-set did 2 things.
1. Parse numa info earlier, some improvements for
ACPI_INITRD_TABLE_OVERRIDE.
(patch1 ~ patch20)
2. Allocate pagetable in local node at boot time. (patch21 ~ patch22)
As you know, the current implement of memory hot-remove is not based on
putting pagetable in local node. If we put pagetable in local node at boot
time, the memory hot-remove won't be able to work as before.
I agree that this should be fixed. But we have the following two reasons to
push "Parse numa info earlier" part first, and improve the performance
later.
1. patch21 and patch22 only affect the performance, not the functionality.
I think we can make memory hot-remove work in the kernel, and than
improve
the performance.
2. Besides putting pagetable in local node at boot time, there are many
other
things need to do. I'm working on improving hot-add code to allocate
pagetable
and vmemmap in local node, and improving hot-remove code to support
freeing
this kind of memory.
So in order to push this patch-set and memory hot-remove functionality,
shall we divide this patch-set into 2 steps:
1. Push patch1 ~ patch20, and I'll push the remaining memory hot-remove
work together.
2. Merge your "putting pagetable in local node" work with the
performance improvement
work I'm doing, and improve the performance.
How do you think ?
BTW, I'm testing your patch-set, and will give a result next week.
I can also help to rebase it if you like.
Thanks. :)
Hi Yinghai, all,
I've tested this patch-set with my following patch-set:
[PATCH v1 00/12] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE.
https://lkml.org/lkml/2013/4/19/94
Using ACPI table override, I overrided SRAT on my box like this:
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
[ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x583ffffff] Hot
Pluggable
[ 0.000000] SRAT: Node 2 PXM 3 [mem 0x584000000-0x7ffffffff] Hot
Pluggable
We had 3 nodes, node0 was not hotpluggable, and node1 and node2 were
hotpluggable.
And memblock reserved pagetable pages (with flag 0x1) in local nodes.
......
[ 0.000000] reserved[0xb] [0x00000307ff0000-0x00000307ff1fff],
0x2000 bytes flags: 0x0
[ 0.000000] reserved[0xc] [0x00000307ff2000-0x00000307ffffff],
0xe000 bytes on node 0 flags: 0x1
[ 0.000000] reserved[0xd] [0x00000583ff7000-0x00000583ffffff],
0x9000 bytes on node 1 flags: 0x1
[ 0.000000] reserved[0xe] [0x000007ffff9000-0x000007ffffffff],
0x7000 bytes on node 2 flags: 0x1
And after some bug fix, memblock can also reserve hotpluggable memory
with flag 0x2.
......
[ 0.000000] reserved[0xb] [0x00000307ff0000-0x00000307ff1fff],
0x2000 bytes flags: 0x0
[ 0.000000] reserved[0xc] [0x00000307ff2000-0x00000307ffffff],
0xe000 bytes on node 0 flags: 0x1
[ 0.000000] reserved[0xd] [0x00000308000000-0x00000583ff6fff],
0x27bff7000 bytes on node 1 flags: 0x2
[ 0.000000] reserved[0xe] [0x00000583ff7000-0x00000583ffffff],
0x9000 bytes on node 1 flags: 0x1
[ 0.000000] reserved[0xf] [0x00000584000000-0x000007ffff7fff],
0x27bff8000 bytes on node 2 flags: 0x2
[ 0.000000] reserved[0x10] [0x000007ffff8000-0x000007ffffffff],
0x8000 bytes on node 2 flags: 0x1
And free it to buddy system when memory initialization finished.
So the results:
1. We can parse SRAT earlier correctly.
2. We can override tables correctly.
3. We can put pagetable pages in local node.
4. We can prevent memblock from allocating hotpluggable memory.
5. We can arrange ZONE_MOVABLE using SRAT info.
Known problems:
When we put pagetable pages in local node, memory hot-remove logic won't
work.
I'm fixing it now. We need to fix the following:
1. Improve hot-remove to support freeing local node pagetable pages.
2. Improve hot-add to support putting hot-added pagetable pages in local
node.
3. Do the same to vmemmap and page_cgrop pages.
So I suggest to separate the job into 2 parts:
1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local node.
And push my work to use SRAT to arrange ZONE_MOVABLE.
In this case, we can enable memory hotplug in the kernel first.
2. Merge patch21 and patch22 into the fixing work I am doing now, and
push them
together when finished.
How do you think ?
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Thanks. :)
Hi all,
Could anyone give some suggestions to this patch-set ?
Thanks.
On 04/30/2013 03:21 PM, Tang Chen wrote:
> Hi Yinghai, all,
>
> I've tested this patch-set with my following patch-set:
> [PATCH v1 00/12] Arrange hotpluggable memory in SRAT as ZONE_MOVABLE.
> https://lkml.org/lkml/2013/4/19/94
>
> Using ACPI table override, I overrided SRAT on my box like this:
>
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
> [ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x583ffffff] Hot Pluggable
> [ 0.000000] SRAT: Node 2 PXM 3 [mem 0x584000000-0x7ffffffff] Hot Pluggable
>
> We had 3 nodes, node0 was not hotpluggable, and node1 and node2 were
> hotpluggable.
>
>
> And memblock reserved pagetable pages (with flag 0x1) in local nodes.
> ......
> [ 0.000000] reserved[0xb] [0x00000307ff0000-0x00000307ff1fff], 0x2000
> bytes flags: 0x0
> [ 0.000000] reserved[0xc] [0x00000307ff2000-0x00000307ffffff], 0xe000
> bytes on node 0 flags: 0x1
> [ 0.000000] reserved[0xd] [0x00000583ff7000-0x00000583ffffff], 0x9000
> bytes on node 1 flags: 0x1
> [ 0.000000] reserved[0xe] [0x000007ffff9000-0x000007ffffffff], 0x7000
> bytes on node 2 flags: 0x1
>
> And after some bug fix, memblock can also reserve hotpluggable memory
> with flag 0x2.
> ......
> [ 0.000000] reserved[0xb] [0x00000307ff0000-0x00000307ff1fff], 0x2000
> bytes flags: 0x0
> [ 0.000000] reserved[0xc] [0x00000307ff2000-0x00000307ffffff], 0xe000
> bytes on node 0 flags: 0x1
> [ 0.000000] reserved[0xd] [0x00000308000000-0x00000583ff6fff],
> 0x27bff7000 bytes on node 1 flags: 0x2
> [ 0.000000] reserved[0xe] [0x00000583ff7000-0x00000583ffffff], 0x9000
> bytes on node 1 flags: 0x1
> [ 0.000000] reserved[0xf] [0x00000584000000-0x000007ffff7fff],
> 0x27bff8000 bytes on node 2 flags: 0x2
> [ 0.000000] reserved[0x10] [0x000007ffff8000-0x000007ffffffff], 0x8000
> bytes on node 2 flags: 0x1
>
> And free it to buddy system when memory initialization finished.
>
>
> So the results:
> 1. We can parse SRAT earlier correctly.
> 2. We can override tables correctly.
> 3. We can put pagetable pages in local node.
> 4. We can prevent memblock from allocating hotpluggable memory.
> 5. We can arrange ZONE_MOVABLE using SRAT info.
>
>
> Known problems:
>
> When we put pagetable pages in local node, memory hot-remove logic won't
> work.
> I'm fixing it now. We need to fix the following:
> 1. Improve hot-remove to support freeing local node pagetable pages.
> 2. Improve hot-add to support putting hot-added pagetable pages in local
> node.
> 3. Do the same to vmemmap and page_cgrop pages.
>
> So I suggest to separate the job into 2 parts:
> 1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local
> node.
> And push my work to use SRAT to arrange ZONE_MOVABLE.
> In this case, we can enable memory hotplug in the kernel first.
> 2. Merge patch21 and patch22 into the fixing work I am doing now, and
> push them
> together when finished.
>
> How do you think ?
>
> Reviewed-by: Tang Chen <[email protected]>
> Tested-by: Tang Chen <[email protected]>
>
> Thanks. :)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
Hi Yinghai,
On 04/30/2013 03:21 PM, Tang Chen wrote:
> So I suggest to separate the job into 2 parts:
> 1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local node.
> And push my work to use SRAT to arrange ZONE_MOVABLE.
> In this case, we can enable memory hotplug in the kernel first.
> 2. Merge patch21 and patch22 into the fixing work I am doing now, and push them
> together when finished.
>
It has been a long time since this mail and there was no response. I do
think I
should move on to push this patch-set. So if you don't mind, I'll rebase
and
push "parse SRAT earlier" part of this patch-set first.
Since putting pagetable in local node will destroy memory hot-remove
logic for now,
I will drop "put pagetable in local node" parts now, and merge this part
into
the hot-add and hot-remove fix work.
If you have any thinking of this patch-set, please let me know.
Thanks. :)
On Thu, May 9, 2013 at 1:54 AM, Tang Chen <[email protected]> wrote:
> Hi Yinghai,
>
>
> On 04/30/2013 03:21 PM, Tang Chen wrote:
>>
>> So I suggest to separate the job into 2 parts:
>> 1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local
>> node.
>> And push my work to use SRAT to arrange ZONE_MOVABLE.
>> In this case, we can enable memory hotplug in the kernel first.
>> 2. Merge patch21 and patch22 into the fixing work I am doing now, and push
>> them
>> together when finished.
>>
>
> It has been a long time since this mail and there was no response. I do
> think I
> should move on to push this patch-set. So if you don't mind, I'll rebase and
> push "parse SRAT earlier" part of this patch-set first.
>
> Since putting pagetable in local node will destroy memory hot-remove logic
> for now,
> I will drop "put pagetable in local node" parts now, and merge this part
> into
> the hot-add and hot-remove fix work.
no, no, no, please do not half-done work.
Do it right, and Do it clean.
>
> If you have any thinking of this patch-set, please let me know.
Talked to HPA, and he will put my patchset into tip/x86/mm after v3.10-rc1.
after that we can work on put pagetable on local node for hotadd path.
Thanks
Yinghai
Hi Yinghai,
On 05/10/2013 02:24 AM, Yinghai Lu wrote:
>>> So I suggest to separate the job into 2 parts:
>>> 1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local
>>> node.
>>> And push my work to use SRAT to arrange ZONE_MOVABLE.
>>> In this case, we can enable memory hotplug in the kernel first.
>>> 2. Merge patch21 and patch22 into the fixing work I am doing now, and push
>>> them
>>> together when finished.
>>>
>
> no, no, no, please do not half-done work.
>
> Do it right, and Do it clean.
>
I'm not saying I want to do it half-way. Putting pagetable in local node
will make memory hot-remove patch unable to work.
Before removing pages, the kernel first offlines pages. If the offline logic
fails, the hot-remove cannot work. Since your patches have put node
pagetable
in local node at boot time, this memory cannot be offlined, furthermore,
it cannot be hot-removed.
The minimum unit of memory online/offline is block. And by default,
one block contains one section, which by default is 128MB. So if parts
of a block are pagetable, and the rest parts are movable memory, this
block cannot be offlined. And as a result, it cannot be removed.
In order to fix it, we have three solutions:
1. Reserve the whole block (128MB), making no user can use the rest
parts of the block. And skip them when offlining memory.
When all the other blocks are offlined, free the pagetable, and remove
all the memory.
But we may lose some memory for this purpose. 128MB is a little big
to waste.
2. Migrate movable pages and keep this block online. Although the offline
operation fails, it is OK to remove memory.
But the offline operation will always fail. And generally speaking,
there are a lot of reasons of offline failing, it is difficult to
detect if it is OK to remove memory.
3. Migrate user pages and make this block offline, but the kernel can
still use the pagetable in it.
But this will change the semantics of "offline". I'm not sure if we
can do it in this way.
4. Do not allocate pagetable to local node when CONFIG_MEMORY_HOTREMOVE
is enabled. (I do suggest not to put pagetable in local node in
memory hot-remove situation.)
How do you think about these 4 solutions above ?
I think I need some advices for this problem in community. Do you have
any idea to fix this problem if we put pagetable in local node ?
The memory hot-plug guys do want to use memory hot-remove. And I think
for now, we use solution 4 above. When CONFIG_MEMORY_HOTREMOVE is enabled,
do not allocate pagetable to local node.
I'm not trying to do it half-way. When we fix this problem, we can allocate
pagetable to local node again with CONFIG_MEMORY_HOTREMOVE enabled.
Please do give some advices or feedback.
>>
>> If you have any thinking of this patch-set, please let me know.
>
> Talked to HPA, and he will put my patchset into tip/x86/mm after v3.10-rc1.
>
> after that we can work on put pagetable on local node for hotadd path.
>
hot-add path is another problem. But I think the hot-remove path is more
urgent now.
Thanks. :)
Hi Yinghai,
How do you think of the following problem and solutions ?
And can we not allocate pagetable to local node when MEMORY_HOTREMOVE
is enabled for now, and do it again when the problem in hot-remove
path is fixed ?
Thanks. :)
On 05/13/2013 10:59 AM, Tang Chen wrote:
> Hi Yinghai,
>
> On 05/10/2013 02:24 AM, Yinghai Lu wrote:
>>>> So I suggest to separate the job into 2 parts:
>>>> 1. Push Yinghai's patch1 ~ patch20, without putting pagetable in local
>>>> node.
>>>> And push my work to use SRAT to arrange ZONE_MOVABLE.
>>>> In this case, we can enable memory hotplug in the kernel first.
>>>> 2. Merge patch21 and patch22 into the fixing work I am doing now,
>>>> and push
>>>> them
>>>> together when finished.
>>>>
>>
>> no, no, no, please do not half-done work.
>>
>> Do it right, and Do it clean.
>>
>
> I'm not saying I want to do it half-way. Putting pagetable in local node
> will make memory hot-remove patch unable to work.
>
> Before removing pages, the kernel first offlines pages. If the offline
> logic
> fails, the hot-remove cannot work. Since your patches have put node
> pagetable
> in local node at boot time, this memory cannot be offlined, furthermore,
> it cannot be hot-removed.
>
> The minimum unit of memory online/offline is block. And by default,
> one block contains one section, which by default is 128MB. So if parts
> of a block are pagetable, and the rest parts are movable memory, this
> block cannot be offlined. And as a result, it cannot be removed.
>
> In order to fix it, we have three solutions:
>
> 1. Reserve the whole block (128MB), making no user can use the rest
> parts of the block. And skip them when offlining memory.
> When all the other blocks are offlined, free the pagetable, and remove
> all the memory.
>
> But we may lose some memory for this purpose. 128MB is a little big
> to waste.
>
>
> 2. Migrate movable pages and keep this block online. Although the offline
> operation fails, it is OK to remove memory.
>
> But the offline operation will always fail. And generally speaking,
> there are a lot of reasons of offline failing, it is difficult to
> detect if it is OK to remove memory.
>
>
> 3. Migrate user pages and make this block offline, but the kernel can
> still use the pagetable in it.
>
> But this will change the semantics of "offline". I'm not sure if we
> can do it in this way.
>
>
> 4. Do not allocate pagetable to local node when CONFIG_MEMORY_HOTREMOVE
> is enabled. (I do suggest not to put pagetable in local node in
> memory hot-remove situation.)
>
>
> How do you think about these 4 solutions above ?
>
> I think I need some advices for this problem in community. Do you have
> any idea to fix this problem if we put pagetable in local node ?
>
> The memory hot-plug guys do want to use memory hot-remove. And I think
> for now, we use solution 4 above. When CONFIG_MEMORY_HOTREMOVE is enabled,
> do not allocate pagetable to local node.
>
> I'm not trying to do it half-way. When we fix this problem, we can allocate
> pagetable to local node again with CONFIG_MEMORY_HOTREMOVE enabled.
>
> Please do give some advices or feedback.
>
>
>>>
>>> If you have any thinking of this patch-set, please let me know.
>>
>> Talked to HPA, and he will put my patchset into tip/x86/mm after
>> v3.10-rc1.
>>
>> after that we can work on put pagetable on local node for hotadd path.
>>
>
> hot-add path is another problem. But I think the hot-remove path is more
> urgent now.
>
>
> Thanks. :)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
On 05/10/2013 02:24 AM, Yinghai Lu wrote:
......
>> If you have any thinking of this patch-set, please let me know.
>
> Talked to HPA, and he will put my patchset into tip/x86/mm after v3.10-rc1.
>
> after that we can work on put pagetable on local node for hotadd path.
>
Hi,
It is Linux v3.10-rc2 now. But I didn't find this patch-set merged into
tip/x86/mm. was it merged to somewhere else, or we have any other plan
to push ?
By the way, I have done some tests for this patch-set, and the test results
have been sent. Please refer to:
https://lkml.org/lkml/2013/4/30/45
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Thanks. :)
Sorry, just have been swamped since -rc1 came out.
-hpa
Hi HPA,
Would you please tell me if this patch-set has been merged
into any tree or branch ?
If not merged, I'll rebased it to the latest kernel and
resend it again. Hope the rebasing will help to push this
patch-set.
Thanks. :)
On 05/22/2013 01:18 PM, H. Peter Anvin wrote:
> Sorry, just have been swamped since -rc1 came out.
>
> -hpa
>
>
Hi yinghai, HPA,
On 04/12/2013 08:55 AM, Yinghai Lu wrote:
> Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
> be used anymore.
I'm rebasing this patch-set to the latest kernel, and improving the
comment. But I didn't find any "arch_pfn_mapped array" in the kernel.
Would you please tell me what "arch_pfn_mapped array" is ?
Is it the "struct range pfn_mapped[E820_X_MAX];" in arch/x86/mm/init.c ?
Thanks. :)
>
> User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.
>
> Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
> as later accessing is using early_ioremap(). We could change to use
> 1U<<(32_PAGE_SHIFT) with it, aka under 4G.
>
> -v2: Leave alone max_low_pfn_mapped in i915 code according to tj.
>
> Suggested-by: H. Peter Anvin<[email protected]>
> Signed-off-by: Yinghai Lu<[email protected]>
> Cc: "Rafael J. Wysocki"<[email protected]>
> Cc: Jacob Shin<[email protected]>
> Cc: Pekka Enberg<[email protected]>
> Cc: [email protected]
> Tested-by: Thomas Renninger<[email protected]>
> ---
> arch/x86/include/asm/page_types.h | 1 -
> arch/x86/kernel/setup.c | 4 +---
> arch/x86/mm/init.c | 4 ----
> drivers/acpi/osl.c | 6 +++---
> 4 files changed, 4 insertions(+), 11 deletions(-)
>
> diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
> index 54c9787..b012b82 100644
> --- a/arch/x86/include/asm/page_types.h
> +++ b/arch/x86/include/asm/page_types.h
> @@ -43,7 +43,6 @@
>
> extern int devmem_is_allowed(unsigned long pagenr);
>
> -extern unsigned long max_low_pfn_mapped;
> extern unsigned long max_pfn_mapped;
>
> static inline phys_addr_t get_max_mapped(void)
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 1629577..e75c6e6 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -113,13 +113,11 @@
> #include<asm/prom.h>
>
> /*
> - * max_low_pfn_mapped: highest direct mapped pfn under 4GB
> - * max_pfn_mapped: highest direct mapped pfn over 4GB
> + * max_pfn_mapped: highest direct mapped pfn
> *
> * The direct mapping only covers E820_RAM regions, so the ranges and gaps are
> * represented by pfn_mapped
> */
> -unsigned long max_low_pfn_mapped;
> unsigned long max_pfn_mapped;
>
> #ifdef CONFIG_DMI
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 59b7fc4..abcc241 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
> nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);
>
> max_pfn_mapped = max(max_pfn_mapped, end_pfn);
> -
> - if (start_pfn< (1UL<<(32-PAGE_SHIFT)))
> - max_low_pfn_mapped = max(max_low_pfn_mapped,
> - min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
> }
>
> bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 586e7e9..313d14d 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
> if (table_nr == 0)
> return;
>
> - acpi_tables_addr =
> - memblock_find_in_range(0, max_low_pfn_mapped<< PAGE_SHIFT,
> - all_tables_size, PAGE_SIZE);
> + /* under 4G at first, then above 4G */
> + acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
> + all_tables_size, PAGE_SIZE);
> if (!acpi_tables_addr) {
> WARN_ON(1);
> return;