2013-03-10 06:46:07

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 00/20] x86, ACPI, numa: Parse numa info early

One commit that tried to parse SRAT early get reverted before v3.9-rc1.

| commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
| Author: Tang Chen <[email protected]>
| Date: Fri Feb 22 16:33:44 2013 -0800
|
| acpi, memory-hotplug: parse SRAT before memblock is ready

It broke several things, like acpi override and fall back path etc.

This patchset is clean implementation that will parse numa info early.
1. keep the acpi table initrd override working by split finding with copying.
finding is done at head_32.S and head64.c stage,
in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
in head64.c, initrd is accessed via kernel low mapping address
with help of #PF set page table.
copying is done with early_ioremap just after memblock is setup.
2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
seperate initmem_init to two stages.
early_initmem_init will only extract numa info early into numa_meminfo.
initmem_init will keep slit and emulation handling.
3. keep other old code flow untouched like relocate_initrd and initmem_init.
early_initmem_init will take old init_mem_mapping position.
it call early_x86_numa_init and init_mem_mapping for every nodes.
For 64bit, we avoid having size limit on initrd, as relocate_initrd
is still after init_mem_mapping for all memory.
4. last patch will try to put page table on local node, so that memory
hotplug will be happy.

In short, early_initmem_init will parse numa info early and call
init_mem_mapping to set page table for every nodes's mem.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

and it is based on today's Linus tree.

-v2: Address tj's review and split patches to small ones.

Thanks

Yinghai

Yinghai Lu (20):
x86: Change get_ramdisk_image() to global
x86, microcode: Use common get_ramdisk_image()
x86, ACPI, mm: Kill max_low_pfn_mapped
x86, ACPI: Increase override tables number limit
x86, ACPI: Split acpi_initrd_override to find/copy two functions
x86, ACPI: Store override acpi tables phys addr in cpio files info array
x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
x86, mm, numa: Move two functions calling on successful path later
x86, mm, numa: Call numa_meminfo_cover_memory() checking early
x86, mm, numa: Move node_map_pfn alignment() to x86
x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
x86, mm, numa: Set memblock nid later
x86, mm, numa: Move node_possible_map setting later
x86, mm, numa: Move emulation handling down.
x86, ACPI, numa, ia64: split SLIT handling out
x86, mm, numa: Add early_initmem_init() stub
x86, mm: Parse numa info early
x86, mm: Make init_mem_mapping be able to be called several times
x86, mm, numa: Put pagetable on local node ram for 64bit

arch/ia64/kernel/setup.c | 4 +-
arch/x86/include/asm/acpi.h | 3 +-
arch/x86/include/asm/page_types.h | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/setup.h | 9 ++
arch/x86/kernel/head64.c | 2 +
arch/x86/kernel/head_32.S | 4 +
arch/x86/kernel/microcode_intel_early.c | 8 +-
arch/x86/kernel/setup.c | 86 ++++++-----
arch/x86/mm/init.c | 88 +++++++-----
arch/x86/mm/numa.c | 240 ++++++++++++++++++++++++-------
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 +
arch/x86/mm/srat.c | 11 +-
drivers/acpi/numa.c | 13 +-
drivers/acpi/osl.c | 134 +++++++++++------
include/linux/acpi.h | 20 +--
include/linux/mm.h | 3 -
mm/page_alloc.c | 52 +------
19 files changed, 445 insertions(+), 240 deletions(-)

--
1.7.10.4


2013-03-10 06:46:08

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 01/20] x86: Change get_ramdisk_image() to global

Need to use get_ramdisk_image() with early microcode_updating in other file.
Change it to global.

Also make it to take boot_params pointer, as head_32.S need to access it via
phys address during 32bit flat mode.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)

extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__

void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 90d8cc9..1629577 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -300,19 +300,19 @@ static void __init reserve_brk(void)

#ifdef CONFIG_BLK_DEV_INITRD

-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;

- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;

return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;

- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;

return ramdisk_size;
}
@@ -321,8 +321,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -361,8 +361,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}

- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -372,8 +372,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);

if (!boot_params.hdr.type_of_loader ||
@@ -385,8 +385,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;

--
1.7.10.4

2013-03-10 06:46:36

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 05/20] x86, ACPI: Split acpi_initrd_override to find/copy two functions

To parse srat early, we need to move acpi table probing early.
acpi_initrd_table_override is before acpi table probing. So we need to
move it early too.

Current code acpi_initrd_table_override is after init_mem_mapping and
relocate_initrd(), so it can scan initrd and copy acpi tables with kernel
virtual address of initrd.
Copying need to be after memblock is ready, because it need to allocate
buffer for new acpi tables.

So we have to split that function to find and copy two functions.
Find should be as early as possible. Copy should be after memblock is ready.

Finding could be done in head_32.S and head64.c, just like microcode
early scanning. In head_32.S, it is 32bit flat mode, we don't
need to set page table to access it. In head64.c, #PF set page table
could help us access initrd with kernel low mapping address.

Copying could be done just after memblock is ready and before probing
acpi tables, and we need to early_ioremap to access source and target
range, as init_mem_mapping is not called yet.

Also move down two functions declaration to avoid #ifdef in setup.c

ACPI_INITRD_TABLE_OVERRIDE depends one ACPI and BLK_DEV_INITRD.
So could move declaration out from #ifdef CONFIG_ACPI protection.

-v2: Split one patch out according to tj.
also don't pass table_nr around.

Signed-off-by: Yinghai <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/setup.c | 6 +++---
drivers/acpi/osl.c | 18 +++++++++++++-----
include/linux/acpi.h | 16 ++++++++--------
3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e75c6e6..d0cc176 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1092,9 +1092,9 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+ acpi_initrd_override_find((void *)initrd_start,
+ initrd_end - initrd_start);
+ acpi_initrd_override_copy();

reserve_crashkernel();

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 8aaf721..d66ae0e 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- char *p;

if (data == NULL || size == 0)
return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
- if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+ int no, total_offset = 0;
+ char *p;
+
+ if (!all_tables_size)
return;

/* under 4G at first, then above 4G */
@@ -647,9 +653,11 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

- for (no = 0; no < table_nr; no++) {
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
phys_addr_t size = acpi_initrd_files[no].size;

+ if (!size)
+ break;
p = early_ioremap(acpi_tables_addr + total_offset, size);
memcpy(p, acpi_initrd_files[no].data, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index bcbdd74..1654a241 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);

-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
@@ -485,6 +477,14 @@ static inline bool acpi_driver_match_device(struct device *dev,

#endif /* !CONFIG_ACPI */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));
--
1.7.10.4

2013-03-10 06:46:10

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 02/20] x86, microcode: Use common get_ramdisk_image()

Use common get_ramdisk_image() to get ramdisk start phys address.

We need this to get correct ramdisk adress for 64bit bzImage that
initrd can be loaded above 4G by kexec-tools.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Fenghua Yu <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 7890bc8..a8df75f 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -742,8 +742,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;

boot_params_p = (struct boot_params *)__pa_symbol(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_image(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;

@@ -752,8 +752,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_symbol(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;

--
1.7.10.4

2013-03-10 06:46:41

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 06/20] x86, ACPI: Store override acpi tables phys addr in cpio files info array

In 32bit we will find table with phys address during 32bit flat mode
in head_32.S, because at that time we don't need set page table to
access initrd.

For copying we could use early_ioremap() with phys directly before mem mapping
is set.

To keep 32bit and 64bit consistent, use phys_addr for all.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
drivers/acpi/osl.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index d66ae0e..54bcc37 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -615,7 +615,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].data = (void *)__pa(file.data);
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
@@ -624,7 +624,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
void __init acpi_initrd_override_copy(void)
{
int no, total_offset = 0;
- char *p;
+ char *p, *q;

if (!all_tables_size)
return;
@@ -654,12 +654,20 @@ void __init acpi_initrd_override_copy(void)
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ /*
+ * have to use unsigned long, otherwise 32bit spit warning
+ * and it is ok to unsigned long, as bootloader would not
+ * load initrd above 4G for 32bit kernel.
+ */
+ unsigned long addr = (unsigned long)acpi_initrd_files[no].data;
phys_addr_t size = acpi_initrd_files[no].size;

if (!size)
break;
+ q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
- memcpy(p, acpi_initrd_files[no].data, size);
+ memcpy(p, q, size);
+ early_iounmap(q, size);
early_iounmap(p, size);
total_offset += size;
}
--
1.7.10.4

2013-03-10 06:46:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 09/20] x86, mm, numa: Move two functions calling on successful path later

We need to have numa info ready before init_mem_mapping, so we
can call init_mem_mapping per nodes also can trim node mem range to
big alignment.

Current numa parsing need to allocate some buffer and need to be
called after init_mem_mapping.

So try to split parsing numa info to two stages, and early one will be
before init_mem_mapping, and it should not need allocate buffers.

At last we will have early_initmem_init() and initmem_init().

This one is first one for separation.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move calling to x86_numa_init(). That will also make
numa_init() small and readable.

-v2: remove online_node_map clear in numa_init(), as it is only
set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 69 +++++++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 72fe01e..d545638 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -480,7 +480,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
- int i, nid;
+ int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -509,24 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- /* Finally register nodes. */
- for_each_node_mask(nid, node_possible_map) {
- u64 start = PFN_PHYS(max_pfn);
- u64 end = 0;
-
- for (i = 0; i < mi->nr_blks; i++) {
- if (nid != mi->blk[i].nid)
- continue;
- start = min(mi->blk[i].start, start);
- end = max(mi->blk[i].end, end);
- }
-
- if (start < end)
- setup_node_data(nid, start, end);
- }
-
- /* Dump memblock with node info and return. */
- memblock_dump_all();
return 0;
}

@@ -562,7 +544,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
- nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
@@ -580,15 +561,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
- numa_init_array();
return 0;
}

@@ -621,7 +593,7 @@ static int __init dummy_numa_init(void)
* last fallback is dummy single node config encomapssing whole memory and
* never fails.
*/
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -641,6 +613,43 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init x86_numa_init(void)
+{
+ int i, nid;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ early_x86_numa_init();
+
+ /* Finally register nodes. */
+ for_each_node_mask(nid, node_possible_map) {
+ u64 start = PFN_PHYS(max_pfn);
+ u64 end = 0;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (nid != mi->blk[i].nid)
+ continue;
+ start = min(mi->blk[i].start, start);
+ end = max(mi->blk[i].end, end);
+ }
+
+ if (start < end)
+ setup_node_data(nid, start, end); /* online is set */
+ }
+
+ /* Dump memblock with node info */
+ memblock_dump_all();
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ int nid = early_cpu_to_node(i);
+
+ if (nid == NUMA_NO_NODE)
+ continue;
+ if (!node_online(nid))
+ numa_clear_node(i);
+ }
+ numa_init_array();
+}
+
static __init int find_near_online_node(int node)
{
int n, val;
--
1.7.10.4

2013-03-10 06:46:47

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 07/20] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

For finding with 32bit, it would be easy to access initrd in 32bit
flat mode, as we don't need to set page table.

That is from head_32.S, and microcode updating already use this trick.

Need to change acpi_initrd_override_find to use phys to access global
variables.

Pass is_phys in the function, as we can not use address to decide if it
is phys or virtual address on 32 bit. Boot loader could load initrd above
max_low_pfn.

Don't call printk as it uses global variables, so delay print later
during copying.

Change table_sigs to use stack instead, otherwise it is too messy to change
string array to phys and still keep offset calculating correct.
That size is about 36x4 bytes, and it is small to settle in stack.

Also remove "continue" in MARCO to make code more readable.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++---------------
include/linux/acpi.h | 5 +--
3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d0cc176..16a703f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1093,7 +1093,7 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();

acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start);
+ initrd_end - initrd_start, false);
acpi_initrd_override_copy();

reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 54bcc37..611ca9b 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,38 +551,54 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}

-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
+ struct cpio_data *files = acpi_initrd_files;
+ int *all_tables_size_p = &all_tables_size;
+
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ char *table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };

if (data == NULL || size == 0)
return;

+ if (is_phys) {
+ files = (struct cpio_data *)__pa_symbol(acpi_initrd_files);
+ all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -591,9 +607,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
data += offset;
size -= offset;

- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ if (!is_phys)
+ INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }

table = file.data;

@@ -601,22 +620,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;

- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ if (!is_phys)
+ INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ if (!is_phys)
+ INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ if (!is_phys)
+ INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }

- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);

- all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = (void *)__pa(file.data);
- acpi_initrd_files[table_nr].size = file.size;
+ (*all_tables_size_p) += table->length;
+ files[table_nr].data = is_phys ?
+ file.data : (void *)__pa(file.data);
+ files[table_nr].size = file.size;
table_nr++;
}
}
@@ -666,6 +696,9 @@ void __init acpi_initrd_override_copy(void)
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));
memcpy(p, q, size);
early_iounmap(q, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 1654a241..4b943e6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -478,10 +478,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
void acpi_initrd_override_copy(void);
#else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+ bool is_phys) { }
static inline void acpi_initrd_override_copy(void) { }
#endif

--
1.7.10.4

2013-03-10 06:47:05

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 10/20] x86, mm, numa: Call numa_meminfo_cover_memory() checking early

For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we can only set memblock nid one time on successful path.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 7 ++++---
include/linux/mm.h | 2 --
mm/page_alloc.c | 2 +-
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d545638..b7173f6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -460,7 +460,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
- numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+ numaram -= absent_pages_in_range(s, e);
if ((s64)numaram < 0)
numaram = 0;
}
@@ -488,6 +488,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;

+ if (!numa_meminfo_cover_memory(mi))
+ return -EINVAL;
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -506,8 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
#endif
- if (!numa_meminfo_cover_memory(mi))
- return -EINVAL;

return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7acc9dc..2ae2050 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1324,8 +1324,6 @@ extern void free_initmem(void);
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..580d919 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4356,7 +4356,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
* Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
* then all holes in the requested range will be accounted for.
*/
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
unsigned long range_start_pfn,
unsigned long range_end_pfn)
{
--
1.7.10.4

2013-03-10 06:47:17

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 12/20] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

We could use numa_meminfo directly instead of memblock nid.

So we could move down set memblock nid and only do it one time
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24155b2..fcaeba9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -496,14 +496,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
* Returns the determined alignment in pfn's. 0 if there is no alignment
* requirement (single node).
*/
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
{
unsigned long accl_mask = 0, last_end = 0;
unsigned long start, end, mask;
int last_nid = -1;
int i, nid;

- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ for (i = 0; i < mi->nr_blks; i++) {
+ start = mi->blk[i].start >> PAGE_SHIFT;
+ end = mi->blk[i].end >> PAGE_SHIFT;
+ nid = mi->blk[i].nid;
if (!start || last_nid < 0 || last_nid == nid) {
last_nid = nid;
last_end = end;
@@ -526,10 +530,16 @@ unsigned long __init node_map_pfn_alignment(void)
/* convert mask to number of pages */
return ~accl_mask + 1;
}
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+ return 0;
+}
+#endif

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
- unsigned long uninitialized_var(pfn_align);
+ unsigned long pfn_align;
int i;

/* Account for nodes with cpus and no memory */
@@ -541,24 +551,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
/*
* If sections array is gonna be used for pfn -> nid mapping, check
* whether its granularity is fine enough.
*/
-#ifdef NODE_NOT_IN_PAGE_FLAGS
- pfn_align = node_map_pfn_alignment();
+ pfn_align = node_map_pfn_alignment(mi);
if (pfn_align && pfn_align < PAGES_PER_SECTION) {
printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
PFN_PHYS(pfn_align) >> 20,
PFN_PHYS(PAGES_PER_SECTION) >> 20);
return -EINVAL;
}
-#endif
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }

return 0;
}
--
1.7.10.4

2013-03-10 06:47:24

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 13/20] x86, mm, numa: Set memblock nid later

For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.

Only set memblock nid one time for successful path.

Also rename numa_register_memblks to numa_check_memblks()
after move out code for setting memblock nid.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index fcaeba9..e2ddcbd 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,10 +537,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
{
unsigned long pfn_align;
- int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -563,11 +562,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
return 0;
}

@@ -604,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();

ret = init_func();
@@ -616,7 +609,7 @@ static int __init numa_init(int (*init_func)(void))

numa_emulation(&numa_meminfo, numa_distance_cnt);

- ret = numa_register_memblks(&numa_meminfo);
+ ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;

@@ -679,6 +672,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
+
/* Finally register nodes. */
for_each_node_mask(nid, node_possible_map) {
u64 start = PFN_PHYS(max_pfn);
--
1.7.10.4

2013-03-10 06:47:36

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 14/20] x86, mm, numa: Move node_possible_map setting later

Move node_possible_map handling out of numa_check_memblks to avoid side
changing in numa_check_memblks().

Only set once for successful path instead of resetting in numa_init()
every time.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e2ddcbd..1d5fa08 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -539,12 +539,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)

static int __init numa_check_memblks(struct numa_meminfo *mi)
{
+ nodemask_t nodes_parsed;
unsigned long pfn_align;

/* Account for nodes with cpus and no memory */
- node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
- if (WARN_ON(nodes_empty(node_possible_map)))
+ nodes_parsed = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&nodes_parsed, mi);
+ if (WARN_ON(nodes_empty(nodes_parsed)))
return -EINVAL;

if (!numa_meminfo_cover_memory(mi))
@@ -596,7 +597,6 @@ static int __init numa_init(int (*init_func)(void))
set_apicid_to_node(i, NUMA_NO_NODE);

nodes_clear(numa_nodes_parsed);
- nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
numa_reset_distance();

@@ -672,6 +672,9 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
--
1.7.10.4

2013-03-10 06:47:42

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 15/20] x86, mm, numa: Move emulation handling down.

It needs to allocate buffer for new numa_meminfo and distance matrix,
so move it down.

Also we change the behavoir:
before this patch, if user input wrong data in command line, it
will fall back to next numa probing or disabling numa.
after this patch, if user input wrong data in command line, it will
stay with numa info from probing before, like acpi srat or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early,
so keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: David Rientjes <[email protected]>
---
arch/x86/mm/numa.c | 6 +++---
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 ++
3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1d5fa08..90fd123 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,7 +537,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
{
nodemask_t nodes_parsed;
unsigned long pfn_align;
@@ -607,8 +607,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- numa_emulation(&numa_meminfo, numa_distance_cnt);
-
ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -672,6 +670,8 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
node_possible_map = numa_nodes_parsed;
numa_nodemask_from_meminfo(&node_possible_map, mi);

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
if (ret < 0)
goto no_emu;

- if (numa_cleanup_meminfo(&ei) < 0) {
+ if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
goto no_emu;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);

void __init x86_numa_init(void);

+int __init numa_check_memblks(struct numa_meminfo *mi);
+
#ifdef CONFIG_NUMA_EMU
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
--
1.7.10.4

2013-03-10 06:47:51

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 16/20] x86, ACPI, numa, ia64: split SLIT handling out

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping.

So move SLIT parsing later.

x86_acpi_numa_init become x86_acpi_numa_init_srat/x86_acpi_numa_init_slit.

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
remove the reset_numa_distance() in numa_init(), as get we only set
distance in slit handling.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
---
arch/ia64/kernel/setup.c | 4 +++-
arch/x86/include/asm/acpi.h | 3 ++-
arch/x86/mm/numa.c | 14 ++++++++++++--
arch/x86/mm/srat.c | 11 +++++++----
drivers/acpi/numa.c | 13 +++++++------
include/linux/acpi.h | 3 ++-
6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 2029cc0..6a2efb5 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
acpi_table_init();
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
- acpi_numa_init();
+ acpi_numa_init_srat();
+ acpi_numa_init_slit();
+ acpi_numa_arch_fixup();
# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
# endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }

#ifdef CONFIG_ACPI_NUMA
extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
#endif /* CONFIG_ACPI_NUMA */

#define acpi_unlazy_tlb(x) leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 90fd123..182e085 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -598,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- numa_reset_distance();

ret = init_func();
if (ret < 0)
@@ -636,6 +635,10 @@ static int __init dummy_numa_init(void)
return 0;
}

+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
/**
* x86_numa_init - Initialize NUMA
*
@@ -651,8 +654,10 @@ static void __init early_x86_numa_init(void)
return;
#endif
#ifdef CONFIG_ACPI_NUMA
- if (!numa_init(x86_acpi_numa_init))
+ if (!numa_init(x86_acpi_numa_init_srat)) {
+ srat_used = true;
return;
+ }
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
@@ -670,6 +675,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+#ifdef CONFIG_ACPI_NUMA
+ if (srat_used)
+ x86_acpi_numa_init_slit();
+#endif
+
numa_emulation(&numa_meminfo, numa_distance_cnt);

node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
return -1;
}

-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
{
int ret;

- ret = acpi_numa_init();
+ ret = acpi_numa_init_srat();
if (ret < 0)
return ret;
return srat_disabled() ? -EINVAL : 0;
}
+
+void __init x86_acpi_numa_init_slit(void)
+{
+ acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
}

-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
{
int cnt = 0;

@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
NR_NODE_MEMBLKS);
}

- /* SLIT: System Locality Information Table */
- acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
- acpi_numa_arch_fixup();
-
if (cnt < 0)
return cnt;
else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
return 0;
}

+void __init acpi_numa_init_slit(void)
+{
+ /* SLIT: System Locality Information Table */
+ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4b943e6..4a78235 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);

int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
--
1.7.10.4

2013-03-10 06:47:55

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 17/20] x86, mm, numa: Add early_initmem_init() stub

early_initmem_init() call early_x86_numa_init() to parse numa info early.

Later will call init_mem_mapping for nodes in it.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/numa.c | 7 +++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

+void early_initmem_init(void);
extern void initmem_init(void);

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index b067663..626bc9f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1135,6 +1135,7 @@ void __init setup_arch(char **cmdline_p)

early_acpi_boot_init();

+ early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index abcc241..28b294f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -450,6 +450,12 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 182e085..c2d4653 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -668,13 +668,16 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init early_initmem_init(void)
+{
+ early_x86_numa_init();
+}
+
void __init x86_numa_init(void)
{
int i, nid;
struct numa_meminfo *mi = &numa_meminfo;

- early_x86_numa_init();
-
#ifdef CONFIG_ACPI_NUMA
if (srat_used)
x86_acpi_numa_init_slit();
--
1.7.10.4

2013-03-10 06:48:00

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 18/20] x86, mm: Parse numa info early

Parsing numa info has been separated to two functions now.

early_initmem_info() only parse info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 626bc9f..86e1ec0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1098,13 +1098,21 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+ /*
+ * Parse the ACPI tables for possible boot-time SMP configuration.
+ */
+ acpi_initrd_override_copy();
+ acpi_boot_table_init();
+ early_acpi_boot_init();
+ early_initmem_init();
init_mem_mapping();
-
+ memblock.current_limit = get_max_mapped();
early_trap_pf_init();

+ reserve_initrd();
+
setup_real_mode();

- memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

/*
@@ -1118,24 +1126,12 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- reserve_initrd();
-
- acpi_initrd_override_copy();
-
reserve_crashkernel();

vsmp_init();

io_delay_init();

- /*
- * Parse the ACPI tables for possible boot-time SMP configuration.
- */
- acpi_boot_table_init();
-
- early_acpi_boot_init();
-
- early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

--
1.7.10.4

2013-03-10 06:48:10

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 19/20] x86, mm: Make init_mem_mapping be able to be called several times

Prepare to put page table on local nodes.

Move calling of init_mem_mapping to early_initmem_init.

Rework alloc_low_pages to alloc page table in following order:
BRK, local node, low range

Still only load_cr3 one time, otherwise we would break xen 64bit again.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 88 ++++++++++++++++++++++++----------------
arch/x86/mm/numa.c | 24 +++++++++++
4 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 86e1ec0..1cdc1a7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1105,7 +1105,6 @@ void __init setup_arch(char **cmdline_p)
acpi_boot_table_init();
early_acpi_boot_init();
early_initmem_init();
- init_mem_mapping();
memblock.current_limit = get_max_mapped();
early_trap_pf_init();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 28b294f..8d0007a 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
static unsigned long __initdata pgt_buf_end;
static unsigned long __initdata pgt_buf_top;

-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;

static bool __initdata can_use_brk_pgt = true;

@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)

if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
+ if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+ if (low_min_pfn_mapped >= low_max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(
+ low_min_pfn_mapped << PAGE_SHIFT,
+ low_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
@@ -387,60 +397,75 @@ static unsigned long __init init_range_memory_mapping(

/* (PUD_SHIFT-PMD_SHIFT)/2 */
#define STEP_SIZE_SHIFT 5
-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
+ bool is_low = false;
+
+ if (!begin) {
+ probe_page_size_mask();
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ begin = ISA_END_ADDRESS;
+ is_low = true;
+ }

- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ if (begin >= end)
+ return;

/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

/* step_size need to be small so pgt_buf from BRK could cover it */
step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
+ local_max_pfn_mapped = begin >> PAGE_SHIFT;
+ local_min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > begin) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < begin)
+ start = begin;
} else
- start = ISA_END_ADDRESS;
+ start = begin;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
+ if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+ local_min_pfn_mapped = start >> PAGE_SHIFT;
last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
step_size <<= STEP_SIZE_SHIFT;
mapped_ram_size += new_mapped_ram_size;
}

- if (real_end < end)
+ if (real_end < end) {
init_range_memory_mapping(real_end, end);
+ if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = end >> PAGE_SHIFT;
+ }

+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
- }
#else
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
early_ioremap_page_table_range_init();
#endif

@@ -449,11 +474,6 @@ void __init init_mem_mapping(void)

early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
#endif

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c2d4653..d3eb0c9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
#include <asm/dma.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
+#include <asm/tlbflush.h>

#include "numa_internal.h"
+#include "mm_internal.h"

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
@@ -668,9 +670,31 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
+ max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+ early_ioremap_page_table_range_init();
+}
+#endif
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
+
+ early_x86_numa_init_mapping();
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
+ early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

void __init x86_numa_init(void)
--
1.7.10.4

2013-03-10 06:48:15

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 20/20] x86, mm, numa: Put pagetable on local node ram for 64bit

If node with ram is hotplugable, local node mem for page table and vmemmap
should be on that node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date: Mon Dec 27 16:48:17 2010 -0800
|
| x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to make memory hotplug work.

Calling init_mem_mapping in early_initmem_init for every node.
alloc_low_pages will alloc page table in following order:
BRK, local node, low range
So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d3eb0c9..11acdf6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -673,7 +673,39 @@ static void __init early_x86_numa_init(void)
#ifdef CONFIG_X86_64
static void __init early_x86_numa_init_mapping(void)
{
- init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ unsigned long last_start = 0, last_end = 0;
+ struct numa_meminfo *mi = &numa_meminfo;
+ unsigned long start, end;
+ int last_nid = -1;
+ int i, nid;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ nid = mi->blk[i].nid;
+ start = mi->blk[i].start;
+ end = mi->blk[i].end;
+
+ if (last_nid == nid) {
+ last_end = end;
+ continue;
+ }
+
+ /* other nid now */
+ if (last_nid >= 0) {
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+ }
+
+ /* for next nid */
+ last_nid = nid;
+ last_start = start;
+ last_end = end;
+ }
+ /* last one */
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+
if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
--
1.7.10.4

2013-03-10 06:47:15

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 11/20] x86, mm, numa: Move node_map_pfn alignment() to x86

Move node_map_pfn_alignment() to arch/x86/mm as no other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 -
mm/page_alloc.c | 50 --------------------------------------------------
3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b7173f6..24155b2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,6 +477,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}

+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
+ * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's. 0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ unsigned long start, end, mask;
+ int last_nid = -1;
+ int i, nid;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ /*
+ * Start with a mask granular enough to pin-point to the
+ * start pfn and tick off bits one-by-one until it becomes
+ * too coarse to separate the current node from the last.
+ */
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ /* accumulate all internode masks */
+ accl_mask |= mask;
+ }
+
+ /* convert mask to number of pages */
+ return ~accl_mask + 1;
+}
+
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 2ae2050..1c79b10 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1323,7 +1323,6 @@ extern void free_initmem(void);
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580d919..f368db4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4725,56 +4725,6 @@ static inline void setup_nr_node_ids(void)
}
#endif

-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
- * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's. 0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
- unsigned long accl_mask = 0, last_end = 0;
- unsigned long start, end, mask;
- int last_nid = -1;
- int i, nid;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
- if (!start || last_nid < 0 || last_nid == nid) {
- last_nid = nid;
- last_end = end;
- continue;
- }
-
- /*
- * Start with a mask granular enough to pin-point to the
- * start pfn and tick off bits one-by-one until it becomes
- * too coarse to separate the current node from the last.
- */
- mask = ~((1 << __ffs(start)) - 1);
- while (mask && last_end <= (start & (mask << 1)))
- mask <<= 1;
-
- /* accumulate all internode masks */
- accl_mask |= mask;
- }
-
- /* convert mask to number of pages */
- return ~accl_mask + 1;
-}
-
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{
--
1.7.10.4

2013-03-10 06:50:00

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 08/20] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

head64.c could use #PF handler set page table to access initrd before
init mem mapping and initrd relocating.

head_32.S could use 32bit flat mode to access initrd before init mem
mapping initrd relocating.

That make 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
also still need to keep #idef head_32.S to avoid compiling error.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/setup.h | 6 ++++++
arch/x86/kernel/head64.c | 2 ++
arch/x86/kernel/head_32.S | 4 ++++
arch/x86/kernel/setup.c | 30 ++++++++++++++++++++++++++++--
4 files changed, 40 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
extern unsigned long saved_video_mode;

extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c5e403f..a31bc63 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -174,6 +174,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");

+ x86_acpi_override_find();
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ call x86_acpi_override_find
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16a703f..b067663 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -424,6 +424,34 @@ static void __init reserve_initrd(void)
}
#endif /* CONFIG_BLK_DEV_INITRD */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ /*
+ * 32bit is from head_32.S, and it is 32bit flat mode.
+ * So need to use phys address to access global variables.
+ */
+ boot_params_p = (struct boot_params *)__pa_symbol(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (unsigned char *)ramdisk_image;
+ acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = __va(ramdisk_image);
+ acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -1092,8 +1120,6 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

- acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start, false);
acpi_initrd_override_copy();

reserve_crashkernel();
--
1.7.10.4

2013-03-10 06:46:34

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 03/20] x86, ACPI, mm: Kill max_low_pfn_mapped

Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
be used anymore.

User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.

Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
as later accessing is using early_ioremap(). Change to try to 4G below
and then 4G above.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: David Airlie <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: [email protected]
Cc: [email protected]
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 4 +---
arch/x86/mm/init.c | 4 ----
drivers/acpi/osl.c | 10 +++++++---
4 files changed, 8 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@

extern int devmem_is_allowed(unsigned long pagenr);

-extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1629577..e75c6e6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -113,13 +113,11 @@
#include <asm/prom.h>

/*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped: highest direct mapped pfn over 4GB
+ * max_pfn_mapped: highest direct mapped pfn
*
* The direct mapping only covers E820_RAM regions, so the ranges and gaps are
* represented by pfn_mapped
*/
-unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

#ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 59b7fc4..abcc241 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);

max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
- if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
- max_low_pfn_mapped = max(max_low_pfn_mapped,
- min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
}

bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 586e7e9..c08cdb6 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,13 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
+ /* under 4G at first, then above 4G */
+ acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+ all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr)
+ acpi_tables_addr = memblock_find_in_range(0,
+ ~(phys_addr_t)0,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.7.10.4

2013-03-10 06:46:32

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v2 04/20] x86, ACPI: Increase override tables number limit

Current acpi tables in initrd is limited to 10, that is too small.
64 should be good enough as we have 35 sigs and could have several
SSDT.

Two problems in current code prevent us from increasing limit:
1. that cpio file info array is put in stack, as every element is 32
bytes, could run out of stack if we have that array size to 64.
We can move it out from stack, and make it as global and put it in
__initdata section.
2. early_ioremap only can remap 256k one time. Current code is mapping
10 tables one time. If we increase that limit, whole size could be
more than 256k, early_ioremap will fail with that.
We can map table one by one during copying, instead of mapping
all them one time.

-v2: According to tj, split it out to separated patch, also
rename array name to acpi_initrd_files.

Signed-off-by: Yinghai <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
drivers/acpi/osl.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index c08cdb6..8aaf721 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override(void *data, size_t size)
{
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;

if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- early_initrd_files[table_nr].data = file.data;
- early_initrd_files[table_nr].size = file.size;
+ acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
if (table_nr == 0)
@@ -648,14 +647,14 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

- p = early_ioremap(acpi_tables_addr, all_tables_size);
-
for (no = 0; no < table_nr; no++) {
- memcpy(p + total_offset, early_initrd_files[no].data,
- early_initrd_files[no].size);
- total_offset += early_initrd_files[no].size;
+ phys_addr_t size = acpi_initrd_files[no].size;
+
+ p = early_ioremap(acpi_tables_addr + total_offset, size);
+ memcpy(p, acpi_initrd_files[no].data, size);
+ early_iounmap(p, size);
+ total_offset += size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

--
1.7.10.4

2013-03-10 10:25:16

by Pekka Enberg

[permalink] [raw]
Subject: Re: [PATCH v2 08/20] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

On Sun, Mar 10, 2013 at 8:44 AM, Yinghai Lu <[email protected]> wrote:
> +void __init x86_acpi_override_find(void)
> +{
> + unsigned long ramdisk_image, ramdisk_size;
> + unsigned char *p = NULL;
> +
> +#ifdef CONFIG_X86_32
> + struct boot_params *boot_params_p;
> +
> + /*
> + * 32bit is from head_32.S, and it is 32bit flat mode.
> + * So need to use phys address to access global variables.
> + */
> + boot_params_p = (struct boot_params *)__pa_symbol(&boot_params);
> + ramdisk_image = get_ramdisk_image(boot_params_p);
> + ramdisk_size = get_ramdisk_size(boot_params_p);
> + p = (unsigned char *)ramdisk_image;
> + acpi_initrd_override_find(p, ramdisk_size, true);
> +#else
> + ramdisk_image = get_ramdisk_image(&boot_params);
> + ramdisk_size = get_ramdisk_size(&boot_params);
> + if (ramdisk_image)
> + p = __va(ramdisk_image);
> + acpi_initrd_override_find(p, ramdisk_size, false);
> +#endif
> +}
> +#endif

What is preventing us from making the 64-bit variant also work in flat
mode to make the code consistent and not hiding the differences under
the rug? What am I missing here?

Pekka

2013-03-10 16:47:20

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 08/20] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

On Sun, Mar 10, 2013 at 3:25 AM, Pekka Enberg <[email protected]> wrote:
>
> What is preventing us from making the 64-bit variant also work in flat
> mode to make the code consistent and not hiding the differences under
> the rug? What am I missing here?

Boot loader could start kernel from 64bit directly from
from arch/x86/boot/compressed/head_64.s::startup_64.

initrd can be loaded by 64bit bootloader above 4G.

So we even switch back to 32bit flat mode, we still can not access those initrd
directly.

Thanks

Yinghai

2013-03-10 17:46:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 08/20] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

There is no 64-bit flat mode. We use a #PF handler to emulate one by creating page tables on the fly.

Yinghai Lu <[email protected]> wrote:

>On Sun, Mar 10, 2013 at 3:25 AM, Pekka Enberg <[email protected]>
>wrote:
>>
>> What is preventing us from making the 64-bit variant also work in
>flat
>> mode to make the code consistent and not hiding the differences under
>> the rug? What am I missing here?
>
>Boot loader could start kernel from 64bit directly from
>from arch/x86/boot/compressed/head_64.s::startup_64.
>
>initrd can be loaded by 64bit bootloader above 4G.
>
>So we even switch back to 32bit flat mode, we still can not access
>those initrd
>directly.
>
>Thanks
>
>Yinghai

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-03-11 05:47:17

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH v2 20/20] x86, mm, numa: Put pagetable on local node ram for 64bit

Hi Yinghai,

Please see below. :)

On 03/10/2013 02:44 PM, Yinghai Lu wrote:
> If node with ram is hotplugable, local node mem for page table and vmemmap
> should be on that node ram.
>
> This patch is some kind of refreshment of
> | commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
> | Date: Mon Dec 27 16:48:17 2010 -0800
> |
> | x86-64, numa: Put pgtable to local node memory
> That was reverted before.
>
> We have reason to reintroduce it to make memory hotplug work.
>
> Calling init_mem_mapping in early_initmem_init for every node.
> alloc_low_pages will alloc page table in following order:
> BRK, local node, low range
> So page table will be on low range or local nodes.
>
> Signed-off-by: Yinghai Lu<[email protected]>
> Cc: Pekka Enberg<[email protected]>
> Cc: Jacob Shin<[email protected]>
> Cc: Konrad Rzeszutek Wilk<[email protected]>
> ---
> arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index d3eb0c9..11acdf6 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -673,7 +673,39 @@ static void __init early_x86_numa_init(void)
> #ifdef CONFIG_X86_64
> static void __init early_x86_numa_init_mapping(void)
> {
> - init_mem_mapping(0, max_pfn<< PAGE_SHIFT);
> + unsigned long last_start = 0, last_end = 0;
> + struct numa_meminfo *mi =&numa_meminfo;
> + unsigned long start, end;
> + int last_nid = -1;
> + int i, nid;
> +
> + for (i = 0; i< mi->nr_blks; i++) {
> + nid = mi->blk[i].nid;
> + start = mi->blk[i].start;
> + end = mi->blk[i].end;
> +
> + if (last_nid == nid) {
> + last_end = end;
> + continue;
> + }
> +
> + /* other nid now */
> + if (last_nid>= 0) {
> + printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
> + last_nid, last_start, last_end - 1);
> + init_mem_mapping(last_start, last_end);

IIUC, we call init_mem_mapping() for each node ranges. In the first time,
local_max_pfn_mapped = begin >> PAGE_SHIFT;
local_min_pfn_mapped = real_end >> PAGE_SHIFT;
which means
local_min_pfn_mapped >= local_max_pfn_mapped
right ?

So, the first page allocated by alloc_low_pages() is not on local node,
right ?
Furthermore, the first page of pagetable is not on local node, right ?

BTW, I'm reading your code, and doing necessary hot-add and hot-remove
changes now.

Thanks. :)

> + }
> +
> + /* for next nid */
> + last_nid = nid;
> + last_start = start;
> + last_end = end;
> + }
> + /* last one */
> + printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
> + last_nid, last_start, last_end - 1);
> + init_mem_mapping(last_start, last_end);
> +
> if (max_pfn> max_low_pfn)
> max_low_pfn = max_pfn;
> }

2013-03-11 06:29:51

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 20/20] x86, mm, numa: Put pagetable on local node ram for 64bit

On Sun, Mar 10, 2013 at 10:49 PM, Tang Chen <[email protected]> wrote:
> On 03/10/2013 02:44 PM, Yinghai Lu wrote:
>>
>> Calling init_mem_mapping in early_initmem_init for every node.
>> alloc_low_pages will alloc page table in following order:
>> BRK, local node, low range
>> So page table will be on low range or local nodes.
...
> IIUC, we call init_mem_mapping() for each node ranges. In the first time,
> local_max_pfn_mapped = begin >> PAGE_SHIFT;
> local_min_pfn_mapped = real_end >> PAGE_SHIFT;
> which means
> local_min_pfn_mapped >= local_max_pfn_mapped
> right ?
>
> So, the first page allocated by alloc_low_pages() is not on local node,
> right ?

It is from BRK with kernel code.

> Furthermore, the first page of pagetable is not on local node, right ?

It is in BRK for node with start = 0.

Other node, it is from low_range aka node with start = 0.

Thanks

Yinghai

2013-03-11 13:17:30

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] x86, mm: Make init_mem_mapping be able to be called several times

On Sat, Mar 09, 2013 at 10:44:46PM -0800, Yinghai Lu wrote:
> Prepare to put page table on local nodes.
>
> Move calling of init_mem_mapping to early_initmem_init.
>
> Rework alloc_low_pages to alloc page table in following order:
> BRK, local node, low range
>
> Still only load_cr3 one time, otherwise we would break xen 64bit again.
>

We could also fix that. Now that the regression storm has passed
and I am able to spend some time on it we could make it a bit more
resistant.

> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Jacob Shin <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 2 +-
> arch/x86/kernel/setup.c | 1 -
> arch/x86/mm/init.c | 88 ++++++++++++++++++++++++----------------
> arch/x86/mm/numa.c | 24 +++++++++++
> 4 files changed, 79 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 1e67223..868687c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
> #ifndef __ASSEMBLY__
>
> extern int direct_gbpages;
> -void init_mem_mapping(void);
> +void init_mem_mapping(unsigned long begin, unsigned long end);
> void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 86e1ec0..1cdc1a7 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1105,7 +1105,6 @@ void __init setup_arch(char **cmdline_p)
> acpi_boot_table_init();
> early_acpi_boot_init();
> early_initmem_init();
> - init_mem_mapping();
> memblock.current_limit = get_max_mapped();
> early_trap_pf_init();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 28b294f..8d0007a 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
> static unsigned long __initdata pgt_buf_end;
> static unsigned long __initdata pgt_buf_top;
>
> -static unsigned long min_pfn_mapped;
> +static unsigned long low_min_pfn_mapped;
> +static unsigned long low_max_pfn_mapped;
> +static unsigned long local_min_pfn_mapped;
> +static unsigned long local_max_pfn_mapped;
>
> static bool __initdata can_use_brk_pgt = true;
>
> @@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)
>
> if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
> unsigned long ret;
> - if (min_pfn_mapped >= max_pfn_mapped)
> - panic("alloc_low_page: ran out of memory");
> - ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> - max_pfn_mapped << PAGE_SHIFT,
> + if (local_min_pfn_mapped >= local_max_pfn_mapped) {
> + if (low_min_pfn_mapped >= low_max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(
> + low_min_pfn_mapped << PAGE_SHIFT,
> + low_max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE * num , PAGE_SIZE);
> + } else
> + ret = memblock_find_in_range(
> + local_min_pfn_mapped << PAGE_SHIFT,
> + local_max_pfn_mapped << PAGE_SHIFT,
> PAGE_SIZE * num , PAGE_SIZE);
> if (!ret)
> panic("alloc_low_page: can not alloc memory");
> @@ -387,60 +397,75 @@ static unsigned long __init init_range_memory_mapping(
>
> /* (PUD_SHIFT-PMD_SHIFT)/2 */
> #define STEP_SIZE_SHIFT 5
> -void __init init_mem_mapping(void)
> +void __init init_mem_mapping(unsigned long begin, unsigned long end)
> {
> - unsigned long end, real_end, start, last_start;
> + unsigned long real_end, start, last_start;
> unsigned long step_size;
> unsigned long addr;
> unsigned long mapped_ram_size = 0;
> unsigned long new_mapped_ram_size;
> + bool is_low = false;
> +
> + if (!begin) {
> + probe_page_size_mask();
> + /* the ISA range is always mapped regardless of memory holes */
> + init_memory_mapping(0, ISA_END_ADDRESS);
> + begin = ISA_END_ADDRESS;
> + is_low = true;
> + }
>
> - probe_page_size_mask();
> -
> -#ifdef CONFIG_X86_64
> - end = max_pfn << PAGE_SHIFT;
> -#else
> - end = max_low_pfn << PAGE_SHIFT;
> -#endif
> -
> - /* the ISA range is always mapped regardless of memory holes */
> - init_memory_mapping(0, ISA_END_ADDRESS);
> + if (begin >= end)
> + return;
>
> /* xen has big range in reserved near end of ram, skip it at first.*/
> - addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
> + addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
> real_end = addr + PMD_SIZE;
>
> /* step_size need to be small so pgt_buf from BRK could cover it */
> step_size = PMD_SIZE;
> - max_pfn_mapped = 0; /* will get exact value next */
> - min_pfn_mapped = real_end >> PAGE_SHIFT;
> + local_max_pfn_mapped = begin >> PAGE_SHIFT;
> + local_min_pfn_mapped = real_end >> PAGE_SHIFT;
> last_start = start = real_end;
> - while (last_start > ISA_END_ADDRESS) {
> + while (last_start > begin) {
> if (last_start > step_size) {
> start = round_down(last_start - 1, step_size);
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> + if (start < begin)
> + start = begin;
> } else
> - start = ISA_END_ADDRESS;
> + start = begin;
> new_mapped_ram_size = init_range_memory_mapping(start,
> last_start);
> + if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
> + local_max_pfn_mapped = last_start >> PAGE_SHIFT;
> + local_min_pfn_mapped = start >> PAGE_SHIFT;
> last_start = start;
> - min_pfn_mapped = last_start >> PAGE_SHIFT;
> /* only increase step_size after big range get mapped */
> if (new_mapped_ram_size > mapped_ram_size)
> step_size <<= STEP_SIZE_SHIFT;
> mapped_ram_size += new_mapped_ram_size;
> }
>
> - if (real_end < end)
> + if (real_end < end) {
> init_range_memory_mapping(real_end, end);
> + if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
> + local_max_pfn_mapped = end >> PAGE_SHIFT;
> + }
>
> + if (is_low) {
> + low_min_pfn_mapped = local_min_pfn_mapped;
> + low_max_pfn_mapped = local_max_pfn_mapped;
> + }
> +}
> +
> +#ifndef CONFIG_NUMA
> +void __init early_initmem_init(void)
> +{
> #ifdef CONFIG_X86_64
> - if (max_pfn > max_low_pfn) {
> - /* can we preseve max_low_pfn ?*/
> + init_mem_mapping(0, max_pfn << PAGE_SHIFT);
> + if (max_pfn > max_low_pfn)
> max_low_pfn = max_pfn;
> - }
> #else
> + init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
> early_ioremap_page_table_range_init();
> #endif
>
> @@ -449,11 +474,6 @@ void __init init_mem_mapping(void)
>
> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> }
> -
> -#ifndef CONFIG_NUMA
> -void __init early_initmem_init(void)
> -{
> -}
> #endif
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c2d4653..d3eb0c9 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -17,8 +17,10 @@
> #include <asm/dma.h>
> #include <asm/acpi.h>
> #include <asm/amd_nb.h>
> +#include <asm/tlbflush.h>
>
> #include "numa_internal.h"
> +#include "mm_internal.h"
>
> int __initdata numa_off;
> nodemask_t numa_nodes_parsed __initdata;
> @@ -668,9 +670,31 @@ static void __init early_x86_numa_init(void)
> numa_init(dummy_numa_init);
> }
>
> +#ifdef CONFIG_X86_64
> +static void __init early_x86_numa_init_mapping(void)
> +{
> + init_mem_mapping(0, max_pfn << PAGE_SHIFT);
> + if (max_pfn > max_low_pfn)
> + max_low_pfn = max_pfn;
> +}
> +#else
> +static void __init early_x86_numa_init_mapping(void)
> +{
> + init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
> + early_ioremap_page_table_range_init();
> +}
> +#endif
> +
> void __init early_initmem_init(void)
> {
> early_x86_numa_init();
> +
> + early_x86_numa_init_mapping();
> +
> + load_cr3(swapper_pg_dir);
> + __flush_tlb_all();
> +
> + early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
> }
>
> void __init x86_numa_init(void)
> --
> 1.7.10.4
>

2013-03-11 20:28:07

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 19/20] x86, mm: Make init_mem_mapping be able to be called several times

On Mon, Mar 11, 2013 at 6:16 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> On Sat, Mar 09, 2013 at 10:44:46PM -0800, Yinghai Lu wrote:
>> Prepare to put page table on local nodes.
>>
>> Move calling of init_mem_mapping to early_initmem_init.
>>
>> Rework alloc_low_pages to alloc page table in following order:
>> BRK, local node, low range
>>
>> Still only load_cr3 one time, otherwise we would break xen 64bit again.
>>
>
> We could also fix that. Now that the regression storm has passed
> and I am able to spend some time on it we could make it a bit more
> resistant.

Never mind, We should only need to call load_cr3 one time.

as init_memory_mapping itself flush tlb everytime on 64bit.

Thanks

Yinghai

2013-04-04 17:37:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 03/20] x86, ACPI, mm: Kill max_low_pfn_mapped

Hello,

On Sat, Mar 09, 2013 at 10:44:30PM -0800, Yinghai Lu wrote:
> Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
> be used anymore.
>
> User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.
>
> Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
> as later accessing is using early_ioremap(). Change to try to 4G below
> and then 4G above.
...
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index 586e7e9..c08cdb6 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -624,9 +624,13 @@ void __init acpi_initrd_override(void *data, size_t size)
> if (table_nr == 0)
> return;
>
> - acpi_tables_addr =
> - memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
> - all_tables_size, PAGE_SIZE);
> + /* under 4G at first, then above 4G */
> + acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
> + all_tables_size, PAGE_SIZE);
> + if (!acpi_tables_addr)
> + acpi_tables_addr = memblock_find_in_range(0,
> + ~(phys_addr_t)0,
> + all_tables_size, PAGE_SIZE);

So, it's changing the allocation from <=4G to <=4G first and then >4G.
The only explanation given is "as later accessing is using
early_ioremap()", but I can't see why that can be a reason for that.
early_ioremap() doesn't care whether the given physaddr is under 4G or
not, it unconditionally maps it into fixmap, so whether the allocated
address is below or above 4G doesn't make any difference.

Changing the allowed range of the allocation should be a separate
patch. It has some chance of its own breakage and the change itself
isn't really related to this one.

Please try to elaborate the reasoning behind "why", so that readers of
the description don't have to deduce (oh well, guess) your intentions
behind the changes. As much as it would help the readers, it'd also
help you even more as you would have had to explicitly write something
like "the table is accessed with early_ioremap() so the address
doesn't need to be restricted under 4G; however, to avoid unnecessary
remappings, first try <= 4G and then > 4G." Then, you would be
compelled to check whether the statement you explicitly wrote is true,
which isn't in this case and you would also realize that the change
isn't trivial and doesn't really belong with this patch. By not doing
the due diligence, you're offloading what you should have done to
others, which isn't very nice.

I think the descriptions are better in this posting than the last time
but it's still lacking, so, please putfff more effort into describing
the changes and reasoning behind them.

Thanks.

--
tejun

2013-04-04 17:48:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] x86, microcode: Use common get_ramdisk_image()

On Sat, Mar 09, 2013 at 10:44:29PM -0800, Yinghai Lu wrote:
> Use common get_ramdisk_image() to get ramdisk start phys address.
>
> We need this to get correct ramdisk adress for 64bit bzImage that
> initrd can be loaded above 4G by kexec-tools.

Is this a bug fix? Can it actually happen?

> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Fenghua Yu <[email protected]>

For 01 and 02

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-04-04 17:50:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] x86, ACPI: Increase override tables number limit

On Sat, Mar 09, 2013 at 10:44:31PM -0800, Yinghai Lu wrote:
> Current acpi tables in initrd is limited to 10, that is too small.
> 64 should be good enough as we have 35 sigs and could have several
> SSDT.
>
> Two problems in current code prevent us from increasing limit:
> 1. that cpio file info array is put in stack, as every element is 32
> bytes, could run out of stack if we have that array size to 64.
> We can move it out from stack, and make it as global and put it in
> __initdata section.
> 2. early_ioremap only can remap 256k one time. Current code is mapping
> 10 tables one time. If we increase that limit, whole size could be
> more than 256k, early_ioremap will fail with that.
> We can map table one by one during copying, instead of mapping
> all them one time.
>
> -v2: According to tj, split it out to separated patch, also
> rename array name to acpi_initrd_files.
>
> Signed-off-by: Yinghai <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: [email protected]

Acked-by: Tejun Heo <[email protected]>

> @@ -648,14 +647,14 @@ void __init acpi_initrd_override(void *data, size_t size)
> memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
> arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
>
> - p = early_ioremap(acpi_tables_addr, all_tables_size);
> -

It'd be nice to have a brief comment here explaining why we're mapping
each table separately.

> for (no = 0; no < table_nr; no++) {
> - memcpy(p + total_offset, early_initrd_files[no].data,
> - early_initrd_files[no].size);
> - total_offset += early_initrd_files[no].size;
> + phys_addr_t size = acpi_initrd_files[no].size;
> +
> + p = early_ioremap(acpi_tables_addr + total_offset, size);
> + memcpy(p, acpi_initrd_files[no].data, size);
> + early_iounmap(p, size);
> + total_offset += size;
> }
> - early_iounmap(p, all_tables_size);

Thanks.

--
tejun

2013-04-04 17:59:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 02/20] x86, microcode: Use common get_ramdisk_image()

On Thu, Apr 4, 2013 at 10:48 AM, Tejun Heo <[email protected]> wrote:
> On Sat, Mar 09, 2013 at 10:44:29PM -0800, Yinghai Lu wrote:
>> Use common get_ramdisk_image() to get ramdisk start phys address.
>>
>> We need this to get correct ramdisk adress for 64bit bzImage that
>> initrd can be loaded above 4G by kexec-tools.
>
> Is this a bug fix? Can it actually happen?

Yes, it could happen.
When second kernel have early microcode updating support, and it would search
wrong wrong place for ramdisk.

Thanks

Yinghai

2013-04-04 18:03:37

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 04/20] x86, ACPI: Increase override tables number limit

On Thu, Apr 4, 2013 at 10:50 AM, Tejun Heo <[email protected]> wrote:
> On Sat, Mar 09, 2013 at 10:44:31PM -0800, Yinghai Lu wrote:

>> @@ -648,14 +647,14 @@ void __init acpi_initrd_override(void *data, size_t size)
>> memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
>> arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
>>
>> - p = early_ioremap(acpi_tables_addr, all_tables_size);
>> -
>
> It'd be nice to have a brief comment here explaining why we're mapping
> each table separately.

ok, will copy lines from changelog to comment.

Thanks

Yinghai

2013-04-04 18:07:32

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] x86, ACPI: Split acpi_initrd_override to find/copy two functions

On Sat, Mar 09, 2013 at 10:44:32PM -0800, Yinghai Lu wrote:
> To parse srat early, we need to move acpi table probing early.
> acpi_initrd_table_override is before acpi table probing. So we need to
> move it early too.
>
> Current code acpi_initrd_table_override is after init_mem_mapping and
> relocate_initrd(), so it can scan initrd and copy acpi tables with kernel
> virtual address of initrd.
> Copying need to be after memblock is ready, because it need to allocate
> buffer for new acpi tables.
>
> So we have to split that function to find and copy two functions.
> Find should be as early as possible. Copy should be after memblock is ready.
>
> Finding could be done in head_32.S and head64.c, just like microcode
> early scanning. In head_32.S, it is 32bit flat mode, we don't
> need to set page table to access it. In head64.c, #PF set page table
> could help us access initrd with kernel low mapping address.
>
> Copying could be done just after memblock is ready and before probing
> acpi tables, and we need to early_ioremap to access source and target
> range, as init_mem_mapping is not called yet.
>
> Also move down two functions declaration to avoid #ifdef in setup.c
>
> ACPI_INITRD_TABLE_OVERRIDE depends one ACPI and BLK_DEV_INITRD.
> So could move declaration out from #ifdef CONFIG_ACPI protection.

Heh, I couldn't really follow the above. How about something like the
following.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version
were conditionalized inside CONFIG_ACPI. This forced setup_arch() to
have its own #ifdefs around acpi_initrd_override() as otherwise build
would fail when !CONFIG_ACPI. Move the prototypes and dummy
implementations of the newly split functions below CONFIG_ACPI block
in acpi.h so that we can do away with #ifdefs in its user.

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-04-04 18:20:57

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 03/20] x86, ACPI, mm: Kill max_low_pfn_mapped

On Thu, Apr 4, 2013 at 10:36 AM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Sat, Mar 09, 2013 at 10:44:30PM -0800, Yinghai Lu wrote:
>> Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
>> be used anymore.
>>
>> User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.
>>
>> Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
>> as later accessing is using early_ioremap(). Change to try to 4G below
>> and then 4G above.
> ...
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index 586e7e9..c08cdb6 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -624,9 +624,13 @@ void __init acpi_initrd_override(void *data, size_t size)
>> if (table_nr == 0)
>> return;
>>
>> - acpi_tables_addr =
>> - memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
>> - all_tables_size, PAGE_SIZE);
>> + /* under 4G at first, then above 4G */
>> + acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
>> + all_tables_size, PAGE_SIZE);
>> + if (!acpi_tables_addr)
>> + acpi_tables_addr = memblock_find_in_range(0,
>> + ~(phys_addr_t)0,
>> + all_tables_size, PAGE_SIZE);
>
> So, it's changing the allocation from <=4G to <=4G first and then >4G.
> The only explanation given is "as later accessing is using
> early_ioremap()", but I can't see why that can be a reason for that.
> early_ioremap() doesn't care whether the given physaddr is under 4G or
> not, it unconditionally maps it into fixmap, so whether the allocated
> address is below or above 4G doesn't make any difference.
>
> Changing the allowed range of the allocation should be a separate
> patch. It has some chance of its own breakage and the change itself
> isn't really related to this one.

Ok, will separate that "try above 4G" to another patch.

>
> Please try to elaborate the reasoning behind "why", so that readers of
> the description don't have to deduce (oh well, guess) your intentions
> behind the changes. As much as it would help the readers, it'd also
> help you even more as you would have had to explicitly write something
> like "the table is accessed with early_ioremap() so the address
> doesn't need to be restricted under 4G; however, to avoid unnecessary
> remappings, first try <= 4G and then > 4G." Then, you would be
> compelled to check whether the statement you explicitly wrote is true,
> which isn't in this case and you would also realize that the change
> isn't trivial and doesn't really belong with this patch. By not doing
> the due diligence, you're offloading what you should have done to
> others, which isn't very nice.
>
> I think the descriptions are better in this posting than the last time
> but it's still lacking, so, please putfff more effort into describing
> the changes and reasoning behind them.

ok.

Thanks a lot.

Yinghai

2013-04-04 18:27:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Sat, Mar 09, 2013 at 10:44:33PM -0800, Yinghai Lu wrote:
> In 32bit we will find table with phys address during 32bit flat mode
> in head_32.S, because at that time we don't need set page table to
> access initrd.
>
> For copying we could use early_ioremap() with phys directly before mem mapping
> is set.
>
> To keep 32bit and 64bit consistent, use phys_addr for all.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: [email protected]
> ---
> drivers/acpi/osl.c | 14 +++++++++++---
> 1 file changed, 11 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index d66ae0e..54bcc37 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -615,7 +615,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
> table->signature, cpio_path, file.name, table->length);
>
> all_tables_size += table->length;
> - acpi_initrd_files[table_nr].data = file.data;
> + acpi_initrd_files[table_nr].data = (void *)__pa(file.data);
> acpi_initrd_files[table_nr].size = file.size;
> table_nr++;
> }
> @@ -624,7 +624,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
> void __init acpi_initrd_override_copy(void)
> {
> int no, total_offset = 0;
> - char *p;
> + char *p, *q;
>
> if (!all_tables_size)
> return;
> @@ -654,12 +654,20 @@ void __init acpi_initrd_override_copy(void)
> arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
>
> for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
> + /*
> + * have to use unsigned long, otherwise 32bit spit warning
> + * and it is ok to unsigned long, as bootloader would not
> + * load initrd above 4G for 32bit kernel.
> + */
> + unsigned long addr = (unsigned long)acpi_initrd_files[no].data;

I can't say I like this. It's stuffing phys_addr_t into void *. It
might work okay but the code is a bit misleading / confusing. "void
*" shouldn't contain a physical address. Maybe the alternatives are
uglier, I don't know. If you can think of a reasonable way to not do
this, it would be great.

Thanks.

--
tejun

2013-04-04 18:30:59

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Thu, Apr 04, 2013 at 11:27:42AM -0700, Tejun Heo wrote:
> > + /*
> > + * have to use unsigned long, otherwise 32bit spit warning
> > + * and it is ok to unsigned long, as bootloader would not
> > + * load initrd above 4G for 32bit kernel.
> > + */
> > + unsigned long addr = (unsigned long)acpi_initrd_files[no].data;
>
> I can't say I like this. It's stuffing phys_addr_t into void *. It
> might work okay but the code is a bit misleading / confusing. "void
> *" shouldn't contain a physical address. Maybe the alternatives are
> uglier, I don't know. If you can think of a reasonable way to not do
> this, it would be great.

Also the comment contradicts with what you wrote in the next patch.

Boot loader could load initrd above max_low_pfn.

Hmmm?

--
tejun

2013-04-04 18:36:05

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

Hello,

On Sat, Mar 09, 2013 at 10:44:34PM -0800, Yinghai Lu wrote:
> For finding with 32bit, it would be easy to access initrd in 32bit
> flat mode, as we don't need to set page table.
>
> That is from head_32.S, and microcode updating already use this trick.
>
> Need to change acpi_initrd_override_find to use phys to access global
> variables.
>
> Pass is_phys in the function, as we can not use address to decide if it
> is phys or virtual address on 32 bit. Boot loader could load initrd above
> max_low_pfn.
>
> Don't call printk as it uses global variables, so delay print later
> during copying.
>
> Change table_sigs to use stack instead, otherwise it is too messy to change
> string array to phys and still keep offset calculating correct.
> That size is about 36x4 bytes, and it is small to settle in stack.
>
> Also remove "continue" in MARCO to make code more readable.

It'd be nice if the error message can be stored somewhere and then
printed out after the system is in proper address mode if that isn't
too complex to achieve. If it gets too messy, no need to bother.

Thanks.

--
tejun

2013-04-04 19:29:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 05/20] x86, ACPI: Split acpi_initrd_override to find/copy two functions

On Thu, Apr 4, 2013 at 11:07 AM, Tejun Heo <[email protected]> wrote:
>>
>> Also move down two functions declaration to avoid #ifdef in setup.c
>>
>> ACPI_INITRD_TABLE_OVERRIDE depends one ACPI and BLK_DEV_INITRD.
>> So could move declaration out from #ifdef CONFIG_ACPI protection.
>
> Heh, I couldn't really follow the above. How about something like the
> following.
>
> While a dummy version of acpi_initrd_override() was defined when
> !CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version
> were conditionalized inside CONFIG_ACPI. This forced setup_arch() to
> have its own #ifdefs around acpi_initrd_override() as otherwise build
> would fail when !CONFIG_ACPI. Move the prototypes and dummy
> implementations of the newly split functions below CONFIG_ACPI block
> in acpi.h so that we can do away with #ifdefs in its user.

update changelog with your changes.

Thanks

Yinghai

2013-04-04 19:40:21

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Thu, Apr 4, 2013 at 11:30 AM, Tejun Heo <[email protected]> wrote:
> Also the comment contradicts with what you wrote in the next patch.
>
> Boot loader could load initrd above max_low_pfn.

It does not contradict:
this patch: bootloader would not load initrd above 4G for 32bit kernel

max_low_pfn is below 4G.

Thanks

Yinghai

2013-04-04 20:03:19

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 06/20] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Thu, Apr 4, 2013 at 11:27 AM, Tejun Heo <[email protected]> wrote:
>> for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
>> + /*
>> + * have to use unsigned long, otherwise 32bit spit warning
>> + * and it is ok to unsigned long, as bootloader would not
>> + * load initrd above 4G for 32bit kernel.
>> + */
>> + unsigned long addr = (unsigned long)acpi_initrd_files[no].data;
>
> I can't say I like this. It's stuffing phys_addr_t into void *. It
> might work okay but the code is a bit misleading / confusing. "void
> *" shouldn't contain a physical address. Maybe the alternatives are
> uglier, I don't know. If you can think of a reasonable way to not do
> this, it would be great.

Please check if you are happy with attached.

-v2: introduce file_pos to save phys address instead of abusing cpio_data
that tj is not happy with.


Attachments:
fix_acpi_override_2.patch (2.33 kB)

2013-04-04 20:22:59

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v2 07/20] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

On Thu, Apr 4, 2013 at 11:35 AM, Tejun Heo <[email protected]> wrote:
>
> It'd be nice if the error message can be stored somewhere and then
> printed out after the system is in proper address mode if that isn't
> too complex to achieve. If it gets too messy, no need to bother.

Maybe not necessary. As later during coping, another print out
is added there for successful one.

@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(vo
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd
[%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));

Thanks

Yinghai

2013-04-04 20:26:54

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v2 08/20] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

On 03/10/2013 03:25 AM, Pekka Enberg wrote:
>
> What is preventing us from making the 64-bit variant also work in flat
> mode to make the code consistent and not hiding the differences under
> the rug? What am I missing here?
>

There is no such thing as "flat mode" in 64-bit mode. We use a #PF
handler to emulate it, but we add the normal kernel offset when doing so.

In the 32-bit case the problem is that the kernel offset is not
available while in linear mode. It *could* be created using segment
bases, but that would break Xen, I'm pretty sure, and possibly some
other too-clever environments.

-hpa