2013-06-13 13:28:09

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

From: Yinghai Lu <[email protected]>

No offence, just rebase and resend the patches from Yinghai to help
to push this functionality faster.
Also improve the comments in the patches' log.


One commit that tried to parse SRAT early get reverted before v3.9-rc1.

| commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
| Author: Tang Chen <[email protected]>
| Date: Fri Feb 22 16:33:44 2013 -0800
|
| acpi, memory-hotplug: parse SRAT before memblock is ready

It broke several things, like acpi override and fall back path etc.

This patchset is clean implementation that will parse numa info early.
1. keep the acpi table initrd override working by split finding with copying.
finding is done at head_32.S and head64.c stage,
in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
in head64.c, initrd is accessed via kernel low mapping address
with help of #PF set page table.
copying is done with early_ioremap just after memblock is setup.
2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
seperate initmem_init to two stages.
early_initmem_init will only extract numa info early into numa_meminfo.
initmem_init will keep slit and emulation handling.
3. keep other old code flow untouched like relocate_initrd and initmem_init.
early_initmem_init will take old init_mem_mapping position.
it call early_x86_numa_init and init_mem_mapping for every nodes.
For 64bit, we avoid having size limit on initrd, as relocate_initrd
is still after init_mem_mapping for all memory.
4. last patch will try to put page table on local node, so that memory
hotplug will be happy.

In short, early_initmem_init will parse numa info early and call
init_mem_mapping to set page table for every nodes's mem.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

and it is based on today's Linus tree.

-v2: Address tj's review and split patches to small ones.
-v3: Add some Acked-by from tj, also stop abusing cpio_data for acpi_files info
-v4: fix one typo found by Tang Chen.
Also added tested-by from Thomas Renninger and Tony.
-v5: Rebase to Linux-3.10.0-rc5 (patch 5 and 21 has been rebased)
Improve comments in patches' log.
Improve comment in init_mem_mapping() in patch21.

Yinghai Lu (22):
x86: Change get_ramdisk_{image|size}() to global
x86, microcode: Use common get_ramdisk_{image|size}()
x86, ACPI, mm: Kill max_low_pfn_mapped
x86, ACPI: Search buffer above 4GB in a second try for acpi initrd
table override
x86, ACPI: Increase acpi initrd override tables number limit
x86, ACPI: Split acpi_initrd_override() into find/copy two steps
x86, ACPI: Store override acpi tables phys addr in cpio files info
array
x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
x86, mm, numa: Move two functions calling on successful path later
x86, mm, numa: Call numa_meminfo_cover_memory() checking early
x86, mm, numa: Move node_map_pfn_alignment() to x86
x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
x86, mm, numa: Set memblock nid later
x86, mm, numa: Move node_possible_map setting later
x86, mm, numa: Move numa emulation handling down.
x86, ACPI, numa, ia64: split SLIT handling out
x86, mm, numa: Add early_initmem_init() stub
x86, mm: Parse numa info earlier
x86, mm: Add comments for step_size shift
x86, mm: Make init_mem_mapping be able to be called several times
x86, mm, numa: Put pagetable on local node ram for 64bit

arch/ia64/kernel/setup.c | 4 +-
arch/x86/include/asm/acpi.h | 3 +-
arch/x86/include/asm/page_types.h | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/setup.h | 9 ++
arch/x86/kernel/head64.c | 2 +
arch/x86/kernel/head_32.S | 4 +
arch/x86/kernel/microcode_intel_early.c | 8 +-
arch/x86/kernel/setup.c | 86 +++++++-----
arch/x86/mm/init.c | 121 +++++++++++-----
arch/x86/mm/numa.c | 240 ++++++++++++++++++++++++-------
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 +
arch/x86/mm/srat.c | 11 +-
drivers/acpi/numa.c | 13 +-
drivers/acpi/osl.c | 138 +++++++++++++------
include/linux/acpi.h | 20 ++--
include/linux/mm.h | 3 -
mm/page_alloc.c | 52 +-------
19 files changed, 476 insertions(+), 246 deletions(-)


2013-06-13 13:28:18

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

From: Yinghai Lu <[email protected]>

For finding procedure, it would be easy to access initrd in 32bit flat
mode, as we don't need to setup page table. That is from head_32.S, and
microcode updating already use this trick.

This patch does the following:

1. Change acpi_initrd_override_find to use phys to access global variables.

2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
we cannot tell if it is a pa or a va through the address itself with
32bit. Boot loader could load initrd above max_low_pfn.

3. Put table_sigs[] on stack, otherwise it is too messy to change string
array to physaddr and still keep offset calculating correct. The size is
about 36x4 bytes, and it is small to settle in stack.

4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
so that it is more readable.

NOTE: Don't call printk as it uses global variables, so delay print
during copying.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++--------------
include/linux/acpi.h | 5 ++-
3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 42f584c..142e042 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,7 +1120,7 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();

acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start);
+ initrd_end - initrd_start, false);
acpi_initrd_override_copy();

reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42f79e3..23578e8 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}

-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

@@ -576,17 +564,45 @@ struct file_pos {
};
static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
+ struct file_pos *files = acpi_initrd_files;
+ int *all_tables_size_p = &all_tables_size;
+
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ char *table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };

if (data == NULL || size == 0)
return;

+ if (is_phys) {
+ files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+ all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
data += offset;
size -= offset;

- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ if (!is_phys)
+ INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }

table = file.data;

@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;

- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ if (!is_phys)
+ INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ if (!is_phys)
+ INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ if (!is_phys)
+ INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }

- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);

- all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
- acpi_initrd_files[table_nr].size = file.size;
+ (*all_tables_size_p) += table->length;
+ files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+ __pa_nodebug(file.data);
+ files[table_nr].size = file.size;
table_nr++;
}
}
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));
memcpy(p, q, size);
early_iounmap(q, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 8dd917b..4e3731b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -469,10 +469,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
void acpi_initrd_override_copy(void);
#else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+ bool is_phys) { }
static inline void acpi_initrd_override_copy(void) { }
#endif

--
1.7.1

2013-06-13 13:28:38

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 18/22] x86, mm, numa: Add early_initmem_init() stub

From: Yinghai Lu <[email protected]>

Introduce early_initmem_init() to call early_x86_numa_init(),
which will be used to parse numa info earlier.

Later will call init_mem_mapping for all the nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/numa.c | 7 +++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

+void early_initmem_init(void);
extern void initmem_init(void);

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d11b1b7..301165e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1162,6 +1162,7 @@ void __init setup_arch(char **cmdline_p)

early_acpi_boot_init();

+ early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8554656..3c21f16 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -467,6 +467,12 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 630e09f..7d76936 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -665,13 +665,16 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init early_initmem_init(void)
+{
+ early_x86_numa_init();
+}
+
void __init x86_numa_init(void)
{
int i, nid;
struct numa_meminfo *mi = &numa_meminfo;

- early_x86_numa_init();
-
#ifdef CONFIG_ACPI_NUMA
if (srat_used)
x86_acpi_numa_init_slit();
--
1.7.1

2013-06-13 13:28:14

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later

From: Yinghai Lu <[email protected]>

We need to have numa info ready before init_mem_mappingi(), so that we
can call init_mem_mapping per node, and alse trim node memory ranges to
big alignment.

Currently, parsing numa info needs to allocate some buffer and need to be
called after init_mem_mapping. So try to split parsing numa info procedure
into two steps:
- The first step will be called before init_mem_mapping, and it
should not need allocate buffers.
- The second step will cantain all the buffer related code and be
executed later.

At last we will have early_initmem_init() and initmem_init().

This patch implements only the first step.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move these two callings to x86_numa_init(). That will also
make numa_init() smaller and more readable.

-v2: remove online_node_map clear in numa_init(), as it is only
set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 69 +++++++++++++++++++++++++++++----------------------
1 files changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..07ae800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,7 +477,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
- int i, nid;
+ int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -506,24 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- /* Finally register nodes. */
- for_each_node_mask(nid, node_possible_map) {
- u64 start = PFN_PHYS(max_pfn);
- u64 end = 0;
-
- for (i = 0; i < mi->nr_blks; i++) {
- if (nid != mi->blk[i].nid)
- continue;
- start = min(mi->blk[i].start, start);
- end = max(mi->blk[i].end, end);
- }
-
- if (start < end)
- setup_node_data(nid, start, end);
- }
-
- /* Dump memblock with node info and return. */
- memblock_dump_all();
return 0;
}

@@ -559,7 +541,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
- nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
@@ -577,15 +558,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
- numa_init_array();
return 0;
}

@@ -618,7 +590,7 @@ static int __init dummy_numa_init(void)
* last fallback is dummy single node config encomapssing whole memory and
* never fails.
*/
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -638,6 +610,43 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init x86_numa_init(void)
+{
+ int i, nid;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ early_x86_numa_init();
+
+ /* Finally register nodes. */
+ for_each_node_mask(nid, node_possible_map) {
+ u64 start = PFN_PHYS(max_pfn);
+ u64 end = 0;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (nid != mi->blk[i].nid)
+ continue;
+ start = min(mi->blk[i].start, start);
+ end = max(mi->blk[i].end, end);
+ }
+
+ if (start < end)
+ setup_node_data(nid, start, end); /* online is set */
+ }
+
+ /* Dump memblock with node info */
+ memblock_dump_all();
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ int nid = early_cpu_to_node(i);
+
+ if (nid == NUMA_NO_NODE)
+ continue;
+ if (!node_online(nid))
+ numa_clear_node(i);
+ }
+ numa_init_array();
+}
+
static __init int find_near_online_node(int node)
{
int n, val;
--
1.7.1

2013-06-13 13:31:09

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 17/22] x86, ACPI, numa, ia64: split SLIT handling out

From: Yinghai Lu <[email protected]>

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping. So move
SLIT parsing procedure later.

x86_acpi_numa_init() will be splited into x86_acpi_numa_init_srat() and
x86_acpi_numa_init_slit().

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
remove the reset_numa_distance() in numa_init(), as get we only set
distance in slit handling.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Tested-by: Tony Luck <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/ia64/kernel/setup.c | 4 +++-
arch/x86/include/asm/acpi.h | 3 ++-
arch/x86/mm/numa.c | 14 ++++++++++++--
arch/x86/mm/srat.c | 11 +++++++----
drivers/acpi/numa.c | 13 +++++++------
include/linux/acpi.h | 3 ++-
6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 13bfdd2..5f7db4a 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
acpi_table_init();
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
- acpi_numa_init();
+ acpi_numa_init_srat();
+ acpi_numa_init_slit();
+ acpi_numa_arch_fixup();
# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
# endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }

#ifdef CONFIG_ACPI_NUMA
extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
#endif /* CONFIG_ACPI_NUMA */

#define acpi_unlazy_tlb(x) leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3254f22..630e09f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -595,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- numa_reset_distance();

ret = init_func();
if (ret < 0)
@@ -633,6 +632,10 @@ static int __init dummy_numa_init(void)
return 0;
}

+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
/**
* x86_numa_init - Initialize NUMA
*
@@ -648,8 +651,10 @@ static void __init early_x86_numa_init(void)
return;
#endif
#ifdef CONFIG_ACPI_NUMA
- if (!numa_init(x86_acpi_numa_init))
+ if (!numa_init(x86_acpi_numa_init_srat)) {
+ srat_used = true;
return;
+ }
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
@@ -667,6 +672,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+#ifdef CONFIG_ACPI_NUMA
+ if (srat_used)
+ x86_acpi_numa_init_slit();
+#endif
+
numa_emulation(&numa_meminfo, numa_distance_cnt);

node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
return -1;
}

-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
{
int ret;

- ret = acpi_numa_init();
+ ret = acpi_numa_init_srat();
if (ret < 0)
return ret;
return srat_disabled() ? -EINVAL : 0;
}
+
+void __init x86_acpi_numa_init_slit(void)
+{
+ acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
}

-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
{
int cnt = 0;

@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
NR_NODE_MEMBLKS);
}

- /* SLIT: System Locality Information Table */
- acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
- acpi_numa_arch_fixup();
-
if (cnt < 0)
return cnt;
else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
return 0;
}

+void __init acpi_numa_init_slit(void)
+{
+ /* SLIT: System Locality Information Table */
+ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4e3731b..92463b5 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);

int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
--
1.7.1

2013-06-13 13:31:34

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 06/22] x86, ACPI: Split acpi_initrd_override() into find/copy two steps

From: Yinghai Lu <[email protected]>

To parse SRAT before memblock starts to work, we need to move acpi table
probing procedure earlier. But acpi_initrd_table_override procedure must
be executed before acpi table probing. So we need to move it earlier too,
which means to move acpi_initrd_table_override procedure before memblock
starts to work.

But acpi_initrd_table_override procedure needs memblock to allocate buffer
for ACPI tables. To solve this problem, we need to split acpi_initrd_override()
procedure into two steps: finding and copying.
Find should be as early as possible. Copy should be after memblock is ready.

Currently, acpi_initrd_table_override procedure is executed after
init_mem_mapping() and relocate_initrd(), so it can scan initrd and copy
acpi tables with kernel virtual addresses of initrd.

Once we split it into finding and copying steps, it could be done like the
following:

Finding could be done in head_32.S and head64.c, just like microcode early
scanning. In head_32.S, it is 32bit flat mode, we don't need to setup page
table to access it. In head64.c, #PF set page table could help us to access
initrd with kernel low mapping addresses.

Copying need to be done just after memblock is ready, because it needs to
allocate buffer for new acpi tables with memblock.
Also it should be done before probing acpi tables, and we need early_ioremap
to access source and target ranges, as init_mem_mapping is not called yet.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version were
conditionalized inside CONFIG_ACPI. This forced setup_arch() to have its own
#ifdefs around acpi_initrd_override() as otherwise build would fail when
!CONFIG_ACPI. Move the prototypes and dummy implementations of the newly
split functions out of CONFIG_ACPI block in acpi.h so that we can throw away
the #ifdefs from its users.

-v2: Split one patch out according to tj.
also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
avoid #idef in setup.c

Signed-off-by: Yinghai <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 6 +++---
drivers/acpi/osl.c | 18 +++++++++++++-----
include/linux/acpi.h | 16 ++++++++--------
3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ca5f2c..42f584c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,9 +1119,9 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+ acpi_initrd_override_find((void *)initrd_start,
+ initrd_end - initrd_start);
+ acpi_initrd_override_copy();

reserve_crashkernel();

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 53dd490..6ab6c54 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- char *p;

if (data == NULL || size == 0)
return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
- if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+ int no, total_offset = 0;
+ char *p;
+
+ if (!all_tables_size)
return;

/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
* tables at one time, we will hit the limit. So we need to map tables
* one by one during copying.
*/
- for (no = 0; no < table_nr; no++) {
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
phys_addr_t size = acpi_initrd_files[no].size;

+ if (!size)
+ break;
p = early_ioremap(acpi_tables_addr + total_offset, size);
memcpy(p, acpi_initrd_files[no].data, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17b5b59..8dd917b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);

-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
@@ -476,6 +468,14 @@ static inline bool acpi_driver_match_device(struct device *dev,

#endif /* !CONFIG_ACPI */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));
--
1.7.1

2013-06-13 13:28:12

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 01/22] x86: Change get_ramdisk_{image|size}() to global

From: Yinghai Lu <[email protected]>

This patch does two things:
1. Change get_ramdisk_image() and get_ramdisk_size() to global.
2. Make get_ramdisk_image() and get_ramdisk_size() take a
boot_params pointer parameter.

The whole patch-set tries to split ACPI initrd table override
procedure into two steps: finding and copying.
The finding step is done at head_32.S and head64.c stage. So we
need to call get_ramdisk_image() and get_ramdisk_size() in these
two files.

And also, in head_32.S, it can only access boot_params via physical
address during 32bit flat mode, so make get_ramdisk_image() and
get_ramdisk_size() take a boot_params pointer, so that we can pass
a physical address to code in head_32.S.

Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)

extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__

void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 56f7fcf..66ab495 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,19 +297,19 @@ static void __init reserve_brk(void)

#ifdef CONFIG_BLK_DEV_INITRD

-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;

- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;

return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;

- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;

return ramdisk_size;
}
@@ -318,8 +318,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -358,8 +358,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}

- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -369,8 +369,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);

if (!boot_params.hdr.type_of_loader ||
@@ -382,8 +382,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;

--
1.7.1

2013-06-13 13:31:57

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped

From: Yinghai Lu <[email protected]>

Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
be used anymore. Users should use pfn_mapped[] or just
1UL<<(32-PAGE_SHIFT) instead.

The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

Known problem:
There is another user of max_low_pfn_mapped: i915 device driver.
But the code is commented out by a pair of "#if 0 ... #endif".
Not sure why the driver developers want to do that.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 4 +---
arch/x86/mm/init.c | 4 ----
drivers/acpi/osl.c | 6 +++---
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@

extern int devmem_is_allowed(unsigned long pagenr);

-extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 66ab495..6ca5f2c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -112,13 +112,11 @@
#include <asm/prom.h>

/*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped: highest direct mapped pfn over 4GB
+ * max_pfn_mapped: highest direct mapped pfn
*
* The direct mapping only covers E820_RAM regions, so the ranges and gaps are
* represented by pfn_mapped
*/
-unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

#ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index eaac174..8554656 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);

max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
- if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
- max_low_pfn_mapped = max(max_low_pfn_mapped,
- min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
}

bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e721863..93e3194 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
+ /* under 4G at first, then above 4G */
+ acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.7.1

2013-06-13 13:31:56

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

From: Yinghai Lu <[email protected]>

head64.c could use #PF handler setup page table to access initrd before
init mem mapping and initrd relocating.

head_32.S could use 32bit flat mode to access initrd before init mem
mapping initrd relocating.

This patch introduces x86_acpi_override_find(), which is called from
head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
can makes 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
to make sure we are using right address.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/setup.h | 6 ++++++
arch/x86/kernel/head64.c | 2 ++
arch/x86/kernel/head_32.S | 4 ++++
arch/x86/kernel/setup.c | 34 ++++++++++++++++++++++++++++++----
4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
extern unsigned long saved_video_mode;

extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 55b6761..229b281 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -175,6 +175,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");

+ x86_acpi_override_find();
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ call x86_acpi_override_find
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 142e042..d11b1b7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -421,6 +421,34 @@ static void __init reserve_initrd(void)
}
#endif /* CONFIG_BLK_DEV_INITRD */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ /*
+ * 32bit is from head_32.S, and it is 32bit flat mode.
+ * So need to use phys address to access global variables.
+ */
+ boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (unsigned char *)ramdisk_image;
+ acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = __va(ramdisk_image);
+ acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -1117,12 +1145,10 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- reserve_initrd();
-
- acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start, false);
acpi_initrd_override_copy();

+ reserve_initrd();
+
reserve_crashkernel();

vsmp_init();
--
1.7.1

2013-06-13 13:32:33

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times

From: Yinghai Lu <[email protected]>

Prepare to put page table on local nodes.

Move calling of init_mem_mapping() to early_initmem_init().

Rework alloc_low_pages to allocate page table in following order:
BRK, local node, low range

Still only load_cr3 one time, otherwise we would break xen 64bit again.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 100 +++++++++++++++++++++++++---------------
arch/x86/mm/numa.c | 24 ++++++++++
4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd0d5be..9ccbd60 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,7 +1132,6 @@ void __init setup_arch(char **cmdline_p)
acpi_boot_table_init();
early_acpi_boot_init();
early_initmem_init();
- init_mem_mapping();
memblock.current_limit = get_max_mapped();
early_trap_pf_init();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5f38e72..9ff71ff 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
static unsigned long __initdata pgt_buf_end;
static unsigned long __initdata pgt_buf_top;

-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;

static bool __initdata can_use_brk_pgt = true;

@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)

if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
+ if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+ if (low_min_pfn_mapped >= low_max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(
+ low_min_pfn_mapped << PAGE_SHIFT,
+ low_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
@@ -412,67 +422,88 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
return step_size;
}

-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
+ bool is_low = false;
+
+ if (!begin) {
+ probe_page_size_mask();
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ begin = ISA_END_ADDRESS;
+ is_low = true;
+ }

- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ if (begin >= end)
+ return;

/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

/* step_size need to be small so pgt_buf from BRK could cover it */
step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
+ local_max_pfn_mapped = begin >> PAGE_SHIFT;
+ local_min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;

/*
- * We start from the top (end of memory) and go to the bottom.
- * The memblock_find_in_range() gets us a block of RAM from the
- * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
- * for page table.
+ * alloc_low_pages() will allocate pagetable pages in the following
+ * order:
+ * BRK, local node, low range
+ *
+ * That means it will first use up all the BRK memory, then try to get
+ * us a block of RAM from [local_min_pfn_mapped, local_max_pfn_mapped)
+ * used as new pagetable pages. If no memory on the local node has
+ * been mapped, it will allocate memory from
+ * [low_min_pfn_mapped, low_max_pfn_mapped).
*/
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > begin) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < begin)
+ start = begin;
} else
- start = ISA_END_ADDRESS;
+ start = begin;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
+ if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+ local_min_pfn_mapped = start >> PAGE_SHIFT;
last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

- if (real_end < end)
+ if (real_end < end) {
init_range_memory_mapping(real_end, end);
+ if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = end >> PAGE_SHIFT;
+ }

+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
#else
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
early_ioremap_page_table_range_init();
#endif

@@ -481,11 +512,6 @@ void __init init_mem_mapping(void)

early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
#endif

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7d76936..9b18ee8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
#include <asm/dma.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
+#include <asm/tlbflush.h>

#include "numa_internal.h"
+#include "mm_internal.h"

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
@@ -665,9 +667,31 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
+ max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+ early_ioremap_page_table_range_init();
+}
+#endif
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
+
+ early_x86_numa_init_mapping();
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
+ early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

void __init x86_numa_init(void)
--
1.7.1

2013-06-13 13:32:56

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array

From: Yinghai Lu <[email protected]>

This patch introduces a file_pos struct to store physaddr. And then changes
acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
in acpi_initrd_files[].

For finding, we will find ACPI tables with physaddr during 32bit flat mode
in head_32.S, because at that time we don't need to setup page table to
access initrd.

For copying, we could use early_ioremap() with physaddr directly before
memory mapping is set.

To keep 32bit and 64bit platforms consistent, use phys_addr for all.

-v2: introduce file_pos to save physaddr instead of abusing cpio_data
which tj is not happy with.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 6ab6c54..42f79e3 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

#define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+ phys_addr_t data;
+ phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override_find(void *data, size_t size)
{
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
void __init acpi_initrd_override_copy(void)
{
int no, total_offset = 0;
- char *p;
+ char *p, *q;

if (!all_tables_size)
return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
* one by one during copying.
*/
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ phys_addr_t addr = acpi_initrd_files[no].data;
phys_addr_t size = acpi_initrd_files[no].size;

if (!size)
break;
+ q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
- memcpy(p, acpi_initrd_files[no].data, size);
+ memcpy(p, q, size);
+ early_iounmap(q, size);
early_iounmap(p, size);
total_offset += size;
}
--
1.7.1

2013-06-13 13:33:00

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override

From: Yinghai Lu <[email protected]>

Now we only search buffer for new acpi tables in initrd under
4GB. In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4GB. So do second try to
search for buffer above 4GB.

Since later accessing to the tables is using early_ioremap(),
using memory above 4GB is OK.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 93e3194..42c48fc 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
/* under 4G at first, then above 4G */
acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr)
+ acpi_tables_addr = memblock_find_in_range(0,
+ ~(phys_addr_t)0,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.7.1

2013-06-13 13:32:57

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 19/22] x86, mm: Parse numa info earlier

From: Yinghai Lu <[email protected]>

Parsing numa info has been separated into two steps now.

early_initmem_info() only parses info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 24 ++++++++++--------------
1 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 301165e..fd0d5be 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1125,13 +1125,21 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+ /*
+ * Parse the ACPI tables for possible boot-time SMP configuration.
+ */
+ acpi_initrd_override_copy();
+ acpi_boot_table_init();
+ early_acpi_boot_init();
+ early_initmem_init();
init_mem_mapping();
-
+ memblock.current_limit = get_max_mapped();
early_trap_pf_init();

+ reserve_initrd();
+
setup_real_mode();

- memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

/*
@@ -1145,24 +1153,12 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- acpi_initrd_override_copy();
-
- reserve_initrd();
-
reserve_crashkernel();

vsmp_init();

io_delay_init();

- /*
- * Parse the ACPI tables for possible boot-time SMP configuration.
- */
- acpi_boot_table_init();
-
- early_acpi_boot_init();
-
- early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

--
1.7.1

2013-06-13 13:32:54

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 05/22] x86, ACPI: Increase acpi initrd override tables number limit

From: Yinghai Lu <[email protected]>

Current number of acpi tables in initrd is limited to 10, which is
too small. 64 would be good enough as we have 35 sigs and could
have several SSDTs.

Two problems in current code prevent us from increasing the 10 tables limit:
1. cpio file info array is put in stack, as every element is 32 bytes, we
could run out of stack if we increase the array size to 64.
So we can move it out from stack, and make it global and put it in
__initdata section.
2. early_ioremap only can remap 256kb one time. Current code is mapping
10 tables one time. If we increase that limit, the whole size could be
more than 256kb, and early_ioremap will fail.
So we can map the tables one by one during copying, instead of mapping
all of them at one time.

-v2: According to tj, split it out to separated patch, also
rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
per tj.

Signed-off-by: Yinghai <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 26 +++++++++++++++-----------
1 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42c48fc..53dd490 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override(void *data, size_t size)
{
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;

if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- early_initrd_files[table_nr].data = file.data;
- early_initrd_files[table_nr].size = file.size;
+ acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

- p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+ /*
+ * early_ioremap can only remap 256KB at one time. If we map all the
+ * tables at one time, we will hit the limit. So we need to map tables
+ * one by one during copying.
+ */
for (no = 0; no < table_nr; no++) {
- memcpy(p + total_offset, early_initrd_files[no].data,
- early_initrd_files[no].size);
- total_offset += early_initrd_files[no].size;
+ phys_addr_t size = acpi_initrd_files[no].size;
+
+ p = early_ioremap(acpi_tables_addr + total_offset, size);
+ memcpy(p, acpi_initrd_files[no].data, size);
+ early_iounmap(p, size);
+ total_offset += size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

--
1.7.1

2013-06-13 13:32:52

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit

From: Yinghai Lu <[email protected]>

If node with ram is hotplugable, memory for local node page
table and vmemmap should be on the local node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date: Mon Dec 27 16:48:17 2010 -0800
|
| x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to improve performance when
using memory hotplug.

Calling init_mem_mapping() in early_initmem_init() for each
node. alloc_low_pages() will allocate page table in following
order:
BRK, local node, low range

So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
1 files changed, 33 insertions(+), 1 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9b18ee8..5adf803 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -670,7 +670,39 @@ static void __init early_x86_numa_init(void)
#ifdef CONFIG_X86_64
static void __init early_x86_numa_init_mapping(void)
{
- init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ unsigned long last_start = 0, last_end = 0;
+ struct numa_meminfo *mi = &numa_meminfo;
+ unsigned long start, end;
+ int last_nid = -1;
+ int i, nid;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ nid = mi->blk[i].nid;
+ start = mi->blk[i].start;
+ end = mi->blk[i].end;
+
+ if (last_nid == nid) {
+ last_end = end;
+ continue;
+ }
+
+ /* other nid now */
+ if (last_nid >= 0) {
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+ }
+
+ /* for next nid */
+ last_nid = nid;
+ last_start = start;
+ last_end = end;
+ }
+ /* last one */
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+
if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
--
1.7.1

2013-06-13 13:34:37

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.

From: Yinghai Lu <[email protected]>

numa_emulation() needs to allocate buffer for new numa_meminfo
and distance matrix, so execute it later in x86_numa_init().

Also we change the behavoir:
- before this patch, if user input wrong data in command
line, it will fall back to next numa probing or disabling
numa.
- after this patch, if user input wrong data in command line,
it will stay with numa info probed from previous probing,
like ACPI SRAT or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early
so that we can keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: David Rientjes <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 6 +++---
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 ++
3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index da2ebab..3254f22 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,7 +534,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
{
nodemask_t nodes_parsed;
unsigned long pfn_align;
@@ -604,8 +604,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- numa_emulation(&numa_meminfo, numa_distance_cnt);
-
ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -669,6 +667,8 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
node_possible_map = numa_nodes_parsed;
numa_nodemask_from_meminfo(&node_possible_map, mi);

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
if (ret < 0)
goto no_emu;

- if (numa_cleanup_meminfo(&ei) < 0) {
+ if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
goto no_emu;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);

void __init x86_numa_init(void);

+int __init numa_check_memblks(struct numa_meminfo *mi);
+
#ifdef CONFIG_NUMA_EMU
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
--
1.7.1

2013-06-13 13:28:07

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

From: Yinghai Lu <[email protected]>

We could use numa_meminfo directly instead of memblock nid in
node_map_pfn_alignment().

So we could do setting memblock nid later and only do it once
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
1 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 10c6240..cff565a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -493,14 +493,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
* Returns the determined alignment in pfn's. 0 if there is no alignment
* requirement (single node).
*/
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
{
unsigned long accl_mask = 0, last_end = 0;
unsigned long start, end, mask;
int last_nid = -1;
int i, nid;

- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ for (i = 0; i < mi->nr_blks; i++) {
+ start = mi->blk[i].start >> PAGE_SHIFT;
+ end = mi->blk[i].end >> PAGE_SHIFT;
+ nid = mi->blk[i].nid;
if (!start || last_nid < 0 || last_nid == nid) {
last_nid = nid;
last_end = end;
@@ -523,10 +527,16 @@ unsigned long __init node_map_pfn_alignment(void)
/* convert mask to number of pages */
return ~accl_mask + 1;
}
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+ return 0;
+}
+#endif

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
- unsigned long uninitialized_var(pfn_align);
+ unsigned long pfn_align;
int i;

/* Account for nodes with cpus and no memory */
@@ -538,24 +548,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
/*
* If sections array is gonna be used for pfn -> nid mapping, check
* whether its granularity is fine enough.
*/
-#ifdef NODE_NOT_IN_PAGE_FLAGS
- pfn_align = node_map_pfn_alignment();
+ pfn_align = node_map_pfn_alignment(mi);
if (pfn_align && pfn_align < PAGES_PER_SECTION) {
printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
PFN_PHYS(pfn_align) >> 20,
PFN_PHYS(PAGES_PER_SECTION) >> 20);
return -EINVAL;
}
-#endif
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }

return 0;
}
--
1.7.1

2013-06-13 13:34:52

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 02/22] x86, microcode: Use common get_ramdisk_{image|size}()

From: Yinghai Lu <[email protected]>

In patch1, we change get_ramdisk_image() and get_ramdisk_size()
to global, so we can use them instead of using global variable
boot_params.

We need this to get correct ramdisk adress for 64bits bzImage
that initrd can be loaded above 4G by kexec-tools.

-v2: fix one typo that is found by Tang Chen

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Fenghua Yu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 2e9e128..54575a9 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -743,8 +743,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;

boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;

@@ -753,8 +753,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;

--
1.7.1

2013-06-13 13:28:04

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 15/22] x86, mm, numa: Move node_possible_map setting later

From: Yinghai Lu <[email protected]>

Move node_possible_map handling out of numa_check_memblks()
to avoid side effect when changing numa_check_memblks().

Only set node_possible_map once for successful path instead
of resetting in numa_init() every time.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 11 +++++++----
1 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e448b6f..da2ebab 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -536,12 +536,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)

static int __init numa_check_memblks(struct numa_meminfo *mi)
{
+ nodemask_t nodes_parsed;
unsigned long pfn_align;

/* Account for nodes with cpus and no memory */
- node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
- if (WARN_ON(nodes_empty(node_possible_map)))
+ nodes_parsed = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&nodes_parsed, mi);
+ if (WARN_ON(nodes_empty(nodes_parsed)))
return -EINVAL;

if (!numa_meminfo_cover_memory(mi))
@@ -593,7 +594,6 @@ static int __init numa_init(int (*init_func)(void))
set_apicid_to_node(i, NUMA_NO_NODE);

nodes_clear(numa_nodes_parsed);
- nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
numa_reset_distance();

@@ -669,6 +669,9 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
--
1.7.1

2013-06-13 13:37:21

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86

From: Yinghai Lu <[email protected]>

Move node_map_pfn_alignment() to arch/x86/mm as there is no
other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 -
mm/page_alloc.c | 50 --------------------------------------------------
3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1bb565d..10c6240 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -474,6 +474,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}

+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
+ * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's. 0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ unsigned long start, end, mask;
+ int last_nid = -1;
+ int i, nid;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ /*
+ * Start with a mask granular enough to pin-point to the
+ * start pfn and tick off bits one-by-one until it becomes
+ * too coarse to separate the current node from the last.
+ */
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ /* accumulate all internode masks */
+ accl_mask |= mask;
+ }
+
+ /* convert mask to number of pages */
+ return ~accl_mask + 1;
+}
+
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 28e9470..b827743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1384,7 +1384,6 @@ static inline unsigned long free_initmem_default(int poison)
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 74e3428..7ba7703 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4762,56 +4762,6 @@ void __init setup_nr_node_ids(void)
}
#endif

-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
- * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's. 0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
- unsigned long accl_mask = 0, last_end = 0;
- unsigned long start, end, mask;
- int last_nid = -1;
- int i, nid;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
- if (!start || last_nid < 0 || last_nid == nid) {
- last_nid = nid;
- last_end = end;
- continue;
- }
-
- /*
- * Start with a mask granular enough to pin-point to the
- * start pfn and tick off bits one-by-one until it becomes
- * too coarse to separate the current node from the last.
- */
- mask = ~((1 << __ffs(start)) - 1);
- while (mask && last_end <= (start & (mask << 1)))
- mask <<= 1;
-
- /* accumulate all internode masks */
- accl_mask |= mask;
- }
-
- /* convert mask to number of pages */
- return ~accl_mask + 1;
-}
-
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{
--
1.7.1

2013-06-13 13:38:08

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early

From: Yinghai Lu <[email protected]>

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later, as it could change memblock
array, and possible doube memblock.memory array which will need
to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also we could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we will set memblock nid only once on successful path.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 7 ++++---
include/linux/mm.h | 2 --
mm/page_alloc.c | 2 +-
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 07ae800..1bb565d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -457,7 +457,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
- numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+ numaram -= absent_pages_in_range(s, e);
if ((s64)numaram < 0)
numaram = 0;
}
@@ -485,6 +485,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;

+ if (!numa_meminfo_cover_memory(mi))
+ return -EINVAL;
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -503,8 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
#endif
- if (!numa_meminfo_cover_memory(mi))
- return -EINVAL;

return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..28e9470 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1385,8 +1385,6 @@ static inline unsigned long free_initmem_default(int poison)
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3edb62..74e3428 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4397,7 +4397,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
* Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
* then all holes in the requested range will be accounted for.
*/
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
unsigned long range_start_pfn,
unsigned long range_end_pfn)
{
--
1.7.1

2013-06-13 13:39:31

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later

From: Yinghai Lu <[email protected]>

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later because it could change memblock
array, and possible doube memblock.memory array which will allocate
buffer.

Only set memblock nid once for successful path.

Also rename numa_register_memblks to numa_check_memblks() after
moving out code of setting memblock nid.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/numa.c | 16 +++++++---------
1 files changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index cff565a..e448b6f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,10 +534,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
{
unsigned long pfn_align;
- int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -560,11 +559,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
return 0;
}

@@ -601,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();

ret = init_func();
@@ -613,7 +606,7 @@ static int __init numa_init(int (*init_func)(void))

numa_emulation(&numa_meminfo, numa_distance_cnt);

- ret = numa_register_memblks(&numa_meminfo);
+ ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;

@@ -676,6 +669,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
+
/* Finally register nodes. */
for_each_node_mask(nid, node_possible_map) {
u64 start = PFN_PHYS(max_pfn);
--
1.7.1

2013-06-13 13:41:09

by Tang Chen

[permalink] [raw]
Subject: [Part1 PATCH v5 20/22] x86, mm: Add comments for step_size shift

From: Yinghai Lu <[email protected]>

As requested by hpa, add comments for why we choose 5 to be
the step size shift.

Signed-off-by: Yinghai Lu <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/mm/init.c | 21 ++++++++++++++++++---
1 files changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3c21f16..5f38e72 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -395,8 +395,23 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
}

-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+ /*
+ * initial mapped size is PMD_SIZE, aka 2M.
+ * We can not set step_size to be PUD_SIZE aka 1G yet.
+ * In worse case, when 1G is cross the 1G boundary, and
+ * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+ * to map 1G range with PTE. Use 5 as shift for now.
+ */
+ unsigned long new_step_size = step_size << 5;
+
+ if (new_step_size > step_size)
+ step_size = new_step_size;
+
+ return step_size;
+}
+
void __init init_mem_mapping(void)
{
unsigned long end, real_end, start, last_start;
@@ -445,7 +460,7 @@ void __init init_mem_mapping(void)
min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
+ step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

--
1.7.1

2013-06-13 18:36:35

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times

Tang Chen <[email protected]> wrote:

>From: Yinghai Lu <[email protected]>
>
>Prepare to put page table on local nodes.
>
>Move calling of init_mem_mapping() to early_initmem_init().
>
>Rework alloc_low_pages to allocate page table in following order:
> BRK, local node, low range
>
>Still only load_cr3 one time, otherwise we would break xen 64bit again.
>



Sigh.. Can that comment on Xen be removed please. The issue was fixed last release and I believe I already asked to remove that comment as it is not true anymore.
--
Sent from my Android phone. Please excuse my brevity.

2013-06-13 22:47:42

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times

On Thu, Jun 13, 2013 at 11:35 AM, Konrad Rzeszutek Wilk
<[email protected]> wrote:
> Tang Chen <[email protected]> wrote:
>
>>From: Yinghai Lu <[email protected]>
>>
>>Prepare to put page table on local nodes.
>>
>>Move calling of init_mem_mapping() to early_initmem_init().
>>
>>Rework alloc_low_pages to allocate page table in following order:
>> BRK, local node, low range
>>
>>Still only load_cr3 one time, otherwise we would break xen 64bit again.
>>
>
>
>
> Sigh.. Can that comment on Xen be removed please. The issue was fixed last release and I believe I already asked to remove that comment as it is not true anymore.

Sorry about that again, I thought I removed that already.

Yinghai

2013-06-14 05:05:49

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 21/22] x86, mm: Make init_mem_mapping be able to be called several times

On 06/14/2013 06:47 AM, Yinghai Lu wrote:
> On Thu, Jun 13, 2013 at 11:35 AM, Konrad Rzeszutek Wilk
> <[email protected]> wrote:
>> Tang Chen<[email protected]> wrote:
>>
>>> From: Yinghai Lu<[email protected]>
>>>
>>> Prepare to put page table on local nodes.
>>>
>>> Move calling of init_mem_mapping() to early_initmem_init().
>>>
>>> Rework alloc_low_pages to allocate page table in following order:
>>> BRK, local node, low range
>>>
>>> Still only load_cr3 one time, otherwise we would break xen 64bit again.
>>>
>>
>>
>>
>> Sigh.. Can that comment on Xen be removed please. The issue was fixed last release and I believe I already asked to remove that comment as it is not true anymore.
>
> Sorry about that again, I thought I removed that already.

Sorry I didn't notice that. Will remove it if Yinghai or I resend this
patch-set.

Thanks.

Subject: [tip:x86/mm] x86: Change get_ramdisk_{image|size}() to global

Commit-ID: d9518cb78d6d5ee6b24eb7ee2f4b108ec30e174e
Gitweb: http://git.kernel.org/tip/d9518cb78d6d5ee6b24eb7ee2f4b108ec30e174e
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:48 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:03:26 -0700

x86: Change get_ramdisk_{image|size}() to global

This patch does two things:
1. Change get_ramdisk_image() and get_ramdisk_size() to global.
2. Make get_ramdisk_image() and get_ramdisk_size() take a
boot_params pointer parameter.

The whole patch-set tries to split ACPI initrd table override
procedure into two steps: finding and copying.
The finding step is done at head_32.S and head64.c stage. So we
need to call get_ramdisk_image() and get_ramdisk_size() in these
two files.

And also, in head_32.S, it can only access boot_params via physical
address during 32bit flat mode, so make get_ramdisk_image() and
get_ramdisk_size() take a boot_params pointer, so that we can pass
a physical address to code in head_32.S.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)

extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__

void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 56f7fcf..66ab495 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,19 +297,19 @@ static void __init reserve_brk(void)

#ifdef CONFIG_BLK_DEV_INITRD

-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;

- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;

return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;

- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;

return ramdisk_size;
}
@@ -318,8 +318,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -358,8 +358,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}

- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -369,8 +369,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);

if (!boot_params.hdr.type_of_loader ||
@@ -382,8 +382,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;

Subject: [tip:x86/mm] x86, ACPI, mm: Kill max_low_pfn_mapped

Commit-ID: b19feb388cdee35bf991e4977d1936f6d23c75a8
Gitweb: http://git.kernel.org/tip/b19feb388cdee35bf991e4977d1936f6d23c75a8
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:50 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:03:37 -0700

x86, ACPI, mm: Kill max_low_pfn_mapped

Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
be used anymore. Users should use pfn_mapped[] or just
1UL<<(32-PAGE_SHIFT) instead.

The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

Known problem:
There is another user of max_low_pfn_mapped: i915 device driver.
But the code is commented out by a pair of "#if 0 ... #endif".
Not sure why the driver developers want to do that.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 4 +---
arch/x86/mm/init.c | 4 ----
drivers/acpi/osl.c | 6 +++---
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@

extern int devmem_is_allowed(unsigned long pagenr);

-extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 66ab495..6ca5f2c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -112,13 +112,11 @@
#include <asm/prom.h>

/*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped: highest direct mapped pfn over 4GB
+ * max_pfn_mapped: highest direct mapped pfn
*
* The direct mapping only covers E820_RAM regions, so the ranges and gaps are
* represented by pfn_mapped
*/
-unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

#ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index eaac174..8554656 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);

max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
- if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
- max_low_pfn_mapped = max(max_low_pfn_mapped,
- min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
}

bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e721863..93e3194 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
+ /* under 4G at first, then above 4G */
+ acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;

Subject: [tip:x86/mm] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override

Commit-ID: 5a2c7ccc51a2bc42d96e05dd3d920ef0c09eb730
Gitweb: http://git.kernel.org/tip/5a2c7ccc51a2bc42d96e05dd3d920ef0c09eb730
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:51 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:03:49 -0700

x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override

Now we only search buffer for new acpi tables in initrd under
4GB. In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4GB. So do second try to
search for buffer above 4GB.

Since later accessing to the tables is using early_ioremap(),
using memory above 4GB is OK.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
drivers/acpi/osl.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 93e3194..42c48fc 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
/* under 4G at first, then above 4G */
acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr)
+ acpi_tables_addr = memblock_find_in_range(0,
+ ~(phys_addr_t)0,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;

Subject: [tip:x86/mm] x86, ACPI: Increase acpi initrd override tables number limit

Commit-ID: 7a309b8608958c40bb7f82ac83532a44b09deae2
Gitweb: http://git.kernel.org/tip/7a309b8608958c40bb7f82ac83532a44b09deae2
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:52 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:03:57 -0700

x86, ACPI: Increase acpi initrd override tables number limit

Current number of acpi tables in initrd is limited to 10, which is
too small. 64 would be good enough as we have 35 sigs and could
have several SSDTs.

Two problems in current code prevent us from increasing the 10 tables limit:
1. cpio file info array is put in stack, as every element is 32 bytes, we
could run out of stack if we increase the array size to 64.
So we can move it out from stack, and make it global and put it in
__initdata section.
2. early_ioremap only can remap 256kb one time. Current code is mapping
10 tables one time. If we increase that limit, the whole size could be
more than 256kb, and early_ioremap will fail.
So we can map the tables one by one during copying, instead of mapping
all of them at one time.

-v2: According to tj, split it out to separated patch, also
rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
per tj.

Signed-off-by: Yinghai <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
drivers/acpi/osl.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42c48fc..53dd490 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override(void *data, size_t size)
{
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;

if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- early_initrd_files[table_nr].data = file.data;
- early_initrd_files[table_nr].size = file.size;
+ acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

- p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+ /*
+ * early_ioremap can only remap 256KB at one time. If we map all the
+ * tables at one time, we will hit the limit. So we need to map tables
+ * one by one during copying.
+ */
for (no = 0; no < table_nr; no++) {
- memcpy(p + total_offset, early_initrd_files[no].data,
- early_initrd_files[no].size);
- total_offset += early_initrd_files[no].size;
+ phys_addr_t size = acpi_initrd_files[no].size;
+
+ p = early_ioremap(acpi_tables_addr + total_offset, size);
+ memcpy(p, acpi_initrd_files[no].data, size);
+ early_iounmap(p, size);
+ total_offset += size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

Subject: [tip:x86/mm] x86, ACPI: Store override acpi tables phys addr in cpio files info array

Commit-ID: 8ec3ffdf3921675aeae8e9c2b42be3c0b700f153
Gitweb: http://git.kernel.org/tip/8ec3ffdf3921675aeae8e9c2b42be3c0b700f153
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:54 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:04 -0700

x86, ACPI: Store override acpi tables phys addr in cpio files info array

This patch introduces a file_pos struct to store physaddr. And then changes
acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
in acpi_initrd_files[].

For finding, we will find ACPI tables with physaddr during 32bit flat mode
in head_32.S, because at that time we don't need to setup page table to
access initrd.

For copying, we could use early_ioremap() with physaddr directly before
memory mapping is set.

To keep 32bit and 64bit platforms consistent, use phys_addr for all.

-v2: introduce file_pos to save physaddr instead of abusing cpio_data
which tj is not happy with.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
drivers/acpi/osl.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 6ab6c54..42f79e3 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

#define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+ phys_addr_t data;
+ phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override_find(void *data, size_t size)
{
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
void __init acpi_initrd_override_copy(void)
{
int no, total_offset = 0;
- char *p;
+ char *p, *q;

if (!all_tables_size)
return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
* one by one during copying.
*/
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ phys_addr_t addr = acpi_initrd_files[no].data;
phys_addr_t size = acpi_initrd_files[no].size;

if (!size)
break;
+ q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
- memcpy(p, acpi_initrd_files[no].data, size);
+ memcpy(p, q, size);
+ early_iounmap(q, size);
early_iounmap(p, size);
total_offset += size;
}

Subject: [tip:x86/mm] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

Commit-ID: 56cb257fee5a6e381452bc11fe47357b04cd085e
Gitweb: http://git.kernel.org/tip/56cb257fee5a6e381452bc11fe47357b04cd085e
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:55 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:06 -0700

x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

For finding procedure, it would be easy to access initrd in 32bit flat
mode, as we don't need to setup page table. That is from head_32.S, and
microcode updating already use this trick.

This patch does the following:

1. Change acpi_initrd_override_find to use phys to access global variables.

2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
we cannot tell if it is a pa or a va through the address itself with
32bit. Boot loader could load initrd above max_low_pfn.

3. Put table_sigs[] on stack, otherwise it is too messy to change string
array to physaddr and still keep offset calculating correct. The size is
about 36x4 bytes, and it is small to settle in stack.

4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
so that it is more readable.

NOTE: Don't call printk as it uses global variables, so delay print
during copying.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++++---------------
include/linux/acpi.h | 5 +--
3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 42f584c..142e042 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1120,7 +1120,7 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();

acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start);
+ initrd_end - initrd_start, false);
acpi_initrd_override_copy();

reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 42f79e3..23578e8 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}

-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

@@ -576,17 +564,45 @@ struct file_pos {
};
static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
+ struct file_pos *files = acpi_initrd_files;
+ int *all_tables_size_p = &all_tables_size;
+
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ char *table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };

if (data == NULL || size == 0)
return;

+ if (is_phys) {
+ files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+ all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
data += offset;
size -= offset;

- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ if (!is_phys)
+ INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }

table = file.data;

@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;

- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ if (!is_phys)
+ INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ if (!is_phys)
+ INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ if (!is_phys)
+ INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }

- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);

- all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
- acpi_initrd_files[table_nr].size = file.size;
+ (*all_tables_size_p) += table->length;
+ files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+ __pa_nodebug(file.data);
+ files[table_nr].size = file.size;
table_nr++;
}
}
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));
memcpy(p, q, size);
early_iounmap(q, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 8dd917b..4e3731b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -469,10 +469,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
void acpi_initrd_override_copy(void);
#else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+ bool is_phys) { }
static inline void acpi_initrd_override_copy(void) { }
#endif

Subject: [tip:x86/mm] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

Commit-ID: 88168dcb255f44892bcf9f6fac6aeb424471ffaa
Gitweb: http://git.kernel.org/tip/88168dcb255f44892bcf9f6fac6aeb424471ffaa
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:56 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:50 -0700

x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

head64.c can use the #PF handler automatic page tables to access initrd before
init mem mapping and initrd relocating.

head_32.S can use 32-bit flat mode to access initrd before init mem
mapping initrd relocating.

This patch introduces x86_acpi_override_find(), which is called from
head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
can makes 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
to make sure we are using right address.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/setup.h | 6 ++++++
arch/x86/kernel/head64.c | 2 ++
arch/x86/kernel/head_32.S | 4 ++++
arch/x86/kernel/setup.c | 34 ++++++++++++++++++++++++++++++----
4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
extern unsigned long saved_video_mode;

extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 55b6761..229b281 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -175,6 +175,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");

+ x86_acpi_override_find();
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ call x86_acpi_override_find
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 142e042..d11b1b7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -421,6 +421,34 @@ static void __init reserve_initrd(void)
}
#endif /* CONFIG_BLK_DEV_INITRD */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ /*
+ * 32bit is from head_32.S, and it is 32bit flat mode.
+ * So need to use phys address to access global variables.
+ */
+ boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (unsigned char *)ramdisk_image;
+ acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = __va(ramdisk_image);
+ acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -1117,12 +1145,10 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- reserve_initrd();
-
- acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start, false);
acpi_initrd_override_copy();

+ reserve_initrd();
+
reserve_crashkernel();

vsmp_init();

Subject: [tip:x86/mm] x86, mm, numa: Call numa_meminfo_cover_memory() checking early

Commit-ID: 3c5d8f9640b0c7c512434d7047c34bab976e1f9a
Gitweb: http://git.kernel.org/tip/3c5d8f9640b0c7c512434d7047c34bab976e1f9a
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:58 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:56 -0700

x86, mm, numa: Call numa_meminfo_cover_memory() checking early

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later, as it could change memblock
array, and possible doube memblock.memory array which will need
to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also we could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we will set memblock nid only once on successful path.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 7 ++++---
include/linux/mm.h | 2 --
mm/page_alloc.c | 2 +-
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 07ae800..1bb565d 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -457,7 +457,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
- numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+ numaram -= absent_pages_in_range(s, e);
if ((s64)numaram < 0)
numaram = 0;
}
@@ -485,6 +485,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;

+ if (!numa_meminfo_cover_memory(mi))
+ return -EINVAL;
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -503,8 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
#endif
- if (!numa_meminfo_cover_memory(mi))
- return -EINVAL;

return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..28e9470 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1385,8 +1385,6 @@ static inline unsigned long free_initmem_default(int poison)
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 378a15b..c427f46 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4395,7 +4395,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
* Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
* then all holes in the requested range will be accounted for.
*/
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
unsigned long range_start_pfn,
unsigned long range_end_pfn)
{

Subject: [tip:x86/mm] x86, mm, numa: Move node_map_pfn_alignment() to x86

Commit-ID: 076d2bd696f8fc47c881a92dd1e5b203ef556f51
Gitweb: http://git.kernel.org/tip/076d2bd696f8fc47c881a92dd1e5b203ef556f51
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:59 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:58 -0700

x86, mm, numa: Move node_map_pfn_alignment() to x86

Move node_map_pfn_alignment() to arch/x86/mm as there is no
other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 -
mm/page_alloc.c | 50 --------------------------------------------------
3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1bb565d..10c6240 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -474,6 +474,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}

+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
+ * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's. 0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ unsigned long start, end, mask;
+ int last_nid = -1;
+ int i, nid;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ /*
+ * Start with a mask granular enough to pin-point to the
+ * start pfn and tick off bits one-by-one until it becomes
+ * too coarse to separate the current node from the last.
+ */
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ /* accumulate all internode masks */
+ accl_mask |= mask;
+ }
+
+ /* convert mask to number of pages */
+ return ~accl_mask + 1;
+}
+
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 28e9470..b827743 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1384,7 +1384,6 @@ static inline unsigned long free_initmem_default(int poison)
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c427f46..28c4a97 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4760,56 +4760,6 @@ void __init setup_nr_node_ids(void)
}
#endif

-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
- * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's. 0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
- unsigned long accl_mask = 0, last_end = 0;
- unsigned long start, end, mask;
- int last_nid = -1;
- int i, nid;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
- if (!start || last_nid < 0 || last_nid == nid) {
- last_nid = nid;
- last_end = end;
- continue;
- }
-
- /*
- * Start with a mask granular enough to pin-point to the
- * start pfn and tick off bits one-by-one until it becomes
- * too coarse to separate the current node from the last.
- */
- mask = ~((1 << __ffs(start)) - 1);
- while (mask && last_end <= (start & (mask << 1)))
- mask <<= 1;
-
- /* accumulate all internode masks */
- accl_mask |= mask;
- }
-
- /* convert mask to number of pages */
- return ~accl_mask + 1;
-}
-
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{

Subject: [tip:x86/mm] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

Commit-ID: 052b6965a153de6c46203c574c5ad3161e829898
Gitweb: http://git.kernel.org/tip/052b6965a153de6c46203c574c5ad3161e829898
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:00 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:00 -0700

x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

We could use numa_meminfo directly instead of memblock nid in
node_map_pfn_alignment().

So we could do setting memblock nid later and only do it once
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 10c6240..cff565a 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -493,14 +493,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
* Returns the determined alignment in pfn's. 0 if there is no alignment
* requirement (single node).
*/
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
{
unsigned long accl_mask = 0, last_end = 0;
unsigned long start, end, mask;
int last_nid = -1;
int i, nid;

- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ for (i = 0; i < mi->nr_blks; i++) {
+ start = mi->blk[i].start >> PAGE_SHIFT;
+ end = mi->blk[i].end >> PAGE_SHIFT;
+ nid = mi->blk[i].nid;
if (!start || last_nid < 0 || last_nid == nid) {
last_nid = nid;
last_end = end;
@@ -523,10 +527,16 @@ unsigned long __init node_map_pfn_alignment(void)
/* convert mask to number of pages */
return ~accl_mask + 1;
}
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+ return 0;
+}
+#endif

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
- unsigned long uninitialized_var(pfn_align);
+ unsigned long pfn_align;
int i;

/* Account for nodes with cpus and no memory */
@@ -538,24 +548,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
/*
* If sections array is gonna be used for pfn -> nid mapping, check
* whether its granularity is fine enough.
*/
-#ifdef NODE_NOT_IN_PAGE_FLAGS
- pfn_align = node_map_pfn_alignment();
+ pfn_align = node_map_pfn_alignment(mi);
if (pfn_align && pfn_align < PAGES_PER_SECTION) {
printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
PFN_PHYS(pfn_align) >> 20,
PFN_PHYS(PAGES_PER_SECTION) >> 20);
return -EINVAL;
}
-#endif
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }

return 0;
}

Subject: [tip:x86/mm] x86, mm, numa: Set memblock nid later

Commit-ID: 1b74b2fd7fa0b4b1493a4921eefd560f9ff67963
Gitweb: http://git.kernel.org/tip/1b74b2fd7fa0b4b1493a4921eefd560f9ff67963
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:01 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:02 -0700

x86, mm, numa: Set memblock nid later

In order to seperate parsing numa info procedure into two steps,
we need to set memblock nid later because it could change memblock
array, and possible doube memblock.memory array which will allocate
buffer.

Only set memblock nid once for successful path.

Also rename numa_register_memblks to numa_check_memblks() after
moving out code of setting memblock nid.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index cff565a..e448b6f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,10 +534,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
{
unsigned long pfn_align;
- int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -560,11 +559,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
return 0;
}

@@ -601,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();

ret = init_func();
@@ -613,7 +606,7 @@ static int __init numa_init(int (*init_func)(void))

numa_emulation(&numa_meminfo, numa_distance_cnt);

- ret = numa_register_memblks(&numa_meminfo);
+ ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;

@@ -676,6 +669,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
+
/* Finally register nodes. */
for_each_node_mask(nid, node_possible_map) {
u64 start = PFN_PHYS(max_pfn);

Subject: [tip:x86/mm] x86, mm, numa: Move node_possible_map setting later

Commit-ID: 052f56f9b1ffa4b1d1fffb7beb43511e0c630305
Gitweb: http://git.kernel.org/tip/052f56f9b1ffa4b1d1fffb7beb43511e0c630305
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:02 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:03 -0700

x86, mm, numa: Move node_possible_map setting later

Move node_possible_map handling out of numa_check_memblks()
to avoid side effect when changing numa_check_memblks().

Only set node_possible_map once for successful path instead
of resetting in numa_init() every time.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e448b6f..da2ebab 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -536,12 +536,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)

static int __init numa_check_memblks(struct numa_meminfo *mi)
{
+ nodemask_t nodes_parsed;
unsigned long pfn_align;

/* Account for nodes with cpus and no memory */
- node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
- if (WARN_ON(nodes_empty(node_possible_map)))
+ nodes_parsed = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&nodes_parsed, mi);
+ if (WARN_ON(nodes_empty(nodes_parsed)))
return -EINVAL;

if (!numa_meminfo_cover_memory(mi))
@@ -593,7 +594,6 @@ static int __init numa_init(int (*init_func)(void))
set_apicid_to_node(i, NUMA_NO_NODE);

nodes_clear(numa_nodes_parsed);
- nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
numa_reset_distance();

@@ -669,6 +669,9 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);

Subject: [tip:x86/mm] x86, mm, numa: Move numa emulation handling down.

Commit-ID: 1169f9b1e7bfb609264544bf3581f038722eb10a
Gitweb: http://git.kernel.org/tip/1169f9b1e7bfb609264544bf3581f038722eb10a
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:03 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:05 -0700

x86, mm, numa: Move numa emulation handling down.

numa_emulation() needs to allocate buffer for new numa_meminfo
and distance matrix, so execute it later in x86_numa_init().

Also we change the behavoir:
- before this patch, if user input wrong data in command
line, it will fall back to next numa probing or disabling
numa.
- after this patch, if user input wrong data in command line,
it will stay with numa info probed from previous probing,
like ACPI SRAT or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early
so that we can keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: David Rientjes <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 6 +++---
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 ++
3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index da2ebab..3254f22 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -534,7 +534,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
{
nodemask_t nodes_parsed;
unsigned long pfn_align;
@@ -604,8 +604,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- numa_emulation(&numa_meminfo, numa_distance_cnt);
-
ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -669,6 +667,8 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
node_possible_map = numa_nodes_parsed;
numa_nodemask_from_meminfo(&node_possible_map, mi);

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
if (ret < 0)
goto no_emu;

- if (numa_cleanup_meminfo(&ei) < 0) {
+ if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
goto no_emu;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);

void __init x86_numa_init(void);

+int __init numa_check_memblks(struct numa_meminfo *mi);
+
#ifdef CONFIG_NUMA_EMU
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);

Subject: [tip:x86/mm] x86, ACPI, numa, ia64: split SLIT handling out

Commit-ID: f8e2d4e7235c816cf0a23aa2d32c57c0d4f8a3f2
Gitweb: http://git.kernel.org/tip/f8e2d4e7235c816cf0a23aa2d32c57c0d4f8a3f2
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:04 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:08 -0700

x86, ACPI, numa, ia64: split SLIT handling out

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping. So move
SLIT parsing procedure later.

x86_acpi_numa_init() will be splited into x86_acpi_numa_init_srat() and
x86_acpi_numa_init_slit().

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
remove the reset_numa_distance() in numa_init(), as get we only set
distance in slit handling.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
Tested-by: Tony Luck <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/ia64/kernel/setup.c | 4 +++-
arch/x86/include/asm/acpi.h | 3 ++-
arch/x86/mm/numa.c | 14 ++++++++++++--
arch/x86/mm/srat.c | 11 +++++++----
drivers/acpi/numa.c | 13 +++++++------
include/linux/acpi.h | 3 ++-
6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 13bfdd2..5f7db4a 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
acpi_table_init();
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
- acpi_numa_init();
+ acpi_numa_init_srat();
+ acpi_numa_init_slit();
+ acpi_numa_arch_fixup();
# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
# endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }

#ifdef CONFIG_ACPI_NUMA
extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
#endif /* CONFIG_ACPI_NUMA */

#define acpi_unlazy_tlb(x) leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 3254f22..630e09f 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -595,7 +595,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- numa_reset_distance();

ret = init_func();
if (ret < 0)
@@ -633,6 +632,10 @@ static int __init dummy_numa_init(void)
return 0;
}

+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
/**
* x86_numa_init - Initialize NUMA
*
@@ -648,8 +651,10 @@ static void __init early_x86_numa_init(void)
return;
#endif
#ifdef CONFIG_ACPI_NUMA
- if (!numa_init(x86_acpi_numa_init))
+ if (!numa_init(x86_acpi_numa_init_srat)) {
+ srat_used = true;
return;
+ }
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
@@ -667,6 +672,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+#ifdef CONFIG_ACPI_NUMA
+ if (srat_used)
+ x86_acpi_numa_init_slit();
+#endif
+
numa_emulation(&numa_meminfo, numa_distance_cnt);

node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
return -1;
}

-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
{
int ret;

- ret = acpi_numa_init();
+ ret = acpi_numa_init_srat();
if (ret < 0)
return ret;
return srat_disabled() ? -EINVAL : 0;
}
+
+void __init x86_acpi_numa_init_slit(void)
+{
+ acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
}

-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
{
int cnt = 0;

@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
NR_NODE_MEMBLKS);
}

- /* SLIT: System Locality Information Table */
- acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
- acpi_numa_arch_fixup();
-
if (cnt < 0)
return cnt;
else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
return 0;
}

+void __init acpi_numa_init_slit(void)
+{
+ /* SLIT: System Locality Information Table */
+ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4e3731b..92463b5 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);

int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);

Subject: [tip:x86/mm] x86, mm, numa: Add early_initmem_init() stub

Commit-ID: 9c80560654a3fb62ec3b3529ddcf85317537ff85
Gitweb: http://git.kernel.org/tip/9c80560654a3fb62ec3b3529ddcf85317537ff85
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:05 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:11 -0700

x86, mm, numa: Add early_initmem_init() stub

Introduce early_initmem_init() to call early_x86_numa_init(),
which will be used to parse numa info earlier.

Later will call init_mem_mapping for all the nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/numa.c | 7 +++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

+void early_initmem_init(void);
extern void initmem_init(void);

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d11b1b7..301165e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1162,6 +1162,7 @@ void __init setup_arch(char **cmdline_p)

early_acpi_boot_init();

+ early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 8554656..3c21f16 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -467,6 +467,12 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 630e09f..7d76936 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -665,13 +665,16 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init early_initmem_init(void)
+{
+ early_x86_numa_init();
+}
+
void __init x86_numa_init(void)
{
int i, nid;
struct numa_meminfo *mi = &numa_meminfo;

- early_x86_numa_init();
-
#ifdef CONFIG_ACPI_NUMA
if (srat_used)
x86_acpi_numa_init_slit();

Subject: [tip:x86/mm] x86, mm: Parse numa info earlier

Commit-ID: ca099f2813b5dccf2383784dbcfb9589110bd846
Gitweb: http://git.kernel.org/tip/ca099f2813b5dccf2383784dbcfb9589110bd846
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:06 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:13 -0700

x86, mm: Parse numa info earlier

Parsing numa info has been separated into two steps now.

early_initmem_info() only parses info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/setup.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 301165e..fd0d5be 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1125,13 +1125,21 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+ /*
+ * Parse the ACPI tables for possible boot-time SMP configuration.
+ */
+ acpi_initrd_override_copy();
+ acpi_boot_table_init();
+ early_acpi_boot_init();
+ early_initmem_init();
init_mem_mapping();
-
+ memblock.current_limit = get_max_mapped();
early_trap_pf_init();

+ reserve_initrd();
+
setup_real_mode();

- memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

/*
@@ -1145,24 +1153,12 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- acpi_initrd_override_copy();
-
- reserve_initrd();
-
reserve_crashkernel();

vsmp_init();

io_delay_init();

- /*
- * Parse the ACPI tables for possible boot-time SMP configuration.
- */
- acpi_boot_table_init();
-
- early_acpi_boot_init();
-
- early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

Subject: [tip:x86/mm] x86, mm: Add comments for step_size shift

Commit-ID: 7d5a256fc953dd80a4eb9a1870607ec991d23ec2
Gitweb: http://git.kernel.org/tip/7d5a256fc953dd80a4eb9a1870607ec991d23ec2
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:07 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:32 -0700

x86, mm: Add comments for step_size shift

As requested by hpa, add comments for why we choose 5 to be
the step size shift.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/init.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 3c21f16..5f38e72 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -395,8 +395,23 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
}

-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+ /*
+ * initial mapped size is PMD_SIZE, aka 2M.
+ * We can not set step_size to be PUD_SIZE aka 1G yet.
+ * In worse case, when 1G is cross the 1G boundary, and
+ * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+ * to map 1G range with PTE. Use 5 as shift for now.
+ */
+ unsigned long new_step_size = step_size << 5;
+
+ if (new_step_size > step_size)
+ step_size = new_step_size;
+
+ return step_size;
+}
+
void __init init_mem_mapping(void)
{
unsigned long end, real_end, start, last_start;
@@ -445,7 +460,7 @@ void __init init_mem_mapping(void)
min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
+ step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

Subject: [tip:x86/mm] x86, mm: Make init_mem_mapping be able to be called several times

Commit-ID: ae4ffbb606770c7918e627e36c84b627250b1dbb
Gitweb: http://git.kernel.org/tip/ae4ffbb606770c7918e627e36c84b627250b1dbb
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:08 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:43 -0700

x86, mm: Make init_mem_mapping be able to be called several times

Prepare to put page table on local nodes.

Move calling of init_mem_mapping() to early_initmem_init().

Rework alloc_low_pages to allocate page table in following order:
BRK, local node, low range

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 100 ++++++++++++++++++++++++++---------------
arch/x86/mm/numa.c | 24 ++++++++++
4 files changed, 88 insertions(+), 39 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index fd0d5be..9ccbd60 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1132,7 +1132,6 @@ void __init setup_arch(char **cmdline_p)
acpi_boot_table_init();
early_acpi_boot_init();
early_initmem_init();
- init_mem_mapping();
memblock.current_limit = get_max_mapped();
early_trap_pf_init();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 5f38e72..9ff71ff 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
static unsigned long __initdata pgt_buf_end;
static unsigned long __initdata pgt_buf_top;

-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;

static bool __initdata can_use_brk_pgt = true;

@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)

if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
+ if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+ if (low_min_pfn_mapped >= low_max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(
+ low_min_pfn_mapped << PAGE_SHIFT,
+ low_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
@@ -412,67 +422,88 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
return step_size;
}

-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
+ bool is_low = false;
+
+ if (!begin) {
+ probe_page_size_mask();
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ begin = ISA_END_ADDRESS;
+ is_low = true;
+ }

- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ if (begin >= end)
+ return;

/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

/* step_size need to be small so pgt_buf from BRK could cover it */
step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
+ local_max_pfn_mapped = begin >> PAGE_SHIFT;
+ local_min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;

/*
- * We start from the top (end of memory) and go to the bottom.
- * The memblock_find_in_range() gets us a block of RAM from the
- * end of RAM in [min_pfn_mapped, max_pfn_mapped) used as new pages
- * for page table.
+ * alloc_low_pages() will allocate pagetable pages in the following
+ * order:
+ * BRK, local node, low range
+ *
+ * That means it will first use up all the BRK memory, then try to get
+ * us a block of RAM from [local_min_pfn_mapped, local_max_pfn_mapped)
+ * used as new pagetable pages. If no memory on the local node has
+ * been mapped, it will allocate memory from
+ * [low_min_pfn_mapped, low_max_pfn_mapped).
*/
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > begin) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < begin)
+ start = begin;
} else
- start = ISA_END_ADDRESS;
+ start = begin;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
+ if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+ local_min_pfn_mapped = start >> PAGE_SHIFT;
last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

- if (real_end < end)
+ if (real_end < end) {
init_range_memory_mapping(real_end, end);
+ if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = end >> PAGE_SHIFT;
+ }

+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
#else
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
early_ioremap_page_table_range_init();
#endif

@@ -481,11 +512,6 @@ void __init init_mem_mapping(void)

early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
#endif

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 7d76936..9b18ee8 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
#include <asm/dma.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
+#include <asm/tlbflush.h>

#include "numa_internal.h"
+#include "mm_internal.h"

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
@@ -665,9 +667,31 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
+ max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+ early_ioremap_page_table_range_init();
+}
+#endif
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
+
+ early_x86_numa_init_mapping();
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
+ early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

void __init x86_numa_init(void)

Subject: [tip:x86/mm] x86, mm, numa: Put pagetable on local node ram for 64bit

Commit-ID: 5f02a5e6ca366be44064463f25b6f4cc4468a197
Gitweb: http://git.kernel.org/tip/5f02a5e6ca366be44064463f25b6f4cc4468a197
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:03:09 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:05:49 -0700

x86, mm, numa: Put pagetable on local node ram for 64bit

If node with ram is hotplugable, memory for local node page
table and vmemmap should be on the local node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date: Mon Dec 27 16:48:17 2010 -0800
|
| x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to improve performance when
using memory hotplug.

Calling init_mem_mapping() in early_initmem_init() for each
node. alloc_low_pages() will allocate page table in following
order:
BRK, local node, low range

So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 9b18ee8..5adf803 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -670,7 +670,39 @@ static void __init early_x86_numa_init(void)
#ifdef CONFIG_X86_64
static void __init early_x86_numa_init_mapping(void)
{
- init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ unsigned long last_start = 0, last_end = 0;
+ struct numa_meminfo *mi = &numa_meminfo;
+ unsigned long start, end;
+ int last_nid = -1;
+ int i, nid;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ nid = mi->blk[i].nid;
+ start = mi->blk[i].start;
+ end = mi->blk[i].end;
+
+ if (last_nid == nid) {
+ last_end = end;
+ continue;
+ }
+
+ /* other nid now */
+ if (last_nid >= 0) {
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+ }
+
+ /* for next nid */
+ last_nid = nid;
+ last_start = start;
+ last_end = end;
+ }
+ /* last one */
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+
if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}

Subject: [tip:x86/mm] x86, mm, numa: Move two functions calling on successful path later

Commit-ID: f5127d18677d45bdd17bb3d34e21c2a3f6b0eef6
Gitweb: http://git.kernel.org/tip/f5127d18677d45bdd17bb3d34e21c2a3f6b0eef6
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:57 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:53 -0700

x86, mm, numa: Move two functions calling on successful path later

We need to have numa info ready before init_mem_mappingi(), so that we
can call init_mem_mapping per node, and alse trim node memory ranges to
big alignment.

Currently, parsing numa info needs to allocate some buffer and need to be
called after init_mem_mapping. So try to split parsing numa info procedure
into two steps:
- The first step will be called before init_mem_mapping, and it
should not need allocate buffers.
- The second step will cantain all the buffer related code and be
executed later.

At last we will have early_initmem_init() and initmem_init().

This patch implements only the first step.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move these two callings to x86_numa_init(). That will also
make numa_init() smaller and more readable.

-v2: remove online_node_map clear in numa_init(), as it is only
set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/mm/numa.c | 69 ++++++++++++++++++++++++++++++------------------------
1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index a71c4e2..07ae800 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,7 +477,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
- int i, nid;
+ int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -506,24 +506,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- /* Finally register nodes. */
- for_each_node_mask(nid, node_possible_map) {
- u64 start = PFN_PHYS(max_pfn);
- u64 end = 0;
-
- for (i = 0; i < mi->nr_blks; i++) {
- if (nid != mi->blk[i].nid)
- continue;
- start = min(mi->blk[i].start, start);
- end = max(mi->blk[i].end, end);
- }
-
- if (start < end)
- setup_node_data(nid, start, end);
- }
-
- /* Dump memblock with node info and return. */
- memblock_dump_all();
return 0;
}

@@ -559,7 +541,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
- nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
@@ -577,15 +558,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
- numa_init_array();
return 0;
}

@@ -618,7 +590,7 @@ static int __init dummy_numa_init(void)
* last fallback is dummy single node config encomapssing whole memory and
* never fails.
*/
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -638,6 +610,43 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init x86_numa_init(void)
+{
+ int i, nid;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ early_x86_numa_init();
+
+ /* Finally register nodes. */
+ for_each_node_mask(nid, node_possible_map) {
+ u64 start = PFN_PHYS(max_pfn);
+ u64 end = 0;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (nid != mi->blk[i].nid)
+ continue;
+ start = min(mi->blk[i].start, start);
+ end = max(mi->blk[i].end, end);
+ }
+
+ if (start < end)
+ setup_node_data(nid, start, end); /* online is set */
+ }
+
+ /* Dump memblock with node info */
+ memblock_dump_all();
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ int nid = early_cpu_to_node(i);
+
+ if (nid == NUMA_NO_NODE)
+ continue;
+ if (!node_online(nid))
+ numa_clear_node(i);
+ }
+ numa_init_array();
+}
+
static __init int find_near_online_node(int node)
{
int n, val;

Subject: [tip:x86/mm] x86, ACPI: Split acpi_initrd_override() into find/ copy two steps

Commit-ID: 29206daa5831dc5b435a06387fd702875401c6bd
Gitweb: http://git.kernel.org/tip/29206daa5831dc5b435a06387fd702875401c6bd
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:53 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:04:01 -0700

x86, ACPI: Split acpi_initrd_override() into find/copy two steps

To parse SRAT before memblock starts to work, we need to move acpi table
probing procedure earlier. But acpi_initrd_table_override procedure must
be executed before acpi table probing. So we need to move it earlier too,
which means to move acpi_initrd_table_override procedure before memblock
starts to work.

But acpi_initrd_table_override procedure needs memblock to allocate buffer
for ACPI tables. To solve this problem, we need to split acpi_initrd_override()
procedure into two steps: finding and copying.
Find should be as early as possible. Copy should be after memblock is ready.

Currently, acpi_initrd_table_override procedure is executed after
init_mem_mapping() and relocate_initrd(), so it can scan initrd and copy
acpi tables with kernel virtual addresses of initrd.

Once we split it into finding and copying steps, it could be done like the
following:

Finding could be done in head_32.S and head64.c, just like microcode early
scanning. In head_32.S, it is 32bit flat mode, we don't need to setup page
table to access it. In head64.c, #PF set page table could help us to access
initrd with kernel low mapping addresses.

Copying need to be done just after memblock is ready, because it needs to
allocate buffer for new acpi tables with memblock.
Also it should be done before probing acpi tables, and we need early_ioremap
to access source and target ranges, as init_mem_mapping is not called yet.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version were
conditionalized inside CONFIG_ACPI. This forced setup_arch() to have its own
#ifdefs around acpi_initrd_override() as otherwise build would fail when
!CONFIG_ACPI. Move the prototypes and dummy implementations of the newly
split functions out of CONFIG_ACPI block in acpi.h so that we can throw away
the #ifdefs from its users.

-v2: Split one patch out according to tj.
also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
avoid #idef in setup.c

Signed-off-by: Yinghai <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/setup.c | 6 +++---
drivers/acpi/osl.c | 18 +++++++++++++-----
include/linux/acpi.h | 16 ++++++++--------
3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ca5f2c..42f584c 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1119,9 +1119,9 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+ acpi_initrd_override_find((void *)initrd_start,
+ initrd_end - initrd_start);
+ acpi_initrd_override_copy();

reserve_crashkernel();

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 53dd490..6ab6c54 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- char *p;

if (data == NULL || size == 0)
return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
- if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+ int no, total_offset = 0;
+ char *p;
+
+ if (!all_tables_size)
return;

/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
* tables at one time, we will hit the limit. So we need to map tables
* one by one during copying.
*/
- for (no = 0; no < table_nr; no++) {
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
phys_addr_t size = acpi_initrd_files[no].size;

+ if (!size)
+ break;
p = early_ioremap(acpi_tables_addr + total_offset, size);
memcpy(p, acpi_initrd_files[no].data, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 17b5b59..8dd917b 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);

-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
@@ -476,6 +468,14 @@ static inline bool acpi_driver_match_device(struct device *dev,

#endif /* !CONFIG_ACPI */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));

Subject: [tip:x86/mm] x86, microcode: Use common get_ramdisk_{image|size}( )

Commit-ID: a795ab2d9c2113c63d2c9a0677012db13e746121
Gitweb: http://git.kernel.org/tip/a795ab2d9c2113c63d2c9a0677012db13e746121
Author: Yinghai Lu <[email protected]>
AuthorDate: Thu, 13 Jun 2013 21:02:49 +0800
Committer: H. Peter Anvin <[email protected]>
CommitDate: Fri, 14 Jun 2013 14:03:30 -0700

x86, microcode: Use common get_ramdisk_{image|size}()

In patch1, we change get_ramdisk_image() and get_ramdisk_size()
to global, so we can use them instead of using global variable
boot_params.

We need this to get correct ramdisk adress for 64bits bzImage
that initrd can be loaded above 4G by kexec-tools.

-v2: fix one typo that is found by Tang Chen

Signed-off-by: Yinghai Lu <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Cc: Fenghua Yu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
Signed-off-by: H. Peter Anvin <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 2e9e128..54575a9 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -743,8 +743,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;

boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;

@@ -753,8 +753,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;

2013-06-17 21:04:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped

Hello,

On Thu, Jun 13, 2013 at 09:02:50PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
> be used anymore. Users should use pfn_mapped[] or just
> 1UL<<(32-PAGE_SHIFT) instead.
>
> The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
> We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.

^ typo

...
> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
> index e721863..93e3194 100644
> --- a/drivers/acpi/osl.c
> +++ b/drivers/acpi/osl.c
> @@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
> if (table_nr == 0)
> return;
>
> - acpi_tables_addr =
> - memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
> - all_tables_size, PAGE_SIZE);
> + /* under 4G at first, then above 4G */
> + acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
> + all_tables_size, PAGE_SIZE);

No bigge, but why (1ULL << 32) - 1? Shouldn't it be just 1ULL << 32?
memblock deals with [@start, @end) areas, right?

Other than that,

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-06-17 21:06:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 04/22] x86, ACPI: Search buffer above 4GB in a second try for acpi initrd table override

On Thu, Jun 13, 2013 at 09:02:51PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> Now we only search buffer for new acpi tables in initrd under
> 4GB. In some case, like user use memmap to exclude all low ram,
> we may not find range for it under 4GB. So do second try to
> search for buffer above 4GB.
>
> Since later accessing to the tables is using early_ioremap(),

Maybe "later accesses to the tables" would read better?

> using memory above 4GB is OK.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: [email protected]
> Tested-by: Thomas Renninger <[email protected]>
> Reviewed-by: Tang Chen <[email protected]>
> Tested-by: Tang Chen <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-06-17 21:14:01

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped

On Mon, Jun 17, 2013 at 2:04 PM, Tejun Heo <[email protected]> wrote:
> Hello,
>
> On Thu, Jun 13, 2013 at 09:02:50PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <[email protected]>
>>
>> Now we have pfn_mapped[] array, and max_low_pfn_mapped should not
>> be used anymore. Users should use pfn_mapped[] or just
>> 1UL<<(32-PAGE_SHIFT) instead.
>>
>> The only user of max_low_pfn_mapped is ACPI_INITRD_TABLE_OVERRIDE.
>> We could change to use 1U<<(32_PAGE_SHIFT) with it, aka under 4G.
>
> ^ typo

ok.

>
> ...
>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>> index e721863..93e3194 100644
>> --- a/drivers/acpi/osl.c
>> +++ b/drivers/acpi/osl.c
>> @@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
>> if (table_nr == 0)
>> return;
>>
>> - acpi_tables_addr =
>> - memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
>> - all_tables_size, PAGE_SIZE);
>> + /* under 4G at first, then above 4G */
>> + acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
>> + all_tables_size, PAGE_SIZE);
>
> No bigge, but why (1ULL << 32) - 1? Shouldn't it be just 1ULL << 32?
> memblock deals with [@start, @end) areas, right?

that is for 32bit, when phys_addr_t is 32bit, in that case
(1ULL<<32) cast to 32bit would be 0.

>
> Other than that,
>
> Acked-by: Tejun Heo <[email protected]>

Thanks

Yinghai

2013-06-17 23:08:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped

On Mon, Jun 17, 2013 at 2:13 PM, Yinghai Lu <[email protected]> wrote:
>> No bigge, but why (1ULL << 32) - 1? Shouldn't it be just 1ULL << 32?
>> memblock deals with [@start, @end) areas, right?
>
> that is for 32bit, when phys_addr_t is 32bit, in that case
> (1ULL<<32) cast to 32bit would be 0.

Right, it'd work the same even after overflowing but yeah, it can be confusing.

Thanks.

--
tejun

2013-06-17 23:38:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
> -static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
> +struct file_pos {
> + phys_addr_t data;
> + phys_addr_t size;
> +};

Isn't file_pos too generic as name? Would acpi_initrd_file_pos too
long? Maybe just struct acpi_initrd_file?

Thanks.

--
tejun

2013-06-17 23:40:36

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Mon, Jun 17, 2013 at 4:38 PM, Tejun Heo <[email protected]> wrote:
> On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
>> -static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
>> +struct file_pos {
>> + phys_addr_t data;
>> + phys_addr_t size;
>> +};
>
> Isn't file_pos too generic as name? Would acpi_initrd_file_pos too
> long? Maybe just struct acpi_initrd_file?

ok, will change to acpi_initrd_file.

Thanks

Yinghai

2013-06-17 23:52:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array

On Thu, Jun 13, 2013 at 09:02:54PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> This patch introduces a file_pos struct to store physaddr. And then changes
> acpi_initrd_files[] to file_pos type. Then store physaddr of ACPI tables
> in acpi_initrd_files[].
>
> For finding, we will find ACPI tables with physaddr during 32bit flat mode
> in head_32.S, because at that time we don't need to setup page table to
> access initrd.
>
> For copying, we could use early_ioremap() with physaddr directly before
> memory mapping is set.
>
> To keep 32bit and 64bit platforms consistent, use phys_addr for all.

Also, how about something like the following?

Subject: x86, ACPI: introduce a new struct to store phys_addr of acpi override tables

ACPI initrd override table handling has been recently broken into two
functions - acpi_initrd_override_find() and
acpi_initrd_override_copy(). The former function currently stores the
virtual addresses and sizes of the found override tables in an array
of struct cpio_data for the latter function.

To make NUMA information available earlier during boot,
acpi_initrd_override_find() will be used much earlier - on 32bit, from
head_32.S before linear address translation is set up, which will make
it impossible to use the virtual addresses of the tables.

This patch introduces a new struct - file_pos - which records
phys_addr and size of a memory area, and replaces the cpio_data array
with it so that acpi_initrd_override_find() can record the phys_addrs
of the override tables instead of virtual addresses. This will allow
using the function before the linear address is set up.

acpi_initrd_override_copy() now accesses the override tables using
early_ioremap() on the stored phys_addrs.

--
tejun

2013-06-18 00:07:34

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

On Thu, Jun 13, 2013 at 09:02:55PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> For finding procedure, it would be easy to access initrd in 32bit flat
> mode, as we don't need to setup page table. That is from head_32.S, and
> microcode updating already use this trick.

It'd be really great if you can give a brief explanation of why this
is happening at the beginning of the commit description so that when
someone lands on this commit later on, [s]he can orient oneself. It
doesn't have to be long. Open with something like,

To make NUMA info available early during boot for memory hotplug
support, acpi_initrd_override_find() needs to be used very early
during boot.

and then continue to describe what's happening. It'll make the commit
a lot more approachable to people who just encountered it.

> This patch does the following:
>
> 1. Change acpi_initrd_override_find to use phys to access global variables.
>
> 2. Pass a bool parameter "is_phys" to acpi_initrd_override_find() because
> we cannot tell if it is a pa or a va through the address itself with
> 32bit. Boot loader could load initrd above max_low_pfn.

Do you mean "from 32bit address boundary"? Maybe "from 4G boundary"
is clearer?

>
> 3. Put table_sigs[] on stack, otherwise it is too messy to change string
> array to physaddr and still keep offset calculating correct. The size is
> about 36x4 bytes, and it is small to settle in stack.
>
> 4. Also rewrite the MACRO INVALID_TABLE to be in a do {...} while(0) loop
> so that it is more readable.

The important part is taking "continue" out of it, right?

> +/*
> + * acpi_initrd_override_find() is called from head_32.S and head64.c.
> + * head_32.S calling path is with 32bit flat mode, so we can access

When called from head_32.S, the CPU is in 32bit flat mode and the
kernel virtual address space isn't available yet.

> + * initrd early without setting pagetable or relocating initrd. For
> + * global variables accessing, we need to use phys address instead of

As initrd is in phys_addr, it can be accessed directly; however,
global variables must be accessed by explicitly obtaining their
physical addresses.

> + * kernel virtual address, try to put table_sigs string array in stack,
> + * so avoid switching for it.

Note that table_sigs array is built on stack to avoid such address
translations while accessing its members.

> + * Also don't call printk as it uses global variables.
> + */
> +void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)

Thanks.

--
tejun

2013-06-18 00:33:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

On Thu, Jun 13, 2013 at 09:02:56PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>

Ditto for the opening. Probably not a must, I suppose, but would be
very nice.

> head64.c could use #PF handler setup page table to access initrd before
> init mem mapping and initrd relocating.
>
> head_32.S could use 32bit flat mode to access initrd before init mem
> mapping initrd relocating.
>
> This patch introduces x86_acpi_override_find(), which is called from
> head_32.S/head64.c, to replace acpi_initrd_override_find(). So that we
> can makes 32bit and 64 bit more consistent.
>
> -v2: use inline function in header file instead according to tj.
> also still need to keep #idef head_32.S to avoid compiling error.
> -v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
> to make sure we are using right address.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Jacob Shin <[email protected]>
> Cc: Rafael J. Wysocki <[email protected]>
> Cc: [email protected]
> Tested-by: Thomas Renninger <[email protected]>
> Reviewed-by: Tang Chen <[email protected]>
> Tested-by: Tang Chen <[email protected]>

Other than that,

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-06-18 00:53:20

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 10/22] x86, mm, numa: Move two functions calling on successful path later

Hello,

Does the subject match the patch content? What two functions? The
patch is separating out the actual registration part so that the
discovery part can happen earlier, right?

> Currently, parsing numa info needs to allocate some buffer and need to be
> called after init_mem_mapping. So try to split parsing numa info procedure
> into two steps:
> - The first step will be called before init_mem_mapping, and it
> should not need allocate buffers.

Document the requirement somewhere in the source code?

> - The second step will cantain all the buffer related code and be
> executed later.
>
> At last we will have early_initmem_init() and initmem_init().

Do you mean "eventually" or "in the end" by "at last"?

> This patch implements only the first step.
>
> setup_node_data() and numa_init_array() are only called for successful
> path, so we can move these two callings to x86_numa_init(). That will also
> make numa_init() smaller and more readable.

I find the description somewhat difficult to follow. :(

> -v2: remove online_node_map clear in numa_init(), as it is only
> set in setup_node_data() at last in successful path.

I don't get this. What prevents specific numa init functions (numaq,
x86_acpi, amd...) from updating node_online_map?

Thanks.

--
tejun

2013-06-18 01:06:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early

On Thu, Jun 13, 2013 at 09:02:58PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> In order to seperate parsing numa info procedure into two steps,
> we need to set memblock nid later, as it could change memblock
> array, and possible doube memblock.memory array which will need
> to allocate buffer.
>
> We do not need to use nid in memblock to find out absent pages.

because...

And please also explain it in the source code with comment including
why the check has to be done early.

> So we can move that numa_meminfo_cover_memory() early.

Maybe "So, we can use the NUMA-unaware absent_pages_in_range() in
numa_meminfo_cover_memory() and call the function before setting nid's
to memblock."

> Also we could change __absent_pages_in_range() to static and use
> absent_pages_in_range() directly.

"As this removes the last user of __absent_pages_in_range(), this
patch also makes the function static."

Thanks.

--
tejun

2013-06-18 01:08:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 12/22] x86, mm, numa: Move node_map_pfn_alignment() to x86

On Thu, Jun 13, 2013 at 09:02:59PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> Move node_map_pfn_alignment() to arch/x86/mm as there is no
> other user for it.
>
> Will update it to use numa_meminfo instead of memblock.
>
> Signed-off-by: Yinghai Lu <[email protected]>
> Reviewed-by: Tang Chen <[email protected]>
> Tested-by: Tang Chen <[email protected]>

Acked-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-06-18 01:40:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

On Thu, Jun 13, 2013 at 09:03:00PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> We could use numa_meminfo directly instead of memblock nid in
> node_map_pfn_alignment().
>
> So we could do setting memblock nid later and only do it once
> for successful path.
>
> -v2: according to tj, separate moving to another patch.

How about something like,

Subject: x86, mm, NUMA: Use numa_meminfo instead of memblock in node_map_pfn_alignment()

When sparsemem is used and page->flags doesn't have enough space to
carry both the sparsemem section and node ID, NODE_NOT_IN_PAGE_FLAGS
is set and the node is determined from section. This requires that
the NUMA nodes aren't more granular than sparsemem sections.
node_map_pfn_alignment() is used to determine the maximum NUMA
inter-node alignment which can distinguish all nodes to verify the
above condition.

The function currently assumes the NUMA node maps are populated and
sorted and uses for_each_mem_pfn_range() to iterate memory regions.
We want this to happen way earlier to support memory hotplug (maybe
elaborate a bit more here).

This patch updates node_map_pfn_alignment() so that it iterates over
numa_meminfo instead and moves its invocation before memory regions
are registered to memblock and node maps in numa_register_memblks().
This will help memory hotplug (how...) and as a bonus we register
memory regions only if the alignment check succeeds rather than
registering and then failing.

Also, the comment on top of node_map_pfn_alignment() needs to be
updated, right?

Thanks.

--
tejun

2013-06-18 01:45:27

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 14/22] x86, mm, numa: Set memblock nid later

On Thu, Jun 13, 2013 at 09:03:01PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> In order to seperate parsing numa info procedure into two steps,

Short "why" would be nice.

> we need to set memblock nid later because it could change memblock
^
in where?

> array, and possible doube memblock.memory array which will allocate
^
possibly double

> buffer.

which is bad why?

> Only set memblock nid once for successful path.
>
> Also rename numa_register_memblks to numa_check_memblks() after
> moving out code of setting memblock nid.
> @@ -676,6 +669,11 @@ void __init x86_numa_init(void)
>
> early_x86_numa_init();
>
> + for (i = 0; i < mi->nr_blks; i++) {
> + struct numa_memblk *mb = &mi->blk[i];
> + memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
> + }
> +

Can we please have some comments explaining the new ordering
requirements? When reading code, how is one supposed to know that the
ordering of operations is all deliberate and fragile?

Thanks.

--
tejun

2013-06-18 01:58:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.

On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> numa_emulation() needs to allocate buffer for new numa_meminfo
> and distance matrix, so execute it later in x86_numa_init().
>
> Also we change the behavoir:
> - before this patch, if user input wrong data in command
> line, it will fall back to next numa probing or disabling
> numa.
> - after this patch, if user input wrong data in command line,
> it will stay with numa info probed from previous probing,
> like ACPI SRAT or amd_numa.
>
> We need to call numa_check_memblks to reject wrong user inputs early
> so that we can keep the original numa_meminfo not changed.

So, this is another very subtle ordering you're adding without any
comment and I'm not sure it even makes sense because the function can
fail after that point.

I'm getting really doubtful about this whole approach of carefully
splitting discovery and registration. It's inherently fragile like
hell and the poor documentation makes it a lot worse. I'm gonna reply
to the head message.

Thanks.

--
tejun

2013-06-18 02:04:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hello,

On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>
> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> | Author: Tang Chen <[email protected]>
> | Date: Fri Feb 22 16:33:44 2013 -0800
> |
> | acpi, memory-hotplug: parse SRAT before memblock is ready
>
> It broke several things, like acpi override and fall back path etc.
>
> This patchset is clean implementation that will parse numa info early.
> 1. keep the acpi table initrd override working by split finding with copying.
> finding is done at head_32.S and head64.c stage,
> in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
> in head64.c, initrd is accessed via kernel low mapping address
> with help of #PF set page table.
> copying is done with early_ioremap just after memblock is setup.
> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
> seperate initmem_init to two stages.
> early_initmem_init will only extract numa info early into numa_meminfo.
> initmem_init will keep slit and emulation handling.
> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
> early_initmem_init will take old init_mem_mapping position.
> it call early_x86_numa_init and init_mem_mapping for every nodes.
> For 64bit, we avoid having size limit on initrd, as relocate_initrd
> is still after init_mem_mapping for all memory.
> 4. last patch will try to put page table on local node, so that memory
> hotplug will be happy.
>
> In short, early_initmem_init will parse numa info early and call
> init_mem_mapping to set page table for every nodes's mem.

So, can you please explain why you're doing the above? What are you
trying to achieve in the end and why is this the best approach? This
is all for memory hotplug, right?

I can understand the part where you're move NUMA discovery before
initializations which will get allocated permanent addresses in the
wrong nodes, but trying to do the same with memblock itself is making
the code extremely fragile. It's nasty because there's nothing
apparent which seems to necessitate such ordering. The ordering looks
rather arbitrary but changing the orders will subtly break memory
hotplug support, which is a really bad way to structure the code.

Can't you just move memblock arrays after NUMA init is complete?
That'd be a lot simpler and way more robust than the proposed changes,
no?

Thanks.

--
tejun

2013-06-18 05:44:20

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hi tj,

On 06/18/2013 10:03 AM, Tejun Heo wrote:
......
>
> So, can you please explain why you're doing the above? What are you
> trying to achieve in the end and why is this the best approach? This
> is all for memory hotplug, right?

Yes, this is all for memory hotplug.

[why]
At early boot time (before parsing SRAT), memblock will allocate memory
for kernel to use. But the memory could be hotpluggable memory because
at such an early time, we don't know which memory is hotpluggable. This
will cause hotpluggable memory un-hotpluggable. What we are trying to
do is to prevent memblock from allocating hotpluggable memory.

[approach]
Parse SRAT earlier before memblock starts to work, because there is a
bit in SRAT specifying which memory is hotpluggable.

I'm not saying this is the best approach. I can also see that this
patch-set touches a lot of boot code. But i think parsing SRAT earlier
is reasonable because this is the only way for now to know which memory
is hotpluggable from firmware.

>
> I can understand the part where you're move NUMA discovery before
> initializations which will get allocated permanent addresses in the
> wrong nodes, but trying to do the same with memblock itself is making
> the code extremely fragile. It's nasty because there's nothing
> apparent which seems to necessitate such ordering. The ordering looks
> rather arbitrary but changing the orders will subtly break memory
> hotplug support, which is a really bad way to structure the code.
>
> Can't you just move memblock arrays after NUMA init is complete?
> That'd be a lot simpler and way more robust than the proposed changes,
> no?

Sorry, I don't quite understand the approach you are suggesting. If we
move memblock arrays, we need to update all the pointers pointing to
the moved memory. How can we do this ?

Thanks. :)

2013-06-18 06:22:26

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.

On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <[email protected]> wrote:
> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <[email protected]>
>>
>> numa_emulation() needs to allocate buffer for new numa_meminfo
>> and distance matrix, so execute it later in x86_numa_init().
>>
>> Also we change the behavoir:
>> - before this patch, if user input wrong data in command
>> line, it will fall back to next numa probing or disabling
>> numa.
>> - after this patch, if user input wrong data in command line,
>> it will stay with numa info probed from previous probing,
>> like ACPI SRAT or amd_numa.
>>
>> We need to call numa_check_memblks to reject wrong user inputs early
>> so that we can keep the original numa_meminfo not changed.
>
> So, this is another very subtle ordering you're adding without any
> comment and I'm not sure it even makes sense because the function can
> fail after that point.

Yes, if it fail, we will stay with current numa info from firmware.
That looks like right behavior.

Before this patch, it will fail to next numa way like if acpi srat + user
input fail, it will try to go with amd_numa then try apply user info.

>
> I'm getting really doubtful about this whole approach of carefully
> splitting discovery and registration. It's inherently fragile like
> hell and the poor documentation makes it a lot worse. I'm gonna reply
> to the head message.

Maybe look at the patch is not clear enough, but if looks at the final changed
code it would be more clear.

Thanks

Yinghai

2013-06-18 07:13:54

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.

On Mon, Jun 17, 2013 at 11:22 PM, Yinghai Lu <[email protected]> wrote:
> On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <[email protected]> wrote:
>> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>>> From: Yinghai Lu <[email protected]>
>>>
>>> numa_emulation() needs to allocate buffer for new numa_meminfo
>>> and distance matrix, so execute it later in x86_numa_init().
>>>
>>> Also we change the behavoir:
>>> - before this patch, if user input wrong data in command
>>> line, it will fall back to next numa probing or disabling
>>> numa.
>>> - after this patch, if user input wrong data in command line,
>>> it will stay with numa info probed from previous probing,
>>> like ACPI SRAT or amd_numa.
>>>
>>> We need to call numa_check_memblks to reject wrong user inputs early
>>> so that we can keep the original numa_meminfo not changed.
>>
>> So, this is another very subtle ordering you're adding without any
>> comment and I'm not sure it even makes sense because the function can
>> fail after that point.
>
> Yes, if it fail, we will stay with current numa info from firmware.
> That looks like right behavior.
>
> Before this patch, it will fail to next numa way like if acpi srat + user
> input fail, it will try to go with amd_numa then try apply user info.
>
>>
>> I'm getting really doubtful about this whole approach of carefully
>> splitting discovery and registration. It's inherently fragile like
>> hell and the poor documentation makes it a lot worse. I'm gonna reply
>> to the head message.
>
> Maybe look at the patch is not clear enough, but if looks at the final changed
> code it would be more clear.

update the patches from 1-15 with your review.

git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm

https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm

Yinghai

2013-06-18 17:10:43

by Vasilis Liaskovitis

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hi,

On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> No offence, just rebase and resend the patches from Yinghai to help
> to push this functionality faster.
> Also improve the comments in the patches' log.
>
>
> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>
> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> | Author: Tang Chen <[email protected]>
> | Date: Fri Feb 22 16:33:44 2013 -0800
> |
> | acpi, memory-hotplug: parse SRAT before memblock is ready
>
> It broke several things, like acpi override and fall back path etc.
>
> This patchset is clean implementation that will parse numa info early.
> 1. keep the acpi table initrd override working by split finding with copying.
> finding is done at head_32.S and head64.c stage,
> in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
> in head64.c, initrd is accessed via kernel low mapping address
> with help of #PF set page table.
> copying is done with early_ioremap just after memblock is setup.
> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
> seperate initmem_init to two stages.
> early_initmem_init will only extract numa info early into numa_meminfo.
> initmem_init will keep slit and emulation handling.
> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
> early_initmem_init will take old init_mem_mapping position.
> it call early_x86_numa_init and init_mem_mapping for every nodes.
> For 64bit, we avoid having size limit on initrd, as relocate_initrd
> is still after init_mem_mapping for all memory.
> 4. last patch will try to put page table on local node, so that memory
> hotplug will be happy.
>
> In short, early_initmem_init will parse numa info early and call
> init_mem_mapping to set page table for every nodes's mem.
>
> could be found at:
> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>
> and it is based on today's Linus tree.
>

Has this patchset been tested on various numa configs?
I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
boots successfully in many numa configs but while trying different memory sizes
for a 2 numa node VM, I noticed that booting does not complete in all cases
(bootup screen appears to hang but there is no output indicating an early panic)

node0 node1 boots
1G 1G yes
1G 2G yes
1G 0.5G yes
3G 2.5G yes
3G 3G yes
4G 0G yes
4G 4G yes
1.5G 1G no
2G 1G no
2G 2G no
2.5G 2G no
2.5G 2.5G no

linux-next next-20130607 boots al of these configs fine.

Looks odd, perhaps I have something wrong in my setup or maybe there is a
seabios/qemu interaction with this patchset. I will update if I find something.

thanks,

- Vasilis

2013-06-18 17:21:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hey, Tang.

On Tue, Jun 18, 2013 at 01:47:16PM +0800, Tang Chen wrote:
> [approach]
> Parse SRAT earlier before memblock starts to work, because there is a
> bit in SRAT specifying which memory is hotpluggable.
>
> I'm not saying this is the best approach. I can also see that this
> patch-set touches a lot of boot code. But i think parsing SRAT earlier
> is reasonable because this is the only way for now to know which memory
> is hotpluggable from firmware.

Touching a lot of code is not a problem but it feels like it's trying
to boot strap itself while walking and achieves that by carefully
sequencing all operations which may allocate from memblock before NUMA
info is available without any way to enforce or verify that.

> >Can't you just move memblock arrays after NUMA init is complete?
> >That'd be a lot simpler and way more robust than the proposed changes,
> >no?
>
> Sorry, I don't quite understand the approach you are suggesting. If we
> move memblock arrays, we need to update all the pointers pointing to
> the moved memory. How can we do this ?

So, there are two things involved here - memblock itself and consumers
of memblock, right? I get that the latter shouldn't allocate memory
from memblock before NUMA info is entered into memblock, so please
reorder as necessary *and* make sure memblock complains if something
violates that. Temporary memory areas which are return are fine.
Just complain if there are memory regions remaining which are
allocated before NUMA info is available after boot is complete. No
need to make booting more painful than it currently is.

As for memblock itself, there's no need to walk carefully around it.
Just let it do its thing and implement
memblock_relocate_to_numa_node_0() or whatever after NUMA information
is available. memblock already does relocate itself whenever it's
expanding the arrays anyway, so implementation should be trivial.

Maybe I'm missing something but having a working memory allocator as
soon as possible is *way* less painful than trying to bootstrap around
it. Allow boot path to allocate memory areas from memblock as soon as
possible but just ensure that none of the ones which may violate the
hotplug requirements is remaining once boot is complete. Temporaray
regions won't matter then and the few which need persistent areas can
either be reordered to happen after NUMA init or they can allocate a
new area and move to there after NUMA info is available. Let's please
minimize this walking-and-trying-to-tie-shoestrings-at-the-same-time
thing. It's painful and extremely fragile.

Thanks.

--
tejun

2013-06-18 20:19:16

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
<[email protected]> wrote:
>> could be found at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>>
>> and it is based on today's Linus tree.
>>
>
> Has this patchset been tested on various numa configs?
> I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> boots successfully in many numa configs but while trying different memory sizes
> for a 2 numa node VM, I noticed that booting does not complete in all cases
> (bootup screen appears to hang but there is no output indicating an early panic)
>
> node0 node1 boots
> 1G 1G yes
> 1G 2G yes
> 1G 0.5G yes
> 3G 2.5G yes
> 3G 3G yes
> 4G 0G yes
> 4G 4G yes
> 1.5G 1G no
> 2G 1G no
> 2G 2G no
> 2.5G 2G no
> 2.5G 2.5G no
>
> linux-next next-20130607 boots al of these configs fine.
>
> Looks odd, perhaps I have something wrong in my setup or maybe there is a
> seabios/qemu interaction with this patchset. I will update if I find something.

just tried 2g/2g, and it works on qemu-kvm:

early console in setup code
Probing EDD (edd=off to disable)... ok
early console in decompress_kernel
decompress_kernel:
input: [0x2a8e2c2-0x3393991], output: 0x1000000, heap: [0x339b200-0x33a31ff]

Decompressing Linux... xz... Parsing ELF... done.
Booting the kernel.
[ 0.000000] bootconsole [uart0] enabled
[ 0.000000] real_mode_data : phys 0000000000014490
[ 0.000000] real_mode_data : virt ffff880000014490
[ 0.000000] boot_params : init virt ffffffff82f869a0
[ 0.000000] boot_params : phys 0000000002f869a0
[ 0.000000] boot_params : virt ffff880002f869a0
[ 0.000000] boot_command_line : init virt ffffffff82e53020
[ 0.000000] boot_command_line : phys 0000000002e53020
[ 0.000000] boot_command_line : virt ffff880002e53020
[ 0.000000] Kernel Layout:
[ 0.000000] .text: [0x01000000-0x020b8840]
[ 0.000000] .rodata: [0x02200000-0x029d3fff]
[ 0.000000] .data: [0x02a00000-0x02bd4d7f]
[ 0.000000] .init: [0x02bd6000-0x02f71fff]
[ 0.000000] .bss: [0x02f80000-0x03c20fff]
[ 0.000000] .brk: [0x03c21000-0x03c45fff]
[ 0.000000] memblock_reserve: [0x0009f000-0x000fffff] * BIOS reserved
[ 0.000000] Initializing cgroup subsys cpuset
[ 0.000000] Initializing cgroup subsys cpu
[ 0.000000] Initializing cgroup subsys cpuacct
[ 0.000000] Linux version 3.10.0-rc6-yh-01398-ga6660aa-dirty
([email protected]) (gcc version 4.7.2 20130108 [gcc-4_7-branch
revision 195012] (SUSE Linux) ) #1754 SMP Tue Jun 18 13:10:47 PDT 2013
[ 0.000000] memblock_reserve: [0x01000000-0x03c20fff] TEXT DATA BSS
[ 0.000000] memblock_reserve: [0x7dcef000-0x7fffefff] RAMDISK
[ 0.000000] Command line: BOOT_IMAGE=linux debug ignore_loglevel
initcall_debug pci=routeirq ramdisk_size=262144 root=/dev/ram0 rw
ip=dhcp console=uart8250,io,0x3f8,115200 initrd=initrd.img
[ 0.000000] KERNEL supported cpus:
[ 0.000000] Intel GenuineIntel
[ 0.000000] AMD AuthenticAMD
[ 0.000000] Centaur CentaurHauls
[ 0.000000] Physical RAM map:
[ 0.000000] raw: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] raw: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] raw: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] raw: [mem 0x0000000000100000-0x00000000dfffdfff] usable
[ 0.000000] raw: [mem 0x00000000dfffe000-0x00000000dfffffff] reserved
[ 0.000000] raw: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] raw: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] raw: [mem 0x0000000100000000-0x000000011fffffff] usable
[ 0.000000] e820: BIOS-provided physical RAM map (sanitized by setup):
[ 0.000000] BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
[ 0.000000] BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000000100000-0x00000000dfffdfff] usable
[ 0.000000] BIOS-e820: [mem 0x00000000dfffe000-0x00000000dfffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
[ 0.000000] BIOS-e820: [mem 0x0000000100000000-0x000000011fffffff] usable
[ 0.000000] debug: ignoring loglevel setting.
[ 0.000000] NX (Execute Disable) protection: active
[ 0.000000] SMBIOS 2.4 present.
[ 0.000000] DMI: Bochs Bochs, BIOS Bochs 01/01/2011
[ 0.000000] e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
[ 0.000000] e820: remove [mem 0x000a0000-0x000fffff] usable
[ 0.000000] No AGP bridge found
[ 0.000000] e820: last_pfn = 0x120000 max_arch_pfn = 0x400000000
[ 0.000000] MTRR default type: write-back
[ 0.000000] MTRR fixed ranges enabled:
[ 0.000000] 00000-9FFFF write-back
[ 0.000000] A0000-BFFFF uncachable
[ 0.000000] C0000-FFFFF write-protect
[ 0.000000] MTRR variable ranges enabled:
[ 0.000000] 0 [00E0000000-00FFFFFFFF] mask FFE0000000 uncachable
[ 0.000000] 1 disabled
[ 0.000000] 2 disabled
[ 0.000000] 3 disabled
[ 0.000000] 4 disabled
[ 0.000000] 5 disabled
[ 0.000000] 6 disabled
[ 0.000000] 7 disabled
[ 0.000000] PAT not supported by CPU.
[ 0.000000] e820: last_pfn = 0xdfffe max_arch_pfn = 0x400000000
[ 0.000000] found SMP MP-table at [mem 0x000fdae0-0x000fdaef]
mapped at [ffff8800000fdae0]
[ 0.000000] memblock_reserve: [0x000fdae0-0x000fdaef] * MP-table mpf
[ 0.000000] memblock_reserve: [0x000fdaf0-0x000fdbe3] * MP-table mpc
[ 0.000000] memblock_reserve: [0x03c21000-0x03c26fff] BRK
[ 0.000000] MEMBLOCK configuration:
[ 0.000000] memory size = 0xfff9cc00 reserved size = 0x4f98000
[ 0.000000] memory.cnt = 0x3
[ 0.000000] memory[0x0] [0x00001000-0x0009efff], 0x9e000 bytes
[ 0.000000] memory[0x1] [0x00100000-0xdfffdfff], 0xdfefe000 bytes
[ 0.000000] memory[0x2] [0x100000000-0x11fffffff], 0x20000000 bytes
[ 0.000000] reserved.cnt = 0x3
[ 0.000000] reserved[0x0] [0x0009f000-0x000fffff], 0x61000 bytes
[ 0.000000] reserved[0x1] [0x01000000-0x03c26fff], 0x2c27000 bytes
[ 0.000000] reserved[0x2] [0x7dcef000-0x7fffefff], 0x2310000 bytes
[ 0.000000] memblock_reserve: [0x00099000-0x0009efff] TRAMPOLINE
[ 0.000000] Base memory trampoline at [ffff880000099000] 99000 size 24576
[ 0.000000] memblock_reserve: [0x00000000-0x0000ffff] RESERVELOW
[ 0.000000] ACPI: RSDP 00000000000fd8d0 00014 (v00 BOCHS )
[ 0.000000] ACPI: RSDT 00000000dfffe270 00038 (v01 BOCHS BXPCRSDT
00000001 BXPC 00000001)
[ 0.000000] ACPI: FACP 00000000dfffff80 00074 (v01 BOCHS BXPCFACP
00000001 BXPC 00000001)
[ 0.000000] ACPI: DSDT 00000000dfffe2b0 011A9 (v01 BXPC BXDSDT
00000001 INTL 20100528)
[ 0.000000] ACPI: FACS 00000000dfffff40 00040
[ 0.000000] ACPI: SSDT 00000000dffff6e0 00858 (v01 BOCHS BXPCSSDT
00000001 BXPC 00000001)
[ 0.000000] ACPI: APIC 00000000dffff5b0 00090 (v01 BOCHS BXPCAPIC
00000001 BXPC 00000001)
[ 0.000000] ACPI: HPET 00000000dffff570 00038 (v01 BOCHS BXPCHPET
00000001 BXPC 00000001)
[ 0.000000] ACPI: SRAT 00000000dffff460 00110 (v01 BOCHS BXPCSRAT
00000001 BXPC 00000001)
[ 0.000000] ACPI: Local APIC address 0xfee00000
[ 0.000000] SRAT: PXM 0 -> APIC 0x00 -> Node 0
[ 0.000000] SRAT: PXM 0 -> APIC 0x01 -> Node 0
[ 0.000000] SRAT: PXM 1 -> APIC 0x02 -> Node 1
[ 0.000000] SRAT: PXM 1 -> APIC 0x03 -> Node 1
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00100000-0x7fffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x80000000-0xdfffffff]
[ 0.000000] SRAT: Node 1 PXM 1 [mem 0x100000000-0x11fffffff]
[ 0.000000] NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem
0x00100000-0x7fffffff] -> [mem 0x00000000-0x7fffffff]
[ 0.000000] NUMA: Node 1 [mem 0x80000000-0xdfffffff] + [mem
0x100000000-0x11fffffff] -> [mem 0x80000000-0x11fffffff]
[ 0.000000] Node 0: [mem 0x00000000000000-0x0000007fffffff]
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] [mem 0x00000000-0x000fffff] page 4k
[ 0.000000] BRK [0x03c22000, 0x03c22fff] PGTABLE
[ 0.000000] BRK [0x03c23000, 0x03c23fff] PGTABLE
[ 0.000000] BRK [0x03c24000, 0x03c24fff] PGTABLE
[ 0.000000] init_memory_mapping: [mem 0x7da00000-0x7dbfffff]
[ 0.000000] [mem 0x7da00000-0x7dbfffff] page 2M
[ 0.000000] BRK [0x03c25000, 0x03c25fff] PGTABLE
[ 0.000000] init_memory_mapping: [mem 0x7c000000-0x7d9fffff]
[ 0.000000] [mem 0x7c000000-0x7d9fffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x00100000-0x7bffffff]
[ 0.000000] [mem 0x00100000-0x001fffff] page 4k
[ 0.000000] [mem 0x00200000-0x7bffffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x7dc00000-0x7fffffff]
[ 0.000000] [mem 0x7dc00000-0x7fffffff] page 2M
[ 0.000000] Node 1: [mem 0x00000080000000-0x0000011fffffff]
[ 0.000000] init_memory_mapping: [mem 0x11fe00000-0x11fffffff]
[ 0.000000] [mem 0x11fe00000-0x11fffffff] page 2M
[ 0.000000] BRK [0x03c26000, 0x03c26fff] PGTABLE
[ 0.000000] init_memory_mapping: [mem 0x11c000000-0x11fdfffff]
[ 0.000000] [mem 0x11c000000-0x11fdfffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x100000000-0x11bffffff]
[ 0.000000] [mem 0x100000000-0x11bffffff] page 2M
[ 0.000000] init_memory_mapping: [mem 0x80000000-0xdfffdfff]
[ 0.000000] [mem 0x80000000-0xdfdfffff] page 2M
[ 0.000000] [mem 0xdfe00000-0xdfffdfff] page 4k
[ 0.000000] memblock_reserve: [0x11ffff000-0x11fffffff] PGTABLE
[ 0.000000] memblock_reserve: [0x11fffe000-0x11fffefff] PGTABLE
[ 0.000000] memblock_reserve: [0x11fffd000-0x11fffdfff] PGTABLE
[ 0.000000] RAMDISK: [mem 0x7dcef000-0x7fffefff]
[ 0.000000] Initmem setup node 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] memblock_reserve: [0x7dcc8000-0x7dceefff]
[ 0.000000] NODE_DATA [mem 0x7dcc8000-0x7dceefff]
[ 0.000000] Initmem setup node 1 [mem 0x80000000-0x11fffffff]
[ 0.000000] memblock_reserve: [0x11ffd6000-0x11fffcfff]
[ 0.000000] NODE_DATA [mem 0x11ffd6000-0x11fffcfff]
[ 0.000000] MEMBLOCK configuration:
[ 0.000000] memory size = 0xfff9cc00 reserved size = 0x4fff000
[ 0.000000] memory.cnt = 0x4
[ 0.000000] memory[0x0] [0x00001000-0x0009efff], 0x9e000 bytes on node 0
[ 0.000000] memory[0x1] [0x00100000-0x7fffffff], 0x7ff00000
bytes on node 0
[ 0.000000] memory[0x2] [0x80000000-0xdfffdfff], 0x5fffe000
bytes on node 1
[ 0.000000] memory[0x3] [0x100000000-0x11fffffff], 0x20000000
bytes on node 1
[ 0.000000] reserved.cnt = 0x5
[ 0.000000] reserved[0x0] [0x00000000-0x0000ffff], 0x10000 bytes
[ 0.000000] reserved[0x1] [0x00099000-0x000fffff], 0x67000 bytes
[ 0.000000] reserved[0x2] [0x01000000-0x03c26fff], 0x2c27000 bytes
[ 0.000000] reserved[0x3] [0x7dcc8000-0x7fffefff], 0x2337000 bytes
[ 0.000000] reserved[0x4] [0x11ffd6000-0x11fffffff], 0x2a000 bytes
[ 0.000000] memblock_reserve: [0x7ffff000-0x7fffffff] sparse section
[ 0.000000] memblock_reserve: [0x11fbd6000-0x11ffd5fff] usemap_map
[ 0.000000] memblock_reserve: [0x7dcc7e00-0x7dcc7fff] usemap section
[ 0.000000] memblock_reserve: [0x11fbd5e00-0x11fbd5fff] usemap section
[ 0.000000] memblock_reserve: [0x11f7d5e00-0x11fbd5dff] map_map
[ 0.000000] memblock_reserve: [0x7bc00000-0x7dbfffff] vmemmap buf
[ 0.000000] memblock_reserve: [0x7dcc6000-0x7dcc6fff] vmemmap block
[ 0.000000] [ffffea0000000000-ffffea7fffffffff] PGD @
ffff88007dcc6000 on node 0
[ 0.000000] memblock_reserve: [0x7dcc5000-0x7dcc5fff] vmemmap block
[ 0.000000] [ffffea0000000000-ffffea003fffffff] PUD @
ffff88007dcc5000 on node 0
[ 0.000000] memblock_free: [0x7dc00000-0x7dbfffff]
[ 0.000000] memblock_reserve: [0x11d600000-0x11f5fffff] vmemmap buf
[ 0.000000] [ffffea0000000000-ffffea0001ffffff] PMD ->
[ffff88007bc00000-ffff88007dbfffff] on node 0
[ 0.000000] memblock_free: [0x11f600000-0x11f5fffff]
[ 0.000000] [ffffea0002000000-ffffea00047fffff] PMD ->
[ffff88011d600000-ffff88011f5fffff] on node 1
[ 0.000000] memblock_free: [0x11f7d5e00-0x11fbd5dff]
[ 0.000000] memblock_free: [0x11fbd6000-0x11ffd5fff]Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal [mem 0x100000000-0x11fffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ 0.000000] node 0: [mem 0x00001000-0x0009efff]
[ 0.000000] node 0: [mem 0x00100000-0x7fffffff]
[ 0.000000] node 1: [mem 0x80000000-0xdfffdfff]
[ 0.000000] node 1: [mem 0x100000000-0x11fffffff]
[ 0.000000] start - node_states[2]:
[ 0.000000] On node 0 totalpages: 524190
[ 0.000000] DMA zone: 64 pages used for memmap
[ 0.000000] DMA zone: 21 pages reserved
[ 0.000000] DMA zone: 3998 pages, LIFO batch:0
[ 0.000000] memblock_reserve: [0x7dc6d000-0x7dcc4fff] pgdat
[ 0.000000] DMA32 zone: 8128 pages used for memmap
[ 0.000000] DMA32 zone: 520192 pages, LIFO batch:31
[ 0.000000] memblock_reserve: [0x7dc15000-0x7dc6cfff] pgdat
[ 0.000000] On node 1 totalpages: 524286
[ 0.000000] DMA32 zone: 6144 pages used for memmap
[ 0.000000] DMA32 zone: 393214 pages, LIFO batch:31
[ 0.000000] memblock_reserve: [0x11ff7e000-0x11ffd5fff] pgdat
[ 0.000000] Normal zone: 2048 pages used for memmap
[ 0.000000] Normal zone: 131072 pages, LIFO batch:31
[ 0.000000] memblock_reserve: [0x11ff26000-0x11ff7dfff] pgdat
[ 0.000000] after - node_states[2]: 0-1
[ 0.000000] memblock_reserve: [0x11ff25000-0x11ff25fff] pgtable

2013-06-19 10:05:52

by Vasilis Liaskovitis

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On Tue, Jun 18, 2013 at 01:19:12PM -0700, Yinghai Lu wrote:
> On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
> <[email protected]> wrote:
> >> could be found at:
> >> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
> >>
> >> and it is based on today's Linus tree.
> >>
> >
> > Has this patchset been tested on various numa configs?
> > I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> > boots successfully in many numa configs but while trying different memory sizes
> > for a 2 numa node VM, I noticed that booting does not complete in all cases
> > (bootup screen appears to hang but there is no output indicating an early panic)
> >
> > node0 node1 boots
> > 1G 1G yes
> > 1G 2G yes
> > 1G 0.5G yes
> > 3G 2.5G yes
> > 3G 3G yes
> > 4G 0G yes
> > 4G 4G yes
> > 1.5G 1G no
> > 2G 1G no
> > 2G 2G no
> > 2.5G 2G no
> > 2.5G 2.5G no
> >
> > linux-next next-20130607 boots al of these configs fine.
> >
> > Looks odd, perhaps I have something wrong in my setup or maybe there is a
> > seabios/qemu interaction with this patchset. I will update if I find something.
>
> just tried 2g/2g, and it works on qemu-kvm:

thanks for testing. If you can also share qemu/seabios versions you use (release
or git commits), that would be helpful.

this is most likely some error on my setup, I 'll let you know if I conclude
otherwise.

thanks,

- Vasilis

2013-06-19 21:25:43

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 16/22] x86, mm, numa: Move numa emulation handling down.

On Mon, Jun 17, 2013 at 11:22 PM, Yinghai Lu <[email protected]> wrote:
> On Mon, Jun 17, 2013 at 6:58 PM, Tejun Heo <[email protected]> wrote:
>> On Thu, Jun 13, 2013 at 09:03:03PM +0800, Tang Chen wrote:
>>> From: Yinghai Lu <[email protected]>
>>>
>>> numa_emulation() needs to allocate buffer for new numa_meminfo
>>> and distance matrix, so execute it later in x86_numa_init().
>>>
>>> Also we change the behavoir:
>>> - before this patch, if user input wrong data in command
>>> line, it will fall back to next numa probing or disabling
>>> numa.
>>> - after this patch, if user input wrong data in command line,
>>> it will stay with numa info probed from previous probing,
>>> like ACPI SRAT or amd_numa.
>>>
>>> We need to call numa_check_memblks to reject wrong user inputs early
>>> so that we can keep the original numa_meminfo not changed.
>>
>> So, this is another very subtle ordering you're adding without any
>> comment and I'm not sure it even makes sense because the function can
>> fail after that point.

the new numa_emulation will call numa_check_memblks at first before
touch numa_meminfo.
if it fails, numa_meminfo is not touched, so that should not a problem.

>
> Yes, if it fail, we will stay with current numa info from firmware.
> That looks like right behavior.
>
> Before this patch, it will fail to next numa way like if acpi srat + user
> input fail, it will try to go with amd_numa then try apply user info.

For numa emulation fail sequence, want to double check what should be
right seuence:

on and before 2.6.38:
emulation ==> acpi ==> amd ==> dummy
so if emulation with wrong input, will fall back to acpi numa.

from 2.6.39
acpi (emulation) ==> amd (emulation) ==> dummy (emulation)
if emulation with wrong input, it will fall back to next numa discovery.

after my patchset
will be acpi ==> amd ==> dummy
emulation.
the new emulation will call numa_check_memblks at first before touch
numa_meminfo.
anyway if emulation fails, numa_meminfo is not touched.

so this change looks like right change.

Thanks

Yinghai

2013-06-20 05:49:56

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hi tj, Yinghai,

On 06/19/2013 01:21 AM, Tejun Heo wrote:
> Hey, Tang.
>
> On Tue, Jun 18, 2013 at 01:47:16PM +0800, Tang Chen wrote:
>> [approach]
>> Parse SRAT earlier before memblock starts to work, because there is a
>> bit in SRAT specifying which memory is hotpluggable.
>>
>> I'm not saying this is the best approach. I can also see that this
>> patch-set touches a lot of boot code. But i think parsing SRAT earlier
>> is reasonable because this is the only way for now to know which memory
>> is hotpluggable from firmware.
>
> Touching a lot of code is not a problem but it feels like it's trying
> to boot strap itself while walking and achieves that by carefully
> sequencing all operations which may allocate from memblock before NUMA
> info is available without any way to enforce or verify that.

Yes, the current implementation has no way to verify if there is
anything violating the hotplug requirement. This is weak and should
be improved.

>
>>> Can't you just move memblock arrays after NUMA init is complete?
>>> That'd be a lot simpler and way more robust than the proposed changes,
>>> no?
>>
>> Sorry, I don't quite understand the approach you are suggesting. If we
>> move memblock arrays, we need to update all the pointers pointing to
>> the moved memory. How can we do this ?
>
> So, there are two things involved here - memblock itself and consumers
> of memblock, right?

Yes.

>I get that the latter shouldn't allocate memory
> from memblock before NUMA info is entered into memblock, so please
> reorder as necessary *and* make sure memblock complains if something
> violates that. Temporary memory areas which are return are fine.
> Just complain if there are memory regions remaining which are
> allocated before NUMA info is available after boot is complete. No
> need to make booting more painful than it currently is.

I think there are two difficulties to do this job in your way.

1. It is difficult to tell which memory allocation is temporary and
which one is permanent when memblock is allocating memory. So, we
can only wait till boot is complete, and see which remains.
But, we have the second difficulty.

2. In memblock.reserve[], we cannot tell why we allocated this memory
just from the array item, right? So it is difficult to do the
relocation. If in the future, we have to allocate permanent memory
for other new purposes, we have to do the relocation again and again.
(Not sure if I understand the point correctly. I think there isn't
a generic way to relocate memory used for different purposes.)

If you also had a look at the Part2 patches, you will see that I
introduced a flags member into memblock to specify different types
of memory, which will help to recognize hotpluggable memory. My
thinking is that ensure memblock will not allocate hotpluggable
memory. I think this is the most safe and easy way to satisfy hotplug
requirement.

(not finished, please see below)

>
> As for memblock itself, there's no need to walk carefully around it.
> Just let it do its thing and implement
> memblock_relocate_to_numa_node_0() or whatever after NUMA information
> is available. memblock already does relocate itself whenever it's
> expanding the arrays anyway, so implementation should be trivial.

Yes, this is easy.

>
> Maybe I'm missing something but having a working memory allocator as
> soon as possible is *way* less painful than trying to bootstrap around
> it. Allow boot path to allocate memory areas from memblock as soon as
> possible but just ensure that none of the ones which may violate the
> hotplug requirements is remaining once boot is complete. Temporaray
> regions won't matter then and the few which need persistent areas can
> either be reordered to happen after NUMA init or they can allocate a
> new area and move to there after NUMA info is available. Let's please
> minimize this walking-and-trying-to-tie-shoestrings-at-the-same-time
> thing. It's painful and extremely fragile.

IIUC, I know what you are worrying about:

1. No way to ensure parsing numa info is early enough in the future.
Someone could have a chance to use memblock before parsing SRAT
in the future.

2. memblock won't complain if anything violates the hotplug requirement.
This is not safe.

So you don't agree to serialize the operations at boot time.

But I think ensuring memblock won't allocate hotpluggable memory to
the kernel (which is the current way in Part2 patches) is the safest
way to satisfy memory hotplug requirement. And this is right a working
memory allocator at boot time. Not checking or relocating after system
boots.

About this patch-set from Yinghai, actually he is doing a job that I
failed to do. And he also included a lot of other things in the
patch-set, such as extend max number of overridable acpi tables, local
node pagetable, and so on.

Maybe all these things are done at the same time looks a little messy.
So, how about we do it this way:

1. Improvements for ACPI_TABLE_OVERRIDE, such as increase the number
of overridable tables.

2. Move forward parsing SRAT.

3. local device pagetable (not local node), I mentioned in Part3
patch-set discussion. I'm now also working on it.

I'm not trying to do thing half way. I just think maybe smaller patch-set
will be easy to understand and review.


PS:
More info about local device pagetable:

There could be more than on memory device in a numa node. If we allocate
local node pagetable, the pagetable pages of one memory device could be
in another memory device. So the memory device containing pagetable have
to be hot-removed in the last place. This is hard to handle in hot-remove
path. So maybe local device pagetable is more reasonable.

Thanks. :)






2013-06-20 06:17:35

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hello, Tang.

On Thu, Jun 20, 2013 at 01:52:50PM +0800, Tang Chen wrote:
> 1. It is difficult to tell which memory allocation is temporary and
> which one is permanent when memblock is allocating memory. So, we
> can only wait till boot is complete, and see which remains.
> But, we have the second difficulty.
>
> 2. In memblock.reserve[], we cannot tell why we allocated this memory
> just from the array item, right? So it is difficult to do the
> relocation. If in the future, we have to allocate permanent memory
> for other new purposes, we have to do the relocation again and again.
> (Not sure if I understand the point correctly. I think there isn't
> a generic way to relocate memory used for different purposes.)

I was suggesting two separate things.

* As memblock allocator can relocate itself. There's no point in
avoiding setting NUMA node while parsing and registering NUMA
topology. Just parse and register NUMA info and later tell it to
relocate itself out of hot-pluggable node. A number of patches in
the series is doing this dancing - carefully reordering NUMA
probing. No need to do that. It's really fragile thing to do.

* Once you get the above out of the way, I don't think there are a lot
of permanent allocations in the way before NUMA is initialized.
Re-order the remaining ones if that's cleaner to do. If that gets
overly messy / fragile, copying them around or freeing and reloading
afterwards could be an option too. There isn't much point in being
super-efficient about ACPI override table. Being cleaner and more
robust is far more important.

As for distinguishing temporary / permanent, it shouldn't be difficult
to make memblock track all allocations before NUMA info becomes online
and then verify that those areas are free by the time boot is
complete. Just mark the reserved areas allocated before NUMA info is
fully available.

> If you also had a look at the Part2 patches, you will see that I
> introduced a flags member into memblock to specify different types
> of memory, which will help to recognize hotpluggable memory. My
> thinking is that ensure memblock will not allocate hotpluggable
> memory. I think this is the most safe and easy way to satisfy hotplug
> requirement.

And you can use exactly the same mechanism to track memory areas which
were allocated before NUMA info was fully available, right?

> So you don't agree to serialize the operations at boot time.

No, I'm not disagreeing that some ordering is necessary. My point is
that things seem to be going that way too far. Sure, some reordering
is necessary but it doesn't have to be this fragile. Careful
reordering isn't the only way to achieve it.

> About this patch-set from Yinghai, actually he is doing a job that I
> failed to do. And he also included a lot of other things in the
> patch-set, such as extend max number of overridable acpi tables, local
> node pagetable, and so on.

Doing multiple things to achieve a goal in a patchset might not be
optimal but is usually okay if properly explained. What's not okay is
not explaining the overall goal, approach and design in the head
message, poor quality of patch description and code documentation.

This part of code is almost inherently fragile and difficult to debug
and patchset like this would degrade the maintainability and I really
don't want to spend hours trying to decipher what the overall approach
is by trying to navigate maze of poorly documented patches only to
find out that some of the basic approaches are not very agreeable. We
could have had this exact discussion way earlier if the head message
properly described what was going on and the review process would have
been much more pleasant for all involved parties.

I don't think it matters whose patches go in how as long as they are
attributed correctly. The end result - what goes in the git tree as
log and code changes - matters, and it needs to be whole lot better.

Thanks.

--
tejun

2013-06-20 18:42:11

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On Wed, Jun 19, 2013 at 3:05 AM, Vasilis Liaskovitis
<[email protected]> wrote:
> On Tue, Jun 18, 2013 at 01:19:12PM -0700, Yinghai Lu wrote:
>> On Tue, Jun 18, 2013 at 10:10 AM, Vasilis Liaskovitis
>> <[email protected]> wrote:
>> >> could be found at:
>> >> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>> >>
>> >> and it is based on today's Linus tree.
>> >>
>> >
>> > Has this patchset been tested on various numa configs?
>> > I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
>> > boots successfully in many numa configs but while trying different memory sizes
>> > for a 2 numa node VM, I noticed that booting does not complete in all cases
>> > (bootup screen appears to hang but there is no output indicating an early panic)
>> >
>> > node0 node1 boots
>> > 1G 1G yes
>> > 1G 2G yes
>> > 1G 0.5G yes
>> > 3G 2.5G yes
>> > 3G 3G yes
>> > 4G 0G yes
>> > 4G 4G yes
>> > 1.5G 1G no
>> > 2G 1G no
>> > 2G 2G no
>> > 2.5G 2G no
>> > 2.5G 2.5G no
>> >
>> > linux-next next-20130607 boots al of these configs fine.
>> >
>> > Looks odd, perhaps I have something wrong in my setup or maybe there is a
>> > seabios/qemu interaction with this patchset. I will update if I find something.
>>
>> just tried 2g/2g, and it works on qemu-kvm:
>
> thanks for testing. If you can also share qemu/seabios versions you use (release
> or git commits), that would be helpful.

QEMU emulator version 1.5.50, Copyright (c) 2003-2008 Fabrice Bellard

it is at:
commit 7387de16d0e4d2988df350926537cd12a8e34206
Merge: b8a75b6 e73fe2b
Author: Anthony Liguori <[email protected]>
Date: Fri Jun 7 08:40:52 2013 -0500

Merge remote-tracking branch 'stefanha/block' into staging

start command:

#for 64bit numa
/usr/local/kvm/bin/qemu-system-x86_64 -L /usr/local/kvm/share/qemu
-enable-kvm -numa node,nodeid=0,cpus=0-1,mem=2048 -numa
node,nodeid=1,cpus=2-3,mem=2048 -smp sockets=2,cores=2,threads=1 -m
4096 -net nic,model=e1000,macaddr=00:1c:25:1c:13:e9 -net user -hda
/home/yhlu/data.dsk -cdrom
/home/yhlu/xx/xx/kernel/tip/linux-2.6/arch/x86/boot/image.iso -boot d
-serial telnet:127.0.0.1:4444,server -monitor stdio


>
> this is most likely some error on my setup, I 'll let you know if I conclude
> otherwise.
>
> thanks,
>
> - Vasilis

2013-06-21 05:21:56

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/13/2013 06:02 AM, Tang Chen wrote:
> From: Yinghai Lu <[email protected]>
>
> No offence, just rebase and resend the patches from Yinghai to help
> to push this functionality faster.
> Also improve the comments in the patches' log.
>

So we need a new version of this which addresses the build problems and
the feedback from Tejun... and it would be good to get that soon, or
we'll be looking at 3.12.

Since the merge window is approaching quickly, is there a meaningful
subset that is ready now?

-hpa

2013-06-21 06:03:53

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/21/2013 01:19 PM, H. Peter Anvin wrote:
> On 06/13/2013 06:02 AM, Tang Chen wrote:
>> From: Yinghai Lu<[email protected]>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>
> So we need a new version of this which addresses the build problems and
> the feedback from Tejun... and it would be good to get that soon, or
> we'll be looking at 3.12.

Hi hpa,

The build problem has been fixed by Yinghai.

>
> Since the merge window is approaching quickly, is there a meaningful
> subset that is ready now?

I think memory hotplug needs at least part1 and part2 patches. But
local node pagetable (patch 21 and 22 in part1) will break memory
hot-remove path. My part3 intends to fix it, but it seems we need
local device pagetable to enable single device hotplug, but not just
local node pagetable.

So, my plan is

1. Implement arranging hotpluggable memory with SRAT first, within tj's
comments, without local node pagetable.
(The main work in part2. And of course, need some patches in part1.)
2. Do the local device pagetable work, not local node.
3. Improve memory hotplug to support local device pagetable.

I'll send a new version patch-set of step1, wishing we can catch up with
the merge window. And I think step2 and 3 should be done later.


Thanks. :)

2013-06-21 06:11:16

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/20/2013 11:06 PM, Tang Chen wrote:
>
> Hi hpa,
>
> The build problem has been fixed by Yinghai.
>

Where? I don't see anything that is obviously a fix in my inbox.

What about Tejun's feedback?

-hpa

2013-06-21 06:17:32

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/21/2013 02:10 PM, H. Peter Anvin wrote:
> On 06/20/2013 11:06 PM, Tang Chen wrote:
>>
>> Hi hpa,
>>
>> The build problem has been fixed by Yinghai.
>>
>
> Where? I don't see anything that is obviously a fix in my inbox.
>

Yinghai resent a new version to fix the problem.
https://lkml.org/lkml/2013/6/14/561

> What about Tejun's feedback?

tj's comments were after the latest version. So we need to
restructure the patch-set.

>
> -hpa
>
>

2013-06-21 06:26:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hello, guys.

On Fri, Jun 21, 2013 at 02:20:28PM +0800, Tang Chen wrote:
> >What about Tejun's feedback?
>
> tj's comments were after the latest version. So we need to
> restructure the patch-set.

Given that it's unlikely to reach actual functionality in this cycle,
it's probably a better idea to aim the next cycle. I don't think we
wanna rush it. As for my suggestions, I'm not sure how much of it'd
work out and how much better it's gonna make things but it definitely
seems worth investigating to me. Let's please see how it goes.

Thanks a lot!

--
tejun

2013-06-21 09:16:59

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hi tj,

On 06/20/2013 02:17 PM, Tejun Heo wrote:
......
>
> I was suggesting two separate things.
>
> * As memblock allocator can relocate itself. There's no point in
> avoiding setting NUMA node while parsing and registering NUMA
> topology. Just parse and register NUMA info and later tell it to
> relocate itself out of hot-pluggable node. A number of patches in
> the series is doing this dancing - carefully reordering NUMA
> probing. No need to do that. It's really fragile thing to do.
>
> * Once you get the above out of the way, I don't think there are a lot
> of permanent allocations in the way before NUMA is initialized.
> Re-order the remaining ones if that's cleaner to do. If that gets
> overly messy / fragile, copying them around or freeing and reloading
> afterwards could be an option too.

memblock allocator can relocate itself, but it cannot relocate the memory
it allocated for users. There could be some pointers pointing to these
memory ranges. If we do the relocation, how to update these pointers ?

Or, do you mean modify the pagetable ? I don't think so.

So would you please tell me more about how to do the relocation ?

Thanks. :)

2013-06-21 18:25:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hey,

On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
> >* As memblock allocator can relocate itself. There's no point in
> > avoiding setting NUMA node while parsing and registering NUMA
> > topology. Just parse and register NUMA info and later tell it to
> > relocate itself out of hot-pluggable node. A number of patches in
> > the series is doing this dancing - carefully reordering NUMA
> > probing. No need to do that. It's really fragile thing to do.
> >
> >* Once you get the above out of the way, I don't think there are a lot
> > of permanent allocations in the way before NUMA is initialized.
> > Re-order the remaining ones if that's cleaner to do. If that gets
> > overly messy / fragile, copying them around or freeing and reloading
> > afterwards could be an option too.
>
> memblock allocator can relocate itself, but it cannot relocate the memory

Hmmm... maybe I wasn't clear but that's the first bullet point above.

> it allocated for users. There could be some pointers pointing to these
> memory ranges. If we do the relocation, how to update these pointers ?

And the second. Can you please list what persistent areas are
allocated before numa info is configured into memblock? There
shouldn't be whole lot. And, again, this type of information should
have been available in the head message so that high-level discussion
could take place right away.

Thanks.

--
tejun

2013-06-21 20:18:40

by Yinghai Lu

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On Thu, Jun 20, 2013 at 10:19 PM, H. Peter Anvin <[email protected]> wrote:
> On 06/13/2013 06:02 AM, Tang Chen wrote:
>> From: Yinghai Lu <[email protected]>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>
> So we need a new version of this which addresses the build problems and
> the feedback from Tejun... and it would be good to get that soon, or
> we'll be looking at 3.12.
>
> Since the merge window is approaching quickly, is there a meaningful
> subset that is ready now?

patch 1-9, and 20 in updated patchset, could goto 3.11.
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git
for-x86-mm
https://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/log/?h=for-x86-mm

they are about acpi_override move early and some enhancement.
they got enough tested-by and Acked-by include ones from tj.

If you are ok with that, I could resend those 10 patches today.

Thanks

Yinghai

2013-06-24 03:48:57

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/22/2013 02:25 AM, Tejun Heo wrote:
> Hey,
>
> On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
>>> * As memblock allocator can relocate itself. There's no point in
>>> avoiding setting NUMA node while parsing and registering NUMA
>>> topology. Just parse and register NUMA info and later tell it to
>>> relocate itself out of hot-pluggable node. A number of patches in
>>> the series is doing this dancing - carefully reordering NUMA
>>> probing. No need to do that. It's really fragile thing to do.
>>>
>>> * Once you get the above out of the way, I don't think there are a lot
>>> of permanent allocations in the way before NUMA is initialized.
>>> Re-order the remaining ones if that's cleaner to do. If that gets
>>> overly messy / fragile, copying them around or freeing and reloading
>>> afterwards could be an option too.
>>
>> memblock allocator can relocate itself, but it cannot relocate the memory
>
> Hmmm... maybe I wasn't clear but that's the first bullet point above.
>
>> it allocated for users. There could be some pointers pointing to these
>> memory ranges. If we do the relocation, how to update these pointers ?
>
> And the second. Can you please list what persistent areas are
> allocated before numa info is configured into memblock? There

Hi tj,

My box is x86_64, and the memory layout is:
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
[ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot
Pluggable
[ 0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot
Pluggable


I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
There are about 14 ranges which is persistent after boot.

[ 0.000000] reserved[0x0] [0x00000000000000-0x0000000000ffff],
0x10000 bytes flags: 0x4
[ 0.000000] reserved[0x1] [0x00000000093000-0x000000000fffff],
0x6d000 bytes flags: 0x4
[ 0.000000] reserved[0x2] [0x00000001000000-0x00000002a9afff],
0x1a9b000 bytes flags: 0x4
[ 0.000000] reserved[0x3] [0x00000030000000-0x00000037ffffff],
0x8000000 bytes flags: 0x4
...
[ 0.000000] reserved[0x5] [0x0000006da81000-0x0000006e46afff],
0x9ea000 bytes flags: 0x4
[ 0.000000] reserved[0x6] [0x0000006ed6a000-0x0000006f246fff],
0x4dd000 bytes flags: 0x4
[ 0.000000] reserved[0x7] [0x0000006f28a000-0x0000006f299fff],
0x10000 bytes flags: 0x4
[ 0.000000] reserved[0x8] [0x0000006f29c000-0x0000006fe91fff],
0xbf6000 bytes flags: 0x4
[ 0.000000] reserved[0x9] [0x00000070e92000-0x00000071d54fff],
0xec3000 bytes flags: 0x4
[ 0.000000] reserved[0xa] [0x00000071d5e000-0x00000072204fff],
0x4a7000 bytes flags: 0x4
[ 0.000000] reserved[0xb] [0x00000072220000-0x0000007222074f],
0x750 bytes flags: 0x4
...
[ 0.000000] reserved[0xd] [0x000000722bc000-0x000000722bc1cf],
0x1d0 bytes flags: 0x4
[ 0.000000] reserved[0xe] [0x00000072bd3000-0x00000076c8ffff],
0x40bd000 bytes flags: 0x4
......
[ 0.000000] reserved[0x134] [0x000007fffdf000-0x000007ffffffff],
0x21000 bytes flags: 0x4


Just for the readability:
[0x00000308000000-0x00000587ffffff]
Hot Pluggable
[0x00000588000000-0x000007ffffffff]
Hot Pluggable

Seeing from the dmesg, only the last one is in hotpluggable area. I need
to go
through the code to find out what it is, and find a way to relocate it.

But I'm not sure if a box with a different SRAT will have different result.

I will send more info later.

Thanks. :)


> shouldn't be whole lot. And, again, this type of information should
> have been available in the head message so that high-level discussion
> could take place right away.
>
> Thanks.
>

2013-06-24 07:23:35

by Tang Chen

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/24/2013 11:51 AM, Tang Chen wrote:
> On 06/22/2013 02:25 AM, Tejun Heo wrote:
>> Hey,
>>
>> On Fri, Jun 21, 2013 at 05:19:48PM +0800, Tang Chen wrote:
>>>> * As memblock allocator can relocate itself. There's no point in
>>>> avoiding setting NUMA node while parsing and registering NUMA
>>>> topology. Just parse and register NUMA info and later tell it to
>>>> relocate itself out of hot-pluggable node. A number of patches in
>>>> the series is doing this dancing - carefully reordering NUMA
>>>> probing. No need to do that. It's really fragile thing to do.
>>>>
>>>> * Once you get the above out of the way, I don't think there are a lot
>>>> of permanent allocations in the way before NUMA is initialized.
>>>> Re-order the remaining ones if that's cleaner to do. If that gets
>>>> overly messy / fragile, copying them around or freeing and reloading
>>>> afterwards could be an option too.
>>>
>>> memblock allocator can relocate itself, but it cannot relocate the
>>> memory
>>
>> Hmmm... maybe I wasn't clear but that's the first bullet point above.
>>
>>> it allocated for users. There could be some pointers pointing to these
>>> memory ranges. If we do the relocation, how to update these pointers ?
>>
>> And the second. Can you please list what persistent areas are
>> allocated before numa info is configured into memblock? There
>
> Hi tj,
>
> My box is x86_64, and the memory layout is:
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> [ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
> [ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot Pluggable
> [ 0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot Pluggable
>
>
> I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
> There are about 14 ranges which is persistent after boot.
>
> [ 0.000000] reserved[0x0] [0x00000000000000-0x0000000000ffff], 0x10000
> bytes flags: 0x4
> [ 0.000000] reserved[0x1] [0x00000000093000-0x000000000fffff], 0x6d000
> bytes flags: 0x4
> [ 0.000000] reserved[0x2] [0x00000001000000-0x00000002a9afff], 0x1a9b000
> bytes flags: 0x4
> [ 0.000000] reserved[0x3] [0x00000030000000-0x00000037ffffff], 0x8000000
> bytes flags: 0x4
> ...
> [ 0.000000] reserved[0x5] [0x0000006da81000-0x0000006e46afff], 0x9ea000
> bytes flags: 0x4
> [ 0.000000] reserved[0x6] [0x0000006ed6a000-0x0000006f246fff], 0x4dd000
> bytes flags: 0x4
> [ 0.000000] reserved[0x7] [0x0000006f28a000-0x0000006f299fff], 0x10000
> bytes flags: 0x4
> [ 0.000000] reserved[0x8] [0x0000006f29c000-0x0000006fe91fff], 0xbf6000
> bytes flags: 0x4
> [ 0.000000] reserved[0x9] [0x00000070e92000-0x00000071d54fff], 0xec3000
> bytes flags: 0x4
> [ 0.000000] reserved[0xa] [0x00000071d5e000-0x00000072204fff], 0x4a7000
> bytes flags: 0x4
> [ 0.000000] reserved[0xb] [0x00000072220000-0x0000007222074f], 0x750
> bytes flags: 0x4
> ...
> [ 0.000000] reserved[0xd] [0x000000722bc000-0x000000722bc1cf], 0x1d0
> bytes flags: 0x4
> [ 0.000000] reserved[0xe] [0x00000072bd3000-0x00000076c8ffff], 0x40bd000
> bytes flags: 0x4
> ......
> [ 0.000000] reserved[0x134] [0x000007fffdf000-0x000007ffffffff], 0x21000
> bytes flags: 0x4

This range is allocated by init_mem_mapping() in setup_arch(), it calls
alloc_low_pages() to allocate pagetable pages.

I think if we do the local device pagetable, we can solve this problem
without any relocation.

I will make a patch trying to do this. But I'm not sure if there are any
other relocation problems on other architectures.

But even if not, I still think this could be dangerous if someone modifies
the boot path and allocates some persistent memory before SRAT parsed in
the future. He has to be aware of memory hotplug things and do the
necessary
relocation himself.

I'll try to make the patch to acheve this with comment as full as possible.

Thanks. :)

>
>
> Just for the readability:
> [0x00000308000000-0x00000587ffffff] Hot Pluggable
> [0x00000588000000-0x000007ffffffff] Hot Pluggable
>
> Seeing from the dmesg, only the last one is in hotpluggable area. I need
> to go
> through the code to find out what it is, and find a way to relocate it.
>
> But I'm not sure if a box with a different SRAT will have different result.
>
> I will send more info later.
>
> Thanks. :)
>
>
>> shouldn't be whole lot. And, again, this type of information should
>> have been available in the head message so that high-level discussion
>> could take place right away.
>>
>> Thanks.
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-24 09:43:18

by Gu Zheng

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

On 06/19/2013 01:10 AM, Vasilis Liaskovitis wrote:

> Hi,
>
> On Thu, Jun 13, 2013 at 09:02:47PM +0800, Tang Chen wrote:
>> From: Yinghai Lu <[email protected]>
>>
>> No offence, just rebase and resend the patches from Yinghai to help
>> to push this functionality faster.
>> Also improve the comments in the patches' log.
>>
>>
>> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>>
>> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>> | Author: Tang Chen <[email protected]>
>> | Date: Fri Feb 22 16:33:44 2013 -0800
>> |
>> | acpi, memory-hotplug: parse SRAT before memblock is ready
>>
>> It broke several things, like acpi override and fall back path etc.
>>
>> This patchset is clean implementation that will parse numa info early.
>> 1. keep the acpi table initrd override working by split finding with copying.
>> finding is done at head_32.S and head64.c stage,
>> in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
>> in head64.c, initrd is accessed via kernel low mapping address
>> with help of #PF set page table.
>> copying is done with early_ioremap just after memblock is setup.
>> 2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
>> seperate initmem_init to two stages.
>> early_initmem_init will only extract numa info early into numa_meminfo.
>> initmem_init will keep slit and emulation handling.
>> 3. keep other old code flow untouched like relocate_initrd and initmem_init.
>> early_initmem_init will take old init_mem_mapping position.
>> it call early_x86_numa_init and init_mem_mapping for every nodes.
>> For 64bit, we avoid having size limit on initrd, as relocate_initrd
>> is still after init_mem_mapping for all memory.
>> 4. last patch will try to put page table on local node, so that memory
>> hotplug will be happy.
>>
>> In short, early_initmem_init will parse numa info early and call
>> init_mem_mapping to set page table for every nodes's mem.
>>
>> could be found at:
>> git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm
>>
>> and it is based on today's Linus tree.
>>
>
> Has this patchset been tested on various numa configs?
> I am using linux-next next-20130607 + part1 with qemu/kvm/seabios VMs. The kernel
> boots successfully in many numa configs but while trying different memory sizes
> for a 2 numa node VM, I noticed that booting does not complete in all cases
> (bootup screen appears to hang but there is no output indicating an early panic)
>
> node0 node1 boots
> 1G 1G yes
> 1G 2G yes
> 1G 0.5G yes
> 3G 2.5G yes
> 3G 3G yes
> 4G 0G yes
> 4G 4G yes
> 1.5G 1G no
> 2G 1G no
> 2G 2G no
> 2.5G 2G no
> 2.5G 2.5G no
>
> linux-next next-20130607 boots al of these configs fine.
>
> Looks odd, perhaps I have something wrong in my setup or maybe there is a
> seabios/qemu interaction with this patchset. I will update if I find something.

Hi Vasilis,
This patchset can work well with all the numa config cases you mentioned in latest kernel tree (3.10-rc7) in our box.

Host OS: RHEL 6.4 Beta
qemu-kvm: 0.12.1.2 (Released with RHEL 6.4 Beta)
Guest OS: RHEL 6.3
Guest kernel:3.10-rc7 + [Part1 PATCH v5 ] x86, ACPI, numa: Parse numa info earlier
Cmd:

/usr/libexec/qemu-kvm -name rhel_6.3 -S -M rhel6.4.0 -enable-kvm
-m 5120 -smp 4,sockets=4,cores=1,threads=1
-numa node,nodeid=0,cpus=0-1,mem=2560
-numa node,nodeid=1,cpus=2-3,mem=2560
-uuid fa11164c-1a09-280b-eae4-e2c40c631767 -nodefconfig -nodefaults -chardev
socket,id=charmonitor,path=/var/lib/libvirt/qemu/rhel_6.3.monitor,server,nowait
-mon chardev=charmonitor,id=monitor,mode=control -rtc base=utc -no-shutdown
-device piix3-usb-uhci,id=usb,bus=pci.0,addr=0x1.0x2 -drive file=/home/hut-rhel6.3.img,if=none,id=drive-virtio-disk0,format=qcow2,cache=none -device virtio-blk-pci,scsi=off,bus=pci.0,addr=0x4,drive=drive-virtio-disk0,id=virtio-disk0,bootindex=1 -netdev tap,fd=26,id=hostnet0,vhost=on,vhostfd=27 -device virtio-net-pci,netdev=hostnet0,id=net0,mac=52:54:00:28:6e:29,bus=pci.0,addr=0x3 -chardev pty,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -device usb-tablet,id=input0 -vnc 127.0.0.1:0 -vga cirrus -device virtio-balloon-pci,id=balloon0,bus=pci.0,addr=0x5


Result:
node0 node1 boots
1G 1G yes
1G 2G yes
1G 0.5G yes
3G 2.5G yes
3G 3G yes
4G 0G yes
4G 4G yes
1.5G 1G yes
2G 1G yes
2G 2G yes
2.5G 2G yes
2.5G 2.5G yes

Thanks,

Gu


>
> thanks,
>
> - Vasilis
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2013-06-24 20:00:07

by Tejun Heo

[permalink] [raw]
Subject: Re: [Part1 PATCH v5 00/22] x86, ACPI, numa: Parse numa info earlier

Hello, Tang.

On Mon, Jun 24, 2013 at 03:26:27PM +0800, Tang Chen wrote:
> >My box is x86_64, and the memory layout is:
> >[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x00000000-0x7fffffff]
> >[ 0.000000] SRAT: Node 0 PXM 0 [mem 0x100000000-0x307ffffff]
> >[ 0.000000] SRAT: Node 1 PXM 2 [mem 0x308000000-0x587ffffff] Hot Pluggable
> >[ 0.000000] SRAT: Node 2 PXM 3 [mem 0x588000000-0x7ffffffff] Hot Pluggable
> >
> >
> >I marked ranges reserved by memblock before we parse SRAT with flag 0x4.
> >There are about 14 ranges which is persistent after boot.

You can also record the caller address or short backtrace with each
allocation (maybe controlled by some debug parameter). It'd be a nice
capability to keep around anyway.

> This range is allocated by init_mem_mapping() in setup_arch(), it calls
> alloc_low_pages() to allocate pagetable pages.
>
> I think if we do the local device pagetable, we can solve this problem
> without any relocation.

Yeah, I really can't think of many places which would allocate
permanent piece of memory before memblock is fully initialized. Just
in case I wasn't clear, I don't have anything fundamentally against
reordering operations if that's cleaner, but we really should at least
find out what needs to be reordered and have a mechanism to verify and
track them down, and of course if relocating / reloading / whatever is
cleaner and/or more robust, that's what we should do.

> I will make a patch trying to do this. But I'm not sure if there are any
> other relocation problems on other architectures.
>
> But even if not, I still think this could be dangerous if someone modifies
> the boot path and allocates some persistent memory before SRAT parsed in
> the future. He has to be aware of memory hotplug things and do the
> necessary relocation himself.

As I wrote above, I think it'd be nice to have a way to track memblock
allocations. It can be a debug thing but we can just do it by
default, e.g., for allocations before memblock is fully initialized.
It's not like there are a ton of them. Those extra allocations can be
freed on boot completion anyway, so they won't affect NUMA hotplug
either and we'll be able to continuously watch, and thus properly
maintain, the early boot hotplug issue on most configurations whether
they actually support and perform hotplug or not, which will be
multiple times more robust than trying to tweak boot sequence once and
hoping that it doesn't deteriorate over time.

Thanks.

--
tejun