2013-04-04 23:53:36

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

One commit that tried to parse SRAT early get reverted before v3.9-rc1.

| commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
| Author: Tang Chen <[email protected]>
| Date: Fri Feb 22 16:33:44 2013 -0800
|
| acpi, memory-hotplug: parse SRAT before memblock is ready

It broke several things, like acpi override and fall back path etc.

This patchset is clean implementation that will parse numa info early.
1. keep the acpi table initrd override working by split finding with copying.
finding is done at head_32.S and head64.c stage,
in head_32.S, initrd is accessed in 32bit flat mode with phys addr.
in head64.c, initrd is accessed via kernel low mapping address
with help of #PF set page table.
copying is done with early_ioremap just after memblock is setup.
2. keep fallback path working. numaq and ACPI and amd_nmua and dummy.
seperate initmem_init to two stages.
early_initmem_init will only extract numa info early into numa_meminfo.
initmem_init will keep slit and emulation handling.
3. keep other old code flow untouched like relocate_initrd and initmem_init.
early_initmem_init will take old init_mem_mapping position.
it call early_x86_numa_init and init_mem_mapping for every nodes.
For 64bit, we avoid having size limit on initrd, as relocate_initrd
is still after init_mem_mapping for all memory.
4. last patch will try to put page table on local node, so that memory
hotplug will be happy.

In short, early_initmem_init will parse numa info early and call
init_mem_mapping to set page table for every nodes's mem.

could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

and it is based on today's Linus tree.

-v2: Address tj's review and split patches to small ones.
-v3: Add some Acked-by from tj, also stop abusing cpio_data for acpi_files info

Thanks

Yinghai

Yinghai Lu (22):
x86: Change get_ramdisk_image() to global
x86, microcode: Use common get_ramdisk_image()
x86, ACPI, mm: Kill max_low_pfn_mapped
x86, ACPI: Search buffer above 4G in second try for acpi override
tables
x86, ACPI: Increase override tables number limit
x86, ACPI: Split acpi_initrd_override to find/copy two functions
x86, ACPI: Store override acpi tables phys addr in cpio files info
array
x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode
x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c
x86, mm, numa: Move two functions calling on successful path later
x86, mm, numa: Call numa_meminfo_cover_memory() checking early
x86, mm, numa: Move node_map_pfn alignment() to x86
x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment
x86, mm, numa: Set memblock nid later
x86, mm, numa: Move node_possible_map setting later
x86, mm, numa: Move emulation handling down.
x86, ACPI, numa, ia64: split SLIT handling out
x86, mm, numa: Add early_initmem_init() stub
x86, mm: Parse numa info early
x86, mm: Add comments for step_size shift
x86, mm: Make init_mem_mapping be able to be called several times
x86, mm, numa: Put pagetable on local node ram for 64bit

arch/ia64/kernel/setup.c | 4 +-
arch/x86/include/asm/acpi.h | 3 +-
arch/x86/include/asm/page_types.h | 2 +-
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/include/asm/setup.h | 9 ++
arch/x86/kernel/head64.c | 2 +
arch/x86/kernel/head_32.S | 4 +
arch/x86/kernel/microcode_intel_early.c | 8 +-
arch/x86/kernel/setup.c | 86 +++++++-----
arch/x86/mm/init.c | 109 ++++++++++-----
arch/x86/mm/numa.c | 240 +++++++++++++++++++++++++-------
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 +
arch/x86/mm/srat.c | 11 +-
drivers/acpi/numa.c | 13 +-
drivers/acpi/osl.c | 138 ++++++++++++------
include/linux/acpi.h | 20 +--
include/linux/mm.h | 3 -
mm/page_alloc.c | 52 +------
19 files changed, 467 insertions(+), 243 deletions(-)

--
1.8.1.4


2013-04-04 23:47:58

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 01/22] x86: Change get_ramdisk_image() to global

Need to use get_ramdisk_image() with early microcode_updating in other file.
Change it to global.

Also make it to take boot_params pointer, as head_32.S need to access it via
phys address during 32bit flat mode.

Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)

extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__

void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 90d8cc9..1629577 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -300,19 +300,19 @@ static void __init reserve_brk(void)

#ifdef CONFIG_BLK_DEV_INITRD

-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;

- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;

return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;

- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;

return ramdisk_size;
}
@@ -321,8 +321,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -361,8 +361,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}

- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -372,8 +372,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);

if (!boot_params.hdr.type_of_loader ||
@@ -385,8 +385,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;

--
1.8.1.4

2013-04-04 23:48:06

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 15/22] x86, mm, numa: Move node_possible_map setting later

Move node_possible_map handling out of numa_check_memblks to avoid side
changing in numa_check_memblks().

Only set once for successful path instead of resetting in numa_init()
every time.

Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index e2ddcbd..1d5fa08 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -539,12 +539,13 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)

static int __init numa_check_memblks(struct numa_meminfo *mi)
{
+ nodemask_t nodes_parsed;
unsigned long pfn_align;

/* Account for nodes with cpus and no memory */
- node_possible_map = numa_nodes_parsed;
- numa_nodemask_from_meminfo(&node_possible_map, mi);
- if (WARN_ON(nodes_empty(node_possible_map)))
+ nodes_parsed = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&nodes_parsed, mi);
+ if (WARN_ON(nodes_empty(nodes_parsed)))
return -EINVAL;

if (!numa_meminfo_cover_memory(mi))
@@ -596,7 +597,6 @@ static int __init numa_init(int (*init_func)(void))
set_apicid_to_node(i, NUMA_NO_NODE);

nodes_clear(numa_nodes_parsed);
- nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
numa_reset_distance();

@@ -672,6 +672,9 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ node_possible_map = numa_nodes_parsed;
+ numa_nodemask_from_meminfo(&node_possible_map, mi);
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
--
1.8.1.4

2013-04-04 23:48:29

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 21/22] x86, mm: Make init_mem_mapping be able to be called several times

Prepare to put page table on local nodes.

Move calling of init_mem_mapping to early_initmem_init.

Rework alloc_low_pages to alloc page table in following order:
BRK, local node, low range

Still only load_cr3 one time, otherwise we would break xen 64bit again.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/include/asm/pgtable.h | 2 +-
arch/x86/kernel/setup.c | 1 -
arch/x86/mm/init.c | 88 ++++++++++++++++++++++++++----------------
arch/x86/mm/numa.c | 24 ++++++++++++
4 files changed, 79 insertions(+), 36 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 1e67223..868687c 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
#ifndef __ASSEMBLY__

extern int direct_gbpages;
-void init_mem_mapping(void);
+void init_mem_mapping(unsigned long begin, unsigned long end);
void early_alloc_pgt_buf(void);

/* local pte updates need not use xchg for locking */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 6ef3fa2..67ef4bc 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1105,7 +1105,6 @@ void __init setup_arch(char **cmdline_p)
acpi_boot_table_init();
early_acpi_boot_init();
early_initmem_init();
- init_mem_mapping();
memblock.current_limit = get_max_mapped();
early_trap_pf_init();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2754e45..8a03283 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
static unsigned long __initdata pgt_buf_end;
static unsigned long __initdata pgt_buf_top;

-static unsigned long min_pfn_mapped;
+static unsigned long low_min_pfn_mapped;
+static unsigned long low_max_pfn_mapped;
+static unsigned long local_min_pfn_mapped;
+static unsigned long local_max_pfn_mapped;

static bool __initdata can_use_brk_pgt = true;

@@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)

if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
unsigned long ret;
- if (min_pfn_mapped >= max_pfn_mapped)
- panic("alloc_low_page: ran out of memory");
- ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
- max_pfn_mapped << PAGE_SHIFT,
+ if (local_min_pfn_mapped >= local_max_pfn_mapped) {
+ if (low_min_pfn_mapped >= low_max_pfn_mapped)
+ panic("alloc_low_page: ran out of memory");
+ ret = memblock_find_in_range(
+ low_min_pfn_mapped << PAGE_SHIFT,
+ low_max_pfn_mapped << PAGE_SHIFT,
+ PAGE_SIZE * num , PAGE_SIZE);
+ } else
+ ret = memblock_find_in_range(
+ local_min_pfn_mapped << PAGE_SHIFT,
+ local_max_pfn_mapped << PAGE_SHIFT,
PAGE_SIZE * num , PAGE_SIZE);
if (!ret)
panic("alloc_low_page: can not alloc memory");
@@ -402,60 +412,75 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
return step_size;
}

-void __init init_mem_mapping(void)
+void __init init_mem_mapping(unsigned long begin, unsigned long end)
{
- unsigned long end, real_end, start, last_start;
+ unsigned long real_end, start, last_start;
unsigned long step_size;
unsigned long addr;
unsigned long mapped_ram_size = 0;
unsigned long new_mapped_ram_size;
+ bool is_low = false;
+
+ if (!begin) {
+ probe_page_size_mask();
+ /* the ISA range is always mapped regardless of memory holes */
+ init_memory_mapping(0, ISA_END_ADDRESS);
+ begin = ISA_END_ADDRESS;
+ is_low = true;
+ }

- probe_page_size_mask();
-
-#ifdef CONFIG_X86_64
- end = max_pfn << PAGE_SHIFT;
-#else
- end = max_low_pfn << PAGE_SHIFT;
-#endif
-
- /* the ISA range is always mapped regardless of memory holes */
- init_memory_mapping(0, ISA_END_ADDRESS);
+ if (begin >= end)
+ return;

/* xen has big range in reserved near end of ram, skip it at first.*/
- addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
+ addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
real_end = addr + PMD_SIZE;

/* step_size need to be small so pgt_buf from BRK could cover it */
step_size = PMD_SIZE;
- max_pfn_mapped = 0; /* will get exact value next */
- min_pfn_mapped = real_end >> PAGE_SHIFT;
+ local_max_pfn_mapped = begin >> PAGE_SHIFT;
+ local_min_pfn_mapped = real_end >> PAGE_SHIFT;
last_start = start = real_end;
- while (last_start > ISA_END_ADDRESS) {
+ while (last_start > begin) {
if (last_start > step_size) {
start = round_down(last_start - 1, step_size);
- if (start < ISA_END_ADDRESS)
- start = ISA_END_ADDRESS;
+ if (start < begin)
+ start = begin;
} else
- start = ISA_END_ADDRESS;
+ start = begin;
new_mapped_ram_size = init_range_memory_mapping(start,
last_start);
+ if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = last_start >> PAGE_SHIFT;
+ local_min_pfn_mapped = start >> PAGE_SHIFT;
last_start = start;
- min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

- if (real_end < end)
+ if (real_end < end) {
init_range_memory_mapping(real_end, end);
+ if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
+ local_max_pfn_mapped = end >> PAGE_SHIFT;
+ }

+ if (is_low) {
+ low_min_pfn_mapped = local_min_pfn_mapped;
+ low_max_pfn_mapped = local_max_pfn_mapped;
+ }
+}
+
+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
#ifdef CONFIG_X86_64
- if (max_pfn > max_low_pfn) {
- /* can we preseve max_low_pfn ?*/
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
- }
#else
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
early_ioremap_page_table_range_init();
#endif

@@ -464,11 +489,6 @@ void __init init_mem_mapping(void)

early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}
-
-#ifndef CONFIG_NUMA
-void __init early_initmem_init(void)
-{
-}
#endif

/*
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index c2d4653..d3eb0c9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -17,8 +17,10 @@
#include <asm/dma.h>
#include <asm/acpi.h>
#include <asm/amd_nb.h>
+#include <asm/tlbflush.h>

#include "numa_internal.h"
+#include "mm_internal.h"

int __initdata numa_off;
nodemask_t numa_nodes_parsed __initdata;
@@ -668,9 +670,31 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+#ifdef CONFIG_X86_64
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ if (max_pfn > max_low_pfn)
+ max_low_pfn = max_pfn;
+}
+#else
+static void __init early_x86_numa_init_mapping(void)
+{
+ init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
+ early_ioremap_page_table_range_init();
+}
+#endif
+
void __init early_initmem_init(void)
{
early_x86_numa_init();
+
+ early_x86_numa_init_mapping();
+
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+
+ early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
}

void __init x86_numa_init(void)
--
1.8.1.4

2013-04-04 23:48:26

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 22/22] x86, mm, numa: Put pagetable on local node ram for 64bit

If node with ram is hotplugable, local node mem for page table and vmemmap
should be on that node ram.

This patch is some kind of refreshment of
| commit 1411e0ec3123ae4c4ead6bfc9fe3ee5a3ae5c327
| Date: Mon Dec 27 16:48:17 2010 -0800
|
| x86-64, numa: Put pgtable to local node memory
That was reverted before.

We have reason to reintroduce it to make memory hotplug work.

Calling init_mem_mapping in early_initmem_init for every node.
alloc_low_pages will alloc page table in following order:
BRK, local node, low range
So page table will be on low range or local nodes.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Konrad Rzeszutek Wilk <[email protected]>
---
arch/x86/mm/numa.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d3eb0c9..11acdf6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -673,7 +673,39 @@ static void __init early_x86_numa_init(void)
#ifdef CONFIG_X86_64
static void __init early_x86_numa_init_mapping(void)
{
- init_mem_mapping(0, max_pfn << PAGE_SHIFT);
+ unsigned long last_start = 0, last_end = 0;
+ struct numa_meminfo *mi = &numa_meminfo;
+ unsigned long start, end;
+ int last_nid = -1;
+ int i, nid;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ nid = mi->blk[i].nid;
+ start = mi->blk[i].start;
+ end = mi->blk[i].end;
+
+ if (last_nid == nid) {
+ last_end = end;
+ continue;
+ }
+
+ /* other nid now */
+ if (last_nid >= 0) {
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+ }
+
+ /* for next nid */
+ last_nid = nid;
+ last_start = start;
+ last_end = end;
+ }
+ /* last one */
+ printk(KERN_DEBUG "Node %d: [mem %#016lx-%#016lx]\n",
+ last_nid, last_start, last_end - 1);
+ init_mem_mapping(last_start, last_end);
+
if (max_pfn > max_low_pfn)
max_low_pfn = max_pfn;
}
--
1.8.1.4

2013-04-04 23:48:24

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 19/22] x86, mm: Parse numa info early

Parsing numa info has been separated to two functions now.

early_initmem_info() only parse info in numa_meminfo and
nodes_parsed. still keep numaq, acpi_numa, amd_numa, dummy
fall back sequence working.

SLIT and numa emulation handling are still left in initmem_init().

Call early_initmem_init before init_mem_mapping() to prepare
to use numa_info with it.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/kernel/setup.c | 24 ++++++++++--------------
1 file changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d40e16e..6ef3fa2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1098,13 +1098,21 @@ void __init setup_arch(char **cmdline_p)
trim_platform_memory_ranges();
trim_low_memory_range();

+ /*
+ * Parse the ACPI tables for possible boot-time SMP configuration.
+ */
+ acpi_initrd_override_copy();
+ acpi_boot_table_init();
+ early_acpi_boot_init();
+ early_initmem_init();
init_mem_mapping();
-
+ memblock.current_limit = get_max_mapped();
early_trap_pf_init();

+ reserve_initrd();
+
setup_real_mode();

- memblock.current_limit = get_max_mapped();
dma_contiguous_reserve(0);

/*
@@ -1118,24 +1126,12 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- acpi_initrd_override_copy();
-
- reserve_initrd();
-
reserve_crashkernel();

vsmp_init();

io_delay_init();

- /*
- * Parse the ACPI tables for possible boot-time SMP configuration.
- */
- acpi_boot_table_init();
-
- early_acpi_boot_init();
-
- early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

--
1.8.1.4

2013-04-04 23:48:04

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 03/22] x86, ACPI, mm: Kill max_low_pfn_mapped

Now we have arch_pfn_mapped array, and max_low_pfn_mapped should not
be used anymore.

User should use arch_pfn_mapped or just 1UL<<(32-PAGE_SHIFT) instead.

Only user is ACPI_INITRD_TABLE_OVERRIDE, and it should not use that,
as later accessing is using early_ioremap(). We could change to use
1U<<(32_PAGE_SHIFT) with it, aka under 4G.

-v2: Leave alone max_low_pfn_mapped in i915 code according to tj.

Suggested-by: H. Peter Anvin <[email protected]>
Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/page_types.h | 1 -
arch/x86/kernel/setup.c | 4 +---
arch/x86/mm/init.c | 4 ----
drivers/acpi/osl.c | 6 +++---
4 files changed, 4 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index 54c9787..b012b82 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -43,7 +43,6 @@

extern int devmem_is_allowed(unsigned long pagenr);

-extern unsigned long max_low_pfn_mapped;
extern unsigned long max_pfn_mapped;

static inline phys_addr_t get_max_mapped(void)
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1629577..e75c6e6 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -113,13 +113,11 @@
#include <asm/prom.h>

/*
- * max_low_pfn_mapped: highest direct mapped pfn under 4GB
- * max_pfn_mapped: highest direct mapped pfn over 4GB
+ * max_pfn_mapped: highest direct mapped pfn
*
* The direct mapping only covers E820_RAM regions, so the ranges and gaps are
* represented by pfn_mapped
*/
-unsigned long max_low_pfn_mapped;
unsigned long max_pfn_mapped;

#ifdef CONFIG_DMI
diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 59b7fc4..abcc241 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -313,10 +313,6 @@ static void add_pfn_range_mapped(unsigned long start_pfn, unsigned long end_pfn)
nr_pfn_mapped = clean_sort_range(pfn_mapped, E820_X_MAX);

max_pfn_mapped = max(max_pfn_mapped, end_pfn);
-
- if (start_pfn < (1UL<<(32-PAGE_SHIFT)))
- max_low_pfn_mapped = max(max_low_pfn_mapped,
- min(end_pfn, 1UL<<(32-PAGE_SHIFT)));
}

bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn)
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 586e7e9..313d14d 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -624,9 +624,9 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
+ /* under 4G at first, then above 4G */
+ acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.8.1.4

2013-04-04 23:49:12

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 17/22] x86, ACPI, numa, ia64: split SLIT handling out

We need to handle slit later, as it need to allocate buffer for distance
matrix. Also we do not need SLIT info before init_mem_mapping.

So move SLIT parsing later.

x86_acpi_numa_init become x86_acpi_numa_init_srat/x86_acpi_numa_init_slit.

It should not break ia64 by replacing acpi_numa_init with
acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

-v2: Change name to acpi_numa_init_srat/acpi_numa_init_slit according tj.
remove the reset_numa_distance() in numa_init(), as get we only set
distance in slit handling.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Cc: Tony Luck <[email protected]>
Cc: Fenghua Yu <[email protected]>
Cc: [email protected]
---
arch/ia64/kernel/setup.c | 4 +++-
arch/x86/include/asm/acpi.h | 3 ++-
arch/x86/mm/numa.c | 14 ++++++++++++--
arch/x86/mm/srat.c | 11 +++++++----
drivers/acpi/numa.c | 13 +++++++------
include/linux/acpi.h | 3 ++-
6 files changed, 33 insertions(+), 15 deletions(-)

diff --git a/arch/ia64/kernel/setup.c b/arch/ia64/kernel/setup.c
index 2029cc0..6a2efb5 100644
--- a/arch/ia64/kernel/setup.c
+++ b/arch/ia64/kernel/setup.c
@@ -558,7 +558,9 @@ setup_arch (char **cmdline_p)
acpi_table_init();
early_acpi_boot_init();
# ifdef CONFIG_ACPI_NUMA
- acpi_numa_init();
+ acpi_numa_init_srat();
+ acpi_numa_init_slit();
+ acpi_numa_arch_fixup();
# ifdef CONFIG_ACPI_HOTPLUG_CPU
prefill_possible_map();
# endif
diff --git a/arch/x86/include/asm/acpi.h b/arch/x86/include/asm/acpi.h
index b31bf97..651db0b 100644
--- a/arch/x86/include/asm/acpi.h
+++ b/arch/x86/include/asm/acpi.h
@@ -178,7 +178,8 @@ static inline void disable_acpi(void) { }

#ifdef CONFIG_ACPI_NUMA
extern int acpi_numa;
-extern int x86_acpi_numa_init(void);
+int x86_acpi_numa_init_srat(void);
+void x86_acpi_numa_init_slit(void);
#endif /* CONFIG_ACPI_NUMA */

#define acpi_unlazy_tlb(x) leave_mm(x)
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 90fd123..182e085 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -598,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- numa_reset_distance();

ret = init_func();
if (ret < 0)
@@ -636,6 +635,10 @@ static int __init dummy_numa_init(void)
return 0;
}

+#ifdef CONFIG_ACPI_NUMA
+static bool srat_used __initdata;
+#endif
+
/**
* x86_numa_init - Initialize NUMA
*
@@ -651,8 +654,10 @@ static void __init early_x86_numa_init(void)
return;
#endif
#ifdef CONFIG_ACPI_NUMA
- if (!numa_init(x86_acpi_numa_init))
+ if (!numa_init(x86_acpi_numa_init_srat)) {
+ srat_used = true;
return;
+ }
#endif
#ifdef CONFIG_AMD_NUMA
if (!numa_init(amd_numa_init))
@@ -670,6 +675,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+#ifdef CONFIG_ACPI_NUMA
+ if (srat_used)
+ x86_acpi_numa_init_slit();
+#endif
+
numa_emulation(&numa_meminfo, numa_distance_cnt);

node_possible_map = numa_nodes_parsed;
diff --git a/arch/x86/mm/srat.c b/arch/x86/mm/srat.c
index cdd0da9..443f9ef 100644
--- a/arch/x86/mm/srat.c
+++ b/arch/x86/mm/srat.c
@@ -185,14 +185,17 @@ out_err:
return -1;
}

-void __init acpi_numa_arch_fixup(void) {}
-
-int __init x86_acpi_numa_init(void)
+int __init x86_acpi_numa_init_srat(void)
{
int ret;

- ret = acpi_numa_init();
+ ret = acpi_numa_init_srat();
if (ret < 0)
return ret;
return srat_disabled() ? -EINVAL : 0;
}
+
+void __init x86_acpi_numa_init_slit(void)
+{
+ acpi_numa_init_slit();
+}
diff --git a/drivers/acpi/numa.c b/drivers/acpi/numa.c
index 33e609f..6460db4 100644
--- a/drivers/acpi/numa.c
+++ b/drivers/acpi/numa.c
@@ -282,7 +282,7 @@ acpi_table_parse_srat(enum acpi_srat_type id,
handler, max_entries);
}

-int __init acpi_numa_init(void)
+int __init acpi_numa_init_srat(void)
{
int cnt = 0;

@@ -303,11 +303,6 @@ int __init acpi_numa_init(void)
NR_NODE_MEMBLKS);
}

- /* SLIT: System Locality Information Table */
- acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
-
- acpi_numa_arch_fixup();
-
if (cnt < 0)
return cnt;
else if (!parsed_numa_memblks)
@@ -315,6 +310,12 @@ int __init acpi_numa_init(void)
return 0;
}

+void __init acpi_numa_init_slit(void)
+{
+ /* SLIT: System Locality Information Table */
+ acpi_table_parse(ACPI_SIG_SLIT, acpi_parse_slit);
+}
+
int acpi_get_pxm(acpi_handle h)
{
unsigned long long pxm;
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 4b943e6..4a78235 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -85,7 +85,8 @@ int early_acpi_boot_init(void);
int acpi_boot_init (void);
void acpi_boot_table_init (void);
int acpi_mps_check (void);
-int acpi_numa_init (void);
+int acpi_numa_init_srat(void);
+void acpi_numa_init_slit(void);

int acpi_table_init (void);
int acpi_table_parse(char *id, acpi_tbl_table_handler handler);
--
1.8.1.4

2013-04-04 23:49:11

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 08/22] x86, ACPI: Make acpi_initrd_override_find work with 32bit flat mode

For finding with 32bit, it would be easy to access initrd in 32bit
flat mode, as we don't need to set page table.

That is from head_32.S, and microcode updating already use this trick.

Need to change acpi_initrd_override_find to use phys to access global
variables.

Pass is_phys in the function, as we can not use address to decide if it
is phys or virtual address on 32 bit. Boot loader could load initrd above
max_low_pfn.

Don't call printk as it uses global variables, so delay print later
during copying.

Change table_sigs to use stack instead, otherwise it is too messy to change
string array to phys and still keep offset calculating correct.
That size is about 36x4 bytes, and it is small to settle in stack.

Also remove "continue" in MARCO to make code more readable.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 85 ++++++++++++++++++++++++++++++++++---------------
include/linux/acpi.h | 5 +--
3 files changed, 63 insertions(+), 29 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index d0cc176..16a703f 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1093,7 +1093,7 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();

acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start);
+ initrd_end - initrd_start, false);
acpi_initrd_override_copy();

reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index ee5c531..cce92a5 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,21 +551,9 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}

-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
/* Non-fatal errors: Affected tables/files are ignored */
#define INVALID_TABLE(x, path, name) \
- { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); continue; }
+ do { pr_err("ACPI OVERRIDE: " x " [%s%s]\n", path, name); } while (0)

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

@@ -576,17 +564,45 @@ struct file_pos {
};
static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override_find(void *data, size_t size)
+/*
+ * acpi_initrd_override_find() is called from head_32.S and head64.c.
+ * head_32.S calling path is with 32bit flat mode, so we can access
+ * initrd early without setting pagetable or relocating initrd. For
+ * global variables accessing, we need to use phys address instead of
+ * kernel virtual address, try to put table_sigs string array in stack,
+ * so avoid switching for it.
+ * Also don't call printk as it uses global variables.
+ */
+void __init acpi_initrd_override_find(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
+ struct file_pos *files = acpi_initrd_files;
+ int *all_tables_size_p = &all_tables_size;
+
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ char *table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };

if (data == NULL || size == 0)
return;

+ if (is_phys) {
+ files = (struct file_pos *)__pa_symbol(acpi_initrd_files);
+ all_tables_size_p = (int *)__pa_symbol(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -595,9 +611,12 @@ void __init acpi_initrd_override_find(void *data, size_t size)
data += offset;
size -= offset;

- if (file.size < sizeof(struct acpi_table_header))
- INVALID_TABLE("Table smaller than ACPI header",
+ if (file.size < sizeof(struct acpi_table_header)) {
+ if (!is_phys)
+ INVALID_TABLE("Table smaller than ACPI header",
cpio_path, file.name);
+ continue;
+ }

table = file.data;

@@ -605,22 +624,33 @@ void __init acpi_initrd_override_find(void *data, size_t size)
if (!memcmp(table->signature, table_sigs[sig], 4))
break;

- if (!table_sigs[sig])
- INVALID_TABLE("Unknown signature",
+ if (!table_sigs[sig]) {
+ if (!is_phys)
+ INVALID_TABLE("Unknown signature",
cpio_path, file.name);
- if (file.size != table->length)
- INVALID_TABLE("File length does not match table length",
+ continue;
+ }
+ if (file.size != table->length) {
+ if (!is_phys)
+ INVALID_TABLE("File length does not match table length",
cpio_path, file.name);
- if (acpi_table_checksum(file.data, table->length))
- INVALID_TABLE("Bad table checksum",
+ continue;
+ }
+ if (acpi_table_checksum(file.data, table->length)) {
+ if (!is_phys)
+ INVALID_TABLE("Bad table checksum",
cpio_path, file.name);
+ continue;
+ }

- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);

- all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
- acpi_initrd_files[table_nr].size = file.size;
+ (*all_tables_size_p) += table->length;
+ files[table_nr].data = is_phys ? (phys_addr_t)file.data :
+ __pa_nodebug(file.data);
+ files[table_nr].size = file.size;
table_nr++;
}
}
@@ -670,6 +700,9 @@ void __init acpi_initrd_override_copy(void)
break;
q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
+ pr_info("%4.4s ACPI table found in initrd [%#010llx-%#010llx]\n",
+ ((struct acpi_table_header *)q)->signature,
+ (u64)addr, (u64)(addr + size - 1));
memcpy(p, q, size);
early_iounmap(q, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 1654a241..4b943e6 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -478,10 +478,11 @@ static inline bool acpi_driver_match_device(struct device *dev,
#endif /* !CONFIG_ACPI */

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_find(void *data, size_t size, bool is_phys);
void acpi_initrd_override_copy(void);
#else
-static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_find(void *data, size_t size,
+ bool is_phys) { }
static inline void acpi_initrd_override_copy(void) { }
#endif

--
1.8.1.4

2013-04-04 23:49:46

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 18/22] x86, mm, numa: Add early_initmem_init() stub

early_initmem_init() call early_x86_numa_init() to parse numa info early.

Later will call init_mem_mapping for nodes in it.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
---
arch/x86/include/asm/page_types.h | 1 +
arch/x86/kernel/setup.c | 1 +
arch/x86/mm/init.c | 6 ++++++
arch/x86/mm/numa.c | 7 +++++--
4 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/page_types.h b/arch/x86/include/asm/page_types.h
index b012b82..d04dd8c 100644
--- a/arch/x86/include/asm/page_types.h
+++ b/arch/x86/include/asm/page_types.h
@@ -55,6 +55,7 @@ bool pfn_range_is_mapped(unsigned long start_pfn, unsigned long end_pfn);
extern unsigned long init_memory_mapping(unsigned long start,
unsigned long end);

+void early_initmem_init(void);
extern void initmem_init(void);

#endif /* !__ASSEMBLY__ */
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 2d29bc0..d40e16e 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1135,6 +1135,7 @@ void __init setup_arch(char **cmdline_p)

early_acpi_boot_init();

+ early_initmem_init();
initmem_init();
memblock_find_dma_reserve();

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index abcc241..28b294f 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -450,6 +450,12 @@ void __init init_mem_mapping(void)
early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
}

+#ifndef CONFIG_NUMA
+void __init early_initmem_init(void)
+{
+}
+#endif
+
/*
* devmem_is_allowed() checks to see if /dev/mem access to a certain address
* is valid. The argument is a physical page number.
diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 182e085..c2d4653 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -668,13 +668,16 @@ static void __init early_x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init early_initmem_init(void)
+{
+ early_x86_numa_init();
+}
+
void __init x86_numa_init(void)
{
int i, nid;
struct numa_meminfo *mi = &numa_meminfo;

- early_x86_numa_init();
-
#ifdef CONFIG_ACPI_NUMA
if (srat_used)
x86_acpi_numa_init_slit();
--
1.8.1.4

2013-04-04 23:49:59

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 13/22] x86, mm, numa: Use numa_meminfo to check node_map_pfn alignment

We could use numa_meminfo directly instead of memblock nid.

So we could move down set memblock nid and only do it one time
for successful path.

-v2: according to tj, separate moving to another patch.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 30 +++++++++++++++++++-----------
1 file changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 24155b2..fcaeba9 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -496,14 +496,18 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
* Returns the determined alignment in pfn's. 0 if there is no alignment
* requirement (single node).
*/
-unsigned long __init node_map_pfn_alignment(void)
+#ifdef NODE_NOT_IN_PAGE_FLAGS
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
{
unsigned long accl_mask = 0, last_end = 0;
unsigned long start, end, mask;
int last_nid = -1;
int i, nid;

- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ for (i = 0; i < mi->nr_blks; i++) {
+ start = mi->blk[i].start >> PAGE_SHIFT;
+ end = mi->blk[i].end >> PAGE_SHIFT;
+ nid = mi->blk[i].nid;
if (!start || last_nid < 0 || last_nid == nid) {
last_nid = nid;
last_end = end;
@@ -526,10 +530,16 @@ unsigned long __init node_map_pfn_alignment(void)
/* convert mask to number of pages */
return ~accl_mask + 1;
}
+#else
+static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
+{
+ return 0;
+}
+#endif

static int __init numa_register_memblks(struct numa_meminfo *mi)
{
- unsigned long uninitialized_var(pfn_align);
+ unsigned long pfn_align;
int i;

/* Account for nodes with cpus and no memory */
@@ -541,24 +551,22 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
/*
* If sections array is gonna be used for pfn -> nid mapping, check
* whether its granularity is fine enough.
*/
-#ifdef NODE_NOT_IN_PAGE_FLAGS
- pfn_align = node_map_pfn_alignment();
+ pfn_align = node_map_pfn_alignment(mi);
if (pfn_align && pfn_align < PAGES_PER_SECTION) {
printk(KERN_WARNING "Node alignment %LuMB < min %LuMB, rejecting NUMA config\n",
PFN_PHYS(pfn_align) >> 20,
PFN_PHYS(PAGES_PER_SECTION) >> 20);
return -EINVAL;
}
-#endif
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }

return 0;
}
--
1.8.1.4

2013-04-04 23:48:01

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 20/22] x86, mm: Add comments for step_size shift

As request by hpa, add comments for why we choose 5 for
step size shift.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/init.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 28b294f..2754e45 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -385,8 +385,23 @@ static unsigned long __init init_range_memory_mapping(
return mapped_ram_size;
}

-/* (PUD_SHIFT-PMD_SHIFT)/2 */
-#define STEP_SIZE_SHIFT 5
+static unsigned long __init get_new_step_size(unsigned long step_size)
+{
+ /*
+ * initial mapped size is PMD_SIZE, aka 2M.
+ * We can not set step_size to be PUD_SIZE aka 1G yet.
+ * In worse case, when 1G is cross the 1G boundary, and
+ * PG_LEVEL_2M is not set, we will need 1+1+512 pages (aka 2M + 8k)
+ * to map 1G range with PTE. Use 5 as shift for now.
+ */
+ unsigned long new_step_size = step_size << 5;
+
+ if (new_step_size > step_size)
+ step_size = new_step_size;
+
+ return step_size;
+}
+
void __init init_mem_mapping(void)
{
unsigned long end, real_end, start, last_start;
@@ -428,7 +443,7 @@ void __init init_mem_mapping(void)
min_pfn_mapped = last_start >> PAGE_SHIFT;
/* only increase step_size after big range get mapped */
if (new_mapped_ram_size > mapped_ram_size)
- step_size <<= STEP_SIZE_SHIFT;
+ step_size = get_new_step_size(step_size);
mapped_ram_size += new_mapped_ram_size;
}

--
1.8.1.4

2013-04-04 23:50:44

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 05/22] x86, ACPI: Increase override tables number limit

Current acpi tables in initrd is limited to 10, that is too small.
64 should be good enough as we have 35 sigs and could have several
SSDT.

Two problems in current code prevent us from increasing limit:
1. that cpio file info array is put in stack, as every element is 32
bytes, could run out of stack if we have that array size to 64.
We can move it out from stack, and make it as global and put it in
__initdata section.
2. early_ioremap only can remap 256k one time. Current code is mapping
10 tables one time. If we increase that limit, whole size could be
more than 256k, early_ioremap will fail with that.
We can map table one by one during copying, instead of mapping
all them one time.

-v2: According to tj, split it out to separated patch, also
rename array name to acpi_initrd_files.
-v3: Add some comments about mapping table one by one during copying
per tj.

Signed-off-by: Yinghai <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
---
drivers/acpi/osl.c | 26 +++++++++++++++-----------
1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index c08cdb6..a5a9346 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -569,8 +569,8 @@ static const char * const table_sigs[] = {

#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

-/* Must not increase 10 or needs code modification below */
-#define ACPI_OVERRIDE_TABLES 10
+#define ACPI_OVERRIDE_TABLES 64
+static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override(void *data, size_t size)
{
@@ -579,7 +579,6 @@ void __init acpi_initrd_override(void *data, size_t size)
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;

if (data == NULL || size == 0)
@@ -617,8 +616,8 @@ void __init acpi_initrd_override(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- early_initrd_files[table_nr].data = file.data;
- early_initrd_files[table_nr].size = file.size;
+ acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
if (table_nr == 0)
@@ -648,14 +647,19 @@ void __init acpi_initrd_override(void *data, size_t size)
memblock_reserve(acpi_tables_addr, acpi_tables_addr + all_tables_size);
arch_reserve_mem_area(acpi_tables_addr, all_tables_size);

- p = early_ioremap(acpi_tables_addr, all_tables_size);
-
+ /*
+ * early_ioremap only can remap 256k one time. If we map all
+ * tables one time, we will hit the limit. Need to map table
+ * one by one during copying.
+ */
for (no = 0; no < table_nr; no++) {
- memcpy(p + total_offset, early_initrd_files[no].data,
- early_initrd_files[no].size);
- total_offset += early_initrd_files[no].size;
+ phys_addr_t size = acpi_initrd_files[no].size;
+
+ p = early_ioremap(acpi_tables_addr + total_offset, size);
+ memcpy(p, acpi_initrd_files[no].data, size);
+ early_iounmap(p, size);
+ total_offset += size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

--
1.8.1.4

2013-04-04 23:51:05

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 06/22] x86, ACPI: Split acpi_initrd_override to find/copy two functions

To parse srat early, we need to move acpi table probing early.
acpi_initrd_table_override is before acpi table probing. So we need to
move it early too.

Current code acpi_initrd_table_override is after init_mem_mapping and
relocate_initrd(), so it can scan initrd and copy acpi tables with kernel
virtual address of initrd.
Copying need to be after memblock is ready, because it need to allocate
buffer for new acpi tables.

So we have to split that function to find and copy two functions.
Find should be as early as possible. Copy should be after memblock is ready.

Finding could be done in head_32.S and head64.c, just like microcode
early scanning. In head_32.S, it is 32bit flat mode, we don't
need to set page table to access it. In head64.c, #PF set page table
could help us access initrd with kernel low mapping address.

Copying could be done just after memblock is ready and before probing
acpi tables, and we need to early_ioremap to access source and target
range, as init_mem_mapping is not called yet.

While a dummy version of acpi_initrd_override() was defined when
!CONFIG_ACPI_INITRD_TABLE_OVERRIDE, the prototype and dummy version
were conditionalized inside CONFIG_ACPI. This forced setup_arch() to
have its own #ifdefs around acpi_initrd_override() as otherwise build
would fail when !CONFIG_ACPI. Move the prototypes and dummy
implementations of the newly split functions below CONFIG_ACPI block
in acpi.h so that we can do away with #ifdefs in its user.

-v2: Split one patch out according to tj.
also don't pass table_nr around.
-v3: Add Tj's changelog about moving down to #idef in acpi.h to
avoid #idef in setup.c

Signed-off-by: Yinghai <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
Acked-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/setup.c | 6 +++---
drivers/acpi/osl.c | 18 +++++++++++++-----
include/linux/acpi.h | 16 ++++++++--------
3 files changed, 24 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index e75c6e6..d0cc176 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1092,9 +1092,9 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
-#endif
+ acpi_initrd_override_find((void *)initrd_start,
+ initrd_end - initrd_start);
+ acpi_initrd_override_copy();

reserve_crashkernel();

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index a5a9346..21714fb 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -572,14 +572,13 @@ static const char * const table_sigs[] = {
#define ACPI_OVERRIDE_TABLES 64
static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

-void __init acpi_initrd_override(void *data, size_t size)
+void __init acpi_initrd_override_find(void *data, size_t size)
{
- int sig, no, table_nr = 0, total_offset = 0;
+ int sig, no, table_nr = 0;
long offset = 0;
struct acpi_table_header *table;
char cpio_path[32] = "kernel/firmware/acpi/";
struct cpio_data file;
- char *p;

if (data == NULL || size == 0)
return;
@@ -620,7 +619,14 @@ void __init acpi_initrd_override(void *data, size_t size)
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
- if (table_nr == 0)
+}
+
+void __init acpi_initrd_override_copy(void)
+{
+ int no, total_offset = 0;
+ char *p;
+
+ if (!all_tables_size)
return;

/* under 4G at first, then above 4G */
@@ -652,9 +658,11 @@ void __init acpi_initrd_override(void *data, size_t size)
* tables one time, we will hit the limit. Need to map table
* one by one during copying.
*/
- for (no = 0; no < table_nr; no++) {
+ for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
phys_addr_t size = acpi_initrd_files[no].size;

+ if (!size)
+ break;
p = early_ioremap(acpi_tables_addr + total_offset, size);
memcpy(p, acpi_initrd_files[no].data, size);
early_iounmap(p, size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index bcbdd74..1654a241 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -79,14 +79,6 @@ typedef int (*acpi_tbl_table_handler)(struct acpi_table_header *table);
typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);

-#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
-#else
-static inline void acpi_initrd_override(void *data, size_t size)
-{
-}
-#endif
-
char * __acpi_map_table (unsigned long phys_addr, unsigned long size);
void __acpi_unmap_table(char *map, unsigned long size);
int early_acpi_boot_init(void);
@@ -485,6 +477,14 @@ static inline bool acpi_driver_match_device(struct device *dev,

#endif /* !CONFIG_ACPI */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void acpi_initrd_override_find(void *data, size_t size);
+void acpi_initrd_override_copy(void);
+#else
+static inline void acpi_initrd_override_find(void *data, size_t size) { }
+static inline void acpi_initrd_override_copy(void) { }
+#endif
+
#ifdef CONFIG_ACPI
void acpi_os_set_prepare_sleep(int (*func)(u8 sleep_state,
u32 pm1a_ctrl, u32 pm1b_ctrl));
--
1.8.1.4

2013-04-04 23:51:09

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 09/22] x86, ACPI: Find acpi tables in initrd early from head_32.S/head64.c

head64.c could use #PF handler set page table to access initrd before
init mem mapping and initrd relocating.

head_32.S could use 32bit flat mode to access initrd before init mem
mapping initrd relocating.

That make 32bit and 64 bit more consistent.

-v2: use inline function in header file instead according to tj.
also still need to keep #idef head_32.S to avoid compiling error.
-v3: need to move down reserve_initrd() after acpi_initrd_override_copy(),
to make sure we are using right address.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Jacob Shin <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/setup.h | 6 ++++++
arch/x86/kernel/head64.c | 2 ++
arch/x86/kernel/head_32.S | 4 ++++
arch/x86/kernel/setup.c | 34 ++++++++++++++++++++++++++++++----
4 files changed, 42 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..6f885b7 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_override_find(void);
+#else
+static inline void x86_acpi_override_find(void) { }
+#endif
+
extern unsigned long saved_video_mode;

extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index c5e403f..a31bc63 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -174,6 +174,8 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");

+ x86_acpi_override_find();
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 73afd11..ca08f0e 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+ call x86_acpi_override_find
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16a703f..2d29bc0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -424,6 +424,34 @@ static void __init reserve_initrd(void)
}
#endif /* CONFIG_BLK_DEV_INITRD */

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void __init x86_acpi_override_find(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ unsigned char *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ /*
+ * 32bit is from head_32.S, and it is 32bit flat mode.
+ * So need to use phys address to access global variables.
+ */
+ boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (unsigned char *)ramdisk_image;
+ acpi_initrd_override_find(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = __va(ramdisk_image);
+ acpi_initrd_override_find(p, ramdisk_size, false);
+#endif
+}
+#endif
+
static void __init parse_setup_data(void)
{
struct setup_data *data;
@@ -1090,12 +1118,10 @@ void __init setup_arch(char **cmdline_p)
/* Allocate bigger log buffer */
setup_log_buf(1);

- reserve_initrd();
-
- acpi_initrd_override_find((void *)initrd_start,
- initrd_end - initrd_start, false);
acpi_initrd_override_copy();

+ reserve_initrd();
+
reserve_crashkernel();

vsmp_init();
--
1.8.1.4

2013-04-04 23:51:07

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 12/22] x86, mm, numa: Move node_map_pfn alignment() to x86

Move node_map_pfn_alignment() to arch/x86/mm as no other user for it.

Will update it to use numa_meminfo instead of memblock.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 -
mm/page_alloc.c | 50 --------------------------------------------------
3 files changed, 50 insertions(+), 51 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index b7173f6..24155b2 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -477,6 +477,56 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
return true;
}

+/**
+ * node_map_pfn_alignment - determine the maximum internode alignment
+ *
+ * This function should be called after node map is populated and sorted.
+ * It calculates the maximum power of two alignment which can distinguish
+ * all the nodes.
+ *
+ * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
+ * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
+ * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
+ * shifted, 1GiB is enough and this function will indicate so.
+ *
+ * This is used to test whether pfn -> nid mapping of the chosen memory
+ * model has fine enough granularity to avoid incorrect mapping for the
+ * populated node map.
+ *
+ * Returns the determined alignment in pfn's. 0 if there is no alignment
+ * requirement (single node).
+ */
+unsigned long __init node_map_pfn_alignment(void)
+{
+ unsigned long accl_mask = 0, last_end = 0;
+ unsigned long start, end, mask;
+ int last_nid = -1;
+ int i, nid;
+
+ for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
+ if (!start || last_nid < 0 || last_nid == nid) {
+ last_nid = nid;
+ last_end = end;
+ continue;
+ }
+
+ /*
+ * Start with a mask granular enough to pin-point to the
+ * start pfn and tick off bits one-by-one until it becomes
+ * too coarse to separate the current node from the last.
+ */
+ mask = ~((1 << __ffs(start)) - 1);
+ while (mask && last_end <= (start & (mask << 1)))
+ mask <<= 1;
+
+ /* accumulate all internode masks */
+ accl_mask |= mask;
+ }
+
+ /* convert mask to number of pages */
+ return ~accl_mask + 1;
+}
+
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 192806c..77a71fb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1322,7 +1322,6 @@ extern void free_initmem(void);
* CONFIG_HAVE_MEMBLOCK_NODE_MAP.
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
-unsigned long node_map_pfn_alignment(void);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 580d919..f368db4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4725,56 +4725,6 @@ static inline void setup_nr_node_ids(void)
}
#endif

-/**
- * node_map_pfn_alignment - determine the maximum internode alignment
- *
- * This function should be called after node map is populated and sorted.
- * It calculates the maximum power of two alignment which can distinguish
- * all the nodes.
- *
- * For example, if all nodes are 1GiB and aligned to 1GiB, the return value
- * would indicate 1GiB alignment with (1 << (30 - PAGE_SHIFT)). If the
- * nodes are shifted by 256MiB, 256MiB. Note that if only the last node is
- * shifted, 1GiB is enough and this function will indicate so.
- *
- * This is used to test whether pfn -> nid mapping of the chosen memory
- * model has fine enough granularity to avoid incorrect mapping for the
- * populated node map.
- *
- * Returns the determined alignment in pfn's. 0 if there is no alignment
- * requirement (single node).
- */
-unsigned long __init node_map_pfn_alignment(void)
-{
- unsigned long accl_mask = 0, last_end = 0;
- unsigned long start, end, mask;
- int last_nid = -1;
- int i, nid;
-
- for_each_mem_pfn_range(i, MAX_NUMNODES, &start, &end, &nid) {
- if (!start || last_nid < 0 || last_nid == nid) {
- last_nid = nid;
- last_end = end;
- continue;
- }
-
- /*
- * Start with a mask granular enough to pin-point to the
- * start pfn and tick off bits one-by-one until it becomes
- * too coarse to separate the current node from the last.
- */
- mask = ~((1 << __ffs(start)) - 1);
- while (mask && last_end <= (start & (mask << 1)))
- mask <<= 1;
-
- /* accumulate all internode masks */
- accl_mask |= mask;
- }
-
- /* convert mask to number of pages */
- return ~accl_mask + 1;
-}
-
/* Find the lowest pfn for a node */
static unsigned long __init find_min_pfn_for_node(int nid)
{
--
1.8.1.4

2013-04-04 23:47:57

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 02/22] x86, microcode: Use common get_ramdisk_image()

Use common get_ramdisk_image() to get ramdisk start phys address.

We need this to get correct ramdisk adress for 64bit bzImage that
initrd can be loaded above 4G by kexec-tools.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Fenghua Yu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index d893e8e..ea57bd8 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -742,8 +742,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;

boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_image(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;

@@ -752,8 +752,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;

--
1.8.1.4

2013-04-04 23:52:29

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 04/22] x86, ACPI: Search buffer above 4G in second try for acpi override tables

Now we only search buffer for override acpi table under 4G.
In some case, like user use memmap to exclude all low ram,
we may not find range for it under 4G.

Do second try to search above 4G.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: [email protected]
---
drivers/acpi/osl.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 313d14d..c08cdb6 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -627,6 +627,10 @@ void __init acpi_initrd_override(void *data, size_t size)
/* under 4G at first, then above 4G */
acpi_tables_addr = memblock_find_in_range(0, (1ULL<<32) - 1,
all_tables_size, PAGE_SIZE);
+ if (!acpi_tables_addr)
+ acpi_tables_addr = memblock_find_in_range(0,
+ ~(phys_addr_t)0,
+ all_tables_size, PAGE_SIZE);
if (!acpi_tables_addr) {
WARN_ON(1);
return;
--
1.8.1.4

2013-04-04 23:52:27

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 07/22] x86, ACPI: Store override acpi tables phys addr in cpio files info array

In 32bit we will find table with phys address during 32bit flat mode
in head_32.S, because at that time we don't need set page table to
access initrd.

For copying we could use early_ioremap() with phys directly before mem mapping
is set.

To keep 32bit and 64bit consistent, use phys_addr for all.

-v2: introduce file_pos to save phys address instead of abusing cpio_data
that tj is not happy with.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Rafael J. Wysocki <[email protected]>
Cc: [email protected]
---
drivers/acpi/osl.c | 15 +++++++++++----
1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 21714fb..ee5c531 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -570,7 +570,11 @@ static const char * const table_sigs[] = {
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

#define ACPI_OVERRIDE_TABLES 64
-static struct cpio_data __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];
+struct file_pos {
+ phys_addr_t data;
+ phys_addr_t size;
+};
+static struct file_pos __initdata acpi_initrd_files[ACPI_OVERRIDE_TABLES];

void __init acpi_initrd_override_find(void *data, size_t size)
{
@@ -615,7 +619,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
table->signature, cpio_path, file.name, table->length);

all_tables_size += table->length;
- acpi_initrd_files[table_nr].data = file.data;
+ acpi_initrd_files[table_nr].data = __pa_nodebug(file.data);
acpi_initrd_files[table_nr].size = file.size;
table_nr++;
}
@@ -624,7 +628,7 @@ void __init acpi_initrd_override_find(void *data, size_t size)
void __init acpi_initrd_override_copy(void)
{
int no, total_offset = 0;
- char *p;
+ char *p, *q;

if (!all_tables_size)
return;
@@ -659,12 +663,15 @@ void __init acpi_initrd_override_copy(void)
* one by one during copying.
*/
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
+ phys_addr_t addr = acpi_initrd_files[no].data;
phys_addr_t size = acpi_initrd_files[no].size;

if (!size)
break;
+ q = early_ioremap(addr, size);
p = early_ioremap(acpi_tables_addr + total_offset, size);
- memcpy(p, acpi_initrd_files[no].data, size);
+ memcpy(p, q, size);
+ early_iounmap(q, size);
early_iounmap(p, size);
total_offset += size;
}
--
1.8.1.4

2013-04-04 23:52:26

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 16/22] x86, mm, numa: Move emulation handling down.

It needs to allocate buffer for new numa_meminfo and distance matrix,
so move it down.

Also we change the behavoir:
before this patch, if user input wrong data in command line, it
will fall back to next numa probing or disabling numa.
after this patch, if user input wrong data in command line, it will
stay with numa info from probing before, like acpi srat or amd_numa.

We need to call numa_check_memblks to reject wrong user inputs early,
so keep the original numa_meminfo not changed.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: David Rientjes <[email protected]>
---
arch/x86/mm/numa.c | 6 +++---
arch/x86/mm/numa_emulation.c | 2 +-
arch/x86/mm/numa_internal.h | 2 ++
3 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 1d5fa08..90fd123 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,7 +537,7 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_check_memblks(struct numa_meminfo *mi)
+int __init numa_check_memblks(struct numa_meminfo *mi)
{
nodemask_t nodes_parsed;
unsigned long pfn_align;
@@ -607,8 +607,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- numa_emulation(&numa_meminfo, numa_distance_cnt);
-
ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;
@@ -672,6 +670,8 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ numa_emulation(&numa_meminfo, numa_distance_cnt);
+
node_possible_map = numa_nodes_parsed;
numa_nodemask_from_meminfo(&node_possible_map, mi);

diff --git a/arch/x86/mm/numa_emulation.c b/arch/x86/mm/numa_emulation.c
index dbbbb47..5a0433d 100644
--- a/arch/x86/mm/numa_emulation.c
+++ b/arch/x86/mm/numa_emulation.c
@@ -348,7 +348,7 @@ void __init numa_emulation(struct numa_meminfo *numa_meminfo, int numa_dist_cnt)
if (ret < 0)
goto no_emu;

- if (numa_cleanup_meminfo(&ei) < 0) {
+ if (numa_cleanup_meminfo(&ei) < 0 || numa_check_memblks(&ei) < 0) {
pr_warning("NUMA: Warning: constructed meminfo invalid, disabling emulation\n");
goto no_emu;
}
diff --git a/arch/x86/mm/numa_internal.h b/arch/x86/mm/numa_internal.h
index ad86ec9..bb2fbcc 100644
--- a/arch/x86/mm/numa_internal.h
+++ b/arch/x86/mm/numa_internal.h
@@ -21,6 +21,8 @@ void __init numa_reset_distance(void);

void __init x86_numa_init(void);

+int __init numa_check_memblks(struct numa_meminfo *mi);
+
#ifdef CONFIG_NUMA_EMU
void __init numa_emulation(struct numa_meminfo *numa_meminfo,
int numa_dist_cnt);
--
1.8.1.4

2013-04-04 23:53:35

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 14/22] x86, mm, numa: Set memblock nid later

For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.

Only set memblock nid one time for successful path.

Also rename numa_register_memblks to numa_check_memblks()
after move out code for setting memblock nid.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 16 +++++++---------
1 file changed, 7 insertions(+), 9 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index fcaeba9..e2ddcbd 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -537,10 +537,9 @@ static unsigned long __init node_map_pfn_alignment(struct numa_meminfo *mi)
}
#endif

-static int __init numa_register_memblks(struct numa_meminfo *mi)
+static int __init numa_check_memblks(struct numa_meminfo *mi)
{
unsigned long pfn_align;
- int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -563,11 +562,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}

- for (i = 0; i < mi->nr_blks; i++) {
- struct numa_memblk *mb = &mi->blk[i];
- memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
- }
-
return 0;
}

@@ -604,7 +598,6 @@ static int __init numa_init(int (*init_func)(void))
nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
- WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();

ret = init_func();
@@ -616,7 +609,7 @@ static int __init numa_init(int (*init_func)(void))

numa_emulation(&numa_meminfo, numa_distance_cnt);

- ret = numa_register_memblks(&numa_meminfo);
+ ret = numa_check_memblks(&numa_meminfo);
if (ret < 0)
return ret;

@@ -679,6 +672,11 @@ void __init x86_numa_init(void)

early_x86_numa_init();

+ for (i = 0; i < mi->nr_blks; i++) {
+ struct numa_memblk *mb = &mi->blk[i];
+ memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
+ }
+
/* Finally register nodes. */
for_each_node_mask(nid, node_possible_map) {
u64 start = PFN_PHYS(max_pfn);
--
1.8.1.4

2013-04-04 23:53:33

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 11/22] x86, mm, numa: Call numa_meminfo_cover_memory() checking early

For the separation, we need to set memblock nid later, as it
could change memblock array, and possible doube memblock.memory
array that will need to allocate buffer.

We do not need to use nid in memblock to find out absent pages.
So we can move that numa_meminfo_cover_memory() early.

Also could change __absent_pages_in_range() to static and use
absent_pages_in_range() directly.

Later we can only set memblock nid one time on successful path.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 7 ++++---
include/linux/mm.h | 2 --
mm/page_alloc.c | 2 +-
3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index d545638..b7173f6 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -460,7 +460,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
u64 s = mi->blk[i].start >> PAGE_SHIFT;
u64 e = mi->blk[i].end >> PAGE_SHIFT;
numaram += e - s;
- numaram -= __absent_pages_in_range(mi->blk[i].nid, s, e);
+ numaram -= absent_pages_in_range(s, e);
if ((s64)numaram < 0)
numaram = 0;
}
@@ -488,6 +488,9 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (WARN_ON(nodes_empty(node_possible_map)))
return -EINVAL;

+ if (!numa_meminfo_cover_memory(mi))
+ return -EINVAL;
+
for (i = 0; i < mi->nr_blks; i++) {
struct numa_memblk *mb = &mi->blk[i];
memblock_set_node(mb->start, mb->end - mb->start, mb->nid);
@@ -506,8 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
return -EINVAL;
}
#endif
- if (!numa_meminfo_cover_memory(mi))
- return -EINVAL;

return 0;
}
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e19ff30..192806c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1323,8 +1323,6 @@ extern void free_initmem(void);
*/
extern void free_area_init_nodes(unsigned long *max_zone_pfn);
unsigned long node_map_pfn_alignment(void);
-unsigned long __absent_pages_in_range(int nid, unsigned long start_pfn,
- unsigned long end_pfn);
extern unsigned long absent_pages_in_range(unsigned long start_pfn,
unsigned long end_pfn);
extern void get_pfn_range_for_nid(unsigned int nid,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8fcced7..580d919 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4356,7 +4356,7 @@ static unsigned long __meminit zone_spanned_pages_in_node(int nid,
* Return the number of holes in a range on a node. If nid is MAX_NUMNODES,
* then all holes in the requested range will be accounted for.
*/
-unsigned long __meminit __absent_pages_in_range(int nid,
+static unsigned long __meminit __absent_pages_in_range(int nid,
unsigned long range_start_pfn,
unsigned long range_end_pfn)
{
--
1.8.1.4

2013-04-04 23:53:32

by Yinghai Lu

[permalink] [raw]
Subject: [PATCH v3 10/22] x86, mm, numa: Move two functions calling on successful path later

We need to have numa info ready before init_mem_mapping, so we
can call init_mem_mapping per nodes also can trim node mem range to
big alignment.

Current numa parsing need to allocate some buffer and need to be
called after init_mem_mapping.

So try to split parsing numa info to two stages, and early one will be
before init_mem_mapping, and it should not need allocate buffers.

At last we will have early_initmem_init() and initmem_init().

This one is first one for separation.

setup_node_data() and numa_init_array() are only called for successful
path, so we can move calling to x86_numa_init(). That will also make
numa_init() small and readable.

-v2: remove online_node_map clear in numa_init(), as it is only
set in setup_node_data() at last in successful path.

Signed-off-by: Yinghai Lu <[email protected]>
---
arch/x86/mm/numa.c | 69 ++++++++++++++++++++++++++++++------------------------
1 file changed, 39 insertions(+), 30 deletions(-)

diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
index 72fe01e..d545638 100644
--- a/arch/x86/mm/numa.c
+++ b/arch/x86/mm/numa.c
@@ -480,7 +480,7 @@ static bool __init numa_meminfo_cover_memory(const struct numa_meminfo *mi)
static int __init numa_register_memblks(struct numa_meminfo *mi)
{
unsigned long uninitialized_var(pfn_align);
- int i, nid;
+ int i;

/* Account for nodes with cpus and no memory */
node_possible_map = numa_nodes_parsed;
@@ -509,24 +509,6 @@ static int __init numa_register_memblks(struct numa_meminfo *mi)
if (!numa_meminfo_cover_memory(mi))
return -EINVAL;

- /* Finally register nodes. */
- for_each_node_mask(nid, node_possible_map) {
- u64 start = PFN_PHYS(max_pfn);
- u64 end = 0;
-
- for (i = 0; i < mi->nr_blks; i++) {
- if (nid != mi->blk[i].nid)
- continue;
- start = min(mi->blk[i].start, start);
- end = max(mi->blk[i].end, end);
- }
-
- if (start < end)
- setup_node_data(nid, start, end);
- }
-
- /* Dump memblock with node info and return. */
- memblock_dump_all();
return 0;
}

@@ -562,7 +544,6 @@ static int __init numa_init(int (*init_func)(void))

nodes_clear(numa_nodes_parsed);
nodes_clear(node_possible_map);
- nodes_clear(node_online_map);
memset(&numa_meminfo, 0, sizeof(numa_meminfo));
WARN_ON(memblock_set_node(0, ULLONG_MAX, MAX_NUMNODES));
numa_reset_distance();
@@ -580,15 +561,6 @@ static int __init numa_init(int (*init_func)(void))
if (ret < 0)
return ret;

- for (i = 0; i < nr_cpu_ids; i++) {
- int nid = early_cpu_to_node(i);
-
- if (nid == NUMA_NO_NODE)
- continue;
- if (!node_online(nid))
- numa_clear_node(i);
- }
- numa_init_array();
return 0;
}

@@ -621,7 +593,7 @@ static int __init dummy_numa_init(void)
* last fallback is dummy single node config encomapssing whole memory and
* never fails.
*/
-void __init x86_numa_init(void)
+static void __init early_x86_numa_init(void)
{
if (!numa_off) {
#ifdef CONFIG_X86_NUMAQ
@@ -641,6 +613,43 @@ void __init x86_numa_init(void)
numa_init(dummy_numa_init);
}

+void __init x86_numa_init(void)
+{
+ int i, nid;
+ struct numa_meminfo *mi = &numa_meminfo;
+
+ early_x86_numa_init();
+
+ /* Finally register nodes. */
+ for_each_node_mask(nid, node_possible_map) {
+ u64 start = PFN_PHYS(max_pfn);
+ u64 end = 0;
+
+ for (i = 0; i < mi->nr_blks; i++) {
+ if (nid != mi->blk[i].nid)
+ continue;
+ start = min(mi->blk[i].start, start);
+ end = max(mi->blk[i].end, end);
+ }
+
+ if (start < end)
+ setup_node_data(nid, start, end); /* online is set */
+ }
+
+ /* Dump memblock with node info */
+ memblock_dump_all();
+
+ for (i = 0; i < nr_cpu_ids; i++) {
+ int nid = early_cpu_to_node(i);
+
+ if (nid == NUMA_NO_NODE)
+ continue;
+ if (!node_online(nid))
+ numa_clear_node(i);
+ }
+ numa_init_array();
+}
+
static __init int find_near_online_node(int node)
{
int n, val;
--
1.8.1.4

2013-04-05 02:28:14

by Thomas Renninger

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

On Thursday, April 04, 2013 04:46:04 PM Yinghai Lu wrote:
> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>
> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> | Author: Tang Chen <[email protected]>
> | Date: Fri Feb 22 16:33:44 2013 -0800
> |
> | acpi, memory-hotplug: parse SRAT before memblock is ready
>
> It broke several things, like acpi override and fall back path etc.
>
> This patchset is clean implementation that will parse numa info early.

I tried acpi table overriding, but it did not work for me.
In your tree there seem to miss acpi initrd overriding doku:
Documentation/acpi/initrd_table_override.txt
?
And your tree is 3.6.0-rc6-default+ based, right?

I tried it like that:
mkdir -p kernel/firmware/acpi
cp dsdt.aml kernel/firmware/acpi
find kernel | cpio -H newc --create > /boot/instrumented_initrd
cat /boot/initrd >>/boot/instrumented_initrd

modified /boot/grub/menu.lst and pointed to /boot/instrumented_initrd

-> no override messages in dmesg, no overriding happened at all.

Did I oversee something?

Thomas

2013-04-05 03:09:48

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

On Thu, Apr 4, 2013 at 7:28 PM, Thomas Renninger <[email protected]> wrote:
> On Thursday, April 04, 2013 04:46:04 PM Yinghai Lu wrote:
>> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
>>
>> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
>> | Author: Tang Chen <[email protected]>
>> | Date: Fri Feb 22 16:33:44 2013 -0800
>> |
>> | acpi, memory-hotplug: parse SRAT before memblock is ready
>>
>> It broke several things, like acpi override and fall back path etc.
>>
>> This patchset is clean implementation that will parse numa info early.
>
> I tried acpi table overriding, but it did not work for me.
> In your tree there seem to miss acpi initrd overriding doku:
> Documentation/acpi/initrd_table_override.txt

http://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/tree/Documentation/acpi/initrd_table_override.txt?h=for-x86-mm

> And your tree is 3.6.0-rc6-default+ based, right?

It is in for-x86-mm branch, should be 3.9-rc5 based.

http://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/tree/Documentation/acpi/initrd_table_override.txt?h=for-x86-mm

can you try

git checkout -b for-x86-mm origin/for-x86-mm

Thanks

Yinghai

2013-04-05 10:44:19

by Thomas Renninger

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

On Thursday, April 04, 2013 08:09:46 PM Yinghai Lu wrote:
> On Thu, Apr 4, 2013 at 7:28 PM, Thomas Renninger <[email protected]> wrote:
> > On Thursday, April 04, 2013 04:46:04 PM Yinghai Lu wrote:
> >> One commit that tried to parse SRAT early get reverted before v3.9-rc1.
> >>
> >> | commit e8d1955258091e4c92d5a975ebd7fd8a98f5d30f
> >> | Author: Tang Chen <[email protected]>
> >> | Date: Fri Feb 22 16:33:44 2013 -0800
> >> |
> >> | acpi, memory-hotplug: parse SRAT before memblock is ready
> >>
> >> It broke several things, like acpi override and fall back path etc.
> >>
> >> This patchset is clean implementation that will parse numa info early.
> >
> > I tried acpi table overriding, but it did not work for me.
> > In your tree there seem to miss acpi initrd overriding doku:
> > Documentation/acpi/initrd_table_override.txt
>
> http://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/tree/D
> ocumentation/acpi/initrd_table_override.txt?h=for-x86-mm
> > And your tree is 3.6.0-rc6-default+ based, right?
>
> It is in for-x86-mm branch, should be 3.9-rc5 based.
>
> http://git.kernel.org/cgit/linux/kernel/git/yinghai/linux-yinghai.git/tree/D
> ocumentation/acpi/initrd_table_override.txt?h=for-x86-mm
>
> can you try
>
> git checkout -b for-x86-mm origin/for-x86-mm
Argh stupid, I simply put a git clone before:
could be found at:
git://git.kernel.org/pub/scm/linux/kernel/git/yinghai/linux-yinghai.git for-x86-mm

I doubt I will make it today, so I'll try to give it a test on Mo.

Thanks,

Thomas

2013-04-05 13:40:19

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH v3 21/22] x86, mm: Make init_mem_mapping be able to be called several times

On Thu, Apr 04, 2013 at 04:46:25PM -0700, Yinghai Lu wrote:
> Prepare to put page table on local nodes.
>
> Move calling of init_mem_mapping to early_initmem_init.
>
> Rework alloc_low_pages to alloc page table in following order:
> BRK, local node, low range
>
> Still only load_cr3 one time, otherwise we would break xen 64bit again.

Nope. It should be fixed now.


>
> Signed-off-by: Yinghai Lu <[email protected]>
> Cc: Pekka Enberg <[email protected]>
> Cc: Jacob Shin <[email protected]>
> Cc: Konrad Rzeszutek Wilk <[email protected]>
> ---
> arch/x86/include/asm/pgtable.h | 2 +-
> arch/x86/kernel/setup.c | 1 -
> arch/x86/mm/init.c | 88 ++++++++++++++++++++++++++----------------
> arch/x86/mm/numa.c | 24 ++++++++++++
> 4 files changed, 79 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 1e67223..868687c 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -621,7 +621,7 @@ static inline int pgd_none(pgd_t pgd)
> #ifndef __ASSEMBLY__
>
> extern int direct_gbpages;
> -void init_mem_mapping(void);
> +void init_mem_mapping(unsigned long begin, unsigned long end);
> void early_alloc_pgt_buf(void);
>
> /* local pte updates need not use xchg for locking */
> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> index 6ef3fa2..67ef4bc 100644
> --- a/arch/x86/kernel/setup.c
> +++ b/arch/x86/kernel/setup.c
> @@ -1105,7 +1105,6 @@ void __init setup_arch(char **cmdline_p)
> acpi_boot_table_init();
> early_acpi_boot_init();
> early_initmem_init();
> - init_mem_mapping();
> memblock.current_limit = get_max_mapped();
> early_trap_pf_init();
>
> diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
> index 2754e45..8a03283 100644
> --- a/arch/x86/mm/init.c
> +++ b/arch/x86/mm/init.c
> @@ -24,7 +24,10 @@ static unsigned long __initdata pgt_buf_start;
> static unsigned long __initdata pgt_buf_end;
> static unsigned long __initdata pgt_buf_top;
>
> -static unsigned long min_pfn_mapped;
> +static unsigned long low_min_pfn_mapped;
> +static unsigned long low_max_pfn_mapped;
> +static unsigned long local_min_pfn_mapped;
> +static unsigned long local_max_pfn_mapped;
>
> static bool __initdata can_use_brk_pgt = true;
>
> @@ -52,10 +55,17 @@ __ref void *alloc_low_pages(unsigned int num)
>
> if ((pgt_buf_end + num) > pgt_buf_top || !can_use_brk_pgt) {
> unsigned long ret;
> - if (min_pfn_mapped >= max_pfn_mapped)
> - panic("alloc_low_page: ran out of memory");
> - ret = memblock_find_in_range(min_pfn_mapped << PAGE_SHIFT,
> - max_pfn_mapped << PAGE_SHIFT,
> + if (local_min_pfn_mapped >= local_max_pfn_mapped) {
> + if (low_min_pfn_mapped >= low_max_pfn_mapped)
> + panic("alloc_low_page: ran out of memory");
> + ret = memblock_find_in_range(
> + low_min_pfn_mapped << PAGE_SHIFT,
> + low_max_pfn_mapped << PAGE_SHIFT,
> + PAGE_SIZE * num , PAGE_SIZE);
> + } else
> + ret = memblock_find_in_range(
> + local_min_pfn_mapped << PAGE_SHIFT,
> + local_max_pfn_mapped << PAGE_SHIFT,
> PAGE_SIZE * num , PAGE_SIZE);
> if (!ret)
> panic("alloc_low_page: can not alloc memory");
> @@ -402,60 +412,75 @@ static unsigned long __init get_new_step_size(unsigned long step_size)
> return step_size;
> }
>
> -void __init init_mem_mapping(void)
> +void __init init_mem_mapping(unsigned long begin, unsigned long end)
> {
> - unsigned long end, real_end, start, last_start;
> + unsigned long real_end, start, last_start;
> unsigned long step_size;
> unsigned long addr;
> unsigned long mapped_ram_size = 0;
> unsigned long new_mapped_ram_size;
> + bool is_low = false;
> +
> + if (!begin) {
> + probe_page_size_mask();
> + /* the ISA range is always mapped regardless of memory holes */
> + init_memory_mapping(0, ISA_END_ADDRESS);
> + begin = ISA_END_ADDRESS;
> + is_low = true;
> + }
>
> - probe_page_size_mask();
> -
> -#ifdef CONFIG_X86_64
> - end = max_pfn << PAGE_SHIFT;
> -#else
> - end = max_low_pfn << PAGE_SHIFT;
> -#endif
> -
> - /* the ISA range is always mapped regardless of memory holes */
> - init_memory_mapping(0, ISA_END_ADDRESS);
> + if (begin >= end)
> + return;
>
> /* xen has big range in reserved near end of ram, skip it at first.*/
> - addr = memblock_find_in_range(ISA_END_ADDRESS, end, PMD_SIZE, PMD_SIZE);
> + addr = memblock_find_in_range(begin, end, PMD_SIZE, PMD_SIZE);
> real_end = addr + PMD_SIZE;
>
> /* step_size need to be small so pgt_buf from BRK could cover it */
> step_size = PMD_SIZE;
> - max_pfn_mapped = 0; /* will get exact value next */
> - min_pfn_mapped = real_end >> PAGE_SHIFT;
> + local_max_pfn_mapped = begin >> PAGE_SHIFT;
> + local_min_pfn_mapped = real_end >> PAGE_SHIFT;
> last_start = start = real_end;
> - while (last_start > ISA_END_ADDRESS) {
> + while (last_start > begin) {
> if (last_start > step_size) {
> start = round_down(last_start - 1, step_size);
> - if (start < ISA_END_ADDRESS)
> - start = ISA_END_ADDRESS;
> + if (start < begin)
> + start = begin;
> } else
> - start = ISA_END_ADDRESS;
> + start = begin;
> new_mapped_ram_size = init_range_memory_mapping(start,
> last_start);
> + if ((last_start >> PAGE_SHIFT) > local_max_pfn_mapped)
> + local_max_pfn_mapped = last_start >> PAGE_SHIFT;
> + local_min_pfn_mapped = start >> PAGE_SHIFT;
> last_start = start;
> - min_pfn_mapped = last_start >> PAGE_SHIFT;
> /* only increase step_size after big range get mapped */
> if (new_mapped_ram_size > mapped_ram_size)
> step_size = get_new_step_size(step_size);
> mapped_ram_size += new_mapped_ram_size;
> }
>
> - if (real_end < end)
> + if (real_end < end) {
> init_range_memory_mapping(real_end, end);
> + if ((end >> PAGE_SHIFT) > local_max_pfn_mapped)
> + local_max_pfn_mapped = end >> PAGE_SHIFT;
> + }
>
> + if (is_low) {
> + low_min_pfn_mapped = local_min_pfn_mapped;
> + low_max_pfn_mapped = local_max_pfn_mapped;
> + }
> +}
> +
> +#ifndef CONFIG_NUMA
> +void __init early_initmem_init(void)
> +{
> #ifdef CONFIG_X86_64
> - if (max_pfn > max_low_pfn) {
> - /* can we preseve max_low_pfn ?*/
> + init_mem_mapping(0, max_pfn << PAGE_SHIFT);
> + if (max_pfn > max_low_pfn)
> max_low_pfn = max_pfn;
> - }
> #else
> + init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
> early_ioremap_page_table_range_init();
> #endif
>
> @@ -464,11 +489,6 @@ void __init init_mem_mapping(void)
>
> early_memtest(0, max_pfn_mapped << PAGE_SHIFT);
> }
> -
> -#ifndef CONFIG_NUMA
> -void __init early_initmem_init(void)
> -{
> -}
> #endif
>
> /*
> diff --git a/arch/x86/mm/numa.c b/arch/x86/mm/numa.c
> index c2d4653..d3eb0c9 100644
> --- a/arch/x86/mm/numa.c
> +++ b/arch/x86/mm/numa.c
> @@ -17,8 +17,10 @@
> #include <asm/dma.h>
> #include <asm/acpi.h>
> #include <asm/amd_nb.h>
> +#include <asm/tlbflush.h>
>
> #include "numa_internal.h"
> +#include "mm_internal.h"
>
> int __initdata numa_off;
> nodemask_t numa_nodes_parsed __initdata;
> @@ -668,9 +670,31 @@ static void __init early_x86_numa_init(void)
> numa_init(dummy_numa_init);
> }
>
> +#ifdef CONFIG_X86_64
> +static void __init early_x86_numa_init_mapping(void)
> +{
> + init_mem_mapping(0, max_pfn << PAGE_SHIFT);
> + if (max_pfn > max_low_pfn)
> + max_low_pfn = max_pfn;
> +}
> +#else
> +static void __init early_x86_numa_init_mapping(void)
> +{
> + init_mem_mapping(0, max_low_pfn << PAGE_SHIFT);
> + early_ioremap_page_table_range_init();
> +}
> +#endif
> +
> void __init early_initmem_init(void)
> {
> early_x86_numa_init();
> +
> + early_x86_numa_init_mapping();
> +
> + load_cr3(swapper_pg_dir);
> + __flush_tlb_all();
> +
> + early_memtest(0, max_pfn_mapped<<PAGE_SHIFT);
> }
>
> void __init x86_numa_init(void)
> --
> 1.8.1.4
>

2013-04-05 16:36:29

by Thomas Renninger

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

On Thursday, April 04, 2013 08:09:46 PM Yinghai Lu wrote:
...
> can you try
>
> git checkout -b for-x86-mm origin/for-x86-mm

That worked out much better :)

I see these changes in e820 table, the first part is probably unrelated:

BIOS-e820: [mem 0x0000000000000000-0x000000000009bbff] usable
...
BIOS-e820: [mem 0x0000000000100000-0x00000000ba294fff] usable

modified: [mem 0x0000000000000000-0x0000000000000fff] reserved
modified: [mem 0x0000000000001000-0x000000000009bbff] usable
modified: [mem 0x0000000000100000-0x00000000ba27bfff] usable
...
modified: [mem 0x00000000ba27c000-0x00000000ba2947fc] ACPI data
modified: [mem 0x00000000ba2947fd-0x00000000ba294fff] usable

And the ACPI data section where the modified tables are placed
seem to get correctly inserted at:
0x00000000ba27c000-0x00000000ba2947fc

-> 0x187FC == 100,348 bytes
DSDT and FACP (better known as FADT) I passed have a size
of (see dmesg parts below):
0x18709 + 0xF4 bytes = 100,349 bytes.

Ah wait the 0xba2947fc is inclusive, so it should exactly fit.

I then see:
DSDT ACPI table found in initrd [0x378f5208-0x3790d910]
FACP ACPI table found in initrd [0x3790d9a0-0x3790da93]
ACPI: RSDP 00000000000f0410 00024 (v02 INTEL)
ACPI: XSDT 00000000bdf24d98 0008C (v01 INTEL ROMLEY 06222004 INTL 20090903)
ACPI: Override [FACP- ROMLEY], this is unsafe: tainting kernel
Disabling lock debugging due to kernel taint
ACPI: FACP 00000000bdf24a98 Physical table override, new table: ffffffffff4af709
ACPI: FACP 00000000ba294709 000F4 (v04 INTEL ROMLEY 06222004 INTL 20121220)
ACPI BIOS Bug: Warning: Invalid length for FADT/Pm1aControlBlock: 32, using default 16 (20130117/tbfadt-649)
ACPI: Override [DSDT- ROMLEY], this is unsafe: tainting kernel
ACPI: DSDT 00000000bdf09018 Physical table override, new table: ffffffffff4af000
ACPI: DSDT 00000000ba27c000 18709 (v02 INTEL ROMLEY 00000021 INTL 20121220)

Later I see my debug string added to the DSDT when the
PCI Routing Table (_PRT) is processed:
[ 9.505419] [ACPI Debug] String [0x0A] "XXXXXXXXXX"

And taking the FADT from /sys/firmware/acpi/tables/FACP:
my:
PM Profile : 04 [Enterprise Server]
changed (as expected) to:
PM Profile : 02 [Mobile]

>From acpi overriding parts:
Tested-by: Thomas Renninger <[email protected]>

I also went through the override related patches and from
what I can judge (certainly not the early memory, flat 32 bit memory you call it?
specific parts), they look fine.

Nice work!

Thomas

2013-04-05 18:10:09

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

On Fri, Apr 5, 2013 at 9:36 AM, Thomas Renninger <[email protected]> wrote:
>
> From acpi overriding parts:
> Tested-by: Thomas Renninger <[email protected]>

Thanks a lot for testing.

>
> I also went through the override related patches and from
> what I can judge (certainly not the early memory, flat 32 bit memory you call it?
> specific parts), they look fine.

Yes, we call it "32bit flat mode"

Thanks

Yinghai

2013-04-05 21:54:42

by Tony Luck

[permalink] [raw]
Subject: Re: [PATCH v3 17/22] x86, ACPI, numa, ia64: split SLIT handling out

On Thu, Apr 4, 2013 at 4:46 PM, Yinghai Lu <[email protected]> wrote:
> It should not break ia64 by replacing acpi_numa_init with
> acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.

You are right - it doesn't break ia64. All my test configs still
build. Machines both with and without NUMA still boot and
nothing strange happens.

Tested-by: Tony Luck <[email protected]>

2013-04-05 22:16:50

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v3 17/22] x86, ACPI, numa, ia64: split SLIT handling out

On Fri, Apr 5, 2013 at 2:54 PM, Tony Luck <[email protected]> wrote:
> On Thu, Apr 4, 2013 at 4:46 PM, Yinghai Lu <[email protected]> wrote:
>> It should not break ia64 by replacing acpi_numa_init with
>> acpi_numa_init_srat/acpi_numa_init_slit/acpi_num_arch_fixup.
>
> You are right - it doesn't break ia64. All my test configs still
> build. Machines both with and without NUMA still boot and
> nothing strange happens.
>
> Tested-by: Tony Luck <[email protected]>

Great, Thanks a lot for testing them.

Yinghai

2013-04-10 05:31:45

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH v3 02/22] x86, microcode: Use common get_ramdisk_image()

On 04/05/2013 07:46 AM, Yinghai Lu wrote:
> Use common get_ramdisk_image() to get ramdisk start phys address.
>
> We need this to get correct ramdisk adress for 64bit bzImage that
> initrd can be loaded above 4G by kexec-tools.
>
> Signed-off-by: Yinghai Lu<[email protected]>
> Cc: Fenghua Yu<[email protected]>
> Acked-by: Tejun Heo<[email protected]>
> ---
> arch/x86/kernel/microcode_intel_early.c | 8 ++++----
> 1 file changed, 4 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
> index d893e8e..ea57bd8 100644
> --- a/arch/x86/kernel/microcode_intel_early.c
> +++ b/arch/x86/kernel/microcode_intel_early.c
> @@ -742,8 +742,8 @@ load_ucode_intel_bsp(void)
> struct boot_params *boot_params_p;
>
> boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
> - ramdisk_image = boot_params_p->hdr.ramdisk_image;
> - ramdisk_size = boot_params_p->hdr.ramdisk_size;
> + ramdisk_image = get_ramdisk_image(boot_params_p);
> + ramdisk_size = get_ramdisk_image(boot_params_p);

Shoule be get_ramdisk_size(boot_params_p) ?

> initrd_start_early = ramdisk_image;
> initrd_end_early = initrd_start_early + ramdisk_size;
>
> @@ -752,8 +752,8 @@ load_ucode_intel_bsp(void)
> (unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
> initrd_start_early, initrd_end_early,&uci);
> #else
> - ramdisk_image = boot_params.hdr.ramdisk_image;
> - ramdisk_size = boot_params.hdr.ramdisk_size;
> + ramdisk_image = get_ramdisk_image(&boot_params);
> + ramdisk_size = get_ramdisk_size(&boot_params);
> initrd_start_early = ramdisk_image + PAGE_OFFSET;
> initrd_end_early = initrd_start_early + ramdisk_size;
>

2013-04-10 07:40:43

by Thomas Renninger

[permalink] [raw]
Subject: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

Hello,

On Wednesday, April 10, 2013 01:34:33 PM Tang Chen wrote:
> On 04/05/2013 07:46 AM, Yinghai Lu wrote:
> > Use common get_ramdisk_image() to get ramdisk start phys address.
> >
> > We need this to get correct ramdisk adress for 64bit bzImage that
> > initrd can be loaded above 4G by kexec-tools.disk_size;

don't know whether this question came up when this feature got
submitted (if yes a pointer would be appreciated).

Is there a concept how signed microcode can get verified when applied early,
like it is done via firmware loader?

If not, early microcode loading is not really usable in secure boot
environment, right?

Thanks,

Thomas

2013-04-10 16:13:32

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH v3 02/22] x86, microcode: Use common get_ramdisk_image()

On Tue, Apr 9, 2013 at 10:34 PM, Tang Chen <[email protected]> wrote:
> On 04/05/2013 07:46 AM, Yinghai Lu wrote:
>> boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
>> - ramdisk_image = boot_params_p->hdr.ramdisk_image;
>> - ramdisk_size = boot_params_p->hdr.ramdisk_size;
>> + ramdisk_image = get_ramdisk_image(boot_params_p);
>> + ramdisk_size = get_ramdisk_image(boot_params_p);
>
>
> Shoule be get_ramdisk_size(boot_params_p) ?
>

oh, will update that.

Thanks

Yinghai

2013-04-10 17:47:29

by Fenghua Yu

[permalink] [raw]
Subject: RE: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

> -----Original Message-----
> From: Thomas Renninger [mailto:[email protected]]
> Sent: Wednesday, April 10, 2013 12:41 AM
> Hello,
>
> On Wednesday, April 10, 2013 01:34:33 PM Tang Chen wrote:
> > On 04/05/2013 07:46 AM, Yinghai Lu wrote:
> > > Use common get_ramdisk_image() to get ramdisk start phys address.
> > >
> > > We need this to get correct ramdisk adress for 64bit bzImage that
> > > initrd can be loaded above 4G by kexec-tools.disk_size;
>
> don't know whether this question came up when this feature got
> submitted (if yes a pointer would be appreciated).
>
> Is there a concept how signed microcode can get verified when applied
> early,
> like it is done via firmware loader?
>
> If not, early microcode loading is not really usable in secure boot
> environment, right?

The microcode is cryptographically authenticated by the CPU itself, so there is no security issue related to early microcode loading.

Thanks.

-Fenghua

2013-04-11 07:31:37

by Thomas Renninger

[permalink] [raw]
Subject: Re: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

On Wednesday, April 10, 2013 05:47:25 PM Yu, Fenghua wrote:
> > -----Original Message-----
> > From: Thomas Renninger [mailto:[email protected]]
> > Sent: Wednesday, April 10, 2013 12:41 AM
> > Hello,
> >
> > On Wednesday, April 10, 2013 01:34:33 PM Tang Chen wrote:
> > > On 04/05/2013 07:46 AM, Yinghai Lu wrote:
> > > > Use common get_ramdisk_image() to get ramdisk start phys address.
> > > >
> > > > We need this to get correct ramdisk adress for 64bit bzImage that
> > > > initrd can be loaded above 4G by kexec-tools.disk_size;
> >
> > don't know whether this question came up when this feature got
> > submitted (if yes a pointer would be appreciated).
> >
> > Is there a concept how signed microcode can get verified when applied
> > early,
> > like it is done via firmware loader?
> >
> > If not, early microcode loading is not really usable in secure boot
> > environment, right?
>
> The microcode is cryptographically authenticated by the CPU itself, so there
> is no security issue related to early microcode loading.

So Intel HW is allowed to authenticate its firmware itself, bypassing the UEFI
certificates...
Does this apply for other vendors as well?
Does this apply to secure boot specification?

Is this "cryptographically authenticated by the CPU itself" thing documented
somewhere so that security people can double check that it is really
secure?

Just some questions to discuss and think about...

Thomas

2013-04-11 08:29:01

by Fenghua Yu

[permalink] [raw]
Subject: RE: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

> From: Thomas Renninger [mailto:[email protected]]
> Sent: Thursday, April 11, 2013 12:32 AM
> On Wednesday, April 10, 2013 05:47:25 PM Yu, Fenghua wrote:
> > > -----Original Message-----
> > > From: Thomas Renninger [mailto:[email protected]]
> > > Sent: Wednesday, April 10, 2013 12:41 AM
> > > Hello,
> > >
> > > On Wednesday, April 10, 2013 01:34:33 PM Tang Chen wrote:
> > > > On 04/05/2013 07:46 AM, Yinghai Lu wrote:
> > > > > Use common get_ramdisk_image() to get ramdisk start phys
> address.
> > > > >
> > > > > We need this to get correct ramdisk adress for 64bit bzImage
> that
> > > > > initrd can be loaded above 4G by kexec-tools.disk_size;
> > >
> > > don't know whether this question came up when this feature got
> > > submitted (if yes a pointer would be appreciated).
> > >
> > > Is there a concept how signed microcode can get verified when
> applied
> > > early,
> > > like it is done via firmware loader?
> > >
> > > If not, early microcode loading is not really usable in secure boot
> > > environment, right?
> >
> > The microcode is cryptographically authenticated by the CPU itself,
> so there
> > is no security issue related to early microcode loading.
>
> So Intel HW is allowed to authenticate its firmware itself, bypassing
> the UEFI
> certificates...
> Does this apply for other vendors as well?
This should apply for other vendors as well as this is defined in X86 SDM.

> Does this apply to secure boot specification?
Secure boot can add another level of security because the early loaded
microcode is part of initrd and initrd is measured by secure boot.

>
> Is this "cryptographically authenticated by the CPU itself" thing
> documented
> somewhere so that security people can double check that it is really
> secure?

X86 SDM defines that the second part of microcode update is the encrypted data.

2013-04-11 08:59:31

by Thomas Renninger

[permalink] [raw]
Subject: Re: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

On Thursday, April 11, 2013 08:28:37 AM Yu, Fenghua wrote:
> > From: Thomas Renninger [mailto:[email protected]]
...
> > Does this apply to secure boot specification?
>
> Secure boot can add another level of security because the early loaded
> microcode is part of initrd and initrd is measured by secure boot.
Not sure what is ment with "initrd is measured by secure boot".

Afaik the initrd does not get signed and verified, I cannot imagine how
that could work as it needs to get regenerated on client systems.

I expect it works like this:
initrd does not need signing as it is not executed itself, it only gets
extracted.
Everything inside the initrd is subject to the secure boot rules:
binaries or whatever data which gets executed with kernel
privileges (or updates firmware) needs verification through
secure boot keys.

> > Is this "cryptographically authenticated by the CPU itself" thing
> > documented
> > somewhere so that security people can double check that it is really
> > secure?
>
> X86 SDM defines that the second part of microcode update is the encrypted
> data.

Again, I doubt it is allowed to bypass UEFI authentication with arbitrary,
vendor specific authentication checks.

Thomas

2013-04-11 22:53:09

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Early microcode signing in secure boot environment - Was: x86, microcode: Use common get_ramdisk_image()

On 04/11/2013 01:59 AM, Thomas Renninger wrote:
>
>>> Is this "cryptographically authenticated by the CPU itself" thing
>>> documented
>>> somewhere so that security people can double check that it is really
>>> secure?
>>
>> X86 SDM defines that the second part of microcode update is the encrypted
>> data.
>
> Again, I doubt it is allowed to bypass UEFI authentication with arbitrary,
> vendor specific authentication checks.
>

What does that even mean in this context?

-hpa

2013-04-11 22:54:53

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v3 00/22] x86, ACPI, numa: Parse numa info early

Can you post the current version of this set? I see there have been
some changes.

-hpa