2013-08-21 10:20:03

by Tang Chen

[permalink] [raw]
Subject: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

This patch-set aims to move acpi_initrd_override() earlier on x86.
Some of the patches are from Yinghai's patch-set:
https://lkml.org/lkml/2013/6/14/561

The difference between this patch-set and Yinghai's original patch-set are:
1. This patch-set doesn't split acpi_initrd_override(), but call it as a
whole operation at early time.
2. Allocate memory from BRK to store override tables.
(This idea is also from Yinghai.)


[Current state]

The current Linux kernel will initialize acpi tables like the following:

1. Find all acpi override table provided by users in initrd.
(Linux allows users to override acpi tables in firmware, by specifying
their own tables in initrd.)

2. Use acpica code to initialize acpi global root table list and install all
tables into it. If any override tables exists, use it to override the one
provided by firmware.

Then others can parse these tables and get useful info.

Both of the two steps happen after direct mapping page tables are setup.

[Issues]

In the current Linux kernel, the initialization of acpi tables is too late for
new functionalities.

We have some issues about this:

* For memory hotplug, we need ACPI SRAT at early time to be aware of which memory
ranges are hotpluggable, and prevent bootmem allocator from allocating memory
for the kernel. (Kernel pages cannot be hotplugged because )

* As suggested by Yinghai Lu <[email protected]>, we should allocate page tables
in local node. This also needs SRAT before direct mapping page tables are setup.

* As mentioned by Toshi Kani <[email protected]>, ACPI SCPR/DBGP/DBG2 tables
allow the OS to initialize serial console/debug ports at early boot time. The
earlier it can be initialized, the better this feature will be. These tables
are not currently used by Linux due to a licensing issue, but it could be
addressed some time soon.


[What are we doing]

We are trying to initialize acip tables as early as possible. But Linux kernel
allows users to override acpi tables by specifying their own tables in initrd.
So we have to do acpi_initrd_override() earlier first.


[About this patch-set]

This patch-set aims to move acpi_initrd_override() as early as possible on x86.
As suggested by Yinghai, we are trying to do it like this:

On 32bit: do it in head_32.S, before paging is enabled. In this case, we can
access initrd with physical address without page tables.

On 64bit: do it in head_64.c, after paging is enabled but before direct mapping
is setup.

And also, acpi_initrd_override() needs to allocate memory for override tables.
But at such an early time, there is no memory allocator works. So the basic idea
from Yinghai is to use BRK. We will extend BRK 256KB in this patch-set.


Tang Chen (6):
x86, acpi: Move table_sigs[] to stack.
x86, acpi, brk: Extend BRK 256KB to store acpi override tables.
x86, brk: Make extend_brk() available with va/pa.
x86, acpi: Make acpi_initrd_override() available with va or pa.
x86, acpi, brk: Make early_alloc_acpi_override_tables_buf() available
with va/pa.
x86, acpi: Do acpi_initrd_override() earlier in head_32.S/head64.c.

Yinghai Lu (2):
x86: Make get_ramdisk_{image|size}() global.
x86, microcode: Use get_ramdisk_{image|size}() in microcode handling.

arch/x86/include/asm/dmi.h | 2 +-
arch/x86/include/asm/setup.h | 11 +++-
arch/x86/kernel/head64.c | 4 +
arch/x86/kernel/head_32.S | 4 +
arch/x86/kernel/microcode_intel_early.c | 8 +-
arch/x86/kernel/setup.c | 93 ++++++++++++++++------
arch/x86/mm/init.c | 2 +-
arch/x86/xen/enlighten.c | 2 +-
arch/x86/xen/mmu.c | 6 +-
arch/x86/xen/p2m.c | 27 ++++---
drivers/acpi/osl.c | 130 ++++++++++++++++++++-----------
include/linux/acpi.h | 5 +-
12 files changed, 196 insertions(+), 98 deletions(-)


2013-08-21 10:18:15

by Tang Chen

[permalink] [raw]
Subject: [PATCH 3/8] x86, acpi: Move table_sigs[] to stack.

On 64bit, we will do acpi_initrd_override() in x86_64_start_kernel(). This is
after CPU enables paging, but before direct mapping page tables are setup.
This is OK because we have an early page fault handler to help to access data
without direct mapping page tables.

But on 32bit, in order to keep the x86_32 and x86_64 unified code clean, we want
to do acpi_initrd_override() in head_32.S, before CPU enables paging. Without
direct mapping page tables, we need to access data with physical address.

For global variables, if we access them, we are using va. So on 32bit, we have
to convert them into pa. But for a global array, it could be too messy.

So this patch move table_sigs[] to stack. Define it in acpi_initrd_override().
It is no more than 36 pointers, so it is OK to put it on stack.

Originally-From: Yinghai Lu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
drivers/acpi/osl.c | 23 +++++++++++------------
1 files changed, 11 insertions(+), 12 deletions(-)

diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index e7effc1..06996d8 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -551,18 +551,6 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
return sum;
}

-/* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
-static const char * const table_sigs[] = {
- ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
- ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
- ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
- ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
- ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
- ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
- ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
- ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
- ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
-
#define ACPI_HEADER_SIZE sizeof(struct acpi_table_header)

/* Must not increase 10 or needs code modification below */
@@ -577,6 +565,17 @@ void __init acpi_initrd_override(void *data, size_t size)
struct cpio_data file;
struct cpio_data early_initrd_files[ACPI_OVERRIDE_TABLES];
char *p;
+ /* All but ACPI_SIG_RSDP and ACPI_SIG_FACS: */
+ const char * const table_sigs[] = {
+ ACPI_SIG_BERT, ACPI_SIG_CPEP, ACPI_SIG_ECDT, ACPI_SIG_EINJ,
+ ACPI_SIG_ERST, ACPI_SIG_HEST, ACPI_SIG_MADT, ACPI_SIG_MSCT,
+ ACPI_SIG_SBST, ACPI_SIG_SLIT, ACPI_SIG_SRAT, ACPI_SIG_ASF,
+ ACPI_SIG_BOOT, ACPI_SIG_DBGP, ACPI_SIG_DMAR, ACPI_SIG_HPET,
+ ACPI_SIG_IBFT, ACPI_SIG_IVRS, ACPI_SIG_MCFG, ACPI_SIG_MCHI,
+ ACPI_SIG_SLIC, ACPI_SIG_SPCR, ACPI_SIG_SPMI, ACPI_SIG_TCPA,
+ ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
+ ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
+ ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };

if (data == NULL || size == 0)
return;
--
1.7.1

2013-08-21 10:18:50

by Tang Chen

[permalink] [raw]
Subject: [PATCH 2/8] x86, microcode: Use get_ramdisk_{image|size}() in microcode handling.

From: Yinghai Lu <[email protected]>

Since we made get_ramdisk_{image|size}() global, use them when we want
to access ramdisk.

Signed-off-by: Yinghai Lu <[email protected]>
Cc: Fenghua Yu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/kernel/microcode_intel_early.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/microcode_intel_early.c b/arch/x86/kernel/microcode_intel_early.c
index 1575deb..4c58a1b 100644
--- a/arch/x86/kernel/microcode_intel_early.c
+++ b/arch/x86/kernel/microcode_intel_early.c
@@ -743,8 +743,8 @@ load_ucode_intel_bsp(void)
struct boot_params *boot_params_p;

boot_params_p = (struct boot_params *)__pa_nodebug(&boot_params);
- ramdisk_image = boot_params_p->hdr.ramdisk_image;
- ramdisk_size = boot_params_p->hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
initrd_start_early = ramdisk_image;
initrd_end_early = initrd_start_early + ramdisk_size;

@@ -753,8 +753,8 @@ load_ucode_intel_bsp(void)
(unsigned long *)__pa_nodebug(&mc_saved_in_initrd),
initrd_start_early, initrd_end_early, &uci);
#else
- ramdisk_image = boot_params.hdr.ramdisk_image;
- ramdisk_size = boot_params.hdr.ramdisk_size;
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
initrd_start_early = ramdisk_image + PAGE_OFFSET;
initrd_end_early = initrd_start_early + ramdisk_size;

--
1.7.1

2013-08-21 10:19:52

by Tang Chen

[permalink] [raw]
Subject: [PATCH 1/8] x86: Make get_ramdisk_{image|size}() global.

From: Yinghai Lu <[email protected]>

This patch does two things:

1. Make get_ramdisk_image() and get_ramdisk_size() global so that we can use
them in the later patches.

2. In later patches, we are going to call them in head_32.S before paging is
enabled. In that case, we can only use physical address to access global
variable like boot_params. So make them take a boot_params pointer parameter
so that we can pass va or pa to them.

Signed-off-by: Yinghai Lu <[email protected]>
Acked-by: Tejun Heo <[email protected]>
Tested-by: Thomas Renninger <[email protected]>
Reviewed-by: Tang Chen <[email protected]>
Tested-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/setup.h | 3 +++
arch/x86/kernel/setup.c | 28 ++++++++++++++--------------
2 files changed, 17 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index b7bf350..4f71d48 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -106,6 +106,9 @@ void *extend_brk(size_t size, size_t align);
RESERVE_BRK(name, sizeof(type) * entries)

extern void probe_roms(void);
+u64 get_ramdisk_image(struct boot_params *bp);
+u64 get_ramdisk_size(struct boot_params *bp);
+
#ifdef __i386__

void __init i386_start_kernel(void);
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f8ec578..5bfd4c8 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -297,19 +297,19 @@ static void __init reserve_brk(void)

#ifdef CONFIG_BLK_DEV_INITRD

-static u64 __init get_ramdisk_image(void)
+u64 __init get_ramdisk_image(struct boot_params *bp)
{
- u64 ramdisk_image = boot_params.hdr.ramdisk_image;
+ u64 ramdisk_image = bp->hdr.ramdisk_image;

- ramdisk_image |= (u64)boot_params.ext_ramdisk_image << 32;
+ ramdisk_image |= (u64)bp->ext_ramdisk_image << 32;

return ramdisk_image;
}
-static u64 __init get_ramdisk_size(void)
+u64 __init get_ramdisk_size(struct boot_params *bp)
{
- u64 ramdisk_size = boot_params.hdr.ramdisk_size;
+ u64 ramdisk_size = bp->hdr.ramdisk_size;

- ramdisk_size |= (u64)boot_params.ext_ramdisk_size << 32;
+ ramdisk_size |= (u64)bp->ext_ramdisk_size << 32;

return ramdisk_size;
}
@@ -318,8 +318,8 @@ static u64 __init get_ramdisk_size(void)
static void __init relocate_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 area_size = PAGE_ALIGN(ramdisk_size);
u64 ramdisk_here;
unsigned long slop, clen, mapaddr;
@@ -358,8 +358,8 @@ static void __init relocate_initrd(void)
ramdisk_size -= clen;
}

- ramdisk_image = get_ramdisk_image();
- ramdisk_size = get_ramdisk_size();
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
printk(KERN_INFO "Move RAMDISK from [mem %#010llx-%#010llx] to"
" [mem %#010llx-%#010llx]\n",
ramdisk_image, ramdisk_image + ramdisk_size - 1,
@@ -369,8 +369,8 @@ static void __init relocate_initrd(void)
static void __init early_reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);

if (!boot_params.hdr.type_of_loader ||
@@ -382,8 +382,8 @@ static void __init early_reserve_initrd(void)
static void __init reserve_initrd(void)
{
/* Assume only end is not page aligned */
- u64 ramdisk_image = get_ramdisk_image();
- u64 ramdisk_size = get_ramdisk_size();
+ u64 ramdisk_image = get_ramdisk_image(&boot_params);
+ u64 ramdisk_size = get_ramdisk_size(&boot_params);
u64 ramdisk_end = PAGE_ALIGN(ramdisk_image + ramdisk_size);
u64 mapped_size;

--
1.7.1

2013-08-21 10:20:20

by Tang Chen

[permalink] [raw]
Subject: [PATCH 7/8] x86, acpi, brk: Make early_alloc_acpi_override_tables_buf() available with va/pa.

We are using the same trick in previous patch.

Introduce a "bool is_phys" to early_alloc_acpi_override_tables_buf(). When it
is true, convert all golbal variables va to pa, so that we can access them on
32bit before paging is enabled.

NOTE: Do not call printk() on 32bit before paging is enabled
because it will use global variables.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 2 +-
drivers/acpi/osl.c | 11 ++++++++---
include/linux/acpi.h | 2 +-
3 files changed, 10 insertions(+), 5 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1290ea7..5729cd2 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1071,7 +1071,7 @@ void __init setup_arch(char **cmdline_p)

#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
/* Allocate buffer to store acpi override tables in brk. */
- early_alloc_acpi_override_tables_buf();
+ early_alloc_acpi_override_tables_buf(false);
#endif

/*
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index ccdb5a6..25ba68d 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -560,10 +560,15 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
/* Reserve 256KB in BRK to store acpi override tables */
#define ACPI_OVERRIDE_TABLES_SIZE (256 * 1024)
RESERVE_BRK(acpi_override_tables_alloc, ACPI_OVERRIDE_TABLES_SIZE);
-void __init early_alloc_acpi_override_tables_buf(void)
+void __init early_alloc_acpi_override_tables_buf(bool is_phys)
{
- acpi_tables_addr = __pa(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
- PAGE_SIZE, false));
+ u64 *acpi_tables_addr_p;
+
+ acpi_tables_addr_p = is_phys ? (u64 *)__pa_nodebug(&acpi_tables_addr) :
+ (u64 *)&acpi_tables_addr;
+
+ *acpi_tables_addr_p = __pa_nodebug(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
+ PAGE_SIZE, is_phys));
}

/**
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index af4da51..17f2e8e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -81,7 +81,7 @@ typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
void acpi_initrd_override(void *data, size_t size, bool is_phys);
-void early_alloc_acpi_override_tables_buf(void);
+void early_alloc_acpi_override_tables_buf(bool is_phys);
#else
static inline void acpi_initrd_override(void *data, size_t size, bool is_phys)
{
--
1.7.1

2013-08-21 10:20:27

by Tang Chen

[permalink] [raw]
Subject: [PATCH 5/8] x86, brk: Make extend_brk() available with va/pa.

We are going to do acpi_initrd_override() at very early time:

On 32bit: do it in head_32.S, before paging is enabled. In this case, we can
access initrd with physical address without page tables.

On 64bit: do it in head_64.c, after paging is enabled but before direct mapping
is setup.

On 64bit, we have an early page fault handler to help to access data
with direct mapping page tables. So it is easy to do in head_64.c.

And we need to allocate memory to store override tables. At such an early time,
no memory allocator works. So we can only use BRK.

As mentioned above, on 32bit before paging is enabled, we have to access variables
with pa. So introduce a "bool is_phys" parameter to extend_brk(), and convert va
to pa is it is true.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/dmi.h | 2 +-
arch/x86/include/asm/setup.h | 2 +-
arch/x86/kernel/setup.c | 20 ++++++++++++++------
arch/x86/mm/init.c | 2 +-
arch/x86/xen/enlighten.c | 2 +-
arch/x86/xen/mmu.c | 6 +++---
arch/x86/xen/p2m.c | 27 ++++++++++++++-------------
drivers/acpi/osl.c | 2 +-
8 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h
index fd8f9e2..3b51d81 100644
--- a/arch/x86/include/asm/dmi.h
+++ b/arch/x86/include/asm/dmi.h
@@ -9,7 +9,7 @@

static __always_inline __init void *dmi_alloc(unsigned len)
{
- return extend_brk(len, sizeof(int));
+ return extend_brk(len, sizeof(int), false);
}

/* Use early IO mappings for DMI because it's initialized early */
diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 4f71d48..96d00da 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -75,7 +75,7 @@ extern struct boot_params boot_params;

/* exceedingly early brk-like allocator */
extern unsigned long _brk_end;
-void *extend_brk(size_t size, size_t align);
+void *extend_brk(size_t size, size_t align, bool is_phys);

/*
* Reserve space in the brk section. The name must be unique within
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 51fcd5d..a189909 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -259,19 +259,27 @@ static inline void __init copy_edd(void)
}
#endif

-void * __init extend_brk(size_t size, size_t align)
+void * __init extend_brk(size_t size, size_t align, bool is_phys)
{
size_t mask = align - 1;
void *ret;
+ unsigned long *brk_start, *brk_end, *brk_limit;

- BUG_ON(_brk_start == 0);
+ brk_start = is_phys ? (unsigned long *)__pa_nodebug(&_brk_start) :
+ (unsigned long *)&_brk_start;
+ brk_end = is_phys ? (unsigned long *)__pa_nodebug(&_brk_end) :
+ (unsigned long *)&_brk_end;
+ brk_limit = is_phys ? (unsigned long *)__pa_nodebug(__brk_limit) :
+ (unsigned long *)__brk_limit;
+
+ BUG_ON(*brk_start == 0);
BUG_ON(align & mask);

- _brk_end = (_brk_end + mask) & ~mask;
- BUG_ON((char *)(_brk_end + size) > __brk_limit);
+ *brk_end = (*brk_end + mask) & ~mask;
+ BUG_ON((char *)(*brk_end + size) > brk_limit);

- ret = (void *)_brk_end;
- _brk_end += size;
+ ret = (void *)(*brk_end);
+ *brk_end += size;

memset(ret, 0, size);

diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
index 2ec29ac..189a9e2 100644
--- a/arch/x86/mm/init.c
+++ b/arch/x86/mm/init.c
@@ -86,7 +86,7 @@ void __init early_alloc_pgt_buf(void)
unsigned long tables = INIT_PGT_BUF_SIZE;
phys_addr_t base;

- base = __pa(extend_brk(tables, PAGE_SIZE));
+ base = __pa(extend_brk(tables, PAGE_SIZE, false));

pgt_buf_start = base >> PAGE_SHIFT;
pgt_buf_end = pgt_buf_start;
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 193097e..2d5a34f 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1629,7 +1629,7 @@ void __ref xen_hvm_init_shared_info(void)

if (!shared_info_page)
shared_info_page = (struct shared_info *)
- extend_brk(PAGE_SIZE, PAGE_SIZE);
+ extend_brk(PAGE_SIZE, PAGE_SIZE, false);
xatp.domid = DOMID_SELF;
xatp.idx = 0;
xatp.space = XENMAPSPACE_shared_info;
diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
index fdc3ba2..573bc50 100644
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -1768,7 +1768,7 @@ static void __init xen_map_identity_early(pmd_t *pmd, unsigned long max_pfn)
unsigned long pfn;

level1_ident_pgt = extend_brk(sizeof(pte_t) * LEVEL1_IDENT_ENTRIES,
- PAGE_SIZE);
+ PAGE_SIZE, false);

ident_pte = 0;
pfn = 0;
@@ -1980,7 +1980,7 @@ static void __init xen_write_cr3_init(unsigned long cr3)
* swapper_pg_dir.
*/
swapper_kernel_pmd =
- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);
copy_page(swapper_kernel_pmd, initial_kernel_pmd);
swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
__pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
@@ -2003,7 +2003,7 @@ void __init xen_setup_kernel_pagetable(pgd_t *pgd, unsigned long max_pfn)
pmd_t *kernel_pmd;

initial_kernel_pmd =
- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);

max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
xen_start_info->nr_pt_frames * PAGE_SIZE +
diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
index 95fb2aa..bbdcf20 100644
--- a/arch/x86/xen/p2m.c
+++ b/arch/x86/xen/p2m.c
@@ -281,13 +281,13 @@ void __ref xen_build_mfn_list_list(void)

/* Pre-initialize p2m_top_mfn to be completely missing */
if (p2m_top_mfn == NULL) {
- p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_mid_mfn_init(p2m_mid_missing_mfn);

- p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_top_mfn_p_init(p2m_top_mfn_p);

- p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_top_mfn_init(p2m_top_mfn);
} else {
/* Reinitialise, mfn's all change after migration */
@@ -322,7 +322,7 @@ void __ref xen_build_mfn_list_list(void)
* runtime. extend_brk() will BUG if we call
* it too late.
*/
- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_mid_mfn_init(mid_mfn_p);

p2m_top_mfn_p[topidx] = mid_mfn_p;
@@ -351,16 +351,16 @@ void __init xen_build_dynamic_phys_to_machine(void)

xen_max_p2m_pfn = max_pfn;

- p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_init(p2m_missing);

- p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_mid_init(p2m_mid_missing);

- p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_top_init(p2m_top);

- p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_init(p2m_identity);

/*
@@ -373,7 +373,8 @@ void __init xen_build_dynamic_phys_to_machine(void)
unsigned mididx = p2m_mid_index(pfn);

if (p2m_top[topidx] == p2m_mid_missing) {
- unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE,
+ false);
p2m_mid_init(mid);

p2m_top[topidx] = mid;
@@ -609,7 +610,7 @@ static bool __init early_alloc_p2m_middle(unsigned long pfn, bool check_boundary
return false;

/* Boundary cross-over for the edges: */
- p2m = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ p2m = extend_brk(PAGE_SIZE, PAGE_SIZE, false);

p2m_init(p2m);

@@ -635,7 +636,7 @@ static bool __init early_alloc_p2m(unsigned long pfn)
mid = p2m_top[topidx];
mid_mfn_p = p2m_top_mfn_p[topidx];
if (mid == p2m_mid_missing) {
- mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ mid = extend_brk(PAGE_SIZE, PAGE_SIZE, false);

p2m_mid_init(mid);

@@ -645,7 +646,7 @@ static bool __init early_alloc_p2m(unsigned long pfn)
}
/* And the save/restore P2M tables.. */
if (mid_mfn_p == p2m_mid_missing_mfn) {
- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
p2m_mid_mfn_init(mid_mfn_p);

p2m_top_mfn_p[topidx] = mid_mfn_p;
@@ -858,7 +859,7 @@ static void __init m2p_override_init(void)
unsigned i;

m2p_overrides = extend_brk(sizeof(*m2p_overrides) * M2P_OVERRIDE_HASH,
- sizeof(unsigned long));
+ sizeof(unsigned long), false);

for (i = 0; i < M2P_OVERRIDE_HASH; i++)
INIT_LIST_HEAD(&m2p_overrides[i]);
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 4c1baa7..dff7fcc 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -563,7 +563,7 @@ RESERVE_BRK(acpi_override_tables_alloc, ACPI_OVERRIDE_TABLES_SIZE);
void __init early_alloc_acpi_override_tables_buf(void)
{
acpi_tables_addr = __pa(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
- PAGE_SIZE));
+ PAGE_SIZE, false));
}

void __init acpi_initrd_override(void *data, size_t size)
--
1.7.1

2013-08-21 10:20:53

by Tang Chen

[permalink] [raw]
Subject: [PATCH 4/8] x86, acpi, brk: Extend BRK 256KB to store acpi override tables.

When finding acpi override tables in initrd, we need to allocate memory to
store these tables. But at such an early time, we don't have any memory
allocator. The basic idea is to use BRK.

This patch reserves 256KB in BRK, and allocate it to store override tables,
instead of memblock.

This idea is from Yinghai Lu <[email protected]>.

Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 5 +++++
drivers/acpi/osl.c | 44 ++++++++++++++++++++++----------------------
include/linux/acpi.h | 1 +
3 files changed, 28 insertions(+), 22 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5bfd4c8..51fcd5d 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1061,6 +1061,11 @@ void __init setup_arch(char **cmdline_p)

early_alloc_pgt_buf();

+#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
+ /* Allocate buffer to store acpi override tables in brk. */
+ early_alloc_acpi_override_tables_buf();
+#endif
+
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index 06996d8..4c1baa7 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -48,6 +48,7 @@

#include <asm/io.h>
#include <asm/uaccess.h>
+#include <asm/setup.h>

#include <acpi/acpi.h>
#include <acpi/acpi_bus.h>
@@ -556,6 +557,15 @@ u8 __init acpi_table_checksum(u8 *buffer, u32 length)
/* Must not increase 10 or needs code modification below */
#define ACPI_OVERRIDE_TABLES 10

+/* Reserve 256KB in BRK to store acpi override tables */
+#define ACPI_OVERRIDE_TABLES_SIZE (256 * 1024)
+RESERVE_BRK(acpi_override_tables_alloc, ACPI_OVERRIDE_TABLES_SIZE);
+void __init early_alloc_acpi_override_tables_buf(void)
+{
+ acpi_tables_addr = __pa(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
+ PAGE_SIZE));
+}
+
void __init acpi_initrd_override(void *data, size_t size)
{
int sig, no, table_nr = 0, total_offset = 0;
@@ -619,7 +629,18 @@ void __init acpi_initrd_override(void *data, size_t size)
pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
table->signature, cpio_path, file.name, table->length);

+ /*
+ * If the override tables in cpio file exceeds the BRK buffer,
+ * ignore the current table and go for the next one.
+ */
all_tables_size += table->length;
+ if (all_tables_size > ACPI_OVERRIDE_TABLES_SIZE) {
+ pr_warning("ACPI OVERRIDE: ACPI override tables exceeds buffer size."
+ " Ignoring table %4.4s\n", table->signature);
+ all_tables_size -= table->length;
+ continue;
+ }
+
early_initrd_files[table_nr].data = file.data;
early_initrd_files[table_nr].size = file.size;
table_nr++;
@@ -627,34 +648,13 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- acpi_tables_addr =
- memblock_find_in_range(0, max_low_pfn_mapped << PAGE_SHIFT,
- all_tables_size, PAGE_SIZE);
- if (!acpi_tables_addr) {
- WARN_ON(1);
- return;
- }
- /*
- * Only calling e820_add_reserve does not work and the
- * tables are invalid (memory got used) later.
- * memblock_reserve works as expected and the tables won't get modified.
- * But it's not enough on X86 because ioremap will
- * complain later (used by acpi_os_map_memory) that the pages
- * that should get mapped are not marked "reserved".
- * Both memblock_reserve and e820_add_region (via arch_reserve_mem_area)
- * works fine.
- */
- memblock_reserve(acpi_tables_addr, all_tables_size);
- arch_reserve_mem_area(acpi_tables_addr, all_tables_size);
-
- p = early_ioremap(acpi_tables_addr, all_tables_size);
+ p = __va(acpi_tables_addr);

for (no = 0; no < table_nr; no++) {
memcpy(p + total_offset, early_initrd_files[no].data,
early_initrd_files[no].size);
total_offset += early_initrd_files[no].size;
}
- early_iounmap(p, all_tables_size);
}
#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */

diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 353ba25..381579e 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -81,6 +81,7 @@ typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
void acpi_initrd_override(void *data, size_t size);
+void early_alloc_acpi_override_tables_buf(void);
#else
static inline void acpi_initrd_override(void *data, size_t size)
{
--
1.7.1

2013-08-21 10:21:53

by Tang Chen

[permalink] [raw]
Subject: [PATCH 8/8] x86, acpi: Do acpi_initrd_override() earlier in head_32.S/head64.c.

Introduce x86_acpi_initrd_override() to do acpi table override job. This function
can be called before or after paging is enabled. On 32bit, it will be called before
paging is enabled. On 64bit, it will be called after paging is enabled but before
direct mapping page tables are setup.

Originally-From: Yinghai Lu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/include/asm/setup.h | 6 +++++
arch/x86/kernel/head64.c | 4 +++
arch/x86/kernel/head_32.S | 4 +++
arch/x86/kernel/setup.c | 51 ++++++++++++++++++++++++++++++++---------
4 files changed, 54 insertions(+), 11 deletions(-)

diff --git a/arch/x86/include/asm/setup.h b/arch/x86/include/asm/setup.h
index 96d00da..9f32cb4 100644
--- a/arch/x86/include/asm/setup.h
+++ b/arch/x86/include/asm/setup.h
@@ -42,6 +42,12 @@ extern void visws_early_detect(void);
static inline void visws_early_detect(void) { }
#endif

+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+void x86_acpi_initrd_override(void);
+#else
+static inline void x86_acpi_initrd_override(void) { }
+#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+
extern unsigned long saved_video_mode;

extern void reserve_standard_io_resources(void);
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
index 55b6761..88e19b4 100644
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -175,6 +175,10 @@ void __init x86_64_start_kernel(char * real_mode_data)
if (console_loglevel == 10)
early_printk("Kernel alive\n");

+#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
+ x86_acpi_initrd_override();
+#endif
+
clear_page(init_level4_pgt);
/* set init_level4_pgt kernel high mapping*/
init_level4_pgt[511] = early_level4_pgt[511];
diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 5dd87a8..e04e13b 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -149,6 +149,10 @@ ENTRY(startup_32)
call load_ucode_bsp
#endif

+#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
+ call x86_acpi_initrd_override
+#endif
+
/*
* Initialize page tables. This creates a PDE and a set of page
* tables, which are located immediately beyond __brk_base. The variable
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 5729cd2..b48a0ff 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -833,7 +833,46 @@ static void __init trim_low_memory_range(void)
{
memblock_reserve(0, ALIGN(reserve_low, PAGE_SIZE));
}
-
+
+#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
+/**
+ * x86_acpi_initrd_override - Find all acpi override tables in initrd, and copy
+ * them to acpi_tables_addr.
+ *
+ * On 32bit platform, this function is call in head_32.S, before paging is
+ * enabled. So we have to use physical address.
+ *
+ * On 64bit platform, this function is call in head_64.c, after paging is
+ * enabled but before direct mapping page tables are set up. Since we have an
+ * early page fault handler on 64bit, so it is OK to use virtual address.
+ */
+void __init x86_acpi_initrd_override(void)
+{
+ unsigned long ramdisk_image, ramdisk_size;
+ void *p = NULL;
+
+#ifdef CONFIG_X86_32
+ struct boot_params *boot_params_p;
+
+ boot_params_p = (struct boot_params *)__pa(&boot_params);
+ ramdisk_image = get_ramdisk_image(boot_params_p);
+ ramdisk_size = get_ramdisk_size(boot_params_p);
+ p = (void *)ramdisk_image;
+
+ early_alloc_acpi_override_tables_buf(true);
+ acpi_initrd_override(p, ramdisk_size, true);
+#else
+ ramdisk_image = get_ramdisk_image(&boot_params);
+ ramdisk_size = get_ramdisk_size(&boot_params);
+ if (ramdisk_image)
+ p = (void *)__va(ramdisk_image);
+
+ early_alloc_acpi_override_tables_buf(false);
+ acpi_initrd_override(p, ramdisk_size, false);
+#endif /* CONFIG_X86_32 */
+}
+#endif /* CONFIG_ACPI_INITRD_TABLE_OVERRIDE */
+
/*
* Determine if we were loaded by an EFI loader. If so, then we have also been
* passed the efi memmap, systab, etc., so we should use these data structures
@@ -1069,11 +1108,6 @@ void __init setup_arch(char **cmdline_p)

early_alloc_pgt_buf();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- /* Allocate buffer to store acpi override tables in brk. */
- early_alloc_acpi_override_tables_buf(false);
-#endif
-
/*
* Need to conclude brk, before memblock_x86_fill()
* it could use memblock_find_in_range, could overlap with
@@ -1132,11 +1166,6 @@ void __init setup_arch(char **cmdline_p)

reserve_initrd();

-#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start,
- false);
-#endif
-
reserve_crashkernel();

vsmp_init();
--
1.7.1

2013-08-21 10:47:03

by Tang Chen

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hi all,

This patch-set has not been fully tested. I sent them first for you
to review. Please comment if we can agree on this solution.

Thanks.:)

On 08/21/2013 06:15 PM, Tang Chen wrote:
> This patch-set aims to move acpi_initrd_override() earlier on x86.
> Some of the patches are from Yinghai's patch-set:
> https://lkml.org/lkml/2013/6/14/561
>
> The difference between this patch-set and Yinghai's original patch-set are:
> 1. This patch-set doesn't split acpi_initrd_override(), but call it as a
> whole operation at early time.
> 2. Allocate memory from BRK to store override tables.
> (This idea is also from Yinghai.)
>
>
> [Current state]
>
> The current Linux kernel will initialize acpi tables like the following:
>
> 1. Find all acpi override table provided by users in initrd.
> (Linux allows users to override acpi tables in firmware, by specifying
> their own tables in initrd.)
>
> 2. Use acpica code to initialize acpi global root table list and install all
> tables into it. If any override tables exists, use it to override the one
> provided by firmware.
>
> Then others can parse these tables and get useful info.
>
> Both of the two steps happen after direct mapping page tables are setup.
>
> [Issues]
>
> In the current Linux kernel, the initialization of acpi tables is too late for
> new functionalities.
>
> We have some issues about this:
>
> * For memory hotplug, we need ACPI SRAT at early time to be aware of which memory
> ranges are hotpluggable, and prevent bootmem allocator from allocating memory
> for the kernel. (Kernel pages cannot be hotplugged because )
>
> * As suggested by Yinghai Lu<[email protected]>, we should allocate page tables
> in local node. This also needs SRAT before direct mapping page tables are setup.
>
> * As mentioned by Toshi Kani<[email protected]>, ACPI SCPR/DBGP/DBG2 tables
> allow the OS to initialize serial console/debug ports at early boot time. The
> earlier it can be initialized, the better this feature will be. These tables
> are not currently used by Linux due to a licensing issue, but it could be
> addressed some time soon.
>
>
> [What are we doing]
>
> We are trying to initialize acip tables as early as possible. But Linux kernel
> allows users to override acpi tables by specifying their own tables in initrd.
> So we have to do acpi_initrd_override() earlier first.
>
>
> [About this patch-set]
>
> This patch-set aims to move acpi_initrd_override() as early as possible on x86.
> As suggested by Yinghai, we are trying to do it like this:
>
> On 32bit: do it in head_32.S, before paging is enabled. In this case, we can
> access initrd with physical address without page tables.
>
> On 64bit: do it in head_64.c, after paging is enabled but before direct mapping
> is setup.
>
> And also, acpi_initrd_override() needs to allocate memory for override tables.
> But at such an early time, there is no memory allocator works. So the basic idea
> from Yinghai is to use BRK. We will extend BRK 256KB in this patch-set.
>
>
> Tang Chen (6):
> x86, acpi: Move table_sigs[] to stack.
> x86, acpi, brk: Extend BRK 256KB to store acpi override tables.
> x86, brk: Make extend_brk() available with va/pa.
> x86, acpi: Make acpi_initrd_override() available with va or pa.
> x86, acpi, brk: Make early_alloc_acpi_override_tables_buf() available
> with va/pa.
> x86, acpi: Do acpi_initrd_override() earlier in head_32.S/head64.c.
>
> Yinghai Lu (2):
> x86: Make get_ramdisk_{image|size}() global.
> x86, microcode: Use get_ramdisk_{image|size}() in microcode handling.
>
> arch/x86/include/asm/dmi.h | 2 +-
> arch/x86/include/asm/setup.h | 11 +++-
> arch/x86/kernel/head64.c | 4 +
> arch/x86/kernel/head_32.S | 4 +
> arch/x86/kernel/microcode_intel_early.c | 8 +-
> arch/x86/kernel/setup.c | 93 ++++++++++++++++------
> arch/x86/mm/init.c | 2 +-
> arch/x86/xen/enlighten.c | 2 +-
> arch/x86/xen/mmu.c | 6 +-
> arch/x86/xen/p2m.c | 27 ++++---
> drivers/acpi/osl.c | 130 ++++++++++++++++++++-----------
> include/linux/acpi.h | 5 +-
> 12 files changed, 196 insertions(+), 98 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email:<a href=mailto:"[email protected]"> [email protected]</a>
>

2013-08-21 10:48:08

by Tang Chen

[permalink] [raw]
Subject: [PATCH 6/8] x86, acpi: Make acpi_initrd_override() available with va or pa.

We are using the same trick in previous patch.

Introduce a "bool is_phys" to acpi_initrd_override(). When it
is true, convert all golbal variables va to pa, so that we can
access them on 32bit before paging is enabled.

NOTE: Do not call printk() on 32bit before paging is enabled
because it will use global variables.

Originally-From: Yinghai Lu <[email protected]>
Signed-off-by: Tang Chen <[email protected]>
---
arch/x86/kernel/setup.c | 3 +-
drivers/acpi/osl.c | 68 ++++++++++++++++++++++++++++++++++------------
include/linux/acpi.h | 4 +-
3 files changed, 54 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index a189909..1290ea7 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1133,7 +1133,8 @@ void __init setup_arch(char **cmdline_p)
reserve_initrd();

#if defined(CONFIG_ACPI) && defined(CONFIG_BLK_DEV_INITRD)
- acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start);
+ acpi_initrd_override((void *)initrd_start, initrd_end - initrd_start,
+ false);
#endif

reserve_crashkernel();
diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
index dff7fcc..ccdb5a6 100644
--- a/drivers/acpi/osl.c
+++ b/drivers/acpi/osl.c
@@ -566,7 +566,19 @@ void __init early_alloc_acpi_override_tables_buf(void)
PAGE_SIZE, false));
}

-void __init acpi_initrd_override(void *data, size_t size)
+/**
+ * acpi_initrd_override - Initialize acpi_tables_addr with acpi override tables
+ * @data: cpio file address (va or pa)
+ * @size: size of cpio file
+ * @is_phys: true if @data is pa, false otherwise
+ *
+ * This function will find all acpi override tables provided by initrd, and
+ * store the addresses in acpi_tables_addr.
+ *
+ * This function could be called before paging is enabled. Before paging is
+ * enabled, caller should use physical address, and set @is_phys as true.
+ */
+void __init acpi_initrd_override(void *data, size_t size, bool is_phys)
{
int sig, no, table_nr = 0, total_offset = 0;
long offset = 0;
@@ -586,10 +598,17 @@ void __init acpi_initrd_override(void *data, size_t size)
ACPI_SIG_UEFI, ACPI_SIG_WAET, ACPI_SIG_WDAT, ACPI_SIG_WDDT,
ACPI_SIG_WDRT, ACPI_SIG_DSDT, ACPI_SIG_FADT, ACPI_SIG_PSDT,
ACPI_SIG_RSDT, ACPI_SIG_XSDT, ACPI_SIG_SSDT, NULL };
+ u64 *acpi_tables_addr_p = &acpi_tables_addr;
+ int *all_tables_size_p = &all_tables_size;

if (data == NULL || size == 0)
return;

+ if (is_phys) {
+ acpi_tables_addr_p = (u64 *)__pa_nodebug(&acpi_tables_addr);
+ all_tables_size_p = (int *)__pa_nodebug(&all_tables_size);
+ }
+
for (no = 0; no < ACPI_OVERRIDE_TABLES; no++) {
file = find_cpio_data(cpio_path, data, size, &offset);
if (!file.data)
@@ -599,8 +618,9 @@ void __init acpi_initrd_override(void *data, size_t size)
size -= offset;

if (file.size < sizeof(struct acpi_table_header)) {
- pr_err("ACPI OVERRIDE: Table smaller than ACPI header [%s%s]\n",
- cpio_path, file.name);
+ if (!is_phys)
+ pr_err("ACPI OVERRIDE: Table smaller than ACPI header [%s%s]\n",
+ cpio_path, file.name);
continue;
}

@@ -611,36 +631,48 @@ void __init acpi_initrd_override(void *data, size_t size)
break;

if (!table_sigs[sig]) {
- pr_err("ACPI OVERRIDE: Unknown signature [%s%s]\n",
- cpio_path, file.name);
+ if (!is_phys)
+ pr_err("ACPI OVERRIDE: Unknown signature [%s%s]\n",
+ cpio_path, file.name);
continue;
}
if (file.size != table->length) {
- pr_err("ACPI OVERRIDE: File length does not match table length [%s%s]\n",
- cpio_path, file.name);
+ if (!is_phys)
+ pr_err("ACPI OVERRIDE: File length does not match table length [%s%s]\n",
+ cpio_path, file.name);
continue;
}
if (acpi_table_checksum(file.data, table->length)) {
- pr_err("ACPI OVERRIDE: Bad table checksum [%s%s]\n",
- cpio_path, file.name);
+ if (!is_phys)
+ pr_err("ACPI OVERRIDE: Bad table checksum [%s%s]\n",
+ cpio_path, file.name);
continue;
}

- pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
- table->signature, cpio_path, file.name, table->length);
+ if (!is_phys)
+ pr_info("%4.4s ACPI table found in initrd [%s%s][0x%x]\n",
+ table->signature, cpio_path,
+ file.name, table->length);

/*
* If the override tables in cpio file exceeds the BRK buffer,
* ignore the current table and go for the next one.
*/
- all_tables_size += table->length;
- if (all_tables_size > ACPI_OVERRIDE_TABLES_SIZE) {
- pr_warning("ACPI OVERRIDE: ACPI override tables exceeds buffer size."
- " Ignoring table %4.4s\n", table->signature);
- all_tables_size -= table->length;
+ *all_tables_size_p += table->length;
+ if (*all_tables_size_p > ACPI_OVERRIDE_TABLES_SIZE) {
+ if (!is_phys)
+ pr_warning("ACPI OVERRIDE: ACPI override tables exceeds buffer size."
+ " Ignoring table %4.4s\n",
+ table->signature);
+ *all_tables_size_p -= table->length;
continue;
}

+ /*
+ * file.data is the offset of the table in initrd. If @data is
+ * pa, then we find pa. If @data is va, then we find va. No need
+ * to convert.
+ */
early_initrd_files[table_nr].data = file.data;
early_initrd_files[table_nr].size = file.size;
table_nr++;
@@ -648,8 +680,8 @@ void __init acpi_initrd_override(void *data, size_t size)
if (table_nr == 0)
return;

- p = __va(acpi_tables_addr);
-
+ p = is_phys ? (char *)(*acpi_tables_addr_p) :
+ (char *)__va(*acpi_tables_addr_p);
for (no = 0; no < table_nr; no++) {
memcpy(p + total_offset, early_initrd_files[no].data,
early_initrd_files[no].size);
diff --git a/include/linux/acpi.h b/include/linux/acpi.h
index 381579e..af4da51 100644
--- a/include/linux/acpi.h
+++ b/include/linux/acpi.h
@@ -80,10 +80,10 @@ typedef int (*acpi_tbl_entry_handler)(struct acpi_subtable_header *header,
const unsigned long end);

#ifdef CONFIG_ACPI_INITRD_TABLE_OVERRIDE
-void acpi_initrd_override(void *data, size_t size);
+void acpi_initrd_override(void *data, size_t size, bool is_phys);
void early_alloc_acpi_override_tables_buf(void);
#else
-static inline void acpi_initrd_override(void *data, size_t size)
+static inline void acpi_initrd_override(void *data, size_t size, bool is_phys)
{
}
#endif
--
1.7.1

2013-08-21 12:27:38

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 5/8] x86, brk: Make extend_brk() available with va/pa.

Tang Chen <[email protected]> wrote:
>We are going to do acpi_initrd_override() at very early time:
>
>On 32bit: do it in head_32.S, before paging is enabled. In this case,
>we can
> access initrd with physical address without page tables.
>
>On 64bit: do it in head_64.c, after paging is enabled but before direct
>mapping
> is setup.
>
> On 64bit, we have an early page fault handler to help to access data
> with direct mapping page tables. So it is easy to do in head_64.c.
>
>And we need to allocate memory to store override tables. At such an
>early time,
>no memory allocator works. So we can only use BRK.
>
>As mentioned above, on 32bit before paging is enabled, we have to
>access variables
>with pa. So introduce a "bool is_phys" parameter to extend_brk(), and
>convert va
>to pa is it is true.

Could you do it differently? Meaning have a global symbol (paging_enabled) which will be used by most of the functions you changed in this patch and the next ones? It would naturally be enabled when paging is on and __va addresses can be used.

That could also be used in the printk case to do a BUG_ON before paging is enabled on 32bit. Or perhaps use a different code path to deal with using __pa address.

?
>
>Signed-off-by: Tang Chen <[email protected]>
>---
> arch/x86/include/asm/dmi.h | 2 +-
> arch/x86/include/asm/setup.h | 2 +-
> arch/x86/kernel/setup.c | 20 ++++++++++++++------
> arch/x86/mm/init.c | 2 +-
> arch/x86/xen/enlighten.c | 2 +-
> arch/x86/xen/mmu.c | 6 +++---
> arch/x86/xen/p2m.c | 27 ++++++++++++++-------------
> drivers/acpi/osl.c | 2 +-
> 8 files changed, 36 insertions(+), 27 deletions(-)
>
>diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h
>index fd8f9e2..3b51d81 100644
>--- a/arch/x86/include/asm/dmi.h
>+++ b/arch/x86/include/asm/dmi.h
>@@ -9,7 +9,7 @@
>
> static __always_inline __init void *dmi_alloc(unsigned len)
> {
>- return extend_brk(len, sizeof(int));
>+ return extend_brk(len, sizeof(int), false);
> }
>
> /* Use early IO mappings for DMI because it's initialized early */
>diff --git a/arch/x86/include/asm/setup.h
>b/arch/x86/include/asm/setup.h
>index 4f71d48..96d00da 100644
>--- a/arch/x86/include/asm/setup.h
>+++ b/arch/x86/include/asm/setup.h
>@@ -75,7 +75,7 @@ extern struct boot_params boot_params;
>
> /* exceedingly early brk-like allocator */
> extern unsigned long _brk_end;
>-void *extend_brk(size_t size, size_t align);
>+void *extend_brk(size_t size, size_t align, bool is_phys);
>
> /*
> * Reserve space in the brk section. The name must be unique within
>diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>index 51fcd5d..a189909 100644
>--- a/arch/x86/kernel/setup.c
>+++ b/arch/x86/kernel/setup.c
>@@ -259,19 +259,27 @@ static inline void __init copy_edd(void)
> }
> #endif
>
>-void * __init extend_brk(size_t size, size_t align)
>+void * __init extend_brk(size_t size, size_t align, bool is_phys)
> {
> size_t mask = align - 1;
> void *ret;
>+ unsigned long *brk_start, *brk_end, *brk_limit;
>
>- BUG_ON(_brk_start == 0);
>+ brk_start = is_phys ? (unsigned long *)__pa_nodebug(&_brk_start) :
>+ (unsigned long *)&_brk_start;
>+ brk_end = is_phys ? (unsigned long *)__pa_nodebug(&_brk_end) :
>+ (unsigned long *)&_brk_end;
>+ brk_limit = is_phys ? (unsigned long *)__pa_nodebug(__brk_limit) :
>+ (unsigned long *)__brk_limit;
>+
>+ BUG_ON(*brk_start == 0);
> BUG_ON(align & mask);
>
>- _brk_end = (_brk_end + mask) & ~mask;
>- BUG_ON((char *)(_brk_end + size) > __brk_limit);
>+ *brk_end = (*brk_end + mask) & ~mask;
>+ BUG_ON((char *)(*brk_end + size) > brk_limit);
>
>- ret = (void *)_brk_end;
>- _brk_end += size;
>+ ret = (void *)(*brk_end);
>+ *brk_end += size;
>
> memset(ret, 0, size);
>
>diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>index 2ec29ac..189a9e2 100644
>--- a/arch/x86/mm/init.c
>+++ b/arch/x86/mm/init.c
>@@ -86,7 +86,7 @@ void __init early_alloc_pgt_buf(void)
> unsigned long tables = INIT_PGT_BUF_SIZE;
> phys_addr_t base;
>
>- base = __pa(extend_brk(tables, PAGE_SIZE));
>+ base = __pa(extend_brk(tables, PAGE_SIZE, false));
>
> pgt_buf_start = base >> PAGE_SHIFT;
> pgt_buf_end = pgt_buf_start;
>diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>index 193097e..2d5a34f 100644
>--- a/arch/x86/xen/enlighten.c
>+++ b/arch/x86/xen/enlighten.c
>@@ -1629,7 +1629,7 @@ void __ref xen_hvm_init_shared_info(void)
>
> if (!shared_info_page)
> shared_info_page = (struct shared_info *)
>- extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> xatp.domid = DOMID_SELF;
> xatp.idx = 0;
> xatp.space = XENMAPSPACE_shared_info;
>diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>index fdc3ba2..573bc50 100644
>--- a/arch/x86/xen/mmu.c
>+++ b/arch/x86/xen/mmu.c
>@@ -1768,7 +1768,7 @@ static void __init xen_map_identity_early(pmd_t
>*pmd, unsigned long max_pfn)
> unsigned long pfn;
>
> level1_ident_pgt = extend_brk(sizeof(pte_t) * LEVEL1_IDENT_ENTRIES,
>- PAGE_SIZE);
>+ PAGE_SIZE, false);
>
> ident_pte = 0;
> pfn = 0;
>@@ -1980,7 +1980,7 @@ static void __init xen_write_cr3_init(unsigned
>long cr3)
> * swapper_pg_dir.
> */
> swapper_kernel_pmd =
>- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
>+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);
> copy_page(swapper_kernel_pmd, initial_kernel_pmd);
> swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
> __pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
>@@ -2003,7 +2003,7 @@ void __init xen_setup_kernel_pagetable(pgd_t
>*pgd, unsigned long max_pfn)
> pmd_t *kernel_pmd;
>
> initial_kernel_pmd =
>- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
>+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);
>
> max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
> xen_start_info->nr_pt_frames * PAGE_SIZE +
>diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
>index 95fb2aa..bbdcf20 100644
>--- a/arch/x86/xen/p2m.c
>+++ b/arch/x86/xen/p2m.c
>@@ -281,13 +281,13 @@ void __ref xen_build_mfn_list_list(void)
>
> /* Pre-initialize p2m_top_mfn to be completely missing */
> if (p2m_top_mfn == NULL) {
>- p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_mid_mfn_init(p2m_mid_missing_mfn);
>
>- p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_top_mfn_p_init(p2m_top_mfn_p);
>
>- p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_top_mfn_init(p2m_top_mfn);
> } else {
> /* Reinitialise, mfn's all change after migration */
>@@ -322,7 +322,7 @@ void __ref xen_build_mfn_list_list(void)
> * runtime. extend_brk() will BUG if we call
> * it too late.
> */
>- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_mid_mfn_init(mid_mfn_p);
>
> p2m_top_mfn_p[topidx] = mid_mfn_p;
>@@ -351,16 +351,16 @@ void __init
>xen_build_dynamic_phys_to_machine(void)
>
> xen_max_p2m_pfn = max_pfn;
>
>- p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_init(p2m_missing);
>
>- p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_mid_init(p2m_mid_missing);
>
>- p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_top_init(p2m_top);
>
>- p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_init(p2m_identity);
>
> /*
>@@ -373,7 +373,8 @@ void __init xen_build_dynamic_phys_to_machine(void)
> unsigned mididx = p2m_mid_index(pfn);
>
> if (p2m_top[topidx] == p2m_mid_missing) {
>- unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE,
>+ false);
> p2m_mid_init(mid);
>
> p2m_top[topidx] = mid;
>@@ -609,7 +610,7 @@ static bool __init early_alloc_p2m_middle(unsigned
>long pfn, bool check_boundary
> return false;
>
> /* Boundary cross-over for the edges: */
>- p2m = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ p2m = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>
> p2m_init(p2m);
>
>@@ -635,7 +636,7 @@ static bool __init early_alloc_p2m(unsigned long
>pfn)
> mid = p2m_top[topidx];
> mid_mfn_p = p2m_top_mfn_p[topidx];
> if (mid == p2m_mid_missing) {
>- mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ mid = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>
> p2m_mid_init(mid);
>
>@@ -645,7 +646,7 @@ static bool __init early_alloc_p2m(unsigned long
>pfn)
> }
> /* And the save/restore P2M tables.. */
> if (mid_mfn_p == p2m_mid_missing_mfn) {
>- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
> p2m_mid_mfn_init(mid_mfn_p);
>
> p2m_top_mfn_p[topidx] = mid_mfn_p;
>@@ -858,7 +859,7 @@ static void __init m2p_override_init(void)
> unsigned i;
>
> m2p_overrides = extend_brk(sizeof(*m2p_overrides) * M2P_OVERRIDE_HASH,
>- sizeof(unsigned long));
>+ sizeof(unsigned long), false);
>
> for (i = 0; i < M2P_OVERRIDE_HASH; i++)
> INIT_LIST_HEAD(&m2p_overrides[i]);
>diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>index 4c1baa7..dff7fcc 100644
>--- a/drivers/acpi/osl.c
>+++ b/drivers/acpi/osl.c
>@@ -563,7 +563,7 @@ RESERVE_BRK(acpi_override_tables_alloc,
>ACPI_OVERRIDE_TABLES_SIZE);
> void __init early_alloc_acpi_override_tables_buf(void)
> {
> acpi_tables_addr = __pa(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
>- PAGE_SIZE));
>+ PAGE_SIZE, false));
> }
>
> void __init acpi_initrd_override(void *data, size_t size)

2013-08-21 12:37:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/8] x86, brk: Make extend_brk() available with va/pa.

Global symbols are inaccessible in physical mode.

This is incidentally yet another example of "PV/weird platform violence", since in their absence it would be trivial to work around this by using segmentation.

Konrad Rzeszutek Wilk <[email protected]> wrote:
>Tang Chen <[email protected]> wrote:
>>We are going to do acpi_initrd_override() at very early time:
>>
>>On 32bit: do it in head_32.S, before paging is enabled. In this case,
>>we can
>> access initrd with physical address without page tables.
>>
>>On 64bit: do it in head_64.c, after paging is enabled but before
>direct
>>mapping
>> is setup.
>>
>> On 64bit, we have an early page fault handler to help to access
>data
>> with direct mapping page tables. So it is easy to do in
>head_64.c.
>>
>>And we need to allocate memory to store override tables. At such an
>>early time,
>>no memory allocator works. So we can only use BRK.
>>
>>As mentioned above, on 32bit before paging is enabled, we have to
>>access variables
>>with pa. So introduce a "bool is_phys" parameter to extend_brk(), and
>>convert va
>>to pa is it is true.
>
>Could you do it differently? Meaning have a global symbol
>(paging_enabled) which will be used by most of the functions you
>changed in this patch and the next ones? It would naturally be enabled
>when paging is on and __va addresses can be used.
>
>That could also be used in the printk case to do a BUG_ON before paging
>is enabled on 32bit. Or perhaps use a different code path to deal with
>using __pa address.
>
>?
>>
>>Signed-off-by: Tang Chen <[email protected]>
>>---
>> arch/x86/include/asm/dmi.h | 2 +-
>> arch/x86/include/asm/setup.h | 2 +-
>> arch/x86/kernel/setup.c | 20 ++++++++++++++------
>> arch/x86/mm/init.c | 2 +-
>> arch/x86/xen/enlighten.c | 2 +-
>> arch/x86/xen/mmu.c | 6 +++---
>> arch/x86/xen/p2m.c | 27 ++++++++++++++-------------
>> drivers/acpi/osl.c | 2 +-
>> 8 files changed, 36 insertions(+), 27 deletions(-)
>>
>>diff --git a/arch/x86/include/asm/dmi.h b/arch/x86/include/asm/dmi.h
>>index fd8f9e2..3b51d81 100644
>>--- a/arch/x86/include/asm/dmi.h
>>+++ b/arch/x86/include/asm/dmi.h
>>@@ -9,7 +9,7 @@
>>
>> static __always_inline __init void *dmi_alloc(unsigned len)
>> {
>>- return extend_brk(len, sizeof(int));
>>+ return extend_brk(len, sizeof(int), false);
>> }
>>
>> /* Use early IO mappings for DMI because it's initialized early */
>>diff --git a/arch/x86/include/asm/setup.h
>>b/arch/x86/include/asm/setup.h
>>index 4f71d48..96d00da 100644
>>--- a/arch/x86/include/asm/setup.h
>>+++ b/arch/x86/include/asm/setup.h
>>@@ -75,7 +75,7 @@ extern struct boot_params boot_params;
>>
>> /* exceedingly early brk-like allocator */
>> extern unsigned long _brk_end;
>>-void *extend_brk(size_t size, size_t align);
>>+void *extend_brk(size_t size, size_t align, bool is_phys);
>>
>> /*
>> * Reserve space in the brk section. The name must be unique within
>>diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>index 51fcd5d..a189909 100644
>>--- a/arch/x86/kernel/setup.c
>>+++ b/arch/x86/kernel/setup.c
>>@@ -259,19 +259,27 @@ static inline void __init copy_edd(void)
>> }
>> #endif
>>
>>-void * __init extend_brk(size_t size, size_t align)
>>+void * __init extend_brk(size_t size, size_t align, bool is_phys)
>> {
>> size_t mask = align - 1;
>> void *ret;
>>+ unsigned long *brk_start, *brk_end, *brk_limit;
>>
>>- BUG_ON(_brk_start == 0);
>>+ brk_start = is_phys ? (unsigned long *)__pa_nodebug(&_brk_start) :
>>+ (unsigned long *)&_brk_start;
>>+ brk_end = is_phys ? (unsigned long *)__pa_nodebug(&_brk_end) :
>>+ (unsigned long *)&_brk_end;
>>+ brk_limit = is_phys ? (unsigned long *)__pa_nodebug(__brk_limit) :
>>+ (unsigned long *)__brk_limit;
>>+
>>+ BUG_ON(*brk_start == 0);
>> BUG_ON(align & mask);
>>
>>- _brk_end = (_brk_end + mask) & ~mask;
>>- BUG_ON((char *)(_brk_end + size) > __brk_limit);
>>+ *brk_end = (*brk_end + mask) & ~mask;
>>+ BUG_ON((char *)(*brk_end + size) > brk_limit);
>>
>>- ret = (void *)_brk_end;
>>- _brk_end += size;
>>+ ret = (void *)(*brk_end);
>>+ *brk_end += size;
>>
>> memset(ret, 0, size);
>>
>>diff --git a/arch/x86/mm/init.c b/arch/x86/mm/init.c
>>index 2ec29ac..189a9e2 100644
>>--- a/arch/x86/mm/init.c
>>+++ b/arch/x86/mm/init.c
>>@@ -86,7 +86,7 @@ void __init early_alloc_pgt_buf(void)
>> unsigned long tables = INIT_PGT_BUF_SIZE;
>> phys_addr_t base;
>>
>>- base = __pa(extend_brk(tables, PAGE_SIZE));
>>+ base = __pa(extend_brk(tables, PAGE_SIZE, false));
>>
>> pgt_buf_start = base >> PAGE_SHIFT;
>> pgt_buf_end = pgt_buf_start;
>>diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
>>index 193097e..2d5a34f 100644
>>--- a/arch/x86/xen/enlighten.c
>>+++ b/arch/x86/xen/enlighten.c
>>@@ -1629,7 +1629,7 @@ void __ref xen_hvm_init_shared_info(void)
>>
>> if (!shared_info_page)
>> shared_info_page = (struct shared_info *)
>>- extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> xatp.domid = DOMID_SELF;
>> xatp.idx = 0;
>> xatp.space = XENMAPSPACE_shared_info;
>>diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
>>index fdc3ba2..573bc50 100644
>>--- a/arch/x86/xen/mmu.c
>>+++ b/arch/x86/xen/mmu.c
>>@@ -1768,7 +1768,7 @@ static void __init xen_map_identity_early(pmd_t
>>*pmd, unsigned long max_pfn)
>> unsigned long pfn;
>>
>> level1_ident_pgt = extend_brk(sizeof(pte_t) * LEVEL1_IDENT_ENTRIES,
>>- PAGE_SIZE);
>>+ PAGE_SIZE, false);
>>
>> ident_pte = 0;
>> pfn = 0;
>>@@ -1980,7 +1980,7 @@ static void __init xen_write_cr3_init(unsigned
>>long cr3)
>> * swapper_pg_dir.
>> */
>> swapper_kernel_pmd =
>>- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
>>+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);
>> copy_page(swapper_kernel_pmd, initial_kernel_pmd);
>> swapper_pg_dir[KERNEL_PGD_BOUNDARY] =
>> __pgd(__pa(swapper_kernel_pmd) | _PAGE_PRESENT);
>>@@ -2003,7 +2003,7 @@ void __init xen_setup_kernel_pagetable(pgd_t
>>*pgd, unsigned long max_pfn)
>> pmd_t *kernel_pmd;
>>
>> initial_kernel_pmd =
>>- extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE);
>>+ extend_brk(sizeof(pmd_t) * PTRS_PER_PMD, PAGE_SIZE, false);
>>
>> max_pfn_mapped = PFN_DOWN(__pa(xen_start_info->pt_base) +
>> xen_start_info->nr_pt_frames * PAGE_SIZE +
>>diff --git a/arch/x86/xen/p2m.c b/arch/x86/xen/p2m.c
>>index 95fb2aa..bbdcf20 100644
>>--- a/arch/x86/xen/p2m.c
>>+++ b/arch/x86/xen/p2m.c
>>@@ -281,13 +281,13 @@ void __ref xen_build_mfn_list_list(void)
>>
>> /* Pre-initialize p2m_top_mfn to be completely missing */
>> if (p2m_top_mfn == NULL) {
>>- p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_mid_missing_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_mid_mfn_init(p2m_mid_missing_mfn);
>>
>>- p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_top_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_top_mfn_p_init(p2m_top_mfn_p);
>>
>>- p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_top_mfn = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_top_mfn_init(p2m_top_mfn);
>> } else {
>> /* Reinitialise, mfn's all change after migration */
>>@@ -322,7 +322,7 @@ void __ref xen_build_mfn_list_list(void)
>> * runtime. extend_brk() will BUG if we call
>> * it too late.
>> */
>>- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_mid_mfn_init(mid_mfn_p);
>>
>> p2m_top_mfn_p[topidx] = mid_mfn_p;
>>@@ -351,16 +351,16 @@ void __init
>>xen_build_dynamic_phys_to_machine(void)
>>
>> xen_max_p2m_pfn = max_pfn;
>>
>>- p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_init(p2m_missing);
>>
>>- p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_mid_missing = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_mid_init(p2m_mid_missing);
>>
>>- p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_top = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_top_init(p2m_top);
>>
>>- p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m_identity = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_init(p2m_identity);
>>
>> /*
>>@@ -373,7 +373,8 @@ void __init
>xen_build_dynamic_phys_to_machine(void)
>> unsigned mididx = p2m_mid_index(pfn);
>>
>> if (p2m_top[topidx] == p2m_mid_missing) {
>>- unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ unsigned long **mid = extend_brk(PAGE_SIZE, PAGE_SIZE,
>>+ false);
>> p2m_mid_init(mid);
>>
>> p2m_top[topidx] = mid;
>>@@ -609,7 +610,7 @@ static bool __init early_alloc_p2m_middle(unsigned
>>long pfn, bool check_boundary
>> return false;
>>
>> /* Boundary cross-over for the edges: */
>>- p2m = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ p2m = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>>
>> p2m_init(p2m);
>>
>>@@ -635,7 +636,7 @@ static bool __init early_alloc_p2m(unsigned long
>>pfn)
>> mid = p2m_top[topidx];
>> mid_mfn_p = p2m_top_mfn_p[topidx];
>> if (mid == p2m_mid_missing) {
>>- mid = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ mid = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>>
>> p2m_mid_init(mid);
>>
>>@@ -645,7 +646,7 @@ static bool __init early_alloc_p2m(unsigned long
>>pfn)
>> }
>> /* And the save/restore P2M tables.. */
>> if (mid_mfn_p == p2m_mid_missing_mfn) {
>>- mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE);
>>+ mid_mfn_p = extend_brk(PAGE_SIZE, PAGE_SIZE, false);
>> p2m_mid_mfn_init(mid_mfn_p);
>>
>> p2m_top_mfn_p[topidx] = mid_mfn_p;
>>@@ -858,7 +859,7 @@ static void __init m2p_override_init(void)
>> unsigned i;
>>
>> m2p_overrides = extend_brk(sizeof(*m2p_overrides) *
>M2P_OVERRIDE_HASH,
>>- sizeof(unsigned long));
>>+ sizeof(unsigned long), false);
>>
>> for (i = 0; i < M2P_OVERRIDE_HASH; i++)
>> INIT_LIST_HEAD(&m2p_overrides[i]);
>>diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c
>>index 4c1baa7..dff7fcc 100644
>>--- a/drivers/acpi/osl.c
>>+++ b/drivers/acpi/osl.c
>>@@ -563,7 +563,7 @@ RESERVE_BRK(acpi_override_tables_alloc,
>>ACPI_OVERRIDE_TABLES_SIZE);
>> void __init early_alloc_acpi_override_tables_buf(void)
>> {
>> acpi_tables_addr = __pa(extend_brk(ACPI_OVERRIDE_TABLES_SIZE,
>>- PAGE_SIZE));
>>+ PAGE_SIZE, false));
>> }
>>
>> void __init acpi_initrd_override(void *data, size_t size)

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 13:06:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Wed, Aug 21, 2013 at 06:15:35PM +0800, Tang Chen wrote:
> [What are we doing]
>
> We are trying to initialize acip tables as early as possible. But Linux kernel
> allows users to override acpi tables by specifying their own tables in initrd.
> So we have to do acpi_initrd_override() earlier first.

So, are we now back to making SRAT info as early as possible? What
happened to just co-locating early allocations close to kernel image?
What'd be the benefit of doing this over that?

Thanks.

--
tejun

2013-08-21 14:44:40

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [PATCH 5/8] x86, brk: Make extend_brk() available with va/pa.

On Wed, Aug 21, 2013 at 02:35:36PM +0200, H. Peter Anvin wrote:
> Global symbols are inaccessible in physical mode.

Even if they are embedded in the assembler code and use GLOBAL(paging_enabled) ?

>
> This is incidentally yet another example of "PV/weird platform violence", since in their absence it would be trivial to work around this by using segmentation.

I don't follow why it could not.

Why can't there be a __pa_symbol(paging_enabled) that is used. Won't
that in effect allow you to check the contents of that 'global
constant' even when you don't have paging enabled?

> >>As mentioned above, on 32bit before paging is enabled, we have to
> >>access variables
> >>with pa. So introduce a "bool is_phys" parameter to extend_brk(), and
> >>convert va
> >>to pa is it is true.
> >
> >Could you do it differently? Meaning have a global symbol
> >(paging_enabled) which will be used by most of the functions you
> >changed in this patch and the next ones? It would naturally be enabled
> >when paging is on and __va addresses can be used.
> >
> >That could also be used in the printk case to do a BUG_ON before paging
> >is enabled on 32bit. Or perhaps use a different code path to deal with
> >using __pa address.
> >
> >?

2013-08-21 15:00:59

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hi tejun,

On 08/21/2013 09:06 PM, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 21, 2013 at 06:15:35PM +0800, Tang Chen wrote:
>> [What are we doing]
>>
>> We are trying to initialize acip tables as early as possible. But Linux kernel
>> allows users to override acpi tables by specifying their own tables in initrd.
>> So we have to do acpi_initrd_override() earlier first.
>
> So, are we now back to making SRAT info as early as possible? What
> happened to just co-locating early allocations close to kernel image?
> What'd be the benefit of doing this over that?

We know you are trying to give the direction to make the change more natural and
robust and very thankful for your comments. We have taken your comments and suggestions
about co-locating early allocations close to kernel image into consideration, but
still we found that not that easy.

In current boot order, before we get the SRAT, we have a big consumer of early
allocations: we are setting up the page table in top-down (The idea was proposed by HPA,
Link: https://lkml.org/lkml/2012/10/4/701). That said, this kind of page table
setup will make the page tables as high as possible in memory, since memory at low
addresses is precious (for stupid DMA devices, for things like kexec/kdump, and so on.)

So if we are trying to make early allocations close to kernel image, we should
rewrite the way we are setting up page table totally. That is not a easy thing
to do.

As for the benefits of the patchset, just as Tang said in this patch,

* For memory hotplug, we need ACPI SRAT at early time to be aware of which memory
ranges are hotpluggable, and tell the kernel to try to stay away from hotpluggable
nodes.

This one is the current requirement of us but may be very helpful for future change:

* As suggested by Yinghai, we should allocate page tables in local node. This also
needs SRAT before direct mapping page tables are setup.

* As mentioned by Toshi Kani <[email protected]>, ACPI SCPR/DBGP/DBG2 tables
allow the OS to initialize serial console/debug ports at early boot time. The
earlier it can be initialized, the better this feature will be. These tables
are not currently used by Linux due to a licensing issue, but it could be
addressed some time soon.

So we decided to firstly make ACPI override earlier and use BRK (this is obviously
near the kernel image range) to store the found ACPI tables.

--
Thanks.
Zhang Yanfei

2013-08-21 15:06:08

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 5/8] x86, brk: Make extend_brk() available with va/pa.



Konrad Rzeszutek Wilk <[email protected]> wrote:
>On Wed, Aug 21, 2013 at 02:35:36PM +0200, H. Peter Anvin wrote:
>> Global symbols are inaccessible in physical mode.
>
>Even if they are embedded in the assembler code and use
>GLOBAL(paging_enabled) ?

Yes, because the address is different in physical mode. Think about it. You could do a *function* like:

paging_enabled:
mov (%esp),%edx
xor %eax,%eax
cmpl $PAGE_OFFSET,%edx
setae %al
ret

>>
>> This is incidentally yet another example of "PV/weird platform
>violence", since in their absence it would be trivial to work around
>this by using segmentation.
>
>I don't follow why it could not.
>
>Why can't there be a __pa_symbol(paging_enabled) that is used. Won't
>that in effect allow you to check the contents of that 'global
>constant' even when you don't have paging enabled?

Yes. But not once paging has been turned on.

>> >>As mentioned above, on 32bit before paging is enabled, we have to
>> >>access variables
>> >>with pa. So introduce a "bool is_phys" parameter to extend_brk(),
>and
>> >>convert va
>> >>to pa is it is true.
>> >
>> >Could you do it differently? Meaning have a global symbol
>> >(paging_enabled) which will be used by most of the functions you
>> >changed in this patch and the next ones? It would naturally be
>enabled
>> >when paging is on and __va addresses can be used.
>> >
>> >That could also be used in the printk case to do a BUG_ON before
>paging
>> >is enabled on 32bit. Or perhaps use a different code path to deal
>with
>> >using __pa address.
>> >
>> >?

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-21 15:36:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Wed, Aug 21, 2013 at 11:00:26PM +0800, Zhang Yanfei wrote:
> In current boot order, before we get the SRAT, we have a big consumer of early
> allocations: we are setting up the page table in top-down (The idea was proposed by HPA,
> Link: https://lkml.org/lkml/2012/10/4/701). That said, this kind of page table
> setup will make the page tables as high as possible in memory, since memory at low
> addresses is precious (for stupid DMA devices, for things like kexec/kdump, and so on.)

With huge mappings, they are fairly small, right? And this whole
thing needs a kernel param anyway at this point, so the allocation
direction can be made dependent on that or huge mapping availability
and, even with 4k mappings, we aren't talking about gigabytes of
memory, are we?

> So if we are trying to make early allocations close to kernel image, we should
> rewrite the way we are setting up page table totally. That is not a easy thing
> to do.

It has been a while since I looked at the code so can you please
elaborate why that is not easy? It's pretty simple conceptually.

> * For memory hotplug, we need ACPI SRAT at early time to be aware of which memory
> ranges are hotpluggable, and tell the kernel to try to stay away from hotpluggable
> nodes.
>
> This one is the current requirement of us but may be very helpful for future change:
>
> * As suggested by Yinghai, we should allocate page tables in local node. This also
> needs SRAT before direct mapping page tables are setup.

Does this even matter for huge mappings?

> * As mentioned by Toshi Kani <[email protected]>, ACPI SCPR/DBGP/DBG2 tables
> allow the OS to initialize serial console/debug ports at early boot time. The
> earlier it can be initialized, the better this feature will be. These tables
> are not currently used by Linux due to a licensing issue, but it could be
> addressed some time soon.
>
> So we decided to firstly make ACPI override earlier and use BRK (this is obviously
> near the kernel image range) to store the found ACPI tables.

I don't know. The whole effort seems way overcomplicated compared to
the benefits it would bring. For NUMA memory hotunplug, what's the
point of doing all this when the kernel doesn't have any control over
where its image is gonna be? Some megabytes at the tail aren't gonna
make a huge difference and if you wanna do this properly, you need to
determine the load address of the kernel considering the node
boundaries and hotpluggability of each node, which has to happen
before the early kernel boot code executes. And if there's a code
piece which does that, that might as well place the kernel image such
that extra allocation afterwards doesn't interfere with memory
hotunplugging.

It looks like a lot of code changes for a mechanism which doesn't seem
all that useful. This code is already too late in boot sequence to be
a proper solution so I don't see the point in pushing the coverage to
the maximum from here. It's kinda silly.

The last point - early init of debug facility - makes some sense but
again how extra coverage are we talking about? The code path between
the two points is fairly short and the change doesn't come free. It
means we add more fragile firmware-specific code path before the
execution environment is stable and get to do things like traveling
the same code paths multiple times in different environments. Doesn't
seem like a win. We want to reach stable execution environment as
soon as possible. Shoving whole more logic before that in the name of
"earlier debugging" doesn't make a lot of sense.

Thanks.

--
tejun

2013-08-21 19:33:15

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Wed, 2013-08-21 at 11:36 -0400, Tejun Heo wrote:
> Hello,
>
> On Wed, Aug 21, 2013 at 11:00:26PM +0800, Zhang Yanfei wrote:
> > In current boot order, before we get the SRAT, we have a big consumer of early
> > allocations: we are setting up the page table in top-down (The idea was proposed by HPA,
> > Link: https://lkml.org/lkml/2012/10/4/701). That said, this kind of page table
> > setup will make the page tables as high as possible in memory, since memory at low
> > addresses is precious (for stupid DMA devices, for things like kexec/kdump, and so on.)
>
> With huge mappings, they are fairly small, right? And this whole
> thing needs a kernel param anyway at this point, so the allocation
> direction can be made dependent on that or huge mapping availability
> and, even with 4k mappings, we aren't talking about gigabytes of
> memory, are we?
>
> > So if we are trying to make early allocations close to kernel image, we should
> > rewrite the way we are setting up page table totally. That is not a easy thing
> > to do.
>
> It has been a while since I looked at the code so can you please
> elaborate why that is not easy? It's pretty simple conceptually.
>
> > * For memory hotplug, we need ACPI SRAT at early time to be aware of which memory
> > ranges are hotpluggable, and tell the kernel to try to stay away from hotpluggable
> > nodes.
> >
> > This one is the current requirement of us but may be very helpful for future change:
> >
> > * As suggested by Yinghai, we should allocate page tables in local node. This also
> > needs SRAT before direct mapping page tables are setup.
>
> Does this even matter for huge mappings?
>
> > * As mentioned by Toshi Kani <[email protected]>, ACPI SCPR/DBGP/DBG2 tables
> > allow the OS to initialize serial console/debug ports at early boot time. The
> > earlier it can be initialized, the better this feature will be. These tables
> > are not currently used by Linux due to a licensing issue, but it could be
> > addressed some time soon.
> >
> > So we decided to firstly make ACPI override earlier and use BRK (this is obviously
> > near the kernel image range) to store the found ACPI tables.
>
> I don't know. The whole effort seems way overcomplicated compared to
> the benefits it would bring. For NUMA memory hotunplug, what's the
> point of doing all this when the kernel doesn't have any control over
> where its image is gonna be? Some megabytes at the tail aren't gonna
> make a huge difference and if you wanna do this properly, you need to
> determine the load address of the kernel considering the node
> boundaries and hotpluggability of each node, which has to happen
> before the early kernel boot code executes. And if there's a code
> piece which does that, that might as well place the kernel image such
> that extra allocation afterwards doesn't interfere with memory
> hotunplugging.
>
> It looks like a lot of code changes for a mechanism which doesn't seem
> all that useful. This code is already too late in boot sequence to be
> a proper solution so I don't see the point in pushing the coverage to
> the maximum from here. It's kinda silly.
>
> The last point - early init of debug facility - makes some sense but
> again how extra coverage are we talking about? The code path between
> the two points is fairly short and the change doesn't come free. It
> means we add more fragile firmware-specific code path before the
> execution environment is stable and get to do things like traveling
> the same code paths multiple times in different environments. Doesn't
> seem like a win. We want to reach stable execution environment as
> soon as possible. Shoving whole more logic before that in the name of
> "earlier debugging" doesn't make a lot of sense.

Well, there is reason why we have earlyprintk feature today. So, let's
not debate on this feature now. There was previous attempt to support
this feature with ACPI tables below. As described, it had the same
ordering issue.

https://lkml.org/lkml/2012/10/8/498

There is a basic problem that when we try to use ACPI tables that
extends or replaces legacy interfaces (ex. SRAT extending e820), we hit
this ordering issue because ACPI is not available as early as the legacy
interfaces.

Thanks,
-Toshi

2013-08-21 19:54:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello, Toshi.

On Wed, Aug 21, 2013 at 01:31:43PM -0600, Toshi Kani wrote:
> Well, there is reason why we have earlyprintk feature today. So, let's
> not debate on this feature now. There was previous attempt to support

Are you saying the existing earlyprintk automatically justifies
addition of more complex mechanism? The added complex of course
should be traded off against the benefits of gaining ACPI based early
boot. You aren't gonna suggest implementing netconsole based
earlyprintk, right?

> this feature with ACPI tables below. As described, it had the same
> ordering issue.
>
> https://lkml.org/lkml/2012/10/8/498
>
> There is a basic problem that when we try to use ACPI tables that
> extends or replaces legacy interfaces (ex. SRAT extending e820), we hit
> this ordering issue because ACPI is not available as early as the legacy
> interfaces.

Do we even want ACPI parsing and all that that early? Parsing SRAT
early doesn't buy us much and I'm not sure whether adding ACPI
earlyprintk would increase or decrease debuggability during earlyboot.
It adds whole lot more code paths where things can go wrong while the
basic execution environment is unstable. Why do that?

Thanks.

--
tejun

2013-08-21 20:30:59

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Tejun,

On Wed, 2013-08-21 at 15:54 -0400, Tejun Heo wrote:
> On Wed, Aug 21, 2013 at 01:31:43PM -0600, Toshi Kani wrote:
> > Well, there is reason why we have earlyprintk feature today. So, let's
> > not debate on this feature now. There was previous attempt to support
>
> Are you saying the existing earlyprintk automatically justifies
> addition of more complex mechanism? The added complex of course
> should be traded off against the benefits of gaining ACPI based early
> boot. You aren't gonna suggest implementing netconsole based
> earlyprintk, right?

Platforms vendors (which care Linux) need to support the existing Linux
features. This means that they have to implement legacy interfaces on
x86 until the kernel supports an alternative method. For instance, some
platforms are legacy-free and do not have legacy COM ports. These ACPI
tables were defined so that non-legacy COM ports can be described and
informed to the OS. Without this support, such platforms may have to
emulate the legacy COM ports for Linux, or drop Linux support.

> > this feature with ACPI tables below. As described, it had the same
> > ordering issue.
> >
> > https://lkml.org/lkml/2012/10/8/498
> >
> > There is a basic problem that when we try to use ACPI tables that
> > extends or replaces legacy interfaces (ex. SRAT extending e820), we hit
> > this ordering issue because ACPI is not available as early as the legacy
> > interfaces.
>
> Do we even want ACPI parsing and all that that early? Parsing SRAT
> early doesn't buy us much and I'm not sure whether adding ACPI
> earlyprintk would increase or decrease debuggability during earlyboot.
> It adds whole lot more code paths where things can go wrong while the
> basic execution environment is unstable. Why do that?

I think the kernel boot-up sequence should be designed in such a way
that can support legacy-free and/or NUMA platforms properly.

Thanks,
-Toshi

2013-08-21 20:40:50

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello, Toshi.

On Wed, Aug 21, 2013 at 02:29:28PM -0600, Toshi Kani wrote:
> Platforms vendors (which care Linux) need to support the existing Linux
> features. This means that they have to implement legacy interfaces on
> x86 until the kernel supports an alternative method. For instance, some
> platforms are legacy-free and do not have legacy COM ports. These ACPI
> tables were defined so that non-legacy COM ports can be described and
> informed to the OS. Without this support, such platforms may have to
> emulate the legacy COM ports for Linux, or drop Linux support.

Are you seriously saying that vendors are gonna drop linux support for
lacking ACPI earlyprintk support? Please...

Please take a look at the existing earlyprintk code and how compact
and self-contained they are. If you want to add ACPI earlyprintk, do
similar stuff. Forget about firmware blob override from initrd or
ACPICA. Just implement the bare minimum to get the thing working. Do
not add dependency to large body of code from earlyboot. It's a bad
idea through and through.

> I think the kernel boot-up sequence should be designed in such a way
> that can support legacy-free and/or NUMA platforms properly.

Blanket statements like the above don't mean much. There are many
separate stages of boot and you're talking about one of the very first
stages where we traditionally have always depended upon only the very
bare minimum of the platform both in hardware itself and configuration
information. We've been doing that for *very* good reasons. If you
screw up there, it's mighty tricky to figure out what went wrong
especially on the machines that you can't physically kick. You're now
suggesting to add whole ACPI parsing including overloading from initrd
into that stage with pretty weak rationale.

Seriously, if you want ACPI based earlyprintk, implement it in a
discrete minimal code which is easy to verify and won't get affected
when the rest of ACPI machinery is updated. We really don't want
earlyboot to fail because someone screwed up ACPI or initrd handling.

Thanks.

--
tejun

2013-08-21 22:38:06

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Tejun,

On Wed, 2013-08-21 at 16:40 -0400, Tejun Heo wrote:
> On Wed, Aug 21, 2013 at 02:29:28PM -0600, Toshi Kani wrote:
> > Platforms vendors (which care Linux) need to support the existing Linux
> > features. This means that they have to implement legacy interfaces on
> > x86 until the kernel supports an alternative method. For instance, some
> > platforms are legacy-free and do not have legacy COM ports. These ACPI
> > tables were defined so that non-legacy COM ports can be described and
> > informed to the OS. Without this support, such platforms may have to
> > emulate the legacy COM ports for Linux, or drop Linux support.
>
> Are you seriously saying that vendors are gonna drop linux support for
> lacking ACPI earlyprintk support? Please...

earlyprintk is an example of the issues. The point is that vendors are
required to support legacy stuff for Linux.

> Please take a look at the existing earlyprintk code and how compact
> and self-contained they are. If you want to add ACPI earlyprintk, do
> similar stuff. Forget about firmware blob override from initrd or
> ACPICA. Just implement the bare minimum to get the thing working. Do
> not add dependency to large body of code from earlyboot. It's a bad
> idea through and through.

I am not saying that ACPI earlyprintk must be available at exactly the
same point. How early it can reasonably be is a subject of discussion.

> > I think the kernel boot-up sequence should be designed in such a way
> > that can support legacy-free and/or NUMA platforms properly.
>
> Blanket statements like the above don't mean much. There are many
> separate stages of boot and you're talking about one of the very first
> stages where we traditionally have always depended upon only the very
> bare minimum of the platform both in hardware itself and configuration
> information. We've been doing that for *very* good reasons. If you
> screw up there, it's mighty tricky to figure out what went wrong
> especially on the machines that you can't physically kick. You're now
> suggesting to add whole ACPI parsing including overloading from initrd
> into that stage with pretty weak rationale.

I agree that ACPI is rather complicated stuff. But in my experience,
the majority complication comes from ACPI namespace and methods, not
from ACPI tables. Do you really think ACPI table init is that risky? I
consider ACPI tables are part of the minimum config info, esp. for
legacy-free platforms.

> Seriously, if you want ACPI based earlyprintk, implement it in a
> discrete minimal code which is easy to verify and won't get affected
> when the rest of ACPI machinery is updated. We really don't want
> earlyboot to fail because someone screwed up ACPI or initrd handling.

earlyprintk is just another example to this SRAT issue. The local page
table is yet another example. My hope here is for us to be able to
utilize ACPI tables properly without hitting this kind of ordering
issues again and again, which requires considerable time & effort to
address.

Thanks,
-Toshi

2013-08-22 03:32:41

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Wed, Aug 21, 2013 at 04:36:35PM -0600, Toshi Kani wrote:
> I agree that ACPI is rather complicated stuff. But in my experience,
> the majority complication comes from ACPI namespace and methods, not
> from ACPI tables. Do you really think ACPI table init is that risky? I
> consider ACPI tables are part of the minimum config info, esp. for
> legacy-free platforms.

It's just that we're talking about the very first stage of boot. We
really don't do much there and pulling in ACPI code into that stage is
a lot by comparison. If that's gonna happen, it needs pretty strong
justification.

> earlyprintk is just another example to this SRAT issue. The local page
> table is yet another example. My hope here is for us to be able to
> utilize ACPI tables properly without hitting this kind of ordering
> issues again and again, which requires considerable time & effort to
> address.

So, the two things brought up at this point are early parsing of SRAT,
which can't really solve the problem at hand anyway, and earlyprintk
which should be implemented in minimal way which is not activated
unless specifically enabled with earlyprintk boot param. Neither
seems to justify pulling in full ACPI into early boot, right?

Thanks.

--
tejun

2013-08-22 15:53:40

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Tejun,

On Wed, 2013-08-21 at 23:32 -0400, Tejun Heo wrote:
> On Wed, Aug 21, 2013 at 04:36:35PM -0600, Toshi Kani wrote:
> > I agree that ACPI is rather complicated stuff. But in my experience,
> > the majority complication comes from ACPI namespace and methods, not
> > from ACPI tables. Do you really think ACPI table init is that risky? I
> > consider ACPI tables are part of the minimum config info, esp. for
> > legacy-free platforms.
>
> It's just that we're talking about the very first stage of boot. We
> really don't do much there and pulling in ACPI code into that stage is
> a lot by comparison. If that's gonna happen, it needs pretty strong
> justification.

It moves up the ACPI table init code, which itself is simple. And ACPI
tables are defined to be pursed at early boot-time, which is why they
exist in addition to ACPI namespace/methods. They are similar to EFI
memory table. Firmware publishes tables in one way or the other.

I understand that you are concerned about stability of the ACPI stuff,
which I think is a valid point, but most of (if not all) of the
ACPI-related issues come from ACPI namespace/methods, which is a very
different thing. Please do not mix up those two. The ACPI
namespace/methods stuff remains the same and continues to be initialized
at very late in the boot sequence.

What's making the patchset complicated is acpi_initrd_override(), which
is intended for developers and allows overwriting ACPI bits at their own
risk. This feature won't be used by regular users.

> > earlyprintk is just another example to this SRAT issue. The local page
> > table is yet another example. My hope here is for us to be able to
> > utilize ACPI tables properly without hitting this kind of ordering
> > issues again and again, which requires considerable time & effort to
> > address.
>
> So, the two things brought up at this point are early parsing of SRAT,
> which can't really solve the problem at hand anyway,

If you are referring the issue of kernel image location, it is a
limitation in the current implementation, not a technical limitation. I
know other OS that supports movable memory and puts the kernel image
into a movable memory with SRAT by changing the bootloader.

Also, how do you support local page tables without pursing SRAT early?

> and earlyprintk
> which should be implemented in minimal way which is not activated
> unless specifically enabled with earlyprintk boot param. Neither
> seems to justify pulling in full ACPI into early boot, right?

Initializing page tables on large systems may take a long time, and I do
think that earlyprink needs to be available before that point.

Thanks,
-Toshi

2013-08-22 18:31:43

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Thu, Aug 22, 2013 at 09:52:09AM -0600, Toshi Kani wrote:
> I understand that you are concerned about stability of the ACPI stuff,
> which I think is a valid point, but most of (if not all) of the
> ACPI-related issues come from ACPI namespace/methods, which is a very
> different thing. Please do not mix up those two. The ACPI

I have no objection to implementing self-conftained earlyprintk
support. If that's all you want to do, please go ahead but do not
pull in initrd override or ACPICA into it.

> namespace/methods stuff remains the same and continues to be initialized
> at very late in the boot sequence.
>
> What's making the patchset complicated is acpi_initrd_override(), which
> is intended for developers and allows overwriting ACPI bits at their own
> risk. This feature won't be used by regular users.

Yeah, please forget about that in earlyboot. It doesn't make any
sense to fiddle with initrd that early during boot.

> If you are referring the issue of kernel image location, it is a
> limitation in the current implementation, not a technical limitation. I
> know other OS that supports movable memory and puts the kernel image
> into a movable memory with SRAT by changing the bootloader.

I'm not saying that problem shouldn't be solved. I'm saying what you
guys are pushing doesn't help solving it at all. It's too late in the
boot process. It needs to be handled either by bootloader or earlier
kernel kexecing the actual one and super-early SRAT doens't help at
all in either case, so what's the point of pulling ACPI code in when
it doesn't contribute to solving the problem properly?

> Also, how do you support local page tables without pursing SRAT early?

Does it even matter with huge mappings? It's gonna be contained in a
single page anyway, right?

> Initializing page tables on large systems may take a long time, and I do
> think that earlyprink needs to be available before that point.

Yeah, sure, implement it in *minimal* way which doesn't affect
anything if not explicitly enabled by kernel param like other
earlyprintks. It doens't make any sense to add dependency to acpi
from early boot for that.

Thanks.

--
tejun

2013-08-22 19:40:27

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello tejun,

On 08/23/2013 02:31 AM, Tejun Heo wrote:
> Hello,
>
> On Thu, Aug 22, 2013 at 09:52:09AM -0600, Toshi Kani wrote:
>> I understand that you are concerned about stability of the ACPI stuff,
>> which I think is a valid point, but most of (if not all) of the
>> ACPI-related issues come from ACPI namespace/methods, which is a very
>> different thing. Please do not mix up those two. The ACPI
>
> I have no objection to implementing self-conftained earlyprintk
> support. If that's all you want to do, please go ahead but do not
> pull in initrd override or ACPICA into it.
>
>> namespace/methods stuff remains the same and continues to be initialized
>> at very late in the boot sequence.
>>
>> What's making the patchset complicated is acpi_initrd_override(), which
>> is intended for developers and allows overwriting ACPI bits at their own
>> risk. This feature won't be used by regular users.
>
> Yeah, please forget about that in earlyboot. It doesn't make any
> sense to fiddle with initrd that early during boot.

What do you mean by "earlyboot"? And also in your previous mail, I am also
a little confused by what you said "the very first stage of boot". Does
this mean the stage we are in head_32 or head64.c?

If so, could we just do something just as Yinghai did before, that is, Split
acpi_override into 2 parts: find and copy. And in "earlyboot", we just do
the find, and I think that is less of risk. Or we can just do ACPI override
earlier in setup_arch(), not pulling this process that early during boot?

Thanks

--
Thanks.
Zhang Yanfei

2013-08-22 19:46:06

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, Aug 23, 2013 at 03:39:53AM +0800, Zhang Yanfei wrote:
> What do you mean by "earlyboot"? And also in your previous mail, I am also
> a little confused by what you said "the very first stage of boot". Does
> this mean the stage we are in head_32 or head64.c?

Mostly referring to the state where we don't have basic environment
set up yet including page tables.

> If so, could we just do something just as Yinghai did before, that is, Split
> acpi_override into 2 parts: find and copy. And in "earlyboot", we just do
> the find, and I think that is less of risk. Or we can just do ACPI override
> earlier in setup_arch(), not pulling this process that early during boot?

But *WHY*? It doesn't really buy us anything substantial. What are
you trying to achieve here? "Making ACPI info available early" can't
be a goal in itself and the two benefits cited in this thread seem
pretty dubious to me. Why are you guys trying to push this
convolution when it doesn't bring any substantial gain?

Thanks.

--
tejun

2013-08-22 20:13:03

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Tejun,

On Thu, 2013-08-22 at 14:31 -0400, Tejun Heo wrote:
> On Thu, Aug 22, 2013 at 09:52:09AM -0600, Toshi Kani wrote:
> > I understand that you are concerned about stability of the ACPI stuff,
> > which I think is a valid point, but most of (if not all) of the
> > ACPI-related issues come from ACPI namespace/methods, which is a very
> > different thing. Please do not mix up those two. The ACPI
>
> I have no objection to implementing self-conftained earlyprintk
> support. If that's all you want to do, please go ahead but do not
> pull in initrd override or ACPICA into it.

If you are referring ACPICA as the AML interpreter, right, we do not
move it up as I explained before. We are trying to move up the ACPI
table init code (which is part of ACPICA, but has nothing to do with
AML.)

Note that ia64 also uses ACPI, and calls acpi_table_init() in
setup_arch() before initializing the bootmap in find_memory().

> > namespace/methods stuff remains the same and continues to be initialized
> > at very late in the boot sequence.
> >
> > What's making the patchset complicated is acpi_initrd_override(), which
> > is intended for developers and allows overwriting ACPI bits at their own
> > risk. This feature won't be used by regular users.
>
> Yeah, please forget about that in earlyboot. It doesn't make any
> sense to fiddle with initrd that early during boot.

I think the reason why Tang is working on this stuff again is that his
previous change (which was once accepted) had broken initrd. So, he'd
have to support it this time...

> > If you are referring the issue of kernel image location, it is a
> > limitation in the current implementation, not a technical limitation. I
> > know other OS that supports movable memory and puts the kernel image
> > into a movable memory with SRAT by changing the bootloader.
>
> I'm not saying that problem shouldn't be solved. I'm saying what you
> guys are pushing doesn't help solving it at all. It's too late in the
> boot process. It needs to be handled either by bootloader or earlier
> kernel kexecing the actual one and super-early SRAT doens't help at
> all in either case, so what's the point of pulling ACPI code in when
> it doesn't contribute to solving the problem properly?

It's too late for the kernel image itself, but it prevents allocating
kernel memory from movable ranges after that. I'd say it solves a half
of the issue this time.

> > Also, how do you support local page tables without pursing SRAT early?
>
> Does it even matter with huge mappings? It's gonna be contained in a
> single page anyway, right?

Are the huge mappings always used? We cannot force user programs to use
huge pages, can we?

> > Initializing page tables on large systems may take a long time, and I do
> > think that earlyprink needs to be available before that point.
>
> Yeah, sure, implement it in *minimal* way which doesn't affect
> anything if not explicitly enabled by kernel param like other
> earlyprintks. It doens't make any sense to add dependency to acpi
> from early boot for that.

It makes sense because it needs to obtain the config info from ACPI
tables.

As for the maintainability, I am far more concerned with your suggestion
of having a separate page table init code when SRAT is used. This kind
of divergence is a recipe of breakage.

Thanks,
-Toshi

2013-08-22 20:22:08

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Thu, Aug 22, 2013 at 02:11:32PM -0600, Toshi Kani wrote:
> It's too late for the kernel image itself, but it prevents allocating
> kernel memory from movable ranges after that. I'd say it solves a half
> of the issue this time.

That works if such half solution eventually leads to the full
solution. This is just a distraction. You are already too late in
the boot sequence. It doesn't even qualify as a half solution. It's
like obsessing about a speck on your shirt without your trousers on.
If you want to solve this, do that from a place where it actually is
solvable.

> > > Also, how do you support local page tables without pursing SRAT early?
> >
> > Does it even matter with huge mappings? It's gonna be contained in a
> > single page anyway, right?
>
> Are the huge mappings always used? We cannot force user programs to use
> huge pages, can we?

Everything is a trade-off. Should we do all this just to support the
off chance someone tries to use memory hotplug on a machine which
doesn't support huge mapping when virtually all CPUs on market
supports it?

> As for the maintainability, I am far more concerned with your suggestion
> of having a separate page table init code when SRAT is used. This kind
> of divergence is a recipe of breakage.

I don't buy that. The only thing which needs to change is the
directionality of allocation and we probably don't even need to do
that if huge mapping is in use.

Thanks.

--
tejun

2013-08-22 20:35:38

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

A bit of addition.

On Thu, Aug 22, 2013 at 04:21:58PM -0400, Tejun Heo wrote:
> That works if such half solution eventually leads to the full
> solution. This is just a distraction. You are already too late in
> the boot sequence. It doesn't even qualify as a half solution. It's
> like obsessing about a speck on your shirt without your trousers on.
> If you want to solve this, do that from a place where it actually is
> solvable.

Seriously, what's the end game here? How do you guys see this
eventually reaching full solution? If you don't see that and this
kinda-sorta-working solution is fine, then that's fine too but we
aren't gonna make a lot of invasive changes for that. If you can at
least envision the full solution, please try to fit this effort into
the bigger picture.

In all possible solutions that I can think of, there needs to be
earlier handling of SRAT informtaion before the kernel proper starts
executing be that either the actual bootloader or earlier kernel
serving as kexec host. If a proper solution needs such processing
earlier anyway, it can set up things so that either the default
booting behavior doesn't harm hotpluggability or feed the necessary
information to the kernel. In both cases, doing ACPI super early in
the booting kernel doesn't buy us anything.

So, then, what the hell are we doing here with all these relocations,
careful double execution of the same code from different execution
contexts, worrying about initrd firmware override even before the
kernel page table is set up? If we're doing all those to just make
the temporary half-assed-anyway solution minutely better, that's just
plain stupid.

Thanks.

--
tejun

2013-08-22 21:08:09

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Thu, 2013-08-22 at 16:21 -0400, Tejun Heo wrote:
> On Thu, Aug 22, 2013 at 02:11:32PM -0600, Toshi Kani wrote:
> > It's too late for the kernel image itself, but it prevents allocating
> > kernel memory from movable ranges after that. I'd say it solves a half
> > of the issue this time.
>
> That works if such half solution eventually leads to the full
> solution. This is just a distraction. You are already too late in
> the boot sequence. It doesn't even qualify as a half solution. It's
> like obsessing about a speck on your shirt without your trousers on.
> If you want to solve this, do that from a place where it actually is
> solvable.

Since some node(s) won't be ejectable, this solution is reasonable as
the first step. I do not think it is a distraction. I view your
suggestion as a distraction of supporting local page tables, though.

> > > > Also, how do you support local page tables without pursing SRAT early?
> > >
> > > Does it even matter with huge mappings? It's gonna be contained in a
> > > single page anyway, right?
> >
> > Are the huge mappings always used? We cannot force user programs to use
> > huge pages, can we?
>
> Everything is a trade-off. Should we do all this just to support the
> off chance someone tries to use memory hotplug on a machine which
> doesn't support huge mapping when virtually all CPUs on market
> supports it?

Local page table and memory hotplug are two separate things. That is,
local page tables can be supported on all NUMA platforms without hotplug
support. Are you sure huge mapping will solve everything for all types
of applications, and therefore local page tables won't be needed at all?

> > As for the maintainability, I am far more concerned with your suggestion
> > of having a separate page table init code when SRAT is used. This kind
> > of divergence is a recipe of breakage.
>
> I don't buy that. The only thing which needs to change is the
> directionality of allocation and we probably don't even need to do
> that if huge mapping is in use.

When someone changes the page table init code, who will test it with the
special allocation code?

Thanks,
-Toshi

2013-08-22 21:21:22

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello, Toshi.

On Thu, Aug 22, 2013 at 03:06:38PM -0600, Toshi Kani wrote:
> Since some node(s) won't be ejectable, this solution is reasonable as
> the first step. I do not think it is a distraction. I view your

But does this contribute to reaching the next step? If so, how?
I can't see how and that's why I said this was a distraction.

> suggestion as a distraction of supporting local page tables, though.

Hmmm...

> Local page table and memory hotplug are two separate things. That is,
> local page tables can be supported on all NUMA platforms without hotplug
> support. Are you sure huge mapping will solve everything for all types
> of applications, and therefore local page tables won't be needed at all?

When you throw around terms like "all" and "at all", you can't reach
rational discussion about engineering trade-offs. I was asking you
whether it was reasonable to do per-node page table when most machines
support huge page mappings which makes the whole thing rather
pointless. Of course there will be some niche cases where this might
not be optimal but do you think that would be enough to justify the
added complexity and churn? If you think so, can you please
elaborate?

> When someone changes the page table init code, who will test it with the
> special allocation code?

What are you worrying about? Are you saying that allocating page
table towards top or bottom of memory would be more disruptive and
difficult to debug than pulling in ACPI init and SRAT information into
the process? Am I missing something here?

Thanks.

--
tejun

2013-08-22 22:19:14

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Tejun,

On Thu, 2013-08-22 at 17:21 -0400, Tejun Heo wrote:
:
> > Local page table and memory hotplug are two separate things. That is,
> > local page tables can be supported on all NUMA platforms without hotplug
> > support. Are you sure huge mapping will solve everything for all types
> > of applications, and therefore local page tables won't be needed at all?
>
> When you throw around terms like "all" and "at all", you can't reach
> rational discussion about engineering trade-offs. I was asking you
> whether it was reasonable to do per-node page table when most machines
> support huge page mappings which makes the whole thing rather
> pointless. Of course there will be some niche cases where this might
> not be optimal but do you think that would be enough to justify the
> added complexity and churn? If you think so, can you please
> elaborate?

I am relatively new to Linux, so I am not a good person to elaborate
this. From my experience on other OS, huge pages helped for the kernel,
but did not necessarily help user applications. It depended on
applications, which were not niche cases. But Linux may be different,
so I asked since you seemed confident. I'd appreciate if you can point
us some data that endorses your statement.

> > When someone changes the page table init code, who will test it with the
> > special allocation code?
>
> What are you worrying about? Are you saying that allocating page
> table towards top or bottom of memory would be more disruptive and
> difficult to debug than pulling in ACPI init and SRAT information into
> the process? Am I missing something here?

My worry is that the code is unlikely tested with the special logic when
someone makes code changes to the page tables. Such code can easily be
broken in future.

To answer your other question/email, I believe Tang's next step is to
support local page tables. This is why we think pursing SRAT earlier is
the right direction.

Thanks,
-Toshi

2013-08-23 13:04:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello, Toshi.

On Thu, Aug 22, 2013 at 04:17:41PM -0600, Toshi Kani wrote:
> I am relatively new to Linux, so I am not a good person to elaborate
> this. From my experience on other OS, huge pages helped for the kernel,
> but did not necessarily help user applications. It depended on
> applications, which were not niche cases. But Linux may be different,
> so I asked since you seemed confident. I'd appreciate if you can point
> us some data that endorses your statement.

We are talking about the kernel linear mapping which is created during
early boot, so if it's available and useable there's no reason not to
use it. Exceptions would be earlier processors which didn't do 1G
mappings or e820 maps with a lot of holes. For CPUs used in NUMA
configurations, the former has been history for a bit now. Can't be
sure about the latter but it'd be surprising for that to affect large
amount of memory in the systems that are of interest here. Ooh, that
reminds me that we probably wanna go back to 1G + MTRR mapping under
4G. We're currently creating a lot of mapping holes.

> My worry is that the code is unlikely tested with the special logic when
> someone makes code changes to the page tables. Such code can easily be
> broken in future.

Well, I wouldn't consider flipping the direction of allocation to be
particularly difficult to get right especially when compared to
bringing in ACPI tables into the mix.

> To answer your other question/email, I believe Tang's next step is to
> support local page tables. This is why we think pursing SRAT earlier is
> the right direction.

Given 1G mappings, is that even a worthwhile effort? I'm getting even
more more skeptical.

Thanks.

--
tejun

2013-08-23 13:11:04

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

What is the point of 1G+MTRR? If there are caching differences the TLB will fracture the pages anyway.

Tejun Heo <[email protected]> wrote:
>Hello, Toshi.
>
>On Thu, Aug 22, 2013 at 04:17:41PM -0600, Toshi Kani wrote:
>> I am relatively new to Linux, so I am not a good person to elaborate
>> this. From my experience on other OS, huge pages helped for the
>kernel,
>> but did not necessarily help user applications. It depended on
>> applications, which were not niche cases. But Linux may be
>different,
>> so I asked since you seemed confident. I'd appreciate if you can
>point
>> us some data that endorses your statement.
>
>We are talking about the kernel linear mapping which is created during
>early boot, so if it's available and useable there's no reason not to
>use it. Exceptions would be earlier processors which didn't do 1G
>mappings or e820 maps with a lot of holes. For CPUs used in NUMA
>configurations, the former has been history for a bit now. Can't be
>sure about the latter but it'd be surprising for that to affect large
>amount of memory in the systems that are of interest here. Ooh, that
>reminds me that we probably wanna go back to 1G + MTRR mapping under
>4G. We're currently creating a lot of mapping holes.
>
>> My worry is that the code is unlikely tested with the special logic
>when
>> someone makes code changes to the page tables. Such code can easily
>be
>> broken in future.
>
>Well, I wouldn't consider flipping the direction of allocation to be
>particularly difficult to get right especially when compared to
>bringing in ACPI tables into the mix.
>
>> To answer your other question/email, I believe Tang's next step is to
>> support local page tables. This is why we think pursing SRAT earlier
>is
>> the right direction.
>
>Given 1G mappings, is that even a worthwhile effort? I'm getting even
>more more skeptical.
>
>Thanks.

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-23 14:19:30

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, Aug 23, 2013 at 03:08:55PM +0200, H. Peter Anvin wrote:
> What is the point of 1G+MTRR? If there are caching differences the
> TLB will fracture the pages anyway.

Ah, right. Consuming less memory / cachelines would still be a small
advantage tho unless creating split TLB from larger mapping is
noticeably less efficient. If the extra logic to do that is small,
which I think it'd be, it'd be a gain at almost no cost.

Thanks.

--
tejun

2013-08-23 14:26:30

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Well... relying on MTRRs is a big cost in complexity and failure modes.

Tejun Heo <[email protected]> wrote:
>Hello,
>
>On Fri, Aug 23, 2013 at 03:08:55PM +0200, H. Peter Anvin wrote:
>> What is the point of 1G+MTRR? If there are caching differences the
>> TLB will fracture the pages anyway.
>
>Ah, right. Consuming less memory / cachelines would still be a small
>advantage tho unless creating split TLB from larger mapping is
>noticeably less efficient. If the extra logic to do that is small,
>which I think it'd be, it'd be a gain at almost no cost.
>
>Thanks.

--
Sent from my mobile phone. Please excuse brevity and lack of formatting.

2013-08-23 14:35:14

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, Aug 23, 2013 at 04:24:06PM +0200, H. Peter Anvin wrote:
> Well... relying on MTRRs is a big cost in complexity and failure modes.

Yeah, it's true that MTRRs are nasty. On the other hand, we've been
doing that for over a decade and are still doing it anyway if I'm not
mistaken. It probably isn't a big difference but it's still a bit sad
that this is likely causing small performance regression out in the
wild.

Thanks.

--
tejun

2013-08-23 14:57:39

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 10:35:07AM -0400, Tejun Heo wrote:
> Yeah, it's true that MTRRs are nasty. On the other hand, we've been
> doing that for over a decade and are still doing it anyway if I'm not
> mistaken. It probably isn't a big difference but it's still a bit sad
> that this is likely causing small performance regression out in the
> wild.

Just went over the processor manual and it doesn't seem like doing the
above would be a good idea.


System Programming Guide, Part 1

11.11.9 Large Page Size Considerations

...
Because the memory type for a large page is cached in the TLB, the
processor can behave in an undefined manner if a large page is mapped
to a region of memory that MTRRs have mapped with multiple memory
types.
...
If a large page maps to a region of memory containing different
MTRR-defined memory types, the PCD and PWT flags in the page-table
entry should be set for the most conservative memory type for that
range. For example, a large page used for memory mapped I/O and
regular memory 11-48 Vol. 3A MEMORY CACHE CONTROL
...

The Pentium 4, Intel Xeon, and P6 family processors provide special
support for the physical memory range from 0 to 4 MBytes,
...
Here, the processor maps the memory range as multiple 4-KByte pages
within the TLB. This operation insures correct behavior at the cost
of performance. To avoid this performance penalty, operating-system
software should reserve the large page option for regions of memory
at addresses greater than or equal to 4 MBytes.

So, yeah, the current behavior seems like the right thing to do.

Thanks.

--
tejun

2013-08-23 16:15:42

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, 2013-08-23 at 09:04 -0400, Tejun Heo wrote:
> On Thu, Aug 22, 2013 at 04:17:41PM -0600, Toshi Kani wrote:
> > I am relatively new to Linux, so I am not a good person to elaborate
> > this. From my experience on other OS, huge pages helped for the kernel,
> > but did not necessarily help user applications. It depended on
> > applications, which were not niche cases. But Linux may be different,
> > so I asked since you seemed confident. I'd appreciate if you can point
> > us some data that endorses your statement.
>
> We are talking about the kernel linear mapping which is created during
> early boot, so if it's available and useable there's no reason not to
> use it. Exceptions would be earlier processors which didn't do 1G
> mappings or e820 maps with a lot of holes. For CPUs used in NUMA
> configurations, the former has been history for a bit now. Can't be
> sure about the latter but it'd be surprising for that to affect large
> amount of memory in the systems that are of interest here. Ooh, that
> reminds me that we probably wanna go back to 1G + MTRR mapping under
> 4G. We're currently creating a lot of mapping holes.

Thanks for the explanation.

> > My worry is that the code is unlikely tested with the special logic when
> > someone makes code changes to the page tables. Such code can easily be
> > broken in future.
>
> Well, I wouldn't consider flipping the direction of allocation to be
> particularly difficult to get right especially when compared to
> bringing in ACPI tables into the mix.
>
> > To answer your other question/email, I believe Tang's next step is to
> > support local page tables. This is why we think pursing SRAT earlier is
> > the right direction.
>
> Given 1G mappings, is that even a worthwhile effort? I'm getting even
> more more skeptical.

With 1G mappings, I agree that it won't make much difference.

I still think acpi table info should be available earlier, but I do not
think I can convince you on this. This can be religious debate.

Tang, what do you think? Are you OK to try Tejun's suggestion as well?

Thanks,
-Toshi

2013-08-23 16:24:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, Aug 23, 2013 at 10:14:08AM -0600, Toshi Kani wrote:
> I still think acpi table info should be available earlier, but I do not
> think I can convince you on this. This can be religious debate.

I'm curious. If there aren't substantial enough benefits, why would
you still want to pull it earlier when it brings in things like initrd
override and crafting the code carefully so that it's safe to execute
it from different address modes and so on? Please note that x86 is
not ia64. The early environment is completely different not only
technically but also in its diversity and suckiness. It wasn't too
long ago that vendors were screwing up ACPI left and right. It has
been getting better but there's a reason why, for example, we still
consider e820 to be the authoritative information over ACPI.

Thanks.

--
tejun

2013-08-23 16:54:52

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello

On 08/24/2013 12:14 AM, Toshi Kani wrote:
> Hello,
>
> On Fri, 2013-08-23 at 09:04 -0400, Tejun Heo wrote:
>> On Thu, Aug 22, 2013 at 04:17:41PM -0600, Toshi Kani wrote:
>>> I am relatively new to Linux, so I am not a good person to elaborate
>>> this. From my experience on other OS, huge pages helped for the kernel,
>>> but did not necessarily help user applications. It depended on
>>> applications, which were not niche cases. But Linux may be different,
>>> so I asked since you seemed confident. I'd appreciate if you can point
>>> us some data that endorses your statement.
>>
>> We are talking about the kernel linear mapping which is created during
>> early boot, so if it's available and useable there's no reason not to
>> use it. Exceptions would be earlier processors which didn't do 1G
>> mappings or e820 maps with a lot of holes. For CPUs used in NUMA
>> configurations, the former has been history for a bit now. Can't be
>> sure about the latter but it'd be surprising for that to affect large
>> amount of memory in the systems that are of interest here. Ooh, that
>> reminds me that we probably wanna go back to 1G + MTRR mapping under
>> 4G. We're currently creating a lot of mapping holes.
>
> Thanks for the explanation.
>
>>> My worry is that the code is unlikely tested with the special logic when
>>> someone makes code changes to the page tables. Such code can easily be
>>> broken in future.
>>
>> Well, I wouldn't consider flipping the direction of allocation to be
>> particularly difficult to get right especially when compared to
>> bringing in ACPI tables into the mix.
>>
>>> To answer your other question/email, I believe Tang's next step is to
>>> support local page tables. This is why we think pursing SRAT earlier is
>>> the right direction.
>>
>> Given 1G mappings, is that even a worthwhile effort? I'm getting even
>> more more skeptical.
>
> With 1G mappings, I agree that it won't make much difference.
>
> I still think acpi table info should be available earlier, but I do not
> think I can convince you on this. This can be religious debate.
>
> Tang, what do you think? Are you OK to try Tejun's suggestion as well?
>

By saying TJ's suggestion, you mean, we will let memblock to control the
behaviour, that said, we will do early allocations near the kernel image
range before we get the SRAT info?

If so, yeah, we have been working on this direction. By doing this, we may
have two main changes:

1. change some of memblock's APIs to make it have the ability to allocate
memory from low address.
2. setup kernel page table down-top. Concretely, we first map the memory
just after the kernel image to the top, then, we map 0 - kernel image end.

Do you guys think this is reasonable and acceptable?

--
Thanks.
Zhang Yanfei

2013-08-23 17:15:09

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello,

On Fri, 2013-08-23 at 12:24 -0400, Tejun Heo wrote:
> On Fri, Aug 23, 2013 at 10:14:08AM -0600, Toshi Kani wrote:
> > I still think acpi table info should be available earlier, but I do not
> > think I can convince you on this. This can be religious debate.
>
> I'm curious. If there aren't substantial enough benefits, why would
> you still want to pull it earlier when it brings in things like initrd
> override and crafting the code carefully so that it's safe to execute
> it from different address modes and so on? Please note that x86 is
> not ia64. The early environment is completely different not only
> technically but also in its diversity and suckiness. It wasn't too
> long ago that vendors were screwing up ACPI left and right. It has
> been getting better but there's a reason why, for example, we still
> consider e820 to be the authoritative information over ACPI.

Firmware generates tables, and provides them via some interface. Memory
map table can be provided via e820 or EFI memory map. Memory topology
table is provided via ACPI. I agree to prioritize one table over the
other when there is overlap. But in the end, it is the firmware that
generates the tables. Because it is provided via ACPI does not make it
suddenly unreliable. I think table info from e820/EFI/ACPI should be
available at the same time. To me, it makes more sense to use the
hotplug info to initialize memblock than try to find a way to workaround
without it. I think we will continue to be in that way to find a
workaround in this direction.

I came from ia64 background, and am not very familiar with x86. So, you
may be very right about that x86 is different. I also agree that initrd
is making it unnecessarily complicated. We may see some initial issues,
but my hope is that the code gets matured over the time.

Thanks,
-Toshi

2013-08-23 17:30:33

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hi Toshi,

On 08/24/2013 01:13 AM, Toshi Kani wrote:
> Hello,
>
> On Fri, 2013-08-23 at 12:24 -0400, Tejun Heo wrote:
>> On Fri, Aug 23, 2013 at 10:14:08AM -0600, Toshi Kani wrote:
>>> I still think acpi table info should be available earlier, but I do not
>>> think I can convince you on this. This can be religious debate.
>>
>> I'm curious. If there aren't substantial enough benefits, why would
>> you still want to pull it earlier when it brings in things like initrd
>> override and crafting the code carefully so that it's safe to execute
>> it from different address modes and so on? Please note that x86 is
>> not ia64. The early environment is completely different not only
>> technically but also in its diversity and suckiness. It wasn't too
>> long ago that vendors were screwing up ACPI left and right. It has
>> been getting better but there's a reason why, for example, we still
>> consider e820 to be the authoritative information over ACPI.
>
> Firmware generates tables, and provides them via some interface. Memory
> map table can be provided via e820 or EFI memory map. Memory topology
> table is provided via ACPI. I agree to prioritize one table over the
> other when there is overlap. But in the end, it is the firmware that
> generates the tables. Because it is provided via ACPI does not make it
> suddenly unreliable. I think table info from e820/EFI/ACPI should be
> available at the same time. To me, it makes more sense to use the
> hotplug info to initialize memblock than try to find a way to workaround
> without it.

Yeah, agreed. But sigh.... on x86, we have ACPI initrd override, so we still
cannot convince Tj....

I think we will continue to be in that way to find a
> workaround in this direction.
>
> I came from ia64 background, and am not very familiar with x86. So, you
> may be very right about that x86 is different. I also agree that initrd
> is making it unnecessarily complicated. We may see some initial issues,
> but my hope is that the code gets matured over the time.
>
> Thanks,
> -Toshi
>


--
Thanks.
Zhang Yanfei

2013-08-23 18:30:39

by Toshi Kani

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hello Zhang,

On Sat, 2013-08-24 at 00:54 +0800, Zhang Yanfei wrote:
> > Tang, what do you think? Are you OK to try Tejun's suggestion as well?
> >
>
> By saying TJ's suggestion, you mean, we will let memblock to control the
> behaviour, that said, we will do early allocations near the kernel image
> range before we get the SRAT info?

Right.

> If so, yeah, we have been working on this direction.

Great!

> By doing this, we may
> have two main changes:
>
> 1. change some of memblock's APIs to make it have the ability to allocate
> memory from low address.
> 2. setup kernel page table down-top. Concretely, we first map the memory
> just after the kernel image to the top, then, we map 0 - kernel image end.
>
> Do you guys think this is reasonable and acceptable?

Have you also looked at Yinghai's comments below?

http://www.spinics.net/lists/linux-mm/msg61362.html

Thanks,
-Toshi

2013-08-23 20:09:00

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]> wrote:
>
> BRK makes sense as long as you can set a sane O(1) size limit.
>
>>
>>put the acpi override table in BRK, we still need ok from HPA.
>>I have impression that he did not like it, so want to confirm from him.

on 8 sockets system:
-rw-r--r-- 1 root root 3532 Aug 22 10:26 APIC.dat
-rw-r--r-- 1 root root 48 Aug 22 10:26 BDAT.dat
-rw-r--r-- 1 root root 824 Aug 22 10:26 DMAR.dat
-rw-r--r-- 1 root root 83509 Aug 22 10:26 DSDT.dat
-rw-r--r-- 1 root root 244 Aug 22 10:26 FACP.dat
-rw-r--r-- 1 root root 64 Aug 22 10:26 FACS.dat
-rw-r--r-- 1 root root 68 Aug 22 10:26 FPDT.dat
-rw-r--r-- 1 root root 56 Aug 22 10:26 HPET.dat
-rw-r--r-- 1 root root 304 Aug 22 10:26 MCEJ.dat
-rw-r--r-- 1 root root 60 Aug 22 10:26 MCFG.dat
-rw-r--r-- 1 root root 6712 Aug 22 10:26 MPST.dat
-rw-r--r-- 1 root root 232 Aug 22 10:26 MSCT.dat
-rw-r--r-- 1 root root 172 Aug 22 10:26 PCCT.dat
-rw-r--r-- 1 root root 96 Aug 22 10:26 PMCT.dat
-rw-r--r-- 1 root root 48 Aug 22 10:26 RASF.dat
-rw-r--r-- 1 root root 108 Aug 22 10:26 SLIT.dat
-rw-r--r-- 1 root root 80 Aug 22 10:26 SPCR.dat
-rw-r--r-- 1 root root 65 Aug 22 10:26 SPMI.dat
-rw-r--r-- 1 root root 6448 Aug 22 10:26 SRAT.dat
-rw-r--r-- 1 root root 100 Aug 22 10:26 SSDT1.dat
-rw-r--r-- 1 root root 283527 Aug 22 10:26 SSDT2.dat
-rw-r--r-- 1 root root 66 Aug 22 10:26 UEFI.dat
-rw-r--r-- 1 root root 64 Aug 22 10:26 WDDT.dat

assume for 32sockets will have four times bigger with DSDT and SSDT.
(with more pci and cpus)

So we can not have O(1) the size.

Russ, What is ACPI table size on your big machine?

Thanks

Yinghai

2013-08-23 20:30:55

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 01:08:56PM -0700, Yinghai Lu wrote:
> On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]> wrote:
> >
> > BRK makes sense as long as you can set a sane O(1) size limit.
> >
> >>
> >>put the acpi override table in BRK, we still need ok from HPA.
> >>I have impression that he did not like it, so want to confirm from him.
>
> on 8 sockets system:
> -rw-r--r-- 1 root root 3532 Aug 22 10:26 APIC.dat
> -rw-r--r-- 1 root root 48 Aug 22 10:26 BDAT.dat
> -rw-r--r-- 1 root root 824 Aug 22 10:26 DMAR.dat
> -rw-r--r-- 1 root root 83509 Aug 22 10:26 DSDT.dat
> -rw-r--r-- 1 root root 244 Aug 22 10:26 FACP.dat
> -rw-r--r-- 1 root root 64 Aug 22 10:26 FACS.dat
> -rw-r--r-- 1 root root 68 Aug 22 10:26 FPDT.dat
> -rw-r--r-- 1 root root 56 Aug 22 10:26 HPET.dat
> -rw-r--r-- 1 root root 304 Aug 22 10:26 MCEJ.dat
> -rw-r--r-- 1 root root 60 Aug 22 10:26 MCFG.dat
> -rw-r--r-- 1 root root 6712 Aug 22 10:26 MPST.dat
> -rw-r--r-- 1 root root 232 Aug 22 10:26 MSCT.dat
> -rw-r--r-- 1 root root 172 Aug 22 10:26 PCCT.dat
> -rw-r--r-- 1 root root 96 Aug 22 10:26 PMCT.dat
> -rw-r--r-- 1 root root 48 Aug 22 10:26 RASF.dat
> -rw-r--r-- 1 root root 108 Aug 22 10:26 SLIT.dat
> -rw-r--r-- 1 root root 80 Aug 22 10:26 SPCR.dat
> -rw-r--r-- 1 root root 65 Aug 22 10:26 SPMI.dat
> -rw-r--r-- 1 root root 6448 Aug 22 10:26 SRAT.dat
> -rw-r--r-- 1 root root 100 Aug 22 10:26 SSDT1.dat
> -rw-r--r-- 1 root root 283527 Aug 22 10:26 SSDT2.dat
> -rw-r--r-- 1 root root 66 Aug 22 10:26 UEFI.dat
> -rw-r--r-- 1 root root 64 Aug 22 10:26 WDDT.dat
>
> assume for 32sockets will have four times bigger with DSDT and SSDT.
> (with more pci and cpus)
>
> So we can not have O(1) the size.
>
> Russ, What is ACPI table size on your big machine?

This is from a 256 socket 32TB system.

Reserving 256MB of memory at 66973408MB for crashkernel (System RAM: 32501719MB)
ACPI: RSDP 000000007ef3d014 00024 (v02 INTEL )
ACPI: XSDT 000000007ef3d120 0007C (v01 INTEL TIANO 00000000 01000013)
ACPI: FACP 000000007ef3a000 000F4 (v04 INTEL TIANO 00000000 MSFT 01000013)
ACPI: DSDT 000000007e6c3000 7F493E (v02 SGI2 UVX 00000002 MSFT 01000013)
ACPI: FACS 000000007d147000 00040
ACPI: UEFI 000000007ef3c000 0012A (v01 INTEL RstScuO 00000000 00000000)
ACPI: UEFI 000000007ef3b000 0005C (v01 INTEL RstScuV 00000000 00000000)
ACPI: HPET 000000007ef39000 00038 (v01 INTEL TIANO 00000001 MSFT 01000013)
ACPI: SSDT 000000007ef33000 05352 (v02 INTEL ROSECITY 00000003 INTL 20070508)
ACPI: SLIT 000000007ef10000 1002C (v01 SGI2 UVX 00000002 MSFT 00000001)
ACPI: APIC 000000007eeee000 10070 (v03 SGI2 UVX 00000002 MSFT 00000001)
ACPI: SRAT 000000007eeb8000 1A830 (v03 SGI2 UVX 00000002 MSFT 00000001)
ACPI: MCFG 000000007d6d4000 0105C (v01 SGI2 UVX 00000002 MSFT 00000001)
ACPI: SPCR 000000007e6c2000 00050 (v01 00000000 00000000)
ACPI: DMAR 000000007d6d3000 0013C (v01 INTEL TIANO 00000001 MSFT 01000013)

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]

2013-08-23 20:48:56

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 1:30 PM, Russ Anderson <[email protected]> wrote:
> On Fri, Aug 23, 2013 at 01:08:56PM -0700, Yinghai Lu wrote:
>> On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]> wrote:

>> Russ, What is ACPI table size on your big machine?
>
> This is from a 256 socket 32TB system.
>
> Reserving 256MB of memory at 66973408MB for crashkernel (System RAM: 32501719MB)
> ACPI: RSDP 000000007ef3d014 00024 (v02 INTEL )
> ACPI: XSDT 000000007ef3d120 0007C (v01 INTEL TIANO 00000000 01000013)
> ACPI: FACP 000000007ef3a000 000F4 (v04 INTEL TIANO 00000000 MSFT 01000013)
> ACPI: DSDT 000000007e6c3000 7F493E (v02 SGI2 UVX 00000002 MSFT 01000013)
> ACPI: FACS 000000007d147000 00040
> ACPI: UEFI 000000007ef3c000 0012A (v01 INTEL RstScuO 00000000 00000000)
> ACPI: UEFI 000000007ef3b000 0005C (v01 INTEL RstScuV 00000000 00000000)
> ACPI: HPET 000000007ef39000 00038 (v01 INTEL TIANO 00000001 MSFT 01000013)
> ACPI: SSDT 000000007ef33000 05352 (v02 INTEL ROSECITY 00000003 INTL 20070508)
> ACPI: SLIT 000000007ef10000 1002C (v01 SGI2 UVX 00000002 MSFT 00000001)
> ACPI: APIC 000000007eeee000 10070 (v03 SGI2 UVX 00000002 MSFT 00000001)
> ACPI: SRAT 000000007eeb8000 1A830 (v03 SGI2 UVX 00000002 MSFT 00000001)
> ACPI: MCFG 000000007d6d4000 0105C (v01 SGI2 UVX 00000002 MSFT 00000001)
> ACPI: SPCR 000000007e6c2000 00050 (v01 00000000 00000000)
> ACPI: DMAR 000000007d6d3000 0013C (v01 INTEL TIANO 00000001 MSFT 01000013)
>

so the DSDT is 7F493E, and total is more than 8M.

that will need BRK to be extended 16M?

Thanks

Yinghai

2013-08-23 21:50:59

by chen tang

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hi Yinghai,

2013/8/24 Yinghai Lu <[email protected]>:
> On Fri, Aug 23, 2013 at 1:30 PM, Russ Anderson <[email protected]> wrote:
>> On Fri, Aug 23, 2013 at 01:08:56PM -0700, Yinghai Lu wrote:
>>> On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]> wrote:
>
>>> Russ, What is ACPI table size on your big machine?
>>
>> This is from a 256 socket 32TB system.
>>
>> Reserving 256MB of memory at 66973408MB for crashkernel (System RAM: 32501719MB)
>> ACPI: RSDP 000000007ef3d014 00024 (v02 INTEL )
>> ACPI: XSDT 000000007ef3d120 0007C (v01 INTEL TIANO 00000000 01000013)
>> ACPI: FACP 000000007ef3a000 000F4 (v04 INTEL TIANO 00000000 MSFT 01000013)
>> ACPI: DSDT 000000007e6c3000 7F493E (v02 SGI2 UVX 00000002 MSFT 01000013)
>> ACPI: FACS 000000007d147000 00040
>> ACPI: UEFI 000000007ef3c000 0012A (v01 INTEL RstScuO 00000000 00000000)
>> ACPI: UEFI 000000007ef3b000 0005C (v01 INTEL RstScuV 00000000 00000000)
>> ACPI: HPET 000000007ef39000 00038 (v01 INTEL TIANO 00000001 MSFT 01000013)
>> ACPI: SSDT 000000007ef33000 05352 (v02 INTEL ROSECITY 00000003 INTL 20070508)
>> ACPI: SLIT 000000007ef10000 1002C (v01 SGI2 UVX 00000002 MSFT 00000001)
>> ACPI: APIC 000000007eeee000 10070 (v03 SGI2 UVX 00000002 MSFT 00000001)
>> ACPI: SRAT 000000007eeb8000 1A830 (v03 SGI2 UVX 00000002 MSFT 00000001)
>> ACPI: MCFG 000000007d6d4000 0105C (v01 SGI2 UVX 00000002 MSFT 00000001)
>> ACPI: SPCR 000000007e6c2000 00050 (v01 00000000 00000000)
>> ACPI: DMAR 000000007d6d3000 0013C (v01 INTEL TIANO 00000001 MSFT 01000013)
>>
>
> so the DSDT is 7F493E, and total is more than 8M.
>
> that will need BRK to be extended 16M?
>

Then how about use early_ioremap(), and don't do it that early in
head_32 and head64 ?

Thanks.

2013-08-23 21:52:38

by Moore, Robert

[permalink] [raw]
Subject: RE: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

While we're at it:

Can someone send me the acpidump for this machine? We very much would like to test all of ACPICA with such a large DSDT.

Thanks,
Bob


> -----Original Message-----
> From: chen tang [mailto:[email protected]]
> Sent: Friday, August 23, 2013 2:51 PM
> To: Yinghai Lu
> Cc: Russ Anderson; H. Peter Anvin; Zhang Yanfei; Toshi Kani; Tejun Heo;
> Tang Chen; Konrad Rzeszutek Wilk; Moore, Robert; Zheng, Lv; Rafael J.
> Wysocki; Ingo Molnar; Andrew Morton; Thomas Renninger; Yasuaki Ishimatsu;
> Mel Gorman; Linux Kernel Mailing List
> Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.
>
> Hi Yinghai,
>
> 2013/8/24 Yinghai Lu <[email protected]>:
> > On Fri, Aug 23, 2013 at 1:30 PM, Russ Anderson <[email protected]> wrote:
> >> On Fri, Aug 23, 2013 at 01:08:56PM -0700, Yinghai Lu wrote:
> >>> On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]>
> wrote:
> >
> >>> Russ, What is ACPI table size on your big machine?
> >>
> >> This is from a 256 socket 32TB system.
> >>
> >> Reserving 256MB of memory at 66973408MB for crashkernel (System RAM:
> 32501719MB)
> >> ACPI: RSDP 000000007ef3d014 00024 (v02 INTEL )
> >> ACPI: XSDT 000000007ef3d120 0007C (v01 INTEL TIANO 00000000
> 01000013)
> >> ACPI: FACP 000000007ef3a000 000F4 (v04 INTEL TIANO 00000000 MSFT
> 01000013)
> >> ACPI: DSDT 000000007e6c3000 7F493E (v02 SGI2 UVX 00000002 MSFT
> 01000013)
> >> ACPI: FACS 000000007d147000 00040
> >> ACPI: UEFI 000000007ef3c000 0012A (v01 INTEL RstScuO 00000000
> 00000000)
> >> ACPI: UEFI 000000007ef3b000 0005C (v01 INTEL RstScuV 00000000
> 00000000)
> >> ACPI: HPET 000000007ef39000 00038 (v01 INTEL TIANO 00000001 MSFT
> 01000013)
> >> ACPI: SSDT 000000007ef33000 05352 (v02 INTEL ROSECITY 00000003 INTL
> 20070508)
> >> ACPI: SLIT 000000007ef10000 1002C (v01 SGI2 UVX 00000002 MSFT
> 00000001)
> >> ACPI: APIC 000000007eeee000 10070 (v03 SGI2 UVX 00000002 MSFT
> 00000001)
> >> ACPI: SRAT 000000007eeb8000 1A830 (v03 SGI2 UVX 00000002 MSFT
> 00000001)
> >> ACPI: MCFG 000000007d6d4000 0105C (v01 SGI2 UVX 00000002 MSFT
> 00000001)
> >> ACPI: SPCR 000000007e6c2000 00050 (v01 00000000
> 00000000)
> >> ACPI: DMAR 000000007d6d3000 0013C (v01 INTEL TIANO 00000001 MSFT
> 01000013)
> >>
> >
> > so the DSDT is 7F493E, and total is more than 8M.
> >
> > that will need BRK to be extended 16M?
> >
>
> Then how about use early_ioremap(), and don't do it that early in
> head_32 and head64 ?
>
> Thanks.

2013-08-23 22:05:18

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 2:52 PM, Moore, Robert <[email protected]> wrote:
> While we're at it:
>
> Can someone send me the acpidump for this machine? We very much would like to test all of ACPICA with such a large DSDT.

That is Russ.

2013-08-23 22:08:29

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 2:50 PM, chen tang <[email protected]> wrote:
>>
>> so the DSDT is 7F493E, and total is more than 8M.
>>
>> that will need BRK to be extended 16M?
>>
>
> Then how about use early_ioremap(), and don't do it that early in
> head_32 and head64 ?

why could early_ioremap() help?

when to use early_ioremap()? what for?

Yinghai

2013-08-23 22:40:38

by chen tang

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

Hi Yinghai,

2013/8/24 Yinghai Lu <[email protected]>:
> On Fri, Aug 23, 2013 at 2:50 PM, chen tang <[email protected]> wrote:
>>>
>>> so the DSDT is 7F493E, and total is more than 8M.
>>>
>>> that will need BRK to be extended 16M?
>>>
>>
>> Then how about use early_ioremap(), and don't do it that early in
>> head_32 and head64 ?
>
> why could early_ioremap() help?
>
> when to use early_ioremap()? what for?
>

In my understanding, acpica framework needs users to copy the override tables
somewhere in the memory. And acpica will get these user specified tables when
installing firmware tables. This is the acpica logic, which cannot be
changed, I think.

So we need to allocate memory. That is why you suggested to use BRK, right ?
And the size seems to be a problem.

So I suggest to use early_ioremap().

1. After paging is enabled, before direct mapping page tables are
setup, we map the
initrd with early_ioremap(). And we are able to access it with va,
even on 32bit.
Then we can find all tables.
2. We still use memblock to allocate memory. Maybe it will be
hotpluggable memory,
but this memory can be freed when all the acpi tables are parsed, right ?

So I want to try early_ioremap(). All these should be done in setup_arch().

Thanks.

2013-08-23 23:04:06

by Yinghai Lu

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

> So we need to allocate memory. That is why you suggested to use BRK, right ?
> And the size seems to be a problem.
>
> So I suggest to use early_ioremap().
>
> 1. After paging is enabled, before direct mapping page tables are
> setup, we map the
> initrd with early_ioremap(). And we are able to access it with va,
> even on 32bit.
> Then we can find all tables.
> 2. We still use memblock to allocate memory. Maybe it will be
> hotpluggable memory,
> but this memory can be freed when all the acpi tables are parsed, right ?
>
> So I want to try early_ioremap(). All these should be done in setup_arch().

no.
cpio search need to take whole range virtual address,
and early_ioremap has size limitation.
you will have to update cpio search to take mapping function.
could be too messy.

Yinghai

2013-08-24 02:41:09

by Russ Anderson

[permalink] [raw]
Subject: Re: [PATCH 0/8] x86, acpi: Move acpi_initrd_override() earlier.

On Fri, Aug 23, 2013 at 01:08:56PM -0700, Yinghai Lu wrote:
> On Fri, Aug 23, 2013 at 11:25 AM, H. Peter Anvin <[email protected]> wrote:
> >
> > BRK makes sense as long as you can set a sane O(1) size limit.
> >
> >>
> >>put the acpi override table in BRK, we still need ok from HPA.
> >>I have impression that he did not like it, so want to confirm from him.
>
> on 8 sockets system:
> -rw-r--r-- 1 root root 3532 Aug 22 10:26 APIC.dat
> -rw-r--r-- 1 root root 48 Aug 22 10:26 BDAT.dat
> -rw-r--r-- 1 root root 824 Aug 22 10:26 DMAR.dat
> -rw-r--r-- 1 root root 83509 Aug 22 10:26 DSDT.dat
> -rw-r--r-- 1 root root 244 Aug 22 10:26 FACP.dat
> -rw-r--r-- 1 root root 64 Aug 22 10:26 FACS.dat
> -rw-r--r-- 1 root root 68 Aug 22 10:26 FPDT.dat
> -rw-r--r-- 1 root root 56 Aug 22 10:26 HPET.dat
> -rw-r--r-- 1 root root 304 Aug 22 10:26 MCEJ.dat
> -rw-r--r-- 1 root root 60 Aug 22 10:26 MCFG.dat
> -rw-r--r-- 1 root root 6712 Aug 22 10:26 MPST.dat
> -rw-r--r-- 1 root root 232 Aug 22 10:26 MSCT.dat
> -rw-r--r-- 1 root root 172 Aug 22 10:26 PCCT.dat
> -rw-r--r-- 1 root root 96 Aug 22 10:26 PMCT.dat
> -rw-r--r-- 1 root root 48 Aug 22 10:26 RASF.dat
> -rw-r--r-- 1 root root 108 Aug 22 10:26 SLIT.dat
> -rw-r--r-- 1 root root 80 Aug 22 10:26 SPCR.dat
> -rw-r--r-- 1 root root 65 Aug 22 10:26 SPMI.dat
> -rw-r--r-- 1 root root 6448 Aug 22 10:26 SRAT.dat
> -rw-r--r-- 1 root root 100 Aug 22 10:26 SSDT1.dat
> -rw-r--r-- 1 root root 283527 Aug 22 10:26 SSDT2.dat
> -rw-r--r-- 1 root root 66 Aug 22 10:26 UEFI.dat
> -rw-r--r-- 1 root root 64 Aug 22 10:26 WDDT.dat
>
> assume for 32sockets will have four times bigger with DSDT and SSDT.
> (with more pci and cpus)
>
> So we can not have O(1) the size.
>
> Russ, What is ACPI table size on your big machine?

This is from a 255 socket, 4080 cpu, 15TB system.

-------------------------------------------------------
-rw-r--r-- 1 root root 65392 Aug 23 21:23 apic.dat
-rw-r--r-- 1 root root 316 Aug 23 21:23 dmar.dat
-rw-r--r-- 1 root root 8309249 Aug 23 21:23 dsdt.dat
-rw-r--r-- 1 root root 244 Aug 23 21:23 facp.dat
-rw-r--r-- 1 root root 64 Aug 23 21:23 facs.dat
-rw-r--r-- 1 root root 56 Aug 23 21:23 hpet.dat
-rw-r--r-- 1 root root 4172 Aug 23 21:23 mcfg.dat
-rw-r--r-- 1 root root 36 Aug 23 21:23 rsdp.dat
-rw-r--r-- 1 root root 80 Aug 23 21:23 rsdt.dat
-rw-r--r-- 1 root root 65069 Aug 23 21:23 slit.dat
-rw-r--r-- 1 root root 80 Aug 23 21:23 spcr.dat
-rw-r--r-- 1 root root 108168 Aug 23 21:23 srat.dat
-rw-r--r-- 1 root root 21330 Aug 23 21:23 ssdt.dat
-rw-r--r-- 1 root root 92 Aug 23 21:23 uefi1.dat
-rw-r--r-- 1 root root 298 Aug 23 21:23 uefi.dat
-rw-r--r-- 1 root root 124 Aug 23 21:23 xsdt.dat
-------------------------------------------------------

--
Russ Anderson, OS RAS/Partitioning Project Lead
SGI - Silicon Graphics Inc [email protected]