2008-01-19 16:09:20

by Ian Campbell

[permalink] [raw]
Subject: [PATCH] x86/voyager: Switch voyager memory detection to early_ioremap.

Extracted from an earlier patch by Eric Biederman.

Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
CC: James Bottomley <[email protected]>
Cc: Eric W. Biederman <[email protected]>
---
arch/x86/mach-voyager/voyager_basic.c | 19 +++++++------------
1 files changed, 7 insertions(+), 12 deletions(-)

diff --git a/arch/x86/mach-voyager/voyager_basic.c b/arch/x86/mach-voyager/voyager_basic.c
index 6a949e4..ed41fd8 100644
--- a/arch/x86/mach-voyager/voyager_basic.c
+++ b/arch/x86/mach-voyager/voyager_basic.c
@@ -110,8 +110,9 @@ typedef struct ClickMap {
} Entry[CLICK_ENTRIES];
} ClickMap_t;

-/* This routine is pretty much an awful hack to read the bios clickmap by
- * mapping it into page 0. There are usually three regions in the map:
+/*
+ * This routine reads the bios clickmap. There are usually three
+ * regions in the map:
* Base Memory
* Extended Memory
* zero length marker for end of map
@@ -125,7 +126,6 @@ int __init voyager_memory_detect(int region, __u32 * start, __u32 * length)
__u8 cmos[4];
ClickMap_t *map;
unsigned long map_addr;
- unsigned long old;

if (region >= CLICK_ENTRIES) {
printk("Voyager: Illegal ClickMap region %d\n", region);
@@ -138,12 +138,8 @@ int __init voyager_memory_detect(int region, __u32 * start, __u32 * length)

map_addr = *(unsigned long *)cmos;

- /* steal page 0 for this */
- old = pg0[0];
- pg0[0] = ((map_addr & PAGE_MASK) | _PAGE_RW | _PAGE_PRESENT);
- local_flush_tlb();
- /* now clear everything out but page 0 */
- map = (ClickMap_t *) (map_addr & (~PAGE_MASK));
+ /* Setup a temporary mapping for the clickmap */
+ map = early_ioremap(map_addr, sizeof(*map));

/* zero length is the end of the clickmap */
if (map->Entry[region].Length != 0) {
@@ -152,9 +148,8 @@ int __init voyager_memory_detect(int region, __u32 * start, __u32 * length)
retval = 1;
}

- /* replace the mapping */
- pg0[0] = old;
- local_flush_tlb();
+ /* undo the mapping */
+ early_iounmap(map, sizeof(*map));
return retval;
}

--
1.5.3.8


2008-01-19 16:09:35

by Ian Campbell

[permalink] [raw]
Subject: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Derived from an earlier patch by Eric Biederman.

Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Eric W. Biederman <[email protected]>
---
arch/x86/kernel/head_32.S | 116 +++++++++++++------------------------
arch/x86/kernel/setup_32.c | 4 +
arch/x86/mm/Makefile_32 | 2 +-
arch/x86/mm/early_pgtable_32.c | 125 ++++++++++++++++++++++++++++++++++++++++
arch/x86/mm/init_32.c | 45 --------------
arch/x86/mm/ioremap_32.c | 53 ++++++++++-------
include/asm-x86/page_32.h | 1 -
include/asm-x86/pgtable_32.h | 4 -
8 files changed, 201 insertions(+), 149 deletions(-)
create mode 100644 arch/x86/mm/early_pgtable_32.c

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index f409fe2..2090aa4 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -33,44 +33,6 @@
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id

/*
- * This is how much memory *in addition to the memory covered up to
- * and including _end* we need mapped initially.
- * We need:
- * - one bit for each possible page, but only in low memory, which means
- * 2^32/4096/8 = 128K worst case (4G/4G split.)
- * - enough space to map all low memory, which means
- * (2^32/4096) / 1024 pages (worst case, non PAE)
- * (2^32/4096) / 512 + 4 pages (worst case for PAE)
- * - a few pages for allocator use before the kernel pagetable has
- * been set up
- *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- */
-LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)
-
-/*
- * To preserve the DMA pool in PAGEALLOC kernels, we'll allocate
- * pagetables from above the 16MB DMA limit, so we'll have to set
- * up pagetables 16MB more (worst-case):
- */
-#ifdef CONFIG_DEBUG_PAGEALLOC
-LOW_PAGES = LOW_PAGES + 0x1000000
-#endif
-
-#if PTRS_PER_PMD > 1
-PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
-#else
-PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
-#endif
-BOOTBITMAP_SIZE = LOW_PAGES / 8
-ALLOCATOR_SLOP = 4
-
-INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
-
-/*
* 32-bit kernel entrypoint; only used by the boot CPU. On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don't depend on
@@ -160,47 +122,52 @@ num_subarch_entries = (. - subarch_entries) / 4
.previous
#endif /* CONFIG_PARAVIRT */

-/*
- * Initialize page tables. This creates a PDE and a set of page
- * tables, which are located immediately beyond _end. The variable
- * init_pg_tables_end is set up to point to the first "safe" location.
- * Mappings are created both at virtual address 0 (identity mapping)
- * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
- *
- * Warning: don't use %esi or the stack in this code. However, %esp
- * can be used as a GPR if you really need it...
- */
-page_pde_offset = (__PAGE_OFFSET >> 20);
+#define cr4_bits mmu_cr4_features-__PAGE_OFFSET

default_entry:
- movl $(pg0 - __PAGE_OFFSET), %edi
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
-10:
- leal 0x007(%edi),%ecx /* Create PDE entry */
- movl %ecx,(%edx) /* Store identity PDE entry */
- movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
- addl $4,%edx
- movl $1024, %ecx
-11:
- stosl
- addl $0x1000,%eax
- loop 11b
- /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
- /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
- leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
- cmpl %ebp,%eax
- jb 10b
- movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
-
- /* Do an early initialization of the fixmap area */
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $(swapper_pg_pmd - __PAGE_OFFSET), %eax
- addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
- movl %eax, 4092(%edx)
+ /* Setup the stack */
+ lss stack_start - __PAGE_OFFSET, %esp
+ subl $__PAGE_OFFSET, %esp
+
+ /* Initialize the boot page tables */
+ call early_pgtable_init
+
+ movl cr4_bits,%edx
+ andl %edx,%edx
+ jz 1f
+ movl %cr4,%eax # Turn on paging options (PSE,PAE,..)
+ orl %edx,%eax
+ movl %eax,%cr4
+1:
+#ifdef CONFIG_X86_PAE
+ btl $5, %eax
+ jnc err_no_pae
+#endif

xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
+
+#ifdef CONFIG_X86_PAE
+err_no_pae:
+ /* It is probably too early but we might as well try... */
+#ifdef CONFIG_PRINTK
+ pusha
+ pushl %eax
+ pushl $err_no_pae_msg - __PAGE_OFFSET
+#ifdef CONFIG_EARLY_PRINTK
+ call early_printk - __PAGE_OFFSET
+#else
+ call printk - __PAGE_OFFSET
+#endif
+#endif
+ jmp hlt_loop
+
+err_no_pae_msg:
+ .ascii "cannot execute a PAE-enabled kernel on a PAE-less CPU!"
+ .ascii " (CR4 %lx)\n"
+ .byte 0
+#endif
+
/*
* Non-boot CPU entry point; entered from trampoline.S
* We can't lgdt here, because lgdt itself uses a data segment, but
@@ -237,7 +204,6 @@ ENTRY(startup_32_smp)
* NOTE! We have to correct for the fact that we're
* not yet offset PAGE_OFFSET..
*/
-#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
movl cr4_bits,%edx
andl %edx,%edx
jz 6f
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index c6f25cb..196c23b 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -153,7 +153,11 @@ struct cpuinfo_x86 new_cpu_data __cpuinitdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
EXPORT_SYMBOL(boot_cpu_data);

+#ifndef CONFIG_X86_PAE
unsigned long mmu_cr4_features;
+#else
+unsigned long mmu_cr4_features = X86_CR4_PAE;
+#endif

/* for MCA, but anyone else can use it if they want */
unsigned int machine_id;
diff --git a/arch/x86/mm/Makefile_32 b/arch/x86/mm/Makefile_32
index 2f69025..1b8c09f 100644
--- a/arch/x86/mm/Makefile_32
+++ b/arch/x86/mm/Makefile_32
@@ -2,7 +2,7 @@
# Makefile for the linux i386-specific parts of the memory manager.
#

-obj-y := init_32.o pgtable_32.o fault_32.o ioremap_32.o extable.o pageattr_32.o mmap.o pat.o ioremap.o
+obj-y := init_32.o pgtable_32.o fault_32.o ioremap_32.o extable.o pageattr_32.o mmap.o pat.o ioremap.o early_pgtable_32.o

obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o
obj-$(CONFIG_NUMA) += discontig_32.o
diff --git a/arch/x86/mm/early_pgtable_32.c b/arch/x86/mm/early_pgtable_32.c
new file mode 100644
index 0000000..dc5d648
--- /dev/null
+++ b/arch/x86/mm/early_pgtable_32.c
@@ -0,0 +1,125 @@
+/*
+ * Construct boot time page tables.
+ */
+
+/*
+ * Since a paravirt guest will never come down this path we want
+ * native style page table accessors here.
+ */
+#undef CONFIG_PARAVIRT
+
+#include <linux/pagemap.h>
+
+#include <asm/setup.h>
+
+/*
+ * This is how much memory *in addition to the memory covered up to
+ * and including _end* we need mapped initially. We need one bit for
+ * each possible page, but only in low memory, which means
+ * 2^32/4096/8 = 128K worst case (4G/4G split.)
+ *
+ * Modulo rounding, each megabyte assigned here requires a kilobyte of
+ * memory, which is currently unreclaimed.
+ *
+ * This should be a multiple of a page.
+ */
+#define INIT_MAP_BEYOND_END (128*1024)
+
+/*
+ * Initialize page tables. This creates a PDE and a set of page
+ * tables, which are located immediately beyond _end. The variable
+ * init_pg_tables_end is set up to point to the first "safe" location.
+ * Mappings are created both at virtual address 0 (identity mapping)
+ * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
+ *
+ * WARNING: This code runs at it's physical address not it's virtual address,
+ * with all physical everything identity mapped, and nothing else mapped.
+ * This means global variables must be done very carefully.
+ */
+#define __pavar(X) (*(__typeof__(X) *)__pa_symbol(&(X)))
+
+static inline __init pud_t *early_pud_offset(pgd_t *pgd, unsigned long vaddr)
+{
+ return (pud_t *)(pgd + pgd_index(vaddr));
+}
+
+static inline __init pmd_t *early_pmd_offset(pud_t *pud, unsigned long vaddr)
+{
+#ifndef CONFIG_X86_PAE
+ return (pmd_t *)pud;
+#else
+ return ((pmd_t *)(u32)(pud_val(*pud) & PAGE_MASK))
+ + pmd_index(vaddr);
+#endif
+}
+
+static inline __init pte_t *early_pte_offset(pmd_t *pmd, unsigned long vaddr)
+{
+ return ((pte_t *)(u32)(pmd_val(*pmd) & PAGE_MASK))
+ + pte_index(vaddr);
+}
+
+static inline __init pmd_t *
+early_pmd_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
+{
+ pud_t *pud = early_pud_offset(pgd_base, vaddr);
+
+#ifdef CONFIG_X86_PAE
+ if (!(pud_val(*pud) & _PAGE_PRESENT)) {
+ unsigned long phys = *end;
+ memset((void *)phys, 0, PAGE_SIZE);
+ set_pud(pud, __pud(phys | _PAGE_PRESENT));
+ *end += PAGE_SIZE;
+ }
+#endif
+ return early_pmd_offset(pud, vaddr);
+}
+
+static __init pte_t *
+early_pte_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
+{
+ pmd_t *pmd;
+
+ pmd = early_pmd_alloc(pgd_base, vaddr, end);
+ if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
+ unsigned long phys = *end;
+ memset((void *)phys, 0, PAGE_SIZE);
+ set_pmd(pmd, __pmd(phys | _PAGE_TABLE));
+ *end += PAGE_SIZE;
+ }
+ return early_pte_offset(pmd, vaddr);
+}
+
+static __init void early_set_pte_phys(pgd_t *pgd_base, unsigned long vaddr,
+ unsigned long phys, unsigned long *end)
+{
+ pte_t *pte;
+ pte = early_pte_alloc(pgd_base, vaddr, end);
+ set_pte(pte, __pte(phys | _PAGE_KERNEL_EXEC));
+}
+
+void __init early_pgtable_init(void)
+{
+ unsigned long addr, end;
+ pgd_t *pgd_base;
+
+ pgd_base = __pavar(swapper_pg_dir);
+ end = __pa_symbol(pg0);
+
+ /* Initialize the directory page */
+ memset(pgd_base, 0, PAGE_SIZE);
+
+ /* Set up the fixmap page table */
+ early_pte_alloc(pgd_base, __pavar(__FIXADDR_TOP), &end);
+
+ /* Set up the initial kernel mapping */
+ for (addr = 0; addr < (end + INIT_MAP_BEYOND_END); addr += PAGE_SIZE)
+ early_set_pte_phys(pgd_base, addr + PAGE_OFFSET, addr, &end);
+
+
+ /* Set up the low identity mappings */
+ clone_pgd_range(pgd_base, pgd_base + USER_PTRS_PER_PGD,
+ min_t(unsigned long, KERNEL_PGD_PTRS, USER_PGD_PTRS));
+
+ __pavar(init_pg_tables_end) = end;
+}
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cbba769..2f94a3a 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -353,44 +353,11 @@ extern void __init remap_numa_kva(void);

void __init native_pagetable_setup_start(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- int i;
-
- /*
- * Init entries of the first-level page table to the
- * zero page, if they haven't already been set up.
- *
- * In a normal native boot, we'll be running on a
- * pagetable rooted in swapper_pg_dir, but not in PAE
- * mode, so this will end up clobbering the mappings
- * for the lower 24Mbytes of the address space,
- * without affecting the kernel address space.
- */
- for (i = 0; i < USER_PTRS_PER_PGD; i++)
- set_pgd(&base[i],
- __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
-
- /* Make sure kernel address space is empty so that a pagetable
- will be allocated for it. */
- memset(&base[USER_PTRS_PER_PGD], 0,
- KERNEL_PGD_PTRS * sizeof(pgd_t));
-#else
paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
-#endif
}

void __init native_pagetable_setup_done(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
-#endif
}

/*
@@ -559,14 +526,6 @@ void __init paging_init(void)

load_cr3(swapper_pg_dir);

-#ifdef CONFIG_X86_PAE
- /*
- * We will bail out later - printk doesn't work right now so
- * the user would just see a hanging kernel.
- */
- if (cpu_has_pae)
- set_in_cr4(X86_CR4_PAE);
-#endif
__flush_tlb_all();

kmap_init();
@@ -696,10 +655,6 @@ void __init mem_init(void)
BUG_ON((unsigned long)high_memory > VMALLOC_START);
#endif /* double-sanity-check paranoia */

-#ifdef CONFIG_X86_PAE
- if (!cpu_has_pae)
- panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!");
-#endif
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 05a24cd..73a36cd 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -226,40 +226,45 @@ static int __init early_ioremap_debug_setup(char *str)
__setup("early_ioremap_debug", early_ioremap_debug_setup);

static __initdata int after_paging_init;
-static __initdata unsigned long bm_pte[1024]
+static __initdata pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)]
__attribute__((aligned(PAGE_SIZE)));

-static inline unsigned long * __init early_ioremap_pgd(unsigned long addr)
+static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
{
- return (unsigned long *)swapper_pg_dir + ((addr >> 22) & 1023);
+ pgd_t *pgd = &swapper_pg_dir[pgd_index(addr)];
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+
+ return pmd;
}

-static inline unsigned long * __init early_ioremap_pte(unsigned long addr)
+static inline pte_t * __init early_ioremap_pte(unsigned long addr)
{
- return bm_pte + ((addr >> PAGE_SHIFT) & 1023);
+ return &bm_pte[pte_index(addr)];
}

void __init early_ioremap_init(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_init()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = __pa(bm_pte) | _PAGE_TABLE;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
+ set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+
/*
- * The boot-ioremap range spans multiple pgds, for which
+ * The boot-ioremap range spans multiple pmds, for which
* we are not prepared:
*/
- if (pgd != early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END))) {
+ if (pmd != early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END))) {
WARN_ON(1);
- printk("pgd %p != %p\n",
- pgd, early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END)));
- printk("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
+ printk(KERN_WARNING "pmd %p != %p\n",
+ pmd, early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END)));
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
fix_to_virt(FIX_BTMAP_BEGIN));
- printk("fix_to_virt(FIX_BTMAP_END): %08lx\n",
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_END): %08lx\n",
fix_to_virt(FIX_BTMAP_END));

printk("FIX_BTMAP_END: %d\n", FIX_BTMAP_END);
@@ -269,27 +274,28 @@ void __init early_ioremap_init(void)

void __init early_ioremap_clear(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_clear()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = 0;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
+ pmd_clear(pmd);
__flush_tlb_all();
}

void __init early_ioremap_reset(void)
{
enum fixed_addresses idx;
- unsigned long *pte, phys, addr;
+ unsigned long addr, phys;
+ pte_t *pte;

after_paging_init = 1;
for (idx = FIX_BTMAP_BEGIN; idx <= FIX_BTMAP_END; idx--) {
addr = fix_to_virt(idx);
pte = early_ioremap_pte(addr);
- if (!*pte & _PAGE_PRESENT) {
- phys = *pte & PAGE_MASK;
+ if (!(pte_val(*pte) & _PAGE_PRESENT)) {
+ phys = pte_val(*pte) & PAGE_MASK;
set_fixmap(idx, phys);
}
}
@@ -298,7 +304,8 @@ void __init early_ioremap_reset(void)
static void __init __early_set_fixmap(enum fixed_addresses idx,
unsigned long phys, pgprot_t flags)
{
- unsigned long *pte, addr = __fix_to_virt(idx);
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *pte;

if (idx >= __end_of_fixed_addresses) {
BUG();
@@ -306,9 +313,9 @@ static void __init __early_set_fixmap(enum fixed_addresses idx,
}
pte = early_ioremap_pte(addr);
if (pgprot_val(flags))
- *pte = (phys & PAGE_MASK) | pgprot_val(flags);
+ set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
- *pte = 0;
+ pte_clear(NULL, addr, pte);
__flush_tlb_one(addr);
}

diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h
index 11c4b39..8fc0473 100644
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -48,7 +48,6 @@ typedef unsigned long pgprotval_t;
typedef unsigned long phys_addr_t;

typedef union { pteval_t pte, pte_low; } pte_t;
-typedef pte_t boot_pte_t;

#endif /* __ASSEMBLY__ */
#endif /* CONFIG_X86_PAE */
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index 11c8b73..c07389b 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -55,10 +55,6 @@ int text_address(unsigned long);
#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT)
#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS)

-#define TWOLEVEL_PGDIR_SHIFT 22
-#define BOOT_USER_PGD_PTRS (__PAGE_OFFSET >> TWOLEVEL_PGDIR_SHIFT)
-#define BOOT_KERNEL_PGD_PTRS (1024-BOOT_USER_PGD_PTRS)
-
/* Just any arbitrary offset to the start of the vmalloc VM area: the
* current 8MB value just means that there will be a 8MB "hole" after the
* physical memory until the kernel virtual memory starts. That means that
--
1.5.3.8

2008-01-19 23:23:37

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell <[email protected]> writes:
> +1:
> +#ifdef CONFIG_X86_PAE
> + btl $5, %eax
> + jnc err_no_pae
> +#endif
>
> xorl %ebx,%ebx /* This is the boot CPU (BSP) */
> jmp 3f
> +
> +#ifdef CONFIG_X86_PAE
> +err_no_pae:
> + /* It is probably too early but we might as well try... */

Without a low identity mapping early_printk will not work and printk
definitely not.

> +#ifdef CONFIG_PRINTK

You should do the test in the 16 bit boot code. In fact it should
already do it by testing the CPUID REQUIRED_MASK.

The only way this could be entered is if someone skips the 16bit boot code
by using kexec, but has the wrong flags. I'm not sure how to handle
it there.

> +/*
> + * Since a paravirt guest will never come down this path we want
> + * native style page table accessors here.
> + */
> +#undef CONFIG_PARAVIRT

Seems quite fragile. I'm sure that would hurt later.


> +
> +static inline __init pud_t *early_pud_offset(pgd_t *pgd, unsigned long vaddr)
> +{
> + return (pud_t *)(pgd + pgd_index(vaddr));
> +}
> +
> +static inline __init pmd_t *early_pmd_offset(pud_t *pud, unsigned long vaddr)
> +{
> +#ifndef CONFIG_X86_PAE
> + return (pmd_t *)pud;
> +#else
> + return ((pmd_t *)(u32)(pud_val(*pud) & PAGE_MASK))
> + + pmd_index(vaddr);
> +#endif
> +}
> +
> +static inline __init pte_t *early_pte_offset(pmd_t *pmd, unsigned long vaddr)
> +{
> + return ((pte_t *)(u32)(pmd_val(*pmd) & PAGE_MASK))

That will break if the kernel is > 4GB won't it? Also same for pmd.

Also not handling NX is dubious, although you can probably get away from it there.

> + + pte_index(vaddr);
> +}
> +
> +static inline __init pmd_t *
> +early_pmd_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
> +{
> + pud_t *pud = early_pud_offset(pgd_base, vaddr);
> +
> +#ifdef CONFIG_X86_PAE
> + if (!(pud_val(*pud) & _PAGE_PRESENT)) {


Why not set it in the pgd which is identical? Also the proper test is !pgd_none()



> +{
> + pmd_t *pmd;
> +
> + pmd = early_pmd_alloc(pgd_base, vaddr, end);
> + if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {

!pmd_none()

> + unsigned long phys = *end;
> + memset((void *)phys, 0, PAGE_SIZE);
> + set_pmd(pmd, __pmd(phys | _PAGE_TABLE));
> + *end += PAGE_SIZE;
> + }
> + return early_pte_offset(pmd, vaddr);
> +}
> +
> +static __init void early_set_pte_phys(pgd_t *pgd_base, unsigned long vaddr,
> + unsigned long phys, unsigned long *end)
> +{
> + pte_t *pte;
> + pte = early_pte_alloc(pgd_base, vaddr, end);
> + set_pte(pte, __pte(phys | _PAGE_KERNEL_EXEC));
> +}
> +
> +void __init early_pgtable_init(void)
> +{
> + unsigned long addr, end;
> + pgd_t *pgd_base;
> +
> + pgd_base = __pavar(swapper_pg_dir);
> + end = __pa_symbol(pg0);

Are you sure there will be enough memory here? You might need to use
an e820 allocator similar to x86-64.

Typical problems is you running into some other memory used by
someone else.

> {
> enum fixed_addresses idx;
> - unsigned long *pte, phys, addr;
> + unsigned long addr, phys;
> + pte_t *pte;
>
> after_paging_init = 1;
> for (idx = FIX_BTMAP_BEGIN; idx <= FIX_BTMAP_END; idx--) {
> addr = fix_to_virt(idx);
> pte = early_ioremap_pte(addr);
> - if (!*pte & _PAGE_PRESENT) {
> - phys = *pte & PAGE_MASK;
> + if (!(pte_val(*pte) & _PAGE_PRESENT)) {

pte_present(). Ok the old code was wrong too, but no need to do that again.

> set_fixmap(idx, phys);
> }
> }
> @@ -298,7 +304,8 @@ void __init early_ioremap_reset(void)
> static void __init __early_set_fixmap(enum fixed_addresses idx,
> unsigned long phys, pgprot_t flags)
> {
> - unsigned long *pte, addr = __fix_to_virt(idx);
> + unsigned long addr = __fix_to_virt(idx);
> + pte_t *pte;

Unrelated?

-Andi

2008-01-19 23:52:01

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Andi Kleen wrote:
>
> That will break if the kernel is > 4GB won't it? Also same for pmd.
>

The kernel can't be > 4 GB; after all, we're running here with paging
disabled, so inherently we're < 4 GB...

-hpa

2008-01-20 16:45:29

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Sun, 2008-01-20 at 00:07 +0100, Andi Kleen wrote:
> Ian Campbell <[email protected]> writes:
> > +#ifdef CONFIG_X86_PAE
> > +err_no_pae:
> > + /* It is probably too early but we might as well try... */
>
> Without a low identity mapping early_printk will not work and printk
> definitely not.
>
> > +#ifdef CONFIG_PRINTK
>
> You should do the test in the 16 bit boot code. In fact it should
> already do it by testing the CPUID REQUIRED_MASK.

Indeed it does. I don't have any non-PAE to test it but I turned the
failure case into a simple jmp to hlt_loop since we ought never to get
here in any case.

> > +/*
> > + * Since a paravirt guest will never come down this path we want
> > + * native style page table accessors here.
> > + */
> > +#undef CONFIG_PARAVIRT
>
> Seems quite fragile. I'm sure that would hurt later.

The problem here is that we explicitly want native accessors because
it's too early to use the pv ops since we are still running P==V. A PV
kernel boot will never come down this path -- it is diverted earlier in
head_32.S so using the native versions are appropriate.

I'll try again to use the native_{make,set}_xxx functions but originally
I found the necessary variants weren't defined in all combinations of
PAE/not and PARAVIRT/not.

FWIW we use the same undef trick under arch/x86/boot too and this early
start of day stuff if fairly similar.

> > +static inline __init pte_t *early_pte_offset(pmd_t *pmd, unsigned long vaddr)
> > +{
> > + return ((pte_t *)(u32)(pmd_val(*pmd) & PAGE_MASK))
>
> That will break if the kernel is > 4GB won't it? Also same for pmd.

As hpa says we can't be above 4G at this point. Probably I can use some
variant of make_pte now though.

> > +static inline __init pmd_t *
> > +early_pmd_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
> > +{
> > + pud_t *pud = early_pud_offset(pgd_base, vaddr);
> > +
> > +#ifdef CONFIG_X86_PAE
> > + if (!(pud_val(*pud) & _PAGE_PRESENT)) {
>
> Why not set it in the pgd which is identical? Also the proper test is !pgd_none()

I was trying to fit in with the native_foo stuff that is available and
happened to be using pud on my last attempt before I switched to the
#undef CONFIG_PARAVIRT approach. I'll switch to pgd if I can get it to
work.

pgd_none (and pud_none) are hardcoded to 0 in the 32 bit case (in
asm-generic/pgtable-nopud.h and asm-generic/pgtable-nopmd.h or
asm-x86/pgtable-3level.h). Presumably this is because at regular runtime
these entries are guaranteed to exist which isn't true this early at
startup.

In fact since we are always going to need a PMD in the PAE case there is
probably not much wrong with simply unconditionally allocating the pmd
at the start of early_pgtable_init().

> > +{
> > + pmd_t *pmd;
> > +
> > + pmd = early_pmd_alloc(pgd_base, vaddr, end);
> > + if (!(pmd_val(*pmd) & _PAGE_PRESENT)) {
>
> !pmd_none()

done (without the !)

> > +void __init early_pgtable_init(void)
> > +{
> > + unsigned long addr, end;
> > + pgd_t *pgd_base;
> > +
> > + pgd_base = __pavar(swapper_pg_dir);
> > + end = __pa_symbol(pg0);
>
> Are you sure there will be enough memory here? You might need to use
> an e820 allocator similar to x86-64.

True. However the assembly being replaced makes the same assumptions so
I don't think that should block this patch, it's a fixup that can be
made later.

> > - if (!*pte & _PAGE_PRESENT) {
> > - phys = *pte & PAGE_MASK;
> > + if (!(pte_val(*pte) & _PAGE_PRESENT)) {
>
> pte_present(). Ok the old code was wrong too, but no need to do that again.

Done.

> > @@ -298,7 +304,8 @@ void __init early_ioremap_reset(void)
> > static void __init __early_set_fixmap(enum fixed_addresses idx,
> > unsigned long phys, pgprot_t flags)
> > {
> > - unsigned long *pte, addr = __fix_to_virt(idx);
> > + unsigned long addr = __fix_to_virt(idx);
> > + pte_t *pte;
>
> Unrelated?

Nope, the return type of early_ioremap_pte() changed unsigned long ->
pte_t and that is what is assigned to pte.

I'll spin another version.

Ian.
--
Ian Campbell

"I go on working for the same reason a hen goes on laying eggs."
-- H. L. Mencken

2008-01-20 17:36:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Sun, Jan 20, 2008 at 04:44:50PM +0000, Ian Campbell wrote:
> Indeed it does. I don't have any non-PAE to test it but I turned the
> failure case into a simple jmp to hlt_loop since we ought never to get
> here in any case.

There are various loaders (kexec, elilo, ...) that skip the 16bit code
and jump directly to 32bit head.S. So in theory those could hit it.
But still having the loop only is probably fine.

> > > + * Since a paravirt guest will never come down this path we want
> > > + * native style page table accessors here.
> > > + */
> > > +#undef CONFIG_PARAVIRT
> >
> > Seems quite fragile. I'm sure that would hurt later.
>
> The problem here is that we explicitly want native accessors because
> it's too early to use the pv ops since we are still running P==V. A PV
> kernel boot will never come down this path -- it is diverted earlier in
> head_32.S so using the native versions are appropriate.

Then i think it would be cleaner to just open code everything without
any accessors.

> As hpa says we can't be above 4G at this point. Probably I can use some
> variant of make_pte now though.

The 32bit cast still feels unclean. After all the PTE is not 32bit.

-Andi

2008-01-20 18:31:14

by Mika Penttilä

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


> + * This is how much memory *in addition to the memory covered up to
> + * and including _end* we need mapped initially. We need one bit for
> + * each possible page, but only in low memory, which means
> + * 2^32/4096/8 = 128K worst case (4G/4G split.)
> + *
> + * Modulo rounding, each megabyte assigned here requires a kilobyte of
> + * memory, which is currently unreclaimed.
> + *
> + * This should be a multiple of a page.
> + */
> +#define INIT_MAP_BEYOND_END (128*1024)
> +
> +/*
>

You have dropped the requirement to map all of low memory (the boot
allocator is used for instance to construct physical mem mapping).
Either you should fix your INIT_MAP_BEYOND_END or make a big comment
telling us why it isn't necessary anymore to map low mem.

--Mika

2008-01-20 18:49:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Andi Kleen wrote:
> On Sun, Jan 20, 2008 at 04:44:50PM +0000, Ian Campbell wrote:
>> Indeed it does. I don't have any non-PAE to test it but I turned the
>> failure case into a simple jmp to hlt_loop since we ought never to get
>> here in any case.
>
> There are various loaders (kexec, elilo, ...) that skip the 16bit code
> and jump directly to 32bit head.S. So in theory those could hit it.
> But still having the loop only is probably fine.
>

It's probably just as well, since we don't really know how to get a
message out in such an environment anyway...

>>>> + * Since a paravirt guest will never come down this path we want
>>>> + * native style page table accessors here.
>>>> + */
>>>> +#undef CONFIG_PARAVIRT
>>> Seems quite fragile. I'm sure that would hurt later.
>> The problem here is that we explicitly want native accessors because
>> it's too early to use the pv ops since we are still running P==V. A PV
>> kernel boot will never come down this path -- it is diverted earlier in
>> head_32.S so using the native versions are appropriate.
>
> Then i think it would be cleaner to just open code everything without
> any accessors.

I was thinking about this yesterday, and it seems to me that there are
two cleaner options here...

- either we should put in the full machinery to be able to run C code
compiled with -fPIC/-fPIE before paging is enabled. Unfortunately gcc
generates R_386_GOT32 relocations for external references even with
-fPIE, so we'll have to put in some code to adjust the GOT (easy enough
to do.)

As far as the native accessors are concerned, the right thing to do
would is to use the native_ forms thereof, not #undef CONFIG_PARAVIRT.

- alternatively, we recognize that this isn't all that big of a piece of
code and doing it in C really isn't necessary. We can have a small
assembly loop for PAE that matches the small assembly loop we already
have for !PAE.

>> As hpa says we can't be above 4G at this point. Probably I can use some
>> variant of make_pte now though.
>
> The 32bit cast still feels unclean. After all the PTE is not 32bit.

No, but (pte_t *) is 32 bits. To be more "Linuxy" it probably should be
(long) or (unsigned long) though.

-hpa

2008-01-20 18:52:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

> >There are various loaders (kexec, elilo, ...) that skip the 16bit code
> >and jump directly to 32bit head.S. So in theory those could hit it.
> >But still having the loop only is probably fine.
> >
>
> It's probably just as well, since we don't really know how to get a
> message out in such an environment anyway...

It would be robably possible to extend the 32bit protocol to some
way to error out in such a case. On the other hand I'm not sure it's really
worth the considerable work to implement and debug such an addition.

> >>variant of make_pte now though.
> >
> >The 32bit cast still feels unclean. After all the PTE is not 32bit.
>
> No, but (pte_t *) is 32 bits. To be more "Linuxy" it probably should be
> (long) or (unsigned long) though.

That's not 32bit either.

-Andi

2008-01-20 18:55:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Andi Kleen wrote:
>>> There are various loaders (kexec, elilo, ...) that skip the 16bit code
>>> and jump directly to 32bit head.S. So in theory those could hit it.
>>> But still having the loop only is probably fine.
>>>
>> It's probably just as well, since we don't really know how to get a
>> message out in such an environment anyway...
>
> It would be robably possible to extend the 32bit protocol to some
> way to error out in such a case. On the other hand I'm not sure it's really
> worth the considerable work to implement and debug such an addition.
>
>>>> variant of make_pte now though.
>>> The 32bit cast still feels unclean. After all the PTE is not 32bit.
>> No, but (pte_t *) is 32 bits. To be more "Linuxy" it probably should be
>> (long) or (unsigned long) though.
>
> That's not 32bit either.

Looked at the subject line?

-hpa

2008-01-21 21:24:20

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Sun, 2008-01-20 at 20:30 +0200, Mika Penttilä wrote:
> > + * This is how much memory *in addition to the memory covered up to
> > + * and including _end* we need mapped initially. We need one bit for
> > + * each possible page, but only in low memory, which means
> > + * 2^32/4096/8 = 128K worst case (4G/4G split.)
> > + *
> > + * Modulo rounding, each megabyte assigned here requires a kilobyte of
> > + * memory, which is currently unreclaimed.
> > + *
> > + * This should be a multiple of a page.
> > + */
> > +#define INIT_MAP_BEYOND_END (128*1024)
> > +
> > +/*
> >
>
> You have dropped the requirement to map all of low memory (the boot
> allocator is used for instance to construct physical mem mapping).
> Either you should fix your INIT_MAP_BEYOND_END or make a big comment
> telling us why it isn't necessary anymore to map low mem.

I think you are right. The patch ensures that all the initial page
tables themselves have mappings but won't map the additional pages
needed for mapping the rest of lowmem.

However, I think it is no longer necessary to map a whole new 4G worth
of page table pages because the code in kernel_physical_mapping_init now
extends the initial mappings rather than replacing them (see changes to
native_pagetable_setup_start). So now we only need to map 4G worth of
page tables including the initial page tables. That means we only need
to map a fixed set of extra pages rather than the sliding limit
currently used in the patch.

I'm not convinced by the additional 16MB for CONFIG_DEBUG_PAGEALLOC --
we map enough pages for page tables for 4G of lowmem -- adding space for
an extra 16M seems pointless.

Ian.
--
Ian Campbell

Good-bye. I am leaving because I am bored.
-- George Saunders' dying words

2008-01-21 21:39:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell wrote:
> On Sun, 2008-01-20 at 20:30 +0200, Mika Penttilä wrote:
>>>
>> You have dropped the requirement to map all of low memory (the boot
>> allocator is used for instance to construct physical mem mapping).
>> Either you should fix your INIT_MAP_BEYOND_END or make a big comment
>> telling us why it isn't necessary anymore to map low mem.
>
> I think you are right. The patch ensures that all the initial page
> tables themselves have mappings but won't map the additional pages
> needed for mapping the rest of lowmem.
>
> However, I think it is no longer necessary to map a whole new 4G worth
> of page table pages because the code in kernel_physical_mapping_init now
> extends the initial mappings rather than replacing them (see changes to
> native_pagetable_setup_start). So now we only need to map 4G worth of
> page tables including the initial page tables. That means we only need
> to map a fixed set of extra pages rather than the sliding limit
> currently used in the patch.
>

We still need to be able to construct those page tables, which is what
that stuff is about...


> I'm not convinced by the additional 16MB for CONFIG_DEBUG_PAGEALLOC --
> we map enough pages for page tables for 4G of lowmem -- adding space for
> an extra 16M seems pointless.

If so, adjusting the limit should be a separate patch.

Either way, I'm increasingly thinking that setting up the initial page
tables via an assembly loop instead of worrying about the C accessors is
actually cleaner (I prototyped it yesterday, although I still need the
rest of the machinery.)

2008-01-21 21:46:59

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Mon, 2008-01-21 at 13:38 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> > On Sun, 2008-01-20 at 20:30 +0200, Mika Penttilä wrote:
> >>>
> >> You have dropped the requirement to map all of low memory (the boot
> >> allocator is used for instance to construct physical mem mapping).
> >> Either you should fix your INIT_MAP_BEYOND_END or make a big comment
> >> telling us why it isn't necessary anymore to map low mem.
> >
> > I think you are right. The patch ensures that all the initial page
> > tables themselves have mappings but won't map the additional pages
> > needed for mapping the rest of lowmem.
> >
> > However, I think it is no longer necessary to map a whole new 4G worth
> > of page table pages because the code in kernel_physical_mapping_init now
> > extends the initial mappings rather than replacing them (see changes to
> > native_pagetable_setup_start). So now we only need to map 4G worth of
> > page tables including the initial page tables. That means we only need
> > to map a fixed set of extra pages rather than the sliding limit
> > currently used in the patch.
> >
>
> We still need to be able to construct those page tables, which is what
> that stuff is about...

Yes, my initial patch was wrong. However with the patch we no longer
throw away the non-PAE initial page tables and replace them with PAE
ones, instead we augment the initial PAE page tables. This means we only
need initial mappings of 4G worth of page tables rather than 4G plus
what is needed for the non-PAE initial page tables.

I don't think I explained that at all well on either attempt...
Hopefully what I mean will be clearer in patch form -- coming in a
second...

> > I'm not convinced by the additional 16MB for CONFIG_DEBUG_PAGEALLOC --
> > we map enough pages for page tables for 4G of lowmem -- adding space for
> > an extra 16M seems pointless.
>
> If so, adjusting the limit should be a separate patch.
>
> Either way, I'm increasingly thinking that setting up the initial page
> tables via an assembly loop instead of worrying about the C accessors is
> actually cleaner (I prototyped it yesterday, although I still need the
> rest of the machinery.)

I'm just preparing to send out a version which uses the native_* way of
doing things, its not actually as clean as I would like so I'd be
interested to see the ASM variant.

Ian.
--
Ian Campbell

You shall be rewarded for a dastardly deed.

2008-01-21 22:15:50

by Ian Campbell

[permalink] [raw]
Subject: [PATCH] x86_32: Remove unnecessary use of %ebx as the boot cpu flag

Currently in head_32.S there are two ways we test to see if we
are the boot cpu. By looking at %ebx and by looking at the
static variable ready. When changing things around I have
found that it gets tricky to preserve %ebx. So this
patch just switches head.S over to the more reliable
test of always using ready.

Hopefully later we can kill these tests entirely.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Mika Penttilä <[email protected]>
---
arch/x86/kernel/head_32.S | 7 +------
1 files changed, 1 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index f409fe2..7b9b256 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -199,7 +199,6 @@ default_entry:
addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
movl %eax, 4092(%edx)

- xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
/*
* Non-boot CPU entry point; entered from trampoline.S
@@ -268,10 +267,6 @@ ENTRY(startup_32_smp)
wrmsr

6:
- /* This is a secondary processor (AP) */
- xorl %ebx,%ebx
- incl %ebx
-
#endif /* CONFIG_SMP */
3:

@@ -297,7 +292,7 @@ ENTRY(startup_32_smp)
popfl

#ifdef CONFIG_SMP
- andl %ebx,%ebx
+ cmpb $0, ready
jz 1f /* Initial CPU cleans BSS */
jmp checkCPUtype
1:
--
1.5.3.8

2008-01-21 22:16:07

by Ian Campbell

[permalink] [raw]
Subject: [PATCH] x86_32: Always run the full set of paging state.

I am preparing to convert the boot time page table to the kernels
native format. To achieve that I need to enable PAE. Enabling PSE
and the no execute bit would not hurt. So this patch modifies
the boot cpu path to execute all of the kernels enable code
if and only if we have the proper bits set in mmu_cr4_features.

Signed-off-by: Eric W. Biederman <[email protected]>
Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Mika Penttilä <[email protected]>
---
arch/x86/kernel/head_32.S | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 7b9b256..a2b6331 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -221,6 +221,8 @@ ENTRY(startup_32_smp)
movl %eax,%es
movl %eax,%fs
movl %eax,%gs
+#endif /* CONFIG_SMP */
+3:

/*
* New page tables may be in 4Mbyte page mode and may
@@ -267,8 +269,6 @@ ENTRY(startup_32_smp)
wrmsr

6:
-#endif /* CONFIG_SMP */
-3:

/*
* Enable paging
--
1.5.3.8

2008-01-21 22:16:27

by Ian Campbell

[permalink] [raw]
Subject: [0/3] x86_32: Construct 32 bit boot time page tables in native format.

The following patches change x86_32 to use native format for the boot
time page tables. The first two are taken directly from Eric
Biederman's older patch series
http://marc.info/?l=linux-kernel&m=117794868309910&w=2 and
http://marc.info/?l=linux-kernel&m=117794868819193&w=2 while the final
patch is http://marc.info/?l=linux-kernel&m=117795243926186&w=2
updated for the current tree (much of it was already in, the rest has
been extensively modified).

Changes since v1:
- Grabbed Eric's patches for enabling PAE in head_32.S -- I think
they are cleaner than what I had.
- Use native_* accessors instead of #undef CONFIG_PARAVIRT. This
approach seem cleaner and also allows the code to live in init_32.c
instead of creating a new file.
- Ensure we map enough pages to create a full 4G mapping of
lowmem. This now includes the space for the initial page tables
rather than being in addition too.
- Continue to map extra pages for CONFIG_DEBUG_PAGEALLOC and some
extra pages for slack, per the original assembly code.

Cheers,
Ian.

Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Mika Penttilä <[email protected]>

2008-01-21 22:16:40

by Ian Campbell

[permalink] [raw]
Subject: [PATCH] x86_32: Construct 32 bit boot time page tables in native format.

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Derived from an earlier patch by Eric Biederman.

Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Mika Penttilä <[email protected]>
---
arch/x86/kernel/head_32.S | 82 ++----------------
arch/x86/kernel/setup_32.c | 4 +
arch/x86/mm/init_32.c | 196 +++++++++++++++++++++++++++++++----------
arch/x86/mm/ioremap_32.c | 53 +++++++-----
include/asm-x86/page_32.h | 1 -
include/asm-x86/pgtable_32.h | 4 -
6 files changed, 189 insertions(+), 151 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index a2b6331..93e165a 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -33,44 +33,6 @@
#define X86_VENDOR_ID new_cpu_data+CPUINFO_x86_vendor_id

/*
- * This is how much memory *in addition to the memory covered up to
- * and including _end* we need mapped initially.
- * We need:
- * - one bit for each possible page, but only in low memory, which means
- * 2^32/4096/8 = 128K worst case (4G/4G split.)
- * - enough space to map all low memory, which means
- * (2^32/4096) / 1024 pages (worst case, non PAE)
- * (2^32/4096) / 512 + 4 pages (worst case for PAE)
- * - a few pages for allocator use before the kernel pagetable has
- * been set up
- *
- * Modulo rounding, each megabyte assigned here requires a kilobyte of
- * memory, which is currently unreclaimed.
- *
- * This should be a multiple of a page.
- */
-LOW_PAGES = 1<<(32-PAGE_SHIFT_asm)
-
-/*
- * To preserve the DMA pool in PAGEALLOC kernels, we'll allocate
- * pagetables from above the 16MB DMA limit, so we'll have to set
- * up pagetables 16MB more (worst-case):
- */
-#ifdef CONFIG_DEBUG_PAGEALLOC
-LOW_PAGES = LOW_PAGES + 0x1000000
-#endif
-
-#if PTRS_PER_PMD > 1
-PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PMD) + PTRS_PER_PGD
-#else
-PAGE_TABLE_SIZE = (LOW_PAGES / PTRS_PER_PGD)
-#endif
-BOOTBITMAP_SIZE = LOW_PAGES / 8
-ALLOCATOR_SLOP = 4
-
-INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_SIZE_asm
-
-/*
* 32-bit kernel entrypoint; only used by the boot CPU. On entry,
* %esi points to the real-mode code as a 32-bit pointer.
* CS and DS must be 4 GB flat segments, but we don't depend on
@@ -160,46 +122,16 @@ num_subarch_entries = (. - subarch_entries) / 4
.previous
#endif /* CONFIG_PARAVIRT */

-/*
- * Initialize page tables. This creates a PDE and a set of page
- * tables, which are located immediately beyond _end. The variable
- * init_pg_tables_end is set up to point to the first "safe" location.
- * Mappings are created both at virtual address 0 (identity mapping)
- * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
- *
- * Warning: don't use %esi or the stack in this code. However, %esp
- * can be used as a GPR if you really need it...
- */
-page_pde_offset = (__PAGE_OFFSET >> 20);
-
default_entry:
- movl $(pg0 - __PAGE_OFFSET), %edi
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
-10:
- leal 0x007(%edi),%ecx /* Create PDE entry */
- movl %ecx,(%edx) /* Store identity PDE entry */
- movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
- addl $4,%edx
- movl $1024, %ecx
-11:
- stosl
- addl $0x1000,%eax
- loop 11b
- /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
- /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
- leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
- cmpl %ebp,%eax
- jb 10b
- movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
-
- /* Do an early initialization of the fixmap area */
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $(swapper_pg_pmd - __PAGE_OFFSET), %eax
- addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
- movl %eax, 4092(%edx)
+ /* Setup the stack */
+ lss stack_start - __PAGE_OFFSET, %esp
+ subl $__PAGE_OFFSET, %esp
+
+ /* Initialize the boot page tables */
+ call early_pgtable_init

jmp 3f
+
/*
* Non-boot CPU entry point; entered from trampoline.S
* We can't lgdt here, because lgdt itself uses a data segment, but
diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index c6f25cb..196c23b 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -153,7 +153,11 @@ struct cpuinfo_x86 new_cpu_data __cpuinitdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
EXPORT_SYMBOL(boot_cpu_data);

+#ifndef CONFIG_X86_PAE
unsigned long mmu_cr4_features;
+#else
+unsigned long mmu_cr4_features = X86_CR4_PAE;
+#endif

/* for MCA, but anyone else can use it if they want */
unsigned int machine_id;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cbba769..a8eb443 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -43,6 +43,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>
#include <asm/paravirt.h>
+#include <asm/setup.h>

unsigned int __VMALLOC_RESERVE = 128 << 20;

@@ -52,6 +53,151 @@ unsigned long highstart_pfn, highend_pfn;
static int noinline do_test_wp_bit(void);

/*
+ * Initialize page tables. This creates a PDE and a set of page
+ * tables, which are located immediately beyond _end. The variable
+ * init_pg_tables_end is set up to point to the first "safe" location.
+ * Mappings are created both at virtual address 0 (identity mapping)
+ * and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
+ *
+ * WARNING: This code runs at it's physical address not it's virtual address,
+ * with all physical everything identity mapped, and nothing else mapped.
+ * This means global variables must be done very carefully.
+ */
+#define __pavar(X) (*(__typeof__(X) *)__pa_symbol(&(X)))
+
+static inline __init pgd_t *early_pgd_offset(pgd_t *pgd, unsigned long vaddr)
+{
+ return pgd + pgd_index(vaddr);
+}
+
+static inline __init pmd_t *early_pmd_offset(pgd_t *pgd, unsigned long vaddr)
+{
+#ifndef CONFIG_X86_PAE
+ return (pmd_t *)pgd;
+#else
+ return ((pmd_t *)(unsigned long)(native_pgd_val(*pgd) & PAGE_MASK))
+ + pmd_index(vaddr);
+#endif
+}
+
+static inline __init pte_t *early_pte_offset(pmd_t *pmd, unsigned long vaddr)
+{
+ return ((pte_t *)(unsigned long)(native_pmd_val(*pmd) & PAGE_MASK))
+ + pte_index(vaddr);
+}
+
+static inline __init pmd_t *
+early_pmd_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
+{
+ pgd_t *pgd = early_pgd_offset(pgd_base, vaddr);
+
+ if (!(unsigned long)native_pgd_val(*pgd)) {
+ unsigned long phys = *end;
+ memset((void *)phys, 0, PAGE_SIZE);
+#ifdef CONFIG_X86_PAE
+ *pgd = native_make_pgd(phys | _PAGE_PRESENT);
+#else
+ *pgd = native_make_pgd(phys | _PAGE_TABLE);
+#endif
+ *end += PAGE_SIZE;
+ }
+
+ return early_pmd_offset(pgd, vaddr);
+}
+
+static inline __init pte_t *
+early_pte_alloc(pgd_t *pgd_base, unsigned long vaddr, unsigned long *end)
+{
+ pmd_t *pmd;
+
+ pmd = early_pmd_alloc(pgd_base, vaddr, end);
+#ifdef CONFIG_X86_PAE
+ if (!(unsigned long)native_pmd_val(*pmd)) {
+ unsigned long phys = *end;
+ memset((void *)phys, 0, PAGE_SIZE);
+ native_set_pmd(pmd, native_make_pmd(phys | _PAGE_TABLE));
+ *end += PAGE_SIZE;
+ }
+#endif
+ return early_pte_offset(pmd, vaddr);
+}
+
+static __init void early_set_pte_phys(pgd_t *pgd_base, unsigned long vaddr,
+ unsigned long phys, unsigned long *end)
+{
+ pte_t *pte;
+ pte = early_pte_alloc(pgd_base, vaddr, end);
+ native_set_pte(pte, native_make_pte(phys | _PAGE_KERNEL_EXEC));
+}
+
+/*
+ * To preserve the DMA pool in PAGEALLOC kernels, we'll allocate
+ * pagetables from above the 16MB DMA limit, so we'll have to set
+ * up pagetables 16MB more (worst-case):
+ */
+#ifdef CONFIG_DEBUG_PAGEALLOC
+#define LOW_PAGES (1<<(32-PAGE_SHIFT) + 0x1000000)
+#else
+#define LOW_PAGES (1<<(32-PAGE_SHIFT))
+#endif
+
+void __init early_pgtable_init(void)
+{
+ unsigned long addr, end, limit;
+ pgd_t *pgd_base;
+
+ pgd_base = __pavar(swapper_pg_dir);
+ end = __pa_symbol(pg0);
+
+ /*
+ * This is how much memory *in addition to the memory covered up to
+ * and including _end* we need mapped initially.
+ */
+ limit = end;
+
+ /*
+ * - one bit for each possible page, but only in low memory,
+ * which means
+ * 2^32/4096/8 = 128K worst case (4G/4G split.)
+ */
+ limit += LOW_PAGES / 8;
+
+ /*
+ * - enough space to map all low memory, which means
+ * (2^32/4096) / 1024 pages (worst case, non PAE)
+ * (2^32/4096) / 512 + 4 pages (worst case for PAE)
+ */
+#if PTRS_PER_PMD > 1
+ limit += (LOW_PAGES / PTRS_PER_PMD) * PAGE_SIZE;
+ limit += PTRS_PER_PGD * PAGE_SIZE;
+#else
+ limit += (LOW_PAGES / PTRS_PER_PGD) * PAGE_SIZE;
+#endif
+
+ /*
+ * - a few pages for allocator use before the kernel pagetable has
+ * been set up
+ */
+ limit += 4 * PAGE_SIZE;
+
+ /* Initialize the directory page */
+ memset(pgd_base, 0, PAGE_SIZE);
+
+ /* Set up the fixmap page table */
+ early_pte_alloc(pgd_base, __pavar(__FIXADDR_TOP), &end);
+
+ /* Set up the initial kernel mapping */
+ for (addr = 0; addr < limit; addr += PAGE_SIZE)
+ early_set_pte_phys(pgd_base, addr + PAGE_OFFSET, addr, &end);
+
+ /* Set up the low identity mappings */
+ clone_pgd_range(pgd_base, pgd_base + USER_PTRS_PER_PGD,
+ min_t(unsigned long, KERNEL_PGD_PTRS, USER_PGD_PTRS));
+
+ __pavar(init_pg_tables_end) = end;
+}
+
+/*
* Creates a middle page table and puts a pointer to it in the
* given global directory entry. This only returns the gd entry
* in non-PAE compilation mode, since the middle layer is folded.
@@ -353,44 +499,11 @@ extern void __init remap_numa_kva(void);

void __init native_pagetable_setup_start(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- int i;
-
- /*
- * Init entries of the first-level page table to the
- * zero page, if they haven't already been set up.
- *
- * In a normal native boot, we'll be running on a
- * pagetable rooted in swapper_pg_dir, but not in PAE
- * mode, so this will end up clobbering the mappings
- * for the lower 24Mbytes of the address space,
- * without affecting the kernel address space.
- */
- for (i = 0; i < USER_PTRS_PER_PGD; i++)
- set_pgd(&base[i],
- __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
-
- /* Make sure kernel address space is empty so that a pagetable
- will be allocated for it. */
- memset(&base[USER_PTRS_PER_PGD], 0,
- KERNEL_PGD_PTRS * sizeof(pgd_t));
-#else
paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
-#endif
}

void __init native_pagetable_setup_done(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
-#endif
}

/*
@@ -399,9 +512,8 @@ void __init native_pagetable_setup_done(pgd_t *base)
* the boot process.
*
* If we're booting on native hardware, this will be a pagetable
- * constructed in arch/i386/kernel/head.S, and not running in PAE mode
- * (even if we'll end up running in PAE). The root of the pagetable
- * will be swapper_pg_dir.
+ * constructed in arch/x86/kernel/head_32.S. The root of the
+ * pagetable will be swapper_pg_dir.
*
* If we're booting paravirtualized under a hypervisor, then there are
* more options: we may already be running PAE, and the pagetable may
@@ -559,14 +671,6 @@ void __init paging_init(void)

load_cr3(swapper_pg_dir);

-#ifdef CONFIG_X86_PAE
- /*
- * We will bail out later - printk doesn't work right now so
- * the user would just see a hanging kernel.
- */
- if (cpu_has_pae)
- set_in_cr4(X86_CR4_PAE);
-#endif
__flush_tlb_all();

kmap_init();
@@ -696,10 +800,6 @@ void __init mem_init(void)
BUG_ON((unsigned long)high_memory > VMALLOC_START);
#endif /* double-sanity-check paranoia */

-#ifdef CONFIG_X86_PAE
- if (!cpu_has_pae)
- panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!");
-#endif
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 05a24cd..fa8a3ff 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -226,40 +226,45 @@ static int __init early_ioremap_debug_setup(char *str)
__setup("early_ioremap_debug", early_ioremap_debug_setup);

static __initdata int after_paging_init;
-static __initdata unsigned long bm_pte[1024]
+static __initdata pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)]
__attribute__((aligned(PAGE_SIZE)));

-static inline unsigned long * __init early_ioremap_pgd(unsigned long addr)
+static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
{
- return (unsigned long *)swapper_pg_dir + ((addr >> 22) & 1023);
+ pgd_t *pgd = &swapper_pg_dir[pgd_index(addr)];
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+
+ return pmd;
}

-static inline unsigned long * __init early_ioremap_pte(unsigned long addr)
+static inline pte_t * __init early_ioremap_pte(unsigned long addr)
{
- return bm_pte + ((addr >> PAGE_SHIFT) & 1023);
+ return &bm_pte[pte_index(addr)];
}

void __init early_ioremap_init(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_init()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = __pa(bm_pte) | _PAGE_TABLE;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
+ set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+
/*
- * The boot-ioremap range spans multiple pgds, for which
+ * The boot-ioremap range spans multiple pmds, for which
* we are not prepared:
*/
- if (pgd != early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END))) {
+ if (pmd != early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END))) {
WARN_ON(1);
- printk("pgd %p != %p\n",
- pgd, early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END)));
- printk("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
+ printk(KERN_WARNING "pmd %p != %p\n",
+ pmd, early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END)));
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
fix_to_virt(FIX_BTMAP_BEGIN));
- printk("fix_to_virt(FIX_BTMAP_END): %08lx\n",
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_END): %08lx\n",
fix_to_virt(FIX_BTMAP_END));

printk("FIX_BTMAP_END: %d\n", FIX_BTMAP_END);
@@ -269,27 +274,28 @@ void __init early_ioremap_init(void)

void __init early_ioremap_clear(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_clear()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = 0;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
+ pmd_clear(pmd);
__flush_tlb_all();
}

void __init early_ioremap_reset(void)
{
enum fixed_addresses idx;
- unsigned long *pte, phys, addr;
+ unsigned long addr, phys;
+ pte_t *pte;

after_paging_init = 1;
for (idx = FIX_BTMAP_BEGIN; idx <= FIX_BTMAP_END; idx--) {
addr = fix_to_virt(idx);
pte = early_ioremap_pte(addr);
- if (!*pte & _PAGE_PRESENT) {
- phys = *pte & PAGE_MASK;
+ if (pte_present(*pte)) {
+ phys = pte_val(*pte) & PAGE_MASK;
set_fixmap(idx, phys);
}
}
@@ -298,7 +304,8 @@ void __init early_ioremap_reset(void)
static void __init __early_set_fixmap(enum fixed_addresses idx,
unsigned long phys, pgprot_t flags)
{
- unsigned long *pte, addr = __fix_to_virt(idx);
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *pte;

if (idx >= __end_of_fixed_addresses) {
BUG();
@@ -306,9 +313,9 @@ static void __init __early_set_fixmap(enum fixed_addresses idx,
}
pte = early_ioremap_pte(addr);
if (pgprot_val(flags))
- *pte = (phys & PAGE_MASK) | pgprot_val(flags);
+ set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
- *pte = 0;
+ pte_clear(NULL, addr, pte);
__flush_tlb_one(addr);
}

diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h
index 11c4b39..8fc0473 100644
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -48,7 +48,6 @@ typedef unsigned long pgprotval_t;
typedef unsigned long phys_addr_t;

typedef union { pteval_t pte, pte_low; } pte_t;
-typedef pte_t boot_pte_t;

#endif /* __ASSEMBLY__ */
#endif /* CONFIG_X86_PAE */
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index 11c8b73..c07389b 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -55,10 +55,6 @@ int text_address(unsigned long);
#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT)
#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS)

-#define TWOLEVEL_PGDIR_SHIFT 22
-#define BOOT_USER_PGD_PTRS (__PAGE_OFFSET >> TWOLEVEL_PGDIR_SHIFT)
-#define BOOT_KERNEL_PGD_PTRS (1024-BOOT_USER_PGD_PTRS)
-
/* Just any arbitrary offset to the start of the vmalloc VM area: the
* current 8MB value just means that there will be a 8MB "hole" after the
* physical memory until the kernel virtual memory starts. That means that
--
1.5.3.8

2008-01-22 02:17:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index f409fe2..d1d30db 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -18,6 +18,10 @@
#include <asm/thread_info.h>
#include <asm/asm-offsets.h>
#include <asm/setup.h>
+#include <asm/processor-flags.h>
+
+/* Physical address */
+#define pa(X) ((X) - __PAGE_OFFSET)

/*
* References to members of the new_cpu_data structure.
@@ -79,10 +83,6 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_
*/
.section .text.head,"ax",@progbits
ENTRY(startup_32)
- /* check to see if KEEP_SEGMENTS flag is meaningful */
- cmpw $0x207, BP_version(%esi)
- jb 1f
-
/* test KEEP_SEGMENTS flag to see if the bootloader is asking
us to not reload segments */
testb $(1<<6), BP_loadflags(%esi)
@@ -91,7 +91,7 @@ ENTRY(startup_32)
/*
* Set segments to known values.
*/
-1: lgdt boot_gdt_descr - __PAGE_OFFSET
+1: lgdt pa(boot_gdt_descr)
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
@@ -104,8 +104,8 @@ ENTRY(startup_32)
*/
cld
xorl %eax,%eax
- movl $__bss_start - __PAGE_OFFSET,%edi
- movl $__bss_stop - __PAGE_OFFSET,%ecx
+ movl $pa(__bss_start),%edi
+ movl $pa(__bss_stop),%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
@@ -117,31 +117,32 @@ ENTRY(startup_32)
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
- movl $(boot_params - __PAGE_OFFSET),%edi
+ movl $pa(boot_params),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
- movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
+ movl pa(boot_params) + NEW_CL_POINTER,%esi
andl %esi,%esi
jz 1f # No comand line
- movl $(boot_command_line - __PAGE_OFFSET),%edi
+ movl $pa(boot_command_line),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:

#ifdef CONFIG_PARAVIRT
- cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET)
+ /* This is can only trip for a broken bootloader... */
+ cmpw $0x207, pa(boot_params + BP_version)
jb default_entry

/* Paravirt-compatible boot parameters. Look to see what architecture
we're booting under. */
- movl (boot_params + BP_hardware_subarch - __PAGE_OFFSET), %eax
+ movl pa(boot_params + BP_hardware_subarch), %eax
cmpl $num_subarch_entries, %eax
jae bad_subarch

- movl subarch_entries - __PAGE_OFFSET(,%eax,4), %eax
+ movl pa(subarch_entries)(,%eax,4), %eax
subl $__PAGE_OFFSET, %eax
jmp *%eax

@@ -167,17 +168,74 @@ num_subarch_entries = (. - subarch_entries) / 4
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
- * Warning: don't use %esi or the stack in this code. However, %esp
- * can be used as a GPR if you really need it...
+ * Note that the stack is not yet set up!
*/
-page_pde_offset = (__PAGE_OFFSET >> 20);
+#define PTE_ATTR 0x007 /* PRESENT+RW+USER */
+#define PDE_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */
+#define PGD_ATTR 0x001 /* PRESENT (no other attributes) */

default_entry:
- movl $(pg0 - __PAGE_OFFSET), %edi
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
+#ifdef CONFIG_X86_PAE
+ /*
+ * In PAE mode, the kernel PMD is shared, and __PAGE_OFFSET
+ * is guaranteed to be a multiple of 1 GB (the PGD granulatity.)
+ * Thus, we only need to set up a single PMD here; the identity
+ * mapping is handled by pointing two PGD entries to the PMD.
+ *
+ * Note the upper half of each PMD or PTE are always zero at
+ * this stage.
+ */
+page_pde_offset = (__PAGE_OFFSET >> 27);
+
+ movl %cr4, %eax
+ orl $X86_CR4_PAE, %eax
+ movl %eax, %cr4
+
+ xorl %ebx,%ebx /* %ebx is kept at zero */
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_pmd), %edx
+ movl $PTE_ATTR, %eax
+10:
+ leal PDE_ATTR(%edi),%ecx /* Create PMD entry */
+ movl %ecx,(%edx) /* Store PMD entry */
+ /* Upper half already zero */
+ addl $8,%edx
+ movl $512,%ecx
+11:
+ stosl
+ xchgl %eax,%ebx
+ stosl
+ xchgl %eax,%ebx
+ addl $0x1000,%eax
+ loop 11b
+
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables.
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
+ cmpl %ebp,%eax
+ jb 10b
+ movl %edi,pa(init_pg_tables_end)
+
+ /* Set up the PGD */
+ movl $pa(swapper_pg_pmd)+PGD_ATTR, %eax
+ movl %eax, pa(swapper_pg_dir) /* Identity map */
+ movl %eax, pa(swapper_pg_dir+page_pde_offset) /* Kernel map */
+
+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
+ movl %eax,pa(swapper_pg_pmd+0xff8)
+#else /* Not PAE */
+
+page_pde_offset = (__PAGE_OFFSET >> 20);
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_dir), %edx
+ movl $PTE_ATTR, %eax
10:
- leal 0x007(%edi),%ecx /* Create PDE entry */
+ leal PDE_ATTR(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
addl $4,%edx
@@ -186,19 +244,20 @@ default_entry:
stosl
addl $0x1000,%eax
loop 11b
- /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
- /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
- leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables; the +0x007 is
+ * the attribute bits
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
cmpl %ebp,%eax
jb 10b
- movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
-
- /* Do an early initialization of the fixmap area */
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $(swapper_pg_pmd - __PAGE_OFFSET), %eax
- addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
- movl %eax, 4092(%edx)
+ movl %edi,pa(init_pg_tables_end)

+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ADDR,%eax
+ movl %eax,pa(swapper_pg_dir+0xffc)
+#endif
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
/*
@@ -237,7 +296,7 @@ ENTRY(startup_32_smp)
* NOTE! We have to correct for the fact that we're
* not yet offset PAGE_OFFSET..
*/
-#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
+#define cr4_bits pa(mmu_cr4_features)
movl cr4_bits,%edx
andl %edx,%edx
jz 6f
@@ -278,7 +337,7 @@ ENTRY(startup_32_smp)
/*
* Enable paging
*/
- movl $swapper_pg_dir-__PAGE_OFFSET,%eax
+ movl $pa(swapper_pg_dir),%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
orl $0x80000000,%eax
@@ -556,8 +615,12 @@ ENTRY(_stext)
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
+#ifdef CONFIG_X86_PAE
ENTRY(swapper_pg_pmd)
.fill 1024,4,0
+#endif
+ENTRY(swapper_pg_fixmap)
+ .fill 1024,4,0
ENTRY(empty_zero_page)
.fill 4096,1,0


Attachments:
diff (6.63 kB)

2008-01-22 10:05:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* H. Peter Anvin <[email protected]> wrote:

> I was thinking about this yesterday, and it seems to me that there are
> two cleaner options here...
>
> - either we should put in the full machinery to be able to run C code
> compiled with -fPIC/-fPIE before paging is enabled. Unfortunately gcc
> generates R_386_GOT32 relocations for external references even with
> -fPIE, so we'll have to put in some code to adjust the GOT (easy
> enough to do.)

i'd _love_ to have this approach instead of the assembly routines. While
'constructing pagetables' might not look like a big deal in isolation -
C is still 10 times more programmable than assembly. Pushing more of the
early boot code into a sane, non-assembly environment will have positive
long-term effects all across.

Ingo

2008-01-22 13:47:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86_32: Remove unnecessary use of %ebx as the boot cpu flag


* Ian Campbell <[email protected]> wrote:

> Currently in head_32.S there are two ways we test to see if we are the
> boot cpu. By looking at %ebx and by looking at the static variable
> ready. When changing things around I have found that it gets tricky
> to preserve %ebx. So this patch just switches head.S over to the more
> reliable test of always using ready.
>
> Hopefully later we can kill these tests entirely.

FYI, i've applied this patch and your next patch to x86.git. (still
waiting with the third patch, until your discussion of this topic with
Peter settles down.) Thanks,

Ingo

2008-01-22 16:25:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
>> I was thinking about this yesterday, and it seems to me that there are
>> two cleaner options here...
>>
>> - either we should put in the full machinery to be able to run C code
>> compiled with -fPIC/-fPIE before paging is enabled. Unfortunately gcc
>> generates R_386_GOT32 relocations for external references even with
>> -fPIE, so we'll have to put in some code to adjust the GOT (easy
>> enough to do.)
>
> i'd _love_ to have this approach instead of the assembly routines. While
> 'constructing pagetables' might not look like a big deal in isolation -
> C is still 10 times more programmable than assembly. Pushing more of the
> early boot code into a sane, non-assembly environment will have positive
> long-term effects all across.
>

Yes, but that doesn't mean that this particular task is the right thing
for that job. In particular, the GOT adjustment wll be almost the same
size as the whole task.

On the other hand, there is a whole bunch of post-paging code in
head_32.S which doesn't need to be there.

-hpa

2008-01-22 17:37:19

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Mon, 2008-01-21 at 18:16 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> >
> > I'm just preparing to send out a version which uses the native_* way of
> > doing things, its not actually as clean as I would like so I'd be
> > interested to see the ASM variant.
> >
>
> This is the asm version I came up with.

I moderately prefer the C version, even if it is in a restricted
environment where care is needed to access global variables. I like that
it avoids multiple copies of the code and also find the structure of
what's going on is more obviously apparent (even to someone who has done
plenty of ASM mode page table frobbing in the past).

Anyhow, I don't feel all that strongly about it so if the opinion of the
early start of day maintainer(s) is strongly in favour of ASM I'll defer
to that.

> This is only the actual
> assembly part; it doesn't require the (obviously necessary) bootmem
> adjustments.

Do you mean the native_pagetable_setup_start/done changes? I'm a bit
confused by not requiring obviously necessary changes -- I presume you
just mean that those changes are desirable but should be deferred into a
separate patch?

The C way doesn't inherently require those two changes to happen in the
same patch either -- probably worth splitting out if we go that route.

Ian.

--
Ian Campbell
Current Noise: Pelican - Bliss In Concrete

Your life would be very empty if you had nothing to regret.

2008-01-22 18:24:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell wrote:
> On Mon, 2008-01-21 at 18:16 -0800, H. Peter Anvin wrote:
>> Ian Campbell wrote:
>>> I'm just preparing to send out a version which uses the native_* way of
>>> doing things, its not actually as clean as I would like so I'd be
>>> interested to see the ASM variant.
>>>
>> This is the asm version I came up with.
>
> I moderately prefer the C version, even if it is in a restricted
> environment where care is needed to access global variables. I like that
> it avoids multiple copies of the code and also find the structure of
> what's going on is more obviously apparent (even to someone who has done
> plenty of ASM mode page table frobbing in the past).
>
> Anyhow, I don't feel all that strongly about it so if the opinion of the
> early start of day maintainer(s) is strongly in favour of ASM I'll defer
> to that.
>

My opinion is that I want it done properly (PIC and all that jazz) or
not at all, and certainly would not want to mix linear and
paging-enabled code in the same file. When it comes to assembly code,
at least people can *see* that there there be dragons.

The plus *and* minus of a C version is that it's easier for people to
modify. The plus side of that is that if we really need it, it's a lot
cleaner; the minus side is that it may encourage more code to creep into
the pre-paging code, which would not be a good thing IMO.

-hpa

2008-01-22 19:49:19

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Tue, 2008-01-22 at 10:23 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> > Anyhow, I don't feel all that strongly about it so if the opinion of the
> > early start of day maintainer(s) is strongly in favour of ASM I'll defer
> > to that.
> >
>
> My opinion is that I want it done properly (PIC and all that jazz) or
> not at all, and certainly would not want to mix linear and
> paging-enabled code in the same file. When it comes to assembly code,
> at least people can *see* that there there be dragons.
>
> The plus *and* minus of a C version is that it's easier for people to
> modify. The plus side of that is that if we really need it, it's a lot
> cleaner; the minus side is that it may encourage more code to creep into
> the pre-paging code, which would not be a good thing IMO.

Seems reasonable to me. I'll integrate your asm diff with the other
changes and give it a whirl.

Ian.
--
Ian Campbell

Never go to bed mad. Stay up and fight.
-- Phyllis Diller, "Phyllis Diller's Housekeeping Hints"

2008-01-22 20:02:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index f409fe2..d6a1e04 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -18,6 +18,10 @@
#include <asm/thread_info.h>
#include <asm/asm-offsets.h>
#include <asm/setup.h>
+#include <asm/processor-flags.h>
+
+/* Physical address */
+#define pa(X) ((X) - __PAGE_OFFSET)

/*
* References to members of the new_cpu_data structure.
@@ -79,10 +83,6 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_
*/
.section .text.head,"ax",@progbits
ENTRY(startup_32)
- /* check to see if KEEP_SEGMENTS flag is meaningful */
- cmpw $0x207, BP_version(%esi)
- jb 1f
-
/* test KEEP_SEGMENTS flag to see if the bootloader is asking
us to not reload segments */
testb $(1<<6), BP_loadflags(%esi)
@@ -91,7 +91,7 @@ ENTRY(startup_32)
/*
* Set segments to known values.
*/
-1: lgdt boot_gdt_descr - __PAGE_OFFSET
+1: lgdt pa(boot_gdt_descr)
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
@@ -104,8 +104,8 @@ ENTRY(startup_32)
*/
cld
xorl %eax,%eax
- movl $__bss_start - __PAGE_OFFSET,%edi
- movl $__bss_stop - __PAGE_OFFSET,%ecx
+ movl $pa(__bss_start),%edi
+ movl $pa(__bss_stop),%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
@@ -117,31 +117,32 @@ ENTRY(startup_32)
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
- movl $(boot_params - __PAGE_OFFSET),%edi
+ movl $pa(boot_params),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
- movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
+ movl pa(boot_params) + NEW_CL_POINTER,%esi
andl %esi,%esi
jz 1f # No comand line
- movl $(boot_command_line - __PAGE_OFFSET),%edi
+ movl $pa(boot_command_line),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:

#ifdef CONFIG_PARAVIRT
- cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET)
+ /* This is can only trip for a broken bootloader... */
+ cmpw $0x207, pa(boot_params + BP_version)
jb default_entry

/* Paravirt-compatible boot parameters. Look to see what architecture
we're booting under. */
- movl (boot_params + BP_hardware_subarch - __PAGE_OFFSET), %eax
+ movl pa(boot_params + BP_hardware_subarch), %eax
cmpl $num_subarch_entries, %eax
jae bad_subarch

- movl subarch_entries - __PAGE_OFFSET(,%eax,4), %eax
+ movl pa(subarch_entries)(,%eax,4), %eax
subl $__PAGE_OFFSET, %eax
jmp *%eax

@@ -167,17 +168,74 @@ num_subarch_entries = (. - subarch_entries) / 4
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
- * Warning: don't use %esi or the stack in this code. However, %esp
- * can be used as a GPR if you really need it...
+ * Note that the stack is not yet set up!
*/
-page_pde_offset = (__PAGE_OFFSET >> 20);
+#define PTE_ATTR 0x007 /* PRESENT+RW+USER */
+#define PDE_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */
+#define PGD_ATTR 0x001 /* PRESENT (no other attributes) */

default_entry:
- movl $(pg0 - __PAGE_OFFSET), %edi
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
+#ifdef CONFIG_X86_PAE
+ /*
+ * In PAE mode, the kernel PMD is shared, and __PAGE_OFFSET
+ * is guaranteed to be a multiple of 1 GB (the PGD granulatity.)
+ * Thus, we only need to set up a single PMD here; the identity
+ * mapping is handled by pointing two PGD entries to the PMD.
+ *
+ * Note the upper half of each PMD or PTE are always zero at
+ * this stage.
+ */
+page_pde_offset = (__PAGE_OFFSET >> 27);
+
+ movl %cr4, %eax
+ orl $X86_CR4_PAE, %eax
+ movl %eax, %cr4
+
+ xorl %ebx,%ebx /* %ebx is kept at zero */
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_pmd), %edx
+ movl $PTE_ATTR, %eax
+10:
+ leal PDE_ATTR(%edi),%ecx /* Create PMD entry */
+ movl %ecx,(%edx) /* Store PMD entry */
+ /* Upper half already zero */
+ addl $8,%edx
+ movl $512,%ecx
+11:
+ stosl
+ xchgl %eax,%ebx
+ stosl
+ xchgl %eax,%ebx
+ addl $0x1000,%eax
+ loop 11b
+
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables.
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
+ cmpl %ebp,%eax
+ jb 10b
+ movl %edi,pa(init_pg_tables_end)
+
+ /* Set up the PGD */
+ movl $pa(swapper_pg_pmd)+PGD_ATTR, %eax
+ movl %eax, pa(swapper_pg_dir) /* Identity map */
+ movl %eax, pa(swapper_pg_dir+page_pde_offset) /* Kernel map */
+
+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
+ movl %eax,pa(swapper_pg_pmd+0xff8)
+#else /* Not PAE */
+
+page_pde_offset = (__PAGE_OFFSET >> 20);
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_dir), %edx
+ movl $PTE_ATTR, %eax
10:
- leal 0x007(%edi),%ecx /* Create PDE entry */
+ leal PDE_ATTR(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
addl $4,%edx
@@ -186,19 +244,20 @@ default_entry:
stosl
addl $0x1000,%eax
loop 11b
- /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
- /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
- leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables; the +0x007 is
+ * the attribute bits
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
cmpl %ebp,%eax
jb 10b
- movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
-
- /* Do an early initialization of the fixmap area */
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $(swapper_pg_pmd - __PAGE_OFFSET), %eax
- addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
- movl %eax, 4092(%edx)
+ movl %edi,pa(init_pg_tables_end)

+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
+ movl %eax,pa(swapper_pg_dir+0xffc)
+#endif
xorl %ebx,%ebx /* This is the boot CPU (BSP) */
jmp 3f
/*
@@ -237,7 +296,7 @@ ENTRY(startup_32_smp)
* NOTE! We have to correct for the fact that we're
* not yet offset PAGE_OFFSET..
*/
-#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
+#define cr4_bits pa(mmu_cr4_features)
movl cr4_bits,%edx
andl %edx,%edx
jz 6f
@@ -278,10 +337,10 @@ ENTRY(startup_32_smp)
/*
* Enable paging
*/
- movl $swapper_pg_dir-__PAGE_OFFSET,%eax
+ movl $pa(swapper_pg_dir),%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
- orl $0x80000000,%eax
+ orl $X86_CR0_PG,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
@@ -556,8 +615,12 @@ ENTRY(_stext)
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
+#ifdef CONFIG_X86_PAE
ENTRY(swapper_pg_pmd)
.fill 1024,4,0
+#endif
+ENTRY(swapper_pg_fixmap)
+ .fill 1024,4,0
ENTRY(empty_zero_page)
.fill 4096,1,0

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index cbba769..14c6c41 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -43,6 +43,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>
#include <asm/paravirt.h>
+#include <asm/setup.h>

unsigned int __VMALLOC_RESERVE = 128 << 20;

@@ -353,44 +354,11 @@ extern void __init remap_numa_kva(void);

void __init native_pagetable_setup_start(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- int i;
-
- /*
- * Init entries of the first-level page table to the
- * zero page, if they haven't already been set up.
- *
- * In a normal native boot, we'll be running on a
- * pagetable rooted in swapper_pg_dir, but not in PAE
- * mode, so this will end up clobbering the mappings
- * for the lower 24Mbytes of the address space,
- * without affecting the kernel address space.
- */
- for (i = 0; i < USER_PTRS_PER_PGD; i++)
- set_pgd(&base[i],
- __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
-
- /* Make sure kernel address space is empty so that a pagetable
- will be allocated for it. */
- memset(&base[USER_PTRS_PER_PGD], 0,
- KERNEL_PGD_PTRS * sizeof(pgd_t));
-#else
paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
-#endif
}

void __init native_pagetable_setup_done(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
-#endif
}

/*
@@ -399,9 +367,8 @@ void __init native_pagetable_setup_done(pgd_t *base)
* the boot process.
*
* If we're booting on native hardware, this will be a pagetable
- * constructed in arch/i386/kernel/head.S, and not running in PAE mode
- * (even if we'll end up running in PAE). The root of the pagetable
- * will be swapper_pg_dir.
+ * constructed in arch/x86/kernel/head_32.S. The root of the
+ * pagetable will be swapper_pg_dir.
*
* If we're booting paravirtualized under a hypervisor, then there are
* more options: we may already be running PAE, and the pagetable may
@@ -559,14 +526,6 @@ void __init paging_init(void)

load_cr3(swapper_pg_dir);

-#ifdef CONFIG_X86_PAE
- /*
- * We will bail out later - printk doesn't work right now so
- * the user would just see a hanging kernel.
- */
- if (cpu_has_pae)
- set_in_cr4(X86_CR4_PAE);
-#endif
__flush_tlb_all();

kmap_init();
@@ -696,10 +655,6 @@ void __init mem_init(void)
BUG_ON((unsigned long)high_memory > VMALLOC_START);
#endif /* double-sanity-check paranoia */

-#ifdef CONFIG_X86_PAE
- if (!cpu_has_pae)
- panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!");
-#endif
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 05a24cd..fa8a3ff 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -226,40 +226,45 @@ static int __init early_ioremap_debug_setup(char *str)
__setup("early_ioremap_debug", early_ioremap_debug_setup);

static __initdata int after_paging_init;
-static __initdata unsigned long bm_pte[1024]
+static __initdata pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)]
__attribute__((aligned(PAGE_SIZE)));

-static inline unsigned long * __init early_ioremap_pgd(unsigned long addr)
+static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
{
- return (unsigned long *)swapper_pg_dir + ((addr >> 22) & 1023);
+ pgd_t *pgd = &swapper_pg_dir[pgd_index(addr)];
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+
+ return pmd;
}

-static inline unsigned long * __init early_ioremap_pte(unsigned long addr)
+static inline pte_t * __init early_ioremap_pte(unsigned long addr)
{
- return bm_pte + ((addr >> PAGE_SHIFT) & 1023);
+ return &bm_pte[pte_index(addr)];
}

void __init early_ioremap_init(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_init()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = __pa(bm_pte) | _PAGE_TABLE;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
+ set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+
/*
- * The boot-ioremap range spans multiple pgds, for which
+ * The boot-ioremap range spans multiple pmds, for which
* we are not prepared:
*/
- if (pgd != early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END))) {
+ if (pmd != early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END))) {
WARN_ON(1);
- printk("pgd %p != %p\n",
- pgd, early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END)));
- printk("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
+ printk(KERN_WARNING "pmd %p != %p\n",
+ pmd, early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END)));
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
fix_to_virt(FIX_BTMAP_BEGIN));
- printk("fix_to_virt(FIX_BTMAP_END): %08lx\n",
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_END): %08lx\n",
fix_to_virt(FIX_BTMAP_END));

printk("FIX_BTMAP_END: %d\n", FIX_BTMAP_END);
@@ -269,27 +274,28 @@ void __init early_ioremap_init(void)

void __init early_ioremap_clear(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_clear()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = 0;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
+ pmd_clear(pmd);
__flush_tlb_all();
}

void __init early_ioremap_reset(void)
{
enum fixed_addresses idx;
- unsigned long *pte, phys, addr;
+ unsigned long addr, phys;
+ pte_t *pte;

after_paging_init = 1;
for (idx = FIX_BTMAP_BEGIN; idx <= FIX_BTMAP_END; idx--) {
addr = fix_to_virt(idx);
pte = early_ioremap_pte(addr);
- if (!*pte & _PAGE_PRESENT) {
- phys = *pte & PAGE_MASK;
+ if (pte_present(*pte)) {
+ phys = pte_val(*pte) & PAGE_MASK;
set_fixmap(idx, phys);
}
}
@@ -298,7 +304,8 @@ void __init early_ioremap_reset(void)
static void __init __early_set_fixmap(enum fixed_addresses idx,
unsigned long phys, pgprot_t flags)
{
- unsigned long *pte, addr = __fix_to_virt(idx);
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *pte;

if (idx >= __end_of_fixed_addresses) {
BUG();
@@ -306,9 +313,9 @@ static void __init __early_set_fixmap(enum fixed_addresses idx,
}
pte = early_ioremap_pte(addr);
if (pgprot_val(flags))
- *pte = (phys & PAGE_MASK) | pgprot_val(flags);
+ set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
- *pte = 0;
+ pte_clear(NULL, addr, pte);
__flush_tlb_one(addr);
}

diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h
index 11c4b39..8fc0473 100644
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -48,7 +48,6 @@ typedef unsigned long pgprotval_t;
typedef unsigned long phys_addr_t;

typedef union { pteval_t pte, pte_low; } pte_t;
-typedef pte_t boot_pte_t;

#endif /* __ASSEMBLY__ */
#endif /* CONFIG_X86_PAE */
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index 11c8b73..c07389b 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -55,10 +55,6 @@ int text_address(unsigned long);
#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT)
#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS)

-#define TWOLEVEL_PGDIR_SHIFT 22
-#define BOOT_USER_PGD_PTRS (__PAGE_OFFSET >> TWOLEVEL_PGDIR_SHIFT)
-#define BOOT_KERNEL_PGD_PTRS (1024-BOOT_USER_PGD_PTRS)
-
/* Just any arbitrary offset to the start of the vmalloc VM area: the
* current 8MB value just means that there will be a 8MB "hole" after the
* physical memory until the kernel virtual memory starts. That means that


Attachments:
diff (14.36 kB)

2008-01-22 20:36:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* H. Peter Anvin <[email protected]> wrote:

>> Seems reasonable to me. I'll integrate your asm diff with the other
>> changes and give it a whirl.
>
> This version boots into userspace on both PAE and !PAE. You want to
> take it from here?

ok, i'll wait for Ian to submit the final (tested) version then. A few
possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
with tons of RAM, NX-less boxes and NX-able boxes :)

Ingo

2008-01-22 20:44:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
>>> Seems reasonable to me. I'll integrate your asm diff with the other
>>> changes and give it a whirl.
>> This version boots into userspace on both PAE and !PAE. You want to
>> take it from here?
>
> ok, i'll wait for Ian to submit the final (tested) version then. A few
> possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
> with tons of RAM, NX-less boxes and NX-able boxes :)

PSE-less should be less of an issue than making sure we switch to using
large pages where appropriate, and enable the PGE and NX bits where
appropriate.

-hpa

2008-01-22 20:46:16

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* H. Peter Anvin <[email protected]> wrote:

>> ok, i'll wait for Ian to submit the final (tested) version then. A
>> few possible complications are: PSE-less boxes, 32-bit PAGEALLOC
>> bootups with tons of RAM, NX-less boxes and NX-able boxes :)
>
> PSE-less should be less of an issue than making sure we switch to
> using large pages where appropriate, and enable the PGE and NX bits
> where appropriate.

yeah - and that would be the right point to enable gigapages as well -
once we have all this stuff consolidated and unified from grounds up.

Ingo

2008-01-22 20:53:06

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Tue, 2008-01-22 at 21:36 +0100, Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
> >> Seems reasonable to me. I'll integrate your asm diff with the other
> >> changes and give it a whirl.
> >
> > This version boots into userspace on both PAE and !PAE. You want to
> > take it from here?
>
> ok, i'll wait for Ian to submit the final (tested) version then. A few
> possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
> with tons of RAM, NX-less boxes and NX-able boxes :)

I'm not sure I can promise that sort of coverage ;-) Will test on what
hardware I've got available...

Ian.
--
Ian Campbell

Modesty is a vastly overrated virtue.
-- J. K. Galbraith

2008-01-22 21:00:17

by Harvey Harrison

[permalink] [raw]
Subject: [PATCH] x86: make nx_enabled conditional on CONFIG_X86_PAE

nx_enabled can only be set to non-zero when CONFIG_X86_PAE is
set. The only use not currently inside a CONFIG_X86_PAE block
is the definition, the declaration and a conditional unlikely
test in fault_32.c (is_prefetch).

When !CONFIG_X86_PAE, is_prefetch always returns 0 immediately
as nx_enabled is always 0.

When CONFIG_X86_PAE, the test is preserved, but the test against
the cpu model and stepping is deleted, this may not be correct.

Signed-off-by: Harvey Harrison <[email protected]>
---
Ingo, further to your nx vs !nx comment, had this lying around,
needs testing, only affects the CONFIG_X86_PAE case.

arch/x86/mm/fault_32.c | 17 ++++++++---------
arch/x86/mm/fault_64.c | 17 ++++++++---------
arch/x86/mm/init_32.c | 4 +---
include/asm-x86/page_32.h | 2 ++
4 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/arch/x86/mm/fault_32.c b/arch/x86/mm/fault_32.c
index 0bd2417..049c3bb 100644
--- a/arch/x86/mm/fault_32.c
+++ b/arch/x86/mm/fault_32.c
@@ -81,16 +81,15 @@ static int is_prefetch(struct pt_regs *regs, unsigned long addr,
unsigned char *max_instr;

#ifdef CONFIG_X86_32
- if (unlikely(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 >= 6)) {
- /* Catch an obscure case of prefetch inside an NX page. */
- if (nx_enabled && (error_code & PF_INSTR))
- return 0;
- } else {
+# ifdef CONFIG_X86_PAE
+ /* If it was a exec fault on NX page, ignore */
+ if (nx_enabled && (error_code & PF_INSTR))
return 0;
- }
-#else
- /* If it was a exec fault ignore */
+# else
+ return 0;
+# endif
+#else /* CONFIG_X86_64 */
+ /* If it was a exec fault on NX page, ignore */
if (error_code & PF_INSTR)
return 0;
#endif
diff --git a/arch/x86/mm/fault_64.c b/arch/x86/mm/fault_64.c
index ccbb8e3..33e8ced 100644
--- a/arch/x86/mm/fault_64.c
+++ b/arch/x86/mm/fault_64.c
@@ -84,16 +84,15 @@ static int is_prefetch(struct pt_regs *regs, unsigned long addr,
unsigned char *max_instr;

#ifdef CONFIG_X86_32
- if (unlikely(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 >= 6)) {
- /* Catch an obscure case of prefetch inside an NX page. */
- if (nx_enabled && (error_code & PF_INSTR))
- return 0;
- } else {
+# ifdef CONFIG_X86_PAE
+ /* If it was a exec fault on NX page, ignore */
+ if (nx_enabled && (error_code & PF_INSTR))
return 0;
- }
-#else
- /* If it was a exec fault ignore */
+# else
+ return 0;
+# endif
+#else /* CONFIG_X86_64 */
+ /* If it was a exec fault on NX page, ignore */
if (error_code & PF_INSTR)
return 0;
#endif
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 358d3b9..317cf5d 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -478,13 +478,11 @@ void zap_low_mappings (void)
flush_tlb_all();
}

-int nx_enabled = 0;
-
pteval_t __supported_pte_mask __read_mostly = ~_PAGE_NX;
EXPORT_SYMBOL_GPL(__supported_pte_mask);

#ifdef CONFIG_X86_PAE
-
+int nx_enabled = 0;
static int disable_nx __initdata = 0;

/*
diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h
index 11c4b39..251f972 100644
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -65,7 +65,9 @@ typedef pte_t boot_pte_t;
#define pfn_valid(pfn) ((pfn) < max_mapnr)
#endif /* CONFIG_FLATMEM */

+#ifdef CONFIG_X86_PAE
extern int nx_enabled;
+#endif

/*
* This much address space is reserved for vmalloc() and iomap()
--
1.5.4.rc3.1118.gf6754c


2008-01-22 21:01:51

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell wrote:
> On Tue, 2008-01-22 at 21:36 +0100, Ingo Molnar wrote:
>> * H. Peter Anvin <[email protected]> wrote:
>>
>>>> Seems reasonable to me. I'll integrate your asm diff with the other
>>>> changes and give it a whirl.
>>> This version boots into userspace on both PAE and !PAE. You want to
>>> take it from here?
>> ok, i'll wait for Ian to submit the final (tested) version then. A few
>> possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
>> with tons of RAM, NX-less boxes and NX-able boxes :)
>
> I'm not sure I can promise that sort of coverage ;-) Will test on what
> hardware I've got available...
>

I tend to use simulators (e.g. Qemu) quite a bit. They let you tune
this kind of stuff.

-hpa

2008-01-22 21:05:19

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: make nx_enabled conditional on CONFIG_X86_PAE


* Harvey Harrison <[email protected]> wrote:

> nx_enabled can only be set to non-zero when CONFIG_X86_PAE is set.
> The only use not currently inside a CONFIG_X86_PAE block is the
> definition, the declaration and a conditional unlikely test in
> fault_32.c (is_prefetch).
>
> When !CONFIG_X86_PAE, is_prefetch always returns 0 immediately as
> nx_enabled is always 0.
>
> When CONFIG_X86_PAE, the test is preserved, but the test against the
> cpu model and stepping is deleted, this may not be correct.

thanks, applied.

> Ingo, further to your nx vs !nx comment, had this lying around, needs
> testing, only affects the CONFIG_X86_PAE case.

will keep an eye on it.

How far away are you from unifying fault_32.c and fault_64.c? You
already managed to line up their sizes:

$ wc -l arch/x86/mm/fault_*.c
742 arch/x86/mm/fault_32.c
734 arch/x86/mm/fault_64.c

;-)

and the raw diff between them doesnt look that bad either:

1 file changed, 127 insertions(+), 135 deletions(-)

so we might as well take a shot at that?

Ingo

2008-01-22 21:08:03

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] x86: make nx_enabled conditional on CONFIG_X86_PAE

On Tue, 2008-01-22 at 13:00 -0800, Harvey Harrison wrote:
> nx_enabled can only be set to non-zero when CONFIG_X86_PAE is
> set. The only use not currently inside a CONFIG_X86_PAE block
> is the definition, the declaration and a conditional unlikely
> test in fault_32.c (is_prefetch).
>
> When !CONFIG_X86_PAE, is_prefetch always returns 0 immediately
> as nx_enabled is always 0.
>
> When CONFIG_X86_PAE, the test is preserved, but the test against
> the cpu model and stepping is deleted, this may not be correct.
>
> Signed-off-by: Harvey Harrison <[email protected]>
> ---

Sorry, I missed the usage in kernel/acpi/wakeup_32.S, that's the only
other user. I don't know that code well enough to comment on the usage
there, but if anybody knows if that could be conditionalized, please
advise.


Harvey

2008-01-22 21:35:22

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] x86: make nx_enabled conditional on CONFIG_X86_PAE

On Tue, 2008-01-22 at 22:04 +0100, Ingo Molnar wrote:
> * Harvey Harrison <[email protected]> wrote:
>
> > nx_enabled can only be set to non-zero when CONFIG_X86_PAE is set.
> > The only use not currently inside a CONFIG_X86_PAE block is the
> > definition, the declaration and a conditional unlikely test in
> > fault_32.c (is_prefetch).
> >
> > When !CONFIG_X86_PAE, is_prefetch always returns 0 immediately as
> > nx_enabled is always 0.
> >
> > When CONFIG_X86_PAE, the test is preserved, but the test against the
> > cpu model and stepping is deleted, this may not be correct.
>
> thanks, applied.

Hmmm, the extern nx_enabled in page_32.h is already within an
ifndef __ASSEMBLY__ block, so I'm not sure how wakeup_32.S could
be using it. Thoughts?

Harvey

2008-01-22 22:22:06

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Tue, 2008-01-22 at 13:00 -0800, H. Peter Anvin wrote:
> Ian Campbell wrote:
> > On Tue, 2008-01-22 at 21:36 +0100, Ingo Molnar wrote:
> >> * H. Peter Anvin <[email protected]> wrote:
> >>
> >>>> Seems reasonable to me. I'll integrate your asm diff with the other
> >>>> changes and give it a whirl.
> >>> This version boots into userspace on both PAE and !PAE. You want to
> >>> take it from here?
> >> ok, i'll wait for Ian to submit the final (tested) version then. A few
> >> possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
> >> with tons of RAM, NX-less boxes and NX-able boxes :)
> >
> > I'm not sure I can promise that sort of coverage ;-) Will test on what
> > hardware I've got available...
> >
>
> I tend to use simulators (e.g. Qemu) quite a bit. They let you tune
> this kind of stuff.

So do I but I'd never really investigated the option to fiddle with the
CPU type -- very useful though, thanks for the tip!

Ian.

--
Ian Campbell

No one can feel as helpless as the owner of a sick goldfish.

2008-01-23 11:21:31

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] x86: make nx_enabled conditional on CONFIG_X86_PAE

On Wed, 2008-01-23 at 09:46 +0100, Andi Kleen wrote:
> Harvey Harrison <[email protected]> writes:
>
> > nx_enabled can only be set to non-zero when CONFIG_X86_PAE is
> > set. The only use not currently inside a CONFIG_X86_PAE block
> > is the definition, the declaration and a conditional unlikely
> > test in fault_32.c (is_prefetch).
>
> The variable is pretty useless anyways; it can be probably
> replaced with (__supported_pte_mask & _PAGE_NX). Just make
> sure that the disable option still works, but that should
> be possible with some care.
>
> So if you feel the need to clean up things here convert it
> to using that. That will automatically be optimized away
> on !PAE too because _PAGE_NX is 0 there.
>

Noted, will do.

Harvey

2008-01-23 20:52:43

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Tue, 2008-01-22 at 21:36 +0100, Ingo Molnar wrote:
> * H. Peter Anvin <[email protected]> wrote:
>
> >> Seems reasonable to me. I'll integrate your asm diff with the other
> >> changes and give it a whirl.
> >
> > This version boots into userspace on both PAE and !PAE. You want to
> > take it from here?
>
> ok, i'll wait for Ian to submit the final (tested) version then. A few
> possible complications are: PSE-less boxes, 32-bit PAGEALLOC bootups
> with tons of RAM, NX-less boxes and NX-able boxes :)

FYI, CONFIG_DEBUG_PAGEALLOC+PAE is broken. I'll dig in but it might be
the weekend before I get a chance (there's a beer festival in town ;-)).

Ian.
--
Ian Campbell

Flattery will get you everywhere.

2008-01-24 01:06:27

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell wrote:
> FYI, CONFIG_DEBUG_PAGEALLOC+PAE is broken. I'll dig in but it might be
> the weekend before I get a chance (there's a beer festival in town ;-)).
>

I'm poking around trying to get Xen working again as well; I may end up
fixing it in passing.

At the moment I've got a problem with early_ioremap's bt_pte[] array
ending up hanging around in init's pagetable, which Xen is most unhappy
about.

J

2008-01-24 09:39:36

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


On Wed, 2008-01-23 at 17:06 -0800, Jeremy Fitzhardinge wrote:
> Ian Campbell wrote:
> > FYI, CONFIG_DEBUG_PAGEALLOC+PAE is broken. I'll dig in but it might be
> > the weekend before I get a chance (there's a beer festival in town ;-)).
> >
>
> I'm poking around trying to get Xen working again as well; I may end up
> fixing it in passing.

Turns out that the initial mapping is extending well past the end of the
128M of physical memory when DEBUG_PAGEALLOC is enabled due to the extra
16M mapping. The overshoot is enough that it interferes with the vmalloc
area. I see the same problem without the patch on non PAE if I reduce
the RAM to 64M. There will be ~twice as many PAE boot page tables which
explains why it triggers with 128M with the patch (INIT_MAP_BEYOND_END
is 70M for !PAE and 138M with when DEBUG_PAGEALLOC is on).

I don't see the crash with PAE before this patch. I think because the
kernel page tables are wiped in native_pagetable_setup_start which would
have blown away the mappings past the end of physical memory, avoiding
the problem. I've added code to wipe those mappings past max_low_pfn
from the boot page tables in pagetable_setup_start which has solved the
problem and I think is the right way to do it since we don't know the
correct limit of physical RAM in head_32.

Tested patch follows. All four of {pae,!pae}x{paravirt,!
paravirt}+DEBUG_PAGEALLOC boot to userspace on qemu (smp:2 mem:128)
using cpu type 486 (no PSE or PAE or anything), pentium (PSE only),
pentium3 (everything but NX) and qemu64 (everything, running 32 bit mode
despite name). Exceptions are obviously 486 and pentium which didn't
boot the PAE versions to userspace -- they correctly bailed during
setup.

I'm not sure how PSE comes to be used ever though -- an EFI only thing?
Using the qemu monitor I could see a bunch of NX bits used when NX was
available.

I also booted both {!PAE,PAE}xPARAVIRT+DEBUG_PAGEALLOC on a physical
pentium4 with 4G RAM with and without mem=16M.

> At the moment I've got a problem with early_ioremap's bt_pte[] array
> ending up hanging around in init's pagetable, which Xen is most unhappy
> about.

That sounds a lot like the problem I was having with the patch I sent
you as well, although I never identified where the problematic mapping
was.

Ian.

---
x86_32: Construct 32 bit boot time page tables in native format.

Specifically the boot time page tables in a CONFIG_X86_PAE=y enabled
kernel are in PAE format.

early_ioremap is updated to use the standard page table accessors.

Clear any mappings beyond max_low_pfn from the boot page tables in
native_pagetable_setup_start because the initial mappings can extend
beyond the range of physical memory and into the vmalloc area.

Derived from patches by Eric Biederman and H. Peter Anvin.

Signed-off-by: Ian Campbell <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Eric W. Biederman <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Mika Penttilä <[email protected]>
Cc: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/head_32.S | 132 +++++++++++++++++++++++++++++++-----------
arch/x86/kernel/setup_32.c | 4 +
arch/x86/mm/init_32.c | 70 ++++++++--------------
arch/x86/mm/ioremap_32.c | 53 ++++++++++-------
include/asm-x86/page_32.h | 1 -
include/asm-x86/pgtable_32.h | 4 -
6 files changed, 158 insertions(+), 106 deletions(-)

diff --git a/arch/x86/kernel/head_32.S b/arch/x86/kernel/head_32.S
index 5d8c573..c6af2c0 100644
--- a/arch/x86/kernel/head_32.S
+++ b/arch/x86/kernel/head_32.S
@@ -19,6 +19,10 @@
#include <asm/thread_info.h>
#include <asm/asm-offsets.h>
#include <asm/setup.h>
+#include <asm/processor-flags.h>
+
+/* Physical address */
+#define pa(X) ((X) - __PAGE_OFFSET)

/*
* References to members of the new_cpu_data structure.
@@ -80,10 +84,6 @@ INIT_MAP_BEYOND_END = BOOTBITMAP_SIZE + (PAGE_TABLE_SIZE + ALLOCATOR_SLOP)*PAGE_
*/
.section .text.head,"ax",@progbits
ENTRY(startup_32)
- /* check to see if KEEP_SEGMENTS flag is meaningful */
- cmpw $0x207, BP_version(%esi)
- jb 1f
-
/* test KEEP_SEGMENTS flag to see if the bootloader is asking
us to not reload segments */
testb $(1<<6), BP_loadflags(%esi)
@@ -92,7 +92,7 @@ ENTRY(startup_32)
/*
* Set segments to known values.
*/
-1: lgdt boot_gdt_descr - __PAGE_OFFSET
+ lgdt pa(boot_gdt_descr)
movl $(__BOOT_DS),%eax
movl %eax,%ds
movl %eax,%es
@@ -105,8 +105,8 @@ ENTRY(startup_32)
*/
cld
xorl %eax,%eax
- movl $__bss_start - __PAGE_OFFSET,%edi
- movl $__bss_stop - __PAGE_OFFSET,%ecx
+ movl $pa(__bss_start),%edi
+ movl $pa(__bss_stop),%ecx
subl %edi,%ecx
shrl $2,%ecx
rep ; stosl
@@ -118,31 +118,32 @@ ENTRY(startup_32)
* (kexec on panic case). Hence copy out the parameters before initializing
* page tables.
*/
- movl $(boot_params - __PAGE_OFFSET),%edi
+ movl $pa(boot_params),%edi
movl $(PARAM_SIZE/4),%ecx
cld
rep
movsl
- movl boot_params - __PAGE_OFFSET + NEW_CL_POINTER,%esi
+ movl pa(boot_params) + NEW_CL_POINTER,%esi
andl %esi,%esi
jz 1f # No comand line
- movl $(boot_command_line - __PAGE_OFFSET),%edi
+ movl $pa(boot_command_line),%edi
movl $(COMMAND_LINE_SIZE/4),%ecx
rep
movsl
1:

#ifdef CONFIG_PARAVIRT
- cmpw $0x207, (boot_params + BP_version - __PAGE_OFFSET)
+ /* This is can only trip for a broken bootloader... */
+ cmpw $0x207, pa(boot_params + BP_version)
jb default_entry

/* Paravirt-compatible boot parameters. Look to see what architecture
we're booting under. */
- movl (boot_params + BP_hardware_subarch - __PAGE_OFFSET), %eax
+ movl pa(boot_params + BP_hardware_subarch), %eax
cmpl $num_subarch_entries, %eax
jae bad_subarch

- movl subarch_entries - __PAGE_OFFSET(,%eax,4), %eax
+ movl pa(subarch_entries)(,%eax,4), %eax
subl $__PAGE_OFFSET, %eax
jmp *%eax

@@ -170,17 +171,77 @@ num_subarch_entries = (. - subarch_entries) / 4
* Mappings are created both at virtual address 0 (identity mapping)
* and PAGE_OFFSET for up to _end+sizeof(page tables)+INIT_MAP_BEYOND_END.
*
- * Warning: don't use %esi or the stack in this code. However, %esp
- * can be used as a GPR if you really need it...
+ * Note that the stack is not yet set up!
*/
-page_pde_offset = (__PAGE_OFFSET >> 20);
+#define PTE_ATTR 0x007 /* PRESENT+RW+USER */
+#define PDE_ATTR 0x067 /* PRESENT+RW+USER+DIRTY+ACCESSED */
+#define PGD_ATTR 0x001 /* PRESENT (no other attributes) */

default_entry:
- movl $(pg0 - __PAGE_OFFSET), %edi
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $0x007, %eax /* 0x007 = PRESENT+RW+USER */
+#ifdef CONFIG_X86_PAE
+ /*
+ * In PAE mode, the kernel PMD is shared, and __PAGE_OFFSET
+ * is guaranteed to be a multiple of 1 GB (the PGD granulatity.)
+ * Thus, we only need to set up a single PMD here; the identity
+ * mapping is handled by pointing two PGD entries to the PMD.
+ *
+ * Note the upper half of each PMD or PTE are always zero at
+ * this stage.
+ */
+page_pde_offset = (__PAGE_OFFSET >> 27);
+ xorl %ebx,%ebx /* %ebx is kept at zero */
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_pmd), %edx
+ movl $PTE_ATTR, %eax
10:
- leal 0x007(%edi),%ecx /* Create PDE entry */
+ leal PDE_ATTR(%edi),%ecx /* Create PMD entry */
+ movl %ecx,(%edx) /* Store PMD entry */
+ /* Upper half already zero */
+ addl $8,%edx
+ movl $512,%ecx
+11:
+ stosl
+ xchgl %eax,%ebx
+ stosl
+ xchgl %eax,%ebx
+ addl $0x1000,%eax
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables.
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
+ cmpl %ebp,%eax
+ ja 1f
+ loop 11b
+
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables.
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
+ cmpl %ebp,%eax
+ jb 10b
+1:
+ movl %edi,pa(init_pg_tables_end)
+
+ /* Set up the PGD */
+ movl $pa(swapper_pg_pmd)+PGD_ATTR, %eax
+ movl %eax, pa(swapper_pg_dir) /* Identity map */
+ movl %eax, pa(swapper_pg_dir+page_pde_offset) /* Kernel map */
+
+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
+ movl %eax,pa(swapper_pg_pmd+0xff8)
+#else /* Not PAE */
+
+page_pde_offset = (__PAGE_OFFSET >> 20);
+
+ movl $pa(pg0), %edi
+ movl $pa(swapper_pg_dir), %edx
+ movl $PTE_ATTR, %eax
+10:
+ leal PDE_ATTR(%edi),%ecx /* Create PDE entry */
movl %ecx,(%edx) /* Store identity PDE entry */
movl %ecx,page_pde_offset(%edx) /* Store kernel PDE entry */
addl $4,%edx
@@ -189,19 +250,20 @@ default_entry:
stosl
addl $0x1000,%eax
loop 11b
- /* End condition: we must map up to and including INIT_MAP_BEYOND_END */
- /* bytes beyond the end of our own page tables; the +0x007 is the attribute bits */
- leal (INIT_MAP_BEYOND_END+0x007)(%edi),%ebp
+ /*
+ * End condition: we must map up to and including INIT_MAP_BEYOND_END
+ * bytes beyond the end of our own page tables; the +0x007 is
+ * the attribute bits
+ */
+ leal (INIT_MAP_BEYOND_END+PTE_ATTR)(%edi),%ebp
cmpl %ebp,%eax
jb 10b
- movl %edi,(init_pg_tables_end - __PAGE_OFFSET)
-
- /* Do an early initialization of the fixmap area */
- movl $(swapper_pg_dir - __PAGE_OFFSET), %edx
- movl $(swapper_pg_pmd - __PAGE_OFFSET), %eax
- addl $0x67, %eax /* 0x67 == _PAGE_TABLE */
- movl %eax, 4092(%edx)
+ movl %edi,pa(init_pg_tables_end)

+ /* Do early initialization of the fixmap area */
+ movl $pa(swapper_pg_fixmap)+PDE_ATTR,%eax
+ movl %eax,pa(swapper_pg_dir+0xffc)
+#endif
jmp 3f
/*
* Non-boot CPU entry point; entered from trampoline.S
@@ -241,7 +303,7 @@ ENTRY(startup_32_smp)
* NOTE! We have to correct for the fact that we're
* not yet offset PAGE_OFFSET..
*/
-#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
+#define cr4_bits pa(mmu_cr4_features)
movl cr4_bits,%edx
andl %edx,%edx
jz 6f
@@ -276,10 +338,10 @@ ENTRY(startup_32_smp)
/*
* Enable paging
*/
- movl $swapper_pg_dir-__PAGE_OFFSET,%eax
+ movl $pa(swapper_pg_dir),%eax
movl %eax,%cr3 /* set the page table pointer.. */
movl %cr0,%eax
- orl $0x80000000,%eax
+ orl $X86_CR0_PG,%eax
movl %eax,%cr0 /* ..and set paging (PG) bit */
ljmp $__BOOT_CS,$1f /* Clear prefetch and normalize %eip */
1:
@@ -554,8 +616,12 @@ ENTRY(_stext)
.align PAGE_SIZE_asm
ENTRY(swapper_pg_dir)
.fill 1024,4,0
+#ifdef CONFIG_X86_PAE
ENTRY(swapper_pg_pmd)
.fill 1024,4,0
+#endif
+ENTRY(swapper_pg_fixmap)
+ .fill 1024,4,0
ENTRY(empty_zero_page)
.fill 4096,1,0

diff --git a/arch/x86/kernel/setup_32.c b/arch/x86/kernel/setup_32.c
index 2256338..b4b6652 100644
--- a/arch/x86/kernel/setup_32.c
+++ b/arch/x86/kernel/setup_32.c
@@ -153,7 +153,11 @@ struct cpuinfo_x86 new_cpu_data __cpuinitdata = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
struct cpuinfo_x86 boot_cpu_data __read_mostly = { 0, 0, 0, 0, -1, 1, 0, 0, -1 };
EXPORT_SYMBOL(boot_cpu_data);

+#ifndef CONFIG_X86_PAE
unsigned long mmu_cr4_features;
+#else
+unsigned long mmu_cr4_features = X86_CR4_PAE;
+#endif

/* for MCA, but anyone else can use it if they want */
unsigned int machine_id;
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index 358d3b9..b382889 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -43,6 +43,7 @@
#include <asm/tlbflush.h>
#include <asm/sections.h>
#include <asm/paravirt.h>
+#include <asm/setup.h>

unsigned int __VMALLOC_RESERVE = 128 << 20;

@@ -343,44 +344,35 @@ extern void __init remap_numa_kva(void);

void __init native_pagetable_setup_start(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- int i;
+ unsigned long pfn, va;
+ pgd_t *pgd;
+ pud_t *pud;
+ pmd_t *pmd;
+ pte_t *pte;

/*
- * Init entries of the first-level page table to the
- * zero page, if they haven't already been set up.
- *
- * In a normal native boot, we'll be running on a
- * pagetable rooted in swapper_pg_dir, but not in PAE
- * mode, so this will end up clobbering the mappings
- * for the lower 24Mbytes of the address space,
- * without affecting the kernel address space.
+ * Remove any mappings which extend past the end of physical
+ * memory from the boot time page table.
*/
- for (i = 0; i < USER_PTRS_PER_PGD; i++)
- set_pgd(&base[i],
- __pgd(__pa(empty_zero_page) | _PAGE_PRESENT));
-
- /* Make sure kernel address space is empty so that a pagetable
- will be allocated for it. */
- memset(&base[USER_PTRS_PER_PGD], 0,
- KERNEL_PGD_PTRS * sizeof(pgd_t));
-#else
+ for (pfn = max_low_pfn + 1; pfn < 1<<(32-PAGE_SHIFT); pfn++) {
+ va = PAGE_OFFSET + (pfn<<PAGE_SHIFT);
+ pgd = base + pgd_index(va);
+ if (!pgd_present(*pgd))
+ break;
+ pud = pud_offset(pgd, va);
+ pmd = pmd_offset(pud, va);
+ if (!pmd_present(*pmd))
+ break;
+ pte = pte_offset_kernel(pmd, va);
+ if (!pte_present(*pte))
+ break;
+ pte_clear(NULL, va, pte);
+ }
paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
-#endif
}

void __init native_pagetable_setup_done(pgd_t *base)
{
-#ifdef CONFIG_X86_PAE
- /*
- * Add low memory identity-mappings - SMP needs it when
- * starting up on an AP from real-mode. In the non-PAE
- * case we already have these mappings through head.S.
- * All user-space mappings are explicitly cleared after
- * SMP startup.
- */
- set_pgd(&base[0], base[USER_PTRS_PER_PGD]);
-#endif
}

/*
@@ -389,9 +381,8 @@ void __init native_pagetable_setup_done(pgd_t *base)
* the boot process.
*
* If we're booting on native hardware, this will be a pagetable
- * constructed in arch/i386/kernel/head.S, and not running in PAE mode
- * (even if we'll end up running in PAE). The root of the pagetable
- * will be swapper_pg_dir.
+ * constructed in arch/x86/kernel/head_32.S. The root of the
+ * pagetable will be swapper_pg_dir.
*
* If we're booting paravirtualized under a hypervisor, then there are
* more options: we may already be running PAE, and the pagetable may
@@ -408,6 +399,7 @@ static void __init pagetable_init (void)
unsigned long vaddr, end;
pgd_t *pgd_base = swapper_pg_dir;

+ printk(KERN_CRIT "%s\n", __FUNCTION__);
paravirt_pagetable_setup_start(pgd_base);

/* Enable PSE if available */
@@ -549,14 +541,6 @@ void __init paging_init(void)

load_cr3(swapper_pg_dir);

-#ifdef CONFIG_X86_PAE
- /*
- * We will bail out later - printk doesn't work right now so
- * the user would just see a hanging kernel.
- */
- if (cpu_has_pae)
- set_in_cr4(X86_CR4_PAE);
-#endif
__flush_tlb_all();

kmap_init();
@@ -686,10 +670,6 @@ void __init mem_init(void)
BUG_ON((unsigned long)high_memory > VMALLOC_START);
#endif /* double-sanity-check paranoia */

-#ifdef CONFIG_X86_PAE
- if (!cpu_has_pae)
- panic("cannot execute a PAE-enabled kernel on a PAE-less CPU!");
-#endif
if (boot_cpu_data.wp_works_ok < 0)
test_wp_bit();

diff --git a/arch/x86/mm/ioremap_32.c b/arch/x86/mm/ioremap_32.c
index 96c3ed2..7827f01 100644
--- a/arch/x86/mm/ioremap_32.c
+++ b/arch/x86/mm/ioremap_32.c
@@ -220,40 +220,45 @@ static int __init early_ioremap_debug_setup(char *str)
__setup("early_ioremap_debug", early_ioremap_debug_setup);

static __initdata int after_paging_init;
-static __initdata unsigned long bm_pte[1024]
+static __initdata pte_t bm_pte[PAGE_SIZE/sizeof(pte_t)]
__attribute__((aligned(PAGE_SIZE)));

-static inline unsigned long * __init early_ioremap_pgd(unsigned long addr)
+static inline pmd_t * __init early_ioremap_pmd(unsigned long addr)
{
- return (unsigned long *)swapper_pg_dir + ((addr >> 22) & 1023);
+ pgd_t *pgd = &swapper_pg_dir[pgd_index(addr)];
+ pud_t *pud = pud_offset(pgd, addr);
+ pmd_t *pmd = pmd_offset(pud, addr);
+
+ return pmd;
}

-static inline unsigned long * __init early_ioremap_pte(unsigned long addr)
+static inline pte_t * __init early_ioremap_pte(unsigned long addr)
{
- return bm_pte + ((addr >> PAGE_SHIFT) & 1023);
+ return &bm_pte[pte_index(addr)];
}

void __init early_ioremap_init(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_init()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = __pa(bm_pte) | _PAGE_TABLE;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
+ set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+
/*
- * The boot-ioremap range spans multiple pgds, for which
+ * The boot-ioremap range spans multiple pmds, for which
* we are not prepared:
*/
- if (pgd != early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END))) {
+ if (pmd != early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END))) {
WARN_ON(1);
- printk("pgd %p != %p\n",
- pgd, early_ioremap_pgd(fix_to_virt(FIX_BTMAP_END)));
- printk("fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
+ printk(KERN_WARNING "pmd %p != %p\n",
+ pmd, early_ioremap_pmd(fix_to_virt(FIX_BTMAP_END)));
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_BEGIN): %08lx\n",
fix_to_virt(FIX_BTMAP_BEGIN));
- printk("fix_to_virt(FIX_BTMAP_END): %08lx\n",
+ printk(KERN_WARNING "fix_to_virt(FIX_BTMAP_END): %08lx\n",
fix_to_virt(FIX_BTMAP_END));

printk("FIX_BTMAP_END: %d\n", FIX_BTMAP_END);
@@ -263,27 +268,28 @@ void __init early_ioremap_init(void)

void __init early_ioremap_clear(void)
{
- unsigned long *pgd;
+ pmd_t *pmd;

if (early_ioremap_debug)
printk("early_ioremap_clear()\n");

- pgd = early_ioremap_pgd(fix_to_virt(FIX_BTMAP_BEGIN));
- *pgd = 0;
+ pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
+ pmd_clear(pmd);
__flush_tlb_all();
}

void __init early_ioremap_reset(void)
{
enum fixed_addresses idx;
- unsigned long *pte, phys, addr;
+ unsigned long addr, phys;
+ pte_t *pte;

after_paging_init = 1;
for (idx = FIX_BTMAP_BEGIN; idx <= FIX_BTMAP_END; idx--) {
addr = fix_to_virt(idx);
pte = early_ioremap_pte(addr);
- if (!*pte & _PAGE_PRESENT) {
- phys = *pte & PAGE_MASK;
+ if (pte_present(*pte)) {
+ phys = pte_val(*pte) & PAGE_MASK;
set_fixmap(idx, phys);
}
}
@@ -292,7 +298,8 @@ void __init early_ioremap_reset(void)
static void __init __early_set_fixmap(enum fixed_addresses idx,
unsigned long phys, pgprot_t flags)
{
- unsigned long *pte, addr = __fix_to_virt(idx);
+ unsigned long addr = __fix_to_virt(idx);
+ pte_t *pte;

if (idx >= __end_of_fixed_addresses) {
BUG();
@@ -300,9 +307,9 @@ static void __init __early_set_fixmap(enum fixed_addresses idx,
}
pte = early_ioremap_pte(addr);
if (pgprot_val(flags))
- *pte = (phys & PAGE_MASK) | pgprot_val(flags);
+ set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
- *pte = 0;
+ pte_clear(NULL, addr, pte);
__flush_tlb_one(addr);
}

diff --git a/include/asm-x86/page_32.h b/include/asm-x86/page_32.h
index 11c4b39..8fc0473 100644
--- a/include/asm-x86/page_32.h
+++ b/include/asm-x86/page_32.h
@@ -48,7 +48,6 @@ typedef unsigned long pgprotval_t;
typedef unsigned long phys_addr_t;

typedef union { pteval_t pte, pte_low; } pte_t;
-typedef pte_t boot_pte_t;

#endif /* __ASSEMBLY__ */
#endif /* CONFIG_X86_PAE */
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
index 72eb06c..e8a6195 100644
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -55,10 +55,6 @@ void paging_init(void);
#define USER_PGD_PTRS (PAGE_OFFSET >> PGDIR_SHIFT)
#define KERNEL_PGD_PTRS (PTRS_PER_PGD-USER_PGD_PTRS)

-#define TWOLEVEL_PGDIR_SHIFT 22
-#define BOOT_USER_PGD_PTRS (__PAGE_OFFSET >> TWOLEVEL_PGDIR_SHIFT)
-#define BOOT_KERNEL_PGD_PTRS (1024-BOOT_USER_PGD_PTRS)
-
/* Just any arbitrary offset to the start of the vmalloc VM area: the
* current 8MB value just means that there will be a 8MB "hole" after the
* physical memory until the kernel virtual memory starts. That means that
--
1.5.3.8


2008-01-24 22:10:41

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Ian Campbell wrote:
>
> I'm not sure how PSE comes to be used ever though -- an EFI only thing?
> Using the qemu monitor I could see a bunch of NX bits used when NX was
> available.
>

This is part of the trickiness with re-using the early pagetables
instead of rebuilding them from scratch - if PSE is available, we have
two options:

- either we build PSE page tables early (which means detecting PSE,
which means if there are any chip-specific CPUID workarounds they have
to be present in the early code), or

- when building the "complete" page tables, coalesce !PSE pagetables
into PSE entries where appropriate.

For PAT to work right, the first chunk probably should *not* be a PSE
page table, which complicates things further. (There is no TLB impact,
since a PSE page table at offset zero or that otherwise have an MTRR
conflict will be broken apart in hardware.) In the former case, it
means splitting it apart later; in the latter case it just means
excluding it from coalescing.

In other words, reusing the early page tables isn't all that
straightforward. It may easily be that it's better to build a new set
of page tables from scratch, however, it would *still* be beneficial to
have the early page tables be in the same format as the later one, since
it lets us use the fixmap area, and therefore {bt,early}_ioremap() much
sooner.

-hpa

2008-01-24 22:35:45

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> In other words, reusing the early page tables isn't all that
> straightforward. It may easily be that it's better to build a new set
> of page tables from scratch, however, it would *still* be beneficial
> to have the early page tables be in the same format as the later one,
> since it lets us use the fixmap area, and therefore
> {bt,early}_ioremap() much sooner.

Yes, and it simplifies Xen as it always starts guest domains in the
appropriate pagetable mode and doesn't let the guest change it on the
fly. If early_ioremap depends on non-PAE early pagetables in an
otherwise PAE kernel, we'd need to go to some effort to make sure all
the early_ioremap stuff is skipped (which would be possible but
unpleasant for domU, but very bad in dom0).

J

2008-01-24 22:44:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> In other words, reusing the early page tables isn't all that
>> straightforward. It may easily be that it's better to build a new set
>> of page tables from scratch, however, it would *still* be beneficial
>> to have the early page tables be in the same format as the later one,
>> since it lets us use the fixmap area, and therefore
>> {bt,early}_ioremap() much sooner.
>
> Yes, and it simplifies Xen as it always starts guest domains in the
> appropriate pagetable mode and doesn't let the guest change it on the
> fly. If early_ioremap depends on non-PAE early pagetables in an
> otherwise PAE kernel, we'd need to go to some effort to make sure all
> the early_ioremap stuff is skipped (which would be possible but
> unpleasant for domU, but very bad in dom0).
>

Yeah, and it's ugly for the kernel proper, so that bit is a no-brainer.
It's just a matter of hammering out the details.

It doesn't sound from the above that you have any opinion either way
about reusing the initial page tables or creating a new set, as long as
they're in the same format.

-hpa

2008-01-24 22:59:42

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> Yeah, and it's ugly for the kernel proper, so that bit is a
> no-brainer. It's just a matter of hammering out the details.
>
> It doesn't sound from the above that you have any opinion either way
> about reusing the initial page tables or creating a new set, as long
> as they're in the same format.

Right.

Xen provides a initial set of pagetables in the appropriate format, so
what head.S generates is moot. For simplicity I graft the Xen-provided
pagetables into swapper_pg_dir in xen_start_kernel, so it is the
functional equivalent to the head.S pagetable construction.

We also don't (yet) support PSE, so that's a non-issue for us too.

J

2008-01-24 23:12:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Yeah, and it's ugly for the kernel proper, so that bit is a
>> no-brainer. It's just a matter of hammering out the details.
>>
>> It doesn't sound from the above that you have any opinion either way
>> about reusing the initial page tables or creating a new set, as long
>> as they're in the same format.
>
> Right.
>
> Xen provides a initial set of pagetables in the appropriate format, so
> what head.S generates is moot. For simplicity I graft the Xen-provided
> pagetables into swapper_pg_dir in xen_start_kernel, so it is the
> functional equivalent to the head.S pagetable construction.
>
> We also don't (yet) support PSE, so that's a non-issue for us too.
>

While we're mucking around in this area, there is another thing which we
should eventually get around to fixing:

we need a set of page tables with an identity mapping as well as the
kernel mapping, for trampolining (during startup, but also during things
like ACPI suspend/resume.) Right now, we let those be the swapper page
tables, but that's probably not really a good idea, since it can hide bugs.

-hpa

2008-01-24 23:40:48

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> While we're mucking around in this area, there is another thing which
> we should eventually get around to fixing:
>
> we need a set of page tables with an identity mapping as well as the
> kernel mapping, for trampolining (during startup, but also during
> things like ACPI suspend/resume.) Right now, we let those be the
> swapper page tables, but that's probably not really a good idea, since
> it can hide bugs.

So you're suggesting a second system pagetable which has a P=V alias as
well as the normal kernel mapping, used only when we actually need that
alias? Sounds simple enough to arrange.

J

2008-01-24 23:48:38

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> While we're mucking around in this area, there is another thing which
>> we should eventually get around to fixing:
>>
>> we need a set of page tables with an identity mapping as well as the
>> kernel mapping, for trampolining (during startup, but also during
>> things like ACPI suspend/resume.) Right now, we let those be the
>> swapper page tables, but that's probably not really a good idea, since
>> it can hide bugs.
>
> So you're suggesting a second system pagetable which has a P=V alias as
> well as the normal kernel mapping, used only when we actually need that
> alias? Sounds simple enough to arrange.
>

Yes. We'd use it during initialization and at other times when we need
trampolining, but give the swapper something which only has the kernel map.

-hpa

2008-01-24 23:51:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> Yes. We'd use it during initialization and at other times when we
> need trampolining, but give the swapper something which only has the
> kernel map.

Hm, though Xen makes it all a bit more complex, as usual. In the PAE
case it wouldn't allow the pmd to be shared, so you'd have to allocate a
new pmd and copy into it. There's probably a way to deal with it within
the existing paravirt hooks...

J

2008-01-24 23:55:52

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> While we're mucking around in this area, there is another thing which
>> we should eventually get around to fixing:
>>
>> we need a set of page tables with an identity mapping as well as the
>> kernel mapping, for trampolining (during startup, but also during
>> things like ACPI suspend/resume.) Right now, we let those be the
>> swapper page tables, but that's probably not really a good idea, since
>> it can hide bugs.
>
> So you're suggesting a second system pagetable which has a P=V alias as
> well as the normal kernel mapping, used only when we actually need that
> alias? Sounds simple enough to arrange.
>

I just looked at the ACPI suspend code, and it looks like it hacks its
own identity map at runtime. Pavel, am I reading that code right?

-hpa

2008-01-25 00:06:47

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Yes. We'd use it during initialization and at other times when we
>> need trampolining, but give the swapper something which only has the
>> kernel map.
>
> Hm, though Xen makes it all a bit more complex, as usual. In the PAE
> case it wouldn't allow the pmd to be shared, so you'd have to allocate a
> new pmd and copy into it. There's probably a way to deal with it within
> the existing paravirt hooks...
>

Yeah, I'm aware of this particular piece of Xen braindamage, and
although I had some very unkind words to say about it, it mirrors what
we have to do for the !PAE case anyway, so it can be sort of glossed over.

-hpa

2008-01-25 00:12:10

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> Yeah, I'm aware of this particular piece of Xen braindamage, and
> although I had some very unkind words to say about it, it mirrors what
> we have to do for the !PAE case anyway, so it can be sort of glossed
> over.

Sort of. If Xen weren't an issue, then both cases are a matter of
copying a set of entries from one place in the pgd to another.

It would be easy enough to add some code on xen side to look for pmd
aliases when using/pinning a pagetable, and allocate'n'copy a new pmd
page as needed. That way the core code can ignore the issue.

J

2008-01-25 00:20:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> Yeah, I'm aware of this particular piece of Xen braindamage, and
>> although I had some very unkind words to say about it, it mirrors what
>> we have to do for the !PAE case anyway, so it can be sort of glossed
>> over.
>
> Sort of. If Xen weren't an issue, then both cases are a matter of
> copying a set of entries from one place in the pgd to another.
>

No, if Xen wasn't an issue there wouldn't be anything to do for the PAE
case at all (since the PGD is trivial.)

Copying PMDs is more or less an analogous case of the !PAE case, once
the allocation is already done. The allocation should be trivial
though, since this would be a one-time thing.

> It would be easy enough to add some code on xen side to look for pmd
> aliases when using/pinning a pagetable, and allocate'n'copy a new pmd
> page as needed. That way the core code can ignore the issue.

As much as I'd rather see Xen fixing this than having it continue to
impact the kernel, I presume it will take some time to flush the broken
hypervisors out?

-hpa

2008-01-25 00:20:52

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Thu 2008-01-24 15:51:24, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> While we're mucking around in this area, there is another thing which we
>>> should eventually get around to fixing:
>>>
>>> we need a set of page tables with an identity mapping as well as the
>>> kernel mapping, for trampolining (during startup, but also during things
>>> like ACPI suspend/resume.) Right now, we let those be the swapper page
>>> tables, but that's probably not really a good idea, since it can hide
>>> bugs.
>>
>> So you're suggesting a second system pagetable which has a P=V alias as
>> well as the normal kernel mapping, used only when we actually need that
>> alias? Sounds simple enough to arrange.
>>
>
> I just looked at the ACPI suspend code, and it looks like it hacks its own
> identity map at runtime. Pavel, am I reading that code right?

Yes, I think so, I believe we do it on both 32 and 64 bit now.

(It is early here. And I almost got the .c wakeup code to work... it
already sets the mode).
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-01-25 00:31:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> No, if Xen wasn't an issue there wouldn't be anything to do for the
> PAE case at all (since the PGD is trivial.)
>
> Copying PMDs is more or less an analogous case of the !PAE case, once
> the allocation is already done. The allocation should be trivial
> though, since this would be a one-time thing.

I think we're in vehement agreement here. In either case, its just a
matter of something like:

memcpy(pgd, &pgd[USER_PTRS_PER_PGD], sizeof(pgd_t) * KERNEL_PTRS_PER_PGD);


which would work for both PAE and non-PAE.

>> It would be easy enough to add some code on xen side to look for pmd
>> aliases when using/pinning a pagetable, and allocate'n'copy a new pmd
>> page as needed. That way the core code can ignore the issue.
>
> As much as I'd rather see Xen fixing this than having it continue to
> impact the kernel, I presume it will take some time to flush the
> broken hypervisors out?

Sorry, I was unclear. I meant in the purely Xen-specific parts of the
kernel (arch/x86/xen). It wouldn't require a hypervisor change.

J

2008-01-25 00:32:23

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Pavel Machek wrote:
>>>
>> I just looked at the ACPI suspend code, and it looks like it hacks its own
>> identity map at runtime. Pavel, am I reading that code right?
>
> Yes, I think so, I believe we do it on both 32 and 64 bit now.
>

So the background to this... we need an identity map to trampoline at
early boot, obviously, but we'd like it to not stick around more than
necessary. We have zap_low_mappings() now but it's not really sufficient.

Secondary SMP processors need these mappings during trampolining --
presumably including CPU hotplug -- and I'm suspecting it might simply
make sense to use a separate set of page tables (with both the identity
and the kernel map) for trampolining and just keep them around. That
way they would be usable for ACPI as well.

> (It is early here. And I almost got the .c wakeup code to work... it
> already sets the mode).

Sweet!

-hpa

2008-01-25 00:41:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> No, if Xen wasn't an issue there wouldn't be anything to do for the
>> PAE case at all (since the PGD is trivial.)
>>
>> Copying PMDs is more or less an analogous case of the !PAE case, once
>> the allocation is already done. The allocation should be trivial
>> though, since this would be a one-time thing.
>
> I think we're in vehement agreement here. In either case, its just a
> matter of something like:
>
> memcpy(pgd, &pgd[USER_PTRS_PER_PGD], sizeof(pgd_t) *
> KERNEL_PTRS_PER_PGD);
>
> which would work for both PAE and non-PAE.
>
>>> It would be easy enough to add some code on xen side to look for pmd
>>> aliases when using/pinning a pagetable, and allocate'n'copy a new pmd
>>> page as needed. That way the core code can ignore the issue.
>>
>> As much as I'd rather see Xen fixing this than having it continue to
>> impact the kernel, I presume it will take some time to flush the
>> broken hypervisors out?
>
> Sorry, I was unclear. I meant in the purely Xen-specific parts of the
> kernel (arch/x86/xen). It wouldn't require a hypervisor change.
>

Oh, that makes that option much more viable and probably preferrable.

-hpa

2008-01-25 00:50:07

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Friday, 25 of January 2008, H. Peter Anvin wrote:
> Pavel Machek wrote:
> >>>
> >> I just looked at the ACPI suspend code, and it looks like it hacks its own
> >> identity map at runtime. Pavel, am I reading that code right?
> >
> > Yes, I think so, I believe we do it on both 32 and 64 bit now.

For clarity, are you referring to the code in arch/x86/kernel/acpi ?

Rafael

2008-01-25 01:13:35

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Rafael J. Wysocki wrote:
> On Friday, 25 of January 2008, H. Peter Anvin wrote:
>> Pavel Machek wrote:
>>>> I just looked at the ACPI suspend code, and it looks like it hacks its own
>>>> identity map at runtime. Pavel, am I reading that code right?
>>> Yes, I think so, I believe we do it on both 32 and 64 bit now.
>
> For clarity, are you referring to the code in arch/x86/kernel/acpi ?
>

Yes.

-hpa

2008-01-25 02:21:43

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

"H. Peter Anvin" <[email protected]> writes:

> Pavel Machek wrote:
>>>>
>>> I just looked at the ACPI suspend code, and it looks like it hacks its own
>>> identity map at runtime. Pavel, am I reading that code right?
>>
>> Yes, I think so, I believe we do it on both 32 and 64 bit now.
>>
>
> So the background to this... we need an identity map to trampoline at early
> boot, obviously, but we'd like it to not stick around more than necessary. We
> have zap_low_mappings() now but it's not really sufficient.
>
> Secondary SMP processors need these mappings during trampolining --
> presumably including CPU hotplug -- and I'm suspecting it might simply make
> sense to use a separate set of page tables (with both the identity and the
> kernel map) for trampolining and just keep them around. That way they would be
> usable for ACPI as well.

We already do this on the 64bit side. We reuse the kernel and the
identity parts from the core kernel page tables but it is actually
a distinct page table.

x86_64 has not had the identity mappings mapped in any of the
normal page tables since the relocatable kernel support was merged
a while ago.

Only on the 32bit side does this still remain an issue. I don't know
if what we can do optimization wise there. Emulating the 64bit code
and having a dedicated top level pgd (as part of the trampoline)
and then a mapping into it the kernel identity mapping and the kernel
mapping (which are the same on 32bit) should work fairly easily.

It is just a handful of pgd entries, and then in the actual kernel
entry code we reload %cr3 with the appropriate kernel page table
and we should be fine. No need for an explicit zap there at all.

Eric

2008-01-25 02:30:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Eric W. Biederman wrote:
>
> We already do this on the 64bit side. We reuse the kernel and the
> identity parts from the core kernel page tables but it is actually
> a distinct page table.
>
> x86_64 has not had the identity mappings mapped in any of the
> normal page tables since the relocatable kernel support was merged
> a while ago.
>
> Only on the 32bit side does this still remain an issue. I don't know
> if what we can do optimization wise there. Emulating the 64bit code
> and having a dedicated top level pgd (as part of the trampoline)
> and then a mapping into it the kernel identity mapping and the kernel
> mapping (which are the same on 32bit) should work fairly easily.
>
> It is just a handful of pgd entries, and then in the actual kernel
> entry code we reload %cr3 with the appropriate kernel page table
> and we should be fine. No need for an explicit zap there at all.
>

That's pretty much what I figure. The one issue is that on non-PAE
32-bit (or if we actually have to deal with unsharable PMDs on PAE
kernels) then the PGD (PMD) kernel mappings at least formally should
really be put in sync. This could be done either by the same code which
keeps the PGDs of various processes in sync already or on demand; I
believe my personal preference would be to have all that in the same
place, since we have to do it anyway, and this is nothing different
except for the offset.

-hpa

2008-01-25 02:59:59

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

"H. Peter Anvin" <[email protected]> writes:

> Jeremy Fitzhardinge wrote:
>> H. Peter Anvin wrote:
>>> No, if Xen wasn't an issue there wouldn't be anything to do for the PAE case
>>> at all (since the PGD is trivial.)
>>>
>>> Copying PMDs is more or less an analogous case of the !PAE case, once the
>>> allocation is already done. The allocation should be trivial though, since
>>> this would be a one-time thing.
>>
>> I think we're in vehement agreement here. In either case, its just a matter
>> of something like:
>>
>> memcpy(pgd, &pgd[USER_PTRS_PER_PGD], sizeof(pgd_t) * KERNEL_PTRS_PER_PGD);
>> which would work for both PAE and non-PAE.
>>
>>>> It would be easy enough to add some code on xen side to look for pmd aliases
>>>> when using/pinning a pagetable, and allocate'n'copy a new pmd page as
>>>> needed. That way the core code can ignore the issue.
>>>
>>> As much as I'd rather see Xen fixing this than having it continue to impact
>>> the kernel, I presume it will take some time to flush the broken hypervisors
>>> out?
>>
>> Sorry, I was unclear. I meant in the purely Xen-specific parts of the kernel
>> (arch/x86/xen). It wouldn't require a hypervisor change.
>>
>
> Oh, that makes that option much more viable and probably preferrable.

Note. I don't believe we use either trampoline (cpu startup or acpi wakeup)
in the hypervisor case (esp Xen). So we should be able to completely ignore
Xen and do the memcpy of pgd entries.

I expect Xen gives us other cpus already in protected mode (which is overall
the sane thing to do).

Eric

2008-01-25 04:42:17

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Eric W. Biederman wrote:
> Note. I don't believe we use either trampoline (cpu startup or acpi wakeup)
> in the hypervisor case (esp Xen). So we should be able to completely ignore
> Xen and do the memcpy of pgd entries.
>

Indeed. The alias mapping can be set up in
native_pagetable_setup_done() and needn't involve Xen at all.

> I expect Xen gives us other cpus already in protected mode (which is overall
> the sane thing to do).

Quite so.

J

2008-01-25 08:01:31

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Thu 2008-01-24 16:27:58, H. Peter Anvin wrote:
> Pavel Machek wrote:
> >>>
> >>I just looked at the ACPI suspend code, and it looks
> >>like it hacks its own identity map at runtime. Pavel,
> >>am I reading that code right?
> >
> >Yes, I think so, I believe we do it on both 32 and 64
> >bit now.
> >
>
> So the background to this... we need an identity map to
> trampoline at early boot, obviously, but we'd like it to
> not stick around more than necessary. We have
> zap_low_mappings() now but it's not really sufficient.
>
> Secondary SMP processors need these mappings during
> trampolining -- presumably including CPU hotplug -- and
> I'm suspecting it might simply make sense to use a
> separate set of page tables (with both the identity and
> the kernel map) for trampolining and just keep them
> around. That way they would be usable for ACPI as well.

That would enable some cleanups, yes.
Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-01-25 11:12:03

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge <[email protected]> writes:

> Eric W. Biederman wrote:
>> Note. I don't believe we use either trampoline (cpu startup or acpi wakeup)
>> in the hypervisor case (esp Xen). So we should be able to completely ignore
>> Xen and do the memcpy of pgd entries.
>>
>
> Indeed. The alias mapping can be set up in native_pagetable_setup_done() and
> needn't involve Xen at all.

Good. Then this case gets easy.

We just need a pgd that has pgd entries that duplicate the kernel pgd entries
at both address 0 and at the normal kernel address.

In 64bit mode we make this part of the trampoline because we need a pgt below
4G so that we can point a 32bit %cr3 value at it. We can either use that
technique for the 32bit kernel (and be consistent) or we can have a single
trampoline/wakeup pgd that we use. As all pgd entries must be below 4G in
32bit mode.

Although if we really wanted to be restrictive we could have a much more limited
set of identity page table entries that only map the low 1M, or possibly just
640K.

Eric

2008-01-25 22:05:30

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Friday, 25 of January 2008, Pavel Machek wrote:
> On Thu 2008-01-24 16:27:58, H. Peter Anvin wrote:
> > Pavel Machek wrote:
> > >>>
> > >>I just looked at the ACPI suspend code, and it looks
> > >>like it hacks its own identity map at runtime. Pavel,
> > >>am I reading that code right?
> > >
> > >Yes, I think so, I believe we do it on both 32 and 64
> > >bit now.
> > >
> >
> > So the background to this... we need an identity map to
> > trampoline at early boot, obviously, but we'd like it to
> > not stick around more than necessary. We have
> > zap_low_mappings() now but it's not really sufficient.
> >
> > Secondary SMP processors need these mappings during
> > trampolining -- presumably including CPU hotplug -- and
> > I'm suspecting it might simply make sense to use a
> > separate set of page tables (with both the identity and
> > the kernel map) for trampolining and just keep them
> > around. That way they would be usable for ACPI as well.
>
> That would enable some cleanups, yes.

Speaking of cleanups, the following one is applicable IMO.

Greetings,
Rafael

---
arch/x86/mm/init_32.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -444,23 +444,23 @@ static void __init pagetable_init (void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_SUSPEND
/*
* Swap suspend & friends need this for resume because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned (PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_SUSPEND */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_SUSPEND */

void zap_low_mappings (void)
{

2008-01-25 22:11:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Hi!

> > > >>I just looked at the ACPI suspend code, and it looks
> > > >>like it hacks its own identity map at runtime. Pavel,
> > > >>am I reading that code right?
> > > >
> > > >Yes, I think so, I believe we do it on both 32 and 64
> > > >bit now.
> > > >
> > >
> > > So the background to this... we need an identity map to
> > > trampoline at early boot, obviously, but we'd like it to
> > > not stick around more than necessary. We have
> > > zap_low_mappings() now but it's not really sufficient.
> > >
> > > Secondary SMP processors need these mappings during
> > > trampolining -- presumably including CPU hotplug -- and
> > > I'm suspecting it might simply make sense to use a
> > > separate set of page tables (with both the identity and
> > > the kernel map) for trampolining and just keep them
> > > around. That way they would be usable for ACPI as well.
> >
> > That would enable some cleanups, yes.
>
> Speaking of cleanups, the following one is applicable IMO.

ACK... and BTW ack for that deferred device removal series.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-01-28 15:01:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* Rafael J. Wysocki <[email protected]> wrote:

> Speaking of cleanups, the following one is applicable IMO.

> --- linux-2.6.orig/arch/x86/mm/init_32.c
> +++ linux-2.6/arch/x86/mm/init_32.c
> @@ -444,23 +444,23 @@ static void __init pagetable_init (void)
> paravirt_pagetable_setup_done(pgd_base);
> }
>
> -#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
> +#ifdef CONFIG_SUSPEND
> /*
> * Swap suspend & friends need this for resume because things like the intel-agp
> * driver might have split up a kernel 4MB mapping.
> */
> -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> +char swsusp_pg_dir[PAGE_SIZE]

thanks, applied.

Ingo

2008-01-28 15:28:00

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > Speaking of cleanups, the following one is applicable IMO.
>
> > --- linux-2.6.orig/arch/x86/mm/init_32.c
> > +++ linux-2.6/arch/x86/mm/init_32.c
> > @@ -444,23 +444,23 @@ static void __init pagetable_init (void)
> > paravirt_pagetable_setup_done(pgd_base);
> > }
> >
> > -#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
> > +#ifdef CONFIG_SUSPEND
> > /*
> > * Swap suspend & friends need this for resume because things like the intel-agp
> > * driver might have split up a kernel 4MB mapping.
> > */
> > -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> > +char swsusp_pg_dir[PAGE_SIZE]
>
> thanks, applied.

Thanks. Well, the comment should also be updated as I can see now. I'll send
a separate patch for that.

Rafael

2008-01-28 16:13:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


> > * driver might have split up a kernel 4MB mapping.
> > */
> > -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> > +char swsusp_pg_dir[PAGE_SIZE]

hm, random-qa found build breakage with this patch:

arch/x86/kernel/built-in.o: In function `wakeup_start':
: undefined reference to `swsusp_pg_dir'

config attached.

Ingo


Attachments:
(No filename) (328.00 B)
config (46.93 kB)
Download all attachments

2008-01-28 17:04:25

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, Ingo Molnar wrote:
>
> > > * driver might have split up a kernel 4MB mapping.
> > > */
> > > -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> > > +char swsusp_pg_dir[PAGE_SIZE]
>
> hm, random-qa found build breakage with this patch:
>
> arch/x86/kernel/built-in.o: In function `wakeup_start':
> : undefined reference to `swsusp_pg_dir'
>
> config attached.

I see. CONFIG_HIBERNATION && CONFIG_ACPI -> CONFIG_ACPI_SLEEP
and the Makefile in arch/x86/kernel/acpi/ wants to build wakeup.S, which is
not necessary. Hmm.

We can do a couple of things:
(1) make wakeup_$(BITS).o depend on CONFIG_SUSPEND (alone)
This will build it if CONFIG_SUSPEND is set, but CONFIG_ACPI is not
(still, that's consistent with the change in question).
(2) make wakeup_$(BITS).o depend on CONFIG_SUSPEND and CONFIG_ACPI
(3) define CONFIG_ACPI_SUSPEND depending on ACPI and SUSPEND and
make wakeup_$(BITS).o as well as swsusp_pg_dir depend on that (most
elegant)

Which one do you prefer?

In case you choose (3), please drop the patch and I'll send a new one to Len.

Thanks,
Rafael

2008-01-28 19:40:50

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Hi!

> > > /*
> > > * Swap suspend & friends need this for resume because things like the intel-agp
> > > * driver might have split up a kernel 4MB mapping.
> > > */
> > > -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> > > +char swsusp_pg_dir[PAGE_SIZE]
> >
> > thanks, applied.

Sorry, this is subtle and I've overlooked it before.

(I thought you were only changing ifdef).

Now you memcpy() over pg_dir when that pgdir is in use during swsusp
resume. Granted, you memcpy() with same data that already are there,
but it may still do some funny effects.

Hmm, but same argument applies to lower levels of paging in 64-bit and
PAE cases, and we still do that memcpy-over-active-pagetables there...
:-(.

Pavel
--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2008-01-28 19:56:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Pavel Machek wrote:
> Hi!
>
>>>> /*
>>>> * Swap suspend & friends need this for resume because things like the intel-agp
>>>> * driver might have split up a kernel 4MB mapping.
>>>> */
>>>> -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
>>>> +char swsusp_pg_dir[PAGE_SIZE]
>>> thanks, applied.
>
> Sorry, this is subtle and I've overlooked it before.
>
> (I thought you were only changing ifdef).
>
> Now you memcpy() over pg_dir when that pgdir is in use during swsusp
> resume. Granted, you memcpy() with same data that already are there,
> but it may still do some funny effects.
>
> Hmm, but same argument applies to lower levels of paging in 64-bit and
> PAE cases, and we still do that memcpy-over-active-pagetables there...
> :-(.
>

This really comes down to the concept that we should keep an
identity-mapped page table set around and keep it maintained.
Maintenance should be relatively cheap -- we don't care about the
vmalloc area (but if it's easier to have it, it won't cause any harm),
and we already have to have code to synchronize the PGDs on !PAE and the
PMDs on Xen (although that was supposedly getting fixed). This is
nothing very different than synchronizing yet another PGD[*] offset.

This obviously relates to (and needs to be on top of) the
always-native-pagetables work.

[*] = Almost. There is one exception: for 3 GB kernel:1 GB userspace,
we must ensure that only 1 GB of the kernel area is synced.

-hpa

2008-01-28 20:03:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

H. Peter Anvin wrote:
> and we already have to have code to synchronize the PGDs on !PAE and
> the PMDs on Xen (although that was supposedly getting fixed).

No, I don't have any plans there. Xen will continue to require
non-shared kernel pmd, at least for a 32-bit host. I think the point is
that nothing that requires an identity mapping will work under Xen
anyway, so Xen just doesn't care about this case.

J

2008-01-28 20:11:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Jeremy Fitzhardinge wrote:
> H. Peter Anvin wrote:
>> and we already have to have code to synchronize the PGDs on !PAE and
>> the PMDs on Xen (although that was supposedly getting fixed).
>
> No, I don't have any plans there. Xen will continue to require
> non-shared kernel pmd, at least for a 32-bit host. I think the point is
> that nothing that requires an identity mapping will work under Xen
> anyway, so Xen just doesn't care about this case.
>

Still makes it a special case, not just for this.

-hpa

2008-01-28 20:29:04

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, Pavel Machek wrote:
> Hi!
>
> > > > /*
> > > > * Swap suspend & friends need this for resume because things like the intel-agp
> > > > * driver might have split up a kernel 4MB mapping.
> > > > */
> > > > -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> > > > +char swsusp_pg_dir[PAGE_SIZE]
> > >
> > > thanks, applied.
>
> Sorry, this is subtle and I've overlooked it before.
>
> (I thought you were only changing ifdef).
>
> Now you memcpy() over pg_dir when that pgdir is in use during swsusp
> resume.

It is not. swsusp hasn't been using swsusp_pg_dir for several months.
Hence, the patch. :-)

> Granted, you memcpy() with same data that already are there,
> but it may still do some funny effects.
>
> Hmm, but same argument applies to lower levels of paging in 64-bit and
> PAE cases, and we still do that memcpy-over-active-pagetables there...
> :-(.

Actually, no. We only do that with the kernel code mapping which should be
safe as long as TLBs are not flushed (and they aren't).

Thanks,
Rafael

2008-01-28 20:32:10

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
> > H. Peter Anvin wrote:
> >> and we already have to have code to synchronize the PGDs on !PAE and
> >> the PMDs on Xen (although that was supposedly getting fixed).
> >
> > No, I don't have any plans there. Xen will continue to require
> > non-shared kernel pmd, at least for a 32-bit host. I think the point is
> > that nothing that requires an identity mapping will work under Xen
> > anyway, so Xen just doesn't care about this case.
> >
>
> Still makes it a special case, not just for this.

In fact swsusp creates its own temporary page tables for restoring the last
part of the image. Please have a look at
arch/x86/kernel/suspend_*_64.c and the files in arch/x86/power (most
importantly suspend.c).

Thanks,
Rafael

2008-01-28 20:35:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Rafael J. Wysocki wrote:
> On Monday, 28 of January 2008, Pavel Machek wrote:
>> Hi!
>>
>>>>> /*
>>>>> * Swap suspend & friends need this for resume because things like the intel-agp
>>>>> * driver might have split up a kernel 4MB mapping.
>>>>> */
>>>>> -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
>>>>> +char swsusp_pg_dir[PAGE_SIZE]
>>>> thanks, applied.
>> Sorry, this is subtle and I've overlooked it before.
>>
>> (I thought you were only changing ifdef).
>>
>> Now you memcpy() over pg_dir when that pgdir is in use during swsusp
>> resume.
>
> It is not. swsusp hasn't been using swsusp_pg_dir for several months.
> Hence, the patch. :-)
>
>> Granted, you memcpy() with same data that already are there,
>> but it may still do some funny effects.
>>
>> Hmm, but same argument applies to lower levels of paging in 64-bit and
>> PAE cases, and we still do that memcpy-over-active-pagetables there...
>> :-(.
>
> Actually, no. We only do that with the kernel code mapping which should be
> safe as long as TLBs are not flushed (and they aren't).
>

Okay... does that in any way affect using the kernel code mapping
synchronization code to maintain a set of trampoline pagetables?

-hpa

2008-01-28 20:45:13

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Rafael J. Wysocki wrote:
> Actually, no. We only do that with the kernel code mapping which should be
> safe as long as TLBs are not flushed (and they aren't).
>

Er, what? Assuming the TLB will retain some mappings while you
overwrite the pagetable is a highly dubious prospect. Are you copying
the same values over, or something else?

J

2008-01-28 20:52:34

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, Jeremy Fitzhardinge wrote:
> Rafael J. Wysocki wrote:
> > Actually, no. We only do that with the kernel code mapping which should be
> > safe as long as TLBs are not flushed (and they aren't).
> >
>
> Er, what? Assuming the TLB will retain some mappings while you
> overwrite the pagetable is a highly dubious prospect. Are you copying
> the same values over, or something else?

As long as a relocatable kernel is not used to restore a non-relocatable one
(or vice versa), we're copying the same values over.

Rafael

2008-01-28 21:01:43

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, H. Peter Anvin wrote:
> Rafael J. Wysocki wrote:
> > On Monday, 28 of January 2008, Pavel Machek wrote:
> >> Hi!
> >>
> >>>>> /*
> >>>>> * Swap suspend & friends need this for resume because things like the intel-agp
> >>>>> * driver might have split up a kernel 4MB mapping.
> >>>>> */
> >>>>> -char __nosavedata swsusp_pg_dir[PAGE_SIZE]
> >>>>> +char swsusp_pg_dir[PAGE_SIZE]
> >>>> thanks, applied.
> >> Sorry, this is subtle and I've overlooked it before.
> >>
> >> (I thought you were only changing ifdef).
> >>
> >> Now you memcpy() over pg_dir when that pgdir is in use during swsusp
> >> resume.
> >
> > It is not. swsusp hasn't been using swsusp_pg_dir for several months.
> > Hence, the patch. :-)
> >
> >> Granted, you memcpy() with same data that already are there,
> >> but it may still do some funny effects.
> >>
> >> Hmm, but same argument applies to lower levels of paging in 64-bit and
> >> PAE cases, and we still do that memcpy-over-active-pagetables there...
> >> :-(.
> >
> > Actually, no. We only do that with the kernel code mapping which should be
> > safe as long as TLBs are not flushed (and they aren't).
> >
>
> Okay... does that in any way affect using the kernel code mapping
> synchronization code to maintain a set of trampoline pagetables?

I really don't think so.

Rafael

2008-01-28 21:33:33

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

Rafael J. Wysocki wrote:
> On Monday, 28 of January 2008, Jeremy Fitzhardinge wrote:
>> Rafael J. Wysocki wrote:
>>> Actually, no. We only do that with the kernel code mapping which should be
>>> safe as long as TLBs are not flushed (and they aren't).
>>>
>> Er, what? Assuming the TLB will retain some mappings while you
>> overwrite the pagetable is a highly dubious prospect. Are you copying
>> the same values over, or something else?
>
> As long as a relocatable kernel is not used to restore a non-relocatable one
> (or vice versa), we're copying the same values over.
>

So that case is deliberately considered broken?

-hpa

2008-01-28 22:05:31

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Monday, 28 of January 2008, H. Peter Anvin wrote:
> Rafael J. Wysocki wrote:
> > On Monday, 28 of January 2008, Jeremy Fitzhardinge wrote:
> >> Rafael J. Wysocki wrote:
> >>> Actually, no. We only do that with the kernel code mapping which should be
> >>> safe as long as TLBs are not flushed (and they aren't).
> >>>
> >> Er, what? Assuming the TLB will retain some mappings while you
> >> overwrite the pagetable is a highly dubious prospect. Are you copying
> >> the same values over, or something else?
> >
> > As long as a relocatable kernel is not used to restore a non-relocatable one
> > (or vice versa), we're copying the same values over.
> >
>
> So that case is deliberately considered broken?

Not deliberately, but the fix I had caused a regression. It's just a pending
issue.

Rafael

2008-02-01 13:52:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* Rafael J. Wysocki <[email protected]> wrote:

> > arch/x86/kernel/built-in.o: In function `wakeup_start':
> > : undefined reference to `swsusp_pg_dir'
> >
> > config attached.
>
> I see. CONFIG_HIBERNATION && CONFIG_ACPI -> CONFIG_ACPI_SLEEP and the
> Makefile in arch/x86/kernel/acpi/ wants to build wakeup.S, which is
> not necessary. Hmm.
>
> We can do a couple of things:
> (1) make wakeup_$(BITS).o depend on CONFIG_SUSPEND (alone)
> This will build it if CONFIG_SUSPEND is set, but CONFIG_ACPI is not
> (still, that's consistent with the change in question).
> (2) make wakeup_$(BITS).o depend on CONFIG_SUSPEND and CONFIG_ACPI
> (3) define CONFIG_ACPI_SUSPEND depending on ACPI and SUSPEND and
> make wakeup_$(BITS).o as well as swsusp_pg_dir depend on that (most
> elegant)
>
> Which one do you prefer?

no strong preference here - pick the one you like best and send a patch
please :-)

Ingo

2008-02-01 14:30:33

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Friday, 1 of February 2008, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > > arch/x86/kernel/built-in.o: In function `wakeup_start':
> > > : undefined reference to `swsusp_pg_dir'
> > >
> > > config attached.
> >
> > I see. CONFIG_HIBERNATION && CONFIG_ACPI -> CONFIG_ACPI_SLEEP and the
> > Makefile in arch/x86/kernel/acpi/ wants to build wakeup.S, which is
> > not necessary. Hmm.
> >
> > We can do a couple of things:
> > (1) make wakeup_$(BITS).o depend on CONFIG_SUSPEND (alone)
> > This will build it if CONFIG_SUSPEND is set, but CONFIG_ACPI is not
> > (still, that's consistent with the change in question).
> > (2) make wakeup_$(BITS).o depend on CONFIG_SUSPEND and CONFIG_ACPI
> > (3) define CONFIG_ACPI_SUSPEND depending on ACPI and SUSPEND and
> > make wakeup_$(BITS).o as well as swsusp_pg_dir depend on that (most
> > elegant)
> >
> > Which one do you prefer?
>
> no strong preference here - pick the one you like best and send a patch
> please :-)

Here you go, but I think it falls into the ACPI category.

---
From: Rafael J. Wysocki <[email protected]>

Since hibernation uses its own temporary page tables for restoring
the image kernel, swsusp_pg_dir is only needed for ACPI resume from
RAM. Also, some files under arch/x86/kernel/acpi need only be compiled
if ACPI suspend to RAM is going to be used.

Signed-off-by: Rafael J. Wysocki <[email protected]>
---
arch/x86/kernel/acpi/Makefile | 2 +-
arch/x86/mm/init_32.c | 10 +++++-----
drivers/acpi/Kconfig | 5 +++++
3 files changed, 11 insertions(+), 6 deletions(-)

Index: linux-2.6/arch/x86/mm/init_32.c
===================================================================
--- linux-2.6.orig/arch/x86/mm/init_32.c
+++ linux-2.6/arch/x86/mm/init_32.c
@@ -423,23 +423,23 @@ static void __init pagetable_init(void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_ACPI_SUSPEND
/*
- * Swap suspend & friends need this for resume because things like the intel-agp
+ * ACPI suspend needs this for resume, because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned(PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_ACPI_SUSPEND */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_ACPI_SUSPEND */

void zap_low_mappings(void)
{
Index: linux-2.6/drivers/acpi/Kconfig
===================================================================
--- linux-2.6.orig/drivers/acpi/Kconfig
+++ linux-2.6/drivers/acpi/Kconfig
@@ -42,6 +42,11 @@ menuconfig ACPI

if ACPI

+config ACPI_SUSPEND
+ bool
+ depends on SUSPEND
+ default y
+
config ACPI_SLEEP
bool
depends on PM_SLEEP
Index: linux-2.6/arch/x86/kernel/acpi/Makefile
===================================================================
--- linux-2.6.orig/arch/x86/kernel/acpi/Makefile
+++ linux-2.6/arch/x86/kernel/acpi/Makefile
@@ -1,5 +1,5 @@
obj-$(CONFIG_ACPI) += boot.o
-obj-$(CONFIG_ACPI_SLEEP) += sleep.o wakeup_$(BITS).o
+obj-$(CONFIG_ACPI_SUSPEND) += sleep.o wakeup_$(BITS).o

ifneq ($(CONFIG_ACPI_PROCESSOR),)
obj-y += cstate.o processor.o

2008-02-01 14:55:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.


* Rafael J. Wysocki <[email protected]> wrote:

> > no strong preference here - pick the one you like best and send a
> > patch please :-)
>
> Here you go, but I think it falls into the ACPI category.

agreed - Len, would you mind to pick this patch up?

Acked-by: Ingo Molnar <[email protected]>

Ingo

2008-02-01 22:57:38

by Len Brown

[permalink] [raw]
Subject: Re: [PATCH] x86: Construct 32 bit boot time page tables in native format.

On Friday 01 February 2008 09:54, Ingo Molnar wrote:
>
> * Rafael J. Wysocki <[email protected]> wrote:
>
> > > no strong preference here - pick the one you like best and send a
> > > patch please :-)
> >
> > Here you go, but I think it falls into the ACPI category.
>
> agreed - Len, would you mind to pick this patch up?

This won't work as written -- for the ACPI code doesn't currently optimize
for the HIBERNATE && ! SUSPEND case, and so both code paths are under ACPI_SLEEP.

While some day there may be a justification to make that optimization,
this isn't the day, and this isn't the patch.

So Rafael and I talked about it and decided to go with the simpler
patch below -- which I'll push.

thanks,
-Len

---
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
index da524fb..f2f36f8 100644
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -423,23 +423,23 @@ static void __init pagetable_init(void)
paravirt_pagetable_setup_done(pgd_base);
}

-#if defined(CONFIG_HIBERNATION) || defined(CONFIG_ACPI)
+#ifdef CONFIG_ACPI_SLEEP
/*
- * Swap suspend & friends need this for resume because things like the intel-agp
+ * ACPI suspend needs this for resume, because things like the intel-agp
* driver might have split up a kernel 4MB mapping.
*/
-char __nosavedata swsusp_pg_dir[PAGE_SIZE]
+char swsusp_pg_dir[PAGE_SIZE]
__attribute__ ((aligned(PAGE_SIZE)));

static inline void save_pg_dir(void)
{
memcpy(swsusp_pg_dir, swapper_pg_dir, PAGE_SIZE);
}
-#else
+#else /* !CONFIG_ACPI_SLEEP */
static inline void save_pg_dir(void)
{
}
-#endif
+#endif /* !CONFIG_ACPI_SLEEP */

void zap_low_mappings(void)
{