Hi Ingo,
This series lays the groundwork for 64-bit Xen support. It follows
the usual pattern: a series of general cleanups and improvements,
followed by additions and modifications needed to slide Xen in.
Most of the 64-bit paravirt-ops work has already been done and
integrated for some time, so the changes are relatively minor.
Interesting and potentially hazardous changes in this series are:
"paravirt/x86_64: move __PAGE_OFFSET to leave a space for hypervisor"
This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
to 0xffff880000000000. I have no general justification for this: the
specific reason is that Xen claims the first 16 kernel GDT slots for
itself, and we must move up the mapping to make room. In the process
I parameterised the compile-time construction of the initial
pagetables in head_64.S to cope with it.
"x86_64: adjust mapping of physical pagetables to work with Xen"
"x86_64: create small vmemmap mappings if PSE not available"
This rearranges the construction of the physical mapping so that it
works with Xen. This affects three aspects of the code:
1. It can't use pse, so it will only use pse if the processor
supports it.
2. It never replaces an existing mapping, so it can just extend the
early boot-provided mappings (either from head_64.S or the Xen domain
builder).
3. It makes sure that any page is iounmapped before attaching it to the
pagetable to avoid having writable aliases of pagetable pages.
The logical structure of the code is more or less unchanged, and still
works fine in the native case.
vmemmap mapping is likewise changed.
"x86_64: PSE no longer a hard requirement."
Because booting under Xen doesn't set PSE, it's no longer a hard
requirement for the kernel. PSE will be used whereever possible.
Overall diffstat:
arch/x86/Kconfig | 7 +
arch/x86/ia32/ia32entry.S | 37 +++--
arch/x86/kernel/aperture_64.c | 4
arch/x86/kernel/asm-offsets_32.c | 2
arch/x86/kernel/asm-offsets_64.c | 5
arch/x86/kernel/cpu/common_64.c | 3
arch/x86/kernel/e820.c | 2
arch/x86/kernel/early-quirks.c | 2
arch/x86/kernel/entry_32.S | 8 -
arch/x86/kernel/entry_64.S | 14 +-
arch/x86/kernel/head64.c | 5
arch/x86/kernel/head_64.S | 17 +-
arch/x86/kernel/machine_kexec_64.c | 2
arch/x86/kernel/paravirt.c | 24 +++
arch/x86/kernel/paravirt_patch_32.c | 4
arch/x86/kernel/paravirt_patch_64.c | 9 -
arch/x86/kernel/pci-calgary_64.c | 4
arch/x86/kernel/pci-dma.c | 4
arch/x86/kernel/pci-gart_64.c | 4
arch/x86/kernel/pci-swiotlb_64.c | 2
arch/x86/kernel/process_64.c | 50 +++++--
arch/x86/kernel/setup_64.c | 24 +--
arch/x86/kernel/vmi_32.c | 4
arch/x86/kernel/vsyscall_64.c | 12 -
arch/x86/mm/fault.c | 73 +++--------
arch/x86/mm/init_64.c | 234 +++++++++++++++++++++---------------
arch/x86/mm/ioremap.c | 7 -
arch/x86/mm/k8topology_64.c | 4
arch/x86/mm/numa_64.c | 4
arch/x86/mm/pgtable.c | 176 ++++++++++++++++-----------
arch/x86/mm/srat_64.c | 2
arch/x86/power/hibernate_64.c | 2
arch/x86/xen/enlighten.c | 9 +
include/asm-x86/cmpxchg_64.h | 37 +++++
include/asm-x86/desc_defs.h | 4
include/asm-x86/elf.h | 2
include/asm-x86/fixmap_64.h | 16 ++
include/asm-x86/io.h | 13 ++
include/asm-x86/io_32.h | 12 -
include/asm-x86/irqflags.h | 19 ++
include/asm-x86/mmu_context.h | 32 ++++
include/asm-x86/mmu_context_32.h | 28 ----
include/asm-x86/mmu_context_64.h | 18 --
include/asm-x86/msr.h | 5
include/asm-x86/page_64.h | 11 +
include/asm-x86/paravirt.h | 141 ++++++++++++++++++---
include/asm-x86/pgalloc.h | 4
include/asm-x86/pgtable.h | 20 +++
include/asm-x86/pgtable_32.h | 20 ---
include/asm-x86/pgtable_64.h | 8 -
include/asm-x86/processor.h | 2
include/asm-x86/required-features.h | 2
include/asm-x86/setup.h | 4
include/asm-x86/system.h | 7 -
54 files changed, 734 insertions(+), 431 deletions(-)
Thanks,
J
Add "memory" clobbers to savesegment and loadsegment, since they can
affect memory accesses and we never want the compiler to reorder them
with respect to memory references.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/system.h | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/include/asm-x86/system.h b/include/asm-x86/system.h
--- a/include/asm-x86/system.h
+++ b/include/asm-x86/system.h
@@ -157,14 +157,14 @@
"jmp 2b\n" \
".previous\n" \
_ASM_EXTABLE(1b,3b) \
- : :"r" (value), "r" (0))
+ : :"r" (value), "r" (0) : "memory")
/*
* Save a segment register away
*/
#define savesegment(seg, value) \
- asm volatile("mov %%" #seg ",%0":"=rm" (value))
+ asm("mov %%" #seg ",%0":"=rm" (value) : : "memory")
static inline unsigned long get_limit(unsigned long segment)
{
Add hooks which are called at pgd_alloc/free time. The pgd_alloc hook
may return an error code, which if non-zero, causes the pgd allocation
to be failed. The hooks may be used to allocate/free auxillary
per-pgd information.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/paravirt.c | 4 ++++
arch/x86/mm/pgtable.c | 13 ++++++++-----
arch/x86/xen/enlighten.c | 4 ++++
include/asm-x86/paravirt.h | 19 ++++++++++++++++++-
include/asm-x86/pgalloc.h | 4 ++++
5 files changed, 38 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -30,6 +30,7 @@
#include <asm/setup.h>
#include <asm/arch_hooks.h>
#include <asm/time.h>
+#include <asm/pgalloc.h>
#include <asm/irq.h>
#include <asm/delay.h>
#include <asm/fixmap.h>
@@ -369,6 +370,9 @@
.flush_tlb_single = native_flush_tlb_single,
.flush_tlb_others = native_flush_tlb_others,
+ .pgd_alloc = __paravirt_pgd_alloc,
+ .pgd_free = paravirt_nop,
+
.alloc_pte = paravirt_nop,
.alloc_pmd = paravirt_nop,
.alloc_pmd_clone = paravirt_nop,
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -215,13 +215,15 @@
/* so that alloc_pmd can use it */
mm->pgd = pgd;
- if (pgd)
+ if (pgd) {
pgd_ctor(pgd);
- if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
- pgd_dtor(pgd);
- free_page((unsigned long)pgd);
- pgd = NULL;
+ if (paravirt_pgd_alloc(mm) != 0 ||
+ !pgd_prepopulate_pmd(mm, pgd)) {
+ pgd_dtor(pgd);
+ free_page((unsigned long)pgd);
+ pgd = NULL;
+ }
}
return pgd;
@@ -231,6 +233,7 @@
{
pgd_mop_up_pmds(mm, pgd);
pgd_dtor(pgd);
+ paravirt_pgd_free(mm, pgd);
free_page((unsigned long)pgd);
}
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -45,6 +45,7 @@
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
#include <asm/reboot.h>
+#include <asm/pgalloc.h>
#include "xen-ops.h"
#include "mmu.h"
@@ -1151,6 +1152,9 @@
.pte_update = paravirt_nop,
.pte_update_defer = paravirt_nop,
+ .pgd_alloc = __paravirt_pgd_alloc,
+ .pgd_free = paravirt_nop,
+
.alloc_pte = xen_alloc_pte_init,
.release_pte = xen_release_pte_init,
.alloc_pmd = xen_alloc_pte_init,
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -220,7 +220,14 @@
void (*flush_tlb_others)(const cpumask_t *cpus, struct mm_struct *mm,
unsigned long va);
- /* Hooks for allocating/releasing pagetable pages */
+ /* Hooks for allocating and freeing a pagetable top-level */
+ int (*pgd_alloc)(struct mm_struct *mm);
+ void (*pgd_free)(struct mm_struct *mm, pgd_t *pgd);
+
+ /*
+ * Hooks for allocating/releasing pagetable pages when they're
+ * attached to a pagetable
+ */
void (*alloc_pte)(struct mm_struct *mm, u32 pfn);
void (*alloc_pmd)(struct mm_struct *mm, u32 pfn);
void (*alloc_pmd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
@@ -926,6 +933,16 @@
PVOP_VCALL3(pv_mmu_ops.flush_tlb_others, &cpumask, mm, va);
}
+static inline int paravirt_pgd_alloc(struct mm_struct *mm)
+{
+ return PVOP_CALL1(int, pv_mmu_ops.pgd_alloc, mm);
+}
+
+static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *pgd)
+{
+ PVOP_VCALL2(pv_mmu_ops.pgd_free, mm, pgd);
+}
+
static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned pfn)
{
PVOP_VCALL2(pv_mmu_ops.alloc_pte, mm, pfn);
diff --git a/include/asm-x86/pgalloc.h b/include/asm-x86/pgalloc.h
--- a/include/asm-x86/pgalloc.h
+++ b/include/asm-x86/pgalloc.h
@@ -5,9 +5,13 @@
#include <linux/mm.h> /* for struct page */
#include <linux/pagemap.h>
+static inline int __paravirt_pgd_alloc(struct mm_struct *mm) { return 0; }
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else
+#define paravirt_pgd_alloc(mm) __paravirt_pgd_alloc(mm)
+static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
Split x86_64_start_kernel() into two pieces:
The first essentially cleans up after head_64.S. It clears the
bss, zaps low identity mappings, sets up some early exception
handlers.
The second part preserves the boot data, reserves the kernel's
text/data/bss, pagetables and ramdisk, and then starts the kernel
proper.
This split is so that Xen can call the second part to do the set up it
needs done. It doesn't need any of the first part setups, because it
doesn't boot via head_64.S, and its redundant or actively damaging.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/head64.c | 5 +++++
include/asm-x86/setup.h | 2 ++
2 files changed, 7 insertions(+)
diff --git a/arch/x86/kernel/head64.c b/arch/x86/kernel/head64.c
--- a/arch/x86/kernel/head64.c
+++ b/arch/x86/kernel/head64.c
@@ -108,6 +108,11 @@
early_printk("Kernel really alive\n");
+ x86_64_start_reservations(real_mode_data);
+}
+
+void __init x86_64_start_reservations(char *real_mode_data)
+{
copy_bootdata(__va(real_mode_data));
reserve_early(__pa_symbol(&_text), __pa_symbol(&_end), "TEXT DATA BSS");
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -63,6 +63,8 @@
#else
void __init x86_64_start_kernel(char *real_mode);
+void __init x86_64_start_reservations(char *real_mode_data);
+
#endif /* __i386__ */
#endif /* _SETUP */
#endif /* __ASSEMBLY__ */
vmalloc_sync_all() is only called from register_die_notifier and
alloc_vm_area. Neither is on any performance-critical paths, so
vmalloc_sync_all() itself is not on any hot paths.
Given that the optimisations in vmalloc_sync_all add a fair amount of
code and complexity, and are fairly hard to evaluate for correctness,
it's better to just remove them to simplify the code rather than worry
about its absolute performance.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/fault.c | 73 ++++++++++++++++-----------------------------------
1 file changed, 24 insertions(+), 49 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -949,14 +949,7 @@
void vmalloc_sync_all(void)
{
#ifdef CONFIG_X86_32
- /*
- * Note that races in the updates of insync and start aren't
- * problematic: insync can only get set bits added, and updates to
- * start are only improving performance (without affecting correctness
- * if undone).
- */
- static DECLARE_BITMAP(insync, PTRS_PER_PGD);
- static unsigned long start = TASK_SIZE;
+ unsigned long start = VMALLOC_START & PGDIR_MASK;
unsigned long address;
if (SHARED_KERNEL_PMD)
@@ -964,56 +957,38 @@
BUILD_BUG_ON(TASK_SIZE & ~PGDIR_MASK);
for (address = start; address >= TASK_SIZE; address += PGDIR_SIZE) {
- if (!test_bit(pgd_index(address), insync)) {
- unsigned long flags;
- struct page *page;
+ unsigned long flags;
+ struct page *page;
- spin_lock_irqsave(&pgd_lock, flags);
- list_for_each_entry(page, &pgd_list, lru) {
- if (!vmalloc_sync_one(page_address(page),
- address))
- break;
- }
- spin_unlock_irqrestore(&pgd_lock, flags);
- if (!page)
- set_bit(pgd_index(address), insync);
+ spin_lock_irqsave(&pgd_lock, flags);
+ list_for_each_entry(page, &pgd_list, lru) {
+ if (!vmalloc_sync_one(page_address(page),
+ address))
+ break;
}
- if (address == start && test_bit(pgd_index(address), insync))
- start = address + PGDIR_SIZE;
+ spin_unlock_irqrestore(&pgd_lock, flags);
}
#else /* CONFIG_X86_64 */
- /*
- * Note that races in the updates of insync and start aren't
- * problematic: insync can only get set bits added, and updates to
- * start are only improving performance (without affecting correctness
- * if undone).
- */
- static DECLARE_BITMAP(insync, PTRS_PER_PGD);
- static unsigned long start = VMALLOC_START & PGDIR_MASK;
+ unsigned long start = VMALLOC_START & PGDIR_MASK;
unsigned long address;
for (address = start; address <= VMALLOC_END; address += PGDIR_SIZE) {
- if (!test_bit(pgd_index(address), insync)) {
- const pgd_t *pgd_ref = pgd_offset_k(address);
- unsigned long flags;
- struct page *page;
+ const pgd_t *pgd_ref = pgd_offset_k(address);
+ unsigned long flags;
+ struct page *page;
- if (pgd_none(*pgd_ref))
- continue;
- spin_lock_irqsave(&pgd_lock, flags);
- list_for_each_entry(page, &pgd_list, lru) {
- pgd_t *pgd;
- pgd = (pgd_t *)page_address(page) + pgd_index(address);
- if (pgd_none(*pgd))
- set_pgd(pgd, *pgd_ref);
- else
- BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
- }
- spin_unlock_irqrestore(&pgd_lock, flags);
- set_bit(pgd_index(address), insync);
+ if (pgd_none(*pgd_ref))
+ continue;
+ spin_lock_irqsave(&pgd_lock, flags);
+ list_for_each_entry(page, &pgd_list, lru) {
+ pgd_t *pgd;
+ pgd = (pgd_t *)page_address(page) + pgd_index(address);
+ if (pgd_none(*pgd))
+ set_pgd(pgd, *pgd_ref);
+ else
+ BUG_ON(pgd_page_vaddr(*pgd) != pgd_page_vaddr(*pgd_ref));
}
- if (address == start)
- start = address + PGDIR_SIZE;
+ spin_unlock_irqrestore(&pgd_lock, flags);
}
#endif
}
Jan Beulich points out that vmalloc_sync_all() assumes that the
kernel's pmd is always expected to be present in the pgd. The current
pgd construction code will add the pgd to the pgd_list before its pmds
have been pre-populated, thereby making it visible to
vmalloc_sync_all().
However, because pgd_prepopulate_pmd also does the allocation, it may
block and cannot be done under spinlock.
The solution is to preallocate the pmds out of the spinlock, then
populate them while holding the pgd_list lock.
This patch also pulls the pmd preallocation and mop-up functions out
to be common, assuming that the compiler will generate no code for
them when PREALLOCTED_PMDS is 0. Also, there's no need for pgd_ctor
to clear the pgd again, since it's allocated as a zeroed page.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Jan Beulich <[email protected]>
---
arch/x86/mm/pgtable.c | 177 +++++++++++++++++++++++++++++--------------------
1 file changed, 105 insertions(+), 72 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -66,12 +66,6 @@
static void pgd_ctor(void *p)
{
pgd_t *pgd = p;
- unsigned long flags;
-
- /* Clear usermode parts of PGD */
- memset(pgd, 0, KERNEL_PGD_BOUNDARY*sizeof(pgd_t));
-
- spin_lock_irqsave(&pgd_lock, flags);
/* If the pgd points to a shared pagetable level (either the
ptes in non-PAE, or shared PMD in PAE), then just copy the
@@ -91,8 +85,6 @@
/* list required to sync kernel mapping updates */
if (!SHARED_KERNEL_PMD)
pgd_list_add(pgd);
-
- spin_unlock_irqrestore(&pgd_lock, flags);
}
static void pgd_dtor(void *pgd)
@@ -120,30 +112,6 @@
#ifdef CONFIG_X86_PAE
/*
- * Mop up any pmd pages which may still be attached to the pgd.
- * Normally they will be freed by munmap/exit_mmap, but any pmd we
- * preallocate which never got a corresponding vma will need to be
- * freed manually.
- */
-static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
-{
- int i;
-
- for(i = 0; i < UNSHARED_PTRS_PER_PGD; i++) {
- pgd_t pgd = pgdp[i];
-
- if (pgd_val(pgd) != 0) {
- pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
-
- pgdp[i] = native_make_pgd(0);
-
- paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
- pmd_free(mm, pmd);
- }
- }
-}
-
-/*
* In PAE mode, we need to do a cr3 reload (=tlb flush) when
* updating the top-level pagetable entries to guarantee the
* processor notices the update. Since this is expensive, and
@@ -154,31 +122,7 @@
* not shared between pagetables (!SHARED_KERNEL_PMDS), we allocate
* and initialize the kernel pmds here.
*/
-static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
-{
- pud_t *pud;
- unsigned long addr;
- int i;
-
- pud = pud_offset(pgd, 0);
- for (addr = i = 0; i < UNSHARED_PTRS_PER_PGD;
- i++, pud++, addr += PUD_SIZE) {
- pmd_t *pmd = pmd_alloc_one(mm, addr);
-
- if (!pmd) {
- pgd_mop_up_pmds(mm, pgd);
- return 0;
- }
-
- if (i >= KERNEL_PGD_BOUNDARY)
- memcpy(pmd, (pmd_t *)pgd_page_vaddr(swapper_pg_dir[i]),
- sizeof(pmd_t) * PTRS_PER_PMD);
-
- pud_populate(mm, pud, pmd);
- }
-
- return 1;
-}
+#define PREALLOCATED_PMDS UNSHARED_PTRS_PER_PGD
void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
{
@@ -198,35 +142,124 @@
write_cr3(read_cr3());
}
#else /* !CONFIG_X86_PAE */
+
/* No need to prepopulate any pagetable entries in non-PAE modes. */
-static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
+#define PREALLOCATED_PMDS 0
+
+#endif /* CONFIG_X86_PAE */
+
+static void free_pmds(pmd_t *pmds[])
{
- return 1;
+ int i;
+
+ for(i = 0; i < PREALLOCATED_PMDS; i++)
+ if (pmds[i])
+ free_page((unsigned long)pmds[i]);
}
-static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgd)
+static int preallocate_pmds(pmd_t *pmds[])
{
+ int i;
+ bool failed = false;
+
+ for(i = 0; i < PREALLOCATED_PMDS; i++) {
+ pmd_t *pmd = (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
+ if (pmd == NULL)
+ failed = true;
+ pmds[i] = pmd;
+ }
+
+ if (failed) {
+ free_pmds(pmds);
+ return -ENOMEM;
+ }
+
+ return 0;
}
-#endif /* CONFIG_X86_PAE */
+
+/*
+ * Mop up any pmd pages which may still be attached to the pgd.
+ * Normally they will be freed by munmap/exit_mmap, but any pmd we
+ * preallocate which never got a corresponding vma will need to be
+ * freed manually.
+ */
+static void pgd_mop_up_pmds(struct mm_struct *mm, pgd_t *pgdp)
+{
+ int i;
+
+ for(i = 0; i < PREALLOCATED_PMDS; i++) {
+ pgd_t pgd = pgdp[i];
+
+ if (pgd_val(pgd) != 0) {
+ pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+
+ pgdp[i] = native_make_pgd(0);
+
+ paravirt_release_pmd(pgd_val(pgd) >> PAGE_SHIFT);
+ pmd_free(mm, pmd);
+ }
+ }
+}
+
+static void pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd, pmd_t *pmds[])
+{
+ pud_t *pud;
+ unsigned long addr;
+ int i;
+
+ pud = pud_offset(pgd, 0);
+
+ for (addr = i = 0; i < PREALLOCATED_PMDS;
+ i++, pud++, addr += PUD_SIZE) {
+ pmd_t *pmd = pmds[i];
+
+ if (i >= KERNEL_PGD_BOUNDARY)
+ memcpy(pmd, (pmd_t *)pgd_page_vaddr(swapper_pg_dir[i]),
+ sizeof(pmd_t) * PTRS_PER_PMD);
+
+ pud_populate(mm, pud, pmd);
+ }
+}
pgd_t *pgd_alloc(struct mm_struct *mm)
{
- pgd_t *pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+ pgd_t *pgd;
+ pmd_t *pmds[PREALLOCATED_PMDS];
+ unsigned long flags;
- /* so that alloc_pmd can use it */
+ pgd = (pgd_t *)__get_free_page(GFP_KERNEL | __GFP_ZERO);
+
+ if (pgd == NULL)
+ goto out;
+
mm->pgd = pgd;
- if (pgd) {
- pgd_ctor(pgd);
- if (paravirt_pgd_alloc(mm) != 0 ||
- !pgd_prepopulate_pmd(mm, pgd)) {
- pgd_dtor(pgd);
- free_page((unsigned long)pgd);
- pgd = NULL;
- }
- }
+ if (preallocate_pmds(pmds) != 0)
+ goto out_free_pgd;
+
+ if (paravirt_pgd_alloc(mm) != 0)
+ goto out_free_pmds;
+
+ /*
+ * Make sure that pre-populating the pmds is atomic with
+ * respect to anything walking the pgd_list, so that they
+ * never see a partially populated pgd.
+ */
+ spin_lock_irqsave(&pgd_lock, flags);
+
+ pgd_ctor(pgd);
+ pgd_prepopulate_pmd(mm, pgd, pmds);
+
+ spin_unlock_irqrestore(&pgd_lock, flags);
return pgd;
+
+out_free_pmds:
+ free_pmds(pmds);
+out_free_pgd:
+ free_page((unsigned long)pgd);
+out:
+ return NULL;
}
void pgd_free(struct mm_struct *mm, pgd_t *pgd)
Reasons for replacement:
- they're semantically identical
- end_pfn is a bad name for a global identifier
- small step towards unification
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Vivek Goyal <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
---
arch/x86/kernel/aperture_64.c | 4 ++--
arch/x86/kernel/e820.c | 2 +-
arch/x86/kernel/early-quirks.c | 2 +-
arch/x86/kernel/machine_kexec_64.c | 2 +-
arch/x86/kernel/pci-calgary_64.c | 4 ++--
arch/x86/kernel/pci-dma.c | 4 ++--
arch/x86/kernel/pci-gart_64.c | 4 ++--
arch/x86/kernel/pci-swiotlb_64.c | 2 +-
arch/x86/kernel/setup_64.c | 22 ++++++++++------------
arch/x86/mm/init_64.c | 14 +++++++-------
arch/x86/mm/k8topology_64.c | 4 ++--
arch/x86/mm/numa_64.c | 4 ++--
arch/x86/mm/srat_64.c | 2 +-
arch/x86/power/hibernate_64.c | 2 +-
include/asm-x86/page_64.h | 3 +--
15 files changed, 36 insertions(+), 39 deletions(-)
diff --git a/arch/x86/kernel/aperture_64.c b/arch/x86/kernel/aperture_64.c
--- a/arch/x86/kernel/aperture_64.c
+++ b/arch/x86/kernel/aperture_64.c
@@ -407,7 +407,7 @@
agp_aper_base == aper_base &&
agp_aper_order == aper_order) {
/* the same between two setting from NB and agp */
- if (!no_iommu && end_pfn > MAX_DMA32_PFN && !printed_gart_size_msg) {
+ if (!no_iommu && num_physpages > MAX_DMA32_PFN && !printed_gart_size_msg) {
printk(KERN_ERR "you are using iommu with agp, but GART size is less than 64M\n");
printk(KERN_ERR "please increase GART size in your BIOS setup\n");
printk(KERN_ERR "if BIOS doesn't have that option, contact your HW vendor!\n");
@@ -448,7 +448,7 @@
/* Got the aperture from the AGP bridge */
} else if (swiotlb && !valid_agp) {
/* Do nothing */
- } else if ((!no_iommu && end_pfn > MAX_DMA32_PFN) ||
+ } else if ((!no_iommu && num_physpages > MAX_DMA32_PFN) ||
force_iommu ||
valid_agp ||
fallback_aper_force) {
diff --git a/arch/x86/kernel/e820.c b/arch/x86/kernel/e820.c
--- a/arch/x86/kernel/e820.c
+++ b/arch/x86/kernel/e820.c
@@ -516,7 +516,7 @@
#ifdef CONFIG_X86_64
if (!found) {
- gapstart = (end_pfn << PAGE_SHIFT) + 1024*1024;
+ gapstart = (num_physpages << PAGE_SHIFT) + 1024*1024;
printk(KERN_ERR "PCI: Warning: Cannot find a gap in the 32bit "
"address range\n"
KERN_ERR "PCI: Unassigned devices with 32bit resource "
diff --git a/arch/x86/kernel/early-quirks.c b/arch/x86/kernel/early-quirks.c
--- a/arch/x86/kernel/early-quirks.c
+++ b/arch/x86/kernel/early-quirks.c
@@ -50,7 +50,7 @@
static void __init via_bugs(int num, int slot, int func)
{
#ifdef CONFIG_GART_IOMMU
- if ((end_pfn > MAX_DMA32_PFN || force_iommu) &&
+ if ((num_physpages > MAX_DMA32_PFN || force_iommu) &&
!gart_iommu_aperture_allowed) {
printk(KERN_INFO
"Looks like a VIA chipset. Disabling IOMMU."
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -112,7 +112,7 @@
{
pgd_t *level4p;
level4p = (pgd_t *)__va(start_pgtable);
- return init_level4_page(image, level4p, 0, end_pfn << PAGE_SHIFT);
+ return init_level4_page(image, level4p, 0, num_physpages << PAGE_SHIFT);
}
static void set_idt(void *newidt, u16 limit)
diff --git a/arch/x86/kernel/pci-calgary_64.c b/arch/x86/kernel/pci-calgary_64.c
--- a/arch/x86/kernel/pci-calgary_64.c
+++ b/arch/x86/kernel/pci-calgary_64.c
@@ -1394,7 +1394,7 @@
return;
}
- specified_table_size = determine_tce_table_size(end_pfn * PAGE_SIZE);
+ specified_table_size = determine_tce_table_size(num_physpages * PAGE_SIZE);
for (bus = 0; bus < MAX_PHB_BUS_NUM; bus++) {
struct calgary_bus_info *info = &bus_info[bus];
@@ -1459,7 +1459,7 @@
if (ret) {
printk(KERN_ERR "PCI-DMA: Calgary init failed %d, "
"falling back to no_iommu\n", ret);
- if (end_pfn > MAX_DMA32_PFN)
+ if (num_physpages > MAX_DMA32_PFN)
printk(KERN_ERR "WARNING more than 4GB of memory, "
"32bit PCI may malfunction.\n");
return ret;
diff --git a/arch/x86/kernel/pci-dma.c b/arch/x86/kernel/pci-dma.c
--- a/arch/x86/kernel/pci-dma.c
+++ b/arch/x86/kernel/pci-dma.c
@@ -74,7 +74,7 @@
void __init dma32_reserve_bootmem(void)
{
unsigned long size, align;
- if (end_pfn <= MAX_DMA32_PFN)
+ if (num_physpages <= MAX_DMA32_PFN)
return;
/*
@@ -93,7 +93,7 @@
static void __init dma32_free_bootmem(void)
{
- if (end_pfn <= MAX_DMA32_PFN)
+ if (num_physpages <= MAX_DMA32_PFN)
return;
if (!dma32_bootmem_ptr)
diff --git a/arch/x86/kernel/pci-gart_64.c b/arch/x86/kernel/pci-gart_64.c
--- a/arch/x86/kernel/pci-gart_64.c
+++ b/arch/x86/kernel/pci-gart_64.c
@@ -752,10 +752,10 @@
return;
if (no_iommu ||
- (!force_iommu && end_pfn <= MAX_DMA32_PFN) ||
+ (!force_iommu && num_physpages <= MAX_DMA32_PFN) ||
!gart_iommu_aperture ||
(no_agp && init_k8_gatt(&info) < 0)) {
- if (end_pfn > MAX_DMA32_PFN) {
+ if (num_physpages > MAX_DMA32_PFN) {
printk(KERN_WARNING "More than 4GB of memory "
"but GART IOMMU not available.\n"
KERN_WARNING "falling back to iommu=soft.\n");
diff --git a/arch/x86/kernel/pci-swiotlb_64.c b/arch/x86/kernel/pci-swiotlb_64.c
--- a/arch/x86/kernel/pci-swiotlb_64.c
+++ b/arch/x86/kernel/pci-swiotlb_64.c
@@ -38,7 +38,7 @@
void __init pci_swiotlb_init(void)
{
/* don't initialize swiotlb if iommu=off (no_iommu=1) */
- if (!iommu_detected && !no_iommu && end_pfn > MAX_DMA32_PFN)
+ if (!iommu_detected && !no_iommu && num_physpages > MAX_DMA32_PFN)
swiotlb = 1;
if (swiotlb_force)
swiotlb = 1;
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -161,7 +161,7 @@
unsigned long ramdisk_image = boot_params.hdr.ramdisk_image;
unsigned long ramdisk_size = boot_params.hdr.ramdisk_size;
unsigned long ramdisk_end = ramdisk_image + ramdisk_size;
- unsigned long end_of_mem = end_pfn << PAGE_SHIFT;
+ unsigned long end_of_mem = num_physpages << PAGE_SHIFT;
if (ramdisk_end <= end_of_mem) {
/*
@@ -267,25 +267,23 @@
* partially used pages are not usable - thus
* we are rounding upwards:
*/
- end_pfn = e820_end_of_ram();
+ num_physpages = e820_end_of_ram();
/* pre allocte 4k for mptable mpc */
early_reserve_e820_mpc_new();
/* update e820 for memory not covered by WB MTRRs */
mtrr_bp_init();
- if (mtrr_trim_uncached_memory(end_pfn)) {
+ if (mtrr_trim_uncached_memory(num_physpages)) {
remove_all_active_ranges();
e820_register_active_regions(0, 0, -1UL);
- end_pfn = e820_end_of_ram();
+ num_physpages = e820_end_of_ram();
}
reserve_initrd();
- num_physpages = end_pfn;
-
check_efer();
- max_pfn_mapped = init_memory_mapping(0, (end_pfn << PAGE_SHIFT));
+ max_pfn_mapped = init_memory_mapping(0, (num_physpages << PAGE_SHIFT));
vsmp_init();
@@ -300,9 +298,9 @@
acpi_boot_table_init();
/* How many end-of-memory variables you have, grandma! */
- max_low_pfn = end_pfn;
- max_pfn = end_pfn;
- high_memory = (void *)__va(end_pfn * PAGE_SIZE - 1) + 1;
+ max_low_pfn = num_physpages;
+ max_pfn = num_physpages;
+ high_memory = (void *)__va(num_physpages * PAGE_SIZE - 1) + 1;
/* Remove active ranges so rediscovery with NUMA-awareness happens */
remove_all_active_ranges();
@@ -314,7 +312,7 @@
acpi_numa_init();
#endif
- initmem_init(0, end_pfn);
+ initmem_init(0, num_physpages);
dma32_reserve_bootmem();
@@ -366,7 +364,7 @@
* We trust e820 completely. No explicit ROM probing in memory.
*/
e820_reserve_resources();
- e820_mark_nosave_regions(end_pfn);
+ e820_mark_nosave_regions(num_physpages);
reserve_standard_io_resources();
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -240,7 +240,7 @@
return adr;
}
- if (pfn >= end_pfn)
+ if (pfn >= num_physpages)
panic("alloc_low_page: ran out of memory");
adr = early_ioremap(pfn * PAGE_SIZE, PAGE_SIZE);
@@ -584,9 +584,9 @@
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
max_zone_pfns[ZONE_DMA32] = MAX_DMA32_PFN;
- max_zone_pfns[ZONE_NORMAL] = end_pfn;
+ max_zone_pfns[ZONE_NORMAL] = num_physpages;
- memory_present(0, 0, end_pfn);
+ memory_present(0, 0, num_physpages);
sparse_init();
free_area_init_nodes(max_zone_pfns);
}
@@ -668,8 +668,8 @@
#else
totalram_pages = free_all_bootmem();
#endif
- reservedpages = end_pfn - totalram_pages -
- absent_pages_in_range(0, end_pfn);
+ reservedpages = num_physpages - totalram_pages -
+ absent_pages_in_range(0, num_physpages);
after_bootmem = 1;
codesize = (unsigned long) &_etext - (unsigned long) &_text;
@@ -688,7 +688,7 @@
printk(KERN_INFO "Memory: %luk/%luk available (%ldk kernel code, "
"%ldk reserved, %ldk data, %ldk init)\n",
(unsigned long) nr_free_pages() << (PAGE_SHIFT-10),
- end_pfn << (PAGE_SHIFT-10),
+ num_physpages << (PAGE_SHIFT-10),
codesize >> 10,
reservedpages << (PAGE_SHIFT-10),
datasize >> 10,
@@ -788,7 +788,7 @@
#endif
unsigned long pfn = phys >> PAGE_SHIFT;
- if (pfn >= end_pfn) {
+ if (pfn >= num_physpages) {
/*
* This can happen with kdump kernels when accessing
* firmware tables:
diff --git a/arch/x86/mm/k8topology_64.c b/arch/x86/mm/k8topology_64.c
--- a/arch/x86/mm/k8topology_64.c
+++ b/arch/x86/mm/k8topology_64.c
@@ -143,8 +143,8 @@
limit |= (1<<24)-1;
limit++;
- if (limit > end_pfn << PAGE_SHIFT)
- limit = end_pfn << PAGE_SHIFT;
+ if (limit > num_physpages << PAGE_SHIFT)
+ limit = num_physpages << PAGE_SHIFT;
if (limit <= base)
continue;
diff --git a/arch/x86/mm/numa_64.c b/arch/x86/mm/numa_64.c
--- a/arch/x86/mm/numa_64.c
+++ b/arch/x86/mm/numa_64.c
@@ -86,7 +86,7 @@
addr = 0x8000;
nodemap_size = round_up(sizeof(s16) * memnodemapsize, L1_CACHE_BYTES);
- nodemap_addr = find_e820_area(addr, end_pfn<<PAGE_SHIFT,
+ nodemap_addr = find_e820_area(addr, num_physpages<<PAGE_SHIFT,
nodemap_size, L1_CACHE_BYTES);
if (nodemap_addr == -1UL) {
printk(KERN_ERR
@@ -579,7 +579,7 @@
memset(max_zone_pfns, 0, sizeof(max_zone_pfns));
max_zone_pfns[ZONE_DMA] = MAX_DMA_PFN;
max_zone_pfns[ZONE_DMA32] = MAX_DMA32_PFN;
- max_zone_pfns[ZONE_NORMAL] = end_pfn;
+ max_zone_pfns[ZONE_NORMAL] = num_physpages;
sparse_memory_present_with_active_regions(MAX_NUMNODES);
sparse_init();
diff --git a/arch/x86/mm/srat_64.c b/arch/x86/mm/srat_64.c
--- a/arch/x86/mm/srat_64.c
+++ b/arch/x86/mm/srat_64.c
@@ -299,7 +299,7 @@
pxmram = 0;
}
- e820ram = end_pfn - absent_pages_in_range(0, end_pfn);
+ e820ram = num_physpages - absent_pages_in_range(0, num_physpages);
/* We seem to lose 3 pages somewhere. Allow a bit of slack. */
if ((long)(e820ram - pxmram) >= 1*1024*1024) {
printk(KERN_ERR
diff --git a/arch/x86/power/hibernate_64.c b/arch/x86/power/hibernate_64.c
--- a/arch/x86/power/hibernate_64.c
+++ b/arch/x86/power/hibernate_64.c
@@ -83,7 +83,7 @@
/* Set up the direct mapping from scratch */
start = (unsigned long)pfn_to_kaddr(0);
- end = (unsigned long)pfn_to_kaddr(end_pfn);
+ end = (unsigned long)pfn_to_kaddr(num_physpages);
for (; start < end; start = next) {
pud_t *pud = (pud_t *)get_safe_page(GFP_ATOMIC);
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -58,7 +58,6 @@
void clear_page(void *page);
void copy_page(void *to, void *from);
-extern unsigned long end_pfn;
extern unsigned long phys_base;
extern unsigned long __phys_addr(unsigned long);
@@ -87,7 +86,7 @@
#endif /* !__ASSEMBLY__ */
#ifdef CONFIG_FLATMEM
-#define pfn_valid(pfn) ((pfn) < end_pfn)
+#define pfn_valid(pfn) ((pfn) < num_physpages)
#endif
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/setup.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/asm-x86/setup.h b/include/asm-x86/setup.h
--- a/include/asm-x86/setup.h
+++ b/include/asm-x86/setup.h
@@ -61,6 +61,8 @@
extern unsigned long init_pg_tables_start;
extern unsigned long init_pg_tables_end;
+#else
+void __init x86_64_start_kernel(char *real_mode);
#endif /* __i386__ */
#endif /* _SETUP */
#endif /* __ASSEMBLY__ */
wrmsr is a special instruction which can have arbitrary system-wide
effects. We don't want the compiler to reorder it with respect to
memory operations, so make it a memory barrier.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/msr.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/include/asm-x86/msr.h b/include/asm-x86/msr.h
--- a/include/asm-x86/msr.h
+++ b/include/asm-x86/msr.h
@@ -66,7 +66,7 @@
static inline void native_write_msr(unsigned int msr,
unsigned low, unsigned high)
{
- asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high));
+ asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory");
}
static inline int native_write_msr_safe(unsigned int msr,
@@ -81,7 +81,8 @@
_ASM_EXTABLE(2b, 3b)
: "=a" (err)
: "c" (msr), "0" (low), "d" (high),
- "i" (-EFAULT));
+ "i" (-EFAULT)
+ : "memory");
return err;
}
Add sync_cmpxchg to match 32-bit's sync_cmpxchg.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/cmpxchg_64.h | 37 +++++++++++++++++++++++++++++++++++++
1 file changed, 37 insertions(+)
diff --git a/include/asm-x86/cmpxchg_64.h b/include/asm-x86/cmpxchg_64.h
--- a/include/asm-x86/cmpxchg_64.h
+++ b/include/asm-x86/cmpxchg_64.h
@@ -93,6 +93,39 @@
return old;
}
+/*
+ * Always use locked operations when touching memory shared with a
+ * hypervisor, since the system may be SMP even if the guest kernel
+ * isn't.
+ */
+static inline unsigned long __sync_cmpxchg(volatile void *ptr,
+ unsigned long old,
+ unsigned long new, int size)
+{
+ unsigned long prev;
+ switch (size) {
+ case 1:
+ asm volatile("lock; cmpxchgb %b1,%2"
+ : "=a"(prev)
+ : "q"(new), "m"(*__xg(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ case 2:
+ asm volatile("lock; cmpxchgw %w1,%2"
+ : "=a"(prev)
+ : "r"(new), "m"(*__xg(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ case 4:
+ asm volatile("lock; cmpxchgl %1,%2"
+ : "=a"(prev)
+ : "r"(new), "m"(*__xg(ptr)), "0"(old)
+ : "memory");
+ return prev;
+ }
+ return old;
+}
+
static inline unsigned long __cmpxchg_local(volatile void *ptr,
unsigned long old,
unsigned long new, int size)
@@ -139,6 +172,10 @@
((__typeof__(*(ptr)))__cmpxchg_local((ptr), (unsigned long)(o), \
(unsigned long)(n), \
sizeof(*(ptr))))
+#define sync_cmpxchg(ptr, o, n) \
+ ((__typeof__(*(ptr)))__sync_cmpxchg((ptr), (unsigned long)(o), \
+ (unsigned long)(n), \
+ sizeof(*(ptr))))
#define cmpxchg64_local(ptr, o, n) \
({ \
BUILD_BUG_ON(sizeof(*(ptr)) != 8); \
This makes a few of changes to the construction of the initial
pagetables to work better with paravirt_ops/Xen. The main areas
are:
1. Support non-PSE mapping of memory, since Xen doesn't currently
allow 2M pages to be mapped in guests.
2. Make sure that the ioremap alias of all pages are dropped before
attaching the new page to the pagetable. This avoids having
writable aliases of pagetable pages.
3. Preserve existing pagetable entries, rather than overwriting. Its
possible that a fair amount of pagetable has already been constructed,
so reuse what's already in place rather than ignoring and overwriting it.
The algorithm relies on the invariant that any page which is part of
the kernel pagetable is itself mapped in the linear memory area. This
way, it can avoid using ioremap on a pagetable page.
The invariant holds because it maps memory from low to high addresses,
and also allocates memory from low to high. Each allocated page can
map at least 2M of address space, so the mapped area will always
progress much faster than the allocated area. It relies on the early
boot code mapping enough pages to get started.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/init_64.c | 94 ++++++++++++++++++++++++++++++++++++++++++-------
arch/x86/mm/ioremap.c | 2 -
2 files changed, 83 insertions(+), 13 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -257,6 +257,43 @@
early_iounmap(adr, PAGE_SIZE);
}
+static void __meminit
+phys_pte_init(pte_t *pte_page, unsigned long addr, unsigned long end)
+{
+ unsigned pages = 0;
+ int i;
+ pte_t *pte = pte_page + pte_index(addr);
+
+ for(i = pte_index(addr); i < PTRS_PER_PTE; i++, addr += PAGE_SIZE, pte++) {
+
+ if (addr >= end) {
+ if (!after_bootmem) {
+ for(; i < PTRS_PER_PTE; i++, pte++)
+ set_pte(pte, __pte(0));
+ }
+ break;
+ }
+
+ if (pte_val(*pte))
+ continue;
+
+ if (0)
+ printk(" pte=%p addr=%lx pte=%016lx\n",
+ pte, addr, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL).pte);
+ set_pte(pte, pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL));
+ pages++;
+ }
+ update_page_count(PG_LEVEL_4K, pages);
+}
+
+static void __meminit
+phys_pte_update(pmd_t *pmd, unsigned long address, unsigned long end)
+{
+ pte_t *pte = (pte_t *)pmd_page_vaddr(*pmd);
+
+ phys_pte_init(pte, address, end);
+}
+
static unsigned long __meminit
phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
{
@@ -265,7 +302,9 @@
int i = pmd_index(address);
for (; i < PTRS_PER_PMD; i++, address += PMD_SIZE) {
+ unsigned long pte_phys;
pmd_t *pmd = pmd_page + pmd_index(address);
+ pte_t *pte;
if (address >= end) {
if (!after_bootmem) {
@@ -275,12 +314,23 @@
break;
}
- if (pmd_val(*pmd))
+ if (pmd_val(*pmd)) {
+ phys_pte_update(pmd, address, end);
continue;
+ }
- pages++;
- set_pte((pte_t *)pmd,
- pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ if (cpu_has_pse) {
+ pages++;
+ set_pte((pte_t *)pmd,
+ pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ continue;
+ }
+
+ pte = alloc_low_page(&pte_phys);
+ phys_pte_init(pte, address, end);
+ unmap_low_page(pte);
+
+ pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
}
update_page_count(PG_LEVEL_2M, pages);
return address;
@@ -337,11 +387,11 @@
pmd = alloc_low_page(&pmd_phys);
spin_lock(&init_mm.page_table_lock);
+ last_map_addr = phys_pmd_init(pmd, addr, end);
+ unmap_low_page(pmd);
pud_populate(&init_mm, pud, __va(pmd_phys));
- last_map_addr = phys_pmd_init(pmd, addr, end);
spin_unlock(&init_mm.page_table_lock);
- unmap_low_page(pmd);
}
__flush_tlb_all();
update_page_count(PG_LEVEL_1G, pages);
@@ -349,15 +399,29 @@
return last_map_addr >> PAGE_SHIFT;
}
+static unsigned long __meminit
+phys_pud_update(pgd_t *pgd, unsigned long addr, unsigned long end)
+{
+ pud_t *pud;
+
+ pud = (pud_t *)pgd_page_vaddr(*pgd);
+
+ return phys_pud_init(pud, addr, end);
+}
+
static void __init find_early_table_space(unsigned long end)
{
- unsigned long puds, pmds, tables, start;
+ unsigned long puds, tables, start;
puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
if (!direct_gbpages) {
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+ unsigned long pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
tables += round_up(pmds * sizeof(pmd_t), PAGE_SIZE);
+ }
+ if (!cpu_has_pse) {
+ unsigned long ptes = (end + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ tables += round_up(ptes * sizeof(pte_t), PAGE_SIZE);
}
/*
@@ -529,19 +593,25 @@
unsigned long pud_phys;
pud_t *pud;
+ next = start + PGDIR_SIZE;
+ if (next > end)
+ next = end;
+
+ if (pgd_val(*pgd)) {
+ last_map_addr = phys_pud_update(pgd, __pa(start), __pa(end));
+ continue;
+ }
+
if (after_bootmem)
pud = pud_offset(pgd, start & PGDIR_MASK);
else
pud = alloc_low_page(&pud_phys);
- next = start + PGDIR_SIZE;
- if (next > end)
- next = end;
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next));
+ unmap_low_page(pud);
if (!after_bootmem)
pgd_populate(&init_mm, pgd_offset_k(start),
__va(pud_phys));
- unmap_low_page(pud);
}
if (!after_bootmem)
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -513,7 +513,7 @@
if (pgprot_val(flags))
set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
- pte_clear(NULL, addr, pte);
+ pte_clear(&init_mm, addr, pte);
__flush_tlb_one(addr);
}
From: Eduardo Habkost <[email protected]>
Use __pgd() on mk_kernel_pgd()
Signed-off-by: Eduardo Habkost <[email protected]>
---
include/asm-x86/pgtable_64.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -201,7 +201,7 @@
#define pgd_offset_k(address) (init_level4_pgt + pgd_index((address)))
#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
static inline int pgd_large(pgd_t pgd) { return 0; }
-#define mk_kernel_pgd(address) ((pgd_t){ (address) | _KERNPG_TABLE })
+#define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)
/* PUD - Level3 access */
/* to find an entry in a page-table-directory. */
For calculating the offset from struct gate_struct fields.
[ gate_offset and gate_segment were broken for 32-bit. ]
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/desc_defs.h | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/include/asm-x86/desc_defs.h b/include/asm-x86/desc_defs.h
--- a/include/asm-x86/desc_defs.h
+++ b/include/asm-x86/desc_defs.h
@@ -75,10 +75,14 @@
typedef struct gate_struct64 gate_desc;
typedef struct ldttss_desc64 ldt_desc;
typedef struct ldttss_desc64 tss_desc;
+#define gate_offset(g) ((g).offset_low | ((unsigned long)(g).offset_middle << 16) | ((unsigned long)(g).offset_high << 32))
+#define gate_segment(g) ((g).segment)
#else
typedef struct desc_struct gate_desc;
typedef struct desc_struct ldt_desc;
typedef struct desc_struct tss_desc;
+#define gate_offset(g) (((g).b & 0xffff0000) | ((g).a & 0x0000ffff))
+#define gate_segment(g) ((g).a >> 16)
#endif
struct desc_ptr {
Some amount of asm-x86/mmu_context.h can be unified, including
activate_mm paravirt hook.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/mmu_context.h | 32 ++++++++++++++++++++++++++++++++
include/asm-x86/mmu_context_32.h | 28 ----------------------------
include/asm-x86/mmu_context_64.h | 18 ------------------
3 files changed, 32 insertions(+), 46 deletions(-)
diff --git a/include/asm-x86/mmu_context.h b/include/asm-x86/mmu_context.h
--- a/include/asm-x86/mmu_context.h
+++ b/include/asm-x86/mmu_context.h
@@ -1,5 +1,37 @@
+#ifndef __ASM_X86_MMU_CONTEXT_H
+#define __ASM_X86_MMU_CONTEXT_H
+
+#include <asm/desc.h>
+#include <asm/atomic.h>
+#include <asm/pgalloc.h>
+#include <asm/tlbflush.h>
+#include <asm/paravirt.h>
+#ifndef CONFIG_PARAVIRT
+#include <asm-generic/mm_hooks.h>
+
+static inline void paravirt_activate_mm(struct mm_struct *prev,
+ struct mm_struct *next)
+{
+}
+#endif /* !CONFIG_PARAVIRT */
+
+/*
+ * Used for LDT copy/destruction.
+ */
+int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
+void destroy_context(struct mm_struct *mm);
+
#ifdef CONFIG_X86_32
# include "mmu_context_32.h"
#else
# include "mmu_context_64.h"
#endif
+
+#define activate_mm(prev, next) \
+do { \
+ paravirt_activate_mm((prev), (next)); \
+ switch_mm((prev), (next), NULL); \
+} while (0);
+
+
+#endif /* __ASM_X86_MMU_CONTEXT_H */
diff --git a/include/asm-x86/mmu_context_32.h b/include/asm-x86/mmu_context_32.h
--- a/include/asm-x86/mmu_context_32.h
+++ b/include/asm-x86/mmu_context_32.h
@@ -1,27 +1,5 @@
#ifndef __I386_SCHED_H
#define __I386_SCHED_H
-
-#include <asm/desc.h>
-#include <asm/atomic.h>
-#include <asm/pgalloc.h>
-#include <asm/tlbflush.h>
-#include <asm/paravirt.h>
-#ifndef CONFIG_PARAVIRT
-#include <asm-generic/mm_hooks.h>
-
-static inline void paravirt_activate_mm(struct mm_struct *prev,
- struct mm_struct *next)
-{
-}
-#endif /* !CONFIG_PARAVIRT */
-
-
-/*
- * Used for LDT copy/destruction.
- */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
-
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
@@ -75,10 +53,4 @@
#define deactivate_mm(tsk, mm) \
asm("movl %0,%%gs": :"r" (0));
-#define activate_mm(prev, next) \
-do { \
- paravirt_activate_mm((prev), (next)); \
- switch_mm((prev), (next), NULL); \
-} while (0);
-
#endif
diff --git a/include/asm-x86/mmu_context_64.h b/include/asm-x86/mmu_context_64.h
--- a/include/asm-x86/mmu_context_64.h
+++ b/include/asm-x86/mmu_context_64.h
@@ -1,21 +1,7 @@
#ifndef __X86_64_MMU_CONTEXT_H
#define __X86_64_MMU_CONTEXT_H
-#include <asm/desc.h>
-#include <asm/atomic.h>
-#include <asm/pgalloc.h>
#include <asm/pda.h>
-#include <asm/pgtable.h>
-#include <asm/tlbflush.h>
-#ifndef CONFIG_PARAVIRT
-#include <asm-generic/mm_hooks.h>
-#endif
-
-/*
- * possibly do the LDT unload here?
- */
-int init_new_context(struct task_struct *tsk, struct mm_struct *mm);
-void destroy_context(struct mm_struct *mm);
static inline void enter_lazy_tlb(struct mm_struct *mm, struct task_struct *tsk)
{
@@ -65,8 +51,4 @@
asm volatile("movl %0,%%fs"::"r"(0)); \
} while (0)
-#define activate_mm(prev, next) \
- switch_mm((prev), (next), NULL)
-
-
#endif
pgd_index is common for 32 and 64-bit, so move it to a common place.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/pgtable.h | 20 ++++++++++++++++++++
include/asm-x86/pgtable_32.h | 20 --------------------
include/asm-x86/pgtable_64.h | 3 ---
3 files changed, 20 insertions(+), 23 deletions(-)
diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
--- a/include/asm-x86/pgtable.h
+++ b/include/asm-x86/pgtable.h
@@ -357,6 +357,26 @@
# include "pgtable_64.h"
#endif
+/*
+ * the pgd page can be thought of an array like this: pgd_t[PTRS_PER_PGD]
+ *
+ * this macro returns the index of the entry in the pgd page which would
+ * control the given virtual address
+ */
+#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
+
+/*
+ * pgd_offset() returns a (pgd_t *)
+ * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
+ */
+#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
+/*
+ * a shortcut which implies the use of the kernel's pgd, instead
+ * of a process's
+ */
+#define pgd_offset_k(address) pgd_offset(&init_mm, (address))
+
+
#define KERNEL_PGD_BOUNDARY pgd_index(PAGE_OFFSET)
#define KERNEL_PGD_PTRS (PTRS_PER_PGD - KERNEL_PGD_BOUNDARY)
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -119,26 +119,6 @@
*/
#define mk_pte(page, pgprot) pfn_pte(page_to_pfn(page), (pgprot))
-/*
- * the pgd page can be thought of an array like this: pgd_t[PTRS_PER_PGD]
- *
- * this macro returns the index of the entry in the pgd page which would
- * control the given virtual address
- */
-#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
-#define pgd_index_k(addr) pgd_index((addr))
-
-/*
- * pgd_offset() returns a (pgd_t *)
- * pgd_index() is used get the offset into the pgd page's array of pgd_t's;
- */
-#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
-
-/*
- * a shortcut which implies the use of the kernel's pgd, instead
- * of a process's
- */
-#define pgd_offset_k(address) pgd_offset(&init_mm, (address))
static inline int pud_large(pud_t pud) { return 0; }
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -196,9 +196,6 @@
#define pgd_page_vaddr(pgd) \
((unsigned long)__va((unsigned long)pgd_val((pgd)) & PTE_MASK))
#define pgd_page(pgd) (pfn_to_page(pgd_val((pgd)) >> PAGE_SHIFT))
-#define pgd_index(address) (((address) >> PGDIR_SHIFT) & (PTRS_PER_PGD - 1))
-#define pgd_offset(mm, address) ((mm)->pgd + pgd_index((address)))
-#define pgd_offset_k(address) (init_level4_pgt + pgd_index((address)))
#define pgd_present(pgd) (pgd_val(pgd) & _PAGE_PRESENT)
static inline int pgd_large(pgd_t pgd) { return 0; }
#define mk_kernel_pgd(address) __pgd((address) | _KERNPG_TABLE)
If PSE is not available, then fall back to 4k page mappings for the
vmemmap area.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/init_64.c | 62 +++++++++++++++++++++++++++++++------------------
1 file changed, 40 insertions(+), 22 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -997,7 +997,7 @@
pmd_t *pmd;
for (; addr < end; addr = next) {
- next = pmd_addr_end(addr, end);
+ void *p = NULL;
pgd = vmemmap_pgd_populate(addr, node);
if (!pgd)
@@ -1007,33 +1007,51 @@
if (!pud)
return -ENOMEM;
- pmd = pmd_offset(pud, addr);
- if (pmd_none(*pmd)) {
- pte_t entry;
- void *p;
+ if (!cpu_has_pse) {
+ next = (addr + PAGE_SIZE) & PAGE_MASK;
+ pmd = vmemmap_pmd_populate(pud, addr, node);
- p = vmemmap_alloc_block(PMD_SIZE, node);
+ if (!pmd)
+ return -ENOMEM;
+
+ p = vmemmap_pte_populate(pmd, addr, node);
+
if (!p)
return -ENOMEM;
- entry = pfn_pte(__pa(p) >> PAGE_SHIFT,
- PAGE_KERNEL_LARGE);
- set_pmd(pmd, __pmd(pte_val(entry)));
+ addr_end = addr + PAGE_SIZE;
+ p_end = p + PAGE_SIZE;
+ } else {
+ next = pmd_addr_end(addr, end);
- /* check to see if we have contiguous blocks */
- if (p_end != p || node_start != node) {
- if (p_start)
- printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
- addr_start, addr_end-1, p_start, p_end-1, node_start);
- addr_start = addr;
- node_start = node;
- p_start = p;
- }
- addr_end = addr + PMD_SIZE;
- p_end = p + PMD_SIZE;
- } else {
- vmemmap_verify((pte_t *)pmd, node, addr, next);
+ pmd = pmd_offset(pud, addr);
+ if (pmd_none(*pmd)) {
+ pte_t entry;
+
+ p = vmemmap_alloc_block(PMD_SIZE, node);
+ if (!p)
+ return -ENOMEM;
+
+ entry = pfn_pte(__pa(p) >> PAGE_SHIFT,
+ PAGE_KERNEL_LARGE);
+ set_pmd(pmd, __pmd(pte_val(entry)));
+
+ addr_end = addr + PMD_SIZE;
+ p_end = p + PMD_SIZE;
+
+ /* check to see if we have contiguous blocks */
+ if (p_end != p || node_start != node) {
+ if (p_start)
+ printk(KERN_DEBUG " [%lx-%lx] PMD -> [%p-%p] on node %d\n",
+ addr_start, addr_end-1, p_start, p_end-1, node_start);
+ addr_start = addr;
+ node_start = node;
+ p_start = p;
+ }
+ } else
+ vmemmap_verify((pte_t *)pmd, node, addr, next);
}
+
}
return 0;
}
This is needed when the kernel is running on RING3, such as under Xen.
x86_64 has a weird feature that makes it #GP on iret when SS is a null
descriptor.
This need to be tested on bare metal to make sure it doesn't cause any
problems. AMD specs say SS is always ignored (except on iret?).
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/entry_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -212,7 +212,7 @@
.macro FAKE_STACK_FRAME child_rip
/* push in order ss, rsp, eflags, cs, rip */
xorl %eax, %eax
- pushq %rax /* ss */
+ pushq $__KERNEL_DS /* ss */
CFI_ADJUST_CFA_OFFSET 8
/*CFI_REL_OFFSET ss,0*/
pushq %rax /* rsp */
We must leave lazy mode before switching the %fs and %gs selectors.
(patch should be merged with previous
__switch_to()/arch_leave_lazy_cpu_mode() patch)
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/process_64.c | 9 +++++++++
1 file changed, 9 insertions(+)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -603,6 +603,15 @@
load_TLS(next, cpu);
+ /*
+ * Leave lazy mode, flushing any hypercalls made here.
+ * This must be done before restoring TLS segments so
+ * the GDT and LDT are properly updated, and must be
+ * done before math_state_restore, so the TS bit is up
+ * to date.
+ */
+ arch_leave_lazy_cpu_mode();
+
/*
* Switch FS and GS.
*/
Replace privileged instructions with the corresponding pvops in
ia32entry.S.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/ia32/ia32entry.S | 16 ++++++++--------
1 file changed, 8 insertions(+), 8 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -98,14 +98,14 @@
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,0
CFI_REGISTER rsp,rbp
- swapgs
+ SWAPGS
movq %gs:pda_kernelstack, %rsp
addq $(PDA_STACKOFFSET),%rsp
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs, here we enable it straight after entry:
*/
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
movl %ebp,%ebp /* zero extension */
pushq $__USER32_DS
CFI_ADJUST_CFA_OFFSET 8
@@ -147,7 +147,7 @@
call *ia32_sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
jnz int_ret_from_sys_call
@@ -210,7 +210,7 @@
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
- swapgs
+ SWAPGS
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq %gs:pda_kernelstack,%rsp
@@ -218,7 +218,7 @@
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
SAVE_ARGS 8,1,1
movl %eax,%eax /* zero extension */
movq %rax,ORIG_RAX-ARGOFFSET(%rsp)
@@ -251,7 +251,7 @@
call *ia32_sys_call_table(,%rax,8)
movq %rax,RAX-ARGOFFSET(%rsp)
GET_THREAD_INFO(%r10)
- cli
+ DISABLE_INTERRUPTS(CLBR_NONE)
TRACE_IRQS_OFF
testl $_TIF_ALLWORK_MASK,threadinfo_flags(%r10)
jnz int_ret_from_sys_call
@@ -319,12 +319,12 @@
/*CFI_REL_OFFSET rflags,EFLAGS-RIP*/
/*CFI_REL_OFFSET cs,CS-RIP*/
CFI_REL_OFFSET rip,RIP-RIP
- swapgs
+ SWAPGS
/*
* No need to follow this irqs on/off section: the syscall
* disabled irqs and here we enable it straight after entry:
*/
- sti
+ ENABLE_INTERRUPTS(CLBR_NONE)
movl %eax,%eax
pushq %rax
CFI_ADJUST_CFA_OFFSET 8
Use write_gdt_entry to generate the special vgetcpu descriptor in the
vsyscall page.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/vsyscall_64.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/vsyscall_64.c b/arch/x86/kernel/vsyscall_64.c
--- a/arch/x86/kernel/vsyscall_64.c
+++ b/arch/x86/kernel/vsyscall_64.c
@@ -250,7 +250,7 @@
doesn't violate that. We'll find out if it does. */
static void __cpuinit vsyscall_set_cpu(int cpu)
{
- unsigned long *d;
+ unsigned long d;
unsigned long node = 0;
#ifdef CONFIG_NUMA
node = cpu_to_node(cpu);
@@ -261,11 +261,11 @@
/* Store cpu number in limit so that it can be loaded quickly
in user space in vgetcpu.
12 bits for the CPU and 8 bits for the node. */
- d = (unsigned long *)(get_cpu_gdt_table(cpu) + GDT_ENTRY_PER_CPU);
- *d = 0x0f40000000000ULL;
- *d |= cpu;
- *d |= (node & 0xf) << 12;
- *d |= (node >> 4) << 48;
+ d = 0x0f40000000000ULL;
+ d |= cpu;
+ d |= (node & 0xf) << 12;
+ d |= (node >> 4) << 48;
+ write_gdt_entry(get_cpu_gdt_table(cpu), GDT_ENTRY_PER_CPU, &d, DESCTYPE_S);
}
static void __cpuinit cpu_vsyscall_init(void *arg)
It's never safe to call a swapgs pvop when the user stack is current -
it must be inline replaced. Rather than making a call, the
SWAPGS_UNSAFE_STACK pvop always just puts "swapgs" as a placeholder,
which must either be replaced inline or trap'n'emulated (somehow).
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/irqflags.h | 2 +-
include/asm-x86/paravirt.h | 10 ++++++++++
2 files changed, 11 insertions(+), 1 deletion(-)
diff --git a/include/asm-x86/irqflags.h b/include/asm-x86/irqflags.h
--- a/include/asm-x86/irqflags.h
+++ b/include/asm-x86/irqflags.h
@@ -167,6 +167,7 @@
#define INTERRUPT_RETURN_NMI_SAFE NATIVE_INTERRUPT_RETURN_NMI_SAFE
#ifdef CONFIG_X86_64
+#define SWAPGS_UNSAFE_STACK swapgs
#define INTERRUPT_RETURN iretq
#define USERGS_SYSRET64 \
swapgs; \
@@ -241,7 +242,6 @@
* Either way, this is a good way to document that we don't
* have a reliable stack. x86_64 only.
*/
-#define SWAPGS_UNSAFE_STACK swapgs
#define ARCH_LOCKDEP_SYS_EXIT call lockdep_sys_exit_thunk
#define ARCH_LOCKDEP_SYS_EXIT_IRQ \
TRACE_IRQS_ON; \
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -1529,6 +1529,16 @@
#else /* !CONFIG_X86_32 */
+
+/*
+ * If swapgs is used while the userspace stack is still current,
+ * there's no way to call a pvop. The PV replacement *must* be
+ * inlined, or the swapgs instruction must be trapped and emulated.
+ */
+#define SWAPGS_UNSAFE_STACK \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
+ swapgs)
+
#define SWAPGS \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
PV_SAVE_REGS; \
From: Eduardo Habkost <[email protected]>
From: Jeremy Fitzhardinge <[email protected]>
Set __PAGE_OFFSET to the most negative possible address +
16*PGDIR_SIZE. The gap is to allow a space for a hypervisor to fit.
The gap is more or less arbitrary, but it's what Xen needs.
When booting native, kernel/head_64.S has a set of compile-time
generated pagetables used at boot time. This patch removes their
absolutely hard-coded layout, and makes it parameterised on
__PAGE_OFFSET (and __START_KERNEL_map).
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/head_64.S | 19 +++++++++++++------
include/asm-x86/page_64.h | 8 +++++++-
2 files changed, 20 insertions(+), 7 deletions(-)
diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -31,6 +31,13 @@
* because we need identity-mapped pages.
*
*/
+
+#define pud_index(x) (((x) >> PUD_SHIFT) & (PTRS_PER_PUD-1))
+
+L4_PAGE_OFFSET = pgd_index(__PAGE_OFFSET)
+L3_PAGE_OFFSET = pud_index(__PAGE_OFFSET)
+L4_START_KERNEL = pgd_index(__START_KERNEL_map)
+L3_START_KERNEL = pud_index(__START_KERNEL_map)
.text
.section .text.head
@@ -77,8 +84,8 @@
/* Fixup the physical addresses in the page table
*/
addq %rbp, init_level4_pgt + 0(%rip)
- addq %rbp, init_level4_pgt + (258*8)(%rip)
- addq %rbp, init_level4_pgt + (511*8)(%rip)
+ addq %rbp, init_level4_pgt + (L4_PAGE_OFFSET*8)(%rip)
+ addq %rbp, init_level4_pgt + (L4_START_KERNEL*8)(%rip)
addq %rbp, level3_ident_pgt + 0(%rip)
@@ -338,9 +345,9 @@
*/
NEXT_PAGE(init_level4_pgt)
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 257,8,0
+ .org init_level4_pgt + L4_PAGE_OFFSET*8, 0
.quad level3_ident_pgt - __START_KERNEL_map + _KERNPG_TABLE
- .fill 252,8,0
+ .org init_level4_pgt + L4_START_KERNEL*8, 0
/* (2^48-(2*1024*1024*1024))/(2^39) = 511 */
.quad level3_kernel_pgt - __START_KERNEL_map + _PAGE_TABLE
@@ -349,7 +356,7 @@
.fill 511,8,0
NEXT_PAGE(level3_kernel_pgt)
- .fill 510,8,0
+ .fill L3_START_KERNEL,8,0
/* (2^48-(2*1024*1024*1024)-((2^39)*511))/(2^30) = 510 */
.quad level2_kernel_pgt - __START_KERNEL_map + _KERNPG_TABLE
.quad level2_fixmap_pgt - __START_KERNEL_map + _PAGE_TABLE
diff --git a/include/asm-x86/page_64.h b/include/asm-x86/page_64.h
--- a/include/asm-x86/page_64.h
+++ b/include/asm-x86/page_64.h
@@ -26,7 +26,13 @@
#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
-#define __PAGE_OFFSET _AC(0xffff810000000000, UL)
+/*
+ * Set __PAGE_OFFSET to the most negative possible address +
+ * PGDIR_SIZE*16 (pgd slot 272). The gap is to allow a space for a
+ * hypervisor to fit. Choosing 16 slots here is arbitrary, but it's
+ * what Xen requires.
+ */
+#define __PAGE_OFFSET _AC(0xffff880000000000, UL)
#define __PHYSICAL_START CONFIG_PHYSICAL_START
#define __KERNEL_ALIGN 0x200000
This matches 32 bit.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/fixmap_64.h | 3 +++
1 file changed, 3 insertions(+)
diff --git a/include/asm-x86/fixmap_64.h b/include/asm-x86/fixmap_64.h
--- a/include/asm-x86/fixmap_64.h
+++ b/include/asm-x86/fixmap_64.h
@@ -46,6 +46,9 @@
FIX_EFI_IO_MAP_LAST_PAGE,
FIX_EFI_IO_MAP_FIRST_PAGE = FIX_EFI_IO_MAP_LAST_PAGE
+ MAX_EFI_IO_PAGES - 1,
+#ifdef CONFIG_PARAVIRT
+ FIX_PARAVIRT_BOOTMAP,
+#endif
#ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
FIX_OHCI1394_BASE,
#endif
Don't conflate sysret and sysexit; they're different instructions with
different semantics, and may be in use at the same time (at least
within the same kernel, depending on whether its an Intel or AMD
system).
sysexit - just return to userspace, does no register restoration of
any kind; must explicitly atomically enable interrupts.
sysret - reloads flags from r11, so no need to explicitly enable
interrupts on 64-bit, responsible for restoring usermode %gs
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/asm-offsets_32.c | 2 +-
arch/x86/kernel/asm-offsets_64.c | 2 +-
arch/x86/kernel/entry_32.S | 8 ++++----
arch/x86/kernel/entry_64.S | 4 ++--
arch/x86/kernel/paravirt.c | 12 +++++++++---
arch/x86/kernel/paravirt_patch_32.c | 4 ++--
arch/x86/kernel/paravirt_patch_64.c | 4 ++--
arch/x86/kernel/vmi_32.c | 4 ++--
arch/x86/xen/enlighten.c | 2 +-
include/asm-x86/irqflags.h | 4 ++--
include/asm-x86/paravirt.h | 15 ++++++++++-----
11 files changed, 36 insertions(+), 25 deletions(-)
diff --git a/arch/x86/kernel/asm-offsets_32.c b/arch/x86/kernel/asm-offsets_32.c
--- a/arch/x86/kernel/asm-offsets_32.c
+++ b/arch/x86/kernel/asm-offsets_32.c
@@ -112,7 +112,7 @@
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
- OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
+ OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
OFFSET(PV_CPU_read_cr0, pv_cpu_ops, read_cr0);
#endif
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,7 +63,7 @@
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
- OFFSET(PV_CPU_irq_enable_syscall_ret, pv_cpu_ops, irq_enable_syscall_ret);
+ OFFSET(PV_CPU_usersp_sysret, pv_cpu_ops, usersp_sysret);
OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
#endif
diff --git a/arch/x86/kernel/entry_32.S b/arch/x86/kernel/entry_32.S
--- a/arch/x86/kernel/entry_32.S
+++ b/arch/x86/kernel/entry_32.S
@@ -59,7 +59,7 @@
* for paravirtualization. The following will never clobber any registers:
* INTERRUPT_RETURN (aka. "iret")
* GET_CR0_INTO_EAX (aka. "movl %cr0, %eax")
- * ENABLE_INTERRUPTS_SYSCALL_RET (aka "sti; sysexit").
+ * ENABLE_INTERRUPTS_SYSEXIT (aka "sti; sysexit").
*
* For DISABLE_INTERRUPTS/ENABLE_INTERRUPTS (aka "cli"/"sti"), you must
* specify what registers can be overwritten (CLBR_NONE, CLBR_EAX/EDX/ECX/ANY).
@@ -376,7 +376,7 @@
xorl %ebp,%ebp
TRACE_IRQS_ON
1: mov PT_FS(%esp), %fs
- ENABLE_INTERRUPTS_SYSCALL_RET
+ ENABLE_INTERRUPTS_SYSEXIT
CFI_ENDPROC
.pushsection .fixup,"ax"
2: movl $0,PT_FS(%esp)
@@ -905,10 +905,10 @@
NATIVE_INTERRUPT_RETURN_NMI_SAFE # Should we deal with popf exception ?
END(native_nmi_return)
-ENTRY(native_irq_enable_syscall_ret)
+ENTRY(native_irq_enable_sysexit)
sti
sysexit
-END(native_irq_enable_syscall_ret)
+END(native_irq_enable_sysexit)
#endif
KPROBE_ENTRY(int3)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -167,7 +167,7 @@
#endif
#ifdef CONFIG_PARAVIRT
-ENTRY(native_irq_enable_syscall_ret)
+ENTRY(native_usersp_sysret)
movq %gs:pda_oldrsp,%rsp
swapgs
sysretq
@@ -383,7 +383,7 @@
CFI_REGISTER rip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER rflags,r11*/
- ENABLE_INTERRUPTS_SYSCALL_RET
+ USERSP_SYSRET
CFI_RESTORE_STATE
/* Handle reschedules */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -141,7 +141,8 @@
ret = paravirt_patch_nop();
else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
- type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret))
+ type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
+ type == PARAVIRT_PATCH(pv_cpu_ops.usersp_sysret))
/* If operation requires a jmp, then jmp */
ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
else
@@ -193,7 +194,8 @@
/* These are in entry.S */
extern void native_iret(void);
extern void native_nmi_return(void);
-extern void native_irq_enable_syscall_ret(void);
+extern void native_irq_enable_sysexit(void);
+extern void native_usersp_sysret(void);
static int __init print_banner(void)
{
@@ -329,7 +331,11 @@
.write_idt_entry = native_write_idt_entry,
.load_sp0 = native_load_sp0,
- .irq_enable_syscall_ret = native_irq_enable_syscall_ret,
+#ifdef CONFIG_X86_32
+ .irq_enable_sysexit = native_irq_enable_sysexit,
+#else
+ .usersp_sysret = native_usersp_sysret,
+#endif
.iret = native_iret,
.nmi_return = native_nmi_return,
.swapgs = native_swapgs,
diff --git a/arch/x86/kernel/paravirt_patch_32.c b/arch/x86/kernel/paravirt_patch_32.c
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -8,7 +8,7 @@
DEF_NATIVE(pv_cpu_ops, iret, "iret");
DEF_NATIVE(pv_cpu_ops, nmi_return,
__stringify(NATIVE_INTERRUPT_RETURN_NMI_SAFE));
-DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "sti; sysexit");
+DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "sti; sysexit");
DEF_NATIVE(pv_mmu_ops, read_cr2, "mov %cr2, %eax");
DEF_NATIVE(pv_mmu_ops, write_cr3, "mov %eax, %cr3");
DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %cr3, %eax");
@@ -33,7 +33,7 @@
PATCH_SITE(pv_irq_ops, save_fl);
PATCH_SITE(pv_cpu_ops, iret);
PATCH_SITE(pv_cpu_ops, nmi_return);
- PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+ PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
PATCH_SITE(pv_mmu_ops, read_cr2);
PATCH_SITE(pv_mmu_ops, read_cr3);
PATCH_SITE(pv_mmu_ops, write_cr3);
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -18,7 +18,7 @@
DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
/* the three commands give us more control to how to return from a syscall */
-DEF_NATIVE(pv_cpu_ops, irq_enable_syscall_ret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
+DEF_NATIVE(pv_cpu_ops, usersp_sysret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
@@ -39,7 +39,7 @@
PATCH_SITE(pv_irq_ops, irq_disable);
PATCH_SITE(pv_cpu_ops, iret);
PATCH_SITE(pv_cpu_ops, nmi_return);
- PATCH_SITE(pv_cpu_ops, irq_enable_syscall_ret);
+ PATCH_SITE(pv_cpu_ops, usersp_sysret);
PATCH_SITE(pv_cpu_ops, swapgs);
PATCH_SITE(pv_mmu_ops, read_cr2);
PATCH_SITE(pv_mmu_ops, read_cr3);
diff --git a/arch/x86/kernel/vmi_32.c b/arch/x86/kernel/vmi_32.c
--- a/arch/x86/kernel/vmi_32.c
+++ b/arch/x86/kernel/vmi_32.c
@@ -153,7 +153,7 @@
return patch_internal(VMI_CALL_IRET, len, insns, ip);
case PARAVIRT_PATCH(pv_cpu_ops.nmi_return):
return patch_internal(VMI_CALL_IRET, len, insns, ip);
- case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_syscall_ret):
+ case PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit):
return patch_internal(VMI_CALL_SYSEXIT, len, insns, ip);
default:
break;
@@ -898,7 +898,7 @@
* the backend. They are performance critical anyway, so requiring
* a patch is not a big problem.
*/
- pv_cpu_ops.irq_enable_syscall_ret = (void *)0xfeedbab0;
+ pv_cpu_ops.irq_enable_sysexit = (void *)0xfeedbab0;
pv_cpu_ops.iret = (void *)0xbadbab0;
#ifdef CONFIG_SMP
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1087,7 +1087,7 @@
.iret = xen_iret,
.nmi_return = xen_iret,
- .irq_enable_syscall_ret = xen_sysexit,
+ .irq_enable_sysexit = xen_sysexit,
.load_tr_desc = paravirt_nop,
.set_ldt = xen_set_ldt,
diff --git a/include/asm-x86/irqflags.h b/include/asm-x86/irqflags.h
--- a/include/asm-x86/irqflags.h
+++ b/include/asm-x86/irqflags.h
@@ -168,13 +168,13 @@
#ifdef CONFIG_X86_64
#define INTERRUPT_RETURN iretq
-#define ENABLE_INTERRUPTS_SYSCALL_RET \
+#define USERSP_SYSRET \
movq %gs:pda_oldrsp, %rsp; \
swapgs; \
sysretq;
#else
#define INTERRUPT_RETURN iret
-#define ENABLE_INTERRUPTS_SYSCALL_RET sti; sysexit
+#define ENABLE_INTERRUPTS_SYSEXIT sti; sysexit
#define GET_CR0_INTO_EAX movl %cr0, %eax
#endif
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -141,8 +141,9 @@
u64 (*read_pmc)(int counter);
unsigned long long (*read_tscp)(unsigned int *aux);
- /* These three are jmp to, not actually called. */
- void (*irq_enable_syscall_ret)(void);
+ /* These ones are jmp'ed to, not actually called. */
+ void (*irq_enable_sysexit)(void);
+ void (*usersp_sysret)(void);
void (*iret)(void);
void (*nmi_return)(void);
@@ -1485,10 +1486,10 @@
call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \
PV_RESTORE_REGS;)
-#define ENABLE_INTERRUPTS_SYSCALL_RET \
- PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_syscall_ret),\
+#define ENABLE_INTERRUPTS_SYSEXIT \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_sysexit), \
CLBR_NONE, \
- jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_syscall_ret))
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_sysexit))
#ifdef CONFIG_X86_32
@@ -1509,6 +1510,10 @@
movq %rax, %rcx; \
xorq %rax, %rax;
+#define USERSP_SYSRET \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usersp_sysret), \
+ CLBR_NONE, \
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usersp_sysret))
#endif
#endif /* __ASSEMBLY__ */
We must do this because load_TLS() may need to clear %fs and %gs,
such (e.g. under Xen).
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/process_64.c | 17 +++++++++++------
1 file changed, 11 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -579,6 +579,7 @@
*next = &next_p->thread;
int cpu = smp_processor_id();
struct tss_struct *tss = &per_cpu(init_tss, cpu);
+ unsigned fsindex, gsindex;
/* we're going to use this soon, after a few expensive things */
if (next_p->fpu_counter>5)
@@ -601,6 +602,15 @@
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
+
+ /* We must save %fs and %gs before load_TLS() because
+ * %fs and %gs may be cleared by load_TLS().
+ *
+ * (e.g. xen_load_tls())
+ */
+ savesegment(fs, fsindex);
+ savesegment(gs, gsindex);
+
load_TLS(next, cpu);
/*
@@ -616,8 +626,6 @@
* Switch FS and GS.
*/
{
- unsigned fsindex;
- savesegment(fs, fsindex);
/* segment register != 0 always requires a reload.
also reload when it has changed.
when prev process used 64bit base always reload
@@ -635,10 +643,7 @@
if (next->fs)
wrmsrl(MSR_FS_BASE, next->fs);
prev->fsindex = fsindex;
- }
- {
- unsigned gsindex;
- savesegment(gs, gsindex);
+
if (unlikely(gsindex | next->gsindex | prev->gs)) {
load_gs_index(next->gsindex);
if (gsindex)
On 32-bit it's best to use a %cs: prefix to access memory where the
other segments may not bet set up properly yet. On 64-bit it's best
to use a rip-relative addressing mode. Define PARA_INDIRECT() to
abstract this and generate the proper addressing mode in each case.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/paravirt.h | 30 ++++++++++++++++--------------
1 file changed, 16 insertions(+), 14 deletions(-)
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -1456,55 +1456,57 @@
#define PV_RESTORE_REGS popq %rdx; popq %rcx; popq %rdi; popq %rax
#define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 8)
#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .quad, 8)
+#define PARA_INDIRECT(addr) *addr(%rip)
#else
#define PV_SAVE_REGS pushl %eax; pushl %edi; pushl %ecx; pushl %edx
#define PV_RESTORE_REGS popl %edx; popl %ecx; popl %edi; popl %eax
#define PARA_PATCH(struct, off) ((PARAVIRT_PATCH_##struct + (off)) / 4)
#define PARA_SITE(ptype, clobbers, ops) _PVSITE(ptype, clobbers, ops, .long, 4)
+#define PARA_INDIRECT(addr) *%cs:addr
#endif
#define INTERRUPT_RETURN \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_iret), CLBR_NONE, \
- jmp *%cs:pv_cpu_ops+PV_CPU_iret)
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_iret))
#define INTERRUPT_RETURN_NMI_SAFE \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_nmi_return), CLBR_NONE, \
- jmp *%cs:pv_cpu_ops+PV_CPU_nmi_return)
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_nmi_return))
#define DISABLE_INTERRUPTS(clobbers) \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_disable), clobbers, \
- PV_SAVE_REGS; \
- call *%cs:pv_irq_ops+PV_IRQ_irq_disable; \
+ PV_SAVE_REGS; \
+ call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_disable); \
PV_RESTORE_REGS;) \
#define ENABLE_INTERRUPTS(clobbers) \
PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_irq_enable), clobbers, \
- PV_SAVE_REGS; \
- call *%cs:pv_irq_ops+PV_IRQ_irq_enable; \
+ PV_SAVE_REGS; \
+ call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \
PV_RESTORE_REGS;)
#define ENABLE_INTERRUPTS_SYSCALL_RET \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_syscall_ret),\
CLBR_NONE, \
- jmp *%cs:pv_cpu_ops+PV_CPU_irq_enable_syscall_ret)
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_syscall_ret))
#ifdef CONFIG_X86_32
-#define GET_CR0_INTO_EAX \
- push %ecx; push %edx; \
- call *pv_cpu_ops+PV_CPU_read_cr0; \
+#define GET_CR0_INTO_EAX \
+ push %ecx; push %edx; \
+ call PARA_INDIRECT(pv_cpu_ops+PV_CPU_read_cr0); \
pop %edx; pop %ecx
#else
#define SWAPGS \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
PV_SAVE_REGS; \
- call *pv_cpu_ops+PV_CPU_swapgs; \
+ call PARA_INDIRECT(pv_cpu_ops+PV_CPU_swapgs); \
PV_RESTORE_REGS \
)
-#define GET_CR2_INTO_RCX \
- call *pv_mmu_ops+PV_MMU_read_cr2; \
- movq %rax, %rcx; \
+#define GET_CR2_INTO_RCX \
+ call PARA_INDIRECT(pv_mmu_ops+PV_MMU_read_cr2); \
+ movq %rax, %rcx; \
xorq %rax, %rax;
#endif
The 32-bit early_ioremap will work equally well for 64-bit, so just use it.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/setup_64.c | 2 +
arch/x86/mm/init_64.c | 52 -------------------------------------------
arch/x86/mm/ioremap.c | 5 +++-
include/asm-x86/fixmap_64.h | 13 ++++++++++
include/asm-x86/io.h | 13 ++++++++++
include/asm-x86/io_32.h | 12 ---------
6 files changed, 32 insertions(+), 65 deletions(-)
diff --git a/arch/x86/kernel/setup_64.c b/arch/x86/kernel/setup_64.c
--- a/arch/x86/kernel/setup_64.c
+++ b/arch/x86/kernel/setup_64.c
@@ -191,6 +191,8 @@
{
printk(KERN_INFO "Command line: %s\n", boot_command_line);
+ early_ioremap_init();
+
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
screen_info = boot_params.screen_info;
edid_info = boot_params.edid_info;
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -255,58 +255,6 @@
return;
early_iounmap(adr, PAGE_SIZE);
-}
-
-/* Must run before zap_low_mappings */
-__meminit void *early_ioremap(unsigned long addr, unsigned long size)
-{
- pmd_t *pmd, *last_pmd;
- unsigned long vaddr;
- int i, pmds;
-
- pmds = ((addr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
- vaddr = __START_KERNEL_map;
- pmd = level2_kernel_pgt;
- last_pmd = level2_kernel_pgt + PTRS_PER_PMD - 1;
-
- for (; pmd <= last_pmd; pmd++, vaddr += PMD_SIZE) {
- for (i = 0; i < pmds; i++) {
- if (pmd_present(pmd[i]))
- goto continue_outer_loop;
- }
- vaddr += addr & ~PMD_MASK;
- addr &= PMD_MASK;
-
- for (i = 0; i < pmds; i++, addr += PMD_SIZE)
- set_pmd(pmd+i, __pmd(addr | __PAGE_KERNEL_LARGE_EXEC));
- __flush_tlb_all();
-
- return (void *)vaddr;
-continue_outer_loop:
- ;
- }
- printk(KERN_ERR "early_ioremap(0x%lx, %lu) failed\n", addr, size);
-
- return NULL;
-}
-
-/*
- * To avoid virtual aliases later:
- */
-__meminit void early_iounmap(void *addr, unsigned long size)
-{
- unsigned long vaddr;
- pmd_t *pmd;
- int i, pmds;
-
- vaddr = (unsigned long)addr;
- pmds = ((vaddr & ~PMD_MASK) + size + ~PMD_MASK) / PMD_SIZE;
- pmd = level2_kernel_pgt + pmd_index(vaddr);
-
- for (i = 0; i < pmds; i++)
- pmd_clear(pmd + i);
-
- __flush_tlb_all();
}
static unsigned long __meminit
diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -409,8 +409,6 @@
return;
}
-#ifdef CONFIG_X86_32
-
int __initdata early_ioremap_debug;
static int __init early_ioremap_debug_setup(char *str)
@@ -511,6 +509,7 @@
return;
}
pte = early_ioremap_pte(addr);
+
if (pgprot_val(flags))
set_pte(pte, pfn_pte(phys >> PAGE_SHIFT, flags));
else
@@ -652,5 +651,3 @@
{
WARN_ON(1);
}
-
-#endif /* CONFIG_X86_32 */
diff --git a/include/asm-x86/fixmap_64.h b/include/asm-x86/fixmap_64.h
--- a/include/asm-x86/fixmap_64.h
+++ b/include/asm-x86/fixmap_64.h
@@ -49,6 +49,19 @@
#ifdef CONFIG_PROVIDE_OHCI1394_DMA_INIT
FIX_OHCI1394_BASE,
#endif
+ __end_of_permanent_fixed_addresses,
+ /*
+ * 256 temporary boot-time mappings, used by early_ioremap(),
+ * before ioremap() is functional.
+ *
+ * We round it up to the next 512 pages boundary so that we
+ * can have a single pgd entry and a single pte table:
+ */
+#define NR_FIX_BTMAPS 64
+#define FIX_BTMAPS_NESTING 4
+ FIX_BTMAP_END = __end_of_permanent_fixed_addresses + 512 -
+ (__end_of_permanent_fixed_addresses & 511),
+ FIX_BTMAP_BEGIN = FIX_BTMAP_END + NR_FIX_BTMAPS*FIX_BTMAPS_NESTING - 1,
__end_of_fixed_addresses
};
diff --git a/include/asm-x86/io.h b/include/asm-x86/io.h
--- a/include/asm-x86/io.h
+++ b/include/asm-x86/io.h
@@ -72,4 +72,17 @@
unsigned long prot_val);
extern void __iomem *ioremap_wc(unsigned long offset, unsigned long size);
+/*
+ * early_ioremap() and early_iounmap() are for temporary early boot-time
+ * mappings, before the real ioremap() is functional.
+ * A boot-time mapping is currently limited to at most 16 pages.
+ */
+extern void early_ioremap_init(void);
+extern void early_ioremap_clear(void);
+extern void early_ioremap_reset(void);
+extern void *early_ioremap(unsigned long offset, unsigned long size);
+extern void early_iounmap(void *addr, unsigned long size);
+extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys);
+
+
#endif /* _ASM_X86_IO_H */
diff --git a/include/asm-x86/io_32.h b/include/asm-x86/io_32.h
--- a/include/asm-x86/io_32.h
+++ b/include/asm-x86/io_32.h
@@ -122,18 +122,6 @@
extern void iounmap(volatile void __iomem *addr);
/*
- * early_ioremap() and early_iounmap() are for temporary early boot-time
- * mappings, before the real ioremap() is functional.
- * A boot-time mapping is currently limited to at most 16 pages.
- */
-extern void early_ioremap_init(void);
-extern void early_ioremap_clear(void);
-extern void early_ioremap_reset(void);
-extern void *early_ioremap(unsigned long offset, unsigned long size);
-extern void early_iounmap(void *addr, unsigned long size);
-extern void __iomem *fix_ioremap(unsigned idx, unsigned long phys);
-
-/*
* ISA I/O bus memory addresses are 1:1 with the physical address.
*/
#define isa_virt_to_bus virt_to_phys
Use the _populate() functions to attach new pages to a pagetable, to
make sure the right paravirt_ops calls get called.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/init_64.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -167,7 +167,7 @@
pud = pud_offset(pgd, vaddr);
if (pud_none(*pud)) {
pmd = (pmd_t *) spp_getpage();
- set_pud(pud, __pud(__pa(pmd) | _KERNPG_TABLE | _PAGE_USER));
+ pud_populate(&init_mm, pud, pmd);
if (pmd != pmd_offset(pud, 0)) {
printk(KERN_ERR "PAGETABLE BUG #01! %p <-> %p\n",
pmd, pmd_offset(pud, 0));
@@ -177,7 +177,7 @@
pmd = pmd_offset(pud, vaddr);
if (pmd_none(*pmd)) {
pte = (pte_t *) spp_getpage();
- set_pmd(pmd, __pmd(__pa(pte) | _KERNPG_TABLE | _PAGE_USER));
+ pmd_populate_kernel(&init_mm, pmd, pte);
if (pte != pte_offset_kernel(pmd, 0)) {
printk(KERN_ERR "PAGETABLE BUG #02!\n");
return;
@@ -389,7 +389,7 @@
pmd = alloc_low_page(&pmd_phys);
spin_lock(&init_mm.page_table_lock);
- set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
+ pud_populate(&init_mm, pud, __va(pmd_phys));
last_map_addr = phys_pmd_init(pmd, addr, end);
spin_unlock(&init_mm.page_table_lock);
@@ -591,7 +591,8 @@
next = end;
last_map_addr = phys_pud_init(pud, __pa(start), __pa(next));
if (!after_bootmem)
- set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
+ pgd_populate(&init_mm, pgd_offset_k(start),
+ __va(pud_phys));
unmap_low_page(pud);
}
There's no need to combine restoring the user rsp within the sysret
pvop, so split it out. This makes the pvop's semantics closer to the
machine instruction.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/asm-offsets_64.c | 2 +-
arch/x86/kernel/entry_64.S | 6 +++---
arch/x86/kernel/paravirt.c | 6 +++---
arch/x86/kernel/paravirt_patch_64.c | 4 ++--
include/asm-x86/irqflags.h | 3 +--
include/asm-x86/paravirt.h | 8 ++++----
6 files changed, 14 insertions(+), 15 deletions(-)
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,7 +63,7 @@
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
- OFFSET(PV_CPU_usersp_sysret, pv_cpu_ops, usersp_sysret);
+ OFFSET(PV_CPU_usergs_sysret, pv_cpu_ops, usergs_sysret);
OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
#endif
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -167,8 +167,7 @@
#endif
#ifdef CONFIG_PARAVIRT
-ENTRY(native_usersp_sysret)
- movq %gs:pda_oldrsp,%rsp
+ENTRY(native_usergs_sysret)
swapgs
sysretq
#endif /* CONFIG_PARAVIRT */
@@ -383,7 +382,8 @@
CFI_REGISTER rip,rcx
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER rflags,r11*/
- USERSP_SYSRET
+ movq %gs:pda_oldrsp, %rsp
+ USERGS_SYSRET
CFI_RESTORE_STATE
/* Handle reschedules */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -142,7 +142,7 @@
else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
- type == PARAVIRT_PATCH(pv_cpu_ops.usersp_sysret))
+ type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret))
/* If operation requires a jmp, then jmp */
ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
else
@@ -195,7 +195,7 @@
extern void native_iret(void);
extern void native_nmi_return(void);
extern void native_irq_enable_sysexit(void);
-extern void native_usersp_sysret(void);
+extern void native_usergs_sysret(void);
static int __init print_banner(void)
{
@@ -334,7 +334,7 @@
#ifdef CONFIG_X86_32
.irq_enable_sysexit = native_irq_enable_sysexit,
#else
- .usersp_sysret = native_usersp_sysret,
+ .usergs_sysret = native_usergs_sysret,
#endif
.iret = native_iret,
.nmi_return = native_nmi_return,
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -18,7 +18,7 @@
DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
/* the three commands give us more control to how to return from a syscall */
-DEF_NATIVE(pv_cpu_ops, usersp_sysret, "movq %gs:" __stringify(pda_oldrsp) ", %rsp; swapgs; sysretq;");
+DEF_NATIVE(pv_cpu_ops, usergs_sysret, "swapgs; sysretq;");
DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
@@ -39,7 +39,7 @@
PATCH_SITE(pv_irq_ops, irq_disable);
PATCH_SITE(pv_cpu_ops, iret);
PATCH_SITE(pv_cpu_ops, nmi_return);
- PATCH_SITE(pv_cpu_ops, usersp_sysret);
+ PATCH_SITE(pv_cpu_ops, usergs_sysret);
PATCH_SITE(pv_cpu_ops, swapgs);
PATCH_SITE(pv_mmu_ops, read_cr2);
PATCH_SITE(pv_mmu_ops, read_cr3);
diff --git a/include/asm-x86/irqflags.h b/include/asm-x86/irqflags.h
--- a/include/asm-x86/irqflags.h
+++ b/include/asm-x86/irqflags.h
@@ -168,8 +168,7 @@
#ifdef CONFIG_X86_64
#define INTERRUPT_RETURN iretq
-#define USERSP_SYSRET \
- movq %gs:pda_oldrsp, %rsp; \
+#define USERGS_SYSRET \
swapgs; \
sysretq;
#else
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -143,7 +143,7 @@
/* These ones are jmp'ed to, not actually called. */
void (*irq_enable_sysexit)(void);
- void (*usersp_sysret)(void);
+ void (*usergs_sysret)(void);
void (*iret)(void);
void (*nmi_return)(void);
@@ -1510,10 +1510,10 @@
movq %rax, %rcx; \
xorq %rax, %rax;
-#define USERSP_SYSRET \
- PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usersp_sysret), \
+#define USERGS_SYSRET \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret), \
CLBR_NONE, \
- jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usersp_sysret))
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret))
#endif
#endif /* __ASSEMBLY__ */
In a 64-bit system, we need separate sysret/sysexit operations to
return to a 32-bit userspace.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/ia32/ia32entry.S | 21 +++++++++---
arch/x86/kernel/asm-offsets_64.c | 4 +-
arch/x86/kernel/entry_64.S | 4 +-
arch/x86/kernel/paravirt.c | 12 ++++---
arch/x86/kernel/paravirt_patch_64.c | 9 +++--
include/asm-x86/irqflags.h | 14 ++++++--
include/asm-x86/paravirt.h | 58 ++++++++++++++++++++++++++++-------
7 files changed, 91 insertions(+), 31 deletions(-)
diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -60,6 +60,19 @@
CFI_UNDEFINED r14
CFI_UNDEFINED r15
.endm
+
+#ifdef CONFIG_PARAVIRT
+ENTRY(native_usergs_sysret32)
+ swapgs
+ sysretl
+ENDPROC(native_usergs_sysret32)
+
+ENTRY(native_irq_enable_sysexit)
+ swapgs
+ sti
+ sysexit
+ENDPROC(native_irq_enable_sysexit)
+#endif
/*
* 32bit SYSENTER instruction entry.
@@ -151,10 +164,7 @@
CFI_ADJUST_CFA_OFFSET -8
CFI_REGISTER rsp,rcx
TRACE_IRQS_ON
- swapgs
- sti /* sti only takes effect after the next instruction */
- /* sysexit */
- .byte 0xf, 0x35
+ ENABLE_INTERRUPTS_SYSEXIT32
sysenter_tracesys:
CFI_RESTORE_STATE
@@ -254,8 +264,7 @@
TRACE_IRQS_ON
movl RSP-ARGOFFSET(%rsp),%esp
CFI_RESTORE rsp
- swapgs
- sysretl
+ USERGS_SYSRET32
cstar_tracesys:
CFI_RESTORE_STATE
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -63,7 +63,9 @@
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
- OFFSET(PV_CPU_usergs_sysret, pv_cpu_ops, usergs_sysret);
+ OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
+ OFFSET(PV_CPU_usergs_sysret64, pv_cpu_ops, usergs_sysret64);
+ OFFSET(PV_CPU_irq_enable_sysexit, pv_cpu_ops, irq_enable_sysexit);
OFFSET(PV_CPU_swapgs, pv_cpu_ops, swapgs);
OFFSET(PV_MMU_read_cr2, pv_mmu_ops, read_cr2);
#endif
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -167,7 +167,7 @@
#endif
#ifdef CONFIG_PARAVIRT
-ENTRY(native_usergs_sysret)
+ENTRY(native_usergs_sysret64)
swapgs
sysretq
#endif /* CONFIG_PARAVIRT */
@@ -383,7 +383,7 @@
RESTORE_ARGS 0,-ARG_SKIP,1
/*CFI_REGISTER rflags,r11*/
movq %gs:pda_oldrsp, %rsp
- USERGS_SYSRET
+ USERGS_SYSRET64
CFI_RESTORE_STATE
/* Handle reschedules */
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -142,7 +142,8 @@
else if (type == PARAVIRT_PATCH(pv_cpu_ops.iret) ||
type == PARAVIRT_PATCH(pv_cpu_ops.nmi_return) ||
type == PARAVIRT_PATCH(pv_cpu_ops.irq_enable_sysexit) ||
- type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret))
+ type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret32) ||
+ type == PARAVIRT_PATCH(pv_cpu_ops.usergs_sysret64))
/* If operation requires a jmp, then jmp */
ret = paravirt_patch_jmp(insnbuf, opfunc, addr, len);
else
@@ -195,7 +196,8 @@
extern void native_iret(void);
extern void native_nmi_return(void);
extern void native_irq_enable_sysexit(void);
-extern void native_usergs_sysret(void);
+extern void native_usergs_sysret32(void);
+extern void native_usergs_sysret64(void);
static int __init print_banner(void)
{
@@ -331,10 +333,10 @@
.write_idt_entry = native_write_idt_entry,
.load_sp0 = native_load_sp0,
-#ifdef CONFIG_X86_32
.irq_enable_sysexit = native_irq_enable_sysexit,
-#else
- .usergs_sysret = native_usergs_sysret,
+#ifdef CONFIG_X86_64
+ .usergs_sysret32 = native_usergs_sysret32,
+ .usergs_sysret64 = native_usergs_sysret64,
#endif
.iret = native_iret,
.nmi_return = native_nmi_return,
diff --git a/arch/x86/kernel/paravirt_patch_64.c b/arch/x86/kernel/paravirt_patch_64.c
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -17,8 +17,9 @@
DEF_NATIVE(pv_cpu_ops, clts, "clts");
DEF_NATIVE(pv_cpu_ops, wbinvd, "wbinvd");
-/* the three commands give us more control to how to return from a syscall */
-DEF_NATIVE(pv_cpu_ops, usergs_sysret, "swapgs; sysretq;");
+DEF_NATIVE(pv_cpu_ops, irq_enable_sysexit, "swapgs; sti; sysexit");
+DEF_NATIVE(pv_cpu_ops, usergs_sysret64, "swapgs; sysretq");
+DEF_NATIVE(pv_cpu_ops, usergs_sysret32, "swapgs; sysretl");
DEF_NATIVE(pv_cpu_ops, swapgs, "swapgs");
unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
@@ -39,7 +40,9 @@
PATCH_SITE(pv_irq_ops, irq_disable);
PATCH_SITE(pv_cpu_ops, iret);
PATCH_SITE(pv_cpu_ops, nmi_return);
- PATCH_SITE(pv_cpu_ops, usergs_sysret);
+ PATCH_SITE(pv_cpu_ops, irq_enable_sysexit);
+ PATCH_SITE(pv_cpu_ops, usergs_sysret32);
+ PATCH_SITE(pv_cpu_ops, usergs_sysret64);
PATCH_SITE(pv_cpu_ops, swapgs);
PATCH_SITE(pv_mmu_ops, read_cr2);
PATCH_SITE(pv_mmu_ops, read_cr3);
diff --git a/include/asm-x86/irqflags.h b/include/asm-x86/irqflags.h
--- a/include/asm-x86/irqflags.h
+++ b/include/asm-x86/irqflags.h
@@ -168,9 +168,17 @@
#ifdef CONFIG_X86_64
#define INTERRUPT_RETURN iretq
-#define USERGS_SYSRET \
- swapgs; \
- sysretq;
+#define USERGS_SYSRET64 \
+ swapgs; \
+ sysretq;
+#define USERGS_SYSRET32 \
+ swapgs; \
+ sysretl
+#define ENABLE_INTERRUPTS_SYSEXIT32 \
+ swapgs; \
+ sti; \
+ sysexit
+
#else
#define INTERRUPT_RETURN iret
#define ENABLE_INTERRUPTS_SYSEXIT sti; sysexit
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -141,10 +141,35 @@
u64 (*read_pmc)(int counter);
unsigned long long (*read_tscp)(unsigned int *aux);
- /* These ones are jmp'ed to, not actually called. */
+ /*
+ * Atomically enable interrupts and return to userspace. This
+ * is only ever used to return to 32-bit processes; in a
+ * 64-bit kernel, it's used for 32-on-64 compat processes, but
+ * never native 64-bit processes. (Jump, not call.)
+ */
void (*irq_enable_sysexit)(void);
- void (*usergs_sysret)(void);
+
+ /*
+ * Switch to usermode gs and return to 64-bit usermode using
+ * sysret. Only used in 64-bit kernels to return to 64-bit
+ * processes. Usermode register state, including %rsp, must
+ * already be restored.
+ */
+ void (*usergs_sysret64)(void);
+
+ /*
+ * Switch to usermode gs and return to 32-bit usermode using
+ * sysret. Used to return to 32-on-64 compat processes.
+ * Other usermode register state, including %esp, must already
+ * be restored.
+ */
+ void (*usergs_sysret32)(void);
+
+ /* Normal iret. Jump to this with the standard iret stack
+ frame set up. */
void (*iret)(void);
+
+ /* Return from NMI. (?) */
void (*nmi_return)(void);
void (*swapgs)(void);
@@ -1486,18 +1511,24 @@
call PARA_INDIRECT(pv_irq_ops+PV_IRQ_irq_enable); \
PV_RESTORE_REGS;)
-#define ENABLE_INTERRUPTS_SYSEXIT \
- PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_sysexit), \
+#define USERGS_SYSRET32 \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret32), \
CLBR_NONE, \
- jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_sysexit))
-
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret32))
#ifdef CONFIG_X86_32
#define GET_CR0_INTO_EAX \
push %ecx; push %edx; \
call PARA_INDIRECT(pv_cpu_ops+PV_CPU_read_cr0); \
pop %edx; pop %ecx
-#else
+
+#define ENABLE_INTERRUPTS_SYSEXIT \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_sysexit), \
+ CLBR_NONE, \
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_sysexit))
+
+
+#else /* !CONFIG_X86_32 */
#define SWAPGS \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_swapgs), CLBR_NONE, \
PV_SAVE_REGS; \
@@ -1510,11 +1541,16 @@
movq %rax, %rcx; \
xorq %rax, %rax;
-#define USERGS_SYSRET \
- PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret), \
+#define USERGS_SYSRET64 \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret64), \
CLBR_NONE, \
- jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret))
-#endif
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_usergs_sysret64))
+
+#define ENABLE_INTERRUPTS_SYSEXIT32 \
+ PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_irq_enable_sysexit), \
+ CLBR_NONE, \
+ jmp PARA_INDIRECT(pv_cpu_ops+PV_CPU_irq_enable_sysexit))
+#endif /* CONFIG_X86_32 */
#endif /* __ASSEMBLY__ */
#endif /* CONFIG_PARAVIRT */
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/pgtable.h | 2 ++
1 file changed, 2 insertions(+)
diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
--- a/include/asm-x86/pgtable.h
+++ b/include/asm-x86/pgtable.h
@@ -425,6 +425,8 @@
* race with other CPU's that might be updating the dirty
* bit at the same time.
*/
+struct vm_area_struct;
+
#define __HAVE_ARCH_PTEP_SET_ACCESS_FLAGS
extern int ptep_set_access_flags(struct vm_area_struct *vma,
unsigned long address, pte_t *ptep,
Rather than just jumping to 0 when there's a missing operation, raise a BUG.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/Kconfig | 7 +++++++
include/asm-x86/paravirt.h | 8 ++++++++
2 files changed, 15 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -433,6 +433,13 @@
the kernel is theoretically slower and slightly larger.
endif
+
+config PARAVIRT_DEBUG
+ bool "paravirt-ops debugging"
+ depends on PARAVIRT && DEBUG_KERNEL
+ help
+ Enable to debug paravirt_ops internals. Specifically, BUG if
+ a paravirt_op is missing when it is called.
config MEMTEST
bool "Memtest"
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -460,10 +460,17 @@
#define VEXTRA_CLOBBERS , "rax", "r8", "r9", "r10", "r11"
#endif
+#ifdef CONFIG_PARAVIRT_DEBUG
+#define PVOP_TEST_NULL(op) BUG_ON(op == NULL)
+#else
+#define PVOP_TEST_NULL(op) ((void)op)
+#endif
+
#define __PVOP_CALL(rettype, op, pre, post, ...) \
({ \
rettype __ret; \
PVOP_CALL_ARGS; \
+ PVOP_TEST_NULL(op); \
/* This is 32-bit specific, but is okay in 64-bit */ \
/* since this condition will never hold */ \
if (sizeof(rettype) > sizeof(unsigned long)) { \
@@ -492,6 +499,7 @@
#define __PVOP_VCALL(op, pre, post, ...) \
({ \
PVOP_VCALL_ARGS; \
+ PVOP_TEST_NULL(op); \
asm volatile(pre \
paravirt_alt(PARAVIRT_CALL) \
post \
Because Xen doesn't support PSE mappings in guests, all code which
assumed the presence of PSE has been changed to fall back to smaller
mappings if necessary. As a result, PSE is optional rather than
required (though still used whereever possible).
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/required-features.h | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/include/asm-x86/required-features.h b/include/asm-x86/required-features.h
--- a/include/asm-x86/required-features.h
+++ b/include/asm-x86/required-features.h
@@ -42,7 +42,7 @@
#endif
#ifdef CONFIG_X86_64
-#define NEED_PSE (1<<(X86_FEATURE_PSE & 31))
+#define NEED_PSE 0
#define NEED_MSR (1<<(X86_FEATURE_MSR & 31))
#define NEED_PGE (1<<(X86_FEATURE_PGE & 31))
#define NEED_FXSR (1<<(X86_FEATURE_FXSR & 31))
This removes a pile of buggy open-coded implementations of savesegment
and loadsegment.
(They are buggy because they don't have memory barriers to prevent
them from being reordered with respect to memory accesses.)
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/cpu/common_64.c | 3 ++-
arch/x86/kernel/process_64.c | 28 +++++++++++++++-------------
2 files changed, 17 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/cpu/common_64.c b/arch/x86/kernel/cpu/common_64.c
--- a/arch/x86/kernel/cpu/common_64.c
+++ b/arch/x86/kernel/cpu/common_64.c
@@ -480,7 +480,8 @@
struct x8664_pda *pda = cpu_pda(cpu);
/* Setup up data that may be needed in __get_free_pages early */
- asm volatile("movl %0,%%fs ; movl %0,%%gs" :: "r" (0));
+ loadsegment(fs, 0);
+ loadsegment(gs, 0);
/* Memory clobbers used to order PDA accessed */
mb();
wrmsrl(MSR_GS_BASE, pda);
diff --git a/arch/x86/kernel/process_64.c b/arch/x86/kernel/process_64.c
--- a/arch/x86/kernel/process_64.c
+++ b/arch/x86/kernel/process_64.c
@@ -362,10 +362,10 @@
p->thread.fs = me->thread.fs;
p->thread.gs = me->thread.gs;
- asm("mov %%gs,%0" : "=m" (p->thread.gsindex));
- asm("mov %%fs,%0" : "=m" (p->thread.fsindex));
- asm("mov %%es,%0" : "=m" (p->thread.es));
- asm("mov %%ds,%0" : "=m" (p->thread.ds));
+ savesegment(gs, p->thread.gsindex);
+ savesegment(fs, p->thread.fsindex);
+ savesegment(es, p->thread.es);
+ savesegment(ds, p->thread.ds);
if (unlikely(test_tsk_thread_flag(me, TIF_IO_BITMAP))) {
p->thread.io_bitmap_ptr = kmalloc(IO_BITMAP_BYTES, GFP_KERNEL);
@@ -404,7 +404,9 @@
void
start_thread(struct pt_regs *regs, unsigned long new_ip, unsigned long new_sp)
{
- asm volatile("movl %0, %%fs; movl %0, %%es; movl %0, %%ds" :: "r"(0));
+ loadsegment(fs, 0);
+ loadsegment(es, 0);
+ loadsegment(ds, 0);
load_gs_index(0);
regs->ip = new_ip;
regs->sp = new_sp;
@@ -591,11 +593,11 @@
* Switch DS and ES.
* This won't pick up thread selector changes, but I guess that is ok.
*/
- asm volatile("mov %%es,%0" : "=m" (prev->es));
+ savesegment(es, prev->es);
if (unlikely(next->es | prev->es))
loadsegment(es, next->es);
-
- asm volatile ("mov %%ds,%0" : "=m" (prev->ds));
+
+ savesegment(ds, prev->ds);
if (unlikely(next->ds | prev->ds))
loadsegment(ds, next->ds);
@@ -606,7 +608,7 @@
*/
{
unsigned fsindex;
- asm volatile("movl %%fs,%0" : "=r" (fsindex));
+ savesegment(fs, fsindex);
/* segment register != 0 always requires a reload.
also reload when it has changed.
when prev process used 64bit base always reload
@@ -627,7 +629,7 @@
}
{
unsigned gsindex;
- asm volatile("movl %%gs,%0" : "=r" (gsindex));
+ savesegment(gs, gsindex);
if (unlikely(gsindex | next->gsindex | prev->gs)) {
load_gs_index(next->gsindex);
if (gsindex)
@@ -807,7 +809,7 @@
set_32bit_tls(task, FS_TLS, addr);
if (doit) {
load_TLS(&task->thread, cpu);
- asm volatile("movl %0,%%fs" :: "r"(FS_TLS_SEL));
+ loadsegment(fs, FS_TLS_SEL);
}
task->thread.fsindex = FS_TLS_SEL;
task->thread.fs = 0;
@@ -817,7 +819,7 @@
if (doit) {
/* set the selector to 0 to not confuse
__switch_to */
- asm volatile("movl %0,%%fs" :: "r" (0));
+ loadsegment(fs, 0);
ret = checking_wrmsrl(MSR_FS_BASE, addr);
}
}
@@ -840,7 +842,7 @@
if (task->thread.gsindex == GS_TLS_SEL)
base = read_32bit_tls(task, GS_TLS);
else if (doit) {
- asm("movl %%gs,%0" : "=r" (gsindex));
+ savesegment(gs, gsindex);
if (gsindex)
rdmsrl(MSR_KERNEL_GS_BASE, base);
else
64-bit Xen pushes a couple of extra words onto an exception frame.
Add a hook to deal with them.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/asm-offsets_64.c | 1 +
arch/x86/kernel/entry_64.S | 2 ++
arch/x86/kernel/paravirt.c | 3 +++
arch/x86/xen/enlighten.c | 3 +++
include/asm-x86/paravirt.h | 9 +++++++++
include/asm-x86/processor.h | 2 ++
6 files changed, 20 insertions(+)
diff --git a/arch/x86/kernel/asm-offsets_64.c b/arch/x86/kernel/asm-offsets_64.c
--- a/arch/x86/kernel/asm-offsets_64.c
+++ b/arch/x86/kernel/asm-offsets_64.c
@@ -61,6 +61,7 @@
OFFSET(PARAVIRT_PATCH_pv_irq_ops, paravirt_patch_template, pv_irq_ops);
OFFSET(PV_IRQ_irq_disable, pv_irq_ops, irq_disable);
OFFSET(PV_IRQ_irq_enable, pv_irq_ops, irq_enable);
+ OFFSET(PV_IRQ_adjust_exception_frame, pv_irq_ops, adjust_exception_frame);
OFFSET(PV_CPU_iret, pv_cpu_ops, iret);
OFFSET(PV_CPU_nmi_return, pv_cpu_ops, nmi_return);
OFFSET(PV_CPU_usergs_sysret32, pv_cpu_ops, usergs_sysret32);
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -864,6 +864,7 @@
*/
.macro zeroentry sym
INTR_FRAME
+ PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq $0 /* push error code/oldrax */
CFI_ADJUST_CFA_OFFSET 8
pushq %rax /* push real oldrax to the rdi slot */
@@ -876,6 +877,7 @@
.macro errorentry sym
XCPT_FRAME
+ PARAVIRT_ADJUST_EXCEPTION_FRAME
pushq %rax
CFI_ADJUST_CFA_OFFSET 8
CFI_REL_OFFSET rax,0
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -298,6 +298,9 @@
.irq_enable = native_irq_enable,
.safe_halt = native_safe_halt,
.halt = native_halt,
+#ifdef CONFIG_X86_64
+ .adjust_exception_frame = paravirt_nop,
+#endif
};
struct pv_cpu_ops pv_cpu_ops = {
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1121,6 +1121,9 @@
.irq_enable = xen_irq_enable,
.safe_halt = xen_safe_halt,
.halt = xen_halt,
+#ifdef CONFIG_X86_64
+ .adjust_exception_frame = paravirt_nop,
+#endif
};
static const struct pv_apic_ops xen_apic_ops __initdata = {
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -192,6 +192,10 @@
void (*irq_enable)(void);
void (*safe_halt)(void);
void (*halt)(void);
+
+#ifdef CONFIG_X86_64
+ void (*adjust_exception_frame)(void);
+#endif
};
struct pv_apic_ops {
@@ -1551,6 +1555,11 @@
movq %rax, %rcx; \
xorq %rax, %rax;
+#define PARAVIRT_ADJUST_EXCEPTION_FRAME \
+ PARA_SITE(PARA_PATCH(pv_irq_ops, PV_IRQ_adjust_exception_frame), \
+ CLBR_NONE, \
+ call PARA_INDIRECT(pv_irq_ops+PV_IRQ_adjust_exception_frame))
+
#define USERGS_SYSRET64 \
PARA_SITE(PARA_PATCH(pv_cpu_ops, PV_CPU_usergs_sysret64), \
CLBR_NONE, \
diff --git a/include/asm-x86/processor.h b/include/asm-x86/processor.h
--- a/include/asm-x86/processor.h
+++ b/include/asm-x86/processor.h
@@ -542,6 +542,8 @@
#define set_iopl_mask native_set_iopl_mask
#define SWAPGS swapgs
+
+#define PARAVIRT_ADJUST_EXCEPTION_FRAME /* */
#endif /* CONFIG_PARAVIRT */
/*
From: Eduardo Habkost <[email protected]>
We will need to set a pte on l3_user_pgt. Extract set_pte_vaddr_pud()
from set_pte_vaddr(), that will accept the l3 page table as parameter.
This change should be a no-op for existing code.
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Mark McLoughlin <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/init_64.c | 31 ++++++++++++++++++++-----------
include/asm-x86/pgtable_64.h | 3 +++
2 files changed, 23 insertions(+), 11 deletions(-)
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -149,22 +149,13 @@
}
void
-set_pte_vaddr(unsigned long vaddr, pte_t new_pte)
+set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte)
{
- pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(new_pte));
-
- pgd = pgd_offset_k(vaddr);
- if (pgd_none(*pgd)) {
- printk(KERN_ERR
- "PGD FIXMAP MISSING, it should be setup in head.S!\n");
- return;
- }
- pud = pud_offset(pgd, vaddr);
+ pud = pud_page + pud_index(vaddr);
if (pud_none(*pud)) {
pmd = (pmd_t *) spp_getpage();
pud_populate(&init_mm, pud, pmd);
@@ -195,6 +186,24 @@
* (PGE mappings get flushed as well)
*/
__flush_tlb_one(vaddr);
+}
+
+void
+set_pte_vaddr(unsigned long vaddr, pte_t pteval)
+{
+ pgd_t *pgd;
+ pud_t *pud_page;
+
+ pr_debug("set_pte_vaddr %lx to %lx\n", vaddr, native_pte_val(pteval));
+
+ pgd = pgd_offset_k(vaddr);
+ if (pgd_none(*pgd)) {
+ printk(KERN_ERR
+ "PGD FIXMAP MISSING, it should be setup in head.S!\n");
+ return;
+ }
+ pud_page = (pud_t*)pgd_page_vaddr(*pgd);
+ set_pte_vaddr_pud(pud_page, vaddr, pteval);
}
/*
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -69,6 +69,9 @@
#define pud_none(x) (!pud_val(x))
struct mm_struct;
+
+void set_pte_vaddr_pud(pud_t *pud_page, unsigned long vaddr, pte_t new_pte);
+
static inline void native_pte_clear(struct mm_struct *mm, unsigned long addr,
pte_t *ptep)
Signed-off-by: Eduardo Habkost <[email protected]>
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/entry_64.S | 4 ++--
arch/x86/kernel/paravirt.c | 3 +++
include/asm-x86/elf.h | 2 +-
include/asm-x86/paravirt.h | 10 ++++++++++
include/asm-x86/system.h | 3 ++-
5 files changed, 18 insertions(+), 4 deletions(-)
diff --git a/arch/x86/kernel/entry_64.S b/arch/x86/kernel/entry_64.S
--- a/arch/x86/kernel/entry_64.S
+++ b/arch/x86/kernel/entry_64.S
@@ -1080,7 +1080,7 @@
/* Reload gs selector with exception handling */
/* edi: new selector */
-ENTRY(load_gs_index)
+ENTRY(native_load_gs_index)
CFI_STARTPROC
pushf
CFI_ADJUST_CFA_OFFSET 8
@@ -1094,7 +1094,7 @@
CFI_ADJUST_CFA_OFFSET -8
ret
CFI_ENDPROC
-ENDPROC(load_gs_index)
+ENDPROC(native_load_gs_index)
.section __ex_table,"a"
.align 8
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -331,6 +331,9 @@
.store_idt = native_store_idt,
.store_tr = native_store_tr,
.load_tls = native_load_tls,
+#ifdef CONFIG_X86_64
+ .load_gs_index = native_load_gs_index,
+#endif
.write_ldt_entry = native_write_ldt_entry,
.write_gdt_entry = native_write_gdt_entry,
.write_idt_entry = native_write_idt_entry,
diff --git a/include/asm-x86/elf.h b/include/asm-x86/elf.h
--- a/include/asm-x86/elf.h
+++ b/include/asm-x86/elf.h
@@ -83,9 +83,9 @@
(((x)->e_machine == EM_386) || ((x)->e_machine == EM_486))
#include <asm/processor.h>
+#include <asm/system.h>
#ifdef CONFIG_X86_32
-#include <asm/system.h> /* for savesegment */
#include <asm/desc.h>
#define elf_check_arch(x) elf_check_arch_ia32(x)
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -115,6 +115,9 @@
void (*set_ldt)(const void *desc, unsigned entries);
unsigned long (*store_tr)(void);
void (*load_tls)(struct thread_struct *t, unsigned int cpu);
+#ifdef CONFIG_X86_64
+ void (*load_gs_index)(unsigned int idx);
+#endif
void (*write_ldt_entry)(struct desc_struct *ldt, int entrynum,
const void *desc);
void (*write_gdt_entry)(struct desc_struct *,
@@ -848,6 +851,13 @@
PVOP_VCALL2(pv_cpu_ops.load_tls, t, cpu);
}
+#ifdef CONFIG_X86_64
+static inline void load_gs_index(unsigned int gs)
+{
+ PVOP_VCALL1(pv_cpu_ops.load_gs_index, gs);
+}
+#endif
+
static inline void write_ldt_entry(struct desc_struct *dt, int entry,
const void *desc)
{
diff --git a/include/asm-x86/system.h b/include/asm-x86/system.h
--- a/include/asm-x86/system.h
+++ b/include/asm-x86/system.h
@@ -140,7 +140,7 @@
#define set_base(ldt, base) _set_base(((char *)&(ldt)) , (base))
#define set_limit(ldt, limit) _set_limit(((char *)&(ldt)) , ((limit)-1))
-extern void load_gs_index(unsigned);
+extern void native_load_gs_index(unsigned);
/*
* Load a segment. Fall back on loading the zero
@@ -286,6 +286,7 @@
#ifdef CONFIG_X86_64
#define read_cr8() (native_read_cr8())
#define write_cr8(x) (native_write_cr8(x))
+#define load_gs_index native_load_gs_index
#endif
/* Clear the 'TS' bit */
On Wed, 25 Jun 2008 00:18:59 -0400
Jeremy Fitzhardinge <[email protected]> wrote:
> wrmsr is a special instruction which can have arbitrary system-wide
> effects. We don't want the compiler to reorder it with respect to
> memory operations, so make it a memory barrier.
it's more readable for several of these cases to stick a barrier(); in
front and after it to be honest; that makes it more explicit that
these are deliberate compiler barriers rather than "actual" memory
access...
--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
* Jeremy Fitzhardinge <[email protected]> wrote:
> Hi Ingo,
>
> This series lays the groundwork for 64-bit Xen support. It follows
> the usual pattern: a series of general cleanups and improvements,
> followed by additions and modifications needed to slide Xen in.
cool stuff :-)
> Most of the 64-bit paravirt-ops work has already been done and
> integrated for some time, so the changes are relatively minor.
>
> Interesting and potentially hazardous changes in this series are:
>
> "paravirt/x86_64: move __PAGE_OFFSET to leave a space for hypervisor"
>
> This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
> to 0xffff880000000000. I have no general justification for this: the
> specific reason is that Xen claims the first 16 kernel GDT slots for
> itself, and we must move up the mapping to make room. In the process
> I parameterised the compile-time construction of the initial
> pagetables in head_64.S to cope with it.
This reduces native kernel max memory support from around 127 TB to
around 120 TB. We also limit the Xen hypervisor to ~7 TB of physical
memory - is that wise in the long run? Sure, current CPUs support 40
physical bits [1 TB] for now so it's all theoretical at this moment.
my guess is that CPU makers will first extend the physical lines all the
way up to 46-47 bits before they are willing to touch the logical model
and extend the virtual space beyond 48 bits (47 bits of that available
to kernel-space in practice - i.e. 128 TB).
So eventually, in a few years, we'll feel some sort of crunch when the #
of physical lines approaches the # of logical bits - just like when
32-bit felt a crunch when physical lines went to 31 and beyond.
> "x86_64: adjust mapping of physical pagetables to work with Xen"
> "x86_64: create small vmemmap mappings if PSE not available"
>
> This rearranges the construction of the physical mapping so that it
> works with Xen. This affects three aspects of the code:
> 1. It can't use pse, so it will only use pse if the processor
> supports it.
> 2. It never replaces an existing mapping, so it can just extend the
> early boot-provided mappings (either from head_64.S or the Xen domain
> builder).
> 3. It makes sure that any page is iounmapped before attaching it to the
> pagetable to avoid having writable aliases of pagetable pages.
>
> The logical structure of the code is more or less unchanged, and still
> works fine in the native case.
>
> vmemmap mapping is likewise changed.
>
> "x86_64: PSE no longer a hard requirement."
>
> Because booting under Xen doesn't set PSE, it's no longer a hard
> requirement for the kernel. PSE will be used whereever possible.
That should be fine too - and probably useful for 64-bit kmemcheck
support as well.
To further increase the symmetry between 64-bit and 32-bit, could you
please also activate the mem=nopentium switch on 64-bit to allow the
forcing of a non-PSE native 64-bit bootup? (Obviously not a good idea
normally, as it wastes 0.1% of RAM and increases PTE related CPU cache
footprint and TLB overhead, but it is useful for debugging.)
a few other risk areas:
- the vmalloc-sync changes. Are you absolutely sure that it does not
matter for performance?
- "The 32-bit early_ioremap will work equally well for 64-bit, so just
use it." Famous last words ;-)
Anyway, that's all theory - i'll try out your patchset in -tip to see
what breaks in practice ;-)
Ingo
* Jeremy Fitzhardinge <[email protected]> wrote:
> Signed-off-by: Eduardo Habkost <[email protected]>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
patch logistics detail: the signoff order suggests it's been authored by
Eduardo - but there's no From line to that effect - should i change it
accordingly?
Ingo
Ingo Molnar wrote:
>> "paravirt/x86_64: move __PAGE_OFFSET to leave a space for hypervisor"
>>
>> This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
>> to 0xffff880000000000. I have no general justification for this: the
>> specific reason is that Xen claims the first 16 kernel GDT slots for
>> itself, and we must move up the mapping to make room. In the process
>> I parameterised the compile-time construction of the initial
>> pagetables in head_64.S to cope with it.
>>
>
> This reduces native kernel max memory support from around 127 TB to
> around 120 TB. We also limit the Xen hypervisor to ~7 TB of physical
> memory - is that wise in the long run? Sure, current CPUs support 40
> physical bits [1 TB] for now so it's all theoretical at this moment.
>
> my guess is that CPU makers will first extend the physical lines all the
> way up to 46-47 bits before they are willing to touch the logical model
> and extend the virtual space beyond 48 bits (47 bits of that available
> to kernel-space in practice - i.e. 128 TB).
>
> So eventually, in a few years, we'll feel some sort of crunch when the #
> of physical lines approaches the # of logical bits - just like when
> 32-bit felt a crunch when physical lines went to 31 and beyond.
>
There's no inherent reason why Xen itself needs to be able to have all
memory mapped at once. 32-bit Xen doesn't and can survive quite
happily. It's certainly nice to be able to access anything directly,
but it's just a performance optimisation. In practice, the guest
generally has almost everything interesting mapped anyway, and Xen
maintains a recursive mapping of the pagetable to make its access to the
pagetable very efficient, so it's only when a hypercall is doing
something to an unmapped page that there's an issue.
The main limitation the hole-size imposes is the max size of the machine
to physical map. That uses 8bytes/page, and reserves 256GB of space for
it, meaning that the current limit is 2^47 bytes - but there's another
256GB of reserved and unused space next to it, so that could be easily
extended to 2^48 if that really becomes an issue.
> That should be fine too - and probably useful for 64-bit kmemcheck
> support as well.
>
> To further increase the symmetry between 64-bit and 32-bit, could you
> please also activate the mem=nopentium switch on 64-bit to allow the
> forcing of a non-PSE native 64-bit bootup? (Obviously not a good idea
> normally, as it wastes 0.1% of RAM and increases PTE related CPU cache
> footprint and TLB overhead, but it is useful for debugging.)
>
OK. Though it might be an idea to add "nopse" and start deprecating
nopentium.
> a few other risk areas:
>
> - the vmalloc-sync changes. Are you absolutely sure that it does not
> matter for performance?
>
Oh, I didn't mean to include that one. I think it's probably safe (from
both the performance and correctness stands), but it's not necessary for
64-bit Xen.
> - "The 32-bit early_ioremap will work equally well for 64-bit, so just
> use it." Famous last words ;-)
>
> Anyway, that's all theory - i'll try out your patchset in -tip to see
> what breaks in practice ;-)
>
Yep, thanks,
J
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> Signed-off-by: Eduardo Habkost <[email protected]>
>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>>
>
> patch logistics detail: the signoff order suggests it's been authored by
> Eduardo - but there's no From line to that effect - should i change it
> accordingly?
Yes, it's Eduardo's. Huh, I have the From line here; must have got
stripped off by my script...
J
Jeremy Fitzhardinge <[email protected]> writes:
>
> This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
GDT? PDP?
> to 0xffff880000000000. I have no general justification for this: th
This will significantly decrease the maximum amount of physical
memory supported by Linux longer term.
> "x86_64: PSE no longer a hard requirement."
>
> Because booting under Xen doesn't set PSE, it's no longer a hard
> requirement for the kernel. PSE will be used whereever possible.
Both sound like cases of "let's hack Linux to work around Xen
problems"
-Andi
* Ingo Molnar <[email protected]> wrote:
> Anyway, that's all theory - i'll try out your patchset in -tip to see
> what breaks in practice ;-)
i've put the commits (and a good number of dependent commits) into the
new tip/x86/xen-64bit topic branch.
It quickly broke the build in testing:
include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
include/asm/pgalloc.h:14: error: parameter name omitted
arch/x86/kernel/entry_64.S: In file included from
arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
include/asm/pgalloc.h:14: error: parameter name omitted
[...]
with this config:
http://redhat.com/~mingo/misc/config-Wed_Jun_25_16_37_51_CEST_2008.bad
this could easily be some integration mistake on my part, so please
double-check the end result.
Merging it into tip/master is a bit tricky, due to various interactions.
This should work fine if you check out the latest tip/master:
git-merge tip/x86/xen-64bit
[ ... fix up the trivial merge conflict ... ]
i've already merged tip/x86/xen-64bit-base topic into master, to make it
easier. (there were a few preconditions for the 64-bit Xen patches which
arent carried in linux-next - such as the nmi-safe changes.)
Ingo
On 25/6/08 13:40, "Andi Kleen" <[email protected]> wrote:
>> to 0xffff880000000000. I have no general justification for this: th
>
> This will significantly decrease the maximum amount of physical
> memory supported by Linux longer term.
What does Linux expect to scale up to? Reserving 16 PML4 entries leaves the
kernel with 120TB of available 'negative' address space. Should be plenty, I
would think.
-- Keir
> What does Linux expect to scale up to? Reserving 16 PML4 entries leaves the
> kernel with 120TB of available 'negative' address space. Should be plenty, I
> would think.
There are already (ok non x86-64) systems shipping today with 10+TB of
addressable memory. 100+TB is not that far away with typical
growth rates. Besides there has to be much more in the negative address
space than just direct mapping.
So far we always that 64bit Linux can support upto 1/4*max VA memory.
With your change that formula would be not true anymore.
-Andi
On 25/6/08 20:13, "Andi Kleen" <[email protected]> wrote:
>> What does Linux expect to scale up to? Reserving 16 PML4 entries leaves the
>> kernel with 120TB of available 'negative' address space. Should be plenty, I
>> would think.
>
> There are already (ok non x86-64) systems shipping today with 10+TB of
> addressable memory. 100+TB is not that far away with typical
> growth rates. Besides there has to be much more in the negative address
> space than just direct mapping.
There are obviously no x64 boxes around at the moment with >1TB of regular
shared memory, since no CPUs have more than 40 address lines. 100+TB RAM is
surely years away.
If this is a blocker issue, we could just keep PAGE_OFFSET as it is when Xen
support is not configured into the kernel. Then those who are concerned
about 5% extra headroom at 100TB RAM sizes can configure their kernel
appropriately.
> So far we always that 64bit Linux can support upto 1/4*max VA memory.
> With your change that formula would be not true anymore.
Does the formula have any practical significance?
-- Keir
> There are obviously no x64 boxes around at the moment with >1TB of regular
> shared memory, since no CPUs have more than 40 address lines. 100+TB RAM is
That's actually not true.
> surely years away.
Yes, but why build something non scalable now that you have to fix in a few
years? Especially when it comes with "i have no justification" in
the commit log.
> > So far we always that 64bit Linux can support upto 1/4*max VA memory.
> > With your change that formula would be not true anymore.
>
> Does the formula have any practical significance?
Yes, because getting more than 48bits of VA will be extremly costly
in terms of infrastructure and assuming continuing growth rates and very large
machines 46bits is not all that much.
-Andi
Andi Kleen wrote:
> Jeremy Fitzhardinge <[email protected]> writes:
>
>> This moves __PAGE_OFFSET up by 16 GDT slots, from 0xffff810000000000
>>
>
> GDT? PDP?
>
I meant PGD slots. Or PML4 in x86 terms.
>> to 0xffff880000000000. I have no general justification for this: th
>>
>
> This will significantly decrease the maximum amount of physical
> memory supported by Linux longer term.
>
A bit, but not "significantly". We'd already discussed that if the
amount of physical starts approaching 2^48 then we'd hope that the chips
will grow some more virtual bits.
J
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> Anyway, that's all theory - i'll try out your patchset in -tip to see
>> what breaks in practice ;-)
>>
>
> i've put the commits (and a good number of dependent commits) into the
> new tip/x86/xen-64bit topic branch.
>
> It quickly broke the build in testing:
>
> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
> arch/x86/kernel/entry_64.S: In file included from
> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
> include/asm/pgalloc.h:14: error: parameter name omitted
>
No, looks like my fault. The non-PARAVIRT version of
paravirt_pgd_free() is:
static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
but C doesn't like missing parameter names, even if unused.
This should fix it:
diff -r 19b73cc5fdf4 include/asm-x86/pgalloc.h
--- a/include/asm-x86/pgalloc.h Wed Jun 25 11:24:41 2008 -0400
+++ b/include/asm-x86/pgalloc.h Wed Jun 25 13:11:56 2008 -0700
@@ -11,7 +11,7 @@
#include <asm/paravirt.h>
#else
#define paravirt_pgd_alloc(mm) __paravirt_pgd_alloc(mm)
-static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
+static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *pgd) {}
static inline void paravirt_alloc_pte(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd(struct mm_struct *mm, unsigned long pfn) {}
static inline void paravirt_alloc_pmd_clone(unsigned long pfn, unsigned long clonepfn,
Arjan van de Ven wrote:
> it's more readable for several of these cases to stick a barrier(); in
> front and after it to be honest; that makes it more explicit that
> these are deliberate compiler barriers rather than "actual" memory
> access...
>
>
I suppose, though I would be inclined to put the barriers in the wrmsr
macro itself to act as documentation. Either way, I don't think there's
any legitimate reason to let the compiler reorder things around a wrmsr,
and it should be an inherent property of the macro, rather than relying
on ad-hoc barriers where it gets used. After all, that's a fairly
accurate reflection of how the micro-architecture treats wrmsr...
J
On Wed, 25 Jun 2008 14:08:57 -0700
Jeremy Fitzhardinge <[email protected]> wrote:
> Arjan van de Ven wrote:
> > it's more readable for several of these cases to stick a barrier();
> > in front and after it to be honest; that makes it more explicit that
> > these are deliberate compiler barriers rather than "actual" memory
> > access...
> >
> >
>
> I suppose, though I would be inclined to put the barriers in the
> wrmsr macro itself to act as documentation.
yeah I meant like this:
static inline void native_write_msr(unsigned int msr,
unsigned low, unsigned high)
{
barrier();
asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high));
barrier();
}
or in the same in the thing that calls this.
--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
Arjan van de Ven wrote:
> On Wed, 25 Jun 2008 14:08:57 -0700
> Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> Arjan van de Ven wrote:
>>
>>> it's more readable for several of these cases to stick a barrier();
>>> in front and after it to be honest; that makes it more explicit that
>>> these are deliberate compiler barriers rather than "actual" memory
>>> access...
>>>
>>>
>>>
>> I suppose, though I would be inclined to put the barriers in the
>> wrmsr macro itself to act as documentation.
>>
>
>
> yeah I meant like this:
>
> static inline void native_write_msr(unsigned int msr,
> unsigned low, unsigned high)
> {
> barrier();
> asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high));
> barrier();
> }
>
> or in the same in the thing that calls this.
>
>
OK, we're in vehement agreement then.
J
H. Peter Anvin wrote:
> Arjan van de Ven wrote:
>>>>
>>> I suppose, though I would be inclined to put the barriers in the
>>> wrmsr macro itself to act as documentation.
>>
>>
>> yeah I meant like this:
>>
>> static inline void native_write_msr(unsigned int msr,
>> unsigned low, unsigned high)
>> {
>> barrier();
>> asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high));
>> barrier();
>> }
>>
>> or in the same in the thing that calls this.
>>
>
> Actually, I believe the barrier(); before is actually incorrect, since
> it would affect the wrmsr() register arguments rather than the wrmsr
> instruction itself.
How so? What kind of failure do think might occur? Some effect on how
the wrmsr arguments are evaluated?
barrier() is specifically a compiler optimisation barrier, so the
barrier before would prevent the compiler from moving anything logically
before the wrmsr to afterwards.
That said, making the wrmsr itself a memory clobber may be simpler
understand with a comment, rather than separate barriers...
J
Arjan van de Ven wrote:
>>>
>> I suppose, though I would be inclined to put the barriers in the
>> wrmsr macro itself to act as documentation.
>
>
> yeah I meant like this:
>
> static inline void native_write_msr(unsigned int msr,
> unsigned low, unsigned high)
> {
> barrier();
> asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high));
> barrier();
> }
>
> or in the same in the thing that calls this.
>
Actually, I believe the barrier(); before is actually incorrect, since
it would affect the wrmsr() register arguments rather than the wrmsr
instruction itself.
-hpa
Jeremy Fitzhardinge wrote:
>>
>> Actually, I believe the barrier(); before is actually incorrect, since
>> it would affect the wrmsr() register arguments rather than the wrmsr
>> instruction itself.
>
> How so? What kind of failure do think might occur? Some effect on how
> the wrmsr arguments are evaluated?
>
> barrier() is specifically a compiler optimisation barrier, so the
> barrier before would prevent the compiler from moving anything logically
> before the wrmsr to afterwards.
>
The barrier() before prevents the compiler from optimizing the access to
the arguments (before they go into registers), not the actual wrmsr;
this has to do with the ordering of operations around the barrier above.
The barrier *after* does what you just describe.
> That said, making the wrmsr itself a memory clobber may be simpler
> understand with a comment, rather than separate barriers...
This should be functionally equivalent to a barrier(); after, and given
that this is clearly a point of confusion *already*, I think the memory
clobber is better.
-hpa
* Jeremy Fitzhardinge <[email protected]> wrote:
>> It quickly broke the build in testing:
>>
>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
>> include/asm/pgalloc.h:14: error: parameter name omitted
>> arch/x86/kernel/entry_64.S: In file included from
>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function
>> ‘paravirt_pgd_free':
>> include/asm/pgalloc.h:14: error: parameter name omitted
>>
>
> No, looks like my fault. The non-PARAVIRT version of
> paravirt_pgd_free() is:
>
> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
>
> but C doesn't like missing parameter names, even if unused.
>
> This should fix it:
that fixed the build but now we've got a boot crash with this config:
time.c: Detected 2010.304 MHz processor.
spurious 8259A interrupt: IRQ7.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<0000000000000000>]
PGD 0
Thread overran stack, or stack corrupted
Oops: 0010 [1] SMP
CPU 0
with:
http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
i've pushed out the current tip/xen-64bit branch, so that you can see
how things look like at the moment, but i cannot put it into tip/master
yet.
Ingo
* Ingo Molnar <[email protected]> wrote:
> that fixed the build but now we've got a boot crash with this config:
plus -tip auto-testing found another build failure with:
http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
Ingo
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>>> It quickly broke the build in testing:
>>>
>>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>> arch/x86/kernel/entry_64.S: In file included from
>>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function
>>> ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>>
>>>
>> No, looks like my fault. The non-PARAVIRT version of
>> paravirt_pgd_free() is:
>>
>> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
>>
>> but C doesn't like missing parameter names, even if unused.
>>
>> This should fix it:
>>
>
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
> with:
>
> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>
Blerg, a contextless NULL rip. Have you done any bisection on it?
Could you try again with the same config, but with
"CONFIG_PARAVIRT_DEBUG" enabled as well? That will BUG if it turns out
to be trying to call a NULL paravirt-op
I'll try to repro here anyway.
> i've pushed out the current tip/xen-64bit branch, so that you can see
> how things look like at the moment, but i cannot put it into tip/master
> yet.
Yeah, I was expecting things to break somewhere with this lot :/
Could you add this patch? I don't think it will help this case, but
it's a bugfix.
J
Subject: x86_64: use SWAPGS_UNSAFE_STACK in ia32entry.S
Use SWAPGS_UNSAFE_STACK in ia32entry.S in the places where the active
stack is the usermode stack.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/ia32/ia32entry.S | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
===================================================================
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -98,7 +98,7 @@
CFI_SIGNAL_FRAME
CFI_DEF_CFA rsp,0
CFI_REGISTER rsp,rbp
- SWAPGS
+ SWAPGS_UNSAFE_STACK
movq %gs:pda_kernelstack, %rsp
addq $(PDA_STACKOFFSET),%rsp
/*
@@ -210,7 +210,7 @@
CFI_DEF_CFA rsp,PDA_STACKOFFSET
CFI_REGISTER rip,rcx
/*CFI_REGISTER rflags,r11*/
- SWAPGS
+ SWAPGS_UNSAFE_STACK
movl %esp,%r8d
CFI_REGISTER rsp,r8
movq %gs:pda_kernelstack,%rsp
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> that fixed the build but now we've got a boot crash with this config:
>>
>
> plus -tip auto-testing found another build failure with:
>
> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>
> arch/x86/kernel/entry_64.S: Assembler messages:
> arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
> arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
> arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
> arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
>
>
I'm confused. How did this config both crash and not build?
J
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>>> It quickly broke the build in testing:
>>>
>>> include/asm/pgalloc.h: In function ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>> arch/x86/kernel/entry_64.S: In file included from
>>> arch/x86/kernel/traps_64.c:51:include/asm/pgalloc.h: In function
>>> ‘paravirt_pgd_free':
>>> include/asm/pgalloc.h:14: error: parameter name omitted
>>>
>>>
>> No, looks like my fault. The non-PARAVIRT version of
>> paravirt_pgd_free() is:
>>
>> static inline void paravirt_pgd_free(struct mm_struct *mm, pgd_t *) {}
>>
>> but C doesn't like missing parameter names, even if unused.
>>
>> This should fix it:
>>
>
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
What stage during boot? I'm seeing an initrd problem, but that's
relatively late.
J
Ingo Molnar wrote:
> that fixed the build but now we've got a boot crash with this config:
>
> time.c: Detected 2010.304 MHz processor.
> spurious 8259A interrupt: IRQ7.
> BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
> IP: [<0000000000000000>]
> PGD 0
> Thread overran stack, or stack corrupted
> Oops: 0010 [1] SMP
> CPU 0
>
I don't know if this will fix this bug, but it's definitely a bugfix.
It was trashing random pages by overwriting them with pagetables...
Subject: x86_64: memory mapping: don't trash large pmd mapping
Don't trash a large pmd's data when mapping physical memory.
This is a bugfix for "x86_64: adjust mapping of physical pagetables
to work with Xen".
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/init_64.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
===================================================================
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -311,7 +311,8 @@
}
if (pmd_val(*pmd)) {
- phys_pte_update(pmd, address, end);
+ if (!pmd_large(*pmd))
+ phys_pte_update(pmd, address, end);
continue;
}
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> * Ingo Molnar <[email protected]> wrote:
>>
>>
>>> that fixed the build but now we've got a boot crash with this config:
>>>
>>
>> plus -tip auto-testing found another build failure with:
>>
>> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>>
>> arch/x86/kernel/entry_64.S: Assembler messages:
>> arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
>>
>>
>
> I'm confused. How did this config both crash and not build?
i'm testing on multiple systems in parallel, each is running randconfig
kernels. One 64-bit system found a build bug, the other one found a boot
crash.
This can happen if certain configs build fine (but crash), certain
configs dont even build. Each system does a random walk of the config
space.
I've applied your two fixes and i'm re-testing.
Ingo
Ingo Molnar wrote:
> i'm testing on multiple systems in parallel, each is running randconfig
> kernels. One 64-bit system found a build bug, the other one found a boot
> crash.
>
> This can happen if certain configs build fine (but crash), certain
> configs dont even build. Each system does a random walk of the config
> space.
>
Yes, but the URL for both the crash and the build failure pointed to the
same config. Is one of them a mistake?
> I've applied your two fixes and i'm re-testing.
>
Thanks,
J
* Jeremy Fitzhardinge <[email protected]> wrote:
>> plus -tip auto-testing found another build failure with:
>>
>> http://redhat.com/~mingo/misc/config-Thu_Jun_26_12_46_46_CEST_2008.bad
>>
>> arch/x86/kernel/entry_64.S: Assembler messages:
>> arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
>> arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
>
> I'm confused. How did this config both crash and not build?
this problem still reproduces.
i've pushed out all fixes into tip/x86/xen-64bit. That branch combined
with the config above still reproduces the build failure above.
Ingo
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> i'm testing on multiple systems in parallel, each is running randconfig
>> kernels. One 64-bit system found a build bug, the other one found a
>> boot crash.
>>
>> This can happen if certain configs build fine (but crash), certain
>> configs dont even build. Each system does a random walk of the config
>> space.
>
> Yes, but the URL for both the crash and the build failure pointed to
> the same config. Is one of them a mistake?
yeah, i guess so. Right now i only ran into the build failure so there's
hope :) Here's a config that fails to build for sure:
http://redhat.com/~mingo/misc/config-Fri_Jun_27_17_54_32_CEST_2008.bad
note, on 32-bit there's a yet unfixed initrd corruption bug i've
bisected back to:
| 510be56adc4bb9fb229637a8e89268d987263614 is first bad commit
| commit 510be56adc4bb9fb229637a8e89268d987263614
| Author: Yinghai Lu <[email protected]>
| Date: Tue Jun 24 04:10:47 2008 -0700
|
| x86: introduce init_memory_mapping for 32bit
so if you see something like that it's probably not a bug introduced by
your changes. (and maybe you'll see why the above commit is buggy, i
havent figured it out yet)
Ingo
Ingo Molnar wrote:
> yeah, i guess so. Right now i only ran into the build failure so there's
> hope :) Here's a config that fails to build for sure:
>
> http://redhat.com/~mingo/misc/config-Fri_Jun_27_17_54_32_CEST_2008.bad
>
Will look at it shortly.
> note, on 32-bit there's a yet unfixed initrd corruption bug i've
> bisected back to:
>
> | 510be56adc4bb9fb229637a8e89268d987263614 is first bad commit
> | commit 510be56adc4bb9fb229637a8e89268d987263614
> | Author: Yinghai Lu <[email protected]>
> | Date: Tue Jun 24 04:10:47 2008 -0700
> |
> | x86: introduce init_memory_mapping for 32bit
>
> so if you see something like that it's probably not a bug introduced by
> your changes. (and maybe you'll see why the above commit is buggy, i
> havent figured it out yet)
Well, on a non-PSE system find_early_table_space() will not allocate
enough memory for ptes. But I posted the fix for that, and it's likely
you're using PSE anyway. Nothing pops out from a quick re-read, but it
could easily be mis-reserving the ramdisk memory or something.
J
Ingo Molnar wrote:
> this problem still reproduces.
>
> i've pushed out all fixes into tip/x86/xen-64bit. That branch combined
> with the config above still reproduces the build failure above.
>
Subject: x86_64: fix non-paravirt compilation
Make sure SWAPGS and PARAVIRT_ADJUST_EXCEPTION_FRAME are properly
defined when CONFIG_PARAVIRT is off.
Fixes Ingo's build failure:
arch/x86/kernel/entry_64.S: Assembler messages:
arch/x86/kernel/entry_64.S:1201: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1205: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1209: Error: invalid character '_' in mnemonic
arch/x86/kernel/entry_64.S:1213: Error: invalid character '_' in mnemonic
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
include/asm-x86/irqflags.h | 22 +++++++++++++---------
include/asm-x86/processor.h | 3 ---
2 files changed, 13 insertions(+), 12 deletions(-)
===================================================================
--- a/include/asm-x86/irqflags.h
+++ b/include/asm-x86/irqflags.h
@@ -167,7 +167,20 @@
#define INTERRUPT_RETURN_NMI_SAFE NATIVE_INTERRUPT_RETURN_NMI_SAFE
#ifdef CONFIG_X86_64
+#define SWAPGS swapgs
+/*
+ * Currently paravirt can't handle swapgs nicely when we
+ * don't have a stack we can rely on (such as a user space
+ * stack). So we either find a way around these or just fault
+ * and emulate if a guest tries to call swapgs directly.
+ *
+ * Either way, this is a good way to document that we don't
+ * have a reliable stack. x86_64 only.
+ */
#define SWAPGS_UNSAFE_STACK swapgs
+
+#define PARAVIRT_ADJUST_EXCEPTION_FRAME /* */
+
#define INTERRUPT_RETURN iretq
#define USERGS_SYSRET64 \
swapgs; \
@@ -233,15 +246,6 @@
#else
#ifdef CONFIG_X86_64
-/*
- * Currently paravirt can't handle swapgs nicely when we
- * don't have a stack we can rely on (such as a user space
- * stack). So we either find a way around these or just fault
- * and emulate if a guest tries to call swapgs directly.
- *
- * Either way, this is a good way to document that we don't
- * have a reliable stack. x86_64 only.
- */
#define ARCH_LOCKDEP_SYS_EXIT call lockdep_sys_exit_thunk
#define ARCH_LOCKDEP_SYS_EXIT_IRQ \
TRACE_IRQS_ON; \
===================================================================
--- a/include/asm-x86/processor.h
+++ b/include/asm-x86/processor.h
@@ -541,9 +541,6 @@
}
#define set_iopl_mask native_set_iopl_mask
-#define SWAPGS swapgs
-
-#define PARAVIRT_ADJUST_EXCEPTION_FRAME /* */
#endif /* CONFIG_PARAVIRT */
/*
* Jeremy Fitzhardinge <[email protected]> wrote:
> Subject: x86_64: fix non-paravirt compilation
i've put tip/x86/xen-64bit into tip/master briefly and it quickly
triggered this crash on 64-bit x86:
Linux version 2.6.26-rc8-tip-00241-gc6c8cb2-dirty (mingo@dione)
(gcc version 4.2.3) #12303 SMP Sun Jun 29 10:30:01 CEST 2008
Command line: root=/dev/sda6 console=ttyS0,115200 earlyprintk=serial,ttyS0,115200 debug initcall_debug apic=verbose sysrq_always_enabled ignore_loglevel selinux=0
BIOS-provided physical RAM map:
BIOS-e820: 0000000000000000 - 000000000009f800 (usable)
BIOS-e820: 000000000009f800 - 00000000000a0000 (reserved)
BIOS-e820: 00000000000f0000 - 0000000000100000 (reserved)
BIOS-e820: 0000000000100000 - 000000003fff0000 (usable)
BIOS-e820: 000000003fff0000 - 000000003fff3000 (ACPI NVS)
BIOS-e820: 000000003fff3000 - 0000000040000000 (ACPI data)
BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
BIOS-e820: 00000000fec00000 - 0000000100000000 (reserved)
KERNEL supported cpus:
Intel GenuineIntel
AMD AuthenticAMD
Centaur CentaurHauls
console [earlyser0] enabled
debug: ignoring loglevel setting.
Entering add_active_range(0, 0x0, 0x9f) 0 entries of 25600 used
Entering add_active_range(0, 0x100, 0x3fff0) 1 entries of 25600 used
last_pfn = 0x3fff0 max_arch_pfn = 0x3ffffffff
init_memory_mapping
kernel direct mapping tables up to 3fff0000 @ 8000-a000
PANIC: early exception 0e rip 10:ffffffff804b24e2 error 0 cr2 ffffffffff300000
Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip-00241-gc6c8cb2-dirty #12303
Call Trace:
[<ffffffff80efe196>] early_idt_handler+0x56/0x6a
[<ffffffff804b24e2>] ? __memcpy_fromio+0x12/0x30
[<ffffffff804b24d9>] ? __memcpy_fromio+0x9/0x30
[<ffffffff80f32f27>] dmi_scan_machine+0x57/0x1b0
[<ffffffff80f02c15>] setup_arch+0x3f5/0x5e0
[<ffffffff80efedd5>] start_kernel+0x75/0x350
[<ffffffff80efe289>] x86_64_start_reservations+0x89/0xa0
[<ffffffff80efe397>] x86_64_start_kernel+0xf7/0x100
RIP 0x10
with this config:
http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
that exact version.
Ingo
Ingo Molnar wrote:
> with this config:
>
> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>
> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
> that exact version.
Looks like the setup.c unification missed the early_ioremap init from
the early_ioremap unification. Unconditionally call early_ioremap_init().
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
diff -r 5c26177fdf8c arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c Sun Jun 29 16:57:52 2008 -0700
+++ b/arch/x86/kernel/setup.c Sun Jun 29 19:57:00 2008 -0700
@@ -523,11 +523,12 @@
memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
pre_setup_arch_hook();
early_cpu_init();
- early_ioremap_init();
reserve_setup_data();
#else
printk(KERN_INFO "Command line: %s\n", boot_command_line);
#endif
+
+ early_ioremap_init();
ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
screen_info = boot_params.screen_info;
On Sun, Jun 29, 2008 at 8:02 PM, Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>>
>> with this config:
>>
>> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>>
>> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
>> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test that
>> exact version.
>
> Looks like the setup.c unification missed the early_ioremap init from the
> early_ioremap unification. Unconditionally call early_ioremap_init().
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>
> diff -r 5c26177fdf8c arch/x86/kernel/setup.c
> --- a/arch/x86/kernel/setup.c Sun Jun 29 16:57:52 2008 -0700
> +++ b/arch/x86/kernel/setup.c Sun Jun 29 19:57:00 2008 -0700
> @@ -523,11 +523,12 @@
> memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
> pre_setup_arch_hook();
> early_cpu_init();
> - early_ioremap_init();
> reserve_setup_data();
> #else
> printk(KERN_INFO "Command line: %s\n", boot_command_line);
> #endif
> +
> + early_ioremap_init();
>
> ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
> screen_info = boot_params.screen_info;
it could be wrong? do we need that for 64 bit?
YH
Yinghai Lu wrote:
> On Sun, Jun 29, 2008 at 8:02 PM, Jeremy Fitzhardinge <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>
>>> with this config:
>>>
>>> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>>>
>>> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
>>> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test that
>>> exact version.
>>>
>> Looks like the setup.c unification missed the early_ioremap init from the
>> early_ioremap unification. Unconditionally call early_ioremap_init().
>>
>> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
>>
>> diff -r 5c26177fdf8c arch/x86/kernel/setup.c
>> --- a/arch/x86/kernel/setup.c Sun Jun 29 16:57:52 2008 -0700
>> +++ b/arch/x86/kernel/setup.c Sun Jun 29 19:57:00 2008 -0700
>> @@ -523,11 +523,12 @@
>> memcpy(&boot_cpu_data, &new_cpu_data, sizeof(new_cpu_data));
>> pre_setup_arch_hook();
>> early_cpu_init();
>> - early_ioremap_init();
>> reserve_setup_data();
>> #else
>> printk(KERN_INFO "Command line: %s\n", boot_command_line);
>> #endif
>> +
>> + early_ioremap_init();
>>
>> ROOT_DEV = old_decode_dev(boot_params.hdr.root_dev);
>> screen_info = boot_params.screen_info;
>>
>
> it could be wrong? do we need that for 64 bit?
Yes. I unified the early_ioremap implementations by making 64-bit use
the 32-bit one.
J
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> with this config:
>>
>> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>>
>> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
>> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
>> that exact version.
>
> Looks like the setup.c unification missed the early_ioremap init from
> the early_ioremap unification. Unconditionally call
> early_ioremap_init().
applied to tip/x86/unify-setup - thanks Jeremy.
I've reactived the x86/xen-64bit branch and i'm testing it currently.
Ingo
* Ingo Molnar <[email protected]> wrote:
>
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> >> with this config:
> >>
> >> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
> >>
> >> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
> >> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
> >> that exact version.
> >
> > Looks like the setup.c unification missed the early_ioremap init from
> > the early_ioremap unification. Unconditionally call
> > early_ioremap_init().
>
> applied to tip/x86/unify-setup - thanks Jeremy.
>
> I've reactived the x86/xen-64bit branch and i'm testing it currently.
-tip auto-testing found pagetable corruption (CPA self-test failure):
[ 32.956015] CPA self-test:
[ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
[ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
[ 32.968000] CPA ffff88001d54e000: unexpected level 2
[ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
[ 32.976000] CPA ffff880022c5d000: unexpected level 2
[ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
[ 32.984000] CPA ffff8800200ce000: unexpected level 2
[ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
config and full log can be found at:
http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad
i've pushed that tree out into tip/tmp.xen-64bit.Mon_Jun_30_11_11. The
only new item in that tree over a well-tested base is x86/xen-64bit, so
i've taken it out again.
Ingo
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>>
>>> Ingo Molnar wrote:
>>>
>>>> with this config:
>>>>
>>>> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>>>>
>>>> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
>>>> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
>>>> that exact version.
>>>>
>>> Looks like the setup.c unification missed the early_ioremap init from
>>> the early_ioremap unification. Unconditionally call
>>> early_ioremap_init().
>>>
>> applied to tip/x86/unify-setup - thanks Jeremy.
>>
>> I've reactived the x86/xen-64bit branch and i'm testing it currently.
>>
>
> -tip auto-testing found pagetable corruption (CPA self-test failure):
>
> [ 32.956015] CPA self-test:
> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>
> config and full log can be found at:
>
> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>
That config doesn't build for me. When I put it in place and do "make
oldconfig" it still asks for lots of config options (which I just set to
default). But when I build it fails with:
CC arch/x86/kernel/asm-offsets.s
In file included from include2/asm/page.h:40,
from include2/asm/pda.h:8,
from include2/asm/current.h:19,
from include2/asm/processor.h:15,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/prefetch.h:14,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/list.h:6,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/module.h:9,
from /home/jeremy/hg/xen/paravirt/linux/include/linux/crypto.h:21,
from /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/asm-offsets_64.c:7,
from /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/asm-offsets.c:4:
include2/asm/page_64.h:46:2: error: #error "CONFIG_PHYSICAL_START must be a multiple of 2MB"
make[3]: *** [arch/x86/kernel/asm-offsets.s] Error 1
I can fix that, of course, but it doesn't give me confidence I'm testing
what you are...
J
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>>
>>> Ingo Molnar wrote:
>>>
>>>> with this config:
>>>>
>>>> http://redhat.com/~mingo/misc/config-Sun_Jun_29_10_29_11_CEST_2008.bad
>>>>
>>>> i've saved the merged 2.6.26-rc8-tip-00241-gc6c8cb2-dirty tree into
>>>> tip/tmp.x86.xen-64bit.Sun_Jun_29_10 and pushed it out, so you can test
>>>> that exact version.
>>>>
>>> Looks like the setup.c unification missed the early_ioremap init from
>>> the early_ioremap unification. Unconditionally call
>>> early_ioremap_init().
>>>
>> applied to tip/x86/unify-setup - thanks Jeremy.
>>
>> I've reactived the x86/xen-64bit branch and i'm testing it currently.
>>
>
> -tip auto-testing found pagetable corruption (CPA self-test failure):
>
> [ 32.956015] CPA self-test:
> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>
> config and full log can be found at:
>
> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>
This config doesn't have CONFIG_DEBUG_KERNEL enabled, let alone
CONFIG_CPA_DEBUG. I've noticed this seems to happen quite a lot:
there's a disconnect between the log file and the config which is
supposed to have built the kernel. Is there a bug in your test
infrastructure?
J
* Jeremy Fitzhardinge <[email protected]> wrote:
>> -tip auto-testing found pagetable corruption (CPA self-test failure):
>>
>> [ 32.956015] CPA self-test:
>> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
>> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
>> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
>> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
>> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
>> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
>> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
>> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>>
>> config and full log can be found at:
>>
>> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>>
>
> This config doesn't have CONFIG_DEBUG_KERNEL enabled, let alone
> CONFIG_CPA_DEBUG. I've noticed this seems to happen quite a lot:
> there's a disconnect between the log file and the config which is
> supposed to have built the kernel. Is there a bug in your test
> infrastructure?
sometimes the kernel preceding the currently built one is the buggy one.
As i have them saved away, so the right one should be:
http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_03_04_CEST_2008.bad
Ingo
* Jeremy Fitzhardinge <[email protected]> wrote:
>> config and full log can be found at:
>>
>> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>>
>
> That config doesn't build for me. When I put it in place and do "make
> oldconfig" it still asks for lots of config options (which I just set
> to default). But when I build it fails with:
try 'make ARCH=i386 oldconfig' - does it work better that way?
> include2/asm/page_64.h:46:2: error: #error "CONFIG_PHYSICAL_START must be a multiple of 2MB"
> make[3]: *** [arch/x86/kernel/asm-offsets.s] Error 1
>
> I can fix that, of course, but it doesn't give me confidence I'm
> testing what you are...
the problem there is that the 32-bit config has:
CONFIG_PHYSICAL_START=0x100000
which the 64-bit make oldconfig picked up, but that start address is not
valid on 64-bit.
Ingo
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>>> config and full log can be found at:
>>>
>>> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>>>
>>>
>> That config doesn't build for me. When I put it in place and do "make
>> oldconfig" it still asks for lots of config options (which I just set
>> to default). But when I build it fails with:
>>
>
> try 'make ARCH=i386 oldconfig' - does it work better that way?
>
Er, we're talking about 64-bit here, aren't we? The log messages are
from a 64-bit kernel.
Well, it was the wrong config anyway, which I guess is the source of
this confusion.
(I thought ARCH= to select 32/64 was going away now that the config has
the bitsize config?)
J
* Jeremy Fitzhardinge <[email protected]> wrote:
>> try 'make ARCH=i386 oldconfig' - does it work better that way?
>
> Er, we're talking about 64-bit here, aren't we? The log messages are
> from a 64-bit kernel.
>
> Well, it was the wrong config anyway, which I guess is the source of
> this confusion.
yeah.
> (I thought ARCH= to select 32/64 was going away now that the config
> has the bitsize config?)
yep, correct - but it has to be done carefully - until now people (and
tools) could assume that 'make oldconfig' just creates stuff for their
native host architecture. But i agree in principle.
Ingo
Ingo Molnar wrote:
> -tip auto-testing found pagetable corruption (CPA self-test failure):
>
> [ 32.956015] CPA self-test:
> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>
> config and full log can be found at:
>
> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
> http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad
>
> i've pushed that tree out into tip/tmp.xen-64bit.Mon_Jun_30_11_11. The
> only new item in that tree over a well-tested base is x86/xen-64bit, so
> i've taken it out again.
>
Phew. OK, I've worked this out. Short version is that's it's a false
alarm, and there was no real failure here. Long version:
* I changed the code to create the physical mapping pagetables to
reuse any existing mapping rather than replace it. Specifically,
reusing an pud pointed to by the pgd caused this symptom to appear.
* The specific PUD being reused is the one created statically in
head_64.S, which creates an initial 1GB mapping.
* That mapping doesn't have _PAGE_GLOBAL set on it, due to the
inconsistency between __PAGE_* and PAGE_*.
* The CPA test attempts to clear _PAGE_GLOBAL, and then checks to
see that the resulting range is 1) shattered into 4k pages, and 2)
has no _PAGE_GLOBAL.
* However, since it didn't have _PAGE_GLOBAL on that range to start
with, change_page_attr_clear() had nothing to do, and didn't
bother shattering the range,
* resulting in the reported messages
The simple fix is to set _PAGE_GLOBAL in level2_ident_pgt.
An additional fix to make CPA testing more robust by using some other
pagetable bit (one of the unused available-to-software ones). This
would solve spurious CPA test warnings under Xen which uses _PAGE_GLOBAL
for its own purposes (ie, not under guest control).
Also, we should revisit the use of _PAGE_GLOBAL in asm-x86/pgtable.h,
and use it consistently, and drop MAKE_GLOBAL. The first time I
proposed it it caused breakages in the very early CPA code; with luck
that's all fixed now.
Anyway, the simple fix below. I'll put together RFC patches for the
other suggestions. I also split the originating patch into tiny, tiny
bisectable pieces.
Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/head_64.S | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
===================================================================
--- a/arch/x86/kernel/head_64.S
+++ b/arch/x86/kernel/head_64.S
@@ -374,7 +374,7 @@
/* Since I easily can, map the first 1G.
* Don't set NX because code runs from these pages.
*/
- PMDS(0, __PAGE_KERNEL_LARGE_EXEC, PTRS_PER_PMD)
+ PMDS(0, __PAGE_KERNEL_LARGE_EXEC | _PAGE_GLOBAL, PTRS_PER_PMD)
NEXT_PAGE(level2_kernel_pgt)
/*
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> -tip auto-testing found pagetable corruption (CPA self-test failure):
>>
>> [ 32.956015] CPA self-test:
>> [ 32.958822] 4k 2048 large 508 gb 0 x 2556[ffff880000000000-ffff88003fe00000] miss 0
>> [ 32.964000] CPA ffff88001d54e000: bad pte 1d4000e3
>> [ 32.968000] CPA ffff88001d54e000: unexpected level 2
>> [ 32.972000] CPA ffff880022c5d000: bad pte 22c000e3
>> [ 32.976000] CPA ffff880022c5d000: unexpected level 2
>> [ 32.980000] CPA ffff8800200ce000: bad pte 200000e3
>> [ 32.984000] CPA ffff8800200ce000: unexpected level 2
>> [ 32.988000] CPA ffff8800210f0000: bad pte 210000e3
>>
>> config and full log can be found at:
>>
>> http://redhat.com/~mingo/misc/config-Mon_Jun_30_11_11_51_CEST_2008.bad
>> http://redhat.com/~mingo/misc/log-Mon_Jun_30_11_11_51_CEST_2008.bad
>>
>> i've pushed that tree out into tip/tmp.xen-64bit.Mon_Jun_30_11_11. The
>> only new item in that tree over a well-tested base is x86/xen-64bit, so
>> i've taken it out again.
>>
>
> Phew. OK, I've worked this out. Short version is that's it's a false
> alarm, and there was no real failure here. Long version:
>
> * I changed the code to create the physical mapping pagetables to
> reuse any existing mapping rather than replace it. Specifically,
> reusing an pud pointed to by the pgd caused this symptom to appear.
> * The specific PUD being reused is the one created statically in
> head_64.S, which creates an initial 1GB mapping.
> * That mapping doesn't have _PAGE_GLOBAL set on it, due to the
> inconsistency between __PAGE_* and PAGE_*.
> * The CPA test attempts to clear _PAGE_GLOBAL, and then checks to
> see that the resulting range is 1) shattered into 4k pages, and 2)
> has no _PAGE_GLOBAL.
> * However, since it didn't have _PAGE_GLOBAL on that range to start
> with, change_page_attr_clear() had nothing to do, and didn't
> bother shattering the range,
> * resulting in the reported messages
>
> The simple fix is to set _PAGE_GLOBAL in level2_ident_pgt.
>
> An additional fix to make CPA testing more robust by using some other
> pagetable bit (one of the unused available-to-software ones). This
> would solve spurious CPA test warnings under Xen which uses _PAGE_GLOBAL
> for its own purposes (ie, not under guest control).
>
> Also, we should revisit the use of _PAGE_GLOBAL in asm-x86/pgtable.h,
> and use it consistently, and drop MAKE_GLOBAL. The first time I
> proposed it it caused breakages in the very early CPA code; with luck
> that's all fixed now.
>
> Anyway, the simple fix below. [...]
great - i've applied your fix and re-integrated x86/xen-64bit, it's
under testing now. (no problems so far)
> [...] I'll put together RFC patches for the other suggestions. I also
> split the originating patch into tiny, tiny bisectable pieces.
cool! :)
Ingo
* Ingo Molnar <[email protected]> wrote:
> great - i've applied your fix and re-integrated x86/xen-64bit, it's
> under testing now. (no problems so far)
hm, -tip testing still triggers a 64-bit bootup crash:
[ 0.000000] init_memory_mapping
[ 0.000000] kernel direct mapping tables up to 3fff0000 @ 8000-a000
PANIC: early exception 0e rip 10:ffffffff80418f81 error 0 cr2 ffffffffff300000
[ 0.000000] Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #13363
[ 0.000000]
[ 0.000000] Call Trace:
[ 0.000000] [<ffffffff807f088b>] ? init_memory_mapping+0x341/0x56b
[ 0.000000] [<ffffffff80dba19f>] early_idt_handler+0x5f/0x73
[ 0.000000] [<ffffffff80418f81>] ? __memcpy_fromio+0xd/0x1e
[ 0.000000] [<ffffffff80de238a>] dmi_scan_machine+0x41/0x19b
[ 0.000000] [<ffffffff80dbeba8>] setup_arch+0x46d/0x5d8
[ 0.000000] [<ffffffff802896a0>] ? kernel_text_unlock+0x10/0x12
[ 0.000000] [<ffffffff80263b86>] ? raw_notifier_chain_register+0x9/0xb
[ 0.000000] [<ffffffff80dba140>] ? early_idt_handler+0x0/0x73
[ 0.000000] [<ffffffff80dbac5a>] start_kernel+0xf4/0x3b3
[ 0.000000] [<ffffffff80dba140>] ? early_idt_handler+0x0/0x73
[ 0.000000] [<ffffffff80dba2a4>] x86_64_start_reservations+0xa9/0xad
[ 0.000000] [<ffffffff80dba3b8>] x86_64_start_kernel+0x110/0x11f
[ 0.000000]
http://redhat.com/~mingo/misc/crash.log-Tue_Jul__1_10_55_47_CEST_2008.bad
http://redhat.com/~mingo/misc/config-Tue_Jul__1_10_55_47_CEST_2008.bad
Excluding the x86/xen-64bit topic solves the problem.
It triggered on two 64-bit machines so it seems readily reproducible
with that config.
i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
Ingo
Ingo Molnar wrote:
> http://redhat.com/~mingo/misc/crash.log-Tue_Jul__1_10_55_47_CEST_2008.bad
> http://redhat.com/~mingo/misc/config-Tue_Jul__1_10_55_47_CEST_2008.bad
>
> Excluding the x86/xen-64bit topic solves the problem.
>
> It triggered on two 64-bit machines so it seems readily reproducible
> with that config.
>
> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>
Looks like you lost the other patch to put the early_ioremap_init in the
right place...
J
Ingo Molnar wrote:
> Excluding the x86/xen-64bit topic solves the problem.
>
> It triggered on two 64-bit machines so it seems readily reproducible
> with that config.
>
> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>
The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
early_ioremap_init()". Logically that patch should probably be in the
xen64 branch, since it's only meaningful with the early_ioremap unification.
J
* Jeremy Fitzhardinge <[email protected]> wrote:
> Ingo Molnar wrote:
>> Excluding the x86/xen-64bit topic solves the problem.
>>
>> It triggered on two 64-bit machines so it seems readily reproducible
>> with that config.
>>
>> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>
>
> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
> early_ioremap_init()". Logically that patch should probably be in the
> xen64 branch, since it's only meaningful with the early_ioremap
> unification.
ah, indeed - it was missing from tip/master due to:
| commit ac998c259605741efcfbd215533b379970ba1d9f
| Author: Ingo Molnar <[email protected]>
| Date: Mon Jun 30 12:01:31 2008 +0200
|
| Revert "x86: setup_arch() && early_ioremap_init()"
|
| This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
because that change needed the other changes from xen-64bit.
will retry tomorrow.
Ingo
* Ingo Molnar <[email protected]> wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
> > Ingo Molnar wrote:
> >> Excluding the x86/xen-64bit topic solves the problem.
> >>
> >> It triggered on two 64-bit machines so it seems readily reproducible
> >> with that config.
> >>
> >> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
> >>
> >
> > The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
> > early_ioremap_init()". Logically that patch should probably be in the
> > xen64 branch, since it's only meaningful with the early_ioremap
> > unification.
>
> ah, indeed - it was missing from tip/master due to:
>
> | commit ac998c259605741efcfbd215533b379970ba1d9f
> | Author: Ingo Molnar <[email protected]>
> | Date: Mon Jun 30 12:01:31 2008 +0200
> |
> | Revert "x86: setup_arch() && early_ioremap_init()"
> |
> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>
> because that change needed the other changes from xen-64bit.
>
> will retry tomorrow.
ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
Ingo
Ingo Molnar wrote:
> * Ingo Molnar <[email protected]> wrote:
>
>
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>>
>>> Ingo Molnar wrote:
>>>
>>>> Excluding the x86/xen-64bit topic solves the problem.
>>>>
>>>> It triggered on two 64-bit machines so it seems readily reproducible
>>>> with that config.
>>>>
>>>> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>>>
>>>>
>>> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
>>> early_ioremap_init()". Logically that patch should probably be in the
>>> xen64 branch, since it's only meaningful with the early_ioremap
>>> unification.
>>>
>> ah, indeed - it was missing from tip/master due to:
>>
>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>> | Author: Ingo Molnar <[email protected]>
>> | Date: Mon Jun 30 12:01:31 2008 +0200
>> |
>> | Revert "x86: setup_arch() && early_ioremap_init()"
>> |
>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>
>> because that change needed the other changes from xen-64bit.
>>
>> will retry tomorrow.
>>
>
> ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
>
Great. I'm hoping this stuff will be OK for the next merge, so I'm
primed for fast turnaround bugfixes ;)
Also, I have the series of followup patches to actually implement 64-bit
Xen which have much less impact on the non-Xen parts of the tree. I'll
probably mail them out later today.
Thanks,
J
On Thu, Jul 3, 2008 at 2:10 AM, Ingo Molnar <[email protected]> wrote:
>
> * Ingo Molnar <[email protected]> wrote:
>
>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>
>> > Ingo Molnar wrote:
>> >> Excluding the x86/xen-64bit topic solves the problem.
>> >>
>> >> It triggered on two 64-bit machines so it seems readily reproducible
>> >> with that config.
>> >>
>> >> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>> >>
>> >
>> > The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
>> > early_ioremap_init()". Logically that patch should probably be in the
>> > xen64 branch, since it's only meaningful with the early_ioremap
>> > unification.
>>
>> ah, indeed - it was missing from tip/master due to:
>>
>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>> | Author: Ingo Molnar <[email protected]>
>> | Date: Mon Jun 30 12:01:31 2008 +0200
>> |
>> | Revert "x86: setup_arch() && early_ioremap_init()"
>> |
>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>
>> because that change needed the other changes from xen-64bit.
>>
>> will retry tomorrow.
>
> ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
>
got
[ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0
[ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0
[ffffe20000000000-ffffe200003fffff] PMD ->
[ffff880001400000-ffff8800017fffff] on node 0
[ffffe20000200000-ffffe200005fffff] PMD ->
[ffff880001600000-ffff8800019fffff] on node 0
[ffffe20000400000-ffffe200007fffff] PMD ->
[ffff880001800000-ffff880001bfffff] on node 0
[ffffe20000600000-ffffe200009fffff] PMD ->
[ffff880001a00000-ffff880001dfffff] on node 0
[ffffe20000800000-ffffe20000bfffff] PMD ->
[ffff880001c00000-ffff880001ffffff] on node 0
[ffffe20000a00000-ffffe20000dfffff] PMD ->
[ffff880001e00000-ffff8800021fffff] on node 0
[ffffe20000c00000-ffffe20000ffffff] PMD ->
[ffff880002000000-ffff8800023fffff] on node 0
[ffffe20000e00000-ffffe200011fffff] PMD ->
[ffff880002200000-ffff8800025fffff] on node 0
[ffffe20001000000-ffffe200013fffff] PMD ->
[ffff880002400000-ffff8800027fffff] on node 0
[ffffe20001200000-ffffe200015fffff] PMD ->
[ffff880002600000-ffff8800029fffff] on node 0
[ffffe20001400000-ffffe200017fffff] PMD ->
[ffff880002800000-ffff880002bfffff] on node 0
[ffffe20001600000-ffffe200019fffff] PMD ->
[ffff880002a00000-ffff880002dfffff] on node 0
[ffffe20001800000-ffffe20001bfffff] PMD ->
[ffff880002c00000-ffff880002ffffff] on node 0
[ffffe20001a00000-ffffe20001dfffff] PMD ->
[ffff880002e00000-ffff8800031fffff] on node 0
[ffffe20001c00000-ffffe20001ffffff] PMD ->
[ffff880003000000-ffff8800033fffff] on node 0
[ffffe20001e00000-ffffe200021fffff] PMD ->
[ffff880003200000-ffff8800035fffff] on node 0
[ffffe20002000000-ffffe200023fffff] PMD ->
[ffff880003400000-ffff8800037fffff] on node 0
[ffffe20002200000-ffffe200025fffff] PMD ->
[ffff880003600000-ffff8800039fffff] on node 0
[ffffe20002400000-ffffe200027fffff] PMD ->
[ffff880003800000-ffff880003bfffff] on node 0
[ffffe20002600000-ffffe200029fffff] PMD ->
[ffff880003a00000-ffff880003dfffff] on node 0
[ffffe20002800000-ffffe20002bfffff] PMD ->
[ffff880003c00000-ffff880003ffffff] on node 0
[ffffe20002a00000-ffffe20002dfffff] PMD ->
[ffff880003e00000-ffff8800041fffff] on node 0
[ffffe20002c00000-ffffe20002ffffff] PMD ->
[ffff880004000000-ffff8800043fffff] on node 0
[ffffe20002e00000-ffffe200039fffff] PMD ->
[ffff880004200000-ffff8800045fffff] on node 0
[ffffe20003800000-ffffe20003bfffff] PMD ->
[ffff880004400000-ffff8800047fffff] on node 0
[ffffe20003a00000-ffffe20003dfffff] PMD ->
[ffff880004600000-ffff8800049fffff] on node 0
[ffffe20003c00000-ffffe20003ffffff] PMD ->
[ffff880004800000-ffff880004bfffff] on node 0
[ffffe20003e00000-ffffe200041fffff] PMD ->
[ffff880004a00000-ffff880004dfffff] on node 0
[ffffe20004000000-ffffe200043fffff] PMD ->
[ffff880004c00000-ffff880004ffffff] on node 0
[ffffe20004200000-ffffe200045fffff] PMD ->
[ffff880004e00000-ffff8800051fffff] on node 0
[ffffe20004400000-ffffe200047fffff] PMD ->
[ffff880005000000-ffff8800053fffff] on node 0
[ffffe20004600000-ffffe200049fffff] PMD ->
[ffff880005200000-ffff8800055fffff] on node 0
[ffffe20004800000-ffffe20004bfffff] PMD ->
[ffff880005400000-ffff8800057fffff] on node 0
[ffffe20004a00000-ffffe20004dfffff] PMD ->
[ffff880005600000-ffff8800059fffff] on node 0
[ffffe20004c00000-ffffe20004ffffff] PMD ->
[ffff880005800000-ffff880005bfffff] on node 0
[ffffe20004e00000-ffffe200051fffff] PMD ->
[ffff880005a00000-ffff880005dfffff] on node 0
[ffffe20005000000-ffffe200053fffff] PMD ->
[ffff880005c00000-ffff880005ffffff] on node 0
[ffffe20005200000-ffffe200055fffff] PMD ->
[ffff880005e00000-ffff8800061fffff] on node 0
[ffffe20005400000-ffffe200057fffff] PMD ->
[ffff880006000000-ffff8800063fffff] on node 0
[ffffe20005600000-ffffe200059fffff] PMD ->
[ffff880006200000-ffff8800065fffff] on node 0
[ffffe20005800000-ffffe20005bfffff] PMD ->
[ffff880006400000-ffff8800067fffff] on node 0
[ffffe20005a00000-ffffe20005dfffff] PMD ->
[ffff880006600000-ffff8800069fffff] on node 0
[ffffe20005c00000-ffffe20005ffffff] PMD ->
[ffff880006800000-ffff880006bfffff] on node 0
[ffffe20005e00000-ffffe200061fffff] PMD ->
[ffff880006a00000-ffff880006dfffff] on node 0
[ffffe20006000000-ffffe200063fffff] PMD ->
[ffff880006c00000-ffff880006ffffff] on node 0
[ffffe20006200000-ffffe200065fffff] PMD ->
[ffff880006e00000-ffff8800071fffff] on node 0
[ffffe20006400000-ffffe200067fffff] PMD ->
[ffff880007000000-ffff8800073fffff] on node 0
[ffffe20006600000-ffffe200069fffff] PMD ->
[ffff880007200000-ffff8800075fffff] on node 0
[ffffe20006800000-ffffe20006bfffff] PMD ->
[ffff880007400000-ffff8800077fffff] on node 0
[ffffe20006a00000-ffffe20006dfffff] PMD ->
[ffff880007600000-ffff8800079fffff] on node 0
[ffffe20006c00000-ffffe20006ffffff] PMD ->
[ffff880007800000-ffff880007bfffff] on node 0
[ffffe20006e00000-ffffe200071fffff] PMD ->
[ffff880007a00000-ffff880007dfffff] on node 0
[ffffe20007000000-ffffe200073fffff] PMD ->
[ffff880007c00000-ffff880007ffffff] on node 0
[ffffe20007200000-ffffe200075fffff] PMD ->
[ffff880007e00000-ffff8800081fffff] on node 0
[ffffe20007400000-ffffe200077fffff] PMD ->
[ffff880008000000-ffff8800083fffff] on node 0
[ffffe20007600000-ffffe200079fffff] PMD ->
[ffff880008200000-ffff8800085fffff] on node 0
[ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
[ffffe20007800000-ffffe20007bfffff] PMD ->
[ffff880008400000-ffff8802283fffff] on node 0
[ffffe20007a00000-ffffe20007dfffff] PMD ->
[ffff880228200000-ffff8802285fffff] on node 1
[ffffe20007c00000-ffffe20007ffffff] PMD ->
[ffff880228400000-ffff8802287fffff] on node 1
[ffffe20007e00000-ffffe200081fffff] PMD ->
[ffff880228600000-ffff8802289fffff] on node 1
[ffffe20008000000-ffffe200083fffff] PMD ->
[ffff880228800000-ffff880228bfffff] on node 1
[ffffe20008200000-ffffe200085fffff] PMD ->
[ffff880228a00000-ffff880228dfffff] on node 1
[ffffe20008400000-ffffe200087fffff] PMD ->
[ffff880228c00000-ffff880228ffffff] on node 1
[ffffe20008600000-ffffe200089fffff] PMD ->
[ffff880228e00000-ffff8802291fffff] on node 1
[ffffe20008800000-ffffe20008bfffff] PMD ->
[ffff880229000000-ffff8802293fffff] on node 1
[ffffe20008a00000-ffffe20008dfffff] PMD ->
[ffff880229200000-ffff8802295fffff] on node 1
[ffffe20008c00000-ffffe20008ffffff] PMD ->
[ffff880229400000-ffff8802297fffff] on node 1
[ffffe20008e00000-ffffe200091fffff] PMD ->
[ffff880229600000-ffff8802299fffff] on node 1
[ffffe20009000000-ffffe200093fffff] PMD ->
[ffff880229800000-ffff880229bfffff] on node 1
[ffffe20009200000-ffffe200095fffff] PMD ->
[ffff880229a00000-ffff880229dfffff] on node 1
[ffffe20009400000-ffffe200097fffff] PMD ->
[ffff880229c00000-ffff880229ffffff] on node 1
[ffffe20009600000-ffffe200099fffff] PMD ->
[ffff880229e00000-ffff88022a1fffff] on node 1
[ffffe20009800000-ffffe20009bfffff] PMD ->
[ffff88022a000000-ffff88022a3fffff] on node 1
[ffffe20009a00000-ffffe20009dfffff] PMD ->
[ffff88022a200000-ffff88022a5fffff] on node 1
[ffffe20009c00000-ffffe20009ffffff] PMD ->
[ffff88022a400000-ffff88022a7fffff] on node 1
[ffffe20009e00000-ffffe2000a1fffff] PMD ->
[ffff88022a600000-ffff88022a9fffff] on node 1
[ffffe2000a000000-ffffe2000a3fffff] PMD ->
[ffff88022a800000-ffff88022abfffff] on node 1
[ffffe2000a200000-ffffe2000a5fffff] PMD ->
[ffff88022aa00000-ffff88022adfffff] on node 1
[ffffe2000a400000-ffffe2000a7fffff] PMD ->
[ffff88022ac00000-ffff88022affffff] on node 1
[ffffe2000a600000-ffffe2000a9fffff] PMD ->
[ffff88022ae00000-ffff88022b1fffff] on node 1
[ffffe2000a800000-ffffe2000abfffff] PMD ->
[ffff88022b000000-ffff88022b3fffff] on node 1
[ffffe2000aa00000-ffffe2000adfffff] PMD ->
[ffff88022b200000-ffff88022b5fffff] on node 1
[ffffe2000ac00000-ffffe2000affffff] PMD ->
[ffff88022b400000-ffff88022b7fffff] on node 1
[ffffe2000ae00000-ffffe2000b1fffff] PMD ->
[ffff88022b600000-ffff88022b9fffff] on node 1
[ffffe2000b000000-ffffe2000b3fffff] PMD ->
[ffff88022b800000-ffff88022bbfffff] on node 1
[ffffe2000b200000-ffffe2000b5fffff] PMD ->
[ffff88022ba00000-ffff88022bdfffff] on node 1
[ffffe2000b400000-ffffe2000b7fffff] PMD ->
[ffff88022bc00000-ffff88022bffffff] on node 1
[ffffe2000b600000-ffffe2000b9fffff] PMD ->
[ffff88022be00000-ffff88022c1fffff] on node 1
[ffffe2000b800000-ffffe2000bbfffff] PMD ->
[ffff88022c000000-ffff88022c3fffff] on node 1
[ffffe2000ba00000-ffffe2000bdfffff] PMD ->
[ffff88022c200000-ffff88022c5fffff] on node 1
[ffffe2000bc00000-ffffe2000bffffff] PMD ->
[ffff88022c400000-ffff88022c7fffff] on node 1
[ffffe2000be00000-ffffe2000c1fffff] PMD ->
[ffff88022c600000-ffff88022c9fffff] on node 1
[ffffe2000c000000-ffffe2000c3fffff] PMD ->
[ffff88022c800000-ffff88022cbfffff] on node 1
[ffffe2000c200000-ffffe2000c5fffff] PMD ->
[ffff88022ca00000-ffff88022cdfffff] on node 1
[ffffe2000c400000-ffffe2000c7fffff] PMD ->
[ffff88022cc00000-ffff88022cffffff] on node 1
[ffffe2000c600000-ffffe2000c9fffff] PMD ->
[ffff88022ce00000-ffff88022d1fffff] on node 1
[ffffe2000c800000-ffffe2000cbfffff] PMD ->
[ffff88022d000000-ffff88022d3fffff] on node 1
[ffffe2000ca00000-ffffe2000cdfffff] PMD ->
[ffff88022d200000-ffff88022d5fffff] on node 1
[ffffe2000cc00000-ffffe2000cffffff] PMD ->
[ffff88022d400000-ffff88022d7fffff] on node 1
[ffffe2000ce00000-ffffe2000d1fffff] PMD ->
[ffff88022d600000-ffff88022d9fffff] on node 1
[ffffe2000d000000-ffffe2000d3fffff] PMD ->
[ffff88022d800000-ffff88022dbfffff] on node 1
[ffffe2000d200000-ffffe2000d5fffff] PMD ->
[ffff88022da00000-ffff88022ddfffff] on node 1
[ffffe2000d400000-ffffe2000d7fffff] PMD ->
[ffff88022dc00000-ffff88022dffffff] on node 1
[ffffe2000d600000-ffffe2000d9fffff] PMD ->
[ffff88022de00000-ffff88022e1fffff] on node 1
[ffffe2000d800000-ffffe2000dbfffff] PMD ->
[ffff88022e000000-ffff88022e3fffff] on node 1
[ffffe2000da00000-ffffe2000ddfffff] PMD ->
[ffff88022e200000-ffff88022e5fffff] on node 1
[ffffe2000dc00000-ffffe2000dffffff] PMD ->
[ffff88022e400000-ffff88022e7fffff] on node 1
[ffffe2000de00000-ffffe2000e1fffff] PMD ->
[ffff88022e600000-ffff88022e9fffff] on node 1
[ffffe2000e000000-ffffe2000e3fffff] PMD ->
[ffff88022e800000-ffff88022ebfffff] on node 1
[ffffe2000e200000-ffffe2000e5fffff] PMD ->
[ffff88022ea00000-ffff88022edfffff] on node 1
[ffffe2000e400000-ffffe2000e7fffff] PMD ->
[ffff88022ec00000-ffff88022effffff] on node 1
[ffffe2000e600000-ffffe2000e9fffff] PMD ->
[ffff88022ee00000-ffff88022f1fffff] on node 1
[ffffe2000e800000-ffffe2000e9fffff] PMD ->
[ffff88022f000000-ffff88022f1fffff] on node 1
should have
[ffffe20000000000-ffffe27fffffffff] PGD ->ffff8100011ce000 on node 0
[ffffe20000000000-ffffe2003fffffff] PUD ->ffff8100011cf000 on node 0
[ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
[ffffe20000000000-ffffe200079fffff] PMD ->
[ffff810001200000-ffff8100083fffff] on node 0
[ffffe20007a00000-ffffe2000e9fffff] PMD ->
[ffff810228200000-ffff81022f1fffff] on node 1
YH
Yinghai Lu wrote:
> On Thu, Jul 3, 2008 at 2:10 AM, Ingo Molnar <[email protected]> wrote:
>
>> * Ingo Molnar <[email protected]> wrote:
>>
>>
>>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>>
>>>
>>>> Ingo Molnar wrote:
>>>>
>>>>> Excluding the x86/xen-64bit topic solves the problem.
>>>>>
>>>>> It triggered on two 64-bit machines so it seems readily reproducible
>>>>> with that config.
>>>>>
>>>>> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>>>>
>>>>>
>>>> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
>>>> early_ioremap_init()". Logically that patch should probably be in the
>>>> xen64 branch, since it's only meaningful with the early_ioremap
>>>> unification.
>>>>
>>> ah, indeed - it was missing from tip/master due to:
>>>
>>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>>> | Author: Ingo Molnar <[email protected]>
>>> | Date: Mon Jun 30 12:01:31 2008 +0200
>>> |
>>> | Revert "x86: setup_arch() && early_ioremap_init()"
>>> |
>>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>>
>>> because that change needed the other changes from xen-64bit.
>>>
>>> will retry tomorrow.
>>>
>> ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
>>
>>
>
> got
> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0
> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0
> [ffffe20000000000-ffffe200003fffff] PMD ->
> [ffff880001400000-ffff8800017fffff] on node 0
> [ffffe20000200000-ffffe200005fffff] PMD ->
> [ffff880001600000-ffff8800019fffff] on node 0
> [ffffe20000400000-ffffe200007fffff] PMD ->
> [ffff880001800000-ffff880001bfffff] on node 0
> [ffffe20000600000-ffffe200009fffff] PMD ->
> [ffff880001a00000-ffff880001dfffff] on node 0
> [ffffe20000800000-ffffe20000bfffff] PMD ->
> [ffff880001c00000-ffff880001ffffff] on node 0
> [ffffe20000a00000-ffffe20000dfffff] PMD ->
> [ffff880001e00000-ffff8800021fffff] on node 0
> [ffffe20000c00000-ffffe20000ffffff] PMD ->
> [ffff880002000000-ffff8800023fffff] on node 0
> [ffffe20000e00000-ffffe200011fffff] PMD ->
> [ffff880002200000-ffff8800025fffff] on node 0
> [ffffe20001000000-ffffe200013fffff] PMD ->
> [ffff880002400000-ffff8800027fffff] on node 0
> [ffffe20001200000-ffffe200015fffff] PMD ->
> [ffff880002600000-ffff8800029fffff] on node 0
> [ffffe20001400000-ffffe200017fffff] PMD ->
> [ffff880002800000-ffff880002bfffff] on node 0
> [ffffe20001600000-ffffe200019fffff] PMD ->
> [ffff880002a00000-ffff880002dfffff] on node 0
> [ffffe20001800000-ffffe20001bfffff] PMD ->
> [ffff880002c00000-ffff880002ffffff] on node 0
> [ffffe20001a00000-ffffe20001dfffff] PMD ->
> [ffff880002e00000-ffff8800031fffff] on node 0
> [ffffe20001c00000-ffffe20001ffffff] PMD ->
> [ffff880003000000-ffff8800033fffff] on node 0
> [ffffe20001e00000-ffffe200021fffff] PMD ->
> [ffff880003200000-ffff8800035fffff] on node 0
> [ffffe20002000000-ffffe200023fffff] PMD ->
> [ffff880003400000-ffff8800037fffff] on node 0
> [ffffe20002200000-ffffe200025fffff] PMD ->
> [ffff880003600000-ffff8800039fffff] on node 0
> [ffffe20002400000-ffffe200027fffff] PMD ->
> [ffff880003800000-ffff880003bfffff] on node 0
> [ffffe20002600000-ffffe200029fffff] PMD ->
> [ffff880003a00000-ffff880003dfffff] on node 0
> [ffffe20002800000-ffffe20002bfffff] PMD ->
> [ffff880003c00000-ffff880003ffffff] on node 0
> [ffffe20002a00000-ffffe20002dfffff] PMD ->
> [ffff880003e00000-ffff8800041fffff] on node 0
> [ffffe20002c00000-ffffe20002ffffff] PMD ->
> [ffff880004000000-ffff8800043fffff] on node 0
> [ffffe20002e00000-ffffe200039fffff] PMD ->
> [ffff880004200000-ffff8800045fffff] on node 0
> [ffffe20003800000-ffffe20003bfffff] PMD ->
> [ffff880004400000-ffff8800047fffff] on node 0
> [ffffe20003a00000-ffffe20003dfffff] PMD ->
> [ffff880004600000-ffff8800049fffff] on node 0
> [ffffe20003c00000-ffffe20003ffffff] PMD ->
> [ffff880004800000-ffff880004bfffff] on node 0
> [ffffe20003e00000-ffffe200041fffff] PMD ->
> [ffff880004a00000-ffff880004dfffff] on node 0
> [ffffe20004000000-ffffe200043fffff] PMD ->
> [ffff880004c00000-ffff880004ffffff] on node 0
> [ffffe20004200000-ffffe200045fffff] PMD ->
> [ffff880004e00000-ffff8800051fffff] on node 0
> [ffffe20004400000-ffffe200047fffff] PMD ->
> [ffff880005000000-ffff8800053fffff] on node 0
> [ffffe20004600000-ffffe200049fffff] PMD ->
> [ffff880005200000-ffff8800055fffff] on node 0
> [ffffe20004800000-ffffe20004bfffff] PMD ->
> [ffff880005400000-ffff8800057fffff] on node 0
> [ffffe20004a00000-ffffe20004dfffff] PMD ->
> [ffff880005600000-ffff8800059fffff] on node 0
> [ffffe20004c00000-ffffe20004ffffff] PMD ->
> [ffff880005800000-ffff880005bfffff] on node 0
> [ffffe20004e00000-ffffe200051fffff] PMD ->
> [ffff880005a00000-ffff880005dfffff] on node 0
> [ffffe20005000000-ffffe200053fffff] PMD ->
> [ffff880005c00000-ffff880005ffffff] on node 0
> [ffffe20005200000-ffffe200055fffff] PMD ->
> [ffff880005e00000-ffff8800061fffff] on node 0
> [ffffe20005400000-ffffe200057fffff] PMD ->
> [ffff880006000000-ffff8800063fffff] on node 0
> [ffffe20005600000-ffffe200059fffff] PMD ->
> [ffff880006200000-ffff8800065fffff] on node 0
> [ffffe20005800000-ffffe20005bfffff] PMD ->
> [ffff880006400000-ffff8800067fffff] on node 0
> [ffffe20005a00000-ffffe20005dfffff] PMD ->
> [ffff880006600000-ffff8800069fffff] on node 0
> [ffffe20005c00000-ffffe20005ffffff] PMD ->
> [ffff880006800000-ffff880006bfffff] on node 0
> [ffffe20005e00000-ffffe200061fffff] PMD ->
> [ffff880006a00000-ffff880006dfffff] on node 0
> [ffffe20006000000-ffffe200063fffff] PMD ->
> [ffff880006c00000-ffff880006ffffff] on node 0
> [ffffe20006200000-ffffe200065fffff] PMD ->
> [ffff880006e00000-ffff8800071fffff] on node 0
> [ffffe20006400000-ffffe200067fffff] PMD ->
> [ffff880007000000-ffff8800073fffff] on node 0
> [ffffe20006600000-ffffe200069fffff] PMD ->
> [ffff880007200000-ffff8800075fffff] on node 0
> [ffffe20006800000-ffffe20006bfffff] PMD ->
> [ffff880007400000-ffff8800077fffff] on node 0
> [ffffe20006a00000-ffffe20006dfffff] PMD ->
> [ffff880007600000-ffff8800079fffff] on node 0
> [ffffe20006c00000-ffffe20006ffffff] PMD ->
> [ffff880007800000-ffff880007bfffff] on node 0
> [ffffe20006e00000-ffffe200071fffff] PMD ->
> [ffff880007a00000-ffff880007dfffff] on node 0
> [ffffe20007000000-ffffe200073fffff] PMD ->
> [ffff880007c00000-ffff880007ffffff] on node 0
> [ffffe20007200000-ffffe200075fffff] PMD ->
> [ffff880007e00000-ffff8800081fffff] on node 0
> [ffffe20007400000-ffffe200077fffff] PMD ->
> [ffff880008000000-ffff8800083fffff] on node 0
> [ffffe20007600000-ffffe200079fffff] PMD ->
> [ffff880008200000-ffff8800085fffff] on node 0
> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
> [ffffe20007800000-ffffe20007bfffff] PMD ->
> [ffff880008400000-ffff8802283fffff] on node 0
> [ffffe20007a00000-ffffe20007dfffff] PMD ->
> [ffff880228200000-ffff8802285fffff] on node 1
> [ffffe20007c00000-ffffe20007ffffff] PMD ->
> [ffff880228400000-ffff8802287fffff] on node 1
> [ffffe20007e00000-ffffe200081fffff] PMD ->
> [ffff880228600000-ffff8802289fffff] on node 1
> [ffffe20008000000-ffffe200083fffff] PMD ->
> [ffff880228800000-ffff880228bfffff] on node 1
> [ffffe20008200000-ffffe200085fffff] PMD ->
> [ffff880228a00000-ffff880228dfffff] on node 1
> [ffffe20008400000-ffffe200087fffff] PMD ->
> [ffff880228c00000-ffff880228ffffff] on node 1
> [ffffe20008600000-ffffe200089fffff] PMD ->
> [ffff880228e00000-ffff8802291fffff] on node 1
> [ffffe20008800000-ffffe20008bfffff] PMD ->
> [ffff880229000000-ffff8802293fffff] on node 1
> [ffffe20008a00000-ffffe20008dfffff] PMD ->
> [ffff880229200000-ffff8802295fffff] on node 1
> [ffffe20008c00000-ffffe20008ffffff] PMD ->
> [ffff880229400000-ffff8802297fffff] on node 1
> [ffffe20008e00000-ffffe200091fffff] PMD ->
> [ffff880229600000-ffff8802299fffff] on node 1
> [ffffe20009000000-ffffe200093fffff] PMD ->
> [ffff880229800000-ffff880229bfffff] on node 1
> [ffffe20009200000-ffffe200095fffff] PMD ->
> [ffff880229a00000-ffff880229dfffff] on node 1
> [ffffe20009400000-ffffe200097fffff] PMD ->
> [ffff880229c00000-ffff880229ffffff] on node 1
> [ffffe20009600000-ffffe200099fffff] PMD ->
> [ffff880229e00000-ffff88022a1fffff] on node 1
> [ffffe20009800000-ffffe20009bfffff] PMD ->
> [ffff88022a000000-ffff88022a3fffff] on node 1
> [ffffe20009a00000-ffffe20009dfffff] PMD ->
> [ffff88022a200000-ffff88022a5fffff] on node 1
> [ffffe20009c00000-ffffe20009ffffff] PMD ->
> [ffff88022a400000-ffff88022a7fffff] on node 1
> [ffffe20009e00000-ffffe2000a1fffff] PMD ->
> [ffff88022a600000-ffff88022a9fffff] on node 1
> [ffffe2000a000000-ffffe2000a3fffff] PMD ->
> [ffff88022a800000-ffff88022abfffff] on node 1
> [ffffe2000a200000-ffffe2000a5fffff] PMD ->
> [ffff88022aa00000-ffff88022adfffff] on node 1
> [ffffe2000a400000-ffffe2000a7fffff] PMD ->
> [ffff88022ac00000-ffff88022affffff] on node 1
> [ffffe2000a600000-ffffe2000a9fffff] PMD ->
> [ffff88022ae00000-ffff88022b1fffff] on node 1
> [ffffe2000a800000-ffffe2000abfffff] PMD ->
> [ffff88022b000000-ffff88022b3fffff] on node 1
> [ffffe2000aa00000-ffffe2000adfffff] PMD ->
> [ffff88022b200000-ffff88022b5fffff] on node 1
> [ffffe2000ac00000-ffffe2000affffff] PMD ->
> [ffff88022b400000-ffff88022b7fffff] on node 1
> [ffffe2000ae00000-ffffe2000b1fffff] PMD ->
> [ffff88022b600000-ffff88022b9fffff] on node 1
> [ffffe2000b000000-ffffe2000b3fffff] PMD ->
> [ffff88022b800000-ffff88022bbfffff] on node 1
> [ffffe2000b200000-ffffe2000b5fffff] PMD ->
> [ffff88022ba00000-ffff88022bdfffff] on node 1
> [ffffe2000b400000-ffffe2000b7fffff] PMD ->
> [ffff88022bc00000-ffff88022bffffff] on node 1
> [ffffe2000b600000-ffffe2000b9fffff] PMD ->
> [ffff88022be00000-ffff88022c1fffff] on node 1
> [ffffe2000b800000-ffffe2000bbfffff] PMD ->
> [ffff88022c000000-ffff88022c3fffff] on node 1
> [ffffe2000ba00000-ffffe2000bdfffff] PMD ->
> [ffff88022c200000-ffff88022c5fffff] on node 1
> [ffffe2000bc00000-ffffe2000bffffff] PMD ->
> [ffff88022c400000-ffff88022c7fffff] on node 1
> [ffffe2000be00000-ffffe2000c1fffff] PMD ->
> [ffff88022c600000-ffff88022c9fffff] on node 1
> [ffffe2000c000000-ffffe2000c3fffff] PMD ->
> [ffff88022c800000-ffff88022cbfffff] on node 1
> [ffffe2000c200000-ffffe2000c5fffff] PMD ->
> [ffff88022ca00000-ffff88022cdfffff] on node 1
> [ffffe2000c400000-ffffe2000c7fffff] PMD ->
> [ffff88022cc00000-ffff88022cffffff] on node 1
> [ffffe2000c600000-ffffe2000c9fffff] PMD ->
> [ffff88022ce00000-ffff88022d1fffff] on node 1
> [ffffe2000c800000-ffffe2000cbfffff] PMD ->
> [ffff88022d000000-ffff88022d3fffff] on node 1
> [ffffe2000ca00000-ffffe2000cdfffff] PMD ->
> [ffff88022d200000-ffff88022d5fffff] on node 1
> [ffffe2000cc00000-ffffe2000cffffff] PMD ->
> [ffff88022d400000-ffff88022d7fffff] on node 1
> [ffffe2000ce00000-ffffe2000d1fffff] PMD ->
> [ffff88022d600000-ffff88022d9fffff] on node 1
> [ffffe2000d000000-ffffe2000d3fffff] PMD ->
> [ffff88022d800000-ffff88022dbfffff] on node 1
> [ffffe2000d200000-ffffe2000d5fffff] PMD ->
> [ffff88022da00000-ffff88022ddfffff] on node 1
> [ffffe2000d400000-ffffe2000d7fffff] PMD ->
> [ffff88022dc00000-ffff88022dffffff] on node 1
> [ffffe2000d600000-ffffe2000d9fffff] PMD ->
> [ffff88022de00000-ffff88022e1fffff] on node 1
> [ffffe2000d800000-ffffe2000dbfffff] PMD ->
> [ffff88022e000000-ffff88022e3fffff] on node 1
> [ffffe2000da00000-ffffe2000ddfffff] PMD ->
> [ffff88022e200000-ffff88022e5fffff] on node 1
> [ffffe2000dc00000-ffffe2000dffffff] PMD ->
> [ffff88022e400000-ffff88022e7fffff] on node 1
> [ffffe2000de00000-ffffe2000e1fffff] PMD ->
> [ffff88022e600000-ffff88022e9fffff] on node 1
> [ffffe2000e000000-ffffe2000e3fffff] PMD ->
> [ffff88022e800000-ffff88022ebfffff] on node 1
> [ffffe2000e200000-ffffe2000e5fffff] PMD ->
> [ffff88022ea00000-ffff88022edfffff] on node 1
> [ffffe2000e400000-ffffe2000e7fffff] PMD ->
> [ffff88022ec00000-ffff88022effffff] on node 1
> [ffffe2000e600000-ffffe2000e9fffff] PMD ->
> [ffff88022ee00000-ffff88022f1fffff] on node 1
> [ffffe2000e800000-ffffe2000e9fffff] PMD ->
> [ffff88022f000000-ffff88022f1fffff] on node 1
>
> should have
>
> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff8100011ce000 on node 0
> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff8100011cf000 on node 0
> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
> [ffffe20000000000-ffffe200079fffff] PMD ->
> [ffff810001200000-ffff8100083fffff] on node 0
> [ffffe20007a00000-ffffe2000e9fffff] PMD ->
> [ffff810228200000-ffff81022f1fffff] on node 1
I haven't seen those messages before. Can you explain what they mean?
J
On Thu, Jul 3, 2008 at 11:25 AM, Jeremy Fitzhardinge <[email protected]> wrote:
> Yinghai Lu wrote:
>>
>> On Thu, Jul 3, 2008 at 2:10 AM, Ingo Molnar <[email protected]> wrote:
>>
>>>
>>> * Ingo Molnar <[email protected]> wrote:
>>>
>>>
>>>>
>>>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>>>
>>>>
>>>>>
>>>>> Ingo Molnar wrote:
>>>>>
>>>>>>
>>>>>> Excluding the x86/xen-64bit topic solves the problem.
>>>>>>
>>>>>> It triggered on two 64-bit machines so it seems readily reproducible
>>>>>> with that config.
>>>>>>
>>>>>> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>>>>>
>>>>>>
>>>>>
>>>>> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
>>>>> early_ioremap_init()". Logically that patch should probably be in the
>>>>> xen64 branch, since it's only meaningful with the early_ioremap
>>>>> unification.
>>>>>
>>>>
>>>> ah, indeed - it was missing from tip/master due to:
>>>>
>>>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>>>> | Author: Ingo Molnar <[email protected]>
>>>> | Date: Mon Jun 30 12:01:31 2008 +0200
>>>> |
>>>> | Revert "x86: setup_arch() && early_ioremap_init()"
>>>> |
>>>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>>>
>>>> because that change needed the other changes from xen-64bit.
>>>>
>>>> will retry tomorrow.
>>>>
>>>
>>> ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
>>>
>>>
>>
>> got
>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0
>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0
>> [ffffe20000000000-ffffe200003fffff] PMD ->
>> [ffff880001400000-ffff8800017fffff] on node 0
>> [ffffe20000200000-ffffe200005fffff] PMD ->
>> [ffff880001600000-ffff8800019fffff] on node 0
>> [ffffe20000400000-ffffe200007fffff] PMD ->
>> [ffff880001800000-ffff880001bfffff] on node 0
>> [ffffe20000600000-ffffe200009fffff] PMD ->
>> [ffff880001a00000-ffff880001dfffff] on node 0
>> [ffffe20000800000-ffffe20000bfffff] PMD ->
>> [ffff880001c00000-ffff880001ffffff] on node 0
>> [ffffe20000a00000-ffffe20000dfffff] PMD ->
>> [ffff880001e00000-ffff8800021fffff] on node 0
>> [ffffe20000c00000-ffffe20000ffffff] PMD ->
>> [ffff880002000000-ffff8800023fffff] on node 0
>> [ffffe20000e00000-ffffe200011fffff] PMD ->
>> [ffff880002200000-ffff8800025fffff] on node 0
>> [ffffe20001000000-ffffe200013fffff] PMD ->
>> [ffff880002400000-ffff8800027fffff] on node 0
>> [ffffe20001200000-ffffe200015fffff] PMD ->
>> [ffff880002600000-ffff8800029fffff] on node 0
>> [ffffe20001400000-ffffe200017fffff] PMD ->
>> [ffff880002800000-ffff880002bfffff] on node 0
>> [ffffe20001600000-ffffe200019fffff] PMD ->
>> [ffff880002a00000-ffff880002dfffff] on node 0
>> [ffffe20001800000-ffffe20001bfffff] PMD ->
>> [ffff880002c00000-ffff880002ffffff] on node 0
>> [ffffe20001a00000-ffffe20001dfffff] PMD ->
>> [ffff880002e00000-ffff8800031fffff] on node 0
>> [ffffe20001c00000-ffffe20001ffffff] PMD ->
>> [ffff880003000000-ffff8800033fffff] on node 0
>> [ffffe20001e00000-ffffe200021fffff] PMD ->
>> [ffff880003200000-ffff8800035fffff] on node 0
>> [ffffe20002000000-ffffe200023fffff] PMD ->
>> [ffff880003400000-ffff8800037fffff] on node 0
>> [ffffe20002200000-ffffe200025fffff] PMD ->
>> [ffff880003600000-ffff8800039fffff] on node 0
>> [ffffe20002400000-ffffe200027fffff] PMD ->
>> [ffff880003800000-ffff880003bfffff] on node 0
>> [ffffe20002600000-ffffe200029fffff] PMD ->
>> [ffff880003a00000-ffff880003dfffff] on node 0
>> [ffffe20002800000-ffffe20002bfffff] PMD ->
>> [ffff880003c00000-ffff880003ffffff] on node 0
>> [ffffe20002a00000-ffffe20002dfffff] PMD ->
>> [ffff880003e00000-ffff8800041fffff] on node 0
>> [ffffe20002c00000-ffffe20002ffffff] PMD ->
>> [ffff880004000000-ffff8800043fffff] on node 0
>> [ffffe20002e00000-ffffe200039fffff] PMD ->
>> [ffff880004200000-ffff8800045fffff] on node 0
>> [ffffe20003800000-ffffe20003bfffff] PMD ->
>> [ffff880004400000-ffff8800047fffff] on node 0
>> [ffffe20003a00000-ffffe20003dfffff] PMD ->
>> [ffff880004600000-ffff8800049fffff] on node 0
>> [ffffe20003c00000-ffffe20003ffffff] PMD ->
>> [ffff880004800000-ffff880004bfffff] on node 0
>> [ffffe20003e00000-ffffe200041fffff] PMD ->
>> [ffff880004a00000-ffff880004dfffff] on node 0
>> [ffffe20004000000-ffffe200043fffff] PMD ->
>> [ffff880004c00000-ffff880004ffffff] on node 0
>> [ffffe20004200000-ffffe200045fffff] PMD ->
>> [ffff880004e00000-ffff8800051fffff] on node 0
>> [ffffe20004400000-ffffe200047fffff] PMD ->
>> [ffff880005000000-ffff8800053fffff] on node 0
>> [ffffe20004600000-ffffe200049fffff] PMD ->
>> [ffff880005200000-ffff8800055fffff] on node 0
>> [ffffe20004800000-ffffe20004bfffff] PMD ->
>> [ffff880005400000-ffff8800057fffff] on node 0
>> [ffffe20004a00000-ffffe20004dfffff] PMD ->
>> [ffff880005600000-ffff8800059fffff] on node 0
>> [ffffe20004c00000-ffffe20004ffffff] PMD ->
>> [ffff880005800000-ffff880005bfffff] on node 0
>> [ffffe20004e00000-ffffe200051fffff] PMD ->
>> [ffff880005a00000-ffff880005dfffff] on node 0
>> [ffffe20005000000-ffffe200053fffff] PMD ->
>> [ffff880005c00000-ffff880005ffffff] on node 0
>> [ffffe20005200000-ffffe200055fffff] PMD ->
>> [ffff880005e00000-ffff8800061fffff] on node 0
>> [ffffe20005400000-ffffe200057fffff] PMD ->
>> [ffff880006000000-ffff8800063fffff] on node 0
>> [ffffe20005600000-ffffe200059fffff] PMD ->
>> [ffff880006200000-ffff8800065fffff] on node 0
>> [ffffe20005800000-ffffe20005bfffff] PMD ->
>> [ffff880006400000-ffff8800067fffff] on node 0
>> [ffffe20005a00000-ffffe20005dfffff] PMD ->
>> [ffff880006600000-ffff8800069fffff] on node 0
>> [ffffe20005c00000-ffffe20005ffffff] PMD ->
>> [ffff880006800000-ffff880006bfffff] on node 0
>> [ffffe20005e00000-ffffe200061fffff] PMD ->
>> [ffff880006a00000-ffff880006dfffff] on node 0
>> [ffffe20006000000-ffffe200063fffff] PMD ->
>> [ffff880006c00000-ffff880006ffffff] on node 0
>> [ffffe20006200000-ffffe200065fffff] PMD ->
>> [ffff880006e00000-ffff8800071fffff] on node 0
>> [ffffe20006400000-ffffe200067fffff] PMD ->
>> [ffff880007000000-ffff8800073fffff] on node 0
>> [ffffe20006600000-ffffe200069fffff] PMD ->
>> [ffff880007200000-ffff8800075fffff] on node 0
>> [ffffe20006800000-ffffe20006bfffff] PMD ->
>> [ffff880007400000-ffff8800077fffff] on node 0
>> [ffffe20006a00000-ffffe20006dfffff] PMD ->
>> [ffff880007600000-ffff8800079fffff] on node 0
>> [ffffe20006c00000-ffffe20006ffffff] PMD ->
>> [ffff880007800000-ffff880007bfffff] on node 0
>> [ffffe20006e00000-ffffe200071fffff] PMD ->
>> [ffff880007a00000-ffff880007dfffff] on node 0
>> [ffffe20007000000-ffffe200073fffff] PMD ->
>> [ffff880007c00000-ffff880007ffffff] on node 0
>> [ffffe20007200000-ffffe200075fffff] PMD ->
>> [ffff880007e00000-ffff8800081fffff] on node 0
>> [ffffe20007400000-ffffe200077fffff] PMD ->
>> [ffff880008000000-ffff8800083fffff] on node 0
>> [ffffe20007600000-ffffe200079fffff] PMD ->
>> [ffff880008200000-ffff8800085fffff] on node 0
>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>> [ffffe20007800000-ffffe20007bfffff] PMD ->
>> [ffff880008400000-ffff8802283fffff] on node 0
>> [ffffe20007a00000-ffffe20007dfffff] PMD ->
>> [ffff880228200000-ffff8802285fffff] on node 1
>> [ffffe20007c00000-ffffe20007ffffff] PMD ->
>> [ffff880228400000-ffff8802287fffff] on node 1
>> [ffffe20007e00000-ffffe200081fffff] PMD ->
>> [ffff880228600000-ffff8802289fffff] on node 1
>> [ffffe20008000000-ffffe200083fffff] PMD ->
>> [ffff880228800000-ffff880228bfffff] on node 1
>> [ffffe20008200000-ffffe200085fffff] PMD ->
>> [ffff880228a00000-ffff880228dfffff] on node 1
>> [ffffe20008400000-ffffe200087fffff] PMD ->
>> [ffff880228c00000-ffff880228ffffff] on node 1
>> [ffffe20008600000-ffffe200089fffff] PMD ->
>> [ffff880228e00000-ffff8802291fffff] on node 1
>> [ffffe20008800000-ffffe20008bfffff] PMD ->
>> [ffff880229000000-ffff8802293fffff] on node 1
>> [ffffe20008a00000-ffffe20008dfffff] PMD ->
>> [ffff880229200000-ffff8802295fffff] on node 1
>> [ffffe20008c00000-ffffe20008ffffff] PMD ->
>> [ffff880229400000-ffff8802297fffff] on node 1
>> [ffffe20008e00000-ffffe200091fffff] PMD ->
>> [ffff880229600000-ffff8802299fffff] on node 1
>> [ffffe20009000000-ffffe200093fffff] PMD ->
>> [ffff880229800000-ffff880229bfffff] on node 1
>> [ffffe20009200000-ffffe200095fffff] PMD ->
>> [ffff880229a00000-ffff880229dfffff] on node 1
>> [ffffe20009400000-ffffe200097fffff] PMD ->
>> [ffff880229c00000-ffff880229ffffff] on node 1
>> [ffffe20009600000-ffffe200099fffff] PMD ->
>> [ffff880229e00000-ffff88022a1fffff] on node 1
>> [ffffe20009800000-ffffe20009bfffff] PMD ->
>> [ffff88022a000000-ffff88022a3fffff] on node 1
>> [ffffe20009a00000-ffffe20009dfffff] PMD ->
>> [ffff88022a200000-ffff88022a5fffff] on node 1
>> [ffffe20009c00000-ffffe20009ffffff] PMD ->
>> [ffff88022a400000-ffff88022a7fffff] on node 1
>> [ffffe20009e00000-ffffe2000a1fffff] PMD ->
>> [ffff88022a600000-ffff88022a9fffff] on node 1
>> [ffffe2000a000000-ffffe2000a3fffff] PMD ->
>> [ffff88022a800000-ffff88022abfffff] on node 1
>> [ffffe2000a200000-ffffe2000a5fffff] PMD ->
>> [ffff88022aa00000-ffff88022adfffff] on node 1
>> [ffffe2000a400000-ffffe2000a7fffff] PMD ->
>> [ffff88022ac00000-ffff88022affffff] on node 1
>> [ffffe2000a600000-ffffe2000a9fffff] PMD ->
>> [ffff88022ae00000-ffff88022b1fffff] on node 1
>> [ffffe2000a800000-ffffe2000abfffff] PMD ->
>> [ffff88022b000000-ffff88022b3fffff] on node 1
>> [ffffe2000aa00000-ffffe2000adfffff] PMD ->
>> [ffff88022b200000-ffff88022b5fffff] on node 1
>> [ffffe2000ac00000-ffffe2000affffff] PMD ->
>> [ffff88022b400000-ffff88022b7fffff] on node 1
>> [ffffe2000ae00000-ffffe2000b1fffff] PMD ->
>> [ffff88022b600000-ffff88022b9fffff] on node 1
>> [ffffe2000b000000-ffffe2000b3fffff] PMD ->
>> [ffff88022b800000-ffff88022bbfffff] on node 1
>> [ffffe2000b200000-ffffe2000b5fffff] PMD ->
>> [ffff88022ba00000-ffff88022bdfffff] on node 1
>> [ffffe2000b400000-ffffe2000b7fffff] PMD ->
>> [ffff88022bc00000-ffff88022bffffff] on node 1
>> [ffffe2000b600000-ffffe2000b9fffff] PMD ->
>> [ffff88022be00000-ffff88022c1fffff] on node 1
>> [ffffe2000b800000-ffffe2000bbfffff] PMD ->
>> [ffff88022c000000-ffff88022c3fffff] on node 1
>> [ffffe2000ba00000-ffffe2000bdfffff] PMD ->
>> [ffff88022c200000-ffff88022c5fffff] on node 1
>> [ffffe2000bc00000-ffffe2000bffffff] PMD ->
>> [ffff88022c400000-ffff88022c7fffff] on node 1
>> [ffffe2000be00000-ffffe2000c1fffff] PMD ->
>> [ffff88022c600000-ffff88022c9fffff] on node 1
>> [ffffe2000c000000-ffffe2000c3fffff] PMD ->
>> [ffff88022c800000-ffff88022cbfffff] on node 1
>> [ffffe2000c200000-ffffe2000c5fffff] PMD ->
>> [ffff88022ca00000-ffff88022cdfffff] on node 1
>> [ffffe2000c400000-ffffe2000c7fffff] PMD ->
>> [ffff88022cc00000-ffff88022cffffff] on node 1
>> [ffffe2000c600000-ffffe2000c9fffff] PMD ->
>> [ffff88022ce00000-ffff88022d1fffff] on node 1
>> [ffffe2000c800000-ffffe2000cbfffff] PMD ->
>> [ffff88022d000000-ffff88022d3fffff] on node 1
>> [ffffe2000ca00000-ffffe2000cdfffff] PMD ->
>> [ffff88022d200000-ffff88022d5fffff] on node 1
>> [ffffe2000cc00000-ffffe2000cffffff] PMD ->
>> [ffff88022d400000-ffff88022d7fffff] on node 1
>> [ffffe2000ce00000-ffffe2000d1fffff] PMD ->
>> [ffff88022d600000-ffff88022d9fffff] on node 1
>> [ffffe2000d000000-ffffe2000d3fffff] PMD ->
>> [ffff88022d800000-ffff88022dbfffff] on node 1
>> [ffffe2000d200000-ffffe2000d5fffff] PMD ->
>> [ffff88022da00000-ffff88022ddfffff] on node 1
>> [ffffe2000d400000-ffffe2000d7fffff] PMD ->
>> [ffff88022dc00000-ffff88022dffffff] on node 1
>> [ffffe2000d600000-ffffe2000d9fffff] PMD ->
>> [ffff88022de00000-ffff88022e1fffff] on node 1
>> [ffffe2000d800000-ffffe2000dbfffff] PMD ->
>> [ffff88022e000000-ffff88022e3fffff] on node 1
>> [ffffe2000da00000-ffffe2000ddfffff] PMD ->
>> [ffff88022e200000-ffff88022e5fffff] on node 1
>> [ffffe2000dc00000-ffffe2000dffffff] PMD ->
>> [ffff88022e400000-ffff88022e7fffff] on node 1
>> [ffffe2000de00000-ffffe2000e1fffff] PMD ->
>> [ffff88022e600000-ffff88022e9fffff] on node 1
>> [ffffe2000e000000-ffffe2000e3fffff] PMD ->
>> [ffff88022e800000-ffff88022ebfffff] on node 1
>> [ffffe2000e200000-ffffe2000e5fffff] PMD ->
>> [ffff88022ea00000-ffff88022edfffff] on node 1
>> [ffffe2000e400000-ffffe2000e7fffff] PMD ->
>> [ffff88022ec00000-ffff88022effffff] on node 1
>> [ffffe2000e600000-ffffe2000e9fffff] PMD ->
>> [ffff88022ee00000-ffff88022f1fffff] on node 1
>> [ffffe2000e800000-ffffe2000e9fffff] PMD ->
>> [ffff88022f000000-ffff88022f1fffff] on node 1
>>
>> should have
>>
>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff8100011ce000 on node 0
>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff8100011cf000 on node 0
>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>> [ffffe20000000000-ffffe200079fffff] PMD ->
>> [ffff810001200000-ffff8100083fffff] on node 0
>> [ffffe20007a00000-ffffe2000e9fffff] PMD ->
>> [ffff810228200000-ffff81022f1fffff] on node 1
>
>
> I haven't seen those messages before. Can you explain what they mean?
that is for SPARSEMEM virtual memmap...
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
YH
Yinghai Lu wrote:
> On Thu, Jul 3, 2008 at 11:25 AM, Jeremy Fitzhardinge <[email protected]> wrote:
>
>> Yinghai Lu wrote:
>>
>>> On Thu, Jul 3, 2008 at 2:10 AM, Ingo Molnar <[email protected]> wrote:
>>>
>>>
>>>> * Ingo Molnar <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>> Ingo Molnar wrote:
>>>>>>
>>>>>>
>>>>>>> Excluding the x86/xen-64bit topic solves the problem.
>>>>>>>
>>>>>>> It triggered on two 64-bit machines so it seems readily reproducible
>>>>>>> with that config.
>>>>>>>
>>>>>>> i've pushed the failing tree out to tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch() &&
>>>>>> early_ioremap_init()". Logically that patch should probably be in the
>>>>>> xen64 branch, since it's only meaningful with the early_ioremap
>>>>>> unification.
>>>>>>
>>>>>>
>>>>> ah, indeed - it was missing from tip/master due to:
>>>>>
>>>>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>>>>> | Author: Ingo Molnar <[email protected]>
>>>>> | Date: Mon Jun 30 12:01:31 2008 +0200
>>>>> |
>>>>> | Revert "x86: setup_arch() && early_ioremap_init()"
>>>>> |
>>>>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>>>>
>>>>> because that change needed the other changes from xen-64bit.
>>>>>
>>>>> will retry tomorrow.
>>>>>
>>>>>
>>>> ok, i've re-added x86/xen-64bit and it's looking good in testing so far.
>>>>
>>>>
>>>>
>>> got
>>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0
>>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0
>>> [ffffe20000000000-ffffe200003fffff] PMD ->
>>> [ffff880001400000-ffff8800017fffff] on node 0
>>> [ffffe20000200000-ffffe200005fffff] PMD ->
>>> [ffff880001600000-ffff8800019fffff] on node 0
>>> [ffffe20000400000-ffffe200007fffff] PMD ->
>>> [ffff880001800000-ffff880001bfffff] on node 0
>>> [ffffe20000600000-ffffe200009fffff] PMD ->
>>> [ffff880001a00000-ffff880001dfffff] on node 0
>>> [ffffe20000800000-ffffe20000bfffff] PMD ->
>>> [ffff880001c00000-ffff880001ffffff] on node 0
>>> [ffffe20000a00000-ffffe20000dfffff] PMD ->
>>> [ffff880001e00000-ffff8800021fffff] on node 0
>>> [ffffe20000c00000-ffffe20000ffffff] PMD ->
>>> [ffff880002000000-ffff8800023fffff] on node 0
>>> [ffffe20000e00000-ffffe200011fffff] PMD ->
>>> [ffff880002200000-ffff8800025fffff] on node 0
>>> [ffffe20001000000-ffffe200013fffff] PMD ->
>>> [ffff880002400000-ffff8800027fffff] on node 0
>>> [ffffe20001200000-ffffe200015fffff] PMD ->
>>> [ffff880002600000-ffff8800029fffff] on node 0
>>> [ffffe20001400000-ffffe200017fffff] PMD ->
>>> [ffff880002800000-ffff880002bfffff] on node 0
>>> [ffffe20001600000-ffffe200019fffff] PMD ->
>>> [ffff880002a00000-ffff880002dfffff] on node 0
>>> [ffffe20001800000-ffffe20001bfffff] PMD ->
>>> [ffff880002c00000-ffff880002ffffff] on node 0
>>> [ffffe20001a00000-ffffe20001dfffff] PMD ->
>>> [ffff880002e00000-ffff8800031fffff] on node 0
>>> [ffffe20001c00000-ffffe20001ffffff] PMD ->
>>> [ffff880003000000-ffff8800033fffff] on node 0
>>> [ffffe20001e00000-ffffe200021fffff] PMD ->
>>> [ffff880003200000-ffff8800035fffff] on node 0
>>> [ffffe20002000000-ffffe200023fffff] PMD ->
>>> [ffff880003400000-ffff8800037fffff] on node 0
>>> [ffffe20002200000-ffffe200025fffff] PMD ->
>>> [ffff880003600000-ffff8800039fffff] on node 0
>>> [ffffe20002400000-ffffe200027fffff] PMD ->
>>> [ffff880003800000-ffff880003bfffff] on node 0
>>> [ffffe20002600000-ffffe200029fffff] PMD ->
>>> [ffff880003a00000-ffff880003dfffff] on node 0
>>> [ffffe20002800000-ffffe20002bfffff] PMD ->
>>> [ffff880003c00000-ffff880003ffffff] on node 0
>>> [ffffe20002a00000-ffffe20002dfffff] PMD ->
>>> [ffff880003e00000-ffff8800041fffff] on node 0
>>> [ffffe20002c00000-ffffe20002ffffff] PMD ->
>>> [ffff880004000000-ffff8800043fffff] on node 0
>>> [ffffe20002e00000-ffffe200039fffff] PMD ->
>>> [ffff880004200000-ffff8800045fffff] on node 0
>>> [ffffe20003800000-ffffe20003bfffff] PMD ->
>>> [ffff880004400000-ffff8800047fffff] on node 0
>>> [ffffe20003a00000-ffffe20003dfffff] PMD ->
>>> [ffff880004600000-ffff8800049fffff] on node 0
>>> [ffffe20003c00000-ffffe20003ffffff] PMD ->
>>> [ffff880004800000-ffff880004bfffff] on node 0
>>> [ffffe20003e00000-ffffe200041fffff] PMD ->
>>> [ffff880004a00000-ffff880004dfffff] on node 0
>>> [ffffe20004000000-ffffe200043fffff] PMD ->
>>> [ffff880004c00000-ffff880004ffffff] on node 0
>>> [ffffe20004200000-ffffe200045fffff] PMD ->
>>> [ffff880004e00000-ffff8800051fffff] on node 0
>>> [ffffe20004400000-ffffe200047fffff] PMD ->
>>> [ffff880005000000-ffff8800053fffff] on node 0
>>> [ffffe20004600000-ffffe200049fffff] PMD ->
>>> [ffff880005200000-ffff8800055fffff] on node 0
>>> [ffffe20004800000-ffffe20004bfffff] PMD ->
>>> [ffff880005400000-ffff8800057fffff] on node 0
>>> [ffffe20004a00000-ffffe20004dfffff] PMD ->
>>> [ffff880005600000-ffff8800059fffff] on node 0
>>> [ffffe20004c00000-ffffe20004ffffff] PMD ->
>>> [ffff880005800000-ffff880005bfffff] on node 0
>>> [ffffe20004e00000-ffffe200051fffff] PMD ->
>>> [ffff880005a00000-ffff880005dfffff] on node 0
>>> [ffffe20005000000-ffffe200053fffff] PMD ->
>>> [ffff880005c00000-ffff880005ffffff] on node 0
>>> [ffffe20005200000-ffffe200055fffff] PMD ->
>>> [ffff880005e00000-ffff8800061fffff] on node 0
>>> [ffffe20005400000-ffffe200057fffff] PMD ->
>>> [ffff880006000000-ffff8800063fffff] on node 0
>>> [ffffe20005600000-ffffe200059fffff] PMD ->
>>> [ffff880006200000-ffff8800065fffff] on node 0
>>> [ffffe20005800000-ffffe20005bfffff] PMD ->
>>> [ffff880006400000-ffff8800067fffff] on node 0
>>> [ffffe20005a00000-ffffe20005dfffff] PMD ->
>>> [ffff880006600000-ffff8800069fffff] on node 0
>>> [ffffe20005c00000-ffffe20005ffffff] PMD ->
>>> [ffff880006800000-ffff880006bfffff] on node 0
>>> [ffffe20005e00000-ffffe200061fffff] PMD ->
>>> [ffff880006a00000-ffff880006dfffff] on node 0
>>> [ffffe20006000000-ffffe200063fffff] PMD ->
>>> [ffff880006c00000-ffff880006ffffff] on node 0
>>> [ffffe20006200000-ffffe200065fffff] PMD ->
>>> [ffff880006e00000-ffff8800071fffff] on node 0
>>> [ffffe20006400000-ffffe200067fffff] PMD ->
>>> [ffff880007000000-ffff8800073fffff] on node 0
>>> [ffffe20006600000-ffffe200069fffff] PMD ->
>>> [ffff880007200000-ffff8800075fffff] on node 0
>>> [ffffe20006800000-ffffe20006bfffff] PMD ->
>>> [ffff880007400000-ffff8800077fffff] on node 0
>>> [ffffe20006a00000-ffffe20006dfffff] PMD ->
>>> [ffff880007600000-ffff8800079fffff] on node 0
>>> [ffffe20006c00000-ffffe20006ffffff] PMD ->
>>> [ffff880007800000-ffff880007bfffff] on node 0
>>> [ffffe20006e00000-ffffe200071fffff] PMD ->
>>> [ffff880007a00000-ffff880007dfffff] on node 0
>>> [ffffe20007000000-ffffe200073fffff] PMD ->
>>> [ffff880007c00000-ffff880007ffffff] on node 0
>>> [ffffe20007200000-ffffe200075fffff] PMD ->
>>> [ffff880007e00000-ffff8800081fffff] on node 0
>>> [ffffe20007400000-ffffe200077fffff] PMD ->
>>> [ffff880008000000-ffff8800083fffff] on node 0
>>> [ffffe20007600000-ffffe200079fffff] PMD ->
>>> [ffff880008200000-ffff8800085fffff] on node 0
>>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>>> [ffffe20007800000-ffffe20007bfffff] PMD ->
>>> [ffff880008400000-ffff8802283fffff] on node 0
>>> [ffffe20007a00000-ffffe20007dfffff] PMD ->
>>> [ffff880228200000-ffff8802285fffff] on node 1
>>> [ffffe20007c00000-ffffe20007ffffff] PMD ->
>>> [ffff880228400000-ffff8802287fffff] on node 1
>>> [ffffe20007e00000-ffffe200081fffff] PMD ->
>>> [ffff880228600000-ffff8802289fffff] on node 1
>>> [ffffe20008000000-ffffe200083fffff] PMD ->
>>> [ffff880228800000-ffff880228bfffff] on node 1
>>> [ffffe20008200000-ffffe200085fffff] PMD ->
>>> [ffff880228a00000-ffff880228dfffff] on node 1
>>> [ffffe20008400000-ffffe200087fffff] PMD ->
>>> [ffff880228c00000-ffff880228ffffff] on node 1
>>> [ffffe20008600000-ffffe200089fffff] PMD ->
>>> [ffff880228e00000-ffff8802291fffff] on node 1
>>> [ffffe20008800000-ffffe20008bfffff] PMD ->
>>> [ffff880229000000-ffff8802293fffff] on node 1
>>> [ffffe20008a00000-ffffe20008dfffff] PMD ->
>>> [ffff880229200000-ffff8802295fffff] on node 1
>>> [ffffe20008c00000-ffffe20008ffffff] PMD ->
>>> [ffff880229400000-ffff8802297fffff] on node 1
>>> [ffffe20008e00000-ffffe200091fffff] PMD ->
>>> [ffff880229600000-ffff8802299fffff] on node 1
>>> [ffffe20009000000-ffffe200093fffff] PMD ->
>>> [ffff880229800000-ffff880229bfffff] on node 1
>>> [ffffe20009200000-ffffe200095fffff] PMD ->
>>> [ffff880229a00000-ffff880229dfffff] on node 1
>>> [ffffe20009400000-ffffe200097fffff] PMD ->
>>> [ffff880229c00000-ffff880229ffffff] on node 1
>>> [ffffe20009600000-ffffe200099fffff] PMD ->
>>> [ffff880229e00000-ffff88022a1fffff] on node 1
>>> [ffffe20009800000-ffffe20009bfffff] PMD ->
>>> [ffff88022a000000-ffff88022a3fffff] on node 1
>>> [ffffe20009a00000-ffffe20009dfffff] PMD ->
>>> [ffff88022a200000-ffff88022a5fffff] on node 1
>>> [ffffe20009c00000-ffffe20009ffffff] PMD ->
>>> [ffff88022a400000-ffff88022a7fffff] on node 1
>>> [ffffe20009e00000-ffffe2000a1fffff] PMD ->
>>> [ffff88022a600000-ffff88022a9fffff] on node 1
>>> [ffffe2000a000000-ffffe2000a3fffff] PMD ->
>>> [ffff88022a800000-ffff88022abfffff] on node 1
>>> [ffffe2000a200000-ffffe2000a5fffff] PMD ->
>>> [ffff88022aa00000-ffff88022adfffff] on node 1
>>> [ffffe2000a400000-ffffe2000a7fffff] PMD ->
>>> [ffff88022ac00000-ffff88022affffff] on node 1
>>> [ffffe2000a600000-ffffe2000a9fffff] PMD ->
>>> [ffff88022ae00000-ffff88022b1fffff] on node 1
>>> [ffffe2000a800000-ffffe2000abfffff] PMD ->
>>> [ffff88022b000000-ffff88022b3fffff] on node 1
>>> [ffffe2000aa00000-ffffe2000adfffff] PMD ->
>>> [ffff88022b200000-ffff88022b5fffff] on node 1
>>> [ffffe2000ac00000-ffffe2000affffff] PMD ->
>>> [ffff88022b400000-ffff88022b7fffff] on node 1
>>> [ffffe2000ae00000-ffffe2000b1fffff] PMD ->
>>> [ffff88022b600000-ffff88022b9fffff] on node 1
>>> [ffffe2000b000000-ffffe2000b3fffff] PMD ->
>>> [ffff88022b800000-ffff88022bbfffff] on node 1
>>> [ffffe2000b200000-ffffe2000b5fffff] PMD ->
>>> [ffff88022ba00000-ffff88022bdfffff] on node 1
>>> [ffffe2000b400000-ffffe2000b7fffff] PMD ->
>>> [ffff88022bc00000-ffff88022bffffff] on node 1
>>> [ffffe2000b600000-ffffe2000b9fffff] PMD ->
>>> [ffff88022be00000-ffff88022c1fffff] on node 1
>>> [ffffe2000b800000-ffffe2000bbfffff] PMD ->
>>> [ffff88022c000000-ffff88022c3fffff] on node 1
>>> [ffffe2000ba00000-ffffe2000bdfffff] PMD ->
>>> [ffff88022c200000-ffff88022c5fffff] on node 1
>>> [ffffe2000bc00000-ffffe2000bffffff] PMD ->
>>> [ffff88022c400000-ffff88022c7fffff] on node 1
>>> [ffffe2000be00000-ffffe2000c1fffff] PMD ->
>>> [ffff88022c600000-ffff88022c9fffff] on node 1
>>> [ffffe2000c000000-ffffe2000c3fffff] PMD ->
>>> [ffff88022c800000-ffff88022cbfffff] on node 1
>>> [ffffe2000c200000-ffffe2000c5fffff] PMD ->
>>> [ffff88022ca00000-ffff88022cdfffff] on node 1
>>> [ffffe2000c400000-ffffe2000c7fffff] PMD ->
>>> [ffff88022cc00000-ffff88022cffffff] on node 1
>>> [ffffe2000c600000-ffffe2000c9fffff] PMD ->
>>> [ffff88022ce00000-ffff88022d1fffff] on node 1
>>> [ffffe2000c800000-ffffe2000cbfffff] PMD ->
>>> [ffff88022d000000-ffff88022d3fffff] on node 1
>>> [ffffe2000ca00000-ffffe2000cdfffff] PMD ->
>>> [ffff88022d200000-ffff88022d5fffff] on node 1
>>> [ffffe2000cc00000-ffffe2000cffffff] PMD ->
>>> [ffff88022d400000-ffff88022d7fffff] on node 1
>>> [ffffe2000ce00000-ffffe2000d1fffff] PMD ->
>>> [ffff88022d600000-ffff88022d9fffff] on node 1
>>> [ffffe2000d000000-ffffe2000d3fffff] PMD ->
>>> [ffff88022d800000-ffff88022dbfffff] on node 1
>>> [ffffe2000d200000-ffffe2000d5fffff] PMD ->
>>> [ffff88022da00000-ffff88022ddfffff] on node 1
>>> [ffffe2000d400000-ffffe2000d7fffff] PMD ->
>>> [ffff88022dc00000-ffff88022dffffff] on node 1
>>> [ffffe2000d600000-ffffe2000d9fffff] PMD ->
>>> [ffff88022de00000-ffff88022e1fffff] on node 1
>>> [ffffe2000d800000-ffffe2000dbfffff] PMD ->
>>> [ffff88022e000000-ffff88022e3fffff] on node 1
>>> [ffffe2000da00000-ffffe2000ddfffff] PMD ->
>>> [ffff88022e200000-ffff88022e5fffff] on node 1
>>> [ffffe2000dc00000-ffffe2000dffffff] PMD ->
>>> [ffff88022e400000-ffff88022e7fffff] on node 1
>>> [ffffe2000de00000-ffffe2000e1fffff] PMD ->
>>> [ffff88022e600000-ffff88022e9fffff] on node 1
>>> [ffffe2000e000000-ffffe2000e3fffff] PMD ->
>>> [ffff88022e800000-ffff88022ebfffff] on node 1
>>> [ffffe2000e200000-ffffe2000e5fffff] PMD ->
>>> [ffff88022ea00000-ffff88022edfffff] on node 1
>>> [ffffe2000e400000-ffffe2000e7fffff] PMD ->
>>> [ffff88022ec00000-ffff88022effffff] on node 1
>>> [ffffe2000e600000-ffffe2000e9fffff] PMD ->
>>> [ffff88022ee00000-ffff88022f1fffff] on node 1
>>> [ffffe2000e800000-ffffe2000e9fffff] PMD ->
>>> [ffff88022f000000-ffff88022f1fffff] on node 1
>>>
>>> should have
>>>
>>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff8100011ce000 on node 0
>>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff8100011cf000 on node 0
>>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>>> [ffffe20000000000-ffffe200079fffff] PMD ->
>>> [ffff810001200000-ffff8100083fffff] on node 0
>>> [ffffe20007a00000-ffffe2000e9fffff] PMD ->
>>> [ffff810228200000-ffff81022f1fffff] on node 1
>>>
>> I haven't seen those messages before. Can you explain what they mean?
>>
>
> that is for SPARSEMEM virtual memmap...
>
> CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
> CONFIG_SPARSEMEM_VMEMMAP=y
>
I modified the vmemmap code so it would create 4k mappings if PSE isn't
supported. Did I get it wrong? It should have no effect when PSE is
available (which is any time you're not running under Xen).
J
On Thu, Jul 3, 2008 at 11:41 AM, Jeremy Fitzhardinge <[email protected]> wrote:
> Yinghai Lu wrote:
>>
>> On Thu, Jul 3, 2008 at 11:25 AM, Jeremy Fitzhardinge <[email protected]>
>> wrote:
>>
>>>
>>> Yinghai Lu wrote:
>>>
>>>>
>>>> On Thu, Jul 3, 2008 at 2:10 AM, Ingo Molnar <[email protected]> wrote:
>>>>
>>>>
>>>>>
>>>>> * Ingo Molnar <[email protected]> wrote:
>>>>>
>>>>>
>>>>>
>>>>>>
>>>>>> * Jeremy Fitzhardinge <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>>>
>>>>>>> Ingo Molnar wrote:
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> Excluding the x86/xen-64bit topic solves the problem.
>>>>>>>>
>>>>>>>> It triggered on two 64-bit machines so it seems readily reproducible
>>>>>>>> with that config.
>>>>>>>>
>>>>>>>> i've pushed the failing tree out to
>>>>>>>> tip/tmp.xen-64bit.Tue_Jul__1_10_55
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>> The patch to fix this is on tip/x86/unify-setup: "x86: setup_arch()
>>>>>>> &&
>>>>>>> early_ioremap_init()". Logically that patch should probably be in
>>>>>>> the
>>>>>>> xen64 branch, since it's only meaningful with the early_ioremap
>>>>>>> unification.
>>>>>>>
>>>>>>>
>>>>>>
>>>>>> ah, indeed - it was missing from tip/master due to:
>>>>>>
>>>>>> | commit ac998c259605741efcfbd215533b379970ba1d9f
>>>>>> | Author: Ingo Molnar <[email protected]>
>>>>>> | Date: Mon Jun 30 12:01:31 2008 +0200
>>>>>> |
>>>>>> | Revert "x86: setup_arch() && early_ioremap_init()"
>>>>>> |
>>>>>> | This reverts commit 181b3601a1a7d2ac3ace6b23cb3204450a4f9a27.
>>>>>>
>>>>>> because that change needed the other changes from xen-64bit.
>>>>>>
>>>>>> will retry tomorrow.
>>>>>>
>>>>>>
>>>>>
>>>>> ok, i've re-added x86/xen-64bit and it's looking good in testing so
>>>>> far.
>>>>>
>>>>>
>>>>>
>>>>
>>>> got
>>>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff88000128a000 on node 0
>>>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff88000128b000 on node 0
>>>> [ffffe20000000000-ffffe200003fffff] PMD ->
>>>> [ffff880001400000-ffff8800017fffff] on node 0
>>>> [ffffe20000200000-ffffe200005fffff] PMD ->
>>>> [ffff880001600000-ffff8800019fffff] on node 0
>>>> [ffffe20000400000-ffffe200007fffff] PMD ->
>>>> [ffff880001800000-ffff880001bfffff] on node 0
>>>> [ffffe20000600000-ffffe200009fffff] PMD ->
>>>> [ffff880001a00000-ffff880001dfffff] on node 0
>>>> [ffffe20000800000-ffffe20000bfffff] PMD ->
>>>> [ffff880001c00000-ffff880001ffffff] on node 0
>>>> [ffffe20000a00000-ffffe20000dfffff] PMD ->
>>>> [ffff880001e00000-ffff8800021fffff] on node 0
>>>> [ffffe20000c00000-ffffe20000ffffff] PMD ->
>>>> [ffff880002000000-ffff8800023fffff] on node 0
>>>> [ffffe20000e00000-ffffe200011fffff] PMD ->
>>>> [ffff880002200000-ffff8800025fffff] on node 0
>>>> [ffffe20001000000-ffffe200013fffff] PMD ->
>>>> [ffff880002400000-ffff8800027fffff] on node 0
>>>> [ffffe20001200000-ffffe200015fffff] PMD ->
>>>> [ffff880002600000-ffff8800029fffff] on node 0
>>>> [ffffe20001400000-ffffe200017fffff] PMD ->
>>>> [ffff880002800000-ffff880002bfffff] on node 0
>>>> [ffffe20001600000-ffffe200019fffff] PMD ->
>>>> [ffff880002a00000-ffff880002dfffff] on node 0
>>>> [ffffe20001800000-ffffe20001bfffff] PMD ->
>>>> [ffff880002c00000-ffff880002ffffff] on node 0
>>>> [ffffe20001a00000-ffffe20001dfffff] PMD ->
>>>> [ffff880002e00000-ffff8800031fffff] on node 0
>>>> [ffffe20001c00000-ffffe20001ffffff] PMD ->
>>>> [ffff880003000000-ffff8800033fffff] on node 0
>>>> [ffffe20001e00000-ffffe200021fffff] PMD ->
>>>> [ffff880003200000-ffff8800035fffff] on node 0
>>>> [ffffe20002000000-ffffe200023fffff] PMD ->
>>>> [ffff880003400000-ffff8800037fffff] on node 0
>>>> [ffffe20002200000-ffffe200025fffff] PMD ->
>>>> [ffff880003600000-ffff8800039fffff] on node 0
>>>> [ffffe20002400000-ffffe200027fffff] PMD ->
>>>> [ffff880003800000-ffff880003bfffff] on node 0
>>>> [ffffe20002600000-ffffe200029fffff] PMD ->
>>>> [ffff880003a00000-ffff880003dfffff] on node 0
>>>> [ffffe20002800000-ffffe20002bfffff] PMD ->
>>>> [ffff880003c00000-ffff880003ffffff] on node 0
>>>> [ffffe20002a00000-ffffe20002dfffff] PMD ->
>>>> [ffff880003e00000-ffff8800041fffff] on node 0
>>>> [ffffe20002c00000-ffffe20002ffffff] PMD ->
>>>> [ffff880004000000-ffff8800043fffff] on node 0
>>>> [ffffe20002e00000-ffffe200039fffff] PMD ->
>>>> [ffff880004200000-ffff8800045fffff] on node 0
>>>> [ffffe20003800000-ffffe20003bfffff] PMD ->
>>>> [ffff880004400000-ffff8800047fffff] on node 0
>>>> [ffffe20003a00000-ffffe20003dfffff] PMD ->
>>>> [ffff880004600000-ffff8800049fffff] on node 0
>>>> [ffffe20003c00000-ffffe20003ffffff] PMD ->
>>>> [ffff880004800000-ffff880004bfffff] on node 0
>>>> [ffffe20003e00000-ffffe200041fffff] PMD ->
>>>> [ffff880004a00000-ffff880004dfffff] on node 0
>>>> [ffffe20004000000-ffffe200043fffff] PMD ->
>>>> [ffff880004c00000-ffff880004ffffff] on node 0
>>>> [ffffe20004200000-ffffe200045fffff] PMD ->
>>>> [ffff880004e00000-ffff8800051fffff] on node 0
>>>> [ffffe20004400000-ffffe200047fffff] PMD ->
>>>> [ffff880005000000-ffff8800053fffff] on node 0
>>>> [ffffe20004600000-ffffe200049fffff] PMD ->
>>>> [ffff880005200000-ffff8800055fffff] on node 0
>>>> [ffffe20004800000-ffffe20004bfffff] PMD ->
>>>> [ffff880005400000-ffff8800057fffff] on node 0
>>>> [ffffe20004a00000-ffffe20004dfffff] PMD ->
>>>> [ffff880005600000-ffff8800059fffff] on node 0
>>>> [ffffe20004c00000-ffffe20004ffffff] PMD ->
>>>> [ffff880005800000-ffff880005bfffff] on node 0
>>>> [ffffe20004e00000-ffffe200051fffff] PMD ->
>>>> [ffff880005a00000-ffff880005dfffff] on node 0
>>>> [ffffe20005000000-ffffe200053fffff] PMD ->
>>>> [ffff880005c00000-ffff880005ffffff] on node 0
>>>> [ffffe20005200000-ffffe200055fffff] PMD ->
>>>> [ffff880005e00000-ffff8800061fffff] on node 0
>>>> [ffffe20005400000-ffffe200057fffff] PMD ->
>>>> [ffff880006000000-ffff8800063fffff] on node 0
>>>> [ffffe20005600000-ffffe200059fffff] PMD ->
>>>> [ffff880006200000-ffff8800065fffff] on node 0
>>>> [ffffe20005800000-ffffe20005bfffff] PMD ->
>>>> [ffff880006400000-ffff8800067fffff] on node 0
>>>> [ffffe20005a00000-ffffe20005dfffff] PMD ->
>>>> [ffff880006600000-ffff8800069fffff] on node 0
>>>> [ffffe20005c00000-ffffe20005ffffff] PMD ->
>>>> [ffff880006800000-ffff880006bfffff] on node 0
>>>> [ffffe20005e00000-ffffe200061fffff] PMD ->
>>>> [ffff880006a00000-ffff880006dfffff] on node 0
>>>> [ffffe20006000000-ffffe200063fffff] PMD ->
>>>> [ffff880006c00000-ffff880006ffffff] on node 0
>>>> [ffffe20006200000-ffffe200065fffff] PMD ->
>>>> [ffff880006e00000-ffff8800071fffff] on node 0
>>>> [ffffe20006400000-ffffe200067fffff] PMD ->
>>>> [ffff880007000000-ffff8800073fffff] on node 0
>>>> [ffffe20006600000-ffffe200069fffff] PMD ->
>>>> [ffff880007200000-ffff8800075fffff] on node 0
>>>> [ffffe20006800000-ffffe20006bfffff] PMD ->
>>>> [ffff880007400000-ffff8800077fffff] on node 0
>>>> [ffffe20006a00000-ffffe20006dfffff] PMD ->
>>>> [ffff880007600000-ffff8800079fffff] on node 0
>>>> [ffffe20006c00000-ffffe20006ffffff] PMD ->
>>>> [ffff880007800000-ffff880007bfffff] on node 0
>>>> [ffffe20006e00000-ffffe200071fffff] PMD ->
>>>> [ffff880007a00000-ffff880007dfffff] on node 0
>>>> [ffffe20007000000-ffffe200073fffff] PMD ->
>>>> [ffff880007c00000-ffff880007ffffff] on node 0
>>>> [ffffe20007200000-ffffe200075fffff] PMD ->
>>>> [ffff880007e00000-ffff8800081fffff] on node 0
>>>> [ffffe20007400000-ffffe200077fffff] PMD ->
>>>> [ffff880008000000-ffff8800083fffff] on node 0
>>>> [ffffe20007600000-ffffe200079fffff] PMD ->
>>>> [ffff880008200000-ffff8800085fffff] on node 0
>>>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>>>> [ffffe20007800000-ffffe20007bfffff] PMD ->
>>>> [ffff880008400000-ffff8802283fffff] on node 0
>>>> [ffffe20007a00000-ffffe20007dfffff] PMD ->
>>>> [ffff880228200000-ffff8802285fffff] on node 1
>>>> [ffffe20007c00000-ffffe20007ffffff] PMD ->
>>>> [ffff880228400000-ffff8802287fffff] on node 1
>>>> [ffffe20007e00000-ffffe200081fffff] PMD ->
>>>> [ffff880228600000-ffff8802289fffff] on node 1
>>>> [ffffe20008000000-ffffe200083fffff] PMD ->
>>>> [ffff880228800000-ffff880228bfffff] on node 1
>>>> [ffffe20008200000-ffffe200085fffff] PMD ->
>>>> [ffff880228a00000-ffff880228dfffff] on node 1
>>>> [ffffe20008400000-ffffe200087fffff] PMD ->
>>>> [ffff880228c00000-ffff880228ffffff] on node 1
>>>> [ffffe20008600000-ffffe200089fffff] PMD ->
>>>> [ffff880228e00000-ffff8802291fffff] on node 1
>>>> [ffffe20008800000-ffffe20008bfffff] PMD ->
>>>> [ffff880229000000-ffff8802293fffff] on node 1
>>>> [ffffe20008a00000-ffffe20008dfffff] PMD ->
>>>> [ffff880229200000-ffff8802295fffff] on node 1
>>>> [ffffe20008c00000-ffffe20008ffffff] PMD ->
>>>> [ffff880229400000-ffff8802297fffff] on node 1
>>>> [ffffe20008e00000-ffffe200091fffff] PMD ->
>>>> [ffff880229600000-ffff8802299fffff] on node 1
>>>> [ffffe20009000000-ffffe200093fffff] PMD ->
>>>> [ffff880229800000-ffff880229bfffff] on node 1
>>>> [ffffe20009200000-ffffe200095fffff] PMD ->
>>>> [ffff880229a00000-ffff880229dfffff] on node 1
>>>> [ffffe20009400000-ffffe200097fffff] PMD ->
>>>> [ffff880229c00000-ffff880229ffffff] on node 1
>>>> [ffffe20009600000-ffffe200099fffff] PMD ->
>>>> [ffff880229e00000-ffff88022a1fffff] on node 1
>>>> [ffffe20009800000-ffffe20009bfffff] PMD ->
>>>> [ffff88022a000000-ffff88022a3fffff] on node 1
>>>> [ffffe20009a00000-ffffe20009dfffff] PMD ->
>>>> [ffff88022a200000-ffff88022a5fffff] on node 1
>>>> [ffffe20009c00000-ffffe20009ffffff] PMD ->
>>>> [ffff88022a400000-ffff88022a7fffff] on node 1
>>>> [ffffe20009e00000-ffffe2000a1fffff] PMD ->
>>>> [ffff88022a600000-ffff88022a9fffff] on node 1
>>>> [ffffe2000a000000-ffffe2000a3fffff] PMD ->
>>>> [ffff88022a800000-ffff88022abfffff] on node 1
>>>> [ffffe2000a200000-ffffe2000a5fffff] PMD ->
>>>> [ffff88022aa00000-ffff88022adfffff] on node 1
>>>> [ffffe2000a400000-ffffe2000a7fffff] PMD ->
>>>> [ffff88022ac00000-ffff88022affffff] on node 1
>>>> [ffffe2000a600000-ffffe2000a9fffff] PMD ->
>>>> [ffff88022ae00000-ffff88022b1fffff] on node 1
>>>> [ffffe2000a800000-ffffe2000abfffff] PMD ->
>>>> [ffff88022b000000-ffff88022b3fffff] on node 1
>>>> [ffffe2000aa00000-ffffe2000adfffff] PMD ->
>>>> [ffff88022b200000-ffff88022b5fffff] on node 1
>>>> [ffffe2000ac00000-ffffe2000affffff] PMD ->
>>>> [ffff88022b400000-ffff88022b7fffff] on node 1
>>>> [ffffe2000ae00000-ffffe2000b1fffff] PMD ->
>>>> [ffff88022b600000-ffff88022b9fffff] on node 1
>>>> [ffffe2000b000000-ffffe2000b3fffff] PMD ->
>>>> [ffff88022b800000-ffff88022bbfffff] on node 1
>>>> [ffffe2000b200000-ffffe2000b5fffff] PMD ->
>>>> [ffff88022ba00000-ffff88022bdfffff] on node 1
>>>> [ffffe2000b400000-ffffe2000b7fffff] PMD ->
>>>> [ffff88022bc00000-ffff88022bffffff] on node 1
>>>> [ffffe2000b600000-ffffe2000b9fffff] PMD ->
>>>> [ffff88022be00000-ffff88022c1fffff] on node 1
>>>> [ffffe2000b800000-ffffe2000bbfffff] PMD ->
>>>> [ffff88022c000000-ffff88022c3fffff] on node 1
>>>> [ffffe2000ba00000-ffffe2000bdfffff] PMD ->
>>>> [ffff88022c200000-ffff88022c5fffff] on node 1
>>>> [ffffe2000bc00000-ffffe2000bffffff] PMD ->
>>>> [ffff88022c400000-ffff88022c7fffff] on node 1
>>>> [ffffe2000be00000-ffffe2000c1fffff] PMD ->
>>>> [ffff88022c600000-ffff88022c9fffff] on node 1
>>>> [ffffe2000c000000-ffffe2000c3fffff] PMD ->
>>>> [ffff88022c800000-ffff88022cbfffff] on node 1
>>>> [ffffe2000c200000-ffffe2000c5fffff] PMD ->
>>>> [ffff88022ca00000-ffff88022cdfffff] on node 1
>>>> [ffffe2000c400000-ffffe2000c7fffff] PMD ->
>>>> [ffff88022cc00000-ffff88022cffffff] on node 1
>>>> [ffffe2000c600000-ffffe2000c9fffff] PMD ->
>>>> [ffff88022ce00000-ffff88022d1fffff] on node 1
>>>> [ffffe2000c800000-ffffe2000cbfffff] PMD ->
>>>> [ffff88022d000000-ffff88022d3fffff] on node 1
>>>> [ffffe2000ca00000-ffffe2000cdfffff] PMD ->
>>>> [ffff88022d200000-ffff88022d5fffff] on node 1
>>>> [ffffe2000cc00000-ffffe2000cffffff] PMD ->
>>>> [ffff88022d400000-ffff88022d7fffff] on node 1
>>>> [ffffe2000ce00000-ffffe2000d1fffff] PMD ->
>>>> [ffff88022d600000-ffff88022d9fffff] on node 1
>>>> [ffffe2000d000000-ffffe2000d3fffff] PMD ->
>>>> [ffff88022d800000-ffff88022dbfffff] on node 1
>>>> [ffffe2000d200000-ffffe2000d5fffff] PMD ->
>>>> [ffff88022da00000-ffff88022ddfffff] on node 1
>>>> [ffffe2000d400000-ffffe2000d7fffff] PMD ->
>>>> [ffff88022dc00000-ffff88022dffffff] on node 1
>>>> [ffffe2000d600000-ffffe2000d9fffff] PMD ->
>>>> [ffff88022de00000-ffff88022e1fffff] on node 1
>>>> [ffffe2000d800000-ffffe2000dbfffff] PMD ->
>>>> [ffff88022e000000-ffff88022e3fffff] on node 1
>>>> [ffffe2000da00000-ffffe2000ddfffff] PMD ->
>>>> [ffff88022e200000-ffff88022e5fffff] on node 1
>>>> [ffffe2000dc00000-ffffe2000dffffff] PMD ->
>>>> [ffff88022e400000-ffff88022e7fffff] on node 1
>>>> [ffffe2000de00000-ffffe2000e1fffff] PMD ->
>>>> [ffff88022e600000-ffff88022e9fffff] on node 1
>>>> [ffffe2000e000000-ffffe2000e3fffff] PMD ->
>>>> [ffff88022e800000-ffff88022ebfffff] on node 1
>>>> [ffffe2000e200000-ffffe2000e5fffff] PMD ->
>>>> [ffff88022ea00000-ffff88022edfffff] on node 1
>>>> [ffffe2000e400000-ffffe2000e7fffff] PMD ->
>>>> [ffff88022ec00000-ffff88022effffff] on node 1
>>>> [ffffe2000e600000-ffffe2000e9fffff] PMD ->
>>>> [ffff88022ee00000-ffff88022f1fffff] on node 1
>>>> [ffffe2000e800000-ffffe2000e9fffff] PMD ->
>>>> [ffff88022f000000-ffff88022f1fffff] on node 1
>>>>
>>>> should have
>>>>
>>>> [ffffe20000000000-ffffe27fffffffff] PGD ->ffff8100011ce000 on node 0
>>>> [ffffe20000000000-ffffe2003fffffff] PUD ->ffff8100011cf000 on node 0
>>>> [ffffe200078c0000-ffffe200079fffff] potential offnode page_structs
>>>> [ffffe20000000000-ffffe200079fffff] PMD ->
>>>> [ffff810001200000-ffff8100083fffff] on node 0
>>>> [ffffe20007a00000-ffffe2000e9fffff] PMD ->
>>>> [ffff810228200000-ffff81022f1fffff] on node 1
>>>>
>>>
>>> I haven't seen those messages before. Can you explain what they mean?
>>>
>>
>> that is for SPARSEMEM virtual memmap...
>>
>> CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
>> CONFIG_SPARSEMEM_VMEMMAP=y
>>
>
> I modified the vmemmap code so it would create 4k mappings if PSE isn't
> supported. Did I get it wrong? It should have no effect when PSE is
> available (which is any time you're not running under Xen).
>
it could be address continuous checkup for printout in
vmemmap_populated has some problem...
YH
On Thu, Jul 3, 2008 at 11:51 AM, Yinghai Lu <[email protected]> wrote:
> On Thu, Jul 3, 2008 at 11:41 AM, Jeremy Fitzhardinge <[email protected]> wrote:
..
>> I modified the vmemmap code so it would create 4k mappings if PSE isn't
>> supported. Did I get it wrong? It should have no effect when PSE is
>> available (which is any time you're not running under Xen).
>>
>
> it could be address continuous checkup for printout in
> vmemmap_populated has some problem...
you moved p_end = p + PMD_SIZE before...
if (p_end != p || node_start != node) {
YH
On Thu, Jul 3, 2008 at 12:19 PM, Yinghai Lu <[email protected]> wrote:
> On Thu, Jul 3, 2008 at 11:51 AM, Yinghai Lu <[email protected]> wrote:
>> On Thu, Jul 3, 2008 at 11:41 AM, Jeremy Fitzhardinge <[email protected]> wrote:
> ..
>>> I modified the vmemmap code so it would create 4k mappings if PSE isn't
>>> supported. Did I get it wrong? It should have no effect when PSE is
>>> available (which is any time you're not running under Xen).
>>>
>>
>> it could be address continuous checkup for printout in
>> vmemmap_populated has some problem...
>
> you moved p_end = p + PMD_SIZE before...
>
> if (p_end != p || node_start != node) {
Ingo,
please put attached patch after jeremy's xen pv64 patches.
YH
* Yinghai Lu <[email protected]> wrote:
> >> it could be address continuous checkup for printout in
> >> vmemmap_populated has some problem...
> >
> > you moved p_end = p + PMD_SIZE before...
> >
> > if (p_end != p || node_start != node) {
>
> Ingo,
>
> please put attached patch after jeremy's xen pv64 patches.
>
> YH
> [PATCH] x86: fix vmemmap printout check
applied, thanks.
Ingo