2008-01-25 21:52:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 00 of 11] x86: separate pmd lifetime from pgd

Hi Ingo,

This series addresses various cleanups in pagetable allocation in the
direction of unifying 32/64 bits (that's still a while off yet).

The significant change in here is that I'm separating the lifetime of
a pmd from its pgd in the 32-bit PAE case. This makes it logically
the same as 64-bit pagetable allocation, and it overall simplifies the
code.

The patches are:
- A pure Xen fix I tacked on for convenience
- Use the same pgd_list mechanism for 32 and 64 bits
- Add an mm parameter for paravirt_alloc_pd, for consistency
- Some fixes to early_ioremap to make sure the right paravirt
hooks are called appropriately
- de-macro asm-x86/pgalloc_32.h
- make mm/pgtable_32.c:pgd_ctor a single function
- dynamically allocate pmds rather than always allocating
them with the pgd
- Add Xen bits for dealing with pmd allocation
- Preallocate pmds to avoid excessive tlb flushes
- Allocate and initialize kernel pmds when they're not shared
- Avoid excessive tlb flushes when pulling down pmds.

I've done a number of randconfig test builds to shake out various
configurations on 32 nd 64 bits.

One caveat: in order to demacro pgalloc_32.h, I had to rearrange some
headers in asm-generic/tlb.h, as it was including asm/pgalloc.h for no
good reason. As a result, any other file which was expecting to
implicitly pick up asm/pgalloc.h when including a asm/tlb.h header may
get header file problems. I have not done any cross builds to try and
track down any non-x86 fallout from this.

Thanks,
J


2008-01-25 21:42:19

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Jeremy Fitzhardinge wrote:
> PAE mode requires that we reload cr3 in order to guarantee that
> changes to the pgd will be noticed by the processor. This means that
> in principle pud_clear needs to reload cr3 every time. However,
> because reloading cr3 implies a tlb flush, we want to avoid it where
> possible.

It only matters (for a processor which supports PGE) if we actually use
the namespace in between.

> In huge_pmd_unshare, it is followed by flush_tlb_range, which always
> results in a full cr3-reload tlb flush.

This one makes me nervous, as it feels like a side effect of
implementation, and not a guarantee by design.

-hpa

2008-01-25 21:47:29

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 04 of 11] x86: fix early_ioremap pagetable ops

Put appropriate pagetable update hooks in so that paravirt knows
what's going on in there.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/ioremap.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -18,6 +18,7 @@
#include <asm/fixmap.h>
#include <asm/pgtable.h>
#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>

#ifdef CONFIG_X86_64

@@ -265,7 +266,7 @@ void __init early_ioremap_init(void)

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
- set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+ pmd_populate_kernel(&init_mm, pmd, bm_pte);

/*
* The boot-ioremap range spans multiple pmds, for which
@@ -295,6 +296,7 @@ void __init early_ioremap_clear(void)

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
pmd_clear(pmd);
+ paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT);
__flush_tlb_all();
}


2008-01-25 21:47:43

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 06 of 11] x86: unify PAE/non-PAE pgd_ctor

The constructors for PAE and non-PAE pgd_ctors are more or less
identical, and can be made into the same function.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: William Irwin <[email protected]>

---
arch/x86/mm/pgtable_32.c | 43 +++++++++++--------------------------------
1 file changed, 11 insertions(+), 32 deletions(-)

diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -224,50 +224,32 @@ static inline void pgd_list_del(pgd_t *p
list_del(&page->lru);
}

+#define UNSHARED_PTRS_PER_PGD \
+ (SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)

-
-#if (PTRS_PER_PMD == 1)
-/* Non-PAE pgd constructor */
static void pgd_ctor(void *pgd)
{
unsigned long flags;

- /* !PAE, no pagetable sharing */
+ /* Clear usermode parts of PGD */
memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));

spin_lock_irqsave(&pgd_lock, flags);

- /* must happen under lock */
- clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
- swapper_pg_dir + USER_PTRS_PER_PGD,
- KERNEL_PGD_PTRS);
- paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
- __pa(swapper_pg_dir) >> PAGE_SHIFT,
- USER_PTRS_PER_PGD,
- KERNEL_PGD_PTRS);
- pgd_list_add(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
-}
-#else /* PTRS_PER_PMD > 1 */
-/* PAE pgd constructor */
-static void pgd_ctor(void *pgd)
-{
- /* PAE, kernel PMD may be shared */
-
if (SHARED_KERNEL_PMD) {
+ /* must happen under lock */
clone_pgd_range((pgd_t *)pgd + USER_PTRS_PER_PGD,
swapper_pg_dir + USER_PTRS_PER_PGD,
KERNEL_PGD_PTRS);
- } else {
- unsigned long flags;
+ paravirt_alloc_pd_clone(__pa(pgd) >> PAGE_SHIFT,
+ __pa(swapper_pg_dir) >> PAGE_SHIFT,
+ USER_PTRS_PER_PGD,
+ KERNEL_PGD_PTRS);
+ } else
+ pgd_list_add(pgd);

- memset(pgd, 0, USER_PTRS_PER_PGD*sizeof(pgd_t));
- spin_lock_irqsave(&pgd_lock, flags);
- pgd_list_add(pgd);
- spin_unlock_irqrestore(&pgd_lock, flags);
- }
+ spin_unlock_irqrestore(&pgd_lock, flags);
}
-#endif /* PTRS_PER_PMD */

static void pgd_dtor(void *pgd)
{
@@ -281,9 +263,6 @@ static void pgd_dtor(void *pgd)
pgd_list_del(pgd);
spin_unlock_irqrestore(&pgd_lock, flags);
}
-
-#define UNSHARED_PTRS_PER_PGD \
- (SHARED_KERNEL_PMD ? USER_PTRS_PER_PGD : PTRS_PER_PGD)

/* If we allocate a pmd for part of the kernel address space, then
make sure its initialized with the appropriate kernel mappings.

2008-01-25 21:52:55

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 01 of 11] xen: fix mismerge in masking pte flags

Looks like a mismerge/misapply dropped one of the cases of pte flag
masking for Xen. Also, only mask the flags for present ptes.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/mmu.c | 10 ++++++----
1 file changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/x86/xen/mmu.c b/arch/x86/xen/mmu.c
--- a/arch/x86/xen/mmu.c
+++ b/arch/x86/xen/mmu.c
@@ -244,8 +244,10 @@ unsigned long long xen_pgd_val(pgd_t pgd

pte_t xen_make_pte(unsigned long long pte)
{
- if (pte & 1)
+ if (pte & _PAGE_PRESENT) {
pte = phys_to_machine(XPADDR(pte)).maddr;
+ pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ }

return (pte_t){ .pte = pte };
}
@@ -291,10 +293,10 @@ unsigned long xen_pgd_val(pgd_t pgd)

pte_t xen_make_pte(unsigned long pte)
{
- if (pte & _PAGE_PRESENT)
+ if (pte & _PAGE_PRESENT) {
pte = phys_to_machine(XPADDR(pte)).maddr;
-
- pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ pte &= ~(_PAGE_PCD | _PAGE_PWT);
+ }

return (pte_t){ pte };
}

2008-01-25 21:53:22

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 03 of 11] x86: add mm parameter to paravirt_alloc_pd

Add mm to paravirt_alloc_pd, partly to make it consistent with
paravirt_alloc_pt, and because later changes will make use of it.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/kernel/vmi_32.c | 2 +-
arch/x86/mm/init_32.c | 4 ++--
arch/x86/mm/pgtable_32.c | 4 +++-
include/asm-x86/paravirt.h | 6 +++---
include/asm-x86/pgalloc_32.h | 3 +--
5 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/arch/x86/kernel/vmi_32.c b/arch/x86/kernel/vmi_32.c
--- a/arch/x86/kernel/vmi_32.c
+++ b/arch/x86/kernel/vmi_32.c
@@ -398,7 +398,7 @@ static void vmi_allocate_pt(struct mm_st
vmi_ops.allocate_page(pfn, VMI_PAGE_L1, 0, 0, 0);
}

-static void vmi_allocate_pd(u32 pfn)
+static void vmi_allocate_pd(struct mm_struct *mm, u32 pfn)
{
/*
* This call comes in very early, before mem_map is setup.
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -67,7 +67,7 @@ static pmd_t * __init one_md_table_init(
if (!(pgd_val(*pgd) & _PAGE_PRESENT)) {
pmd_table = (pmd_t *) alloc_bootmem_low_pages(PAGE_SIZE);

- paravirt_alloc_pd(__pa(pmd_table) >> PAGE_SHIFT);
+ paravirt_alloc_pd(&init_mm, __pa(pmd_table) >> PAGE_SHIFT);
set_pgd(pgd, __pgd(__pa(pmd_table) | _PAGE_PRESENT));
pud = pud_offset(pgd, 0);
BUG_ON(pmd_table != pmd_offset(pud, 0));
@@ -378,7 +378,7 @@ void __init native_pagetable_setup_start

pte_clear(NULL, va, pte);
}
- paravirt_alloc_pd(__pa(swapper_pg_dir) >> PAGE_SHIFT);
+ paravirt_alloc_pd(&init_mm, __pa(base) >> PAGE_SHIFT);
}

void __init native_pagetable_setup_done(pgd_t *base)
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -321,13 +321,15 @@ pgd_t *pgd_alloc(struct mm_struct *mm)
if (PTRS_PER_PMD == 1 || !pgd)
return pgd;

+ mm->pgd = pgd; /* so that alloc_pd can use it */
+
for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
pmd_t *pmd = pmd_cache_alloc(i);

if (!pmd)
goto out_oom;

- paravirt_alloc_pd(__pa(pmd) >> PAGE_SHIFT);
+ paravirt_alloc_pd(mm, __pa(pmd) >> PAGE_SHIFT);
set_pgd(&pgd[i], __pgd(1 + __pa(pmd)));
}
return pgd;
diff --git a/include/asm-x86/paravirt.h b/include/asm-x86/paravirt.h
--- a/include/asm-x86/paravirt.h
+++ b/include/asm-x86/paravirt.h
@@ -221,7 +221,7 @@ struct pv_mmu_ops {

/* Hooks for allocating/releasing pagetable pages */
void (*alloc_pt)(struct mm_struct *mm, u32 pfn);
- void (*alloc_pd)(u32 pfn);
+ void (*alloc_pd)(struct mm_struct *mm, u32 pfn);
void (*alloc_pd_clone)(u32 pfn, u32 clonepfn, u32 start, u32 count);
void (*release_pt)(u32 pfn);
void (*release_pd)(u32 pfn);
@@ -903,9 +903,9 @@ static inline void paravirt_release_pt(u
PVOP_VCALL1(pv_mmu_ops.release_pt, pfn);
}

-static inline void paravirt_alloc_pd(unsigned pfn)
+static inline void paravirt_alloc_pd(struct mm_struct *mm, unsigned pfn)
{
- PVOP_VCALL1(pv_mmu_ops.alloc_pd, pfn);
+ PVOP_VCALL2(pv_mmu_ops.alloc_pd, mm, pfn);
}

static inline void paravirt_alloc_pd_clone(unsigned pfn, unsigned clonepfn,
diff --git a/include/asm-x86/pgalloc_32.h b/include/asm-x86/pgalloc_32.h
--- a/include/asm-x86/pgalloc_32.h
+++ b/include/asm-x86/pgalloc_32.h
@@ -8,8 +8,7 @@
#include <asm/paravirt.h>
#else
#define paravirt_alloc_pt(mm, pfn) do { } while (0)
-#define paravirt_alloc_pd(pfn) do { } while (0)
-#define paravirt_alloc_pd(pfn) do { } while (0)
+#define paravirt_alloc_pd(mm, pfn) do { } while (0)
#define paravirt_alloc_pd_clone(pfn, clonepfn, start, count) do { } while (0)
#define paravirt_release_pt(pfn) do { } while (0)
#define paravirt_release_pd(pfn) do { } while (0)

2008-01-25 21:53:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 02 of 11] x86: use the same pgd_list for PAE and 64-bit

Use a standard list threaded through page->lru for maintaining the pgd
list on PAE. This is the same as 64-bit, and seems saner than using a
non-standard list via page->index.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>

---
arch/x86/mm/fault.c | 10 +++-------
arch/x86/mm/pageattr.c | 2 +-
arch/x86/mm/pgtable_32.c | 19 +++++--------------
include/asm-x86/pgtable.h | 2 ++
include/asm-x86/pgtable_32.h | 2 --
include/asm-x86/pgtable_64.h | 3 ---
6 files changed, 11 insertions(+), 27 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -922,10 +922,8 @@ do_sigbus:
force_sig_info_fault(SIGBUS, BUS_ADRERR, address, tsk);
}

-#ifdef CONFIG_X86_64
DEFINE_SPINLOCK(pgd_lock);
LIST_HEAD(pgd_list);
-#endif

void vmalloc_sync_all(void)
{
@@ -950,13 +948,11 @@ void vmalloc_sync_all(void)
struct page *page;

spin_lock_irqsave(&pgd_lock, flags);
- for (page = pgd_list; page; page =
- (struct page *)page->index)
+ list_for_each_entry(page, &pgd_list, lru) {
if (!vmalloc_sync_one(page_address(page),
- address)) {
- BUG_ON(page != pgd_list);
+ address))
break;
- }
+ }
spin_unlock_irqrestore(&pgd_lock, flags);
if (!page)
set_bit(pgd_index(address), insync);
diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
--- a/arch/x86/mm/pageattr.c
+++ b/arch/x86/mm/pageattr.c
@@ -100,7 +100,7 @@ static void __set_pmd_pte(pte_t *kpte, u
if (!SHARED_KERNEL_PMD) {
struct page *page;

- for (page = pgd_list; page; page = (struct page *)page->index) {
+ list_for_each_entry(page, &pgd_list, lru) {
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -210,27 +210,18 @@ void pmd_ctor(struct kmem_cache *cache,
* vmalloc faults work because attached pagetables are never freed.
* -- wli
*/
-DEFINE_SPINLOCK(pgd_lock);
-struct page *pgd_list;
-
static inline void pgd_list_add(pgd_t *pgd)
{
struct page *page = virt_to_page(pgd);
- page->index = (unsigned long)pgd_list;
- if (pgd_list)
- set_page_private(pgd_list, (unsigned long)&page->index);
- pgd_list = page;
- set_page_private(page, (unsigned long)&pgd_list);
+
+ list_add(&page->lru, &pgd_list);
}

static inline void pgd_list_del(pgd_t *pgd)
{
- struct page *next, **pprev, *page = virt_to_page(pgd);
- next = (struct page *)page->index;
- pprev = (struct page **)page_private(page);
- *pprev = next;
- if (next)
- set_page_private(next, (unsigned long)pprev);
+ struct page *page = virt_to_page(pgd);
+
+ list_del(&page->lru);
}


diff --git a/include/asm-x86/pgtable.h b/include/asm-x86/pgtable.h
--- a/include/asm-x86/pgtable.h
+++ b/include/asm-x86/pgtable.h
@@ -131,6 +131,8 @@ extern unsigned long empty_zero_page[PAG
extern unsigned long empty_zero_page[PAGE_SIZE/sizeof(unsigned long)];
#define ZERO_PAGE(vaddr) (virt_to_page(empty_zero_page))

+extern spinlock_t pgd_lock;
+extern struct list_head pgd_list;

/*
* The following only work if pte_present() is true.
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -27,8 +27,6 @@ struct vm_area_struct;

extern pgd_t swapper_pg_dir[1024];
extern struct kmem_cache *pmd_cache;
-extern spinlock_t pgd_lock;
-extern struct page *pgd_list;
void check_pgt_cache(void);

void pmd_ctor(struct kmem_cache *, void *);
diff --git a/include/asm-x86/pgtable_64.h b/include/asm-x86/pgtable_64.h
--- a/include/asm-x86/pgtable_64.h
+++ b/include/asm-x86/pgtable_64.h
@@ -240,9 +240,6 @@ static inline unsigned long pmd_bad(pmd_
#define __pte_to_swp_entry(pte) ((swp_entry_t) { pte_val(pte) })
#define __swp_entry_to_pte(x) ((pte_t) { .pte = (x).val })

-extern spinlock_t pgd_lock;
-extern struct list_head pgd_list;
-
extern int kern_addr_valid(unsigned long addr);

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \

2008-01-25 21:54:05

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 07 of 11] x86: don't special-case pmd allocations as much

In x86 PAE mode, stop treating pmds as a special case. Previously
they were always allocated and freed with the pgd. The modifies the
code to be the same as 64-bit mode, where they are allocated on
demand.

This is a step on the way to unifying 32/64-bit pagetable allocation
as much as possible.

There is a complicating wart, however. When you install a new
reference to a pmd in the pgd, the processor isn't guaranteed to see
it unless you reload cr3. Since reloading cr3 also has the
side-effect of flushing the tlb, this is an expense that we want to
avoid whereever possible.

This patch simply avoids reloading cr3 unless the update is to the
current pagetable. Later patches will optimise this further.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: William Irwin <[email protected]>
---
arch/x86/mm/init_32.c | 13 -------
arch/x86/mm/pgtable_32.c | 68 --------------------------------------
include/asm-x86/pgalloc_32.h | 22 ++++++++++--
include/asm-x86/pgtable-3level.h | 39 +++++++++++++++------
include/asm-x86/pgtable_32.h | 3 -
5 files changed, 47 insertions(+), 98 deletions(-)

diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -709,19 +709,6 @@ int arch_add_memory(int nid, u64 start,
}
#endif

-struct kmem_cache *pmd_cache;
-
-void __init pgtable_cache_init(void)
-{
- if (PTRS_PER_PMD > 1) {
- pmd_cache = kmem_cache_create("pmd",
- PTRS_PER_PMD*sizeof(pmd_t),
- PTRS_PER_PMD*sizeof(pmd_t),
- SLAB_PANIC,
- pmd_ctor);
- }
-}
-
/*
* This function cannot be __init, since exceptions don't work in that
* section. Put this after the callers, so that it cannot be inlined.
diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -195,11 +195,6 @@ struct page *pte_alloc_one(struct mm_str
return pte;
}

-void pmd_ctor(struct kmem_cache *cache, void *pmd)
-{
- memset(pmd, 0, PTRS_PER_PMD*sizeof(pmd_t));
-}
-
/*
* List of all pgd's needed for non-PAE so it can invalidate entries
* in both cached and uncached pgd's; not needed for PAE since the
@@ -258,85 +253,22 @@ static void pgd_dtor(void *pgd)
if (SHARED_KERNEL_PMD)
return;

- paravirt_release_pd(__pa(pgd) >> PAGE_SHIFT);
spin_lock_irqsave(&pgd_lock, flags);
pgd_list_del(pgd);
spin_unlock_irqrestore(&pgd_lock, flags);
}

-/* If we allocate a pmd for part of the kernel address space, then
- make sure its initialized with the appropriate kernel mappings.
- Otherwise use a cached zeroed pmd. */
-static pmd_t *pmd_cache_alloc(int idx)
-{
- pmd_t *pmd;
-
- if (idx >= USER_PTRS_PER_PGD) {
- pmd = (pmd_t *)__get_free_page(GFP_KERNEL);
-
- if (pmd)
- memcpy(pmd,
- (void *)pgd_page_vaddr(swapper_pg_dir[idx]),
- sizeof(pmd_t) * PTRS_PER_PMD);
- } else
- pmd = kmem_cache_alloc(pmd_cache, GFP_KERNEL);
-
- return pmd;
-}
-
-static void pmd_cache_free(pmd_t *pmd, int idx)
-{
- if (idx >= USER_PTRS_PER_PGD)
- free_page((unsigned long)pmd);
- else
- kmem_cache_free(pmd_cache, pmd);
-}
-
pgd_t *pgd_alloc(struct mm_struct *mm)
{
- int i;
pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);
-
- if (PTRS_PER_PMD == 1 || !pgd)
- return pgd;

mm->pgd = pgd; /* so that alloc_pd can use it */

- for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
- pmd_t *pmd = pmd_cache_alloc(i);
-
- if (!pmd)
- goto out_oom;
-
- paravirt_alloc_pd(mm, __pa(pmd) >> PAGE_SHIFT);
- set_pgd(&pgd[i], __pgd(1 + __pa(pmd)));
- }
return pgd;
-
-out_oom:
- for (i--; i >= 0; i--) {
- pgd_t pgdent = pgd[i];
- void* pmd = (void *)__va(pgd_val(pgdent)-1);
- paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
- pmd_cache_free(pmd, i);
- }
- quicklist_free(0, pgd_dtor, pgd);
- return NULL;
}

void pgd_free(pgd_t *pgd)
{
- int i;
-
- /* in the PAE case user pgd entries are overwritten before usage */
- if (PTRS_PER_PMD > 1)
- for (i = 0; i < UNSHARED_PTRS_PER_PGD; ++i) {
- pgd_t pgdent = pgd[i];
- void* pmd = (void *)__va(pgd_val(pgdent)-1);
- paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
- pmd_cache_free(pmd, i);
- }
- /* in the non-PAE case, free_pgtables() clears user pgd entries */
quicklist_free(0, pgd_dtor, pgd);
}

diff --git a/include/asm-x86/pgalloc_32.h b/include/asm-x86/pgalloc_32.h
--- a/include/asm-x86/pgalloc_32.h
+++ b/include/asm-x86/pgalloc_32.h
@@ -63,21 +63,35 @@ static inline void __pte_free_tlb(struct
*/
static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
{
- BUG();
- return (pmd_t *)2;
+ return (pmd_t *)get_zeroed_page(GFP_KERNEL|__GFP_REPEAT);
}

static inline void pmd_free(pmd_t *pmd)
{
+ BUG_ON((unsigned long)pmd & (PAGE_SIZE-1));
+ free_page((unsigned long)pmd);
}

static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
+ paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
+ tlb_remove_page(tlb, virt_to_page(pmd));
}

-static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+static inline void pud_populate(struct mm_struct *mm, pud_t *pudp, pmd_t *pmd)
{
- BUG();
+ paravirt_alloc_pd(mm, __pa(pmd) >> PAGE_SHIFT);
+
+ /* Note: almost everything apart from _PAGE_PRESENT is
+ reserved at the pmd (PDPT) level. */
+ set_pud(pudp, __pud(__pa(pmd) | _PAGE_PRESENT));
+
+ /*
+ * Pentium-II erratum A13: in PAE mode we explicitly have to flush
+ * the TLB via cr3 if the top-level pgd is changed...
+ */
+ if (mm == current->active_mm)
+ write_cr3(read_cr3());
}
#endif /* CONFIG_X86_PAE */

diff --git a/include/asm-x86/pgtable-3level.h b/include/asm-x86/pgtable-3level.h
--- a/include/asm-x86/pgtable-3level.h
+++ b/include/asm-x86/pgtable-3level.h
@@ -15,9 +15,19 @@
#define pgd_ERROR(e) \
printk("%s:%d: bad pgd %p(%016Lx).\n", __FILE__, __LINE__, &(e), pgd_val(e))

-#define pud_none(pud) 0
-#define pud_bad(pud) 0
-#define pud_present(pud) 1
+
+static inline int pud_none(pud_t pud)
+{
+ return pud_val(pud) == 0;
+}
+static inline int pud_bad(pud_t pud)
+{
+ return (pud_val(pud) & ~(PTE_MASK | _KERNPG_TABLE | _PAGE_USER)) != 0;
+}
+static inline int pud_present(pud_t pud)
+{
+ return pud_val(pud) & _PAGE_PRESENT;
+}

/* Rules for using set_pte: the pte being assigned *must* be
* either not present or in a state where the hardware will
@@ -58,7 +68,7 @@ static inline void native_set_pmd(pmd_t
}
static inline void native_set_pud(pud_t *pudp, pud_t pud)
{
- *pudp = pud;
+ set_64bit((unsigned long long *)(pudp),native_pud_val(pud));
}

/*
@@ -81,13 +91,20 @@ static inline void native_pmd_clear(pmd_
*(tmp + 1) = 0;
}

-/*
- * Pentium-II erratum A13: in PAE mode we explicitly have to flush
- * the TLB via cr3 if the top-level pgd is changed...
- * We do not let the generic code free and clear pgd entries due to
- * this erratum.
- */
-static inline void pud_clear (pud_t * pud) { }
+static inline void pud_clear(pud_t *pudp)
+{
+ set_pud(pudp, __pud(0));
+
+ /*
+ * Pentium-II erratum A13: in PAE mode we explicitly have to flush
+ * the TLB via cr3 if the top-level pgd is changed...
+ *
+ * XXX I don't think we need to worry about this here, since
+ * when clearing the pud, the calling code needs to flush the
+ * tlb anyway. But do it now for safety's sake. - jsgf
+ */
+ write_cr3(read_cr3());
+}

#define pud_page(pud) \
((struct page *) __va(pud_val(pud) & PAGE_MASK))
diff --git a/include/asm-x86/pgtable_32.h b/include/asm-x86/pgtable_32.h
--- a/include/asm-x86/pgtable_32.h
+++ b/include/asm-x86/pgtable_32.h
@@ -29,8 +29,7 @@ extern struct kmem_cache *pmd_cache;
extern struct kmem_cache *pmd_cache;
void check_pgt_cache(void);

-void pmd_ctor(struct kmem_cache *, void *);
-void pgtable_cache_init(void);
+static inline void pgtable_cache_init(void) {}
void paging_init(void);



2008-01-25 21:54:35

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 10 of 11] x86: allocate and initialize unshared pmds

If SHARED_KERNEL_PMD is false, then we need to allocate and initialize
the kernel pmd. We can easily piggy-back this onto the existing pmd
prepopulation code.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/pgtable_32.c | 14 +++++++++++---
1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -269,7 +269,7 @@ static void pgd_mop_up_pmds(pgd_t *pgdp)
{
int i;

- for(i = 0; i < USER_PTRS_PER_PGD; i++) {
+ for(i = 0; i < UNSHARED_PTRS_PER_PGD; i++) {
pgd_t pgd = pgdp[i];

if (pgd_val(pgd) != 0) {
@@ -289,6 +289,10 @@ static void pgd_mop_up_pmds(pgd_t *pgdp)
* processor notices the update. Since this is expensive, and
* all 4 top-level entries are used almost immediately in a
* new process's life, we just pre-populate them here.
+ *
+ * Also, if we're in a paravirt environment where the kernel pmd is
+ * not shared between pagetables (!SHARED_KERNEL_PMDS), we allocate
+ * and initialize the kernel pmds here.
*/
static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
{
@@ -297,13 +301,18 @@ static int pgd_prepopulate_pmd(struct mm
int i;

pud = pud_offset(pgd, 0);
- for (addr = i = 0; i < USER_PTRS_PER_PGD; i++, pud++, addr += PUD_SIZE) {
+ for (addr = i = 0; i < UNSHARED_PTRS_PER_PGD;
+ i++, pud++, addr += PUD_SIZE) {
pmd_t *pmd = pmd_alloc_one(mm, addr);

if (!pmd) {
pgd_mop_up_pmds(pgd);
return 0;
}
+
+ if (i >= USER_PTRS_PER_PGD)
+ memcpy(pmd, (pmd_t *)pgd_page_vaddr(swapper_pg_dir[i]),
+ sizeof(pmd_t) * PTRS_PER_PMD);

pud_populate(mm, pud, pmd);
}
@@ -346,4 +355,3 @@ void check_pgt_cache(void)
{
quicklist_trim(0, pgd_dtor, 25, 16);
}
-

2008-01-25 21:54:53

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 05 of 11] x86: demacro asm-x86/pgalloc_32.h

Convert macros into inline functions, for better type-checking.

This patch required a little bit of fiddling with headers in order to
make __(pte|pmd)_free_tlb inline rather than macros.
asm-generic/tlb.h includes asm/pgalloc.h, though it doesn't directly
use any pgalloc definitions. I removed this include to avoid an
include cycle, but it may cause secondary compile failures by things
depending on the indirect inclusion; arch/x86/mm/hugetlbpage.c was one
such place; there may be others.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/mm/hugetlbpage.c | 3 +
arch/x86/mm/init_32.c | 1
include/asm-generic/tlb.h | 1
include/asm-x86/pgalloc_32.h | 61 ++++++++++++++++++++++++--------------
include/asm-x86/pgtable-3level.h | 2 -
include/linux/swap.h | 1
6 files changed, 43 insertions(+), 26 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -15,6 +15,7 @@
#include <asm/mman.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>

static unsigned long page_table_shareable(struct vm_area_struct *svma,
struct vm_area_struct *vma,
@@ -88,7 +89,7 @@ static void huge_pmd_share(struct mm_str

spin_lock(&mm->page_table_lock);
if (pud_none(*pud))
- pud_populate(mm, pud, (unsigned long) spte & PAGE_MASK);
+ pud_populate(mm, pud, (pmd_t *)((unsigned long)spte & PAGE_MASK));
else
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
diff --git a/arch/x86/mm/init_32.c b/arch/x86/mm/init_32.c
--- a/arch/x86/mm/init_32.c
+++ b/arch/x86/mm/init_32.c
@@ -42,6 +42,7 @@
#include <asm/bugs.h>
#include <asm/tlb.h>
#include <asm/tlbflush.h>
+#include <asm/pgalloc.h>
#include <asm/sections.h>
#include <asm/paravirt.h>
#include <asm/setup.h>
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -15,7 +15,6 @@

#include <linux/swap.h>
#include <linux/quicklist.h>
-#include <asm/pgalloc.h>
#include <asm/tlbflush.h>

/*
diff --git a/include/asm-x86/pgalloc_32.h b/include/asm-x86/pgalloc_32.h
--- a/include/asm-x86/pgalloc_32.h
+++ b/include/asm-x86/pgalloc_32.h
@@ -3,6 +3,8 @@

#include <linux/threads.h>
#include <linux/mm.h> /* for struct page */
+#include <asm/tlb.h>
+#include <asm-generic/tlb.h>

#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
@@ -14,19 +16,20 @@
#define paravirt_release_pd(pfn) do { } while (0)
#endif

-#define pmd_populate_kernel(mm, pmd, pte) \
-do { \
- paravirt_alloc_pt(mm, __pa(pte) >> PAGE_SHIFT); \
- set_pmd(pmd, __pmd(_PAGE_TABLE + __pa(pte))); \
-} while (0)
+static inline void pmd_populate_kernel(struct mm_struct *mm,
+ pmd_t *pmd, pte_t *pte)
+{
+ paravirt_alloc_pt(mm, __pa(pte) >> PAGE_SHIFT);
+ set_pmd(pmd, __pmd(__pa(pte) | _PAGE_TABLE));
+}

-#define pmd_populate(mm, pmd, pte) \
-do { \
- paravirt_alloc_pt(mm, page_to_pfn(pte)); \
- set_pmd(pmd, __pmd(_PAGE_TABLE + \
- ((unsigned long long)page_to_pfn(pte) << \
- (unsigned long long) PAGE_SHIFT))); \
-} while (0)
+static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd, struct page *pte)
+{
+ unsigned long pfn = page_to_pfn(pte);
+
+ paravirt_alloc_pt(mm, pfn);
+ set_pmd(pmd, __pmd(((pteval_t)pfn << PAGE_SHIFT) | _PAGE_TABLE));
+}

/*
* Allocate and free page tables.
@@ -48,20 +51,34 @@ static inline void pte_free(struct page
}


-#define __pte_free_tlb(tlb,pte) \
-do { \
- paravirt_release_pt(page_to_pfn(pte)); \
- tlb_remove_page((tlb),(pte)); \
-} while (0)
+static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
+{
+ paravirt_release_pt(page_to_pfn(pte));
+ tlb_remove_page(tlb, pte);
+}

#ifdef CONFIG_X86_PAE
/*
* In the PAE case we free the pmds as part of the pgd.
*/
-#define pmd_alloc_one(mm, addr) ({ BUG(); ((pmd_t *)2); })
-#define pmd_free(x) do { } while (0)
-#define __pmd_free_tlb(tlb,x) do { } while (0)
-#define pud_populate(mm, pmd, pte) BUG()
-#endif
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+ BUG();
+ return (pmd_t *)2;
+}
+
+static inline void pmd_free(pmd_t *pmd)
+{
+}
+
+static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
+{
+}
+
+static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
+{
+ BUG();
+}
+#endif /* CONFIG_X86_PAE */

#endif /* _I386_PGALLOC_H */
diff --git a/include/asm-x86/pgtable-3level.h b/include/asm-x86/pgtable-3level.h
--- a/include/asm-x86/pgtable-3level.h
+++ b/include/asm-x86/pgtable-3level.h
@@ -149,6 +149,4 @@ static inline unsigned long pte_pfn(pte_
#define __pte_to_swp_entry(pte) ((swp_entry_t){ (pte).pte_high })
#define __swp_entry_to_pte(x) ((pte_t){ { .pte_high = (x).val } })

-#define __pmd_free_tlb(tlb, x) do { } while (0)
-
#endif /* _I386_PGTABLE_3LEVEL_H */
diff --git a/include/linux/swap.h b/include/linux/swap.h
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/list.h>
#include <linux/sched.h>
+#include <linux/pagemap.h>

#include <asm/atomic.h>
#include <asm/page.h>

2008-01-25 21:55:21

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 08 of 11] xen: deal with pmd being allocated/freed

Deal properly with pmd-level pages being allocated and freed
dynamically. We can handle them more or less the same as pte pages.

Also, deal with early_ioremap pagetable manipulations.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
---
arch/x86/xen/enlighten.c | 30 +++++++++++++++++++++++++-----
1 file changed, 25 insertions(+), 5 deletions(-)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -658,6 +658,13 @@ static __init void xen_alloc_pt_init(str
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
}

+/* Early release_pt assumes that all pts are pinned, since there's
+ only init_mm and anything attached to that is pinned. */
+static void xen_release_pt_init(u32 pfn)
+{
+ make_lowmem_page_readwrite(__va(PFN_PHYS(pfn)));
+}
+
static void pin_pagetable_pfn(unsigned level, unsigned long pfn)
{
struct mmuext_op op;
@@ -669,7 +676,7 @@ static void pin_pagetable_pfn(unsigned l

/* This needs to make sure the new pte page is pinned iff its being
attached to a pinned pagetable. */
-static void xen_alloc_pt(struct mm_struct *mm, u32 pfn)
+static void xen_alloc_ptpage(struct mm_struct *mm, u32 pfn, unsigned level)
{
struct page *page = pfn_to_page(pfn);

@@ -678,12 +685,22 @@ static void xen_alloc_pt(struct mm_struc

if (!PageHighMem(page)) {
make_lowmem_page_readonly(__va(PFN_PHYS(pfn)));
- pin_pagetable_pfn(MMUEXT_PIN_L1_TABLE, pfn);
+ pin_pagetable_pfn(level, pfn);
} else
/* make sure there are no stray mappings of
this page */
kmap_flush_unused();
}
+}
+
+static void xen_alloc_pt(struct mm_struct *mm, u32 pfn)
+{
+ xen_alloc_ptpage(mm, pfn, MMUEXT_PIN_L1_TABLE);
+}
+
+static void xen_alloc_pd(struct mm_struct *mm, u32 pfn)
+{
+ xen_alloc_ptpage(mm, pfn, MMUEXT_PIN_L2_TABLE);
}

/* This should never happen until we're OK to use struct page */
@@ -788,6 +805,9 @@ static __init void xen_pagetable_setup_d
/* This will work as long as patching hasn't happened yet
(which it hasn't) */
pv_mmu_ops.alloc_pt = xen_alloc_pt;
+ pv_mmu_ops.alloc_pd = xen_alloc_pd;
+ pv_mmu_ops.release_pt = xen_release_pt;
+ pv_mmu_ops.release_pd = xen_release_pt;
pv_mmu_ops.set_pte = xen_set_pte;

if (!xen_feature(XENFEAT_auto_translated_physmap)) {
@@ -1011,10 +1031,10 @@ static const struct pv_mmu_ops xen_mmu_o
.pte_update_defer = paravirt_nop,

.alloc_pt = xen_alloc_pt_init,
- .release_pt = xen_release_pt,
- .alloc_pd = paravirt_nop,
+ .release_pt = xen_release_pt_init,
+ .alloc_pd = xen_alloc_pt_init,
.alloc_pd_clone = paravirt_nop,
- .release_pd = paravirt_nop,
+ .release_pd = xen_release_pt_init,

#ifdef CONFIG_HIGHPTE
.kmap_atomic_pte = xen_kmap_atomic_pte,

2008-01-25 21:55:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 09 of 11] x86: preallocate pmds at pgd creation time

In PAE mode, an update to the pgd requires a cr3 reload to make sure
the processor notices the changes. Since this also has the
side-effect of flushing the tlb, its an expensive operation which we
want to avoid where possible.

This patch mitigates the cost of installing the initial set of pmds on
process creation by preallocating them when the pgd is allocated.
This avoids up to three tlb flushes during exec, as it creates the new
process address space while the pagetable is in active use.

The pmds will be freed as part of the normal pagetable teardown in
free_pgtables, which is called in munmap and process exit. However,
free_pgtables will only free parts of the pagetable which actually
contain mappings, so stray pmds may still be attached to the pgd at
pgd_free time. We must mop them up to prevent a memory leak.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: William Irwin <[email protected]>
---
arch/x86/mm/pgtable_32.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 70 insertions(+)

diff --git a/arch/x86/mm/pgtable_32.c b/arch/x86/mm/pgtable_32.c
--- a/arch/x86/mm/pgtable_32.c
+++ b/arch/x86/mm/pgtable_32.c
@@ -258,17 +258,87 @@ static void pgd_dtor(void *pgd)
spin_unlock_irqrestore(&pgd_lock, flags);
}

+#ifdef CONFIG_X86_PAE
+/*
+ * Mop up any pmd pages which may still be attached to the pgd.
+ * Normally they will be freed by munmap/exit_mmap, but any pmd we
+ * preallocate which never got a corresponding vma will need to be
+ * freed manually.
+ */
+static void pgd_mop_up_pmds(pgd_t *pgdp)
+{
+ int i;
+
+ for(i = 0; i < USER_PTRS_PER_PGD; i++) {
+ pgd_t pgd = pgdp[i];
+
+ if (pgd_val(pgd) != 0) {
+ pmd_t *pmd = (pmd_t *)pgd_page_vaddr(pgd);
+
+ pgdp[i] = native_make_pgd(0);
+
+ paravirt_release_pd(pgd_val(pgd) >> PAGE_SHIFT);
+ pmd_free(pmd);
+ }
+ }
+}
+
+/*
+ * In PAE mode, we need to do a cr3 reload (=tlb flush) when
+ * updating the top-level pagetable entries to guarantee the
+ * processor notices the update. Since this is expensive, and
+ * all 4 top-level entries are used almost immediately in a
+ * new process's life, we just pre-populate them here.
+ */
+static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
+{
+ pud_t *pud;
+ unsigned long addr;
+ int i;
+
+ pud = pud_offset(pgd, 0);
+ for (addr = i = 0; i < USER_PTRS_PER_PGD; i++, pud++, addr += PUD_SIZE) {
+ pmd_t *pmd = pmd_alloc_one(mm, addr);
+
+ if (!pmd) {
+ pgd_mop_up_pmds(pgd);
+ return 0;
+ }
+
+ pud_populate(mm, pud, pmd);
+ }
+
+ return 1;
+}
+#else /* !CONFIG_X86_PAE */
+/* No need to prepopulate any pagetable entries in non-PAE modes. */
+static int pgd_prepopulate_pmd(struct mm_struct *mm, pgd_t *pgd)
+{
+ return 1;
+}
+
+static void pgd_mop_up_pmds(pgd_t *pgd)
+{
+}
+#endif /* CONFIG_X86_PAE */
+
pgd_t *pgd_alloc(struct mm_struct *mm)
{
pgd_t *pgd = quicklist_alloc(0, GFP_KERNEL, pgd_ctor);

mm->pgd = pgd; /* so that alloc_pd can use it */

+ if (pgd && !pgd_prepopulate_pmd(mm, pgd)) {
+ quicklist_free(0, pgd_dtor, pgd);
+ pgd = NULL;
+ }
+
return pgd;
}

void pgd_free(pgd_t *pgd)
{
+ pgd_mop_up_pmds(pgd);
quicklist_free(0, pgd_dtor, pgd);
}


2008-01-25 21:56:14

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

PAE mode requires that we reload cr3 in order to guarantee that
changes to the pgd will be noticed by the processor. This means that
in principle pud_clear needs to reload cr3 every time. However,
because reloading cr3 implies a tlb flush, we want to avoid it where
possible.

pud_clear() is only used in a couple of places:
- in free_pmd_range(), when pulling down a range of process address space, and
- huge_pmd_unshare()

In both cases, the calling code will do a a tlb flush anyway, so
there's no need to do it within pud_clear().

In free_pmd_range(), the pud_clear is immediately followed by
pmd_free_tlb(); we can hook that to make the mmu_gather do an
unconditional full flush to make sure cr3 gets reloaded.

In huge_pmd_unshare, it is followed by flush_tlb_range, which always
results in a full cr3-reload tlb flush.

Signed-off-by: Jeremy Fitzhardinge <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Linus Torvalds <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: William Irwin <[email protected]>
---
include/asm-x86/pgalloc_32.h | 7 +++++++
include/asm-x86/pgtable-3level.h | 21 +++++++++++++++------
2 files changed, 22 insertions(+), 6 deletions(-)

diff --git a/include/asm-x86/pgalloc_32.h b/include/asm-x86/pgalloc_32.h
--- a/include/asm-x86/pgalloc_32.h
+++ b/include/asm-x86/pgalloc_32.h
@@ -74,6 +74,13 @@ static inline void pmd_free(pmd_t *pmd)

static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
{
+ /* This is called just after the pmd has been detached from
+ the pgd, which requires a full tlb flush to be recognized
+ by the CPU. Rather than incurring multiple tlb flushes
+ while the address space is being pulled down, make the tlb
+ gathering machinery do a full flush when we're done. */
+ tlb->fullmm = 1;
+
paravirt_release_pd(__pa(pmd) >> PAGE_SHIFT);
tlb_remove_page(tlb, virt_to_page(pmd));
}
diff --git a/include/asm-x86/pgtable-3level.h b/include/asm-x86/pgtable-3level.h
--- a/include/asm-x86/pgtable-3level.h
+++ b/include/asm-x86/pgtable-3level.h
@@ -96,14 +96,23 @@ static inline void pud_clear(pud_t *pudp
set_pud(pudp, __pud(0));

/*
- * Pentium-II erratum A13: in PAE mode we explicitly have to flush
- * the TLB via cr3 if the top-level pgd is changed...
+ * In principle we need to do a cr3 reload here to make sure
+ * the processor recognizes the changed pgd. In practice, all
+ * the places where pud_clear() gets called are followed by
+ * full tlb flushes anyway, so we can defer the cost here.
*
- * XXX I don't think we need to worry about this here, since
- * when clearing the pud, the calling code needs to flush the
- * tlb anyway. But do it now for safety's sake. - jsgf
+ * Specifically:
+ *
+ * mm/memory.c:free_pmd_range() - immediately after the
+ * pud_clear() it does a pmd_free_tlb(). We change the
+ * mmu_gather structure to do a full tlb flush (which has the
+ * effect of reloading cr3) when the pagetable free is
+ * complete.
+ *
+ * arch/x86/mm/hugetlbpage.c:huge_pmd_unshare() - the call to
+ * this is followed by a flush_tlb_range, which on x86 does a
+ * full tlb flush.
*/
- write_cr3(read_cr3());
}

#define pud_page(pud) \

2008-01-25 22:55:11

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

H. Peter Anvin wrote:
> Jeremy Fitzhardinge wrote:
>> PAE mode requires that we reload cr3 in order to guarantee that
>> changes to the pgd will be noticed by the processor. This means that
>> in principle pud_clear needs to reload cr3 every time. However,
>> because reloading cr3 implies a tlb flush, we want to avoid it where
>> possible.
>
> It only matters (for a processor which supports PGE) if we actually
> use the namespace in between.

Yes. And I don't think there are instances of one place in the kernel
removing pmd and then replacing it in short order. It either means
you're doing a large munmap, or the pagetable is being destroyed (and
therefore isn't currently in cr3 anyway).

>> In huge_pmd_unshare, it is followed by flush_tlb_range, which always
>> results in a full cr3-reload tlb flush.
>
> This one makes me nervous, as it feels like a side effect of
> implementation, and not a guarantee by design.

Yes. Obviously any user of pud_clear must do some kind of tlb flush
before they can expect the effect of their change to be visible. We
just need to make sure that its the right kind of tlb flush that
satisfies the processor's requirements for updating the pgd. If someone
went and changed x86's flush_tlb_range to use invlpg then there'd be a
problem.

I considered hacks like adding a percpu flag saying "the next tlb flush
of any kind must be a cr3-reloading tlb flush", but that seems a
little... ugly.

Hm. Changing PGE in cr4 is documented as flushing the tlb, but is it
enough to get new pmds recognized? (yes: see below)

Where is the requirement to reload cr3 after a pgd update documented
anyway? The comment in the source refers to "Pentium-II erratum A13",
which I think is wrong. The Pentium II specification update document
#243337-049, Jul 2002, lists A13 as "MCE Due to L2 Parity Error Gives L1
MCACOD.LL"; the only mention of PAE in that document is a bad
interaction with PAT in some steppings.

The only possibly relevant comment I can find in vol3a is:

Older IA-32 processors that implement the PAE mechanism use uncached
accesses when loading page-directory-pointer table entries. This
behavior is
model specific and not architectural. More recent IA-32 processors may
cache page-directory-pointer table entries.

(I'm not sure if "cache" here is the data cache or the TLB.)

Ah, here it is, in "Pentium® Pro Processor Specification Update",
January 1999, document 242689-035:

8. TLB Flush Necessary After PDPE Change
As described in Section 3.7, “Translation Lookaside Buffers (TLBs),”
in the Pentium® Pro Family Developer's Manual, Volume 3: Operating
System Writer’s Manual, the operating system is required to
invalidate the corresponding entry in the TLB after any change to a
page-directory or page-table entry. However, if the physical address
extension (PAE) feature is enabled to use 36-bit addressing, a new
table is added to the paging hierarchy, called the page directory
pointer table (as per Section 3.8, “Physical Address Extension”). If
an entry is changed in this table (to point to another page
directory), the TLBs must then be flushed by writing to CR3.

Alright, here's the definitive document: TLBs, Paging-Structure Caches, and
Their Invalidation
(http://www.intel.com/design/processor/applnots/317080.pdf), in which
section 8.1 confirms that you need to either reload cr3 or one of the
other control registers which triggers a cache flush:

The processor does not maintain a PDP cache as described in Section
4. The processor always caches information from the four
page-directory-pointer-table entries. These entries are not cached
at the time of address translation. Instead, they are always cached
as part of the execution of the following instructions:

* A MOV to CR3 that occurs with IA32_EFER.LMA = 0 and CR4.PAE = 1.
* A MOV to CR4 that results in CR4.PAE = 1, that occurs with
IA32_EFER.LMA = 0 and CR0.PG = 1, and that modifies at least
one of CR4.PAE, CR4.PGE, and CR4.PSE.
* A MOV to CR0 that modifies CR0.PG and that occurs with
IA32_EFER.LMA = 0 and CR4.PAE = 1.

I'll put together a patch with a pointer to the proper documentation...

J

2008-01-25 23:38:58

by Keir Fraser

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

On 25/1/08 22:54, "Jeremy Fitzhardinge" <[email protected]> wrote:

> The only possibly relevant comment I can find in vol3a is:
>
> Older IA-32 processors that implement the PAE mechanism use uncached
> accesses when loading page-directory-pointer table entries. This
> behavior is
> model specific and not architectural. More recent IA-32 processors may
> cache page-directory-pointer table entries.

Go read the Intel application note "TLBs, Paging-Structure Caches, and Their
Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf

Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
only be refreshed by appropriate tickling of CR0, CR3 or CR4.

It is also important to note that *any* valid page directory entry at *any*
level in the page-table hierarchy can become cached at *any* time. Basically
TLB lookup is performed as a longest-prefix match on the linear address to
skip as many levels in a page-table walk as possible (where a walk is
needed, because there is no full-length match on the linear address). So, if
you modify a directory entry from present to not-present, or change the page
directory that a valid pde points to, you probably need to flush the pde
caching structure. One piece of good news is that all pde caches are flushed
by any arbitrary INVLPG.

-- Keir

2008-01-25 23:44:46

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Keir Fraser wrote:
> Go read the Intel application note "TLBs, Paging-Structure Caches, and Their
> Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf
>
> Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
> only be refreshed by appropriate tickling of CR0, CR3 or CR4.
>

Yeah, I found that document, and mentioned it a little lower in the mail ;)

> It is also important to note that *any* valid page directory entry at *any*
> level in the page-table hierarchy can become cached at *any* time. Basically
> TLB lookup is performed as a longest-prefix match on the linear address to
> skip as many levels in a page-table walk as possible (where a walk is
> needed, because there is no full-length match on the linear address). So, if
> you modify a directory entry from present to not-present, or change the page
> directory that a valid pde points to, you probably need to flush the pde
> caching structure. One piece of good news is that all pde caches are flushed
> by any arbitrary INVLPG.
>

Hm, but then chapter 10 goes and makes things confusing with
"Alternative INVLPG Behavior"; but I guess if software needs to
explicitly enable this behaviour in a yet-to-be-determined way, its OK...

Is there any guide about the tradeoff of when to use invlpg vs flushing
the whole tlb? 1 page? 10? 90% of the tlb?

J

2008-01-26 00:12:08

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()


* Jeremy Fitzhardinge <[email protected]> wrote:

> Is there any guide about the tradeoff of when to use invlpg vs
> flushing the whole tlb? 1 page? 10? 90% of the tlb?

i made measurements some time ago and INVLPG was quite uniformly slow on
all important CPU types - on the order of 100+ cycles. It's probably
microcode. With a cr3 flush being on the order of 200-300 cycles (plus
any add-on TLB miss costs - but those are amortized quite well as long
as the pagetables are well cached - which they usually are on today's
2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile
for anything more than a few pages.

so INVLPG makes sense for pagetable fault realated single-address
flushes, but they rarely make sense for range flushes. (and that's how
Linux uses it)

Ingo

2008-01-26 00:15:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Keir Fraser wrote:
> On 25/1/08 22:54, "Jeremy Fitzhardinge" <[email protected]> wrote:
>
>> The only possibly relevant comment I can find in vol3a is:
>>
>> Older IA-32 processors that implement the PAE mechanism use uncached
>> accesses when loading page-directory-pointer table entries. This
>> behavior is
>> model specific and not architectural. More recent IA-32 processors may
>> cache page-directory-pointer table entries.
>
> Go read the Intel application note "TLBs, Paging-Structure Caches, and Their
> Invalidation" at http://www.intel.com/design/processor/applnots/317080.pdf
>
> Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
> only be refreshed by appropriate tickling of CR0, CR3 or CR4.
>
> It is also important to note that *any* valid page directory entry at *any*
> level in the page-table hierarchy can become cached at *any* time. Basically
> TLB lookup is performed as a longest-prefix match on the linear address to
> skip as many levels in a page-table walk as possible (where a walk is
> needed, because there is no full-length match on the linear address). So, if
> you modify a directory entry from present to not-present, or change the page
> directory that a valid pde points to, you probably need to flush the pde
> caching structure. One piece of good news is that all pde caches are flushed
> by any arbitrary INVLPG.
>

Actually, it's trickier than that. The PDPTR, just like the segments,
aren't a real cache, and aren't invalidated by INVLPG. This means you
can't go from less permissive to more permissive, which is normally
permitted in the x86. The PDPTR should really be thought of as an
extended cr3 with four entries (this is also how it would be typically
implemented in hardware) rather than as a part of the paging structure
per se.

We do NOT want to frob %cr4 unless we actually need to clear all the
global pages.

The stuff in chapter 10 sounds like they're flagging for a revised
INVLPG instruction or mode which would fit some of the extremely serious
defects in INVLPG that was introduced by haphazard semantics from the P5
and early P6 days.

In general, we should assume that INVLPG only flushes the hierarchy
above it, and not rely on side effects. In particular, we should only
assume INVLPG invalidates the hierarchy immediately above it, not on any
side effects. That's basically sane design anyway.

Now, all of this reminds me of something somewhat messy: if we share the
kernel page tables for trampoline page tables, as discussed elsewhere,
we HAVE to do a complete, all-tlb-including-global-pages flush after
use, since the kernel pages are global and otherwise will stick around.
Unlike the permissions pages, there aren't G enable bits on the higher
levels, but only for the PTEs themselves.

-hpa

2008-01-26 00:24:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>> Is there any guide about the tradeoff of when to use invlpg vs
>> flushing the whole tlb? 1 page? 10? 90% of the tlb?
>
> i made measurements some time ago and INVLPG was quite uniformly slow on
> all important CPU types - on the order of 100+ cycles. It's probably
> microcode. With a cr3 flush being on the order of 200-300 cycles (plus
> any add-on TLB miss costs - but those are amortized quite well as long
> as the pagetables are well cached - which they usually are on today's
> 2MB-ish L2 caches), the high cost of INVLPG rarely makes it worthwile
> for anything more than a few pages.
>
> so INVLPG makes sense for pagetable fault realated single-address
> flushes, but they rarely make sense for range flushes. (and that's how
> Linux uses it)
>

Incidentally, as far as I can tell, the main INVLPG is so slow is
because of its painful behaviour with regards to large pages which may
have been split by hardware.

-hpa

2008-01-26 00:58:01

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

H. Peter Anvin wrote:
> Keir Fraser wrote:
>> On 25/1/08 22:54, "Jeremy Fitzhardinge" <[email protected]> wrote:
>>
>>> The only possibly relevant comment I can find in vol3a is:
>>>
>>> Older IA-32 processors that implement the PAE mechanism use
>>> uncached
>>> accesses when loading page-directory-pointer table entries. This
>>> behavior is
>>> model specific and not architectural. More recent IA-32
>>> processors may
>>> cache page-directory-pointer table entries.
>>
>> Go read the Intel application note "TLBs, Paging-Structure Caches,
>> and Their
>> Invalidation" at
>> http://www.intel.com/design/processor/applnots/317080.pdf
>>
>> Section 8.1 explains about the PDPTR cache in 32-bit PAE mode, which can
>> only be refreshed by appropriate tickling of CR0, CR3 or CR4.
>>
>> It is also important to note that *any* valid page directory entry at
>> *any*
>> level in the page-table hierarchy can become cached at *any* time.
>> Basically
>> TLB lookup is performed as a longest-prefix match on the linear
>> address to
>> skip as many levels in a page-table walk as possible (where a walk is
>> needed, because there is no full-length match on the linear address).
>> So, if
>> you modify a directory entry from present to not-present, or change
>> the page
>> directory that a valid pde points to, you probably need to flush the pde
>> caching structure. One piece of good news is that all pde caches are
>> flushed
>> by any arbitrary INVLPG.
>>
>
> Actually, it's trickier than that. The PDPTR, just like the segments,
> aren't a real cache, and aren't invalidated by INVLPG. This means you
> can't go from less permissive to more permissive, which is normally
> permitted in the x86. The PDPTR should really be thought of as an
> extended cr3 with four entries (this is also how it would be typically
> implemented in hardware) rather than as a part of the paging structure
> per se.

Yeah, that's basically what 8.1 says. PAE doesn't follow the normal TLB
rules for the top level, though they reserve the right to make it behave
properly (as it would if you graft a PAE pagetable into a full 64-bit
pagetable).


> Now, all of this reminds me of something somewhat messy: if we share
> the kernel page tables for trampoline page tables, as discussed
> elsewhere, we HAVE to do a complete, all-tlb-including-global-pages
> flush after use, since the kernel pages are global and otherwise will
> stick around. Unlike the permissions pages, there aren't G enable
> bits on the higher levels, but only for the PTEs themselves.

That wouldn't happen to often though, would it. The identity mapping is
only interested in a 1:1 view on RAM, and that's not going to change at
all? Does the TLB cache PAT attributes? Do you need to do a global
flush after changing a PTE's PAT bits to make sure that all that PTE's
mappings have a consistent view on memory?

J

2008-01-26 01:13:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Jeremy Fitzhardinge wrote:
>
>> Now, all of this reminds me of something somewhat messy: if we share
>> the kernel page tables for trampoline page tables, as discussed
>> elsewhere, we HAVE to do a complete, all-tlb-including-global-pages
>> flush after use, since the kernel pages are global and otherwise will
>> stick around. Unlike the permissions pages, there aren't G enable
>> bits on the higher levels, but only for the PTEs themselves.
>
> That wouldn't happen to often though, would it. The identity mapping is
> only interested in a 1:1 view on RAM, and that's not going to change at
> all? Does the TLB cache PAT attributes? Do you need to do a global
> flush after changing a PTE's PAT bits to make sure that all that PTE's
> mappings have a consistent view on memory?
>

You do need to flush *that page* globally, yes.

As far as flushing after using the trampoline pagetables, we're talking
about rare, expensive events here like suspend to ram.

-hpa

2008-01-26 05:57:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

On Saturday 26 January 2008 01:11:28 Ingo Molnar wrote:
(plus
> any add-on TLB miss costs - but those are amortized quite well as long
> as the pagetables are well cached - which they usually are on today's
> 2MB-ish L2 caches),

Did you measure the cost of that amortizing too?

My guess is that especially with TLBs getting larger and larger the
cost of full CR3 flushes are rising.

> so INVLPG makes sense for pagetable fault realated single-address
> flushes, but they rarely make sense for range flushes. (and that's how
> Linux uses it)

I think it would be an interesting experiment to switch flush_tlb_range()
over to INVLPG if the length is below some threshold and see if there
are visible effects in macro benchmarks. The main problem
would be to determine the right threshold -- would likely be CPU dependent.

-Andi

2008-01-26 06:08:18

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 11 of 11] x86: defer cr3 reload when doing pud_clear()

Andi Kleen wrote:
>
>> so INVLPG makes sense for pagetable fault realated single-address
>> flushes, but they rarely make sense for range flushes. (and that's how
>> Linux uses it)
>
> I think it would be an interesting experiment to switch flush_tlb_range()
> over to INVLPG if the length is below some threshold and see if there
> are visible effects in macro benchmarks. The main problem
> would be to determine the right threshold -- would likely be CPU dependent.
>

It would be an interesting experiment. Odds are pretty good that the
cutover is roughly linear in the TLB size.

-hpa

2008-01-28 15:18:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00 of 11] x86: separate pmd lifetime from pgd


* Jeremy Fitzhardinge <[email protected]> wrote:

> This series addresses various cleanups in pagetable allocation in the
> direction of unifying 32/64 bits (that's still a while off yet).

hm, i tried this, and got an early crash:

[ 29.389844] VFS: Mounted root (ext3 filesystem) readonly.
[ 29.389872] debug: unmapping init memory c0b03000..c0b6f000
[ 29.440139] PM: Adding info for No Bus:vcs1
[ 29.463676] khelper used greatest stack depth: 2404 bytes left
[ 29.467238] PM: Adding info for No Bus:vcsa1
[ 29.541785] PANIC: double fault, gdt at c1d16000 [255 bytes]
[ 29.541785] double fault, tss at c1d19100
[ 29.541785] eip = c011fa95, esp = c3bf6000
[ 29.541785] eax = c3bf6010, ebx = c0b6fc08, ecx = 0000007b, edx = 00000000
[ 29.541785] esi = f76a7df4, edi = c011fa90

i think it's one of your patches :) Bisecting it down to the right one
now. Config attached.

Ingo


Attachments:
(No filename) (899.00 B)
config (44.72 kB)
Download all attachments

2008-01-28 15:39:31

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 00 of 11] x86: separate pmd lifetime from pgd

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <[email protected]> wrote:
>
>
>> This series addresses various cleanups in pagetable allocation in the
>> direction of unifying 32/64 bits (that's still a while off yet).
>>
>
> hm, i tried this, and got an early crash:
>
> [ 29.389844] VFS: Mounted root (ext3 filesystem) readonly.
> [ 29.389872] debug: unmapping init memory c0b03000..c0b6f000
> [ 29.440139] PM: Adding info for No Bus:vcs1
> [ 29.463676] khelper used greatest stack depth: 2404 bytes left
> [ 29.467238] PM: Adding info for No Bus:vcsa1
> [ 29.541785] PANIC: double fault, gdt at c1d16000 [255 bytes]
> [ 29.541785] double fault, tss at c1d19100
> [ 29.541785] eip = c011fa95, esp = c3bf6000
> [ 29.541785] eax = c3bf6010, ebx = c0b6fc08, ecx = 0000007b, edx = 00000000
> [ 29.541785] esi = f76a7df4, edi = c011fa90
>
> i think it's one of your patches :) Bisecting it down to the right one
> now.

Wouldn't surprise me. Given that its a non-PAE config, most of the
patches won't be in play, but perhaps I screwed up coping the kernel
pagetable entries into the pgd somehow.

J

2008-01-28 15:42:22

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00 of 11] x86: separate pmd lifetime from pgd


* Ingo Molnar <[email protected]> wrote:

> hm, i tried this, and got an early crash:
>
> [ 29.389844] VFS: Mounted root (ext3 filesystem) readonly.
> [ 29.389872] debug: unmapping init memory c0b03000..c0b6f000
> [ 29.440139] PM: Adding info for No Bus:vcs1
> [ 29.463676] khelper used greatest stack depth: 2404 bytes left
> [ 29.467238] PM: Adding info for No Bus:vcsa1
> [ 29.541785] PANIC: double fault, gdt at c1d16000 [255 bytes]
> [ 29.541785] double fault, tss at c1d19100
> [ 29.541785] eip = c011fa95, esp = c3bf6000
> [ 29.541785] eax = c3bf6010, ebx = c0b6fc08, ecx = 0000007b, edx = 00000000
> [ 29.541785] esi = f76a7df4, edi = c011fa90
>
> i think it's one of your patches :) Bisecting it down to the right one
> now. Config attached.

and after a session of bisection, the winner patch is:

Subject: x86: unify PAE/non-PAE pgd_ctor

which is a tad unexpected, given the relatively harmless nature of the
patch. (but then again, nothing is really harmless in PAE land.)

btw., this is not fair i think: your patch was apparently caught by the
new debugging helper that tells about itself here:

> [ 29.389872] debug: unmapping init memory c0b03000..c0b6f000

note the close proximity of c0b6f000 and ebx = c0b6fc08. [ I regularly
come up with such nasty tricks and debugging helpers like that to catch
bad patches off-guard. You have been warned! ;-) ]

Ingo

2008-01-28 15:48:04

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 00 of 11] x86: separate pmd lifetime from pgd


* Ingo Molnar <[email protected]> wrote:

> and after a session of bisection, the winner patch is:
>
> Subject: x86: unify PAE/non-PAE pgd_ctor
>
> which is a tad unexpected, given the relatively harmless nature of the
> patch. (but then again, nothing is really harmless in PAE land.)

ok, i merged up your series with this patch removed. (it was possible
with a few manual fixups) That way the problem .config boots fine.

Ingo

2008-01-28 16:20:59

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 00 of 11] x86: separate pmd lifetime from pgd

Ingo Molnar wrote:
> and after a session of bisection, the winner patch is:
>
> Subject: x86: unify PAE/non-PAE pgd_ctor
>
> which is a tad unexpected, given the relatively harmless nature of the
> patch. (but then again, nothing is really harmless in PAE land.)
>

Oh, well, good. At least off-the-cuff diagnosis was right. I must have
overlooked some detail in that merge.

> btw., this is not fair i think: your patch was apparently caught by the
> new debugging helper that tells about itself here:
>
>
>> [ 29.389872] debug: unmapping init memory c0b03000..c0b6f000
>>
>
> note the close proximity of c0b6f000 and ebx = c0b6fc08. [ I regularly
> come up with such nasty tricks and debugging helpers like that to catch
> bad patches off-guard. You have been warned! ;-) ]
>

Hm, perhaps, but it could be as easily coincidence. The place there
initmem is freed is close to where it first needs to rely on a
non-initmm pagetable. I presume that message means that c0b6f000 was
*not* freed.

J

2008-01-31 19:02:05

by Ian Campbell

[permalink] [raw]
Subject: Re: [PATCH 04 of 11] x86: fix early_ioremap pagetable ops


On Fri, 2008-01-25 at 13:23 -0800, Jeremy Fitzhardinge wrote:
> Put appropriate pagetable update hooks in so that paravirt knows
> what's going on in there.
>
> Signed-off-by: Jeremy Fitzhardinge <[email protected]>
> ---
> arch/x86/mm/ioremap.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -18,6 +18,7 @@
> #include <asm/fixmap.h>
> #include <asm/pgtable.h>
> #include <asm/tlbflush.h>
> +#include <asm/pgalloc.h>
>
> #ifdef CONFIG_X86_64
>
> @@ -265,7 +266,7 @@ void __init early_ioremap_init(void)
>
> pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
> memset(bm_pte, 0, sizeof(bm_pte));
> - set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
> + pmd_populate_kernel(&init_mm, pmd, bm_pte);
>
> /*
> * The boot-ioremap range spans multiple pmds, for which
> @@ -295,6 +296,7 @@ void __init early_ioremap_clear(void)
>
> pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
> pmd_clear(pmd);
> + paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT);
> __flush_tlb_all();
> }

This seems to have ended up in f6df72e71eba621b2f5c49b3a763116fac748f6e
as:
+ paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT);

and the pmd_populate_kernel hunk is missing altogether.

---
>From bfa2a08064a269dd7906ed5f60e436360e1360e7 Mon Sep 17 00:00:00 2001
From: Ian Campbell <[email protected]>
Date: Thu, 31 Jan 2008 18:56:06 +0000
Subject: [PATCH] x86: fix early_ioremap pagetable ops for paravirt.

Some important parts of f6df72e71eba621b2f5c49b3a763116fac748f6e got dropped
along the way, reintroduce them.

Signed-off-by: Ian Campbell <[email protected]>
---
arch/x86/mm/ioremap.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
index ed4208e..93d931e 100644
--- a/arch/x86/mm/ioremap.c
+++ b/arch/x86/mm/ioremap.c
@@ -302,7 +302,7 @@ void __init early_ioremap_init(void)

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
memset(bm_pte, 0, sizeof(bm_pte));
- set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
+ pmd_populate_kernel(&init_mm, pmd, bm_pte);

/*
* The boot-ioremap range spans multiple pmds, for which
@@ -332,7 +332,7 @@ void __init early_ioremap_clear(void)

pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
pmd_clear(pmd);
- paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT);
+ paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT);
__flush_tlb_all();
}

--
1.5.3.8



--
Ian Campbell

This fortune would be seven words long if it were six words shorter.

2008-01-31 19:53:19

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 04 of 11] x86: fix early_ioremap pagetable ops

Ian Campbell wrote:
>
> This seems to have ended up in f6df72e71eba621b2f5c49b3a763116fac748f6e
> as:
> + paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT);
>
> and the pmd_populate_kernel hunk is missing altogether.
>
> ---
> >From bfa2a08064a269dd7906ed5f60e436360e1360e7 Mon Sep 17 00:00:00 2001
> From: Ian Campbell <[email protected]>
> Date: Thu, 31 Jan 2008 18:56:06 +0000
> Subject: [PATCH] x86: fix early_ioremap pagetable ops for paravirt.
>
> Some important parts of f6df72e71eba621b2f5c49b3a763116fac748f6e got dropped
> along the way, reintroduce them.
>

Yep.

Acked-by: Jeremy Fitzhardinge <[email protected]>

> Signed-off-by: Ian Campbell <[email protected]>
> ---
> arch/x86/mm/ioremap.c | 4 ++--
> 1 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/ioremap.c b/arch/x86/mm/ioremap.c
> index ed4208e..93d931e 100644
> --- a/arch/x86/mm/ioremap.c
> +++ b/arch/x86/mm/ioremap.c
> @@ -302,7 +302,7 @@ void __init early_ioremap_init(void)
>
> pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
> memset(bm_pte, 0, sizeof(bm_pte));
> - set_pmd(pmd, __pmd(__pa(bm_pte) | _PAGE_TABLE));
> + pmd_populate_kernel(&init_mm, pmd, bm_pte);
>
> /*
> * The boot-ioremap range spans multiple pmds, for which
> @@ -332,7 +332,7 @@ void __init early_ioremap_clear(void)
>
> pmd = early_ioremap_pmd(fix_to_virt(FIX_BTMAP_BEGIN));
> pmd_clear(pmd);
> - paravirt_release_pt(__pa(pmd) >> PAGE_SHIFT);
> + paravirt_release_pt(__pa(bm_pte) >> PAGE_SHIFT);
> __flush_tlb_all();
> }
>
>

2008-01-31 20:38:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 04 of 11] x86: fix early_ioremap pagetable ops


* Ian Campbell <[email protected]> wrote:

> Some important parts of f6df72e71eba621b2f5c49b3a763116fac748f6e got
> dropped along the way, reintroduce them.

thanks, applied. AFAICS it should only affect paravirt, not the native
kernel, right?

Ingo

2008-01-31 20:41:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH 04 of 11] x86: fix early_ioremap pagetable ops

Ingo Molnar wrote:
> * Ian Campbell <[email protected]> wrote:
>
>
>> Some important parts of f6df72e71eba621b2f5c49b3a763116fac748f6e got
>> dropped along the way, reintroduce them.
>>
>
> thanks, applied. AFAICS it should only affect paravirt, not the native
> kernel, right?
>

Correct.

J