2008-01-03 15:24:25

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [1/28] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels


No need to make it 64bit there.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_32.c | 4 ++--
include/asm-x86/pgtable_32.h | 3 ++-
2 files changed, 4 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -326,9 +326,9 @@ static void __init set_highmem_pages_ini
#define set_highmem_pages_init(bad_ppro) do { } while (0)
#endif /* CONFIG_HIGHMEM */

-unsigned long long __PAGE_KERNEL = _PAGE_KERNEL;
+typeof_field(pgprot_t, pgprot) __PAGE_KERNEL = _PAGE_KERNEL;
EXPORT_SYMBOL(__PAGE_KERNEL);
-unsigned long long __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;
+typeof_field(pgprot_t, pgprot) __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;

#ifdef CONFIG_NUMA
extern void __init remap_numa_kva(void);
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -153,7 +153,8 @@ void paging_init(void);
#define _PAGE_KERNEL_EXEC \
(_PAGE_PRESENT | _PAGE_RW | _PAGE_DIRTY | _PAGE_ACCESSED)

-extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
+#define typeof_field(t, m) typeof(((t *)0)->m)
+extern typeof_field(pgprot_t, pgprot) __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
#define __PAGE_KERNEL_RO (__PAGE_KERNEL & ~_PAGE_RW)
#define __PAGE_KERNEL_RX (__PAGE_KERNEL_EXEC & ~_PAGE_RW)
#define __PAGE_KERNEL_NOCACHE (__PAGE_KERNEL | _PAGE_PCD | _PAGE_PWT)


2008-01-03 15:24:42

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [3/28] Add pte accessors for the global bit


Needed for some test code.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable_32.h | 3 +++
include/asm-x86/pgtable_64.h | 3 +++
2 files changed, 6 insertions(+)

Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -275,6 +275,7 @@ static inline int pte_young(pte_t pte)
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; }
+static inline int pte_global(pte_t pte) { return pte_val(pte) & _PAGE_GLOBAL; }

static inline pte_t pte_mkclean(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_DIRTY)); return pte; }
static inline pte_t pte_mkold(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_ACCESSED)); return pte; }
@@ -285,6 +286,8 @@ static inline pte_t pte_mkyoung(pte_t pt
static inline pte_t pte_mkwrite(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_RW)); return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_PSE)); return pte; }
static inline pte_t pte_clrhuge(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_PSE)); return pte; }
+static inline pte_t pte_mkglobal(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) | _PAGE_GLOBAL)); return pte; }
+static inline pte_t pte_clrglobal(pte_t pte) { set_pte(&pte, __pte(pte_val(pte) & ~_PAGE_GLOBAL)); return pte; }

struct vm_area_struct;

Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -220,6 +220,7 @@ static inline int pte_dirty(pte_t pte)
static inline int pte_young(pte_t pte) { return (pte).pte_low & _PAGE_ACCESSED; }
static inline int pte_write(pte_t pte) { return (pte).pte_low & _PAGE_RW; }
static inline int pte_huge(pte_t pte) { return (pte).pte_low & _PAGE_PSE; }
+static inline int pte_global(pte_t pte) { return (pte).pte_low & _PAGE_GLOBAL; }

/*
* The following only works if pte_present() is not true.
@@ -233,6 +234,8 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_mkglobal(pte_t pte) { (pte).pte_low |= _PAGE_GLOBAL; return pte; }
+static inline pte_t pte_clrglobal(pte_t pte) { (pte).pte_low &= ~_PAGE_GLOBAL; return pte; }

#ifdef CONFIG_X86_PAE
# include <asm/pgtable-3level.h>

2008-01-03 15:24:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [4/28] Add pte_clrhuge on i386


64bit had it already

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable_32.h | 1 +
1 file changed, 1 insertion(+)

Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -234,6 +234,7 @@ static inline pte_t pte_mkdirty(pte_t pt
static inline pte_t pte_mkyoung(pte_t pte) { (pte).pte_low |= _PAGE_ACCESSED; return pte; }
static inline pte_t pte_mkwrite(pte_t pte) { (pte).pte_low |= _PAGE_RW; return pte; }
static inline pte_t pte_mkhuge(pte_t pte) { (pte).pte_low |= _PAGE_PSE; return pte; }
+static inline pte_t pte_clrhuge(pte_t pte) { (pte).pte_low &= ~_PAGE_PSE; return pte; }
static inline pte_t pte_mkglobal(pte_t pte) { (pte).pte_low |= _PAGE_GLOBAL; return pte; }
static inline pte_t pte_clrglobal(pte_t pte) { (pte).pte_low &= ~_PAGE_GLOBAL; return pte; }

2008-01-03 15:25:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [5/28] Add pte_pgprot on i386


64bit already had it.

Needed for later patches.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable-2level.h | 2 ++
include/asm-x86/pgtable-3level.h | 2 ++
2 files changed, 4 insertions(+)

Index: linux/include/asm-x86/pgtable-2level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-2level.h
+++ linux/include/asm-x86/pgtable-2level.h
@@ -76,6 +76,8 @@ static inline int pte_exec_kernel(pte_t
#define pgoff_to_pte(off) \
((pte_t) { (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })

+#define pte_pgprot(x) __pgprot((x).pte_low & 0xfff)
+
/* Encode and de-code a swap entry */
#define __swp_type(x) (((x).val >> 1) & 0x1f)
#define __swp_offset(x) ((x).val >> 8)
Index: linux/include/asm-x86/pgtable-3level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-3level.h
+++ linux/include/asm-x86/pgtable-3level.h
@@ -172,6 +172,8 @@ static inline pmd_t pfn_pmd(unsigned lon
pgprot_val(pgprot)) & __supported_pte_mask);
}

+#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))
+
/*
* Bits 0, 6 and 7 are taken in the low part of the pte,
* put the 32 bits of offset into the high part.

2008-01-03 15:25:40

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [6/28] CPA: Undo white space changes


Undo random white space changes. This reverts ddb53b5735793a19dc17bcd98b050f672f28f1ea

I simply don't have the nerves to port a 20+ patch series to the
reformatted version. And the patch series changes most lines
anyways and drops the trailing white spaces there.

And since this was a nop losing it for now isn't a problem.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 151 ++++++++++++++++++++--------------------------
arch/x86/mm/pageattr_64.c | 143 ++++++++++++++++++-------------------------
2 files changed, 130 insertions(+), 164 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -1,29 +1,28 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
* Thanks to Ben LaHaise for precious feedback.
- */
+ */

+#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
-#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
#include <asm/processor.h>
#include <asm/tlbflush.h>
-#include <asm/sections.h>
-#include <asm/uaccess.h>
#include <asm/pgalloc.h>
+#include <asm/sections.h>

static DEFINE_SPINLOCK(cpa_lock);
static struct list_head df_list = LIST_HEAD_INIT(df_list);

-pte_t *lookup_address(unsigned long address)
-{
+
+pte_t *lookup_address(unsigned long address)
+{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
pmd_t *pmd;
-
if (pgd_none(*pgd))
return NULL;
pud = pud_offset(pgd, address);
@@ -34,22 +33,21 @@ pte_t *lookup_address(unsigned long addr
return NULL;
if (pmd_large(*pmd))
return (pte_t *)pmd;
+ return pte_offset_kernel(pmd, address);
+}

- return pte_offset_kernel(pmd, address);
-}
-
-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+ pgprot_t ref_prot)
+{
+ int i;
unsigned long addr;
struct page *base;
pte_t *pbase;
- int i;

spin_unlock_irq(&cpa_lock);
base = alloc_pages(GFP_KERNEL, 0);
spin_lock_irq(&cpa_lock);
- if (!base)
+ if (!base)
return NULL;

/*
@@ -60,24 +58,22 @@ split_large_page(unsigned long address,
page_private(base) = 0;

address = __pa(address);
- addr = address & LARGE_PAGE_MASK;
+ addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
paravirt_alloc_pt(&init_mm, page_to_pfn(base));
-
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
- set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
- addr == address ? prot : ref_prot));
+ set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
+ addr == address ? prot : ref_prot));
}
return base;
-}
+}

static void cache_flush_page(struct page *p)
-{
- void *addr = page_address(p);
+{
+ void *adr = page_address(p);
int i;
-
for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
- clflush(addr + i);
+ clflush(adr+i);
}

static void flush_kernel_map(void *arg)
@@ -87,27 +83,23 @@ static void flush_kernel_map(void *arg)

/* High level code is not ready for clflush yet */
if (0 && cpu_has_clflush) {
- list_for_each_entry(p, lh, lru)
+ list_for_each_entry (p, lh, lru)
cache_flush_page(p);
- } else {
- if (boot_cpu_data.x86_model >= 4)
- wbinvd();
- }
+ } else if (boot_cpu_data.x86_model >= 4)
+ wbinvd();

- /*
- * Flush all to work around Errata in early athlons regarding
- * large page flushing.
+ /* Flush all to work around Errata in early athlons regarding
+ * large page flushing.
*/
- __flush_tlb_all();
+ __flush_tlb_all();
}

-static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
-{
- unsigned long flags;
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+{
struct page *page;
+ unsigned long flags;

- /* change init_mm */
- set_pte_atomic(kpte, pte);
+ set_pte_atomic(kpte, pte); /* change init_mm */
if (SHARED_KERNEL_PMD)
return;

@@ -116,7 +108,6 @@ static void set_pmd_pte(pte_t *kpte, uns
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-
pgd = (pgd_t *)page_address(page) + pgd_index(address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
@@ -125,9 +116,9 @@ static void set_pmd_pte(pte_t *kpte, uns
spin_unlock_irqrestore(&pgd_lock, flags);
}

-/*
- * No more special protections in this 2/4MB area - revert to a large
- * page again.
+/*
+ * No more special protections in this 2/4MB area - revert to a
+ * large page again.
*/
static inline void revert_page(struct page *kpte_page, unsigned long address)
{
@@ -151,11 +142,12 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

-static int __change_page_attr(struct page *page, pgprot_t prot)
-{
- struct page *kpte_page;
+static int
+__change_page_attr(struct page *page, pgprot_t prot)
+{
+ pte_t *kpte;
unsigned long address;
- pte_t *kpte;
+ struct page *kpte_page;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);
@@ -163,17 +155,16 @@ static int __change_page_attr(struct pag
kpte = lookup_address(address);
if (!kpte)
return -EINVAL;
-
kpte_page = virt_to_page(kpte);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (!pte_huge(*kpte)) {
- set_pte_atomic(kpte, mk_pte(page, prot));
+ set_pte_atomic(kpte, mk_pte(page, prot));
} else {
- struct page *split;
pgprot_t ref_prot;
+ struct page *split;

ref_prot =
((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
@@ -181,19 +172,16 @@ static int __change_page_attr(struct pag
split = split_large_page(address, prot, ref_prot);
if (!split)
return -ENOMEM;
-
- set_pmd_pte(kpte, address, mk_pte(split, ref_prot));
+ set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
kpte_page = split;
}
page_private(kpte_page)++;
- } else {
- if (!pte_huge(*kpte)) {
- set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
- } else
- BUG();
- }
+ } else if (!pte_huge(*kpte)) {
+ set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+ BUG_ON(page_private(kpte_page) == 0);
+ page_private(kpte_page)--;
+ } else
+ BUG();

/*
* If the pte was reserved, it means it was created at boot
@@ -209,7 +197,7 @@ static int __change_page_attr(struct pag
}
}
return 0;
-}
+}

static inline void flush_map(struct list_head *l)
{
@@ -223,33 +211,32 @@ static inline void flush_map(struct list
* than write-back somewhere - some CPUs do not like it when mappings with
* different caching policies exist. This changes the page attributes of the
* in kernel linear mapping too.
- *
+ *
* The caller needs to ensure that there are no conflicting mappings elsewhere.
* This function only deals with the kernel linear map.
- *
+ *
* Caller must call global_flush_tlb() after this.
*/
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
+ int err = 0;
+ int i;
unsigned long flags;
- int err = 0, i;

spin_lock_irqsave(&cpa_lock, flags);
- for (i = 0; i < numpages; i++, page++) {
+ for (i = 0; i < numpages; i++, page++) {
err = __change_page_attr(page, prot);
- if (err)
- break;
- }
+ if (err)
+ break;
+ }
spin_unlock_irqrestore(&cpa_lock, flags);
-
return err;
}
-EXPORT_SYMBOL(change_page_attr);

void global_flush_tlb(void)
{
- struct page *pg, *next;
struct list_head l;
+ struct page *pg, *next;

BUG_ON(irqs_disabled());

@@ -266,28 +253,26 @@ void global_flush_tlb(void)
__free_page(pg);
}
}
-EXPORT_SYMBOL(global_flush_tlb);

#ifdef CONFIG_DEBUG_PAGEALLOC
void kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
- if (!enable) {
+ if (!enable)
debug_check_no_locks_freed(page_address(page),
numpages * PAGE_SIZE);
- }

- /*
- * the return value is ignored - the calls cannot fail,
+ /* the return value is ignored - the calls cannot fail,
* large pages are disabled at boot time.
*/
change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
-
- /*
- * we should perform an IPI and flush all tlbs,
+ /* we should perform an IPI and flush all tlbs,
* but that can deadlock->flush only current cpu.
*/
__flush_tlb_all();
}
#endif
+
+EXPORT_SYMBOL(change_page_attr);
+EXPORT_SYMBOL(global_flush_tlb);
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -1,54 +1,48 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
* Thanks to Ben LaHaise for precious feedback.
- */
+ */

+#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
-#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
#include <asm/processor.h>
#include <asm/tlbflush.h>
-#include <asm/uaccess.h>
#include <asm/io.h>

pte_t *lookup_address(unsigned long address)
-{
+{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
-
if (pgd_none(*pgd))
return NULL;
pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- return NULL;
+ return NULL;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- return NULL;
+ return NULL;
if (pmd_large(*pmd))
return (pte_t *)pmd;
-
pte = pte_offset_kernel(pmd, address);
if (pte && !pte_present(*pte))
- pte = NULL;
-
+ pte = NULL;
return pte;
-}
+}

-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+ pgprot_t ref_prot)
+{
+ int i;
unsigned long addr;
- struct page *base;
+ struct page *base = alloc_pages(GFP_KERNEL, 0);
pte_t *pbase;
- int i;
-
- base = alloc_pages(GFP_KERNEL, 0);
- if (!base)
+ if (!base)
return NULL;
/*
* page_private is used to track the number of entries in
@@ -58,21 +52,20 @@ split_large_page(unsigned long address,
page_private(base) = 0;

address = __pa(address);
- addr = address & LARGE_PAGE_MASK;
+ addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
- pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
+ pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
addr == address ? prot : ref_prot);
}
return base;
-}
+}

-void clflush_cache_range(void *addr, int size)
+void clflush_cache_range(void *adr, int size)
{
int i;
-
for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
- clflush(addr+i);
+ clflush(adr+i);
}

static void flush_kernel_map(void *arg)
@@ -83,20 +76,17 @@ static void flush_kernel_map(void *arg)
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
/* clflush is still broken. Disable for now. */
- if (1 || !cpu_has_clflush) {
+ if (1 || !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
- } else {
- list_for_each_entry(pg, l, lru) {
- void *addr = page_address(pg);
-
- clflush_cache_range(addr, PAGE_SIZE);
- }
+ else list_for_each_entry(pg, l, lru) {
+ void *adr = page_address(pg);
+ clflush_cache_range(adr, PAGE_SIZE);
}
__flush_tlb_all();
}

static inline void flush_map(struct list_head *l)
-{
+{
on_each_cpu(flush_kernel_map, l, 1, 1);
}

@@ -108,56 +98,52 @@ static inline void save_page(struct page
list_add(&fpage->lru, &deferred_pages);
}

-/*
+/*
* No more special protections in this 2/4MB area - revert to a
- * large page again.
+ * large page again.
*/
static void revert_page(unsigned long address, pgprot_t ref_prot)
{
- unsigned long pfn;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t large_pte;
+ unsigned long pfn;

pgd = pgd_offset_k(address);
BUG_ON(pgd_none(*pgd));
- pud = pud_offset(pgd, address);
+ pud = pud_offset(pgd,address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
pfn = (__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT;
large_pte = pfn_pte(pfn, ref_prot);
large_pte = pte_mkhuge(large_pte);
-
set_pte((pte_t *)pmd, large_pte);
-}
+}

static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
- pgprot_t ref_prot)
-{
+ pgprot_t ref_prot)
+{
+ pte_t *kpte;
struct page *kpte_page;
pgprot_t ref_prot2;
- pte_t *kpte;

kpte = lookup_address(address);
- if (!kpte)
- return 0;
-
+ if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
- if (pgprot_val(prot) != pgprot_val(ref_prot)) {
+ if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (!pte_huge(*kpte)) {
set_pte(kpte, pfn_pte(pfn, prot));
} else {
- /*
+ /*
* split_large_page will take the reference for this
* change_page_attr on the split page.
- */
+ */
struct page *split;
-
ref_prot2 = pte_pgprot(pte_clrhuge(*kpte));
split = split_large_page(address, prot, ref_prot2);
if (!split)
@@ -167,23 +153,21 @@ __change_page_attr(unsigned long address
kpte_page = split;
}
page_private(kpte_page)++;
- } else {
- if (!pte_huge(*kpte)) {
- set_pte(kpte, pfn_pte(pfn, ref_prot));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
- } else
- BUG();
- }
+ } else if (!pte_huge(*kpte)) {
+ set_pte(kpte, pfn_pte(pfn, ref_prot));
+ BUG_ON(page_private(kpte_page) == 0);
+ page_private(kpte_page)--;
+ } else
+ BUG();

/* on x86-64 the direct mapping set at boot is not using 4k pages */
- BUG_ON(PageReserved(kpte_page));
+ BUG_ON(PageReserved(kpte_page));

save_page(kpte_page);
if (page_private(kpte_page) == 0)
revert_page(address, ref_prot);
return 0;
-}
+}

/*
* Change the page attributes of an page in the linear mapping.
@@ -192,19 +176,19 @@ __change_page_attr(unsigned long address
* than write-back somewhere - some CPUs do not like it when mappings with
* different caching policies exist. This changes the page attributes of the
* in kernel linear mapping too.
- *
+ *
* The caller needs to ensure that there are no conflicting mappings elsewhere.
* This function only deals with the kernel linear map.
- *
+ *
* Caller must call global_flush_tlb() after this.
*/
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
- int err = 0, kernel_map = 0, i;
-
- if (address >= __START_KERNEL_map &&
- address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
+ int err = 0, kernel_map = 0;
+ int i;

+ if (address >= __START_KERNEL_map
+ && address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
address = (unsigned long)__va(__pa(address));
kernel_map = 1;
}
@@ -214,8 +198,7 @@ int change_page_attr_addr(unsigned long
unsigned long pfn = __pa(address) >> PAGE_SHIFT;

if (!kernel_map || pte_present(pfn_pte(0, prot))) {
- err = __change_page_attr(address, pfn, prot,
- PAGE_KERNEL);
+ err = __change_page_attr(address, pfn, prot, PAGE_KERNEL);
if (err)
break;
}
@@ -224,16 +207,14 @@ int change_page_attr_addr(unsigned long
if (__pa(address) < KERNEL_TEXT_SIZE) {
unsigned long addr2;
pgprot_t prot2;
-
addr2 = __START_KERNEL_map + __pa(address);
/* Make sure the kernel mappings stay executable */
prot2 = pte_pgprot(pte_mkexec(pfn_pte(0, prot)));
err = __change_page_attr(addr2, pfn, prot2,
PAGE_KERNEL_EXEC);
- }
- }
- up_write(&init_mm.mmap_sem);
-
+ }
+ }
+ up_write(&init_mm.mmap_sem);
return err;
}

@@ -241,13 +222,11 @@ int change_page_attr_addr(unsigned long
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
unsigned long addr = (unsigned long)page_address(page);
-
return change_page_attr_addr(addr, numpages, prot);
}
-EXPORT_SYMBOL(change_page_attr);

void global_flush_tlb(void)
-{
+{
struct page *pg, *next;
struct list_head l;

@@ -269,6 +248,8 @@ void global_flush_tlb(void)
continue;
ClearPagePrivate(pg);
__free_page(pg);
- }
-}
+ }
+}
+
+EXPORT_SYMBOL(change_page_attr);
EXPORT_SYMBOL(global_flush_tlb);

2008-01-03 15:25:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [7/28] Extract page table dumping code from i386 fault handler into dump_pagetable()


Similar to x86-64. This is useful in other situations where we want
the page table dumped too.

Besides anything that makes i386 do_page_fault shorter is good.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault_32.c | 72 ++++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 33 deletions(-)

Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -28,6 +28,44 @@
#include <asm/desc.h>
#include <asm/segment.h>

+void dump_pagetable(unsigned long address)
+{
+ typeof(pte_val(__pte(0))) page;
+
+ page = read_cr3();
+ page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
+#ifdef CONFIG_X86_PAE
+ printk("*pdpt = %016Lx ", page);
+ if ((page >> PAGE_SHIFT) < max_low_pfn
+ && page & _PAGE_PRESENT) {
+ page &= PAGE_MASK;
+ page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
+ & (PTRS_PER_PMD - 1)];
+ printk(KERN_CONT "*pde = %016Lx ", page);
+ page &= ~_PAGE_NX;
+ }
+#else
+ printk("*pde = %08lx ", page);
+#endif
+
+ /*
+ * We must not directly access the pte in the highpte
+ * case if the page table is located in highmem.
+ * And let's rather not kmap-atomic the pte, just in case
+ * it's allocated already.
+ */
+ if ((page >> PAGE_SHIFT) < max_low_pfn
+ && (page & _PAGE_PRESENT)
+ && !(page & _PAGE_PSE)) {
+ page &= PAGE_MASK;
+ page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
+ & (PTRS_PER_PTE - 1)];
+ printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
+ }
+
+ printk("\n");
+}
+
/*
* Page fault error code bits
* bit 0 == 0 means no page found, 1 means protection fault
@@ -602,7 +640,6 @@ no_context:
bust_spinlocks(1);

if (oops_may_print()) {
- __typeof__(pte_val(__pte(0))) page;

#ifdef CONFIG_X86_PAE
if (error_code & 16) {
@@ -623,38 +660,7 @@ no_context:
printk(" at virtual address %08lx\n", address);
printk(KERN_ALERT "printing ip: %08lx ", regs->ip);

- page = read_cr3();
- page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
-#ifdef CONFIG_X86_PAE
- printk("*pdpt = %016Lx ", page);
- if ((page >> PAGE_SHIFT) < max_low_pfn
- && page & _PAGE_PRESENT) {
- page &= PAGE_MASK;
- page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
- & (PTRS_PER_PMD - 1)];
- printk(KERN_CONT "*pde = %016Lx ", page);
- page &= ~_PAGE_NX;
- }
-#else
- printk("*pde = %08lx ", page);
-#endif
-
- /*
- * We must not directly access the pte in the highpte
- * case if the page table is located in highmem.
- * And let's rather not kmap-atomic the pte, just in case
- * it's allocated already.
- */
- if ((page >> PAGE_SHIFT) < max_low_pfn
- && (page & _PAGE_PRESENT)
- && !(page & _PAGE_PSE)) {
- page &= PAGE_MASK;
- page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
- & (PTRS_PER_PTE - 1)];
- printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
- }
-
- printk("\n");
+ dump_pagetable(address);
}

tsk->thread.cr2 = address;

2008-01-03 15:26:17

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [11/28] CPA: Change 32bit back to init_mm semaphore locking


Now that debug pagealloc uses a separate function it is better
to change standard change_page_attr back to init_mm semaphore locking like 64bit.
Various things are simpler when sleeping is allowed.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,10 +14,9 @@
#include <asm/pgalloc.h>
#include <asm/sections.h>

-static DEFINE_SPINLOCK(cpa_lock);
+/* Protected by init_mm.mmap_sem */
static struct list_head df_list = LIST_HEAD_INIT(df_list);

-
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -46,9 +45,7 @@ static struct page *split_large_page(uns
struct page *base;
pte_t *pbase;

- spin_unlock_irq(&cpa_lock);
base = alloc_pages(GFP_KERNEL, 0);
- spin_lock_irq(&cpa_lock);
if (!base)
return NULL;

@@ -224,15 +221,14 @@ int change_page_attr(struct page *page,
{
int err = 0;
int i;
- unsigned long flags;

- spin_lock_irqsave(&cpa_lock, flags);
+ down_write(&init_mm.mmap_sem);
for (i = 0; i < numpages; i++, page++) {
err = __change_page_attr(page, prot);
if (err)
break;
}
- spin_unlock_irqrestore(&cpa_lock, flags);
+ up_write(&init_mm.mmap_sem);
return err;
}

@@ -243,9 +239,9 @@ void global_flush_tlb(void)

BUG_ON(irqs_disabled());

- spin_lock_irq(&cpa_lock);
+ down_write(&init_mm.mmap_sem);
list_replace_init(&df_list, &l);
- spin_unlock_irq(&cpa_lock);
+ up_write(&init_mm.mmap_sem);
flush_map(&l);
list_for_each_entry_safe(pg, next, &l, lru) {
list_del(&pg->lru);

2008-01-03 15:26:35

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [12/28] CPA: CLFLUSH support in change_page_attr()


Queue individual data pages for flushing with CLFLUSH in change_page_attr(),
instead of doing global WBINVDs. WBINVDs is a very painful operation
for the CPU and quite slow too. Worse it is not interruptible and
can cause long latencies on hypervisors on older Intel VT systems.

CLFLUSH on the other hand only flushes the cache lines that actually need to be
flushed and since it works in smaller chunks is more preemeptible.

To do this c_p_a needs to save the address to be flush for global_tlb_flush()
later. This is done using a separate data structure, not struct page,
because page->lru is often used or not there for memory holes.

Also the flushes are done in FIFO order now, not LIFO.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 78 ++++++++++++++++++++++++++++++++++------------
arch/x86/mm/pageattr_64.c | 77 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 118 insertions(+), 37 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,6 +13,11 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

+struct flush {
+ struct list_head l;
+ unsigned long addr;
+};
+
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -63,6 +68,11 @@ static struct page *split_large_page(uns
return base;
}

+struct flush_arg {
+ int full_flush;
+ struct list_head l;
+};
+
void clflush_cache_range(void *adr, int size)
{
int i;
@@ -72,27 +82,27 @@ void clflush_cache_range(void *adr, int

static void flush_kernel_map(void *arg)
{
- struct list_head *l = (struct list_head *)arg;
- struct page *pg;
+ struct flush_arg *a = (struct flush_arg *)arg;
+ struct flush *f;
+
+ if (!cpu_has_clflush)
+ a->full_flush = 1;

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- /* clflush is still broken. Disable for now. */
- if (1 || !cpu_has_clflush)
+ if (a->full_flush)
asm volatile("wbinvd" ::: "memory");
- else list_for_each_entry(pg, l, lru) {
- void *adr = page_address(pg);
- clflush_cache_range(adr, PAGE_SIZE);
+ list_for_each_entry(f, &a->l, l) {
+ if (!a->full_flush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
}
__flush_tlb_all();
}

-static inline void flush_map(struct list_head *l)
-{
- on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
-static LIST_HEAD(deferred_pages); /* protected by init_mm.mmap_sem */
+/* both protected by init_mm.mmap_sem */
+static int full_flush;
+static LIST_HEAD(deferred_pages);
+static LIST_HEAD(flush_pages);

static inline void save_page(struct page *fpage)
{
@@ -124,6 +134,25 @@ static void revert_page(unsigned long ad
set_pte((pte_t *)pmd, large_pte);
}

+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that it will
+ * work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+ struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+ if (!f) {
+ full_flush = 1;
+ return;
+ }
+ f->addr = address;
+ list_add_tail(&f->l, &flush_pages);
+}
+
static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
pgprot_t ref_prot)
@@ -136,8 +165,11 @@ __change_page_attr(unsigned long address
kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
- BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
+ BUG_ON(PageLRU(kpte_page));
+
+ set_tlb_flush(address);
+
if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (!pte_huge(*kpte)) {
set_pte(kpte, pfn_pte(pfn, prot));
@@ -231,7 +263,9 @@ int change_page_attr(struct page *page,
void global_flush_tlb(void)
{
struct page *pg, *next;
- struct list_head l;
+ struct flush *f, *fnext;
+ struct flush_arg arg;
+ struct list_head free_pages;

/*
* Write-protect the semaphore, to exclude two contexts
@@ -239,12 +273,19 @@ void global_flush_tlb(void)
* exclude new additions to the deferred_pages list:
*/
down_write(&init_mm.mmap_sem);
- list_replace_init(&deferred_pages, &l);
+ arg.full_flush = full_flush;
+ full_flush = 0;
+ list_replace_init(&flush_pages, &arg.l);
+ list_replace_init(&deferred_pages, &free_pages);
up_write(&init_mm.mmap_sem);

- flush_map(&l);
+ on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+ list_for_each_entry_safe(f, fnext, &arg.l, l) {
+ kfree(f);
+ }

- list_for_each_entry_safe(pg, next, &l, lru) {
+ list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
clear_bit(PG_arch_1, &pg->flags);
if (page_private(pg) != 0)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -15,8 +15,15 @@
#include <asm/sections.h>

/* Protected by init_mm.mmap_sem */
+/* Variables protected by cpa_lock */
+static int full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
+static LIST_HEAD(flush_pages);

+struct flush {
+ struct list_head l;
+ unsigned long addr;
+};
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -67,25 +74,31 @@ static struct page *split_large_page(uns
return base;
}

-static void cache_flush_page(struct page *p)
+struct flush_arg {
+ int full_flush;
+ struct list_head l;
+};
+
+void clflush_cache_range(void *adr, int size)
{
- void *adr = page_address(p);
int i;
- for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
+ for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
clflush(adr+i);
}

static void flush_kernel_map(void *arg)
{
- struct list_head *lh = (struct list_head *)arg;
- struct page *p;
+ struct flush_arg *a = (struct flush_arg *)arg;
+ struct flush *f;

- /* High level code is not ready for clflush yet */
- if (0 && cpu_has_clflush) {
- list_for_each_entry (p, lh, lru)
- cache_flush_page(p);
- } else if (boot_cpu_data.x86_model >= 4)
+ if (!cpu_has_clflush)
+ a->full_flush = 1;
+ if (a->full_flush && boot_cpu_data.x86_model >= 4)
wbinvd();
+ list_for_each_entry(f, &a->l, l) {
+ if (!a->full_flush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ }

/* Flush all to work around Errata in early athlons regarding
* large page flushing.
@@ -141,6 +154,25 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that it will
+ * work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+ struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+ if (!f) {
+ full_flush = 1;
+ return;
+ }
+ f->addr = address;
+ list_add_tail(&f->l, &flush_pages);
+}
+
static int
__change_page_attr(struct page *page, pgprot_t prot)
{
@@ -159,6 +191,8 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

+ set_tlb_flush(address);
+
if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (!pte_huge(*kpte)) {
set_pte_atomic(kpte, mk_pte(page, prot));
@@ -199,11 +233,6 @@ __change_page_attr(struct page *page, pg
return 0;
}

-static inline void flush_map(struct list_head *l)
-{
- on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
/*
* Change the page attributes of an page in the linear mapping.
*
@@ -234,16 +263,27 @@ int change_page_attr(struct page *page,

void global_flush_tlb(void)
{
- struct list_head l;
+ struct flush_arg arg;
struct page *pg, *next;
+ struct flush *f, *fnext;
+ struct list_head free_pages;

BUG_ON(irqs_disabled());

down_write(&init_mm.mmap_sem);
- list_replace_init(&df_list, &l);
+ arg.full_flush = full_flush;
+ full_flush = 0;
+ list_replace_init(&flush_pages, &arg.l);
+ list_replace_init(&df_list, &free_pages);
up_write(&init_mm.mmap_sem);
- flush_map(&l);
- list_for_each_entry_safe(pg, next, &l, lru) {
+
+ on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+ list_for_each_entry_safe(f, fnext, &arg.l, l) {
+ kfree(f);
+ }
+
+ list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
clear_bit(PG_arch_1, &pg->flags);
if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)

2008-01-03 15:26:47

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [13/28] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr


Instead of open coding the bit accesses uses standard style
*PageDeferred* macros.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 14 ++++++++++----
arch/x86/mm/pageattr_64.c | 11 +++++++++--
2 files changed, 19 insertions(+), 6 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -18,6 +18,13 @@ struct flush {
unsigned long addr;
};

+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -106,7 +113,7 @@ static LIST_HEAD(flush_pages);

static inline void save_page(struct page *fpage)
{
- if (!test_and_set_bit(PG_arch_1, &fpage->flags))
+ if (!TestSetPageDeferred(fpage))
list_add(&fpage->lru, &deferred_pages);
}

@@ -287,7 +294,7 @@ void global_flush_tlb(void)

list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
- clear_bit(PG_arch_1, &pg->flags);
+ ClearPageDeferred(pg);
if (page_private(pg) != 0)
continue;
ClearPagePrivate(pg);
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,8 +14,14 @@
#include <asm/pgalloc.h>
#include <asm/sections.h>

-/* Protected by init_mm.mmap_sem */
-/* Variables protected by cpa_lock */
+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
+/* Variables protected by init_mm.mmap_sem */
static int full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
static LIST_HEAD(flush_pages);
@@ -150,7 +156,7 @@ static inline void revert_page(struct pa

static inline void save_page(struct page *kpte_page)
{
- if (!test_and_set_bit(PG_arch_1, &kpte_page->flags))
+ if (!TestSetPageDeferred(kpte_page))
list_add(&kpte_page->lru, &df_list);
}

@@ -285,7 +291,7 @@ void global_flush_tlb(void)

list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
- clear_bit(PG_arch_1, &pg->flags);
+ ClearPageDeferred(pg);
if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
continue;
ClearPagePrivate(pg);

2008-01-03 15:27:07

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [14/28] CPA: Use page granuality TLB flushing in change_page_attr


With the infrastructure added for CLFLUSH it is possible
to only TLB flush the actually changed pages in change_page_attr()

Take care of old Athlon K7 Errata on the 32bit version

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 15 ++++++++-------
arch/x86/mm/pageattr_64.c | 10 +++++-----
2 files changed, 13 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,19 +97,20 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if (!cpu_has_clflush)
- a->full_flush = 1;
- if (a->full_flush && boot_cpu_data.x86_model >= 4)
+ if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (!a->full_flush)
+ __flush_tlb_one(f->addr);
}

- /* Flush all to work around Errata in early athlons regarding
- * large page flushing.
- */
- __flush_tlb_all();
+ /* Handle errata with flushing large pages in early Athlons */
+ if (a->full_flush ||
+ (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+ boot_cpu_data.x86 == 7))
+ __flush_tlb_all();
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -92,18 +92,18 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if (!cpu_has_clflush)
- a->full_flush = 1;
-
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (a->full_flush)
+ if (a->full_flush || !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (!a->full_flush)
+ __flush_tlb_one(f->addr);
}
- __flush_tlb_all();
+ if (a->full_flush)
+ __flush_tlb_all();
}

/* both protected by init_mm.mmap_sem */

2008-01-03 15:27:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [15/28] CPA: Don't flush the caches when the CPU supports self-snoop


When the self-snoop CPUID bit is set change_page_attr() only needs to flush
TLBs, but not the caches.

The description of self-snoop in the Intel manuals is a bit vague
but I got confirmation that this is what SS really means.

This should improve c_p_a() performance significantly on newer
Intel CPUs.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 5 +++--
arch/x86/mm/pageattr_64.c | 4 ++--
include/asm-x86/cpufeature.h | 1 +
3 files changed, 6 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,10 +97,11 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
+ if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
+ !cpu_has_ss)
wbinvd();
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush)
+ if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
Index: linux/include/asm-x86/cpufeature.h
===================================================================
--- linux.orig/include/asm-x86/cpufeature.h
+++ linux/include/asm-x86/cpufeature.h
@@ -167,6 +167,7 @@
#define cpu_has_pebs boot_cpu_has(X86_FEATURE_PEBS)
#define cpu_has_clflush boot_cpu_has(X86_FEATURE_CLFLSH)
#define cpu_has_bts boot_cpu_has(X86_FEATURE_BTS)
+#define cpu_has_ss boot_cpu_has(X86_FEATURE_SELFSNOOP)

#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
# define cpu_has_invlpg 1
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,10 +94,10 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (a->full_flush || !cpu_has_clflush)
+ if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
asm volatile("wbinvd" ::: "memory");
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush)
+ if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);

2008-01-03 15:27:50

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [16/28] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a()


Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -95,7 +95,7 @@ static void flush_kernel_map(void *arg)
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
- asm volatile("wbinvd" ::: "memory");
+ wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);

2008-01-03 15:28:13

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [10/28] CPA: Change kernel_map_pages to not use c_p_a()


CONFIG_DEBUG_PAGEALLOC uses change_page_attr to map/unmap mappings for catching
stray kernel mappings. But standard c_p_a() does a lot of unnecessary work for
this simple case with pre-split mappings.

Change kernel_map_pages to just access the page table directly which
is simpler and faster.

I also fixed it to use INVLPG if available.

This is required for changes to c_p_a() later that make it use kmalloc. Without
this we would risk infinite recursion. Also in general things are easier when
sleeping is allowed.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -258,22 +258,36 @@ void global_flush_tlb(void)
}

#ifdef CONFIG_DEBUG_PAGEALLOC
+/* Map or unmap pages in the kernel direct mapping for kernel debugging. */
void kernel_map_pages(struct page *page, int numpages, int enable)
{
+ unsigned long addr;
+ int i;
+
if (PageHighMem(page))
return;
+ addr = (unsigned long)page_address(page);
if (!enable)
- debug_check_no_locks_freed(page_address(page),
- numpages * PAGE_SIZE);
+ debug_check_no_locks_freed((void *)addr, numpages * PAGE_SIZE);
+
+ /* Bootup has forced 4K pages so this is very simple */
+
+ for (i = 0; i < numpages; i++, addr += PAGE_SIZE, page++) {
+ int level;
+ pte_t *pte = lookup_address(addr, &level);

- /* the return value is ignored - the calls cannot fail,
- * large pages are disabled at boot time.
- */
- change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
- /* we should perform an IPI and flush all tlbs,
- * but that can deadlock->flush only current cpu.
- */
- __flush_tlb_all();
+ BUG_ON(level != 3);
+ if (enable) {
+ set_pte_atomic(pte, mk_pte(page, PAGE_KERNEL));
+ /*
+ * We should perform an IPI and flush all tlbs,
+ * but that can deadlock->flush only current cpu.
+ */
+ __flush_tlb_one(addr);
+ } else {
+ kpte_clear_flush(pte, addr);
+ }
+ }
}
#endif

2008-01-03 15:28:35

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [9/28] CPA: Add simple self test at boot


Since change_page_attr() is tricky code it is good to have some regression
test code. This patch maps and unmaps some random pages in the direct mapping
at boot and then dumps the state.

One disadvantage is that the results currently have to be checked
manually, it would be better if it had a global passed/failed message.

Add it with a CONFIG option.

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/Makefile_32 | 1
arch/x86/mm/Makefile_64 | 1
arch/x86/mm/pageattr-test.c | 187 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 189 insertions(+)

Index: linux/arch/x86/mm/Makefile_64
===================================================================
--- linux.orig/arch/x86/mm/Makefile_64
+++ linux/arch/x86/mm/Makefile_64
@@ -7,3 +7,4 @@ obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpag
obj-$(CONFIG_NUMA) += numa_64.o
obj-$(CONFIG_K8_NUMA) += k8topology_64.o
obj-$(CONFIG_ACPI_NUMA) += srat_64.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o
Index: linux/arch/x86/mm/pageattr-test.c
===================================================================
--- /dev/null
+++ linux/arch/x86/mm/pageattr-test.c
@@ -0,0 +1,187 @@
+/*
+ * self test for change_page_attr.
+ *
+ * Clears the global bit on random pages in the direct mapping, then reverts and
+ * compares page tables forwards and afterwards.
+ */
+
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/cacheflush.h>
+#include <asm/pgtable.h>
+#include <asm/kdebug.h>
+
+enum {
+ NTEST = 25,
+#ifdef CONFIG_X86_64
+ LOWEST_LEVEL = 4,
+ LPS = (1 << PMD_SHIFT),
+#elif defined(CONFIG_X86_PAE)
+ LOWEST_LEVEL = 3,
+ LPS = (1 << PMD_SHIFT),
+#else
+ LOWEST_LEVEL = 3, /* lookup_address lies here */
+ LPS = (1 << 22),
+#endif
+ GPS = (1<<30)
+};
+
+#ifdef CONFIG_X86_64
+#include <asm/proto.h>
+#define max_mapped end_pfn_map
+#else
+#define max_mapped max_low_pfn
+#endif
+
+struct split_state {
+ long lpg, gpg, spg, nx;
+};
+
+static __init int print_split(struct split_state *s)
+{
+ int printed = 0;
+ long i, expected, missed = 0;
+ s->lpg = s->gpg = s->spg = s->nx = 0;
+ for (i = 0; i < max_mapped; ) {
+ int level;
+ pte_t *pte;
+ unsigned long adr = (unsigned long)__va(i << PAGE_SHIFT);
+
+ pte = lookup_address(adr, &level);
+ if (!pte) {
+ if (!printed) {
+ dump_pagetable(adr);
+ printk("CPA %lx no pte level %d\n", adr, level);
+ printed = 1;
+ }
+ missed++;
+ i++;
+ continue;
+ }
+
+ if (level == 2 && sizeof(long) == 8) {
+ s->gpg++;
+ i += GPS/PAGE_SIZE;
+ } else if (level != LOWEST_LEVEL) {
+ s->lpg++;
+ i += LPS/PAGE_SIZE;
+ } else {
+ s->spg++;
+ i++;
+ }
+ if (pte_val(*pte) & _PAGE_NX)
+ s->nx++;
+ }
+ printk("CPA direct mapping small %lu large %lu giga %lu nx %lu missed %lu\n",
+ s->spg, s->lpg, s->gpg, s->nx, missed);
+ expected = (s->gpg*GPS + s->lpg*LPS)/PAGE_SIZE + s->spg + missed;
+ if (expected != max_mapped) {
+ printk("CPA max_mapped %lu but expected %lu\n",
+ max_mapped, expected);
+ return 1;
+ }
+ return 0;
+}
+
+static __init int state_same(struct split_state *a, struct split_state *b)
+{
+ return a->lpg == b->lpg && a->gpg == b->gpg && a->spg == b->spg &&
+ a->nx == b->nx;
+}
+
+static unsigned long addr[NTEST] __initdata;
+static unsigned len[NTEST] __initdata;
+
+/* Change the global bit on random pages in the direct mapping */
+static __init int exercise_pageattr(void)
+{
+ int i;
+ pte_t *pte;
+ int level;
+ int err;
+ struct split_state sa, sb, sc;
+ int failed = 0;
+
+ printk("CPA exercising pageattr\n");
+ failed += print_split(&sa);
+ srandom32(100);
+ for (i = 0; i < NTEST; i++) {
+ unsigned long pfn = random32() % max_mapped;
+ addr[i] = (unsigned long)__va(pfn << PAGE_SHIFT);
+ len[i] = min_t(unsigned long, random32() % 100, max_mapped - pfn);
+ pte = lookup_address(addr[i], &level);
+ if (!pte) {
+ printk("lookup of %lx failed\n", addr[i]);
+ failed++;
+ continue;
+ }
+ if (pgprot_val(pte_pgprot(*pte)) == 0) {
+ addr[i] = 0;
+ continue;
+ }
+
+ err = change_page_attr(virt_to_page(addr[i]), len[i],
+ pte_pgprot(pte_clrhuge(pte_clrglobal(*pte))));
+ if (err < 0) {
+ printk("cpa %d failed %d\n", i, err);
+ failed++;
+ }
+
+ pte = lookup_address(addr[i], &level);
+ if (!pte || pte_global(*pte) || pte_huge(*pte)) {
+ printk("%lx: bad pte %Lx\n", addr[i],
+ pte ? (u64)pte_val(*pte) : 0ULL);
+ failed++;
+ }
+ if (level != LOWEST_LEVEL) {
+ printk("%lx: unexpected level %d\n", addr[i], level);
+ failed++;
+ }
+
+ }
+ global_flush_tlb();
+
+ failed += print_split(&sb);
+
+ printk("CPA reverting everything\n");
+ for (i = 0; i < NTEST; i++) {
+ if (!addr[i])
+ continue;
+ pte = lookup_address(addr[i], &level);
+ if (!pte) {
+ printk("lookup of %lx failed\n", addr[i]);
+ failed++;
+ continue;
+ }
+ err = change_page_attr(virt_to_page(addr[i]), len[i],
+ pte_pgprot(pte_mkglobal(*pte)));
+ if (err < 0) {
+ printk("cpa2 failed: %d\n", err);
+ failed++;
+ }
+ pte = lookup_address(addr[i], &level);
+ if (!pte || !pte_global(*pte)) {
+ printk("%lx: bad pte %Lx\n", addr[i],
+ pte ? (u64)pte_val(*pte) : 0ULL);
+ failed++;
+ }
+
+ }
+ global_flush_tlb();
+
+ failed += print_split(&sc);
+ if (!state_same(&sa, &sc))
+ failed++;
+
+ if (failed)
+ printk("CPA selftests NOT PASSED. Please report.\n");
+ else
+ printk("CPA selftests PASSED\n");
+
+ return 0;
+}
+
+module_init(exercise_pageattr);
Index: linux/arch/x86/mm/Makefile_32
===================================================================
--- linux.orig/arch/x86/mm/Makefile_32
+++ linux/arch/x86/mm/Makefile_32
@@ -8,3 +8,4 @@ obj-$(CONFIG_NUMA) += discontig_32.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_HIGHMEM) += highmem_32.o
obj-$(CONFIG_BOOT_IOREMAP) += boot_ioremap_32.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o

2008-01-03 15:28:52

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [17/28] CPA: Reorder TLB / cache flushes to follow Intel recommendation


Intel recommends to first flush the TLBs and then the caches
on caching attribute changes. c_p_a() previously did it the
other way round. Reorder that.

The procedure is still not fully compliant to the Intel documentation
because Intel recommends a all CPU synchronization step between
the TLB flushes and the cache flushes.

However on all new Intel CPUs this is now meaningless anyways
because they support Self-Snoop and can skip the cache flush
step anyways.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 13 ++++++++++---
arch/x86/mm/pageattr_64.c | 14 ++++++++++----
2 files changed, 20 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,9 +97,6 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
- !cpu_has_ss)
- wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
@@ -112,6 +109,16 @@ static void flush_kernel_map(void *arg)
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
boot_cpu_data.x86 == 7))
__flush_tlb_all();
+
+ /*
+ * RED-PEN: Intel documentation ask for a CPU synchronization step
+ * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+ */
+
+ if ((!cpu_has_clflush || a->full_flush) &&
+ !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+ wbinvd();
+
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,16 +94,22 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
- wbinvd();
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
+ if (!a->full_flush && !cpu_has_ss)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
}
if (a->full_flush)
__flush_tlb_all();
+
+ /*
+ * RED-PEN: Intel documentation ask for a CPU synchronization step
+ * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+ */
+
+ if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+ wbinvd();
}

/* both protected by init_mm.mmap_sem */

2008-01-03 15:29:14

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [8/28] CPA: Return the page table level in lookup_address()


Needed for the next change.

And change all the callers.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault_32.c | 3 ++-
arch/x86/mm/init_32.c | 3 ++-
arch/x86/mm/pageattr_32.c | 7 +++++--
arch/x86/mm/pageattr_64.c | 7 +++++--
arch/x86/xen/mmu.c | 9 ++++++---
include/asm-x86/pgtable_32.h | 2 +-
include/asm-x86/pgtable_64.h | 2 +-
7 files changed, 22 insertions(+), 11 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,7 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-pte_t *lookup_address(unsigned long address)
+pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
@@ -27,8 +27,10 @@ pte_t *lookup_address(unsigned long addr
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;
+ *level = 3;
if (pmd_large(*pmd))
return (pte_t *)pmd;
+ *level = 4;
pte = pte_offset_kernel(pmd, address);
if (pte && !pte_present(*pte))
pte = NULL;
@@ -129,8 +131,9 @@ __change_page_attr(unsigned long address
pte_t *kpte;
struct page *kpte_page;
pgprot_t ref_prot2;
+ int level;

- kpte = lookup_address(address);
+ kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
BUG_ON(PageLRU(kpte_page));
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -409,7 +409,7 @@ extern struct list_head pgd_list;

extern int kern_addr_valid(unsigned long addr);

-pte_t *lookup_address(unsigned long addr);
+pte_t *lookup_address(unsigned long addr, int *level);

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
remap_pfn_range(vma, vaddr, pfn, size, prot)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -18,7 +18,7 @@ static DEFINE_SPINLOCK(cpa_lock);
static struct list_head df_list = LIST_HEAD_INIT(df_list);


-pte_t *lookup_address(unsigned long address)
+pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
@@ -31,8 +31,10 @@ pte_t *lookup_address(unsigned long addr
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return NULL;
+ *level = 2;
if (pmd_large(*pmd))
return (pte_t *)pmd;
+ *level = 3;
return pte_offset_kernel(pmd, address);
}

@@ -148,11 +150,12 @@ __change_page_attr(struct page *page, pg
pte_t *kpte;
unsigned long address;
struct page *kpte_page;
+ int level;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);

- kpte = lookup_address(address);
+ kpte = lookup_address(address, &level);
if (!kpte)
return -EINVAL;
kpte_page = virt_to_page(kpte);
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -443,7 +443,7 @@ static inline pte_t pte_modify(pte_t pte
* NOTE: the return type is pte_t but if the pmd is PSE then we return it
* as a pte too.
*/
-extern pte_t *lookup_address(unsigned long address);
+extern pte_t *lookup_address(unsigned long address, int *level);

/*
* Make a given kernel text page executable/non-executable.
Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -643,7 +643,8 @@ no_context:

#ifdef CONFIG_X86_PAE
if (error_code & 16) {
- pte_t *pte = lookup_address(address);
+ int level;
+ pte_t *pte = lookup_address(address, &level);

if (pte && pte_present(*pte) && !pte_exec_kernel(*pte))
printk(KERN_CRIT "kernel tried to execute "
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -528,11 +528,12 @@ int __init set_kernel_exec(unsigned long
{
pte_t *pte;
int ret = 1;
+ int level;

if (!nx_enabled)
goto out;

- pte = lookup_address(vaddr);
+ pte = lookup_address(vaddr, &level);
BUG_ON(!pte);

if (!pte_exec_kernel(*pte))
Index: linux/arch/x86/xen/mmu.c
===================================================================
--- linux.orig/arch/x86/xen/mmu.c
+++ linux/arch/x86/xen/mmu.c
@@ -58,7 +58,8 @@

xmaddr_t arbitrary_virt_to_machine(unsigned long address)
{
- pte_t *pte = lookup_address(address);
+ int level;
+ pte_t *pte = lookup_address(address, &level);
unsigned offset = address & PAGE_MASK;

BUG_ON(pte == NULL);
@@ -70,8 +71,9 @@ void make_lowmem_page_readonly(void *vad
{
pte_t *pte, ptev;
unsigned long address = (unsigned long)vaddr;
+ int level;

- pte = lookup_address(address);
+ pte = lookup_address(address, &level);
BUG_ON(pte == NULL);

ptev = pte_wrprotect(*pte);
@@ -84,8 +86,9 @@ void make_lowmem_page_readwrite(void *va
{
pte_t *pte, ptev;
unsigned long address = (unsigned long)vaddr;
+ int level;

- pte = lookup_address(address);
+ pte = lookup_address(address, &level);
BUG_ON(pte == NULL);

ptev = pte_mkwrite(*pte);

2008-01-03 15:29:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [18/28] CPA: Make change_page_attr() more robust against use of PAT bits


Use the page table level instead of the PSE bit to check if the PTE
is for a 4K page or not. This makes the code more robust when the PAT
bit is changed because the PAT bit on 4K pages is in the same position as the
PSE bit.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 4 ++--
arch/x86/mm/pageattr_64.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -209,7 +209,7 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address);

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
- if (!pte_huge(*kpte)) {
+ if (level == 3) {
set_pte_atomic(kpte, mk_pte(page, prot));
} else {
pgprot_t ref_prot;
@@ -225,7 +225,7 @@ __change_page_attr(struct page *page, pg
kpte_page = split;
}
page_private(kpte_page)++;
- } else if (!pte_huge(*kpte)) {
+ } else if (level == 3) {
set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -184,7 +184,7 @@ __change_page_attr(unsigned long address
set_tlb_flush(address);

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
- if (!pte_huge(*kpte)) {
+ if (level == 4) {
set_pte(kpte, pfn_pte(pfn, prot));
} else {
/*
@@ -201,7 +201,7 @@ __change_page_attr(unsigned long address
kpte_page = split;
}
page_private(kpte_page)++;
- } else if (!pte_huge(*kpte)) {
+ } else if (level == 4) {
set_pte(kpte, pfn_pte(pfn, ref_prot));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;

2008-01-03 15:29:50

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [19/28] CPA: Limit cache flushing to pages that really change caching


Previously change_page_attr always flushed caches even for
pages that only change a non caching related attribute (like RO for
read/write protection).

This changes the flush code to only flush the cache when the
caching attributes actually change.

I made some effort to already handle reprogrammed PAT bits, although this
is not strictly needed right now by the core kernel (but that will
probably change soon)

This will only make a difference on AMD CPUs or older Intel CPUs,
because all newer Intel CPUs support "self-snoop" and do not require
this cache flushing anyways.

Another advantage of this patch is that it prevents recursive
slab calls with slab debugging.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 43 +++++++++++++++++++++++++++++++++----------
arch/x86/mm/pageattr_64.c | 42 ++++++++++++++++++++++++++++++++++--------
include/asm-x86/pgtable_32.h | 8 ++++++++
include/asm-x86/pgtable_64.h | 8 ++++++++
4 files changed, 83 insertions(+), 18 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,10 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
struct flush {
+ enum flush_mode mode;
struct list_head l;
unsigned long addr;
};
@@ -76,7 +79,7 @@ static struct page *split_large_page(uns
}

struct flush_arg {
- int full_flush;
+ enum flush_mode full_flush;
struct list_head l;
};

@@ -91,14 +94,19 @@ static void flush_kernel_map(void *arg)
{
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;
+ int cache_flush = a->full_flush == FLUSH_CACHE;

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
__flush_tlb_one(f->addr);
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+ if (cpu_has_clflush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ else
+ cache_flush++;
+ }
}
if (a->full_flush)
__flush_tlb_all();
@@ -108,12 +116,12 @@ static void flush_kernel_map(void *arg)
* here and in the loop. But it is moot on Self-Snoop CPUs anyways.
*/

- if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+ if (cache_flush > 0 && !cpu_has_ss)
wbinvd();
}

/* both protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
static LIST_HEAD(deferred_pages);
static LIST_HEAD(flush_pages);

@@ -155,17 +163,35 @@ static void revert_page(unsigned long ad
* data structure to keep track of the flush. This has the added bonus that it will
* work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
{
+ enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
if (!f) {
- full_flush = 1;
+ if (full_flush < mode)
+ full_flush = mode;
return;
}
+
f->addr = address;
+ f->mode = mode;
list_add_tail(&f->l, &flush_pages);
}

+static unsigned short pat_bit[5] = {
+ [4] = _PAGE_PAT,
+ [3] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+ unsigned a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+ /* Correct for PAT bit being in different position on large pages */
+ if (pte_val(pte) & pat_bit[level])
+ a |= _PAGE_PAT;
+ return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
pgprot_t ref_prot)
@@ -181,7 +207,7 @@ __change_page_attr(unsigned long address
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

- set_tlb_flush(address);
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -150,10 +150,15 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_BIT_ACCESSED 5
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
+#define _PAGE_BIT_PAT 7 /* Only on 4kb pages */
+#define _PAGE_BIT_PAT_LARGE 12 /* Only on large pages */
#define _PAGE_BIT_FILE 6
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

+/* Needs special handling for large pages */
+#define _PAGE_CACHE (_PAGE_PCD|_PAGE_PWT|_PAGE_PAT)
+
#define _PAGE_PRESENT (_AC(1, UL)<<_PAGE_BIT_PRESENT)
#define _PAGE_RW (_AC(1, UL)<<_PAGE_BIT_RW)
#define _PAGE_USER (_AC(1, UL)<<_PAGE_BIT_USER)
@@ -171,6 +176,9 @@ static inline pte_t ptep_get_and_clear_f
#define _PAGE_PROTNONE 0x080 /* If not present */
#define _PAGE_NX (_AC(1, UL)<<_PAGE_BIT_NX)

+#define _PAGE_PAT (_AC(1,UL) << _PAGE_BIT_PAT)
+#define _PAGE_PAT_LARGE (_AC(1,UL) << _PAGE_BIT_PAT_LARGE)
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -21,12 +21,15 @@
#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)

+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
/* Variables protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
static LIST_HEAD(flush_pages);

struct flush {
+ enum flush_mode mode;
struct list_head l;
unsigned long addr;
};
@@ -81,7 +84,7 @@ static struct page *split_large_page(uns
}

struct flush_arg {
- int full_flush;
+ enum flush_mode full_flush;
struct list_head l;
};

@@ -96,12 +99,17 @@ static void flush_kernel_map(void *arg)
{
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;
+ int cache_flush = a->full_flush == FLUSH_CACHE;

list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
+ if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+ if (cpu_has_clflush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ else
+ cache_flush++;
+ }
}

/* Handle errata with flushing large pages in early Athlons */
@@ -115,10 +123,8 @@ static void flush_kernel_map(void *arg)
* here and in the loop. But it is moot on Self-Snoop CPUs anyways.
*/

- if ((!cpu_has_clflush || a->full_flush) &&
- !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+ if (cache_flush > 0 && !cpu_has_ss && boot_cpu_data.x86_model >= 4)
wbinvd();
-
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
@@ -169,6 +175,20 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

+static unsigned short pat_bit[4] = {
+ [3] = _PAGE_PAT,
+ [2] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+ unsigned long a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+ /* Correct for PAT bit being in different position on large pages */
+ if (pte_val(pte) & pat_bit[level])
+ a |= _PAGE_PAT;
+ return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
/*
* Mark the address for flushing later in global_tlb_flush().
*
@@ -177,14 +197,17 @@ static inline void save_page(struct page
* data structure to keep track of the flush. This has the added bonus that it will
* work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
{
+ enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
if (!f) {
- full_flush = 1;
+ if (full_flush < mode)
+ full_flush = mode;
return;
}
f->addr = address;
+ f->mode = mode;
list_add_tail(&f->l, &flush_pages);
}

@@ -206,7 +229,7 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- set_tlb_flush(address);
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (level == 3) {
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -97,6 +97,8 @@ void paging_init(void);
#define _PAGE_BIT_ACCESSED 5
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page, Pentium+, if present.. */
+#define _PAGE_BIT_PAT 7 /* only on 4K pages */
+#define _PAGE_BIT_PAT_LARGE 12 /* only on large pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
#define _PAGE_BIT_UNUSED2 10
@@ -126,6 +128,12 @@ void paging_init(void);
#define _PAGE_NX 0
#endif

+#define _PAGE_PAT (1U << _PAGE_BIT_PAT)
+#define _PAGE_PAT_LARGE (1U << _PAGE_BIT_PAT_LARGE)
+
+/* Needs special handling for large pages */
+#define _PAGE_CACHE (_PAGE_PWT|_PAGE_PCD|_PAGE_PAT)
+
#define _PAGE_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_USER | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _KERNPG_TABLE (_PAGE_PRESENT | _PAGE_RW | _PAGE_ACCESSED | _PAGE_DIRTY)
#define _PAGE_CHG_MASK (PTE_MASK | _PAGE_ACCESSED | _PAGE_DIRTY)

2008-01-03 15:30:12

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [20/28] CPA: Fix inaccurate comments in 64bit change_page_attr()


No code changes.
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -120,7 +120,7 @@ static void flush_kernel_map(void *arg)
wbinvd();
}

-/* both protected by init_mm.mmap_sem */
+/* All protected by init_mm.mmap_sem */
static enum flush_mode full_flush;
static LIST_HEAD(deferred_pages);
static LIST_HEAD(flush_pages);
@@ -132,7 +132,7 @@ static inline void save_page(struct page
}

/*
- * No more special protections in this 2/4MB area - revert to a
+ * No more special protections in this 2MB area - revert to a
* large page again.
*/
static void revert_page(unsigned long address, pgprot_t ref_prot)

2008-01-03 15:30:35

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [21/28] CPA: Dump pagetable when inconsistency is detected


When c_p_a() detects a inconsistency in the kernel page tables
it BUGs. When this happens dump the page table first to avoid one
bug reporting round trip.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 11 ++++++++++-
arch/x86/mm/pageattr_64.c | 11 ++++++++++-
2 files changed, 20 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -12,6 +12,7 @@
#include <asm/processor.h>
#include <asm/tlbflush.h>
#include <asm/io.h>
+#include <asm/kdebug.h>

enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };

@@ -231,8 +232,16 @@ __change_page_attr(unsigned long address
set_pte(kpte, pfn_pte(pfn, ref_prot));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
- } else
+ } else {
+ /*
+ * When you're here you either did set the same page to PAGE_KERNEL
+ * two times in a row or the page table reference counting is broken again.
+ * To catch the later bug for now (sorry)
+ */
+ printk(KERN_ERR "address %lx\n", address);
+ dump_pagetable(address);
BUG();
+ }

/* on x86-64 the direct mapping set at boot is not using 4k pages */
BUG_ON(PageReserved(kpte_page));
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -13,6 +13,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/sections.h>
+#include <asm/kdebug.h>

#define PG_deferred PG_arch_1

@@ -252,8 +253,16 @@ __change_page_attr(struct page *page, pg
set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
- } else
+ } else {
+ /*
+ * When you're here you either did set the same page to PAGE_KERNEL
+ * two times in a row or the page table reference counting is broken again.
+ * To catch the later bug for now (sorry)
+ */
+ printk(KERN_ERR "address %lx\n", address);
+ dump_pagetable(address);
BUG();
+ }

/*
* If the pte was reserved, it means it was created at boot

2008-01-03 15:30:51

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [22/28] CPA: Only queue actually unused page table pages for freeing


With the separate data structure added for flush earlier it is only
needed to call save_page() now on pte pages that have been already reverted.

Also move all freeing checks into the caller.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 4 +---
arch/x86/mm/pageattr_64.c | 7 +++----
2 files changed, 4 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -270,9 +270,9 @@ __change_page_attr(struct page *page, pg
* replace it with a largepage.
*/

- save_page(kpte_page);
if (!PageReserved(kpte_page)) {
if (cpu_has_pse && (page_private(kpte_page) == 0)) {
+ save_page(kpte_page);
paravirt_release_pt(page_to_pfn(kpte_page));
revert_page(kpte_page, address);
}
@@ -333,8 +333,6 @@ void global_flush_tlb(void)
list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
ClearPageDeferred(pg);
- if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
- continue;
ClearPagePrivate(pg);
__free_page(pg);
}
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -246,9 +246,10 @@ __change_page_attr(unsigned long address
/* on x86-64 the direct mapping set at boot is not using 4k pages */
BUG_ON(PageReserved(kpte_page));

- save_page(kpte_page);
- if (page_private(kpte_page) == 0)
+ if (page_private(kpte_page) == 0) {
+ save_page(kpte_page);
revert_page(address, ref_prot);
+ }
return 0;
}

@@ -336,8 +337,6 @@ void global_flush_tlb(void)
list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
ClearPageDeferred(pg);
- if (page_private(pg) != 0)
- continue;
ClearPagePrivate(pg);
__free_page(pg);
}

2008-01-03 15:31:08

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [23/28] CPA: Remove unnecessary masking of address


virt_to_page does not care about the bits below the page granuality.
So don't mask them.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -204,7 +204,7 @@ __change_page_attr(unsigned long address

kpte = lookup_address(address, &level);
if (!kpte) return 0;
- kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+ kpte_page = virt_to_page(kpte);
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

2008-01-03 15:31:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [24/28] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set


Otherwise the kernel will likely always run with 4K pages instead of 2MB pages,
which is costly in terms of TLBs.

Also optimize it a little bit by using only a single change_page_attr() calls.
This is particularly useful if debugging is enabled inside it because it spams
the logs much less.

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_64.c | 8 +++++---
1 file changed, 5 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -546,13 +546,15 @@ void free_init_pages(char *what, unsigne
init_page_count(virt_to_page(addr));
memset((void *)(addr & ~(PAGE_SIZE-1)),
POISON_FREE_INITMEM, PAGE_SIZE);
- if (addr >= __START_KERNEL_map)
- change_page_attr_addr(addr, 1, __pgprot(0));
free_page(addr);
totalram_pages++;
}
- if (addr > __START_KERNEL_map)
+#ifdef CONFIG_DEBUG_RODATA
+ if (begin >= __START_KERNEL_map) {
+ change_page_attr_addr(begin, (end - begin)/PAGE_SIZE, __pgprot(0));
global_flush_tlb();
+ }
+#endif
}

void free_initmem(void)

2008-01-03 15:31:45

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [25/28] CPA: Always do full TLB flush when splitting large pages


Various CPUs have errata when using INVLPG to flush large pages.
This includes Intel Penryn (AV2) and AMD K7 (#16 in Athlon 4)
While these happen only in specific circumstances it is still
a little risky and it's not clear the kernel can avoid them all.

Avoid this can of worms by always flushing the full TLB (but
not the full cache) when splitting a large page. This should
not be that expensive anyways and initial splitting should be
hopefully infrequent.

This also removes the previously hard coded workaround for K7
Athlon on 32bit.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 17 +++++++++++------
arch/x86/mm/pageattr_64.c | 11 +++++++++--
2 files changed, 20 insertions(+), 8 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -113,10 +113,7 @@ static void flush_kernel_map(void *arg)
}
}

- /* Handle errata with flushing large pages in early Athlons */
- if (a->full_flush ||
- (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 == 7))
+ if (a->full_flush)
__flush_tlb_all();

/*
@@ -198,7 +195,7 @@ static int cache_attr_changed(pte_t pte,
* data structure to keep track of the flush. This has the added bonus that it will
* work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
{
enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -210,6 +207,13 @@ static void set_tlb_flush(unsigned long
f->addr = address;
f->mode = mode;
list_add_tail(&f->l, &flush_pages);
+
+ /*
+ * Work around large page INVLPG bugs in early K7 and in Penryn.
+ * When we split a large page always do a full TLB flush.
+ */
+ if (large && full_flush < FLUSH_TLB)
+ full_flush = FLUSH_TLB;
}

static int
@@ -230,7 +234,8 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
+ level < 3);

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (level == 3) {
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -164,7 +164,7 @@ static void revert_page(unsigned long ad
* data structure to keep track of the flush. This has the added bonus that it will
* work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
{
enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -177,6 +177,13 @@ static void set_tlb_flush(unsigned long
f->addr = address;
f->mode = mode;
list_add_tail(&f->l, &flush_pages);
+
+ /*
+ * Work around large page INVLPG bugs in early K7 and in Penryn.
+ * When we split a large page always do a full TLB flush.
+ */
+ if (large && full_flush < FLUSH_TLB)
+ full_flush = FLUSH_TLB;
}

static unsigned short pat_bit[5] = {
@@ -208,7 +215,7 @@ __change_page_attr(unsigned long address
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

- set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level), level < 4);

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {

2008-01-03 15:32:04

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [26/28] CPA: Fix reference counting when changing already changed pages


When changing a page that has already been modified to non standard attributes
before don't change the reference count. And when changing back a page
only decrease the ref count if the old attributes were non standard.

Cc: [email protected]
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 36 +++++++++++++++++++-----------------
arch/x86/mm/pageattr_64.c | 13 +++++++++----
2 files changed, 28 insertions(+), 21 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -206,12 +206,13 @@ __change_page_attr(unsigned long address
{
pte_t *kpte;
struct page *kpte_page;
- pgprot_t ref_prot2;
+ pgprot_t ref_prot2, oldprot;
int level;

kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(kpte);
+ oldprot = pte_pgprot(*kpte);
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

@@ -219,6 +220,8 @@ __change_page_attr(unsigned long address

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {
+ if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+ page_private(kpte_page)++;
set_pte(kpte, pfn_pte(pfn, prot));
} else {
/*
@@ -233,12 +236,14 @@ __change_page_attr(unsigned long address
pgprot_val(ref_prot2) &= ~_PAGE_NX;
set_pte(kpte, mk_pte(split, ref_prot2));
kpte_page = split;
+ page_private(kpte_page)++;
}
- page_private(kpte_page)++;
} else if (level == 4) {
+ if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+ BUG_ON(page_private(kpte_page) <= 0);
+ page_private(kpte_page)--;
+ }
set_pte(kpte, pfn_pte(pfn, ref_prot));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
} else {
/*
* When you're here you either did set the same page to PAGE_KERNEL
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -151,20 +151,16 @@ static void set_pmd_pte(pte_t *kpte, uns
* No more special protections in this 2/4MB area - revert to a
* large page again.
*/
-static inline void revert_page(struct page *kpte_page, unsigned long address)
+static void
+revert_page(struct page *kpte_page, unsigned long address, pgprot_t ref_prot)
{
- pgprot_t ref_prot;
pte_t *linear;

- ref_prot =
- ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
- ? PAGE_KERNEL_LARGE_EXEC : PAGE_KERNEL_LARGE;
-
linear = (pte_t *)
pmd_offset(pud_offset(pgd_offset_k(address), address), address);
set_pmd_pte(linear, address,
- pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
- ref_prot));
+ pte_mkhuge(pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
+ ref_prot)));
}

static inline void save_page(struct page *kpte_page)
@@ -223,6 +219,8 @@ __change_page_attr(struct page *page, pg
unsigned long address;
struct page *kpte_page;
int level;
+ pgprot_t oldprot;
+ pgprot_t ref_prot = PAGE_KERNEL;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);
@@ -230,6 +228,8 @@ __change_page_attr(struct page *page, pg
kpte = lookup_address(address, &level);
if (!kpte)
return -EINVAL;
+
+ oldprot = pte_pgprot(*kpte);
kpte_page = virt_to_page(kpte);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
@@ -237,27 +237,29 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 3);

+ if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+ ref_prot = PAGE_KERNEL_EXEC;
+
if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (level == 3) {
+ if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+ page_private(kpte_page)++;
set_pte_atomic(kpte, mk_pte(page, prot));
} else {
- pgprot_t ref_prot;
struct page *split;
-
- ref_prot =
- ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
- ? PAGE_KERNEL_EXEC : PAGE_KERNEL;
split = split_large_page(address, prot, ref_prot);
if (!split)
return -ENOMEM;
set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
kpte_page = split;
+ page_private(kpte_page)++;
}
- page_private(kpte_page)++;
} else if (level == 3) {
+ if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+ BUG_ON(page_private(kpte_page) <= 0);
+ page_private(kpte_page)--;
+ }
set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
} else {
/*
* When you're here you either did set the same page to PAGE_KERNEL
@@ -279,7 +281,7 @@ __change_page_attr(struct page *page, pg
if (cpu_has_pse && (page_private(kpte_page) == 0)) {
save_page(kpte_page);
paravirt_release_pt(page_to_pfn(kpte_page));
- revert_page(kpte_page, address);
+ revert_page(kpte_page, address, ref_prot);
}
}
return 0;

2008-01-03 15:32:28

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [27/28] CPA: Change comments of external interfaces to kerneldoc format


And clarify description a bit.

Only for 64bit, but the interfaces are identical for 32bit and kerneldoc should
merge them (?)

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/DocBook/kernel-api.tmpl | 8 +++++
arch/x86/mm/pageattr_64.c | 46 +++++++++++++++++++++++++---------
2 files changed, 42 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -265,19 +265,19 @@ __change_page_attr(unsigned long address
return 0;
}

-/*
- * Change the page attributes of an page in the linear mapping.
- *
- * This should be used when a page is mapped with a different caching policy
- * than write-back somewhere - some CPUs do not like it when mappings with
- * different caching policies exist. This changes the page attributes of the
- * in kernel linear mapping too.
+/**
+ * change_page_attr_addr - Change page table attributes in linear mapping
+ * @address: Virtual address in linear mapping.
+ * @numpages: Number of pages to change
+ * @prot: New page table attribute (PAGE_*)
*
- * The caller needs to ensure that there are no conflicting mappings elsewhere.
- * This function only deals with the kernel linear map.
- *
- * Caller must call global_flush_tlb() after this.
+ * Change page attributes of a page in the direct mapping. This is a variant
+ * of change_page_attr() that also works on memory holes that do not have
+ * mem_map entry (pfn_valid() is false).
+ *
+ * See change_page_attr() documentation for more details.
*/
+
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
int err = 0, kernel_map = 0;
@@ -314,13 +314,35 @@ int change_page_attr_addr(unsigned long
return err;
}

-/* Don't call this for MMIO areas that may not have a mem_map entry */
+/**
+ * change_page_attr - Change page table attributes in the linear mapping.
+ * @page: First page to change
+ * @numpages: Number of pages to change
+ * @prot: New protection/caching type (PAGE_*)
+ *
+ * Returns 0 on success, otherwise a negated errno.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * Caller must call global_flush_tlb() later to make the changes active.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere
+ * (e.g. in user space) * This function only deals with the kernel linear map.
+ *
+ * For MMIO areas without mem_map use change_page_attr_addr() instead.
+ */
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
unsigned long addr = (unsigned long)page_address(page);
return change_page_attr_addr(addr, numpages, prot);
}

+/**
+ * global_flush_tlb - Flush linear mappings changed by change_page_attr()
+ */
void global_flush_tlb(void)
{
struct page *pg, *next;
Index: linux/Documentation/DocBook/kernel-api.tmpl
===================================================================
--- linux.orig/Documentation/DocBook/kernel-api.tmpl
+++ linux/Documentation/DocBook/kernel-api.tmpl
@@ -726,4 +726,12 @@ X!Idrivers/video/console/fonts.c
!Ffs/pipe.c
</chapter>

+ <chapter id="pageattr">
+ <title>Kernel direct mapping paging attributes</title>
+ <para>
+ Changing the paging attributes of kernel direct mapping pages
+ on x86.
+ </para>
+!Farch/x86/mm/pageattr_64.c
+ </chapter>
</book>

2008-01-03 15:32:45

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [28/28] Make kernel_text test match boot mapping initialization


The boot direct mapping initialization used a different test to check if a
page was part of the kernel mapping than c_p_a(). Fix that. Also
round up to a large page size to be sure.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -183,6 +183,12 @@ static int cache_attr_changed(pte_t pte,
return a != (pgprot_val(prot) & _PAGE_CACHE);
}

+static int text_address(unsigned long addr)
+{
+ unsigned long end = ((unsigned long)__init_end & LARGE_PAGE_MASK);
+ return addr < end + LARGE_PAGE_SIZE;
+}
+
/*
* Mark the address for flushing later in global_tlb_flush().
*
@@ -237,7 +243,7 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 3);

- if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+ if (text_address(address))
ref_prot = PAGE_KERNEL_EXEC;

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {

2008-01-03 15:33:03

by Andi Kleen

[permalink] [raw]
Subject: [PATCH CPA] [2/28] CPA: Do a simple self test at boot


When CONFIG_DEBUG_RODATA is enabled undo the ro mapping and redo it again.
This gives some simple testing for change_page_attr()

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/Kconfig.debug | 5 +++++
arch/x86/mm/init_32.c | 22 ++++++++++++++++++++++
arch/x86/mm/init_64.c | 10 ++++++++++
3 files changed, 37 insertions(+)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -596,6 +596,16 @@ void mark_rodata_ro(void)
* of who is the culprit.
*/
global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+ printk("Testing CPA: undo %lx-%lx\n", start, end);
+ change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL);
+ global_flush_tlb();
+
+ printk("Testing CPA: again\n");
+ change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL_RO);
+ global_flush_tlb();
+#endif
}
#endif

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -792,6 +792,18 @@ void mark_rodata_ro(void)
change_page_attr(virt_to_page(start),
size >> PAGE_SHIFT, PAGE_KERNEL_RX);
printk("Write protecting the kernel text: %luk\n", size >> 10);
+
+#ifdef CONFIG_CPA_DEBUG
+ global_flush_tlb();
+
+ printk("Testing CPA: undo %lx-%lx\n", start, start+size);
+ change_page_attr(virt_to_page(start), size>>PAGE_SHIFT, PAGE_KERNEL_EXEC);
+ global_flush_tlb();
+
+ printk("Testing CPA: write protecting again\n");
+ change_page_attr(virt_to_page(start),size>>PAGE_SHIFT,PAGE_KERNEL_RX);
+ global_flush_tlb();
+#endif
}
#endif
start += size;
@@ -808,6 +820,16 @@ void mark_rodata_ro(void)
* of who is the culprit.
*/
global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+ printk("Testing CPA: undo %lx-%lx\n", start, start + size);
+ change_page_attr(virt_to_page(start), size >> PAGE_SHIFT, PAGE_KERNEL);
+ global_flush_tlb();
+
+ printk("Testing CPA: write protecting again\n");
+ change_page_attr(virt_to_page(start), size >> PAGE_SHIFT, PAGE_KERNEL_RO);
+ global_flush_tlb();
+#endif
}
#endif

Index: linux/arch/x86/Kconfig.debug
===================================================================
--- linux.orig/arch/x86/Kconfig.debug
+++ linux/arch/x86/Kconfig.debug
@@ -185,4 +185,9 @@ config DEFAULT_IO_DELAY_TYPE
default IO_DELAY_TYPE_NONE
endif

+config CPA_DEBUG
+ bool "CPA self test code"
+ help
+ Do change_page_attr self tests at boot.
+
endmenu

2008-01-05 06:49:06

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: [PATCH CPA] [4/28] Add pte_clrhuge on i386

Andi Kleen wrote:
> 64bit had it already
>

I'm unifying pgtable.h at the moment. Will post soon.

J

2008-01-10 09:31:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


Andi,

finally managed to get the time to review your CPA patchset, and i
fundamentally agree with most of the detail changes done in it. But here
are a few structural high-level observations:

- firstly, there's no rationale given. So we'll change ioremap()/etc.
from doing a cflush-range instruction instead of a WBINVD. But why?
WBINVD isnt particular fast (takes a few msecs), but why is that a
problem? Drivers dont do high-frequency ioremap-ing. It's typically
only done at driver/device startup and that's it. Whether module load
time takes 1254 msecs instead of 1250 msecs is no big deal.

- secondly, obviously doing a 'flush some' instead of a 'flush all'
operation is risky. There's not many ways we can get the 'flush all'
WBINVD instruction wrong (as long as we do it on all cpus, etc.). But
with this specific range flush we've got all the risks of accidentally
not flushing _enough_. Especially if some boundary of a mapped area is
imprecisely. Accidentally leaving aliases around in the cache is
asking for trouble as it's very hard to debug and the operations here
(ioremap) are typically done only once per system bootup. So we'll add
_fundamental_ fragility to an already historically fragile and
bug-ridden piece of code. That does not sound too smart to me and you
do not analyze these concerns in your submission.

- the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
they come first in the series, instead of being mixed into it?

so i'd suggest for you to reshape this ontop of the PAT patchset done by
Venki and Suresh, with the cflush stuff kept at the very end of the
series, and even that should be boot and .config configurable, with
default off - at least initially. The cflush performance advantages seem
dubious at best, and they bring in very real regression dangers.

Ingo

2008-01-10 09:44:17

by Dave Airlie

[permalink] [raw]
Subject: Re: CPA patchset

>
> finally managed to get the time to review your CPA patchset, and i
> fundamentally agree with most of the detail changes done in it. But here
> are a few structural high-level observations:
>
> - firstly, there's no rationale given. So we'll change ioremap()/etc.
> from doing a cflush-range instruction instead of a WBINVD. But why?
> WBINVD isnt particular fast (takes a few msecs), but why is that a
> problem? Drivers dont do high-frequency ioremap-ing. It's typically
> only done at driver/device startup and that's it. Whether module load
> time takes 1254 msecs instead of 1250 msecs is no big deal.

read graphics drivers, even though I think we may avoid the whole path
if we can and end up doing some of this in the drivers
when they know more about the situation so can avoid safeties..

but I still see this being used in AGP a fair bit at some point on
some drivers..

Dave.

2008-01-10 09:53:48

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 10:31:26AM +0100, Ingo Molnar wrote:
>
> Andi,
>
> finally managed to get the time to review your CPA patchset, and i
> fundamentally agree with most of the detail changes done in it. But here
> are a few structural high-level observations:

I have a few changes and will post a updated version shortly.
There's also unfortunately still one outstanding bug (triggeredb y
some recent changes) on 32bit that makes the self test suite not pass
currently.

> - firstly, there's no rationale given. So we'll change ioremap()/etc.
> from doing a cflush-range instruction instead of a WBINVD. But why?

- WBINVD is a very nasty operation. I was talking to some CPU people
and they really recommended to get rid of it as far as possible.
Stopping the CPU for msecs is just wrong and there are apparently
even some theoretical live lock situations.
- It is not interruptible in earlier VT versions and messes up
real time in the hypervisor. Some people were doing KVM on rt
kernels and had latency spikes from that.

I'll add that to the changelog.

> WBINVD isnt particular fast (takes a few msecs), but why is that a
> problem? Drivers dont do high-frequency ioremap-ing. It's typically

Actually graphics drivers can do higher frequency allocation of WC
memory (with PAT) support.

> only done at driver/device startup and that's it. Whether module load
> time takes 1254 msecs instead of 1250 msecs is no big deal.
>
> - secondly, obviously doing a 'flush some' instead of a 'flush all'
> operation is risky. There's not many ways we can get the 'flush all'

Yes, that is why it took so long. But it's eventually needed
to make PAT actually useful for once.

> WBINVD instruction wrong (as long as we do it on all cpus, etc.). But
> with this specific range flush we've got all the risks of accidentally
> not flushing _enough_. Especially if some boundary of a mapped area is
> imprecisely.

Not sure what you mean here? If someone doesn't pass in the correct
area then not everything will be uncached in the first place.
Using something as cached that should be uncached is usually
noticed because it tends to be quite slow or corrupt data.

I don't think you have thought that concern through.

> asking for trouble as it's very hard to debug and the operations here
> (ioremap) are typically done only once per system bootup. So we'll add

change_page_attr() is used for more than just ioremap (e.g. a common
case is mapping memory into AGP apertures which still happens in
many graphics drivers). I also expect it will be used even more
in the future when PAT usage will become more wide spread.

> - the bugfixes and cleanups to pgattr.c i like very much - but shouldnt
> they come first in the series, instead of being mixed into it?

It all depends on each other; I don't think any of the changes can
be easily reordered.

-Andi

2008-01-10 09:55:31

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> >
> > finally managed to get the time to review your CPA patchset, and i
> > fundamentally agree with most of the detail changes done in it. But here
> > are a few structural high-level observations:
> >
> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > from doing a cflush-range instruction instead of a WBINVD. But why?
> > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > problem? Drivers dont do high-frequency ioremap-ing. It's typically
> > only done at driver/device startup and that's it. Whether module load
> > time takes 1254 msecs instead of 1250 msecs is no big deal.
>
> read graphics drivers, even though I think we may avoid the whole path

You mean avoid change_page_attr() ?

> if we can and end up doing some of this in the drivers
> when they know more about the situation so can avoid safeties..

Please explain, but it sounds very dubious.

> but I still see this being used in AGP a fair bit at some point on
> some drivers..

Well GARTs and those are widely used even without AGP busses.

-Andi

2008-01-10 10:03:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Dave Airlie <[email protected]> wrote:

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > from doing a cflush-range instruction instead of a WBINVD. But why?
> > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > problem? Drivers dont do high-frequency ioremap-ing. It's typically
> > only done at driver/device startup and that's it. Whether module load
> > time takes 1254 msecs instead of 1250 msecs is no big deal.
>
> read graphics drivers, even though I think we may avoid the whole path
> if we can and end up doing some of this in the drivers when they know
> more about the situation so can avoid safeties..

by 'read graphics drivers' do you mean direct framebuffer access? In any
case, a driver (or even userspace) can use cflush just fine, it's an
unprivileged instruction. But this is about the ioremap() implementation
using cflush instead of WBINVD, and that is a slowpath and is up to the
kernel anyway - i'm not aware of many high-frequency ioremap() users, so
robustness concerns control the policy here. So could you please explain
your point in more detail?

Ingo

2008-01-10 10:05:30

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Andi Kleen <[email protected]> wrote:

> > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > problem? Drivers dont do high-frequency ioremap-ing. It's
> > typically only done at driver/device startup and that's it.
>
> Actually graphics drivers can do higher frequency allocation of WC
> memory (with PAT) support.

but that's not too smart: why dont they use WB plus cflush instead?
ioremap() will always be slow and we dont care about stupid uses of it.

Ingo

2008-01-10 10:07:28

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 11:04:43AM +0100, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > > problem? Drivers dont do high-frequency ioremap-ing. It's
> > > typically only done at driver/device startup and that's it.
> >
> > Actually graphics drivers can do higher frequency allocation of WC
> > memory (with PAT) support.
>
> but that's not too smart: why dont they use WB plus cflush instead?

Because they need to access it WC for performance.

> ioremap() will always be slow and we dont care about stupid uses of it.

Uninterruptible msec latencies are too slow even for slow path
initialization code imho.

-Andi

2008-01-10 10:20:43

by Dave Airlie

[permalink] [raw]
Subject: Re: CPA patchset

On Jan 10, 2008 7:55 PM, Andi Kleen <[email protected]> wrote:
> On Thu, Jan 10, 2008 at 07:44:03PM +1000, Dave Airlie wrote:
> > >
> > > finally managed to get the time to review your CPA patchset, and i
> > > fundamentally agree with most of the detail changes done in it. But here
> > > are a few structural high-level observations:
> > >
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > > from doing a cflush-range instruction instead of a WBINVD. But why?
> > > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > > problem? Drivers dont do high-frequency ioremap-ing. It's typically
> > > only done at driver/device startup and that's it. Whether module load
> > > time takes 1254 msecs instead of 1250 msecs is no big deal.
> >
> > read graphics drivers, even though I think we may avoid the whole path
>
> You mean avoid change_page_attr() ?

yes in some cases...

>
> > if we can and end up doing some of this in the drivers
> > when they know more about the situation so can avoid safeties..
>
> Please explain, but it sounds very dubious.

So on Intel 9xx hardware we know how to flush the whole pipe from chip
to chipset to RAM, so when in the Intel
driver we can leave pages mapped cached, and just flush all stages
before having the GPU access them,

This is only possible as long as we know all the parts involved, for
example on AMD we have problems with that
over-eager prefetching so for drivers on AMD chipsets we have to do
something else more than likely using change_page_attr.

>
> Well GARTs and those are widely used even without AGP busses.

Yes mainly we have used this scheme with Intel GARTs (also the only
driver we have done with the new GPU system) by working closely with
Intel engineers..

Dave.

2008-01-10 10:44:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Andi Kleen <[email protected]> wrote:

> There's also unfortunately still one outstanding bug (triggeredb y
> some recent changes) on 32bit that makes the self test suite not pass
> currently.

ok. I'd expect there to be a whole lot of bugs triggered in this area.
We have to shape it in a way that puts the most useful bits to the head
and the riskiest bits to the end, so that we can do reverts in a
sensible and least destructive way - if the need arises.

> > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > from doing a cflush-range instruction instead of a WBINVD. But why?
>
> - WBINVD is a very nasty operation. I was talking to some CPU people
> and they really recommended to get rid of it as far as possible.
> Stopping the CPU for msecs is just wrong and there are apparently even
> some theoretical live lock situations. - It is not interruptible in
> earlier VT versions and messes up real time in the hypervisor. Some
> people were doing KVM on rt kernels and had latency spikes from that.

ok, i thought you might have had some higher rationale too. Frankly,
WBINVD wont ever go away from the x86 instruction set, and while i am
very symphathetic to rt and latency issues, it's still a risky change in
a historically fragile area of code.

What is very real though are the hard limitations of MTRRs. So i'd
rather first like to see a clean PAT approach (which all other modern
OSs have already migrated to in the past 10 years), with all the
structural cleanups and bugfixes you did as well, which would allow us
to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an
(optional) cflush approach basically as the final step. Right now cflush
is mixed inextractably into the CPA patchset. WBINVD latency is really
the last of our problems here.

> > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > problem? Drivers dont do high-frequency ioremap-ing. It's
> > typically
>
> Actually graphics drivers can do higher frequency allocation of WC
> memory (with PAT) support.

but that is plain stupid. If it can be WC mapped, why not map it WB and
use cflush when updating memory? That gives cache-coherency on demand
_and_ true read side caching - so it's faster as well. Modern GX cards
can take that just fine. The best of both worlds. Unconditional WC is
way overrated.

> > - secondly, obviously doing a 'flush some' instead of a 'flush all'
> > operation is risky. There's not many ways we can get the 'flush
> > all'
>
> Yes, that is why it took so long. But it's eventually needed to make
> PAT actually useful for once.

as i see it, the utility of PAT comes from going away from MTRRs, and
the resulting flexibility it gives system designers to shape memory and
IO BARs.

> > WBINVD instruction wrong (as long as we do it on all cpus, etc.).
> > But with this specific range flush we've got all the risks of
> > accidentally not flushing _enough_. Especially if some boundary of
> > a mapped area is imprecisely.
>
> Not sure what you mean here? If someone doesn't pass in the correct
> area then not everything will be uncached in the first place. Using
> something as cached that should be uncached is usually noticed because
> it tends to be quite slow or corrupt data.

yes, 'quite slow or corrupt data' is what i meant:

> > asking for trouble as it's very hard to debug and the operations
^^^^^^^^^^^^^^^^^^
> > here (ioremap) are typically done only once per system bootup.
> > [...]
>
> change_page_attr() is used for more than just ioremap (e.g. a common
> case is mapping memory into AGP apertures which still happens in many
> graphics drivers). I also expect it will be used even more in the
> future when PAT usage will become more wide spread.

i expect usage to not change in pattern. ioremap()ping on the fly is not
too smart.

[ quoting this from the top: ]

> > finally managed to get the time to review your CPA patchset, and i
> > fundamentally agree with most of the detail changes done in it. But
> > here are a few structural high-level observations:
>
> I have a few changes and will post a updated version shortly.

thanks. I've test-merged the PAT patchset from Venki/Suresh to have a
look, and to me the most logical way to layer these changes would be:

- PAT
- non-cflush bits of CPA
- cflush bits of CPA
- GBPAGES

hm?

Ingo

2008-01-10 10:50:27

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 08:20:26PM +1000, Dave Airlie wrote:
> This is only possible as long as we know all the parts involved, for
> example on AMD we have problems with that
> over-eager prefetching so for drivers on AMD chipsets we have to do
> something else more than likely using change_page_attr.

What over eager prefetching? AFAIK AMD chips don't do any prefetching
that Intel chips don't do. In fact P4 does more aggressive speculation
than any AMD chip.

One problem I see is that the CPU (both AMD and Intel) can follow random
addresses even from speculative execution and then there is a window between
your CLFLUSH and eventual use where the data could be forced back
into cache behind your back. Then eventually the cache gets flushed
and overwrites the data in memory.

The only way to prevent that is to unmap or remap to uncached.

Might happen only rarely of course, but it can. We've had reports
over the years for bugs which were tracked down to such cases. Each
was a bitch to debug usually keeping people busy for a long time.

>
> >
> > Well GARTs and those are widely used even without AGP busses.
>
> Yes mainly we have used this scheme with Intel GARTs (also the only
> driver we have done with the new GPU system) by working closely with
> Intel engineers..

Have you discussed the case above?

Also BTW even if you can solve it for Intel GPUs there are others too
who really rely on WC/UC.

-Andi

2008-01-10 10:57:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Andi Kleen <[email protected]> wrote:

> > > > WBINVD isnt particular fast (takes a few msecs), but why is
> > > > that a problem? Drivers dont do high-frequency ioremap-ing.
> > > > It's typically only done at driver/device startup and that's
> > > > it.
> > >
> > > Actually graphics drivers can do higher frequency allocation of WC
> > > memory (with PAT) support.
> >
> > but that's not too smart: why dont they use WB plus cflush instead?
>
> Because they need to access it WC for performance.

I think you have it fundamentally backwards: the best for performance is
WB + cflush. What would WC offer for performance that cflush cannot do?

also, it's irrelevant to change_page_attr() call frequency. Just map in
everything from the card and use it. In graphics, if you remap anything
on the fly and it's not a slowpath you've lost the performance game even
before you began it.

Ingo

2008-01-10 11:07:21

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 11:43:51AM +0100, Ingo Molnar wrote:
> > > - firstly, there's no rationale given. So we'll change ioremap()/etc.
> > > from doing a cflush-range instruction instead of a WBINVD. But why?
> >
> > - WBINVD is a very nasty operation. I was talking to some CPU people
> > and they really recommended to get rid of it as far as possible.
> > Stopping the CPU for msecs is just wrong and there are apparently even
> > some theoretical live lock situations. - It is not interruptible in
> > earlier VT versions and messes up real time in the hypervisor. Some
> > people were doing KVM on rt kernels and had latency spikes from that.
>
> ok, i thought you might have had some higher rationale too. Frankly,
> WBINVD wont ever go away from the x86 instruction set, and while i am

We can (and should) eliminate it from normal operation at least. The one during
reboot has to be kept unfortunately.

> very symphathetic to rt and latency issues, it's still a risky change in
> a historically fragile area of code.

It is, but the resulting code is significantly more efficient and does
various other things in better ways (the explicit list allows various
other optimizations). It's how c_p_a should have been written from day one.


> What is very real though are the hard limitations of MTRRs. So i'd
> rather first like to see a clean PAT approach (which all other modern

That's mostly orthogonal. Don't know why you bring it up now?

Anyways more efficient c_p_a() makes PAT usage easier.

> structural cleanups and bugfixes you did as well, which would allow us
> to phase out MTRR use (of the DRM drivers, etc.), and _then_ layer an
> (optional) cflush approach basically as the final step. Right now cflush
> is mixed inextractably into the CPA patchset. WBINVD latency is really

You could always disable it. Just set full_flush = 1.
Anyways it would be possible to disable it more fully, but at least
the individual flush infrastructure is used by other code too and useful
for it.


> the last of our problems here.

I disagree on that.

>
> > > WBINVD isnt particular fast (takes a few msecs), but why is that a
> > > problem? Drivers dont do high-frequency ioremap-ing. It's
> > > typically
> >
> > Actually graphics drivers can do higher frequency allocation of WC
> > memory (with PAT) support.
>
> but that is plain stupid. If it can be WC mapped, why not map it WB and
> use cflush when updating memory?

Not reliable due to the speculative out of order nature of x86 CPUs.

See my other mail to Dave explaining one scenario where it can go
wrong.

> > > all'
> >
> > Yes, that is why it took so long. But it's eventually needed to make
> > PAT actually useful for once.
>
> as i see it, the utility of PAT comes from going away from MTRRs, and
> the resulting flexibility it gives system designers to shape memory and
> IO BARs.

MTRRs will not go away, just used less.

One advantage of PAT is that you can set any page in memory efficiently
to WC or UC without requiring hacks like an hardware aperture. But to
make good use of that this operation needs to be reasonably efficient.
>
> > > WBINVD instruction wrong (as long as we do it on all cpus, etc.).
> > > But with this specific range flush we've got all the risks of
> > > accidentally not flushing _enough_. Especially if some boundary of
> > > a mapped area is imprecisely.
> >
> > Not sure what you mean here? If someone doesn't pass in the correct
> > area then not everything will be uncached in the first place. Using
> > something as cached that should be uncached is usually noticed because
> > it tends to be quite slow or corrupt data.
>
> yes, 'quite slow or corrupt data' is what i meant:

That will already happen without my patchkit. No change to the positive
or the negative.

>
> > > asking for trouble as it's very hard to debug and the operations
> ^^^^^^^^^^^^^^^^^^
> > > here (ioremap) are typically done only once per system bootup.
> > > [...]
> >
> > change_page_attr() is used for more than just ioremap (e.g. a common
> > case is mapping memory into AGP apertures which still happens in many
> > graphics drivers). I also expect it will be used even more in the
> > future when PAT usage will become more wide spread.
>
> i expect usage to not change in pattern. ioremap()ping on the fly is not
> too smart.

Once PAT is in people will use it more I expect. And contrary to your
claims it is already used for more than just ioremap. Please grep
for it.

Actually in DEBUG_PAGEALLOC kernels it is currently heavily used,
but I eliminated that.

>
> [ quoting this from the top: ]
>
> > > finally managed to get the time to review your CPA patchset, and i
> > > fundamentally agree with most of the detail changes done in it. But
> > > here are a few structural high-level observations:
> >
> > I have a few changes and will post a updated version shortly.
>
> thanks. I've test-merged the PAT patchset from Venki/Suresh to have a
> look, and to me the most logical way to layer these changes would be:
>
> - PAT
> - non-cflush bits of CPA
> - cflush bits of CPA

I would prefer to not reorder it significantly inside itself.

I took pains to make it all in small bisectable steps but redoing
all that would be very painful and time consuming for me. Also currently
it is relatively logical in the way it builds on each other and dumb separation
in clflush vs no clflush wouldn't help with that.

A merge after PAT is fine for me if PAT happens soon (iirc there were
only minor changes to cpa anyways), although it's a pity to delay
some of the fixes for so long. But I haven't seen an PAT update recently
and at least the original patch needed much more work. So I think
you should only delay it for PAT if that happens soon.
I had actually hoped that CPA would be ready for .25, but I have my doubts
for PAT with that.

One possibility if you're worried about CLFLUSH so much would be to
merge the whole series but keep CLFLUSH disabled. Don't think it would
improve things much though and the only way to get it shaked out is wider
testing anyways.


-Andi

2008-01-10 11:12:58

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 11:57:26AM +0100, Ingo Molnar wrote:
>
> > > > > WBINVD isnt particular fast (takes a few msecs), but why is
> > > > > that a problem? Drivers dont do high-frequency ioremap-ing.
> > > > > It's typically only done at driver/device startup and that's
> > > > > it.
> > > >
> > > > Actually graphics drivers can do higher frequency allocation of WC
> > > > memory (with PAT) support.
> > >
> > > but that's not too smart: why dont they use WB plus cflush instead?
> >
> > Because they need to access it WC for performance.
>
> I think you have it fundamentally backwards: the best for performance is
> WB + cflush. What would WC offer for performance that cflush cannot do?

Cached requires the cache line to be read first before you can write it.

WC on the other hand does not allocate a cache line and just dumps
the data into a special write combining buffer. It was invented originally
because reads from AGP were incredibly slow.

And it's race less regarding the caching protocol (assuming you flush
the caches and TLBs correctly).

Another typical problem is that if something
is uncached then you can't have it in any other caches because if that
cache eventually flushes it will corrupt the data.
That can happen with remapping apertures for example which remap data
behind the CPUs back.

CLFLUSH is really only a hint but it cannot be used if UC is needed
for correctness.

> also, it's irrelevant to change_page_attr() call frequency. Just map in
> everything from the card and use it. In graphics, if you remap anything
> on the fly and it's not a slowpath you've lost the performance game even
> before you began it.

The typical case would be lots of user space DRI clients supplying
their own buffers on the fly. There's not really a fixed pool
in this case, but it all varies dynamically. In some scenarios
that could happen quite often.

-Andi

2008-01-10 12:22:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Andi Kleen <[email protected]> wrote:

> > What is very real though are the hard limitations of MTRRs. So i'd
> > rather first like to see a clean PAT approach (which all other
> > modern OSs have already migrated to in the past 10 years)
>
> That's mostly orthogonal. Don't know why you bring it up now?

because the PAT (Page Attribute Table support) patchset and the CPA
(change_page_attr()) patchset are are not orthogonal at all - as their
name already signals: because they change the implementation/effects of
the same interface(s). [just at different levels].

Both patchsets change how the kernel pagetable caching is handled. PAT
changes the kernel pte details and unshackles us from MTRR reliance and
thus solves real problems on real boxes:

55 files changed, 1288 insertions(+), 237 deletions(-)

CPA moves change_page_attr() from invwb flushing to cflush flushing, for
a speedup/latency-win, plus a whole bunch of intermingled fixes and
improvements to page attribute modification:

26 files changed, 882 insertions(+), 423 deletions(-)

so in terms of risk management, the "perfect patch order" is:

- minimal_set of correctness fixes to the highlevel cpa code.

- ( then any provably NOP cleanups to pave the way. )

- then change the lowlevel pte code (PAT) to reduce/eliminate the need
to have runtime MTRR use

- then structural improvements/cleanups of the highlevel cpa code

- then the cflush (optional) performance feature ontop of it.

- then gigabyte-largepages/TLBs support [new CPU feature that further
complicates page-attribute management]

All in an easy-to-revert fashion. We _will_ regress here, and this stuff
is very hard to debug.

Ingo

2008-01-10 12:39:51

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

On Thu, Jan 10, 2008 at 01:22:04PM +0100, Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > What is very real though are the hard limitations of MTRRs. So i'd
> > > rather first like to see a clean PAT approach (which all other
> > > modern OSs have already migrated to in the past 10 years)
> >
> > That's mostly orthogonal. Don't know why you bring it up now?
>
> because the PAT (Page Attribute Table support) patchset and the CPA
> (change_page_attr()) patchset are are not orthogonal at all - as their
> name already signals: because they change the implementation/effects of
> the same interface(s). [just at different levels].
>
> Both patchsets change how the kernel pagetable caching is handled. PAT
> changes the kernel pte details and unshackles us from MTRR reliance and
> thus solves real problems on real boxes:
>
> 55 files changed, 1288 insertions(+), 237 deletions(-)
>
> CPA moves change_page_attr() from invwb flushing to cflush flushing, for
> a speedup/latency-win, plus a whole bunch of intermingled fixes and
> improvements to page attribute modification:
>
> 26 files changed, 882 insertions(+), 423 deletions(-)
>
> so in terms of risk management, the "perfect patch order" is:
>
> - minimal_set of correctness fixes to the highlevel cpa code.

The two clear bug fix patches are refcount and flush order.

refcnt could be moved earlier; flush order would be quite painful
because there are quite a lot of patches dependent on it.

I could move ref count earlier, but I would prefer not to because
of the significant work it would be for me.

Since it is all already bisectable I'm also not sure what debugging
advantages the reordering would be.

It's already bisectable (I have not booted all immediate steps,
but several of them and I believe all compile) and in small pieces for that.

If it's really broken it would need to be reverted and then the ref count
stuff would go too. But I hope that won't be needed.

And even losing the reference count fixes wouldn't be catastrophic -- in the
worst case you lose some minor performance because kernel mappings are
unnecessarily split to 4K pages, but it's not a correctness fix.

So while the reordering would be possible, it would imho not
bring very much advantages and I must admit I'm not too motivated
to do another time-consuming reordering for a relatively weak reason.

If it was a 5 patch series I would probably not complain too much
about this, but it's 25+ patches.

> - ( then any provably NOP cleanups to pave the way. )
>
> - then change the lowlevel pte code (PAT) to reduce/eliminate the need
> to have runtime MTRR use

That's a completely different area really. Most of the PAT code
has nothing to do with clear_page_attr(). Please think that through
again after reading both patchkits.

As far as I can see making it wait for PAT would mean delaying
it for a longer time which would be a pity.


> - then structural improvements/cleanups of the highlevel cpa code
>
> - then the cflush (optional) performance feature ontop of it.

There are actual a lot more performance features in there (self snoop,
minimal TLB flushing some other stuff). Most of it is related to
that in fact.

>
> - then gigabyte-largepages/TLBs support [new CPU feature that further
> complicates page-attribute management]

That's already at the end.

>
> All in an easy-to-revert fashion. We _will_ regress here, and this stuff

That's already the case.

-Andi

2008-01-11 07:20:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Andi Kleen <[email protected]> wrote:

> > > > but that's not too smart: why dont they use WB plus cflush
> > > > instead?
> > >
> > > Because they need to access it WC for performance.
> >
> > I think you have it fundamentally backwards: the best for
> > performance is WB + cflush. What would WC offer for performance that
> > cflush cannot do?
>
> Cached requires the cache line to be read first before you can write
> it.

nonsense, and you should know it. It is perfectly possible to construct
fully written cachelines, without reading the cacheline first. MOVDQ is
SSE1 so on basically in every CPU today - and it is 16 byte aligned and
can generate full cacheline writes, _without_ filling in the cacheline
first. Bulk ops (string ops, etc.) will do full cacheline writes too,
without filling in the cacheline. Especially with high performance 3D
ops we do _NOT_ need any funky reads from anywhere because 3D software
can stream a lot of writes out: we construct a full frame or a portion
of a frame, or upload vertices or shader scripts, textures, etc.

( also, _even_ when there is a cache fill pending on for a partially
written cacheline, that might go on in parallel and it is not
necessarily holding up the CPU unless it has an actual data dependency
on that. )

but that's totally besides the point anyway. WC or WB accesses, if a 3D
app or a driver does high-freq change_page_attr() calls, it will _lose_
the performance game:

> > also, it's irrelevant to change_page_attr() call frequency. Just map
> > in everything from the card and use it. In graphics, if you remap
> > anything on the fly and it's not a slowpath you've lost the
> > performance game even before you began it.
>
> The typical case would be lots of user space DRI clients supplying
> their own buffers on the fly. There's not really a fixed pool in this
> case, but it all varies dynamically. In some scenarios that could
> happen quite often.

in what scenarios? Please give me in-tree examples of such high-freq
change_page_attr() cases, where the driver authors would like to call it
with high frequency but are unable to do it and see performance problems
due to the WBINVD.

Ingo

2008-01-11 07:34:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: CPA patchset


* Ingo Molnar <[email protected]> wrote:

> > > I think you have it fundamentally backwards: the best for
> > > performance is WB + cflush. What would WC offer for performance
> > > that cflush cannot do?
> >
> > Cached requires the cache line to be read first before you can write
> > it.
>
> nonsense, and you should know it. It is perfectly possible to
> construct fully written cachelines, without reading the cacheline
> first. MOVDQ is SSE1 so on basically in every CPU today - and it is 16
> byte aligned and can generate full cacheline writes, _without_ filling
> in the cacheline first. Bulk ops (string ops, etc.) will do full
> cacheline writes too, without filling in the cacheline. Especially
> with high performance 3D ops we do _NOT_ need any funky reads from
> anywhere because 3D software can stream a lot of writes out: we
> construct a full frame or a portion of a frame, or upload vertices or
> shader scripts, textures, etc.
>
> ( also, _even_ when there is a cache fill pending on for a partially
> written cacheline, that might go on in parallel and it is not
> necessarily holding up the CPU unless it has an actual data dependency
> on that. )

btw., that's not to suggest that WC does not have its place - but it's
not where you think it is.

Write-Combining can be very useful for devices that are behind a slow or
a high-latency transport, such as PCI, and which are mapped UnCached
today. Such as a networking card (or storage card) descriptor rings. New
entries in the TX ring are typically constructed on the CPU side without
knowing anything about their contents (and hence no need to read them,
the drivers typically shadow the ring on the host side already, it is
required for good UC performance today), so the new descriptors can be
streamed out efficiently with WC, instead of generating lots of small UC
accesses. But these descriptor rings have constant size, they have a
fixed place in the aperture and are ioremap()ed once during device init
and that's it. No high-frequency change_page_attributes() usage is done
or desired here.

Ingo

2008-01-11 11:26:22

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

> It is perfectly possible to construct
> fully written cachelines, without reading the cacheline first. MOVDQ is

If you write a aligned full 64 (or 128) byte area and even then you can
have occassional reads which can be either painfully slow or even incorrect.

> but that's totally besides the point anyway. WC or WB accesses, if a 3D
> app or a driver does high-freq change_page_attr() calls, it will _lose_
> the performance game:

Yes, high frequency as in doing it in fast paths is not a good idea, but
reasonably low frequency (as in acceptable process exit latencies for
example) are something to aim for. Right now with WBINVD and other problems
it is too slow.

> > > in everything from the card and use it. In graphics, if you remap
> > > anything on the fly and it's not a slowpath you've lost the
> > > performance game even before you began it.
> >
> > The typical case would be lots of user space DRI clients supplying
> > their own buffers on the fly. There's not really a fixed pool in this
> > case, but it all varies dynamically. In some scenarios that could
> > happen quite often.
>
> in what scenarios? Please give me in-tree examples of such high-freq
> change_page_attr() cases, where the driver authors would like to call it
> with high frequency but are unable to do it and see performance problems
> due to the WBINVD.

Some workloads do regular mapping into the GART aperture, but it is
not too critical yet.

But it is not too widely used because it is too slow; but i've got
requests from various parties over the years for more efficient c_p_a().
It's a chicken'n'egg problem -- you're asking for users but the users
don't use it yet because it's too slow.

-Andi

2008-01-11 11:29:07

by Andi Kleen

[permalink] [raw]
Subject: Re: CPA patchset

> Write-Combining can be very useful for devices that are behind a slow or
> a high-latency transport, such as PCI, and which are mapped UnCached

That is what I wrote! If you meant the same we must have been
spectacularly miscommunicating.

-Andi

2008-01-11 17:02:56

by dean gaudet

[permalink] [raw]
Subject: Re: CPA patchset

On Fri, 11 Jan 2008, Ingo Molnar wrote:

> * Andi Kleen <[email protected]> wrote:
>
> > Cached requires the cache line to be read first before you can write
> > it.
>
> nonsense, and you should know it. It is perfectly possible to construct
> fully written cachelines, without reading the cacheline first. MOVDQ is
> SSE1 so on basically in every CPU today - and it is 16 byte aligned and
> can generate full cacheline writes, _without_ filling in the cacheline
> first.

did you mean to write MOVNTPS above?


> Bulk ops (string ops, etc.) will do full cacheline writes too,
> without filling in the cacheline.

on intel with fast strings enabled yes. mind you intel gives hints in
the documentation these operations don't respect coherence... and i
asked about this when they posted their memory ordering paper but got no
response.

-dean

2008-01-11 17:19:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: CPA patchset

On Fri, 11 Jan 2008 09:02:46 -0800 (PST)
dean gaudet <[email protected]> wrote:

> > Bulk ops (string ops, etc.) will do full cacheline writes too,
> > without filling in the cacheline.
>
> on intel with fast strings enabled yes. mind you intel gives hints in
> the documentation these operations don't respect coherence... and i
> asked about this when they posted their memory ordering paper but got
> no response.

I know there's an answer on the way; it just needs to get checked by
architects for cpus we shipped in like the last 10 years ;)

--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-01-11 17:57:14

by dean gaudet

[permalink] [raw]
Subject: Re: CPA patchset

On Fri, 11 Jan 2008, dean gaudet wrote:

> On Fri, 11 Jan 2008, Ingo Molnar wrote:
>
> > * Andi Kleen <[email protected]> wrote:
> >
> > > Cached requires the cache line to be read first before you can write
> > > it.
> >
> > nonsense, and you should know it. It is perfectly possible to construct
> > fully written cachelines, without reading the cacheline first. MOVDQ is
> > SSE1 so on basically in every CPU today - and it is 16 byte aligned and
> > can generate full cacheline writes, _without_ filling in the cacheline
> > first.
>
> did you mean to write MOVNTPS above?

btw in case you were thinking a normal store to WB rather than a
non-temporal store... i ran a microbenchmark streaming stores to every 16
bytes of a 16MiB region aligned to 4096 bytes on a xeon 53xx series CPU
(4MiB L2) + 5000X northbridge and the avg latency of MOVNTPS is 12 cycles
whereas the avg latency of MOVAPS is 20 cycles.

the inner loop is unrolled 16 times so there are literally 4 cache lines
worth of stores being stuffed into the store queue as fast as possible...
and there's no coalescing for normal stores even on this modern CPU.

i'm certain i'll see the same thing on AMD... it's a very hard thing to do
in hardware without the non-temporal hint.

-dean