2008-01-14 22:16:49

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/31] Great change_page_attr patch series v2


Lots of improvements to change_page_attr(). Make it a lot more
efficient and fix various bugs.

Changes against earlier version

- Fixed some checkpatch.pl complaints
- Improved self test suite
- Fixed more reference bugs
- Fix NX handling on 32bit
- Remove some useless code there
- Few other random fixes

-Andi


2008-01-14 22:17:16

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/31] Shrink __PAGE_KERNEL/__PAGE_KERNEL_EXEC on non PAE kernels


No need to make it 64bit there.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_32.c | 4 ++--
include/asm-x86/pgtable.h | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -326,9 +326,9 @@ static void __init set_highmem_pages_ini
#define set_highmem_pages_init(bad_ppro) do { } while (0)
#endif /* CONFIG_HIGHMEM */

-unsigned long long __PAGE_KERNEL = _PAGE_KERNEL;
+pteval_t __PAGE_KERNEL = _PAGE_KERNEL;
EXPORT_SYMBOL(__PAGE_KERNEL);
-unsigned long long __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;
+pteval_t __PAGE_KERNEL_EXEC = _PAGE_KERNEL_EXEC;

#ifdef CONFIG_NUMA
extern void __init remap_numa_kva(void);
Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -64,7 +64,7 @@
#define _PAGE_KERNEL (_PAGE_KERNEL_EXEC | _PAGE_NX)

#ifndef __ASSEMBLY__
-extern unsigned long long __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
+extern pteval_t __PAGE_KERNEL, __PAGE_KERNEL_EXEC;
#endif /* __ASSEMBLY__ */
#else
#define __PAGE_KERNEL_EXEC \

2008-01-14 22:17:34

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/31] CPA: Do a simple self test at boot


When CONFIG_DEBUG_RODATA is enabled undo the ro mapping and redo it again.
This gives some simple testing for change_page_attr()

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/Kconfig.debug | 5 +++++
arch/x86/mm/init_32.c | 26 ++++++++++++++++++++++++++
arch/x86/mm/init_64.c | 10 ++++++++++
3 files changed, 41 insertions(+)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -603,6 +603,16 @@ void mark_rodata_ro(void)
* of who is the culprit.
*/
global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+ printk("Testing CPA: undo %lx-%lx\n", start, end);
+ change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL);
+ global_flush_tlb();
+
+ printk("Testing CPA: again\n");
+ change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL_RO);
+ global_flush_tlb();
+#endif
}
#endif

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -793,6 +793,20 @@ void mark_rodata_ro(void)
change_page_attr(virt_to_page(start),
size >> PAGE_SHIFT, PAGE_KERNEL_RX);
printk("Write protecting the kernel text: %luk\n", size >> 10);
+
+#ifdef CONFIG_CPA_DEBUG
+ global_flush_tlb();
+
+ printk("Testing CPA: Reverting %lx-%lx\n", start, start+size);
+ change_page_attr(virt_to_page(start), size>>PAGE_SHIFT,
+ PAGE_KERNEL_EXEC);
+ global_flush_tlb();
+
+ printk("Testing CPA: write protecting again\n");
+ change_page_attr(virt_to_page(start), size>>PAGE_SHIFT,
+ PAGE_KERNEL_RX);
+ global_flush_tlb();
+#endif
}
#endif
start += size;
@@ -809,6 +823,18 @@ void mark_rodata_ro(void)
* of who is the culprit.
*/
global_flush_tlb();
+
+#ifdef CONFIG_CPA_DEBUG
+ printk("Testing CPA: undo %lx-%lx\n", start, start + size);
+ change_page_attr(virt_to_page(start), size >> PAGE_SHIFT,
+ PAGE_KERNEL);
+ global_flush_tlb();
+
+ printk("Testing CPA: write protecting again\n");
+ change_page_attr(virt_to_page(start), size >> PAGE_SHIFT,
+ PAGE_KERNEL_RO);
+ global_flush_tlb();
+#endif
}
#endif

Index: linux/arch/x86/Kconfig.debug
===================================================================
--- linux.orig/arch/x86/Kconfig.debug
+++ linux/arch/x86/Kconfig.debug
@@ -192,4 +192,9 @@ config DEBUG_BOOT_PARAMS
help
This option will cause struct boot_params to be exported via debugfs.

+config CPA_DEBUG
+ bool "CPA self test code"
+ help
+ Do change_page_attr self tests at boot.
+
endmenu

2008-01-14 22:17:48

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/31] Add pte accessors for the global bit


Needed for some test code.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable.h | 3 +++
1 file changed, 3 insertions(+)

Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -134,6 +134,7 @@ static inline int pte_young(pte_t pte)
static inline int pte_write(pte_t pte) { return pte_val(pte) & _PAGE_RW; }
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; }
+static inline int pte_global(pte_t pte) { return pte_val(pte) & _PAGE_GLOBAL; }

static inline int pmd_large(pmd_t pte) {
return (pmd_val(pte) & (_PAGE_PSE|_PAGE_PRESENT)) ==
@@ -149,6 +150,8 @@ static inline pte_t pte_mkyoung(pte_t pt
static inline pte_t pte_mkwrite(pte_t pte) { return __pte(pte_val(pte) | _PAGE_RW); }
static inline pte_t pte_mkhuge(pte_t pte) { return __pte(pte_val(pte) | _PAGE_PSE); }
static inline pte_t pte_clrhuge(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_PSE); }
+static inline pte_t pte_mkglobal(pte_t pte) { return __pte(pte_val(pte) | _PAGE_GLOBAL); }
+static inline pte_t pte_clrglobal(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_GLOBAL); }

extern pteval_t __supported_pte_mask;

2008-01-14 22:18:08

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/31] Add pte_pgprot on i386


64bit already had it.

Needed for later patches.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable-2level.h | 2 ++
include/asm-x86/pgtable-3level.h | 2 ++
2 files changed, 4 insertions(+)

Index: linux/include/asm-x86/pgtable-2level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-2level.h
+++ linux/include/asm-x86/pgtable-2level.h
@@ -75,6 +75,8 @@ static inline int pte_exec_kernel(pte_t
#define pgoff_to_pte(off) \
((pte_t) { .pte_low = (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })

+#define pte_pgprot(x) __pgprot((x).pte_low & 0xfff)
+
/* Encode and de-code a swap entry */
#define __swp_type(x) (((x).val >> 1) & 0x1f)
#define __swp_offset(x) ((x).val >> 8)
Index: linux/include/asm-x86/pgtable-3level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-3level.h
+++ linux/include/asm-x86/pgtable-3level.h
@@ -142,6 +142,8 @@ static inline unsigned long pte_pfn(pte_
return pte_val(pte) >> PAGE_SHIFT;
}

+#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))
+
/*
* Bits 0, 6 and 7 are taken in the low part of the pte,
* put the 32 bits of offset into the high part.

2008-01-14 22:18:25

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/31] Don't drop NX bit in pte modifier functions for 32bit


The pte_* modifier functions that cleared bits dropped the NX bit on 32bit
PAE because they only worked in int, but NX is in bit 63. Fix that
by adding appropiate casts so that the arithmetic happens as long long
on PAE kernels.

I decided to just use 64bit arithmetic instead of open coding like
pte_modify() because gcc should generate good enough code for that now.

While this looks in theory like a .24 candidate this might trigger
some subtle latent bugs so it's better to delay it for .25 for more
testing.

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable.h | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -141,17 +141,17 @@ static inline int pmd_large(pmd_t pte) {
(_PAGE_PSE|_PAGE_PRESENT);
}

-static inline pte_t pte_mkclean(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_DIRTY); }
-static inline pte_t pte_mkold(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_ACCESSED); }
-static inline pte_t pte_wrprotect(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_RW); }
-static inline pte_t pte_mkexec(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_NX); }
+static inline pte_t pte_mkclean(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_DIRTY); }
+static inline pte_t pte_mkold(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_ACCESSED); }
+static inline pte_t pte_wrprotect(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_RW); }
+static inline pte_t pte_mkexec(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_NX); }
static inline pte_t pte_mkdirty(pte_t pte) { return __pte(pte_val(pte) | _PAGE_DIRTY); }
static inline pte_t pte_mkyoung(pte_t pte) { return __pte(pte_val(pte) | _PAGE_ACCESSED); }
static inline pte_t pte_mkwrite(pte_t pte) { return __pte(pte_val(pte) | _PAGE_RW); }
static inline pte_t pte_mkhuge(pte_t pte) { return __pte(pte_val(pte) | _PAGE_PSE); }
-static inline pte_t pte_clrhuge(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_PSE); }
+static inline pte_t pte_clrhuge(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_PSE); }
static inline pte_t pte_mkglobal(pte_t pte) { return __pte(pte_val(pte) | _PAGE_GLOBAL); }
-static inline pte_t pte_clrglobal(pte_t pte) { return __pte(pte_val(pte) & ~_PAGE_GLOBAL); }
+static inline pte_t pte_clrglobal(pte_t pte) { return __pte(pte_val(pte) & ~(pteval_t)_PAGE_GLOBAL); }

extern pteval_t __supported_pte_mask;

2008-01-14 22:18:40

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [6/31] CPA: Undo white space changes


Undo random white space changes. This reverts ddb53b5735793a19dc17bcd98b050f672f28f1ea

I simply don't have the nerves to port a 20+ patch series to the
reformatted version. And the patch series changes most lines
anyways and drops the trailing white spaces there.

And since this was a nop losing it for now isn't a problem.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 149 ++++++++++++++++++++--------------------------
arch/x86/mm/pageattr_64.c | 137 ++++++++++++++++++------------------------
2 files changed, 126 insertions(+), 160 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -1,29 +1,28 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
* Thanks to Ben LaHaise for precious feedback.
- */
+ */

+#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
-#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
#include <asm/processor.h>
#include <asm/tlbflush.h>
-#include <asm/sections.h>
-#include <asm/uaccess.h>
#include <asm/pgalloc.h>
+#include <asm/sections.h>

static DEFINE_SPINLOCK(cpa_lock);
static struct list_head df_list = LIST_HEAD_INIT(df_list);

-pte_t *lookup_address(unsigned long address)
-{
+
+pte_t *lookup_address(unsigned long address)
+{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
pmd_t *pmd;
-
if (pgd_none(*pgd))
return NULL;
pud = pud_offset(pgd, address);
@@ -34,22 +33,21 @@ pte_t *lookup_address(unsigned long addr
return NULL;
if (pmd_large(*pmd))
return (pte_t *)pmd;
-
return pte_offset_kernel(pmd, address);
-}
+}

-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+ pgprot_t ref_prot)
+{
+ int i;
unsigned long addr;
struct page *base;
pte_t *pbase;
- int i;

spin_unlock_irq(&cpa_lock);
base = alloc_pages(GFP_KERNEL, 0);
spin_lock_irq(&cpa_lock);
- if (!base)
+ if (!base)
return NULL;

/*
@@ -60,24 +58,22 @@ split_large_page(unsigned long address,
page_private(base) = 0;

address = __pa(address);
- addr = address & LARGE_PAGE_MASK;
+ addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
paravirt_alloc_pt(&init_mm, page_to_pfn(base));
-
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
- set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
- addr == address ? prot : ref_prot));
+ set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT,
+ addr == address ? prot : ref_prot));
}
return base;
-}
+}

static void cache_flush_page(struct page *p)
-{
- void *addr = page_address(p);
+{
+ void *adr = page_address(p);
int i;
-
for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
- clflush(addr + i);
+ clflush(adr+i);
}

static void flush_kernel_map(void *arg)
@@ -87,27 +83,23 @@ static void flush_kernel_map(void *arg)

/* High level code is not ready for clflush yet */
if (0 && cpu_has_clflush) {
- list_for_each_entry(p, lh, lru)
+ list_for_each_entry (p, lh, lru)
cache_flush_page(p);
- } else {
- if (boot_cpu_data.x86_model >= 4)
- wbinvd();
- }
+ } else if (boot_cpu_data.x86_model >= 4)
+ wbinvd();

- /*
- * Flush all to work around Errata in early athlons regarding
- * large page flushing.
+ /* Flush all to work around Errata in early athlons regarding
+ * large page flushing.
*/
- __flush_tlb_all();
+ __flush_tlb_all();
}

-static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
-{
- unsigned long flags;
+static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
+{
struct page *page;
+ unsigned long flags;

- /* change init_mm */
- set_pte_atomic(kpte, pte);
+ set_pte_atomic(kpte, pte); /* change init_mm */
if (SHARED_KERNEL_PMD)
return;

@@ -116,7 +108,6 @@ static void set_pmd_pte(pte_t *kpte, uns
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
-
pgd = (pgd_t *)page_address(page) + pgd_index(address);
pud = pud_offset(pgd, address);
pmd = pmd_offset(pud, address);
@@ -125,9 +116,9 @@ static void set_pmd_pte(pte_t *kpte, uns
spin_unlock_irqrestore(&pgd_lock, flags);
}

-/*
- * No more special protections in this 2/4MB area - revert to a large
- * page again.
+/*
+ * No more special protections in this 2/4MB area - revert to a
+ * large page again.
*/
static inline void revert_page(struct page *kpte_page, unsigned long address)
{
@@ -151,11 +142,12 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

-static int __change_page_attr(struct page *page, pgprot_t prot)
-{
- struct page *kpte_page;
+static int
+__change_page_attr(struct page *page, pgprot_t prot)
+{
+ pte_t *kpte;
unsigned long address;
- pte_t *kpte;
+ struct page *kpte_page;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);
@@ -163,17 +155,16 @@ static int __change_page_attr(struct pag
kpte = lookup_address(address);
if (!kpte)
return -EINVAL;
-
kpte_page = virt_to_page(kpte);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (!pte_huge(*kpte)) {
- set_pte_atomic(kpte, mk_pte(page, prot));
+ set_pte_atomic(kpte, mk_pte(page, prot));
} else {
- struct page *split;
pgprot_t ref_prot;
+ struct page *split;

ref_prot =
((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
@@ -181,19 +172,16 @@ static int __change_page_attr(struct pag
split = split_large_page(address, prot, ref_prot);
if (!split)
return -ENOMEM;
-
- set_pmd_pte(kpte, address, mk_pte(split, ref_prot));
+ set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
kpte_page = split;
}
page_private(kpte_page)++;
- } else {
- if (!pte_huge(*kpte)) {
- set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
- } else
- BUG();
- }
+ } else if (!pte_huge(*kpte)) {
+ set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
+ BUG_ON(page_private(kpte_page) == 0);
+ page_private(kpte_page)--;
+ } else
+ BUG();

/*
* If the pte was reserved, it means it was created at boot
@@ -209,7 +197,7 @@ static int __change_page_attr(struct pag
}
}
return 0;
-}
+}

static inline void flush_map(struct list_head *l)
{
@@ -223,33 +211,32 @@ static inline void flush_map(struct list
* than write-back somewhere - some CPUs do not like it when mappings with
* different caching policies exist. This changes the page attributes of the
* in kernel linear mapping too.
- *
+ *
* The caller needs to ensure that there are no conflicting mappings elsewhere.
* This function only deals with the kernel linear map.
- *
+ *
* Caller must call global_flush_tlb() after this.
*/
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
+ int err = 0;
+ int i;
unsigned long flags;
- int err = 0, i;

spin_lock_irqsave(&cpa_lock, flags);
- for (i = 0; i < numpages; i++, page++) {
+ for (i = 0; i < numpages; i++, page++) {
err = __change_page_attr(page, prot);
- if (err)
- break;
- }
+ if (err)
+ break;
+ }
spin_unlock_irqrestore(&cpa_lock, flags);
-
return err;
}
-EXPORT_SYMBOL(change_page_attr);

void global_flush_tlb(void)
{
- struct page *pg, *next;
struct list_head l;
+ struct page *pg, *next;

BUG_ON(irqs_disabled());

@@ -266,28 +253,26 @@ void global_flush_tlb(void)
__free_page(pg);
}
}
-EXPORT_SYMBOL(global_flush_tlb);

#ifdef CONFIG_DEBUG_PAGEALLOC
void kernel_map_pages(struct page *page, int numpages, int enable)
{
if (PageHighMem(page))
return;
- if (!enable) {
+ if (!enable)
debug_check_no_locks_freed(page_address(page),
numpages * PAGE_SIZE);
- }

- /*
- * the return value is ignored - the calls cannot fail,
+ /* the return value is ignored - the calls cannot fail,
* large pages are disabled at boot time.
*/
change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
-
- /*
- * we should perform an IPI and flush all tlbs,
+ /* we should perform an IPI and flush all tlbs,
* but that can deadlock->flush only current cpu.
*/
__flush_tlb_all();
}
#endif
+
+EXPORT_SYMBOL(change_page_attr);
+EXPORT_SYMBOL(global_flush_tlb);
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -1,54 +1,48 @@
-/*
- * Copyright 2002 Andi Kleen, SuSE Labs.
+/*
+ * Copyright 2002 Andi Kleen, SuSE Labs.
* Thanks to Ben LaHaise for precious feedback.
- */
+ */

+#include <linux/mm.h>
+#include <linux/sched.h>
#include <linux/highmem.h>
#include <linux/module.h>
-#include <linux/sched.h>
#include <linux/slab.h>
-#include <linux/mm.h>
-
+#include <asm/uaccess.h>
#include <asm/processor.h>
#include <asm/tlbflush.h>
-#include <asm/uaccess.h>
#include <asm/io.h>

pte_t *lookup_address(unsigned long address)
-{
+{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
-
if (pgd_none(*pgd))
return NULL;
pud = pud_offset(pgd, address);
if (!pud_present(*pud))
- return NULL;
+ return NULL;
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
- return NULL;
+ return NULL;
if (pmd_large(*pmd))
return (pte_t *)pmd;
-
pte = pte_offset_kernel(pmd, address);
if (pte && !pte_present(*pte))
- pte = NULL;
-
+ pte = NULL;
return pte;
-}
+}

-static struct page *
-split_large_page(unsigned long address, pgprot_t prot, pgprot_t ref_prot)
-{
+static struct page *split_large_page(unsigned long address, pgprot_t prot,
+ pgprot_t ref_prot)
+{
+ int i;
unsigned long addr;
- struct page *base;
+ struct page *base = alloc_pages(GFP_KERNEL, 0);
pte_t *pbase;
- int i;
-
- base = alloc_pages(GFP_KERNEL, 0);
- if (!base)
+ if (!base)
return NULL;
/*
* page_private is used to track the number of entries in
@@ -58,21 +52,20 @@ split_large_page(unsigned long address,
page_private(base) = 0;

address = __pa(address);
- addr = address & LARGE_PAGE_MASK;
+ addr = address & LARGE_PAGE_MASK;
pbase = (pte_t *)page_address(base);
for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE) {
- pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
+ pbase[i] = pfn_pte(addr >> PAGE_SHIFT,
addr == address ? prot : ref_prot);
}
return base;
-}
+}

-void clflush_cache_range(void *addr, int size)
+void clflush_cache_range(void *adr, int size)
{
int i;
-
for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
- clflush(addr+i);
+ clflush(adr+i);
}

static void flush_kernel_map(void *arg)
@@ -83,20 +76,17 @@ static void flush_kernel_map(void *arg)
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
/* clflush is still broken. Disable for now. */
- if (1 || !cpu_has_clflush) {
+ if (1 || !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
- } else {
- list_for_each_entry(pg, l, lru) {
- void *addr = page_address(pg);
-
- clflush_cache_range(addr, PAGE_SIZE);
- }
+ else list_for_each_entry(pg, l, lru) {
+ void *adr = page_address(pg);
+ clflush_cache_range(adr, PAGE_SIZE);
}
__flush_tlb_all();
}

static inline void flush_map(struct list_head *l)
-{
+{
on_each_cpu(flush_kernel_map, l, 1, 1);
}

@@ -108,47 +98,44 @@ static inline void save_page(struct page
list_add(&fpage->lru, &deferred_pages);
}

-/*
+/*
* No more special protections in this 2/4MB area - revert to a
- * large page again.
+ * large page again.
*/
static void revert_page(unsigned long address, pgprot_t ref_prot)
{
- unsigned long pfn;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t large_pte;
+ unsigned long pfn;

pgd = pgd_offset_k(address);
BUG_ON(pgd_none(*pgd));
- pud = pud_offset(pgd, address);
+ pud = pud_offset(pgd,address);
BUG_ON(pud_none(*pud));
pmd = pmd_offset(pud, address);
BUG_ON(pmd_val(*pmd) & _PAGE_PSE);
pfn = (__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT;
large_pte = pfn_pte(pfn, ref_prot);
large_pte = pte_mkhuge(large_pte);
-
set_pte((pte_t *)pmd, large_pte);
-}
+}

static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
- pgprot_t ref_prot)
-{
+ pgprot_t ref_prot)
+{
+ pte_t *kpte;
struct page *kpte_page;
pgprot_t ref_prot2;
- pte_t *kpte;

kpte = lookup_address(address);
- if (!kpte)
- return 0;
-
+ if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
- if (pgprot_val(prot) != pgprot_val(ref_prot)) {
+ if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (!pte_huge(*kpte)) {
set_pte(kpte, pfn_pte(pfn, prot));
} else {
@@ -157,7 +144,6 @@ __change_page_attr(unsigned long address
* change_page_attr on the split page.
*/
struct page *split;
-
ref_prot2 = pte_pgprot(pte_clrhuge(*kpte));
split = split_large_page(address, prot, ref_prot2);
if (!split)
@@ -167,14 +153,12 @@ __change_page_attr(unsigned long address
kpte_page = split;
}
page_private(kpte_page)++;
- } else {
- if (!pte_huge(*kpte)) {
- set_pte(kpte, pfn_pte(pfn, ref_prot));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
- } else
- BUG();
- }
+ } else if (!pte_huge(*kpte)) {
+ set_pte(kpte, pfn_pte(pfn, ref_prot));
+ BUG_ON(page_private(kpte_page) == 0);
+ page_private(kpte_page)--;
+ } else
+ BUG();

/* on x86-64 the direct mapping set at boot is not using 4k pages */
BUG_ON(PageReserved(kpte_page));
@@ -183,7 +167,7 @@ __change_page_attr(unsigned long address
if (page_private(kpte_page) == 0)
revert_page(address, ref_prot);
return 0;
-}
+}

/*
* Change the page attributes of an page in the linear mapping.
@@ -192,19 +176,19 @@ __change_page_attr(unsigned long address
* than write-back somewhere - some CPUs do not like it when mappings with
* different caching policies exist. This changes the page attributes of the
* in kernel linear mapping too.
- *
+ *
* The caller needs to ensure that there are no conflicting mappings elsewhere.
* This function only deals with the kernel linear map.
- *
+ *
* Caller must call global_flush_tlb() after this.
*/
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
- int err = 0, kernel_map = 0, i;
-
- if (address >= __START_KERNEL_map &&
- address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
+ int err = 0, kernel_map = 0;
+ int i;

+ if (address >= __START_KERNEL_map
+ && address < __START_KERNEL_map + KERNEL_TEXT_SIZE) {
address = (unsigned long)__va(__pa(address));
kernel_map = 1;
}
@@ -214,8 +198,7 @@ int change_page_attr_addr(unsigned long
unsigned long pfn = __pa(address) >> PAGE_SHIFT;

if (!kernel_map || pte_present(pfn_pte(0, prot))) {
- err = __change_page_attr(address, pfn, prot,
- PAGE_KERNEL);
+ err = __change_page_attr(address, pfn, prot, PAGE_KERNEL);
if (err)
break;
}
@@ -224,16 +207,14 @@ int change_page_attr_addr(unsigned long
if (__pa(address) < KERNEL_TEXT_SIZE) {
unsigned long addr2;
pgprot_t prot2;
-
addr2 = __START_KERNEL_map + __pa(address);
/* Make sure the kernel mappings stay executable */
prot2 = pte_pgprot(pte_mkexec(pfn_pte(0, prot)));
err = __change_page_attr(addr2, pfn, prot2,
PAGE_KERNEL_EXEC);
- }
- }
- up_write(&init_mm.mmap_sem);
-
+ }
+ }
+ up_write(&init_mm.mmap_sem);
return err;
}

@@ -241,13 +222,11 @@ int change_page_attr_addr(unsigned long
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
unsigned long addr = (unsigned long)page_address(page);
-
return change_page_attr_addr(addr, numpages, prot);
}
-EXPORT_SYMBOL(change_page_attr);

void global_flush_tlb(void)
-{
+{
struct page *pg, *next;
struct list_head l;

@@ -269,6 +248,8 @@ void global_flush_tlb(void)
continue;
ClearPagePrivate(pg);
__free_page(pg);
- }
-}
+ }
+}
+
+EXPORT_SYMBOL(change_page_attr);
EXPORT_SYMBOL(global_flush_tlb);

2008-01-14 22:18:56

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [7/31] Extract page table dumping code from i386 fault handler into dump_pagetable()


Similar to x86-64. This is useful in other situations where we want
the page table dumped too.

Besides anything that makes i386 do_page_fault shorter is good.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault_32.c | 72 ++++++++++++++++++++++++++-----------------------
1 file changed, 39 insertions(+), 33 deletions(-)

Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -28,6 +28,44 @@
#include <asm/desc.h>
#include <asm/segment.h>

+void dump_pagetable(unsigned long address)
+{
+ typeof(pte_val(__pte(0))) page;
+
+ page = read_cr3();
+ page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
+#ifdef CONFIG_X86_PAE
+ printk("*pdpt = %016Lx ", page);
+ if ((page >> PAGE_SHIFT) < max_low_pfn
+ && page & _PAGE_PRESENT) {
+ page &= PAGE_MASK;
+ page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
+ & (PTRS_PER_PMD - 1)];
+ printk(KERN_CONT "*pde = %016Lx ", page);
+ page &= ~_PAGE_NX;
+ }
+#else
+ printk("*pde = %08lx ", page);
+#endif
+
+ /*
+ * We must not directly access the pte in the highpte
+ * case if the page table is located in highmem.
+ * And let's rather not kmap-atomic the pte, just in case
+ * it's allocated already.
+ */
+ if ((page >> PAGE_SHIFT) < max_low_pfn
+ && (page & _PAGE_PRESENT)
+ && !(page & _PAGE_PSE)) {
+ page &= PAGE_MASK;
+ page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
+ & (PTRS_PER_PTE - 1)];
+ printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
+ }
+
+ printk("\n");
+}
+
/*
* Page fault error code bits
* bit 0 == 0 means no page found, 1 means protection fault
@@ -530,7 +568,6 @@ no_context:
bust_spinlocks(1);

if (oops_may_print()) {
- __typeof__(pte_val(__pte(0))) page;

#ifdef CONFIG_X86_PAE
if (error_code & PF_INSTR) {
@@ -551,38 +588,7 @@ no_context:
printk(" at virtual address %08lx\n", address);
printk(KERN_ALERT "printing ip: %08lx ", regs->ip);

- page = read_cr3();
- page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
-#ifdef CONFIG_X86_PAE
- printk("*pdpt = %016Lx ", page);
- if ((page >> PAGE_SHIFT) < max_low_pfn
- && page & _PAGE_PRESENT) {
- page &= PAGE_MASK;
- page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
- & (PTRS_PER_PMD - 1)];
- printk(KERN_CONT "*pde = %016Lx ", page);
- page &= ~_PAGE_NX;
- }
-#else
- printk("*pde = %08lx ", page);
-#endif
-
- /*
- * We must not directly access the pte in the highpte
- * case if the page table is located in highmem.
- * And let's rather not kmap-atomic the pte, just in case
- * it's allocated already.
- */
- if ((page >> PAGE_SHIFT) < max_low_pfn
- && (page & _PAGE_PRESENT)
- && !(page & _PAGE_PSE)) {
- page &= PAGE_MASK;
- page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
- & (PTRS_PER_PTE - 1)];
- printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
- }
-
- printk("\n");
+ dump_pagetable(address);
}

tsk->thread.cr2 = address;

2008-01-14 22:19:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [8/31] CPA: Return the page table level in lookup_address()


Needed for the next change.

And change all the callers.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault_32.c | 3 ++-
arch/x86/mm/init_32.c | 3 ++-
arch/x86/mm/pageattr_32.c | 7 +++++--
arch/x86/mm/pageattr_64.c | 7 +++++--
arch/x86/xen/mmu.c | 9 ++++++---
include/asm-x86/pgtable_32.h | 2 +-
include/asm-x86/pgtable_64.h | 2 +-
7 files changed, 22 insertions(+), 11 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,7 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

-pte_t *lookup_address(unsigned long address)
+pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
@@ -27,8 +27,10 @@ pte_t *lookup_address(unsigned long addr
pmd = pmd_offset(pud, address);
if (!pmd_present(*pmd))
return NULL;
+ *level = 3;
if (pmd_large(*pmd))
return (pte_t *)pmd;
+ *level = 4;
pte = pte_offset_kernel(pmd, address);
if (pte && !pte_present(*pte))
pte = NULL;
@@ -129,8 +131,9 @@ __change_page_attr(unsigned long address
pte_t *kpte;
struct page *kpte_page;
pgprot_t ref_prot2;
+ int level;

- kpte = lookup_address(address);
+ kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
BUG_ON(PageLRU(kpte_page));
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -240,7 +240,7 @@ extern struct list_head pgd_list;

extern int kern_addr_valid(unsigned long addr);

-pte_t *lookup_address(unsigned long addr);
+pte_t *lookup_address(unsigned long addr, int *level);

#define io_remap_pfn_range(vma, vaddr, pfn, size, prot) \
remap_pfn_range(vma, vaddr, pfn, size, prot)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -18,7 +18,7 @@ static DEFINE_SPINLOCK(cpa_lock);
static struct list_head df_list = LIST_HEAD_INIT(df_list);


-pte_t *lookup_address(unsigned long address)
+pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
@@ -31,8 +31,10 @@ pte_t *lookup_address(unsigned long addr
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return NULL;
+ *level = 2;
if (pmd_large(*pmd))
return (pte_t *)pmd;
+ *level = 3;
return pte_offset_kernel(pmd, address);
}

@@ -148,11 +150,12 @@ __change_page_attr(struct page *page, pg
pte_t *kpte;
unsigned long address;
struct page *kpte_page;
+ int level;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);

- kpte = lookup_address(address);
+ kpte = lookup_address(address, &level);
if (!kpte)
return -EINVAL;
kpte_page = virt_to_page(kpte);
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -182,7 +182,7 @@ static inline void clone_pgd_range(pgd_t
* NOTE: the return type is pte_t but if the pmd is PSE then we return it
* as a pte too.
*/
-extern pte_t *lookup_address(unsigned long address);
+extern pte_t *lookup_address(unsigned long address, int *level);

/*
* Make a given kernel text page executable/non-executable.
Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -571,7 +571,8 @@ no_context:

#ifdef CONFIG_X86_PAE
if (error_code & PF_INSTR) {
- pte_t *pte = lookup_address(address);
+ int level;
+ pte_t *pte = lookup_address(address, &level);

if (pte && pte_present(*pte) && !pte_exec_kernel(*pte))
printk(KERN_CRIT "kernel tried to execute "
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -529,11 +529,12 @@ int __init set_kernel_exec(unsigned long
{
pte_t *pte;
int ret = 1;
+ int level;

if (!nx_enabled)
goto out;

- pte = lookup_address(vaddr);
+ pte = lookup_address(vaddr, &level);
BUG_ON(!pte);

if (!pte_exec_kernel(*pte))
Index: linux/arch/x86/xen/mmu.c
===================================================================
--- linux.orig/arch/x86/xen/mmu.c
+++ linux/arch/x86/xen/mmu.c
@@ -58,7 +58,8 @@

xmaddr_t arbitrary_virt_to_machine(unsigned long address)
{
- pte_t *pte = lookup_address(address);
+ int level;
+ pte_t *pte = lookup_address(address, &level);
unsigned offset = address & PAGE_MASK;

BUG_ON(pte == NULL);
@@ -70,8 +71,9 @@ void make_lowmem_page_readonly(void *vad
{
pte_t *pte, ptev;
unsigned long address = (unsigned long)vaddr;
+ int level;

- pte = lookup_address(address);
+ pte = lookup_address(address, &level);
BUG_ON(pte == NULL);

ptev = pte_wrprotect(*pte);
@@ -84,8 +86,9 @@ void make_lowmem_page_readwrite(void *va
{
pte_t *pte, ptev;
unsigned long address = (unsigned long)vaddr;
+ int level;

- pte = lookup_address(address);
+ pte = lookup_address(address, &level);
BUG_ON(pte == NULL);

ptev = pte_mkwrite(*pte);

2008-01-14 22:19:36

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [9/31] CPA: Add simple self test at boot


Since change_page_attr() is tricky code it is good to have some regression
test code. This patch maps and unmaps some random pages in the direct mapping
at boot and then dumps the state and does some simple sanity checks.

Add it with a CONFIG option.

Optional patch, but I find it useful.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/Makefile_32 | 1
arch/x86/mm/Makefile_64 | 1
arch/x86/mm/pageattr-test.c | 233 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 235 insertions(+)

Index: linux/arch/x86/mm/Makefile_64
===================================================================
--- linux.orig/arch/x86/mm/Makefile_64
+++ linux/arch/x86/mm/Makefile_64
@@ -7,3 +7,4 @@ obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpag
obj-$(CONFIG_NUMA) += numa_64.o
obj-$(CONFIG_K8_NUMA) += k8topology_64.o
obj-$(CONFIG_ACPI_NUMA) += srat_64.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o
Index: linux/arch/x86/mm/pageattr-test.c
===================================================================
--- /dev/null
+++ linux/arch/x86/mm/pageattr-test.c
@@ -0,0 +1,233 @@
+/*
+ * self test for change_page_attr.
+ *
+ * Clears the global bit on random pages in the direct mapping, then reverts
+ * and compares page tables forwards and afterwards.
+ */
+
+#include <linux/mm.h>
+#include <linux/random.h>
+#include <linux/kernel.h>
+#include <linux/init.h>
+#include <linux/bootmem.h>
+#include <asm/cacheflush.h>
+#include <asm/pgtable.h>
+#include <asm/kdebug.h>
+
+enum {
+ NTEST = 400,
+#ifdef CONFIG_X86_64
+ LOWEST_LEVEL = 4,
+ LPS = (1 << PMD_SHIFT),
+#elif defined(CONFIG_X86_PAE)
+ LOWEST_LEVEL = 3,
+ LPS = (1 << PMD_SHIFT),
+#else
+ LOWEST_LEVEL = 3, /* lookup_address lies here */
+ LPS = (1 << 22),
+#endif
+ GPS = (1<<30)
+};
+
+#ifdef CONFIG_X86_64
+#include <asm/proto.h>
+#define max_mapped end_pfn_map
+#else
+#define max_mapped max_low_pfn
+#endif
+
+struct split_state {
+ long lpg, gpg, spg, exec;
+ long min_exec, max_exec;
+};
+
+static __init int print_split(struct split_state *s)
+{
+ int printed = 0;
+ long i, expected, missed = 0;
+ int err = 0;
+
+ s->lpg = s->gpg = s->spg = s->exec = 0;
+ s->min_exec = ~0UL;
+ s->max_exec = 0;
+ for (i = 0; i < max_mapped; ) {
+ int level;
+ pte_t *pte;
+ unsigned long adr = (unsigned long)__va(i << PAGE_SHIFT);
+
+ pte = lookup_address(adr, &level);
+ if (!pte) {
+ if (!printed) {
+ dump_pagetable(adr);
+ printk("CPA %lx no pte level %d\n", adr, level);
+ printed = 1;
+ }
+ missed++;
+ i++;
+ continue;
+ }
+
+ if (level == 2 && sizeof(long) == 8) {
+ s->gpg++;
+ i += GPS/PAGE_SIZE;
+ } else if (level != LOWEST_LEVEL) {
+ if (!(pte_val(*pte) & _PAGE_PSE)) {
+ printk("%lx level %d but not PSE %Lx\n",
+ adr, level, (u64)pte_val(*pte));
+ err = 1;
+ }
+ s->lpg++;
+ i += LPS/PAGE_SIZE;
+ } else {
+ s->spg++;
+ i++;
+ }
+ if (!(pte_val(*pte) & _PAGE_NX)) {
+ s->exec++;
+ if (adr < s->min_exec)
+ s->min_exec = adr;
+ if (adr > s->max_exec)
+ s->max_exec = adr;
+ }
+ }
+ printk("CPA mapping 4k %lu large %lu gb %lu x %lu[%lx-%lx] miss %lu\n",
+ s->spg, s->lpg, s->gpg, s->exec,
+ s->min_exec != ~0UL ? s->min_exec : 0, s->max_exec, missed);
+ expected = (s->gpg*GPS + s->lpg*LPS)/PAGE_SIZE + s->spg + missed;
+ if (expected != i) {
+ printk("CPA max_mapped %lu but expected %lu\n",
+ max_mapped, expected);
+ return 1;
+ }
+ return err;
+}
+
+static __init int state_same(struct split_state *a, struct split_state *b)
+{
+ return a->lpg == b->lpg && a->gpg == b->gpg && a->spg == b->spg &&
+ a->exec == b->exec;
+}
+
+static unsigned long addr[NTEST] __initdata;
+static unsigned len[NTEST] __initdata;
+
+/* Change the global bit on random pages in the direct mapping */
+static __init int exercise_pageattr(void)
+{
+ int i, k;
+ pte_t *pte, pte0;
+ int level;
+ int err;
+ struct split_state sa, sb, sc;
+ int failed = 0;
+ unsigned long *bm;
+
+ printk("CPA exercising pageattr\n");
+
+ bm = vmalloc((max_mapped + 7) / 8);
+ if (!bm) {
+ printk("CPA Cannot vmalloc bitmap\n");
+ return -ENOMEM;
+ }
+ memset(bm, 0, (max_mapped + 7) / 8);
+
+ failed += print_split(&sa);
+ srandom32(100);
+ for (i = 0; i < NTEST; i++) {
+ unsigned long pfn = random32() % max_mapped;
+ addr[i] = (unsigned long)__va(pfn << PAGE_SHIFT);
+ len[i] = random32() % 100;
+ len[i] = min_t(unsigned long, len[i], max_mapped - pfn - 1);
+ if (len[i] == 0)
+ len[i] = 1;
+
+ pte = NULL;
+ pte0 = pfn_pte(0, __pgprot(0)); /* shut gcc up */
+ for (k = 0; k < len[i]; k++) {
+ pte = lookup_address(addr[i] + k*PAGE_SIZE, &level);
+ if (!pte || pgprot_val(pte_pgprot(*pte)) == 0) {
+ addr[i] = 0;
+ break;
+ }
+ if (k == 0)
+ pte0 = *pte;
+ else if (pgprot_val(pte_pgprot(*pte)) !=
+ pgprot_val(pte_pgprot(pte0))) {
+ len[i] = k;
+ break;
+ }
+ if (test_bit(pfn + k, bm)) {
+ len[i] = k;
+ break;
+ }
+ __set_bit(pfn + k, bm);
+ }
+ if (!addr[i] || !pte || !k) {
+ addr[i] = 0;
+ continue;
+ }
+
+ err = change_page_attr(virt_to_page(addr[i]), len[i],
+ pte_pgprot(pte_clrhuge(pte_clrglobal(pte0))));
+ if (err < 0) {
+ printk("CPA %d failed %d\n", i, err);
+ failed++;
+ }
+
+ pte = lookup_address(addr[i], &level);
+ if (!pte || pte_global(*pte) || pte_huge(*pte)) {
+ printk("CPA %lx: bad pte %Lx\n", addr[i],
+ pte ? (u64)pte_val(*pte) : 0ULL);
+ failed++;
+ }
+ if (level != LOWEST_LEVEL) {
+ printk("CPA %lx: unexpected level %d\n", addr[i],
+ level);
+ failed++;
+ }
+
+ }
+ vfree(bm);
+ global_flush_tlb();
+
+ failed += print_split(&sb);
+
+ printk("CPA reverting everything\n");
+ for (i = 0; i < NTEST; i++) {
+ if (!addr[i])
+ continue;
+ pte = lookup_address(addr[i], &level);
+ if (!pte) {
+ printk("CPA lookup of %lx failed\n", addr[i]);
+ failed++;
+ continue;
+ }
+ err = change_page_attr(virt_to_page(addr[i]), len[i],
+ pte_pgprot(pte_mkglobal(*pte)));
+ if (err < 0) {
+ printk("CPA reverting failed: %d\n", err);
+ failed++;
+ }
+ pte = lookup_address(addr[i], &level);
+ if (!pte || !pte_global(*pte)) {
+ printk("CPA %lx: bad pte after revert %Lx\n", addr[i],
+ pte ? (u64)pte_val(*pte) : 0ULL);
+ failed++;
+ }
+
+ }
+ global_flush_tlb();
+
+ failed += print_split(&sc);
+ if (!state_same(&sa, &sc))
+ failed++;
+
+ if (failed)
+ printk("CPA selftests NOT PASSED. Please report.\n");
+ else
+ printk("CPA selftests PASSED\n");
+
+ return 0;
+}
+
+module_init(exercise_pageattr);
Index: linux/arch/x86/mm/Makefile_32
===================================================================
--- linux.orig/arch/x86/mm/Makefile_32
+++ linux/arch/x86/mm/Makefile_32
@@ -8,3 +8,4 @@ obj-$(CONFIG_NUMA) += discontig_32.o
obj-$(CONFIG_HUGETLB_PAGE) += hugetlbpage.o
obj-$(CONFIG_HIGHMEM) += highmem_32.o
obj-$(CONFIG_BOOT_IOREMAP) += boot_ioremap_32.o
+obj-$(CONFIG_CPA_DEBUG) += pageattr-test.o

2008-01-14 22:19:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [10/31] CPA: Change kernel_map_pages to not use c_p_a()


CONFIG_DEBUG_PAGEALLOC uses change_page_attr to map/unmap mappings for catching
stray kernel mappings. But standard c_p_a() does a lot of unnecessary work for
this simple case with pre-split mappings.

Change kernel_map_pages to just access the page table directly which
is simpler and faster.

I also fixed it to use INVLPG if available.

This is required for changes to c_p_a() later that make it use kmalloc. Without
this we would risk infinite recursion. Also in general things are easier when
sleeping is allowed.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 34 ++++++++++++++++++++++++----------
1 file changed, 24 insertions(+), 10 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -258,22 +258,36 @@ void global_flush_tlb(void)
}

#ifdef CONFIG_DEBUG_PAGEALLOC
+/* Map or unmap pages in the kernel direct mapping for kernel debugging. */
void kernel_map_pages(struct page *page, int numpages, int enable)
{
+ unsigned long addr;
+ int i;
+
if (PageHighMem(page))
return;
+ addr = (unsigned long)page_address(page);
if (!enable)
- debug_check_no_locks_freed(page_address(page),
- numpages * PAGE_SIZE);
+ debug_check_no_locks_freed((void *)addr, numpages * PAGE_SIZE);
+
+ /* Bootup has forced 4K pages so this is very simple */
+
+ for (i = 0; i < numpages; i++, addr += PAGE_SIZE, page++) {
+ int level;
+ pte_t *pte = lookup_address(addr, &level);

- /* the return value is ignored - the calls cannot fail,
- * large pages are disabled at boot time.
- */
- change_page_attr(page, numpages, enable ? PAGE_KERNEL : __pgprot(0));
- /* we should perform an IPI and flush all tlbs,
- * but that can deadlock->flush only current cpu.
- */
- __flush_tlb_all();
+ BUG_ON(level != 3);
+ if (enable) {
+ set_pte_atomic(pte, mk_pte(page, PAGE_KERNEL));
+ /*
+ * We should perform an IPI and flush all tlbs,
+ * but that can deadlock->flush only current cpu.
+ */
+ __flush_tlb_one(addr);
+ } else {
+ kpte_clear_flush(pte, addr);
+ }
+ }
}
#endif

2008-01-14 22:20:43

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [11/31] CPA: Change 32bit back to init_mm semaphore locking


Now that debug pagealloc uses a separate function it is better
to change standard change_page_attr back to init_mm semaphore locking like 64bit.
Various things are simpler when sleeping is allowed.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 14 +++++---------
1 file changed, 5 insertions(+), 9 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,10 +14,9 @@
#include <asm/pgalloc.h>
#include <asm/sections.h>

-static DEFINE_SPINLOCK(cpa_lock);
+/* Protected by init_mm.mmap_sem */
static struct list_head df_list = LIST_HEAD_INIT(df_list);

-
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -46,9 +45,7 @@ static struct page *split_large_page(uns
struct page *base;
pte_t *pbase;

- spin_unlock_irq(&cpa_lock);
base = alloc_pages(GFP_KERNEL, 0);
- spin_lock_irq(&cpa_lock);
if (!base)
return NULL;

@@ -224,15 +221,14 @@ int change_page_attr(struct page *page,
{
int err = 0;
int i;
- unsigned long flags;

- spin_lock_irqsave(&cpa_lock, flags);
+ down_write(&init_mm.mmap_sem);
for (i = 0; i < numpages; i++, page++) {
err = __change_page_attr(page, prot);
if (err)
break;
}
- spin_unlock_irqrestore(&cpa_lock, flags);
+ up_write(&init_mm.mmap_sem);
return err;
}

@@ -243,9 +239,9 @@ void global_flush_tlb(void)

BUG_ON(irqs_disabled());

- spin_lock_irq(&cpa_lock);
+ down_write(&init_mm.mmap_sem);
list_replace_init(&df_list, &l);
- spin_unlock_irq(&cpa_lock);
+ up_write(&init_mm.mmap_sem);
flush_map(&l);
list_for_each_entry_safe(pg, next, &l, lru) {
list_del(&pg->lru);

2008-01-14 22:21:03

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [12/31] CPA: CLFLUSH support in change_page_attr()


Queue individual data pages for flushing with CLFLUSH in change_page_attr(),
instead of doing global WBINVDs. WBINVD is a very painful operation
for the CPU (can take msecs) and quite slow too. Worse it is not interruptible
and can cause long latencies on hypervisors on older Intel VT systems.

CLFLUSH on the other hand only flushes the cache lines that actually need to be
flushed and since it works in smaller chunks is more preemeptible.

To do this c_p_a needs to save the address to be flush for global_tlb_flush()
later. This is done using a separate data structure, not struct page,
because page->lru is often used or not there for memory holes.

Also the flushes are done in FIFO order now, not LIFO.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 78 ++++++++++++++++++++++++++++++++++------------
arch/x86/mm/pageattr_64.c | 77 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 118 insertions(+), 37 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,6 +13,11 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

+struct flush {
+ struct list_head l;
+ unsigned long addr;
+};
+
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -63,6 +68,11 @@ static struct page *split_large_page(uns
return base;
}

+struct flush_arg {
+ int full_flush;
+ struct list_head l;
+};
+
void clflush_cache_range(void *adr, int size)
{
int i;
@@ -72,27 +82,27 @@ void clflush_cache_range(void *adr, int

static void flush_kernel_map(void *arg)
{
- struct list_head *l = (struct list_head *)arg;
- struct page *pg;
+ struct flush_arg *a = (struct flush_arg *)arg;
+ struct flush *f;
+
+ if (!cpu_has_clflush)
+ a->full_flush = 1;

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- /* clflush is still broken. Disable for now. */
- if (1 || !cpu_has_clflush)
+ if (a->full_flush)
asm volatile("wbinvd" ::: "memory");
- else list_for_each_entry(pg, l, lru) {
- void *adr = page_address(pg);
- clflush_cache_range(adr, PAGE_SIZE);
+ list_for_each_entry(f, &a->l, l) {
+ if (!a->full_flush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
}
__flush_tlb_all();
}

-static inline void flush_map(struct list_head *l)
-{
- on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
-static LIST_HEAD(deferred_pages); /* protected by init_mm.mmap_sem */
+/* both protected by init_mm.mmap_sem */
+static int full_flush;
+static LIST_HEAD(deferred_pages);
+static LIST_HEAD(flush_pages);

static inline void save_page(struct page *fpage)
{
@@ -124,6 +134,25 @@ static void revert_page(unsigned long ad
set_pte((pte_t *)pmd, large_pte);
}

+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that
+ * it will work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+ struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+ if (!f) {
+ full_flush = 1;
+ return;
+ }
+ f->addr = address;
+ list_add_tail(&f->l, &flush_pages);
+}
+
static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
pgprot_t ref_prot)
@@ -136,8 +165,11 @@ __change_page_attr(unsigned long address
kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
- BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
+ BUG_ON(PageLRU(kpte_page));
+
+ set_tlb_flush(address);
+
if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (!pte_huge(*kpte)) {
set_pte(kpte, pfn_pte(pfn, prot));
@@ -231,7 +263,9 @@ int change_page_attr(struct page *page,
void global_flush_tlb(void)
{
struct page *pg, *next;
- struct list_head l;
+ struct flush *f, *fnext;
+ struct flush_arg arg;
+ struct list_head free_pages;

/*
* Write-protect the semaphore, to exclude two contexts
@@ -239,12 +273,19 @@ void global_flush_tlb(void)
* exclude new additions to the deferred_pages list:
*/
down_write(&init_mm.mmap_sem);
- list_replace_init(&deferred_pages, &l);
+ arg.full_flush = full_flush;
+ full_flush = 0;
+ list_replace_init(&flush_pages, &arg.l);
+ list_replace_init(&deferred_pages, &free_pages);
up_write(&init_mm.mmap_sem);

- flush_map(&l);
+ on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+ list_for_each_entry_safe(f, fnext, &arg.l, l) {
+ kfree(f);
+ }

- list_for_each_entry_safe(pg, next, &l, lru) {
+ list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
clear_bit(PG_arch_1, &pg->flags);
if (page_private(pg) != 0)
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -15,8 +15,15 @@
#include <asm/sections.h>

/* Protected by init_mm.mmap_sem */
+/* Variables protected by cpa_lock */
+static int full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
+static LIST_HEAD(flush_pages);

+struct flush {
+ struct list_head l;
+ unsigned long addr;
+};
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -67,25 +74,31 @@ static struct page *split_large_page(uns
return base;
}

-static void cache_flush_page(struct page *p)
+struct flush_arg {
+ int full_flush;
+ struct list_head l;
+};
+
+void clflush_cache_range(void *adr, int size)
{
- void *adr = page_address(p);
int i;
- for (i = 0; i < PAGE_SIZE; i += boot_cpu_data.x86_clflush_size)
+ for (i = 0; i < size; i += boot_cpu_data.x86_clflush_size)
clflush(adr+i);
}

static void flush_kernel_map(void *arg)
{
- struct list_head *lh = (struct list_head *)arg;
- struct page *p;
+ struct flush_arg *a = (struct flush_arg *)arg;
+ struct flush *f;

- /* High level code is not ready for clflush yet */
- if (0 && cpu_has_clflush) {
- list_for_each_entry (p, lh, lru)
- cache_flush_page(p);
- } else if (boot_cpu_data.x86_model >= 4)
+ if (!cpu_has_clflush)
+ a->full_flush = 1;
+ if (a->full_flush && boot_cpu_data.x86_model >= 4)
wbinvd();
+ list_for_each_entry(f, &a->l, l) {
+ if (!a->full_flush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ }

/* Flush all to work around Errata in early athlons regarding
* large page flushing.
@@ -141,6 +154,25 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

+/*
+ * Mark the address for flushing later in global_tlb_flush().
+ *
+ * Other parts of the kernel are already in a feeding frenzy over the various
+ * struct page fields. Instead of trying to compete allocate a separate
+ * data structure to keep track of the flush. This has the added bonus that
+ * it will work for MMIO holes without mem_map too.
+ */
+static void set_tlb_flush(unsigned long address)
+{
+ struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
+ if (!f) {
+ full_flush = 1;
+ return;
+ }
+ f->addr = address;
+ list_add_tail(&f->l, &flush_pages);
+}
+
static int
__change_page_attr(struct page *page, pgprot_t prot)
{
@@ -159,6 +191,8 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

+ set_tlb_flush(address);
+
if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (!pte_huge(*kpte)) {
set_pte_atomic(kpte, mk_pte(page, prot));
@@ -199,11 +233,6 @@ __change_page_attr(struct page *page, pg
return 0;
}

-static inline void flush_map(struct list_head *l)
-{
- on_each_cpu(flush_kernel_map, l, 1, 1);
-}
-
/*
* Change the page attributes of an page in the linear mapping.
*
@@ -234,16 +263,27 @@ int change_page_attr(struct page *page,

void global_flush_tlb(void)
{
- struct list_head l;
+ struct flush_arg arg;
struct page *pg, *next;
+ struct flush *f, *fnext;
+ struct list_head free_pages;

BUG_ON(irqs_disabled());

down_write(&init_mm.mmap_sem);
- list_replace_init(&df_list, &l);
+ arg.full_flush = full_flush;
+ full_flush = 0;
+ list_replace_init(&flush_pages, &arg.l);
+ list_replace_init(&df_list, &free_pages);
up_write(&init_mm.mmap_sem);
- flush_map(&l);
- list_for_each_entry_safe(pg, next, &l, lru) {
+
+ on_each_cpu(flush_kernel_map, &arg, 1, 1);
+
+ list_for_each_entry_safe(f, fnext, &arg.l, l) {
+ kfree(f);
+ }
+
+ list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
clear_bit(PG_arch_1, &pg->flags);
if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)

2008-01-14 22:21:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [13/31] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr


Instead of open coding the bit accesses uses standard style
*PageDeferred* macros.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 14 ++++++++++----
arch/x86/mm/pageattr_64.c | 11 +++++++++--
2 files changed, 19 insertions(+), 6 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -18,6 +18,13 @@ struct flush {
unsigned long addr;
};

+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
pte_t *lookup_address(unsigned long address, int *level)
{
pgd_t *pgd = pgd_offset_k(address);
@@ -106,7 +113,7 @@ static LIST_HEAD(flush_pages);

static inline void save_page(struct page *fpage)
{
- if (!test_and_set_bit(PG_arch_1, &fpage->flags))
+ if (!TestSetPageDeferred(fpage))
list_add(&fpage->lru, &deferred_pages);
}

@@ -287,7 +294,7 @@ void global_flush_tlb(void)

list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
- clear_bit(PG_arch_1, &pg->flags);
+ ClearPageDeferred(pg);
if (page_private(pg) != 0)
continue;
ClearPagePrivate(pg);
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -14,8 +14,14 @@
#include <asm/pgalloc.h>
#include <asm/sections.h>

-/* Protected by init_mm.mmap_sem */
-/* Variables protected by cpa_lock */
+#define PG_deferred PG_arch_1
+
+#define PageDeferred(p) test_bit(PG_deferred, &(p)->flags)
+#define SetPageDeferred(p) set_bit(PG_deferred, &(p)->flags)
+#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
+#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)
+
+/* Variables protected by init_mm.mmap_sem */
static int full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
static LIST_HEAD(flush_pages);
@@ -150,7 +156,7 @@ static inline void revert_page(struct pa

static inline void save_page(struct page *kpte_page)
{
- if (!test_and_set_bit(PG_arch_1, &kpte_page->flags))
+ if (!TestSetPageDeferred(kpte_page))
list_add(&kpte_page->lru, &df_list);
}

@@ -285,7 +291,7 @@ void global_flush_tlb(void)

list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
- clear_bit(PG_arch_1, &pg->flags);
+ ClearPageDeferred(pg);
if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
continue;
ClearPagePrivate(pg);

2008-01-14 22:21:39

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [14/31] CPA: Use page granuality TLB flushing in change_page_attr


With the infrastructure added for CLFLUSH it is possible
to only TLB flush the actually changed pages in change_page_attr()

Take care of old Athlon K7 Errata on the 32bit version

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 15 ++++++++-------
arch/x86/mm/pageattr_64.c | 10 +++++-----
2 files changed, 13 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,19 +97,20 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if (!cpu_has_clflush)
- a->full_flush = 1;
- if (a->full_flush && boot_cpu_data.x86_model >= 4)
+ if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (!a->full_flush)
+ __flush_tlb_one(f->addr);
}

- /* Flush all to work around Errata in early athlons regarding
- * large page flushing.
- */
- __flush_tlb_all();
+ /* Handle errata with flushing large pages in early Athlons */
+ if (a->full_flush ||
+ (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
+ boot_cpu_data.x86 == 7))
+ __flush_tlb_all();
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -92,18 +92,18 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if (!cpu_has_clflush)
- a->full_flush = 1;
-
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (a->full_flush)
+ if (a->full_flush || !cpu_has_clflush)
asm volatile("wbinvd" ::: "memory");
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (!a->full_flush)
+ __flush_tlb_one(f->addr);
}
- __flush_tlb_all();
+ if (a->full_flush)
+ __flush_tlb_all();
}

/* both protected by init_mm.mmap_sem */

2008-01-14 22:21:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [16/31] CPA: Use wbinvd() macro instead of inline assembly in 64bit c_p_a()


Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -95,7 +95,7 @@ static void flush_kernel_map(void *arg)
/* When clflush is available always use it because it is
much cheaper than WBINVD. */
if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
- asm volatile("wbinvd" ::: "memory");
+ wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);

2008-01-14 22:22:22

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [17/31] CPA: Reorder TLB / cache flushes to follow Intel recommendation


Intel recommends to first flush the TLBs and then the caches
on caching attribute changes. c_p_a() previously did it the
other way round. Reorder that.

The procedure is still not fully compliant to the Intel documentation
because Intel recommends a all CPU synchronization step between
the TLB flushes and the cache flushes.

However on all new Intel CPUs this is now meaningless anyways
because they support Self-Snoop and can skip the cache flush
step anyways.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 13 ++++++++++---
arch/x86/mm/pageattr_64.c | 14 ++++++++++----
2 files changed, 20 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,9 +97,6 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
- !cpu_has_ss)
- wbinvd();
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
@@ -112,6 +109,16 @@ static void flush_kernel_map(void *arg)
(boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
boot_cpu_data.x86 == 7))
__flush_tlb_all();
+
+ /*
+ * RED-PEN: Intel documentation ask for a CPU synchronization step
+ * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+ */
+
+ if ((!cpu_has_clflush || a->full_flush) &&
+ !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+ wbinvd();
+
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,16 +94,22 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
- wbinvd();
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
+ if (!a->full_flush && !cpu_has_ss)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
}
if (a->full_flush)
__flush_tlb_all();
+
+ /*
+ * RED-PEN: Intel documentation ask for a CPU synchronization step
+ * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
+ */
+
+ if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+ wbinvd();
}

/* both protected by init_mm.mmap_sem */

2008-01-14 22:22:44

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [15/31] CPA: Don't flush the caches when the CPU supports self-snoop


When the self-snoop CPUID bit is set change_page_attr() only needs to flush
TLBs, but not the caches.

The description of self-snoop in the Intel manuals is a bit vague
but I got confirmation that this is what SS really means.

This should improve c_p_a() performance significantly on newer
Intel CPUs.

Note: the line > 80 characters will be modified again in a followup

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 5 +++--
arch/x86/mm/pageattr_64.c | 4 ++--
include/asm-x86/cpufeature.h | 1 +
3 files changed, 6 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -97,10 +97,11 @@ static void flush_kernel_map(void *arg)
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;

- if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4)
+ if ((!cpu_has_clflush || a->full_flush) && boot_cpu_data.x86_model >= 4 &&
+ !cpu_has_ss)
wbinvd();
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush)
+ if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
Index: linux/include/asm-x86/cpufeature.h
===================================================================
--- linux.orig/include/asm-x86/cpufeature.h
+++ linux/include/asm-x86/cpufeature.h
@@ -167,6 +167,7 @@
#define cpu_has_pebs boot_cpu_has(X86_FEATURE_PEBS)
#define cpu_has_clflush boot_cpu_has(X86_FEATURE_CLFLSH)
#define cpu_has_bts boot_cpu_has(X86_FEATURE_BTS)
+#define cpu_has_ss boot_cpu_has(X86_FEATURE_SELFSNOOP)

#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
# define cpu_has_invlpg 1
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -94,10 +94,10 @@ static void flush_kernel_map(void *arg)

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
- if (a->full_flush || !cpu_has_clflush)
+ if ((a->full_flush || !cpu_has_clflush) && !cpu_has_ss)
asm volatile("wbinvd" ::: "memory");
list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush)
+ if (!a->full_flush && !cpu_has_ss)
clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);

2008-01-14 22:23:06

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [18/31] CPA: Make change_page_attr() more robust against use of PAT bits


Use the page table level instead of the PSE bit to check if the PTE
is for a 4K page or not. This makes the code more robust when the PAT
bit is changed because the PAT bit on 4K pages is in the same position as the
PSE bit.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 4 ++--
arch/x86/mm/pageattr_64.c | 4 ++--
2 files changed, 4 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -209,7 +209,7 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address);

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
- if (!pte_huge(*kpte)) {
+ if (level == 3) {
set_pte_atomic(kpte, mk_pte(page, prot));
} else {
pgprot_t ref_prot;
@@ -225,7 +225,7 @@ __change_page_attr(struct page *page, pg
kpte_page = split;
}
page_private(kpte_page)++;
- } else if (!pte_huge(*kpte)) {
+ } else if (level == 3) {
set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -184,7 +184,7 @@ __change_page_attr(unsigned long address
set_tlb_flush(address);

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
- if (!pte_huge(*kpte)) {
+ if (level == 4) {
set_pte(kpte, pfn_pte(pfn, prot));
} else {
/*
@@ -201,7 +201,7 @@ __change_page_attr(unsigned long address
kpte_page = split;
}
page_private(kpte_page)++;
- } else if (!pte_huge(*kpte)) {
+ } else if (level == 4) {
set_pte(kpte, pfn_pte(pfn, ref_prot));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;

2008-01-14 22:23:45

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [19/31] CPA: Limit cache flushing to pages that really change caching


Previously change_page_attr always flushed caches even for
pages that only change a non caching related attribute (like RO for
read/write protection).

This changes the flush code to only flush the cache when the
caching attributes actually change.

I made some effort to already handle reprogrammed PAT bits, although this
is not strictly needed right now by the core kernel (but that will
probably change soon)

This will only make a difference on AMD CPUs or older Intel CPUs,
because all newer Intel CPUs support "self-snoop" and do not require
this cache flushing anyways.

Another advantage of this patch is that it prevents recursive
slab calls with slab debugging.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 43 +++++++++++++++++++++++++++++++++----------
arch/x86/mm/pageattr_64.c | 42 ++++++++++++++++++++++++++++++++++--------
include/asm-x86/pgtable.h | 8 ++++++++
3 files changed, 75 insertions(+), 18 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -13,7 +13,10 @@
#include <asm/tlbflush.h>
#include <asm/io.h>

+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
struct flush {
+ enum flush_mode mode;
struct list_head l;
unsigned long addr;
};
@@ -76,7 +79,7 @@ static struct page *split_large_page(uns
}

struct flush_arg {
- int full_flush;
+ enum flush_mode full_flush;
struct list_head l;
};

@@ -91,14 +94,19 @@ static void flush_kernel_map(void *arg)
{
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;
+ int cache_flush = a->full_flush == FLUSH_CACHE;

/* When clflush is available always use it because it is
much cheaper than WBINVD. */
list_for_each_entry(f, &a->l, l) {
if (!a->full_flush)
__flush_tlb_one(f->addr);
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+ if (cpu_has_clflush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ else
+ cache_flush++;
+ }
}
if (a->full_flush)
__flush_tlb_all();
@@ -108,12 +116,12 @@ static void flush_kernel_map(void *arg)
* here and in the loop. But it is moot on Self-Snoop CPUs anyways.
*/

- if ((!cpu_has_clflush || a->full_flush) && !cpu_has_ss)
+ if (cache_flush > 0 && !cpu_has_ss)
wbinvd();
}

/* both protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
static LIST_HEAD(deferred_pages);
static LIST_HEAD(flush_pages);

@@ -155,17 +163,35 @@ static void revert_page(unsigned long ad
* data structure to keep track of the flush. This has the added bonus that
* it will work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
{
+ enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
if (!f) {
- full_flush = 1;
+ if (full_flush < mode)
+ full_flush = mode;
return;
}
+
f->addr = address;
+ f->mode = mode;
list_add_tail(&f->l, &flush_pages);
}

+static unsigned short pat_bit[5] = {
+ [4] = _PAGE_PAT,
+ [3] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+ unsigned a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+ /* Correct for PAT bit being in different position on large pages */
+ if (pte_val(pte) & pat_bit[level])
+ a |= _PAGE_PAT;
+ return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
static int
__change_page_attr(unsigned long address, unsigned long pfn, pgprot_t prot,
pgprot_t ref_prot)
@@ -181,7 +207,7 @@ __change_page_attr(unsigned long address
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

- set_tlb_flush(address);
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -21,12 +21,15 @@
#define ClearPageDeferred(p) clear_bit(PG_deferred, &(p)->flags)
#define TestSetPageDeferred(p) test_and_set_bit(PG_deferred, &(p)->flags)

+enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };
+
/* Variables protected by init_mm.mmap_sem */
-static int full_flush;
+static enum flush_mode full_flush;
static struct list_head df_list = LIST_HEAD_INIT(df_list);
static LIST_HEAD(flush_pages);

struct flush {
+ enum flush_mode mode;
struct list_head l;
unsigned long addr;
};
@@ -81,7 +84,7 @@ static struct page *split_large_page(uns
}

struct flush_arg {
- int full_flush;
+ enum flush_mode full_flush;
struct list_head l;
};

@@ -96,12 +99,17 @@ static void flush_kernel_map(void *arg)
{
struct flush_arg *a = (struct flush_arg *)arg;
struct flush *f;
+ int cache_flush = a->full_flush == FLUSH_CACHE;

list_for_each_entry(f, &a->l, l) {
- if (!a->full_flush && !cpu_has_ss)
- clflush_cache_range((void *)f->addr, PAGE_SIZE);
if (!a->full_flush)
__flush_tlb_one(f->addr);
+ if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+ if (cpu_has_clflush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ else
+ cache_flush++;
+ }
}

/* Handle errata with flushing large pages in early Athlons */
@@ -115,10 +123,8 @@ static void flush_kernel_map(void *arg)
* here and in the loop. But it is moot on Self-Snoop CPUs anyways.
*/

- if ((!cpu_has_clflush || a->full_flush) &&
- !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+ if (cache_flush > 0 && !cpu_has_ss && boot_cpu_data.x86_model >= 4)
wbinvd();
-
}

static void set_pmd_pte(pte_t *kpte, unsigned long address, pte_t pte)
@@ -169,6 +175,20 @@ static inline void save_page(struct page
list_add(&kpte_page->lru, &df_list);
}

+static unsigned short pat_bit[4] = {
+ [3] = _PAGE_PAT,
+ [2] = _PAGE_PAT_LARGE,
+};
+
+static int cache_attr_changed(pte_t pte, pgprot_t prot, int level)
+{
+ unsigned long a = pte_val(pte) & (_PAGE_PCD|_PAGE_PWT);
+ /* Correct for PAT bit being in different position on large pages */
+ if (pte_val(pte) & pat_bit[level])
+ a |= _PAGE_PAT;
+ return a != (pgprot_val(prot) & _PAGE_CACHE);
+}
+
/*
* Mark the address for flushing later in global_tlb_flush().
*
@@ -177,14 +197,17 @@ static inline void save_page(struct page
* data structure to keep track of the flush. This has the added bonus that
* it will work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address)
+static void set_tlb_flush(unsigned long address, int cache)
{
+ enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
if (!f) {
- full_flush = 1;
+ if (full_flush < mode)
+ full_flush = mode;
return;
}
f->addr = address;
+ f->mode = mode;
list_add_tail(&f->l, &flush_pages);
}

@@ -206,7 +229,7 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- set_tlb_flush(address);
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (level == 3) {
Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -13,12 +13,17 @@
#define _PAGE_BIT_DIRTY 6
#define _PAGE_BIT_FILE 6
#define _PAGE_BIT_PSE 7 /* 4 MB (or 2MB) page */
+#define _PAGE_BIT_PAT 7 /* Only on 4kb pages */
+#define _PAGE_BIT_PAT_LARGE 12 /* Only on large pages */
#define _PAGE_BIT_GLOBAL 8 /* Global TLB entry PPro+ */
#define _PAGE_BIT_UNUSED1 9 /* available for programmer */
#define _PAGE_BIT_UNUSED2 10
#define _PAGE_BIT_UNUSED3 11
#define _PAGE_BIT_NX 63 /* No execute: only valid after cpuid check */

+/* Needs special handling for large pages */
+#define _PAGE_CACHE (_PAGE_PCD|_PAGE_PWT|_PAGE_PAT)
+
#define _PAGE_PRESENT (_AC(1, UL)<<_PAGE_BIT_PRESENT)
#define _PAGE_RW (_AC(1, UL)<<_PAGE_BIT_RW)
#define _PAGE_USER (_AC(1, UL)<<_PAGE_BIT_USER)
@@ -32,6 +37,9 @@
#define _PAGE_UNUSED2 (_AC(1, UL)<<_PAGE_BIT_UNUSED2)
#define _PAGE_UNUSED3 (_AC(1, UL)<<_PAGE_BIT_UNUSED3)

+#define _PAGE_PAT (_AC(1, UL) << _PAGE_BIT_PAT)
+#define _PAGE_PAT_LARGE (_AC(1, UL) << _PAGE_BIT_PAT_LARGE)
+
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AC(1, ULL) << _PAGE_BIT_NX)
#else

2008-01-14 22:24:03

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [21/31] CPA: Dump pagetable when inconsistency is detected


When c_p_a() detects a inconsistency in the kernel page tables
it BUGs. When this happens dump the page table first to avoid one
bug reporting round trip.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 11 ++++++++++-
arch/x86/mm/pageattr_64.c | 11 ++++++++++-
2 files changed, 20 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -12,6 +12,7 @@
#include <asm/processor.h>
#include <asm/tlbflush.h>
#include <asm/io.h>
+#include <asm/kdebug.h>

enum flush_mode { FLUSH_NONE, FLUSH_CACHE, FLUSH_TLB };

@@ -231,8 +232,16 @@ __change_page_attr(unsigned long address
set_pte(kpte, pfn_pte(pfn, ref_prot));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
- } else
+ } else {
+ /*
+ * When you're here you either set the same page to PAGE_KERNEL
+ * two times in a row or the page table reference counting is
+ * broken again. To catch the later bug for now (sorry)
+ */
+ printk(KERN_ERR "address %lx\n", address);
+ dump_pagetable(address);
BUG();
+ }

/* on x86-64 the direct mapping set at boot is not using 4k pages */
BUG_ON(PageReserved(kpte_page));
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -13,6 +13,7 @@
#include <asm/tlbflush.h>
#include <asm/pgalloc.h>
#include <asm/sections.h>
+#include <asm/kdebug.h>

#define PG_deferred PG_arch_1

@@ -252,8 +253,16 @@ __change_page_attr(struct page *page, pg
set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
BUG_ON(page_private(kpte_page) == 0);
page_private(kpte_page)--;
- } else
+ } else {
+ /*
+ * When you're here you either set the same page to PAGE_KERNEL
+ * two times in a row or the page table reference counting is
+ * broken again. To catch the later bug for now (sorry)
+ */
+ printk(KERN_ERR "address %lx\n", address);
+ dump_pagetable(address);
BUG();
+ }

/*
* If the pte was reserved, it means it was created at boot

2008-01-14 22:24:27

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [20/31] CPA: Fix inaccurate comments in 64bit change_page_attr()


No code changes.
Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -120,7 +120,7 @@ static void flush_kernel_map(void *arg)
wbinvd();
}

-/* both protected by init_mm.mmap_sem */
+/* All protected by init_mm.mmap_sem */
static enum flush_mode full_flush;
static LIST_HEAD(deferred_pages);
static LIST_HEAD(flush_pages);
@@ -132,7 +132,7 @@ static inline void save_page(struct page
}

/*
- * No more special protections in this 2/4MB area - revert to a
+ * No more special protections in this 2MB area - revert to a
* large page again.
*/
static void revert_page(unsigned long address, pgprot_t ref_prot)

2008-01-14 22:24:42

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [22/31] CPA: Only queue actually unused page table pages for freeing


With the separate data structure added for flush earlier it is only
needed to call save_page() now on pte pages that have been already reverted.

Also move all freeing checks into the caller.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 4 +---
arch/x86/mm/pageattr_64.c | 7 +++----
2 files changed, 4 insertions(+), 7 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -270,9 +270,9 @@ __change_page_attr(struct page *page, pg
* replace it with a largepage.
*/

- save_page(kpte_page);
if (!PageReserved(kpte_page)) {
if (cpu_has_pse && (page_private(kpte_page) == 0)) {
+ save_page(kpte_page);
paravirt_release_pt(page_to_pfn(kpte_page));
revert_page(kpte_page, address);
}
@@ -333,8 +333,6 @@ void global_flush_tlb(void)
list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
ClearPageDeferred(pg);
- if (PageReserved(pg) || !cpu_has_pse || page_private(pg) != 0)
- continue;
ClearPagePrivate(pg);
__free_page(pg);
}
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -246,9 +246,10 @@ __change_page_attr(unsigned long address
/* on x86-64 the direct mapping set at boot is not using 4k pages */
BUG_ON(PageReserved(kpte_page));

- save_page(kpte_page);
- if (page_private(kpte_page) == 0)
+ if (page_private(kpte_page) == 0) {
+ save_page(kpte_page);
revert_page(address, ref_prot);
+ }
return 0;
}

@@ -336,8 +337,6 @@ void global_flush_tlb(void)
list_for_each_entry_safe(pg, next, &free_pages, lru) {
list_del(&pg->lru);
ClearPageDeferred(pg);
- if (page_private(pg) != 0)
- continue;
ClearPagePrivate(pg);
__free_page(pg);
}

2008-01-14 22:24:59

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [23/31] CPA: Remove unnecessary masking of address


virt_to_page does not care about the bits below the page granuality.
So don't mask them.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_64.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -204,7 +204,7 @@ __change_page_attr(unsigned long address

kpte = lookup_address(address, &level);
if (!kpte) return 0;
- kpte_page = virt_to_page(((unsigned long)kpte) & PAGE_MASK);
+ kpte_page = virt_to_page(kpte);
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

2008-01-14 22:25:36

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [24/31] CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set


Otherwise the kernel will likely always run with 4K pages instead of 2MB pages,
which is costly in terms of TLBs.

Also optimize it a little bit by using only a single change_page_attr() calls.
This is particularly useful if debugging is enabled inside it because it spams
the logs much less.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_64.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -553,13 +553,16 @@ void free_init_pages(char *what, unsigne
init_page_count(virt_to_page(addr));
memset((void *)(addr & ~(PAGE_SIZE-1)),
POISON_FREE_INITMEM, PAGE_SIZE);
- if (addr >= __START_KERNEL_map)
- change_page_attr_addr(addr, 1, __pgprot(0));
free_page(addr);
totalram_pages++;
}
- if (addr > __START_KERNEL_map)
+#ifdef CONFIG_DEBUG_RODATA
+ if (begin >= __START_KERNEL_map) {
+ change_page_attr_addr(begin, (end - begin)/PAGE_SIZE,
+ __pgprot(0));
global_flush_tlb();
+ }
+#endif
}

void free_initmem(void)

2008-01-14 22:25:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [25/31] CPA: Always do full TLB flush when splitting large pages


Various CPUs have errata when using INVLPG to flush large pages.
This includes Intel Penryn (AV2) and AMD K7 (#16 in Athlon 4)
While these happen only in specific circumstances it is still
a little risky and it's not clear the kernel can avoid them all.

Avoid this can of worms by always flushing the full TLB (but
not the full cache) when splitting a large page. This should
not be that expensive anyways and initial splitting should be
hopefully infrequent.

This also removes the previously hard coded workaround for K7
Athlon on 32bit.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 17 +++++++++++------
arch/x86/mm/pageattr_64.c | 12 ++++++++++--
2 files changed, 21 insertions(+), 8 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -113,10 +113,7 @@ static void flush_kernel_map(void *arg)
}
}

- /* Handle errata with flushing large pages in early Athlons */
- if (a->full_flush ||
- (boot_cpu_data.x86_vendor == X86_VENDOR_AMD &&
- boot_cpu_data.x86 == 7))
+ if (a->full_flush)
__flush_tlb_all();

/*
@@ -198,7 +195,7 @@ static int cache_attr_changed(pte_t pte,
* data structure to keep track of the flush. This has the added bonus that
* it will work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
{
enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -210,6 +207,13 @@ static void set_tlb_flush(unsigned long
f->addr = address;
f->mode = mode;
list_add_tail(&f->l, &flush_pages);
+
+ /*
+ * Work around large page INVLPG bugs in early K7 and in Penryn.
+ * When we split a large page always do a full TLB flush.
+ */
+ if (large && full_flush < FLUSH_TLB)
+ full_flush = FLUSH_TLB;
}

static int
@@ -230,7 +234,8 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

- set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
+ level < 3);

if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
if (level == 3) {
Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -164,7 +164,7 @@ static void revert_page(unsigned long ad
* data structure to keep track of the flush. This has the added bonus that
* it will work for MMIO holes without mem_map too.
*/
-static void set_tlb_flush(unsigned long address, int cache)
+static void set_tlb_flush(unsigned long address, int cache, int large)
{
enum flush_mode mode = cache ? FLUSH_CACHE : FLUSH_TLB;
struct flush *f = kmalloc(sizeof(struct flush), GFP_KERNEL);
@@ -177,6 +177,13 @@ static void set_tlb_flush(unsigned long
f->addr = address;
f->mode = mode;
list_add_tail(&f->l, &flush_pages);
+
+ /*
+ * Work around large page INVLPG bugs in early K7 and in Penryn.
+ * When we split a large page always do a full TLB flush.
+ */
+ if (large && full_flush < FLUSH_TLB)
+ full_flush = FLUSH_TLB;
}

static unsigned short pat_bit[5] = {
@@ -208,7 +215,8 @@ __change_page_attr(unsigned long address
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

- set_tlb_flush(address, cache_attr_changed(*kpte, prot, level));
+ set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
+ level < 4);

if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {

2008-01-14 22:26:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [26/31] CPA: Fix reference counting when changing already changed pages


When changing a page that has already been modified to non standard attributes
before don't change the reference count. And when changing back a page
only decrease the ref count if the old attributes were non standard.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 44 +++++++++++++++++++++++++-------------------
arch/x86/mm/pageattr_64.c | 16 ++++++++++++----
include/asm-x86/pgtable.h | 2 ++
3 files changed, 39 insertions(+), 23 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -206,20 +206,26 @@ __change_page_attr(unsigned long address
{
pte_t *kpte;
struct page *kpte_page;
- pgprot_t ref_prot2;
+ pgprot_t ref_prot2, oldprot;
int level;

kpte = lookup_address(address, &level);
if (!kpte) return 0;
kpte_page = virt_to_page(kpte);
+ oldprot = pte_pgprot(*kpte);
BUG_ON(PageCompound(kpte_page));
BUG_ON(PageLRU(kpte_page));

set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 4);

+ ref_prot = canon_pgprot(ref_prot);
+ prot = canon_pgprot(prot);
+
if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 4) {
+ if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+ page_private(kpte_page)++;
set_pte(kpte, pfn_pte(pfn, prot));
} else {
/*
@@ -234,12 +240,14 @@ __change_page_attr(unsigned long address
pgprot_val(ref_prot2) &= ~_PAGE_NX;
set_pte(kpte, mk_pte(split, ref_prot2));
kpte_page = split;
+ page_private(kpte_page)++;
}
- page_private(kpte_page)++;
} else if (level == 4) {
+ if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+ BUG_ON(page_private(kpte_page) <= 0);
+ page_private(kpte_page)--;
+ }
set_pte(kpte, pfn_pte(pfn, ref_prot));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
} else {
/*
* When you're here you either set the same page to PAGE_KERNEL
Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -151,20 +151,17 @@ static void set_pmd_pte(pte_t *kpte, uns
* No more special protections in this 2/4MB area - revert to a
* large page again.
*/
-static inline void revert_page(struct page *kpte_page, unsigned long address)
+static void
+revert_page(struct page *kpte_page, unsigned long address, pgprot_t ref_prot)
{
- pgprot_t ref_prot;
pte_t *linear;

- ref_prot =
- ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
- ? PAGE_KERNEL_LARGE_EXEC : PAGE_KERNEL_LARGE;
-
linear = (pte_t *)
pmd_offset(pud_offset(pgd_offset_k(address), address), address);
set_pmd_pte(linear, address,
- pfn_pte((__pa(address) & LARGE_PAGE_MASK) >> PAGE_SHIFT,
- ref_prot));
+ pte_mkhuge(pfn_pte((__pa(address) & LARGE_PAGE_MASK)
+ >> PAGE_SHIFT,
+ ref_prot)));
}

static inline void save_page(struct page *kpte_page)
@@ -223,6 +220,8 @@ __change_page_attr(struct page *page, pg
unsigned long address;
struct page *kpte_page;
int level;
+ pgprot_t oldprot;
+ pgprot_t ref_prot = PAGE_KERNEL;

BUG_ON(PageHighMem(page));
address = (unsigned long)page_address(page);
@@ -230,6 +229,8 @@ __change_page_attr(struct page *page, pg
kpte = lookup_address(address, &level);
if (!kpte)
return -EINVAL;
+
+ oldprot = pte_pgprot(*kpte);
kpte_page = virt_to_page(kpte);
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));
@@ -237,27 +238,32 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 3);

- if (pgprot_val(prot) != pgprot_val(PAGE_KERNEL)) {
+ if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+ ref_prot = PAGE_KERNEL_EXEC;
+
+ ref_prot = canon_pgprot(ref_prot);
+ prot = canon_pgprot(prot);
+
+ if (pgprot_val(prot) != pgprot_val(ref_prot)) {
if (level == 3) {
+ if (pgprot_val(oldprot) == pgprot_val(ref_prot))
+ page_private(kpte_page)++;
set_pte_atomic(kpte, mk_pte(page, prot));
} else {
- pgprot_t ref_prot;
struct page *split;
-
- ref_prot =
- ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
- ? PAGE_KERNEL_EXEC : PAGE_KERNEL;
split = split_large_page(address, prot, ref_prot);
if (!split)
return -ENOMEM;
set_pmd_pte(kpte,address,mk_pte(split, ref_prot));
kpte_page = split;
+ page_private(kpte_page)++;
}
- page_private(kpte_page)++;
} else if (level == 3) {
- set_pte_atomic(kpte, mk_pte(page, PAGE_KERNEL));
- BUG_ON(page_private(kpte_page) == 0);
- page_private(kpte_page)--;
+ if (pgprot_val(oldprot) != pgprot_val(ref_prot)) {
+ BUG_ON(page_private(kpte_page) <= 0);
+ page_private(kpte_page)--;
+ }
+ set_pte_atomic(kpte, mk_pte(page, ref_prot));
} else {
/*
* When you're here you either set the same page to PAGE_KERNEL
@@ -279,7 +285,7 @@ __change_page_attr(struct page *page, pg
if (cpu_has_pse && (page_private(kpte_page) == 0)) {
save_page(kpte_page);
paravirt_release_pt(page_to_pfn(kpte_page));
- revert_page(kpte_page, address);
+ revert_page(kpte_page, address, ref_prot);
}
}
return 0;
Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -192,6 +192,8 @@ static inline pte_t pte_modify(pte_t pte
return __pte(val);
}

+#define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)
+
#ifdef CONFIG_PARAVIRT
#include <asm/paravirt.h>
#else /* !CONFIG_PARAVIRT */

2008-01-14 22:26:55

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [27/31] CPA: Change comments of external interfaces to kerneldoc format


And clarify description a bit.

Only for 64bit, but the interfaces are identical for 32bit and kerneldoc should
merge them (?)

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/DocBook/kernel-api.tmpl | 8 +++++
arch/x86/mm/pageattr_64.c | 46 +++++++++++++++++++++++++---------
2 files changed, 42 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/pageattr_64.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_64.c
+++ linux/arch/x86/mm/pageattr_64.c
@@ -269,19 +269,19 @@ __change_page_attr(unsigned long address
return 0;
}

-/*
- * Change the page attributes of an page in the linear mapping.
- *
- * This should be used when a page is mapped with a different caching policy
- * than write-back somewhere - some CPUs do not like it when mappings with
- * different caching policies exist. This changes the page attributes of the
- * in kernel linear mapping too.
+/**
+ * change_page_attr_addr - Change page table attributes in linear mapping
+ * @address: Virtual address in linear mapping.
+ * @numpages: Number of pages to change
+ * @prot: New page table attribute (PAGE_*)
*
- * The caller needs to ensure that there are no conflicting mappings elsewhere.
- * This function only deals with the kernel linear map.
- *
- * Caller must call global_flush_tlb() after this.
+ * Change page attributes of a page in the direct mapping. This is a variant
+ * of change_page_attr() that also works on memory holes that do not have
+ * mem_map entry (pfn_valid() is false).
+ *
+ * See change_page_attr() documentation for more details.
*/
+
int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
{
int err = 0, kernel_map = 0;
@@ -318,13 +318,35 @@ int change_page_attr_addr(unsigned long
return err;
}

-/* Don't call this for MMIO areas that may not have a mem_map entry */
+/**
+ * change_page_attr - Change page table attributes in the linear mapping.
+ * @page: First page to change
+ * @numpages: Number of pages to change
+ * @prot: New protection/caching type (PAGE_*)
+ *
+ * Returns 0 on success, otherwise a negated errno.
+ *
+ * This should be used when a page is mapped with a different caching policy
+ * than write-back somewhere - some CPUs do not like it when mappings with
+ * different caching policies exist. This changes the page attributes of the
+ * in kernel linear mapping too.
+ *
+ * Caller must call global_flush_tlb() later to make the changes active.
+ *
+ * The caller needs to ensure that there are no conflicting mappings elsewhere
+ * (e.g. in user space) * This function only deals with the kernel linear map.
+ *
+ * For MMIO areas without mem_map use change_page_attr_addr() instead.
+ */
int change_page_attr(struct page *page, int numpages, pgprot_t prot)
{
unsigned long addr = (unsigned long)page_address(page);
return change_page_attr_addr(addr, numpages, prot);
}

+/**
+ * global_flush_tlb - Flush linear mappings changed by change_page_attr()
+ */
void global_flush_tlb(void)
{
struct page *pg, *next;
Index: linux/Documentation/DocBook/kernel-api.tmpl
===================================================================
--- linux.orig/Documentation/DocBook/kernel-api.tmpl
+++ linux/Documentation/DocBook/kernel-api.tmpl
@@ -726,4 +726,12 @@ X!Idrivers/video/console/fonts.c
!Ffs/pipe.c
</chapter>

+ <chapter id="pageattr">
+ <title>Kernel direct mapping paging attributes</title>
+ <para>
+ Changing the paging attributes of kernel direct mapping pages
+ on x86.
+ </para>
+!Farch/x86/mm/pageattr_64.c
+ </chapter>
</book>

2008-01-14 22:27:21

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [28/31] CPA: Make kernel_text test match boot mapping initialization


The boot direct mapping initialization used a different test to check if a
page was part of the kernel mapping than c_p_a(). Make them use
a common function.

Also round up to a large page size to be sure and check for the beginning
of the kernel address to handle highly loaded kernels better.

This gives a small semantic change of NX applying to always 2MB areas
on !PSE && NX systems, but that's an obscure case even considering
DEBUG_PAGEALLOC.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_32.c | 16 ++--------------
arch/x86/mm/pageattr_32.c | 9 ++++++++-
include/asm-x86/pgtable_32.h | 2 +-
3 files changed, 11 insertions(+), 16 deletions(-)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -184,6 +184,13 @@ static int cache_attr_changed(pte_t pte,
return a != (pgprot_val(prot) & _PAGE_CACHE);
}

+int text_address(unsigned long addr)
+{
+ unsigned long start = ((unsigned long)&_text) & LARGE_PAGE_MASK;
+ unsigned long end = ((unsigned long)&__init_end) & LARGE_PAGE_MASK;
+ return addr >= start && addr < end + LARGE_PAGE_SIZE;
+}
+
/*
* Mark the address for flushing later in global_tlb_flush().
*
@@ -238,7 +245,7 @@ __change_page_attr(struct page *page, pg
set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 3);

- if ((address & LARGE_PAGE_MASK) < (unsigned long)&_etext)
+ if (text_address(address))
ref_prot = PAGE_KERNEL_EXEC;

ref_prot = canon_pgprot(ref_prot);
Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -136,13 +136,6 @@ static void __init page_table_range_init
}
}

-static inline int is_kernel_text(unsigned long addr)
-{
- if (addr >= PAGE_OFFSET && addr <= (unsigned long)__init_end)
- return 1;
- return 0;
-}
-
/*
* This maps the physical memory to kernel virtual address space, a total
* of max_low_pfn pages, by creating page tables starting from address
@@ -172,14 +165,9 @@ static void __init kernel_physical_mappi
/* Map with big pages if possible, otherwise
create normal page tables. */
if (cpu_has_pse) {
- unsigned int address2;
pgprot_t prot = PAGE_KERNEL_LARGE;

- address2 = (pfn + PTRS_PER_PTE - 1) * PAGE_SIZE +
- PAGE_OFFSET + PAGE_SIZE-1;
-
- if (is_kernel_text(address) ||
- is_kernel_text(address2))
+ if (text_address(address))
prot = PAGE_KERNEL_LARGE_EXEC;

set_pmd(pmd, pfn_pmd(pfn, prot));
@@ -193,7 +181,7 @@ static void __init kernel_physical_mappi
pte++, pfn++, pte_ofs++, address += PAGE_SIZE) {
pgprot_t prot = PAGE_KERNEL;

- if (is_kernel_text(address))
+ if (text_address(address))
prot = PAGE_KERNEL_EXEC;

set_pte(pte, pfn_pte(pfn, prot));
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -34,7 +34,7 @@ void check_pgt_cache(void);
void pmd_ctor(struct kmem_cache *, void *);
void pgtable_cache_init(void);
void paging_init(void);
-
+int text_address(unsigned long);

/*
* The Linux x86 paging architecture is 'compile-time dual-mode', it

2008-01-14 22:27:38

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [29/31] CPA: Add a BUG_ON checking for someone setting the kernel text NX


Someone setting NX on the kernel text tends to result in nasty failures
and triple faults, so BUG_ON early for that.

Does not cover __inittext.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr_32.c | 8 ++++++++
1 file changed, 8 insertions(+)

Index: linux/arch/x86/mm/pageattr_32.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr_32.c
+++ linux/arch/x86/mm/pageattr_32.c
@@ -242,6 +242,14 @@ __change_page_attr(struct page *page, pg
BUG_ON(PageLRU(kpte_page));
BUG_ON(PageCompound(kpte_page));

+ /*
+ * Better fail early if someone sets the kernel text to NX.
+ * Does not cover __inittext
+ */
+ BUG_ON(address >= (unsigned long)&_text &&
+ address < (unsigned long)&_etext &&
+ (pgprot_val(prot) & _PAGE_NX));
+
set_tlb_flush(address, cache_attr_changed(*kpte, prot, level),
level < 3);

2008-01-14 22:27:53

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [30/31] Remove set_kernel_exec


The SMP trampoline always runs in real mode, so making it executable
in the page tables doesn't make much sense because it executes
before page tables are set up. That was the only user of
set_kernel_exec(). Remove set_kernel_exec().

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/smpboot_32.c | 11 -----------
arch/x86/mm/init_32.c | 30 ------------------------------
include/asm-x86/pgtable_32.h | 12 ------------
3 files changed, 53 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -508,36 +508,6 @@ static void __init set_nx(void)
}
}
}
-
-/*
- * Enables/disables executability of a given kernel page and
- * returns the previous setting.
- */
-int __init set_kernel_exec(unsigned long vaddr, int enable)
-{
- pte_t *pte;
- int ret = 1;
- int level;
-
- if (!nx_enabled)
- goto out;
-
- pte = lookup_address(vaddr, &level);
- BUG_ON(!pte);
-
- if (!pte_exec_kernel(*pte))
- ret = 0;
-
- if (enable)
- pte->pte_high &= ~(1 << (_PAGE_BIT_NX - 32));
- else
- pte->pte_high |= 1 << (_PAGE_BIT_NX - 32);
- pte_update_defer(&init_mm, vaddr, pte);
- __flush_tlb_all();
-out:
- return ret;
-}
-
#endif

/*
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -184,18 +184,6 @@ static inline void clone_pgd_range(pgd_t
*/
extern pte_t *lookup_address(unsigned long address, int *level);

-/*
- * Make a given kernel text page executable/non-executable.
- * Returns the previous executability setting of that page (which
- * is used to restore the previous state). Used by the SMP bootup code.
- * NOTE: this is an __init function for security reasons.
- */
-#ifdef CONFIG_X86_PAE
- extern int set_kernel_exec(unsigned long vaddr, int enable);
-#else
- static inline int set_kernel_exec(unsigned long vaddr, int enable) { return 0;}
-#endif
-
#if defined(CONFIG_HIGHPTE)
#define pte_offset_map(dir, address) \
((pte_t *)kmap_atomic_pte(pmd_page(*(dir)),KM_PTE0) + pte_index(address))
Index: linux/arch/x86/kernel/smpboot_32.c
===================================================================
--- linux.orig/arch/x86/kernel/smpboot_32.c
+++ linux/arch/x86/kernel/smpboot_32.c
@@ -112,7 +112,6 @@ u8 apicid_2_node[MAX_APICID];
extern const unsigned char trampoline_data [];
extern const unsigned char trampoline_end [];
static unsigned char *trampoline_base;
-static int trampoline_exec;

static void map_cpu_to_logical_apicid(void);

@@ -144,10 +143,6 @@ void __init smp_alloc_memory(void)
*/
if (__pa(trampoline_base) >= 0x9F000)
BUG();
- /*
- * Make the SMP trampoline executable:
- */
- trampoline_exec = set_kernel_exec((unsigned long)trampoline_base, 1);
}

/*
@@ -1295,12 +1290,6 @@ void __init native_smp_cpus_done(unsigne
setup_ioapic_dest();
#endif
zap_low_mappings();
-#ifndef CONFIG_HOTPLUG_CPU
- /*
- * Disable executability of the SMP trampoline:
- */
- set_kernel_exec((unsigned long)trampoline_base, trampoline_exec);
-#endif
}

void __init smp_intr_init(void)

2008-01-14 22:28:17

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [31/31] Clean up pte_exec


- Rename it to pte_exec() from pte_exec_kernel(). There is nothing
kernel specific in there.
- Move it into the common file because _PAGE_NX is 0 on !PAE and then
pte_exec() will be always evaluate to true.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault_32.c | 2 +-
include/asm-x86/pgtable-2level.h | 8 --------
include/asm-x86/pgtable-3level.h | 8 --------
include/asm-x86/pgtable.h | 1 +
4 files changed, 2 insertions(+), 17 deletions(-)

Index: linux/include/asm-x86/pgtable-2level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-2level.h
+++ linux/include/asm-x86/pgtable-2level.h
@@ -56,14 +56,6 @@ static inline pte_t native_ptep_get_and_
#define pte_pfn(x) (pte_val(x) >> PAGE_SHIFT)

/*
- * All present pages are kernel-executable:
- */
-static inline int pte_exec_kernel(pte_t pte)
-{
- return 1;
-}
-
-/*
* Bits 0, 6 and 7 are taken, split up the 29 bits of offset
* into this range:
*/
Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -143,6 +143,7 @@ static inline int pte_write(pte_t pte)
static inline int pte_file(pte_t pte) { return pte_val(pte) & _PAGE_FILE; }
static inline int pte_huge(pte_t pte) { return pte_val(pte) & _PAGE_PSE; }
static inline int pte_global(pte_t pte) { return pte_val(pte) & _PAGE_GLOBAL; }
+static inline int pte_exec(pte_t pte) { return !(pte_val(pte) & _PAGE_NX); }

static inline int pmd_large(pmd_t pte) {
return (pmd_val(pte) & (_PAGE_PSE|_PAGE_PRESENT)) ==
Index: linux/include/asm-x86/pgtable-3level.h
===================================================================
--- linux.orig/include/asm-x86/pgtable-3level.h
+++ linux/include/asm-x86/pgtable-3level.h
@@ -19,14 +19,6 @@
#define pud_bad(pud) 0
#define pud_present(pud) 1

-/*
- * All present pages with !NX bit are kernel-executable:
- */
-static inline int pte_exec_kernel(pte_t pte)
-{
- return !(pte_val(pte) & _PAGE_NX);
-}
-
/* Rules for using set_pte: the pte being assigned *must* be
* either not present or in a state where the hardware will
* not attempt to update the pte. In places where this is
Index: linux/arch/x86/mm/fault_32.c
===================================================================
--- linux.orig/arch/x86/mm/fault_32.c
+++ linux/arch/x86/mm/fault_32.c
@@ -574,7 +574,7 @@ no_context:
int level;
pte_t *pte = lookup_address(address, &level);

- if (pte && pte_present(*pte) && !pte_exec_kernel(*pte))
+ if (pte && pte_present(*pte) && !pte_exec(*pte))
printk(KERN_CRIT "kernel tried to execute "
"NX-protected page - exploit attempt? "
"(uid: %d)\n", current->uid);

2008-01-14 22:52:19

by Randy Dunlap

[permalink] [raw]
Subject: Re: [PATCH] [27/31] CPA: Change comments of external interfaces to kerneldoc format

On Mon, 14 Jan 2008 23:17:00 +0100 (CET) Andi Kleen wrote:

>
> And clarify description a bit.
>
> Only for 64bit, but the interfaces are identical for 32bit and kerneldoc should
> merge them (?)

I doubt that kernel-doc will merge them.

Normally we just import (like with !I, !E, !F, etc.) one file that
contains the kernel-doc API notations.
Would that work for this case?


> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> Documentation/DocBook/kernel-api.tmpl | 8 +++++
> arch/x86/mm/pageattr_64.c | 46 +++++++++++++++++++++++++---------
> 2 files changed, 42 insertions(+), 12 deletions(-)
>
> Index: linux/arch/x86/mm/pageattr_64.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pageattr_64.c
> +++ linux/arch/x86/mm/pageattr_64.c
> @@ -269,19 +269,19 @@ __change_page_attr(unsigned long address
> return 0;
> }
>
> -/*
> - * Change the page attributes of an page in the linear mapping.
> - *
> - * This should be used when a page is mapped with a different caching policy
> - * than write-back somewhere - some CPUs do not like it when mappings with
> - * different caching policies exist. This changes the page attributes of the
> - * in kernel linear mapping too.
> +/**
> + * change_page_attr_addr - Change page table attributes in linear mapping
> + * @address: Virtual address in linear mapping.
> + * @numpages: Number of pages to change
> + * @prot: New page table attribute (PAGE_*)
> *
> - * The caller needs to ensure that there are no conflicting mappings elsewhere.
> - * This function only deals with the kernel linear map.
> - *
> - * Caller must call global_flush_tlb() after this.
> + * Change page attributes of a page in the direct mapping. This is a variant
> + * of change_page_attr() that also works on memory holes that do not have
> + * mem_map entry (pfn_valid() is false).
> + *
> + * See change_page_attr() documentation for more details.
> */
> +
> int change_page_attr_addr(unsigned long address, int numpages, pgprot_t prot)
> {
> int err = 0, kernel_map = 0;
> @@ -318,13 +318,35 @@ int change_page_attr_addr(unsigned long
> return err;
> }
>
> -/* Don't call this for MMIO areas that may not have a mem_map entry */
> +/**
> + * change_page_attr - Change page table attributes in the linear mapping.
> + * @page: First page to change
> + * @numpages: Number of pages to change
> + * @prot: New protection/caching type (PAGE_*)
> + *
> + * Returns 0 on success, otherwise a negated errno.
> + *
> + * This should be used when a page is mapped with a different caching policy
> + * than write-back somewhere - some CPUs do not like it when mappings with
> + * different caching policies exist. This changes the page attributes of the
> + * in kernel linear mapping too.
> + *
> + * Caller must call global_flush_tlb() later to make the changes active.
> + *
> + * The caller needs to ensure that there are no conflicting mappings elsewhere
> + * (e.g. in user space) * This function only deals with the kernel linear map.
> + *
> + * For MMIO areas without mem_map use change_page_attr_addr() instead.
> + */
> int change_page_attr(struct page *page, int numpages, pgprot_t prot)
> {
> unsigned long addr = (unsigned long)page_address(page);
> return change_page_attr_addr(addr, numpages, prot);
> }
>
> +/**
> + * global_flush_tlb - Flush linear mappings changed by change_page_attr()
> + */
> void global_flush_tlb(void)
> {
> struct page *pg, *next;
> Index: linux/Documentation/DocBook/kernel-api.tmpl
> ===================================================================
> --- linux.orig/Documentation/DocBook/kernel-api.tmpl
> +++ linux/Documentation/DocBook/kernel-api.tmpl
> @@ -726,4 +726,12 @@ X!Idrivers/video/console/fonts.c
> !Ffs/pipe.c
> </chapter>
>
> + <chapter id="pageattr">
> + <title>Kernel direct mapping paging attributes</title>
> + <para>
> + Changing the paging attributes of kernel direct mapping pages
> + on x86.
> + </para>
> +!Farch/x86/mm/pageattr_64.c
> + </chapter>
> </book>
> --

---
~Randy

2008-01-15 00:49:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [27/31] CPA: Change comments of external interfaces to kerneldoc format

On Monday 14 January 2008 23:50:54 Randy Dunlap wrote:
> On Mon, 14 Jan 2008 23:17:00 +0100 (CET) Andi Kleen wrote:
>
> >
> > And clarify description a bit.
> >
> > Only for 64bit, but the interfaces are identical for 32bit and kerneldoc should
> > merge them (?)
>
> I doubt that kernel-doc will merge them.
>
> Normally we just import (like with !I, !E, !F, etc.) one file that
> contains the kernel-doc API notations.
> Would that work for this case?

Yes it should. The interface is identical except that 64bit has
one entry point more (but Venki was planning to add that one to 32bit too)

-Andi

2008-01-15 08:40:23

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [12/31] CPA: CLFLUSH support in change_page_attr()

>- /* clflush is still broken. Disable for now. */
>- if (1 || !cpu_has_clflush)
>+ if (a->full_flush)
> asm volatile("wbinvd" ::: "memory");
>- else list_for_each_entry(pg, l, lru) {
>- void *adr = page_address(pg);
>- clflush_cache_range(adr, PAGE_SIZE);
>+ list_for_each_entry(f, &a->l, l) {
>+ if (!a->full_flush)

This if() looks redundant (could also be avoided in the 32-bit variant, but
isn't redundant there at present). Also, is there no
wbinvd() on 64bit?

Jan

2008-01-15 08:46:25

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [19/31] CPA: Limit cache flushing to pages that really change caching

>+static unsigned short pat_bit[5] = {

Could be const and perhaps even local to cache_attr_changed()...

Jan

2008-01-15 08:47:48

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [2/31] CPA: Do a simple self test at boot

On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> When CONFIG_DEBUG_RODATA is enabled undo the ro mapping and redo it again.
> This gives some simple testing for change_page_attr()
>
> Optional patch, but I find it useful.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/Kconfig.debug | 5 +++++
> arch/x86/mm/init_32.c | 26 ++++++++++++++++++++++++++
> arch/x86/mm/init_64.c | 10 ++++++++++
> 3 files changed, 41 insertions(+)
>
> Index: linux/arch/x86/mm/init_64.c
> ===================================================================
> --- linux.orig/arch/x86/mm/init_64.c
> +++ linux/arch/x86/mm/init_64.c
> @@ -603,6 +603,16 @@ void mark_rodata_ro(void)
> * of who is the culprit.
> */
> global_flush_tlb();

global_flush_tlb outside CONFIG_CPA_DEBUG here.

> +
> +#ifdef CONFIG_CPA_DEBUG
> + printk("Testing CPA: undo %lx-%lx\n", start, end);
> + change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL);
> + global_flush_tlb();
> +
> + printk("Testing CPA: again\n");
> + change_page_attr_addr(start, (end - start) >> PAGE_SHIFT, PAGE_KERNEL_RO);
> + global_flush_tlb();
> +#endif
> }
> #endif
>
> Index: linux/arch/x86/mm/init_32.c
> ===================================================================
> --- linux.orig/arch/x86/mm/init_32.c
> +++ linux/arch/x86/mm/init_32.c
> @@ -793,6 +793,20 @@ void mark_rodata_ro(void)
> change_page_attr(virt_to_page(start),
> size >> PAGE_SHIFT, PAGE_KERNEL_RX);
> printk("Write protecting the kernel text: %luk\n", size >> 10);
> +
> +#ifdef CONFIG_CPA_DEBUG
> + global_flush_tlb();
> +

global_flush_tlb inside CONFIG_CPA_DEBUG here.

> + printk("Testing CPA: Reverting %lx-%lx\n", start, start+size);
> + change_page_attr(virt_to_page(start), size>>PAGE_SHIFT,
> + PAGE_KERNEL_EXEC);
> + global_flush_tlb();
> +
> + printk("Testing CPA: write protecting again\n");
> + change_page_attr(virt_to_page(start), size>>PAGE_SHIFT,
> + PAGE_KERNEL_RX);
> + global_flush_tlb();
> +#endif
> }
> #endif
> start += size;
> @@ -809,6 +823,18 @@ void mark_rodata_ro(void)
> * of who is the culprit.
> */
> global_flush_tlb();

global_flush_tlb outside CONFIG_CPA_DEBUG here. Intentional?

> +
> +#ifdef CONFIG_CPA_DEBUG
> + printk("Testing CPA: undo %lx-%lx\n", start, start + size);
> + change_page_attr(virt_to_page(start), size >> PAGE_SHIFT,
> + PAGE_KERNEL);
> + global_flush_tlb();
> +
> + printk("Testing CPA: write protecting again\n");
> + change_page_attr(virt_to_page(start), size >> PAGE_SHIFT,
> + PAGE_KERNEL_RO);
> + global_flush_tlb();
> +#endif

Cheers,

Harvey

2008-01-15 08:56:25

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [7/31] Extract page table dumping code from i386 fault handler into dump_pagetable()

On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> Similar to x86-64. This is useful in other situations where we want
> the page table dumped too.
>
> Besides anything that makes i386 do_page_fault shorter is good.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/mm/fault_32.c | 72 ++++++++++++++++++++++++++-----------------------
> 1 file changed, 39 insertions(+), 33 deletions(-)
>
> Index: linux/arch/x86/mm/fault_32.c
> ===================================================================
> --- linux.orig/arch/x86/mm/fault_32.c
> +++ linux/arch/x86/mm/fault_32.c
> @@ -28,6 +28,44 @@
> #include <asm/desc.h>
> #include <asm/segment.h>
>
> +void dump_pagetable(unsigned long address)

static?

> +{
> + typeof(pte_val(__pte(0))) page;

Is there any type that could be picked that would be nicer than
sprinkling ((__typeof__(page) *), typeof(pte_val(__pte(0))) etc
through here, I know it's just moving the code out to another
function, just wondering if you had any better ideas that someone
could follow-up on.

Maybe another helper printk_page()? could help here.

> +
> + page = read_cr3();
> + page = ((__typeof__(page) *) __va(page))[address >> PGDIR_SHIFT];
> +#ifdef CONFIG_X86_PAE
> + printk("*pdpt = %016Lx ", page);
> + if ((page >> PAGE_SHIFT) < max_low_pfn
> + && page & _PAGE_PRESENT) {
> + page &= PAGE_MASK;
> + page = ((__typeof__(page) *) __va(page))[(address >> PMD_SHIFT)
> + & (PTRS_PER_PMD - 1)];
> + printk(KERN_CONT "*pde = %016Lx ", page);
> + page &= ~_PAGE_NX;
> + }
> +#else
> + printk("*pde = %08lx ", page);
> +#endif
> +
> + /*
> + * We must not directly access the pte in the highpte
> + * case if the page table is located in highmem.
> + * And let's rather not kmap-atomic the pte, just in case
> + * it's allocated already.
> + */
> + if ((page >> PAGE_SHIFT) < max_low_pfn
> + && (page & _PAGE_PRESENT)
> + && !(page & _PAGE_PSE)) {
> + page &= PAGE_MASK;
> + page = ((__typeof__(page) *) __va(page))[(address >> PAGE_SHIFT)
> + & (PTRS_PER_PTE - 1)];
> + printk("*pte = %0*Lx ", sizeof(page)*2, (u64)page);
> + }
> +
> + printk("\n");
> +}
> +

Cheers,

Harvey

2008-01-15 09:05:46

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [26/31] CPA: Fix reference counting when changing already changed pages

>+ ref_prot = canon_pgprot(ref_prot);
>+ prot = canon_pgprot(prot);
>+
> if (pgprot_val(prot) != pgprot_val(ref_prot)) {
>...
> } else if (level == 4) {
>...
> } else {
> /*
> * When you're here you either set the same page to PAGE_KERNEL

Doesn't this change require modifying the BUG() here into a BUG_ON() so
that it doesn't trigger if pgprot_val(prot) == pgprot_val(ref_prot) and
level != 4?

>+#define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)

While I remember you stated the inverse, I continue to think that it'd be
safer to mask out the accessed and dirty flags for the comparisons this
macro is being used for.

Jan

2008-01-15 09:11:38

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [0/31] Great change_page_attr patch series v2

>>> Andi Kleen <[email protected]> 14.01.08 23:16 >>>
>
>Lots of improvements to change_page_attr(). Make it a lot more
>efficient and fix various bugs.
>
>Changes against earlier version
>
>- Fixed some checkpatch.pl complaints
>- Improved self test suite
>- Fixed more reference bugs
>- Fix NX handling on 32bit
>- Remove some useless code there
>- Few other random fixes

The one concept that I'm missing (but that I can easily produce a follow-up
patch for, as I had this in my c_p_a() changes) is the tracking and adjusting
of the reference protection for a large page range that got fully converted
to another type (namely relevant for .rodata if it exceeds 2/4 Mb), allowing
to use a large page mapping in this case even for non-default mappings.

Apart from the other comments, the whole series

Acked-by: Jan Beulich <[email protected]>

2008-01-15 09:30:06

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [13/31] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr

On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> Instead of open coding the bit accesses uses standard style
> *PageDeferred* macros.
>

Any reason these couldn't be static inlines in a shared header?

asm/page.h perhaps?

Cheers,

Harvey

2008-01-15 10:00:51

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [7/31] Extract page table dumping code from i386 fault handler into dump_pagetable()


> > +void dump_pagetable(unsigned long address)
>
> static?

This is used by other files too with future patches. Also
in general it's useful for debugging stuff - i often put calls
to it into debugging patches.

> > +{
> > + typeof(pte_val(__pte(0))) page;
>
> Is there any type that could be picked that would be nicer than
> sprinkling ((__typeof__(page) *), typeof(pte_val(__pte(0))) etc
> through here, I know it's just moving the code out to another
> function, just wondering if you had any better ideas that someone
> could follow-up on.

It could be pteval_t now which Jeremy recently introduced. But for pure
code movement I don't want to do any improvements.

-Andi

2008-01-15 10:01:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [2/31] CPA: Do a simple self test at boot


> global_flush_tlb outside CONFIG_CPA_DEBUG here. Intentional?

Yes that's all intentional. Without CPA_DEBUG there is only a single flush,
with CPA_DEBUG we flush more often simply to catch bugs.

-Andi

2008-01-15 10:01:26

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [12/31] CPA: CLFLUSH support in change_page_attr()

On Tuesday 15 January 2008 09:40:27 Jan Beulich wrote:
> >- /* clflush is still broken. Disable for now. */
> >- if (1 || !cpu_has_clflush)
> >+ if (a->full_flush)
> > asm volatile("wbinvd" ::: "memory");
> >- else list_for_each_entry(pg, l, lru) {
> >- void *adr = page_address(pg);
> >- clflush_cache_range(adr, PAGE_SIZE);
> >+ list_for_each_entry(f, &a->l, l) {
> >+ if (!a->full_flush)
>
> This if() looks redundant (could also be avoided in the 32-bit variant, but
> isn't redundant there at present). Also, is there no
> wbinvd() on 64bit?

That's all done in a later patch.

The transformation steps are not always ideal, but in the end the code
is ok I think.

-Andi

The final result of the series is for 32bit & flush_kernel_map:

static void flush_kernel_map(void *arg)
{
- struct list_head *lh = (struct list_head *)arg;
- struct page *p;
+ struct flush_arg *a = (struct flush_arg *)arg;
+ struct flush *f;
+ int cache_flush = a->full_flush == FLUSH_CACHE;
+
+ list_for_each_entry(f, &a->l, l) {
+ if (!a->full_flush)
+ __flush_tlb_one(f->addr);
+ if (f->mode == FLUSH_CACHE && !cpu_has_ss) {
+ if (cpu_has_clflush)
+ clflush_cache_range((void *)f->addr, PAGE_SIZE);
+ else
+ cache_flush++;
+ }
+ }

- /* High level code is not ready for clflush yet */
- if (0 && cpu_has_clflush) {
- list_for_each_entry (p, lh, lru)
- cache_flush_page(p);
- } else if (boot_cpu_data.x86_model >= 4)
- wbinvd();
+ if (a->full_flush)
+ __flush_tlb_all();

- /* Flush all to work around Errata in early athlons regarding
- * large page flushing.
+ /*
+ * RED-PEN: Intel documentation ask for a CPU synchronization step
+ * here and in the loop. But it is moot on Self-Snoop CPUs anyways.
*/
- __flush_tlb_all();
+
+ if (cache_flush > 0 && !cpu_has_ss && boot_cpu_data.x86_model >= 4)
+ wbinvd();
}

2008-01-15 10:05:27

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [7/31] Extract page table dumping code from i386 fault handler into dump_pagetable()

On Tue, 2008-01-15 at 11:00 +0100, Andi Kleen wrote:
> > > +void dump_pagetable(unsigned long address)
> >
> > static?
>
> This is used by other files too with future patches. Also
> in general it's useful for debugging stuff - i often put calls
> to it into debugging patches.
>
> > > +{
> > > + typeof(pte_val(__pte(0))) page;
> >
> > Is there any type that could be picked that would be nicer than
> > sprinkling ((__typeof__(page) *), typeof(pte_val(__pte(0))) etc
> > through here, I know it's just moving the code out to another
> > function, just wondering if you had any better ideas that someone
> > could follow-up on.
>
> It could be pteval_t now which Jeremy recently introduced. But for pure
> code movement I don't want to do any improvements.

Sure, I had a very similar patch but was looking for a better way to
address this, I'll keep an eye on this and Jeremy's patch and follow
up when his stuff has settled a bit more.

Harvey

2008-01-15 10:08:28

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [2/31] CPA: Do a simple self test at boot

On Tue, 2008-01-15 at 10:59 +0100, Andi Kleen wrote:
> > global_flush_tlb outside CONFIG_CPA_DEBUG here. Intentional?
>
> Yes that's all intentional. Without CPA_DEBUG there is only a single flush,
> with CPA_DEBUG we flush more often simply to catch bugs.
>

OK, it just looked a bit fishy with the different grouping, just making
sure.

Harvey

2008-01-15 10:09:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/31] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr

On Tuesday 15 January 2008 10:29:51 Harvey Harrison wrote:
> On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> > Instead of open coding the bit accesses uses standard style
> > *PageDeferred* macros.
> >
>
> Any reason these couldn't be static inlines in a shared header?

The whole usage of that bit is private to the file

-Andi

2008-01-15 10:09:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [26/31] CPA: Fix reference counting when changing already changed pages

On Tuesday 15 January 2008 10:05:44 Jan Beulich wrote:
> >+ ref_prot = canon_pgprot(ref_prot);
> >+ prot = canon_pgprot(prot);
> >+
> > if (pgprot_val(prot) != pgprot_val(ref_prot)) {
> >...
> > } else if (level == 4) {
> >...
> > } else {
> > /*
> > * When you're here you either set the same page to PAGE_KERNEL
>
> Doesn't this change require modifying the BUG() here into a BUG_ON() so
> that it doesn't trigger if pgprot_val(prot) == pgprot_val(ref_prot) and
> level != 4?

I addressed this in the comment

+ /*
+ * When you're here you either set the same page to PAGE_KERNEL
+ * two times in a row or the page table reference counting is
+ * broken again. To catch the later bug for now (sorry)
+ */

Do you think it's important to handle? The function already has too many
special cases and setting something several times in a row to PAGE_KERNEL
is usually a bug in the caller anyways (or a cpa bug)

>
> >+#define canon_pgprot(p) __pgprot(pgprot_val(p) & __supported_pte_mask)
>
> While I remember you stated the inverse,

I changed my mind regarding not doing this. It was true when I said it originally.

> I continue to think that it'd be
> safer to mask out the accessed and dirty flags for the comparisons this
> macro is being used for.

A and D should be always set for _PAGE_KERNEL and for kernel mappings in general.
The problem would only occur if someone made up custom page flags without those.
Didn't think that usage was worth supporting. Perhaps it would be a good idea
to add a WARN_ON for this case though.

Thanks for the review.

-Andi

2008-01-15 10:09:55

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/31] Great change_page_attr patch series v2

On Tuesday 15 January 2008 10:11:44 Jan Beulich wrote:
> >>> Andi Kleen <[email protected]> 14.01.08 23:16 >>>
> >
> >Lots of improvements to change_page_attr(). Make it a lot more
> >efficient and fix various bugs.
> >
> >Changes against earlier version
> >
> >- Fixed some checkpatch.pl complaints
> >- Improved self test suite
> >- Fixed more reference bugs
> >- Fix NX handling on 32bit
> >- Remove some useless code there
> >- Few other random fixes
>
> The one concept that I'm missing (but that I can easily produce a follow-up
> patch for, as I had this in my c_p_a() changes) is the tracking and adjusting
> of the reference protection for a large page range that got fully converted
> to another type (namely relevant for .rodata if it exceeds 2/4 Mb), allowing
> to use a large page mapping in this case even for non-default mappings.

Ah -- i got rid of that by changing the rodata code to not do this
except for the debugging case

>>
CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set

Otherwise the kernel will likely always run with 4K pages instead of 2MB pages,
which is costly in terms of TLBs.
<<

But you're right that would be an useful feature. But wouldn't it require
aligning rodata to 2MB in the vmlinux to be really effective?

>
> Apart from the other comments, the whole series
>
> Acked-by: Jan Beulich <[email protected]>

Thanks.

-Andi


2008-01-15 10:15:17

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [13/31] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr

On Tue, 2008-01-15 at 11:06 +0100, Andi Kleen wrote:
> On Tuesday 15 January 2008 10:29:51 Harvey Harrison wrote:
> > On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> > > Instead of open coding the bit accesses uses standard style
> > > *PageDeferred* macros.
> > >
> >
> > Any reason these couldn't be static inlines in a shared header?
>
> The whole usage of that bit is private to the file

True if pageattr_32|64.c get merged, not one I have looked at yet,
but wouldn't a static inline in the files then be better for the
typechecking?

Harvey

2008-01-15 10:25:35

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [13/31] CPA: Use macros to modify the PG_arch_1 page flags in change_page_attr

On Tuesday 15 January 2008 11:15:02 Harvey Harrison wrote:
> On Tue, 2008-01-15 at 11:06 +0100, Andi Kleen wrote:
> > On Tuesday 15 January 2008 10:29:51 Harvey Harrison wrote:
> > > On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> > > > Instead of open coding the bit accesses uses standard style
> > > > *PageDeferred* macros.
> > > >
> > >
> > > Any reason these couldn't be static inlines in a shared header?
> >
> > The whole usage of that bit is private to the file
>
> True if pageattr_32|64.c get merged, not one I have looked at yet,
> but wouldn't a static inline in the files then be better for the
> typechecking?

If you pass in something other than a struct page * the compiler
will already complain. Also all the other Page* accessor in page-flags.h are macros.

-Andi

2008-01-15 10:37:33

by Harvey Harrison

[permalink] [raw]
Subject: Re: [PATCH] [9/31] CPA: Add simple self test at boot

On Mon, 2008-01-14 at 23:16 +0100, Andi Kleen wrote:
> Since change_page_attr() is tricky code it is good to have some regression
> test code. This patch maps and unmaps some random pages in the direct mapping
> at boot and then dumps the state and does some simple sanity checks.
>
> Add it with a CONFIG option.
>
> Optional patch, but I find it useful.

Has it been decided if a new /test directory is going to be
created for this kind of smoketest?

Harvey

2008-01-15 11:55:18

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [0/31] Great change_page_attr patch series v2

>> The one concept that I'm missing (but that I can easily produce a follow-up
>> patch for, as I had this in my c_p_a() changes) is the tracking and adjusting
>> of the reference protection for a large page range that got fully converted
>> to another type (namely relevant for .rodata if it exceeds 2/4 Mb), allowing
>> to use a large page mapping in this case even for non-default mappings.
>
>Ah -- i got rid of that by changing the rodata code to not do this
>except for the debugging case
>
>>>
>CPA: Only unmap kernel init pages in text mapping when CONFIG_DEBUG_RODATA is set
>
>Otherwise the kernel will likely always run with 4K pages instead of 2MB pages,
>which is costly in terms of TLBs.
><<
>
>But you're right that would be an useful feature. But wouldn't it require
>aligning rodata to 2MB in the vmlinux to be really effective?

Yes, that would be desirable then (and .data should be at a 2/4 Mb
boundary for this, too).

Jan

2008-01-15 12:00:02

by Jan Beulich

[permalink] [raw]
Subject: Re: [PATCH] [26/31] CPA: Fix reference counting when changing already changed pages

>>> Andi Kleen <[email protected]> 15.01.08 11:04 >>>
>On Tuesday 15 January 2008 10:05:44 Jan Beulich wrote:
>> >+ ref_prot = canon_pgprot(ref_prot);
>> >+ prot = canon_pgprot(prot);
>> >+
>> > if (pgprot_val(prot) != pgprot_val(ref_prot)) {
>> >...
>> > } else if (level == 4) {
>> >...
>> > } else {
>> > /*
>> > * When you're here you either set the same page to PAGE_KERNEL
>>
>> Doesn't this change require modifying the BUG() here into a BUG_ON() so
>> that it doesn't trigger if pgprot_val(prot) == pgprot_val(ref_prot) and
>> level != 4?
>
>I addressed this in the comment
>
>+ /*
>+ * When you're here you either set the same page to PAGE_KERNEL
>+ * two times in a row or the page table reference counting is
>+ * broken again. To catch the later bug for now (sorry)
>+ */
>
>Do you think it's important to handle? The function already has too many
>special cases and setting something several times in a row to PAGE_KERNEL
>is usually a bug in the caller anyways (or a cpa bug)

It definitely is when making ref_prot variable (as discussed in an earlier
reply regarding a different patch), but I think it's even inconsistent given
the possible presence/absence of _PAGE_NX.

Jan

2008-01-15 12:43:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [0/31] Great change_page_attr patch series v2


> >But you're right that would be an useful feature. But wouldn't it require
> >aligning rodata to 2MB in the vmlinux to be really effective?
>
> Yes, that would be desirable then (and .data should be at a 2/4 Mb
> boundary for this, too).

Yes, rather .rodata/.text together and .data/.bss separately
aligned to 2MB.

But the problem is it would be quite wasteful -- even on a big
defconfig kernel data+bss is far less than 2MB and with 4MB
pages it would be even worse.

I'm not sure that would be worth the advantage of write protection for non
debugging kernels and for debugging kernels just using 4K pages is fine.

There's also the issue where to put the initcode/initdata.

-Andi

2008-01-15 12:58:57

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH] [4/31] Add pte_pgprot on i386

Hi,

Andi Kleen <[email protected]> writes:

[...]
> --- linux.orig/include/asm-x86/pgtable-2level.h
> +++ linux/include/asm-x86/pgtable-2level.h
> @@ -75,6 +75,8 @@ static inline int pte_exec_kernel(pte_t
> #define pgoff_to_pte(off) \
> ((pte_t) { .pte_low = (((off) & 0x1f) << 1) + (((off) >> 5) << 8) + _PAGE_FILE })
>
> +#define pte_pgprot(x) __pgprot((x).pte_low & 0xfff)

[...]

> --- linux.orig/include/asm-x86/pgtable-3level.h
> +++ linux/include/asm-x86/pgtable-3level.h
> @@ -142,6 +142,8 @@ static inline unsigned long pte_pfn(pte_
> return pte_val(pte) >> PAGE_SHIFT;
> }
>
> +#define pte_pgprot(x) __pgprot(pte_val(x) & (0xfff | _PAGE_NX))

Could we use pte_val() in both cases?

Hannes