LinuxLists.cc - [PATCH] [0/9] Latest GBPAGES patchkit for 2.6.25

2008-01-29 05:06:42

Subject: [PATCH] [0/9] Latest GBPAGES patchkit for 2.6.25

This patchkit implements support for the 1GB pages of AMD Fam10h CPUs
in the kernel direct mapping.

Change to previous versions:
- Ported to latest change_page_attr
- kexec now works again
- Ported to latest git-x86
- Minor cleanups.

I believe this patchkit is ready for the 2.6.25 merge.

-Andi

2008-01-29 05:06:57

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [1/9] Handle kernel near memory hole in clear_kernel_mapping

This was a long standing obscure problem in the relocatable kernel. The
AMD GART driver needs to unmap part of the GART in the kernel direct mapping to
prevent cache corruption. With the relocatable kernel it is in theory possible
that the separate kernel text mapping straddles that area too.

Normally it should not happen because GART tends to be >= 2GB, and the kernel
is normally not loaded that high, but it is possible in theory.

Teach clear_kernel_mapping() about this case.

This will become more important once the kernel mapping uses 1GB pages.

Cc: [email protected]
Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_64.c | 25 ++++++++++++++++++++++++-
1 file changed, 24 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -438,7 +438,8 @@ void __init paging_init(void)
* address and size must be aligned to 2MB boundaries.
* Does nothing when the mapping doesn't exist.
*/
-void __init clear_kernel_mapping(unsigned long address, unsigned long size)
+static void __init
+__clear_kernel_mapping(unsigned long address, unsigned long size)
{
unsigned long end = address + size;

@@ -475,6 +476,28 @@ void __init clear_kernel_mapping(unsigne
__flush_tlb_all();
}

+#define overlaps(as, ae, bs, be) ((ae) >= (bs) && (as) <= (be))
+
+void __init clear_kernel_mapping(unsigned long address, unsigned long size)
+{
+ int sh = PMD_SHIFT;
+ unsigned long kernel = __pa(__START_KERNEL_map);
+
+ /*
+ * Note that we cannot unmap the kernel itself because the unmapped
+ * holes here are always at least 2MB aligned.
+ * This just applies to the trailing areas of the 40MB kernel mapping.
+ */
+ if (overlaps(kernel >> sh, (kernel + KERNEL_TEXT_SIZE) >> sh,
+ __pa(address) >> sh, __pa(address + size) >> sh)) {
+ printk(KERN_WARNING
+ "Kernel mapping at %lx within 2MB of memory hole\n",
+ kernel);
+ __clear_kernel_mapping(__START_KERNEL_map+__pa(address), size);
+ }
+ __clear_kernel_mapping(address, size);
+}
+
/*
* Memory hotplug specific functions
*/

2008-01-29 05:07:35

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [2/9] GBPAGES: Add feature macros for the gbpages cpuid bit

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/cpufeature.h | 2 ++
1 file changed, 2 insertions(+)

Index: linux/include/asm-x86/cpufeature.h
===================================================================
--- linux.orig/include/asm-x86/cpufeature.h
+++ linux/include/asm-x86/cpufeature.h
@@ -49,6 +49,7 @@
#define X86_FEATURE_MP (1*32+19) /* MP Capable. */
#define X86_FEATURE_NX (1*32+20) /* Execute Disable */
#define X86_FEATURE_MMXEXT (1*32+22) /* AMD MMX extensions */
+#define X86_FEATURE_GBPAGES (1*32+26) /* GB pages */
#define X86_FEATURE_RDTSCP (1*32+27) /* RDTSCP */
#define X86_FEATURE_LM (1*32+29) /* Long Mode (x86-64) */
#define X86_FEATURE_3DNOWEXT (1*32+30) /* AMD 3DNow! extensions */
@@ -175,6 +176,7 @@
#define cpu_has_pebs boot_cpu_has(X86_FEATURE_PEBS)
#define cpu_has_clflush boot_cpu_has(X86_FEATURE_CLFLSH)
#define cpu_has_bts boot_cpu_has(X86_FEATURE_BTS)
+#define cpu_has_gbpages boot_cpu_has(X86_FEATURE_GBPAGES)

#if defined(CONFIG_X86_INVLPG) || defined(CONFIG_X86_64)
# define cpu_has_invlpg 1

2008-01-29 05:08:19

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [3/9] GBPAGES: Split LARGE_PAGE_SIZE/MASK into PUD_PAGE_SIZE/PMD_PAGE_SIZE

Split the existing LARGE_PAGE_SIZE/MASK macro into two new macros
PUD_PAGE_SIZE/MASK and PMD_PAGE_SIZE/MASK.

Fix up all callers to use the new names.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/boot/compressed/head_64.S | 8 ++++----
arch/x86/kernel/head_64.S | 4 ++--
arch/x86/kernel/pci-gart_64.c | 2 +-
arch/x86/mm/init_64.c | 6 +++---
arch/x86/mm/pageattr.c | 2 +-
include/asm-x86/page.h | 4 ++--
include/asm-x86/page_32.h | 4 ++++
include/asm-x86/page_64.h | 3 +++
8 files changed, 20 insertions(+), 13 deletions(-)

Index: linux/include/asm-x86/page_64.h
===================================================================
--- linux.orig/include/asm-x86/page_64.h
+++ linux/include/asm-x86/page_64.h
@@ -23,6 +23,9 @@
#define MCE_STACK 5
#define N_EXCEPTION_STACKS 5 /* hw limit: 7 */

+#define PUD_PAGE_SIZE (_AC(1, UL) << PUD_SHIFT)
+#define PUD_PAGE_MASK (~(PUD_PAGE_SIZE-1))
+
#define __PAGE_OFFSET _AC(0xffff810000000000, UL)

#define __PHYSICAL_START CONFIG_PHYSICAL_START
Index: linux/arch/x86/boot/compressed/head_64.S
===================================================================
--- linux.orig/arch/x86/boot/compressed/head_64.S
+++ linux/arch/x86/boot/compressed/head_64.S
@@ -80,8 +80,8 @@ startup_32:

#ifdef CONFIG_RELOCATABLE
movl %ebp, %ebx
- addl $(LARGE_PAGE_SIZE -1), %ebx
- andl $LARGE_PAGE_MASK, %ebx
+ addl $(PMD_PAGE_SIZE -1), %ebx
+ andl $PMD_PAGE_MASK, %ebx
#else
movl $CONFIG_PHYSICAL_START, %ebx
#endif
@@ -220,8 +220,8 @@ ENTRY(startup_64)
/* Start with the delta to where the kernel will run at. */
#ifdef CONFIG_RELOCATABLE
leaq startup_32(%rip) /* - $startup_32 */, %rbp
- addq $(LARGE_PAGE_SIZE - 1), %rbp
- andq $LARGE_PAGE_MASK, %rbp
+ addq $(PMD_PAGE_SIZE - 1), %rbp
+ andq $PMD_PAGE_MASK, %rbp
movq %rbp, %rbx
#else
movq $CONFIG_PHYSICAL_START, %rbp
Index: linux/arch/x86/kernel/pci-gart_64.c
===================================================================
--- linux.orig/arch/x86/kernel/pci-gart_64.c
+++ linux/arch/x86/kernel/pci-gart_64.c
@@ -501,7 +501,7 @@ static __init unsigned long check_iommu_
}

a = aper + iommu_size;
- iommu_size -= round_up(a, LARGE_PAGE_SIZE) - a;
+ iommu_size -= round_up(a, PMD_PAGE_SIZE) - a;

if (iommu_size < 64*1024*1024) {
printk(KERN_WARNING
Index: linux/arch/x86/kernel/head_64.S
===================================================================
--- linux.orig/arch/x86/kernel/head_64.S
+++ linux/arch/x86/kernel/head_64.S
@@ -63,7 +63,7 @@ startup_64:

/* Is the address not 2M aligned? */
movq %rbp, %rax
- andl $~LARGE_PAGE_MASK, %eax
+ andl $~PMD_PAGE_MASK, %eax
testl %eax, %eax
jnz bad_address

@@ -88,7 +88,7 @@ startup_64:

/* Add an Identity mapping if I am above 1G */
leaq _text(%rip), %rdi
- andq $LARGE_PAGE_MASK, %rdi
+ andq $PMD_PAGE_MASK, %rdi

movq %rdi, %rax
shrq $PUD_SHIFT, %rax
Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -443,10 +443,10 @@ __clear_kernel_mapping(unsigned long add
{
unsigned long end = address + size;

- BUG_ON(address & ~LARGE_PAGE_MASK);
- BUG_ON(size & ~LARGE_PAGE_MASK);
+ BUG_ON(address & ~PMD_PAGE_MASK);
+ BUG_ON(size & ~PMD_PAGE_MASK);

- for (; address < end; address += LARGE_PAGE_SIZE) {
+ for (; address < end; address += PMD_PAGE_SIZE) {
pgd_t *pgd = pgd_offset_k(address);
pud_t *pud;
pmd_t *pmd;
Index: linux/include/asm-x86/page_32.h
===================================================================
--- linux.orig/include/asm-x86/page_32.h
+++ linux/include/asm-x86/page_32.h
@@ -13,6 +13,10 @@
*/
#define __PAGE_OFFSET _AC(CONFIG_PAGE_OFFSET, UL)

+/* Eventually 32bit should be moved over to the new names too */
+#define LARGE_PAGE_SIZE PMD_PAGE_SIZE
+#define LARGE_PAGE_MASK PMD_PAGE_MASK
+
#ifdef CONFIG_X86_PAE
#define __PHYSICAL_MASK_SHIFT 36
#define __VIRTUAL_MASK_SHIFT 32
Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -13,8 +13,8 @@
#define PHYSICAL_PAGE_MASK (PAGE_MASK & __PHYSICAL_MASK)
#define PTE_MASK (_AT(long, PHYSICAL_PAGE_MASK))

-#define LARGE_PAGE_SIZE (_AC(1,UL) << PMD_SHIFT)
-#define LARGE_PAGE_MASK (~(LARGE_PAGE_SIZE-1))
+#define PMD_PAGE_SIZE (_AC(1, UL) << PMD_SHIFT)
+#define PMD_PAGE_MASK (~(PMD_PAGE_SIZE-1))

#define HPAGE_SHIFT PMD_SHIFT
#define HPAGE_SIZE (_AC(1,UL) << HPAGE_SHIFT)
Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -218,7 +218,7 @@ static int split_large_page(pte_t *kpte,
}

address = __pa(address);
- addr = address & LARGE_PAGE_MASK;
+ addr = address & PMD_PAGE_MASK;
pbase = (pte_t *)page_address(base);
#ifdef CONFIG_X86_32
paravirt_alloc_pt(&init_mm, page_to_pfn(base));

2008-01-29 05:08:40

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [4/9] Add pgtable accessor functions for GB pages

Signed-off-by: Andi Kleen <[email protected]>

---
include/asm-x86/pgtable_32.h | 2 ++
include/asm-x86/pgtable_64.h | 6 ++++++
2 files changed, 8 insertions(+)

Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -199,6 +199,12 @@ static inline unsigned long pmd_bad(pmd_
#define pud_offset(pgd, address) ((pud_t *) pgd_page_vaddr(*(pgd)) + pud_index(address))
#define pud_present(pud) (pud_val(pud) & _PAGE_PRESENT)

+static inline int pud_large(pud_t pte)
+{
+ return (pud_val(pte) & (_PAGE_PSE|_PAGE_PRESENT)) ==
+ (_PAGE_PSE|_PAGE_PRESENT);
+}
+
/* PMD - Level 2 access */
#define pmd_page_vaddr(pmd) ((unsigned long) __va(pmd_val(pmd) & PTE_MASK))
#define pmd_page(pmd) (pfn_to_page(pmd_val(pmd) >> PAGE_SHIFT))
Index: linux/include/asm-x86/pgtable_32.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_32.h
+++ linux/include/asm-x86/pgtable_32.h
@@ -148,6 +148,8 @@ static inline void clone_pgd_range(pgd_t
*/
#define pgd_offset_k(address) pgd_offset(&init_mm, address)

+static inline int pud_large(pud_t pud) { return 0; }
+
/*
* the pmd page can be thought of an array like this: pmd_t[PTRS_PER_PMD]
*

2008-01-29 05:08:57

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [5/9] GBPAGES: Support gbpages in pagetable dump

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/fault.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/fault.c
===================================================================
--- linux.orig/arch/x86/mm/fault.c
+++ linux/arch/x86/mm/fault.c
@@ -240,7 +240,8 @@ void dump_pagetable(unsigned long addres
pud = pud_offset(pgd, address);
if (bad_address(pud)) goto bad;
printk("PUD %lx ", pud_val(*pud));
- if (!pud_present(*pud)) goto ret;
+ if (!pud_present(*pud) || pud_large(*pud))
+ goto ret;

pmd = pmd_offset(pud, address);
if (bad_address(pmd)) goto bad;

2008-01-29 05:09:29

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [6/9] GBPAGES: Add gbpages support to lookup_address

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 5 +++++
include/asm-x86/pgtable.h | 1 +
2 files changed, 6 insertions(+)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -152,9 +152,14 @@ pte_t *lookup_address(unsigned long addr

if (pgd_none(*pgd))
return NULL;
+
+ *level = PG_LEVEL_1G;
pud = pud_offset(pgd, address);
if (pud_none(*pud))
return NULL;
+ if (pud_large(*pud))
+ return (pte_t *)pud;
+
pmd = pmd_offset(pud, address);
if (pmd_none(*pmd))
return NULL;
Index: linux/include/asm-x86/pgtable.h
===================================================================
--- linux.orig/include/asm-x86/pgtable.h
+++ linux/include/asm-x86/pgtable.h
@@ -242,6 +242,7 @@ enum {
PG_LEVEL_NONE,
PG_LEVEL_4K,
PG_LEVEL_2M,
+ PG_LEVEL_1G,
};

/*

2008-01-29 05:09:47

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [7/9] Add an option to disable direct mapping gbpages and a global variable

Signed-off-by: Andi Kleen <[email protected]>

---
Documentation/x86_64/boot-options.txt | 3 +++
arch/x86/mm/init_64.c | 12 ++++++++++++
include/asm-x86/pgtable_64.h | 2 ++
3 files changed, 17 insertions(+)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -53,6 +53,18 @@ static unsigned long dma_reserve __initd

DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);

+int direct_gbpages;
+
+static int __init parse_direct_gbpages(char *arg)
+{
+ if (!strcmp(arg, "off")) {
+ direct_gbpages = -1;
+ return 0;
+ }
+ return -1;
+}
+early_param("direct_gbpages", parse_direct_gbpages);
+
/*
* NOTE: pagetable_init alloc all the fixmap pagetables contiguous on the
* physical space so we can cache the place of the first one and move
Index: linux/include/asm-x86/pgtable_64.h
===================================================================
--- linux.orig/include/asm-x86/pgtable_64.h
+++ linux/include/asm-x86/pgtable_64.h
@@ -239,6 +239,8 @@ static inline int pud_large(pud_t pte)

#define update_mmu_cache(vma,address,pte) do { } while (0)

+extern int direct_gbpages;
+
/* Encode and de-code a swap entry */
#define __swp_type(x) (((x).val >> 1) & 0x3f)
#define __swp_offset(x) ((x).val >> 8)
Index: linux/Documentation/x86_64/boot-options.txt
===================================================================
--- linux.orig/Documentation/x86_64/boot-options.txt
+++ linux/Documentation/x86_64/boot-options.txt
@@ -307,3 +307,6 @@ Debugging
stuck (default)

Miscellaneous
+
+ direct_gbpages=off
+ Do not use GB pages for kernel direct mapping.

2008-01-29 05:10:03

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [8/9] GBPAGES: Implement gbpages support in change_page_attr()

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -203,6 +203,7 @@ static int split_large_page(pte_t *kpte,
pte_t *pbase, *tmp;
struct page *base;
int i, level;
+ unsigned long ps;

#ifdef CONFIG_DEBUG_PAGEALLOC
gfp_flags = GFP_ATOMIC;
@@ -224,12 +225,22 @@ static int split_large_page(pte_t *kpte,

address = __pa(address);
addr = address & PMD_PAGE_MASK;
+
+ ps = PAGE_SIZE;
+#ifdef CONFIG_X86_64
+ if (level == PG_LEVEL_1G) {
+ ps = PMD_PAGE_SIZE;
+ pgprot_val(ref_prot) |= _PAGE_PSE;
+ addr &= PUD_PAGE_MASK;
+ }
+#endif
+
pbase = (pte_t *)page_address(base);
#ifdef CONFIG_X86_32
paravirt_alloc_pt(&init_mm, page_to_pfn(base));
#endif

- for (i = 0; i < PTRS_PER_PTE; i++, addr += PAGE_SIZE)
+ for (i = 0; i < PTRS_PER_PTE; i++, addr += ps)
set_pte(&pbase[i], pfn_pte(addr >> PAGE_SHIFT, ref_prot));

/*

2008-01-29 05:10:33

by Andi Kleen

[permalink] [raw]

Subject: [PATCH] [9/9] GBPAGES: Do kernel direct mapping at boot using GB pages

This should decrease TLB pressure because the kernel will need
less TLB faults for its own data access.

Only done for 64bit because i386 does not support GB page tables.

This only applies to the data portion of the direct mapping; the
kernel text mapping stays with 2MB pages because the AMD Fam10h
microarchitecture does not support GB ITLBs and AMD recommends
against using GB mappings for code.

Can be disabled with direct_gbpages=off

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_64.c | 64 ++++++++++++++++++++++++++++++++++++++++++--------
1 file changed, 55 insertions(+), 9 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -279,13 +279,20 @@ __meminit void early_iounmap(void *addr,
__flush_tlb_all();
}

+static unsigned long direct_entry(unsigned long paddr)
+{
+ unsigned long entry;
+ entry = __PAGE_KERNEL_LARGE|paddr;
+ entry &= __supported_pte_mask;
+ return entry;
+}
+
static void __meminit
phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
{
int i = pmd_index(address);

for (; i < PTRS_PER_PMD; i++, address += PMD_SIZE) {
- unsigned long entry;
pmd_t *pmd = pmd_page + pmd_index(address);

if (address >= end) {
@@ -299,9 +306,7 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
if (pmd_val(*pmd))
continue;

- entry = __PAGE_KERNEL_LARGE|_PAGE_GLOBAL|address;
- entry &= __supported_pte_mask;
- set_pmd(pmd, __pmd(entry));
+ set_pmd(pmd, __pmd(direct_entry(address)));
}
}

@@ -335,7 +340,13 @@ phys_pud_init(pud_t *pud_page, unsigned
}

if (pud_val(*pud)) {
- phys_pmd_update(pud, addr, end);
+ if (!pud_large(*pud))
+ phys_pmd_update(pud, addr, end);
+ continue;
+ }
+
+ if (direct_gbpages > 0) {
+ set_pud(pud, __pud(direct_entry(addr)));
continue;
}

@@ -356,9 +367,11 @@ static void __init find_early_table_spac
unsigned long puds, pmds, tables, start;

puds = (end + PUD_SIZE - 1) >> PUD_SHIFT;
- pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
- tables = round_up(puds * sizeof(pud_t), PAGE_SIZE) +
- round_up(pmds * sizeof(pmd_t), PAGE_SIZE);
+ tables = round_up(puds * sizeof(pud_t), PAGE_SIZE);
+ if (!direct_gbpages) {
+ pmds = (end + PMD_SIZE - 1) >> PMD_SHIFT;
+ tables += round_up(pmds * sizeof(pmd_t), PAGE_SIZE);
+ }

/*
* RED-PEN putting page tables only on node 0 could
@@ -378,6 +391,20 @@ static void __init find_early_table_spac
(table_start << PAGE_SHIFT) + tables);
}

+static void init_gbpages(void)
+{
+#ifdef CONFIG_DEBUG_PAGEALLOC
+ /* debug pagealloc causes too much recursion with gbpages */
+ if (direct_gbpages == 0)
+ return;
+#endif
+ if (direct_gbpages >= 0 && cpu_has_gbpages) {
+ printk(KERN_INFO "Using GB pages for direct mapping\n");
+ direct_gbpages = 1;
+ } else
+ direct_gbpages = 0;
+}
+
/*
* Setup the direct mapping of the physical memory at PAGE_OFFSET.
* This runs before bootmem is initialized and gets pages directly from
@@ -396,8 +423,10 @@ void __init_refok init_memory_mapping(un
* memory mapped. Unfortunately this is done currently before the
* nodes are discovered.
*/
- if (!after_bootmem)
+ if (!after_bootmem) {
+ init_gbpages();
find_early_table_space(end);
+ }

start = (unsigned long)__va(start);
end = (unsigned long)__va(end);
@@ -444,6 +473,21 @@ void __init paging_init(void)
}
#endif

+static void split_gb_page(pud_t *pud, unsigned long paddr)
+{
+ int i;
+ pmd_t *pmd;
+ struct page *p = alloc_page(GFP_KERNEL);
+ if (!p)
+ return;
+
+ paddr &= PUD_PAGE_MASK;
+ pmd = page_address(p);
+ for (i = 0; i < PTRS_PER_PTE; i++, paddr += PMD_PAGE_SIZE)
+ pmd[i] = __pmd(direct_entry(paddr));
+ pud_populate(NULL, pud, pmd);
+}
+
/*
* Unmap a kernel mapping if it exists. This is useful to avoid
* prefetches from the CPU leading to inconsistent cache lines.
@@ -467,6 +511,8 @@ __clear_kernel_mapping(unsigned long add
continue;

pud = pud_offset(pgd, address);
+ if (pud_large(*pud))
+ split_gb_page(pud, __pa(address));
if (pud_none(*pud))
continue;

2008-01-31 15:49:09

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [1/9] Handle kernel near memory hole in clear_kernel_mapping

On Tue, 29 Jan 2008, Andi Kleen wrote:
> This was a long standing obscure problem in the relocatable kernel. The
> AMD GART driver needs to unmap part of the GART in the kernel direct mapping to
> prevent cache corruption. With the relocatable kernel it is in theory possible
> that the separate kernel text mapping straddles that area too.
>
> Normally it should not happen because GART tends to be >= 2GB, and the kernel
> is normally not loaded that high, but it is possible in theory.

This smells fishy.

>
....
>
> +#define overlaps(as, ae, bs, be) ((ae) >= (bs) && (as) <= (be))

inline function please and a bit more intuituive arrangement of
arguments.

> +void __init clear_kernel_mapping(unsigned long address, unsigned long size)
> +{
> + int sh = PMD_SHIFT;
> + unsigned long kernel = __pa(__START_KERNEL_map);
> +
> + /*
> + * Note that we cannot unmap the kernel itself because the unmapped
> + * holes here are always at least 2MB aligned.

That's not enforced. The unmap code just does not split pages.

> + * This just applies to the trailing areas of the 40MB kernel mapping.

How is this ensured, that it only affects the end of the 40MB mapping ?

> + */
> + if (overlaps(kernel >> sh, (kernel + KERNEL_TEXT_SIZE) >> sh,
> + __pa(address) >> sh, __pa(address + size) >> sh)) {

This checks:

(kernel_end + 1) >= gart_start && kernel_start <= gart_end

One off error: kernel + KERNEL_TEXT_SIZE
needs to be: kernel + KERNEL_TEXT_SIZE - 1

Also there is no sanity check, whether the area is inside real kernel text.

> + printk(KERN_WARNING
> + "Kernel mapping at %lx within 2MB of memory hole\n",
> + kernel);

> + __clear_kernel_mapping(__START_KERNEL_map+__pa(address), size);

Doh! This is unmapping the wrong place. According to __phys_addr():

paddr = vaddr - __START_KERNEL_map + phys_base;

Transformation:

vaddr = padd + __START_KERNEL_map - phys_base;

Sigh. The clear_kernel_mapping() functionality is just another wreckaged
implementation of CPA.

We really should use CPA and have a function to mark entries not
present. Actually the DEBUG_PAGEALLOC code has one already. There are
sanity checks in the CPA code as well, which can be extended to cover
the GART unmap in a safe way.

Thanks,
tglx

2008-01-31 15:57:44

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [3/9] GBPAGES: Split LARGE_PAGE_SIZE/MASK into PUD_PAGE_SIZE/PMD_PAGE_SIZE

On Tue, 29 Jan 2008, Andi Kleen wrote:
> Split the existing LARGE_PAGE_SIZE/MASK macro into two new macros
> PUD_PAGE_SIZE/MASK and PMD_PAGE_SIZE/MASK.

This is not splitting anything. It adds PUD_PAGE_xxx and renames
LARGE_PAGE_XXX to PMD_PAGE_XXX.

Please split the patch into one, which does the rename and one which
adds the new defines.

>
> +/* Eventually 32bit should be moved over to the new names too */
> +#define LARGE_PAGE_SIZE PMD_PAGE_SIZE
> +#define LARGE_PAGE_MASK PMD_PAGE_MASK

Grep tells, that there is no remaining user of LARGE_PAGE_XXX, so this
is useless. Even if there would be users we'd better fix them right
away.

Thanks,
tglx

2008-01-31 16:02:48

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [6/9] GBPAGES: Add gbpages support to lookup_address

On Tue, 29 Jan 2008, Andi Kleen wrote:

>
> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/mm/pageattr.c | 5 +++++
> include/asm-x86/pgtable.h | 1 +
> 2 files changed, 6 insertions(+)
>
> Index: linux/arch/x86/mm/pageattr.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pageattr.c
> +++ linux/arch/x86/mm/pageattr.c
> @@ -152,9 +152,14 @@ pte_t *lookup_address(unsigned long addr
>
> if (pgd_none(*pgd))
> return NULL;
> +
> + *level = PG_LEVEL_1G;
> pud = pud_offset(pgd, address);
> if (pud_none(*pud))
> return NULL;

Please move the level update here, so it is consistent with PMD.

Thanks,

tglx

2008-01-31 16:13:06

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [7/9] Add an option to disable direct mapping gbpages and a global variable

On Tue, 29 Jan 2008, Andi Kleen wrote:
> +int direct_gbpages;
> +
> +static int __init parse_direct_gbpages(char *arg)
> +{
> + if (!strcmp(arg, "off")) {
> + direct_gbpages = -1;
> + return 0;
> + }
> + return -1;
> +}
> +early_param("direct_gbpages", parse_direct_gbpages);

Can we please use simple boolean logic ? This 0, -1, 1 magic is more
than confusing.

Thanks,
tglx

2008-01-31 16:18:17

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [9/9] GBPAGES: Do kernel direct mapping at boot using GB pages

On Tue, 29 Jan 2008, Andi Kleen wrote:
>
> +static unsigned long direct_entry(unsigned long paddr)

Please use a more sensible function name. This one has no association
with the functionality at all.

> +{
> + unsigned long entry;

New line please

> + entry = __PAGE_KERNEL_LARGE|paddr;
> + entry &= __supported_pte_mask;
> + return entry;
> +}

> +static void init_gbpages(void)
> +{
> +#ifdef CONFIG_DEBUG_PAGEALLOC
> + /* debug pagealloc causes too much recursion with gbpages */
> + if (direct_gbpages == 0)
> + return;
> +#endif
> + if (direct_gbpages >= 0 && cpu_has_gbpages) {
> + printk(KERN_INFO "Using GB pages for direct mapping\n");
> + direct_gbpages = 1;
> + } else
> + direct_gbpages = 0;
> +}

Please use simple boolean logic. gbpages are either enabled or disabled.

> +static void split_gb_page(pud_t *pud, unsigned long paddr)
> +{
> + int i;
> + pmd_t *pmd;
> + struct page *p = alloc_page(GFP_KERNEL);
> + if (!p)
> + return;
> +
> + paddr &= PUD_PAGE_MASK;
> + pmd = page_address(p);
> + for (i = 0; i < PTRS_PER_PTE; i++, paddr += PMD_PAGE_SIZE)
> + pmd[i] = __pmd(direct_entry(paddr));
> + pud_populate(NULL, pud, pmd);
> +}
> +
> /*
> * Unmap a kernel mapping if it exists. This is useful to avoid
> * prefetches from the CPU leading to inconsistent cache lines.
> @@ -467,6 +511,8 @@ __clear_kernel_mapping(unsigned long add
> continue;
>
> pud = pud_offset(pgd, address);
> + if (pud_large(*pud))
> + split_gb_page(pud, __pa(address));
> if (pud_none(*pud))
> continue;

As I said before, this needs to use CPA and not implement another variant.

Thanks,
tglx

2008-01-31 16:22:31

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] [1/9] Handle kernel near memory hole in clear_kernel_mapping

> > +#define overlaps(as, ae, bs, be) ((ae) >= (bs) && (as) <= (be))
>
> inline function please and a bit more intuituive arrangement of
> arguments.

Which one do you prefer? (to be honest the current one is "intuitive"
to me)

>
> > +void __init clear_kernel_mapping(unsigned long address, unsigned long size)
> > +{
> > + int sh = PMD_SHIFT;
> > + unsigned long kernel = __pa(__START_KERNEL_map);
> > +
> > + /*
> > + * Note that we cannot unmap the kernel itself because the unmapped
> > + * holes here are always at least 2MB aligned.
>
> That's not enforced. The unmap code just does not split pages.

It is -- there are BUG_ONs for this in __clear_kernel_mapping

>
> > + * This just applies to the trailing areas of the 40MB kernel mapping.
>
> How is this ensured, that it only affects the end of the 40MB mapping ?

It is enforced in the callers (actually there is only a single caller --
the GART code ) by not calling it overlapping for the kernel itself.
Given that could be checked too, but that would be probably overkill
for an internal function.

>
> > + */
> > + if (overlaps(kernel >> sh, (kernel + KERNEL_TEXT_SIZE) >> sh,
> > + __pa(address) >> sh, __pa(address + size) >> sh)) {
>
> This checks:
>
> (kernel_end + 1) >= gart_start && kernel_start <= gart_end
>
> One off error: kernel + KERNEL_TEXT_SIZE
> needs to be: kernel + KERNEL_TEXT_SIZE - 1

Ok.

>
> Also there is no sanity check, whether the area is inside real kernel text.

Hmm I can add one, but if that happens the caller is likely seriously
confused and will likely cause other problems anyways.

I don't think it can happen for the GART code which is currently
the only caller.

>
> > + printk(KERN_WARNING
> > + "Kernel mapping at %lx within 2MB of memory hole\n",
> > + kernel);
>
>
> > + __clear_kernel_mapping(__START_KERNEL_map+__pa(address), size);
>
> Doh! This is unmapping the wrong place. According to __phys_addr():
>
> paddr = vaddr - __START_KERNEL_map + phys_base;

Hmm true -- that will only affect relocatable kernels, but for those
it's wrong. Given that the patch was supposed to fix a case
that only happens relocatable kernels that's quite ironic :)

Actually thinking about it again you can just drop it for now.
It is orthogonal to gbpages. I think I added it to the series
when I was planning to do kernel mapping as GB pages, but that
turned out to be a bad idea anyways.

Thanks for the review,

-Andi

2008-01-31 16:24:20

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] [7/9] Add an option to disable direct mapping gbpages and a global variable

On Thu, Jan 31, 2008 at 05:12:35PM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2008, Andi Kleen wrote:
> > +int direct_gbpages;
> > +
> > +static int __init parse_direct_gbpages(char *arg)
> > +{
> > + if (!strcmp(arg, "off")) {
> > + direct_gbpages = -1;
> > + return 0;
> > + }
> > + return -1;
> > +}
> > +early_param("direct_gbpages", parse_direct_gbpages);
>
> Can we please use simple boolean logic ? This 0, -1, 1 magic is more
> than confusing.

I did it this way to later allow forcing gbpages. This actually
makes sense together with debug pagealloc which right now forces
them to be off.

The only bit missing for that is a check for direct_gbpages=on

-Andi

2008-01-31 16:38:22

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] [9/9] GBPAGES: Do kernel direct mapping at boot using GB pages

On Thu, Jan 31, 2008 at 05:17:41PM +0100, Thomas Gleixner wrote:
> On Tue, 29 Jan 2008, Andi Kleen wrote:
> >
> > +static unsigned long direct_entry(unsigned long paddr)
>
> Please use a more sensible function name. This one has no association
> with the functionality at all.

Can you suggest one? I honestly cannot think of a better one.
Its's a "entry" in the "direct" mapping.

> > + if (direct_gbpages >= 0 && cpu_has_gbpages) {
> > + printk(KERN_INFO "Using GB pages for direct mapping\n");
> > + direct_gbpages = 1;
> > + } else
> > + direct_gbpages = 0;
> > +}
>
> Please use simple boolean logic. gbpages are either enabled or disabled.

It's a fairly standard widely used idiom for command line options because
it allows default and forcing. e.g. there can be cases when the kernel
disables it, but then it makes sense for the command line option
to override it. There is already one case in the kernel that
disables it.

> > * prefetches from the CPU leading to inconsistent cache lines.
> > @@ -467,6 +511,8 @@ __clear_kernel_mapping(unsigned long add
> > continue;
> >
> > pud = pud_offset(pgd, address);
> > + if (pud_large(*pud))
> > + split_gb_page(pud, __pa(address));
> > if (pud_none(*pud))
> > continue;
>
> As I said before, this needs to use CPA and not implement another variant.

Ok, I can switch GART over to CPA, but that will make the patch
much more intrusive and a little riskier. Is that ok for you
and will it be still considered .25 candidate?

-Andi

2008-01-31 17:01:15

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [7/9] Add an option to disable direct mapping gbpages and a global variable

On Thu, 31 Jan 2008, Andi Kleen wrote:

> On Thu, Jan 31, 2008 at 05:12:35PM +0100, Thomas Gleixner wrote:
> > On Tue, 29 Jan 2008, Andi Kleen wrote:
> > > +int direct_gbpages;
> > > +
> > > +static int __init parse_direct_gbpages(char *arg)
> > > +{
> > > + if (!strcmp(arg, "off")) {
> > > + direct_gbpages = -1;
> > > + return 0;
> > > + }
> > > + return -1;
> > > +}
> > > +early_param("direct_gbpages", parse_direct_gbpages);
> >
> > Can we please use simple boolean logic ? This 0, -1, 1 magic is more
> > than confusing.
>
> I did it this way to later allow forcing gbpages. This actually
> makes sense together with debug pagealloc which right now forces
> them to be off.
>
> The only bit missing for that is a check for direct_gbpages=on

Fair enough. But please use some sensible constants for
that. Otherwise it's just annoying to figure out what 0, -1, 1 and
later on 2 means.

Thanks,
tglx

2008-01-31 17:10:38

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH] [9/9] GBPAGES: Do kernel direct mapping at boot using GB pages

On Thu, 31 Jan 2008, Andi Kleen wrote:

> On Thu, Jan 31, 2008 at 05:17:41PM +0100, Thomas Gleixner wrote:
> > On Tue, 29 Jan 2008, Andi Kleen wrote:
> > >
> > > +static unsigned long direct_entry(unsigned long paddr)
> >
> > Please use a more sensible function name. This one has no association
> > with the functionality at all.
>
> Can you suggest one? I honestly cannot think of a better one.
> Its's a "entry" in the "direct" mapping.

Doh, yes :) It just did not trigger.

> > > + if (direct_gbpages >= 0 && cpu_has_gbpages) {
> > > + printk(KERN_INFO "Using GB pages for direct mapping\n");
> > > + direct_gbpages = 1;
> > > + } else
> > > + direct_gbpages = 0;
> > > +}
> >
> > Please use simple boolean logic. gbpages are either enabled or disabled.
>
> It's a fairly standard widely used idiom for command line options because
> it allows default and forcing. e.g. there can be cases when the kernel
> disables it, but then it makes sense for the command line option
> to override it. There is already one case in the kernel that
> disables it.

Right, but please use some constants. I was tripping over the

if (direct_gbpages > 0)

somewhere else in the code and it would have been simpler to see
something like:

if (direct_gbpages == GBPAGES_ENABLED)

> > > * prefetches from the CPU leading to inconsistent cache lines.
> > > @@ -467,6 +511,8 @@ __clear_kernel_mapping(unsigned long add
> > > continue;
> > >
> > > pud = pud_offset(pgd, address);
> > > + if (pud_large(*pud))
> > > + split_gb_page(pud, __pa(address));
> > > if (pud_none(*pud))
> > > continue;
> >
> > As I said before, this needs to use CPA and not implement another variant.
>
> Ok, I can switch GART over to CPA, but that will make the patch
> much more intrusive and a little riskier. Is that ok for you
> and will it be still considered .25 candidate?

We have the code in place already and we need to shake out any
remaining problems anyway. So using the CPA code would be preferred as
it adds another user. Can you give it a try please ?

Thanks,
tglx

2008-01-31 17:39:52

by Andi Kleen

[permalink] [raw]

Subject: Re: [PATCH] [9/9] GBPAGES: Do kernel direct mapping at boot using GB pages

On Thu, Jan 31, 2008 at 06:10:02PM +0100, Thomas Gleixner wrote:
> On Thu, 31 Jan 2008, Andi Kleen wrote:
>
> > On Thu, Jan 31, 2008 at 05:17:41PM +0100, Thomas Gleixner wrote:
> > > On Tue, 29 Jan 2008, Andi Kleen wrote:
> > > >
> > > > +static unsigned long direct_entry(unsigned long paddr)
> > >
> > > Please use a more sensible function name. This one has no association
> > > with the functionality at all.
> >
> > Can you suggest one? I honestly cannot think of a better one.
> > Its's a "entry" in the "direct" mapping.
>
> Doh, yes :) It just did not trigger.

I got rid of it now by switching over to the standard pte primitives
(with the usual casts). It's not prettier, but at least more consistent.

I hope i didn't break the paravirt case though, we'll see.

-Andi