2008-02-11 09:35:30

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [0/8] Various kernel mapping bug fixes


Relative to git-x86#mm 05afd57b4438d4650d93e6f54f465b0cdd2d9ea8)

The first bug fix is the most important one -- it fixes a boot hang
on gbpages systems -- the rest is for random issues I found while looking
at the code. Several are related to EFI (see caveats in the descriptions),
which looks very broken currently without these patches, but definitely
needs more testing.

-Andi


2008-02-11 09:34:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [2/8] CPA: Flush the caches when setting pages not present


e.g. the AMD64 pci-gart code sets pages not present to prevent potential
cache coherency problems. When doing this it is probably safer to flush the
caches too. So consider clearing of the present bit as a cache flush indicator.

Note that debug pagealloc marks pages regularly not present either, but it won't
call this because it calls directly into a lower level function.

I have not actually seen failures from this, but it seems safer
to do it this way.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -670,9 +670,16 @@ static int __change_page_attr_set_clr(st
return 0;
}

-static inline int cache_attr(pgprot_t attr)
+static inline int cache_attr(pgprot_t set, pgprot_t clr)
{
- return pgprot_val(attr) &
+ /*
+ * Clearing pages is usually done for cache cohereny reasons
+ * (except for pagealloc debug, but that doesn't call this anyways)
+ * It's safer to flush the caches in this case too.
+ */
+ if (pgprot_val(clr) & _PAGE_PRESENT)
+ return 1;
+ return pgprot_val(set) &
(_PAGE_PAT | _PAGE_PAT_LARGE | _PAGE_PWT | _PAGE_PCD);
}

@@ -709,7 +716,7 @@ static int change_page_attr_set_clr(unsi
* No need to flush, when we did not set any of the caching
* attributes:
*/
- cache = cache_attr(mask_set);
+ cache = cache_attr(mask_set, mask_clr);

/*
* On success we use clflush, when the CPU supports it to

2008-02-11 09:35:01

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page


Use correct page sizes and masks for GB pages in try_preserve_large_page()

This prevents a boot hang on a GB capable system with CONFIG_DIRECT_GBPAGES
enabled.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -278,8 +278,8 @@ try_preserve_large_page(pte_t *kpte, uns
break;
#ifdef CONFIG_X86_64
case PG_LEVEL_1G:
- psize = PMD_PAGE_SIZE;
- pmask = PMD_PAGE_MASK;
+ psize = PUD_PAGE_SIZE;
+ pmask = PUD_PAGE_MASK;
break;
#endif
default:

2008-02-11 09:35:46

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [4/8] CPA: Fix set_memory_x for ioremap


EFI currently calls set_memory_x() on potentially ioremapped addresses.

This is problematic for several reasons:

- The cpa code internally calls __pa on it which does not work for remapped
addresses and will give some random result.
- cpa will try to change all potential aliases (like the kernel mapping
on x86-64), but that is not needed for NX because the caller does only needs its
specific virtual address executable. There is no requirement in the x86 architecture
for nx bits to be coherent between mapping aliases. Also with the
previous problem of __pa returning a wrong address it would likely try to change
some random other page if you're unlucky and the random result would
match the kernel text range.

There would be several possible ways to fix this:
- Simply don't set the NX bit in the original ioremap and drop
set_memory_x and add a ioremap_exec(). That would be my preferred solution,
but unfortunately has been dismissed before
- Drop all __pas and always use the physical address derived
from the looked up PTE. This would need some significant restructuring
and would only fix the first problem above, not the second.
- Special case NX clear to change any aliases. I chose this one
because it happens to fix both problems, so is both a fix
and a optimization.

This implies that it's still not safe calling set_memory_(not x) on
any ioremaped/vmalloced/module addresses.

I don't have a EFI system so this is untested.

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -28,6 +28,7 @@ struct cpa_data {
pgprot_t mask_clr;
int numpages;
int flushtlb;
+ int addronly;
};

static inline int
@@ -602,7 +603,7 @@ static int change_page_attr_addr(struct
* fixup the low mapping first. __va() returns the virtual
* address in the linear mapping:
*/
- if (within(address, HIGH_MAP_START, HIGH_MAP_END))
+ if (within(address, HIGH_MAP_START, HIGH_MAP_END) && !cpa->addronly)
address = (unsigned long) __va(phys_addr);
#endif

@@ -615,7 +616,7 @@ static int change_page_attr_addr(struct
* If the physical address is inside the kernel map, we need
* to touch the high mapped kernel as well:
*/
- if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
+ if (!cpa->addronly && within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
/*
* Calc the high mapping address. See __phys_addr()
* for the non obvious details.
@@ -695,6 +696,8 @@ static int change_page_attr_set_clr(unsi
cpa.mask_set = mask_set;
cpa.mask_clr = mask_clr;
cpa.flushtlb = 0;
+ cpa.addronly = !pgprot_val(mask_set) &&
+ pgprot_val(mask_clr) == _PAGE_NX;

ret = __change_page_attr_set_clr(&cpa);

2008-02-11 09:36:00

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64


static_protections previously would test against the x86-64 kernel mapping
twice. First against the unchanged symbol directly from the linker (which
always points into the kernel mapping) and then again it would manually
relocate the address into the kernel mapping and test again.

This patch reverses the second test instead to test against the direct mapping
(low) aliases virtual addresses which was probably intended in the first place.

Simply use __pa and __va for that.

This also allows to drop an ifdef because it will work both on 32bit and 64bit.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/pageattr.c | 22 +++++++---------------
1 file changed, 7 insertions(+), 15 deletions(-)

Index: linux/arch/x86/mm/pageattr.c
===================================================================
--- linux.orig/arch/x86/mm/pageattr.c
+++ linux/arch/x86/mm/pageattr.c
@@ -128,17 +128,9 @@ static void cpa_flush_range(unsigned lon
#define HIGH_MAP_START __START_KERNEL_map
#define HIGH_MAP_END (__START_KERNEL_map + KERNEL_TEXT_SIZE)

-
-/*
- * Converts a virtual address to a X86-64 highmap address
- */
-static unsigned long virt_to_highmap(void *address)
+static unsigned long virt_to_direct(void *address)
{
-#ifdef CONFIG_X86_64
- return __pa((unsigned long)address) + HIGH_MAP_START - phys_base;
-#else
- return (unsigned long)address;
-#endif
+ return (unsigned long)__va(__pa(address));
}

/*
@@ -165,9 +157,9 @@ static inline pgprot_t static_protection
if (within(address, (unsigned long)_text, (unsigned long)_etext))
pgprot_val(forbidden) |= _PAGE_NX;
/*
- * Do the same for the x86-64 high kernel mapping
+ * Do the same for the direct map alias of the x86-64 text mapping
*/
- if (within(address, virt_to_highmap(_text), virt_to_highmap(_etext)))
+ if (within(address, virt_to_direct(_text), virt_to_direct(_etext)))
pgprot_val(forbidden) |= _PAGE_NX;

/* The .rodata section needs to be read-only */
@@ -175,10 +167,10 @@ static inline pgprot_t static_protection
(unsigned long)__end_rodata))
pgprot_val(forbidden) |= _PAGE_RW;
/*
- * Do the same for the x86-64 high kernel mapping
+ * Do the same for the direct map alias of the x86-64 text mapping
*/
- if (within(address, virt_to_highmap(__start_rodata),
- virt_to_highmap(__end_rodata)))
+ if (within(address, virt_to_direct(__start_rodata),
+ virt_to_direct(__end_rodata)))
pgprot_val(forbidden) |= _PAGE_RW;

prot = __pgprot(pgprot_val(prot) & ~pgprot_val(forbidden));

2008-02-11 09:36:24

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [5/8] Fix logic error in 64bit memory hotadd


The memory hotadd code assumes that the pud always exists already, but
that might be not true. Allocate it if it isn't there.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_64.c | 13 +++++++++----
1 file changed, 9 insertions(+), 4 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -432,17 +432,22 @@ void __init_refok init_memory_mapping(un
unsigned long pud_phys;
pud_t *pud;

- if (after_bootmem)
+ if (after_bootmem) {
pud = pud_offset(pgd, start & PGDIR_MASK);
- else
+ if (!pud_present(*pud)) {
+ pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
+ if (!pud)
+ panic("Out of memory");
+ }
+ pud_phys = __pa(pud);
+ } else
pud = alloc_low_page(&pud_phys);

next = start + PGDIR_SIZE;
if (next > end)
next = end;
phys_pud_init(pud, __pa(start), __pa(next));
- if (!after_bootmem)
- set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
+ set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
unmap_low_page(pud);
}

2008-02-11 09:36:38

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [6/8] Account overlapped mappings in end_pfn_map


When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in end_pfn_mapped.

This is needed for other code that needs to know the true
mapping alias state (in this case EFI).

But it's also more correct in general

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/setup_64.c | 4 +++-
arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++----------
include/asm-x86/proto.h | 2 +-
3 files changed, 27 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -287,7 +287,7 @@ __meminit void early_iounmap(void *addr,
__flush_tlb_all();
}

-static void __meminit
+static unsigned long __meminit
phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
{
int i = pmd_index(address);
@@ -309,21 +309,25 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
set_pte((pte_t *)pmd,
pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
}
+ return address;
}

-static void __meminit
+static unsigned long __meminit
phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
{
+ unsigned long r;
pmd_t *pmd = pmd_offset(pud, 0);
spin_lock(&init_mm.page_table_lock);
- phys_pmd_init(pmd, address, end);
+ r = phys_pmd_init(pmd, address, end);
spin_unlock(&init_mm.page_table_lock);
__flush_tlb_all();
+ return r;
}

-static void __meminit
+static unsigned long __meminit
phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end)
{
+ unsigned long true_end = end;
int i = pud_index(addr);

for (; i < PTRS_PER_PUD; i++, addr = (addr & PUD_MASK) + PUD_SIZE) {
@@ -342,13 +346,14 @@ phys_pud_init(pud_t *pud_page, unsigned

if (pud_val(*pud)) {
if (!pud_large(*pud))
- phys_pmd_update(pud, addr, end);
+ true_end = phys_pmd_update(pud, addr, end);
continue;
}

if (direct_gbpages) {
set_pte((pte_t *)pud,
pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ true_end = (addr & PUD_MASK) + PUD_SIZE;
continue;
}

@@ -356,12 +361,14 @@ phys_pud_init(pud_t *pud_page, unsigned

spin_lock(&init_mm.page_table_lock);
set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
- phys_pmd_init(pmd, addr, end);
+ true_end = phys_pmd_init(pmd, addr, end);
spin_unlock(&init_mm.page_table_lock);

unmap_low_page(pmd);
}
__flush_tlb_all();
+
+ return true_end >> PAGE_SHIFT;
}

static void __init find_early_table_space(unsigned long end)
@@ -406,9 +413,10 @@ static void __init init_gbpages(void)
* This runs before bootmem is initialized and gets pages directly from
* the physical memory. To access them they are temporarily mapped.
*/
-void __init_refok init_memory_mapping(unsigned long start, unsigned long end)
+unsigned long __init_refok
+init_memory_mapping(unsigned long start, unsigned long end)
{
- unsigned long next;
+ unsigned long next, true_end = end;

pr_debug("init_memory_mapping\n");

@@ -446,7 +454,7 @@ void __init_refok init_memory_mapping(un
next = start + PGDIR_SIZE;
if (next > end)
next = end;
- phys_pud_init(pud, __pa(start), __pa(next));
+ true_end = phys_pud_init(pud, __pa(start), __pa(next));
set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
unmap_low_page(pud);
}
@@ -458,6 +466,8 @@ void __init_refok init_memory_mapping(un
if (!after_bootmem)
reserve_early(table_start << PAGE_SHIFT,
table_end << PAGE_SHIFT, "PGTABLE");
+
+ return true_end;
}

#ifndef CONFIG_NUMA
@@ -499,9 +509,12 @@ int arch_add_memory(int nid, u64 start,
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ unsigned long r;
int ret;

- init_memory_mapping(start, start + size-1);
+ r = init_memory_mapping(start, start + size-1);
+ if (r > end_pfn_map)
+ end_pfn_map = r;

ret = __add_pages(zone, start_pfn, nr_pages);
WARN_ON(1);
Index: linux/include/asm-x86/proto.h
===================================================================
--- linux.orig/include/asm-x86/proto.h
+++ linux/include/asm-x86/proto.h
@@ -7,7 +7,8 @@

extern void early_idt_handler(void);

-extern void init_memory_mapping(unsigned long start, unsigned long end);
+extern unsigned long init_memory_mapping(unsigned long start,
+ unsigned long end);

extern void system_call(void);
extern void syscall_init(void);
Index: linux/arch/x86/kernel/setup_64.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_64.c
+++ linux/arch/x86/kernel/setup_64.c
@@ -341,7 +341,7 @@ void __init setup_arch(char **cmdline_p)

check_efer();

- init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
+ end_pfn_map = init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
if (efi_enabled)
efi_init();

2008-02-11 09:36:54

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit


Even on 32bit 2MB pages can map more memory than is in the true
max_low_pfn if end_pfn is not highmem and not aligned to 2MB.
Add a end_pfn_map similar to x86-64 that accounts for this
fact. This is important for code that really needs to know about
all mapping aliases. Needed for followup patches (in this case EFI)

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/mm/init_32.c | 5 +++++
include/asm-x86/page.h | 4 +++-
include/asm-x86/page_64.h | 1 -
3 files changed, 8 insertions(+), 2 deletions(-)

Index: linux/arch/x86/mm/init_32.c
===================================================================
--- linux.orig/arch/x86/mm/init_32.c
+++ linux/arch/x86/mm/init_32.c
@@ -50,6 +50,8 @@

unsigned int __VMALLOC_RESERVE = 128 << 20;

+unsigned long end_pfn_map;
+
DEFINE_PER_CPU(struct mmu_gather, mmu_gathers);
unsigned long highstart_pfn, highend_pfn;

@@ -193,6 +195,7 @@ static void __init kernel_physical_mappi
set_pmd(pmd, pfn_pmd(pfn, prot));

pfn += PTRS_PER_PTE;
+ end_pfn_map = pfn;
continue;
}
pte = one_page_table_init(pmd);
@@ -207,6 +210,7 @@ static void __init kernel_physical_mappi

set_pte(pte, pfn_pte(pfn, prot));
}
+ end_pfn_map = pfn;
}
}
}
Index: linux/include/asm-x86/page.h
===================================================================
--- linux.orig/include/asm-x86/page.h
+++ linux/include/asm-x86/page.h
@@ -36,7 +36,7 @@
#define max_pfn_mapped end_pfn_map
#else
#include <asm/page_32.h>
-#define max_pfn_mapped max_low_pfn
+#define max_pfn_mapped end_pfn_map
#endif /* CONFIG_X86_64 */

#define PAGE_OFFSET ((unsigned long)__PAGE_OFFSET)
@@ -51,6 +51,8 @@
extern int page_is_ram(unsigned long pagenr);
extern int devmem_is_allowed(unsigned long pagenr);

+extern unsigned long end_pfn_map;
+
struct page;

static void inline clear_user_page(void *page, unsigned long vaddr,
Index: linux/include/asm-x86/page_64.h
===================================================================
--- linux.orig/include/asm-x86/page_64.h
+++ linux/include/asm-x86/page_64.h
@@ -55,7 +55,6 @@ void clear_page(void *page);
void copy_page(void *to, void *from);

extern unsigned long end_pfn;
-extern unsigned long end_pfn_map;
extern unsigned long phys_base;

extern unsigned long __phys_addr(unsigned long);

2008-02-11 09:37:16

by Andi Kleen

[permalink] [raw]
Subject: [PATCH] [8/8] RFC: Fix some EFI problems


>From code review the EFI memory map handling has a couple of problems:

- The test for _WB memory was reversed so it would set cache able memory
to uncached
- It would always set a wrong uninitialized zero address to uncached
(so I suspect it always set the first few pages in phys memory to uncached,
that is why it may have gone unnoticed)
- It would call set_memory_x() on a fixmap address that it doesn't
handle correct.
- Some other problems I commented in the code (but was unable to solve
for now)

I changed the ioremaps to set the correct caching attributes
and also corrected the ordering so it looks roughly correct now.

This is an RFC, because I don't have a EFI system to test.

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/efi.c | 14 ++++++++------
arch/x86/kernel/efi_64.c | 6 ++++--
include/asm-x86/efi.h | 5 +++--
3 files changed, 15 insertions(+), 10 deletions(-)

Index: linux/arch/x86/kernel/efi.c
===================================================================
--- linux.orig/arch/x86/kernel/efi.c
+++ linux/arch/x86/kernel/efi.c
@@ -423,13 +423,15 @@ void __init efi_enter_virtual_mode(void)
size = md->num_pages << EFI_PAGE_SHIFT;
end = md->phys_addr + size;

- if ((end >> PAGE_SHIFT) <= max_pfn_mapped)
+ /* RED-PEN does not handle overlapped areas */
+ if ((end >> PAGE_SHIFT) <= max_pfn_mapped) {
va = __va(md->phys_addr);
- else
- va = efi_ioremap(md->phys_addr, size);
-
- if (md->attribute & EFI_MEMORY_WB)
- set_memory_uc(md->virt_addr, size);
+ /* RED-PEN spec and ia64 have a lot more flags */
+ if (!(md->attribute & EFI_MEMORY_WB))
+ set_memory_uc(md->virt_addr, size);
+ } else
+ va = efi_ioremap(md->phys_addr, size,
+ !!(md->attribute & EFI_MEMORY_WB));

md->virt_addr = (u64) (unsigned long) va;

Index: linux/arch/x86/kernel/efi_64.c
===================================================================
--- linux.orig/arch/x86/kernel/efi_64.c
+++ linux/arch/x86/kernel/efi_64.c
@@ -109,7 +109,8 @@ void __init efi_reserve_bootmem(void)
memmap.nr_map * memmap.desc_size);
}

-void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long size)
+void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long size,
+ int cache)
{
static unsigned pages_mapped;
unsigned i, pages;
@@ -124,7 +125,8 @@ void __iomem * __init efi_ioremap(unsign

for (i = 0; i < pages; i++) {
__set_fixmap(FIX_EFI_IO_MAP_FIRST_PAGE - pages_mapped,
- phys_addr, PAGE_KERNEL);
+ phys_addr,
+ cache ? PAGE_KERNEL : PAGE_KERNEL_NOCACHE);
phys_addr += PAGE_SIZE;
pages_mapped++;
}
Index: linux/include/asm-x86/efi.h
===================================================================
--- linux.orig/include/asm-x86/efi.h
+++ linux/include/asm-x86/efi.h
@@ -33,7 +33,8 @@ extern unsigned long asmlinkage efi_call
#define efi_call_virt6(f, a1, a2, a3, a4, a5, a6) \
efi_call_virt(f, a1, a2, a3, a4, a5, a6)

-#define efi_ioremap(addr, size) ioremap_cache(addr, size)
+#define efi_ioremap(addr, size, cache) \
+ (cache ? ioremap_cache(addr, size) : ioremap_nocache(addr, size))

#else /* !CONFIG_X86_32 */

@@ -86,7 +87,7 @@ extern u64 efi_call6(void *fp, u64 arg1,
efi_call6((void *)(efi.systab->runtime->f), (u64)(a1), (u64)(a2), \
(u64)(a3), (u64)(a4), (u64)(a5), (u64)(a6))

-extern void *efi_ioremap(unsigned long addr, unsigned long size);
+extern void *efi_ioremap(unsigned long addr, unsigned long size, int cache);

#endif /* CONFIG_X86_32 */

2008-02-11 09:45:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page

On Mon, 11 Feb 2008, Andi Kleen wrote:

>
> Use correct page sizes and masks for GB pages in try_preserve_large_page()
>
> This prevents a boot hang on a GB capable system with CONFIG_DIRECT_GBPAGES
> enabled.

Doh, yes. Applied.

Thanks,
tglx

> Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> arch/x86/mm/pageattr.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> Index: linux/arch/x86/mm/pageattr.c
> ===================================================================
> --- linux.orig/arch/x86/mm/pageattr.c
> +++ linux/arch/x86/mm/pageattr.c
> @@ -278,8 +278,8 @@ try_preserve_large_page(pte_t *kpte, uns
> break;
> #ifdef CONFIG_X86_64
> case PG_LEVEL_1G:
> - psize = PMD_PAGE_SIZE;
> - pmask = PMD_PAGE_MASK;
> + psize = PUD_PAGE_SIZE;
> + pmask = PUD_PAGE_MASK;
> break;
> #endif
> default:
>

2008-02-11 10:12:33

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page


* Thomas Gleixner <[email protected]> wrote:

> On Mon, 11 Feb 2008, Andi Kleen wrote:
>
> > Use correct page sizes and masks for GB pages in
> > try_preserve_large_page()
> >
> > This prevents a boot hang on a GB capable system with
> > CONFIG_DIRECT_GBPAGES enabled.
>
> Doh, yes. Applied.

btw., v2.6.25-rc1 is not affected by this. Gbpages support is a "still
cooking" feature only available in x86.git#mm, destined for .26, and
even there it's default disabled and only available on Barcelona class
hardware. (Some of these bits are upstream already to make maintenance
easier, but they are inactive code.)

Ingo

2008-02-11 11:00:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [2/8] CPA: Flush the caches when setting pages not present


* Andi Kleen <[email protected]> wrote:

> e.g. the AMD64 pci-gart code sets pages not present to prevent
> potential cache coherency problems. When doing this it is probably
> safer to flush the caches too. So consider clearing of the present bit
> as a cache flush indicator.

uhm, but why? "Probably safer" is not the right kind of technical
argument ;-)

The PCI-GART quirk which we use on some systems unmaps pages _not_
because of coherency problems (any cache coherency problems would be
triggerable before we unmap them anyway), but due to _page aliasing_
problems: these pages might be in our general e820 map so we must
disable them.

( in any case, the CPA code does not deal with cache attribute coherency
- that is topic of the upcoming PAT feature. We still largely rely on
MTRRs to get cache coherency right. )

furthermore, your patch does not even do what you claim it does, because
it is an elaborate NOP on most modern CPUs:

> +static inline int cache_attr(pgprot_t set, pgprot_t clr)
> {
> - return pgprot_val(attr) &
> + /*
> + * Clearing pages is usually done for cache cohereny reasons
> + * (except for pagealloc debug, but that doesn't call this anyways)
> + * It's safer to flush the caches in this case too.
> + */
> + if (pgprot_val(clr) & _PAGE_PRESENT)
> + return 1;

as clflush does not work on not present pages...

Ingo

2008-02-11 11:01:50

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [1/8] CPA: Fix gbpages support in try_preserve_lage_page

Ingo Molnar <[email protected]> writes:
>
> btw., v2.6.25-rc1 is not affected by this. Gbpages support is a "still
> cooking" feature only available in x86.git#mm, destined for .26, and
> even there it's default disabled and only available on Barcelona class
> hardware. (Some of these bits are upstream already to make maintenance
> easier, but they are inactive code.)

This means it will be tested for ~6 months. Don't you think that's a little
excessive for such a simple patch?

% git show 6dc6ffff706db6c70a98253e7dc7011e4343df6c | diffstat
init_64.c | 29 ++++++++++++++++++++++++-----
1 files changed, 24 insertions(+), 5 deletions(-)

One option would be also to merge it to .25 and mark the
CONFIG option EXPERIMENTAL.

-Andi

2008-02-11 11:49:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [3/8] CPA: Test the correct mapping alias on x86-64


* Andi Kleen <[email protected]> wrote:

> static_protections previously would test against the x86-64 kernel
> mapping twice. First against the unchanged symbol directly from the
> linker (which always points into the kernel mapping) and then again it
> would manually relocate the address into the kernel mapping and test
> again.
>
> This patch reverses the second test instead to test against the direct
> mapping (low) aliases virtual addresses which was probably intended in
> the first place.
>
> Simply use __pa and __va for that.

thanks, applied.

( the practical implications of this are low because we do not utilize
the low direct aliases for execution. It needs to be fixed
nevertheless (will be needed for PAT later on anyway) and your cleanup
and #ifdef reduction is nice to have as well. )

Ingo

2008-02-11 12:28:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [2/8] CPA: Flush the caches when setting pages not present

On Monday 11 February 2008 12:00:25 Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > e.g. the AMD64 pci-gart code sets pages not present to prevent
> > potential cache coherency problems. When doing this it is probably
> > safer to flush the caches too. So consider clearing of the present bit
> > as a cache flush indicator.
>
> uhm, but why? "Probably safer" is not the right kind of technical
> argument ;-)

It is safer in my opinion, but I have not seen a concrete bug triggered by
it so I qualified it.

Also it is quite common to qualify technical arguments this way BTW.

> The PCI-GART quirk which we use on some systems unmaps pages _not_
> because of coherency problems (any cache coherency problems would be
> triggerable before we unmap them anyway),

The GART is unmapped to avoid a cache coherency problem. AGP
normally requires cache flushes and changing to uncached to map
pages into the GART. Otherwise you could have cache incoherence
between the original page and the remapped page.
Since it would be very expensive to call cpa on each pci_map_*
to turn the pages uncached the IOMMU part of the GART is unmapped instead
so that the CPU only ever sees the original memory; never the remapped
part.

> but due to _page aliasing_
> problems: these pages might be in our general e820 map so we must
> disable them.
>
> ( in any case, the CPA code does not deal with cache attribute coherency
> - that is topic of the upcoming PAT feature. We still largely rely on
> MTRRs to get cache coherency right. )

If it wouldn't deal with cache coherency surely it wouldn't need to flush
caches ...

I started pageattr.c originally to deal with a concrete cache coherency
bugs (AGP cache corruptions between GART and original page on Athlon K7).
This is so that when a page is remapped in the GART it is changed in the
direct mapping to uncached. Without that there was a reproducible cache corruption
on some systems which was very hard to track down and debug.

Even with all your changes it still does that.

All the other uses like DEBUG_RODATA or DEBUG_PAGEALLOC only came later.


> furthermore, your patch does not even do what you claim it does, because
> it is an elaborate NOP on most modern CPUs:
>
> > +static inline int cache_attr(pgprot_t set, pgprot_t clr)
> > {
> > - return pgprot_val(attr) &
> > + /*
> > + * Clearing pages is usually done for cache cohereny reasons
> > + * (except for pagealloc debug, but that doesn't call this anyways)
> > + * It's safer to flush the caches in this case too.
> > + */
> > + if (pgprot_val(clr) & _PAGE_PRESENT)
> > + return 1;
>
> as clflush does not work on not present pages...

Hmm, that's true. It needs to force WBINVD for this case.
I'll revise. Thanks.

-Andi

2008-02-11 12:27:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [4/8] CPA: Fix set_memory_x for ioremap


* Andi Kleen <[email protected]> wrote:

> EFI currently calls set_memory_x() on potentially ioremapped
> addresses.
>
> This is problematic for several reasons:
>
> - The cpa code internally calls __pa on it which does not work for
> remapped addresses and will give some random result.

Wrong. We do call __pa() on vmalloc ranges (which is a known
uncleanliness that we intend to fix), but contrary to your claim the
result is not "random result". On 64-bit it's guaranteed to have a value
above ~66 TB on 64-bit and hence fails all the filters later on so it
has zero practical relevance at the moment. On 32-bit we transform it
down to somewhere around 1GB - where we check it against the BIOS range
filters - which again cannot trigger. But I do agree that it's unclean
and needs fixing up.

Detailed analysis on 64-bit: we call __pa() here:

static int change_page_attr_addr(struct cpa_data *cpa)
...
unsigned long phys_addr = __pa(address);

which for vmalloc area virtual addresses will indeed yield some really
high (and invalid) physical address. That address will never trigger
this check:

if (within(address, HIGH_MAP_START, HIGH_MAP_END))
address = (unsigned long) __va(phys_addr);

or this check:

if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {

so we'll never actuall _use_ that phys_addr.

> - cpa will try to change all potential aliases (like the kernel
> mapping on x86-64), but that is not needed for NX because the caller
> does only needs its specific virtual address executable. There is no
> requirement in the x86 architecture for nx bits to be coherent between
> mapping aliases. Also with the previous problem of __pa returning a
> wrong address it would likely try to change some random other page if
> you're unlucky and the random result would match the kernel text
> range.

wrong. That "random other page" is guaranteed to be above 66 TB
physical.

Anyway, i agree that it's ugly and unintuitive and it's on our clean-up
list. But your patch is not a good cleanup because it just hides the
underlying weakness.

Ingo

2008-02-11 12:45:36

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [4/8] CPA: Fix set_memory_x for ioremap


> Wrong. We do call __pa() on vmalloc ranges (which is a known
> uncleanliness that we intend to fix),

AFAIK nobody does actually currently. Although I expect sooner
or later someone will try since __ioremap() lost its pgprot argument
that made it so powerful. Best would be probably to stick
in some bugs just to catch that.

> but contrary to your claim the
> result is not "random result". On 64-bit it's guaranteed to have a value
> above ~66 TB on 64-bit and hence fails all the filters later on so it
> has zero practical relevance at the moment.

Note that 64bit EFI passes in a fixmap address (they just call
it efi_ioremap). Fixmaps are in the kernel mapping which __pa() handles
and then this gives a low number likely somewhere in memory
and might well trigger.

> On 32-bit we transform it
> down to somewhere around 1GB - where we check it against the BIOS range
> filters - which again cannot trigger. But I do agree that it's unclean
> and needs fixing up.

Are you sure about this for all possible __PAGE_OFFSET values? e.g. consider
1:3 split. Also there is always relocated kernels where kernels might be loaded
quite high.

>
> static int change_page_attr_addr(struct cpa_data *cpa)
> ...
> unsigned long phys_addr = __pa(address);
>
> which for vmalloc area virtual addresses will indeed yield some really
> high (and invalid) physical address. That address will never trigger
> this check:
>
> if (within(address, HIGH_MAP_START, HIGH_MAP_END))
> address = (unsigned long) __va(phys_addr);

That doesn't check phys_addr at all?

> or this check:
>
> if (within(phys_addr, 0, KERNEL_TEXT_SIZE)) {
>
> so we'll never actuall _use_ that phys_addr.



> and it's on our clean-up
> list. But your patch is not a good cleanup because it just hides the
> underlying weakness.

I never claimed it was a cleanup. It's a fix and a optimization
(don't do unnecessary coherency between direct mapping and other
mappings for clearing X -- this means some innocent pages in the
direct mapping won't get split)

Anyways even if you don't want to fix this clear bug I would ask
to still consider the optimization independently.

-Andi

2008-02-11 12:46:51

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd


* Andi Kleen <[email protected]> wrote:

> The memory hotadd code assumes that the pud always exists already, but
> that might be not true. Allocate it if it isn't there.

ok, this seems an like an ancient memory-hotplug bug. Does anyone even
use memory hotplug currently? Did you find this bug via review, or did
it trigger in practice?

Also, your fix, while it solves a real bug we want to fix, is not quite
right for upstream integration yet. I can see 3 immediate problems with
it:

> + if (!pud_present(*pud)) {
> + pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);

the GFP_ATOMIC here can fail.

The proper solution is to instead extend init_memory_mapping() with a
gfp_t parameter and pass in GFP_ATOMIC from the early init code (where
we must not schedule and where GFP_ATOMIC will succeed anyway), but do a
GFP_KERNEL from arch_add_memory().

> + if (!pud)
> + panic("Out of memory");

the panic() here is very rude to the user in the hotplug usecase.

The proper solution is to extend init_memory_mapping() with a return
value, and to check in the caller. arch_add_memory() obviously does not
want to panic(), it wants to return -ENOMEM to mm/memory_hotplug.c.

and a small style nit while changing this code:

> + } else
> pud = alloc_low_page(&pud_phys);

please add curly braces to the 'else' branch too. (we generally prefer
symmetrical curly braces) Thanks,

Ingo

2008-02-11 13:05:23

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd

On Monday 11 February 2008 13:46:25 Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > The memory hotadd code assumes that the pud always exists already, but
> > that might be not true. Allocate it if it isn't there.
>
> ok, this seems an like an ancient memory-hotplug bug.

Yes.

> Does anyone even
> use memory hotplug currently?

I don't know.

> Did you find this bug via review, or did
> it trigger in practice?

Review.

>
> Also, your fix, while it solves a real bug we want to fix, is not quite
> right for upstream integration yet. I can see 3 immediate problems with
> it:
>
> > + if (!pud_present(*pud)) {
> > + pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
>
> the GFP_ATOMIC here can fail.

The memory hotplug code already uses GFP_ATOMIC elsewhere (spp_getpage)

> The proper solution is to instead extend init_memory_mapping() with a
> gfp_t parameter and pass in GFP_ATOMIC from the early init code (where
> we must not schedule and where GFP_ATOMIC will succeed anyway), but do a
> GFP_KERNEL from arch_add_memory().

The existing code already does GFP_ATOMIC. I admit I haven't double
checked why it does that (didn't read the complete path) but I assume
it takes a spin lock somewhere.

If there is no lock doing a general clean up of all of them would probably
make sense. But it would be orthogonal to my patch and I don't think
it's needed to fix this concrete bug.

The gfp argument is not needed though because this
case can be already distingushed by checking after_bootmem.

> The proper solution is to extend init_memory_mapping() with a return
> value, and to check in the caller. arch_add_memory() obviously does not
> want to panic(), it wants to return -ENOMEM to mm/memory_hotplug.c.

The existing code already panics elsewhere (spp_getpage); i just copied
that.

So in summary the panic&GFP_ATOMIC use are not good (I agree), but it's
not worse than what was in there before.

-Andi

2008-02-11 13:09:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map


* Andi Kleen <[email protected]> wrote:

> When end_pfn is not aligned to 2MB (or 1GB) then the kernel might map
> more memory than end_pfn. Account this in end_pfn_mapped.

can you see any practical relevance? Your patch description only deals
with the mechanical details of the change instead of analyzing the most
important thing: relevance and impact analysis. That makes it harder to
add your patches and easier to miss a good fix accidentally. It also
makes it quite a bit harder to trust your patches.

at a quick glance the relevance is to EFI only, in
efi_enter_virtual_mode(). max_pfn_mapped is used as a differentiator
between __va() and efi_ioremap(). AFAICS EFI will typically have its
runtime code not right after the end of physical memory.

Nevertheless, i do agree that the max_pfn_mapped/end_pfn_map limit needs
to be sharpened to reflect reality (for later PAT support).

your patch is also a bit unclean:

> -static void __meminit
> +static unsigned long __meminit
> phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
> {
> + unsigned long r;
> pmd_t *pmd = pmd_offset(pud, 0);
> spin_lock(&init_mm.page_table_lock);
> - phys_pmd_init(pmd, address, end);
> + r = phys_pmd_init(pmd, address, end);
> spin_unlock(&init_mm.page_table_lock);
> __flush_tlb_all();
> + return r;
> }

a meaningless "r" parameter. Use "true_end" instead.

and here's another type of variable naming problem:

> +unsigned long __init_refok
> +init_memory_mapping(unsigned long start, unsigned long end)
> {
> - unsigned long next;
> + unsigned long next, true_end = end;

that "true_end" is already a _pfn_, so name it accordingly:
true_end_pfn. Because we switch between pfn and addr types it's
important to point out where it's pfn and where it's a linear/physical
address.

> struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
> unsigned long start_pfn = start >> PAGE_SHIFT;
> unsigned long nr_pages = size >> PAGE_SHIFT;
> + unsigned long r;
> int ret;
>
> - init_memory_mapping(start, start + size-1);
> + r = init_memory_mapping(start, start + size-1);
> + if (r > end_pfn_map)
> + end_pfn_map = r;

rename "r" to "true_end_pfn".

Ingo

2008-02-11 13:27:30

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map

On Monday 11 February 2008 14:08:43 Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > When end_pfn is not aligned to 2MB (or 1GB) then the kernel might map
> > more memory than end_pfn. Account this in end_pfn_mapped.
>
> can you see any practical relevance?

Yes EFI needs to know this to decide if it should ioremap or not.

> Your patch description only deals
> with the mechanical details of the change instead of analyzing the most
> important thing: relevance and impact analysis.

I repeat:

"This is needed for other code that needs to know the true
mapping alias state (in this case EFI)."

and

"This is important for code that really needs to know about
all mapping aliases. Needed for followup patches (in this case EFI)"

> That makes it harder to
> add your patches and easier to miss a good fix accidentally. It also
> makes it quite a bit harder to trust your patches.
>
> at a quick glance the relevance is to EFI only, in
> efi_enter_virtual_mode(). max_pfn_mapped is used as a differentiator
> between __va() and efi_ioremap(). AFAICS EFI will typically have its
> runtime code not right after the end of physical memory.
>
> Nevertheless, i do agree that the max_pfn_mapped/end_pfn_map limit needs
> to be sharpened to reflect reality (for later PAT support).
>
> your patch is also a bit unclean:

Ok patch with hungarized variables appended.

-Andi

---

Account overlapped mappings in end_pfn_map

When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
map more memory than end_pfn. Account this in end_pfn_mapped.

This is needed for other code that needs to know the true
mapping alias state (in this case EFI).

But it's also more correct in general

Cc: [email protected]

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/setup_64.c | 4 +++-
arch/x86/mm/init_64.c | 33 +++++++++++++++++++++++----------
include/asm-x86/proto.h | 2 +-
3 files changed, 27 insertions(+), 12 deletions(-)

Index: linux/arch/x86/mm/init_64.c
===================================================================
--- linux.orig/arch/x86/mm/init_64.c
+++ linux/arch/x86/mm/init_64.c
@@ -287,7 +287,7 @@ __meminit void early_iounmap(void *addr,
__flush_tlb_all();
}

-static void __meminit
+static unsigned long __meminit
phys_pmd_init(pmd_t *pmd_page, unsigned long address, unsigned long end)
{
int i = pmd_index(address);
@@ -309,21 +309,25 @@ phys_pmd_init(pmd_t *pmd_page, unsigned
set_pte((pte_t *)pmd,
pfn_pte(address >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
}
+ return address;
}

-static void __meminit
+static unsigned long __meminit
phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
{
+ unsigned long true_end;
pmd_t *pmd = pmd_offset(pud, 0);
spin_lock(&init_mm.page_table_lock);
- phys_pmd_init(pmd, address, end);
+ true_end = phys_pmd_init(pmd, address, end);
spin_unlock(&init_mm.page_table_lock);
__flush_tlb_all();
+ return true_end;
}

-static void __meminit
+static unsigned long __meminit
phys_pud_init(pud_t *pud_page, unsigned long addr, unsigned long end)
{
+ unsigned long true_end = end;
int i = pud_index(addr);

for (; i < PTRS_PER_PUD; i++, addr = (addr & PUD_MASK) + PUD_SIZE) {
@@ -342,13 +346,14 @@ phys_pud_init(pud_t *pud_page, unsigned

if (pud_val(*pud)) {
if (!pud_large(*pud))
- phys_pmd_update(pud, addr, end);
+ true_end = phys_pmd_update(pud, addr, end);
continue;
}

if (direct_gbpages) {
set_pte((pte_t *)pud,
pfn_pte(addr >> PAGE_SHIFT, PAGE_KERNEL_LARGE));
+ true_end = (addr & PUD_MASK) + PUD_SIZE;
continue;
}

@@ -356,12 +361,14 @@ phys_pud_init(pud_t *pud_page, unsigned

spin_lock(&init_mm.page_table_lock);
set_pud(pud, __pud(pmd_phys | _KERNPG_TABLE));
- phys_pmd_init(pmd, addr, end);
+ true_end = phys_pmd_init(pmd, addr, end);
spin_unlock(&init_mm.page_table_lock);

unmap_low_page(pmd);
}
__flush_tlb_all();
+
+ return true_end >> PAGE_SHIFT;
}

static void __init find_early_table_space(unsigned long end)
@@ -406,9 +413,10 @@ static void __init init_gbpages(void)
* This runs before bootmem is initialized and gets pages directly from
* the physical memory. To access them they are temporarily mapped.
*/
-void __init_refok init_memory_mapping(unsigned long start, unsigned long end)
+unsigned long __init_refok
+init_memory_mapping(unsigned long start, unsigned long end)
{
- unsigned long next;
+ unsigned long next, true_end = end;

pr_debug("init_memory_mapping\n");

@@ -446,7 +454,7 @@ void __init_refok init_memory_mapping(un
next = start + PGDIR_SIZE;
if (next > end)
next = end;
- phys_pud_init(pud, __pa(start), __pa(next));
+ true_end = phys_pud_init(pud, __pa(start), __pa(next));
set_pgd(pgd_offset_k(start), mk_kernel_pgd(pud_phys));
unmap_low_page(pud);
}
@@ -458,6 +466,8 @@ void __init_refok init_memory_mapping(un
if (!after_bootmem)
reserve_early(table_start << PAGE_SHIFT,
table_end << PAGE_SHIFT, "PGTABLE");
+
+ return true_end;
}

#ifndef CONFIG_NUMA
@@ -499,9 +509,12 @@ int arch_add_memory(int nid, u64 start,
struct zone *zone = pgdat->node_zones + ZONE_NORMAL;
unsigned long start_pfn = start >> PAGE_SHIFT;
unsigned long nr_pages = size >> PAGE_SHIFT;
+ unsigned long true_end_pfn;
int ret;

- init_memory_mapping(start, start + size-1);
+ true_end_pfn = init_memory_mapping(start, start + size-1);
+ if (true_end_pfn > end_pfn_map)
+ end_pfn_map = true_end_pfn;

ret = __add_pages(zone, start_pfn, nr_pages);
WARN_ON(1);
Index: linux/include/asm-x86/proto.h
===================================================================
--- linux.orig/include/asm-x86/proto.h
+++ linux/include/asm-x86/proto.h
@@ -7,7 +7,8 @@

extern void early_idt_handler(void);

-extern void init_memory_mapping(unsigned long start, unsigned long end);
+extern unsigned long init_memory_mapping(unsigned long start,
+ unsigned long end);

extern void system_call(void);
extern void syscall_init(void);
Index: linux/arch/x86/kernel/setup_64.c
===================================================================
--- linux.orig/arch/x86/kernel/setup_64.c
+++ linux/arch/x86/kernel/setup_64.c
@@ -341,7 +341,7 @@ void __init setup_arch(char **cmdline_p)

check_efer();

- init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
+ end_pfn_map = init_memory_mapping(0, (end_pfn_map << PAGE_SHIFT));
if (efi_enabled)
efi_init();

2008-02-11 13:33:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd


* Andi Kleen <[email protected]> wrote:

> > Also, your fix, while it solves a real bug we want to fix, is not quite
> > right for upstream integration yet. I can see 3 immediate problems with
> > it:
> >
> > > + if (!pud_present(*pud)) {
> > > + pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
> >
> > the GFP_ATOMIC here can fail.
>
> The memory hotplug code already uses GFP_ATOMIC elsewhere
> (spp_getpage)

wrong. The _x86_ memory hotplug code uses GFP_ATOMIC elsewhere.
The generic memory hotplug code does not.

and the x86 memory hotplug code uses GFP_ATOMIC and panic() elsewhere
because:

> The existing code already panics elsewhere (spp_getpage); i just
> copied that.

and you had nothing to do with that "existing code"? git-log reveals
that the GFP_ATOMIC and panic()-ing patch was added 2 years ago and was
signed off by you:

commit 44df75e629106efcada087cead6c3f33ed6bcc60
Author: Matt Tolentino <[email protected]>
Date: Tue Jan 17 07:03:41 2006 +0100

[PATCH] x86_64: add x86-64 support for memory hot-add

[...]
Signed-off-by: Andi Kleen <[email protected]>

We (like most upstream kernel subsystems) generally do not accept
patches into arch/x86 that spreads a buggy implementation detail
further. Please submit a patch that cleans up the mess. Thanks,

Ingo

2008-02-11 13:44:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd

On Monday 11 February 2008 14:33:33 Ingo Molnar wrote:
>
> * Andi Kleen <[email protected]> wrote:
>
> > > Also, your fix, while it solves a real bug we want to fix, is not quite
> > > right for upstream integration yet. I can see 3 immediate problems with
> > > it:
> > >
> > > > + if (!pud_present(*pud)) {
> > > > + pud = (pud_t *)get_zeroed_page(GFP_ATOMIC);
> > >
> > > the GFP_ATOMIC here can fail.
> >
> > The memory hotplug code already uses GFP_ATOMIC elsewhere
> > (spp_getpage)
>
> wrong. The _x86_ memory hotplug code uses GFP_ATOMIC elsewhere.
> The generic memory hotplug code does not.

To be honest I'm a little tired now how you attempt to misinterpret every
word I write. Was it not clear from the context which code
was meant?

>
> and the x86 memory hotplug code uses GFP_ATOMIC and panic() elsewhere
> because:

I see it's all my fault.

>
> > The existing code already panics elsewhere (spp_getpage); i just
> > copied that.
>
> and you had nothing to do with that "existing code"? git-log reveals
> that the GFP_ATOMIC and panic()-ing patch was added 2 years ago and was
> signed off by you:

Should I point out all unclean and buggy code you ever signed
off? @) Just alone in .25-rc1 there is enough of that.

>
> commit 44df75e629106efcada087cead6c3f33ed6bcc60
> Author: Matt Tolentino <[email protected]>
> Date: Tue Jan 17 07:03:41 2006 +0100
>
> [PATCH] x86_64: add x86-64 support for memory hot-add
>
> [...]
> Signed-off-by: Andi Kleen <[email protected]>
>
> We (like most upstream kernel subsystems) generally do not accept
> patches into arch/x86 that spreads a buggy implementation detail
> further. Please submit a patch that cleans up the mess. Thanks,

Ok I withdraw the patch under these circumstances. I'm not your coding
slave and I don't feel strongly enough about the hotplug case to
put much more work into this.

-Andi

2008-02-11 13:55:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map


* Andi Kleen <[email protected]> wrote:

> > your patch is also a bit unclean:
>
> Ok patch with hungarized variables appended.

My comments have nothing to do with "hungarized variables". You used
clearly unclean and ambigious coding constructs like:

+static unsigned long __meminit
phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
{
+ unsigned long r;
pmd_t *pmd = pmd_offset(pud, 0);
spin_lock(&init_mm.page_table_lock);
- phys_pmd_init(pmd, address, end);
+ r = phys_pmd_init(pmd, address, end);
spin_unlock(&init_mm.page_table_lock);
__flush_tlb_all();
+ return r;
}

the "r" name is totally unintuitive and the reader of the code is given
absolutely no idea what happens here.

The cleanliness rules about descriptive variable names are obvious,
necessary and well respected throughout the kernel, by all other kernel
contributors i know. For years you were allowed to merge such patches.
You should really (re-)learn and accept the rules that all other kernel
contributors have been following for years.

Ingo

2008-02-11 14:16:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map


On Mon, 2008-02-11 at 14:27 +0100, Andi Kleen wrote:

> Ok patch with hungarized variables appended.

> -static void __meminit
> +static unsigned long __meminit
> phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
> {
> + unsigned long true_end;
> pmd_t *pmd = pmd_offset(pud, 0);
> spin_lock(&init_mm.page_table_lock);
> - phys_pmd_init(pmd, address, end);
> + true_end = phys_pmd_init(pmd, address, end);
> spin_unlock(&init_mm.page_table_lock);
> __flush_tlb_all();
> + return true_end;
> }

Just for the record, Hungarian notation would have it like:

ulTrueEnd

http://en.wikipedia.org/wiki/Hungarian_notation

And the kernel doesn't do that, to wit (from Documentation/CodingStyle):

Linus Torvalds (against systems Hungarian): Encoding the type of a
function into the name (so-called Hungarian notation) is brain damaged -
the compiler knows the types anyway and can check those, and it only
confuses the programmer.


2008-02-11 14:24:22

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map

On Monday 11 February 2008 15:16:31 Peter Zijlstra wrote:
>
> On Mon, 2008-02-11 at 14:27 +0100, Andi Kleen wrote:
>
> > Ok patch with hungarized variables appended.
>
> > -static void __meminit
> > +static unsigned long __meminit
> > phys_pmd_update(pud_t *pud, unsigned long address, unsigned long end)
> > {
> > + unsigned long true_end;
> > pmd_t *pmd = pmd_offset(pud, 0);
> > spin_lock(&init_mm.page_table_lock);
> > - phys_pmd_init(pmd, address, end);
> > + true_end = phys_pmd_init(pmd, address, end);
> > spin_unlock(&init_mm.page_table_lock);
> > __flush_tlb_all();
> > + return true_end;
> > }
>
> Just for the record, Hungarian notation would have it like:
>
> ulTrueEnd

_pfn is variant of hungarian notation (just postfix instead of prefix);
that is why I referred to it.

Admittedly it was a little unfair pun with Ingo, but he really
asked for it in this case :-)


> http://en.wikipedia.org/wiki/Hungarian_notation
>
> And the kernel doesn't do that, to wit (from Documentation/CodingStyle):
>
> Linus Torvalds (against systems Hungarian): Encoding the type of a
> function into the name (so-called Hungarian notation) is brain damaged -
> the compiler knows the types anyway and can check those, and it only
> confuses the programmer.

xxx_pfn is exactly that.

BTW Coding style also recommends to use short variable names inside
functions. Ingo asked me actually to violate that:

"
LOCAL variable names should be short, and to the point. If you have
some random integer loop counter, it should probably be called "i".
Calling it "loop_counter" is non-productive, if there is no chance of it
being mis-understood. Similarly, "tmp" can be just about any type of
variable that is used to hold a temporary value.
"

I used r for result in this case which is 100% conform to coding style.

-Andi (trying to exit this thread)

2008-02-11 14:41:15

by Sam Ravnborg

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map

>
> _pfn is variant of hungarian notation (just postfix instead of prefix);
> that is why I referred to it.
>
> Admittedly it was a little unfair pun with Ingo, but he really
> asked for it in this case :-)

I thought it was a good joke! It made me smile.

Sam

2008-02-11 15:13:37

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH] [6/8] Account overlapped mappings in end_pfn_map

On Mon, 11 Feb 2008 10:34:34 +0100 (CET)
Andi Kleen <[email protected]> wrote:

>
> When end_pfn is not aligned to 2MB (or 1GB) then the kernel might
> map more memory than end_pfn. Account this in end_pfn_mapped.

we need to fix this btw; mapping memory that doesn't really exist
is a rather nasty trap with cpu prefetching and the like.


--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-02-12 10:36:22

by Yasunori Goto

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd

Hi Ingo-san.

> > Does anyone even
> > use memory hotplug currently?
>
> I don't know.

IBM's powerpc box can memory hot-add/remove by dynamic partitioning.
And our fujitsu server has memory hot-add feature (Ia-64).
So, they are concrete user of memory hotplug.

In x86, E8500 chipset has the feature of memory-hotplug.
(I searched a data-sheet from intel site.)
http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf
(6.3.8 IMI Hot-Plug)

So, it depends on how many server uses it, I think.

Thanks.

--
Yasunori Goto

2008-02-12 11:21:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [5/8] Fix logic error in 64bit memory hotadd

On Tuesday 12 February 2008 11:35:22 Yasunori Goto wrote:
> Hi Ingo-san.
>
> > > Does anyone even
> > > use memory hotplug currently?
> >
> > I don't know.
>
> IBM's powerpc box can memory hot-add/remove by dynamic partitioning.
> And our fujitsu server has memory hot-add feature (Ia-64).
> So, they are concrete user of memory hotplug.
>
> In x86, E8500 chipset has the feature of memory-hotplug.
> (I searched a data-sheet from intel site.)
> http://download.intel.com/design/chipsets/e8500/datashts/30674501.pdf
> (6.3.8 IMI Hot-Plug)
>
> So, it depends on how many server uses it, I think.

With the logic pud error i found and that seems currently will stay
for the forseeable future in the tree I don't think x86-64 64bit hotadds for more
than 1GB have ever worked with the sparsemem method.

Older trees (before 2.6.24) had also a different method that did
preallocate everything based on the SRAT. That one should have worked
for this case.

-Andi

2008-02-12 19:40:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit

On Mon, 11 Feb 2008, Andi Kleen wrote:
> Even on 32bit 2MB pages can map more memory than is in the true
> max_low_pfn if end_pfn is not highmem and not aligned to 2MB.
> Add a end_pfn_map similar to x86-64 that accounts for this
> fact. This is important for code that really needs to know about
> all mapping aliases. Needed for followup patches (in this case EFI)

It's not only necessary for followup patches, it is a question of
general correctness.

> @@ -36,7 +36,7 @@
> #define max_pfn_mapped end_pfn_map
> #else
> #include <asm/page_32.h>
> -#define max_pfn_mapped max_low_pfn
> +#define max_pfn_mapped end_pfn_map

We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
care about which one, but keeping both makes no sense at all.

Thanks,

tglx

2008-02-12 19:50:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit

On Tuesday 12 February 2008 20:39:48 Thomas Gleixner wrote:
> On Mon, 11 Feb 2008, Andi Kleen wrote:
> > Even on 32bit 2MB pages can map more memory than is in the true
> > max_low_pfn if end_pfn is not highmem and not aligned to 2MB.
> > Add a end_pfn_map similar to x86-64 that accounts for this
> > fact. This is important for code that really needs to know about
> > all mapping aliases. Needed for followup patches (in this case EFI)
>
> It's not only necessary for followup patches, it is a question of
> general correctness.

True. ioremap already requires it.

> > @@ -36,7 +36,7 @@
> > #define max_pfn_mapped end_pfn_map
> > #else
> > #include <asm/page_32.h>
> > -#define max_pfn_mapped max_low_pfn
> > +#define max_pfn_mapped end_pfn_map
>
> We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
> care about which one, but keeping both makes no sense at all.

I didn't want to bundle such a clean up into the bug fix
because my experience is that you usually reject that categorically.

I can send the removal of max_pfn_mapped as a follow up patch
if you apply this one.

-Andi

2008-02-12 20:04:35

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] [8/8] RFC: Fix some EFI problems

On Mon, 11 Feb 2008, Andi Kleen wrote:

> >From code review the EFI memory map handling has a couple of problems:
>
> - The test for _WB memory was reversed so it would set cache able memory
> to uncached
> - It would always set a wrong uninitialized zero address to uncached
> (so I suspect it always set the first few pages in phys memory to uncached,
> that is why it may have gone unnoticed)
> - It would call set_memory_x() on a fixmap address that it doesn't
> handle correct.
> - Some other problems I commented in the code (but was unable to solve
> for now)
>
> I changed the ioremaps to set the correct caching attributes
> and also corrected the ordering so it looks roughly correct now.

The only effective change is:

- if (md->attribute & EFI_MEMORY_WB)
+ if (!(md->attribute & EFI_MEMORY_WB))

I appreciate that you noticed the reverse logic, which I messed up
when I fixed up rejects.

I pulled this out as it is a real fix. The rest of this patch is just
turning code in circles for nothing, simply because it is functionally
completely irrelevant whether does simply:

if ((end >> PAGE_SHIFT) <= max_pfn_mapped)
va = __va(md->phys_addr);
else
va = efi_ioremap(md->phys_addr, size);

if (!(md->attribute & EFI_MEMORY_WB))
set_memory_uc(md->virt_addr, size);
or

if ((end >> PAGE_SHIFT) <= max_pfn_mapped) {
va = __va(md->phys_addr);

if (!(md->attribute & EFI_MEMORY_WB))
set_memory_uc(md->virt_addr, size);
} else
va = efi_ioremap(md->phys_addr, size,
!!(md->attribute & EFI_MEMORY_WB));

And you just copied the real bug in that logic as well:

set_memory_uc(md->virt_addr, size);
------------------------^^^^^^^^

which is initialized a couple of lines down.

md->virt_addr = (u64) (unsigned long) va;

The reordering/optimizing needs to be a separate patch.

Please keep bugfixes and other changes separate.

> + /* RED-PEN does not handle overlapped areas */

Can you please use CHECKME/FIXME which is used everywhere else. No need to
invent an extra marker.

Thanks,

tglx

2008-02-12 20:26:08

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] [7/8] Implement true end_pfn_mapped for 32bit

On Tue, 12 Feb 2008, Andi Kleen wrote:
> > > @@ -36,7 +36,7 @@
> > > #define max_pfn_mapped end_pfn_map
> > > #else
> > > #include <asm/page_32.h>
> > > -#define max_pfn_mapped max_low_pfn
> > > +#define max_pfn_mapped end_pfn_map
> >
> > We can nuke either max_pfn_mapped or end_pfn_map completely. I don't
> > care about which one, but keeping both makes no sense at all.
>
> I didn't want to bundle such a clean up into the bug fix
> because my experience is that you usually reject that categorically.

True, but I apply two separate patches.

> I can send the removal of max_pfn_mapped as a follow up patch
> if you apply this one.

Too late. It was so obvious that it screamed to be fixed. I applied
the inital patch already and cleaned it up myself. :)

Thanks,

tglx

2008-02-12 20:28:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [8/8] RFC: Fix some EFI problems

On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:

>
> And you just copied the real bug in that logic as well:
>
> set_memory_uc(md->virt_addr, size);

Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
my brown paper back tonight when I go out.

> ------------------------^^^^^^^^
>
> which is initialized a couple of lines down.
>
> md->virt_addr = (u64) (unsigned long) va;
>
> The reordering/optimizing needs to be a separate patch.

What optimizing? It wasn't intended to be an optimization.
It fixes a bug.

Not doing set_memory_uc on efi_ioremap output is needed because
set_memory_uc doesn't work on fixmap which is what efi_ioremap
returns.

(see previous mails on that topic -- i fixed the 'x' case,
but fixing "uc" is too hard imho)

So I fixed efi_ioremap instead to set the correct caching
mode directly. That is ok because there can be no overlap
with the direct mapping, so no aliases to fix up.


> Please keep bugfixes and other changes separate.
>
> > + /* RED-PEN does not handle overlapped areas */
>
> Can you please use CHECKME/FIXME which is used everywhere else. No need to
> invent an extra marker.

I've always used RED-PEN

% grep -r RED-PEN arch/x86/* | wc -l
12
%

It comes originally from network code I hacked a long time ago, although
most of those got lost over time (only 2 left, sniff)

Sorry I don't want to change this now and I doubt that will really cause
a problem for anybody.

I'll send an updated patch with the va thing fixed.

-Andi

2008-02-12 20:49:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] [8/8] RFC: Fix some EFI problems

On Tue, 12 Feb 2008, Andi Kleen wrote:

> On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:
>
> >
> > And you just copied the real bug in that logic as well:
> >
> > set_memory_uc(md->virt_addr, size);
>
> Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
> my brown paper back tonight when I go out.
>
> > ------------------------^^^^^^^^
> >
> > which is initialized a couple of lines down.
> >
> > md->virt_addr = (u64) (unsigned long) va;
> >
> > The reordering/optimizing needs to be a separate patch.
>
> What optimizing? It wasn't intended to be an optimization.
> It fixes a bug.

No, it does not. Please go back and read my mail.

The code had exactly two bugs:

1) the logic of checking EFI_MEMORY_WB was wrong
2) the uninitialized variable

The fix is:

arch/x86/kernel/efi.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

Index: linux-2.6/arch/x86/kernel/efi.c
===================================================================
--- linux-2.6.orig/arch/x86/kernel/efi.c
+++ linux-2.6/arch/x86/kernel/efi.c
@@ -428,9 +428,6 @@ void __init efi_enter_virtual_mode(void)
else
va = efi_ioremap(md->phys_addr, size);

- if (md->attribute & EFI_MEMORY_WB)
- set_memory_uc(md->virt_addr, size);
-
md->virt_addr = (u64) (unsigned long) va;

if (!va) {
@@ -439,6 +436,9 @@ void __init efi_enter_virtual_mode(void)
continue;
}

+ if (!(md->attribute & EFI_MEMORY_WB))
+ set_memory_uc(md->virt_addr, size);
+
systab = (u64) (unsigned long) efi_phys.systab;
if (md->phys_addr <= systab && systab < end) {
systab += md->virt_addr - md->phys_addr;

The reordering of code is completely irrelevant. It can be done, but
in a separate patch.

Thanks,

tglx

2008-02-13 11:04:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH] [8/8] RFC: Fix some EFI problems

Thomas Gleixner wrote:
> On Tue, 12 Feb 2008, Andi Kleen wrote:
>
>> On Tuesday 12 February 2008 21:04:06 Thomas Gleixner wrote:
>>
>>> And you just copied the real bug in that logic as well:
>>>
>>> set_memory_uc(md->virt_addr, size);
>> Oops you're right. I wanted to fix that, but didn't. Ok I'll put up
>> my brown paper back tonight when I go out.
>>
>>> ------------------------^^^^^^^^
>>>
>>> which is initialized a couple of lines down.
>>>
>>> md->virt_addr = (u64) (unsigned long) va;
>>>
>>> The reordering/optimizing needs to be a separate patch.
>> What optimizing? It wasn't intended to be an optimization.
>> It fixes a bug.
>
> No, it does not. Please go back and read my mail.

I describe the problem again:

- efi_ioremap on 64bit returns a fixmap address:
void __iomem * __init efi_ioremap(unsigned long phys_addr, unsigned long
size)
{
...
return (void __iomem *)__fix_to_virt(FIX_EFI_IO_MAP_FIRST_PAGE -
(pages_mapped - pages));

}
- __fix_to_virt is:
(FIXADDR_TOP - ((x) << PAGE_SHIFT)) and x is a small integer <30 or so.
- Fixmap is
#define VSYSCALL_END (-2UL << 20)
#define FIXADDR_TOP (VSYSCALL_END-PAGE_SIZE)
that gives e.g. 0xffffffdfffff for the top fixmap; the efi fixmap
is only slightly pages below.
- You pass that into set_memory_uc()
- That eventually calls __pa() on it several times
(in static_protections and in change_page_attr_addr for 64bit to
check for the kernel mapping)
- __pa calls __phys_addr which does
unsigned long __phys_addr(unsigned long x)
{
if (x >= __START_KERNEL_map)
return x - __START_KERNEL_map + phys_base;
return x - PAGE_OFFSET;
}
- Now __START_KERNEL_map is 0xffffffff80000000.
- That ends up with

x = 0xffffffdfffff - smallnumber*PAGE_SIZE

if (x >= 0xffffffff80000000) (evaluates to true)
return x - 0xffffffff80000000 + phys_addr
- This ends up with some fictional number in cpa (but likely one
looking like a valid pa address) that has nothing
to do with the address that is mapped below the fixmap
- cpa() does weird things to random unrelated memory then or
might clear rw if you're really unlucky.
- I think on 32bit with a real ioremap it's also not completely kosher
with the right __PAGE_OFFSET (but I have not double checked
that step by step)

This is why I avoided calling set_memory_uc for the fixmap
and instead changed the code to set the correct PAT
attribute into the fixmap directly to avoid this.

I believe the full original change or some Thomasized variant of it
is still needed.

-Andi