2006-02-22 00:14:34

by David Gibson

[permalink] [raw]
Subject: IA64 non-contiguous memory space bugs

Quite some time ago, I found (by inspection) an ia64 specific bug
related to its non-contiguous user address space. I've never done
anything about it, because I don't have an ia64 to test on - but
somebody should fix it. Recently I've spotted another possible bug,
also by inspection - I don't know enough about ia64 to tell if it's a
real problem or not.

First bug (confirmed many months ago by Chris Wedgwood) - you can get
weird effects if you attempt to mmap() something into one of the
address space gaps. The ia64 outer wrapper for mmap2() tries to
prevent it, but doesn't do a good enough job, it's still possible
indirectly with shmat() and maybe mremap(). Basic trouble is that
most of the checks applied by the generic code assume that everything
between 0 and TASK_SIZE is valid. Patch below addresses this, or did
- quite likely it's bitrotted since I wrote it.

Second problem is in the hugepage logic in free_pgtables()
(mm/memory.c). As far as I can tell it's complete crap, and only
works by accident, for different accidental reasons on ppc64 and ia64,
the only archs that have a non-trivial is_hugepage_only_range().
Except that I'm not sure it does entirely work by accident on ia64:
suppose a process has a hugepage mapping that begins some way after
the beginning of the hugepage address range. Before
hugetlb_free_pgd_range() gets called on that area, it will be called
on the next normal page VMA down - but with an end address at the
beginning of the hugepage VMA and so extending into the hugepage
address range. I don't really understand the ia64 pagetable mapping
stuff well enough to tell if that's dangerous or not.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

Index: working-2.6/mm/mremap.c
===================================================================
--- working-2.6.orig/mm/mremap.c 2004-08-30 10:23:00.000000000 +1000
+++ working-2.6/mm/mremap.c 2004-09-24 15:02:18.169776624 +1000
@@ -274,7 +274,7 @@
if (!(flags & MREMAP_MAYMOVE))
goto out;

- if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
+ if (! GOOD_TASK_VM_RANGE(new_addr, new_len))
goto out;

/* Check if the location we're moving into overlaps the
@@ -351,6 +351,8 @@
!((flags & MREMAP_FIXED) && (addr != new_addr)) &&
(old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
unsigned long max_addr = TASK_SIZE;
+ if (max_addr > REGION_MAX(vma->vm_start))
+ max_addr = REGION_MAX(vma->vm_start);
if (vma->vm_next)
max_addr = vma->vm_next->vm_start;
/* can we just expand the current mapping? */
Index: working-2.6/mm/mmap.c
===================================================================
--- working-2.6.orig/mm/mmap.c 2004-09-07 10:38:00.000000000 +1000
+++ working-2.6/mm/mmap.c 2004-09-24 15:02:18.170776472 +1000
@@ -1209,7 +1209,7 @@
if (flags & MAP_FIXED) {
unsigned long ret;

- if (addr > TASK_SIZE - len)
+ if (! GOOD_TASK_VM_RANGE(addr, len))
return -ENOMEM;
if (addr & ~PAGE_MASK)
return -EINVAL;
@@ -1658,7 +1658,7 @@
unsigned long end;
struct vm_area_struct *mpnt, *prev, *last;

- if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
+ if ((start & ~PAGE_MASK) || !GOOD_TASK_VM_RANGE(start, len))
return -EINVAL;

if ((len = PAGE_ALIGN(len)) == 0)
@@ -1749,7 +1749,7 @@
if (!len)
return addr;

- if ((addr + len) > TASK_SIZE || (addr + len) < addr)
+ if (! GOOD_TASK_VM_RANGE(addr, len))
return -EINVAL;

/*
Index: working-2.6/include/asm-ia64/page.h
===================================================================
--- working-2.6.orig/include/asm-ia64/page.h 2004-09-10 13:26:27.000000000 +1000
+++ working-2.6/include/asm-ia64/page.h 2004-09-24 15:02:18.171776320 +1000
@@ -123,6 +123,8 @@
#define REGION_SIZE REGION_NUMBER(1)
#define REGION_KERNEL 7

+#define REGION_MAX(x) ((REGION_NUMBER(x)<<61) + RGN_MAP_LIMIT)
+
#ifdef CONFIG_HUGETLB_PAGE
# define htlbpage_to_page(x) ((REGION_NUMBER(x) << 61) \
| (REGION_OFFSET(x) >> (HPAGE_SHIFT-PAGE_SHIFT)))
Index: working-2.6/include/linux/mm.h
===================================================================
--- working-2.6.orig/include/linux/mm.h 2004-09-07 10:38:00.000000000 +1000
+++ working-2.6/include/linux/mm.h 2004-09-24 15:02:18.172776168 +1000
@@ -41,6 +41,13 @@
#define MM_VM_SIZE(mm) TASK_SIZE
#endif

+#ifndef REGION_MAX
+#define REGION_MAX(addr) TASK_SIZE
+#endif
+
+#define GOOD_TASK_VM_RANGE(addr, len) \
+ ( ((addr)+(len) >= (addr)) || ((addr)+(len) <= TASK_SIZE) \
+ || ((addr)+(len) <= REGION_MAX(addr)) )
/*
* Linux kernel virtual memory manager primitives.
* The idea being to have a "virtual" mm in the same way
Index: working-2.6/fs/binfmt_elf.c
===================================================================
--- working-2.6.orig/fs/binfmt_elf.c 2004-09-24 10:14:10.000000000 +1000
+++ working-2.6/fs/binfmt_elf.c 2004-09-24 15:02:18.174775864 +1000
@@ -81,7 +81,8 @@
.min_coredump = ELF_EXEC_PAGESIZE
};

-#define BAD_ADDR(x) ((unsigned long)(x) > TASK_SIZE)
+#define BAD_ADDR(x) (((unsigned long)(x) > TASK_SIZE) \
+ || ((unsigned long)(x) >= REGION_MAX((unsigned long)x)) )

static int set_brk(unsigned long start, unsigned long end)
{
@@ -370,8 +371,8 @@
* <= p_memsize so it is only necessary to check p_memsz.
*/
k = load_addr + eppnt->p_vaddr;
- if (k > TASK_SIZE || eppnt->p_filesz > eppnt->p_memsz ||
- eppnt->p_memsz > TASK_SIZE || TASK_SIZE - eppnt->p_memsz < k) {
+ if ((eppnt->p_filesz > eppnt->p_memsz) ||
+ ! GOOD_TASK_VM_RANGE(k, eppnt->p_memsz)) {
error = -ENOMEM;
goto out_close;
}
@@ -798,9 +799,8 @@
* allowed task size. Note that p_filesz must always be
* <= p_memsz so it is only necessary to check p_memsz.
*/
- if (k > TASK_SIZE || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
- elf_ppnt->p_memsz > TASK_SIZE ||
- TASK_SIZE - elf_ppnt->p_memsz < k) {
+ if (elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
+ ! GOOD_TASK_VM_RANGE(k, elf_ppnt->p_memsz)) {
/* set_brk can never work. Avoid overflows. */
send_sig(SIGKILL, current, 0);
goto out_free_dentry;
Index: working-2.6/arch/ia64/kernel/sys_ia64.c
===================================================================
--- working-2.6.orig/arch/ia64/kernel/sys_ia64.c 2004-08-09 09:51:26.000000000 +1000
+++ working-2.6/arch/ia64/kernel/sys_ia64.c 2004-09-24 15:02:18.175775712 +1000
@@ -211,17 +211,6 @@
goto out;
}

- /*
- * Don't permit mappings into unmapped space, the virtual page table of a region,
- * or across a region boundary. Note: RGN_MAP_LIMIT is equal to 2^n-PAGE_SIZE
- * (for some integer n <= 61) and len > 0.
- */
- roff = REGION_OFFSET(addr);
- if ((len > RGN_MAP_LIMIT) || (roff > (RGN_MAP_LIMIT - len))) {
- addr = -EINVAL;
- goto out;
- }
-
down_write(&current->mm->mmap_sem);
addr = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
up_write(&current->mm->mmap_sem);


----- End forwarded message -----

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson


2006-02-22 00:39:38

by David Miller

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

From: David Gibson <[email protected]>
Date: Wed, 22 Feb 2006 11:13:59 +1100

> Quite some time ago, I found (by inspection) an ia64 specific bug
> related to its non-contiguous user address space. I've never done
> anything about it, because I don't have an ia64 to test on - but
> somebody should fix it. Recently I've spotted another possible bug,
> also by inspection - I don't know enough about ia64 to tell if it's a
> real problem or not.

Good catch David, I'll need to add similar checks on sparc64 as
we have a 64-bit virtual address space hole on several processors
there.

2006-02-22 01:32:09

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> Second problem is in the hugepage logic in free_pgtables()
> (mm/memory.c). As far as I can tell it's complete crap, and only
> works by accident, for different accidental reasons on ppc64 and ia64,
> the only archs that have a non-trivial is_hugepage_only_range().
> Except that I'm not sure it does entirely work by accident on ia64:
> suppose a process has a hugepage mapping that begins some way after
> the beginning of the hugepage address range. Before
> hugetlb_free_pgd_range() gets called on that area, it will be called
> on the next normal page VMA down - but with an end address at the
> beginning of the hugepage VMA and so extending into the hugepage
> address range. I don't really understand the ia64 pagetable mapping
> stuff well enough to tell if that's dangerous or not.

I don't see any problem in the ia64 code. The start and end address is
what the vma specified. Floor and ceiling is just a hint for free_pgtables()
to free any left over page tables between vma holes (to prev and next).
As far as I can tell, the code looks fine.

- Ken

2006-02-22 01:51:56

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> Second problem is in the hugepage logic in free_pgtables()
> (mm/memory.c). As far as I can tell it's complete crap, and only
> works by accident, for different accidental reasons on ppc64 and ia64,
> the only archs that have a non-trivial is_hugepage_only_range().
> Except that I'm not sure it does entirely work by accident on ia64:
> suppose a process has a hugepage mapping that begins some way after
> the beginning of the hugepage address range. Before
> hugetlb_free_pgd_range() gets called on that area, it will be called
> on the next normal page VMA down - but with an end address at the
> beginning of the hugepage VMA and so extending into the hugepage
> address range. I don't really understand the ia64 pagetable mapping
> stuff well enough to tell if that's dangerous or not.

Chen, Kenneth W wrote on Tuesday, February 21, 2006 5:32 PM
> I don't see any problem in the ia64 code. The start and end address is
> what the vma specified. Floor and ceiling is just a hint for free_pgtables()
> to free any left over page tables between vma holes (to prev and next).
> As far as I can tell, the code looks fine.


free_pgtables() has partial crap that the check of is_hugepage_only_range()
should be done on the entire vma range, not just the first hugetlb page.
Though, it's not possible to have a hugetlb vma while having normal page
instantiated inside that vma. So the bug is mostly phantom. For
correctness, it should be fixed.


--- linux-2.6.16-rc4/mm/memory.c.orig 2006-02-21 18:33:32.427186571 -0800
+++ linux-2.6.16-rc4/mm/memory.c 2006-02-21 18:37:22.414488441 -0800
@@ -270,6 +270,7 @@ void free_pgtables(struct mmu_gather **t
while (vma) {
struct vm_area_struct *next = vma->vm_next;
unsigned long addr = vma->vm_start;
+ unsigned long end = vma->vm_end;

/*
* Hide vma from rmap and vmtruncate before freeing pgtables
@@ -277,8 +278,8 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);

- if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
- hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
+ if (is_hugepage_only_range(vma->vm_mm, addr, end - addr)) {
+ hugetlb_free_pgd_range(tlb, addr, end
floor, next? next->vm_start: ceiling);
} else {
/*




2006-02-22 02:26:36

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Tue, Feb 21, 2006 at 05:51:52PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> > Second problem is in the hugepage logic in free_pgtables()
> > (mm/memory.c). As far as I can tell it's complete crap, and only
> > works by accident, for different accidental reasons on ppc64 and ia64,
> > the only archs that have a non-trivial is_hugepage_only_range().
> > Except that I'm not sure it does entirely work by accident on ia64:
> > suppose a process has a hugepage mapping that begins some way after
> > the beginning of the hugepage address range. Before
> > hugetlb_free_pgd_range() gets called on that area, it will be called
> > on the next normal page VMA down - but with an end address at the
> > beginning of the hugepage VMA and so extending into the hugepage
> > address range. I don't really understand the ia64 pagetable mapping
> > stuff well enough to tell if that's dangerous or not.
>
> Chen, Kenneth W wrote on Tuesday, February 21, 2006 5:32 PM
> > I don't see any problem in the ia64 code. The start and end address is
> > what the vma specified. Floor and ceiling is just a hint for free_pgtables()
> > to free any left over page tables between vma holes (to prev and next).
> > As far as I can tell, the code looks fine.
>
>
> free_pgtables() has partial crap that the check of is_hugepage_only_range()
> should be done on the entire vma range, not just the first hugetlb page.
> Though, it's not possible to have a hugetlb vma while having normal page
> instantiated inside that vma. So the bug is mostly phantom. For
> correctness, it should be fixed.

Actually, from ppc64's point of view, the problem with the test is
that the whole vma could be *less* than HPAGE_SIZE - we don't test
that the address is aligned before checking is_hugepage_only_range().
We thus can call hugetlb_free_pgd_range() on normal page VMAs - which
we only get away with because the ppc64 hugetlb_free_pgd_range() is
(so far) an alias for the normal free_pgd_range().

Your patch below is insufficient, because there's a second test of
is_hugepage_only_range() further down. However, instead of tweaking
the tested ranges, I think what we really want to do is check for
is_vm_hugetlb_page() instead.

I was worried before, but now that you point out it's the 'end'
address which really matters, not the ceiling, that might be
sufficient. Um.. except that hugepages, unlike normal pages these
days don't necessarily clean up all possible pagetable pages on
unmap... crud. Still the patch below ought to be an improvement.

Index: working-2.6/mm/memory.c
===================================================================
--- working-2.6.orig/mm/memory.c 2006-02-22 10:42:14.000000000 +1100
+++ working-2.6/mm/memory.c 2006-02-22 13:22:07.000000000 +1100
@@ -277,7 +277,7 @@ void free_pgtables(struct mmu_gather **t
anon_vma_unlink(vma);
unlink_file_vma(vma);

- if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
+ if (is_vm_hugetlb_page(vma)) {
hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
floor, next? next->vm_start: ceiling);
} else {
@@ -285,8 +285,7 @@ void free_pgtables(struct mmu_gather **t
* Optimization: gather nearby vmas into one call down
*/
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
- && !is_hugepage_only_range(vma->vm_mm, next->vm_start,
- HPAGE_SIZE)) {
+ && !is_vm_hugetlb_page(vma->vm_mm)) {
vma = next;
next = vma->vm_next;
anon_vma_unlink(vma);


--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 02:26:32

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Tue, Feb 21, 2006 at 05:31:59PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> > Second problem is in the hugepage logic in free_pgtables()
> > (mm/memory.c). As far as I can tell it's complete crap, and only
> > works by accident, for different accidental reasons on ppc64 and ia64,
> > the only archs that have a non-trivial is_hugepage_only_range().
> > Except that I'm not sure it does entirely work by accident on ia64:
> > suppose a process has a hugepage mapping that begins some way after
> > the beginning of the hugepage address range. Before
> > hugetlb_free_pgd_range() gets called on that area, it will be called
> > on the next normal page VMA down - but with an end address at the
> > beginning of the hugepage VMA and so extending into the hugepage
> > address range. I don't really understand the ia64 pagetable mapping
> > stuff well enough to tell if that's dangerous or not.
>
> I don't see any problem in the ia64 code. The start and end address is
> what the vma specified. Floor and ceiling is just a hint for free_pgtables()
> to free any left over page tables between vma holes (to prev and next).
> As far as I can tell, the code looks fine.

Ah, yes, I see now. free_pgd_range() only iterates through to end,
not ceiling so it should be fine.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 02:26:34

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Tue, Feb 21, 2006 at 04:39:35PM -0800, David S. Miller wrote:
> From: David Gibson <[email protected]>
> Date: Wed, 22 Feb 2006 11:13:59 +1100
>
> > Quite some time ago, I found (by inspection) an ia64 specific bug
> > related to its non-contiguous user address space. I've never done
> > anything about it, because I don't have an ia64 to test on - but
> > somebody should fix it. Recently I've spotted another possible bug,
> > also by inspection - I don't know enough about ia64 to tell if it's a
> > real problem or not.
>
> Good catch David, I'll need to add similar checks on sparc64 as
> we have a 64-bit virtual address space hole on several processors
> there.

Ok, that patch adds all the checks in generic code anyway, so all you
should need for sparc64 is an appropriate REGION_MAX() macro.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 02:45:32

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Tuesday, February 21, 2006 6:26 PM
> Chen, Kenneth W wrote on Tuesday, February 21, 2006 5:32 PM
> > free_pgtables() has partial crap that the check of is_hugepage_only_range()
> > should be done on the entire vma range, not just the first hugetlb page.
> > Though, it's not possible to have a hugetlb vma while having normal page
> > instantiated inside that vma. So the bug is mostly phantom. For
> > correctness, it should be fixed.
>
> Actually, from ppc64's point of view, the problem with the test is
> that the whole vma could be *less* than HPAGE_SIZE - we don't test
> that the address is aligned before checking is_hugepage_only_range().
> We thus can call hugetlb_free_pgd_range() on normal page VMAs - which
> we only get away with because the ppc64 hugetlb_free_pgd_range() is
> (so far) an alias for the normal free_pgd_range().
>
> Your patch below is insufficient, because there's a second test of
> is_hugepage_only_range() further down. However, instead of tweaking
> the tested ranges, I think what we really want to do is check for
> is_vm_hugetlb_page() instead.
>
> I was worried before, but now that you point out it's the 'end'
> address which really matters, not the ceiling, that might be
> sufficient. Um.. except that hugepages, unlike normal pages these
> days don't necessarily clean up all possible pagetable pages on
> unmap... crud. Still the patch below ought to be an improvement.


I agree. It's an improvement with your patch.

For ia64:
Acked-by: Ken Chen <[email protected]>



> Index: working-2.6/mm/memory.c
> ===================================================================
> --- working-2.6.orig/mm/memory.c 2006-02-22 10:42:14.000000000 +1100
> +++ working-2.6/mm/memory.c 2006-02-22 13:22:07.000000000 +1100
> @@ -277,7 +277,7 @@ void free_pgtables(struct mmu_gather **t
> anon_vma_unlink(vma);
> unlink_file_vma(vma);
>
> - if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
> + if (is_vm_hugetlb_page(vma)) {
> hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
> floor, next? next->vm_start: ceiling);
> } else {
> @@ -285,8 +285,7 @@ void free_pgtables(struct mmu_gather **t
> * Optimization: gather nearby vmas into one call down
> */
> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
> - && !is_hugepage_only_range(vma->vm_mm, next->vm_start,
> - HPAGE_SIZE)) {
> + && !is_vm_hugetlb_page(vma->vm_mm)) {
> vma = next;
> next = vma->vm_next;
> anon_vma_unlink(vma);

2006-02-22 02:53:36

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> First bug (confirmed many months ago by Chris Wedgwood) - you can get
> weird effects if you attempt to mmap() something into one of the
> address space gaps. The ia64 outer wrapper for mmap2() tries to
> prevent it, but doesn't do a good enough job, it's still possible
> indirectly with shmat() and maybe mremap(). Basic trouble is that
> most of the checks applied by the generic code assume that everything
> between 0 and TASK_SIZE is valid.

Ha ha ha.

On ia64, the low level tlb fault handler (vhpt_miss and nested_dtlb_miss)
checks that all unused address bits (between REGION_NUMBER and PGDIR_SHIFT)
should be all zero. If they are not zero, it will fall into page fault
handler and in there, ia64 should just send SEGV instead of happily hand
over a page. Buggy buggy....

- Ken

2006-02-22 03:01:25

by Zhang, Yanmin

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

>>-----Original Message-----
>>From: [email protected] [mailto:[email protected]] On Behalf Of 'David Gibson'
>>Sent: 2006??2??22?? 10:26
>>To: Chen, Kenneth W
>>Cc: [email protected]; [email protected]
>>Subject: Re: IA64 non-contiguous memory space bugs
>>
>>Your patch below is insufficient, because there's a second test of
>>is_hugepage_only_range() further down. However, instead of tweaking
>>the tested ranges, I think what we really want to do is check for
>>is_vm_hugetlb_page() instead.
>>
>>I was worried before, but now that you point out it's the 'end'
>>address which really matters, not the ceiling, that might be
>>sufficient. Um.. except that hugepages, unlike normal pages these
>>days don't necessarily clean up all possible pagetable pages on
>>unmap... crud. Still the patch below ought to be an improvement.
>>
>>Index: working-2.6/mm/memory.c
>>===================================================================
>>--- working-2.6.orig/mm/memory.c 2006-02-22 10:42:14.000000000 +1100
>>+++ working-2.6/mm/memory.c 2006-02-22 13:22:07.000000000 +1100
>>@@ -277,7 +277,7 @@ void free_pgtables(struct mmu_gather **t
>> anon_vma_unlink(vma);
>> unlink_file_vma(vma);
>>
>>- if (is_hugepage_only_range(vma->vm_mm, addr, HPAGE_SIZE)) {
>>+ if (is_vm_hugetlb_page(vma)) {
>> hugetlb_free_pgd_range(tlb, addr, vma->vm_end,
>> floor, next? next->vm_start: ceiling);
>> } else {
>>@@ -285,8 +285,7 @@ void free_pgtables(struct mmu_gather **t
>> * Optimization: gather nearby vmas into one call down
>> */
>> while (next && next->vm_start <= vma->vm_end + PMD_SIZE
>>- && !is_hugepage_only_range(vma->vm_mm, next->vm_start,
>>- HPAGE_SIZE)) {
>>+ && !is_vm_hugetlb_page(vma->vm_mm)) {
is_vm_hugetlb_page(vma->vm_mm) should be is_vm_hugetlb_page(vma)?

2006-02-22 03:55:21

by David Mosberger-Tang

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

I'm only following this superficially, but keep in mind that a vm-area
MUST NEVER cross a hole.

--david

On 2/21/06, Chen, Kenneth W <[email protected]> wrote:
> David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> > First bug (confirmed many months ago by Chris Wedgwood) - you can get
> > weird effects if you attempt to mmap() something into one of the
> > address space gaps. The ia64 outer wrapper for mmap2() tries to
> > prevent it, but doesn't do a good enough job, it's still possible
> > indirectly with shmat() and maybe mremap(). Basic trouble is that
> > most of the checks applied by the generic code assume that everything
> > between 0 and TASK_SIZE is valid.
>
> Ha ha ha.
>
> On ia64, the low level tlb fault handler (vhpt_miss and nested_dtlb_miss)
> checks that all unused address bits (between REGION_NUMBER and PGDIR_SHIFT)
> should be all zero. If they are not zero, it will fall into page fault
> handler and in there, ia64 should just send SEGV instead of happily hand
> over a page. Buggy buggy....
>
> - Ken
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>


--
Mosberger Consulting LLC, http://www.mosberger-consulting.com/

2006-02-22 04:03:45

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Tue, Feb 21, 2006 at 08:55:15PM -0700, David Mosberger-Tang wrote:
> I'm only following this superficially, but keep in mind that a vm-area
> MUST NEVER cross a hole.

Yes.. but it can, that's precisely the bug.

> On 2/21/06, Chen, Kenneth W <[email protected]> wrote:
> > David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> > > First bug (confirmed many months ago by Chris Wedgwood) - you can get
> > > weird effects if you attempt to mmap() something into one of the
> > > address space gaps. The ia64 outer wrapper for mmap2() tries to
> > > prevent it, but doesn't do a good enough job, it's still possible
> > > indirectly with shmat() and maybe mremap(). Basic trouble is that
> > > most of the checks applied by the generic code assume that everything
> > > between 0 and TASK_SIZE is valid.
> >
> > Ha ha ha.
> >
> > On ia64, the low level tlb fault handler (vhpt_miss and nested_dtlb_miss)
> > checks that all unused address bits (between REGION_NUMBER and PGDIR_SHIFT)
> > should be all zero. If they are not zero, it will fall into page fault
> > handler and in there, ia64 should just send SEGV instead of happily hand
> > over a page. Buggy buggy....

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 16:20:01

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> --- working-2.6.orig/include/linux/mm.h 2004-09-07 10:38:00.000000000 +1000
> +++ working-2.6/include/linux/mm.h 2004-09-24 15:02:18.172776168 +1000
> @@ -41,6 +41,13 @@
> #define MM_VM_SIZE(mm) TASK_SIZE
> #endif
>
> +#ifndef REGION_MAX
> +#define REGION_MAX(addr) TASK_SIZE
> +#endif
> +
> +#define GOOD_TASK_VM_RANGE(addr, len) \
> + ( ((addr)+(len) >= (addr)) || ((addr)+(len) <= TASK_SIZE) \
> + || ((addr)+(len) <= REGION_MAX(addr)) )

Looks like the logic is reversed, should be && instead of ||.

- Ken

2006-02-22 16:34:47

by Hugh Dickins

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Wed, 22 Feb 2006, 'David Gibson' wrote:
> On Tue, Feb 21, 2006 at 05:51:52PM -0800, Chen, Kenneth W wrote:
> >
> > free_pgtables() has partial crap that the check of is_hugepage_only_range()
> > should be done on the entire vma range, not just the first hugetlb page.

We're testing whether this vma falls in a hugepage_only_range. It's
important to check the size (end) of the vma when setting it up; but
when tearing down it is fair to assume that it was set up correctly,
so unnecessary to check its size (end).

I used HPAGE_SIZE rather than 0 because one can imagine an implementation
of is_hugepage_only_range which would go wrong with 0, or with a fraction
of HPAGE_SIZE. Perhaps you'd be happier to add an is_hugepage_only_addr
macro which hides that size arg to is_hugepage_only_range.

> > Though, it's not possible to have a hugetlb vma while having normal page
> > instantiated inside that vma.

That is, if the setup does its checks correctly, you're not allowed
to have a normal vma in (or spanning) a hugepage_only_range.

> > So the bug is mostly phantom. For correctness, it should be fixed.

I believe the bug is non-existent, and therefore needs no fix.

> Actually, from ppc64's point of view, the problem with the test is
> that the whole vma could be *less* than HPAGE_SIZE - we don't test
> that the address is aligned before checking is_hugepage_only_range().
> We thus can call hugetlb_free_pgd_range() on normal page VMAs - which
> we only get away with because the ppc64 hugetlb_free_pgd_range() is
> (so far) an alias for the normal free_pgd_range().

Are you saying that on ppc64, you can put non-hugepage vmas into a
hugepage_only_range? If so, why is it called a hugepage_only_range?
Or, are you saying that your hugepage_only_range is smaller than
HPAGE_SIZE, so no hugepages can fit in it? Or that you have
hugepage vma start addresses not aligned to HPAGE_SIZE?
None of that makes sense to me.

> Your patch below is insufficient, because there's a second test of
> is_hugepage_only_range() further down. However, instead of tweaking
> the tested ranges, I think what we really want to do is check for
> is_vm_hugetlb_page() instead.

No, that looks cleaner, but it's wrong. hugetlb_free_pgd_range does
something useful on those architectures which have a not-always-false
is_hugepage_only_range (ia64 and powerpc alone): it's paired with it.
(Though as you've noticed, powerpc only does the usual free_pgd_range.)

hugetlb_free_pgd_range does nothing on most architectures, even those
(i386 etc) which have a not-always-false is_vm_hugetlb_page: we do
want to free_pgd_range on those. So using is_vm_hugetlb_page instead
of is_hugepage_only_range is wrong for them. Though I guess you could
change their hugetlb_free_pgd_range definitions to free_pgd_range, and
then use is_vm_hugetlb_page as you did, that would work too (though
with less combining of vmas in that loop, so not an improvement).

So far as I know, there's nothing to be fixed at the free_pgtables end
(apart from an utterly unrelated latency issue). But, without having
studied your first patch, I've little doubt that you have identified
significant oversights when bounds checking the vma when it's set up.

Hugh

2006-02-22 23:26:55

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Wed, Feb 22, 2006 at 08:19:51AM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Tuesday, February 21, 2006 4:14 PM
> > --- working-2.6.orig/include/linux/mm.h 2004-09-07 10:38:00.000000000 +1000
> > +++ working-2.6/include/linux/mm.h 2004-09-24 15:02:18.172776168 +1000
> > @@ -41,6 +41,13 @@
> > #define MM_VM_SIZE(mm) TASK_SIZE
> > #endif
> >
> > +#ifndef REGION_MAX
> > +#define REGION_MAX(addr) TASK_SIZE
> > +#endif
> > +
> > +#define GOOD_TASK_VM_RANGE(addr, len) \
> > + ( ((addr)+(len) >= (addr)) || ((addr)+(len) <= TASK_SIZE) \
> > + || ((addr)+(len) <= REGION_MAX(addr)) )
>
> Looks like the logic is reversed, should be && instead of ||.

Ah, yes indeed. Looks like I screwed up my de Morgan's when I changed
it from BAD_TASK_VM_RANGE(). This and another reject fixed in the
version below. Mind you it's been a long time since I wrote this, so
it should probably be checked for any new places where the tests need
to be changed. I'm probably not the best person to do that, since I
don't have an ia64 machine to test on.

Index: working-2.6/mm/mremap.c
===================================================================
--- working-2.6.orig/mm/mremap.c 2006-01-16 13:02:29.000000000 +1100
+++ working-2.6/mm/mremap.c 2006-02-23 09:54:52.000000000 +1100
@@ -278,7 +278,7 @@ unsigned long do_mremap(unsigned long ad
if (!(flags & MREMAP_MAYMOVE))
goto out;

- if (new_len > TASK_SIZE || new_addr > TASK_SIZE - new_len)
+ if (! GOOD_TASK_VM_RANGE(new_addr, new_len))
goto out;

/* Check if the location we're moving into overlaps the
@@ -355,6 +355,8 @@ unsigned long do_mremap(unsigned long ad
!((flags & MREMAP_FIXED) && (addr != new_addr)) &&
(old_len != new_len || !(flags & MREMAP_MAYMOVE))) {
unsigned long max_addr = TASK_SIZE;
+ if (max_addr > REGION_MAX(vma->vm_start))
+ max_addr = REGION_MAX(vma->vm_start);
if (vma->vm_next)
max_addr = vma->vm_next->vm_start;
/* can we just expand the current mapping? */
Index: working-2.6/mm/mmap.c
===================================================================
--- working-2.6.orig/mm/mmap.c 2006-02-23 09:54:43.000000000 +1100
+++ working-2.6/mm/mmap.c 2006-02-23 10:00:08.000000000 +1100
@@ -1345,7 +1345,7 @@ get_unmapped_area(struct file *file, uns
return addr;
}

- if (addr > TASK_SIZE - len)
+ if (! GOOD_TASK_VM_RANGE(addr, len))
return -ENOMEM;
if (addr & ~PAGE_MASK)
return -EINVAL;
@@ -1757,7 +1757,7 @@ int do_munmap(struct mm_struct *mm, unsi
unsigned long end;
struct vm_area_struct *vma, *prev, *last;

- if ((start & ~PAGE_MASK) || start > TASK_SIZE || len > TASK_SIZE-start)
+ if ((start & ~PAGE_MASK) || !GOOD_TASK_VM_RANGE(start, len))
return -EINVAL;

if ((len = PAGE_ALIGN(len)) == 0)
@@ -1851,7 +1851,7 @@ unsigned long do_brk(unsigned long addr,
if (!len)
return addr;

- if ((addr + len) > TASK_SIZE || (addr + len) < addr)
+ if (! GOOD_TASK_VM_RANGE(addr, len))
return -EINVAL;

/*
Index: working-2.6/include/asm-ia64/page.h
===================================================================
--- working-2.6.orig/include/asm-ia64/page.h 2006-01-16 13:02:29.000000000 +1100
+++ working-2.6/include/asm-ia64/page.h 2006-02-23 09:54:52.000000000 +1100
@@ -142,6 +142,8 @@ typedef union ia64_va {
#define REGION_NUMBER(x) ({ia64_va _v; _v.l = (long) (x); _v.f.reg;})
#define REGION_OFFSET(x) ({ia64_va _v; _v.l = (long) (x); _v.f.off;})

+#define REGION_MAX(x) ((REGION_NUMBER(x)<<61) + RGN_MAP_LIMIT)
+
#ifdef CONFIG_HUGETLB_PAGE
# define htlbpage_to_page(x) (((unsigned long) REGION_NUMBER(x) << 61) \
| (REGION_OFFSET(x) >> (HPAGE_SHIFT-PAGE_SHIFT)))
Index: working-2.6/include/linux/mm.h
===================================================================
--- working-2.6.orig/include/linux/mm.h 2006-02-20 10:55:12.000000000 +1100
+++ working-2.6/include/linux/mm.h 2006-02-23 10:01:51.000000000 +1100
@@ -41,6 +41,13 @@ extern int sysctl_legacy_va_layout;

#define nth_page(page,n) pfn_to_page(page_to_pfn((page)) + (n))

+#ifndef REGION_MAX
+#define REGION_MAX(addr) TASK_SIZE
+#endif
+
+#define GOOD_TASK_VM_RANGE(addr, len) \
+ ( ((addr)+(len) >= (addr)) && ((addr)+(len) <= TASK_SIZE) \
+ && ((addr)+(len) <= REGION_MAX(addr)) )
/*
* Linux kernel virtual memory manager primitives.
* The idea being to have a "virtual" mm in the same way
Index: working-2.6/fs/binfmt_elf.c
===================================================================
--- working-2.6.orig/fs/binfmt_elf.c 2006-01-16 13:08:42.000000000 +1100
+++ working-2.6/fs/binfmt_elf.c 2006-02-23 09:54:52.000000000 +1100
@@ -86,7 +86,8 @@ static struct linux_binfmt elf_format =
.min_coredump = ELF_EXEC_PAGESIZE
};

-#define BAD_ADDR(x) ((unsigned long)(x) > TASK_SIZE)
+#define BAD_ADDR(x) (((unsigned long)(x) > TASK_SIZE) \
+ || ((unsigned long)(x) >= REGION_MAX((unsigned long)x)) )

static int set_brk(unsigned long start, unsigned long end)
{
@@ -389,8 +390,8 @@ static unsigned long load_elf_interp(str
* <= p_memsize so it is only necessary to check p_memsz.
*/
k = load_addr + eppnt->p_vaddr;
- if (k > TASK_SIZE || eppnt->p_filesz > eppnt->p_memsz ||
- eppnt->p_memsz > TASK_SIZE || TASK_SIZE - eppnt->p_memsz < k) {
+ if ((eppnt->p_filesz > eppnt->p_memsz) ||
+ ! GOOD_TASK_VM_RANGE(k, eppnt->p_memsz)) {
error = -ENOMEM;
goto out_close;
}
@@ -871,9 +872,8 @@ static int load_elf_binary(struct linux_
* allowed task size. Note that p_filesz must always be
* <= p_memsz so it is only necessary to check p_memsz.
*/
- if (k > TASK_SIZE || elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
- elf_ppnt->p_memsz > TASK_SIZE ||
- TASK_SIZE - elf_ppnt->p_memsz < k) {
+ if (elf_ppnt->p_filesz > elf_ppnt->p_memsz ||
+ ! GOOD_TASK_VM_RANGE(k, elf_ppnt->p_memsz)) {
/* set_brk can never work. Avoid overflows. */
send_sig(SIGKILL, current, 0);
goto out_free_dentry;
Index: working-2.6/arch/ia64/kernel/sys_ia64.c
===================================================================
--- working-2.6.orig/arch/ia64/kernel/sys_ia64.c 2006-01-16 13:02:16.000000000 +1100
+++ working-2.6/arch/ia64/kernel/sys_ia64.c 2006-02-23 09:54:52.000000000 +1100
@@ -189,17 +189,6 @@ do_mmap2 (unsigned long addr, unsigned l
goto out;
}

- /*
- * Don't permit mappings into unmapped space, the virtual page table of a region,
- * or across a region boundary. Note: RGN_MAP_LIMIT is equal to 2^n-PAGE_SIZE
- * (for some integer n <= 61) and len > 0.
- */
- roff = REGION_OFFSET(addr);
- if ((len > RGN_MAP_LIMIT) || (roff > (RGN_MAP_LIMIT - len))) {
- addr = -EINVAL;
- goto out;
- }
-
down_write(&current->mm->mmap_sem);
addr = do_mmap_pgoff(file, addr, len, prot, flags, pgoff);
up_write(&current->mm->mmap_sem);


--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-22 23:50:36

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Wed, Feb 22, 2006 at 04:35:34PM +0000, Hugh Dickins wrote:
> On Wed, 22 Feb 2006, 'David Gibson' wrote:
> > On Tue, Feb 21, 2006 at 05:51:52PM -0800, Chen, Kenneth W wrote:
> > >
> > > free_pgtables() has partial crap that the check of is_hugepage_only_range()
> > > should be done on the entire vma range, not just the first hugetlb page.
>
> We're testing whether this vma falls in a hugepage_only_range. It's
> important to check the size (end) of the vma when setting it up; but
> when tearing down it is fair to assume that it was set up correctly,
> so unnecessary to check its size (end).
>
> I used HPAGE_SIZE rather than 0 because one can imagine an implementation
> of is_hugepage_only_range which would go wrong with 0, or with a fraction
> of HPAGE_SIZE. Perhaps you'd be happier to add an is_hugepage_only_addr
> macro which hides that size arg to is_hugepage_only_range.

(Aside: is_hugepage_only_range() isn't about telling where huge pages
can go, it's about telling where normal pages can't go. As such it
must for it's primary callsite on the MAP_FIXED path in
get_unmapped_area() work with parameters that aren't HPAGE_SIZE
aligned).

> > > Though, it's not possible to have a hugetlb vma while having normal page
> > > instantiated inside that vma.
>
> That is, if the setup does its checks correctly, you're not allowed
> to have a normal vma in (or spanning) a hugepage_only_range.
>
> > > So the bug is mostly phantom. For correctness, it should be fixed.
>
> I believe the bug is non-existent, and therefore needs no fix.

The bug is real alright, I've watched it call hugetlb_free_pgd_range()
for a normal page VMA on powerpc.

> > Actually, from ppc64's point of view, the problem with the test is
> > that the whole vma could be *less* than HPAGE_SIZE - we don't test
> > that the address is aligned before checking is_hugepage_only_range().
> > We thus can call hugetlb_free_pgd_range() on normal page VMAs - which
> > we only get away with because the ppc64 hugetlb_free_pgd_range() is
> > (so far) an alias for the normal free_pgd_range().
>
> Are you saying that on ppc64, you can put non-hugepage vmas into a
> hugepage_only_range? If so, why is it called a hugepage_only_range?
> Or, are you saying that your hugepage_only_range is smaller than
> HPAGE_SIZE, so no hugepages can fit in it? Or that you have
> hugepage vma start addresses not aligned to HPAGE_SIZE?
> None of that makes sense to me.

None of the above. However, is_hugepage_only_range() does not need to
be called on a hugepage aligned range (and is not here), and returns
true (and must do so) if the given range intersects a hugepage only
area, not only if it lies entirely within a hugepage only area.

Consider a HPAGE_SIZE hugepage VMA starting at 4GB, and a normal page
VMA starting at (4GB-PAGE_SIZE). This situation is possible on
powerpc, and is_hugepage_only_range(4GB-PAGE_SIZE, HPAGE_SIZE) will
(and must) return true. Therefore the free_pgtables() logic will call
hugetlb_free_pgd_range() across the normal page VMA.

> > Your patch below is insufficient, because there's a second test of
> > is_hugepage_only_range() further down. However, instead of tweaking
> > the tested ranges, I think what we really want to do is check for
> > is_vm_hugetlb_page() instead.
>
> No, that looks cleaner, but it's wrong. hugetlb_free_pgd_range does
> something useful on those architectures which have a not-always-false
> is_hugepage_only_range (ia64 and powerpc alone): it's paired with it.
> (Though as you've noticed, powerpc only does the usual free_pgd_range.)
>
> hugetlb_free_pgd_range does nothing on most architectures, even those
> (i386 etc) which have a not-always-false is_vm_hugetlb_page: we do
> want to free_pgd_range on those. So using is_vm_hugetlb_page instead
> of is_hugepage_only_range is wrong for them. Though I guess you could
> change their hugetlb_free_pgd_range definitions to free_pgd_range, and
> then use is_vm_hugetlb_page as you did, that would work too (though
> with less combining of vmas in that loop, so not an improvement).

Yes, I realised that was wrong shortly after posting. In fact it's
wrong in just the same way that is_hugepage_only_range() is wrong for
powerpc right now - which we work around becuse
hugetlb_free_pgd_range() is identical to free_pgd_range().

I can see two ways of fixing this. The quick, hacky fix is to use
is_vm_hugetlb_page(), and work around the problems by having
hugetlb_free_pgd_range() be identical to free_pgd_range() in most
cases. Fixing it more cleanly will need a new callback that actually
encodes this "need special pagetable freeing" concept.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-23 20:13:02

by Hugh Dickins

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Thu, 23 Feb 2006, 'David Gibson' wrote:
>
> Consider a HPAGE_SIZE hugepage VMA starting at 4GB, and a normal page
> VMA starting at (4GB-PAGE_SIZE). This situation is possible on
> powerpc, and is_hugepage_only_range(4GB-PAGE_SIZE, HPAGE_SIZE) will
> (and must) return true. Therefore the free_pgtables() logic will call
> hugetlb_free_pgd_range() across the normal page VMA.

Thanks for your patience, I eventually got it. Although (amused to
observe my own incomprehension) I couldn't actually understand your
explanation at all, realized it myself overnight, read again what
you'd written, and then found that you had explained it very well.

Yes, I was wrong to use HPAGE_SIZE in that way in free_pgtables,
and it ought to go to the trouble of testing the real end-addr
(if we keep using is_hugepage_only_range there at all). Though
it's nothing urgent while your hugetlb_free_pgd_range happens to
be the same as your free_pgd_range, right? Is that changing soon?

May I plead the extenuating circumstance, that the powerpc
is_hugepage_only_range means something quite different from the ia64?
The ia64 one means "within a hugepage-only range" but the powerpc one
means "overlaps a hugepage-only range"; I don't know which came first,
and is_hugepage_only_range isn't very descriptive of either (though
matches the ia64 case much better).

(That is, I think from the "touch" naming, and from your description,
that the powerpc one means "overlaps". After a few minutes, I gave
up trying to decipher exactly what LOW_ESID_MASK and HTLB_AREA_MASK
end up doing, and take your superior knowledge on trust.)

While is_hugepage_only_range means different things to different
architectures, I guess it'd best be avoided in common code. That use
in get_unmapped_area: powerpc gets it right, but ia64 gets it wrong?
But I didn't notice a change to that line (or the ia64 implementaton
thereof) in your original patch.

> I can see two ways of fixing this. The quick, hacky fix is to use
> is_vm_hugetlb_page(), and work around the problems by having
> hugetlb_free_pgd_range() be identical to free_pgd_range() in most
> cases.

I don't see that as hacky. I did point out that is_vm_hugetlb_page
will miss out on some coalescence, but that can't be a big deal for
what are already huge areas (the optimization was intended for many
tiny adjacent areas).

Hugh

2006-02-24 00:12:21

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Thu, Feb 23, 2006 at 08:13:38PM +0000, Hugh Dickins wrote:
> On Thu, 23 Feb 2006, 'David Gibson' wrote:
> >
> > Consider a HPAGE_SIZE hugepage VMA starting at 4GB, and a normal page
> > VMA starting at (4GB-PAGE_SIZE). This situation is possible on
> > powerpc, and is_hugepage_only_range(4GB-PAGE_SIZE, HPAGE_SIZE) will
> > (and must) return true. Therefore the free_pgtables() logic will call
> > hugetlb_free_pgd_range() across the normal page VMA.
>
> Thanks for your patience, I eventually got it. Although (amused to
> observe my own incomprehension) I couldn't actually understand your
> explanation at all, realized it myself overnight, read again what
> you'd written, and then found that you had explained it very well.
>
> Yes, I was wrong to use HPAGE_SIZE in that way in free_pgtables,
> and it ought to go to the trouble of testing the real end-addr
> (if we keep using is_hugepage_only_range there at all). Though
> it's nothing urgent while your hugetlb_free_pgd_range happens to
> be the same as your free_pgd_range, right? Is that changing soon?

Maybe. At the moment ppc64 wastes substantial memory, particularly
with 64K base page size because we store 16 hugepage PTEs in a 64k
pagetable page. I'd like to fix that by using a different pagetable
allocator for the hugepages, and that will require a different
hugetlb_free_pgd_range().

> May I plead the extenuating circumstance, that the powerpc
> is_hugepage_only_range means something quite different from the ia64?
> The ia64 one means "within a hugepage-only range" but the powerpc one
> means "overlaps a hugepage-only range"; I don't know which came first,
> and is_hugepage_only_range isn't very descriptive of either (though
> matches the ia64 case much better).

Well.. in fact on ia64 the two meanings are equivalent for
"reasonable" inputs (anything that's a valid user VM range at all).
In general the semantics need to be "overlaps" because any overlapping
range is unsuitable for a normalpage VMA, which is the real purpose of
this test.

The name "is_hugepage_only_range" is mine, I'm afraid, replacing an
even less descriptive name which I've now forgotten. I can't think of
anything unequivocally clearer off the top of my head. Well, that's
not ludicrously verbose anyway
(is_range_unsuitable_for_normal_vma_by_reason_of_hugepage_areas()?)

> (That is, I think from the "touch" naming, and from your description,
> that the powerpc one means "overlaps". After a few minutes, I gave
> up trying to decipher exactly what LOW_ESID_MASK and HTLB_AREA_MASK
> end up doing, and take your superior knowledge on trust.)

You're correct, it's "overlap" semantics (now anyway, it wasn't at one
stage, and that was a bug). It's also complicated on powerpc because
the hugepage exclusive range is not fixed - when you make a hugepage
mapping chunks of the address space (for this mm) are switched over to
be hugepage dedicated - granularity is at 256M chunks below 4GB and
1TB chunks above that, two bitmaps in the mm_context record which
"low" and "high" areas are set aside for hugepages.

> While is_hugepage_only_range means different things to different
> architectures, I guess it'd best be avoided in common code. That use
> in get_unmapped_area: powerpc gets it right, but ia64 gets it wrong?
> But I didn't notice a change to that line (or the ia64 implementaton
> thereof) in your original patch.

It doesn't really mean different things - "touches a hugepage
exclusive area" is the correct semantic, the ia64 implementation
doesn't quite encode that, but is equivalent for valid address
ranges. (though I wonder if that's another bug associated with by
task-region-max patch, without that patch invalid address ranges can
slip through, so maybe it's possible on ia64 to create a normalpage VM
with its start in the address space gap and its end in the hugepage
region, ouch).

> > I can see two ways of fixing this. The quick, hacky fix is to use
> > is_vm_hugetlb_page(), and work around the problems by having
> > hugetlb_free_pgd_range() be identical to free_pgd_range() in most
> > cases.
>
> I don't see that as hacky. I did point out that is_vm_hugetlb_page
> will miss out on some coalescence, but that can't be a big deal for
> what are already huge areas (the optimization was intended for many
> tiny adjacent areas).

Very well, I'll look at coding up such a fix.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson

2006-02-24 01:14:21

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: IA64 non-contiguous memory space bugs

David Gibson wrote on Thursday, February 23, 2006 4:12 PM
> It doesn't really mean different things - "touches a hugepage
> exclusive area" is the correct semantic, the ia64 implementation
> doesn't quite encode that, but is equivalent for valid address
> ranges. (though I wonder if that's another bug associated with by
> task-region-max patch, without that patch invalid address ranges can
> slip through, so maybe it's possible on ia64 to create a normalpage VM
> with its start in the address space gap and its end in the hugepage
> region, ouch).

This is getting complicated that my little brain hurts. There has been
so many iterations that the semantic is ambiguous. If the semantic is
decided to be "overlap", then

It will break arch/ia64/mm/hugetlbpage.c:hugetlb_free_pgd_range():

if (is_hugepage_only_range(tlb->mm, floor, HPAGE_SIZE))
floor = htlbpage_to_page(floor);
if (is_hugepage_only_range(tlb->mm, ceiling, HPAGE_SIZE))
ceiling = htlbpage_to_page(ceiling);

And it will break horribly bad because hugetlbpage_to_page does
"magic" address transformation.

- Ken

2006-02-24 02:47:26

by David Gibson

[permalink] [raw]
Subject: Re: IA64 non-contiguous memory space bugs

On Thu, Feb 23, 2006 at 05:14:03PM -0800, Chen, Kenneth W wrote:
> David Gibson wrote on Thursday, February 23, 2006 4:12 PM
> > It doesn't really mean different things - "touches a hugepage
> > exclusive area" is the correct semantic, the ia64 implementation
> > doesn't quite encode that, but is equivalent for valid address
> > ranges. (though I wonder if that's another bug associated with by
> > task-region-max patch, without that patch invalid address ranges can
> > slip through, so maybe it's possible on ia64 to create a normalpage VM
> > with its start in the address space gap and its end in the hugepage
> > region, ouch).
>
> This is getting complicated that my little brain hurts. There has been
> so many iterations that the semantic is ambiguous. If the semantic is
> decided to be "overlap", then

The semantic is "! is this range ok for a normalpage VMA", so that we
can do that check on the MAP_FIXED path. That implies "overlap" -
except that if you assume it's passed a valid user address range in
the first place, then just checking the region is sufficient on ia64.

--
David Gibson | I'll have my music baroque, and my code
david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_
| _way_ _around_!
http://www.ozlabs.org/~dgibson