2012-11-05 22:47:35

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 00/16] mm: use augmented rbtrees for finding unmapped areas

Earlier this year, Rik proposed using augmented rbtrees to optimize
our search for a suitable unmapped area during mmap(). This prompted
my work on improving the augmented rbtree code. Rik doesn't seem to
have time to follow up on his idea at this time, so I'm sending this
series to revive the idea.

These changes are against v3.7-rc4. I have not converted all applicable
architectuers yet, but we don't necessarily need to get them all onboard
at once - the series is fully bisectable and additional architectures
can be added later on. I am confident enough in my tests for patches 1-8;
however the second half of the series basically didn't get tested as
I don't have access to all the relevant architectures.

Change log since the previous (RFC) send:
- Added bug fix in validate_mm(), noticed by Sasha Levin and figured
out by Bob Liu, which sometimes caused NULL pointer dereference when
running with CONFIG_DEBUG_VM_RB=y
- Fixed generic and x86_64 arch_get_unmapped_area_topdown to avoid
allocating new areas at addr=0 as suggested by Rik Van Riel
- Converted more architectures to use the new vm_unmapped_area()
search function
- Converted hugetlbfs (generic / i386 / sparc64 / tile) to use the new
vm_unmapped_area() search function as well.

In this resend, I have kept Rik's Reviewed-by tags from the original
RFC submission for patches that haven't been updated other than applying
his suggestions.

Patch 1 is the validate_mm() fix from Bob Liu (+ fixed-the-fix from me :)

Patch 2 augments the VMA rbtree with a new rb_subtree_gap field,
indicating the length of the largest gap immediately preceding any
VMAs in a subtree.

Patch 3 adds new checks to CONFIG_DEBUG_VM_RB to verify the above
information is correctly maintained.

Patch 4 rearranges the vm_area_struct layout so that rbtree searches only
need data that is contained in the first cacheline (this one is from
Rik's original patch series)

Patch 5 adds a generic vm_unmapped_area() search function, which
allows for searching for an address space of any desired length,
within [low; high[ address constraints, with any desired alignment.
The generic arch_get_unmapped_area[_topdown] functions are also converted
to use this.

Patch 6 converts the x86_64 arch_get_unmapped_area[_topdown] functions
to use vm_unmapped_area() as well.

Patch 7 fixes cache coloring on x86_64, as suggested by Rik in his
previous series.

Patch 8 and 9 convert the generic and i386 hugetlbfs code to use
vm_unmapped_area()

Patches 10-16 convert extra architectures to use vm_unmapped_area()

I'm happy that this series removes more code than it adds, as calling
vm_unmapped_area() with the desired arguments is quite shorter than
duplicating the brute force algorithm all over the place. There is
still a bit of repetition between various implementations of
arch_get_unmapped_area[_topdown] functions that could probably be
simplified somehow, but I feel we can keep that for a later step...

Michel Lespinasse (15):
mm: add anon_vma_lock to validate_mm()
mm: augment vma rbtree with rb_subtree_gap
mm: check rb_subtree_gap correctness
mm: vm_unmapped_area() lookup function
mm: use vm_unmapped_area() on x86_64 architecture
mm: fix cache coloring on x86_64 architecture
mm: use vm_unmapped_area() in hugetlbfs
mm: use vm_unmapped_area() in hugetlbfs on i386 architecture
mm: use vm_unmapped_area() on mips architecture
mm: use vm_unmapped_area() on arm architecture
mm: use vm_unmapped_area() on sh architecture
mm: use vm_unmapped_area() on sparc64 architecture
mm: use vm_unmapped_area() in hugetlbfs on sparc64 architecture
mm: use vm_unmapped_area() on sparc32 architecture
mm: use vm_unmapped_area() in hugetlbfs on tile architecture

Rik van Riel (1):
mm: rearrange vm_area_struct for fewer cache misses

arch/arm/mm/mmap.c | 119 ++--------
arch/mips/mm/mmap.c | 99 ++-------
arch/sh/mm/mmap.c | 126 ++---------
arch/sparc/kernel/sys_sparc_32.c | 24 +--
arch/sparc/kernel/sys_sparc_64.c | 132 +++---------
arch/sparc/mm/hugetlbpage.c | 123 +++--------
arch/tile/mm/hugetlbpage.c | 139 ++----------
arch/x86/include/asm/elf.h | 6 +-
arch/x86/kernel/sys_x86_64.c | 151 +++----------
arch/x86/mm/hugetlbpage.c | 130 ++---------
arch/x86/vdso/vma.c | 2 +-
fs/hugetlbfs/inode.c | 42 +---
include/linux/mm.h | 31 +++
include/linux/mm_types.h | 19 ++-
mm/mmap.c | 452 +++++++++++++++++++++++++++++---------
15 files changed, 616 insertions(+), 979 deletions(-)

--
1.7.7.3


2012-11-05 22:47:41

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 03/16] mm: check rb_subtree_gap correctness

When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is
correctly set for every vma and that mm->highest_vm_end is also correct.

Also add an explicit 'bug' variable to track if browse_rb() detected any
invalid condition.

Signed-off-by: Michel Lespinasse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>

---
mm/mmap.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2de8bcfe859d..769a09cc71af 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -365,7 +365,7 @@ static void vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
#ifdef CONFIG_DEBUG_VM_RB
static int browse_rb(struct rb_root *root)
{
- int i = 0, j;
+ int i = 0, j, bug = 0;
struct rb_node *nd, *pn = NULL;
unsigned long prev = 0, pend = 0;

@@ -373,29 +373,33 @@ static int browse_rb(struct rb_root *root)
struct vm_area_struct *vma;
vma = rb_entry(nd, struct vm_area_struct, vm_rb);
if (vma->vm_start < prev)
- printk("vm_start %lx prev %lx\n", vma->vm_start, prev), i = -1;
+ printk("vm_start %lx prev %lx\n", vma->vm_start, prev), bug = 1;
if (vma->vm_start < pend)
- printk("vm_start %lx pend %lx\n", vma->vm_start, pend);
+ printk("vm_start %lx pend %lx\n", vma->vm_start, pend), bug = 1;
if (vma->vm_start > vma->vm_end)
- printk("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start);
+ printk("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start), bug = 1;
+ if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma))
+ printk("free gap %lx, correct %lx\n",
+ vma->rb_subtree_gap,
+ vma_compute_subtree_gap(vma)), bug = 1;
i++;
pn = nd;
prev = vma->vm_start;
pend = vma->vm_end;
}
j = 0;
- for (nd = pn; nd; nd = rb_prev(nd)) {
+ for (nd = pn; nd; nd = rb_prev(nd))
j++;
- }
if (i != j)
- printk("backwards %d, forwards %d\n", j, i), i = 0;
- return i;
+ printk("backwards %d, forwards %d\n", j, i), bug = 1;
+ return bug ? -1 : i;
}

void validate_mm(struct mm_struct *mm)
{
int bug = 0;
int i = 0;
+ unsigned long highest_address = 0;
struct vm_area_struct *vma = mm->mmap;
while (vma) {
struct anon_vma_chain *avc;
@@ -403,11 +407,15 @@ void validate_mm(struct mm_struct *mm)
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
anon_vma_interval_tree_verify(avc);
anon_vma_unlock(vma->anon_vma);
+ highest_address = vma->vm_end;
vma = vma->vm_next;
i++;
}
if (i != mm->map_count)
printk("map_count %d vm_next %d\n", mm->map_count, i), bug = 1;
+ if (highest_address != mm->highest_vm_end)
+ printk("mm->highest_vm_end %lx, found %lx\n",
+ mm->highest_vm_end, highest_address), bug = 1;
i = browse_rb(&mm->mm_rb);
if (i != mm->map_count)
printk("map_count %d rb %d\n", mm->map_count, i), bug = 1;
--
1.7.7.3

2012-11-05 22:47:52

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 10/16] mm: use vm_unmapped_area() on mips architecture

Update the mips arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/mips/mm/mmap.c | 99 +++++++++------------------------------------------
1 files changed, 17 insertions(+), 82 deletions(-)

diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index 302d779d5b0d..e4b54b233e17 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -71,6 +71,7 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
struct vm_area_struct *vma;
unsigned long addr = addr0;
int do_color_align;
+ struct vm_unmapped_area_info info;

if (unlikely(len > TASK_SIZE))
return -ENOMEM;
@@ -107,97 +108,31 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
return addr;
}

- if (dir == UP) {
- addr = mm->mmap_base;
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
+ info.length = len;
+ info.align_mask = do_color_align ? (PAGE_MASK & shm_align_mask) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;

- for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr)
- return -ENOMEM;
- if (!vma || addr + len <= vma->vm_start)
- return addr;
- addr = vma->vm_end;
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
- } else {
- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
+ if (dir == DOWN) {
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ addr = vm_unmapped_area(&info);
+
+ if (!(addr & ~PAGE_MASK))
+ return addr;

- /*
- * either no address requested, or the mapping can't fit into
- * the requested address hole
- */
- addr = mm->free_area_cache;
- if (do_color_align) {
- unsigned long base =
- COLOUR_ALIGN_DOWN(addr - len, pgoff);
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr - len);
- if (!vma || addr <= vma->vm_start) {
- /* cache the address as a hint for next time */
- return mm->free_area_cache = addr - len;
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = mm->mmap_base - len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /* cache the address as a hint for next time */
- return mm->free_area_cache = addr;
- }
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start - len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (likely(len < vma->vm_start));
-
-bottomup:
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
}
+
+ info.flags = 0;
+ info.low_limit = mm->mmap_base;
+ info.high_limit = TASK_SIZE;
+ return vm_unmapped_area(&info);
}

unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0,
--
1.7.7.3

2012-11-05 22:47:57

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 12/16] mm: use vm_unmapped_area() on sh architecture

Update the sh arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/sh/mm/mmap.c | 126 ++++++++++-------------------------------------------
1 files changed, 24 insertions(+), 102 deletions(-)

diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c
index afeb710ec5c3..acb3b8f71908 100644
--- a/arch/sh/mm/mmap.c
+++ b/arch/sh/mm/mmap.c
@@ -49,6 +49,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct vm_area_struct *vma;
unsigned long start_addr;
int do_colour_align;
+ struct vm_unmapped_area_info info;

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
@@ -79,47 +80,13 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
return addr;
}

- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- mm->cached_hole_size = 0;
- start_addr = addr = TASK_UNMAPPED_BASE;
- }
-
-full_search:
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(mm->free_area_cache);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (unlikely(TASK_SIZE - len < addr)) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- addr = vma->vm_end;
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ return vm_unmapped_area(&info);
}

unsigned long
@@ -131,6 +98,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
struct mm_struct *mm = current->mm;
unsigned long addr = addr0;
int do_colour_align;
+ struct vm_unmapped_area_info info;

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
@@ -162,73 +130,27 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache;
- if (do_colour_align) {
- unsigned long base = COLOUR_ALIGN_DOWN(addr-len, pgoff);
-
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = mm->mmap_base-len;
- if (do_colour_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr+len <= vma->vm_start)) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
- }
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- if (do_colour_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (likely(len < vma->vm_start));
-
-bottomup:
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ addr = vm_unmapped_area(&info);
+
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:48:01

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

Update the sparc32 arch_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/sparc/kernel/sys_sparc_32.c | 24 +++++++++---------------
1 files changed, 9 insertions(+), 15 deletions(-)

diff --git a/arch/sparc/kernel/sys_sparc_32.c b/arch/sparc/kernel/sys_sparc_32.c
index 0c9b31b22e07..653899849b27 100644
--- a/arch/sparc/kernel/sys_sparc_32.c
+++ b/arch/sparc/kernel/sys_sparc_32.c
@@ -39,6 +39,7 @@ asmlinkage unsigned long sys_getpagesize(void)
unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct vm_area_struct * vmm;
+ struct vm_unmapped_area_info info;

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
@@ -56,21 +57,14 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
if (!addr)
addr = TASK_UNMAPPED_BASE;

- if (flags & MAP_SHARED)
- addr = COLOUR_ALIGN(addr);
- else
- addr = PAGE_ALIGN(addr);
-
- for (vmm = find_vma(current->mm, addr); ; vmm = vmm->vm_next) {
- /* At this point: (!vmm || addr < vmm->vm_end). */
- if (TASK_SIZE - PAGE_SIZE - len < addr)
- return -ENOMEM;
- if (!vmm || addr + len <= vmm->vm_start)
- return addr;
- addr = vmm->vm_end;
- if (flags & MAP_SHARED)
- addr = COLOUR_ALIGN(addr);
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = addr;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = (flags & MAP_SHARED) ?
+ (PAGE_MASK & (SHMLBA - 1)) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ return vm_unmapped_area(&info);
}

/*
--
1.7.7.3

2012-11-05 22:48:21

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 16/16] mm: use vm_unmapped_area() in hugetlbfs on tile architecture

Update the tile hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/tile/mm/hugetlbpage.c | 139 ++++++++------------------------------------
1 files changed, 25 insertions(+), 114 deletions(-)

diff --git a/arch/tile/mm/hugetlbpage.c b/arch/tile/mm/hugetlbpage.c
index 812e2d037972..6f74cce053e1 100644
--- a/arch/tile/mm/hugetlbpage.c
+++ b/arch/tile/mm/hugetlbpage.c
@@ -231,42 +231,15 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
-
- if (len > mm->cached_hole_size) {
- start_addr = mm->free_area_cache;
- } else {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- }
-
-full_search:
- addr = ALIGN(start_addr, huge_page_size(h));
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = ALIGN(vma->vm_end, huge_page_size(h));
- }
+ struct vm_unmapped_area_info info;
+
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+ info.align_offset = 0;
+ return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
@@ -274,92 +247,30 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma, *prev_vma;
- unsigned long base = mm->mmap_base, addr = addr0;
- unsigned long largest_hole = mm->cached_hole_size;
- int first_time = 1;
-
- /* don't allow allocations above current base */
- if (mm->free_area_cache > base)
- mm->free_area_cache = base;
-
- if (len <= largest_hole) {
- largest_hole = 0;
- mm->free_area_cache = base;
- }
-try_again:
- /* make sure it can fit in the remaining address space */
- if (mm->free_area_cache < len)
- goto fail;
-
- /* either no address requested or can't fit in requested address hole */
- addr = (mm->free_area_cache - len) & huge_page_mask(h);
- do {
- /*
- * Lookup failure means no vma is above this address,
- * i.e. return with success:
- */
- vma = find_vma_prev(mm, addr, &prev_vma);
- if (!vma) {
- return addr;
- break;
- }
-
- /*
- * new region fits between prev_vma->vm_end and
- * vma->vm_start, use it:
- */
- if (addr + len <= vma->vm_start &&
- (!prev_vma || (addr >= prev_vma->vm_end))) {
- /* remember the address as a hint for next time */
- mm->cached_hole_size = largest_hole;
- mm->free_area_cache = addr;
- return addr;
- } else {
- /* pull free_area_cache down to the first hole */
- if (mm->free_area_cache == vma->vm_end) {
- mm->free_area_cache = vma->vm_start;
- mm->cached_hole_size = largest_hole;
- }
- }
+ struct vm_unmapped_area_info info;
+ unsigned long addr;

- /* remember the largest hole we saw so far */
- if (addr + largest_hole < vma->vm_start)
- largest_hole = vma->vm_start - addr;
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+ info.align_offset = 0;
+ addr = vm_unmapped_area(&info);

- /* try just below the current vma->vm_start */
- addr = (vma->vm_start - len) & huge_page_mask(h);
-
- } while (len <= vma->vm_start);
-
-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- */
- if (first_time) {
- mm->free_area_cache = base;
- largest_hole = 0;
- first_time = 0;
- goto try_again;
- }
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = ~0UL;
- addr = hugetlb_get_unmapped_area_bottomup(file, addr0,
- len, pgoff, flags);
-
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:48:45

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 14/16] mm: use vm_unmapped_area() in hugetlbfs on sparc64 architecture

Update the sparc64 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/sparc/mm/hugetlbpage.c | 123 ++++++++++--------------------------------
1 files changed, 30 insertions(+), 93 deletions(-)

diff --git a/arch/sparc/mm/hugetlbpage.c b/arch/sparc/mm/hugetlbpage.c
index f76f83d5ac63..42e5dba6cb26 100644
--- a/arch/sparc/mm/hugetlbpage.c
+++ b/arch/sparc/mm/hugetlbpage.c
@@ -30,55 +30,28 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *filp,
unsigned long pgoff,
unsigned long flags)
{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct * vma;
unsigned long task_size = TASK_SIZE;
- unsigned long start_addr;
+ struct vm_unmapped_area_info info;

if (test_thread_flag(TIF_32BIT))
task_size = STACK_TOP32;
- if (unlikely(len >= VA_EXCLUDE_START))
- return -ENOMEM;

- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = min(task_size, VA_EXCLUDE_START);
+ info.align_mask = PAGE_MASK & ~HPAGE_MASK;
+ info.align_offset = 0;
+ addr = vm_unmapped_area(&info);
+
+ if ((addr & ~PAGE_MASK) && task_size > VA_EXCLUDE_END) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.low_limit = VA_EXCLUDE_END;
+ info.high_limit = task_size;
+ addr = vm_unmapped_area(&info);
}

- task_size -= len;
-
-full_search:
- addr = ALIGN(addr, HPAGE_SIZE);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (addr < VA_EXCLUDE_START &&
- (addr + len) >= VA_EXCLUDE_START) {
- addr = VA_EXCLUDE_END;
- vma = find_vma(mm, VA_EXCLUDE_END);
- }
- if (unlikely(task_size < addr)) {
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- addr = ALIGN(vma->vm_end, HPAGE_SIZE);
- }
+ return addr;
}

static unsigned long
@@ -90,68 +63,32 @@ hugetlb_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
unsigned long addr = addr0;
+ struct vm_unmapped_area_info info;

/* This should only ever run for 32-bit processes. */
BUG_ON(!test_thread_flag(TIF_32BIT));

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache & HPAGE_MASK;
-
- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = (mm->mmap_base-len) & HPAGE_MASK;
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr+len <= vma->vm_start)) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = PAGE_MASK & ~HPAGE_MASK;
+ info.align_offset = 0;
+ addr = vm_unmapped_area(&info);

- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = (vma->vm_start-len) & HPAGE_MASK;
- } while (likely(len < vma->vm_start));
-
-bottomup:
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = STACK_TOP32;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:48:47

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 13/16] mm: use vm_unmapped_area() on sparc64 architecture

Update the sparc64 arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/sparc/kernel/sys_sparc_64.c | 132 +++++++++-----------------------------
1 files changed, 30 insertions(+), 102 deletions(-)

diff --git a/arch/sparc/kernel/sys_sparc_64.c b/arch/sparc/kernel/sys_sparc_64.c
index 11c6c9603e71..2999253ff38a 100644
--- a/arch/sparc/kernel/sys_sparc_64.c
+++ b/arch/sparc/kernel/sys_sparc_64.c
@@ -102,6 +102,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
unsigned long task_size = TASK_SIZE;
unsigned long start_addr;
int do_color_align;
+ struct vm_unmapped_area_info info;

if (flags & MAP_FIXED) {
/* We do not accept a shared mapping if it would violate
@@ -134,50 +135,22 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr, unsi
return addr;
}

- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = min(task_size, VA_EXCLUDE_START);
+ info.align_mask = do_color_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ addr = vm_unmapped_area(&info);
+
+ if ((addr & ~PAGE_MASK) && task_size > VA_EXCLUDE_END) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.low_limit = VA_EXCLUDE_END;
+ info.high_limit = task_size;
+ addr = vm_unmapped_area(&info);
}

- task_size -= len;
-
-full_search:
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (addr < VA_EXCLUDE_START &&
- (addr + len) >= VA_EXCLUDE_START) {
- addr = VA_EXCLUDE_END;
- vma = find_vma(mm, VA_EXCLUDE_END);
- }
- if (unlikely(task_size < addr)) {
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- addr = vma->vm_end;
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
+ return addr;
}

unsigned long
@@ -190,6 +163,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long task_size = STACK_TOP32;
unsigned long addr = addr0;
int do_color_align;
+ struct vm_unmapped_area_info info;

/* This should only ever run for 32-bit processes. */
BUG_ON(!test_thread_flag(TIF_32BIT));
@@ -224,73 +198,27 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache;
- if (do_color_align) {
- unsigned long base = COLOUR_ALIGN_DOWN(addr-len, pgoff);
-
- addr = base + len;
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = do_colour_align ? (PAGE_MASK & shm_align_mask) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ addr = vm_unmapped_area(&info);

- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = mm->mmap_base-len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr+len <= vma->vm_start)) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
- }
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (likely(len < vma->vm_start));
-
-bottomup:
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = STACK_TOP32;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:47:49

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 08/16] mm: use vm_unmapped_area() in hugetlbfs

Update the hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
fs/hugetlbfs/inode.c | 42 ++++++++----------------------------------
1 files changed, 8 insertions(+), 34 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index c5bc355d8243..14bc0c1fb4fb 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -151,8 +151,8 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- unsigned long start_addr;
struct hstate *h = hstate_file(file);
+ struct vm_unmapped_area_info info;

if (len & ~huge_page_mask(h))
return -EINVAL;
@@ -173,39 +173,13 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
return addr;
}

- if (len > mm->cached_hole_size)
- start_addr = mm->free_area_cache;
- else {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- }
-
-full_search:
- addr = ALIGN(start_addr, huge_page_size(h));
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
-
- if (!vma || addr + len <= vma->vm_start) {
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = ALIGN(vma->vm_end, huge_page_size(h));
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+ info.align_offset = 0;
+ return vm_unmapped_area(&info);
}
#endif

--
1.7.7.3

2012-11-05 22:49:52

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 11/16] mm: use vm_unmapped_area() on arm architecture

Update the arm arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/arm/mm/mmap.c | 119 ++++++++++------------------------------------------
1 files changed, 23 insertions(+), 96 deletions(-)

diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index ce8cb1970d7a..267e63f58098 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -72,6 +72,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long start_addr;
int do_align = 0;
int aliasing = cache_is_vipt_aliasing();
+ struct vm_unmapped_area_info info;

/*
* We only need to do colour alignment if either the I or D
@@ -104,46 +105,14 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
(!vma || addr + len <= vma->vm_start))
return addr;
}
- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = mm->mmap_base;
- mm->cached_hole_size = 0;
- }

-full_search:
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = vma->vm_end;
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = mm->mmap_base;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ return vm_unmapped_area(&info);
}

unsigned long
@@ -156,6 +125,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
unsigned long addr = addr0;
int do_align = 0;
int aliasing = cache_is_vipt_aliasing();
+ struct vm_unmapped_area_info info;

/*
* We only need to do colour alignment if either the I or D
@@ -187,70 +157,27 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache;
- if (do_align) {
- unsigned long base = COLOUR_ALIGN_DOWN(addr - len, pgoff);
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (addr > len) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = do_align ? (PAGE_MASK & (SHMLBA - 1)) : 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
+ addr = vm_unmapped_area(&info);

- if (mm->mmap_base < len)
- goto bottomup;
-
- addr = mm->mmap_base - len;
- if (do_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start - len;
- if (do_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (len < vma->vm_start);
-
-bottomup:
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = mm->mmap_base;
+ info.high_limit = TASK_SIZE;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:47:45

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 06/16] mm: use vm_unmapped_area() on x86_64 architecture

Update the x86_64 arch_get_unmapped_area[_topdown] functions to make
use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>

---
arch/x86/include/asm/elf.h | 6 +-
arch/x86/kernel/sys_x86_64.c | 151 ++++++++---------------------------------
arch/x86/vdso/vma.c | 2 +-
3 files changed, 33 insertions(+), 126 deletions(-)

diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 5939f44fe0c0..9c999c1674fa 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -354,12 +354,10 @@ static inline int mmap_is_ia32(void)
return 0;
}

-/* The first two values are special, do not change. See align_addr() */
+/* Do not change the values. See get_align_mask() */
enum align_flags {
ALIGN_VA_32 = BIT(0),
ALIGN_VA_64 = BIT(1),
- ALIGN_VDSO = BIT(2),
- ALIGN_TOPDOWN = BIT(3),
};

struct va_alignment {
@@ -368,5 +366,5 @@ struct va_alignment {
} ____cacheline_aligned;

extern struct va_alignment va_align;
-extern unsigned long align_addr(unsigned long, struct file *, enum align_flags);
+extern unsigned long align_vdso_addr(unsigned long);
#endif /* _ASM_X86_ELF_H */
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index b4d3c3927dd8..f00d006d60fd 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -21,37 +21,23 @@

/*
* Align a virtual address to avoid aliasing in the I$ on AMD F15h.
- *
- * @flags denotes the allocation direction - bottomup or topdown -
- * or vDSO; see call sites below.
*/
-unsigned long align_addr(unsigned long addr, struct file *filp,
- enum align_flags flags)
+static unsigned long get_align_mask(void)
{
- unsigned long tmp_addr;
-
/* handle 32- and 64-bit case with a single conditional */
if (va_align.flags < 0 || !(va_align.flags & (2 - mmap_is_ia32())))
- return addr;
+ return 0;

if (!(current->flags & PF_RANDOMIZE))
- return addr;
-
- if (!((flags & ALIGN_VDSO) || filp))
- return addr;
-
- tmp_addr = addr;
-
- /*
- * We need an address which is <= than the original
- * one only when in topdown direction.
- */
- if (!(flags & ALIGN_TOPDOWN))
- tmp_addr += va_align.mask;
+ return 0;

- tmp_addr &= ~va_align.mask;
+ return va_align.mask;
+}

- return tmp_addr;
+unsigned long align_vdso_addr(unsigned long addr)
+{
+ unsigned long align_mask = get_align_mask();
+ return (addr + align_mask) & ~align_mask;
}

static int __init control_va_addr_alignment(char *str)
@@ -126,7 +112,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- unsigned long start_addr;
+ struct vm_unmapped_area_info info;
unsigned long begin, end;

if (flags & MAP_FIXED)
@@ -144,50 +130,16 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
(!vma || addr + len <= vma->vm_start))
return addr;
}
- if (((flags & MAP_32BIT) || test_thread_flag(TIF_ADDR32))
- && len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = begin;
- }
- addr = mm->free_area_cache;
- if (addr < begin)
- addr = begin;
- start_addr = addr;
-
-full_search:
-
- addr = align_addr(addr, filp, 0);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (end - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != begin) {
- start_addr = addr = begin;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;

- addr = vma->vm_end;
- addr = align_addr(addr, filp, 0);
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = begin;
+ info.high_limit = end;
+ info.align_mask = filp ? get_align_mask() : 0;
+ info.align_offset = 0;
+ return vm_unmapped_area(&info);
}

-
unsigned long
arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
const unsigned long len, const unsigned long pgoff,
@@ -195,7 +147,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
{
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
- unsigned long addr = addr0, start_addr;
+ unsigned long addr = addr0;
+ struct vm_unmapped_area_info info;

/* requested length too big for entire address space */
if (len > TASK_SIZE)
@@ -217,51 +170,16 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
-try_again:
- /* either no address requested or can't fit in requested address hole */
- start_addr = addr = mm->free_area_cache;
-
- if (addr < len)
- goto fail;
-
- addr -= len;
- do {
- addr = align_addr(addr, filp, ALIGN_TOPDOWN);
-
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return mm->free_area_cache = addr;
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- } while (len < vma->vm_start);
-
-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- */
- if (start_addr != mm->mmap_base) {
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = 0;
- goto try_again;
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = filp ? get_align_mask() : 0;
+ info.align_offset = 0;
+ addr = vm_unmapped_area(&info);
+ if (!(addr & ~PAGE_MASK))
+ return addr;
+ VM_BUG_ON(addr != -ENOMEM);

bottomup:
/*
@@ -270,14 +188,5 @@ bottomup:
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
+ return arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
}
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 00aaf047b39f..431e87544411 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -141,7 +141,7 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
* unaligned here as a result of stack start randomization.
*/
addr = PAGE_ALIGN(addr);
- addr = align_addr(addr, NULL, ALIGN_VDSO);
+ addr = align_vdso_addr(addr);

return addr;
}
--
1.7.7.3

2012-11-05 22:50:29

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 09/16] mm: use vm_unmapped_area() in hugetlbfs on i386 architecture

Update the i386 hugetlb_get_unmapped_area function to make use of
vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>

---
arch/x86/mm/hugetlbpage.c | 130 +++++++++------------------------------------
1 files changed, 25 insertions(+), 105 deletions(-)

diff --git a/arch/x86/mm/hugetlbpage.c b/arch/x86/mm/hugetlbpage.c
index 937bff5cdaa7..c00c4a4cd564 100644
--- a/arch/x86/mm/hugetlbpage.c
+++ b/arch/x86/mm/hugetlbpage.c
@@ -274,42 +274,15 @@ static unsigned long hugetlb_get_unmapped_area_bottomup(struct file *file,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
-
- if (len > mm->cached_hole_size) {
- start_addr = mm->free_area_cache;
- } else {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- }
-
-full_search:
- addr = ALIGN(start_addr, huge_page_size(h));
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = ALIGN(vma->vm_end, huge_page_size(h));
- }
+ struct vm_unmapped_area_info info;
+
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+ info.align_offset = 0;
+ return vm_unmapped_area(&info);
}

static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
@@ -317,83 +290,30 @@ static unsigned long hugetlb_get_unmapped_area_topdown(struct file *file,
unsigned long pgoff, unsigned long flags)
{
struct hstate *h = hstate_file(file);
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long base = mm->mmap_base;
- unsigned long addr = addr0;
- unsigned long largest_hole = mm->cached_hole_size;
- unsigned long start_addr;
-
- /* don't allow allocations above current base */
- if (mm->free_area_cache > base)
- mm->free_area_cache = base;
-
- if (len <= largest_hole) {
- largest_hole = 0;
- mm->free_area_cache = base;
- }
-try_again:
- start_addr = mm->free_area_cache;
-
- /* make sure it can fit in the remaining address space */
- if (mm->free_area_cache < len)
- goto fail;
-
- /* either no address requested or can't fit in requested address hole */
- addr = (mm->free_area_cache - len) & huge_page_mask(h);
- do {
- /*
- * Lookup failure means no vma is above this address,
- * i.e. return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma)
- return addr;
+ struct vm_unmapped_area_info info;
+ unsigned long addr;

- if (addr + len <= vma->vm_start) {
- /* remember the address as a hint for next time */
- mm->cached_hole_size = largest_hole;
- return (mm->free_area_cache = addr);
- } else if (mm->free_area_cache == vma->vm_end) {
- /* pull free_area_cache down to the first hole */
- mm->free_area_cache = vma->vm_start;
- mm->cached_hole_size = largest_hole;
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = PAGE_MASK & ~huge_page_mask(h);
+ info.align_offset = 0;
+ addr = vm_unmapped_area(&info);

- /* remember the largest hole we saw so far */
- if (addr + largest_hole < vma->vm_start)
- largest_hole = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = (vma->vm_start - len) & huge_page_mask(h);
- } while (len <= vma->vm_start);
-
-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- */
- if (start_addr != base) {
- mm->free_area_cache = base;
- largest_hole = 0;
- goto try_again;
- }
/*
* A failed mmap() very likely causes application failure,
* so fall back to the bottom-up function here. This scenario
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = ~0UL;
- addr = hugetlb_get_unmapped_area_bottomup(file, addr0,
- len, pgoff, flags);
-
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:51:25

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 07/16] mm: fix cache coloring on x86_64 architecture

Fix the x86-64 cache alignment code to take pgoff into account.
Use the x86 and MIPS cache alignment code as the basis for a generic
cache alignment function.

The old x86 code will always align the mmap to aliasing boundaries,
even if the program mmaps the file with a non-zero pgoff.

If program A mmaps the file with pgoff 0, and program B mmaps the
file with pgoff 1. The old code would align the mmaps, resulting in
misaligned pages:

A: 0123
B: 123

After this patch, they are aligned so the pages line up:

A: 0123
B: 123

Signed-off-by: Michel Lespinasse <[email protected]>
Proposed-by: Rik van Riel <[email protected]>

---
arch/x86/kernel/sys_x86_64.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index f00d006d60fd..97ef74b88e0f 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -136,7 +136,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
info.low_limit = begin;
info.high_limit = end;
info.align_mask = filp ? get_align_mask() : 0;
- info.align_offset = 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
return vm_unmapped_area(&info);
}

@@ -175,7 +175,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
info.low_limit = PAGE_SIZE;
info.high_limit = mm->mmap_base;
info.align_mask = filp ? get_align_mask() : 0;
- info.align_offset = 0;
+ info.align_offset = pgoff << PAGE_SHIFT;
addr = vm_unmapped_area(&info);
if (!(addr & ~PAGE_MASK))
return addr;
--
1.7.7.3

2012-11-05 22:51:43

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 05/16] mm: vm_unmapped_area() lookup function

Implement vm_unmapped_area() using the rb_subtree_gap and highest_vm_end
information to look up for suitable virtual address space gaps.

struct vm_unmapped_area_info is used to define the desired allocation
request:
- lowest or highest possible address matching the remaining constraints
- desired gap length
- low/high address limits that the gap must fit into
- alignment mask and offset

Also update the generic arch_get_unmapped_area[_topdown] functions to
make use of vm_unmapped_area() instead of implementing a brute force search.

Signed-off-by: Michel Lespinasse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>

---
include/linux/mm.h | 31 +++++
mm/mmap.c | 312 +++++++++++++++++++++++++++++++++++++---------------
2 files changed, 253 insertions(+), 90 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index fa0680402738..441dfe8ea24d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1456,6 +1456,37 @@ extern unsigned long vm_mmap(struct file *, unsigned long,
unsigned long, unsigned long,
unsigned long, unsigned long);

+struct vm_unmapped_area_info {
+#define VM_UNMAPPED_AREA_TOPDOWN 1
+ unsigned long flags;
+ unsigned long length;
+ unsigned long low_limit;
+ unsigned long high_limit;
+ unsigned long align_mask;
+ unsigned long align_offset;
+};
+
+extern unsigned long unmapped_area(struct vm_unmapped_area_info *info);
+extern unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info);
+
+/*
+ * Search for an unmapped address range.
+ *
+ * We are looking for a range that:
+ * - does not intersect with any VMA;
+ * - is contained within the [low_limit, high_limit) interval;
+ * - is at least the desired size.
+ * - satisfies (begin_addr & align_mask) == (align_offset & align_mask)
+ */
+static inline unsigned long
+vm_unmapped_area(struct vm_unmapped_area_info *info)
+{
+ if (!(info->flags & VM_UNMAPPED_AREA_TOPDOWN))
+ return unmapped_area(info);
+ else
+ return unmapped_area_topdown(info);
+}
+
/* truncate.c */
extern void truncate_inode_pages(struct address_space *, loff_t);
extern void truncate_inode_pages_range(struct address_space *,
diff --git a/mm/mmap.c b/mm/mmap.c
index 769a09cc71af..aa748988cd54 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1496,6 +1496,206 @@ unacct_error:
return error;
}

+unsigned long unmapped_area(struct vm_unmapped_area_info *info)
+{
+ /*
+ * We implement the search by looking for an rbtree node that
+ * immediately follows a suitable gap. That is,
+ * - gap_start = vma->vm_prev->vm_end <= info->high_limit - length;
+ * - gap_end = vma->vm_start >= info->low_limit + length;
+ * - gap_end - gap_start >= length
+ */
+
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ unsigned long length, low_limit, high_limit, gap_start, gap_end;
+
+ /* Adjust search length to account for worst case alignment overhead */
+ length = info->length + info->align_mask;
+ if (length < info->length)
+ return -ENOMEM;
+
+ /* Adjust search limits by the desired length */
+ if (info->high_limit < length)
+ return -ENOMEM;
+ high_limit = info->high_limit - length;
+
+ if (info->low_limit > high_limit)
+ return -ENOMEM;
+ low_limit = info->low_limit + length;
+
+ /* Check if rbtree root looks promising */
+ if (RB_EMPTY_ROOT(&mm->mm_rb))
+ goto check_highest;
+ vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
+ if (vma->rb_subtree_gap < length)
+ goto check_highest;
+
+ while (true) {
+ /* Visit left subtree if it looks promising */
+ gap_end = vma->vm_start;
+ if (gap_end >= low_limit && vma->vm_rb.rb_left) {
+ struct vm_area_struct *left =
+ rb_entry(vma->vm_rb.rb_left,
+ struct vm_area_struct, vm_rb);
+ if (left->rb_subtree_gap >= length) {
+ vma = left;
+ continue;
+ }
+ }
+
+ gap_start = vma->vm_prev ? vma->vm_prev->vm_end : 0;
+ check_current:
+ /* Check if current node has a suitable gap */
+ if (gap_start > high_limit)
+ return -ENOMEM;
+ if (gap_end >= low_limit && gap_end - gap_start >= length)
+ goto found;
+
+ /* Visit right subtree if it looks promising */
+ if (vma->vm_rb.rb_right) {
+ struct vm_area_struct *right =
+ rb_entry(vma->vm_rb.rb_right,
+ struct vm_area_struct, vm_rb);
+ if (right->rb_subtree_gap >= length) {
+ vma = right;
+ continue;
+ }
+ }
+
+ /* Go back up the rbtree to find next candidate node */
+ while (true) {
+ struct rb_node *prev = &vma->vm_rb;
+ if (!rb_parent(prev))
+ goto check_highest;
+ vma = rb_entry(rb_parent(prev),
+ struct vm_area_struct, vm_rb);
+ if (prev == vma->vm_rb.rb_left) {
+ gap_start = vma->vm_prev->vm_end;
+ gap_end = vma->vm_start;
+ goto check_current;
+ }
+ }
+ }
+
+check_highest:
+ /* Check highest gap, which does not precede any rbtree node */
+ gap_start = mm->highest_vm_end;
+ gap_end = ULONG_MAX; /* Only for VM_BUG_ON below */
+ if (gap_start > high_limit)
+ return -ENOMEM;
+
+found:
+ /* We found a suitable gap. Clip it with the original low_limit. */
+ if (gap_start < info->low_limit)
+ gap_start = info->low_limit;
+
+ /* Adjust gap address to the desired alignment */
+ gap_start += (info->align_offset - gap_start) & info->align_mask;
+
+ VM_BUG_ON(gap_start + info->length > info->high_limit);
+ VM_BUG_ON(gap_start + info->length > gap_end);
+ return gap_start;
+}
+
+unsigned long unmapped_area_topdown(struct vm_unmapped_area_info *info)
+{
+ struct mm_struct *mm = current->mm;
+ struct vm_area_struct *vma;
+ unsigned long length, low_limit, high_limit, gap_start, gap_end;
+
+ /* Adjust search length to account for worst case alignment overhead */
+ length = info->length + info->align_mask;
+ if (length < info->length)
+ return -ENOMEM;
+
+ /*
+ * Adjust search limits by the desired length.
+ * See implementation comment at top of unmapped_area().
+ */
+ gap_end = info->high_limit;
+ if (gap_end < length)
+ return -ENOMEM;
+ high_limit = gap_end - length;
+
+ if (info->low_limit > high_limit)
+ return -ENOMEM;
+ low_limit = info->low_limit + length;
+
+ /* Check highest gap, which does not precede any rbtree node */
+ gap_start = mm->highest_vm_end;
+ if (gap_start <= high_limit)
+ goto found_highest;
+
+ /* Check if rbtree root looks promising */
+ if (RB_EMPTY_ROOT(&mm->mm_rb))
+ return -ENOMEM;
+ vma = rb_entry(mm->mm_rb.rb_node, struct vm_area_struct, vm_rb);
+ if (vma->rb_subtree_gap < length)
+ return -ENOMEM;
+
+ while (true) {
+ /* Visit right subtree if it looks promising */
+ gap_start = vma->vm_prev ? vma->vm_prev->vm_end : 0;
+ if (gap_start <= high_limit && vma->vm_rb.rb_right) {
+ struct vm_area_struct *right =
+ rb_entry(vma->vm_rb.rb_right,
+ struct vm_area_struct, vm_rb);
+ if (right->rb_subtree_gap >= length) {
+ vma = right;
+ continue;
+ }
+ }
+
+ check_current:
+ /* Check if current node has a suitable gap */
+ gap_end = vma->vm_start;
+ if (gap_end < low_limit)
+ return -ENOMEM;
+ if (gap_start <= high_limit && gap_end - gap_start >= length)
+ goto found;
+
+ /* Visit left subtree if it looks promising */
+ if (vma->vm_rb.rb_left) {
+ struct vm_area_struct *left =
+ rb_entry(vma->vm_rb.rb_left,
+ struct vm_area_struct, vm_rb);
+ if (left->rb_subtree_gap >= length) {
+ vma = left;
+ continue;
+ }
+ }
+
+ /* Go back up the rbtree to find next candidate node */
+ while (true) {
+ struct rb_node *prev = &vma->vm_rb;
+ if (!rb_parent(prev))
+ return -ENOMEM;
+ vma = rb_entry(rb_parent(prev),
+ struct vm_area_struct, vm_rb);
+ if (prev == vma->vm_rb.rb_right) {
+ gap_start = vma->vm_prev ?
+ vma->vm_prev->vm_end : 0;
+ goto check_current;
+ }
+ }
+ }
+
+found:
+ /* We found a suitable gap. Clip it with the original high_limit. */
+ if (gap_end > info->high_limit)
+ gap_end = info->high_limit;
+
+found_highest:
+ /* Compute highest gap address at the desired alignment */
+ gap_end -= info->length;
+ gap_end -= (gap_end - info->align_offset) & info->align_mask;
+
+ VM_BUG_ON(gap_end < info->low_limit);
+ VM_BUG_ON(gap_end < gap_start);
+ return gap_end;
+}
+
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
@@ -1514,7 +1714,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
- unsigned long start_addr;
+ struct vm_unmapped_area_info info;

if (len > TASK_SIZE)
return -ENOMEM;
@@ -1529,40 +1729,13 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
(!vma || addr + len <= vma->vm_start))
return addr;
}
- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- }

-full_search:
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- addr = TASK_UNMAPPED_BASE;
- start_addr = addr;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = vma->vm_end;
- }
+ info.flags = 0;
+ info.length = len;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ info.align_mask = 0;
+ return vm_unmapped_area(&info);
}
#endif

@@ -1587,7 +1760,8 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
{
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
- unsigned long addr = addr0, start_addr;
+ unsigned long addr = addr0;
+ struct vm_unmapped_area_info info;

/* requested length too big for entire address space */
if (len > TASK_SIZE)
@@ -1605,53 +1779,12 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
-try_again:
- /* either no address requested or can't fit in requested address hole */
- start_addr = addr = mm->free_area_cache;
-
- if (addr < len)
- goto fail;
-
- addr -= len;
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- } while (len < vma->vm_start);
-
-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- *
- * Note: this is different with the case of bottomup
- * which does the fully line-search, but we use find_vma
- * here that causes some holes skipped.
- */
- if (start_addr != mm->mmap_base) {
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = 0;
- goto try_again;
- }
+ info.flags = VM_UNMAPPED_AREA_TOPDOWN;
+ info.length = len;
+ info.low_limit = PAGE_SIZE;
+ info.high_limit = mm->mmap_base;
+ info.align_mask = 0;
+ addr = vm_unmapped_area(&info);

/*
* A failed mmap() very likely causes application failure,
@@ -1659,14 +1792,13 @@ fail:
* can happen with large stack limits and large mmap()
* allocations.
*/
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ if (addr & ~PAGE_MASK) {
+ VM_BUG_ON(addr != -ENOMEM);
+ info.flags = 0;
+ info.low_limit = TASK_UNMAPPED_BASE;
+ info.high_limit = TASK_SIZE;
+ addr = vm_unmapped_area(&info);
+ }

return addr;
}
--
1.7.7.3

2012-11-05 22:47:39

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 02/16] mm: augment vma rbtree with rb_subtree_gap

Define vma->rb_subtree_gap as the largest gap between any vma in the
subtree rooted at that vma, and their predecessor. Or, for a recursive
definition, vma->rb_subtree_gap is the max of:
- vma->vm_start - vma->vm_prev->vm_end
- rb_subtree_gap fields of the vmas pointed by vma->rb.rb_left and
vma->rb.rb_right

This will allow get_unmapped_area_* to find a free area of the right
size in O(log(N)) time, instead of potentially having to do a linear
walk across all the VMAs.

Also define mm->highest_vm_end as the vm_end field of the highest vma,
so that we can easily check if the following gap is suitable.

This does have the potential to make unmapping VMAs more expensive,
especially for processes with very large numbers of VMAs, where the
VMA rbtree can grow quite deep.

Signed-off-by: Michel Lespinasse <[email protected]>
Reviewed-by: Rik van Riel <[email protected]>

---
include/linux/mm_types.h | 9 ++++
mm/mmap.c | 114 ++++++++++++++++++++++++++++++++++++++++++----
2 files changed, 114 insertions(+), 9 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 31f8a3af7d94..94fa52b28ee8 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -238,6 +238,14 @@ struct vm_area_struct {
struct rb_node vm_rb;

/*
+ * Largest free memory gap in bytes to the left of this VMA.
+ * Either between this VMA and vma->vm_prev, or between one of the
+ * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
+ * get_unmapped_area find a free area of the right size.
+ */
+ unsigned long rb_subtree_gap;
+
+ /*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree, or
* linkage of vma in the address_space->i_mmap_nonlinear list.
@@ -322,6 +330,7 @@ struct mm_struct {
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
+ unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
diff --git a/mm/mmap.c b/mm/mmap.c
index cf18502e75c6..2de8bcfe859d 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -31,6 +31,7 @@
#include <linux/audit.h>
#include <linux/khugepaged.h>
#include <linux/uprobes.h>
+#include <linux/rbtree_augmented.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -297,6 +298,70 @@ out:
return retval;
}

+static long vma_compute_subtree_gap(struct vm_area_struct *vma)
+{
+ unsigned long max, subtree_gap;
+ max = vma->vm_start;
+ if (vma->vm_prev)
+ max -= vma->vm_prev->vm_end;
+ if (vma->vm_rb.rb_left) {
+ subtree_gap = rb_entry(vma->vm_rb.rb_left,
+ struct vm_area_struct, vm_rb)->rb_subtree_gap;
+ if (subtree_gap > max)
+ max = subtree_gap;
+ }
+ if (vma->vm_rb.rb_right) {
+ subtree_gap = rb_entry(vma->vm_rb.rb_right,
+ struct vm_area_struct, vm_rb)->rb_subtree_gap;
+ if (subtree_gap > max)
+ max = subtree_gap;
+ }
+ return max;
+}
+
+RB_DECLARE_CALLBACKS(static, vma_gap_callbacks, struct vm_area_struct, vm_rb,
+ unsigned long, rb_subtree_gap, vma_compute_subtree_gap)
+
+/*
+ * Update augmented rbtree rb_subtree_gap values after vma->vm_start or
+ * vma->vm_prev->vm_end values changed, without modifying the vma's position
+ * in the rbtree.
+ */
+static void vma_gap_update(struct vm_area_struct *vma)
+{
+ /*
+ * As it turns out, RB_DECLARE_CALLBACKS() already created a callback
+ * function that does exacltly what we want.
+ */
+ vma_gap_callbacks_propagate(&vma->vm_rb, NULL);
+}
+
+static inline void vma_rb_insert(struct vm_area_struct *vma,
+ struct rb_root *root)
+{
+ /*
+ * vma->vm_prev wasn't known when we followed the rbtree to find the
+ * correct insertion point for that vma. As a result, we could not
+ * update the vma vm_rb parents rb_subtree_gap values on the way down.
+ * So, we first insert the vma with a zero rb_subtree_gap value
+ * (to be consistent with what we did on the way down), and then
+ * immediately update the gap to the correct value.
+ */
+ vma->rb_subtree_gap = 0;
+ rb_insert_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+ vma_gap_update(vma);
+}
+
+static void vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
+{
+ /*
+ * Note rb_erase_augmented is a fairly large inline function,
+ * so make sure we instantiate it only once with our desired
+ * augmented rbtree callbacks.
+ */
+ rb_erase_augmented(&vma->vm_rb, root, &vma_gap_callbacks);
+}
+
#ifdef CONFIG_DEBUG_VM_RB
static int browse_rb(struct rb_root *root)
{
@@ -422,7 +487,11 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
struct rb_node **rb_link, struct rb_node *rb_parent)
{
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
- rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+ vma_rb_insert(vma, &mm->mm_rb);
+ if (vma->vm_next)
+ vma_gap_update(vma->vm_next);
+ else
+ mm->highest_vm_end = vma->vm_end;
}

static void __vma_link_file(struct vm_area_struct *vma)
@@ -503,7 +572,7 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
prev->vm_next = next;
if (next)
next->vm_prev = prev;
- rb_erase(&vma->vm_rb, &mm->mm_rb);
+ vma_rb_erase(vma, &mm->mm_rb);
if (mm->mmap_cache == vma)
mm->mmap_cache = prev;
}
@@ -525,6 +594,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct rb_root *root = NULL;
struct anon_vma *anon_vma = NULL;
struct file *file = vma->vm_file;
+ bool start_changed = false, end_changed = false;
long adjust_next = 0;
int remove_next = 0;

@@ -615,8 +685,14 @@ again: remove_next = 1 + (end > next->vm_end);
vma_interval_tree_remove(next, root);
}

- vma->vm_start = start;
- vma->vm_end = end;
+ if (start != vma->vm_start) {
+ vma->vm_start = start;
+ start_changed = true;
+ }
+ if (end != vma->vm_end) {
+ vma->vm_end = end;
+ end_changed = true;
+ }
vma->vm_pgoff = pgoff;
if (adjust_next) {
next->vm_start += adjust_next << PAGE_SHIFT;
@@ -645,6 +721,15 @@ again: remove_next = 1 + (end > next->vm_end);
* (it may either follow vma or precede it).
*/
__insert_vm_struct(mm, insert);
+ } else {
+ if (start_changed)
+ vma_gap_update(vma);
+ if (end_changed) {
+ if (!next)
+ mm->highest_vm_end = end;
+ else if (!adjust_next)
+ vma_gap_update(next);
+ }
}

if (anon_vma) {
@@ -678,10 +763,13 @@ again: remove_next = 1 + (end > next->vm_end);
* we must remove another next too. It would clutter
* up the code too much to do both in one go.
*/
- if (remove_next == 2) {
- next = vma->vm_next;
+ next = vma->vm_next;
+ if (remove_next == 2)
goto again;
- }
+ else if (next)
+ vma_gap_update(next);
+ else
+ mm->highest_vm_end = end;
}
if (insert && file)
uprobe_mmap(insert);
@@ -1783,6 +1871,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
anon_vma_interval_tree_post_update_vma(vma);
+ if (vma->vm_next)
+ vma_gap_update(vma->vm_next);
+ else
+ mm->highest_vm_end = address;
perf_event_mmap(vma);
}
}
@@ -1837,6 +1929,7 @@ int expand_downwards(struct vm_area_struct *vma,
vma->vm_start = address;
vma->vm_pgoff -= grow;
anon_vma_interval_tree_post_update_vma(vma);
+ vma_gap_update(vma);
perf_event_mmap(vma);
}
}
@@ -1959,14 +2052,17 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
do {
- rb_erase(&vma->vm_rb, &mm->mm_rb);
+ vma_rb_erase(vma, &mm->mm_rb);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
} while (vma && vma->vm_start < end);
*insertion_point = vma;
- if (vma)
+ if (vma) {
vma->vm_prev = prev;
+ vma_gap_update(vma);
+ } else
+ mm->highest_vm_end = prev ? prev->vm_end : 0;
tail_vma->vm_next = NULL;
if (mm->unmap_area == arch_unmap_area)
addr = prev ? prev->vm_end : mm->mmap_base;
--
1.7.7.3

2012-11-05 22:52:04

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 04/16] mm: rearrange vm_area_struct for fewer cache misses

From: Rik van Riel <[email protected]>

The kernel walks the VMA rbtree in various places, including
the page fault path. However, the vm_rb node spanned two
cache lines, on 64 bit systems with 64 byte cache lines (most
x86 systems).

Rearrange vm_area_struct a little, so all the information we
need to do a VMA tree walk is in the first cache line.

Signed-off-by: Michel Lespinasse <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>

---
include/linux/mm_types.h | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 94fa52b28ee8..528da4abf8ee 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -224,7 +224,8 @@ struct vm_region {
* library, the executable area etc).
*/
struct vm_area_struct {
- struct mm_struct * vm_mm; /* The address space we belong to. */
+ /* The first cache line has the info for VMA tree walking. */
+
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
@@ -232,9 +233,6 @@ struct vm_area_struct {
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;

- pgprot_t vm_page_prot; /* Access permissions of this VMA. */
- unsigned long vm_flags; /* Flags, see mm.h. */
-
struct rb_node vm_rb;

/*
@@ -245,6 +243,12 @@ struct vm_area_struct {
*/
unsigned long rb_subtree_gap;

+ /* Second cache line starts here. */
+
+ struct mm_struct * vm_mm; /* The address space we belong to. */
+ pgprot_t vm_page_prot; /* Access permissions of this VMA. */
+ unsigned long vm_flags; /* Flags, see mm.h. */
+
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap interval tree, or
--
1.7.7.3

2012-11-05 22:52:51

by Michel Lespinasse

[permalink] [raw]
Subject: [PATCH 01/16] mm: add anon_vma_lock to validate_mm()

Iterate vma->anon_vma_chain without anon_vma_lock may cause NULL ptr deref in
anon_vma_interval_tree_verify(), because the node in the chain might have been
removed.

[ 1523.657950] BUG: unable to handle kernel paging request at fffffffffffffff0
[ 1523.660022] IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
[ 1523.660022] PGD 4e28067 PUD 4e29067 PMD 0
[ 1523.675725] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1523.750066] CPU 0
[ 1523.750066] Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
[ 1523.750066] RIP: 0010:[<ffffffff8122c29c>] [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
[ 1523.750066] RSP: 0018:ffff880045f81d48 EFLAGS: 00010296
[ 1523.750066] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX: 0000000000000000
[ 1523.750066] RDX: 0000000000000000 RSI: 0000000000000001 RDI: fffffffffffffff0
[ 1523.750066] RBP: ffff880045f81d58 R08: 0000000000000000 R09: 0000000000000f14
[ 1523.750066] R10: 0000000000000f12 R11: 0000000000000000 R12: ffff8800096c8d70
[ 1523.750066] R13: ffff8800096c8d00 R14: 0000000000000000 R15: ffff8800095b45e0
[ 1523.750066] FS: 00007f7a923f3700(0000) GS:ffff880013600000(0000) knlGS:0000000000000000
[ 1523.750066] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1523.750066] CR2: fffffffffffffff0 CR3: 000000000969d000 CR4: 00000000000406f0
[ 1523.750066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 1523.750066] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 1523.750066] Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
[ 1523.750066] Stack:
[ 1523.750066] ffff88000d7533f0 fffffffffffffff0 ffff880045f81da8 ffffffff812361d8
[ 1523.750066] ffff880045f81d98 ffff880048ee9000 ffff8800095b4580 ffff8800095b4580
[ 1523.750066] ffff88001d1cdb00 ffff8800095b45f0 ffff880022a4d630 ffff8800095b45e0
[ 1523.750066] Call Trace:
[ 1523.750066] [<ffffffff812361d8>] validate_mm+0x58/0x1e0
[ 1523.750066] [<ffffffff81236aa5>] vma_adjust+0x635/0x6b0
[ 1523.750066] [<ffffffff81236c81>] __split_vma.isra.22+0x161/0x220
[ 1523.750066] [<ffffffff81237934>] split_vma+0x24/0x30
[ 1523.750066] [<ffffffff8122ce6a>] sys_madvise+0x5da/0x7b0
[ 1523.750066] [<ffffffff811cd14c>] ? rcu_eqs_exit+0x9c/0xb0
[ 1523.750066] [<ffffffff811802cd>] ? trace_hardirqs_on+0xd/0x10
[ 1523.750066] [<ffffffff83aee198>] tracesys+0xe1/0xe6
[ 1523.750066] Code: 4c 09 ff 48 39 ce 77 9e f3 c3 0f 1f 44 00 00 31 c0 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
48 89 fb 48 83 ec 08 <48> 8b 17 48 8b 8a 90 00 00 00 48 39 4f 40 74 34 80 3d f7 1f 5c
[ 1523.750066] RIP [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
[ 1523.750066] RSP <ffff880045f81d48>
[ 1523.750066] CR2: fffffffffffffff0
[ 1523.750066] ---[ end trace e35e5fa49072faf9 ]---

Reported-by: Sasha Levin <[email protected]>
Figured-out-by: Bob Liu <[email protected]>
Signed-off-by: Michel Lespinasse <[email protected]>

---
mm/mmap.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 2d942353d681..9a796c41e7d9 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -334,8 +334,10 @@ void validate_mm(struct mm_struct *mm)
struct vm_area_struct *vma = mm->mmap;
while (vma) {
struct anon_vma_chain *avc;
+ vma_lock_anon_vma(vma);
list_for_each_entry(avc, &vma->anon_vma_chain, same_vma)
anon_vma_interval_tree_verify(avc);
+ vma_unlock_anon_vma(vma);
vma = vma->vm_next;
i++;
}
--
1.7.7.3

2012-11-05 23:04:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: add anon_vma_lock to validate_mm()

On 11/05/2012 05:46 PM, Michel Lespinasse wrote:
> Iterate vma->anon_vma_chain without anon_vma_lock may cause NULL ptr deref in
> anon_vma_interval_tree_verify(), because the node in the chain might have been
> removed.
>
> [ 1523.657950] BUG: unable to handle kernel paging request at fffffffffffffff0
> [ 1523.660022] IP: [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
> [ 1523.660022] PGD 4e28067 PUD 4e29067 PMD 0
> [ 1523.675725] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 1523.750066] CPU 0
> [ 1523.750066] Pid: 9050, comm: trinity-child64 Tainted: G W 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
> [ 1523.750066] RIP: 0010:[<ffffffff8122c29c>] [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
> [ 1523.750066] RSP: 0018:ffff880045f81d48 EFLAGS: 00010296
> [ 1523.750066] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX: 0000000000000000
> [ 1523.750066] RDX: 0000000000000000 RSI: 0000000000000001 RDI: fffffffffffffff0
> [ 1523.750066] RBP: ffff880045f81d58 R08: 0000000000000000 R09: 0000000000000f14
> [ 1523.750066] R10: 0000000000000f12 R11: 0000000000000000 R12: ffff8800096c8d70
> [ 1523.750066] R13: ffff8800096c8d00 R14: 0000000000000000 R15: ffff8800095b45e0
> [ 1523.750066] FS: 00007f7a923f3700(0000) GS:ffff880013600000(0000) knlGS:0000000000000000
> [ 1523.750066] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1523.750066] CR2: fffffffffffffff0 CR3: 000000000969d000 CR4: 00000000000406f0
> [ 1523.750066] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1523.750066] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1523.750066] Process trinity-child64 (pid: 9050, threadinfo ffff880045f80000, task ffff880048eb0000)
> [ 1523.750066] Stack:
> [ 1523.750066] ffff88000d7533f0 fffffffffffffff0 ffff880045f81da8 ffffffff812361d8
> [ 1523.750066] ffff880045f81d98 ffff880048ee9000 ffff8800095b4580 ffff8800095b4580
> [ 1523.750066] ffff88001d1cdb00 ffff8800095b45f0 ffff880022a4d630 ffff8800095b45e0
> [ 1523.750066] Call Trace:
> [ 1523.750066] [<ffffffff812361d8>] validate_mm+0x58/0x1e0
> [ 1523.750066] [<ffffffff81236aa5>] vma_adjust+0x635/0x6b0
> [ 1523.750066] [<ffffffff81236c81>] __split_vma.isra.22+0x161/0x220
> [ 1523.750066] [<ffffffff81237934>] split_vma+0x24/0x30
> [ 1523.750066] [<ffffffff8122ce6a>] sys_madvise+0x5da/0x7b0
> [ 1523.750066] [<ffffffff811cd14c>] ? rcu_eqs_exit+0x9c/0xb0
> [ 1523.750066] [<ffffffff811802cd>] ? trace_hardirqs_on+0xd/0x10
> [ 1523.750066] [<ffffffff83aee198>] tracesys+0xe1/0xe6
> [ 1523.750066] Code: 4c 09 ff 48 39 ce 77 9e f3 c3 0f 1f 44 00 00 31 c0 c3 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
> 48 89 fb 48 83 ec 08 <48> 8b 17 48 8b 8a 90 00 00 00 48 39 4f 40 74 34 80 3d f7 1f 5c
> [ 1523.750066] RIP [<ffffffff8122c29c>] anon_vma_interval_tree_verify+0xc/0xa0
> [ 1523.750066] RSP <ffff880045f81d48>
> [ 1523.750066] CR2: fffffffffffffff0
> [ 1523.750066] ---[ end trace e35e5fa49072faf9 ]---
>
> Reported-by: Sasha Levin <[email protected]>
> Figured-out-by: Bob Liu <[email protected]>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:05:11

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 09/16] mm: use vm_unmapped_area() in hugetlbfs on i386 architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the i386 hugetlb_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:31:31

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 08/16] mm: use vm_unmapped_area() in hugetlbfs

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the hugetlb_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:32:33

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 10/16] mm: use vm_unmapped_area() on mips architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the mips arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:32:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 12/16] mm: use vm_unmapped_area() on sh architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the sh arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:33:30

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 11/16] mm: use vm_unmapped_area() on arm architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the arm arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:34:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 13/16] mm: use vm_unmapped_area() on sparc64 architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the sparc64 arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:34:38

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 14/16] mm: use vm_unmapped_area() in hugetlbfs on sparc64 architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the sparc64 hugetlb_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:34:56

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the sparc32 arch_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-05 23:35:23

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 16/16] mm: use vm_unmapped_area() in hugetlbfs on tile architecture

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> Update the tile hugetlb_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Reviewed-by: Rik van Riel <[email protected]>

2012-11-06 01:25:08

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

From: Michel Lespinasse <[email protected]>
Date: Mon, 5 Nov 2012 14:47:12 -0800

> Update the sparc32 arch_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.
>
> Signed-off-by: Michel Lespinasse <[email protected]>

Hmmm...

> - if (flags & MAP_SHARED)
> - addr = COLOUR_ALIGN(addr);
> - else
> - addr = PAGE_ALIGN(addr);

What part of vm_unmapped_area() is going to duplicate this special
aligning logic we need on sparc?

2012-11-06 03:13:22

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

On Mon, Nov 5, 2012 at 5:25 PM, David Miller <[email protected]> wrote:
> From: Michel Lespinasse <[email protected]>
> Date: Mon, 5 Nov 2012 14:47:12 -0800
>
>> Update the sparc32 arch_get_unmapped_area function to make use of
>> vm_unmapped_area() instead of implementing a brute force search.
>>
>> Signed-off-by: Michel Lespinasse <[email protected]>
>
> Hmmm...
>
>> - if (flags & MAP_SHARED)
>> - addr = COLOUR_ALIGN(addr);
>> - else
>> - addr = PAGE_ALIGN(addr);
>
> What part of vm_unmapped_area() is going to duplicate this special
> aligning logic we need on sparc?

The idea there is that you can specify the desired alignment mask and
offset using info.align_mask and info.align_offset.

Now, I just noticed that the old code actually always uses an
alignment offset of 0 instead of basing it on pgoff. I'm not sure why
that is, but it looks like this may be an issue ?

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-11-06 07:28:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

On 11/05/2012 08:25 PM, David Miller wrote:
> From: Michel Lespinasse <[email protected]>
> Date: Mon, 5 Nov 2012 14:47:12 -0800
>
>> Update the sparc32 arch_get_unmapped_area function to make use of
>> vm_unmapped_area() instead of implementing a brute force search.
>>
>> Signed-off-by: Michel Lespinasse <[email protected]>
>
> Hmmm...
>
>> - if (flags & MAP_SHARED)
>> - addr = COLOUR_ALIGN(addr);
>> - else
>> - addr = PAGE_ALIGN(addr);
>
> What part of vm_unmapped_area() is going to duplicate this special
> aligning logic we need on sparc?
>

That would be this part:

+found:
+ /* We found a suitable gap. Clip it with the original low_limit. */
+ if (gap_start < info->low_limit)
+ gap_start = info->low_limit;
+
+ /* Adjust gap address to the desired alignment */
+ gap_start += (info->align_offset - gap_start) & info->align_mask;
+
+ VM_BUG_ON(gap_start + info->length > info->high_limit);
+ VM_BUG_ON(gap_start + info->length > gap_end);
+ return gap_start;
+}

2012-11-06 08:01:03

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 01/16] mm: add anon_vma_lock to validate_mm()

Adding Sasha and Bob, which I forgot to CC in the original message.

On Mon, Nov 5, 2012 at 3:06 PM, Rik van Riel <[email protected]> wrote:
> On 11/05/2012 05:46 PM, Michel Lespinasse wrote:
>>
>> Iterate vma->anon_vma_chain without anon_vma_lock may cause NULL ptr deref
>> in
>> anon_vma_interval_tree_verify(), because the node in the chain might have
>> been
>> removed.
>>
>> [ 1523.657950] BUG: unable to handle kernel paging request at
>> fffffffffffffff0
>> [ 1523.660022] IP: [<ffffffff8122c29c>]
>> anon_vma_interval_tree_verify+0xc/0xa0
>> [ 1523.660022] PGD 4e28067 PUD 4e29067 PMD 0
>> [ 1523.675725] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
>> [ 1523.750066] CPU 0
>> [ 1523.750066] Pid: 9050, comm: trinity-child64 Tainted: G W
>> 3.7.0-rc2-next-20121025-sasha-00001-g673f98e-dirty #77
>> [ 1523.750066] RIP: 0010:[<ffffffff8122c29c>] [<ffffffff8122c29c>]
>> anon_vma_interval_tree_verify+0xc/0xa0
>> [ 1523.750066] RSP: 0018:ffff880045f81d48 EFLAGS: 00010296
>> [ 1523.750066] RAX: 0000000000000000 RBX: fffffffffffffff0 RCX:
>> 0000000000000000
>> [ 1523.750066] RDX: 0000000000000000 RSI: 0000000000000001 RDI:
>> fffffffffffffff0
>> [ 1523.750066] RBP: ffff880045f81d58 R08: 0000000000000000 R09:
>> 0000000000000f14
>> [ 1523.750066] R10: 0000000000000f12 R11: 0000000000000000 R12:
>> ffff8800096c8d70
>> [ 1523.750066] R13: ffff8800096c8d00 R14: 0000000000000000 R15:
>> ffff8800095b45e0
>> [ 1523.750066] FS: 00007f7a923f3700(0000) GS:ffff880013600000(0000)
>> knlGS:0000000000000000
>> [ 1523.750066] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [ 1523.750066] CR2: fffffffffffffff0 CR3: 000000000969d000 CR4:
>> 00000000000406f0
>> [ 1523.750066] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
>> 0000000000000000
>> [ 1523.750066] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
>> 0000000000000400
>> [ 1523.750066] Process trinity-child64 (pid: 9050, threadinfo
>> ffff880045f80000, task ffff880048eb0000)
>> [ 1523.750066] Stack:
>> [ 1523.750066] ffff88000d7533f0 fffffffffffffff0 ffff880045f81da8
>> ffffffff812361d8
>> [ 1523.750066] ffff880045f81d98 ffff880048ee9000 ffff8800095b4580
>> ffff8800095b4580
>> [ 1523.750066] ffff88001d1cdb00 ffff8800095b45f0 ffff880022a4d630
>> ffff8800095b45e0
>> [ 1523.750066] Call Trace:
>> [ 1523.750066] [<ffffffff812361d8>] validate_mm+0x58/0x1e0
>> [ 1523.750066] [<ffffffff81236aa5>] vma_adjust+0x635/0x6b0
>> [ 1523.750066] [<ffffffff81236c81>] __split_vma.isra.22+0x161/0x220
>> [ 1523.750066] [<ffffffff81237934>] split_vma+0x24/0x30
>> [ 1523.750066] [<ffffffff8122ce6a>] sys_madvise+0x5da/0x7b0
>> [ 1523.750066] [<ffffffff811cd14c>] ? rcu_eqs_exit+0x9c/0xb0
>> [ 1523.750066] [<ffffffff811802cd>] ? trace_hardirqs_on+0xd/0x10
>> [ 1523.750066] [<ffffffff83aee198>] tracesys+0xe1/0xe6
>> [ 1523.750066] Code: 4c 09 ff 48 39 ce 77 9e f3 c3 0f 1f 44 00 00 31 c0 c3
>> 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 55 48 89 e5 53
>> 48 89 fb 48 83 ec 08 <48> 8b 17 48 8b 8a 90 00 00 00 48 39 4f 40 74 34 80
>> 3d f7 1f 5c
>> [ 1523.750066] RIP [<ffffffff8122c29c>]
>> anon_vma_interval_tree_verify+0xc/0xa0
>> [ 1523.750066] RSP <ffff880045f81d48>
>> [ 1523.750066] CR2: fffffffffffffff0
>> [ 1523.750066] ---[ end trace e35e5fa49072faf9 ]---
>>
>> Reported-by: Sasha Levin <[email protected]>
>> Figured-out-by: Bob Liu <[email protected]>
>> Signed-off-by: Michel Lespinasse <[email protected]>
>
> Reviewed-by: Rik van Riel <[email protected]>

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-11-06 17:41:39

by David Miller

[permalink] [raw]
Subject: Re: [PATCH 15/16] mm: use vm_unmapped_area() on sparc32 architecture

From: Rik van Riel <[email protected]>
Date: Tue, 06 Nov 2012 02:30:07 -0500

> On 11/05/2012 08:25 PM, David Miller wrote:
>> From: Michel Lespinasse <[email protected]>
>> Date: Mon, 5 Nov 2012 14:47:12 -0800
>>
>>> Update the sparc32 arch_get_unmapped_area function to make use of
>>> vm_unmapped_area() instead of implementing a brute force search.
>>>
>>> Signed-off-by: Michel Lespinasse <[email protected]>
>>
>> Hmmm...
>>
>>> - if (flags & MAP_SHARED)
>>> - addr = COLOUR_ALIGN(addr);
>>> - else
>>> - addr = PAGE_ALIGN(addr);
>>
>> What part of vm_unmapped_area() is going to duplicate this special
>> aligning logic we need on sparc?
>>
>
> That would be this part:
>
> +found:
> + /* We found a suitable gap. Clip it with the original low_limit. */
> + if (gap_start < info->low_limit)
> + gap_start = info->low_limit;
> +
> + /* Adjust gap address to the desired alignment */
> + gap_start += (info->align_offset - gap_start) & info->align_mask;
> +
> + VM_BUG_ON(gap_start + info->length > info->high_limit);
> + VM_BUG_ON(gap_start + info->length > gap_end);
> + return gap_start;
> +}

Ok, now I understand. Works for me:

Acked-by: David S. Miller <[email protected]>

2012-11-06 22:11:43

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/16] mm: use augmented rbtrees for finding unmapped areas

On Mon, 5 Nov 2012 14:46:57 -0800
Michel Lespinasse <[email protected]> wrote:

> Earlier this year, Rik proposed using augmented rbtrees to optimize
> our search for a suitable unmapped area during mmap(). This prompted
> my work on improving the augmented rbtree code. Rik doesn't seem to
> have time to follow up on his idea at this time, so I'm sending this
> series to revive the idea.

Well, the key word here is "optimize". Some quantitative testing
results would be nice, please!

People do occasionally see nasty meltdowns in the get_unmapped_area()
vicinity. There was one case 2-3 years ago which was just ghastly, but
I can't find the email (it's on linux-mm somewhere). This one might be
another case:
http://lkml.indiana.edu/hypermail/linux/kernel/1101.1/00896.html

If you can demonstrate that this patchset fixes some of all of the bad
search complexity scenarios then that's quite a win?

> These changes are against v3.7-rc4. I have not converted all applicable
> architectuers yet, but we don't necessarily need to get them all onboard
> at once - the series is fully bisectable and additional architectures
> can be added later on. I am confident enough in my tests for patches 1-8;
> however the second half of the series basically didn't get tested as
> I don't have access to all the relevant architectures.

Yes, I'll try to get these into -next so that the thousand monkeys at
least give us some compilation coverage testing. Hopefully the
relevant arch maintainers will find time to perform a runtime test.

> Patch 1 is the validate_mm() fix from Bob Liu (+ fixed-the-fix from me :)

I grabbed this one separately, as a post-3.6 fix.

2012-11-06 22:38:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 03/16] mm: check rb_subtree_gap correctness

On Mon, 5 Nov 2012 14:47:00 -0800
Michel Lespinasse <[email protected]> wrote:

> When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is
> correctly set for every vma and that mm->highest_vm_end is also correct.
>
> Also add an explicit 'bug' variable to track if browse_rb() detected any
> invalid condition.
>
> ...
>
> @@ -365,7 +365,7 @@ static void vma_rb_erase(struct vm_area_struct *vma, struct rb_root *root)
> #ifdef CONFIG_DEBUG_VM_RB
> static int browse_rb(struct rb_root *root)
> {
> - int i = 0, j;
> + int i = 0, j, bug = 0;
> struct rb_node *nd, *pn = NULL;
> unsigned long prev = 0, pend = 0;
>
> @@ -373,29 +373,33 @@ static int browse_rb(struct rb_root *root)
> struct vm_area_struct *vma;
> vma = rb_entry(nd, struct vm_area_struct, vm_rb);
> if (vma->vm_start < prev)
> - printk("vm_start %lx prev %lx\n", vma->vm_start, prev), i = -1;
> + printk("vm_start %lx prev %lx\n", vma->vm_start, prev), bug = 1;
> if (vma->vm_start < pend)
> - printk("vm_start %lx pend %lx\n", vma->vm_start, pend);
> + printk("vm_start %lx pend %lx\n", vma->vm_start, pend), bug = 1;
> if (vma->vm_start > vma->vm_end)
> - printk("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start);
> + printk("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start), bug = 1;
> + if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma))
> + printk("free gap %lx, correct %lx\n",
> + vma->rb_subtree_gap,
> + vma_compute_subtree_gap(vma)), bug = 1;

OK, now who did that. Whoever it was: stop it or you'll have your
kernel license revoked!

--- a/mm/mmap.c~mm-check-rb_subtree_gap-correctness-fix
+++ a/mm/mmap.c
@@ -372,16 +372,25 @@ static int browse_rb(struct rb_root *roo
for (nd = rb_first(root); nd; nd = rb_next(nd)) {
struct vm_area_struct *vma;
vma = rb_entry(nd, struct vm_area_struct, vm_rb);
- if (vma->vm_start < prev)
- printk("vm_start %lx prev %lx\n", vma->vm_start, prev), bug = 1;
- if (vma->vm_start < pend)
- printk("vm_start %lx pend %lx\n", vma->vm_start, pend), bug = 1;
- if (vma->vm_start > vma->vm_end)
- printk("vm_end %lx < vm_start %lx\n", vma->vm_end, vma->vm_start), bug = 1;
- if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma))
+ if (vma->vm_start < prev) {
+ printk("vm_start %lx prev %lx\n", vma->vm_start, prev);
+ bug = 1;
+ }
+ if (vma->vm_start < pend) {
+ printk("vm_start %lx pend %lx\n", vma->vm_start, pend);
+ bug = 1;
+ }
+ if (vma->vm_start > vma->vm_end) {
+ printk("vm_end %lx < vm_start %lx\n",
+ vma->vm_end, vma->vm_start);
+ bug = 1;
+ }
+ if (vma->rb_subtree_gap != vma_compute_subtree_gap(vma)) {
printk("free gap %lx, correct %lx\n",
vma->rb_subtree_gap,
- vma_compute_subtree_gap(vma)), bug = 1;
+ vma_compute_subtree_gap(vma));
+ bug = 1;
+ }
i++;
pn = nd;
prev = vma->vm_start;
@@ -390,8 +399,10 @@ static int browse_rb(struct rb_root *roo
j = 0;
for (nd = pn; nd; nd = rb_prev(nd))
j++;
- if (i != j)
- printk("backwards %d, forwards %d\n", j, i), bug = 1;
+ if (i != j) {
+ printk("backwards %d, forwards %d\n", j, i);
+ bug = 1;
+ }
return bug ? -1 : i;
}

@@ -411,14 +422,20 @@ void validate_mm(struct mm_struct *mm)
vma = vma->vm_next;
i++;
}
- if (i != mm->map_count)
- printk("map_count %d vm_next %d\n", mm->map_count, i), bug = 1;
- if (highest_address != mm->highest_vm_end)
+ if (i != mm->map_count) {
+ printk("map_count %d vm_next %d\n", mm->map_count, i);
+ bug = 1;
+ }
+ if (highest_address != mm->highest_vm_end) {
printk("mm->highest_vm_end %lx, found %lx\n",
- mm->highest_vm_end, highest_address), bug = 1;
+ mm->highest_vm_end, highest_address);
+ bug = 1;
+ }
i = browse_rb(&mm->mm_rb);
- if (i != mm->map_count)
- printk("map_count %d rb %d\n", mm->map_count, i), bug = 1;
+ if (i != mm->map_count) {
+ printk("map_count %d rb %d\n", mm->map_count, i);
+ bug = 1;
+ }
BUG_ON(bug);
}
#else

2012-11-06 22:38:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 07/16] mm: fix cache coloring on x86_64 architecture

On Mon, 5 Nov 2012 14:47:04 -0800
Michel Lespinasse <[email protected]> wrote:

> Fix the x86-64 cache alignment code to take pgoff into account.
> Use the x86 and MIPS cache alignment code as the basis for a generic
> cache alignment function.
>
> The old x86 code will always align the mmap to aliasing boundaries,
> even if the program mmaps the file with a non-zero pgoff.
>
> If program A mmaps the file with pgoff 0, and program B mmaps the
> file with pgoff 1. The old code would align the mmaps, resulting in
> misaligned pages:
>
> A: 0123
> B: 123
>
> After this patch, they are aligned so the pages line up:
>
> A: 0123
> B: 123

We have a bit of a history of fiddling with coloring and finding that
the changes made at best no improvement. Or at least, that's my
perhaps faulty memory of it.

This one needs pretty careful testing, please.

2012-11-06 22:38:30

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 09/16] mm: use vm_unmapped_area() in hugetlbfs on i386 architecture

On Mon, 5 Nov 2012 14:47:06 -0800
Michel Lespinasse <[email protected]> wrote:

> Update the i386 hugetlb_get_unmapped_area function to make use of
> vm_unmapped_area() instead of implementing a brute force search.

The x86_64 coloring "fix" wasn't copied into i386?

2012-11-06 22:38:33

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 10/16] mm: use vm_unmapped_area() on mips architecture

On Mon, 5 Nov 2012 14:47:07 -0800
Michel Lespinasse <[email protected]> wrote:

> Update the mips arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.
>

Are the changes to the coloring equivalent to what was there before?
It's unobvious..

COLOUR_ALIGN_DOWN() is now unused and should be removed?

2012-11-06 22:38:39

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 11/16] mm: use vm_unmapped_area() on arm architecture

On Mon, 5 Nov 2012 14:47:08 -0800
Michel Lespinasse <[email protected]> wrote:

> Update the arm arch_get_unmapped_area[_topdown] functions to make
> use of vm_unmapped_area() instead of implementing a brute force search.

Again,

--- a/arch/arm/mm/mmap.c~mm-use-vm_unmapped_area-on-arm-architecture-fix
+++ a/arch/arm/mm/mmap.c
@@ -11,18 +11,6 @@
#include <linux/random.h>
#include <asm/cachetype.h>

-static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
- unsigned long pgoff)
-{
- unsigned long base = addr & ~(SHMLBA-1);
- unsigned long off = (pgoff << PAGE_SHIFT) & (SHMLBA-1);
-
- if (base + off <= addr)
- return base + off;
-
- return base - off;
-}
-
#define COLOUR_ALIGN(addr,pgoff) \
((((addr)+SHMLBA-1)&~(SHMLBA-1)) + \
(((pgoff)<<PAGE_SHIFT) & (SHMLBA-1)))
_

2012-11-06 22:46:00

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 09/16] mm: use vm_unmapped_area() in hugetlbfs on i386 architecture

On 11/06/2012 05:38 PM, Andrew Morton wrote:
> On Mon, 5 Nov 2012 14:47:06 -0800
> Michel Lespinasse <[email protected]> wrote:
>
>> Update the i386 hugetlb_get_unmapped_area function to make use of
>> vm_unmapped_area() instead of implementing a brute force search.
>
> The x86_64 coloring "fix" wasn't copied into i386?

Only certain 64 bit AMD CPUs have that issue at all.

On x86, page coloring is really not much of an issue.

All the x86-64 patch does is make the x86-64 page
coloring code behave the same way page coloring
does on MIPS, SPARC, ARM, PA-RISC and others...

2012-11-09 14:13:56

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 03/16] mm: check rb_subtree_gap correctness

On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is
> correctly set for every vma and that mm->highest_vm_end is also correct.
>
> Also add an explicit 'bug' variable to track if browse_rb() detected any
> invalid condition.
>
> Signed-off-by: Michel Lespinasse <[email protected]>
> Reviewed-by: Rik van Riel <[email protected]>
>
> ---

Hi all,

While fuzzing with trinity inside a KVM tools (lkvm) guest, using today's -next
kernel, I'm getting these:


[ 117.007714] free gap 7fba0dd1c000, correct 7fba0dcfb000
[ 117.019773] map_count 750 rb -1
[ 117.028362] ------------[ cut here ]------------
[ 117.029813] kernel BUG at mm/mmap.c:439!
[ 117.031024] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 117.032933] Dumping ftrace buffer:
[ 117.033972] (ftrace buffer empty)
[ 117.035085] CPU 4
[ 117.035676] Pid: 6859, comm: trinity-child46 Tainted: G W 3.7.0-rc4-next-20121109-sasha-00013-g9407f3c #124
[ 117.038217] RIP: 0010:[<ffffffff81236687>] [<ffffffff81236687>] validate_mm+0x297/0x2c0
[ 117.041056] RSP: 0018:ffff880016a4fdf8 EFLAGS: 00010296
[ 117.041056] RAX: 0000000000000013 RBX: 00000000ffffffff RCX: 0000000000000006
[ 117.041056] RDX: 0000000000005270 RSI: ffff880024120910 RDI: 0000000000000286
[ 117.052131] RBP: ffff880016a4fe48 R08: 0000000000000000 R09: 0000000000000000
[ 117.052131] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000002ee
[ 117.052131] R13: 00007fffea1fc000 R14: ffff88002412c000 R15: 0000000000000000
[ 117.052131] FS: 00007fba129db700(0000) GS:ffff880063600000(0000) knlGS:0000000000000000
[ 117.052131] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 117.052131] CR2: 0000000003323288 CR3: 00000000169b2000 CR4: 00000000000406e0
[ 117.052131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 117.052131] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
[ 117.052131] Process trinity-child46 (pid: 6859, threadinfo ffff880016a4e000, task ffff880024120000)
[ 117.052131] Stack:
[ 117.052131] ffffffff8489e201 ffffffff81235aa0 ffff88000885cac8 0000000100000000
[ 117.052131] ffffffff812361b9 ffff88002412c000 ffff88000885cac8 ffff88000885cdc8
[ 117.052131] ffff88000885cdd0 ffff88002412c000 ffff880016a4fe98 ffffffff812367b4
[ 117.052131] Call Trace:
[ 117.052131] [<ffffffff81235aa0>] ? vma_compute_subtree_gap+0x40/0x40
[ 117.052131] [<ffffffff812361b9>] ? vma_gap_update+0x19/0x30
[ 117.052131] [<ffffffff812367b4>] vma_link+0x94/0xe0
[ 117.052131] [<ffffffff812386c4>] do_brk+0x2c4/0x380
[ 117.052131] [<ffffffff812387bf>] ? sys_brk+0x3f/0x190
[ 117.052131] [<ffffffff812388ce>] sys_brk+0x14e/0x190
[ 117.052131] [<ffffffff83be2618>] tracesys+0xe1/0xe6
[ 117.052131] Code: d8 41 8b 76 60 39 de 74 1b 89 da 48 c7 c7 c6 d9 89 84 31 c0 e8 01 76 94 02 eb 10 66 0f 1f 84 00 00 00 00 00
8b 45 c8 85 c0 74 18 <0f> 0b 4c 8d 48 e0 48 8b 70 e0 31 db c7 45 cc 00 00 00 00 e9 f4
[ 117.052131] RIP [<ffffffff81236687>] validate_mm+0x297/0x2c0
[ 117.052131] RSP <ffff880016a4fdf8>
[ 117.136092] ---[ end trace 5ce250e0bf6d040c ]---

Note that they are very easy to reproduce.

Also, I see that lots of the code there has a local variable named 'bug' thats tracking
whether we should BUG() later on. Why does it work that way and the BUG() isn't immediate?


Thanks,
Sasha

2012-11-09 20:06:29

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 03/16] mm: check rb_subtree_gap correctness

On Fri, 9 Nov 2012, Sasha Levin wrote:
> On 11/05/2012 05:47 PM, Michel Lespinasse wrote:
> > When CONFIG_DEBUG_VM_RB is enabled, check that rb_subtree_gap is
> > correctly set for every vma and that mm->highest_vm_end is also correct.
> >
> > Also add an explicit 'bug' variable to track if browse_rb() detected any
> > invalid condition.
> >
> > Signed-off-by: Michel Lespinasse <[email protected]>
> > Reviewed-by: Rik van Riel <[email protected]>
> >
> > ---
>
> Hi all,
>
> While fuzzing with trinity inside a KVM tools (lkvm) guest, using today's -next
> kernel, I'm getting these:
>
>
> [ 117.007714] free gap 7fba0dd1c000, correct 7fba0dcfb000
> [ 117.019773] map_count 750 rb -1
> [ 117.028362] ------------[ cut here ]------------
> [ 117.029813] kernel BUG at mm/mmap.c:439!
> [ 117.031024] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 117.032933] Dumping ftrace buffer:
> [ 117.033972] (ftrace buffer empty)
> [ 117.035085] CPU 4
> [ 117.035676] Pid: 6859, comm: trinity-child46 Tainted: G W 3.7.0-rc4-next-20121109-sasha-00013-g9407f3c #124
> [ 117.038217] RIP: 0010:[<ffffffff81236687>] [<ffffffff81236687>] validate_mm+0x297/0x2c0
> [ 117.041056] RSP: 0018:ffff880016a4fdf8 EFLAGS: 00010296
> [ 117.041056] RAX: 0000000000000013 RBX: 00000000ffffffff RCX: 0000000000000006
> [ 117.041056] RDX: 0000000000005270 RSI: ffff880024120910 RDI: 0000000000000286
> [ 117.052131] RBP: ffff880016a4fe48 R08: 0000000000000000 R09: 0000000000000000
> [ 117.052131] R10: 0000000000000001 R11: 0000000000000000 R12: 00000000000002ee
> [ 117.052131] R13: 00007fffea1fc000 R14: ffff88002412c000 R15: 0000000000000000
> [ 117.052131] FS: 00007fba129db700(0000) GS:ffff880063600000(0000) knlGS:0000000000000000
> [ 117.052131] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 117.052131] CR2: 0000000003323288 CR3: 00000000169b2000 CR4: 00000000000406e0
> [ 117.052131] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 117.052131] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 117.052131] Process trinity-child46 (pid: 6859, threadinfo ffff880016a4e000, task ffff880024120000)
> [ 117.052131] Stack:
> [ 117.052131] ffffffff8489e201 ffffffff81235aa0 ffff88000885cac8 0000000100000000
> [ 117.052131] ffffffff812361b9 ffff88002412c000 ffff88000885cac8 ffff88000885cdc8
> [ 117.052131] ffff88000885cdd0 ffff88002412c000 ffff880016a4fe98 ffffffff812367b4
> [ 117.052131] Call Trace:
> [ 117.052131] [<ffffffff81235aa0>] ? vma_compute_subtree_gap+0x40/0x40
> [ 117.052131] [<ffffffff812361b9>] ? vma_gap_update+0x19/0x30
> [ 117.052131] [<ffffffff812367b4>] vma_link+0x94/0xe0
> [ 117.052131] [<ffffffff812386c4>] do_brk+0x2c4/0x380
> [ 117.052131] [<ffffffff812387bf>] ? sys_brk+0x3f/0x190
> [ 117.052131] [<ffffffff812388ce>] sys_brk+0x14e/0x190
> [ 117.052131] [<ffffffff83be2618>] tracesys+0xe1/0xe6
> [ 117.052131] Code: d8 41 8b 76 60 39 de 74 1b 89 da 48 c7 c7 c6 d9 89 84 31 c0 e8 01 76 94 02 eb 10 66 0f 1f 84 00 00 00 00 00
> 8b 45 c8 85 c0 74 18 <0f> 0b 4c 8d 48 e0 48 8b 70 e0 31 db c7 45 cc 00 00 00 00 e9 f4
> [ 117.052131] RIP [<ffffffff81236687>] validate_mm+0x297/0x2c0
> [ 117.052131] RSP <ffff880016a4fdf8>
> [ 117.136092] ---[ end trace 5ce250e0bf6d040c ]---
>
> Note that they are very easy to reproduce.
>
> Also, I see that lots of the code there has a local variable named 'bug' thats tracking
> whether we should BUG() later on. Why does it work that way and the BUG() isn't immediate?

3.7.0-rc4-mm1 BUGged on mm/mmap.c:439 as soon as I tried to rebuild
that kernel with Alan's tty/vt/fb patch included, no fuzzing required.

free_gap 55551d077000, correct 55551ccd2000 in my case.

It should only be affecting the minority with CONFIG_DEBUG_VM_RB=y.
I've put #if 0 around the rb_subtree_gap checking block in browse_rb(),
and running okay so far with that - but not yet done much with it.

Hugh

2012-11-12 11:55:22

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH 03/16] mm: check rb_subtree_gap correctness

On Fri, Nov 9, 2012 at 6:13 AM, Sasha Levin <[email protected]> wrote:
> While fuzzing with trinity inside a KVM tools (lkvm) guest, using today's -next
> kernel, I'm getting these:
>
> [ 117.007714] free gap 7fba0dd1c000, correct 7fba0dcfb000
> [ 117.019773] map_count 750 rb -1
> [ 117.028362] ------------[ cut here ]------------
> [ 117.029813] kernel BUG at mm/mmap.c:439!
>
> Note that they are very easy to reproduce.

Thanks for the report. I had trouble reproducing this on Friday, but
after Hugh came up with an easy test case I think I have it figured
out. I sent out a proposed fix as "[PATCH 0/3] fix missing
rb_subtree_gap updates on vma insert/erase". Let's follow up the
discussion there if necessary.

Cheers,

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.