2012-06-21 21:57:34

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 00/11] mm: scalable and unified arch_get_unmapped_area

A long time ago, we decided to limit the number of VMAs per
process to 64k. As it turns out, there actually are programs
using tens of thousands of VMAs.

The linear search in arch_get_unmapped_area and
arch_get_unmapped_area_topdown can be a real issue for
those programs.

This patch series aims to fix the scalability issue by
tracking the size of each free hole in the VMA rbtree,
propagating the free hole info up the tree.

Another major goal is to put the bulk of the necessary
arch_get_unmapped_area(_topdown) functionality into one
set of functions, so we can eliminate the custom large
functions per architecture, sticking to a few much smaller
architecture specific functions instead.

In this version I have only gotten rid of the x86, ARM, SH
and MIPS arch-specific code, and am already showing a
fairly promising diffstat:

arch/arm/include/asm/pgtable.h | 6
arch/arm/mm/init.c | 4
arch/arm/mm/mmap.c | 217 ------------------
arch/mips/include/asm/page.h | 2
arch/mips/include/asm/pgtable.h | 7
arch/mips/mm/mmap.c | 177 --------------
arch/sh/include/asm/pgtable.h | 4
arch/sh/mm/mmap.c | 219 ------------------
arch/x86/include/asm/elf.h | 3
arch/x86/include/asm/pgtable_64.h | 4
arch/x86/kernel/sys_x86_64.c | 200 ++--------------
arch/x86/vdso/vma.c | 2
include/linux/mm_types.h | 19 +
include/linux/rbtree.h | 12 +
include/linux/sched.h | 13 +
lib/rbtree.c | 46 +++
mm/internal.h | 5
mm/mmap.c | 449 +++++++++++++++++++++++++++++---------
18 files changed, 478 insertions(+), 911 deletions(-)

v2: address reviewers' comments
optimize propagating info up the VMA tree (30% faster at frag test)
add SH architecture

TODO:
- eliminate arch-specific functions for more architectures
- integrate hugetlbfs alignment (with Andi Kleen's patch?)

Performance

Testing performance with a benchmark that allocates tens
of thousands of VMAs, unmaps them and mmaps them some more
in a loop, shows promising results.

Vanilla 3.4 kernel:
$ ./agua_frag_test_64
..........

Min Time (ms): 6
Avg. Time (ms): 294.0000
Max Time (ms): 609
Std Dev (ms): 113.1664
Standard deviation exceeds 10

With -v2 patches:
$ ./agua_frag_test_64
..........

Min Time (ms): 12
Avg. Time (ms): 31.0000
Max Time (ms): 42
Std Dev (ms): 3.3648
All checks pass

The total run time of the test goes down by about a
factor 5. More importantly, the worst case performance
of the loop (which is what really hurt some applications)
has gone down by about a factor 14.


2012-06-21 21:57:51

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 06/11] mm: arbitrary address ranges for arch_get_unmapped_area

Allow each architecture to specify the address range that can be used
for this allocation.

On x86-64, this is used to implement MMAP_32BIT semantics.

On PPC and IA64, allocations using certain page sizes need to be
restricted to certain virtual address ranges. This callback could
be used to implement such address restrictions with minimal hassle.

Signed-off-by: Rik van Riel <[email protected]>
---
arch/mips/mm/mmap.c | 8 ++----
arch/x86/include/asm/pgtable_64.h | 1 +
arch/x86/kernel/sys_x86_64.c | 11 ++++++---
include/linux/sched.h | 7 ++++++
mm/mmap.c | 38 ++++++++++++++++++++++++++++++++++--
5 files changed, 53 insertions(+), 12 deletions(-)

diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index 302d779..3f8af17 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -61,8 +61,6 @@ static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
((((addr) + shm_align_mask) & ~shm_align_mask) + \
(((pgoff) << PAGE_SHIFT) & shm_align_mask))

-enum mmap_allocation_direction {UP, DOWN};
-
static unsigned long arch_get_unmapped_area_common(struct file *filp,
unsigned long addr0, unsigned long len, unsigned long pgoff,
unsigned long flags, enum mmap_allocation_direction dir)
@@ -107,7 +105,7 @@ static unsigned long arch_get_unmapped_area_common(struct file *filp,
return addr;
}

- if (dir == UP) {
+ if (dir == ALLOC_UP) {
addr = mm->mmap_base;
if (do_color_align)
addr = COLOUR_ALIGN(addr, pgoff);
@@ -204,7 +202,7 @@ unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
return arch_get_unmapped_area_common(filp,
- addr0, len, pgoff, flags, UP);
+ addr0, len, pgoff, flags, ALLOC_UP);
}

/*
@@ -216,7 +214,7 @@ unsigned long arch_get_unmapped_area_topdown(struct file *filp,
unsigned long flags)
{
return arch_get_unmapped_area_common(filp,
- addr0, len, pgoff, flags, DOWN);
+ addr0, len, pgoff, flags, ALLOC_DOWN);
}

void arch_pick_mmap_layout(struct mm_struct *mm)
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 975f709..8af36f6 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -169,6 +169,7 @@ extern void cleanup_highmap(void);

#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
+#define HAVE_ARCH_GET_ADDRESS_RANGE

#define pgtable_cache_init() do { } while (0)
#define check_pgt_cache() do { } while (0)
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index b4d3c39..2595a5e 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -95,8 +95,8 @@ SYSCALL_DEFINE6(mmap, unsigned long, addr, unsigned long, len,
return error;
}

-static void find_start_end(unsigned long flags, unsigned long *begin,
- unsigned long *end)
+void arch_get_address_range(unsigned long flags, unsigned long *begin,
+ unsigned long *end, enum mmap_allocation_direction direction)
{
if (!test_thread_flag(TIF_ADDR32) && (flags & MAP_32BIT)) {
unsigned long new_begin;
@@ -114,9 +114,12 @@ static void find_start_end(unsigned long flags, unsigned long *begin,
if (new_begin)
*begin = new_begin;
}
- } else {
+ } else if (direction == ALLOC_UP) {
*begin = TASK_UNMAPPED_BASE;
*end = TASK_SIZE;
+ } else /* direction == ALLOC_DOWN */ {
+ *begin = 0;
+ *end = current->mm->mmap_base;
}
}

@@ -132,7 +135,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
if (flags & MAP_FIXED)
return addr;

- find_start_end(flags, &begin, &end);
+ arch_get_address_range(flags, &begin, &end, ALLOC_UP);

if (len > end)
return -ENOMEM;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4059c0f..fc76318 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -388,7 +388,14 @@ extern int sysctl_max_map_count;
#include <linux/aio.h>

#ifdef CONFIG_MMU
+enum mmap_allocation_direction {
+ ALLOC_UP,
+ ALLOC_DOWN
+};
extern void arch_pick_mmap_layout(struct mm_struct *mm);
+extern void
+arch_get_address_range(unsigned long flags, unsigned long *begin,
+ unsigned long *end, enum mmap_allocation_direction direction);
extern unsigned long
arch_get_unmapped_area(struct file *, unsigned long, unsigned long,
unsigned long, unsigned long);
diff --git a/mm/mmap.c b/mm/mmap.c
index e501b4f..2420951 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1482,6 +1482,20 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return error;
}

+#ifndef HAVE_ARCH_GET_ADDRESS_RANGE
+void arch_get_address_range(unsigned long flags, unsigned long *begin,
+ unsigned long *end, enum mmap_allocation_direction direction)
+{
+ if (direction == ALLOC_UP) {
+ *begin = TASK_UNMAPPED_BASE;
+ *end = TASK_SIZE;
+ } else /* direction == ALLOC_DOWN */ {
+ *begin = 0;
+ *end = current->mm->mmap_base;
+ }
+}
+#endif
+
/* Get an address range which is currently unmapped.
* For shmat() with addr=0.
*
@@ -1501,7 +1515,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct rb_node *rb_node;
- unsigned long lower_limit = TASK_UNMAPPED_BASE;
+ unsigned long lower_limit, upper_limit;
+
+ arch_get_address_range(flags, &lower_limit, &upper_limit, ALLOC_UP);

if (len > TASK_SIZE)
return -ENOMEM;
@@ -1546,6 +1562,13 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
rb_node = rb_node->rb_left;
continue;
}
+
+ /* We have gone too far right, and can not go left. */
+ if (vma->vm_end + len > upper_limit) {
+ if (!addr)
+ return -ENOMEM;
+ goto found_addr;
+ }
}

if (!found_here && node_free_gap(rb_node->rb_right) >= len) {
@@ -1612,7 +1635,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
struct mm_struct *mm = current->mm;
unsigned long addr = addr0;
struct rb_node *rb_node = NULL;
- unsigned long upper_limit = mm->mmap_base;
+ unsigned long lower_limit, upper_limit;
+
+ arch_get_address_range(flags, &lower_limit, &upper_limit, ALLOC_DOWN);

/* requested length too big for entire address space */
if (len > TASK_SIZE)
@@ -1631,7 +1656,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
}

/* requested length too big; prevent integer underflow below */
- if (len > upper_limit)
+ if (len > upper_limit - lower_limit)
return -ENOMEM;

/*
@@ -1668,6 +1693,13 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
}
}

+ /* We have gone too far left, and can not go right. */
+ if (vma->vm_start < lower_limit + len) {
+ if (!addr)
+ return -ENOMEM;
+ goto found_addr;
+ }
+
if (!found_here && node_free_gap(rb_node->rb_left) >= len) {
/* Last known gap is to the right of this subtree. */
rb_node = rb_node->rb_left;
--
1.7.7.6

2012-06-21 21:57:56

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

Track the size of free areas between VMAs in the VMA rbtree.

This will allow get_unmapped_area_* to find a free area of the
right size in O(log(N)) time, instead of potentially having to
do a linear walk across all the VMAs.

This does have the potential to make unmapping VMAs more expensive,
especially for processes with very large numbers of VMAs, where the
VMA rbtree can grow quite deep.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mm_types.h | 8 ++++
include/linux/rbtree.h | 8 ++++
mm/internal.h | 5 +++
mm/mmap.c | 84 ++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 102 insertions(+), 3 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index dad95bd..9fc0291 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -213,6 +213,14 @@ struct vm_area_struct {
struct rb_node vm_rb;

/*
+ * Largest free memory gap in bytes to the left of this VMA.
+ * Either between this VMA and vma->vm_prev, or between one of the
+ * VMAs below us in the VMA rbtree and its ->vm_prev. This helps
+ * get_unmapped_area find a free area of the right size.
+ */
+ unsigned long free_gap;
+
+ /*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
* linkage to the list of like vmas hanging off its node, or
diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 033b507..661288d 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -155,6 +155,14 @@ extern struct rb_node *rb_augment_erase_begin(struct rb_node *node);
extern void rb_augment_erase_end(struct rb_node *node,
rb_augment_f func, void *data);

+static inline void rb_augment_erase(struct rb_node *node, struct rb_root *root,
+ rb_augment_f func, void *data)
+{
+ struct rb_node *deepest = rb_augment_erase_begin(node);
+ rb_erase(node, root);
+ rb_augment_erase_end(deepest, func, data);
+}
+
/* Find logical next and previous nodes in a tree */
extern struct rb_node *rb_next(const struct rb_node *);
extern struct rb_node *rb_prev(const struct rb_node *);
diff --git a/mm/internal.h b/mm/internal.h
index 2ba87fb..f59f97a 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -159,6 +159,11 @@ static inline void munlock_vma_pages_all(struct vm_area_struct *vma)
munlock_vma_pages_range(vma, vma->vm_start, vma->vm_end);
}

+static inline struct vm_area_struct *rb_to_vma(struct rb_node *node)
+{
+ return container_of(node, struct vm_area_struct, vm_rb);
+}
+
/*
* Called only in fault path via page_evictable() for a new page
* to determine if it's being mapped into a LOCKED vma.
diff --git a/mm/mmap.c b/mm/mmap.c
index 3edfcdf..dd6edcd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -206,6 +206,60 @@ static void __remove_shared_vm_struct(struct vm_area_struct *vma,
}

/*
+ * largest_free_gap - returns the largest free gap "nearby"
+ *
+ * This function is called when propagating information on the
+ * free spaces between VMAs "up" the VMA rbtree. It returns the
+ * largest of:
+ *
+ * 1. The gap between this VMA and vma->vm_prev.
+ * 2. The largest gap below us and to our left in the rbtree.
+ * 3. The largest gap below us and to our right in the rbtree.
+ */
+static unsigned long largest_free_gap(struct rb_node *node)
+{
+ struct vm_area_struct *vma, *prev, *left = NULL, *right = NULL;
+ unsigned long largest = 0;
+
+ if (node->rb_left)
+ left = rb_to_vma(node->rb_left);
+ if (node->rb_right)
+ right = rb_to_vma(node->rb_right);
+
+ /* Calculate the free gap size between us and the VMA to our left. */
+ vma = rb_to_vma(node);
+ prev = vma->vm_prev;
+
+ if (prev)
+ largest = vma->vm_start - prev->vm_end;
+ else
+ largest = vma->vm_start;
+
+ /* We propagate the largest of our own, or our children's free gaps. */
+ if (left)
+ largest = max(largest, left->free_gap);
+ if (right)
+ largest = max(largest, right->free_gap);
+
+ return largest;
+}
+
+static void vma_rb_augment_cb(struct rb_node *node, void *__unused)
+{
+ struct vm_area_struct *vma = rb_to_vma(node);
+ vma->free_gap = largest_free_gap(node);
+}
+
+/*
+ * Use the augmented rbtree code to propagate info on the largest
+ * free gap between VMAs up the VMA rbtree.
+ */
+static void adjust_free_gap(struct vm_area_struct *vma)
+{
+ rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
+}
+
+/*
* Unlink a file-based vm structure from its prio_tree, to hide
* vma from rmap and vmtruncate before freeing its page tables.
*/
@@ -342,6 +396,8 @@ void validate_mm(struct mm_struct *mm)
int i = 0;
struct vm_area_struct *tmp = mm->mmap;
while (tmp) {
+ if (tmp->free_gap != largest_free_gap(&tmp->vm_rb))
+ printk("free space %lx, correct %lx\n", tmp->free_gap, largest_free_gap(&tmp->vm_rb)), bug = 1;
tmp = tmp->vm_next;
i++;
}
@@ -398,6 +454,10 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
{
rb_link_node(&vma->vm_rb, rb_parent, rb_link);
rb_insert_color(&vma->vm_rb, &mm->mm_rb);
+ adjust_free_gap(vma);
+ /* Propagate the new free gap between next and us up the tree. */
+ if (vma->vm_next)
+ adjust_free_gap(vma->vm_next);
}

static void __vma_link_file(struct vm_area_struct *vma)
@@ -475,9 +535,12 @@ __vma_unlink(struct mm_struct *mm, struct vm_area_struct *vma,
struct vm_area_struct *next = vma->vm_next;

prev->vm_next = next;
- if (next)
+ if (next) {
next->vm_prev = prev;
- rb_erase(&vma->vm_rb, &mm->mm_rb);
+ adjust_free_gap(next);
+ }
+ rb_augment_erase(&vma->vm_rb, &mm->mm_rb, vma_rb_augment_cb, NULL);
+
if (mm->mmap_cache == vma)
mm->mmap_cache = prev;
}
@@ -657,6 +720,15 @@ again: remove_next = 1 + (end > next->vm_end);
if (insert && file)
uprobe_mmap(insert);

+ /* Adjust the rb tree for changes in the free gaps between VMAs. */
+ adjust_free_gap(vma);
+ if (insert)
+ adjust_free_gap(insert);
+ if (vma->vm_next && vma->vm_next != insert)
+ adjust_free_gap(vma->vm_next);
+ if (insert && insert->vm_next && insert->vm_next != vma)
+ adjust_free_gap(insert->vm_next);
+
validate_mm(mm);

return 0;
@@ -1760,6 +1832,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
error = acct_stack_growth(vma, size, grow);
if (!error) {
vma->vm_end = address;
+ if (vma->vm_next)
+ adjust_free_gap(vma->vm_next);
perf_event_mmap(vma);
}
}
@@ -1811,6 +1885,7 @@ int expand_downwards(struct vm_area_struct *vma,
if (!error) {
vma->vm_start = address;
vma->vm_pgoff -= grow;
+ adjust_free_gap(vma);
perf_event_mmap(vma);
}
}
@@ -1933,7 +2008,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
insertion_point = (prev ? &prev->vm_next : &mm->mmap);
vma->vm_prev = NULL;
do {
- rb_erase(&vma->vm_rb, &mm->mm_rb);
+ rb_augment_erase(&vma->vm_rb, &mm->mm_rb,
+ vma_rb_augment_cb, NULL);
mm->map_count--;
tail_vma = vma;
vma = vma->vm_next;
@@ -1941,6 +2017,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
*insertion_point = vma;
if (vma)
vma->vm_prev = prev;
+ if (vma)
+ rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
tail_vma->vm_next = NULL;
if (mm->unmap_area == arch_unmap_area)
addr = prev ? prev->vm_end : mm->mmap_base;
--
1.7.7.6

2012-06-21 21:57:58

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 03/11] mm: vma_adjust: only call adjust_free_gap when needed

When inserting or removing VMAs in adjust_vma, __insert_vm_struct
and __vma_unlink already take care of keeping the free gap information
in the VMA rbtree up to date.

Only if VMA boundaries shift without VMA insertion or removal on that side
of vma, vma_adjust needs to call adjust_free_gap.

Signed-off-by: Rik van Riel <[email protected]>
---
mm/mmap.c | 29 ++++++++++++++++++++---------
1 files changed, 20 insertions(+), 9 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index dd6edcd..95f66c5 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -562,6 +562,7 @@ int vma_adjust(struct vm_area_struct *vma, unsigned long start,
struct prio_tree_root *root = NULL;
struct anon_vma *anon_vma = NULL;
struct file *file = vma->vm_file;
+ bool start_changed = false, end_changed = false;
long adjust_next = 0;
int remove_next = 0;

@@ -651,8 +652,14 @@ again: remove_next = 1 + (end > next->vm_end);
vma_prio_tree_remove(next, root);
}

- vma->vm_start = start;
- vma->vm_end = end;
+ if (start != vma->vm_start) {
+ vma->vm_start = start;
+ start_changed = true;
+ }
+ if (end != vma->vm_end) {
+ vma->vm_end = end;
+ end_changed = true;
+ }
vma->vm_pgoff = pgoff;
if (adjust_next) {
next->vm_start += adjust_next << PAGE_SHIFT;
@@ -720,14 +727,18 @@ again: remove_next = 1 + (end > next->vm_end);
if (insert && file)
uprobe_mmap(insert);

- /* Adjust the rb tree for changes in the free gaps between VMAs. */
- adjust_free_gap(vma);
- if (insert)
- adjust_free_gap(insert);
- if (vma->vm_next && vma->vm_next != insert)
+ /*
+ * When inserting or removing VMAs, __insert_vm_struct or __vma_unlink
+ * have already taken care of updating the free gap information on
+ * that side of vma.
+ * Only call adjust_free_gap if VMA boundaries changed without
+ * insertion or removal on that side of vma.
+ */
+ if (start_changed && (!vma->vm_prev || vma->vm_prev != insert))
+ adjust_free_gap(vma);
+ if (end_changed && !remove_next && vma->vm_next &&
+ vma->vm_next != insert)
adjust_free_gap(vma->vm_next);
- if (insert && insert->vm_next && insert->vm_next != vma)
- adjust_free_gap(insert->vm_next);

validate_mm(mm);

--
1.7.7.6

2012-06-21 21:58:05

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 11/11] mm: remove SH arch_get_unmapped_area functions

Remove the SH special variants of arch_get_unmapped_area since
the generic functions should now be able to handle everything.

Paul, does anything in NOMMU SH need shm_align_mask?

Untested because I have no SH hardware.

Signed-off-by: Rik van Riel <[email protected]>
Cc: Paul Mundt <[email protected]>
Cc: Magnus Damm <[email protected]>
Cc: [email protected]
---
arch/sh/include/asm/pgtable.h | 4 -
arch/sh/mm/mmap.c | 219 +----------------------------------------
2 files changed, 2 insertions(+), 221 deletions(-)

diff --git a/arch/sh/include/asm/pgtable.h b/arch/sh/include/asm/pgtable.h
index 9210e93..03f5312 100644
--- a/arch/sh/include/asm/pgtable.h
+++ b/arch/sh/include/asm/pgtable.h
@@ -155,10 +155,6 @@ extern void paging_init(void);
extern void page_table_range_init(unsigned long start, unsigned long end,
pgd_t *pgd);

-/* arch/sh/mm/mmap.c */
-#define HAVE_ARCH_UNMAPPED_AREA
-#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
-
#define __HAVE_ARCH_PTE_SPECIAL

#include <asm-generic/pgtable.h>
diff --git a/arch/sh/mm/mmap.c b/arch/sh/mm/mmap.c
index afeb710..12f5edd 100644
--- a/arch/sh/mm/mmap.c
+++ b/arch/sh/mm/mmap.c
@@ -14,225 +14,10 @@
#include <asm/page.h>
#include <asm/processor.h>

+#ifndef CONFIG_MMU
unsigned long shm_align_mask = PAGE_SIZE - 1; /* Sane caches */
EXPORT_SYMBOL(shm_align_mask);
-
-#ifdef CONFIG_MMU
-/*
- * To avoid cache aliases, we map the shared page with same color.
- */
-static inline unsigned long COLOUR_ALIGN(unsigned long addr,
- unsigned long pgoff)
-{
- unsigned long base = (addr + shm_align_mask) & ~shm_align_mask;
- unsigned long off = (pgoff << PAGE_SHIFT) & shm_align_mask;
-
- return base + off;
-}
-
-static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
- unsigned long pgoff)
-{
- unsigned long base = addr & ~shm_align_mask;
- unsigned long off = (pgoff << PAGE_SHIFT) & shm_align_mask;
-
- if (base + off <= addr)
- return base + off;
-
- return base - off;
-}
-
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr,
- unsigned long len, unsigned long pgoff, unsigned long flags)
-{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
- int do_colour_align;
-
- if (flags & MAP_FIXED) {
- /* We do not accept a shared mapping if it would violate
- * cache aliasing constraints.
- */
- if ((flags & MAP_SHARED) &&
- ((addr - (pgoff << PAGE_SHIFT)) & shm_align_mask))
- return -EINVAL;
- return addr;
- }
-
- if (unlikely(len > TASK_SIZE))
- return -ENOMEM;
-
- do_colour_align = 0;
- if (filp || (flags & MAP_SHARED))
- do_colour_align = 1;
-
- if (addr) {
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
-
- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- mm->cached_hole_size = 0;
- start_addr = addr = TASK_UNMAPPED_BASE;
- }
-
-full_search:
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(mm->free_area_cache);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (unlikely(TASK_SIZE - len < addr)) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- addr = vma->vm_end;
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
-}
-
-unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
- const unsigned long len, const unsigned long pgoff,
- const unsigned long flags)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm = current->mm;
- unsigned long addr = addr0;
- int do_colour_align;
-
- if (flags & MAP_FIXED) {
- /* We do not accept a shared mapping if it would violate
- * cache aliasing constraints.
- */
- if ((flags & MAP_SHARED) &&
- ((addr - (pgoff << PAGE_SHIFT)) & shm_align_mask))
- return -EINVAL;
- return addr;
- }
-
- if (unlikely(len > TASK_SIZE))
- return -ENOMEM;
-
- do_colour_align = 0;
- if (filp || (flags & MAP_SHARED))
- do_colour_align = 1;
-
- /* requesting a specific address */
- if (addr) {
- if (do_colour_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
-
- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache;
- if (do_colour_align) {
- unsigned long base = COLOUR_ALIGN_DOWN(addr-len, pgoff);
-
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = mm->mmap_base-len;
- if (do_colour_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr+len <= vma->vm_start)) {
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
- }
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- if (do_colour_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (likely(len < vma->vm_start));
-
-bottomup:
- /*
- * A failed mmap() very likely causes application failure,
- * so fall back to the bottom-up function here. This scenario
- * can happen with large stack limits and large mmap()
- * allocations.
- */
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
-}
-#endif /* CONFIG_MMU */
+#endif

/*
* You really shouldn't be using read() or write() on /dev/mem. This
--
1.7.7.6

2012-06-21 21:58:10

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 10/11] mm: remove ARM arch_get_unmapped_area functions

Remove the ARM special variants of arch_get_unmapped_area since the
generic functions should now be able to handle everything.

ARM only needs page colouring if cache_is_vipt_aliasing; leave
shm_align_mask at PAGE_SIZE-1 unless we need colouring.

Untested because I have no ARM hardware.

Cc: Russell King <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/arm/include/asm/pgtable.h | 6 -
arch/arm/mm/init.c | 4 +
arch/arm/mm/mmap.c | 217 ----------------------------------------
3 files changed, 4 insertions(+), 223 deletions(-)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index f66626d..6754183 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -296,12 +296,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
#include <asm-generic/pgtable.h>

/*
- * We provide our own arch_get_unmapped_area to cope with VIPT caches.
- */
-#define HAVE_ARCH_UNMAPPED_AREA
-#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
-
-/*
* remap a physical page `pfn' of size `size' with page protection `prot'
* into virtual address `from'
*/
diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
index f54d592..abaf862 100644
--- a/arch/arm/mm/init.c
+++ b/arch/arm/mm/init.c
@@ -600,6 +600,10 @@ void __init mem_init(void)
extern u32 itcm_end;
#endif

+ /* Tell the page colouring code what we need. */
+ if (cache_is_vipt_aliasing())
+ shm_align_mask = SHMLBA - 1;
+
max_mapnr = pfn_to_page(max_pfn + PHYS_PFN_OFFSET) - mem_map;

/* this will put all unused low memory onto the freelists */
diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
index ce8cb19..d90969b 100644
--- a/arch/arm/mm/mmap.c
+++ b/arch/arm/mm/mmap.c
@@ -11,22 +11,6 @@
#include <linux/random.h>
#include <asm/cachetype.h>

-static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
- unsigned long pgoff)
-{
- unsigned long base = addr & ~(SHMLBA-1);
- unsigned long off = (pgoff << PAGE_SHIFT) & (SHMLBA-1);
-
- if (base + off <= addr)
- return base + off;
-
- return base - off;
-}
-
-#define COLOUR_ALIGN(addr,pgoff) \
- ((((addr)+SHMLBA-1)&~(SHMLBA-1)) + \
- (((pgoff)<<PAGE_SHIFT) & (SHMLBA-1)))
-
/* gap between mmap and stack */
#define MIN_GAP (128*1024*1024UL)
#define MAX_GAP ((TASK_SIZE)/6*5)
@@ -54,207 +38,6 @@ static unsigned long mmap_base(unsigned long rnd)
return PAGE_ALIGN(TASK_SIZE - gap - rnd);
}

-/*
- * We need to ensure that shared mappings are correctly aligned to
- * avoid aliasing issues with VIPT caches. We need to ensure that
- * a specific page of an object is always mapped at a multiple of
- * SHMLBA bytes.
- *
- * We unconditionally provide this function for all cases, however
- * in the VIVT case, we optimise out the alignment rules.
- */
-unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
- unsigned long len, unsigned long pgoff, unsigned long flags)
-{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
- int do_align = 0;
- int aliasing = cache_is_vipt_aliasing();
-
- /*
- * We only need to do colour alignment if either the I or D
- * caches alias.
- */
- if (aliasing)
- do_align = filp || (flags & MAP_SHARED);
-
- /*
- * We enforce the MAP_FIXED case.
- */
- if (flags & MAP_FIXED) {
- if (aliasing && flags & MAP_SHARED &&
- (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1))
- return -EINVAL;
- return addr;
- }
-
- if (len > TASK_SIZE)
- return -ENOMEM;
-
- if (addr) {
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = mm->mmap_base;
- mm->cached_hole_size = 0;
- }
-
-full_search:
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = vma->vm_end;
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
-}
-
-unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
- const unsigned long len, const unsigned long pgoff,
- const unsigned long flags)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm = current->mm;
- unsigned long addr = addr0;
- int do_align = 0;
- int aliasing = cache_is_vipt_aliasing();
-
- /*
- * We only need to do colour alignment if either the I or D
- * caches alias.
- */
- if (aliasing)
- do_align = filp || (flags & MAP_SHARED);
-
- /* requested length too big for entire address space */
- if (len > TASK_SIZE)
- return -ENOMEM;
-
- if (flags & MAP_FIXED) {
- if (aliasing && flags & MAP_SHARED &&
- (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1))
- return -EINVAL;
- return addr;
- }
-
- /* requesting a specific address */
- if (addr) {
- if (do_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
-
- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /* either no address requested or can't fit in requested address hole */
- addr = mm->free_area_cache;
- if (do_align) {
- unsigned long base = COLOUR_ALIGN_DOWN(addr - len, pgoff);
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (addr > len) {
- vma = find_vma(mm, addr-len);
- if (!vma || addr <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr-len);
- }
-
- if (mm->mmap_base < len)
- goto bottomup;
-
- addr = mm->mmap_base - len;
- if (do_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start - len;
- if (do_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (len < vma->vm_start);
-
-bottomup:
- /*
- * A failed mmap() very likely causes application failure,
- * so fall back to the bottom-up function here. This scenario
- * can happen with large stack limits and large mmap()
- * allocations.
- */
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
-}
-
void arch_pick_mmap_layout(struct mm_struct *mm)
{
unsigned long random_factor = 0UL;
--
1.7.7.6

2012-06-21 21:58:17

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 09/11] mm: remove MIPS arch_get_unmapped_area code

Remove all the MIPS specific arch_get_unmapped_area(_topdown) and
page colouring code, now that the generic code should be able to
handle things.

Untested, because I do not have any MIPS systems.

Cc: Ralf Baechle <[email protected]>
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Rik van Riel <[email protected]>
---
arch/mips/include/asm/pgtable.h | 8 --
arch/mips/mm/mmap.c | 175 ---------------------------------------
2 files changed, 0 insertions(+), 183 deletions(-)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index f133a4c..5f9c49a 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -410,14 +410,6 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
#endif

/*
- * We provide our own get_unmapped area to cope with the virtual aliasing
- * constraints placed on us by the cache architecture.
- */
-#define HAVE_ARCH_UNMAPPED_AREA
-#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
-#define HAVE_ARCH_ALIGN_ADDR
-
-/*
* No page table caches to initialise
*/
#define pgtable_cache_init() do { } while (0)
diff --git a/arch/mips/mm/mmap.c b/arch/mips/mm/mmap.c
index 3f8af17..ac342bd 100644
--- a/arch/mips/mm/mmap.c
+++ b/arch/mips/mm/mmap.c
@@ -15,9 +15,6 @@
#include <linux/random.h>
#include <linux/sched.h>

-unsigned long shm_align_mask = PAGE_SIZE - 1; /* Sane caches */
-EXPORT_SYMBOL(shm_align_mask);
-
/* gap between mmap and stack */
#define MIN_GAP (128*1024*1024UL)
#define MAX_GAP ((TASK_SIZE)/6*5)
@@ -45,178 +42,6 @@ static unsigned long mmap_base(unsigned long rnd)
return PAGE_ALIGN(TASK_SIZE - gap - rnd);
}

-static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
- unsigned long pgoff)
-{
- unsigned long base = addr & ~shm_align_mask;
- unsigned long off = (pgoff << PAGE_SHIFT) & shm_align_mask;
-
- if (base + off <= addr)
- return base + off;
-
- return base - off;
-}
-
-#define COLOUR_ALIGN(addr, pgoff) \
- ((((addr) + shm_align_mask) & ~shm_align_mask) + \
- (((pgoff) << PAGE_SHIFT) & shm_align_mask))
-
-static unsigned long arch_get_unmapped_area_common(struct file *filp,
- unsigned long addr0, unsigned long len, unsigned long pgoff,
- unsigned long flags, enum mmap_allocation_direction dir)
-{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long addr = addr0;
- int do_color_align;
-
- if (unlikely(len > TASK_SIZE))
- return -ENOMEM;
-
- if (flags & MAP_FIXED) {
- /* Even MAP_FIXED mappings must reside within TASK_SIZE */
- if (TASK_SIZE - len < addr)
- return -EINVAL;
-
- /*
- * We do not accept a shared mapping if it would violate
- * cache aliasing constraints.
- */
- if ((flags & MAP_SHARED) &&
- ((addr - (pgoff << PAGE_SHIFT)) & shm_align_mask))
- return -EINVAL;
- return addr;
- }
-
- do_color_align = 0;
- if (filp || (flags & MAP_SHARED))
- do_color_align = 1;
-
- /* requesting a specific address */
- if (addr) {
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
-
- if (dir == ALLOC_UP) {
- addr = mm->mmap_base;
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- else
- addr = PAGE_ALIGN(addr);
-
- for (vma = find_vma(current->mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr)
- return -ENOMEM;
- if (!vma || addr + len <= vma->vm_start)
- return addr;
- addr = vma->vm_end;
- if (do_color_align)
- addr = COLOUR_ALIGN(addr, pgoff);
- }
- } else {
- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
- /*
- * either no address requested, or the mapping can't fit into
- * the requested address hole
- */
- addr = mm->free_area_cache;
- if (do_color_align) {
- unsigned long base =
- COLOUR_ALIGN_DOWN(addr - len, pgoff);
- addr = base + len;
- }
-
- /* make sure it can fit in the remaining address space */
- if (likely(addr > len)) {
- vma = find_vma(mm, addr - len);
- if (!vma || addr <= vma->vm_start) {
- /* cache the address as a hint for next time */
- return mm->free_area_cache = addr - len;
- }
- }
-
- if (unlikely(mm->mmap_base < len))
- goto bottomup;
-
- addr = mm->mmap_base - len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
-
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (likely(!vma || addr + len <= vma->vm_start)) {
- /* cache the address as a hint for next time */
- return mm->free_area_cache = addr;
- }
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start - len;
- if (do_color_align)
- addr = COLOUR_ALIGN_DOWN(addr, pgoff);
- } while (likely(len < vma->vm_start));
-
-bottomup:
- /*
- * A failed mmap() very likely causes application failure,
- * so fall back to the bottom-up function here. This scenario
- * can happen with large stack limits and large mmap()
- * allocations.
- */
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
- }
-}
-
-unsigned long arch_get_unmapped_area(struct file *filp, unsigned long addr0,
- unsigned long len, unsigned long pgoff, unsigned long flags)
-{
- return arch_get_unmapped_area_common(filp,
- addr0, len, pgoff, flags, ALLOC_UP);
-}
-
-/*
- * There is no need to export this but sched.h declares the function as
- * extern so making it static here results in an error.
- */
-unsigned long arch_get_unmapped_area_topdown(struct file *filp,
- unsigned long addr0, unsigned long len, unsigned long pgoff,
- unsigned long flags)
-{
- return arch_get_unmapped_area_common(filp,
- addr0, len, pgoff, flags, ALLOC_DOWN);
-}
-
void arch_pick_mmap_layout(struct mm_struct *mm)
{
unsigned long random_factor = 0UL;
--
1.7.7.6

2012-06-21 21:58:03

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 02/11] mm: rearrange vm_area_struct for fewer cache misses

The kernel walks the VMA rbtree in various places, including
the page fault path. However, the vm_rb node spanned two
cache lines, on 64 bit systems with 64 byte cache lines (most
x86 systems).

Rearrange vm_area_struct a little, so all the information we
need to do a VMA tree walk is in the first cache line.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mm_types.h | 12 ++++++++----
1 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 9fc0291..23bd1e2 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -199,7 +199,8 @@ struct vm_region {
* library, the executable area etc).
*/
struct vm_area_struct {
- struct mm_struct * vm_mm; /* The address space we belong to. */
+ /* The first cache line has the info for VMA tree walking. */
+
unsigned long vm_start; /* Our start address within vm_mm. */
unsigned long vm_end; /* The first byte after our end address
within vm_mm. */
@@ -207,9 +208,6 @@ struct vm_area_struct {
/* linked list of VM areas per task, sorted by address */
struct vm_area_struct *vm_next, *vm_prev;

- pgprot_t vm_page_prot; /* Access permissions of this VMA. */
- unsigned long vm_flags; /* Flags, see mm.h. */
-
struct rb_node vm_rb;

/*
@@ -220,6 +218,12 @@ struct vm_area_struct {
*/
unsigned long free_gap;

+ /* Second cache line starts here. */
+
+ struct mm_struct * vm_mm; /* The address space we belong to. */
+ pgprot_t vm_page_prot; /* Access permissions of this VMA. */
+ unsigned long vm_flags; /* Flags, see mm.h. */
+
/*
* For areas with an address space and backing store,
* linkage into the address_space->i_mmap prio tree, or
--
1.7.7.6

2012-06-21 21:59:08

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 05/11] mm: get unmapped area from VMA tree

Change the generic implementations of arch_get_unmapped_area(_topdown)
to use the free space info in the VMA rbtree. This makes it possible
to find free address space in O(log(N)) complexity.

For bottom-up allocations, we pick the lowest hole that is large
enough for our allocation. For topdown allocations, we pick the
highest hole of sufficient size.

For topdown allocations, we need to keep track of the highest
mapped VMA address, because it might be below mm->mmap_base,
and we only keep track of free space to the left of each VMA
in the VMA tree. It is tempting to try and keep track of
the free space to the right of each VMA when running in
topdown mode, but that gets us into trouble when running on
x86, where a process can switch direction in the middle of
execve.

We have to leave the mm->free_area_cache and mm->largest_hole_size
in place for now, because the architecture specific versions still
use those. Once all the architectures have been converted over to
the new code, we can remove these variables from mm_struct.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/mm_types.h | 1 +
mm/mmap.c | 241 +++++++++++++++++++++++++++++-----------------
2 files changed, 155 insertions(+), 87 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 23bd1e2..6102c27 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -312,6 +312,7 @@ struct mm_struct {
unsigned long task_size; /* size of task vm space */
unsigned long cached_hole_size; /* if non-zero, the largest hole below free_area_cache */
unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */
+ unsigned long highest_vm_end; /* highest vma end address */
pgd_t * pgd;
atomic_t mm_users; /* How many users with user space? */
atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */
diff --git a/mm/mmap.c b/mm/mmap.c
index 95f66c5..e501b4f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -4,6 +4,7 @@
* Written by obz.
*
* Address space accounting code <[email protected]>
+ * Rbtree get_unmapped_area Copyright (C) 2012 Rik van Riel
*/

#include <linux/slab.h>
@@ -259,6 +260,17 @@ static void adjust_free_gap(struct vm_area_struct *vma)
rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
}

+static unsigned long node_free_gap(struct rb_node *node)
+{
+ struct vm_area_struct *vma;
+
+ if (!node)
+ return 0;
+
+ vma = rb_to_vma(node);
+ return vma->free_gap;
+}
+
/*
* Unlink a file-based vm structure from its prio_tree, to hide
* vma from rmap and vmtruncate before freeing its page tables.
@@ -395,12 +407,16 @@ void validate_mm(struct mm_struct *mm)
int bug = 0;
int i = 0;
struct vm_area_struct *tmp = mm->mmap;
+ unsigned long highest_address = 0;
while (tmp) {
if (tmp->free_gap != largest_free_gap(&tmp->vm_rb))
printk("free space %lx, correct %lx\n", tmp->free_gap, largest_free_gap(&tmp->vm_rb)), bug = 1;
+ highest_address = tmp->vm_end;
tmp = tmp->vm_next;
i++;
}
+ if (highest_address != mm->highest_vm_end)
+ printk("mm->highest_vm_end %lx, found %lx\n", mm->highest_vm_end, highest_address), bug = 1;
if (i != mm->map_count)
printk("map_count %d vm_next %d\n", mm->map_count, i), bug = 1;
i = browse_rb(&mm->mm_rb);
@@ -458,6 +474,9 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
/* Propagate the new free gap between next and us up the tree. */
if (vma->vm_next)
adjust_free_gap(vma->vm_next);
+ else
+ /* This is the VMA with the highest address. */
+ mm->highest_vm_end = vma->vm_end;
}

static void __vma_link_file(struct vm_area_struct *vma)
@@ -661,6 +680,8 @@ again: remove_next = 1 + (end > next->vm_end);
end_changed = true;
}
vma->vm_pgoff = pgoff;
+ if (!next)
+ mm->highest_vm_end = end;
if (adjust_next) {
next->vm_start += adjust_next << PAGE_SHIFT;
next->vm_pgoff += adjust_next;
@@ -1478,8 +1499,9 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
unsigned long len, unsigned long pgoff, unsigned long flags)
{
struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
+ struct vm_area_struct *vma = NULL;
+ struct rb_node *rb_node;
+ unsigned long lower_limit = TASK_UNMAPPED_BASE;

if (len > TASK_SIZE)
return -ENOMEM;
@@ -1494,40 +1516,76 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
(!vma || addr + len <= vma->vm_start))
return addr;
}
- if (len > mm->cached_hole_size) {
- start_addr = addr = mm->free_area_cache;
- } else {
- start_addr = addr = TASK_UNMAPPED_BASE;
- mm->cached_hole_size = 0;
- }

-full_search:
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (TASK_SIZE - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != TASK_UNMAPPED_BASE) {
- addr = TASK_UNMAPPED_BASE;
- start_addr = addr;
- mm->cached_hole_size = 0;
- goto full_search;
+ /* Find the left-most free area of sufficient size. */
+ for (addr = 0, rb_node = mm->mm_rb.rb_node; rb_node; ) {
+ unsigned long vma_start;
+ bool found_here = false;
+
+ vma = rb_to_vma(rb_node);
+
+ if (vma->vm_start > len) {
+ if (!vma->vm_prev) {
+ /* This is the left-most VMA. */
+ if (vma->vm_start - len >= lower_limit) {
+ addr = lower_limit;
+ goto found_addr;
+ }
+ } else {
+ /* Is this gap large enough? Remember it. */
+ vma_start = max(vma->vm_prev->vm_end, lower_limit);
+ if (vma->vm_start - len >= vma_start) {
+ addr = vma_start;
+ found_here = true;
+ }
+ }
+
+ /* Go left if it looks promising. */
+ if (node_free_gap(rb_node->rb_left) >= len &&
+ vma->vm_start - len >= lower_limit) {
+ rb_node = rb_node->rb_left;
+ continue;
}
- return -ENOMEM;
}
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
+
+ if (!found_here && node_free_gap(rb_node->rb_right) >= len) {
+ /* Last known gap is to the right of this subtree. */
+ rb_node = rb_node->rb_right;
+ continue;
+ } else if (!addr) {
+ rb_node = rb_find_next_uncle(rb_node);
+ continue;
}
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
- addr = vma->vm_end;
+
+ /* This is the left-most gap. */
+ goto found_addr;
}
+
+ /*
+ * There is not enough space to the left of any VMA.
+ * Check the far right-hand side of the VMA tree.
+ */
+ rb_node = mm->mm_rb.rb_node;
+ while (rb_node->rb_right)
+ rb_node = rb_node->rb_right;
+ vma = rb_to_vma(rb_node);
+ addr = vma->vm_end;
+
+ /*
+ * The right-most VMA ends below the lower limit. Can only happen
+ * if a binary personality loads the stack below the executable.
+ */
+ if (addr < lower_limit)
+ addr = lower_limit;
+
+ found_addr:
+ if (TASK_SIZE - len < addr)
+ return -ENOMEM;
+
+ /* This "free area" was not really free. Tree corrupted? */
+ VM_BUG_ON(find_vma_intersection(mm, addr, addr+len));
+
+ return addr;
}
#endif

@@ -1550,9 +1608,11 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
const unsigned long len, const unsigned long pgoff,
const unsigned long flags)
{
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;
struct mm_struct *mm = current->mm;
- unsigned long addr = addr0, start_addr;
+ unsigned long addr = addr0;
+ struct rb_node *rb_node = NULL;
+ unsigned long upper_limit = mm->mmap_base;

/* requested length too big for entire address space */
if (len > TASK_SIZE)
@@ -1570,68 +1630,65 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
return addr;
}

- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
+ /* requested length too big; prevent integer underflow below */
+ if (len > upper_limit)
+ return -ENOMEM;

-try_again:
- /* either no address requested or can't fit in requested address hole */
- start_addr = addr = mm->free_area_cache;
+ /*
+ * Does the highest VMA end far enough below the upper limit
+ * of our search space?
+ */
+ if (upper_limit - len > mm->highest_vm_end) {
+ addr = upper_limit - len;
+ goto found_addr;
+ }

- if (addr < len)
- goto fail;
+ /* Find the right-most free area of sufficient size. */
+ for (addr = 0, rb_node = mm->mm_rb.rb_node; rb_node; ) {
+ unsigned long vma_start;
+ bool found_here = false;

- addr -= len;
- do {
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return (mm->free_area_cache = addr);
+ vma = rb_to_vma(rb_node);
+
+ /* Is this gap large enough? Remember it. */
+ vma_start = min(vma->vm_start, upper_limit);
+ if (vma_start > len) {
+ if (!vma->vm_prev ||
+ (vma_start - len >= vma->vm_prev->vm_end)) {
+ addr = vma_start - len;
+ found_here = true;
+ }
+ }

- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
+ /* Go right if it looks promising. */
+ if (node_free_gap(rb_node->rb_right) >= len) {
+ if (upper_limit - len > vma->vm_end) {
+ rb_node = rb_node->rb_right;
+ continue;
+ }
+ }

- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- } while (len < vma->vm_start);
+ if (!found_here && node_free_gap(rb_node->rb_left) >= len) {
+ /* Last known gap is to the right of this subtree. */
+ rb_node = rb_node->rb_left;
+ continue;
+ } else if (!addr) {
+ rb_node = rb_find_prev_uncle(rb_node);
+ continue;
+ }

-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- *
- * Note: this is different with the case of bottomup
- * which does the fully line-search, but we use find_vma
- * here that causes some holes skipped.
- */
- if (start_addr != mm->mmap_base) {
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = 0;
- goto try_again;
+ /* This is the right-most gap. */
+ goto found_addr;
}

- /*
- * A failed mmap() very likely causes application failure,
- * so fall back to the bottom-up function here. This scenario
- * can happen with large stack limits and large mmap()
- * allocations.
- */
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
+ return -ENOMEM;
+
+ found_addr:
+ if (TASK_SIZE - len < addr)
+ return -ENOMEM;
+
+ /* This "free area" was not really free. Tree corrupted? */
+ VM_BUG_ON(find_vma_intersection(mm, addr, addr+len));

return addr;
}
@@ -1845,6 +1902,9 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
vma->vm_end = address;
if (vma->vm_next)
adjust_free_gap(vma->vm_next);
+ else
+ vma->vm_mm->highest_vm_end =
+ vma->vm_end;
perf_event_mmap(vma);
}
}
@@ -2028,6 +2088,13 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
*insertion_point = vma;
if (vma)
vma->vm_prev = prev;
+ else {
+ /* We just unmapped the highest VMA. */
+ if (prev)
+ mm->highest_vm_end = prev->vm_end;
+ else
+ mm->highest_vm_end = 0;
+ }
if (vma)
rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
tail_vma->vm_next = NULL;
--
1.7.7.6

2012-06-21 21:59:34

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 07/11] mm: make cache alignment code generic

Fix the x86-64 cache alignment code to take pgoff into account.
Use the x86 and MIPS cache alignment code as the basis for a generic
cache alignment function.

Teach the generic arch_get_unmapped_area(_topdown) code to call the
cache alignment code.

Make sure that ALIGN_DOWN always aligns down, and ends up at the
right page colour.

The old x86 code will always align the mmap to aliasing boundaries,
even if the program mmaps the file with a non-zero pgoff.

If program A mmaps the file with pgoff 0, and program B mmaps the
file with pgoff 1. The old code would align the mmaps, resulting in
misaligned pages:

A: 0123
B: 123

After this patch, they are aligned so the pages line up:

A: 0123
B: 123

Signed-off-by: Rik van Riel <[email protected]>
---
arch/mips/include/asm/page.h | 2 -
arch/mips/include/asm/pgtable.h | 1 +
arch/x86/include/asm/elf.h | 3 -
arch/x86/include/asm/pgtable_64.h | 1 +
arch/x86/kernel/sys_x86_64.c | 39 +++++++++++-----
arch/x86/vdso/vma.c | 2 +-
include/linux/sched.h | 8 +++-
mm/mmap.c | 91 ++++++++++++++++++++++++++++++++-----
8 files changed, 115 insertions(+), 32 deletions(-)

diff --git a/arch/mips/include/asm/page.h b/arch/mips/include/asm/page.h
index da9bd7d..459cc25 100644
--- a/arch/mips/include/asm/page.h
+++ b/arch/mips/include/asm/page.h
@@ -63,8 +63,6 @@ extern void build_copy_page(void);
extern void clear_page(void * page);
extern void copy_page(void * to, void * from);

-extern unsigned long shm_align_mask;
-
static inline unsigned long pages_do_alias(unsigned long addr1,
unsigned long addr2)
{
diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index b2202a6..f133a4c 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -415,6 +415,7 @@ int phys_mem_access_prot_allowed(struct file *file, unsigned long pfn,
*/
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
+#define HAVE_ARCH_ALIGN_ADDR

/*
* No page table caches to initialise
diff --git a/arch/x86/include/asm/elf.h b/arch/x86/include/asm/elf.h
index 5939f44..dc2d0bf 100644
--- a/arch/x86/include/asm/elf.h
+++ b/arch/x86/include/asm/elf.h
@@ -358,8 +358,6 @@ static inline int mmap_is_ia32(void)
enum align_flags {
ALIGN_VA_32 = BIT(0),
ALIGN_VA_64 = BIT(1),
- ALIGN_VDSO = BIT(2),
- ALIGN_TOPDOWN = BIT(3),
};

struct va_alignment {
@@ -368,5 +366,4 @@ struct va_alignment {
} ____cacheline_aligned;

extern struct va_alignment va_align;
-extern unsigned long align_addr(unsigned long, struct file *, enum align_flags);
#endif /* _ASM_X86_ELF_H */
diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 8af36f6..8408ccd 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -170,6 +170,7 @@ extern void cleanup_highmap(void);
#define HAVE_ARCH_UNMAPPED_AREA
#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
#define HAVE_ARCH_GET_ADDRESS_RANGE
+#define HAVE_ARCH_ALIGN_ADDR

#define pgtable_cache_init() do { } while (0)
#define check_pgt_cache() do { } while (0)
diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index 2595a5e..c059c19 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -25,31 +25,44 @@
* @flags denotes the allocation direction - bottomup or topdown -
* or vDSO; see call sites below.
*/
-unsigned long align_addr(unsigned long addr, struct file *filp,
- enum align_flags flags)
+unsigned long arch_align_addr(unsigned long addr, struct file *filp,
+ unsigned long pgoff, unsigned long flags,
+ enum mmap_allocation_direction direction)
{
- unsigned long tmp_addr;
+ unsigned long tmp_addr = PAGE_ALIGN(addr);

/* handle 32- and 64-bit case with a single conditional */
if (va_align.flags < 0 || !(va_align.flags & (2 - mmap_is_ia32())))
- return addr;
+ return tmp_addr;

- if (!(current->flags & PF_RANDOMIZE))
- return addr;
+ /* Always allow MAP_FIXED. Colouring is a performance thing only. */
+ if (flags & MAP_FIXED)
+ return tmp_addr;

- if (!((flags & ALIGN_VDSO) || filp))
- return addr;
+ if (!(current->flags & PF_RANDOMIZE))
+ return tmp_addr;

- tmp_addr = addr;
+ if (!(filp || direction == ALLOC_VDSO))
+ return tmp_addr;

/*
* We need an address which is <= than the original
* one only when in topdown direction.
*/
- if (!(flags & ALIGN_TOPDOWN))
+ if (direction == ALLOC_UP)
tmp_addr += va_align.mask;

tmp_addr &= ~va_align.mask;
+ tmp_addr += ((pgoff << PAGE_SHIFT) & va_align.mask);
+
+ /*
+ * When aligning down, make sure we did not accidentally go up.
+ * The caller will check for underflow.
+ */
+ if (direction == ALLOC_DOWN && tmp_addr > addr) {
+ tmp_addr -= va_align.mask;
+ tmp_addr &= ~va_align.mask;
+ }

return tmp_addr;
}
@@ -159,7 +172,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,

full_search:

- addr = align_addr(addr, filp, 0);
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);

for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
/* At this point: (!vma || addr < vma->vm_end). */
@@ -186,7 +199,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
mm->cached_hole_size = vma->vm_start - addr;

addr = vma->vm_end;
- addr = align_addr(addr, filp, 0);
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
}
}

@@ -235,7 +248,7 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,

addr -= len;
do {
- addr = align_addr(addr, filp, ALIGN_TOPDOWN);
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_DOWN);

/*
* Lookup failure means no vma is above this address,
diff --git a/arch/x86/vdso/vma.c b/arch/x86/vdso/vma.c
index 00aaf04..83e0355 100644
--- a/arch/x86/vdso/vma.c
+++ b/arch/x86/vdso/vma.c
@@ -141,7 +141,7 @@ static unsigned long vdso_addr(unsigned long start, unsigned len)
* unaligned here as a result of stack start randomization.
*/
addr = PAGE_ALIGN(addr);
- addr = align_addr(addr, NULL, ALIGN_VDSO);
+ addr = arch_align_addr(addr, NULL, 0, 0, ALLOC_VDSO);

return addr;
}
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc76318..18f9326 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -390,12 +390,18 @@ extern int sysctl_max_map_count;
#ifdef CONFIG_MMU
enum mmap_allocation_direction {
ALLOC_UP,
- ALLOC_DOWN
+ ALLOC_DOWN,
+ ALLOC_VDSO,
};
extern void arch_pick_mmap_layout(struct mm_struct *mm);
extern void
arch_get_address_range(unsigned long flags, unsigned long *begin,
unsigned long *end, enum mmap_allocation_direction direction);
+extern unsigned long shm_align_mask;
+extern unsigned long
+arch_align_addr(unsigned long addr, struct file *filp,
+ unsigned long pgoff, unsigned long flags,
+ enum mmap_allocation_direction direction);
extern unsigned long
arch_get_unmapped_area(struct file *, unsigned long, unsigned long,
unsigned long, unsigned long);
diff --git a/mm/mmap.c b/mm/mmap.c
index 2420951..3da19f8 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1482,6 +1482,51 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
return error;
}

+#ifndef HAVE_ARCH_ALIGN_ADDR
+/* Each architecture is responsible for setting this to the required value. */
+unsigned long shm_align_mask = PAGE_SIZE - 1;
+EXPORT_SYMBOL(shm_align_mask);
+
+unsigned long arch_align_addr(unsigned long addr, struct file *filp,
+ unsigned long pgoff, unsigned long flags,
+ enum mmap_allocation_direction direction)
+{
+ unsigned long tmp_addr = PAGE_ALIGN(addr);
+
+ if (shm_align_mask <= PAGE_SIZE)
+ return tmp_addr;
+
+ /* Allow MAP_FIXED without MAP_SHARED at any address. */
+ if ((flags & (MAP_FIXED|MAP_SHARED)) == MAP_FIXED)
+ return tmp_addr;
+
+ /* Enforce page colouring for any file or MAP_SHARED mapping. */
+ if (!(filp || (flags & MAP_SHARED)))
+ return tmp_addr;
+
+ /*
+ * We need an address which is <= than the original
+ * one only when in topdown direction.
+ */
+ if (direction == ALLOC_UP)
+ tmp_addr += shm_align_mask;
+
+ tmp_addr &= ~shm_align_mask;
+ tmp_addr += ((pgoff << PAGE_SHIFT) & shm_align_mask);
+
+ /*
+ * When aligning down, make sure we did not accidentally go up.
+ * The caller will check for underflow.
+ */
+ if (direction == ALLOC_DOWN && tmp_addr > addr) {
+ tmp_addr -= shm_align_mask;
+ tmp_addr &= ~shm_align_mask;
+ }
+
+ return tmp_addr;
+}
+#endif
+
#ifndef HAVE_ARCH_GET_ADDRESS_RANGE
void arch_get_address_range(unsigned long flags, unsigned long *begin,
unsigned long *end, enum mmap_allocation_direction direction)
@@ -1515,18 +1560,22 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma = NULL;
struct rb_node *rb_node;
- unsigned long lower_limit, upper_limit;
+ unsigned long lower_limit, upper_limit, tmp_addr;

arch_get_address_range(flags, &lower_limit, &upper_limit, ALLOC_UP);

if (len > TASK_SIZE)
return -ENOMEM;

- if (flags & MAP_FIXED)
+ if (flags & MAP_FIXED) {
+ tmp_addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
+ if (tmp_addr != PAGE_ALIGN(addr))
+ return -EINVAL;
return addr;
+ }

if (addr) {
- addr = PAGE_ALIGN(addr);
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
vma = find_vma(mm, addr);
if (TASK_SIZE - len >= addr &&
(!vma || addr + len <= vma->vm_start))
@@ -1535,7 +1584,7 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,

/* Find the left-most free area of sufficient size. */
for (addr = 0, rb_node = mm->mm_rb.rb_node; rb_node; ) {
- unsigned long vma_start;
+ unsigned long vma_start, tmp_addr;
bool found_here = false;

vma = rb_to_vma(rb_node);
@@ -1543,13 +1592,17 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
if (vma->vm_start > len) {
if (!vma->vm_prev) {
/* This is the left-most VMA. */
- if (vma->vm_start - len >= lower_limit) {
- addr = lower_limit;
+ tmp_addr = arch_align_addr(lower_limit, filp,
+ pgoff, flags, ALLOC_UP);
+ if (vma->vm_start - len >= tmp_addr) {
+ addr = tmp_addr;
goto found_addr;
}
} else {
/* Is this gap large enough? Remember it. */
vma_start = max(vma->vm_prev->vm_end, lower_limit);
+ vma_start = arch_align_addr(vma_start, filp,
+ pgoff, flags, ALLOC_UP);
if (vma->vm_start - len >= vma_start) {
addr = vma_start;
found_here = true;
@@ -1601,6 +1654,8 @@ arch_get_unmapped_area(struct file *filp, unsigned long addr,
if (addr < lower_limit)
addr = lower_limit;

+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
+
found_addr:
if (TASK_SIZE - len < addr)
return -ENOMEM;
@@ -1643,12 +1698,17 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
if (len > TASK_SIZE)
return -ENOMEM;

- if (flags & MAP_FIXED)
+ if (flags & MAP_FIXED) {
+ unsigned long tmp_addr;
+ tmp_addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_DOWN);
+ if (tmp_addr != PAGE_ALIGN(addr))
+ return -EINVAL;
return addr;
+ }

/* requesting a specific address */
if (addr) {
- addr = PAGE_ALIGN(addr);
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_DOWN);
vma = find_vma(mm, addr);
if (TASK_SIZE - len >= addr &&
(!vma || addr + len <= vma->vm_start))
@@ -1665,7 +1725,9 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
*/
if (upper_limit - len > mm->highest_vm_end) {
addr = upper_limit - len;
- goto found_addr;
+ addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_DOWN);
+ if (addr >= mm->highest_vm_end);
+ goto found_addr;
}

/* Find the right-most free area of sufficient size. */
@@ -1678,9 +1740,14 @@ arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
/* Is this gap large enough? Remember it. */
vma_start = min(vma->vm_start, upper_limit);
if (vma_start > len) {
- if (!vma->vm_prev ||
- (vma_start - len >= vma->vm_prev->vm_end)) {
- addr = vma_start - len;
+ unsigned long tmp_addr = vma_start - len;
+ tmp_addr = arch_align_addr(tmp_addr, filp,
+ pgoff, flags, ALLOC_DOWN);
+ /* No underflow? Does it still fit the hole? */
+ if (tmp_addr && tmp_addr <= vma_start - len &&
+ (!vma->vm_prev ||
+ tmp_addr >= vma->vm_prev->vm_end)) {
+ addr = tmp_addr;
found_here = true;
}
}
--
1.7.7.6

2012-06-21 21:57:55

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 04/11] rbtree: add helpers to find nearest uncle node

It is useful to search an augmented rbtree based on the augmented
data, ie. not using the sort key as the primary search criterium.
However, we may still need to limit our search to a sub-part of the
whole tree, using the sort key as limiters where we can search.

In that case, we may need to stop searching in one part of the tree,
and continue the search at the nearest (great-?)uncle node in a particular
direction.

Add helper functions to find the nearest uncle node.

Signed-off-by: Rik van Riel <[email protected]>
---
include/linux/rbtree.h | 4 ++++
lib/rbtree.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 50 insertions(+), 0 deletions(-)

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 661288d..a74b74b 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -169,6 +169,10 @@ extern struct rb_node *rb_prev(const struct rb_node *);
extern struct rb_node *rb_first(const struct rb_root *);
extern struct rb_node *rb_last(const struct rb_root *);

+/* Find the prev or next uncle of a node in the desired direction. */
+extern struct rb_node *rb_find_prev_uncle(struct rb_node *);
+extern struct rb_node *rb_find_next_uncle(struct rb_node *);
+
/* Fast replacement of a single node without remove/rebalance/add/rebalance */
extern void rb_replace_node(struct rb_node *victim, struct rb_node *new,
struct rb_root *root);
diff --git a/lib/rbtree.c b/lib/rbtree.c
index d417556..08c16d8 100644
--- a/lib/rbtree.c
+++ b/lib/rbtree.c
@@ -437,6 +437,52 @@ struct rb_node *rb_prev(const struct rb_node *node)
}
EXPORT_SYMBOL(rb_prev);

+/*
+ * rb_find_{prev,next}_uncle - Find the nearest "uncle" node in a direction
+ *
+ * An "uncle" node is a sibling node of a parent or grandparent. These
+ * functions walk up the tree to the nearest uncle of this node in the
+ * desired direction.
+ *
+ * G
+ * / \
+ * P U
+ * \
+ * N
+ * This is necessary when searching for something in a bounded subset
+ * of an augmented rbtree, when the primary search criterium is the
+ * augmented data, and not the sort key.
+ */
+struct rb_node *rb_find_prev_uncle(struct rb_node *node)
+{
+ struct rb_node *prev;
+
+ while ((prev = node) && (node = rb_parent(node))) {
+ if (prev == node->rb_left)
+ continue;
+
+ if (node->rb_left)
+ return node->rb_left;
+ }
+
+ return NULL;
+}
+
+struct rb_node *rb_find_next_uncle(struct rb_node *node)
+{
+ struct rb_node *prev;
+
+ while ((prev = node) && (node = rb_parent(node))) {
+ if (prev == node->rb_right)
+ continue;
+
+ if (node->rb_right)
+ return node->rb_right;
+ }
+
+ return NULL;
+}
+
void rb_replace_node(struct rb_node *victim, struct rb_node *new,
struct rb_root *root)
{
--
1.7.7.6

2012-06-21 22:00:32

by Rik van Riel

[permalink] [raw]
Subject: [PATCH -mm v2 08/11] mm: remove x86 arch_get_unmapped_area(_topdown)

The generic arch_get_unmapped_area(_topdown) should now be able
to do everything x86 needs. Remove the x86 specific functions.

TODO: make the hugetlbfs arch_get_unmapped_area call the generic
code with proper alignment info.

Cc: Andi Kleen <[email protected]>
Signed-off-by: Rik van Riel <[email protected]>
---
arch/x86/include/asm/pgtable_64.h | 2 -
arch/x86/kernel/sys_x86_64.c | 162 -------------------------------------
2 files changed, 0 insertions(+), 164 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_64.h b/arch/x86/include/asm/pgtable_64.h
index 8408ccd..0ff6500 100644
--- a/arch/x86/include/asm/pgtable_64.h
+++ b/arch/x86/include/asm/pgtable_64.h
@@ -167,8 +167,6 @@ static inline int pgd_large(pgd_t pgd) { return 0; }
extern int kern_addr_valid(unsigned long addr);
extern void cleanup_highmap(void);

-#define HAVE_ARCH_UNMAPPED_AREA
-#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
#define HAVE_ARCH_GET_ADDRESS_RANGE
#define HAVE_ARCH_ALIGN_ADDR

diff --git a/arch/x86/kernel/sys_x86_64.c b/arch/x86/kernel/sys_x86_64.c
index c059c19..ee03b18 100644
--- a/arch/x86/kernel/sys_x86_64.c
+++ b/arch/x86/kernel/sys_x86_64.c
@@ -135,165 +135,3 @@ void arch_get_address_range(unsigned long flags, unsigned long *begin,
*end = current->mm->mmap_base;
}
}
-
-unsigned long
-arch_get_unmapped_area(struct file *filp, unsigned long addr,
- unsigned long len, unsigned long pgoff, unsigned long flags)
-{
- struct mm_struct *mm = current->mm;
- struct vm_area_struct *vma;
- unsigned long start_addr;
- unsigned long begin, end;
-
- if (flags & MAP_FIXED)
- return addr;
-
- arch_get_address_range(flags, &begin, &end, ALLOC_UP);
-
- if (len > end)
- return -ENOMEM;
-
- if (addr) {
- addr = PAGE_ALIGN(addr);
- vma = find_vma(mm, addr);
- if (end - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
- if (((flags & MAP_32BIT) || test_thread_flag(TIF_ADDR32))
- && len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = begin;
- }
- addr = mm->free_area_cache;
- if (addr < begin)
- addr = begin;
- start_addr = addr;
-
-full_search:
-
- addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
-
- for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
- /* At this point: (!vma || addr < vma->vm_end). */
- if (end - len < addr) {
- /*
- * Start a new search - just in case we missed
- * some holes.
- */
- if (start_addr != begin) {
- start_addr = addr = begin;
- mm->cached_hole_size = 0;
- goto full_search;
- }
- return -ENOMEM;
- }
- if (!vma || addr + len <= vma->vm_start) {
- /*
- * Remember the place where we stopped the search:
- */
- mm->free_area_cache = addr + len;
- return addr;
- }
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- addr = vma->vm_end;
- addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_UP);
- }
-}
-
-
-unsigned long
-arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
- const unsigned long len, const unsigned long pgoff,
- const unsigned long flags)
-{
- struct vm_area_struct *vma;
- struct mm_struct *mm = current->mm;
- unsigned long addr = addr0, start_addr;
-
- /* requested length too big for entire address space */
- if (len > TASK_SIZE)
- return -ENOMEM;
-
- if (flags & MAP_FIXED)
- return addr;
-
- /* for MAP_32BIT mappings we force the legact mmap base */
- if (!test_thread_flag(TIF_ADDR32) && (flags & MAP_32BIT))
- goto bottomup;
-
- /* requesting a specific address */
- if (addr) {
- addr = PAGE_ALIGN(addr);
- vma = find_vma(mm, addr);
- if (TASK_SIZE - len >= addr &&
- (!vma || addr + len <= vma->vm_start))
- return addr;
- }
-
- /* check if free_area_cache is useful for us */
- if (len <= mm->cached_hole_size) {
- mm->cached_hole_size = 0;
- mm->free_area_cache = mm->mmap_base;
- }
-
-try_again:
- /* either no address requested or can't fit in requested address hole */
- start_addr = addr = mm->free_area_cache;
-
- if (addr < len)
- goto fail;
-
- addr -= len;
- do {
- addr = arch_align_addr(addr, filp, pgoff, flags, ALLOC_DOWN);
-
- /*
- * Lookup failure means no vma is above this address,
- * else if new region fits below vma->vm_start,
- * return with success:
- */
- vma = find_vma(mm, addr);
- if (!vma || addr+len <= vma->vm_start)
- /* remember the address as a hint for next time */
- return mm->free_area_cache = addr;
-
- /* remember the largest hole we saw so far */
- if (addr + mm->cached_hole_size < vma->vm_start)
- mm->cached_hole_size = vma->vm_start - addr;
-
- /* try just below the current vma->vm_start */
- addr = vma->vm_start-len;
- } while (len < vma->vm_start);
-
-fail:
- /*
- * if hint left us with no space for the requested
- * mapping then try again:
- */
- if (start_addr != mm->mmap_base) {
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = 0;
- goto try_again;
- }
-
-bottomup:
- /*
- * A failed mmap() very likely causes application failure,
- * so fall back to the bottom-up function here. This scenario
- * can happen with large stack limits and large mmap()
- * allocations.
- */
- mm->cached_hole_size = ~0UL;
- mm->free_area_cache = TASK_UNMAPPED_BASE;
- addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
- /*
- * Restore the topdown base:
- */
- mm->free_area_cache = mm->mmap_base;
- mm->cached_hole_size = ~0UL;
-
- return addr;
-}
--
1.7.7.6

2012-06-22 09:50:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 04/11] rbtree: add helpers to find nearest uncle node

On Thu, 2012-06-21 at 17:57 -0400, Rik van Riel wrote:
> It is useful to search an augmented rbtree based on the augmented
> data, ie. not using the sort key as the primary search criterium.
> However, we may still need to limit our search to a sub-part of the
> whole tree, using the sort key as limiters where we can search.
>
> In that case, we may need to stop searching in one part of the tree,
> and continue the search at the nearest (great-?)uncle node in a particular
> direction.
>
> Add helper functions to find the nearest uncle node.

I don't think we need these at all, in fact, I cannot prove your lookup
function is O(log n) at all, since the uncle might not have a suitable
max gap size, so you might need to find yet another uncle etc.

2012-06-22 09:58:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Thu, 2012-06-21 at 17:57 -0400, Rik van Riel wrote:
> +/*
> + * Use the augmented rbtree code to propagate info on the largest
> + * free gap between VMAs up the VMA rbtree.
> + */
> +static void adjust_free_gap(struct vm_area_struct *vma)
> +{
> + rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
> +}

I was more thinking along the lines of:

/*
* Abuse rb_augment_erase_end() to propagate a modification up
* the tree by pretending the modified node is the deepest node
* still in the tree.
*/


Alternatively, we could add rb_augment_mod() or somesuch.

2012-06-22 09:59:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Thu, 2012-06-21 at 17:57 -0400, Rik van Riel wrote:
> @@ -1941,6 +2017,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
> *insertion_point = vma;
> if (vma)
> vma->vm_prev = prev;
> + if (vma)
> + rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);

Shouldn't that be adjust_free_gap()? There is after all no actual erase
happening.

2012-06-22 10:03:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Thu, 2012-06-21 at 17:57 -0400, Rik van Riel wrote:
> +static unsigned long largest_free_gap(struct rb_node *node)
> +{
> + struct vm_area_struct *vma, *prev, *left = NULL, *right = NULL;
> + unsigned long largest = 0;
> +
> + if (node->rb_left)
> + left = rb_to_vma(node->rb_left);
> + if (node->rb_right)
> + right = rb_to_vma(node->rb_right);
> +
> + /* Calculate the free gap size between us and the VMA to our left. */
> + vma = rb_to_vma(node);
> + prev = vma->vm_prev;
> +
> + if (prev)
> + largest = vma->vm_start - prev->vm_end;
> + else
> + largest = vma->vm_start;
> +
> + /* We propagate the largest of our own, or our children's free gaps. */
> + if (left)
> + largest = max(largest, left->free_gap);
> + if (right)
> + largest = max(largest, right->free_gap);
> +
> + return largest;
> +}

If you introduce helpers like:

static inline struct vm_area_struct *vma_of(struct rb_node *node)
{
return container_of(node, struct vm_area_struct, vm_rb);
}

static inline unsigned long max_gap_of(struct rb_node *node)
{
return vma_of(node)->free_gap;
}

static unsigned long gap_of(struct rb_node *node)
{
struct vm_area_struct *vma = vma_of(node);

if (!vma->vm_prev)
return vma->vm_start;

return vma->vm_start - vma->vm_prev->vm_end;
}

You can write your largest free gap as:

unsigned long largest_gap(struct rb_node *node)
{
unsigned long gap = gap_of(node);

if (node->rb_left)
gap = max(gap, max_gap_of(node->rb_left));
if (node->rb_right)
gap = max(gap, max_gap_of(node->rb_right));

return gap;
}

And as shown, you can re-used those {max_,}gap_of() function in the
lookup function in the next patch.

2012-06-22 14:12:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/22/2012 05:58 AM, Peter Zijlstra wrote:
> On Thu, 2012-06-21 at 17:57 -0400, Rik van Riel wrote:
>> @@ -1941,6 +2017,8 @@ detach_vmas_to_be_unmapped(struct mm_struct *mm, struct vm_area_struct *vma,
>> *insertion_point = vma;
>> if (vma)
>> vma->vm_prev = prev;
>> + if (vma)
>> + rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
>
> Shouldn't that be adjust_free_gap()? There is after all no actual erase
> happening.

You are right. I will fix this and also adjust the comment
above adjust_free_gap().

I am still trying to wrap my brain around your alternative
search algorithm, not sure if/how it can be combined with
arbitrary address limits and alignment...

--
All rights reversed

2012-06-22 14:14:55

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Fri, 2012-06-22 at 10:11 -0400, Rik van Riel wrote:
>
> I am still trying to wrap my brain around your alternative
> search algorithm, not sure if/how it can be combined with
> arbitrary address limits and alignment...

for alignment we can do: len += align - 1;

Will indeed need to ponder the range thing ;-)

2012-06-22 14:26:36

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/22/2012 10:13 AM, Peter Zijlstra wrote:
> On Fri, 2012-06-22 at 10:11 -0400, Rik van Riel wrote:
>>
>> I am still trying to wrap my brain around your alternative
>> search algorithm, not sure if/how it can be combined with
>> arbitrary address limits and alignment...
>
> for alignment we can do: len += align - 1;

We could, but that might lead us to returning -ENOMEM
when we actually have memory available.

When you consider architectures like HPPA, which use
a pretty large alignment, but align everything the same,
chances are pretty much every freed hole will have the
right alignment...

> Will indeed need to ponder the range thing ;-)


--
All rights reversed

2012-06-22 14:29:44

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 00/11] mm: scalable and unified arch_get_unmapped_area

>>>>> "Rik" == Rik van Riel <[email protected]> writes:

Rik> A long time ago, we decided to limit the number of VMAs per
Rik> process to 64k. As it turns out, there actually are programs
Rik> using tens of thousands of VMAs.


Rik> Performance

Rik> Testing performance with a benchmark that allocates tens
Rik> of thousands of VMAs, unmaps them and mmaps them some more
Rik> in a loop, shows promising results.

How are the numbers for applications which only map a few VMAs? Is
there any impact there?

John

2012-06-22 14:38:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Fri, 2012-06-22 at 10:25 -0400, Rik van Riel wrote:
> On 06/22/2012 10:13 AM, Peter Zijlstra wrote:
> > On Fri, 2012-06-22 at 10:11 -0400, Rik van Riel wrote:
> >>
> >> I am still trying to wrap my brain around your alternative
> >> search algorithm, not sure if/how it can be combined with
> >> arbitrary address limits and alignment...
> >
> > for alignment we can do: len += align - 1;
>
> We could, but that might lead us to returning -ENOMEM
> when we actually have memory available.
>
> When you consider architectures like HPPA, which use
> a pretty large alignment, but align everything the same,
> chances are pretty much every freed hole will have the
> right alignment...

Well, if you don't your gap heap is next to useless and you'll revert to
simply walking all gaps until you find a suitable one.

I really worry about this search function of yours, its complexity is
very non obvious.

2012-06-22 15:02:33

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm v2 00/11] mm: scalable and unified arch_get_unmapped_area

On Thu, Jun 21, 2012 at 05:57:04PM -0400, Rik van Riel wrote:
> A long time ago, we decided to limit the number of VMAs per
> process to 64k. As it turns out, there actually are programs
> using tens of thousands of VMAs.
>
> The linear search in arch_get_unmapped_area and
> arch_get_unmapped_area_topdown can be a real issue for
> those programs.
>
> This patch series aims to fix the scalability issue by
> tracking the size of each free hole in the VMA rbtree,
> propagating the free hole info up the tree.
>
> Another major goal is to put the bulk of the necessary
> arch_get_unmapped_area(_topdown) functionality into one
> set of functions, so we can eliminate the custom large
> functions per architecture, sticking to a few much smaller
> architecture specific functions instead.
>
> In this version I have only gotten rid of the x86, ARM, SH
> and MIPS arch-specific code, and am already showing a
> fairly promising diffstat:
>
> arch/arm/include/asm/pgtable.h | 6
> arch/arm/mm/init.c | 4
> arch/arm/mm/mmap.c | 217 ------------------
> arch/mips/include/asm/page.h | 2
> arch/mips/include/asm/pgtable.h | 7
> arch/mips/mm/mmap.c | 177 --------------
> arch/sh/include/asm/pgtable.h | 4
> arch/sh/mm/mmap.c | 219 ------------------
> arch/x86/include/asm/elf.h | 3
> arch/x86/include/asm/pgtable_64.h | 4
> arch/x86/kernel/sys_x86_64.c | 200 ++--------------
> arch/x86/vdso/vma.c | 2
> include/linux/mm_types.h | 19 +
> include/linux/rbtree.h | 12 +
> include/linux/sched.h | 13 +
> lib/rbtree.c | 46 +++
> mm/internal.h | 5
> mm/mmap.c | 449 +++++++++++++++++++++++++++++---------
> 18 files changed, 478 insertions(+), 911 deletions(-)
>
> v2: address reviewers' comments
> optimize propagating info up the VMA tree (30% faster at frag test)

Here is a comparison of running the anti-bench on all three kernels
(an updated version of the test, I botched the initial one. But it
still yielded useful results, albeit from testing another aspect).

First, repeated unmaps and remaps of one VMA in the midst of a few
thousand other VMAs. v2 did not improve over v1, unfortunately:

innerremap-next innerremap-agua-v1 innerremap-agua-v2
Elapsed time 4.99 ( +0.00%) 12.66 (+128.05%) 12.55 (+126.21%)
Elapsed time (stddev) 0.06 ( +0.00%) 0.73 ( +63.43%) 0.54 ( +45.51%)
User time 0.41 ( +0.00%) 0.57 ( +10.66%) 0.47 ( +3.63%)
User time (stddev) 0.02 ( +0.00%) 0.06 ( +3.34%) 0.07 ( +4.26%)
System time 4.57 ( +0.00%) 12.09 (+134.86%) 12.08 (+134.68%)
System time (stddev) 0.06 ( +0.00%) 0.69 ( +59.38%) 0.50 ( +41.05%)

The vma_adjust() optimizations for the case where vmas were split or
merged without changing adjacent holes seemed to improve for repeated
mprotect-splitting instead of remapping of the VMA in v2:

innermprot-next innermprot-agua-v1 innermprot-agua-v2
Elapsed time 8.02 ( +0.00%) 18.84 (+119.93%) 13.10 ( +56.32%)
Elapsed time (stddev) 0.77 ( +0.00%) 1.15 ( +21.25%) 0.79 ( +0.62%)
User time 3.92 ( +0.00%) 3.95 ( +0.59%) 4.09 ( +3.50%)
User time (stddev) 0.80 ( +0.00%) 0.69 ( -6.14%) 0.81 ( +0.34%)
System time 4.10 ( +0.00%) 14.89 (+211.44%) 9.01 ( +96.18%)
System time (stddev) 0.11 ( +0.00%) 0.90 ( +71.61%) 0.32 ( +19.00%)

The kernbench result did not measurably change from v1 to v2:

1x4kernbench-3.5.0-rc3-next-20120619 1x4kernbench-3.5.0-rc3-next-20120619-00007-g594e750 1x4kernbench-3.5.0-rc3-next-20120619-00011-g09982c8
Elapsed time 273.95 ( +0.00%) 274.11 ( +0.06%) 274.69 ( +0.27%)
Elapsed time (stddev) 0.23 ( +0.00%) 0.20 ( -2.81%) 0.30 ( +5.87%)
User time 463.38 ( +0.00%) 463.13 ( -0.05%) 463.78 ( +0.09%)
User time (stddev) 0.16 ( +0.00%) 0.23 ( +6.66%) 0.32 ( +14.15%)
System time 49.36 ( +0.00%) 50.16 ( +1.60%) 50.07 ( +1.42%)
System time (stddev) 0.24 ( +0.00%) 0.26 ( +1.43%) 0.27 ( +2.89%)

---

#include <sys/mman.h>

int main(int ac, char **av)
{
int orig_write;
unsigned int i;
int write = 0;
char *map;

for (i = 0; i < 4096; i++)
mmap(NULL, 2 << 12, (write ^= 1) ? PROT_WRITE : PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
map = mmap(NULL, 2 << 12, (orig_write = write ^= 1) ? PROT_WRITE : PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
sbrk(0);
for (i = 0; i < 8192; i++)
mmap(NULL, 2 << 12, (write ^= 1) ? PROT_WRITE : PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
sbrk(0);
for (i = 0; i < (1UL << 23); i++) {
#if 1
mprotect(map, 1 << 12, PROT_NONE);
mprotect(map, 1 << 12, orig_write ? PROT_WRITE : PROT_READ);
#else
munmap(map, 2 << 12);
map = mmap(NULL, 2 << 12, orig_write ? PROT_WRITE : PROT_READ, MAP_PRIVATE | MAP_ANON, -1, 0);
#endif
}
return 0;
}

2012-06-22 15:41:46

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/22/2012 10:37 AM, Peter Zijlstra wrote:
> On Fri, 2012-06-22 at 10:25 -0400, Rik van Riel wrote:
>> On 06/22/2012 10:13 AM, Peter Zijlstra wrote:
>>> On Fri, 2012-06-22 at 10:11 -0400, Rik van Riel wrote:
>>>>
>>>> I am still trying to wrap my brain around your alternative
>>>> search algorithm, not sure if/how it can be combined with
>>>> arbitrary address limits and alignment...
>>>
>>> for alignment we can do: len += align - 1;
>>
>> We could, but that might lead us to returning -ENOMEM
>> when we actually have memory available.
>>
>> When you consider architectures like HPPA, which use
>> a pretty large alignment, but align everything the same,
>> chances are pretty much every freed hole will have the
>> right alignment...
>
> Well, if you don't your gap heap is next to useless and you'll revert to
> simply walking all gaps until you find a suitable one.

I could see how that might potentially be a problem,
especially when we have a small allocation with large
alignment constraints, eg. HPPA cache alignment.

> I really worry about this search function of yours, its complexity is
> very non obvious.

Let me try implementing your algorithm with arbitrary
address constraints and alignment/colouring.

Basically, we need to remember if the allocation failed
due to bad alignment. If it did, we add shm_align_mask
to the allocation length, and try a second search.

This should result in at worst two whole tree traversals
and one partial traversal. Less on sane architectures,
or for non-MAP_SHARED allocations.

--
All rights reversed.

2012-06-22 21:47:17

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH -mm v2 00/11] mm: scalable and unified arch_get_unmapped_area

On Fri, 22 Jun 2012 10:24:58 -0400
"John Stoffel" <[email protected]> wrote:

> >>>>> "Rik" == Rik van Riel <[email protected]> writes:
>
> Rik> A long time ago, we decided to limit the number of VMAs per
> Rik> process to 64k. As it turns out, there actually are programs
> Rik> using tens of thousands of VMAs.
>
>
> Rik> Performance
>
> Rik> Testing performance with a benchmark that allocates tens
> Rik> of thousands of VMAs, unmaps them and mmaps them some more
> Rik> in a loop, shows promising results.
>
> How are the numbers for applications which only map a few VMAs? Is
> there any impact there?
>

Johannes did a test for that: https://lkml.org/lkml/2012/6/22/219

Some regression with such a workload is unavoidable, I expect. We have
to work out whether the pros outweigh the cons. This involves handwaving.

2012-06-22 22:32:11

by Russell King - ARM Linux

[permalink] [raw]
Subject: Re: [PATCH -mm v2 10/11] mm: remove ARM arch_get_unmapped_area functions

On Thu, Jun 21, 2012 at 05:57:14PM -0400, Rik van Riel wrote:
> Remove the ARM special variants of arch_get_unmapped_area since the
> generic functions should now be able to handle everything.
>
> ARM only needs page colouring if cache_is_vipt_aliasing; leave
> shm_align_mask at PAGE_SIZE-1 unless we need colouring.
>
> Untested because I have no ARM hardware.

And I'll need other bits of the patch set to be able to test this for you...

> Cc: Russell King <[email protected]>
> Signed-off-by: Rik van Riel <[email protected]>
> ---
> arch/arm/include/asm/pgtable.h | 6 -
> arch/arm/mm/init.c | 4 +
> arch/arm/mm/mmap.c | 217 ----------------------------------------
> 3 files changed, 4 insertions(+), 223 deletions(-)
>
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index f66626d..6754183 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -296,12 +296,6 @@ static inline pte_t pte_modify(pte_t pte, pgprot_t newprot)
> #include <asm-generic/pgtable.h>
>
> /*
> - * We provide our own arch_get_unmapped_area to cope with VIPT caches.
> - */
> -#define HAVE_ARCH_UNMAPPED_AREA
> -#define HAVE_ARCH_UNMAPPED_AREA_TOPDOWN
> -
> -/*
> * remap a physical page `pfn' of size `size' with page protection `prot'
> * into virtual address `from'
> */
> diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c
> index f54d592..abaf862 100644
> --- a/arch/arm/mm/init.c
> +++ b/arch/arm/mm/init.c
> @@ -600,6 +600,10 @@ void __init mem_init(void)
> extern u32 itcm_end;
> #endif
>
> + /* Tell the page colouring code what we need. */
> + if (cache_is_vipt_aliasing())
> + shm_align_mask = SHMLBA - 1;
> +
> max_mapnr = pfn_to_page(max_pfn + PHYS_PFN_OFFSET) - mem_map;
>
> /* this will put all unused low memory onto the freelists */
> diff --git a/arch/arm/mm/mmap.c b/arch/arm/mm/mmap.c
> index ce8cb19..d90969b 100644
> --- a/arch/arm/mm/mmap.c
> +++ b/arch/arm/mm/mmap.c
> @@ -11,22 +11,6 @@
> #include <linux/random.h>
> #include <asm/cachetype.h>
>
> -static inline unsigned long COLOUR_ALIGN_DOWN(unsigned long addr,
> - unsigned long pgoff)
> -{
> - unsigned long base = addr & ~(SHMLBA-1);
> - unsigned long off = (pgoff << PAGE_SHIFT) & (SHMLBA-1);
> -
> - if (base + off <= addr)
> - return base + off;
> -
> - return base - off;
> -}
> -
> -#define COLOUR_ALIGN(addr,pgoff) \
> - ((((addr)+SHMLBA-1)&~(SHMLBA-1)) + \
> - (((pgoff)<<PAGE_SHIFT) & (SHMLBA-1)))
> -
> /* gap between mmap and stack */
> #define MIN_GAP (128*1024*1024UL)
> #define MAX_GAP ((TASK_SIZE)/6*5)
> @@ -54,207 +38,6 @@ static unsigned long mmap_base(unsigned long rnd)
> return PAGE_ALIGN(TASK_SIZE - gap - rnd);
> }
>
> -/*
> - * We need to ensure that shared mappings are correctly aligned to
> - * avoid aliasing issues with VIPT caches. We need to ensure that
> - * a specific page of an object is always mapped at a multiple of
> - * SHMLBA bytes.
> - *
> - * We unconditionally provide this function for all cases, however
> - * in the VIVT case, we optimise out the alignment rules.
> - */
> -unsigned long
> -arch_get_unmapped_area(struct file *filp, unsigned long addr,
> - unsigned long len, unsigned long pgoff, unsigned long flags)
> -{
> - struct mm_struct *mm = current->mm;
> - struct vm_area_struct *vma;
> - unsigned long start_addr;
> - int do_align = 0;
> - int aliasing = cache_is_vipt_aliasing();
> -
> - /*
> - * We only need to do colour alignment if either the I or D
> - * caches alias.
> - */
> - if (aliasing)
> - do_align = filp || (flags & MAP_SHARED);
> -
> - /*
> - * We enforce the MAP_FIXED case.
> - */
> - if (flags & MAP_FIXED) {
> - if (aliasing && flags & MAP_SHARED &&
> - (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1))
> - return -EINVAL;
> - return addr;
> - }
> -
> - if (len > TASK_SIZE)
> - return -ENOMEM;
> -
> - if (addr) {
> - if (do_align)
> - addr = COLOUR_ALIGN(addr, pgoff);
> - else
> - addr = PAGE_ALIGN(addr);
> -
> - vma = find_vma(mm, addr);
> - if (TASK_SIZE - len >= addr &&
> - (!vma || addr + len <= vma->vm_start))
> - return addr;
> - }
> - if (len > mm->cached_hole_size) {
> - start_addr = addr = mm->free_area_cache;
> - } else {
> - start_addr = addr = mm->mmap_base;
> - mm->cached_hole_size = 0;
> - }
> -
> -full_search:
> - if (do_align)
> - addr = COLOUR_ALIGN(addr, pgoff);
> - else
> - addr = PAGE_ALIGN(addr);
> -
> - for (vma = find_vma(mm, addr); ; vma = vma->vm_next) {
> - /* At this point: (!vma || addr < vma->vm_end). */
> - if (TASK_SIZE - len < addr) {
> - /*
> - * Start a new search - just in case we missed
> - * some holes.
> - */
> - if (start_addr != TASK_UNMAPPED_BASE) {
> - start_addr = addr = TASK_UNMAPPED_BASE;
> - mm->cached_hole_size = 0;
> - goto full_search;
> - }
> - return -ENOMEM;
> - }
> - if (!vma || addr + len <= vma->vm_start) {
> - /*
> - * Remember the place where we stopped the search:
> - */
> - mm->free_area_cache = addr + len;
> - return addr;
> - }
> - if (addr + mm->cached_hole_size < vma->vm_start)
> - mm->cached_hole_size = vma->vm_start - addr;
> - addr = vma->vm_end;
> - if (do_align)
> - addr = COLOUR_ALIGN(addr, pgoff);
> - }
> -}
> -
> -unsigned long
> -arch_get_unmapped_area_topdown(struct file *filp, const unsigned long addr0,
> - const unsigned long len, const unsigned long pgoff,
> - const unsigned long flags)
> -{
> - struct vm_area_struct *vma;
> - struct mm_struct *mm = current->mm;
> - unsigned long addr = addr0;
> - int do_align = 0;
> - int aliasing = cache_is_vipt_aliasing();
> -
> - /*
> - * We only need to do colour alignment if either the I or D
> - * caches alias.
> - */
> - if (aliasing)
> - do_align = filp || (flags & MAP_SHARED);
> -
> - /* requested length too big for entire address space */
> - if (len > TASK_SIZE)
> - return -ENOMEM;
> -
> - if (flags & MAP_FIXED) {
> - if (aliasing && flags & MAP_SHARED &&
> - (addr - (pgoff << PAGE_SHIFT)) & (SHMLBA - 1))
> - return -EINVAL;
> - return addr;
> - }
> -
> - /* requesting a specific address */
> - if (addr) {
> - if (do_align)
> - addr = COLOUR_ALIGN(addr, pgoff);
> - else
> - addr = PAGE_ALIGN(addr);
> - vma = find_vma(mm, addr);
> - if (TASK_SIZE - len >= addr &&
> - (!vma || addr + len <= vma->vm_start))
> - return addr;
> - }
> -
> - /* check if free_area_cache is useful for us */
> - if (len <= mm->cached_hole_size) {
> - mm->cached_hole_size = 0;
> - mm->free_area_cache = mm->mmap_base;
> - }
> -
> - /* either no address requested or can't fit in requested address hole */
> - addr = mm->free_area_cache;
> - if (do_align) {
> - unsigned long base = COLOUR_ALIGN_DOWN(addr - len, pgoff);
> - addr = base + len;
> - }
> -
> - /* make sure it can fit in the remaining address space */
> - if (addr > len) {
> - vma = find_vma(mm, addr-len);
> - if (!vma || addr <= vma->vm_start)
> - /* remember the address as a hint for next time */
> - return (mm->free_area_cache = addr-len);
> - }
> -
> - if (mm->mmap_base < len)
> - goto bottomup;
> -
> - addr = mm->mmap_base - len;
> - if (do_align)
> - addr = COLOUR_ALIGN_DOWN(addr, pgoff);
> -
> - do {
> - /*
> - * Lookup failure means no vma is above this address,
> - * else if new region fits below vma->vm_start,
> - * return with success:
> - */
> - vma = find_vma(mm, addr);
> - if (!vma || addr+len <= vma->vm_start)
> - /* remember the address as a hint for next time */
> - return (mm->free_area_cache = addr);
> -
> - /* remember the largest hole we saw so far */
> - if (addr + mm->cached_hole_size < vma->vm_start)
> - mm->cached_hole_size = vma->vm_start - addr;
> -
> - /* try just below the current vma->vm_start */
> - addr = vma->vm_start - len;
> - if (do_align)
> - addr = COLOUR_ALIGN_DOWN(addr, pgoff);
> - } while (len < vma->vm_start);
> -
> -bottomup:
> - /*
> - * A failed mmap() very likely causes application failure,
> - * so fall back to the bottom-up function here. This scenario
> - * can happen with large stack limits and large mmap()
> - * allocations.
> - */
> - mm->cached_hole_size = ~0UL;
> - mm->free_area_cache = TASK_UNMAPPED_BASE;
> - addr = arch_get_unmapped_area(filp, addr0, len, pgoff, flags);
> - /*
> - * Restore the topdown base:
> - */
> - mm->free_area_cache = mm->mmap_base;
> - mm->cached_hole_size = ~0UL;
> -
> - return addr;
> -}
> -
> void arch_pick_mmap_layout(struct mm_struct *mm)
> {
> unsigned long random_factor = 0UL;
> --
> 1.7.7.6
>

2012-06-23 16:08:23

by John Stoffel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 00/11] mm: scalable and unified arch_get_unmapped_area

>>>>> "Andrew" == Andrew Morton <[email protected]> writes:

Andrew> On Fri, 22 Jun 2012 10:24:58 -0400
Andrew> "John Stoffel" <[email protected]> wrote:

>> >>>>> "Rik" == Rik van Riel <[email protected]> writes:
>>
Rik> A long time ago, we decided to limit the number of VMAs per
Rik> process to 64k. As it turns out, there actually are programs
Rik> using tens of thousands of VMAs.
>>
>>
Rik> Performance
>>
Rik> Testing performance with a benchmark that allocates tens
Rik> of thousands of VMAs, unmaps them and mmaps them some more
Rik> in a loop, shows promising results.
>>
>> How are the numbers for applications which only map a few VMAs? Is
>> there any impact there?
>>

Andrew> Johannes did a test for that: https://lkml.org/lkml/2012/6/22/219

I don't see that in his results. But maybe (probably) I don't
understand what types of applications this change is supposed to
help. I guess I worry that this will just keep slowing down other
apps.

His tests seemed to be for just one VMA remapped with thousands in
use. Or am I missing the fact that all VMAs are in the same pool?

Andrew> Some regression with such a workload is unavoidable, I expect.
Andrew> We have to work out whether the pros outweigh the cons. This
Andrew> involves handwaving.

Yup, it does. Proof by vigorous handwaving is a time honored
tradition.

And I do see that the numbers aren't that much poorer, I just keep
thinking that if we can speed up the corner case, can't we also speed
up the normal case with a better algorithm or data structure?

John

2012-06-23 17:53:51

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH -mm v2 10/11] mm: remove ARM arch_get_unmapped_area functions

On Fri, Jun 22, 2012 at 11:27:56PM +0100, Russell King - ARM Linux wrote:
> On Thu, Jun 21, 2012 at 05:57:14PM -0400, Rik van Riel wrote:
> > Remove the ARM special variants of arch_get_unmapped_area since the
> > generic functions should now be able to handle everything.
> >
> > ARM only needs page colouring if cache_is_vipt_aliasing; leave
> > shm_align_mask at PAGE_SIZE-1 unless we need colouring.
> >
> > Untested because I have no ARM hardware.
>
> And I'll need other bits of the patch set to be able to test this for you...

I also prefer getting 10 mails to make sense of that one change in a
series over getting that one without context. Subjects are usually
obvious enough to quickly find out which stuff was meant for you.

2012-06-25 02:12:51

by Paul Mundt

[permalink] [raw]
Subject: Re: [PATCH -mm v2 11/11] mm: remove SH arch_get_unmapped_area functions

On Thu, Jun 21, 2012 at 05:57:15PM -0400, Rik van Riel wrote:
> Remove the SH special variants of arch_get_unmapped_area since
> the generic functions should now be able to handle everything.
>
> Paul, does anything in NOMMU SH need shm_align_mask?
>
> Untested because I have no SH hardware.
>
> Signed-off-by: Rik van Riel <[email protected]>
> Cc: Paul Mundt <[email protected]>
> Cc: Magnus Damm <[email protected]>
> Cc: [email protected]

We don't particularly need it for the nommu case, it's just using the
default PAGE_SIZE case there. The primary reason for having it defined
is so we can use the same cache alias checking and d-cache purging code
on parts that can operate with or without the MMU enabled.

So it would be nice to have the variable generally accessible regardless
of CONFIG_MMU setting, rather than having to have a private definition
for the nommu case.

2012-06-25 19:30:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Fri, 2012-06-22 at 11:41 -0400, Rik van Riel wrote:
> Let me try implementing your algorithm with arbitrary
> address constraints and alignment/colouring.

Right, so the best I could come up with for a range search is
O((log n)^2), it does need another pointer in the vma though :/

Instead of storing the single right sub-tree pointer, you store a single
linked list of right sub-tree pointers using that extra vma member.

Then when no gap was found on the left downward path, try each
successive right sub-tree (bottom-up per the LIFO single linked list and
top-down) and do a right-path and left subtree search for those.

So you get a log n walk, with a log n walk for each right sub-tree,
giving a 1/2 * log n * log n aka O((log n)^2).

If you do the search without right limit, the first right subtree search
is sufficient and you'll revert back to O(log n).



I've also thought about the update cost and I think I can make the
vma_adjust case cheaper if you keep the max(vm_end) as second
augmentation, this does add another word to the vma though.

Using this max(vm_end) of the subtree you can do a rb_augment_path()
variant over a specified range (the range that gets modified by
vma_adjust etc..) in O(m log n) worst time, but much better on average.

You still do the m iteration on the range, but you stop the path upwards
whenever the subtree max(vm_end) covers the given range end. Except for
the very last of m, at which point you'll go all the way up.

This should avoid many of the duplicate path traversals the naive
implementation does.


Sadly we've just added two words to the vma and are 2 words over the
cacheline size for the fields used in the update.

This too could be fixed by removing vm_{prev,next} and making
rb_{prev,next} O(1) instead of O(log n) as they are now. Like:
http://en.wikipedia.org/wiki/Threaded_binary_tree


I haven't had any good ideas on the alignment thing though, I keep
getting back to O(n) or worse if you want a guarantee you find a hole if
you have it.

The thing you propose, the double search, once for len, and once for len
+align-1 doesn't guarantee you'll find a hole. All holes of len might be
mis-aligned but the len+align-1 search might overlook a hole of suitable
size and alignment, you'd have to search the entire range: [len, len
+align-1], and that's somewhat silly.

2012-06-25 21:54:05

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/25/2012 03:29 PM, Peter Zijlstra wrote:
> On Fri, 2012-06-22 at 11:41 -0400, Rik van Riel wrote:
>> Let me try implementing your algorithm with arbitrary
>> address constraints and alignment/colouring.
>
> Right, so the best I could come up with for a range search is
> O((log n)^2), it does need another pointer in the vma though :/
>
> Instead of storing the single right sub-tree pointer, you store a single
> linked list of right sub-tree pointers using that extra vma member.
>
> Then when no gap was found on the left downward path, try each
> successive right sub-tree (bottom-up per the LIFO single linked list and
> top-down) and do a right-path and left subtree search for those.
>
> So you get a log n walk, with a log n walk for each right sub-tree,
> giving a 1/2 * log n * log n aka O((log n)^2).
>
> If you do the search without right limit, the first right subtree search
> is sufficient and you'll revert back to O(log n).

That is essentially what my current code does.

> I've also thought about the update cost and I think I can make the
> vma_adjust case cheaper if you keep the max(vm_end) as second
> augmentation, this does add another word to the vma though.
>
> Using this max(vm_end) of the subtree you can do a rb_augment_path()
> variant over a specified range (the range that gets modified by
> vma_adjust etc..) in O(m log n) worst time, but much better on average.
>
> You still do the m iteration on the range, but you stop the path upwards
> whenever the subtree max(vm_end) covers the given range end. Except for
> the very last of m, at which point you'll go all the way up.
>
> This should avoid many of the duplicate path traversals the naive
> implementation does.

One of the benefits of an rbtree is that an insertion
or deletion is accompanied by at most 2 rotations.

These rotations, if there are two, will happen next
to each other, meaning at most 5 nodes will have new
child nodes.

Furthermore, the parent of the area where rotation
happened still has the same set of child nodes as
before (plus or minus the inserted or deleted node).

If we call the augmentation function from inside the
tree manipulation code, we may be able to save on a
lot of walking simply by having it bail out once the
augmented value in a node equals the "new" value.

The downside? This makes the rbtree code somewhat more
complex, vs. the brute force walk up the tree the current
augmented rbtree code does.

> I haven't had any good ideas on the alignment thing though, I keep
> getting back to O(n) or worse if you want a guarantee you find a hole if
> you have it.
>
> The thing you propose, the double search, once for len, and once for len
> +align-1 doesn't guarantee you'll find a hole. All holes of len might be
> mis-aligned but the len+align-1 search might overlook a hole of suitable
> size and alignment, you'd have to search the entire range: [len, len
> +align-1], and that's somewhat silly.

This may still be good enough.

Programs that exhaust their virtual address space to the
point of not being able to find a properly aligned memory
hole should be recompiled as 64 bit systems, or run on
hardware that does not have such requirements (eg. x86).

Also, the memory alignment requirement on AMD Bulldozer,
ARM and MIPS seems to be quite small. On the order of a
handful of pages (4 pages on AMD & ARM, I assume modern
MIPS is fairly sane too).

2012-06-26 08:32:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Mon, 2012-06-25 at 17:52 -0400, Rik van Riel wrote:
> The downside? This makes the rbtree code somewhat more
> complex, vs. the brute force walk up the tree the current
> augmented rbtree code does.

Something like that should be in the git history of that code. See
b945d6b2554d55 ("rbtree: Undo augmented trees performance damage and
regression").

I removed that because it adds overhead to the scheduler fast paths, but
if we can all agree to move lib/rbtree.c into inlines in
include/linux/rbtree.h (possibly utilizing Daniel Santos' magic) then we
could do this again.

Anyway, doing the updates in the insertion/deletion might speed up
those, but you still have the regular modifications what don't do
insert/delete to think about.

If you look at your patch 1, __vma_unlink has an adjust_free_gap() right
next to the rb_augment_erase(), vma_adjust() has 3 adjust_free_gap()
calls right next to each other.

All these will do an entire path walk back to the root. I would think we
could save quite a bit of updating by not having them all walk back to
the root. No point in re-computing the top levels if you know the next
update will change them again anyway.

2012-06-26 08:38:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Mon, 2012-06-25 at 17:52 -0400, Rik van Riel wrote:
> > The thing you propose, the double search, once for len, and once for len
> > +align-1 doesn't guarantee you'll find a hole. All holes of len might be
> > mis-aligned but the len+align-1 search might overlook a hole of suitable
> > size and alignment, you'd have to search the entire range: [len, len
> > +align-1], and that's somewhat silly.
>
> This may still be good enough.

OK, as long as this is clearly mentioned in a comment near there.

It just annoys me that I cannot come up with anything better :-)

2012-06-26 13:07:18

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/26/2012 04:31 AM, Peter Zijlstra wrote:

> If you look at your patch 1, __vma_unlink has an adjust_free_gap() right
> next to the rb_augment_erase(), vma_adjust() has 3 adjust_free_gap()
> calls right next to each other.
>
> All these will do an entire path walk back to the root. I would think we
> could save quite a bit of updating by not having them all walk back to
> the root. No point in re-computing the top levels if you know the next
> update will change them again anyway.

The problem is, unless we look at the augmented data at
rotate time, we do not know when it is safe to stop
iterating up the tree.

--
All rights reversed

2012-06-26 13:46:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Tue, 2012-06-26 at 09:05 -0400, Rik van Riel wrote:
> On 06/26/2012 04:31 AM, Peter Zijlstra wrote:
>
> > If you look at your patch 1, __vma_unlink has an adjust_free_gap() right
> > next to the rb_augment_erase(), vma_adjust() has 3 adjust_free_gap()
> > calls right next to each other.
> >
> > All these will do an entire path walk back to the root. I would think we
> > could save quite a bit of updating by not having them all walk back to
> > the root. No point in re-computing the top levels if you know the next
> > update will change them again anyway.
>
> The problem is, unless we look at the augmented data at
> rotate time, we do not know when it is safe to stop
> iterating up the tree.

argh,.. you're using adjust_vma_gap() for insertions instead of
rb_augment_insert().

I was going on the premise that you're doing updates for augmented data
without modifying the tree structure and that doing insert/delete will
keep the stuff up-to-date.

So now I'm not sure why you do if (insert) adjust_free_gap(insert),
since __insert_vm_struct(mm, insert) -> __vma_link() -> __vma_link_rb()
already does an augment update.

2012-06-26 15:51:55

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On 06/26/2012 09:45 AM, Peter Zijlstra wrote:
> On Tue, 2012-06-26 at 09:05 -0400, Rik van Riel wrote:
>> On 06/26/2012 04:31 AM, Peter Zijlstra wrote:
>>
>>> If you look at your patch 1, __vma_unlink has an adjust_free_gap() right
>>> next to the rb_augment_erase(), vma_adjust() has 3 adjust_free_gap()
>>> calls right next to each other.
>>>
>>> All these will do an entire path walk back to the root. I would think we
>>> could save quite a bit of updating by not having them all walk back to
>>> the root. No point in re-computing the top levels if you know the next
>>> update will change them again anyway.
>>
>> The problem is, unless we look at the augmented data at
>> rotate time, we do not know when it is safe to stop
>> iterating up the tree.
>
> argh,.. you're using adjust_vma_gap() for insertions instead of
> rb_augment_insert().
>
> I was going on the premise that you're doing updates for augmented data
> without modifying the tree structure and that doing insert/delete will
> keep the stuff up-to-date.
>
> So now I'm not sure why you do if (insert) adjust_free_gap(insert),
> since __insert_vm_struct(mm, insert) -> __vma_link() -> __vma_link_rb()
> already does an augment update.

I have fixed that in patch 3/11 of this series,
which I kept separate for this round of submission
to make it easier for reviewers to see what I
changed there.

However, doing an insert or delete changes the
gap size for the _next_ vma, and potentially a
change in the maximum gap size for the parent
node, so both insert and delete cause two tree
walks :(

2012-06-27 12:28:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Tue, 2012-06-26 at 11:49 -0400, Rik van Riel wrote:
>
> However, doing an insert or delete changes the
> gap size for the _next_ vma, and potentially a
> change in the maximum gap size for the parent
> node, so both insert and delete cause two tree
> walks :(

Right,.. don't have anything smart for that :/

I guess there's nothing to it but create a number of variants of
rb_insert/rb_erase, possibly using Daniel's 'template' stuff so we don't
actually have to maintain multiple copies of the code.

Maybe something simple like:

static void __always_inline
__rb_insert(struct rb_node *node, struct rb_root *root, rb_augment_f func, bool threaded)
{
/* all the fancy code */
}

void rb_insert(struct rb_node *node, struct rb_root *root)
{
__rb_insert(node, root, NULL, false);
}

void rb_insert_threaded(struct rb_node *node, struct rb_root *root)
{
__rb_insert(node, root, NULL, true);
}

void rb_insert_augment(struct rb_node *node, struct rb_root *root, rb_augment_f func)
{
__rb_insert(node, root, func, false);
}

void rb_insert_augment_threaded(struct rb_node *node, struct rb_root *root, rb_augment_f func)
{
__rb_insert(node, root, func, true);
}

Would do, except it wouldn't be able to inline the augment function. For
that to happen we'd need to move __rb_insert() and the
__rb_insert_augment*() variants into rbtree.h.

But it would create clean variants without augmentation/threading
without too much duplicate code.


BTW, is there a reason rb_link_node() and rb_insert_color() are separate
functions? They seem to always be used together in sequence.

2012-06-29 23:46:45

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH -mm v2 01/11] mm: track free size between VMAs in VMA rbtree

On Thu, Jun 21, 2012 at 05:57:05PM -0400, Rik van Riel wrote:
> /*
> + * largest_free_gap - returns the largest free gap "nearby"
> + *
> + * This function is called when propagating information on the
> + * free spaces between VMAs "up" the VMA rbtree. It returns the
> + * largest of:
> + *
> + * 1. The gap between this VMA and vma->vm_prev.
> + * 2. The largest gap below us and to our left in the rbtree.
> + * 3. The largest gap below us and to our right in the rbtree.
> + */
> +static unsigned long largest_free_gap(struct rb_node *node)
> +{
> + struct vm_area_struct *vma, *prev, *left = NULL, *right = NULL;
> + unsigned long largest = 0;
> +
> + if (node->rb_left)
> + left = rb_to_vma(node->rb_left);
> + if (node->rb_right)
> + right = rb_to_vma(node->rb_right);
> +
> + /* Calculate the free gap size between us and the VMA to our left. */
> + vma = rb_to_vma(node);
> + prev = vma->vm_prev;
> +
> + if (prev)
> + largest = vma->vm_start - prev->vm_end;
> + else
> + largest = vma->vm_start;
> +
> + /* We propagate the largest of our own, or our children's free gaps. */
> + if (left)
> + largest = max(largest, left->free_gap);
> + if (right)
> + largest = max(largest, right->free_gap);
> +
> + return largest;
> +}

I second PeterZ's suggestion of having an helper function

static unsigned long free_gap(struct rb_node *node)
{
struct vm_area_struct *vma = rb_to_vma(node);
unsigned long gap = vma->vm_start;
if (vma->vm_prev)
gap -= vma->vm_prev->vm_end;
return gap;
}

Which you can then use to simplify the largest_free_gap() implementation.

> +/*
> + * Use the augmented rbtree code to propagate info on the largest
> + * free gap between VMAs up the VMA rbtree.
> + */
> +static void adjust_free_gap(struct vm_area_struct *vma)
> +{
> + rb_augment_erase_end(&vma->vm_rb, vma_rb_augment_cb, NULL);
> +}
> @@ -398,6 +454,10 @@ void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
> {
> rb_link_node(&vma->vm_rb, rb_parent, rb_link);
> rb_insert_color(&vma->vm_rb, &mm->mm_rb);
> + adjust_free_gap(vma);
> + /* Propagate the new free gap between next and us up the tree. */
> + if (vma->vm_next)
> + adjust_free_gap(vma->vm_next);
> }

So this will work, and may be fine for a first implementation. However,
the augmented rbtree support really seems inadequate here. What we
would want is for adjust_free_gap to adjust the node's free_gap as
well as its parents, and *stop* when it reaches a node that already
has the desired free_gap instead of going all the way to the root as
it does now. But, to do that we would also need rb_insert_color() to
adjust free_gap as needed when doing tree rotations, and it doesn't
have the necessary support there.

Basically, I think lib/rbtree.c should provide augmented rbtree support
in the form of (versions of) rb_insert_color() and rb_erase() being able to
callback to adjust the augmented node information around tree rotations,
instead of using (conservative, overkill) loops to adjust the augmented
node information after the fact in all places that might have been affected
by such rotations. Doing it after the fact is necessarity overkill because
it visits O(log N) nodes, while the number of rotations that might have
occured is only O(1).

I'm interested in this stuff, please CC me if you do a v3 :)

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-06-30 01:33:23

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH -mm v2 05/11] mm: get unmapped area from VMA tree

On Thu, Jun 21, 2012 at 05:57:09PM -0400, Rik van Riel wrote:
> + if (!vma->vm_prev) {
> + /* This is the left-most VMA. */
> + if (vma->vm_start - len >= lower_limit) {
> + addr = lower_limit;
> + goto found_addr;
> + }
> + } else {
> + /* Is this gap large enough? Remember it. */
> + vma_start = max(vma->vm_prev->vm_end, lower_limit);
> + if (vma->vm_start - len >= vma_start) {
> + addr = vma_start;
> + found_here = true;
> + }
> + }

You could unify these two cases:

vma_start = lower_limit;
if (vma->vm_prev && vma->vm_prev->vm_end > vma_start)
vma_start = vma->vm_prev->vm_end;
if (vma->vm_start - len >= vma_start) {
addr = vma_start;
found_here = true;
}

You don't need the goto found_addr; the search won't be going left as there
is no node there and it won't be going right as found_here is true.

We may also be albe to dispense with found_here and replace it with a special
value (either NULL or something not page aligned) for addr.

> + if (!found_here && node_free_gap(rb_node->rb_right) >= len) {
> + /* Last known gap is to the right of this subtree. */
> + rb_node = rb_node->rb_right;
> + continue;
> + } else if (!addr) {
> + rb_node = rb_find_next_uncle(rb_node);
> + continue;
> }

Looks like you're already using my suggestion of using !addr to indicate
we haven't found a suitable gap yet :)

I don't think we want to visit just any uncle though - we want to visit one
that has a large enough free gap somewhere in its subtree.

So, maybe:

if (!found_here) { // or if(!addr) or whatever
struct rb_node *rb_prev = NULL;
do {
if (rb_node != rb_prev &&
node_free_gap(rb_node->rb_right) >= len) {
rb_node = rb_node->rb_right;
break;
}
rb_prev = rb_node;
rb_node = rb_parent(rb_node);
} while (rb_node);
continue;
}

> + /* This is the left-most gap. */
> + goto found_addr;
> }
> +
> + /*
> + * There is not enough space to the left of any VMA.
> + * Check the far right-hand side of the VMA tree.
> + */
> + rb_node = mm->mm_rb.rb_node;
> + while (rb_node->rb_right)
> + rb_node = rb_node->rb_right;
> + vma = rb_to_vma(rb_node);
> + addr = vma->vm_end;
> +
> + /*
> + * The right-most VMA ends below the lower limit. Can only happen
> + * if a binary personality loads the stack below the executable.
> + */
> + if (addr < lower_limit)
> + addr = lower_limit;
> +
> + found_addr:
> + if (TASK_SIZE - len < addr)
> + return -ENOMEM;

I'm confused - if we come from 'goto found_addr', we found a gap to the
left of an existing vma; aren't we guaranteed that this gap ends to the
left of TASK_SIZE too since the existing vma's vm_begin should be
less than TASK_SIZE ?

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-06-30 02:22:45

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH -mm v2 07/11] mm: make cache alignment code generic

On Thu, Jun 21, 2012 at 05:57:11PM -0400, Rik van Riel wrote:
> /* Is this gap large enough? Remember it. */
> vma_start = max(vma->vm_prev->vm_end, lower_limit);
> + vma_start = arch_align_addr(vma_start, filp,
> + pgoff, flags, ALLOC_UP);
> if (vma->vm_start - len >= vma_start) {
> addr = vma_start;
> found_here = true;


So, right there you're losing the benefit of O(log N) allocations on these
vmas that require alignment. The rbtree lets you quickly find an allocation
that has the desired size, but you may see any number of them without ever
finding one that is large enough after alignment.

I wonder if one could go with a two-stage process:

1- figure out what gap size would guarantee a successful, aligned allocation.
basically it's desired size + desired alignment - PAGE_SIZE. See if you
can find a gap of that size, and carve your aligned allocation into it
if possible.

2- if that failed, look for all gaps of at least the desired size,
as you are proposing, and see if any of them is aligned enough for
your requirements.

This would possibly cause a bit more virtual address space fragmentation,
but I think this should still work ?

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2012-06-30 02:42:33

by Michel Lespinasse

[permalink] [raw]
Subject: Re: [PATCH -mm v2 05/11] mm: get unmapped area from VMA tree

On Thu, Jun 21, 2012 at 2:57 PM, Rik van Riel <[email protected]> wrote:
> For topdown allocations, we need to keep track of the highest
> mapped VMA address, because it might be below mm->mmap_base,
> and we only keep track of free space to the left of each VMA
> in the VMA tree. ?It is tempting to try and keep track of
> the free space to the right of each VMA when running in
> topdown mode, but that gets us into trouble when running on
> x86, where a process can switch direction in the middle of
> execve.

Just a random thought - one way to handle this could be to always have
some sentinel VMA at the end of the address space. Some architectures
might already have it in the form of the vsyscall page, or we could
always have some other sentinel right above the last usable user page
?

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.