Initial work on preallocations showed no regression in performance
during testing, but recently some users (both on [1] and off [android]
list) have reported that preallocating the worst-case number of nodes
has caused some slow down. This patch set addresses the number of
allocations in a few ways.
During munmap() most munmap() operations will remove a single VMA, so
leverage the fact that the maple tree can place a single pointer at
range 0 - 0 without allocating. This is done by changing the index of
the VMAs to be indexed by the count, starting at 0.
Re-introduce the entry argument to mas_preallocate() so that a more
intelligent guess of the node count can be made.
Implement the more intelligent guess of the node count, although there
is more work to be done.
During development of v2 of this patch set, I also noticed that the
number of nodes being allocated for a rebalance was beyond what could
possibly be needed. This is addressed in patch 0008.
Patches are in the following order:
0001-0002: Testing framework for benchmarking some operations
0003: Reduction of maple node allocation in sidetree
0004: Small cleanup of do_vmi_align_munmap()
0005-0014: mas_preallocate() calculation change
0015: Change the vma iterator order
Changes since v2:
- No longer moving the unmap_vmas() definition - Thanks Mike Kravetz
- Rebase on top of stack changes in do_vmi_align_munmap()
v2: https://lore.kernel.org/linux-mm/[email protected]/
v1: https://lore.kernel.org/lkml/[email protected]/
Liam R. Howlett (15):
maple_tree: Add benchmarking for mas_for_each
maple_tree: Add benchmarking for mas_prev()
mm: Change do_vmi_align_munmap() tracking of VMAs to remove
mm: Remove prev check from do_vmi_align_munmap()
maple_tree: Introduce __mas_set_range()
mm: Remove re-walk from mmap_region()
maple_tree: Re-introduce entry to mas_preallocate() arguments
maple_tree: Adjust node allocation on mas_rebalance()
mm: Use vma_iter_clear_gfp() in nommu
mm: Set up vma iterator for vma_iter_prealloc() calls
maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
maple_tree: Update mas_preallocate() testing
maple_tree: Refine mas_preallocate() node calculations
maple_tree: Reduce resets during store setup
mm/mmap: Change vma iteration order in do_vmi_align_munmap()
fs/exec.c | 1 +
include/linux/maple_tree.h | 23 ++++-
include/linux/mm.h | 4 +-
lib/maple_tree.c | 117 +++++++++++++++------
lib/test_maple_tree.c | 76 ++++++++++++++
mm/internal.h | 36 +++++--
mm/memory.c | 16 ++-
mm/mmap.c | 170 +++++++++++++++++--------------
mm/nommu.c | 45 ++++----
tools/testing/radix-tree/maple.c | 59 ++++++-----
10 files changed, 359 insertions(+), 188 deletions(-)
--
2.39.2
Using vma_iter_set() will reset the tree and cause a re-walk. Use
vmi_iter_config() to set the write to a sub-set of the range. Change
the file case to also use vmi_iter_config() so that the end is correctly
set.
Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 8 ++++++++
mm/mmap.c | 15 ++++++++++-----
2 files changed, 18 insertions(+), 5 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 7d11ebe5d11c..c5ba08f55deb 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1041,6 +1041,14 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
return !(vma->vm_flags & VM_SOFTDIRTY);
}
+static inline void vma_iter_config(struct vma_iterator *vmi,
+ unsigned long index, unsigned long last)
+{
+ MAS_BUG_ON(&vmi->mas, vmi->mas.node != MAS_START &&
+ (vmi->mas.index > index || vmi->mas.last < index));
+ __mas_set_range(&vmi->mas, index, last - 1);
+}
+
/*
* VMA Iterator functions shared between nommu and mmap
*/
diff --git a/mm/mmap.c b/mm/mmap.c
index d4056d1de7fa..f518e4c70a7b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2675,8 +2675,11 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
next = vma_next(&vmi);
prev = vma_prev(&vmi);
- if (vm_flags & VM_SPECIAL)
+ if (vm_flags & VM_SPECIAL) {
+ if (prev)
+ vma_iter_next_range(&vmi);
goto cannot_expand;
+ }
/* Attempt to expand an old mapping */
/* Check next */
@@ -2697,6 +2700,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_start = prev->vm_start;
vma = prev;
vm_pgoff = prev->vm_pgoff;
+ } else if (prev) {
+ vma_iter_next_range(&vmi);
}
@@ -2707,9 +2712,9 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto expanded;
}
+ if (vma == prev)
+ vma_iter_set(&vmi, addr);
cannot_expand:
- if (prev)
- vma_iter_next_range(&vmi);
/*
* Determine the object being mapped and call the appropriate
@@ -2722,7 +2727,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto unacct_error;
}
- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
vma->vm_start = addr;
vma->vm_end = end;
vm_flags_init(vma, vm_flags);
@@ -2749,7 +2754,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (WARN_ON((addr != vma->vm_start)))
goto close_and_free_vma;
- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
/*
* If vm_flags changed after call_mmap(), we should try merge
* vma again as we may succeed this time.
--
2.39.2
mas_rebalance() is called to rebalance an insufficient node into a
single node or two sufficient nodes. The preallocation estimate is
always too many in this case as the height of the tree will never grow
and there is no possibility to have a three way split in this case, so
revise the node allocation count.
Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 0d7e30c7d999..494f884ef17f 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -3136,7 +3136,7 @@ static inline int mas_rebalance(struct ma_state *mas,
* tries to combine the data in the same way. If one node contains the
* entire range of the tree, then that node is used as a new root node.
*/
- mas_node_count(mas, 1 + empty_count * 3);
+ mas_node_count(mas, empty_count * 2 - 1);
if (mas_is_err(mas))
return 0;
--
2.39.2
Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code. This will reduce node preallocations
in nommu.
Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 12 ++++++++++++
mm/mmap.c | 12 ------------
mm/nommu.c | 12 ++++--------
3 files changed, 16 insertions(+), 20 deletions(-)
diff --git a/mm/internal.h b/mm/internal.h
index 65f646c2ccf3..2f35c0ef7b7f 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1064,6 +1064,18 @@ static inline void vma_iter_clear(struct vma_iterator *vmi,
mas_store_prealloc(&vmi->mas, NULL);
}
+static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
+ unsigned long start, unsigned long end, gfp_t gfp)
+{
+ vmi->mas.index = start;
+ vmi->mas.last = end - 1;
+ mas_store_gfp(&vmi->mas, NULL, gfp);
+ if (unlikely(mas_is_err(&vmi->mas)))
+ return -ENOMEM;
+
+ return 0;
+}
+
static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
{
return mas_walk(&vmi->mas);
diff --git a/mm/mmap.c b/mm/mmap.c
index 625d7411d5a0..d09f11a9ad85 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -154,18 +154,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
return mas_prev(&vmi->mas, min);
}
-static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
- unsigned long start, unsigned long end, gfp_t gfp)
-{
- vmi->mas.index = start;
- vmi->mas.last = end - 1;
- mas_store_gfp(&vmi->mas, NULL, gfp);
- if (unlikely(mas_is_err(&vmi->mas)))
- return -ENOMEM;
-
- return 0;
-}
-
/*
* check_brk_limits() - Use platform specific check of range & verify mlock
* limits.
diff --git a/mm/nommu.c b/mm/nommu.c
index 9826f6101a05..418cc0669c1f 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1396,17 +1396,13 @@ static int vmi_shrink_vma(struct vma_iterator *vmi,
/* adjust the VMA's pointers, which may reposition it in the MM's tree
* and list */
- if (vma_iter_prealloc(vmi)) {
- pr_warn("Allocation of vma tree for process %d failed\n",
- current->pid);
- return -ENOMEM;
- }
-
if (from > vma->vm_start) {
- vma_iter_clear(vmi, from, vma->vm_end);
+ if (vma_iter_clear_gfp(vmi, from, vma->vm_end, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_end = from;
} else {
- vma_iter_clear(vmi, vma->vm_start, to);
+ if (vma_iter_clear_gfp(vmi, vma->vm_start, to, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_start = to;
}
--
2.39.2
By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations. It
also means that prev/next will be loaded closer to when they are used.
This requires changing the loop type when gathering the VMAs that will
be freed.
Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).
Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.
Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 27 +++++++++++++--------------
1 file changed, 13 insertions(+), 14 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index 58f7b7038e4c..f11c0d663deb 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2451,20 +2451,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
goto map_count_exceeded;
- error = __split_vma(vmi, vma, start, 0);
+ error = __split_vma(vmi, vma, start, 1);
if (error)
goto start_split_failed;
-
- vma = vma_iter_load(vmi);
}
- prev = vma_prev(vmi);
-
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- for_each_vma_range(*vmi, next, end) {
+ next = vma;
+ do {
/* Does it split the end? */
if (next->vm_end > end) {
error = __split_vma(vmi, next, end, 0);
@@ -2500,13 +2497,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(next->vm_start < start);
BUG_ON(next->vm_start > end);
#endif
- }
-
- if (vma_iter_end(vmi) > end)
- next = vma_iter_load(vmi);
-
- if (!next)
- next = vma_next(vmi);
+ } for_each_vma_range(*vmi, next, end);
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
@@ -2527,7 +2518,10 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(count != test_count);
}
#endif
- vma_iter_set(vmi, start);
+
+ while (vma_iter_addr(vmi) > start)
+ vma_iter_prev_range(vmi);
+
error = vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL);
if (error)
goto clear_tree_failed;
@@ -2538,6 +2532,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (unlock)
mmap_write_downgrade(mm);
+ prev = vma_iter_prev_range(vmi);
+ next = vma_next(vmi);
+ if (next)
+ vma_iter_prev_range(vmi);
+
/*
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
--
2.39.2
mas_prealloc() may walk partially down the tree before finding that a
split or spanning store is needed. When the write occurs, relax the
logic on resetting the walk so that partial walks will not restart, but
walks that have gone too far (a store that affects beyond the current
node) should be restarted.
Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 37 ++++++++++++++++++++++++++-----------
1 file changed, 26 insertions(+), 11 deletions(-)
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 4a111785360f..a3d602cfd030 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5424,19 +5424,34 @@ static inline void mte_destroy_walk(struct maple_enode *enode,
static void mas_wr_store_setup(struct ma_wr_state *wr_mas)
{
+ if (mas_is_start(wr_mas->mas))
+ return;
+
if (unlikely(mas_is_paused(wr_mas->mas)))
- mas_reset(wr_mas->mas);
+ goto reset;
- if (!mas_is_start(wr_mas->mas)) {
- if (mas_is_none(wr_mas->mas)) {
- mas_reset(wr_mas->mas);
- } else {
- wr_mas->r_max = wr_mas->mas->max;
- wr_mas->type = mte_node_type(wr_mas->mas->node);
- if (mas_is_span_wr(wr_mas))
- mas_reset(wr_mas->mas);
- }
- }
+ if (unlikely(mas_is_none(wr_mas->mas)))
+ goto reset;
+
+ /*
+ * A less strict version of mas_is_span_wr() where we allow spanning
+ * writes within this node. This is to stop partial walks in
+ * mas_prealloc() from being reset.
+ */
+ if (wr_mas->mas->last > wr_mas->mas->max)
+ goto reset;
+
+ if (wr_mas->entry)
+ return;
+
+ if (mte_is_leaf(wr_mas->mas->node) &&
+ wr_mas->mas->last == wr_mas->mas->max)
+ goto reset;
+
+ return;
+
+reset:
+ mas_reset(wr_mas->mas);
}
/* Interface */
--
2.39.2
@akpm
On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <[email protected]> wrote:
> Since prev will be set later in the function, it is better to reverse
> the splitting direction of the start VMA (modify the new_below argument
> to __split_vma).
It might be a good idea to reorder "mm: always lock new vma before
inserting into vma tree" before this patch.
If you apply this patch without "mm: always lock new vma before
inserting into vma tree", I think move_vma(), when called with a start
address in the middle of a VMA, will behave like this:
- vma_start_write() [lock the VMA to be moved]
- move_page_tables() [moves page table entries]
- do_vmi_munmap()
- do_vmi_align_munmap()
- __split_vma()
- creates a new VMA **covering the moved range** that is **not locked**
- stores the new VMA in the VMA tree **without locking it** [1]
- new VMA is locked and removed again [2]
[...]
So after the page tables in the region have already been moved, I
believe there will be a brief window (between [1] and [2]) where page
faults in the region can happen again, which could probably cause new
page tables and PTEs to be created in the region again in that window.
(This can't happen in Linus' current tree because the new VMA created
by __split_vma() only covers the range that is not being moved.)
Though I guess that's not going to lead to anything bad, since
do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
So maybe it's not that important.
On Mon, 14 Aug 2023 17:43:39 +0200 Jann Horn <[email protected]> wrote:
> @akpm
>
> On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <[email protected]> wrote:
> > Since prev will be set later in the function, it is better to reverse
> > the splitting direction of the start VMA (modify the new_below argument
> > to __split_vma).
>
> It might be a good idea to reorder "mm: always lock new vma before
> inserting into vma tree" before this patch.
>
> If you apply this patch without "mm: always lock new vma before
> inserting into vma tree", I think move_vma(), when called with a start
> address in the middle of a VMA, will behave like this:
>
> - vma_start_write() [lock the VMA to be moved]
> - move_page_tables() [moves page table entries]
> - do_vmi_munmap()
> - do_vmi_align_munmap()
> - __split_vma()
> - creates a new VMA **covering the moved range** that is **not locked**
> - stores the new VMA in the VMA tree **without locking it** [1]
> - new VMA is locked and removed again [2]
> [...]
>
> So after the page tables in the region have already been moved, I
> believe there will be a brief window (between [1] and [2]) where page
> faults in the region can happen again, which could probably cause new
> page tables and PTEs to be created in the region again in that window.
> (This can't happen in Linus' current tree because the new VMA created
> by __split_vma() only covers the range that is not being moved.)
>
> Though I guess that's not going to lead to anything bad, since
> do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> So maybe it's not that important.
Thanks. I'd of course prefer not to rebuild mm-stable. If this ends
up being a hard-to-hit issue during git-bisect searches, I think we can
live with that.
* Jann Horn <[email protected]> [230814 11:44]:
> @akpm
>
> On Mon, Jul 24, 2023 at 8:31 PM Liam R. Howlett <[email protected]> wrote:
> > Since prev will be set later in the function, it is better to reverse
> > the splitting direction of the start VMA (modify the new_below argument
> > to __split_vma).
>
> It might be a good idea to reorder "mm: always lock new vma before
> inserting into vma tree" before this patch.
>
> If you apply this patch without "mm: always lock new vma before
> inserting into vma tree", I think move_vma(), when called with a start
> address in the middle of a VMA, will behave like this:
>
> - vma_start_write() [lock the VMA to be moved]
> - move_page_tables() [moves page table entries]
> - do_vmi_munmap()
> - do_vmi_align_munmap()
> - __split_vma()
> - creates a new VMA **covering the moved range** that is **not locked**
> - stores the new VMA in the VMA tree **without locking it** [1]
> - new VMA is locked and removed again [2]
> [...]
>
> So after the page tables in the region have already been moved, I
> believe there will be a brief window (between [1] and [2]) where page
> faults in the region can happen again, which could probably cause new
> page tables and PTEs to be created in the region again in that window.
> (This can't happen in Linus' current tree because the new VMA created
> by __split_vma() only covers the range that is not being moved.)
Ah, so my reversing of which VMA to keep to the first split call opens a
window where the VMA being removed is not locked. Good catch.
>
> Though I guess that's not going to lead to anything bad, since
> do_vmi_munmap() anyway cleans up PTEs and page tables in the region?
> So maybe it's not that important.
do_vmi_munmap() will clean up PTEs from the end of the previous VMA to
the start of the next (even over areas without VMA coverages - because
there are platforms what need this, apparently).
I don't have any objections in the ordering or see an issue resulting
from having it this way... Except for maybe lockdep, so maybe we should
change the ordering of the patch sets just to be safe?
In fact, should we add another check somewhere to ensure we do generate
the warning? Perhaps to remove_mt() to avoid the exit path hitting it?
Thanks,
Liam