2023-06-01 02:18:39

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 00/14] Reduce preallocations for maple tree

Initial work on preallocations showed no regression in performance
during testing, but recently some users (both on [1] and off [android]
list) have reported that preallocating the worst-case number of nodes
has caused some slow down. This patch set addresses the number of
allocations in a few ways.

During munmap() most munmap() operations will remove a single VMA, so
leverage the fact that the maple tree can place a single pointer at
range 0 - 0 without allocating. This is done by changing the index in
the 'sidetree'.

Re-introduce the entry argument to mas_preallocate() so that a more
intelligent guess of the node count can be made.

Patches are in the following order:
0001-0002: Testing framework for benchmarking some operations
0003-0004: Reduction of maple node allocation in sidetree
0005: Small cleanup of do_vmi_align_munmap()
0006-0013: mas_preallocate() calculation change
0014: Change the vma iterator order

[1] https://lore.kernel.org/linux-mm/[email protected]/

Liam R. Howlett (14):
maple_tree: Add benchmarking for mas_for_each
maple_tree: Add benchmarking for mas_prev()
mm: Move unmap_vmas() declaration to internal header
mm: Change do_vmi_align_munmap() side tree index
mm: Remove prev check from do_vmi_align_munmap()
maple_tree: Introduce __mas_set_range()
mm: Remove re-walk from mmap_region()
maple_tree: Re-introduce entry to mas_preallocate() arguments
mm: Use vma_iter_clear_gfp() in nommu
mm: Set up vma iterator for vma_iter_prealloc() calls
maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
maple_tree: Update mas_preallocate() testing
maple_tree: Refine mas_preallocate() node calculations
mm/mmap: Change vma iteration order in do_vmi_align_munmap()

fs/exec.c | 1 +
include/linux/maple_tree.h | 23 ++++-
include/linux/mm.h | 4 -
lib/maple_tree.c | 78 ++++++++++----
lib/test_maple_tree.c | 74 +++++++++++++
mm/internal.h | 40 ++++++--
mm/memory.c | 16 ++-
mm/mmap.c | 171 ++++++++++++++++---------------
mm/nommu.c | 45 ++++----
tools/testing/radix-tree/maple.c | 59 ++++++-----
10 files changed, 331 insertions(+), 180 deletions(-)

--
2.39.2



2023-06-01 02:18:40

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 08/14] maple_tree: Re-introduce entry to mas_preallocate() arguments

The current preallocation strategy is to preallocate the absolute
worst-case allocation for a tree modification. The entry (or NULL) is
needed to know how many nodes are needed to write to the tree. Start by
adding the argument to the mas_preallocate() definition.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/maple_tree.h | 2 +-
lib/maple_tree.c | 3 ++-
mm/internal.h | 2 +-
mm/mmap.c | 4 ++--
tools/testing/radix-tree/maple.c | 32 ++++++++++++++++----------------
5 files changed, 22 insertions(+), 21 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index ed50d8f459e6..e18ecbefc7f7 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -458,7 +458,7 @@ void *mas_find(struct ma_state *mas, unsigned long max);
void *mas_find_range(struct ma_state *mas, unsigned long max);
void *mas_find_rev(struct ma_state *mas, unsigned long min);
void *mas_find_range_rev(struct ma_state *mas, unsigned long max);
-int mas_preallocate(struct ma_state *mas, gfp_t gfp);
+int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp);
bool mas_is_err(struct ma_state *mas);

bool mas_nomem(struct ma_state *mas, gfp_t gfp);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index bfffbb7cab26..34eccddb0b47 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5535,11 +5535,12 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
/**
* mas_preallocate() - Preallocate enough nodes for a store operation
* @mas: The maple state
+ * @entry: The entry that will be stored
* @gfp: The GFP_FLAGS to use for allocations.
*
* Return: 0 on success, -ENOMEM if memory could not be allocated.
*/
-int mas_preallocate(struct ma_state *mas, gfp_t gfp)
+int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
{
int ret;

diff --git a/mm/internal.h b/mm/internal.h
index cdf06f680d6e..2691deca9699 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1047,7 +1047,7 @@ static inline void vma_iter_config(struct vma_iterator *vmi,
*/
static inline int vma_iter_prealloc(struct vma_iterator *vmi)
{
- return mas_preallocate(&vmi->mas, GFP_KERNEL);
+ return mas_preallocate(&vmi->mas, NULL, GFP_KERNEL);
}

static inline void vma_iter_clear(struct vma_iterator *vmi,
diff --git a/mm/mmap.c b/mm/mmap.c
index 8b3e58d6ac40..75b2a86e1faa 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1982,7 +1982,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/* Check that both stack segments have the same anon_vma? */
}

- if (mas_preallocate(&mas, GFP_KERNEL))
+ if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

/* We must make sure the anon_vma is allocated. */
@@ -2064,7 +2064,7 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
return -ENOMEM;
}

- if (mas_preallocate(&mas, GFP_KERNEL))
+ if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

/* We must make sure the anon_vma is allocated. */
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 03539d86cdf0..cfadc4b75d51 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35383,7 +35383,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
for (i = 0; i <= max; i++)
mtree_test_store_range(mt, i * 10, i * 10 + 5, &i);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35392,18 +35392,18 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);


- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35412,26 +35412,26 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);
mn->parent = ma_parent_ptr(mn);
ma_free_rcu(mn);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35440,12 +35440,12 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
mas_push_node(&mas, mn);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
mas_destroy(&mas);
allocated = mas_allocated(&mas);
MT_BUG_ON(mt, allocated != 0);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35453,21 +35453,21 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);

- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35475,14 +35475,14 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
mt_set_non_kernel(1);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL & GFP_NOWAIT) == 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated != 0);
mas_destroy(&mas);


- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL) != 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated == 0);
@@ -35490,7 +35490,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
mt_set_non_kernel(1);
- MT_BUG_ON(mt, mas_preallocate(&mas, GFP_KERNEL & GFP_NOWAIT) == 0);
+ MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
MT_BUG_ON(mt, allocated != 0);
--
2.39.2


2023-06-01 02:18:41

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 11/14] maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()

Relocate it and call mas_wr_extend_null() from within mas_wr_end_piv().
Extending the NULL may affect the end pivot value so call
mas_wr_endtend_null() from within mas_wr_end_piv() to keep it all
together.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 34eccddb0b47..adf662bc413e 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4197,18 +4197,6 @@ static inline bool mas_wr_slot_store(struct ma_wr_state *wr_mas)
return true;
}

-static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas)
-{
- while ((wr_mas->offset_end < wr_mas->node_end) &&
- (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end]))
- wr_mas->offset_end++;
-
- if (wr_mas->offset_end < wr_mas->node_end)
- wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end];
- else
- wr_mas->end_piv = wr_mas->mas->max;
-}
-
static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas)
{
struct ma_state *mas = wr_mas->mas;
@@ -4245,6 +4233,21 @@ static inline void mas_wr_extend_null(struct ma_wr_state *wr_mas)
}
}

+static inline void mas_wr_end_piv(struct ma_wr_state *wr_mas)
+{
+ while ((wr_mas->offset_end < wr_mas->node_end) &&
+ (wr_mas->mas->last > wr_mas->pivots[wr_mas->offset_end]))
+ wr_mas->offset_end++;
+
+ if (wr_mas->offset_end < wr_mas->node_end)
+ wr_mas->end_piv = wr_mas->pivots[wr_mas->offset_end];
+ else
+ wr_mas->end_piv = wr_mas->mas->max;
+
+ if (!wr_mas->entry)
+ mas_wr_extend_null(wr_mas);
+}
+
static inline unsigned char mas_wr_new_end(struct ma_wr_state *wr_mas)
{
struct ma_state *mas = wr_mas->mas;
@@ -4377,10 +4380,6 @@ static inline void *mas_wr_store_entry(struct ma_wr_state *wr_mas)

/* At this point, we are at the leaf node that needs to be altered. */
mas_wr_end_piv(wr_mas);
-
- if (!wr_mas->entry)
- mas_wr_extend_null(wr_mas);
-
/* New root for a single pointer */
if (unlikely(!mas->index && mas->last == ULONG_MAX)) {
mas_new_root(mas, wr_mas->entry);
--
2.39.2


2023-06-01 02:19:11

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 10/14] mm: Set up vma iterator for vma_iter_prealloc() calls

Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/exec.c | 1 +
mm/internal.h | 18 +++++-------
mm/mmap.c | 81 +++++++++++++++++++++++++++++++--------------------
mm/nommu.c | 33 +++++++++------------
4 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 25c65b64544b..dc0ba74ebb74 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -697,6 +697,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
if (vma != vma_next(&vmi))
return -EFAULT;

+ vma_iter_prev_range(&vmi);
/*
* cover the whole range: [new_start, old_end)
*/
diff --git a/mm/internal.h b/mm/internal.h
index d78fd0fafa3b..531b2e95146c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1045,23 +1045,21 @@ static inline void vma_iter_config(struct vma_iterator *vmi,
/*
* VMA Iterator functions shared between nommu and mmap
*/
-static inline int vma_iter_prealloc(struct vma_iterator *vmi)
+static inline int vma_iter_prealloc(struct vma_iterator *vmi,
+ struct vm_area_struct *vma)
{
- return mas_preallocate(&vmi->mas, NULL, GFP_KERNEL);
+ return mas_preallocate(&vmi->mas, vma, GFP_KERNEL);
}

-static inline void vma_iter_clear(struct vma_iterator *vmi,
- unsigned long start, unsigned long end)
+static inline void vma_iter_clear(struct vma_iterator *vmi)
{
- mas_set_range(&vmi->mas, start, end - 1);
mas_store_prealloc(&vmi->mas, NULL);
}

static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
unsigned long start, unsigned long end, gfp_t gfp)
{
- vmi->mas.index = start;
- vmi->mas.last = end - 1;
+ __mas_set_range(&vmi->mas, start, end - 1);
mas_store_gfp(&vmi->mas, NULL, gfp);
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
@@ -1098,8 +1096,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
vma_iter_invalidate(vmi);

- vmi->mas.index = vma->vm_start;
- vmi->mas.last = vma->vm_end - 1;
+ __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_prealloc(&vmi->mas, vma);
}

@@ -1110,8 +1107,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
vma_iter_invalidate(vmi);

- vmi->mas.index = vma->vm_start;
- vmi->mas.last = vma->vm_end - 1;
+ __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_gfp(&vmi->mas, vma, gfp);
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
diff --git a/mm/mmap.c b/mm/mmap.c
index 22c71dff762b..eaebcc8f60d2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -411,7 +411,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
VMA_ITERATOR(vmi, mm, 0);
struct address_space *mapping = NULL;

- if (vma_iter_prealloc(&vmi))
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma))
return -ENOMEM;

if (vma->vm_file) {
@@ -664,19 +665,16 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* Only handles expanding */
VM_WARN_ON(vma->vm_start < start || vma->vm_end > end);

- if (vma_iter_prealloc(vmi))
+ /* Note: vma iterator must be pointing to 'start' */
+ vma_iter_config(vmi, start, end);
+ if (vma_iter_prealloc(vmi, vma))
goto nomem;

vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);
- /* VMA iterator points to previous, so set to start if necessary */
- if (vma_iter_addr(vmi) != start)
- vma_iter_set(vmi, start);
-
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
- /* Note: mas must be pointing to the expanding VMA */
vma_iter_store(vmi, vma);

vma_complete(&vp, vmi, vma->vm_mm);
@@ -703,19 +701,19 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,

WARN_ON((vma->vm_start != start) && (vma->vm_end != end));

- if (vma_iter_prealloc(vmi))
+ if (vma->vm_start < start)
+ vma_iter_config(vmi, vma->vm_start, start);
+ else
+ vma_iter_config(vmi, end, vma->vm_end);
+
+ if (vma_iter_prealloc(vmi, NULL))
return -ENOMEM;

init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);

- if (vma->vm_start < start)
- vma_iter_clear(vmi, vma->vm_start, start);
-
- if (vma->vm_end > end)
- vma_iter_clear(vmi, end, vma->vm_end);
-
+ vma_iter_clear(vmi);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
@@ -991,7 +989,17 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
if (err)
return NULL;

- if (vma_iter_prealloc(vmi))
+ if (vma_start < vma->vm_start || vma_end > vma->vm_end)
+ vma_expanded = true;
+
+ if (vma_expanded) {
+ vma_iter_config(vmi, vma_start, vma_end);
+ } else {
+ vma_iter_config(vmi, adjust->vm_start + adj_start,
+ adjust->vm_end);
+ }
+
+ if (vma_iter_prealloc(vmi, vma))
return NULL;

init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
@@ -1000,8 +1008,6 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,

vma_prepare(&vp);
vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start);
- if (vma_start < vma->vm_start || vma_end > vma->vm_end)
- vma_expanded = true;

vma->vm_start = vma_start;
vma->vm_end = vma_end;
@@ -1945,7 +1951,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
struct vm_area_struct *next;
unsigned long gap_addr;
int error = 0;
- MA_STATE(mas, &mm->mm_mt, 0, 0);
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, address);

if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;
@@ -1970,6 +1976,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/* Check that both stack segments have the same anon_vma? */
}

+ if (next)
+ mas_prev_range(&mas, address);
+
+ __mas_set_range(&mas, vma->vm_start, address - 1);
if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2013,7 +2023,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
/* Overwrite old entry in mtree. */
- mas_set_range(&mas, vma->vm_start, address - 1);
mas_store_prealloc(&mas, vma);
anon_vma_interval_tree_post_update_vma(vma);
spin_unlock(&mm->page_table_lock);
@@ -2052,6 +2061,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
return -ENOMEM;
}

+ if (prev)
+ mas_next_range(&mas, vma->vm_start);
+
+ __mas_set_range(&mas, address, vma->vm_end - 1);
if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2096,7 +2109,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
vma->vm_start = address;
vma->vm_pgoff -= grow;
/* Overwrite old entry in mtree. */
- mas_set_range(&mas, address, vma->vm_end - 1);
mas_store_prealloc(&mas, vma);
anon_vma_interval_tree_post_update_vma(vma);
spin_unlock(&mm->page_table_lock);
@@ -2255,10 +2267,6 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!new)
return -ENOMEM;

- err = -ENOMEM;
- if (vma_iter_prealloc(vmi))
- goto out_free_vma;
-
if (new_below) {
new->vm_end = addr;
} else {
@@ -2266,6 +2274,11 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
}

+ err = -ENOMEM;
+ vma_iter_config(vmi, new->vm_start, new->vm_end);
+ if (vma_iter_prealloc(vmi, new))
+ goto out_free_vma;
+
err = vma_dup_policy(vma, new);
if (err)
goto out_free_vmi;
@@ -2600,8 +2613,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

next = vma_next(&vmi);
prev = vma_prev(&vmi);
- if (vm_flags & VM_SPECIAL)
+ if (vm_flags & VM_SPECIAL) {
+ if (prev)
+ vma_iter_next_range(&vmi);
+
goto cannot_expand;
+ }

/* Attempt to expand an old mapping */
/* Check next */
@@ -2611,6 +2628,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_end = next->vm_end;
vma = next;
vm_pgoff = next->vm_pgoff - pglen;
+
}

/* Check prev */
@@ -2622,9 +2640,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_start = prev->vm_start;
vma = prev;
vm_pgoff = prev->vm_pgoff;
+ } else if (prev) {
+ vma_iter_next_range(&vmi);
}

-
/* Actually expand, if possible */
if (vma &&
!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
@@ -2633,9 +2652,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}

cannot_expand:
- if (prev)
- vma_iter_next_range(&vmi);
-
/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
@@ -2721,7 +2737,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto close_and_free_vma;

error = -ENOMEM;
- if (vma_iter_prealloc(&vmi))
+ if (vma_iter_prealloc(&vmi, vma))
goto close_and_free_vma;

if (vma->vm_file)
@@ -2994,7 +3010,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma && vma->vm_end == addr && !vma_policy(vma) &&
can_vma_merge_after(vma, flags, NULL, NULL,
addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) {
- if (vma_iter_prealloc(vmi))
+ vma_iter_config(vmi, vma->vm_start, addr + len);
+ if (vma_iter_prealloc(vmi, vma))
goto unacct_fail;

init_vma_prep(&vp, vma);
@@ -3009,6 +3026,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto out;
}

+ if (vma)
+ vma_iter_next_range(vmi);
/* create a vma struct for an anonymous mapping */
vma = vm_area_alloc(mm);
if (!vma)
diff --git a/mm/nommu.c b/mm/nommu.c
index a764b86b132a..a96b889cc17e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -583,7 +583,8 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
{
VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_start);

- if (vma_iter_prealloc(&vmi)) {
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma)) {
pr_warn("Allocation of vma tree for process %d failed\n",
current->pid);
return -ENOMEM;
@@ -591,7 +592,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
cleanup_vma_from_mm(vma);

/* remove from the MM's tree and list */
- vma_iter_clear(&vmi, vma->vm_start, vma->vm_end);
+ vma_iter_clear(&vmi);
return 0;
}
/*
@@ -1041,9 +1042,6 @@ unsigned long do_mmap(struct file *file,
if (!vma)
goto error_getting_vma;

- if (vma_iter_prealloc(&vmi))
- goto error_vma_iter_prealloc;
-
region->vm_usage = 1;
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;
@@ -1185,6 +1183,10 @@ unsigned long do_mmap(struct file *file,

share:
BUG_ON(!vma->vm_region);
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma))
+ goto error_just_free;
+
setup_vma_to_mm(vma, current->mm);
current->mm->map_count++;
/* add the VMA to the tree */
@@ -1231,14 +1233,6 @@ unsigned long do_mmap(struct file *file,
len, current->pid);
show_free_areas(0, NULL);
return -ENOMEM;
-
-error_vma_iter_prealloc:
- kmem_cache_free(vm_region_jar, region);
- vm_area_free(vma);
- pr_warn("Allocation of vma tree for process %d failed\n", current->pid);
- show_free_areas(0, NULL);
- return -ENOMEM;
-
}

unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
@@ -1323,12 +1317,6 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!new)
goto err_vma_dup;

- if (vma_iter_prealloc(vmi)) {
- pr_warn("Allocation of vma tree for process %d failed\n",
- current->pid);
- goto err_vmi_preallocate;
- }
-
/* most fields are the same, copy all, and then fixup */
*region = *vma->vm_region;
new->vm_region = region;
@@ -1342,6 +1330,13 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
region->vm_pgoff = new->vm_pgoff += npages;
}

+ vma_iter_config(vmi, new->vm_start, new->vm_end);
+ if (vma_iter_prealloc(vmi, vma)) {
+ pr_warn("Allocation of vma tree for process %d failed\n",
+ current->pid);
+ goto err_vmi_preallocate;
+ }
+
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);

--
2.39.2


2023-06-01 02:22:18

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 07/14] mm: Remove re-walk from mmap_region()

Using vma_iter_set() will reset the tree and cause a re-walk. Use
vmi_iter_config() to set the write to a sub-set of the range. Change
the file case to also use vmi_iter_config() so that the end is correctly
set.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 8 ++++++++
mm/mmap.c | 4 ++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24437f11d3c2..cdf06f680d6e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1034,6 +1034,14 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
return !(vma->vm_flags & VM_SOFTDIRTY);
}

+static inline void vma_iter_config(struct vma_iterator *vmi,
+ unsigned long index, unsigned long last)
+{
+ MAS_BUG_ON(&vmi->mas, vmi->mas.node != MAS_START &&
+ (vmi->mas.index > index || vmi->mas.last < index));
+ __mas_set_range(&vmi->mas, index, last - 1);
+}
+
/*
* VMA Iterator functions shared between nommu and mmap
*/
diff --git a/mm/mmap.c b/mm/mmap.c
index fd3f505b40cc..8b3e58d6ac40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2659,7 +2659,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto unacct_error;
}

- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
vma->vm_start = addr;
vma->vm_end = end;
vm_flags_init(vma, vm_flags);
@@ -2686,7 +2686,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (WARN_ON((addr != vma->vm_start)))
goto close_and_free_vma;

- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
/*
* If vm_flags changed after call_mmap(), we should try merge
* vma again as we may succeed this time.
--
2.39.2


2023-06-01 02:24:41

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 01/14] maple_tree: Add benchmarking for mas_for_each

Add a way to test the speed of mas_for_each() to the testing code.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/test_maple_tree.c | 38 ++++++++++++++++++++++++++++++++++++++
1 file changed, 38 insertions(+)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 9939be34e516..3dbf99c3f2b1 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -44,6 +44,7 @@ atomic_t maple_tree_tests_passed;
/* #define BENCH_WALK */
/* #define BENCH_MT_FOR_EACH */
/* #define BENCH_FORK */
+/* #define BENCH_MAS_FOR_EACH */

#ifdef __KERNEL__
#define mt_set_non_kernel(x) do {} while (0)
@@ -1705,6 +1706,36 @@ static noinline void __init bench_mt_for_each(struct maple_tree *mt)
}
#endif

+#if defined(BENCH_MAS_FOR_EACH)
+static noinline void __init bench_mas_for_each(struct maple_tree *mt)
+{
+ int i, count = 1000000;
+ unsigned long max = 2500;
+ void *entry;
+ MA_STATE(mas, mt, 0, 0);
+
+ for (i = 0; i < max; i += 5) {
+ int gap = 4;
+ if (i % 30 == 0)
+ gap = 3;
+ mtree_store_range(mt, i, i + gap, xa_mk_value(i), GFP_KERNEL);
+ }
+
+ rcu_read_lock();
+ for (i = 0; i < count; i++) {
+ unsigned long j = 0;
+
+ mas_for_each(&mas, entry, max) {
+ MT_BUG_ON(mt, entry != xa_mk_value(j));
+ j += 5;
+ }
+ mas_set(&mas, 0);
+ }
+ rcu_read_unlock();
+
+}
+#endif
+
/* check_forking - simulate the kernel forking sequence with the tree. */
static noinline void __init check_forking(struct maple_tree *mt)
{
@@ -3430,6 +3461,13 @@ static int __init maple_tree_seed(void)
mtree_destroy(&tree);
goto skip;
#endif
+#if defined(BENCH_MAS_FOR_EACH)
+#define BENCH
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ bench_mas_for_each(&tree);
+ mtree_destroy(&tree);
+ goto skip;
+#endif

mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_iteration(&tree);
--
2.39.2


2023-06-01 02:26:11

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 14/14] mm/mmap: Change vma iteration order in do_vmi_align_munmap()

By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations. It
also means that prev/next will be loaded closer to when they are used.

This has the consequence of needing to change the for_each() to a do {}
for_each() when writing to the side tree.

Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).

Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index eaebcc8f60d2..429e314bd134 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2395,20 +2395,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
goto map_count_exceeded;

- error = __split_vma(vmi, vma, start, 0);
+ error = __split_vma(vmi, vma, start, 1);
if (error)
goto start_split_failed;
-
- vma = vma_iter_load(vmi);
}

- prev = vma_prev(vmi);
-
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- for_each_vma_range(*vmi, next, end) {
+ next = vma;
+ do {
/* Does it split the end? */
if (next->vm_end > end) {
error = __split_vma(vmi, next, end, 0);
@@ -2440,13 +2437,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(next->vm_start < start);
BUG_ON(next->vm_start > end);
#endif
- }
-
- if (vma_iter_end(vmi) > end)
- next = vma_iter_load(vmi);
-
- if (!next)
- next = vma_next(vmi);
+ } for_each_vma_range(*vmi, next, end);

#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
@@ -2467,12 +2458,18 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(count != test_count);
}
#endif
- /* Point of no return */
- vma_iter_set(vmi, start);
+ while (vma_iter_addr(vmi) > start)
+ vma_iter_prev_range(vmi);
+
if (vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL))
return -ENOMEM;

mm->map_count -= count;
+ prev = vma_iter_prev_range(vmi);
+ next = vma_next(vmi);
+ if (next)
+ vma_iter_prev_range(vmi);
+
/*
* Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
* VM_GROWSUP VMA. Such VMAs can change their size under
--
2.39.2


2023-06-01 02:27:48

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 05/14] mm: Remove prev check from do_vmi_align_munmap()

If the prev does not exist, the vma iterator will be set to MAS_NONE,
which will be treated as a MAS_START when the mas_next or mas_find is
used. In this case, the next caller will be the vma iterator, which
uses mas_find() under the hood and will now do what the user expects.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8e5563668b18..fd3f505b40cc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2402,8 +2402,6 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
}

prev = vma_prev(vmi);
- if (unlikely((!prev)))
- vma_iter_set(vmi, start);

/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
--
2.39.2


2023-06-01 02:29:31

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 04/14] mm: Change do_vmi_align_munmap() side tree index

The majority of the calls to munmap a VMA is for a single vma. The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.

Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.

Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 8 ++++----
mm/memory.c | 16 +++++++---------
mm/mmap.c | 41 ++++++++++++++++++++++++-----------------
3 files changed, 35 insertions(+), 30 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9b665c4e5fc0..24437f11d3c2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -103,7 +103,7 @@ bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
void folio_activate(struct folio *folio);

-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *start_vma, unsigned long floor,
unsigned long ceiling, bool mm_wr_locked);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
@@ -1099,9 +1099,9 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
return 0;
}

-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
- struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr, bool mm_wr_locked);
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
+ struct vm_area_struct *start_vma, unsigned long start,
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked);

/*
* VMA lock generalization
diff --git a/mm/memory.c b/mm/memory.c
index 8358f3b853f2..6737263e17e3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -360,12 +360,10 @@ void free_pgd_range(struct mmu_gather *tlb,
} while (pgd++, addr = next, addr != end);
}

-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *vma, unsigned long floor,
unsigned long ceiling, bool mm_wr_locked)
{
- MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
do {
unsigned long addr = vma->vm_start;
struct vm_area_struct *next;
@@ -374,7 +372,7 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
* Note: USER_PGTABLES_CEILING may be passed as ceiling and may
* be 0. This will underflow and is okay.
*/
- next = mas_find(&mas, ceiling - 1);
+ next = mas_find(mas, ceiling - 1);

/*
* Hide vma from rmap and truncate_pagecache before freeing
@@ -395,7 +393,7 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
&& !is_vm_hugetlb_page(next)) {
vma = next;
- next = mas_find(&mas, ceiling - 1);
+ next = mas_find(mas, ceiling - 1);
if (mm_wr_locked)
vma_start_write(vma);
unlink_anon_vmas(vma);
@@ -1700,9 +1698,10 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* ensure that any thus-far unmapped pages are flushed before unmap_vmas()
* drops the lock and schedules.
*/
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr, bool mm_wr_locked)
+ unsigned long end_addr, unsigned long tree_end,
+ bool mm_wr_locked)
{
struct mmu_notifier_range range;
struct zap_details details = {
@@ -1710,7 +1709,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
/* Careful - we need to zap private pages too! */
.even_cows = true,
};
- MA_STATE(mas, mt, vma->vm_end, vma->vm_end);

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
start_addr, end_addr);
@@ -1718,7 +1716,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
do {
unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
mm_wr_locked);
- } while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
+ } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 1503a7bdb192..8e5563668b18 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,10 +76,10 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
static bool ignore_rlimit_data;
core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);

-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next, unsigned long start,
- unsigned long end, bool mm_wr_locked);
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked);

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
@@ -2221,18 +2221,20 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
*
* Called with the mm semaphore held.
*/
-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next,
- unsigned long start, unsigned long end, bool mm_wr_locked)
+ struct vm_area_struct *next, unsigned long start,
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked)
{
struct mmu_gather tlb;
+ unsigned long mt_start = mas->index;

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
- free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+ unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
+ mas_set(mas, mt_start);
+ free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
next ? next->vm_start : USER_PGTABLES_CEILING,
mm_wr_locked);
tlb_finish_mmu(&tlb);
@@ -2338,7 +2340,6 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
struct ma_state *mas_detach)
{
vma_start_write(vma);
- mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2415,6 +2416,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (error)
goto end_split_failed;
}
+ mas_set(&mas_detach, count);
error = munmap_sidetree(next, &mas_detach);
if (error)
goto munmap_sidetree_failed;
@@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
{
- MA_STATE(test, &mt_detach, start, end - 1);
+ MA_STATE(test, &mt_detach, 0, 0);
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;

vma_iter_set(vmi, start);
rcu_read_lock();
- vma_test = mas_find(&test, end - 1);
+ vma_test = mas_find(&test, count - 1);
for_each_vma_range(*vmi, vma_mas, end) {
BUG_ON(vma_mas != vma_test);
test_count++;
- vma_test = mas_next(&test, end - 1);
+ vma_test = mas_next(&test, count - 1);
}
rcu_read_unlock();
BUG_ON(count != test_count);
@@ -2490,9 +2492,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
- unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
+ mas_set(&mas_detach, 1);
+ unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
+ !downgrade);
/* Statistics and freeing VMAs */
- mas_set(&mas_detach, start);
+ mas_set(&mas_detach, 0);
remove_mt(mm, &mas_detach);
__mt_destroy(&mt_detach);

@@ -2800,9 +2804,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
fput(vma->vm_file);
vma->vm_file = NULL;

+ vma_iter_set(&vmi, vma->vm_end);
/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
- vma->vm_end, true);
+ unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
+ vma->vm_end, vma->vm_end, true);
}
if (file && (vm_flags & VM_SHARED))
mapping_unmap_writable(file->f_mapping);
@@ -3131,7 +3136,7 @@ void exit_mmap(struct mm_struct *mm)
tlb_gather_mmu_fullmm(&tlb, mm);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
- unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
+ unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
mmap_read_unlock(mm);

/*
@@ -3141,7 +3146,8 @@ void exit_mmap(struct mm_struct *mm)
set_bit(MMF_OOM_SKIP, &mm->flags);
mmap_write_lock(mm);
mt_clear_in_rcu(&mm->mm_mt);
- free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
+ mas_set(&mas, vma->vm_end);
+ free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS,
USER_PGTABLES_CEILING, true);
tlb_finish_mmu(&tlb);

@@ -3150,6 +3156,7 @@ void exit_mmap(struct mm_struct *mm)
* enabled, without holding any MM locks besides the unreachable
* mmap_write_lock.
*/
+ mas_set(&mas, vma->vm_end);
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
--
2.39.2


2023-06-01 02:30:57

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 03/14] mm: Move unmap_vmas() declaration to internal header

unmap_vmas() is not used outside of the core mm code. Move the
definition to mm/internal.h

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 4 ----
mm/internal.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66032f0d515c..209fe13dc8a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2302,10 +2302,6 @@ static inline void zap_vma_pages(struct vm_area_struct *vma)
zap_page_range_single(vma, vma->vm_start,
vma->vm_end - vma->vm_start, NULL);
}
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
- struct vm_area_struct *start_vma, unsigned long start,
- unsigned long end, bool mm_wr_locked);
-
struct mmu_notifier_range;

void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index f45f5eb4514f..9b665c4e5fc0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1099,6 +1099,10 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
return 0;
}

+void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+ struct vm_area_struct *vma, unsigned long start_addr,
+ unsigned long end_addr, bool mm_wr_locked);
+
/*
* VMA lock generalization
*/
--
2.39.2


2023-06-01 02:32:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 13/14] maple_tree: Refine mas_preallocate() node calculations

Calculate the number of nodes based on the pending write action instead
of assuming the worst case.

This addresses a performance regression introduced in platforms that
have longer allocation timing.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 43 insertions(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index adf662bc413e..5ea211c3f186 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5541,9 +5541,51 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
*/
int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
{
+ MA_WR_STATE(wr_mas, mas, entry);
+ unsigned char node_size;
+ int request = 1;
int ret;

- mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
+
+ if (unlikely(!mas->index && mas->last == ULONG_MAX))
+ goto ask_now;
+
+ mas_wr_store_setup(&wr_mas);
+ wr_mas.content = mas_start(mas);
+ /* Root expand */
+ if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
+ goto ask_now;
+
+ if (unlikely(!mas_wr_walk(&wr_mas))) {
+ /* Spanning store, use worst case for now */
+ request = 1 + mas_mt_height(mas) * 3;
+ goto ask_now;
+ }
+
+ /* At this point, we are at the leaf node that needs to be altered. */
+ /* Exact fit, no nodes needed. */
+ if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
+ return 0;
+
+ mas_wr_end_piv(&wr_mas);
+ node_size = mas_wr_new_end(&wr_mas);
+ if (node_size >= mt_slots[wr_mas.type]) {
+ /* Split, worst case for now. */
+ request = 1 + mas_mt_height(mas) * 2;
+ goto ask_now;
+ }
+
+ /* New root needs a singe node */
+ if (unlikely(mte_is_root(mas->node)))
+ goto ask_now;
+
+ /* Potential spanning rebalance collapsing a node, use worst-case */
+ if (node_size - 1 <= mt_min_slots[wr_mas.type])
+ request = mas_mt_height(mas) * 2 - 1;
+
+ /* node store, slot store needs one node */
+ask_now:
+ mas_node_count_gfp(mas, request, gfp);
mas->mas_flags |= MA_STATE_PREALLOC;
if (likely(!mas_is_err(mas)))
return 0;
--
2.39.2


2023-06-01 02:33:23

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 12/14] maple_tree: Update mas_preallocate() testing

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/maple.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index cfadc4b75d51..14d89ba186ee 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35383,6 +35383,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
for (i = 0; i <= max; i++)
mtree_test_store_range(mt, i * 10, i * 10 + 5, &i);

+ /* Spanning store */
+ mas_set_range(&mas, 470, 500);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35406,7 +35408,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35420,7 +35421,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35434,7 +35434,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35448,33 +35447,37 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);

+ /* Slot store does not need allocations */
+ mas_set_range(&mas, 6, 9);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
- height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 0);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+
+ mas_set_range(&mas, 6, 10);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 1);
mas_store_prealloc(&mas, ptr);
+ MT_BUG_ON(mt, mas_allocated(&mas) != 0);

+ /* Split */
+ mas_set_range(&mas, 54, 54);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 1 + height * 2);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
mt_set_non_kernel(1);
+ /* Spanning store */
+ mas_set_range(&mas, 1, 100);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35482,6 +35485,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
mas_destroy(&mas);


+ /* Spanning store */
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35489,6 +35493,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+ mas_set_range(&mas, 0, 200);
mt_set_non_kernel(1);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
--
2.39.2


2023-06-01 02:36:36

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 09/14] mm: Use vma_iter_clear_gfp() in nommu

Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code. This will reduce node preallocations
in nommu.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 12 ++++++++++++
mm/mmap.c | 12 ------------
mm/nommu.c | 12 ++++--------
3 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2691deca9699..d78fd0fafa3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1057,6 +1057,18 @@ static inline void vma_iter_clear(struct vma_iterator *vmi,
mas_store_prealloc(&vmi->mas, NULL);
}

+static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
+ unsigned long start, unsigned long end, gfp_t gfp)
+{
+ vmi->mas.index = start;
+ vmi->mas.last = end - 1;
+ mas_store_gfp(&vmi->mas, NULL, gfp);
+ if (unlikely(mas_is_err(&vmi->mas)))
+ return -ENOMEM;
+
+ return 0;
+}
+
static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
{
return mas_walk(&vmi->mas);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75b2a86e1faa..22c71dff762b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -154,18 +154,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
return mas_prev(&vmi->mas, min);
}

-static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
- unsigned long start, unsigned long end, gfp_t gfp)
-{
- vmi->mas.index = start;
- vmi->mas.last = end - 1;
- mas_store_gfp(&vmi->mas, NULL, gfp);
- if (unlikely(mas_is_err(&vmi->mas)))
- return -ENOMEM;
-
- return 0;
-}
-
/*
* check_brk_limits() - Use platform specific check of range & verify mlock
* limits.
diff --git a/mm/nommu.c b/mm/nommu.c
index f670d9979a26..a764b86b132a 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1383,17 +1383,13 @@ static int vmi_shrink_vma(struct vma_iterator *vmi,

/* adjust the VMA's pointers, which may reposition it in the MM's tree
* and list */
- if (vma_iter_prealloc(vmi)) {
- pr_warn("Allocation of vma tree for process %d failed\n",
- current->pid);
- return -ENOMEM;
- }
-
if (from > vma->vm_start) {
- vma_iter_clear(vmi, from, vma->vm_end);
+ if (vma_iter_clear_gfp(vmi, from, vma->vm_end, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_end = from;
} else {
- vma_iter_clear(vmi, vma->vm_start, to);
+ if (vma_iter_clear_gfp(vmi, vma->vm_start, to, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_start = to;
}

--
2.39.2


2023-06-01 02:39:11

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 02/14] maple_tree: Add benchmarking for mas_prev()

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/test_maple_tree.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 3dbf99c3f2b1..15d7b7bce7d6 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -45,6 +45,7 @@ atomic_t maple_tree_tests_passed;
/* #define BENCH_MT_FOR_EACH */
/* #define BENCH_FORK */
/* #define BENCH_MAS_FOR_EACH */
+/* #define BENCH_MAS_PREV */

#ifdef __KERNEL__
#define mt_set_non_kernel(x) do {} while (0)
@@ -1735,7 +1736,35 @@ static noinline void __init bench_mas_for_each(struct maple_tree *mt)

}
#endif
+#if defined(BENCH_MAS_PREV)
+static noinline void __init bench_mas_prev(struct maple_tree *mt)
+{
+ int i, count = 1000000;
+ unsigned long max = 2500;
+ void *entry;
+ MA_STATE(mas, mt, 0, 0);
+
+ for (i = 0; i < max; i += 5) {
+ int gap = 4;
+ if (i % 30 == 0)
+ gap = 3;
+ mtree_store_range(mt, i, i + gap, xa_mk_value(i), GFP_KERNEL);
+ }

+ rcu_read_lock();
+ for (i = 0; i < count; i++) {
+ unsigned long j = 2495;
+
+ mas_set(&mas, ULONG_MAX);
+ while ((entry = mas_prev(&mas, 0)) != NULL) {
+ MT_BUG_ON(mt, entry != xa_mk_value(j));
+ j -= 5;
+ }
+ }
+ rcu_read_unlock();
+
+}
+#endif
/* check_forking - simulate the kernel forking sequence with the tree. */
static noinline void __init check_forking(struct maple_tree *mt)
{
@@ -3468,6 +3497,13 @@ static int __init maple_tree_seed(void)
mtree_destroy(&tree);
goto skip;
#endif
+#if defined(BENCH_MAS_PREV)
+#define BENCH
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ bench_mas_prev(&tree);
+ mtree_destroy(&tree);
+ goto skip;
+#endif

mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_iteration(&tree);
--
2.39.2


2023-06-01 02:40:33

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH 06/14] maple_tree: Introduce __mas_set_range()

mas_set_range() resets the node to MAS_START, which will cause a re-walk
of the tree to the range. This is unnecessary when the maple state is
already at the correct location of the write. Add a function that only
sets the range to avoid unnecessary re-walking of the tree.

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/maple_tree.h | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 295548cca8b3..ed50d8f459e6 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -531,6 +531,22 @@ static inline void mas_reset(struct ma_state *mas)
*/
#define mas_for_each(__mas, __entry, __max) \
while (((__entry) = mas_find((__mas), (__max))) != NULL)
+/**
+ * __mas_set_range() - Set up Maple Tree operation state to a sub-range of the
+ * current location.
+ * @mas: Maple Tree operation state.
+ * @start: New start of range in the Maple Tree.
+ * @last: New end of range in the Maple Tree.
+ *
+ * set the internal maple state values to a sub-range.
+ * Please use mas_set_range() if you do not know where you are in the tree.
+ */
+static inline void __mas_set_range(struct ma_state *mas, unsigned long start,
+ unsigned long last)
+{
+ mas->index = start;
+ mas->last = last;
+}

/**
* mas_set_range() - Set up Maple Tree operation state for a different index.
@@ -545,9 +561,8 @@ static inline void mas_reset(struct ma_state *mas)
static inline
void mas_set_range(struct ma_state *mas, unsigned long start, unsigned long last)
{
- mas->index = start;
- mas->last = last;
- mas->node = MAS_START;
+ __mas_set_range(mas, start, last);
+ mas->node = MAS_START;
}

/**
--
2.39.2


2023-06-02 08:22:33

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

Hi Liam,

On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
> Initial work on preallocations showed no regression in performance
> during testing, but recently some users (both on [1] and off [android]
> list) have reported that preallocating the worst-case number of nodes
> has caused some slow down. This patch set addresses the number of
> allocations in a few ways.
>
> During munmap() most munmap() operations will remove a single VMA, so
> leverage the fact that the maple tree can place a single pointer at
> range 0 - 0 without allocating. This is done by changing the index in
> the 'sidetree'.
>
> Re-introduce the entry argument to mas_preallocate() so that a more
> intelligent guess of the node count can be made.
>
> Patches are in the following order:
> 0001-0002: Testing framework for benchmarking some operations
> 0003-0004: Reduction of maple node allocation in sidetree
> 0005: Small cleanup of do_vmi_align_munmap()
> 0006-0013: mas_preallocate() calculation change
> 0014: Change the vma iterator order
I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
this patchset.

The result has a little bit improvement:
Base (next-20230602):
503880
Base with this patchset:
519501

But they are far from the none-regression result (commit 7be1c1a3c7b1):
718080


Some other information I collected:
With Base, the mas_alloc_nodes are always hit with request: 7.
With this patchset, the request are 1 or 5.

I suppose this is the reason for improvement from 503880 to 519501.

With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
mas_alloc_nodes() call. Thanks.


Regards
Yin, Fengwei

>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Liam R. Howlett (14):
> maple_tree: Add benchmarking for mas_for_each
> maple_tree: Add benchmarking for mas_prev()
> mm: Move unmap_vmas() declaration to internal header
> mm: Change do_vmi_align_munmap() side tree index
> mm: Remove prev check from do_vmi_align_munmap()
> maple_tree: Introduce __mas_set_range()
> mm: Remove re-walk from mmap_region()
> maple_tree: Re-introduce entry to mas_preallocate() arguments
> mm: Use vma_iter_clear_gfp() in nommu
> mm: Set up vma iterator for vma_iter_prealloc() calls
> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
> maple_tree: Update mas_preallocate() testing
> maple_tree: Refine mas_preallocate() node calculations
> mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>
> fs/exec.c | 1 +
> include/linux/maple_tree.h | 23 ++++-
> include/linux/mm.h | 4 -
> lib/maple_tree.c | 78 ++++++++++----
> lib/test_maple_tree.c | 74 +++++++++++++
> mm/internal.h | 40 ++++++--
> mm/memory.c | 16 ++-
> mm/mmap.c | 171 ++++++++++++++++---------------
> mm/nommu.c | 45 ++++----
> tools/testing/radix-tree/maple.c | 59 ++++++-----
> 10 files changed, 331 insertions(+), 180 deletions(-)
>

2023-06-02 11:26:36

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 13/14] maple_tree: Refine mas_preallocate() node calculations



在 2023/6/1 10:16, Liam R. Howlett 写道:
> Calculate the number of nodes based on the pending write action instead
> of assuming the worst case.
>
> This addresses a performance regression introduced in platforms that
> have longer allocation timing.
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> lib/maple_tree.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 43 insertions(+), 1 deletion(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index adf662bc413e..5ea211c3f186 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5541,9 +5541,51 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
> */
> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> {
> + MA_WR_STATE(wr_mas, mas, entry);
> + unsigned char node_size;
> + int request = 1;
If either mas_wr_append() or mas_wr_slot_store() succeeds,
or does a simple replacement, we don't need to make any allocations.
> int ret;
>
> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> +
> + if (unlikely(!mas->index && mas->last == ULONG_MAX))
> + goto ask_now;
> +
> + mas_wr_store_setup(&wr_mas);
> + wr_mas.content = mas_start(mas);
> + /* Root expand */
> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
> + goto ask_now;
> +
> + if (unlikely(!mas_wr_walk(&wr_mas))) {
> + /* Spanning store, use worst case for now */
> + request = 1 + mas_mt_height(mas) * 3;
> + goto ask_now;
> + }
> +
> + /* At this point, we are at the leaf node that needs to be altered. */
> + /* Exact fit, no nodes needed. */
> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
> + return 0;
> +
> + mas_wr_end_piv(&wr_mas);
> + node_size = mas_wr_new_end(&wr_mas);
> + if (node_size >= mt_slots[wr_mas.type]) {
> + /* Split, worst case for now. */
> + request = 1 + mas_mt_height(mas) * 2;
> + goto ask_now;
> + }
> +
> + /* New root needs a singe node */
> + if (unlikely(mte_is_root(mas->node)))
> + goto ask_now;
> +
> + /* Potential spanning rebalance collapsing a node, use worst-case */
> + if (node_size - 1 <= mt_min_slots[wr_mas.type])
> + request = mas_mt_height(mas) * 2 - 1;
> +
> + /* node store, slot store needs one node */
> +ask_now:
> + mas_node_count_gfp(mas, request, gfp);
> mas->mas_flags |= MA_STATE_PREALLOC;
> if (likely(!mas_is_err(mas)))
> return 0;

2023-06-02 19:06:28

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

* Yin, Fengwei <[email protected]> [230602 04:11]:
> Hi Liam,
>
> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
> > Initial work on preallocations showed no regression in performance
> > during testing, but recently some users (both on [1] and off [android]
> > list) have reported that preallocating the worst-case number of nodes
> > has caused some slow down. This patch set addresses the number of
> > allocations in a few ways.
> >
> > During munmap() most munmap() operations will remove a single VMA, so
> > leverage the fact that the maple tree can place a single pointer at
> > range 0 - 0 without allocating. This is done by changing the index in
> > the 'sidetree'.
> >
> > Re-introduce the entry argument to mas_preallocate() so that a more
> > intelligent guess of the node count can be made.
> >
> > Patches are in the following order:
> > 0001-0002: Testing framework for benchmarking some operations
> > 0003-0004: Reduction of maple node allocation in sidetree
> > 0005: Small cleanup of do_vmi_align_munmap()
> > 0006-0013: mas_preallocate() calculation change
> > 0014: Change the vma iterator order
> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
> this patchset.
>
> The result has a little bit improvement:
> Base (next-20230602):
> 503880
> Base with this patchset:
> 519501
>
> But they are far from the none-regression result (commit 7be1c1a3c7b1):
> 718080
>
>
> Some other information I collected:
> With Base, the mas_alloc_nodes are always hit with request: 7.
> With this patchset, the request are 1 or 5.
>
> I suppose this is the reason for improvement from 503880 to 519501.
>
> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
> mas_alloc_nodes() call. Thanks.

Thanks for retesting. I've not been able to see the regression myself.
Are you running in a VM of sorts? Android and some cloud VMs seem to
see this, but I do not in kvm or the server I test on.

I am still looking to reduce/reverse the regression and a reproducer on
my end would help.

>
>
> Regards
> Yin, Fengwei
>
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
> >
> > Liam R. Howlett (14):
> > maple_tree: Add benchmarking for mas_for_each
> > maple_tree: Add benchmarking for mas_prev()
> > mm: Move unmap_vmas() declaration to internal header
> > mm: Change do_vmi_align_munmap() side tree index
> > mm: Remove prev check from do_vmi_align_munmap()
> > maple_tree: Introduce __mas_set_range()
> > mm: Remove re-walk from mmap_region()
> > maple_tree: Re-introduce entry to mas_preallocate() arguments
> > mm: Use vma_iter_clear_gfp() in nommu
> > mm: Set up vma iterator for vma_iter_prealloc() calls
> > maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
> > maple_tree: Update mas_preallocate() testing
> > maple_tree: Refine mas_preallocate() node calculations
> > mm/mmap: Change vma iteration order in do_vmi_align_munmap()
> >
> > fs/exec.c | 1 +
> > include/linux/maple_tree.h | 23 ++++-
> > include/linux/mm.h | 4 -
> > lib/maple_tree.c | 78 ++++++++++----
> > lib/test_maple_tree.c | 74 +++++++++++++
> > mm/internal.h | 40 ++++++--
> > mm/memory.c | 16 ++-
> > mm/mmap.c | 171 ++++++++++++++++---------------
> > mm/nommu.c | 45 ++++----
> > tools/testing/radix-tree/maple.c | 59 ++++++-----
> > 10 files changed, 331 insertions(+), 180 deletions(-)
> >

2023-06-02 19:26:05

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 13/14] maple_tree: Refine mas_preallocate() node calculations

* Peng Zhang <[email protected]> [230602 06:37]:
>
>
> 在 2023/6/1 10:16, Liam R. Howlett 写道:
> > Calculate the number of nodes based on the pending write action instead
> > of assuming the worst case.
> >
> > This addresses a performance regression introduced in platforms that
> > have longer allocation timing.
> >
> > Signed-off-by: Liam R. Howlett <[email protected]>
> > ---
> > lib/maple_tree.c | 44 +++++++++++++++++++++++++++++++++++++++++++-
> > 1 file changed, 43 insertions(+), 1 deletion(-)
> >
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index adf662bc413e..5ea211c3f186 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -5541,9 +5541,51 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
> > */
> > int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> > {
> > + MA_WR_STATE(wr_mas, mas, entry);
> > + unsigned char node_size;
> > + int request = 1;
> If either mas_wr_append() or mas_wr_slot_store() succeeds,
> or does a simple replacement, we don't need to make any allocations.

Thanks for looking at this.

See below

> > int ret;
> > - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> > +
> > + if (unlikely(!mas->index && mas->last == ULONG_MAX))
> > + goto ask_now;
> > +
> > + mas_wr_store_setup(&wr_mas);
> > + wr_mas.content = mas_start(mas);
> > + /* Root expand */
> > + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
> > + goto ask_now;
> > +
> > + if (unlikely(!mas_wr_walk(&wr_mas))) {
> > + /* Spanning store, use worst case for now */
> > + request = 1 + mas_mt_height(mas) * 3;
> > + goto ask_now;
> > + }
> > +
> > + /* At this point, we are at the leaf node that needs to be altered. */
> > + /* Exact fit, no nodes needed. */
> > + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
> > + return 0;

The exact fit (simple replacement) case is covered here.

> > +
> > + mas_wr_end_piv(&wr_mas);
> > + node_size = mas_wr_new_end(&wr_mas);

> > + if (node_size >= mt_slots[wr_mas.type]) {
> > + /* Split, worst case for now. */
> > + request = 1 + mas_mt_height(mas) * 2;
> > + goto ask_now;
> > + }

I will look at adding the mas_wr_append() and mas_wr_slot_store()
checks as well. I suspect it will need to live around here.

I'm debating a function that returns the type of store and saving it in
mas (wr_mas does not live beyond this function). The mas store type
could be cleared unless mas_store_prealloc() is used. This would avoid
duplicating this work on the actual write operation.

> > +
> > + /* New root needs a singe node */
> > + if (unlikely(mte_is_root(mas->node)))
> > + goto ask_now;
> > +
> > + /* Potential spanning rebalance collapsing a node, use worst-case */
> > + if (node_size - 1 <= mt_min_slots[wr_mas.type])
> > + request = mas_mt_height(mas) * 2 - 1;
> > +
> > + /* node store, slot store needs one node */
> > +ask_now:
> > + mas_node_count_gfp(mas, request, gfp);
> > mas->mas_flags |= MA_STATE_PREALLOC;
> > if (likely(!mas_is_err(mas)))
> > return 0;

2023-06-04 12:14:29

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

Hi Liam,

On 6/3/2023 2:55 AM, Liam R. Howlett wrote:
> * Yin, Fengwei <[email protected]> [230602 04:11]:
>> Hi Liam,
>>
>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>> Initial work on preallocations showed no regression in performance
>>> during testing, but recently some users (both on [1] and off [android]
>>> list) have reported that preallocating the worst-case number of nodes
>>> has caused some slow down. This patch set addresses the number of
>>> allocations in a few ways.
>>>
>>> During munmap() most munmap() operations will remove a single VMA, so
>>> leverage the fact that the maple tree can place a single pointer at
>>> range 0 - 0 without allocating. This is done by changing the index in
>>> the 'sidetree'.
>>>
>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>> intelligent guess of the node count can be made.
>>>
>>> Patches are in the following order:
>>> 0001-0002: Testing framework for benchmarking some operations
>>> 0003-0004: Reduction of maple node allocation in sidetree
>>> 0005: Small cleanup of do_vmi_align_munmap()
>>> 0006-0013: mas_preallocate() calculation change
>>> 0014: Change the vma iterator order
>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>> this patchset.
>>
>> The result has a little bit improvement:
>> Base (next-20230602):
>> 503880
>> Base with this patchset:
>> 519501
>>
>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>> 718080
>>
>>
>> Some other information I collected:
>> With Base, the mas_alloc_nodes are always hit with request: 7.
>> With this patchset, the request are 1 or 5.
>>
>> I suppose this is the reason for improvement from 503880 to 519501.
>>
>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>> mas_alloc_nodes() call. Thanks.
>
> Thanks for retesting. I've not been able to see the regression myself.
> Are you running in a VM of sorts? Android and some cloud VMs seem to
I didn't run it in VM. I run it on a native env.

> see this, but I do not in kvm or the server I test on.
>
> I am still looking to reduce/reverse the regression and a reproducer on
> my end would help.

The test is page_test of AIM9. You could get AIM9 test suite from:
http://nchc.dl.sourceforge.net/project/aimbench/aim-suite9

After build it, we could see app singleuser.

It needs a txt file named s9workfile to define the test case. The s9workfile
I am using has following content:

# @(#) s9workfile:1.2 1/22/96 00:00:00
# AIM Independent Resource Benchmark - Suite IX Workfile
FILESIZE: 5M
page_test

Then you can run the testing by command:
./singleuser -nl

It will ask some configuration questions and then run the real test.

One thing need be taken care is that the create-clo.c has one line:
newbrk = sbrk(-4096 * 16);

It should be updated as:
intptr_t inc = -4096 * 16;
newbrk = sbrk(inc);

Otherwise, the -4096 * 16 will be treated as 32 bit and the line is
changed to extend brk to around 4G. If we don't have enough RAM, the
set_brk syscall will fail.

If you met any issue to run the test, just ping me. Thanks.


Regards
Yin, Fengwei

>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Liam R. Howlett (14):
>>> maple_tree: Add benchmarking for mas_for_each
>>> maple_tree: Add benchmarking for mas_prev()
>>> mm: Move unmap_vmas() declaration to internal header
>>> mm: Change do_vmi_align_munmap() side tree index
>>> mm: Remove prev check from do_vmi_align_munmap()
>>> maple_tree: Introduce __mas_set_range()
>>> mm: Remove re-walk from mmap_region()
>>> maple_tree: Re-introduce entry to mas_preallocate() arguments
>>> mm: Use vma_iter_clear_gfp() in nommu
>>> mm: Set up vma iterator for vma_iter_prealloc() calls
>>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>> maple_tree: Update mas_preallocate() testing
>>> maple_tree: Refine mas_preallocate() node calculations
>>> mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>
>>> fs/exec.c | 1 +
>>> include/linux/maple_tree.h | 23 ++++-
>>> include/linux/mm.h | 4 -
>>> lib/maple_tree.c | 78 ++++++++++----
>>> lib/test_maple_tree.c | 74 +++++++++++++
>>> mm/internal.h | 40 ++++++--
>>> mm/memory.c | 16 ++-
>>> mm/mmap.c | 171 ++++++++++++++++---------------
>>> mm/nommu.c | 45 ++++----
>>> tools/testing/radix-tree/maple.c | 59 ++++++-----
>>> 10 files changed, 331 insertions(+), 180 deletions(-)
>>>

2023-06-05 03:48:26

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



在 2023/6/2 16:10, Yin, Fengwei 写道:
> Hi Liam,
>
> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>> Initial work on preallocations showed no regression in performance
>> during testing, but recently some users (both on [1] and off [android]
>> list) have reported that preallocating the worst-case number of nodes
>> has caused some slow down. This patch set addresses the number of
>> allocations in a few ways.
>>
>> During munmap() most munmap() operations will remove a single VMA, so
>> leverage the fact that the maple tree can place a single pointer at
>> range 0 - 0 without allocating. This is done by changing the index in
>> the 'sidetree'.
>>
>> Re-introduce the entry argument to mas_preallocate() so that a more
>> intelligent guess of the node count can be made.
>>
>> Patches are in the following order:
>> 0001-0002: Testing framework for benchmarking some operations
>> 0003-0004: Reduction of maple node allocation in sidetree
>> 0005: Small cleanup of do_vmi_align_munmap()
>> 0006-0013: mas_preallocate() calculation change
>> 0014: Change the vma iterator order
> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
> this patchset.
>
> The result has a little bit improvement:
> Base (next-20230602):
> 503880
> Base with this patchset:
> 519501
>
> But they are far from the none-regression result (commit 7be1c1a3c7b1):
> 718080
>
>
> Some other information I collected:
> With Base, the mas_alloc_nodes are always hit with request: 7.
> With this patchset, the request are 1 or 5.
>
> I suppose this is the reason for improvement from 503880 to 519501.
>
> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
> mas_alloc_nodes() call. Thanks.
Hi Fengwei,

I think it may be related to the inaccurate number of nodes allocated
in the pre-allocation. I slightly modified the pre-allocation in this
patchset, but I don't know if it works. It would be great if you could
help test it, and help pinpoint the cause. Below is the diff, which can
be applied based on this pachset.

Thanks,
Peng

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 5ea211c3f186..e67bf2744384 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void
*entry, gfp_t gfp)
goto ask_now;
}

- /* New root needs a singe node */
- if (unlikely(mte_is_root(mas->node)))
- goto ask_now;
+ if ((node_size == wr_mas.node_end + 1 &&
+ mas->offset == wr_mas.node_end) ||
+ (node_size == wr_mas.node_end &&
+ wr_mas.offset_end - mas->offset == 1))
+ return 0;

/* Potential spanning rebalance collapsing a node, use worst-case */
if (node_size - 1 <= mt_min_slots[wr_mas.type])
@@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void
*entry, gfp_t gfp)
if (likely(!mas_is_err(mas)))
return 0;

- mas_set_alloc_req(mas, 0);
ret = xa_err(mas->node);
mas_reset(mas);
mas_destroy(mas);


>
>
> Regards
> Yin, Fengwei
>
>>
>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Liam R. Howlett (14):
>> maple_tree: Add benchmarking for mas_for_each
>> maple_tree: Add benchmarking for mas_prev()
>> mm: Move unmap_vmas() declaration to internal header
>> mm: Change do_vmi_align_munmap() side tree index
>> mm: Remove prev check from do_vmi_align_munmap()
>> maple_tree: Introduce __mas_set_range()
>> mm: Remove re-walk from mmap_region()
>> maple_tree: Re-introduce entry to mas_preallocate() arguments
>> mm: Use vma_iter_clear_gfp() in nommu
>> mm: Set up vma iterator for vma_iter_prealloc() calls
>> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>> maple_tree: Update mas_preallocate() testing
>> maple_tree: Refine mas_preallocate() node calculations
>> mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>
>> fs/exec.c | 1 +
>> include/linux/maple_tree.h | 23 ++++-
>> include/linux/mm.h | 4 -
>> lib/maple_tree.c | 78 ++++++++++----
>> lib/test_maple_tree.c | 74 +++++++++++++
>> mm/internal.h | 40 ++++++--
>> mm/memory.c | 16 ++-
>> mm/mmap.c | 171 ++++++++++++++++---------------
>> mm/nommu.c | 45 ++++----
>> tools/testing/radix-tree/maple.c | 59 ++++++-----
>> 10 files changed, 331 insertions(+), 180 deletions(-)
>>

2023-06-05 05:01:03

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

Hi Peng,

On 6/5/23 11:28, Peng Zhang wrote:
>
>
> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>> Hi Liam,
>>
>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>> Initial work on preallocations showed no regression in performance
>>> during testing, but recently some users (both on [1] and off [android]
>>> list) have reported that preallocating the worst-case number of nodes
>>> has caused some slow down.  This patch set addresses the number of
>>> allocations in a few ways.
>>>
>>> During munmap() most munmap() operations will remove a single VMA, so
>>> leverage the fact that the maple tree can place a single pointer at
>>> range 0 - 0 without allocating.  This is done by changing the index in
>>> the 'sidetree'.
>>>
>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>> intelligent guess of the node count can be made.
>>>
>>> Patches are in the following order:
>>> 0001-0002: Testing framework for benchmarking some operations
>>> 0003-0004: Reduction of maple node allocation in sidetree
>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>> 0006-0013: mas_preallocate() calculation change
>>> 0014:      Change the vma iterator order
>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>> this patchset.
>>
>> The result has a little bit improvement:
>> Base (next-20230602):
>>    503880
>> Base with this patchset:
>>    519501
>>
>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>    718080
>>
>>
>> Some other information I collected:
>> With Base, the mas_alloc_nodes are always hit with request: 7.
>> With this patchset, the request are 1 or 5.
>>
>> I suppose this is the reason for improvement from 503880 to 519501.
>>
>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>> mas_alloc_nodes() call. Thanks.
> Hi Fengwei,
>
> I think it may be related to the inaccurate number of nodes allocated
> in the pre-allocation. I slightly modified the pre-allocation in this
> patchset, but I don't know if it works. It would be great if you could
> help test it, and help pinpoint the cause. Below is the diff, which can
> be applied based on this pachset.
I tried the patch, it could eliminate the call of mas_alloc_nodes() during
the test. But the result of benchmark got a little bit improvement:
529040

But it's still much less than none-regression result. I will also double
confirm the none-regression result.


Regards
Yin, Fengwei

>
> Thanks,
> Peng
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 5ea211c3f186..e67bf2744384 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>          goto ask_now;
>      }
>
> -    /* New root needs a singe node */
> -    if (unlikely(mte_is_root(mas->node)))
> -        goto ask_now;
> +    if ((node_size == wr_mas.node_end + 1 &&
> +         mas->offset == wr_mas.node_end) ||
> +        (node_size == wr_mas.node_end &&
> +         wr_mas.offset_end - mas->offset == 1))
> +        return 0;
>
>      /* Potential spanning rebalance collapsing a node, use worst-case */
>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>      if (likely(!mas_is_err(mas)))
>          return 0;
>
> -    mas_set_alloc_req(mas, 0);
>      ret = xa_err(mas->node);
>      mas_reset(mas);
>      mas_destroy(mas);
>
>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>
>>> Liam R. Howlett (14):
>>>    maple_tree: Add benchmarking for mas_for_each
>>>    maple_tree: Add benchmarking for mas_prev()
>>>    mm: Move unmap_vmas() declaration to internal header
>>>    mm: Change do_vmi_align_munmap() side tree index
>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>    maple_tree: Introduce __mas_set_range()
>>>    mm: Remove re-walk from mmap_region()
>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>    maple_tree: Update mas_preallocate() testing
>>>    maple_tree: Refine mas_preallocate() node calculations
>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>
>>>   fs/exec.c                        |   1 +
>>>   include/linux/maple_tree.h       |  23 ++++-
>>>   include/linux/mm.h               |   4 -
>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>   mm/internal.h                    |  40 ++++++--
>>>   mm/memory.c                      |  16 ++-
>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>   mm/nommu.c                       |  45 ++++----
>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>

2023-06-05 06:20:15

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



On 6/5/2023 12:41 PM, Yin Fengwei wrote:
> Hi Peng,
>
> On 6/5/23 11:28, Peng Zhang wrote:
>>
>>
>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>> Hi Liam,
>>>
>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>> Initial work on preallocations showed no regression in performance
>>>> during testing, but recently some users (both on [1] and off [android]
>>>> list) have reported that preallocating the worst-case number of nodes
>>>> has caused some slow down.  This patch set addresses the number of
>>>> allocations in a few ways.
>>>>
>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>> leverage the fact that the maple tree can place a single pointer at
>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>> the 'sidetree'.
>>>>
>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>> intelligent guess of the node count can be made.
>>>>
>>>> Patches are in the following order:
>>>> 0001-0002: Testing framework for benchmarking some operations
>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>> 0006-0013: mas_preallocate() calculation change
>>>> 0014:      Change the vma iterator order
>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>> this patchset.
>>>
>>> The result has a little bit improvement:
>>> Base (next-20230602):
>>>    503880
>>> Base with this patchset:
>>>    519501
>>>
>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>    718080
>>>
>>>
>>> Some other information I collected:
>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>> With this patchset, the request are 1 or 5.
>>>
>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>
>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>> mas_alloc_nodes() call. Thanks.
>> Hi Fengwei,
>>
>> I think it may be related to the inaccurate number of nodes allocated
>> in the pre-allocation. I slightly modified the pre-allocation in this
>> patchset, but I don't know if it works. It would be great if you could
>> help test it, and help pinpoint the cause. Below is the diff, which can
>> be applied based on this pachset.
> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
> the test. But the result of benchmark got a little bit improvement:
> 529040
>
> But it's still much less than none-regression result. I will also double
> confirm the none-regression result.
Just noticed that the commit f5715584af95 make validate_mm() two implementation
based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
apple to apple.


I disable CONFIG_DEBUG_VM and re-run the test and got:
Before preallocation change (7be1c1a3c7b1):
770100
After preallocation change (28c5609fb236):
680000
With liam's fix:
702100
plus Peng's fix:
725900


Regards
Yin, Fengwei

>
>
> Regards
> Yin, Fengwei
>
>>
>> Thanks,
>> Peng
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 5ea211c3f186..e67bf2744384 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>          goto ask_now;
>>      }
>>
>> -    /* New root needs a singe node */
>> -    if (unlikely(mte_is_root(mas->node)))
>> -        goto ask_now;
>> +    if ((node_size == wr_mas.node_end + 1 &&
>> +         mas->offset == wr_mas.node_end) ||
>> +        (node_size == wr_mas.node_end &&
>> +         wr_mas.offset_end - mas->offset == 1))
>> +        return 0;
>>
>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>      if (likely(!mas_is_err(mas)))
>>          return 0;
>>
>> -    mas_set_alloc_req(mas, 0);
>>      ret = xa_err(mas->node);
>>      mas_reset(mas);
>>      mas_destroy(mas);
>>
>>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>
>>>> Liam R. Howlett (14):
>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>    maple_tree: Introduce __mas_set_range()
>>>>    mm: Remove re-walk from mmap_region()
>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>    maple_tree: Update mas_preallocate() testing
>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>
>>>>   fs/exec.c                        |   1 +
>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>   include/linux/mm.h               |   4 -
>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>   mm/internal.h                    |  40 ++++++--
>>>>   mm/memory.c                      |  16 ++-
>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>   mm/nommu.c                       |  45 ++++----
>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>

2023-06-05 08:21:16

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



在 2023/6/5 14:18, Yin, Fengwei 写道:
>
>
> On 6/5/2023 12:41 PM, Yin Fengwei wrote:
>> Hi Peng,
>>
>> On 6/5/23 11:28, Peng Zhang wrote:
>>>
>>>
>>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>>> Hi Liam,
>>>>
>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>>> Initial work on preallocations showed no regression in performance
>>>>> during testing, but recently some users (both on [1] and off [android]
>>>>> list) have reported that preallocating the worst-case number of nodes
>>>>> has caused some slow down.  This patch set addresses the number of
>>>>> allocations in a few ways.
>>>>>
>>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>>> leverage the fact that the maple tree can place a single pointer at
>>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>>> the 'sidetree'.
>>>>>
>>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>>> intelligent guess of the node count can be made.
>>>>>
>>>>> Patches are in the following order:
>>>>> 0001-0002: Testing framework for benchmarking some operations
>>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>>> 0006-0013: mas_preallocate() calculation change
>>>>> 0014:      Change the vma iterator order
>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>>> this patchset.
>>>>
>>>> The result has a little bit improvement:
>>>> Base (next-20230602):
>>>>    503880
>>>> Base with this patchset:
>>>>    519501
>>>>
>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>>    718080
>>>>
>>>>
>>>> Some other information I collected:
>>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>>> With this patchset, the request are 1 or 5.
>>>>
>>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>>
>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>>> mas_alloc_nodes() call. Thanks.
>>> Hi Fengwei,
>>>
>>> I think it may be related to the inaccurate number of nodes allocated
>>> in the pre-allocation. I slightly modified the pre-allocation in this
>>> patchset, but I don't know if it works. It would be great if you could
>>> help test it, and help pinpoint the cause. Below is the diff, which can
>>> be applied based on this pachset.
>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
>> the test. But the result of benchmark got a little bit improvement:
>> 529040
>>
>> But it's still much less than none-regression result. I will also double
>> confirm the none-regression result.
> Just noticed that the commit f5715584af95 make validate_mm() two implementation
> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
> apple to apple.
>
>
> I disable CONFIG_DEBUG_VM and re-run the test and got:
> Before preallocation change (7be1c1a3c7b1):
> 770100
> After preallocation change (28c5609fb236):
> 680000
> With liam's fix:
> 702100
> plus Peng's fix:
> 725900
Thank you for your test, now it seems that the performance
regression is not so much.
>
>
> Regards
> Yin, Fengwei
>
>>
>>
>> Regards
>> Yin, Fengwei
>>
>>>
>>> Thanks,
>>> Peng
>>>
>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>> index 5ea211c3f186..e67bf2744384 100644
>>> --- a/lib/maple_tree.c
>>> +++ b/lib/maple_tree.c
>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>          goto ask_now;
>>>      }
>>>
>>> -    /* New root needs a singe node */
>>> -    if (unlikely(mte_is_root(mas->node)))
>>> -        goto ask_now;
>>> +    if ((node_size == wr_mas.node_end + 1 &&
>>> +         mas->offset == wr_mas.node_end) ||
>>> +        (node_size == wr_mas.node_end &&
>>> +         wr_mas.offset_end - mas->offset == 1))
>>> +        return 0;
>>>
>>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>      if (likely(!mas_is_err(mas)))
>>>          return 0;
>>>
>>> -    mas_set_alloc_req(mas, 0);
>>>      ret = xa_err(mas->node);
>>>      mas_reset(mas);
>>>      mas_destroy(mas);
>>>
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>>
>>>>> Liam R. Howlett (14):
>>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>>    maple_tree: Introduce __mas_set_range()
>>>>>    mm: Remove re-walk from mmap_region()
>>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>>    maple_tree: Update mas_preallocate() testing
>>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>>
>>>>>   fs/exec.c                        |   1 +
>>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>>   include/linux/mm.h               |   4 -
>>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>>   mm/internal.h                    |  40 ++++++--
>>>>>   mm/memory.c                      |  16 ++-
>>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>>   mm/nommu.c                       |  45 ++++----
>>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>>

2023-06-05 14:13:22

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

* Peng Zhang <[email protected]> [230605 03:59]:
>
>
> 在 2023/6/5 14:18, Yin, Fengwei 写道:
> >
> >
> > On 6/5/2023 12:41 PM, Yin Fengwei wrote:
> > > Hi Peng,
> > >
> > > On 6/5/23 11:28, Peng Zhang wrote:
> > > >
> > > >
> > > > 在 2023/6/2 16:10, Yin, Fengwei 写道:
> > > > > Hi Liam,
> > > > >
> > > > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
> > > > > > Initial work on preallocations showed no regression in performance
> > > > > > during testing, but recently some users (both on [1] and off [android]
> > > > > > list) have reported that preallocating the worst-case number of nodes
> > > > > > has caused some slow down.  This patch set addresses the number of
> > > > > > allocations in a few ways.
> > > > > >
> > > > > > During munmap() most munmap() operations will remove a single VMA, so
> > > > > > leverage the fact that the maple tree can place a single pointer at
> > > > > > range 0 - 0 without allocating.  This is done by changing the index in
> > > > > > the 'sidetree'.
> > > > > >
> > > > > > Re-introduce the entry argument to mas_preallocate() so that a more
> > > > > > intelligent guess of the node count can be made.
> > > > > >
> > > > > > Patches are in the following order:
> > > > > > 0001-0002: Testing framework for benchmarking some operations
> > > > > > 0003-0004: Reduction of maple node allocation in sidetree
> > > > > > 0005:      Small cleanup of do_vmi_align_munmap()
> > > > > > 0006-0013: mas_preallocate() calculation change
> > > > > > 0014:      Change the vma iterator order
> > > > > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
> > > > > this patchset.
> > > > >
> > > > > The result has a little bit improvement:
> > > > > Base (next-20230602):
> > > > >    503880
> > > > > Base with this patchset:
> > > > >    519501
> > > > >
> > > > > But they are far from the none-regression result (commit 7be1c1a3c7b1):
> > > > >    718080
> > > > >
> > > > >
> > > > > Some other information I collected:
> > > > > With Base, the mas_alloc_nodes are always hit with request: 7.
> > > > > With this patchset, the request are 1 or 5.
> > > > >
> > > > > I suppose this is the reason for improvement from 503880 to 519501.
> > > > >
> > > > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
> > > > > mas_alloc_nodes() call. Thanks.
> > > > Hi Fengwei,
> > > >
> > > > I think it may be related to the inaccurate number of nodes allocated
> > > > in the pre-allocation. I slightly modified the pre-allocation in this
> > > > patchset, but I don't know if it works. It would be great if you could
> > > > help test it, and help pinpoint the cause. Below is the diff, which can
> > > > be applied based on this pachset.
> > > I tried the patch, it could eliminate the call of mas_alloc_nodes() during
> > > the test. But the result of benchmark got a little bit improvement:
> > > 529040
> > >
> > > But it's still much less than none-regression result. I will also double
> > > confirm the none-regression result.
> > Just noticed that the commit f5715584af95 make validate_mm() two implementation
> > based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
> > CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
> > apple to apple.

You mean "mm: update validate_mm() to use vma iterator" here I guess. I
have it as a different commit id in my branch.

I 'restored' some of the checking because I was able to work around not
having the mt_dump() definition with the vma iterator. I'm now
wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
have added these extra checks.

> >
> >
> > I disable CONFIG_DEBUG_VM and re-run the test and got:
> > Before preallocation change (7be1c1a3c7b1):
> > 770100
> > After preallocation change (28c5609fb236):
> > 680000
> > With liam's fix:
> > 702100
> > plus Peng's fix:
> > 725900
> Thank you for your test, now it seems that the performance
> regression is not so much.

We are also too strict on the reset during mas_store_prealloc() checking
for a spanning write. I have a fix for this for v2 of the patch set,
although I suspect it will not make a huge difference.

> >
> >
> > Regards
> > Yin, Fengwei
> >
> > >
> > >
> > > Regards
> > > Yin, Fengwei
> > >
> > > >
> > > > Thanks,
> > > > Peng
> > > >
> > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > > index 5ea211c3f186..e67bf2744384 100644
> > > > --- a/lib/maple_tree.c
> > > > +++ b/lib/maple_tree.c
> > > > @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> > > >          goto ask_now;
> > > >      }
> > > >
> > > > -    /* New root needs a singe node */
> > > > -    if (unlikely(mte_is_root(mas->node)))
> > > > -        goto ask_now;

Why did you drop this? If we are creating a new root we will only need
one node.

> > > > +    if ((node_size == wr_mas.node_end + 1 &&
> > > > +         mas->offset == wr_mas.node_end) ||
> > > > +        (node_size == wr_mas.node_end &&
> > > > +         wr_mas.offset_end - mas->offset == 1))
> > > > +        return 0;

I will add this to v2 as well, or something similar.

> > > >
> > > >      /* Potential spanning rebalance collapsing a node, use worst-case */
> > > >      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
> > > > @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> > > >      if (likely(!mas_is_err(mas)))
> > > >          return 0;
> > > >
> > > > -    mas_set_alloc_req(mas, 0);

Why did you drop this? It seems like a worth while cleanup on failure.

> > > >      ret = xa_err(mas->node);
> > > >      mas_reset(mas);
> > > >      mas_destroy(mas);
> > > >
> > > >
> > > > >
> > > > >
> > > > > Regards
> > > > > Yin, Fengwei
> > > > >
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > > > > >
> > > > > > Liam R. Howlett (14):
> > > > > >    maple_tree: Add benchmarking for mas_for_each
> > > > > >    maple_tree: Add benchmarking for mas_prev()
> > > > > >    mm: Move unmap_vmas() declaration to internal header
> > > > > >    mm: Change do_vmi_align_munmap() side tree index
> > > > > >    mm: Remove prev check from do_vmi_align_munmap()
> > > > > >    maple_tree: Introduce __mas_set_range()
> > > > > >    mm: Remove re-walk from mmap_region()
> > > > > >    maple_tree: Re-introduce entry to mas_preallocate() arguments
> > > > > >    mm: Use vma_iter_clear_gfp() in nommu
> > > > > >    mm: Set up vma iterator for vma_iter_prealloc() calls
> > > > > >    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
> > > > > >    maple_tree: Update mas_preallocate() testing
> > > > > >    maple_tree: Refine mas_preallocate() node calculations
> > > > > >    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
> > > > > >
> > > > > >   fs/exec.c                        |   1 +
> > > > > >   include/linux/maple_tree.h       |  23 ++++-
> > > > > >   include/linux/mm.h               |   4 -
> > > > > >   lib/maple_tree.c                 |  78 ++++++++++----
> > > > > >   lib/test_maple_tree.c            |  74 +++++++++++++
> > > > > >   mm/internal.h                    |  40 ++++++--
> > > > > >   mm/memory.c                      |  16 ++-
> > > > > >   mm/mmap.c                        | 171 ++++++++++++++++---------------
> > > > > >   mm/nommu.c                       |  45 ++++----
> > > > > >   tools/testing/radix-tree/maple.c |  59 ++++++-----
> > > > > >   10 files changed, 331 insertions(+), 180 deletions(-)
> > > > > >

2023-06-05 14:50:11

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



在 2023/6/5 22:03, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230605 03:59]:
>>
>>
>> 在 2023/6/5 14:18, Yin, Fengwei 写道:
>>>
>>>
>>> On 6/5/2023 12:41 PM, Yin Fengwei wrote:
>>>> Hi Peng,
>>>>
>>>> On 6/5/23 11:28, Peng Zhang wrote:
>>>>>
>>>>>
>>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>>>>> Hi Liam,
>>>>>>
>>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>>>>> Initial work on preallocations showed no regression in performance
>>>>>>> during testing, but recently some users (both on [1] and off [android]
>>>>>>> list) have reported that preallocating the worst-case number of nodes
>>>>>>> has caused some slow down.  This patch set addresses the number of
>>>>>>> allocations in a few ways.
>>>>>>>
>>>>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>>>>> leverage the fact that the maple tree can place a single pointer at
>>>>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>>>>> the 'sidetree'.
>>>>>>>
>>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>>>>> intelligent guess of the node count can be made.
>>>>>>>
>>>>>>> Patches are in the following order:
>>>>>>> 0001-0002: Testing framework for benchmarking some operations
>>>>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>>>>> 0006-0013: mas_preallocate() calculation change
>>>>>>> 0014:      Change the vma iterator order
>>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>>>>> this patchset.
>>>>>>
>>>>>> The result has a little bit improvement:
>>>>>> Base (next-20230602):
>>>>>>    503880
>>>>>> Base with this patchset:
>>>>>>    519501
>>>>>>
>>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>>>>    718080
>>>>>>
>>>>>>
>>>>>> Some other information I collected:
>>>>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>>>>> With this patchset, the request are 1 or 5.
>>>>>>
>>>>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>>>>
>>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>>>>> mas_alloc_nodes() call. Thanks.
>>>>> Hi Fengwei,
>>>>>
>>>>> I think it may be related to the inaccurate number of nodes allocated
>>>>> in the pre-allocation. I slightly modified the pre-allocation in this
>>>>> patchset, but I don't know if it works. It would be great if you could
>>>>> help test it, and help pinpoint the cause. Below is the diff, which can
>>>>> be applied based on this pachset.
>>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
>>>> the test. But the result of benchmark got a little bit improvement:
>>>> 529040
>>>>
>>>> But it's still much less than none-regression result. I will also double
>>>> confirm the none-regression result.
>>> Just noticed that the commit f5715584af95 make validate_mm() two implementation
>>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
>>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
>>> apple to apple.
>
> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> have it as a different commit id in my branch.
>
> I 'restored' some of the checking because I was able to work around not
> having the mt_dump() definition with the vma iterator. I'm now
> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> have added these extra checks.
>
>>>
>>>
>>> I disable CONFIG_DEBUG_VM and re-run the test and got:
>>> Before preallocation change (7be1c1a3c7b1):
>>> 770100
>>> After preallocation change (28c5609fb236):
>>> 680000
>>> With liam's fix:
>>> 702100
>>> plus Peng's fix:
>>> 725900
>> Thank you for your test, now it seems that the performance
>> regression is not so much.
>
> We are also too strict on the reset during mas_store_prealloc() checking
> for a spanning write. I have a fix for this for v2 of the patch set,
> although I suspect it will not make a huge difference.
>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Peng
>>>>>
>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>> index 5ea211c3f186..e67bf2744384 100644
>>>>> --- a/lib/maple_tree.c
>>>>> +++ b/lib/maple_tree.c
>>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>          goto ask_now;
>>>>>      }
>>>>>
>>>>> -    /* New root needs a singe node */
>>>>> -    if (unlikely(mte_is_root(mas->node)))
>>>>> -        goto ask_now;
>
> Why did you drop this? If we are creating a new root we will only need
> one node.
The code below handles the root case perfectly,
we don't need additional checks.
if (node_size - 1 <= mt_min_slots[wr_mas.type])
request = mas_mt_height(mas) * 2 - 1;
>
>>>>> +    if ((node_size == wr_mas.node_end + 1 &&
>>>>> +         mas->offset == wr_mas.node_end) ||
>>>>> +        (node_size == wr_mas.node_end &&
>>>>> +         wr_mas.offset_end - mas->offset == 1))
>>>>> +        return 0;
>
> I will add this to v2 as well, or something similar.
>
>>>>>
>>>>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>>>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>      if (likely(!mas_is_err(mas)))
>>>>>          return 0;
>>>>>
>>>>> -    mas_set_alloc_req(mas, 0);
>
> Why did you drop this? It seems like a worth while cleanup on failure.
Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes().
>
>>>>>      ret = xa_err(mas->node);
>>>>>      mas_reset(mas);
>>>>>      mas_destroy(mas);
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>
>>>>>>> Liam R. Howlett (14):
>>>>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>>>>    maple_tree: Introduce __mas_set_range()
>>>>>>>    mm: Remove re-walk from mmap_region()
>>>>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>>>>    maple_tree: Update mas_preallocate() testing
>>>>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>>>>
>>>>>>>   fs/exec.c                        |   1 +
>>>>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>>>>   include/linux/mm.h               |   4 -
>>>>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>>>>   mm/internal.h                    |  40 ++++++--
>>>>>>>   mm/memory.c                      |  16 ++-
>>>>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>>>>   mm/nommu.c                       |  45 ++++----
>>>>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>>>>

2023-06-05 14:56:01

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

* Peng Zhang <[email protected]> [230605 10:27]:
>
>
> 在 2023/6/5 22:03, Liam R. Howlett 写道:
> > * Peng Zhang <[email protected]> [230605 03:59]:
> > >
> > >
> > > 在 2023/6/5 14:18, Yin, Fengwei 写道:
> > > >
> > > >
> > > > On 6/5/2023 12:41 PM, Yin Fengwei wrote:
> > > > > Hi Peng,
> > > > >
> > > > > On 6/5/23 11:28, Peng Zhang wrote:
> > > > > >
> > > > > >
> > > > > > 在 2023/6/2 16:10, Yin, Fengwei 写道:
> > > > > > > Hi Liam,
> > > > > > >
> > > > > > > On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
> > > > > > > > Initial work on preallocations showed no regression in performance
> > > > > > > > during testing, but recently some users (both on [1] and off [android]
> > > > > > > > list) have reported that preallocating the worst-case number of nodes
> > > > > > > > has caused some slow down.  This patch set addresses the number of
> > > > > > > > allocations in a few ways.
> > > > > > > >
> > > > > > > > During munmap() most munmap() operations will remove a single VMA, so
> > > > > > > > leverage the fact that the maple tree can place a single pointer at
> > > > > > > > range 0 - 0 without allocating.  This is done by changing the index in
> > > > > > > > the 'sidetree'.
> > > > > > > >
> > > > > > > > Re-introduce the entry argument to mas_preallocate() so that a more
> > > > > > > > intelligent guess of the node count can be made.
> > > > > > > >
> > > > > > > > Patches are in the following order:
> > > > > > > > 0001-0002: Testing framework for benchmarking some operations
> > > > > > > > 0003-0004: Reduction of maple node allocation in sidetree
> > > > > > > > 0005:      Small cleanup of do_vmi_align_munmap()
> > > > > > > > 0006-0013: mas_preallocate() calculation change
> > > > > > > > 0014:      Change the vma iterator order
> > > > > > > I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
> > > > > > > this patchset.
> > > > > > >
> > > > > > > The result has a little bit improvement:
> > > > > > > Base (next-20230602):
> > > > > > >    503880
> > > > > > > Base with this patchset:
> > > > > > >    519501
> > > > > > >
> > > > > > > But they are far from the none-regression result (commit 7be1c1a3c7b1):
> > > > > > >    718080
> > > > > > >
> > > > > > >
> > > > > > > Some other information I collected:
> > > > > > > With Base, the mas_alloc_nodes are always hit with request: 7.
> > > > > > > With this patchset, the request are 1 or 5.
> > > > > > >
> > > > > > > I suppose this is the reason for improvement from 503880 to 519501.
> > > > > > >
> > > > > > > With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
> > > > > > > mas_alloc_nodes() call. Thanks.
> > > > > > Hi Fengwei,
> > > > > >
> > > > > > I think it may be related to the inaccurate number of nodes allocated
> > > > > > in the pre-allocation. I slightly modified the pre-allocation in this
> > > > > > patchset, but I don't know if it works. It would be great if you could
> > > > > > help test it, and help pinpoint the cause. Below is the diff, which can
> > > > > > be applied based on this pachset.
> > > > > I tried the patch, it could eliminate the call of mas_alloc_nodes() during
> > > > > the test. But the result of benchmark got a little bit improvement:
> > > > > 529040
> > > > >
> > > > > But it's still much less than none-regression result. I will also double
> > > > > confirm the none-regression result.
> > > > Just noticed that the commit f5715584af95 make validate_mm() two implementation
> > > > based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
> > > > CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
> > > > apple to apple.
> >
> > You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> > have it as a different commit id in my branch.
> >
> > I 'restored' some of the checking because I was able to work around not
> > having the mt_dump() definition with the vma iterator. I'm now
> > wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> > have added these extra checks.
> >
> > > >
> > > >
> > > > I disable CONFIG_DEBUG_VM and re-run the test and got:
> > > > Before preallocation change (7be1c1a3c7b1):
> > > > 770100
> > > > After preallocation change (28c5609fb236):
> > > > 680000
> > > > With liam's fix:
> > > > 702100
> > > > plus Peng's fix:
> > > > 725900
> > > Thank you for your test, now it seems that the performance
> > > regression is not so much.
> >
> > We are also too strict on the reset during mas_store_prealloc() checking
> > for a spanning write. I have a fix for this for v2 of the patch set,
> > although I suspect it will not make a huge difference.
> >
> > > >
> > > >
> > > > Regards
> > > > Yin, Fengwei
> > > >
> > > > >
> > > > >
> > > > > Regards
> > > > > Yin, Fengwei
> > > > >
> > > > > >
> > > > > > Thanks,
> > > > > > Peng
> > > > > >
> > > > > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > > > > index 5ea211c3f186..e67bf2744384 100644
> > > > > > --- a/lib/maple_tree.c
> > > > > > +++ b/lib/maple_tree.c
> > > > > > @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> > > > > >          goto ask_now;
> > > > > >      }
> > > > > >
> > > > > > -    /* New root needs a singe node */
> > > > > > -    if (unlikely(mte_is_root(mas->node)))
> > > > > > -        goto ask_now;
> >
> > Why did you drop this? If we are creating a new root we will only need
> > one node.
> The code below handles the root case perfectly,
> we don't need additional checks.
> if (node_size - 1 <= mt_min_slots[wr_mas.type])
> request = mas_mt_height(mas) * 2 - 1;

Unless we have the minimum for a node, in which case we will fall
through and ask for one node. This works and is rare enough so I'll
drop it. Thanks.

> >
> > > > > > +    if ((node_size == wr_mas.node_end + 1 &&
> > > > > > +         mas->offset == wr_mas.node_end) ||
> > > > > > +        (node_size == wr_mas.node_end &&
> > > > > > +         wr_mas.offset_end - mas->offset == 1))
> > > > > > +        return 0;
> >
> > I will add this to v2 as well, or something similar.
> >
> > > > > >
> > > > > >      /* Potential spanning rebalance collapsing a node, use worst-case */
> > > > > >      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
> > > > > > @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> > > > > >      if (likely(!mas_is_err(mas)))
> > > > > >          return 0;
> > > > > >
> > > > > > -    mas_set_alloc_req(mas, 0);
> >
> > Why did you drop this? It seems like a worth while cleanup on failure.
> Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes().

On failure we set the alloc request to the remainder of what was not
allocated.

> >
> > > > > >      ret = xa_err(mas->node);
> > > > > >      mas_reset(mas);
> > > > > >      mas_destroy(mas);
> > > > > >
> > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Regards
> > > > > > > Yin, Fengwei
> > > > > > >
> > > > > > > >
> > > > > > > > [1] https://lore.kernel.org/linux-mm/[email protected]/
> > > > > > > >
> > > > > > > > Liam R. Howlett (14):
> > > > > > > >    maple_tree: Add benchmarking for mas_for_each
> > > > > > > >    maple_tree: Add benchmarking for mas_prev()
> > > > > > > >    mm: Move unmap_vmas() declaration to internal header
> > > > > > > >    mm: Change do_vmi_align_munmap() side tree index
> > > > > > > >    mm: Remove prev check from do_vmi_align_munmap()
> > > > > > > >    maple_tree: Introduce __mas_set_range()
> > > > > > > >    mm: Remove re-walk from mmap_region()
> > > > > > > >    maple_tree: Re-introduce entry to mas_preallocate() arguments
> > > > > > > >    mm: Use vma_iter_clear_gfp() in nommu
> > > > > > > >    mm: Set up vma iterator for vma_iter_prealloc() calls
> > > > > > > >    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
> > > > > > > >    maple_tree: Update mas_preallocate() testing
> > > > > > > >    maple_tree: Refine mas_preallocate() node calculations
> > > > > > > >    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
> > > > > > > >
> > > > > > > >   fs/exec.c                        |   1 +
> > > > > > > >   include/linux/maple_tree.h       |  23 ++++-
> > > > > > > >   include/linux/mm.h               |   4 -
> > > > > > > >   lib/maple_tree.c                 |  78 ++++++++++----
> > > > > > > >   lib/test_maple_tree.c            |  74 +++++++++++++
> > > > > > > >   mm/internal.h                    |  40 ++++++--
> > > > > > > >   mm/memory.c                      |  16 ++-
> > > > > > > >   mm/mmap.c                        | 171 ++++++++++++++++---------------
> > > > > > > >   mm/nommu.c                       |  45 ++++----
> > > > > > > >   tools/testing/radix-tree/maple.c |  59 ++++++-----
> > > > > > > >   10 files changed, 331 insertions(+), 180 deletions(-)
> > > > > > > >

2023-06-05 15:17:05

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



在 2023/6/5 22:44, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230605 10:27]:
>>
>>
>> 在 2023/6/5 22:03, Liam R. Howlett 写道:
>>> * Peng Zhang <[email protected]> [230605 03:59]:
>>>>
>>>>
>>>> 在 2023/6/5 14:18, Yin, Fengwei 写道:
>>>>>
>>>>>
>>>>> On 6/5/2023 12:41 PM, Yin Fengwei wrote:
>>>>>> Hi Peng,
>>>>>>
>>>>>> On 6/5/23 11:28, Peng Zhang wrote:
>>>>>>>
>>>>>>>
>>>>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>>>>>>> Hi Liam,
>>>>>>>>
>>>>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>>>>>>> Initial work on preallocations showed no regression in performance
>>>>>>>>> during testing, but recently some users (both on [1] and off [android]
>>>>>>>>> list) have reported that preallocating the worst-case number of nodes
>>>>>>>>> has caused some slow down.  This patch set addresses the number of
>>>>>>>>> allocations in a few ways.
>>>>>>>>>
>>>>>>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>>>>>>> leverage the fact that the maple tree can place a single pointer at
>>>>>>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>>>>>>> the 'sidetree'.
>>>>>>>>>
>>>>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>>>>>>> intelligent guess of the node count can be made.
>>>>>>>>>
>>>>>>>>> Patches are in the following order:
>>>>>>>>> 0001-0002: Testing framework for benchmarking some operations
>>>>>>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>>>>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>>>>>>> 0006-0013: mas_preallocate() calculation change
>>>>>>>>> 0014:      Change the vma iterator order
>>>>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>>>>>>> this patchset.
>>>>>>>>
>>>>>>>> The result has a little bit improvement:
>>>>>>>> Base (next-20230602):
>>>>>>>>    503880
>>>>>>>> Base with this patchset:
>>>>>>>>    519501
>>>>>>>>
>>>>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>>>>>>    718080
>>>>>>>>
>>>>>>>>
>>>>>>>> Some other information I collected:
>>>>>>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>>>>>>> With this patchset, the request are 1 or 5.
>>>>>>>>
>>>>>>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>>>>>>
>>>>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>>>>>>> mas_alloc_nodes() call. Thanks.
>>>>>>> Hi Fengwei,
>>>>>>>
>>>>>>> I think it may be related to the inaccurate number of nodes allocated
>>>>>>> in the pre-allocation. I slightly modified the pre-allocation in this
>>>>>>> patchset, but I don't know if it works. It would be great if you could
>>>>>>> help test it, and help pinpoint the cause. Below is the diff, which can
>>>>>>> be applied based on this pachset.
>>>>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
>>>>>> the test. But the result of benchmark got a little bit improvement:
>>>>>> 529040
>>>>>>
>>>>>> But it's still much less than none-regression result. I will also double
>>>>>> confirm the none-regression result.
>>>>> Just noticed that the commit f5715584af95 make validate_mm() two implementation
>>>>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
>>>>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
>>>>> apple to apple.
>>>
>>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
>>> have it as a different commit id in my branch.
>>>
>>> I 'restored' some of the checking because I was able to work around not
>>> having the mt_dump() definition with the vma iterator. I'm now
>>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
>>> have added these extra checks.
>>>
>>>>>
>>>>>
>>>>> I disable CONFIG_DEBUG_VM and re-run the test and got:
>>>>> Before preallocation change (7be1c1a3c7b1):
>>>>> 770100
>>>>> After preallocation change (28c5609fb236):
>>>>> 680000
>>>>> With liam's fix:
>>>>> 702100
>>>>> plus Peng's fix:
>>>>> 725900
>>>> Thank you for your test, now it seems that the performance
>>>> regression is not so much.
>>>
>>> We are also too strict on the reset during mas_store_prealloc() checking
>>> for a spanning write. I have a fix for this for v2 of the patch set,
>>> although I suspect it will not make a huge difference.
>>>
>>>>>
>>>>>
>>>>> Regards
>>>>> Yin, Fengwei
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Peng
>>>>>>>
>>>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>>>> index 5ea211c3f186..e67bf2744384 100644
>>>>>>> --- a/lib/maple_tree.c
>>>>>>> +++ b/lib/maple_tree.c
>>>>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>>>          goto ask_now;
>>>>>>>      }
>>>>>>>
>>>>>>> -    /* New root needs a singe node */
>>>>>>> -    if (unlikely(mte_is_root(mas->node)))
>>>>>>> -        goto ask_now;
>>>
>>> Why did you drop this? If we are creating a new root we will only need
>>> one node.
>> The code below handles the root case perfectly,
>> we don't need additional checks.
>> if (node_size - 1 <= mt_min_slots[wr_mas.type])
>> request = mas_mt_height(mas) * 2 - 1;
>
> Unless we have the minimum for a node, in which case we will fall
> through and ask for one node. This works and is rare enough so I'll
> drop it. Thanks.
>
>>>
>>>>>>> +    if ((node_size == wr_mas.node_end + 1 &&
>>>>>>> +         mas->offset == wr_mas.node_end) ||
>>>>>>> +        (node_size == wr_mas.node_end &&
>>>>>>> +         wr_mas.offset_end - mas->offset == 1))
>>>>>>> +        return 0;
>>>
>>> I will add this to v2 as well, or something similar.
>>>
>>>>>>>
>>>>>>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>>>>>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>>>      if (likely(!mas_is_err(mas)))
>>>>>>>          return 0;
>>>>>>>
>>>>>>> -    mas_set_alloc_req(mas, 0);
>>>
>>> Why did you drop this? It seems like a worth while cleanup on failure.
>> Because we will clear it in mas_node_count_gfp()->mas_alloc_nodes().
>
> On failure we set the alloc request to the remainder of what was not
> allocated.
Yes, you are right.
>
>>>
>>>>>>>      ret = xa_err(mas->node);
>>>>>>>      mas_reset(mas);
>>>>>>>      mas_destroy(mas);
>>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Regards
>>>>>>>> Yin, Fengwei
>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>>>
>>>>>>>>> Liam R. Howlett (14):
>>>>>>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>>>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>>>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>>>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>>>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>>>>>>    maple_tree: Introduce __mas_set_range()
>>>>>>>>>    mm: Remove re-walk from mmap_region()
>>>>>>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>>>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>>>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>>>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>>>>>>    maple_tree: Update mas_preallocate() testing
>>>>>>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>>>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>>>>>>
>>>>>>>>>   fs/exec.c                        |   1 +
>>>>>>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>>>>>>   include/linux/mm.h               |   4 -
>>>>>>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>>>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>>>>>>   mm/internal.h                    |  40 ++++++--
>>>>>>>>>   mm/memory.c                      |  16 ++-
>>>>>>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>>>>>>   mm/nommu.c                       |  45 ++++----
>>>>>>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>>>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>>>>>>

2023-06-06 01:47:37

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



On 6/5/23 22:03, Liam R. Howlett wrote:
> * Peng Zhang <[email protected]> [230605 03:59]:
>>
>>
>> 在 2023/6/5 14:18, Yin, Fengwei 写道:
>>>
>>>
>>> On 6/5/2023 12:41 PM, Yin Fengwei wrote:
>>>> Hi Peng,
>>>>
>>>> On 6/5/23 11:28, Peng Zhang wrote:
>>>>>
>>>>>
>>>>> 在 2023/6/2 16:10, Yin, Fengwei 写道:
>>>>>> Hi Liam,
>>>>>>
>>>>>> On 6/1/2023 10:15 AM, Liam R. Howlett wrote:
>>>>>>> Initial work on preallocations showed no regression in performance
>>>>>>> during testing, but recently some users (both on [1] and off [android]
>>>>>>> list) have reported that preallocating the worst-case number of nodes
>>>>>>> has caused some slow down.  This patch set addresses the number of
>>>>>>> allocations in a few ways.
>>>>>>>
>>>>>>> During munmap() most munmap() operations will remove a single VMA, so
>>>>>>> leverage the fact that the maple tree can place a single pointer at
>>>>>>> range 0 - 0 without allocating.  This is done by changing the index in
>>>>>>> the 'sidetree'.
>>>>>>>
>>>>>>> Re-introduce the entry argument to mas_preallocate() so that a more
>>>>>>> intelligent guess of the node count can be made.
>>>>>>>
>>>>>>> Patches are in the following order:
>>>>>>> 0001-0002: Testing framework for benchmarking some operations
>>>>>>> 0003-0004: Reduction of maple node allocation in sidetree
>>>>>>> 0005:      Small cleanup of do_vmi_align_munmap()
>>>>>>> 0006-0013: mas_preallocate() calculation change
>>>>>>> 0014:      Change the vma iterator order
>>>>>> I did run The AIM:page_test on an IceLake 48C/96T + 192G RAM platform with
>>>>>> this patchset.
>>>>>>
>>>>>> The result has a little bit improvement:
>>>>>> Base (next-20230602):
>>>>>>    503880
>>>>>> Base with this patchset:
>>>>>>    519501
>>>>>>
>>>>>> But they are far from the none-regression result (commit 7be1c1a3c7b1):
>>>>>>    718080
>>>>>>
>>>>>>
>>>>>> Some other information I collected:
>>>>>> With Base, the mas_alloc_nodes are always hit with request: 7.
>>>>>> With this patchset, the request are 1 or 5.
>>>>>>
>>>>>> I suppose this is the reason for improvement from 503880 to 519501.
>>>>>>
>>>>>> With commit 7be1c1a3c7b1, mas_store_gfp() in do_brk_flags never triggered
>>>>>> mas_alloc_nodes() call. Thanks.
>>>>> Hi Fengwei,
>>>>>
>>>>> I think it may be related to the inaccurate number of nodes allocated
>>>>> in the pre-allocation. I slightly modified the pre-allocation in this
>>>>> patchset, but I don't know if it works. It would be great if you could
>>>>> help test it, and help pinpoint the cause. Below is the diff, which can
>>>>> be applied based on this pachset.
>>>> I tried the patch, it could eliminate the call of mas_alloc_nodes() during
>>>> the test. But the result of benchmark got a little bit improvement:
>>>> 529040
>>>>
>>>> But it's still much less than none-regression result. I will also double
>>>> confirm the none-regression result.
>>> Just noticed that the commit f5715584af95 make validate_mm() two implementation
>>> based on CONFIG_DEBUG_VM instead of CONFIG_DEBUG_VM_MAPPLE_TREE). I have
>>> CONFIG_DEBUG_VM but not CONFIG_DEBUG_VM_MAPPLE_TREE defined. So it's not an
>>> apple to apple.
>
> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> have it as a different commit id in my branch.
Yes. The f5715584af95 was from next-20230602. I will include the subject
of the patch in the future to avoid such confusion.

>
> I 'restored' some of the checking because I was able to work around not
> having the mt_dump() definition with the vma iterator. I'm now
> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> have added these extra checks.
No worries here. The problem is in my local config. I enabled the
CONFIG_DEBUG_VM to capture VM asserts/warnings. It should be disabled
for performance testing. LKP does disable it for performance check.


Regards
Yin, Fengwei

>
>>>
>>>
>>> I disable CONFIG_DEBUG_VM and re-run the test and got:
>>> Before preallocation change (7be1c1a3c7b1):
>>> 770100
>>> After preallocation change (28c5609fb236):
>>> 680000
>>> With liam's fix:
>>> 702100
>>> plus Peng's fix:
>>> 725900
>> Thank you for your test, now it seems that the performance
>> regression is not so much.
>
> We are also too strict on the reset during mas_store_prealloc() checking
> for a spanning write. I have a fix for this for v2 of the patch set,
> although I suspect it will not make a huge difference.
>
>>>
>>>
>>> Regards
>>> Yin, Fengwei
>>>
>>>>
>>>>
>>>> Regards
>>>> Yin, Fengwei
>>>>
>>>>>
>>>>> Thanks,
>>>>> Peng
>>>>>
>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>> index 5ea211c3f186..e67bf2744384 100644
>>>>> --- a/lib/maple_tree.c
>>>>> +++ b/lib/maple_tree.c
>>>>> @@ -5575,9 +5575,11 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>          goto ask_now;
>>>>>      }
>>>>>
>>>>> -    /* New root needs a singe node */
>>>>> -    if (unlikely(mte_is_root(mas->node)))
>>>>> -        goto ask_now;
>
> Why did you drop this? If we are creating a new root we will only need
> one node.
>
>>>>> +    if ((node_size == wr_mas.node_end + 1 &&
>>>>> +         mas->offset == wr_mas.node_end) ||
>>>>> +        (node_size == wr_mas.node_end &&
>>>>> +         wr_mas.offset_end - mas->offset == 1))
>>>>> +        return 0;
>
> I will add this to v2 as well, or something similar.
>
>>>>>
>>>>>      /* Potential spanning rebalance collapsing a node, use worst-case */
>>>>>      if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>>> @@ -5590,7 +5592,6 @@ int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>      if (likely(!mas_is_err(mas)))
>>>>>          return 0;
>>>>>
>>>>> -    mas_set_alloc_req(mas, 0);
>
> Why did you drop this? It seems like a worth while cleanup on failure.
>
>>>>>      ret = xa_err(mas->node);
>>>>>      mas_reset(mas);
>>>>>      mas_destroy(mas);
>>>>>
>>>>>
>>>>>>
>>>>>>
>>>>>> Regards
>>>>>> Yin, Fengwei
>>>>>>
>>>>>>>
>>>>>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>>>>>>
>>>>>>> Liam R. Howlett (14):
>>>>>>>    maple_tree: Add benchmarking for mas_for_each
>>>>>>>    maple_tree: Add benchmarking for mas_prev()
>>>>>>>    mm: Move unmap_vmas() declaration to internal header
>>>>>>>    mm: Change do_vmi_align_munmap() side tree index
>>>>>>>    mm: Remove prev check from do_vmi_align_munmap()
>>>>>>>    maple_tree: Introduce __mas_set_range()
>>>>>>>    mm: Remove re-walk from mmap_region()
>>>>>>>    maple_tree: Re-introduce entry to mas_preallocate() arguments
>>>>>>>    mm: Use vma_iter_clear_gfp() in nommu
>>>>>>>    mm: Set up vma iterator for vma_iter_prealloc() calls
>>>>>>>    maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
>>>>>>>    maple_tree: Update mas_preallocate() testing
>>>>>>>    maple_tree: Refine mas_preallocate() node calculations
>>>>>>>    mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>>>>>>>
>>>>>>>   fs/exec.c                        |   1 +
>>>>>>>   include/linux/maple_tree.h       |  23 ++++-
>>>>>>>   include/linux/mm.h               |   4 -
>>>>>>>   lib/maple_tree.c                 |  78 ++++++++++----
>>>>>>>   lib/test_maple_tree.c            |  74 +++++++++++++
>>>>>>>   mm/internal.h                    |  40 ++++++--
>>>>>>>   mm/memory.c                      |  16 ++-
>>>>>>>   mm/mmap.c                        | 171 ++++++++++++++++---------------
>>>>>>>   mm/nommu.c                       |  45 ++++----
>>>>>>>   tools/testing/radix-tree/maple.c |  59 ++++++-----
>>>>>>>   10 files changed, 331 insertions(+), 180 deletions(-)
>>>>>>>

2023-06-06 03:21:39

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

Hi Huge,

On 6/6/2023 10:41 AM, Hugh Dickins wrote:
> On Mon, 5 Jun 2023, Liam R. Howlett wrote:
>>
>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
>> have it as a different commit id in my branch.
>>
>> I 'restored' some of the checking because I was able to work around not
>> having the mt_dump() definition with the vma iterator. I'm now
>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
>> have added these extra checks.
>
> Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for
Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report.


> easily checked conditions. If validate_mm() is still the kind of thing
> it used to be, checking through every vma on every mmap operation, please
> don't bring that into CONFIG_DEBUG_VM - it distorts performance too much,
> so always used to be under a separate CONFIG_DEBUG_VM_RB instead.
So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance
testing? Thanks.


Regards
Yin, Fengwei

>
> Hugh

2023-06-06 03:23:48

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

On Mon, 5 Jun 2023, Liam R. Howlett wrote:
>
> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> have it as a different commit id in my branch.
>
> I 'restored' some of the checking because I was able to work around not
> having the mt_dump() definition with the vma iterator. I'm now
> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> have added these extra checks.

Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for
easily checked conditions. If validate_mm() is still the kind of thing
it used to be, checking through every vma on every mmap operation, please
don't bring that into CONFIG_DEBUG_VM - it distorts performance too much,
so always used to be under a separate CONFIG_DEBUG_VM_RB instead.

Hugh

2023-06-06 03:25:23

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

On Tue, 6 Jun 2023, Yin, Fengwei wrote:
> On 6/6/2023 10:41 AM, Hugh Dickins wrote:
> > On Mon, 5 Jun 2023, Liam R. Howlett wrote:
> >>
> >> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> >> have it as a different commit id in my branch.
> >>
> >> I 'restored' some of the checking because I was able to work around not
> >> having the mt_dump() definition with the vma iterator. I'm now
> >> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> >> have added these extra checks.
> >
> > Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for
> Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report.
>
>
> > easily checked conditions. If validate_mm() is still the kind of thing
> > it used to be, checking through every vma on every mmap operation, please
> > don't bring that into CONFIG_DEBUG_VM - it distorts performance too much,
> > so always used to be under a separate CONFIG_DEBUG_VM_RB instead.
> So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance
> testing? Thanks.

I was going to say:
No, I did not mean that: I just meant that even developers not doing
strict performance testing still like to keep a rough eye on performance
changes; and historically CONFIG_DEBUG_VM has not distorted very much.

But then I wonder about certain distros which (wrongly or rightly) turn
CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels.

Hugh

2023-06-06 03:37:11

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree



On 6/6/2023 11:08 AM, Hugh Dickins wrote:
> On Tue, 6 Jun 2023, Yin, Fengwei wrote:
>> On 6/6/2023 10:41 AM, Hugh Dickins wrote:
>>> On Mon, 5 Jun 2023, Liam R. Howlett wrote:
>>>>
>>>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
>>>> have it as a different commit id in my branch.
>>>>
>>>> I 'restored' some of the checking because I was able to work around not
>>>> having the mt_dump() definition with the vma iterator. I'm now
>>>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
>>>> have added these extra checks.
>>>
>>> Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for
>> Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report.
>>
>>
>>> easily checked conditions. If validate_mm() is still the kind of thing
>>> it used to be, checking through every vma on every mmap operation, please
>>> don't bring that into CONFIG_DEBUG_VM - it distorts performance too much,
>>> so always used to be under a separate CONFIG_DEBUG_VM_RB instead.
>> So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance
>> testing? Thanks.
>
> I was going to say:
> No, I did not mean that: I just meant that even developers not doing
> strict performance testing still like to keep a rough eye on performance
> changes; and historically CONFIG_DEBUG_VM has not distorted very much.
>
> But then I wonder about certain distros which (wrongly or rightly) turn
> CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels.
Fair enough. Thanks for explanation.

Regards
Yin, Fengwei

>
> Hugh

2023-06-06 16:32:29

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 00/14] Reduce preallocations for maple tree

* Yin, Fengwei <[email protected]> [230605 23:11]:
>
>
> On 6/6/2023 11:08 AM, Hugh Dickins wrote:
> > On Tue, 6 Jun 2023, Yin, Fengwei wrote:
> >> On 6/6/2023 10:41 AM, Hugh Dickins wrote:
> >>> On Mon, 5 Jun 2023, Liam R. Howlett wrote:
> >>>>
> >>>> You mean "mm: update validate_mm() to use vma iterator" here I guess. I
> >>>> have it as a different commit id in my branch.
> >>>>
> >>>> I 'restored' some of the checking because I was able to work around not
> >>>> having the mt_dump() definition with the vma iterator. I'm now
> >>>> wondering how wide spread CONFIG_DEBUG_VM is used and if I should not
> >>>> have added these extra checks.
> >>>
> >>> Most CONFIG_DEBUG_VM checks are quite cheap, mostly VM_BUG_ONs for
> >> Indeed. I had CONFIG_DEBUG_VM enabled and didn't see surprise perf report.
> >>
> >>
> >>> easily checked conditions. If validate_mm() is still the kind of thing
> >>> it used to be, checking through every vma on every mmap operation, please
> >>> don't bring that into CONFIG_DEBUG_VM - it distorts performance too much,
> >>> so always used to be under a separate CONFIG_DEBUG_VM_RB instead.

Okay, I will update my patch to use CONFIG_DEBUG_VM_MAPLE_TREE for
validate_mm().

> >> So does this mean CONFIG_DEBUG_VM is allowed to be enabled for performance
> >> testing? Thanks.
> >
> > I was going to say:
> > No, I did not mean that: I just meant that even developers not doing
> > strict performance testing still like to keep a rough eye on performance
> > changes; and historically CONFIG_DEBUG_VM has not distorted very much.
> >
> > But then I wonder about certain distros which (wrongly or rightly) turn
> > CONFIG_DEBUG_VM on: I expect they do performance testing on their kernels.
> Fair enough. Thanks for explanation.
>

Thanks for looking at this everyone.

Regards,
Liam