2023-06-12 20:47:25

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 00/16] Reduce preallocations for maple tree

Initial work on preallocations showed no regression in performance
during testing, but recently some users (both on [1] and off [android]
list) have reported that preallocating the worst-case number of nodes
has caused some slow down. This patch set addresses the number of
allocations in a few ways.

During munmap() most munmap() operations will remove a single VMA, so
leverage the fact that the maple tree can place a single pointer at
range 0 - 0 without allocating. This is done by changing the index in
the 'sidetree'.

Re-introduce the entry argument to mas_preallocate() so that a more
intelligent guess of the node count can be made.

During development of v2 of this patch set, I also noticed that the
number of nodes being allocated for a rebalance was beyond what could
possibly be needed. This is addressed in patch 0008.

Patches are in the following order:
0001-0002: Testing framework for benchmarking some operations
0003-0004: Reduction of maple node allocation in sidetree
0005: Small cleanup of do_vmi_align_munmap()
0006-0015: mas_preallocate() calculation change
0016: Change the vma iterator order

Changes since v1:
- Reduced preallocations for append and slot store - Thanks Peng Zhang
- Added patch to reduce node allocations for mas_rebalance() (patch 0008)
- Reduced resets during store setup to avoid duplicate walking (patch 0015)

v1: https://lore.kernel.org/lkml/[email protected]/

[1] https://lore.kernel.org/linux-mm/[email protected]/

Liam R. Howlett (16):
maple_tree: Add benchmarking for mas_for_each
maple_tree: Add benchmarking for mas_prev()
mm: Move unmap_vmas() declaration to internal header
mm: Change do_vmi_align_munmap() side tree index
mm: Remove prev check from do_vmi_align_munmap()
maple_tree: Introduce __mas_set_range()
mm: Remove re-walk from mmap_region()
maple_tree: Adjust node allocation on mas_rebalance()
maple_tree: Re-introduce entry to mas_preallocate() arguments
mm: Use vma_iter_clear_gfp() in nommu
mm: Set up vma iterator for vma_iter_prealloc() calls
maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
maple_tree: Update mas_preallocate() testing
maple_tree: Refine mas_preallocate() node calculations
maple_tree: Reduce resets during store setup
mm/mmap: Change vma iteration order in do_vmi_align_munmap()

fs/exec.c | 1 +
include/linux/maple_tree.h | 23 ++++-
include/linux/mm.h | 4 -
lib/maple_tree.c | 121 ++++++++++++++++------
lib/test_maple_tree.c | 74 +++++++++++++
mm/internal.h | 40 ++++++--
mm/memory.c | 19 ++--
mm/mmap.c | 171 ++++++++++++++++---------------
mm/nommu.c | 45 ++++----
tools/testing/radix-tree/maple.c | 59 ++++++-----
10 files changed, 364 insertions(+), 193 deletions(-)

--
2.39.2



2023-06-12 20:47:28

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 04/16] mm: Change do_vmi_align_munmap() side tree index

The majority of the calls to munmap a VMA is for a single vma. The
maple tree is able to store a single entry at 0, with a size of 1 as a
pointer and avoid any allocations. Change do_vmi_align_munmap() to
store the VMAs being munmap()'ed into a tree indexed by the count. This
will leverage the ability to store the first entry without a node
allocation.

Storing the entries into a tree by the count and not the vma start and
end means changing the functions which iterate over the entries. Update
unmap_vmas() and free_pgtables() to take a maple state and a tree end
address to support this functionality.

Passing through the same maple state to unmap_vmas() and free_pgtables()
means the state needs to be reset between calls. This happens in the
static unmap_region() and exit_mmap().

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 8 ++++----
mm/memory.c | 19 +++++++++----------
mm/mmap.c | 41 ++++++++++++++++++++++++-----------------
3 files changed, 37 insertions(+), 31 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 9b665c4e5fc0..24437f11d3c2 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -103,7 +103,7 @@ bool __folio_end_writeback(struct folio *folio);
void deactivate_file_folio(struct folio *folio);
void folio_activate(struct folio *folio);

-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *start_vma, unsigned long floor,
unsigned long ceiling, bool mm_wr_locked);
void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte);
@@ -1099,9 +1099,9 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
return 0;
}

-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
- struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr, bool mm_wr_locked);
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
+ struct vm_area_struct *start_vma, unsigned long start,
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked);

/*
* VMA lock generalization
diff --git a/mm/memory.c b/mm/memory.c
index 8358f3b853f2..fa343b8dad4b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -360,12 +360,10 @@ void free_pgd_range(struct mmu_gather *tlb,
} while (pgd++, addr = next, addr != end);
}

-void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
+void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *vma, unsigned long floor,
unsigned long ceiling, bool mm_wr_locked)
{
- MA_STATE(mas, mt, vma->vm_end, vma->vm_end);
-
do {
unsigned long addr = vma->vm_start;
struct vm_area_struct *next;
@@ -374,7 +372,7 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
* Note: USER_PGTABLES_CEILING may be passed as ceiling and may
* be 0. This will underflow and is okay.
*/
- next = mas_find(&mas, ceiling - 1);
+ next = mas_find(mas, ceiling - 1);

/*
* Hide vma from rmap and truncate_pagecache before freeing
@@ -395,7 +393,7 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt,
while (next && next->vm_start <= vma->vm_end + PMD_SIZE
&& !is_vm_hugetlb_page(next)) {
vma = next;
- next = mas_find(&mas, ceiling - 1);
+ next = mas_find(mas, ceiling - 1);
if (mm_wr_locked)
vma_start_write(vma);
unlink_anon_vmas(vma);
@@ -1684,10 +1682,11 @@ static void unmap_single_vma(struct mmu_gather *tlb,
/**
* unmap_vmas - unmap a range of memory covered by a list of vma's
* @tlb: address of the caller's struct mmu_gather
- * @mt: the maple tree
+ * @mas: The maple state
* @vma: the starting vma
* @start_addr: virtual address at which to start unmapping
* @end_addr: virtual address at which to end unmapping
+ * @tree_end: The end address to search in the maple tree
*
* Unmap all pages in the vma list.
*
@@ -1700,9 +1699,10 @@ static void unmap_single_vma(struct mmu_gather *tlb,
* ensure that any thus-far unmapped pages are flushed before unmap_vmas()
* drops the lock and schedules.
*/
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
struct vm_area_struct *vma, unsigned long start_addr,
- unsigned long end_addr, bool mm_wr_locked)
+ unsigned long end_addr, unsigned long tree_end,
+ bool mm_wr_locked)
{
struct mmu_notifier_range range;
struct zap_details details = {
@@ -1710,7 +1710,6 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
/* Careful - we need to zap private pages too! */
.even_cows = true,
};
- MA_STATE(mas, mt, vma->vm_end, vma->vm_end);

mmu_notifier_range_init(&range, MMU_NOTIFY_UNMAP, 0, vma->vm_mm,
start_addr, end_addr);
@@ -1718,7 +1717,7 @@ void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
do {
unmap_single_vma(tlb, vma, start_addr, end_addr, &details,
mm_wr_locked);
- } while ((vma = mas_find(&mas, end_addr - 1)) != NULL);
+ } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
mmu_notifier_invalidate_range_end(&range);
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 1503a7bdb192..8e5563668b18 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -76,10 +76,10 @@ int mmap_rnd_compat_bits __read_mostly = CONFIG_ARCH_MMAP_RND_COMPAT_BITS;
static bool ignore_rlimit_data;
core_param(ignore_rlimit_data, ignore_rlimit_data, bool, 0644);

-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
struct vm_area_struct *vma, struct vm_area_struct *prev,
struct vm_area_struct *next, unsigned long start,
- unsigned long end, bool mm_wr_locked);
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked);

static pgprot_t vm_pgprot_modify(pgprot_t oldprot, unsigned long vm_flags)
{
@@ -2221,18 +2221,20 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas)
*
* Called with the mm semaphore held.
*/
-static void unmap_region(struct mm_struct *mm, struct maple_tree *mt,
+static void unmap_region(struct mm_struct *mm, struct ma_state *mas,
struct vm_area_struct *vma, struct vm_area_struct *prev,
- struct vm_area_struct *next,
- unsigned long start, unsigned long end, bool mm_wr_locked)
+ struct vm_area_struct *next, unsigned long start,
+ unsigned long end, unsigned long tree_end, bool mm_wr_locked)
{
struct mmu_gather tlb;
+ unsigned long mt_start = mas->index;

lru_add_drain();
tlb_gather_mmu(&tlb, mm);
update_hiwater_rss(mm);
- unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked);
- free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
+ unmap_vmas(&tlb, mas, vma, start, end, tree_end, mm_wr_locked);
+ mas_set(mas, mt_start);
+ free_pgtables(&tlb, mas, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS,
next ? next->vm_start : USER_PGTABLES_CEILING,
mm_wr_locked);
tlb_finish_mmu(&tlb);
@@ -2338,7 +2340,6 @@ static inline int munmap_sidetree(struct vm_area_struct *vma,
struct ma_state *mas_detach)
{
vma_start_write(vma);
- mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1);
if (mas_store_gfp(mas_detach, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2415,6 +2416,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (error)
goto end_split_failed;
}
+ mas_set(&mas_detach, count);
error = munmap_sidetree(next, &mas_detach);
if (error)
goto munmap_sidetree_failed;
@@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
{
- MA_STATE(test, &mt_detach, start, end - 1);
+ MA_STATE(test, &mt_detach, 0, 0);
struct vm_area_struct *vma_mas, *vma_test;
int test_count = 0;

vma_iter_set(vmi, start);
rcu_read_lock();
- vma_test = mas_find(&test, end - 1);
+ vma_test = mas_find(&test, count - 1);
for_each_vma_range(*vmi, vma_mas, end) {
BUG_ON(vma_mas != vma_test);
test_count++;
- vma_test = mas_next(&test, end - 1);
+ vma_test = mas_next(&test, count - 1);
}
rcu_read_unlock();
BUG_ON(count != test_count);
@@ -2490,9 +2492,11 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
* We can free page tables without write-locking mmap_lock because VMAs
* were isolated before we downgraded mmap_lock.
*/
- unmap_region(mm, &mt_detach, vma, prev, next, start, end, !downgrade);
+ mas_set(&mas_detach, 1);
+ unmap_region(mm, &mas_detach, vma, prev, next, start, end, count,
+ !downgrade);
/* Statistics and freeing VMAs */
- mas_set(&mas_detach, start);
+ mas_set(&mas_detach, 0);
remove_mt(mm, &mas_detach);
__mt_destroy(&mt_detach);

@@ -2800,9 +2804,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
fput(vma->vm_file);
vma->vm_file = NULL;

+ vma_iter_set(&vmi, vma->vm_end);
/* Undo any partial mapping done by a device driver. */
- unmap_region(mm, &mm->mm_mt, vma, prev, next, vma->vm_start,
- vma->vm_end, true);
+ unmap_region(mm, &vmi.mas, vma, prev, next, vma->vm_start,
+ vma->vm_end, vma->vm_end, true);
}
if (file && (vm_flags & VM_SHARED))
mapping_unmap_writable(file->f_mapping);
@@ -3131,7 +3136,7 @@ void exit_mmap(struct mm_struct *mm)
tlb_gather_mmu_fullmm(&tlb, mm);
/* update_hiwater_rss(mm) here? but nobody should be looking */
/* Use ULONG_MAX here to ensure all VMAs in the mm are unmapped */
- unmap_vmas(&tlb, &mm->mm_mt, vma, 0, ULONG_MAX, false);
+ unmap_vmas(&tlb, &mas, vma, 0, ULONG_MAX, ULONG_MAX, false);
mmap_read_unlock(mm);

/*
@@ -3141,7 +3146,8 @@ void exit_mmap(struct mm_struct *mm)
set_bit(MMF_OOM_SKIP, &mm->flags);
mmap_write_lock(mm);
mt_clear_in_rcu(&mm->mm_mt);
- free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS,
+ mas_set(&mas, vma->vm_end);
+ free_pgtables(&tlb, &mas, vma, FIRST_USER_ADDRESS,
USER_PGTABLES_CEILING, true);
tlb_finish_mmu(&tlb);

@@ -3150,6 +3156,7 @@ void exit_mmap(struct mm_struct *mm)
* enabled, without holding any MM locks besides the unreachable
* mmap_write_lock.
*/
+ mas_set(&mas, vma->vm_end);
do {
if (vma->vm_flags & VM_ACCOUNT)
nr_accounted += vma_pages(vma);
--
2.39.2


2023-06-12 20:48:16

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 05/16] mm: Remove prev check from do_vmi_align_munmap()

If the prev does not exist, the vma iterator will be set to MAS_NONE,
which will be treated as a MAS_START when the mas_next or mas_find is
used. In this case, the next caller will be the vma iterator, which
uses mas_find() under the hood and will now do what the user expects.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 8e5563668b18..fd3f505b40cc 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2402,8 +2402,6 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
}

prev = vma_prev(vmi);
- if (unlikely((!prev)))
- vma_iter_set(vmi, start);

/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
--
2.39.2


2023-06-12 20:48:24

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 11/16] mm: Set up vma iterator for vma_iter_prealloc() calls

Set the correct limits for vma_iter_prealloc() calls so that the maple
tree can be smarter about how many nodes are needed.

Signed-off-by: Liam R. Howlett <[email protected]>
---
fs/exec.c | 1 +
mm/internal.h | 18 +++++-------
mm/mmap.c | 81 +++++++++++++++++++++++++++++++--------------------
mm/nommu.c | 33 +++++++++------------
4 files changed, 72 insertions(+), 61 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index 25c65b64544b..dc0ba74ebb74 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -697,6 +697,7 @@ static int shift_arg_pages(struct vm_area_struct *vma, unsigned long shift)
if (vma != vma_next(&vmi))
return -EFAULT;

+ vma_iter_prev_range(&vmi);
/*
* cover the whole range: [new_start, old_end)
*/
diff --git a/mm/internal.h b/mm/internal.h
index d78fd0fafa3b..531b2e95146c 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1045,23 +1045,21 @@ static inline void vma_iter_config(struct vma_iterator *vmi,
/*
* VMA Iterator functions shared between nommu and mmap
*/
-static inline int vma_iter_prealloc(struct vma_iterator *vmi)
+static inline int vma_iter_prealloc(struct vma_iterator *vmi,
+ struct vm_area_struct *vma)
{
- return mas_preallocate(&vmi->mas, NULL, GFP_KERNEL);
+ return mas_preallocate(&vmi->mas, vma, GFP_KERNEL);
}

-static inline void vma_iter_clear(struct vma_iterator *vmi,
- unsigned long start, unsigned long end)
+static inline void vma_iter_clear(struct vma_iterator *vmi)
{
- mas_set_range(&vmi->mas, start, end - 1);
mas_store_prealloc(&vmi->mas, NULL);
}

static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
unsigned long start, unsigned long end, gfp_t gfp)
{
- vmi->mas.index = start;
- vmi->mas.last = end - 1;
+ __mas_set_range(&vmi->mas, start, end - 1);
mas_store_gfp(&vmi->mas, NULL, gfp);
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
@@ -1098,8 +1096,7 @@ static inline void vma_iter_store(struct vma_iterator *vmi,
((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
vma_iter_invalidate(vmi);

- vmi->mas.index = vma->vm_start;
- vmi->mas.last = vma->vm_end - 1;
+ __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_prealloc(&vmi->mas, vma);
}

@@ -1110,8 +1107,7 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
((vmi->mas.index > vma->vm_start) || (vmi->mas.last < vma->vm_start)))
vma_iter_invalidate(vmi);

- vmi->mas.index = vma->vm_start;
- vmi->mas.last = vma->vm_end - 1;
+ __mas_set_range(&vmi->mas, vma->vm_start, vma->vm_end - 1);
mas_store_gfp(&vmi->mas, vma, gfp);
if (unlikely(mas_is_err(&vmi->mas)))
return -ENOMEM;
diff --git a/mm/mmap.c b/mm/mmap.c
index 22c71dff762b..eaebcc8f60d2 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -411,7 +411,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
VMA_ITERATOR(vmi, mm, 0);
struct address_space *mapping = NULL;

- if (vma_iter_prealloc(&vmi))
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma))
return -ENOMEM;

if (vma->vm_file) {
@@ -664,19 +665,16 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
/* Only handles expanding */
VM_WARN_ON(vma->vm_start < start || vma->vm_end > end);

- if (vma_iter_prealloc(vmi))
+ /* Note: vma iterator must be pointing to 'start' */
+ vma_iter_config(vmi, start, end);
+ if (vma_iter_prealloc(vmi, vma))
goto nomem;

vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);
- /* VMA iterator points to previous, so set to start if necessary */
- if (vma_iter_addr(vmi) != start)
- vma_iter_set(vmi, start);
-
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
- /* Note: mas must be pointing to the expanding VMA */
vma_iter_store(vmi, vma);

vma_complete(&vp, vmi, vma->vm_mm);
@@ -703,19 +701,19 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,

WARN_ON((vma->vm_start != start) && (vma->vm_end != end));

- if (vma_iter_prealloc(vmi))
+ if (vma->vm_start < start)
+ vma_iter_config(vmi, vma->vm_start, start);
+ else
+ vma_iter_config(vmi, end, vma->vm_end);
+
+ if (vma_iter_prealloc(vmi, NULL))
return -ENOMEM;

init_vma_prep(&vp, vma);
vma_prepare(&vp);
vma_adjust_trans_huge(vma, start, end, 0);

- if (vma->vm_start < start)
- vma_iter_clear(vmi, vma->vm_start, start);
-
- if (vma->vm_end > end)
- vma_iter_clear(vmi, end, vma->vm_end);
-
+ vma_iter_clear(vmi);
vma->vm_start = start;
vma->vm_end = end;
vma->vm_pgoff = pgoff;
@@ -991,7 +989,17 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
if (err)
return NULL;

- if (vma_iter_prealloc(vmi))
+ if (vma_start < vma->vm_start || vma_end > vma->vm_end)
+ vma_expanded = true;
+
+ if (vma_expanded) {
+ vma_iter_config(vmi, vma_start, vma_end);
+ } else {
+ vma_iter_config(vmi, adjust->vm_start + adj_start,
+ adjust->vm_end);
+ }
+
+ if (vma_iter_prealloc(vmi, vma))
return NULL;

init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
@@ -1000,8 +1008,6 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,

vma_prepare(&vp);
vma_adjust_trans_huge(vma, vma_start, vma_end, adj_start);
- if (vma_start < vma->vm_start || vma_end > vma->vm_end)
- vma_expanded = true;

vma->vm_start = vma_start;
vma->vm_end = vma_end;
@@ -1945,7 +1951,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
struct vm_area_struct *next;
unsigned long gap_addr;
int error = 0;
- MA_STATE(mas, &mm->mm_mt, 0, 0);
+ MA_STATE(mas, &mm->mm_mt, vma->vm_start, address);

if (!(vma->vm_flags & VM_GROWSUP))
return -EFAULT;
@@ -1970,6 +1976,10 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
/* Check that both stack segments have the same anon_vma? */
}

+ if (next)
+ mas_prev_range(&mas, address);
+
+ __mas_set_range(&mas, vma->vm_start, address - 1);
if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2013,7 +2023,6 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
anon_vma_interval_tree_pre_update_vma(vma);
vma->vm_end = address;
/* Overwrite old entry in mtree. */
- mas_set_range(&mas, vma->vm_start, address - 1);
mas_store_prealloc(&mas, vma);
anon_vma_interval_tree_post_update_vma(vma);
spin_unlock(&mm->page_table_lock);
@@ -2052,6 +2061,10 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
return -ENOMEM;
}

+ if (prev)
+ mas_next_range(&mas, vma->vm_start);
+
+ __mas_set_range(&mas, address, vma->vm_end - 1);
if (mas_preallocate(&mas, vma, GFP_KERNEL))
return -ENOMEM;

@@ -2096,7 +2109,6 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address)
vma->vm_start = address;
vma->vm_pgoff -= grow;
/* Overwrite old entry in mtree. */
- mas_set_range(&mas, address, vma->vm_end - 1);
mas_store_prealloc(&mas, vma);
anon_vma_interval_tree_post_update_vma(vma);
spin_unlock(&mm->page_table_lock);
@@ -2255,10 +2267,6 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!new)
return -ENOMEM;

- err = -ENOMEM;
- if (vma_iter_prealloc(vmi))
- goto out_free_vma;
-
if (new_below) {
new->vm_end = addr;
} else {
@@ -2266,6 +2274,11 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
new->vm_pgoff += ((addr - vma->vm_start) >> PAGE_SHIFT);
}

+ err = -ENOMEM;
+ vma_iter_config(vmi, new->vm_start, new->vm_end);
+ if (vma_iter_prealloc(vmi, new))
+ goto out_free_vma;
+
err = vma_dup_policy(vma, new);
if (err)
goto out_free_vmi;
@@ -2600,8 +2613,12 @@ unsigned long mmap_region(struct file *file, unsigned long addr,

next = vma_next(&vmi);
prev = vma_prev(&vmi);
- if (vm_flags & VM_SPECIAL)
+ if (vm_flags & VM_SPECIAL) {
+ if (prev)
+ vma_iter_next_range(&vmi);
+
goto cannot_expand;
+ }

/* Attempt to expand an old mapping */
/* Check next */
@@ -2611,6 +2628,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_end = next->vm_end;
vma = next;
vm_pgoff = next->vm_pgoff - pglen;
+
}

/* Check prev */
@@ -2622,9 +2640,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
merge_start = prev->vm_start;
vma = prev;
vm_pgoff = prev->vm_pgoff;
+ } else if (prev) {
+ vma_iter_next_range(&vmi);
}

-
/* Actually expand, if possible */
if (vma &&
!vma_expand(&vmi, vma, merge_start, merge_end, vm_pgoff, next)) {
@@ -2633,9 +2652,6 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
}

cannot_expand:
- if (prev)
- vma_iter_next_range(&vmi);
-
/*
* Determine the object being mapped and call the appropriate
* specific mapper. the address has already been validated, but
@@ -2721,7 +2737,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto close_and_free_vma;

error = -ENOMEM;
- if (vma_iter_prealloc(&vmi))
+ if (vma_iter_prealloc(&vmi, vma))
goto close_and_free_vma;

if (vma->vm_file)
@@ -2994,7 +3010,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (vma && vma->vm_end == addr && !vma_policy(vma) &&
can_vma_merge_after(vma, flags, NULL, NULL,
addr >> PAGE_SHIFT, NULL_VM_UFFD_CTX, NULL)) {
- if (vma_iter_prealloc(vmi))
+ vma_iter_config(vmi, vma->vm_start, addr + len);
+ if (vma_iter_prealloc(vmi, vma))
goto unacct_fail;

init_vma_prep(&vp, vma);
@@ -3009,6 +3026,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
goto out;
}

+ if (vma)
+ vma_iter_next_range(vmi);
/* create a vma struct for an anonymous mapping */
vma = vm_area_alloc(mm);
if (!vma)
diff --git a/mm/nommu.c b/mm/nommu.c
index a764b86b132a..a96b889cc17e 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -583,7 +583,8 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
{
VMA_ITERATOR(vmi, vma->vm_mm, vma->vm_start);

- if (vma_iter_prealloc(&vmi)) {
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma)) {
pr_warn("Allocation of vma tree for process %d failed\n",
current->pid);
return -ENOMEM;
@@ -591,7 +592,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma)
cleanup_vma_from_mm(vma);

/* remove from the MM's tree and list */
- vma_iter_clear(&vmi, vma->vm_start, vma->vm_end);
+ vma_iter_clear(&vmi);
return 0;
}
/*
@@ -1041,9 +1042,6 @@ unsigned long do_mmap(struct file *file,
if (!vma)
goto error_getting_vma;

- if (vma_iter_prealloc(&vmi))
- goto error_vma_iter_prealloc;
-
region->vm_usage = 1;
region->vm_flags = vm_flags;
region->vm_pgoff = pgoff;
@@ -1185,6 +1183,10 @@ unsigned long do_mmap(struct file *file,

share:
BUG_ON(!vma->vm_region);
+ vma_iter_config(&vmi, vma->vm_start, vma->vm_end);
+ if (vma_iter_prealloc(&vmi, vma))
+ goto error_just_free;
+
setup_vma_to_mm(vma, current->mm);
current->mm->map_count++;
/* add the VMA to the tree */
@@ -1231,14 +1233,6 @@ unsigned long do_mmap(struct file *file,
len, current->pid);
show_free_areas(0, NULL);
return -ENOMEM;
-
-error_vma_iter_prealloc:
- kmem_cache_free(vm_region_jar, region);
- vm_area_free(vma);
- pr_warn("Allocation of vma tree for process %d failed\n", current->pid);
- show_free_areas(0, NULL);
- return -ENOMEM;
-
}

unsigned long ksys_mmap_pgoff(unsigned long addr, unsigned long len,
@@ -1323,12 +1317,6 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (!new)
goto err_vma_dup;

- if (vma_iter_prealloc(vmi)) {
- pr_warn("Allocation of vma tree for process %d failed\n",
- current->pid);
- goto err_vmi_preallocate;
- }
-
/* most fields are the same, copy all, and then fixup */
*region = *vma->vm_region;
new->vm_region = region;
@@ -1342,6 +1330,13 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
region->vm_pgoff = new->vm_pgoff += npages;
}

+ vma_iter_config(vmi, new->vm_start, new->vm_end);
+ if (vma_iter_prealloc(vmi, vma)) {
+ pr_warn("Allocation of vma tree for process %d failed\n",
+ current->pid);
+ goto err_vmi_preallocate;
+ }
+
if (new->vm_ops && new->vm_ops->open)
new->vm_ops->open(new);

--
2.39.2


2023-06-12 20:48:28

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 03/16] mm: Move unmap_vmas() declaration to internal header

unmap_vmas() is not used outside of the core mm code. Move the
definition to mm/internal.h

Signed-off-by: Liam R. Howlett <[email protected]>
---
include/linux/mm.h | 4 ----
mm/internal.h | 4 ++++
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 66032f0d515c..209fe13dc8a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2302,10 +2302,6 @@ static inline void zap_vma_pages(struct vm_area_struct *vma)
zap_page_range_single(vma, vma->vm_start,
vma->vm_end - vma->vm_start, NULL);
}
-void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
- struct vm_area_struct *start_vma, unsigned long start,
- unsigned long end, bool mm_wr_locked);
-
struct mmu_notifier_range;

void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index f45f5eb4514f..9b665c4e5fc0 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1099,6 +1099,10 @@ static inline int vma_iter_store_gfp(struct vma_iterator *vmi,
return 0;
}

+void unmap_vmas(struct mmu_gather *tlb, struct maple_tree *mt,
+ struct vm_area_struct *vma, unsigned long start_addr,
+ unsigned long end_addr, bool mm_wr_locked);
+
/*
* VMA lock generalization
*/
--
2.39.2


2023-06-12 20:48:42

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 02/16] maple_tree: Add benchmarking for mas_prev()

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/test_maple_tree.c | 36 ++++++++++++++++++++++++++++++++++++
1 file changed, 36 insertions(+)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 3dbf99c3f2b1..15d7b7bce7d6 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -45,6 +45,7 @@ atomic_t maple_tree_tests_passed;
/* #define BENCH_MT_FOR_EACH */
/* #define BENCH_FORK */
/* #define BENCH_MAS_FOR_EACH */
+/* #define BENCH_MAS_PREV */

#ifdef __KERNEL__
#define mt_set_non_kernel(x) do {} while (0)
@@ -1735,7 +1736,35 @@ static noinline void __init bench_mas_for_each(struct maple_tree *mt)

}
#endif
+#if defined(BENCH_MAS_PREV)
+static noinline void __init bench_mas_prev(struct maple_tree *mt)
+{
+ int i, count = 1000000;
+ unsigned long max = 2500;
+ void *entry;
+ MA_STATE(mas, mt, 0, 0);
+
+ for (i = 0; i < max; i += 5) {
+ int gap = 4;
+ if (i % 30 == 0)
+ gap = 3;
+ mtree_store_range(mt, i, i + gap, xa_mk_value(i), GFP_KERNEL);
+ }

+ rcu_read_lock();
+ for (i = 0; i < count; i++) {
+ unsigned long j = 2495;
+
+ mas_set(&mas, ULONG_MAX);
+ while ((entry = mas_prev(&mas, 0)) != NULL) {
+ MT_BUG_ON(mt, entry != xa_mk_value(j));
+ j -= 5;
+ }
+ }
+ rcu_read_unlock();
+
+}
+#endif
/* check_forking - simulate the kernel forking sequence with the tree. */
static noinline void __init check_forking(struct maple_tree *mt)
{
@@ -3468,6 +3497,13 @@ static int __init maple_tree_seed(void)
mtree_destroy(&tree);
goto skip;
#endif
+#if defined(BENCH_MAS_PREV)
+#define BENCH
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ bench_mas_prev(&tree);
+ mtree_destroy(&tree);
+ goto skip;
+#endif

mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_iteration(&tree);
--
2.39.2


2023-06-12 20:48:43

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 08/16] maple_tree: Adjust node allocation on mas_rebalance()

mas_rebalance() is called to rebalance an insufficient node into a
single node or two sufficient nodes. The preallocation estimate is
always too many in this case as the height of the tree will never grow
and there is no possibility to have a three way split in this case, so
revise the node allocation count.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index bfffbb7cab26..815c64f4c370 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -3166,7 +3166,7 @@ static inline int mas_rebalance(struct ma_state *mas,
* tries to combine the data in the same way. If one node contains the
* entire range of the tree, then that node is used as a new root node.
*/
- mas_node_count(mas, 1 + empty_count * 3);
+ mas_node_count(mas, empty_count * 2 - 1);
if (mas_is_err(mas))
return 0;

--
2.39.2


2023-06-12 20:48:59

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 13/16] maple_tree: Update mas_preallocate() testing

Signed-off-by: Liam R. Howlett <[email protected]>
---
tools/testing/radix-tree/maple.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index cfadc4b75d51..c42033172276 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35383,6 +35383,8 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
for (i = 0; i <= max; i++)
mtree_test_store_range(mt, i * 10, i * 10 + 5, &i);

+ /* Spanning store */
+ mas_set_range(&mas, 470, 500);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35406,7 +35408,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35420,7 +35421,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35434,7 +35434,6 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mn = mas_pop_node(&mas);
MT_BUG_ON(mt, mas_allocated(&mas) != allocated - 1);
@@ -35448,33 +35447,37 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);

+ /* Slot store does not need allocations */
+ mas_set_range(&mas, 6, 9);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
- height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 0);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+
+ mas_set_range(&mas, 6, 10);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 0);
mas_store_prealloc(&mas, ptr);
+ MT_BUG_ON(mt, mas_allocated(&mas) != 0);

+ /* Split */
+ mas_set_range(&mas, 54, 54);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
- MT_BUG_ON(mt, allocated == 0);
- MT_BUG_ON(mt, allocated != 1 + height * 3);
+ MT_BUG_ON(mt, allocated != 1 + height * 2);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
mt_set_non_kernel(1);
+ /* Spanning store */
+ mas_set_range(&mas, 1, 100);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35482,6 +35485,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
mas_destroy(&mas);


+ /* Spanning store */
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL) != 0);
allocated = mas_allocated(&mas);
height = mas_mt_height(&mas);
@@ -35489,6 +35493,7 @@ static noinline void __init check_prealloc(struct maple_tree *mt)
MT_BUG_ON(mt, allocated != 1 + height * 3);
mas_store_prealloc(&mas, ptr);
MT_BUG_ON(mt, mas_allocated(&mas) != 0);
+ mas_set_range(&mas, 0, 200);
mt_set_non_kernel(1);
MT_BUG_ON(mt, mas_preallocate(&mas, ptr, GFP_KERNEL & GFP_NOWAIT) == 0);
allocated = mas_allocated(&mas);
--
2.39.2


2023-06-12 20:49:09

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 07/16] mm: Remove re-walk from mmap_region()

Using vma_iter_set() will reset the tree and cause a re-walk. Use
vmi_iter_config() to set the write to a sub-set of the range. Change
the file case to also use vmi_iter_config() so that the end is correctly
set.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 8 ++++++++
mm/mmap.c | 4 ++--
2 files changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 24437f11d3c2..cdf06f680d6e 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1034,6 +1034,14 @@ static inline bool vma_soft_dirty_enabled(struct vm_area_struct *vma)
return !(vma->vm_flags & VM_SOFTDIRTY);
}

+static inline void vma_iter_config(struct vma_iterator *vmi,
+ unsigned long index, unsigned long last)
+{
+ MAS_BUG_ON(&vmi->mas, vmi->mas.node != MAS_START &&
+ (vmi->mas.index > index || vmi->mas.last < index));
+ __mas_set_range(&vmi->mas, index, last - 1);
+}
+
/*
* VMA Iterator functions shared between nommu and mmap
*/
diff --git a/mm/mmap.c b/mm/mmap.c
index fd3f505b40cc..8b3e58d6ac40 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2659,7 +2659,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
goto unacct_error;
}

- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
vma->vm_start = addr;
vma->vm_end = end;
vm_flags_init(vma, vm_flags);
@@ -2686,7 +2686,7 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
if (WARN_ON((addr != vma->vm_start)))
goto close_and_free_vma;

- vma_iter_set(&vmi, addr);
+ vma_iter_config(&vmi, addr, end);
/*
* If vm_flags changed after call_mmap(), we should try merge
* vma again as we may succeed this time.
--
2.39.2


2023-06-12 20:49:11

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 10/16] mm: Use vma_iter_clear_gfp() in nommu

Move the definition of vma_iter_clear_gfp() from mmap.c to internal.h so
it can be used in the nommu code. This will reduce node preallocations
in nommu.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/internal.h | 12 ++++++++++++
mm/mmap.c | 12 ------------
mm/nommu.c | 12 ++++--------
3 files changed, 16 insertions(+), 20 deletions(-)

diff --git a/mm/internal.h b/mm/internal.h
index 2691deca9699..d78fd0fafa3b 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -1057,6 +1057,18 @@ static inline void vma_iter_clear(struct vma_iterator *vmi,
mas_store_prealloc(&vmi->mas, NULL);
}

+static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
+ unsigned long start, unsigned long end, gfp_t gfp)
+{
+ vmi->mas.index = start;
+ vmi->mas.last = end - 1;
+ mas_store_gfp(&vmi->mas, NULL, gfp);
+ if (unlikely(mas_is_err(&vmi->mas)))
+ return -ENOMEM;
+
+ return 0;
+}
+
static inline struct vm_area_struct *vma_iter_load(struct vma_iterator *vmi)
{
return mas_walk(&vmi->mas);
diff --git a/mm/mmap.c b/mm/mmap.c
index 75b2a86e1faa..22c71dff762b 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -154,18 +154,6 @@ static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi,
return mas_prev(&vmi->mas, min);
}

-static inline int vma_iter_clear_gfp(struct vma_iterator *vmi,
- unsigned long start, unsigned long end, gfp_t gfp)
-{
- vmi->mas.index = start;
- vmi->mas.last = end - 1;
- mas_store_gfp(&vmi->mas, NULL, gfp);
- if (unlikely(mas_is_err(&vmi->mas)))
- return -ENOMEM;
-
- return 0;
-}
-
/*
* check_brk_limits() - Use platform specific check of range & verify mlock
* limits.
diff --git a/mm/nommu.c b/mm/nommu.c
index f670d9979a26..a764b86b132a 100644
--- a/mm/nommu.c
+++ b/mm/nommu.c
@@ -1383,17 +1383,13 @@ static int vmi_shrink_vma(struct vma_iterator *vmi,

/* adjust the VMA's pointers, which may reposition it in the MM's tree
* and list */
- if (vma_iter_prealloc(vmi)) {
- pr_warn("Allocation of vma tree for process %d failed\n",
- current->pid);
- return -ENOMEM;
- }
-
if (from > vma->vm_start) {
- vma_iter_clear(vmi, from, vma->vm_end);
+ if (vma_iter_clear_gfp(vmi, from, vma->vm_end, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_end = from;
} else {
- vma_iter_clear(vmi, vma->vm_start, to);
+ if (vma_iter_clear_gfp(vmi, vma->vm_start, to, GFP_KERNEL))
+ return -ENOMEM;
vma->vm_start = to;
}

--
2.39.2


2023-06-12 21:01:26

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 15/16] maple_tree: Reduce resets during store setup

mas_prealloc() may walk partially down the tree before finding that a
split or spanning store is needed. When the write occurs, relax the
logic on resetting the walk so that partial walks will not restart, but
walks that have gone too far (a store that affects beyond the current
node) should be restarted.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 37 ++++++++++++++++++++++++++-----------
1 file changed, 26 insertions(+), 11 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 7ac5b5457603..279b871f21a6 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5430,19 +5430,34 @@ static inline void mte_destroy_walk(struct maple_enode *enode,

static void mas_wr_store_setup(struct ma_wr_state *wr_mas)
{
+ if (mas_is_start(wr_mas->mas))
+ return;
+
if (unlikely(mas_is_paused(wr_mas->mas)))
- mas_reset(wr_mas->mas);
+ goto reset;

- if (!mas_is_start(wr_mas->mas)) {
- if (mas_is_none(wr_mas->mas)) {
- mas_reset(wr_mas->mas);
- } else {
- wr_mas->r_max = wr_mas->mas->max;
- wr_mas->type = mte_node_type(wr_mas->mas->node);
- if (mas_is_span_wr(wr_mas))
- mas_reset(wr_mas->mas);
- }
- }
+ if (unlikely(mas_is_none(wr_mas->mas)))
+ goto reset;
+
+ /*
+ * A less strict version of mas_is_span_wr() where we allow spanning
+ * writes within this node. This is to stop partial walks in
+ * mas_prealloc() from being reset.
+ */
+ if (wr_mas->mas->last > wr_mas->mas->max)
+ goto reset;
+
+ if (wr_mas->entry)
+ return;
+
+ if (mte_is_leaf(wr_mas->mas->node) &&
+ wr_mas->mas->last == wr_mas->mas->max)
+ goto reset;
+
+ return;
+
+reset:
+ mas_reset(wr_mas->mas);
}

/* Interface */
--
2.39.2


2023-06-12 21:04:08

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

Calculate the number of nodes based on the pending write action instead
of assuming the worst case.

This addresses a performance regression introduced in platforms that
have longer allocation timing.

Signed-off-by: Liam R. Howlett <[email protected]>
---
lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 47 insertions(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 048d6413a114..7ac5b5457603 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
*/
int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
{
+ MA_WR_STATE(wr_mas, mas, entry);
+ unsigned char node_size;
+ int request = 1;
int ret;

- mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
+
+ if (unlikely(!mas->index && mas->last == ULONG_MAX))
+ goto ask_now;
+
+ mas_wr_store_setup(&wr_mas);
+ wr_mas.content = mas_start(mas);
+ /* Root expand */
+ if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
+ goto ask_now;
+
+ if (unlikely(!mas_wr_walk(&wr_mas))) {
+ /* Spanning store, use worst case for now */
+ request = 1 + mas_mt_height(mas) * 3;
+ goto ask_now;
+ }
+
+ /* At this point, we are at the leaf node that needs to be altered. */
+ /* Exact fit, no nodes needed. */
+ if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
+ return 0;
+
+ mas_wr_end_piv(&wr_mas);
+ node_size = mas_wr_new_end(&wr_mas);
+ /* Slot store can avoid using any nodes */
+ if (node_size == wr_mas.node_end && wr_mas.offset_end - mas->offset == 1)
+ return 0;
+
+ if (node_size >= mt_slots[wr_mas.type]) {
+ /* Split, worst case for now. */
+ request = 1 + mas_mt_height(mas) * 2;
+ goto ask_now;
+ }
+
+ /* Appending does not need any nodes */
+ if (node_size == wr_mas.node_end + 1 && mas->offset == wr_mas.node_end)
+ return 0;
+
+ /* Potential spanning rebalance collapsing a node, use worst-case */
+ if (node_size - 1 <= mt_min_slots[wr_mas.type])
+ request = mas_mt_height(mas) * 2 - 1;
+
+ /* node store needs one node */
+ask_now:
+ mas_node_count_gfp(mas, request, gfp);
mas->mas_flags |= MA_STATE_PREALLOC;
if (likely(!mas_is_err(mas)))
return 0;
--
2.39.2


2023-06-12 21:05:30

by Liam R. Howlett

[permalink] [raw]
Subject: [PATCH v2 16/16] mm/mmap: Change vma iteration order in do_vmi_align_munmap()

By delaying the setting of prev/next VMA until after the write of NULL,
the probability of the prev/next VMA already being in the CPU cache is
significantly increased, especially for larger munmap operations. It
also means that prev/next will be loaded closer to when they are used.

This has the consequence of needing to change the for_each() to a do {}
for_each() when writing to the side tree.

Since prev will be set later in the function, it is better to reverse
the splitting direction of the start VMA (modify the new_below argument
to __split_vma).

Using the vma_iter_prev_range() to walk back to the correct location in
the tree will, on the most part, mean walking within the CPU cache.
Usually, this is two steps vs a node reset and a tree re-walk.

Signed-off-by: Liam R. Howlett <[email protected]>
---
mm/mmap.c | 27 ++++++++++++---------------
1 file changed, 12 insertions(+), 15 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index eaebcc8f60d2..429e314bd134 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2395,20 +2395,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
if (end < vma->vm_end && mm->map_count >= sysctl_max_map_count)
goto map_count_exceeded;

- error = __split_vma(vmi, vma, start, 0);
+ error = __split_vma(vmi, vma, start, 1);
if (error)
goto start_split_failed;
-
- vma = vma_iter_load(vmi);
}

- prev = vma_prev(vmi);
-
/*
* Detach a range of VMAs from the mm. Using next as a temp variable as
* it is always overwritten.
*/
- for_each_vma_range(*vmi, next, end) {
+ next = vma;
+ do {
/* Does it split the end? */
if (next->vm_end > end) {
error = __split_vma(vmi, next, end, 0);
@@ -2440,13 +2437,7 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(next->vm_start < start);
BUG_ON(next->vm_start > end);
#endif
- }
-
- if (vma_iter_end(vmi) > end)
- next = vma_iter_load(vmi);
-
- if (!next)
- next = vma_next(vmi);
+ } for_each_vma_range(*vmi, next, end);

#if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
/* Make sure no VMAs are about to be lost. */
@@ -2467,12 +2458,18 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
BUG_ON(count != test_count);
}
#endif
- /* Point of no return */
- vma_iter_set(vmi, start);
+ while (vma_iter_addr(vmi) > start)
+ vma_iter_prev_range(vmi);
+
if (vma_iter_clear_gfp(vmi, start, end, GFP_KERNEL))
return -ENOMEM;

mm->map_count -= count;
+ prev = vma_iter_prev_range(vmi);
+ next = vma_next(vmi);
+ if (next)
+ vma_iter_prev_range(vmi);
+
/*
* Do not downgrade mmap_lock if we are next to VM_GROWSDOWN or
* VM_GROWSUP VMA. Such VMAs can change their size under
--
2.39.2


2023-06-15 08:57:04

by Yin, Fengwei

[permalink] [raw]
Subject: Re: [PATCH v2 00/16] Reduce preallocations for maple tree

Hi Liam,

On 6/13/2023 4:39 AM, Liam R. Howlett wrote:
> Initial work on preallocations showed no regression in performance
> during testing, but recently some users (both on [1] and off [android]
> list) have reported that preallocating the worst-case number of nodes
> has caused some slow down. This patch set addresses the number of
> allocations in a few ways.
>
> During munmap() most munmap() operations will remove a single VMA, so
> leverage the fact that the maple tree can place a single pointer at
> range 0 - 0 without allocating. This is done by changing the index in
> the 'sidetree'.
>
> Re-introduce the entry argument to mas_preallocate() so that a more
> intelligent guess of the node count can be made.
>
> During development of v2 of this patch set, I also noticed that the
> number of nodes being allocated for a rebalance was beyond what could
> possibly be needed. This is addressed in patch 0008.
>
> Patches are in the following order:
> 0001-0002: Testing framework for benchmarking some operations
> 0003-0004: Reduction of maple node allocation in sidetree
> 0005: Small cleanup of do_vmi_align_munmap()
> 0006-0015: mas_preallocate() calculation change
> 0016: Change the vma iterator order
>
> Changes since v1:
> - Reduced preallocations for append and slot store - Thanks Peng Zhang
> - Added patch to reduce node allocations for mas_rebalance() (patch 0008)
> - Reduced resets during store setup to avoid duplicate walking (patch 0015)

AIM9.page_test in my local env:
before-preallocation: 763812
preallocation: 676600
preallocation fixup v2 (this patchset: 734060

For this specific test, the v2 patch works perfectly. Thanks.

Regards
Yin, Fengwei

>
> v1: https://lore.kernel.org/lkml/[email protected]/
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Liam R. Howlett (16):
> maple_tree: Add benchmarking for mas_for_each
> maple_tree: Add benchmarking for mas_prev()
> mm: Move unmap_vmas() declaration to internal header
> mm: Change do_vmi_align_munmap() side tree index
> mm: Remove prev check from do_vmi_align_munmap()
> maple_tree: Introduce __mas_set_range()
> mm: Remove re-walk from mmap_region()
> maple_tree: Adjust node allocation on mas_rebalance()
> maple_tree: Re-introduce entry to mas_preallocate() arguments
> mm: Use vma_iter_clear_gfp() in nommu
> mm: Set up vma iterator for vma_iter_prealloc() calls
> maple_tree: Move mas_wr_end_piv() below mas_wr_extend_null()
> maple_tree: Update mas_preallocate() testing
> maple_tree: Refine mas_preallocate() node calculations
> maple_tree: Reduce resets during store setup
> mm/mmap: Change vma iteration order in do_vmi_align_munmap()
>
> fs/exec.c | 1 +
> include/linux/maple_tree.h | 23 ++++-
> include/linux/mm.h | 4 -
> lib/maple_tree.c | 121 ++++++++++++++++------
> lib/test_maple_tree.c | 74 +++++++++++++
> mm/internal.h | 40 ++++++--
> mm/memory.c | 19 ++--
> mm/mmap.c | 171 ++++++++++++++++---------------
> mm/nommu.c | 45 ++++----
> tools/testing/radix-tree/maple.c | 59 ++++++-----
> 10 files changed, 364 insertions(+), 193 deletions(-)
>

2023-06-20 12:44:39

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH v2 04/16] mm: Change do_vmi_align_munmap() side tree index

Hello Liam,

On (23/06/12 16:39), Liam R. Howlett wrote:
[..]
> @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> /* Make sure no VMAs are about to be lost. */
> {
> - MA_STATE(test, &mt_detach, start, end - 1);
> + MA_STATE(test, &mt_detach, 0, 0);
> struct vm_area_struct *vma_mas, *vma_test;
> int test_count = 0;
>
> vma_iter_set(vmi, start);
> rcu_read_lock();
> - vma_test = mas_find(&test, end - 1);
> + vma_test = mas_find(&test, count - 1);
> for_each_vma_range(*vmi, vma_mas, end) {
> BUG_ON(vma_mas != vma_test);
> test_count++;
> - vma_test = mas_next(&test, end - 1);
> + vma_test = mas_next(&test, count - 1);
> }
> rcu_read_unlock();
> BUG_ON(count != test_count);

Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)

[ 8.156437] ------------[ cut here ]------------
[ 8.160473] kernel BUG at mm/mmap.c:2439!
[ 8.163894] invalid opcode: 0000 [#1] PREEMPT SMP PTI

RIP: 0010:do_vmi_align_munmap+0x489/0x8a0

[ 8.207867] Call Trace:
[ 8.208463] <TASK>
[ 8.209018] ? die+0x32/0x80
[ 8.209709] ? do_trap+0xd2/0x100
[ 8.210520] ? do_vmi_align_munmap+0x489/0x8a0
[ 8.211576] ? do_vmi_align_munmap+0x489/0x8a0
[ 8.212639] ? do_error_trap+0x94/0x110
[ 8.213549] ? do_vmi_align_munmap+0x489/0x8a0
[ 8.214581] ? exc_invalid_op+0x49/0x60
[ 8.215494] ? do_vmi_align_munmap+0x489/0x8a0
[ 8.216576] ? asm_exc_invalid_op+0x16/0x20
[ 8.217562] ? do_vmi_align_munmap+0x489/0x8a0
[ 8.218626] do_vmi_munmap+0xc7/0x120
[ 8.219494] __vm_munmap+0xaa/0x1c0
[ 8.220370] __x64_sys_munmap+0x17/0x20
[ 8.221275] do_syscall_64+0x34/0x80
[ 8.222165] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 8.223359] RIP: 0033:0x7fdb0e2fca97
[ 8.224224] Code: ff ff ff ff c3 66 0f 1f 44 00 00 f7 d8 89 05 20 28 01 00 48 c7 c0 ff ff ff ff c3 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d f9 27 01 00 f7 d8 89 01 48 83
[ 8.228432] RSP: 002b:00007ffd15458348 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
[ 8.230175] RAX: ffffffffffffffda RBX: 0000562081a347b0 RCX: 00007fdb0e2fca97
[ 8.231833] RDX: 0000000000000002 RSI: 0000000000004010 RDI: 00007fdb0e2d5000
[ 8.233513] RBP: 00007ffd154584d0 R08: 0000000000000000 R09: 0000562081a3fd30
[ 8.235178] R10: 0000562081a3fd18 R11: 0000000000000202 R12: 00007ffd15458388
[ 8.236861] R13: 00007ffd15458438 R14: 00007ffd15458370 R15: 0000562081a347b0

2023-06-20 13:17:54

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v2 04/16] mm: Change do_vmi_align_munmap() side tree index

* Sergey Senozhatsky <[email protected]> [230620 08:26]:
> Hello Liam,
>
> On (23/06/12 16:39), Liam R. Howlett wrote:
> [..]
> > @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > /* Make sure no VMAs are about to be lost. */
> > {
> > - MA_STATE(test, &mt_detach, start, end - 1);
> > + MA_STATE(test, &mt_detach, 0, 0);
> > struct vm_area_struct *vma_mas, *vma_test;
> > int test_count = 0;
> >
> > vma_iter_set(vmi, start);
> > rcu_read_lock();
> > - vma_test = mas_find(&test, end - 1);
> > + vma_test = mas_find(&test, count - 1);
> > for_each_vma_range(*vmi, vma_mas, end) {
> > BUG_ON(vma_mas != vma_test);
> > test_count++;
> > - vma_test = mas_next(&test, end - 1);
> > + vma_test = mas_next(&test, count - 1);
> > }
> > rcu_read_unlock();
> > BUG_ON(count != test_count);
>
> Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)

Is this with next by any chance? There's a merge conflict which I'll
have to fix, but I won't be getting to it in time so the patches will
not make this merge window.

>
> [ 8.156437] ------------[ cut here ]------------
> [ 8.160473] kernel BUG at mm/mmap.c:2439!
> [ 8.163894] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>
> RIP: 0010:do_vmi_align_munmap+0x489/0x8a0
>
> [ 8.207867] Call Trace:
> [ 8.208463] <TASK>
> [ 8.209018] ? die+0x32/0x80
> [ 8.209709] ? do_trap+0xd2/0x100
> [ 8.210520] ? do_vmi_align_munmap+0x489/0x8a0
> [ 8.211576] ? do_vmi_align_munmap+0x489/0x8a0
> [ 8.212639] ? do_error_trap+0x94/0x110
> [ 8.213549] ? do_vmi_align_munmap+0x489/0x8a0
> [ 8.214581] ? exc_invalid_op+0x49/0x60
> [ 8.215494] ? do_vmi_align_munmap+0x489/0x8a0
> [ 8.216576] ? asm_exc_invalid_op+0x16/0x20
> [ 8.217562] ? do_vmi_align_munmap+0x489/0x8a0
> [ 8.218626] do_vmi_munmap+0xc7/0x120
> [ 8.219494] __vm_munmap+0xaa/0x1c0
> [ 8.220370] __x64_sys_munmap+0x17/0x20
> [ 8.221275] do_syscall_64+0x34/0x80
> [ 8.222165] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [ 8.223359] RIP: 0033:0x7fdb0e2fca97
> [ 8.224224] Code: ff ff ff ff c3 66 0f 1f 44 00 00 f7 d8 89 05 20 28 01 00 48 c7 c0 ff ff ff ff c3 0f 1f 84 00 00 00 00 00 b8 0b 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8d 0d f9 27 01 00 f7 d8 89 01 48 83
> [ 8.228432] RSP: 002b:00007ffd15458348 EFLAGS: 00000202 ORIG_RAX: 000000000000000b
> [ 8.230175] RAX: ffffffffffffffda RBX: 0000562081a347b0 RCX: 00007fdb0e2fca97
> [ 8.231833] RDX: 0000000000000002 RSI: 0000000000004010 RDI: 00007fdb0e2d5000
> [ 8.233513] RBP: 00007ffd154584d0 R08: 0000000000000000 R09: 0000562081a3fd30
> [ 8.235178] R10: 0000562081a3fd18 R11: 0000000000000202 R12: 00007ffd15458388
> [ 8.236861] R13: 00007ffd15458438 R14: 00007ffd15458370 R15: 0000562081a347b0

2023-06-21 00:38:57

by Sergey Senozhatsky

[permalink] [raw]
Subject: Re: [PATCH v2 04/16] mm: Change do_vmi_align_munmap() side tree index

On (23/06/20 09:04), Liam R. Howlett wrote:
> > On (23/06/12 16:39), Liam R. Howlett wrote:
> > [..]
> > > @@ -2450,17 +2452,17 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma,
> > > #if defined(CONFIG_DEBUG_VM_MAPLE_TREE)
> > > /* Make sure no VMAs are about to be lost. */
> > > {
> > > - MA_STATE(test, &mt_detach, start, end - 1);
> > > + MA_STATE(test, &mt_detach, 0, 0);
> > > struct vm_area_struct *vma_mas, *vma_test;
> > > int test_count = 0;
> > >
> > > vma_iter_set(vmi, start);
> > > rcu_read_lock();
> > > - vma_test = mas_find(&test, end - 1);
> > > + vma_test = mas_find(&test, count - 1);
> > > for_each_vma_range(*vmi, vma_mas, end) {
> > > BUG_ON(vma_mas != vma_test);
> > > test_count++;
> > > - vma_test = mas_next(&test, end - 1);
> > > + vma_test = mas_next(&test, count - 1);
> > > }
> > > rcu_read_unlock();
> > > BUG_ON(count != test_count);
> >
> > Something isn't quite working, I'm hitting BUG_ON(vma_mas != vma_test)
>
> Is this with next by any chance?

Oh yes, linux-next

2023-06-22 17:06:13

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/12/23 22:39, Liam R. Howlett wrote:
> Calculate the number of nodes based on the pending write action instead
> of assuming the worst case.

Liam already gave me a heads-up on this patch, which I already replied
to [1].

However, I think it might make sense to also reply to this patch directly.

For a mas_preallocate() calculating the actual required nodes to be
allocated instead of assuming the worst to work, it is required to
ensure that the tree does not change between calling mas_preallocate()
and mas_store_prealloc() if my understanding is correct.

In DRM however, more specifically the DRM GPUVA Manager [2], we do have
the case that we are not able to ensure this:

Jobs to create GPU mappings can be submitted by userspace, are queued up
by the kernel and are processed asynchronously in dma-fence signalling
critical paths, e.g. by using the drm_gpu_scheduler. Hence, we must be
able to allocate the worst case amount of node, since at the time a job
is submitted we can't predict the state the maple tree keeping track of
mappings has once a mapping is inserted in the (asynchronous) dma-fence
signalling critical path.

A more detailed explanation can be found in [1].

Could we keep a separate function for allocating the worst case amount
of nodes additionally to this optimization? E.g. something like
mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I
guess the new one requires the maple tree to be kept locked in order not
to change)?

[1]
https://lore.kernel.org/nouveau/[email protected]/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5

[2]
https://lore.kernel.org/linux-mm/[email protected]/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653

- Danilo

>
> This addresses a performance regression introduced in platforms that
> have longer allocation timing.
>
> Signed-off-by: Liam R. Howlett <[email protected]>
> ---
> lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 47 insertions(+), 1 deletion(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 048d6413a114..7ac5b5457603 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
> */
> int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
> {
> + MA_WR_STATE(wr_mas, mas, entry);
> + unsigned char node_size;
> + int request = 1;
> int ret;
>
> - mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
> +
> + if (unlikely(!mas->index && mas->last == ULONG_MAX))
> + goto ask_now;
> +
> + mas_wr_store_setup(&wr_mas);
> + wr_mas.content = mas_start(mas);
> + /* Root expand */
> + if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
> + goto ask_now;
> +
> + if (unlikely(!mas_wr_walk(&wr_mas))) {
> + /* Spanning store, use worst case for now */
> + request = 1 + mas_mt_height(mas) * 3;
> + goto ask_now;
> + }
> +
> + /* At this point, we are at the leaf node that needs to be altered. */
> + /* Exact fit, no nodes needed. */
> + if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
> + return 0;
> +
> + mas_wr_end_piv(&wr_mas);
> + node_size = mas_wr_new_end(&wr_mas);
> + /* Slot store can avoid using any nodes */
> + if (node_size == wr_mas.node_end && wr_mas.offset_end - mas->offset == 1)
> + return 0;
> +
> + if (node_size >= mt_slots[wr_mas.type]) {
> + /* Split, worst case for now. */
> + request = 1 + mas_mt_height(mas) * 2;
> + goto ask_now;
> + }
> +
> + /* Appending does not need any nodes */
> + if (node_size == wr_mas.node_end + 1 && mas->offset == wr_mas.node_end)
> + return 0;
> +
> + /* Potential spanning rebalance collapsing a node, use worst-case */
> + if (node_size - 1 <= mt_min_slots[wr_mas.type])
> + request = mas_mt_height(mas) * 2 - 1;
> +
> + /* node store needs one node */
> +ask_now:
> + mas_node_count_gfp(mas, request, gfp);
> mas->mas_flags |= MA_STATE_PREALLOC;
> if (likely(!mas_is_err(mas)))
> return 0;


2023-06-25 04:04:41

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations



在 2023/6/23 00:41, Danilo Krummrich 写道:
> On 6/12/23 22:39, Liam R. Howlett wrote:
>> Calculate the number of nodes based on the pending write action instead
>> of assuming the worst case.
>
> Liam already gave me a heads-up on this patch, which I already replied
> to [1].
>
> However, I think it might make sense to also reply to this patch directly.
>
> For a mas_preallocate() calculating the actual required nodes to be
> allocated instead of assuming the worst to work, it is required to
> ensure that the tree does not change between calling mas_preallocate()
> and mas_store_prealloc() if my understanding is correct.
>
> In DRM however, more specifically the DRM GPUVA Manager [2], we do have
> the case that we are not able to ensure this:
>
> Jobs to create GPU mappings can be submitted by userspace, are queued up
> by the kernel and are processed asynchronously in dma-fence signalling
> critical paths, e.g. by using the drm_gpu_scheduler. Hence, we must be
> able to allocate the worst case amount of node, since at the time a job
> is submitted we can't predict the state the maple tree keeping track of
> mappings has once a mapping is inserted in the (asynchronous) dma-fence
> signalling critical path.
>
> A more detailed explanation can be found in [1].
>
> Could we keep a separate function for allocating the worst case amount
> of nodes additionally to this optimization? E.g. something like
> mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I
> guess the new one requires the maple tree to be kept locked in order not
> to change)?
Hi Danilo,

Your understanding seems incorrect. Even with previously unoptimized
mas_preallocate(), the maple tree cannot be modified between calls to
mas_preallocate() and mas_store_prealloc(). The calculation of the
number of pre-allocated nodes depends on the structure of the maple
tree. In the unoptimized mas_preallocate(), it depends on the height of
the tree. If the maple tree is modified before mas_store_prealloc() and
the height of the tree changes, the number of pre-allocated nodes is
inaccurate.

Regards,
Peng

>
> [1]
> https://lore.kernel.org/nouveau/[email protected]/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5
>
> [2]
> https://lore.kernel.org/linux-mm/[email protected]/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653
>
> - Danilo
>
>>
>> This addresses a performance regression introduced in platforms that
>> have longer allocation timing.
>>
>> Signed-off-by: Liam R. Howlett <[email protected]>
>> ---
>>   lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
>>   1 file changed, 47 insertions(+), 1 deletion(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index 048d6413a114..7ac5b5457603 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>>    */
>>   int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>   {
>> +    MA_WR_STATE(wr_mas, mas, entry);
>> +    unsigned char node_size;
>> +    int request = 1;
>>       int ret;
>> -    mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
>> +
>> +    if (unlikely(!mas->index && mas->last == ULONG_MAX))
>> +        goto ask_now;
>> +
>> +    mas_wr_store_setup(&wr_mas);
>> +    wr_mas.content = mas_start(mas);
>> +    /* Root expand */
>> +    if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
>> +        goto ask_now;
>> +
>> +    if (unlikely(!mas_wr_walk(&wr_mas))) {
>> +        /* Spanning store, use worst case for now */
>> +        request = 1 + mas_mt_height(mas) * 3;
>> +        goto ask_now;
>> +    }
>> +
>> +    /* At this point, we are at the leaf node that needs to be
>> altered. */
>> +    /* Exact fit, no nodes needed. */
>> +    if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
>> +        return 0;
>> +
>> +    mas_wr_end_piv(&wr_mas);
>> +    node_size = mas_wr_new_end(&wr_mas);
>> +    /* Slot store can avoid using any nodes */
>> +    if (node_size == wr_mas.node_end && wr_mas.offset_end -
>> mas->offset == 1)
>> +        return 0;
>> +
>> +    if (node_size >= mt_slots[wr_mas.type]) {
>> +        /* Split, worst case for now. */
>> +        request = 1 + mas_mt_height(mas) * 2;
>> +        goto ask_now;
>> +    }
>> +
>> +    /* Appending does not need any nodes */
>> +    if (node_size == wr_mas.node_end + 1 && mas->offset ==
>> wr_mas.node_end)
>> +        return 0;
>> +
>> +    /* Potential spanning rebalance collapsing a node, use worst-case */
>> +    if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>> +        request = mas_mt_height(mas) * 2 - 1;
>> +
>> +    /* node store needs one node */
>> +ask_now:
>> +    mas_node_count_gfp(mas, request, gfp);
>>       mas->mas_flags |= MA_STATE_PREALLOC;
>>       if (likely(!mas_is_err(mas)))
>>           return 0;
>
>

2023-06-26 03:12:03

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

Hi Peng,

On 6/25/23 05:28, Peng Zhang wrote:
>
>
> 在 2023/6/23 00:41, Danilo Krummrich 写道:
>> On 6/12/23 22:39, Liam R. Howlett wrote:
>>> Calculate the number of nodes based on the pending write action instead
>>> of assuming the worst case.
>>
>> Liam already gave me a heads-up on this patch, which I already replied
>> to [1].
>>
>> However, I think it might make sense to also reply to this patch
>> directly.
>>
>> For a mas_preallocate() calculating the actual required nodes to be
>> allocated instead of assuming the worst to work, it is required to
>> ensure that the tree does not change between calling mas_preallocate()
>> and mas_store_prealloc() if my understanding is correct.
>>
>> In DRM however, more specifically the DRM GPUVA Manager [2], we do
>> have the case that we are not able to ensure this:
>>
>> Jobs to create GPU mappings can be submitted by userspace, are queued
>> up by the kernel and are processed asynchronously in dma-fence
>> signalling critical paths, e.g. by using the drm_gpu_scheduler. Hence,
>> we must be able to allocate the worst case amount of node, since at
>> the time a job is submitted we can't predict the state the maple tree
>> keeping track of mappings has once a mapping is inserted in the
>> (asynchronous) dma-fence signalling critical path.
>>
>> A more detailed explanation can be found in [1].
>>
>> Could we keep a separate function for allocating the worst case amount
>> of nodes additionally to this optimization? E.g. something like
>> mas_preallocate_worst_case() or mas_preallocate_unlocked() (since I
>> guess the new one requires the maple tree to be kept locked in order
>> not to change)?
> Hi Danilo,
>
> Your understanding seems incorrect. Even with previously unoptimized
> mas_preallocate(), the maple tree cannot be modified between calls to
> mas_preallocate() and mas_store_prealloc(). The calculation of the
> number of pre-allocated nodes depends on the structure of the maple
> tree. In the unoptimized mas_preallocate(), it depends on the height of
> the tree. If the maple tree is modified before mas_store_prealloc() and
> the height of the tree changes, the number of pre-allocated nodes is
> inaccurate.

Thanks for pointing this out!

First of all, it's probably fair to say "naive me", it totally makes
sense the tree height is needed - it's a b-tree.

On the other hand, unless I miss something (and if so, please let me
know), something is bogus with the API then.

While the documentation of the Advanced API of the maple tree explicitly
claims that the user of the API is responsible for locking, this should
be limited to the bounds set by the maple tree implementation. Which
means, the user must decide for either the internal (spin-) lock or an
external lock (which possibly goes away in the future) and acquire and
release it according to the rules maple tree enforces through lockdep
checks.

Let's say one picks the internal lock. How is one supposed to ensure the
tree isn't modified using the internal lock with mas_preallocate()?

Besides that, I think the documentation should definitely mention this
limitation and give some guidance for the locking.

Currently, from an API perspective, I can't see how anyone not familiar
with the implementation details would be able to recognize this limitation.

In terms of the GPUVA manager, unfortunately, it seems like I need to
drop the maple tree and go back to using a rb-tree, since it seems there
is no sane way doing a worst-case pre-allocation that does not suffer
from this limitation.

- Danilo

>
> Regards,
> Peng
>
>>
>> [1]
>> https://lore.kernel.org/nouveau/[email protected]/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5
>>
>> [2]
>> https://lore.kernel.org/linux-mm/[email protected]/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653
>>
>> - Danilo
>>
>>>
>>> This addresses a performance regression introduced in platforms that
>>> have longer allocation timing.
>>>
>>> Signed-off-by: Liam R. Howlett <[email protected]>
>>> ---
>>>   lib/maple_tree.c | 48 +++++++++++++++++++++++++++++++++++++++++++++++-
>>>   1 file changed, 47 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>> index 048d6413a114..7ac5b5457603 100644
>>> --- a/lib/maple_tree.c
>>> +++ b/lib/maple_tree.c
>>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>>>    */
>>>   int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>   {
>>> +    MA_WR_STATE(wr_mas, mas, entry);
>>> +    unsigned char node_size;
>>> +    int request = 1;
>>>       int ret;
>>> -    mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
>>> +
>>> +    if (unlikely(!mas->index && mas->last == ULONG_MAX))
>>> +        goto ask_now;
>>> +
>>> +    mas_wr_store_setup(&wr_mas);
>>> +    wr_mas.content = mas_start(mas);
>>> +    /* Root expand */
>>> +    if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
>>> +        goto ask_now;
>>> +
>>> +    if (unlikely(!mas_wr_walk(&wr_mas))) {
>>> +        /* Spanning store, use worst case for now */
>>> +        request = 1 + mas_mt_height(mas) * 3;
>>> +        goto ask_now;
>>> +    }
>>> +
>>> +    /* At this point, we are at the leaf node that needs to be
>>> altered. */
>>> +    /* Exact fit, no nodes needed. */
>>> +    if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
>>> +        return 0;
>>> +
>>> +    mas_wr_end_piv(&wr_mas);
>>> +    node_size = mas_wr_new_end(&wr_mas);
>>> +    /* Slot store can avoid using any nodes */
>>> +    if (node_size == wr_mas.node_end && wr_mas.offset_end -
>>> mas->offset == 1)
>>> +        return 0;
>>> +
>>> +    if (node_size >= mt_slots[wr_mas.type]) {
>>> +        /* Split, worst case for now. */
>>> +        request = 1 + mas_mt_height(mas) * 2;
>>> +        goto ask_now;
>>> +    }
>>> +
>>> +    /* Appending does not need any nodes */
>>> +    if (node_size == wr_mas.node_end + 1 && mas->offset ==
>>> wr_mas.node_end)
>>> +        return 0;
>>> +
>>> +    /* Potential spanning rebalance collapsing a node, use
>>> worst-case */
>>> +    if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>> +        request = mas_mt_height(mas) * 2 - 1;
>>> +
>>> +    /* node store needs one node */
>>> +ask_now:
>>> +    mas_node_count_gfp(mas, request, gfp);
>>>       mas->mas_flags |= MA_STATE_PREALLOC;
>>>       if (likely(!mas_is_err(mas)))
>>>           return 0;
>>
>>
>


2023-06-26 13:37:50

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
> On the other hand, unless I miss something (and if so, please let me know),
> something is bogus with the API then.
>
> While the documentation of the Advanced API of the maple tree explicitly
> claims that the user of the API is responsible for locking, this should be
> limited to the bounds set by the maple tree implementation. Which means, the
> user must decide for either the internal (spin-) lock or an external lock
> (which possibly goes away in the future) and acquire and release it
> according to the rules maple tree enforces through lockdep checks.
>
> Let's say one picks the internal lock. How is one supposed to ensure the
> tree isn't modified using the internal lock with mas_preallocate()?
>
> Besides that, I think the documentation should definitely mention this
> limitation and give some guidance for the locking.
>
> Currently, from an API perspective, I can't see how anyone not familiar with
> the implementation details would be able to recognize this limitation.
>
> In terms of the GPUVA manager, unfortunately, it seems like I need to drop
> the maple tree and go back to using a rb-tree, since it seems there is no
> sane way doing a worst-case pre-allocation that does not suffer from this
> limitation.

I haven't been paying much attention here (too many other things going
on), but something's wrong.

First, you shouldn't need to preallocate. Preallocation is only there
for really gnarly cases. The way this is *supposed* to work is that
the store walks down to the leaf, attempts to insert into that leaf
and tries to allocate new nodes with __GFP_NOWAIT. If that fails,
it drops the spinlock, allocates with the gfp flags you've specified,
then rewalks the tree to retry the store, this time with allocated
nodes in its back pocket so that the store will succeed.

2023-06-26 14:14:44

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations



在 2023/6/26 08:38, Danilo Krummrich 写道:
> Hi Peng,
>
> On 6/25/23 05:28, Peng Zhang wrote:
>>
>>
>> 在 2023/6/23 00:41, Danilo Krummrich 写道:
>>> On 6/12/23 22:39, Liam R. Howlett wrote:
>>>> Calculate the number of nodes based on the pending write action instead
>>>> of assuming the worst case.
>>>
>>> Liam already gave me a heads-up on this patch, which I already
>>> replied to [1].
>>>
>>> However, I think it might make sense to also reply to this patch
>>> directly.
>>>
>>> For a mas_preallocate() calculating the actual required nodes to be
>>> allocated instead of assuming the worst to work, it is required to
>>> ensure that the tree does not change between calling
>>> mas_preallocate() and mas_store_prealloc() if my understanding is
>>> correct.
>>>
>>> In DRM however, more specifically the DRM GPUVA Manager [2], we do
>>> have the case that we are not able to ensure this:
>>>
>>> Jobs to create GPU mappings can be submitted by userspace, are queued
>>> up by the kernel and are processed asynchronously in dma-fence
>>> signalling critical paths, e.g. by using the drm_gpu_scheduler.
>>> Hence, we must be able to allocate the worst case amount of node,
>>> since at the time a job is submitted we can't predict the state the
>>> maple tree keeping track of mappings has once a mapping is inserted
>>> in the (asynchronous) dma-fence signalling critical path.
>>>
>>> A more detailed explanation can be found in [1].
>>>
>>> Could we keep a separate function for allocating the worst case
>>> amount of nodes additionally to this optimization? E.g. something
>>> like mas_preallocate_worst_case() or mas_preallocate_unlocked()
>>> (since I guess the new one requires the maple tree to be kept locked
>>> in order not to change)?
>> Hi Danilo,
>>
>> Your understanding seems incorrect. Even with previously unoptimized
>> mas_preallocate(), the maple tree cannot be modified between calls to
>> mas_preallocate() and mas_store_prealloc(). The calculation of the
>> number of pre-allocated nodes depends on the structure of the maple
>> tree. In the unoptimized mas_preallocate(), it depends on the height of
>> the tree. If the maple tree is modified before mas_store_prealloc() and
>> the height of the tree changes, the number of pre-allocated nodes is
>> inaccurate.
>
> Thanks for pointing this out!
>
> First of all, it's probably fair to say "naive me", it totally makes
> sense the tree height is needed - it's a b-tree.
>
> On the other hand, unless I miss something (and if so, please let me
> know), something is bogus with the API then.
>
> While the documentation of the Advanced API of the maple tree explicitly
> claims that the user of the API is responsible for locking, this should
> be limited to the bounds set by the maple tree implementation. Which
> means, the user must decide for either the internal (spin-) lock or an
> external lock (which possibly goes away in the future) and acquire and
> release it according to the rules maple tree enforces through lockdep
> checks.
>
> Let's say one picks the internal lock. How is one supposed to ensure the
> tree isn't modified using the internal lock with mas_preallocate()?
>
> Besides that, I think the documentation should definitely mention this
> limitation and give some guidance for the locking.
Yes, the documentation of maple tree is not detailed and complete.
>
> Currently, from an API perspective, I can't see how anyone not familiar
> with the implementation details would be able to recognize this limitation.
>
> In terms of the GPUVA manager, unfortunately, it seems like I need to
> drop the maple tree and go back to using a rb-tree, since it seems there
> is no sane way doing a worst-case pre-allocation that does not suffer
> from this limitation.
I also think preallocation may not be necessary, and I agree with what
Matthew said. Preallocation should be used in some cases where
preallocation has to be used. If preallocation is used, but the number
of preallocated nodes is insufficient because the tree is modified
midway, GFP_NOWAIT will be used for memory allocation during the tree
modification process, and the user may not notice that more nodes are
not from preallocation.

>
> - Danilo
>
>>
>> Regards,
>> Peng
>>
>>>
>>> [1]
>>> https://lore.kernel.org/nouveau/[email protected]/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5
>>>
>>> [2]
>>> https://lore.kernel.org/linux-mm/[email protected]/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653
>>>
>>> - Danilo
>>>
>>>>
>>>> This addresses a performance regression introduced in platforms that
>>>> have longer allocation timing.
>>>>
>>>> Signed-off-by: Liam R. Howlett <[email protected]>
>>>> ---
>>>>   lib/maple_tree.c | 48
>>>> +++++++++++++++++++++++++++++++++++++++++++++++-
>>>>   1 file changed, 47 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>> index 048d6413a114..7ac5b5457603 100644
>>>> --- a/lib/maple_tree.c
>>>> +++ b/lib/maple_tree.c
>>>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>>>>    */
>>>>   int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>   {
>>>> +    MA_WR_STATE(wr_mas, mas, entry);
>>>> +    unsigned char node_size;
>>>> +    int request = 1;
>>>>       int ret;
>>>> -    mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
>>>> +
>>>> +    if (unlikely(!mas->index && mas->last == ULONG_MAX))
>>>> +        goto ask_now;
>>>> +
>>>> +    mas_wr_store_setup(&wr_mas);
>>>> +    wr_mas.content = mas_start(mas);
>>>> +    /* Root expand */
>>>> +    if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
>>>> +        goto ask_now;
>>>> +
>>>> +    if (unlikely(!mas_wr_walk(&wr_mas))) {
>>>> +        /* Spanning store, use worst case for now */
>>>> +        request = 1 + mas_mt_height(mas) * 3;
>>>> +        goto ask_now;
>>>> +    }
>>>> +
>>>> +    /* At this point, we are at the leaf node that needs to be
>>>> altered. */
>>>> +    /* Exact fit, no nodes needed. */
>>>> +    if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
>>>> +        return 0;
>>>> +
>>>> +    mas_wr_end_piv(&wr_mas);
>>>> +    node_size = mas_wr_new_end(&wr_mas);
>>>> +    /* Slot store can avoid using any nodes */
>>>> +    if (node_size == wr_mas.node_end && wr_mas.offset_end -
>>>> mas->offset == 1)
>>>> +        return 0;
>>>> +
>>>> +    if (node_size >= mt_slots[wr_mas.type]) {
>>>> +        /* Split, worst case for now. */
>>>> +        request = 1 + mas_mt_height(mas) * 2;
>>>> +        goto ask_now;
>>>> +    }
>>>> +
>>>> +    /* Appending does not need any nodes */
>>>> +    if (node_size == wr_mas.node_end + 1 && mas->offset ==
>>>> wr_mas.node_end)
>>>> +        return 0;
>>>> +
>>>> +    /* Potential spanning rebalance collapsing a node, use
>>>> worst-case */
>>>> +    if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>> +        request = mas_mt_height(mas) * 2 - 1;
>>>> +
>>>> +    /* node store needs one node */
>>>> +ask_now:
>>>> +    mas_node_count_gfp(mas, request, gfp);
>>>>       mas->mas_flags |= MA_STATE_PREALLOC;
>>>>       if (likely(!mas_is_err(mas)))
>>>>           return 0;
>>>
>>>
>>
>

2023-06-26 14:40:42

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/26/23 15:19, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>> On the other hand, unless I miss something (and if so, please let me know),
>> something is bogus with the API then.
>>
>> While the documentation of the Advanced API of the maple tree explicitly
>> claims that the user of the API is responsible for locking, this should be
>> limited to the bounds set by the maple tree implementation. Which means, the
>> user must decide for either the internal (spin-) lock or an external lock
>> (which possibly goes away in the future) and acquire and release it
>> according to the rules maple tree enforces through lockdep checks.
>>
>> Let's say one picks the internal lock. How is one supposed to ensure the
>> tree isn't modified using the internal lock with mas_preallocate()?
>>
>> Besides that, I think the documentation should definitely mention this
>> limitation and give some guidance for the locking.
>>
>> Currently, from an API perspective, I can't see how anyone not familiar with
>> the implementation details would be able to recognize this limitation.
>>
>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop
>> the maple tree and go back to using a rb-tree, since it seems there is no
>> sane way doing a worst-case pre-allocation that does not suffer from this
>> limitation.
>
> I haven't been paying much attention here (too many other things going
> on), but something's wrong.
>
> First, you shouldn't need to preallocate. Preallocation is only there

Unfortunately, I think we really have a case where we have to. Typically
GPU mappings are created in a dma-fence signalling critical path and
that is where such mappings need to be added to the maple tree. Hence,
we can't do any sleeping allocations there.

> for really gnarly cases. The way this is *supposed* to work is that
> the store walks down to the leaf, attempts to insert into that leaf
> and tries to allocate new nodes with __GFP_NOWAIT. If that fails,
> it drops the spinlock, allocates with the gfp flags you've specified,
> then rewalks the tree to retry the store, this time with allocated
> nodes in its back pocket so that the store will succeed.

You are talking about mas_store_gfp() here, right? And I guess, if the
tree has changed while the spinlock was dropped and even more nodes are
needed it just retries until it succeeds?

But what about mas_preallocate()? What happens if the tree changed in
between mas_preallocate() and mas_store_prealloc()? Does the latter one
fall back to __GFP_NOWAIT in such a case? I guess not, since
mas_store_prealloc() has a void return type, and __GFP_NOWAIT could fail
as well.

So, how to use the internal spinlock for mas_preallocate() and
mas_store_prealloc() to ensure the tree can't change?


2023-06-26 15:07:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
> On 6/26/23 15:19, Matthew Wilcox wrote:
> > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
> > > On the other hand, unless I miss something (and if so, please let me know),
> > > something is bogus with the API then.
> > >
> > > While the documentation of the Advanced API of the maple tree explicitly
> > > claims that the user of the API is responsible for locking, this should be
> > > limited to the bounds set by the maple tree implementation. Which means, the
> > > user must decide for either the internal (spin-) lock or an external lock
> > > (which possibly goes away in the future) and acquire and release it
> > > according to the rules maple tree enforces through lockdep checks.
> > >
> > > Let's say one picks the internal lock. How is one supposed to ensure the
> > > tree isn't modified using the internal lock with mas_preallocate()?
> > >
> > > Besides that, I think the documentation should definitely mention this
> > > limitation and give some guidance for the locking.
> > >
> > > Currently, from an API perspective, I can't see how anyone not familiar with
> > > the implementation details would be able to recognize this limitation.
> > >
> > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop
> > > the maple tree and go back to using a rb-tree, since it seems there is no
> > > sane way doing a worst-case pre-allocation that does not suffer from this
> > > limitation.
> >
> > I haven't been paying much attention here (too many other things going
> > on), but something's wrong.
> >
> > First, you shouldn't need to preallocate. Preallocation is only there
>
> Unfortunately, I think we really have a case where we have to. Typically GPU
> mappings are created in a dma-fence signalling critical path and that is
> where such mappings need to be added to the maple tree. Hence, we can't do
> any sleeping allocations there.

OK, so there are various ways to hadle this, depending on what's
appropriate for your case.

The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
layer "This is too hard, let me tap into the emergency reserves". It's
mildly frowned upon, so let's see if we can do better.

If you know where the allocation needs to be stored, but want it to act as
NULL until the time is right, you can store a ZERO entry. That will read
as NULL until you store to it. A pure overwriting store will not cause
any memory allocation since all the implementation has to do is change
a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
As you're discovering, the Maple Tree API isn't fully baked yet.


2023-06-26 15:11:51

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations



在 2023/6/26 22:27, Danilo Krummrich 写道:
> On 6/26/23 15:19, Matthew Wilcox wrote:
>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>>> On the other hand, unless I miss something (and if so, please let me
>>> know),
>>> something is bogus with the API then.
>>>
>>> While the documentation of the Advanced API of the maple tree explicitly
>>> claims that the user of the API is responsible for locking, this
>>> should be
>>> limited to the bounds set by the maple tree implementation. Which
>>> means, the
>>> user must decide for either the internal (spin-) lock or an external
>>> lock
>>> (which possibly goes away in the future) and acquire and release it
>>> according to the rules maple tree enforces through lockdep checks.
>>>
>>> Let's say one picks the internal lock. How is one supposed to ensure the
>>> tree isn't modified using the internal lock with mas_preallocate()?
>>>
>>> Besides that, I think the documentation should definitely mention this
>>> limitation and give some guidance for the locking.
>>>
>>> Currently, from an API perspective, I can't see how anyone not
>>> familiar with
>>> the implementation details would be able to recognize this limitation.
>>>
>>> In terms of the GPUVA manager, unfortunately, it seems like I need to
>>> drop
>>> the maple tree and go back to using a rb-tree, since it seems there
>>> is no
>>> sane way doing a worst-case pre-allocation that does not suffer from
>>> this
>>> limitation.
>>
>> I haven't been paying much attention here (too many other things going
>> on), but something's wrong.
>>
>> First, you shouldn't need to preallocate.  Preallocation is only there
>
> Unfortunately, I think we really have a case where we have to. Typically
> GPU mappings are created in a dma-fence signalling critical path and
> that is where such mappings need to be added to the maple tree. Hence,
> we can't do any sleeping allocations there.
>
>> for really gnarly cases.  The way this is *supposed* to work is that
>> the store walks down to the leaf, attempts to insert into that leaf
>> and tries to allocate new nodes with __GFP_NOWAIT.  If that fails,
>> it drops the spinlock, allocates with the gfp flags you've specified,
>> then rewalks the tree to retry the store, this time with allocated
>> nodes in its back pocket so that the store will succeed.
>
> You are talking about mas_store_gfp() here, right? And I guess, if the
> tree has changed while the spinlock was dropped and even more nodes are
> needed it just retries until it succeeds?
>
> But what about mas_preallocate()? What happens if the tree changed in
> between mas_preallocate() and mas_store_prealloc()? Does the latter one
> fall back to __GFP_NOWAIT in such a case? I guess not, since
> mas_store_prealloc() has a void return type, and __GFP_NOWAIT could fail
> as well.
mas_store_prealloc() will fallback to __GFP_NOWAIT and issue a warning.
If __GFP_NOWAIT allocation fails, BUG_ON() in mas_store_prealloc() will
be triggered.

>
> So, how to use the internal spinlock for mas_preallocate() and
> mas_store_prealloc() to ensure the tree can't change?
>

2023-06-26 15:12:52

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/26/23 16:08, Peng Zhang wrote:
>
>
> 在 2023/6/26 08:38, Danilo Krummrich 写道:
>> Hi Peng,
>>
>> On 6/25/23 05:28, Peng Zhang wrote:
>>>
>>>
>>> 在 2023/6/23 00:41, Danilo Krummrich 写道:
>>>> On 6/12/23 22:39, Liam R. Howlett wrote:
>>>>> Calculate the number of nodes based on the pending write action
>>>>> instead
>>>>> of assuming the worst case.
>>>>
>>>> Liam already gave me a heads-up on this patch, which I already
>>>> replied to [1].
>>>>
>>>> However, I think it might make sense to also reply to this patch
>>>> directly.
>>>>
>>>> For a mas_preallocate() calculating the actual required nodes to be
>>>> allocated instead of assuming the worst to work, it is required to
>>>> ensure that the tree does not change between calling
>>>> mas_preallocate() and mas_store_prealloc() if my understanding is
>>>> correct.
>>>>
>>>> In DRM however, more specifically the DRM GPUVA Manager [2], we do
>>>> have the case that we are not able to ensure this:
>>>>
>>>> Jobs to create GPU mappings can be submitted by userspace, are
>>>> queued up by the kernel and are processed asynchronously in
>>>> dma-fence signalling critical paths, e.g. by using the
>>>> drm_gpu_scheduler. Hence, we must be able to allocate the worst case
>>>> amount of node, since at the time a job is submitted we can't
>>>> predict the state the maple tree keeping track of mappings has once
>>>> a mapping is inserted in the (asynchronous) dma-fence signalling
>>>> critical path.
>>>>
>>>> A more detailed explanation can be found in [1].
>>>>
>>>> Could we keep a separate function for allocating the worst case
>>>> amount of nodes additionally to this optimization? E.g. something
>>>> like mas_preallocate_worst_case() or mas_preallocate_unlocked()
>>>> (since I guess the new one requires the maple tree to be kept locked
>>>> in order not to change)?
>>> Hi Danilo,
>>>
>>> Your understanding seems incorrect. Even with previously unoptimized
>>> mas_preallocate(), the maple tree cannot be modified between calls to
>>> mas_preallocate() and mas_store_prealloc(). The calculation of the
>>> number of pre-allocated nodes depends on the structure of the maple
>>> tree. In the unoptimized mas_preallocate(), it depends on the height of
>>> the tree. If the maple tree is modified before mas_store_prealloc() and
>>> the height of the tree changes, the number of pre-allocated nodes is
>>> inaccurate.
>>
>> Thanks for pointing this out!
>>
>> First of all, it's probably fair to say "naive me", it totally makes
>> sense the tree height is needed - it's a b-tree.
>>
>> On the other hand, unless I miss something (and if so, please let me
>> know), something is bogus with the API then.
>>
>> While the documentation of the Advanced API of the maple tree
>> explicitly claims that the user of the API is responsible for locking,
>> this should be limited to the bounds set by the maple tree
>> implementation. Which means, the user must decide for either the
>> internal (spin-) lock or an external lock (which possibly goes away in
>> the future) and acquire and release it according to the rules maple
>> tree enforces through lockdep checks.
>>
>> Let's say one picks the internal lock. How is one supposed to ensure
>> the tree isn't modified using the internal lock with mas_preallocate()?
>>
>> Besides that, I think the documentation should definitely mention this
>> limitation and give some guidance for the locking.
> Yes, the documentation of maple tree is not detailed and complete.
>>
>> Currently, from an API perspective, I can't see how anyone not
>> familiar with the implementation details would be able to recognize
>> this limitation.
>>
>> In terms of the GPUVA manager, unfortunately, it seems like I need to
>> drop the maple tree and go back to using a rb-tree, since it seems
>> there is no sane way doing a worst-case pre-allocation that does not
>> suffer from this limitation.
> I also think preallocation may not be necessary, and I agree with what
> Matthew said. Preallocation should be used in some cases where
> preallocation has to be used. If preallocation is used, but the number
> of preallocated nodes is insufficient because the tree is modified
> midway, GFP_NOWAIT will be used for memory allocation during the tree
> modification process, and the user may not notice that more nodes are
> not from preallocation.

Please see my reply to Matthew. :)

- Danilo

>
>>
>> - Danilo
>>
>>>
>>> Regards,
>>> Peng
>>>
>>>>
>>>> [1]
>>>> https://lore.kernel.org/nouveau/[email protected]/T/#ma326e200b1de1e3c9df4e9fcb3bf243061fee8b5
>>>>
>>>> [2]
>>>> https://lore.kernel.org/linux-mm/[email protected]/T/#m47ab82310f87793d0f0cc1825a316eb30ad5b653
>>>>
>>>> - Danilo
>>>>
>>>>>
>>>>> This addresses a performance regression introduced in platforms that
>>>>> have longer allocation timing.
>>>>>
>>>>> Signed-off-by: Liam R. Howlett <[email protected]>
>>>>> ---
>>>>>   lib/maple_tree.c | 48
>>>>> +++++++++++++++++++++++++++++++++++++++++++++++-
>>>>>   1 file changed, 47 insertions(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>>>>> index 048d6413a114..7ac5b5457603 100644
>>>>> --- a/lib/maple_tree.c
>>>>> +++ b/lib/maple_tree.c
>>>>> @@ -5541,9 +5541,55 @@ EXPORT_SYMBOL_GPL(mas_store_prealloc);
>>>>>    */
>>>>>   int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp)
>>>>>   {
>>>>> +    MA_WR_STATE(wr_mas, mas, entry);
>>>>> +    unsigned char node_size;
>>>>> +    int request = 1;
>>>>>       int ret;
>>>>> -    mas_node_count_gfp(mas, 1 + mas_mt_height(mas) * 3, gfp);
>>>>> +
>>>>> +    if (unlikely(!mas->index && mas->last == ULONG_MAX))
>>>>> +        goto ask_now;
>>>>> +
>>>>> +    mas_wr_store_setup(&wr_mas);
>>>>> +    wr_mas.content = mas_start(mas);
>>>>> +    /* Root expand */
>>>>> +    if (unlikely(mas_is_none(mas) || mas_is_ptr(mas)))
>>>>> +        goto ask_now;
>>>>> +
>>>>> +    if (unlikely(!mas_wr_walk(&wr_mas))) {
>>>>> +        /* Spanning store, use worst case for now */
>>>>> +        request = 1 + mas_mt_height(mas) * 3;
>>>>> +        goto ask_now;
>>>>> +    }
>>>>> +
>>>>> +    /* At this point, we are at the leaf node that needs to be
>>>>> altered. */
>>>>> +    /* Exact fit, no nodes needed. */
>>>>> +    if (wr_mas.r_min == mas->index && wr_mas.r_max == mas->last)
>>>>> +        return 0;
>>>>> +
>>>>> +    mas_wr_end_piv(&wr_mas);
>>>>> +    node_size = mas_wr_new_end(&wr_mas);
>>>>> +    /* Slot store can avoid using any nodes */
>>>>> +    if (node_size == wr_mas.node_end && wr_mas.offset_end -
>>>>> mas->offset == 1)
>>>>> +        return 0;
>>>>> +
>>>>> +    if (node_size >= mt_slots[wr_mas.type]) {
>>>>> +        /* Split, worst case for now. */
>>>>> +        request = 1 + mas_mt_height(mas) * 2;
>>>>> +        goto ask_now;
>>>>> +    }
>>>>> +
>>>>> +    /* Appending does not need any nodes */
>>>>> +    if (node_size == wr_mas.node_end + 1 && mas->offset ==
>>>>> wr_mas.node_end)
>>>>> +        return 0;
>>>>> +
>>>>> +    /* Potential spanning rebalance collapsing a node, use
>>>>> worst-case */
>>>>> +    if (node_size  - 1 <= mt_min_slots[wr_mas.type])
>>>>> +        request = mas_mt_height(mas) * 2 - 1;
>>>>> +
>>>>> +    /* node store needs one node */
>>>>> +ask_now:
>>>>> +    mas_node_count_gfp(mas, request, gfp);
>>>>>       mas->mas_flags |= MA_STATE_PREALLOC;
>>>>>       if (likely(!mas_is_err(mas)))
>>>>>           return 0;
>>>>
>>>>
>>>
>>
>


2023-06-26 19:00:46

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/26/23 16:49, Peng Zhang wrote:
>
>
> 在 2023/6/26 22:27, Danilo Krummrich 写道:
>> On 6/26/23 15:19, Matthew Wilcox wrote:
>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>>>> On the other hand, unless I miss something (and if so, please let me
>>>> know),
>>>> something is bogus with the API then.
>>>>
>>>> While the documentation of the Advanced API of the maple tree
>>>> explicitly
>>>> claims that the user of the API is responsible for locking, this
>>>> should be
>>>> limited to the bounds set by the maple tree implementation. Which
>>>> means, the
>>>> user must decide for either the internal (spin-) lock or an external
>>>> lock
>>>> (which possibly goes away in the future) and acquire and release it
>>>> according to the rules maple tree enforces through lockdep checks.
>>>>
>>>> Let's say one picks the internal lock. How is one supposed to ensure
>>>> the
>>>> tree isn't modified using the internal lock with mas_preallocate()?
>>>>
>>>> Besides that, I think the documentation should definitely mention this
>>>> limitation and give some guidance for the locking.
>>>>
>>>> Currently, from an API perspective, I can't see how anyone not
>>>> familiar with
>>>> the implementation details would be able to recognize this limitation.
>>>>
>>>> In terms of the GPUVA manager, unfortunately, it seems like I need
>>>> to drop
>>>> the maple tree and go back to using a rb-tree, since it seems there
>>>> is no
>>>> sane way doing a worst-case pre-allocation that does not suffer from
>>>> this
>>>> limitation.
>>>
>>> I haven't been paying much attention here (too many other things going
>>> on), but something's wrong.
>>>
>>> First, you shouldn't need to preallocate.  Preallocation is only there
>>
>> Unfortunately, I think we really have a case where we have to.
>> Typically GPU mappings are created in a dma-fence signalling critical
>> path and that is where such mappings need to be added to the maple
>> tree. Hence, we can't do any sleeping allocations there.
>>
>>> for really gnarly cases.  The way this is *supposed* to work is that
>>> the store walks down to the leaf, attempts to insert into that leaf
>>> and tries to allocate new nodes with __GFP_NOWAIT.  If that fails,
>>> it drops the spinlock, allocates with the gfp flags you've specified,
>>> then rewalks the tree to retry the store, this time with allocated
>>> nodes in its back pocket so that the store will succeed.
>>
>> You are talking about mas_store_gfp() here, right? And I guess, if the
>> tree has changed while the spinlock was dropped and even more nodes
>> are needed it just retries until it succeeds?
>>
>> But what about mas_preallocate()? What happens if the tree changed in
>> between mas_preallocate() and mas_store_prealloc()? Does the latter
>> one fall back to __GFP_NOWAIT in such a case? I guess not, since
>> mas_store_prealloc() has a void return type, and __GFP_NOWAIT could
>> fail as well.
> mas_store_prealloc() will fallback to __GFP_NOWAIT and issue a warning.
> If __GFP_NOWAIT allocation fails, BUG_ON() in mas_store_prealloc() will
> be triggered.

Ok, so this is an absolute last resort and surely should not be relied on.

I think the maple tree should either strictly enforce (through locking
policy) that this can never happen or if API wise it is OK not to lock
these two is legit, return an error code rather then issue a warning and
even worse call BUG_ON() in case it can't fix things up.

- Danilo

>
>>
>> So, how to use the internal spinlock for mas_preallocate() and
>> mas_store_prealloc() to ensure the tree can't change?
>>
>


2023-06-26 19:15:49

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/26/23 16:52, Matthew Wilcox wrote:
> On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
>> On 6/26/23 15:19, Matthew Wilcox wrote:
>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>>>> On the other hand, unless I miss something (and if so, please let me know),
>>>> something is bogus with the API then.
>>>>
>>>> While the documentation of the Advanced API of the maple tree explicitly
>>>> claims that the user of the API is responsible for locking, this should be
>>>> limited to the bounds set by the maple tree implementation. Which means, the
>>>> user must decide for either the internal (spin-) lock or an external lock
>>>> (which possibly goes away in the future) and acquire and release it
>>>> according to the rules maple tree enforces through lockdep checks.
>>>>
>>>> Let's say one picks the internal lock. How is one supposed to ensure the
>>>> tree isn't modified using the internal lock with mas_preallocate()?
>>>>
>>>> Besides that, I think the documentation should definitely mention this
>>>> limitation and give some guidance for the locking.
>>>>
>>>> Currently, from an API perspective, I can't see how anyone not familiar with
>>>> the implementation details would be able to recognize this limitation.
>>>>
>>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop
>>>> the maple tree and go back to using a rb-tree, since it seems there is no
>>>> sane way doing a worst-case pre-allocation that does not suffer from this
>>>> limitation.
>>>
>>> I haven't been paying much attention here (too many other things going
>>> on), but something's wrong.
>>>
>>> First, you shouldn't need to preallocate. Preallocation is only there
>>
>> Unfortunately, I think we really have a case where we have to. Typically GPU
>> mappings are created in a dma-fence signalling critical path and that is
>> where such mappings need to be added to the maple tree. Hence, we can't do
>> any sleeping allocations there.
>
> OK, so there are various ways to hadle this, depending on what's
> appropriate for your case.
>
> The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
> layer "This is too hard, let me tap into the emergency reserves". It's
> mildly frowned upon, so let's see if we can do better.
>
> If you know where the allocation needs to be stored, but want it to act as
> NULL until the time is right, you can store a ZERO entry. That will read
> as NULL until you store to it. A pure overwriting store will not cause
> any memory allocation since all the implementation has to do is change
> a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
> As you're discovering, the Maple Tree API isn't fully baked yet.
>

Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
entries in advance would not work. Typically userspace submits a job to
the kernel issuing one or multiple requests to map and unmap memory in
an ioctl. Such a job is then put into a queue and processed
asynchronously in a dma-fence signalling critical section. Hence, at the
we'd store entries in advance we could have an arbitrary amount of
pending jobs potentially still messing with the same address space region.

So, the only way to go seems to be to use mas_store_gfp() with
GFP_ATOMIC directly in the fence signalling critical path. I guess
mas_store_gfp() does not BUG_ON() if it can't get atomic pages?

Also, I just saw that the tree is limited in it's height
(MAPLE_HEIGHT_MAX). Do you think it could be a sane alternative to
pre-allocate with MAPLE_HEIGHT_MAX rather than to rely on atomic pages?
Or maybe a compromise of pre-allocating just a couple of nodes and then
rely on atomic pages for the rest?

FYI, we're talking about a magnitude of hundreds of thousands of entries
to be stored in the tree.

- Danilo


2023-06-27 02:38:41

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

* Danilo Krummrich <[email protected]> [230626 14:37]:
> On 6/26/23 16:52, Matthew Wilcox wrote:
> > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
> > > On 6/26/23 15:19, Matthew Wilcox wrote:
> > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
> > > > > On the other hand, unless I miss something (and if so, please let me know),
> > > > > something is bogus with the API then.
> > > > >
> > > > > While the documentation of the Advanced API of the maple tree explicitly
> > > > > claims that the user of the API is responsible for locking, this should be
> > > > > limited to the bounds set by the maple tree implementation. Which means, the
> > > > > user must decide for either the internal (spin-) lock or an external lock
> > > > > (which possibly goes away in the future) and acquire and release it
> > > > > according to the rules maple tree enforces through lockdep checks.
> > > > >
> > > > > Let's say one picks the internal lock. How is one supposed to ensure the
> > > > > tree isn't modified using the internal lock with mas_preallocate()?
> > > > >
> > > > > Besides that, I think the documentation should definitely mention this
> > > > > limitation and give some guidance for the locking.
> > > > >
> > > > > Currently, from an API perspective, I can't see how anyone not familiar with
> > > > > the implementation details would be able to recognize this limitation.
> > > > >
> > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop
> > > > > the maple tree and go back to using a rb-tree, since it seems there is no
> > > > > sane way doing a worst-case pre-allocation that does not suffer from this
> > > > > limitation.
> > > >
> > > > I haven't been paying much attention here (too many other things going
> > > > on), but something's wrong.
> > > >
> > > > First, you shouldn't need to preallocate. Preallocation is only there
> > >
> > > Unfortunately, I think we really have a case where we have to. Typically GPU
> > > mappings are created in a dma-fence signalling critical path and that is
> > > where such mappings need to be added to the maple tree. Hence, we can't do
> > > any sleeping allocations there.
> >
> > OK, so there are various ways to hadle this, depending on what's
> > appropriate for your case.
> >
> > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
> > layer "This is too hard, let me tap into the emergency reserves". It's
> > mildly frowned upon, so let's see if we can do better.
> >
> > If you know where the allocation needs to be stored, but want it to act as
> > NULL until the time is right, you can store a ZERO entry. That will read
> > as NULL until you store to it. A pure overwriting store will not cause
> > any memory allocation since all the implementation has to do is change
> > a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
> > As you're discovering, the Maple Tree API isn't fully baked yet.
> >
>
> Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
> entries in advance would not work. Typically userspace submits a job to the
> kernel issuing one or multiple requests to map and unmap memory in an ioctl.
> Such a job is then put into a queue and processed asynchronously in a
> dma-fence signalling critical section. Hence, at the we'd store entries in
> advance we could have an arbitrary amount of pending jobs potentially still
> messing with the same address space region.

What I think you are saying is that you have a number of requests
flooding in, which may overwrite the same areas, but are queued up to be
written after they are queued. These operations look to be kept in
order according to the code in nouveau_job_submit[1]. Is this correct?

So then, your issue isn't that you don't know where they will land, but
don't know if the area that you reserved is already split into other
areas? For instance, before the range 5-10 is backed by whatever
happens in the fence, it may have already become 5-6 & 8-10 by something
that came after (from userspace) but hasn't been processed by the
kernel that will live at 7? So you can't write 5-10 right away because
you can't be sure 5-10 is going to exist once you reach the kernel fence
code that stores the entry?

Is my understanding of your issue correct?

Oh, and I guess the queued requests would have to remain ordered between
threads or whatever is on the other side? I mean, you can't have two
threads firing different things into the kernel at the same region
because I would think the results would be unpredictable?

Can these overlapping entries partially overlap one region and another?
That is, can you have three in-flight writes that does something like:
store 1-10, store 10-20, store 5-15?

How stable of an output is needed? Does each kernel write need to be
100% correct or is there a point where the userspace updates stop and
only then it is needed to be stable?

>
> So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC
> directly in the fence signalling critical path. I guess mas_store_gfp() does
> not BUG_ON() if it can't get atomic pages?
>
> Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX).
> Do you think it could be a sane alternative to pre-allocate with
> MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise
> of pre-allocating just a couple of nodes and then rely on atomic pages for
> the rest?
>
> FYI, we're talking about a magnitude of hundreds of thousands of entries to
> be stored in the tree.
>

Since you are not tracking gaps, you will get 16 entries per node. The
maximum height is 31, so that would be 16^31, assuming a gap between
each entry (the worst case), you can cut that in 1/2. To assure you can
successfully allocate storage for a new entries, you'd need to allocate
30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful
as almost all of these would be freed, and sometimes all of them.

You estimate less than 1M entries, that would never go over 6 levels (8.3M
entries with the worst-case). 5 levels would get you 500K in the worst
case, but realistically you'll be in the 5 levels almost always. So,
5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages.

[1] https://lore.kernel.org/linux-mm/[email protected]/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c

2023-06-27 15:32:22

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

Hi Liam,

On 6/27/23 03:58, Liam R. Howlett wrote:
> * Danilo Krummrich <[email protected]> [230626 14:37]:
>> On 6/26/23 16:52, Matthew Wilcox wrote:
>>> On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
>>>> On 6/26/23 15:19, Matthew Wilcox wrote:
>>>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>>>>>> On the other hand, unless I miss something (and if so, please let me know),
>>>>>> something is bogus with the API then.
>>>>>>
>>>>>> While the documentation of the Advanced API of the maple tree explicitly
>>>>>> claims that the user of the API is responsible for locking, this should be
>>>>>> limited to the bounds set by the maple tree implementation. Which means, the
>>>>>> user must decide for either the internal (spin-) lock or an external lock
>>>>>> (which possibly goes away in the future) and acquire and release it
>>>>>> according to the rules maple tree enforces through lockdep checks.
>>>>>>
>>>>>> Let's say one picks the internal lock. How is one supposed to ensure the
>>>>>> tree isn't modified using the internal lock with mas_preallocate()?
>>>>>>
>>>>>> Besides that, I think the documentation should definitely mention this
>>>>>> limitation and give some guidance for the locking.
>>>>>>
>>>>>> Currently, from an API perspective, I can't see how anyone not familiar with
>>>>>> the implementation details would be able to recognize this limitation.
>>>>>>
>>>>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop
>>>>>> the maple tree and go back to using a rb-tree, since it seems there is no
>>>>>> sane way doing a worst-case pre-allocation that does not suffer from this
>>>>>> limitation.
>>>>>
>>>>> I haven't been paying much attention here (too many other things going
>>>>> on), but something's wrong.
>>>>>
>>>>> First, you shouldn't need to preallocate. Preallocation is only there
>>>>
>>>> Unfortunately, I think we really have a case where we have to. Typically GPU
>>>> mappings are created in a dma-fence signalling critical path and that is
>>>> where such mappings need to be added to the maple tree. Hence, we can't do
>>>> any sleeping allocations there.
>>>
>>> OK, so there are various ways to hadle this, depending on what's
>>> appropriate for your case.
>>>
>>> The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
>>> layer "This is too hard, let me tap into the emergency reserves". It's
>>> mildly frowned upon, so let's see if we can do better.
>>>
>>> If you know where the allocation needs to be stored, but want it to act as
>>> NULL until the time is right, you can store a ZERO entry. That will read
>>> as NULL until you store to it. A pure overwriting store will not cause
>>> any memory allocation since all the implementation has to do is change
>>> a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
>>> As you're discovering, the Maple Tree API isn't fully baked yet.
>>>
>>
>> Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
>> entries in advance would not work. Typically userspace submits a job to the
>> kernel issuing one or multiple requests to map and unmap memory in an ioctl.
>> Such a job is then put into a queue and processed asynchronously in a
>> dma-fence signalling critical section. Hence, at the we'd store entries in
>> advance we could have an arbitrary amount of pending jobs potentially still
>> messing with the same address space region.
>
> What I think you are saying is that you have a number of requests
> flooding in, which may overwrite the same areas, but are queued up to be
> written after they are queued. These operations look to be kept in
> order according to the code in nouveau_job_submit[1]. Is this correct?

That's all correct.

(Although Nouveau isn't a good example in this case. Some aspects of it
do and some aspects of it do not apply to the problem we're discussing
here.)

>
> So then, your issue isn't that you don't know where they will land, but
> don't know if the area that you reserved is already split into other
> areas? For instance, before the range 5-10 is backed by whatever
> happens in the fence, it may have already become 5-6 & 8-10 by something
> that came after (from userspace) but hasn't been processed by the
> kernel that will live at 7? So you can't write 5-10 right away because
> you can't be sure 5-10 is going to exist once you reach the kernel fence
> code that stores the entry?
>
> Is my understanding of your issue correct?

Yes, it is.

However, the problem already starts while trying to reserve an area. In
order to satisfy a user request, such a request is broken down into
operations such as unmap mappings which are in the way entirely, remap
mappings which intersect with the requested mapping and finally map the
requested mapping. The execution of such a sequence must appear atomic
and hence be locked accordingly. When trying to reserve an area we'd
need to take that lock. But since this lock would be used in the
dma-fence signalling critical path as well we'd not be allowed to do
sleeping allocations while holding this lock.

Potentially, this could be solved with a retry loop though. Drop the
lock while allocating, take it again and check whether we still got
enough nodes allocated. Analogous to what the maple tree does in
mas_store_gfp(), I guess.

>
> Oh, and I guess the queued requests would have to remain ordered between
> threads or whatever is on the other side? I mean, you can't have two
> threads firing different things into the kernel at the same region
> because I would think the results would be unpredictable?

Once a job is queued up in the kernel they remain ordered.

However, user threads could concurrently push jobs to the kernel
altering the same region of the address space - it just would not make
any sense for userspace to do that.

In general userspace is responsible for the semantics of the address
space. The kernel basically just takes any (valid) request and make it
happen. It also assures waiting and signalling of fences which might be
bound to certain jobs and obviously keeps track of the VA space to be
able to clean things up once a client disappears.

>
> Can these overlapping entries partially overlap one region and another?
> That is, can you have three in-flight writes that does something like:
> store 1-10, store 10-20, store 5-15?

Absolutely, yes.

>
> How stable of an output is needed? Does each kernel write need to be
> 100% correct or is there a point where the userspace updates stop and
> only then it is needed to be stable?

It needs to be 100% correct all the time. The reason is that, as
mentioned above, every job can carry in- and out-fences, such that
userspace can order these jobs against the execution of shaders.

This is also why there could be jobs queued up, where all of them apply
changes to the same region within the VA space, since there might be
shader executions (or just memory copies) ordered right between them.

- Danilo

>
>>
>> So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC
>> directly in the fence signalling critical path. I guess mas_store_gfp() does
>> not BUG_ON() if it can't get atomic pages?
>>
>> Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX).
>> Do you think it could be a sane alternative to pre-allocate with
>> MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise
>> of pre-allocating just a couple of nodes and then rely on atomic pages for
>> the rest?
>>
>> FYI, we're talking about a magnitude of hundreds of thousands of entries to
>> be stored in the tree.
>>
>
> Since you are not tracking gaps, you will get 16 entries per node. The
> maximum height is 31, so that would be 16^31, assuming a gap between
> each entry (the worst case), you can cut that in 1/2. To assure you can
> successfully allocate storage for a new entries, you'd need to allocate
> 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful
> as almost all of these would be freed, and sometimes all of them.
>
> You estimate less than 1M entries, that would never go over 6 levels (8.3M
> entries with the worst-case). 5 levels would get you 500K in the worst
> case, but realistically you'll be in the 5 levels almost always. So,
> 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages.
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c
>


2023-06-27 17:02:01

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

* Danilo Krummrich <[email protected]> [230627 10:58]:
> Hi Liam,
>
> On 6/27/23 03:58, Liam R. Howlett wrote:
> > * Danilo Krummrich <[email protected]> [230626 14:37]:
> > > On 6/26/23 16:52, Matthew Wilcox wrote:
> > > > On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
> > > > > On 6/26/23 15:19, Matthew Wilcox wrote:
> > > > > > On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
> > > > > > > On the other hand, unless I miss something (and if so, please let me know),
> > > > > > > something is bogus with the API then.
> > > > > > >
> > > > > > > While the documentation of the Advanced API of the maple tree explicitly
> > > > > > > claims that the user of the API is responsible for locking, this should be
> > > > > > > limited to the bounds set by the maple tree implementation. Which means, the
> > > > > > > user must decide for either the internal (spin-) lock or an external lock
> > > > > > > (which possibly goes away in the future) and acquire and release it
> > > > > > > according to the rules maple tree enforces through lockdep checks.
> > > > > > >
> > > > > > > Let's say one picks the internal lock. How is one supposed to ensure the
> > > > > > > tree isn't modified using the internal lock with mas_preallocate()?
> > > > > > >
> > > > > > > Besides that, I think the documentation should definitely mention this
> > > > > > > limitation and give some guidance for the locking.
> > > > > > >
> > > > > > > Currently, from an API perspective, I can't see how anyone not familiar with
> > > > > > > the implementation details would be able to recognize this limitation.
> > > > > > >
> > > > > > > In terms of the GPUVA manager, unfortunately, it seems like I need to drop
> > > > > > > the maple tree and go back to using a rb-tree, since it seems there is no
> > > > > > > sane way doing a worst-case pre-allocation that does not suffer from this
> > > > > > > limitation.
> > > > > >
> > > > > > I haven't been paying much attention here (too many other things going
> > > > > > on), but something's wrong.
> > > > > >
> > > > > > First, you shouldn't need to preallocate. Preallocation is only there
> > > > >
> > > > > Unfortunately, I think we really have a case where we have to. Typically GPU
> > > > > mappings are created in a dma-fence signalling critical path and that is
> > > > > where such mappings need to be added to the maple tree. Hence, we can't do
> > > > > any sleeping allocations there.
> > > >
> > > > OK, so there are various ways to hadle this, depending on what's
> > > > appropriate for your case.
> > > >
> > > > The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
> > > > layer "This is too hard, let me tap into the emergency reserves". It's
> > > > mildly frowned upon, so let's see if we can do better.
> > > >
> > > > If you know where the allocation needs to be stored, but want it to act as
> > > > NULL until the time is right, you can store a ZERO entry. That will read
> > > > as NULL until you store to it. A pure overwriting store will not cause
> > > > any memory allocation since all the implementation has to do is change
> > > > a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
> > > > As you're discovering, the Maple Tree API isn't fully baked yet.
> > > >
> > >
> > > Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
> > > entries in advance would not work. Typically userspace submits a job to the
> > > kernel issuing one or multiple requests to map and unmap memory in an ioctl.
> > > Such a job is then put into a queue and processed asynchronously in a
> > > dma-fence signalling critical section. Hence, at the we'd store entries in
> > > advance we could have an arbitrary amount of pending jobs potentially still
> > > messing with the same address space region.
> >
> > What I think you are saying is that you have a number of requests
> > flooding in, which may overwrite the same areas, but are queued up to be
> > written after they are queued. These operations look to be kept in
> > order according to the code in nouveau_job_submit[1]. Is this correct?
>
> That's all correct.
>
> (Although Nouveau isn't a good example in this case. Some aspects of it do
> and some aspects of it do not apply to the problem we're discussing here.)
>
> >
> > So then, your issue isn't that you don't know where they will land, but
> > don't know if the area that you reserved is already split into other
> > areas? For instance, before the range 5-10 is backed by whatever
> > happens in the fence, it may have already become 5-6 & 8-10 by something
> > that came after (from userspace) but hasn't been processed by the
> > kernel that will live at 7? So you can't write 5-10 right away because
> > you can't be sure 5-10 is going to exist once you reach the kernel fence
> > code that stores the entry?
> >
> > Is my understanding of your issue correct?
>
> Yes, it is.
>
> However, the problem already starts while trying to reserve an area. In
> order to satisfy a user request, such a request is broken down into
> operations such as unmap mappings which are in the way entirely, remap
> mappings which intersect with the requested mapping and finally map the
> requested mapping. The execution of such a sequence must appear atomic and
> hence be locked accordingly. When trying to reserve an area we'd need to
> take that lock. But since this lock would be used in the dma-fence
> signalling critical path as well we'd not be allowed to do sleeping
> allocations while holding this lock.
>
> Potentially, this could be solved with a retry loop though. Drop the lock
> while allocating, take it again and check whether we still got enough nodes
> allocated. Analogous to what the maple tree does in mas_store_gfp(), I
> guess.
>
> >
> > Oh, and I guess the queued requests would have to remain ordered between
> > threads or whatever is on the other side? I mean, you can't have two
> > threads firing different things into the kernel at the same region
> > because I would think the results would be unpredictable?
>
> Once a job is queued up in the kernel they remain ordered.
>
> However, user threads could concurrently push jobs to the kernel altering
> the same region of the address space - it just would not make any sense for
> userspace to do that.
>
> In general userspace is responsible for the semantics of the address space.
> The kernel basically just takes any (valid) request and make it happen. It
> also assures waiting and signalling of fences which might be bound to
> certain jobs and obviously keeps track of the VA space to be able to clean
> things up once a client disappears.
>
> >
> > Can these overlapping entries partially overlap one region and another?
> > That is, can you have three in-flight writes that does something like:
> > store 1-10, store 10-20, store 5-15?
>
> Absolutely, yes.
>
> >
> > How stable of an output is needed? Does each kernel write need to be
> > 100% correct or is there a point where the userspace updates stop and
> > only then it is needed to be stable?
>
> It needs to be 100% correct all the time. The reason is that, as mentioned
> above, every job can carry in- and out-fences, such that userspace can order
> these jobs against the execution of shaders.

But each job is split into parts, so the fences surround these groups of
operations?

Since ordering is kept, you must reach a point before entering the
fences which could call the mas_preallocate() to ensure enough nodes
exist to install the new mapping, and then no other operations will be
happening. I guess what you are saying is each fence has more than one
tree operation?

As long as you are not mapping more than a range, then this should be possible
in a single write and thus a single preallocation. You can do this by
not actually writing unmaps/remaps to the tree within the fence. Once
the out-fence is reached, then the operation looks atomic.

Reading your patch, it is not clear this is accurate for VM_BIND of
asynchronous syncobjs. Is the fence spanning multiple syncobjs with
various ranges to map? Or are these things the split-up tasks of
unmap/remap, etc that will eventually boil down to what appears to be a
single write?

>
> This is also why there could be jobs queued up, where all of them apply
> changes to the same region within the VA space, since there might be shader
> executions (or just memory copies) ordered right between them.
>
> - Danilo
>
> >
> > >
> > > So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC
> > > directly in the fence signalling critical path. I guess mas_store_gfp() does
> > > not BUG_ON() if it can't get atomic pages?
> > >
> > > Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX).
> > > Do you think it could be a sane alternative to pre-allocate with
> > > MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise
> > > of pre-allocating just a couple of nodes and then rely on atomic pages for
> > > the rest?
> > >
> > > FYI, we're talking about a magnitude of hundreds of thousands of entries to
> > > be stored in the tree.
> > >
> >
> > Since you are not tracking gaps, you will get 16 entries per node. The
> > maximum height is 31, so that would be 16^31, assuming a gap between
> > each entry (the worst case), you can cut that in 1/2. To assure you can
> > successfully allocate storage for a new entries, you'd need to allocate
> > 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful
> > as almost all of these would be freed, and sometimes all of them.
> >
> > You estimate less than 1M entries, that would never go over 6 levels (8.3M
> > entries with the worst-case). 5 levels would get you 500K in the worst
> > case, but realistically you'll be in the 5 levels almost always. So,
> > 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages.
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c
> >
>

2023-06-27 22:45:55

by Danilo Krummrich

[permalink] [raw]
Subject: Re: [PATCH v2 14/16] maple_tree: Refine mas_preallocate() node calculations

On 6/27/23 18:11, Liam R. Howlett wrote:
> * Danilo Krummrich <[email protected]> [230627 10:58]:
>> Hi Liam,
>>
>> On 6/27/23 03:58, Liam R. Howlett wrote:
>>> * Danilo Krummrich <[email protected]> [230626 14:37]:
>>>> On 6/26/23 16:52, Matthew Wilcox wrote:
>>>>> On Mon, Jun 26, 2023 at 04:27:54PM +0200, Danilo Krummrich wrote:
>>>>>> On 6/26/23 15:19, Matthew Wilcox wrote:
>>>>>>> On Mon, Jun 26, 2023 at 02:38:06AM +0200, Danilo Krummrich wrote:
>>>>>>>> On the other hand, unless I miss something (and if so, please let me know),
>>>>>>>> something is bogus with the API then.
>>>>>>>>
>>>>>>>> While the documentation of the Advanced API of the maple tree explicitly
>>>>>>>> claims that the user of the API is responsible for locking, this should be
>>>>>>>> limited to the bounds set by the maple tree implementation. Which means, the
>>>>>>>> user must decide for either the internal (spin-) lock or an external lock
>>>>>>>> (which possibly goes away in the future) and acquire and release it
>>>>>>>> according to the rules maple tree enforces through lockdep checks.
>>>>>>>>
>>>>>>>> Let's say one picks the internal lock. How is one supposed to ensure the
>>>>>>>> tree isn't modified using the internal lock with mas_preallocate()?
>>>>>>>>
>>>>>>>> Besides that, I think the documentation should definitely mention this
>>>>>>>> limitation and give some guidance for the locking.
>>>>>>>>
>>>>>>>> Currently, from an API perspective, I can't see how anyone not familiar with
>>>>>>>> the implementation details would be able to recognize this limitation.
>>>>>>>>
>>>>>>>> In terms of the GPUVA manager, unfortunately, it seems like I need to drop
>>>>>>>> the maple tree and go back to using a rb-tree, since it seems there is no
>>>>>>>> sane way doing a worst-case pre-allocation that does not suffer from this
>>>>>>>> limitation.
>>>>>>>
>>>>>>> I haven't been paying much attention here (too many other things going
>>>>>>> on), but something's wrong.
>>>>>>>
>>>>>>> First, you shouldn't need to preallocate. Preallocation is only there
>>>>>>
>>>>>> Unfortunately, I think we really have a case where we have to. Typically GPU
>>>>>> mappings are created in a dma-fence signalling critical path and that is
>>>>>> where such mappings need to be added to the maple tree. Hence, we can't do
>>>>>> any sleeping allocations there.
>>>>>
>>>>> OK, so there are various ways to hadle this, depending on what's
>>>>> appropriate for your case.
>>>>>
>>>>> The simplest is to use GFP_ATOMIC. Essentially, you're saying to the MM
>>>>> layer "This is too hard, let me tap into the emergency reserves". It's
>>>>> mildly frowned upon, so let's see if we can do better.
>>>>>
>>>>> If you know where the allocation needs to be stored, but want it to act as
>>>>> NULL until the time is right, you can store a ZERO entry. That will read
>>>>> as NULL until you store to it. A pure overwriting store will not cause
>>>>> any memory allocation since all the implementation has to do is change
>>>>> a pointer. The XArray wraps this up nicely behind an xa_reserve() API.
>>>>> As you're discovering, the Maple Tree API isn't fully baked yet.
>>>>>
>>>>
>>>> Unfortunately, GFP_ATOMIC seems the be the only option. I think storing
>>>> entries in advance would not work. Typically userspace submits a job to the
>>>> kernel issuing one or multiple requests to map and unmap memory in an ioctl.
>>>> Such a job is then put into a queue and processed asynchronously in a
>>>> dma-fence signalling critical section. Hence, at the we'd store entries in
>>>> advance we could have an arbitrary amount of pending jobs potentially still
>>>> messing with the same address space region.
>>>
>>> What I think you are saying is that you have a number of requests
>>> flooding in, which may overwrite the same areas, but are queued up to be
>>> written after they are queued. These operations look to be kept in
>>> order according to the code in nouveau_job_submit[1]. Is this correct?
>>
>> That's all correct.
>>
>> (Although Nouveau isn't a good example in this case. Some aspects of it do
>> and some aspects of it do not apply to the problem we're discussing here.)
>>
>>>
>>> So then, your issue isn't that you don't know where they will land, but
>>> don't know if the area that you reserved is already split into other
>>> areas? For instance, before the range 5-10 is backed by whatever
>>> happens in the fence, it may have already become 5-6 & 8-10 by something
>>> that came after (from userspace) but hasn't been processed by the
>>> kernel that will live at 7? So you can't write 5-10 right away because
>>> you can't be sure 5-10 is going to exist once you reach the kernel fence
>>> code that stores the entry?
>>>
>>> Is my understanding of your issue correct?
>>
>> Yes, it is.
>>
>> However, the problem already starts while trying to reserve an area. In
>> order to satisfy a user request, such a request is broken down into
>> operations such as unmap mappings which are in the way entirely, remap
>> mappings which intersect with the requested mapping and finally map the
>> requested mapping. The execution of such a sequence must appear atomic and
>> hence be locked accordingly. When trying to reserve an area we'd need to
>> take that lock. But since this lock would be used in the dma-fence
>> signalling critical path as well we'd not be allowed to do sleeping
>> allocations while holding this lock.
>>
>> Potentially, this could be solved with a retry loop though. Drop the lock
>> while allocating, take it again and check whether we still got enough nodes
>> allocated. Analogous to what the maple tree does in mas_store_gfp(), I
>> guess.
>>
>>>
>>> Oh, and I guess the queued requests would have to remain ordered between
>>> threads or whatever is on the other side? I mean, you can't have two
>>> threads firing different things into the kernel at the same region
>>> because I would think the results would be unpredictable?
>>
>> Once a job is queued up in the kernel they remain ordered.
>>
>> However, user threads could concurrently push jobs to the kernel altering
>> the same region of the address space - it just would not make any sense for
>> userspace to do that.
>>
>> In general userspace is responsible for the semantics of the address space.
>> The kernel basically just takes any (valid) request and make it happen. It
>> also assures waiting and signalling of fences which might be bound to
>> certain jobs and obviously keeps track of the VA space to be able to clean
>> things up once a client disappears.
>>
>>>
>>> Can these overlapping entries partially overlap one region and another?
>>> That is, can you have three in-flight writes that does something like:
>>> store 1-10, store 10-20, store 5-15?
>>
>> Absolutely, yes.
>>
>>>
>>> How stable of an output is needed? Does each kernel write need to be
>>> 100% correct or is there a point where the userspace updates stop and
>>> only then it is needed to be stable?
>>
>> It needs to be 100% correct all the time. The reason is that, as mentioned
>> above, every job can carry in- and out-fences, such that userspace can order
>> these jobs against the execution of shaders.
>
> But each job is split into parts, so the fences surround these groups of
> operations?

Yes, each job can have multiple requests to map or unmap something and
each of them gets broken down into operations to make them happen. The
fences are per job.

>
> Since ordering is kept, you must reach a point before entering the
> fences which could call the mas_preallocate() to ensure enough nodes
> exist to install the new mapping, and then no other operations will be
> happening. I guess what you are saying is each fence has more than one
> tree operation?
>

I guess you assume that in the asynchronous path, where jobs are fetched
from the queue for execution, there must be a point of time before we
enter the fence signalling critical path. This is not the case.

The fence signalling critical path is entered once the corresponding
out-fence is published to userspace and hence becomes visible to
userspace. This happens when the job submit ioctl() (which is where the
job is queued up for execution) returns. Hence, all jobs in the queue
potentially entered their fence signalling critical path already before
they could have been fetched from the queue.

The job submit ioctl() is where we would need to call mas_preallocate(),
but this is also where we could have concurrent modifications to the
tree from previously submitted jobs that have been fetched from the
queue for asynchronous execution.

> As long as you are not mapping more than a range, then this should be possible
> in a single write and thus a single preallocation. You can do this by
> not actually writing unmaps/remaps to the tree within the fence. Once
> the out-fence is reached, then the operation looks atomic.

As mentioned above there can be an arbitrary amount of map and unmap
requests per job.

Also, we can have cases where we, for instance, have a mapping A[0,10]
either already in the tree (or pending in the job queue) and a user
requests B[4,7].

The expected result would be: A'[0,4], B[4,7] and A''[7,10]. Hence, one
job can cause multiple writes per single map request even.

At the time the job asking to map B is submitted, A might not even be in
the tree yet.

>
> Reading your patch, it is not clear this is accurate for VM_BIND of
> asynchronous syncobjs. Is the fence spanning multiple syncobjs with
> various ranges to map? Or are these things the split-up tasks of
> unmap/remap, etc that will eventually boil down to what appears to be a
> single write?
>

The VM_BIND ioctl() is the ioctl() mentioned above to submit (bind)
jobs. Userspace can pass syncobjs together with a job, which can be both
syncobj which contain fences to wait for before executing the job and
syncobjs to install the job's fence to for userspace to wait for them or
pass them into another kernel interface to synchronize them against
something else.

Otherwise, as mentioned above, each job can have multiple requests to
map or unmap something and each of them gets broken down into operations
to make such a request happen, which themselves can be re-maps, unmaps
and maps. For instance, an unmap request for a given range, gets broken
down into re-maps of mappings it intersects and unmaps of mappings it
entirely spans over.

Thanks,
Danilo

>>
>> This is also why there could be jobs queued up, where all of them apply
>> changes to the same region within the VA space, since there might be shader
>> executions (or just memory copies) ordered right between them.
>>
>> - Danilo
>>
>>>
>>>>
>>>> So, the only way to go seems to be to use mas_store_gfp() with GFP_ATOMIC
>>>> directly in the fence signalling critical path. I guess mas_store_gfp() does
>>>> not BUG_ON() if it can't get atomic pages?
>>>>
>>>> Also, I just saw that the tree is limited in it's height (MAPLE_HEIGHT_MAX).
>>>> Do you think it could be a sane alternative to pre-allocate with
>>>> MAPLE_HEIGHT_MAX rather than to rely on atomic pages? Or maybe a compromise
>>>> of pre-allocating just a couple of nodes and then rely on atomic pages for
>>>> the rest?
>>>>
>>>> FYI, we're talking about a magnitude of hundreds of thousands of entries to
>>>> be stored in the tree.
>>>>
>>>
>>> Since you are not tracking gaps, you will get 16 entries per node. The
>>> maximum height is 31, so that would be 16^31, assuming a gap between
>>> each entry (the worst case), you can cut that in 1/2. To assure you can
>>> successfully allocate storage for a new entries, you'd need to allocate
>>> 30 * 3 + 1, or 91 nodes, which is 6 pages. That'll be highly wasteful
>>> as almost all of these would be freed, and sometimes all of them.
>>>
>>> You estimate less than 1M entries, that would never go over 6 levels (8.3M
>>> entries with the worst-case). 5 levels would get you 500K in the worst
>>> case, but realistically you'll be in the 5 levels almost always. So,
>>> 5*3+1 = 17 nodes, or 2 pages (1 node over 1 page).. assuming 4k pages.
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/T/#Z2e.:..:20230620004217.4700-4-dakr::40redhat.com:1drivers:gpu:drm:drm_gem.c
>>>
>>
>