2023-07-26 08:51:54

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 00/11] Introduce mt_dup() to improve the performance of fork()

A few weeks ago, Liam and I discussed "Fork & Dup tree + Delete DONT_COPY" in
[1]. Thanks Liam for a lot of useful information there.

I didn't use the scheme of linking the leaf nodes, nor did I use the scheme
of building the tree in BFS order. I ended up using the scheme of building the
tree in DFS order. I implemented this algorithm and it seems to be very
efficient.

Use bench_forking() in lib/test_maple_tree.c to test in user space, and get the
performance numbers of duplicating maple tree as follows:

before: 13.52s
after: 2.60s

Their meaning is the time consumed by duplicating 80,000 maple trees with 134
entries. These numbers do not include the time consumed by mt_validate() and
mtree_destroy(). It can be seen that the time consumed has been reduced by
80.77%.

The performance improvement of fork() can be summarized as follows:
With 23 VMAs, performance improves by about 3%, with 223 VMAs, performance
improves by about 15%, and with 4023 VMAs, performance improves by about 30%.
See patch[11/11] for details.

In addition, I would like to assist Liam in maintaining the maple tree, which
requires Liam's consent. In the future I will make some contributions to the
development of maple tree.

The layout of these patches:
001 - 003: Introduce some internal functions to facilitate the
implementation of mt_dup().
004 - 005: Introduce __mt_dup() and mt_dup(), and their tests.
006: Introduce mas_replace_entry() to efficiently replace an entry.
007 - 009: Follow-up work on introducing these things.
010: Add myself as co-maintainer for maple tree.
011: Use __mt_dup() to duplicate maple tree in dup_mmap().

[1] https://lore.kernel.org/lkml/[email protected]/

Peng Zhang (11):
maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()
maple_tree: Validate MAPLE_ENODE and ma_nonleaf_data_end()
maple_tree: Add some helper functions
maple_tree: Introduce interfaces __mt_dup() and mt_dup()
maple_tree: Add test for mt_dup()
maple_tree: Introduce mas_replace_entry() to directly replace an entry
maple_tree: Update the documentation of maple tree
maple_tree: Skip other tests when BENCH is enabled
maple_tree: Update check_forking() and bench_forking()
MAINTAINERS: Add co-maintainer for maple tree
fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

Documentation/core-api/maple_tree.rst | 10 +
MAINTAINERS | 1 +
include/linux/maple_tree.h | 4 +
kernel/fork.c | 35 ++-
lib/maple_tree.c | 389 ++++++++++++++++++++++++--
lib/test_maple_tree.c | 67 ++---
mm/mmap.c | 14 +-
tools/testing/radix-tree/maple.c | 204 ++++++++++++++
8 files changed, 658 insertions(+), 66 deletions(-)

--
2.20.1



2023-07-26 08:56:47

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 05/11] maple_tree: Add test for mt_dup()

Add test for mt_dup().

Signed-off-by: Peng Zhang <[email protected]>
---
tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
1 file changed, 202 insertions(+)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index e5da1cad70ba..3052e899e5df 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
mt_clear_in_rcu(mt);
}

+/*
+ * Compare two nodes and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_node(struct maple_enode *enode_a,
+ struct maple_enode *enode_b)
+{
+ struct maple_node *node_a, *node_b;
+ struct maple_node a, b;
+ void **slots_a, **slots_b; /* Do not use the rcu tag. */
+ enum maple_type type;
+ int i;
+
+ if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
+ ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
+ pr_err("The lower 8 bits of enode are different.\n");
+ return -1;
+ }
+
+ type = mte_node_type(enode_a);
+ node_a = mte_to_node(enode_a);
+ node_b = mte_to_node(enode_b);
+ a = *node_a;
+ b = *node_b;
+
+ /* Do not compare addresses. */
+ if (ma_is_root(node_a) || ma_is_root(node_b)) {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MA_ROOT_PARENT);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MA_ROOT_PARENT);
+ } else {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MAPLE_NODE_MASK);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MAPLE_NODE_MASK);
+ }
+
+ if (a.parent != b.parent) {
+ pr_err("The lower 8 bits of parents are different. %p %p\n",
+ a.parent, b.parent);
+ return -1;
+ }
+
+ /*
+ * If it is a leaf node, the slots do not contain the node address, and
+ * no special processing of slots is required.
+ */
+ if (ma_is_leaf(type))
+ goto cmp;
+
+ slots_a = ma_slots(&a, type);
+ slots_b = ma_slots(&b, type);
+
+ for (i = 0; i < mt_slots[type]; i++) {
+ if (!slots_a[i] && !slots_b[i])
+ break;
+
+ if (!slots_a[i] || !slots_b[i]) {
+ pr_err("The number of slots is different.\n");
+ return -1;
+ }
+
+ /* Do not compare addresses in slots. */
+ ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
+ ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
+ }
+
+cmp:
+ /*
+ * Compare all contents of two nodes, including parent (except address),
+ * slots (except address), pivots, gaps and metadata.
+ */
+ return memcmp(&a, &b, sizeof(struct maple_node));
+}
+
+/*
+ * Compare two trees and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
+{
+ MA_STATE(mas_a, mt_a, 0, 0);
+ MA_STATE(mas_b, mt_b, 0, 0);
+
+ if (mt_a->ma_flags != mt_b->ma_flags) {
+ pr_err("The flags of the two trees are different.\n");
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+
+ if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
+ if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
+ pr_err("One is MAS_ROOT and the other is not.\n");
+ return -1;
+ }
+ return 0;
+ }
+
+ while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
+
+ if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
+ pr_err("One is MAS_NONE and the other is not.\n");
+ return -1;
+ }
+
+ if (mas_a.min != mas_b.min ||
+ mas_a.max != mas_b.max) {
+ pr_err("mas->min, mas->max do not match.\n");
+ return -1;
+ }
+
+ if (compare_node(mas_a.node, mas_b.node)) {
+ pr_err("The contents of nodes %p and %p are different.\n",
+ mas_a.node, mas_b.node);
+ mt_dump(mt_a, mt_dump_dec);
+ mt_dump(mt_b, mt_dump_dec);
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+ }
+
+ return 0;
+}
+
+static noinline void __init check_mt_dup(struct maple_tree *mt)
+{
+ DEFINE_MTREE(new);
+ int i, j, ret, count = 0;
+
+ /* stored in the root pointer*/
+ mt_init_flags(&tree, 0);
+ mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
+ mt_dup(&tree, &new, GFP_KERNEL);
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1)
+ mt_init_flags(&tree, 0);
+ else
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(&tree, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ ret = mt_dup(&tree, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret != 0);
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+ }
+
+ /* Test memory allocation failed. */
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1)
+ mt_init_flags(&tree, 0);
+ else
+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(&tree, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ mt_set_non_kernel(50);
+ ret = mt_dup(&tree, &new, GFP_NOWAIT);
+ mt_set_non_kernel(0);
+ if (ret != 0) {
+ MT_BUG_ON(&new, ret != -ENOMEM);
+ count++;
+ mtree_destroy(&tree);
+ continue;
+ }
+
+ mt_validate(&new);
+ if (compare_tree(&tree, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(&tree);
+ mtree_destroy(&new);
+ }
+
+ /* pr_info("mt_dup() fail %d times\n", count); */
+ BUG_ON(!count);
+}
+
extern void test_kmem_cache_bulk(void);

void farmer_tests(void)
@@ -35904,6 +36102,10 @@ void farmer_tests(void)
check_null_expand(&tree);
mtree_destroy(&tree);

+ mt_init_flags(&tree, 0);
+ check_mt_dup(&tree);
+ mtree_destroy(&tree);
+
/* RCU testing */
mt_init_flags(&tree, 0);
check_erase_testset(&tree);
--
2.20.1


2023-07-26 09:00:30

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 07/11] maple_tree: Update the documentation of maple tree

Introduces the newly introduced mt_dup() and mas_replace_entry().

Signed-off-by: Peng Zhang <[email protected]>
---
Documentation/core-api/maple_tree.rst | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
index 45defcf15da7..a4fa991277c7 100644
--- a/Documentation/core-api/maple_tree.rst
+++ b/Documentation/core-api/maple_tree.rst
@@ -71,6 +71,11 @@ return -EEXIST if the range is not empty.

You can search for an entry from an index upwards by using mt_find().

+If you want to duplicate a tree, you can use mt_dup(). It will build a new tree
+that is exactly the same as the source tree, and it uses an efficient
+implementation, so it is much faster than traversing the source tree and
+inserting into the new tree one by one.
+
You can walk each entry within a range by calling mt_for_each(). You must
provide a temporary variable to store a cursor. If you want to walk each
element of the tree then ``0`` and ``ULONG_MAX`` may be used as the range. If
@@ -115,6 +120,7 @@ Takes ma_lock internally:
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
+ * mt_dup()

If you want to take advantage of the internal lock to protect the data
structures that you are storing in the Maple Tree, you can call mtree_lock()
@@ -155,6 +161,10 @@ You can set entries using mas_store(). mas_store() will overwrite any entry
with the new entry and return the first existing entry that is overwritten.
The range is passed in as members of the maple state: index and last.

+If you have located an entry using something like mas_find(), and want to
+replace this entry, you can use mas_replace_entry(), which is more efficient
+than mas_store*().
+
You can use mas_erase() to erase an entire range by setting index and
last of the maple state to the desired range to erase. This will erase
the first range that is found in that range, set the maple state index
--
2.20.1


2023-07-26 09:01:09

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()

Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
non-leaf nodes without knowing the maximum value of nodes, so that any
ascending can be avoided even if the maximum value of nodes is not known.

The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
cannot be used by metadata, so we can distinguish whether it is ENODE or
metadata.

The nocheck version is to avoid lockdep complaining in some scenarios
where no locks are held.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 68 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index a3d602cfd030..98e4fdf6f4b9 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
#define MAPLE_ENODE_TYPE_SHIFT 0x03
/* Bit 2 means a NULL somewhere below */
#define MAPLE_ENODE_NULL 0x04
+/* Bit 7 means this is an ENODE, instead of metadata */
+#define MAPLE_ENODE 0x80
+
+static inline bool slot_is_mte(unsigned long slot)
+{
+ return slot & MAPLE_ENODE;
+}

static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
enum maple_type type)
{
- return (void *)((unsigned long)node |
- (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
+ return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
+ MAPLE_ENODE_NULL | MAPLE_ENODE);
}

static inline void *mte_mk_root(const struct maple_enode *node)
@@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
return NULL;
}

+/*
+ * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
+ * @mt: The maple tree
+ * @node: The maple node
+ * @type: The maple node type
+ *
+ * Uses metadata to find the end of the data when possible without knowing the
+ * node maximum.
+ *
+ * Return: The zero indexed last slot with child.
+ */
+static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
+ struct maple_node *node,
+ enum maple_type type)
+{
+ void __rcu **slots;
+ unsigned long slot;
+
+ slots = ma_slots(node, type);
+ slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
+ if (unlikely(slot_is_mte(slot)))
+ return mt_pivots[type];
+
+ return ma_meta_end(node, type);
+}
+
+/*
+ * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
+ * @node: The maple node
+ * @type: The maple node type
+ *
+ * Uses metadata to find the end of the data when possible without knowing the
+ * node maximum. This is the version of ma_nonleaf_data_end() that does not
+ * check for lock held. This particular version is designed to avoid lockdep
+ * complaining in some scenarios.
+ *
+ * Return: The zero indexed last slot with child.
+ */
+static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
+ enum maple_type type)
+{
+ void __rcu **slots;
+ unsigned long slot;
+
+ slots = ma_slots(node, type);
+ slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
+ if (unlikely(slot_is_mte(slot)))
+ return mt_pivots[type];
+
+ return ma_meta_end(node, type);
+}
+
+/* See ma_nonleaf_data_end() */
+static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
+ struct maple_enode *enode)
+{
+ return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
+}
+
/*
* ma_data_end() - Find the end of the data in a node.
* @node: The maple node
--
2.20.1


2023-07-26 09:03:05

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry

If mas has located a specific entry, it may be need to replace this
entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
will be more efficient than mas_store*() because it doesn't do many
unnecessary checks.

This function should be inline, but more functions need to be moved to
the header file, so I didn't do it for the time being.

Signed-off-by: Peng Zhang <[email protected]>
---
include/linux/maple_tree.h | 1 +
lib/maple_tree.c | 25 +++++++++++++++++++++++++
2 files changed, 26 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index 229fe78e4c89..a05e9827d761 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -462,6 +462,7 @@ struct ma_wr_state {

void *mas_walk(struct ma_state *mas);
void *mas_store(struct ma_state *mas, void *entry);
+void mas_replace_entry(struct ma_state *mas, void *entry);
void *mas_erase(struct ma_state *mas);
int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
void mas_store_prealloc(struct ma_state *mas, void *entry);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index efac6761ae37..d58572666a00 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
}
EXPORT_SYMBOL_GPL(mas_store);

+/**
+ * mas_replace_entry() - Replace an entry that already exists in the maple tree
+ * @mas: The maple state
+ * @entry: The entry to store
+ *
+ * Please note that mas must already locate an existing entry, and the new entry
+ * must not be NULL. If these two points cannot be guaranteed, please use
+ * mas_store*() instead, otherwise it will cause an internal error in the maple
+ * tree. This function does not need to allocate memory, so it must succeed.
+ */
+void mas_replace_entry(struct ma_state *mas, void *entry)
+{
+ void __rcu **slots;
+
+#ifdef CONFIG_DEBUG_MAPLE_TREE
+ MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
+ MAS_WARN_ON(mas, !entry);
+ MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
+#endif
+
+ slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
+ rcu_assign_pointer(slots[mas->offset], entry);
+}
+EXPORT_SYMBOL_GPL(mas_replace_entry);
+
/**
* mas_store_gfp() - Store a value into the tree.
* @mas: The maple state
--
2.20.1


2023-07-26 09:05:26

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
directly modify the entries of VMAs in the new maple tree, which can
get better performance. dup_mmap() is used by fork(), so this patch
optimizes fork(). The optimization effect is proportional to the number
of VMAs.

Due to the introduction of this method, the optimization in
(maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
has an effect here, but it is also an optimization of the maple tree.

There is a unixbench test suite[2] where 'spawn' is used to test fork().
'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
bit to use mmap() to control the number of VMAs. Therefore, the
performance under different numbers of VMAs can be measured.

Insert code like below into 'spawn':
for (int i = 0; i < 200; ++i) {
size_t size = 10 * getpagesize();
void *addr;

if (i & 1) {
addr = mmap(NULL, size, PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
} else {
addr = mmap(NULL, size, PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
}
if (addr == MAP_FAILED)
...
}

Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
4 times in 30 seconds each time, and get the following numbers. These
numbers are the number of fork() successes in 30s (average of the best
3 out of 4). By the way, based on next-20230725, I reverted [1], and
tested it together as a comparison. In order to ensure the reliability
of the test results, these tests were run on a physical machine.

23VMAs 223VMAs 4023VMAs
revert [1]: 159104.00 73316.33 6787.00

+0.77% +0.42% +0.28%
next-20230721: 160321.67 73624.67 6806.33

+2.77% +15.42% +29.86%
apply this: 164751.67 84980.33 8838.67

It can be seen that the performance improvement is proportional to
the number of VMAs. With 23 VMAs, performance improves by about 3%,
with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
performance improves by about 30%.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://github.com/kdlucas/byte-unixbench/tree/master

Signed-off-by: Peng Zhang <[email protected]>
---
kernel/fork.c | 35 +++++++++++++++++++++++++++--------
mm/mmap.c | 14 ++++++++++++--
2 files changed, 39 insertions(+), 10 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index f81149739eb9..ef80025b62d6 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
int retval;
unsigned long charge = 0;
LIST_HEAD(uf);
- VMA_ITERATOR(old_vmi, oldmm, 0);
VMA_ITERATOR(vmi, mm, 0);

uprobe_start_dup_mmap();
@@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
goto out;
khugepaged_fork(mm, oldmm);

- retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
- if (retval)
+ /* Use __mt_dup() to efficiently build an identical maple tree. */
+ retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
+ if (unlikely(retval))
goto out;

mt_clear_in_rcu(vmi.mas.tree);
- for_each_vma(old_vmi, mpnt) {
+ for_each_vma(vmi, mpnt) {
struct file *file;

vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
+
+ /*
+ * Since the new tree is exactly the same as the old one,
+ * we need to remove the unneeded VMAs.
+ */
+ mas_store(&vmi.mas, NULL);
+
+ /*
+ * Even removing an entry may require memory allocation,
+ * and if removal fails, we use XA_ZERO_ENTRY to mark
+ * from which VMA it failed. The case of encountering
+ * XA_ZERO_ENTRY will be handled in exit_mmap().
+ */
+ if (unlikely(mas_is_err(&vmi.mas))) {
+ retval = xa_err(vmi.mas.node);
+ mas_reset(&vmi.mas);
+ if (mas_find(&vmi.mas, ULONG_MAX))
+ mas_replace_entry(&vmi.mas,
+ XA_ZERO_ENTRY);
+ goto loop_out;
+ }
+
continue;
}
charge = 0;
@@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
hugetlb_dup_vma_private(tmp);

/* Link the vma into the MT */
- if (vma_iter_bulk_store(&vmi, tmp))
- goto fail_nomem_vmi_store;
+ mas_replace_entry(&vmi.mas, tmp);

mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
uprobe_end_dup_mmap();
return retval;

-fail_nomem_vmi_store:
- unlink_anon_vmas(tmp);
fail_nomem_anon_vma_fork:
mpol_put(vma_policy(tmp));
fail_nomem_policy:
diff --git a/mm/mmap.c b/mm/mmap.c
index bc91d91261ab..5bfba2fb0e39 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
arch_exit_mmap(mm);

vma = mas_find(&mas, ULONG_MAX);
- if (!vma) {
+ /*
+ * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
+ * xa_is_zero(vma) may be true.
+ */
+ if (!vma || xa_is_zero(vma)) {
/* Can happen if dup_mmap() received an OOM */
mmap_read_unlock(mm);
return;
@@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
remove_vma(vma, true);
count++;
cond_resched();
- } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+ vma = mas_find(&mas, ULONG_MAX);
+ /*
+ * If xa_is_zero(vma) is true, it means that subsequent VMAs
+ * donot need to be removed. Can happen if dup_mmap() fails to
+ * remove a VMA marked VM_DONTCOPY.
+ */
+ } while (vma != NULL && !xa_is_zero(vma));

BUG_ON(count != mm->map_count);

--
2.20.1


2023-07-26 09:06:50

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 09/11] maple_tree: Update check_forking() and bench_forking()

Updated check_forking() and bench_forking() to use __mt_dup() to
duplicate maple tree.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/test_maple_tree.c | 59 ++++++++++++++++++-------------------------
1 file changed, 25 insertions(+), 34 deletions(-)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 0ec0c6a7c0b5..bbdac08927c6 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -1837,7 +1837,7 @@ static noinline void __init check_forking(struct maple_tree *mt)
{

struct maple_tree newmt;
- int i, nr_entries = 134;
+ int i, nr_entries = 134, ret;
void *val;
MA_STATE(mas, mt, 0, 0);
MA_STATE(newmas, mt, 0, 0);
@@ -1847,26 +1847,22 @@ static noinline void __init check_forking(struct maple_tree *mt)
xa_mk_value(i), GFP_KERNEL);

mt_set_non_kernel(99999);
- mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- mas_lock(&newmas);
- mas.index = 0;
- mas.last = 0;
- if (mas_expected_entries(&newmas, nr_entries)) {
+ mas_lock(&mas);
+
+ ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+ if (ret) {
pr_err("OOM!");
BUG_ON(1);
}
- rcu_read_lock();
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
- mas_store(&newmas, val);
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX) {
+ mas_replace_entry(&newmas, val);
}
- rcu_read_unlock();
+
+ mas_unlock(&mas);
+
mas_destroy(&newmas);
- mas_unlock(&newmas);
mt_validate(&newmt);
mt_set_non_kernel(0);
mtree_destroy(&newmt);
@@ -1974,9 +1970,8 @@ static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
#if defined(BENCH_FORK)
static noinline void __init bench_forking(struct maple_tree *mt)
{
-
struct maple_tree newmt;
- int i, nr_entries = 134, nr_fork = 80000;
+ int i, nr_entries = 134, nr_fork = 80000, ret;
void *val;
MA_STATE(mas, mt, 0, 0);
MA_STATE(newmas, mt, 0, 0);
@@ -1987,26 +1982,22 @@ static noinline void __init bench_forking(struct maple_tree *mt)

for (i = 0; i < nr_fork; i++) {
mt_set_non_kernel(99999);
- mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- mas.index = 0;
- mas.last = 0;
- rcu_read_lock();
- mas_lock(&newmas);
- if (mas_expected_entries(&newmas, nr_entries)) {
- printk("OOM!");
+
+ mas_lock(&mas);
+ ret = __mt_dup(mt, &newmt, GFP_NOWAIT | __GFP_NOWARN);
+ if (ret) {
+ pr_err("OOM!");
BUG_ON(1);
}
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
- mas_store(&newmas, val);
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX) {
+ mas_replace_entry(&newmas, val);
}
+
+ mas_unlock(&mas);
+
mas_destroy(&newmas);
- mas_unlock(&newmas);
- rcu_read_unlock();
mt_validate(&newmt);
mt_set_non_kernel(0);
mtree_destroy(&newmt);
--
2.20.1


2023-07-26 09:19:36

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 03/11] maple_tree: Add some helper functions

Add some helper functions so that their parameters are maple node
instead of maple enode, these functions will be used later.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
1 file changed, 55 insertions(+), 16 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index e0e9a87bdb43..da3a2fb405c0 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
}

+static inline void mt_free_one(struct maple_node *node)
+{
+ kmem_cache_free(maple_node_cache, node);
+}
+
static inline void mt_free_bulk(size_t size, void __rcu **nodes)
{
kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
@@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
}

/*
- * mas_parent_type() - Return the maple_type of the parent from the stored
- * parent type.
- * @mas: The maple state
- * @enode: The maple_enode to extract the parent's enum
+ * ma_parent_type() - Return the maple_type of the parent from the stored parent
+ * type.
+ * @mt: The maple tree
+ * @node: The maple_node to extract the parent's enum
* Return: The node->parent maple_type
*/
static inline
-enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
+enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
{
unsigned long p_type;

- p_type = (unsigned long)mte_to_node(enode)->parent;
+ p_type = (unsigned long)node->parent;
if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
return 0;

@@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
p_type &= ~mte_parent_slot_mask(p_type);
switch (p_type) {
case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
- if (mt_is_alloc(mas->tree))
+ if (mt_is_alloc(mt))
return maple_arange_64;
return maple_range_64;
}
@@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
return 0;
}

+/*
+ * mas_parent_type() - Return the maple_type of the parent from the stored
+ * parent type.
+ * @mas: The maple state
+ * @enode: The maple_enode to extract the parent's enum
+ * Return: The node->parent maple_type
+ */
+static inline
+enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
+{
+ return ma_parent_type(mas->tree, mte_to_node(enode));
+}
+
/*
* mas_set_parent() - Set the parent node and encode the slot
* @enode: The encoded maple node.
@@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
}

/*
- * mte_parent_slot() - get the parent slot of @enode.
- * @enode: The encoded maple node.
+ * ma_parent_slot() - get the parent slot of @node.
+ * @node: The maple node.
*
- * Return: The slot in the parent node where @enode resides.
+ * Return: The slot in the parent node where @node resides.
*/
-static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
+static inline unsigned int ma_parent_slot(const struct maple_node *node)
{
- unsigned long val = (unsigned long)mte_to_node(enode)->parent;
+ unsigned long val = (unsigned long)node->parent;

if (val & MA_ROOT_PARENT)
return 0;
@@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
}

/*
- * mte_parent() - Get the parent of @node.
- * @node: The encoded maple node.
+ * mte_parent_slot() - get the parent slot of @enode.
+ * @enode: The encoded maple node.
+ *
+ * Return: The slot in the parent node where @enode resides.
+ */
+static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
+{
+ return ma_parent_slot(mte_to_node(enode));
+}
+
+/*
+ * ma_parent() - Get the parent of @node.
+ * @node: The maple node.
+ *
+ * Return: The parent maple node.
+ */
+static inline struct maple_node *ma_parent(const struct maple_node *node)
+{
+ return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
+}
+
+/*
+ * mte_parent() - Get the parent of @enode.
+ * @enode: The encoded maple node.
*
* Return: The parent maple node.
*/
static inline struct maple_node *mte_parent(const struct maple_enode *enode)
{
- return (void *)((unsigned long)
- (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
+ return ma_parent(mte_to_node(enode));
}

/*
--
2.20.1


2023-07-26 09:24:48

by Peng Zhang

[permalink] [raw]
Subject: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree

Add myself as co-maintainer for maple tree. I would like to assist
Liam R. Howlett in maintaining maple tree. I will continue to contribute
to the development of maple tree in the future.

Signed-off-by: Peng Zhang <[email protected]>
---
MAINTAINERS | 1 +
1 file changed, 1 insertion(+)

diff --git a/MAINTAINERS b/MAINTAINERS
index ddc71b815791..8cfedd492509 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -12526,6 +12526,7 @@ F: net/mctp/

MAPLE TREE
M: Liam R. Howlett <[email protected]>
+M: Peng Zhang <[email protected]>
L: [email protected]
S: Supported
F: Documentation/core-api/maple_tree.rst
--
2.20.1


2023-07-26 15:04:25

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()

* Peng Zhang <[email protected]> [230726 04:10]:
> Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
> non-leaf nodes without knowing the maximum value of nodes, so that any
> ascending can be avoided even if the maximum value of nodes is not known.
>
> The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
> cannot be used by metadata, so we can distinguish whether it is ENODE or
> metadata.
>
> The nocheck version is to avoid lockdep complaining in some scenarios
> where no locks are held.
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 68 insertions(+), 2 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index a3d602cfd030..98e4fdf6f4b9 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
> #define MAPLE_ENODE_TYPE_SHIFT 0x03
> /* Bit 2 means a NULL somewhere below */
> #define MAPLE_ENODE_NULL 0x04
> +/* Bit 7 means this is an ENODE, instead of metadata */
> +#define MAPLE_ENODE 0x80

We were saving this bit for more node types. I don't want to use this
bit for this reason since you could have done BFS to duplicate the tree
using the existing way to find the node end.

Bits are highly valuable and this is the only remaining bit. I had
thought about using this in Feb 2021 to see if there was metadata or
not, but figured a way around it (using the max trick) and thus saved
this bit for potential expansion of node types.

> +
> +static inline bool slot_is_mte(unsigned long slot)
> +{
> + return slot & MAPLE_ENODE;
> +}
>
> static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
> enum maple_type type)
> {
> - return (void *)((unsigned long)node |
> - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
> + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
> + MAPLE_ENODE_NULL | MAPLE_ENODE);
> }
>
> static inline void *mte_mk_root(const struct maple_enode *node)
> @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
> return NULL;
> }
>
> +/*
> + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
> + * @mt: The maple tree
> + * @node: The maple node
> + * @type: The maple node type
> + *
> + * Uses metadata to find the end of the data when possible without knowing the
> + * node maximum.
> + *
> + * Return: The zero indexed last slot with child.
> + */
> +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
> + struct maple_node *node,
> + enum maple_type type)
> +{
> + void __rcu **slots;
> + unsigned long slot;
> +
> + slots = ma_slots(node, type);
> + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
> + if (unlikely(slot_is_mte(slot)))
> + return mt_pivots[type];
> +
> + return ma_meta_end(node, type);
> +}
> +
> +/*
> + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
> + * @node: The maple node
> + * @type: The maple node type
> + *
> + * Uses metadata to find the end of the data when possible without knowing the
> + * node maximum. This is the version of ma_nonleaf_data_end() that does not
> + * check for lock held. This particular version is designed to avoid lockdep
> + * complaining in some scenarios.
> + *
> + * Return: The zero indexed last slot with child.
> + */
> +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
> + enum maple_type type)
> +{
> + void __rcu **slots;
> + unsigned long slot;
> +
> + slots = ma_slots(node, type);
> + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
> + if (unlikely(slot_is_mte(slot)))
> + return mt_pivots[type];
> +
> + return ma_meta_end(node, type);
> +}
> +
> +/* See ma_nonleaf_data_end() */
> +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
> + struct maple_enode *enode)
> +{
> + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
> +}
> +
> /*
> * ma_data_end() - Find the end of the data in a node.
> * @node: The maple node
> --
> 2.20.1
>
>

2023-07-26 15:37:26

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 03/11] maple_tree: Add some helper functions

* Peng Zhang <[email protected]> [230726 04:10]:
> Add some helper functions so that their parameters are maple node
> instead of maple enode, these functions will be used later.
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
> 1 file changed, 55 insertions(+), 16 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index e0e9a87bdb43..da3a2fb405c0 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
> return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> }
>
> +static inline void mt_free_one(struct maple_node *node)
> +{
> + kmem_cache_free(maple_node_cache, node);
> +}
> +

There is a place in mas_destroy() that could use this if it is added.

> static inline void mt_free_bulk(size_t size, void __rcu **nodes)
> {
> kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> @@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
> }
>
> /*
> - * mas_parent_type() - Return the maple_type of the parent from the stored
> - * parent type.
> - * @mas: The maple state
> - * @enode: The maple_enode to extract the parent's enum
> + * ma_parent_type() - Return the maple_type of the parent from the stored parent
> + * type.
> + * @mt: The maple tree
> + * @node: The maple_node to extract the parent's enum
> * Return: The node->parent maple_type
> */
> static inline
> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)

I was trying to keep ma_* prefix to mean the first argument is
maple_node and mt_* to mean maple_tree. I wasn't entirely successful
with this and I do see why you want to use ma_, but maybe reverse the
arguments here?

> {
> unsigned long p_type;
>
> - p_type = (unsigned long)mte_to_node(enode)->parent;
> + p_type = (unsigned long)node->parent;
> if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
> return 0;
>
> @@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> p_type &= ~mte_parent_slot_mask(p_type);
> switch (p_type) {
> case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
> - if (mt_is_alloc(mas->tree))
> + if (mt_is_alloc(mt))
> return maple_arange_64;
> return maple_range_64;
> }
> @@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> return 0;
> }
>
> +/*
> + * mas_parent_type() - Return the maple_type of the parent from the stored
> + * parent type.
> + * @mas: The maple state
> + * @enode: The maple_enode to extract the parent's enum
> + * Return: The node->parent maple_type
> + */
> +static inline
> +enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> +{
> + return ma_parent_type(mas->tree, mte_to_node(enode));
> +}
> +
> /*
> * mas_set_parent() - Set the parent node and encode the slot
> * @enode: The encoded maple node.
> @@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
> }
>
> /*
> - * mte_parent_slot() - get the parent slot of @enode.
> - * @enode: The encoded maple node.
> + * ma_parent_slot() - get the parent slot of @node.
> + * @node: The maple node.
> *
> - * Return: The slot in the parent node where @enode resides.
> + * Return: The slot in the parent node where @node resides.
> */
> -static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> +static inline unsigned int ma_parent_slot(const struct maple_node *node)
> {
> - unsigned long val = (unsigned long)mte_to_node(enode)->parent;
> + unsigned long val = (unsigned long)node->parent;
>
> if (val & MA_ROOT_PARENT)
> return 0;
> @@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> }
>
> /*
> - * mte_parent() - Get the parent of @node.
> - * @node: The encoded maple node.
> + * mte_parent_slot() - get the parent slot of @enode.
> + * @enode: The encoded maple node.
> + *
> + * Return: The slot in the parent node where @enode resides.
> + */
> +static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
> +{
> + return ma_parent_slot(mte_to_node(enode));
> +}
> +
> +/*
> + * ma_parent() - Get the parent of @node.
> + * @node: The maple node.
> + *
> + * Return: The parent maple node.
> + */
> +static inline struct maple_node *ma_parent(const struct maple_node *node)

I had a lot of these helpers before, but they eventually became used so
little that I dropped them.

> +{
> + return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
> +}
> +
> +/*
> + * mte_parent() - Get the parent of @enode.
> + * @enode: The encoded maple node.
> *
> * Return: The parent maple node.
> */
> static inline struct maple_node *mte_parent(const struct maple_enode *enode)
> {
> - return (void *)((unsigned long)
> - (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
> + return ma_parent(mte_to_node(enode));
> }
>
> /*
> --
> 2.20.1
>

2023-07-26 16:08:00

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 03/11] maple_tree: Add some helper functions

On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
> * Peng Zhang <[email protected]> [230726 04:10]:
> > static inline
> > -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> > +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>
> I was trying to keep ma_* prefix to mean the first argument is
> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> with this and I do see why you want to use ma_, but maybe reverse the
> arguments here?

I think your first idea is better. Usually we prefer to order the
arguments by "containing thing" to "contained thing". So always use
(struct address_space *, struct folio *), for example. Or (struct
mm_struct *, struct vm_area_struct *).


2023-07-26 16:54:10

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree

* Peng Zhang <[email protected]> [230726 04:10]:
> Add myself as co-maintainer for maple tree. I would like to assist
> Liam R. Howlett in maintaining maple tree. I will continue to contribute
> to the development of maple tree in the future.

Sorry, but no.

I appreciate the patches, bug fixes, and code review but there is no
need for another maintainer for the tree at this time.

Thank you,
Liam

>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> MAINTAINERS | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ddc71b815791..8cfedd492509 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -12526,6 +12526,7 @@ F: net/mctp/
>
> MAPLE TREE
> M: Liam R. Howlett <[email protected]>
> +M: Peng Zhang <[email protected]>
> L: [email protected]
> S: Supported
> F: Documentation/core-api/maple_tree.rst
> --
> 2.20.1
>

2023-07-26 17:09:33

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 05/11] maple_tree: Add test for mt_dup()

* Peng Zhang <[email protected]> [230726 04:10]:
> Add test for mt_dup().
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
> 1 file changed, 202 insertions(+)
>
> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> index e5da1cad70ba..3052e899e5df 100644
> --- a/tools/testing/radix-tree/maple.c
> +++ b/tools/testing/radix-tree/maple.c
> @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
> mt_clear_in_rcu(mt);
> }
>
> +/*
> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> + */
> +static int __init compare_node(struct maple_enode *enode_a,
> + struct maple_enode *enode_b)
> +{
> + struct maple_node *node_a, *node_b;
> + struct maple_node a, b;
> + void **slots_a, **slots_b; /* Do not use the rcu tag. */
> + enum maple_type type;
> + int i;
> +
> + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> + pr_err("The lower 8 bits of enode are different.\n");
> + return -1;
> + }
> +
> + type = mte_node_type(enode_a);
> + node_a = mte_to_node(enode_a);
> + node_b = mte_to_node(enode_b);
> + a = *node_a;
> + b = *node_b;
> +
> + /* Do not compare addresses. */
> + if (ma_is_root(node_a) || ma_is_root(node_b)) {
> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> + MA_ROOT_PARENT);
> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> + MA_ROOT_PARENT);
> + } else {
> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> + MAPLE_NODE_MASK);
> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> + MAPLE_NODE_MASK);
> + }
> +
> + if (a.parent != b.parent) {
> + pr_err("The lower 8 bits of parents are different. %p %p\n",
> + a.parent, b.parent);
> + return -1;
> + }
> +
> + /*
> + * If it is a leaf node, the slots do not contain the node address, and
> + * no special processing of slots is required.
> + */
> + if (ma_is_leaf(type))
> + goto cmp;
> +
> + slots_a = ma_slots(&a, type);
> + slots_b = ma_slots(&b, type);
> +
> + for (i = 0; i < mt_slots[type]; i++) {
> + if (!slots_a[i] && !slots_b[i])
> + break;
> +
> + if (!slots_a[i] || !slots_b[i]) {
> + pr_err("The number of slots is different.\n");
> + return -1;
> + }
> +
> + /* Do not compare addresses in slots. */
> + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> + }
> +
> +cmp:
> + /*
> + * Compare all contents of two nodes, including parent (except address),
> + * slots (except address), pivots, gaps and metadata.
> + */
> + return memcmp(&a, &b, sizeof(struct maple_node));
> +}
> +
> +/*
> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> + */
> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> +{
> + MA_STATE(mas_a, mt_a, 0, 0);
> + MA_STATE(mas_b, mt_b, 0, 0);
> +
> + if (mt_a->ma_flags != mt_b->ma_flags) {
> + pr_err("The flags of the two trees are different.\n");
> + return -1;
> + }
> +
> + mas_dfs_preorder(&mas_a);
> + mas_dfs_preorder(&mas_b);
> +
> + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> + pr_err("One is MAS_ROOT and the other is not.\n");
> + return -1;
> + }
> + return 0;
> + }
> +
> + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> +
> + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> + pr_err("One is MAS_NONE and the other is not.\n");
> + return -1;
> + }
> +
> + if (mas_a.min != mas_b.min ||
> + mas_a.max != mas_b.max) {
> + pr_err("mas->min, mas->max do not match.\n");
> + return -1;
> + }
> +
> + if (compare_node(mas_a.node, mas_b.node)) {
> + pr_err("The contents of nodes %p and %p are different.\n",
> + mas_a.node, mas_b.node);
> + mt_dump(mt_a, mt_dump_dec);
> + mt_dump(mt_b, mt_dump_dec);
> + return -1;
> + }
> +
> + mas_dfs_preorder(&mas_a);
> + mas_dfs_preorder(&mas_b);
> + }
> +
> + return 0;
> +}
> +
> +static noinline void __init check_mt_dup(struct maple_tree *mt)
> +{
> + DEFINE_MTREE(new);
> + int i, j, ret, count = 0;
> +
> + /* stored in the root pointer*/
> + mt_init_flags(&tree, 0);
> + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> + mt_dup(&tree, &new, GFP_KERNEL);
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> +
> + for (i = 0; i < 1000; i += 3) {
> + if (i & 1)
> + mt_init_flags(&tree, 0);
> + else
> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +
> + for (j = 0; j < i; j++) {
> + mtree_store_range(&tree, j * 10, j * 10 + 5,
> + xa_mk_value(j), GFP_KERNEL);
> + }

Storing in this way is probably not checking a full tree. I think it's
important to check the full tree/full nodes since you have changes to
detect the metadata.

> +
> + ret = mt_dup(&tree, &new, GFP_KERNEL);
> + MT_BUG_ON(&new, ret != 0);
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> + }
> +
> + /* Test memory allocation failed. */
> + for (i = 0; i < 1000; i += 3) {
> + if (i & 1)
> + mt_init_flags(&tree, 0);
> + else
> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> +
> + for (j = 0; j < i; j++) {
> + mtree_store_range(&tree, j * 10, j * 10 + 5,
> + xa_mk_value(j), GFP_KERNEL);
> + }
> +
> + mt_set_non_kernel(50);

It may be worth while allowing more/less than 50 allocations.

> + ret = mt_dup(&tree, &new, GFP_NOWAIT);
> + mt_set_non_kernel(0);
> + if (ret != 0) {
> + MT_BUG_ON(&new, ret != -ENOMEM);
> + count++;
> + mtree_destroy(&tree);
> + continue;
> + }
> +
> + mt_validate(&new);
> + if (compare_tree(&tree, &new))
> + MT_BUG_ON(&new, 1);
> +
> + mtree_destroy(&tree);
> + mtree_destroy(&new);
> + }
> +
> + /* pr_info("mt_dup() fail %d times\n", count); */
> + BUG_ON(!count);
> +}
> +
> extern void test_kmem_cache_bulk(void);
>
> void farmer_tests(void)
> @@ -35904,6 +36102,10 @@ void farmer_tests(void)
> check_null_expand(&tree);
> mtree_destroy(&tree);
>
> + mt_init_flags(&tree, 0);
> + check_mt_dup(&tree);
> + mtree_destroy(&tree);
> +
> /* RCU testing */
> mt_init_flags(&tree, 0);
> check_erase_testset(&tree);
> --
> 2.20.1
>
>

2023-07-26 17:12:06

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry

* Peng Zhang <[email protected]> [230726 04:10]:
> If mas has located a specific entry, it may be need to replace this
> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> will be more efficient than mas_store*() because it doesn't do many
> unnecessary checks.
>
> This function should be inline, but more functions need to be moved to
> the header file, so I didn't do it for the time being.

I am really nervous having no checks here. I get that this could be
used for duplicating the tree more efficiently, but having a function
that just swaps a value in is very dangerous - especially since it is
decoupled from the tree duplication code.

>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> include/linux/maple_tree.h | 1 +
> lib/maple_tree.c | 25 +++++++++++++++++++++++++
> 2 files changed, 26 insertions(+)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index 229fe78e4c89..a05e9827d761 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -462,6 +462,7 @@ struct ma_wr_state {
>
> void *mas_walk(struct ma_state *mas);
> void *mas_store(struct ma_state *mas, void *entry);
> +void mas_replace_entry(struct ma_state *mas, void *entry);
> void *mas_erase(struct ma_state *mas);
> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
> void mas_store_prealloc(struct ma_state *mas, void *entry);
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index efac6761ae37..d58572666a00 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
> }
> EXPORT_SYMBOL_GPL(mas_store);
>
> +/**
> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
> + * @mas: The maple state
> + * @entry: The entry to store
> + *
> + * Please note that mas must already locate an existing entry, and the new entry
> + * must not be NULL. If these two points cannot be guaranteed, please use
> + * mas_store*() instead, otherwise it will cause an internal error in the maple
> + * tree. This function does not need to allocate memory, so it must succeed.
> + */
> +void mas_replace_entry(struct ma_state *mas, void *entry)
> +{
> + void __rcu **slots;
> +
> +#ifdef CONFIG_DEBUG_MAPLE_TREE
> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
> + MAS_WARN_ON(mas, !entry);
> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
> +#endif
> +
> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
> + rcu_assign_pointer(slots[mas->offset], entry);
> +}
> +EXPORT_SYMBOL_GPL(mas_replace_entry);
> +
> /**
> * mas_store_gfp() - Store a value into the tree.
> * @mas: The maple state
> --
> 2.20.1
>

2023-07-26 17:27:12

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

* Peng Zhang <[email protected]> [230726 04:10]:
> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
> directly modify the entries of VMAs in the new maple tree, which can
> get better performance. dup_mmap() is used by fork(), so this patch
> optimizes fork(). The optimization effect is proportional to the number
> of VMAs.
>
> Due to the introduction of this method, the optimization in
> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
> has an effect here, but it is also an optimization of the maple tree.
>
> There is a unixbench test suite[2] where 'spawn' is used to test fork().
> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
> bit to use mmap() to control the number of VMAs. Therefore, the
> performance under different numbers of VMAs can be measured.
>
> Insert code like below into 'spawn':
> for (int i = 0; i < 200; ++i) {
> size_t size = 10 * getpagesize();
> void *addr;
>
> if (i & 1) {
> addr = mmap(NULL, size, PROT_READ,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> } else {
> addr = mmap(NULL, size, PROT_WRITE,
> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
> }
> if (addr == MAP_FAILED)
> ...
> }
>
> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
> 4 times in 30 seconds each time, and get the following numbers. These
> numbers are the number of fork() successes in 30s (average of the best
> 3 out of 4). By the way, based on next-20230725, I reverted [1], and
> tested it together as a comparison. In order to ensure the reliability
> of the test results, these tests were run on a physical machine.
>
> 23VMAs 223VMAs 4023VMAs
> revert [1]: 159104.00 73316.33 6787.00

You can probably remove the revert benchmark from this since there is no
reason to revert the previous change. The change is worth while on its
own, so it's better to have the numbers more clear by having with and
without this series.

>
> +0.77% +0.42% +0.28%
> next-20230721: 160321.67 73624.67 6806.33
>
> +2.77% +15.42% +29.86%
> apply this: 164751.67 84980.33 8838.67

What is the difference between using this patch with mas_replace_entry()
and mas_store_entry()?

>
> It can be seen that the performance improvement is proportional to
> the number of VMAs. With 23 VMAs, performance improves by about 3%,
> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
> performance improves by about 30%.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://github.com/kdlucas/byte-unixbench/tree/master
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> kernel/fork.c | 35 +++++++++++++++++++++++++++--------
> mm/mmap.c | 14 ++++++++++++--
> 2 files changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index f81149739eb9..ef80025b62d6 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> int retval;
> unsigned long charge = 0;
> LIST_HEAD(uf);
> - VMA_ITERATOR(old_vmi, oldmm, 0);
> VMA_ITERATOR(vmi, mm, 0);
>
> uprobe_start_dup_mmap();
> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> goto out;
> khugepaged_fork(mm, oldmm);
>
> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> - if (retval)
> + /* Use __mt_dup() to efficiently build an identical maple tree. */
> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
> + if (unlikely(retval))
> goto out;
>
> mt_clear_in_rcu(vmi.mas.tree);
> - for_each_vma(old_vmi, mpnt) {
> + for_each_vma(vmi, mpnt) {
> struct file *file;
>
> vma_start_write(mpnt);
> if (mpnt->vm_flags & VM_DONTCOPY) {
> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> +
> + /*
> + * Since the new tree is exactly the same as the old one,
> + * we need to remove the unneeded VMAs.
> + */
> + mas_store(&vmi.mas, NULL);
> +
> + /*
> + * Even removing an entry may require memory allocation,
> + * and if removal fails, we use XA_ZERO_ENTRY to mark
> + * from which VMA it failed. The case of encountering
> + * XA_ZERO_ENTRY will be handled in exit_mmap().
> + */
> + if (unlikely(mas_is_err(&vmi.mas))) {
> + retval = xa_err(vmi.mas.node);
> + mas_reset(&vmi.mas);
> + if (mas_find(&vmi.mas, ULONG_MAX))
> + mas_replace_entry(&vmi.mas,
> + XA_ZERO_ENTRY);
> + goto loop_out;
> + }
> +
> continue;
> }
> charge = 0;
> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> hugetlb_dup_vma_private(tmp);
>
> /* Link the vma into the MT */
> - if (vma_iter_bulk_store(&vmi, tmp))
> - goto fail_nomem_vmi_store;
> + mas_replace_entry(&vmi.mas, tmp);
>
> mm->map_count++;
> if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> uprobe_end_dup_mmap();
> return retval;
>
> -fail_nomem_vmi_store:
> - unlink_anon_vmas(tmp);
> fail_nomem_anon_vma_fork:
> mpol_put(vma_policy(tmp));
> fail_nomem_policy:
> diff --git a/mm/mmap.c b/mm/mmap.c
> index bc91d91261ab..5bfba2fb0e39 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
> arch_exit_mmap(mm);
>
> vma = mas_find(&mas, ULONG_MAX);
> - if (!vma) {
> + /*
> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
> + * xa_is_zero(vma) may be true.
> + */
> + if (!vma || xa_is_zero(vma)) {
> /* Can happen if dup_mmap() received an OOM */
> mmap_read_unlock(mm);
> return;
> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
> remove_vma(vma, true);
> count++;
> cond_resched();
> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> + vma = mas_find(&mas, ULONG_MAX);
> + /*
> + * If xa_is_zero(vma) is true, it means that subsequent VMAs
> + * donot need to be removed. Can happen if dup_mmap() fails to
> + * remove a VMA marked VM_DONTCOPY.
> + */
> + } while (vma != NULL && !xa_is_zero(vma));
>
> BUG_ON(count != mm->map_count);
>
> --
> 2.20.1
>

2023-07-31 10:48:17

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()



在 2023/7/26 22:58, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
>> non-leaf nodes without knowing the maximum value of nodes, so that any
>> ascending can be avoided even if the maximum value of nodes is not known.
>>
>> The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
>> cannot be used by metadata, so we can distinguish whether it is ENODE or
>> metadata.
>>
>> The nocheck version is to avoid lockdep complaining in some scenarios
>> where no locks are held.
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 68 insertions(+), 2 deletions(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index a3d602cfd030..98e4fdf6f4b9 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
>> #define MAPLE_ENODE_TYPE_SHIFT 0x03
>> /* Bit 2 means a NULL somewhere below */
>> #define MAPLE_ENODE_NULL 0x04
>> +/* Bit 7 means this is an ENODE, instead of metadata */
>> +#define MAPLE_ENODE 0x80
>
> We were saving this bit for more node types. I don't want to use this
> bit for this reason since you could have done BFS to duplicate the tree
> using the existing way to find the node end.
We have reserved 4 bits for the node type. I don't think there will be
more than 16 node types going forward.

Even the DFS that has been implemented can use the existing way to get
the data end. I didn't use it because when walking up the tree, we don't
know the maximum value of the node, and the continuous upward walk will
introduce more overhead, which is what mas_ascend() does. Doing BFS
cannot avoid this problem also.

The reason I don't do BFS is that it has more overhead than DFS. If you
think of a tree as a graph, doing DFS will only walk each edge twice.
What if it is BFS? Since we can't use queues, we can only emulate BFS,
which additionally does something like mas_next_node() does, which
introduces more overhead than DFS. Considering only the layer of leaf
nodes, it needs to walk each edge twice. So the overhead of doing BFS is
more than DFS.
>
> Bits are highly valuable and this is the only remaining bit. I had
> thought about using this in Feb 2021 to see if there was metadata or
> not, but figured a way around it (using the max trick) and thus saved
> this bit for potential expansion of node types.
I thought of another way to get the maximum value of a node without
doing any extra upward walk. When doing DFS, we can use a stack to save
the maximum value of ancestor nodes. The stack size can be set to
MAPLE_HEIGHT_MAX. In this way, this bit can be reserved, and there is no
need to do a loop like mas_ascend() in order to get the maximum value.
>
>> +
>> +static inline bool slot_is_mte(unsigned long slot)
>> +{
>> + return slot & MAPLE_ENODE;
>> +}
>>
>> static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
>> enum maple_type type)
>> {
>> - return (void *)((unsigned long)node |
>> - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
>> + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
>> + MAPLE_ENODE_NULL | MAPLE_ENODE);
>> }
>>
>> static inline void *mte_mk_root(const struct maple_enode *node)
>> @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
>> return NULL;
>> }
>>
>> +/*
>> + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
>> + * @mt: The maple tree
>> + * @node: The maple node
>> + * @type: The maple node type
>> + *
>> + * Uses metadata to find the end of the data when possible without knowing the
>> + * node maximum.
>> + *
>> + * Return: The zero indexed last slot with child.
>> + */
>> +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
>> + struct maple_node *node,
>> + enum maple_type type)
>> +{
>> + void __rcu **slots;
>> + unsigned long slot;
>> +
>> + slots = ma_slots(node, type);
>> + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
>> + if (unlikely(slot_is_mte(slot)))
>> + return mt_pivots[type];
>> +
>> + return ma_meta_end(node, type);
>> +}
>> +
>> +/*
>> + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
>> + * @node: The maple node
>> + * @type: The maple node type
>> + *
>> + * Uses metadata to find the end of the data when possible without knowing the
>> + * node maximum. This is the version of ma_nonleaf_data_end() that does not
>> + * check for lock held. This particular version is designed to avoid lockdep
>> + * complaining in some scenarios.
>> + *
>> + * Return: The zero indexed last slot with child.
>> + */
>> +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
>> + enum maple_type type)
>> +{
>> + void __rcu **slots;
>> + unsigned long slot;
>> +
>> + slots = ma_slots(node, type);
>> + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
>> + if (unlikely(slot_is_mte(slot)))
>> + return mt_pivots[type];
>> +
>> + return ma_meta_end(node, type);
>> +}
>> +
>> +/* See ma_nonleaf_data_end() */
>> +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
>> + struct maple_enode *enode)
>> +{
>> + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
>> +}
>> +
>> /*
>> * ma_data_end() - Find the end of the data in a node.
>> * @node: The maple node
>> --
>> 2.20.1
>>
>>

2023-07-31 13:04:44

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 03/11] maple_tree: Add some helper functions



在 2023/7/26 23:08, Matthew Wilcox 写道:
> On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
>> * Peng Zhang <[email protected]> [230726 04:10]:
>>> static inline
>>> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>>> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>>
>> I was trying to keep ma_* prefix to mean the first argument is
>> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
>> with this and I do see why you want to use ma_, but maybe reverse the
>> arguments here?
>
> I think your first idea is better. Usually we prefer to order the
> arguments by "containing thing" to "contained thing". So always use
> (struct address_space *, struct folio *), for example. Or (struct
> mm_struct *, struct vm_area_struct *).
There are disagreements here, so how to decide? But I don't know if the
new version still has this helper.
>

2023-07-31 13:05:58

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry



在 2023/7/27 00:08, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> If mas has located a specific entry, it may be need to replace this
>> entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
>> will be more efficient than mas_store*() because it doesn't do many
>> unnecessary checks.
>>
>> This function should be inline, but more functions need to be moved to
>> the header file, so I didn't do it for the time being.
>
> I am really nervous having no checks here. I get that this could be
> used for duplicating the tree more efficiently, but having a function
> that just swaps a value in is very dangerous - especially since it is
> decoupled from the tree duplication code.
I've thought about this, and I feel like this is something the user
should be guaranteed. If the user is not sure whether to use it,
mas_store() can be used instead. And we should provide this interface
because it has better performance.
>
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> include/linux/maple_tree.h | 1 +
>> lib/maple_tree.c | 25 +++++++++++++++++++++++++
>> 2 files changed, 26 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index 229fe78e4c89..a05e9827d761 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -462,6 +462,7 @@ struct ma_wr_state {
>>
>> void *mas_walk(struct ma_state *mas);
>> void *mas_store(struct ma_state *mas, void *entry);
>> +void mas_replace_entry(struct ma_state *mas, void *entry);
>> void *mas_erase(struct ma_state *mas);
>> int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
>> void mas_store_prealloc(struct ma_state *mas, void *entry);
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index efac6761ae37..d58572666a00 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
>> }
>> EXPORT_SYMBOL_GPL(mas_store);
>>
>> +/**
>> + * mas_replace_entry() - Replace an entry that already exists in the maple tree
>> + * @mas: The maple state
>> + * @entry: The entry to store
>> + *
>> + * Please note that mas must already locate an existing entry, and the new entry
>> + * must not be NULL. If these two points cannot be guaranteed, please use
>> + * mas_store*() instead, otherwise it will cause an internal error in the maple
>> + * tree. This function does not need to allocate memory, so it must succeed.
>> + */
>> +void mas_replace_entry(struct ma_state *mas, void *entry)
>> +{
>> + void __rcu **slots;
>> +
>> +#ifdef CONFIG_DEBUG_MAPLE_TREE
>> + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
>> + MAS_WARN_ON(mas, !entry);
>> + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
>> +#endif
>> +
>> + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
>> + rcu_assign_pointer(slots[mas->offset], entry);
>> +}
>> +EXPORT_SYMBOL_GPL(mas_replace_entry);
>> +
>> /**
>> * mas_store_gfp() - Store a value into the tree.
>> * @mas: The maple state
>> --
>> 2.20.1
>>

2023-07-31 13:18:32

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 05/11] maple_tree: Add test for mt_dup()



在 2023/7/27 00:06, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> Add test for mt_dup().
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
>> 1 file changed, 202 insertions(+)
>>
>> diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
>> index e5da1cad70ba..3052e899e5df 100644
>> --- a/tools/testing/radix-tree/maple.c
>> +++ b/tools/testing/radix-tree/maple.c
>> @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
>> mt_clear_in_rcu(mt);
>> }
>>
>> +/*
>> + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_node(struct maple_enode *enode_a,
>> + struct maple_enode *enode_b)
>> +{
>> + struct maple_node *node_a, *node_b;
>> + struct maple_node a, b;
>> + void **slots_a, **slots_b; /* Do not use the rcu tag. */
>> + enum maple_type type;
>> + int i;
>> +
>> + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
>> + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
>> + pr_err("The lower 8 bits of enode are different.\n");
>> + return -1;
>> + }
>> +
>> + type = mte_node_type(enode_a);
>> + node_a = mte_to_node(enode_a);
>> + node_b = mte_to_node(enode_b);
>> + a = *node_a;
>> + b = *node_b;
>> +
>> + /* Do not compare addresses. */
>> + if (ma_is_root(node_a) || ma_is_root(node_b)) {
>> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> + MA_ROOT_PARENT);
>> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> + MA_ROOT_PARENT);
>> + } else {
>> + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
>> + MAPLE_NODE_MASK);
>> + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
>> + MAPLE_NODE_MASK);
>> + }
>> +
>> + if (a.parent != b.parent) {
>> + pr_err("The lower 8 bits of parents are different. %p %p\n",
>> + a.parent, b.parent);
>> + return -1;
>> + }
>> +
>> + /*
>> + * If it is a leaf node, the slots do not contain the node address, and
>> + * no special processing of slots is required.
>> + */
>> + if (ma_is_leaf(type))
>> + goto cmp;
>> +
>> + slots_a = ma_slots(&a, type);
>> + slots_b = ma_slots(&b, type);
>> +
>> + for (i = 0; i < mt_slots[type]; i++) {
>> + if (!slots_a[i] && !slots_b[i])
>> + break;
>> +
>> + if (!slots_a[i] || !slots_b[i]) {
>> + pr_err("The number of slots is different.\n");
>> + return -1;
>> + }
>> +
>> + /* Do not compare addresses in slots. */
>> + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
>> + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
>> + }
>> +
>> +cmp:
>> + /*
>> + * Compare all contents of two nodes, including parent (except address),
>> + * slots (except address), pivots, gaps and metadata.
>> + */
>> + return memcmp(&a, &b, sizeof(struct maple_node));
>> +}
>> +
>> +/*
>> + * Compare two trees and return 0 if they are the same, non-zero otherwise.
>> + */
>> +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
>> +{
>> + MA_STATE(mas_a, mt_a, 0, 0);
>> + MA_STATE(mas_b, mt_b, 0, 0);
>> +
>> + if (mt_a->ma_flags != mt_b->ma_flags) {
>> + pr_err("The flags of the two trees are different.\n");
>> + return -1;
>> + }
>> +
>> + mas_dfs_preorder(&mas_a);
>> + mas_dfs_preorder(&mas_b);
>> +
>> + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
>> + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
>> + pr_err("One is MAS_ROOT and the other is not.\n");
>> + return -1;
>> + }
>> + return 0;
>> + }
>> +
>> + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
>> +
>> + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
>> + pr_err("One is MAS_NONE and the other is not.\n");
>> + return -1;
>> + }
>> +
>> + if (mas_a.min != mas_b.min ||
>> + mas_a.max != mas_b.max) {
>> + pr_err("mas->min, mas->max do not match.\n");
>> + return -1;
>> + }
>> +
>> + if (compare_node(mas_a.node, mas_b.node)) {
>> + pr_err("The contents of nodes %p and %p are different.\n",
>> + mas_a.node, mas_b.node);
>> + mt_dump(mt_a, mt_dump_dec);
>> + mt_dump(mt_b, mt_dump_dec);
>> + return -1;
>> + }
>> +
>> + mas_dfs_preorder(&mas_a);
>> + mas_dfs_preorder(&mas_b);
>> + }
>> +
>> + return 0;
>> +}
>> +
>> +static noinline void __init check_mt_dup(struct maple_tree *mt)
>> +{
>> + DEFINE_MTREE(new);
>> + int i, j, ret, count = 0;
>> +
>> + /* stored in the root pointer*/
>> + mt_init_flags(&tree, 0);
>> + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
>> + mt_dup(&tree, &new, GFP_KERNEL);
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> +
>> + for (i = 0; i < 1000; i += 3) {
>> + if (i & 1)
>> + mt_init_flags(&tree, 0);
>> + else
>> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +
>> + for (j = 0; j < i; j++) {
>> + mtree_store_range(&tree, j * 10, j * 10 + 5,
>> + xa_mk_value(j), GFP_KERNEL);
>> + }
>
> Storing in this way is probably not checking a full tree. I think it's
> important to check the full tree/full nodes since you have changes to
> detect the metadata.
I probably won't change the way I check metadata. But is there a way to
construct a full tree? All I can think of is to write new code to
construct a full tree.
>
>> +
>> + ret = mt_dup(&tree, &new, GFP_KERNEL);
>> + MT_BUG_ON(&new, ret != 0);
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> + }
>> +
>> + /* Test memory allocation failed. */
>> + for (i = 0; i < 1000; i += 3) {
>> + if (i & 1)
>> + mt_init_flags(&tree, 0);
>> + else
>> + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
>> +
>> + for (j = 0; j < i; j++) {
>> + mtree_store_range(&tree, j * 10, j * 10 + 5,
>> + xa_mk_value(j), GFP_KERNEL);
>> + }
>> +
>> + mt_set_non_kernel(50);
>
> It may be worth while allowing more/less than 50 allocations.
Actually I have used other values before. I haven't thought of a good
value yet, probably a random number in a suitable range would be nice
too.

>
>> + ret = mt_dup(&tree, &new, GFP_NOWAIT);
>> + mt_set_non_kernel(0);
>> + if (ret != 0) {
>> + MT_BUG_ON(&new, ret != -ENOMEM);
>> + count++;
>> + mtree_destroy(&tree);
>> + continue;
>> + }
>> +
>> + mt_validate(&new);
>> + if (compare_tree(&tree, &new))
>> + MT_BUG_ON(&new, 1);
>> +
>> + mtree_destroy(&tree);
>> + mtree_destroy(&new);
>> + }
>> +
>> + /* pr_info("mt_dup() fail %d times\n", count); */
>> + BUG_ON(!count);
>> +}
>> +
>> extern void test_kmem_cache_bulk(void);
>>
>> void farmer_tests(void)
>> @@ -35904,6 +36102,10 @@ void farmer_tests(void)
>> check_null_expand(&tree);
>> mtree_destroy(&tree);
>>
>> + mt_init_flags(&tree, 0);
>> + check_mt_dup(&tree);
>> + mtree_destroy(&tree);
>> +
>> /* RCU testing */
>> mt_init_flags(&tree, 0);
>> check_erase_testset(&tree);
>> --
>> 2.20.1
>>
>>

2023-07-31 13:27:04

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree



在 2023/7/27 00:39, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> Add myself as co-maintainer for maple tree. I would like to assist
>> Liam R. Howlett in maintaining maple tree. I will continue to contribute
>> to the development of maple tree in the future.
>
> Sorry, but no.
>
> I appreciate the patches, bug fixes, and code review but there is no
> need for another maintainer for the tree at this time.
So can I add a reviewer here? This is convenient for me to review the
code and know the problems reported from the community. Usually I can't
receive maple tree issues reported by the community. It has no effect on
your maintenance of it.
>
> Thank you,
> Liam
>
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> MAINTAINERS | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/MAINTAINERS b/MAINTAINERS
>> index ddc71b815791..8cfedd492509 100644
>> --- a/MAINTAINERS
>> +++ b/MAINTAINERS
>> @@ -12526,6 +12526,7 @@ F: net/mctp/
>>
>> MAPLE TREE
>> M: Liam R. Howlett <[email protected]>
>> +M: Peng Zhang <[email protected]>
>> L: [email protected]
>> S: Supported
>> F: Documentation/core-api/maple_tree.rst
>> --
>> 2.20.1
>>

2023-07-31 13:51:46

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 03/11] maple_tree: Add some helper functions



在 2023/7/26 23:02, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> Add some helper functions so that their parameters are maple node
>> instead of maple enode, these functions will be used later.
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> lib/maple_tree.c | 71 +++++++++++++++++++++++++++++++++++++-----------
>> 1 file changed, 55 insertions(+), 16 deletions(-)
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index e0e9a87bdb43..da3a2fb405c0 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -164,6 +164,11 @@ static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>> return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>> }
>>
>> +static inline void mt_free_one(struct maple_node *node)
>> +{
>> + kmem_cache_free(maple_node_cache, node);
>> +}
>> +
>
> There is a place in mas_destroy() that could use this if it is added.
I will make changes accordingly. It's not done here because it doesn't
seem to be relevant to the theme of this patchset.
>
>> static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>> {
>> kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> @@ -432,18 +437,18 @@ static inline unsigned long mte_parent_slot_mask(unsigned long parent)
>> }
>>
>> /*
>> - * mas_parent_type() - Return the maple_type of the parent from the stored
>> - * parent type.
>> - * @mas: The maple state
>> - * @enode: The maple_enode to extract the parent's enum
>> + * ma_parent_type() - Return the maple_type of the parent from the stored parent
>> + * type.
>> + * @mt: The maple tree
>> + * @node: The maple_node to extract the parent's enum
>> * Return: The node->parent maple_type
>> */
>> static inline
>> -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
>
> I was trying to keep ma_* prefix to mean the first argument is
> maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> with this and I do see why you want to use ma_, but maybe reverse the
> arguments here?
I just think it is redundant to construct maple enode through
node->parent in order to adapt the parameters of mte_*. So ma_* are
introduced to avoid meaningless construction.

>
>> {
>> unsigned long p_type;
>>
>> - p_type = (unsigned long)mte_to_node(enode)->parent;
>> + p_type = (unsigned long)node->parent;
>> if (WARN_ON(p_type & MAPLE_PARENT_ROOT))
>> return 0;
>>
>> @@ -451,7 +456,7 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> p_type &= ~mte_parent_slot_mask(p_type);
>> switch (p_type) {
>> case MAPLE_PARENT_RANGE64: /* or MAPLE_PARENT_ARANGE64 */
>> - if (mt_is_alloc(mas->tree))
>> + if (mt_is_alloc(mt))
>> return maple_arange_64;
>> return maple_range_64;
>> }
>> @@ -459,6 +464,19 @@ enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> return 0;
>> }
>>
>> +/*
>> + * mas_parent_type() - Return the maple_type of the parent from the stored
>> + * parent type.
>> + * @mas: The maple state
>> + * @enode: The maple_enode to extract the parent's enum
>> + * Return: The node->parent maple_type
>> + */
>> +static inline
>> +enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
>> +{
>> + return ma_parent_type(mas->tree, mte_to_node(enode));
>> +}
>> +
>> /*
>> * mas_set_parent() - Set the parent node and encode the slot
>> * @enode: The encoded maple node.
>> @@ -499,14 +517,14 @@ void mas_set_parent(struct ma_state *mas, struct maple_enode *enode,
>> }
>>
>> /*
>> - * mte_parent_slot() - get the parent slot of @enode.
>> - * @enode: The encoded maple node.
>> + * ma_parent_slot() - get the parent slot of @node.
>> + * @node: The maple node.
>> *
>> - * Return: The slot in the parent node where @enode resides.
>> + * Return: The slot in the parent node where @node resides.
>> */
>> -static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> +static inline unsigned int ma_parent_slot(const struct maple_node *node)
>> {
>> - unsigned long val = (unsigned long)mte_to_node(enode)->parent;
>> + unsigned long val = (unsigned long)node->parent;
>>
>> if (val & MA_ROOT_PARENT)
>> return 0;
>> @@ -519,15 +537,36 @@ static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> }
>>
>> /*
>> - * mte_parent() - Get the parent of @node.
>> - * @node: The encoded maple node.
>> + * mte_parent_slot() - get the parent slot of @enode.
>> + * @enode: The encoded maple node.
>> + *
>> + * Return: The slot in the parent node where @enode resides.
>> + */
>> +static inline unsigned int mte_parent_slot(const struct maple_enode *enode)
>> +{
>> + return ma_parent_slot(mte_to_node(enode));
>> +}
>> +
>> +/*
>> + * ma_parent() - Get the parent of @node.
>> + * @node: The maple node.
>> + *
>> + * Return: The parent maple node.
>> + */
>> +static inline struct maple_node *ma_parent(const struct maple_node *node)
>
> I had a lot of these helpers before, but they eventually became used so
> little that I dropped them.
Just for not wanting to construct maple enode. It's not really a
problem.
>
>> +{
>> + return (void *)((unsigned long)(node->parent) & ~MAPLE_NODE_MASK);
>> +}
>> +
>> +/*
>> + * mte_parent() - Get the parent of @enode.
>> + * @enode: The encoded maple node.
>> *
>> * Return: The parent maple node.
>> */
>> static inline struct maple_node *mte_parent(const struct maple_enode *enode)
>> {
>> - return (void *)((unsigned long)
>> - (mte_to_node(enode)->parent) & ~MAPLE_NODE_MASK);
>> + return ma_parent(mte_to_node(enode));
>> }
>>
>> /*
>> --
>> 2.20.1
>>

2023-07-31 14:26:25

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH 11/11] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()



在 2023/7/27 01:06, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [230726 04:10]:
>> Use __mt_dup() to duplicate the old maple tree in dup_mmap(), and then
>> directly modify the entries of VMAs in the new maple tree, which can
>> get better performance. dup_mmap() is used by fork(), so this patch
>> optimizes fork(). The optimization effect is proportional to the number
>> of VMAs.
>>
>> Due to the introduction of this method, the optimization in
>> (maple_tree: add a fast path case in mas_wr_slot_store())[1] no longer
>> has an effect here, but it is also an optimization of the maple tree.
>>
>> There is a unixbench test suite[2] where 'spawn' is used to test fork().
>> 'spawn' only has 23 VMAs by default, so I tweaked the benchmark code a
>> bit to use mmap() to control the number of VMAs. Therefore, the
>> performance under different numbers of VMAs can be measured.
>>
>> Insert code like below into 'spawn':
>> for (int i = 0; i < 200; ++i) {
>> size_t size = 10 * getpagesize();
>> void *addr;
>>
>> if (i & 1) {
>> addr = mmap(NULL, size, PROT_READ,
>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> } else {
>> addr = mmap(NULL, size, PROT_WRITE,
>> MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
>> }
>> if (addr == MAP_FAILED)
>> ...
>> }
>>
>> Based on next-20230721, use 'spawn' under 23, 203, and 4023 VMAs, test
>> 4 times in 30 seconds each time, and get the following numbers. These
>> numbers are the number of fork() successes in 30s (average of the best
>> 3 out of 4). By the way, based on next-20230725, I reverted [1], and
>> tested it together as a comparison. In order to ensure the reliability
>> of the test results, these tests were run on a physical machine.
>>
>> 23VMAs 223VMAs 4023VMAs
>> revert [1]: 159104.00 73316.33 6787.00
>
> You can probably remove the revert benchmark from this since there is no
> reason to revert the previous change. The change is worth while on its
> own, so it's better to have the numbers more clear by having with and
> without this series.
I will remove it.
>
>>
>> +0.77% +0.42% +0.28%
>> next-20230721: 160321.67 73624.67 6806.33
>>
>> +2.77% +15.42% +29.86%
>> apply this: 164751.67 84980.33 8838.67
>
> What is the difference between using this patch with mas_replace_entry()
> and mas_store_entry()?
I haven't tested and compared them yet, I will compare them when I have
time. It may be compared by simulating fork() in user space.
>
>>
>> It can be seen that the performance improvement is proportional to
>> the number of VMAs. With 23 VMAs, performance improves by about 3%,
>> with 223 VMAs, performance improves by about 15%, and with 4023 VMAs,
>> performance improves by about 30%.
>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>> [2] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> kernel/fork.c | 35 +++++++++++++++++++++++++++--------
>> mm/mmap.c | 14 ++++++++++++--
>> 2 files changed, 39 insertions(+), 10 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index f81149739eb9..ef80025b62d6 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> int retval;
>> unsigned long charge = 0;
>> LIST_HEAD(uf);
>> - VMA_ITERATOR(old_vmi, oldmm, 0);
>> VMA_ITERATOR(vmi, mm, 0);
>>
>> uprobe_start_dup_mmap();
>> @@ -678,17 +677,40 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> goto out;
>> khugepaged_fork(mm, oldmm);
>>
>> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> - if (retval)
>> + /* Use __mt_dup() to efficiently build an identical maple tree. */
>> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_NOWAIT | __GFP_NOWARN);
>> + if (unlikely(retval))
>> goto out;
>>
>> mt_clear_in_rcu(vmi.mas.tree);
>> - for_each_vma(old_vmi, mpnt) {
>> + for_each_vma(vmi, mpnt) {
>> struct file *file;
>>
>> vma_start_write(mpnt);
>> if (mpnt->vm_flags & VM_DONTCOPY) {
>> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> +
>> + /*
>> + * Since the new tree is exactly the same as the old one,
>> + * we need to remove the unneeded VMAs.
>> + */
>> + mas_store(&vmi.mas, NULL);
>> +
>> + /*
>> + * Even removing an entry may require memory allocation,
>> + * and if removal fails, we use XA_ZERO_ENTRY to mark
>> + * from which VMA it failed. The case of encountering
>> + * XA_ZERO_ENTRY will be handled in exit_mmap().
>> + */
>> + if (unlikely(mas_is_err(&vmi.mas))) {
>> + retval = xa_err(vmi.mas.node);
>> + mas_reset(&vmi.mas);
>> + if (mas_find(&vmi.mas, ULONG_MAX))
>> + mas_replace_entry(&vmi.mas,
>> + XA_ZERO_ENTRY);
>> + goto loop_out;
>> + }
>> +
>> continue;
>> }
>> charge = 0;
>> @@ -750,8 +772,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> hugetlb_dup_vma_private(tmp);
>>
>> /* Link the vma into the MT */
>> - if (vma_iter_bulk_store(&vmi, tmp))
>> - goto fail_nomem_vmi_store;
>> + mas_replace_entry(&vmi.mas, tmp);
>>
>> mm->map_count++;
>> if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -778,8 +799,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> uprobe_end_dup_mmap();
>> return retval;
>>
>> -fail_nomem_vmi_store:
>> - unlink_anon_vmas(tmp);
>> fail_nomem_anon_vma_fork:
>> mpol_put(vma_policy(tmp));
>> fail_nomem_policy:
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index bc91d91261ab..5bfba2fb0e39 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3184,7 +3184,11 @@ void exit_mmap(struct mm_struct *mm)
>> arch_exit_mmap(mm);
>>
>> vma = mas_find(&mas, ULONG_MAX);
>> - if (!vma) {
>> + /*
>> + * If dup_mmap() fails to remove a VMA marked VM_DONTCOPY,
>> + * xa_is_zero(vma) may be true.
>> + */
>> + if (!vma || xa_is_zero(vma)) {
>> /* Can happen if dup_mmap() received an OOM */
>> mmap_read_unlock(mm);
>> return;
>> @@ -3222,7 +3226,13 @@ void exit_mmap(struct mm_struct *mm)
>> remove_vma(vma, true);
>> count++;
>> cond_resched();
>> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> + vma = mas_find(&mas, ULONG_MAX);
>> + /*
>> + * If xa_is_zero(vma) is true, it means that subsequent VMAs
>> + * donot need to be removed. Can happen if dup_mmap() fails to
>> + * remove a VMA marked VM_DONTCOPY.
>> + */
>> + } while (vma != NULL && !xa_is_zero(vma));
>>
>> BUG_ON(count != mm->map_count);
>>
>> --
>> 2.20.1
>>

2023-07-31 16:32:53

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 01/11] maple_tree: Introduce ma_nonleaf_data_end{_nocheck}()

* Peng Zhang <[email protected]> [230731 05:53]:
>
>
> 在 2023/7/26 22:58, Liam R. Howlett 写道:
> > * Peng Zhang <[email protected]> [230726 04:10]:
> > > Introduce ma_nonleaf_data_end{_nocheck}() to get the data end of
> > > non-leaf nodes without knowing the maximum value of nodes, so that any
> > > ascending can be avoided even if the maximum value of nodes is not known.
> > >
> > > The principle is that we introduce MAPLE_ENODE to mark an ENODE, which
> > > cannot be used by metadata, so we can distinguish whether it is ENODE or
> > > metadata.
> > >
> > > The nocheck version is to avoid lockdep complaining in some scenarios
> > > where no locks are held.
> > >
> > > Signed-off-by: Peng Zhang <[email protected]>
> > > ---
> > > lib/maple_tree.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++--
> > > 1 file changed, 68 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index a3d602cfd030..98e4fdf6f4b9 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -310,12 +310,19 @@ static inline void mte_set_node_dead(struct maple_enode *mn)
> > > #define MAPLE_ENODE_TYPE_SHIFT 0x03
> > > /* Bit 2 means a NULL somewhere below */
> > > #define MAPLE_ENODE_NULL 0x04
> > > +/* Bit 7 means this is an ENODE, instead of metadata */
> > > +#define MAPLE_ENODE 0x80
> >
> > We were saving this bit for more node types. I don't want to use this
> > bit for this reason since you could have done BFS to duplicate the tree
> > using the existing way to find the node end.
> We have reserved 4 bits for the node type. I don't think there will be
> more than 16 node types going forward.

We aren't using this bit for the metadata validation. We have plans for
the tree to be used elsewhere and I'd like to keep this bit in case we
need it.

>
> Even the DFS that has been implemented can use the existing way to get
> the data end.

Please try using it and see the impact on your performance testing.
Think about how often a node is full and how often a parent node is
full, etc, etc. As I brought up during the review of your testing code,
you actually have not filled the trees during construction. So there
will not be a walk up more than one, and I'm not even sure you will get
one walk up.

The nodes are also going to be in CPU cache, so even when you do need to
walk up multiple times, it's not going to be a huge performance
consideration.

Furthermore, the maple allocation nodes have metadata stored outside of
the data - there is extra space that we are using in those nodes always.
So for fork, all this is to make walking up to the parent of full leaves
faster - but it's not used for leaves so there is no benefit in that
case.

>I didn't use it because when walking up the tree, we don't
> know the maximum value of the node, and the continuous upward walk will
> introduce more overhead, which is what mas_ascend() does. Doing BFS
> cannot avoid this problem also.
>
> The reason I don't do BFS is that it has more overhead than DFS. If you
> think of a tree as a graph, doing DFS will only walk each edge twice.
> What if it is BFS? Since we can't use queues, we can only emulate BFS,
> which additionally does something like mas_next_node() does, which
> introduces more overhead than DFS. Considering only the layer of leaf
> nodes, it needs to walk each edge twice. So the overhead of doing BFS is
> more than DFS.

Okay, thanks.

> >
> > Bits are highly valuable and this is the only remaining bit. I had
> > thought about using this in Feb 2021 to see if there was metadata or
> > not, but figured a way around it (using the max trick) and thus saved
> > this bit for potential expansion of node types.
> I thought of another way to get the maximum value of a node without
> doing any extra upward walk. When doing DFS, we can use a stack to save
> the maximum value of ancestor nodes. The stack size can be set to
> MAPLE_HEIGHT_MAX. In this way, this bit can be reserved, and there is no
> need to do a loop like mas_ascend() in order to get the maximum value.

You want an array of 31 unsigned longs to keep track of the max of this
node? Please don't do that.

> >
> > > +
> > > +static inline bool slot_is_mte(unsigned long slot)
> > > +{
> > > + return slot & MAPLE_ENODE;
> > > +}
> > > static inline struct maple_enode *mt_mk_node(const struct maple_node *node,
> > > enum maple_type type)
> > > {
> > > - return (void *)((unsigned long)node |
> > > - (type << MAPLE_ENODE_TYPE_SHIFT) | MAPLE_ENODE_NULL);
> > > + return (void *)((unsigned long)node | (type << MAPLE_ENODE_TYPE_SHIFT) |
> > > + MAPLE_ENODE_NULL | MAPLE_ENODE);
> > > }
> > > static inline void *mte_mk_root(const struct maple_enode *node)
> > > @@ -1411,6 +1418,65 @@ static inline struct maple_enode *mas_start(struct ma_state *mas)
> > > return NULL;
> > > }
> > > +/*
> > > + * ma_nonleaf_data_end() - Find the end of the data in a non-leaf node.
> > > + * @mt: The maple tree
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)mt_slot(mt, slots, mt_pivots[type]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/*
> > > + * ma_nonleaf_data_end_nocheck() - Find the end of the data in a non-leaf node.
> > > + * @node: The maple node
> > > + * @type: The maple node type
> > > + *
> > > + * Uses metadata to find the end of the data when possible without knowing the
> > > + * node maximum. This is the version of ma_nonleaf_data_end() that does not
> > > + * check for lock held. This particular version is designed to avoid lockdep
> > > + * complaining in some scenarios.
> > > + *
> > > + * Return: The zero indexed last slot with child.
> > > + */
> > > +static inline unsigned char ma_nonleaf_data_end_nocheck(struct maple_node *node,
> > > + enum maple_type type)
> > > +{
> > > + void __rcu **slots;
> > > + unsigned long slot;
> > > +
> > > + slots = ma_slots(node, type);
> > > + slot = (unsigned long)rcu_dereference_raw(slots[mt_pivots[type]]);
> > > + if (unlikely(slot_is_mte(slot)))
> > > + return mt_pivots[type];
> > > +
> > > + return ma_meta_end(node, type);
> > > +}
> > > +
> > > +/* See ma_nonleaf_data_end() */
> > > +static inline unsigned char mte_nonleaf_data_end(struct maple_tree *mt,
> > > + struct maple_enode *enode)
> > > +{
> > > + return ma_nonleaf_data_end(mt, mte_to_node(enode), mte_node_type(enode));
> > > +}
> > > +
> > > /*
> > > * ma_data_end() - Find the end of the data in a node.
> > > * @node: The maple node
> > > --
> > > 2.20.1
> > >
> > >

2023-07-31 17:32:04

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 05/11] maple_tree: Add test for mt_dup()

* Peng Zhang <[email protected]> [230731 08:32]:
>
>
> 在 2023/7/27 00:06, Liam R. Howlett 写道:
> > * Peng Zhang <[email protected]> [230726 04:10]:
> > > Add test for mt_dup().
> > >
> > > Signed-off-by: Peng Zhang <[email protected]>
> > > ---
> > > tools/testing/radix-tree/maple.c | 202 +++++++++++++++++++++++++++++++
> > > 1 file changed, 202 insertions(+)
> > >
> > > diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
> > > index e5da1cad70ba..3052e899e5df 100644
> > > --- a/tools/testing/radix-tree/maple.c
> > > +++ b/tools/testing/radix-tree/maple.c
> > > @@ -35857,6 +35857,204 @@ static noinline void __init check_locky(struct maple_tree *mt)
> > > mt_clear_in_rcu(mt);
> > > }
> > > +/*
> > > + * Compare two nodes and return 0 if they are the same, non-zero otherwise.
> > > + */
> > > +static int __init compare_node(struct maple_enode *enode_a,
> > > + struct maple_enode *enode_b)
> > > +{
> > > + struct maple_node *node_a, *node_b;
> > > + struct maple_node a, b;
> > > + void **slots_a, **slots_b; /* Do not use the rcu tag. */
> > > + enum maple_type type;
> > > + int i;
> > > +
> > > + if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
> > > + ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
> > > + pr_err("The lower 8 bits of enode are different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + type = mte_node_type(enode_a);
> > > + node_a = mte_to_node(enode_a);
> > > + node_b = mte_to_node(enode_b);
> > > + a = *node_a;
> > > + b = *node_b;
> > > +
> > > + /* Do not compare addresses. */
> > > + if (ma_is_root(node_a) || ma_is_root(node_b)) {
> > > + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > + MA_ROOT_PARENT);
> > > + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > + MA_ROOT_PARENT);
> > > + } else {
> > > + a.parent = (struct maple_pnode *)((unsigned long)a.parent &
> > > + MAPLE_NODE_MASK);
> > > + b.parent = (struct maple_pnode *)((unsigned long)b.parent &
> > > + MAPLE_NODE_MASK);
> > > + }
> > > +
> > > + if (a.parent != b.parent) {
> > > + pr_err("The lower 8 bits of parents are different. %p %p\n",
> > > + a.parent, b.parent);
> > > + return -1;
> > > + }
> > > +
> > > + /*
> > > + * If it is a leaf node, the slots do not contain the node address, and
> > > + * no special processing of slots is required.
> > > + */
> > > + if (ma_is_leaf(type))
> > > + goto cmp;
> > > +
> > > + slots_a = ma_slots(&a, type);
> > > + slots_b = ma_slots(&b, type);
> > > +
> > > + for (i = 0; i < mt_slots[type]; i++) {
> > > + if (!slots_a[i] && !slots_b[i])
> > > + break;
> > > +
> > > + if (!slots_a[i] || !slots_b[i]) {
> > > + pr_err("The number of slots is different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + /* Do not compare addresses in slots. */
> > > + ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
> > > + ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
> > > + }
> > > +
> > > +cmp:
> > > + /*
> > > + * Compare all contents of two nodes, including parent (except address),
> > > + * slots (except address), pivots, gaps and metadata.
> > > + */
> > > + return memcmp(&a, &b, sizeof(struct maple_node));
> > > +}
> > > +
> > > +/*
> > > + * Compare two trees and return 0 if they are the same, non-zero otherwise.
> > > + */
> > > +static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
> > > +{
> > > + MA_STATE(mas_a, mt_a, 0, 0);
> > > + MA_STATE(mas_b, mt_b, 0, 0);
> > > +
> > > + if (mt_a->ma_flags != mt_b->ma_flags) {
> > > + pr_err("The flags of the two trees are different.\n");
> > > + return -1;
> > > + }
> > > +
> > > + mas_dfs_preorder(&mas_a);
> > > + mas_dfs_preorder(&mas_b);
> > > +
> > > + if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
> > > + if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
> > > + pr_err("One is MAS_ROOT and the other is not.\n");
> > > + return -1;
> > > + }
> > > + return 0;
> > > + }
> > > +
> > > + while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
> > > +
> > > + if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
> > > + pr_err("One is MAS_NONE and the other is not.\n");
> > > + return -1;
> > > + }
> > > +
> > > + if (mas_a.min != mas_b.min ||
> > > + mas_a.max != mas_b.max) {
> > > + pr_err("mas->min, mas->max do not match.\n");
> > > + return -1;
> > > + }
> > > +
> > > + if (compare_node(mas_a.node, mas_b.node)) {
> > > + pr_err("The contents of nodes %p and %p are different.\n",
> > > + mas_a.node, mas_b.node);
> > > + mt_dump(mt_a, mt_dump_dec);
> > > + mt_dump(mt_b, mt_dump_dec);
> > > + return -1;
> > > + }
> > > +
> > > + mas_dfs_preorder(&mas_a);
> > > + mas_dfs_preorder(&mas_b);
> > > + }
> > > +
> > > + return 0;
> > > +}
> > > +
> > > +static noinline void __init check_mt_dup(struct maple_tree *mt)
> > > +{
> > > + DEFINE_MTREE(new);
> > > + int i, j, ret, count = 0;
> > > +
> > > + /* stored in the root pointer*/
> > > + mt_init_flags(&tree, 0);
> > > + mtree_store_range(&tree, 0, 0, xa_mk_value(0), GFP_KERNEL);
> > > + mt_dup(&tree, &new, GFP_KERNEL);
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > +
> > > + for (i = 0; i < 1000; i += 3) {
> > > + if (i & 1)
> > > + mt_init_flags(&tree, 0);
> > > + else
> > > + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +
> > > + for (j = 0; j < i; j++) {
> > > + mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > + xa_mk_value(j), GFP_KERNEL);
> > > + }
> >
> > Storing in this way is probably not checking a full tree. I think it's
> > important to check the full tree/full nodes since you have changes to
> > detect the metadata.
> I probably won't change the way I check metadata.

What I am tell you is that you haven't tested your new code for the
metadata of full nodes with this testcase, or have I missed something?
If it's not tested here, are there other testscases that cover the new
code?

>But is there a way to
> construct a full tree? All I can think of is to write new code to
> construct a full tree.

Normally, what I do, is create a tree in a loop like you have done above
and then store entries over a portion of existing ranges to fill out the
nodes until they are full. check_ranges() in lib/test_maple_tree.c
might be of help.

> >
> > > +
> > > + ret = mt_dup(&tree, &new, GFP_KERNEL);
> > > + MT_BUG_ON(&new, ret != 0);
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > + }
> > > +
> > > + /* Test memory allocation failed. */
> > > + for (i = 0; i < 1000; i += 3) {
> > > + if (i & 1)
> > > + mt_init_flags(&tree, 0);
> > > + else
> > > + mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
> > > +
> > > + for (j = 0; j < i; j++) {
> > > + mtree_store_range(&tree, j * 10, j * 10 + 5,
> > > + xa_mk_value(j), GFP_KERNEL);
> > > + }
> > > +
> > > + mt_set_non_kernel(50);
> >
> > It may be worth while allowing more/less than 50 allocations.
> Actually I have used other values before. I haven't thought of a good
> value yet, probably a random number in a suitable range would be nice
> too.

random numbers are difficult to recreate so it might be best to limit
that to the userspace tools/testing/radix-tree/maple.c tests and print
the random number for reproducibility.

>
> >
> > > + ret = mt_dup(&tree, &new, GFP_NOWAIT);
> > > + mt_set_non_kernel(0);
> > > + if (ret != 0) {
> > > + MT_BUG_ON(&new, ret != -ENOMEM);
> > > + count++;
> > > + mtree_destroy(&tree);
> > > + continue;
> > > + }
> > > +
> > > + mt_validate(&new);
> > > + if (compare_tree(&tree, &new))
> > > + MT_BUG_ON(&new, 1);
> > > +
> > > + mtree_destroy(&tree);
> > > + mtree_destroy(&new);
> > > + }
> > > +
> > > + /* pr_info("mt_dup() fail %d times\n", count); */
> > > + BUG_ON(!count);
> > > +}
> > > +
> > > extern void test_kmem_cache_bulk(void);
> > > void farmer_tests(void)
> > > @@ -35904,6 +36102,10 @@ void farmer_tests(void)
> > > check_null_expand(&tree);
> > > mtree_destroy(&tree);
> > > + mt_init_flags(&tree, 0);
> > > + check_mt_dup(&tree);
> > > + mtree_destroy(&tree);
> > > +
> > > /* RCU testing */
> > > mt_init_flags(&tree, 0);
> > > check_erase_testset(&tree);
> > > --
> > > 2.20.1
> > >
> > >

2023-07-31 18:51:57

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 06/11] maple_tree: Introduce mas_replace_entry() to directly replace an entry

* Peng Zhang <[email protected]> [230731 08:39]:
>
>
> 在 2023/7/27 00:08, Liam R. Howlett 写道:
> > * Peng Zhang <[email protected]> [230726 04:10]:
> > > If mas has located a specific entry, it may be need to replace this
> > > entry, so introduce mas_replace_entry() to do this. mas_replace_entry()
> > > will be more efficient than mas_store*() because it doesn't do many
> > > unnecessary checks.
> > >
> > > This function should be inline, but more functions need to be moved to
> > > the header file, so I didn't do it for the time being.
> >
> > I am really nervous having no checks here. I get that this could be
> > used for duplicating the tree more efficiently, but having a function
> > that just swaps a value in is very dangerous - especially since it is
> > decoupled from the tree duplication code.
> I've thought about this, and I feel like this is something the user
> should be guaranteed. If the user is not sure whether to use it,
> mas_store() can be used instead.

Documentation often isn't up to date and even more rarely read.
mas_replace_entry() does not give a hint of a requirement for a specific
state to the mas. This is not acceptable.

The description of the function also doesn't say anything about a
requirement of the maple state, just that it replaces an already
existing entry. You have to read the notes to find out that 'mas must
already locate an existing entry'.

>And we should provide this interface
> because it has better performance.

How much better is the performance? There's always a trade off but
without numbers, this is hard to justify.

> >
> > >
> > > Signed-off-by: Peng Zhang <[email protected]>
> > > ---
> > > include/linux/maple_tree.h | 1 +
> > > lib/maple_tree.c | 25 +++++++++++++++++++++++++
> > > 2 files changed, 26 insertions(+)
> > >
> > > diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> > > index 229fe78e4c89..a05e9827d761 100644
> > > --- a/include/linux/maple_tree.h
> > > +++ b/include/linux/maple_tree.h
> > > @@ -462,6 +462,7 @@ struct ma_wr_state {
> > > void *mas_walk(struct ma_state *mas);
> > > void *mas_store(struct ma_state *mas, void *entry);
> > > +void mas_replace_entry(struct ma_state *mas, void *entry);
> > > void *mas_erase(struct ma_state *mas);
> > > int mas_store_gfp(struct ma_state *mas, void *entry, gfp_t gfp);
> > > void mas_store_prealloc(struct ma_state *mas, void *entry);
> > > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > > index efac6761ae37..d58572666a00 100644
> > > --- a/lib/maple_tree.c
> > > +++ b/lib/maple_tree.c
> > > @@ -5600,6 +5600,31 @@ void *mas_store(struct ma_state *mas, void *entry)
> > > }
> > > EXPORT_SYMBOL_GPL(mas_store);
> > > +/**
> > > + * mas_replace_entry() - Replace an entry that already exists in the maple tree
> > > + * @mas: The maple state
> > > + * @entry: The entry to store
> > > + *
> > > + * Please note that mas must already locate an existing entry, and the new entry
> > > + * must not be NULL. If these two points cannot be guaranteed, please use
> > > + * mas_store*() instead, otherwise it will cause an internal error in the maple
> > > + * tree. This function does not need to allocate memory, so it must succeed.
> > > + */
> > > +void mas_replace_entry(struct ma_state *mas, void *entry)
> > > +{
> > > + void __rcu **slots;
> > > +
> > > +#ifdef CONFIG_DEBUG_MAPLE_TREE
> > > + MAS_WARN_ON(mas, !mte_is_leaf(mas->node));
> > > + MAS_WARN_ON(mas, !entry);
> > > + MAS_WARN_ON(mas, mas->offset >= mt_slots[mte_node_type(mas->node)]);
> > > +#endif
> > > +
> > > + slots = ma_slots(mte_to_node(mas->node), mte_node_type(mas->node));
> > > + rcu_assign_pointer(slots[mas->offset], entry);
> > > +}
> > > +EXPORT_SYMBOL_GPL(mas_replace_entry);
> > > +
> > > /**
> > > * mas_store_gfp() - Store a value into the tree.
> > > * @mas: The maple state
> > > --
> > > 2.20.1
> > >

2023-07-31 21:26:09

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 10/11] MAINTAINERS: Add co-maintainer for maple tree

* Peng Zhang <[email protected]> [230731 08:55]:
>
>
> 在 2023/7/27 00:39, Liam R. Howlett 写道:
> > * Peng Zhang <[email protected]> [230726 04:10]:
> > > Add myself as co-maintainer for maple tree. I would like to assist
> > > Liam R. Howlett in maintaining maple tree. I will continue to contribute
> > > to the development of maple tree in the future.
> >
> > Sorry, but no.
> >
> > I appreciate the patches, bug fixes, and code review but there is no
> > need for another maintainer for the tree at this time.

> So can I add a reviewer here? This is convenient for me to review the
> code and know the problems reported from the community. Usually I can't
> receive maple tree issues reported by the community. It has no effect on
> your maintenance of it.

Although you are on the path to becoming a reviewer in the MAINTAINERS
file, that file is not a mailing list and shouldn't be treated as such.

You should receive all maple tree issues reported by the community by
using the mailing lists.

You do not need to have an entry in that file to review patches and to
be added as a reviewer of a particular patch; just keep doing what you
have done and send out R-b responses.

I am happy to get reviews by you and will be sure to pick up any R-b
from you for the commits between revisions. Please also review other
peoples patches on the mailing lists.

Thanks,
Liam

> >
> > >
> > > Signed-off-by: Peng Zhang <[email protected]>
> > > ---
> > > MAINTAINERS | 1 +
> > > 1 file changed, 1 insertion(+)
> > >
> > > diff --git a/MAINTAINERS b/MAINTAINERS
> > > index ddc71b815791..8cfedd492509 100644
> > > --- a/MAINTAINERS
> > > +++ b/MAINTAINERS
> > > @@ -12526,6 +12526,7 @@ F: net/mctp/
> > > MAPLE TREE
> > > M: Liam R. Howlett <[email protected]>
> > > +M: Peng Zhang <[email protected]>
> > > L: [email protected]
> > > S: Supported
> > > F: Documentation/core-api/maple_tree.rst
> > > --
> > > 2.20.1
> > >

2023-08-11 18:05:09

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH 03/11] maple_tree: Add some helper functions

* Peng Zhang <[email protected]> [230731 07:45]:
>
>
> 在 2023/7/26 23:08, Matthew Wilcox 写道:
> > On Wed, Jul 26, 2023 at 11:02:52AM -0400, Liam R. Howlett wrote:
> > > * Peng Zhang <[email protected]> [230726 04:10]:
> > > > static inline
> > > > -enum maple_type mas_parent_type(struct ma_state *mas, struct maple_enode *enode)
> > > > +enum maple_type ma_parent_type(struct maple_tree *mt, struct maple_node *node)
> > >
> > > I was trying to keep ma_* prefix to mean the first argument is
> > > maple_node and mt_* to mean maple_tree. I wasn't entirely successful
> > > with this and I do see why you want to use ma_, but maybe reverse the
> > > arguments here?
> >
> > I think your first idea is better. Usually we prefer to order the
> > arguments by "containing thing" to "contained thing". So always use
> > (struct address_space *, struct folio *), for example. Or (struct
> > mm_struct *, struct vm_area_struct *).
> There are disagreements here, so how to decide? But I don't know if the
> new version still has this helper.

Please keep the maple tree as the first argument and use:
mt_parent_type as the name.. if you still need it.