2023-10-16 03:23:47

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 00/10] Introduce __mt_dup() to improve the performance of fork()

Hi all,

This series introduces __mt_dup() to improve the performance of fork(). During
the duplication process of mmap, all VMAs are traversed and inserted one by one
into the new maple tree, causing the maple tree to be rebalanced multiple times.
Balancing the maple tree is a costly operation. To duplicate VMAs more
efficiently, mtree_dup() and __mt_dup() are introduced for the maple tree. They
can efficiently duplicate a maple tree.

Here are some algorithmic details about {mtree,__mt}_dup(). We perform a DFS
pre-order traversal of all nodes in the source maple tree. During this process,
we fully copy the nodes from the source tree to the new tree. This involves
memory allocation, and when encountering a new node, if it is a non-leaf node,
all its child nodes are allocated at once.

Some previous discussions can be referred to as [1]. For a more detailed
analysis of the algorithm, please refer to the logs for patch [3/10] and patch
[10/10]

There is a "spawn" in byte-unixbench[2], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.

Below are the test results. The first row shows the number of VMAs.
The second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.

21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

Thanks for Liam's review.

Changes since v4:
- Change the handling method for the failure of dup_mmap(). Handle it in
exit_mmap().
- Update check_forking() and bench_forking().
- Add the corresponding copyright statement.

Peng Zhang (10):
maple_tree: Add mt_free_one() and mt_attr() helpers
maple_tree: Introduce {mtree,mas}_lock_nested()
maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
radix tree test suite: Align kmem_cache_alloc_bulk() with kernel
behavior.
maple_tree: Add test for mtree_dup()
maple_tree: Update the documentation of maple tree
maple_tree: Skip other tests when BENCH is enabled
maple_tree: Update check_forking() and bench_forking()
maple_tree: Preserve the tree attributes when destroying maple tree
fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

Documentation/core-api/maple_tree.rst | 4 +
include/linux/maple_tree.h | 7 +
kernel/fork.c | 39 ++-
lib/maple_tree.c | 304 ++++++++++++++++++++-
lib/test_maple_tree.c | 123 +++++----
mm/memory.c | 7 +-
mm/mmap.c | 9 +-
tools/include/linux/rwsem.h | 4 +
tools/include/linux/spinlock.h | 1 +
tools/testing/radix-tree/linux.c | 45 +++-
tools/testing/radix-tree/maple.c | 363 ++++++++++++++++++++++++++
11 files changed, 815 insertions(+), 91 deletions(-)

--
2.20.1


2023-10-16 03:23:48

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 03/10] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

Introduce interfaces __mt_dup() and mtree_dup(), which are used to
duplicate a maple tree. They duplicate a maple tree in Depth-First
Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
source tree and allocate new child nodes in non-leaf nodes. The new node
is exactly the same as the source node except for all the addresses
stored in it. It will be faster than traversing all elements in the
source tree and inserting them one by one into the new tree. The time
complexity of these two functions is O(n).

The difference between __mt_dup() and mtree_dup() is that mtree_dup()
handles locks internally.

Analysis of the average time complexity of this algorithm:

For simplicity, let's assume that the maximum branching factor of all
non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
full tree.

Under the given conditions, if there is a maple tree with n elements,
the number of its leaves is n/16. From bottom to top, the number of
nodes in each level is 1/16 of the number of nodes in the level below.
So the total number of nodes in the entire tree is given by the sum of
n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has
log(n) terms with base 16. According to the formula for the sum of a
geometric series, the sum of this series can be calculated as (n-1)/15.
Each node has only one parent node pointer, which can be considered as
an edge. In total, there are (n-1)/15-1 edges.

This algorithm consists of two operations:

1. Traversing all nodes in DFS order.
2. For each node, making a copy and performing necessary modifications
to create a new node.

For the first part, DFS traversal will visit each edge twice. Let
T(ascend) represent the cost of taking one step downwards, and
T(descend) represent the cost of taking one step upwards. And both of
them are constants (although mas_ascend() may not be, as it contains a
loop, but here we ignore it and treat it as a constant). So the time
spent on the first part can be represented as
((n-1)/15-1) * (T(ascend) + T(descend)).

For the second part, each node will be copied, and the cost of copying a
node is denoted as T(copy_node). For each non-leaf node, it is necessary
to reallocate all child nodes, and the cost of this operation is denoted
as T(dup_alloc). The behavior behind memory allocation is complex and
not specific to the maple tree operation. Here, we assume that the time
required for a single allocation is constant. Since the size of a node
is fixed, both of these symbols are also constants. We can calculate
that the time spent on the second part is
((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc).

Adding both parts together, the total time spent by the algorithm can be
represented as:

((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
n/16 * T(dup_alloc) - (T(ascend) + T(descend))

Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
Let C2 = T(dup_alloc)
Let C3 = T(ascend) + T(descend)

Finally, the expression can be simplified as:
((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).

This is a linear function, so the average time complexity is O(n).

Signed-off-by: Peng Zhang <[email protected]>
---
include/linux/maple_tree.h | 3 +
lib/maple_tree.c | 290 +++++++++++++++++++++++++++++++++++++
2 files changed, 293 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index f91dbc7fe091..a452dd8a1e5c 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
void *entry, gfp_t gfp);
void *mtree_erase(struct maple_tree *mt, unsigned long index);

+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
+
void mtree_destroy(struct maple_tree *mt);
void __mt_destroy(struct maple_tree *mt);

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index ca7039633844..6e0ad83f14e3 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4,6 +4,10 @@
* Copyright (c) 2018-2022 Oracle Corporation
* Authors: Liam R. Howlett <[email protected]>
* Matthew Wilcox <[email protected]>
+ *
+ * Algorithm for duplicating Maple Tree
+ * Copyright (c) 2023 ByteDance
+ * Author: Peng Zhang <[email protected]>
*/

/*
@@ -6475,6 +6479,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
}
EXPORT_SYMBOL(mtree_erase);

+/*
+ * mas_dup_free() - Free an incomplete duplication of a tree.
+ * @mas: The maple state of a incomplete tree.
+ *
+ * The parameter @mas->node passed in indicates that the allocation failed on
+ * this node. This function frees all nodes starting from @mas->node in the
+ * reverse order of mas_dup_build(). There is no need to hold the source tree
+ * lock at this time.
+ */
+static void mas_dup_free(struct ma_state *mas)
+{
+ struct maple_node *node;
+ enum maple_type type;
+ void __rcu **slots;
+ unsigned char count, i;
+
+ /* Maybe the first node allocation failed. */
+ if (mas_is_none(mas))
+ return;
+
+ while (!mte_is_root(mas->node)) {
+ mas_ascend(mas);
+
+ if (mas->offset) {
+ mas->offset--;
+ do {
+ mas_descend(mas);
+ mas->offset = mas_data_end(mas);
+ } while (!mte_is_leaf(mas->node));
+
+ mas_ascend(mas);
+ }
+
+ node = mte_to_node(mas->node);
+ type = mte_node_type(mas->node);
+ slots = ma_slots(node, type);
+ count = mas_data_end(mas) + 1;
+ for (i = 0; i < count; i++)
+ ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
+
+ mt_free_bulk(count, slots);
+ }
+
+ node = mte_to_node(mas->node);
+ mt_free_one(node);
+}
+
+/*
+ * mas_copy_node() - Copy a maple node and replace the parent.
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @parent: The parent of the new node.
+ *
+ * Copy @mas->node to @new_mas->node, set @parent to be the parent of
+ * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
+ struct maple_pnode *parent)
+{
+ struct maple_node *node = mte_to_node(mas->node);
+ struct maple_node *new_node = mte_to_node(new_mas->node);
+ unsigned long val;
+
+ /* Copy the node completely. */
+ memcpy(new_node, node, sizeof(struct maple_node));
+
+ /* Update the parent node pointer. */
+ val = (unsigned long)node->parent & MAPLE_NODE_MASK;
+ new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
+}
+
+/*
+ * mas_dup_alloc() - Allocate child nodes for a maple node.
+ * @mas: The maple state of source tree.
+ * @new_mas: The maple state of new tree.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function allocates child nodes for @new_mas->node during the duplication
+ * process. If memory allocation fails, @mas is set to -ENOMEM.
+ */
+static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
+ gfp_t gfp)
+{
+ struct maple_node *node = mte_to_node(mas->node);
+ struct maple_node *new_node = mte_to_node(new_mas->node);
+ enum maple_type type;
+ unsigned char request, count, i;
+ void __rcu **slots;
+ void __rcu **new_slots;
+ unsigned long val;
+
+ /* Allocate memory for child nodes. */
+ type = mte_node_type(mas->node);
+ new_slots = ma_slots(new_node, type);
+ request = mas_data_end(mas) + 1;
+ count = mt_alloc_bulk(gfp, request, (void **)new_slots);
+ if (unlikely(count < request)) {
+ if (count)
+ mt_free_bulk(count, new_slots);
+
+ memset(new_slots, 0, request * sizeof(void *));
+ mas_set_err(mas, -ENOMEM);
+ return;
+ }
+
+ /* Restore node type information in slots. */
+ slots = ma_slots(node, type);
+ for (i = 0; i < count; i++) {
+ val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
+ val &= MAPLE_NODE_MASK;
+ ((unsigned long *)new_slots)[i] |= val;
+ }
+}
+
+/*
+ * mas_dup_build() - Build a new maple tree from a source tree
+ * @mas: The maple state of source tree, need to be in MAS_START state.
+ * @new_mas: The maple state of new tree, need to be in MAS_START state.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * This function builds a new tree in DFS preorder. If the memory allocation
+ * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
+ * last node. mas_dup_free() will free the incomplete duplication of a tree.
+ *
+ * Note that the attributes of the two trees need to be exactly the same, and the
+ * new tree needs to be empty, otherwise -EINVAL will be set in @mas.
+ */
+static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
+ gfp_t gfp)
+{
+ struct maple_node *node;
+ struct maple_pnode *parent = NULL;
+ struct maple_enode *root;
+ enum maple_type type;
+
+ if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
+ unlikely(!mtree_empty(new_mas->tree))) {
+ mas_set_err(mas, -EINVAL);
+ return;
+ }
+
+ mas_start(mas);
+ if (mas_is_ptr(mas) || mas_is_none(mas)) {
+ root = mt_root_locked(mas->tree);
+ goto set_new_tree;
+ }
+
+ node = mt_alloc_one(gfp);
+ if (!node) {
+ new_mas->node = MAS_NONE;
+ mas_set_err(mas, -ENOMEM);
+ return;
+ }
+
+ type = mte_node_type(mas->node);
+ root = mt_mk_node(node, type);
+ new_mas->node = root;
+ new_mas->min = 0;
+ new_mas->max = ULONG_MAX;
+ root = mte_mk_root(root);
+
+ while (1) {
+ mas_copy_node(mas, new_mas, parent);
+
+ if (!mte_is_leaf(mas->node)) {
+ /* Only allocate child nodes for non-leaf nodes. */
+ mas_dup_alloc(mas, new_mas, gfp);
+ if (unlikely(mas_is_err(mas)))
+ return;
+ } else {
+ /*
+ * This is the last leaf node and duplication is
+ * completed.
+ */
+ if (mas->max == ULONG_MAX)
+ goto done;
+
+ /* This is not the last leaf node and needs to go up. */
+ do {
+ mas_ascend(mas);
+ mas_ascend(new_mas);
+ } while (mas->offset == mas_data_end(mas));
+
+ /* Move to the next subtree. */
+ mas->offset++;
+ new_mas->offset++;
+ }
+
+ mas_descend(mas);
+ parent = ma_parent_ptr(mte_to_node(new_mas->node));
+ mas_descend(new_mas);
+ mas->offset = 0;
+ new_mas->offset = 0;
+ }
+done:
+ /* Specially handle the parent of the root node. */
+ mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
+set_new_tree:
+ /* Make them the same height */
+ new_mas->tree->ma_flags = mas->tree->ma_flags;
+ rcu_assign_pointer(new_mas->tree->ma_root, root);
+}
+
+/**
+ * __mt_dup(): Duplicate an entire maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
+ * traversal. It uses memcopy() to copy nodes in the source tree and allocate
+ * new child nodes in non-leaf nodes. The new node is exactly the same as the
+ * source node except for all the addresses stored in it. It will be faster than
+ * traversing all elements in the source tree and inserting them one by one into
+ * the new tree.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ * Note that the user needs to manually lock the source tree and the new tree.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret = 0;
+ MA_STATE(mas, mt, 0, 0);
+ MA_STATE(new_mas, new, 0, 0);
+
+ mas_dup_build(&mas, &new_mas, gfp);
+
+ if (unlikely(mas_is_err(&mas))) {
+ ret = xa_err(mas.node);
+ if (ret == -ENOMEM)
+ mas_dup_free(&new_mas);
+ }
+
+ return ret;
+}
+EXPORT_SYMBOL(__mt_dup);
+
+/**
+ * mtree_dup(): Duplicate an entire maple tree
+ * @mt: The source maple tree
+ * @new: The new maple tree
+ * @gfp: The GFP_FLAGS to use for allocations
+ *
+ * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
+ * traversal. It uses memcopy() to copy nodes in the source tree and allocate
+ * new child nodes in non-leaf nodes. The new node is exactly the same as the
+ * source node except for all the addresses stored in it. It will be faster than
+ * traversing all elements in the source tree and inserting them one by one into
+ * the new tree.
+ * The user needs to ensure that the attributes of the source tree and the new
+ * tree are the same, and the new tree needs to be an empty tree, otherwise
+ * -EINVAL will be returned.
+ *
+ * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
+ * the attributes of the two trees are different or the new tree is not an empty
+ * tree.
+ */
+int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
+{
+ int ret = 0;
+ MA_STATE(mas, mt, 0, 0);
+ MA_STATE(new_mas, new, 0, 0);
+
+ mas_lock(&new_mas);
+ mas_lock_nested(&mas, SINGLE_DEPTH_NESTING);
+
+ mas_dup_build(&mas, &new_mas, gfp);
+ mas_unlock(&mas);
+
+ if (unlikely(mas_is_err(&mas))) {
+ ret = xa_err(mas.node);
+ if (ret == -ENOMEM)
+ mas_dup_free(&new_mas);
+ }
+
+ mas_unlock(&new_mas);
+
+ return ret;
+}
+EXPORT_SYMBOL(mtree_dup);
+
/**
* __mt_destroy() - Walk and free all nodes of a locked maple tree.
* @mt: The maple tree
--
2.20.1

2023-10-16 03:24:08

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 05/10] maple_tree: Add test for mtree_dup()

Add test for mtree_dup().
Test by duplicating different maple trees and then comparing the two
trees. Includes tests for duplicating full trees and memory allocation
failures on different nodes.

Signed-off-by: Peng Zhang <[email protected]>
---
tools/testing/radix-tree/maple.c | 361 +++++++++++++++++++++++++++++++
1 file changed, 361 insertions(+)

diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index e5da1cad70ba..12b3390e9591 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -35857,6 +35857,363 @@ static noinline void __init check_locky(struct maple_tree *mt)
mt_clear_in_rcu(mt);
}

+/*
+ * Compares two nodes except for the addresses stored in the nodes.
+ * Returns zero if they are the same, otherwise returns non-zero.
+ */
+static int __init compare_node(struct maple_enode *enode_a,
+ struct maple_enode *enode_b)
+{
+ struct maple_node *node_a, *node_b;
+ struct maple_node a, b;
+ void **slots_a, **slots_b; /* Do not use the rcu tag. */
+ enum maple_type type;
+ int i;
+
+ if (((unsigned long)enode_a & MAPLE_NODE_MASK) !=
+ ((unsigned long)enode_b & MAPLE_NODE_MASK)) {
+ pr_err("The lower 8 bits of enode are different.\n");
+ return -1;
+ }
+
+ type = mte_node_type(enode_a);
+ node_a = mte_to_node(enode_a);
+ node_b = mte_to_node(enode_b);
+ a = *node_a;
+ b = *node_b;
+
+ /* Do not compare addresses. */
+ if (ma_is_root(node_a) || ma_is_root(node_b)) {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MA_ROOT_PARENT);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MA_ROOT_PARENT);
+ } else {
+ a.parent = (struct maple_pnode *)((unsigned long)a.parent &
+ MAPLE_NODE_MASK);
+ b.parent = (struct maple_pnode *)((unsigned long)b.parent &
+ MAPLE_NODE_MASK);
+ }
+
+ if (a.parent != b.parent) {
+ pr_err("The lower 8 bits of parents are different. %p %p\n",
+ a.parent, b.parent);
+ return -1;
+ }
+
+ /*
+ * If it is a leaf node, the slots do not contain the node address, and
+ * no special processing of slots is required.
+ */
+ if (ma_is_leaf(type))
+ goto cmp;
+
+ slots_a = ma_slots(&a, type);
+ slots_b = ma_slots(&b, type);
+
+ for (i = 0; i < mt_slots[type]; i++) {
+ if (!slots_a[i] && !slots_b[i])
+ break;
+
+ if (!slots_a[i] || !slots_b[i]) {
+ pr_err("The number of slots is different.\n");
+ return -1;
+ }
+
+ /* Do not compare addresses in slots. */
+ ((unsigned long *)slots_a)[i] &= MAPLE_NODE_MASK;
+ ((unsigned long *)slots_b)[i] &= MAPLE_NODE_MASK;
+ }
+
+cmp:
+ /*
+ * Compare all contents of two nodes, including parent (except address),
+ * slots (except address), pivots, gaps and metadata.
+ */
+ return memcmp(&a, &b, sizeof(struct maple_node));
+}
+
+/*
+ * Compare two trees and return 0 if they are the same, non-zero otherwise.
+ */
+static int __init compare_tree(struct maple_tree *mt_a, struct maple_tree *mt_b)
+{
+ MA_STATE(mas_a, mt_a, 0, 0);
+ MA_STATE(mas_b, mt_b, 0, 0);
+
+ if (mt_a->ma_flags != mt_b->ma_flags) {
+ pr_err("The flags of the two trees are different.\n");
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+
+ if (mas_is_ptr(&mas_a) || mas_is_ptr(&mas_b)) {
+ if (!(mas_is_ptr(&mas_a) && mas_is_ptr(&mas_b))) {
+ pr_err("One is MAS_ROOT and the other is not.\n");
+ return -1;
+ }
+ return 0;
+ }
+
+ while (!mas_is_none(&mas_a) || !mas_is_none(&mas_b)) {
+
+ if (mas_is_none(&mas_a) || mas_is_none(&mas_b)) {
+ pr_err("One is MAS_NONE and the other is not.\n");
+ return -1;
+ }
+
+ if (mas_a.min != mas_b.min ||
+ mas_a.max != mas_b.max) {
+ pr_err("mas->min, mas->max do not match.\n");
+ return -1;
+ }
+
+ if (compare_node(mas_a.node, mas_b.node)) {
+ pr_err("The contents of nodes %p and %p are different.\n",
+ mas_a.node, mas_b.node);
+ mt_dump(mt_a, mt_dump_dec);
+ mt_dump(mt_b, mt_dump_dec);
+ return -1;
+ }
+
+ mas_dfs_preorder(&mas_a);
+ mas_dfs_preorder(&mas_b);
+ }
+
+ return 0;
+}
+
+static __init void mas_subtree_max_range(struct ma_state *mas)
+{
+ unsigned long limit = mas->max;
+ MA_STATE(newmas, mas->tree, 0, 0);
+ void *entry;
+
+ mas_for_each(mas, entry, limit) {
+ if (mas->last - mas->index >=
+ newmas.last - newmas.index) {
+ newmas = *mas;
+ }
+ }
+
+ *mas = newmas;
+}
+
+/*
+ * build_full_tree() - Build a full tree.
+ * @mt: The tree to build.
+ * @flags: Use @flags to build the tree.
+ * @height: The height of the tree to build.
+ *
+ * Build a tree with full leaf nodes and internal nodes. Note that the height
+ * should not exceed 3, otherwise it will take a long time to build.
+ * Return: zero if the build is successful, non-zero if it fails.
+ */
+static __init int build_full_tree(struct maple_tree *mt, unsigned int flags,
+ int height)
+{
+ MA_STATE(mas, mt, 0, 0);
+ unsigned long step;
+ int ret = 0, cnt = 1;
+ enum maple_type type;
+
+ mt_init_flags(mt, flags);
+ mtree_insert_range(mt, 0, ULONG_MAX, xa_mk_value(5), GFP_KERNEL);
+
+ mtree_lock(mt);
+
+ while (1) {
+ mas_set(&mas, 0);
+ if (mt_height(mt) < height) {
+ mas.max = ULONG_MAX;
+ goto store;
+ }
+
+ while (1) {
+ mas_dfs_preorder(&mas);
+ if (mas_is_none(&mas))
+ goto unlock;
+
+ type = mte_node_type(mas.node);
+ if (mas_data_end(&mas) + 1 < mt_slots[type]) {
+ mas_set(&mas, mas.min);
+ goto store;
+ }
+ }
+store:
+ mas_subtree_max_range(&mas);
+ step = mas.last - mas.index;
+ if (step < 1) {
+ ret = -1;
+ goto unlock;
+ }
+
+ step /= 2;
+ mas.last = mas.index + step;
+ mas_store_gfp(&mas, xa_mk_value(5),
+ GFP_KERNEL);
+ ++cnt;
+ }
+unlock:
+ mtree_unlock(mt);
+
+ MT_BUG_ON(mt, mt_height(mt) != height);
+ /* pr_info("height:%u number of elements:%d\n", mt_height(mt), cnt); */
+ return ret;
+}
+
+static noinline void __init check_mtree_dup(struct maple_tree *mt)
+{
+ DEFINE_MTREE(new);
+ int i, j, ret, count = 0;
+ unsigned int rand_seed = 17, rand;
+
+ /* store a value at [0, 0] */
+ mt_init_flags(mt, 0);
+ mtree_store_range(mt, 0, 0, xa_mk_value(0), GFP_KERNEL);
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret);
+ mt_validate(&new);
+ if (compare_tree(mt, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+
+ /* The two trees have different attributes. */
+ mt_init_flags(mt, 0);
+ mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret != -EINVAL);
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+
+ /* The new tree is not empty */
+ mt_init_flags(mt, 0);
+ mt_init_flags(&new, 0);
+ mtree_store(&new, 5, xa_mk_value(5), GFP_KERNEL);
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret != -EINVAL);
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+
+ /* Test for duplicating full trees. */
+ for (i = 1; i <= 3; i++) {
+ ret = build_full_tree(mt, 0, i);
+ MT_BUG_ON(mt, ret);
+ mt_init_flags(&new, 0);
+
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret);
+ mt_validate(&new);
+ if (compare_tree(mt, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+ }
+
+ for (i = 1; i <= 3; i++) {
+ ret = build_full_tree(mt, MT_FLAGS_ALLOC_RANGE, i);
+ MT_BUG_ON(mt, ret);
+ mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret);
+ mt_validate(&new);
+ if (compare_tree(mt, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+ }
+
+ /* Test for normal duplicating. */
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1) {
+ mt_init_flags(mt, 0);
+ mt_init_flags(&new, 0);
+ } else {
+ mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
+ mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+ }
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(mt, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ ret = mtree_dup(mt, &new, GFP_KERNEL);
+ MT_BUG_ON(&new, ret);
+ mt_validate(&new);
+ if (compare_tree(mt, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+ }
+
+ /* Test memory allocation failed. */
+ mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
+ for (i = 0; i < 30; i += 3) {
+ mtree_store_range(mt, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+
+ /* Failed at the first node. */
+ mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+ mt_set_non_kernel(0);
+ ret = mtree_dup(mt, &new, GFP_NOWAIT);
+ mt_set_non_kernel(0);
+ MT_BUG_ON(&new, ret != -ENOMEM);
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+
+ /* Random maple tree fails at a random node. */
+ for (i = 0; i < 1000; i += 3) {
+ if (i & 1) {
+ mt_init_flags(mt, 0);
+ mt_init_flags(&new, 0);
+ } else {
+ mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
+ mt_init_flags(&new, MT_FLAGS_ALLOC_RANGE);
+ }
+
+ for (j = 0; j < i; j++) {
+ mtree_store_range(mt, j * 10, j * 10 + 5,
+ xa_mk_value(j), GFP_KERNEL);
+ }
+ /*
+ * The rand() library function is not used, so we can generate
+ * the same random numbers on any platform.
+ */
+ rand_seed = rand_seed * 1103515245 + 12345;
+ rand = rand_seed / 65536 % 128;
+ mt_set_non_kernel(rand);
+
+ ret = mtree_dup(mt, &new, GFP_NOWAIT);
+ mt_set_non_kernel(0);
+ if (ret != 0) {
+ MT_BUG_ON(&new, ret != -ENOMEM);
+ count++;
+ mtree_destroy(mt);
+ continue;
+ }
+
+ mt_validate(&new);
+ if (compare_tree(mt, &new))
+ MT_BUG_ON(&new, 1);
+
+ mtree_destroy(mt);
+ mtree_destroy(&new);
+ }
+
+ /* pr_info("mtree_dup() fail %d times\n", count); */
+ BUG_ON(!count);
+}
+
extern void test_kmem_cache_bulk(void);

void farmer_tests(void)
@@ -35904,6 +36261,10 @@ void farmer_tests(void)
check_null_expand(&tree);
mtree_destroy(&tree);

+ mt_init_flags(&tree, 0);
+ check_mtree_dup(&tree);
+ mtree_destroy(&tree);
+
/* RCU testing */
mt_init_flags(&tree, 0);
check_erase_testset(&tree);
--
2.20.1

2023-10-16 03:24:12

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 06/10] maple_tree: Update the documentation of maple tree

Introduce the new interface mtree_dup() in the documentation.

Signed-off-by: Peng Zhang <[email protected]>
---
Documentation/core-api/maple_tree.rst | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/Documentation/core-api/maple_tree.rst b/Documentation/core-api/maple_tree.rst
index 45defcf15da7..285e2d2b21ae 100644
--- a/Documentation/core-api/maple_tree.rst
+++ b/Documentation/core-api/maple_tree.rst
@@ -81,6 +81,9 @@ section.
Sometimes it is necessary to ensure the next call to store to a maple tree does
not allocate memory, please see :ref:`maple-tree-advanced-api` for this use case.

+You can use mtree_dup() to duplicate an entire maple tree. It is a more
+efficient way than inserting all elements one by one into a new tree.
+
Finally, you can remove all entries from a maple tree by calling
mtree_destroy(). If the maple tree entries are pointers, you may wish to free
the entries first.
@@ -112,6 +115,7 @@ Takes ma_lock internally:
* mtree_insert()
* mtree_insert_range()
* mtree_erase()
+ * mtree_dup()
* mtree_destroy()
* mt_set_in_rcu()
* mt_clear_in_rcu()
--
2.20.1

2023-10-16 03:24:24

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 09/10] maple_tree: Preserve the tree attributes when destroying maple tree

When destroying maple tree, preserve its attributes and then turn it
into an empty tree. This allows it to be reused without needing to be
reinitialized.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/maple_tree.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6e0ad83f14e3..9b5e5580b252 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -6779,7 +6779,7 @@ void __mt_destroy(struct maple_tree *mt)
if (xa_is_node(root))
mte_destroy_walk(root, mt);

- mt->ma_flags = 0;
+ mt->ma_flags = mt_attr(mt);
}
EXPORT_SYMBOL_GPL(__mt_destroy);

--
2.20.1

2023-10-16 03:24:48

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 07/10] maple_tree: Skip other tests when BENCH is enabled

Skip other tests when BENCH is enabled so that performance can be
measured in user space.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/test_maple_tree.c | 8 ++++----
tools/testing/radix-tree/maple.c | 2 ++
2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 464eeb90d5ad..de470950714f 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -3585,10 +3585,6 @@ static int __init maple_tree_seed(void)

pr_info("\nTEST STARTING\n\n");

- mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
- check_root_expand(&tree);
- mtree_destroy(&tree);
-
#if defined(BENCH_SLOT_STORE)
#define BENCH
mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
@@ -3646,6 +3642,10 @@ static int __init maple_tree_seed(void)
goto skip;
#endif

+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ check_root_expand(&tree);
+ mtree_destroy(&tree);
+
mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_iteration(&tree);
mtree_destroy(&tree);
diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c
index 12b3390e9591..cb5358674521 100644
--- a/tools/testing/radix-tree/maple.c
+++ b/tools/testing/radix-tree/maple.c
@@ -36299,7 +36299,9 @@ void farmer_tests(void)

void maple_tree_tests(void)
{
+#if !defined(BENCH)
farmer_tests();
+#endif
maple_tree_seed();
maple_tree_harvest();
}
--
2.20.1

2023-10-16 03:26:23

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 10/10] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
directly replacing the entries of VMAs in the new maple tree can result
in better performance. __mt_dup() uses DFS pre-order to duplicate the
maple tree, so it is efficient.

The average time complexity of __mt_dup() is O(n), where n is the number
of VMAs. The proof of the time complexity is provided in the commit log
that introduces __mt_dup(). After duplicating the maple tree, each element
is traversed and replaced (ignoring the cases of deletion, which are rare).
Since it is only a replacement operation for each element, this process is
also O(n).

Analyzing the exact time complexity of the previous algorithm is
challenging because each insertion can involve appending to a node, pushing
data to adjacent nodes, or even splitting nodes. The frequency of each
action is difficult to calculate. The worst-case scenario for a single
insertion is when the tree undergoes splitting at every level. If we
consider each insertion as the worst-case scenario, we can determine that
the upper bound of the time complexity is O(n*log(n)), although this is a
loose upper bound. However, based on the test data, it appears that the
actual time complexity is likely to be O(n).

As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
fails, there will be a portion of VMAs that have not been duplicated in
the maple tree. To handle this, we mark the failure point with
XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
releasing VMAs that have not been duplicated after this point.

There is a "spawn" in byte-unixbench[1], which can be used to test the
performance of fork(). I modified it slightly to make it work with
different number of VMAs.

Below are the test results. The first row shows the number of VMAs.
The second and third rows show the number of fork() calls per ten seconds,
corresponding to next-20231006 and the this patchset, respectively. The
test results were obtained with CPU binding to avoid scheduler load
balancing that could cause unstable results. There are still some
fluctuations in the test results, but at least they are better than the
original performance.

21 121 221 421 821 1621 3221 6421 12821 25621 51221
112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%

[1] https://github.com/kdlucas/byte-unixbench/tree/master

Signed-off-by: Peng Zhang <[email protected]>
---
kernel/fork.c | 39 ++++++++++++++++++++++++++++-----------
mm/memory.c | 7 ++++++-
mm/mmap.c | 9 ++++++---
3 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/kernel/fork.c b/kernel/fork.c
index 0ff2e0cd4109..0be15501e52e 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
int retval;
unsigned long charge = 0;
LIST_HEAD(uf);
- VMA_ITERATOR(old_vmi, oldmm, 0);
VMA_ITERATOR(vmi, mm, 0);

uprobe_start_dup_mmap();
@@ -678,16 +677,21 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
goto out;
khugepaged_fork(mm, oldmm);

- retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
- if (retval)
+ /* Use __mt_dup() to efficiently build an identical maple tree. */
+ retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
+ if (unlikely(retval))
goto out;

mt_clear_in_rcu(vmi.mas.tree);
- for_each_vma(old_vmi, mpnt) {
+ for_each_vma(vmi, mpnt) {
struct file *file;

vma_start_write(mpnt);
if (mpnt->vm_flags & VM_DONTCOPY) {
+ retval = mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL);
+ if (retval)
+ goto loop_out;
+
vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
continue;
}
@@ -749,9 +753,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (is_vm_hugetlb_page(tmp))
hugetlb_dup_vma_private(tmp);

- /* Link the vma into the MT */
- if (vma_iter_bulk_store(&vmi, tmp))
- goto fail_nomem_vmi_store;
+ /*
+ * Link the vma into the MT. After using __mt_dup(), memory
+ * allocation is not necessary here, so it cannot fail.
+ */
+ mas_store(&vmi.mas, tmp);

mm->map_count++;
if (!(tmp->vm_flags & VM_WIPEONFORK))
@@ -760,15 +766,28 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
if (tmp->vm_ops && tmp->vm_ops->open)
tmp->vm_ops->open(tmp);

- if (retval)
+ if (retval) {
+ mpnt = vma_next(&vmi);
goto loop_out;
+ }
}
/* a new mm has just been created */
retval = arch_dup_mmap(oldmm, mm);
loop_out:
vma_iter_free(&vmi);
- if (!retval)
+ if (!retval) {
mt_set_in_rcu(vmi.mas.tree);
+ } else if (mpnt) {
+ /*
+ * The entire maple tree has already been duplicated. If the
+ * mmap duplication fails, mark the failure point with
+ * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
+ * stop releasing VMAs that have not been duplicated after this
+ * point.
+ */
+ mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
+ mas_store(&vmi.mas, XA_ZERO_ENTRY);
+ }
out:
mmap_write_unlock(mm);
flush_tlb_mm(oldmm);
@@ -778,8 +797,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
uprobe_end_dup_mmap();
return retval;

-fail_nomem_vmi_store:
- unlink_anon_vmas(tmp);
fail_nomem_anon_vma_fork:
mpol_put(vma_policy(tmp));
fail_nomem_policy:
diff --git a/mm/memory.c b/mm/memory.c
index b320af6466cc..ea48bd4b1feb 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -374,6 +374,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
* be 0. This will underflow and is okay.
*/
next = mas_find(mas, ceiling - 1);
+ if (unlikely(xa_is_zero(next)))
+ next = NULL;

/*
* Hide vma from rmap and truncate_pagecache before freeing
@@ -395,6 +397,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
&& !is_vm_hugetlb_page(next)) {
vma = next;
next = mas_find(mas, ceiling - 1);
+ if (unlikely(xa_is_zero(next)))
+ next = NULL;
if (mm_wr_locked)
vma_start_write(vma);
unlink_anon_vmas(vma);
@@ -1743,7 +1747,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
unmap_single_vma(tlb, vma, start, end, &details,
mm_wr_locked);
hugetlb_zap_end(vma, &details);
- } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
+ vma = mas_find(mas, tree_end - 1);
+ } while (vma && likely(!xa_is_zero(vma)));
mmu_notifier_invalidate_range_end(&range);
}

diff --git a/mm/mmap.c b/mm/mmap.c
index 1855a2d84200..12ce17863e62 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3213,10 +3213,11 @@ void exit_mmap(struct mm_struct *mm)
arch_exit_mmap(mm);

vma = mas_find(&mas, ULONG_MAX);
- if (!vma) {
+ if (!vma || unlikely(xa_is_zero(vma))) {
/* Can happen if dup_mmap() received an OOM */
mmap_read_unlock(mm);
- return;
+ mmap_write_lock(mm);
+ goto destroy;
}

lru_add_drain();
@@ -3251,11 +3252,13 @@ void exit_mmap(struct mm_struct *mm)
remove_vma(vma, true);
count++;
cond_resched();
- } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
+ vma = mas_find(&mas, ULONG_MAX);
+ } while (vma && likely(!xa_is_zero(vma)));

BUG_ON(count != mm->map_count);

trace_exit_mmap(mm);
+destroy:
__mt_destroy(&mm->mm_mt);
mmap_write_unlock(mm);
vm_unacct_memory(nr_accounted);
--
2.20.1

2023-10-16 03:26:31

by Peng Zhang

[permalink] [raw]
Subject: [PATCH v5 08/10] maple_tree: Update check_forking() and bench_forking()

Updated check_forking() and bench_forking() to use __mt_dup() to
duplicate maple tree.

Signed-off-by: Peng Zhang <[email protected]>
---
lib/test_maple_tree.c | 117 ++++++++++++++++++------------------
tools/include/linux/rwsem.h | 4 ++
2 files changed, 62 insertions(+), 59 deletions(-)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index de470950714f..3e4597fb49d3 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -1834,47 +1834,48 @@ static noinline void __init bench_mas_prev(struct maple_tree *mt)
}
#endif
/* check_forking - simulate the kernel forking sequence with the tree. */
-static noinline void __init check_forking(struct maple_tree *mt)
+static noinline void __init check_forking(void)
{
-
- struct maple_tree newmt;
- int i, nr_entries = 134;
+ struct maple_tree mt, newmt;
+ int i, nr_entries = 134, ret;
void *val;
- MA_STATE(mas, mt, 0, 0);
- MA_STATE(newmas, mt, 0, 0);
- struct rw_semaphore newmt_lock;
+ MA_STATE(mas, &mt, 0, 0);
+ MA_STATE(newmas, &newmt, 0, 0);
+ struct rw_semaphore mt_lock, newmt_lock;

+ init_rwsem(&mt_lock);
init_rwsem(&newmt_lock);

- for (i = 0; i <= nr_entries; i++)
- mtree_store_range(mt, i*10, i*10 + 5,
- xa_mk_value(i), GFP_KERNEL);
+ mt_init_flags(&mt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
+ mt_set_external_lock(&mt, &mt_lock);

- mt_set_non_kernel(99999);
mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
mt_set_external_lock(&newmt, &newmt_lock);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- down_write(&newmt_lock);
- mas.index = 0;
- mas.last = 0;
- if (mas_expected_entries(&newmas, nr_entries)) {
+
+ down_write(&mt_lock);
+ for (i = 0; i <= nr_entries; i++) {
+ mas_set_range(&mas, i*10, i*10 + 5);
+ mas_store_gfp(&mas, xa_mk_value(i), GFP_KERNEL);
+ }
+
+ down_write_nested(&newmt_lock, SINGLE_DEPTH_NESTING);
+ ret = __mt_dup(&mt, &newmt, GFP_KERNEL);
+ if (ret) {
pr_err("OOM!");
BUG_ON(1);
}
- rcu_read_lock();
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX)
mas_store(&newmas, val);
- }
- rcu_read_unlock();
+
mas_destroy(&newmas);
+ mas_destroy(&mas);
mt_validate(&newmt);
- mt_set_non_kernel(0);
__mt_destroy(&newmt);
+ __mt_destroy(&mt);
up_write(&newmt_lock);
+ up_write(&mt_lock);
}

static noinline void __init check_iteration(struct maple_tree *mt)
@@ -1977,49 +1978,51 @@ static noinline void __init check_mas_store_gfp(struct maple_tree *mt)
}

#if defined(BENCH_FORK)
-static noinline void __init bench_forking(struct maple_tree *mt)
+static noinline void __init bench_forking(void)
{
-
- struct maple_tree newmt;
- int i, nr_entries = 134, nr_fork = 80000;
+ struct maple_tree mt, newmt;
+ int i, nr_entries = 134, nr_fork = 80000, ret;
void *val;
- MA_STATE(mas, mt, 0, 0);
- MA_STATE(newmas, mt, 0, 0);
- struct rw_semaphore newmt_lock;
+ MA_STATE(mas, &mt, 0, 0);
+ MA_STATE(newmas, &newmt, 0, 0);
+ struct rw_semaphore mt_lock, newmt_lock;

+ init_rwsem(&mt_lock);
init_rwsem(&newmt_lock);
- mt_set_external_lock(&newmt, &newmt_lock);

- for (i = 0; i <= nr_entries; i++)
- mtree_store_range(mt, i*10, i*10 + 5,
- xa_mk_value(i), GFP_KERNEL);
+ mt_init_flags(&mt, MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
+ mt_set_external_lock(&mt, &mt_lock);
+
+ down_write(&mt_lock);
+ for (i = 0; i <= nr_entries; i++) {
+ mas_set_range(&mas, i*10, i*10 + 5);
+ mas_store_gfp(&mas, xa_mk_value(i), GFP_KERNEL);
+ }

for (i = 0; i < nr_fork; i++) {
- mt_set_non_kernel(99999);
- mt_init_flags(&newmt, MT_FLAGS_ALLOC_RANGE);
- newmas.tree = &newmt;
- mas_reset(&newmas);
- mas_reset(&mas);
- mas.index = 0;
- mas.last = 0;
- rcu_read_lock();
- down_write(&newmt_lock);
- if (mas_expected_entries(&newmas, nr_entries)) {
- printk("OOM!");
+ mt_init_flags(&newmt,
+ MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN);
+ mt_set_external_lock(&newmt, &newmt_lock);
+
+ down_write_nested(&newmt_lock, SINGLE_DEPTH_NESTING);
+ ret = __mt_dup(&mt, &newmt, GFP_KERNEL);
+ if (ret) {
+ pr_err("OOM!");
BUG_ON(1);
}
- mas_for_each(&mas, val, ULONG_MAX) {
- newmas.index = mas.index;
- newmas.last = mas.last;
+
+ mas_set(&newmas, 0);
+ mas_for_each(&newmas, val, ULONG_MAX)
mas_store(&newmas, val);
- }
+
mas_destroy(&newmas);
- rcu_read_unlock();
mt_validate(&newmt);
- mt_set_non_kernel(0);
__mt_destroy(&newmt);
up_write(&newmt_lock);
}
+ mas_destroy(&mas);
+ __mt_destroy(&mt);
+ up_write(&mt_lock);
}
#endif

@@ -3615,9 +3618,7 @@ static int __init maple_tree_seed(void)
#endif
#if defined(BENCH_FORK)
#define BENCH
- mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
- bench_forking(&tree);
- mtree_destroy(&tree);
+ bench_forking();
goto skip;
#endif
#if defined(BENCH_MT_FOR_EACH)
@@ -3650,9 +3651,7 @@ static int __init maple_tree_seed(void)
check_iteration(&tree);
mtree_destroy(&tree);

- mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
- check_forking(&tree);
- mtree_destroy(&tree);
+ check_forking();

mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
check_mas_store_gfp(&tree);
diff --git a/tools/include/linux/rwsem.h b/tools/include/linux/rwsem.h
index 83971b3cbfce..f8bffd4a987c 100644
--- a/tools/include/linux/rwsem.h
+++ b/tools/include/linux/rwsem.h
@@ -37,4 +37,8 @@ static inline int up_write(struct rw_semaphore *sem)
{
return pthread_rwlock_unlock(&sem->lock);
}
+
+#define down_read_nested(sem, subclass) down_read(sem)
+#define down_write_nested(sem, subclass) down_write(sem)
+
#endif /* _TOOLS_RWSEM_H */
--
2.20.1

2023-10-16 03:41:19

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v5 00/10] Introduce __mt_dup() to improve the performance of fork()



在 2023/10/16 11:22, Peng Zhang 写道:
> Hi all,
>
> This series introduces __mt_dup() to improve the performance of fork(). During
> the duplication process of mmap, all VMAs are traversed and inserted one by one
> into the new maple tree, causing the maple tree to be rebalanced multiple times.
> Balancing the maple tree is a costly operation. To duplicate VMAs more
> efficiently, mtree_dup() and __mt_dup() are introduced for the maple tree. They
> can efficiently duplicate a maple tree.
>
> Here are some algorithmic details about {mtree,__mt}_dup(). We perform a DFS
> pre-order traversal of all nodes in the source maple tree. During this process,
> we fully copy the nodes from the source tree to the new tree. This involves
> memory allocation, and when encountering a new node, if it is a non-leaf node,
> all its child nodes are allocated at once.
>
> Some previous discussions can be referred to as [1]. For a more detailed
> analysis of the algorithm, please refer to the logs for patch [3/10] and patch
> [10/10]
>
> There is a "spawn" in byte-unixbench[2], which can be used to test the
> performance of fork(). I modified it slightly to make it work with
> different number of VMAs.
>
> Below are the test results. The first row shows the number of VMAs.
> The second and third rows show the number of fork() calls per ten seconds,
> corresponding to next-20231006 and the this patchset, respectively. The
> test results were obtained with CPU binding to avoid scheduler load
> balancing that could cause unstable results. There are still some
> fluctuations in the test results, but at least they are better than the
> original performance.
>
> 21 121 221 421 821 1621 3221 6421 12821 25621 51221
> 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
> 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
> 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
>
> Thanks for Liam's review.
>
> Changes since v4:
> - Change the handling method for the failure of dup_mmap(). Handle it in
> exit_mmap().
> - Update check_forking() and bench_forking().
> - Add the corresponding copyright statement.
>
I apologize for forgetting to include all the links while editing the cover letter.
Here they are:
[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://github.com/kdlucas/byte-unixbench/tree/master

v1: https://lore.kernel.org/lkml/[email protected]/
v2: https://lore.kernel.org/lkml/[email protected]/
v3: https://lore.kernel.org/lkml/[email protected]/
v4: https://lore.kernel.org/lkml/[email protected]/

> Peng Zhang (10):
> maple_tree: Add mt_free_one() and mt_attr() helpers
> maple_tree: Introduce {mtree,mas}_lock_nested()
> maple_tree: Introduce interfaces __mt_dup() and mtree_dup()
> radix tree test suite: Align kmem_cache_alloc_bulk() with kernel
> behavior.
> maple_tree: Add test for mtree_dup()
> maple_tree: Update the documentation of maple tree
> maple_tree: Skip other tests when BENCH is enabled
> maple_tree: Update check_forking() and bench_forking()
> maple_tree: Preserve the tree attributes when destroying maple tree
> fork: Use __mt_dup() to duplicate maple tree in dup_mmap()
>
> Documentation/core-api/maple_tree.rst | 4 +
> include/linux/maple_tree.h | 7 +
> kernel/fork.c | 39 ++-
> lib/maple_tree.c | 304 ++++++++++++++++++++-
> lib/test_maple_tree.c | 123 +++++----
> mm/memory.c | 7 +-
> mm/mmap.c | 9 +-
> tools/include/linux/rwsem.h | 4 +
> tools/include/linux/spinlock.h | 1 +
> tools/testing/radix-tree/linux.c | 45 +++-
> tools/testing/radix-tree/maple.c | 363 ++++++++++++++++++++++++++
> 11 files changed, 815 insertions(+), 91 deletions(-)
>

2023-10-16 14:11:27

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

On Mon, Oct 16, 2023 at 11:22:19AM +0800, Peng Zhang wrote:
> +++ b/lib/maple_tree.c
> @@ -4,6 +4,10 @@
> * Copyright (c) 2018-2022 Oracle Corporation
> * Authors: Liam R. Howlett <[email protected]>
> * Matthew Wilcox <[email protected]>
> + *
> + * Algorithm for duplicating Maple Tree
> + * Copyright (c) 2023 ByteDance
> + * Author: Peng Zhang <[email protected]>

You can't copyright an algorithm. You can copyright the
_implementation_ of an algorithm. You have a significant chunk of code
in this file, so adding your copyright is reasonable (although not
legally required, AIUI). Just leave out this line:

+ * Algorithm for duplicating Maple Tree

> +/**
> + * __mt_dup(): Duplicate an entire maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate

memcpy()?

> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> + int ret = 0;
> + MA_STATE(mas, mt, 0, 0);
> + MA_STATE(new_mas, new, 0, 0);
> +
> + mas_dup_build(&mas, &new_mas, gfp);
> +
> + if (unlikely(mas_is_err(&mas))) {
> + ret = xa_err(mas.node);
> + if (ret == -ENOMEM)
> + mas_dup_free(&new_mas);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(__mt_dup);

Why does it need to be exported?

2023-10-17 02:45:36

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()



在 2023/10/16 22:10, Matthew Wilcox 写道:
> On Mon, Oct 16, 2023 at 11:22:19AM +0800, Peng Zhang wrote:
>> +++ b/lib/maple_tree.c
>> @@ -4,6 +4,10 @@
>> * Copyright (c) 2018-2022 Oracle Corporation
>> * Authors: Liam R. Howlett <[email protected]>
>> * Matthew Wilcox <[email protected]>
>> + *
>> + * Algorithm for duplicating Maple Tree
>> + * Copyright (c) 2023 ByteDance
>> + * Author: Peng Zhang <[email protected]>
>
> You can't copyright an algorithm. You can copyright the
> _implementation_ of an algorithm. You have a significant chunk of code
> in this file, so adding your copyright is reasonable (although not
> legally required, AIUI). Just leave out this line:
>
> + * Algorithm for duplicating Maple Tree
Okay, I will make the correction, thank you.
>
>> +/**
>> + * __mt_dup(): Duplicate an entire maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
>> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
>
> memcpy()?
Yes, thank you for pointing this out.
>
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> + int ret = 0;
>> + MA_STATE(mas, mt, 0, 0);
>> + MA_STATE(new_mas, new, 0, 0);
>> +
>> + mas_dup_build(&mas, &new_mas, gfp);
>> +
>> + if (unlikely(mas_is_err(&mas))) {
>> + ret = xa_err(mas.node);
>> + if (ret == -ENOMEM)
>> + mas_dup_free(&new_mas);
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>
> Why does it need to be exported?
I consider __mt_dup() as a general interface for Maple Tree,
uncertain whether it will be used by certain modules in the future.
>
>

2023-10-17 13:51:30

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()

* Peng Zhang <[email protected]> [231015 23:23]:
> In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
> directly replacing the entries of VMAs in the new maple tree can result
> in better performance. __mt_dup() uses DFS pre-order to duplicate the
> maple tree, so it is efficient.
>
> The average time complexity of __mt_dup() is O(n), where n is the number
> of VMAs. The proof of the time complexity is provided in the commit log
> that introduces __mt_dup(). After duplicating the maple tree, each element
> is traversed and replaced (ignoring the cases of deletion, which are rare).
> Since it is only a replacement operation for each element, this process is
> also O(n).
>
> Analyzing the exact time complexity of the previous algorithm is
> challenging because each insertion can involve appending to a node, pushing
> data to adjacent nodes, or even splitting nodes. The frequency of each
> action is difficult to calculate. The worst-case scenario for a single
> insertion is when the tree undergoes splitting at every level. If we
> consider each insertion as the worst-case scenario, we can determine that
> the upper bound of the time complexity is O(n*log(n)), although this is a
> loose upper bound. However, based on the test data, it appears that the
> actual time complexity is likely to be O(n).
>
> As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
> fails, there will be a portion of VMAs that have not been duplicated in
> the maple tree. To handle this, we mark the failure point with
> XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
> releasing VMAs that have not been duplicated after this point.
>
> There is a "spawn" in byte-unixbench[1], which can be used to test the
> performance of fork(). I modified it slightly to make it work with
> different number of VMAs.
>
> Below are the test results. The first row shows the number of VMAs.
> The second and third rows show the number of fork() calls per ten seconds,
> corresponding to next-20231006 and the this patchset, respectively. The
> test results were obtained with CPU binding to avoid scheduler load
> balancing that could cause unstable results. There are still some
> fluctuations in the test results, but at least they are better than the
> original performance.
>
> 21 121 221 421 821 1621 3221 6421 12821 25621 51221
> 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
> 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
> 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
>
> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> kernel/fork.c | 39 ++++++++++++++++++++++++++++-----------
> mm/memory.c | 7 ++++++-
> mm/mmap.c | 9 ++++++---
> 3 files changed, 40 insertions(+), 15 deletions(-)
>
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 0ff2e0cd4109..0be15501e52e 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> int retval;
> unsigned long charge = 0;
> LIST_HEAD(uf);
> - VMA_ITERATOR(old_vmi, oldmm, 0);
> VMA_ITERATOR(vmi, mm, 0);
>
> uprobe_start_dup_mmap();
> @@ -678,16 +677,21 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> goto out;
> khugepaged_fork(mm, oldmm);
>
> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
> - if (retval)
> + /* Use __mt_dup() to efficiently build an identical maple tree. */
> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
> + if (unlikely(retval))
> goto out;
>
> mt_clear_in_rcu(vmi.mas.tree);
> - for_each_vma(old_vmi, mpnt) {
> + for_each_vma(vmi, mpnt) {
> struct file *file;
>
> vma_start_write(mpnt);
> if (mpnt->vm_flags & VM_DONTCOPY) {
> + retval = mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL);

vma_iter_clear_gfp() exists, but needs to be relocated from internal.h
to mm.h to be used here.

> + if (retval)
> + goto loop_out;
> +
> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
> continue;
> }
> @@ -749,9 +753,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> if (is_vm_hugetlb_page(tmp))
> hugetlb_dup_vma_private(tmp);
>
> - /* Link the vma into the MT */
> - if (vma_iter_bulk_store(&vmi, tmp))
> - goto fail_nomem_vmi_store;
> + /*
> + * Link the vma into the MT. After using __mt_dup(), memory
> + * allocation is not necessary here, so it cannot fail.
> + */

Allocations didn't happen here with the bulk store either, and the
vma_iter_bulk_store() does a mas_store() - see include/linux/mm.h

The vma iterator is used when possible for type safety.

> + mas_store(&vmi.mas, tmp);
>
> mm->map_count++;
> if (!(tmp->vm_flags & VM_WIPEONFORK))
> @@ -760,15 +766,28 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> if (tmp->vm_ops && tmp->vm_ops->open)
> tmp->vm_ops->open(tmp);
>
> - if (retval)
> + if (retval) {
> + mpnt = vma_next(&vmi);
> goto loop_out;
> + }
> }
> /* a new mm has just been created */
> retval = arch_dup_mmap(oldmm, mm);
> loop_out:
> vma_iter_free(&vmi);
> - if (!retval)
> + if (!retval) {
> mt_set_in_rcu(vmi.mas.tree);
> + } else if (mpnt) {
> + /*
> + * The entire maple tree has already been duplicated. If the
> + * mmap duplication fails, mark the failure point with
> + * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
> + * stop releasing VMAs that have not been duplicated after this
> + * point.
> + */
> + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
> + mas_store(&vmi.mas, XA_ZERO_ENTRY);

vma_iter_clear() exists but uses the preallocation call. I really don't
want mas_ calls where unnecessary, but this seems like a special case.
We have a vma iterator here so it's messy.

> + }
> out:
> mmap_write_unlock(mm);
> flush_tlb_mm(oldmm);
> @@ -778,8 +797,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
> uprobe_end_dup_mmap();
> return retval;
>
> -fail_nomem_vmi_store:
> - unlink_anon_vmas(tmp);
> fail_nomem_anon_vma_fork:
> mpol_put(vma_policy(tmp));
> fail_nomem_policy:
> diff --git a/mm/memory.c b/mm/memory.c
> index b320af6466cc..ea48bd4b1feb 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -374,6 +374,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> * be 0. This will underflow and is okay.
> */
> next = mas_find(mas, ceiling - 1);
> + if (unlikely(xa_is_zero(next)))
> + next = NULL;
>
> /*
> * Hide vma from rmap and truncate_pagecache before freeing
> @@ -395,6 +397,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
> && !is_vm_hugetlb_page(next)) {
> vma = next;
> next = mas_find(mas, ceiling - 1);
> + if (unlikely(xa_is_zero(next)))
> + next = NULL;
> if (mm_wr_locked)
> vma_start_write(vma);
> unlink_anon_vmas(vma);
> @@ -1743,7 +1747,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
> unmap_single_vma(tlb, vma, start, end, &details,
> mm_wr_locked);
> hugetlb_zap_end(vma, &details);
> - } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
> + vma = mas_find(mas, tree_end - 1);
> + } while (vma && likely(!xa_is_zero(vma)));
> mmu_notifier_invalidate_range_end(&range);
> }
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 1855a2d84200..12ce17863e62 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3213,10 +3213,11 @@ void exit_mmap(struct mm_struct *mm)
> arch_exit_mmap(mm);
>
> vma = mas_find(&mas, ULONG_MAX);
> - if (!vma) {
> + if (!vma || unlikely(xa_is_zero(vma))) {
> /* Can happen if dup_mmap() received an OOM */
> mmap_read_unlock(mm);
> - return;
> + mmap_write_lock(mm);
> + goto destroy;
> }
>
> lru_add_drain();
> @@ -3251,11 +3252,13 @@ void exit_mmap(struct mm_struct *mm)
> remove_vma(vma, true);
> count++;
> cond_resched();
> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
> + vma = mas_find(&mas, ULONG_MAX);
> + } while (vma && likely(!xa_is_zero(vma)));
>
> BUG_ON(count != mm->map_count);
>
> trace_exit_mmap(mm);
> +destroy:
> __mt_destroy(&mm->mm_mt);
> mmap_write_unlock(mm);
> vm_unacct_memory(nr_accounted);
> --
> 2.20.1
>

2023-10-17 13:58:36

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()

* Peng Zhang <[email protected]> [231015 23:23]:
> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
> duplicate a maple tree. They duplicate a maple tree in Depth-First
> Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
> source tree and allocate new child nodes in non-leaf nodes. The new node
> is exactly the same as the source node except for all the addresses
> stored in it. It will be faster than traversing all elements in the
> source tree and inserting them one by one into the new tree. The time
> complexity of these two functions is O(n).
>
> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
> handles locks internally.
>
> Analysis of the average time complexity of this algorithm:
>
> For simplicity, let's assume that the maximum branching factor of all
> non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
> full tree.
>
> Under the given conditions, if there is a maple tree with n elements,
> the number of its leaves is n/16. From bottom to top, the number of
> nodes in each level is 1/16 of the number of nodes in the level below.
> So the total number of nodes in the entire tree is given by the sum of
> n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has
> log(n) terms with base 16. According to the formula for the sum of a
> geometric series, the sum of this series can be calculated as (n-1)/15.
> Each node has only one parent node pointer, which can be considered as
> an edge. In total, there are (n-1)/15-1 edges.
>
> This algorithm consists of two operations:
>
> 1. Traversing all nodes in DFS order.
> 2. For each node, making a copy and performing necessary modifications
> to create a new node.
>
> For the first part, DFS traversal will visit each edge twice. Let
> T(ascend) represent the cost of taking one step downwards, and
> T(descend) represent the cost of taking one step upwards. And both of
> them are constants (although mas_ascend() may not be, as it contains a
> loop, but here we ignore it and treat it as a constant). So the time
> spent on the first part can be represented as
> ((n-1)/15-1) * (T(ascend) + T(descend)).
>
> For the second part, each node will be copied, and the cost of copying a
> node is denoted as T(copy_node). For each non-leaf node, it is necessary
> to reallocate all child nodes, and the cost of this operation is denoted
> as T(dup_alloc). The behavior behind memory allocation is complex and
> not specific to the maple tree operation. Here, we assume that the time
> required for a single allocation is constant. Since the size of a node
> is fixed, both of these symbols are also constants. We can calculate
> that the time spent on the second part is
> ((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc).
>
> Adding both parts together, the total time spent by the algorithm can be
> represented as:
>
> ((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
> n/16 * T(dup_alloc) - (T(ascend) + T(descend))
>
> Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
> Let C2 = T(dup_alloc)
> Let C3 = T(ascend) + T(descend)
>
> Finally, the expression can be simplified as:
> ((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).
>
> This is a linear function, so the average time complexity is O(n).
>
> Signed-off-by: Peng Zhang <[email protected]>
> ---
> include/linux/maple_tree.h | 3 +
> lib/maple_tree.c | 290 +++++++++++++++++++++++++++++++++++++
> 2 files changed, 293 insertions(+)
>
> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
> index f91dbc7fe091..a452dd8a1e5c 100644
> --- a/include/linux/maple_tree.h
> +++ b/include/linux/maple_tree.h
> @@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
> void *entry, gfp_t gfp);
> void *mtree_erase(struct maple_tree *mt, unsigned long index);
>
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
> +
> void mtree_destroy(struct maple_tree *mt);
> void __mt_destroy(struct maple_tree *mt);
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index ca7039633844..6e0ad83f14e3 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -4,6 +4,10 @@
> * Copyright (c) 2018-2022 Oracle Corporation
> * Authors: Liam R. Howlett <[email protected]>
> * Matthew Wilcox <[email protected]>
> + *
> + * Algorithm for duplicating Maple Tree
> + * Copyright (c) 2023 ByteDance
> + * Author: Peng Zhang <[email protected]>
> */
>
> /*
> @@ -6475,6 +6479,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
> }
> EXPORT_SYMBOL(mtree_erase);
>
> +/*
> + * mas_dup_free() - Free an incomplete duplication of a tree.
> + * @mas: The maple state of a incomplete tree.
> + *
> + * The parameter @mas->node passed in indicates that the allocation failed on
> + * this node. This function frees all nodes starting from @mas->node in the
> + * reverse order of mas_dup_build(). There is no need to hold the source tree
> + * lock at this time.
> + */
> +static void mas_dup_free(struct ma_state *mas)
> +{
> + struct maple_node *node;
> + enum maple_type type;
> + void __rcu **slots;
> + unsigned char count, i;
> +
> + /* Maybe the first node allocation failed. */
> + if (mas_is_none(mas))
> + return;
> +
> + while (!mte_is_root(mas->node)) {
> + mas_ascend(mas);
> +

Please watch the extra whitespace. There are a few in this patch.

> + if (mas->offset) {
> + mas->offset--;
> + do {
> + mas_descend(mas);
> + mas->offset = mas_data_end(mas);
> + } while (!mte_is_leaf(mas->node));
> +
> + mas_ascend(mas);
> + }
> +
> + node = mte_to_node(mas->node);
> + type = mte_node_type(mas->node);
> + slots = ma_slots(node, type);
> + count = mas_data_end(mas) + 1;
> + for (i = 0; i < count; i++)
> + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
> +
> + mt_free_bulk(count, slots);
> + }
> +
> + node = mte_to_node(mas->node);
> + mt_free_one(node);
> +}
> +
> +/*
> + * mas_copy_node() - Copy a maple node and replace the parent.
> + * @mas: The maple state of source tree.
> + * @new_mas: The maple state of new tree.
> + * @parent: The parent of the new node.
> + *
> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
> + * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
> + */
> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
> + struct maple_pnode *parent)
> +{
> + struct maple_node *node = mte_to_node(mas->node);
> + struct maple_node *new_node = mte_to_node(new_mas->node);
> + unsigned long val;
> +
> + /* Copy the node completely. */
> + memcpy(new_node, node, sizeof(struct maple_node));
> +
> + /* Update the parent node pointer. */
> + val = (unsigned long)node->parent & MAPLE_NODE_MASK;
> + new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
> +}
> +
> +/*
> + * mas_dup_alloc() - Allocate child nodes for a maple node.
> + * @mas: The maple state of source tree.
> + * @new_mas: The maple state of new tree.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * This function allocates child nodes for @new_mas->node during the duplication
> + * process. If memory allocation fails, @mas is set to -ENOMEM.
> + */
> +static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
> + gfp_t gfp)
> +{
> + struct maple_node *node = mte_to_node(mas->node);
> + struct maple_node *new_node = mte_to_node(new_mas->node);
> + enum maple_type type;
> + unsigned char request, count, i;
> + void __rcu **slots;
> + void __rcu **new_slots;
> + unsigned long val;
> +
> + /* Allocate memory for child nodes. */
> + type = mte_node_type(mas->node);
> + new_slots = ma_slots(new_node, type);
> + request = mas_data_end(mas) + 1;
> + count = mt_alloc_bulk(gfp, request, (void **)new_slots);
> + if (unlikely(count < request)) {
> + if (count)
> + mt_free_bulk(count, new_slots);

We were dropping this mt_free_bulk() call as discussed in [1]. Did I
miss something?

> +
> + memset(new_slots, 0, request * sizeof(void *));
> + mas_set_err(mas, -ENOMEM);
> + return;
> + }
> +
> + /* Restore node type information in slots. */
> + slots = ma_slots(node, type);
> + for (i = 0; i < count; i++) {
> + val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
> + val &= MAPLE_NODE_MASK;
> + ((unsigned long *)new_slots)[i] |= val;
> + }
> +}
> +
> +/*
> + * mas_dup_build() - Build a new maple tree from a source tree
> + * @mas: The maple state of source tree, need to be in MAS_START state.
> + * @new_mas: The maple state of new tree, need to be in MAS_START state.
> + * @gfp: The GFP_FLAGS to use for allocations.
> + *
> + * This function builds a new tree in DFS preorder. If the memory allocation
> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
> + * last node. mas_dup_free() will free the incomplete duplication of a tree.
> + *
> + * Note that the attributes of the two trees need to be exactly the same, and the
> + * new tree needs to be empty, otherwise -EINVAL will be set in @mas.
> + */
> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
> + gfp_t gfp)
> +{
> + struct maple_node *node;
> + struct maple_pnode *parent = NULL;
> + struct maple_enode *root;
> + enum maple_type type;
> +
> + if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
> + unlikely(!mtree_empty(new_mas->tree))) {
> + mas_set_err(mas, -EINVAL);
> + return;
> + }
> +
> + mas_start(mas);
> + if (mas_is_ptr(mas) || mas_is_none(mas)) {
> + root = mt_root_locked(mas->tree);

mas_start(mas) would return the root entry if it's a pointer and NULL if
the tree is empty, so this can be written:
root = mas_start(mas);
if (mas_is_ptry() || mas_is_none()
goto set_new_tree;


> + goto set_new_tree;
> + }
> +
> + node = mt_alloc_one(gfp);
> + if (!node) {
> + new_mas->node = MAS_NONE;
> + mas_set_err(mas, -ENOMEM);
> + return;
> + }
> +
> + type = mte_node_type(mas->node);
> + root = mt_mk_node(node, type);
> + new_mas->node = root;
> + new_mas->min = 0;
> + new_mas->max = ULONG_MAX;
> + root = mte_mk_root(root);
> +
> + while (1) {
> + mas_copy_node(mas, new_mas, parent);
> +
> + if (!mte_is_leaf(mas->node)) {
> + /* Only allocate child nodes for non-leaf nodes. */
> + mas_dup_alloc(mas, new_mas, gfp);
> + if (unlikely(mas_is_err(mas)))
> + return;
> + } else {
> + /*
> + * This is the last leaf node and duplication is
> + * completed.
> + */
> + if (mas->max == ULONG_MAX)
> + goto done;
> +
> + /* This is not the last leaf node and needs to go up. */
> + do {
> + mas_ascend(mas);
> + mas_ascend(new_mas);
> + } while (mas->offset == mas_data_end(mas));
> +
> + /* Move to the next subtree. */
> + mas->offset++;
> + new_mas->offset++;
> + }
> +
> + mas_descend(mas);
> + parent = ma_parent_ptr(mte_to_node(new_mas->node));
> + mas_descend(new_mas);
> + mas->offset = 0;
> + new_mas->offset = 0;
> + }
> +done:
> + /* Specially handle the parent of the root node. */
> + mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
> +set_new_tree:
> + /* Make them the same height */
> + new_mas->tree->ma_flags = mas->tree->ma_flags;
> + rcu_assign_pointer(new_mas->tree->ma_root, root);
> +}
> +
> +/**
> + * __mt_dup(): Duplicate an entire maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
> + * source node except for all the addresses stored in it. It will be faster than
> + * traversing all elements in the source tree and inserting them one by one into
> + * the new tree.
> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + * Note that the user needs to manually lock the source tree and the new tree.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> + int ret = 0;
> + MA_STATE(mas, mt, 0, 0);
> + MA_STATE(new_mas, new, 0, 0);
> +
> + mas_dup_build(&mas, &new_mas, gfp);
> +
> + if (unlikely(mas_is_err(&mas))) {
> + ret = xa_err(mas.node);
> + if (ret == -ENOMEM)
> + mas_dup_free(&new_mas);
> + }
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(__mt_dup);
> +
> +/**
> + * mtree_dup(): Duplicate an entire maple tree
> + * @mt: The source maple tree
> + * @new: The new maple tree
> + * @gfp: The GFP_FLAGS to use for allocations
> + *
> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
> + * source node except for all the addresses stored in it. It will be faster than
> + * traversing all elements in the source tree and inserting them one by one into
> + * the new tree.
> + * The user needs to ensure that the attributes of the source tree and the new
> + * tree are the same, and the new tree needs to be an empty tree, otherwise
> + * -EINVAL will be returned.
> + *
> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
> + * the attributes of the two trees are different or the new tree is not an empty
> + * tree.
> + */
> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
> +{
> + int ret = 0;
> + MA_STATE(mas, mt, 0, 0);
> + MA_STATE(new_mas, new, 0, 0);
> +
> + mas_lock(&new_mas);
> + mas_lock_nested(&mas, SINGLE_DEPTH_NESTING);
> +
> + mas_dup_build(&mas, &new_mas, gfp);
> + mas_unlock(&mas);
> +
> + if (unlikely(mas_is_err(&mas))) {
> + ret = xa_err(mas.node);
> + if (ret == -ENOMEM)
> + mas_dup_free(&new_mas);
> + }
> +
> + mas_unlock(&new_mas);
> +
> + return ret;
> +}
> +EXPORT_SYMBOL(mtree_dup);
> +
> /**
> * __mt_destroy() - Walk and free all nodes of a locked maple tree.
> * @mt: The maple tree
> --
> 2.20.1
>

[1]. https://lore.kernel.org/lkml/20231004142500.gz2552r74aiphl4z@revolver/

Thanks,
Liam

2023-10-24 08:41:29

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v5 03/10] maple_tree: Introduce interfaces __mt_dup() and mtree_dup()



在 2023/10/17 21:57, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [231015 23:23]:
>> Introduce interfaces __mt_dup() and mtree_dup(), which are used to
>> duplicate a maple tree. They duplicate a maple tree in Depth-First
>> Search (DFS) pre-order traversal. It uses memcopy() to copy nodes in the
>> source tree and allocate new child nodes in non-leaf nodes. The new node
>> is exactly the same as the source node except for all the addresses
>> stored in it. It will be faster than traversing all elements in the
>> source tree and inserting them one by one into the new tree. The time
>> complexity of these two functions is O(n).
>>
>> The difference between __mt_dup() and mtree_dup() is that mtree_dup()
>> handles locks internally.
>>
>> Analysis of the average time complexity of this algorithm:
>>
>> For simplicity, let's assume that the maximum branching factor of all
>> non-leaf nodes is 16 (in allocation mode, it is 10), and the tree is a
>> full tree.
>>
>> Under the given conditions, if there is a maple tree with n elements,
>> the number of its leaves is n/16. From bottom to top, the number of
>> nodes in each level is 1/16 of the number of nodes in the level below.
>> So the total number of nodes in the entire tree is given by the sum of
>> n/16 + n/16^2 + n/16^3 + ... + 1. This is a geometric series, and it has
>> log(n) terms with base 16. According to the formula for the sum of a
>> geometric series, the sum of this series can be calculated as (n-1)/15.
>> Each node has only one parent node pointer, which can be considered as
>> an edge. In total, there are (n-1)/15-1 edges.
>>
>> This algorithm consists of two operations:
>>
>> 1. Traversing all nodes in DFS order.
>> 2. For each node, making a copy and performing necessary modifications
>> to create a new node.
>>
>> For the first part, DFS traversal will visit each edge twice. Let
>> T(ascend) represent the cost of taking one step downwards, and
>> T(descend) represent the cost of taking one step upwards. And both of
>> them are constants (although mas_ascend() may not be, as it contains a
>> loop, but here we ignore it and treat it as a constant). So the time
>> spent on the first part can be represented as
>> ((n-1)/15-1) * (T(ascend) + T(descend)).
>>
>> For the second part, each node will be copied, and the cost of copying a
>> node is denoted as T(copy_node). For each non-leaf node, it is necessary
>> to reallocate all child nodes, and the cost of this operation is denoted
>> as T(dup_alloc). The behavior behind memory allocation is complex and
>> not specific to the maple tree operation. Here, we assume that the time
>> required for a single allocation is constant. Since the size of a node
>> is fixed, both of these symbols are also constants. We can calculate
>> that the time spent on the second part is
>> ((n-1)/15) * T(copy_node) + ((n-1)/15 - n/16) * T(dup_alloc).
>>
>> Adding both parts together, the total time spent by the algorithm can be
>> represented as:
>>
>> ((n-1)/15) * (T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)) -
>> n/16 * T(dup_alloc) - (T(ascend) + T(descend))
>>
>> Let C1 = T(ascend) + T(descend) + T(copy_node) + T(dup_alloc)
>> Let C2 = T(dup_alloc)
>> Let C3 = T(ascend) + T(descend)
>>
>> Finally, the expression can be simplified as:
>> ((16 * C1 - 15 * C2) / (15 * 16)) * n - (C1 / 15 + C3).
>>
>> This is a linear function, so the average time complexity is O(n).
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> include/linux/maple_tree.h | 3 +
>> lib/maple_tree.c | 290 +++++++++++++++++++++++++++++++++++++
>> 2 files changed, 293 insertions(+)
>>
>> diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
>> index f91dbc7fe091..a452dd8a1e5c 100644
>> --- a/include/linux/maple_tree.h
>> +++ b/include/linux/maple_tree.h
>> @@ -329,6 +329,9 @@ int mtree_store(struct maple_tree *mt, unsigned long index,
>> void *entry, gfp_t gfp);
>> void *mtree_erase(struct maple_tree *mt, unsigned long index);
>>
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp);
>> +
>> void mtree_destroy(struct maple_tree *mt);
>> void __mt_destroy(struct maple_tree *mt);
>>
>> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> index ca7039633844..6e0ad83f14e3 100644
>> --- a/lib/maple_tree.c
>> +++ b/lib/maple_tree.c
>> @@ -4,6 +4,10 @@
>> * Copyright (c) 2018-2022 Oracle Corporation
>> * Authors: Liam R. Howlett <[email protected]>
>> * Matthew Wilcox <[email protected]>
>> + *
>> + * Algorithm for duplicating Maple Tree
>> + * Copyright (c) 2023 ByteDance
>> + * Author: Peng Zhang <[email protected]>
>> */
>>
>> /*
>> @@ -6475,6 +6479,292 @@ void *mtree_erase(struct maple_tree *mt, unsigned long index)
>> }
>> EXPORT_SYMBOL(mtree_erase);
>>
>> +/*
>> + * mas_dup_free() - Free an incomplete duplication of a tree.
>> + * @mas: The maple state of a incomplete tree.
>> + *
>> + * The parameter @mas->node passed in indicates that the allocation failed on
>> + * this node. This function frees all nodes starting from @mas->node in the
>> + * reverse order of mas_dup_build(). There is no need to hold the source tree
>> + * lock at this time.
>> + */
>> +static void mas_dup_free(struct ma_state *mas)
>> +{
>> + struct maple_node *node;
>> + enum maple_type type;
>> + void __rcu **slots;
>> + unsigned char count, i;
>> +
>> + /* Maybe the first node allocation failed. */
>> + if (mas_is_none(mas))
>> + return;
>> +
>> + while (!mte_is_root(mas->node)) {
>> + mas_ascend(mas);
>> +
>
> Please watch the extra whitespace. There are a few in this patch.
Done in v6, thank you.
>
>> + if (mas->offset) {
>> + mas->offset--;
>> + do {
>> + mas_descend(mas);
>> + mas->offset = mas_data_end(mas);
>> + } while (!mte_is_leaf(mas->node));
>> +
>> + mas_ascend(mas);
>> + }
>> +
>> + node = mte_to_node(mas->node);
>> + type = mte_node_type(mas->node);
>> + slots = ma_slots(node, type);
>> + count = mas_data_end(mas) + 1;
>> + for (i = 0; i < count; i++)
>> + ((unsigned long *)slots)[i] &= ~MAPLE_NODE_MASK;
>> +
>> + mt_free_bulk(count, slots);
>> + }
>> +
>> + node = mte_to_node(mas->node);
>> + mt_free_one(node);
>> +}
>> +
>> +/*
>> + * mas_copy_node() - Copy a maple node and replace the parent.
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @parent: The parent of the new node.
>> + *
>> + * Copy @mas->node to @new_mas->node, set @parent to be the parent of
>> + * @new_mas->node. If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_copy_node(struct ma_state *mas, struct ma_state *new_mas,
>> + struct maple_pnode *parent)
>> +{
>> + struct maple_node *node = mte_to_node(mas->node);
>> + struct maple_node *new_node = mte_to_node(new_mas->node);
>> + unsigned long val;
>> +
>> + /* Copy the node completely. */
>> + memcpy(new_node, node, sizeof(struct maple_node));
>> +
>> + /* Update the parent node pointer. */
>> + val = (unsigned long)node->parent & MAPLE_NODE_MASK;
>> + new_node->parent = ma_parent_ptr(val | (unsigned long)parent);
>> +}
>> +
>> +/*
>> + * mas_dup_alloc() - Allocate child nodes for a maple node.
>> + * @mas: The maple state of source tree.
>> + * @new_mas: The maple state of new tree.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function allocates child nodes for @new_mas->node during the duplication
>> + * process. If memory allocation fails, @mas is set to -ENOMEM.
>> + */
>> +static inline void mas_dup_alloc(struct ma_state *mas, struct ma_state *new_mas,
>> + gfp_t gfp)
>> +{
>> + struct maple_node *node = mte_to_node(mas->node);
>> + struct maple_node *new_node = mte_to_node(new_mas->node);
>> + enum maple_type type;
>> + unsigned char request, count, i;
>> + void __rcu **slots;
>> + void __rcu **new_slots;
>> + unsigned long val;
>> +
>> + /* Allocate memory for child nodes. */
>> + type = mte_node_type(mas->node);
>> + new_slots = ma_slots(new_node, type);
>> + request = mas_data_end(mas) + 1;
>> + count = mt_alloc_bulk(gfp, request, (void **)new_slots);
>> + if (unlikely(count < request)) {
>> + if (count)
>> + mt_free_bulk(count, new_slots);
>
> We were dropping this mt_free_bulk() call as discussed in [1]. Did I
> miss something?
It seems that I misunderstood earlier, I thought it needed to be kept.
It has been deleted in v6, thank you.
>
>> +
>> + memset(new_slots, 0, request * sizeof(void *));
>> + mas_set_err(mas, -ENOMEM);
>> + return;
>> + }
>> +
>> + /* Restore node type information in slots. */
>> + slots = ma_slots(node, type);
>> + for (i = 0; i < count; i++) {
>> + val = (unsigned long)mt_slot_locked(mas->tree, slots, i);
>> + val &= MAPLE_NODE_MASK;
>> + ((unsigned long *)new_slots)[i] |= val;
>> + }
>> +}
>> +
>> +/*
>> + * mas_dup_build() - Build a new maple tree from a source tree
>> + * @mas: The maple state of source tree, need to be in MAS_START state.
>> + * @new_mas: The maple state of new tree, need to be in MAS_START state.
>> + * @gfp: The GFP_FLAGS to use for allocations.
>> + *
>> + * This function builds a new tree in DFS preorder. If the memory allocation
>> + * fails, the error code -ENOMEM will be set in @mas, and @new_mas points to the
>> + * last node. mas_dup_free() will free the incomplete duplication of a tree.
>> + *
>> + * Note that the attributes of the two trees need to be exactly the same, and the
>> + * new tree needs to be empty, otherwise -EINVAL will be set in @mas.
>> + */
>> +static inline void mas_dup_build(struct ma_state *mas, struct ma_state *new_mas,
>> + gfp_t gfp)
>> +{
>> + struct maple_node *node;
>> + struct maple_pnode *parent = NULL;
>> + struct maple_enode *root;
>> + enum maple_type type;
>> +
>> + if (unlikely(mt_attr(mas->tree) != mt_attr(new_mas->tree)) ||
>> + unlikely(!mtree_empty(new_mas->tree))) {
>> + mas_set_err(mas, -EINVAL);
>> + return;
>> + }
>> +
>> + mas_start(mas);
>> + if (mas_is_ptr(mas) || mas_is_none(mas)) {
>> + root = mt_root_locked(mas->tree);
>
> mas_start(mas) would return the root entry if it's a pointer and NULL if
> the tree is empty, so this can be written:
> root = mas_start(mas);
> if (mas_is_ptry() || mas_is_none()
> goto set_new_tree;
Done in v6, thank you.
>
>
>> + goto set_new_tree;
>> + }
>> +
>> + node = mt_alloc_one(gfp);
>> + if (!node) {
>> + new_mas->node = MAS_NONE;
>> + mas_set_err(mas, -ENOMEM);
>> + return;
>> + }
>> +
>> + type = mte_node_type(mas->node);
>> + root = mt_mk_node(node, type);
>> + new_mas->node = root;
>> + new_mas->min = 0;
>> + new_mas->max = ULONG_MAX;
>> + root = mte_mk_root(root);
>> +
>> + while (1) {
>> + mas_copy_node(mas, new_mas, parent);
>> +
>> + if (!mte_is_leaf(mas->node)) {
>> + /* Only allocate child nodes for non-leaf nodes. */
>> + mas_dup_alloc(mas, new_mas, gfp);
>> + if (unlikely(mas_is_err(mas)))
>> + return;
>> + } else {
>> + /*
>> + * This is the last leaf node and duplication is
>> + * completed.
>> + */
>> + if (mas->max == ULONG_MAX)
>> + goto done;
>> +
>> + /* This is not the last leaf node and needs to go up. */
>> + do {
>> + mas_ascend(mas);
>> + mas_ascend(new_mas);
>> + } while (mas->offset == mas_data_end(mas));
>> +
>> + /* Move to the next subtree. */
>> + mas->offset++;
>> + new_mas->offset++;
>> + }
>> +
>> + mas_descend(mas);
>> + parent = ma_parent_ptr(mte_to_node(new_mas->node));
>> + mas_descend(new_mas);
>> + mas->offset = 0;
>> + new_mas->offset = 0;
>> + }
>> +done:
>> + /* Specially handle the parent of the root node. */
>> + mte_to_node(root)->parent = ma_parent_ptr(mas_tree_parent(new_mas));
>> +set_new_tree:
>> + /* Make them the same height */
>> + new_mas->tree->ma_flags = mas->tree->ma_flags;
>> + rcu_assign_pointer(new_mas->tree->ma_root, root);
>> +}
>> +
>> +/**
>> + * __mt_dup(): Duplicate an entire maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
>> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
>> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
>> + * source node except for all the addresses stored in it. It will be faster than
>> + * traversing all elements in the source tree and inserting them one by one into
>> + * the new tree.
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + * Note that the user needs to manually lock the source tree and the new tree.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int __mt_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> + int ret = 0;
>> + MA_STATE(mas, mt, 0, 0);
>> + MA_STATE(new_mas, new, 0, 0);
>> +
>> + mas_dup_build(&mas, &new_mas, gfp);
>> +
>> + if (unlikely(mas_is_err(&mas))) {
>> + ret = xa_err(mas.node);
>> + if (ret == -ENOMEM)
>> + mas_dup_free(&new_mas);
>> + }
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(__mt_dup);
>> +
>> +/**
>> + * mtree_dup(): Duplicate an entire maple tree
>> + * @mt: The source maple tree
>> + * @new: The new maple tree
>> + * @gfp: The GFP_FLAGS to use for allocations
>> + *
>> + * This function duplicates a maple tree in Depth-First Search (DFS) pre-order
>> + * traversal. It uses memcopy() to copy nodes in the source tree and allocate
>> + * new child nodes in non-leaf nodes. The new node is exactly the same as the
>> + * source node except for all the addresses stored in it. It will be faster than
>> + * traversing all elements in the source tree and inserting them one by one into
>> + * the new tree.
>> + * The user needs to ensure that the attributes of the source tree and the new
>> + * tree are the same, and the new tree needs to be an empty tree, otherwise
>> + * -EINVAL will be returned.
>> + *
>> + * Return: 0 on success, -ENOMEM if memory could not be allocated, -EINVAL If
>> + * the attributes of the two trees are different or the new tree is not an empty
>> + * tree.
>> + */
>> +int mtree_dup(struct maple_tree *mt, struct maple_tree *new, gfp_t gfp)
>> +{
>> + int ret = 0;
>> + MA_STATE(mas, mt, 0, 0);
>> + MA_STATE(new_mas, new, 0, 0);
>> +
>> + mas_lock(&new_mas);
>> + mas_lock_nested(&mas, SINGLE_DEPTH_NESTING);
>> +
>> + mas_dup_build(&mas, &new_mas, gfp);
>> + mas_unlock(&mas);
>> +
>> + if (unlikely(mas_is_err(&mas))) {
>> + ret = xa_err(mas.node);
>> + if (ret == -ENOMEM)
>> + mas_dup_free(&new_mas);
>> + }
>> +
>> + mas_unlock(&new_mas);
>> +
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(mtree_dup);
>> +
>> /**
>> * __mt_destroy() - Walk and free all nodes of a locked maple tree.
>> * @mt: The maple tree
>> --
>> 2.20.1
>>
>
> [1]. https://lore.kernel.org/lkml/20231004142500.gz2552r74aiphl4z@revolver/
>
> Thanks,
> Liam

2023-10-24 08:46:35

by Peng Zhang

[permalink] [raw]
Subject: Re: [PATCH v5 10/10] fork: Use __mt_dup() to duplicate maple tree in dup_mmap()



在 2023/10/17 21:50, Liam R. Howlett 写道:
> * Peng Zhang <[email protected]> [231015 23:23]:
>> In dup_mmap(), using __mt_dup() to duplicate the old maple tree and then
>> directly replacing the entries of VMAs in the new maple tree can result
>> in better performance. __mt_dup() uses DFS pre-order to duplicate the
>> maple tree, so it is efficient.
>>
>> The average time complexity of __mt_dup() is O(n), where n is the number
>> of VMAs. The proof of the time complexity is provided in the commit log
>> that introduces __mt_dup(). After duplicating the maple tree, each element
>> is traversed and replaced (ignoring the cases of deletion, which are rare).
>> Since it is only a replacement operation for each element, this process is
>> also O(n).
>>
>> Analyzing the exact time complexity of the previous algorithm is
>> challenging because each insertion can involve appending to a node, pushing
>> data to adjacent nodes, or even splitting nodes. The frequency of each
>> action is difficult to calculate. The worst-case scenario for a single
>> insertion is when the tree undergoes splitting at every level. If we
>> consider each insertion as the worst-case scenario, we can determine that
>> the upper bound of the time complexity is O(n*log(n)), although this is a
>> loose upper bound. However, based on the test data, it appears that the
>> actual time complexity is likely to be O(n).
>>
>> As the entire maple tree is duplicated using __mt_dup(), if dup_mmap()
>> fails, there will be a portion of VMAs that have not been duplicated in
>> the maple tree. To handle this, we mark the failure point with
>> XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered, stop
>> releasing VMAs that have not been duplicated after this point.
>>
>> There is a "spawn" in byte-unixbench[1], which can be used to test the
>> performance of fork(). I modified it slightly to make it work with
>> different number of VMAs.
>>
>> Below are the test results. The first row shows the number of VMAs.
>> The second and third rows show the number of fork() calls per ten seconds,
>> corresponding to next-20231006 and the this patchset, respectively. The
>> test results were obtained with CPU binding to avoid scheduler load
>> balancing that could cause unstable results. There are still some
>> fluctuations in the test results, but at least they are better than the
>> original performance.
>>
>> 21 121 221 421 821 1621 3221 6421 12821 25621 51221
>> 112100 76261 54227 34035 20195 11112 6017 3161 1606 802 393
>> 114558 83067 65008 45824 28751 16072 8922 4747 2436 1233 599
>> 2.19% 8.92% 19.88% 34.64% 42.37% 44.64% 48.28% 50.17% 51.68% 53.74% 52.42%
>>
>> [1] https://github.com/kdlucas/byte-unixbench/tree/master
>>
>> Signed-off-by: Peng Zhang <[email protected]>
>> ---
>> kernel/fork.c | 39 ++++++++++++++++++++++++++++-----------
>> mm/memory.c | 7 ++++++-
>> mm/mmap.c | 9 ++++++---
>> 3 files changed, 40 insertions(+), 15 deletions(-)
>>
>> diff --git a/kernel/fork.c b/kernel/fork.c
>> index 0ff2e0cd4109..0be15501e52e 100644
>> --- a/kernel/fork.c
>> +++ b/kernel/fork.c
>> @@ -650,7 +650,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> int retval;
>> unsigned long charge = 0;
>> LIST_HEAD(uf);
>> - VMA_ITERATOR(old_vmi, oldmm, 0);
>> VMA_ITERATOR(vmi, mm, 0);
>>
>> uprobe_start_dup_mmap();
>> @@ -678,16 +677,21 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> goto out;
>> khugepaged_fork(mm, oldmm);
>>
>> - retval = vma_iter_bulk_alloc(&vmi, oldmm->map_count);
>> - if (retval)
>> + /* Use __mt_dup() to efficiently build an identical maple tree. */
>> + retval = __mt_dup(&oldmm->mm_mt, &mm->mm_mt, GFP_KERNEL);
>> + if (unlikely(retval))
>> goto out;
>>
>> mt_clear_in_rcu(vmi.mas.tree);
>> - for_each_vma(old_vmi, mpnt) {
>> + for_each_vma(vmi, mpnt) {
>> struct file *file;
>>
>> vma_start_write(mpnt);
>> if (mpnt->vm_flags & VM_DONTCOPY) {
>> + retval = mas_store_gfp(&vmi.mas, NULL, GFP_KERNEL);
>
> vma_iter_clear_gfp() exists, but needs to be relocated from internal.h
> to mm.h to be used here.
Done in v6, thanks.
>
>> + if (retval)
>> + goto loop_out;
>> +
>> vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt));
>> continue;
>> }
>> @@ -749,9 +753,11 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> if (is_vm_hugetlb_page(tmp))
>> hugetlb_dup_vma_private(tmp);
>>
>> - /* Link the vma into the MT */
>> - if (vma_iter_bulk_store(&vmi, tmp))
>> - goto fail_nomem_vmi_store;
>> + /*
>> + * Link the vma into the MT. After using __mt_dup(), memory
>> + * allocation is not necessary here, so it cannot fail.
>> + */
>
> Allocations didn't happen here with the bulk store either, and the
> vma_iter_bulk_store() does a mas_store() - see include/linux/mm.h
>
> The vma iterator is used when possible for type safety.
Done in v6, thanks.
>
>> + mas_store(&vmi.mas, tmp);
>>
>> mm->map_count++;
>> if (!(tmp->vm_flags & VM_WIPEONFORK))
>> @@ -760,15 +766,28 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> if (tmp->vm_ops && tmp->vm_ops->open)
>> tmp->vm_ops->open(tmp);
>>
>> - if (retval)
>> + if (retval) {
>> + mpnt = vma_next(&vmi);
>> goto loop_out;
>> + }
>> }
>> /* a new mm has just been created */
>> retval = arch_dup_mmap(oldmm, mm);
>> loop_out:
>> vma_iter_free(&vmi);
>> - if (!retval)
>> + if (!retval) {
>> mt_set_in_rcu(vmi.mas.tree);
>> + } else if (mpnt) {
>> + /*
>> + * The entire maple tree has already been duplicated. If the
>> + * mmap duplication fails, mark the failure point with
>> + * XA_ZERO_ENTRY. In exit_mmap(), if this marker is encountered,
>> + * stop releasing VMAs that have not been duplicated after this
>> + * point.
>> + */
>> + mas_set_range(&vmi.mas, mpnt->vm_start, mpnt->vm_end - 1);
>> + mas_store(&vmi.mas, XA_ZERO_ENTRY);
>
> vma_iter_clear() exists but uses the preallocation call. I really don't
> want mas_ calls where unnecessary, but this seems like a special case.
> We have a vma iterator here so it's messy.
If mas_ interface is not used, it may be necessary to add another
vma_iter_ interface to do this.
>
>> + }
>> out:
>> mmap_write_unlock(mm);
>> flush_tlb_mm(oldmm);
>> @@ -778,8 +797,6 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm,
>> uprobe_end_dup_mmap();
>> return retval;
>>
>> -fail_nomem_vmi_store:
>> - unlink_anon_vmas(tmp);
>> fail_nomem_anon_vma_fork:
>> mpol_put(vma_policy(tmp));
>> fail_nomem_policy:
>> diff --git a/mm/memory.c b/mm/memory.c
>> index b320af6466cc..ea48bd4b1feb 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -374,6 +374,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>> * be 0. This will underflow and is okay.
>> */
>> next = mas_find(mas, ceiling - 1);
>> + if (unlikely(xa_is_zero(next)))
>> + next = NULL;
>>
>> /*
>> * Hide vma from rmap and truncate_pagecache before freeing
>> @@ -395,6 +397,8 @@ void free_pgtables(struct mmu_gather *tlb, struct ma_state *mas,
>> && !is_vm_hugetlb_page(next)) {
>> vma = next;
>> next = mas_find(mas, ceiling - 1);
>> + if (unlikely(xa_is_zero(next)))
>> + next = NULL;
>> if (mm_wr_locked)
>> vma_start_write(vma);
>> unlink_anon_vmas(vma);
>> @@ -1743,7 +1747,8 @@ void unmap_vmas(struct mmu_gather *tlb, struct ma_state *mas,
>> unmap_single_vma(tlb, vma, start, end, &details,
>> mm_wr_locked);
>> hugetlb_zap_end(vma, &details);
>> - } while ((vma = mas_find(mas, tree_end - 1)) != NULL);
>> + vma = mas_find(mas, tree_end - 1);
>> + } while (vma && likely(!xa_is_zero(vma)));
>> mmu_notifier_invalidate_range_end(&range);
>> }
>>
>> diff --git a/mm/mmap.c b/mm/mmap.c
>> index 1855a2d84200..12ce17863e62 100644
>> --- a/mm/mmap.c
>> +++ b/mm/mmap.c
>> @@ -3213,10 +3213,11 @@ void exit_mmap(struct mm_struct *mm)
>> arch_exit_mmap(mm);
>>
>> vma = mas_find(&mas, ULONG_MAX);
>> - if (!vma) {
>> + if (!vma || unlikely(xa_is_zero(vma))) {
>> /* Can happen if dup_mmap() received an OOM */
>> mmap_read_unlock(mm);
>> - return;
>> + mmap_write_lock(mm);
>> + goto destroy;
>> }
>>
>> lru_add_drain();
>> @@ -3251,11 +3252,13 @@ void exit_mmap(struct mm_struct *mm)
>> remove_vma(vma, true);
>> count++;
>> cond_resched();
>> - } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL);
>> + vma = mas_find(&mas, ULONG_MAX);
>> + } while (vma && likely(!xa_is_zero(vma)));
>>
>> BUG_ON(count != mm->map_count);
>>
>> trace_exit_mmap(mm);
>> +destroy:
>> __mt_destroy(&mm->mm_mt);
>> mmap_write_unlock(mm);
>> vm_unacct_memory(nr_accounted);
>> --
>> 2.20.1
>>
>