2024-02-13 21:37:38

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities

In an effort to address slab fragmentation issues reported a few
months ago, I've replaced the use of xarrays for the directory
offset map in "simple" file systems (including tmpfs).

This patch set passes functional testing and is ready for code
review. But I don't have the facilities to re-run the performance
tests that identified the regression. We expect the performance of
this implementation will need additional improvement.

Thanks to Liam Howlett for helping me get this working.

---

Chuck Lever (6):
libfs: Rename "so_ctx"
libfs: Define a minimum directory offset
libfs: Add simple_offset_empty()
maple_tree: Add mtree_alloc_cyclic()
libfs: Convert simple directory offsets to use a Maple Tree
libfs: Re-arrange locking in offset_iterate_dir()

Liam R. Howlett (1):
test_maple_tree: testing the cyclic allocation


fs/libfs.c | 125 +++++++++++++++++++++++--------------
include/linux/fs.h | 6 +-
include/linux/maple_tree.h | 7 +++
lib/maple_tree.c | 93 +++++++++++++++++++++++++++
lib/test_maple_tree.c | 44 +++++++++++++
mm/shmem.c | 4 +-
6 files changed, 227 insertions(+), 52 deletions(-)

--
Chuck Lever



2024-02-13 21:39:07

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 4/7] maple_tree: Add mtree_alloc_cyclic()

From: Chuck Lever <[email protected]>

I need a cyclic allocator for the simple_offset implementation in
fs/libfs.c.

Signed-off-by: Chuck Lever <[email protected]>
---
include/linux/maple_tree.h | 7 +++
lib/maple_tree.c | 93 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 100 insertions(+)

diff --git a/include/linux/maple_tree.h b/include/linux/maple_tree.h
index b3d63123b945..a53ad4dabd7e 100644
--- a/include/linux/maple_tree.h
+++ b/include/linux/maple_tree.h
@@ -171,6 +171,7 @@ enum maple_type {
#define MT_FLAGS_LOCK_IRQ 0x100
#define MT_FLAGS_LOCK_BH 0x200
#define MT_FLAGS_LOCK_EXTERN 0x300
+#define MT_FLAGS_ALLOC_WRAPPED 0x0800

#define MAPLE_HEIGHT_MAX 31

@@ -319,6 +320,9 @@ int mtree_insert_range(struct maple_tree *mt, unsigned long first,
int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp,
void *entry, unsigned long size, unsigned long min,
unsigned long max, gfp_t gfp);
+int mtree_alloc_cyclic(struct maple_tree *mt, unsigned long *startp,
+ void *entry, unsigned long range_lo, unsigned long range_hi,
+ unsigned long *next, gfp_t gfp);
int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp,
void *entry, unsigned long size, unsigned long min,
unsigned long max, gfp_t gfp);
@@ -499,6 +503,9 @@ void *mas_find_range(struct ma_state *mas, unsigned long max);
void *mas_find_rev(struct ma_state *mas, unsigned long min);
void *mas_find_range_rev(struct ma_state *mas, unsigned long max);
int mas_preallocate(struct ma_state *mas, void *entry, gfp_t gfp);
+int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp,
+ void *entry, unsigned long range_lo, unsigned long range_hi,
+ unsigned long *next, gfp_t gfp);

bool mas_nomem(struct ma_state *mas, gfp_t gfp);
void mas_pause(struct ma_state *mas);
diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 6f241bb38799..af0970288727 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -4290,6 +4290,56 @@ static inline void *mas_insert(struct ma_state *mas, void *entry)

}

+/**
+ * mas_alloc_cyclic() - Internal call to find somewhere to store an entry
+ * @mas: The maple state.
+ * @startp: Pointer to ID.
+ * @range_lo: Lower bound of range to search.
+ * @range_hi: Upper bound of range to search.
+ * @entry: The entry to store.
+ * @next: Pointer to next ID to allocate.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Return: 0 if the allocation succeeded without wrapping, 1 if the
+ * allocation succeeded after wrapping, or -EBUSY if there are no
+ * free entries.
+ */
+int mas_alloc_cyclic(struct ma_state *mas, unsigned long *startp,
+ void *entry, unsigned long range_lo, unsigned long range_hi,
+ unsigned long *next, gfp_t gfp)
+{
+ unsigned long min = range_lo;
+ int ret = 0;
+
+ range_lo = max(min, *next);
+ ret = mas_empty_area(mas, range_lo, range_hi, 1);
+ if ((mas->tree->ma_flags & MT_FLAGS_ALLOC_WRAPPED) && ret == 0) {
+ mas->tree->ma_flags &= ~MT_FLAGS_ALLOC_WRAPPED;
+ ret = 1;
+ }
+ if (ret < 0 && range_lo > min) {
+ ret = mas_empty_area(mas, min, range_hi, 1);
+ if (ret == 0)
+ ret = 1;
+ }
+ if (ret < 0)
+ return ret;
+
+ do {
+ mas_insert(mas, entry);
+ } while (mas_nomem(mas, gfp));
+ if (mas_is_err(mas))
+ return xa_err(mas->node);
+
+ *startp = mas->index;
+ *next = *startp + 1;
+ if (*next == 0)
+ mas->tree->ma_flags |= MT_FLAGS_ALLOC_WRAPPED;
+
+ return ret;
+}
+EXPORT_SYMBOL(mas_alloc_cyclic);
+
static __always_inline void mas_rewalk(struct ma_state *mas, unsigned long index)
{
retry:
@@ -6443,6 +6493,49 @@ int mtree_alloc_range(struct maple_tree *mt, unsigned long *startp,
}
EXPORT_SYMBOL(mtree_alloc_range);

+/**
+ * mtree_alloc_cyclic() - Find somewhere to store this entry in the tree.
+ * @mt: The maple tree.
+ * @startp: Pointer to ID.
+ * @range_lo: Lower bound of range to search.
+ * @range_hi: Upper bound of range to search.
+ * @entry: The entry to store.
+ * @next: Pointer to next ID to allocate.
+ * @gfp: The GFP_FLAGS to use for allocations.
+ *
+ * Finds an empty entry in @mt after @next, stores the new index into
+ * the @id pointer, stores the entry at that index, then updates @next.
+ *
+ * @mt must be initialized with the MT_FLAGS_ALLOC_RANGE flag.
+ *
+ * Context: Any context. Takes and releases the mt.lock. May sleep if
+ * the @gfp flags permit.
+ *
+ * Return: 0 if the allocation succeeded without wrapping, 1 if the
+ * allocation succeeded after wrapping, -ENOMEM if memory could not be
+ * allocated, -EINVAL if @mt cannot be used, or -EBUSY if there are no
+ * free entries.
+ */
+int mtree_alloc_cyclic(struct maple_tree *mt, unsigned long *startp,
+ void *entry, unsigned long range_lo, unsigned long range_hi,
+ unsigned long *next, gfp_t gfp)
+{
+ int ret;
+
+ MA_STATE(mas, mt, 0, 0);
+
+ if (!mt_is_alloc(mt))
+ return -EINVAL;
+ if (WARN_ON_ONCE(mt_is_reserved(entry)))
+ return -EINVAL;
+ mtree_lock(mt);
+ ret = mas_alloc_cyclic(&mas, startp, entry, range_lo, range_hi,
+ next, gfp);
+ mtree_unlock(mt);
+ return ret;
+}
+EXPORT_SYMBOL(mtree_alloc_cyclic);
+
int mtree_alloc_rrange(struct maple_tree *mt, unsigned long *startp,
void *entry, unsigned long size, unsigned long min,
unsigned long max, gfp_t gfp)



2024-02-13 21:40:06

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

From: Chuck Lever <[email protected]>

Liam says that, unlike with xarray, once the RCU read lock is
released ma_state is not safe to re-use for the next mas_find() call.
But the RCU read lock has to be released on each loop iteration so
that dput() can be called safely.

Thus we are forced to walk the offset tree with fresh state for each
directory entry. mt_find() can do this for us, though it might be a
little less efficient than maintaining ma_state locally.

Since offset_iterate_dir() doesn't build ma_state locally any more,
there's no longer a strong need for offset_find_next(). Clean up by
rolling these two helpers together.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/libfs.c | 39 +++++++++++++++++----------------------
1 file changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index f073e9aeb2bf..6e01fde1cf95 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
}

-static struct dentry *offset_find_next(struct ma_state *mas)
-{
- struct dentry *child, *found = NULL;
-
- rcu_read_lock();
- child = mas_find(mas, ULONG_MAX);
- if (!child)
- goto out;
- spin_lock(&child->d_lock);
- if (simple_positive(child))
- found = dget_dlock(child);
- spin_unlock(&child->d_lock);
-out:
- rcu_read_unlock();
- return found;
-}
-
static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
{
unsigned long offset = dentry2offset(dentry);
@@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
{
struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
- MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
- struct dentry *dentry;
+ struct dentry *dentry, *found;
+ unsigned long offset;

+ offset = ctx->pos;
while (true) {
- dentry = offset_find_next(&mas);
+ found = mt_find(&octx->mt, &offset, ULONG_MAX);
+ if (!found)
+ goto out_noent;
+
+ dentry = NULL;
+ spin_lock(&found->d_lock);
+ if (simple_positive(found))
+ dentry = dget_dlock(found);
+ spin_unlock(&found->d_lock);
if (!dentry)
- return ERR_PTR(-ENOENT);
+ goto out_noent;

if (!offset_dir_emit(ctx, dentry)) {
dput(dentry);
@@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
}

dput(dentry);
- ctx->pos = mas.index + 1;
+ ctx->pos = offset;
}
return NULL;
+
+out_noent:
+ return ERR_PTR(-ENOENT);
}

/**



2024-02-13 21:44:39

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 3/7] libfs: Add simple_offset_empty()

From: Chuck Lever <[email protected]>

For simple filesystems that use directory offset mapping, rely
strictly on the directory offset map to tell when a directory has
no children.

After this patch is applied, the emptiness test holds only the RCU
read lock when the directory being tested has no children.

In addition, this adds another layer of confirmation that
simple_offset_add/remove() are working as expected.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/libfs.c | 32 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 1 +
mm/shmem.c | 4 ++--
3 files changed, 35 insertions(+), 2 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index a38af72f4719..3cf773950f93 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -313,6 +313,38 @@ void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
offset_set(dentry, 0);
}

+/**
+ * simple_offset_empty - Check if a dentry can be unlinked
+ * @dentry: dentry to be tested
+ *
+ * Returns 0 if @dentry is a non-empty directory; otherwise returns 1.
+ */
+int simple_offset_empty(struct dentry *dentry)
+{
+ struct inode *inode = d_inode(dentry);
+ struct offset_ctx *octx;
+ struct dentry *child;
+ unsigned long index;
+ int ret = 1;
+
+ if (!inode || !S_ISDIR(inode->i_mode))
+ return ret;
+
+ index = 2;
+ octx = inode->i_op->get_offset_ctx(inode);
+ xa_for_each(&octx->xa, index, child) {
+ spin_lock(&child->d_lock);
+ if (simple_positive(child)) {
+ spin_unlock(&child->d_lock);
+ ret = 0;
+ break;
+ }
+ spin_unlock(&child->d_lock);
+ }
+
+ return ret;
+}
+
/**
* simple_offset_rename_exchange - exchange rename with directory offsets
* @old_dir: parent of dentry being moved
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ed5966a70495..03d141809a2c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3267,6 +3267,7 @@ struct offset_ctx {
void simple_offset_init(struct offset_ctx *octx);
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
+int simple_offset_empty(struct dentry *dentry);
int simple_offset_rename_exchange(struct inode *old_dir,
struct dentry *old_dentry,
struct inode *new_dir,
diff --git a/mm/shmem.c b/mm/shmem.c
index d7c84ff62186..6fed524343cb 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -3374,7 +3374,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)

static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
{
- if (!simple_empty(dentry))
+ if (!simple_offset_empty(dentry))
return -ENOTEMPTY;

drop_nlink(d_inode(dentry));
@@ -3431,7 +3431,7 @@ static int shmem_rename2(struct mnt_idmap *idmap,
return simple_offset_rename_exchange(old_dir, old_dentry,
new_dir, new_dentry);

- if (!simple_empty(new_dentry))
+ if (!simple_offset_empty(new_dentry))
return -ENOTEMPTY;

if (flags & RENAME_WHITEOUT) {



2024-02-13 21:45:22

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 2/7] libfs: Define a minimum directory offset

From: Chuck Lever <[email protected]>

This value is used in several places, so make it a symbolic
constant.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/libfs.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index bfbe1a8c5d2d..a38af72f4719 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -240,6 +240,11 @@ const struct inode_operations simple_dir_inode_operations = {
};
EXPORT_SYMBOL(simple_dir_inode_operations);

+/* 0 is '.', 1 is '..', so always start with offset 2 or more */
+enum {
+ DIR_OFFSET_MIN = 2,
+};
+
static void offset_set(struct dentry *dentry, u32 offset)
{
dentry->d_fsdata = (void *)((uintptr_t)(offset));
@@ -261,9 +266,7 @@ void simple_offset_init(struct offset_ctx *octx)
{
xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
-
- /* 0 is '.', 1 is '..', so always start with offset 2 */
- octx->next_offset = 2;
+ octx->next_offset = DIR_OFFSET_MIN;
}

/**
@@ -276,7 +279,7 @@ void simple_offset_init(struct offset_ctx *octx)
*/
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
{
- static const struct xa_limit limit = XA_LIMIT(2, U32_MAX);
+ static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
u32 offset;
int ret;

@@ -481,7 +484,7 @@ static int offset_readdir(struct file *file, struct dir_context *ctx)
return 0;

/* In this case, ->private_data is protected by f_pos_lock */
- if (ctx->pos == 2)
+ if (ctx->pos == DIR_OFFSET_MIN)
file->private_data = NULL;
else if (file->private_data == ERR_PTR(-ENOENT))
return 0;



2024-02-13 21:45:53

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 5/7] test_maple_tree: testing the cyclic allocation

From: Liam R. Howlett <[email protected]>

This tests the interactions of the cyclic allocations, the maple state
index and last, and overflow.

Signed-off-by: Liam R. Howlett <[email protected]>
Signed-off-by: Chuck Lever <[email protected]>
---
lib/test_maple_tree.c | 44 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 44 insertions(+)

diff --git a/lib/test_maple_tree.c b/lib/test_maple_tree.c
index 29185ac5c727..399380db449c 100644
--- a/lib/test_maple_tree.c
+++ b/lib/test_maple_tree.c
@@ -3599,6 +3599,45 @@ static noinline void __init check_state_handling(struct maple_tree *mt)
mas_unlock(&mas);
}

+static noinline void __init alloc_cyclic_testing(struct maple_tree *mt)
+{
+ unsigned long location;
+ unsigned long next;
+ int ret = 0;
+ MA_STATE(mas, mt, 0, 0);
+
+ next = 0;
+ mtree_lock(mt);
+ for (int i = 0; i < 100; i++) {
+ mas_alloc_cyclic(&mas, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+ MAS_BUG_ON(&mas, i != location - 2);
+ MAS_BUG_ON(&mas, mas.index != location);
+ MAS_BUG_ON(&mas, mas.last != location);
+ MAS_BUG_ON(&mas, i != next - 3);
+ }
+
+ mtree_unlock(mt);
+ mtree_destroy(mt);
+ next = 0;
+ mt_init_flags(mt, MT_FLAGS_ALLOC_RANGE);
+ for (int i = 0; i < 100; i++) {
+ mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+ MT_BUG_ON(mt, i != location - 2);
+ MT_BUG_ON(mt, i != next - 3);
+ MT_BUG_ON(mt, mtree_load(mt, location) != mt);
+ }
+
+ mtree_destroy(mt);
+ /* Overflow test */
+ next = ULONG_MAX - 1;
+ ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+ MT_BUG_ON(mt, ret != 0);
+ ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+ MT_BUG_ON(mt, ret != 0);
+ ret = mtree_alloc_cyclic(mt, &location, mt, 2, ULONG_MAX, &next, GFP_KERNEL);
+ MT_BUG_ON(mt, ret != 1);
+}
+
static DEFINE_MTREE(tree);
static int __init maple_tree_seed(void)
{
@@ -3880,6 +3919,11 @@ static int __init maple_tree_seed(void)
check_state_handling(&tree);
mtree_destroy(&tree);

+ mt_init_flags(&tree, MT_FLAGS_ALLOC_RANGE);
+ alloc_cyclic_testing(&tree);
+ mtree_destroy(&tree);
+
+
#if defined(BENCH)
skip:
#endif



2024-02-13 21:55:20

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 1/7] libfs: Rename "so_ctx"

From: Chuck Lever <[email protected]>

Most of instances of "so_ctx" were renamed before the simple offset
work was merged, but there were a few that were missed.

Signed-off-by: Chuck Lever <[email protected]>
---
fs/libfs.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index eec6031b0155..bfbe1a8c5d2d 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -271,7 +271,7 @@ void simple_offset_init(struct offset_ctx *octx)
* @octx: directory offset ctx to be updated
* @dentry: new dentry being added
*
- * Returns zero on success. @so_ctx and the dentry offset are updated.
+ * Returns zero on success. @octx and the dentry's offset are updated.
* Otherwise, a negative errno value is returned.
*/
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
@@ -430,8 +430,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)

static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
{
- struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
- XA_STATE(xas, &so_ctx->xa, ctx->pos);
+ struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
+ XA_STATE(xas, &octx->xa, ctx->pos);
struct dentry *dentry;

while (true) {



2024-02-13 22:01:36

by Chuck Lever

[permalink] [raw]
Subject: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

From: Chuck Lever <[email protected]>

Test robot reports:
> kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
>
> commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

Feng Tang further clarifies that:
> ... the new simple_offset_add()
> called by shmem_mknod() brings extra cost related with slab,
> specifically the 'radix_tree_node', which cause the regression.

Willy's analysis is that, over time, the test workload causes
xa_alloc_cyclic() to fragment the underlying SLAB cache.

This patch replaces the offset_ctx's xarray with a Maple Tree in the
hope that Maple Tree's dense node mode will handle this scenario
more scalably.

In addition, we can widen the directory offset to an unsigned long
everywhere.

Suggested-by: Matthew Wilcox <[email protected]>
Reported-by: kernel test robot <[email protected]>
Closes: https://lore.kernel.org/oe-lkp/[email protected]
Signed-off-by: Chuck Lever <[email protected]>
---
fs/libfs.c | 53 ++++++++++++++++++++++++++--------------------------
include/linux/fs.h | 5 +++--
2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/fs/libfs.c b/fs/libfs.c
index 3cf773950f93..f073e9aeb2bf 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -245,17 +245,17 @@ enum {
DIR_OFFSET_MIN = 2,
};

-static void offset_set(struct dentry *dentry, u32 offset)
+static void offset_set(struct dentry *dentry, unsigned long offset)
{
- dentry->d_fsdata = (void *)((uintptr_t)(offset));
+ dentry->d_fsdata = (void *)offset;
}

-static u32 dentry2offset(struct dentry *dentry)
+static unsigned long dentry2offset(struct dentry *dentry)
{
- return (u32)((uintptr_t)(dentry->d_fsdata));
+ return (unsigned long)dentry->d_fsdata;
}

-static struct lock_class_key simple_offset_xa_lock;
+static struct lock_class_key simple_offset_lock_class;

/**
* simple_offset_init - initialize an offset_ctx
@@ -264,8 +264,8 @@ static struct lock_class_key simple_offset_xa_lock;
*/
void simple_offset_init(struct offset_ctx *octx)
{
- xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
- lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
+ mt_init_flags(&octx->mt, MT_FLAGS_ALLOC_RANGE);
+ lockdep_set_class(&octx->mt.ma_lock, &simple_offset_lock_class);
octx->next_offset = DIR_OFFSET_MIN;
}

@@ -279,15 +279,14 @@ void simple_offset_init(struct offset_ctx *octx)
*/
int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
{
- static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
- u32 offset;
+ unsigned long offset;
int ret;

if (dentry2offset(dentry) != 0)
return -EBUSY;

- ret = xa_alloc_cyclic(&octx->xa, &offset, dentry, limit,
- &octx->next_offset, GFP_KERNEL);
+ ret = mtree_alloc_cyclic(&octx->mt, &offset, dentry, DIR_OFFSET_MIN,
+ ULONG_MAX, &octx->next_offset, GFP_KERNEL);
if (ret < 0)
return ret;

@@ -303,13 +302,13 @@ int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
*/
void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
{
- u32 offset;
+ unsigned long offset;

offset = dentry2offset(dentry);
if (offset == 0)
return;

- xa_erase(&octx->xa, offset);
+ mtree_erase(&octx->mt, offset);
offset_set(dentry, 0);
}

@@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
if (!inode || !S_ISDIR(inode->i_mode))
return ret;

- index = 2;
+ index = DIR_OFFSET_MIN;
octx = inode->i_op->get_offset_ctx(inode);
- xa_for_each(&octx->xa, index, child) {
+ mt_for_each(&octx->mt, child, index, ULONG_MAX) {
spin_lock(&child->d_lock);
if (simple_positive(child)) {
spin_unlock(&child->d_lock);
@@ -362,8 +361,8 @@ int simple_offset_rename_exchange(struct inode *old_dir,
{
struct offset_ctx *old_ctx = old_dir->i_op->get_offset_ctx(old_dir);
struct offset_ctx *new_ctx = new_dir->i_op->get_offset_ctx(new_dir);
- u32 old_index = dentry2offset(old_dentry);
- u32 new_index = dentry2offset(new_dentry);
+ unsigned long old_index = dentry2offset(old_dentry);
+ unsigned long new_index = dentry2offset(new_dentry);
int ret;

simple_offset_remove(old_ctx, old_dentry);
@@ -389,9 +388,9 @@ int simple_offset_rename_exchange(struct inode *old_dir,

out_restore:
offset_set(old_dentry, old_index);
- xa_store(&old_ctx->xa, old_index, old_dentry, GFP_KERNEL);
+ mtree_store(&old_ctx->mt, old_index, old_dentry, GFP_KERNEL);
offset_set(new_dentry, new_index);
- xa_store(&new_ctx->xa, new_index, new_dentry, GFP_KERNEL);
+ mtree_store(&new_ctx->mt, new_index, new_dentry, GFP_KERNEL);
return ret;
}

@@ -404,7 +403,7 @@ int simple_offset_rename_exchange(struct inode *old_dir,
*/
void simple_offset_destroy(struct offset_ctx *octx)
{
- xa_destroy(&octx->xa);
+ mtree_destroy(&octx->mt);
}

/**
@@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)

/* In this case, ->private_data is protected by f_pos_lock */
file->private_data = NULL;
- return vfs_setpos(file, offset, U32_MAX);
+ return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
}

-static struct dentry *offset_find_next(struct xa_state *xas)
+static struct dentry *offset_find_next(struct ma_state *mas)
{
struct dentry *child, *found = NULL;

rcu_read_lock();
- child = xas_next_entry(xas, U32_MAX);
+ child = mas_find(mas, ULONG_MAX);
if (!child)
goto out;
spin_lock(&child->d_lock);
@@ -456,7 +455,7 @@ static struct dentry *offset_find_next(struct xa_state *xas)

static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
{
- u32 offset = dentry2offset(dentry);
+ unsigned long offset = dentry2offset(dentry);
struct inode *inode = d_inode(dentry);

return ctx->actor(ctx, dentry->d_name.name, dentry->d_name.len, offset,
@@ -466,11 +465,11 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
{
struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
- XA_STATE(xas, &octx->xa, ctx->pos);
+ MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
struct dentry *dentry;

while (true) {
- dentry = offset_find_next(&xas);
+ dentry = offset_find_next(&mas);
if (!dentry)
return ERR_PTR(-ENOENT);

@@ -480,7 +479,7 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
}

dput(dentry);
- ctx->pos = xas.xa_index + 1;
+ ctx->pos = mas.index + 1;
}
return NULL;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 03d141809a2c..55144c12ee0f 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -43,6 +43,7 @@
#include <linux/cred.h>
#include <linux/mnt_idmapping.h>
#include <linux/slab.h>
+#include <linux/maple_tree.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -3260,8 +3261,8 @@ extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
const void __user *from, size_t count);

struct offset_ctx {
- struct xarray xa;
- u32 next_offset;
+ struct maple_tree mt;
+ unsigned long next_offset;
};

void simple_offset_init(struct offset_ctx *octx);



2024-02-13 22:15:05

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH RFC 0/7] Use Maple Trees for simple_offset utilities



> On Feb 13, 2024, at 4:37 PM, Chuck Lever <[email protected]> wrote:
>
> In an effort to address slab fragmentation issues reported a few
> months ago, I've replaced the use of xarrays for the directory
> offset map in "simple" file systems (including tmpfs).
>
> This patch set passes functional testing and is ready for code
> review. But I don't have the facilities to re-run the performance
> tests that identified the regression. We expect the performance of
> this implementation will need additional improvement.
>
> Thanks to Liam Howlett for helping me get this working.

And note the patches are also available from:

https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git

in the simple-offset-maple branch.


> ---
>
> Chuck Lever (6):
> libfs: Rename "so_ctx"
> libfs: Define a minimum directory offset
> libfs: Add simple_offset_empty()
> maple_tree: Add mtree_alloc_cyclic()
> libfs: Convert simple directory offsets to use a Maple Tree
> libfs: Re-arrange locking in offset_iterate_dir()
>
> Liam R. Howlett (1):
> test_maple_tree: testing the cyclic allocation
>
>
> fs/libfs.c | 125 +++++++++++++++++++++++--------------
> include/linux/fs.h | 6 +-
> include/linux/maple_tree.h | 7 +++
> lib/maple_tree.c | 93 +++++++++++++++++++++++++++
> lib/test_maple_tree.c | 44 +++++++++++++
> mm/shmem.c | 4 +-
> 6 files changed, 227 insertions(+), 52 deletions(-)
>
> --
> Chuck Lever
>
>

--
Chuck Lever


2024-02-15 12:43:06

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 1/7] libfs: Rename "so_ctx"

On Tue 13-02-24 16:37:25, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> Most of instances of "so_ctx" were renamed before the simple offset
> work was merged, but there were a few that were missed.
>
> Signed-off-by: Chuck Lever <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/libfs.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index eec6031b0155..bfbe1a8c5d2d 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -271,7 +271,7 @@ void simple_offset_init(struct offset_ctx *octx)
> * @octx: directory offset ctx to be updated
> * @dentry: new dentry being added
> *
> - * Returns zero on success. @so_ctx and the dentry offset are updated.
> + * Returns zero on success. @octx and the dentry's offset are updated.
> * Otherwise, a negative errno value is returned.
> */
> int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> @@ -430,8 +430,8 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
>
> static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
> {
> - struct offset_ctx *so_ctx = inode->i_op->get_offset_ctx(inode);
> - XA_STATE(xas, &so_ctx->xa, ctx->pos);
> + struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> + XA_STATE(xas, &octx->xa, ctx->pos);
> struct dentry *dentry;
>
> while (true) {
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 12:47:30

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 2/7] libfs: Define a minimum directory offset

On Tue 13-02-24 16:37:32, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> This value is used in several places, so make it a symbolic
> constant.
>
> Signed-off-by: Chuck Lever <[email protected]>

Looks good. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/libfs.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index bfbe1a8c5d2d..a38af72f4719 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -240,6 +240,11 @@ const struct inode_operations simple_dir_inode_operations = {
> };
> EXPORT_SYMBOL(simple_dir_inode_operations);
>
> +/* 0 is '.', 1 is '..', so always start with offset 2 or more */
> +enum {
> + DIR_OFFSET_MIN = 2,
> +};
> +
> static void offset_set(struct dentry *dentry, u32 offset)
> {
> dentry->d_fsdata = (void *)((uintptr_t)(offset));
> @@ -261,9 +266,7 @@ void simple_offset_init(struct offset_ctx *octx)
> {
> xa_init_flags(&octx->xa, XA_FLAGS_ALLOC1);
> lockdep_set_class(&octx->xa.xa_lock, &simple_offset_xa_lock);
> -
> - /* 0 is '.', 1 is '..', so always start with offset 2 */
> - octx->next_offset = 2;
> + octx->next_offset = DIR_OFFSET_MIN;
> }
>
> /**
> @@ -276,7 +279,7 @@ void simple_offset_init(struct offset_ctx *octx)
> */
> int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry)
> {
> - static const struct xa_limit limit = XA_LIMIT(2, U32_MAX);
> + static const struct xa_limit limit = XA_LIMIT(DIR_OFFSET_MIN, U32_MAX);
> u32 offset;
> int ret;
>
> @@ -481,7 +484,7 @@ static int offset_readdir(struct file *file, struct dir_context *ctx)
> return 0;
>
> /* In this case, ->private_data is protected by f_pos_lock */
> - if (ctx->pos == 2)
> + if (ctx->pos == DIR_OFFSET_MIN)
> file->private_data = NULL;
> else if (file->private_data == ERR_PTR(-ENOENT))
> return 0;
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 13:16:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> Liam says that, unlike with xarray, once the RCU read lock is
> released ma_state is not safe to re-use for the next mas_find() call.
> But the RCU read lock has to be released on each loop iteration so
> that dput() can be called safely.
>
> Thus we are forced to walk the offset tree with fresh state for each
> directory entry. mt_find() can do this for us, though it might be a
> little less efficient than maintaining ma_state locally.
>
> Since offset_iterate_dir() doesn't build ma_state locally any more,
> there's no longer a strong need for offset_find_next(). Clean up by
> rolling these two helpers together.
>
> Signed-off-by: Chuck Lever <[email protected]>

Well, in general I think even xas_next_entry() is not safe to use how
offset_find_next() was using it. Once you drop rcu_read_lock(),
xas->xa_node could go stale. But since you're holding inode->i_rwsem when
using offset_find_next() you should be protected from concurrent
modifications of the mapping (whatever the underlying data structure is) -
that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
maple tree? Am I missing something?

Honza

> ---
> fs/libfs.c | 39 +++++++++++++++++----------------------
> 1 file changed, 17 insertions(+), 22 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index f073e9aeb2bf..6e01fde1cf95 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -436,23 +436,6 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> }
>
> -static struct dentry *offset_find_next(struct ma_state *mas)
> -{
> - struct dentry *child, *found = NULL;
> -
> - rcu_read_lock();
> - child = mas_find(mas, ULONG_MAX);
> - if (!child)
> - goto out;
> - spin_lock(&child->d_lock);
> - if (simple_positive(child))
> - found = dget_dlock(child);
> - spin_unlock(&child->d_lock);
> -out:
> - rcu_read_unlock();
> - return found;
> -}
> -
> static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> {
> unsigned long offset = dentry2offset(dentry);
> @@ -465,13 +448,22 @@ static bool offset_dir_emit(struct dir_context *ctx, struct dentry *dentry)
> static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
> {
> struct offset_ctx *octx = inode->i_op->get_offset_ctx(inode);
> - MA_STATE(mas, &octx->mt, ctx->pos, ctx->pos);
> - struct dentry *dentry;
> + struct dentry *dentry, *found;
> + unsigned long offset;
>
> + offset = ctx->pos;
> while (true) {
> - dentry = offset_find_next(&mas);
> + found = mt_find(&octx->mt, &offset, ULONG_MAX);
> + if (!found)
> + goto out_noent;
> +
> + dentry = NULL;
> + spin_lock(&found->d_lock);
> + if (simple_positive(found))
> + dentry = dget_dlock(found);
> + spin_unlock(&found->d_lock);
> if (!dentry)
> - return ERR_PTR(-ENOENT);
> + goto out_noent;
>
> if (!offset_dir_emit(ctx, dentry)) {
> dput(dentry);
> @@ -479,9 +471,12 @@ static void *offset_iterate_dir(struct inode *inode, struct dir_context *ctx)
> }
>
> dput(dentry);
> - ctx->pos = mas.index + 1;
> + ctx->pos = offset;
> }
> return NULL;
> +
> +out_noent:
> + return ERR_PTR(-ENOENT);
> }
>
> /**
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 13:39:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 3/7] libfs: Add simple_offset_empty()

On Tue 13-02-24 16:37:39, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> For simple filesystems that use directory offset mapping, rely
> strictly on the directory offset map to tell when a directory has
> no children.
>
> After this patch is applied, the emptiness test holds only the RCU
> read lock when the directory being tested has no children.
>
> In addition, this adds another layer of confirmation that
> simple_offset_add/remove() are working as expected.
>
> Signed-off-by: Chuck Lever <[email protected]>

Makes sense. Feel free to add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/libfs.c | 32 ++++++++++++++++++++++++++++++++
> include/linux/fs.h | 1 +
> mm/shmem.c | 4 ++--
> 3 files changed, 35 insertions(+), 2 deletions(-)
>
> diff --git a/fs/libfs.c b/fs/libfs.c
> index a38af72f4719..3cf773950f93 100644
> --- a/fs/libfs.c
> +++ b/fs/libfs.c
> @@ -313,6 +313,38 @@ void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry)
> offset_set(dentry, 0);
> }
>
> +/**
> + * simple_offset_empty - Check if a dentry can be unlinked
> + * @dentry: dentry to be tested
> + *
> + * Returns 0 if @dentry is a non-empty directory; otherwise returns 1.
> + */
> +int simple_offset_empty(struct dentry *dentry)
> +{
> + struct inode *inode = d_inode(dentry);
> + struct offset_ctx *octx;
> + struct dentry *child;
> + unsigned long index;
> + int ret = 1;
> +
> + if (!inode || !S_ISDIR(inode->i_mode))
> + return ret;
> +
> + index = 2;
> + octx = inode->i_op->get_offset_ctx(inode);
> + xa_for_each(&octx->xa, index, child) {
> + spin_lock(&child->d_lock);
> + if (simple_positive(child)) {
> + spin_unlock(&child->d_lock);
> + ret = 0;
> + break;
> + }
> + spin_unlock(&child->d_lock);
> + }
> +
> + return ret;
> +}
> +
> /**
> * simple_offset_rename_exchange - exchange rename with directory offsets
> * @old_dir: parent of dentry being moved
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ed5966a70495..03d141809a2c 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3267,6 +3267,7 @@ struct offset_ctx {
> void simple_offset_init(struct offset_ctx *octx);
> int simple_offset_add(struct offset_ctx *octx, struct dentry *dentry);
> void simple_offset_remove(struct offset_ctx *octx, struct dentry *dentry);
> +int simple_offset_empty(struct dentry *dentry);
> int simple_offset_rename_exchange(struct inode *old_dir,
> struct dentry *old_dentry,
> struct inode *new_dir,
> diff --git a/mm/shmem.c b/mm/shmem.c
> index d7c84ff62186..6fed524343cb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -3374,7 +3374,7 @@ static int shmem_unlink(struct inode *dir, struct dentry *dentry)
>
> static int shmem_rmdir(struct inode *dir, struct dentry *dentry)
> {
> - if (!simple_empty(dentry))
> + if (!simple_offset_empty(dentry))
> return -ENOTEMPTY;
>
> drop_nlink(d_inode(dentry));
> @@ -3431,7 +3431,7 @@ static int shmem_rename2(struct mnt_idmap *idmap,
> return simple_offset_rename_exchange(old_dir, old_dentry,
> new_dir, new_dentry);
>
> - if (!simple_empty(new_dentry))
> + if (!simple_offset_empty(new_dentry))
> return -ENOTEMPTY;
>
> if (flags & RENAME_WHITEOUT) {
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 13:50:26

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> From: Chuck Lever <[email protected]>
>
> Test robot reports:
> > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> >
> > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> Feng Tang further clarifies that:
> > ... the new simple_offset_add()
> > called by shmem_mknod() brings extra cost related with slab,
> > specifically the 'radix_tree_node', which cause the regression.
>
> Willy's analysis is that, over time, the test workload causes
> xa_alloc_cyclic() to fragment the underlying SLAB cache.
>
> This patch replaces the offset_ctx's xarray with a Maple Tree in the
> hope that Maple Tree's dense node mode will handle this scenario
> more scalably.
>
> In addition, we can widen the directory offset to an unsigned long
> everywhere.
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Reported-by: kernel test robot <[email protected]>
> Closes: https://lore.kernel.org/oe-lkp/[email protected]
> Signed-off-by: Chuck Lever <[email protected]>

OK, but this will need the performance numbers. Otherwise we have no idea
whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
0-day guys are quite helpful.

> @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> if (!inode || !S_ISDIR(inode->i_mode))
> return ret;
>
> - index = 2;
> + index = DIR_OFFSET_MIN;

This bit should go into the simple_offset_empty() patch...

> @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
>
> /* In this case, ->private_data is protected by f_pos_lock */
> file->private_data = NULL;
> - return vfs_setpos(file, offset, U32_MAX);
> + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
^^^
Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
quite right? Why not use ULONG_MAX here directly?

Otherwise the patch looks good to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 14:12:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

On Thu 15-02-24 08:45:33, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > >
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > >
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > >
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > >
> > > Suggested-by: Matthew Wilcox <[email protected]>
> > > Reported-by: kernel test robot <[email protected]>
> > > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > > Signed-off-by: Chuck Lever <[email protected]>
> >
> > OK, but this will need the performance numbers.
>
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
>
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.
>
>
> > Otherwise we have no idea
> > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > 0-day guys are quite helpful.
>
> Oliver and Feng were copied on this series.
>
>
> > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > if (!inode || !S_ISDIR(inode->i_mode))
> > > return ret;
> > >
> > > - index = 2;
> > > + index = DIR_OFFSET_MIN;
> >
> > This bit should go into the simple_offset_empty() patch...
> >
> > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >
> > > /* In this case, ->private_data is protected by f_pos_lock */
> > > file->private_data = NULL;
> > > - return vfs_setpos(file, offset, U32_MAX);
> > > + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > ^^^
> > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > quite right? Why not use ULONG_MAX here directly?
>
> I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> length checking in vfs_setpos() fails. There is probably a sign
> extension thing happening here that I don't understand.

Right. loff_t is signed (long long). So I think you should make the
'offset' be long instead of unsigned long and allow values 0..LONG_MAX?
Then you can pass LONG_MAX here. You potentially loose half of the usable
offsets on 32-bit userspace with 64-bit file offsets but who cares I guess?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 14:18:49

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > From: Chuck Lever <[email protected]>
> >
> > Test robot reports:
> > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > >
> > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > Feng Tang further clarifies that:
> > > ... the new simple_offset_add()
> > > called by shmem_mknod() brings extra cost related with slab,
> > > specifically the 'radix_tree_node', which cause the regression.
> >
> > Willy's analysis is that, over time, the test workload causes
> > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> >
> > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > hope that Maple Tree's dense node mode will handle this scenario
> > more scalably.
> >
> > In addition, we can widen the directory offset to an unsigned long
> > everywhere.
> >
> > Suggested-by: Matthew Wilcox <[email protected]>
> > Reported-by: kernel test robot <[email protected]>
> > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > Signed-off-by: Chuck Lever <[email protected]>
>
> OK, but this will need the performance numbers.

Yes, I totally concur. The point of this posting was to get some
early review and start the ball rolling.

Actually we expect roughly the same performance numbers now. "Dense
node" support in Maple Tree is supposed to be the real win, but
I'm not sure it's ready yet.


> Otherwise we have no idea
> whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> 0-day guys are quite helpful.

Oliver and Feng were copied on this series.


> > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > if (!inode || !S_ISDIR(inode->i_mode))
> > return ret;
> >
> > - index = 2;
> > + index = DIR_OFFSET_MIN;
>
> This bit should go into the simple_offset_empty() patch...
>
> > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> >
> > /* In this case, ->private_data is protected by f_pos_lock */
> > file->private_data = NULL;
> > - return vfs_setpos(file, offset, U32_MAX);
> > + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> ^^^
> Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> quite right? Why not use ULONG_MAX here directly?

I initially changed U32_MAX to ULONG_MAX, but for some reason, the
length checking in vfs_setpos() fails. There is probably a sign
extension thing happening here that I don't understand.


> Otherwise the patch looks good to me.

As always, thank you for your review.


--
Chuck Lever

2024-02-15 17:00:59

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

* Jan Kara <[email protected]> [240215 08:16]:
> On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > From: Chuck Lever <[email protected]>
> >
> > Liam says that, unlike with xarray, once the RCU read lock is
> > released ma_state is not safe to re-use for the next mas_find() call.
> > But the RCU read lock has to be released on each loop iteration so
> > that dput() can be called safely.
> >
> > Thus we are forced to walk the offset tree with fresh state for each
> > directory entry. mt_find() can do this for us, though it might be a
> > little less efficient than maintaining ma_state locally.
> >
> > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > there's no longer a strong need for offset_find_next(). Clean up by
> > rolling these two helpers together.
> >
> > Signed-off-by: Chuck Lever <[email protected]>
>
> Well, in general I think even xas_next_entry() is not safe to use how
> offset_find_next() was using it. Once you drop rcu_read_lock(),
> xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> using offset_find_next() you should be protected from concurrent
> modifications of the mapping (whatever the underlying data structure is) -
> that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> maple tree? Am I missing something?

If you are stopping, you should be pausing the iteration. Although this
works today, it's not how it should be used because if we make changes
(ie: compaction requires movement of data), then you may end up with a
UAF issue. We'd have no way of knowing you are depending on the tree
structure to remain consistent.

IOW the inode->i_rwsem is protecting writes of data but not the
structure holding the data.

This is true for both xarray and maple tree.

Thanks,
Liam


2024-02-15 17:16:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> * Jan Kara <[email protected]> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > >
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > >
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> >
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
>
> If you are stopping, you should be pausing the iteration. Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue. We'd have no way of knowing you are depending on the tree
> structure to remain consistent.

I see. But we have versions of these structures that have locking external
to the structure itself, don't we? Then how do you imagine serializing the
background operations like compaction? As much as I agree your argument is
"theoretically clean", it seems a bit like a trap and there are definitely
xarray users that are going to be broken by this (e.g.
tag_pages_for_writeback())...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-15 18:06:12

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> * Jan Kara <[email protected]> [240215 08:16]:
> > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > Liam says that, unlike with xarray, once the RCU read lock is
> > > released ma_state is not safe to re-use for the next mas_find() call.
> > > But the RCU read lock has to be released on each loop iteration so
> > > that dput() can be called safely.
> > >
> > > Thus we are forced to walk the offset tree with fresh state for each
> > > directory entry. mt_find() can do this for us, though it might be a
> > > little less efficient than maintaining ma_state locally.
> > >
> > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > there's no longer a strong need for offset_find_next(). Clean up by
> > > rolling these two helpers together.
> > >
> > > Signed-off-by: Chuck Lever <[email protected]>
> >
> > Well, in general I think even xas_next_entry() is not safe to use how
> > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > using offset_find_next() you should be protected from concurrent
> > modifications of the mapping (whatever the underlying data structure is) -
> > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > maple tree? Am I missing something?
>
> If you are stopping, you should be pausing the iteration. Although this
> works today, it's not how it should be used because if we make changes
> (ie: compaction requires movement of data), then you may end up with a
> UAF issue. We'd have no way of knowing you are depending on the tree
> structure to remain consistent.
>
> IOW the inode->i_rwsem is protecting writes of data but not the
> structure holding the data.
>
> This is true for both xarray and maple tree.

Would it be appropriate to reorder this series so 7/7 comes before
the transition to use Maple Tree?


--
Chuck Lever

2024-02-15 21:08:18

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

* Jan Kara <[email protected]> [240215 12:16]:
> On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > * Jan Kara <[email protected]> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <[email protected]>
> > > >
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > >
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > >
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > >
> > > > Signed-off-by: Chuck Lever <[email protected]>
> > >
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> >
> > If you are stopping, you should be pausing the iteration. Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue. We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
>
> I see. But we have versions of these structures that have locking external
> to the structure itself, don't we?

Ah, I do have them - but I don't want to propagate its use as the dream
is that it can be removed.


> Then how do you imagine serializing the
> background operations like compaction? As much as I agree your argument is
> "theoretically clean", it seems a bit like a trap and there are definitely
> xarray users that are going to be broken by this (e.g.
> tag_pages_for_writeback())...

I'm not sure I follow the trap logic. There are locks for the data
structure that need to be followed for reading (rcu) and writing
(spinlock for the maple tree). If you don't correctly lock the data
structure then you really are setting yourself up for potential issues
in the future.

The limitations are outlined in the documentation as to how and when to
lock. I'm not familiar with the xarray users, but it does check for
locking with lockdep, but the way this is written bypasses the lockdep
checking as the locks are taken and dropped without the proper scope.

If you feel like this is a trap, then maybe we need to figure out a new
plan to detect incorrect use?

Looking through tag_pages_for_writeback(), it does what is necessary to
keep a safe state - before it unlocks it calls xas_pause(). We have the
same on maple tree; mas_pause(). This will restart the next operation
from the root of the tree (the root can also change), to ensure that it
is safe.

If you have other examples you think are unsafe then I can have a look
at them as well.

You can make the existing code safe by also calling xas_pause() before
the rcu lock is dropped, but that is essentially what Chuck has done in
the maple tree conversion by using mt_find().

Regarding compaction, I would expect the write lock to be taken to avoid
any writes happening while compaction occurs. Readers would use rcu
locking to ensure they return either the old or new value. During the
write critical section, other writers would be in a
"mas_pause()/xas_pause()" state - so once they continue, they will
re-start the walk to the next element in the new tree.

If external locks are used, then compaction would be sub-optimal as it
may unnecessarily hold up readers, or partial write work (before the
point of a store into the maple tree).

Thanks,
Liam



2024-02-15 21:09:26

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

* Chuck Lever <[email protected]> [240215 12:40]:
> On Thu, Feb 15, 2024 at 12:00:08PM -0500, Liam R. Howlett wrote:
> > * Jan Kara <[email protected]> [240215 08:16]:
> > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > From: Chuck Lever <[email protected]>
> > > >
> > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > But the RCU read lock has to be released on each loop iteration so
> > > > that dput() can be called safely.
> > > >
> > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > directory entry. mt_find() can do this for us, though it might be a
> > > > little less efficient than maintaining ma_state locally.
> > > >
> > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > rolling these two helpers together.
> > > >
> > > > Signed-off-by: Chuck Lever <[email protected]>
> > >
> > > Well, in general I think even xas_next_entry() is not safe to use how
> > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > using offset_find_next() you should be protected from concurrent
> > > modifications of the mapping (whatever the underlying data structure is) -
> > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > maple tree? Am I missing something?
> >
> > If you are stopping, you should be pausing the iteration. Although this
> > works today, it's not how it should be used because if we make changes
> > (ie: compaction requires movement of data), then you may end up with a
> > UAF issue. We'd have no way of knowing you are depending on the tree
> > structure to remain consistent.
> >
> > IOW the inode->i_rwsem is protecting writes of data but not the
> > structure holding the data.
> >
> > This is true for both xarray and maple tree.
>
> Would it be appropriate to reorder this series so 7/7 comes before
> the transition to use Maple Tree?

I think it would, yes.

Thanks,
Liam

2024-02-16 10:16:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> * Jan Kara <[email protected]> [240215 12:16]:
> > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > * Jan Kara <[email protected]> [240215 08:16]:
> > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > From: Chuck Lever <[email protected]>
> > > > >
> > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > that dput() can be called safely.
> > > > >
> > > > > Thus we are forced to walk the offset tree with fresh state for each
> > > > > directory entry. mt_find() can do this for us, though it might be a
> > > > > little less efficient than maintaining ma_state locally.
> > > > >
> > > > > Since offset_iterate_dir() doesn't build ma_state locally any more,
> > > > > there's no longer a strong need for offset_find_next(). Clean up by
> > > > > rolling these two helpers together.
> > > > >
> > > > > Signed-off-by: Chuck Lever <[email protected]>
> > > >
> > > > Well, in general I think even xas_next_entry() is not safe to use how
> > > > offset_find_next() was using it. Once you drop rcu_read_lock(),
> > > > xas->xa_node could go stale. But since you're holding inode->i_rwsem when
> > > > using offset_find_next() you should be protected from concurrent
> > > > modifications of the mapping (whatever the underlying data structure is) -
> > > > that's what makes xas_next_entry() safe AFAIU. Isn't that enough for the
> > > > maple tree? Am I missing something?
> > >
> > > If you are stopping, you should be pausing the iteration. Although this
> > > works today, it's not how it should be used because if we make changes
> > > (ie: compaction requires movement of data), then you may end up with a
> > > UAF issue. We'd have no way of knowing you are depending on the tree
> > > structure to remain consistent.
> >
> > I see. But we have versions of these structures that have locking external
> > to the structure itself, don't we?
>
> Ah, I do have them - but I don't want to propagate its use as the dream
> is that it can be removed.
>
>
> > Then how do you imagine serializing the
> > background operations like compaction? As much as I agree your argument is
> > "theoretically clean", it seems a bit like a trap and there are definitely
> > xarray users that are going to be broken by this (e.g.
> > tag_pages_for_writeback())...
>
> I'm not sure I follow the trap logic. There are locks for the data
> structure that need to be followed for reading (rcu) and writing
> (spinlock for the maple tree). If you don't correctly lock the data
> structure then you really are setting yourself up for potential issues
> in the future.
>
> The limitations are outlined in the documentation as to how and when to
> lock. I'm not familiar with the xarray users, but it does check for
> locking with lockdep, but the way this is written bypasses the lockdep
> checking as the locks are taken and dropped without the proper scope.
>
> If you feel like this is a trap, then maybe we need to figure out a new
> plan to detect incorrect use?

OK, I was a bit imprecise. What I wanted to say is that this is a shift in
the paradigm in the sense that previously, we mostly had (and still have)
data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
guaranteeing that unless you call into the function to mutate the data
structure it stays intact. Now maple trees are shifting more in a direction
of black-box API where you cannot assume what happens inside. Which is fine
but then we have e.g. these iterators which do not quite follow this
black-box design and you have to remember subtle details like calling
"mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
the black-box API shouldn't be exposed to the details of the internal
locking at all (but then the performance suffers so I understand why you do
things this way). Second to this ideal variant would be if we could detect
we unlocked the lock without calling xas_pause() and warn on that. Or maybe
xas_unlock*() should be calling xas_pause() automagically and we'd have
similar helpers for RCU to do the magic for you?

> Looking through tag_pages_for_writeback(), it does what is necessary to
> keep a safe state - before it unlocks it calls xas_pause(). We have the
> same on maple tree; mas_pause(). This will restart the next operation
> from the root of the tree (the root can also change), to ensure that it
> is safe.

OK, I've missed the xas_pause(). Thanks for correcting me.

> If you have other examples you think are unsafe then I can have a look
> at them as well.

I'm currently not aware of any but I'll let you know if I find some.
Missing xas/mas_pause() seems really easy.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2024-02-16 15:15:24

by Christian Brauner

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > >
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > >
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > >
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > >
> > > Suggested-by: Matthew Wilcox <[email protected]>
> > > Reported-by: kernel test robot <[email protected]>
> > > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > > Signed-off-by: Chuck Lever <[email protected]>
> >
> > OK, but this will need the performance numbers.
>
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
>
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.

I keep repeating this but we need a better way to request performance
tests for specific series/branches. Maybe I can add a vfs.perf branch
where we can put patches that we suspect have positive/negative perf
impact and that perf bot can pull that in. I know, that's a fuzzy
boundary but for stuff like this where we already know that there's a
perf impact that's important for us it would really help.

Because it is royally annoying to get a perf regression report after a
patch has been in -next for a long time already and the merge window is
coming up or we already merged that stuff.

2024-02-16 15:59:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Fri, Feb 16, 2024 at 11:15:46AM +0100, Jan Kara wrote:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > The limitations are outlined in the documentation as to how and when to
> > lock. I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> >
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
>
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact.

hm, no. The radix tree never guaranteed that to you; I just documented
that it wasn't guaranteed for the XArray.

> Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?

If you're unlocking the lock that protects a data structure while still
using that data structure, you should always be aware that you're doing
something very dangerous! It's no different from calling inode_unlock()
inside a filesystem. Sure, you can do it, but you'd better be ready to
deal with the consequences.

The question is, do we want to be able to defragment slabs or not?
My thinking is "yes", for objects where we can ensure there are no
current users (at least after an RCU grace period), we want to be able
to move them. That does impose certain costs (and subtleties), but just
like fast-GUP and lockless page-cache, I think it's worth doing.

Of course, we don't have slab defragmentation yet, so we're not getting
any benefit from this. The most recent attempt was in 2019:
https://lore.kernel.org/linux-mm/[email protected]/
but there were earlier attepts in 2017:
https://lore.kernel.org/linux-mm/[email protected]/
and 2008:
https://lore.kernel.org/linux-mm/[email protected]/

so I have to believe it's something we want, just haven't been able to
push across the "merge this now" line.

2024-02-16 16:42:32

by Liam R. Howlett

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

* Jan Kara <[email protected]> [240216 05:15]:
> On Thu 15-02-24 16:07:42, Liam R. Howlett wrote:
> > * Jan Kara <[email protected]> [240215 12:16]:
> > > On Thu 15-02-24 12:00:08, Liam R. Howlett wrote:
> > > > * Jan Kara <[email protected]> [240215 08:16]:
> > > > > On Tue 13-02-24 16:38:08, Chuck Lever wrote:
> > > > > > From: Chuck Lever <[email protected]>
> > > > > >
> > > > > > Liam says that, unlike with xarray, once the RCU read lock is
> > > > > > released ma_state is not safe to re-use for the next mas_find() call.
> > > > > > But the RCU read lock has to be released on each loop iteration so
> > > > > > that dput() can be called safely.
> > > > > >

..

> >
> > > Then how do you imagine serializing the
> > > background operations like compaction? As much as I agree your argument is
> > > "theoretically clean", it seems a bit like a trap and there are definitely
> > > xarray users that are going to be broken by this (e.g.
> > > tag_pages_for_writeback())...
> >
> > I'm not sure I follow the trap logic. There are locks for the data
> > structure that need to be followed for reading (rcu) and writing
> > (spinlock for the maple tree). If you don't correctly lock the data
> > structure then you really are setting yourself up for potential issues
> > in the future.
> >
> > The limitations are outlined in the documentation as to how and when to
> > lock. I'm not familiar with the xarray users, but it does check for
> > locking with lockdep, but the way this is written bypasses the lockdep
> > checking as the locks are taken and dropped without the proper scope.
> >
> > If you feel like this is a trap, then maybe we need to figure out a new
> > plan to detect incorrect use?
>
> OK, I was a bit imprecise. What I wanted to say is that this is a shift in
> the paradigm in the sense that previously, we mostly had (and still have)
> data structure APIs (lists, rb-trees, radix-tree, now xarray) that were
> guaranteeing that unless you call into the function to mutate the data
> structure it stays intact. Now maple trees are shifting more in a direction
> of black-box API where you cannot assume what happens inside. Which is fine
> but then we have e.g. these iterators which do not quite follow this
> black-box design and you have to remember subtle details like calling
> "mas_pause()" before unlocking which is IMHO error-prone. Ideally, users of
> the black-box API shouldn't be exposed to the details of the internal
> locking at all (but then the performance suffers so I understand why you do
> things this way). Second to this ideal variant would be if we could detect
> we unlocked the lock without calling xas_pause() and warn on that. Or maybe
> xas_unlock*() should be calling xas_pause() automagically and we'd have
> similar helpers for RCU to do the magic for you?
>
..

Fair enough. You can think of the radix-tree and xarray as black boxes
as well; not everyone knows what's going on in there. The rbtree and
list are embedded in your own data structure and you have to do a lot of
work for your operations.

Right now, it is the way you have described; it's a data structure API.
It will work this way, but you lose the really neat and performant part
of the tree. If we do this right, we can compact at the same time as
reading data. We cannot do that when depending on the locks you use
today.

You don't have to use the mas_pause() functions, there are less
error-prone methods such as mt_find() that Chuck used here. If you want
to do really neat stuff though, you should be looking at the mas_*
interface. And yes, we totally hand you enough rope to hang yourself.
I'm not sure we can have an advanced interface without doing this.

Using the correct calls will allow for us to smoothly transition to a
model where you don't depend on the data remaining in the same place in
the tree (or xarray).

>
> > If you have other examples you think are unsafe then I can have a look
> > at them as well.
>
> I'm currently not aware of any but I'll let you know if I find some.
> Missing xas/mas_pause() seems really easy.

What if we convert the rcu_read_lock() to a mas_read_lock() or
xas_read_lock() and we can check to ensure the state isn't being locked
without being in the 'parked' state (paused or otherwise)?

mas_read_lock(struct ma_state *mas) {
assert(!mas_active(mas));
rcu_read_lock();
}

Would that be a reasonable resolution to your concern? Unfortunately,
what was done with the locking in this case would not be detected with
this change unless the rcu_read_lock() was replaced. IOW, people could
still use the rcu_read_lock() and skip the detection.

Doing the same in the mas_unlock() doesn't make as much sense since that
may be called without the intent to reuse the state at all. So we'd be
doing more work than necessary at the end of each loop or use.

Thanks,
Liam



2024-02-18 02:03:18

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

hi, Chuck Lever,

On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > From: Chuck Lever <[email protected]>
> > >
> > > Test robot reports:
> > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > >
> > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > Feng Tang further clarifies that:
> > > > ... the new simple_offset_add()
> > > > called by shmem_mknod() brings extra cost related with slab,
> > > > specifically the 'radix_tree_node', which cause the regression.
> > >
> > > Willy's analysis is that, over time, the test workload causes
> > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > >
> > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > hope that Maple Tree's dense node mode will handle this scenario
> > > more scalably.
> > >
> > > In addition, we can widen the directory offset to an unsigned long
> > > everywhere.
> > >
> > > Suggested-by: Matthew Wilcox <[email protected]>
> > > Reported-by: kernel test robot <[email protected]>
> > > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > > Signed-off-by: Chuck Lever <[email protected]>
> >
> > OK, but this will need the performance numbers.
>
> Yes, I totally concur. The point of this posting was to get some
> early review and start the ball rolling.
>
> Actually we expect roughly the same performance numbers now. "Dense
> node" support in Maple Tree is supposed to be the real win, but
> I'm not sure it's ready yet.
>
>
> > Otherwise we have no idea
> > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > 0-day guys are quite helpful.
>
> Oliver and Feng were copied on this series.

we are in holidays last week, now we are back.

I noticed there is v2 for this patch set
https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/

and you also put it in a branch:
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
"simple-offset-maple" branch.

we will test aim9 performance based on this branch. Thanks

>
>
> > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > if (!inode || !S_ISDIR(inode->i_mode))
> > > return ret;
> > >
> > > - index = 2;
> > > + index = DIR_OFFSET_MIN;
> >
> > This bit should go into the simple_offset_empty() patch...
> >
> > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > >
> > > /* In this case, ->private_data is protected by f_pos_lock */
> > > file->private_data = NULL;
> > > - return vfs_setpos(file, offset, U32_MAX);
> > > + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > ^^^
> > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > quite right? Why not use ULONG_MAX here directly?
>
> I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> length checking in vfs_setpos() fails. There is probably a sign
> extension thing happening here that I don't understand.
>
>
> > Otherwise the patch looks good to me.
>
> As always, thank you for your review.
>
>
> --
> Chuck Lever

2024-02-18 15:57:47

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

On Sun, Feb 18, 2024 at 10:02:37AM +0800, Oliver Sang wrote:
> hi, Chuck Lever,
>
> On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> > On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > > From: Chuck Lever <[email protected]>
> > > >
> > > > Test robot reports:
> > > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > > >
> > > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > >
> > > > Feng Tang further clarifies that:
> > > > > ... the new simple_offset_add()
> > > > > called by shmem_mknod() brings extra cost related with slab,
> > > > > specifically the 'radix_tree_node', which cause the regression.
> > > >
> > > > Willy's analysis is that, over time, the test workload causes
> > > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > >
> > > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > > hope that Maple Tree's dense node mode will handle this scenario
> > > > more scalably.
> > > >
> > > > In addition, we can widen the directory offset to an unsigned long
> > > > everywhere.
> > > >
> > > > Suggested-by: Matthew Wilcox <[email protected]>
> > > > Reported-by: kernel test robot <[email protected]>
> > > > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > > > Signed-off-by: Chuck Lever <[email protected]>
> > >
> > > OK, but this will need the performance numbers.
> >
> > Yes, I totally concur. The point of this posting was to get some
> > early review and start the ball rolling.
> >
> > Actually we expect roughly the same performance numbers now. "Dense
> > node" support in Maple Tree is supposed to be the real win, but
> > I'm not sure it's ready yet.
> >
> >
> > > Otherwise we have no idea
> > > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > > 0-day guys are quite helpful.
> >
> > Oliver and Feng were copied on this series.
>
> we are in holidays last week, now we are back.
>
> I noticed there is v2 for this patch set
> https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/
>
> and you also put it in a branch:
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> "simple-offset-maple" branch.
>
> we will test aim9 performance based on this branch. Thanks

Very much appreciated!


> > > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > > if (!inode || !S_ISDIR(inode->i_mode))
> > > > return ret;
> > > >
> > > > - index = 2;
> > > > + index = DIR_OFFSET_MIN;
> > >
> > > This bit should go into the simple_offset_empty() patch...
> > >
> > > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > >
> > > > /* In this case, ->private_data is protected by f_pos_lock */
> > > > file->private_data = NULL;
> > > > - return vfs_setpos(file, offset, U32_MAX);
> > > > + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > > ^^^
> > > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > > quite right? Why not use ULONG_MAX here directly?
> >
> > I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> > length checking in vfs_setpos() fails. There is probably a sign
> > extension thing happening here that I don't understand.
> >
> >
> > > Otherwise the patch looks good to me.
> >
> > As always, thank you for your review.
> >
> >
> > --
> > Chuck Lever

--
Chuck Lever

2024-02-19 06:00:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH RFC 6/7] libfs: Convert simple directory offsets to use a Maple Tree

hi, Chuck Lever,

On Sun, Feb 18, 2024 at 10:57:07AM -0500, Chuck Lever wrote:
> On Sun, Feb 18, 2024 at 10:02:37AM +0800, Oliver Sang wrote:
> > hi, Chuck Lever,
> >
> > On Thu, Feb 15, 2024 at 08:45:33AM -0500, Chuck Lever wrote:
> > > On Thu, Feb 15, 2024 at 02:06:01PM +0100, Jan Kara wrote:
> > > > On Tue 13-02-24 16:38:01, Chuck Lever wrote:
> > > > > From: Chuck Lever <[email protected]>
> > > > >
> > > > > Test robot reports:
> > > > > > kernel test robot noticed a -19.0% regression of aim9.disk_src.ops_per_sec on:
> > > > > >
> > > > > > commit: a2e459555c5f9da3e619b7e47a63f98574dc75f1 ("shmem: stable directory offsets")
> > > > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > > > >
> > > > > Feng Tang further clarifies that:
> > > > > > ... the new simple_offset_add()
> > > > > > called by shmem_mknod() brings extra cost related with slab,
> > > > > > specifically the 'radix_tree_node', which cause the regression.
> > > > >
> > > > > Willy's analysis is that, over time, the test workload causes
> > > > > xa_alloc_cyclic() to fragment the underlying SLAB cache.
> > > > >
> > > > > This patch replaces the offset_ctx's xarray with a Maple Tree in the
> > > > > hope that Maple Tree's dense node mode will handle this scenario
> > > > > more scalably.
> > > > >
> > > > > In addition, we can widen the directory offset to an unsigned long
> > > > > everywhere.
> > > > >
> > > > > Suggested-by: Matthew Wilcox <[email protected]>
> > > > > Reported-by: kernel test robot <[email protected]>
> > > > > Closes: https://lore.kernel.org/oe-lkp/[email protected]
> > > > > Signed-off-by: Chuck Lever <[email protected]>
> > > >
> > > > OK, but this will need the performance numbers.
> > >
> > > Yes, I totally concur. The point of this posting was to get some
> > > early review and start the ball rolling.
> > >
> > > Actually we expect roughly the same performance numbers now. "Dense
> > > node" support in Maple Tree is supposed to be the real win, but
> > > I'm not sure it's ready yet.
> > >
> > >
> > > > Otherwise we have no idea
> > > > whether this is worth it or not. Maybe you can ask Oliver Sang? Usually
> > > > 0-day guys are quite helpful.
> > >
> > > Oliver and Feng were copied on this series.
> >
> > we are in holidays last week, now we are back.
> >
> > I noticed there is v2 for this patch set
> > https://lore.kernel.org/all/170820145616.6328.12620992971699079156.stgit@91.116.238.104.host.secureserver.net/
> >
> > and you also put it in a branch:
> > https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git
> > "simple-offset-maple" branch.
> >
> > we will test aim9 performance based on this branch. Thanks
>
> Very much appreciated!

always our pleasure!

we've already sent out a report [1] to you for the commit a616bc6667 in new
branch.
we saw 11.8% improvement of aim9.disk_src.ops_per_sec on it comparing to its
parent.

so the regression we saw on a2e459555c is 'half' recovered.

since I noticed the performance for a2e459555c, v6.8-rc4 and f3f24869a1 (parent
of a616bc6667) are very similar, so ignored results from v6.8-rc4 and f3f24869a1
in below tables for brief. if you want a full table, please let me know. Thanks!


summary for aim9.disk_src.ops_per_sec:

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s

commit:
23a31d8764 ("shmem: Refactor shmem_symlink()")
a2e459555c ("shmem: stable directory offsets")
a616bc6667 ("libfs: Convert simple directory offsets to use a Maple Tree")


23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 a616bc666748063733c62e15ea4
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
202424 -19.0% 163868 -9.3% 183678 aim9.disk_src.ops_per_sec


full data:

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/tbox_group/test/testcase/testtime:
gcc-12/performance/x86_64-rhel-8.3/debian-11.1-x86_64-20220510.cgz/lkp-ivb-2ep1/disk_src/aim9/300s

commit:
23a31d8764 ("shmem: Refactor shmem_symlink()")
a2e459555c ("shmem: stable directory offsets")
a616bc6667 ("libfs: Convert simple directory offsets to use a Maple Tree")


23a31d87645c6527 a2e459555c5f9da3e619b7e47a6 a616bc666748063733c62e15ea4
---------------- --------------------------- ---------------------------
%stddev %change %stddev %change %stddev
\ | \ | \
1404 +0.6% 1412 +3.3% 1450 boot-time.idle
0.26 ? 9% +0.1 0.36 ? 2% -0.1 0.20 ? 4% mpstat.cpu.all.soft%
0.61 -0.1 0.52 -0.0 0.58 mpstat.cpu.all.usr%
1.00 +0.0% 1.00 +19.5% 1.19 ? 2% vmstat.procs.r
1525 ? 6% -5.6% 1440 +9.4% 1668 ? 4% vmstat.system.cs
52670 +5.0% 55323 ? 3% +12.7% 59381 ? 3% turbostat.C1
0.02 +0.0 0.02 +0.0 0.03 ? 10% turbostat.C1%
12949 ? 9% -1.2% 12795 ? 6% -19.6% 10412 ? 6% turbostat.POLL
115468 ? 24% +11.9% 129258 ? 7% +16.5% 134545 ? 5% numa-meminfo.node0.AnonPages.max
2015 ? 12% -7.5% 1864 ? 5% +30.3% 2624 ? 10% numa-meminfo.node0.PageTables
4795 ? 30% +1.1% 4846 ? 37% +117.0% 10405 ? 21% numa-meminfo.node0.Shmem
6442 ? 5% -0.6% 6401 ? 4% -19.6% 5180 ? 7% numa-meminfo.node1.KernelStack
13731 ? 92% -88.7% 1546 ? 6% +161.6% 35915 ? 23% time.involuntary_context_switches
94.83 -4.2% 90.83 +1.2% 96.00 time.percent_of_cpu_this_job_got
211.64 +0.5% 212.70 +4.0% 220.06 time.system_time
73.62 -17.6% 60.69 -6.4% 68.94 time.user_time
202424 -19.0% 163868 -9.3% 183678 aim9.disk_src.ops_per_sec
13731 ? 92% -88.7% 1546 ? 6% +161.6% 35915 ? 23% aim9.time.involuntary_context_switches
94.83 -4.2% 90.83 +1.2% 96.00 aim9.time.percent_of_cpu_this_job_got
211.64 +0.5% 212.70 +4.0% 220.06 aim9.time.system_time
73.62 -17.6% 60.69 -6.4% 68.94 aim9.time.user_time
174558 ? 4% -1.0% 172852 ? 7% -19.7% 140084 ? 7% meminfo.DirectMap4k
94166 +6.6% 100388 -14.4% 80579 meminfo.KReclaimable
12941 +0.4% 12989 -14.4% 11078 meminfo.KernelStack
3769 +0.2% 3775 +31.5% 4955 meminfo.PageTables
79298 +0.0% 79298 -70.6% 23298 meminfo.Percpu
94166 +6.6% 100388 -14.4% 80579 meminfo.SReclaimable
204209 +2.9% 210111 -9.6% 184661 meminfo.Slab
503.33 ? 12% -7.5% 465.67 ? 5% +30.4% 656.39 ? 10% numa-vmstat.node0.nr_page_table_pages
1198 ? 30% +1.1% 1211 ? 37% +117.0% 2601 ? 21% numa-vmstat.node0.nr_shmem
220.33 +0.2% 220.83 ? 2% -97.3% 6.00 ?100% numa-vmstat.node1.nr_active_file
167.00 +0.2% 167.33 ? 2% -87.7% 20.48 ?100% numa-vmstat.node1.nr_inactive_file
6443 ? 5% -0.6% 6405 ? 4% -19.5% 5184 ? 7% numa-vmstat.node1.nr_kernel_stack
220.33 +0.2% 220.83 ? 2% -97.3% 6.00 ?100% numa-vmstat.node1.nr_zone_active_file
167.00 +0.2% 167.33 ? 2% -87.7% 20.48 ?100% numa-vmstat.node1.nr_zone_inactive_file
0.04 ? 25% +4.1% 0.05 ? 33% -40.1% 0.03 ? 34% perf-sched.sch_delay.avg.ms.syslog_print.do_syslog.kmsg_read.vfs_read
0.01 ? 4% +1.6% 0.01 ? 8% +136.9% 0.02 ? 29% perf-sched.sch_delay.avg.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.16 ? 10% -18.9% 0.13 ? 12% -16.4% 0.13 ? 15% perf-sched.sch_delay.max.ms.pipe_read.vfs_read.ksys_read.do_syscall_64
0.14 ? 15% -6.0% 0.14 ? 20% -30.4% 0.10 ? 26% perf-sched.sch_delay.max.ms.schedule_timeout.__wait_for_common.wait_for_completion_state.kernel_clone
0.04 ? 7% +1802.4% 0.78 ?115% +7258.6% 3.00 ? 56% perf-sched.sch_delay.max.ms.smpboot_thread_fn.kthread.ret_from_fork.ret_from_fork_asm
0.05 ? 40% -14.5% 0.04 ? 31% +6155.4% 3.09 perf-sched.sch_delay.max.ms.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.01 ? 8% +25.5% 0.01 ? 21% +117.0% 0.02 ? 30% perf-sched.total_sch_delay.average.ms
0.16 ? 11% +1204.9% 2.12 ? 90% +2225.8% 3.77 ? 10% perf-sched.total_sch_delay.max.ms
58.50 ? 28% -4.8% 55.67 ? 14% -23.9% 44.50 ? 4% perf-sched.wait_and_delay.count.schedule_hrtimeout_range_clock.usleep_range_state.ipmi_thread.kthread
277.83 ? 3% +10.3% 306.50 ? 5% +17.9% 327.62 ? 4% perf-sched.wait_and_delay.count.worker_thread.kthread.ret_from_fork.ret_from_fork_asm
0.10 ? 75% -47.1% 0.05 ? 86% -100.0% 0.00 perf-sched.wait_time.avg.ms.__cond_resched.dput.path_put.vfs_statx.vfs_fstatat
0.10 ? 72% -10.8% 0.09 ? 67% -100.0% 0.00 perf-sched.wait_time.avg.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
2.00 ? 27% -4.8% 1.90 ? 14% -23.7% 1.53 ? 4% perf-sched.wait_time.avg.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
0.16 ? 81% +20.1% 0.19 ?104% -100.0% 0.00 perf-sched.wait_time.max.ms.__cond_resched.dput.path_put.vfs_statx.vfs_fstatat
0.22 ? 77% +14.2% 0.25 ? 54% -100.0% 0.00 perf-sched.wait_time.max.ms.exit_to_user_mode_loop.exit_to_user_mode_prepare.irqentry_exit_to_user_mode.asm_sysvec_apic_timer_interrupt
7.32 ? 43% -10.3% 6.57 ? 20% -33.2% 4.89 ? 6% perf-sched.wait_time.max.ms.schedule_hrtimeout_range_clock.do_select.core_sys_select.kern_select
220.33 +0.2% 220.83 ? 2% -94.6% 12.00 proc-vmstat.nr_active_file
676766 -0.0% 676706 +4.1% 704377 proc-vmstat.nr_file_pages
167.00 +0.2% 167.33 ? 2% -75.5% 40.98 proc-vmstat.nr_inactive_file
12941 +0.4% 12988 -14.4% 11083 proc-vmstat.nr_kernel_stack
942.50 +0.1% 943.50 +31.4% 1238 proc-vmstat.nr_page_table_pages
7915 ? 3% -0.8% 7855 +8.6% 8599 proc-vmstat.nr_shmem
23541 +6.5% 25074 -14.4% 20144 proc-vmstat.nr_slab_reclaimable
27499 -0.3% 27422 -5.4% 26016 proc-vmstat.nr_slab_unreclaimable
668461 +0.0% 668461 +4.2% 696493 proc-vmstat.nr_unevictable
220.33 +0.2% 220.83 ? 2% -94.6% 12.00 proc-vmstat.nr_zone_active_file
167.00 +0.2% 167.33 ? 2% -75.5% 40.98 proc-vmstat.nr_zone_inactive_file
668461 +0.0% 668461 +4.2% 696493 proc-vmstat.nr_zone_unevictable
722.17 ? 71% -34.7% 471.67 ?136% -98.3% 12.62 ?129% proc-vmstat.numa_hint_faults_local
1437319 ? 24% +377.6% 6864201 -47.8% 750673 ? 7% proc-vmstat.numa_hit
1387016 ? 25% +391.4% 6815486 -49.5% 700947 ? 7% proc-vmstat.numa_local
50329 -0.0% 50324 -1.2% 49704 proc-vmstat.numa_other
4864362 ? 34% +453.6% 26931180 -66.1% 1648373 ? 17% proc-vmstat.pgalloc_normal
4835960 ? 34% +455.4% 26856610 -66.3% 1628178 ? 18% proc-vmstat.pgfree
11.21 +23.7% 13.87 -81.8% 2.04 perf-stat.i.MPKI
7.223e+08 -4.4% 6.907e+08 -3.9% 6.94e+08 perf-stat.i.branch-instructions
2.67 +0.2 2.88 +0.0 2.70 perf-stat.i.branch-miss-rate%
19988363 +2.8% 20539702 -3.1% 19363031 perf-stat.i.branch-misses
17.36 -2.8 14.59 +0.4 17.77 perf-stat.i.cache-miss-rate%
40733859 +19.5% 48659982 -1.9% 39962840 perf-stat.i.cache-references
1482 ? 7% -5.7% 1398 +9.6% 1623 ? 5% perf-stat.i.context-switches
1.76 +3.5% 1.82 +5.1% 1.85 perf-stat.i.cpi
55.21 +5.4% 58.21 ? 2% +0.4% 55.45 perf-stat.i.cpu-migrations
16524721 ? 8% -4.8% 15726404 ? 4% -13.1% 14367627 ? 4% perf-stat.i.dTLB-load-misses
1.01e+09 -3.8% 9.719e+08 -4.4% 9.659e+08 perf-stat.i.dTLB-loads
0.26 ? 4% -0.0 0.23 ? 3% -0.0 0.25 ? 3% perf-stat.i.dTLB-store-miss-rate%
2166022 ? 4% -6.9% 2015917 ? 3% -7.0% 2014037 ? 3% perf-stat.i.dTLB-store-misses
8.503e+08 +5.5% 8.968e+08 -3.5% 8.205e+08 perf-stat.i.dTLB-stores
69.22 ? 4% +6.4 75.60 +14.4 83.60 ? 3% perf-stat.i.iTLB-load-miss-rate%
709457 ? 5% -5.4% 670950 ? 2% +133.6% 1657233 perf-stat.i.iTLB-load-misses
316455 ? 12% -31.6% 216531 ? 3% +3.5% 327592 ? 19% perf-stat.i.iTLB-loads
3.722e+09 -3.1% 3.608e+09 -4.6% 3.553e+09 perf-stat.i.instructions
5243 ? 5% +2.2% 5357 ? 2% -59.1% 2142 perf-stat.i.instructions-per-iTLB-miss
0.57 -3.3% 0.55 -4.8% 0.54 perf-stat.i.ipc
865.04 -10.4% 775.02 ? 3% -2.5% 843.12 perf-stat.i.metric.K/sec
53.84 -0.4% 53.61 -4.0% 51.71 perf-stat.i.metric.M/sec
47.51 -2.1 45.37 +0.7 48.17 perf-stat.i.node-load-miss-rate%
88195 ? 3% +5.2% 92745 ? 4% +16.4% 102647 ? 6% perf-stat.i.node-load-misses
106705 ? 3% +14.8% 122490 ? 5% +13.2% 120774 ? 5% perf-stat.i.node-loads
107169 ? 4% +29.0% 138208 ? 7% +7.5% 115217 ? 5% perf-stat.i.node-stores
10.94 +23.3% 13.49 -81.8% 1.99 perf-stat.overall.MPKI
2.77 +0.2 2.97 +0.0 2.79 perf-stat.overall.branch-miss-rate%
17.28 -2.7 14.56 +0.4 17.67 perf-stat.overall.cache-miss-rate%
1.73 +3.4% 1.79 +5.0% 1.82 perf-stat.overall.cpi
0.25 ? 4% -0.0 0.22 ? 3% -0.0 0.24 ? 3% perf-stat.overall.dTLB-store-miss-rate%
69.20 ? 4% +6.4 75.60 +14.4 83.58 ? 3% perf-stat.overall.iTLB-load-miss-rate%
5260 ? 5% +2.3% 5380 ? 2% -59.2% 2144 perf-stat.overall.instructions-per-iTLB-miss
0.58 -3.2% 0.56 -4.7% 0.55 perf-stat.overall.ipc
45.25 -2.2 43.10 +0.7 45.93 perf-stat.overall.node-load-miss-rate%
7.199e+08 -4.4% 6.883e+08 -3.9% 6.917e+08 perf-stat.ps.branch-instructions
19919808 +2.8% 20469001 -3.1% 19299968 perf-stat.ps.branch-misses
40597326 +19.5% 48497201 -1.9% 39829580 perf-stat.ps.cache-references
1477 ? 7% -5.7% 1393 +9.6% 1618 ? 5% perf-stat.ps.context-switches
55.06 +5.4% 58.03 ? 2% +0.5% 55.32 perf-stat.ps.cpu-migrations
16469488 ? 8% -4.8% 15673772 ? 4% -13.1% 14319828 ? 4% perf-stat.ps.dTLB-load-misses
1.007e+09 -3.8% 9.686e+08 -4.4% 9.627e+08 perf-stat.ps.dTLB-loads
2158768 ? 4% -6.9% 2009174 ? 3% -7.0% 2007326 ? 3% perf-stat.ps.dTLB-store-misses
8.475e+08 +5.5% 8.937e+08 -3.5% 8.178e+08 perf-stat.ps.dTLB-stores
707081 ? 5% -5.4% 668703 ? 2% +133.6% 1651678 perf-stat.ps.iTLB-load-misses
315394 ? 12% -31.6% 215816 ? 3% +3.5% 326463 ? 19% perf-stat.ps.iTLB-loads
3.71e+09 -3.1% 3.595e+09 -4.5% 3.541e+09 perf-stat.ps.instructions
87895 ? 3% +5.2% 92424 ? 4% +16.4% 102341 ? 6% perf-stat.ps.node-load-misses
106351 ? 3% +14.8% 122083 ? 5% +13.2% 120439 ? 5% perf-stat.ps.node-loads
106728 ? 4% +29.1% 137740 ? 7% +7.6% 114824 ? 5% perf-stat.ps.node-stores
1.117e+12 -3.0% 1.084e+12 -4.5% 1.067e+12 perf-stat.total.instructions
10.55 ? 95% -100.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.MIN_vruntime.avg
506.30 ? 95% -100.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.MIN_vruntime.max
0.00 +0.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.MIN_vruntime.min
1.08 ? 7% -7.7% 1.00 -46.2% 0.58 ? 27% sched_debug.cfs_rq:/.h_nr_running.max
538959 ? 24% -23.2% 414090 -58.3% 224671 ? 31% sched_debug.cfs_rq:/.load.max
130191 ? 14% -13.3% 112846 ? 6% -56.6% 56505 ? 39% sched_debug.cfs_rq:/.load.stddev
10.56 ? 95% -100.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.max_vruntime.avg
506.64 ? 95% -100.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.max_vruntime.max
0.00 +0.0% 0.00 -100.0% 0.00 sched_debug.cfs_rq:/.max_vruntime.min
116849 ? 27% -51.2% 56995 ? 20% -66.7% 38966 ? 39% sched_debug.cfs_rq:/.min_vruntime.max
1248 ? 14% +9.8% 1370 ? 15% -29.8% 876.86 ? 16% sched_debug.cfs_rq:/.min_vruntime.min
20484 ? 22% -37.0% 12910 ? 15% -60.7% 8059 ? 32% sched_debug.cfs_rq:/.min_vruntime.stddev
1.08 ? 7% -7.7% 1.00 -46.2% 0.58 ? 27% sched_debug.cfs_rq:/.nr_running.max
1223 ?191% -897.4% -9754 -100.0% 0.00 sched_debug.cfs_rq:/.spread0.avg
107969 ? 29% -65.3% 37448 ? 39% -100.0% 0.00 sched_debug.cfs_rq:/.spread0.max
-7628 +138.2% -18173 -100.0% 0.00 sched_debug.cfs_rq:/.spread0.min
20484 ? 22% -37.0% 12910 ? 15% -100.0% 0.00 sched_debug.cfs_rq:/.spread0.stddev
29.84 ? 19% -1.1% 29.52 ? 14% -100.0% 0.00 sched_debug.cfs_rq:/.util_est_enqueued.avg
569.91 ? 11% +4.5% 595.58 -100.0% 0.00 sched_debug.cfs_rq:/.util_est_enqueued.max
109.06 ? 13% +3.0% 112.32 ? 5% -100.0% 0.00 sched_debug.cfs_rq:/.util_est_enqueued.stddev
910320 +0.0% 910388 -44.2% 507736 ? 28% sched_debug.cpu.avg_idle.avg
1052715 ? 5% -3.8% 1012910 -45.8% 570219 ? 28% sched_debug.cpu.avg_idle.max
175426 ? 8% +2.8% 180422 ? 6% -42.5% 100839 ? 20% sched_debug.cpu.clock.avg
175428 ? 8% +2.8% 180424 ? 6% -42.5% 100840 ? 20% sched_debug.cpu.clock.max
175424 ? 8% +2.8% 180421 ? 6% -42.5% 100838 ? 20% sched_debug.cpu.clock.min
170979 ? 8% +2.8% 175682 ? 6% -42.5% 98373 ? 20% sched_debug.cpu.clock_task.avg
173398 ? 8% +2.6% 177937 ? 6% -42.6% 99568 ? 20% sched_debug.cpu.clock_task.max
165789 ? 8% +3.2% 171014 ? 6% -42.4% 95513 ? 20% sched_debug.cpu.clock_task.min
2178 ? 8% -13.1% 1893 ? 18% -49.7% 1094 ? 33% sched_debug.cpu.clock_task.stddev
7443 ? 4% +1.7% 7566 ? 3% -43.0% 4246 ? 24% sched_debug.cpu.curr->pid.max
1409 ? 2% +1.1% 1426 ? 2% -42.0% 818.46 ? 25% sched_debug.cpu.curr->pid.stddev
502082 +0.0% 502206 -43.9% 281834 ? 29% sched_debug.cpu.max_idle_balance_cost.avg
567767 ? 5% -3.3% 548996 ? 2% -46.6% 303439 ? 27% sched_debug.cpu.max_idle_balance_cost.max
4294 +0.0% 4294 -43.7% 2415 ? 29% sched_debug.cpu.next_balance.avg
4294 +0.0% 4294 -43.7% 2415 ? 29% sched_debug.cpu.next_balance.max
4294 +0.0% 4294 -43.7% 2415 ? 29% sched_debug.cpu.next_balance.min
0.29 ? 3% -1.2% 0.28 -43.0% 0.16 ? 29% sched_debug.cpu.nr_running.stddev
8212 ? 7% -2.2% 8032 ? 4% -40.7% 4867 ? 20% sched_debug.cpu.nr_switches.avg
55209 ? 14% -21.8% 43154 ? 14% -57.0% 23755 ? 20% sched_debug.cpu.nr_switches.max
1272 ? 23% +10.2% 1402 ? 8% -39.1% 775.51 ? 29% sched_debug.cpu.nr_switches.min
9805 ? 13% -13.7% 8459 ? 8% -50.8% 4825 ? 20% sched_debug.cpu.nr_switches.stddev
-15.73 -14.1% -13.52 -64.8% -5.53 sched_debug.cpu.nr_uninterruptible.min
6.08 ? 27% -1.0% 6.02 ? 13% -53.7% 2.82 ? 24% sched_debug.cpu.nr_uninterruptible.stddev
175425 ? 8% +2.8% 180421 ? 6% -42.5% 100838 ? 20% sched_debug.cpu_clk
4.295e+09 +0.0% 4.295e+09 -43.7% 2.416e+09 ? 29% sched_debug.jiffies
174815 ? 8% +2.9% 179811 ? 6% -42.5% 100505 ? 20% sched_debug.ktime
175972 ? 8% +2.8% 180955 ? 6% -42.5% 101150 ? 20% sched_debug.sched_clk
58611259 +0.0% 58611259 -94.0% 3508734 ? 29% sched_debug.sysctl_sched.sysctl_sched_features
0.75 +0.0% 0.75 -100.0% 0.00 sched_debug.sysctl_sched.sysctl_sched_idle_min_granularity
24.00 +0.0% 24.00 -100.0% 0.00 sched_debug.sysctl_sched.sysctl_sched_latency
3.00 +0.0% 3.00 -100.0% 0.00 sched_debug.sysctl_sched.sysctl_sched_min_granularity
4.00 +0.0% 4.00 -100.0% 0.00 sched_debug.sysctl_sched.sysctl_sched_wakeup_granularity
0.26 ?100% -0.3 0.00 +1.2 1.44 ? 9% perf-profile.calltrace.cycles-pp.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
2.08 ? 26% -0.2 1.87 ? 12% -1.4 0.63 ? 11% perf-profile.calltrace.cycles-pp.syscall_exit_to_user_mode.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
0.00 +0.0 0.00 +0.7 0.67 ? 11% perf-profile.calltrace.cycles-pp.rcu_sched_clock_irq.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues
0.00 +0.0 0.00 +0.7 0.71 ? 18% perf-profile.calltrace.cycles-pp.rebalance_domains.__do_softirq.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt
0.00 +0.0 0.00 +0.8 0.80 ? 14% perf-profile.calltrace.cycles-pp.getname_flags.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
0.00 +0.0 0.00 +0.8 0.84 ? 12% perf-profile.calltrace.cycles-pp.__fput.__x64_sys_close.do_syscall_64.entry_SYSCALL_64_after_hwframe.__close
0.00 +0.0 0.00 +1.0 0.98 ? 13% perf-profile.calltrace.cycles-pp.mas_alloc_cyclic.mtree_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
0.00 +0.0 0.00 +1.1 1.10 ? 14% perf-profile.calltrace.cycles-pp.mtree_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
0.00 +0.0 0.00 +1.2 1.20 ? 13% perf-profile.calltrace.cycles-pp.mas_erase.mtree_erase.simple_offset_remove.shmem_unlink.vfs_unlink
0.00 +0.0 0.00 +1.3 1.34 ? 15% perf-profile.calltrace.cycles-pp.link_path_walk.path_lookupat.filename_lookup.vfs_statx.__do_sys_newstat
0.00 +0.0 0.00 +1.4 1.35 ? 12% perf-profile.calltrace.cycles-pp.mtree_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
0.00 +0.0 0.00 +1.6 1.56 ? 18% perf-profile.calltrace.cycles-pp.__do_softirq.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state
0.00 +0.0 0.00 +1.7 1.73 ? 6% perf-profile.calltrace.cycles-pp.scheduler_tick.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues
0.00 +0.0 0.00 +1.8 1.80 ? 16% perf-profile.calltrace.cycles-pp.irq_exit_rcu.sysvec_apic_timer_interrupt.asm_sysvec_apic_timer_interrupt.cpuidle_enter_state.cpuidle_enter
0.00 +0.0 0.00 +2.0 2.03 ? 15% perf-profile.calltrace.cycles-pp.path_lookupat.filename_lookup.vfs_statx.__do_sys_newstat.do_syscall_64
0.00 +0.0 0.00 +2.2 2.16 ? 15% perf-profile.calltrace.cycles-pp.filename_lookup.vfs_statx.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.0 0.00 +2.9 2.94 ? 14% perf-profile.calltrace.cycles-pp.vfs_statx.__do_sys_newstat.do_syscall_64.entry_SYSCALL_64_after_hwframe.__xstat64
0.00 +0.0 0.00 +3.2 3.19 ? 8% perf-profile.calltrace.cycles-pp.update_process_times.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt
0.00 +0.0 0.00 +3.4 3.35 ? 8% perf-profile.calltrace.cycles-pp.tick_sched_handle.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt
0.00 +0.0 0.00 +3.9 3.88 ? 8% perf-profile.calltrace.cycles-pp.tick_nohz_highres_handler.__hrtimer_run_queues.hrtimer_interrupt.__sysvec_apic_timer_interrupt.sysvec_apic_timer_interrupt
0.00 +0.8 0.75 ? 12% +0.0 0.00 perf-profile.calltrace.cycles-pp.__call_rcu_common.xas_store.__xa_erase.xa_erase.simple_offset_remove
0.00 +0.8 0.78 ? 34% +0.0 0.00 perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store
0.00 +0.8 0.83 ? 29% +0.0 0.00 perf-profile.calltrace.cycles-pp.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand
0.00 +0.9 0.92 ? 26% +0.0 0.00 perf-profile.calltrace.cycles-pp.___slab_alloc.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create
0.00 +1.0 0.99 ? 27% +0.0 0.00 perf-profile.calltrace.cycles-pp.shuffle_freelist.allocate_slab.___slab_alloc.kmem_cache_alloc_lru.xas_alloc
0.00 +1.0 1.04 ? 28% +0.0 0.00 perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_create.xas_store.__xa_alloc
0.00 +1.1 1.11 ? 26% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_alloc.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
1.51 ? 24% +1.2 2.73 ? 10% +1.2 2.75 ? 10% perf-profile.calltrace.cycles-pp.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +1.2 1.24 ? 20% +0.0 0.00 perf-profile.calltrace.cycles-pp.kmem_cache_alloc_lru.xas_alloc.xas_expand.xas_create.xas_store
0.00 +1.3 1.27 ? 10% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_store.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink
0.00 +1.3 1.30 ? 10% +0.0 0.00 perf-profile.calltrace.cycles-pp.__xa_erase.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink
0.00 +1.3 1.33 ? 19% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_alloc.xas_expand.xas_create.xas_store.__xa_alloc
0.00 +1.4 1.36 ? 10% +0.0 0.00 perf-profile.calltrace.cycles-pp.xa_erase.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat
0.00 +1.4 1.37 ? 10% +1.4 1.37 ? 12% perf-profile.calltrace.cycles-pp.simple_offset_remove.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink
0.00 +1.5 1.51 ? 17% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_expand.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic
0.00 +1.6 1.62 ? 12% +1.6 1.65 ? 12% perf-profile.calltrace.cycles-pp.shmem_unlink.vfs_unlink.do_unlinkat.__x64_sys_unlink.do_syscall_64
0.00 +2.8 2.80 ? 13% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_create.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add
0.00 +2.9 2.94 ? 13% +0.0 0.00 perf-profile.calltrace.cycles-pp.xas_store.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod
5.38 ? 24% +3.1 8.51 ? 11% +0.6 5.95 ? 11% perf-profile.calltrace.cycles-pp.lookup_open.open_last_lookups.path_openat.do_filp_open.do_sys_openat2
6.08 ? 24% +3.2 9.24 ? 12% +0.5 6.59 ? 10% perf-profile.calltrace.cycles-pp.open_last_lookups.path_openat.do_filp_open.do_sys_openat2.__x64_sys_creat
0.00 +3.2 3.20 ? 13% +0.0 0.00 perf-profile.calltrace.cycles-pp.__xa_alloc.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open
0.00 +3.2 3.24 ? 13% +0.0 0.00 perf-profile.calltrace.cycles-pp.__xa_alloc_cyclic.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups
0.00 +3.4 3.36 ? 14% +1.2 1.16 ? 13% perf-profile.calltrace.cycles-pp.simple_offset_add.shmem_mknod.lookup_open.open_last_lookups.path_openat
2.78 ? 25% +3.4 6.17 ? 12% +0.9 3.69 ? 12% perf-profile.calltrace.cycles-pp.shmem_mknod.lookup_open.open_last_lookups.path_openat.do_filp_open
0.16 ? 30% -0.1 0.08 ? 20% -0.0 0.13 ? 42% perf-profile.children.cycles-pp.map_id_up
0.22 ? 18% -0.0 0.16 ? 17% -0.1 0.12 ? 8% perf-profile.children.cycles-pp._raw_spin_lock_irq
0.47 ? 17% -0.0 0.43 ? 16% +1.0 1.49 ? 10% perf-profile.children.cycles-pp.__x64_sys_close
0.02 ?142% -0.0 0.00 +0.1 0.08 ? 22% perf-profile.children.cycles-pp._find_next_zero_bit
0.01 ?223% -0.0 0.00 +2.0 1.97 ? 15% perf-profile.children.cycles-pp.irq_exit_rcu
0.00 +0.0 0.00 +0.1 0.08 ? 30% perf-profile.children.cycles-pp.__wake_up
0.00 +0.0 0.00 +0.1 0.08 ? 21% perf-profile.children.cycles-pp.should_we_balance
0.00 +0.0 0.00 +0.1 0.09 ? 34% perf-profile.children.cycles-pp.amd_clear_divider
0.00 +0.0 0.00 +0.1 0.10 ? 36% perf-profile.children.cycles-pp.apparmor_current_getsecid_subj
0.00 +0.0 0.00 +0.1 0.10 ? 32% perf-profile.children.cycles-pp.filp_flush
0.00 +0.0 0.00 +0.1 0.10 ? 27% perf-profile.children.cycles-pp.mas_wr_end_piv
0.00 +0.0 0.00 +0.1 0.12 ? 27% perf-profile.children.cycles-pp.mnt_get_write_access
0.00 +0.0 0.00 +0.1 0.12 ? 23% perf-profile.children.cycles-pp.file_close_fd
0.00 +0.0 0.00 +0.1 0.13 ? 30% perf-profile.children.cycles-pp.security_current_getsecid_subj
0.00 +0.0 0.00 +0.1 0.13 ? 18% perf-profile.children.cycles-pp.native_apic_mem_eoi
0.00 +0.0 0.00 +0.2 0.17 ? 22% perf-profile.children.cycles-pp.mas_leaf_max_gap
0.00 +0.0 0.00 +0.2 0.18 ? 24% perf-profile.children.cycles-pp.mtree_range_walk
0.00 +0.0 0.00 +0.2 0.20 ? 19% perf-profile.children.cycles-pp.inode_set_ctime_current
0.00 +0.0 0.00 +0.2 0.24 ? 14% perf-profile.children.cycles-pp.ima_file_check
0.00 +0.0 0.00 +0.2 0.24 ? 22% perf-profile.children.cycles-pp.mas_anode_descend
0.00 +0.0 0.00 +0.3 0.26 ? 18% perf-profile.children.cycles-pp.lockref_put_return
0.00 +0.0 0.00 +0.3 0.29 ? 16% perf-profile.children.cycles-pp.mas_wr_walk
0.00 +0.0 0.00 +0.3 0.31 ? 23% perf-profile.children.cycles-pp.mas_update_gap
0.00 +0.0 0.00 +0.3 0.32 ? 17% perf-profile.children.cycles-pp.mas_wr_append
0.00 +0.0 0.00 +0.4 0.37 ? 15% perf-profile.children.cycles-pp.mas_empty_area
0.00 +0.0 0.00 +0.5 0.47 ? 18% perf-profile.children.cycles-pp.mas_wr_node_store
0.00 +0.0 0.00 +0.5 0.53 ? 18% perf-profile.children.cycles-pp.__memcg_slab_post_alloc_hook
0.00 +0.0 0.00 +0.7 0.65 ? 12% perf-profile.children.cycles-pp.__memcg_slab_pre_alloc_hook
0.00 +0.0 0.00 +0.7 0.65 ? 28% perf-profile.children.cycles-pp.__memcg_slab_free_hook
0.00 +0.0 0.00 +1.0 0.99 ? 13% perf-profile.children.cycles-pp.mas_alloc_cyclic
0.00 +0.0 0.00 +1.1 1.11 ? 14% perf-profile.children.cycles-pp.mtree_alloc_cyclic
0.00 +0.0 0.00 +1.2 1.21 ? 14% perf-profile.children.cycles-pp.mas_erase
0.00 +0.0 0.00 +1.4 1.35 ? 12% perf-profile.children.cycles-pp.mtree_erase
0.00 +0.0 0.00 +1.8 1.78 ? 9% perf-profile.children.cycles-pp.entry_SYSCALL_64
0.00 +0.0 0.00 +4.1 4.06 ? 8% perf-profile.children.cycles-pp.tick_nohz_highres_handler
0.02 ?146% +0.1 0.08 ? 13% +0.0 0.06 ? 81% perf-profile.children.cycles-pp.shmem_is_huge
0.02 ?141% +0.1 0.09 ? 16% -0.0 0.00 perf-profile.children.cycles-pp.__list_del_entry_valid
0.00 +0.1 0.08 ? 11% +0.0 0.00 perf-profile.children.cycles-pp.free_unref_page
0.00 +0.1 0.08 ? 13% +0.1 0.08 ? 45% perf-profile.children.cycles-pp.shmem_destroy_inode
0.04 ?101% +0.1 0.14 ? 25% +0.0 0.05 ? 65% perf-profile.children.cycles-pp.rcu_nocb_try_bypass
0.00 +0.1 0.12 ? 27% +0.0 0.00 perf-profile.children.cycles-pp.xas_find_marked
0.02 ?144% +0.1 0.16 ? 14% -0.0 0.00 perf-profile.children.cycles-pp.__unfreeze_partials
0.03 ?106% +0.2 0.19 ? 26% -0.0 0.03 ?136% perf-profile.children.cycles-pp.xas_descend
0.01 ?223% +0.2 0.17 ? 15% -0.0 0.00 perf-profile.children.cycles-pp.get_page_from_freelist
0.11 ? 22% +0.2 0.29 ? 16% -0.0 0.08 ? 30% perf-profile.children.cycles-pp.rcu_segcblist_enqueue
0.02 ?146% +0.2 0.24 ? 13% -0.0 0.01 ?174% perf-profile.children.cycles-pp.__alloc_pages
0.36 ? 79% +0.6 0.98 ? 15% -0.0 0.31 ? 43% perf-profile.children.cycles-pp.__slab_free
0.50 ? 26% +0.7 1.23 ? 14% -0.2 0.31 ? 19% perf-profile.children.cycles-pp.__call_rcu_common
0.00 +0.8 0.82 ? 13% +0.0 0.00 perf-profile.children.cycles-pp.radix_tree_node_rcu_free
0.00 +1.1 1.14 ? 17% +0.0 0.00 perf-profile.children.cycles-pp.radix_tree_node_ctor
0.16 ? 86% +1.2 1.38 ? 16% -0.1 0.02 ?174% perf-profile.children.cycles-pp.setup_object
1.52 ? 25% +1.2 2.75 ? 10% +1.2 2.76 ? 11% perf-profile.children.cycles-pp.vfs_unlink
0.36 ? 22% +1.3 1.63 ? 12% +1.3 1.65 ? 12% perf-profile.children.cycles-pp.shmem_unlink
0.00 +1.3 1.30 ? 10% +0.0 0.00 perf-profile.children.cycles-pp.__xa_erase
0.20 ? 79% +1.3 1.53 ? 15% -0.2 0.02 ?173% perf-profile.children.cycles-pp.shuffle_freelist
0.00 +1.4 1.36 ? 10% +0.0 0.00 perf-profile.children.cycles-pp.xa_erase
0.00 +1.4 1.38 ? 10% +1.4 1.37 ? 12% perf-profile.children.cycles-pp.simple_offset_remove
0.00 +1.5 1.51 ? 17% +0.0 0.00 perf-profile.children.cycles-pp.xas_expand
0.26 ? 78% +1.6 1.87 ? 13% -0.2 0.05 ? 68% perf-profile.children.cycles-pp.allocate_slab
0.40 ? 49% +1.7 2.10 ? 13% -0.3 0.14 ? 28% perf-profile.children.cycles-pp.___slab_alloc
1.30 ? 85% +2.1 3.42 ? 12% -0.1 1.15 ? 41% perf-profile.children.cycles-pp.rcu_do_batch
1.56 ? 27% +2.4 3.93 ? 11% -0.2 1.41 ? 12% perf-profile.children.cycles-pp.kmem_cache_alloc_lru
0.00 +2.4 2.44 ? 12% +0.0 0.00 perf-profile.children.cycles-pp.xas_alloc
2.66 ? 13% +2.5 5.14 ? 5% -2.7 0.00 perf-profile.children.cycles-pp.__irq_exit_rcu
11.16 ? 10% +2.7 13.88 ? 8% +0.6 11.72 ? 8% perf-profile.children.cycles-pp.sysvec_apic_timer_interrupt
11.77 ? 10% +2.7 14.49 ? 8% +0.6 12.40 ? 8% perf-profile.children.cycles-pp.asm_sysvec_apic_timer_interrupt
0.00 +2.8 2.82 ? 13% +0.0 0.00 perf-profile.children.cycles-pp.xas_create
5.40 ? 24% +3.1 8.52 ? 11% +0.6 5.97 ? 11% perf-profile.children.cycles-pp.lookup_open
6.12 ? 24% +3.1 9.27 ? 12% +0.5 6.62 ? 10% perf-profile.children.cycles-pp.open_last_lookups
0.00 +3.2 3.22 ? 13% +0.0 0.00 perf-profile.children.cycles-pp.__xa_alloc
0.00 +3.2 3.24 ? 13% +0.0 0.00 perf-profile.children.cycles-pp.__xa_alloc_cyclic
0.00 +3.4 3.36 ? 14% +1.2 1.16 ? 13% perf-profile.children.cycles-pp.simple_offset_add
2.78 ? 25% +3.4 6.18 ? 12% +0.9 3.70 ? 12% perf-profile.children.cycles-pp.shmem_mknod
0.00 +4.2 4.24 ? 12% +0.0 0.00 perf-profile.children.cycles-pp.xas_store
0.14 ? 27% -0.1 0.08 ? 21% -0.0 0.12 ? 42% perf-profile.self.cycles-pp.map_id_up
0.09 ? 18% -0.0 0.06 ? 52% +0.1 0.14 ? 12% perf-profile.self.cycles-pp.obj_cgroup_charge
0.18 ? 22% -0.0 0.15 ? 17% -0.1 0.10 ? 9% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.02 ?141% -0.0 0.00 +0.1 0.08 ? 22% perf-profile.self.cycles-pp._find_next_zero_bit
0.16 ? 24% -0.0 0.16 ? 24% -0.1 0.07 ? 64% perf-profile.self.cycles-pp.__sysvec_apic_timer_interrupt
0.02 ?146% +0.0 0.02 ?146% +0.1 0.08 ? 18% perf-profile.self.cycles-pp.shmem_mknod
0.00 +0.0 0.00 +0.1 0.09 ? 36% perf-profile.self.cycles-pp.irq_exit_rcu
0.00 +0.0 0.00 +0.1 0.09 ? 28% perf-profile.self.cycles-pp.tick_nohz_highres_handler
0.00 +0.0 0.00 +0.1 0.09 ? 36% perf-profile.self.cycles-pp.apparmor_current_getsecid_subj
0.00 +0.0 0.00 +0.1 0.09 ? 30% perf-profile.self.cycles-pp.mtree_erase
0.00 +0.0 0.00 +0.1 0.10 ? 26% perf-profile.self.cycles-pp.mtree_alloc_cyclic
0.00 +0.0 0.00 +0.1 0.10 ? 27% perf-profile.self.cycles-pp.mas_wr_end_piv
0.00 +0.0 0.00 +0.1 0.12 ? 28% perf-profile.self.cycles-pp.mnt_get_write_access
0.00 +0.0 0.00 +0.1 0.12 ? 29% perf-profile.self.cycles-pp.inode_set_ctime_current
0.00 +0.0 0.00 +0.1 0.12 ? 38% perf-profile.self.cycles-pp.mas_empty_area
0.00 +0.0 0.00 +0.1 0.13 ? 18% perf-profile.self.cycles-pp.native_apic_mem_eoi
0.00 +0.0 0.00 +0.1 0.14 ? 38% perf-profile.self.cycles-pp.mas_update_gap
0.00 +0.0 0.00 +0.1 0.14 ? 20% perf-profile.self.cycles-pp.mas_wr_append
0.00 +0.0 0.00 +0.2 0.16 ? 23% perf-profile.self.cycles-pp.mas_leaf_max_gap
0.00 +0.0 0.00 +0.2 0.18 ? 24% perf-profile.self.cycles-pp.mtree_range_walk
0.00 +0.0 0.00 +0.2 0.18 ? 29% perf-profile.self.cycles-pp.mas_alloc_cyclic
0.00 +0.0 0.00 +0.2 0.20 ? 14% perf-profile.self.cycles-pp.__memcg_slab_pre_alloc_hook
0.00 +0.0 0.00 +0.2 0.22 ? 32% perf-profile.self.cycles-pp.mas_erase
0.00 +0.0 0.00 +0.2 0.24 ? 35% perf-profile.self.cycles-pp.__memcg_slab_free_hook
0.00 +0.0 0.00 +0.2 0.24 ? 22% perf-profile.self.cycles-pp.mas_anode_descend
0.00 +0.0 0.00 +0.3 0.26 ? 17% perf-profile.self.cycles-pp.lockref_put_return
0.00 +0.0 0.00 +0.3 0.27 ? 16% perf-profile.self.cycles-pp.mas_wr_walk
0.00 +0.0 0.00 +0.3 0.34 ? 20% perf-profile.self.cycles-pp.mas_wr_node_store
0.00 +0.0 0.00 +0.4 0.35 ? 20% perf-profile.self.cycles-pp.__memcg_slab_post_alloc_hook
0.00 +0.0 0.00 +1.6 1.59 ? 8% perf-profile.self.cycles-pp.entry_SYSCALL_64
0.05 ? 85% +0.0 0.06 ? 47% +0.1 0.15 ? 54% perf-profile.self.cycles-pp.check_heap_object
0.05 ? 72% +0.0 0.06 ? 81% +0.0 0.10 ? 15% perf-profile.self.cycles-pp.call_cpuidle
0.04 ?105% +0.0 0.04 ? 75% +0.1 0.10 ? 33% perf-profile.self.cycles-pp.putname
0.00 +0.1 0.06 ? 24% +0.0 0.05 ? 66% perf-profile.self.cycles-pp.shmem_destroy_inode
0.00 +0.1 0.07 ? 8% +0.0 0.00 perf-profile.self.cycles-pp.__xa_alloc
0.02 ?146% +0.1 0.11 ? 28% +0.0 0.02 ?133% perf-profile.self.cycles-pp.rcu_nocb_try_bypass
0.01 ?223% +0.1 0.10 ? 28% -0.0 0.00 perf-profile.self.cycles-pp.shuffle_freelist
0.00 +0.1 0.11 ? 40% +0.0 0.00 perf-profile.self.cycles-pp.xas_create
0.00 +0.1 0.12 ? 27% +0.0 0.00 perf-profile.self.cycles-pp.xas_find_marked
0.00 +0.1 0.14 ? 18% +0.0 0.00 perf-profile.self.cycles-pp.xas_alloc
0.03 ?103% +0.1 0.17 ? 29% -0.0 0.03 ?136% perf-profile.self.cycles-pp.xas_descend
0.00 +0.2 0.16 ? 23% +0.0 0.00 perf-profile.self.cycles-pp.xas_expand
0.10 ? 22% +0.2 0.27 ? 16% -0.0 0.06 ? 65% perf-profile.self.cycles-pp.rcu_segcblist_enqueue
0.92 ? 35% +0.3 1.22 ? 11% -0.3 0.59 ? 15% perf-profile.self.cycles-pp.kmem_cache_free
0.00 +0.4 0.36 ? 16% +0.0 0.00 perf-profile.self.cycles-pp.xas_store
0.32 ? 30% +0.4 0.71 ? 12% -0.1 0.18 ? 23% perf-profile.self.cycles-pp.__call_rcu_common
0.18 ? 27% +0.5 0.65 ? 8% +0.1 0.26 ? 21% perf-profile.self.cycles-pp.kmem_cache_alloc_lru
0.36 ? 79% +0.6 0.96 ? 15% -0.0 0.31 ? 42% perf-profile.self.cycles-pp.__slab_free
0.00 +0.8 0.80 ? 14% +0.0 0.00 perf-profile.self.cycles-pp.radix_tree_node_rcu_free
0.00 +1.0 1.01 ? 16% +0.0 0.00 perf-profile.self.cycles-pp.radix_tree_node_ctor



[1] https://lore.kernel.org/all/[email protected]/

>
>
> > > > > @@ -330,9 +329,9 @@ int simple_offset_empty(struct dentry *dentry)
> > > > > if (!inode || !S_ISDIR(inode->i_mode))
> > > > > return ret;
> > > > >
> > > > > - index = 2;
> > > > > + index = DIR_OFFSET_MIN;
> > > >
> > > > This bit should go into the simple_offset_empty() patch...
> > > >
> > > > > @@ -434,15 +433,15 @@ static loff_t offset_dir_llseek(struct file *file, loff_t offset, int whence)
> > > > >
> > > > > /* In this case, ->private_data is protected by f_pos_lock */
> > > > > file->private_data = NULL;
> > > > > - return vfs_setpos(file, offset, U32_MAX);
> > > > > + return vfs_setpos(file, offset, MAX_LFS_FILESIZE);
> > > > ^^^
> > > > Why this? It is ULONG_MAX << PAGE_SHIFT on 32-bit so that doesn't seem
> > > > quite right? Why not use ULONG_MAX here directly?
> > >
> > > I initially changed U32_MAX to ULONG_MAX, but for some reason, the
> > > length checking in vfs_setpos() fails. There is probably a sign
> > > extension thing happening here that I don't understand.
> > >
> > >
> > > > Otherwise the patch looks good to me.
> > >
> > > As always, thank you for your review.
> > >
> > >
> > > --
> > > Chuck Lever
>
> --
> Chuck Lever

2024-02-19 18:06:59

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH RFC 7/7] libfs: Re-arrange locking in offset_iterate_dir()

On Fri 16-02-24 11:33:18, Liam R. Howlett wrote:
> * Jan Kara <[email protected]> [240216 05:15]:
> > > If you have other examples you think are unsafe then I can have a look
> > > at them as well.
> >
> > I'm currently not aware of any but I'll let you know if I find some.
> > Missing xas/mas_pause() seems really easy.
>
> What if we convert the rcu_read_lock() to a mas_read_lock() or
> xas_read_lock() and we can check to ensure the state isn't being locked
> without being in the 'parked' state (paused or otherwise)?
>
> mas_read_lock(struct ma_state *mas) {
> assert(!mas_active(mas));
> rcu_read_lock();
> }
>
> Would that be a reasonable resolution to your concern? Unfortunately,
> what was done with the locking in this case would not be detected with
> this change unless the rcu_read_lock() was replaced. IOW, people could
> still use the rcu_read_lock() and skip the detection.

Yes, I guess this is still better than nothing.

> Doing the same in the mas_unlock() doesn't make as much sense since that
> may be called without the intent to reuse the state at all. So we'd be
> doing more work than necessary at the end of each loop or use.

Yes, understood.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR