2021-06-07 10:13:06

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 0/6] kernfs: proposed locking and concurrency improvement

There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

Changes since v4:
- fixed kernfs_active() naming.
- added back kernfs_node revision patch to use for negative dentry
validation.
- minor updates to patch descriptions.

Changes since v3:
- remove unneeded indirection when referencing the super block.
- check if inode attribute update is actually needed.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .gated() by re-factoring the
attribute handling code.
---

Ian Kent (6):
kernfs: move revalidate to be near lookup
kernfs: add a revision to identify directory node changes
kernfs: use VFS negative dentry caching
kernfs: switch kernfs to use an rwsem
kernfs: use i_lock to protect concurrent inode updates
kernfs: add kernfs_need_inode_refresh()


fs/kernfs/dir.c | 155 ++++++++++++++++++++----------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 45 +++++++++--
fs/kernfs/kernfs-internal.h | 29 ++++++-
fs/kernfs/mount.c | 12 +--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 7 +-
7 files changed, 166 insertions(+), 90 deletions(-)

--
Ian


2021-06-07 10:13:50

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 1/6] kernfs: move revalidate to be near lookup

While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

It makes sense to locate this function near to kernfs_iop_lookup()
because we will be adding VFS negative dentry caching to reduce path
lookup overhead for non-existent paths.

There's no functional change from this patch.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 86 ++++++++++++++++++++++++++++---------------------------
1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53c..33166ec90a112 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -548,49 +548,6 @@ void kernfs_put(struct kernfs_node *kn)
}
EXPORT_SYMBOL_GPL(kernfs_put);

-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
- struct kernfs_node *kn;
-
- if (flags & LOOKUP_RCU)
- return -ECHILD;
-
- /* Always perform fresh lookup for negatives */
- if (d_really_is_negative(dentry))
- goto out_bad_unlocked;
-
- kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
-
- /* The kernfs node has been deactivated */
- if (!kernfs_active(kn))
- goto out_bad;
-
- /* The kernfs node has been moved? */
- if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
- goto out_bad;
-
- /* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
- goto out_bad;
-
- /* The kernfs node has been moved to a different namespace */
- if (kn->parent && kernfs_ns_enabled(kn->parent) &&
- kernfs_info(dentry->d_sb)->ns != kn->ns)
- goto out_bad;
-
- mutex_unlock(&kernfs_mutex);
- return 1;
-out_bad:
- mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
- return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
- .d_revalidate = kernfs_dop_revalidate,
-};
-
/**
* kernfs_node_from_dentry - determine kernfs_node associated with a dentry
* @dentry: the dentry in question
@@ -1073,6 +1030,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
return ERR_PTR(rc);
}

+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct kernfs_node *kn;
+
+ if (flags & LOOKUP_RCU)
+ return -ECHILD;
+
+ /* Always perform fresh lookup for negatives */
+ if (d_really_is_negative(dentry))
+ goto out_bad_unlocked;
+
+ kn = kernfs_dentry_node(dentry);
+ mutex_lock(&kernfs_mutex);
+
+ /* The kernfs node has been deactivated */
+ if (!kernfs_active(kn))
+ goto out_bad;
+
+ /* The kernfs node has been moved? */
+ if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+ goto out_bad;
+
+ /* The kernfs node has been renamed */
+ if (strcmp(dentry->d_name.name, kn->name) != 0)
+ goto out_bad;
+
+ /* The kernfs node has been moved to a different namespace */
+ if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+ kernfs_info(dentry->d_sb)->ns != kn->ns)
+ goto out_bad;
+
+ mutex_unlock(&kernfs_mutex);
+ return 1;
+out_bad:
+ mutex_unlock(&kernfs_mutex);
+out_bad_unlocked:
+ return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+ .d_revalidate = kernfs_dop_revalidate,
+};
+
static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)


2021-06-07 10:14:19

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 2/6] kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 8 ++++++++
fs/kernfs/kernfs-internal.h | 24 ++++++++++++++++++++++++
include/linux/kernfs.h | 5 +++++
3 files changed, 37 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 33166ec90a112..b88432c48851f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* successfully added, account subdir number */
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
+ kernfs_inc_rev(kn->parent);

return 0;
}
@@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)

if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
+ kernfs_inc_rev(kn->parent);

rb_erase(&kn->rb, &kn->parent->dir.children);
RB_CLEAR_NODE(&kn->rb);
@@ -1105,6 +1107,12 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,

/* instantiate and hash dentry */
ret = d_splice_alias(inode, dentry);
+ if (!IS_ERR(ret)) {
+ if (unlikely(ret))
+ kernfs_set_rev(parent, ret);
+ else
+ kernfs_set_rev(parent, dentry);
+ }
out_unlock:
mutex_unlock(&kernfs_mutex);
return ret;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306f..1536002584fc4 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -81,6 +81,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
return d_inode(dentry)->i_private;
}

+static inline void kernfs_set_rev(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+ if (kernfs_type(kn) == KERNFS_DIR)
+ dentry->d_time = kn->dir.rev;
+}
+
+static inline void kernfs_inc_rev(struct kernfs_node *kn)
+{
+ if (kernfs_type(kn) == KERNFS_DIR)
+ kn->dir.rev++;
+}
+
+static inline bool kernfs_dir_changed(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ /* Not really a time bit it does what's needed */
+ if (time_after(kn->dir.rev, dentry->d_time))
+ return true;
+ }
+ return false;
+}
+
extern const struct super_operations kernfs_sops;
extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;

diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c268..7947acb1163d7 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -98,6 +98,11 @@ struct kernfs_elem_dir {
* better directly in kernfs_node but is here to save space.
*/
struct kernfs_root *root;
+ /*
+ * Monotonic revision counter, used to identify if a directory
+ * node has changed during revalidation.
+ */
+ unsigned long rev;
};

struct kernfs_elem_symlink {


2021-06-07 10:15:02

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 3/6] kernfs: use VFS negative dentry caching

If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Use the kernfs node parent revision to identify if a change has been
made to the containing directory so that the negative dentry can be
discarded and the lookup redone.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 53 +++++++++++++++++++++++++++++++----------------------
1 file changed, 31 insertions(+), 22 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index b88432c48851f..5ae95e8d1aea1 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1039,13 +1039,32 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (flags & LOOKUP_RCU)
return -ECHILD;

- /* Always perform fresh lookup for negatives */
- if (d_really_is_negative(dentry))
- goto out_bad_unlocked;
-
kn = kernfs_dentry_node(dentry);
mutex_lock(&kernfs_mutex);

+ /* Negative hashed dentry? */
+ if (!kn) {
+ struct dentry *d_parent = dget_parent(dentry);
+ struct kernfs_node *parent;
+
+ /* If the kernfs parent node has changed discard and
+ * proceed to ->lookup.
+ */
+ parent = kernfs_dentry_node(d_parent);
+ if (parent) {
+ if (kernfs_dir_changed(parent, dentry)) {
+ dput(d_parent);
+ goto out_bad;
+ }
+ }
+ dput(d_parent);
+
+ /* The kernfs node doesn't exist, leave the dentry
+ * negative and return success.
+ */
+ goto out;
+ }
+
/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
goto out_bad;
@@ -1062,12 +1081,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (kn->parent && kernfs_ns_enabled(kn->parent) &&
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
-
+out:
mutex_unlock(&kernfs_mutex);
return 1;
out_bad:
mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
return 0;
}

@@ -1082,30 +1100,21 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct inode *inode;
+ struct inode *inode = NULL;
const void *ns = NULL;

mutex_lock(&kernfs_mutex);
-
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
-
- /* no such entry */
- if (!kn || !kernfs_active(kn)) {
- ret = NULL;
- goto out_unlock;
- }
-
/* attach dentry and inode */
- inode = kernfs_get_inode(dir->i_sb, kn);
- if (!inode) {
- ret = ERR_PTR(-ENOMEM);
- goto out_unlock;
+ if (kn && kernfs_active(kn)) {
+ inode = kernfs_get_inode(dir->i_sb, kn);
+ if (!inode)
+ inode = ERR_PTR(-ENOMEM);
}
-
- /* instantiate and hash dentry */
+ /* instantiate and hash (possibly negative) dentry */
ret = d_splice_alias(inode, dentry);
if (!IS_ERR(ret)) {
if (unlikely(ret))
@@ -1113,8 +1122,8 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
else
kernfs_set_rev(parent, dentry);
}
- out_unlock:
mutex_unlock(&kernfs_mutex);
+
return ret;
}



2021-06-07 10:15:32

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 4/6] kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 94 ++++++++++++++++++++++---------------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 16 ++++---
fs/kernfs/kernfs-internal.h | 5 +-
fs/kernfs/mount.c | 12 +++--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 2 -
7 files changed, 69 insertions(+), 68 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5ae95e8d1aea1..cb0f77568dd41 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@

#include "kernfs-internal.h"

-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by rename_lock */
static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
@@ -26,7 +26,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);
return atomic_read(&kn->active) >= 0;
}

@@ -340,7 +340,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
* @kn->parent->dir.children.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*
* RETURNS:
* 0 on susccess -EEXIST on failure.
@@ -386,7 +386,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
* removed, %false if @kn wasn't on the rbtree.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*/
static bool kernfs_unlink_sibling(struct kernfs_node *kn)
{
@@ -457,14 +457,14 @@ void kernfs_put_active(struct kernfs_node *kn)
* return after draining is complete.
*/
static void kernfs_drain(struct kernfs_node *kn)
- __releases(&kernfs_mutex) __acquires(&kernfs_mutex)
+ __releases(&kernfs_rwsem) __acquires(&kernfs_rwsem)
{
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);
WARN_ON_ONCE(kernfs_active(kn));

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +483,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}

/**
@@ -722,7 +722,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +753,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +767,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -788,7 +788,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);

if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -820,7 +820,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_read(&kernfs_rwsem);

/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(&kernfs_rename_lock);
@@ -860,10 +860,10 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -884,10 +884,10 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -1040,7 +1040,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
return -ECHILD;

kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

/* Negative hashed dentry? */
if (!kn) {
@@ -1082,10 +1082,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
out:
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 1;
out_bad:
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 0;
}

@@ -1103,7 +1103,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct inode *inode = NULL;
const void *ns = NULL;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

@@ -1122,7 +1122,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
else
kernfs_set_rev(parent, dentry);
}
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return ret;
}
@@ -1244,7 +1244,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1280,7 +1280,7 @@ void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1294,14 +1294,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1324,7 +1324,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
pos = kernfs_leftmost_descendant(kn);

/*
- * kernfs_drain() drops kernfs_mutex temporarily and @pos's
+ * kernfs_drain() drops kernfs_rwsem temporarily and @pos's
* base ref could have been put by someone else by the time
* the function returns. Make sure it doesn't go away
* underneath us.
@@ -1371,9 +1371,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
__kernfs_remove(kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

/**
@@ -1460,17 +1460,17 @@ bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_break_active_protection(kn);

/*
* SUICIDAL is used to arbitrate among competing invocations. Only
* the first one will actually perform removal. When the removal
* is complete, SUICIDED is set and the active ref is restored
- * while holding kernfs_mutex. The ones which lost arbitration
- * waits for SUICDED && drained which can happen only after the
- * enclosing kernfs operation which executed the winning instance
- * of kernfs_remove_self() finished.
+ * while kernfs_rwsem for held exclusive. The ones which lost
+ * arbitration waits for SUICIDED && drained which can happen only
+ * after the enclosing kernfs operation which executed the winning
+ * instance of kernfs_remove_self() finished.
*/
if (!(kn->flags & KERNFS_SUICIDAL)) {
kn->flags |= KERNFS_SUICIDAL;
@@ -1488,9 +1488,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
schedule();
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1498,12 +1498,12 @@ bool kernfs_remove_self(struct kernfs_node *kn)
}

/*
- * This must be done while holding kernfs_mutex; otherwise, waiting
- * for SUICIDED && deactivated could finish prematurely.
+ * This must be done while kernfs_rwsem held exclusive; otherwise,
+ * waiting for SUICIDED && deactivated could finish prematurely.
*/
kernfs_unbreak_active_protection(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -1527,13 +1527,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kn)
return 0;
@@ -1559,7 +1559,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
if (!kn->parent)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1613,7 +1613,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -1688,7 +1688,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)

if (!dir_emit_dots(file, ctx))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1705,12 +1705,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);

- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
}
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c757193121475..60e2a86c535eb 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -860,7 +860,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
spin_unlock_irq(&kernfs_notify_lock);

/* kick fsnotify */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
@@ -898,7 +898,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index d73950fc3d57d..3b01e9e61f14e 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -106,9 +106,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
ret = __kernfs_setattr(kn, iattr);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -122,7 +122,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
if (!kn)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -135,7 +135,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -191,9 +191,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

generic_fillattr(&init_user_ns, inode, stat);
return 0;
@@ -284,9 +284,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

return generic_permission(&init_user_ns, inode, mask);
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 1536002584fc4..ec81c391a20dc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -13,6 +13,7 @@
#include <linux/lockdep.h>
#include <linux/fs.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/xattr.h>

#include <linux/kernfs.h>
@@ -69,7 +70,7 @@ struct kernfs_super_info {
*/
const void *ns;

- /* anchored at kernfs_root->supers, protected by kernfs_mutex */
+ /* anchored at kernfs_root->supers, protected by kernfs_rwsem */
struct list_head node;
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
@@ -126,7 +127,7 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
/*
* dir.c
*/
-extern struct mutex kernfs_mutex;
+extern struct rw_semaphore kernfs_rwsem;
extern const struct dentry_operations kernfs_dops;
extern const struct file_operations kernfs_dir_fops;
extern const struct inode_operations kernfs_dir_iops;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10f..baa4155ba2edf 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
@@ -344,9 +344,9 @@ int kernfs_get_tree(struct fs_context *fc)
}
sb->s_flags |= SB_ACTIVE;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_add(&info->node, &info->root->supers);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

fc->root = dget(sb->s_root);
@@ -372,9 +372,9 @@ void kernfs_kill_sb(struct super_block *sb)
{
struct kernfs_super_info *info = kernfs_info(sb);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_del(&info->node);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Remove the superblock from fs_supers/s_instances
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f2..c8f8e41b84110 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -116,9 +116,9 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *target = kn->symlink.target_kn;
int error;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
error = kernfs_get_target_path(parent, target, path);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return error;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 7947acb1163d7..229c4e96a3c39 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -193,7 +193,7 @@ struct kernfs_root {
u32 id_highbits;
struct kernfs_syscall_ops *syscall_ops;

- /* list of kernfs_super_info of this root, protected by kernfs_mutex */
+ /* list of kernfs_super_info of this root, protected by kernfs_rwsem */
struct list_head supers;

wait_queue_head_t deactivate_waitq;


2021-06-07 10:17:13

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 5/6] kernfs: use i_lock to protect concurrent inode updates

The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

The last hunk in the patch, applied to kernfs_fill_super(), is possibly
not needed but taking the lock was present originally and I prefer to
continue to take it so the rb tree is held stable during the call to
kernfs_refresh_inode() made by kernfs_get_inode().

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/inode.c | 10 ++++++----
fs/kernfs/mount.c | 4 ++--
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6728ecd81eb37 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
{
struct kernfs_iattrs *attrs = kn->iattr;

+ spin_lock(&inode->i_lock);
inode->i_mode = kn->mode;
if (attrs)
/*
@@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)

if (kernfs_type(kn) == KERNFS_DIR)
set_nlink(inode, kn->dir.subdirs + 2);
+ spin_unlock(&inode->i_lock);
}

int kernfs_iop_getattr(struct user_namespace *mnt_userns,
@@ -191,9 +193,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);

generic_fillattr(&init_user_ns, inode, stat);
return 0;
@@ -284,9 +286,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);

return generic_permission(&init_user_ns, inode, mask);
}
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;


2021-06-07 10:17:30

by Ian Kent

[permalink] [raw]
Subject: [PATCH v5 6/6] kernfs: add kernfs_need_inode_refresh()

Now the kernfs_rwsem read lock is held for kernfs_refresh_inode() and
the i_lock taken to protect inode updates there can be some contention
introduced when .permission() is called with concurrent path walks in
progress.

Since .permission() is called frequently during path walks it's worth
checking if the update is actually needed before taking the lock and
performing the update.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/inode.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 6728ecd81eb37..67fb1289c51dc 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -158,6 +158,30 @@ static inline void set_default_inode_attr(struct inode *inode, umode_t mode)
inode->i_ctime = current_time(inode);
}

+static bool kernfs_need_inode_refresh(struct kernfs_node *kn,
+ struct inode *inode,
+ struct kernfs_iattrs *attrs)
+{
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ if (inode->i_nlink != kn->dir.subdirs + 2)
+ return true;
+ }
+
+ if (inode->i_mode != kn->mode)
+ return true;
+
+ if (attrs) {
+ if (!timespec64_equal(&inode->i_atime, &attrs->ia_atime) ||
+ !timespec64_equal(&inode->i_mtime, &attrs->ia_mtime) ||
+ !timespec64_equal(&inode->i_ctime, &attrs->ia_ctime) ||
+ !uid_eq(inode->i_uid, attrs->ia_uid) ||
+ !gid_eq(inode->i_gid, attrs->ia_gid))
+ return true;
+ }
+
+ return false;
+}
+
static inline void set_inode_attr(struct inode *inode,
struct kernfs_iattrs *attrs)
{
@@ -172,6 +196,9 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
{
struct kernfs_iattrs *attrs = kn->iattr;

+ if (!kernfs_need_inode_refresh(kn, inode, attrs))
+ return;
+
spin_lock(&inode->i_lock);
inode->i_mode = kn->mode;
if (attrs)


2021-06-07 10:26:23

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v5 0/6] kernfs: proposed locking and concurrency improvement

On Mon, 2021-06-07 at 18:11 +0800, Ian Kent wrote:
> There have been a few instances of contention on the kernfs_mutex
> during
> path walks, a case on very large IBM systems seen by myself, a report
> by
> Brice Goglin and followed up by Fox Chen, and I've since seen a couple
> of other reports by CoreOS users.

The contention problems that I've seen show large numbers of processes
blocking in the functions ->d_revalidate() and ->permission() when a
lot of path walks are being done concurrently.

I've also seen contention on d_alloc_parallel() when there are a lot of
path walks for a file path that doesn't exist. For this later case I
saw a much smaller propotion of processes blocking but enough to get
my attention.

I used Fox Chen's benchmark repo. (slightly modified) to perform tests
to demonstrate the contention. The program basically runs a number of
pthreads threads (equal to the number of cpus) and each thread opens,
reads and then closes a file or files (depending on test case) in a
loop, here, 10000 times

I performed two tests, one with distinct file paths for half the number
of cpus, ie. 8 files for a 16 cpu machine, and another using the same
non-existent file path to simulate many lookups for a path that doesn't
exist.

The former test compares general path lookup behaviour between the
kernfs mutex and the rwsem while the later compares lookup behaviour
when VFS negative dentry caching is used.

I looked at the case of having only the negative dentry patches
applied but it was uninteresting as the contention moved from
d_alloc_parallel() to ->d_realidate() and ->permission(). So the
rwsem patches are needed for the netgative dentry patches to be
useful.

I captured perf call graphs for each of the test runs that show where
contention is occuring. For the missing file case d_alloc_parallel()
dominates the call graph and for the files case ->d_revalidate() and
->permission() dominate the call graph.

1) Run with 8 distinct sysfs file paths on 16 cpu machine, perf graphs
in base-files-cpu-16-perf.txt and patched-files-cpu-16-perf.txt.

Base (5.8.18-100.fc31, unpatched)
---------------------------------
single: total 37.802725ms per 3.780272us
concur: total 888.934870ms per 88.893487us CPU 5
concur: total 893.396079ms per 89.339608us CPU 3
concur: total 897.952652ms per 89.795265us CPU 8
concur: total 908.120647ms per 90.812065us CPU 13
concur: total 909.288507ms per 90.928851us CPU 2
concur: total 936.016942ms per 93.601694us CPU 10
concur: total 937.143611ms per 93.714361us CPU 15
concur: total 946.569582ms per 94.656958us CPU 6
concur: total 946.859803ms per 94.685980us CPU 11
concur: total 951.888699ms per 95.188870us CPU 12
concur: total 952.930595ms per 95.293059us CPU 14
concur: total 953.105172ms per 95.310517us CPU 9
concur: total 953.983792ms per 95.398379us CPU 1
concur: total 954.019331ms per 95.401933us CPU 7
concur: total 954.314661ms per 95.431466us CPU 4
concur: total 953.315950ms per 95.331595us CPU 0
times: 10000 threads: 16 cpus: 16

patched (5.12.2)
----------------
single: total 44.351311ms per 4.435131us
concur: total 205.454229ms per 20.545423us CPU 10
concur: total 206.481337ms per 20.648134us CPU 2
concur: total 209.061697ms per 20.906170us CPU 9
concur: total 209.081926ms per 20.908193us CPU 7
concur: total 209.813371ms per 20.981337us CPU 15
concur: total 210.762667ms per 21.076267us CPU 8
concur: total 211.073960ms per 21.107396us CPU 5
concur: total 211.788792ms per 21.178879us CPU 3
concur: total 212.029698ms per 21.202970us CPU 14
concur: total 212.951390ms per 21.295139us CPU 6
concur: total 212.994193ms per 21.299419us CPU 0
concur: total 205.059406ms per 20.505941us CPU 1
concur: total 214.761330ms per 21.476133us CPU 4
concur: total 210.278140ms per 21.027814us CPU 13
concur: total 215.120326ms per 21.512033us CPU 12
concur: total 215.308288ms per 21.530829us CPU 11
times: 10000 threads: 16 cpus: 16

2) Run with a single sysfs file path on 16 cpu machine, perf graphs in
base-missing-cpu-16-perf.txt and patched-missing-cpu-16-perf.txt.

Base (5.8.18-100.fc31, unpatched)
---------------------------------
single: total 23.870708ms per 2.387071us
concur: total 796.504874ms per 79.650487us CPU 3
concur: total 806.306131ms per 80.630613us CPU 11
concur: total 808.494954ms per 80.849495us CPU 6
concur: total 813.103969ms per 81.310397us CPU 1
concur: total 813.407996ms per 81.340800us CPU 8
concur: total 813.427143ms per 81.342714us CPU 9
concur: total 815.892622ms per 81.589262us CPU 2
concur: total 816.133378ms per 81.613338us CPU 4
concur: total 817.189601ms per 81.718960us CPU 14
concur: total 818.323855ms per 81.832386us CPU 13
concur: total 820.115479ms per 82.011548us CPU 15
concur: total 821.024798ms per 82.102480us CPU 7
concur: total 826.135994ms per 82.613599us CPU 12
concur: total 826.315963ms per 82.631596us CPU 0
concur: total 829.141106ms per 82.914111us CPU 10
concur: total 830.058310ms per 83.005831us CPU 5
times: 10000 threads: 16 cpus: 16

patched (5.12.2)
----------------
single: total 21.414203ms per 2.141420us
concur: total 231.474574ms per 23.147457us CPU 15
concur: total 233.030232ms per 23.303023us CPU 11
concur: total 235.226442ms per 23.522644us CPU 5
concur: total 236.084628ms per 23.608463us CPU 9
concur: total 236.635558ms per 23.663556us CPU 10
concur: total 237.156850ms per 23.715685us CPU 2
concur: total 237.260609ms per 23.726061us CPU 3
concur: total 237.577515ms per 23.757752us CPU 12
concur: total 237.605650ms per 23.760565us CPU 1
concur: total 237.746644ms per 23.774664us CPU 8
concur: total 238.417997ms per 23.841800us CPU 0
concur: total 238.725191ms per 23.872519us CPU 4
concur: total 240.301641ms per 24.030164us CPU 14
concur: total 240.570763ms per 24.057076us CPU 13
concur: total 240.758979ms per 24.075898us CPU 6
concur: total 241.211006ms per 24.121101us CPU 7
times: 10000 threads: 16 cpus: 16

3) Run with 24 distinct sysfs file paths on 48 cpu machine, perf graphs
in base-files-cpu-48-perf.txt and patched-files-cpu-48-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 122.827400ms per 12.282740us
concur: total 5306.902134ms per 530.690213us CPU 35
concur: total 5630.720717ms per 563.072072us CPU 46
concur: total 5638.448405ms per 563.844841us CPU 42
concur: total 5642.860083ms per 564.286008us CPU 34
concur: total 5651.030648ms per 565.103065us CPU 20
concur: total 5657.526181ms per 565.752618us CPU 31
concur: total 5658.140447ms per 565.814045us CPU 23
concur: total 5659.691758ms per 565.969176us CPU 19
concur: total 5668.248013ms per 566.824801us CPU 21
concur: total 5669.774274ms per 566.977427us CPU 22
concur: total 5685.258360ms per 568.525836us CPU 30
concur: total 5685.799738ms per 568.579974us CPU 32
concur: total 5689.631849ms per 568.963185us CPU 18
concur: total 5696.818593ms per 569.681859us CPU 44
concur: total 5698.618608ms per 569.861861us CPU 33
concur: total 5698.794859ms per 569.879486us CPU 45
concur: total 5770.686184ms per 577.068618us CPU 28
concur: total 5778.892695ms per 577.889270us CPU 27
concur: total 5784.709119ms per 578.470912us CPU 29
concur: total 5788.893840ms per 578.889384us CPU 24
concur: total 5789.576181ms per 578.957618us CPU 25
concur: total 5798.722220ms per 579.872222us CPU 26
concur: total 5822.426684ms per 582.242668us CPU 36
concur: total 5826.460510ms per 582.646051us CPU 38
concur: total 5831.715090ms per 583.171509us CPU 14
concur: total 5831.966863ms per 583.196686us CPU 41
concur: total 5833.488179ms per 583.348818us CPU 37
concur: total 5835.039815ms per 583.503982us CPU 40
concur: total 5837.073842ms per 583.707384us CPU 39
concur: total 5838.603686ms per 583.860369us CPU 16
concur: total 5841.427760ms per 584.142776us CPU 13
concur: total 5844.173463ms per 584.417346us CPU 17
concur: total 5844.526500ms per 584.452650us CPU 12
concur: total 5844.543912ms per 584.454391us CPU 15
concur: total 5856.646296ms per 585.664630us CPU 43
concur: total 5882.959009ms per 588.295901us CPU 4
concur: total 5885.522053ms per 588.552205us CPU 47
concur: total 5886.485513ms per 588.648551us CPU 9
concur: total 5889.596333ms per 588.959633us CPU 7
concur: total 5891.098216ms per 589.109822us CPU 8
concur: total 5893.823953ms per 589.382395us CPU 6
concur: total 5894.175035ms per 589.417504us CPU 10
concur: total 5894.333983ms per 589.433398us CPU 5
concur: total 5894.339733ms per 589.433973us CPU 11
concur: total 5894.780552ms per 589.478055us CPU 2
concur: total 5894.902495ms per 589.490250us CPU 3
concur: total 5895.138875ms per 589.513888us CPU 1
concur: total 5895.751332ms per 589.575133us CPU 0
times: 10000 threads: 48 cpus: 48

patched (5.12.2)
----------------
single: total 113.291645ms per 11.329165us
concur: total 1593.959049ms per 159.395905us CPU 13
concur: total 1597.518495ms per 159.751850us CPU 3
concur: total 1597.658208ms per 159.765821us CPU 6
concur: total 1600.019094ms per 160.001909us CPU 25
concur: total 1601.089351ms per 160.108935us CPU 23
concur: total 1601.469009ms per 160.146901us CPU 26
concur: total 1602.896466ms per 160.289647us CPU 30
concur: total 1603.235130ms per 160.323513us CPU 1
concur: total 1603.366164ms per 160.336616us CPU 28
concur: total 1604.441214ms per 160.444121us CPU 2
concur: total 1604.688351ms per 160.468835us CPU 36
concur: total 1605.739458ms per 160.573946us CPU 8
concur: total 1606.069951ms per 160.606995us CPU 31
concur: total 1606.332397ms per 160.633240us CPU 22
concur: total 1608.634998ms per 160.863500us CPU 11
concur: total 1608.698868ms per 160.869887us CPU 5
concur: total 1609.072888ms per 160.907289us CPU 43
concur: total 1609.780952ms per 160.978095us CPU 41
concur: total 1610.214802ms per 161.021480us CPU 12
concur: total 1610.618660ms per 161.061866us CPU 16
concur: total 1610.885785ms per 161.088578us CPU 27
concur: total 1611.576231ms per 161.157623us CPU 10
concur: total 1612.083975ms per 161.208398us CPU 38
concur: total 1612.677333ms per 161.267733us CPU 45
concur: total 1612.698645ms per 161.269865us CPU 44
concur: total 1612.887981ms per 161.288798us CPU 18
concur: total 1612.808693ms per 161.280869us CPU 4
concur: total 1612.844263ms per 161.284426us CPU 35
concur: total 1612.760745ms per 161.276075us CPU 40
concur: total 1613.220738ms per 161.322074us CPU 17
concur: total 1613.249031ms per 161.324903us CPU 29
concur: total 1613.270812ms per 161.327081us CPU 20
concur: total 1613.325711ms per 161.332571us CPU 24
concur: total 1613.499246ms per 161.349925us CPU 21
concur: total 1613.347917ms per 161.334792us CPU 42
concur: total 1613.416651ms per 161.341665us CPU 15
concur: total 1613.742291ms per 161.374229us CPU 46
concur: total 1613.809087ms per 161.380909us CPU 32
concur: total 1613.329478ms per 161.332948us CPU 19
concur: total 1613.783009ms per 161.378301us CPU 9
concur: total 1613.626390ms per 161.362639us CPU 39
concur: total 1614.077897ms per 161.407790us CPU 34
concur: total 1614.094290ms per 161.409429us CPU 7
concur: total 1614.754743ms per 161.475474us CPU 14
concur: total 1614.958943ms per 161.495894us CPU 0
concur: total 1616.025304ms per 161.602530us CPU 37
concur: total 1617.808550ms per 161.780855us CPU 47
concur: total 1630.682246ms per 163.068225us CPU 33
times: 10000 threads: 48 cpus: 48

4) Run with a single sysfs file path on 48 cpu machine, perf graphs in
base-missing-cpu-48-perf.txt and patched-missing-cpu-48-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 87.107970ms per 8.710797us
concur: total 15072.702249ms per 1507.270225us CPU 24
concur: total 15184.463418ms per 1518.446342us CPU 26
concur: total 15263.917735ms per 1526.391773us CPU 28
concur: total 15617.042833ms per 1561.704283us CPU 25
concur: total 15660.599769ms per 1566.059977us CPU 27
concur: total 16134.873816ms per 1613.487382us CPU 29
concur: total 16195.713672ms per 1619.571367us CPU 11
concur: total 17182.571407ms per 1718.257141us CPU 10
concur: total 17462.398666ms per 1746.239867us CPU 9
concur: total 17813.014094ms per 1781.301409us CPU 8
concur: total 18436.488514ms per 1843.648851us CPU 6
concur: total 18996.550399ms per 1899.655040us CPU 7
concur: total 21721.021674ms per 2172.102167us CPU 41
concur: total 21986.614285ms per 2198.661429us CPU 17
concur: total 22216.364478ms per 2221.636448us CPU 23
concur: total 22369.110429ms per 2236.911043us CPU 5
concur: total 22526.643861ms per 2252.664386us CPU 35
concur: total 22540.326825ms per 2254.032682us CPU 40
concur: total 22560.761109ms per 2256.076111us CPU 30
concur: total 22774.376673ms per 2277.437667us CPU 33
concur: total 22779.411375ms per 2277.941137us CPU 31
concur: total 22844.223722ms per 2284.422372us CPU 16
concur: total 22868.684174ms per 2286.868417us CPU 34
concur: total 22926.039600ms per 2292.603960us CPU 32
concur: total 22956.189714ms per 2295.618971us CPU 38
concur: total 23002.988812ms per 2300.298881us CPU 22
concur: total 23010.128228ms per 2301.012823us CPU 36
concur: total 23013.737650ms per 2301.373765us CPU 4
concur: total 23023.545614ms per 2302.354561us CPU 39
concur: total 23120.483176ms per 2312.048318us CPU 15
concur: total 23150.576516ms per 2315.057652us CPU 37
concur: total 23240.196530ms per 2324.019653us CPU 14
concur: total 23255.002167ms per 2325.500217us CPU 21
concur: total 23255.595018ms per 2325.559502us CPU 0
concur: total 23258.182221ms per 2325.818222us CPU 3
concur: total 23264.494553ms per 2326.449455us CPU 12
concur: total 23281.848036ms per 2328.184804us CPU 13
concur: total 23307.939070ms per 2330.793907us CPU 47
concur: total 23315.311150ms per 2331.531115us CPU 46
concur: total 23328.394731ms per 2332.839473us CPU 2
concur: total 23329.879007ms per 2332.987901us CPU 20
concur: total 23351.592451ms per 2335.159245us CPU 19
concur: total 23350.752868ms per 2335.075287us CPU 1
concur: total 23356.438116ms per 2335.643812us CPU 45
concur: total 23356.853217ms per 2335.685322us CPU 42
concur: total 23357.738390ms per 2335.773839us CPU 44
concur: total 23360.540952ms per 2336.054095us CPU 43
concur: total 23360.577828ms per 2336.057783us CPU 18
times: 10000 threads: 48 cpus: 48

patched (5.12.2)
----------------

single: total 115.004971ms per 11.500497us
concur: total 1534.106287ms per 153.410629us CPU 15
concur: total 1584.741497ms per 158.474150us CPU 34
concur: total 1588.227774ms per 158.822777us CPU 3
concur: total 1590.944855ms per 159.094485us CPU 27
concur: total 1593.252406ms per 159.325241us CPU 21
concur: total 1594.347841ms per 159.434784us CPU 44
concur: total 1594.519690ms per 159.451969us CPU 43
concur: total 1594.651516ms per 159.465152us CPU 11
concur: total 1595.516558ms per 159.551656us CPU 12
concur: total 1596.826634ms per 159.682663us CPU 22
concur: total 1598.825527ms per 159.882553us CPU 6
concur: total 1598.914890ms per 159.891489us CPU 33
concur: total 1599.541434ms per 159.954143us CPU 28
concur: total 1600.537643ms per 160.053764us CPU 38
concur: total 1602.424304ms per 160.242430us CPU 47
concur: total 1602.725873ms per 160.272587us CPU 30
concur: total 1602.759128ms per 160.275913us CPU 14
concur: total 1603.849343ms per 160.384934us CPU 29
concur: total 1605.117369ms per 160.511737us CPU 35
concur: total 1605.473411ms per 160.547341us CPU 19
concur: total 1606.013413ms per 160.601341us CPU 13
concur: total 1606.068654ms per 160.606865us CPU 4
concur: total 1606.209860ms per 160.620986us CPU 23
concur: total 1606.923183ms per 160.692318us CPU 1
concur: total 1607.064867ms per 160.706487us CPU 40
concur: total 1607.121558ms per 160.712156us CPU 20
concur: total 1610.107603ms per 161.010760us CPU 10
concur: total 1610.140915ms per 161.014092us CPU 5
concur: total 1610.636352ms per 161.063635us CPU 24
concur: total 1612.699753ms per 161.269975us CPU 46
concur: total 1612.879734ms per 161.287973us CPU 42
concur: total 1613.176326ms per 161.317633us CPU 2
concur: total 1613.415669ms per 161.341567us CPU 8
concur: total 1613.811312ms per 161.381131us CPU 25
concur: total 1613.923411ms per 161.392341us CPU 41
concur: total 1613.966209ms per 161.396621us CPU 31
concur: total 1614.947228ms per 161.494723us CPU 17
concur: total 1615.337781ms per 161.533778us CPU 37
concur: total 1615.835025ms per 161.583502us CPU 32
concur: total 1615.982666ms per 161.598267us CPU 39
concur: total 1616.335216ms per 161.633522us CPU 45
concur: total 1616.340457ms per 161.634046us CPU 36
concur: total 1616.387235ms per 161.638723us CPU 16
concur: total 1617.248832ms per 161.724883us CPU 9
concur: total 1617.354503ms per 161.735450us CPU 0
concur: total 1617.455505ms per 161.745550us CPU 18
concur: total 1618.290721ms per 161.829072us CPU 26
concur: total 1630.338637ms per 163.033864us CPU 7
times: 10000 threads: 48 cpus: 48

2021-06-07 10:34:01

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v5 0/6] kernfs: proposed locking and concurrency improvement (the missing perf attachments)


And I thought, I should add these attachments first and sure enough ...

On Mon, 2021-06-07 at 18:24 +0800, Ian Kent wrote:
> On Mon, 2021-06-07 at 18:11 +0800, Ian Kent wrote:
> > There have been a few instances of contention on the kernfs_mutex
> > during
> > path walks, a case on very large IBM systems seen by myself, a
> > report
> > by
> > Brice Goglin and followed up by Fox Chen, and I've since seen a
> > couple
> > of other reports by CoreOS users.
>
> The contention problems that I've seen show large numbers of
> processes
> blocking in the functions ->d_revalidate() and ->permission() when a
> lot of path walks are being done concurrently.
>
> I've also seen contention on d_alloc_parallel() when there are a lot
> of
> path walks for a file path that doesn't exist. For this later case I
> saw a much smaller propotion of processes blocking but enough to get
> my attention.
>
> I used Fox Chen's benchmark repo. (slightly modified) to perform
> tests
> to demonstrate the contention. The program basically runs a number of
> pthreads threads (equal to the number of cpus) and each thread opens,
> reads and then closes a file or files (depending on test case) in a
> loop, here, 10000 times
>
> I performed two tests, one with distinct file paths for half the
> number
> of cpus, ie. 8 files for a 16 cpu machine, and another using the same
> non-existent file path to simulate many lookups for a path that
> doesn't
> exist.
>
> The former test compares general path lookup behaviour between the
> kernfs mutex and the rwsem while the later compares lookup behaviour
> when VFS negative dentry caching is used.
>
> I looked at the case of having only the negative dentry patches
> applied but it was uninteresting as the contention moved from
> d_alloc_parallel() to ->d_realidate() and ->permission(). So the
> rwsem patches are needed for the netgative dentry patches to be
> useful.
>
> I captured perf call graphs for each of the test runs that show where
> contention is occuring. For the missing file case d_alloc_parallel()
> dominates the call graph and for the files case ->d_revalidate() and
> ->permission() dominate the call graph.
>
> 1) Run with 8 distinct sysfs file paths on 16 cpu machine, perf
> graphs
> in base-files-cpu-16-perf.txt and patched-files-cpu-16-perf.txt.
>
> Base (5.8.18-100.fc31, unpatched)
> ---------------------------------
> single: total 37.802725ms per 3.780272us
> concur: total 888.934870ms per 88.893487us  CPU 5
> concur: total 893.396079ms per 89.339608us  CPU 3
> concur: total 897.952652ms per 89.795265us  CPU 8
> concur: total 908.120647ms per 90.812065us  CPU 13
> concur: total 909.288507ms per 90.928851us  CPU 2
> concur: total 936.016942ms per 93.601694us  CPU 10
> concur: total 937.143611ms per 93.714361us  CPU 15
> concur: total 946.569582ms per 94.656958us  CPU 6
> concur: total 946.859803ms per 94.685980us  CPU 11
> concur: total 951.888699ms per 95.188870us  CPU 12
> concur: total 952.930595ms per 95.293059us  CPU 14
> concur: total 953.105172ms per 95.310517us  CPU 9
> concur: total 953.983792ms per 95.398379us  CPU 1
> concur: total 954.019331ms per 95.401933us  CPU 7
> concur: total 954.314661ms per 95.431466us  CPU 4
> concur: total 953.315950ms per 95.331595us  CPU 0
> times: 10000 threads: 16 cpus: 16
>
> patched (5.12.2)
> ----------------
> single: total 44.351311ms per 4.435131us
> concur: total 205.454229ms per 20.545423us  CPU 10
> concur: total 206.481337ms per 20.648134us  CPU 2
> concur: total 209.061697ms per 20.906170us  CPU 9
> concur: total 209.081926ms per 20.908193us  CPU 7
> concur: total 209.813371ms per 20.981337us  CPU 15
> concur: total 210.762667ms per 21.076267us  CPU 8
> concur: total 211.073960ms per 21.107396us  CPU 5
> concur: total 211.788792ms per 21.178879us  CPU 3
> concur: total 212.029698ms per 21.202970us  CPU 14
> concur: total 212.951390ms per 21.295139us  CPU 6
> concur: total 212.994193ms per 21.299419us  CPU 0
> concur: total 205.059406ms per 20.505941us  CPU 1
> concur: total 214.761330ms per 21.476133us  CPU 4
> concur: total 210.278140ms per 21.027814us  CPU 13
> concur: total 215.120326ms per 21.512033us  CPU 12
> concur: total 215.308288ms per 21.530829us  CPU 11
> times: 10000 threads: 16 cpus: 16
>
> 2) Run with a single sysfs file path on 16 cpu machine, perf graphs
> in
> base-missing-cpu-16-perf.txt and patched-missing-cpu-16-perf.txt.
>
> Base (5.8.18-100.fc31, unpatched)
> ---------------------------------
> single: total 23.870708ms per 2.387071us
> concur: total 796.504874ms per 79.650487us  CPU 3
> concur: total 806.306131ms per 80.630613us  CPU 11
> concur: total 808.494954ms per 80.849495us  CPU 6
> concur: total 813.103969ms per 81.310397us  CPU 1
> concur: total 813.407996ms per 81.340800us  CPU 8
> concur: total 813.427143ms per 81.342714us  CPU 9
> concur: total 815.892622ms per 81.589262us  CPU 2
> concur: total 816.133378ms per 81.613338us  CPU 4
> concur: total 817.189601ms per 81.718960us  CPU 14
> concur: total 818.323855ms per 81.832386us  CPU 13
> concur: total 820.115479ms per 82.011548us  CPU 15
> concur: total 821.024798ms per 82.102480us  CPU 7
> concur: total 826.135994ms per 82.613599us  CPU 12
> concur: total 826.315963ms per 82.631596us  CPU 0
> concur: total 829.141106ms per 82.914111us  CPU 10
> concur: total 830.058310ms per 83.005831us  CPU 5
> times: 10000 threads: 16 cpus: 16
>
> patched (5.12.2)
> ----------------
> single: total 21.414203ms per 2.141420us
> concur: total 231.474574ms per 23.147457us  CPU 15
> concur: total 233.030232ms per 23.303023us  CPU 11
> concur: total 235.226442ms per 23.522644us  CPU 5
> concur: total 236.084628ms per 23.608463us  CPU 9
> concur: total 236.635558ms per 23.663556us  CPU 10
> concur: total 237.156850ms per 23.715685us  CPU 2
> concur: total 237.260609ms per 23.726061us  CPU 3
> concur: total 237.577515ms per 23.757752us  CPU 12
> concur: total 237.605650ms per 23.760565us  CPU 1
> concur: total 237.746644ms per 23.774664us  CPU 8
> concur: total 238.417997ms per 23.841800us  CPU 0
> concur: total 238.725191ms per 23.872519us  CPU 4
> concur: total 240.301641ms per 24.030164us  CPU 14
> concur: total 240.570763ms per 24.057076us  CPU 13
> concur: total 240.758979ms per 24.075898us  CPU 6
> concur: total 241.211006ms per 24.121101us  CPU 7
> times: 10000 threads: 16 cpus: 16
>
> 3) Run with 24 distinct sysfs file paths on 48 cpu machine, perf
> graphs
> in base-files-cpu-48-perf.txt and patched-files-cpu-48-perf.txt.
>
> Base (5.12.2, unpatched)
> ------------------------
> single: total 122.827400ms per 12.282740us
> concur: total 5306.902134ms per 530.690213us  CPU 35
> concur: total 5630.720717ms per 563.072072us  CPU 46
> concur: total 5638.448405ms per 563.844841us  CPU 42
> concur: total 5642.860083ms per 564.286008us  CPU 34
> concur: total 5651.030648ms per 565.103065us  CPU 20
> concur: total 5657.526181ms per 565.752618us  CPU 31
> concur: total 5658.140447ms per 565.814045us  CPU 23
> concur: total 5659.691758ms per 565.969176us  CPU 19
> concur: total 5668.248013ms per 566.824801us  CPU 21
> concur: total 5669.774274ms per 566.977427us  CPU 22
> concur: total 5685.258360ms per 568.525836us  CPU 30
> concur: total 5685.799738ms per 568.579974us  CPU 32
> concur: total 5689.631849ms per 568.963185us  CPU 18
> concur: total 5696.818593ms per 569.681859us  CPU 44
> concur: total 5698.618608ms per 569.861861us  CPU 33
> concur: total 5698.794859ms per 569.879486us  CPU 45
> concur: total 5770.686184ms per 577.068618us  CPU 28
> concur: total 5778.892695ms per 577.889270us  CPU 27
> concur: total 5784.709119ms per 578.470912us  CPU 29
> concur: total 5788.893840ms per 578.889384us  CPU 24
> concur: total 5789.576181ms per 578.957618us  CPU 25
> concur: total 5798.722220ms per 579.872222us  CPU 26
> concur: total 5822.426684ms per 582.242668us  CPU 36
> concur: total 5826.460510ms per 582.646051us  CPU 38
> concur: total 5831.715090ms per 583.171509us  CPU 14
> concur: total 5831.966863ms per 583.196686us  CPU 41
> concur: total 5833.488179ms per 583.348818us  CPU 37
> concur: total 5835.039815ms per 583.503982us  CPU 40
> concur: total 5837.073842ms per 583.707384us  CPU 39
> concur: total 5838.603686ms per 583.860369us  CPU 16
> concur: total 5841.427760ms per 584.142776us  CPU 13
> concur: total 5844.173463ms per 584.417346us  CPU 17
> concur: total 5844.526500ms per 584.452650us  CPU 12
> concur: total 5844.543912ms per 584.454391us  CPU 15
> concur: total 5856.646296ms per 585.664630us  CPU 43
> concur: total 5882.959009ms per 588.295901us  CPU 4
> concur: total 5885.522053ms per 588.552205us  CPU 47
> concur: total 5886.485513ms per 588.648551us  CPU 9
> concur: total 5889.596333ms per 588.959633us  CPU 7
> concur: total 5891.098216ms per 589.109822us  CPU 8
> concur: total 5893.823953ms per 589.382395us  CPU 6
> concur: total 5894.175035ms per 589.417504us  CPU 10
> concur: total 5894.333983ms per 589.433398us  CPU 5
> concur: total 5894.339733ms per 589.433973us  CPU 11
> concur: total 5894.780552ms per 589.478055us  CPU 2
> concur: total 5894.902495ms per 589.490250us  CPU 3
> concur: total 5895.138875ms per 589.513888us  CPU 1
> concur: total 5895.751332ms per 589.575133us  CPU 0
> times: 10000 threads: 48 cpus: 48
>
> patched (5.12.2)
> ----------------
> single: total 113.291645ms per 11.329165us
> concur: total 1593.959049ms per 159.395905us  CPU 13
> concur: total 1597.518495ms per 159.751850us  CPU 3
> concur: total 1597.658208ms per 159.765821us  CPU 6
> concur: total 1600.019094ms per 160.001909us  CPU 25
> concur: total 1601.089351ms per 160.108935us  CPU 23
> concur: total 1601.469009ms per 160.146901us  CPU 26
> concur: total 1602.896466ms per 160.289647us  CPU 30
> concur: total 1603.235130ms per 160.323513us  CPU 1
> concur: total 1603.366164ms per 160.336616us  CPU 28
> concur: total 1604.441214ms per 160.444121us  CPU 2
> concur: total 1604.688351ms per 160.468835us  CPU 36
> concur: total 1605.739458ms per 160.573946us  CPU 8
> concur: total 1606.069951ms per 160.606995us  CPU 31
> concur: total 1606.332397ms per 160.633240us  CPU 22
> concur: total 1608.634998ms per 160.863500us  CPU 11
> concur: total 1608.698868ms per 160.869887us  CPU 5
> concur: total 1609.072888ms per 160.907289us  CPU 43
> concur: total 1609.780952ms per 160.978095us  CPU 41
> concur: total 1610.214802ms per 161.021480us  CPU 12
> concur: total 1610.618660ms per 161.061866us  CPU 16
> concur: total 1610.885785ms per 161.088578us  CPU 27
> concur: total 1611.576231ms per 161.157623us  CPU 10
> concur: total 1612.083975ms per 161.208398us  CPU 38
> concur: total 1612.677333ms per 161.267733us  CPU 45
> concur: total 1612.698645ms per 161.269865us  CPU 44
> concur: total 1612.887981ms per 161.288798us  CPU 18
> concur: total 1612.808693ms per 161.280869us  CPU 4
> concur: total 1612.844263ms per 161.284426us  CPU 35
> concur: total 1612.760745ms per 161.276075us  CPU 40
> concur: total 1613.220738ms per 161.322074us  CPU 17
> concur: total 1613.249031ms per 161.324903us  CPU 29
> concur: total 1613.270812ms per 161.327081us  CPU 20
> concur: total 1613.325711ms per 161.332571us  CPU 24
> concur: total 1613.499246ms per 161.349925us  CPU 21
> concur: total 1613.347917ms per 161.334792us  CPU 42
> concur: total 1613.416651ms per 161.341665us  CPU 15
> concur: total 1613.742291ms per 161.374229us  CPU 46
> concur: total 1613.809087ms per 161.380909us  CPU 32
> concur: total 1613.329478ms per 161.332948us  CPU 19
> concur: total 1613.783009ms per 161.378301us  CPU 9
> concur: total 1613.626390ms per 161.362639us  CPU 39
> concur: total 1614.077897ms per 161.407790us  CPU 34
> concur: total 1614.094290ms per 161.409429us  CPU 7
> concur: total 1614.754743ms per 161.475474us  CPU 14
> concur: total 1614.958943ms per 161.495894us  CPU 0
> concur: total 1616.025304ms per 161.602530us  CPU 37
> concur: total 1617.808550ms per 161.780855us  CPU 47
> concur: total 1630.682246ms per 163.068225us  CPU 33
> times: 10000 threads: 48 cpus: 48
>
> 4) Run with a single sysfs file path on 48 cpu machine, perf graphs
> in
> base-missing-cpu-48-perf.txt and patched-missing-cpu-48-perf.txt.
>
> Base (5.12.2, unpatched)
> ------------------------
> single: total 87.107970ms per 8.710797us
> concur: total 15072.702249ms per 1507.270225us  CPU 24
> concur: total 15184.463418ms per 1518.446342us  CPU 26
> concur: total 15263.917735ms per 1526.391773us  CPU 28
> concur: total 15617.042833ms per 1561.704283us  CPU 25
> concur: total 15660.599769ms per 1566.059977us  CPU 27
> concur: total 16134.873816ms per 1613.487382us  CPU 29
> concur: total 16195.713672ms per 1619.571367us  CPU 11
> concur: total 17182.571407ms per 1718.257141us  CPU 10
> concur: total 17462.398666ms per 1746.239867us  CPU 9
> concur: total 17813.014094ms per 1781.301409us  CPU 8
> concur: total 18436.488514ms per 1843.648851us  CPU 6
> concur: total 18996.550399ms per 1899.655040us  CPU 7
> concur: total 21721.021674ms per 2172.102167us  CPU 41
> concur: total 21986.614285ms per 2198.661429us  CPU 17
> concur: total 22216.364478ms per 2221.636448us  CPU 23
> concur: total 22369.110429ms per 2236.911043us  CPU 5
> concur: total 22526.643861ms per 2252.664386us  CPU 35
> concur: total 22540.326825ms per 2254.032682us  CPU 40
> concur: total 22560.761109ms per 2256.076111us  CPU 30
> concur: total 22774.376673ms per 2277.437667us  CPU 33
> concur: total 22779.411375ms per 2277.941137us  CPU 31
> concur: total 22844.223722ms per 2284.422372us  CPU 16
> concur: total 22868.684174ms per 2286.868417us  CPU 34
> concur: total 22926.039600ms per 2292.603960us  CPU 32
> concur: total 22956.189714ms per 2295.618971us  CPU 38
> concur: total 23002.988812ms per 2300.298881us  CPU 22
> concur: total 23010.128228ms per 2301.012823us  CPU 36
> concur: total 23013.737650ms per 2301.373765us  CPU 4
> concur: total 23023.545614ms per 2302.354561us  CPU 39
> concur: total 23120.483176ms per 2312.048318us  CPU 15
> concur: total 23150.576516ms per 2315.057652us  CPU 37
> concur: total 23240.196530ms per 2324.019653us  CPU 14
> concur: total 23255.002167ms per 2325.500217us  CPU 21
> concur: total 23255.595018ms per 2325.559502us  CPU 0
> concur: total 23258.182221ms per 2325.818222us  CPU 3
> concur: total 23264.494553ms per 2326.449455us  CPU 12
> concur: total 23281.848036ms per 2328.184804us  CPU 13
> concur: total 23307.939070ms per 2330.793907us  CPU 47
> concur: total 23315.311150ms per 2331.531115us  CPU 46
> concur: total 23328.394731ms per 2332.839473us  CPU 2
> concur: total 23329.879007ms per 2332.987901us  CPU 20
> concur: total 23351.592451ms per 2335.159245us  CPU 19
> concur: total 23350.752868ms per 2335.075287us  CPU 1
> concur: total 23356.438116ms per 2335.643812us  CPU 45
> concur: total 23356.853217ms per 2335.685322us  CPU 42
> concur: total 23357.738390ms per 2335.773839us  CPU 44
> concur: total 23360.540952ms per 2336.054095us  CPU 43
> concur: total 23360.577828ms per 2336.057783us  CPU 18
> times: 10000 threads: 48 cpus: 48
>
> patched (5.12.2)
> ----------------
>
> single: total 115.004971ms per 11.500497us
> concur: total 1534.106287ms per 153.410629us  CPU 15
> concur: total 1584.741497ms per 158.474150us  CPU 34
> concur: total 1588.227774ms per 158.822777us  CPU 3
> concur: total 1590.944855ms per 159.094485us  CPU 27
> concur: total 1593.252406ms per 159.325241us  CPU 21
> concur: total 1594.347841ms per 159.434784us  CPU 44
> concur: total 1594.519690ms per 159.451969us  CPU 43
> concur: total 1594.651516ms per 159.465152us  CPU 11
> concur: total 1595.516558ms per 159.551656us  CPU 12
> concur: total 1596.826634ms per 159.682663us  CPU 22
> concur: total 1598.825527ms per 159.882553us  CPU 6
> concur: total 1598.914890ms per 159.891489us  CPU 33
> concur: total 1599.541434ms per 159.954143us  CPU 28
> concur: total 1600.537643ms per 160.053764us  CPU 38
> concur: total 1602.424304ms per 160.242430us  CPU 47
> concur: total 1602.725873ms per 160.272587us  CPU 30
> concur: total 1602.759128ms per 160.275913us  CPU 14
> concur: total 1603.849343ms per 160.384934us  CPU 29
> concur: total 1605.117369ms per 160.511737us  CPU 35
> concur: total 1605.473411ms per 160.547341us  CPU 19
> concur: total 1606.013413ms per 160.601341us  CPU 13
> concur: total 1606.068654ms per 160.606865us  CPU 4
> concur: total 1606.209860ms per 160.620986us  CPU 23
> concur: total 1606.923183ms per 160.692318us  CPU 1
> concur: total 1607.064867ms per 160.706487us  CPU 40
> concur: total 1607.121558ms per 160.712156us  CPU 20
> concur: total 1610.107603ms per 161.010760us  CPU 10
> concur: total 1610.140915ms per 161.014092us  CPU 5
> concur: total 1610.636352ms per 161.063635us  CPU 24
> concur: total 1612.699753ms per 161.269975us  CPU 46
> concur: total 1612.879734ms per 161.287973us  CPU 42
> concur: total 1613.176326ms per 161.317633us  CPU 2
> concur: total 1613.415669ms per 161.341567us  CPU 8
> concur: total 1613.811312ms per 161.381131us  CPU 25
> concur: total 1613.923411ms per 161.392341us  CPU 41
> concur: total 1613.966209ms per 161.396621us  CPU 31
> concur: total 1614.947228ms per 161.494723us  CPU 17
> concur: total 1615.337781ms per 161.533778us  CPU 37
> concur: total 1615.835025ms per 161.583502us  CPU 32
> concur: total 1615.982666ms per 161.598267us  CPU 39
> concur: total 1616.335216ms per 161.633522us  CPU 45
> concur: total 1616.340457ms per 161.634046us  CPU 36
> concur: total 1616.387235ms per 161.638723us  CPU 16
> concur: total 1617.248832ms per 161.724883us  CPU 9
> concur: total 1617.354503ms per 161.735450us  CPU 0
> concur: total 1617.455505ms per 161.745550us  CPU 18
> concur: total 1618.290721ms per 161.829072us  CPU 26
> concur: total 1630.338637ms per 163.033864us  CPU 7
> times: 10000 threads: 48 cpus: 48


Attachments:
base-files-cpu-16-perf.txt (141.11 kB)
base-files-cpu-48-perf.txt (394.61 kB)
base-missing-cpu-16-perf.txt (164.83 kB)
base-missing-cpus-48-perf.txt (454.13 kB)
patched-files-cpu-16-perf.txt (252.64 kB)
patched-files-cpu-48-perf.txt (354.38 kB)
patched-missing-cpu-16-perf.txt (167.25 kB)
patched-missing-cpu-48-perf.txt (299.78 kB)
Download all attachments

2021-06-07 17:56:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] kernfs: add a revision to identify directory node changes

Ian Kent <[email protected]> writes:

> Add a revision counter to kernfs directory nodes so it can be used
> to detect if a directory node has changed.
>
> There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> on all architectures and as far as I know that assumption holds.
>
> So adding a revision counter to the struct kernfs_elem_dir variant of
> the kernfs_node type union won't increase the size of the kernfs_node
> struct. This is because struct kernfs_elem_dir is at least
> sizeof(pointer) smaller than the largest union variant. It's tempting
> to make the revision counter a u64 but that would increase the size of
> kernfs_node on archs where sizeof(pointer) is smaller than the revision
> counter.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
> fs/kernfs/dir.c | 8 ++++++++
> fs/kernfs/kernfs-internal.h | 24 ++++++++++++++++++++++++
> include/linux/kernfs.h | 5 +++++
> 3 files changed, 37 insertions(+)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 33166ec90a112..b88432c48851f 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
> /* successfully added, account subdir number */
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs++;
> + kernfs_inc_rev(kn->parent);
>
> return 0;
> }
> @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
>
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs--;
> + kernfs_inc_rev(kn->parent);
>
> rb_erase(&kn->rb, &kn->parent->dir.children);
> RB_CLEAR_NODE(&kn->rb);
> @@ -1105,6 +1107,12 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
>
> /* instantiate and hash dentry */
> ret = d_splice_alias(inode, dentry);
> + if (!IS_ERR(ret)) {
> + if (unlikely(ret))
> + kernfs_set_rev(parent, ret);
> + else
> + kernfs_set_rev(parent, dentry);

Do we care about d_time on non-NULL dentries?

For d_splice_alias to return a different dentry implies
that the dentry was non-NULL.

I am wondering if having a guarantee that d_time never changes could
help simplify the implementation. For never changing it would see to
make sense to call kernfs_set_rev before d_splice_alias on dentry, and
simply not worry about it after d_splice_alias.

> + }
> out_unlock:
> mutex_unlock(&kernfs_mutex);
> return ret;
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index ccc3b44f6306f..1536002584fc4 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -81,6 +81,30 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
> return d_inode(dentry)->i_private;
> }
>
> +static inline void kernfs_set_rev(struct kernfs_node *kn,
> + struct dentry *dentry)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR)
> + dentry->d_time = kn->dir.rev;
> +}
> +
> +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR)
> + kn->dir.rev++;
> +}
> +
> +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> + struct dentry *dentry)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR) {
> + /* Not really a time bit it does what's needed */
> + if (time_after(kn->dir.rev, dentry->d_time))
> + return true;

Why not simply make this:
if (kn->dir.rev != dentry->d_time)
return true;

I don't see what is gained by not counting as changed something in the
wrong half of the values.

> + }
> + return false;
> +}
> +
> extern const struct super_operations kernfs_sops;
> extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
>
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 9e8ca8743c268..7947acb1163d7 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -98,6 +98,11 @@ struct kernfs_elem_dir {
> * better directly in kernfs_node but is here to save space.
> */
> struct kernfs_root *root;
> + /*
> + * Monotonic revision counter, used to identify if a directory
> + * node has changed during revalidation.
> + */
> + unsigned long rev;
> };
>
> struct kernfs_elem_symlink {

Eric

2021-06-07 18:30:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] kernfs: use VFS negative dentry caching

Ian Kent <[email protected]> writes:

> If there are many lookups for non-existent paths these negative lookups
> can lead to a lot of overhead during path walks.
>
> The VFS allows dentries to be created as negative and hashed, and caches
> them so they can be used to reduce the fairly high overhead alloc/free
> cycle that occurs during these lookups.
>
> Use the kernfs node parent revision to identify if a change has been
> made to the containing directory so that the negative dentry can be
> discarded and the lookup redone.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
> fs/kernfs/dir.c | 53 +++++++++++++++++++++++++++++++----------------------
> 1 file changed, 31 insertions(+), 22 deletions(-)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index b88432c48851f..5ae95e8d1aea1 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -1039,13 +1039,32 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
> if (flags & LOOKUP_RCU)
> return -ECHILD;
>
> - /* Always perform fresh lookup for negatives */
> - if (d_really_is_negative(dentry))
> - goto out_bad_unlocked;
> -
> kn = kernfs_dentry_node(dentry);
> mutex_lock(&kernfs_mutex);
>
> + /* Negative hashed dentry? */
> + if (!kn) {
> + struct dentry *d_parent = dget_parent(dentry);
> + struct kernfs_node *parent;
> +
> + /* If the kernfs parent node has changed discard and
> + * proceed to ->lookup.
> + */
> + parent = kernfs_dentry_node(d_parent);
> + if (parent) {
> + if (kernfs_dir_changed(parent, dentry)) {
> + dput(d_parent);
> + goto out_bad;
> + }
> + }
> + dput(d_parent);
> +
> + /* The kernfs node doesn't exist, leave the dentry
> + * negative and return success.
> + */
> + goto out;
> + }

What part of this new negative hashed dentry check needs the
kernfs_mutex?

I guess it is the reading of kn->dir.rev.

Since all you are doing is comparing if two fields are equal it
really should not matter. Maybe somewhere there needs to be a
sprinkling of primitives like READ_ONCE.

It just seems like such a waste to put all of that under kernfs_mutex
on the off chance kn->dir.rev will change while it is being read.

Eric

2021-06-08 01:29:19

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v5 2/6] kernfs: add a revision to identify directory node changes

On Mon, 2021-06-07 at 12:53 -0500, Eric W. Biederman wrote:
> Ian Kent <[email protected]> writes:
>
> > Add a revision counter to kernfs directory nodes so it can be used
> > to detect if a directory node has changed.
> >
> > There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> > on all architectures and as far as I know that assumption holds.
> >
> > So adding a revision counter to the struct kernfs_elem_dir variant
> > of
> > the kernfs_node type union won't increase the size of the
> > kernfs_node
> > struct. This is because struct kernfs_elem_dir is at least
> > sizeof(pointer) smaller than the largest union variant. It's
> > tempting
> > to make the revision counter a u64 but that would increase the size
> > of
> > kernfs_node on archs where sizeof(pointer) is smaller than the
> > revision
> > counter.
> >
> > Signed-off-by: Ian Kent <[email protected]>
> > ---
> >  fs/kernfs/dir.c             |    8 ++++++++
> >  fs/kernfs/kernfs-internal.h |   24 ++++++++++++++++++++++++
> >  include/linux/kernfs.h      |    5 +++++
> >  3 files changed, 37 insertions(+)
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 33166ec90a112..b88432c48851f 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > kernfs_node *kn)
> >         /* successfully added, account subdir number */
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 kn->parent->dir.subdirs++;
> > +       kernfs_inc_rev(kn->parent);
> >  
> >         return 0;
> >  }
> > @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct
> > kernfs_node *kn)
> >  
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 kn->parent->dir.subdirs--;
> > +       kernfs_inc_rev(kn->parent);
> >  
> >         rb_erase(&kn->rb, &kn->parent->dir.children);
> >         RB_CLEAR_NODE(&kn->rb);
> > @@ -1105,6 +1107,12 @@ static struct dentry
> > *kernfs_iop_lookup(struct inode *dir,
> >  
> >         /* instantiate and hash dentry */
> >         ret = d_splice_alias(inode, dentry);
> > +       if (!IS_ERR(ret)) {
> > +               if (unlikely(ret))
> > +                       kernfs_set_rev(parent, ret);
> > +               else
> > +                       kernfs_set_rev(parent, dentry);
>
> Do we care about d_time on non-NULL dentries?

Would we ever need to use it avoid a search for any other cases?

Probably not ... those export ops mean that some dentries might
not have d_time set.

Maybe it's best to put a comment in about only using it for
negative dentries and set it unconditionally in ->lookup() as
you describe.

>
> For d_splice_alias to return a different dentry implies
> that the dentry was non-NULL.
>
> I am wondering if having a guarantee that d_time never changes could
> help simplify the implementation.  For never changing it would see to
> make sense to call kernfs_set_rev before d_splice_alias on dentry,
> and
> simply not worry about it after d_splice_alias.

Yes, I was tempted to do that.

>
> > +       }
> >   out_unlock:
> >         mutex_unlock(&kernfs_mutex);
> >         return ret;
> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-
> > internal.h
> > index ccc3b44f6306f..1536002584fc4 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > @@ -81,6 +81,30 @@ static inline struct kernfs_node
> > *kernfs_dentry_node(struct dentry *dentry)
> >         return d_inode(dentry)->i_private;
> >  }
> >  
> > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > +                                 struct dentry *dentry)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR)
> > +               dentry->d_time = kn->dir.rev;
> > +}
> > +
> > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR)
> > +               kn->dir.rev++;
> > +}
> > +
> > +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> > +                                     struct dentry *dentry)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR) {
> > +               /* Not really a time bit it does what's needed */
> > +               if (time_after(kn->dir.rev, dentry->d_time))
> > +                       return true;
>
> Why not simply make this:
>                 if (kn->dir.rev != dentry->d_time)
>                         return true;
>
> I don't see what is gained by not counting as changed something in
> the
> wrong half of the values.

Yes, it was like that originally and really shouldn't make
any difference. I'll change it back.

Ian
>
> > +       }
> > +       return false;
> > +}
> > +
> >  extern const struct super_operations kernfs_sops;
> >  extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;
> >  
> > diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> > index 9e8ca8743c268..7947acb1163d7 100644
> > --- a/include/linux/kernfs.h
> > +++ b/include/linux/kernfs.h
> > @@ -98,6 +98,11 @@ struct kernfs_elem_dir {
> >          * better directly in kernfs_node but is here to save
> > space.
> >          */
> >         struct kernfs_root      *root;
> > +       /*
> > +        * Monotonic revision counter, used to identify if a
> > directory
> > +        * node has changed during revalidation.
> > +        */
> > +       unsigned long rev;
> >  };
> >  
> >  struct kernfs_elem_symlink {
>
> Eric


2021-06-08 01:58:39

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v5 3/6] kernfs: use VFS negative dentry caching

On Mon, 2021-06-07 at 13:27 -0500, Eric W. Biederman wrote:
> Ian Kent <[email protected]> writes:
>
> > If there are many lookups for non-existent paths these negative
> > lookups
> > can lead to a lot of overhead during path walks.
> >
> > The VFS allows dentries to be created as negative and hashed, and
> > caches
> > them so they can be used to reduce the fairly high overhead
> > alloc/free
> > cycle that occurs during these lookups.
> >
> > Use the kernfs node parent revision to identify if a change has
> > been
> > made to the containing directory so that the negative dentry can be
> > discarded and the lookup redone.
> >
> > Signed-off-by: Ian Kent <[email protected]>
> > ---
> >  fs/kernfs/dir.c |   53 +++++++++++++++++++++++++++++++------------
> > ----------
> >  1 file changed, 31 insertions(+), 22 deletions(-)
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index b88432c48851f..5ae95e8d1aea1 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -1039,13 +1039,32 @@ static int kernfs_dop_revalidate(struct
> > dentry *dentry, unsigned int flags)
> >         if (flags & LOOKUP_RCU)
> >                 return -ECHILD;
> >  
> > -       /* Always perform fresh lookup for negatives */
> > -       if (d_really_is_negative(dentry))
> > -               goto out_bad_unlocked;
> > -
> >         kn = kernfs_dentry_node(dentry);
> >         mutex_lock(&kernfs_mutex);
> >  
> > +       /* Negative hashed dentry? */
> > +       if (!kn) {
> > +               struct dentry *d_parent = dget_parent(dentry);
> > +               struct kernfs_node *parent;
> > +
> > +               /* If the kernfs parent node has changed discard
> > and
> > +                * proceed to ->lookup.
> > +                */
> > +               parent = kernfs_dentry_node(d_parent);
> > +               if (parent) {
> > +                       if (kernfs_dir_changed(parent, dentry)) {
> > +                               dput(d_parent);
> > +                               goto out_bad;
> > +                       }
> > +               }
> > +               dput(d_parent);
> > +
> > +               /* The kernfs node doesn't exist, leave the dentry
> > +                * negative and return success.
> > +                */
> > +               goto out;
> > +       }
>
> What part of this new negative hashed dentry check needs the
> kernfs_mutex?
>
> I guess it is the reading of kn->dir.rev.

I have an irresistible urge to keep the rb tree stable when
accessing it. It was probably not necessary most of the times
I did it, IIUC even a rebalance will leave the node address
unchanged so it should be just removals and moves to worry
about.

>
> Since all you are doing is comparing if two fields are equal it
> really should not matter.  Maybe somewhere there needs to be a
> sprinkling of primitives like READ_ONCE.

There is one case that looks tricky, rename will call ->rename()
and a bit later do the move. Thinking about it a READ_ONCE might
be needed even now but taking the rwsem is probably enough.

Not sure about that one?

Moving this out from under the rwsem would be good to do.

Ian
>
> It just seems like such a waste to put all of that under kernfs_mutex
> on the off chance kn->dir.rev will change while it is being read.
>
> Eric