2021-06-15 10:27:04

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 0/6] kernfs: proposed locking and concurrency improvement

There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

Changes since v6:
- ensure negative dentry as rename target gets invalidated.
- don't bother checking if node is a directory in revision helpers.
- don't use dget_parent(), just use dentry d_lock to ensure parent
is stable.
- fix kernfs_iop_getattr() and kernfs_iop_permission() locking.
- drop add kernfs_need_inode_refresh() patch, it can't be used.

Changes since v5:
- change kernfs_dir_changed() comparison.
- move negative dentry out from under kernfs node lock in revalidate.
- only set d_time for negative dentries.
- add patch to move d_splice_alias() out from under kernfs node lock
in lookup.

Changes since v4:
- fixed kernfs_active() naming.
- added back kernfs_node revision patch to use for negative dentry
validation.
- minor updates to patch descriptions.

Changes since v3:
- remove unneeded indirection when referencing the super block.
- check if inode attribute update is actually needed.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .gated() by re-factoring the
attribute handling code.
---

Ian Kent (6):
kernfs: move revalidate to be near lookup
kernfs: add a revision to identify directory node changes
kernfs: use VFS negative dentry caching
kernfs: switch kernfs to use an rwsem
kernfs: use i_lock to protect concurrent inode updates
kernfs: dont call d_splice_alias() under kernfs node lock


fs/kernfs/dir.c | 151 ++++++++++++++++++++----------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 26 ++++---
fs/kernfs/kernfs-internal.h | 24 +++++-
fs/kernfs/mount.c | 12 +--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 7 +-
7 files changed, 136 insertions(+), 92 deletions(-)

--
Ian


2021-06-15 10:27:27

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 1/6] kernfs: move revalidate to be near lookup

While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

It makes sense to locate this function near to kernfs_iop_lookup()
because we will be adding VFS negative dentry caching to reduce path
lookup overhead for non-existent paths.

There's no functional change from this patch.

Signed-off-by: Ian Kent <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
---
fs/kernfs/dir.c | 86 ++++++++++++++++++++++++++++---------------------------
1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53c..33166ec90a112 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -548,49 +548,6 @@ void kernfs_put(struct kernfs_node *kn)
}
EXPORT_SYMBOL_GPL(kernfs_put);

-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
- struct kernfs_node *kn;
-
- if (flags & LOOKUP_RCU)
- return -ECHILD;
-
- /* Always perform fresh lookup for negatives */
- if (d_really_is_negative(dentry))
- goto out_bad_unlocked;
-
- kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
-
- /* The kernfs node has been deactivated */
- if (!kernfs_active(kn))
- goto out_bad;
-
- /* The kernfs node has been moved? */
- if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
- goto out_bad;
-
- /* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
- goto out_bad;
-
- /* The kernfs node has been moved to a different namespace */
- if (kn->parent && kernfs_ns_enabled(kn->parent) &&
- kernfs_info(dentry->d_sb)->ns != kn->ns)
- goto out_bad;
-
- mutex_unlock(&kernfs_mutex);
- return 1;
-out_bad:
- mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
- return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
- .d_revalidate = kernfs_dop_revalidate,
-};
-
/**
* kernfs_node_from_dentry - determine kernfs_node associated with a dentry
* @dentry: the dentry in question
@@ -1073,6 +1030,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
return ERR_PTR(rc);
}

+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct kernfs_node *kn;
+
+ if (flags & LOOKUP_RCU)
+ return -ECHILD;
+
+ /* Always perform fresh lookup for negatives */
+ if (d_really_is_negative(dentry))
+ goto out_bad_unlocked;
+
+ kn = kernfs_dentry_node(dentry);
+ mutex_lock(&kernfs_mutex);
+
+ /* The kernfs node has been deactivated */
+ if (!kernfs_active(kn))
+ goto out_bad;
+
+ /* The kernfs node has been moved? */
+ if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+ goto out_bad;
+
+ /* The kernfs node has been renamed */
+ if (strcmp(dentry->d_name.name, kn->name) != 0)
+ goto out_bad;
+
+ /* The kernfs node has been moved to a different namespace */
+ if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+ kernfs_info(dentry->d_sb)->ns != kn->ns)
+ goto out_bad;
+
+ mutex_unlock(&kernfs_mutex);
+ return 1;
+out_bad:
+ mutex_unlock(&kernfs_mutex);
+out_bad_unlocked:
+ return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+ .d_revalidate = kernfs_dop_revalidate,
+};
+
static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)


2021-06-15 10:28:41

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 2/6] kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed during negative dentry
revalidation.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 3 +++
fs/kernfs/kernfs-internal.h | 19 +++++++++++++++++++
include/linux/kernfs.h | 5 +++++
3 files changed, 27 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 33166ec90a112..48704c5b6a072 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* successfully added, account subdir number */
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
+ kernfs_inc_rev(kn->parent);

return 0;
}
@@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)

if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
+ kernfs_inc_rev(kn->parent);

rb_erase(&kn->rb, &kn->parent->dir.children);
RB_CLEAR_NODE(&kn->rb);
@@ -1573,6 +1575,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
*/
kernfs_unlink_sibling(kn);
kernfs_get(new_parent);
+ kernfs_inc_rev(new_parent);

/* rename_lock protects ->parent and ->name accessors */
spin_lock_irq(&kernfs_rename_lock);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306f..6a38897b4d02c 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -81,6 +81,25 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
return d_inode(dentry)->i_private;
}

+static inline void kernfs_set_rev(struct kernfs_node *parent,
+ struct dentry *dentry)
+{
+ dentry->d_time = parent->dir.rev;
+}
+
+static inline void kernfs_inc_rev(struct kernfs_node *parent)
+{
+ parent->dir.rev++;
+}
+
+static inline bool kernfs_dir_changed(struct kernfs_node *parent,
+ struct dentry *dentry)
+{
+ if (parent->dir.rev != dentry->d_time)
+ return true;
+ return false;
+}
+
extern const struct super_operations kernfs_sops;
extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;

diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c268..d68b4ad095737 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -98,6 +98,11 @@ struct kernfs_elem_dir {
* better directly in kernfs_node but is here to save space.
*/
struct kernfs_root *root;
+ /*
+ * Monotonic revision counter, used to identify if a directory
+ * node has changed during negative dentry revalidation.
+ */
+ unsigned long rev;
};

struct kernfs_elem_symlink {


2021-06-15 10:29:19

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 6/6] kernfs: dont call d_splice_alias() under kernfs node lock

The call to d_splice_alias() in kernfs_iop_lookup() doesn't depend on
any kernfs node so there's no reason to hold the kernfs node lock when
calling it.

Signed-off-by: Ian Kent <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
---
fs/kernfs/dir.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index a5080fb1dfc05..723348607033c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1097,7 +1097,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)
{
- struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode = NULL;
@@ -1117,11 +1116,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
/* Needed only for negative dentry validation */
if (!inode)
kernfs_set_rev(parent, dentry);
- /* instantiate and hash (possibly negative) dentry */
- ret = d_splice_alias(inode, dentry);
up_read(&kernfs_rwsem);

- return ret;
+ /* instantiate and hash (possibly negative) dentry */
+ return d_splice_alias(inode, dentry);
}

static int kernfs_iop_mkdir(struct user_namespace *mnt_userns,


2021-06-15 10:29:50

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 5/6] kernfs: use i_lock to protect concurrent inode updates

The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is the read lock to protect against
partial updates of these kernfs node fields which are all done under
the write lock.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

The last hunk in the patch, applied to kernfs_fill_super(), is possibly
not needed but taking the lock was present originally. I prefer to
continue to take it to protect against a partial update of the source
kernfs fields during the call to kernfs_refresh_inode() made by
kernfs_get_inode().

Signed-off-by: Ian Kent <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
---
fs/kernfs/inode.c | 18 ++++++++++++------
fs/kernfs/mount.c | 4 ++--
2 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6592938511909 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -191,11 +191,13 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
+ spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
-
generic_fillattr(&init_user_ns, inode, stat);
+ spin_unlock(&inode->i_lock);
+ up_read(&kernfs_rwsem);
+
return 0;
}

@@ -278,17 +280,21 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
struct kernfs_node *kn;
+ int ret;

if (mask & MAY_NOT_BLOCK)
return -ECHILD;

kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
+ spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
+ ret = generic_permission(&init_user_ns, inode, mask);
+ spin_unlock(&inode->i_lock);
+ up_read(&kernfs_rwsem);

- return generic_permission(&init_user_ns, inode, mask);
+ return ret;
}

int kernfs_xattr_get(struct kernfs_node *kn, const char *name,
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;


2021-06-15 10:30:23

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 4/6] kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Signed-off-by: Ian Kent <[email protected]>
Reviewed-by: Miklos Szeredi <[email protected]>
---
fs/kernfs/dir.c | 94 ++++++++++++++++++++++---------------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 16 ++++---
fs/kernfs/kernfs-internal.h | 5 +-
fs/kernfs/mount.c | 12 +++--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 2 -
7 files changed, 69 insertions(+), 68 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index f9352c9e90581..a5080fb1dfc05 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@

#include "kernfs-internal.h"

-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by rename_lock */
static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
@@ -26,7 +26,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);
return atomic_read(&kn->active) >= 0;
}

@@ -340,7 +340,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
* @kn->parent->dir.children.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*
* RETURNS:
* 0 on susccess -EEXIST on failure.
@@ -386,7 +386,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
* removed, %false if @kn wasn't on the rbtree.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*/
static bool kernfs_unlink_sibling(struct kernfs_node *kn)
{
@@ -457,14 +457,14 @@ void kernfs_put_active(struct kernfs_node *kn)
* return after draining is complete.
*/
static void kernfs_drain(struct kernfs_node *kn)
- __releases(&kernfs_mutex) __acquires(&kernfs_mutex)
+ __releases(&kernfs_rwsem) __acquires(&kernfs_rwsem)
{
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);
WARN_ON_ONCE(kernfs_active(kn));

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +483,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}

/**
@@ -722,7 +722,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +753,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +767,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -788,7 +788,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);

if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -820,7 +820,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_read(&kernfs_rwsem);

/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(&kernfs_rename_lock);
@@ -860,10 +860,10 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -884,10 +884,10 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -1063,7 +1063,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1082,10 +1082,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;

- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 1;
out_bad:
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 0;
}

@@ -1103,7 +1103,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct inode *inode = NULL;
const void *ns = NULL;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

@@ -1119,7 +1119,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
kernfs_set_rev(parent, dentry);
/* instantiate and hash (possibly negative) dentry */
ret = d_splice_alias(inode, dentry);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return ret;
}
@@ -1241,7 +1241,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1277,7 +1277,7 @@ void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1291,14 +1291,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1321,7 +1321,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
pos = kernfs_leftmost_descendant(kn);

/*
- * kernfs_drain() drops kernfs_mutex temporarily and @pos's
+ * kernfs_drain() drops kernfs_rwsem temporarily and @pos's
* base ref could have been put by someone else by the time
* the function returns. Make sure it doesn't go away
* underneath us.
@@ -1368,9 +1368,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
__kernfs_remove(kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

/**
@@ -1457,17 +1457,17 @@ bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_break_active_protection(kn);

/*
* SUICIDAL is used to arbitrate among competing invocations. Only
* the first one will actually perform removal. When the removal
* is complete, SUICIDED is set and the active ref is restored
- * while holding kernfs_mutex. The ones which lost arbitration
- * waits for SUICDED && drained which can happen only after the
- * enclosing kernfs operation which executed the winning instance
- * of kernfs_remove_self() finished.
+ * while kernfs_rwsem for held exclusive. The ones which lost
+ * arbitration waits for SUICIDED && drained which can happen only
+ * after the enclosing kernfs operation which executed the winning
+ * instance of kernfs_remove_self() finished.
*/
if (!(kn->flags & KERNFS_SUICIDAL)) {
kn->flags |= KERNFS_SUICIDAL;
@@ -1485,9 +1485,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
schedule();
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1495,12 +1495,12 @@ bool kernfs_remove_self(struct kernfs_node *kn)
}

/*
- * This must be done while holding kernfs_mutex; otherwise, waiting
- * for SUICIDED && deactivated could finish prematurely.
+ * This must be done while kernfs_rwsem held exclusive; otherwise,
+ * waiting for SUICIDED && deactivated could finish prematurely.
*/
kernfs_unbreak_active_protection(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -1524,13 +1524,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kn)
return 0;
@@ -1556,7 +1556,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
if (!kn->parent)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1611,7 +1611,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -1686,7 +1686,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)

if (!dir_emit_dots(file, ctx))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1703,12 +1703,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);

- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
}
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c757193121475..60e2a86c535eb 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -860,7 +860,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
spin_unlock_irq(&kernfs_notify_lock);

/* kick fsnotify */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
@@ -898,7 +898,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index d73950fc3d57d..3b01e9e61f14e 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -106,9 +106,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
ret = __kernfs_setattr(kn, iattr);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -122,7 +122,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
if (!kn)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -135,7 +135,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -191,9 +191,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

generic_fillattr(&init_user_ns, inode, stat);
return 0;
@@ -284,9 +284,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

return generic_permission(&init_user_ns, inode, mask);
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 6a38897b4d02c..eaaa2e1607e8c 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -13,6 +13,7 @@
#include <linux/lockdep.h>
#include <linux/fs.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/xattr.h>

#include <linux/kernfs.h>
@@ -69,7 +70,7 @@ struct kernfs_super_info {
*/
const void *ns;

- /* anchored at kernfs_root->supers, protected by kernfs_mutex */
+ /* anchored at kernfs_root->supers, protected by kernfs_rwsem */
struct list_head node;
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
@@ -121,7 +122,7 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
/*
* dir.c
*/
-extern struct mutex kernfs_mutex;
+extern struct rw_semaphore kernfs_rwsem;
extern const struct dentry_operations kernfs_dops;
extern const struct file_operations kernfs_dir_fops;
extern const struct inode_operations kernfs_dir_iops;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10f..baa4155ba2edf 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
@@ -344,9 +344,9 @@ int kernfs_get_tree(struct fs_context *fc)
}
sb->s_flags |= SB_ACTIVE;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_add(&info->node, &info->root->supers);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

fc->root = dget(sb->s_root);
@@ -372,9 +372,9 @@ void kernfs_kill_sb(struct super_block *sb)
{
struct kernfs_super_info *info = kernfs_info(sb);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_del(&info->node);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Remove the superblock from fs_supers/s_instances
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f2..c8f8e41b84110 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -116,9 +116,9 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *target = kn->symlink.target_kn;
int error;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
error = kernfs_get_target_path(parent, target, path);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return error;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d68b4ad095737..1093abf7c28cc 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -193,7 +193,7 @@ struct kernfs_root {
u32 id_highbits;
struct kernfs_syscall_ops *syscall_ops;

- /* list of kernfs_super_info of this root, protected by kernfs_mutex */
+ /* list of kernfs_super_info of this root, protected by kernfs_rwsem */
struct list_head supers;

wait_queue_head_t deactivate_waitq;


2021-06-15 10:30:34

by Ian Kent

[permalink] [raw]
Subject: [PATCH v7 3/6] kernfs: use VFS negative dentry caching

If there are many lookups for non-existent paths these negative lookups
can lead to a lot of overhead during path walks.

The VFS allows dentries to be created as negative and hashed, and caches
them so they can be used to reduce the fairly high overhead alloc/free
cycle that occurs during these lookups.

Use the kernfs node parent revision to identify if a change has been
made to the containing directory so that the negative dentry can be
discarded and the lookup redone.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 52 ++++++++++++++++++++++++++++++++--------------------
1 file changed, 32 insertions(+), 20 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 48704c5b6a072..f9352c9e90581 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1039,9 +1039,28 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
if (flags & LOOKUP_RCU)
return -ECHILD;

- /* Always perform fresh lookup for negatives */
- if (d_really_is_negative(dentry))
- goto out_bad_unlocked;
+ /* Negative hashed dentry? */
+ if (d_really_is_negative(dentry)) {
+ struct kernfs_node *parent;
+
+ /* If the kernfs parent node has changed discard and
+ * proceed to ->lookup.
+ */
+ spin_lock(&dentry->d_lock);
+ parent = kernfs_dentry_node(dentry->d_parent);
+ if (parent) {
+ if (kernfs_dir_changed(parent, dentry)) {
+ spin_unlock(&dentry->d_lock);
+ return 0;
+ }
+ }
+ spin_unlock(&dentry->d_lock);
+
+ /* The kernfs parent node hasn't changed, leave the
+ * dentry negative and return success.
+ */
+ return 1;
+ }

kn = kernfs_dentry_node(dentry);
mutex_lock(&kernfs_mutex);
@@ -1067,7 +1086,6 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
return 1;
out_bad:
mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
return 0;
}

@@ -1082,33 +1100,27 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct inode *inode;
+ struct inode *inode = NULL;
const void *ns = NULL;

mutex_lock(&kernfs_mutex);
-
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
-
- /* no such entry */
- if (!kn || !kernfs_active(kn)) {
- ret = NULL;
- goto out_unlock;
- }
-
/* attach dentry and inode */
- inode = kernfs_get_inode(dir->i_sb, kn);
- if (!inode) {
- ret = ERR_PTR(-ENOMEM);
- goto out_unlock;
+ if (kn && kernfs_active(kn)) {
+ inode = kernfs_get_inode(dir->i_sb, kn);
+ if (!inode)
+ inode = ERR_PTR(-ENOMEM);
}
-
- /* instantiate and hash dentry */
+ /* Needed only for negative dentry validation */
+ if (!inode)
+ kernfs_set_rev(parent, dentry);
+ /* instantiate and hash (possibly negative) dentry */
ret = d_splice_alias(inode, dentry);
- out_unlock:
mutex_unlock(&kernfs_mutex);
+
return ret;
}



2021-06-15 10:40:48

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v7 0/6] kernfs: proposed locking and concurrency improvement

Hi,

Here are benchmark test runs against patch set v7.
The locking change has made a bit of a defference but the results are
not much different to previous runs.

1) Run with 43 distinct sysfs file paths on 86 cpu machine, perf graphs
in base-files-cpu-86-perf.txt and patched-files-cpu-86-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 37.537578ms per 3.753758us
concur: total 4796.665718ms per 479.666572us CPU 65
concur: total 4800.220672ms per 480.022067us CPU 22
concur: total 4828.959645ms per 482.895964us CPU 43
concur: total 4848.001744ms per 484.800174us CPU 0
concur: total 4859.682650ms per 485.968265us CPU 31
concur: total 4865.884472ms per 486.588447us CPU 67
concur: total 4876.658191ms per 487.665819us CPU 30
concur: total 4877.914506ms per 487.791451us CPU 72
concur: total 4880.408600ms per 488.040860us CPU 24
concur: total 4884.697273ms per 488.469727us CPU 29
concur: total 4886.702702ms per 488.670270us CPU 34
concur: total 4887.505107ms per 488.750511us CPU 25
concur: total 4889.623774ms per 488.962377us CPU 69
concur: total 4894.103764ms per 489.410376us CPU 23
concur: total 4903.002239ms per 490.300224us CPU 81
concur: total 4906.873196ms per 490.687320us CPU 82
concur: total 4907.692111ms per 490.769211us CPU 68
concur: total 4908.913453ms per 490.891345us CPU 74
concur: total 4909.243539ms per 490.924354us CPU 73
concur: total 4916.646302ms per 491.664630us CPU 78
concur: total 4916.962985ms per 491.696299us CPU 27
concur: total 4917.448436ms per 491.744844us CPU 75
concur: total 4920.916201ms per 492.091620us CPU 70
concur: total 4922.719147ms per 492.271915us CPU 84
concur: total 4922.971086ms per 492.297109us CPU 26
concur: total 4923.825391ms per 492.382539us CPU 64
concur: total 4924.360433ms per 492.436043us CPU 80
concur: total 4927.506957ms per 492.750696us CPU 71
concur: total 4928.302267ms per 492.830227us CPU 85
concur: total 4928.619595ms per 492.861959us CPU 38
concur: total 4929.667220ms per 492.966722us CPU 41
concur: total 4929.783827ms per 492.978383us CPU 39
concur: total 4929.820893ms per 492.982089us CPU 77
concur: total 4929.869480ms per 492.986948us CPU 36
concur: total 4931.021352ms per 493.102135us CPU 76
concur: total 4931.837530ms per 493.183753us CPU 28
concur: total 4932.697037ms per 493.269704us CPU 5
concur: total 4932.854097ms per 493.285410us CPU 35
concur: total 4933.293678ms per 493.329368us CPU 49
concur: total 4933.634401ms per 493.363440us CPU 79
concur: total 4933.826684ms per 493.382668us CPU 40
concur: total 4933.903000ms per 493.390300us CPU 51
concur: total 4935.303136ms per 493.530314us CPU 21
concur: total 4935.369630ms per 493.536963us CPU 44
concur: total 4935.510909ms per 493.551091us CPU 32
concur: total 4935.723570ms per 493.572357us CPU 33
concur: total 4936.612243ms per 493.661224us CPU 42
concur: total 4936.901434ms per 493.690143us CPU 37
concur: total 4937.042596ms per 493.704260us CPU 48
concur: total 4937.138242ms per 493.713824us CPU 83
concur: total 4937.141487ms per 493.714149us CPU 47
concur: total 4937.326596ms per 493.732660us CPU 1
concur: total 4938.003521ms per 493.800352us CPU 61
concur: total 4939.302626ms per 493.930263us CPU 8
concur: total 4939.499698ms per 493.949970us CPU 18
concur: total 4940.621054ms per 494.062105us CPU 9
concur: total 4941.746319ms per 494.174632us CPU 50
concur: total 4941.781419ms per 494.178142us CPU 6
concur: total 4943.453629ms per 494.345363us CPU 45
concur: total 4943.543980ms per 494.354398us CPU 59
concur: total 4943.976473ms per 494.397647us CPU 52
concur: total 4944.397692ms per 494.439769us CPU 3
concur: total 4944.411995ms per 494.441200us CPU 7
concur: total 4944.431336ms per 494.443134us CPU 4
concur: total 4944.968336ms per 494.496834us CPU 46
concur: total 4945.390966ms per 494.539097us CPU 17
concur: total 4948.021887ms per 494.802189us CPU 10
concur: total 4948.039212ms per 494.803921us CPU 2
concur: total 4948.123469ms per 494.812347us CPU 16
concur: total 4948.180674ms per 494.818067us CPU 60
concur: total 4948.294916ms per 494.829492us CPU 20
concur: total 4948.585694ms per 494.858569us CPU 54
concur: total 4948.905652ms per 494.890565us CPU 55
concur: total 4948.958036ms per 494.895804us CPU 53
concur: total 4949.346910ms per 494.934691us CPU 15
concur: total 4949.456923ms per 494.945692us CPU 12
concur: total 4949.469902ms per 494.946990us CPU 19
concur: total 4949.518404ms per 494.951840us CPU 57
concur: total 4949.541110ms per 494.954111us CPU 63
concur: total 4949.622395ms per 494.962239us CPU 14
concur: total 4949.718346ms per 494.971835us CPU 56
concur: total 4950.237632ms per 495.023763us CPU 58
concur: total 4950.356322ms per 495.035632us CPU 13
concur: total 4950.452444ms per 495.045244us CPU 62
concur: total 4951.023635ms per 495.102363us CPU 11
concur: total 4954.519297ms per 495.451930us CPU 66
times: 10000 threads: 86 cpus: 86

patched (5.12.2)
----------------
single: total 40.276214ms per 4.027621us
concur: total 1688.316678ms per 168.831668us CPU 62
concur: total 1689.238692ms per 168.923869us CPU 10
concur: total 1690.603833ms per 169.060383us CPU 12
concur: total 1691.032655ms per 169.103265us CPU 61
concur: total 1691.472857ms per 169.147286us CPU 16
concur: total 1693.003579ms per 169.300358us CPU 58
concur: total 1694.125776ms per 169.412578us CPU 3
concur: total 1694.490338ms per 169.449034us CPU 46
concur: total 1694.602632ms per 169.460263us CPU 56
concur: total 1695.086541ms per 169.508654us CPU 44
concur: total 1695.554273ms per 169.555427us CPU 59
concur: total 1695.667461ms per 169.566746us CPU 19
concur: total 1695.690350ms per 169.569035us CPU 57
concur: total 1696.535623ms per 169.653562us CPU 60
concur: total 1696.690512ms per 169.669051us CPU 14
concur: total 1697.387634ms per 169.738763us CPU 18
concur: total 1698.343510ms per 169.834351us CPU 13
concur: total 1698.681114ms per 169.868111us CPU 6
concur: total 1698.836863ms per 169.883686us CPU 55
concur: total 1699.167141ms per 169.916714us CPU 15
concur: total 1699.676994ms per 169.967699us CPU 43
concur: total 1700.591012ms per 170.059101us CPU 2
concur: total 1700.934749ms per 170.093475us CPU 4
concur: total 1701.024945ms per 170.102495us CPU 51
concur: total 1701.226732ms per 170.122673us CPU 45
concur: total 1701.660419ms per 170.166042us CPU 49
concur: total 1701.678262ms per 170.167826us CPU 11
concur: total 1701.686480ms per 170.168648us CPU 25
concur: total 1701.844570ms per 170.184457us CPU 24
concur: total 1702.128359ms per 170.212836us CPU 21
concur: total 1702.610757ms per 170.261076us CPU 17
concur: total 1702.619791ms per 170.261979us CPU 48
concur: total 1702.791868ms per 170.279187us CPU 50
concur: total 1703.042799ms per 170.304280us CPU 68
concur: total 1703.183104ms per 170.318310us CPU 53
concur: total 1703.507021ms per 170.350702us CPU 54
concur: total 1704.226830ms per 170.422683us CPU 7
concur: total 1704.812777ms per 170.481278us CPU 66
concur: total 1704.930442ms per 170.493044us CPU 47
concur: total 1705.165520ms per 170.516552us CPU 20
concur: total 1705.177711ms per 170.517771us CPU 78
concur: total 1705.419976ms per 170.541998us CPU 1
concur: total 1705.803125ms per 170.580312us CPU 83
concur: total 1705.833619ms per 170.583362us CPU 76
concur: total 1705.890534ms per 170.589053us CPU 77
concur: total 1706.080503ms per 170.608050us CPU 0
concur: total 1706.463447ms per 170.646345us CPU 5
concur: total 1706.510373ms per 170.651037us CPU 67
concur: total 1706.530695ms per 170.653070us CPU 39
concur: total 1706.567113ms per 170.656711us CPU 85
concur: total 1706.574973ms per 170.657497us CPU 63
concur: total 1707.226897ms per 170.722690us CPU 22
concur: total 1707.361517ms per 170.736152us CPU 8
concur: total 1707.628961ms per 170.762896us CPU 72
concur: total 1707.759502ms per 170.775950us CPU 27
concur: total 1707.896622ms per 170.789662us CPU 73
concur: total 1707.955414ms per 170.795541us CPU 41
concur: total 1708.020238ms per 170.802024us CPU 37
concur: total 1708.212862ms per 170.821286us CPU 36
concur: total 1708.525288ms per 170.852529us CPU 23
concur: total 1708.600291ms per 170.860029us CPU 79
concur: total 1708.701518ms per 170.870152us CPU 29
concur: total 1708.810124ms per 170.881012us CPU 32
concur: total 1708.838772ms per 170.883877us CPU 84
concur: total 1708.879647ms per 170.887965us CPU 69
concur: total 1708.972612ms per 170.897261us CPU 40
concur: total 1708.985762ms per 170.898576us CPU 71
concur: total 1709.034237ms per 170.903424us CPU 65
concur: total 1709.074527ms per 170.907453us CPU 38
concur: total 1709.142994ms per 170.914299us CPU 28
concur: total 1709.160142ms per 170.916014us CPU 34
concur: total 1709.204206ms per 170.920421us CPU 26
concur: total 1709.228029ms per 170.922803us CPU 33
concur: total 1709.248037ms per 170.924804us CPU 42
concur: total 1709.326463ms per 170.932646us CPU 70
concur: total 1709.542704ms per 170.954270us CPU 35
concur: total 1709.920521ms per 170.992052us CPU 30
concur: total 1709.924509ms per 170.992451us CPU 80
concur: total 1712.591681ms per 171.259168us CPU 82
concur: total 1712.606558ms per 171.260656us CPU 64
concur: total 1712.649220ms per 171.264922us CPU 75
concur: total 1715.379022ms per 171.537902us CPU 52
concur: total 1715.620116ms per 171.562012us CPU 9
concur: total 1722.125475ms per 172.212547us CPU 31
concur: total 1723.295082ms per 172.329508us CPU 74
concur: total 1723.751586ms per 172.375159us CPU 81
times: 10000 threads: 86 cpus: 86

2) Run with a single sysfs file path on 86 cpu machine, perf graphs in
base-missing-cpu-86-perf.txt and patched-missing-cpu-86-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 25.602588ms per 2.560259us
concur: total 7671.072803ms per 767.107280us CPU 22
concur: total 7676.655038ms per 767.665504us CPU 73
concur: total 7694.228695ms per 769.422870us CPU 23
concur: total 7804.388342ms per 780.438834us CPU 71
concur: total 7809.647276ms per 780.964728us CPU 70
concur: total 7902.296857ms per 790.229686us CPU 72
concur: total 7965.755246ms per 796.575525us CPU 24
concur: total 7994.052914ms per 799.405291us CPU 30
concur: total 8012.053435ms per 801.205343us CPU 65
concur: total 8046.237160ms per 804.623716us CPU 66
concur: total 8166.590955ms per 816.659096us CPU 28
concur: total 8188.160332ms per 818.816033us CPU 68
concur: total 8223.517444ms per 822.351744us CPU 29
concur: total 8316.801754ms per 831.680175us CPU 27
concur: total 8339.528925ms per 833.952892us CPU 21
concur: total 8356.378000ms per 835.637800us CPU 64
concur: total 8442.124512ms per 844.212451us CPU 83
concur: total 8601.109595ms per 860.110960us CPU 31
concur: total 8716.851975ms per 871.685197us CPU 67
concur: total 8719.306654ms per 871.930665us CPU 25
concur: total 8745.745263ms per 874.574526us CPU 40
concur: total 8781.849696ms per 878.184970us CPU 74
concur: total 8966.314242ms per 896.631424us CPU 69
concur: total 8974.189988ms per 897.418999us CPU 34
concur: total 9205.936603ms per 920.593660us CPU 77
concur: total 9209.980188ms per 920.998019us CPU 26
concur: total 9312.050816ms per 931.205082us CPU 39
concur: total 9337.596881ms per 933.759688us CPU 80
concur: total 9367.971342ms per 936.797134us CPU 82
concur: total 9425.680953ms per 942.568095us CPU 76
concur: total 9445.991013ms per 944.599101us CPU 81
concur: total 9494.789554ms per 949.478955us CPU 85
concur: total 9497.486906ms per 949.748691us CPU 78
concur: total 9575.471616ms per 957.547162us CPU 84
concur: total 9583.711012ms per 958.371101us CPU 41
concur: total 9603.143505ms per 960.314351us CPU 37
concur: total 9659.819786ms per 965.981979us CPU 33
concur: total 9660.746579ms per 966.074658us CPU 79
concur: total 9672.471465ms per 967.247146us CPU 32
concur: total 9673.978144ms per 967.397814us CPU 75
concur: total 9690.083225ms per 969.008322us CPU 38
concur: total 9718.957048ms per 971.895705us CPU 42
concur: total 9729.803443ms per 972.980344us CPU 35
concur: total 9736.271248ms per 973.627125us CPU 36
concur: total 9846.191441ms per 984.619144us CPU 6
concur: total 9858.151722ms per 985.815172us CPU 51
concur: total 9871.108661ms per 987.110866us CPU 45
concur: total 9876.793398ms per 987.679340us CPU 44
concur: total 9877.221388ms per 987.722139us CPU 43
concur: total 9881.354352ms per 988.135435us CPU 7
concur: total 9883.292263ms per 988.329226us CPU 48
concur: total 9893.701193ms per 989.370119us CPU 1
concur: total 9893.797272ms per 989.379727us CPU 50
concur: total 9893.950988ms per 989.395099us CPU 2
concur: total 9898.109350ms per 989.810935us CPU 8
concur: total 9899.264544ms per 989.926454us CPU 49
concur: total 9914.472096ms per 991.447210us CPU 61
concur: total 9918.795033ms per 991.879503us CPU 0
concur: total 9929.303538ms per 992.930354us CPU 3
concur: total 9930.259153ms per 993.025915us CPU 9
concur: total 9931.281750ms per 993.128175us CPU 46
concur: total 9933.713011ms per 993.371301us CPU 5
concur: total 9939.194669ms per 993.919467us CPU 18
concur: total 9943.118290ms per 994.311829us CPU 52
concur: total 9943.595571ms per 994.359557us CPU 47
concur: total 9945.999798ms per 994.599980us CPU 55
concur: total 9951.944744ms per 995.194474us CPU 12
concur: total 9959.544673ms per 995.954467us CPU 54
concur: total 9963.678139ms per 996.367814us CPU 62
concur: total 9963.679840ms per 996.367984us CPU 4
concur: total 9963.968986ms per 996.396899us CPU 53
concur: total 9964.919007ms per 996.491901us CPU 60
concur: total 9971.413697ms per 997.141370us CPU 59
concur: total 9973.607985ms per 997.360798us CPU 63
concur: total 9974.159664ms per 997.415966us CPU 16
concur: total 9974.520640ms per 997.452064us CPU 14
concur: total 9974.931335ms per 997.493134us CPU 17
concur: total 9974.962836ms per 997.496284us CPU 57
concur: total 9975.494649ms per 997.549465us CPU 11
concur: total 9976.683023ms per 997.668302us CPU 10
concur: total 9976.757297ms per 997.675730us CPU 56
concur: total 9977.161874ms per 997.716187us CPU 19
concur: total 9977.290610ms per 997.729061us CPU 20
concur: total 9978.145954ms per 997.814595us CPU 13
concur: total 9978.815314ms per 997.881531us CPU 58
concur: total 9979.288565ms per 997.928856us CPU 15
times: 10000 threads: 86 cpus: 86

patched (5.12.2)
----------------
single: total 19.675378ms per 1.967538us
concur: total 1267.301099ms per 126.730110us CPU 57
concur: total 1267.676731ms per 126.767673us CPU 58
concur: total 1268.688528ms per 126.868853us CPU 14
concur: total 1269.572284ms per 126.957228us CPU 56
concur: total 1273.534346ms per 127.353435us CPU 13
concur: total 1273.615124ms per 127.361512us CPU 15
concur: total 1275.602230ms per 127.560223us CPU 62
concur: total 1275.658943ms per 127.565894us CPU 19
concur: total 1275.789693ms per 127.578969us CPU 20
concur: total 1276.271081ms per 127.627108us CPU 63
concur: total 1277.278733ms per 127.727873us CPU 61
concur: total 1278.056971ms per 127.805697us CPU 17
concur: total 1278.083690ms per 127.808369us CPU 18
concur: total 1278.446545ms per 127.844654us CPU 16
concur: total 1279.284518ms per 127.928452us CPU 60
concur: total 1279.931614ms per 127.993161us CPU 59
concur: total 1282.073914ms per 128.207391us CPU 12
concur: total 1282.711558ms per 128.271156us CPU 55
concur: total 1283.261985ms per 128.326199us CPU 1
concur: total 1283.325370ms per 128.332537us CPU 53
concur: total 1283.571147ms per 128.357115us CPU 10
concur: total 1284.082425ms per 128.408243us CPU 44
concur: total 1284.714728ms per 128.471473us CPU 11
concur: total 1284.829587ms per 128.482959us CPU 54
concur: total 1288.556189ms per 128.855619us CPU 45
concur: total 1289.402348ms per 128.940235us CPU 47
concur: total 1289.483215ms per 128.948322us CPU 50
concur: total 1289.517709ms per 128.951771us CPU 7
concur: total 1289.576335ms per 128.957633us CPU 2
concur: total 1289.865537ms per 128.986554us CPU 49
concur: total 1290.402884ms per 129.040288us CPU 4
concur: total 1290.635575ms per 129.063558us CPU 51
concur: total 1290.960194ms per 129.096019us CPU 8
concur: total 1291.037710ms per 129.103771us CPU 5
concur: total 1291.164969ms per 129.116497us CPU 43
concur: total 1291.269961ms per 129.126996us CPU 48
concur: total 1291.328620ms per 129.132862us CPU 3
concur: total 1291.512777ms per 129.151278us CPU 6
concur: total 1292.126213ms per 129.212621us CPU 46
concur: total 1292.459740ms per 129.245974us CPU 0
concur: total 1319.990689ms per 131.999069us CPU 36
concur: total 1320.534768ms per 132.053477us CPU 79
concur: total 1321.478834ms per 132.147883us CPU 52
concur: total 1321.688430ms per 132.168843us CPU 9
concur: total 1322.367295ms per 132.236729us CPU 41
concur: total 1322.774626ms per 132.277463us CPU 84
concur: total 1322.858841ms per 132.285884us CPU 37
concur: total 1323.004662ms per 132.300466us CPU 80
concur: total 1323.509692ms per 132.350969us CPU 35
concur: total 1324.168083ms per 132.416808us CPU 78
concur: total 1324.349236ms per 132.434924us CPU 40
concur: total 1324.390289ms per 132.439029us CPU 83
concur: total 1324.582900ms per 132.458290us CPU 39
concur: total 1324.827884ms per 132.482788us CPU 82
concur: total 1325.039595ms per 132.503960us CPU 38
concur: total 1325.178815ms per 132.517881us CPU 81
concur: total 1325.988735ms per 132.598873us CPU 76
concur: total 1326.019984ms per 132.601998us CPU 77
concur: total 1326.032461ms per 132.603246us CPU 34
concur: total 1326.344117ms per 132.634412us CPU 32
concur: total 1326.633867ms per 132.663387us CPU 75
concur: total 1326.717049ms per 132.671705us CPU 23
concur: total 1326.808905ms per 132.680891us CPU 33
concur: total 1327.019730ms per 132.701973us CPU 66
concur: total 1327.332877ms per 132.733288us CPU 25
concur: total 1327.359688ms per 132.735969us CPU 68
concur: total 1327.583548ms per 132.758355us CPU 26
concur: total 1327.881813ms per 132.788181us CPU 69
concur: total 1333.479850ms per 133.347985us CPU 67
concur: total 1333.685477ms per 133.368548us CPU 65
concur: total 1334.340429ms per 133.434043us CPU 70
concur: total 1334.459344ms per 133.445934us CPU 73
concur: total 1334.721884ms per 133.472188us CPU 24
concur: total 1329.700073ms per 132.970007us CPU 29
concur: total 1329.780331ms per 132.978033us CPU 72
concur: total 1334.903260ms per 133.490326us CPU 30
concur: total 1334.930622ms per 133.493062us CPU 22
concur: total 1334.987178ms per 133.498718us CPU 27
concur: total 1335.053335ms per 133.505334us CPU 21
concur: total 1335.098157ms per 133.509816us CPU 71
concur: total 1335.130695ms per 133.513070us CPU 28
concur: total 1335.131987ms per 133.513199us CPU 42
concur: total 1335.168902ms per 133.516890us CPU 64
concur: total 1343.899304ms per 134.389930us CPU 74
concur: total 1343.952212ms per 134.395221us CPU 31
concur: total 1351.058043ms per 135.105804us CPU 85
times: 10000 threads: 86 cpus: 86


Attachments:
base-files-cpus-86-perf.txt (425.23 kB)
base-missing-cpus-86-perf.txt (389.04 kB)
patched-files-cpus-86-perf.txt (440.52 kB)
patched-missing-cpus-86-perf.txt (540.09 kB)
Download all attachments

2021-06-15 15:09:30

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v7 1/6] kernfs: move revalidate to be near lookup

On Tue, Jun 15, 2021 at 06:25:53PM +0800, Ian Kent wrote:
> While the dentry operation kernfs_dop_revalidate() is grouped with
> dentry type functions it also has a strong affinity to the inode
> operation ->lookup().
>
> It makes sense to locate this function near to kernfs_iop_lookup()
> because we will be adding VFS negative dentry caching to reduce path
> lookup overhead for non-existent paths.
>
> There's no functional change from this patch.
>
> Signed-off-by: Ian Kent <[email protected]>
> Reviewed-by: Miklos Szeredi <[email protected]>
> ---
> fs/kernfs/dir.c | 86 ++++++++++++++++++++++++++++---------------------------
> 1 file changed, 43 insertions(+), 43 deletions(-)

As everyone has agreed this patch does nothing wrong, I've applied this
one for now while everyone reviews the other ones :)

thanks,

greg k-h

2021-06-16 09:13:06

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v7 2/6] kernfs: add a revision to identify directory node changes

On Tue, 15 Jun 2021 at 12:26, Ian Kent <[email protected]> wrote:
>
> Add a revision counter to kernfs directory nodes so it can be used
> to detect if a directory node has changed during negative dentry
> revalidation.
>
> There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> on all architectures and as far as I know that assumption holds.
>
> So adding a revision counter to the struct kernfs_elem_dir variant of
> the kernfs_node type union won't increase the size of the kernfs_node
> struct. This is because struct kernfs_elem_dir is at least
> sizeof(pointer) smaller than the largest union variant. It's tempting
> to make the revision counter a u64 but that would increase the size of
> kernfs_node on archs where sizeof(pointer) is smaller than the revision
> counter.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-16 09:14:29

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v7 3/6] kernfs: use VFS negative dentry caching

On Tue, 15 Jun 2021 at 12:26, Ian Kent <[email protected]> wrote:
>
> If there are many lookups for non-existent paths these negative lookups
> can lead to a lot of overhead during path walks.
>
> The VFS allows dentries to be created as negative and hashed, and caches
> them so they can be used to reduce the fairly high overhead alloc/free
> cycle that occurs during these lookups.
>
> Use the kernfs node parent revision to identify if a change has been
> made to the containing directory so that the negative dentry can be
> discarded and the lookup redone.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>