2021-06-09 17:13:06

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 0/7] kernfs: proposed locking and concurrency improvement

There have been a few instances of contention on the kernfs_mutex during
path walks, a case on very large IBM systems seen by myself, a report by
Brice Goglin and followed up by Fox Chen, and I've since seen a couple
of other reports by CoreOS users.

The common thread is a large number of kernfs path walks leading to
slowness of path walks due to kernfs_mutex contention.

The problem being that changes to the VFS over some time have increased
it's concurrency capabilities to an extent that kernfs's use of a mutex
is no longer appropriate. There's also an issue of walks for non-existent
paths causing contention if there are quite a few of them which is a less
common problem.

This patch series is relatively straight forward.

All it does is add the ability to take advantage of VFS negative dentry
caching to avoid needless dentry alloc/free cycles for lookups of paths
that don't exit and change the kernfs_mutex to a read/write semaphore.

The patch that tried to stay in VFS rcu-walk mode during path walks has
been dropped for two reasons. First, it doesn't actually give very much
improvement and, second, if there's a place where mistakes could go
unnoticed it would be in that path. This makes the patch series simpler
to review and reduces the likelihood of problems going unnoticed and
popping up later.

Changes since v5:
- change kernfs_dir_changed() comparison.
- move negative dentry out from under kernfs node lock in revalidate.
- only set d_time for negative dentries.
- add patch to move d_splice_alias() out from under kernfs node lock
in lookup.

Changes since v4:
- fixed kernfs_active() naming.
- added back kernfs_node revision patch to use for negative dentry
validation.
- minor updates to patch descriptions.

Changes since v3:
- remove unneeded indirection when referencing the super block.
- check if inode attribute update is actually needed.

Changes since v2:
- actually fix the inode attribute update locking.
- drop the patch that tried to stay in rcu-walk mode.
- drop the use a revision to identify if a directory has changed patch.

Changes since v1:
- fix locking in .permission() and .gated() by re-factoring the
attribute handling code.
---

Ian Kent (7):
kernfs: move revalidate to be near lookup
kernfs: add a revision to identify directory node changes
kernfs: use VFS negative dentry caching
kernfs: switch kernfs to use an rwsem
kernfs: use i_lock to protect concurrent inode updates
kernfs: add kernfs_need_inode_refresh()
kernfs: dont call d_splice_alias() under kernfs node lock


fs/kernfs/dir.c | 150 +++++++++++++++++++-----------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 45 +++++++++--
fs/kernfs/kernfs-internal.h | 28 ++++++-
fs/kernfs/mount.c | 12 +--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 7 +-
7 files changed, 160 insertions(+), 90 deletions(-)

--
Ian


2021-06-09 17:13:08

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 1/7] kernfs: move revalidate to be near lookup

While the dentry operation kernfs_dop_revalidate() is grouped with
dentry type functions it also has a strong affinity to the inode
operation ->lookup().

It makes sense to locate this function near to kernfs_iop_lookup()
because we will be adding VFS negative dentry caching to reduce path
lookup overhead for non-existent paths.

There's no functional change from this patch.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 86 ++++++++++++++++++++++++++++---------------------------
1 file changed, 43 insertions(+), 43 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 7e0e62deab53c..33166ec90a112 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -548,49 +548,6 @@ void kernfs_put(struct kernfs_node *kn)
}
EXPORT_SYMBOL_GPL(kernfs_put);

-static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
-{
- struct kernfs_node *kn;
-
- if (flags & LOOKUP_RCU)
- return -ECHILD;
-
- /* Always perform fresh lookup for negatives */
- if (d_really_is_negative(dentry))
- goto out_bad_unlocked;
-
- kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
-
- /* The kernfs node has been deactivated */
- if (!kernfs_active(kn))
- goto out_bad;
-
- /* The kernfs node has been moved? */
- if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
- goto out_bad;
-
- /* The kernfs node has been renamed */
- if (strcmp(dentry->d_name.name, kn->name) != 0)
- goto out_bad;
-
- /* The kernfs node has been moved to a different namespace */
- if (kn->parent && kernfs_ns_enabled(kn->parent) &&
- kernfs_info(dentry->d_sb)->ns != kn->ns)
- goto out_bad;
-
- mutex_unlock(&kernfs_mutex);
- return 1;
-out_bad:
- mutex_unlock(&kernfs_mutex);
-out_bad_unlocked:
- return 0;
-}
-
-const struct dentry_operations kernfs_dops = {
- .d_revalidate = kernfs_dop_revalidate,
-};
-
/**
* kernfs_node_from_dentry - determine kernfs_node associated with a dentry
* @dentry: the dentry in question
@@ -1073,6 +1030,49 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
return ERR_PTR(rc);
}

+static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
+{
+ struct kernfs_node *kn;
+
+ if (flags & LOOKUP_RCU)
+ return -ECHILD;
+
+ /* Always perform fresh lookup for negatives */
+ if (d_really_is_negative(dentry))
+ goto out_bad_unlocked;
+
+ kn = kernfs_dentry_node(dentry);
+ mutex_lock(&kernfs_mutex);
+
+ /* The kernfs node has been deactivated */
+ if (!kernfs_active(kn))
+ goto out_bad;
+
+ /* The kernfs node has been moved? */
+ if (kernfs_dentry_node(dentry->d_parent) != kn->parent)
+ goto out_bad;
+
+ /* The kernfs node has been renamed */
+ if (strcmp(dentry->d_name.name, kn->name) != 0)
+ goto out_bad;
+
+ /* The kernfs node has been moved to a different namespace */
+ if (kn->parent && kernfs_ns_enabled(kn->parent) &&
+ kernfs_info(dentry->d_sb)->ns != kn->ns)
+ goto out_bad;
+
+ mutex_unlock(&kernfs_mutex);
+ return 1;
+out_bad:
+ mutex_unlock(&kernfs_mutex);
+out_bad_unlocked:
+ return 0;
+}
+
+const struct dentry_operations kernfs_dops = {
+ .d_revalidate = kernfs_dop_revalidate,
+};
+
static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)


2021-06-09 17:13:17

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 4/7] kernfs: switch kernfs to use an rwsem

The kernfs global lock restricts the ability to perform kernfs node
lookup operations in parallel during path walks.

Change the kernfs mutex to an rwsem so that, when opportunity arises,
node searches can be done in parallel with path walk lookups.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 94 ++++++++++++++++++++++---------------------
fs/kernfs/file.c | 4 +-
fs/kernfs/inode.c | 16 ++++---
fs/kernfs/kernfs-internal.h | 5 +-
fs/kernfs/mount.c | 12 +++--
fs/kernfs/symlink.c | 4 +-
include/linux/kernfs.h | 2 -
7 files changed, 69 insertions(+), 68 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 4f037456a8e17..195561c08439a 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@

#include "kernfs-internal.h"

-DEFINE_MUTEX(kernfs_mutex);
+DECLARE_RWSEM(kernfs_rwsem);
static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by rename_lock */
static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
@@ -26,7 +26,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);
return atomic_read(&kn->active) >= 0;
}

@@ -340,7 +340,7 @@ static int kernfs_sd_compare(const struct kernfs_node *left,
* @kn->parent->dir.children.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*
* RETURNS:
* 0 on susccess -EEXIST on failure.
@@ -386,7 +386,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
* removed, %false if @kn wasn't on the rbtree.
*
* Locking:
- * mutex_lock(kernfs_mutex)
+ * kernfs_rwsem held exclusive
*/
static bool kernfs_unlink_sibling(struct kernfs_node *kn)
{
@@ -457,14 +457,14 @@ void kernfs_put_active(struct kernfs_node *kn)
* return after draining is complete.
*/
static void kernfs_drain(struct kernfs_node *kn)
- __releases(&kernfs_mutex) __acquires(&kernfs_mutex)
+ __releases(&kernfs_rwsem) __acquires(&kernfs_rwsem)
{
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);
WARN_ON_ONCE(kernfs_active(kn));

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +483,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}

/**
@@ -722,7 +722,7 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -753,7 +753,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -767,7 +767,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -788,7 +788,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held(&kernfs_rwsem);

if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -820,7 +820,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_read(&kernfs_rwsem);

/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(&kernfs_rename_lock);
@@ -860,10 +860,10 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -884,10 +884,10 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return kn;
}
@@ -1063,7 +1063,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1082,10 +1082,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;

- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 1;
out_bad:
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
return 0;
}

@@ -1103,7 +1103,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct inode *inode = NULL;
const void *ns = NULL;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

@@ -1119,7 +1119,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
kernfs_set_rev(parent, dentry);
/* instantiate and hash (possibly negative) dentry */
ret = d_splice_alias(inode, dentry);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return ret;
}
@@ -1241,7 +1241,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1277,7 +1277,7 @@ void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1291,14 +1291,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- lockdep_assert_held(&kernfs_mutex);
+ lockdep_assert_held_write(&kernfs_rwsem);

/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1321,7 +1321,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
pos = kernfs_leftmost_descendant(kn);

/*
- * kernfs_drain() drops kernfs_mutex temporarily and @pos's
+ * kernfs_drain() drops kernfs_rwsem temporarily and @pos's
* base ref could have been put by someone else by the time
* the function returns. Make sure it doesn't go away
* underneath us.
@@ -1368,9 +1368,9 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
__kernfs_remove(kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

/**
@@ -1457,17 +1457,17 @@ bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_break_active_protection(kn);

/*
* SUICIDAL is used to arbitrate among competing invocations. Only
* the first one will actually perform removal. When the removal
* is complete, SUICIDED is set and the active ref is restored
- * while holding kernfs_mutex. The ones which lost arbitration
- * waits for SUICDED && drained which can happen only after the
- * enclosing kernfs operation which executed the winning instance
- * of kernfs_remove_self() finished.
+ * while kernfs_rwsem for held exclusive. The ones which lost
+ * arbitration waits for SUICIDED && drained which can happen only
+ * after the enclosing kernfs operation which executed the winning
+ * instance of kernfs_remove_self() finished.
*/
if (!(kn->flags & KERNFS_SUICIDAL)) {
kn->flags |= KERNFS_SUICIDAL;
@@ -1485,9 +1485,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
schedule();
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1495,12 +1495,12 @@ bool kernfs_remove_self(struct kernfs_node *kn)
}

/*
- * This must be done while holding kernfs_mutex; otherwise, waiting
- * for SUICIDED && deactivated could finish prematurely.
+ * This must be done while kernfs_rwsem held exclusive; otherwise,
+ * waiting for SUICIDED && deactivated could finish prematurely.
*/
kernfs_unbreak_active_protection(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -1524,13 +1524,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

if (kn)
return 0;
@@ -1556,7 +1556,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
if (!kn->parent)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1610,7 +1610,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -1685,7 +1685,7 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)

if (!dir_emit_dots(file, ctx))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);

if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1702,12 +1702,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);

- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
}
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index c757193121475..60e2a86c535eb 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -860,7 +860,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
spin_unlock_irq(&kernfs_notify_lock);

/* kick fsnotify */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);

list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
@@ -898,7 +898,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}

- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index d73950fc3d57d..3b01e9e61f14e 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -106,9 +106,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
ret = __kernfs_setattr(kn, iattr);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return ret;
}

@@ -122,7 +122,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
if (!kn)
return -EINVAL;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -135,7 +135,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
return error;
}

@@ -191,9 +191,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

generic_fillattr(&init_user_ns, inode, stat);
return 0;
@@ -284,9 +284,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

return generic_permission(&init_user_ns, inode, mask);
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index b4e7579e04799..8a067609f63ba 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -13,6 +13,7 @@
#include <linux/lockdep.h>
#include <linux/fs.h>
#include <linux/mutex.h>
+#include <linux/rwsem.h>
#include <linux/xattr.h>

#include <linux/kernfs.h>
@@ -69,7 +70,7 @@ struct kernfs_super_info {
*/
const void *ns;

- /* anchored at kernfs_root->supers, protected by kernfs_mutex */
+ /* anchored at kernfs_root->supers, protected by kernfs_rwsem */
struct list_head node;
};
#define kernfs_info(SB) ((struct kernfs_super_info *)(SB->s_fs_info))
@@ -125,7 +126,7 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr);
/*
* dir.c
*/
-extern struct mutex kernfs_mutex;
+extern struct rw_semaphore kernfs_rwsem;
extern const struct dentry_operations kernfs_dops;
extern const struct file_operations kernfs_dir_fops;
extern const struct inode_operations kernfs_dir_iops;
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 9dc7e7a64e10f..baa4155ba2edf 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
@@ -344,9 +344,9 @@ int kernfs_get_tree(struct fs_context *fc)
}
sb->s_flags |= SB_ACTIVE;

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_add(&info->node, &info->root->supers);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);
}

fc->root = dget(sb->s_root);
@@ -372,9 +372,9 @@ void kernfs_kill_sb(struct super_block *sb)
{
struct kernfs_super_info *info = kernfs_info(sb);

- mutex_lock(&kernfs_mutex);
+ down_write(&kernfs_rwsem);
list_del(&info->node);
- mutex_unlock(&kernfs_mutex);
+ up_write(&kernfs_rwsem);

/*
* Remove the superblock from fs_supers/s_instances
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 5432883d819f2..c8f8e41b84110 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -116,9 +116,9 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *target = kn->symlink.target_kn;
int error;

- mutex_lock(&kernfs_mutex);
+ down_read(&kernfs_rwsem);
error = kernfs_get_target_path(parent, target, path);
- mutex_unlock(&kernfs_mutex);
+ up_read(&kernfs_rwsem);

return error;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index d7e0160fce6df..6df8cec4af51d 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -193,7 +193,7 @@ struct kernfs_root {
u32 id_highbits;
struct kernfs_syscall_ops *syscall_ops;

- /* list of kernfs_super_info of this root, protected by kernfs_mutex */
+ /* list of kernfs_super_info of this root, protected by kernfs_rwsem */
struct list_head supers;

wait_queue_head_t deactivate_waitq;


2021-06-09 17:13:18

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

The inode operations .permission() and .getattr() use the kernfs node
write lock but all that's needed is to keep the rb tree stable while
updating the inode attributes as well as protecting the update itself
against concurrent changes.

And .permission() is called frequently during path walks and can cause
quite a bit of contention between kernfs node operations and path
walks when the number of concurrent walks is high.

To change kernfs_iop_getattr() and kernfs_iop_permission() to take
the rw sem read lock instead of the write lock an additional lock is
needed to protect against multiple processes concurrently updating
the inode attributes and link count in kernfs_refresh_inode().

The inode i_lock seems like the sensible thing to use to protect these
inode attribute updates so use it in kernfs_refresh_inode().

The last hunk in the patch, applied to kernfs_fill_super(), is possibly
not needed but taking the lock was present originally and I prefer to
continue to take it so the rb tree is held stable during the call to
kernfs_refresh_inode() made by kernfs_get_inode().

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/inode.c | 10 ++++++----
fs/kernfs/mount.c | 4 ++--
2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3b01e9e61f14e..6728ecd81eb37 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
{
struct kernfs_iattrs *attrs = kn->iattr;

+ spin_lock(&inode->i_lock);
inode->i_mode = kn->mode;
if (attrs)
/*
@@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)

if (kernfs_type(kn) == KERNFS_DIR)
set_nlink(inode, kn->dir.subdirs + 2);
+ spin_unlock(&inode->i_lock);
}

int kernfs_iop_getattr(struct user_namespace *mnt_userns,
@@ -191,9 +193,9 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);

generic_fillattr(&init_user_ns, inode, stat);
return 0;
@@ -284,9 +286,9 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
kernfs_refresh_inode(kn, inode);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);

return generic_permission(&init_user_ns, inode, mask);
}
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index baa4155ba2edf..f2f909d09f522 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -255,9 +255,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- down_write(&kernfs_rwsem);
+ down_read(&kernfs_rwsem);
inode = kernfs_get_inode(sb, info->root->kn);
- up_write(&kernfs_rwsem);
+ up_read(&kernfs_rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;


2021-06-09 17:13:23

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 6/7] kernfs: add kernfs_need_inode_refresh()

Now the kernfs_rwsem read lock is held for kernfs_refresh_inode() and
the i_lock taken to protect inode updates there can be some contention
introduced when .permission() is called with concurrent path walks in
progress.

Since .permission() is called frequently during path walks it's worth
checking if the update is actually needed before taking the lock and
performing the update.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/inode.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 6728ecd81eb37..67fb1289c51dc 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -158,6 +158,30 @@ static inline void set_default_inode_attr(struct inode *inode, umode_t mode)
inode->i_ctime = current_time(inode);
}

+static bool kernfs_need_inode_refresh(struct kernfs_node *kn,
+ struct inode *inode,
+ struct kernfs_iattrs *attrs)
+{
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ if (inode->i_nlink != kn->dir.subdirs + 2)
+ return true;
+ }
+
+ if (inode->i_mode != kn->mode)
+ return true;
+
+ if (attrs) {
+ if (!timespec64_equal(&inode->i_atime, &attrs->ia_atime) ||
+ !timespec64_equal(&inode->i_mtime, &attrs->ia_mtime) ||
+ !timespec64_equal(&inode->i_ctime, &attrs->ia_ctime) ||
+ !uid_eq(inode->i_uid, attrs->ia_uid) ||
+ !gid_eq(inode->i_gid, attrs->ia_gid))
+ return true;
+ }
+
+ return false;
+}
+
static inline void set_inode_attr(struct inode *inode,
struct kernfs_iattrs *attrs)
{
@@ -172,6 +196,9 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
{
struct kernfs_iattrs *attrs = kn->iattr;

+ if (!kernfs_need_inode_refresh(kn, inode, attrs))
+ return;
+
spin_lock(&inode->i_lock);
inode->i_mode = kn->mode;
if (attrs)


2021-06-09 17:14:22

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

Add a revision counter to kernfs directory nodes so it can be used
to detect if a directory node has changed during negative dentry
revalidation.

There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
on all architectures and as far as I know that assumption holds.

So adding a revision counter to the struct kernfs_elem_dir variant of
the kernfs_node type union won't increase the size of the kernfs_node
struct. This is because struct kernfs_elem_dir is at least
sizeof(pointer) smaller than the largest union variant. It's tempting
to make the revision counter a u64 but that would increase the size of
kernfs_node on archs where sizeof(pointer) is smaller than the revision
counter.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 2 ++
fs/kernfs/kernfs-internal.h | 23 +++++++++++++++++++++++
include/linux/kernfs.h | 5 +++++
3 files changed, 30 insertions(+)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 33166ec90a112..b3d1bc0f317d0 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
/* successfully added, account subdir number */
if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs++;
+ kernfs_inc_rev(kn->parent);

return 0;
}
@@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)

if (kernfs_type(kn) == KERNFS_DIR)
kn->parent->dir.subdirs--;
+ kernfs_inc_rev(kn->parent);

rb_erase(&kn->rb, &kn->parent->dir.children);
RB_CLEAR_NODE(&kn->rb);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index ccc3b44f6306f..b4e7579e04799 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -81,6 +81,29 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
return d_inode(dentry)->i_private;
}

+static inline void kernfs_set_rev(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+ if (kernfs_type(kn) == KERNFS_DIR)
+ dentry->d_time = kn->dir.rev;
+}
+
+static inline void kernfs_inc_rev(struct kernfs_node *kn)
+{
+ if (kernfs_type(kn) == KERNFS_DIR)
+ kn->dir.rev++;
+}
+
+static inline bool kernfs_dir_changed(struct kernfs_node *kn,
+ struct dentry *dentry)
+{
+ if (kernfs_type(kn) == KERNFS_DIR) {
+ if (kn->dir.rev != dentry->d_time)
+ return true;
+ }
+ return false;
+}
+
extern const struct super_operations kernfs_sops;
extern struct kmem_cache *kernfs_node_cache, *kernfs_iattrs_cache;

diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9e8ca8743c268..d7e0160fce6df 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -98,6 +98,11 @@ struct kernfs_elem_dir {
* better directly in kernfs_node but is here to save space.
*/
struct kernfs_root *root;
+ /*
+ * Monotonic revision counter, used to identify if a directory
+ * node has changed during negative dentry revalidation.
+ */
+ unsigned long rev;
};

struct kernfs_elem_symlink {


2021-06-09 17:15:04

by Ian Kent

[permalink] [raw]
Subject: [PATCH v6 7/7] kernfs: dont call d_splice_alias() under kernfs node lock

The call to d_splice_alias() in kernfs_iop_lookup() doesn't depend on
any kernfs node so there's no reason to hold the kernfs node lock when
calling it.

Signed-off-by: Ian Kent <[email protected]>
---
fs/kernfs/dir.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 195561c08439a..a5820a8c139a2 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1097,7 +1097,6 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct dentry *dentry,
unsigned int flags)
{
- struct dentry *ret;
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
struct inode *inode = NULL;
@@ -1117,11 +1116,10 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
/* Needed only for negative dentry validation */
if (!inode)
kernfs_set_rev(parent, dentry);
- /* instantiate and hash (possibly negative) dentry */
- ret = d_splice_alias(inode, dentry);
up_read(&kernfs_rwsem);

- return ret;
+ /* instantiate and hash (possibly negative) dentry */
+ return d_splice_alias(inode, dentry);
}

static int kernfs_iop_mkdir(struct user_namespace *mnt_userns,


2021-06-09 17:16:14

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 0/7] kernfs: proposed locking and concurrency improvement

Hi all,

I couldn't get a working system with more cpus so I've only
got results for my 16 cpu system.

1) Run with 8 distinct sysfs file paths on 16 cpu machine, perf graphs
in base-files-cpu-16-perf.txt and patched-files-cpu-16-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 44.158890ms per 4.415889us
concur: total 759.684873ms per 75.968487us CPU 8
concur: total 883.026965ms per 88.302696us CPU 2
concur: total 891.909766ms per 89.190977us CPU 4
concur: total 897.459477ms per 89.745948us CPU 13
concur: total 906.879726ms per 90.687973us CPU 6
concur: total 909.682410ms per 90.968241us CPU 11
concur: total 913.444779ms per 91.344478us CPU 14
concur: total 913.736983ms per 91.373698us CPU 10
concur: total 918.075936ms per 91.807594us CPU 3
concur: total 919.471392ms per 91.947139us CPU 12
concur: total 919.670827ms per 91.967083us CPU 7
concur: total 920.037084ms per 92.003708us CPU 15
concur: total 920.629499ms per 92.062950us CPU 0
concur: total 920.668293ms per 92.066829us CPU 1
concur: total 920.742206ms per 92.074221us CPU 9
concur: total 920.857279ms per 92.085728us CPU 5
times: 10000 threads: 16 cpus: 16

patched (5.12.2)
----------------
single: total 44.802810ms per 4.480281us
concur: total 207.606721ms per 20.760672us CPU 0
concur: total 208.033458ms per 20.803346us CPU 7
concur: total 208.040271ms per 20.804027us CPU 10
concur: total 208.407953ms per 20.840795us CPU 6
concur: total 208.447839ms per 20.844784us CPU 8
concur: total 208.442189ms per 20.844219us CPU 15
concur: total 208.491044ms per 20.849104us CPU 12
concur: total 209.035376ms per 20.903538us CPU 13
concur: total 209.236987ms per 20.923699us CPU 5
concur: total 209.272867ms per 20.927287us CPU 4
concur: total 209.383602ms per 20.938360us CPU 2
concur: total 209.712447ms per 20.971245us CPU 11
concur: total 210.486235ms per 21.048623us CPU 9
concur: total 213.338587ms per 21.333859us CPU 14
concur: total 214.711534ms per 21.471153us CPU 3
concur: total 215.163929ms per 21.516393us CPU 1
times: 10000 threads: 16 cpus: 16

2) Run with a single sysfs file path on 16 cpu machine, perf graphs in
base-missing-cpu-16-perf.txt and patched-missing-cpu-16-perf.txt.

Base (5.12.2, unpatched)
------------------------
single: total 28.576170ms per 2.857617us
concur: total 793.285548ms per 79.328555us CPU 9
concur: total 806.402939ms per 80.640294us CPU 12
concur: total 827.931022ms per 82.793102us CPU 11
concur: total 839.489379ms per 83.948938us CPU 0
concur: total 841.252407ms per 84.125241us CPU 2
concur: total 843.533057ms per 84.353306us CPU 6
concur: total 844.030949ms per 84.403095us CPU 4
concur: total 845.267140ms per 84.526714us CPU 5
concur: total 850.316442ms per 85.031644us CPU 7
concur: total 851.884324ms per 85.188432us CPU 15
concur: total 861.188703ms per 86.118870us CPU 13
concur: total 865.995968ms per 86.599597us CPU 8
concur: total 868.057703ms per 86.805770us CPU 3
concur: total 868.599150ms per 86.859915us CPU 1
concur: total 866.567102ms per 86.656710us CPU 14
concur: total 871.594502ms per 87.159450us CPU 10
times: 10000 threads: 16 cpus: 16

patched (5.12.2)
----------------
single: total 21.335142ms per 2.133514us
concur: total 223.727132ms per 22.372713us CPU 12
concur: total 224.061497ms per 22.406150us CPU 10
concur: total 227.070350ms per 22.707035us CPU 2
concur: total 229.940510ms per 22.994051us CPU 8
concur: total 230.853379ms per 23.085338us CPU 6
concur: total 230.928784ms per 23.092878us CPU 3
concur: total 231.158093ms per 23.115809us CPU 15
concur: total 229.597176ms per 22.959718us CPU 13
concur: total 231.967832ms per 23.196783us CPU 7
concur: total 232.725323ms per 23.272532us CPU 9
concur: total 233.141065ms per 23.314107us CPU 4
concur: total 234.486371ms per 23.448637us CPU 11
concur: total 234.658243ms per 23.465824us CPU 14
concur: total 235.053843ms per 23.505384us CPU 5
concur: total 236.137191ms per 23.613719us CPU 0
concur: total 240.090098ms per 24.009010us CPU 1
times: 10000 threads: 16 cpus: 16


Attachments:
base-files-cpu-16-perf.txt (150.44 kB)
base-missing-cpu-16-perf.txt (778.15 kB)
patched-files-cpu-16-perf.txt (240.39 kB)
patched-missing-cpu-16-perf.txt (180.63 kB)
Download all attachments

2021-06-11 12:50:46

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 1/7] kernfs: move revalidate to be near lookup

On Wed, 9 Jun 2021 at 10:49, Ian Kent <[email protected]> wrote:
>
> While the dentry operation kernfs_dop_revalidate() is grouped with
> dentry type functions it also has a strong affinity to the inode
> operation ->lookup().
>
> It makes sense to locate this function near to kernfs_iop_lookup()
> because we will be adding VFS negative dentry caching to reduce path
> lookup overhead for non-existent paths.
>
> There's no functional change from this patch.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-11 12:52:18

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]> wrote:
>
> Add a revision counter to kernfs directory nodes so it can be used
> to detect if a directory node has changed during negative dentry
> revalidation.
>
> There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> on all architectures and as far as I know that assumption holds.
>
> So adding a revision counter to the struct kernfs_elem_dir variant of
> the kernfs_node type union won't increase the size of the kernfs_node
> struct. This is because struct kernfs_elem_dir is at least
> sizeof(pointer) smaller than the largest union variant. It's tempting
> to make the revision counter a u64 but that would increase the size of
> kernfs_node on archs where sizeof(pointer) is smaller than the revision
> counter.
>
> Signed-off-by: Ian Kent <[email protected]>
> ---
> fs/kernfs/dir.c | 2 ++
> fs/kernfs/kernfs-internal.h | 23 +++++++++++++++++++++++
> include/linux/kernfs.h | 5 +++++
> 3 files changed, 30 insertions(+)
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 33166ec90a112..b3d1bc0f317d0 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct kernfs_node *kn)
> /* successfully added, account subdir number */
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs++;
> + kernfs_inc_rev(kn->parent);
>
> return 0;
> }
> @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct kernfs_node *kn)
>
> if (kernfs_type(kn) == KERNFS_DIR)
> kn->parent->dir.subdirs--;
> + kernfs_inc_rev(kn->parent);
>
> rb_erase(&kn->rb, &kn->parent->dir.children);
> RB_CLEAR_NODE(&kn->rb);
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index ccc3b44f6306f..b4e7579e04799 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -81,6 +81,29 @@ static inline struct kernfs_node *kernfs_dentry_node(struct dentry *dentry)
> return d_inode(dentry)->i_private;
> }
>
> +static inline void kernfs_set_rev(struct kernfs_node *kn,
> + struct dentry *dentry)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR)
> + dentry->d_time = kn->dir.rev;
> +}
> +
> +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR)
> + kn->dir.rev++;
> +}
> +
> +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> + struct dentry *dentry)
> +{
> + if (kernfs_type(kn) == KERNFS_DIR) {

Aren't these always be called on a KERNFS_DIR node?

You could just reduce that to a WARN_ON, or remove the conditions
altogether then.

Thanks,
Miklos

2021-06-11 12:58:35

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Fri, 2021-06-11 at 14:49 +0200, Miklos Szeredi wrote:
> On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]> wrote:
> >
> > Add a revision counter to kernfs directory nodes so it can be used
> > to detect if a directory node has changed during negative dentry
> > revalidation.
> >
> > There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> > on all architectures and as far as I know that assumption holds.
> >
> > So adding a revision counter to the struct kernfs_elem_dir variant
> > of
> > the kernfs_node type union won't increase the size of the
> > kernfs_node
> > struct. This is because struct kernfs_elem_dir is at least
> > sizeof(pointer) smaller than the largest union variant. It's
> > tempting
> > to make the revision counter a u64 but that would increase the size
> > of
> > kernfs_node on archs where sizeof(pointer) is smaller than the
> > revision
> > counter.
> >
> > Signed-off-by: Ian Kent <[email protected]>
> > ---
> >  fs/kernfs/dir.c             |    2 ++
> >  fs/kernfs/kernfs-internal.h |   23 +++++++++++++++++++++++
> >  include/linux/kernfs.h      |    5 +++++
> >  3 files changed, 30 insertions(+)
> >
> > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > index 33166ec90a112..b3d1bc0f317d0 100644
> > --- a/fs/kernfs/dir.c
> > +++ b/fs/kernfs/dir.c
> > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > kernfs_node *kn)
> >         /* successfully added, account subdir number */
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 kn->parent->dir.subdirs++;
> > +       kernfs_inc_rev(kn->parent);
> >
> >         return 0;
> >  }
> > @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct
> > kernfs_node *kn)
> >
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 kn->parent->dir.subdirs--;
> > +       kernfs_inc_rev(kn->parent);
> >
> >         rb_erase(&kn->rb, &kn->parent->dir.children);
> >         RB_CLEAR_NODE(&kn->rb);
> > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-
> > internal.h
> > index ccc3b44f6306f..b4e7579e04799 100644
> > --- a/fs/kernfs/kernfs-internal.h
> > +++ b/fs/kernfs/kernfs-internal.h
> > @@ -81,6 +81,29 @@ static inline struct kernfs_node
> > *kernfs_dentry_node(struct dentry *dentry)
> >         return d_inode(dentry)->i_private;
> >  }
> >
> > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > +                                 struct dentry *dentry)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR)
> > +               dentry->d_time = kn->dir.rev;
> > +}
> > +
> > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR)
> > +               kn->dir.rev++;
> > +}
> > +
> > +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> > +                                     struct dentry *dentry)
> > +{
> > +       if (kernfs_type(kn) == KERNFS_DIR) {
>
> Aren't these always be called on a KERNFS_DIR node?

Yes they are.

>
> You could just reduce that to a WARN_ON, or remove the conditions
> altogether then.

I was tempted to not use the check, a WARN_ON sounds better than
removing the check, I'll do that in a v7.

Thanks
Ian

2021-06-11 13:13:33

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Fri, Jun 11, 2021 at 08:56:18PM +0800, Ian Kent wrote:
> On Fri, 2021-06-11 at 14:49 +0200, Miklos Szeredi wrote:
> > On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]> wrote:
> > >
> > > Add a revision counter to kernfs directory nodes so it can be used
> > > to detect if a directory node has changed during negative dentry
> > > revalidation.
> > >
> > > There's an assumption that sizeof(unsigned long) <= sizeof(pointer)
> > > on all architectures and as far as I know that assumption holds.
> > >
> > > So adding a revision counter to the struct kernfs_elem_dir variant
> > > of
> > > the kernfs_node type union won't increase the size of the
> > > kernfs_node
> > > struct. This is because struct kernfs_elem_dir is at least
> > > sizeof(pointer) smaller than the largest union variant. It's
> > > tempting
> > > to make the revision counter a u64 but that would increase the size
> > > of
> > > kernfs_node on archs where sizeof(pointer) is smaller than the
> > > revision
> > > counter.
> > >
> > > Signed-off-by: Ian Kent <[email protected]>
> > > ---
> > > ?fs/kernfs/dir.c???????????? |??? 2 ++
> > > ?fs/kernfs/kernfs-internal.h |?? 23 +++++++++++++++++++++++
> > > ?include/linux/kernfs.h????? |??? 5 +++++
> > > ?3 files changed, 30 insertions(+)
> > >
> > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > index 33166ec90a112..b3d1bc0f317d0 100644
> > > --- a/fs/kernfs/dir.c
> > > +++ b/fs/kernfs/dir.c
> > > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > > kernfs_node *kn)
> > > ??????? /* successfully added, account subdir number */
> > > ??????? if (kernfs_type(kn) == KERNFS_DIR)
> > > ??????????????? kn->parent->dir.subdirs++;
> > > +?????? kernfs_inc_rev(kn->parent);
> > >
> > > ??????? return 0;
> > > ?}
> > > @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct
> > > kernfs_node *kn)
> > >
> > > ??????? if (kernfs_type(kn) == KERNFS_DIR)
> > > ??????????????? kn->parent->dir.subdirs--;
> > > +?????? kernfs_inc_rev(kn->parent);
> > >
> > > ??????? rb_erase(&kn->rb, &kn->parent->dir.children);
> > > ??????? RB_CLEAR_NODE(&kn->rb);
> > > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-
> > > internal.h
> > > index ccc3b44f6306f..b4e7579e04799 100644
> > > --- a/fs/kernfs/kernfs-internal.h
> > > +++ b/fs/kernfs/kernfs-internal.h
> > > @@ -81,6 +81,29 @@ static inline struct kernfs_node
> > > *kernfs_dentry_node(struct dentry *dentry)
> > > ??????? return d_inode(dentry)->i_private;
> > > ?}
> > >
> > > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > > +???????????????????????????????? struct dentry *dentry)
> > > +{
> > > +?????? if (kernfs_type(kn) == KERNFS_DIR)
> > > +?????????????? dentry->d_time = kn->dir.rev;
> > > +}
> > > +
> > > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > > +{
> > > +?????? if (kernfs_type(kn) == KERNFS_DIR)
> > > +?????????????? kn->dir.rev++;
> > > +}
> > > +
> > > +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> > > +???????????????????????????????????? struct dentry *dentry)
> > > +{
> > > +?????? if (kernfs_type(kn) == KERNFS_DIR) {
> >
> > Aren't these always be called on a KERNFS_DIR node?
>
> Yes they are.
>
> >
> > You could just reduce that to a WARN_ON, or remove the conditions
> > altogether then.
>
> I was tempted to not use the check, a WARN_ON sounds better than
> removing the check, I'll do that in a v7.

No, WARN_ON is not ok, as systems will crash if panic-on-warn is set.

If these are impossible to hit, great, let's not check this and we can
just drop the code. If they can be hit, then the above code is correct
and it should stay.

thanks,

greg k-h

2021-06-11 13:14:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Wed, 9 Jun 2021 at 10:52, Ian Kent <[email protected]> wrote:
>
> The inode operations .permission() and .getattr() use the kernfs node
> write lock but all that's needed is to keep the rb tree stable while
> updating the inode attributes as well as protecting the update itself
> against concurrent changes.
>
> And .permission() is called frequently during path walks and can cause
> quite a bit of contention between kernfs node operations and path
> walks when the number of concurrent walks is high.
>
> To change kernfs_iop_getattr() and kernfs_iop_permission() to take
> the rw sem read lock instead of the write lock an additional lock is
> needed to protect against multiple processes concurrently updating
> the inode attributes and link count in kernfs_refresh_inode().
>
> The inode i_lock seems like the sensible thing to use to protect these
> inode attribute updates so use it in kernfs_refresh_inode().
>
> The last hunk in the patch, applied to kernfs_fill_super(), is possibly
> not needed but taking the lock was present originally and I prefer to
> continue to take it so the rb tree is held stable during the call to
> kernfs_refresh_inode() made by kernfs_get_inode().
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-11 13:14:34

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 4/7] kernfs: switch kernfs to use an rwsem

On Wed, 9 Jun 2021 at 10:51, Ian Kent <[email protected]> wrote:
>
> The kernfs global lock restricts the ability to perform kernfs node
> lookup operations in parallel during path walks.
>
> Change the kernfs mutex to an rwsem so that, when opportunity arises,
> node searches can be done in parallel with path walk lookups.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-11 13:15:40

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 6/7] kernfs: add kernfs_need_inode_refresh()

On Wed, 9 Jun 2021 at 10:52, Ian Kent <[email protected]> wrote:
>
> Now the kernfs_rwsem read lock is held for kernfs_refresh_inode() and
> the i_lock taken to protect inode updates there can be some contention
> introduced when .permission() is called with concurrent path walks in
> progress.
>
> Since .permission() is called frequently during path walks it's worth
> checking if the update is actually needed before taking the lock and
> performing the update.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-11 13:17:03

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH v6 7/7] kernfs: dont call d_splice_alias() under kernfs node lock

On Wed, 9 Jun 2021 at 10:52, Ian Kent <[email protected]> wrote:
>
> The call to d_splice_alias() in kernfs_iop_lookup() doesn't depend on
> any kernfs node so there's no reason to hold the kernfs node lock when
> calling it.
>
> Signed-off-by: Ian Kent <[email protected]>

Reviewed-by: Miklos Szeredi <[email protected]>

2021-06-11 13:35:40

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Fri, 2021-06-11 at 15:11 +0200, Greg Kroah-Hartman wrote:
> On Fri, Jun 11, 2021 at 08:56:18PM +0800, Ian Kent wrote:
> > On Fri, 2021-06-11 at 14:49 +0200, Miklos Szeredi wrote:
> > > On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]> wrote:
> > > >
> > > > Add a revision counter to kernfs directory nodes so it can be
> > > > used
> > > > to detect if a directory node has changed during negative
> > > > dentry
> > > > revalidation.
> > > >
> > > > There's an assumption that sizeof(unsigned long) <=
> > > > sizeof(pointer)
> > > > on all architectures and as far as I know that assumption
> > > > holds.
> > > >
> > > > So adding a revision counter to the struct kernfs_elem_dir
> > > > variant
> > > > of
> > > > the kernfs_node type union won't increase the size of the
> > > > kernfs_node
> > > > struct. This is because struct kernfs_elem_dir is at least
> > > > sizeof(pointer) smaller than the largest union variant. It's
> > > > tempting
> > > > to make the revision counter a u64 but that would increase the
> > > > size
> > > > of
> > > > kernfs_node on archs where sizeof(pointer) is smaller than the
> > > > revision
> > > > counter.
> > > >
> > > > Signed-off-by: Ian Kent <[email protected]>
> > > > ---
> > > >  fs/kernfs/dir.c             |    2 ++
> > > >  fs/kernfs/kernfs-internal.h |   23 +++++++++++++++++++++++
> > > >  include/linux/kernfs.h      |    5 +++++
> > > >  3 files changed, 30 insertions(+)
> > > >
> > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > index 33166ec90a112..b3d1bc0f317d0 100644
> > > > --- a/fs/kernfs/dir.c
> > > > +++ b/fs/kernfs/dir.c
> > > > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > > > kernfs_node *kn)
> > > >         /* successfully added, account subdir number */
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 kn->parent->dir.subdirs++;
> > > > +       kernfs_inc_rev(kn->parent);
> > > >
> > > >         return 0;
> > > >  }
> > > > @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct
> > > > kernfs_node *kn)
> > > >
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 kn->parent->dir.subdirs--;
> > > > +       kernfs_inc_rev(kn->parent);
> > > >
> > > >         rb_erase(&kn->rb, &kn->parent->dir.children);
> > > >         RB_CLEAR_NODE(&kn->rb);
> > > > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-
> > > > internal.h
> > > > index ccc3b44f6306f..b4e7579e04799 100644
> > > > --- a/fs/kernfs/kernfs-internal.h
> > > > +++ b/fs/kernfs/kernfs-internal.h
> > > > @@ -81,6 +81,29 @@ static inline struct kernfs_node
> > > > *kernfs_dentry_node(struct dentry *dentry)
> > > >         return d_inode(dentry)->i_private;
> > > >  }
> > > >
> > > > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > > > +                                 struct dentry *dentry)
> > > > +{
> > > > +       if (kernfs_type(kn) == KERNFS_DIR)
> > > > +               dentry->d_time = kn->dir.rev;
> > > > +}
> > > > +
> > > > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > > > +{
> > > > +       if (kernfs_type(kn) == KERNFS_DIR)
> > > > +               kn->dir.rev++;
> > > > +}
> > > > +
> > > > +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> > > > +                                     struct dentry *dentry)
> > > > +{
> > > > +       if (kernfs_type(kn) == KERNFS_DIR) {
> > >
> > > Aren't these always be called on a KERNFS_DIR node?
> >
> > Yes they are.
> >
> > >
> > > You could just reduce that to a WARN_ON, or remove the conditions
> > > altogether then.
> >
> > I was tempted to not use the check, a WARN_ON sounds better than
> > removing the check, I'll do that in a v7.
>
> No, WARN_ON is not ok, as systems will crash if panic-on-warn is set.

Thanks Greg, understood.

>
> If these are impossible to hit, great, let's not check this and we
> can
> just drop the code.  If they can be hit, then the above code is
> correct
> and it should stay.

It's a programming mistake to call these on a non-directory node.

I can remove the check but do you think there's any value in passing
the node and updating it's parent to avoid possible misuse?

Ian

2021-06-11 14:06:52

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Fri, Jun 11, 2021 at 09:31:36PM +0800, Ian Kent wrote:
> On Fri, 2021-06-11 at 15:11 +0200, Greg Kroah-Hartman wrote:
> > On Fri, Jun 11, 2021 at 08:56:18PM +0800, Ian Kent wrote:
> > > On Fri, 2021-06-11 at 14:49 +0200, Miklos Szeredi wrote:
> > > > On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]> wrote:
> > > > >
> > > > > Add a revision counter to kernfs directory nodes so it can be
> > > > > used
> > > > > to detect if a directory node has changed during negative
> > > > > dentry
> > > > > revalidation.
> > > > >
> > > > > There's an assumption that sizeof(unsigned long) <=
> > > > > sizeof(pointer)
> > > > > on all architectures and as far as I know that assumption
> > > > > holds.
> > > > >
> > > > > So adding a revision counter to the struct kernfs_elem_dir
> > > > > variant
> > > > > of
> > > > > the kernfs_node type union won't increase the size of the
> > > > > kernfs_node
> > > > > struct. This is because struct kernfs_elem_dir is at least
> > > > > sizeof(pointer) smaller than the largest union variant. It's
> > > > > tempting
> > > > > to make the revision counter a u64 but that would increase the
> > > > > size
> > > > > of
> > > > > kernfs_node on archs where sizeof(pointer) is smaller than the
> > > > > revision
> > > > > counter.
> > > > >
> > > > > Signed-off-by: Ian Kent <[email protected]>
> > > > > ---
> > > > > ?fs/kernfs/dir.c???????????? |??? 2 ++
> > > > > ?fs/kernfs/kernfs-internal.h |?? 23 +++++++++++++++++++++++
> > > > > ?include/linux/kernfs.h????? |??? 5 +++++
> > > > > ?3 files changed, 30 insertions(+)
> > > > >
> > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > index 33166ec90a112..b3d1bc0f317d0 100644
> > > > > --- a/fs/kernfs/dir.c
> > > > > +++ b/fs/kernfs/dir.c
> > > > > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > > > > kernfs_node *kn)
> > > > > ??????? /* successfully added, account subdir number */
> > > > > ??????? if (kernfs_type(kn) == KERNFS_DIR)
> > > > > ??????????????? kn->parent->dir.subdirs++;
> > > > > +?????? kernfs_inc_rev(kn->parent);
> > > > >
> > > > > ??????? return 0;
> > > > > ?}
> > > > > @@ -394,6 +395,7 @@ static bool kernfs_unlink_sibling(struct
> > > > > kernfs_node *kn)
> > > > >
> > > > > ??????? if (kernfs_type(kn) == KERNFS_DIR)
> > > > > ??????????????? kn->parent->dir.subdirs--;
> > > > > +?????? kernfs_inc_rev(kn->parent);
> > > > >
> > > > > ??????? rb_erase(&kn->rb, &kn->parent->dir.children);
> > > > > ??????? RB_CLEAR_NODE(&kn->rb);
> > > > > diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-
> > > > > internal.h
> > > > > index ccc3b44f6306f..b4e7579e04799 100644
> > > > > --- a/fs/kernfs/kernfs-internal.h
> > > > > +++ b/fs/kernfs/kernfs-internal.h
> > > > > @@ -81,6 +81,29 @@ static inline struct kernfs_node
> > > > > *kernfs_dentry_node(struct dentry *dentry)
> > > > > ??????? return d_inode(dentry)->i_private;
> > > > > ?}
> > > > >
> > > > > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > > > > +???????????????????????????????? struct dentry *dentry)
> > > > > +{
> > > > > +?????? if (kernfs_type(kn) == KERNFS_DIR)
> > > > > +?????????????? dentry->d_time = kn->dir.rev;
> > > > > +}
> > > > > +
> > > > > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > > > > +{
> > > > > +?????? if (kernfs_type(kn) == KERNFS_DIR)
> > > > > +?????????????? kn->dir.rev++;
> > > > > +}
> > > > > +
> > > > > +static inline bool kernfs_dir_changed(struct kernfs_node *kn,
> > > > > +???????????????????????????????????? struct dentry *dentry)
> > > > > +{
> > > > > +?????? if (kernfs_type(kn) == KERNFS_DIR) {
> > > >
> > > > Aren't these always be called on a KERNFS_DIR node?
> > >
> > > Yes they are.
> > >
> > > >
> > > > You could just reduce that to a WARN_ON, or remove the conditions
> > > > altogether then.
> > >
> > > I was tempted to not use the check, a WARN_ON sounds better than
> > > removing the check, I'll do that in a v7.
> >
> > No, WARN_ON is not ok, as systems will crash if panic-on-warn is set.
>
> Thanks Greg, understood.
>
> >
> > If these are impossible to hit, great, let's not check this and we
> > can
> > just drop the code.? If they can be hit, then the above code is
> > correct
> > and it should stay.
>
> It's a programming mistake to call these on a non-directory node.
>
> I can remove the check but do you think there's any value in passing
> the node and updating it's parent to avoid possible misuse?

I do not understand the question here, sorry. It's a static function,
you control the callers, who can "misuse" it?

thanks,

greg k-h

2021-06-11 14:20:28

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 2/7] kernfs: add a revision to identify directory node changes

On Fri, 2021-06-11 at 16:05 +0200, Greg Kroah-Hartman wrote:
> On Fri, Jun 11, 2021 at 09:31:36PM +0800, Ian Kent wrote:
> > On Fri, 2021-06-11 at 15:11 +0200, Greg Kroah-Hartman wrote:
> > > On Fri, Jun 11, 2021 at 08:56:18PM +0800, Ian Kent wrote:
> > > > On Fri, 2021-06-11 at 14:49 +0200, Miklos Szeredi wrote:
> > > > > On Wed, 9 Jun 2021 at 10:50, Ian Kent <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > > Add a revision counter to kernfs directory nodes so it can
> > > > > > be
> > > > > > used
> > > > > > to detect if a directory node has changed during negative
> > > > > > dentry
> > > > > > revalidation.
> > > > > >
> > > > > > There's an assumption that sizeof(unsigned long) <=
> > > > > > sizeof(pointer)
> > > > > > on all architectures and as far as I know that assumption
> > > > > > holds.
> > > > > >
> > > > > > So adding a revision counter to the struct kernfs_elem_dir
> > > > > > variant
> > > > > > of
> > > > > > the kernfs_node type union won't increase the size of the
> > > > > > kernfs_node
> > > > > > struct. This is because struct kernfs_elem_dir is at least
> > > > > > sizeof(pointer) smaller than the largest union variant.
> > > > > > It's
> > > > > > tempting
> > > > > > to make the revision counter a u64 but that would increase
> > > > > > the
> > > > > > size
> > > > > > of
> > > > > > kernfs_node on archs where sizeof(pointer) is smaller than
> > > > > > the
> > > > > > revision
> > > > > > counter.
> > > > > >
> > > > > > Signed-off-by: Ian Kent <[email protected]>
> > > > > > ---
> > > > > >  fs/kernfs/dir.c             |    2 ++
> > > > > >  fs/kernfs/kernfs-internal.h |   23 +++++++++++++++++++++++
> > > > > >  include/linux/kernfs.h      |    5 +++++
> > > > > >  3 files changed, 30 insertions(+)
> > > > > >
> > > > > > diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> > > > > > index 33166ec90a112..b3d1bc0f317d0 100644
> > > > > > --- a/fs/kernfs/dir.c
> > > > > > +++ b/fs/kernfs/dir.c
> > > > > > @@ -372,6 +372,7 @@ static int kernfs_link_sibling(struct
> > > > > > kernfs_node *kn)
> > > > > >         /* successfully added, account subdir number */
> > > > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > > > >                 kn->parent->dir.subdirs++;
> > > > > > +       kernfs_inc_rev(kn->parent);
> > > > > >
> > > > > >         return 0;
> > > > > >  }
> > > > > > @@ -394,6 +395,7 @@ static bool
> > > > > > kernfs_unlink_sibling(struct
> > > > > > kernfs_node *kn)
> > > > > >
> > > > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > > > >                 kn->parent->dir.subdirs--;
> > > > > > +       kernfs_inc_rev(kn->parent);
> > > > > >
> > > > > >         rb_erase(&kn->rb, &kn->parent->dir.children);
> > > > > >         RB_CLEAR_NODE(&kn->rb);
> > > > > > diff --git a/fs/kernfs/kernfs-internal.h
> > > > > > b/fs/kernfs/kernfs-
> > > > > > internal.h
> > > > > > index ccc3b44f6306f..b4e7579e04799 100644
> > > > > > --- a/fs/kernfs/kernfs-internal.h
> > > > > > +++ b/fs/kernfs/kernfs-internal.h
> > > > > > @@ -81,6 +81,29 @@ static inline struct kernfs_node
> > > > > > *kernfs_dentry_node(struct dentry *dentry)
> > > > > >         return d_inode(dentry)->i_private;
> > > > > >  }
> > > > > >
> > > > > > +static inline void kernfs_set_rev(struct kernfs_node *kn,
> > > > > > +                                 struct dentry *dentry)
> > > > > > +{
> > > > > > +       if (kernfs_type(kn) == KERNFS_DIR)
> > > > > > +               dentry->d_time = kn->dir.rev;
> > > > > > +}
> > > > > > +
> > > > > > +static inline void kernfs_inc_rev(struct kernfs_node *kn)
> > > > > > +{
> > > > > > +       if (kernfs_type(kn) == KERNFS_DIR)
> > > > > > +               kn->dir.rev++;
> > > > > > +}
> > > > > > +
> > > > > > +static inline bool kernfs_dir_changed(struct kernfs_node
> > > > > > *kn,
> > > > > > +                                     struct dentry
> > > > > > *dentry)
> > > > > > +{
> > > > > > +       if (kernfs_type(kn) == KERNFS_DIR) {
> > > > >
> > > > > Aren't these always be called on a KERNFS_DIR node?
> > > >
> > > > Yes they are.
> > > >
> > > > >
> > > > > You could just reduce that to a WARN_ON, or remove the
> > > > > conditions
> > > > > altogether then.
> > > >
> > > > I was tempted to not use the check, a WARN_ON sounds better
> > > > than
> > > > removing the check, I'll do that in a v7.
> > >
> > > No, WARN_ON is not ok, as systems will crash if panic-on-warn is
> > > set.
> >
> > Thanks Greg, understood.
> >
> > >
> > > If these are impossible to hit, great, let's not check this and
> > > we
> > > can
> > > just drop the code.  If they can be hit, then the above code is
> > > correct
> > > and it should stay.
> >
> > It's a programming mistake to call these on a non-directory node.
> >
> > I can remove the check but do you think there's any value in
> > passing
> > the node and updating it's parent to avoid possible misuse?
>
> I do not understand the question here, sorry.  It's a static
> function,
> you control the callers, who can "misuse" it?

Yes, I'll drop the test and name the argument parent to make it
clear for readers.


Ian

2021-06-12 01:26:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v6 4/7] kernfs: switch kernfs to use an rwsem

On Wed, Jun 09, 2021 at 04:50:52PM +0800, Ian Kent wrote:
> The kernfs global lock restricts the ability to perform kernfs node
> lookup operations in parallel during path walks.
>
> Change the kernfs mutex to an rwsem so that, when opportunity arises,
> node searches can be done in parallel with path walk lookups.

> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index 5432883d819f2..c8f8e41b84110 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -116,9 +116,9 @@ static int kernfs_getlink(struct inode *inode, char *path)
> struct kernfs_node *target = kn->symlink.target_kn;
> int error;
>
> - mutex_lock(&kernfs_mutex);
> + down_read(&kernfs_rwsem);
> error = kernfs_get_target_path(parent, target, path);
> - mutex_unlock(&kernfs_mutex);
> + up_read(&kernfs_rwsem);

Unrelated to this patchset, two notes from reading through that area:
1) parent is fetched outside of rwsem. Unstable, IOW.
2) kernfs_get_target_path() is an atrocity. On *any* symlink you
get an arseload of ../ (up to kernfs root), followed by into whatever
directory we want. Even if the target is in the same directory.
Think what happens if you mount --bind a subtree that contains both the
symlink and its destination. And try to follow that symlink.
It really ought to generate the minimal relative pathname.
And it's not hard to do:
calculate the depth of source
calculate the depth of destination
walk up from the deeper one until we get to the depth of the
shallower one.
walk up from both in tandem until two paths converge.
Now we have the LCA of those nodes and can use the to generate the relative
pathname.

2021-06-12 01:48:50

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> The inode operations .permission() and .getattr() use the kernfs node
> write lock but all that's needed is to keep the rb tree stable while
> updating the inode attributes as well as protecting the update itself
> against concurrent changes.

Huh? Where does it access the rbtree at all? Confused...

> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 3b01e9e61f14e..6728ecd81eb37 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
> {
> struct kernfs_iattrs *attrs = kn->iattr;
>
> + spin_lock(&inode->i_lock);
> inode->i_mode = kn->mode;
> if (attrs)
> /*
> @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct kernfs_node *kn, struct inode *inode)
>
> if (kernfs_type(kn) == KERNFS_DIR)
> set_nlink(inode, kn->dir.subdirs + 2);
> + spin_unlock(&inode->i_lock);
> }

Even more so - just what are you serializing here? That code synchronizes inode
metadata with those in kernfs_node. Suppose you've got two threads doing
->permission(); the first one gets through kernfs_refresh_inode() and goes into
generic_permission(). No locks are held, so kernfs_refresh_inode() from another
thread can run in parallel with generic_permission().

If that's not a problem, why two kernfs_refresh_inode() done in parallel would
be a problem?

Thread 1:
permission
done refresh, all locks released now
Thread 2:
change metadata in kernfs_node
Thread 2:
permission
goes into refresh, copying metadata into inode
Thread 1:
generic_permission()
No locks in common between the last two operations, so
we generic_permission() might see partially updated metadata.
Either we don't give a fuck (in which case I don't understand
what purpose does that ->i_lock serve) *or* we need the exclusion
to cover a wider area.

2021-06-13 01:34:58

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > The inode operations .permission() and .getattr() use the kernfs
> > node
> > write lock but all that's needed is to keep the rb tree stable
> > while
> > updating the inode attributes as well as protecting the update
> > itself
> > against concurrent changes.
>
> Huh?  Where does it access the rbtree at all?  Confused...

That description's wrong, I'll fix that.

>
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index 3b01e9e61f14e..6728ecd81eb37 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> >  
> > +       spin_lock(&inode->i_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> > +       spin_unlock(&inode->i_lock);
> >  }
>
> Even more so - just what are you serializing here?  That code
> synchronizes inode
> metadata with those in kernfs_node.  Suppose you've got two threads
> doing
> ->permission(); the first one gets through kernfs_refresh_inode() and
> goes into
> generic_permission().  No locks are held, so kernfs_refresh_inode()
> from another
> thread can run in parallel with generic_permission().
>
> If that's not a problem, why two kernfs_refresh_inode() done in
> parallel would
> be a problem?
>
> Thread 1:
>         permission
>                 done refresh, all locks released now
> Thread 2:
>         change metadata in kernfs_node
> Thread 2:
>         permission
>                 goes into refresh, copying metadata into inode
> Thread 1:
>                 generic_permission()
> No locks in common between the last two operations, so
> we generic_permission() might see partially updated metadata.
> Either we don't give a fuck (in which case I don't understand
> what purpose does that ->i_lock serve) *or* we need the exclusion
> to cover a wider area.


2021-06-14 01:36:30

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > The inode operations .permission() and .getattr() use the kernfs
> > node
> > write lock but all that's needed is to keep the rb tree stable
> > while
> > updating the inode attributes as well as protecting the update
> > itself
> > against concurrent changes.
>
> Huh?  Where does it access the rbtree at all?  Confused...
>
> > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > index 3b01e9e61f14e..6728ecd81eb37 100644
> > --- a/fs/kernfs/inode.c
> > +++ b/fs/kernfs/inode.c
> > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  {
> >         struct kernfs_iattrs *attrs = kn->iattr;
> >  
> > +       spin_lock(&inode->i_lock);
> >         inode->i_mode = kn->mode;
> >         if (attrs)
> >                 /*
> > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > kernfs_node *kn, struct inode *inode)
> >  
> >         if (kernfs_type(kn) == KERNFS_DIR)
> >                 set_nlink(inode, kn->dir.subdirs + 2);
> > +       spin_unlock(&inode->i_lock);
> >  }
>
> Even more so - just what are you serializing here?  That code
> synchronizes inode
> metadata with those in kernfs_node.  Suppose you've got two threads
> doing
> ->permission(); the first one gets through kernfs_refresh_inode() and
> goes into
> generic_permission().  No locks are held, so kernfs_refresh_inode()
> from another
> thread can run in parallel with generic_permission().
>
> If that's not a problem, why two kernfs_refresh_inode() done in
> parallel would
> be a problem?
>
> Thread 1:
>         permission
>                 done refresh, all locks released now
> Thread 2:
>         change metadata in kernfs_node
> Thread 2:
>         permission
>                 goes into refresh, copying metadata into inode
> Thread 1:
>                 generic_permission()
> No locks in common between the last two operations, so
> we generic_permission() might see partially updated metadata.
> Either we don't give a fuck (in which case I don't understand
> what purpose does that ->i_lock serve) *or* we need the exclusion
> to cover a wider area.

This didn't occur to me, obviously.

It seems to me this can happen with the original code too although
using a mutex might reduce the likelihood of it happening.

Still ->permission() is meant to be a read-only function so the VFS
shouldn't need to care about it.

Do you have any suggestions on how to handle this.
Perhaps the only way is to ensure the inode is updated only in
functions that are expected to do this.

Ian

2021-06-14 06:56:52

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote:
> On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > > The inode operations .permission() and .getattr() use the kernfs
> > > node
> > > write lock but all that's needed is to keep the rb tree stable
> > > while
> > > updating the inode attributes as well as protecting the update
> > > itself
> > > against concurrent changes.
> >
> > Huh?  Where does it access the rbtree at all?  Confused...
> >
> > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > index 3b01e9e61f14e..6728ecd81eb37 100644
> > > --- a/fs/kernfs/inode.c
> > > +++ b/fs/kernfs/inode.c
> > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > >  {
> > >         struct kernfs_iattrs *attrs = kn->iattr;
> > >  
> > > +       spin_lock(&inode->i_lock);
> > >         inode->i_mode = kn->mode;
> > >         if (attrs)
> > >                 /*
> > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > > kernfs_node *kn, struct inode *inode)
> > >  
> > >         if (kernfs_type(kn) == KERNFS_DIR)
> > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > +       spin_unlock(&inode->i_lock);
> > >  }
> >
> > Even more so - just what are you serializing here?  That code
> > synchronizes inode
> > metadata with those in kernfs_node.  Suppose you've got two threads
> > doing
> > ->permission(); the first one gets through kernfs_refresh_inode()
> > and
> > goes into
> > generic_permission().  No locks are held, so kernfs_refresh_inode()
> > from another
> > thread can run in parallel with generic_permission().
> >
> > If that's not a problem, why two kernfs_refresh_inode() done in
> > parallel would
> > be a problem?
> >
> > Thread 1:
> >         permission
> >                 done refresh, all locks released now
> > Thread 2:
> >         change metadata in kernfs_node
> > Thread 2:
> >         permission
> >                 goes into refresh, copying metadata into inode
> > Thread 1:
> >                 generic_permission()
> > No locks in common between the last two operations, so
> > we generic_permission() might see partially updated metadata.
> > Either we don't give a fuck (in which case I don't understand
> > what purpose does that ->i_lock serve) *or* we need the exclusion
> > to cover a wider area.
>
> This didn't occur to me, obviously.
>
> It seems to me this can happen with the original code too although
> using a mutex might reduce the likelihood of it happening.
>
> Still ->permission() is meant to be a read-only function so the VFS
> shouldn't need to care about it.
>
> Do you have any suggestions on how to handle this.
> Perhaps the only way is to ensure the inode is updated only in
> functions that are expected to do this.

IIRC Greg and Tejun weren't averse to adding a field to the
struct kernfs_iattrs, but there were concerns about increasing
memory usage.

Because of this I think the best way to handle this would be to
broaden the scope of the i_lock to cover the generic calls in
kernfs_iop_getattr() and kernfs_iop_permission(). The only other
call to kernfs_refresh_inode() is at inode initialization and
then only for I_NEW inodes so that should be ok. Also both
generic_permission() and generic_fillattr() are reading from the
inode so not likely to need to take the i_lock any time soon (is
this a reasonable assumption Al?).

Do you think this is a sensible way to go Al?

Ian

2021-06-14 07:20:22

by Ian Kent

[permalink] [raw]
Subject: Re: [PATCH v6 5/7] kernfs: use i_lock to protect concurrent inode updates

On Mon, 2021-06-14 at 14:52 +0800, Ian Kent wrote:
> On Mon, 2021-06-14 at 09:32 +0800, Ian Kent wrote:
> > On Sat, 2021-06-12 at 01:45 +0000, Al Viro wrote:
> > > On Wed, Jun 09, 2021 at 04:51:22PM +0800, Ian Kent wrote:
> > > > The inode operations .permission() and .getattr() use the
> > > > kernfs
> > > > node
> > > > write lock but all that's needed is to keep the rb tree stable
> > > > while
> > > > updating the inode attributes as well as protecting the update
> > > > itself
> > > > against concurrent changes.
> > >
> > > Huh?  Where does it access the rbtree at all?  Confused...
> > >
> > > > diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> > > > index 3b01e9e61f14e..6728ecd81eb37 100644
> > > > --- a/fs/kernfs/inode.c
> > > > +++ b/fs/kernfs/inode.c
> > > > @@ -172,6 +172,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >  {
> > > >         struct kernfs_iattrs *attrs = kn->iattr;
> > > >  
> > > > +       spin_lock(&inode->i_lock);
> > > >         inode->i_mode = kn->mode;
> > > >         if (attrs)
> > > >                 /*
> > > > @@ -182,6 +183,7 @@ static void kernfs_refresh_inode(struct
> > > > kernfs_node *kn, struct inode *inode)
> > > >  
> > > >         if (kernfs_type(kn) == KERNFS_DIR)
> > > >                 set_nlink(inode, kn->dir.subdirs + 2);
> > > > +       spin_unlock(&inode->i_lock);
> > > >  }
> > >
> > > Even more so - just what are you serializing here?  That code
> > > synchronizes inode
> > > metadata with those in kernfs_node.  Suppose you've got two
> > > threads
> > > doing
> > > ->permission(); the first one gets through kernfs_refresh_inode()
> > > and
> > > goes into
> > > generic_permission().  No locks are held, so
> > > kernfs_refresh_inode()
> > > from another
> > > thread can run in parallel with generic_permission().
> > >
> > > If that's not a problem, why two kernfs_refresh_inode() done in
> > > parallel would
> > > be a problem?
> > >
> > > Thread 1:
> > >         permission
> > >                 done refresh, all locks released now
> > > Thread 2:
> > >         change metadata in kernfs_node
> > > Thread 2:
> > >         permission
> > >                 goes into refresh, copying metadata into inode
> > > Thread 1:
> > >                 generic_permission()
> > > No locks in common between the last two operations, so
> > > we generic_permission() might see partially updated metadata.
> > > Either we don't give a fuck (in which case I don't understand
> > > what purpose does that ->i_lock serve) *or* we need the exclusion
> > > to cover a wider area.
> >
> > This didn't occur to me, obviously.
> >
> > It seems to me this can happen with the original code too although
> > using a mutex might reduce the likelihood of it happening.
> >
> > Still ->permission() is meant to be a read-only function so the VFS
> > shouldn't need to care about it.
> >
> > Do you have any suggestions on how to handle this.
> > Perhaps the only way is to ensure the inode is updated only in
> > functions that are expected to do this.
>
> IIRC Greg and Tejun weren't averse to adding a field to the
> struct kernfs_iattrs, but there were concerns about increasing
> memory usage.
>
> Because of this I think the best way to handle this would be to
> broaden the scope of the i_lock to cover the generic calls in
> kernfs_iop_getattr() and kernfs_iop_permission(). The only other
> call to kernfs_refresh_inode() is at inode initialization and
> then only for I_NEW inodes so that should be ok. Also both
> generic_permission() and generic_fillattr() are reading from the
> inode so not likely to need to take the i_lock any time soon (is
> this a reasonable assumption Al?).
>
> Do you think this is a sensible way to go Al?

Unless of course we don't care about taking a lock here at all,
Greg, Tejun?


Ian