2022-03-17 08:29:16

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 0/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I can see contention around above mentioned locks as follows:

- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock

- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate

- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks.

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate

- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

The patches of this patchset introduce following changes:

PATCH-1: Introduce interface to access global kernfs_open_file_mutex.

PATCH-2: Replace global kernfs_open_file_mutex with hashed mutexes.

PATCH-3: Introduce interface to access kernfs_open_node_lock.

PATCH-4: Replace global kernfs_open_node_lock with hashed spinlocks.

PATCH-5: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

PATCH-6: Introduce interface to access per-fs rwsem.

PATCH-7: Replace per-fs rwsem with hashed rwsems.

PATCH-8: Add a document to describe hashed locks used in kernfs.

------------------------------------------------------------------

I did not receive any feedback for v7 of the patchset. Resending
v7 after rebasing on tag next-20220315 of linux-next.

Changes since v6:
- Addressed review comments from Tejun
- Make locking helpers compact and remove unlock_parent from node.

- Addressed review comments from Al Viro
- Make multi node lock helpers non-inline
- Account for change of parent while waiting on semaphores during
rename operation
- Add a document to describe hashed locks introduced in this patch
and to provide proof of correctness

- Fix for some issues that I observed while preparing lock correctness
document
- Lock both parent and target while looking for symlink
- Introduce a per-fs mutex to synchronize lookup and removal of a node
- Avoid locking parent in remove_self, because only the first instance
does actual removal so other invocations of remove_self don't need
to lock the parent

- Remove refcounting from lock helpers
- Refcounting was present in lock helpers for cases where an execution
path acquires node's rwsem and then deletes the node. For such cases
release helpers should not try to acquire semaphore for this already
freed node. Refcounting was ensuring that release helpers could get
an still existing node.
I have modified locking helpers such that helpers that acquire rwsem
returns its address which can later be used by release helpers.

Changes since v5:
- Addressed review comments from Greg
- Clean up duplicate code.
- Addressed review comments from Tejun
- Explain how current value of NR_KERNFS_LOCKS were obtained for
different values of NR_CPUS.
- Use single hash table for locks in place of per-fs hash table
for locks.
- Move introduction of supers_rwsem to a separate patch.
- Separate interface and usage part of hashed rwsem implementation
into 2 patches.
- Use address of rwsems to determine locking order in case of
nested locking. This approach avoids ABBA deadlock possibility.
- Change some #define macros into enum, with proper prefix.
- Taking a cue from Tejun's feedback about separating hashed rwsem
interface and usage into 2 patches, I have also divided the patch
that introduced hashed mutex and spinlock, into separate patches.
- Rebase on linux-next tag: next-20220211

Changes since v4:
- Removed compilation warnings reported by the 0-day bot.

Changes since v3:
- Make open_file_mutex and open_node_lock per-fs.
- Replace per-fs rwsem with per-fs hashed rwsem.
(Suggested by Tejun Heo <[email protected]>)

Imran Khan (8):
kernfs: Introduce interface to access global kernfs_open_file_mutex.
kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
kernfs: Introduce interface to access kernfs_open_node_lock.
kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
kernfs: Use a per-fs rwsem to protect per-fs list of
kernfs_super_info.
kernfs: Introduce interface to access per-fs rwsem.
kernfs: Replace per-fs rwsem with hashed rwsems.
kernfs: Add a document to describe hashed locks used in kernfs.

.../filesystems/kernfs-hashed-locks.rst | 245 +++++++++++++++++
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 212 +++++++++-----
fs/kernfs/file.c | 66 +++--
fs/kernfs/inode.c | 48 +++-
fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++
fs/kernfs/kernfs-internal.h | 162 ++++++++++-
fs/kernfs/mount.c | 30 +-
fs/kernfs/symlink.c | 13 +-
include/linux/kernfs.h | 71 ++++-
10 files changed, 982 insertions(+), 126 deletions(-)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
create mode 100644 fs/kernfs/kernfs-internal.c


base-commit: a32cd981a6da2373c093d471ee4405a915e217d5
--
2.30.2


2022-03-17 08:53:53

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. The contention
around this single rwsem becomes more apparent in large scale systems with
few hundred CPUs where most of the CPUs have running tasks that are
opening, accessing or closing sysfs files at any point of time.

Using hashed rwsems in place of a per-fs rwsem, can significantly reduce
contention around per-fs rwsem and hence provide better scalability.
Moreover as these hashed rwsems are not part of kernfs_node objects we will
not see any singnificant change in memory utilization of kernfs based file
systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed rwsems.
Just like earlier change use kernfs_node address as hashing key. Since we
are getting rid of per-fs lock, in certain cases we may need to acquire
locks corresponding to multiple nodes and in such cases of nested locking,
locks are taken in order of their addresses. Introduce helpers to acquire
rwsems corresponding to multiple nodes for such cases.

For operations that involve finding the node first and then operating on it
(for example operations involving find_and_get_ns), acquiring rwsem for the
node being searched is not possible. Such operations need to make sure that
a concurrent remove does not remove the found node. Introduce a per-fs
mutex that can be used to synchronize these operations against parallel
removal of involved node.

Replacing global mutex and spinlocks with hashed ones (as mentioned in
previous changes) and global rwsem with hashed rwsem (as done in this
change) reduces contention around kernfs and results in better performance
numbers.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I can see contention around above mentioned locks as follows:

- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock

- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate

- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

With the patched kernel and on the same test setup, we no longer see
contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate

- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 133 ++++++++++++++----
fs/kernfs/inode.c | 20 +++
fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++++++++++++++++++++
fs/kernfs/kernfs-internal.h | 46 ++++++-
fs/kernfs/mount.c | 1 +
fs/kernfs/symlink.c | 13 +-
include/linux/kernfs.h | 3 +-
8 files changed, 444 insertions(+), 33 deletions(-)
create mode 100644 fs/kernfs/kernfs-internal.c

diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..778da6b118e9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for the kernfs pseudo filesystem
#

-obj-y := mount.o inode.o dir.o file.o symlink.o
+obj-y := mount.o inode.o dir.o file.o symlink.o kernfs-internal.o
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 5dce041b10c3..1b28d99ff1c3 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}

@@ -450,26 +449,24 @@ void kernfs_put_active(struct kernfs_node *kn)
/**
* kernfs_drain - drain kernfs_node
* @kn: kernfs_node to drain
+ * @anc: ancestor of kernfs_node to drain
*
* Drain existing usages and nuke all existing mmaps of @kn. Mutiple
* removers may invoke this function concurrently on @kn and all will
* return after draining is complete.
*/
-static void kernfs_drain(struct kernfs_node *kn)
- __releases(&kernfs_root(kn)->kernfs_rwsem)
- __acquires(&kernfs_root(kn)->kernfs_rwsem)
+static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
+ __releases(kernfs_rwsem_ptr(anc))
+ __acquires(kernfs_rwsem_ptr(anc))
{
struct kernfs_root *root = kernfs_root(kn);

- /**
- * kn has the same root as its ancestor, so it can be used to get
- * per-fs rwsem.
- */
- struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+ struct rw_semaphore *rwsem;

- kernfs_rwsem_assert_held_write(kn);
+ kernfs_rwsem_assert_held_write(anc);
WARN_ON_ONCE(kernfs_active(kn));

+ rwsem = kernfs_rwsem_ptr(anc);
kernfs_up_write(rwsem);

if (kernfs_lockdep(kn)) {
@@ -489,7 +486,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- kernfs_down_write(kn);
+ kernfs_down_write(anc);
}

/**
@@ -729,6 +726,11 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

+ /**
+ * The node being added is not active at this point of time and may
+ * be activated later depending on CREATE_DEACTIVATED flag. So at
+ * this point of time just locking the parent is enough.
+ */
rwsem = kernfs_down_write(parent);

ret = -EINVAL;
@@ -867,11 +869,20 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. Acquiring root->kernfs_rm_mutex and releasing it after
+ * pinning the found kernfs_node, ensures that found kernfs_node
+ * will not disappear due to a parallel remove operation.
+ */
+ mutex_lock(&root->kernfs_rm_mutex);
rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_rm_mutex);

return kn;
}
@@ -892,11 +903,20 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. Acquiring root->kernfs_rm_mutex and releasing it after
+ * pinning the found kernfs_node, ensures that found kernfs_node
+ * will not disappear due to a parallel remove operation.
+ */
+ mutex_lock(&root->kernfs_rm_mutex);
rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_rm_mutex);

return kn;
}
@@ -921,9 +941,9 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
return ERR_PTR(-ENOMEM);

idr_init(&root->ino_idr);
- init_rwsem(&root->kernfs_rwsem);
INIT_LIST_HEAD(&root->supers);
init_rwsem(&root->supers_rwsem);
+ mutex_init(&root->kernfs_rm_mutex);

/*
* On 64bit ino setups, id is ino. On 32bit, low 32bits are ino.
@@ -1093,6 +1113,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
+ /**
+ * For dentry revalidation just acquiring kernfs_node's rwsem for
+ * reading should be enough. If a competing rename or remove wins
+ * one of the checks below will fail.
+ */
rwsem = kernfs_down_read(kn);

/* The kernfs node has been deactivated */
@@ -1132,24 +1157,35 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct inode *inode = NULL;
const void *ns = NULL;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. So take root->kernfs_rm_mutex to avoid parallel removal of
+ * found kernfs_node.
+ */
+ mutex_lock(&root->kernfs_rm_mutex);
rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+ kernfs_up_read(rwsem);
/* attach dentry and inode */
if (kn) {
/* Inactive nodes are invisible to the VFS so don't
* create a negative.
*/
+ rwsem = kernfs_down_read(kn);
if (!kernfs_active(kn)) {
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_rm_mutex);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
if (!inode)
inode = ERR_PTR(-ENOMEM);
+ kernfs_up_read(rwsem);
}
/*
* Needed for negative dentry validation.
@@ -1157,9 +1193,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* or transforms from positive dentry in dentry_unlink_inode()
* called from vfs_rmdir().
*/
+ rwsem = kernfs_down_read(parent);
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_rm_mutex);

/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1339,27 +1377,40 @@ void kernfs_activate(struct kernfs_node *kn)
static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;
+ struct rw_semaphore *rwsem;
+ struct kernfs_root *root;

- kernfs_rwsem_assert_held_write(kn);
+ if (!kn)
+ return;
+
+ root = kernfs_root(kn);

/*
* Short-circuit if non-root @kn has already finished removal.
* This is for kernfs_remove_self() which plays with active ref
* after removal.
*/
- if (!kn || (kn->parent && RB_EMPTY_NODE(&kn->rb)))
+ mutex_lock(&root->kernfs_rm_mutex);
+ rwsem = kernfs_down_write(kn);
+ if (kn->parent && RB_EMPTY_NODE(&kn->rb)) {
+ kernfs_up_write(rwsem);
+ mutex_unlock(&root->kernfs_rm_mutex);
return;
+ }

pr_debug("kernfs %s: removing\n", kn->name);

/* prevent any new usage under @kn by deactivating all nodes */
pos = NULL;
+
while ((pos = kernfs_next_descendant_post(pos, kn)))
if (kernfs_active(pos))
atomic_add(KN_DEACTIVATED_BIAS, &pos->active);
+ kernfs_up_write(rwsem);

/* deactivate and unlink the subtree node-by-node */
do {
+ rwsem = kernfs_down_write(kn);
pos = kernfs_leftmost_descendant(kn);

/*
@@ -1377,10 +1428,25 @@ static void __kernfs_remove(struct kernfs_node *kn)
* error paths without worrying about draining.
*/
if (kn->flags & KERNFS_ACTIVATED)
- kernfs_drain(pos);
+ kernfs_drain(pos, kn);
else
WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);

+ kernfs_up_write(rwsem);
+
+ /**
+ * By now node and all of its descendants have been deactivated
+ * Once a descendant has been drained, acquire its parent's lock
+ * and unlink it from parent's children rb tree.
+ * We drop kn's lock before acquiring pos->parent's lock to avoid
+ * deadlock that will happen if pos->parent and kn hash to same lock.
+ * Dropping kn's lock should be safe because it is in deactived state.
+ * Further root->kernfs_rm_mutex ensures that we will not have
+ * concurrent instances of __kernfs_remove
+ */
+ if (pos->parent)
+ rwsem = kernfs_down_write(pos->parent);
+
/*
* kernfs_unlink_sibling() succeeds once per node. Use it
* to decide who's responsible for cleanups.
@@ -1398,8 +1464,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
kernfs_put(pos);
}

+ if (pos->parent)
+ kernfs_up_write(rwsem);
kernfs_put(pos);
} while (pos != kn);
+
+ mutex_unlock(&root->kernfs_rm_mutex);
}

/**
@@ -1410,11 +1480,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct rw_semaphore *rwsem;
-
- rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- kernfs_up_write(rwsem);
}

/**
@@ -1516,9 +1582,11 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
if (!(kn->flags & KERNFS_SUICIDAL)) {
kn->flags |= KERNFS_SUICIDAL;
+ kernfs_up_write(rwsem);
__kernfs_remove(kn);
kn->flags |= KERNFS_SUICIDED;
ret = true;
+ rwsem = kernfs_down_write(kn);
} else {
wait_queue_head_t *waitq = &kernfs_root(kn)->deactivate_waitq;
DEFINE_WAIT(wait);
@@ -1572,11 +1640,17 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,

rwsem = kernfs_down_write(parent);

+ /**
+ * Since the node being searched will be removed eventually,
+ * we don't need to take root->kernfs_rm_mutex.
+ * Even if a parallel remove succeeds, the subsequent __kernfs_remove
+ * will detect it and bail-out early.
+ */
kn = kernfs_find_ns(parent, name, ns);
- if (kn)
- __kernfs_remove(kn);

kernfs_up_write(rwsem);
+ if (kn)
+ __kernfs_remove(kn);

if (kn)
return 0;
@@ -1596,14 +1670,24 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
{
struct kernfs_node *old_parent;
const char *old_name = NULL;
- struct rw_semaphore *rwsem;
+ struct kernfs_rwsem_token token;
int error;

/* can't move or rename root */
if (!kn->parent)
return -EINVAL;

- rwsem = kernfs_down_write(kn);
+ old_parent = kn->parent;
+ kernfs_get(old_parent);
+ kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+ while (old_parent != kn->parent) {
+ kernfs_put(old_parent);
+ kernfs_up_write_triple_nodes(kn, old_parent, new_parent, &token);
+ old_parent = kn->parent;
+ kernfs_get(old_parent);
+ kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+ }
+ kernfs_put(old_parent);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1638,7 +1722,6 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
/* rename_lock protects ->parent and ->name accessors */
spin_lock_irq(&kernfs_rename_lock);

- old_parent = kn->parent;
kn->parent = new_parent;

kn->ns = new_ns;
@@ -1657,7 +1740,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- kernfs_up_write(rwsem);
+ kernfs_up_write_triple_nodes(kn, new_parent, old_parent, &token);
return error;
}

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index efe5ae98abf4..36a40b08b97f 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -101,6 +101,12 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int ret;
struct rw_semaphore *rwsem;

+ /**
+ * Since we are only modifying the inode attribute, we just need
+ * to lock involved node. Operations that add or remove a node
+ * acquire parent's lock before changing the inode attributes, so
+ * such operations are also in sync with this interface.
+ */
rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
kernfs_up_write(rwsem);
@@ -118,6 +124,12 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
if (!kn)
return -EINVAL;

+ /**
+ * Since we are only modifying the inode attribute, we just need
+ * to lock involved node. Operations that add or remove a node
+ * acquire parent's lock before changing the inode attributes, so
+ * such operations are also in sync with .setattr backend.
+ */
rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
@@ -188,6 +200,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct kernfs_node *kn = inode->i_private;
struct rw_semaphore *rwsem;

+ /**
+ * As we are only reading ->iattr, acquiring kn's rwsem for
+ * reading is enough.
+ */
rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
@@ -285,6 +301,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

+ /**
+ * As we are only reading ->iattr, acquiring kn's rwsem for
+ * reading is enough.
+ */
rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.c b/fs/kernfs/kernfs-internal.c
new file mode 100644
index 000000000000..80d7d64532fe
--- /dev/null
+++ b/fs/kernfs/kernfs-internal.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This file provides inernal helpers for kernfs.
+ */
+
+#include "kernfs-internal.h"
+
+static void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
+{
+ struct rw_semaphore *tmp;
+
+ tmp = array[i];
+ array[i] = array[j];
+ array[j] = tmp;
+}
+
+static void kernfs_sort_rwsems(struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 2) {
+ if (array[0] == array[1])
+ token->count = 1;
+ else if (array[0] > array[1])
+ kernfs_swap_rwsems(array, 0, 1);
+ } else {
+ if (array[0] == array[1] && array[0] == array[2])
+ token->count = 1;
+ else {
+ if (array[0] > array[1])
+ kernfs_swap_rwsems(array, 0, 1);
+
+ if (array[0] > array[2])
+ kernfs_swap_rwsems(array, 0, 2);
+
+ if (array[1] > array[2])
+ kernfs_swap_rwsems(array, 1, 2);
+
+ if (array[0] == array[1] || array[1] == array[2])
+ token->count = 2;
+ }
+ }
+}
+
+/**
+ * kernfs_down_write_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ token->count = 2;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ down_write_nested(array[0], 0);
+ } else {
+ /* Both nodes hash to different rwsems */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[1], 1);
+ }
+}
+
+/**
+ * kernfs_up_write_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ up_write(array[0]);
+ } else {
+ /* Both nodes hashe to different rwsems */
+ up_write(array[0]);
+ up_write(array[1]);
+ }
+}
+
+/**
+ * kernfs_down_read_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ token->count = 2;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ down_read_nested(array[0], 0);
+ } else {
+ /* Both nodes hash to different rwsems */
+ down_read_nested(array[0], 0);
+ down_read_nested(array[1], 1);
+ }
+}
+
+/**
+ * kernfs_up_read_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ up_read(array[0]);
+ } else {
+ /* Both nodes hashe to different rwsems */
+ up_read(array[0]);
+ up_read(array[1]);
+ }
+}
+
+/**
+ * kernfs_down_write_triple_nodes() - take hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 3 nodes. Some operation may need to acquire
+ * hashed rwsems for 3 nodes (for example rename operation needs to
+ * acquire rwsem corresponding to node, its current parent and its future
+ * parent). This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ array[2] = kernfs_rwsem_ptr(kn3);
+ token->count = 3;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* All 3 nodes hash to same rwsem */
+ down_write_nested(array[0], 0);
+ } else if (token->count == 2) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[2], 1);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[1], 1);
+ down_write_nested(array[2], 2);
+ }
+}
+
+/**
+ * kernfs_up_write_triple_nodes - release hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 3 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* All 3 nodes hash to same rwsem */
+ up_write(array[0]);
+ } else if (token->count == 2) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ up_write(array[0]);
+ up_write(array[2]);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ up_write(array[0]);
+ up_write(array[1]);
+ up_write(array[2]);
+ }
+}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 8b24bc7e36c2..188e5680066c 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,20 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>

+/**
+ * Token for nested locking interfaces.
+ *
+ * rwsems: array of rwsems to acquire
+ * count: has 2 uses
+ * As input argument it specifies size of ->rwsems array
+ * As return argument it specifies number of unique rwsems
+ * present in ->rwsems array
+ */
+struct kernfs_rwsem_token {
+ struct rw_semaphore *rwsems[3];
+ int count;
+};
+
struct kernfs_iattrs {
kuid_t ia_uid;
kgid_t ia_gid;
@@ -46,8 +60,8 @@ struct kernfs_root {
struct list_head supers;

wait_queue_head_t deactivate_waitq;
- struct rw_semaphore kernfs_rwsem;
struct rw_semaphore supers_rwsem;
+ struct mutex kernfs_rm_mutex;
};

/* +1 to avoid triggering overflow warning when negating it */
@@ -208,9 +222,9 @@ static inline spinlock_t *kernfs_open_node_spinlock(struct kernfs_node *kn)

static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
{
- struct kernfs_root *root = kernfs_root(kn);
+ int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);

- return &root->kernfs_rwsem;
+ return &kernfs_locks->kernfs_rwsem[idx];
}

static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
@@ -284,4 +298,30 @@ static inline void kernfs_up_read(struct rw_semaphore *rwsem)
up_read(rwsem);
}

+
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token);
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index f88dc4e26ffb..f2b3d981b42d 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -398,6 +398,7 @@ void __init kernfs_lock_init(void)
for (count = 0; count < NR_KERNFS_LOCKS; count++) {
mutex_init(&kernfs_locks->open_file_mutex[count].lock);
spin_lock_init(&kernfs_locks->open_node_locks[count].lock);
+ init_rwsem(&kernfs_locks->kernfs_rwsem[count]);
}
}

diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 9d4103602554..cbdd1be5f0a8 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,19 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct rw_semaphore *rwsem;
+ struct kernfs_rwsem_token token;
int error;

- rwsem = kernfs_down_read(parent);
+ /**
+ * Lock both parent and target, to avoid their movement
+ * or removal in the middle of path construction.
+ * If a competing remove or rename for parent or target
+ * wins, it will be reflected in result returned from
+ * kernfs_get_target_path.
+ */
+ kernfs_down_read_double_nodes(target, parent, &token);
error = kernfs_get_target_path(parent, target, path);
- kernfs_up_read(rwsem);
+ kernfs_up_read_double_nodes(target, parent, &token);

return error;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 9f0926f553fc..46e71ae940d9 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -97,6 +97,7 @@ struct kernfs_open_node_lock {
struct kernfs_global_locks {
struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
};

enum kernfs_node_type {
@@ -266,8 +267,8 @@ struct kernfs_root {
struct list_head supers;

wait_queue_head_t deactivate_waitq;
- struct rw_semaphore kernfs_rwsem;
struct rw_semaphore supers_rwsem;
+ struct mutex kernfs_rm_mutex;
};
#endif

--
2.30.2

2022-03-17 08:55:12

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

This allows to change underlying mutex locking, without needing to change
the users of the lock. For example next patch modifies this interface to
use hashed mutexes in place of a single global kernfs_open_file_mutex.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/file.c | 26 +++++++++++++++-----------
fs/kernfs/kernfs-internal.h | 18 ++++++++++++++++++
2 files changed, 33 insertions(+), 11 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 7aefaca876a0..99793c32abc3 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -30,7 +30,7 @@
* kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
*/
static DEFINE_SPINLOCK(kernfs_open_node_lock);
-static DEFINE_MUTEX(kernfs_open_file_mutex);
+DEFINE_MUTEX(kernfs_open_file_mutex);

struct kernfs_open_node {
atomic_t refcnt;
@@ -519,9 +519,10 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
struct kernfs_open_file *of)
{
struct kernfs_open_node *on, *new_on = NULL;
+ struct mutex *mutex = NULL;

retry:
- mutex_lock(&kernfs_open_file_mutex);
+ mutex = kernfs_open_file_mutex_lock(kn);
spin_lock_irq(&kernfs_open_node_lock);

if (!kn->attr.open && new_on) {
@@ -536,7 +537,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
}

spin_unlock_irq(&kernfs_open_node_lock);
- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(mutex);

if (on) {
kfree(new_on);
@@ -570,9 +571,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
struct kernfs_open_file *of)
{
struct kernfs_open_node *on = kn->attr.open;
+ struct mutex *mutex = NULL;
unsigned long flags;

- mutex_lock(&kernfs_open_file_mutex);
+ mutex = kernfs_open_file_mutex_lock(kn);
spin_lock_irqsave(&kernfs_open_node_lock, flags);

if (of)
@@ -584,7 +586,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
on = NULL;

spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(mutex);

kfree(on);
}
@@ -724,11 +726,11 @@ static void kernfs_release_file(struct kernfs_node *kn,
/*
* @of is guaranteed to have no other file operations in flight and
* we just want to synchronize release and drain paths.
- * @kernfs_open_file_mutex is enough. @of->mutex can't be used
+ * @open_file_mutex is enough. @of->mutex can't be used
* here because drain path may be called from places which can
* cause circular dependency.
*/
- lockdep_assert_held(&kernfs_open_file_mutex);
+ lockdep_assert_held(kernfs_open_file_mutex_ptr(kn));

if (!of->released) {
/*
@@ -745,11 +747,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
{
struct kernfs_node *kn = inode->i_private;
struct kernfs_open_file *of = kernfs_of(filp);
+ struct mutex *lock = NULL;

if (kn->flags & KERNFS_HAS_RELEASE) {
- mutex_lock(&kernfs_open_file_mutex);
+ lock = kernfs_open_file_mutex_lock(kn);
kernfs_release_file(kn, of);
- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(lock);
}

kernfs_put_open_node(kn, of);
@@ -764,6 +767,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
{
struct kernfs_open_node *on;
struct kernfs_open_file *of;
+ struct mutex *mutex = NULL;

if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
return;
@@ -776,7 +780,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
if (!on)
return;

- mutex_lock(&kernfs_open_file_mutex);
+ mutex = kernfs_open_file_mutex_lock(kn);

list_for_each_entry(of, &on->files, list) {
struct inode *inode = file_inode(of->file);
@@ -788,7 +792,7 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
kernfs_release_file(kn, of);
}

- mutex_unlock(&kernfs_open_file_mutex);
+ mutex_unlock(mutex);

kernfs_put_open_node(kn, NULL);
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index eeaa779b929c..df00a5f5a367 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -164,4 +164,22 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
*/
extern const struct inode_operations kernfs_symlink_iops;

+extern struct mutex kernfs_open_file_mutex;
+
+static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
+{
+ return &kernfs_open_file_mutex;
+}
+
+static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
+{
+ struct mutex *lock;
+
+ lock = kernfs_open_file_mutex_ptr(kn);
+
+ mutex_lock(lock);
+
+ return lock;
+}
+
#endif /* __KERNFS_INTERNAL_H */
--
2.30.2

2022-03-17 09:01:59

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 4/8] kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.

In current kernfs design a single spinlock, kernfs_open_node_lock, protects
the kernfs_node->attr.open i.e kernfs_open_node instances corresponding to
a sysfs attribute. So separate tasks, which are opening or closing separate
sysfs files, can contend on this spinlock. The contention is more apparent
in large scale systems with few hundred CPUs where most of the CPUs have
running tasks that are opening, accessing or closing sysfs files at any
point of time.

Using hashed spinlocks in place of a single global spinlock, can reduce
contention around global spinlock and hence provide better scalability.
Moreover as these hashed spinlocks are not part of kernfs_node objects we
will not see any singnificant change in memory utilization of kernfs based
file systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed
spinlocks. Use kernfs_node address as hashing key.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/file.c | 9 ---------
fs/kernfs/kernfs-internal.h | 6 +++---
fs/kernfs/mount.c | 4 +++-
include/linux/kernfs.h | 10 +++++++++-
4 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 1658bfa048df..95426df9f030 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -18,15 +18,6 @@

#include "kernfs-internal.h"

-/*
- * kernfs_node->attr.open points to kernfs_open_node. attr.open is
- * protected by kernfs_open_node_lock.
- *
- * filp->private_data points to seq_file whose ->private points to
- * kernfs_open_file.
- */
-DEFINE_SPINLOCK(kernfs_open_node_lock);
-
struct kernfs_open_node {
atomic_t refcnt;
atomic_t event;
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 60404a93c28a..25c3329bd60e 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -169,8 +169,6 @@ extern const struct inode_operations kernfs_symlink_iops;
*/
extern struct kernfs_global_locks *kernfs_locks;

-extern spinlock_t kernfs_open_node_lock;
-
static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
{
int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
@@ -191,7 +189,9 @@ static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)

static inline spinlock_t *kernfs_open_node_spinlock_ptr(struct kernfs_node *kn)
{
- return &kernfs_open_node_lock;
+ int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+ return &kernfs_locks->open_node_locks[idx].lock;
}

static inline spinlock_t *kernfs_open_node_spinlock(struct kernfs_node *kn)
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index fa3fa22c95b2..809b738739b1 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -395,8 +395,10 @@ void __init kernfs_lock_init(void)
kernfs_locks = kmalloc(sizeof(struct kernfs_global_locks), GFP_KERNEL);
WARN_ON(!kernfs_locks);

- for (count = 0; count < NR_KERNFS_LOCKS; count++)
+ for (count = 0; count < NR_KERNFS_LOCKS; count++) {
mutex_init(&kernfs_locks->open_file_mutex[count].lock);
+ spin_lock_init(&kernfs_locks->open_node_locks[count].lock);
+ }
}

void __init kernfs_init(void)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 1de54f4bdcc5..e82e57c007e9 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -19,6 +19,7 @@
#include <linux/wait.h>
#include <linux/rwsem.h>
#include <linux/cache.h>
+#include <linux/spinlock.h>

struct file;
struct dentry;
@@ -75,20 +76,27 @@ struct kernfs_iattrs;
* kernfs_open_file.
* kernfs_open_files are chained at kernfs_open_node->files, which is
* protected by kernfs_open_file_mutex.lock.
+ *
+ * kernfs_node->attr.open points to kernfs_open_node. attr.open is
+ * protected by kernfs_open_node_lock.lock.
*/

struct kernfs_open_file_mutex {
struct mutex lock;
} ____cacheline_aligned_in_smp;

+struct kernfs_open_node_lock {
+ spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
/*
* To reduce possible contention in sysfs access, arising due to single
* locks, use an array of locks and use kernfs_node object address as
* hash keys to get the index of these locks.
*/
-
struct kernfs_global_locks {
struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
};

enum kernfs_node_type {
--
2.30.2

2022-03-17 09:14:16

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 3/8] kernfs: Introduce interface to access kernfs_open_node_lock.

Having an interface allows to change the underlying locking mechanism
without needing to change the user of the lock. For example next patch
modifies this interface to make use of hashed spinlocks in place of global
kernfs_open_node_lock.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/file.c | 23 ++++++++++++++---------
fs/kernfs/kernfs-internal.h | 18 ++++++++++++++++++
2 files changed, 32 insertions(+), 9 deletions(-)

diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 8996b00568c3..1658bfa048df 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -25,7 +25,7 @@
* filp->private_data points to seq_file whose ->private points to
* kernfs_open_file.
*/
-static DEFINE_SPINLOCK(kernfs_open_node_lock);
+DEFINE_SPINLOCK(kernfs_open_node_lock);

struct kernfs_open_node {
atomic_t refcnt;
@@ -515,10 +515,11 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
{
struct kernfs_open_node *on, *new_on = NULL;
struct mutex *mutex = NULL;
+ spinlock_t *lock = NULL;

retry:
mutex = kernfs_open_file_mutex_lock(kn);
- spin_lock_irq(&kernfs_open_node_lock);
+ lock = kernfs_open_node_spinlock(kn);

if (!kn->attr.open && new_on) {
kn->attr.open = new_on;
@@ -531,7 +532,7 @@ static int kernfs_get_open_node(struct kernfs_node *kn,
list_add_tail(&of->list, &on->files);
}

- spin_unlock_irq(&kernfs_open_node_lock);
+ spin_unlock_irq(lock);
mutex_unlock(mutex);

if (on) {
@@ -567,10 +568,13 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
{
struct kernfs_open_node *on = kn->attr.open;
struct mutex *mutex = NULL;
+ spinlock_t *lock = NULL;
unsigned long flags;

mutex = kernfs_open_file_mutex_lock(kn);
- spin_lock_irqsave(&kernfs_open_node_lock, flags);
+ lock = kernfs_open_node_spinlock_ptr(kn);
+
+ spin_lock_irqsave(lock, flags);

if (of)
list_del(&of->list);
@@ -580,7 +584,7 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
else
on = NULL;

- spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+ spin_unlock_irqrestore(lock, flags);
mutex_unlock(mutex);

kfree(on);
@@ -763,15 +767,16 @@ void kernfs_drain_open_files(struct kernfs_node *kn)
struct kernfs_open_node *on;
struct kernfs_open_file *of;
struct mutex *mutex = NULL;
+ spinlock_t *lock = NULL;

if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
return;

- spin_lock_irq(&kernfs_open_node_lock);
+ lock = kernfs_open_node_spinlock(kn);
on = kn->attr.open;
if (on)
atomic_inc(&on->refcnt);
- spin_unlock_irq(&kernfs_open_node_lock);
+ spin_unlock_irq(lock);
if (!on)
return;

@@ -916,13 +921,13 @@ void kernfs_notify(struct kernfs_node *kn)
return;

/* kick poll immediately */
- spin_lock_irqsave(&kernfs_open_node_lock, flags);
+ spin_lock_irqsave(kernfs_open_node_spinlock_ptr(kn), flags);
on = kn->attr.open;
if (on) {
atomic_inc(&on->event);
wake_up_interruptible(&on->poll);
}
- spin_unlock_irqrestore(&kernfs_open_node_lock, flags);
+ spin_unlock_irqrestore(kernfs_open_node_spinlock_ptr(kn), flags);

/* schedule work to kick fsnotify */
spin_lock_irqsave(&kernfs_notify_lock, flags);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 4ab696fb2040..60404a93c28a 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -169,6 +169,8 @@ extern const struct inode_operations kernfs_symlink_iops;
*/
extern struct kernfs_global_locks *kernfs_locks;

+extern spinlock_t kernfs_open_node_lock;
+
static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
{
int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
@@ -187,4 +189,20 @@ static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
return lock;
}

+static inline spinlock_t *kernfs_open_node_spinlock_ptr(struct kernfs_node *kn)
+{
+ return &kernfs_open_node_lock;
+}
+
+static inline spinlock_t *kernfs_open_node_spinlock(struct kernfs_node *kn)
+{
+ spinlock_t *lock;
+
+ lock = kernfs_open_node_spinlock_ptr(kn);
+
+ spin_lock_irq(lock);
+
+ return lock;
+}
+
#endif /* __KERNFS_INTERNAL_H */
--
2.30.2

2022-03-17 10:05:32

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH v7 6/8] kernfs: Introduce interface to access per-fs rwsem.

per-fs rwsem is used across kernfs for synchronization purposes.
Having an interface to access it not only avoids code duplication, it
can also help in changing the underlying locking mechanism without needing
to change the lock users. For example next patch modifies this interface
to make use of hashed rwsems in place of per-fs rwsem.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 114 ++++++++++++++++++------------------
fs/kernfs/file.c | 5 +-
fs/kernfs/inode.c | 26 ++++----
fs/kernfs/kernfs-internal.h | 78 ++++++++++++++++++++++++
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 6 +-
6 files changed, 156 insertions(+), 79 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 17b438498c0b..5dce041b10c3 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}

@@ -461,10 +461,16 @@ static void kernfs_drain(struct kernfs_node *kn)
{
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held_write(&root->kernfs_rwsem);
+ /**
+ * kn has the same root as its ancestor, so it can be used to get
+ * per-fs rwsem.
+ */
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ kernfs_rwsem_assert_held_write(kn);
WARN_ON_ONCE(kernfs_active(kn));

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +489,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- down_write(&root->kernfs_rwsem);
+ kernfs_down_write(kn);
}

/**
@@ -718,12 +724,12 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
int kernfs_add_one(struct kernfs_node *kn)
{
struct kernfs_node *parent = kn->parent;
- struct kernfs_root *root = kernfs_root(parent);
struct kernfs_iattrs *ps_iattr;
+ struct rw_semaphore *rwsem;
bool has_ns;
int ret;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -754,7 +760,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -768,7 +774,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -789,7 +795,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(parent);

if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -821,7 +827,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;

- lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_read(parent);

/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(&kernfs_rename_lock);
@@ -860,12 +866,12 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return kn;
}
@@ -885,12 +891,12 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
const char *path, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return kn;
}
@@ -1055,7 +1061,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;

if (flags & LOOKUP_RCU)
return -ECHILD;
@@ -1071,13 +1077,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
parent = kernfs_dentry_node(dentry->d_parent);
if (parent) {
spin_unlock(&dentry->d_lock);
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_dir_changed(parent, dentry)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
} else
spin_unlock(&dentry->d_lock);

@@ -1088,8 +1093,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);

/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1108,10 +1112,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;

- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 1;
out_bad:
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}

@@ -1125,12 +1129,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
{
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct kernfs_root *root;
struct inode *inode = NULL;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;

- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

@@ -1141,7 +1144,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* create a negative.
*/
if (!kernfs_active(kn)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1156,7 +1159,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
*/
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1279,7 +1282,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(root);

/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1314,9 +1317,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1330,14 +1333,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}

static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;

- lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(kn);

/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1407,11 +1410,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}

/**
@@ -1497,9 +1500,9 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
kernfs_break_active_protection(kn);

/*
@@ -1527,9 +1530,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
schedule();
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1542,7 +1545,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
kernfs_unbreak_active_protection(kn);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -1559,7 +1562,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;

if (!parent) {
WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1567,14 +1570,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- root = kernfs_root(parent);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

if (kn)
return 0;
@@ -1593,16 +1595,15 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
const char *new_name, const void *new_ns)
{
struct kernfs_node *old_parent;
- struct kernfs_root *root;
const char *old_name = NULL;
+ struct rw_semaphore *rwsem;
int error;

/* can't move or rename root */
if (!kn->parent)
return -EINVAL;

- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1656,7 +1657,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}

@@ -1727,14 +1728,13 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
- struct kernfs_root *root;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;

if (!dir_emit_dots(file, ctx))
return 0;

- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);

if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1751,12 +1751,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);

- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 07003d47343d..f46c25fb789f 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -838,6 +838,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
struct kernfs_node *kn;
struct kernfs_super_info *info;
struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
repeat:
/* pop one off the notify_list */
spin_lock_irq(&kernfs_notify_lock);
@@ -852,7 +853,7 @@ static void kernfs_notify_workfn(struct work_struct *work)

root = kernfs_root(kn);
/* kick fsnotify */
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -892,7 +893,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
}
up_write(&root->supers_rwsem);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..efe5ae98abf4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,11 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -112,14 +112,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
{
struct inode *inode = d_inode(dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
int error;

if (!kn)
return -EINVAL;

- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +131,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}

@@ -187,14 +186,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
{
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return 0;
}
@@ -277,22 +276,21 @@ void kernfs_evict_inode(struct inode *inode)
int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
+ struct rw_semaphore *rwsem;
struct kernfs_node *kn;
- struct kernfs_root *root;
int ret;

if (mask & MAY_NOT_BLOCK)
return -ECHILD;

kn = inode->i_private;
- root = kernfs_root(kn);

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return ret;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 6e6398a72578..8b24bc7e36c2 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -206,4 +206,82 @@ static inline spinlock_t *kernfs_open_node_spinlock(struct kernfs_node *kn)
return lock;
}

+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+ struct kernfs_root *root = kernfs_root(kn);
+
+ return &root->kernfs_rwsem;
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+ lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+ lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+ lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * kernfs_down_write() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_write(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_write(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_write - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_write(struct rw_semaphore *rwsem)
+{
+ up_write(rwsem);
+}
+
+/**
+ * kernfs_down_read() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_read(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_read(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_read - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_read(struct rw_semaphore *rwsem)
+{
+ up_read(rwsem);
+}
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d35142226c34..f88dc4e26ffb 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,9 +237,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_root *kf_root = kfc->root;
struct inode *inode;
struct dentry *root;
+ struct rw_semaphore *rwsem;

info->sb = sb;
/* Userspace would break if executables or devices appear on sysfs */
@@ -257,9 +257,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- down_read(&kf_root->kernfs_rwsem);
+ rwsem = kernfs_down_read(info->root->kn);
inode = kernfs_get_inode(sb, info->root->kn);
- up_read(&kf_root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f..9d4103602554 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,12 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
int error;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
error = kernfs_get_target_path(parent, target, path);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return error;
}
--
2.30.2

2022-03-17 22:00:23

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

On Thu, Mar 17, 2022 at 06:26:05PM +1100, Imran Khan wrote:

> @@ -570,9 +571,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
> struct kernfs_open_file *of)
> {
> struct kernfs_open_node *on = kn->attr.open;
> + struct mutex *mutex = NULL;
> unsigned long flags;
>
> - mutex_lock(&kernfs_open_file_mutex);
> + mutex = kernfs_open_file_mutex_lock(kn);
> spin_lock_irqsave(&kernfs_open_node_lock, flags);

Can that ever be reached with local interrupts disabled? I mean, what is
that spin_lock_irqsave() about?

> @@ -745,11 +747,12 @@ static int kernfs_fop_release(struct inode *inode, struct file *filp)
> {
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_open_file *of = kernfs_of(filp);
> + struct mutex *lock = NULL;
>
> if (kn->flags & KERNFS_HAS_RELEASE) {
> - mutex_lock(&kernfs_open_file_mutex);
> + lock = kernfs_open_file_mutex_lock(kn);
> kernfs_release_file(kn, of);
> - mutex_unlock(&kernfs_open_file_mutex);
> + mutex_unlock(lock);

Careful - you are about to remove the existing exclusion between *all*
->release() instances, same node or not.

In particular, if some driver had them manipulate a driver-local list of
some kind, relying upon the kernfs to provide the exclusion, it'd break
as soon as you turn that thing into per-node (or hashed) mutex.

It's _probably_ safe, seeing that the one and only instance of ->release()
in the entire tree (cgroup_file_release()) is rather limited in what
it's doing, and while it calls a submethod (cftype.release()) there's only
a couple of instances of that (cgroup_procs_release() and
cgroup_pressure_release(), both in kernel/cgroup/cgroup.c). Neither
seems to rely upon the global exclusion.

However, that's a change of rules and it needs to be documented as such.

Incidentally, what's the point of having kernfs_open_node->refcnt
atomic_t? All users are under kernfs_open_node_lock... AFAICS,
it's simply "->files is non-empty or something is in
kernfs_drain_open_files() for the node in question", so I'm not
even sure we want a counter there...

Note that kernfs_drain_open_files() can't overlap with
kernfs_fops_open() adding to the list of files (and we seriously
rely upon that - you don't want ops->release() called while in
the middle of ops->open()). kernfs_fops_open() starts with
grabbing an active reference; kernfs_drain_open_files() is
not called until we had
* prevented new active references being grabbed and
* waited for all active references to be dropped.

So kernfs_drain_open_files() can do the following:
1) optimistically check for ->attr.open being NULL;
bugger off if it is. We know that nobody could be currently
trying to add anything to it, mutex or no mutex.
2) grab the mutex
3) recheck ->attr.open; it might have become NULL.
If it had, unlock and bugger off.
4) walk the list, doing unmaps/releases.
5) unlock and bugger off.
The only thing doing removals from the list is
kernfs_put_open_node() and it grabs that mutex.
So it can't get to the "remove from list, free the container
of list head" until we are through.

IOW, there's no reason to hold a reference to kernfs_open_node
in kernfs_drain_open_files() at all. And that makes ->refcnt
completely useless - kernfs_put_open_node() should do
list_del(&of->list);
if (list_empty(&on->files))
kn->attr.open = NULL;
else
on = NULL;
and to hell with refcounting.

As the matter of fact, we can do even better - make freeing
that thing rcu-delayed, use rcu_assign_pointer() for stores,
rcu_dereference() for loads and have kernfs_notify() do
rcu_read_lock();
on = rcu_dereference(kn->attr.open);
if (on) {
atomic_inc(&on->event);
wake_up_interruptible(&on->poll);
}
rcu_read_unlock();
and kernfs_open_node_lock becomes useless - all places that
grab it are under kernfs_open_file_mutex.

2022-03-18 03:27:01

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Thu, Mar 17, 2022 at 06:26:11PM +1100, Imran Khan wrote:

> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
> index 9d4103602554..cbdd1be5f0a8 100644
> --- a/fs/kernfs/symlink.c
> +++ b/fs/kernfs/symlink.c
> @@ -113,12 +113,19 @@ static int kernfs_getlink(struct inode *inode, char *path)
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_node *parent = kn->parent;
> struct kernfs_node *target = kn->symlink.target_kn;
> - struct rw_semaphore *rwsem;
> + struct kernfs_rwsem_token token;
> int error;
>
> - rwsem = kernfs_down_read(parent);
> + /**
> + * Lock both parent and target, to avoid their movement
> + * or removal in the middle of path construction.
> + * If a competing remove or rename for parent or target
> + * wins, it will be reflected in result returned from
> + * kernfs_get_target_path.
> + */
> + kernfs_down_read_double_nodes(target, parent, &token);
> error = kernfs_get_target_path(parent, target, path);
> - kernfs_up_read(rwsem);
> + kernfs_up_read_double_nodes(target, parent, &token);
>
> return error;
> }

No. Read through the kernfs_get_target_path(). Why would locking these
two specific nodes be sufficient for anything useful? That code relies
upon ->parent of *many* nodes being stable. Which is not going to be
guaranteed by anything of that sort.

And it's not just "we might get garbage if we race" - it's "we might
walk into kfree'd object and proceed to walk the pointer chain".

Or have this loop
kn = target;
while (kn->parent && kn != base) {
len += strlen(kn->name) + 1;
kn = kn->parent;
}
see the names that are not identical to what we see in
kn = target;
while (kn->parent && kn != base) {
int slen = strlen(kn->name);

len -= slen;
memcpy(s + len, kn->name, slen);
if (len)
s[--len] = '/';

kn = kn->parent;
}
done later in the same function. With obvious unpleasant effects.
Or a different set of nodes, for that matter.

This code really depends upon the tree being stable. No renames of
any sort allowed during that thing.

2022-03-19 12:48:56

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Imran Khan <[email protected]> writes:

> This allows to change underlying mutex locking, without needing to change
> the users of the lock. For example next patch modifies this interface to
> use hashed mutexes in place of a single global kernfs_open_file_mutex.
>
> Signed-off-by: Imran Khan <[email protected]>
> ---
> fs/kernfs/file.c | 26 +++++++++++++++-----------
> fs/kernfs/kernfs-internal.h | 18 ++++++++++++++++++
> 2 files changed, 33 insertions(+), 11 deletions(-)
>
> diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
> index 7aefaca876a0..99793c32abc3 100644
> --- a/fs/kernfs/file.c
> +++ b/fs/kernfs/file.c
> @@ -30,7 +30,7 @@
> * kernfs_open_node->files, which is protected by kernfs_open_file_mutex.
> */
> static DEFINE_SPINLOCK(kernfs_open_node_lock);
> -static DEFINE_MUTEX(kernfs_open_file_mutex);
> +DEFINE_MUTEX(kernfs_open_file_mutex);
^^^^^????

Why when you want things more localized are you making a lock more
global?
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index eeaa779b929c..df00a5f5a367 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -164,4 +164,22 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
> */
> extern const struct inode_operations kernfs_symlink_iops;
>
> +extern struct mutex kernfs_open_file_mutex;
> +
> +static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
> +{
> + return &kernfs_open_file_mutex;
> +}
> +
> +static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
> +{
> + struct mutex *lock;
> +
> + lock = kernfs_open_file_mutex_ptr(kn);
> +
> + mutex_lock(lock);
> +
> + return lock;
> +}

Why don't these functions live in fs/kern/fs/file.c

Eric

2022-03-21 10:24:23

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello Al,
Thanks again for reviewing this.

On 18/3/22 11:07 am, Al Viro wrote:
> On Thu, Mar 17, 2022 at 06:26:11PM +1100, Imran Khan wrote:
>
>> diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
>> index 9d4103602554..cbdd1be5f0a8 100644
>> --- a/fs/kernfs/symlink.c
>> +++ b/fs/kernfs/symlink.c
>> @@ -113,12 +113,19 @@ static int kernfs_getlink(struct inode *inode, char *path)
>> struct kernfs_node *kn = inode->i_private;
>> struct kernfs_node *parent = kn->parent;
>> struct kernfs_node *target = kn->symlink.target_kn;
>> - struct rw_semaphore *rwsem;
>> + struct kernfs_rwsem_token token;
>> int error;
>>
>> - rwsem = kernfs_down_read(parent);
>> + /**
>> + * Lock both parent and target, to avoid their movement
>> + * or removal in the middle of path construction.
>> + * If a competing remove or rename for parent or target
>> + * wins, it will be reflected in result returned from
>> + * kernfs_get_target_path.
>> + */
>> + kernfs_down_read_double_nodes(target, parent, &token);
>> error = kernfs_get_target_path(parent, target, path);
>> - kernfs_up_read(rwsem);
>> + kernfs_up_read_double_nodes(target, parent, &token);
>>
>> return error;
>> }
>
> No. Read through the kernfs_get_target_path(). Why would locking these
> two specific nodes be sufficient for anything useful? That code relies
> upon ->parent of *many* nodes being stable. Which is not going to be
> guaranteed by anything of that sort.
>
> And it's not just "we might get garbage if we race" - it's "we might
> walk into kfree'd object and proceed to walk the pointer chain".
>
> Or have this loop
> kn = target;
> while (kn->parent && kn != base) {
> len += strlen(kn->name) + 1;
> kn = kn->parent;
> }
> see the names that are not identical to what we see in
> kn = target;
> while (kn->parent && kn != base) {
> int slen = strlen(kn->name);
>
> len -= slen;
> memcpy(s + len, kn->name, slen);
> if (len)
> s[--len] = '/';
>
> kn = kn->parent;
> }
> done later in the same function. With obvious unpleasant effects.
> Or a different set of nodes, for that matter.
>
> This code really depends upon the tree being stable. No renames of
> any sort allowed during that thing.

Yes. My earlier approach is wrong.

This patch set has also introduced a per-fs mutex (kernfs_rm_mutex)
which should fix the problem of inconsistent tree view as far as
kernfs_get_path is concerned.
Acquiring kernfs_rm_mutex before invoking kernfs_get_path in
kernfs_getlink will ensure that kernfs_get_path will get a consistent
view of ->parent of nodes from root to target. This is because acquiring
kernfs_rm_mutex will ensure that __kernfs_remove does not remove any
kernfs_node(or parent of kernfs_node). Further it ensures that
kernfs_rename_ns does not move any kernfs_node. So far I have not used
per-fs mutex in kernfs_rename_ns but I can make this change in next
version. So following change on top of current patch set should fix
this issue of ->parent change in the middle of kernfs_get_path.


diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1b28d99ff1c3..8095dcdd437c 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -1672,11 +1672,13 @@ int kernfs_rename_ns(struct kernfs_node *kn,
struct kernfs_node *new_parent,
const char *old_name = NULL;
struct kernfs_rwsem_token token;
int error;
+ struct kernfs_root *root = kernfs_root(kn);

/* can't move or rename root */
if (!kn->parent)
return -EINVAL;

+ mutex_lock(&root->kernfs_rm_mutex);
old_parent = kn->parent;
kernfs_get(old_parent);
kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
@@ -1741,6 +1743,7 @@ int kernfs_rename_ns(struct kernfs_node *kn,
struct kernfs_node *new_parent,
error = 0;
out:
kernfs_up_write_triple_nodes(kn, new_parent, old_parent, &token);
+ mutex_unlock(&root->kernfs_rm_mutex);
return error;
}

diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index cbdd1be5f0a8..805543d7a1f2 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,19 +113,22 @@ static int kernfs_getlink(struct inode *inode,
char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct kernfs_rwsem_token token;
+ struct kernfs_root *root;
int error;

+ root = kernfs_root(kn);
+
/**
- * Lock both parent and target, to avoid their movement
- * or removal in the middle of path construction.
- * If a competing remove or rename for parent or target
- * wins, it will be reflected in result returned from
- * kernfs_get_target_path.
+ * Acquire kernfs_rm_mutex to ensure that kernfs_get_path
+ * sees correct ->parent for all nodes.
+ * We need to make sure that during kernfs_get_path parent
+ * of any node from target to root does not change. Acquiring
+ * kernfs_rm_mutex ensure that there are no concurrent remove
+ * or rename operations.
*/
- kernfs_down_read_double_nodes(target, parent, &token);
+ mutex_lock(&root->kernfs_rm_mutex);
error = kernfs_get_target_path(parent, target, path);
- kernfs_up_read_double_nodes(target, parent, &token);
+ mutex_unlock(&root->kernfs_rm_mutex);

return error;
}

Could you please let me know if you see some issues with this approach ?

Thanks
-- Imran


2022-03-21 11:52:29

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Mon, Mar 21, 2022 at 12:57:07PM +1100, Imran Khan wrote:

> Yes. My earlier approach is wrong.
>
> This patch set has also introduced a per-fs mutex (kernfs_rm_mutex)
> which should fix the problem of inconsistent tree view as far as
> kernfs_get_path is concerned.
> Acquiring kernfs_rm_mutex before invoking kernfs_get_path in
> kernfs_getlink will ensure that kernfs_get_path will get a consistent
> view of ->parent of nodes from root to target. This is because acquiring
> kernfs_rm_mutex will ensure that __kernfs_remove does not remove any
> kernfs_node(or parent of kernfs_node). Further it ensures that
> kernfs_rename_ns does not move any kernfs_node. So far I have not used
> per-fs mutex in kernfs_rename_ns but I can make this change in next
> version. So following change on top of current patch set should fix
> this issue of ->parent change in the middle of kernfs_get_path.

I think it's a massive overkill. Look at kernfs_get_target_path() -
nothing in it is blocking. And you already have kernfs_rename_lock,
stabilizing the tree topology. Turn it into rwlock if you wish,
with that thing being a reader and existing users - writers.
And don't bother with further scaling, until and unless you see a real
contention on it.

2022-03-21 22:04:18

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Hello Eric,

Thanks for reviewing this.

On 19/3/22 4:10 am, Eric W. Biederman wrote:
> Imran Khan <[email protected]> writes:
>
[...]
>> static DEFINE_SPINLOCK(kernfs_open_node_lock);
>> -static DEFINE_MUTEX(kernfs_open_file_mutex);
>> +DEFINE_MUTEX(kernfs_open_file_mutex);
> ^^^^^????
>
> Why when you want things more localized are you making a lock more
> global?

The idea was to keep the interface in kernfs-internal.h. But since this
mutex is used only in kernfs/file.c, I can keep the mutex static and
bring interfaces that access this mutex in kernfs/file.c. This will
cover your other query as well.

>> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
>> index eeaa779b929c..df00a5f5a367 100644
>> --- a/fs/kernfs/kernfs-internal.h
>> +++ b/fs/kernfs/kernfs-internal.h
>> @@ -164,4 +164,22 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
>> */
>> extern const struct inode_operations kernfs_symlink_iops;
>>
>> +extern struct mutex kernfs_open_file_mutex;
>> +
>> +static inline struct mutex *kernfs_open_file_mutex_ptr(struct kernfs_node *kn)
>> +{
>> + return &kernfs_open_file_mutex;
>> +}
>> +
>> +static inline struct mutex *kernfs_open_file_mutex_lock(struct kernfs_node *kn)
>> +{
>> + struct mutex *lock;
>> +
>> + lock = kernfs_open_file_mutex_ptr(kn);
>> +
>> + mutex_lock(lock);
>> +
>> + return lock;
>> +}
>
> Why don't these functions live in fs/kern/fs/file.c
>

I can put these in kernfs/file.c. Please see earlier comment.

Thanks,
-- Imran

2022-03-21 22:42:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello,

On Mon, Mar 21, 2022 at 05:55:53PM +0000, Al Viro wrote:
> Why bother with rwsem, when we don't need anything blocking under it?
> DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static.

Oh I mean, in case the common readers get way too hot, percpu_rwsem is a
relatively easy way to shift the burder from the readers to the writers. I
doubt we'll need that.

> kernfs_walk_ns() - this is fucking insane; on the surface, it needs to
> be exclusive due to the use of the same static buffer. It uses that
> buffer to generate a pathname, *THEN* walks over it with strsep().
> That's an... interesting approach, for the lack of other printable
> terms - we walk the chain of ancestors, concatenating their names
> into a buffer and separating those names with slashes, then we walk
> that buffer, searching for slashes... WTF?

It takes the @parent to walk string @path from. Where does it generate the
pathname?

> kernfs_rename_ns() - exclusive; that's where the tree topology gets
> changed.

This is the only true writer and it shouldn't be difficult to convert the
others to read lock w/ e.g. dynamic allocations or percpu buffers.

> So we can just turn that spinlock into rwlock, replace the existing
> uses with read_lock()/read_unlock() in kernfs_{name,path_from_node,get_parent}
> and with write_lock()/write_unlock() in the rest of fs/kernfs/dir.c,
> make it non-static, put extern into kernfs-internal.h and there you
> go...
>
> Wait a sec; what happens if e.g. kernfs_path_from_node() races with
> __kernfs_remove()? We do _not_ clear ->parent, but we do drop references
> that used to pin what it used to point to, unless I'm misreading that
> code... Or is it somehow prevented by drain-related logics? Seeing
> that it seems to be possible to have kernfs_path_from_node() called from
> an interrupt context, that could be delicate...

kernfs_remove() is akin to freeing of the node and all its descendants. The
caller shouldn't be racing that against any other operations in the subtree.

Thanks.

--
tejun

2022-03-21 22:54:29

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Mon, Mar 21, 2022 at 06:46:53AM -1000, Tejun Heo wrote:
> On Mon, Mar 21, 2022 at 07:29:45AM +0000, Al Viro wrote:
> ...
> > stabilizing the tree topology. Turn it into rwlock if you wish,
> > with that thing being a reader and existing users - writers.
> > And don't bother with further scaling, until and unless you see a real
> > contention on it.
>
> Given how rare these renames are, in the (unlikely) case the rename rwsem
> becomes a problem, we should probably just switch it to a percpu_rwsem.

Why bother with rwsem, when we don't need anything blocking under it?
DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static.

Again, we already have a spinlock protecting ->parent and ->name.
Existing users:

kernfs_name() - can be shared.
kernfs_path_from_node() - can be shared.

pr_cont_kernfs_name() - exclusive, since that thing works into a static buffer.
pr_cont_kernfs_path() - exclusive, same reasons.

kernfs_get_parent() - can be shared, but its callers need to be reviewed;
that's the prime breeding ground for rename races.

kernfs_walk_ns() - this is fucking insane; on the surface, it needs to
be exclusive due to the use of the same static buffer. It uses that
buffer to generate a pathname, *THEN* walks over it with strsep().
That's an... interesting approach, for the lack of other printable
terms - we walk the chain of ancestors, concatenating their names
into a buffer and separating those names with slashes, then we walk
that buffer, searching for slashes... WTF?

kernfs_rename_ns() - exclusive; that's where the tree topology gets
changed.

So we can just turn that spinlock into rwlock, replace the existing
uses with read_lock()/read_unlock() in kernfs_{name,path_from_node,get_parent}
and with write_lock()/write_unlock() in the rest of fs/kernfs/dir.c,
make it non-static, put extern into kernfs-internal.h and there you
go...

Wait a sec; what happens if e.g. kernfs_path_from_node() races with
__kernfs_remove()? We do _not_ clear ->parent, but we do drop references
that used to pin what it used to point to, unless I'm misreading that
code... Or is it somehow prevented by drain-related logics? Seeing
that it seems to be possible to have kernfs_path_from_node() called from
an interrupt context, that could be delicate...

2022-03-21 23:33:10

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Mon, Mar 21, 2022 at 07:29:45AM +0000, Al Viro wrote:
...
> stabilizing the tree topology. Turn it into rwlock if you wish,
> with that thing being a reader and existing users - writers.
> And don't bother with further scaling, until and unless you see a real
> contention on it.

Given how rare these renames are, in the (unlikely) case the rename rwsem
becomes a problem, we should probably just switch it to a percpu_rwsem.

Thanks.

--
tejun

2022-03-22 03:08:02

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Mon, Mar 21, 2022 at 09:20:06AM -1000, Tejun Heo wrote:
> Hello,
>
> On Mon, Mar 21, 2022 at 05:55:53PM +0000, Al Viro wrote:
> > Why bother with rwsem, when we don't need anything blocking under it?
> > DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static.
>
> Oh I mean, in case the common readers get way too hot, percpu_rwsem is a
> relatively easy way to shift the burder from the readers to the writers. I
> doubt we'll need that.
>
> > kernfs_walk_ns() - this is fucking insane; on the surface, it needs to
> > be exclusive due to the use of the same static buffer. It uses that
> > buffer to generate a pathname, *THEN* walks over it with strsep().
> > That's an... interesting approach, for the lack of other printable
> > terms - we walk the chain of ancestors, concatenating their names
> > into a buffer and separating those names with slashes, then we walk
> > that buffer, searching for slashes... WTF?
>
> It takes the @parent to walk string @path from. Where does it generate the
> pathname?

Sorry, misread that thing - the reason it copies the damn thing at all is
the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
be it in kernel or in the userland: don't. It's almost never the right
primitive to use.

Lookups should use qstr; it has both the length and place for hash.
Switch kernfs_find_ns() to that (and lift the calculation of length
into the callers that do not have it - note that kernfs_iop_lookup()
does) and you don't need the strsep() shite (or copying) anymore.

That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared.

HOWEVER, that's not the only lock needed there and this patchset is
broken in that respect - it locks the starting node, then walks the
path. Complete with lookups in rbtrees of children in the descendents
of that node and those are *not* locked.

> > Wait a sec; what happens if e.g. kernfs_path_from_node() races with
> > __kernfs_remove()? We do _not_ clear ->parent, but we do drop references
> > that used to pin what it used to point to, unless I'm misreading that
> > code... Or is it somehow prevented by drain-related logics? Seeing
> > that it seems to be possible to have kernfs_path_from_node() called from
> > an interrupt context, that could be delicate...
>
> kernfs_remove() is akin to freeing of the node and all its descendants. The
> caller shouldn't be racing that against any other operations in the subtree.

That's interesting... My impression had been that some of these functions
could be called from interrupt contexts (judging by the spin_lock_irqsave()
in there). What kind of async contexts those are, and what do you use to
make sure they don't leak into overlap with kernfs_remove()?

In particular, if you ever use those from RCU callbacks, you need to make
sure that you have a grace period somewhere; the wait_event() you have in
kernfs_drain() won't do it.

I've tried to track the callchains that could lead there, but it gets
hairy fast.

2022-03-23 03:19:02

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello,

On Tue, Mar 22, 2022 at 02:40:54AM +0000, Al Viro wrote:
> Sorry, misread that thing - the reason it copies the damn thing at all is
> the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
> be it in kernel or in the userland: don't. It's almost never the right
> primitive to use.

Yeah, it's ugly. I was being lazy on top of the existing interface.

> Lookups should use qstr; it has both the length and place for hash.
> Switch kernfs_find_ns() to that (and lift the calculation of length
> into the callers that do not have it - note that kernfs_iop_lookup()
> does) and you don't need the strsep() shite (or copying) anymore.

and yeah this would be cleaner.

> That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared.
>
> HOWEVER, that's not the only lock needed there and this patchset is
> broken in that respect - it locks the starting node, then walks the
> path. Complete with lookups in rbtrees of children in the descendents
> of that node and those are *not* locked.
>
> > > Wait a sec; what happens if e.g. kernfs_path_from_node() races with
> > > __kernfs_remove()? We do _not_ clear ->parent, but we do drop references
> > > that used to pin what it used to point to, unless I'm misreading that
> > > code... Or is it somehow prevented by drain-related logics? Seeing
> > > that it seems to be possible to have kernfs_path_from_node() called from
> > > an interrupt context, that could be delicate...
> >
> > kernfs_remove() is akin to freeing of the node and all its descendants. The
> > caller shouldn't be racing that against any other operations in the subtree.
>
> That's interesting... My impression had been that some of these functions
> could be called from interrupt contexts (judging by the spin_lock_irqsave()
> in there). What kind of async contexts those are, and what do you use to
> make sure they don't leak into overlap with kernfs_remove()?

The spin_lock_irqsave()'s are there because they're often used when printing
messages which can happen from any context. e.g. cpuset ends up calling into
them to print current's cgroup under rcu_read_lock(), iocost to print
warning message under an irq-safe lock. In both and similar cases, the
caller knows that the cgroup is accessible which in turn guarantees that the
kernfs node hasn't be deleted.

> In particular, if you ever use those from RCU callbacks, you need to make
> sure that you have a grace period somewhere; the wait_event() you have in
> kernfs_drain() won't do it.

I don't know of cases where they are called from rcu callbacks although
there are cases where they're used from rcu criticial sections. That said,
at least in cgroup's case, cgroup destruction path has a grace period and
that's how a rcu critical section is safe to deref a cgroup and the
kernfs_node inside it.

Thanks.

--
tejun

2022-03-24 02:03:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Tue, Mar 22, 2022 at 08:26:54PM +0000, Al Viro wrote:
> The only thing that matter wrt rcu_read_lock() is that we can't block there;
> there are tons of plain spin_lock() calls done in those conditions. And
> rcu_read_lock() doesn't disable interrupts, so spin_lock_irq() is usable
> under it. Now, holding another spinlock with spin_lock_irq{,save}() *does*
> prohibit the use of spin_lock_irq() - there you can use only spin_lock()
> or spin_lock_irqsave().

Yeah, I was just listing different cases to make the point that these
functions don't know what context they might get called in.

> The callchains that prohibit spin_lock() do exist - for example, there's
> pr_cont_kernfs_path <- pr_cont_cgroup_path <- transfer_surpluses <- ioc_timer_fn.

Yeap.

> Out of curiosity, what guarantees that kernfs_remove() won't do
> fun things to ancestors of iocg_to_blkg(iocg)->blkcg->css.cgroup for some
> iocg in ioc->active_iocgs, until after ioc_rqos_exit(ioc) has finished
> del_timer_sync()?

Ah, okay, I was wrong before when I said that kernfs_remove() is like free.
It puts the base reference but as long as anybody has a ref on a
kernfs_node, the node itself and all its parents are pinned. For iocost,
ioc_pd_free() which is called from __blkg_release() ensures that the iocg is
off the active list. __blkg_release() puts its css reference before calling
ioc_pd_free(). If this were the last css of the cgroup, this can trigger the
destruction of cgroup, so the order doesn't seem right - we should call
blkg_free() first and then put the references. However, given that both css
and cgroup release paths involve a RCU grace period and __blkg_release() is
called from rcu callback, I don't think the race window actually exists.
I'll still send a patch to reorder the puts.

Thanks.

--
tejun

2022-03-24 21:30:03

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Tue, Mar 22, 2022 at 07:08:58AM -1000, Tejun Heo wrote:

> > That's interesting... My impression had been that some of these functions
> > could be called from interrupt contexts (judging by the spin_lock_irqsave()
> > in there). What kind of async contexts those are, and what do you use to
> > make sure they don't leak into overlap with kernfs_remove()?
>
> The spin_lock_irqsave()'s are there because they're often used when printing
> messages which can happen from any context. e.g. cpuset ends up calling into
> them to print current's cgroup under rcu_read_lock(), iocost to print
> warning message under an irq-safe lock. In both and similar cases, the
> caller knows that the cgroup is accessible which in turn guarantees that the
> kernfs node hasn't be deleted.

Wait a sec. Choice of spin_lock_irqsave() vs. spin_lock_irq() is affected by
having it called with interrupts disabled; choice of either vs. spin_lock()
is not - that's needed only if you might end up taking the spinlock in question
from interrupt handler. "Under rcu_read_lock()" is irrelevant here...

The point of spin_lock_irq/spin_lock_irqsave is the prevention of
spin_lock(&LOCK); // locked
take an interrupt, enter interrupt handler and there run into
spin_lock(&LOCK); // and we spin forever
If there's no users in interrupt contexts, we are just fine with plain
spin_lock().

The only thing that matter wrt rcu_read_lock() is that we can't block there;
there are tons of plain spin_lock() calls done in those conditions. And
rcu_read_lock() doesn't disable interrupts, so spin_lock_irq() is usable
under it. Now, holding another spinlock with spin_lock_irq{,save}() *does*
prohibit the use of spin_lock_irq() - there you can use only spin_lock()
or spin_lock_irqsave().

The callchains that prohibit spin_lock() do exist - for example, there's
pr_cont_kernfs_path <- pr_cont_cgroup_path <- transfer_surpluses <- ioc_timer_fn.

Out of curiosity, what guarantees that kernfs_remove() won't do
fun things to ancestors of iocg_to_blkg(iocg)->blkcg->css.cgroup for some
iocg in ioc->active_iocgs, until after ioc_rqos_exit(ioc) has finished
del_timer_sync()?

2022-03-28 11:50:35

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello Al,

On 22/3/22 1:40 pm, Al Viro wrote:
[...]
>
> Sorry, misread that thing - the reason it copies the damn thing at all is
> the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
> be it in kernel or in the userland: don't. It's almost never the right
> primitive to use.
>
> Lookups should use qstr; it has both the length and place for hash.
> Switch kernfs_find_ns() to that (and lift the calculation of length
> into the callers that do not have it - note that kernfs_iop_lookup()
> does) and you don't need the strsep() shite (or copying) anymore.
>

Regarding using qstr in kernfs_find_ns, do you mean that I should remove
->name and ->hash with a ->qstr in kernfs_node and further modify
kernfs_name_compare to make use of qstr.name and qstr.hash.

Also the suggestion about removing buffer copying from kernfs_walk_ns is
not clear to me because kernfs_walk_ns invokes kernfs_find_ns with
individual path components so we need some mechanism to separate path
components.

Sorry if I have missed or misunderstood something in your suggestion.

Thanks,
-- Imran

2022-03-28 20:15:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

On Mon, Mar 28, 2022 at 11:15:30AM +1100, Imran Khan wrote:
> Hello Al,
>
> On 22/3/22 1:40 pm, Al Viro wrote:
> [...]
> >
> > Sorry, misread that thing - the reason it copies the damn thing at all is
> > the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
> > be it in kernel or in the userland: don't. It's almost never the right
> > primitive to use.
> >
> > Lookups should use qstr; it has both the length and place for hash.
> > Switch kernfs_find_ns() to that (and lift the calculation of length
> > into the callers that do not have it - note that kernfs_iop_lookup()
> > does) and you don't need the strsep() shite (or copying) anymore.
> >
>
> Regarding using qstr in kernfs_find_ns, do you mean that I should remove
> ->name and ->hash with a ->qstr in kernfs_node and further modify
> kernfs_name_compare to make use of qstr.name and qstr.hash.
>
> Also the suggestion about removing buffer copying from kernfs_walk_ns is
> not clear to me because kernfs_walk_ns invokes kernfs_find_ns with
> individual path components so we need some mechanism to separate path
> components.
>
> Sorry if I have missed or misunderstood something in your suggestion.

qstr encodes the lenght of the string and doesn't need the terminating '\0',
so we can just walk qstr over the passed in path instead of copying and
overwriting '\0'. This is rather separate and we can revisit this later tho.

Thanks.

--
tejun

2022-03-30 11:57:34

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 7/8] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello Al, Hello Tejun,

On 22/3/22 1:40 pm, Al Viro wrote:
> On Mon, Mar 21, 2022 at 09:20:06AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> On Mon, Mar 21, 2022 at 05:55:53PM +0000, Al Viro wrote:
>>> Why bother with rwsem, when we don't need anything blocking under it?
>>> DEFINE_RWLOCK instead of DEFINE_SPINLOCK and don't make it static.
>>
>> Oh I mean, in case the common readers get way too hot, percpu_rwsem is a
>> relatively easy way to shift the burder from the readers to the writers. I
>> doubt we'll need that.
>>
>>> kernfs_walk_ns() - this is fucking insane; on the surface, it needs to
>>> be exclusive due to the use of the same static buffer. It uses that
>>> buffer to generate a pathname, *THEN* walks over it with strsep().
>>> That's an... interesting approach, for the lack of other printable
>>> terms - we walk the chain of ancestors, concatenating their names
>>> into a buffer and separating those names with slashes, then we walk
>>> that buffer, searching for slashes... WTF?
>>
>> It takes the @parent to walk string @path from. Where does it generate the
>> pathname?
>
> Sorry, misread that thing - the reason it copies the damn thing at all is
> the use of strsep(). Yecch... Rule of the thumb regarding strsep() use,
> be it in kernel or in the userland: don't. It's almost never the right
> primitive to use.
>
> Lookups should use qstr; it has both the length and place for hash.
> Switch kernfs_find_ns() to that (and lift the calculation of length
> into the callers that do not have it - note that kernfs_iop_lookup()
> does) and you don't need the strsep() shite (or copying) anymore.
>
> That would allow for kernfs_walk_ns() to take kernfs_rename_lock shared.
>
> HOWEVER, that's not the only lock needed there and this patchset is
> broken in that respect - it locks the starting node, then walks the
> path. Complete with lookups in rbtrees of children in the descendents
> of that node and those are *not* locked.
>
Yes. This was wrong. I have tried to fix it by dropping the lock of
previous parent and taking the lock of current parent before each
invocation of kernfs_find_ns from kernfs_walk_ns. However this does not
look feasible because we are already under spinlock (kernfs_rename_lock).
This limitation will still be there even after changing
kernfs_rename_lock to a read-write lock.
I have thought of ways to fix this but have not yet got any solution.
I am checking further but in the mean time if you have some suggestions
please let me know.

Thanks
-- Imran

2022-04-05 06:23:32

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Hello Al,

On 18/3/22 8:34 am, Al Viro wrote:
> On Thu, Mar 17, 2022 at 06:26:05PM +1100, Imran Khan wrote:
>
>> @@ -570,9 +571,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
>> struct kernfs_open_file *of)
[...]

> As the matter of fact, we can do even better - make freeing
> that thing rcu-delayed, use rcu_assign_pointer() for stores,
> rcu_dereference() for loads and have kernfs_notify() do
> rcu_read_lock();
> on = rcu_dereference(kn->attr.open);
> if (on) {
> atomic_inc(&on->event);
> wake_up_interruptible(&on->poll);
> }
> rcu_read_unlock();
> and kernfs_open_node_lock becomes useless - all places that
> grab it are under kernfs_open_file_mutex.

There are some issues in freeing ->attr.open under RCU callback. There
are some users of ->attr.open that can block and hence can't operate
under rcu_read_lock. For example in kernfs_drain_open_files we are
traversing list of open files corresponding to ->attr.open and unmapping
those as well. The unmapping operation can block in i_mmap_lock_write.
So even after removing refcnt we will still need kernfs_open_node_lock.

Thanks,
-- Imran


2022-04-06 09:33:53

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

On Tue, Apr 05, 2022 at 03:36:03PM +1000, Imran Khan wrote:
> Hello Al,
>
> On 18/3/22 8:34 am, Al Viro wrote:
> > On Thu, Mar 17, 2022 at 06:26:05PM +1100, Imran Khan wrote:
> >
> >> @@ -570,9 +571,10 @@ static void kernfs_put_open_node(struct kernfs_node *kn,
> >> struct kernfs_open_file *of)
> [...]
>
> > As the matter of fact, we can do even better - make freeing
> > that thing rcu-delayed, use rcu_assign_pointer() for stores,
> > rcu_dereference() for loads and have kernfs_notify() do
> > rcu_read_lock();
> > on = rcu_dereference(kn->attr.open);
> > if (on) {
> > atomic_inc(&on->event);
> > wake_up_interruptible(&on->poll);
> > }
> > rcu_read_unlock();
> > and kernfs_open_node_lock becomes useless - all places that
> > grab it are under kernfs_open_file_mutex.
>
> There are some issues in freeing ->attr.open under RCU callback.

Such as?

> There
> are some users of ->attr.open that can block and hence can't operate
> under rcu_read_lock. For example in kernfs_drain_open_files we are
> traversing list of open files corresponding to ->attr.open and unmapping
> those as well. The unmapping operation can block in i_mmap_lock_write.

Yes.

> So even after removing refcnt we will still need kernfs_open_node_lock.

What for? Again, have kernfs_drain_open_files() do this:
{
struct kernfs_open_node *on;
struct kernfs_open_file *of;

if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
return;
if (rcu_dereference(kn->attr.open) == NULL)
return;
mutex_lock(&kernfs_open_file_mutex);
// now ->attr.open is stable (all stores are under kernfs_open_file_mutex)
on = rcu_dereference(kn->attr.open);
if (!on) {
mutex_unlock(&kernfs_open_file_mutex);
return;
}
// on->files contents is stable
list_for_each_entry(of, &on->files, list) {
struct inode *inode = file_inode(of->file);

if (kn->flags & KERNFS_HAS_MMAP)
unmap_mapping_range(inode->i_mapping, 0, 0, 1);

if (kn->flags & KERNFS_HAS_RELEASE)
kernfs_release_file(kn, of);
}
mutex_unlock(&kernfs_open_file_mutex);
}

What's the problem? The caller has already guaranteed that no additions will
happen. Once we'd grabbed kernfs_open_file_mutex, we know that
* kn->attr.open value won't change until we drop the mutex
* nothing gets removed from kn->attr.open->files until we drop the mutex
so we can bloody well walk that list, blocking as much as we want.

We don't need rcu_read_lock() there - we are already holding the mutex used
by writers for exclusion among themselves. RCU *allows* lockless readers,
it doesn't require all readers to be such. kernfs_notify() can be made
lockless, this one can't and that's fine.

BTW, speaking of kernfs_notify() - can calls of that come from NMI handlers?
If not, I'd consider using llist for kernfs_notify_list...

2022-04-06 15:27:45

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Hello Al,

On 6/4/22 12:24 am, Al Viro wrote:
[...]
>
> What for? Again, have kernfs_drain_open_files() do this:
> {
> struct kernfs_open_node *on;
> struct kernfs_open_file *of;
>
> if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
> return;
> if (rcu_dereference(kn->attr.open) == NULL)
> return;
> mutex_lock(&kernfs_open_file_mutex);
> // now ->attr.open is stable (all stores are under kernfs_open_file_mutex)
> on = rcu_dereference(kn->attr.open);
> if (!on) {
> mutex_unlock(&kernfs_open_file_mutex);
> return;
> }
> // on->files contents is stable
> list_for_each_entry(of, &on->files, list) {
> struct inode *inode = file_inode(of->file);
>
> if (kn->flags & KERNFS_HAS_MMAP)
> unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>
> if (kn->flags & KERNFS_HAS_RELEASE)
> kernfs_release_file(kn, of);
> }
> mutex_unlock(&kernfs_open_file_mutex);
> }
>

I did something similar in in [1], except that I was traversing
on->files under rcu_read_lock and this was a source of confusion.

> What's the problem? The caller has already guaranteed that no additions will
> happen. Once we'd grabbed kernfs_open_file_mutex, we know that
> * kn->attr.open value won't change until we drop the mutex
> * nothing gets removed from kn->attr.open->files until we drop the mutex
> so we can bloody well walk that list, blocking as much as we want.
>
> We don't need rcu_read_lock() there - we are already holding the mutex used
> by writers for exclusion among themselves. RCU *allows* lockless readers,
> it doesn't require all readers to be such. kernfs_notify() can be made
> lockless, this one can't and that's fine.
>

Thanks for explaining this. I missed the exclusiveness being provided by
kernfs_open_file_mutex in this case.

> BTW, speaking of kernfs_notify() - can calls of that come from NMI handlers?
> If not, I'd consider using llist for kernfs_notify_list...

I see it gets invoked from 3 places only: cgroup_file_notify,
sysfs_notify and sysfs_notify_dirent. So kernfs_notify should not be
getting invoked in NMI context. I will make the llist transition in next
version.

Thanks,
-- Imran

[1]
https://lore.kernel.org/lkml/[email protected]/

2022-04-06 17:09:57

by Al Viro

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

On Wed, Apr 06, 2022 at 02:54:19PM +1000, Imran Khan wrote:

> > BTW, speaking of kernfs_notify() - can calls of that come from NMI handlers?
> > If not, I'd consider using llist for kernfs_notify_list...
>
> I see it gets invoked from 3 places only: cgroup_file_notify,
> sysfs_notify and sysfs_notify_dirent. So kernfs_notify should not be
> getting invoked in NMI context. I will make the llist transition in next
> version.

Er... Are you sure neither of those is ever called from something that is
called from .... from NMI?

It might never happen, but there's a plenty of callchains leading to that
thing and no obvious obstacles for some of those to come from NMI context;
I don't see it documented anywhere either.

Tejun, could you comment on that one?

2022-04-06 17:18:51

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Hello,

On Wed, Apr 06, 2022 at 02:54:44PM +0000, Al Viro wrote:
> On Wed, Apr 06, 2022 at 02:54:19PM +1000, Imran Khan wrote:
>
> > > BTW, speaking of kernfs_notify() - can calls of that come from NMI handlers?
> > > If not, I'd consider using llist for kernfs_notify_list...
> >
> > I see it gets invoked from 3 places only: cgroup_file_notify,
> > sysfs_notify and sysfs_notify_dirent. So kernfs_notify should not be
> > getting invoked in NMI context. I will make the llist transition in next
> > version.
>
> Er... Are you sure neither of those is ever called from something that is
> called from .... from NMI?
>
> It might never happen, but there's a plenty of callchains leading to that
> thing and no obvious obstacles for some of those to come from NMI context;
> I don't see it documented anywhere either.
>
> Tejun, could you comment on that one?

I don't know any case off the top of my head and expectedly all the common
cases don't involved nmi. If we're worried about being called from nmis, I'd
go for just adding WARN_ON_ONCE(in_nmi()).

Thanks.

--
tejun

2022-04-14 14:14:18

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH v7 1/8] kernfs: Introduce interface to access global kernfs_open_file_mutex.

Hello Al, Hello Tejun,

I have sent v8 of the patchset at [1]. I have incorporated your
suggestions and have also addressed the issue of not locking correct
nodes during kernfs_walk_ns.
I have not yet make the changes to make kernfs_find_ns use qstr because
this part is not clear to me. My understanding is that kernfs_find_ns
is looking for node of given name under a parent, so we need a buffer
in kernfs_walk_ns to hold the full path and then use strsep to take each
path component and look for it under parent (the node obtained during
previous iteration). For sure I am missing something from your
suggestion, about using qstr and removing strsep, but not sure what.

Could you please have a look at current version and let me know your
feedback?

Thanks
-- Imran

[1]
https://lore.kernel.org/lkml/[email protected]/

On 6/4/22 2:54 pm, Imran Khan wrote:
> Hello Al,
>
> On 6/4/22 12:24 am, Al Viro wrote:
> [...]
>>
>> What for? Again, have kernfs_drain_open_files() do this:
>> {
>> struct kernfs_open_node *on;
>> struct kernfs_open_file *of;
>>
>> if (!(kn->flags & (KERNFS_HAS_MMAP | KERNFS_HAS_RELEASE)))
>> return;
>> if (rcu_dereference(kn->attr.open) == NULL)
>> return;
>> mutex_lock(&kernfs_open_file_mutex);
>> // now ->attr.open is stable (all stores are under kernfs_open_file_mutex)
>> on = rcu_dereference(kn->attr.open);
>> if (!on) {
>> mutex_unlock(&kernfs_open_file_mutex);
>> return;
>> }
>> // on->files contents is stable
>> list_for_each_entry(of, &on->files, list) {
>> struct inode *inode = file_inode(of->file);
>>
>> if (kn->flags & KERNFS_HAS_MMAP)
>> unmap_mapping_range(inode->i_mapping, 0, 0, 1);
>>
>> if (kn->flags & KERNFS_HAS_RELEASE)
>> kernfs_release_file(kn, of);
>> }
>> mutex_unlock(&kernfs_open_file_mutex);
>> }
>>
>
> I did something similar in in [1], except that I was traversing
> on->files under rcu_read_lock and this was a source of confusion.
>
>> What's the problem? The caller has already guaranteed that no additions will
>> happen. Once we'd grabbed kernfs_open_file_mutex, we know that
>> * kn->attr.open value won't change until we drop the mutex
>> * nothing gets removed from kn->attr.open->files until we drop the mutex
>> so we can bloody well walk that list, blocking as much as we want.
>>
>> We don't need rcu_read_lock() there - we are already holding the mutex used
>> by writers for exclusion among themselves. RCU *allows* lockless readers,
>> it doesn't require all readers to be such. kernfs_notify() can be made
>> lockless, this one can't and that's fine.
>>
>
> Thanks for explaining this. I missed the exclusiveness being provided by
> kernfs_open_file_mutex in this case.
>
>> BTW, speaking of kernfs_notify() - can calls of that come from NMI handlers?
>> If not, I'd consider using llist for kernfs_notify_list...
>
> I see it gets invoked from 3 places only: cgroup_file_notify,
> sysfs_notify and sysfs_notify_dirent. So kernfs_notify should not be
> getting invoked in NMI context. I will make the llist transition in next
> version.
>
> Thanks,
> -- Imran
>
> [1]
> https://lore.kernel.org/lkml/[email protected]/