2022-08-10 11:14:48

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 0/5] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info

This patchset contains second subset of patches (after addressing review
comments) discussed at [1]. As per [2], the patches of [1] were divided
into 2 parts so that they can be reviewed feature by feature.
Patches of [2] are now in linux-next. So this patchset is for resuming
review (after addressing latest review comments) of rest of the patches
of [1].

The patches in this change set are as follows:

PATCH-1: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

PATCH-2: Change kernfs_rename_lock into a read-write lock.

PATCH-3: Introduce interface to access per-fs rwsem.

PATCH-4: Replace per-fs rwsem with hashed rwsems.

PATCH-5: Add a document to describe hashed locks used in kernfs.


Original cover letter
---------------------------------------------------------------
Reduce contention around global locks used in kernfs.

The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.

The above mentioned 3 main global locks used in kernfs are:

1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.

2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.

3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.

Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I can see contention around above mentioned locks as follows:

- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock

- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate

- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks.

For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate

- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read

Moreover the test execution time has reduced from 32-34 secs to 15-16 secs.

The patches of this patchset introduce following changes:

PATCH-1: Remove reference counting from kernfs_open_node.

PATCH-2: Make kernfs_open_node->attr.open RCU protected.

PATCH-3: Change kernfs_notify_list to llist.

PATCH-4: Introduce interface to access kernfs_open_file_mutex.

PATCH-5: Replace global kernfs_open_file_mutex with hashed mutexes.

PATCH-6: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

PATCH-7: Change kernfs_rename_lock into a read-write lock.

PATCH-8: Introduce interface to access per-fs rwsem.

PATCH-9: Replace per-fs rwsem with hashed rwsems.

PATCH-10: Add a document to describe hashed locks used in kernfs.

------------------------------------------------------------------

I did not receive any feedback earlier so resending
after rebasing on tag next-20220810 of linux-next.

Imran Khan (5):
kernfs: Use a per-fs rwsem to protect per-fs list of
kernfs_super_info.
kernfs: Change kernfs_rename_lock into a read-write lock.
kernfs: Introduce interface to access per-fs rwsem.
kernfs: Replace per-fs rwsem with hashed rwsems.
kernfs: Add a document to describe hashed locks used in kernfs.

.../filesystems/kernfs-hashed-locks.rst | 214 ++++++++++++++
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 269 ++++++++++++------
fs/kernfs/file.c | 7 +-
fs/kernfs/inode.c | 48 +++-
fs/kernfs/kernfs-internal.c | 259 +++++++++++++++++
fs/kernfs/kernfs-internal.h | 122 +++++++-
fs/kernfs/mount.c | 23 +-
fs/kernfs/symlink.c | 13 +-
include/linux/kernfs.h | 1 +
10 files changed, 845 insertions(+), 113 deletions(-)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
create mode 100644 fs/kernfs/kernfs-internal.c


base-commit: bc6c6584ffb27b62e19ea89553b22b4cad1abaca
--
2.30.2


2022-08-10 11:15:28

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 5/5] kernfs: Add a document to describe hashed locks used in kernfs.

This document describes usage and proof of various hashed locks
introduced in this patch set.

Signed-off-by: Imran Khan <[email protected]>
---
.../filesystems/kernfs-hashed-locks.rst | 214 ++++++++++++++++++
1 file changed, 214 insertions(+)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst

diff --git a/Documentation/filesystems/kernfs-hashed-locks.rst b/Documentation/filesystems/kernfs-hashed-locks.rst
new file mode 100644
index 000000000000..a000e6fcf78c
--- /dev/null
+++ b/Documentation/filesystems/kernfs-hashed-locks.rst
@@ -0,0 +1,214 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+===================
+kernfs hashed locks
+===================
+
+kernfs uses following hashed locks
+
+1. Hashed mutexes
+2. Hashed rwsem
+
+In certain cases hashed rwsem needs to work in conjunction with a per-fs mutex
+(Described further below).So this document describes this mutex as well.
+
+A kernfs_global_locks object (defined below) provides hashed mutexes,
+hashed spinlocks and hashed rwsems.
+
+ struct kernfs_global_locks {
+ struct mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
+ };
+
+For all hashed locks address of a kernfs_node object acts as hashing key.
+
+For the remainder of this document a node means a kernfs_node object. The
+node can refer to a file, directory or symlink of a kernfs based file system.
+Also a node's mutex or rwsem refers to hashed mutex, or hashed rwsem
+corresponding to the node. It does not mean any locking construct embedded in
+the kernfs_node itself.
+
+What is protected by hashed locks
+=================================
+
+(1) There's one kernfs_open_file for each open file and all kernfs_open_file
+ instances corresponding to a kernfs_node are maintained in a list.
+ hashed mutexes or kernfs_global_locks.open_file_mutex[index] protects
+ this list.
+
+(2) Hashed rwsems or kernfs_global_locks.kernfs_rwsem[index] protects node's
+ state and synchronizes operations that change state of a node or depend on
+ the state of a node.
+
+(3) per-fs mutex (mentioned earlier) provides synchronization between lookup
+ and remove operations. It also protects against topology change.
+ While looking for a node we will not have address of corresponding node
+ so we can't acquire node's rwsem right from the beginning.
+ On the other hand a parallel remove operation for the same node can acquire
+ corresponding rwsem and go ahead with node removal. So it may happen that
+ search operation for the node finds and returns it but before it can be
+ pinned or used, the remove operation, that was going on in parallel, removes
+ the node and hence makes its any future use wrong.
+ per-fs mutex ensures that for competing search and remove operations only
+ one proceeds at a time and since object returned by search is pinned before
+ releasing the per-fs mutex, it will be available for subsequent usage.
+
+ This per-fs mutex also protects against topology change during path walks.
+ During path walks we need to acquire and release rwsems corresponding to
+ directories so that these directories don't move and their children RB tree
+ does not change. Since these rwsems can't be taken under a spinlock,
+ kernfs_rename_lock can't be used and needed protection against topology
+ change is provided by per-fs mutex.
+
+Lock usage and proof
+=======================
+
+(1) Hashed mutexes
+
+ Since hashed mutexes protect the list of kernfs_open_file instances
+ corresponding to a kernfs_node, ->open and ->release backends of
+ file_operations need to acquire hashed mutex corresponding to kernfs_node.
+ Also when a kernfs_node is removed, all of its kernfs_open_file instances
+ are drained after deactivating the node. This drain operation acquires
+ hashed mutex to traverse list of kernfs_open_file instances.
+ So addition (via ->open), deletion (via ->release) and traversal
+ (during kernfs_drain) of kernfs_open_file list occurs in a synchronous
+ manner.
+
+(2) Hashed rwsems
+
+ 3.1. A node's rwsem protects its state and needs to be acquired to:
+ 3.1.a. Remove the node
+ 3.1.b. Move the node
+ 3.1.c. Travers or modify a node's children RB tree (for
+ directories), i.e to add/remove files/subdirectories
+ within/from a directory.
+ 3.1.d. Modify or access node's inode attributes
+
+ 3.2. Hashed rwsems are used in following operations:
+
+ 3.2.a. Addition of a new node
+
+ While adding a new kernfs_node under a kernfs directory
+ kernfs_add_one acquires directory node's rwsem for
+ writing. Clause 3.1.a ensures that directory exists
+ throughout the operation. Clause 3.1.c ensures proper
+ updation of children rb tree (i.e ->dir.children).
+ Clause 3.1.d ensures correct modification of inode
+ attribute to reflect timestamp of this operation.
+ If the directory gets removed while waiting for semaphore,
+ the subsequent checks in kernfs_add_one will fail resulting
+ in early bail out from kernfs_add_one.
+
+ 3.2.b. Removal of a node
+
+ Removal of a node involves recursive removal of all of its
+ descendants as well. per-fs mutex (i.e kernfs_rm_mutex) avoids
+ concurrent node removals even if the nodes are different.
+
+ At first node's rwsem is acquired. Clause 3.1.c avoids parallel
+ modification of descendant tree and while holding this rwsem
+ each of the descendants are deactivated.
+
+ Once a descendant has been deactivated and drained, its parent's
+ rwsem is taken. Clause 3.1.c ensures proper unlinking of this
+ descendant from its siblings. Clause 3.1.d ensures that parent's
+ inode attributes are correctly updated to record time stamp of
+ removal.
+
+ 3.2.c. Movement of a node
+
+ Moving or renaming a node (kernfs_rename_ns) acquires rwsem for
+ node and its old and new parents. Clauses 3.1.b and 3.1.c avoid
+ concurrent move operations for the same node.
+ Also if old parent of a node changes while waiting for rwsem,
+ the acquisition of rwsem for 3 involved nodes is attempted
+ again. It is always ensured that as far as old parent is
+ concerned, rwsem corresponding to current parent is acquired.
+
+ 3.2.d. Reading a directory
+
+ For diectory reading kernfs_fop_readdir acquires directory
+ node's rwsem for reading. Clause 3.1.c ensures a consistent view
+ of children RB tree.
+ As far as directroy being read is concerned, if it gets removed
+ while waiting for semaphore, the for loop that iterates through
+ children will be ineffective. So for this operation acquiring
+ directory node's rwsem for reading is enough.
+
+ 3.2.e. Dentry revalidation
+
+ A dentry revalidation (kernfs_dop_revalidate) can happen for a
+ negative or for a normal dentry.
+ For negative dentries we just need to check parent change, so in
+ this case acquiring parent kernfs_node's rwsem for reading is
+ enough.
+ For a normal dentry acquiring node's rwsem for reading is enough
+ (Clause 3.1.a and 3.1.b).
+ If node gets removed while waiting for the lock subsequent checks
+ in kernfs_dop_revalidate will fail and kernfs_dop_revalidate will
+ exit early.
+
+ 3.2.f. kernfs_node lookup
+
+ While searching for a node under a given parent
+ (kernfs_find_and_get_ns, kernfs_walk_and_get_ns) rwsem of parent
+ node is acquired for reading. Clause 3.1.c ensures a consistent
+ view of parent's children RB tree. To avoid parallel removal of
+ found node before it gets pinned, these operation make use of
+ per-fs mutex (kernfs_rm_mutex) as explained earlier.
+ This per-fs mutex is also taken during kernfs_node removal
+ (__kernfs_remove).
+
+ If the node being searched gets removed while waiting for the
+ mutex or rwsem, the subsequent kernfs_find_ns or kernfs_walk_ns
+ will fail.
+
+ 3.2.g. kenfs_node's inode lookup
+
+ Looking up for inode instances via kernfs_iop_lookup involves
+ node lookup. So locks acquired are same as ones required in 3.2.f.
+ Also once node lookup is complete parent's rwsem is released and
+ rwsem of found node is acquired to get corresponding inode.
+ Since we are operating under per-fs kernfs_rm_mutex the found node
+ will not disappear in the middle.
+
+ 3.2.h. Updating or reading inode attribute
+
+ Interfaces that change inode attributes(i.e kernfs_setattr and
+ kernfs_iop_setattr) acquire node's rwsem for writing.
+ If the kernfs_node gets removed while waiting for the semaphore
+ the subsequent __kernfs_setattr will fail.
+ From 3.2.a and 3.2.b we know that updates due to addition or
+ removal of nodes will not happen in parallel.
+ So just locking the kernfs_node in these cases is enough to
+ guarantee correct modification of inode attributes.
+ Similarly the interfaces that read inode attributes
+ (i.e kernfs_iop_getattr, kernfs_iop_permission) just need to
+ acquire involved node's rwsem for reading.
+
+ 3.2.i. kernfs file event generation
+
+ kernfs_notify pins involved node before scheduling
+ kernfs_notify_work and kernfs_notify_workfn acquires node's
+ rwsem. Clauses in 3.1 ensure a consistent view of node state
+ throughout execution of work handler.
+
+ 3.2.j. mount
+
+ kernfs_fill_super, invoked during mount operation, acquires root
+ node's rwsem. During mount process there can't be other execution
+ contexts trying to move or delete the node so just locking the
+ involved node(i.e the root node) is enough.
+
+ 3.2.k. while activating a node
+
+ For a node that started as deactivated, kernfs_activate
+ activates the node. In this case acquiring node's rwsem is
+ enough. Since the node is not active yet any parallel removal
+ that wins the race for rwsem will skip this node and its
+ descendents. Also user space can't see a deactivated node so we
+ don't have any parallel access emanating from their as well.
+
+ 3.3 For operations that involve locking multiple nodes at the same time
+ locks are acquired in order of their addresses.
--
2.30.2

2022-08-10 11:15:55

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 3/5] kernfs: Introduce interface to access per-fs rwsem.

per-fs rwsem is used across kernfs for synchronization purposes.
Having an interface to access it not only avoids code duplication, it
can also help in changing the underlying locking mechanism without needing
to change the lock users. For example next patch modifies this interface
to make use of hashed rwsems in place of per-fs rwsem.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 119 ++++++++++++++++++------------------
fs/kernfs/file.c | 5 +-
fs/kernfs/inode.c | 26 ++++----
fs/kernfs/kernfs-internal.h | 78 +++++++++++++++++++++++
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 6 +-
6 files changed, 159 insertions(+), 81 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index d2a0b4acd073..73f4ebc1464e 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -33,7 +33,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}

@@ -467,12 +467,20 @@ static void kernfs_drain(struct kernfs_node *kn)
__releases(&kernfs_root(kn)->kernfs_rwsem)
__acquires(&kernfs_root(kn)->kernfs_rwsem)
{
+ struct rw_semaphore *rwsem;
struct kernfs_root *root = kernfs_root(kn);

- lockdep_assert_held_write(&root->kernfs_rwsem);
+ /**
+ * kn has the same root as its ancestor, so it can be used to get
+ * per-fs rwsem.
+ */
+ rwsem = kernfs_rwsem_ptr(kn);
+
+ kernfs_rwsem_assert_held_write(kn);
+
WARN_ON_ONCE(kernfs_active(kn));

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -491,7 +499,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- down_write(&root->kernfs_rwsem);
+ kernfs_down_write(kn);
}

/**
@@ -726,12 +734,12 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
int kernfs_add_one(struct kernfs_node *kn)
{
struct kernfs_node *parent = kn->parent;
- struct kernfs_root *root = kernfs_root(parent);
struct kernfs_iattrs *ps_iattr;
+ struct rw_semaphore *rwsem;
bool has_ns;
int ret;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);

ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -762,7 +770,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -776,7 +784,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;

out_unlock:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -797,7 +805,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;

- lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(parent);

if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -829,7 +837,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;

- lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_read(parent);

spin_lock_irq(&kernfs_pr_cont_lock);

@@ -867,12 +875,12 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return kn;
}
@@ -892,12 +900,12 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
const char *path, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return kn;
}
@@ -1062,7 +1070,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;

if (flags & LOOKUP_RCU)
return -ECHILD;
@@ -1078,13 +1086,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
parent = kernfs_dentry_node(dentry->d_parent);
if (parent) {
spin_unlock(&dentry->d_lock);
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_dir_changed(parent, dentry)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
} else
spin_unlock(&dentry->d_lock);

@@ -1095,8 +1102,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);

/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1115,10 +1121,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;

- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 1;
out_bad:
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}

@@ -1132,12 +1138,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
{
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct kernfs_root *root;
struct inode *inode = NULL;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;

- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

@@ -1148,7 +1153,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* create a negative.
*/
if (!kernfs_active(kn)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1163,7 +1168,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
*/
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1286,7 +1291,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;

- lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(root);

/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1321,9 +1326,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1337,7 +1342,7 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}

static void __kernfs_remove(struct kernfs_node *kn)
@@ -1348,7 +1353,7 @@ static void __kernfs_remove(struct kernfs_node *kn)
if (!kn)
return;

- lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(kn);

/*
* This is for kernfs_remove_self() which plays with active ref
@@ -1417,16 +1422,14 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;

if (!kn)
return;

- root = kernfs_root(kn);
-
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}

/**
@@ -1512,9 +1515,9 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
kernfs_break_active_protection(kn);

/*
@@ -1542,9 +1545,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
schedule();
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1557,7 +1560,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
kernfs_unbreak_active_protection(kn);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -1574,7 +1577,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;

if (!parent) {
WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1582,14 +1585,14 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}

- root = kernfs_root(parent);
- down_write(&root->kernfs_rwsem);
+
+ rwsem = kernfs_down_write(parent);

kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);

if (kn)
return 0;
@@ -1608,16 +1611,15 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
const char *new_name, const void *new_ns)
{
struct kernfs_node *old_parent;
- struct kernfs_root *root;
const char *old_name = NULL;
+ struct rw_semaphore *rwsem;
int error;

/* can't move or rename root */
if (!kn->parent)
return -EINVAL;

- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1671,7 +1673,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}

@@ -1742,14 +1744,13 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
- struct kernfs_root *root;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;

if (!dir_emit_dots(file, ctx))
return 0;

- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);

if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1766,12 +1767,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);

- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 812165d8ab4f..669619e01be2 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -911,6 +911,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
struct kernfs_node *kn;
struct kernfs_super_info *info;
struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
repeat:
/* pop one off the notify_list */
spin_lock_irq(&kernfs_notify_lock);
@@ -925,7 +926,7 @@ static void kernfs_notify_workfn(struct work_struct *work)

root = kernfs_root(kn);
/* kick fsnotify */
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);

down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -965,7 +966,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
}
up_write(&root->supers_rwsem);

- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..efe5ae98abf4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,11 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}

@@ -112,14 +112,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
{
struct inode *inode = d_inode(dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
int error;

if (!kn)
return -EINVAL;

- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +131,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}

@@ -187,14 +186,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
{
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return 0;
}
@@ -277,22 +276,21 @@ void kernfs_evict_inode(struct inode *inode)
int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
+ struct rw_semaphore *rwsem;
struct kernfs_node *kn;
- struct kernfs_root *root;
int ret;

if (mask & MAY_NOT_BLOCK)
return -ECHILD;

kn = inode->i_private;
- root = kernfs_root(kn);

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return ret;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 3cd17c100d10..0babc3dc4f4a 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -169,4 +169,82 @@ extern const struct inode_operations kernfs_symlink_iops;
* kernfs locks
*/
extern struct kernfs_global_locks *kernfs_locks;
+
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+ struct kernfs_root *root = kernfs_root(kn);
+
+ return &root->kernfs_rwsem;
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+ lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+ lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+ lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * kernfs_down_write() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_write(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_write(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_write - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_write(struct rw_semaphore *rwsem)
+{
+ up_write(rwsem);
+}
+
+/**
+ * kernfs_down_read() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_read(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_read(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_read - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_read(struct rw_semaphore *rwsem)
+{
+ up_read(rwsem);
+}
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d2be1c304715..3c5334b74f36 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,9 +237,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_root *kf_root = kfc->root;
struct inode *inode;
struct dentry *root;
+ struct rw_semaphore *rwsem;

info->sb = sb;
/* Userspace would break if executables or devices appear on sysfs */
@@ -257,9 +257,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;

/* get root inode, initialize and unlock it */
- down_read(&kf_root->kernfs_rwsem);
+ rwsem = kernfs_down_read(info->root->kn);
inode = kernfs_get_inode(sb, info->root->kn);
- up_read(&kf_root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f..9d4103602554 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,12 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
int error;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
error = kernfs_get_target_path(parent, target, path);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);

return error;
}
--
2.30.2

2022-08-10 11:16:54

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. The contention
around this single rwsem becomes more apparent in large scale systems with
few hundred CPUs where most of the CPUs have running tasks that are
opening, accessing or closing sysfs files at any point of time.

Using hashed rwsems in place of a per-fs rwsem, can significantly reduce
contention around per-fs rwsem and hence provide better scalability.
Moreover as these hashed rwsems are not part of kernfs_node objects we will
not see any singnificant change in memory utilization of kernfs based file
systems like sysfs, cgroupfs etc.

Modify interface introduced in previous patch to make use of hashed rwsems.
Just like earlier change use kernfs_node address as hashing key. Since we
are getting rid of per-fs lock, in certain cases we may need to acquire
locks corresponding to multiple nodes and in such cases of nested locking,
locks are taken in order of their addresses. Introduce helpers to acquire
rwsems corresponding to multiple nodes for such cases.

For operations that involve finding the node first and then operating on it
(for example operations involving find_and_get_ns), acquiring rwsem for the
node being searched is not possible. Such operations need to make sure that
a concurrent remove does not remove the found node. Introduce a per-fs
mutex (kernfs_topo_mutex) that can be used to synchronize these operations
against parallel removal/renaming of involved node.

Also replace usage of kernfs_pr_cont_buf with another global buffer in
kernfs_walk_ns. This is because kernfs_pr_cont_buf is protected by a
spinlock but now kernfs_walk_ns needs to acquire hashed rwsem corresponding
to nodes further down in the path and this can't be done under spinlock.
Having another buffer to piggyback the path in kernfs_walk_ns and protecting
this with kernfs_topo_mutex (mentioned earlier) would avoid need of spinlock
and also ensure that there is no topology change.

Replacing global mutex and spinlocks with hashed ones (as mentioned in
previous changes) and global rwsem with hashed rwsem (as done in this
change) reduces contention around kernfs and results in better performance
numbers.

For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:

for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];

int ret_len;
int ret_fd;
ssize_t ret_rd;

ub4 i, saved_errno;

memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));

ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);

ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}

/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);

if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);

continue;
}

close(ret_fd);
}
}

I can see contention around above mentioned locks as follows:

- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock

- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate

- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate

Moreover this run of 200 application isntances takes 32-34 secs. to
complete.

With the patched kernel and on the same test setup, we no longer see
contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:

- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate

- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read

Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 161 +++++++++++++++++-----
fs/kernfs/inode.c | 20 +++
fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++++++++++++++++++++
fs/kernfs/kernfs-internal.h | 47 ++++++-
fs/kernfs/mount.c | 9 ++
fs/kernfs/symlink.c | 13 +-
include/linux/kernfs.h | 1 +
8 files changed, 474 insertions(+), 38 deletions(-)
create mode 100644 fs/kernfs/kernfs-internal.c

diff --git a/fs/kernfs/Makefile b/fs/kernfs/Makefile
index 4ca54ff54c98..778da6b118e9 100644
--- a/fs/kernfs/Makefile
+++ b/fs/kernfs/Makefile
@@ -3,4 +3,4 @@
# Makefile for the kernfs pseudo filesystem
#

-obj-y := mount.o inode.o dir.o file.o symlink.o
+obj-y := mount.o inode.o dir.o file.o symlink.o kernfs-internal.o
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 73f4ebc1464e..7d02c3dd2c20 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@

#include "kernfs-internal.h"

-static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
+DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
/*
* Don't use rename_lock to piggy back on pr_cont_buf. We don't want to
* call pr_cont() while holding rename_lock. Because sometimes pr_cont()
@@ -27,13 +27,13 @@ static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
*/
static DEFINE_SPINLOCK(kernfs_pr_cont_lock);
static char kernfs_pr_cont_buf[PATH_MAX]; /* protected by pr_cont_lock */
+static char kernfs_path_buf[PATH_MAX]; /* protected by kernfs_topo_mutex */
static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */

#define rb_to_kn(X) rb_entry((X), struct kernfs_node, rb)

static bool kernfs_active(struct kernfs_node *kn)
{
- kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}

@@ -458,14 +458,15 @@ void kernfs_put_active(struct kernfs_node *kn)
/**
* kernfs_drain - drain kernfs_node
* @kn: kernfs_node to drain
+ * @anc: ancestor of kernfs_node to drain
*
* Drain existing usages and nuke all existing mmaps of @kn. Mutiple
* removers may invoke this function concurrently on @kn and all will
* return after draining is complete.
*/
-static void kernfs_drain(struct kernfs_node *kn)
- __releases(&kernfs_root(kn)->kernfs_rwsem)
- __acquires(&kernfs_root(kn)->kernfs_rwsem)
+static void kernfs_drain(struct kernfs_node *kn, struct kernfs_node *anc)
+ __releases(kernfs_rwsem_ptr(anc))
+ __acquires(kernfs_rwsem_ptr(anc))
{
struct rw_semaphore *rwsem;
struct kernfs_root *root = kernfs_root(kn);
@@ -476,10 +477,11 @@ static void kernfs_drain(struct kernfs_node *kn)
*/
rwsem = kernfs_rwsem_ptr(kn);

- kernfs_rwsem_assert_held_write(kn);
+ kernfs_rwsem_assert_held_write(anc);

WARN_ON_ONCE(kernfs_active(kn));

+ rwsem = kernfs_rwsem_ptr(anc);
kernfs_up_write(rwsem);

if (kernfs_lockdep(kn)) {
@@ -499,7 +501,7 @@ static void kernfs_drain(struct kernfs_node *kn)

kernfs_drain_open_files(kn);

- kernfs_down_write(kn);
+ kernfs_down_write(anc);
}

/**
@@ -739,6 +741,11 @@ int kernfs_add_one(struct kernfs_node *kn)
bool has_ns;
int ret;

+ /**
+ * The node being added is not active at this point of time and may
+ * be activated later depending on CREATE_DEACTIVATED flag. So at
+ * this point of time just locking the parent is enough.
+ */
rwsem = kernfs_down_write(parent);

ret = -EINVAL;
@@ -836,28 +843,35 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
{
size_t len;
char *p, *name;
+ struct rw_semaphore *rwsem;

kernfs_rwsem_assert_held_read(parent);

- spin_lock_irq(&kernfs_pr_cont_lock);
+ lockdep_assert_held(&kernfs_root(parent)->kernfs_topo_mutex);

- len = strlcpy(kernfs_pr_cont_buf, path, sizeof(kernfs_pr_cont_buf));
+ /* Caller has kernfs_topo_mutex so topology will not change */
+ p = kernfs_path_buf;
+ len = strlcpy(p, path, PATH_MAX);

- if (len >= sizeof(kernfs_pr_cont_buf)) {
- spin_unlock_irq(&kernfs_pr_cont_lock);
+ if (len >= PATH_MAX) {
+ kfree(p);
return NULL;
}

- p = kernfs_pr_cont_buf;
-
+ rwsem = kernfs_rwsem_ptr(parent);
while ((name = strsep(&p, "/")) && parent) {
if (*name == '\0')
continue;
parent = kernfs_find_ns(parent, name, ns);
+ /*
+ * Release rwsem for node whose child RB tree has been
+ * traversed.
+ */
+ kernfs_up_read(rwsem);
+ if (parent) /* Acquire rwsem before traversing child RB tree */
+ rwsem = kernfs_down_read(parent);
}

- spin_unlock_irq(&kernfs_pr_cont_lock);
-
return parent;
}

@@ -876,11 +890,20 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. Acquiring root->kernfs_topo_mutex and releasing it after
+ * pinning the found kernfs_node, ensures that found kernfs_node
+ * will not disappear due to a parallel remove operation.
+ */
+ mutex_lock(&root->kernfs_topo_mutex);
rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_topo_mutex);

return kn;
}
@@ -901,11 +924,26 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
{
struct kernfs_node *kn;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. Acquiring root->kernfs_topo_mutex and releasing it after
+ * pinning the found kernfs_node, ensures that found kernfs_node
+ * will not disappear due to a parallel remove operation.
+ */
+ mutex_lock(&root->kernfs_topo_mutex);
rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- kernfs_up_read(rwsem);
+ if (kn)
+ /* Release lock taken under kernfs_walk_ns */
+ kernfs_up_read(kernfs_rwsem_ptr(kn));
+ else
+ /* Release parent lock because walk_ns bailed out early */
+ kernfs_up_read(rwsem);
+
+ mutex_unlock(&root->kernfs_topo_mutex);

return kn;
}
@@ -930,9 +968,9 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
return ERR_PTR(-ENOMEM);

idr_init(&root->ino_idr);
- init_rwsem(&root->kernfs_rwsem);
INIT_LIST_HEAD(&root->supers);
init_rwsem(&root->supers_rwsem);
+ mutex_init(&root->kernfs_topo_mutex);

/*
* On 64bit ino setups, id is ino. On 32bit, low 32bits are ino.
@@ -1102,6 +1140,11 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}

kn = kernfs_dentry_node(dentry);
+ /**
+ * For dentry revalidation just acquiring kernfs_node's rwsem for
+ * reading should be enough. If a competing rename or remove wins
+ * one of the checks below will fail.
+ */
rwsem = kernfs_down_read(kn);

/* The kernfs node has been deactivated */
@@ -1141,24 +1184,35 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
struct inode *inode = NULL;
const void *ns = NULL;
struct rw_semaphore *rwsem;
+ struct kernfs_root *root = kernfs_root(parent);

+ /**
+ * We don't have address of kernfs_node (that is being searched)
+ * yet. So take root->kernfs_topo_mutex to avoid parallel removal of
+ * found kernfs_node.
+ */
+ mutex_lock(&root->kernfs_topo_mutex);
rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;

kn = kernfs_find_ns(parent, dentry->d_name.name, ns);
+ kernfs_up_read(rwsem);
/* attach dentry and inode */
if (kn) {
/* Inactive nodes are invisible to the VFS so don't
* create a negative.
*/
+ rwsem = kernfs_down_read(kn);
if (!kernfs_active(kn)) {
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_topo_mutex);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
if (!inode)
inode = ERR_PTR(-ENOMEM);
+ kernfs_up_read(rwsem);
}
/*
* Needed for negative dentry validation.
@@ -1166,9 +1220,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* or transforms from positive dentry in dentry_unlink_inode()
* called from vfs_rmdir().
*/
+ rwsem = kernfs_down_read(parent);
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
kernfs_up_read(rwsem);
+ mutex_unlock(&root->kernfs_topo_mutex);

/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1348,19 +1404,26 @@ void kernfs_activate(struct kernfs_node *kn)
static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;
+ struct rw_semaphore *rwsem;
+ struct kernfs_root *root;

/* Short-circuit if non-root @kn has already finished removal. */
if (!kn)
return;

- kernfs_rwsem_assert_held_write(kn);

+ root = kernfs_root(kn);
/*
* This is for kernfs_remove_self() which plays with active ref
* after removal.
*/
- if (kn->parent && RB_EMPTY_NODE(&kn->rb))
+ mutex_lock(&root->kernfs_topo_mutex);
+ rwsem = kernfs_down_write(kn);
+ if (kn->parent && RB_EMPTY_NODE(&kn->rb)) {
+ kernfs_up_write(rwsem);
+ mutex_unlock(&root->kernfs_topo_mutex);
return;
+ }

pr_debug("kernfs %s: removing\n", kn->name);

@@ -1370,8 +1433,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
if (kernfs_active(pos))
atomic_add(KN_DEACTIVATED_BIAS, &pos->active);

+ kernfs_up_write(rwsem);
+
/* deactivate and unlink the subtree node-by-node */
do {
+ rwsem = kernfs_down_write(kn);
pos = kernfs_leftmost_descendant(kn);

/*
@@ -1389,10 +1455,25 @@ static void __kernfs_remove(struct kernfs_node *kn)
* error paths without worrying about draining.
*/
if (kn->flags & KERNFS_ACTIVATED)
- kernfs_drain(pos);
+ kernfs_drain(pos, kn);
else
WARN_ON_ONCE(atomic_read(&kn->active) != KN_DEACTIVATED_BIAS);

+ kernfs_up_write(rwsem);
+
+ /**
+ * By now node and all of its descendants have been deactivated
+ * Once a descendant has been drained, acquire its parent's lock
+ * and unlink it from parent's children rb tree.
+ * We drop kn's lock before acquiring pos->parent's lock to avoid
+ * deadlock that will happen if pos->parent and kn hash to same lock.
+ * Dropping kn's lock should be safe because it is in deactived state.
+ * Further root->kernfs_topo_mutex ensures that we will not have
+ * concurrent instances of __kernfs_remove
+ */
+ if (pos->parent)
+ rwsem = kernfs_down_write(pos->parent);
+
/*
* kernfs_unlink_sibling() succeeds once per node. Use it
* to decide who's responsible for cleanups.
@@ -1410,8 +1491,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
kernfs_put(pos);
}

+ if (pos->parent)
+ kernfs_up_write(rwsem);
kernfs_put(pos);
} while (pos != kn);
+
+ mutex_unlock(&root->kernfs_topo_mutex);
}

/**
@@ -1422,14 +1507,10 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct rw_semaphore *rwsem;
-
if (!kn)
return;

- rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- kernfs_up_write(rwsem);
}

/**
@@ -1531,9 +1612,11 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
if (!(kn->flags & KERNFS_SUICIDAL)) {
kn->flags |= KERNFS_SUICIDAL;
+ kernfs_up_write(rwsem);
__kernfs_remove(kn);
kn->flags |= KERNFS_SUICIDED;
ret = true;
+ rwsem = kernfs_down_write(kn);
} else {
wait_queue_head_t *waitq = &kernfs_root(kn)->deactivate_waitq;
DEFINE_WAIT(wait);
@@ -1588,11 +1671,17 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,

rwsem = kernfs_down_write(parent);

+ /**
+ * Since the node being searched will be removed eventually,
+ * we don't need to take root->kernfs_topo_mutex.
+ * Even if a parallel remove succeeds, the subsequent __kernfs_remove
+ * will detect it and bail-out early.
+ */
kn = kernfs_find_ns(parent, name, ns);
- if (kn)
- __kernfs_remove(kn);

kernfs_up_write(rwsem);
+ if (kn)
+ __kernfs_remove(kn);

if (kn)
return 0;
@@ -1612,14 +1701,26 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
{
struct kernfs_node *old_parent;
const char *old_name = NULL;
- struct rw_semaphore *rwsem;
+ struct kernfs_rwsem_token token;
int error;
+ struct kernfs_root *root = kernfs_root(kn);

/* can't move or rename root */
if (!kn->parent)
return -EINVAL;

- rwsem = kernfs_down_write(kn);
+ mutex_lock(&root->kernfs_topo_mutex);
+ old_parent = kn->parent;
+ kernfs_get(old_parent);
+ kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+ while (old_parent != kn->parent) {
+ kernfs_put(old_parent);
+ kernfs_up_write_triple_nodes(kn, old_parent, new_parent, &token);
+ old_parent = kn->parent;
+ kernfs_get(old_parent);
+ kernfs_down_write_triple_nodes(kn, old_parent, new_parent, &token);
+ }
+ kernfs_put(old_parent);

error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1654,7 +1755,6 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
/* rename_lock protects ->parent and ->name accessors */
write_lock_irq(&kernfs_rename_lock);

- old_parent = kn->parent;
kn->parent = new_parent;

kn->ns = new_ns;
@@ -1673,7 +1773,8 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,

error = 0;
out:
- kernfs_up_write(rwsem);
+ mutex_unlock(&root->kernfs_topo_mutex);
+ kernfs_up_write_triple_nodes(kn, new_parent, old_parent, &token);
return error;
}

diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index efe5ae98abf4..36a40b08b97f 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -101,6 +101,12 @@ int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int ret;
struct rw_semaphore *rwsem;

+ /**
+ * Since we are only modifying the inode attribute, we just need
+ * to lock involved node. Operations that add or remove a node
+ * acquire parent's lock before changing the inode attributes, so
+ * such operations are also in sync with this interface.
+ */
rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
kernfs_up_write(rwsem);
@@ -118,6 +124,12 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
if (!kn)
return -EINVAL;

+ /**
+ * Since we are only modifying the inode attribute, we just need
+ * to lock involved node. Operations that add or remove a node
+ * acquire parent's lock before changing the inode attributes, so
+ * such operations are also in sync with .setattr backend.
+ */
rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
@@ -188,6 +200,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct kernfs_node *kn = inode->i_private;
struct rw_semaphore *rwsem;

+ /**
+ * As we are only reading ->iattr, acquiring kn's rwsem for
+ * reading is enough.
+ */
rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
@@ -285,6 +301,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,

kn = inode->i_private;

+ /**
+ * As we are only reading ->iattr, acquiring kn's rwsem for
+ * reading is enough.
+ */
rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
diff --git a/fs/kernfs/kernfs-internal.c b/fs/kernfs/kernfs-internal.c
new file mode 100644
index 000000000000..80d7d64532fe
--- /dev/null
+++ b/fs/kernfs/kernfs-internal.c
@@ -0,0 +1,259 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * This file provides inernal helpers for kernfs.
+ */
+
+#include "kernfs-internal.h"
+
+static void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
+{
+ struct rw_semaphore *tmp;
+
+ tmp = array[i];
+ array[i] = array[j];
+ array[j] = tmp;
+}
+
+static void kernfs_sort_rwsems(struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 2) {
+ if (array[0] == array[1])
+ token->count = 1;
+ else if (array[0] > array[1])
+ kernfs_swap_rwsems(array, 0, 1);
+ } else {
+ if (array[0] == array[1] && array[0] == array[2])
+ token->count = 1;
+ else {
+ if (array[0] > array[1])
+ kernfs_swap_rwsems(array, 0, 1);
+
+ if (array[0] > array[2])
+ kernfs_swap_rwsems(array, 0, 2);
+
+ if (array[1] > array[2])
+ kernfs_swap_rwsems(array, 1, 2);
+
+ if (array[0] == array[1] || array[1] == array[2])
+ token->count = 2;
+ }
+ }
+}
+
+/**
+ * kernfs_down_write_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ token->count = 2;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ down_write_nested(array[0], 0);
+ } else {
+ /* Both nodes hash to different rwsems */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[1], 1);
+ }
+}
+
+/**
+ * kernfs_up_write_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ up_write(array[0]);
+ } else {
+ /* Both nodes hashe to different rwsems */
+ up_write(array[0]);
+ up_write(array[1]);
+ }
+}
+
+/**
+ * kernfs_down_read_double_nodes() - take hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 2 nodes. Some operation may need to acquire
+ * hashed rwsems for 2 nodes (for example for a node and its parent).
+ * This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ token->count = 2;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ down_read_nested(array[0], 0);
+ } else {
+ /* Both nodes hash to different rwsems */
+ down_read_nested(array[0], 0);
+ down_read_nested(array[1], 1);
+ }
+}
+
+/**
+ * kernfs_up_read_double_nodes - release hashed rwsem for 2 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 2 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* Both nodes hash to same rwsem */
+ up_read(array[0]);
+ } else {
+ /* Both nodes hashe to different rwsems */
+ up_read(array[0]);
+ up_read(array[1]);
+ }
+}
+
+/**
+ * kernfs_down_write_triple_nodes() - take hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to pass unlocking information to caller
+ *
+ * Acquire hashed rwsem for 3 nodes. Some operation may need to acquire
+ * hashed rwsems for 3 nodes (for example rename operation needs to
+ * acquire rwsem corresponding to node, its current parent and its future
+ * parent). This function can be used in such cases.
+ *
+ * Return: void
+ */
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ array[0] = kernfs_rwsem_ptr(kn1);
+ array[1] = kernfs_rwsem_ptr(kn2);
+ array[2] = kernfs_rwsem_ptr(kn3);
+ token->count = 3;
+
+ kernfs_sort_rwsems(token);
+
+ if (token->count == 1) {
+ /* All 3 nodes hash to same rwsem */
+ down_write_nested(array[0], 0);
+ } else if (token->count == 2) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[2], 1);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[1], 1);
+ down_write_nested(array[2], 2);
+ }
+}
+
+/**
+ * kernfs_up_write_triple_nodes - release hashed rwsem for 3 nodes
+ *
+ * @kn1: first kernfs_node of interest
+ * @kn2: second kernfs_node of interest
+ * @kn3: third kernfs_node of interest
+ * @token: token to indicate unlocking information
+ * ->rwsems is a sorted list of rwsem addresses
+ * ->count contains number of unique locks
+ *
+ * Release hashed rwsems for 3 nodes
+ *
+ * Return: void
+ */
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token)
+{
+ struct rw_semaphore **array = &token->rwsems[0];
+
+ if (token->count == 1) {
+ /* All 3 nodes hash to same rwsem */
+ up_write(array[0]);
+ } else if (token->count == 2) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ up_write(array[0]);
+ up_write(array[2]);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ up_write(array[0]);
+ up_write(array[1]);
+ up_write(array[2]);
+ }
+}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 0babc3dc4f4a..8dc99875da32 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,20 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>

+/**
+ * Token for nested locking interfaces.
+ *
+ * rwsems: array of rwsems to acquire
+ * count: has 2 uses
+ * As input argument it specifies size of ->rwsems array
+ * As return argument it specifies number of unique rwsems
+ * present in ->rwsems array
+ */
+struct kernfs_rwsem_token {
+ struct rw_semaphore *rwsems[3];
+ int count;
+};
+
struct kernfs_iattrs {
kuid_t ia_uid;
kgid_t ia_gid;
@@ -46,8 +60,8 @@ struct kernfs_root {
struct list_head supers;

wait_queue_head_t deactivate_waitq;
- struct rw_semaphore kernfs_rwsem;
struct rw_semaphore supers_rwsem;
+ struct mutex kernfs_topo_mutex;
};

/* +1 to avoid triggering overflow warning when negating it */
@@ -169,12 +183,13 @@ extern const struct inode_operations kernfs_symlink_iops;
* kernfs locks
*/
extern struct kernfs_global_locks *kernfs_locks;
+extern rwlock_t kernfs_rename_lock;

static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
{
- struct kernfs_root *root = kernfs_root(kn);
+ int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);

- return &root->kernfs_rwsem;
+ return &kernfs_locks->kernfs_rwsem[idx];
}

static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
@@ -247,4 +262,30 @@ static inline void kernfs_up_read(struct rw_semaphore *rwsem)
{
up_read(rwsem);
}
+
+void kernfs_down_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_down_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_read_double_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_down_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token);
+
+void kernfs_up_write_triple_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2,
+ struct kernfs_node *kn3,
+ struct kernfs_rwsem_token *token);
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 3c5334b74f36..b9b8cc2c16fd 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -388,6 +388,14 @@ void kernfs_kill_sb(struct super_block *sb)
kfree(info);
}

+static void __init kernfs_rwsem_init(void)
+{
+ int count;
+
+ for (count = 0; count < NR_KERNFS_LOCKS; count++)
+ init_rwsem(&kernfs_locks->kernfs_rwsem[count]);
+}
+
static void __init kernfs_mutex_init(void)
{
int count;
@@ -402,6 +410,7 @@ static void __init kernfs_lock_init(void)
WARN_ON(!kernfs_locks);

kernfs_mutex_init();
+ kernfs_rwsem_init();
}

void __init kernfs_init(void)
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 9d4103602554..d71aa73acec8 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,17 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct rw_semaphore *rwsem;
int error;
-
- rwsem = kernfs_down_read(parent);
+ unsigned long flags;
+
+ /**
+ * kernfs_get_target_path needs that all nodes in the path don't
+ * undergo a parent change in the middle of it. Since ->parent
+ * change happens under kernfs_rename_lock, acquire the same.
+ */
+ read_lock_irqsave(&kernfs_rename_lock, flags);
error = kernfs_get_target_path(parent, target, path);
- kernfs_up_read(rwsem);
+ read_unlock_irqrestore(&kernfs_rename_lock, flags);

return error;
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 367044d7708c..7d9de9aee102 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -89,6 +89,7 @@ struct kernfs_iattrs;
*/
struct kernfs_global_locks {
struct mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
};

enum kernfs_node_type {
--
2.30.2

2022-08-10 11:17:09

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 2/5] kernfs: Change kernfs_rename_lock into a read-write lock.

kernfs_rename_lock protects a node's ->parent and thus kernfs topology.
Thus it can be used in cases that rely on a stable kernfs topology.
Change it to a read-write lock for better scalability.

Suggested by: Al Viro <[email protected]>
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 45e1882bd51f..d2a0b4acd073 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -17,7 +17,7 @@

#include "kernfs-internal.h"

-static DEFINE_SPINLOCK(kernfs_rename_lock); /* kn->parent and ->name */
+static DEFINE_RWLOCK(kernfs_rename_lock); /* kn->parent and ->name */
/*
* Don't use rename_lock to piggy back on pr_cont_buf. We don't want to
* call pr_cont() while holding rename_lock. Because sometimes pr_cont()
@@ -192,9 +192,9 @@ int kernfs_name(struct kernfs_node *kn, char *buf, size_t buflen)
unsigned long flags;
int ret;

- spin_lock_irqsave(&kernfs_rename_lock, flags);
+ read_lock_irqsave(&kernfs_rename_lock, flags);
ret = kernfs_name_locked(kn, buf, buflen);
- spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+ read_unlock_irqrestore(&kernfs_rename_lock, flags);
return ret;
}

@@ -220,9 +220,9 @@ int kernfs_path_from_node(struct kernfs_node *to, struct kernfs_node *from,
unsigned long flags;
int ret;

- spin_lock_irqsave(&kernfs_rename_lock, flags);
+ read_lock_irqsave(&kernfs_rename_lock, flags);
ret = kernfs_path_from_node_locked(to, from, buf, buflen);
- spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+ read_unlock_irqrestore(&kernfs_rename_lock, flags);
return ret;
}
EXPORT_SYMBOL_GPL(kernfs_path_from_node);
@@ -288,10 +288,10 @@ struct kernfs_node *kernfs_get_parent(struct kernfs_node *kn)
struct kernfs_node *parent;
unsigned long flags;

- spin_lock_irqsave(&kernfs_rename_lock, flags);
+ read_lock_irqsave(&kernfs_rename_lock, flags);
parent = kn->parent;
kernfs_get(parent);
- spin_unlock_irqrestore(&kernfs_rename_lock, flags);
+ read_unlock_irqrestore(&kernfs_rename_lock, flags);

return parent;
}
@@ -1650,7 +1650,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
kernfs_get(new_parent);

/* rename_lock protects ->parent and ->name accessors */
- spin_lock_irq(&kernfs_rename_lock);
+ write_lock_irq(&kernfs_rename_lock);

old_parent = kn->parent;
kn->parent = new_parent;
@@ -1661,7 +1661,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
kn->name = new_name;
}

- spin_unlock_irq(&kernfs_rename_lock);
+ write_unlock_irq(&kernfs_rename_lock);

kn->hash = kernfs_name_hash(kn->name, kn->ns);
kernfs_link_sibling(kn);
--
2.30.2

2022-08-10 11:35:13

by Imran Khan

[permalink] [raw]
Subject: [RESEND PATCH 1/5] kernfs: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.

Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.

Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 1 +
fs/kernfs/file.c | 2 ++
fs/kernfs/kernfs-internal.h | 1 +
fs/kernfs/mount.c | 8 ++++----
4 files changed, 8 insertions(+), 4 deletions(-)

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 1cc88ba6de90..45e1882bd51f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -924,6 +924,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
idr_init(&root->ino_idr);
init_rwsem(&root->kernfs_rwsem);
INIT_LIST_HEAD(&root->supers);
+ init_rwsem(&root->supers_rwsem);

/*
* On 64bit ino setups, id is ino. On 32bit, low 32bits are ino.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index b3ec34386b43..812165d8ab4f 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -927,6 +927,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
/* kick fsnotify */
down_write(&root->kernfs_rwsem);

+ down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
struct inode *p_inode = NULL;
@@ -962,6 +963,7 @@ static void kernfs_notify_workfn(struct work_struct *work)

iput(inode);
}
+ up_write(&root->supers_rwsem);

up_write(&root->kernfs_rwsem);
kernfs_put(kn);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 3ae214d02d44..3cd17c100d10 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -47,6 +47,7 @@ struct kernfs_root {

wait_queue_head_t deactivate_waitq;
struct rw_semaphore kernfs_rwsem;
+ struct rw_semaphore supers_rwsem;
};

/* +1 to avoid triggering overflow warning when negating it */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d0859f72d2d6..d2be1c304715 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -347,9 +347,9 @@ int kernfs_get_tree(struct fs_context *fc)
}
sb->s_flags |= SB_ACTIVE;

- down_write(&root->kernfs_rwsem);
+ down_write(&root->supers_rwsem);
list_add(&info->node, &info->root->supers);
- up_write(&root->kernfs_rwsem);
+ up_write(&root->supers_rwsem);
}

fc->root = dget(sb->s_root);
@@ -376,9 +376,9 @@ void kernfs_kill_sb(struct super_block *sb)
struct kernfs_super_info *info = kernfs_info(sb);
struct kernfs_root *root = info->root;

- down_write(&root->kernfs_rwsem);
+ down_write(&root->supers_rwsem);
list_del(&info->node);
- up_write(&root->kernfs_rwsem);
+ up_write(&root->supers_rwsem);

/*
* Remove the superblock from fs_supers/s_instances
--
2.30.2

2022-08-16 22:55:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello,

I'm bandwidth constrained right now and can't really shepherd this patchset,
so I'm not gonna ack or nack the series. That said, here are my thoughts
after glancing through it:

* I find the returning-with-rwsem-held interface and usage odd. We return
with locks held all the time, so that part in itself is fine but how it's
used in the proposed patch is pretty alien.

* I don't understand why the topo_mutex is needed. What is its relationship
with rename_lock?

* Can't the double/triple lock helpers loop over the sorted list instead of
if'ing each case?

Thanks.

--
tejun

2022-11-13 17:47:08

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello Tejun,

I am very sorry for replying late, I was away from work for most of
last couple of months. Also apologies for this lengthy reply but
I want to share some data here from recent experiments to see if
some alternative approach can be used here.

> * I find the returning-with-rwsem-held interface and usage odd. We return
> with locks held all the time, so that part in itself is fine but how it's
> used in the proposed patch is pretty alien.
>

The interface for acquiring and releasing hashed rwsem has been
kept similar to interface for acquiring and releasing hashed mutex (which
was done in part-1 of this work at [1].
Could you please clarify which aspect of its usage you find alien here?
Probably there is some scope of improvement there.

> * I don't understand why the topo_mutex is needed. What is its relationship
> with rename_lock?
>

topo_mutex is providing synchronization between path walk/lookup and remove
operations.
While looking for a node we will not have address of corresponding node
so we can't acquire node's rwsem right from the beginning.
On the other hand a parallel remove operation for the same node can acquire
corresponding rwsem and go ahead with node removal. So it may happen that
search operation for the node finds and returns it but before it can be
pinned or used, the remove operation, that was going on in parallel, removes
the node and hence makes its any future use wrong.
topo_mutex ensures that for competing search and remove operations only
one proceeds at a time and since object returned by search is pinned before
releasing the topo_mutex, it will be available for subsequent usage.

I understand that kernfs_rename_lock protects against parent change but it does
not prevent against node getting delinked from its sibling list or getting removed.

Further during path walks we need to acquire and release rwsems corresponding to
directories so that these directories don't move and their children RB tree
do not change. Since these rwsems can't be taken under a spinlock,
kernfs_rename_lock can't be used and needed protection against topology change
is provided by topo_mutex.

In current scheme of things global kernfs_rwsem automatically ensures that
amongst move, removal and addition of kernfs_node, only one proceeds at time. So
operations like kernfs_walk_ns can complete without any node getting
added/moved/removed in between.
With the approach of hashed kernfs_rwsem, topo_mutex is being used for similar
use cases where we need to ensure that no node gets added/moved/removed during
an operation.

Ideally I would like to avoid topo_mutex because this too is a global lock but I
am not sure if for all kernfs operations we will always be able to get and hold
hashed kernfs_rwsem for all involved nodes in a consistent manner.

These were my reasons for using topo_mutex with hashed kernfs_rwsem. Please let
me know if you still find use of topo_mutex problematic

> * Can't the double/triple lock helpers loop over the sorted list instead of
> if'ing each case?
>
Could you please elaborate this query? Do you mean list of nodes in a path or
list of sibling nodes ?As per my understanding even with triple locks there is
no way to ensure all involved nodes remain consistent.


Below is reasoning and data for my experiments with other approaches.

Since hashed kernfs_rwsem approach has been encountering problems in addressing
some corner cases, I am thinking if some alternative approach can be taken here
to keep kernfs_rwsem global, but replace its usage at some places with
alternative global/hashed rwsems.

For example from the current kernfs code we can see that most of the contention
towards kernfs_rwsem is observed in down_read/up_read emanating from
kernfs_iop_permission and kernfs_dop_revalidate:

- 39.16% 38.98% showgids [kernel.kallsyms] [k] down_read
38.98% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 36.54% link_path_walk
- 20.23% inode_permission
- __inode_permission
- 20.22% kernfs_iop_permission
down_read
- 15.06% walk_component
lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
down_read
- 1.25% kernfs_iop_get_link
down_read
- 1.25% may_open
inode_permission
__inode_permission
kernfs_iop_permission
down_read
- 1.20% lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
down_read
- 28.96% 28.83% showgids [kernel.kallsyms] [k] up_read
28.83% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 28.42% link_path_walk
- 18.09% inode_permission
- __inode_permission
- 18.07% kernfs_iop_permission
up_read
- 9.08% walk_component
lookup_fast
- d_revalidate.part.24
- 9.08% kernfs_dop_revalidate
up_read
- 1.25% kernfs_iop_get_link

In the above snippet down_read/up_read of kernfs_rwsem is taking ~68% of CPU. We
also know that cache line bouncing for kernfs_rwsem is major contributor towards
this overhead because as per [2], changing kernfs_rwsem to a per-cpu
kernfs_rwsem reduced this to a large extent.

Now kernfs_iop_permission is taking kernfs_rwsem to access inode attributes
which are not accessed in kernfs_dop_revalidate (the other path contending for
kernfs_rwsem). So if we use a separate rwsem for protecting inode attributes we
can see that contention towards kernfs_rwsem greatly reduces. For example using
a global rwsem (kernfs_iattr_rwsem) to protect inode attributes as per following
patch:

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 6acd9c3d4cff..f185427131f9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -757,11 +757,13 @@ int kernfs_add_one(struct kernfs_node *kn)
goto out_unlock;

/* Update timestamps on the parent */
+ down_write(&root->kernfs_iattr_rwsem);
ps_iattr = parent->iattr;
if (ps_iattr) {
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
+ up_write(&root->kernfs_iattr_rwsem);

up_write(&root->kernfs_rwsem);

@@ -1442,10 +1444,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
pos->parent ? pos->parent->iattr : NULL;

/* update timestamps on the parent */
+ down_write(&root->kernfs_iattr_rwsem);
if (ps_iattr) {
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
+ up_write(&root->kernfs_iattr_rwsem);

kernfs_put(pos);
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 74f3453f4639..1b8bffc6d2d3 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -101,9 +101,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct
iattr *iattr)
int ret;
struct kernfs_root *root = kernfs_root(kn);

- down_write(&root->kernfs_rwsem);
+ down_write(&root->kernfs_iattr_rwsem);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ up_write(&root->kernfs_iattr_rwsem);
return ret;
}

@@ -119,7 +119,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
struct dentry *dentry,
return -EINVAL;

root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ down_write(&root->kernfs_iattr_rwsem);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +132,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- up_write(&root->kernfs_rwsem);
+ up_write(&root->kernfs_iattr_rwsem);
return error;
}
@@ -189,10 +189,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);

- down_read(&root->kernfs_rwsem);
+ down_read(&root->kernfs_iattr_rwsem);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
- up_read(&root->kernfs_rwsem);
+ up_read(&root->kernfs_iattr_rwsem);

return 0;
}

@@ -285,10 +285,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
kn = inode->i_private;
root = kernfs_root(kn);

- down_read(&root->kernfs_rwsem);
+ down_read(&root->kernfs_iattr_rwsem);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
- up_read(&root->kernfs_rwsem);
+ up_read(&root->kernfs_iattr_rwsem);

return ret;
}

diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index fc5821effd97..4620b74f44b0 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -47,6 +47,7 @@ struct kernfs_root {

wait_queue_head_t deactivate_waitq;
struct rw_semaphore kernfs_rwsem;
+ struct rw_semaphore kernfs_iattr_rwsem;
};



greatly reduces the CPU usage seen earlier in down_read/up_read:


- 13.08% 13.02% showgids [kernel.kallsyms] [k] down_read
13.02% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 12.18% link_path_walk
- 9.44% inode_permission
- __inode_permission
- 9.43% kernfs_iop_permission
down_read
- 2.53% walk_component
lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
down_read
+ 0.62% may_open
- 13.03% 12.97% showgids [kernel.kallsyms] [k] up_read
12.97% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 12.86% link_path_walk
- 11.55% inode_permission
- __inode_permission
- 11.54% kernfs_iop_permission
up_read
- 1.06% walk_component
lookup_fast
- d_revalidate.part.24
- 1.06% kernfs_dop_revalidate

As can be seen down_read/up_read CPU usage is ~26% (compared to ~68% of default
case).

Further using a hashed rwsem for protecting inode attributes as per following patch:

diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 6acd9c3d4cff..dfc0d2167d86 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -734,6 +734,7 @@ int kernfs_add_one(struct kernfs_node *kn)
struct kernfs_iattrs *ps_iattr;
bool has_ns;
int ret;
+ struct rw_semaphore *rwsem;

down_write(&root->kernfs_rwsem);

@@ -757,11 +758,13 @@ int kernfs_add_one(struct kernfs_node *kn)
goto out_unlock;

/* Update timestamps on the parent */
+ rwsem = kernfs_iattr_down_write(kn);
ps_iattr = parent->iattr;
if (ps_iattr) {
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
+ kernfs_iattr_up_write(rwsem, kn);

up_write(&root->kernfs_rwsem);

@@ -1443,8 +1446,10 @@ static void __kernfs_remove(struct kernfs_node *kn)

/* update timestamps on the parent */
if (ps_iattr) {
+ struct rw_semaphore *rwsem =
kernfs_iattr_down_write(pos->parent);
ktime_get_real_ts64(&ps_iattr->ia_ctime);
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
+ kernfs_iattr_up_write(rwsem, kn);
}

kernfs_put(pos);
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 74f3453f4639..2b3cd5a9464f 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,12 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct
iattr *iattr)
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;
+ struct rw_semaphore *rwsem;
struct kernfs_root *root = kernfs_root(kn);

- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_iattr_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ kernfs_iattr_up_write(rwsem, kn);
return ret;
}

@@ -114,12 +115,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
struct dentry *dentry,
struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root;
int error;
+ struct rw_semaphore *rwsem;

if (!kn)
return -EINVAL;

root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_iattr_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +134,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);

out:
- up_write(&root->kernfs_rwsem);
+ kernfs_iattr_up_write(rwsem, kn);
return error;
}

@@ -188,11 +190,12 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_iattr_down_read(kn);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
- up_read(&root->kernfs_rwsem);
+ kernfs_iattr_up_read(rwsem, kn);

return 0;
}
@@ -278,6 +281,7 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct kernfs_node *kn;
struct kernfs_root *root;
int ret;
+ struct rw_semaphore *rwsem;

if (mask & MAY_NOT_BLOCK)
return -ECHILD;
@@ -285,10 +289,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
kn = inode->i_private;
root = kernfs_root(kn);

- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_iattr_down_read(kn);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
- up_read(&root->kernfs_rwsem);
+ kernfs_iattr_up_read(rwsem, kn);

return ret;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index fc5821effd97..bd1ecd126395 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -169,4 +169,53 @@ extern const struct inode_operations kernfs_symlink_iops;
* kernfs locks
*/
extern struct kernfs_global_locks *kernfs_locks;
+
+static inline struct rw_semaphore *kernfs_iattr_rwsem_ptr(struct kernfs_node *kn)
+{
+ int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+ return &kernfs_locks->iattr_rwsem[idx];
+}
+
+static inline struct rw_semaphore *kernfs_iattr_down_write(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem;
+
+ kernfs_get(kn);
+
+ rwsem = kernfs_iattr_rwsem_ptr(kn);
+
+ down_write(rwsem);
+
+ return rwsem;
+}
+
+static inline void kernfs_iattr_up_write(struct rw_semaphore *rwsem,
+ struct kernfs_node *kn)
+{
+ up_write(rwsem);
+
+ kernfs_put(kn);
+}
+
+
+static inline struct rw_semaphore *kernfs_iattr_down_read(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem;
+
+ kernfs_get(kn);
+
+ rwsem = kernfs_iattr_rwsem_ptr(kn);
+
+ down_read(rwsem);
+
+ return rwsem;
+}
+
+static inline void kernfs_iattr_up_read(struct rw_semaphore *rwsem,
+ struct kernfs_node *kn)
+{
+ up_read(rwsem);
+
+ kernfs_put(kn);
+}
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d0859f72d2d6..f282e5d65d04 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -392,8 +392,10 @@ static void __init kernfs_mutex_init(void)
{
int count;

- for (count = 0; count < NR_KERNFS_LOCKS; count++)
+ for (count = 0; count < NR_KERNFS_LOCKS; count++) {
mutex_init(&kernfs_locks->open_file_mutex[count]);
+ init_rwsem(&kernfs_locks->iattr_rwsem[count]);
+ }
}

static void __init kernfs_lock_init(void)
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 73f5c120def8..fcbf5e7c921c 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -89,6 +89,7 @@ struct kernfs_iattrs;
*/
struct kernfs_global_locks {
struct mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct rw_semaphore iattr_rwsem[NR_KERNFS_LOCKS];
};

enum kernfs_node_type {


gives further improvement in CPU usage:

- 8.26% 8.22% showgids [kernel.kallsyms] [k] down_read
8.19% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.59% link_path_walk
- 6.66% walk_component
lookup_fast
- d_revalidate.part.24
- 6.66% kernfs_dop_revalidate
down_read
- 0.71% kernfs_iop_get_link
down_read
- 0.58% lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
down_read
- 7.44% 7.41% showgids [kernel.kallsyms] [k] up_read
7.39% __libc_start_main
__open_nocancel
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.36% link_path_walk
- 6.45% walk_component
lookup_fast
d_revalidate.part.24
kernfs_dop_revalidate
up_read

In above snippet CPU usage in down_read/up_read has gone down to ~16%

So do you think that rather than replacing global kernfs_rwsem with a hashed one
, any of the above mentioned 2 patches (1. Use a global rwsem for protecting
inode attributes or 2. Use a hashed rwsem for protecting inode attributes)
can be used. These patches are not breaking code paths involving multiple nodes
that currently use global kernfs_rwsem.
With hashed kernfs_rwsem I guess there will always be a risk of some corner case
getting missed.

If any of these approaches are acceptable, I can send the patch for review along
with other changes of this series

Thanks again for reviews,
-- Imran

[1]: https://lore.kernel.org/all/[email protected]/
[2]: https://lore.kernel.org/all/[email protected]/

2023-01-04 09:21:50

by Imran Khan

[permalink] [raw]
Subject: Re: [RESEND PATCH 4/5] kernfs: Replace per-fs rwsem with hashed rwsems.

Hello Tejun

On 14/11/2022 4:30 am, Imran Khan wrote:

[...]
>
> Below is reasoning and data for my experiments with other approaches.
>
> Since hashed kernfs_rwsem approach has been encountering problems in addressing
> some corner cases, I am thinking if some alternative approach can be taken here
> to keep kernfs_rwsem global, but replace its usage at some places with
> alternative global/hashed rwsems.
>
> For example from the current kernfs code we can see that most of the contention
> towards kernfs_rwsem is observed in down_read/up_read emanating from
> kernfs_iop_permission and kernfs_dop_revalidate:
>
> - 39.16% 38.98% showgids [kernel.kallsyms] [k] down_read
> 38.98% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 36.54% link_path_walk
> - 20.23% inode_permission
> - __inode_permission
> - 20.22% kernfs_iop_permission
> down_read
> - 15.06% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 1.25% kernfs_iop_get_link
> down_read
> - 1.25% may_open
> inode_permission
> __inode_permission
> kernfs_iop_permission
> down_read
> - 1.20% lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 28.96% 28.83% showgids [kernel.kallsyms] [k] up_read
> 28.83% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 28.42% link_path_walk
> - 18.09% inode_permission
> - __inode_permission
> - 18.07% kernfs_iop_permission
> up_read
> - 9.08% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 9.08% kernfs_dop_revalidate
> up_read
> - 1.25% kernfs_iop_get_link
>
> In the above snippet down_read/up_read of kernfs_rwsem is taking ~68% of CPU. We
> also know that cache line bouncing for kernfs_rwsem is major contributor towards
> this overhead because as per [2], changing kernfs_rwsem to a per-cpu
> kernfs_rwsem reduced this to a large extent.
>
> Now kernfs_iop_permission is taking kernfs_rwsem to access inode attributes
> which are not accessed in kernfs_dop_revalidate (the other path contending for
> kernfs_rwsem). So if we use a separate rwsem for protecting inode attributes we
> can see that contention towards kernfs_rwsem greatly reduces. For example using
> a global rwsem (kernfs_iattr_rwsem) to protect inode attributes as per following
> patch:
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..f185427131f9 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -757,11 +757,13 @@ int kernfs_add_one(struct kernfs_node *kn)
> goto out_unlock;
>
> /* Update timestamps on the parent */
> + down_write(&root->kernfs_iattr_rwsem);
> ps_iattr = parent->iattr;
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + up_write(&root->kernfs_iattr_rwsem);
>
> up_write(&root->kernfs_rwsem);
>
> @@ -1442,10 +1444,12 @@ static void __kernfs_remove(struct kernfs_node *kn)
> pos->parent ? pos->parent->iattr : NULL;
>
> /* update timestamps on the parent */
> + down_write(&root->kernfs_iattr_rwsem);
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + up_write(&root->kernfs_iattr_rwsem);
>
> kernfs_put(pos);
> }
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..1b8bffc6d2d3 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -101,9 +101,9 @@ int kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
> int ret;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_write(&root->kernfs_rwsem);
> + down_write(&root->kernfs_iattr_rwsem);
> ret = __kernfs_setattr(kn, iattr);
> - up_write(&root->kernfs_rwsem);
> + up_write(&root->kernfs_iattr_rwsem);
> return ret;
> }
>
> @@ -119,7 +119,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> return -EINVAL;
>
> root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + down_write(&root->kernfs_iattr_rwsem);
> error = setattr_prepare(&init_user_ns, dentry, iattr);
> if (error)
> goto out;
> @@ -132,7 +132,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> setattr_copy(&init_user_ns, inode, iattr);
>
> out:
> - up_write(&root->kernfs_rwsem);
> + up_write(&root->kernfs_iattr_rwsem);
> return error;
> }
> @@ -189,10 +189,10 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + down_read(&root->kernfs_iattr_rwsem);
> kernfs_refresh_inode(kn, inode);
> generic_fillattr(&init_user_ns, inode, stat);
> - up_read(&root->kernfs_rwsem);
> + up_read(&root->kernfs_iattr_rwsem);
>
> return 0;
> }
>
> @@ -285,10 +285,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> kn = inode->i_private;
> root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + down_read(&root->kernfs_iattr_rwsem);
> kernfs_refresh_inode(kn, inode);
> ret = generic_permission(&init_user_ns, inode, mask);
> - up_read(&root->kernfs_rwsem);
> + up_read(&root->kernfs_iattr_rwsem);
>
> return ret;
> }
>
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..4620b74f44b0 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -47,6 +47,7 @@ struct kernfs_root {
>
> wait_queue_head_t deactivate_waitq;
> struct rw_semaphore kernfs_rwsem;
> + struct rw_semaphore kernfs_iattr_rwsem;
> };
>
>
>
> greatly reduces the CPU usage seen earlier in down_read/up_read:
>
>
> - 13.08% 13.02% showgids [kernel.kallsyms] [k] down_read
> 13.02% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 12.18% link_path_walk
> - 9.44% inode_permission
> - __inode_permission
> - 9.43% kernfs_iop_permission
> down_read
> - 2.53% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> + 0.62% may_open
> - 13.03% 12.97% showgids [kernel.kallsyms] [k] up_read
> 12.97% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 12.86% link_path_walk
> - 11.55% inode_permission
> - __inode_permission
> - 11.54% kernfs_iop_permission
> up_read
> - 1.06% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 1.06% kernfs_dop_revalidate
>
> As can be seen down_read/up_read CPU usage is ~26% (compared to ~68% of default
> case).
>
> Further using a hashed rwsem for protecting inode attributes as per following patch:
>
> diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
> index 6acd9c3d4cff..dfc0d2167d86 100644
> --- a/fs/kernfs/dir.c
> +++ b/fs/kernfs/dir.c
> @@ -734,6 +734,7 @@ int kernfs_add_one(struct kernfs_node *kn)
> struct kernfs_iattrs *ps_iattr;
> bool has_ns;
> int ret;
> + struct rw_semaphore *rwsem;
>
> down_write(&root->kernfs_rwsem);
>
> @@ -757,11 +758,13 @@ int kernfs_add_one(struct kernfs_node *kn)
> goto out_unlock;
>
> /* Update timestamps on the parent */
> + rwsem = kernfs_iattr_down_write(kn);
> ps_iattr = parent->iattr;
> if (ps_iattr) {
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> }
> + kernfs_iattr_up_write(rwsem, kn);
>
> up_write(&root->kernfs_rwsem);
>
> @@ -1443,8 +1446,10 @@ static void __kernfs_remove(struct kernfs_node *kn)
>
> /* update timestamps on the parent */
> if (ps_iattr) {
> + struct rw_semaphore *rwsem =
> kernfs_iattr_down_write(pos->parent);
> ktime_get_real_ts64(&ps_iattr->ia_ctime);
> ps_iattr->ia_mtime = ps_iattr->ia_ctime;
> + kernfs_iattr_up_write(rwsem, kn);
> }
>
> kernfs_put(pos);
> diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
> index 74f3453f4639..2b3cd5a9464f 100644
> --- a/fs/kernfs/inode.c
> +++ b/fs/kernfs/inode.c
> @@ -99,11 +99,12 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct
> iattr *iattr)
> int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
> {
> int ret;
> + struct rw_semaphore *rwsem;
> struct kernfs_root *root = kernfs_root(kn);
>
> - down_write(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_write(kn);
> ret = __kernfs_setattr(kn, iattr);
> - up_write(&root->kernfs_rwsem);
> + kernfs_iattr_up_write(rwsem, kn);
> return ret;
> }
>
> @@ -114,12 +115,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root;
> int error;
> + struct rw_semaphore *rwsem;
>
> if (!kn)
> return -EINVAL;
>
> root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_write(kn);
> error = setattr_prepare(&init_user_ns, dentry, iattr);
> if (error)
> goto out;
> @@ -132,7 +134,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns,
> struct dentry *dentry,
> setattr_copy(&init_user_ns, inode, iattr);
>
> out:
> - up_write(&root->kernfs_rwsem);
> + kernfs_iattr_up_write(rwsem, kn);
> return error;
> }
>
> @@ -188,11 +190,12 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
> struct inode *inode = d_inode(path->dentry);
> struct kernfs_node *kn = inode->i_private;
> struct kernfs_root *root = kernfs_root(kn);
> + struct rw_semaphore *rwsem;
>
> - down_read(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_read(kn);
> kernfs_refresh_inode(kn, inode);
> generic_fillattr(&init_user_ns, inode, stat);
> - up_read(&root->kernfs_rwsem);
> + kernfs_iattr_up_read(rwsem, kn);
>
> return 0;
> }
> @@ -278,6 +281,7 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> struct kernfs_node *kn;
> struct kernfs_root *root;
> int ret;
> + struct rw_semaphore *rwsem;
>
> if (mask & MAY_NOT_BLOCK)
> return -ECHILD;
> @@ -285,10 +289,10 @@ int kernfs_iop_permission(struct user_namespace *mnt_userns,
> kn = inode->i_private;
> root = kernfs_root(kn);
>
> - down_read(&root->kernfs_rwsem);
> + rwsem = kernfs_iattr_down_read(kn);
> kernfs_refresh_inode(kn, inode);
> ret = generic_permission(&init_user_ns, inode, mask);
> - up_read(&root->kernfs_rwsem);
> + kernfs_iattr_up_read(rwsem, kn);
>
> return ret;
> }
> diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
> index fc5821effd97..bd1ecd126395 100644
> --- a/fs/kernfs/kernfs-internal.h
> +++ b/fs/kernfs/kernfs-internal.h
> @@ -169,4 +169,53 @@ extern const struct inode_operations kernfs_symlink_iops;
> * kernfs locks
> */
> extern struct kernfs_global_locks *kernfs_locks;
> +
> +static inline struct rw_semaphore *kernfs_iattr_rwsem_ptr(struct kernfs_node *kn)
> +{
> + int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
> +
> + return &kernfs_locks->iattr_rwsem[idx];
> +}
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_write(struct kernfs_node *kn)
> +{
> + struct rw_semaphore *rwsem;
> +
> + kernfs_get(kn);
> +
> + rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> + down_write(rwsem);
> +
> + return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_write(struct rw_semaphore *rwsem,
> + struct kernfs_node *kn)
> +{
> + up_write(rwsem);
> +
> + kernfs_put(kn);
> +}
> +
> +
> +static inline struct rw_semaphore *kernfs_iattr_down_read(struct kernfs_node *kn)
> +{
> + struct rw_semaphore *rwsem;
> +
> + kernfs_get(kn);
> +
> + rwsem = kernfs_iattr_rwsem_ptr(kn);
> +
> + down_read(rwsem);
> +
> + return rwsem;
> +}
> +
> +static inline void kernfs_iattr_up_read(struct rw_semaphore *rwsem,
> + struct kernfs_node *kn)
> +{
> + up_read(rwsem);
> +
> + kernfs_put(kn);
> +}
> #endif /* __KERNFS_INTERNAL_H */
> diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
> index d0859f72d2d6..f282e5d65d04 100644
> --- a/fs/kernfs/mount.c
> +++ b/fs/kernfs/mount.c
> @@ -392,8 +392,10 @@ static void __init kernfs_mutex_init(void)
> {
> int count;
>
> - for (count = 0; count < NR_KERNFS_LOCKS; count++)
> + for (count = 0; count < NR_KERNFS_LOCKS; count++) {
> mutex_init(&kernfs_locks->open_file_mutex[count]);
> + init_rwsem(&kernfs_locks->iattr_rwsem[count]);
> + }
> }
>
> static void __init kernfs_lock_init(void)
> diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
> index 73f5c120def8..fcbf5e7c921c 100644
> --- a/include/linux/kernfs.h
> +++ b/include/linux/kernfs.h
> @@ -89,6 +89,7 @@ struct kernfs_iattrs;
> */
> struct kernfs_global_locks {
> struct mutex open_file_mutex[NR_KERNFS_LOCKS];
> + struct rw_semaphore iattr_rwsem[NR_KERNFS_LOCKS];
> };
>
> enum kernfs_node_type {
>
>
> gives further improvement in CPU usage:
>
> - 8.26% 8.22% showgids [kernel.kallsyms] [k] down_read
> 8.19% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 7.59% link_path_walk
> - 6.66% walk_component
> lookup_fast
> - d_revalidate.part.24
> - 6.66% kernfs_dop_revalidate
> down_read
> - 0.71% kernfs_iop_get_link
> down_read
> - 0.58% lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> down_read
> - 7.44% 7.41% showgids [kernel.kallsyms] [k] up_read
> 7.39% __libc_start_main
> __open_nocancel
> entry_SYSCALL_64_after_hwframe
> do_syscall_64
> sys_open
> do_sys_open
> do_filp_open
> - path_openat
> - 7.36% link_path_walk
> - 6.45% walk_component
> lookup_fast
> d_revalidate.part.24
> kernfs_dop_revalidate
> up_read
>
> In above snippet CPU usage in down_read/up_read has gone down to ~16%
>
> So do you think that rather than replacing global kernfs_rwsem with a hashed one
> , any of the above mentioned 2 patches (1. Use a global rwsem for protecting
> inode attributes or 2. Use a hashed rwsem for protecting inode attributes)
> can be used. These patches are not breaking code paths involving multiple nodes
> that currently use global kernfs_rwsem.
> With hashed kernfs_rwsem I guess there will always be a risk of some corner case
> getting missed.
>
> If any of these approaches are acceptable, I can send the patch for review along
> with other changes of this series
>
Could you please share your feedback about the above mentioned 2 approaches to
reduce contention around global kernfs_rwsem ? Also the first 2 patches of this
series [1] are not dealing specifically with with kernfs_rwsem, so could you
please share your feedback about those 2 as well.

Thanks,
-- Imran

[1]: https://lore.kernel.org/lkml/[email protected]/