Reduce contention around global locks used in kernfs.
The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.
The above mentioned 3 main global locks used in kernfs are:
1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.
2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.
3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.
Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.
For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:
for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];
int ret_len;
int ret_fd;
ssize_t ret_rd;
ub4 i, saved_errno;
memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));
ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);
ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}
/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);
if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);
continue;
}
close(ret_fd);
}
}
I can see contention around above mentioned locks as follows:
- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock
- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate
- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate
Moreover this run of 200 application isntances takes 32-34 secs. to
complete.
This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks.
For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:
- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate
- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read
Moreover the test execution time has reduced from 32-34 secs to 18-19 secs.
The patches of this patchset introduce following changes:
PATCH-1: Introduce interface to access global kernfs_open_file_mutex.
PATCH-2: Replace global kernfs_open_file_mutex with hashed mutexes.
PATCH-3: Introduce interface to access kernfs_open_node_lock.
PATCH-4: Replace global kernfs_open_node_lock with hashed spinlocks.
PATCH-5: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.
PATCH-6: Introduce interface to access per-fs rwsem.
PATCH-7: Replace per-fs rwsem with hashed rwsems.
PATCH-8: Add a document to describe hashed locks used in kernfs.
------------------------------------------------------------------
Changes since v6:
- Addressed review comments from Tejun
- Make locking helpers compact and remove unlock_parent from node.
- Addressed review comments from Al Viro
- Make multi node lock helpers non-inline
- Account for change of parent while waiting on semaphores during
rename operation
- Add a document to describe hashed locks introduced in this patch
and to provide proof of correctness
- Fix for some issues that I observed while preparing lock correctness
document
- Lock both parent and target while looking for symlink
- Introduce a per-fs mutex to synchronize lookup and removal of a node
- Avoid locking parent in remove_self, because only the first instance
does actual removal so other invocations of remove_self don't need
to lock the parent
- Remove refcounting from lock helpers
- Refcounting was present in lock helpers for cases where an execution
path acquires node's rwsem and then deletes the node. For such cases
release helpers should not try to acquire semaphore for this already
freed node. Refcounting was ensuring that release helpers could get
an still existing node.
I have modified locking helpers such that helpers that acquire rwsem
returns its address which can later be used by release helpers.
Changes since v5:
- Addressed review comments from Greg
- Clean up duplicate code.
- Addressed review comments from Tejun
- Explain how current value of NR_KERNFS_LOCKS were obtained for
different values of NR_CPUS.
- Use single hash table for locks in place of per-fs hash table
for locks.
- Move introduction of supers_rwsem to a separate patch.
- Separate interface and usage part of hashed rwsem implementation
into 2 patches.
- Use address of rwsems to determine locking order in case of
nested locking. This approach avoids ABBA deadlock possibility.
- Change some #define macros into enum, with proper prefix.
- Taking a cue from Tejun's feedback about separating hashed rwsem
interface and usage into 2 patches, I have also divided the patch
that introduced hashed mutex and spinlock, into separate patches.
- Rebase on linux-next tag: next-20220211
Changes since v4:
- Removed compilation warnings reported by the 0-day bot.
Changes since v3:
- Make open_file_mutex and open_node_lock per-fs.
- Replace per-fs rwsem with per-fs hashed rwsem.
(Suggested by Tejun Heo <[email protected]>)
Imran Khan (8):
kernfs: Introduce interface to access global kernfs_open_file_mutex.
kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
kernfs: Introduce interface to access kernfs_open_node_lock.
kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
kernfs: Use a per-fs rwsem to protect per-fs list of
kernfs_super_info.
kernfs: Introduce interface to access per-fs rwsem.
kernfs: Replace per-fs rwsem with hashed rwsems.
kernfs: Add a document to describe hashed locks used in kernfs.
.../filesystems/kernfs-hashed-locks.rst | 245 +++++++++++++++++
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 212 +++++++++-----
fs/kernfs/file.c | 66 +++--
fs/kernfs/inode.c | 48 +++-
fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++
fs/kernfs/kernfs-internal.h | 159 +++++++++++
fs/kernfs/mount.c | 30 +-
fs/kernfs/symlink.c | 13 +-
include/linux/kernfs.h | 71 ++++-
10 files changed, 980 insertions(+), 125 deletions(-)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
create mode 100644 fs/kernfs/kernfs-internal.c
base-commit: 196d330d7fb1e7cc0d85641c89ce4602cb36f12e
--
2.30.2
This document describes usage and proof of various hashed locks
introduced in this patch set
Signed-off-by: Imran Khan <[email protected]>
---
.../filesystems/kernfs-hashed-locks.rst | 245 ++++++++++++++++++
1 file changed, 245 insertions(+)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
diff --git a/Documentation/filesystems/kernfs-hashed-locks.rst b/Documentation/filesystems/kernfs-hashed-locks.rst
new file mode 100644
index 0000000000000..2ffa579ee1e3b
--- /dev/null
+++ b/Documentation/filesystems/kernfs-hashed-locks.rst
@@ -0,0 +1,245 @@
+.. SPDX-License-Identifier: GPL-2.0-only
+
+===================
+kernfs hashed locks
+===================
+
+kernfs uses following hashed locks
+
+1. Hashed mutexes
+2. Hashed spinlock
+3. Hashed rwsem
+
+In certain cases hashed rwsem needs to work in conjunction with a per-fs mutex
+(Described further below).So this document describes this mutex as well.
+
+A kernfs_global_locks object (defined below) provides hashed mutexes,
+hashed spinlocks and hashed rwsems.
+
+ struct kernfs_global_locks {
+ struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
+ struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
+ };
+
+The hashed mutexes and spinlocks are encapsulated in kernfs_open_file_mutex and
+kernfs_open_node_lock respectively as shown below:
+
+struct kernfs_open_file_mutex {
+ struct mutex lock;
+} ____cacheline_aligned_in_smp;
+
+struct kernfs_open_node_lock {
+ spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+
+For all hashed locks address of a kernfs_node object acts as hashing key.
+
+For the remainder of this document a node means a kernfs_node object. The
+node can refer to a file, directory or symlink of a kernfs based file system.
+Also a node's mutex, spinlock or rwsem refers to hashed mutex, hashed spinlock
+or hashed rwsem corresponding to the node.
+It does not mean any locking construct embedded in the kernfs_node itself.
+
+What is protected by hashed locks
+=================================
+
+(1) There's one kernfs_open_file for each open file and all kernfs_open_file
+ instances corresponding to a kernfs_node are maintained in a list.
+ hashed mutexes or kernfs_global_locks.open_file_mutex[index].lock protects
+ this list.
+
+(2) For each kernfs file that has been opened there is one instance of
+ kernfs_open_node and kernfs_node->attr.open points to it.
+ hashed spinlocks or kernfs_global_locks.open_node_locks[index].lock protects
+ ->attr.open.
+
+(3) Hashed rwsems or kernfs_global_locks.kernfs_rwsem[index] protects node's
+ state and synchronizes operations that change state of a node or depend on
+ the state of a node.
+
+(4) per-fs mutex (mentioned earlier) provides synchronization between lookup
+ and remove operations.
+ While looking for a node we will not have address of corresponding node
+ so we can't acquire node's rwsem right from the beginning.
+ On the other hand a parallel remove operation for the same node can acquire
+ corresponding rwsem and go ahead with node removal. So it may happen that
+ search operation for the node finds and returns it but before it can be
+ pinned or used, the remove operation, that was going on in parallel, removes
+ the node and hence makes its any future use wrong.
+ per-fs mutex ensures that for competing search and remove operations only
+ one proceeds at a time and since object returned by search is pinned before
+ releasing the per-fs mutex, it will be available for subsequent usage.
+
+
+Lock usage and proof
+=======================
+
+(1) Hashed mutexes
+
+ Since hashed mutexes protect the list of kernfs_open_file instances
+ corresponding to a kernfs_node, ->open and ->release backends of
+ file_operations need to acquire hashed mutex corresponding to kernfs_node.
+ Also when a kernfs_node is removed, all of its kernfs_open_file instances
+ are drained after deactivating the node. This drain operation acquires
+ hashed mutex to traverse list of kernfs_open_file instances.
+ So addition (via ->open), deletion (via ->release) and traversal
+ (during kernfs_drain) of kernfs_open_file list occurs in a synchronous
+ manner.
+
+(2) Hashed spinlocks
+
+ As hashed spinlocks protect ->attr.open, ->open and ->release backends of
+ file operations need to acquire spinlock corresponding the kernfs_node so
+ that kernfs_open_node instances can be properly refcounted and freed when
+ this refcount reaches 0.
+
+ file events notifier uses ->poll backend of kernfs_open_node instance, so
+ it also needs node's spinlock before accessing ->attr.open.
+
+(3) Hashed rwsems
+
+ 3.1. A node's rwsem protects its state and needs to be acquired to:
+ 3.1.a. Remove the node
+ 3.1.b. Move the node
+ 3.1.c. Travers or modify a node's children RB tree (for
+ directories), i.e to add/remove files/subdirectories
+ within/from a directory.
+ 3.1.d. Modify or access node's inode attributes
+
+ 3.2. Hashed rwsems are used in following operations:
+
+ 3.2.a. Addition of a new node
+
+ While adding a new kernfs_node under a kernfs directory
+ kernfs_add_one acquires directory node's rwsem for
+ writing. Clause 3.1.a ensures that directory exists
+ throughout the operation. Clause 3.1.c ensures proper
+ updation of children rb tree (i.e ->dir.children).
+ Clause 3.1.d ensures correct modification of inode
+ attribute to reflect timestamp of this operation.
+ If the directory gets removed while waiting for semaphore,
+ the subsequent checks in kernfs_add_one will fail resulting
+ in early bail out from kernfs_add_one.
+
+ 3.2.b. Removal of a node
+
+ Removal of a node involves recursive removal of all of its
+ descendants as well. per-fs mutex (i.e kernfs_rm_mutex) avoids
+ concurrent node removals even if the nodes are different.
+
+ At first node's rwsem is acquired. Clause 3.1.c avoids parallel
+ modification of descendant tree and while holding this rwsem
+ each of the descendants are deactivated.
+
+ Once a descendant has been deactivated and drained, its parent's
+ rwsem is taken. Clause 3.1.c ensures proper unlinking of this
+ descendant from its siblings. Clause 3.1.d ensures that parent's
+ inode attributes are correctly updated to record time stamp of
+ removal.
+
+ 3.2.c. Movement of a node
+
+ Moving or renaming a node (kernfs_rename_ns) acquires rwsem for
+ node and its old and new parents. Clauses 3.1.b and 3.1.c avoid
+ concurrent move operations for the same node.
+ Also if old parent of a node changes while waiting for rwsem,
+ the acquisition of rwsem for 3 involved nodes is attempted
+ again. It is always ensured that as far as old parent is
+ concerned, rwsem corresponding to current parent is acquired.
+
+ 3.2.d. Reading a directory
+
+ For diectory reading kernfs_fop_readdir acquires directory
+ node's rwsem for reading. Clause 3.1.c ensures a consistent view
+ of children RB tree.
+ As far as directroy being read is concerned, if it gets removed
+ while waiting for semaphore, the for loop that iterates through
+ children will be ineffective. So for this operation acquiring
+ directory node's rwsem for reading is enough.
+
+ 3.2.e. Dentry revalidation
+
+ A dentry revalidation (kernfs_dop_revalidate) can happen for a
+ negative or for a normal dentry.
+ For negative dentries we just need to check parent change, so in
+ this case acquiring parent kernfs_node's rwsem for reading is
+ enough.
+ For a normal dentry acquiring node's rwsem for reading is enough
+ (Clause 3.1.a and 3.1.b).
+ If node gets removed while waiting for the lock subsequent checks
+ in kernfs_dop_revalidate will fail and kernfs_dop_revalidate will
+ exit early.
+
+ 3.2.f. kernfs_node lookup
+
+ While searching for a node under a given parent
+ (kernfs_find_and_get_ns, kernfs_walk_and_get_ns) rwsem of parent
+ node is acquired for reading. Clause 3.1.c ensures a consistent
+ view of parent's children RB tree. To avoid parallel removal of
+ found node before it gets pinned, these operation make use of
+ per-fs mutex (kernfs_rm_mutex) as explained earlier.
+ This per-fs mutex is also taken during kernfs_node removal
+ (__kernfs_remove).
+
+ If the node being searched gets removed while waiting for the
+ mutex or rwsem, the subsequent kernfs_find_ns or kernfs_walk_ns
+ will fail.
+
+ 3.2.g. kenfs_node's inode lookup
+
+ Looking up for inode instances via kernfs_iop_lookup involves
+ node lookup. So locks acquired are same as ones required in 3.2.f.
+ Also once node lookup is complete parent's rwsem is released and
+ rwsem of found node is acquired to get corresponding inode.
+ Since we are operating under per-fs kernfs_rm_mutex the found node
+ will not disappear in the middle.
+
+ 3.2.h. Updating or reading inode attribute
+
+ Interfaces that change inode attributes(i.e kernfs_setattr and
+ kernfs_iop_setattr) acquire node's rwsem for writing.
+ If the kernfs_node gets removed while waiting for the semaphore
+ the subsequent __kernfs_setattr will fail.
+ From 3.2.a and 3.2.b we know that updates due to addition or
+ removal of nodes will not happen in parallel.
+ So just locking the kernfs_node in these cases is enough to
+ guarantee correct modification of inode attributes.
+ Similarly the interfaces that read inode attributes
+ (i.e kernfs_iop_getattr, kernfs_iop_permission) just need to
+ acquire involved node's rwsem for reading.
+
+ 3.2.i. kernfs file event generation
+
+ kernfs_notify pins involved node before scheduling
+ kernfs_notify_work and kernfs_notify_workfn acquires node's
+ rwsem. Clauses in 3.1 ensure a consistent view of node state
+ throughout execution of work handler.
+
+ 3.2.j. mount
+
+ kernfs_fill_super, invoked during mount operation, acquires root
+ node's rwsem. During mount process there can't be other execution
+ contexts trying to move or delete the node so just locking the
+ involved node(i.e the root node) is enough.
+
+ 3.2.k. getting symlink for a kernfs file
+
+ kernfs_getlink locks both parent and target nodes. Clauses
+ 3.1.a and 3.1.b avoid movement/removal of parent and target
+ nodes. If parent or target gets moved or removed while
+ kernfs_getlink was waiting for rwsem, the subsequent
+ kernfs_get_target_path will return error.
+
+ 3.2.l. while activating a node
+
+ For a node that started as deactivated, kernfs_activate
+ activates the node. In this case acquiring node's rwsem is
+ enough. Since the node is not active yet any parallel removal
+ that wins the race for rwsem will skip this node and its
+ descendents. Also user space can't see a deactivated node so we
+ don't have any parallel access emanating from their as well.
+
+ 3.3 For operations that involve locking multiple nodes at the same time
+ locks are acquired in order of their addresses.
--
2.30.2
per-fs rwsem is used across kernfs for synchronization purposes.
Having an interface to access it not only avoids code duplication, it
can also help in changing the underlying locking mechanism without needing
to change the lock users. For example next patch modifies this interface
to make use of hashed rwsems in place of per-fs rwsem.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 114 ++++++++++++++++++------------------
fs/kernfs/file.c | 5 +-
fs/kernfs/inode.c | 26 ++++----
fs/kernfs/kernfs-internal.h | 78 ++++++++++++++++++++++++
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 6 +-
6 files changed, 156 insertions(+), 79 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index dc769301ac96b..8f22b2735755f 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}
@@ -461,10 +461,16 @@ static void kernfs_drain(struct kernfs_node *kn)
{
struct kernfs_root *root = kernfs_root(kn);
- lockdep_assert_held_write(&root->kernfs_rwsem);
+ /**
+ * kn has the same root as its ancestor, so it can be used to get
+ * per-fs rwsem.
+ */
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ kernfs_rwsem_assert_held_write(kn);
WARN_ON_ONCE(kernfs_active(kn));
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +489,7 @@ static void kernfs_drain(struct kernfs_node *kn)
kernfs_drain_open_files(kn);
- down_write(&root->kernfs_rwsem);
+ kernfs_down_write(kn);
}
/**
@@ -718,12 +724,12 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
int kernfs_add_one(struct kernfs_node *kn)
{
struct kernfs_node *parent = kn->parent;
- struct kernfs_root *root = kernfs_root(parent);
struct kernfs_iattrs *ps_iattr;
+ struct rw_semaphore *rwsem;
bool has_ns;
int ret;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -754,7 +760,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -768,7 +774,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
out_unlock:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -789,7 +795,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;
- lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(parent);
if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -821,7 +827,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;
- lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_read(parent);
/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
spin_lock_irq(&kernfs_rename_lock);
@@ -860,12 +866,12 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return kn;
}
@@ -885,12 +891,12 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
const char *path, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return kn;
}
@@ -1046,7 +1052,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
if (flags & LOOKUP_RCU)
return -ECHILD;
@@ -1062,13 +1068,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
parent = kernfs_dentry_node(dentry->d_parent);
if (parent) {
spin_unlock(&dentry->d_lock);
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_dir_changed(parent, dentry)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
} else
spin_unlock(&dentry->d_lock);
@@ -1079,8 +1084,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}
kn = kernfs_dentry_node(dentry);
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1099,10 +1103,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 1;
out_bad:
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
@@ -1116,12 +1120,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
{
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct kernfs_root *root;
struct inode *inode = NULL;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;
@@ -1132,7 +1135,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* create a negative.
*/
if (!kernfs_active(kn)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1147,7 +1150,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
*/
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1270,7 +1273,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;
- lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(root);
/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1305,9 +1308,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1321,14 +1324,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}
static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(kn);
/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1398,11 +1401,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}
/**
@@ -1488,9 +1491,9 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
kernfs_break_active_protection(kn);
/*
@@ -1518,9 +1521,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
schedule();
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1533,7 +1536,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
kernfs_unbreak_active_protection(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -1550,7 +1553,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
if (!parent) {
WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1558,14 +1561,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}
- root = kernfs_root(parent);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);
kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
if (kn)
return 0;
@@ -1584,16 +1586,15 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
const char *new_name, const void *new_ns)
{
struct kernfs_node *old_parent;
- struct kernfs_root *root;
const char *old_name = NULL;
+ struct rw_semaphore *rwsem;
int error;
/* can't move or rename root */
if (!kn->parent)
return -EINVAL;
- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1647,7 +1648,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
error = 0;
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}
@@ -1718,14 +1719,13 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
- struct kernfs_root *root;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;
if (!dir_emit_dots(file, ctx))
return 0;
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1742,12 +1742,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 07003d47343d7..f46c25fb789fb 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -838,6 +838,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
struct kernfs_node *kn;
struct kernfs_super_info *info;
struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
repeat:
/* pop one off the notify_list */
spin_lock_irq(&kernfs_notify_lock);
@@ -852,7 +853,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
root = kernfs_root(kn);
/* kick fsnotify */
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -892,7 +893,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
}
up_write(&root->supers_rwsem);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5daa..efe5ae98abf46 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,11 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -112,14 +112,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
{
struct inode *inode = d_inode(dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
int error;
if (!kn)
return -EINVAL;
- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +131,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}
@@ -187,14 +186,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
{
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
@@ -277,22 +276,21 @@ void kernfs_evict_inode(struct inode *inode)
int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
+ struct rw_semaphore *rwsem;
struct kernfs_node *kn;
- struct kernfs_root *root;
int ret;
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
kn = inode->i_private;
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return ret;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 64e9cca66d436..bb934949d5eb5 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -188,4 +188,82 @@ static inline spinlock_t *kernfs_open_node_spinlock(struct kernfs_node *kn)
return lock;
}
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+ struct kernfs_root *root = kernfs_root(kn);
+
+ return &root->kernfs_rwsem;
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+ lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+ lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+ lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * kernfs_down_write() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_write(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_write(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_write - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_write(struct rw_semaphore *rwsem)
+{
+ up_write(rwsem);
+}
+
+/**
+ * kernfs_down_read() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_read(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_read(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_read - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_read(struct rw_semaphore *rwsem)
+{
+ up_read(rwsem);
+}
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d35142226c340..f88dc4e26ffb5 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,9 +237,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_root *kf_root = kfc->root;
struct inode *inode;
struct dentry *root;
+ struct rw_semaphore *rwsem;
info->sb = sb;
/* Userspace would break if executables or devices appear on sysfs */
@@ -257,9 +257,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;
/* get root inode, initialize and unlock it */
- down_read(&kf_root->kernfs_rwsem);
+ rwsem = kernfs_down_read(info->root->kn);
inode = kernfs_get_inode(sb, info->root->kn);
- up_read(&kf_root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f7..9d41036025547 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,12 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
int error;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
error = kernfs_get_target_path(parent, target, path);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return error;
}
--
2.30.2
Hello Al and Hello Tejun,
On 25/2/22 4:21 pm, Imran Khan wrote:
> Reduce contention around global locks used in kernfs.
>
> The current kernfs design makes use of 3 global locks to synchronize
> various operations. There are a few other global locks as well but
> they are used for specific cases and hence don't cause severe contention.
>
> The above mentioned 3 main global locks used in kernfs are:
>
> 1. A global mutex, kernfs_open_file_mutex, to protect the list of
> kernfs_open_file instances correspondng to a sysfs attribute.
>
> 2. A global spinlock, kernfs_open_node_lock, to protect
> kernfs_node->attr.open which points to kernfs_open_node instance
> corresponding to a kernfs_node.
>
> 3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
> of the other operations like adding, removing, renaming etc. of a file
> or directory.
>
[...]
>
> ------------------------------------------------------------------
Could you please review v7 of this change set and let me know your feedback?
> Changes since v6:
> - Addressed review comments from Tejun
> - Make locking helpers compact and remove unlock_parent from node.
>
> - Addressed review comments from Al Viro
> - Make multi node lock helpers non-inline
> - Account for change of parent while waiting on semaphores during
> rename operation
> - Add a document to describe hashed locks introduced in this patch
> and to provide proof of correctness
>
> - Fix for some issues that I observed while preparing lock correctness
> document
> - Lock both parent and target while looking for symlink
> - Introduce a per-fs mutex to synchronize lookup and removal of a node
> - Avoid locking parent in remove_self, because only the first instance
> does actual removal so other invocations of remove_self don't need
> to lock the parent
>
> - Remove refcounting from lock helpers
> - Refcounting was present in lock helpers for cases where an execution
> path acquires node's rwsem and then deletes the node. For such cases
> release helpers should not try to acquire semaphore for this already
> freed node. Refcounting was ensuring that release helpers could get
> an still existing node.
> I have modified locking helpers such that helpers that acquire rwsem
> returns its address which can later be used by release helpers.
>
[...]
> Imran Khan (8):
> kernfs: Introduce interface to access global kernfs_open_file_mutex.
> kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
> kernfs: Introduce interface to access kernfs_open_node_lock.
> kernfs: Replace global kernfs_open_node_lock with hashed spinlocks.
> kernfs: Use a per-fs rwsem to protect per-fs list of
> kernfs_super_info.
> kernfs: Introduce interface to access per-fs rwsem.
> kernfs: Replace per-fs rwsem with hashed rwsems.
> kernfs: Add a document to describe hashed locks used in kernfs.
>
> .../filesystems/kernfs-hashed-locks.rst | 245 +++++++++++++++++
> fs/kernfs/Makefile | 2 +-
> fs/kernfs/dir.c | 212 +++++++++-----
> fs/kernfs/file.c | 66 +++--
> fs/kernfs/inode.c | 48 +++-
> fs/kernfs/kernfs-internal.c | 259 ++++++++++++++++++
> fs/kernfs/kernfs-internal.h | 159 +++++++++++
> fs/kernfs/mount.c | 30 +-
> fs/kernfs/symlink.c | 13 +-
> include/linux/kernfs.h | 71 ++++-
> 10 files changed, 980 insertions(+), 125 deletions(-)
> create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
> create mode 100644 fs/kernfs/kernfs-internal.c
>
>
> base-commit: 196d330d7fb1e7cc0d85641c89ce4602cb36f12e
Thanks,
-- Imran