Reduce contention around global locks used in kernfs.
The current kernfs design makes use of 3 global locks to synchronize
various operations. There are a few other global locks as well but
they are used for specific cases and hence don't cause severe contention.
The above mentioned 3 main global locks used in kernfs are:
1. A global mutex, kernfs_open_file_mutex, to protect the list of
kernfs_open_file instances correspondng to a sysfs attribute.
2. A global spinlock, kernfs_open_node_lock, to protect
kernfs_node->attr.open which points to kernfs_open_node instance
corresponding to a kernfs_node.
3. A global per-fs rw semaphore, root->kernfs_rwsem, to synchronize most
of the other operations like adding, removing, renaming etc. of a file
or directory.
Since these locks are global, they can cause contention when multiple
(for example few hundred) applications try to access sysfs (or other kernfs
based file system) files in parallel, even if the applications are
accessing different and unrelated files.
For example on a system with 384 cores, if I run 200 instances of an
application which is mostly executing the following loop:
for (int loop = 0; loop <100 ; loop++)
{
for (int port_num = 1; port_num < 2; port_num++)
{
for (int gid_index = 0; gid_index < 254; gid_index++ )
{
char ret_buf[64], ret_buf_lo[64];
char gid_file_path[1024];
int ret_len;
int ret_fd;
ssize_t ret_rd;
ub4 i, saved_errno;
memset(ret_buf, 0, sizeof(ret_buf));
memset(gid_file_path, 0, sizeof(gid_file_path));
ret_len = snprintf(gid_file_path, sizeof(gid_file_path),
"/sys/class/infiniband/%s/ports/%d/gids/%d",
dev_name,
port_num,
gid_index);
ret_fd = open(gid_file_path, O_RDONLY | O_CLOEXEC);
if (ret_fd < 0)
{
printf("Failed to open %s\n", gid_file_path);
continue;
}
/* Read the GID */
ret_rd = read(ret_fd, ret_buf, 40);
if (ret_rd == -1)
{
printf("Failed to read from file %s, errno: %u\n",
gid_file_path, saved_errno);
continue;
}
close(ret_fd);
}
}
I can see contention around above mentioned locks as follows:
- 54.07% 53.60% showgids [kernel.kallsyms] [k] osq_lock
- 53.60% __libc_start_main
- 32.29% __GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
path_openat
vfs_open
do_dentry_open
kernfs_fop_open
mutex_lock
- __mutex_lock_slowpath
- 32.23% __mutex_lock.isra.5
osq_lock
- 21.31% __GI___libc_close
entry_SYSCALL_64_after_hwframe
do_syscall_64
exit_to_usermode_loop
task_work_run
____fput
__fput
kernfs_fop_release
kernfs_put_open_node.isra.8
mutex_lock
- __mutex_lock_slowpath
- 21.28% __mutex_lock.isra.5
osq_lock
- 10.49% 10.39% showgids [kernel.kallsyms] [k] down_read
10.39% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 9.72% link_path_walk
- 5.21% inode_permission
- __inode_permission
- 5.21% kernfs_iop_permission
down_read
- 4.08% walk_component
lookup_fast
- d_revalidate.part.24
- 4.08% kernfs_dop_revalidate
- 7.48% 7.41% showgids [kernel.kallsyms] [k] up_read
7.41% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 7.01% link_path_walk
- 4.12% inode_permission
- __inode_permission
- 4.12% kernfs_iop_permission
up_read
- 2.61% walk_component
lookup_fast
- d_revalidate.part.24
- 2.61% kernfs_dop_revalidate
Moreover this run of 200 application isntances takes 32-34 secs. to
complete.
This patch set is reducing the above mentioned contention by replacing
these global locks with hashed locks.
For example with the patched kernel and on the same test setup, we no
longer see contention around osq_lock (i.e kernfs_open_file_mutex) and also
contention around per-fs kernfs_rwsem has reduced significantly as well.
This can be seen in the following perf snippet:
- 1.66% 1.65% showgids [kernel.kallsyms] [k] down_read
1.65% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.62% link_path_walk
- 0.98% inode_permission
- __inode_permission
+ 0.98% kernfs_iop_permission
- 0.52% walk_component
lookup_fast
- d_revalidate.part.24
- 0.52% kernfs_dop_revalidate
- 1.12% 1.11% showgids [kernel.kallsyms] [k] up_read
1.11% __libc_start_main
__GI___libc_open
entry_SYSCALL_64_after_hwframe
do_syscall_64
sys_open
do_sys_open
do_filp_open
- path_openat
- 1.11% link_path_walk
- 0.69% inode_permission
- __inode_permission
- 0.69% kernfs_iop_permission
up_read
Moreover the test execution time has reduced from 32-34 secs to 15-16 secs.
The patches of this patchset introduce following changes:
PATCH-1: Remove reference counting from kernfs_open_node.
PATCH-2: Make kernfs_open_node->attr.open RCU protected.
PATCH-3: Change kernfs_notify_list to llist.
PATCH-4: Introduce interface to access kernfs_open_file_mutex.
PATCH-5: Replace global kernfs_open_file_mutex with hashed mutexes.
PATCH-6: Use a per-fs rwsem to protect per-fs list of kernfs_super_info.
PATCH-7: Change kernfs_rename_lock into a read-write lock.
PATCH-8: Introduce interface to access per-fs rwsem.
PATCH-9: Replace per-fs rwsem with hashed rwsems.
PATCH-10: Add a document to describe hashed locks used in kernfs.
------------------------------------------------------------------
Changes since v7:
- Addressed review comments from Al Viro
- Remove reference counting for kernfs_open_node
- Remove kernfs_open_node_lock and make ->attr.open RCU protected
- Change kernfs_rename_lock to a read-write lock
- Lock involved nodes before invoking kernfs_find_ns from kernfs_walk_ns
- Change kernfs_notify_list to llist
- Addressed review comments from Eric
- Move interface to access kernfs_open_file_mutex to file.c
- Update document about hashed locks
- Rebase on tag next-20220407
Changes since v6:
- Addressed review comments from Tejun
- Make locking helpers compact and remove unlock_parent from node.
- Addressed review comments from Al Viro
- Make multi node lock helpers non-inline
- Account for change of parent while waiting on semaphores during
rename operation
- Add a document to describe hashed locks introduced in this patch
and to provide proof of correctness
- Fix for some issues that I observed while preparing lock correctness
document
- Lock both parent and target while looking for symlink
- Introduce a per-fs mutex to synchronize lookup and removal of a node
- Avoid locking parent in remove_self, because only the first instance
does actual removal so other invocations of remove_self don't need
to lock the parent
- Remove refcounting from lock helpers
- Refcounting was present in lock helpers for cases where an execution
path acquires node's rwsem and then deletes the node. For such cases
release helpers should not try to acquire semaphore for this already
freed node. Refcounting was ensuring that release helpers could get
an still existing node.
I have modified locking helpers such that helpers that acquire rwsem
returns its address which can later be used by release helpers.
Changes since v5:
- Addressed review comments from Greg
- Clean up duplicate code.
- Addressed review comments from Tejun
- Explain how current value of NR_KERNFS_LOCKS were obtained for
different values of NR_CPUS.
- Use single hash table for locks in place of per-fs hash table
for locks.
- Move introduction of supers_rwsem to a separate patch.
- Separate interface and usage part of hashed rwsem implementation
into 2 patches.
- Use address of rwsems to determine locking order in case of
nested locking. This approach avoids ABBA deadlock possibility.
- Change some #define macros into enum, with proper prefix.
- Taking a cue from Tejun's feedback about separating hashed rwsem
interface and usage into 2 patches, I have also divided the patch
that introduced hashed mutex and spinlock, into separate patches.
- Rebase on linux-next tag: next-20220211
Changes since v4:
- Removed compilation warnings reported by the 0-day bot.
Changes since v3:
- Make open_file_mutex and open_node_lock per-fs.
- Replace per-fs rwsem with per-fs hashed rwsem.
(Suggested by Tejun Heo <[email protected]>)
Imran Khan (10):
kernfs: Remove reference counting for kernfs_open_node.
kernfs: make ->attr.open RCU protected.
kernfs: Change kernfs_notify_list to llist.
kernfs: Introduce interface to access global kernfs_open_file_mutex.
kernfs: Replace global kernfs_open_file_mutex with hashed mutexes.
kernfs: Use a per-fs rwsem to protect per-fs list of
kernfs_super_info.
kernfs: Change kernfs_rename_lock into a read-write lock.
kernfs: Introduce interface to access per-fs rwsem.
kernfs: Replace per-fs rwsem with hashed rwsems.
kernfs: Add a document to describe hashed locks used in kernfs.
.../filesystems/kernfs-hashed-locks.rst | 214 ++++++++++++++
fs/kernfs/Makefile | 2 +-
fs/kernfs/dir.c | 272 ++++++++++++------
fs/kernfs/file.c | 228 ++++++++-------
fs/kernfs/inode.c | 48 +++-
fs/kernfs/kernfs-internal.c | 259 +++++++++++++++++
fs/kernfs/kernfs-internal.h | 127 +++++++-
fs/kernfs/mount.c | 30 +-
fs/kernfs/symlink.c | 11 +-
include/linux/kernfs.h | 62 +++-
10 files changed, 1041 insertions(+), 212 deletions(-)
create mode 100644 Documentation/filesystems/kernfs-hashed-locks.rst
create mode 100644 fs/kernfs/kernfs-internal.c
base-commit: 2e9a9857569ec27e64d2ddd01294bbe3c736acb1
--
2.30.2
per-fs rwsem is used across kernfs for synchronization purposes.
Having an interface to access it not only avoids code duplication, it
can also help in changing the underlying locking mechanism without needing
to change the lock users. For example next patch modifies this interface
to make use of hashed rwsems in place of per-fs rwsem.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 114 ++++++++++++++++++------------------
fs/kernfs/file.c | 5 +-
fs/kernfs/inode.c | 26 ++++----
fs/kernfs/kernfs-internal.h | 78 ++++++++++++++++++++++++
fs/kernfs/mount.c | 6 +-
fs/kernfs/symlink.c | 6 +-
6 files changed, 156 insertions(+), 79 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 8e8c8b2c350d..f8520d842b39 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,7 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(kn);
return atomic_read(&kn->active) >= 0;
}
@@ -461,10 +461,16 @@ static void kernfs_drain(struct kernfs_node *kn)
{
struct kernfs_root *root = kernfs_root(kn);
- lockdep_assert_held_write(&root->kernfs_rwsem);
+ /**
+ * kn has the same root as its ancestor, so it can be used to get
+ * per-fs rwsem.
+ */
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ kernfs_rwsem_assert_held_write(kn);
WARN_ON_ONCE(kernfs_active(kn));
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
if (kernfs_lockdep(kn)) {
rwsem_acquire(&kn->dep_map, 0, 0, _RET_IP_);
@@ -483,7 +489,7 @@ static void kernfs_drain(struct kernfs_node *kn)
kernfs_drain_open_files(kn);
- down_write(&root->kernfs_rwsem);
+ kernfs_down_write(kn);
}
/**
@@ -718,12 +724,12 @@ struct kernfs_node *kernfs_find_and_get_node_by_id(struct kernfs_root *root,
int kernfs_add_one(struct kernfs_node *kn)
{
struct kernfs_node *parent = kn->parent;
- struct kernfs_root *root = kernfs_root(parent);
struct kernfs_iattrs *ps_iattr;
+ struct rw_semaphore *rwsem;
bool has_ns;
int ret;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);
ret = -EINVAL;
has_ns = kernfs_ns_enabled(parent);
@@ -754,7 +760,7 @@ int kernfs_add_one(struct kernfs_node *kn)
ps_iattr->ia_mtime = ps_iattr->ia_ctime;
}
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
/*
* Activate the new node unless CREATE_DEACTIVATED is requested.
@@ -768,7 +774,7 @@ int kernfs_add_one(struct kernfs_node *kn)
return 0;
out_unlock:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -789,7 +795,7 @@ static struct kernfs_node *kernfs_find_ns(struct kernfs_node *parent,
bool has_ns = kernfs_ns_enabled(parent);
unsigned int hash;
- lockdep_assert_held(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held(parent);
if (has_ns != (bool)ns) {
WARN(1, KERN_WARNING "kernfs: ns %s in '%s' for '%s'\n",
@@ -821,7 +827,7 @@ static struct kernfs_node *kernfs_walk_ns(struct kernfs_node *parent,
size_t len;
char *p, *name;
- lockdep_assert_held_read(&kernfs_root(parent)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_read(parent);
/* grab kernfs_rename_lock to piggy back on kernfs_pr_cont_buf */
write_lock_irq(&kernfs_rename_lock);
@@ -860,12 +866,12 @@ struct kernfs_node *kernfs_find_and_get_ns(struct kernfs_node *parent,
const char *name, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_find_ns(parent, name, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return kn;
}
@@ -885,12 +891,12 @@ struct kernfs_node *kernfs_walk_and_get_ns(struct kernfs_node *parent,
const char *path, const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
kn = kernfs_walk_ns(parent, path, ns);
kernfs_get(kn);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return kn;
}
@@ -1055,7 +1061,7 @@ struct kernfs_node *kernfs_create_empty_dir(struct kernfs_node *parent,
static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
if (flags & LOOKUP_RCU)
return -ECHILD;
@@ -1071,13 +1077,12 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
parent = kernfs_dentry_node(dentry->d_parent);
if (parent) {
spin_unlock(&dentry->d_lock);
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_dir_changed(parent, dentry)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
} else
spin_unlock(&dentry->d_lock);
@@ -1088,8 +1093,7 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
}
kn = kernfs_dentry_node(dentry);
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
/* The kernfs node has been deactivated */
if (!kernfs_active(kn))
@@ -1108,10 +1112,10 @@ static int kernfs_dop_revalidate(struct dentry *dentry, unsigned int flags)
kernfs_info(dentry->d_sb)->ns != kn->ns)
goto out_bad;
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 1;
out_bad:
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
@@ -1125,12 +1129,11 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
{
struct kernfs_node *parent = dir->i_private;
struct kernfs_node *kn;
- struct kernfs_root *root;
struct inode *inode = NULL;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dir->i_sb)->ns;
@@ -1141,7 +1144,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
* create a negative.
*/
if (!kernfs_active(kn)) {
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return NULL;
}
inode = kernfs_get_inode(dir->i_sb, kn);
@@ -1156,7 +1159,7 @@ static struct dentry *kernfs_iop_lookup(struct inode *dir,
*/
if (!IS_ERR(inode))
kernfs_set_rev(parent, dentry);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
/* instantiate and hash (possibly negative) dentry */
return d_splice_alias(inode, dentry);
@@ -1279,7 +1282,7 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
{
struct rb_node *rbn;
- lockdep_assert_held_write(&kernfs_root(root)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(root);
/* if first iteration, visit leftmost descendant which may be root */
if (!pos)
@@ -1314,9 +1317,9 @@ static struct kernfs_node *kernfs_next_descendant_post(struct kernfs_node *pos,
void kernfs_activate(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
pos = NULL;
while ((pos = kernfs_next_descendant_post(pos, kn))) {
@@ -1330,14 +1333,14 @@ void kernfs_activate(struct kernfs_node *kn)
pos->flags |= KERNFS_ACTIVATED;
}
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}
static void __kernfs_remove(struct kernfs_node *kn)
{
struct kernfs_node *pos;
- lockdep_assert_held_write(&kernfs_root(kn)->kernfs_rwsem);
+ kernfs_rwsem_assert_held_write(kn);
/*
* Short-circuit if non-root @kn has already finished removal.
@@ -1407,11 +1410,11 @@ static void __kernfs_remove(struct kernfs_node *kn)
*/
void kernfs_remove(struct kernfs_node *kn)
{
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
}
/**
@@ -1497,9 +1500,9 @@ void kernfs_unbreak_active_protection(struct kernfs_node *kn)
bool kernfs_remove_self(struct kernfs_node *kn)
{
bool ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
kernfs_break_active_protection(kn);
/*
@@ -1527,9 +1530,9 @@ bool kernfs_remove_self(struct kernfs_node *kn)
atomic_read(&kn->active) == KN_DEACTIVATED_BIAS)
break;
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
schedule();
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
}
finish_wait(waitq, &wait);
WARN_ON_ONCE(!RB_EMPTY_NODE(&kn->rb));
@@ -1542,7 +1545,7 @@ bool kernfs_remove_self(struct kernfs_node *kn)
*/
kernfs_unbreak_active_protection(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -1559,7 +1562,7 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
const void *ns)
{
struct kernfs_node *kn;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
if (!parent) {
WARN(1, KERN_WARNING "kernfs: can not remove '%s', no directory\n",
@@ -1567,14 +1570,13 @@ int kernfs_remove_by_name_ns(struct kernfs_node *parent, const char *name,
return -ENOENT;
}
- root = kernfs_root(parent);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(parent);
kn = kernfs_find_ns(parent, name, ns);
if (kn)
__kernfs_remove(kn);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
if (kn)
return 0;
@@ -1593,16 +1595,15 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
const char *new_name, const void *new_ns)
{
struct kernfs_node *old_parent;
- struct kernfs_root *root;
const char *old_name = NULL;
+ struct rw_semaphore *rwsem;
int error;
/* can't move or rename root */
if (!kn->parent)
return -EINVAL;
- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = -ENOENT;
if (!kernfs_active(kn) || !kernfs_active(new_parent) ||
@@ -1656,7 +1657,7 @@ int kernfs_rename_ns(struct kernfs_node *kn, struct kernfs_node *new_parent,
error = 0;
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}
@@ -1727,14 +1728,13 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
struct dentry *dentry = file->f_path.dentry;
struct kernfs_node *parent = kernfs_dentry_node(dentry);
struct kernfs_node *pos = file->private_data;
- struct kernfs_root *root;
const void *ns = NULL;
+ struct rw_semaphore *rwsem;
if (!dir_emit_dots(file, ctx))
return 0;
- root = kernfs_root(parent);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
if (kernfs_ns_enabled(parent))
ns = kernfs_info(dentry->d_sb)->ns;
@@ -1751,12 +1751,12 @@ static int kernfs_fop_readdir(struct file *file, struct dir_context *ctx)
file->private_data = pos;
kernfs_get(pos);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!dir_emit(ctx, name, len, ino, type))
return 0;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
}
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
file->private_data = NULL;
ctx->pos = INT_MAX;
return 0;
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 0bffe5d0f510..03700388baa9 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -869,6 +869,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
struct kernfs_root *root;
struct llist_node *free;
struct kernfs_elem_attr *attr;
+ struct rw_semaphore *rwsem;
repeat:
/**
* pop one off the notify_list.
@@ -888,7 +889,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
kn = attribute_to_node(attr, struct kernfs_node, attr);
root = kernfs_root(kn);
/* kick fsnotify */
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
@@ -928,7 +929,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
}
up_write(&root->supers_rwsem);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
kernfs_put(kn);
goto repeat;
}
diff --git a/fs/kernfs/inode.c b/fs/kernfs/inode.c
index 3d783d80f5da..efe5ae98abf4 100644
--- a/fs/kernfs/inode.c
+++ b/fs/kernfs/inode.c
@@ -99,11 +99,11 @@ int __kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
int kernfs_setattr(struct kernfs_node *kn, const struct iattr *iattr)
{
int ret;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
ret = __kernfs_setattr(kn, iattr);
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return ret;
}
@@ -112,14 +112,13 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
{
struct inode *inode = d_inode(dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root;
+ struct rw_semaphore *rwsem;
int error;
if (!kn)
return -EINVAL;
- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ rwsem = kernfs_down_write(kn);
error = setattr_prepare(&init_user_ns, dentry, iattr);
if (error)
goto out;
@@ -132,7 +131,7 @@ int kernfs_iop_setattr(struct user_namespace *mnt_userns, struct dentry *dentry,
setattr_copy(&init_user_ns, inode, iattr);
out:
- up_write(&root->kernfs_rwsem);
+ kernfs_up_write(rwsem);
return error;
}
@@ -187,14 +186,14 @@ int kernfs_iop_getattr(struct user_namespace *mnt_userns,
{
struct inode *inode = d_inode(path->dentry);
struct kernfs_node *kn = inode->i_private;
- struct kernfs_root *root = kernfs_root(kn);
+ struct rw_semaphore *rwsem;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
generic_fillattr(&init_user_ns, inode, stat);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return 0;
}
@@ -277,22 +276,21 @@ void kernfs_evict_inode(struct inode *inode)
int kernfs_iop_permission(struct user_namespace *mnt_userns,
struct inode *inode, int mask)
{
+ struct rw_semaphore *rwsem;
struct kernfs_node *kn;
- struct kernfs_root *root;
int ret;
if (mask & MAY_NOT_BLOCK)
return -ECHILD;
kn = inode->i_private;
- root = kernfs_root(kn);
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(kn);
spin_lock(&inode->i_lock);
kernfs_refresh_inode(kn, inode);
ret = generic_permission(&init_user_ns, inode, mask);
spin_unlock(&inode->i_lock);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return ret;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 82c6b16645bc..0c49cf57f80f 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -165,4 +165,82 @@ void kernfs_drain_open_files(struct kernfs_node *kn);
*/
extern const struct inode_operations kernfs_symlink_iops;
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+ struct kernfs_root *root = kernfs_root(kn);
+
+ return &root->kernfs_rwsem;
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+ lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+ lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+ lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * kernfs_down_write() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_write(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_write(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_write - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_write(struct rw_semaphore *rwsem)
+{
+ up_write(rwsem);
+}
+
+/**
+ * kernfs_down_read() - Acquire kernfs rwsem
+ *
+ * @kn: kernfs_node for which rwsem needs to be taken
+ *
+ * Return: pointer to acquired rwsem
+ */
+static inline struct rw_semaphore *kernfs_down_read(struct kernfs_node *kn)
+{
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ down_read(rwsem);
+
+ return rwsem;
+}
+
+/**
+ * kernfs_up_read - Release kernfs rwsem
+ *
+ * @rwsem: address of rwsem to release
+ *
+ * Return: void
+ */
+static inline void kernfs_up_read(struct rw_semaphore *rwsem)
+{
+ up_read(rwsem);
+}
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index 1ac36b2a89ab..0e872824b7db 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -237,9 +237,9 @@ struct dentry *kernfs_node_dentry(struct kernfs_node *kn,
static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *kfc)
{
struct kernfs_super_info *info = kernfs_info(sb);
- struct kernfs_root *kf_root = kfc->root;
struct inode *inode;
struct dentry *root;
+ struct rw_semaphore *rwsem;
info->sb = sb;
/* Userspace would break if executables or devices appear on sysfs */
@@ -257,9 +257,9 @@ static int kernfs_fill_super(struct super_block *sb, struct kernfs_fs_context *k
sb->s_shrink.seeks = 0;
/* get root inode, initialize and unlock it */
- down_read(&kf_root->kernfs_rwsem);
+ rwsem = kernfs_down_read(info->root->kn);
inode = kernfs_get_inode(sb, info->root->kn);
- up_read(&kf_root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
if (!inode) {
pr_debug("kernfs: could not get root inode\n");
return -ENOMEM;
diff --git a/fs/kernfs/symlink.c b/fs/kernfs/symlink.c
index 0ab13824822f..9d4103602554 100644
--- a/fs/kernfs/symlink.c
+++ b/fs/kernfs/symlink.c
@@ -113,12 +113,12 @@ static int kernfs_getlink(struct inode *inode, char *path)
struct kernfs_node *kn = inode->i_private;
struct kernfs_node *parent = kn->parent;
struct kernfs_node *target = kn->symlink.target_kn;
- struct kernfs_root *root = kernfs_root(parent);
+ struct rw_semaphore *rwsem;
int error;
- down_read(&root->kernfs_rwsem);
+ rwsem = kernfs_down_read(parent);
error = kernfs_get_target_path(parent, target, path);
- up_read(&root->kernfs_rwsem);
+ kernfs_up_read(rwsem);
return error;
}
--
2.30.2
Right now per-fs kernfs_rwsem protects list of kernfs_super_info instances
for a kernfs_root. Since kernfs_rwsem is used to synchronize several other
operations across kernfs and since most of these operations don't impact
kernfs_super_info, we can use a separate per-fs rwsem to synchronize access
to list of kernfs_super_info.
This helps in reducing contention around kernfs_rwsem and also allows
operations that change/access list of kernfs_super_info to proceed without
contending for kernfs_rwsem.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 1 +
fs/kernfs/file.c | 2 ++
fs/kernfs/kernfs-internal.h | 1 +
fs/kernfs/mount.c | 8 ++++----
4 files changed, 8 insertions(+), 4 deletions(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index 61a8edc4ba8b..17b438498c0b 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -917,6 +917,7 @@ struct kernfs_root *kernfs_create_root(struct kernfs_syscall_ops *scops,
idr_init(&root->ino_idr);
init_rwsem(&root->kernfs_rwsem);
INIT_LIST_HEAD(&root->supers);
+ init_rwsem(&root->supers_rwsem);
/*
* On 64bit ino setups, id is ino. On 32bit, low 32bits are ino.
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index 0946ab341ce4..0bffe5d0f510 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -890,6 +890,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
/* kick fsnotify */
down_write(&root->kernfs_rwsem);
+ down_write(&root->supers_rwsem);
list_for_each_entry(info, &kernfs_root(kn)->supers, node) {
struct kernfs_node *parent;
struct inode *p_inode = NULL;
@@ -925,6 +926,7 @@ static void kernfs_notify_workfn(struct work_struct *work)
iput(inode);
}
+ up_write(&root->supers_rwsem);
up_write(&root->kernfs_rwsem);
kernfs_put(kn);
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index eeaa779b929c..82c6b16645bc 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -47,6 +47,7 @@ struct kernfs_root {
wait_queue_head_t deactivate_waitq;
struct rw_semaphore kernfs_rwsem;
+ struct rw_semaphore supers_rwsem;
};
/* +1 to avoid triggering overflow warning when negating it */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index a64a04efc9be..1ac36b2a89ab 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -347,9 +347,9 @@ int kernfs_get_tree(struct fs_context *fc)
}
sb->s_flags |= SB_ACTIVE;
- down_write(&root->kernfs_rwsem);
+ down_write(&root->supers_rwsem);
list_add(&info->node, &info->root->supers);
- up_write(&root->kernfs_rwsem);
+ up_write(&root->supers_rwsem);
}
fc->root = dget(sb->s_root);
@@ -376,9 +376,9 @@ void kernfs_kill_sb(struct super_block *sb)
struct kernfs_super_info *info = kernfs_info(sb);
struct kernfs_root *root = info->root;
- down_write(&root->kernfs_rwsem);
+ down_write(&root->supers_rwsem);
list_del(&info->node);
- up_write(&root->kernfs_rwsem);
+ up_write(&root->supers_rwsem);
/*
* Remove the superblock from fs_supers/s_instances
--
2.30.2
At present kernfs_notify_list is implemented as a singly linked
list of kernfs_node(s), where last element points to itself and
value of ->attr.next tells if node is present on the list or not.
Both addition and deletion to list happen under kernfs_notify_lock.
Change kernfs_notify_list to llist so that addition to list can heppen
locklessly. We still need kernfs_notify_lock for consumers (kernfs_notify\
_workfn) because there can be multiple concurrent work items.
Also with this approach we don't check if a kernfs_node is already present
in the kernfs_notify_list i.e whether ->attr.next == NULL but that should
be fine as it will allow serial processing of all events submitted for the
same node while the node was still in event notification list. With earlier
approach as long as a node was in the event notification list i.e ->attr.next
!= NULL the subsequent events were ignored.
Suggested by: Al Viro <[email protected]>
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/file.c | 49 ++++++++++++++++++++++--------------------
include/linux/kernfs.h | 2 +-
2 files changed, 27 insertions(+), 24 deletions(-)
diff --git a/fs/kernfs/file.c b/fs/kernfs/file.c
index bc393dcf4efa..c89220dcfdc1 100644
--- a/fs/kernfs/file.c
+++ b/fs/kernfs/file.c
@@ -38,18 +38,17 @@ struct kernfs_open_node {
struct list_head files; /* goes through kernfs_open_file.list */
};
-/*
- * kernfs_notify() may be called from any context and bounces notifications
- * through a work item. To minimize space overhead in kernfs_node, the
- * pending queue is implemented as a singly linked list of kernfs_nodes.
- * The list is terminated with the self pointer so that whether a
- * kernfs_node is on the list or not can be determined by testing the next
- * pointer for NULL.
+/**
+ * attribute_to_node - get kernfs_node object corresponding to a kernfs attribute
+ * @ptr: &struct kernfs_elem_attr
+ * @type: struct kernfs_node
+ * @member: name of member (i.e attr)
*/
-#define KERNFS_NOTIFY_EOL ((void *)&kernfs_notify_list)
+#define attribute_to_node(ptr, type, member) \
+ container_of(ptr, type, member)
static DEFINE_SPINLOCK(kernfs_notify_lock);
-static struct kernfs_node *kernfs_notify_list = KERNFS_NOTIFY_EOL;
+static LLIST_HEAD(kernfs_notify_list);
static struct kernfs_open_file *kernfs_of(struct file *file)
{
@@ -846,18 +845,25 @@ static void kernfs_notify_workfn(struct work_struct *work)
struct kernfs_node *kn;
struct kernfs_super_info *info;
struct kernfs_root *root;
+ struct llist_node *free;
+ struct kernfs_elem_attr *attr;
repeat:
- /* pop one off the notify_list */
+ /**
+ * pop one off the notify_list.
+ * There can be multiple concurrent work items.
+ * Use kernfs_notify_lock to synchronize between multipl consumers.
+ */
spin_lock_irq(&kernfs_notify_lock);
- kn = kernfs_notify_list;
- if (kn == KERNFS_NOTIFY_EOL) {
+ if (llist_empty(&kernfs_notify_list)) {
spin_unlock_irq(&kernfs_notify_lock);
return;
}
- kernfs_notify_list = kn->attr.notify_next;
- kn->attr.notify_next = NULL;
+
+ free = llist_del_first(&kernfs_notify_list);
spin_unlock_irq(&kernfs_notify_lock);
+ attr = llist_entry(free, struct kernfs_elem_attr, notify_next);
+ kn = attribute_to_node(attr, struct kernfs_node, attr);
root = kernfs_root(kn);
/* kick fsnotify */
down_write(&root->kernfs_rwsem);
@@ -913,12 +919,14 @@ static void kernfs_notify_workfn(struct work_struct *work)
void kernfs_notify(struct kernfs_node *kn)
{
static DECLARE_WORK(kernfs_notify_work, kernfs_notify_workfn);
- unsigned long flags;
struct kernfs_open_node *on;
if (WARN_ON(kernfs_type(kn) != KERNFS_FILE))
return;
+ /* Because we are using llist for kernfs_notify_list */
+ WARN_ON_ONCE(in_nmi());
+
/* kick poll immediately */
rcu_read_lock();
on = rcu_dereference(kn->attr.open);
@@ -928,14 +936,9 @@ void kernfs_notify(struct kernfs_node *kn)
}
rcu_read_unlock();
/* schedule work to kick fsnotify */
- spin_lock_irqsave(&kernfs_notify_lock, flags);
- if (!kn->attr.notify_next) {
- kernfs_get(kn);
- kn->attr.notify_next = kernfs_notify_list;
- kernfs_notify_list = kn;
- schedule_work(&kernfs_notify_work);
- }
- spin_unlock_irqrestore(&kernfs_notify_lock, flags);
+ kernfs_get(kn);
+ llist_add(&kn->attr.notify_next, &kernfs_notify_list);
+ schedule_work(&kernfs_notify_work);
}
EXPORT_SYMBOL_GPL(kernfs_notify);
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 13f54f078a52..2dd9c8df0f4f 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -116,7 +116,7 @@ struct kernfs_elem_attr {
const struct kernfs_ops *ops;
struct kernfs_open_node __rcu *open;
loff_t size;
- struct kernfs_node *notify_next; /* for kernfs_notify() */
+ struct llist_node notify_next; /* for kernfs_notify() */
};
/*
--
2.30.2
Hello, Imran.
I took a look to patch 5 and it looks like there's still some work left to
do. Maybe it'd be easier to concentrate on the first two parts - the notify
list conversion and open_file_mutex conversion first and then get to the
rest later?
Thanks.
--
tejun
Hello Tejun,
On 23/4/22 3:03 am, Tejun Heo wrote:
> Hello, Imran.
>
> I took a look to patch 5 and it looks like there's still some work left to
> do. Maybe it'd be easier to concentrate on the first two parts - the notify
> list conversion and open_file_mutex conversion first and then get to the
> rest later?
>
Thanks again for reviewing this and I agree with your suggestion. So far
most of the concerns have been around usage of kernfs_rwsem and those
can be addressed independent of first 5 (even first 7 )changes here.
Just one question in this regard, should I send the new patch set
(addressing open_file_mutex and list conversion) as a separate patch set
or should I sent it as v9 of ongoing change set. I guess first option is
better but thought of confirming once before proceeding.
Thanks
-- Imran
> Thanks.
>
Hello,
On Sat, Apr 23, 2022 at 06:49:49PM +1000, Imran Khan wrote:
> Thanks again for reviewing this and I agree with your suggestion. So far
> most of the concerns have been around usage of kernfs_rwsem and those
> can be addressed independent of first 5 (even first 7 )changes here.
> Just one question in this regard, should I send the new patch set
> (addressing open_file_mutex and list conversion) as a separate patch set
> or should I sent it as v9 of ongoing change set. I guess first option is
> better but thought of confirming once before proceeding.
Either way is okay but I like the first one too. Let's break it up to
smaller pieces so that we can make progress piece by piece.
Thanks.
--
tejun
Hello Tejun,
On 26/4/22 3:21 am, Tejun Heo wrote:
> Hello,
>
> On Sat, Apr 23, 2022 at 06:49:49PM +1000, Imran Khan wrote:
>> Thanks again for reviewing this and I agree with your suggestion. So far
>> most of the concerns have been around usage of kernfs_rwsem and those
>> can be addressed independent of first 5 (even first 7 )changes here.
>> Just one question in this regard, should I send the new patch set
>> (addressing open_file_mutex and list conversion) as a separate patch set
>> or should I sent it as v9 of ongoing change set. I guess first option is
>> better but thought of confirming once before proceeding.
>
> Either way is okay but I like the first one too. Let's break it up to
> smaller pieces so that we can make progress piece by piece.
>
I have sent first 5 patches after addressing your review comments. This
patch set can be seen at [1].
Could you please have a look and let me know if it looks okay.
Thanks
-- Imran
[1]:
https://lore.kernel.org/lkml/[email protected]/