Having a single rwsem to synchronize all operations across a kernfs
based file system (cgroup, sysfs etc.) does not scale well. The contention
around this single rwsem becomes more apparent in large scale systems with
few hundred CPUs where most of the CPUs have running tasks that are
opening, accessing or closing sysfs files at any point of time.
Using hashed rwsems in place of a per-fs rwsem, can significantly reduce
contention around per-fs rwsem and hence provide better scalability.
Moreover as these hashed rwsems are not part of kernfs_node objects we will
not see any singnificant change in memory utilization of kernfs based file
systems like sysfs, cgroupfs etc.
This patch introduces hashed rwsems that can be used in place of per-fs
rwsem. It also provides interfaces to use hashed rwsems and interfaces for
lockdep annotation as well. The next patch makes use of these interfaces
and replaces per-fs rwsem with hashed ones.
Signed-off-by: Imran Khan <[email protected]>
---
fs/kernfs/dir.c | 1 -
fs/kernfs/kernfs-internal.h | 560 ++++++++++++++++++++++++++++++++++++
fs/kernfs/mount.c | 1 +
include/linux/kernfs.h | 2 +
4 files changed, 563 insertions(+), 1 deletion(-)
diff --git a/fs/kernfs/dir.c b/fs/kernfs/dir.c
index dc769301ac96b..0dac58f8091c9 100644
--- a/fs/kernfs/dir.c
+++ b/fs/kernfs/dir.c
@@ -25,7 +25,6 @@ static DEFINE_SPINLOCK(kernfs_idr_lock); /* root->ino_idr */
static bool kernfs_active(struct kernfs_node *kn)
{
- lockdep_assert_held(&kernfs_root(kn)->kernfs_rwsem);
return atomic_read(&kn->active) >= 0;
}
diff --git a/fs/kernfs/kernfs-internal.h b/fs/kernfs/kernfs-internal.h
index 593395f325a18..ba89de378f240 100644
--- a/fs/kernfs/kernfs-internal.h
+++ b/fs/kernfs/kernfs-internal.h
@@ -19,6 +19,17 @@
#include <linux/kernfs.h>
#include <linux/fs_context.h>
+/*
+ * kernfs_rwsem locking pattern:
+ *
+ * KERNFS_RWSEM_LOCK_SELF: lock kernfs_node only.
+ * KERNFS_RWSEM_LOCK_SELF_AND_PARENT: lock kernfs_node and its parent.
+ */
+enum kernfs_rwsem_lock_pattern {
+ KERNFS_RWSEM_LOCK_SELF,
+ KERNFS_RWSEM_LOCK_SELF_AND_PARENT
+};
+
struct kernfs_iattrs {
kuid_t ia_uid;
kgid_t ia_gid;
@@ -190,4 +201,553 @@ kernfs_open_node_spinlock(struct kernfs_node *kn)
return lock;
}
+static inline struct rw_semaphore *kernfs_rwsem_ptr(struct kernfs_node *kn)
+{
+ int idx = hash_ptr(kn, NR_KERNFS_LOCK_BITS);
+
+ return &kernfs_locks->kernfs_rwsem[idx];
+}
+
+static inline void kernfs_rwsem_assert_held(struct kernfs_node *kn)
+{
+ lockdep_assert_held(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_write(struct kernfs_node *kn)
+{
+ lockdep_assert_held_write(kernfs_rwsem_ptr(kn));
+}
+
+static inline void kernfs_rwsem_assert_held_read(struct kernfs_node *kn)
+{
+ lockdep_assert_held_read(kernfs_rwsem_ptr(kn));
+}
+
+/**
+ * down_write_kernfs_rwsem_for_two_nodes() - Acquire hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be taken
+ * @kn2: kernfs_node for which hashed rwsem needs to be taken
+ *
+ * In certain cases we need to acquire hashed rwsem for 2 nodes that don't have a
+ * parent child relationship. This is one of the cases of nested locking involving
+ * hashed rwsem and rwsem with lower address is acquired first.
+ */
+static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2)
+{
+ struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+ struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+ if (rwsem1 == rwsem2)
+ down_write_nested(rwsem1, 0);
+ else {
+ if (rwsem1 < rwsem2) {
+ down_write_nested(rwsem1, 0);
+ down_write_nested(rwsem2, 1);
+ } else {
+ down_write_nested(rwsem2, 0);
+ down_write_nested(rwsem1, 1);
+ }
+ }
+ kernfs_get(kn1);
+ kernfs_get(kn2);
+}
+
+/**
+ * up_write_kernfs_rwsem_for_two_nodes() - Release hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be released
+ * @kn2: kernfs_node for which hashed rwsem needs to be released
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ */
+static inline void up_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2)
+{
+ struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+ struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+ if (rwsem1 == rwsem2)
+ up_write(rwsem1);
+ else {
+ if (rwsem1 > rwsem2) {
+ up_write(rwsem1);
+ up_write(rwsem2);
+ } else {
+ up_write(rwsem2);
+ up_write(rwsem1);
+ }
+ }
+
+ kernfs_put(kn1);
+ kernfs_put(kn2);
+}
+
+/**
+ * down_read_kernfs_rwsem_for_two_nodes() - Acquire hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be taken
+ * @kn2: kernfs_node for which hashed rwsem needs to be taken
+ *
+ * In certain cases we need to acquire hashed rwsem for 2 nodes that don't have a
+ * parent child relationship. This is one of the cases of nested locking involving
+ * hashed rwsem and rwsem with lower address is acquired first.
+ */
+static inline void down_read_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2)
+{
+ struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+ struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+ if (rwsem1 == rwsem2)
+ down_read_nested(rwsem1, 0);
+ else {
+ if (rwsem1 < rwsem2) {
+ down_read_nested(rwsem1, 0);
+ down_read_nested(rwsem2, 1);
+ } else {
+ down_read_nested(rwsem2, 0);
+ down_read_nested(rwsem1, 1);
+ }
+ }
+ kernfs_get(kn1);
+ kernfs_get(kn2);
+}
+
+/**
+ * up_read_kernfs_rwsem_for_two_nodes() - Release hashed rwsem for 2 nodes
+ *
+ * @kn1: kernfs_node for which hashed rwsem needs to be released
+ * @kn2: kernfs_node for which hashed rwsem needs to be released
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ */
+static inline void up_read_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
+ struct kernfs_node *kn2)
+{
+ struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
+ struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
+
+ if (rwsem1 == rwsem2)
+ up_read(rwsem1);
+ else {
+ if (rwsem1 > rwsem2) {
+ up_read(rwsem1);
+ up_read(rwsem2);
+ } else {
+ up_read(rwsem2);
+ up_read(rwsem1);
+ }
+ }
+
+ kernfs_put(kn1);
+ kernfs_put(kn2);
+}
+
+/**
+ * down_write_kernfs_rwsem() - Acquire hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem needs to be taken
+ * @ptrn: locking pattern i.e whether to lock only given node or to lock
+ * node and its parent as well
+ *
+ * In case of nested locking, rwsem with lower address is acquired first.
+ *
+ * Return: void
+ */
+static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
+ enum kernfs_rwsem_lock_pattern ptrn)
+{
+ struct rw_semaphore *p_rwsem = NULL;
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+ int lock_parent = 0;
+
+ if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
+ lock_parent = 1;
+
+ if (lock_parent)
+ p_rwsem = kernfs_rwsem_ptr(kn->parent);
+
+ if (!lock_parent || rwsem == p_rwsem) {
+ down_write_nested(rwsem, 0);
+ kernfs_get(kn);
+ kn->unlock_parent = 0;
+ } else {
+ /**
+ * In case of nested locking, locks are taken in order of their
+ * addresses. lock with lower address is taken first, followed
+ * by lock with higher address.
+ */
+ if (rwsem < p_rwsem) {
+ down_write_nested(rwsem, 0);
+ down_write_nested(p_rwsem, 1);
+ } else {
+ down_write_nested(p_rwsem, 0);
+ down_write_nested(rwsem, 1);
+ }
+ kernfs_get(kn);
+ kernfs_get(kn->parent);
+ kn->unlock_parent = 1;
+ }
+}
+
+/**
+ * up_write_kernfs_rwsem() - Release hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem was taken
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ *
+ * Return: void
+ */
+static inline void up_write_kernfs_rwsem(struct kernfs_node *kn)
+{
+ struct rw_semaphore *p_rwsem = NULL;
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ if (kn->unlock_parent) {
+ kn->unlock_parent = 0;
+ p_rwsem = kernfs_rwsem_ptr(kn->parent);
+ if (rwsem > p_rwsem) {
+ up_write(rwsem);
+ up_write(p_rwsem);
+ } else {
+ up_write(p_rwsem);
+ up_write(rwsem);
+ }
+ kernfs_put(kn->parent);
+ } else
+ up_write(rwsem);
+
+ kernfs_put(kn);
+}
+
+/**
+ * down_read_kernfs_rwsem() - Acquire hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem needs to be taken
+ * @ptrn: locking pattern i.e whether to lock only given node or to lock
+ * node and its parent as well
+ *
+ * In case of nested locking, rwsem with lower address is acquired first.
+ *
+ * Return: void
+ */
+static inline void down_read_kernfs_rwsem(struct kernfs_node *kn,
+ enum kernfs_rwsem_lock_pattern ptrn)
+{
+ struct rw_semaphore *p_rwsem = NULL;
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+ int lock_parent = 0;
+
+ if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
+ lock_parent = 1;
+
+ if (lock_parent)
+ p_rwsem = kernfs_rwsem_ptr(kn->parent);
+
+ if (!lock_parent || rwsem == p_rwsem) {
+ down_read_nested(rwsem, 0);
+ kernfs_get(kn);
+ kn->unlock_parent = 0;
+ } else {
+ /**
+ * In case of nested locking, locks are taken in order of their
+ * addresses. lock with lower address is taken first, followed
+ * by lock with higher address.
+ */
+ if (rwsem < p_rwsem) {
+ down_read_nested(rwsem, 0);
+ down_read_nested(p_rwsem, 1);
+ } else {
+ down_read_nested(p_rwsem, 0);
+ down_read_nested(rwsem, 1);
+ }
+ kernfs_get(kn);
+ kernfs_get(kn->parent);
+ kn->unlock_parent = 1;
+ }
+}
+
+/**
+ * up_read_kernfs_rwsem() - Release hashed rwsem
+ *
+ * @kn: kernfs_node for which hashed rwsem was taken
+ *
+ * In case of nested locking, rwsem with higher address is released first.
+ *
+ * Return: void
+ */
+static inline void up_read_kernfs_rwsem(struct kernfs_node *kn)
+{
+ struct rw_semaphore *p_rwsem = NULL;
+ struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
+
+ if (kn->unlock_parent) {
+ kn->unlock_parent = 0;
+ p_rwsem = kernfs_rwsem_ptr(kn->parent);
+ if (rwsem > p_rwsem) {
+ up_read(rwsem);
+ up_read(p_rwsem);
+ } else {
+ up_read(p_rwsem);
+ up_read(rwsem);
+ }
+ kernfs_put(kn->parent);
+ } else
+ up_read(rwsem);
+
+ kernfs_put(kn);
+}
+
+static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
+{
+ struct rw_semaphore *tmp;
+
+ tmp = array[i];
+ array[i] = array[j];
+ array[j] = tmp;
+}
+
+static inline void kernfs_sort_rwsems(struct rw_semaphore **array)
+{
+ if (array[0] > array[1])
+ kernfs_swap_rwsems(array, 0, 1);
+
+ if (array[0] > array[2])
+ kernfs_swap_rwsems(array, 0, 2);
+
+ if (array[1] > array[2])
+ kernfs_swap_rwsems(array, 1, 2);
+}
+
+/**
+ * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @new_parent: about to become new parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately takes locks corresponding to node, and
+ * corresponding to its current and future parents (if needed).
+ *
+ * Return: void
+ */
+static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+ struct kernfs_node *current_parent,
+ struct kernfs_node *new_parent)
+{
+ struct rw_semaphore *array[3];
+
+ array[0] = kernfs_rwsem_ptr(kn);
+ array[1] = kernfs_rwsem_ptr(current_parent);
+ array[2] = kernfs_rwsem_ptr(new_parent);
+
+ if (array[0] == array[1] && array[0] == array[2]) {
+ /* All 3 nodes hash to same rwsem */
+ down_write_nested(array[0], 0);
+ } else {
+ /**
+ * All 3 nodes are not hashing to the same rwsem, so sort the
+ * array.
+ */
+ kernfs_sort_rwsems(array);
+
+ if (array[0] == array[1] || array[1] == array[2]) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[2], 1);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ down_write_nested(array[0], 0);
+ down_write_nested(array[1], 1);
+ down_write_nested(array[2], 2);
+ }
+ }
+
+ kernfs_get(kn);
+ kernfs_get(current_parent);
+ kernfs_get(new_parent);
+}
+
+/**
+ * up_write_kernfs_rwsem_rename_ns() - release hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @old_parent: old parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately releases locks corresponding to node, and
+ * corresponding to its current and old parents (if needed).
+ *
+ * Return: void
+ */
+static inline void up_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+ struct kernfs_node *current_parent,
+ struct kernfs_node *old_parent)
+{
+ struct rw_semaphore *array[3];
+
+ array[0] = kernfs_rwsem_ptr(kn);
+ array[1] = kernfs_rwsem_ptr(current_parent);
+ array[2] = kernfs_rwsem_ptr(old_parent);
+
+ if (array[0] == array[1] && array[0] == array[2]) {
+ /* All 3 nodes hash to same rwsem */
+ up_write(array[0]);
+ } else {
+ /**
+ * All 3 nodes are not hashing to the same rwsem, so sort the
+ * array.
+ */
+ kernfs_sort_rwsems(array);
+
+ if (array[0] == array[1] || array[1] == array[2]) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ up_write(array[2]);
+ up_write(array[0]);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ up_write(array[2]);
+ up_write(array[1]);
+ up_write(array[0]);
+ }
+ }
+
+ kernfs_put(old_parent);
+ kernfs_put(current_parent);
+ kernfs_put(kn);
+}
+
+/**
+ * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @new_parent: about to become new parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately takes locks corresponding to node, and
+ * corresponding to its current and future parents (if needed).
+ *
+ * Return: void
+ */
+static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+ struct kernfs_node *current_parent,
+ struct kernfs_node *new_parent)
+{
+ struct rw_semaphore *array[3];
+
+ array[0] = kernfs_rwsem_ptr(kn);
+ array[1] = kernfs_rwsem_ptr(current_parent);
+ array[2] = kernfs_rwsem_ptr(new_parent);
+
+ if (array[0] == array[1] && array[0] == array[2]) {
+ /* All 3 nodes hash to same rwsem */
+ down_read_nested(array[0], 0);
+ } else {
+ /**
+ * All 3 nodes are not hashing to the same rwsem, so sort the
+ * array.
+ */
+ kernfs_sort_rwsems(array);
+
+ if (array[0] == array[1] || array[1] == array[2]) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ down_read_nested(array[0], 0);
+ down_read_nested(array[2], 1);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ down_read_nested(array[0], 0);
+ down_read_nested(array[1], 1);
+ down_read_nested(array[2], 2);
+ }
+ }
+
+ kernfs_get(kn);
+ kernfs_get(current_parent);
+ kernfs_get(new_parent);
+}
+
+/**
+ * up_read_kernfs_rwsem_rename_ns() - release hashed rwsem during
+ * rename or similar operations.
+ *
+ * @kn: kernfs_node of interest
+ * @current_parent: current parent of kernfs_node of interest
+ * @old_parent: old parent of kernfs_node
+ *
+ * During rename or similar operations the parent of a node changes,
+ * and this means we will see different parents of a kernfs_node at
+ * the time of taking and releasing its or its parent's hashed rwsem.
+ * This function separately releases locks corresponding to node, and
+ * corresponding to its current and old parents (if needed).
+ *
+ * Return: void
+ */
+static inline void up_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
+ struct kernfs_node *current_parent,
+ struct kernfs_node *old_parent)
+{
+ struct rw_semaphore *array[3];
+
+ array[0] = kernfs_rwsem_ptr(kn);
+ array[1] = kernfs_rwsem_ptr(current_parent);
+ array[2] = kernfs_rwsem_ptr(old_parent);
+
+ if (array[0] == array[1] && array[0] == array[2]) {
+ /* All 3 nodes hash to same rwsem */
+ up_read(array[0]);
+ } else {
+ /**
+ * All 3 nodes are not hashing to the same rwsem, so sort the
+ * array.
+ */
+ kernfs_sort_rwsems(array);
+
+ if (array[0] == array[1] || array[1] == array[2]) {
+ /**
+ * Two nodes hash to same rwsem, and these
+ * will occupy consecutive places in array after
+ * sorting.
+ */
+ up_read(array[2]);
+ up_read(array[0]);
+ } else {
+ /* All 3 nodes hashe to different rwsems */
+ up_read(array[2]);
+ up_read(array[1]);
+ up_read(array[0]);
+ }
+ }
+
+ kernfs_put(old_parent);
+ kernfs_put(current_parent);
+ kernfs_put(kn);
+}
+
#endif /* __KERNFS_INTERNAL_H */
diff --git a/fs/kernfs/mount.c b/fs/kernfs/mount.c
index d35142226c340..d28f8a3eeb215 100644
--- a/fs/kernfs/mount.c
+++ b/fs/kernfs/mount.c
@@ -398,6 +398,7 @@ void __init kernfs_lock_init(void)
for (count = 0; count < NR_KERNFS_LOCKS; count++) {
mutex_init(&kernfs_locks->open_file_mutex[count].lock);
spin_lock_init(&kernfs_locks->open_node_locks[count].lock);
+ init_rwsem(&kernfs_locks->kernfs_rwsem[count]);
}
}
diff --git a/include/linux/kernfs.h b/include/linux/kernfs.h
index 84653c609a5c0..8451f153b2291 100644
--- a/include/linux/kernfs.h
+++ b/include/linux/kernfs.h
@@ -96,6 +96,7 @@ struct kernfs_open_node_lock {
struct kernfs_global_locks {
struct kernfs_open_file_mutex open_file_mutex[NR_KERNFS_LOCKS];
struct kernfs_open_node_lock open_node_locks[NR_KERNFS_LOCKS];
+ struct rw_semaphore kernfs_rwsem[NR_KERNFS_LOCKS];
};
enum kernfs_node_type {
@@ -206,6 +207,7 @@ struct kernfs_node {
*/
struct kernfs_node *parent;
const char *name;
+ u8 unlock_parent; /* release parent's rwsem */
struct rb_node rb;
--
2.30.2
On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:
> +static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
> + struct kernfs_node *kn2)
> +{
> + struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
> + struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
> +
> + if (rwsem1 == rwsem2)
> + down_write_nested(rwsem1, 0);
> + else {
> + if (rwsem1 < rwsem2) {
> + down_write_nested(rwsem1, 0);
> + down_write_nested(rwsem2, 1);
> + } else {
> + down_write_nested(rwsem2, 0);
> + down_write_nested(rwsem1, 1);
> + }
> + }
> + kernfs_get(kn1);
> + kernfs_get(kn2);
What's the last bit for? Do you expect the caller to grab that,
then drop references, then pass the same references to up_write_.....?
Explain, please. Is there any caller that would *not* have the matching
up_write_... in the same function, BTW?
Just in case - "defensive programming" is not a good answer (to just about any
question, TBH)...
> +static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
> + enum kernfs_rwsem_lock_pattern ptrn)
> +{
> + struct rw_semaphore *p_rwsem = NULL;
> + struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
> + int lock_parent = 0;
> +
> + if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
> + lock_parent = 1;
> +
> + if (lock_parent)
> + p_rwsem = kernfs_rwsem_ptr(kn->parent);
Er... And what's to prevent ->parent changing right under you? For that
matter, why bother with that at all - what's wrong with your "lock two
nodes" primitive? AFAICS, your current series has only one caller passing
that flag, so...
Note that ->unlock_parent thing is also dubious - the matching up_write_...
for that sole caller is in the same function, so it could bloody well
choose between single-up and double-up on its own. Would make the whole
thing simpler...
> +/**
> + * down_read_kernfs_rwsem() - Acquire hashed rwsem
> + *
> + * @kn: kernfs_node for which hashed rwsem needs to be taken
> + * @ptrn: locking pattern i.e whether to lock only given node or to lock
> + * node and its parent as well
> + *
> + * In case of nested locking, rwsem with lower address is acquired first.
> + *
> + * Return: void
> + */
... and this one, AFAICS, doesn't need ptrn at all - no callers that would ask to
lock parent.
> +static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
> +{
> + struct rw_semaphore *tmp;
> +
> + tmp = array[i];
> + array[i] = array[j];
> + array[j] = tmp;
> +}
> +
> +static inline void kernfs_sort_rwsems(struct rw_semaphore **array)
> +{
> + if (array[0] > array[1])
> + kernfs_swap_rwsems(array, 0, 1);
> +
> + if (array[0] > array[2])
> + kernfs_swap_rwsems(array, 0, 2);
> +
> + if (array[1] > array[2])
> + kernfs_swap_rwsems(array, 1, 2);
> +}
> +
> +/**
> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
> + * rename or similar operations.
> + *
> + * @kn: kernfs_node of interest
> + * @current_parent: current parent of kernfs_node of interest
> + * @new_parent: about to become new parent of kernfs_node
> + *
> + * During rename or similar operations the parent of a node changes,
> + * and this means we will see different parents of a kernfs_node at
> + * the time of taking and releasing its or its parent's hashed rwsem.
> + * This function separately takes locks corresponding to node, and
> + * corresponding to its current and future parents (if needed).
> + *
> + * Return: void
> + */
> +static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> + struct kernfs_node *current_parent,
> + struct kernfs_node *new_parent)
> +{
> + struct rw_semaphore *array[3];
> +
> + array[0] = kernfs_rwsem_ptr(kn);
> + array[1] = kernfs_rwsem_ptr(current_parent);
> + array[2] = kernfs_rwsem_ptr(new_parent);
> +
> + if (array[0] == array[1] && array[0] == array[2]) {
> + /* All 3 nodes hash to same rwsem */
> + down_write_nested(array[0], 0);
> + } else {
> + /**
> + * All 3 nodes are not hashing to the same rwsem, so sort the
> + * array.
> + */
> + kernfs_sort_rwsems(array);
> +
> + if (array[0] == array[1] || array[1] == array[2]) {
> + /**
> + * Two nodes hash to same rwsem, and these
> + * will occupy consecutive places in array after
> + * sorting.
> + */
> + down_write_nested(array[0], 0);
> + down_write_nested(array[2], 1);
> + } else {
> + /* All 3 nodes hashe to different rwsems */
> + down_write_nested(array[0], 0);
> + down_write_nested(array[1], 1);
> + down_write_nested(array[2], 2);
> + }
> + }
> +
> + kernfs_get(kn);
> + kernfs_get(current_parent);
> + kernfs_get(new_parent);
> +}
Holy shit... And _that_ is supposed to be inlined? Complete with sorting the
array, etc.?
<looks at the callers in the next patch>
- root = kernfs_root(kn);
- down_write(&root->kernfs_rwsem);
+ down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
This is rename. The very thing that changes kn->parent. Just what would
stop *ANOTHER* rename from modifying said kn->parent while you'd been
sleeping waiting for that rwsem? It's not even a narrow race window...
I could believe that similar question about stability of ->parent in
case of KERNFS_RWSEM_LOCK_SELF_AND_PARENT handling might be dealt with
by something along the lines of "nothing is supposed to rename a node
while it's fed to kernfs_remove_self(), which is the only user of that",
but here we definitely *can* have racing renames.
> +/**
> + * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
> + * rename or similar operations.
> + *
> + * @kn: kernfs_node of interest
> + * @current_parent: current parent of kernfs_node of interest
> + * @new_parent: about to become new parent of kernfs_node
> + *
> + * During rename or similar operations the parent of a node changes,
> + * and this means we will see different parents of a kernfs_node at
> + * the time of taking and releasing its or its parent's hashed rwsem.
> + * This function separately takes locks corresponding to node, and
> + * corresponding to its current and future parents (if needed).
> + *
> + * Return: void
> + */
> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
> + struct kernfs_node *current_parent,
> + struct kernfs_node *new_parent)
<tries to guess what use could *that* possibly have>
<fails>
<looks for callers (in a different mail, damnit)>
<none>
...
TBH, this is rather uninspiring. I can easily believe that you've got
contention visibly reduced on a benchmark. That'd only be interesting
if we had a clear proof of correctness. And you don't even bother to
discuss your locking rules anywhere I can see in the series, cover letter
included...
I agree that sysfs locking had always been an atrocity - "serialize
everything, we can't be arsed to think of that" approach had been there
from the day one. However, that's precisely the reason why replacement
needs a very careful analysis. For all I know, you might've done just
that, but you are asking reviewers to reproduce that work independently.
Or to hope that you've gotten it right and nod through the entire thing.
Pardon me, but I know how easy it is to miss something while doing that
kind of work - been there, done that.
*That* is the part that needs review.
Hello Al Viro,
Thanks for reviewing this.
On 16/2/22 12:46 pm, Al Viro wrote:
> On Mon, Feb 14, 2022 at 11:03:21PM +1100, Imran Khan wrote:
>
>> +static inline void down_write_kernfs_rwsem_for_two_nodes(struct kernfs_node *kn1,
>> + struct kernfs_node *kn2)
>> +{
>> + struct rw_semaphore *rwsem1 = kernfs_rwsem_ptr(kn1);
>> + struct rw_semaphore *rwsem2 = kernfs_rwsem_ptr(kn2);
>> +
>> + if (rwsem1 == rwsem2)
>> + down_write_nested(rwsem1, 0);
>> + else {
>> + if (rwsem1 < rwsem2) {
>> + down_write_nested(rwsem1, 0);
>> + down_write_nested(rwsem2, 1);
>> + } else {
>> + down_write_nested(rwsem2, 0);
>> + down_write_nested(rwsem1, 1);
>> + }
>> + }
>> + kernfs_get(kn1);
>> + kernfs_get(kn2);
>
> What's the last bit for? Do you expect the caller to grab that,
> then drop references, then pass the same references to up_write_.....?
> Explain, please. Is there any caller that would *not* have the matching
> up_write_... in the same function, BTW?
Every caller of down_write_ has a matching up_write_. I am grabbing the
reference for cases where node gets deleted between down_write_ and
up_write_. For example currently kernfs_remove invokes
down_write/up_write on per-fs rwsem. This change is using rwsem
corresponding to kernfs_node being removed so we may get a situation
when kernfs_remove --> __kernfs_remove --> kernfs_put results in
deletion of kernfs_node but subsequent up_write_ from kernfs_remove
ends up using freed kernfs_node.
>
> Just in case - "defensive programming" is not a good answer (to just about any
> question, TBH)...
>
>> +static inline void down_write_kernfs_rwsem(struct kernfs_node *kn,
>> + enum kernfs_rwsem_lock_pattern ptrn)
>> +{
>> + struct rw_semaphore *p_rwsem = NULL;
>> + struct rw_semaphore *rwsem = kernfs_rwsem_ptr(kn);
>> + int lock_parent = 0;
>> +
>> + if (ptrn == KERNFS_RWSEM_LOCK_SELF_AND_PARENT && kn->parent)
>> + lock_parent = 1;
>> +
>> + if (lock_parent)
>> + p_rwsem = kernfs_rwsem_ptr(kn->parent);
>
> Er... And what's to prevent ->parent changing right under you? For that
> matter, why bother with that at all - what's wrong with your "lock two
> nodes" primitive? AFAICS, your current series has only one caller passing
> that flag, so...
>
Agree. "lock two nodes" primitive could have been used here to make sure
that I work with same parent. I will have this change in next version.
> Note that ->unlock_parent thing is also dubious - the matching up_write_...
> for that sole caller is in the same function, so it could bloody well
> choose between single-up and double-up on its own. Would make the whole
> thing simpler...
Tejun has raised similar concern in his last feedback and I am working
towards making use of "double node locking" primitives for cases where
both parent and node need locking.
>
>> +/**
>> + * down_read_kernfs_rwsem() - Acquire hashed rwsem
>> + *
>> + * @kn: kernfs_node for which hashed rwsem needs to be taken
>> + * @ptrn: locking pattern i.e whether to lock only given node or to lock
>> + * node and its parent as well
>> + *
>> + * In case of nested locking, rwsem with lower address is acquired first.
>> + *
>> + * Return: void
>> + */
>
> ... and this one, AFAICS, doesn't need ptrn at all - no callers that would ask to
> lock parent.
Sure. I will get rid of this in next version.
>
>> +static inline void kernfs_swap_rwsems(struct rw_semaphore **array, int i, int j)
[...]
>> +}
>> +
>> +/**
>> + * down_write_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_write_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> + struct kernfs_node *current_parent,
>> + struct kernfs_node *new_parent)
>> +{
>> + struct rw_semaphore *array[3];
>> +
>> + array[0] = kernfs_rwsem_ptr(kn);
>> + array[1] = kernfs_rwsem_ptr(current_parent);
>> + array[2] = kernfs_rwsem_ptr(new_parent);
>> +
>> + if (array[0] == array[1] && array[0] == array[2]) {
>> + /* All 3 nodes hash to same rwsem */
>> + down_write_nested(array[0], 0);
>> + } else {
>> + /**
>> + * All 3 nodes are not hashing to the same rwsem, so sort the
>> + * array.
>> + */
>> + kernfs_sort_rwsems(array);
>> +
>> + if (array[0] == array[1] || array[1] == array[2]) {
>> + /**
>> + * Two nodes hash to same rwsem, and these
>> + * will occupy consecutive places in array after
>> + * sorting.
>> + */
>> + down_write_nested(array[0], 0);
>> + down_write_nested(array[2], 1);
>> + } else {
>> + /* All 3 nodes hashe to different rwsems */
>> + down_write_nested(array[0], 0);
>> + down_write_nested(array[1], 1);
>> + down_write_nested(array[2], 2);
>> + }
>> + }
>> +
>> + kernfs_get(kn);
>> + kernfs_get(current_parent);
>> + kernfs_get(new_parent);
>> +}
>
> Holy shit... And _that_ is supposed to be inlined? Complete with sorting the
> array, etc.?
>
Sorry about this. I will correct this in next version. I have got some
suggestions from Tejun about how to make these interfaces more compact
and I will make these multi node interfaces non-inline as well.
> <looks at the callers in the next patch>
>
> - root = kernfs_root(kn);
> - down_write(&root->kernfs_rwsem);
> + down_write_kernfs_rwsem_rename_ns(kn, kn->parent, new_parent);
>
> This is rename. The very thing that changes kn->parent. Just what would
> stop *ANOTHER* rename from modifying said kn->parent while you'd been
> sleeping waiting for that rwsem? It's not even a narrow race window...
>
> I could believe that similar question about stability of ->parent in
> case of KERNFS_RWSEM_LOCK_SELF_AND_PARENT handling might be dealt with
> by something along the lines of "nothing is supposed to rename a node
> while it's fed to kernfs_remove_self(), which is the only user of that",
> but here we definitely *can* have racing renames.
>
Agree. I missed the point of changing parent during wait of rwsem. Could
you please let me know if following approach is acceptable in this case:
1. Note kn->parent
2. down_write_
3. After acquiring the rwsems check if current kn->parent is same as the
earlier noted one and if so proceed otherwise invoke down_write_ again
as per current kn->parent.
Once we have taken the rwsems and found that kn->parent did not change
during wait we can proceed safely till corresponding up_write_
Regarding kernfs_remove_self, yes that was the idea but as mentioned
earlier in the next version I will use "double node locking" primitives.
>> +/**
>> + * down_read_kernfs_rwsem_rename_ns() - take hashed rwsem during
>> + * rename or similar operations.
>> + *
>> + * @kn: kernfs_node of interest
>> + * @current_parent: current parent of kernfs_node of interest
>> + * @new_parent: about to become new parent of kernfs_node
>> + *
>> + * During rename or similar operations the parent of a node changes,
>> + * and this means we will see different parents of a kernfs_node at
>> + * the time of taking and releasing its or its parent's hashed rwsem.
>> + * This function separately takes locks corresponding to node, and
>> + * corresponding to its current and future parents (if needed).
>> + *
>> + * Return: void
>> + */
>> +static inline void down_read_kernfs_rwsem_rename_ns(struct kernfs_node *kn,
>> + struct kernfs_node *current_parent,
>> + struct kernfs_node *new_parent)
>
> <tries to guess what use could *that* possibly have>
> <fails>
> <looks for callers (in a different mail, damnit)>
> <none>
> ...
>
> TBH, this is rather uninspiring. I can easily believe that you've got
> contention visibly reduced on a benchmark. That'd only be interesting
> if we had a clear proof of correctness. And you don't even bother to
> discuss your locking rules anywhere I can see in the series, cover letter
> included...
>
I have run my tests with lockdep enabled as well. Could you please let
me know if there is something that can be done as proof of correctness.
For sanity of patches, my current tests include LTP sysfs tests, CPU
hotplugging and cgroup access/creation/removal. I am booting oracle
linux as well which involves cgroupfs access(via systemd). If you could
suggest some other tests I can execute those as well.
Also regarding locking rules, I am not sure where to mention it. If I
put accompanying comments, at all the places where I am taking hashed
rwsems, to explain why lock corresponding to a node or corresponding to
its parent is being taken, will that be enough as description of locking
rules.
> I agree that sysfs locking had always been an atrocity - "serialize
> everything, we can't be arsed to think of that" approach had been there
> from the day one. However, that's precisely the reason why replacement
> needs a very careful analysis. For all I know, you might've done just
> that, but you are asking reviewers to reproduce that work independently.
> Or to hope that you've gotten it right and nod through the entire thing.
> Pardon me, but I know how easy it is to miss something while doing that
> kind of work - been there, done that.
I understand your concern but I am not asking reviewers to reproduce
this work independently :). If I can get to know what things can
be/should be tested further in this regard, I can do those tests.
Since the start of this work, I have been also running my tests with
lockdep and KASAN enabled as well and those tests have flagged some
issues which I have addressed.
I will certainly address your review comments in next version but in the
meantime if you have some suggestions regarding testing or description
of locking rules, please let me know.
Thanks a lot for having a look.
-- Imran
On Wed, Feb 16, 2022 at 03:57:09PM +1100, Imran Khan wrote:
> Agree. I missed the point of changing parent during wait of rwsem. Could
> you please let me know if following approach is acceptable in this case:
>
> 1. Note kn->parent
> 2. down_write_
> 3. After acquiring the rwsems check if current kn->parent is same as the
> earlier noted one and if so proceed otherwise invoke down_write_ again
> as per current kn->parent.
>
> Once we have taken the rwsems and found that kn->parent did not change
> during wait we can proceed safely till corresponding up_write_
Maybe... TBH, kernfs_remove_ns() calling conventions suck; "move this
(wherever it is) there under that name", especially combined with
topology-modifying moves, makes life very unpleasant for callers who
want to use it safely. And I'm not at all sure they manage (or care to)
use it safely...
There are very few sources of cross-directory moves in the entire system.
One is cross-directory cgroup rename(2) (already serialized on per-fs basis
on VFS level), another is device_move(). Which is uncommon (5 callers
in the entire system, one in drivers/media/usb/pvrusb2/, another in
drivers/s390/cio, 3 more in net/bluetooth) and AFAICS unsafe wrt e.g.
kobject_get_path(). Not for kernfs data structures - unsafe use of
kobject->parent pointers. I could be wrong - that's just from some grepping
around, but it looks like a use-after-free race in the best case.
So changes of kn->parent can be considered a very cold path. Just make sure
you pin the damn thing, so it won't disapper away from under you while you
are grabbing the locks.
> I have run my tests with lockdep enabled as well. Could you please let
> me know if there is something that can be done as proof of correctness.
> For sanity of patches, my current tests include LTP sysfs tests, CPU
> hotplugging and cgroup access/creation/removal. I am booting oracle
> linux as well which involves cgroupfs access(via systemd). If you could
> suggest some other tests I can execute those as well.
>
> Also regarding locking rules, I am not sure where to mention it. If I
> put accompanying comments, at all the places where I am taking hashed
> rwsems, to explain why lock corresponding to a node or corresponding to
> its parent is being taken, will that be enough as description of locking
> rules.
A separate text, along the lines of "foo->bar is modified only when we
are holding this, this and that; any of those is sufficient to stabilize
it. Locking order is such-and-such. Such-and-such predicate is guaranteed
to be true when you acquire this lock and must be true again by the time
you drop it.", etc. Some of that might go into the comments somewhere
in source (e.g. when it's about data structures dealt with only one
file, or in the header where the data structures are declared), some -
in a separate text, perhaps in fs/kernfs/, perhaps in Documentation/filesystems/)
And, of course, there's the cover letter of series and commit messages.
> > I agree that sysfs locking had always been an atrocity - "serialize
> > everything, we can't be arsed to think of that" approach had been there
> > from the day one. However, that's precisely the reason why replacement
> > needs a very careful analysis. For all I know, you might've done just
> > that, but you are asking reviewers to reproduce that work independently.
> > Or to hope that you've gotten it right and nod through the entire thing.
> > Pardon me, but I know how easy it is to miss something while doing that
> > kind of work - been there, done that.
>
> I understand your concern but I am not asking reviewers to reproduce
> this work independently :). If I can get to know what things can
> be/should be tested further in this regard, I can do those tests.
> Since the start of this work, I have been also running my tests with
> lockdep and KASAN enabled as well and those tests have flagged some
> issues which I have addressed.
>
> I will certainly address your review comments in next version but in the
> meantime if you have some suggestions regarding testing or description
> of locking rules, please let me know.
Tests are useful, but if we can't reason about the behaviour of code...
On Fri, Feb 18, 2022 at 03:25:31AM +0000, Al Viro wrote:
> There are very few sources of cross-directory moves in the entire system.
> One is cross-directory cgroup rename(2) (already serialized on per-fs basis
> on VFS level), another is device_move(). Which is uncommon (5 callers
FWIW, cgroup rename(2) doesn't allow changing the parent.
Thanks.
--
tejun
Hello Al,
Thanks for the feedback and suggestions. I have addressed these in v7 of
the patchset at [1].
On 18/2/22 2:25 pm, Al Viro wrote:
> On Wed, Feb 16, 2022 at 03:57:09PM +1100, Imran Khan wrote:
>
[...]
>
> Maybe... TBH, kernfs_remove_ns() calling conventions suck; "move this
> (wherever it is) there under that name", especially combined with
> topology-modifying moves, makes life very unpleasant for callers who
> want to use it safely. And I'm not at all sure they manage (or care to)
> use it safely...
>
> There are very few sources of cross-directory moves in the entire system.
> One is cross-directory cgroup rename(2) (already serialized on per-fs basis
> on VFS level), another is device_move(). Which is uncommon (5 callers
> in the entire system, one in drivers/media/usb/pvrusb2/, another in
> drivers/s390/cio, 3 more in net/bluetooth) and AFAICS unsafe wrt e.g.
> kobject_get_path(). Not for kernfs data structures - unsafe use of
> kobject->parent pointers. I could be wrong - that's just from some grepping
> around, but it looks like a use-after-free race in the best case.
>
> So changes of kn->parent can be considered a very cold path. Just make sure
> you pin the damn thing, so it won't disapper away from under you while you
> are grabbing the locks.
>
Done.
>> I have run my tests with lockdep enabled as well. Could you please let
>> me know if there is something that can be done as proof of correctness.
>> For sanity of patches, my current tests include LTP sysfs tests, CPU
>> hotplugging and cgroup access/creation/removal. I am booting oracle
>> linux as well which involves cgroupfs access(via systemd). If you could
>> suggest some other tests I can execute those as well.
>>
>> Also regarding locking rules, I am not sure where to mention it. If I
>> put accompanying comments, at all the places where I am taking hashed
>> rwsems, to explain why lock corresponding to a node or corresponding to
>> its parent is being taken, will that be enough as description of locking
>> rules.
>
> A separate text, along the lines of "foo->bar is modified only when we
> are holding this, this and that; any of those is sufficient to stabilize
> it. Locking order is such-and-such. Such-and-such predicate is guaranteed
> to be true when you acquire this lock and must be true again by the time
> you drop it.", etc. Some of that might go into the comments somewhere
> in source (e.g. when it's about data structures dealt with only one
> file, or in the header where the data structures are declared), some -
> in a separate text, perhaps in fs/kernfs/, perhaps in Documentation/filesystems/)
>
> And, of course, there's the cover letter of series and commit messages.
>
I have added a document to describe usage and proof of correctness for
hashed locks. This should highlight if I have misunderstood something or
if I have overlooked some scenarios while making these changes. I will
await your feedback regarding this.
Thanks
-- Imran
[1]:
https://lore.kernel.org/lkml/[email protected]/