2023-12-06 06:07:23

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 0/11] vfs: inode cache scalability improvements

We all know that the global inode_hash_lock and the per-fs global
sb->s_inode_list_lock locks are contention points in filesystem workloads
that stream inodes through memory, so it's about time we addressed these
limitations.

The first part of the patchset address the sb->s_inode_list_lock.
This was done a long time ago by Waiman Long by converting the
global linked list to a per-cpu linked list - those infrastructure
patches are pretty much unchanged from when Waiman first wrote them,
and as such the still carry the RVB that Jan Kara gave for them. I
have no idea if the problem that Waiman was trying to solve still
exists, but that's largely irrelevant because there are other
problems that I can easily reproduce.

That is, once at ~16 threads trying to instantiate or tear down
inodes at the same time in a filesystem, the sb->s_inode_list_lock
becomes a single point of contention. Adding an inode to the inode
cache requires adding it to the sb->s_inodes list, and removing an
inode from the cache requires removing it from the sb->s_inodes
list. That's two exclusive lock operations per inode we cycle
through the inode cache.

This creates a hard limit on the number of inodes we can cycle
through memory in a single filesystem. It tops out at around
600-700,000 inodes per second on XFS, and at that point we see
catastrophic cacheline contention breakdown and nothing goes any
faster. We can easily burn hundreds of CPUs on the sb->s_inodes list
operations, yet we still can only get 600-700k inodes/s through the
cache.

Converting the sb->s_inodes list to a dlist gets rid of this single
contention point and makes the sb->s_inodes list operations
disappear from the profiles completely. Prior to this change, at 32
threads XFS could pull 12.8 million inodes into cache in ~20s
(that's ~600k inodes/s - sound familiar?). With this change, those
12.8 million inodes are pulled into cache in ~10s. That's double the
rate at which XFS can pull inodes into memory from the
filesystem....

I'm talking about XFS here, because none of the other filesystem
actually stress the sb->s_inode_list_lock at all. They all hit
catastrophic cacheline contention on the inode_hash_lock long before
they get anywhere near the sb->s_inodes limits. For ext4 and
bcachefs, the inode_hash_lock becomes a limiting factor at 8
threads. btrfs hits internal namespace tree contention limits at 2
threads, so it's not even stressing the inode_hash_lock unless
highly threaded workloads are manually sharded across subvolumes.

So to bring the rest of the filesystems up, we need to fix the
inode_hash_lock contention problems. This patchset replaces the
global inode_hash_lock with the same lock-per-chain implementation
that the dentry cache uses. i.e. hash-bl lists. This is more complex
than the dentry cache implementation, however, because we nest spin
locks inside the inode_hash_lock. This conversion means we nest spin
locks inside bit spin locks in the inode cache.

Whilst this doesn't sound particularly problematic, the issue arises
on CONFIG_PREEMPT_RT kernels, where spinlocks are converted to
sleeping locks. We can't place sleeping locks inside spinning bit
locks, and that's exactly what happens if we use hash-bl lists in
the inode cache and then turn on CONFIG_PREEMPT_RT.

The other downside to converting to hash-bl is that we lose lockdep
coverage of the inode hash table - lockdep does not track bit locks
at all.

Both of these issues can be solved the same way: whenever either of
these two config options are turned on, we change the hash-bl
implementation from using a bit spin lock on th elowest bit of the
chain head pointer to using as dedicated spinlock per chain. This
trades off performance and memory footprint for configurations where
correctness is more important than performance, but allows optimal
implementations of hash-bl lists when performance is the primary
concern.

In making this conversion, we make all hash-bl implementations safe
for PREEMPT_RT usage and gain lockdep coverage of all hash-bl lists.
It also pointed out that several hash-bl list users did not actually
initialise the hash list heads correctly - they elided the
initialisation and only got away with it because they allocated
zeroed memory and the hash list head would still work from empty.
This needed fixing for lockdep....

The result of this conversion is that inode cache lookup heavy
workloads such as filesystem traversals and inode creation/removal
no longer contend on the inode_hash_lock to stream inodes through
the inode cache. This results in big performance improvements at
thread counts of >= 8.

I've run this through fstests with lockdep enabled on ext4 and XFS
without discovering any issues except for dm-snapshot needing
lockdep nesting annotations for list-bl locks. I've run a bunch of
"will-it-scale" like tests across XFS, ext4, bcachefs and btrfs, and
the raw table results for 6.7-rc4 are below.

The tests runs a fixed number of files per thread, so as the thread
count increases we should see runtimes stay constant if scalability
is perfect. I'm not caring about operation rates, I'm not caring
about which filesystems are faster, all I'm looking at is whether
the scalability of individual filesytsems improves with the changes.

base: vanilla 6.7-rc4 kernel
scale: 6.7-rc4 plus this patchset

Filesystem Files Threads Create Walk chmod Unlink
base scale base scale base scale base scale
xfs 400000 1 11.217 10.477 11.621 11.570 14.980 14.797 18.314 18.248
xfs 800000 2 12.313 11.470 11.673 11.158 15.271 14.782 19.413 18.533
xfs 1600000 4 14.130 13.246 9.665 9.444 14.794 13.710 19.582 17.015
xfs 3200000 8 16.654 16.108 10.622 9.275 15.854 14.575 20.679 19.237
xfs 6400000 16 17.587 18.151 12.148 9.508 16.655 17.691 26.044 21.865
xfs 12800000 32 20.833 21.491 20.518 10.308 23.614 19.334 42.572 27.404

All of the operations that require directory traversal show
significant improvements at 16 or more threads on XFS. This is
entirely from the sb->s_inodes modifications.

Filesystem Files Threads Create Walk chmod Unlink
base scale base scale base scale base scale
ext4 400000 1 9.344 9.394 7.691 7.847 9.188 9.212 11.340 12.517
ext4 800000 2 10.445 10.375 7.923 7.358 10.158 10.114 14.366 14.093
ext4 1600000 4 11.008 10.948 8.152 7.530 11.140 11.150 18.093 17.153
ext4 3200000 8 23.348 12.134 13.090 7.871 15.437 12.824 30.806 31.968
ext4 6400000 16 17.343 29.112 24.602 9.540 31.034 22.057 60.538 57.636
ext4 12800000 32 40.125 44.638 49.536 19.314 63.183 38.905 138.688 138.632

Walk on ext4 shows major improvements at 8 threads and above, as
does the recursive chmod. This largely comes from the inode hash
lock removal, but the full scalability improvements are not realised
until the sb->s_inodes changes are added as well.

Note that unlink doesn't scale or improve because the mkfs.ext4
binary in debian unstable does not support the orphan file option
and so it is completely bottlenecked on orphan list scalability
issues.

Filesystem Files Threads Create Walk chmod Unlink
base scale base scale base scale base scale
bcachefs 400000 1 16.999 17.193 6.546 6.355 13.973 13.024 28.890 19.014
bcachefs 800000 2 20.133 19.597 8.003 7.276 22.042 20.070 28.959 29.141
bcachefs 1600000 4 22.836 23.764 9.097 8.506 58.827 56.108 38.955 37.435
bcachefs 3200000 8 27.932 27.545 11.752 10.015 192.802 185.834 64.402 77.188
bcachefs 6400000 16 32.389 32.021 24.614 13.989 409.873 408.098 243.757 249.220
bcachefs 12800000 32 39.809 40.221 49.179 25.982 879.721 821.497 501.357 485.781

bcachefs walk shows major improvements at 16 threads and above, but
chmod and unlink are drowned by internal contention problems.

Filesystem Files Threads Create Walk chmod Unlink
base scale base scale base scale base scale
btrfs 400000 1 10.307 10.228 12.597 12.104 14.744 14.030 24.171 24.273
btrfs 800000 2 15.956 14.018 19.693 17.180 24.859 20.872 59.338 48.725
btrfs 1600000 4 22.441 20.951 32.855 29.013 37.975 33.575 140.601 125.305
btrfs 3200000 8 34.157 32.693 55.066 56.726 66.676 64.661 343.379 325.816
btrfs 6400000 16 60.847 59.123 90.097 89.340 116.994 114.280 683.244 681.953
btrfs 12800000 32 122.525 118.510 118.036 125.761 206.122 212.102 1612.940 1629.689

There's little point in doing scalability testing on plain btrfs -
it is entirely bottlenecked on internal algorithms long before
anything in the VFS becomes a scalability limitation.

Filesystem Files Threads Create Walk chmod Unlink
base scale base scale base scale base scale
btrfs-svol 400000 1 10.417 9.830 12.011 12.154 14.894 14.913 24.157 23.447
btrfs-svol 800000 2 12.079 11.681 12.596 12.208 16.535 15.310 28.031 26.412
btrfs-svol 1600000 4 15.219 15.074 12.711 10.735 18.079 16.948 34.330 31.949
btrfs-svol 3200000 8 23.140 21.307 14.706 10.934 22.580 21.907 53.183 52.129
btrfs-svol 6400000 16 40.657 40.226 26.062 11.471 34.058 33.333 101.133 99.504
btrfs-svol 12800000 32 81.516 79.412 50.320 12.406 65.691 58.229 193.847 200.050

Once btrfs is running with a sharded namespace (i.e. subvol per
thread) we results very similar in nature to bcachefs - walk
improves dramatically at high thread counts, but nothing else
changes as all the scalability limitations are internal to the
filesystem.

I have tested to 64 threads, but there's not a lot extra to add. The
XFs walk was done in 14.1s, so scalability is falling off but I
haven't spent any time looking at it in detail because there's just
so much other internal stuff to fix up before the rest of this
benchmark scales to 64 threads on XFS....

Git tree containing this series can be pulled from:

https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

-Dave.



2023-12-06 06:07:36

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 01/11] lib/dlock-list: Distributed and lock-protected lists

From: Waiman Long <[email protected]>

Linked list is used everywhere in the Linux kernel. However, if many
threads are trying to add or delete entries into the same linked list,
it can create a performance bottleneck.

This patch introduces a new list APIs that provide a set of distributed
lists (one per CPU), each of which is protected by its own spinlock.
To the callers, however, the set of lists acts like a single
consolidated list. This allows list entries insertion and deletion
operations to happen in parallel instead of being serialized with a
global list and lock.

List entry insertion is strictly per cpu. List deletion, however, can
happen in a cpu other than the one that did the insertion. So we still
need lock to protect the list. Because of that, there may still be
a small amount of contention when deletion is being done.

A new header file include/linux/dlock-list.h will be added with the
associated dlock_list_head and dlock_list_node structures. The following
functions are provided to manage the per-cpu list:

1. int alloc_dlock_list_heads(struct dlock_list_heads *dlist)
2. void free_dlock_list_heads(struct dlock_list_heads *dlist)
3. void dlock_list_add(struct dlock_list_node *node,
struct dlock_list_heads *dlist)
4. void dlock_list_del(struct dlock_list *node)

Iteration of all the list entries within a dlock list array
is done by calling either the dlist_for_each_entry() or
dlist_for_each_entry_safe() macros. They correspond to the
list_for_each_entry() and list_for_each_entry_safe() macros
respectively. The iteration states are keep in a dlock_list_iter
structure that is passed to the iteration macros.

Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
include/linux/dlock-list.h | 242 +++++++++++++++++++++++++++++++++++++
lib/Makefile | 2 +-
lib/dlock-list.c | 234 +++++++++++++++++++++++++++++++++++
3 files changed, 477 insertions(+), 1 deletion(-)
create mode 100644 include/linux/dlock-list.h
create mode 100644 lib/dlock-list.c

diff --git a/include/linux/dlock-list.h b/include/linux/dlock-list.h
new file mode 100644
index 000000000000..327cb9edc7e3
--- /dev/null
+++ b/include/linux/dlock-list.h
@@ -0,0 +1,242 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017-2018 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#ifndef __LINUX_DLOCK_LIST_H
+#define __LINUX_DLOCK_LIST_H
+
+#include <linux/spinlock.h>
+#include <linux/list.h>
+
+/*
+ * include/linux/dlock-list.h
+ *
+ * The dlock_list_head structure contains the spinlock. It is cacheline
+ * aligned to reduce contention among different CPUs. The other
+ * dlock_list_node structures contains a pointer to the head entry instead.
+ */
+struct dlock_list_head {
+ struct list_head list;
+ spinlock_t lock;
+} ____cacheline_aligned_in_smp;
+
+struct dlock_list_heads {
+ struct dlock_list_head *heads;
+};
+
+/*
+ * dlock list node data structure
+ */
+struct dlock_list_node {
+ struct list_head list;
+ struct dlock_list_head *head;
+};
+
+/*
+ * dlock list iteration state
+ *
+ * This is an opaque data structure that may change. Users of this structure
+ * should not access the structure members directly other than using the
+ * helper functions and macros provided in this header file.
+ */
+struct dlock_list_iter {
+ int index;
+ struct dlock_list_head *head, *entry;
+};
+
+#define DLOCK_LIST_ITER_INIT(dlist) \
+ { \
+ .index = -1, \
+ .head = (dlist)->heads, \
+ }
+
+#define DEFINE_DLOCK_LIST_ITER(s, heads) \
+ struct dlock_list_iter s = DLOCK_LIST_ITER_INIT(heads)
+
+static inline void init_dlock_list_iter(struct dlock_list_iter *iter,
+ struct dlock_list_heads *heads)
+{
+ *iter = (struct dlock_list_iter)DLOCK_LIST_ITER_INIT(heads);
+}
+
+#define DLOCK_LIST_NODE_INIT(name) \
+ { \
+ .list = LIST_HEAD_INIT(name) \
+ }
+
+static inline void init_dlock_list_node(struct dlock_list_node *node)
+{
+ *node = (struct dlock_list_node)DLOCK_LIST_NODE_INIT(node->list);
+}
+
+/**
+ * dlock_list_unlock - unlock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_unlock(struct dlock_list_iter *iter)
+{
+ spin_unlock(&iter->entry->lock);
+}
+
+/**
+ * dlock_list_relock - lock the spinlock that protects the current list
+ * @iter: Pointer to the dlock list iterator structure
+ */
+static inline void dlock_list_relock(struct dlock_list_iter *iter)
+{
+ spin_lock(&iter->entry->lock);
+}
+
+/*
+ * Allocation and freeing of dlock list
+ */
+extern int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+ struct lock_class_key *key);
+extern void free_dlock_list_heads(struct dlock_list_heads *dlist);
+
+/**
+ * alloc_dlock_list_head - Initialize and allocate the list of head entries.
+ * @dlist : Pointer to the dlock_list_heads structure to be initialized
+ * Return : 0 if successful, -ENOMEM if memory allocation error
+ */
+#define alloc_dlock_list_heads(dlist) \
+({ \
+ static struct lock_class_key _key; \
+ __alloc_dlock_list_heads(dlist, &_key); \
+})
+
+/*
+ * Check if a dlock list is empty or not.
+ */
+extern bool dlock_lists_empty(struct dlock_list_heads *dlist);
+
+/*
+ * The dlock list addition and deletion functions here are not irq-safe.
+ * Special irq-safe variants will have to be added if we need them.
+ */
+extern void dlock_lists_add(struct dlock_list_node *node,
+ struct dlock_list_heads *dlist);
+extern void dlock_lists_del(struct dlock_list_node *node);
+
+/*
+ * Find the first entry of the next available list.
+ */
+extern struct dlock_list_node *
+__dlock_list_next_list(struct dlock_list_iter *iter);
+
+/**
+ * __dlock_list_next_entry - Iterate to the next entry of the dlock list
+ * @curr : Pointer to the current dlock_list_node structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: Pointer to the next entry or NULL if all the entries are iterated
+ *
+ * The iterator has to be properly initialized before calling this function.
+ */
+static inline struct dlock_list_node *
+__dlock_list_next_entry(struct dlock_list_node *curr,
+ struct dlock_list_iter *iter)
+{
+ /*
+ * Find next entry
+ */
+ if (curr)
+ curr = list_next_entry(curr, list);
+
+ if (!curr || (&curr->list == &iter->entry->list)) {
+ /*
+ * The current list has been exhausted, try the next available
+ * list.
+ */
+ curr = __dlock_list_next_list(iter);
+ }
+
+ return curr; /* Continue the iteration */
+}
+
+/**
+ * _dlock_list_next_list_entry - get first element from next list in iterator
+ * @iter : The dlock list iterator.
+ * @pos : A variable of the struct that is embedded in.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to first entry or NULL if all the lists are iterated.
+ */
+#define _dlock_list_next_list_entry(iter, pos, member) \
+ ({ \
+ struct dlock_list_node *_n; \
+ _n = __dlock_list_next_entry(NULL, iter); \
+ _n ? list_entry(_n, typeof(*pos), member) : NULL; \
+ })
+
+/**
+ * _dlock_list_next_entry - iterate to the next entry of the list
+ * @pos : The type * to cursor
+ * @iter : The dlock list iterator.
+ * @member: The name of the dlock_list_node within the struct.
+ * Return : Pointer to the next entry or NULL if all the entries are iterated.
+ *
+ * Note that pos can't be NULL.
+ */
+#define _dlock_list_next_entry(pos, iter, member) \
+ ({ \
+ struct dlock_list_node *_n; \
+ _n = __dlock_list_next_entry(&(pos)->member, iter); \
+ _n ? list_entry(_n, typeof(*(pos)), member) : NULL; \
+ })
+
+/**
+ * dlist_for_each_entry - iterate over the dlock list
+ * @pos : Type * to use as a loop cursor
+ * @iter : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro isn't safe with respect to list entry removal, but
+ * it can correctly iterate newly added entries right after the current one.
+ * This iteration function is designed to be used in a while loop.
+ */
+#define dlist_for_each_entry(pos, iter, member) \
+ for (pos = _dlock_list_next_list_entry(iter, pos, member); \
+ pos != NULL; \
+ pos = _dlock_list_next_entry(pos, iter, member))
+
+/**
+ * dlist_for_each_entry_safe - iterate over the dlock list & safe over removal
+ * @pos : Type * to use as a loop cursor
+ * @n : Another type * to use as temporary storage
+ * @iter : The dlock list iterator
+ * @member: The name of the dlock_list_node within the struct
+ *
+ * This iteration macro is safe with respect to list entry removal.
+ * However, it cannot correctly iterate newly added entries right after the
+ * current one.
+ *
+ * The call to __dlock_list_next_list() is deferred until the next entry
+ * is being iterated to avoid use-after-unlock problem.
+ */
+#define dlist_for_each_entry_safe(pos, n, iter, member) \
+ for (pos = NULL; \
+ ({ \
+ if (!pos || \
+ (&(pos)->member.list == &(iter)->entry->list)) \
+ pos = _dlock_list_next_list_entry(iter, pos, \
+ member); \
+ if (pos) \
+ n = list_next_entry(pos, member.list); \
+ pos; \
+ }); \
+ pos = n)
+
+#endif /* __LINUX_DLOCK_LIST_H */
diff --git a/lib/Makefile b/lib/Makefile
index 6b09731d8e61..73d84b569f1e 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -48,7 +48,7 @@ obj-y += bcd.o sort.o parser.o debug_locks.o random32.o \
bsearch.o find_bit.o llist.o lwq.o memweight.o kfifo.o \
percpu-refcount.o rhashtable.o base64.o \
once.o refcount.o rcuref.o usercopy.o errseq.o bucket_locks.o \
- generic-radix-tree.o bitmap-str.o
+ generic-radix-tree.o bitmap-str.o dlock-list.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
obj-y += string_helpers.o
obj-$(CONFIG_TEST_STRING_HELPERS) += test-string_helpers.o
diff --git a/lib/dlock-list.c b/lib/dlock-list.c
new file mode 100644
index 000000000000..f64ea4cc5e79
--- /dev/null
+++ b/lib/dlock-list.c
@@ -0,0 +1,234 @@
+/*
+ * Distributed and locked list
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * (C) Copyright 2016 Hewlett-Packard Enterprise Development LP
+ * (C) Copyright 2017-2018 Red Hat, Inc.
+ *
+ * Authors: Waiman Long <[email protected]>
+ */
+#include <linux/dlock-list.h>
+#include <linux/lockdep.h>
+#include <linux/slab.h>
+#include <linux/cpumask.h>
+
+/*
+ * The distributed and locked list is a distributed set of lists each of
+ * which is protected by its own spinlock, but acts like a single
+ * consolidated list to the callers. For scaling purpose, the number of
+ * lists used is equal to the number of possible CPUs in the system to
+ * minimize contention.
+ *
+ * However, it is possible that individual CPU numbers may be equal to
+ * or greater than the number of possible CPUs when there are holes in
+ * the CPU number list. As a result, we need to map the CPU number to a
+ * list index.
+ */
+static DEFINE_PER_CPU_READ_MOSTLY(int, cpu2idx);
+
+/*
+ * Initialize cpu2idx mapping table
+ *
+ * It is possible that a dlock-list can be allocated before the cpu2idx is
+ * initialized. In this case, all the cpus are mapped to the first entry
+ * before initialization.
+ *
+ */
+static int __init cpu2idx_init(void)
+{
+ int idx, cpu;
+
+ idx = 0;
+ for_each_possible_cpu(cpu)
+ per_cpu(cpu2idx, cpu) = idx++;
+ return 0;
+}
+postcore_initcall(cpu2idx_init);
+
+/**
+ * __alloc_dlock_list_heads - Initialize and allocate the list of head entries
+ * @dlist: Pointer to the dlock_list_heads structure to be initialized
+ * @key : The lock class key to be used for lockdep
+ * Return: 0 if successful, -ENOMEM if memory allocation error
+ *
+ * This function does not allocate the dlock_list_heads structure itself. The
+ * callers will have to do their own memory allocation, if necessary. However,
+ * this allows embedding the dlock_list_heads structure directly into other
+ * structures.
+ *
+ * Dynamically allocated locks need to have their own special lock class
+ * to avoid lockdep warning.
+ */
+int __alloc_dlock_list_heads(struct dlock_list_heads *dlist,
+ struct lock_class_key *key)
+{
+ int idx;
+
+ dlist->heads = kcalloc(nr_cpu_ids, sizeof(struct dlock_list_head),
+ GFP_KERNEL);
+
+ if (!dlist->heads)
+ return -ENOMEM;
+
+ for (idx = 0; idx < nr_cpu_ids; idx++) {
+ struct dlock_list_head *head = &dlist->heads[idx];
+
+ INIT_LIST_HEAD(&head->list);
+ head->lock = __SPIN_LOCK_UNLOCKED(&head->lock);
+ lockdep_set_class(&head->lock, key);
+ }
+ return 0;
+}
+EXPORT_SYMBOL(__alloc_dlock_list_heads);
+
+/**
+ * free_dlock_list_heads - Free all the heads entries of the dlock list
+ * @dlist: Pointer of the dlock_list_heads structure to be freed
+ *
+ * This function doesn't free the dlock_list_heads structure itself. So
+ * the caller will have to do it, if necessary.
+ */
+void free_dlock_list_heads(struct dlock_list_heads *dlist)
+{
+ kfree(dlist->heads);
+ dlist->heads = NULL;
+}
+EXPORT_SYMBOL(free_dlock_list_heads);
+
+/**
+ * dlock_lists_empty - Check if all the dlock lists are empty
+ * @dlist: Pointer to the dlock_list_heads structure
+ * Return: true if list is empty, false otherwise.
+ *
+ * This can be a pretty expensive function call. If this function is required
+ * in a performance critical path, we may have to maintain a global count
+ * of the list entries in the global dlock_list_heads structure instead.
+ */
+bool dlock_lists_empty(struct dlock_list_heads *dlist)
+{
+ int idx;
+
+ for (idx = 0; idx < nr_cpu_ids; idx++)
+ if (!list_empty(&dlist->heads[idx].list))
+ return false;
+ return true;
+}
+EXPORT_SYMBOL(dlock_lists_empty);
+
+/**
+ * dlock_lists_add - Adds a node to the given dlock list
+ * @node : The node to be added
+ * @dlist: The dlock list where the node is to be added
+ *
+ * List selection is based on the CPU being used when the dlock_list_add()
+ * function is called. However, deletion may be done by a different CPU.
+ */
+void dlock_lists_add(struct dlock_list_node *node,
+ struct dlock_list_heads *dlist)
+{
+ struct dlock_list_head *head = &dlist->heads[this_cpu_read(cpu2idx)];
+
+ /*
+ * There is no need to disable preemption
+ */
+ spin_lock(&head->lock);
+ WRITE_ONCE(node->head, head);
+ list_add(&node->list, &head->list);
+ spin_unlock(&head->lock);
+}
+EXPORT_SYMBOL(dlock_lists_add);
+
+/**
+ * dlock_lists_del - Delete a node from a dlock list
+ * @node : The node to be deleted
+ *
+ * We need to check the lock pointer again after taking the lock to guard
+ * against concurrent deletion of the same node. If the lock pointer changes
+ * (becomes NULL or to a different one), we assume that the deletion was done
+ * elsewhere. A warning will be printed if this happens as it is likely to be
+ * a bug.
+ */
+void dlock_lists_del(struct dlock_list_node *node)
+{
+ struct dlock_list_head *head;
+ bool retry;
+
+ do {
+ head = READ_ONCE(node->head);
+ if (WARN_ONCE(!head, "%s: node 0x%lx has no associated head\n",
+ __func__, (unsigned long)node))
+ return;
+
+ spin_lock(&head->lock);
+ if (likely(head == READ_ONCE(node->head))) {
+ list_del_init(&node->list);
+ WRITE_ONCE(node->head, NULL);
+ retry = false;
+ } else {
+ /*
+ * The lock has somehow changed. Retry again if it is
+ * not NULL. Otherwise, just ignore the delete
+ * operation.
+ */
+ retry = (READ_ONCE(node->head) != NULL);
+ }
+ spin_unlock(&head->lock);
+ } while (retry);
+}
+EXPORT_SYMBOL(dlock_lists_del);
+
+/**
+ * __dlock_list_next_list: Find the first entry of the next available list
+ * @dlist: Pointer to the dlock_list_heads structure
+ * @iter : Pointer to the dlock list iterator structure
+ * Return: true if the entry is found, false if all the lists exhausted
+ *
+ * The information about the next available list will be put into the iterator.
+ */
+struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
+{
+ struct dlock_list_node *next;
+ struct dlock_list_head *head;
+
+restart:
+ if (iter->entry) {
+ spin_unlock(&iter->entry->lock);
+ iter->entry = NULL;
+ }
+
+next_list:
+ /*
+ * Try next list
+ */
+ if (++iter->index >= nr_cpu_ids)
+ return NULL; /* All the entries iterated */
+
+ if (list_empty(&iter->head[iter->index].list))
+ goto next_list;
+
+ head = iter->entry = &iter->head[iter->index];
+ spin_lock(&head->lock);
+ /*
+ * There is a slight chance that the list may become empty just
+ * before the lock is acquired. So an additional check is
+ * needed to make sure that a valid node will be returned.
+ */
+ if (list_empty(&head->list))
+ goto restart;
+
+ next = list_entry(head->list.next, struct dlock_list_node,
+ list);
+ WARN_ON_ONCE(next->head != head);
+
+ return next;
+}
+EXPORT_SYMBOL(__dlock_list_next_list);
--
2.42.0

2023-12-06 06:07:51

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 03/11] vfs: Use dlock list for superblock's inode list

From: Waiman Long <[email protected]>

[dchinner: original commit message preserved]

When many threads are trying to add or delete inode to or from
a superblock's s_inodes list, spinlock contention on the list can
become a performance bottleneck.

This patch changes the s_inodes field to become a dlock list which
is a distributed set of lists with per-list spinlocks. As a result,
the following superblock inode list (sb->s_inodes) iteration functions
in vfs are also being modified:

1. iterate_bdevs()
2. drop_pagecache_sb()
3. evict_inodes()
4. invalidate_inodes()
5. fsnotify_unmount_inodes()
6. add_dquot_ref()
7. remove_dquot_ref()

With an exit microbenchmark that creates a large number of threads,
attachs many inodes to them in procfs and then exits. The runtimes of
that microbenchmark with various number of threads before and after
the patch on a 4-socket Intel E7-8867 v3 system (64 cores, 128 threads)
on a 4.19-rc3 based kernel were as follows:

# of threads Elapsed/Sys Time Elapsed/Sys Time Speedup
Unpatched Kernel Patched Kernel
------------ ---------------- ---------------- -------
1000 59.17s/123m09.8s 18.90s/24m44.5s 3.13
1200 73.20s/151m24.1s 27.54s/50m05.3s 2.66
1400 102.04s/212m00.9s 36.75s/68m26.7s 2.78
1600 131.13s/272m52.4s 50.16s/94m23.7s 2.61

[dchinner: forward port, add new inode list traversals, etc]
[dchinner: scalability results on current TOT XFS]

With 400k inodes per thread concurrent directory traversal workload,
scalability improves at >=16 threads on 6.7-rc4 on XFS. We only test
XFS here as it is the only filesystem that demonstrates sufficient
internal scalability for the superblock inode list to be a
scalability bottleneck.

Table contains test runtime in seconds; perfect scalability is
demonstrated by the runtime staying constant as thread count goes up.

Threads 6.4-rc7 patched
------- ------- -------
2 11.673 11.158
4 9.665 9.444
8 10.622 9.275
16 12.148 9.508
32 20.518 10.308

Unpatched kernel profile at 32 threads:

- 95.45% vfs_fstatat
- 95.00% vfs_statx
- 91.00% filename_lookup
- 90.90% path_lookupat
- 90.40% walk_component
- 89.05% lookup_slow
- 88.95% __lookup_slow
- 86.38% xfs_vn_lookup
- 84.05% xfs_lookup
- 78.82% xfs_iget
- 72.58% xfs_setup_inode
- 72.54% inode_sb_list_add
- 71.12% _raw_spin_lock
- 71.09% do_raw_spin_lock
- 68.85% __pv_queued_spin_lock_slowpath

Patched kernel profile at 32 threads - the biggest single point of
contention is now the dentry cache LRU via dput():

- 21.59% 0.25% [kernel] [k] dput
- 21.34% dput
- 19.93% retain_dentry
- d_lru_add
- 19.82% list_lru_add
- 14.62% _raw_spin_lock
- 14.47% do_raw_spin_lock
10.89% __pv_queued_spin_lock_slowpath
1.78% __list_add_valid_or_report
- 0.81% _raw_spin_unlock
- do_raw_spin_unlock
0.77% __raw_callee_save___pv_queued_spin_unlock
- 0.79% _raw_spin_unlock
- 0.78% do_raw_spin_unlock
0.67% __raw_callee_save___pv_queued_spin_unlock

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
---
block/bdev.c | 24 ++++++++----------------
fs/drop_caches.c | 9 ++++-----
fs/gfs2/ops_fstype.c | 21 +++++++++++----------
fs/inode.c | 37 ++++++++++++++++---------------------
fs/notify/fsnotify.c | 12 ++++++------
fs/quota/dquot.c | 22 ++++++----------------
fs/super.c | 13 +++++++------
include/linux/fs.h | 8 ++++----
security/landlock/fs.c | 25 ++++++-------------------
9 files changed, 68 insertions(+), 103 deletions(-)

diff --git a/block/bdev.c b/block/bdev.c
index 750aec178b6a..07135fd6fda4 100644
--- a/block/bdev.c
+++ b/block/bdev.c
@@ -437,11 +437,11 @@ long nr_blockdev_pages(void)
{
struct inode *inode;
long ret = 0;
+ DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);

- spin_lock(&blockdev_superblock->s_inode_list_lock);
- list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list)
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
ret += inode->i_mapping->nrpages;
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
+ }

return ret;
}
@@ -1032,9 +1032,9 @@ EXPORT_SYMBOL_GPL(bdev_mark_dead);
void sync_bdevs(bool wait)
{
struct inode *inode, *old_inode = NULL;
+ DEFINE_DLOCK_LIST_ITER(iter, &blockdev_superblock->s_inodes);

- spin_lock(&blockdev_superblock->s_inode_list_lock);
- list_for_each_entry(inode, &blockdev_superblock->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
struct address_space *mapping = inode->i_mapping;
struct block_device *bdev;

@@ -1046,15 +1046,8 @@ void sync_bdevs(bool wait)
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
- /*
- * We hold a reference to 'inode' so it couldn't have been
- * removed from s_inodes list while we dropped the
- * s_inode_list_lock We cannot iput the inode now as we can
- * be holding the last reference and we cannot iput it under
- * s_inode_list_lock. So we keep the reference and iput it
- * later.
- */
+ dlock_list_unlock(&iter);
+
iput(old_inode);
old_inode = inode;
bdev = I_BDEV(inode);
@@ -1075,9 +1068,8 @@ void sync_bdevs(bool wait)
}
mutex_unlock(&bdev->bd_disk->open_mutex);

- spin_lock(&blockdev_superblock->s_inode_list_lock);
+ dlock_list_relock(&iter);
}
- spin_unlock(&blockdev_superblock->s_inode_list_lock);
iput(old_inode);
}

diff --git a/fs/drop_caches.c b/fs/drop_caches.c
index b9575957a7c2..3596d0a7c0da 100644
--- a/fs/drop_caches.c
+++ b/fs/drop_caches.c
@@ -19,9 +19,9 @@ int sysctl_drop_caches;
static void drop_pagecache_sb(struct super_block *sb, void *unused)
{
struct inode *inode, *toput_inode = NULL;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
spin_lock(&inode->i_lock);
/*
* We must skip inodes in unusual state. We may also skip
@@ -35,16 +35,15 @@ static void drop_pagecache_sb(struct super_block *sb, void *unused)
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);

invalidate_mapping_pages(inode->i_mapping, 0, -1);
iput(toput_inode);
toput_inode = inode;

cond_resched();
- spin_lock(&sb->s_inode_list_lock);
+ dlock_list_relock(&iter);
}
- spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
}

diff --git a/fs/gfs2/ops_fstype.c b/fs/gfs2/ops_fstype.c
index b108c5d26839..1105710482e7 100644
--- a/fs/gfs2/ops_fstype.c
+++ b/fs/gfs2/ops_fstype.c
@@ -1738,22 +1738,24 @@ static int gfs2_meta_init_fs_context(struct fs_context *fc)
* attempt will time out. Since inodes are evicted sequentially, this can add
* up quickly.
*
- * Function evict_inodes() tries to keep the s_inode_list_lock list locked over
- * a long time, which prevents other inodes from being evicted concurrently.
- * This precludes the cooperative behavior we are looking for. This special
- * version of evict_inodes() avoids that.
- *
* Modeled after drop_pagecache_sb().
+ *
+ * XXX(dgc): this is particularly awful. With the dlist for inodes, concurrent
+ * access to the inode list can occur and evict_inodes() will drop the per-cpu
+ * list lock if the CPU needs rescheduling. Hence if this exists just because
+ * evict_inodes() holds the s_inode_list_lock for long periods preventing
+ * concurrent inode eviction work from being done, this can probably go away
+ * entirely now.
*/
static void gfs2_evict_inodes(struct super_block *sb)
{
struct inode *inode, *toput_inode = NULL;
struct gfs2_sbd *sdp = sb->s_fs_info;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);

set_bit(SDF_EVICTING, &sdp->sd_flags);

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
spin_lock(&inode->i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) &&
!need_resched()) {
@@ -1762,15 +1764,14 @@ static void gfs2_evict_inodes(struct super_block *sb)
}
atomic_inc(&inode->i_count);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);

iput(toput_inode);
toput_inode = inode;

cond_resched();
- spin_lock(&sb->s_inode_list_lock);
+ dlock_list_relock(&iter);
}
- spin_unlock(&sb->s_inode_list_lock);
iput(toput_inode);
}

diff --git a/fs/inode.c b/fs/inode.c
index 17c50a75514f..3426691fa305 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -30,7 +30,7 @@
* inode->i_state, inode->i_hash, __iget(), inode->i_io_list
* Inode LRU list locks protect:
* inode->i_sb->s_inode_lru, inode->i_lru
- * inode->i_sb->s_inode_list_lock protects:
+ * inode->i_sb->s_inodes->head->lock protects:
* inode->i_sb->s_inodes, inode->i_sb_list
* bdi->wb.list_lock protects:
* bdi->wb.b_{dirty,io,more_io,dirty_time}, inode->i_io_list
@@ -39,7 +39,7 @@
*
* Lock ordering:
*
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
* inode->i_lock
* Inode LRU list locks
*
@@ -47,7 +47,7 @@
* inode->i_lock
*
* inode_hash_lock
- * inode->i_sb->s_inode_list_lock
+ * inode->i_sb->s_inodes->head->lock
* inode->i_lock
*
* iunique_lock
@@ -423,7 +423,7 @@ void inode_init_once(struct inode *inode)
INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_wb_list);
INIT_LIST_HEAD(&inode->i_lru);
- INIT_LIST_HEAD(&inode->i_sb_list);
+ init_dlock_list_node(&inode->i_sb_list);
__address_space_init_once(&inode->i_data);
i_size_ordered_init(inode);
}
@@ -492,19 +492,14 @@ static void inode_lru_list_del(struct inode *inode)
*/
void inode_sb_list_add(struct inode *inode)
{
- spin_lock(&inode->i_sb->s_inode_list_lock);
- list_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
- spin_unlock(&inode->i_sb->s_inode_list_lock);
+ dlock_lists_add(&inode->i_sb_list, &inode->i_sb->s_inodes);
}
EXPORT_SYMBOL_GPL(inode_sb_list_add);

static inline void inode_sb_list_del(struct inode *inode)
{
- if (!list_empty(&inode->i_sb_list)) {
- spin_lock(&inode->i_sb->s_inode_list_lock);
- list_del_init(&inode->i_sb_list);
- spin_unlock(&inode->i_sb->s_inode_list_lock);
- }
+ if (!list_empty(&inode->i_sb_list.list))
+ dlock_lists_del(&inode->i_sb_list);
}

static unsigned long hash(struct super_block *sb, unsigned long hashval)
@@ -713,11 +708,12 @@ static void dispose_list(struct list_head *head)
void evict_inodes(struct super_block *sb)
{
struct inode *inode;
+ struct dlock_list_iter iter;
LIST_HEAD(dispose);

again:
- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ init_dlock_list_iter(&iter, &sb->s_inodes);
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
if (atomic_read(&inode->i_count))
continue;

@@ -738,13 +734,12 @@ void evict_inodes(struct super_block *sb)
* bit so we don't livelock.
*/
if (need_resched()) {
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);
cond_resched();
dispose_list(&dispose);
goto again;
}
}
- spin_unlock(&sb->s_inode_list_lock);

dispose_list(&dispose);
}
@@ -759,11 +754,12 @@ EXPORT_SYMBOL_GPL(evict_inodes);
void invalidate_inodes(struct super_block *sb)
{
struct inode *inode;
+ struct dlock_list_iter iter;
LIST_HEAD(dispose);

again:
- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ init_dlock_list_iter(&iter, &sb->s_inodes);
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
spin_lock(&inode->i_lock);
if (inode->i_state & (I_NEW | I_FREEING | I_WILL_FREE)) {
spin_unlock(&inode->i_lock);
@@ -779,13 +775,12 @@ void invalidate_inodes(struct super_block *sb)
spin_unlock(&inode->i_lock);
list_add(&inode->i_lru, &dispose);
if (need_resched()) {
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);
cond_resched();
dispose_list(&dispose);
goto again;
}
}
- spin_unlock(&sb->s_inode_list_lock);

dispose_list(&dispose);
}
@@ -1232,7 +1227,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
* Add inode to the sb list if it's not already. It has I_NEW at this
* point, so it should be safe to test i_sb_list locklessly.
*/
- if (list_empty(&inode->i_sb_list))
+ if (list_empty(&inode->i_sb_list.list))
inode_sb_list_add(inode);
unlock:
spin_unlock(&inode_hash_lock);
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..15e3769e76f5 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -33,14 +33,15 @@ void __fsnotify_vfsmount_delete(struct vfsmount *mnt)
* @sb: superblock being unmounted.
*
* Called during unmount with no locks held, so needs to be safe against
- * concurrent modifiers. We temporarily drop sb->s_inode_list_lock and CAN block.
+ * concurrent modifiers. We temporarily drop sb->s_inodes list lock and CAN
+ * block.
*/
static void fsnotify_unmount_inodes(struct super_block *sb)
{
struct inode *inode, *iput_inode = NULL;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
/*
* We cannot __iget() an inode in state I_FREEING,
* I_WILL_FREE, or I_NEW which is fine because by that point
@@ -68,7 +69,7 @@ static void fsnotify_unmount_inodes(struct super_block *sb)

__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);

iput(iput_inode);

@@ -80,9 +81,8 @@ static void fsnotify_unmount_inodes(struct super_block *sb)
iput_inode = inode;

cond_resched();
- spin_lock(&sb->s_inode_list_lock);
+ dlock_list_relock(&iter);
}
- spin_unlock(&sb->s_inode_list_lock);

iput(iput_inode);
}
diff --git a/fs/quota/dquot.c b/fs/quota/dquot.c
index 58b5de081b57..e873dcbe6feb 100644
--- a/fs/quota/dquot.c
+++ b/fs/quota/dquot.c
@@ -1024,13 +1024,13 @@ static int dqinit_needed(struct inode *inode, int type)
static int add_dquot_ref(struct super_block *sb, int type)
{
struct inode *inode, *old_inode = NULL;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
#ifdef CONFIG_QUOTA_DEBUG
int reserved = 0;
#endif
int err = 0;

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
spin_lock(&inode->i_lock);
if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
!atomic_read(&inode->i_writecount) ||
@@ -1040,7 +1040,7 @@ static int add_dquot_ref(struct super_block *sb, int type)
}
__iget(inode);
spin_unlock(&inode->i_lock);
- spin_unlock(&sb->s_inode_list_lock);
+ dlock_list_unlock(&iter);

#ifdef CONFIG_QUOTA_DEBUG
if (unlikely(inode_get_rsv_space(inode) > 0))
@@ -1053,19 +1053,10 @@ static int add_dquot_ref(struct super_block *sb, int type)
goto out;
}

- /*
- * We hold a reference to 'inode' so it couldn't have been
- * removed from s_inodes list while we dropped the
- * s_inode_list_lock. We cannot iput the inode now as we can be
- * holding the last reference and we cannot iput it under
- * s_inode_list_lock. So we keep the reference and iput it
- * later.
- */
old_inode = inode;
cond_resched();
- spin_lock(&sb->s_inode_list_lock);
+ dlock_list_relock(&iter);
}
- spin_unlock(&sb->s_inode_list_lock);
iput(old_inode);
out:
#ifdef CONFIG_QUOTA_DEBUG
@@ -1081,12 +1072,12 @@ static int add_dquot_ref(struct super_block *sb, int type)
static void remove_dquot_ref(struct super_block *sb, int type)
{
struct inode *inode;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);
#ifdef CONFIG_QUOTA_DEBUG
int reserved = 0;
#endif

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
/*
* We have to scan also I_NEW inodes because they can already
* have quota pointer initialized. Luckily, we need to touch
@@ -1108,7 +1099,6 @@ static void remove_dquot_ref(struct super_block *sb, int type)
}
spin_unlock(&dq_data_lock);
}
- spin_unlock(&sb->s_inode_list_lock);
#ifdef CONFIG_QUOTA_DEBUG
if (reserved) {
printk(KERN_WARNING "VFS (%s): Writes happened after quota"
diff --git a/fs/super.c b/fs/super.c
index 076392396e72..61c19e3f06d8 100644
--- a/fs/super.c
+++ b/fs/super.c
@@ -303,6 +303,7 @@ static void destroy_unused_super(struct super_block *s)
super_unlock_excl(s);
list_lru_destroy(&s->s_dentry_lru);
list_lru_destroy(&s->s_inode_lru);
+ free_dlock_list_heads(&s->s_inodes);
security_sb_free(s);
put_user_ns(s->s_user_ns);
kfree(s->s_subtype);
@@ -367,8 +368,6 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
INIT_HLIST_NODE(&s->s_instances);
INIT_HLIST_BL_HEAD(&s->s_roots);
mutex_init(&s->s_sync_lock);
- INIT_LIST_HEAD(&s->s_inodes);
- spin_lock_init(&s->s_inode_list_lock);
INIT_LIST_HEAD(&s->s_inodes_wb);
spin_lock_init(&s->s_inode_wblist_lock);

@@ -383,6 +382,9 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags,
s->s_time_min = TIME64_MIN;
s->s_time_max = TIME64_MAX;

+ if (alloc_dlock_list_heads(&s->s_inodes))
+ goto fail;
+
s->s_shrink = shrinker_alloc(SHRINKER_NUMA_AWARE | SHRINKER_MEMCG_AWARE,
"sb-%s", type->name);
if (!s->s_shrink)
@@ -695,7 +697,7 @@ void generic_shutdown_super(struct super_block *sb)
if (sop->put_super)
sop->put_super(sb);

- if (CHECK_DATA_CORRUPTION(!list_empty(&sb->s_inodes),
+ if (CHECK_DATA_CORRUPTION(!dlock_lists_empty(&sb->s_inodes),
"VFS: Busy inodes after unmount of %s (%s)",
sb->s_id, sb->s_type->name)) {
/*
@@ -704,14 +706,13 @@ void generic_shutdown_super(struct super_block *sb)
* iput_final() or such crashes cleanly.
*/
struct inode *inode;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
inode->i_op = VFS_PTR_POISON;
inode->i_sb = VFS_PTR_POISON;
inode->i_mapping = VFS_PTR_POISON;
}
- spin_unlock(&sb->s_inode_list_lock);
}
}
/*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 98b7a7a8c42e..bb35591733f1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -43,6 +43,7 @@
#include <linux/cred.h>
#include <linux/mnt_idmapping.h>
#include <linux/slab.h>
+#include <linux/dlock-list.h>

#include <asm/byteorder.h>
#include <uapi/linux/fs.h>
@@ -702,7 +703,7 @@ struct inode {
u16 i_wb_frn_history;
#endif
struct list_head i_lru; /* inode LRU list */
- struct list_head i_sb_list;
+ struct dlock_list_node i_sb_list;
struct list_head i_wb_list; /* backing dev writeback list */
union {
struct hlist_head i_dentry;
@@ -1315,9 +1316,8 @@ struct super_block {
*/
int s_stack_depth;

- /* s_inode_list_lock protects s_inodes */
- spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
- struct list_head s_inodes; /* all inodes */
+ /* The internal per-list locks protect s_inodes */
+ struct dlock_list_heads s_inodes; /* all inodes */

spinlock_t s_inode_wblist_lock;
struct list_head s_inodes_wb; /* writeback inodes */
diff --git a/security/landlock/fs.c b/security/landlock/fs.c
index bc7c126deea2..4269d9938c09 100644
--- a/security/landlock/fs.c
+++ b/security/landlock/fs.c
@@ -844,12 +844,12 @@ static void hook_inode_free_security(struct inode *const inode)
static void hook_sb_delete(struct super_block *const sb)
{
struct inode *inode, *prev_inode = NULL;
+ DEFINE_DLOCK_LIST_ITER(iter, &sb->s_inodes);

if (!landlock_initialized)
return;

- spin_lock(&sb->s_inode_list_lock);
- list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
+ dlist_for_each_entry(inode, &iter, i_sb_list) {
struct landlock_object *object;

/* Only handles referenced inodes. */
@@ -883,6 +883,7 @@ static void hook_sb_delete(struct super_block *const sb)
/* Keeps a reference to this inode until the next loop walk. */
__iget(inode);
spin_unlock(&inode->i_lock);
+ dlock_list_unlock(&iter);

/*
* If there is no concurrent release_inode() ongoing, then we
@@ -917,25 +918,11 @@ static void hook_sb_delete(struct super_block *const sb)
rcu_read_unlock();
}

- if (prev_inode) {
- /*
- * At this point, we still own the __iget() reference
- * that we just set in this loop walk. Therefore we
- * can drop the list lock and know that the inode won't
- * disappear from under us until the next loop walk.
- */
- spin_unlock(&sb->s_inode_list_lock);
- /*
- * We can now actually put the inode reference from the
- * previous loop walk, which is not needed anymore.
- */
- iput(prev_inode);
- cond_resched();
- spin_lock(&sb->s_inode_list_lock);
- }
+ iput(prev_inode);
prev_inode = inode;
+ cond_resched();
+ dlock_list_relock(&iter);
}
- spin_unlock(&sb->s_inode_list_lock);

/* Puts the inode reference from the last loop walk, if any. */
if (prev_inode)
--
2.42.0

2023-12-06 06:07:57

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 10/11] list_bl: don't use bit locks for PREEMPT_RT or lockdep

From: Dave Chinner <[email protected]>

hash-bl nests spinlocks inside the bit locks. This causes problems
for CONFIG_PREEMPT_RT which converts spin locks to sleeping locks,
and we're not allowed to sleep while holding a spinning lock.

Further, lockdep does not support bit locks, so we lose lockdep
coverage of the inode hash table with the hash-bl conversion.

To enable these configs to work, add an external per-chain spinlock
to the hlist_bl_head() and add helpers to use this instead of the
bit spinlock when preempt_rt or lockdep are enabled.

This converts all users of hlist-bl to use the external spinlock in
these situations, so we also gain lockdep coverage of things like
the dentry cache hash table with this change.

Signed-off-by: Dave Chinner <[email protected]>
---
include/linux/list_bl.h | 126 ++++++++++++++++++++++++++++---------
include/linux/rculist_bl.h | 13 ++++
2 files changed, 110 insertions(+), 29 deletions(-)

diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
index 8ee2bf5af131..990ad8e24e0b 100644
--- a/include/linux/list_bl.h
+++ b/include/linux/list_bl.h
@@ -4,14 +4,27 @@

#include <linux/list.h>
#include <linux/bit_spinlock.h>
+#include <linux/spinlock.h>

/*
* Special version of lists, where head of the list has a lock in the lowest
* bit. This is useful for scalable hash tables without increasing memory
* footprint overhead.
*
- * For modification operations, the 0 bit of hlist_bl_head->first
- * pointer must be set.
+ * Whilst the general use of bit spin locking is considered safe, PREEMPT_RT
+ * introduces a problem with nesting spin locks inside bit locks: spin locks
+ * become sleeping locks, and we can't sleep inside spinning locks such as bit
+ * locks. However, for RTPREEMPT, performance is less of an issue than
+ * correctness, so we trade off the memory and cache footprint of a spinlock per
+ * list so the list locks are converted to sleeping locks and work correctly
+ * with PREEMPT_RT kernels.
+ *
+ * An added advantage of this is that we can use the same trick when lockdep is
+ * enabled (again, performance doesn't matter) and gain lockdep coverage of all
+ * the hash-bl operations.
+ *
+ * For modification operations when using pure bit locking, the 0 bit of
+ * hlist_bl_head->first pointer must be set.
*
* With some small modifications, this can easily be adapted to store several
* arbitrary bits (not just a single lock bit), if the need arises to store
@@ -30,16 +43,21 @@
#define LIST_BL_BUG_ON(x)
#endif

+#undef LIST_BL_USE_SPINLOCKS
+#if defined(CONFIG_PREEMPT_RT) || defined(CONFIG_LOCKDEP)
+#define LIST_BL_USE_SPINLOCKS 1
+#endif

struct hlist_bl_head {
struct hlist_bl_node *first;
+#ifdef LIST_BL_USE_SPINLOCKS
+ spinlock_t lock;
+#endif
};

struct hlist_bl_node {
struct hlist_bl_node *next, **pprev;
};
-#define INIT_HLIST_BL_HEAD(ptr) \
- ((ptr)->first = NULL)

static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
{
@@ -54,6 +72,69 @@ static inline bool hlist_bl_unhashed(const struct hlist_bl_node *h)
return !h->pprev;
}

+#ifdef LIST_BL_USE_SPINLOCKS
+#define INIT_HLIST_BL_HEAD(ptr) do { \
+ (ptr)->first = NULL; \
+ spin_lock_init(&(ptr)->lock); \
+} while (0)
+
+static inline void hlist_bl_lock(struct hlist_bl_head *b)
+{
+ spin_lock(&b->lock);
+}
+
+static inline void hlist_bl_unlock(struct hlist_bl_head *b)
+{
+ spin_unlock(&b->lock);
+}
+
+static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
+{
+ return spin_is_locked(&b->lock);
+}
+
+static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
+{
+ return h->first;
+}
+
+static inline void hlist_bl_set_first(struct hlist_bl_head *h,
+ struct hlist_bl_node *n)
+{
+ h->first = n;
+}
+
+static inline void hlist_bl_set_before(struct hlist_bl_node **pprev,
+ struct hlist_bl_node *n)
+{
+ WRITE_ONCE(*pprev, n);
+}
+
+static inline bool hlist_bl_empty(const struct hlist_bl_head *h)
+{
+ return !READ_ONCE(h->first);
+}
+
+#else /* !LIST_BL_USE_SPINLOCKS */
+
+#define INIT_HLIST_BL_HEAD(ptr) \
+ ((ptr)->first = NULL)
+
+static inline void hlist_bl_lock(struct hlist_bl_head *b)
+{
+ bit_spin_lock(0, (unsigned long *)b);
+}
+
+static inline void hlist_bl_unlock(struct hlist_bl_head *b)
+{
+ __bit_spin_unlock(0, (unsigned long *)b);
+}
+
+static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
+{
+ return bit_spin_is_locked(0, (unsigned long *)b);
+}
+
static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
{
return (struct hlist_bl_node *)
@@ -69,11 +150,21 @@ static inline void hlist_bl_set_first(struct hlist_bl_head *h,
h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
}

+static inline void hlist_bl_set_before(struct hlist_bl_node **pprev,
+ struct hlist_bl_node *n)
+{
+ WRITE_ONCE(*pprev,
+ (struct hlist_bl_node *)
+ ((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
+}
+
static inline bool hlist_bl_empty(const struct hlist_bl_head *h)
{
return !((unsigned long)READ_ONCE(h->first) & ~LIST_BL_LOCKMASK);
}

+#endif /* LIST_BL_USE_SPINLOCKS */
+
static inline void hlist_bl_add_head(struct hlist_bl_node *n,
struct hlist_bl_head *h)
{
@@ -94,11 +185,7 @@ static inline void hlist_bl_add_before(struct hlist_bl_node *n,
n->pprev = pprev;
n->next = next;
next->pprev = &n->next;
-
- /* pprev may be `first`, so be careful not to lose the lock bit */
- WRITE_ONCE(*pprev,
- (struct hlist_bl_node *)
- ((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
+ hlist_bl_set_before(pprev, n);
}

static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
@@ -119,11 +206,7 @@ static inline void __hlist_bl_del(struct hlist_bl_node *n)

LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);

- /* pprev may be `first`, so be careful not to lose the lock bit */
- WRITE_ONCE(*pprev,
- (struct hlist_bl_node *)
- ((unsigned long)next |
- ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
+ hlist_bl_set_before(pprev, next);
if (next)
next->pprev = pprev;
}
@@ -165,21 +248,6 @@ static inline bool hlist_bl_fake(struct hlist_bl_node *n)
return n->pprev == &n->next;
}

-static inline void hlist_bl_lock(struct hlist_bl_head *b)
-{
- bit_spin_lock(0, (unsigned long *)b);
-}
-
-static inline void hlist_bl_unlock(struct hlist_bl_head *b)
-{
- __bit_spin_unlock(0, (unsigned long *)b);
-}
-
-static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
-{
- return bit_spin_is_locked(0, (unsigned long *)b);
-}
-
/**
* hlist_bl_for_each_entry - iterate over list of given type
* @tpos: the type * to use as a loop cursor.
diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h
index 0b952d06eb0b..2d5eb5153121 100644
--- a/include/linux/rculist_bl.h
+++ b/include/linux/rculist_bl.h
@@ -8,6 +8,18 @@
#include <linux/list_bl.h>
#include <linux/rcupdate.h>

+#ifdef LIST_BL_USE_SPINLOCKS
+static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
+ struct hlist_bl_node *n)
+{
+ rcu_assign_pointer(h->first, n);
+}
+
+static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
+{
+ return rcu_dereference_check(h->first, hlist_bl_is_locked(h));
+}
+#else /* !LIST_BL_USE_SPINLOCKS */
static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
struct hlist_bl_node *n)
{
@@ -23,6 +35,7 @@ static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
return (struct hlist_bl_node *)
((unsigned long)rcu_dereference_check(h->first, hlist_bl_is_locked(h)) & ~LIST_BL_LOCKMASK);
}
+#endif /* LIST_BL_USE_SPINLOCKS */

/**
* hlist_bl_del_rcu - deletes entry from hash list without re-initialization
--
2.42.0

2023-12-06 06:26:11

by Dave Chinner

[permalink] [raw]
Subject: [PATCH 08/11] vfs: inode cache conversion to hash-bl

From: Dave Chinner <[email protected]>

Scalability of the global inode_hash_lock really sucks for
filesystems that use the vfs inode cache (i.e. everything but XFS).

Profiles of a 32-way concurrent sharded directory walk (no contended
directories) on a couple of different filesystems. All numbers from
a 6.7-rc4 kernel.

Bcachefs:

- 98.78% vfs_statx
- 97.74% filename_lookup
- 97.70% path_lookupat
- 97.54% walk_component
- 97.06% lookup_slow
- 97.03% __lookup_slow
- 96.21% bch2_lookup
- 91.87% bch2_vfs_inode_get
- 84.10% iget5_locked
- 44.09% ilookup5
- 43.50% _raw_spin_lock
- 43.49% do_raw_spin_lock
42.75% __pv_queued_spin_lock_slowpath
- 39.06% inode_insert5
- 38.46% _raw_spin_lock
- 38.46% do_raw_spin_lock
37.51% __pv_queued_spin_lock_slowpath

ext4:

- 93.75% vfs_statx
- 92.39% filename_lookup
- 92.34% path_lookupat
- 92.09% walk_component
- 91.48% lookup_slow
- 91.43% __lookup_slow
- 90.18% ext4_lookup
- 84.84% __ext4_iget
- 83.67% iget_locked
- 81.24% _raw_spin_lock
- 81.23% do_raw_spin_lock
- 78.90% __pv_queued_spin_lock_slowpath


Both bcachefs and ext4 demonstrate poor scaling at >=8 threads on
concurrent lookup or create workloads.

Hence convert the inode hash table to a RCU-aware hash-bl table just
like the dentry cache. Note that we need to store a pointer to the
hlist_bl_head the inode has been added to in the inode so that when
it comes to unhash the inode we know what list to lock. We need to
do this because, unlike the dentry cache, the hash value that is
used to hash the inode is not generated from the inode itself. i.e.
filesystems can provide this themselves so we have to either store
the hashval or the hlist head pointer in the inode to be able to
find the right list head for removal...

Concurrent walk of 400k files per thread with varying thread count
in seconds is as follows. Perfect scaling is an unchanged walk time
as thread count increases.

ext4 bcachefs
threads vanilla patched vanilla patched
2 7.923 7.358 8.003 7.276
4 8.152 7.530 9.097 8.506
8 13.090 7.871 11.752 10.015
16 24.602 9.540 24.614 13.989
32 49.536 19.314 49.179 25.982

The big wins here are at >= 8 threads, with both filesytsems now
being limited by internal filesystem algorithms, not the VFS inode
cache scalability.

Ext4 contention moves to the buffer cache on directory block
lookups:

- 66.45% 0.44% [kernel] [k] __ext4_read_dirblock
- 66.01% __ext4_read_dirblock
- 66.01% ext4_bread
- ext4_getblk
- 64.77% bdev_getblk
- 64.69% __find_get_block
- 63.01% _raw_spin_lock
- 62.96% do_raw_spin_lock
59.21% __pv_queued_spin_lock_slowpath

bcachefs contention moves to internal btree traversal locks.

- 95.37% __lookup_slow
- 93.95% bch2_lookup
- 82.57% bch2_vfs_inode_get
- 65.44% bch2_inode_find_by_inum_trans
- 65.41% bch2_inode_peek_nowarn
- 64.60% bch2_btree_iter_peek_slot
- 64.55% bch2_btree_path_traverse_one
- bch2_btree_path_traverse_cached
- 63.02% bch2_btree_path_traverse_cached_slowpath
- 56.60% mutex_lock
- 55.29% __mutex_lock_slowpath
- 55.25% __mutex_lock
50.29% osq_lock
1.84% __raw_callee_save___kvm_vcpu_is_preempted
0.54% mutex_spin_on_owner

Signed-off-by: Dave Chinner <[email protected]>
---
fs/inode.c | 200 ++++++++++++++++++++++++++++-----------------
include/linux/fs.h | 9 +-
2 files changed, 132 insertions(+), 77 deletions(-)

diff --git a/fs/inode.c b/fs/inode.c
index fead81550cf4..3eb9c4e5b279 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -56,8 +56,7 @@

static unsigned int i_hash_mask __ro_after_init;
static unsigned int i_hash_shift __ro_after_init;
-static struct hlist_head *inode_hashtable __ro_after_init;
-static __cacheline_aligned_in_smp DEFINE_SPINLOCK(inode_hash_lock);
+static struct hlist_bl_head *inode_hashtable __ro_after_init;

static unsigned long hash(struct super_block *sb, unsigned long hashval)
{
@@ -69,7 +68,7 @@ static unsigned long hash(struct super_block *sb, unsigned long hashval)
return tmp & i_hash_mask;
}

-static inline struct hlist_head *i_hash_head(struct super_block *sb,
+static inline struct hlist_bl_head *i_hash_head(struct super_block *sb,
unsigned int hashval)
{
return inode_hashtable + hash(sb, hashval);
@@ -434,7 +433,7 @@ EXPORT_SYMBOL(address_space_init_once);
void inode_init_once(struct inode *inode)
{
memset(inode, 0, sizeof(*inode));
- INIT_HLIST_NODE(&inode->i_hash);
+ INIT_HLIST_BL_NODE(&inode->i_hash);
INIT_LIST_HEAD(&inode->i_devices);
INIT_LIST_HEAD(&inode->i_io_list);
INIT_LIST_HEAD(&inode->i_wb_list);
@@ -518,6 +517,17 @@ static inline void inode_sb_list_del(struct inode *inode)
dlock_lists_del(&inode->i_sb_list);
}

+/*
+ * Ensure that we store the hash head in the inode when we insert the inode into
+ * the hlist_bl_head...
+ */
+static inline void
+__insert_inode_hash_head(struct inode *inode, struct hlist_bl_head *b)
+{
+ hlist_bl_add_head_rcu(&inode->i_hash, b);
+ inode->i_hash_head = b;
+}
+
/**
* __insert_inode_hash - hash an inode
* @inode: unhashed inode
@@ -528,13 +538,13 @@ static inline void inode_sb_list_del(struct inode *inode)
*/
void __insert_inode_hash(struct inode *inode, unsigned long hashval)
{
- struct hlist_head *b = inode_hashtable + hash(inode->i_sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);

- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
spin_lock(&inode->i_lock);
- hlist_add_head_rcu(&inode->i_hash, b);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
}
EXPORT_SYMBOL(__insert_inode_hash);

@@ -546,11 +556,44 @@ EXPORT_SYMBOL(__insert_inode_hash);
*/
void __remove_inode_hash(struct inode *inode)
{
- spin_lock(&inode_hash_lock);
- spin_lock(&inode->i_lock);
- hlist_del_init_rcu(&inode->i_hash);
- spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ struct hlist_bl_head *b = inode->i_hash_head;
+
+ /*
+ * There are some callers that come through here without synchronisation
+ * and potentially with multiple references to the inode. Hence we have
+ * to handle the case that we might race with a remove and insert to a
+ * different list. Coda, in particular, seems to have a userspace API
+ * that can directly trigger "unhash/rehash to different list" behaviour
+ * without any serialisation at all.
+ *
+ * Hence we have to handle the situation where the inode->i_hash_head
+ * might point to a different list than what we expect, indicating that
+ * we raced with another unhash and potentially a new insertion. This
+ * means we have to retest the head once we have everything locked up
+ * and loop again if it doesn't match.
+ */
+ while (b) {
+ hlist_bl_lock(b);
+ spin_lock(&inode->i_lock);
+ if (b != inode->i_hash_head) {
+ hlist_bl_unlock(b);
+ b = inode->i_hash_head;
+ spin_unlock(&inode->i_lock);
+ continue;
+ }
+ /*
+ * Need to set the pprev pointer to NULL after list removal so
+ * that both RCU traversals and hlist_bl_unhashed() work
+ * correctly at this point.
+ */
+ hlist_bl_del_rcu(&inode->i_hash);
+ inode->i_hash.pprev = NULL;
+ inode->i_hash_head = NULL;
+ spin_unlock(&inode->i_lock);
+ hlist_bl_unlock(b);
+ break;
+ }
+
}
EXPORT_SYMBOL(__remove_inode_hash);

@@ -886,26 +929,28 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
return freed;
}

-static void __wait_on_freeing_inode(struct inode *inode);
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+ struct inode *inode);
/*
* Called with the inode lock held.
*/
static struct inode *find_inode(struct super_block *sb,
- struct hlist_head *head,
+ struct hlist_bl_head *b,
int (*test)(struct inode *, void *),
void *data)
{
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_sb != sb)
continue;
if (!test(inode, data))
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(b, inode);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
@@ -924,19 +969,20 @@ static struct inode *find_inode(struct super_block *sb,
* iget_locked for details.
*/
static struct inode *find_inode_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct hlist_bl_head *b, unsigned long ino)
{
+ struct hlist_bl_node *node;
struct inode *inode = NULL;

repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(b, inode);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
@@ -1186,25 +1232,25 @@ EXPORT_SYMBOL(unlock_two_nondirectories);
* return it locked, hashed, and with the I_NEW flag set. The file system gets
* to fill it in before unlocking it via unlock_new_inode().
*
- * Note both @test and @set are called with the inode_hash_lock held, so can't
- * sleep.
+ * Note both @test and @set are called with the inode hash chain lock held,
+ * so can't sleep.
*/
struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(inode->i_sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(inode->i_sb, hashval);
struct inode *old;

again:
- spin_lock(&inode_hash_lock);
- old = find_inode(inode->i_sb, head, test, data);
+ hlist_bl_lock(b);
+ old = find_inode(inode->i_sb, b, test, data);
if (unlikely(old)) {
/*
* Uhhuh, somebody else created the same inode under us.
* Use the old inode instead of the preallocated one.
*/
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
if (IS_ERR(old))
return NULL;
wait_on_inode(old);
@@ -1226,7 +1272,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
*/
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);

/*
@@ -1236,7 +1282,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
if (list_empty(&inode->i_sb_list.list))
inode_sb_list_add(inode);
unlock:
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);

return inode;
}
@@ -1297,12 +1343,12 @@ EXPORT_SYMBOL(iget5_locked);
*/
struct inode *iget_locked(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode_fast(sb, b, ino);
+ hlist_bl_unlock(b);
if (inode) {
if (IS_ERR(inode))
return NULL;
@@ -1318,17 +1364,17 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
if (inode) {
struct inode *old;

- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
/* We released the lock, so.. */
- old = find_inode_fast(sb, head, ino);
+ old = find_inode_fast(sb, b, ino);
if (!old) {
inode->i_ino = ino;
spin_lock(&inode->i_lock);
inode->i_state = I_NEW;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
inode_sb_list_add(inode);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);

/* Return the locked inode with I_NEW set, the
* caller is responsible for filling in the contents
@@ -1341,7 +1387,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
* us. Use the old inode instead of the one we just
* allocated.
*/
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
destroy_inode(inode);
if (IS_ERR(old))
return NULL;
@@ -1365,10 +1411,11 @@ EXPORT_SYMBOL(iget_locked);
*/
static int test_inode_iunique(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_node *node;
struct inode *inode;

- hlist_for_each_entry_rcu(inode, b, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_ino == ino && inode->i_sb == sb)
return 0;
}
@@ -1452,12 +1499,12 @@ EXPORT_SYMBOL(igrab);
struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
struct inode *inode;

- spin_lock(&inode_hash_lock);
- inode = find_inode(sb, head, test, data);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode(sb, b, test, data);
+ hlist_bl_unlock(b);

return IS_ERR(inode) ? NULL : inode;
}
@@ -1507,12 +1554,12 @@ EXPORT_SYMBOL(ilookup5);
*/
struct inode *ilookup(struct super_block *sb, unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_lock(b);
+ inode = find_inode_fast(sb, b, ino);
+ hlist_bl_unlock(b);

if (inode) {
if (IS_ERR(inode))
@@ -1556,12 +1603,13 @@ struct inode *find_inode_nowait(struct super_block *sb,
void *),
void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
+ struct hlist_bl_node *node;
struct inode *inode, *ret_inode = NULL;
int mval;

- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_bl_lock(b);
+ hlist_bl_for_each_entry(inode, node, b, i_hash) {
if (inode->i_sb != sb)
continue;
mval = match(inode, hashval, data);
@@ -1572,7 +1620,7 @@ struct inode *find_inode_nowait(struct super_block *sb,
goto out;
}
out:
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return ret_inode;
}
EXPORT_SYMBOL(find_inode_nowait);
@@ -1601,13 +1649,14 @@ EXPORT_SYMBOL(find_inode_nowait);
struct inode *find_inode_rcu(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data)
{
- struct hlist_head *head = i_hash_head(sb, hashval);
+ struct hlist_bl_head *b = i_hash_head(sb, hashval);
+ struct hlist_bl_node *node;
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
"suspicious find_inode_rcu() usage");

- hlist_for_each_entry_rcu(inode, head, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_sb == sb &&
!(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)) &&
test(inode, data))
@@ -1639,13 +1688,14 @@ EXPORT_SYMBOL(find_inode_rcu);
struct inode *find_inode_by_ino_rcu(struct super_block *sb,
unsigned long ino)
{
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);
+ struct hlist_bl_node *node;
struct inode *inode;

RCU_LOCKDEP_WARN(!rcu_read_lock_held(),
"suspicious find_inode_by_ino_rcu() usage");

- hlist_for_each_entry_rcu(inode, head, i_hash) {
+ hlist_bl_for_each_entry_rcu(inode, node, b, i_hash) {
if (inode->i_ino == ino &&
inode->i_sb == sb &&
!(READ_ONCE(inode->i_state) & (I_FREEING | I_WILL_FREE)))
@@ -1659,39 +1709,42 @@ int insert_inode_locked(struct inode *inode)
{
struct super_block *sb = inode->i_sb;
ino_t ino = inode->i_ino;
- struct hlist_head *head = i_hash_head(sb, ino);
+ struct hlist_bl_head *b = i_hash_head(sb, ino);

while (1) {
- struct inode *old = NULL;
- spin_lock(&inode_hash_lock);
- hlist_for_each_entry(old, head, i_hash) {
- if (old->i_ino != ino)
+ struct hlist_bl_node *node;
+ struct inode *old = NULL, *t;
+
+ hlist_bl_lock(b);
+ hlist_bl_for_each_entry(t, node, b, i_hash) {
+ if (t->i_ino != ino)
continue;
- if (old->i_sb != sb)
+ if (t->i_sb != sb)
continue;
- spin_lock(&old->i_lock);
- if (old->i_state & (I_FREEING|I_WILL_FREE)) {
- spin_unlock(&old->i_lock);
+ spin_lock(&t->i_lock);
+ if (t->i_state & (I_FREEING|I_WILL_FREE)) {
+ spin_unlock(&t->i_lock);
continue;
}
+ old = t;
break;
}
if (likely(!old)) {
spin_lock(&inode->i_lock);
inode->i_state |= I_NEW | I_CREATING;
- hlist_add_head_rcu(&inode->i_hash, head);
+ __insert_inode_hash_head(inode, b);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return 0;
}
if (unlikely(old->i_state & I_CREATING)) {
spin_unlock(&old->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
return -EBUSY;
}
__iget(old);
spin_unlock(&old->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
wait_on_inode(old);
if (unlikely(!inode_unhashed(old))) {
iput(old);
@@ -2271,17 +2324,18 @@ EXPORT_SYMBOL(inode_needs_sync);
* wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
* will DTRT.
*/
-static void __wait_on_freeing_inode(struct inode *inode)
+static void __wait_on_freeing_inode(struct hlist_bl_head *b,
+ struct inode *inode)
{
wait_queue_head_t *wq;
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
wq = bit_waitqueue(&inode->i_state, __I_NEW);
prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ hlist_bl_unlock(b);
schedule();
finish_wait(wq, &wait.wq_entry);
- spin_lock(&inode_hash_lock);
+ hlist_bl_lock(b);
}

static __initdata unsigned long ihash_entries;
@@ -2307,7 +2361,7 @@ void __init inode_init_early(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct hlist_bl_head),
ihash_entries,
14,
HASH_EARLY | HASH_ZERO,
@@ -2333,7 +2387,7 @@ void __init inode_init(void)

inode_hashtable =
alloc_large_system_hash("Inode-cache",
- sizeof(struct hlist_head),
+ sizeof(struct hlist_bl_head),
ihash_entries,
14,
HASH_ZERO,
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bb35591733f1..0ef1b72340c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -692,7 +692,8 @@ struct inode {
unsigned long dirtied_when; /* jiffies of first dirtying */
unsigned long dirtied_time_when;

- struct hlist_node i_hash;
+ struct hlist_bl_node i_hash;
+ struct hlist_bl_head *i_hash_head;
struct list_head i_io_list; /* backing dev IO list */
#ifdef CONFIG_CGROUP_WRITEBACK
struct bdi_writeback *i_wb; /* the associated cgroup wb */
@@ -758,7 +759,7 @@ static inline unsigned int i_blocksize(const struct inode *node)

static inline int inode_unhashed(struct inode *inode)
{
- return hlist_unhashed(&inode->i_hash);
+ return hlist_bl_unhashed(&inode->i_hash);
}

/*
@@ -769,7 +770,7 @@ static inline int inode_unhashed(struct inode *inode)
*/
static inline void inode_fake_hash(struct inode *inode)
{
- hlist_add_fake(&inode->i_hash);
+ hlist_bl_add_fake(&inode->i_hash);
}

/*
@@ -2946,7 +2947,7 @@ static inline void insert_inode_hash(struct inode *inode)
extern void __remove_inode_hash(struct inode *);
static inline void remove_inode_hash(struct inode *inode)
{
- if (!inode_unhashed(inode) && !hlist_fake(&inode->i_hash))
+ if (!inode_unhashed(inode) && !hlist_bl_fake(&inode->i_hash))
__remove_inode_hash(inode);
}

--
2.42.0

2023-12-07 02:27:22

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 01/11] lib/dlock-list: Distributed and lock-protected lists

On Wed, Dec 06, 2023 at 05:05:30PM +1100, Dave Chinner wrote:

> +static inline struct dlock_list_node *
> +__dlock_list_next_entry(struct dlock_list_node *curr,
> + struct dlock_list_iter *iter)
> +{
> + /*
> + * Find next entry
> + */
> + if (curr)
> + curr = list_next_entry(curr, list);
> +
> + if (!curr || (&curr->list == &iter->entry->list)) {

Hmm... hlist, perhaps? I mean, that way the thing becomes
if (curr)
curr = hlist_entry_safe(curr->node.next,
struct dlock_list_node, node);
if (!curr)
curr = __dlock_list_next_list(iter);
return curr;

BTW, does anybody have objections against

#define hlist_first_entry(head, type, member)
hlist_entry_safe((head)->first, type, member)

#define hlist_next_entry(pos, member)
hlist_entry_safe((pos)->member.next, typeof(*pos), member)

added in list.h?

> +static int __init cpu2idx_init(void)
> +{
> + int idx, cpu;
> +
> + idx = 0;
> + for_each_possible_cpu(cpu)
> + per_cpu(cpu2idx, cpu) = idx++;
> + return 0;
> +}
> +postcore_initcall(cpu2idx_init);

Is it early enough? Feels like that ought to be done from smp_init() or
right after it...

> +/**
> + * dlock_lists_empty - Check if all the dlock lists are empty
> + * @dlist: Pointer to the dlock_list_heads structure
> + * Return: true if list is empty, false otherwise.
> + *
> + * This can be a pretty expensive function call. If this function is required
> + * in a performance critical path, we may have to maintain a global count
> + * of the list entries in the global dlock_list_heads structure instead.
> + */
> +bool dlock_lists_empty(struct dlock_list_heads *dlist)
> +{
> + int idx;
> +
> + for (idx = 0; idx < nr_cpu_ids; idx++)
> + if (!list_empty(&dlist->heads[idx].list))
> + return false;
> + return true;
> +}

Umm... How would one use it, anyway? You'd need to stop all insertions
first, wouldn't you?

> + */
> +struct dlock_list_node *__dlock_list_next_list(struct dlock_list_iter *iter)
> +{
> + struct dlock_list_node *next;
> + struct dlock_list_head *head;
> +
> +restart:
> + if (iter->entry) {
> + spin_unlock(&iter->entry->lock);
> + iter->entry = NULL;
> + }
> +
> +next_list:
> + /*
> + * Try next list
> + */
> + if (++iter->index >= nr_cpu_ids)
> + return NULL; /* All the entries iterated */
> +
> + if (list_empty(&iter->head[iter->index].list))
> + goto next_list;
> +
> + head = iter->entry = &iter->head[iter->index];
> + spin_lock(&head->lock);
> + /*
> + * There is a slight chance that the list may become empty just
> + * before the lock is acquired. So an additional check is
> + * needed to make sure that a valid node will be returned.
> + */
> + if (list_empty(&head->list))
> + goto restart;
> +
> + next = list_entry(head->list.next, struct dlock_list_node,
> + list);
> + WARN_ON_ONCE(next->head != head);
> +
> + return next;
> +}

Perhaps something like

if (iter->entry) {
spin_unlock(&iter->entry->lock);
iter->entry = NULL;
}
while (++iter->index < nr_cpu_ids) {
struct dlock_list_head *head = &iter->head[iter->index];

if (list_empty(head->list))
continue;

spin_lock(&head->lock);
// recheck under lock
if (unlikely(list_empty(&head->list))) {
spin_unlock(&head->lock);
continue;
}

iter->entry = head;
next = list_first_entry(&head->list,
struct dlock_list_node, list);
WARN_ON_ONCE(next->head != head);
return next;
}
return NULL;

2023-12-07 02:40:42

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 03/11] vfs: Use dlock list for superblock's inode list

On Wed, Dec 06, 2023 at 05:05:32PM +1100, Dave Chinner wrote:

> @@ -303,6 +303,7 @@ static void destroy_unused_super(struct super_block *s)
> super_unlock_excl(s);
> list_lru_destroy(&s->s_dentry_lru);
> list_lru_destroy(&s->s_inode_lru);
> + free_dlock_list_heads(&s->s_inodes);
> security_sb_free(s);
> put_user_ns(s->s_user_ns);
> kfree(s->s_subtype);

Umm... Who's going to do that on normal umount?

2023-12-07 04:17:17

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 10/11] list_bl: don't use bit locks for PREEMPT_RT or lockdep

On Wed, Dec 06, 2023 at 05:05:39PM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> hash-bl nests spinlocks inside the bit locks. This causes problems
> for CONFIG_PREEMPT_RT which converts spin locks to sleeping locks,
> and we're not allowed to sleep while holding a spinning lock.
>
> Further, lockdep does not support bit locks, so we lose lockdep
> coverage of the inode hash table with the hash-bl conversion.
>
> To enable these configs to work, add an external per-chain spinlock
> to the hlist_bl_head() and add helpers to use this instead of the
> bit spinlock when preempt_rt or lockdep are enabled.
>
> This converts all users of hlist-bl to use the external spinlock in
> these situations, so we also gain lockdep coverage of things like
> the dentry cache hash table with this change.
>
> Signed-off-by: Dave Chinner <[email protected]>

Sleepable bit locks can be done with wait_on_bit(), is that worth
considering for PREEMPT_RT? Or are the other features of real locks
important there?

(not a request for the current patchset, just perhaps a note for future
work)

Reviewed-by: Kent Overstreet <[email protected]>

> ---
> include/linux/list_bl.h | 126 ++++++++++++++++++++++++++++---------
> include/linux/rculist_bl.h | 13 ++++
> 2 files changed, 110 insertions(+), 29 deletions(-)
>
> diff --git a/include/linux/list_bl.h b/include/linux/list_bl.h
> index 8ee2bf5af131..990ad8e24e0b 100644
> --- a/include/linux/list_bl.h
> +++ b/include/linux/list_bl.h
> @@ -4,14 +4,27 @@
>
> #include <linux/list.h>
> #include <linux/bit_spinlock.h>
> +#include <linux/spinlock.h>
>
> /*
> * Special version of lists, where head of the list has a lock in the lowest
> * bit. This is useful for scalable hash tables without increasing memory
> * footprint overhead.
> *
> - * For modification operations, the 0 bit of hlist_bl_head->first
> - * pointer must be set.
> + * Whilst the general use of bit spin locking is considered safe, PREEMPT_RT
> + * introduces a problem with nesting spin locks inside bit locks: spin locks
> + * become sleeping locks, and we can't sleep inside spinning locks such as bit
> + * locks. However, for RTPREEMPT, performance is less of an issue than
> + * correctness, so we trade off the memory and cache footprint of a spinlock per
> + * list so the list locks are converted to sleeping locks and work correctly
> + * with PREEMPT_RT kernels.
> + *
> + * An added advantage of this is that we can use the same trick when lockdep is
> + * enabled (again, performance doesn't matter) and gain lockdep coverage of all
> + * the hash-bl operations.
> + *
> + * For modification operations when using pure bit locking, the 0 bit of
> + * hlist_bl_head->first pointer must be set.
> *
> * With some small modifications, this can easily be adapted to store several
> * arbitrary bits (not just a single lock bit), if the need arises to store
> @@ -30,16 +43,21 @@
> #define LIST_BL_BUG_ON(x)
> #endif
>
> +#undef LIST_BL_USE_SPINLOCKS
> +#if defined(CONFIG_PREEMPT_RT) || defined(CONFIG_LOCKDEP)
> +#define LIST_BL_USE_SPINLOCKS 1
> +#endif
>
> struct hlist_bl_head {
> struct hlist_bl_node *first;
> +#ifdef LIST_BL_USE_SPINLOCKS
> + spinlock_t lock;
> +#endif
> };
>
> struct hlist_bl_node {
> struct hlist_bl_node *next, **pprev;
> };
> -#define INIT_HLIST_BL_HEAD(ptr) \
> - ((ptr)->first = NULL)
>
> static inline void INIT_HLIST_BL_NODE(struct hlist_bl_node *h)
> {
> @@ -54,6 +72,69 @@ static inline bool hlist_bl_unhashed(const struct hlist_bl_node *h)
> return !h->pprev;
> }
>
> +#ifdef LIST_BL_USE_SPINLOCKS
> +#define INIT_HLIST_BL_HEAD(ptr) do { \
> + (ptr)->first = NULL; \
> + spin_lock_init(&(ptr)->lock); \
> +} while (0)
> +
> +static inline void hlist_bl_lock(struct hlist_bl_head *b)
> +{
> + spin_lock(&b->lock);
> +}
> +
> +static inline void hlist_bl_unlock(struct hlist_bl_head *b)
> +{
> + spin_unlock(&b->lock);
> +}
> +
> +static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
> +{
> + return spin_is_locked(&b->lock);
> +}
> +
> +static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
> +{
> + return h->first;
> +}
> +
> +static inline void hlist_bl_set_first(struct hlist_bl_head *h,
> + struct hlist_bl_node *n)
> +{
> + h->first = n;
> +}
> +
> +static inline void hlist_bl_set_before(struct hlist_bl_node **pprev,
> + struct hlist_bl_node *n)
> +{
> + WRITE_ONCE(*pprev, n);
> +}
> +
> +static inline bool hlist_bl_empty(const struct hlist_bl_head *h)
> +{
> + return !READ_ONCE(h->first);
> +}
> +
> +#else /* !LIST_BL_USE_SPINLOCKS */
> +
> +#define INIT_HLIST_BL_HEAD(ptr) \
> + ((ptr)->first = NULL)
> +
> +static inline void hlist_bl_lock(struct hlist_bl_head *b)
> +{
> + bit_spin_lock(0, (unsigned long *)b);
> +}
> +
> +static inline void hlist_bl_unlock(struct hlist_bl_head *b)
> +{
> + __bit_spin_unlock(0, (unsigned long *)b);
> +}
> +
> +static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
> +{
> + return bit_spin_is_locked(0, (unsigned long *)b);
> +}
> +
> static inline struct hlist_bl_node *hlist_bl_first(struct hlist_bl_head *h)
> {
> return (struct hlist_bl_node *)
> @@ -69,11 +150,21 @@ static inline void hlist_bl_set_first(struct hlist_bl_head *h,
> h->first = (struct hlist_bl_node *)((unsigned long)n | LIST_BL_LOCKMASK);
> }
>
> +static inline void hlist_bl_set_before(struct hlist_bl_node **pprev,
> + struct hlist_bl_node *n)
> +{
> + WRITE_ONCE(*pprev,
> + (struct hlist_bl_node *)
> + ((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> +}
> +
> static inline bool hlist_bl_empty(const struct hlist_bl_head *h)
> {
> return !((unsigned long)READ_ONCE(h->first) & ~LIST_BL_LOCKMASK);
> }
>
> +#endif /* LIST_BL_USE_SPINLOCKS */
> +
> static inline void hlist_bl_add_head(struct hlist_bl_node *n,
> struct hlist_bl_head *h)
> {
> @@ -94,11 +185,7 @@ static inline void hlist_bl_add_before(struct hlist_bl_node *n,
> n->pprev = pprev;
> n->next = next;
> next->pprev = &n->next;
> -
> - /* pprev may be `first`, so be careful not to lose the lock bit */
> - WRITE_ONCE(*pprev,
> - (struct hlist_bl_node *)
> - ((uintptr_t)n | ((uintptr_t)*pprev & LIST_BL_LOCKMASK)));
> + hlist_bl_set_before(pprev, n);
> }
>
> static inline void hlist_bl_add_behind(struct hlist_bl_node *n,
> @@ -119,11 +206,7 @@ static inline void __hlist_bl_del(struct hlist_bl_node *n)
>
> LIST_BL_BUG_ON((unsigned long)n & LIST_BL_LOCKMASK);
>
> - /* pprev may be `first`, so be careful not to lose the lock bit */
> - WRITE_ONCE(*pprev,
> - (struct hlist_bl_node *)
> - ((unsigned long)next |
> - ((unsigned long)*pprev & LIST_BL_LOCKMASK)));
> + hlist_bl_set_before(pprev, next);
> if (next)
> next->pprev = pprev;
> }
> @@ -165,21 +248,6 @@ static inline bool hlist_bl_fake(struct hlist_bl_node *n)
> return n->pprev == &n->next;
> }
>
> -static inline void hlist_bl_lock(struct hlist_bl_head *b)
> -{
> - bit_spin_lock(0, (unsigned long *)b);
> -}
> -
> -static inline void hlist_bl_unlock(struct hlist_bl_head *b)
> -{
> - __bit_spin_unlock(0, (unsigned long *)b);
> -}
> -
> -static inline bool hlist_bl_is_locked(struct hlist_bl_head *b)
> -{
> - return bit_spin_is_locked(0, (unsigned long *)b);
> -}
> -
> /**
> * hlist_bl_for_each_entry - iterate over list of given type
> * @tpos: the type * to use as a loop cursor.
> diff --git a/include/linux/rculist_bl.h b/include/linux/rculist_bl.h
> index 0b952d06eb0b..2d5eb5153121 100644
> --- a/include/linux/rculist_bl.h
> +++ b/include/linux/rculist_bl.h
> @@ -8,6 +8,18 @@
> #include <linux/list_bl.h>
> #include <linux/rcupdate.h>
>
> +#ifdef LIST_BL_USE_SPINLOCKS
> +static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
> + struct hlist_bl_node *n)
> +{
> + rcu_assign_pointer(h->first, n);
> +}
> +
> +static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
> +{
> + return rcu_dereference_check(h->first, hlist_bl_is_locked(h));
> +}
> +#else /* !LIST_BL_USE_SPINLOCKS */
> static inline void hlist_bl_set_first_rcu(struct hlist_bl_head *h,
> struct hlist_bl_node *n)
> {
> @@ -23,6 +35,7 @@ static inline struct hlist_bl_node *hlist_bl_first_rcu(struct hlist_bl_head *h)
> return (struct hlist_bl_node *)
> ((unsigned long)rcu_dereference_check(h->first, hlist_bl_is_locked(h)) & ~LIST_BL_LOCKMASK);
> }
> +#endif /* LIST_BL_USE_SPINLOCKS */
>
> /**
> * hlist_bl_del_rcu - deletes entry from hash list without re-initialization
> --
> 2.42.0
>

2023-12-07 04:42:16

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 10/11] list_bl: don't use bit locks for PREEMPT_RT or lockdep

On Wed, Dec 06, 2023 at 11:16:50PM -0500, Kent Overstreet wrote:
> On Wed, Dec 06, 2023 at 05:05:39PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > hash-bl nests spinlocks inside the bit locks. This causes problems
> > for CONFIG_PREEMPT_RT which converts spin locks to sleeping locks,
> > and we're not allowed to sleep while holding a spinning lock.
> >
> > Further, lockdep does not support bit locks, so we lose lockdep
> > coverage of the inode hash table with the hash-bl conversion.
> >
> > To enable these configs to work, add an external per-chain spinlock
> > to the hlist_bl_head() and add helpers to use this instead of the
> > bit spinlock when preempt_rt or lockdep are enabled.
> >
> > This converts all users of hlist-bl to use the external spinlock in
> > these situations, so we also gain lockdep coverage of things like
> > the dentry cache hash table with this change.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
>
> Sleepable bit locks can be done with wait_on_bit(), is that worth
> considering for PREEMPT_RT? Or are the other features of real locks
> important there?

I think wait_on_bit() is not scalable. It hashes down to one of 256
shared struct wait_queue_heads which have thundering herd
behaviours, and it requires the locker to always run
prepare_to_wait() and finish_wait(). This means there is at least
one spinlock_irqsave()/unlock pair needed, sometimes two, just to
get an uncontended sleeping bit lock.

So as a fast path operation that requires lock scalability, it's
going to be better to use a straight spinlock that doesn't require
irq safety as it's far less expensive than a sleeping bit lock.

Whether CONFIG_PREEMPT_RT changes that equation at all is not at all
clear to me, and so I'll leave that consideration to RT people if
they see a need to address it. In the mean time, we need to use an
external spinlock for lockdep validation so it really doesn't make
any sense at all to add a third locking variant with completely
different semantics just for PREEMPT_RT...

-Dave.
--
Dave Chinner
[email protected]

2023-12-07 04:59:03

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 08/11] vfs: inode cache conversion to hash-bl

On Wed, Dec 06, 2023 at 05:05:37PM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> Scalability of the global inode_hash_lock really sucks for
> filesystems that use the vfs inode cache (i.e. everything but XFS).

Ages ago, we talked about (and I attempted, but ended up swearing at
inode lifetime rules) - conversion to rhashtable instead, which I still
believe would be preferable since that code is fully lockless (and
resizeable, of course). But it turned out to be a much bigger project...

But IIRC the bulk of the work was going to be "clean up inode
refcounting/lifetime rules into something sane/modern" - maybe we could
leave some breadcrumbs/comments in fs/inode.c for what that would take,
if/when someone else is sufficiently motivated?

> threads vanilla patched vanilla patched
> 2 7.923 7.358 8.003 7.276
> 4 8.152 7.530 9.097 8.506
> 8 13.090 7.871 11.752 10.015
> 16 24.602 9.540 24.614 13.989
> 32 49.536 19.314 49.179 25.982

nice

> The big wins here are at >= 8 threads, with both filesytsems now
> being limited by internal filesystem algorithms, not the VFS inode
> cache scalability.
>
> Ext4 contention moves to the buffer cache on directory block
> lookups:
>
> - 66.45% 0.44% [kernel] [k] __ext4_read_dirblock
> - 66.01% __ext4_read_dirblock
> - 66.01% ext4_bread
> - ext4_getblk
> - 64.77% bdev_getblk
> - 64.69% __find_get_block
> - 63.01% _raw_spin_lock
> - 62.96% do_raw_spin_lock
> 59.21% __pv_queued_spin_lock_slowpath
>
> bcachefs contention moves to internal btree traversal locks.
>
> - 95.37% __lookup_slow
> - 93.95% bch2_lookup
> - 82.57% bch2_vfs_inode_get
> - 65.44% bch2_inode_find_by_inum_trans
> - 65.41% bch2_inode_peek_nowarn
> - 64.60% bch2_btree_iter_peek_slot
> - 64.55% bch2_btree_path_traverse_one
> - bch2_btree_path_traverse_cached
> - 63.02% bch2_btree_path_traverse_cached_slowpath
> - 56.60% mutex_lock

dlist-lock ought to be perfect for solving this one

Reviewed-by: Kent Overstreet <[email protected]>

2023-12-07 04:59:26

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 03/11] vfs: Use dlock list for superblock's inode list

On Thu, Dec 07, 2023 at 02:40:24AM +0000, Al Viro wrote:
> On Wed, Dec 06, 2023 at 05:05:32PM +1100, Dave Chinner wrote:
>
> > @@ -303,6 +303,7 @@ static void destroy_unused_super(struct super_block *s)
> > super_unlock_excl(s);
> > list_lru_destroy(&s->s_dentry_lru);
> > list_lru_destroy(&s->s_inode_lru);
> > + free_dlock_list_heads(&s->s_inodes);
> > security_sb_free(s);
> > put_user_ns(s->s_user_ns);
> > kfree(s->s_subtype);
>
> Umm... Who's going to do that on normal umount?

Huh. So neither KASAN nor kmemleak has told me that s->s-inodes was
being leaked. I'm guessing a rebase sometime in the past silently
dropped a critical hunk from deactivate_locked_super() in the bit
bucket, but as nothing since whenever that happened has failed or
flagged a memory leak I didn't notice.

Such great tools we have......

-Dave.
--
Dave Chinner
[email protected]

2023-12-07 05:04:11

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 03/11] vfs: Use dlock list for superblock's inode list

On Thu, Dec 07, 2023 at 03:59:10PM +1100, Dave Chinner wrote:
> On Thu, Dec 07, 2023 at 02:40:24AM +0000, Al Viro wrote:
> > On Wed, Dec 06, 2023 at 05:05:32PM +1100, Dave Chinner wrote:
> >
> > > @@ -303,6 +303,7 @@ static void destroy_unused_super(struct super_block *s)
> > > super_unlock_excl(s);
> > > list_lru_destroy(&s->s_dentry_lru);
> > > list_lru_destroy(&s->s_inode_lru);
> > > + free_dlock_list_heads(&s->s_inodes);
> > > security_sb_free(s);
> > > put_user_ns(s->s_user_ns);
> > > kfree(s->s_subtype);
> >
> > Umm... Who's going to do that on normal umount?
>
> Huh. So neither KASAN nor kmemleak has told me that s->s-inodes was
> being leaked. I'm guessing a rebase sometime in the past silently
> dropped a critical hunk from deactivate_locked_super() in the bit
> bucket, but as nothing since whenever that happened has failed or
> flagged a memory leak I didn't notice.
>
> Such great tools we have......

kmemleak has always seemed flakey to me (as one would expect, it's
difficult code to test). If we can ever get memory allocation profiling
merged - it won't be a direct replacement but it'll be more reliable.

2023-12-07 06:03:35

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 08/11] vfs: inode cache conversion to hash-bl

On Wed, Dec 06, 2023 at 11:58:44PM -0500, Kent Overstreet wrote:
> On Wed, Dec 06, 2023 at 05:05:37PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Scalability of the global inode_hash_lock really sucks for
> > filesystems that use the vfs inode cache (i.e. everything but XFS).
>
> Ages ago, we talked about (and I attempted, but ended up swearing at
> inode lifetime rules) - conversion to rhashtable instead, which I still
> believe would be preferable since that code is fully lockless (and
> resizeable, of course). But it turned out to be a much bigger project...

I don't think that the size of the has table is a big issue at the
moment. We already have RCU lookups for the inode cache
(find_inode_rcu() and find_inode_by_ino_rcu()) even before this
patchset, so we don't need rhashtable for that.

We still have to prevent duplicate inodes from being added to the cache
due to racing inserts, so I think we still need some form of
serialisation on the "lookup miss+insert" side. I've not thought
about it further than that - the hash-bl removes the existing
VFS contention points and the limitations move to
filesystem internal algorithms once again.

So until the filesystems can scale to much larger thread counts and
put the pressure back on the VFS inode cache scalability, I
don't see any need to try to do anything more complex or smarter...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-12-07 06:42:43

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 08/11] vfs: inode cache conversion to hash-bl

On Wed, Dec 06, 2023 at 05:05:37PM +1100, Dave Chinner wrote:

> + /*
> + * There are some callers that come through here without synchronisation
> + * and potentially with multiple references to the inode. Hence we have
> + * to handle the case that we might race with a remove and insert to a
> + * different list. Coda, in particular, seems to have a userspace API
> + * that can directly trigger "unhash/rehash to different list" behaviour
> + * without any serialisation at all.
> + *
> + * Hence we have to handle the situation where the inode->i_hash_head
> + * might point to a different list than what we expect, indicating that
> + * we raced with another unhash and potentially a new insertion. This
> + * means we have to retest the head once we have everything locked up
> + * and loop again if it doesn't match.
> + */

coda_replace_fid() is an old headache, but it's thankfully unique - nobody else
does that kind of shit (just rechecked).

Note that coda_replace_fid() is not going to have the sucker racing with
removal from another source, and I'm 100% sure that they really want
some serialization for handling those requests.

remove_inode_hash() is misused there - "in the middle of hash key change"
is not the same state as "unhashed".

Any races between insert and unhash are bugs, not something to support.

2023-12-07 17:09:02

by Kent Overstreet

[permalink] [raw]
Subject: Re: [PATCH 0/11] vfs: inode cache scalability improvements

On Wed, Dec 06, 2023 at 05:05:29PM +1100, Dave Chinner wrote:
...o
> Git tree containing this series can be pulled from:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs.git vfs-scale

For the series:

Tested-by: Kent Overstreet <[email protected]>