I think the appropriate blurb which needs to land here also needs to be
in the commit message for the first patch, so here it is copy pasted
with some modifications at the end:
[quote]
Instantiating a new inode normally takes the global inode hash lock
twice:
1. once to check if it happens to already be present
2. once to add it to the hash
The back-to-back lock/unlock pattern is known to degrade performance
significantly, which is further exacerbated if the hash is heavily
populated (long chains to walk, extending hold time). Arguably hash
sizing and hashing algo need to be revisited, but that's beyond the
scope of this patch.
A long term fix would introduce finer-grained locking. An attempt was
made several times, most recently in [1], but the effort appears
stalled.
A simpler idea which solves majority of the problem and which may be
good enough for the time being is to use RCU for the initial lookup.
Basic RCU support is already present in the hash. This being a temporary
measure I tried to keep the change as small as possible.
iget_locked consumers (notably ext4) get away without any changes
because inode comparison method is built-in.
iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
removal of locking adds more problems (inode can be changing) it's not
safe to assume all filesystems happen to cope. Thus iget5_locked_rcu
and ilookup5_nowait_rcu get added, requiring manual conversion.
In order to reduce code duplication find_inode and find_inode_fast grow
an argument indicating whether inode hash lock is held, which is passed
down should sleeping be necessary. They always rcu_read_lock, which is
redundant but harmless. Doing it conditionally reduces readability for
no real gain that I can see. RCU-alike restrictions were already put on
callbacks due to the hash spinlock being held.
There is a real cache-busting workload scanning millions of files in
parallel (it's a backup server thing), where the initial lookup is
guaranteed to fail resulting in the 2 lock acquires.
Implemented below is a synthehic benchmark which provides the same
behavior. [I shall note the workload is not running on Linux, instead it
was causing trouble elsewhere. Benchmark below was used while addressing
said problems and was found to adequately represent the real workload.]
Total real time fluctuates by 1-2s.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB RAM vm:
[/quote]
Specific results:
ext4 (needed mkfs.ext4 -N 24000000):
before: 3.77s user 890.90s system 1939% cpu 46.118 total
after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
btrfs (s/iget5_locked/iget5_locked_rcu in fs/btrfs/inode.c):
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
btrfs bottlenecks itself on its own locks here.
Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
Previously I indicated I wanted to patch bcachefs, but I ran into bugs
with memory reclaim. To my understanding addressing them is a WIP. More
importantly though I brought up the patch with Kent, we had a little
back and forth. We both agree the entire inode hash situation needs
significant changes, but neither is signing up. :] Bottom line though is
that his fs is probably going to get a local rhashtable, so that fs is
out of the picture as far as this patch goes.
[1] https://lore.kernel.org/all/[email protected]/
v2:
- add argument lists to new routines
- assert the inode hash lock is not held as applicable
- real btrfs patch included
Mateusz Guzik (2):
vfs: add rcu-based find_inode variants for iget ops
btrfs: use iget5_locked_rcu
fs/btrfs/inode.c | 2 +-
fs/inode.c | 119 ++++++++++++++++++++++++++++++++++++++-------
include/linux/fs.h | 10 +++-
3 files changed, 112 insertions(+), 19 deletions(-)
--
2.43.0
Instantiating a new inode normally takes the global inode hash lock
twice:
1. once to check if it happens to already be present
2. once to add it to the hash
The back-to-back lock/unlock pattern is known to degrade performance
significantly, which is further exacerbated if the hash is heavily
populated (long chains to walk, extending hold time). Arguably hash
sizing and hashing algo need to be revisited, but that's beyond the
scope of this patch.
A long term fix would introduce finer-grained locking. An attempt was
made several times, most recently in [1], but the effort appears
stalled.
A simpler idea which solves majority of the problem and which may be
good enough for the time being is to use RCU for the initial lookup.
Basic RCU support is already present in the hash. This being a temporary
measure I tried to keep the change as small as possible.
iget_locked consumers (notably ext4) get away without any changes
because inode comparison method is built-in.
iget5_locked and ilookup5_nowait consumers pass a custom callback. Since
removal of locking adds more problems (inode can be changing) it's not
safe to assume all filesystems happen to cope. Thus iget5_locked_rcu
and ilookup5_nowait_rcu get added, requiring manual conversion.
In order to reduce code duplication find_inode and find_inode_fast grow
an argument indicating whether inode hash lock is held, which is passed
down should sleeping be necessary. They always rcu_read_lock, which is
redundant but harmless. Doing it conditionally reduces readability for
no real gain that I can see. RCU-alike restrictions were already put on
callbacks due to the hash spinlock being held.
There is a real cache-busting workload scanning millions of files in
parallel (it's a backup server thing), where the initial lookup is
guaranteed to fail resulting in the 2 lock acquires.
Implemented below is a synthehic benchmark which provides the same
behavior. [I shall note the workload is not running on Linux, instead it
was causing trouble elsewhere. Benchmark below was used while addressing
said problems and was found to adequately represent the real workload.]
Total real time fluctuates by 1-2s.
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB RAM vm:
ext4 (needed mkfs.ext4 -N 24000000):
before: 3.77s user 890.90s system 1939% cpu 46.118 total
after: 3.24s user 397.73s system 1858% cpu 21.581 total (-53%)
Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
[1] https://lore.kernel.org/all/[email protected]/
Signed-off-by: Mateusz Guzik <[email protected]>
---
fs/inode.c | 119 ++++++++++++++++++++++++++++++++++++++-------
include/linux/fs.h | 10 +++-
2 files changed, 111 insertions(+), 18 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index 3a41f83a4ba5..149adf8ab0ea 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -886,36 +886,45 @@ long prune_icache_sb(struct super_block *sb, struct shrink_control *sc)
return freed;
}
-static void __wait_on_freeing_inode(struct inode *inode);
+static void __wait_on_freeing_inode(struct inode *inode, bool locked);
/*
* Called with the inode lock held.
*/
static struct inode *find_inode(struct super_block *sb,
struct hlist_head *head,
int (*test)(struct inode *, void *),
- void *data)
+ void *data, bool locked)
{
struct inode *inode = NULL;
+ if (locked)
+ lockdep_assert_held(&inode_hash_lock);
+ else
+ lockdep_assert_not_held(&inode_hash_lock);
+
+ rcu_read_lock();
repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_sb != sb)
continue;
if (!test(inode, data))
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(inode, locked);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
__iget(inode);
spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return inode;
}
+ rcu_read_unlock();
return NULL;
}
@@ -924,29 +933,39 @@ static struct inode *find_inode(struct super_block *sb,
* iget_locked for details.
*/
static struct inode *find_inode_fast(struct super_block *sb,
- struct hlist_head *head, unsigned long ino)
+ struct hlist_head *head, unsigned long ino,
+ bool locked)
{
struct inode *inode = NULL;
+ if (locked)
+ lockdep_assert_held(&inode_hash_lock);
+ else
+ lockdep_assert_not_held(&inode_hash_lock);
+
+ rcu_read_lock();
repeat:
- hlist_for_each_entry(inode, head, i_hash) {
+ hlist_for_each_entry_rcu(inode, head, i_hash) {
if (inode->i_ino != ino)
continue;
if (inode->i_sb != sb)
continue;
spin_lock(&inode->i_lock);
if (inode->i_state & (I_FREEING|I_WILL_FREE)) {
- __wait_on_freeing_inode(inode);
+ __wait_on_freeing_inode(inode, locked);
goto repeat;
}
if (unlikely(inode->i_state & I_CREATING)) {
spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return ERR_PTR(-ESTALE);
}
__iget(inode);
spin_unlock(&inode->i_lock);
+ rcu_read_unlock();
return inode;
}
+ rcu_read_unlock();
return NULL;
}
@@ -1161,7 +1180,7 @@ struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
again:
spin_lock(&inode_hash_lock);
- old = find_inode(inode->i_sb, head, test, data);
+ old = find_inode(inode->i_sb, head, test, data, true);
if (unlikely(old)) {
/*
* Uhhuh, somebody else created the same inode under us.
@@ -1245,6 +1264,48 @@ struct inode *iget5_locked(struct super_block *sb, unsigned long hashval,
}
EXPORT_SYMBOL(iget5_locked);
+/**
+ * iget5_locked_rcu - obtain an inode from a mounted file system
+ * @sb: super block of file system
+ * @hashval: hash value (usually inode number) to get
+ * @test: callback used for comparisons between inodes
+ * @set: callback used to initialize a new struct inode
+ * @data: opaque data pointer to pass to @test and @set
+ *
+ * This is equivalent to iget5_locked, except the @test callback must
+ * tolerate inode not being stable, including being mid-teardown.
+ */
+struct inode *iget5_locked_rcu(struct super_block *sb, unsigned long hashval,
+ int (*test)(struct inode *, void *),
+ int (*set)(struct inode *, void *), void *data)
+{
+ struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode *inode, *new;
+
+again:
+ inode = find_inode(sb, head, test, data, false);
+ if (inode) {
+ if (IS_ERR(inode))
+ return NULL;
+ wait_on_inode(inode);
+ if (unlikely(inode_unhashed(inode))) {
+ iput(inode);
+ goto again;
+ }
+ return inode;
+ }
+
+ new = alloc_inode(sb);
+ if (new) {
+ new->i_state = 0;
+ inode = inode_insert5(new, hashval, test, set, data);
+ if (unlikely(inode != new))
+ destroy_inode(new);
+ }
+ return inode;
+}
+EXPORT_SYMBOL(iget5_locked_rcu);
+
/**
* iget_locked - obtain an inode from a mounted file system
* @sb: super block of file system
@@ -1263,9 +1324,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
struct hlist_head *head = inode_hashtable + hash(sb, ino);
struct inode *inode;
again:
- spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
- spin_unlock(&inode_hash_lock);
+ inode = find_inode_fast(sb, head, ino, false);
if (inode) {
if (IS_ERR(inode))
return NULL;
@@ -1283,7 +1342,7 @@ struct inode *iget_locked(struct super_block *sb, unsigned long ino)
spin_lock(&inode_hash_lock);
/* We released the lock, so.. */
- old = find_inode_fast(sb, head, ino);
+ old = find_inode_fast(sb, head, ino, true);
if (!old) {
inode->i_ino = ino;
spin_lock(&inode->i_lock);
@@ -1419,13 +1478,35 @@ struct inode *ilookup5_nowait(struct super_block *sb, unsigned long hashval,
struct inode *inode;
spin_lock(&inode_hash_lock);
- inode = find_inode(sb, head, test, data);
+ inode = find_inode(sb, head, test, data, true);
spin_unlock(&inode_hash_lock);
return IS_ERR(inode) ? NULL : inode;
}
EXPORT_SYMBOL(ilookup5_nowait);
+/**
+ * ilookup5_nowait_rcu - search for an inode in the inode cache
+ * @sb: super block of file system to search
+ * @hashval: hash value (usually inode number) to search for
+ * @test: callback used for comparisons between inodes
+ * @data: opaque data pointer to pass to @test
+ *
+ * This is equivalent to ilookup5_nowait, except the @test callback must
+ * tolerate inode not being stable, including being mid-teardown.
+ */
+struct inode *ilookup5_nowait_rcu(struct super_block *sb, unsigned long hashval,
+ int (*test)(struct inode *, void *), void *data)
+{
+ struct hlist_head *head = inode_hashtable + hash(sb, hashval);
+ struct inode *inode;
+
+ inode = find_inode(sb, head, test, data, false);
+
+ return IS_ERR(inode) ? NULL : inode;
+}
+EXPORT_SYMBOL(ilookup5_nowait_rcu);
+
/**
* ilookup5 - search for an inode in the inode cache
* @sb: super block of file system to search
@@ -1474,7 +1555,7 @@ struct inode *ilookup(struct super_block *sb, unsigned long ino)
struct inode *inode;
again:
spin_lock(&inode_hash_lock);
- inode = find_inode_fast(sb, head, ino);
+ inode = find_inode_fast(sb, head, ino, true);
spin_unlock(&inode_hash_lock);
if (inode) {
@@ -2235,17 +2316,21 @@ EXPORT_SYMBOL(inode_needs_sync);
* wake_up_bit(&inode->i_state, __I_NEW) after removing from the hash list
* will DTRT.
*/
-static void __wait_on_freeing_inode(struct inode *inode)
+static void __wait_on_freeing_inode(struct inode *inode, bool locked)
{
wait_queue_head_t *wq;
DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
wq = bit_waitqueue(&inode->i_state, __I_NEW);
prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
spin_unlock(&inode->i_lock);
- spin_unlock(&inode_hash_lock);
+ rcu_read_unlock();
+ if (locked)
+ spin_unlock(&inode_hash_lock);
schedule();
finish_wait(wq, &wait.wq_entry);
- spin_lock(&inode_hash_lock);
+ if (locked)
+ spin_lock(&inode_hash_lock);
+ rcu_read_lock();
}
static __initdata unsigned long ihash_entries;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index bfc1e6407bf6..65cf1b00d8c0 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3037,6 +3037,9 @@ extern void d_mark_dontcache(struct inode *inode);
extern struct inode *ilookup5_nowait(struct super_block *sb,
unsigned long hashval, int (*test)(struct inode *, void *),
void *data);
+extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
+ unsigned long hashval, int (*test)(struct inode *, void *),
+ void *data);
extern struct inode *ilookup5(struct super_block *sb, unsigned long hashval,
int (*test)(struct inode *, void *), void *data);
extern struct inode *ilookup(struct super_block *sb, unsigned long ino);
@@ -3045,7 +3048,12 @@ extern struct inode *inode_insert5(struct inode *inode, unsigned long hashval,
int (*test)(struct inode *, void *),
int (*set)(struct inode *, void *),
void *data);
-extern struct inode * iget5_locked(struct super_block *, unsigned long, int (*test)(struct inode *, void *), int (*set)(struct inode *, void *), void *);
+extern struct inode * iget5_locked(struct super_block *, unsigned long,
+ int (*test)(struct inode *, void *),
+ int (*set)(struct inode *, void *), void *);
+extern struct inode * iget5_locked_rcu(struct super_block *, unsigned long,
+ int (*test)(struct inode *, void *),
+ int (*set)(struct inode *, void *), void *);
extern struct inode * iget_locked(struct super_block *, unsigned long);
extern struct inode *find_inode_nowait(struct super_block *,
unsigned long,
--
2.43.0
With 20 threads each walking a dedicated 1000 dirs * 1000 files
directory tree to stat(2) on a 32 core + 24GB ram vm:
before: 3.54s user 892.30s system 1966% cpu 45.549 total
after: 3.28s user 738.66s system 1955% cpu 37.932 total (-16.7%)
Benchmark can be found here: https://people.freebsd.org/~mjg/fstree.tgz
Signed-off-by: Mateusz Guzik <[email protected]>
---
fs/btrfs/inode.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
index 4883cb512379..457d2c18d071 100644
--- a/fs/btrfs/inode.c
+++ b/fs/btrfs/inode.c
@@ -5588,7 +5588,7 @@ static struct inode *btrfs_iget_locked(struct super_block *s, u64 ino,
args.ino = ino;
args.root = root;
- inode = iget5_locked(s, hashval, btrfs_find_actor,
+ inode = iget5_locked_rcu(s, hashval, btrfs_find_actor,
btrfs_init_locked_inode,
(void *)&args);
return inode;
--
2.43.0
On Tue, Jun 11, 2024 at 6:59 AM Christoph Hellwig <[email protected]> wrote:
>
> > +EXPORT_SYMBOL(iget5_locked_rcu);
>
> EXPORT_SYMBOL_GPL for rcu APIs.
>
noted for v3, thanks
> > +static void __wait_on_freeing_inode(struct inode *inode, bool locked)
> > {
> > wait_queue_head_t *wq;
> > DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
> > wq = bit_waitqueue(&inode->i_state, __I_NEW);
> > prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> > spin_unlock(&inode->i_lock);
> > - spin_unlock(&inode_hash_lock);
> > + rcu_read_unlock();
> > + if (locked)
> > + spin_unlock(&inode_hash_lock);
>
> The conditional locking here is goign to make sparse rather unhappy.
> Please try to find a way to at least annotate it, or maybe find
> another way around like, like leaving the schedule in finish_wait
> in the callers.
>
So I tried out sparse on my patch vs fs-next and found it emits the
same warnings.
fs/inode.c:846:17: warning: context imbalance in 'inode_lru_isolate' -
unexpected unlock
fs/inode.c:901:9: warning: context imbalance in 'find_inode' -
different lock contexts for basic block
fs/inode.c:932:9: warning: context imbalance in 'find_inode_fast' -
different lock contexts for basic block
fs/inode.c:1621:5: warning: context imbalance in 'insert_inode_locked'
- wrong count at exit
fs/inode.c:1739:20: warning: context imbalance in 'iput_final' -
unexpected unlock
fs/inode.c:1753:6: warning: context imbalance in 'iput' - wrong count at exit
fs/inode.c:2238:13: warning: context imbalance in
'__wait_on_freeing_inode' - unexpected unlock
The patch does not make things *worse*, so I don't think messing with
the code is warranted here.
> > +extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
> > + unsigned long hashval, int (*test)(struct inode *, void *),
> > + void *data);
>
> No need for the extern here (or down below).
>
I agree, but this is me just copying and modifying an existing line.
include/linux/fs.h is chock full of extern-prefixed func declarations,
on top of that some name the arguments while the rest does not.
Someone(tm) should definitely clean it up, but I'm not interested in
bikeshedding about it.
--
Mateusz Guzik <mjguzik gmail.com>
On Tue, Jun 11, 2024 at 7:20 AM Christoph Hellwig <[email protected]> wrote:
>
> On Tue, Jun 11, 2024 at 07:17:41AM +0200, Mateusz Guzik wrote:
> >
> > I agree, but this is me just copying and modifying an existing line.
> >
> > include/linux/fs.h is chock full of extern-prefixed func declarations,
> > on top of that some name the arguments while the rest does not.
> >
> > Someone(tm) should definitely clean it up, but I'm not interested in
> > bikeshedding about it.
>
> No one asked you to clean the mess up, but don't add any in changed
> or added lines. Without that:
>
> Nacked-by: Christoph Hellwig <[email protected]>
>
I am not saying you told me to clean the file up.
I am explaining why the addition looks the way it does: I am keeping
it consistent with the surroundings, which I assumed is the preferred
way.
That said, whichever way this is spelled out is fine with me so if vfs
overlords want 'extern' removed I will make the change no problem. (or
based on what I had seen from Christian so far, he will the touch it
up himself, which is again fine with me)
--
Mateusz Guzik <mjguzik gmail.com>
> +EXPORT_SYMBOL(iget5_locked_rcu);
EXPORT_SYMBOL_GPL for rcu APIs.
> +static void __wait_on_freeing_inode(struct inode *inode, bool locked)
> {
> wait_queue_head_t *wq;
> DEFINE_WAIT_BIT(wait, &inode->i_state, __I_NEW);
> wq = bit_waitqueue(&inode->i_state, __I_NEW);
> prepare_to_wait(wq, &wait.wq_entry, TASK_UNINTERRUPTIBLE);
> spin_unlock(&inode->i_lock);
> - spin_unlock(&inode_hash_lock);
> + rcu_read_unlock();
> + if (locked)
> + spin_unlock(&inode_hash_lock);
The conditional locking here is goign to make sparse rather unhappy.
Please try to find a way to at least annotate it, or maybe find
another way around like, like leaving the schedule in finish_wait
in the callers.
> +extern struct inode *ilookup5_nowait_rcu(struct super_block *sb,
> + unsigned long hashval, int (*test)(struct inode *, void *),
> + void *data);
No need for the extern here (or down below).
On Tue, Jun 11, 2024 at 07:17:41AM +0200, Mateusz Guzik wrote:
>
> I agree, but this is me just copying and modifying an existing line.
>
> include/linux/fs.h is chock full of extern-prefixed func declarations,
> on top of that some name the arguments while the rest does not.
>
> Someone(tm) should definitely clean it up, but I'm not interested in
> bikeshedding about it.
No one asked you to clean the mess up, but don't add any in changed
or added lines. Without that:
Nacked-by: Christoph Hellwig <[email protected]>
On Tue, Jun 11, 2024 at 07:36:46AM +0200, Mateusz Guzik wrote:
> I am explaining why the addition looks the way it does: I am keeping
> it consistent with the surroundings, which I assumed is the preferred
> way.
It is not, and it's clearly documented that it isn't.