2022-10-13 23:50:09

by Stephen Brennan

[permalink] [raw]
Subject: [RFC] fsnotify: allow sleepable child dentry flag update

Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
behind it, to see whether we can find something workable. I apologize for the
length of text here, but I think it's necessary to give full context and ideas.

For background, on machines with lots of memory and weird workloads,
__fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
list is long, this can take a while. The list can be long due to lots of
negative dentries (which can easily number in the hundreds of millions if you
have a process that's relatively frequently looking up nonexisting files). But
the list can also be long due to *positive* dentries. I've seen directories with
~7 million positive dentry children falling victim to this function before (XFS
allows lots of files per dir)! Positive dentries take longer to process in this
function (since they're actually locked and written to), so you don't need as
many for them to be a problem.

Anyway, if you have a huge d_subdirs list, then you can have problems with soft
lockups. From my measurements with ftrace, 100 million negative dentries means
that the function takes about 6 seconds to complete (varies wildly by CPU and
kernel config/version). That's bad, but it can get *much worse*. Imagine that
there are many frequently accessed files in such a directory, and you have an
inotify watch. As soon as that watch is removed, the last fsnotify connector
goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
try to update the dentry flags. In my experience, a thundering herd of CPUs race
to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
point, and they *all* will proceed to iterate through d_subdirs, regardless of
how long the list is, even though only the first CPU needed to do it. So now
your 6 second spin gets multiplied by 5-10. And since the directory is
frequently accessed, all the dget/dputs from other CPUs will all spin for this
long time. This amounts to a nearly unusable system.

Previously I've tried to generally limit or manage the number of negative
dentries in the dcache, which as a general problem is very difficult to get
concensus on. I've also tried the patches to reorder dentries in d_subdirs so
negative dentries are at the end, which has some difficult issues interacting
with d_walk. Neither of those ideas would help for a directory full of positive
dentries either.

So I have two more narrowly scoped strategies to improve the situation. Both are
included in the hacky, awful patch below.

First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
is holding the spinlock for several seconds at a time. We can actually achieve
this via a cursor, the same way that simple_readdir() is implemented. I think
this might require moving the declaration of d_alloc_cursor() and maybe
exporting it. I had to #include fs/internal.h which is not ok.

On its own, that actually makes problems worse, because it allows several tasks
to update at the same time, and they're constantly locking/unlocking, which
makes contention worse.

So second is to add an inode flag saying that
__fsnotify_update_child_dentry_flags() is already in progress. This prevents
concurrent execution, and it allows the caller to skip updating since they know
it's being handled, so it eliminates the thundering herd problem.

The patch works great! It eliminates the chances of soft lockups and makes the
system responsive under those weird loads. But now, I know I've added a new
issue. Updating dentry flags is no longer atomic, and we've lost the guarantee
that after fsnotify_recalc_mask(), child dentries are all flagged when
necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
skip executing since it sees it's already happening, and inotify_add_watch()
would return without the watch being fully ready.

I think the approach can still be salvaged, we just need a way to resolve this.
EG a wait queue or mutex in the connector would allow us to preserve the
guarantee that the child dentries are flagged when necessary. But I know that's
a big addition, so I wanted to get some feedback from you as the maintainers. Is
the strategy here stupid? Am I missing an easier option? Is some added
synchronization in the connector workable?

Thanks!
Stephen

---

You may find it useful to create negative dentries with [1] while using this
patch. Create 100 million dentries as follows. Then use [2] to create userspace
load that's accessing files in the same directory. Finally, inotifywait will
setup a watch, terminate after one event, and tear it down. Without the patch,
the thundering herd of userspace tasks all line up to update the flags, and
lockup the system.

./negdentcreate -p /tmp/dir -c 100000000 -t 4
touch /tmp/dir/file{1..10}
for i in {1..10}; do ./openclose /tmp/dir/file$i & done
inotifywait /tmp/dir

[1] https://github.com/brenns10/kernel_stuff/tree/master/negdentcreate
[2] https://github.com/brenns10/kernel_stuff/tree/master/openclose

fs/notify/fsnotify.c | 96 ++++++++++++++++++++++++++++++++++----------
include/linux/fs.h | 4 ++
2 files changed, 78 insertions(+), 22 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..d74949c01a29 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -13,6 +13,7 @@

#include <linux/fsnotify_backend.h>
#include "fsnotify.h"
+#include "../internal.h"

/*
* Clear all of the marks on an inode when it is being evicted from core
@@ -102,42 +103,93 @@ void fsnotify_sb_delete(struct super_block *sb)
* on a child we run all of our children and set a dentry flag saying that the
* parent cares. Thus when an event happens on a child it can quickly tell
* if there is a need to find a parent and send the event to the parent.
+ *
+ * Some directories may contain too many entries, making iterating through the
+ * parent list slow enough to cause soft lockups. So, we use a cursor -- just as
+ * simple_readdir() does -- which allows us to drop the locks, sleep, and pick
+ * up where we left off. To maintain mutual exclusion we set an inode state
+ * flag.
*/
void __fsnotify_update_child_dentry_flags(struct inode *inode)
{
- struct dentry *alias;
- int watched;
+ struct dentry *alias, *child, *cursor = NULL;
+ const int BATCH_SIZE = 50000;
+ int watched, prev_watched, batch_remaining = BATCH_SIZE;
+ struct list_head *p;

if (!S_ISDIR(inode->i_mode))
return;

- /* determine if the children should tell inode about their events */
- watched = fsnotify_inode_watches_children(inode);
+ /* Do a quick check while inode is unlocked. This avoids unnecessary
+ * contention on i_lock. */
+ if (inode->i_state & I_FSNOTIFY_UPDATING)
+ return;

spin_lock(&inode->i_lock);
+
+ if (inode->i_state & I_FSNOTIFY_UPDATING) {
+ spin_unlock(&inode->i_lock);
+ return;
+ } else {
+ inode->i_state |= I_FSNOTIFY_UPDATING;
+ }
+
/* run all of the dentries associated with this inode. Since this is a
- * directory, there damn well better only be one item on this list */
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
- struct dentry *child;
-
- /* run all of the children of the original inode and fix their
- * d_flags to indicate parental interest (their parent is the
- * original inode) */
- spin_lock(&alias->d_lock);
- list_for_each_entry(child, &alias->d_subdirs, d_child) {
- if (!child->d_inode)
- continue;
+ * directory, there damn well better be exactly one item on this list */
+ BUG_ON(!inode->i_dentry.first);
+ BUG_ON(inode->i_dentry.first->next);
+ alias = container_of(inode->i_dentry.first, struct dentry, d_u.d_alias);
+
+ cursor = d_alloc_cursor(alias);
+ if (!cursor) {
+ inode->i_state &= ~I_FSNOTIFY_UPDATING;
+ spin_unlock(&inode->i_lock);
+ return; // TODO -ENOMEM
+ }

- spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
- if (watched)
- child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
- else
- child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
- spin_unlock(&child->d_lock);
+ /* determine if the children should tell inode about their events */
+ watched = fsnotify_inode_watches_children(inode);
+ spin_lock(&alias->d_lock);
+ p = alias->d_subdirs.next;
+
+ while (p != &alias->d_subdirs) {
+ if (batch_remaining-- <= 0 && need_resched()) {
+ batch_remaining = BATCH_SIZE;
+ list_move(&cursor->d_child, p);
+ p = &cursor->d_child;
+ spin_unlock(&alias->d_lock);
+ spin_unlock(&inode->i_lock);
+ cond_resched();
+ spin_lock(&inode->i_lock);
+ spin_lock(&alias->d_lock);
+ prev_watched = watched;
+ watched = fsnotify_inode_watches_children(inode);
+ if (watched != prev_watched) {
+ /* Somebody else is messing with the flags,
+ * bail and let them handle it. */
+ goto out;
+ }
+ /* TODO: is it possible that i_dentry list is changed? */
}
- spin_unlock(&alias->d_lock);
+ child = list_entry(p, struct dentry, d_child);
+ p = p->next;
+
+ if (!child->d_inode || child->d_flags & DCACHE_DENTRY_CURSOR)
+ continue;
+
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+ if (watched)
+ child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ else
+ child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&child->d_lock);
}
+
+out:
+ inode->i_state &= ~I_FSNOTIFY_UPDATING;
+ spin_unlock(&alias->d_lock);
spin_unlock(&inode->i_lock);
+ dput(cursor);
}

/* Are inode/sb/mount interested in parent and name info with this event? */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e654435f1651..f66aab9f792a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2433,6 +2433,9 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
*
* I_PINNING_FSCACHE_WB Inode is pinning an fscache object for writeback.
*
+ * I_FSNOTIFY_UPDATING Used by fsnotify to avoid duplicated work when updating
+ * child dentry flag DCACHE_FSNOTIFY_PARENT_WATCHED.
+ *
* Q: What is the difference between I_WILL_FREE and I_FREEING?
*/
#define I_DIRTY_SYNC (1 << 0)
@@ -2456,6 +2459,7 @@ static inline void kiocb_clone(struct kiocb *kiocb, struct kiocb *kiocb_src,
#define I_DONTCACHE (1 << 16)
#define I_SYNC_QUEUED (1 << 17)
#define I_PINNING_FSCACHE_WB (1 << 18)
+#define I_FSNOTIFY_UPDATING (1 << 19)

#define I_DIRTY_INODE (I_DIRTY_SYNC | I_DIRTY_DATASYNC)
#define I_DIRTY (I_DIRTY_INODE | I_DIRTY_PAGES)
--
2.34.1


2022-10-14 00:14:00

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote:

> So I have two more narrowly scoped strategies to improve the situation. Both are
> included in the hacky, awful patch below.
>
> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
> is holding the spinlock for several seconds at a time. We can actually achieve
> this via a cursor, the same way that simple_readdir() is implemented. I think
> this might require moving the declaration of d_alloc_cursor() and maybe
> exporting it. I had to #include fs/internal.h which is not ok.

Er... Won't that expose every filesystem to having to deal with cursors?
Currently it's entirely up to the filesystem in question and I wouldn't
be a dime on everyone being ready to cope with such surprises...

2022-10-14 08:37:12

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan
<[email protected]> wrote:
>
> Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
> behind it, to see whether we can find something workable. I apologize for the
> length of text here, but I think it's necessary to give full context and ideas.
>

Hi Stephen!

> For background, on machines with lots of memory and weird workloads,
> __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
> grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
> list is long, this can take a while. The list can be long due to lots of
> negative dentries (which can easily number in the hundreds of millions if you
> have a process that's relatively frequently looking up nonexisting files). But
> the list can also be long due to *positive* dentries. I've seen directories with
> ~7 million positive dentry children falling victim to this function before (XFS
> allows lots of files per dir)! Positive dentries take longer to process in this
> function (since they're actually locked and written to), so you don't need as
> many for them to be a problem.
>
> Anyway, if you have a huge d_subdirs list, then you can have problems with soft
> lockups. From my measurements with ftrace, 100 million negative dentries means
> that the function takes about 6 seconds to complete (varies wildly by CPU and
> kernel config/version). That's bad, but it can get *much worse*. Imagine that
> there are many frequently accessed files in such a directory, and you have an
> inotify watch. As soon as that watch is removed, the last fsnotify connector
> goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
> see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
> try to update the dentry flags. In my experience, a thundering herd of CPUs race
> to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
> rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
> point, and they *all* will proceed to iterate through d_subdirs, regardless of
> how long the list is, even though only the first CPU needed to do it. So now
> your 6 second spin gets multiplied by 5-10. And since the directory is
> frequently accessed, all the dget/dputs from other CPUs will all spin for this
> long time. This amounts to a nearly unusable system.
>
> Previously I've tried to generally limit or manage the number of negative
> dentries in the dcache, which as a general problem is very difficult to get
> concensus on. I've also tried the patches to reorder dentries in d_subdirs so
> negative dentries are at the end, which has some difficult issues interacting
> with d_walk. Neither of those ideas would help for a directory full of positive
> dentries either.
>
> So I have two more narrowly scoped strategies to improve the situation. Both are
> included in the hacky, awful patch below.
>
> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
> is holding the spinlock for several seconds at a time. We can actually achieve
> this via a cursor, the same way that simple_readdir() is implemented. I think
> this might require moving the declaration of d_alloc_cursor() and maybe
> exporting it. I had to #include fs/internal.h which is not ok.
>
> On its own, that actually makes problems worse, because it allows several tasks
> to update at the same time, and they're constantly locking/unlocking, which
> makes contention worse.
>
> So second is to add an inode flag saying that
> __fsnotify_update_child_dentry_flags() is already in progress. This prevents
> concurrent execution, and it allows the caller to skip updating since they know
> it's being handled, so it eliminates the thundering herd problem.
>
> The patch works great! It eliminates the chances of soft lockups and makes the
> system responsive under those weird loads. But now, I know I've added a new
> issue. Updating dentry flags is no longer atomic, and we've lost the guarantee

Just between us ;) the update of the inode event mask is not atomic anyway,
because the test for 'parent_watched' and fsnotify_inode_watches_children()
in __fsnotify_parent() are done without any memory access synchronization.

IOW, the only guarantee for users is that *sometime* after adding events
to a mark mask, events will start being delivered and *sometime* after
removing events from a mark mask, events will stop being delivered.
Some events may have implicit memory barriers that make event delivery
more deterministic, but others may not.

This may not be considered an issue for asynchronous events, but actually,
for permission events, I would like to fix that.
To understand my motivations you can look at:
https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content

> that after fsnotify_recalc_mask(), child dentries are all flagged when
> necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
> skip executing since it sees it's already happening, and inotify_add_watch()
> would return without the watch being fully ready.
>
> I think the approach can still be salvaged, we just need a way to resolve this.
> EG a wait queue or mutex in the connector would allow us to preserve the
> guarantee that the child dentries are flagged when necessary. But I know that's
> a big addition, so I wanted to get some feedback from you as the maintainers. Is
> the strategy here stupid? Am I missing an easier option?

I think you may be missing an easier option.

The call to __fsnotify_update_child_dentry_flags() in
__fsnotify_parent() is a very aggressive optimization
and I think it may be an overkill, and a footgun, according
to your analysis.

If only called from the context of fsnotify_recalc_mask()
(i.e. update mark mask), __fsnotify_update_child_dentry_flags()
can take the dir inode_lock() to synchronize.

I don't think that the dir inode spin lock needs to be held
at all during children iteration.

I think that d_find_any_alias() should be used to obtain
the alias with elevated refcount instead of the awkward
d_u.d_alias iteration loop.

In the context of __fsnotify_parent(), I think the optimization
should stick with updating the flags for the specific child dentry
that had the false positive parent_watched indication,
leaving the rest of the siblings alone.

Would that address the performance issues of your workload?

Jan,

Can you foresee any problems with this change?

Thanks,
Amir.

2022-10-17 08:05:12

by Stephen Brennan

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

Amir Goldstein <[email protected]> writes:
> On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan
> <[email protected]> wrote:
>>
>> Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
>> behind it, to see whether we can find something workable. I apologize for the
>> length of text here, but I think it's necessary to give full context and ideas.
>>
>
> Hi Stephen!
>
>> For background, on machines with lots of memory and weird workloads,
>> __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
>> grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
>> list is long, this can take a while. The list can be long due to lots of
>> negative dentries (which can easily number in the hundreds of millions if you
>> have a process that's relatively frequently looking up nonexisting files). But
>> the list can also be long due to *positive* dentries. I've seen directories with
>> ~7 million positive dentry children falling victim to this function before (XFS
>> allows lots of files per dir)! Positive dentries take longer to process in this
>> function (since they're actually locked and written to), so you don't need as
>> many for them to be a problem.
>>
>> Anyway, if you have a huge d_subdirs list, then you can have problems with soft
>> lockups. From my measurements with ftrace, 100 million negative dentries means
>> that the function takes about 6 seconds to complete (varies wildly by CPU and
>> kernel config/version). That's bad, but it can get *much worse*. Imagine that
>> there are many frequently accessed files in such a directory, and you have an
>> inotify watch. As soon as that watch is removed, the last fsnotify connector
>> goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
>> see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
>> try to update the dentry flags. In my experience, a thundering herd of CPUs race
>> to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
>> rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
>> point, and they *all* will proceed to iterate through d_subdirs, regardless of
>> how long the list is, even though only the first CPU needed to do it. So now
>> your 6 second spin gets multiplied by 5-10. And since the directory is
>> frequently accessed, all the dget/dputs from other CPUs will all spin for this
>> long time. This amounts to a nearly unusable system.
>>
>> Previously I've tried to generally limit or manage the number of negative
>> dentries in the dcache, which as a general problem is very difficult to get
>> concensus on. I've also tried the patches to reorder dentries in d_subdirs so
>> negative dentries are at the end, which has some difficult issues interacting
>> with d_walk. Neither of those ideas would help for a directory full of positive
>> dentries either.
>>
>> So I have two more narrowly scoped strategies to improve the situation. Both are
>> included in the hacky, awful patch below.
>>
>> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
>> is holding the spinlock for several seconds at a time. We can actually achieve
>> this via a cursor, the same way that simple_readdir() is implemented. I think
>> this might require moving the declaration of d_alloc_cursor() and maybe
>> exporting it. I had to #include fs/internal.h which is not ok.
>>
>> On its own, that actually makes problems worse, because it allows several tasks
>> to update at the same time, and they're constantly locking/unlocking, which
>> makes contention worse.
>>
>> So second is to add an inode flag saying that
>> __fsnotify_update_child_dentry_flags() is already in progress. This prevents
>> concurrent execution, and it allows the caller to skip updating since they know
>> it's being handled, so it eliminates the thundering herd problem.
>>
>> The patch works great! It eliminates the chances of soft lockups and makes the
>> system responsive under those weird loads. But now, I know I've added a new
>> issue. Updating dentry flags is no longer atomic, and we've lost the guarantee
>
> Just between us ;) the update of the inode event mask is not atomic anyway,
> because the test for 'parent_watched' and fsnotify_inode_watches_children()
> in __fsnotify_parent() are done without any memory access synchronization.
>
> IOW, the only guarantee for users is that *sometime* after adding events
> to a mark mask, events will start being delivered and *sometime* after
> removing events from a mark mask, events will stop being delivered.
> Some events may have implicit memory barriers that make event delivery
> more deterministic, but others may not.

I did wonder about whether it was truly atomic even without the
sleeping... the sleeping just makes matters much worse. But without the
sleeping, I feel like it wouldn't take much memory synchronization.
The dentry flags modification is protected by a spinlock, I assume we
would just need a memory barrier to pair with the unlock?

(But then again, I really need to read and then reread the memory model
document, and think about it when it's not late for me.)

> This may not be considered an issue for asynchronous events, but actually,
> for permission events, I would like to fix that.
> To understand my motivations you can look at:
> https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content

I'll take a deeper look in (my) morning! It definitely helps for me to
better understand use cases since I really don't know much beyond
inotify.

>> that after fsnotify_recalc_mask(), child dentries are all flagged when
>> necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
>> skip executing since it sees it's already happening, and inotify_add_watch()
>> would return without the watch being fully ready.
>>
>> I think the approach can still be salvaged, we just need a way to resolve this.
>> EG a wait queue or mutex in the connector would allow us to preserve the
>> guarantee that the child dentries are flagged when necessary. But I know that's
>> a big addition, so I wanted to get some feedback from you as the maintainers. Is
>> the strategy here stupid? Am I missing an easier option?
>
> I think you may be missing an easier option.
>
> The call to __fsnotify_update_child_dentry_flags() in
> __fsnotify_parent() is a very aggressive optimization
> and I think it may be an overkill, and a footgun, according
> to your analysis.

Agreed!

> If only called from the context of fsnotify_recalc_mask()
> (i.e. update mark mask), __fsnotify_update_child_dentry_flags()
> can take the dir inode_lock() to synchronize.
>
> I don't think that the dir inode spin lock needs to be held
> at all during children iteration.

Definitely a sleeping lock is better than the spin lock. And if we did
something like that, it would be worth keeping a little bit of state in
the connector to keep track of whether the dentry flags are set or not.
This way, if several marks are added in a row, you don't repeatedly
iterate over the children to do the same operation over and over again.

No matter what, we have to hold the parent dentry's spinlock, and that's
expensive. So if we can make the update happen only when it would
actually enable or disable the flags, that's worth a few bits of state.

> I think that d_find_any_alias() should be used to obtain
> the alias with elevated refcount instead of the awkward
> d_u.d_alias iteration loop.

D'oh! Much better idea :)
Do you think the BUG_ON would still be worthwhile?

> In the context of __fsnotify_parent(), I think the optimization
> should stick with updating the flags for the specific child dentry
> that had the false positive parent_watched indication,
> leaving the rest of

> WOULD that address the performance issues of your workload?

I think synchronizing the __fsnotify_update_child_dentry_flags() with a
mutex and getting rid of the call from __fsnotify_parent() would go a
*huge* way (maybe 80%?) towards resolving the performance issues we've
seen. To be clear, I'm representing not one single workload, but a few
different customer workloads which center around this area.

There are some extreme cases I've seen, where the dentry list is so
huge, that even iterating over it once with the parent dentry spinlock
held is enough to trigger softlockups - no need for several calls to
__fsnotify_update_child_dentry_flags() queueing up as described in the
original mail. So ideally, I'd love to try make *something* work with
the cursor idea as well. But I think the two ideas can be separated
easily, and I can discuss with Al further about if cursors can be
salvaged at all.

Thanks so much for the detailed look at this!
Stephen

2022-10-17 09:16:05

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

Hi guys!

On Fri 14-10-22 11:01:39, Amir Goldstein wrote:
> On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan
> <[email protected]> wrote:
> >
> > Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
> > behind it, to see whether we can find something workable. I apologize for the
> > length of text here, but I think it's necessary to give full context and ideas.
>
> > For background, on machines with lots of memory and weird workloads,
> > __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
> > grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
> > list is long, this can take a while. The list can be long due to lots of
> > negative dentries (which can easily number in the hundreds of millions if you
> > have a process that's relatively frequently looking up nonexisting files). But
> > the list can also be long due to *positive* dentries. I've seen directories with
> > ~7 million positive dentry children falling victim to this function before (XFS
> > allows lots of files per dir)! Positive dentries take longer to process in this
> > function (since they're actually locked and written to), so you don't need as
> > many for them to be a problem.
> >
> > Anyway, if you have a huge d_subdirs list, then you can have problems with soft
> > lockups. From my measurements with ftrace, 100 million negative dentries means
> > that the function takes about 6 seconds to complete (varies wildly by CPU and
> > kernel config/version). That's bad, but it can get *much worse*. Imagine that
> > there are many frequently accessed files in such a directory, and you have an
> > inotify watch. As soon as that watch is removed, the last fsnotify connector
> > goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
> > see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
> > try to update the dentry flags. In my experience, a thundering herd of CPUs race
> > to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
> > rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
> > point, and they *all* will proceed to iterate through d_subdirs, regardless of
> > how long the list is, even though only the first CPU needed to do it. So now
> > your 6 second spin gets multiplied by 5-10. And since the directory is
> > frequently accessed, all the dget/dputs from other CPUs will all spin for this
> > long time. This amounts to a nearly unusable system.
> >
> > Previously I've tried to generally limit or manage the number of negative
> > dentries in the dcache, which as a general problem is very difficult to get
> > concensus on. I've also tried the patches to reorder dentries in d_subdirs so
> > negative dentries are at the end, which has some difficult issues interacting
> > with d_walk. Neither of those ideas would help for a directory full of positive
> > dentries either.
> >
> > So I have two more narrowly scoped strategies to improve the situation. Both are
> > included in the hacky, awful patch below.
> >
> > First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
> > is holding the spinlock for several seconds at a time. We can actually achieve
> > this via a cursor, the same way that simple_readdir() is implemented. I think
> > this might require moving the declaration of d_alloc_cursor() and maybe
> > exporting it. I had to #include fs/internal.h which is not ok.
> >
> > On its own, that actually makes problems worse, because it allows several tasks
> > to update at the same time, and they're constantly locking/unlocking, which
> > makes contention worse.
> >
> > So second is to add an inode flag saying that
> > __fsnotify_update_child_dentry_flags() is already in progress. This prevents
> > concurrent execution, and it allows the caller to skip updating since they know
> > it's being handled, so it eliminates the thundering herd problem.
> >
> > The patch works great! It eliminates the chances of soft lockups and makes the
> > system responsive under those weird loads. But now, I know I've added a new
> > issue. Updating dentry flags is no longer atomic, and we've lost the guarantee
>
> Just between us ;) the update of the inode event mask is not atomic anyway,
> because the test for 'parent_watched' and fsnotify_inode_watches_children()
> in __fsnotify_parent() are done without any memory access synchronization.
>
> IOW, the only guarantee for users is that *sometime* after adding events
> to a mark mask, events will start being delivered and *sometime* after
> removing events from a mark mask, events will stop being delivered.
> Some events may have implicit memory barriers that make event delivery
> more deterministic, but others may not.

This holds even for fsnotify_inode_watches_children() call in
__fsnotify_update_child_dentry_flags(). In principle we can have racing
calls to __fsnotify_update_child_dentry_flags() which result in temporary
inconsistency of child dentry flags with the mask in parent's connector.

> > that after fsnotify_recalc_mask(), child dentries are all flagged when
> > necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
> > skip executing since it sees it's already happening, and inotify_add_watch()
> > would return without the watch being fully ready.
> >
> > I think the approach can still be salvaged, we just need a way to resolve this.
> > EG a wait queue or mutex in the connector would allow us to preserve the
> > guarantee that the child dentries are flagged when necessary. But I know that's
> > a big addition, so I wanted to get some feedback from you as the maintainers. Is
> > the strategy here stupid? Am I missing an easier option?
>
> I think you may be missing an easier option.
>
> The call to __fsnotify_update_child_dentry_flags() in
> __fsnotify_parent() is a very aggressive optimization
> and I think it may be an overkill, and a footgun, according
> to your analysis.
>
> If only called from the context of fsnotify_recalc_mask()
> (i.e. update mark mask), __fsnotify_update_child_dentry_flags()
> can take the dir inode_lock() to synchronize.

This will nest inode lock into fsnotify group lock though. I'm not aware of
any immediate lock ordering problem with that but it might be problematic.
Otherwise this seems workable.

BTW if we call __fsnotify_update_child_dentry_flags() only from
fsnotify_recalc_mask(), I don't think we even need more state in the
connector to detect whether dentry flags update is needed or not.
fsnotify_recalc_mask() knows the old & new mask state so it has all the
information it needs to detect whether dentry flags update is needed or
not. We just have to resolve the flags update races first to avoid
maintaining the inconsistent flags state for a long time.

> I don't think that the dir inode spin lock needs to be held
> at all during children iteration.
>
> I think that d_find_any_alias() should be used to obtain
> the alias with elevated refcount instead of the awkward
> d_u.d_alias iteration loop.
>
> In the context of __fsnotify_parent(), I think the optimization
> should stick with updating the flags for the specific child dentry
> that had the false positive parent_watched indication,
> leaving the rest of the siblings alone.

Yep, that looks sensible to me. Essentially this should be just an uncommon
fixup path until the full child dentry walk from fsnotify_recalc_mask()
catches up with synchronizing all child dentry flags.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-10-17 12:01:06

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
> > On Fri, Oct 14, 2022 at 1:27 AM Stephen Brennan
> > <[email protected]> wrote:
> >>
> >> Hi Jan, Amir, Al - this is a quite ugly patch but I want to discuss the idea
> >> behind it, to see whether we can find something workable. I apologize for the
> >> length of text here, but I think it's necessary to give full context and ideas.
> >>
> >
> > Hi Stephen!
> >
> >> For background, on machines with lots of memory and weird workloads,
> >> __fsnotify_update_child_dentry_flags() has been a real thorn in our side. It
> >> grabs a couple spinlocks and iterates over the whole d_subdirs list. If that
> >> list is long, this can take a while. The list can be long due to lots of
> >> negative dentries (which can easily number in the hundreds of millions if you
> >> have a process that's relatively frequently looking up nonexisting files). But
> >> the list can also be long due to *positive* dentries. I've seen directories with
> >> ~7 million positive dentry children falling victim to this function before (XFS
> >> allows lots of files per dir)! Positive dentries take longer to process in this
> >> function (since they're actually locked and written to), so you don't need as
> >> many for them to be a problem.
> >>
> >> Anyway, if you have a huge d_subdirs list, then you can have problems with soft
> >> lockups. From my measurements with ftrace, 100 million negative dentries means
> >> that the function takes about 6 seconds to complete (varies wildly by CPU and
> >> kernel config/version). That's bad, but it can get *much worse*. Imagine that
> >> there are many frequently accessed files in such a directory, and you have an
> >> inotify watch. As soon as that watch is removed, the last fsnotify connector
> >> goes away, and i_fsnotify_mask becomes 0. System calls accessing dentries still
> >> see DCACHE_FSNOTIFY_PARENT_WATCHED, so they fall into __fsnotify_parent and will
> >> try to update the dentry flags. In my experience, a thundering herd of CPUs race
> >> to __fsnotify_update_child_dentry_flags(). The winner begins updating and the
> >> rest spin waiting for the parent inode's i_lock. Many CPUs make it to that
> >> point, and they *all* will proceed to iterate through d_subdirs, regardless of
> >> how long the list is, even though only the first CPU needed to do it. So now
> >> your 6 second spin gets multiplied by 5-10. And since the directory is
> >> frequently accessed, all the dget/dputs from other CPUs will all spin for this
> >> long time. This amounts to a nearly unusable system.
> >>
> >> Previously I've tried to generally limit or manage the number of negative
> >> dentries in the dcache, which as a general problem is very difficult to get
> >> concensus on. I've also tried the patches to reorder dentries in d_subdirs so
> >> negative dentries are at the end, which has some difficult issues interacting
> >> with d_walk. Neither of those ideas would help for a directory full of positive
> >> dentries either.
> >>
> >> So I have two more narrowly scoped strategies to improve the situation. Both are
> >> included in the hacky, awful patch below.
> >>
> >> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
> >> is holding the spinlock for several seconds at a time. We can actually achieve
> >> this via a cursor, the same way that simple_readdir() is implemented. I think
> >> this might require moving the declaration of d_alloc_cursor() and maybe
> >> exporting it. I had to #include fs/internal.h which is not ok.
> >>
> >> On its own, that actually makes problems worse, because it allows several tasks
> >> to update at the same time, and they're constantly locking/unlocking, which
> >> makes contention worse.
> >>
> >> So second is to add an inode flag saying that
> >> __fsnotify_update_child_dentry_flags() is already in progress. This prevents
> >> concurrent execution, and it allows the caller to skip updating since they know
> >> it's being handled, so it eliminates the thundering herd problem.
> >>
> >> The patch works great! It eliminates the chances of soft lockups and makes the
> >> system responsive under those weird loads. But now, I know I've added a new
> >> issue. Updating dentry flags is no longer atomic, and we've lost the guarantee
> >
> > Just between us ;) the update of the inode event mask is not atomic anyway,
> > because the test for 'parent_watched' and fsnotify_inode_watches_children()
> > in __fsnotify_parent() are done without any memory access synchronization.
> >
> > IOW, the only guarantee for users is that *sometime* after adding events
> > to a mark mask, events will start being delivered and *sometime* after
> > removing events from a mark mask, events will stop being delivered.
> > Some events may have implicit memory barriers that make event delivery
> > more deterministic, but others may not.
>
> I did wonder about whether it was truly atomic even without the
> sleeping... the sleeping just makes matters much worse. But without the
> sleeping, I feel like it wouldn't take much memory synchronization.
> The dentry flags modification is protected by a spinlock, I assume we
> would just need a memory barrier to pair with the unlock?
>

Haha "just" cannot be used here :)
Yes, some sort of memory barrier is missing.
The trick is not hurting performance in the common fast path
where the event subscription mask has not been changed.

This is especially true considering that all applications to date
did just fine without atomic semantics for updating mark masks.
Some applications may have been living in blissful ignorance ...

> (But then again, I really need to read and then reread the memory model
> document, and think about it when it's not late for me.)
>
> > This may not be considered an issue for asynchronous events, but actually,
> > for permission events, I would like to fix that.
> > To understand my motivations you can look at:
> > https://github.com/amir73il/fsnotify-utils/wiki/Hierarchical-Storage-Management-API#Evicting_file_content
>
> I'll take a deeper look in (my) morning! It definitely helps for me to
> better understand use cases since I really don't know much beyond
> inotify.
>

Don't stress yourself over this doc, it's a WIP, but it describes
a system where the atomic update of mark masks would be
important for correctness.

The document does provide a method of working around
the atomic mask update requirement (making sure that
sb event mask includes the needed event), which should be
good enough for the described use case.

> >> that after fsnotify_recalc_mask(), child dentries are all flagged when
> >> necessary. It's possible that after __fsnotify_update_child_dentry_flags() will
> >> skip executing since it sees it's already happening, and inotify_add_watch()
> >> would return without the watch being fully ready.
> >>
> >> I think the approach can still be salvaged, we just need a way to resolve this.
> >> EG a wait queue or mutex in the connector would allow us to preserve the
> >> guarantee that the child dentries are flagged when necessary. But I know that's
> >> a big addition, so I wanted to get some feedback from you as the maintainers. Is
> >> the strategy here stupid? Am I missing an easier option?
> >
> > I think you may be missing an easier option.
> >
> > The call to __fsnotify_update_child_dentry_flags() in
> > __fsnotify_parent() is a very aggressive optimization
> > and I think it may be an overkill, and a footgun, according
> > to your analysis.
>
> Agreed!
>
> > If only called from the context of fsnotify_recalc_mask()
> > (i.e. update mark mask), __fsnotify_update_child_dentry_flags()
> > can take the dir inode_lock() to synchronize.
> >
> > I don't think that the dir inode spin lock needs to be held
> > at all during children iteration.
>
> Definitely a sleeping lock is better than the spin lock. And if we did
> something like that, it would be worth keeping a little bit of state in
> the connector to keep track of whether the dentry flags are set or not.
> This way, if several marks are added in a row, you don't repeatedly
> iterate over the children to do the same operation over and over again.
>
> No matter what, we have to hold the parent dentry's spinlock, and that's
> expensive. So if we can make the update happen only when it would
> actually enable or disable the flags, that's worth a few bits of state.
>

As Jan wrote, this state already exists in i_fsnotify_mask.
fsnotify_recalc_mask() is very capable of knowing when the
state described by fsnotify_inode_watches_children() is
going to change.

> > I think that d_find_any_alias() should be used to obtain
> > the alias with elevated refcount instead of the awkward
> > d_u.d_alias iteration loop.
>
> D'oh! Much better idea :)
> Do you think the BUG_ON would still be worthwhile?
>

Which BUG_ON()?
In general no, if there are ever more multiple aliases for
a directory inode, updating dentry flags would be the last
of our problems.


> > In the context of __fsnotify_parent(), I think the optimization
> > should stick with updating the flags for the specific child dentry
> > that had the false positive parent_watched indication,
> > leaving the rest of
>
> > WOULD that address the performance issues of your workload?
>
> I think synchronizing the __fsnotify_update_child_dentry_flags() with a
> mutex and getting rid of the call from __fsnotify_parent() would go a
> *huge* way (maybe 80%?) towards resolving the performance issues we've
> seen. To be clear, I'm representing not one single workload, but a few
> different customer workloads which center around this area.
>
> There are some extreme cases I've seen, where the dentry list is so
> huge, that even iterating over it once with the parent dentry spinlock
> held is enough to trigger softlockups - no need for several calls to
> __fsnotify_update_child_dentry_flags() queueing up as described in the
> original mail. So ideally, I'd love to try make *something* work with
> the cursor idea as well. But I think the two ideas can be separated
> easily, and I can discuss with Al further about if cursors can be
> salvaged at all.
>

Assuming that you take the dir inode_lock() in
__fsnotify_update_child_dentry_flags(), then I *think* that children
dentries cannot be added to dcache and children dentries cannot
turn from positive to negative and vice versa.

Probably the only thing that can change d_subdirs is children dentries
being evicted from dcache(?), so I *think* that once in N children
if you can dget(child), drop alias->d_lock, cond_resched(),
and then continue d_subdirs iteration from child->d_child.

Thanks,
Amir.

2022-10-17 17:06:45

by Stephen Brennan

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

Amir Goldstein <[email protected]> writes:

> On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan
> <[email protected]> wrote:
>>
>> Amir Goldstein <[email protected]> writes:
[snip]
>> > I think that d_find_any_alias() should be used to obtain
>> > the alias with elevated refcount instead of the awkward
>> > d_u.d_alias iteration loop.
>>
>> D'oh! Much better idea :)
>> Do you think the BUG_ON would still be worthwhile?
>>
>
> Which BUG_ON()?
> In general no, if there are ever more multiple aliases for
> a directory inode, updating dentry flags would be the last
> of our problems.

Sorry, I meant the one in my patch which asserts that the dentry is the
only alias for that inode. I suppose you're right about having bigger
problems in that case -- but the existing code "handles" it by iterating
over the alias list.

>
>> > In the context of __fsnotify_parent(), I think the optimization
>> > should stick with updating the flags for the specific child dentry
>> > that had the false positive parent_watched indication,
>> > leaving the rest of
>>
>> > WOULD that address the performance issues of your workload?
>>
>> I think synchronizing the __fsnotify_update_child_dentry_flags() with a
>> mutex and getting rid of the call from __fsnotify_parent() would go a
>> *huge* way (maybe 80%?) towards resolving the performance issues we've
>> seen. To be clear, I'm representing not one single workload, but a few
>> different customer workloads which center around this area.
>>
>> There are some extreme cases I've seen, where the dentry list is so
>> huge, that even iterating over it once with the parent dentry spinlock
>> held is enough to trigger softlockups - no need for several calls to
>> __fsnotify_update_child_dentry_flags() queueing up as described in the
>> original mail. So ideally, I'd love to try make *something* work with
>> the cursor idea as well. But I think the two ideas can be separated
>> easily, and I can discuss with Al further about if cursors can be
>> salvaged at all.
>>
>
> Assuming that you take the dir inode_lock() in
> __fsnotify_update_child_dentry_flags(), then I *think* that children
> dentries cannot be added to dcache and children dentries cannot
> turn from positive to negative and vice versa.
>
> Probably the only thing that can change d_subdirs is children dentries
> being evicted from dcache(?), so I *think* that once in N children
> if you can dget(child), drop alias->d_lock, cond_resched(),
> and then continue d_subdirs iteration from child->d_child.

This sounds like an excellent idea. I can't think of anything which
would remove a dentry from d_subdirs without the inode lock held.
Cursors wouldn't move without the lock held in read mode. Temporary
dentries from d_alloc_parallel are similar - they need the inode locked
shared in order to be removed from the parent list.

I'll try implementing it (along with the fsnotify changes we've
discussed in this thread). I'll add a BUG_ON after we wake up from
COND_RESCHED() to guarantee that the parent is the same dentry as
expected - just in case the assumption is wrong.

Al - if you've read this far :) - does this approach sound reasonable,
compared to the cursor? I'll send out some concrete patches as soon as
I've implemented and done a few tests on them.

Thanks,
Stephen

2022-10-17 18:44:16

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

On Mon, Oct 17, 2022 at 8:00 PM Stephen Brennan
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Mon, Oct 17, 2022 at 10:59 AM Stephen Brennan
> > <[email protected]> wrote:
> >>
> >> Amir Goldstein <[email protected]> writes:
> [snip]
> >> > I think that d_find_any_alias() should be used to obtain
> >> > the alias with elevated refcount instead of the awkward
> >> > d_u.d_alias iteration loop.
> >>
> >> D'oh! Much better idea :)
> >> Do you think the BUG_ON would still be worthwhile?
> >>
> >
> > Which BUG_ON()?
> > In general no, if there are ever more multiple aliases for
> > a directory inode, updating dentry flags would be the last
> > of our problems.
>
> Sorry, I meant the one in my patch which asserts that the dentry is the
> only alias for that inode. I suppose you're right about having bigger
> problems in that case -- but the existing code "handles" it by iterating
> over the alias list.
>

It is not important IMO.

> >
> >> > In the context of __fsnotify_parent(), I think the optimization
> >> > should stick with updating the flags for the specific child dentry
> >> > that had the false positive parent_watched indication,
> >> > leaving the rest of
> >>
> >> > WOULD that address the performance issues of your workload?
> >>
> >> I think synchronizing the __fsnotify_update_child_dentry_flags() with a
> >> mutex and getting rid of the call from __fsnotify_parent() would go a
> >> *huge* way (maybe 80%?) towards resolving the performance issues we've
> >> seen. To be clear, I'm representing not one single workload, but a few
> >> different customer workloads which center around this area.
> >>
> >> There are some extreme cases I've seen, where the dentry list is so
> >> huge, that even iterating over it once with the parent dentry spinlock
> >> held is enough to trigger softlockups - no need for several calls to
> >> __fsnotify_update_child_dentry_flags() queueing up as described in the
> >> original mail. So ideally, I'd love to try make *something* work with
> >> the cursor idea as well. But I think the two ideas can be separated
> >> easily, and I can discuss with Al further about if cursors can be
> >> salvaged at all.
> >>
> >
> > Assuming that you take the dir inode_lock() in
> > __fsnotify_update_child_dentry_flags(), then I *think* that children
> > dentries cannot be added to dcache and children dentries cannot
> > turn from positive to negative and vice versa.
> >
> > Probably the only thing that can change d_subdirs is children dentries
> > being evicted from dcache(?), so I *think* that once in N children
> > if you can dget(child), drop alias->d_lock, cond_resched(),
> > and then continue d_subdirs iteration from child->d_child.
>
> This sounds like an excellent idea. I can't think of anything which
> would remove a dentry from d_subdirs without the inode lock held.
> Cursors wouldn't move without the lock held in read mode. Temporary
> dentries from d_alloc_parallel are similar - they need the inode locked
> shared in order to be removed from the parent list.
>
> I'll try implementing it (along with the fsnotify changes we've
> discussed in this thread). I'll add a BUG_ON after we wake up from
> COND_RESCHED() to guarantee that the parent is the same dentry as
> expected - just in case the assumption is wrong.

BUG_ON() is almost never a good idea.
If anything you should use if (WARN_ON_ONCE())
and break out of the loop either returning an error
to fanotify_mark() or not.
I personally think that as an unexpected code assertion
returning an error to the user is not a must in this case.

Thanks,
Amir.

>
> Al - if you've read this far :) - does this approach sound reasonable,
> compared to the cursor? I'll send out some concrete patches as soon as
> I've implemented and done a few tests on them.
>
> Thanks,
> Stephen

2022-10-18 04:35:53

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

Hi Jan, Amir, Al,

Here's my first shot at implementing what we discussed. I tested it using the
negative dentry creation tool I mentioned in my previous message, with a similar
workflow. Rather than having a bunch of threads accessing the directory to
create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I
just started a lot of inotifywait tasks:

1. Create 100 million negative dentries in a dir
2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags:
trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags
sudo cat /sys/kernel/debug/tracing/trace_pipe
3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done

With step #3, I see only one execution of __fsnotify_update_child_dentry_flags.
Once that completes, all the inotifywait tasks say "Watches established".
Similarly, once an access occurs in the directory, a single
__fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit.
In short: it works great!

However, while testing this, I've observed a dentry still in use warning during
unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in
use, and I assume that fsnotify must have been used to trigger this. The error
is not there on mainline without my patch so it's definitely caused by this
code. I'll continue debugging it but I wanted to share my first take on this so
you could take a look.

[ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs]

Thanks!
Stephen

Stephen Brennan (2):
fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
fsnotify: allow sleepable child flag update

fs/notify/fsnotify.c | 84 +++++++++++++++++++++++++++++++-------------
fs/notify/mark.c | 55 ++++++++++++++++++++++-------
2 files changed, 103 insertions(+), 36 deletions(-)

--
2.34.1

2022-10-18 04:37:58

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH 1/2] fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem

When an inode is interested in events on its children, it must set
DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
lazily allow __fsnotify_parent() to do this the next time we see an
event on a child.

However, if the list of children is very long (e.g., in the millions),
and lots of activity is occurring on the directory, then it's possible
for many CPUs to end up blocked on the inode spinlock in
__fsnotify_update_child_flags(). Each CPU will then redundantly iterate
over the very long list of children. This situation can cause soft
lockups.

To avoid this, stop lazily updating child flags in __fsnotify_parent().
Protect the child flag update with i_rwsem held exclusive, to ensure
that we only iterate over the child list when it's absolutely necessary,
and even then, only once.

Signed-off-by: Stephen Brennan <[email protected]>
---

Notes:

It seems that there are two implementation options for this, regarding
what i_rwsem protects:

1. Both updates to i_fsnotify_mask, and the child dentry flags, or
2. Only updates to the child dentry flags

I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
don't want to hold the inode lock whenever we decrement the refcount,
but if we don't, then we're stuck holding a spinlock when the refcount
goes to zero, and we need to grab the inode rwsem to synchronize the
update to the child flags. I'm sure there's a way around this, but I
didn't keep going with it.

With #1, as currently implemented, we have the unfortunate effect of
that a mark can be added, can see that no update is required, and
return, despite the fact that the flag update is still in progress on a
different CPU/thread. From our discussion, that seems to be the current
status quo, but I wanted to explicitly point that out. If we want to
move to #1, it should be possible with some work.

fs/notify/fsnotify.c | 12 ++++++++--
fs/notify/mark.c | 55 ++++++++++++++++++++++++++++++++++----------
2 files changed, 53 insertions(+), 14 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..e887a195983b 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
parent = dget_parent(dentry);
p_inode = parent->d_inode;
p_mask = fsnotify_inode_watches_children(p_inode);
- if (unlikely(parent_watched && !p_mask))
- __fsnotify_update_child_dentry_flags(p_inode);
+ if (unlikely(parent_watched && !p_mask)) {
+ /*
+ * Flag would be cleared soon by
+ * __fsnotify_update_child_dentry_flags(), but as an
+ * optimization, clear it now.
+ */
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&dentry->d_lock);
+ }

/*
* Include parent/name in notification either if some notification
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index c74ef947447d..da9f944fcbbb 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
*/
void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
{
+ struct inode *inode = NULL;
+ int watched_before, watched_after;
+
if (!conn)
return;

- spin_lock(&conn->lock);
- __fsnotify_recalc_mask(conn);
- spin_unlock(&conn->lock);
- if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
- __fsnotify_update_child_dentry_flags(
- fsnotify_conn_inode(conn));
+ if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
+ /*
+ * For inodes, we may need to update flags on the child
+ * dentries. To ensure these updates occur exactly once,
+ * synchronize the recalculation with the inode mutex.
+ */
+ inode = fsnotify_conn_inode(conn);
+ spin_lock(&conn->lock);
+ watched_before = fsnotify_inode_watches_children(inode);
+ __fsnotify_recalc_mask(conn);
+ watched_after = fsnotify_inode_watches_children(inode);
+ spin_unlock(&conn->lock);
+
+ inode_lock(inode);
+ if ((watched_before && !watched_after) ||
+ (!watched_before && watched_after)) {
+ __fsnotify_update_child_dentry_flags(inode);
+ }
+ inode_unlock(inode);
+ } else {
+ spin_lock(&conn->lock);
+ __fsnotify_recalc_mask(conn);
+ spin_unlock(&conn->lock);
+ }
}

/* Free all connectors queued for freeing once SRCU period ends */
@@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
void *objp = NULL;
unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
+ struct inode *inode = NULL;
+ int watched_before, watched_after;
bool free_conn = false;

/* Catch marks that were actually never attached to object */
@@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
return;

+ if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
+ inode = fsnotify_conn_inode(conn);
+ watched_before = fsnotify_inode_watches_children(inode);
+ }
+
hlist_del_init_rcu(&mark->obj_list);
if (hlist_empty(&conn->list)) {
objp = fsnotify_detach_connector_from_object(conn, &type);
free_conn = true;
+ watched_after = 0;
} else {
objp = __fsnotify_recalc_mask(conn);
type = conn->type;
+ watched_after = fsnotify_inode_watches_children(inode);
}
WRITE_ONCE(mark->connector, NULL);
spin_unlock(&conn->lock);

+ if (inode) {
+ inode_lock(inode);
+ if (watched_before && !watched_after)
+ __fsnotify_update_child_dentry_flags(inode);
+ inode_unlock(inode);
+ }
+
fsnotify_drop_object(type, objp);

if (free_conn) {
@@ -331,12 +368,6 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
spin_unlock(&destroy_lock);
queue_work(system_unbound_wq, &connector_reaper_work);
}
- /*
- * Note that we didn't update flags telling whether inode cares about
- * what's happening with children. We update these flags from
- * __fsnotify_parent() lazily when next event happens on one of our
- * children.
- */
spin_lock(&destroy_lock);
list_add(&mark->g_list, &destroy_list);
spin_unlock(&destroy_lock);
--
2.34.1

2022-10-18 04:52:45

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH 2/2] fsnotify: allow sleepable child flag update

With very large d_subdirs lists, iteration can take a long time. Since
iteration needs to hold parent->d_lock, this can trigger soft lockups.
It would be best to make this iteration sleepable. Since we have the
inode locked exclusive, we can drop the parent->d_lock and sleep,
holding a reference to a child dentry, and continue iteration once we
wake.

Signed-off-by: Stephen Brennan <[email protected]>
---
fs/notify/fsnotify.c | 72 ++++++++++++++++++++++++++++++--------------
1 file changed, 50 insertions(+), 22 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index e887a195983b..499b19272b32 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -102,10 +102,13 @@ void fsnotify_sb_delete(struct super_block *sb)
* on a child we run all of our children and set a dentry flag saying that the
* parent cares. Thus when an event happens on a child it can quickly tell
* if there is a need to find a parent and send the event to the parent.
+ *
+ * Context: inode locked exclusive
*/
void __fsnotify_update_child_dentry_flags(struct inode *inode)
{
- struct dentry *alias;
+ struct dentry *child, *alias, *last_ref = NULL;
+ struct list_head *p;
int watched;

if (!S_ISDIR(inode->i_mode))
@@ -114,30 +117,55 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
/* determine if the children should tell inode about their events */
watched = fsnotify_inode_watches_children(inode);

- spin_lock(&inode->i_lock);
- /* run all of the dentries associated with this inode. Since this is a
- * directory, there damn well better only be one item on this list */
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
- struct dentry *child;
-
- /* run all of the children of the original inode and fix their
- * d_flags to indicate parental interest (their parent is the
- * original inode) */
- spin_lock(&alias->d_lock);
- list_for_each_entry(child, &alias->d_subdirs, d_child) {
- if (!child->d_inode)
- continue;
+ alias = d_find_any_alias(inode);
+
+ /*
+ * These lists can get very long, so we may need to sleep during
+ * iteration. Normally this would be impossible without a cursor,
+ * but since we have the inode locked exclusive, we're guaranteed
+ * that the directory won't be modified, so whichever dentry we
+ * pick to sleep on won't get moved. So, start a manual iteration
+ * over d_subdirs which will allow us to sleep.
+ */
+ spin_lock(&alias->d_lock);
+ p = alias->d_subdirs.next;

- spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
- if (watched)
- child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
- else
- child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
- spin_unlock(&child->d_lock);
+ while (p != &alias->d_subdirs) {
+ child = list_entry(p, struct dentry, d_child);
+ if (need_resched()) {
+ /*
+ * We need to hold a reference while we sleep. But when
+ * we wake, dput() could free the dentry, invalidating
+ * the list pointers. We can't look at the list pointers
+ * until we re-lock the parent, and we can't dput() once
+ * we have the parent locked. So the solution is to hold
+ * onto our reference and free it the *next* time we drop
+ * alias->d_lock: either at the end of the function, or
+ * at the time of the next sleep.
+ */
+ dget(child);
+ spin_unlock(&alias->d_lock);
+ if (last_ref)
+ dput(last_ref);
+ last_ref = child;
+ cond_resched();
+ spin_lock(&alias->d_lock);
}
- spin_unlock(&alias->d_lock);
+ p = p->next;
+
+ if (!child->d_inode)
+ continue;
+
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+ if (watched)
+ child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ else
+ child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&child->d_lock);
}
- spin_unlock(&inode->i_lock);
+ spin_unlock(&alias->d_lock);
+ if (last_ref)
+ dput(last_ref);
}

/* Are inode/sb/mount interested in parent and name info with this event? */
--
2.34.1

2022-10-18 05:42:23

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 2/2] fsnotify: allow sleepable child flag update

On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
<[email protected]> wrote:
>
> With very large d_subdirs lists, iteration can take a long time. Since
> iteration needs to hold parent->d_lock, this can trigger soft lockups.
> It would be best to make this iteration sleepable. Since we have the
> inode locked exclusive, we can drop the parent->d_lock and sleep,
> holding a reference to a child dentry, and continue iteration once we
> wake.
>
> Signed-off-by: Stephen Brennan <[email protected]>
> ---
> fs/notify/fsnotify.c | 72 ++++++++++++++++++++++++++++++--------------
> 1 file changed, 50 insertions(+), 22 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index e887a195983b..499b19272b32 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -102,10 +102,13 @@ void fsnotify_sb_delete(struct super_block *sb)
> * on a child we run all of our children and set a dentry flag saying that the
> * parent cares. Thus when an event happens on a child it can quickly tell
> * if there is a need to find a parent and send the event to the parent.
> + *
> + * Context: inode locked exclusive

Please add code assertion

WARN_ON_ONCE(!inode_is_locked(inode));

and it probably wouldn't hurt to add an inline wrapper
fsnotify_update_child_dentry_flags()
that locks the inode and calls this helper.

> */
> void __fsnotify_update_child_dentry_flags(struct inode *inode)
> {
> - struct dentry *alias;
> + struct dentry *child, *alias, *last_ref = NULL;
> + struct list_head *p;
> int watched;
>
> if (!S_ISDIR(inode->i_mode))
> @@ -114,30 +117,55 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
> /* determine if the children should tell inode about their events */
> watched = fsnotify_inode_watches_children(inode);
>
> - spin_lock(&inode->i_lock);
> - /* run all of the dentries associated with this inode. Since this is a
> - * directory, there damn well better only be one item on this list */
> - hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
> - struct dentry *child;
> -
> - /* run all of the children of the original inode and fix their
> - * d_flags to indicate parental interest (their parent is the
> - * original inode) */
> - spin_lock(&alias->d_lock);
> - list_for_each_entry(child, &alias->d_subdirs, d_child) {
> - if (!child->d_inode)
> - continue;
> + alias = d_find_any_alias(inode);

Please make the alias change in a separate patch.
It is not explained in commit message and it clutters
the diff which makes reviewing the actual logic changes
harder.

> +
> + /*
> + * These lists can get very long, so we may need to sleep during
> + * iteration. Normally this would be impossible without a cursor,
> + * but since we have the inode locked exclusive, we're guaranteed
> + * that the directory won't be modified, so whichever dentry we
> + * pick to sleep on won't get moved. So, start a manual iteration
> + * over d_subdirs which will allow us to sleep.
> + */
> + spin_lock(&alias->d_lock);
> + p = alias->d_subdirs.next;
>
> - spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
> - if (watched)
> - child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
> - else
> - child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> - spin_unlock(&child->d_lock);
> + while (p != &alias->d_subdirs) {
> + child = list_entry(p, struct dentry, d_child);

IMO it would be better to use list iterator helpers.
What was wrong with list_for_each_entry()?
Why did you feel that you need to open code it?

> + if (need_resched()) {
> + /*
> + * We need to hold a reference while we sleep. But when
> + * we wake, dput() could free the dentry, invalidating
> + * the list pointers. We can't look at the list pointers
> + * until we re-lock the parent, and we can't dput() once
> + * we have the parent locked. So the solution is to hold
> + * onto our reference and free it the *next* time we drop
> + * alias->d_lock: either at the end of the function, or
> + * at the time of the next sleep.
> + */
> + dget(child);
> + spin_unlock(&alias->d_lock);
> + if (last_ref)
> + dput(last_ref);
> + last_ref = child;
> + cond_resched();
> + spin_lock(&alias->d_lock);
> }
> - spin_unlock(&alias->d_lock);
> + p = p->next;
> +
> + if (!child->d_inode)
> + continue;
> +
> + spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
> + if (watched)
> + child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
> + else
> + child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> + spin_unlock(&child->d_lock);
> }
> - spin_unlock(&inode->i_lock);
> + spin_unlock(&alias->d_lock);
> + if (last_ref)
> + dput(last_ref);

Nit: if not needed. dput(NULL) works just fine.

Thanks,
Amir.

2022-10-18 08:14:55

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/2] fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem

On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
<[email protected]> wrote:
>
> When an inode is interested in events on its children, it must set
> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
> lazily allow __fsnotify_parent() to do this the next time we see an
> event on a child.
>
> However, if the list of children is very long (e.g., in the millions),
> and lots of activity is occurring on the directory, then it's possible
> for many CPUs to end up blocked on the inode spinlock in
> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
> over the very long list of children. This situation can cause soft
> lockups.
>
> To avoid this, stop lazily updating child flags in __fsnotify_parent().
> Protect the child flag update with i_rwsem held exclusive, to ensure
> that we only iterate over the child list when it's absolutely necessary,
> and even then, only once.
>
> Signed-off-by: Stephen Brennan <[email protected]>
> ---
>
> Notes:
>
> It seems that there are two implementation options for this, regarding
> what i_rwsem protects:
>
> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
> 2. Only updates to the child dentry flags
>
> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
> don't want to hold the inode lock whenever we decrement the refcount,
> but if we don't, then we're stuck holding a spinlock when the refcount
> goes to zero, and we need to grab the inode rwsem to synchronize the
> update to the child flags. I'm sure there's a way around this, but I
> didn't keep going with it.
>
> With #1, as currently implemented, we have the unfortunate effect of
> that a mark can be added, can see that no update is required, and
> return, despite the fact that the flag update is still in progress on a
> different CPU/thread. From our discussion, that seems to be the current
> status quo, but I wanted to explicitly point that out. If we want to
> move to #1, it should be possible with some work.

I think the solution may be to store the state of children in conn
like you suggested.

See fsnotify_update_iref() and conn flag
FSNOTIFY_CONN_FLAG_HAS_IREF.

You can add a conn flag
FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
that caches the result of the last invocation of update children flags.

For example, fsnotify_update_iref() becomes
fsnotify_update_inode_conn_flags() and
returns inode if either inode ref should be dropped
or if children flags need to be updated (or both)
maybe use some out argument to differentiate the cases.
Same for fsnotify_detach_connector_from_object().

Then, where fsnotify_drop_object() is called, for the
case that inode children need to be updated,
take inode_lock(), take connector spin lock
to check if another thread has already done the update
if not release spin lock, perform the update under inode lock
and at the end, take spin lock again and set the
FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
connector flag.

Not sure if it all works out... maybe

>
> fs/notify/fsnotify.c | 12 ++++++++--
> fs/notify/mark.c | 55 ++++++++++++++++++++++++++++++++++----------
> 2 files changed, 53 insertions(+), 14 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 7974e91ffe13..e887a195983b 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> parent = dget_parent(dentry);
> p_inode = parent->d_inode;
> p_mask = fsnotify_inode_watches_children(p_inode);
> - if (unlikely(parent_watched && !p_mask))
> - __fsnotify_update_child_dentry_flags(p_inode);
> + if (unlikely(parent_watched && !p_mask)) {
> + /*
> + * Flag would be cleared soon by
> + * __fsnotify_update_child_dentry_flags(), but as an
> + * optimization, clear it now.
> + */

I think that we need to also take p_inode spin_lock here and
check fsnotify_inode_watches_children() under lock
otherwise, we could be clearing the WATCHED flag
*after* __fsnotify_update_child_dentry_flags() had
already set it, because you we not observe the change to
p_inode mask.

I would consider renaming __fsnotify_update_child_dentry_flags()
to __fsnotify_update_children_dentry_flags(struct inode *dir)

and creating another inline helper for this call site called:
fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)


> + spin_lock(&dentry->d_lock);
> + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> + spin_unlock(&dentry->d_lock);
> + }
>
> /*
> * Include parent/name in notification either if some notification
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index c74ef947447d..da9f944fcbbb 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> */
> void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> {
> + struct inode *inode = NULL;
> + int watched_before, watched_after;
> +
> if (!conn)
> return;
>
> - spin_lock(&conn->lock);
> - __fsnotify_recalc_mask(conn);
> - spin_unlock(&conn->lock);
> - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> - __fsnotify_update_child_dentry_flags(
> - fsnotify_conn_inode(conn));
> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> + /*
> + * For inodes, we may need to update flags on the child
> + * dentries. To ensure these updates occur exactly once,
> + * synchronize the recalculation with the inode mutex.
> + */
> + inode = fsnotify_conn_inode(conn);
> + spin_lock(&conn->lock);
> + watched_before = fsnotify_inode_watches_children(inode);
> + __fsnotify_recalc_mask(conn);
> + watched_after = fsnotify_inode_watches_children(inode);
> + spin_unlock(&conn->lock);
> +
> + inode_lock(inode);

With the pattern that I suggested above, this if / else would
be unified to code that looks something like this:

spin_lock(&conn->lock);
inode = __fsnotify_recalc_mask(conn);
spin_unlock(&conn->lock);

if (inode)
fsnotify_update_children_dentry_flags(conn, inode);

Where fsnotify_update_children_dentry_flags()
takes inode lock around entire update and conn spin lock
only around check and update of conn flags.

FYI, at this time in the code, adding a mark or updating
existing mark mask cannot result in the need to drop iref.
That is the reason that return value of __fsnotify_recalc_mask()
is not checked here.

> + if ((watched_before && !watched_after) ||
> + (!watched_before && watched_after)) {
> + __fsnotify_update_child_dentry_flags(inode);
> + }
> + inode_unlock(inode);
> + } else {
> + spin_lock(&conn->lock);
> + __fsnotify_recalc_mask(conn);
> + spin_unlock(&conn->lock);
> + }
> }
>
> /* Free all connectors queued for freeing once SRCU period ends */
> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
> void *objp = NULL;
> unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
> + struct inode *inode = NULL;
> + int watched_before, watched_after;
> bool free_conn = false;
>
> /* Catch marks that were actually never attached to object */
> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
> return;
>
> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> + inode = fsnotify_conn_inode(conn);
> + watched_before = fsnotify_inode_watches_children(inode);
> + }
> +
> hlist_del_init_rcu(&mark->obj_list);
> if (hlist_empty(&conn->list)) {
> objp = fsnotify_detach_connector_from_object(conn, &type);
> free_conn = true;
> + watched_after = 0;
> } else {
> objp = __fsnotify_recalc_mask(conn);
> type = conn->type;
> + watched_after = fsnotify_inode_watches_children(inode);
> }
> WRITE_ONCE(mark->connector, NULL);
> spin_unlock(&conn->lock);
>
> + if (inode) {
> + inode_lock(inode);
> + if (watched_before && !watched_after)
> + __fsnotify_update_child_dentry_flags(inode);
> + inode_unlock(inode);
> + }
> +
> fsnotify_drop_object(type, objp);
>

Here as well something like:
if (objp)
fsnotify_update_children_dentry_flags(conn, obj);

But need to distinguish when inode ref needs to be dropped
children flags updates or both.

Hope that this suggestion direction turns out to be useful and not
a complete waste of time...

Thanks,
Amir.

2022-10-18 08:20:00

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
<[email protected]> wrote:
>
> Hi Jan, Amir, Al,
>
> Here's my first shot at implementing what we discussed. I tested it using the
> negative dentry creation tool I mentioned in my previous message, with a similar
> workflow. Rather than having a bunch of threads accessing the directory to
> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I
> just started a lot of inotifywait tasks:
>
> 1. Create 100 million negative dentries in a dir
> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags:
> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags
> sudo cat /sys/kernel/debug/tracing/trace_pipe
> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done
>
> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags.
> Once that completes, all the inotifywait tasks say "Watches established".
> Similarly, once an access occurs in the directory, a single
> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit.
> In short: it works great!
>
> However, while testing this, I've observed a dentry still in use warning during
> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in
> use, and I assume that fsnotify must have been used to trigger this. The error
> is not there on mainline without my patch so it's definitely caused by this
> code. I'll continue debugging it but I wanted to share my first take on this so
> you could take a look.
>
> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs]
>

Hmm, the assumption we made about partial stability of d_subdirs
under dir inode lock looks incorrect for rpc_pipefs.
None of the functions that update the rpc_pipefs dcache take the parent
inode lock.

The assumption looks incorrect for other pseudo fs as well.

The other side of the coin is that we do not really need to worry
about walking a huge list of pseudo fs children.

The question is how to classify those pseudo fs and whether there
are other cases like this that we missed.

Perhaps having simple_dentry_operationsis a good enough
clue, but perhaps it is not enough. I am not sure.

It covers all the cases of pseudo fs that I know about, so you
can certainly use this clue to avoid going to sleep in the
update loop as a first approximation.

I can try to figure this out, but I prefer that Al will chime in to
provide reliable answers to those questions.

Thanks,
Amir.

2022-10-19 00:15:04

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

Amir Goldstein <[email protected]> writes:
> On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> <[email protected]> wrote:
>>
>> Hi Jan, Amir, Al,
>>
>> Here's my first shot at implementing what we discussed. I tested it using the
>> negative dentry creation tool I mentioned in my previous message, with a similar
>> workflow. Rather than having a bunch of threads accessing the directory to
>> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I
>> just started a lot of inotifywait tasks:
>>
>> 1. Create 100 million negative dentries in a dir
>> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags:
>> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags
>> sudo cat /sys/kernel/debug/tracing/trace_pipe
>> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done
>>
>> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags.
>> Once that completes, all the inotifywait tasks say "Watches established".
>> Similarly, once an access occurs in the directory, a single
>> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit.
>> In short: it works great!
>>
>> However, while testing this, I've observed a dentry still in use warning during
>> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in
>> use, and I assume that fsnotify must have been used to trigger this. The error
>> is not there on mainline without my patch so it's definitely caused by this
>> code. I'll continue debugging it but I wanted to share my first take on this so
>> you could take a look.
>>
>> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs]
>>
>
> Hmm, the assumption we made about partial stability of d_subdirs
> under dir inode lock looks incorrect for rpc_pipefs.
> None of the functions that update the rpc_pipefs dcache take the parent
> inode lock.

That may be, but I'm confused how that would trigger this issue. If I'm
understanding correctly, this warning indicates a reference counting
bug.

If __fsnotify_update_child_dentry_flags() had gone to sleep and the list
were edited, then it seems like there could be only two possibilities
that could cause bugs:

1. The dentry we slept holding a reference to was removed from the list,
and maybe moved to a different one, or just removed. If that were the
case, we're quite unlucky, because we'll start looping indefinitely as
we'll never get back to the beginning of the list, or worse.

2. A dentry adjacent to the one we held a reference to was removed. In
that case, our dentry's d_child pointers should get rearranged, and when
we wake, we should see those updates and continue.

In neither of those cases do I understand where we could have done a
dget() unpaired with a dput(), which is what seemingly would trigger
this issue.

I'm probably wrong, but without understanding the mechanism behind the
error, I'm not sure how to approach it.

> The assumption looks incorrect for other pseudo fs as well.
>
> The other side of the coin is that we do not really need to worry
> about walking a huge list of pseudo fs children.
>
> The question is how to classify those pseudo fs and whether there
> are other cases like this that we missed.
>
> Perhaps having simple_dentry_operationsis a good enough
> clue, but perhaps it is not enough. I am not sure.
>
> It covers all the cases of pseudo fs that I know about, so you
> can certainly use this clue to avoid going to sleep in the
> update loop as a first approximation.

I would worry that it would become an exercise of whack-a-mole.
Allow/deny-listing certain filesystems for certain behavior seems scary.

> I can try to figure this out, but I prefer that Al will chime in to
> provide reliable answers to those questions.

I have a core dump from the warning (with panic_on_warn=1) and will see
if I can trace or otherwise identify the exact mechanism myself.

> Thanks,
> Amir.
>

Thanks for your detailed review of both the patches. I didn't get much
time today to update the patches and test them. Your feedback looks very
helpful though, and I'll hope to send out an updated revision tomorrow.

In the absolute worst case (and I don't want to concede defeat just
yet), keeping patch 1 without patch 2 (sleepable iteration) would still
be a major win, since it resolves the thundering herd problem which is
what compounds problem of the long lists.

Thanks!
Stephen

2022-10-19 06:37:41

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

On Wed, Oct 19, 2022 at 2:52 AM Stephen Brennan
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
> > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> > <[email protected]> wrote:
> >>
> >> Hi Jan, Amir, Al,
> >>
> >> Here's my first shot at implementing what we discussed. I tested it using the
> >> negative dentry creation tool I mentioned in my previous message, with a similar
> >> workflow. Rather than having a bunch of threads accessing the directory to
> >> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I
> >> just started a lot of inotifywait tasks:
> >>
> >> 1. Create 100 million negative dentries in a dir
> >> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags:
> >> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags
> >> sudo cat /sys/kernel/debug/tracing/trace_pipe
> >> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done
> >>
> >> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags.
> >> Once that completes, all the inotifywait tasks say "Watches established".
> >> Similarly, once an access occurs in the directory, a single
> >> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit.
> >> In short: it works great!
> >>
> >> However, while testing this, I've observed a dentry still in use warning during
> >> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in
> >> use, and I assume that fsnotify must have been used to trigger this. The error
> >> is not there on mainline without my patch so it's definitely caused by this
> >> code. I'll continue debugging it but I wanted to share my first take on this so
> >> you could take a look.
> >>
> >> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs]
> >>
> >
> > Hmm, the assumption we made about partial stability of d_subdirs
> > under dir inode lock looks incorrect for rpc_pipefs.
> > None of the functions that update the rpc_pipefs dcache take the parent
> > inode lock.
>
> That may be, but I'm confused how that would trigger this issue. If I'm
> understanding correctly, this warning indicates a reference counting
> bug.

Yes.
On generic_shutdown_super() there should be no more
references to dentries.

>
> If __fsnotify_update_child_dentry_flags() had gone to sleep and the list
> were edited, then it seems like there could be only two possibilities
> that could cause bugs:
>
> 1. The dentry we slept holding a reference to was removed from the list,
> and maybe moved to a different one, or just removed. If that were the
> case, we're quite unlucky, because we'll start looping indefinitely as
> we'll never get back to the beginning of the list, or worse.
>
> 2. A dentry adjacent to the one we held a reference to was removed. In
> that case, our dentry's d_child pointers should get rearranged, and when
> we wake, we should see those updates and continue.
>
> In neither of those cases do I understand where we could have done a
> dget() unpaired with a dput(), which is what seemingly would trigger
> this issue.
>

I got the same impression.

> I'm probably wrong, but without understanding the mechanism behind the
> error, I'm not sure how to approach it.
>
> > The assumption looks incorrect for other pseudo fs as well.
> >
> > The other side of the coin is that we do not really need to worry
> > about walking a huge list of pseudo fs children.
> >
> > The question is how to classify those pseudo fs and whether there
> > are other cases like this that we missed.
> >
> > Perhaps having simple_dentry_operationsis a good enough
> > clue, but perhaps it is not enough. I am not sure.
> >
> > It covers all the cases of pseudo fs that I know about, so you
> > can certainly use this clue to avoid going to sleep in the
> > update loop as a first approximation.
>
> I would worry that it would become an exercise of whack-a-mole.
> Allow/deny-listing certain filesystems for certain behavior seems scary.
>

Totally agree.

> > I can try to figure this out, but I prefer that Al will chime in to
> > provide reliable answers to those questions.
>
> I have a core dump from the warning (with panic_on_warn=1) and will see
> if I can trace or otherwise identify the exact mechanism myself.
>

Most likely the refcount was already leaked earlier, but
worth trying.

>
> Thanks for your detailed review of both the patches. I didn't get much
> time today to update the patches and test them. Your feedback looks very
> helpful though, and I'll hope to send out an updated revision tomorrow.
>
> In the absolute worst case (and I don't want to concede defeat just
> yet), keeping patch 1 without patch 2 (sleepable iteration) would still
> be a major win, since it resolves the thundering herd problem which is
> what compounds problem of the long lists.
>

Makes sense.
Patch 1 logic is solid.

Hope my suggestions won't complicate you too much,
if they do, I am sure Jan will find a way to simplify ;)

Thanks,
Amir.

2022-10-21 01:11:55

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH 1/2] fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem

Amir Goldstein <[email protected]> writes:

> On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> <[email protected]> wrote:
>>
>> When an inode is interested in events on its children, it must set
>> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
>> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
>> lazily allow __fsnotify_parent() to do this the next time we see an
>> event on a child.
>>
>> However, if the list of children is very long (e.g., in the millions),
>> and lots of activity is occurring on the directory, then it's possible
>> for many CPUs to end up blocked on the inode spinlock in
>> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
>> over the very long list of children. This situation can cause soft
>> lockups.
>>
>> To avoid this, stop lazily updating child flags in __fsnotify_parent().
>> Protect the child flag update with i_rwsem held exclusive, to ensure
>> that we only iterate over the child list when it's absolutely necessary,
>> and even then, only once.
>>
>> Signed-off-by: Stephen Brennan <[email protected]>
>> ---
>>
>> Notes:
>>
>> It seems that there are two implementation options for this, regarding
>> what i_rwsem protects:
>>
>> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
>> 2. Only updates to the child dentry flags
>>
>> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
>> don't want to hold the inode lock whenever we decrement the refcount,
>> but if we don't, then we're stuck holding a spinlock when the refcount
>> goes to zero, and we need to grab the inode rwsem to synchronize the
>> update to the child flags. I'm sure there's a way around this, but I
>> didn't keep going with it.
>>
>> With #1, as currently implemented, we have the unfortunate effect of
>> that a mark can be added, can see that no update is required, and
>> return, despite the fact that the flag update is still in progress on a
>> different CPU/thread. From our discussion, that seems to be the current
>> status quo, but I wanted to explicitly point that out. If we want to
>> move to #1, it should be possible with some work.
>
> I think the solution may be to store the state of children in conn
> like you suggested.
>
> See fsnotify_update_iref() and conn flag
> FSNOTIFY_CONN_FLAG_HAS_IREF.
>
> You can add a conn flag
> FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> that caches the result of the last invocation of update children flags.
>
> For example, fsnotify_update_iref() becomes
> fsnotify_update_inode_conn_flags() and
> returns inode if either inode ref should be dropped
> or if children flags need to be updated (or both)
> maybe use some out argument to differentiate the cases.
> Same for fsnotify_detach_connector_from_object().
>
> Then, where fsnotify_drop_object() is called, for the
> case that inode children need to be updated,
> take inode_lock(), take connector spin lock
> to check if another thread has already done the update
> if not release spin lock, perform the update under inode lock
> and at the end, take spin lock again and set the
> FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> connector flag.
>
> Not sure if it all works out... maybe

I did this for v2 and I think it has worked well, all threads seem to
block until the flags are updated on all dentries.

>>
>> fs/notify/fsnotify.c | 12 ++++++++--
>> fs/notify/mark.c | 55 ++++++++++++++++++++++++++++++++++----------
>> 2 files changed, 53 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
>> index 7974e91ffe13..e887a195983b 100644
>> --- a/fs/notify/fsnotify.c
>> +++ b/fs/notify/fsnotify.c
>> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
>> parent = dget_parent(dentry);
>> p_inode = parent->d_inode;
>> p_mask = fsnotify_inode_watches_children(p_inode);
>> - if (unlikely(parent_watched && !p_mask))
>> - __fsnotify_update_child_dentry_flags(p_inode);
>> + if (unlikely(parent_watched && !p_mask)) {
>> + /*
>> + * Flag would be cleared soon by
>> + * __fsnotify_update_child_dentry_flags(), but as an
>> + * optimization, clear it now.
>> + */
>
> I think that we need to also take p_inode spin_lock here and
> check fsnotify_inode_watches_children() under lock
> otherwise, we could be clearing the WATCHED flag
> *after* __fsnotify_update_child_dentry_flags() had
> already set it, because you we not observe the change to
> p_inode mask.

I'm not sure I follow. The i_fsnotify_mask field isn't protected by the
p_inode spinlock. It isn't really protected at all, though it mainly
gets modified with the conn->lock held.

Wouldn't it be sufficient to do in a new helper:

spin_lock(&dentry->d_lock);
if (!fsnotify_inode_watches_children(p_inode))
dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
spin_unlock(&dentry->d_lock);

I'm sure I'm missing something about your comment. For the moment I left
it as is in the second version of the patch, we can discuss it more and
I can update it for a v3.

>
> I would consider renaming __fsnotify_update_child_dentry_flags()
> to __fsnotify_update_children_dentry_flags(struct inode *dir)
>
> and creating another inline helper for this call site called:
> fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)
>
>
>> + spin_lock(&dentry->d_lock);
>> + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
>> + spin_unlock(&dentry->d_lock);
>> + }
>>
>> /*
>> * Include parent/name in notification either if some notification
>> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
>> index c74ef947447d..da9f944fcbbb 100644
>> --- a/fs/notify/mark.c
>> +++ b/fs/notify/mark.c
>> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>> */
>> void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
>> {
>> + struct inode *inode = NULL;
>> + int watched_before, watched_after;
>> +
>> if (!conn)
>> return;
>>
>> - spin_lock(&conn->lock);
>> - __fsnotify_recalc_mask(conn);
>> - spin_unlock(&conn->lock);
>> - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
>> - __fsnotify_update_child_dentry_flags(
>> - fsnotify_conn_inode(conn));
>> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
>> + /*
>> + * For inodes, we may need to update flags on the child
>> + * dentries. To ensure these updates occur exactly once,
>> + * synchronize the recalculation with the inode mutex.
>> + */
>> + inode = fsnotify_conn_inode(conn);
>> + spin_lock(&conn->lock);
>> + watched_before = fsnotify_inode_watches_children(inode);
>> + __fsnotify_recalc_mask(conn);
>> + watched_after = fsnotify_inode_watches_children(inode);
>> + spin_unlock(&conn->lock);
>> +
>> + inode_lock(inode);
>
> With the pattern that I suggested above, this if / else would
> be unified to code that looks something like this:
>
> spin_lock(&conn->lock);
> inode = __fsnotify_recalc_mask(conn);
> spin_unlock(&conn->lock);
>
> if (inode)
> fsnotify_update_children_dentry_flags(conn, inode);
>
> Where fsnotify_update_children_dentry_flags()
> takes inode lock around entire update and conn spin lock
> only around check and update of conn flags.
>
> FYI, at this time in the code, adding a mark or updating
> existing mark mask cannot result in the need to drop iref.
> That is the reason that return value of __fsnotify_recalc_mask()
> is not checked here.

For v3 I tried this with a new "flags" out variable and two flags - one
for requiring an iput(), and one for calling
fsnotify_update_children_dentry_flags(). As a result, I did stick a
WARN_ON_ONCE here, but it more or less looks just like this code :)

>> + if ((watched_before && !watched_after) ||
>> + (!watched_before && watched_after)) {
>> + __fsnotify_update_child_dentry_flags(inode);
>> + }
>> + inode_unlock(inode);
>> + } else {
>> + spin_lock(&conn->lock);
>> + __fsnotify_recalc_mask(conn);
>> + spin_unlock(&conn->lock);
>> + }
>> }
>>
>> /* Free all connectors queued for freeing once SRCU period ends */
>> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>> struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
>> void *objp = NULL;
>> unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
>> + struct inode *inode = NULL;
>> + int watched_before, watched_after;
>> bool free_conn = false;
>>
>> /* Catch marks that were actually never attached to object */
>> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
>> if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
>> return;
>>
>> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
>> + inode = fsnotify_conn_inode(conn);
>> + watched_before = fsnotify_inode_watches_children(inode);
>> + }
>> +
>> hlist_del_init_rcu(&mark->obj_list);
>> if (hlist_empty(&conn->list)) {
>> objp = fsnotify_detach_connector_from_object(conn, &type);
>> free_conn = true;
>> + watched_after = 0;
>> } else {
>> objp = __fsnotify_recalc_mask(conn);
>> type = conn->type;
>> + watched_after = fsnotify_inode_watches_children(inode);
>> }
>> WRITE_ONCE(mark->connector, NULL);
>> spin_unlock(&conn->lock);
>>
>> + if (inode) {
>> + inode_lock(inode);
>> + if (watched_before && !watched_after)
>> + __fsnotify_update_child_dentry_flags(inode);
>> + inode_unlock(inode);
>> + }
>> +
>> fsnotify_drop_object(type, objp);
>>
>
> Here as well something like:
> if (objp)
> fsnotify_update_children_dentry_flags(conn, obj);
>
> But need to distinguish when inode ref needs to be dropped
> children flags updates or both.

With a flags out-param, it works well. I actually was able to stuff this
into fsnotify_drop_object, which was good because I had missed a whole
other function that can detach a connector from an inode
(fsnotify_destroy_marks()).

> Hope that this suggestion direction turns out to be useful and not
> a complete waste of time...
>
> Thanks,
> Amir.

2022-10-21 01:50:49

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v2 0/3] fsnotify: fix softlockups iterating over d_subdirs

Hi Amir, Jan, Al,

Here is my swing at v2 of this series. I've taken Amir's suggestion and
stored the flag state in the connector. There's one issue with that:
when the connector is disconnected, we lose the state information, and
we lose the mutual exclusion of conn->lock. It becomes possible for a
new connector to appear and start doing its own updates. Thankfully I'm
pretty confident that there's no case where it would be actually wrong.

I've tested this without the final patch (since that one triggered the
strange dentry refcount warning) and everything works great. Now that
(hopefully) the changes related to fsnotify connectors and things are
solidified, I'll try to look harder at the sleepable iteration, and see
if I can identify why that's not working, and hopefully solicit some
advice & feedback from Al.

There's definitely a few nits and cleanups to be done yet on the series.
Pretty sure I need to clean up the indentation and a few other
checkpatch oddities, so feel free to hit me with whatever changes you
want to see, however small :)

Stephen Brennan (3):
fsnotify: Use d_find_any_alias to get dentry associated with inode
fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
fsnotify: allow sleepable child flag update

fs/notify/fsnotify.c | 92 +++++++++++++++++++--------
fs/notify/fsnotify.h | 31 ++++++++-
fs/notify/mark.c | 106 ++++++++++++++++++++-----------
include/linux/fsnotify_backend.h | 8 +++
4 files changed, 175 insertions(+), 62 deletions(-)

--
2.34.1

2022-10-21 07:32:52

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/2] fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem

()


On Fri, Oct 21, 2022 at 3:33 AM Stephen Brennan
<[email protected]> wrote:
>
> Amir Goldstein <[email protected]> writes:
>
> > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
> > <[email protected]> wrote:
> >>
> >> When an inode is interested in events on its children, it must set
> >> DCACHE_FSNOTIFY_PARENT_WATCHED flag on all its children. Currently, when
> >> the fsnotify connector is removed and i_fsnotify_mask becomes zero, we
> >> lazily allow __fsnotify_parent() to do this the next time we see an
> >> event on a child.
> >>
> >> However, if the list of children is very long (e.g., in the millions),
> >> and lots of activity is occurring on the directory, then it's possible
> >> for many CPUs to end up blocked on the inode spinlock in
> >> __fsnotify_update_child_flags(). Each CPU will then redundantly iterate
> >> over the very long list of children. This situation can cause soft
> >> lockups.
> >>
> >> To avoid this, stop lazily updating child flags in __fsnotify_parent().
> >> Protect the child flag update with i_rwsem held exclusive, to ensure
> >> that we only iterate over the child list when it's absolutely necessary,
> >> and even then, only once.
> >>
> >> Signed-off-by: Stephen Brennan <[email protected]>
> >> ---
> >>
> >> Notes:
> >>
> >> It seems that there are two implementation options for this, regarding
> >> what i_rwsem protects:
> >>
> >> 1. Both updates to i_fsnotify_mask, and the child dentry flags, or
> >> 2. Only updates to the child dentry flags
> >>
> >> I wanted to do #1, but it got really tricky with fsnotify_put_mark(). We
> >> don't want to hold the inode lock whenever we decrement the refcount,
> >> but if we don't, then we're stuck holding a spinlock when the refcount
> >> goes to zero, and we need to grab the inode rwsem to synchronize the
> >> update to the child flags. I'm sure there's a way around this, but I
> >> didn't keep going with it.
> >>
> >> With #1, as currently implemented, we have the unfortunate effect of
> >> that a mark can be added, can see that no update is required, and
> >> return, despite the fact that the flag update is still in progress on a
> >> different CPU/thread. From our discussion, that seems to be the current
> >> status quo, but I wanted to explicitly point that out. If we want to
> >> move to #1, it should be possible with some work.
> >
> > I think the solution may be to store the state of children in conn
> > like you suggested.
> >
> > See fsnotify_update_iref() and conn flag
> > FSNOTIFY_CONN_FLAG_HAS_IREF.
> >
> > You can add a conn flag
> > FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> > that caches the result of the last invocation of update children flags.
> >
> > For example, fsnotify_update_iref() becomes
> > fsnotify_update_inode_conn_flags() and
> > returns inode if either inode ref should be dropped
> > or if children flags need to be updated (or both)
> > maybe use some out argument to differentiate the cases.
> > Same for fsnotify_detach_connector_from_object().
> >
> > Then, where fsnotify_drop_object() is called, for the
> > case that inode children need to be updated,
> > take inode_lock(), take connector spin lock
> > to check if another thread has already done the update
> > if not release spin lock, perform the update under inode lock
> > and at the end, take spin lock again and set the
> > FSNOTIFY_CONN_FLAG_WATCHES_CHILDREN
> > connector flag.
> >
> > Not sure if it all works out... maybe
>
> I did this for v2 and I think it has worked well, all threads seem to
> block until the flags are updated on all dentries.
>

It did work out pretty well. v2 looks nice :)

> >>
> >> fs/notify/fsnotify.c | 12 ++++++++--
> >> fs/notify/mark.c | 55 ++++++++++++++++++++++++++++++++++----------
> >> 2 files changed, 53 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> >> index 7974e91ffe13..e887a195983b 100644
> >> --- a/fs/notify/fsnotify.c
> >> +++ b/fs/notify/fsnotify.c
> >> @@ -207,8 +207,16 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> >> parent = dget_parent(dentry);
> >> p_inode = parent->d_inode;
> >> p_mask = fsnotify_inode_watches_children(p_inode);
> >> - if (unlikely(parent_watched && !p_mask))
> >> - __fsnotify_update_child_dentry_flags(p_inode);
> >> + if (unlikely(parent_watched && !p_mask)) {
> >> + /*
> >> + * Flag would be cleared soon by
> >> + * __fsnotify_update_child_dentry_flags(), but as an
> >> + * optimization, clear it now.
> >> + */
> >
> > I think that we need to also take p_inode spin_lock here and
> > check fsnotify_inode_watches_children() under lock
> > otherwise, we could be clearing the WATCHED flag
> > *after* __fsnotify_update_child_dentry_flags() had
> > already set it, because you we not observe the change to
> > p_inode mask.
>
> I'm not sure I follow. The i_fsnotify_mask field isn't protected by the
> p_inode spinlock. It isn't really protected at all, though it mainly
> gets modified with the conn->lock held.
>

Right.

> Wouldn't it be sufficient to do in a new helper:
>
> spin_lock(&dentry->d_lock);
> if (!fsnotify_inode_watches_children(p_inode))
> dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> spin_unlock(&dentry->d_lock);
>

Yes, I think this is better.

I am not sure if that data race can lead to the wrong state of dentry
flags forever on upstream, but I'm pretty sure that it can on v1.

> I'm sure I'm missing something about your comment. For the moment I left
> it as is in the second version of the patch, we can discuss it more and
> I can update it for a v3.
>

v1 doesn't have a memory barrier between loading value into
parent_watched and into p_mask, so the check wasn't safe.

In v2, __fsnotify_update_child_dentry_flags() rechecks
fsnotify_inode_watches_children() after the the barrier provided
by the dentry spin lock, so the "lockless" check is safe:

Writer:
store i_fsnotify_mask
smp_store_release(d_lock) // spin unlock
store d_flags

Reader:
load d_flags
smp_load_acquire(d_lock) // spin_lock
load i_fsnotify_mask

Best resource I like for refreshing my "lockless" skills:
https://lwn.net/Articles/844224/

> >
> > I would consider renaming __fsnotify_update_child_dentry_flags()
> > to __fsnotify_update_children_dentry_flags(struct inode *dir)
> >
> > and creating another inline helper for this call site called:
> > fsnotify_update_child_dentry_flags(struct inode *dir, struct dentry *child)
> >
> >
> >> + spin_lock(&dentry->d_lock);
> >> + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> >> + spin_unlock(&dentry->d_lock);
> >> + }
> >>
> >> /*
> >> * Include parent/name in notification either if some notification
> >> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> >> index c74ef947447d..da9f944fcbbb 100644
> >> --- a/fs/notify/mark.c
> >> +++ b/fs/notify/mark.c
> >> @@ -184,15 +184,36 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> >> */
> >> void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> >> {
> >> + struct inode *inode = NULL;
> >> + int watched_before, watched_after;
> >> +
> >> if (!conn)
> >> return;
> >>
> >> - spin_lock(&conn->lock);
> >> - __fsnotify_recalc_mask(conn);
> >> - spin_unlock(&conn->lock);
> >> - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> >> - __fsnotify_update_child_dentry_flags(
> >> - fsnotify_conn_inode(conn));
> >> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> >> + /*
> >> + * For inodes, we may need to update flags on the child
> >> + * dentries. To ensure these updates occur exactly once,
> >> + * synchronize the recalculation with the inode mutex.
> >> + */
> >> + inode = fsnotify_conn_inode(conn);
> >> + spin_lock(&conn->lock);
> >> + watched_before = fsnotify_inode_watches_children(inode);
> >> + __fsnotify_recalc_mask(conn);
> >> + watched_after = fsnotify_inode_watches_children(inode);
> >> + spin_unlock(&conn->lock);
> >> +
> >> + inode_lock(inode);
> >
> > With the pattern that I suggested above, this if / else would
> > be unified to code that looks something like this:
> >
> > spin_lock(&conn->lock);
> > inode = __fsnotify_recalc_mask(conn);
> > spin_unlock(&conn->lock);
> >
> > if (inode)
> > fsnotify_update_children_dentry_flags(conn, inode);
> >
> > Where fsnotify_update_children_dentry_flags()
> > takes inode lock around entire update and conn spin lock
> > only around check and update of conn flags.
> >
> > FYI, at this time in the code, adding a mark or updating
> > existing mark mask cannot result in the need to drop iref.
> > That is the reason that return value of __fsnotify_recalc_mask()
> > is not checked here.
>
> For v3 I tried this with a new "flags" out variable and two flags - one
> for requiring an iput(), and one for calling
> fsnotify_update_children_dentry_flags(). As a result, I did stick a
> WARN_ON_ONCE here, but it more or less looks just like this code :)
>

Cool.

> >> + if ((watched_before && !watched_after) ||
> >> + (!watched_before && watched_after)) {
> >> + __fsnotify_update_child_dentry_flags(inode);
> >> + }
> >> + inode_unlock(inode);
> >> + } else {
> >> + spin_lock(&conn->lock);
> >> + __fsnotify_recalc_mask(conn);
> >> + spin_unlock(&conn->lock);
> >> + }
> >> }
> >>
> >> /* Free all connectors queued for freeing once SRCU period ends */
> >> @@ -295,6 +316,8 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> >> struct fsnotify_mark_connector *conn = READ_ONCE(mark->connector);
> >> void *objp = NULL;
> >> unsigned int type = FSNOTIFY_OBJ_TYPE_DETACHED;
> >> + struct inode *inode = NULL;
> >> + int watched_before, watched_after;
> >> bool free_conn = false;
> >>
> >> /* Catch marks that were actually never attached to object */
> >> @@ -311,17 +334,31 @@ void fsnotify_put_mark(struct fsnotify_mark *mark)
> >> if (!refcount_dec_and_lock(&mark->refcnt, &conn->lock))
> >> return;
> >>
> >> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> >> + inode = fsnotify_conn_inode(conn);
> >> + watched_before = fsnotify_inode_watches_children(inode);
> >> + }
> >> +
> >> hlist_del_init_rcu(&mark->obj_list);
> >> if (hlist_empty(&conn->list)) {
> >> objp = fsnotify_detach_connector_from_object(conn, &type);
> >> free_conn = true;
> >> + watched_after = 0;
> >> } else {
> >> objp = __fsnotify_recalc_mask(conn);
> >> type = conn->type;
> >> + watched_after = fsnotify_inode_watches_children(inode);
> >> }
> >> WRITE_ONCE(mark->connector, NULL);
> >> spin_unlock(&conn->lock);
> >>
> >> + if (inode) {
> >> + inode_lock(inode);
> >> + if (watched_before && !watched_after)
> >> + __fsnotify_update_child_dentry_flags(inode);
> >> + inode_unlock(inode);
> >> + }
> >> +
> >> fsnotify_drop_object(type, objp);
> >>
> >
> > Here as well something like:
> > if (objp)
> > fsnotify_update_children_dentry_flags(conn, obj);
> >
> > But need to distinguish when inode ref needs to be dropped
> > children flags updates or both.
>
> With a flags out-param, it works well. I actually was able to stuff this
> into fsnotify_drop_object, which was good because I had missed a whole
> other function that can detach a connector from an inode
> (fsnotify_destroy_marks()).
>

Haha, this call is coming from __fsnotify_inode_delete()
the evicted dir inode cannot have any children in dache,
but it's good to have robust code ;)

Thanks,
Amir.

2022-10-27 08:24:12

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] fsnotify: allow sleepable child flag update

Greeting,

FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with clang-14):

commit: bed2685d9557ff9a7705f4172651a138e5f705af ("[PATCH 2/2] fsnotify: allow sleepable child flag update")
url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-Protect-i_fsnotify_mask-and-child-flags-with-inode-rwsem/20221018-131326
base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify
patch link: https://lore.kernel.org/linux-fsdevel/[email protected]
patch subject: [PATCH 2/2] fsnotify: allow sleepable child flag update

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


[ 31.979147][ T1]
[ 31.979446][ T1] ============================================
[ 31.980051][ T1] WARNING: possible recursive locking detected
[ 31.980674][ T1] 6.0.0-rc4-00066-gbed2685d9557 #1 Not tainted
[ 31.981286][ T1] --------------------------------------------
[ 31.981889][ T1] systemd/1 is trying to acquire lock:
[ 31.982432][ T1] ffff88813f542510 (&dentry->d_lock){+.+.}-{2:2}, at: lockref_get+0xd/0x80
[ 31.983314][ T1]
[ 31.983314][ T1] but task is already holding lock:
[ 31.984040][ T1] ffff888100441b18 (&dentry->d_lock){+.+.}-{2:2}, at: __fsnotify_update_child_dentry_flags+0x85/0x2c0
[ 31.985132][ T1]
[ 31.985132][ T1] other info that might help us debug this:
[ 31.985967][ T1] Possible unsafe locking scenario:
[ 31.985967][ T1]
[ 31.986694][ T1] CPU0
[ 31.987025][ T1] ----
[ 31.987366][ T1] lock(&dentry->d_lock);
[ 31.987828][ T1] lock(&dentry->d_lock);
[ 31.988283][ T1]
[ 31.988283][ T1] *** DEADLOCK ***
[ 31.988283][ T1]
[ 31.989061][ T1] May be due to missing lock nesting notation
[ 31.989061][ T1]
[ 31.989888][ T1] 3 locks held by systemd/1:
[ 31.990361][ T1] #0: ffff88815249e128 (&group->mark_mutex){+.+.}-{3:3}, at: __x64_sys_inotify_add_watch+0x2fc/0xc00
[ 31.991473][ T1] #1: ffff888100480af8 (&sb->s_type->i_mutex_key){++++}-{3:3}, at: fsnotify_recalc_mask+0xf1/0x1c0
[ 31.992528][ T1] #2: ffff888100441b18 (&dentry->d_lock){+.+.}-{2:2}, at: __fsnotify_update_child_dentry_flags+0x85/0x2c0
[ 31.993671][ T1]
[ 31.993671][ T1] stack backtrace:
[ 31.994260][ T1] CPU: 0 PID: 1 Comm: systemd Not tainted 6.0.0-rc4-00066-gbed2685d9557 #1 1afcec0fe797aeed18cb95313bac4a75fb6852d3
[ 31.995440][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 31.996441][ T1] Call Trace:
[ 31.996791][ T1] <TASK>
[ 31.997101][ T1] dump_stack_lvl+0x6a/0x100
[ 31.997590][ T1] __lock_acquire+0x1110/0x7480
[ 31.998105][ T1] ? mark_lock+0x9a/0x380
[ 31.998560][ T1] ? mark_held_locks+0xad/0x1c0
[ 31.999056][ T1] ? lockdep_hardirqs_on_prepare+0x1a8/0x400
[ 31.999650][ T1] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[ 32.000276][ T1] lock_acquire+0x177/0x480
[ 32.000739][ T1] ? lockref_get+0xd/0x80
[ 32.001178][ T1] _raw_spin_lock+0x2f/0x40
[ 32.001656][ T1] ? lockref_get+0xd/0x80
[ 32.002093][ T1] lockref_get+0xd/0x80
[ 32.002529][ T1] __fsnotify_update_child_dentry_flags+0x142/0x2c0
[ 32.003178][ T1] fsnotify_recalc_mask+0x126/0x1c0
[ 32.003711][ T1] fsnotify_add_mark_locked+0xd9e/0x1280
[ 32.004292][ T1] __x64_sys_inotify_add_watch+0x755/0xc00
[ 32.004898][ T1] ? syscall_enter_from_user_mode+0x26/0x180
[ 32.005660][ T1] do_syscall_64+0x6d/0xc0
[ 32.006125][ T1] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 32.006735][ T1] RIP: 0033:0x7f839dd0a8f7
[ 32.007188][ T1] Code: f0 ff ff 73 01 c3 48 8b 0d 96 f5 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 fe 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 f5 0b 00 f7 d8 64 89 01 48
[ 32.009103][ T1] RSP: 002b:00007ffe52095c98 EFLAGS: 00000202 ORIG_RAX: 00000000000000fe
[ 32.009945][ T1] RAX: ffffffffffffffda RBX: 0000555bc72cf930 RCX: 00007f839dd0a8f7
[ 32.010685][ T1] RDX: 0000000000000d84 RSI: 0000555bc72cf930 RDI: 000000000000001a
[ 32.011469][ T1] RBP: 0000555bc72cf931 R08: 00000000fe000000 R09: 0000555bc72a1e90
[ 32.012266][ T1] R10: 00007ffe52095c2c R11: 0000000000000202 R12: 0000000000000000
[ 32.012976][ T1] R13: 0000555bc72a1e90 R14: 0000000000000d84 R15: 0000555bc72cf930
[ 32.013705][ T1] </TASK>


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc4-00066-gbed2685d9557 .config
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.


--
0-DAY CI Kernel Test Service
https://01.org/lkp


Attachments:
(No filename) (5.53 kB)
config-6.0.0-rc4-00066-gbed2685d9557 (172.74 kB)
job-script (4.98 kB)
dmesg.xz (55.66 kB)
Download all attachments

2022-10-27 09:03:00

by Yujie Liu

[permalink] [raw]
Subject: Re: [PATCH 2/2] fsnotify: allow sleepable child flag update

On Thu, Oct 27, 2022 at 03:50:17PM +0800, kernel test robot wrote:
> Greeting,
>
> FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with clang-14):
>
> commit: bed2685d9557ff9a7705f4172651a138e5f705af ("[PATCH 2/2] fsnotify: allow sleepable child flag update")
> url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-Protect-i_fsnotify_mask-and-child-flags-with-inode-rwsem/20221018-131326
> base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify
> patch link: https://lore.kernel.org/linux-fsdevel/[email protected]
> patch subject: [PATCH 2/2] fsnotify: allow sleepable child flag update
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

Sorry, this report is for the v1 patch which seems to be obsolete now.
Please kindly check the details in report, if the issue has already been
fixed in v2, please ignore this report. Thanks.

--
Best Regards,
Yujie

> [ 31.979147][ T1]
> [ 31.979446][ T1] ============================================
> [ 31.980051][ T1] WARNING: possible recursive locking detected
> [ 31.980674][ T1] 6.0.0-rc4-00066-gbed2685d9557 #1 Not tainted
> [ 31.981286][ T1] --------------------------------------------
> [ 31.981889][ T1] systemd/1 is trying to acquire lock:
> [ 31.982432][ T1] ffff88813f542510 (&dentry->d_lock){+.+.}-{2:2}, at: lockref_get+0xd/0x80
> [ 31.983314][ T1]
> [ 31.983314][ T1] but task is already holding lock:
> [ 31.984040][ T1] ffff888100441b18 (&dentry->d_lock){+.+.}-{2:2}, at: __fsnotify_update_child_dentry_flags+0x85/0x2c0
> [ 31.985132][ T1]
> [ 31.985132][ T1] other info that might help us debug this:
> [ 31.985967][ T1] Possible unsafe locking scenario:
> [ 31.985967][ T1]
> [ 31.986694][ T1] CPU0
> [ 31.987025][ T1] ----
> [ 31.987366][ T1] lock(&dentry->d_lock);
> [ 31.987828][ T1] lock(&dentry->d_lock);
> [ 31.988283][ T1]
> [ 31.988283][ T1] *** DEADLOCK ***
> [ 31.988283][ T1]
> [ 31.989061][ T1] May be due to missing lock nesting notation
> [ 31.989061][ T1]
> [ 31.989888][ T1] 3 locks held by systemd/1:
> [ 31.990361][ T1] #0: ffff88815249e128 (&group->mark_mutex){+.+.}-{3:3}, at: __x64_sys_inotify_add_watch+0x2fc/0xc00
> [ 31.991473][ T1] #1: ffff888100480af8 (&sb->s_type->i_mutex_key){++++}-{3:3}, at: fsnotify_recalc_mask+0xf1/0x1c0
> [ 31.992528][ T1] #2: ffff888100441b18 (&dentry->d_lock){+.+.}-{2:2}, at: __fsnotify_update_child_dentry_flags+0x85/0x2c0
> [ 31.993671][ T1]
> [ 31.993671][ T1] stack backtrace:
> [ 31.994260][ T1] CPU: 0 PID: 1 Comm: systemd Not tainted 6.0.0-rc4-00066-gbed2685d9557 #1 1afcec0fe797aeed18cb95313bac4a75fb6852d3
> [ 31.995440][ T1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
> [ 31.996441][ T1] Call Trace:
> [ 31.996791][ T1] <TASK>
> [ 31.997101][ T1] dump_stack_lvl+0x6a/0x100
> [ 31.997590][ T1] __lock_acquire+0x1110/0x7480
> [ 31.998105][ T1] ? mark_lock+0x9a/0x380
> [ 31.998560][ T1] ? mark_held_locks+0xad/0x1c0
> [ 31.999056][ T1] ? lockdep_hardirqs_on_prepare+0x1a8/0x400
> [ 31.999650][ T1] ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [ 32.000276][ T1] lock_acquire+0x177/0x480
> [ 32.000739][ T1] ? lockref_get+0xd/0x80
> [ 32.001178][ T1] _raw_spin_lock+0x2f/0x40
> [ 32.001656][ T1] ? lockref_get+0xd/0x80
> [ 32.002093][ T1] lockref_get+0xd/0x80
> [ 32.002529][ T1] __fsnotify_update_child_dentry_flags+0x142/0x2c0
> [ 32.003178][ T1] fsnotify_recalc_mask+0x126/0x1c0
> [ 32.003711][ T1] fsnotify_add_mark_locked+0xd9e/0x1280
> [ 32.004292][ T1] __x64_sys_inotify_add_watch+0x755/0xc00
> [ 32.004898][ T1] ? syscall_enter_from_user_mode+0x26/0x180
> [ 32.005660][ T1] do_syscall_64+0x6d/0xc0
> [ 32.006125][ T1] entry_SYSCALL_64_after_hwframe+0x46/0xb0
> [ 32.006735][ T1] RIP: 0033:0x7f839dd0a8f7
> [ 32.007188][ T1] Code: f0 ff ff 73 01 c3 48 8b 0d 96 f5 0b 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 66 90 b8 fe 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 f5 0b 00 f7 d8 64 89 01 48
> [ 32.009103][ T1] RSP: 002b:00007ffe52095c98 EFLAGS: 00000202 ORIG_RAX: 00000000000000fe
> [ 32.009945][ T1] RAX: ffffffffffffffda RBX: 0000555bc72cf930 RCX: 00007f839dd0a8f7
> [ 32.010685][ T1] RDX: 0000000000000d84 RSI: 0000555bc72cf930 RDI: 000000000000001a
> [ 32.011469][ T1] RBP: 0000555bc72cf931 R08: 00000000fe000000 R09: 0000555bc72a1e90
> [ 32.012266][ T1] R10: 00007ffe52095c2c R11: 0000000000000202 R12: 0000000000000000
> [ 32.012976][ T1] R13: 0000555bc72a1e90 R14: 0000000000000d84 R15: 0000555bc72cf930
> [ 32.013705][ T1] </TASK>
>
>
> If you fix the issue, kindly add following tag
> | Reported-by: kernel test robot <[email protected]>
> | Link: https://lore.kernel.org/oe-lkp/[email protected]
>
>
> To reproduce:
>
> # build kernel
> cd linux
> cp config-6.0.0-rc4-00066-gbed2685d9557 .config
> make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
> make HOSTCC=clang-14 CC=clang-14 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
> cd <mod-install-dir>
> find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz
>
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.
>
>

2022-10-27 22:27:03

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

Amir Goldstein <[email protected]> writes:
> On Wed, Oct 19, 2022 at 2:52 AM Stephen Brennan
> <[email protected]> wrote:
>>
>> Amir Goldstein <[email protected]> writes:
>> > On Tue, Oct 18, 2022 at 7:12 AM Stephen Brennan
>> > <[email protected]> wrote:
>> >>
>> >> Hi Jan, Amir, Al,
>> >>
>> >> Here's my first shot at implementing what we discussed. I tested it using the
>> >> negative dentry creation tool I mentioned in my previous message, with a similar
>> >> workflow. Rather than having a bunch of threads accessing the directory to
>> >> create that "thundering herd" of CPUs in __fsnotify_update_child_dentry_flags, I
>> >> just started a lot of inotifywait tasks:
>> >>
>> >> 1. Create 100 million negative dentries in a dir
>> >> 2. Use trace-cmd to watch __fsnotify_update_child_dentry_flags:
>> >> trace-cmd start -p function_graph -l __fsnotify_update_child_dentry_flags
>> >> sudo cat /sys/kernel/debug/tracing/trace_pipe
>> >> 3. Run a lot of inotifywait tasks: for i in {1..10} inotifywait $dir & done
>> >>
>> >> With step #3, I see only one execution of __fsnotify_update_child_dentry_flags.
>> >> Once that completes, all the inotifywait tasks say "Watches established".
>> >> Similarly, once an access occurs in the directory, a single
>> >> __fsnotify_update_child_dentry_flags execution occurs, and all the tasks exit.
>> >> In short: it works great!
>> >>
>> >> However, while testing this, I've observed a dentry still in use warning during
>> >> unmount of rpc_pipefs on the "nfs" dentry during shutdown. NFS is of course in
>> >> use, and I assume that fsnotify must have been used to trigger this. The error
>> >> is not there on mainline without my patch so it's definitely caused by this
>> >> code. I'll continue debugging it but I wanted to share my first take on this so
>> >> you could take a look.
>> >>
>> >> [ 1595.197339] BUG: Dentry 000000005f5e7197{i=67,n=nfs} still in use (2) [unmount of rpc_pipefs rpc_pipefs]
>> >>
>> >
>> > Hmm, the assumption we made about partial stability of d_subdirs
>> > under dir inode lock looks incorrect for rpc_pipefs.
>> > None of the functions that update the rpc_pipefs dcache take the parent
>> > inode lock.
>>
>> That may be, but I'm confused how that would trigger this issue. If I'm
>> understanding correctly, this warning indicates a reference counting
>> bug.
>
> Yes.
> On generic_shutdown_super() there should be no more
> references to dentries.
>
>>
>> If __fsnotify_update_child_dentry_flags() had gone to sleep and the list
>> were edited, then it seems like there could be only two possibilities
>> that could cause bugs:
>>
>> 1. The dentry we slept holding a reference to was removed from the list,
>> and maybe moved to a different one, or just removed. If that were the
>> case, we're quite unlucky, because we'll start looping indefinitely as
>> we'll never get back to the beginning of the list, or worse.
>>
>> 2. A dentry adjacent to the one we held a reference to was removed. In
>> that case, our dentry's d_child pointers should get rearranged, and when
>> we wake, we should see those updates and continue.
>>
>> In neither of those cases do I understand where we could have done a
>> dget() unpaired with a dput(), which is what seemingly would trigger
>> this issue.
>>
>
> I got the same impression.

Well I feel stupid. The reason behind this seems to be... that
d_find_any_alias() returns a reference to the dentry, and I promptly
leaked that. I'll have it fixed in v3 which I'm going through testing
now.

Stephen

2022-10-27 22:40:40

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH 2/2] fsnotify: allow sleepable child flag update

Yujie Liu <[email protected]> writes:
> On Thu, Oct 27, 2022 at 03:50:17PM +0800, kernel test robot wrote:
>> Greeting,
>>
>> FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with clang-14):
>>
>> commit: bed2685d9557ff9a7705f4172651a138e5f705af ("[PATCH 2/2] fsnotify: allow sleepable child flag update")
>> url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-Protect-i_fsnotify_mask-and-child-flags-with-inode-rwsem/20221018-131326
>> base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify
>> patch link: https://lore.kernel.org/linux-fsdevel/[email protected]
>> patch subject: [PATCH 2/2] fsnotify: allow sleepable child flag update
>>
>> in testcase: boot
>>
>> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>>
>> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
> Sorry, this report is for the v1 patch which seems to be obsolete now.
> Please kindly check the details in report, if the issue has already been
> fixed in v2, please ignore this report. Thanks.

Thanks for the message, I'm looking deeper into it now. If it were to
happen on the v1, it may very well occur on v2.

Thanks,
Stephen

2022-10-28 00:53:06

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

Hi all,

Here is v3 of the patch series. I've taken all of the feedback,
thanks Amir, Christian, Hilf, et al. Differences are noted in each
patch.

I caught an obvious and silly dentry reference leak: d_find_any_alias()
returns a reference, which I never called dput() on. With that change, I
no longer see the rpc_pipefs issue, but I do think I need more testing
and thinking through the third patch. Al, I'd love your feedback on that
one especially.

Thanks,
Stephen

Stephen Brennan (3):
fsnotify: Use d_find_any_alias to get dentry associated with inode
fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
fsnotify: allow sleepable child flag update

fs/notify/fsnotify.c | 115 +++++++++++++++++++++-------
fs/notify/fsnotify.h | 13 +++-
fs/notify/mark.c | 124 ++++++++++++++++++++-----------
include/linux/fsnotify_backend.h | 1 +
4 files changed, 185 insertions(+), 68 deletions(-)

--
2.34.1


2022-10-28 09:06:07

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 0/2] fsnotify: fix softlockups iterating over d_subdirs

>
> Well I feel stupid. The reason behind this seems to be... that
> d_find_any_alias() returns a reference to the dentry, and I promptly
> leaked that. I'll have it fixed in v3 which I'm going through testing
> now.
>

I reckon if you ran the LTP fsnotify tests you would have seen this
warning a lot more instead of just one random pseudo filesystem
that some process is probably setting a watch on...

You should run the existing LTP test to check for regressions.
The fanotify/inotify test cases in LTP are easy to run, for example:
run make in testcases/kernel/syscalls/fanotify and execute individual
./fanotify* executable.

If you point me to a branch, I can run the tests until you get
your LTP setup ready.

Thanks,
Amir.

2022-11-01 18:24:49

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

Hi Stephen!

On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> Here is v3 of the patch series. I've taken all of the feedback,
> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> patch.
>
> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> returns a reference, which I never called dput() on. With that change, I
> no longer see the rpc_pipefs issue, but I do think I need more testing
> and thinking through the third patch. Al, I'd love your feedback on that
> one especially.
>
> Thanks,
> Stephen
>
> Stephen Brennan (3):
> fsnotify: Use d_find_any_alias to get dentry associated with inode
> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> fsnotify: allow sleepable child flag update

Thanks for the patches Stephen and I'm sorry for replying somewhat late.
The first patch is a nobrainer. The other two patches ... complicate things
somewhat more complicated than I'd like. I guess I can live with them if we
don't find a better solution but I'd like to discuss a bit more about
alternatives.

So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
__fsnotify_parent() for the dentry which triggered the event and does not
have watched parent anymore and never bother with full children walk? I
suppose your contention problems will be gone, we'll just pay the price of
dget_parent() + fsnotify_inode_watches_children() for each child that
falsely triggers instead of for only one. Maybe that's not too bad? After
all any event upto this moment triggered this overhead as well...

Am I missing something? AFAIU this would allow us to avoid the games with
the new connector flag etc... We would probably still need to avoid
softlockups when setting the flag DCACHE_FSNOTIFY_PARENT_WATCHED but that
should be much simpler (we could use i_rwsem trick like you do).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-01 21:06:07

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

Hi Jan,

Jan Kara <[email protected]> writes:
> Hi Stephen!
>
> On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
>> Here is v3 of the patch series. I've taken all of the feedback,
>> thanks Amir, Christian, Hilf, et al. Differences are noted in each
>> patch.
>>
>> I caught an obvious and silly dentry reference leak: d_find_any_alias()
>> returns a reference, which I never called dput() on. With that change, I
>> no longer see the rpc_pipefs issue, but I do think I need more testing
>> and thinking through the third patch. Al, I'd love your feedback on that
>> one especially.
>>
>> Thanks,
>> Stephen
>>
>> Stephen Brennan (3):
>> fsnotify: Use d_find_any_alias to get dentry associated with inode
>> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
>> fsnotify: allow sleepable child flag update
>
> Thanks for the patches Stephen and I'm sorry for replying somewhat late.

Absolutely no worries, these things take time. Thanks for taking a look!

> The first patch is a nobrainer. The other two patches ... complicate things
> somewhat more complicated than I'd like. I guess I can live with them if we
> don't find a better solution but I'd like to discuss a bit more about
> alternatives.

Understood!

> So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> __fsnotify_parent() for the dentry which triggered the event and does not
> have watched parent anymore and never bother with full children walk? I
> suppose your contention problems will be gone, we'll just pay the price of
> dget_parent() + fsnotify_inode_watches_children() for each child that
> falsely triggers instead of for only one. Maybe that's not too bad? After
> all any event upto this moment triggered this overhead as well...

This is an interesting idea. It came across my mind but I don't think I
considered it seriously because I assumed that it was too big a change.
But I suppose in the process I created an even bigger change :P

The false positive dget_parent() + fsnotify_inode_watches_children()
shouldn't be too bad. I could see a situation where there's a lot of
random accesses within a directory, where the dget_parent() could cause
some contention over the parent dentry. But to be fair, the performance
would have been the same or worse while fsnotify was active in that
case, and the contention would go away as most of the dentries get their
flags cleared. So I don't think this is a problem.

> Am I missing something?

I think there's one thing missed here. I understand you'd like to get
rid of the extra flag in the connector. But the advantage of the flag is
avoiding duplicate work by saving a bit of state. Suppose that a mark is
added to a connector, which causes fsnotify_inode_watches_children() to
become true. Then, any subsequent call to fsnotify_recalc_mask() must
call __fsnotify_update_child_dentry_flags(), even though the child
dentry flags don't need to be updated: they're already set. For (very)
large directories, this can take a few seconds, which means that we're
doing a few extra seconds of work each time a new mark is added to or
removed from a connector in that case. I can't imagine that's a super
common workload though, and I don't know if my customers do that (my
guess would be no).

For an example of a test workload where this would make a difference:
one of my test cases is to create a directory with millions of negative
dentries, and then run "for i in {1..20}; do inotifywait $DIR & done".
With the series as-is, only the first task needs to do the child flag
update. With your proposed alternative, each task would re-do the
update.

If we were to manage to get __fsnotify_update_child_dentry_flags() to
work correctly, and safely, with the ability to drop the d_lock and
sleep, then I think that wouldn't be too much of a problem, because then
other spinlock users of the directory will get a chance to grab it, and
so we don't risk softlockups. Without the sleepable iterations, it would
be marginally worse than the current solution, but I can't really
comment _how_ much worse because like I said, it doesn't sound like a
frequent usage pattern.

I think I have a _slight_ preference for the current design, but I see
the appeal of simpler code, and your design would still improve things a
lot! Maybe Amir has an opinion too, but of course I'll defer to what you
want.

Thanks,
Stephen

2022-11-01 22:24:53

by Stephen Brennan

[permalink] [raw]
Subject: Re: [RFC] fsnotify: allow sleepable child dentry flag update

Al Viro <[email protected]> writes:
> On Thu, Oct 13, 2022 at 03:27:19PM -0700, Stephen Brennan wrote:
>
>> So I have two more narrowly scoped strategies to improve the situation. Both are
>> included in the hacky, awful patch below.
>>
>> First, is to let __fsnotify_update_child_dentry_flags() sleep. This means nobody
>> is holding the spinlock for several seconds at a time. We can actually achieve
>> this via a cursor, the same way that simple_readdir() is implemented. I think
>> this might require moving the declaration of d_alloc_cursor() and maybe
>> exporting it. I had to #include fs/internal.h which is not ok.
>
> Er... Won't that expose every filesystem to having to deal with cursors?
> Currently it's entirely up to the filesystem in question and I wouldn't
> be a dime on everyone being ready to cope with such surprises...

Hi Al,

I wanted to follow-up on this. Yeah, I didn't realize that this could
cause issues for some filesystems when I wrote it, since I hadn't
considered that rather few filesystems use dcache_readdir(), and so most
aren't prepared to encounter a cursor. Thanks for that catch.

I think I came up with a better solution, which you can see in context
in v3 [1]. The only change I have from the posting there is to eliminate
the unnecssary "child->d_parent != parent" check. So the new idea is to
take a reference to the current child and then go to sleep. That alone
should be enough to prevent dentry_kill() from removing the child from
its parent's list. However, it wouldn't prevent d_move() from moving it
to a new list, nor would it prevent it from moving along the list if it
were a cursor. However, in this situation, we also hold the parent
i_rwsem in write mode, which means that no concurrent rename or readdir
can happen. You can see my full analysis here [2]. So here's the new
approach, if you can call out anything I've missod or confirm that it's
sound, I'd really appreciate it!


/* Must be called with inode->i_rwsem in held in write mode */
static bool __fsnotify_update_children_dentry_flags(struct inode *inode)
{
struct dentry *child, *alias, *last_ref = NULL;
alias = d_find_any_alias(inode);

/*
* These lists can get very long, so we may need to sleep during
* iteration. Normally this would be impossible without a cursor,
* but since we have the inode locked exclusive, we're guaranteed
* that the directory won't be modified, so whichever dentry we
* pick to sleep on won't get moved.
*/
spin_lock(&alias->d_lock);
list_for_each_entry(child, &alias->d_subdirs, d_child) {
if (need_resched()) {
/*
* We need to hold a reference while we sleep. But when
* we wake, dput() could free the dentry, invalidating
* the list pointers. We can't look at the list pointers
* until we re-lock the parent, and we can't dput() once
* we have the parent locked. So the solution is to hold
* onto our reference and free it the *next* time we drop
* alias->d_lock: either at the end of the function, or
* at the time of the next sleep.
*/
dget(child);
spin_unlock(&alias->d_lock);
dput(last_ref);
last_ref = child;
cond_resched();
spin_lock(&alias->d_lock);
}

if (!child->d_inode)
continue;

spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
if (watched)
child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
else
child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
spin_unlock(&child->d_lock);
}
spin_unlock(&alias->d_lock);
dput(last_ref);
dput(alias);
return watched;
}

Thanks,
Stephen

[1]: https://lore.kernel.org/linux-fsdevel/[email protected]/
[2]: https://lore.kernel.org/linux-fsdevel/[email protected]/

2022-11-02 09:24:24

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

On Tue, Nov 1, 2022 at 10:49 PM Stephen Brennan
<[email protected]> wrote:
>
> Hi Jan,
>
> Jan Kara <[email protected]> writes:
> > Hi Stephen!
> >
> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> patch.
> >>
> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> returns a reference, which I never called dput() on. With that change, I
> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> one especially.
> >>
> >> Thanks,
> >> Stephen
> >>
> >> Stephen Brennan (3):
> >> fsnotify: Use d_find_any_alias to get dentry associated with inode
> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >> fsnotify: allow sleepable child flag update
> >
> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
>
> Absolutely no worries, these things take time. Thanks for taking a look!
>
> > The first patch is a nobrainer. The other two patches ... complicate things
> > somewhat more complicated than I'd like. I guess I can live with them if we
> > don't find a better solution but I'd like to discuss a bit more about
> > alternatives.
>
> Understood!
>
> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> > __fsnotify_parent() for the dentry which triggered the event and does not
> > have watched parent anymore and never bother with full children walk? I
> > suppose your contention problems will be gone, we'll just pay the price of
> > dget_parent() + fsnotify_inode_watches_children() for each child that
> > falsely triggers instead of for only one. Maybe that's not too bad? After
> > all any event upto this moment triggered this overhead as well...
>
> This is an interesting idea. It came across my mind but I don't think I
> considered it seriously because I assumed that it was too big a change.
> But I suppose in the process I created an even bigger change :P
>
> The false positive dget_parent() + fsnotify_inode_watches_children()
> shouldn't be too bad. I could see a situation where there's a lot of
> random accesses within a directory, where the dget_parent() could cause
> some contention over the parent dentry. But to be fair, the performance
> would have been the same or worse while fsnotify was active in that
> case, and the contention would go away as most of the dentries get their
> flags cleared. So I don't think this is a problem.
>

I also don't think that is a problem.

> > Am I missing something?
>
> I think there's one thing missed here. I understand you'd like to get
> rid of the extra flag in the connector. But the advantage of the flag is
> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> added to a connector, which causes fsnotify_inode_watches_children() to
> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> call __fsnotify_update_child_dentry_flags(), even though the child
> dentry flags don't need to be updated: they're already set. For (very)
> large directories, this can take a few seconds, which means that we're
> doing a few extra seconds of work each time a new mark is added to or
> removed from a connector in that case. I can't imagine that's a super
> common workload though, and I don't know if my customers do that (my
> guess would be no).
>
> For an example of a test workload where this would make a difference:
> one of my test cases is to create a directory with millions of negative
> dentries, and then run "for i in {1..20}; do inotifywait $DIR & done".
> With the series as-is, only the first task needs to do the child flag
> update. With your proposed alternative, each task would re-do the
> update.
>
> If we were to manage to get __fsnotify_update_child_dentry_flags() to
> work correctly, and safely, with the ability to drop the d_lock and
> sleep, then I think that wouldn't be too much of a problem, because then
> other spinlock users of the directory will get a chance to grab it, and
> so we don't risk softlockups. Without the sleepable iterations, it would
> be marginally worse than the current solution, but I can't really
> comment _how_ much worse because like I said, it doesn't sound like a
> frequent usage pattern.
>
> I think I have a _slight_ preference for the current design, but I see
> the appeal of simpler code, and your design would still improve things a
> lot! Maybe Amir has an opinion too, but of course I'll defer to what you
> want.
>

IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them,
I see no reason to withhold.

With patches #1+#3 applied, the penalty is restricted to adding or
removing/destroying multiple marks on very large dirs or dirs with
many negative dentries.

I think that fixing the synthetic test case of multiple added marks
is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED.
Something like the attached patch is what Jan had suggested in the
first place.

The synthetic test case of concurrent add/remove watch on the same
dir may still result in multiple children iterations, but that will usually
not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED
and probably not worth optimizing for.

Thoughts?

Thanks,
Amir.


Attachments:
fsnotify-clear-PARENT_WATCHED-flags-lazily.patch (6.63 kB)

2022-11-02 18:07:25

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
> Jan Kara <[email protected]> writes:
> > Hi Stephen!
> >
> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> patch.
> >>
> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> returns a reference, which I never called dput() on. With that change, I
> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> one especially.
> >>
> >> Thanks,
> >> Stephen
> >>
> >> Stephen Brennan (3):
> >> fsnotify: Use d_find_any_alias to get dentry associated with inode
> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >> fsnotify: allow sleepable child flag update
> >
> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
>
> Absolutely no worries, these things take time. Thanks for taking a look!
>
> > The first patch is a nobrainer. The other two patches ... complicate things
> > somewhat more complicated than I'd like. I guess I can live with them if we
> > don't find a better solution but I'd like to discuss a bit more about
> > alternatives.
>
> Understood!
>
> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> > __fsnotify_parent() for the dentry which triggered the event and does not
> > have watched parent anymore and never bother with full children walk? I
> > suppose your contention problems will be gone, we'll just pay the price of
> > dget_parent() + fsnotify_inode_watches_children() for each child that
> > falsely triggers instead of for only one. Maybe that's not too bad? After
> > all any event upto this moment triggered this overhead as well...
>
> This is an interesting idea. It came across my mind but I don't think I
> considered it seriously because I assumed that it was too big a change.
> But I suppose in the process I created an even bigger change :P
>
> The false positive dget_parent() + fsnotify_inode_watches_children()
> shouldn't be too bad. I could see a situation where there's a lot of
> random accesses within a directory, where the dget_parent() could cause
> some contention over the parent dentry. But to be fair, the performance
> would have been the same or worse while fsnotify was active in that
> case, and the contention would go away as most of the dentries get their
> flags cleared. So I don't think this is a problem.
>
> > Am I missing something?
>
> I think there's one thing missed here. I understand you'd like to get
> rid of the extra flag in the connector. But the advantage of the flag is
> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> added to a connector, which causes fsnotify_inode_watches_children() to
> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> call __fsnotify_update_child_dentry_flags(), even though the child
> dentry flags don't need to be updated: they're already set. For (very)
> large directories, this can take a few seconds, which means that we're
> doing a few extra seconds of work each time a new mark is added to or
> removed from a connector in that case. I can't imagine that's a super
> common workload though, and I don't know if my customers do that (my
> guess would be no).

I understand. This basically matters for fsnotify_recalc_mask(). As a side
note I've realized that your changes to fsnotify_recalc_mask() acquiring
inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
dnotify because that calls fsnotify_recalc_mask() under a spinlock.
Furthermore it is somewhat worrying also for inotify & fanotify because it
nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
something doesn't force the ordering the other way around (e.g. the removal
of oneshot mark during modify event generation). Did you run tests with
lockdep enabled?

Anyway, if the lock ordering issues can be solved, I suppose we can
optimize fsnotify_recalc_mask() like:

inode_lock(inode);
spin_lock(&conn->lock);
oldmask = inode->i_fsnotify_mask;
__fsnotify_recalc_mask(conn);
newmask = inode->i_fsnotify_mask;
spin_unlock(&conn->lock);
if (watching children changed(oldmask, newmask))
__fsnotify_update_child_dentry_flags(...)
inode_unlock(inode);

And because everything is serialized by inode_lock, we don't have to worry
about inode->i_fsnotify_mask and dentry flags getting out of sync or some
mark addition returning before all children are marked for reporting
events. No need for the connector flag AFAICT.

But the locking issue needs to be resolved first in any case. I need to
think some more...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-04 23:45:55

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

Jan Kara <[email protected]> writes:
> On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
>> Jan Kara <[email protected]> writes:
>> > Hi Stephen!
>> >
>> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
>> >> Here is v3 of the patch series. I've taken all of the feedback,
>> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
>> >> patch.
>> >>
>> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
>> >> returns a reference, which I never called dput() on. With that change, I
>> >> no longer see the rpc_pipefs issue, but I do think I need more testing
>> >> and thinking through the third patch. Al, I'd love your feedback on that
>> >> one especially.
>> >>
>> >> Thanks,
>> >> Stephen
>> >>
>> >> Stephen Brennan (3):
>> >> fsnotify: Use d_find_any_alias to get dentry associated with inode
>> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
>> >> fsnotify: allow sleepable child flag update
>> >
>> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
>>
>> Absolutely no worries, these things take time. Thanks for taking a look!
>>
>> > The first patch is a nobrainer. The other two patches ... complicate things
>> > somewhat more complicated than I'd like. I guess I can live with them if we
>> > don't find a better solution but I'd like to discuss a bit more about
>> > alternatives.
>>
>> Understood!
>>
>> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
>> > __fsnotify_parent() for the dentry which triggered the event and does not
>> > have watched parent anymore and never bother with full children walk? I
>> > suppose your contention problems will be gone, we'll just pay the price of
>> > dget_parent() + fsnotify_inode_watches_children() for each child that
>> > falsely triggers instead of for only one. Maybe that's not too bad? After
>> > all any event upto this moment triggered this overhead as well...
>>
>> This is an interesting idea. It came across my mind but I don't think I
>> considered it seriously because I assumed that it was too big a change.
>> But I suppose in the process I created an even bigger change :P
>>
>> The false positive dget_parent() + fsnotify_inode_watches_children()
>> shouldn't be too bad. I could see a situation where there's a lot of
>> random accesses within a directory, where the dget_parent() could cause
>> some contention over the parent dentry. But to be fair, the performance
>> would have been the same or worse while fsnotify was active in that
>> case, and the contention would go away as most of the dentries get their
>> flags cleared. So I don't think this is a problem.
>>
>> > Am I missing something?
>>
>> I think there's one thing missed here. I understand you'd like to get
>> rid of the extra flag in the connector. But the advantage of the flag is
>> avoiding duplicate work by saving a bit of state. Suppose that a mark is
>> added to a connector, which causes fsnotify_inode_watches_children() to
>> become true. Then, any subsequent call to fsnotify_recalc_mask() must
>> call __fsnotify_update_child_dentry_flags(), even though the child
>> dentry flags don't need to be updated: they're already set. For (very)
>> large directories, this can take a few seconds, which means that we're
>> doing a few extra seconds of work each time a new mark is added to or
>> removed from a connector in that case. I can't imagine that's a super
>> common workload though, and I don't know if my customers do that (my
>> guess would be no).
>
> I understand. This basically matters for fsnotify_recalc_mask(). As a side
> note I've realized that your changes to fsnotify_recalc_mask() acquiring
> inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
> dnotify because that calls fsnotify_recalc_mask() under a spinlock.
> Furthermore it is somewhat worrying also for inotify & fanotify because it
> nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
> something doesn't force the ordering the other way around (e.g. the removal
> of oneshot mark during modify event generation). Did you run tests with
> lockdep enabled?

No I didn't. I'll be sure to get the LTP tests running with lockdep
early next week and try this series out, we'll probably get an error
like you say.

I'll also take a look at the dnotify use case and see if there's
anything to do there. Hopefully there's something we can do to salvage
it :D

Thanks,
Stephen

> Anyway, if the lock ordering issues can be solved, I suppose we can
> optimize fsnotify_recalc_mask() like:
>
> inode_lock(inode);
> spin_lock(&conn->lock);
> oldmask = inode->i_fsnotify_mask;
> __fsnotify_recalc_mask(conn);
> newmask = inode->i_fsnotify_mask;
> spin_unlock(&conn->lock);
> if (watching children changed(oldmask, newmask))
> __fsnotify_update_child_dentry_flags(...)
> inode_unlock(inode);
>
> And because everything is serialized by inode_lock, we don't have to worry
> about inode->i_fsnotify_mask and dentry flags getting out of sync or some
> mark addition returning before all children are marked for reporting
> events. No need for the connector flag AFAICT.
>
> But the locking issue needs to be resolved first in any case. I need to
> think some more...
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR

2022-11-07 13:01:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

On Fri 04-11-22 16:33:18, Stephen Brennan wrote:
> Jan Kara <[email protected]> writes:
> > On Tue 01-11-22 13:48:54, Stephen Brennan wrote:
> >> Jan Kara <[email protected]> writes:
> >> > Hi Stephen!
> >> >
> >> > On Thu 27-10-22 17:10:13, Stephen Brennan wrote:
> >> >> Here is v3 of the patch series. I've taken all of the feedback,
> >> >> thanks Amir, Christian, Hilf, et al. Differences are noted in each
> >> >> patch.
> >> >>
> >> >> I caught an obvious and silly dentry reference leak: d_find_any_alias()
> >> >> returns a reference, which I never called dput() on. With that change, I
> >> >> no longer see the rpc_pipefs issue, but I do think I need more testing
> >> >> and thinking through the third patch. Al, I'd love your feedback on that
> >> >> one especially.
> >> >>
> >> >> Thanks,
> >> >> Stephen
> >> >>
> >> >> Stephen Brennan (3):
> >> >> fsnotify: Use d_find_any_alias to get dentry associated with inode
> >> >> fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem
> >> >> fsnotify: allow sleepable child flag update
> >> >
> >> > Thanks for the patches Stephen and I'm sorry for replying somewhat late.
> >>
> >> Absolutely no worries, these things take time. Thanks for taking a look!
> >>
> >> > The first patch is a nobrainer. The other two patches ... complicate things
> >> > somewhat more complicated than I'd like. I guess I can live with them if we
> >> > don't find a better solution but I'd like to discuss a bit more about
> >> > alternatives.
> >>
> >> Understood!
> >>
> >> > So what would happen if we just clear DCACHE_FSNOTIFY_PARENT_WATCHED in
> >> > __fsnotify_parent() for the dentry which triggered the event and does not
> >> > have watched parent anymore and never bother with full children walk? I
> >> > suppose your contention problems will be gone, we'll just pay the price of
> >> > dget_parent() + fsnotify_inode_watches_children() for each child that
> >> > falsely triggers instead of for only one. Maybe that's not too bad? After
> >> > all any event upto this moment triggered this overhead as well...
> >>
> >> This is an interesting idea. It came across my mind but I don't think I
> >> considered it seriously because I assumed that it was too big a change.
> >> But I suppose in the process I created an even bigger change :P
> >>
> >> The false positive dget_parent() + fsnotify_inode_watches_children()
> >> shouldn't be too bad. I could see a situation where there's a lot of
> >> random accesses within a directory, where the dget_parent() could cause
> >> some contention over the parent dentry. But to be fair, the performance
> >> would have been the same or worse while fsnotify was active in that
> >> case, and the contention would go away as most of the dentries get their
> >> flags cleared. So I don't think this is a problem.
> >>
> >> > Am I missing something?
> >>
> >> I think there's one thing missed here. I understand you'd like to get
> >> rid of the extra flag in the connector. But the advantage of the flag is
> >> avoiding duplicate work by saving a bit of state. Suppose that a mark is
> >> added to a connector, which causes fsnotify_inode_watches_children() to
> >> become true. Then, any subsequent call to fsnotify_recalc_mask() must
> >> call __fsnotify_update_child_dentry_flags(), even though the child
> >> dentry flags don't need to be updated: they're already set. For (very)
> >> large directories, this can take a few seconds, which means that we're
> >> doing a few extra seconds of work each time a new mark is added to or
> >> removed from a connector in that case. I can't imagine that's a super
> >> common workload though, and I don't know if my customers do that (my
> >> guess would be no).
> >
> > I understand. This basically matters for fsnotify_recalc_mask(). As a side
> > note I've realized that your changes to fsnotify_recalc_mask() acquiring
> > inode->i_rwsem for updating dentry flags in patch 2/3 are problematic for
> > dnotify because that calls fsnotify_recalc_mask() under a spinlock.
> > Furthermore it is somewhat worrying also for inotify & fanotify because it
> > nests inode->i_rwsem inside fsnotify_group->lock however I'm not 100% sure
> > something doesn't force the ordering the other way around (e.g. the removal
> > of oneshot mark during modify event generation). Did you run tests with
> > lockdep enabled?
>
> No I didn't. I'll be sure to get the LTP tests running with lockdep
> early next week and try this series out, we'll probably get an error
> like you say.
>
> I'll also take a look at the dnotify use case and see if there's
> anything to do there. Hopefully there's something we can do to salvage
> it :D

Yeah, dnotify should be solvable. I'm even willing to accept somewhat
unusual methods for dnotify ;). It is ancient and rarely used API these
days so I don't want it to block fixes for newer APIs. From a quick look it
should be enough to move fsnotify_recalc_mask() call in
dnotify_recalc_inode_mask() from under the mark->lock out into callers and
perhaps rename dnotify_recalc_inode_mask() to dnotify_recalc_mark_mask()
to make the name somewhat less confusing.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-10 21:04:40

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v3 0/3] fsnotify: fix softlockups iterating over d_subdirs

Amir Goldstein <[email protected]> writes:
[...]
> IIUC, patches #1+#3 fix a reproducible softlock, so if Al approves them,
> I see no reason to withhold.
>
> With patches #1+#3 applied, the penalty is restricted to adding or
> removing/destroying multiple marks on very large dirs or dirs with
> many negative dentries.
>
> I think that fixing the synthetic test case of multiple added marks
> is rather easy even without DCACHE_FSNOTIFY_PARENT_WATCHED.
> Something like the attached patch is what Jan had suggested in the
> first place.
>
> The synthetic test case of concurrent add/remove watch on the same
> dir may still result in multiple children iterations, but that will usually
> not be avoided even with DCACHE_FSNOTIFY_PARENT_WATCHED
> and probably not worth optimizing for.
>
> Thoughts?

If I understand your train of thought, your below patch would take the
place of my patch #2, since #3 requires we not hold a spinlock during
fsnotify_update_children_dentry_flags().

And since you fully rely on dentry accessors to lazily clear flags, you
don't need to have the mutual exclusion provided by the inode lock.

I like that a lot.

However, the one thing I see here is that if we no longer hold the inode
lock, patch #3 needs to be re-examined: it assumes that dentries can't
be removed or moved around, and that assumption is only true with the
inode lock held.

I'll test things out with this patch replacing #2 and see how things
shake out. Maybe I can improve #3.

Thanks,
Stephen

>
> Thanks,
> Amir.
> From c8ea1d84397c26ce334bff9d9f721400cebb88dd Mon Sep 17 00:00:00 2001
> From: Amir Goldstein <[email protected]>
> Date: Wed, 2 Nov 2022 10:28:01 +0200
> Subject: [PATCH] fsnotify: clear PARENT_WATCHED flags lazily
>
> Call fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags
> only when parent starts watching children.
>
> When parent stops watching children, clear false positive PARENT_WATCHED
> flags lazily in __fsnotify_parent() for each accessed child.
>
> Suggested-by: Jan Kara <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/notify/fsnotify.c | 26 ++++++++++++++++++++------
> fs/notify/fsnotify.h | 3 ++-
> fs/notify/mark.c | 32 +++++++++++++++++++++++++++++---
> include/linux/fsnotify_backend.h | 8 +++++---
> 4 files changed, 56 insertions(+), 13 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 1e541a9bd12b..f60078d6bb97 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -103,17 +103,13 @@ void fsnotify_sb_delete(struct super_block *sb)
> * parent cares. Thus when an event happens on a child it can quickly tell
> * if there is a need to find a parent and send the event to the parent.
> */
> -void __fsnotify_update_child_dentry_flags(struct inode *inode)
> +void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> {
> struct dentry *alias;
> - int watched;
>
> if (!S_ISDIR(inode->i_mode))
> return;
>
> - /* determine if the children should tell inode about their events */
> - watched = fsnotify_inode_watches_children(inode);
> -
> spin_lock(&inode->i_lock);
> /* run all of the dentries associated with this inode. Since this is a
> * directory, there damn well better only be one item on this list */
> @@ -140,6 +136,24 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
> spin_unlock(&inode->i_lock);
> }
>
> +/*
> + * Lazily clear false positive PARENT_WATCHED flag for child whose parent had
> + * stopped wacthing children.
> + */
> +static void fsnotify_update_child_dentry_flags(struct inode *inode,
> + struct dentry *dentry)
> +{
> + spin_lock(&dentry->d_lock);
> + /*
> + * d_lock is a sufficient barrier to prevent observing a non-watched
> + * parent state from before the fsnotify_update_children_dentry_flags()
> + * or fsnotify_update_flags() call that had set PARENT_WATCHED.
> + */
> + if (!fsnotify_inode_watches_children(inode))
> + dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> + spin_unlock(&dentry->d_lock);
> +}
> +
> /* Are inode/sb/mount interested in parent and name info with this event? */
> static bool fsnotify_event_needs_parent(struct inode *inode, struct mount *mnt,
> __u32 mask)
> @@ -208,7 +222,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
> p_inode = parent->d_inode;
> p_mask = fsnotify_inode_watches_children(p_inode);
> if (unlikely(parent_watched && !p_mask))
> - __fsnotify_update_child_dentry_flags(p_inode);
> + fsnotify_update_child_dentry_flags(p_inode, dentry);
>
> /*
> * Include parent/name in notification either if some notification
> diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
> index fde74eb333cc..bce9be36d06b 100644
> --- a/fs/notify/fsnotify.h
> +++ b/fs/notify/fsnotify.h
> @@ -74,7 +74,8 @@ static inline void fsnotify_clear_marks_by_sb(struct super_block *sb)
> * update the dentry->d_flags of all of inode's children to indicate if inode cares
> * about events that happen to its children.
> */
> -extern void __fsnotify_update_child_dentry_flags(struct inode *inode);
> +extern void fsnotify_update_children_dentry_flags(struct inode *inode,
> + bool watched);
>
> extern struct kmem_cache *fsnotify_mark_connector_cachep;
>
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index fcc68b8a40fd..614bce0e7261 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -176,6 +176,24 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> return fsnotify_update_iref(conn, want_iref);
> }
>
> +static bool fsnotify_conn_watches_children(
> + struct fsnotify_mark_connector *conn)
> +{
> + if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
> + return false;
> +
> + return fsnotify_inode_watches_children(fsnotify_conn_inode(conn));
> +}
> +
> +static void fsnotify_conn_set_children_dentry_flags(
> + struct fsnotify_mark_connector *conn)
> +{
> + if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
> + return;
> +
> + fsnotify_update_children_dentry_flags(fsnotify_conn_inode(conn), true);
> +}
> +
> /*
> * Calculate mask of events for a list of marks. The caller must make sure
> * connector and connector->obj cannot disappear under us. Callers achieve
> @@ -184,15 +202,23 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> */
> void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> {
> + bool update_children;
> +
> if (!conn)
> return;
>
> spin_lock(&conn->lock);
> + update_children = !fsnotify_conn_watches_children(conn);
> __fsnotify_recalc_mask(conn);
> + update_children &= fsnotify_conn_watches_children(conn);
> spin_unlock(&conn->lock);
> - if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
> - __fsnotify_update_child_dentry_flags(
> - fsnotify_conn_inode(conn));
> + /*
> + * Set children's PARENT_WATCHED flags only if parent started watching.
> + * When parent stops watching, we clear false positive PARENT_WATCHED
> + * flags lazily in __fsnotify_parent().
> + */
> + if (update_children)
> + fsnotify_conn_set_children_dentry_flags(conn);
> }
>
> /* Free all connectors queued for freeing once SRCU period ends */
> diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
> index a31423c376a7..bd90bcf6c3b0 100644
> --- a/include/linux/fsnotify_backend.h
> +++ b/include/linux/fsnotify_backend.h
> @@ -586,12 +586,14 @@ static inline __u32 fsnotify_parent_needed_mask(__u32 mask)
>
> static inline int fsnotify_inode_watches_children(struct inode *inode)
> {
> + __u32 parent_mask = READ_ONCE(inode->i_fsnotify_mask);
> +
> /* FS_EVENT_ON_CHILD is set if the inode may care */
> - if (!(inode->i_fsnotify_mask & FS_EVENT_ON_CHILD))
> + if (!(parent_mask & FS_EVENT_ON_CHILD))
> return 0;
> /* this inode might care about child events, does it care about the
> * specific set of events that can happen on a child? */
> - return inode->i_fsnotify_mask & FS_EVENTS_POSS_ON_CHILD;
> + return parent_mask & FS_EVENTS_POSS_ON_CHILD;
> }
>
> /*
> @@ -605,7 +607,7 @@ static inline void fsnotify_update_flags(struct dentry *dentry)
> /*
> * Serialisation of setting PARENT_WATCHED on the dentries is provided
> * by d_lock. If inotify_inode_watched changes after we have taken
> - * d_lock, the following __fsnotify_update_child_dentry_flags call will
> + * d_lock, the following fsnotify_update_children_dentry_flags call will
> * find our entry, so it will spin until we complete here, and update
> * us with the new state.
> */
> --
> 2.25.1

2022-11-11 22:10:46

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 0/5] fsnotify: fix softlockups iterating over d_subdirs

Hi Jan, Amir, Al,

Here's my v4 patch series that aims to eliminate soft lockups when updating
dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
allowing the flag to be lazily cleared in the fsnotify_parent() function,
via Amir's patch. This allowed me to drop patch #2 from my previous series
(fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
replaced it with "fsnotify: require inode lock held during child flag
update", patch #5 in this series. I also added "dnotify: move
fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
issues with dnotify.

Jan expressed concerns about lock ordering of the inode rwsem with the
fsnotify group mutex. I built this with lockdep enabled (see below for the
lock debugging .config section -- I'm not too familiar with lockdep so I
wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
I could find in LTP, with no lockdep splats to be found. I don't know that
this can completely satisfy the concerns about lock ordering: I'm reading
through the code to better understand the concern about "the removal of
oneshot mark during modify event generation". But I'm encouraged by the
LTP+lockdep results.

I also went ahead and did my negative dentry oriented testing. Of course
the fsnotify_parent() issue is fully resolved, and when I tested several
processes all using inotifywait on the same directory full of negative
dentries, I was able to use ftrace to confirm that
fsnotify_update_children_dentry_flags() was called exactly once for all
processes. No softlockups occurred!

I originally wrote this series to make the last patch (#5) optional: if for
some reason we didn't think it was necessary to hold the inode rwsem, then
we could omit it -- the main penalty being the race condition described in
the patch description. I tested without the last patch and LTP passed also
with lockdep enabled, but of course when multiple tasks did an inotifywait
on the same directory (with many negative dentries) only the first waited
for the flag updates, the rest of the tasks immediately returned despite
the flags not being ready.

I agree with Amir that as long as the lock ordering is fine, we should keep
patch #5. And if that's the case, I can reorder the series a bit to make it
a bit more logical, and eliminate logic in
fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
which I promptly delete later in the series.

1. fsnotify: clear PARENT_WATCHED flags lazily
2. fsnotify: Use d_find_any_alias to get dentry associated with inode
3. dnotify: move fsnotify_recalc_mask() outside spinlock
4. fsnotify: require inode lock held during child flag update
5. fsnotify: allow sleepable child flag update

Thanks for continuing to read this series, I hope we're making progress
toward a simpler way to fix these scaling issues!

Stephen

Amir Goldstein (1):
fsnotify: clear PARENT_WATCHED flags lazily

Stephen Brennan (4):
fsnotify: Use d_find_any_alias to get dentry associated with inode
dnotify: move fsnotify_recalc_mask() outside spinlock
fsnotify: allow sleepable child flag update
fsnotify: require inode lock held during child flag update

fs/notify/dnotify/dnotify.c | 28 ++++++---
fs/notify/fsnotify.c | 101 ++++++++++++++++++++++---------
fs/notify/fsnotify.h | 3 +-
fs/notify/mark.c | 40 +++++++++++-
include/linux/fsnotify_backend.h | 8 ++-
5 files changed, 136 insertions(+), 44 deletions(-)

--
2.34.1


2022-11-11 22:13:00

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 2/5] fsnotify: Use d_find_any_alias to get dentry associated with inode

Rather than iterating over the inode's i_dentry (requiring holding the
i_lock for the entire duration of the function), we know that there
should be only one item in the list. Use d_find_any_alias() and no
longer hold i_lock.

Signed-off-by: Stephen Brennan <[email protected]>
Reviewed-by: Amir Goldstein <[email protected]>
---

Notes:
Changes in v4:
- Bail out if d_find_any_alias() returns NULL
- Rebase on Amir's patch
Changes in v3:
- Add newlines in block comment
- d_find_any_alias() returns a reference, which I was leaking. Add
a dput(alias) at the end.
- Add Amir's R-b

fs/notify/fsnotify.c | 46 ++++++++++++++++++++++----------------------
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 2c50e9e50d35..409d479cbbc6 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -105,35 +105,35 @@ void fsnotify_sb_delete(struct super_block *sb)
*/
void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
{
- struct dentry *alias;
+ struct dentry *alias, *child;

if (!S_ISDIR(inode->i_mode))
return;

- spin_lock(&inode->i_lock);
- /* run all of the dentries associated with this inode. Since this is a
- * directory, there damn well better only be one item on this list */
- hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
- struct dentry *child;
-
- /* run all of the children of the original inode and fix their
- * d_flags to indicate parental interest (their parent is the
- * original inode) */
- spin_lock(&alias->d_lock);
- list_for_each_entry(child, &alias->d_subdirs, d_child) {
- if (!child->d_inode)
- continue;
+ /* Since this is a directory, there damn well better only be one child */
+ alias = d_find_any_alias(inode);
+ if (!alias)
+ return;

- spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
- if (watched)
- child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
- else
- child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
- spin_unlock(&child->d_lock);
- }
- spin_unlock(&alias->d_lock);
+ /*
+ * run all of the children of the original inode and fix their
+ * d_flags to indicate parental interest (their parent is the
+ * original inode)
+ */
+ spin_lock(&alias->d_lock);
+ list_for_each_entry(child, &alias->d_subdirs, d_child) {
+ if (!child->d_inode)
+ continue;
+
+ spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
+ if (watched)
+ child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
+ else
+ child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&child->d_lock);
}
- spin_unlock(&inode->i_lock);
+ spin_unlock(&alias->d_lock);
+ dput(alias);
}

/*
--
2.34.1


2022-11-11 22:18:26

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 1/5] fsnotify: clear PARENT_WATCHED flags lazily

From: Amir Goldstein <[email protected]>

Call fsnotify_update_children_dentry_flags() to set PARENT_WATCHED flags
only when parent starts watching children.

When parent stops watching children, clear false positive PARENT_WATCHED
flags lazily in __fsnotify_parent() for each accessed child.

Suggested-by: Jan Kara <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---

Notes:
This is Amir's patch based on Jan's idea, with no changes or alterations
from my side.

fs/notify/fsnotify.c | 26 ++++++++++++++++++++------
fs/notify/fsnotify.h | 3 ++-
fs/notify/mark.c | 32 +++++++++++++++++++++++++++++---
include/linux/fsnotify_backend.h | 8 +++++---
4 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 7974e91ffe13..2c50e9e50d35 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -103,17 +103,13 @@ void fsnotify_sb_delete(struct super_block *sb)
* parent cares. Thus when an event happens on a child it can quickly tell
* if there is a need to find a parent and send the event to the parent.
*/
-void __fsnotify_update_child_dentry_flags(struct inode *inode)
+void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
{
struct dentry *alias;
- int watched;

if (!S_ISDIR(inode->i_mode))
return;

- /* determine if the children should tell inode about their events */
- watched = fsnotify_inode_watches_children(inode);
-
spin_lock(&inode->i_lock);
/* run all of the dentries associated with this inode. Since this is a
* directory, there damn well better only be one item on this list */
@@ -140,6 +136,24 @@ void __fsnotify_update_child_dentry_flags(struct inode *inode)
spin_unlock(&inode->i_lock);
}

+/*
+ * Lazily clear false positive PARENT_WATCHED flag for child whose parent had
+ * stopped wacthing children.
+ */
+static void fsnotify_update_child_dentry_flags(struct inode *inode,
+ struct dentry *dentry)
+{
+ spin_lock(&dentry->d_lock);
+ /*
+ * d_lock is a sufficient barrier to prevent observing a non-watched
+ * parent state from before the fsnotify_update_children_dentry_flags()
+ * or fsnotify_update_flags() call that had set PARENT_WATCHED.
+ */
+ if (!fsnotify_inode_watches_children(inode))
+ dentry->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
+ spin_unlock(&dentry->d_lock);
+}
+
/* Are inode/sb/mount interested in parent and name info with this event? */
static bool fsnotify_event_needs_parent(struct inode *inode, struct mount *mnt,
__u32 mask)
@@ -208,7 +222,7 @@ int __fsnotify_parent(struct dentry *dentry, __u32 mask, const void *data,
p_inode = parent->d_inode;
p_mask = fsnotify_inode_watches_children(p_inode);
if (unlikely(parent_watched && !p_mask))
- __fsnotify_update_child_dentry_flags(p_inode);
+ fsnotify_update_child_dentry_flags(p_inode, dentry);

/*
* Include parent/name in notification either if some notification
diff --git a/fs/notify/fsnotify.h b/fs/notify/fsnotify.h
index fde74eb333cc..bce9be36d06b 100644
--- a/fs/notify/fsnotify.h
+++ b/fs/notify/fsnotify.h
@@ -74,7 +74,8 @@ static inline void fsnotify_clear_marks_by_sb(struct super_block *sb)
* update the dentry->d_flags of all of inode's children to indicate if inode cares
* about events that happen to its children.
*/
-extern void __fsnotify_update_child_dentry_flags(struct inode *inode);
+extern void fsnotify_update_children_dentry_flags(struct inode *inode,
+ bool watched);

extern struct kmem_cache *fsnotify_mark_connector_cachep;

diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index c74ef947447d..6797a2952f87 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -176,6 +176,24 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
return fsnotify_update_iref(conn, want_iref);
}

+static bool fsnotify_conn_watches_children(
+ struct fsnotify_mark_connector *conn)
+{
+ if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
+ return false;
+
+ return fsnotify_inode_watches_children(fsnotify_conn_inode(conn));
+}
+
+static void fsnotify_conn_set_children_dentry_flags(
+ struct fsnotify_mark_connector *conn)
+{
+ if (conn->type != FSNOTIFY_OBJ_TYPE_INODE)
+ return;
+
+ fsnotify_update_children_dentry_flags(fsnotify_conn_inode(conn), true);
+}
+
/*
* Calculate mask of events for a list of marks. The caller must make sure
* connector and connector->obj cannot disappear under us. Callers achieve
@@ -184,15 +202,23 @@ static void *__fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
*/
void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
{
+ bool update_children;
+
if (!conn)
return;

spin_lock(&conn->lock);
+ update_children = !fsnotify_conn_watches_children(conn);
__fsnotify_recalc_mask(conn);
+ update_children &= fsnotify_conn_watches_children(conn);
spin_unlock(&conn->lock);
- if (conn->type == FSNOTIFY_OBJ_TYPE_INODE)
- __fsnotify_update_child_dentry_flags(
- fsnotify_conn_inode(conn));
+ /*
+ * Set children's PARENT_WATCHED flags only if parent started watching.
+ * When parent stops watching, we clear false positive PARENT_WATCHED
+ * flags lazily in __fsnotify_parent().
+ */
+ if (update_children)
+ fsnotify_conn_set_children_dentry_flags(conn);
}

/* Free all connectors queued for freeing once SRCU period ends */
diff --git a/include/linux/fsnotify_backend.h b/include/linux/fsnotify_backend.h
index d7d96c806bff..1276de409724 100644
--- a/include/linux/fsnotify_backend.h
+++ b/include/linux/fsnotify_backend.h
@@ -563,12 +563,14 @@ static inline __u32 fsnotify_parent_needed_mask(__u32 mask)

static inline int fsnotify_inode_watches_children(struct inode *inode)
{
+ __u32 parent_mask = READ_ONCE(inode->i_fsnotify_mask);
+
/* FS_EVENT_ON_CHILD is set if the inode may care */
- if (!(inode->i_fsnotify_mask & FS_EVENT_ON_CHILD))
+ if (!(parent_mask & FS_EVENT_ON_CHILD))
return 0;
/* this inode might care about child events, does it care about the
* specific set of events that can happen on a child? */
- return inode->i_fsnotify_mask & FS_EVENTS_POSS_ON_CHILD;
+ return parent_mask & FS_EVENTS_POSS_ON_CHILD;
}

/*
@@ -582,7 +584,7 @@ static inline void fsnotify_update_flags(struct dentry *dentry)
/*
* Serialisation of setting PARENT_WATCHED on the dentries is provided
* by d_lock. If inotify_inode_watched changes after we have taken
- * d_lock, the following __fsnotify_update_child_dentry_flags call will
+ * d_lock, the following fsnotify_update_children_dentry_flags call will
* find our entry, so it will spin until we complete here, and update
* us with the new state.
*/
--
2.34.1


2022-11-11 22:19:55

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 4/5] fsnotify: allow sleepable child flag update

With very large d_subdirs lists, iteration can take a long time. Since
iteration needs to hold alias->d_lock, this can trigger soft lockups.
It would be best to make this iteration sleepable. We can drop
alias->d_lock and sleep, by taking a reference to the current child.
However, we need to be careful, since it's possible for the child's
list pointers to be modified once we drop alias->d_lock. The following
are the only cases where the list pointers are modified:

1. dentry_unlist() in fs/dcache.c

This is a helper of dentry_kill(). This function is quite careful to
check the reference count of the dentry once it has taken the
requisite locks, and bail out if a new reference was taken. So we
can be safe that, assuming we found the dentry and took a reference
before dropping alias->d_lock, any concurrent dentry_kill() should
bail out and leave our list pointers untouched.

2. __d_move() in fs/dcache.c

If the child was moved to a new parent, then we can detect this by
testing d_parent and retrying the iteration.

3. Initialization code in d_alloc() family

We are safe from this code, since we cannot encounter a dentry until
it has been initialized.

4. Cursor code in fs/libfs.c for dcache_readdir()

Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping,
since we could awaken to find they have skipped over a portion of the
child list.

Given these considerations, we can carefully write a loop that iterates
over d_subdirs and is capable of going to sleep periodically.

Signed-off-by: Stephen Brennan <[email protected]>
---

Notes:
v4:
- I've updated this patch so it should be safe even without the
inode locked, by handling cursors and d_move() races.
- Moved comments to lower indentation
- I didn't keep Amir's R-b since this was a fair bit of change.
v3:
- removed if statements around dput()
v2:
- added a check for child->d_parent != alias and retry logic

fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++----
1 file changed, 42 insertions(+), 4 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 409d479cbbc6..0ba61211456c 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -105,7 +105,7 @@ void fsnotify_sb_delete(struct super_block *sb)
*/
void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
{
- struct dentry *alias, *child;
+ struct dentry *alias, *child, *last_ref = NULL;

if (!S_ISDIR(inode->i_mode))
return;
@@ -116,12 +116,49 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
return;

/*
- * run all of the children of the original inode and fix their
- * d_flags to indicate parental interest (their parent is the
- * original inode)
+ * These lists can get very long, so we may need to sleep during
+ * iteration. Normally this would be impossible without a cursor,
+ * but since we have the inode locked exclusive, we're guaranteed
+ * that the directory won't be modified, so whichever dentry we
+ * pick to sleep on won't get moved. So, start a manual iteration
+ * over d_subdirs which will allow us to sleep.
*/
spin_lock(&alias->d_lock);
+retry:
list_for_each_entry(child, &alias->d_subdirs, d_child) {
+ /*
+ * We need to hold a reference while we sleep. But we cannot
+ * sleep holding a reference to a cursor, or we risk skipping
+ * through the list.
+ *
+ * When we wake, dput() could free the dentry, invalidating the
+ * list pointers. We can't look at the list pointers until we
+ * re-lock the parent, and we can't dput() once we have the
+ * parent locked. So the solution is to hold onto our reference
+ * and free it the *next* time we drop alias->d_lock: either at
+ * the end of the function, or at the time of the next sleep.
+ */
+ if (child->d_flags & DCACHE_DENTRY_CURSOR)
+ continue;
+ if (need_resched()) {
+ dget(child);
+ spin_unlock(&alias->d_lock);
+ dput(last_ref);
+ last_ref = child;
+ cond_resched();
+ spin_lock(&alias->d_lock);
+ /* Check for races with __d_move() */
+ if (child->d_parent != alias)
+ goto retry;
+ }
+
+ /*
+ * This check is after the cond_resched() for a reason: it is
+ * safe to sleep holding a negative dentry reference. If the
+ * directory contained many negative dentries, and we skipped
+ * them checking need_resched(), we may never end up sleeping
+ * where necessary, and could trigger a soft lockup.
+ */
if (!child->d_inode)
continue;

@@ -133,6 +170,7 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
spin_unlock(&child->d_lock);
}
spin_unlock(&alias->d_lock);
+ dput(last_ref);
dput(alias);
}

--
2.34.1


2022-11-11 22:39:32

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 3/5] dnotify: move fsnotify_recalc_mask() outside spinlock

In order to allow sleeping during fsnotify_recalc_mask(), we need to
ensure no callers are holding a spinlock.

Signed-off-by: Stephen Brennan <[email protected]>
---
fs/notify/dnotify/dnotify.c | 28 +++++++++++++++++++---------
1 file changed, 19 insertions(+), 9 deletions(-)

diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
index 190aa717fa32..a9f05b3cf5ea 100644
--- a/fs/notify/dnotify/dnotify.c
+++ b/fs/notify/dnotify/dnotify.c
@@ -58,10 +58,10 @@ struct dnotify_mark {
* dnotify cares about for that inode may change. This function runs the
* list of everything receiving dnotify events about this directory and calculates
* the set of all those events. After it updates what dnotify is interested in
- * it calls the fsnotify function so it can update the set of all events relevant
- * to this inode.
+ * it returns true if fsnotify_recalc_mask() should be called to update the set
+ * of all events related to this inode.
*/
-static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
+static bool dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
{
__u32 new_mask = 0;
struct dnotify_struct *dn;
@@ -74,10 +74,9 @@ static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
for (dn = dn_mark->dn; dn != NULL; dn = dn->dn_next)
new_mask |= (dn->dn_mask & ~FS_DN_MULTISHOT);
if (fsn_mark->mask == new_mask)
- return;
+ return false;
fsn_mark->mask = new_mask;
-
- fsnotify_recalc_mask(fsn_mark->connector);
+ return true;
}

/*
@@ -97,6 +96,7 @@ static int dnotify_handle_event(struct fsnotify_mark *inode_mark, u32 mask,
struct dnotify_struct **prev;
struct fown_struct *fown;
__u32 test_mask = mask & ~FS_EVENT_ON_CHILD;
+ bool recalc = false;

/* not a dir, dnotify doesn't care */
if (!dir && !(mask & FS_ISDIR))
@@ -118,12 +118,15 @@ static int dnotify_handle_event(struct fsnotify_mark *inode_mark, u32 mask,
else {
*prev = dn->dn_next;
kmem_cache_free(dnotify_struct_cache, dn);
- dnotify_recalc_inode_mask(inode_mark);
+ recalc = dnotify_recalc_inode_mask(inode_mark);
}
}

spin_unlock(&inode_mark->lock);

+ if (recalc)
+ fsnotify_recalc_mask(inode_mark->connector);
+
return 0;
}

@@ -158,6 +161,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
struct dnotify_struct **prev;
struct inode *inode;
bool free = false;
+ bool recalc = false;

inode = file_inode(filp);
if (!S_ISDIR(inode->i_mode))
@@ -176,7 +180,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
if ((dn->dn_owner == id) && (dn->dn_filp == filp)) {
*prev = dn->dn_next;
kmem_cache_free(dnotify_struct_cache, dn);
- dnotify_recalc_inode_mask(fsn_mark);
+ recalc = dnotify_recalc_inode_mask(fsn_mark);
break;
}
prev = &dn->dn_next;
@@ -184,6 +188,9 @@ void dnotify_flush(struct file *filp, fl_owner_t id)

spin_unlock(&fsn_mark->lock);

+ if (recalc)
+ fsnotify_recalc_mask(fsn_mark->connector);
+
/* nothing else could have found us thanks to the dnotify_groups
mark_mutex */
if (dn_mark->dn == NULL) {
@@ -268,6 +275,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
struct file *f;
int destroy = 0, error = 0;
__u32 mask;
+ bool recalc = false;

/* we use these to tell if we need to kfree */
new_fsn_mark = NULL;
@@ -377,9 +385,11 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
else if (error == -EEXIST)
error = 0;

- dnotify_recalc_inode_mask(fsn_mark);
+ recalc = dnotify_recalc_inode_mask(fsn_mark);
out:
spin_unlock(&fsn_mark->lock);
+ if (recalc)
+ fsnotify_recalc_mask(fsn_mark->connector);

if (destroy)
fsnotify_detach_mark(fsn_mark);
--
2.34.1


2022-11-11 22:42:31

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v4 5/5] fsnotify: require inode lock held during child flag update

With the prior changes to fsnotify, it is now possible for
fsnotify_recalc_mask() to return before all children flags have been
set. Imagine that two CPUs attempt to add a mark to an inode which would
require watching the children of that directory:

CPU 1: CPU 2:

fsnotify_recalc_mask() {
spin_lock();
update_children = ...
__fsnotify_recalc_mask();
update_children = ...
spin_unlock();
// update_children is true!
fsnotify_conn_set_children_dentry_flags() {
// updating flags ...
cond_resched();

fsnotify_recalc_mask() {
spin_lock();
update_children = ...
__fsnotify_recalc_mask();
update_children = ...
spin_unlock();
// update_children is false
}
// returns to userspace, but
// not all children are marked
// continue updating flags
}
}

To prevent this situation, hold the directory inode lock. This ensures
that any concurrent update to the mask will block until the update is
complete, so that we can guarantee that child flags are set prior to
returning.

Since the directory inode lock is now held during iteration over
d_subdirs, we are guaranteed that __d_move() cannot remove the dentry we
hold, so we no longer need check whether we should retry iteration. We
also are guaranteed that no cursors are moving through the list, since
simple_readdir() holds the inode read lock. Simplify the iteration by
removing this logic.

Signed-off-by: Stephen Brennan <[email protected]>
---
fs/notify/fsnotify.c | 25 +++++++++----------------
fs/notify/mark.c | 8 ++++++++
2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 0ba61211456c..b5778775b88d 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -102,6 +102,8 @@ void fsnotify_sb_delete(struct super_block *sb)
* on a child we run all of our children and set a dentry flag saying that the
* parent cares. Thus when an event happens on a child it can quickly tell
* if there is a need to find a parent and send the event to the parent.
+ *
+ * Context: inode locked exclusive
*/
void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
{
@@ -124,22 +126,16 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
* over d_subdirs which will allow us to sleep.
*/
spin_lock(&alias->d_lock);
-retry:
list_for_each_entry(child, &alias->d_subdirs, d_child) {
/*
- * We need to hold a reference while we sleep. But we cannot
- * sleep holding a reference to a cursor, or we risk skipping
- * through the list.
- *
- * When we wake, dput() could free the dentry, invalidating the
- * list pointers. We can't look at the list pointers until we
- * re-lock the parent, and we can't dput() once we have the
- * parent locked. So the solution is to hold onto our reference
- * and free it the *next* time we drop alias->d_lock: either at
- * the end of the function, or at the time of the next sleep.
+ * We need to hold a reference while we sleep. When we wake,
+ * dput() could free the dentry, invalidating the list pointers.
+ * We can't look at the list pointers until we re-lock the
+ * parent, and we can't dput() once we have the parent locked.
+ * So the solution is to hold onto our reference and free it the
+ * *next* time we drop alias->d_lock: either at the end of the
+ * function, or at the time of the next sleep.
*/
- if (child->d_flags & DCACHE_DENTRY_CURSOR)
- continue;
if (need_resched()) {
dget(child);
spin_unlock(&alias->d_lock);
@@ -147,9 +143,6 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
last_ref = child;
cond_resched();
spin_lock(&alias->d_lock);
- /* Check for races with __d_move() */
- if (child->d_parent != alias)
- goto retry;
}

/*
diff --git a/fs/notify/mark.c b/fs/notify/mark.c
index 6797a2952f87..f39cd88ad778 100644
--- a/fs/notify/mark.c
+++ b/fs/notify/mark.c
@@ -203,10 +203,15 @@ static void fsnotify_conn_set_children_dentry_flags(
void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
{
bool update_children;
+ struct inode *inode = NULL;

if (!conn)
return;

+ if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
+ inode = fsnotify_conn_inode(conn);
+ inode_lock(inode);
+ }
spin_lock(&conn->lock);
update_children = !fsnotify_conn_watches_children(conn);
__fsnotify_recalc_mask(conn);
@@ -219,6 +224,9 @@ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
*/
if (update_children)
fsnotify_conn_set_children_dentry_flags(conn);
+
+ if (inode)
+ inode_unlock(inode);
}

/* Free all connectors queued for freeing once SRCU period ends */
--
2.34.1


2022-11-11 22:43:27

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] fsnotify: fix softlockups iterating over d_subdirs

Stephen Brennan <[email protected]> writes:

> Hi Jan, Amir, Al,
>
> Here's my v4 patch series that aims to eliminate soft lockups when updating
> dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> allowing the flag to be lazily cleared in the fsnotify_parent() function,
> via Amir's patch. This allowed me to drop patch #2 from my previous series
> (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> replaced it with "fsnotify: require inode lock held during child flag
> update", patch #5 in this series. I also added "dnotify: move
> fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> issues with dnotify.
>
> Jan expressed concerns about lock ordering of the inode rwsem with the
> fsnotify group mutex. I built this with lockdep enabled (see below for the
> lock debugging .config section -- I'm not too familiar with lockdep so I
> wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> I could find in LTP, with no lockdep splats to be found. I don't know that
> this can completely satisfy the concerns about lock ordering: I'm reading
> through the code to better understand the concern about "the removal of
> oneshot mark during modify event generation". But I'm encouraged by the
> LTP+lockdep results.

Of course, I forgot to append the .config section:

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
# CONFIG_PROVE_RAW_LOCK_NESTING is not set
CONFIG_LOCK_STAT=y
CONFIG_DEBUG_RT_MUTEXES=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y
CONFIG_DEBUG_RWSEMS=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_LOCKDEP=y
CONFIG_DEBUG_ATOMIC_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
CONFIG_CSD_LOCK_WAIT_DEBUG=y
# end of Lock Debugging (spinlocks, mutexes, etc...)

>
> I also went ahead and did my negative dentry oriented testing. Of course
> the fsnotify_parent() issue is fully resolved, and when I tested several
> processes all using inotifywait on the same directory full of negative
> dentries, I was able to use ftrace to confirm that
> fsnotify_update_children_dentry_flags() was called exactly once for all
> processes. No softlockups occurred!
>
> I originally wrote this series to make the last patch (#5) optional: if for
> some reason we didn't think it was necessary to hold the inode rwsem, then
> we could omit it -- the main penalty being the race condition described in
> the patch description. I tested without the last patch and LTP passed also
> with lockdep enabled, but of course when multiple tasks did an inotifywait
> on the same directory (with many negative dentries) only the first waited
> for the flag updates, the rest of the tasks immediately returned despite
> the flags not being ready.
>
> I agree with Amir that as long as the lock ordering is fine, we should keep
> patch #5. And if that's the case, I can reorder the series a bit to make it
> a bit more logical, and eliminate logic in
> fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
> which I promptly delete later in the series.
>
> 1. fsnotify: clear PARENT_WATCHED flags lazily
> 2. fsnotify: Use d_find_any_alias to get dentry associated with inode
> 3. dnotify: move fsnotify_recalc_mask() outside spinlock
> 4. fsnotify: require inode lock held during child flag update
> 5. fsnotify: allow sleepable child flag update
>
> Thanks for continuing to read this series, I hope we're making progress
> toward a simpler way to fix these scaling issues!
>
> Stephen
>
> Amir Goldstein (1):
> fsnotify: clear PARENT_WATCHED flags lazily
>
> Stephen Brennan (4):
> fsnotify: Use d_find_any_alias to get dentry associated with inode
> dnotify: move fsnotify_recalc_mask() outside spinlock
> fsnotify: allow sleepable child flag update
> fsnotify: require inode lock held during child flag update
>
> fs/notify/dnotify/dnotify.c | 28 ++++++---
> fs/notify/fsnotify.c | 101 ++++++++++++++++++++++---------
> fs/notify/fsnotify.h | 3 +-
> fs/notify/mark.c | 40 +++++++++++-
> include/linux/fsnotify_backend.h | 8 ++-
> 5 files changed, 136 insertions(+), 44 deletions(-)
>
> --
> 2.34.1

2022-11-12 09:06:11

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4 2/5] fsnotify: Use d_find_any_alias to get dentry associated with inode

On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan
<[email protected]> wrote:
>
> Rather than iterating over the inode's i_dentry (requiring holding the
> i_lock for the entire duration of the function), we know that there
> should be only one item in the list. Use d_find_any_alias() and no
> longer hold i_lock.
>
> Signed-off-by: Stephen Brennan <[email protected]>
> Reviewed-by: Amir Goldstein <[email protected]>
> ---
>
> Notes:
> Changes in v4:
> - Bail out if d_find_any_alias() returns NULL
> - Rebase on Amir's patch
> Changes in v3:
> - Add newlines in block comment
> - d_find_any_alias() returns a reference, which I was leaking. Add
> a dput(alias) at the end.
> - Add Amir's R-b
>
> fs/notify/fsnotify.c | 46 ++++++++++++++++++++++----------------------
> 1 file changed, 23 insertions(+), 23 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 2c50e9e50d35..409d479cbbc6 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -105,35 +105,35 @@ void fsnotify_sb_delete(struct super_block *sb)
> */
> void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> {
> - struct dentry *alias;
> + struct dentry *alias, *child;
>
> if (!S_ISDIR(inode->i_mode))
> return;
>
> - spin_lock(&inode->i_lock);
> - /* run all of the dentries associated with this inode. Since this is a
> - * directory, there damn well better only be one item on this list */
> - hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
> - struct dentry *child;
> -
> - /* run all of the children of the original inode and fix their
> - * d_flags to indicate parental interest (their parent is the
> - * original inode) */
> - spin_lock(&alias->d_lock);
> - list_for_each_entry(child, &alias->d_subdirs, d_child) {
> - if (!child->d_inode)
> - continue;
> + /* Since this is a directory, there damn well better only be one child */
> + alias = d_find_any_alias(inode);
> + if (!alias)
> + return;
>
> - spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
> - if (watched)
> - child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
> - else
> - child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> - spin_unlock(&child->d_lock);
> - }
> - spin_unlock(&alias->d_lock);
> + /*
> + * run all of the children of the original inode and fix their
> + * d_flags to indicate parental interest (their parent is the
> + * original inode)

nit: this comment can probably fit in two nicer lines

> + */
> + spin_lock(&alias->d_lock);
> + list_for_each_entry(child, &alias->d_subdirs, d_child) {
> + if (!child->d_inode)
> + continue;
> +
> + spin_lock_nested(&child->d_lock, DENTRY_D_LOCK_NESTED);
> + if (watched)
> + child->d_flags |= DCACHE_FSNOTIFY_PARENT_WATCHED;
> + else
> + child->d_flags &= ~DCACHE_FSNOTIFY_PARENT_WATCHED;
> + spin_unlock(&child->d_lock);
> }
> - spin_unlock(&inode->i_lock);
> + spin_unlock(&alias->d_lock);
> + dput(alias);
> }
>
> /*
> --
> 2.34.1
>

2022-11-12 09:20:30

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4 3/5] dnotify: move fsnotify_recalc_mask() outside spinlock

On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan
<[email protected]> wrote:
>
> In order to allow sleeping during fsnotify_recalc_mask(), we need to
> ensure no callers are holding a spinlock.
>
> Signed-off-by: Stephen Brennan <[email protected]>

Reviewed-by: Amir Goldstein <[email protected]>

small suggestion below.

> ---
> fs/notify/dnotify/dnotify.c | 28 +++++++++++++++++++---------
> 1 file changed, 19 insertions(+), 9 deletions(-)
>
> diff --git a/fs/notify/dnotify/dnotify.c b/fs/notify/dnotify/dnotify.c
> index 190aa717fa32..a9f05b3cf5ea 100644
> --- a/fs/notify/dnotify/dnotify.c
> +++ b/fs/notify/dnotify/dnotify.c
> @@ -58,10 +58,10 @@ struct dnotify_mark {
> * dnotify cares about for that inode may change. This function runs the
> * list of everything receiving dnotify events about this directory and calculates
> * the set of all those events. After it updates what dnotify is interested in
> - * it calls the fsnotify function so it can update the set of all events relevant
> - * to this inode.
> + * it returns true if fsnotify_recalc_mask() should be called to update the set
> + * of all events related to this inode.
> */
> -static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
> +static bool dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
> {
> __u32 new_mask = 0;
> struct dnotify_struct *dn;
> @@ -74,10 +74,9 @@ static void dnotify_recalc_inode_mask(struct fsnotify_mark *fsn_mark)
> for (dn = dn_mark->dn; dn != NULL; dn = dn->dn_next)
> new_mask |= (dn->dn_mask & ~FS_DN_MULTISHOT);
> if (fsn_mark->mask == new_mask)
> - return;
> + return false;
> fsn_mark->mask = new_mask;
> -
> - fsnotify_recalc_mask(fsn_mark->connector);
> + return true;
> }
>
> /*
> @@ -97,6 +96,7 @@ static int dnotify_handle_event(struct fsnotify_mark *inode_mark, u32 mask,
> struct dnotify_struct **prev;
> struct fown_struct *fown;
> __u32 test_mask = mask & ~FS_EVENT_ON_CHILD;
> + bool recalc = false;
>
> /* not a dir, dnotify doesn't care */
> if (!dir && !(mask & FS_ISDIR))
> @@ -118,12 +118,15 @@ static int dnotify_handle_event(struct fsnotify_mark *inode_mark, u32 mask,
> else {
> *prev = dn->dn_next;
> kmem_cache_free(dnotify_struct_cache, dn);
> - dnotify_recalc_inode_mask(inode_mark);
> + recalc = dnotify_recalc_inode_mask(inode_mark);
> }
> }
>
> spin_unlock(&inode_mark->lock);
>
> + if (recalc)
> + fsnotify_recalc_mask(inode_mark->connector);
> +
> return 0;
> }
>
> @@ -158,6 +161,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
> struct dnotify_struct **prev;
> struct inode *inode;
> bool free = false;
> + bool recalc = false;
>
> inode = file_inode(filp);
> if (!S_ISDIR(inode->i_mode))
> @@ -176,7 +180,7 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
> if ((dn->dn_owner == id) && (dn->dn_filp == filp)) {
> *prev = dn->dn_next;
> kmem_cache_free(dnotify_struct_cache, dn);
> - dnotify_recalc_inode_mask(fsn_mark);
> + recalc = dnotify_recalc_inode_mask(fsn_mark);
> break;
> }
> prev = &dn->dn_next;
> @@ -184,6 +188,9 @@ void dnotify_flush(struct file *filp, fl_owner_t id)
>
> spin_unlock(&fsn_mark->lock);
>
> + if (recalc)
> + fsnotify_recalc_mask(fsn_mark->connector);
> +
> /* nothing else could have found us thanks to the dnotify_groups
> mark_mutex */
> if (dn_mark->dn == NULL) {
> @@ -268,6 +275,7 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
> struct file *f;
> int destroy = 0, error = 0;
> __u32 mask;
> + bool recalc = false;
>
> /* we use these to tell if we need to kfree */
> new_fsn_mark = NULL;
> @@ -377,9 +385,11 @@ int fcntl_dirnotify(int fd, struct file *filp, unsigned long arg)
> else if (error == -EEXIST)
> error = 0;
>
> - dnotify_recalc_inode_mask(fsn_mark);
> + recalc = dnotify_recalc_inode_mask(fsn_mark);
> out:
> spin_unlock(&fsn_mark->lock);
> + if (recalc)
> + fsnotify_recalc_mask(fsn_mark->connector);
>
> if (destroy)
> fsnotify_detach_mark(fsn_mark);

I'd do else if (recalc)

just to emphasise that destroy and recalc are mutually exclusive.

Thanks,
Amir.

2022-11-12 10:06:40

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4 5/5] fsnotify: require inode lock held during child flag update

fsnotify_update_flags

On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan
<[email protected]> wrote:
>
> With the prior changes to fsnotify, it is now possible for
> fsnotify_recalc_mask() to return before all children flags have been
> set. Imagine that two CPUs attempt to add a mark to an inode which would
> require watching the children of that directory:
>
> CPU 1: CPU 2:
>
> fsnotify_recalc_mask() {
> spin_lock();
> update_children = ...
> __fsnotify_recalc_mask();
> update_children = ...
> spin_unlock();
> // update_children is true!
> fsnotify_conn_set_children_dentry_flags() {
> // updating flags ...
> cond_resched();
>
> fsnotify_recalc_mask() {
> spin_lock();
> update_children = ...
> __fsnotify_recalc_mask();
> update_children = ...
> spin_unlock();
> // update_children is false
> }
> // returns to userspace, but
> // not all children are marked
> // continue updating flags
> }
> }
>
> To prevent this situation, hold the directory inode lock. This ensures
> that any concurrent update to the mask will block until the update is
> complete, so that we can guarantee that child flags are set prior to
> returning.
>
> Since the directory inode lock is now held during iteration over
> d_subdirs, we are guaranteed that __d_move() cannot remove the dentry we
> hold, so we no longer need check whether we should retry iteration. We
> also are guaranteed that no cursors are moving through the list, since
> simple_readdir() holds the inode read lock. Simplify the iteration by
> removing this logic.
>

I very much prefer to start the series with this patch and avoid
those details altogether.

> Signed-off-by: Stephen Brennan <[email protected]>
> ---
> fs/notify/fsnotify.c | 25 +++++++++----------------
> fs/notify/mark.c | 8 ++++++++
> 2 files changed, 17 insertions(+), 16 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 0ba61211456c..b5778775b88d 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -102,6 +102,8 @@ void fsnotify_sb_delete(struct super_block *sb)
> * on a child we run all of our children and set a dentry flag saying that the
> * parent cares. Thus when an event happens on a child it can quickly tell
> * if there is a need to find a parent and send the event to the parent.
> + *
> + * Context: inode locked exclusive
> */
> void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> {
> @@ -124,22 +126,16 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> * over d_subdirs which will allow us to sleep.
> */
> spin_lock(&alias->d_lock);
> -retry:
> list_for_each_entry(child, &alias->d_subdirs, d_child) {
> /*
> - * We need to hold a reference while we sleep. But we cannot
> - * sleep holding a reference to a cursor, or we risk skipping
> - * through the list.
> - *
> - * When we wake, dput() could free the dentry, invalidating the
> - * list pointers. We can't look at the list pointers until we
> - * re-lock the parent, and we can't dput() once we have the
> - * parent locked. So the solution is to hold onto our reference
> - * and free it the *next* time we drop alias->d_lock: either at
> - * the end of the function, or at the time of the next sleep.
> + * We need to hold a reference while we sleep. When we wake,
> + * dput() could free the dentry, invalidating the list pointers.
> + * We can't look at the list pointers until we re-lock the
> + * parent, and we can't dput() once we have the parent locked.
> + * So the solution is to hold onto our reference and free it the
> + * *next* time we drop alias->d_lock: either at the end of the
> + * function, or at the time of the next sleep.
> */
> - if (child->d_flags & DCACHE_DENTRY_CURSOR)
> - continue;
> if (need_resched()) {
> dget(child);
> spin_unlock(&alias->d_lock);
> @@ -147,9 +143,6 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> last_ref = child;
> cond_resched();
> spin_lock(&alias->d_lock);
> - /* Check for races with __d_move() */
> - if (child->d_parent != alias)
> - goto retry;
> }
>
> /*
> diff --git a/fs/notify/mark.c b/fs/notify/mark.c
> index 6797a2952f87..f39cd88ad778 100644
> --- a/fs/notify/mark.c
> +++ b/fs/notify/mark.c
> @@ -203,10 +203,15 @@ static void fsnotify_conn_set_children_dentry_flags(
> void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> {
> bool update_children;
> + struct inode *inode = NULL;
>
> if (!conn)
> return;
>
> + if (conn->type == FSNOTIFY_OBJ_TYPE_INODE) {
> + inode = fsnotify_conn_inode(conn);
> + inode_lock(inode);
> + }
> spin_lock(&conn->lock);
> update_children = !fsnotify_conn_watches_children(conn);
> __fsnotify_recalc_mask(conn);
> @@ -219,6 +224,9 @@ void fsnotify_recalc_mask(struct fsnotify_mark_connector *conn)
> */
> if (update_children)
> fsnotify_conn_set_children_dentry_flags(conn);
> +
> + if (inode)
> + inode_unlock(inode);
> }
>

Interesting.

I was imagining inode_lock taken only inside
fsnotify_conn_set_children_dentry_flags()

The reason is that removing the parent watch does not need
to be serialized for lazy clean up to work correctly.

Maybe I am missing something or maybe it is just best practice
to serialize all parent state changes to keep the mental model
of the code simpler and to keep it the same as the existing upstream
mental model.

So I am NOT opposed to serializing fsnotify_recalc_mask()
just wanted to hear opinions.

Thanks,
Amir.

2022-11-12 10:33:36

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] fsnotify: allow sleepable child flag update

On Sat, Nov 12, 2022 at 12:06 AM Stephen Brennan
<[email protected]> wrote:
>
> With very large d_subdirs lists, iteration can take a long time. Since
> iteration needs to hold alias->d_lock, this can trigger soft lockups.
> It would be best to make this iteration sleepable. We can drop
> alias->d_lock and sleep, by taking a reference to the current child.
> However, we need to be careful, since it's possible for the child's
> list pointers to be modified once we drop alias->d_lock. The following
> are the only cases where the list pointers are modified:
>
> 1. dentry_unlist() in fs/dcache.c
>
> This is a helper of dentry_kill(). This function is quite careful to
> check the reference count of the dentry once it has taken the
> requisite locks, and bail out if a new reference was taken. So we
> can be safe that, assuming we found the dentry and took a reference
> before dropping alias->d_lock, any concurrent dentry_kill() should
> bail out and leave our list pointers untouched.
>
> 2. __d_move() in fs/dcache.c
>
> If the child was moved to a new parent, then we can detect this by
> testing d_parent and retrying the iteration.
>
> 3. Initialization code in d_alloc() family
>
> We are safe from this code, since we cannot encounter a dentry until
> it has been initialized.
>
> 4. Cursor code in fs/libfs.c for dcache_readdir()
>
> Dentries with DCACHE_DENTRY_CURSOR should be skipped before sleeping,
> since we could awaken to find they have skipped over a portion of the
> child list.
>
> Given these considerations, we can carefully write a loop that iterates
> over d_subdirs and is capable of going to sleep periodically.
>
> Signed-off-by: Stephen Brennan <[email protected]>
> ---
>
> Notes:
> v4:
> - I've updated this patch so it should be safe even without the
> inode locked, by handling cursors and d_move() races.
> - Moved comments to lower indentation
> - I didn't keep Amir's R-b since this was a fair bit of change.
> v3:
> - removed if statements around dput()
> v2:
> - added a check for child->d_parent != alias and retry logic
>
> fs/notify/fsnotify.c | 46 ++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 42 insertions(+), 4 deletions(-)
>
> diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
> index 409d479cbbc6..0ba61211456c 100644
> --- a/fs/notify/fsnotify.c
> +++ b/fs/notify/fsnotify.c
> @@ -105,7 +105,7 @@ void fsnotify_sb_delete(struct super_block *sb)
> */
> void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> {
> - struct dentry *alias, *child;
> + struct dentry *alias, *child, *last_ref = NULL;
>
> if (!S_ISDIR(inode->i_mode))
> return;
> @@ -116,12 +116,49 @@ void fsnotify_update_children_dentry_flags(struct inode *inode, bool watched)
> return;
>
> /*
> - * run all of the children of the original inode and fix their
> - * d_flags to indicate parental interest (their parent is the
> - * original inode)
> + * These lists can get very long, so we may need to sleep during
> + * iteration. Normally this would be impossible without a cursor,
> + * but since we have the inode locked exclusive, we're guaranteed

Not exactly true for v4 patchset order, but I do prefer that we make it true.

> + * that the directory won't be modified, so whichever dentry we
> + * pick to sleep on won't get moved. So, start a manual iteration
> + * over d_subdirs which will allow us to sleep.
> */
> spin_lock(&alias->d_lock);
> +retry:

Better if we can avoid this retry by inode lock.
Note that it is enough to take inode_lock_shared() to protect
this code.

It means that tasks doing remove+add parent watch may
iterate d_subdirs in parallel, but maybe that's even better
then having them iterate d_subdirs sequentially?

> list_for_each_entry(child, &alias->d_subdirs, d_child) {
> + /*
> + * We need to hold a reference while we sleep. But we cannot
> + * sleep holding a reference to a cursor, or we risk skipping
> + * through the list.
> + *
> + * When we wake, dput() could free the dentry, invalidating the
> + * list pointers. We can't look at the list pointers until we
> + * re-lock the parent, and we can't dput() once we have the
> + * parent locked. So the solution is to hold onto our reference
> + * and free it the *next* time we drop alias->d_lock: either at
> + * the end of the function, or at the time of the next sleep.
> + */

My usual nit picking: you could concat this explanation to the
comment outside the loop. It won't make it any less clear, maybe
even more clear in the wider context of how the children are safely
iterated.

Thanks,
Amir.

2022-11-15 07:17:09

by Oliver Sang

[permalink] [raw]
Subject: Re: [PATCH v4 4/5] fsnotify: allow sleepable child flag update


Hi, Stephen Brennan, we made report upon v1 patch
https://lore.kernel.org/all/[email protected]/
now we noticed similar issue still on v4. FYI.


Greeting,

FYI, we noticed WARNING:possible_recursive_locking_detected due to commit (built with gcc-11):

commit: 5482581c7c96a8fe60bb29c181e86cdc9889b068 ("[PATCH v4 4/5] fsnotify: allow sleepable child flag update")
url: https://github.com/intel-lab-lkp/linux/commits/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656
base: https://git.kernel.org/cgit/linux/kernel/git/jack/linux-fs.git fsnotify
patch subject: [PATCH v4 4/5] fsnotify: allow sleepable child flag update

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):



If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-lkp/[email protected]


[ 16.419535][ T369] WARNING: possible recursive locking detected
[ 16.420080][ T369] 6.0.0-rc4-00068-g5482581c7c96 #1 Not tainted
[ 16.420595][ T369] --------------------------------------------
[ 16.421105][ T369] dnsmasq/369 is trying to acquire lock:
[ 16.421578][ T369] bfaa3858 (&dentry->d_lock){+.+.}-{2:2}, at: lockref_get (??:?)
[ 16.422251][ T369]
[ 16.422251][ T369] but task is already holding lock:
[ 16.422868][ T369] bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?)
[ 16.423733][ T369]
[ 16.423733][ T369] other info that might help us debug this:
[ 16.424438][ T369] Possible unsafe locking scenario:
[ 16.424438][ T369]
[ 16.425082][ T369] CPU0
[ 16.425376][ T369] ----
[ 16.425660][ T369] lock(&dentry->d_lock);
[ 16.426052][ T369] lock(&dentry->d_lock);
[ 16.426439][ T369]
[ 16.426439][ T369] *** DEADLOCK ***
[ 16.426439][ T369]
[ 16.427109][ T369] May be due to missing lock nesting notation
[ 16.427109][ T369]
[ 16.427808][ T369] 2 locks held by dnsmasq/369:
[ 16.428219][ T369] #0: 409474b0 (&group->mark_mutex){+.+.}-{3:3}, at: inotify_update_watch (inotify_user.c:?)
[ 16.429047][ T369] #1: bf804d60 (&dentry->d_lock){+.+.}-{2:2}, at: fsnotify_update_children_dentry_flags (??:?)
[ 16.430016][ T369]
[ 16.430016][ T369] stack backtrace:
[ 16.430538][ T369] CPU: 1 PID: 369 Comm: dnsmasq Not tainted 6.0.0-rc4-00068-g5482581c7c96 #1
[ 16.431276][ T369] Call Trace:
[ 16.431560][ T369] ? dump_stack_lvl (??:?)
[ 16.431990][ T369] ? dump_stack (??:?)
[ 16.432347][ T369] ? validate_chain.cold (lockdep.c:?)
[ 16.432791][ T369] ? __lock_acquire (lockdep.c:?)
[ 16.433233][ T369] ? lock_acquire (??:?)
[ 16.433651][ T369] ? lockref_get (??:?)
[ 16.434034][ T369] ? __lock_release (lockdep.c:?)
[ 16.434454][ T369] ? fsnotify_update_children_dentry_flags (??:?)
[ 16.435069][ T369] ? _raw_spin_lock (??:?)
[ 16.435465][ T369] ? lockref_get (??:?)
[ 16.435832][ T369] ? lockref_get (??:?)
[ 16.436190][ T369] ? fsnotify_update_children_dentry_flags (??:?)
[ 16.436753][ T369] ? fsnotify_recalc_mask (??:?)
[ 16.437229][ T369] ? fsnotify_add_mark_locked (??:?)
[ 16.437716][ T369] ? inotify_new_watch (inotify_user.c:?)
[ 16.438166][ T369] ? inotify_update_watch (inotify_user.c:?)
[ 16.438627][ T369] ? __ia32_sys_inotify_add_watch (??:?)
[ 16.439183][ T369] ? do_int80_syscall_32 (??:?)
[ 16.439640][ T369] ? entry_INT80_32 (entry_32.o:?)
LKP: ttyS0: 641: Kernel tests: Boot OK!
LKP: ttyS0: 641: HOSTNAME vm-snb, MAC 52:54:00:12:34:56, kernel 6.0.0-rc4-00068-g5482581c7c96 1
LKP: ttyS0: 641: /lkp/lkp/src/bin/run-lkp /lkp/jobs/scheduled/vm-meta-42/boot-1-openwrt-i386-generic-20190428.cgz-5482581c7c96a8fe60bb29c181e86cdc9889b068-20221115-34610-11px8cv-40.yaml
[ 18.633802][ T655] mkdir: can't create directory '/sys/kernel/debug': Operation not permitted
[ 18.633802][ T655] mount: mounting debug on /sys/kernel/debug failed: No such file or directory
[ 21.139501][ T856] process 856 (wait) attempted a POSIX timer syscall while CONFIG_POSIX_TIMERS is not set
[ 32.651080][ T957] rsync (957) used greatest stack depth: 6208 bytes left
LKP: ttyS0: 641: LKP: rebooting forcely
[ 42.464678][ T641] sysrq: Emergency Sync
[ 42.465117][ T641] sysrq: Resetting

Kboot worker: lkp-worker53
Elapsed time: 60

kvm=(
qemu-system-x86_64
-enable-kvm
-cpu SandyBridge
-kernel $kernel
-initrd initrd-vm-meta-42.cgz
-m 16384
-smp 2
-device e1000,netdev=net0
-netdev user,id=net0,hostfwd=tcp::32032-:22
-boot order=nc
-no-reboot
-device i6300esb
-watchdog-action debug
-rtc base=localtime
-serial stdio
-display none
-monitor null
)

append=(
ip=::::vm-meta-42::dhcp
root=/dev/ram0
RESULT_ROOT=/result/boot/1/vm-snb/openwrt-i386-generic-20190428.cgz/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/73
BOOT_IMAGE=/pkg/linux/i386-randconfig-a011-20210930/gcc-11/5482581c7c96a8fe60bb29c181e86cdc9889b068/vmlinuz-6.0.0-rc4-00068-g5482581c7c96
branch=linux-review/Stephen-Brennan/fsnotify-clear-PARENT_WATCHED-flags-lazily/20221112-070656
job=/job-script
user=lkp
ARCH=i386
kconfig=i386-randconfig-a011-20210930
commit=5482581c7c96a8fe60bb29c181e86cdc9889b068
vmalloc=256M
initramfs_async=0
page_owner=on
initcall_debug
max_uptime=600
result_service=tmpfs
selinux=0
debug
apic=debug


To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc4-00068-g5482581c7c96 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (6.17 kB)
config-6.0.0-rc4-00068-g5482581c7c96 (154.13 kB)
job-script (4.85 kB)
dmesg.xz (47.71 kB)
Download all attachments

2022-11-22 12:38:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] fsnotify: fix softlockups iterating over d_subdirs

Hi Stephen!

On Fri 11-11-22 14:06:09, Stephen Brennan wrote:
> Here's my v4 patch series that aims to eliminate soft lockups when updating
> dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> allowing the flag to be lazily cleared in the fsnotify_parent() function,
> via Amir's patch. This allowed me to drop patch #2 from my previous series
> (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> replaced it with "fsnotify: require inode lock held during child flag
> update", patch #5 in this series. I also added "dnotify: move
> fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> issues with dnotify.

Yes, the series is now much simpler. Thanks!

> Jan expressed concerns about lock ordering of the inode rwsem with the
> fsnotify group mutex. I built this with lockdep enabled (see below for the
> lock debugging .config section -- I'm not too familiar with lockdep so I
> wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> I could find in LTP, with no lockdep splats to be found. I don't know that
> this can completely satisfy the concerns about lock ordering: I'm reading
> through the code to better understand the concern about "the removal of
> oneshot mark during modify event generation". But I'm encouraged by the
> LTP+lockdep results.

So I had a look and I think your patches could cause deadlock at least for
nfsd. The problem is with things like inotify IN_ONESHOT marks. They get
autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger
IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark()
from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir()
while holding dir->i_rwsem held. So we have lock ordering like:

nfsd_mkdir()
inode_lock(dir);
...
__nfsd_mkdir(dir, ...)
fsnotify_mkdir(dir, dentry);
...
inotify_handle_inode_event()
...
fsnotify_destroy_mark()
fsnotify_group_lock(group)

So we have dir->i_rwsem > group->mark_mutex. But we also have callchains
like:

inotify_add_watch()
inotify_update_watch()
fsnotify_group_lock(group)
inotify_update_existing_watch()
...
fsnotify_recalc_mask()
inode_lock(dir); -> added by your series

which creates ordering group->mark_mutex > dir->i_rwsem.

It is even worse with dnotify which (even with your patches) ends up
calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a
possibility of direct A->A deadlock. But I'd leave dnotify aside, I think
that can be massaged to not need to call fsnotify_recalc_mask()
(__fsnotify_recalc_mask() would be enough there).

Still I'm not 100% sure about a proper way out of this. The simplicity of
alias->d_subdirs iteration with i_rwsem held is compeling. We could mandate
that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup
nfsd) but IMO that is pushing the complexity from the fsnotify core into
its users which is undesirable. Maybe we could grab inode->i_rwsem in those
places adding / removing notification marks before we grab
group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask()
has the inode->i_rwsem held and be done with it? That pushes a bit of
complexity into the fsnotify backends but it is not too bad.
fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify.
Amir?

> I originally wrote this series to make the last patch (#5) optional: if for
> some reason we didn't think it was necessary to hold the inode rwsem, then
> we could omit it -- the main penalty being the race condition described in
> the patch description. I tested without the last patch and LTP passed also
> with lockdep enabled, but of course when multiple tasks did an inotifywait
> on the same directory (with many negative dentries) only the first waited
> for the flag updates, the rest of the tasks immediately returned despite
> the flags not being ready.
>
> I agree with Amir that as long as the lock ordering is fine, we should keep
> patch #5. And if that's the case, I can reorder the series a bit to make it
> a bit more logical, and eliminate logic in
> fsnotify_update_children_dentry_flags() for handling d_move/cursor races,
> which I promptly delete later in the series.
>
> 1. fsnotify: clear PARENT_WATCHED flags lazily
> 2. fsnotify: Use d_find_any_alias to get dentry associated with inode
> 3. dnotify: move fsnotify_recalc_mask() outside spinlock
> 4. fsnotify: require inode lock held during child flag update
> 5. fsnotify: allow sleepable child flag update
>
> Thanks for continuing to read this series, I hope we're making progress
> toward a simpler way to fix these scaling issues!

Yeah, so I'd be for making sure i_rwsem is held where we need it first and
only after that add reschedule handling into
fsnotify_update_children_dentry_flags(). That makes the series more
logical.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2022-11-22 14:11:14

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v4 0/5] fsnotify: fix softlockups iterating over d_subdirs

On Tue, Nov 22, 2022 at 1:50 PM Jan Kara <[email protected]> wrote:
>
> Hi Stephen!
>
> On Fri 11-11-22 14:06:09, Stephen Brennan wrote:
> > Here's my v4 patch series that aims to eliminate soft lockups when updating
> > dentry flags in fsnotify. I've incorporated Jan's suggestion of simply
> > allowing the flag to be lazily cleared in the fsnotify_parent() function,
> > via Amir's patch. This allowed me to drop patch #2 from my previous series
> > (fsnotify: Protect i_fsnotify_mask and child flags with inode rwsem). I
> > replaced it with "fsnotify: require inode lock held during child flag
> > update", patch #5 in this series. I also added "dnotify: move
> > fsnotify_recalc_mask() outside spinlock" to address the sleep-during-atomic
> > issues with dnotify.
>
> Yes, the series is now much simpler. Thanks!
>
> > Jan expressed concerns about lock ordering of the inode rwsem with the
> > fsnotify group mutex. I built this with lockdep enabled (see below for the
> > lock debugging .config section -- I'm not too familiar with lockdep so I
> > wanted a sanity check). I ran all the fanotify, inotify, and dnotify tests
> > I could find in LTP, with no lockdep splats to be found. I don't know that
> > this can completely satisfy the concerns about lock ordering: I'm reading
> > through the code to better understand the concern about "the removal of
> > oneshot mark during modify event generation". But I'm encouraged by the
> > LTP+lockdep results.
>
> So I had a look and I think your patches could cause deadlock at least for
> nfsd. The problem is with things like inotify IN_ONESHOT marks. They get
> autodeleted as soon as they trigger. Thus e.g. fsnotify_mkdir() can trigger
> IN_ONESHOT mark and goes on removing it by calling fsnotify_destroy_mark()
> from inotify_handle_inode_event(). And nfsd calls e.g. fsnotify_mkdir()
> while holding dir->i_rwsem held. So we have lock ordering like:
>
> nfsd_mkdir()
> inode_lock(dir);
> ...
> __nfsd_mkdir(dir, ...)
> fsnotify_mkdir(dir, dentry);
> ...
> inotify_handle_inode_event()
> ...
> fsnotify_destroy_mark()
> fsnotify_group_lock(group)
>
> So we have dir->i_rwsem > group->mark_mutex. But we also have callchains
> like:
>
> inotify_add_watch()
> inotify_update_watch()
> fsnotify_group_lock(group)
> inotify_update_existing_watch()
> ...
> fsnotify_recalc_mask()
> inode_lock(dir); -> added by your series
>
> which creates ordering group->mark_mutex > dir->i_rwsem.
>
> It is even worse with dnotify which (even with your patches) ends up
> calling fsnotify_recalc_mask() from dnotify_handle_event() so we have a
> possibility of direct A->A deadlock. But I'd leave dnotify aside, I think
> that can be massaged to not need to call fsnotify_recalc_mask()
> (__fsnotify_recalc_mask() would be enough there).
>
> Still I'm not 100% sure about a proper way out of this. The simplicity of
> alias->d_subdirs iteration with i_rwsem held is compeling.

Agreed.

> We could mandate
> that fsnotify hooks cannot be called with inode->i_rwsem held (and fixup
> nfsd) but IMO that is pushing the complexity from the fsnotify core into
> its users which is undesirable.

I think inode in this context is the parent inode, so all fsnotify hooks
in namei.c are holding inode->i_rwsem by design.

> Maybe we could grab inode->i_rwsem in those
> places adding / removing notification marks before we grab
> group->mark_mutex, just verify (with lockdep) that fsnotify_recalc_mask()
> has the inode->i_rwsem held and be done with it? That pushes a bit of
> complexity into the fsnotify backends but it is not too bad.
> fsnotify_recalc_mask() gets only called by dnotify, inotify, and fanotify.
> Amir?
>

Absolutely agree - I think it makes sense and will simplify things a lot.

Obviously if we need to assert inode_is_locked() in fsnotify_recalc_mask()
only for (conn->type == FSNOTIFY_OBJ_TYPE_INODE).

Thanks,
Amir.