2022-02-09 23:52:36

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v2 0/4] Fix softlockup when adding inotify watch

Hi Al et al,

When a system with large amounts of memory has several millions of
negative dentries in a single directory, a softlockup can occur while
adding an inotify watch:

watchdog: BUG: soft lockup - CPU#20 stuck for 9s! [inotifywait:9528]
CPU: 20 PID: 9528 Comm: inotifywait Kdump: loaded Not tainted 5.16.0-rc4.20211208.el8uek.rc1.x86_64 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.4.1 12/03/2020
RIP: 0010:__fsnotify_update_child_dentry_flags+0xad/0x120
Call Trace:
<TASK>
fsnotify_add_mark_locked+0x113/0x160
inotify_new_watch+0x130/0x190
inotify_update_watch+0x11a/0x140
__x64_sys_inotify_add_watch+0xef/0x140
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x44/0xae

This patch series is a modified version of the following:
https://lore.kernel.org/linux-fsdevel/1611235185-1685-1-git-send-email-gautham.ananthakrishna@oracle.com/

The strategy employed by this series is to move negative dentries to the
end of the d_subdirs list, and mark them with a flag as "tail negative".
Then, readers of the d_subdirs list, which are only interested in
positive dentries, can stop reading once they reach the first tail
negative dentry. By applying this patch, I'm able to avoid the above
softlockup caused by 200 million negative dentries on my test system.
Inotify watches are set up nearly instantly.

Previously, Al expressed concern for:

1. Possible memory corruption due to use of lock_parent() in
sweep_negative(), see patch 01 for fix.
2. The previous patch didn't catch all ways a negative dentry could
become positive (d_add, d_instantiate_new), see patch 01.
3. The previous series contained a new negative dentry limit, which
capped the negative dentry count at around 3 per hash bucket. I've
dropped this patch from the series.

Patches 2-4 are unmodified from the previous posting.

In v1 of the patch, a warning was triggered by patch 1:
https://lore.kernel.org/linux-fsdevel/20211218081736.GA1071@xsang-OptiPlex-9020/

I reproduced this warning, and verified it no longer occurs with my patch on
5.17 rc kernels. In particular, commit 29044dae2e74 ("fsnotify: fix fsnotify
hooks in pseudo filesystems") resolves the warning, which I verified on the
5.16 branch that the 0day bot tested. It seems that nfsdfs was using d_delete
to remove some pseudo-filesystem dentries, rather than d_drop, but it
expected there to never be negative dentries. I don't believe that
warning reflected an error in this patch series.

v2:
- explain the nfsd warning
- remove sweep_negative() call from __d_add - rely on dput() for that


Konstantin Khlebnikov (2):
dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()
dcache: stop walking siblings if remaining dentries all negative

Stephen Brennan (2):
dcache: sweep cached negative dentries to the end of list of siblings
fsnotify: stop walking child dentries if remaining tail is negative

fs/dcache.c | 101 +++++++++++++++++++++++++++++++++++++++--
fs/libfs.c | 3 ++
fs/notify/fsnotify.c | 6 ++-
include/linux/dcache.h | 6 +++
4 files changed, 110 insertions(+), 6 deletions(-)

--
2.30.2



2022-02-09 23:53:17

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v2 3/4] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk()

From: Konstantin Khlebnikov <[email protected]>

This lets skip remaining siblings at seeing d_is_tail_negative().

Signed-off-by: Konstantin Khlebnikov <[email protected]>
Signed-off-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---
fs/dcache.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/fs/dcache.c b/fs/dcache.c
index 0960de9b9c36..8809643a4d5b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1391,12 +1391,14 @@ EXPORT_SYMBOL(shrink_dcache_sb);
* @D_WALK_QUIT: quit walk
* @D_WALK_NORETRY: quit when retry is needed
* @D_WALK_SKIP: skip this dentry and its children
+ * @D_WALK_SKIP_SIBLINGS: skip siblings and their children
*/
enum d_walk_ret {
D_WALK_CONTINUE,
D_WALK_QUIT,
D_WALK_NORETRY,
D_WALK_SKIP,
+ D_WALK_SKIP_SIBLINGS,
};

/**
@@ -1427,6 +1429,7 @@ static void d_walk(struct dentry *parent, void *data,
break;
case D_WALK_QUIT:
case D_WALK_SKIP:
+ case D_WALK_SKIP_SIBLINGS:
goto out_unlock;
case D_WALK_NORETRY:
retry = false;
@@ -1458,6 +1461,9 @@ static void d_walk(struct dentry *parent, void *data,
case D_WALK_SKIP:
spin_unlock(&dentry->d_lock);
continue;
+ case D_WALK_SKIP_SIBLINGS:
+ spin_unlock(&dentry->d_lock);
+ goto skip_siblings;
}

if (!list_empty(&dentry->d_subdirs)) {
@@ -1469,6 +1475,7 @@ static void d_walk(struct dentry *parent, void *data,
}
spin_unlock(&dentry->d_lock);
}
+skip_siblings:
/*
* All done at this level ... ascend and resume the search.
*/
--
2.30.2


2022-02-09 23:53:23

by Stephen Brennan

[permalink] [raw]
Subject: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

For disk filesystems result of every negative lookup is cached, content
of directories is usually cached too. Production of negative dentries
isn't limited with disk speed. It's really easy to generate millions of
them if system has enough memory. Negative dentries are linked into
siblings list along with normal positive dentries. Some operations walks
dcache tree but looks only for positive dentries: most important is
fsnotify/inotify.

This patch sweeps negative dentries to the end of list at final dput()
and marks with flag which tells that all following dentries are negative
too. We do this carefully to avoid corruption in case the dentry is
killed when we try to lock its parent. Reverse operation (recycle) is
required before instantiating tail negative dentry, or calling d_add()
with non-NULL inode.

Co-authored-by: Konstantin Khlebnikov <[email protected]>
Co-authored-by: Gautham Ananthakrishna <[email protected]>
Signed-off-by: Stephen Brennan <[email protected]>
---

Notes:
v2: Remove the sweep_negative() call in __d_add() when inode is NULL.

fs/dcache.c | 85 +++++++++++++++++++++++++++++++++++++++---
include/linux/dcache.h | 6 +++
2 files changed, 86 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index c84269c6e8bf..0960de9b9c36 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -660,6 +660,58 @@ static inline struct dentry *lock_parent(struct dentry *dentry)
return __lock_parent(dentry);
}

+/*
+ * Move cached negative dentry to the tail of parent->d_subdirs.
+ * This lets walkers skip them all together at first sight.
+ * Must be called at dput of negative dentry with d_lock held.
+ * Releases d_lock.
+ */
+static void sweep_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ rcu_read_lock();
+ parent = lock_parent(dentry);
+ if (!parent) {
+ rcu_read_unlock();
+ return;
+ }
+
+ /*
+ * If we did not hold a reference to dentry (as in the case of dput),
+ * and dentry->d_lock was dropped in lock_parent(), then we could now be
+ * holding onto a dead dentry. Be careful to check d_count and unlock
+ * before dropping RCU lock, otherwise we could corrupt freed memory.
+ */
+ if (!d_count(dentry) && d_is_negative(dentry) &&
+ !d_is_tail_negative(dentry)) {
+ dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
+ list_move_tail(&dentry->d_child, &parent->d_subdirs);
+ }
+
+ spin_unlock(&parent->d_lock);
+ spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+}
+
+/*
+ * Undo sweep_negative() and move to the head of parent->d_subdirs.
+ * Must be called before converting negative dentry into positive.
+ * Must hold dentry->d_lock, we may drop and re-grab it via lock_parent().
+ * Must be hold a reference or be otherwise sure the dentry cannot be killed.
+ */
+static void recycle_negative(struct dentry *dentry)
+{
+ struct dentry *parent;
+
+ parent = lock_parent(dentry);
+ dentry->d_flags &= ~DCACHE_TAIL_NEGATIVE;
+ if (parent) {
+ list_move(&dentry->d_child, &parent->d_subdirs);
+ spin_unlock(&parent->d_lock);
+ }
+}
+
static inline bool retain_dentry(struct dentry *dentry)
{
WARN_ON(d_in_lookup(dentry));
@@ -765,7 +817,7 @@ static struct dentry *dentry_kill(struct dentry *dentry)
static inline bool fast_dput(struct dentry *dentry)
{
int ret;
- unsigned int d_flags;
+ unsigned int d_flags, required;

/*
* If we have a d_op->d_delete() operation, we sould not
@@ -813,6 +865,8 @@ static inline bool fast_dput(struct dentry *dentry)
* a 'delete' op, and it's referenced and already on
* the LRU list.
*
+ * Cached negative dentry must be swept to the tail.
+ *
* NOTE! Since we aren't locked, these values are
* not "stable". However, it is sufficient that at
* some point after we dropped the reference the
@@ -830,11 +884,16 @@ static inline bool fast_dput(struct dentry *dentry)
*/
smp_rmb();
d_flags = READ_ONCE(dentry->d_flags);
+
+ required = DCACHE_REFERENCED | DCACHE_LRU_LIST |
+ (d_flags_negative(d_flags) ? DCACHE_TAIL_NEGATIVE : 0);
+
d_flags &= DCACHE_REFERENCED | DCACHE_LRU_LIST |
- DCACHE_DISCONNECTED | DCACHE_DONTCACHE;
+ DCACHE_DISCONNECTED | DCACHE_DONTCACHE |
+ DCACHE_TAIL_NEGATIVE;

/* Nothing to do? Dropping the reference was all we needed? */
- if (d_flags == (DCACHE_REFERENCED | DCACHE_LRU_LIST) && !d_unhashed(dentry))
+ if (d_flags == required && !d_unhashed(dentry))
return true;

/*
@@ -906,7 +965,10 @@ void dput(struct dentry *dentry)
rcu_read_unlock();

if (likely(retain_dentry(dentry))) {
- spin_unlock(&dentry->d_lock);
+ if (d_is_negative(dentry) && !d_is_tail_negative(dentry))
+ sweep_negative(dentry); /* drops d_lock */
+ else
+ spin_unlock(&dentry->d_lock);
return;
}

@@ -1998,6 +2060,8 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
WARN_ON(d_in_lookup(dentry));

spin_lock(&dentry->d_lock);
+ if (d_is_tail_negative(dentry))
+ recycle_negative(dentry);
/*
* Decrement negative dentry count if it was in the LRU list.
*/
@@ -2722,6 +2786,13 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
struct inode *dir = NULL;
unsigned n;
spin_lock(&dentry->d_lock);
+ /*
+ * Tail negative dentries must become positive before associating an
+ * inode. recycle_negative() is safe to use because we hold a reference
+ * to dentry.
+ */
+ if (inode && d_is_tail_negative(dentry))
+ recycle_negative(dentry);
if (unlikely(d_in_lookup(dentry))) {
dir = dentry->d_parent->d_inode;
n = start_dir_add(dir);
@@ -2738,7 +2809,11 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode)
__d_rehash(dentry);
if (dir)
end_dir_add(dir, n);
- spin_unlock(&dentry->d_lock);
+
+ if (!inode && !d_is_tail_negative(dentry))
+ sweep_negative(dentry); /* drops d_lock */
+ else
+ spin_unlock(&dentry->d_lock);
if (inode)
spin_unlock(&inode->i_lock);
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f5bba51480b2..7cc1bd384912 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -211,6 +211,7 @@ struct dentry_operations {
#define DCACHE_PAR_LOOKUP 0x10000000 /* being looked up (with parent locked shared) */
#define DCACHE_DENTRY_CURSOR 0x20000000
#define DCACHE_NORCU 0x40000000 /* No RCU delay for freeing */
+#define DCACHE_TAIL_NEGATIVE 0x80000000 /* All following siblings are negative */

extern seqlock_t rename_lock;

@@ -489,6 +490,11 @@ static inline int simple_positive(const struct dentry *dentry)
return d_really_is_positive(dentry) && !d_unhashed(dentry);
}

+static inline bool d_is_tail_negative(const struct dentry *dentry)
+{
+ return unlikely(dentry->d_flags & DCACHE_TAIL_NEGATIVE);
+}
+
extern void d_set_fallthru(struct dentry *dentry);

static inline bool d_is_fallthru(const struct dentry *dentry)
--
2.30.2


2022-02-10 05:44:46

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote:

> +static void sweep_negative(struct dentry *dentry)
> +{
> + struct dentry *parent;
> +
> + rcu_read_lock();
> + parent = lock_parent(dentry);
> + if (!parent) {
> + rcu_read_unlock();
> + return;
> + }
> +
> + /*
> + * If we did not hold a reference to dentry (as in the case of dput),
> + * and dentry->d_lock was dropped in lock_parent(), then we could now be
> + * holding onto a dead dentry. Be careful to check d_count and unlock
> + * before dropping RCU lock, otherwise we could corrupt freed memory.
> + */
> + if (!d_count(dentry) && d_is_negative(dentry) &&
> + !d_is_tail_negative(dentry)) {
> + dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> + list_move_tail(&dentry->d_child, &parent->d_subdirs);
> + }
> +
> + spin_unlock(&parent->d_lock);
> + spin_unlock(&dentry->d_lock);
> + rcu_read_unlock();
> +}

I'm not sure if it came up the last time you'd posted this series
(and I apologize if it had and I forgot the explanation), but... consider
the comment in dentry_unlist(). What's to prevent the race described there
making d_walk() skip a part of tree, by replacing the "lseek moving cursor
in just the wrong moment" with "dput moving the negative dentry right next
to the one being killed to the tail of the list"?

The race in question:
d_walk() is leaving a subdirectory. We are here:
rcu_read_lock();
ascend:
if (this_parent != parent) {

It isn't - we are not back to the root of tree being walked.
At this point this_parent is the directory we'd just finished looking into.

struct dentry *child = this_parent;
this_parent = child->d_parent;

... and now child points to it, and this_parent points to its parent.

spin_unlock(&child->d_lock);

No locks held. Another CPU gets through successful rmdir(). child gets
unhashed and dropped. It's off the ->d_subdirs of this_parent; its
->d_child.next is still pointing where it used to, and whatever it points
to won't be physically freed until rcu_read_unlock().

Moreover, in the meanwhile this next sibling (negative, pinned) got dput().
And had been moved to the tail of the this_parent->d_subdirs. Since
its ->d_child.prev does *NOT* point to child (which is off-list, about to
be freed shortly, etc.), child->d_dchild.next is not modified - it still
points to that (now moved) sibling.

spin_lock(&this_parent->d_lock);
Got it.

/* might go back up the wrong parent if we have had a rename. */
if (need_seqretry(&rename_lock, seq))
goto rename_retry;

Nope, hadn't happened.

/* go into the first sibling still alive */
do {
next = child->d_child.next;
... and this is the moved sibling, now in the end of the ->d_subdirs.

if (next == &this_parent->d_subdirs)
goto ascend;

No, it is not - it's the last element of the list, not its anchor.

child = list_entry(next, struct dentry, d_child);

Our moved negative dentry.

} while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));

Not killed, that one.
rcu_read_unlock();
goto resume;

... and since that sucker has no children, we proceed to look at it,
ascend and now we are at the end of this_parent->d_subdirs. And we
ascend out of it, having entirely skipped all branches that used to
be between the rmdir victim and the end of the parent's ->d_subdirs.

What am I missing here? Unlike the trick we used with cursors (see
dentry_unlist()) we can't predict who won't get moved in this case...

Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do
for rename_lock mismatches would not work unless you grab the spinlock
component of rename_lock every time dentry becomes positive. Which
is obviously not feasible (it's a system-wide lock and cacheline
pingpong alone would hurt us very badly, not to mention the contention
issues due to the frequency of grabbing it going up by several orders
of magnitude).

2022-02-16 06:49:30

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

Hi Al,

Al Viro <[email protected]> writes:
> On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote:
>
>> +static void sweep_negative(struct dentry *dentry)
>> +{
>> + struct dentry *parent;
>> +
>> + rcu_read_lock();
>> + parent = lock_parent(dentry);
>> + if (!parent) {
>> + rcu_read_unlock();
>> + return;
>> + }
>> +
>> + /*
>> + * If we did not hold a reference to dentry (as in the case of dput),
>> + * and dentry->d_lock was dropped in lock_parent(), then we could now be
>> + * holding onto a dead dentry. Be careful to check d_count and unlock
>> + * before dropping RCU lock, otherwise we could corrupt freed memory.
>> + */
>> + if (!d_count(dentry) && d_is_negative(dentry) &&
>> + !d_is_tail_negative(dentry)) {
>> + dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
>> + list_move_tail(&dentry->d_child, &parent->d_subdirs);
>> + }
>> +
>> + spin_unlock(&parent->d_lock);
>> + spin_unlock(&dentry->d_lock);
>> + rcu_read_unlock();
>> +}
>
> I'm not sure if it came up the last time you'd posted this series
> (and I apologize if it had and I forgot the explanation), but... consider
> the comment in dentry_unlist(). What's to prevent the race described there
> making d_walk() skip a part of tree, by replacing the "lseek moving cursor
> in just the wrong moment" with "dput moving the negative dentry right next
> to the one being killed to the tail of the list"?

This did not come up previously, so thanks for pointing this out.

>
> The race in question:
> d_walk() is leaving a subdirectory. We are here:
> rcu_read_lock();
> ascend:
> if (this_parent != parent) {
>
> It isn't - we are not back to the root of tree being walked.
> At this point this_parent is the directory we'd just finished looking into.
>
> struct dentry *child = this_parent;
> this_parent = child->d_parent;
>
> ... and now child points to it, and this_parent points to its parent.
>
> spin_unlock(&child->d_lock);
>
> No locks held. Another CPU gets through successful rmdir(). child gets
> unhashed and dropped. It's off the ->d_subdirs of this_parent; its
> ->d_child.next is still pointing where it used to, and whatever it points
> to won't be physically freed until rcu_read_unlock().
>
> Moreover, in the meanwhile this next sibling (negative, pinned) got dput().
> And had been moved to the tail of the this_parent->d_subdirs. Since
> its ->d_child.prev does *NOT* point to child (which is off-list, about to
> be freed shortly, etc.), child->d_dchild.next is not modified - it still
> points to that (now moved) sibling.

It seems to me that, if we had taken a reference on child by
incrementing the reference count prior to unlocking it, then
dentry_unlist could never have been called, since we would never have
made it into __dentry_kill. child would still be on the list, and any
cursor (or sweep_negative) list updates would now be reflected in
child->d_child.next. But dput is definitely not safe while holding a
lock on a parent dentry (even more so now thanks to my patch), so that
is out of the question.

Would dput_to_list be an appropriate solution to that issue? We can
maintain a dispose list in d_walk and then for any dput which really
drops the refcount to 0, we can handle them after d_walk is done. It
shouldn't be that many dentries anyway.

>
> spin_lock(&this_parent->d_lock);
> Got it.
>
> /* might go back up the wrong parent if we have had a rename. */
> if (need_seqretry(&rename_lock, seq))
> goto rename_retry;
>
> Nope, hadn't happened.
>
> /* go into the first sibling still alive */
> do {
> next = child->d_child.next;
> ... and this is the moved sibling, now in the end of the ->d_subdirs.
>
> if (next == &this_parent->d_subdirs)
> goto ascend;
>
> No, it is not - it's the last element of the list, not its anchor.
>
> child = list_entry(next, struct dentry, d_child);
>
> Our moved negative dentry.
>
> } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
>
> Not killed, that one.
> rcu_read_unlock();
> goto resume;
>
> ... and since that sucker has no children, we proceed to look at it,
> ascend and now we are at the end of this_parent->d_subdirs. And we
> ascend out of it, having entirely skipped all branches that used to
> be between the rmdir victim and the end of the parent's ->d_subdirs.
>
> What am I missing here? Unlike the trick we used with cursors (see
> dentry_unlist()) we can't predict who won't get moved in this case...

I don't think you're missing anything, unfortunately. Maybe if my above
idea pans out, we could prevent this, but I suppose without that,
reordering dentries among the subdirs list, and d_walk, are opposing
features. The cursor trick is neat but not applicable here.

>
> Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do
> for rename_lock mismatches would not work unless you grab the spinlock
> component of rename_lock every time dentry becomes positive. Which
> is obviously not feasible (it's a system-wide lock and cacheline
> pingpong alone would hurt us very badly, not to mention the contention
> issues due to the frequency of grabbing it going up by several orders
> of magnitude).

You won't catch me advocating for a global lock like that :P

I'm going to keep looking into this, since some of our high-uptime
customer machines have steady workloads which just keep churning out
negative dentries, and they tend to be in a particular subdirectory. If
the machine has oodles of free memory then we just let them create
dentries like candy until something eventually topples over, and it
tends to be something like this.

Thanks,
Stephen

2022-02-16 07:19:49

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

On Wed, Feb 16, 2022 at 03:27:39AM +0000, Al Viro wrote:
> On Tue, Feb 15, 2022 at 06:24:53PM -0800, Stephen Brennan wrote:
>
> > It seems to me that, if we had taken a reference on child by
> > incrementing the reference count prior to unlocking it, then
> > dentry_unlist could never have been called, since we would never have
> > made it into __dentry_kill. child would still be on the list, and any
> > cursor (or sweep_negative) list updates would now be reflected in
> > child->d_child.next. But dput is definitely not safe while holding a
> > lock on a parent dentry (even more so now thanks to my patch), so that
> > is out of the question.
> >
> > Would dput_to_list be an appropriate solution to that issue? We can
> > maintain a dispose list in d_walk and then for any dput which really
> > drops the refcount to 0, we can handle them after d_walk is done. It
> > shouldn't be that many dentries anyway.
>
> Interesting idea, but... what happens to behaviour of e.g.
> shrink_dcache_parent()? You'd obviously need to modify the test in
> select_collect(), but then the selected dentries become likely candidates
> for d_walk() itself wanting to move them over to its internal shrink list.
> OTOH, __dput_to_list() will just decrement the count and skip the sucker
> if it's already on a shrink list...
>
> It might work, but it really needs a careful analysis wrt.
> parallel d_walk(). What happens when you have two threads hitting
> shrink_dcache_parent() on two different places, one being an ancestor
> of another? That can happen in parallel, and currently it does work
> correctly, but that's fairly delicate and there are places where a minor
> change could turn O(n) into O(n^2), etc.
>
> Let me think about that - I'm not saying it's hopeless, and it
> would be nice to avoid that subtlety in dentry_unlist(), but there
> might be dragons.

PS: another obvious change is that d_walk() would become blocking.
So e.g.

int path_has_submounts(const struct path *parent)
{
struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };

read_seqlock_excl(&mount_lock);
d_walk(parent->dentry, &data, path_check_mount);
read_sequnlock_excl(&mount_lock);

return data.mounted;
}

would need a rework - d_walk() is under a spinlock here. Another
potential headache in that respect is d_genocide() - currently non-blocking,
with this change extremely likely to do evictions. That, however, is
not a problem for current in-tree callers - they are all shortly followed
by shrink_dcache_parent() or equivalents.

path_has_submounts(), though... I'd really hate to reintroduce the
"call this on entry/call this on exit" callbacks. Perhaps it would
be better to pass the dispose list to d_walk() and have the callers
deal with evictions? For that matter, shrink_dcache_parent() and
friends would be just fine passing the same list they are collecting
into.

<looks at path_has_submounts() callers>
*growl*
autofs_d_automount() has it called under sbi->fs_lock. So we'd need
to take the disposal all the way out there, and export shrink_dentry_list()
while we are at it. Not pretty ;-/

And no, we can't make the disposal async, so offloading it to a worker or
thread is not feasible...

2022-02-16 07:19:54

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

On Tue, Feb 15, 2022 at 06:24:53PM -0800, Stephen Brennan wrote:

> It seems to me that, if we had taken a reference on child by
> incrementing the reference count prior to unlocking it, then
> dentry_unlist could never have been called, since we would never have
> made it into __dentry_kill. child would still be on the list, and any
> cursor (or sweep_negative) list updates would now be reflected in
> child->d_child.next. But dput is definitely not safe while holding a
> lock on a parent dentry (even more so now thanks to my patch), so that
> is out of the question.
>
> Would dput_to_list be an appropriate solution to that issue? We can
> maintain a dispose list in d_walk and then for any dput which really
> drops the refcount to 0, we can handle them after d_walk is done. It
> shouldn't be that many dentries anyway.

Interesting idea, but... what happens to behaviour of e.g.
shrink_dcache_parent()? You'd obviously need to modify the test in
select_collect(), but then the selected dentries become likely candidates
for d_walk() itself wanting to move them over to its internal shrink list.
OTOH, __dput_to_list() will just decrement the count and skip the sucker
if it's already on a shrink list...

It might work, but it really needs a careful analysis wrt.
parallel d_walk(). What happens when you have two threads hitting
shrink_dcache_parent() on two different places, one being an ancestor
of another? That can happen in parallel, and currently it does work
correctly, but that's fairly delicate and there are places where a minor
change could turn O(n) into O(n^2), etc.

Let me think about that - I'm not saying it's hopeless, and it
would be nice to avoid that subtlety in dentry_unlist(), but there
might be dragons.

2022-02-16 08:07:33

by Stephen Brennan

[permalink] [raw]
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings

Al Viro <[email protected]> writes:
> On Wed, Feb 16, 2022 at 03:27:39AM +0000, Al Viro wrote:
>> On Tue, Feb 15, 2022 at 06:24:53PM -0800, Stephen Brennan wrote:
>>
>> > It seems to me that, if we had taken a reference on child by
>> > incrementing the reference count prior to unlocking it, then
>> > dentry_unlist could never have been called, since we would never have
>> > made it into __dentry_kill. child would still be on the list, and any
>> > cursor (or sweep_negative) list updates would now be reflected in
>> > child->d_child.next. But dput is definitely not safe while holding a
>> > lock on a parent dentry (even more so now thanks to my patch), so that
>> > is out of the question.
>> >
>> > Would dput_to_list be an appropriate solution to that issue? We can
>> > maintain a dispose list in d_walk and then for any dput which really
>> > drops the refcount to 0, we can handle them after d_walk is done. It
>> > shouldn't be that many dentries anyway.
>>
>> Interesting idea, but... what happens to behaviour of e.g.
>> shrink_dcache_parent()? You'd obviously need to modify the test in
>> select_collect(), but then the selected dentries become likely candidates
>> for d_walk() itself wanting to move them over to its internal shrink list.
>> OTOH, __dput_to_list() will just decrement the count and skip the sucker
>> if it's already on a shrink list...
>>
>> It might work, but it really needs a careful analysis wrt.
>> parallel d_walk(). What happens when you have two threads hitting
>> shrink_dcache_parent() on two different places, one being an ancestor
>> of another? That can happen in parallel, and currently it does work
>> correctly, but that's fairly delicate and there are places where a minor
>> change could turn O(n) into O(n^2), etc.
>>
>> Let me think about that - I'm not saying it's hopeless, and it
>> would be nice to avoid that subtlety in dentry_unlist(), but there
>> might be dragons.
>
> PS: another obvious change is that d_walk() would become blocking.
> So e.g.
>
> int path_has_submounts(const struct path *parent)
> {
> struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };
>
> read_seqlock_excl(&mount_lock);
> d_walk(parent->dentry, &data, path_check_mount);
> read_sequnlock_excl(&mount_lock);
>
> return data.mounted;
> }
>
> would need a rework - d_walk() is under a spinlock here. Another
> potential headache in that respect is d_genocide() - currently non-blocking,
> with this change extremely likely to do evictions. That, however, is
> not a problem for current in-tree callers - they are all shortly followed
> by shrink_dcache_parent() or equivalents.
>
> path_has_submounts(), though... I'd really hate to reintroduce the
> "call this on entry/call this on exit" callbacks. Perhaps it would
> be better to pass the dispose list to d_walk() and have the callers
> deal with evictions? For that matter, shrink_dcache_parent() and
> friends would be just fine passing the same list they are collecting
> into.
>
> <looks at path_has_submounts() callers>
> *growl*
> autofs_d_automount() has it called under sbi->fs_lock. So we'd need
> to take the disposal all the way out there, and export shrink_dentry_list()
> while we are at it. Not pretty ;-/
>
> And no, we can't make the disposal async, so offloading it to a worker or
> thread is not feasible...

Hm, making d_walk blocking seems like a real barrier to some users that
use d_walk as a lightweight check (e.g. path_has_submounts) which isn't
sensitive to retries. But other callers like shrink_dcache_parent are
okay with blocking, and already have a dispose list, but they want to
avoid retries.

What if we gave d_walk an optional "dispose" list parameter? For those
that provide it, we take references to ensure that child doesn't get
killed out from under us. For those which don't, if child does get
killed, then they are forced to retry from the beginning of
this_parent->d_subdirs. This wouldn't eliminate the careful analysis
necessary for parallel d_walk when both provide dispose lists. But it
would avoid making all d_walk callers pay the penalty of becoming
blocking.

Here's a super rough sketch of what I mean (it's late at night, no
obvious syntax errors but it won't compile unless all the callers are
updated):

diff --git a/fs/dcache.c b/fs/dcache.c
index e98079ed86be..307d32f023c8 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1408,11 +1408,12 @@ enum d_walk_ret {
* @enter: callback when first entering the dentry
*
* The @enter() callbacks are called with d_lock held.
*/
static void d_walk(struct dentry *parent, void *data,
- enum d_walk_ret (*enter)(void *, struct dentry *))
+ enum d_walk_ret (*enter)(void *, struct dentry *),
+ struct list_head *dispose)
{
struct dentry *this_parent;
struct list_head *next;
unsigned seq = 0;
enum d_walk_ret ret;
@@ -1483,23 +1484,30 @@ static void d_walk(struct dentry *parent, void *data,
ascend:
if (this_parent != parent) {
struct dentry *child = this_parent;
this_parent = child->d_parent;

- spin_unlock(&child->d_lock);
- spin_lock(&this_parent->d_lock);
+ if (dispose) {
+ child->d_lockref.count += 1;
+ spin_unlock(&child->d_lock);
+ spin_lock(&this_parent->d_lock);
+ dput_to_list(child, dispose);
+ } else {
+ spin_unlock(&child->d_lock);
+ spin_lock(&this_parent->d_lock);
+ }

/* might go back up the wrong parent if we have had a rename. */
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
- /* go into the first sibling still alive */
- do {
- next = child->d_child.next;
- if (next == &this_parent->d_subdirs)
- goto ascend;
- child = list_entry(next, struct dentry, d_child);
- } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
+
+ /* child was killed, its next pointers may now be stale, retry */
+ if (child->d_flags & DCACHE_DENTRY_KILLED)
+ goto killed_retry;
+
+ next = child->d_child.next;
+ child = list_entry(next, struct dentry, d_child);
rcu_read_unlock();
goto resume;
}
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
@@ -1516,10 +1524,18 @@ static void d_walk(struct dentry *parent, void *data,
BUG_ON(seq & 1);
if (!retry)
return;
seq = 1;
goto again;
+
+killed_retry:
+ rcu_read_unlock();
+ BUG_ON(dispose);
+ if (!retry)
+ return;
+ goto repeat;
+
}

struct check_mount {
struct vfsmount *mnt;
unsigned int mounted;