On Fri, 13 Apr 2018 13:28:23 -0700 Khazhismel Kumykov <[email protected]> wrote:
> shrink_dcache_parent may spin waiting for a parallel shrink_dentry_list.
> In this case we may have 0 dentries to dispose, so we will never
> schedule out while waiting for the parallel shrink_dentry_list to
> complete.
>
> Tested that this fixes syzbot reports of stalls in shrink_dcache_parent()
Well I guess the patch is OK as a stopgap, but things seem fairly
messed up in there. shrink_dcache_parent() shouldn't be doing a
busywait, waiting for the concurrent shrink_dentry_list().
Either we should be waiting (sleeping) for the concurrent operation to
complete or we should just bail out of shrink_dcache_parent(), perhaps
with
if (list_empty(&data.dispose))
break;
or similar. Dunno.
That block comment over `struct select_data' is not a good one. "It
returns zero iff...". *What* returns zero? select_collect()? No it
doesn't, it returns an `enum d_walk_ret'. Perhaps the comment is
trying to refer to select_data.found. And the real interpretation of
select_data.found is, umm, hard to describe. "Counts the number of
dentries which are on a shrink list or which were moved to the dispose
list". Why? What's that all about?
This code needs a bit of thought, documentation and perhaps a redo,
I suspect.
On Fri, 13 Apr 2018, Khazhismel Kumykov wrote:
> shrink_dcache_parent may spin waiting for a parallel shrink_dentry_list.
> In this case we may have 0 dentries to dispose, so we will never
> schedule out while waiting for the parallel shrink_dentry_list to
> complete.
>
> Tested that this fixes syzbot reports of stalls in shrink_dcache_parent()
>
> Fixes: 32785c0539b7 ("fs/dcache.c: add cond_resched() in shrink_dentry_list()")
> Reported-by: [email protected]
>
> Cc: Nikolay Borisov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: David Rientjes <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: Goldwyn Rodrigues <[email protected]>
> Cc: Jeff Mahoney <[email protected]>
> Cc: Davidlohr Bueso <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Signed-off-by: Khazhismel Kumykov <[email protected]>
Acked-by: David Rientjes <[email protected]>
On 14.04.2018 00:14, Andrew Morton wrote:
> On Fri, 13 Apr 2018 13:28:23 -0700 Khazhismel Kumykov <[email protected]> wrote:
>
>> shrink_dcache_parent may spin waiting for a parallel shrink_dentry_list.
>> In this case we may have 0 dentries to dispose, so we will never
>> schedule out while waiting for the parallel shrink_dentry_list to
>> complete.
>>
>> Tested that this fixes syzbot reports of stalls in shrink_dcache_parent()
>
> Well I guess the patch is OK as a stopgap, but things seem fairly
> messed up in there. shrink_dcache_parent() shouldn't be doing a
> busywait, waiting for the concurrent shrink_dentry_list().
>
> Either we should be waiting (sleeping) for the concurrent operation to
> complete or we should just bail out of shrink_dcache_parent(), perhaps
> with
>
> if (list_empty(&data.dispose))
> break;
>
> or similar. Dunno.
I agree, however, not being a dcache expert I'd refrain from touching
it, since it seems to be rather fragile. Perhaps Al could take a look in
there?
>
>
> That block comment over `struct select_data' is not a good one. "It
> returns zero iff...". *What* returns zero? select_collect()? No it
> doesn't, it returns an `enum d_walk_ret'. Perhaps the comment is
> trying to refer to select_data.found. And the real interpretation of
> select_data.found is, umm, hard to describe. "Counts the number of
> dentries which are on a shrink list or which were moved to the dispose
> list". Why? What's that all about?
>
> This code needs a bit of thought, documentation and perhaps a redo,
> I suspect.
>
On Sat, Apr 14, 2018 at 10:00:29AM +0300, Nikolay Borisov wrote:
>
>
> On 14.04.2018 00:14, Andrew Morton wrote:
> > On Fri, 13 Apr 2018 13:28:23 -0700 Khazhismel Kumykov <[email protected]> wrote:
> >
> >> shrink_dcache_parent may spin waiting for a parallel shrink_dentry_list.
> >> In this case we may have 0 dentries to dispose, so we will never
> >> schedule out while waiting for the parallel shrink_dentry_list to
> >> complete.
> >>
> >> Tested that this fixes syzbot reports of stalls in shrink_dcache_parent()
> >
> > Well I guess the patch is OK as a stopgap, but things seem fairly
> > messed up in there. shrink_dcache_parent() shouldn't be doing a
> > busywait, waiting for the concurrent shrink_dentry_list().
> >
> > Either we should be waiting (sleeping) for the concurrent operation to
> > complete or we should just bail out of shrink_dcache_parent(), perhaps
> > with
> >
> > if (list_empty(&data.dispose))
> > break;
> >
> > or similar. Dunno.
>
> I agree, however, not being a dcache expert I'd refrain from touching
> it, since it seems to be rather fragile. Perhaps Al could take a look in
> there?
"Bail out" is definitely a bad idea, "sleep"... what on? Especially
since there might be several evictions we are overlapping with...
On Sat, Apr 14, 2018 at 1:02 AM, Al Viro <[email protected]> wrote:
>
> "Bail out" is definitely a bad idea, "sleep"... what on? Especially
> since there might be several evictions we are overlapping with...
Well, one thing that should be looked at is the return condition from
select_collect() that shrink_dcache_parent() uses.
Because I think that return condition is somewhat insane.
The logic there seems to be:
- if we have found something, stop walking. Either NOW (if somebody
is waiting) or after you've hit a rename (if nobody is)
Now, this actually makes perfect sense for the whole rename situation:
if there's nobody waiting for us, but we hit a rename, we probably
should stop anyway just to let whoever is doing that rename continue,
and we might as well try to get rid of the dentries we have found so
far.
But it does *not* make sense for the case where we've hit a dentry
that is already on the shrink list. Sure, we'll continue to gather all
the other dentries, but if there is concurrent shrinking, shouldn't we
give up the CPU more eagerly - *particularly* if somebody else is
waiting (it might be the other process that actually gets rid of the
shrinking dentries!)?
So my gut feel is that we should at least try doing something like
this in select_collect():
- if (!list_empty(&data->dispose))
+ if (data->found)
ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
because even if we haven't actually been able to shrink something, if
we hit an already shrinking entry we should probably at least not do
the "retry for rename". And if we actually are going to reschedule, we
might as well start from the beginning.
I realize that *this* thread might not be making any actual progress
(because it didn't find any dentries to shrink), but since it did find
_a_ dentry that is being shrunk, we know the operation itself - on a
bigger scale - is making progress.
Hmm?
Now, this is independent of the fact that we probably do need a
cond_resched() in shrink_dcache_parent(), to actually do the
reschedule if we're not preemptible. The "need_resched()" in
select_collect() is obviously done while holding
HOWEVER. Even in that case, I don't think shrink_dcache_parent() is
the right point. I'd rather just do it differently in
shrink_dentry_list(): do it even for the empty list case by just doing
it at the top of the loop:
static void shrink_dentry_list(struct list_head *list)
{
- while (!list_empty(list)) {
+ while (cond_resched(), !list_empty(list)) {
struct dentry *dentry, *parent;
- cond_resched();
so my full patch that I would suggest might be TheRightThing(tm) is
attached (but it should be committed as two patches, since the two
issues are independent - I'm just attaching it as one for testing in
case somebody wants to run some nasty workloads on it)
Comments?
Side note: I think we might want to make that
while (cond_resched(), <condition>) {
....
}
thing a pattern for doing cond_resched() in loops, instead of having
the cond_resched() inside the loop itself.
It not only handles the "zero iterations" case, it also ends up being
neutral location-waise wrt 'continue' statements, and potentially
generates *better* code.
For example, in this case, doing the cond_resched() at the very top of
the loop means that the loop itself then does that
dentry = list_entry(list->prev, struct dentry, d_lru);
right after the "list_empty()" test - which means that register
allocation etc might be easier, because it doesn't have a function
call (with associated register clobbers) in between the two accesses
to "list".
And I think that might be a fairly common pattern - the loop
conditional uses the same values as the loop itself then uses.
I don't know. Maybe I'm just making excuses for the somewhat unusual syntax.
Anybody want to test this out?
Linus
On Sat, Apr 14, 2018 at 09:36:23AM -0700, Linus Torvalds wrote:
> But it does *not* make sense for the case where we've hit a dentry
> that is already on the shrink list. Sure, we'll continue to gather all
> the other dentries, but if there is concurrent shrinking, shouldn't we
> give up the CPU more eagerly - *particularly* if somebody else is
> waiting (it might be the other process that actually gets rid of the
> shrinking dentries!)?
>
> So my gut feel is that we should at least try doing something like
> this in select_collect():
>
> - if (!list_empty(&data->dispose))
> + if (data->found)
> ret = need_resched() ? D_WALK_QUIT : D_WALK_NORETRY;
>
> because even if we haven't actually been able to shrink something, if
> we hit an already shrinking entry we should probably at least not do
> the "retry for rename". And if we actually are going to reschedule, we
> might as well start from the beginning.
>
> I realize that *this* thread might not be making any actual progress
> (because it didn't find any dentries to shrink), but since it did find
> _a_ dentry that is being shrunk, we know the operation itself - on a
> bigger scale - is making progress.
>
> Hmm?
That breaks d_invalidate(), unfortunately. Look at the termination
conditions in the loop there...
On Sat, Apr 14, 2018 at 1:58 PM, Al Viro <[email protected]> wrote:
>
> That breaks d_invalidate(), unfortunately. Look at the termination
> conditions in the loop there...
Ugh. I was going to say "but that doesn't even use select_collect()",
but yeah, detach_and_collect() calls it.
It would be easy enough to just change the
if (!list_empty(&data.select.dispose))
there to
if (!list_empty(&data.select.found))
too.
In fact, it probably *should* do that, exactly to get the whole
"cond_resched()" call in that whole call chain too. Because as-is, it
looks like it has the same issue as shrink_dcache_parent() does..
But yeah, the fact that I didn't notice that makes me a bit nervous.
But now I triple-checked, there are no other indirect callers.
Linus
On Sat, Apr 14, 2018 at 02:47:21PM -0700, Linus Torvalds wrote:
> On Sat, Apr 14, 2018 at 1:58 PM, Al Viro <[email protected]> wrote:
> >
> > That breaks d_invalidate(), unfortunately. Look at the termination
> > conditions in the loop there...
>
> Ugh. I was going to say "but that doesn't even use select_collect()",
> but yeah, detach_and_collect() calls it.
>
> It would be easy enough to just change the
>
> if (!list_empty(&data.select.dispose))
>
> there to
>
> if (!list_empty(&data.select.found))
>
> too.
You would have to do the same in check_and_drop() as well,
and that brings back d_invalidate()/d_invalidate() livelock
we used to have. See 81be24d263db...
I'm trying to put something together, but the damn thing is
full of potential livelocks, unfortunately ;-/ Will send
a followup once I have something resembling a sane solution...
On Sun, Apr 15, 2018 at 01:51:07AM +0100, Al Viro wrote:
> On Sat, Apr 14, 2018 at 02:47:21PM -0700, Linus Torvalds wrote:
> > On Sat, Apr 14, 2018 at 1:58 PM, Al Viro <[email protected]> wrote:
> > >
> > > That breaks d_invalidate(), unfortunately. Look at the termination
> > > conditions in the loop there...
> >
> > Ugh. I was going to say "but that doesn't even use select_collect()",
> > but yeah, detach_and_collect() calls it.
> >
> > It would be easy enough to just change the
> >
> > if (!list_empty(&data.select.dispose))
> >
> > there to
> >
> > if (!list_empty(&data.select.found))
> >
> > too.
>
> You would have to do the same in check_and_drop() as well,
> and that brings back d_invalidate()/d_invalidate() livelock
> we used to have. See 81be24d263db...
>
> I'm trying to put something together, but the damn thing is
> full of potential livelocks, unfortunately ;-/ Will send
> a followup once I have something resembling a sane solution...
I really wonder if we should just do the following in
d_invalidate():
* grab ->d_lock on victim, check if it's unhashed,
unlock and bugger off if it is. Otherwise, unhash and unlock.
From that point on any d_set_mounted() in the subtree will
fail.
* shrink_dcache_parent() to reduce the subtree size.
* go through the (hopefully shrunk) subtree, picking
mountpoints. detach_mounts() for each of them.
* shrink_dcache_parent() if any points had been
encountered, to kick the now-unpinned stuff.
As a side benefit, we could probably be gentler on rename_lock
in d_set_mounted() after that change...
On Sun, Apr 15, 2018 at 03:39:44AM +0100, Al Viro wrote:
> I really wonder if we should just do the following in
> d_invalidate():
> * grab ->d_lock on victim, check if it's unhashed,
> unlock and bugger off if it is. Otherwise, unhash and unlock.
> >From that point on any d_set_mounted() in the subtree will
> fail.
> * shrink_dcache_parent() to reduce the subtree size.
> * go through the (hopefully shrunk) subtree, picking
> mountpoints. detach_mounts() for each of them.
> * shrink_dcache_parent() if any points had been
> encountered, to kick the now-unpinned stuff.
>
> As a side benefit, we could probably be gentler on rename_lock
> in d_set_mounted() after that change...
Something like this, IOW:
diff --git a/fs/dcache.c b/fs/dcache.c
index 86d2de63461e..e61c07ff2d95 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -1230,13 +1230,11 @@ enum d_walk_ret {
* @parent: start of walk
* @data: data passed to @enter() and @finish()
* @enter: callback when first entering the dentry
- * @finish: callback when successfully finished the walk
*
* The @enter() and @finish() callbacks are called with d_lock held.
*/
static void d_walk(struct dentry *parent, void *data,
- enum d_walk_ret (*enter)(void *, struct dentry *),
- void (*finish)(void *))
+ enum d_walk_ret (*enter)(void *, struct dentry *))
{
struct dentry *this_parent;
struct list_head *next;
@@ -1325,8 +1323,6 @@ static void d_walk(struct dentry *parent, void *data,
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
rcu_read_unlock();
- if (finish)
- finish(data);
out_unlock:
spin_unlock(&this_parent->d_lock);
@@ -1375,7 +1371,7 @@ int path_has_submounts(const struct path *parent)
struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };
read_seqlock_excl(&mount_lock);
- d_walk(parent->dentry, &data, path_check_mount, NULL);
+ d_walk(parent->dentry, &data, path_check_mount);
read_sequnlock_excl(&mount_lock);
return data.mounted;
@@ -1483,7 +1479,7 @@ void shrink_dcache_parent(struct dentry *parent)
data.start = parent;
data.found = 0;
- d_walk(parent, &data, select_collect, NULL);
+ d_walk(parent, &data, select_collect);
if (!data.found)
break;
@@ -1518,7 +1514,7 @@ static enum d_walk_ret umount_check(void *_data, struct dentry *dentry)
static void do_one_tree(struct dentry *dentry)
{
shrink_dcache_parent(dentry);
- d_walk(dentry, dentry, umount_check, NULL);
+ d_walk(dentry, dentry, umount_check);
d_drop(dentry);
dput(dentry);
}
@@ -1542,78 +1538,48 @@ void shrink_dcache_for_umount(struct super_block *sb)
}
}
-struct detach_data {
- struct select_data select;
- struct dentry *mountpoint;
-};
-static enum d_walk_ret detach_and_collect(void *_data, struct dentry *dentry)
+static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
{
- struct detach_data *data = _data;
-
if (d_mountpoint(dentry)) {
__dget_dlock(dentry);
- data->mountpoint = dentry;
+ *(struct dentry **)_data = dentry;
return D_WALK_QUIT;
}
- return select_collect(&data->select, dentry);
-}
-
-static void check_and_drop(void *_data)
-{
- struct detach_data *data = _data;
-
- if (!data->mountpoint && list_empty(&data->select.dispose))
- __d_drop(data->select.start);
+ return D_WALK_CONTINUE;
}
/**
* d_invalidate - detach submounts, prune dcache, and drop
* @dentry: dentry to invalidate (aka detach, prune and drop)
- *
- * no dcache lock.
- *
- * The final d_drop is done as an atomic operation relative to
- * rename_lock ensuring there are no races with d_set_mounted. This
- * ensures there are no unhashed dentries on the path to a mountpoint.
*/
void d_invalidate(struct dentry *dentry)
{
- /*
- * If it's already been dropped, return OK.
- */
+ bool had_submounts = false;
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
return;
}
+ __d_drop(dentry);
spin_unlock(&dentry->d_lock);
/* Negative dentries can be dropped without further checks */
- if (!dentry->d_inode) {
- d_drop(dentry);
+ if (!dentry->d_inode)
return;
- }
+ shrink_dcache_parent(dentry);
for (;;) {
- struct detach_data data;
-
- data.mountpoint = NULL;
- INIT_LIST_HEAD(&data.select.dispose);
- data.select.start = dentry;
- data.select.found = 0;
-
- d_walk(dentry, &data, detach_and_collect, check_and_drop);
-
- if (!list_empty(&data.select.dispose))
- shrink_dentry_list(&data.select.dispose);
- else if (!data.mountpoint)
+ struct dentry *victim = NULL;
+ d_walk(dentry, &victim, find_submount);
+ if (!victim) {
+ if (had_submounts)
+ shrink_dcache_parent(dentry);
return;
-
- if (data.mountpoint) {
- detach_mounts(data.mountpoint);
- dput(data.mountpoint);
}
+ had_submounts = true;
+ detach_mounts(victim);
+ dput(victim);
}
}
EXPORT_SYMBOL(d_invalidate);
@@ -3112,7 +3078,7 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
void d_genocide(struct dentry *parent)
{
- d_walk(parent, parent, d_genocide_kill, NULL);
+ d_walk(parent, parent, d_genocide_kill);
}
EXPORT_SYMBOL(d_genocide);
On Sat, Apr 14, 2018 at 5:51 PM, Al Viro <[email protected]> wrote:
>> if (!list_empty(&data.select.found))
That was obviously meant to be just
if (data.select.found)
I had just cut-and-pasted a bit too much.
> You would have to do the same in check_and_drop() as well,
> and that brings back d_invalidate()/d_invalidate() livelock
> we used to have. See 81be24d263db...
Ugh. These are all really incestuous and very intertwined. Yes.
> I'm trying to put something together, but the damn thing is
> full of potential livelocks, unfortunately ;-/ Will send
> a followup once I have something resembling a sane solution...
Ok, that patch of yours looks like a nice cleanup, although *please*
don't do this:
- struct detach_data *data = _data;
-
if (d_mountpoint(dentry)) {
__dget_dlock(dentry);
- data->mountpoint = dentry;
+ *(struct dentry **)_data = dentry;
Please keep the temporary variable, and make it do
+ struct dcache **victim = _victim;
...
+ *victim = dentry;
to kind of match the caller, which does
d_walk(dentry, &victim, find_submount);
because I abhor those casts inside code, and we have a pattern of
passing 'void *_xyz' to callback functions and then making the right
type by that kind of
struct right_type *xyz = _xyz;
at the very top of the function.
No, it's obviously not type-safe, but at least it's _legible_, and is
a pattern, while that "let's randomly just do a cast in the middle of
the code" is just nasty.
Side note: I do feel like "d_walk()" should be returning whether it
terminated early or not. For example, this very same code in the
caller does
+ struct dentry *victim = NULL;
+ d_walk(dentry, &victim, find_submount);
+ if (!victim) {
but in many ways it would be more natural to just check the exit condition, and
+ struct dentry *victim;
+ if (!d_walk(dentry, &victim, find_submount)) {
don't you think? Because that matches the actual setting condition in
the find_submount() callback.
There are other situations where the same thing is true: that
path_check_mount() currently has that "info->mounted" flag, but again,
it could be replaced by just checking what the quit condition was, and
whether we terminated early or not. Because the two are 100%
equivalent, and the return value in many ways would be more logical, I
feel.
(I'm not sure if we should just return the actual exit condition -
defaulting to D_WALK_CONTINUE if there was nothing to walk at all - or
whether we should just return a boolean for "terminated early")
Hmm?
Linus
On Sun, Apr 15, 2018 at 11:34:17AM -0700, Linus Torvalds wrote:
> No, it's obviously not type-safe, but at least it's _legible_, and is
> a pattern, while that "let's randomly just do a cast in the middle of
> the code" is just nasty.
Sure, no problem... I really wish there was a way to say
void foo(int (*f)(α *), α *data) ∀ α
and have the compiler verify that foo(f, v) is done only
when f(v) is well-typed, but that's C, not Haskell... The best
approximation is something along the lines of
void __foo(int (*f)(void *), void *data);
#define foo(f, v) (sizeof((f)((v)), 0), __foo((f),(v)))
and that relies upon the identical calling sequence for all pointer
arguments. AFAIK, it's true for all ABIs we support, but...
Worse, there's no way to get #define in macro expansion, so the
above would be impossible to hide behind anything convenient ;-/
> Side note: I do feel like "d_walk()" should be returning whether it
> terminated early or not. For example, this very same code in the
> caller does
>
> + struct dentry *victim = NULL;
> + d_walk(dentry, &victim, find_submount);
> + if (!victim) {
>
> but in many ways it would be more natural to just check the exit condition, and
>
> + struct dentry *victim;
> + if (!d_walk(dentry, &victim, find_submount)) {
>
> don't you think? Because that matches the actual setting condition in
> the find_submount() callback.
>
> There are other situations where the same thing is true: that
> path_check_mount() currently has that "info->mounted" flag, but again,
> it could be replaced by just checking what the quit condition was, and
> whether we terminated early or not. Because the two are 100%
> equivalent, and the return value in many ways would be more logical, I
> feel.
>
> (I'm not sure if we should just return the actual exit condition -
> defaulting to D_WALK_CONTINUE if there was nothing to walk at all - or
> whether we should just return a boolean for "terminated early")
>
> Hmm?
Not sure... There are 5 callers:
* do_one_tree(), d_genocide() - nothing to return
* path_has_submounts(), d_invalidate() - could use your trick,
but d_invalidate() wants to look at victim if not buggering off, so
that one doesn't win much
* shrink_dcache_parent() - no way to use that. Here we normally
run the walk to completion and need to repeat it in all cases of early
termination *and* in some of the ran-to-completion cases.
BTW, the current placement of cond_resched() looks bogus; suppose we
have collected a lot of victims and ran into need_resched(). We leave
d_walk() and call shrink_dentry_list(). At that point there's a lot
of stuff on our shrink list and anybody else running into them will
have to keep scanning. Giving up the timeslice before we take care
of any of those looks like a bad idea, to put it mildly, and that's
precisely what will happen.
What about doing that in the end of __dentry_kill() instead? And to
hell with both existing call sites - dput() one (before going to
the parent) is obviously covered by that (dentry_kill() only returns
non-NULL after having called __dentry_kill()) and in shrink_dentry_list()
we'll get to it as soon as we go through all dentries that can be
immediately kicked off the shrink list. Which, AFAICS, improves the
situation, now that shrink_lock_dentry() contains no trylock loops...
Comments?
On Sun, Apr 15, 2018 at 09:40:54PM +0100, Al Viro wrote:
> BTW, the current placement of cond_resched() looks bogus; suppose we
> have collected a lot of victims and ran into need_resched(). We leave
> d_walk() and call shrink_dentry_list(). At that point there's a lot
> of stuff on our shrink list and anybody else running into them will
> have to keep scanning. Giving up the timeslice before we take care
> of any of those looks like a bad idea, to put it mildly, and that's
> precisely what will happen.
>
> What about doing that in the end of __dentry_kill() instead? And to
> hell with both existing call sites - dput() one (before going to
> the parent) is obviously covered by that (dentry_kill() only returns
> non-NULL after having called __dentry_kill()) and in shrink_dentry_list()
> we'll get to it as soon as we go through all dentries that can be
> immediately kicked off the shrink list. Which, AFAICS, improves the
> situation, now that shrink_lock_dentry() contains no trylock loops...
>
> Comments?
What I mean is something like this (cumulative diff, it'll obviously need
to be carved up into 3--4 commits):
diff --git a/fs/dcache.c b/fs/dcache.c
index 86d2de63461e..b1e62b54941b 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -580,6 +580,7 @@ static void __dentry_kill(struct dentry *dentry)
spin_unlock(&dentry->d_lock);
if (likely(can_free))
dentry_free(dentry);
+ cond_resched();
}
static struct dentry *__lock_parent(struct dentry *dentry)
@@ -827,30 +828,23 @@ static inline bool fast_dput(struct dentry *dentry)
*/
void dput(struct dentry *dentry)
{
- if (unlikely(!dentry))
- return;
+ while (dentry) {
+ might_sleep();
-repeat:
- might_sleep();
+ rcu_read_lock();
+ if (likely(fast_dput(dentry))) {
+ rcu_read_unlock();
+ return;
+ }
- rcu_read_lock();
- if (likely(fast_dput(dentry))) {
+ /* Slow case: now with the dentry lock held */
rcu_read_unlock();
- return;
- }
-
- /* Slow case: now with the dentry lock held */
- rcu_read_unlock();
-
- if (likely(retain_dentry(dentry))) {
- spin_unlock(&dentry->d_lock);
- return;
- }
- dentry = dentry_kill(dentry);
- if (dentry) {
- cond_resched();
- goto repeat;
+ if (likely(retain_dentry(dentry))) {
+ spin_unlock(&dentry->d_lock);
+ return;
+ }
+ dentry = dentry_kill(dentry);
}
}
EXPORT_SYMBOL(dput);
@@ -1052,8 +1046,6 @@ static void shrink_dentry_list(struct list_head *list)
while (!list_empty(list)) {
struct dentry *dentry, *parent;
- cond_resched();
-
dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
rcu_read_lock();
@@ -1230,13 +1222,11 @@ enum d_walk_ret {
* @parent: start of walk
* @data: data passed to @enter() and @finish()
* @enter: callback when first entering the dentry
- * @finish: callback when successfully finished the walk
*
* The @enter() and @finish() callbacks are called with d_lock held.
*/
static void d_walk(struct dentry *parent, void *data,
- enum d_walk_ret (*enter)(void *, struct dentry *),
- void (*finish)(void *))
+ enum d_walk_ret (*enter)(void *, struct dentry *))
{
struct dentry *this_parent;
struct list_head *next;
@@ -1325,8 +1315,6 @@ static void d_walk(struct dentry *parent, void *data,
if (need_seqretry(&rename_lock, seq))
goto rename_retry;
rcu_read_unlock();
- if (finish)
- finish(data);
out_unlock:
spin_unlock(&this_parent->d_lock);
@@ -1375,7 +1363,7 @@ int path_has_submounts(const struct path *parent)
struct check_mount data = { .mnt = parent->mnt, .mounted = 0 };
read_seqlock_excl(&mount_lock);
- d_walk(parent->dentry, &data, path_check_mount, NULL);
+ d_walk(parent->dentry, &data, path_check_mount);
read_sequnlock_excl(&mount_lock);
return data.mounted;
@@ -1483,7 +1471,7 @@ void shrink_dcache_parent(struct dentry *parent)
data.start = parent;
data.found = 0;
- d_walk(parent, &data, select_collect, NULL);
+ d_walk(parent, &data, select_collect);
if (!data.found)
break;
@@ -1518,7 +1506,7 @@ static enum d_walk_ret umount_check(void *_data, struct dentry *dentry)
static void do_one_tree(struct dentry *dentry)
{
shrink_dcache_parent(dentry);
- d_walk(dentry, dentry, umount_check, NULL);
+ d_walk(dentry, dentry, umount_check);
d_drop(dentry);
dput(dentry);
}
@@ -1542,78 +1530,49 @@ void shrink_dcache_for_umount(struct super_block *sb)
}
}
-struct detach_data {
- struct select_data select;
- struct dentry *mountpoint;
-};
-static enum d_walk_ret detach_and_collect(void *_data, struct dentry *dentry)
+static enum d_walk_ret find_submount(void *_data, struct dentry *dentry)
{
- struct detach_data *data = _data;
-
+ struct dentry **victim = (struct dentry **)_data;
if (d_mountpoint(dentry)) {
__dget_dlock(dentry);
- data->mountpoint = dentry;
+ *victim = dentry;
return D_WALK_QUIT;
}
- return select_collect(&data->select, dentry);
-}
-
-static void check_and_drop(void *_data)
-{
- struct detach_data *data = _data;
-
- if (!data->mountpoint && list_empty(&data->select.dispose))
- __d_drop(data->select.start);
+ return D_WALK_CONTINUE;
}
/**
* d_invalidate - detach submounts, prune dcache, and drop
* @dentry: dentry to invalidate (aka detach, prune and drop)
- *
- * no dcache lock.
- *
- * The final d_drop is done as an atomic operation relative to
- * rename_lock ensuring there are no races with d_set_mounted. This
- * ensures there are no unhashed dentries on the path to a mountpoint.
*/
void d_invalidate(struct dentry *dentry)
{
- /*
- * If it's already been dropped, return OK.
- */
+ bool had_submounts = false;
spin_lock(&dentry->d_lock);
if (d_unhashed(dentry)) {
spin_unlock(&dentry->d_lock);
return;
}
+ __d_drop(dentry);
spin_unlock(&dentry->d_lock);
/* Negative dentries can be dropped without further checks */
- if (!dentry->d_inode) {
- d_drop(dentry);
+ if (!dentry->d_inode)
return;
- }
+ shrink_dcache_parent(dentry);
for (;;) {
- struct detach_data data;
-
- data.mountpoint = NULL;
- INIT_LIST_HEAD(&data.select.dispose);
- data.select.start = dentry;
- data.select.found = 0;
-
- d_walk(dentry, &data, detach_and_collect, check_and_drop);
-
- if (!list_empty(&data.select.dispose))
- shrink_dentry_list(&data.select.dispose);
- else if (!data.mountpoint)
+ struct dentry *victim = NULL;
+ d_walk(dentry, &victim, find_submount);
+ if (!victim) {
+ if (had_submounts)
+ shrink_dcache_parent(dentry);
return;
-
- if (data.mountpoint) {
- detach_mounts(data.mountpoint);
- dput(data.mountpoint);
}
+ had_submounts = true;
+ detach_mounts(victim);
+ dput(victim);
}
}
EXPORT_SYMBOL(d_invalidate);
@@ -3112,7 +3071,7 @@ static enum d_walk_ret d_genocide_kill(void *data, struct dentry *dentry)
void d_genocide(struct dentry *parent)
{
- d_walk(parent, parent, d_genocide_kill, NULL);
+ d_walk(parent, parent, d_genocide_kill);
}
EXPORT_SYMBOL(d_genocide);
On Sun, Apr 15, 2018 at 10:54:55PM +0100, Al Viro wrote:
> On Sun, Apr 15, 2018 at 09:40:54PM +0100, Al Viro wrote:
>
> > BTW, the current placement of cond_resched() looks bogus; suppose we
> > have collected a lot of victims and ran into need_resched(). We leave
> > d_walk() and call shrink_dentry_list(). At that point there's a lot
> > of stuff on our shrink list and anybody else running into them will
> > have to keep scanning. Giving up the timeslice before we take care
> > of any of those looks like a bad idea, to put it mildly, and that's
> > precisely what will happen.
> >
> > What about doing that in the end of __dentry_kill() instead? And to
> > hell with both existing call sites - dput() one (before going to
> > the parent) is obviously covered by that (dentry_kill() only returns
> > non-NULL after having called __dentry_kill()) and in shrink_dentry_list()
> > we'll get to it as soon as we go through all dentries that can be
> > immediately kicked off the shrink list. Which, AFAICS, improves the
> > situation, now that shrink_lock_dentry() contains no trylock loops...
> >
> > Comments?
>
> What I mean is something like this (cumulative diff, it'll obviously need
> to be carved up into 3--4 commits):
... and carved-up version is in vfs.git#work.dcache. Could syzbot folks
hit it with their reproducers?
On Sun, Apr 15, 2018 at 3:34 PM, Al Viro <[email protected]> wrote:
> On Sun, Apr 15, 2018 at 10:54:55PM +0100, Al Viro wrote:
>> On Sun, Apr 15, 2018 at 09:40:54PM +0100, Al Viro wrote:
>>
>> > BTW, the current placement of cond_resched() looks bogus; suppose we
>> > have collected a lot of victims and ran into need_resched(). We leave
>> > d_walk() and call shrink_dentry_list(). At that point there's a lot
>> > of stuff on our shrink list and anybody else running into them will
>> > have to keep scanning. Giving up the timeslice before we take care
>> > of any of those looks like a bad idea, to put it mildly, and that's
>> > precisely what will happen.
>> >
>> > What about doing that in the end of __dentry_kill() instead? And to
>> > hell with both existing call sites - dput() one (before going to
>> > the parent) is obviously covered by that (dentry_kill() only returns
>> > non-NULL after having called __dentry_kill()) and in shrink_dentry_list()
>> > we'll get to it as soon as we go through all dentries that can be
>> > immediately kicked off the shrink list. Which, AFAICS, improves the
>> > situation, now that shrink_lock_dentry() contains no trylock loops...
>> >
>> > Comments?
>>
>> What I mean is something like this (cumulative diff, it'll obviously need
>> to be carved up into 3--4 commits):
>
> ... and carved-up version is in vfs.git#work.dcache. Could syzbot folks
> hit it with their reproducers?
To me it seems like shrink_dcache_parent could still spin without
scheduling similar to before - found > 0 since *someone* is shrinking,
but we have 0 entries to shrink - we never call __dentry_kill or
schedule, and just spin.
And indeed, the syzbot reproducer @vfs#work.dcache hits a softlockup
in shrink_dcache_parent.
I'd think re-adding cond_resched to shrink_dcache_parent does make the
softlockup go way. It seems to fix the reproducer.
Although did we want to terminate the loop in shrink_dcache_parent earlier?