2014-04-29 16:02:07

by Miklos Szeredi

[permalink] [raw]
Subject: dcache shrink list corruption?

This was reported by IBM for 3.12, but if my analysis is right, it affects
current kernel as well as older ones.

So the question is: does anything protect the shrink list from concurrent
modification by one or more dput() instances?

E.g. two dentries are on the shrink list, for both dget(), d_drop() and dput()
are called. dput() -> dentry_kill() -> dentry_lru_del() -> d_shrink_del() ->
list_del_init(). Unlike the LRU list this is only protected with d_lock on the
individual dentries, which is not enough to prevent list corruption:

list->next = a, list->prev = b
a->next = b, a->prev = list
b->next = list, b->prev = a

CPU1: list_del_init(b)
__list_del(a, list)
a->next = list ...
CPU2: list_del_init(a)
__list_del(list, list)
list->next = list
list->prev = list
CPU1: (continuing list_del_init(b))
list->prev = a

Attached patch is just a starting point (untested). Not sure how to minimize
contention without adding too much complexity.

Thanks,
Miklos


diff --git a/fs/dcache.c b/fs/dcache.c
index 40707d88a945..5e0719292e3e 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -357,10 +357,14 @@ static void d_lru_del(struct dentry *dentry)
WARN_ON_ONCE(!list_lru_del(&dentry->d_sb->s_dentry_lru, &dentry->d_lru));
}

+static __cacheline_aligned_in_smp DEFINE_SPINLOCK(dcache_shrink_lock);
+
static void d_shrink_del(struct dentry *dentry)
{
D_FLAG_VERIFY(dentry, DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
+ spin_lock(&dcache_shrink_lock);
list_del_init(&dentry->d_lru);
+ spin_unlock(&dcache_shrink_lock);
dentry->d_flags &= ~(DCACHE_SHRINK_LIST | DCACHE_LRU_LIST);
this_cpu_dec(nr_dentry_unused);
}
@@ -368,7 +372,9 @@ static void d_shrink_del(struct dentry *dentry)
static void d_shrink_add(struct dentry *dentry, struct list_head *list)
{
D_FLAG_VERIFY(dentry, 0);
+ spin_lock(&dcache_shrink_lock);
list_add(&dentry->d_lru, list);
+ spin_unlock(&dcache_shrink_lock);
dentry->d_flags |= DCACHE_SHRINK_LIST | DCACHE_LRU_LIST;
this_cpu_inc(nr_dentry_unused);
}
@@ -391,7 +397,9 @@ static void d_lru_shrink_move(struct dentry *dentry, struct list_head *list)
{
D_FLAG_VERIFY(dentry, DCACHE_LRU_LIST);
dentry->d_flags |= DCACHE_SHRINK_LIST;
+ spin_lock(&dcache_shrink_lock);
list_move_tail(&dentry->d_lru, list);
+ spin_unlock(&dcache_shrink_lock);
}

/*


2014-04-29 17:43:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 9:01 AM, Miklos Szeredi <[email protected]> wrote:
> This was reported by IBM for 3.12, but if my analysis is right, it affects
> current kernel as well as older ones.
>
> So the question is: does anything protect the shrink list from concurrent
> modification by one or more dput() instances?

Ugh. I don't see anything. The shrinking list is a private list, so
adding on its own would be entirely safe, and I think that's where the
"we don't need no steenking locking" comes from.

But yes, the dentries are then visible on the hash chains, and there
can be concurrent removals from the list.

That new global lock smells, though - and if we want to use a global
lock, we should simply use the existing per-superblock LRU lock, not
make up some new global one. The moving case already holds it, can't
we just take it in the add/del case? Was there some reason you didn't
do that?

Let me think about it, maybe there's some trick we can do by virtue of
the list head being private and us holding the dentry lock. So at
least on addition, we have *two* of the tree involved nodes locked.
Does that perhaps allow for any lockless games to be played?

Linus

2014-04-29 17:56:57

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 06:01:39PM +0200, Miklos Szeredi wrote:

> Attached patch is just a starting point (untested). Not sure how to minimize
> contention without adding too much complexity.

Contention isn't the worst problem here - I'd expect the cacheline ping-pong
to hurt more... I agree with the analysis, but I'd really like to avoid that
spinlock ;-/

Let me see if we can avoid that... Oh, well - at least that's a good excuse
to take a break from fucking deadlock analysis around the damn acct(2), most
of VFS and network filesystems ;-/

2014-04-29 18:03:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 7:43 PM, Linus Torvalds
<[email protected]> wrote:
> On Tue, Apr 29, 2014 at 9:01 AM, Miklos Szeredi <[email protected]> wrote:
>> This was reported by IBM for 3.12, but if my analysis is right, it affects
>> current kernel as well as older ones.
>>
>> So the question is: does anything protect the shrink list from concurrent
>> modification by one or more dput() instances?
>
> Ugh. I don't see anything. The shrinking list is a private list, so
> adding on its own would be entirely safe, and I think that's where the
> "we don't need no steenking locking" comes from.
>
> But yes, the dentries are then visible on the hash chains, and there
> can be concurrent removals from the list.
>
> That new global lock smells, though - and if we want to use a global
> lock, we should simply use the existing per-superblock LRU lock, not
> make up some new global one. The moving case already holds it, can't
> we just take it in the add/del case? Was there some reason you didn't
> do that?

Because we no longer have that. It now uses the list_lru thing, with
a "per-node" lock, whatever that one is.

Introducing a new per-sb lock should be OK.

Another idea, which could have subtler effects, is simply not to kill
a dentry that is on the shrink list (indicated by
DCACHE_SHRINK_LIST), since it's bound to get killed anyway. But
that's a change in behaviour...

>
> Let me think about it, maybe there's some trick we can do by virtue of
> the list head being private and us holding the dentry lock. So at
> least on addition, we have *two* of the tree involved nodes locked.
> Does that perhaps allow for any lockless games to be played?

I don't understand how that could be made to work.

2014-04-29 18:16:13

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:

> Introducing a new per-sb lock should be OK.
>
> Another idea, which could have subtler effects, is simply not to kill
> a dentry that is on the shrink list (indicated by
> DCACHE_SHRINK_LIST), since it's bound to get killed anyway. But
> that's a change in behaviour...

Umm... You mean, if final dput() finds dentry already on shrink list,
just leave it there and return? Might get really painful - the code
that knows it's holding the last reference to already unhashed dentry
might get a nasty surprise when dput() returns before it's killed off.

2014-04-29 18:17:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 11:03 AM, Miklos Szeredi <[email protected]> wrote:
>
> Because we no longer have that. It now uses the list_lru thing, with
> a "per-node" lock, whatever that one is.

Oh, yes. Right you are. I just started looking at that and went "ugh".

The lru lists are all distributed now with multiple locks (well, one
per list node).

> Another idea, which could have subtler effects, is simply not to kill
> a dentry that is on the shrink list (indicated by
> DCACHE_SHRINK_LIST), since it's bound to get killed anyway. But
> that's a change in behaviour...

Ooh, I like the way you think. I don't see why this wouldn't be the
right approach.

Linus

2014-04-29 19:10:20

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 07:16:10PM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:
>
> > Introducing a new per-sb lock should be OK.
> >
> > Another idea, which could have subtler effects, is simply not to kill
> > a dentry that is on the shrink list (indicated by
> > DCACHE_SHRINK_LIST), since it's bound to get killed anyway. But
> > that's a change in behaviour...
>
> Umm... You mean, if final dput() finds dentry already on shrink list,
> just leave it there and return? Might get really painful - the code
> that knows it's holding the last reference to already unhashed dentry
> might get a nasty surprise when dput() returns before it's killed off.

I wonder if we could have dput() side of thinks check if we are on the
shrink list, mark it "I'll be killing it anyway" and go ahead without
removal from the shrink list and instead of freeing mark it "I'm done
with it". With shrink_dentry_list(), on the other hand, checking for those
marks, treating the former as "just move it to private list and keep
going". After the list of victims is dealt with, keep picking dentries
from the second list, wait for them to get the second mark and do actual
freeing. That ought to avoid any extra locks and preserve all ordering,
except for that of kmem_cache_free(), AFAICS...

Comments?

2014-04-29 21:18:58

by Dave Chinner

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 08:10:15PM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 07:16:10PM +0100, Al Viro wrote:
> > On Tue, Apr 29, 2014 at 08:03:24PM +0200, Miklos Szeredi wrote:
> >
> > > Introducing a new per-sb lock should be OK.
> > >
> > > Another idea, which could have subtler effects, is simply not to kill
> > > a dentry that is on the shrink list (indicated by
> > > DCACHE_SHRINK_LIST), since it's bound to get killed anyway. But
> > > that's a change in behaviour...
> >
> > Umm... You mean, if final dput() finds dentry already on shrink list,
> > just leave it there and return? Might get really painful - the code
> > that knows it's holding the last reference to already unhashed dentry
> > might get a nasty surprise when dput() returns before it's killed off.
>
> I wonder if we could have dput() side of thinks check if we are on the
> shrink list, mark it "I'll be killing it anyway" and go ahead without
> removal from the shrink list and instead of freeing mark it "I'm done
> with it". With shrink_dentry_list(), on the other hand, checking for those
> marks, treating the former as "just move it to private list and keep
> going". After the list of victims is dealt with, keep picking dentries
> from the second list, wait for them to get the second mark and do actual
> freeing. That ought to avoid any extra locks and preserve all ordering,
> except for that of kmem_cache_free(), AFAICS...
>
> Comments?

Seems like it would work, but it seems fragile to me - I'm
wondering how we can ensure that the private shrink list
manipulations can be kept private.

We have a similar situation with the inode cache (private shrink
list) but the I_FREEING flag is set the entire time the inode is on
the shrink list. Any new hash lookup or attempt to grab the inode
that occurs while I_FREEING is set fails, so perhaps dentries also
need a well defined "being torn down and freed" state where new
references cannot be taken even though the dentry can still be
found...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2014-04-29 21:48:47

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 07:18:51AM +1000, Dave Chinner wrote:

> Seems like it would work, but it seems fragile to me - I'm
> wondering how we can ensure that the private shrink list
> manipulations can be kept private.
>
> We have a similar situation with the inode cache (private shrink
> list) but the I_FREEING flag is set the entire time the inode is on
> the shrink list. Any new hash lookup or attempt to grab the inode
> that occurs while I_FREEING is set fails, so perhaps dentries also
> need a well defined "being torn down and freed" state where new
> references cannot be taken even though the dentry can still be
> found...

Ummm... You mean, have d_lookup() et.al. fail on something that is on
a shrink list?

2014-04-29 23:04:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 2:48 PM, Al Viro <[email protected]> wrote:
>
> Ummm... You mean, have d_lookup() et.al. fail on something that is on
> a shrink list?

So I tried to see if that would work just consider it dead by the time
it hits the shrink list, and if somebody does a lookup on the dentry,
fail on it and just allocate a new dentry and do a real lookup.

But at a minimum, we have "d_op->d_prune()" that would now be possibly
be called for the old dentry *after* a new dentry has been allocated.
Not to mention the inode not having been dropped. So it looks like a
disaster where the filesystem now ends up having concurrent "live"
dentries for the same entry. Even if one of them is on its way out,
it's still visible to the filesystem. That makes me very
uncomfortable.

I'm starting to think that Miklos' original patch (perhaps with the
spinlock split to at least be one per superblock) is the most
straightforward approach after all. It's annoying having to add locks
here, but the whole pruning path should not be a hotpath, so maybe
it's not actually a big deal.

Linus

2014-04-29 23:20:22

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> But at a minimum, we have "d_op->d_prune()" that would now be possibly
> be called for the old dentry *after* a new dentry has been allocated.
> Not to mention the inode not having been dropped. So it looks like a
> disaster where the filesystem now ends up having concurrent "live"
> dentries for the same entry. Even if one of them is on its way out,
> it's still visible to the filesystem. That makes me very
> uncomfortable.
>
> I'm starting to think that Miklos' original patch (perhaps with the
> spinlock split to at least be one per superblock) is the most
> straightforward approach after all. It's annoying having to add locks
> here, but the whole pruning path should not be a hotpath, so maybe
> it's not actually a big deal.

FWIW, my solution is more or less working; I'll give more local beating
and post it...

2014-04-30 02:31:48

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 12:20:13AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 04:04:11PM -0700, Linus Torvalds wrote:
> > But at a minimum, we have "d_op->d_prune()" that would now be possibly
> > be called for the old dentry *after* a new dentry has been allocated.
> > Not to mention the inode not having been dropped. So it looks like a
> > disaster where the filesystem now ends up having concurrent "live"
> > dentries for the same entry. Even if one of them is on its way out,
> > it's still visible to the filesystem. That makes me very
> > uncomfortable.
> >
> > I'm starting to think that Miklos' original patch (perhaps with the
> > spinlock split to at least be one per superblock) is the most
> > straightforward approach after all. It's annoying having to add locks
> > here, but the whole pruning path should not be a hotpath, so maybe
> > it's not actually a big deal.
>
> FWIW, my solution is more or less working; I'll give more local beating
> and post it...

OK, aggregate diff follows, more readable splitup (3 commits) attached.
It seems to survive beating here; testing, review and comments are
welcome.

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,16 +246,8 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry);
}

-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
+static void dentry_free(struct dentry *dentry)
{
- BUG_ON((int)dentry->d_lockref.count > 0);
- this_cpu_dec(nr_dentry);
- if (dentry->d_op && dentry->d_op->d_release)
- dentry->d_op->d_release(dentry);
-
/* if dentry was never visible to RCU, immediate free is OK */
if (!(dentry->d_flags & DCACHE_RCUACCESS))
__d_free(&dentry->d_u.d_rcu);
@@ -420,40 +412,6 @@ static void dentry_lru_del(struct dentry *dentry)
}

/**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
- __releases(dentry->d_lock)
- __releases(parent->d_lock)
- __releases(dentry->d_inode->i_lock)
-{
- list_del(&dentry->d_u.d_child);
- /*
- * Inform d_walk() that we are no longer attached to the
- * dentry tree
- */
- dentry->d_flags |= DCACHE_DENTRY_KILLED;
- if (parent)
- spin_unlock(&parent->d_lock);
- dentry_iput(dentry);
- /*
- * dentry_iput drops the locks, at which point nobody (except
- * transient RCU lookups) can reach this dentry.
- */
- d_free(dentry);
- return parent;
-}
-
-/**
* d_drop - drop a dentry
* @dentry: dentry to drop
*
@@ -499,6 +457,8 @@ void d_drop(struct dentry *dentry)
}
EXPORT_SYMBOL(d_drop);

+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
/*
* Finish off a dentry we've decided to kill.
* dentry->d_lock must be held, returns with it unlocked.
@@ -506,16 +466,17 @@ EXPORT_SYMBOL(d_drop);
* Returns dentry requiring refcount drop, or NULL if we're done.
*/
static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
__releases(dentry->d_lock)
{
struct inode *inode;
struct dentry *parent;
+ bool can_free = true;

inode = dentry->d_inode;
if (inode && !spin_trylock(&inode->i_lock)) {
relock:
- if (unlock_on_failure) {
+ if (!morgue) {
spin_unlock(&dentry->d_lock);
cpu_relax();
}
@@ -531,6 +492,15 @@ relock:
goto relock;
}

+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ /* morgue must be non-NULL */
+ list_move(&dentry->d_lru, morgue);
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ /* inode must be NULL */
+ spin_unlock(&dentry->d_lock);
+ return parent;
+ }
/*
* The dentry is now unrecoverably dead to the world.
*/
@@ -543,10 +513,43 @@ relock:
if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
dentry->d_op->d_prune(dentry);

- dentry_lru_del(dentry);
+ if (dentry->d_flags & DCACHE_LRU_LIST) {
+ if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+ d_lru_del(dentry);
+ else if (morgue)
+ d_shrink_del(dentry);
+ else
+ can_free = false;
+ }
/* if it was on the hash then remove it */
__d_drop(dentry);
- return d_kill(dentry, parent);
+ list_del(&dentry->d_u.d_child);
+ /*
+ * Inform d_walk() that we are no longer attached to the
+ * dentry tree
+ */
+ dentry->d_flags |= DCACHE_DENTRY_KILLED;
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ dentry_iput(dentry);
+ /*
+ * dentry_iput drops the locks, at which point nobody (except
+ * transient RCU lookups) can reach this dentry.
+ */
+ BUG_ON((int)dentry->d_lockref.count > 0);
+ this_cpu_dec(nr_dentry);
+ if (dentry->d_op && dentry->d_op->d_release)
+ dentry->d_op->d_release(dentry);
+
+ if (likely(can_free)) {
+ dentry_free(dentry);
+ } else {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_MAY_FREE;
+ spin_unlock(&dentry->d_lock);
+ wake_up(&free_wq);
+ }
+ return parent;
}

/*
@@ -602,7 +605,7 @@ repeat:
return;

kill_it:
- dentry = dentry_kill(dentry, 1);
+ dentry = dentry_kill(dentry, NULL);
if (dentry)
goto repeat;
}
@@ -815,47 +818,10 @@ restart:
}
EXPORT_SYMBOL(d_prune_aliases);

-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *
- * This may fail if locks cannot be acquired no problem, just try again.
- */
-static struct dentry * try_prune_one_dentry(struct dentry *dentry)
- __releases(dentry->d_lock)
-{
- struct dentry *parent;
-
- parent = dentry_kill(dentry, 0);
- /*
- * If dentry_kill returns NULL, we have nothing more to do.
- * if it returns the same dentry, trylocks failed. In either
- * case, just loop again.
- *
- * Otherwise, we need to prune ancestors too. This is necessary
- * to prevent quadratic behavior of shrink_dcache_parent(), but
- * is also expected to be beneficial in reducing dentry cache
- * fragmentation.
- */
- if (!parent)
- return NULL;
- if (parent == dentry)
- return dentry;
-
- /* Prune ancestors. */
- dentry = parent;
- while (dentry) {
- if (lockref_put_or_lock(&dentry->d_lockref))
- return NULL;
- dentry = dentry_kill(dentry, 1);
- }
- return NULL;
-}
-
static void shrink_dentry_list(struct list_head *list)
{
- struct dentry *dentry;
+ struct dentry *dentry, *parent;
+ LIST_HEAD(morgue);

rcu_read_lock();
for (;;) {
@@ -891,24 +857,43 @@ static void shrink_dentry_list(struct list_head *list)
}
rcu_read_unlock();

+ parent = dentry_kill(dentry, &morgue);
/*
- * If 'try_to_prune()' returns a dentry, it will
- * be the same one we passed in, and d_lock will
- * have been held the whole time, so it will not
- * have been added to any other lists. We failed
- * to get the inode lock.
- *
- * We just add it back to the shrink list.
+ * If dentry_kill returns NULL, we have nothing more to do.
*/
- dentry = try_prune_one_dentry(dentry);
-
- rcu_read_lock();
- if (dentry) {
+ if (!parent) {
+ rcu_read_lock();
+ continue;
+ }
+ if (unlikely(parent == dentry)) {
+ /*
+ * trylocks have failed and d_lock has been held the
+ * whole time, so it could not have been added to any
+ * other lists. Just add it back to the shrink list.
+ */
+ rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(&dentry->d_lock);
+ continue;
}
+ /*
+ * We need to prune ancestors too. This is necessary to prevent
+ * quadratic behavior of shrink_dcache_parent(), but is also
+ * expected to be beneficial in reducing dentry cache
+ * fragmentation.
+ */
+ dentry = parent;
+ while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+ dentry = dentry_kill(dentry, NULL);
+ rcu_read_lock();
}
rcu_read_unlock();
+ while (unlikely(!list_empty(&morgue))) {
+ dentry = list_first_entry(&morgue, struct dentry, d_lru);
+ list_del_init(&dentry->d_lru);
+ wait_event(free_wq, dentry->d_flags & DCACHE_MAY_FREE);
+ dentry_free(dentry);
+ }
}

static enum lru_status
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
#define DCACHE_SYMLINK_TYPE 0x00300000 /* Symlink */
#define DCACHE_FILE_TYPE 0x00400000 /* Other file type */

+#define DCACHE_MAY_FREE 0x00800000
+
extern seqlock_t rename_lock;

static inline int dname_external(const struct dentry *dentry)


Attachments:
(No filename) (8.51 kB)
mb2 (10.95 kB)
Download all attachments

2014-04-30 02:56:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <[email protected]> wrote:
>
> OK, aggregate diff follows, more readable splitup (3 commits) attached.
> It seems to survive beating here; testing, review and comments are
> welcome.

Miklos, did you have some particular load that triggered this, or was
it just some reports? It would be really good to get this patch some
stress-testing.

I like how the patch removes more lines than it adds, but apart from
that it's hard to read the patch (even the split-out ones) and say
anything more about it. I think this needs a *lot* of testing.

Linus

2014-04-30 04:04:43

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <[email protected]> wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
>
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
>
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

FWIW, the first two are really straightforward expanding the function
into its only callsite. The last needs more splitup. Not sure if the
following is good enough, but it ought to be at least somewhat cleaner.
Combined change is identical to the original, so it doesn't invalidate
the testing so far...

>From 895aeb48465bbf78803fd11dee2556d010ada23a Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 15:45:28 -0400
Subject: [PATCH 1/6] fold d_kill() and d_free()

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 76 +++++++++++++++++++----------------------------------------
1 file changed, 24 insertions(+), 52 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 494a9def..9b15c5c 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,23 +246,6 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry);
}

-/*
- * no locks, please.
- */
-static void d_free(struct dentry *dentry)
-{
- BUG_ON((int)dentry->d_lockref.count > 0);
- this_cpu_dec(nr_dentry);
- if (dentry->d_op && dentry->d_op->d_release)
- dentry->d_op->d_release(dentry);
-
- /* if dentry was never visible to RCU, immediate free is OK */
- if (!(dentry->d_flags & DCACHE_RCUACCESS))
- __d_free(&dentry->d_u.d_rcu);
- else
- call_rcu(&dentry->d_u.d_rcu, __d_free);
-}
-
/**
* dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
* @dentry: the target dentry
@@ -420,40 +403,6 @@ static void dentry_lru_del(struct dentry *dentry)
}

/**
- * d_kill - kill dentry and return parent
- * @dentry: dentry to kill
- * @parent: parent dentry
- *
- * The dentry must already be unhashed and removed from the LRU.
- *
- * If this is the root of the dentry tree, return NULL.
- *
- * dentry->d_lock and parent->d_lock must be held by caller, and are dropped by
- * d_kill.
- */
-static struct dentry *d_kill(struct dentry *dentry, struct dentry *parent)
- __releases(dentry->d_lock)
- __releases(parent->d_lock)
- __releases(dentry->d_inode->i_lock)
-{
- list_del(&dentry->d_u.d_child);
- /*
- * Inform d_walk() that we are no longer attached to the
- * dentry tree
- */
- dentry->d_flags |= DCACHE_DENTRY_KILLED;
- if (parent)
- spin_unlock(&parent->d_lock);
- dentry_iput(dentry);
- /*
- * dentry_iput drops the locks, at which point nobody (except
- * transient RCU lookups) can reach this dentry.
- */
- d_free(dentry);
- return parent;
-}
-
-/**
* d_drop - drop a dentry
* @dentry: dentry to drop
*
@@ -546,7 +495,30 @@ relock:
dentry_lru_del(dentry);
/* if it was on the hash then remove it */
__d_drop(dentry);
- return d_kill(dentry, parent);
+ list_del(&dentry->d_u.d_child);
+ /*
+ * Inform d_walk() that we are no longer attached to the
+ * dentry tree
+ */
+ dentry->d_flags |= DCACHE_DENTRY_KILLED;
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ dentry_iput(dentry);
+ /*
+ * dentry_iput drops the locks, at which point nobody (except
+ * transient RCU lookups) can reach this dentry.
+ */
+ BUG_ON((int)dentry->d_lockref.count > 0);
+ this_cpu_dec(nr_dentry);
+ if (dentry->d_op && dentry->d_op->d_release)
+ dentry->d_op->d_release(dentry);
+
+ /* if dentry was never visible to RCU, immediate free is OK */
+ if (!(dentry->d_flags & DCACHE_RCUACCESS))
+ __d_free(&dentry->d_u.d_rcu);
+ else
+ call_rcu(&dentry->d_u.d_rcu, __d_free);
+ return parent;
}

/*
--
1.7.10.4


>From aff934c47717be0216c9e2c10a2e8ca0f829bb54 Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 16:13:18 -0400
Subject: [PATCH 2/6] fold try_prune_one_dentry()

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 75 ++++++++++++++++++++---------------------------------------
1 file changed, 25 insertions(+), 50 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 9b15c5c..a5540d4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -787,47 +787,9 @@ restart:
}
EXPORT_SYMBOL(d_prune_aliases);

-/*
- * Try to throw away a dentry - free the inode, dput the parent.
- * Requires dentry->d_lock is held, and dentry->d_count == 0.
- * Releases dentry->d_lock.
- *
- * This may fail if locks cannot be acquired no problem, just try again.
- */
-static struct dentry * try_prune_one_dentry(struct dentry *dentry)
- __releases(dentry->d_lock)
-{
- struct dentry *parent;
-
- parent = dentry_kill(dentry, 0);
- /*
- * If dentry_kill returns NULL, we have nothing more to do.
- * if it returns the same dentry, trylocks failed. In either
- * case, just loop again.
- *
- * Otherwise, we need to prune ancestors too. This is necessary
- * to prevent quadratic behavior of shrink_dcache_parent(), but
- * is also expected to be beneficial in reducing dentry cache
- * fragmentation.
- */
- if (!parent)
- return NULL;
- if (parent == dentry)
- return dentry;
-
- /* Prune ancestors. */
- dentry = parent;
- while (dentry) {
- if (lockref_put_or_lock(&dentry->d_lockref))
- return NULL;
- dentry = dentry_kill(dentry, 1);
- }
- return NULL;
-}
-
static void shrink_dentry_list(struct list_head *list)
{
- struct dentry *dentry;
+ struct dentry *dentry, *parent;

rcu_read_lock();
for (;;) {
@@ -863,22 +825,35 @@ static void shrink_dentry_list(struct list_head *list)
}
rcu_read_unlock();

+ parent = dentry_kill(dentry, 0);
/*
- * If 'try_to_prune()' returns a dentry, it will
- * be the same one we passed in, and d_lock will
- * have been held the whole time, so it will not
- * have been added to any other lists. We failed
- * to get the inode lock.
- *
- * We just add it back to the shrink list.
+ * If dentry_kill returns NULL, we have nothing more to do.
*/
- dentry = try_prune_one_dentry(dentry);
-
- rcu_read_lock();
- if (dentry) {
+ if (!parent) {
+ rcu_read_lock();
+ continue;
+ }
+ if (unlikely(parent == dentry)) {
+ /*
+ * trylocks have failed and d_lock has been held the
+ * whole time, so it could not have been added to any
+ * other lists. Just add it back to the shrink list.
+ */
+ rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(&dentry->d_lock);
+ continue;
}
+ /*
+ * We need to prune ancestors too. This is necessary to prevent
+ * quadratic behavior of shrink_dcache_parent(), but is also
+ * expected to be beneficial in reducing dentry cache
+ * fragmentation.
+ */
+ dentry = parent;
+ while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
+ dentry = dentry_kill(dentry, 1);
+ rcu_read_lock();
}
rcu_read_unlock();
}
--
1.7.10.4


>From d250553b1ca7d4c9b0544b4b6a3cdf5e2a3dd147 Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 23:40:14 -0400
Subject: [PATCH 3/6] new helper: dentry_free()

The part of old d_free() that dealt with actual freeing of dentry.
Taken out of dentry_kill() into a separate function.

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 15 ++++++++++-----
1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index a5540d4..dab7db1 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -246,6 +246,15 @@ static void __d_free(struct rcu_head *head)
kmem_cache_free(dentry_cache, dentry);
}

+static void dentry_free(struct dentry *dentry)
+{
+ /* if dentry was never visible to RCU, immediate free is OK */
+ if (!(dentry->d_flags & DCACHE_RCUACCESS))
+ __d_free(&dentry->d_u.d_rcu);
+ else
+ call_rcu(&dentry->d_u.d_rcu, __d_free);
+}
+
/**
* dentry_rcuwalk_barrier - invalidate in-progress rcu-walk lookups
* @dentry: the target dentry
@@ -513,11 +522,7 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);

- /* if dentry was never visible to RCU, immediate free is OK */
- if (!(dentry->d_flags & DCACHE_RCUACCESS))
- __d_free(&dentry->d_u.d_rcu);
- else
- call_rcu(&dentry->d_u.d_rcu, __d_free);
+ dentry_free(dentry);
return parent;
}

--
1.7.10.4


>From 2f2c6a0a4153054f2b3b44011ffbe8e4bae1a1e1 Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 23:42:52 -0400
Subject: [PATCH 4/6] expand the call of dentry_lru_del() in dentry_kill()

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index dab7db1..e482775 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -501,7 +501,12 @@ relock:
if ((dentry->d_flags & DCACHE_OP_PRUNE) && !d_unhashed(dentry))
dentry->d_op->d_prune(dentry);

- dentry_lru_del(dentry);
+ if (dentry->d_flags & DCACHE_LRU_LIST) {
+ if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+ d_lru_del(dentry);
+ else
+ d_shrink_del(dentry);
+ }
/* if it was on the hash then remove it */
__d_drop(dentry);
list_del(&dentry->d_u.d_child);
--
1.7.10.4


>From 1d1c5bc3cff207847a2643d85442d25adcf9041a Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: [PATCH 5/6] Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations. Now the meat of that
thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in shrink_dentry_list())
and one where the victim had sat on shrink list with zero refcount. We already
have a flag telling which kind it is (unlock_on_failure). What we want to do
is to avoid d_shrink_del() in the first kind of dentry_kill(). We do, however,
want everything except the actual freeing still done as we would in mainline.
So we need to deal with two new situations - the first kind of dentry_kill()
finding a dentry on shrink list (result of laziness in dget(); we had it on
shrink list with refcount 1) and the second kind finding a dentry that is
currently being killed by the first kind. What we do is this:

* add another list in shrink_dentry_list() ('morgue'), pass it to the second
kind of dentry_kill(), which will move dentry there, drop locks and do nothing
else if it finds that the dentry is in process of being killed; we detect that
by seeing DCACHE_DENTRY_KILLED already set.
* have the first kind skip removal from shrink list and actual freeing
if dentry was on the shrink list; in that case instead of freeing the victim
just mark it DCACHE_MAY_FREE and wake free_wq.
* once shrink_dentry_list() has dealt with the entire list it has been given,
morgue will contain the ones that had triggered "don't free them yet" path in
dentry_kill(dentry, 1, ...). They are going to acquire DCACHE_MAY_FREE once
dentry_kill() is through with them, so just wait for that (on free_wq) and
free them.

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 40 ++++++++++++++++++++++++++++++++++------
include/linux/dcache.h | 2 ++
2 files changed, 36 insertions(+), 6 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..f0773f6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -457,6 +457,8 @@ void d_drop(struct dentry *dentry)
}
EXPORT_SYMBOL(d_drop);

+static DECLARE_WAIT_QUEUE_HEAD(free_wq);
+
/*
* Finish off a dentry we've decided to kill.
* dentry->d_lock must be held, returns with it unlocked.
@@ -464,11 +466,12 @@ EXPORT_SYMBOL(d_drop);
* Returns dentry requiring refcount drop, or NULL if we're done.
*/
static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure)
+dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morgue)
__releases(dentry->d_lock)
{
struct inode *inode;
struct dentry *parent;
+ bool can_free = true;

inode = dentry->d_inode;
if (inode && !spin_trylock(&inode->i_lock)) {
@@ -489,6 +492,15 @@ relock:
goto relock;
}

+ if (unlikely(morgue && dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ list_move(&dentry->d_lru, morgue);
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ if (inode)
+ spin_unlock(&inode->i_lock);
+ spin_unlock(&dentry->d_lock);
+ return parent;
+ }
/*
* The dentry is now unrecoverably dead to the world.
*/
@@ -504,8 +516,10 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
- else
+ else if (!unlock_on_failure)
d_shrink_del(dentry);
+ else
+ can_free = false;
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +541,14 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);

- dentry_free(dentry);
+ if (likely(can_free)) {
+ dentry_free(dentry);
+ } else {
+ spin_lock(&dentry->d_lock);
+ dentry->d_flags |= DCACHE_MAY_FREE;
+ spin_unlock(&dentry->d_lock);
+ wake_up(&free_wq);
+ }
return parent;
}

@@ -584,7 +605,7 @@ repeat:
return;

kill_it:
- dentry = dentry_kill(dentry, 1);
+ dentry = dentry_kill(dentry, 1, NULL);
if (dentry)
goto repeat;
}
@@ -800,6 +821,7 @@ EXPORT_SYMBOL(d_prune_aliases);
static void shrink_dentry_list(struct list_head *list)
{
struct dentry *dentry, *parent;
+ LIST_HEAD(morgue);

rcu_read_lock();
for (;;) {
@@ -835,7 +857,7 @@ static void shrink_dentry_list(struct list_head *list)
}
rcu_read_unlock();

- parent = dentry_kill(dentry, 0);
+ parent = dentry_kill(dentry, 0, &morgue);
/*
* If dentry_kill returns NULL, we have nothing more to do.
*/
@@ -862,10 +884,16 @@ static void shrink_dentry_list(struct list_head *list)
*/
dentry = parent;
while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
- dentry = dentry_kill(dentry, 1);
+ dentry = dentry_kill(dentry, 1, NULL);
rcu_read_lock();
}
rcu_read_unlock();
+ while (unlikely(!list_empty(&morgue))) {
+ dentry = list_first_entry(&morgue, struct dentry, d_lru);
+ list_del_init(&dentry->d_lru);
+ wait_event(free_wq, dentry->d_flags & DCACHE_MAY_FREE);
+ dentry_free(dentry);
+ }
}

static enum lru_status
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
#define DCACHE_SYMLINK_TYPE 0x00300000 /* Symlink */
#define DCACHE_FILE_TYPE 0x00400000 /* Other file type */

+#define DCACHE_MAY_FREE 0x00800000
+
extern seqlock_t rename_lock;

static inline int dname_external(const struct dentry *dentry)
--
1.7.10.4


>From 3b9afa6f49ec53bf45067584d5baf567a0aa5105 Mon Sep 17 00:00:00 2001
From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 23:57:14 -0400
Subject: [PATCH 6/6] clean up afterwards

a) dentry_kill() arguments are redundant - unlock_on_failure == !!morgue
b) the second kind of dentry_kill() can see dentry on process of being
killed only after it's been unlocked by dentry_iput(), so inode is
guaranteed to be NULL in that case
c) the first kind of dentry_kill can't see DCACHE_DENTRY_KILLED at all
(it's only set when refcount reaches zero and is unable to grow back;
the first kind is done right after having dropped the recount to zero).

Signed-off-by: Al Viro <[email protected]>
---
fs/dcache.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index f0773f6..a83b933 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -466,7 +466,7 @@ static DECLARE_WAIT_QUEUE_HEAD(free_wq);
* Returns dentry requiring refcount drop, or NULL if we're done.
*/
static struct dentry *
-dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morgue)
+dentry_kill(struct dentry *dentry, struct list_head *morgue)
__releases(dentry->d_lock)
{
struct inode *inode;
@@ -476,7 +476,7 @@ dentry_kill(struct dentry *dentry, int unlock_on_failure, struct list_head *morg
inode = dentry->d_inode;
if (inode && !spin_trylock(&inode->i_lock)) {
relock:
- if (unlock_on_failure) {
+ if (!morgue) {
spin_unlock(&dentry->d_lock);
cpu_relax();
}
@@ -492,12 +492,12 @@ relock:
goto relock;
}

- if (unlikely(morgue && dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ /* morgue must be non-NULL */
list_move(&dentry->d_lru, morgue);
if (parent)
spin_unlock(&parent->d_lock);
- if (inode)
- spin_unlock(&inode->i_lock);
+ /* inode must be NULL */
spin_unlock(&dentry->d_lock);
return parent;
}
@@ -516,7 +516,7 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
- else if (!unlock_on_failure)
+ else if (morgue)
d_shrink_del(dentry);
else
can_free = false;
@@ -605,7 +605,7 @@ repeat:
return;

kill_it:
- dentry = dentry_kill(dentry, 1, NULL);
+ dentry = dentry_kill(dentry, NULL);
if (dentry)
goto repeat;
}
@@ -857,7 +857,7 @@ static void shrink_dentry_list(struct list_head *list)
}
rcu_read_unlock();

- parent = dentry_kill(dentry, 0, &morgue);
+ parent = dentry_kill(dentry, &morgue);
/*
* If dentry_kill returns NULL, we have nothing more to do.
*/
@@ -884,7 +884,7 @@ static void shrink_dentry_list(struct list_head *list)
*/
dentry = parent;
while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
- dentry = dentry_kill(dentry, 1, NULL);
+ dentry = dentry_kill(dentry, NULL);
rcu_read_lock();
}
rcu_read_unlock();
--
1.7.10.4

2014-04-30 09:15:26

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <[email protected]> wrote:
> >
> > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > It seems to survive beating here; testing, review and comments are
> > welcome.
>
> Miklos, did you have some particular load that triggered this, or was
> it just some reports? It would be really good to get this patch some
> stress-testing.
>
> I like how the patch removes more lines than it adds, but apart from
> that it's hard to read the patch (even the split-out ones) and say
> anything more about it. I think this needs a *lot* of testing.

IBM is triggering this with the host01 test from the LTP suite. I haven't yet
tried to reproduce it.

Thanks,
Miklos

2014-04-30 15:50:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 05:04:36AM +0100, Al Viro wrote:
> On Tue, Apr 29, 2014 at 07:56:13PM -0700, Linus Torvalds wrote:
> > On Tue, Apr 29, 2014 at 7:31 PM, Al Viro <[email protected]> wrote:
> > >
> > > OK, aggregate diff follows, more readable splitup (3 commits) attached.
> > > It seems to survive beating here; testing, review and comments are
> > > welcome.
> >
> > Miklos, did you have some particular load that triggered this, or was
> > it just some reports? It would be really good to get this patch some
> > stress-testing.
> >
> > I like how the patch removes more lines than it adds, but apart from
> > that it's hard to read the patch (even the split-out ones) and say
> > anything more about it. I think this needs a *lot* of testing.
>
> FWIW, the first two are really straightforward expanding the function
> into its only callsite. The last needs more splitup. Not sure if the
> following is good enough, but it ought to be at least somewhat cleaner.
> Combined change is identical to the original, so it doesn't invalidate
> the testing so far...

Hmm, patches look okay, but I'm wondering if we really need the morgue list and
the waiting. Why not just skip dentries that are presently being handled by
dentry_kill()?

Thanks,
Miklos
---


From: Al Viro <[email protected]>
Date: Tue, 29 Apr 2014 23:52:05 -0400
Subject: Don't try to remove from shrink list we don't own

So far it's just been local equivalent transformations. Now the meat of
that thing: dentry_kill() has two kinds of call sites - ones that had just
dropped the last reference (dput(), killing ancestors in
shrink_dentry_list()) and one where the victim had sat on shrink list with
zero refcount. We already have a flag telling which kind it is
(unlock_on_failure). What we want to do is to avoid d_shrink_del() in the
first kind of dentry_kill(). We do, however, want everything except the
actual freeing still done as we would in mainline. So we need to deal with
two new situations - the first kind of dentry_kill() finding a dentry on
shrink list (result of laziness in dget(); we had it on shrink list with
refcount 1) and the second kind finding a dentry that is currently being
killed by the first kind. What we do is this:

- in the first kind of dentry_kill() we don't remove the dentry from the
shrink list, this makes the shrink list really private to the shrinker
functions

- we proceed with the normal killing but only actually free the dentry if
it's definitely not on the shrink list at this point. If it's still on
the shrink list set DCACHE_MAY_FREE and defer the freeing to
shrink_dentry_list()

- shrink_dentry_list() will detect if the dentry is killed with
DCACHE_DENTRY_KILLED. If DCACHE_MAY_FREE isn't yet set, just take the
dentry off the shrink list and let the still running dentry_kill()
finish it off. If DCACHE_MAY_FREE is set, then free the dentry.

Signed-off-by: Al Viro <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/dcache.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
include/linux/dcache.h | 2 ++
2 files changed, 44 insertions(+), 2 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -469,6 +469,7 @@ dentry_kill(struct dentry *dentry, int u
{
struct inode *inode;
struct dentry *parent;
+ bool can_free = true;

inode = dentry->d_inode;
if (inode && !spin_trylock(&inode->i_lock)) {
@@ -504,8 +505,10 @@ dentry_kill(struct dentry *dentry, int u
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
- else
+ else if (!unlock_on_failure)
d_shrink_del(dentry);
+ else
+ can_free = false;
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +530,25 @@ dentry_kill(struct dentry *dentry, int u
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);

- dentry_free(dentry);
+ if (unlikely(!can_free)) {
+ spin_lock(&dentry->d_lock);
+ /*
+ * Could have gotten off the shrink list while not holding the
+ * dcache lock.
+ *
+ * If still on the shrink list shrink_dentry_list will take care
+ * of freeing it for us. Signal this by setting the
+ * DCACHE_MAY_FREE flag.
+ */
+ if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
+ can_free = true;
+ else
+ dentry->d_flags |= DCACHE_MAY_FREE;
+ spin_unlock(&dentry->d_lock);
+ }
+ if (likely(can_free))
+ dentry_free(dentry);
+
return parent;
}

@@ -835,6 +856,25 @@ static void shrink_dentry_list(struct li
}
rcu_read_unlock();

+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ bool free_it;
+ /*
+ * This dentry has already been killed, but was stuck on
+ * the shrink list. It may be undergoing cleanup, in
+ * which case let dput() finish it off.
+ *
+ * If it's already clean, dput() deferred freeing to us.
+ */
+ free_it = (dentry->d_flags & DCACHE_MAY_FREE) != 0;
+ spin_unlock(&dentry->d_lock);
+
+ if (free_it)
+ dentry_free(dentry);
+
+ rcu_read_lock();
+ continue;
+ }
+
parent = dentry_kill(dentry, 0);
/*
* If dentry_kill returns NULL, we have nothing more to do.
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
#define DCACHE_SYMLINK_TYPE 0x00300000 /* Symlink */
#define DCACHE_FILE_TYPE 0x00400000 /* Other file type */

+#define DCACHE_MAY_FREE 0x00800000
+
extern seqlock_t rename_lock;

static inline int dname_external(const struct dentry *dentry)

2014-04-30 15:57:04

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

And a followup patch for that, removing RCU locking from shrink_dentry_list().

I haven't tested any of this at this point, but I'll ask IBM to do some stress
testing.

Thanks,
Miklos
---

From: Miklos Szeredi <[email protected]>
Subject: dcache: don't need rcu in shrink_dentry_list()

Since now the shrink list is private and nobody can free the dentry while
it is on the shrink list, we can remove RCU protection from this.

Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/dcache.c | 29 ++++-------------------------
1 file changed, 4 insertions(+), 25 deletions(-)

--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -822,23 +822,9 @@ static void shrink_dentry_list(struct li
{
struct dentry *dentry, *parent;

- rcu_read_lock();
- for (;;) {
- dentry = list_entry_rcu(list->prev, struct dentry, d_lru);
- if (&dentry->d_lru == list)
- break; /* empty */
-
- /*
- * Get the dentry lock, and re-verify that the dentry is
- * this on the shrinking list. If it is, we know that
- * DCACHE_SHRINK_LIST and DCACHE_LRU_LIST are set.
- */
+ while (!list_empty(list)) {
+ dentry = list_entry(list->prev, struct dentry, d_lru);
spin_lock(&dentry->d_lock);
- if (dentry != list_entry(list->prev, struct dentry, d_lru)) {
- spin_unlock(&dentry->d_lock);
- continue;
- }
-
/*
* The dispose list is isolated and dentries are not accounted
* to the LRU here, so we can simply remove it from the list
@@ -854,7 +840,6 @@ static void shrink_dentry_list(struct li
spin_unlock(&dentry->d_lock);
continue;
}
- rcu_read_unlock();

if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
bool free_it;
@@ -870,8 +855,6 @@ static void shrink_dentry_list(struct li

if (free_it)
dentry_free(dentry);
-
- rcu_read_lock();
continue;
}

@@ -879,17 +862,15 @@ static void shrink_dentry_list(struct li
/*
* If dentry_kill returns NULL, we have nothing more to do.
*/
- if (!parent) {
- rcu_read_lock();
+ if (!parent)
continue;
- }
+
if (unlikely(parent == dentry)) {
/*
* trylocks have failed and d_lock has been held the
* whole time, so it could not have been added to any
* other lists. Just add it back to the shrink list.
*/
- rcu_read_lock();
d_shrink_add(dentry, list);
spin_unlock(&dentry->d_lock);
continue;
@@ -903,9 +884,7 @@ static void shrink_dentry_list(struct li
dentry = parent;
while (dentry && !lockref_put_or_lock(&dentry->d_lockref))
dentry = dentry_kill(dentry, 1);
- rcu_read_lock();
}
- rcu_read_unlock();
}

static enum lru_status

2014-04-30 16:03:51

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> > FWIW, the first two are really straightforward expanding the function
> > into its only callsite. The last needs more splitup. Not sure if the
> > following is good enough, but it ought to be at least somewhat cleaner.
> > Combined change is identical to the original, so it doesn't invalidate
> > the testing so far...
>
> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
> the waiting. Why not just skip dentries that are presently being handled by
> dentry_kill()?

Who will be freeing them? If we do that from dentry_kill(), we are back to
needing them removed from shrink list by something called by dput(). And
if we do that from shrink_dentry_list(), we'd damn better wait for
dentry_iput() et.al. to finish.

2014-04-30 17:33:41

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <[email protected]> wrote:
> On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> > FWIW, the first two are really straightforward expanding the function
>> > into its only callsite. The last needs more splitup. Not sure if the
>> > following is good enough, but it ought to be at least somewhat cleaner.
>> > Combined change is identical to the original, so it doesn't invalidate
>> > the testing so far...
>>
>> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
>> the waiting. Why not just skip dentries that are presently being handled by
>> dentry_kill()?
>
> Who will be freeing them? If we do that from dentry_kill(), we are back to
> needing them removed from shrink list by something called by dput(). And
> if we do that from shrink_dentry_list(), we'd damn better wait for
> dentry_iput() et.al. to finish.

We can do it from dput if the shrinker gets there first and from the
shrinker if dput manages to finish before. See the updated patch in
the previous mail.

Thanks,
Miklos

2014-04-30 18:36:55

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <[email protected]> wrote:
> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
> >> > FWIW, the first two are really straightforward expanding the function
> >> > into its only callsite. The last needs more splitup. Not sure if the
> >> > following is good enough, but it ought to be at least somewhat cleaner.
> >> > Combined change is identical to the original, so it doesn't invalidate
> >> > the testing so far...
> >>
> >> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
> >> the waiting. Why not just skip dentries that are presently being handled by
> >> dentry_kill()?
> >
> > Who will be freeing them? If we do that from dentry_kill(), we are back to
> > needing them removed from shrink list by something called by dput(). And
> > if we do that from shrink_dentry_list(), we'd damn better wait for
> > dentry_iput() et.al. to finish.
>
> We can do it from dput if the shrinker gets there first and from the
> shrinker if dput manages to finish before. See the updated patch in
> the previous mail.

Er? The only patch I see is removal of RCU from shrink_dentry_list(), which
is fine, but doesn't do anything of that sort. What was the Message-ID?

Let me see if I understand what you are proposing:
dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
check DCACHE_MAY_FREE, free it's been set
dentry_kill(dentry, 1) seeing that we are on shrinker list
leave on the list, do the work on killing, retake ->d_lock,
if we are still on shrinker list
set DCACHE_MAY_FREE,
else
free it

That would probably work...

2014-04-30 18:42:05

by Miklos Szeredi

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 8:36 PM, Al Viro <[email protected]> wrote:
> On Wed, Apr 30, 2014 at 07:33:38PM +0200, Miklos Szeredi wrote:
>> On Wed, Apr 30, 2014 at 6:03 PM, Al Viro <[email protected]> wrote:
>> > On Wed, Apr 30, 2014 at 05:49:58PM +0200, Miklos Szeredi wrote:
>> >> > FWIW, the first two are really straightforward expanding the function
>> >> > into its only callsite. The last needs more splitup. Not sure if the
>> >> > following is good enough, but it ought to be at least somewhat cleaner.
>> >> > Combined change is identical to the original, so it doesn't invalidate
>> >> > the testing so far...
>> >>
>> >> Hmm, patches look okay, but I'm wondering if we really need the morgue list and
>> >> the waiting. Why not just skip dentries that are presently being handled by
>> >> dentry_kill()?
>> >
>> > Who will be freeing them? If we do that from dentry_kill(), we are back to
>> > needing them removed from shrink list by something called by dput(). And
>> > if we do that from shrink_dentry_list(), we'd damn better wait for
>> > dentry_iput() et.al. to finish.
>>
>> We can do it from dput if the shrinker gets there first and from the
>> shrinker if dput manages to finish before. See the updated patch in
>> the previous mail.
>
> Er? The only patch I see is removal of RCU from shrink_dentry_list(), which
> is fine, but doesn't do anything of that sort. What was the Message-ID?

Message-ID: <[email protected]>

>
> Let me see if I understand what you are proposing:
> dentry_kill(dentry, 0) seeing DCACHE_DENTRY_KILLED
> check DCACHE_MAY_FREE, free it's been set
> dentry_kill(dentry, 1) seeing that we are on shrinker list
> leave on the list, do the work on killing, retake ->d_lock,
> if we are still on shrinker list
> set DCACHE_MAY_FREE,
> else
> free it
>
> That would probably work...

Exactly.

Thanks,
Miklos

2014-04-30 19:02:33

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:

> Message-ID: <[email protected]>

I see... Several points:
* I still think that it's better to do handling of "we see
DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
* in dentry_kill(dentry, 0) we *know* that it's not on a shrink
list - the caller has just removed it from there and we'd kept ->d_lock
since that point. What's more, with that scheme we don't need to bother
with can_free at all - just grab ->d_lock after ->d_release() call and check
DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we
could just go ahead and free the sucker, there's neither contention nor
anybody else interested in that cacheline, so spin_lock(&dentry->d_lock)
is pretty much free.

IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..82d8eb4 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,17 @@ relock:
goto relock;
}

+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ if (dentry->d_flags & DCACHE_MAY_FREE) {
+ spin_unlock(&dentry->d_lock);
+ dentry_free(dentry);
+ } else {
+ spin_unlock(&dentry->d_lock);
+ }
+ return parent;
+ }
/*
* The dentry is now unrecoverably dead to the world.
*/
@@ -504,8 +515,6 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
- else
- d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +536,14 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);

- dentry_free(dentry);
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+ dentry->d_flags |= DCACHE_MAY_FREE;
+ spin_unlock(&dentry->d_lock);
+ } else {
+ spin_unlock(&dentry->d_lock);
+ dentry_free(dentry);
+ }
return parent;
}

diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
#define DCACHE_SYMLINK_TYPE 0x00300000 /* Symlink */
#define DCACHE_FILE_TYPE 0x00400000 /* Other file type */

+#define DCACHE_MAY_FREE 0x00800000
+
extern seqlock_t rename_lock;

static inline int dname_external(const struct dentry *dentry)

2014-04-30 19:59:22

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 08:02:27PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 08:42:01PM +0200, Miklos Szeredi wrote:
>
> > Message-ID: <[email protected]>
>
> I see... Several points:
> * I still think that it's better to do handling of "we see
> DCACHE_DENTRY_KILLED already set" in dentry_kill() itself.
> * in dentry_kill(dentry, 0) we *know* that it's not on a shrink
> list - the caller has just removed it from there and we'd kept ->d_lock
> since that point. What's more, with that scheme we don't need to bother
> with can_free at all - just grab ->d_lock after ->d_release() call and check
> DCACHE_SHRINK_LIST. No sense trying to avoid that - in case where we
> could just go ahead and free the sucker, there's neither contention nor
> anybody else interested in that cacheline, so spin_lock(&dentry->d_lock)
> is pretty much free.
>
> IOW, I'd prefer to do the following (completely untested) on top of 1/6--4/6:

Sigh... One more problem:
/*
* We found an inuse dentry which was not removed from
* the LRU because of laziness during lookup. Do not free it.
*/
if (dentry->d_lockref.count) {
spin_unlock(&dentry->d_lock);
continue;
}
should become
if (dentry->d_lockref.count > 0) {
....
instead - lockref_mark_dead() slaps -128 into it, so we'll just leak all
dentries where dput() has come first and decided to leave the suckers
for us.

Another thing: I don't like what's going on with freeing vs. ->d_lock there.
Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
use-after-free of pipe_inode_info". The question is, can spin_unlock(p)
dereference p after another CPU gets through spin_lock(p)? Linus?

It can be dealt with by setting DCACHE_RCUACCESS in "let another guy free
it" cases and playing with rcu_read_lock a bit, but I wonder whether we
need to bother - quick look through alpha/sparc/ppc shows that on those
we do not and the same is true for non-paravirt case on x86. I hadn't
checked what paravirt one does, though, and I certainly hadn't done
exhaustive search for architectures doing something weird...

2014-04-30 20:23:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 12:59 PM, Al Viro <[email protected]> wrote:
>
> Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> use-after-free of pipe_inode_info". The question is, can spin_unlock(p)
> dereference p after another CPU gets through spin_lock(p)? Linus?

spin_unlock() *should* be safe wrt that issue.

But I have to say, I think paravirtualized spinlocks may break that.
They do all kinds of "kick waiters" after releasing the lock.

Doesn't the RCU protection solve that, though? Nobody should be
releasing the dentry under us, afaik..

Linus

2014-04-30 20:38:26

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 01:23:26PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 12:59 PM, Al Viro <[email protected]> wrote:
> >
> > Another thing: I don't like what's going on with freeing vs. ->d_lock there.
> > Had that been a mutex, we'd definitely get a repeat of "vfs: fix subtle
> > use-after-free of pipe_inode_info". The question is, can spin_unlock(p)
> > dereference p after another CPU gets through spin_lock(p)? Linus?
>
> spin_unlock() *should* be safe wrt that issue.
>
> But I have to say, I think paravirtualized spinlocks may break that.
> They do all kinds of "kick waiters" after releasing the lock.
>
> Doesn't the RCU protection solve that, though? Nobody should be
> releasing the dentry under us, afaik..

We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
trigger any amount of IO, for one thing. We can take it around the
couple of places where do that spin_unlock(&dentry->d_lock) (along with
setting DCACHE_RCUACCESS) - that's what I'd been refering to. Then this
sucker (tests still running, so far everything seems to survive) becomes
the following (again, on top of 1/6..4/6). BTW, is there any convenient
way to tell git commit --amend to update the commit date? Something
like --date=now would be nice, but it isn't accepted...

commit 797ff22681dc969b478ed837787d24dfd2dd2132
Author: Al Viro <[email protected]>
Date: Tue Apr 29 23:52:05 2014 -0400

dentry_kill(): don't try to remove from shrink list

If the victim in on the shrink list, don't remove it from there.
If shrink_dentry_list() manages to remove it from the list before
we are done - fine, we'll just free it as usual. If not - mark
it with new flag (DCACHE_MAY_FREE) and leave it there.

Eventually, shrink_dentry_list() will get to it, remove the sucker
from shrink list and call dentry_kill(dentry, 0). Which is where
we'll deal with freeing.

Since now dentry_kill(dentry, 0) may happen after or during
dentry_kill(dentry, 1), we need to recognize that (by seeing
DCACHE_DENTRY_KILLED already set), unlock everything
and either free the sucker (in case DCACHE_MAY_FREE has been
set) or leave it for ongoing dentry_kill(dentry, 1) to deal with.

Signed-off-by: Al Viro <[email protected]>

diff --git a/fs/dcache.c b/fs/dcache.c
index e482775..fa40d26 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -489,6 +489,20 @@ relock:
goto relock;
}

+ if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
+ if (parent)
+ spin_unlock(&parent->d_lock);
+ if (dentry->d_flags & DCACHE_MAY_FREE) {
+ spin_unlock(&dentry->d_lock);
+ dentry_free(dentry);
+ } else {
+ dentry->d_flags |= DCACHE_RCUACCESS;
+ rcu_read_lock();
+ spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+ }
+ return parent;
+ }
/*
* The dentry is now unrecoverably dead to the world.
*/
@@ -504,8 +518,6 @@ relock:
if (dentry->d_flags & DCACHE_LRU_LIST) {
if (!(dentry->d_flags & DCACHE_SHRINK_LIST))
d_lru_del(dentry);
- else
- d_shrink_del(dentry);
}
/* if it was on the hash then remove it */
__d_drop(dentry);
@@ -527,7 +539,16 @@ relock:
if (dentry->d_op && dentry->d_op->d_release)
dentry->d_op->d_release(dentry);

- dentry_free(dentry);
+ spin_lock(&dentry->d_lock);
+ if (dentry->d_flags & DCACHE_SHRINK_LIST) {
+ dentry->d_flags |= DCACHE_MAY_FREE | DCACHE_RCUACCESS;
+ rcu_read_lock();
+ spin_unlock(&dentry->d_lock);
+ rcu_read_unlock();
+ } else {
+ spin_unlock(&dentry->d_lock);
+ dentry_free(dentry);
+ }
return parent;
}

@@ -829,7 +850,7 @@ static void shrink_dentry_list(struct list_head *list)
* We found an inuse dentry which was not removed from
* the LRU because of laziness during lookup. Do not free it.
*/
- if (dentry->d_lockref.count) {
+ if (dentry->d_lockref.count > 0) {
spin_unlock(&dentry->d_lock);
continue;
}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 3b9bfdb..3c7ec32 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -221,6 +221,8 @@ struct dentry_operations {
#define DCACHE_SYMLINK_TYPE 0x00300000 /* Symlink */
#define DCACHE_FILE_TYPE 0x00400000 /* Other file type */

+#define DCACHE_MAY_FREE 0x00800000
+
extern seqlock_t rename_lock;

static inline int dname_external(const struct dentry *dentry)

2014-04-30 20:57:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <[email protected]> wrote:
>
> We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> trigger any amount of IO, for one thing. We can take it around the
> couple of places where do that spin_unlock(&dentry->d_lock) (along with
> setting DCACHE_RCUACCESS) - that's what I'd been refering to.

Just the last spin_unlock() would be the case that matters, if the
spin_unlock() is done on something that could be freed immediately and
the lock protects and is inside the entity that gets freed.

> BTW, is there any convenient
> way to tell git commit --amend to update the commit date? Something
> like --date=now would be nice, but it isn't accepted...

--date="$(date)" works.

The "--date" thing doesn't take the nice "git approxidate" strings,
maybe it should. So you have to give it a "real" date.

Linus

2014-04-30 21:12:11

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <[email protected]> wrote:
> >
> > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > trigger any amount of IO, for one thing. We can take it around the
> > couple of places where do that spin_unlock(&dentry->d_lock) (along with
> > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
>
> Just the last spin_unlock() would be the case that matters, if the
> spin_unlock() is done on something that could be freed immediately and
> the lock protects and is inside the entity that gets freed.

*nod*

There are two such spin_unlock (handover from shrink_dentry_list() to
dput() and the opposite one), but they are all that needs protection -
->d_flags update is outside the rcu-critical area. I really wonder
if we *can* get there without DCACHE_RCUACCESS having been set, though;
dentry would have to be
* picked into shrink list (i.e. have had zero refcount at some point)
* never had been through __d_rehash()
shrink_dentry_list() definitely counts on that being impossible, and it
probably is, but I'm feeling seriously paranoid about the whole area.
I'll finish grepping through the tree and probably drop setting
DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
it it turns out that it is possible and shrink_dentry_list() is fucked...

Tests seem to be running fine so far...

> > BTW, is there any convenient
> > way to tell git commit --amend to update the commit date? Something
> > like --date=now would be nice, but it isn't accepted...
>
> --date="$(date)" works.

Thanks...

2014-04-30 22:12:42

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 10:12:06PM +0100, Al Viro wrote:
> On Wed, Apr 30, 2014 at 01:57:05PM -0700, Linus Torvalds wrote:
> > On Wed, Apr 30, 2014 at 1:38 PM, Al Viro <[email protected]> wrote:
> > >
> > > We do not (and cannot) call dentry_kill() with rcu_read_lock held - it can
> > > trigger any amount of IO, for one thing. We can take it around the
> > > couple of places where do that spin_unlock(&dentry->d_lock) (along with
> > > setting DCACHE_RCUACCESS) - that's what I'd been refering to.
> >
> > Just the last spin_unlock() would be the case that matters, if the
> > spin_unlock() is done on something that could be freed immediately and
> > the lock protects and is inside the entity that gets freed.
>
> *nod*
>
> There are two such spin_unlock (handover from shrink_dentry_list() to
> dput() and the opposite one), but they are all that needs protection -
> ->d_flags update is outside the rcu-critical area. I really wonder
> if we *can* get there without DCACHE_RCUACCESS having been set, though;
> dentry would have to be
> * picked into shrink list (i.e. have had zero refcount at some point)
> * never had been through __d_rehash()
> shrink_dentry_list() definitely counts on that being impossible, and it
> probably is, but I'm feeling seriously paranoid about the whole area.
> I'll finish grepping through the tree and probably drop setting
> DCACHE_RCUACCESS from the patch - either that, or set it in d_shrink_add()
> it it turns out that it is possible and shrink_dentry_list() is fucked...

OK, it really can't happen. The proof is more convoluted than I'd like it,
but it's solid enough, so setting that flag in dentry_kill() handover cases
wasn't needed. I've just pushed the whole thing to vfs.git#for-linus;
review and testing would be very welcome. I can repost it one more time,
but the only difference compared to the last variant in this thread is not
bothering with DCACHE_RCUACCESS.

It has survived LTP tests, going through xfstests now...

2014-04-30 23:05:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 3:12 PM, Al Viro <[email protected]> wrote:
>
> I've just pushed the whole thing to vfs.git#for-linus;
> review and testing would be very welcome.

I have no half-way relevant test-case for this, so I'm hoping people
who have good VFS stress-tests (preferably under memory pressure so
that we get that whole shrinking path) will test.

But it looks fine.

That said, I do hate that RCU read-lock around the final spin-unlock.

Let me go talk to the paravirt people. Maybe they don't need this, and
I don't know exactly *how* they use that lock pointer after the unlock
in the "kick waiters" part. Maybe it's all good.

Linus

2014-04-30 23:14:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
<[email protected]> wrote:
>
> Let me go talk to the paravirt people. Maybe they don't need this, and
> I don't know exactly *how* they use that lock pointer after the unlock
> in the "kick waiters" part. Maybe it's all good.

.. looking at current users (xen and kvm) it does in fact look all
good. Yes, we "kick" possible waiters after the unlock, but the lock
itself is not touched by that, it just uses the pointer to the lock as
a way to figure out who to kick.

In fact, I kind of knew that, but had forgotten. We very much depend
on spin_unlock being safe wrt immediate deleton already: the
"completion" code very much depends on it. It does a "spin_unlock()"
to release the completion, and it can get reused immediately (it might
be on the stack, it might be in some data structure that gets freed).

So I'd suggest removing that whole RCU thing, because it should be
safe to unlock something that can go away immediately. We'd have huge
problems elsewhere if it wasn't safe.

Linus

2014-04-30 23:38:09

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 04:04:58PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 3:12 PM, Al Viro <[email protected]> wrote:
> >
> > I've just pushed the whole thing to vfs.git#for-linus;
> > review and testing would be very welcome.
>
> I have no half-way relevant test-case for this, so I'm hoping people
> who have good VFS stress-tests (preferably under memory pressure so
> that we get that whole shrinking path) will test.
>
> But it looks fine.
>
> That said, I do hate that RCU read-lock around the final spin-unlock.

So do I, obviously... After looking through the rest of arch_spin_unlock(),
it looks like the only suspicious one except the paravirt on x86 is
blackfin. And that might be misreading - I'm not familiar enough with the
architecture to tell...

2014-04-30 23:43:45

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 04:14:14PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 4:04 PM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Let me go talk to the paravirt people. Maybe they don't need this, and
> > I don't know exactly *how* they use that lock pointer after the unlock
> > in the "kick waiters" part. Maybe it's all good.
>
> .. looking at current users (xen and kvm) it does in fact look all
> good. Yes, we "kick" possible waiters after the unlock, but the lock
> itself is not touched by that, it just uses the pointer to the lock as
> a way to figure out who to kick.
>
> In fact, I kind of knew that, but had forgotten. We very much depend
> on spin_unlock being safe wrt immediate deleton already: the
> "completion" code very much depends on it. It does a "spin_unlock()"
> to release the completion, and it can get reused immediately (it might
> be on the stack, it might be in some data structure that gets freed).
>
> So I'd suggest removing that whole RCU thing, because it should be
> safe to unlock something that can go away immediately. We'd have huge
> problems elsewhere if it wasn't safe.

OK, done and force-pushed. Should propagate in a few...

2014-05-01 00:18:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 4:43 PM, Al Viro <[email protected]> wrote:
>
> OK, done and force-pushed. Should propagate in a few...

That made it more obvious how the DCACHE_MAY_FREE case ends up
working. And in particular, mind rewriting this:

if (dentry->d_flags & DCACHE_MAY_FREE) {
spin_unlock(&dentry->d_lock);
dentry_free(dentry);
} else {
spin_unlock(&dentry->d_lock);
}
return parent;

as just

bool free = dentry->d_flags & DCACHE_MAY_FREE;
spin_unlock(&dentry->d_lock);
if (free)
dentry_free(dentry);
return parent;

instead? In fact, I get the feeling that the other case later on
really fits the same model:

spin_lock(&dentry->d_lock);
if (dentry->d_flags & DCACHE_SHRINK_LIST) {
dentry->d_flags |= DCACHE_MAY_FREE;
spin_unlock(&dentry->d_lock);
} else {
spin_unlock(&dentry->d_lock);
dentry_free(dentry);
}

ends up really being better as

spin_lock(&dentry->d_lock);
free = 1;
if (dentry->d_flags & DCACHE_SHRINK_LIST) {
dentry->d_flags |= DCACHE_MAY_FREE;
free = 0;
}
spin_unlock(&dentry->d_lock);
if (free)
dentry_free(dentry);
return parent;

and then suddenly it looks like we have a common exit sequence from
that dentry_kill() function, no?

(The earlier "unlock_on_failure" exit case is altogether a different case).

I dunno. Maybe not a big deal, but one reason I prefer doing that
"free" flag is because I really tend to prefer the simple case of
lock-unlock pairing cleanly at the same level. NOT the pattern where
you have one lock at one indentation level, paired with multiple
unlocks for all the different cases.

Linus

2014-05-01 02:51:10

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 05:18:23PM -0700, Linus Torvalds wrote:

> and then suddenly it looks like we have a common exit sequence from
> that dentry_kill() function, no?
>
> (The earlier "unlock_on_failure" exit case is altogether a different case).
>
> I dunno. Maybe not a big deal, but one reason I prefer doing that
> "free" flag is because I really tend to prefer the simple case of
> lock-unlock pairing cleanly at the same level. NOT the pattern where
> you have one lock at one indentation level, paired with multiple
> unlocks for all the different cases.

Yeah, but... I have such variant, but the best I could get still generated
the code that wasn't particulary nice. Part might be gcc code generation
sucking for bool, part - extra register pressure... It has slightly lower
i-cache footprint, though, so it might be worth doing. Hell knows; that's a
fairly hot codepath, so let's do it that way - I've just pushed an alternative
branch with bool can_free variant; branches in question: vfs.git#for-linus and
vfs.git#dentry_kill-2. Help with profiling is needed; the loads to watch are
the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
for the differences to matter. Any takers?

2014-05-01 02:59:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 7:51 PM, Al Viro <[email protected]> wrote:
>
> Help with profiling is needed; the loads to watch are
> the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
> for the differences to matter. Any takers?

I really hope there are no such loads, my "lock/unlock pairing"
suggestion was mostly so that the pairing is clearer, not necessarily
for performance.

That said, I'd assume that it migth be worth testing at least the
"lots of concurrent lookups of 'simple_dentry_operations' dentries".
So most of /proc, most uses of simple_lookup(). That at least triggers
the dentry_kill() path in dput(), so it should be fairly easy to get
profiles.

But real loads with real filesystems? That sounds harder.

Linus

2014-05-01 03:12:55

by Al Viro

[permalink] [raw]
Subject: Re: dcache shrink list corruption?

On Wed, Apr 30, 2014 at 07:59:43PM -0700, Linus Torvalds wrote:
> On Wed, Apr 30, 2014 at 7:51 PM, Al Viro <[email protected]> wrote:
> >
> > Help with profiling is needed; the loads to watch are
> > the ones where dentry_kill() + dentry_free() are sufficiently high in profiles
> > for the differences to matter. Any takers?
>
> I really hope there are no such loads, my "lock/unlock pairing"
> suggestion was mostly so that the pairing is clearer, not necessarily
> for performance.
>
> That said, I'd assume that it migth be worth testing at least the
> "lots of concurrent lookups of 'simple_dentry_operations' dentries".
> So most of /proc, most uses of simple_lookup(). That at least triggers
> the dentry_kill() path in dput(), so it should be fairly easy to get
> profiles.
>
> But real loads with real filesystems? That sounds harder.

Well, the simplest way to do that is a bunch of open/unlink/close. Another
one is cp -rl / rm -rf of a large tree (rmdir(2) does shrink_dcache_parent()).
For that matter, unmapping an anon shared mapping will trigger the same path.