2009-03-09 19:05:36

by Sage Weil

[permalink] [raw]
Subject: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

real_lookup() is called by do_lookup() if dentry revalidation fails. If
the cache is re-populated while waiting for i_mutex, it may find that
a d_lookup() subsequently succeeds (see the "Uhhuh! Nasty case" comment).

Previously, real_lookup() would drop i_mutex and do_revalidate() again. If
revalidate failed _again_, however, it would give up with -ENOENT. The
problem here that network file systems may be invalidating dentries via
server callbacks, e.g. due to concurrent access from another client, and
-ENOENT is frequently the wrong answer.

This problem has been seen with both Lustre and Ceph. It seems possible
to hit this case with NFS as well if the cache lifetime is very short.

Instead, we should do_revalidate() while i_mutex is still held. If
revalidation fails, we can move on to a ->lookup() and ensure a correct
result without worrying about any subsequent races.

Note that do_revalidate() is called with i_mutex held elsewhere. For
example, do_filp_open(), lookup_create(), do_unlinkat(), do_rmdir(),
and possibly others all take the directory i_mutex, and then

-> lookup_hash
-> __lookup_hash
-> cached_lookup
-> do_revalidate

so this does not introduce any new locking rules for d_revalidate
implementations.

CC: Al Viro <[email protected]>
CC: Andreas Dilger <[email protected]>
Signed-off-by: Yehuda Sadeh <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
---
fs/namei.c | 56 +++++++++++++++++++++++++++++---------------------------
1 files changed, 29 insertions(+), 27 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c30e33d..49f58d1 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -469,6 +469,7 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
{
struct dentry * result;
struct inode *dir = parent->d_inode;
+ struct dentry *dentry;

mutex_lock(&dir->i_mutex);
/*
@@ -486,38 +487,39 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
* so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
*/
result = d_lookup(parent, name);
- if (!result) {
- struct dentry *dentry;
-
- /* Don't create child dentry for a dead directory. */
- result = ERR_PTR(-ENOENT);
- if (IS_DEADDIR(dir))
- goto out_unlock;
-
- dentry = d_alloc(parent, name);
- result = ERR_PTR(-ENOMEM);
- if (dentry) {
- result = dir->i_op->lookup(dir, dentry, nd);
+ if (result) {
+ /*
+ * The cache was re-populated while we waited on the
+ * mutex. We need to revalidate, this time while
+ * holding i_mutex (to avoid another race).
+ */
+ if (result->d_op && result->d_op->d_revalidate) {
+ result = do_revalidate(result, nd);
if (result)
- dput(dentry);
- else
- result = dentry;
+ goto out_unlock;
+ /*
+ * The dentry was left behind invalid. Just
+ * do the lookup.
+ */
}
-out_unlock:
- mutex_unlock(&dir->i_mutex);
- return result;
}

- /*
- * Uhhuh! Nasty case: the cache was re-populated while
- * we waited on the semaphore. Need to revalidate.
- */
- mutex_unlock(&dir->i_mutex);
- if (result->d_op && result->d_op->d_revalidate) {
- result = do_revalidate(result, nd);
- if (!result)
- result = ERR_PTR(-ENOENT);
+ /* Don't create child dentry for a dead directory. */
+ result = ERR_PTR(-ENOENT);
+ if (IS_DEADDIR(dir))
+ goto out_unlock;
+
+ dentry = d_alloc(parent, name);
+ result = ERR_PTR(-ENOMEM);
+ if (dentry) {
+ result = dir->i_op->lookup(dir, dentry, nd);
+ if (result) {
+ dput(dentry);
+ } else
+ result = dentry;
}
+out_unlock:
+ mutex_unlock(&dir->i_mutex);
return result;
}

--
1.5.6.5


2009-03-10 19:23:13

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

The patch is wrong in case ->d_revalidate is NULL.

Something like this should fix it up:

Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2009-03-10 20:03:58.000000000 +0100
+++ linux-2.6/fs/namei.c 2009-03-10 20:19:29.000000000 +0100
@@ -501,6 +501,8 @@ static struct dentry * real_lookup(struc
* The dentry was left behind invalid. Just
* do the lookup.
*/
+ } else {
+ goto out_unlock;
}
}

Otherwise looks OK.

Thanks,
Miklos


On Mon, 9 Mar 2009, Sage Weil wrote:
> real_lookup() is called by do_lookup() if dentry revalidation fails. If
> the cache is re-populated while waiting for i_mutex, it may find that
> a d_lookup() subsequently succeeds (see the "Uhhuh! Nasty case" comment).
>
> Previously, real_lookup() would drop i_mutex and do_revalidate() again. If
> revalidate failed _again_, however, it would give up with -ENOENT. The
> problem here that network file systems may be invalidating dentries via
> server callbacks, e.g. due to concurrent access from another client, and
> -ENOENT is frequently the wrong answer.
>
> This problem has been seen with both Lustre and Ceph. It seems possible
> to hit this case with NFS as well if the cache lifetime is very short.
>
> Instead, we should do_revalidate() while i_mutex is still held. If
> revalidation fails, we can move on to a ->lookup() and ensure a correct
> result without worrying about any subsequent races.
>
> Note that do_revalidate() is called with i_mutex held elsewhere. For
> example, do_filp_open(), lookup_create(), do_unlinkat(), do_rmdir(),
> and possibly others all take the directory i_mutex, and then
>
> -> lookup_hash
> -> __lookup_hash
> -> cached_lookup
> -> do_revalidate
>
> so this does not introduce any new locking rules for d_revalidate
> implementations.
>
> CC: Al Viro <[email protected]>
> CC: Andreas Dilger <[email protected]>
> Signed-off-by: Yehuda Sadeh <[email protected]>
> Signed-off-by: Sage Weil <[email protected]>
> ---
> fs/namei.c | 56 +++++++++++++++++++++++++++++---------------------------
> 1 files changed, 29 insertions(+), 27 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index c30e33d..49f58d1 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -469,6 +469,7 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
> {
> struct dentry * result;
> struct inode *dir = parent->d_inode;
> + struct dentry *dentry;
>
> mutex_lock(&dir->i_mutex);
> /*
> @@ -486,38 +487,39 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
> * so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
> */
> result = d_lookup(parent, name);
> - if (!result) {
> - struct dentry *dentry;
> -
> - /* Don't create child dentry for a dead directory. */
> - result = ERR_PTR(-ENOENT);
> - if (IS_DEADDIR(dir))
> - goto out_unlock;
> -
> - dentry = d_alloc(parent, name);
> - result = ERR_PTR(-ENOMEM);
> - if (dentry) {
> - result = dir->i_op->lookup(dir, dentry, nd);
> + if (result) {
> + /*
> + * The cache was re-populated while we waited on the
> + * mutex. We need to revalidate, this time while
> + * holding i_mutex (to avoid another race).
> + */
> + if (result->d_op && result->d_op->d_revalidate) {
> + result = do_revalidate(result, nd);
> if (result)
> - dput(dentry);
> - else
> - result = dentry;
> + goto out_unlock;
> + /*
> + * The dentry was left behind invalid. Just
> + * do the lookup.
> + */
> }
> -out_unlock:
> - mutex_unlock(&dir->i_mutex);
> - return result;
> }
>
> - /*
> - * Uhhuh! Nasty case: the cache was re-populated while
> - * we waited on the semaphore. Need to revalidate.
> - */
> - mutex_unlock(&dir->i_mutex);
> - if (result->d_op && result->d_op->d_revalidate) {
> - result = do_revalidate(result, nd);
> - if (!result)
> - result = ERR_PTR(-ENOENT);
> + /* Don't create child dentry for a dead directory. */
> + result = ERR_PTR(-ENOENT);
> + if (IS_DEADDIR(dir))
> + goto out_unlock;
> +
> + dentry = d_alloc(parent, name);
> + result = ERR_PTR(-ENOMEM);
> + if (dentry) {
> + result = dir->i_op->lookup(dir, dentry, nd);
> + if (result) {
> + dput(dentry);
> + } else
> + result = dentry;
> }
> +out_unlock:
> + mutex_unlock(&dir->i_mutex);
> return result;
> }
>
> --
> 1.5.6.5
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2009-03-10 19:31:32

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

On Tue, 10 Mar 2009, Miklos Szeredi wrote:

> The patch is wrong in case ->d_revalidate is NULL.
>
> Something like this should fix it up:
>
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c 2009-03-10 20:03:58.000000000 +0100
> +++ linux-2.6/fs/namei.c 2009-03-10 20:19:29.000000000 +0100
> @@ -501,6 +501,8 @@ static struct dentry * real_lookup(struc
> * The dentry was left behind invalid. Just
> * do the lookup.
> */
> + } else {
> + goto out_unlock;
> }
> }
>
> Otherwise looks OK.

Good catch. Here is an updated patch (fixing the checkpatch error as
well).

Thanks!
sage

---

>From d33ad281f3e6a3bb172a39a55824ce69187903be Mon Sep 17 00:00:00 2001
From: Sage Weil <[email protected]>
Date: Tue, 10 Mar 2009 12:26:37 -0700
Subject: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

real_lookup() is called by do_lookup() if dentry revalidation fails. If
the cache is re-populated while waiting for i_mutex, it may find that
a d_lookup() subsequently succeeds (see the "Uhhuh! Nasty case" comment).

Previously, real_lookup() would drop i_mutex and do_revalidate() again. If
revalidate failed _again_, however, it would give up with -ENOENT. The
problem here that network file systems may be invalidating dentries via
server callbacks, e.g. due to concurrent access from another client, and
-ENOENT is frequently the wrong answer.

This problem has been seen with both Lustre and Ceph. It seems possible
to hit this case with NFS as well if the cache lifetime is very short.

Instead, we should do_revalidate() while i_mutex is still held. If
revalidation fails, we can move on to a ->lookup() and ensure a correct
result without worrying about any subsequent races.

Note that do_revalidate() is called with i_mutex held elsewhere. For
example, do_filp_open(), lookup_create(), do_unlinkat(), do_rmdir(),
and possibly others all take the directory i_mutex, and then

-> lookup_hash
-> __lookup_hash
-> cached_lookup
-> do_revalidate

so this does not introduce any new locking rules for d_revalidate
implementations.

Signed-off-by: Yehuda Sadeh <[email protected]>
Signed-off-by: Sage Weil <[email protected]>
---
fs/namei.c | 58 +++++++++++++++++++++++++++++++---------------------------
1 files changed, 31 insertions(+), 27 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c30e33d..64cf927 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -469,6 +469,7 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
{
struct dentry * result;
struct inode *dir = parent->d_inode;
+ struct dentry *dentry;

mutex_lock(&dir->i_mutex);
/*
@@ -486,38 +487,41 @@ static struct dentry * real_lookup(struct dentry * parent, struct qstr * name, s
* so doing d_lookup() (with seqlock), instead of lockfree __d_lookup
*/
result = d_lookup(parent, name);
- if (!result) {
- struct dentry *dentry;
-
- /* Don't create child dentry for a dead directory. */
- result = ERR_PTR(-ENOENT);
- if (IS_DEADDIR(dir))
- goto out_unlock;
-
- dentry = d_alloc(parent, name);
- result = ERR_PTR(-ENOMEM);
- if (dentry) {
- result = dir->i_op->lookup(dir, dentry, nd);
+ if (result) {
+ /*
+ * The cache was re-populated while we waited on the
+ * mutex. We need to revalidate, this time while
+ * holding i_mutex (to avoid another race).
+ */
+ if (result->d_op && result->d_op->d_revalidate) {
+ result = do_revalidate(result, nd);
if (result)
- dput(dentry);
- else
- result = dentry;
+ goto out_unlock;
+ /*
+ * The dentry was left behind invalid. Just
+ * do the lookup.
+ */
+ } else {
+ goto out_unlock;
}
-out_unlock:
- mutex_unlock(&dir->i_mutex);
- return result;
}

- /*
- * Uhhuh! Nasty case: the cache was re-populated while
- * we waited on the semaphore. Need to revalidate.
- */
- mutex_unlock(&dir->i_mutex);
- if (result->d_op && result->d_op->d_revalidate) {
- result = do_revalidate(result, nd);
- if (!result)
- result = ERR_PTR(-ENOENT);
+ /* Don't create child dentry for a dead directory. */
+ result = ERR_PTR(-ENOENT);
+ if (IS_DEADDIR(dir))
+ goto out_unlock;
+
+ dentry = d_alloc(parent, name);
+ result = ERR_PTR(-ENOMEM);
+ if (dentry) {
+ result = dir->i_op->lookup(dir, dentry, nd);
+ if (result)
+ dput(dentry);
+ else
+ result = dentry;
}
+out_unlock:
+ mutex_unlock(&dir->i_mutex);
return result;
}

--
1.5.6.5

2009-03-17 08:17:51

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

Keeping i_mutes over do_revalidate seem fine from a first glance, but
can you please do it without rearranging the whole code?

Something like the tiny untested patch below should archive the same
thing:


Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c 2009-03-17 09:15:53.430978739 +0100
+++ linux-2.6/fs/namei.c 2009-03-17 09:16:19.553981306 +0100
@@ -512,12 +512,12 @@ out_unlock:
* Uhhuh! Nasty case: the cache was re-populated while
* we waited on the semaphore. Need to revalidate.
*/
- mutex_unlock(&dir->i_mutex);
if (result->d_op && result->d_op->d_revalidate) {
result = do_revalidate(result, nd);
if (!result)
result = ERR_PTR(-ENOENT);
}
+ mutex_unlock(&dir->i_mutex);
return result;
}

2009-03-17 17:03:48

by Sage Weil

[permalink] [raw]
Subject: Re: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

On Tue, 17 Mar 2009, Christoph Hellwig wrote:
> Keeping i_mutes over do_revalidate seem fine from a first glance, but
> can you please do it without rearranging the whole code?

Yeah, but not without an extra goto. Holding i_mutex over revalidate is
only half of it... we also want to go ahead with the ->lookup if the
revalidate fails (instead of returning -ENOENT). I make the patch easier
to read (with a goto), but I assumed we'd want the resulting code to be
more clear?

FWIW, here's the patched result:

result = d_lookup(parent, name);
if (result) {
/*
* The cache was re-populated while we waited on the
* mutex. We need to revalidate, this time while
* holding i_mutex (to avoid another race).
*/
if (result->d_op && result->d_op->d_revalidate) {
result = do_revalidate(result, nd);
if (result)
goto out_unlock;
/*
* The dentry was left behind invalid. Just
* do the lookup.
*/
} else {
goto out_unlock;
}
}

/* Don't create child dentry for a dead directory. */
result = ERR_PTR(-ENOENT);
if (IS_DEADDIR(dir))
goto out_unlock;

dentry = d_alloc(parent, name);
result = ERR_PTR(-ENOMEM);
if (dentry) {
result = dir->i_op->lookup(dir, dentry, nd);
if (result)
dput(dentry);
else
result = dentry;
}
out_unlock:

Let me know!
sage


> Something like the tiny untested patch below should archive the same
> thing:
>
>
> Index: linux-2.6/fs/namei.c
> ===================================================================
> --- linux-2.6.orig/fs/namei.c 2009-03-17 09:15:53.430978739 +0100
> +++ linux-2.6/fs/namei.c 2009-03-17 09:16:19.553981306 +0100
> @@ -512,12 +512,12 @@ out_unlock:
> * Uhhuh! Nasty case: the cache was re-populated while
> * we waited on the semaphore. Need to revalidate.
> */
> - mutex_unlock(&dir->i_mutex);
> if (result->d_op && result->d_op->d_revalidate) {
> result = do_revalidate(result, nd);
> if (!result)
> result = ERR_PTR(-ENOENT);
> }
> + mutex_unlock(&dir->i_mutex);
> return result;
> }
>
>
>

2009-03-19 19:33:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vfs: make real_lookup do dentry revalidation with i_mutex held

On Tue, Mar 17, 2009 at 10:03:35AM -0700, Sage Weil wrote:
> On Tue, 17 Mar 2009, Christoph Hellwig wrote:
> > Keeping i_mutes over do_revalidate seem fine from a first glance, but
> > can you please do it without rearranging the whole code?
>
> Yeah, but not without an extra goto. Holding i_mutex over revalidate is
> only half of it... we also want to go ahead with the ->lookup if the
> revalidate fails (instead of returning -ENOENT). I make the patch easier
> to read (with a goto), but I assumed we'd want the resulting code to be
> more clear?

Well, if you want to re-organize real_lookup make that a separate patch.
Might actually be worthwile to do so and clean up the other issues
in there (too long line in the prototype, spaces after the pointer *,
too. And then have a small patch ontop to implement the mutex and
going ahead with the lookup.