2009-03-19 14:54:21

by J. R. Okajima

[permalink] [raw]
Subject: Q: NFSD readdir in linux-2.6.28


Hello David and Al,
I have a question about NFSD readdir.

By the commit 14f7dd632011bb89c035722edd6ea0d90ca6b078
"[PATCH] Copy XFS readdir hack into nfsd code", nfsd_buffered_filldir()
was introduced and nfs3svc_encode_entry_plus() (the 'func' parameter) is
not called from vfs_readdir().

In 2.6.27, when nfs3svc_encode_entry_plus() calls lookup_one_len(), the
i_mutex lock was acquired by vfs_readdir() and it was not a problem.

After the commit (above), nfsd_readdir/nfsd_buffered_readdir/vfs_readdir
calls nfsd_buffered_filldir(), and nfs3svc_encode_entry_plus() is called
later.
In this sequence, lookup_one_len() is called without i_mutex held.

Isn't it a problem?


J. R. Okajima


2009-03-19 15:17:36

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Thu, 2009-03-19 at 14:54 +0000, [email protected] wrote:
>
> Hello David and Al,
> I have a question about NFSD readdir.
>
> By the commit 14f7dd632011bb89c035722edd6ea0d90ca6b078
> "[PATCH] Copy XFS readdir hack into nfsd code", nfsd_buffered_filldir()
> was introduced and nfs3svc_encode_entry_plus() (the 'func' parameter) is
> not called from vfs_readdir().
>
> In 2.6.27, when nfs3svc_encode_entry_plus() calls lookup_one_len(), the
> i_mutex lock was acquired by vfs_readdir() and it was not a problem.
>
> After the commit (above), nfsd_readdir/nfsd_buffered_readdir/vfs_readdir
> calls nfsd_buffered_filldir(), and nfs3svc_encode_entry_plus() is called
> later.
> In this sequence, lookup_one_len() is called without i_mutex held.
>
> Isn't it a problem?

Yes, well spotted. It didn't matter when the buffered readdir() was
purely internal to XFS, because it didn't matter there that we called
->lookup() without i_mutex set. But now we're exposing arbitrary file
systems to it, we need to make sure we follow the locking rules.

I _think_ it's sufficient to make the affected callers of
lookup_one_len() lock the parent's i_mutex for themselves before calling
it. I'll take a closer look...

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-03-19 15:35:52

by J. R. Okajima

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28


David Woodhouse:
> Yes, well spotted. It didn't matter when the buffered readdir() was
> purely internal to XFS, because it didn't matter there that we called
> ->lookup() without i_mutex set. But now we're exposing arbitrary file
> systems to it, we need to make sure we follow the locking rules.
>
> I _think_ it's sufficient to make the affected callers of
> lookup_one_len() lock the parent's i_mutex for themselves before calling
> it. I'll take a closer look...

If you remember why you discarded the FS_NO_LOOKUP_IN_READDIR flag
approach, please let me know. URL or something is enough.

Thanx
J. R. Okajima

2009-03-19 15:37:26

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Thu, Mar 19, 2009 at 03:17:17PM +0000, David Woodhouse wrote:
> > In 2.6.27, when nfs3svc_encode_entry_plus() calls lookup_one_len(), the
> > i_mutex lock was acquired by vfs_readdir() and it was not a problem.
> >
> > After the commit (above), nfsd_readdir/nfsd_buffered_readdir/vfs_readdir
> > calls nfsd_buffered_filldir(), and nfs3svc_encode_entry_plus() is called
> > later.
> > In this sequence, lookup_one_len() is called without i_mutex held.
> >
> > Isn't it a problem?
>
> Yes, well spotted. It didn't matter when the buffered readdir() was
> purely internal to XFS, because it didn't matter there that we called
> ->lookup() without i_mutex set. But now we're exposing arbitrary file
> systems to it, we need to make sure we follow the locking rules.
>
> I _think_ it's sufficient to make the affected callers of
> lookup_one_len() lock the parent's i_mutex for themselves before calling
> it. I'll take a closer look...

Should we also add this?

---

Ensure inode is locked in lookup_one_len()

Signed-off-by: Matthew Wilcox <[email protected]>

diff --git a/fs/namei.c b/fs/namei.c
index bbc15c2..476b1d0 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1244,6 +1244,7 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
int err;
struct qstr this;

+ BUG_ON(!mutex_is_locked(&base->d_inode->i_mutex));
err = __lookup_one_len(name, &this, base, len);
if (err)
return ERR_PTR(err);

--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2009-03-19 15:46:07

by J. R. Okajima

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28


Matthew Wilcox:
> > I _think_ it's sufficient to make the affected callers of
> > lookup_one_len() lock the parent's i_mutex for themselves before calling
> > it. I'll take a closer look...
>
> Should we also add this?
:::
> @@ -1244,6 +1244,7 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
> int err;
> struct qstr this;
>
> + BUG_ON(!mutex_is_locked(&base->d_inode->i_mutex));
> err = __lookup_one_len(name, &this, base, len);

That was how I found this problem actually. :-)
I implemented very similar debug code in aufs which is compiled when
CONFIG_AUFS_DEBUG is enabled.

(in aufs header files)
#ifdef CONFIG_AUFS_DEBUG
#define AuDebugOn(a) BUG_ON(a)
:::
#endif

/* to debug easier, do not make them inlined functions */
#define MtxMustLock(mtx) AuDebugOn(!mutex_is_locked(mtx))
#define IMustLock(i) MtxMustLock(&(i)->i_mutex)

(in ->lookup of aufs)
static struct dentry *aufs_lookup(struct inode *dir, struct dentry *dentry,
struct nameidata *nd)
{
:::
IMustLock(dir);


J. R. Okajima

2009-03-19 15:52:18

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-03-20 at 00:34 +0900, [email protected] wrote:
> If you remember why you discarded the FS_NO_LOOKUP_IN_READDIR flag
> approach, please let me know. URL or something is enough.

I was just lamenting that decision. I think someone persuaded me that
the extra complexity wasn't worth it, and that we should just do it
unconditionally.

One option would be to restore the FS_NO_LOOKUP_IN_READDIR flag, and
document that it means that your ->lookup is called _without_ i_mutex
held. That would actually be OK in all cases that need the flag, I
believe (since the whole point in those cases is that they have their
_own_ locking which is why we did it in the first place).

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-03-19 16:02:36

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Thu, 2009-03-19 at 09:37 -0600, Matthew Wilcox wrote:
> Should we also add this?

Maybe. It was actually perfectly OK when XFS was doing that to _itself_
though, and that BUG_ON() would have triggered. It's only now that we're
doing it unconditionally that it's really a problem.

--
dwmw2

2009-04-16 21:46:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Thu, Mar 19, 2009 at 03:17:17PM +0000, David Woodhouse wrote:
> On Thu, 2009-03-19 at 14:54 +0000, [email protected] wrote:
> >
> > Hello David and Al,
> > I have a question about NFSD readdir.
> >
> > By the commit 14f7dd632011bb89c035722edd6ea0d90ca6b078
> > "[PATCH] Copy XFS readdir hack into nfsd code", nfsd_buffered_filldir()
> > was introduced and nfs3svc_encode_entry_plus() (the 'func' parameter) is
> > not called from vfs_readdir().
> >
> > In 2.6.27, when nfs3svc_encode_entry_plus() calls lookup_one_len(), the
> > i_mutex lock was acquired by vfs_readdir() and it was not a problem.
> >
> > After the commit (above), nfsd_readdir/nfsd_buffered_readdir/vfs_readdir
> > calls nfsd_buffered_filldir(), and nfs3svc_encode_entry_plus() is called
> > later.
> > In this sequence, lookup_one_len() is called without i_mutex held.
> >
> > Isn't it a problem?
>
> Yes, well spotted. It didn't matter when the buffered readdir() was
> purely internal to XFS, because it didn't matter there that we called
> ->lookup() without i_mutex set. But now we're exposing arbitrary file
> systems to it, we need to make sure we follow the locking rules.
>
> I _think_ it's sufficient to make the affected callers of
> lookup_one_len() lock the parent's i_mutex for themselves before calling
> it. I'll take a closer look...

Yipes--is this problem still here?

--b.

2009-04-17 04:13:45

by J. R. Okajima

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28


"J. Bruce Fields":
> > > After the commit (above), nfsd_readdir/nfsd_buffered_readdir/vfs_readdir
> > > calls nfsd_buffered_filldir(), and nfs3svc_encode_entry_plus() is called
> > > later.
> > > In this sequence, lookup_one_len() is called without i_mutex held.
:::
> Yipes--is this problem still here?

Yes.


J. R. Okajima

2009-04-17 09:32:38

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-03-20 at 00:34 +0900, [email protected] wrote:
> David Woodhouse:
> > Yes, well spotted. It didn't matter when the buffered readdir() was
> > purely internal to XFS, because it didn't matter there that we called
> > ->lookup() without i_mutex set. But now we're exposing arbitrary file
> > systems to it, we need to make sure we follow the locking rules.
> >
> > I _think_ it's sufficient to make the affected callers of
> > lookup_one_len() lock the parent's i_mutex for themselves before calling
> > it. I'll take a closer look...
>
> If you remember why you discarded the FS_NO_LOOKUP_IN_READDIR flag
> approach, please let me know. URL or something is enough.

I think someone talked me into doing it in the interest of simplicity,
so we confine the necessary hack into the NFS code alone and _always_
just use the "safe" version. I can't remember who it was, but I'm
guessing Al or Christoph -- both of whom are CC'd in case they want to
object:

Given the fun with i_mutex I think the best answer is to bring it back.
I'm about to test this patch which restores it (and sets it for btrfs,
jffs2 and xfs)...

diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
index deeeed0..06f539c 100644
--- a/Documentation/filesystems/vfs.txt
+++ b/Documentation/filesystems/vfs.txt
@@ -357,7 +357,10 @@ otherwise noted.
If you wish to overload the dentry methods then you should
initialise the "d_dop" field in the dentry; this is a pointer
to a struct "dentry_operations".
- This method is called with the directory inode semaphore held
+ This method is called with the directory inode mutex held
+ unless the file system sets the FS_NO_LOOKUP_IN_READDIR flag
+ its file system flags, in which case lookup() calls from NFS
+ readdirplus() requests will be made without directory mutex.

link: called by the link(2) system call. Only required if you want
to support hard links. You will probably need to call
diff --git a/fs/btrfs/super.c b/fs/btrfs/super.c
index 9744af9..f0e9d18 100644
--- a/fs/btrfs/super.c
+++ b/fs/btrfs/super.c
@@ -619,7 +619,7 @@ static struct file_system_type btrfs_fs_type = {
.name = "btrfs",
.get_sb = btrfs_get_sb,
.kill_sb = kill_anon_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_NO_LOOKUP_IN_READDIR,
};

/*
diff --git a/fs/jffs2/super.c b/fs/jffs2/super.c
index 4c4e18c..0e97a35 100644
--- a/fs/jffs2/super.c
+++ b/fs/jffs2/super.c
@@ -209,6 +209,7 @@ static struct file_system_type jffs2_fs_type = {
.name = "jffs2",
.get_sb = jffs2_get_sb,
.kill_sb = jffs2_kill_sb,
+ .flags = FS_NO_LOOKUP_IN_READDIR,
};

static int __init init_jffs2_fs(void)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..230e30e 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1841,6 +1841,30 @@ out:
return err;
}

+
+static int nfsd_real_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
+{
+ int host_err;
+
+ /*
+ * Read the directory entries. This silly loop is necessary because
+ * readdir() is not guaranteed to fill up the entire buffer, but
+ * may choose to do less.
+ */
+ do {
+ cdp->err = nfserr_eof; /* will be cleared on successful read */
+ host_err = vfs_readdir(file, func, cdp);
+ } while (host_err >=0 && cdp->err == nfs_ok);
+
+ *offsetp = vfs_llseek(file, 0, 1);
+
+ if (host_err)
+ return nfserrno(host_err);
+ else
+ return cdp->err;
+}
+
/*
* We do this buffering because we must not call back into the file
* system's ->lookup() method from the filldir callback. That may well
@@ -1970,7 +1994,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
goto out_close;
}

- err = nfsd_buffered_readdir(file, func, cdp, offsetp);
+ if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags &
+ FS_NO_LOOKUP_IN_READDIR))
+ err = nfsd_buffered_readdir(file, func, cdp, offsetp);
+ else
+ err = nfsd_real_readdir(file, func, cdp, offsetp);

if (err == nfserr_eof || err == nfserr_toosmall)
err = nfs_ok; /* can still be found in ->err */
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index bb68526..7bba46a 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1546,7 +1546,7 @@ static struct file_system_type xfs_fs_type = {
.name = "xfs",
.get_sb = xfs_fs_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_NO_LOOKUP_IN_READDIR,
};

STATIC int __init
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e766be0..caf2a94 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -176,6 +176,11 @@ struct inodes_stat_t {
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
* during rename() internally.
*/
+#define FS_NO_LOOKUP_IN_READDIR 65536 /* FS has its own locking and will
+ deadlock if you call its lookup() method from within a filldir
+ function, as NFSd wants to. A file system setting this flag must
+ be happy for its ->lookup() method to be called without i_mutex,
+ which is what the NFSd workaround code will do. */

/*
* These are the fs-independent mount-flags: up to 32 flags are supported


--
dwmw2

2009-04-17 19:32:51

by Al Viro

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, Apr 17, 2009 at 10:32:19AM +0100, David Woodhouse wrote:
> On Fri, 2009-03-20 at 00:34 +0900, [email protected] wrote:
> > David Woodhouse:
> > > Yes, well spotted. It didn't matter when the buffered readdir() was
> > > purely internal to XFS, because it didn't matter there that we called
> > > ->lookup() without i_mutex set. But now we're exposing arbitrary file
> > > systems to it, we need to make sure we follow the locking rules.
> > >
> > > I _think_ it's sufficient to make the affected callers of
> > > lookup_one_len() lock the parent's i_mutex for themselves before calling
> > > it. I'll take a closer look...
> >
> > If you remember why you discarded the FS_NO_LOOKUP_IN_READDIR flag
> > approach, please let me know. URL or something is enough.
>
> I think someone talked me into doing it in the interest of simplicity,
> so we confine the necessary hack into the NFS code alone and _always_
> just use the "safe" version. I can't remember who it was, but I'm
> guessing Al or Christoph -- both of whom are CC'd in case they want to
> object:

Ow... Sorry, I've missed that one. I really don't like flags-based
solution ;-/ It's not just filesystem method that want i_mutex there -
we have fs/namei.c code suddenly called in different locking conditions
now.

What were the details of xfs and jffs2 locking problems, just in case
if something can be done in that direction short-term?

2009-04-17 22:17:37

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-04-17 at 20:32 +0100, Al Viro wrote:
> Ow... Sorry, I've missed that one. I really don't like flags-based
> solution ;-/ It's not just filesystem method that want i_mutex there -
> we have fs/namei.c code suddenly called in different locking conditions
> now.

Well, we have that already.

When nfsd readdirplus code ends up calling back into lookup_one_len(),
it does so without i_mutex held. XFS had been doing it that way for
years, so that detail escaped our notice when we shifted the XFS 'hack'
into nfsd code.

> What were the details of xfs and jffs2 locking problems, just in case
> if something can be done in that direction short-term?

JFFS2 has its own internal mutex protecting the directory. It needs an
internal mutex instead of i_mutex because of ordering issues with
garbage collection. We are _in_ jffs2_readdir(), with the lock held,
when we call back into nfsd's filldir function... which calls right back
into jffs2_lookup() and deadlocks when it tries to take the same lock
again.

The situation in btrfs and xfs is broadly similar. They have their own
locks, and we end up deadlocking. GFS2 has some lovely "is this process
the owner of the lock" magic to avoid a similar deadlock.

It sounds like the better answer is to just make sure i_mutex is held
when nfsd_buffered_readdir() calls back into the provided filldir
function (we could do it in the various filldir functions themselves,
_if_ they call lookup_one_len(), but I think I prefer it this way --
it's simpler). Patch below for comment.

(While I'm staring at it, it looks like nfsd_buffered_readdir() should
be returning a __be32 not an int, and its 'return -ENOMEM' should be
'return nfserrno(-ENOMEM)'. The first bug I inherited from the existing
nfsd_do_readdir() when I replaced it, but the second is all my own. I'll
send a patch to fix those shortly.)

diff --git a/fs/namei.c b/fs/namei.c
index b8433eb..78f253c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
int err;
struct qstr this;

+ WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
+
err = __lookup_one_len(name, &this, base, len);
if (err)
return ERR_PTR(err);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..fb2965d 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1920,13 +1920,27 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
break;

de = (struct buffered_dirent *)buf.dirent;
+
while (size > 0) {
+ struct inode *dir_inode = file->f_path.dentry->d_inode;
+ int finished;
+
offset = de->offset;

- if (func(cdp, de->name, de->namlen, de->offset,
- de->ino, de->d_type))
+ /*
+ * Various filldir functions may end up calling back
+ * into lookup_one_len() and the file system's
+ * ->lookup() method. These expect i_mutex to be held.
+ */
+ host_err = mutex_lock_killable(&dir_inode->i_mutex);
+ if (host_err)
goto done;

+ finished = func(cdp, de->name, de->namlen, de->offset,
+ de->ino, de->d_type);
+
+ mutex_unlock(&dir_inode->i_mutex);
+
if (cdp->err != nfs_ok)
goto done;




--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-04-17 22:44:15

by Al Viro

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, Apr 17, 2009 at 11:17:00PM +0100, David Woodhouse wrote:
> It sounds like the better answer is to just make sure i_mutex is held
> when nfsd_buffered_readdir() calls back into the provided filldir
> function (we could do it in the various filldir functions themselves,
> _if_ they call lookup_one_len(), but I think I prefer it this way --
> it's simpler). Patch below for comment.

Umm... I can live with that, assuming that we don't have callbacks
that take i_mutex themselves. AFAICS, everything we call there is
either obviously not touching i_mutex or is already called while we
hold i_mutex elsewhere, but I'd appreciate if somebody actually
tested that sucker for different versions of protocol...

> (While I'm staring at it, it looks like nfsd_buffered_readdir() should
> be returning a __be32 not an int, and its 'return -ENOMEM' should be
> 'return nfserrno(-ENOMEM)'. The first bug I inherited from the existing
> nfsd_do_readdir() when I replaced it, but the second is all my own. I'll
> send a patch to fix those shortly.)

Fold it into this one, please.

2009-04-17 22:52:15

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-04-17 at 23:43 +0100, Al Viro wrote:
> On Fri, Apr 17, 2009 at 11:17:00PM +0100, David Woodhouse wrote:
> > It sounds like the better answer is to just make sure i_mutex is held
> > when nfsd_buffered_readdir() calls back into the provided filldir
> > function (we could do it in the various filldir functions themselves,
> > _if_ they call lookup_one_len(), but I think I prefer it this way --
> > it's simpler). Patch below for comment.
>
> Umm... I can live with that, assuming that we don't have callbacks
> that take i_mutex themselves. AFAICS, everything we call there is
> either obviously not touching i_mutex or is already called while we
> hold i_mutex elsewhere,

More than that... until commit 14f7dd63, we were holding i_mutex when we
called back into those callbacks from nfsd_readdir() itself. We're only
reverting to a fairly recent behaviour, in that respect.

> > (While I'm staring at it, it looks like nfsd_buffered_readdir() should
> > be returning a __be32 not an int, and its 'return -ENOMEM' should be
> > 'return nfserrno(-ENOMEM)'. The first bug I inherited from the existing
> > nfsd_do_readdir() when I replaced it, but the second is all my own. I'll
> > send a patch to fix those shortly.)
>
> Fold it into this one, please.

OK.

Ah, and see also commit 05f4f678b (Bruce).

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-04-17 22:53:24

by Al Viro

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, Apr 17, 2009 at 11:43:50PM +0100, Al Viro wrote:
> On Fri, Apr 17, 2009 at 11:17:00PM +0100, David Woodhouse wrote:
> > It sounds like the better answer is to just make sure i_mutex is held
> > when nfsd_buffered_readdir() calls back into the provided filldir
> > function (we could do it in the various filldir functions themselves,
> > _if_ they call lookup_one_len(), but I think I prefer it this way --
> > it's simpler). Patch below for comment.
>
> Umm... I can live with that, assuming that we don't have callbacks
> that take i_mutex themselves. AFAICS, everything we call there is
> either obviously not touching i_mutex or is already called while we
> hold i_mutex elsewhere, but I'd appreciate if somebody actually
> tested that sucker for different versions of protocol...

BTW, why mess with taking i_mutex inside the inner loop and not
immediately around it?

2009-04-17 22:56:33

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-04-17 at 23:53 +0100, Al Viro wrote:
> On Fri, Apr 17, 2009 at 11:43:50PM +0100, Al Viro wrote:
> > On Fri, Apr 17, 2009 at 11:17:00PM +0100, David Woodhouse wrote:
> > > It sounds like the better answer is to just make sure i_mutex is held
> > > when nfsd_buffered_readdir() calls back into the provided filldir
> > > function (we could do it in the various filldir functions themselves,
> > > _if_ they call lookup_one_len(), but I think I prefer it this way --
> > > it's simpler). Patch below for comment.
> >
> > Umm... I can live with that, assuming that we don't have callbacks
> > that take i_mutex themselves. AFAICS, everything we call there is
> > either obviously not touching i_mutex or is already called while we
> > hold i_mutex elsewhere, but I'd appreciate if somebody actually
> > tested that sucker for different versions of protocol...
>
> BTW, why mess with taking i_mutex inside the inner loop and not
> immediately around it?

Just to avoid making spaghetti of the bail-out cases, really. I did
consider it (it wouldn't be _that_ bad), but figured that it wasn't such
a bad thing if we don't hog the lock.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-04-17 23:23:49

by David Woodhouse

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Fri, 2009-04-17 at 23:53 +0100, Al Viro wrote:
> On Fri, Apr 17, 2009 at 11:43:50PM +0100, Al Viro wrote:
> > On Fri, Apr 17, 2009 at 11:17:00PM +0100, David Woodhouse wrote:
> > > It sounds like the better answer is to just make sure i_mutex is held
> > > when nfsd_buffered_readdir() calls back into the provided filldir
> > > function (we could do it in the various filldir functions themselves,
> > > _if_ they call lookup_one_len(), but I think I prefer it this way --
> > > it's simpler). Patch below for comment.
> >
> > Umm... I can live with that, assuming that we don't have callbacks
> > that take i_mutex themselves. AFAICS, everything we call there is
> > either obviously not touching i_mutex or is already called while we
> > hold i_mutex elsewhere, but I'd appreciate if somebody actually
> > tested that sucker for different versions of protocol...
>
> BTW, why mess with taking i_mutex inside the inner loop and not
> immediately around it?

Or, to phrase my last response slightly differently... because I don't
like either of these two versions very much...



diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..0523e9f 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1885,9 +1885,10 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
return 0;
}

-static int nfsd_buffered_readdir(struct file *file, filldir_t func,
- struct readdir_cd *cdp, loff_t *offsetp)
+static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
{
+ struct inode *dir_inode = file->f_path.dentry->d_inode;
struct readdir_data buf;
struct buffered_dirent *de;
int host_err;
@@ -1896,7 +1897,7 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,

buf.dirent = (void *)__get_free_page(GFP_KERNEL);
if (!buf.dirent)
- return -ENOMEM;
+ return nfserrno(-ENOMEM);

offset = *offsetp;

@@ -1912,23 +1913,37 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
host_err = 0;

if (host_err < 0)
- break;
+ goto done;

size = buf.used;

if (!size)
- break;
+ goto done;

de = (struct buffered_dirent *)buf.dirent;
+
+ host_err = mutex_lock_killable(&dir_inode->i_mutex);
+ if (host_err)
+ goto done;
+
while (size > 0) {
+
+ int finished;
+
offset = de->offset;

+ /*
+ * Various filldir functions may end up calling back
+ * into lookup_one_len() and the file system's
+ * ->lookup() method. These expect i_mutex to be held.
+ */
+
if (func(cdp, de->name, de->namlen, de->offset,
de->ino, de->d_type))
- goto done;
-
+ goto done_unlock;
+
if (cdp->err != nfs_ok)
- goto done;
+ goto done_unlock;

reclen = ALIGN(sizeof(*de) + de->namlen,
sizeof(u64));
@@ -1937,6 +1952,8 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
}
offset = vfs_llseek(file, 0, SEEK_CUR);
}
+ done_unlock:
+ mutex_unlock(&dir_inode->i_mutex);

done:
free_page((unsigned long)(buf.dirent));



====================================================================

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..915a47c 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1885,8 +1885,8 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
return 0;
}

-static int nfsd_buffered_readdir(struct file *file, filldir_t func,
- struct readdir_cd *cdp, loff_t *offsetp)
+static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
{
struct readdir_data buf;
struct buffered_dirent *de;
@@ -1896,11 +1896,12 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,

buf.dirent = (void *)__get_free_page(GFP_KERNEL);
if (!buf.dirent)
- return -ENOMEM;
+ return nfserrno(-ENOMEM);

offset = *offsetp;

while (1) {
+ struct inode *dir_inode = file->f_path.dentry->d_inode;
unsigned int reclen;

cdp->err = nfserr_eof; /* will be cleared on successful read */
@@ -1920,24 +1921,42 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
break;

de = (struct buffered_dirent *)buf.dirent;
+
+ host_err = mutex_lock_killable(&dir_inode->i_mutex);
+ if (host_err)
+ break;
+
while (size > 0) {
+ int finished;
+
offset = de->offset;

+ /*
+ * Various filldir functions may end up calling back
+ * into lookup_one_len() and the file system's
+ * ->lookup() method. These expect i_mutex to be held.
+ */
+
if (func(cdp, de->name, de->namlen, de->offset,
de->ino, de->d_type))
- goto done;
-
+ goto done_unlock;
+
if (cdp->err != nfs_ok)
- goto done;
+ goto done_unlock;

reclen = ALIGN(sizeof(*de) + de->namlen,
sizeof(u64));
size -= reclen;
de = (struct buffered_dirent *)((char *)de + reclen);
}
+ mutex_unlock(&dir_inode->i_mutex);
offset = vfs_llseek(file, 0, SEEK_CUR);
- }
+ continue;

+ done_unlock:
+ mutex_unlock(&dir_inode->i_mutex);
+ break;
+ }
done:
free_page((unsigned long)(buf.dirent));


--
dwmw2

2009-04-17 23:38:19

by Al Viro

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Sat, Apr 18, 2009 at 12:23:32AM +0100, David Woodhouse wrote:

> Or, to phrase my last response slightly differently... because I don't
> like either of these two versions very much...

Eh? Just have
host_err = mutex_lock_killable(....);
if (host_err)
break;
de = ...
while (size > 0) {
offset = de->offset;
if (func(cdp, de->name, de->namlen, de->offset,
de->ino, de->d_type))
break;
if (cdp->err != nfs_ok)
break;
...
size -= reclen;
de = ...
}
mutex_unlock(....);
if (size > 0) /* we'd an error */
break;
offset = vfs_llseek(....);
}
free_page(....);
and to hell with all goto in there...

2009-04-17 23:40:10

by Al Viro

[permalink] [raw]
Subject: Re: Q: NFSD readdir in linux-2.6.28

On Sat, Apr 18, 2009 at 12:37:55AM +0100, Al Viro wrote:
> if (size > 0) /* we'd an error */
bailed out early
that is.

2009-04-18 00:16:20

by David Woodhouse

[permalink] [raw]
Subject: [PATCH] Fix i_mutex handling in nfsd readdir.

Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a
bug to generic code which had been extant for a long time in the XFS
version -- it started to call through into lookup_one_len() and hence
into the file systems' ->lookup() methods without i_mutex held on the
directory.

This patch fixes it by locking the directory's i_mutex again before
calling the filldir functions. The original deadlocks which commit
14f7dd63 was designed to avoid are still avoided, because they were due
to fs-internal locking, not i_mutex.

While we're at it, fix the return type of nfsd_buffered_readdir() which
should be a __be32 not an int -- it's an NFS errno, not a Linux errno.
And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM.
Sparse would have caught both of those if it wasn't so busy bitching
about __cold__.

Reported-by: J. R. Okajima <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
---

Commit 05f4f678b introduced a similar problem for NFSv4 which I'm not
going to attempt to deal with before getting some sleep...

diff --git a/fs/namei.c b/fs/namei.c
index b8433eb..78f253c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
int err;
struct qstr this;

+ WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
+
err = __lookup_one_len(name, &this, base, len);
if (err)
return ERR_PTR(err);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..44a68e1 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1885,8 +1885,8 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
return 0;
}

-static int nfsd_buffered_readdir(struct file *file, filldir_t func,
- struct readdir_cd *cdp, loff_t *offsetp)
+static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
{
struct readdir_data buf;
struct buffered_dirent *de;
@@ -1896,11 +1896,12 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,

buf.dirent = (void *)__get_free_page(GFP_KERNEL);
if (!buf.dirent)
- return -ENOMEM;
+ return nfserrno(-ENOMEM);

offset = *offsetp;

while (1) {
+ struct inode *dir_inode = file->f_path.dentry->d_inode;
unsigned int reclen;

cdp->err = nfserr_eof; /* will be cleared on successful read */
@@ -1919,22 +1920,35 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
if (!size)
break;

+ /*
+ * Various filldir functions may end up calling back into
+ * lookup_one_len() and the file system's ->lookup() method.
+ * These expect i_mutex to be held, as it would within readdir.
+ */
+ host_err = mutex_lock_killable(&dir_inode->i_mutex);
+ if (host_err)
+ break;
+
de = (struct buffered_dirent *)buf.dirent;
while (size > 0) {
offset = de->offset;

if (func(cdp, de->name, de->namlen, de->offset,
de->ino, de->d_type))
- goto done;
+ break;

if (cdp->err != nfs_ok)
- goto done;
+ break;

reclen = ALIGN(sizeof(*de) + de->namlen,
sizeof(u64));
size -= reclen;
de = (struct buffered_dirent *)((char *)de + reclen);
}
+ mutex_unlock(&dir_inode->i_mutex);
+ if (size > 0) /* We bailed out early */
+ break;
+
offset = vfs_llseek(file, 0, SEEK_CUR);
}



--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-04-18 03:12:22

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH] Fix i_mutex handling in nfsd readdir.


David Woodhouse:
> This patch fixes it by locking the directory's i_mutex again before
> calling the filldir functions. The original deadlocks which commit
> 14f7dd63 was designed to avoid are still avoided, because they were due
> to fs-internal locking, not i_mutex.
:::
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
> int err;
> struct qstr this;
>
> + WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
> +
> err = __lookup_one_len(name, &this, base, len);

I'd suggest this checking is done only when CONFIG_DEBUG_KERNEL (or
something) is enabled. Because unconditional checking costs high for the
well-reviewed lookup code.


J. R. Okajima

2009-04-18 14:26:04

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] Fix i_mutex handling in nfsd readdir.

On Sat, Apr 18, 2009 at 12:11:54PM +0900, [email protected] wrote:
>
> David Woodhouse:
> > This patch fixes it by locking the directory's i_mutex again before
> > calling the filldir functions. The original deadlocks which commit
> > 14f7dd63 was designed to avoid are still avoided, because they were due
> > to fs-internal locking, not i_mutex.
> :::
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
> > int err;
> > struct qstr this;
> >
> > + WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
> > +
> > err = __lookup_one_len(name, &this, base, len);
>
> I'd suggest this checking is done only when CONFIG_DEBUG_KERNEL (or
> something) is enabled. Because unconditional checking costs high for the
> well-reviewed lookup code.

Frankly, I'd rather have DEBUG_WARN... conditional on that, rather than
cluttering code with ifdefs...

2009-04-19 07:19:17

by David Woodhouse

[permalink] [raw]
Subject: Re: [PATCH] Fix i_mutex handling in nfsd readdir.

On Sat, 2009-04-18 at 12:11 +0900, [email protected] wrote:
> David Woodhouse:
> > This patch fixes it by locking the directory's i_mutex again before
> > calling the filldir functions. The original deadlocks which commit
> > 14f7dd63 was designed to avoid are still avoided, because they were due
> > to fs-internal locking, not i_mutex.
> :::
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
> > int err;
> > struct qstr this;
> >
> > + WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
> > +
> > err = __lookup_one_len(name, &this, base, len);
>
> I'd suggest this checking is done only when CONFIG_DEBUG_KERNEL (or
> something) is enabled. Because unconditional checking costs high for the
> well-reviewed lookup code.

It's supposed to be locked. It's likely to have been locked quite
recently, so it'll be in the cache. I don't think the mutex_is_locked()
check is going to be that expensive, is it?

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation

2009-04-19 12:28:17

by David Woodhouse

[permalink] [raw]
Subject: [PATCH v2] Fix i_mutex handling in nfsd readdir

Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a
bug to generic code which had been extant for a long time in the XFS
version -- it started to call through into lookup_one_len() and hence
into the file systems' ->lookup() methods without i_mutex held on the
directory.

This patch fixes it by locking the directory's i_mutex again before
calling the filldir functions. The original deadlocks which commit
14f7dd63 was designed to avoid are still avoided, because they were due
to fs-internal locking, not i_mutex.

Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
code") introduced a similar problem there, which this also addresses.

While we're at it, fix the return type of nfsd_buffered_readdir() which
should be a __be32 not an int -- it's an NFS errno, not a Linux errno.
And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM.
Sparse would have caught both of those if it wasn't so busy bitching
about __cold__.

Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
code") introduced a similar problem with calling lookup_one_len()
without i_mutex, which this patch also addresses.

Reported-by: J. R. Okajima <[email protected]>
Signed-off-by: David Woodhouse <[email protected]>
Umm-I-can-live-with-that-by: Al Viro <[email protected]>
---
Still haven't tested the NFSv4 bit -- Bruce?

This time I remembered to remove the no-longer-used 'done:' label from
nfsd_buffered_readdir() too.

diff --git a/fs/namei.c b/fs/namei.c
index b8433eb..78f253c 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
int err;
struct qstr this;

+ WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
+
err = __lookup_one_len(name, &this, base, len);
if (err)
return ERR_PTR(err);
diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 3444c00..210709c 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -229,21 +229,23 @@ nfsd4_list_rec_dir(struct dentry *dir, recdir_func *f)
goto out;
status = vfs_readdir(filp, nfsd4_build_namelist, &names);
fput(filp);
+ mutex_lock(&dir->d_inode->i_mutex);
while (!list_empty(&names)) {
entry = list_entry(names.next, struct name_list, list);

dentry = lookup_one_len(entry->name, dir, HEXDIR_LEN-1);
if (IS_ERR(dentry)) {
status = PTR_ERR(dentry);
- goto out;
+ break;
}
status = f(dir, dentry);
dput(dentry);
if (status)
- goto out;
+ break;
list_del(&entry->list);
kfree(entry);
}
+ mutex_unlock(&dir->d_inode->i_mutex);
out:
while (!list_empty(&names)) {
entry = list_entry(names.next, struct name_list, list);
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ab93fcf..0874299 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1885,8 +1885,8 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
return 0;
}

-static int nfsd_buffered_readdir(struct file *file, filldir_t func,
- struct readdir_cd *cdp, loff_t *offsetp)
+static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
{
struct readdir_data buf;
struct buffered_dirent *de;
@@ -1896,11 +1896,12 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,

buf.dirent = (void *)__get_free_page(GFP_KERNEL);
if (!buf.dirent)
- return -ENOMEM;
+ return nfserrno(-ENOMEM);

offset = *offsetp;

while (1) {
+ struct inode *dir_inode = file->f_path.dentry->d_inode;
unsigned int reclen;

cdp->err = nfserr_eof; /* will be cleared on successful read */
@@ -1919,26 +1920,38 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
if (!size)
break;

+ /*
+ * Various filldir functions may end up calling back into
+ * lookup_one_len() and the file system's ->lookup() method.
+ * These expect i_mutex to be held, as it would within readdir.
+ */
+ host_err = mutex_lock_killable(&dir_inode->i_mutex);
+ if (host_err)
+ break;
+
de = (struct buffered_dirent *)buf.dirent;
while (size > 0) {
offset = de->offset;

if (func(cdp, de->name, de->namlen, de->offset,
de->ino, de->d_type))
- goto done;
+ break;

if (cdp->err != nfs_ok)
- goto done;
+ break;

reclen = ALIGN(sizeof(*de) + de->namlen,
sizeof(u64));
size -= reclen;
de = (struct buffered_dirent *)((char *)de + reclen);
}
+ mutex_unlock(&dir_inode->i_mutex);
+ if (size > 0) /* We bailed out early */
+ break;
+
offset = vfs_llseek(file, 0, SEEK_CUR);
}

- done:
free_page((unsigned long)(buf.dirent));

if (host_err)

--
dwmw2

2009-04-19 20:52:25

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Sun, Apr 19, 2009 at 01:27:49PM +0100, David Woodhouse wrote:
> Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a
> bug to generic code which had been extant for a long time in the XFS
> version -- it started to call through into lookup_one_len() and hence
> into the file systems' ->lookup() methods without i_mutex held on the
> directory.
>
> This patch fixes it by locking the directory's i_mutex again before
> calling the filldir functions. The original deadlocks which commit
> 14f7dd63 was designed to avoid are still avoided, because they were due
> to fs-internal locking, not i_mutex.
>
> Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
> code") introduced a similar problem there, which this also addresses.
>
> While we're at it, fix the return type of nfsd_buffered_readdir() which
> should be a __be32 not an int -- it's an NFS errno, not a Linux errno.
> And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM.
> Sparse would have caught both of those if it wasn't so busy bitching
> about __cold__.
>
> Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
> code") introduced a similar problem with calling lookup_one_len()
> without i_mutex, which this patch also addresses.
>
> Reported-by: J. R. Okajima <[email protected]>
> Signed-off-by: David Woodhouse <[email protected]>
> Umm-I-can-live-with-that-by: Al Viro <[email protected]>
> ---
> Still haven't tested the NFSv4 bit -- Bruce?

Thanks, there's an iterator in there that calls a passed-in function,
some examples of which were taking the i_mutex--so some fixing up is
needed. I'll follow up with a patch once I've got one tested.

--b.

>
> This time I remembered to remove the no-longer-used 'done:' label from
> nfsd_buffered_readdir() too.
>
> diff --git a/fs/namei.c b/fs/namei.c
> index b8433eb..78f253c 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1248,6 +1248,8 @@ struct dentry *lookup_one_len(const char *name, struct dentry *base, int len)
> int err;
> struct qstr this;
>
> + WARN_ON_ONCE(!mutex_is_locked(&base->d_inode->i_mutex));
> +
> err = __lookup_one_len(name, &this, base, len);
> if (err)
> return ERR_PTR(err);
> diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
> index 3444c00..210709c 100644
> --- a/fs/nfsd/nfs4recover.c
> +++ b/fs/nfsd/nfs4recover.c
> @@ -229,21 +229,23 @@ nfsd4_list_rec_dir(struct dentry *dir, recdir_func *f)
> goto out;
> status = vfs_readdir(filp, nfsd4_build_namelist, &names);
> fput(filp);
> + mutex_lock(&dir->d_inode->i_mutex);
> while (!list_empty(&names)) {
> entry = list_entry(names.next, struct name_list, list);
>
> dentry = lookup_one_len(entry->name, dir, HEXDIR_LEN-1);
> if (IS_ERR(dentry)) {
> status = PTR_ERR(dentry);
> - goto out;
> + break;
> }
> status = f(dir, dentry);
> dput(dentry);
> if (status)
> - goto out;
> + break;
> list_del(&entry->list);
> kfree(entry);
> }
> + mutex_unlock(&dir->d_inode->i_mutex);
> out:
> while (!list_empty(&names)) {
> entry = list_entry(names.next, struct name_list, list);
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index ab93fcf..0874299 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -1885,8 +1885,8 @@ static int nfsd_buffered_filldir(void *__buf, const char *name, int namlen,
> return 0;
> }
>
> -static int nfsd_buffered_readdir(struct file *file, filldir_t func,
> - struct readdir_cd *cdp, loff_t *offsetp)
> +static __be32 nfsd_buffered_readdir(struct file *file, filldir_t func,
> + struct readdir_cd *cdp, loff_t *offsetp)
> {
> struct readdir_data buf;
> struct buffered_dirent *de;
> @@ -1896,11 +1896,12 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
>
> buf.dirent = (void *)__get_free_page(GFP_KERNEL);
> if (!buf.dirent)
> - return -ENOMEM;
> + return nfserrno(-ENOMEM);
>
> offset = *offsetp;
>
> while (1) {
> + struct inode *dir_inode = file->f_path.dentry->d_inode;
> unsigned int reclen;
>
> cdp->err = nfserr_eof; /* will be cleared on successful read */
> @@ -1919,26 +1920,38 @@ static int nfsd_buffered_readdir(struct file *file, filldir_t func,
> if (!size)
> break;
>
> + /*
> + * Various filldir functions may end up calling back into
> + * lookup_one_len() and the file system's ->lookup() method.
> + * These expect i_mutex to be held, as it would within readdir.
> + */
> + host_err = mutex_lock_killable(&dir_inode->i_mutex);
> + if (host_err)
> + break;
> +
> de = (struct buffered_dirent *)buf.dirent;
> while (size > 0) {
> offset = de->offset;
>
> if (func(cdp, de->name, de->namlen, de->offset,
> de->ino, de->d_type))
> - goto done;
> + break;
>
> if (cdp->err != nfs_ok)
> - goto done;
> + break;
>
> reclen = ALIGN(sizeof(*de) + de->namlen,
> sizeof(u64));
> size -= reclen;
> de = (struct buffered_dirent *)((char *)de + reclen);
> }
> + mutex_unlock(&dir_inode->i_mutex);
> + if (size > 0) /* We bailed out early */
> + break;
> +
> offset = vfs_llseek(file, 0, SEEK_CUR);
> }
>
> - done:
> free_page((unsigned long)(buf.dirent));
>
> if (host_err)
>
> --
> dwmw2
>

2009-04-20 19:50:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Sun, Apr 19, 2009 at 04:51:54PM -0400, bfields wrote:
> On Sun, Apr 19, 2009 at 01:27:49PM +0100, David Woodhouse wrote:
> > Commit 14f7dd63 ("Copy XFS readdir hack into nfsd code") introduced a
> > bug to generic code which had been extant for a long time in the XFS
> > version -- it started to call through into lookup_one_len() and hence
> > into the file systems' ->lookup() methods without i_mutex held on the
> > directory.
> >
> > This patch fixes it by locking the directory's i_mutex again before
> > calling the filldir functions. The original deadlocks which commit
> > 14f7dd63 was designed to avoid are still avoided, because they were due
> > to fs-internal locking, not i_mutex.
> >
> > Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
> > code") introduced a similar problem there, which this also addresses.
> >
> > While we're at it, fix the return type of nfsd_buffered_readdir() which
> > should be a __be32 not an int -- it's an NFS errno, not a Linux errno.
> > And return nfserrno(-ENOMEM) when allocation fails, not just -ENOMEM.
> > Sparse would have caught both of those if it wasn't so busy bitching
> > about __cold__.
> >
> > Commit 05f4f678 ("nfsd4: don't do lookup within readdir in recovery
> > code") introduced a similar problem with calling lookup_one_len()
> > without i_mutex, which this patch also addresses.
> >
> > Reported-by: J. R. Okajima <[email protected]>
> > Signed-off-by: David Woodhouse <[email protected]>
> > Umm-I-can-live-with-that-by: Al Viro <[email protected]>
> > ---
> > Still haven't tested the NFSv4 bit -- Bruce?
>
> Thanks, there's an iterator in there that calls a passed-in function,
> some examples of which were taking the i_mutex--so some fixing up is
> needed. I'll follow up with a patch once I've got one tested.

Sorry for the delay. Simpler might be just to drop and reacquire the
mutex each time through nfsd4_list_rec_dir()'s loop, but I'd just as
soon rework the called functions to expect the mutex be held (and get
rid of the unused, probably fragile, clear_clid_dir() in the process).

So the following could be folded in to your patch.

I tested the combined patch over 2.6.30-rc2. I also tested 2.6.29 +
05f4f678 + the combined patch. Both look OK. Feel free to add a
tested-by or acked-by for "J. Bruce Fields" <[email protected]> as
appropriate. Or happy to add a s-o-b and shepherd it along myself if
it's easier....

--b.

diff --git a/fs/nfsd/nfs4recover.c b/fs/nfsd/nfs4recover.c
index 210709c..5275097 100644
--- a/fs/nfsd/nfs4recover.c
+++ b/fs/nfsd/nfs4recover.c
@@ -257,36 +257,6 @@ out:
}

static int
-nfsd4_remove_clid_file(struct dentry *dir, struct dentry *dentry)
-{
- int status;
-
- if (!S_ISREG(dir->d_inode->i_mode)) {
- printk("nfsd4: non-file found in client recovery directory\n");
- return -EINVAL;
- }
- mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);
- status = vfs_unlink(dir->d_inode, dentry);
- mutex_unlock(&dir->d_inode->i_mutex);
- return status;
-}
-
-static int
-nfsd4_clear_clid_dir(struct dentry *dir, struct dentry *dentry)
-{
- int status;
-
- /* For now this directory should already be empty, but we empty it of
- * any regular files anyway, just in case the directory was created by
- * a kernel from the future.... */
- nfsd4_list_rec_dir(dentry, nfsd4_remove_clid_file);
- mutex_lock_nested(&dir->d_inode->i_mutex, I_MUTEX_PARENT);
- status = vfs_rmdir(dir->d_inode, dentry);
- mutex_unlock(&dir->d_inode->i_mutex);
- return status;
-}
-
-static int
nfsd4_unlink_clid_dir(char *name, int namlen)
{
struct dentry *dentry;
@@ -296,18 +266,18 @@ nfsd4_unlink_clid_dir(char *name, int namlen)

mutex_lock(&rec_dir.dentry->d_inode->i_mutex);
dentry = lookup_one_len(name, rec_dir.dentry, namlen);
- mutex_unlock(&rec_dir.dentry->d_inode->i_mutex);
if (IS_ERR(dentry)) {
status = PTR_ERR(dentry);
- return status;
+ goto out_unlock;
}
status = -ENOENT;
if (!dentry->d_inode)
goto out;
-
- status = nfsd4_clear_clid_dir(rec_dir.dentry, dentry);
+ status = vfs_rmdir(rec_dir.dentry->d_inode, dentry);
out:
dput(dentry);
+out_unlock:
+ mutex_unlock(&rec_dir.dentry->d_inode->i_mutex);
return status;
}

@@ -350,7 +320,7 @@ purge_old(struct dentry *parent, struct dentry *child)
if (nfs4_has_reclaimed_state(child->d_name.name, false))
return 0;

- status = nfsd4_clear_clid_dir(parent, child);
+ status = vfs_rmdir(parent->d_inode, child);
if (status)
printk("failed to remove client recovery directory %s\n",
child->d_name.name);

2009-04-21 00:29:53

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

> Sorry for the delay. Simpler might be just to drop and reacquire the
> mutex each time through nfsd4_list_rec_dir()'s loop, but I'd just as
> soon rework the called functions to expect the mutex be held (and get
> rid of the unused, probably fragile, clear_clid_dir() in the process).
>
> So the following could be folded in to your patch.
>
> I tested the combined patch over 2.6.30-rc2. I also tested 2.6.29 +
> 05f4f678 + the combined patch. Both look OK. Feel free to add a
> tested-by or acked-by for "J. Bruce Fields" <[email protected]> as
> appropriate. Or happy to add a s-o-b and shepherd it along myself if
> it's easier....

I can take both, but if you prefer to have that one go through nfs tree -
fine by me.

I'm going to push a queue into for-next in a couple of hours; running build
tests right now. Patch from dwmw2 is in there...

2009-04-21 21:16:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Tue, Apr 21, 2009 at 01:29:34AM +0100, Al Viro wrote:
> > Sorry for the delay. Simpler might be just to drop and reacquire the
> > mutex each time through nfsd4_list_rec_dir()'s loop, but I'd just as
> > soon rework the called functions to expect the mutex be held (and get
> > rid of the unused, probably fragile, clear_clid_dir() in the process).
> >
> > So the following could be folded in to your patch.
> >
> > I tested the combined patch over 2.6.30-rc2. I also tested 2.6.29 +
> > 05f4f678 + the combined patch. Both look OK. Feel free to add a
> > tested-by or acked-by for "J. Bruce Fields" <[email protected]> as
> > appropriate. Or happy to add a s-o-b and shepherd it along myself if
> > it's easier....
>
> I can take both, but if you prefer to have that one go through nfs tree -
> fine by me.
>
> I'm going to push a queue into for-next in a couple of hours; running build
> tests right now. Patch from dwmw2 is in there...

However it makes it there is fine by me; I'll wait for one of you to
take care of it.

--b.

2009-04-21 21:55:37

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Tue, Apr 21, 2009 at 05:15:59PM -0400, J. Bruce Fields wrote:

> However it makes it there is fine by me; I'll wait for one of you to
> take care of it.

Folded variant is in the mainline.

2009-04-22 04:42:21

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir


David Woodhouse:
> This patch fixes it by locking the directory's i_mutex again before
> calling the filldir functions. The original deadlocks which commit
:::

An entry may be removed between the first mutex_unlock and the second
mutex_lock. In this case, lookup_one_len() in compose_entry_fh() will
return a negative dentry.
Currently the inode test (positive/negative) is done AFTER fh_compose().
Isn't it better to test it BEFORE fh_compose()?

compose_entry_fh()
{
:::
dchild = lookup_one_len(name, dparent, namlen);
if (IS_ERR(dchild))
return 1;
if (d_mountpoint(dchild) ||
fh_compose(fhp, exp, dchild, &cd->fh) != 0 ||
!dchild->d_inode)
rv = 1;
:::
}


J. R. Okajima

2009-04-22 19:13:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Wed, Apr 22, 2009 at 01:41:46PM +0900, [email protected] wrote:
>
> David Woodhouse:
> > This patch fixes it by locking the directory's i_mutex again before
> > calling the filldir functions. The original deadlocks which commit
> :::
>
> An entry may be removed between the first mutex_unlock and the second
> mutex_lock. In this case, lookup_one_len() in compose_entry_fh() will
> return a negative dentry.
> Currently the inode test (positive/negative) is done AFTER fh_compose().
> Isn't it better to test it BEFORE fh_compose()?
>
> compose_entry_fh()
> {
> :::
> dchild = lookup_one_len(name, dparent, namlen);
> if (IS_ERR(dchild))
> return 1;
> if (d_mountpoint(dchild) ||
> fh_compose(fhp, exp, dchild, &cd->fh) != 0 ||
> !dchild->d_inode)
> rv = 1;
> :::
> }

Yes, I think you're right.

Arrrgh.

--b.

2009-04-23 06:41:07

by J. R. Okajima

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir


"J. Bruce Fields":
> > Isn't it better to test it BEFORE fh_compose()?
:::
> Yes, I think you're right.

Then here you are.

J. R. Okajima

----------------------------------------------------------------------

commit c98c6c4a207d602bd9498ea5f1d2993a00e98445
Author: J. R. Okajima <[email protected]>
Date: Thu Apr 23 15:38:43 2009 +0900

NFSD: test d_inode before fh_compose()

After 2f9092e1020246168b1309b35e085ecd7ff9ff72 "Fix i_mutex vs. readdir
handling in nfsd" (and 14f7dd63 "Copy XFS readdir hack into nfsd code"),
an entry may be removed between the first mutex_unlock and the second
mutex_lock. In this case, lookup_one_len() in compose_entry_fh() will
return a negative dentry.
It is better to test inode (positive/negative) BEFORE fh_compose().

Signed-off-by: J. R. Okajima <[email protected]>

diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
index 17d0dd9..1b5543b 100644
--- a/fs/nfsd/nfs3xdr.c
+++ b/fs/nfsd/nfs3xdr.c
@@ -851,8 +851,8 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
if (IS_ERR(dchild))
return 1;
if (d_mountpoint(dchild) ||
- fh_compose(fhp, exp, dchild, &cd->fh) != 0 ||
- !dchild->d_inode)
+ !dchild->d_inode ||
+ fh_compose(fhp, exp, dchild, &cd->fh) != 0)
rv = 1;
dput(dchild);
return rv;

2009-04-23 20:27:29

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2] Fix i_mutex handling in nfsd readdir

On Thu, Apr 23, 2009 at 03:40:23PM +0900, [email protected] wrote:
>
> "J. Bruce Fields":
> > > Isn't it better to test it BEFORE fh_compose()?
> :::
> > Yes, I think you're right.
>
> Then here you are.

The nfsv4 readdir callback needs a similar fix.

Also, it looks to me like this results in us encoding an entry for this
deleted file in the readdir reply, but with an empty filehandle. From a
quick glance at the rfc it's not clear to me whether this is really
legal. I suspect it may cause odd behavior on clients. At the least it
would seem cleaner to check for this condition early enough that we can
just skip the entry entirely.

--b.

>
> J. R. Okajima
>
> ----------------------------------------------------------------------
>
> commit c98c6c4a207d602bd9498ea5f1d2993a00e98445
> Author: J. R. Okajima <[email protected]>
> Date: Thu Apr 23 15:38:43 2009 +0900
>
> NFSD: test d_inode before fh_compose()
>
> After 2f9092e1020246168b1309b35e085ecd7ff9ff72 "Fix i_mutex vs. readdir
> handling in nfsd" (and 14f7dd63 "Copy XFS readdir hack into nfsd code"),
> an entry may be removed between the first mutex_unlock and the second
> mutex_lock. In this case, lookup_one_len() in compose_entry_fh() will
> return a negative dentry.
> It is better to test inode (positive/negative) BEFORE fh_compose().
>
> Signed-off-by: J. R. Okajima <[email protected]>
>
> diff --git a/fs/nfsd/nfs3xdr.c b/fs/nfsd/nfs3xdr.c
> index 17d0dd9..1b5543b 100644
> --- a/fs/nfsd/nfs3xdr.c
> +++ b/fs/nfsd/nfs3xdr.c
> @@ -851,8 +851,8 @@ compose_entry_fh(struct nfsd3_readdirres *cd, struct svc_fh *fhp,
> if (IS_ERR(dchild))
> return 1;
> if (d_mountpoint(dchild) ||
> - fh_compose(fhp, exp, dchild, &cd->fh) != 0 ||
> - !dchild->d_inode)
> + !dchild->d_inode ||
> + fh_compose(fhp, exp, dchild, &cd->fh) != 0)
> rv = 1;
> dput(dchild);
> return rv;