On Thursday May 1, [email protected] wrote:
> On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote:
> > Yes, and get_fsid would be extremly useful, especially for those
> > filesystems that already have an uuid in the superblock
> > (*cough*, XFS, *cough*), but it'll need some co-operations with
> > nfs-utils on when to use it.
>
> Why do you need to co-operate with userspace? Userspace shouldn't need
> to do anything -- we'll just generate a suitable fsid/uuid for
> ourselves, unless userspace deliberately overrides it for the export in
> question.
Actually it is the kernel that doesn't need to do anything....
Mapping between the filesystem and the filesystem part of the
filehandle is done entirely in user space.
The kernel says "Here is a filehandle fragement, what filesystem
should I be accessing".
So what you really want is to teach nfs-utils to recognise JFFS2 and
extract an appropriate uuid.
It already uses libblkid to get uuids for ext3 and XFS and others.
Extending that to handle JFFS2 should be much of a drama.
>
> > Patch looks good for me except for the few introduced overlong lines.
>
> Bah, don't you start. A less onanistic problem with it is the deadlock
> with NFS readdirplus (->readdir->encode_entry->compose_entry_fh->lookup)
>
> I wonder if we should postpone the calls to compose_entry_fh() until
> _after_ readdir() has completed. Leave space in the response for the
> filehandles, but only walk through it again calling compose_entry_fh()
> once we're done in readdir. That would allow us to get rid of the
> various icky hacks that file systems have to avoid that deadlock.
Why is there a deadlock here?
Both readdir and lookup are called with i_mutex held on the directory
so there should need to do any extra locking (he said, naively). In
the readdirplus cases, i_mutex is held across both the readdir and the
lookup....
One problem with your proposed solution is that filehandles aren't all
the same length, so you cannot reliably leave space for them.
Awkward.
NeilBrown
On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> On Thursday May 1, [email protected] wrote:
> > On Thu, 2008-05-01 at 16:48 -0400, Christoph Hellwig wrote:
> > > Yes, and get_fsid would be extremly useful, especially for those
> > > filesystems that already have an uuid in the superblock
> > > (*cough*, XFS, *cough*), but it'll need some co-operations with
> > > nfs-utils on when to use it.
> >
> > Why do you need to co-operate with userspace? Userspace shouldn't need
> > to do anything -- we'll just generate a suitable fsid/uuid for
> > ourselves, unless userspace deliberately overrides it for the export in
> > question.
>
> Actually it is the kernel that doesn't need to do anything....
> Mapping between the filesystem and the filesystem part of the
> filehandle is done entirely in user space.
> The kernel says "Here is a filehandle fragement, what filesystem
> should I be accessing".
>
> So what you really want is to teach nfs-utils to recognise JFFS2 and
> extract an appropriate uuid.
> It already uses libblkid to get uuids for ext3 and XFS and others.
> Extending that to handle JFFS2 should be much of a drama.
For JFFS2, there is no UUID; only i_sb->s_dev. Actually. if we just set
FS_REQUIRES_DEV then it would work out OK -- at least for the NFS
export. We don't do that though, because it gives behaviour we don't
want in other situations,
> Why is there a deadlock here?
Many file systems have their own locking, and lookup() can end up trying
to re-take a lock which readdir() is already holding. In the JFFS2 case,
it's the fs-internal inode mutex, which is required because the garbage
collector can't use i_mutex for lock ordering reasons.
See also the readdir implementation and surrounding comments in
fs/xfs/linux-2.6/xfs_file.c -- and the way GFS2 uses
gfs2_glock_is_locked_by_me() to avoid the deadlock.
The annoying thing is that JFFS2 doesn't even _implement_ i_generation,
so you get no more useful information out of the lookup() call anyway :)
> Both readdir and lookup are called with i_mutex held on the directory
> so there should need to do any extra locking (he said, naively). In
> the readdirplus cases, i_mutex is held across both the readdir and the
> lookup....
>
> One problem with your proposed solution is that filehandles aren't all
> the same length, so you cannot reliably leave space for them.
Not without moving stuff around during the postprocessing, I suppose.
Which isn't very pretty -- but it's prettier than some of the hacks we
have at the moment to avoid the deadlock.
--
dwmw2
On Fri, 2008-05-02 at 11:38 +1000, Neil Brown wrote:
> Why is there a deadlock here?
> Both readdir and lookup are called with i_mutex held on the directory
> so there should need to do any extra locking (he said, naively). In
> the readdirplus cases, i_mutex is held across both the readdir and the
> lookup....
>
> One problem with your proposed solution is that filehandles aren't all
> the same length, so you cannot reliably leave space for them.
>
> Awkward.
Yeah. I think the sanest plan for the short term is, as hch suggests,
just to transplant the existing XFS hack into the nfsd code. That way,
at least we can avoid using the hack for local users. And it makes NFS
export from other file systems (jffs2, btrfs, etc.) easier without
having to put the same hacks in each one.
Git tree at git.infradead.org/users/dwmw2/nfsexport-2.6.git; patch
sequence follows...
--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation
Signed-off-by: David Woodhouse <[email protected]>
---
fs/nfsd/vfs.c | 40 ++++++++++++++++++++++++----------------
1 files changed, 24 insertions(+), 16 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 18060be..3e22634 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,29 @@ out:
return err;
}
+static int nfsd_do_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
+{
+ int host_err;
+
+ /*
+ * Read the directory entries. This silly loop is necessary because
+ * readdir() is not guaranteed to fill up the entire buffer, but
+ * may choose to do less.
+ */
+ do {
+ cdp->err = nfserr_eof; /* will be cleared on successful read */
+ host_err = vfs_readdir(file, func, cdp);
+ } while (host_err >=0 && cdp->err == nfs_ok);
+
+ *offsetp = vfs_llseek(file, 0, 1);
+
+ if (host_err)
+ return nfserrno(host_err);
+ else
+ return cdp->err;
+}
+
/*
* Read entries from a directory.
* The NFSv3/4 verifier we ignore for now.
@@ -1823,7 +1846,6 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
struct readdir_cd *cdp, filldir_t func)
{
__be32 err;
- int host_err;
struct file *file;
loff_t offset = *offsetp;
@@ -1837,21 +1859,7 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
goto out_close;
}
- /*
- * Read the directory entries. This silly loop is necessary because
- * readdir() is not guaranteed to fill up the entire buffer, but
- * may choose to do less.
- */
-
- do {
- cdp->err = nfserr_eof; /* will be cleared on successful read */
- host_err = vfs_readdir(file, func, cdp);
- } while (host_err >=0 && cdp->err == nfs_ok);
- if (host_err)
- err = nfserrno(host_err);
- else
- err = cdp->err;
- *offsetp = vfs_llseek(file, 0, 1);
+ err = nfsd_do_readdir(file, func, cdp, offsetp);
if (err == nfserr_eof || err == nfserr_toosmall)
err = nfs_ok; /* can still be found in ->err */
--
1.5.5.1
--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation
Some file systems with their own internal locking have problems with the
way that nfsd calls the ->lookup() method from within a filldir function
called from their ->readdir() method. The recursion back into the file
system code can cause deadlock.
XFS has a fairly hackish solution to this which involves doing the
readdir() into a locally-allocated buffer, then going back through it
calling the filldir function afterwards. It's not ideal, but it works.
It's particularly suboptimal because XFS does this for local file
systems too, where it's completely unnecessary.
Copy this hack into the NFS code where it can be used only for NFS, and
only for file systems which indicate that they need it by setting
FS_NO_LOOKUP_IN_READDIR in their fs_type flags.
Signed-off-by: David Woodhouse <[email protected]>
---
fs/nfsd/vfs.c | 111 +++++++++++++++++++++++++++++++++++++++++++++++++++-
include/linux/fs.h | 2 +
2 files changed, 112 insertions(+), 1 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 3e22634..81c9411 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -1814,6 +1814,111 @@ out:
return err;
}
+struct hack_dirent {
+ u64 ino;
+ loff_t offset;
+ int namlen;
+ unsigned int d_type;
+ char name[];
+};
+
+struct hack_callback {
+ char *dirent;
+ size_t len;
+ size_t used;
+};
+
+static int nfsd_hack_filldir(void *__buf, const char *name, int namlen,
+ loff_t offset, u64 ino, unsigned int d_type)
+{
+ struct hack_callback *buf = __buf;
+ struct hack_dirent *de = (void *)(buf->dirent + buf->used);
+ unsigned int reclen;
+
+ reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
+ if (buf->used + reclen > buf->len)
+ return -EINVAL;
+
+ de->namlen = namlen;
+ de->offset = offset;
+ de->ino = ino;
+ de->d_type = d_type;
+ memcpy(de->name, name, namlen);
+ buf->used += reclen;
+
+ return 0;
+}
+
+static int nfsd_hack_readdir(struct file *file, filldir_t func,
+ struct readdir_cd *cdp, loff_t *offsetp)
+{
+ struct hack_callback buf;
+ struct hack_dirent *de;
+ int host_err;
+ int size;
+ loff_t offset;
+
+ /*
+ * Try fairly hard to get memory
+ */
+ buf.len = PAGE_CACHE_SIZE;
+ do {
+ buf.dirent = kmalloc(buf.len, GFP_KERNEL);
+ if (buf.dirent)
+ break;
+ buf.len >>= 1;
+ } while (buf.len >= 1024);
+
+ if (!buf.dirent)
+ return -ENOMEM;
+
+ offset = *offsetp;
+ cdp->err = nfserr_eof; /* will be cleared on successful read */
+
+ while (1) {
+ unsigned int reclen;
+
+ buf.used = 0;
+
+ host_err = vfs_readdir(file, nfsd_hack_filldir, &buf);
+ if (host_err)
+ break;
+
+ size = buf.used;
+
+ if (!size)
+ break;
+
+
+ de = (struct hack_dirent *)buf.dirent;
+ while (size > 0) {
+ offset = de->offset;
+
+ if (func(cdp, de->name, de->namlen, de->offset,
+ de->ino, de->d_type))
+ goto done;
+
+ if (cdp->err != nfs_ok)
+ goto done;
+
+ reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
+ sizeof(u64));
+ size -= reclen;
+ de = (struct hack_dirent *)((char *)de + reclen);
+ }
+ offset = vfs_llseek(file, 0, 1);
+ }
+
+ done:
+ kfree(buf.dirent);
+
+ if (host_err)
+ return nfserrno(host_err);
+
+ *offsetp = offset;
+ return cdp->err;
+}
+
static int nfsd_do_readdir(struct file *file, filldir_t func,
struct readdir_cd *cdp, loff_t *offsetp)
{
@@ -1859,7 +1964,11 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
goto out_close;
}
- err = nfsd_do_readdir(file, func, cdp, offsetp);
+ if ((file->f_path.dentry->d_inode->i_sb->s_type->fs_flags &
+ FS_NO_LOOKUP_IN_READDIR))
+ err = nfsd_hack_readdir(file, func, cdp, offsetp);
+ else
+ err = nfsd_do_readdir(file, func, cdp, offsetp);
if (err == nfserr_eof || err == nfserr_toosmall)
err = nfs_ok; /* can still be found in ->err */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 580b513..80ca410 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -100,6 +100,8 @@ extern int dir_notify_enable;
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move()
* during rename() internally.
*/
+#define FS_NO_LOOKUP_IN_READDIR 65536 /* FS will deadlock if you call its
+ lookup() method from filldir */
/*
* These are the fs-independent mount-flags: up to 32 flags are supported
--
1.5.5.1
--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation
Now that we've moved the readdir hack to the nfsd code, we can just set
the FS_NO_LOOKUP_IN_READDIR flag and then remove the local hack from the
xfs code.
Signed-off-by: David Woodhouse <[email protected]>
---
fs/xfs/linux-2.6/xfs_file.c | 120 ------------------------------------------
fs/xfs/linux-2.6/xfs_super.c | 2 +-
2 files changed, 1 insertions(+), 121 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 5f60363..d65d377 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -212,7 +212,6 @@ xfs_file_fsync(
* Hopefully we'll find a better workaround that allows to use the optimal
* version at least for local readdirs for 2.6.25.
*/
-#if 0
STATIC int
xfs_file_readdir(
struct file *filp,
@@ -244,125 +243,6 @@ xfs_file_readdir(
return -error;
return 0;
}
-#else
-
-struct hack_dirent {
- u64 ino;
- loff_t offset;
- int namlen;
- unsigned int d_type;
- char name[];
-};
-
-struct hack_callback {
- char *dirent;
- size_t len;
- size_t used;
-};
-
-STATIC int
-xfs_hack_filldir(
- void *__buf,
- const char *name,
- int namlen,
- loff_t offset,
- u64 ino,
- unsigned int d_type)
-{
- struct hack_callback *buf = __buf;
- struct hack_dirent *de = (struct hack_dirent *)(buf->dirent + buf->used);
- unsigned int reclen;
-
- reclen = ALIGN(sizeof(struct hack_dirent) + namlen, sizeof(u64));
- if (buf->used + reclen > buf->len)
- return -EINVAL;
-
- de->namlen = namlen;
- de->offset = offset;
- de->ino = ino;
- de->d_type = d_type;
- memcpy(de->name, name, namlen);
- buf->used += reclen;
- return 0;
-}
-
-STATIC int
-xfs_file_readdir(
- struct file *filp,
- void *dirent,
- filldir_t filldir)
-{
- struct inode *inode = filp->f_path.dentry->d_inode;
- xfs_inode_t *ip = XFS_I(inode);
- struct hack_callback buf;
- struct hack_dirent *de;
- int error;
- loff_t size;
- int eof = 0;
- xfs_off_t start_offset, curr_offset, offset;
-
- /*
- * Try fairly hard to get memory
- */
- buf.len = PAGE_CACHE_SIZE;
- do {
- buf.dirent = kmalloc(buf.len, GFP_KERNEL);
- if (buf.dirent)
- break;
- buf.len >>= 1;
- } while (buf.len >= 1024);
-
- if (!buf.dirent)
- return -ENOMEM;
-
- curr_offset = filp->f_pos;
- if (curr_offset == 0x7fffffff)
- offset = 0xffffffff;
- else
- offset = filp->f_pos;
-
- while (!eof) {
- unsigned int reclen;
-
- start_offset = offset;
-
- buf.used = 0;
- error = -xfs_readdir(ip, &buf, buf.len, &offset,
- xfs_hack_filldir);
- if (error || offset == start_offset) {
- size = 0;
- break;
- }
-
- size = buf.used;
- de = (struct hack_dirent *)buf.dirent;
- while (size > 0) {
- curr_offset = de->offset /* & 0x7fffffff */;
- if (filldir(dirent, de->name, de->namlen,
- curr_offset & 0x7fffffff,
- de->ino, de->d_type)) {
- goto done;
- }
-
- reclen = ALIGN(sizeof(struct hack_dirent) + de->namlen,
- sizeof(u64));
- size -= reclen;
- de = (struct hack_dirent *)((char *)de + reclen);
- }
- }
-
- done:
- if (!error) {
- if (size == 0)
- filp->f_pos = offset & 0x7fffffff;
- else if (de)
- filp->f_pos = curr_offset;
- }
-
- kfree(buf.dirent);
- return error;
-}
-#endif
STATIC int
xfs_file_mmap(
diff --git a/fs/xfs/linux-2.6/xfs_super.c b/fs/xfs/linux-2.6/xfs_super.c
index 9433812..e44a21d 100644
--- a/fs/xfs/linux-2.6/xfs_super.c
+++ b/fs/xfs/linux-2.6/xfs_super.c
@@ -1426,7 +1426,7 @@ static struct file_system_type xfs_fs_type = {
.name = "xfs",
.get_sb = xfs_fs_get_sb,
.kill_sb = kill_block_super,
- .fs_flags = FS_REQUIRES_DEV,
+ .fs_flags = FS_REQUIRES_DEV | FS_NO_LOOKUP_IN_READDIR,
};
--
1.5.5.1
--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation