With the ext3/ext4 directory index implementation hashes are used to specify
offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
hashes and therefore the probability of hash collisions for larger directories
is rather high. As recently reported on the NFS mailing list that theoretical
problem also happens on real systems:
http://comments.gmane.org/gmane.linux.nfs/40863
The following series adds two new f_mode flags to tell ext4
to use 32-bit or 64-bit hash values for llseek() calls.
These flags can then used by network file systems, such as NFS, to
request 32-bit or 64-bit offsets (hashes).
Version 5
- update NFSD_MAY_64BIT_COOKIE to 0x1000
Version 4
- Andreas noticed there was HAVE_IS_COMPAT_TASK instead of
CONFIG_COMPAT in the
"Return 32/64-bit dir name hash according to usage type"
patch
Version 3:
- remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
as it was included upstream separately
- split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
two separate patches as suggested by Bruce, one patch to rename
'access' to 'may_flags'. And the remainder of the original patch to set
FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
NFSD_MAY_64BIT_COOKIE flag
Version 2:
- use f_mode instead of O_* flags and also in a separate patch
- introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
- fix SEEK_END in ext4_dir_llseek()
- set f_mode flags in NFS code as early as possible and introduce a new
NFSD_MAY_64BIT_COOKIE flag for that
--
Bernd Schubert
Fraunhofer ITWM
From: Fan Yong <[email protected]>
Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
and telldir(). However, this causes problems if there are 32-bit hash
collisions, since the NFSv2 server can get stuck resending the same
entries from the directory repeatedly.
Allow ext4 to return a full 64-bit hash (both major and minor) for
telldir to decrease the chance of hash collisions. This still needs
integration on the NFS side.
Patch-updated-by: Bernd Schubert <[email protected]>
(blame me if something is not correct)
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/ext4/dir.c | 185 ++++++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/ext4.h | 6 ++
fs/ext4/hash.c | 4 +
3 files changed, 154 insertions(+), 41 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..cee09f2 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -32,24 +32,8 @@ static unsigned char ext4_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
};
-static int ext4_readdir(struct file *, void *, filldir_t);
static int ext4_dx_readdir(struct file *filp,
void *dirent, filldir_t filldir);
-static int ext4_release_dir(struct inode *inode,
- struct file *filp);
-
-const struct file_operations ext4_dir_operations = {
- .llseek = ext4_llseek,
- .read = generic_read_dir,
- .readdir = ext4_readdir, /* we take BKL. needed?*/
- .unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = ext4_compat_ioctl,
-#endif
- .fsync = ext4_sync_file,
- .release = ext4_release_dir,
-};
-
static unsigned char get_dtype(struct super_block *sb, int filetype)
{
@@ -254,22 +238,134 @@ out:
return ret;
}
+static inline int is_32bit_api(void)
+{
+#ifdef CONFIG_COMPAT
+ return is_compat_task();
+#else
+ return (BITS_PER_LONG == 32);
+#endif
+}
+
/*
* These functions convert from the major/minor hash to an f_pos
- * value.
+ * value for dx directories
*
- * Currently we only use major hash numer. This is unfortunate, but
- * on 32-bit machines, the same VFS interface is used for lseek and
- * llseek, so if we use the 64 bit offset, then the 32-bit versions of
- * lseek/telldir/seekdir will blow out spectacularly, and from within
- * the ext2 low-level routine, we don't know if we're being called by
- * a 64-bit version of the system call or the 32-bit version of the
- * system call. Worse yet, NFSv2 only allows for a 32-bit readdir
- * cookie. Sigh.
+ * Upper layer (for example NFS) should specify FMODE_32BITHASH or
+ * FMODE_64BITHASH explicitly. On the other hand, we allow ext4 to be mounted
+ * directly on both 32-bit and 64-bit nodes, under such case, neither
+ * FMODE_32BITHASH nor FMODE_64BITHASH is specified.
*/
-#define hash2pos(major, minor) (major >> 1)
-#define pos2maj_hash(pos) ((pos << 1) & 0xffffffff)
-#define pos2min_hash(pos) (0)
+static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return major >> 1;
+ else
+ return ((__u64)(major >> 1) << 32) | (__u64)minor;
+}
+
+static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return (pos << 1) & 0xffffffff;
+ else
+ return ((pos >> 32) << 1) & 0xffffffff;
+}
+
+static inline __u32 pos2min_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return 0;
+ else
+ return pos & 0xffffffff;
+}
+
+/*
+ * Return 32- or 64-bit end-of-file for dx directories
+ */
+static inline loff_t ext4_get_htree_eof(struct file *filp)
+{
+ if ((filp->f_mode & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return EXT4_HTREE_EOF_32BIT;
+ else
+ return EXT4_HTREE_EOF_64BIT;
+}
+
+
+/*
+ * ext4_dir_llseek() based on generic_file_llseek() to handle both
+ * non-htree and htree directories, where the "offset" is in terms
+ * of the filename hash value instead of the byte offset.
+ *
+ * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
+ * will be invalid once the directory was converted into a dx directory
+ */
+loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ loff_t ret = -EINVAL;
+ int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
+
+ mutex_lock(&inode->i_mutex);
+
+ /* NOTE: relative offsets with dx directories might not work
+ * as expected, as it is difficult to figure out the
+ * correct offset between dx hashes */
+
+ switch (origin) {
+ case SEEK_END:
+ if (unlikely(offset > 0))
+ goto out_err; /* not supported for directories */
+
+ /* so only negative offsets are left, does that have a
+ * meaning for directories at all? */
+ if (is_dx_dir)
+ offset += ext4_get_htree_eof(file);
+ else
+ offset += inode->i_size;
+ break;
+ case SEEK_CUR:
+ /*
+ * Here we special-case the lseek(fd, 0, SEEK_CUR)
+ * position-querying operation. Avoid rewriting the "same"
+ * f_pos value back to the file because a concurrent read(),
+ * write() or lseek() might have altered it
+ */
+ if (offset == 0) {
+ offset = file->f_pos;
+ goto out_ok;
+ }
+
+ offset += file->f_pos;
+ break;
+ }
+
+ if (unlikely(offset < 0))
+ goto out_err;
+
+ if (!is_dx_dir) {
+ if (offset > inode->i_sb->s_maxbytes)
+ goto out_err;
+ } else if (offset > ext4_get_htree_eof(file))
+ goto out_err;
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+
+out_ok:
+ ret = offset;
+out_err:
+ mutex_unlock(&inode->i_mutex);
+
+ return ret;
+}
/*
* This structure holds the nodes of the red-black tree used to store
@@ -330,15 +426,16 @@ static void free_rb_tree_fname(struct rb_root *root)
}
-static struct dir_private_info *ext4_htree_create_dir_info(loff_t pos)
+static struct dir_private_info *ext4_htree_create_dir_info(struct file *filp,
+ loff_t pos)
{
struct dir_private_info *p;
p = kzalloc(sizeof(struct dir_private_info), GFP_KERNEL);
if (!p)
return NULL;
- p->curr_hash = pos2maj_hash(pos);
- p->curr_minor_hash = pos2min_hash(pos);
+ p->curr_hash = pos2maj_hash(filp, pos);
+ p->curr_minor_hash = pos2min_hash(filp, pos);
return p;
}
@@ -429,7 +526,7 @@ static int call_filldir(struct file *filp, void *dirent,
"null fname?!?\n");
return 0;
}
- curr_pos = hash2pos(fname->hash, fname->minor_hash);
+ curr_pos = hash2pos(filp, fname->hash, fname->minor_hash);
while (fname) {
error = filldir(dirent, fname->name,
fname->name_len, curr_pos,
@@ -454,13 +551,13 @@ static int ext4_dx_readdir(struct file *filp,
int ret;
if (!info) {
- info = ext4_htree_create_dir_info(filp->f_pos);
+ info = ext4_htree_create_dir_info(filp, filp->f_pos);
if (!info)
return -ENOMEM;
filp->private_data = info;
}
- if (filp->f_pos == EXT4_HTREE_EOF)
+ if (filp->f_pos == ext4_get_htree_eof(filp))
return 0; /* EOF */
/* Some one has messed with f_pos; reset the world */
@@ -468,8 +565,8 @@ static int ext4_dx_readdir(struct file *filp,
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
- info->curr_hash = pos2maj_hash(filp->f_pos);
- info->curr_minor_hash = pos2min_hash(filp->f_pos);
+ info->curr_hash = pos2maj_hash(filp, filp->f_pos);
+ info->curr_minor_hash = pos2min_hash(filp, filp->f_pos);
}
/*
@@ -501,7 +598,7 @@ static int ext4_dx_readdir(struct file *filp,
if (ret < 0)
return ret;
if (ret == 0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_node = rb_first(&info->root);
@@ -521,7 +618,7 @@ static int ext4_dx_readdir(struct file *filp,
info->curr_minor_hash = fname->minor_hash;
} else {
if (info->next_hash == ~0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_hash = info->next_hash;
@@ -540,3 +637,15 @@ static int ext4_release_dir(struct inode *inode, struct file *filp)
return 0;
}
+
+const struct file_operations ext4_dir_operations = {
+ .llseek = ext4_dir_llseek,
+ .read = generic_read_dir,
+ .readdir = ext4_readdir,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .fsync = ext4_sync_file,
+ .release = ext4_release_dir,
+};
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1554b15..d3fe1ea 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1599,7 +1599,11 @@ struct dx_hash_info
u32 *seed;
};
-#define EXT4_HTREE_EOF 0x7fffffff
+
+/* 32 and 64 bit signed EOF for dx directories */
+#define EXT4_HTREE_EOF_32BIT ((1UL << (32 - 1)) - 1)
+#define EXT4_HTREE_EOF_64BIT ((1ULL << (64 - 1)) - 1)
+
/*
* Control parameters used by ext4_htree_next_block
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index ac8f168..fa8e491 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
return -1;
}
hash = hash & ~1;
- if (hash == (EXT4_HTREE_EOF << 1))
- hash = (EXT4_HTREE_EOF-1) << 1;
+ if (hash == (EXT4_HTREE_EOF_32BIT << 1))
+ hash = (EXT4_HTREE_EOF_32BIT - 1) << 1;
hinfo->hash = hash;
hinfo->minor_hash = minor_hash;
return 0;
On Wed, Jan 11, 2012 at 09:48:27AM -0500, J. Bruce Fields wrote:
>
> Yes, this needs to be fixed--is there anything in particular holding up
> these patches?
Yes, I need more people reviewing ext4 patches, so it's not me doing
all of the review. :-)
I will put this on my "high priority to review ASAP for the next merge
window" list.
- Ted
Use 32-bit or 64-bit llseek() hashes for directory offsets depending on
the NFS version. NFSv2 gets 32-bit hashes only.
NOTE: This patch got rather complex as Christoph asked to set the
filp->f_mode flag in the open call or immediatly after dentry_open()
in nfsd_open() to avoid races.
Personally I still do not see a reason for that and in my opinion
FMODE_32BITHASH/FMODE_64BITHASH flags could be set nfsd_readdir(), as it
follows directly after nfsd_open() without a chance of races.
Signed-off-by: Bernd Schubert <[email protected]>
Acked-by: J. Bruce Fields<[email protected]>
---
fs/nfsd/vfs.c | 15 +++++++++++++--
fs/nfsd/vfs.h | 2 ++
2 files changed, 15 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 20fa252..f162dc3 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -785,9 +785,15 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
flags, current_cred());
if (IS_ERR(*filp))
host_err = PTR_ERR(*filp);
- else
+ else {
host_err = ima_file_check(*filp, may_flags);
+ if (may_flags & NFSD_MAY_64BIT_COOKIE)
+ (*filp)->f_mode |= FMODE_64BITHASH;
+ else
+ (*filp)->f_mode |= FMODE_32BITHASH;
+ }
+
out_nfserr:
err = nfserrno(host_err);
out:
@@ -2011,8 +2017,13 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
__be32 err;
struct file *file;
loff_t offset = *offsetp;
+ int may_flags = NFSD_MAY_READ;
+
+ /* NFSv2 only supports 32 bit cookies */
+ if (rqstp->rq_vers > 2)
+ may_flags |= NFSD_MAY_64BIT_COOKIE;
- err = nfsd_open(rqstp, fhp, S_IFDIR, NFSD_MAY_READ, &file);
+ err = nfsd_open(rqstp, fhp, S_IFDIR, may_flags, &file);
if (err)
goto out;
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 1dcd238..ec0611b 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -27,6 +27,8 @@
#define NFSD_MAY_BYPASS_GSS 0x400
#define NFSD_MAY_READ_IF_EXEC 0x800
+#define NFSD_MAY_64BIT_COOKIE 0x1000 /* 64 bit readdir cookies for >= NFSv3 */
+
#define NFSD_MAY_CREATE (NFSD_MAY_EXEC|NFSD_MAY_WRITE)
#define NFSD_MAY_REMOVE (NFSD_MAY_EXEC|NFSD_MAY_WRITE|NFSD_MAY_TRUNC)
Those flags are supposed to be set by NFS readdir() to tell ext3/ext4
to 32bit (NFSv2) or 64bit hash values (offsets) in seekdir().
Signed-off-by: Bernd Schubert <[email protected]>
---
include/linux/fs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7aacf31..b158386 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -92,6 +92,11 @@ struct inodes_stat_t {
/* File is opened using open(.., 3, ..) and is writeable only for ioctls
(specialy hack for floppy.c) */
#define FMODE_WRITE_IOCTL ((__force fmode_t)0x100)
+/* 32bit hashes as llseek() offset (for directories) */
+#define FMODE_32BITHASH ((__force fmode_t)0x200)
+/* 64bit hashes as llseek() offset (for directories) */
+#define FMODE_64BITHASH ((__force fmode_t)0x400)
+
/*
* Don't update ctime and mtime.
On 2012-01-09, at 6:21 AM, Bernd Schubert wrote:
> With the ext3/ext4 directory index implementation hashes are used to specify
> offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
> on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
> hashes and therefore the probability of hash collisions for larger directories
> is rather high. As recently reported on the NFS mailing list that theoretical
> problem also happens on real systems:
> http://comments.gmane.org/gmane.linux.nfs/40863
>
> The following series adds two new f_mode flags to tell ext4
> to use 32-bit or 64-bit hash values for llseek() calls.
> These flags can then used by network file systems, such as NFS, to
> request 32-bit or 64-bit offsets (hashes).
Ted, it would be great if these patches could land. We hit issues like
this previously as well, which is why we started this patch series in the
first place.
> Version 5
> - update NFSD_MAY_64BIT_COOKIE to 0x1000
>
> Version 4
> - Andreas noticed there was HAVE_IS_COMPAT_TASK instead of
> CONFIG_COMPAT in the
> "Return 32/64-bit dir name hash according to usage type"
> patch
>
> Version 3:
> - remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
> as it was included upstream separately
> - split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
> two separate patches as suggested by Bruce, one patch to rename
> 'access' to 'may_flags'. And the remainder of the original patch to set
> FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
> NFSD_MAY_64BIT_COOKIE flag
>
> Version 2:
> - use f_mode instead of O_* flags and also in a separate patch
> - introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
> - fix SEEK_END in ext4_dir_llseek()
> - set f_mode flags in NFS code as early as possible and introduce a new
> NFSD_MAY_64BIT_COOKIE flag for that
>
> --
> Bernd Schubert
> Fraunhofer ITWM
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Engineer http://www.whamcloud.com/
Just rename this variable, as the next patch will add a flag and
'access' as variable name would not be correct any more.
Signed-off-by: Bernd Schubert <[email protected]>
Acked-by: J. Bruce Fields<[email protected]>
---
fs/nfsd/vfs.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index d25a723..20fa252 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -726,12 +726,13 @@ static int nfsd_open_break_lease(struct inode *inode, int access)
/*
* Open an existing file or directory.
- * The access argument indicates the type of open (read/write/lock)
+ * The may_flags argument indicates the type of open (read/write/lock)
+ * and additional flags.
* N.B. After this call fhp needs an fh_put
*/
__be32
nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
- int access, struct file **filp)
+ int may_flags, struct file **filp)
{
struct dentry *dentry;
struct inode *inode;
@@ -746,7 +747,7 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
* and (hopefully) checked permission - so allow OWNER_OVERRIDE
* in case a chmod has now revoked permission.
*/
- err = fh_verify(rqstp, fhp, type, access | NFSD_MAY_OWNER_OVERRIDE);
+ err = fh_verify(rqstp, fhp, type, may_flags | NFSD_MAY_OWNER_OVERRIDE);
if (err)
goto out;
@@ -757,7 +758,7 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
* or any access when mandatory locking enabled
*/
err = nfserr_perm;
- if (IS_APPEND(inode) && (access & NFSD_MAY_WRITE))
+ if (IS_APPEND(inode) && (may_flags & NFSD_MAY_WRITE))
goto out;
/*
* We must ignore files (but only files) which might have mandatory
@@ -770,12 +771,12 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
if (!inode->i_fop)
goto out;
- host_err = nfsd_open_break_lease(inode, access);
+ host_err = nfsd_open_break_lease(inode, may_flags);
if (host_err) /* NOMEM or WOULDBLOCK */
goto out_nfserr;
- if (access & NFSD_MAY_WRITE) {
- if (access & NFSD_MAY_READ)
+ if (may_flags & NFSD_MAY_WRITE) {
+ if (may_flags & NFSD_MAY_READ)
flags = O_RDWR|O_LARGEFILE;
else
flags = O_WRONLY|O_LARGEFILE;
@@ -785,7 +786,8 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, umode_t type,
if (IS_ERR(*filp))
host_err = PTR_ERR(*filp);
else
- host_err = ima_file_check(*filp, access);
+ host_err = ima_file_check(*filp, may_flags);
+
out_nfserr:
err = nfserrno(host_err);
out:
On Tue, Jan 10, 2012 at 04:27:15AM -0700, Andreas Dilger wrote:
> On 2012-01-09, at 6:21 AM, Bernd Schubert wrote:
> > With the ext3/ext4 directory index implementation hashes are used to specify
> > offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
> > on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
> > hashes and therefore the probability of hash collisions for larger directories
> > is rather high. As recently reported on the NFS mailing list that theoretical
> > problem also happens on real systems:
> > http://comments.gmane.org/gmane.linux.nfs/40863
> >
> > The following series adds two new f_mode flags to tell ext4
> > to use 32-bit or 64-bit hash values for llseek() calls.
> > These flags can then used by network file systems, such as NFS, to
> > request 32-bit or 64-bit offsets (hashes).
>
> Ted, it would be great if these patches could land. We hit issues like
> this previously as well, which is why we started this patch series in the
> first place.
Yes, this needs to be fixed--is there anything in particular holding up
these patches?
--b.
On Tue, Mar 06, 2012 at 10:59:55AM +0100, Bernd Schubert wrote:
> Arg, my bad, I introduced this issue when I converted from f_flags
> to f_mode, seems I forgot all of those above :(
> Hrm, I thought I had tested sufficiently, but obviously I did not :(
> Here's the test tool.
> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
>
> While quickly looking, I think it only affects NFSv2, which I think
> I indeed didn't test. I only run tests for 32 bit and 64-bit user
> space and NFSv3. But yes, NFSv2 is an important test too. Not sure
> if I will find time for that today.
I think the problem case to worry about would have been on a 32-bit
NFS server (think embedded/bookshelf NAS servers) and NFSv3, since
that's where the inconsistency would have meant that hash2pos would
use the 32-bit codepath, but pos2maj_hash would use the 64-bit
codepath, and so then... kablooey.
> Will send an updated version later on.
If it's just those changes, no worries, as I've already fixed up the
f_flags vs. f_mode issue in my copy of the patches.
Let me know if there are any other changes that you'd like to make or
issues that you've spotted.
- Ted
On 01/11/2012 04:31 PM, Ted Ts'o wrote:
> On Wed, Jan 11, 2012 at 09:48:27AM -0500, J. Bruce Fields wrote:
>>
>> Yes, this needs to be fixed--is there anything in particular holding up
>> these patches?
>
> Yes, I need more people reviewing ext4 patches, so it's not me doing
> all of the review. :-)
>
> I will put this on my "high priority to review ASAP for the next merge
> window" list.
Ted,
did you have a chance to review the patches in the mean time?
I will go on vacation on Wednesday for two weeks in Moscow and I'm not
sure yet if I will have internet access there. So if I need to update
the patches for linux-3.4, tomorrow is probably the last day me to do that.
Thanks,
Bernd
On Tue, Mar 13, 2012 at 09:03:51PM +0100, Bernd Schubert wrote:
> On 03/13/2012 09:01 PM, J. Bruce Fields wrote:
> >On Mon, Mar 12, 2012 at 06:22:50PM -0400, bfields wrote:
> >>On Mon, Mar 12, 2012 at 11:49:21AM -0400, J. Bruce Fields wrote:
> >>>On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
> >>>>On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
> >>>>>
> >>>>>Linus in general doesn't like cross tree merges (or any extraneous
> >>>>>merges) unless they are absolutely necessary; but then, he trusts git
> >>>>>merges more than many of us do. :-)
> >>>>>
> >>>>>I've put the the patch series on a separate patch, with a signed tag, at:
> >>>>>
> >>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
> >>>>>
> >>>>>Can you confirm that you've pulled it into your tree and it's landed
> >>>>>in linux-next? I probably won't bother pulling it in mine unless
> >>>>>there is definitely a merge conflict that I need to resolve. (I've
> >>>>>checked and there do not appear to be any against linux-next as of
> >>>>>last night.)
> >>>>
> >>>>Ping?
> >>>
> >>>Whoops, sorry.
> >>>
> >>>Looks perfect, thanks for handling these!
> >>>
> >>>I'm running them through my usual regression tests
> >>
> >>Urgh, I'm seeing a failure on the telldir test (part of the "special"
> >>connectathon tests). I haven't looked at what it's trying to do yet.
> >>Reproduceable even just on a local filesystem without NFS involved. If
> >>someone wants to look at it, you can just do:
> >>
> >> git clone git://linux-nfs.org/~bfields/cthon04.git
> >> cd cthon04
> >> make
> >> NFS_TESTDIR=/somewhere_on_an_ext4_fs/TMP ./runtests -s
> >>
> >>(Or after running that, more specifically,
> >>
> >> cd /somewhere_on_an_ext4_fs/TMP/
> >> ./telldir
> >>
> >>)
> >>
> >>I'll look at it tomorrow.
> >
> >Looks like seekdir is just not accepting an offset returned from
> >getdents:
> >
> > openat(AT_FDCWD, "telldir-test", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> > getdents(3, {...{d_ino=20386, d_off=5728083968307607285, d_reclen=24, d_name="192"}...} = 4848
> > lseek(3, 5728083968307607285, SEEK_SET) = -1 EINVAL (Invalid argument)
> >
> >Still investigating.
>
> Thanks! I'm also just going to work on it (just noticed your mail
> while on vacation ...).
OK, good. You can probably figure this out faster than me....
--b.
On Mon, Mar 05, 2012 at 07:08:37PM -0500, Ted Ts'o wrote:
> On Mon, Jan 09, 2012 at 02:21:53PM +0100, Bernd Schubert wrote:
> > Just rename this variable, as the next patch will add a flag and
> > 'access' as variable name would not be correct any more.
> >
> > Signed-off-by: Bernd Schubert <[email protected]>
> > Acked-by: J. Bruce Fields<[email protected]>
> > ---
> > fs/nfsd/vfs.c | 18 ++++++++++--------
> > 1 files changed, 10 insertions(+), 8 deletions(-)
>
> On Mon, Jan 09, 2012 at 02:21:58PM +0100, Bernd Schubert wrote:
> > Use 32-bit or 64-bit llseek() hashes for directory offsets depending on
> > the NFS version. NFSv2 gets 32-bit hashes only.
> >
> > NOTE: This patch got rather complex as Christoph asked to set the
> > filp->f_mode flag in the open call or immediatly after dentry_open()
> > in nfsd_open() to avoid races.
> > Personally I still do not see a reason for that and in my opinion
> > FMODE_32BITHASH/FMODE_64BITHASH flags could be set nfsd_readdir(), as it
> > follows directly after nfsd_open() without a chance of races.
> >
> > Signed-off-by: Bernd Schubert <[email protected]>
> > Acked-by: J. Bruce Fields<[email protected]>
> > ---
> > fs/nfsd/vfs.c | 15 +++++++++++++--
> > fs/nfsd/vfs.h | 2 ++
> > 2 files changed, 15 insertions(+), 2 deletions(-)
>
> NFS folks,
>
> Since Bruce has ack'ed these patches I assume it will be OK if I carry
> them in the ext4 tree for merging? (They depend on changes in ext4
> found in the first two patches of these patch series.)
Yes, that would be fine.
(There may well be some conflict: simplest might be if you can get that
pushed out to a stable public branch somewhere, hopefully before the
merge window opens, and that'll give me a chance to pull it and handle
any merge conflicts.
Then as long as your initial pull request goes in early (so that the
ext4 changes don't get sucked in with mine), everything works.)
--b.
On Mon, Mar 05, 2012 at 09:08:19PM -0500, J. Bruce Fields wrote:
> Yes, that would be fine.
>
> (There may well be some conflict: simplest might be if you can get that
> pushed out to a stable public branch somewhere, hopefully before the
> merge window opens, and that'll give me a chance to pull it and handle
> any merge conflicts.
Are the commits you're worried about on the NFS side already in
linux-next? If so, I'll do a quick test merge and see if there are
any conflicts. If so, I'll segregate out these patches into a
separate pre-merge branch --- and then if you like you could pull them
into your branch and we can deal with it that way.
- Ted
On 03/06/2012 03:28 AM, Ted Ts'o wrote:
> On Tue, Mar 06, 2012 at 01:40:05AM +0100, Bernd Schubert wrote:
>>
>> Yeah, you are right, we also should check for 64-bit EOF. But
>> wouldn't be something like this be better?
>>
>> /* check for hash collision */
>> if(is_32bit_api() ) {
>> if (hash == (EXT4_HTREE_EOF_32BIT<< 1))
>> hash = (EXT4_HTREE_EOF_32BIT - 1)<< 1;
>> } else {
>> if (hash == (EXT4_HTREE_EOF_64BIT<< 1))
>> hash = (EXT4_HTREE_EOF_64BIT - 1)<< 1;
>> }
>
> Actually, neither change is needed, now that I look at things more
> closely. hash is a __u32, so it could never been
> EXT4_HTREE_EOF_64BIT. But given that we won't let major hash become
> larger than 0xfffffffc, that means the largest possible position value
> is 0x7ffffffeffffffff. So using an EOF value of 0x0x7fffffffffffffff
> will work fine.
Ah, I looked after 1 a.m., seems that was too late for me to notice.
>
> The bigger problem that I found when I looked more closely at the
> patch is that the patch uses f_flags in places where f_mode needs to
> be used:
>
> static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
> {
> if ((filp->f_flags& FMODE_32BITHASH) ||
> ^^^^^^^
> (!(filp->f_flags& FMODE_64BITHASH)&& is_32bit_api()))
> ^^^^^^^
> return major>> 1;
> else
> return ((__u64)(major>> 1)<< 32) | (__u64)minor;
> }
>
> static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
> {
> if ((filp->f_flags& FMODE_32BITHASH) ||
> ^^^^^^
> (!(filp->f_mode& FMODE_64BITHASH)&& is_32bit_api()))
> ^^^^^^
> return (pos<< 1)& 0xffffffff;
> else
> return ((pos>> 32)<< 1)& 0xffffffff;
> }
>
> Which makes me wonder how much this has been tested?
Arg, my bad, I introduced this issue when I converted from f_flags to
f_mode, seems I forgot all of those above :(
Hrm, I thought I had tested sufficiently, but obviously I did not :(
Here's the test tool.
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
While quickly looking, I think it only affects NFSv2, which I think I
indeed didn't test. I only run tests for 32 bit and 64-bit user space
and NFSv3. But yes, NFSv2 is an important test too. Not sure if I will
find time for that today.
Will send an updated version later on.
Thanks for your review,
Bernd
On 03/06/2012 04:15 PM, Ted Ts'o wrote:
> On Tue, Mar 06, 2012 at 10:59:55AM +0100, Bernd Schubert wrote:
>> Arg, my bad, I introduced this issue when I converted from f_flags
>> to f_mode, seems I forgot all of those above :(
>> Hrm, I thought I had tested sufficiently, but obviously I did not :(
>> Here's the test tool.
>> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
>>
>> While quickly looking, I think it only affects NFSv2, which I think
>> I indeed didn't test. I only run tests for 32 bit and 64-bit user
>> space and NFSv3. But yes, NFSv2 is an important test too. Not sure
>> if I will find time for that today.
>
> I think the problem case to worry about would have been on a 32-bit
> NFS server (think embedded/bookshelf NAS servers) and NFSv3, since
> that's where the inconsistency would have meant that hash2pos would
> use the 32-bit codepath, but pos2maj_hash would use the 64-bit
> codepath, and so then... kablooey.
>
>> Will send an updated version later on.
>
> If it's just those changes, no worries, as I've already fixed up the
> f_flags vs. f_mode issue in my copy of the patches.
>
> Let me know if there are any other changes that you'd like to make or
> issues that you've spotted.
Ted, thanks a bunch! I really will try to be more careful updating
patches next time. And during the next days I'm also going to review the
patches again, at least I hope I find the time for that while being on
vacation).
Thanks again,
Bernd
On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
> On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
> >
> > Linus in general doesn't like cross tree merges (or any extraneous
> > merges) unless they are absolutely necessary; but then, he trusts git
> > merges more than many of us do. :-)
> >
> > I've put the the patch series on a separate patch, with a signed tag, at:
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
> >
> > Can you confirm that you've pulled it into your tree and it's landed
> > in linux-next? I probably won't bother pulling it in mine unless
> > there is definitely a merge conflict that I need to resolve. (I've
> > checked and there do not appear to be any against linux-next as of
> > last night.)
>
> Ping?
Whoops, sorry.
Looks perfect, thanks for handling these!
I'm running them through my usual regression tests and then I'll push
the result out soon, with a "here's why I'm merging this" comment on the
merge commit. (Though I can't see it would trigger Linus's usual
back-merge rant anyway, as it's obviously quite targetted, not a case of
"oh I think I'll update to -rc5 now".)
--b.
On Mon, Mar 19, 2012 at 04:00:41PM -0400, J. Bruce Fields wrote:
> Sorry that was more of a hassle than it could have been.... It would be
> nice to try rewriting the cthon tests for xfstests, if that would help
> catch nfs-related bugs sooner. I don't have the time for now, though.
No problem. I'd love to look at adapting the connectathon tests for
xfstests as well, but I don't have the time either.
I'll put it on an ext4 todo list; maybe someone who is looking to get
into ext4 programming will take this on as a task.
- Ted
On Tue, Mar 13, 2012 at 10:09:05PM +0100, Bernd Schubert wrote:
> Hmm, there must have gone something wrong on merging,
In that case one approach would be to rebase your last-sent patches on
to the same base as Ted's versions, confirm that one still works and the
other doesn't, and do a diff....
> my own test
> also fails
>
> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
>
> (Sorry, it does not say 'failure', but one needs to compare the file
> names and telldir-offset numbers)
>
> I think I will continue in the morning as its already 1 a.m. here.
OK, thanks for following up!
--b.
On Mon, Jan 09, 2012 at 02:21:48PM +0100, Bernd Schubert wrote:
> diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
> index ac8f168..fa8e491 100644
> --- a/fs/ext4/hash.c
> +++ b/fs/ext4/hash.c
> @@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
> return -1;
> }
> hash = hash & ~1;
> - if (hash == (EXT4_HTREE_EOF << 1))
> - hash = (EXT4_HTREE_EOF-1) << 1;
> + if (hash == (EXT4_HTREE_EOF_32BIT << 1))
> + hash = (EXT4_HTREE_EOF_32BIT - 1) << 1;
> hinfo->hash = hash;
> hinfo->minor_hash = minor_hash;
> return 0;
Is there a reason why we don't need to avoid the collsion with the
64-bit EOF value as well? i.e., I think we also need to add:
if (hash == (EXT4_HTREE_EOF_64BIT << 1))
hash = (EXT4_HTREE_EOF_64BIT - 1) << 1;
- Ted
On Tue, Mar 13, 2012 at 04:34:46PM -0400, J. Bruce Fields wrote:
> >
> > Thanks! I'm also just going to work on it (just noticed your mail
> > while on vacation ...).
Brend, thanks!!
Bruce, random question --- do you know what the copyright and
licensing terms are for the connectathon test suite? It occurs to me
that it might be good to get some of these tests into xfstests, given
that you mentioned that the telldir test was failing even when run on
a local file system. That says to me this is a good file system level
test that we should try to get into xfstests if possible.
Getting it into xfstests would also tend to reduce the chance of
regressions, since I always run xfstests before I submit patches (and
in fact I did run xfstests before sending you a pull request Bernd's
patches --- and I'm glad your testing caught the problem when mine
didn't!)
Thanks,
- Ted
On 03/14/2012 05:05 PM, J. Bruce Fields wrote:
> On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
>> shall I sent an updated version of the previous ext4 patch
>> or is this patch sufficient?
>
> Up to Ted.
>
> Ted, I haven't published my merge of your branch, so at least for my
> purposes I don't care whether you fold this in or pop another patch on
> top.
Ted, could you please tell me how to proceed?
Thanks,
Bernd
On Tue, Mar 06, 2012 at 01:40:05AM +0100, Bernd Schubert wrote:
>
> Yeah, you are right, we also should check for 64-bit EOF. But
> wouldn't be something like this be better?
>
> /* check for hash collision */
> if(is_32bit_api() ) {
> if (hash == (EXT4_HTREE_EOF_32BIT<< 1))
> hash = (EXT4_HTREE_EOF_32BIT - 1)<< 1;
> } else {
> if (hash == (EXT4_HTREE_EOF_64BIT<< 1))
> hash = (EXT4_HTREE_EOF_64BIT - 1)<< 1;
> }
Actually, neither change is needed, now that I look at things more
closely. hash is a __u32, so it could never been
EXT4_HTREE_EOF_64BIT. But given that we won't let major hash become
larger than 0xfffffffc, that means the largest possible position value
is 0x7ffffffeffffffff. So using an EOF value of 0x0x7fffffffffffffff
will work fine.
The bigger problem that I found when I looked more closely at the
patch is that the patch uses f_flags in places where f_mode needs to
be used:
static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
{
if ((filp->f_flags & FMODE_32BITHASH) ||
^^^^^^^
(!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
^^^^^^^
return major >> 1;
else
return ((__u64)(major >> 1) << 32) | (__u64)minor;
}
static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
{
if ((filp->f_flags & FMODE_32BITHASH) ||
^^^^^^
(!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
^^^^^^
return (pos << 1) & 0xffffffff;
else
return ((pos >> 32) << 1) & 0xffffffff;
}
Which makes me wonder how much this has been tested?
- Ted
On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
>
> Linus in general doesn't like cross tree merges (or any extraneous
> merges) unless they are absolutely necessary; but then, he trusts git
> merges more than many of us do. :-)
>
> I've put the the patch series on a separate patch, with a signed tag, at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
>
> Can you confirm that you've pulled it into your tree and it's landed
> in linux-next? I probably won't bother pulling it in mine unless
> there is definitely a merge conflict that I need to resolve. (I've
> checked and there do not appear to be any against linux-next as of
> last night.)
Ping?
- Ted
On Mon, Jan 09, 2012 at 02:21:53PM +0100, Bernd Schubert wrote:
> Just rename this variable, as the next patch will add a flag and
> 'access' as variable name would not be correct any more.
>
> Signed-off-by: Bernd Schubert <[email protected]>
> Acked-by: J. Bruce Fields<[email protected]>
> ---
> fs/nfsd/vfs.c | 18 ++++++++++--------
> 1 files changed, 10 insertions(+), 8 deletions(-)
On Mon, Jan 09, 2012 at 02:21:58PM +0100, Bernd Schubert wrote:
> Use 32-bit or 64-bit llseek() hashes for directory offsets depending on
> the NFS version. NFSv2 gets 32-bit hashes only.
>
> NOTE: This patch got rather complex as Christoph asked to set the
> filp->f_mode flag in the open call or immediatly after dentry_open()
> in nfsd_open() to avoid races.
> Personally I still do not see a reason for that and in my opinion
> FMODE_32BITHASH/FMODE_64BITHASH flags could be set nfsd_readdir(), as it
> follows directly after nfsd_open() without a chance of races.
>
> Signed-off-by: Bernd Schubert <[email protected]>
> Acked-by: J. Bruce Fields<[email protected]>
> ---
> fs/nfsd/vfs.c | 15 +++++++++++++--
> fs/nfsd/vfs.h | 2 ++
> 2 files changed, 15 insertions(+), 2 deletions(-)
NFS folks,
Since Bruce has ack'ed these patches I assume it will be OK if I carry
them in the ext4 tree for merging? (They depend on changes in ext4
found in the first two patches of these patch series.)
- Ted
On Mon, Mar 12, 2012 at 06:22:50PM -0400, bfields wrote:
> On Mon, Mar 12, 2012 at 11:49:21AM -0400, J. Bruce Fields wrote:
> > On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
> > > On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
> > > >
> > > > Linus in general doesn't like cross tree merges (or any extraneous
> > > > merges) unless they are absolutely necessary; but then, he trusts git
> > > > merges more than many of us do. :-)
> > > >
> > > > I've put the the patch series on a separate patch, with a signed tag, at:
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
> > > >
> > > > Can you confirm that you've pulled it into your tree and it's landed
> > > > in linux-next? I probably won't bother pulling it in mine unless
> > > > there is definitely a merge conflict that I need to resolve. (I've
> > > > checked and there do not appear to be any against linux-next as of
> > > > last night.)
> > >
> > > Ping?
> >
> > Whoops, sorry.
> >
> > Looks perfect, thanks for handling these!
> >
> > I'm running them through my usual regression tests
>
> Urgh, I'm seeing a failure on the telldir test (part of the "special"
> connectathon tests). I haven't looked at what it's trying to do yet.
> Reproduceable even just on a local filesystem without NFS involved. If
> someone wants to look at it, you can just do:
>
> git clone git://linux-nfs.org/~bfields/cthon04.git
> cd cthon04
> make
> NFS_TESTDIR=/somewhere_on_an_ext4_fs/TMP ./runtests -s
>
> (Or after running that, more specifically,
>
> cd /somewhere_on_an_ext4_fs/TMP/
> ./telldir
>
> )
>
> I'll look at it tomorrow.
Looks like seekdir is just not accepting an offset returned from
getdents:
openat(AT_FDCWD, "telldir-test", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
getdents(3, {...{d_ino=20386, d_off=5728083968307607285, d_reclen=24, d_name="192"}...} = 4848
lseek(3, 5728083968307607285, SEEK_SET) = -1 EINVAL (Invalid argument)
Still investigating.
--b.
On Sun, Mar 18, 2012 at 10:54:55PM -0400, Ted Ts'o wrote:
> On Fri, Mar 16, 2012 at 10:22:51PM +0100, Bernd Schubert wrote:
> > On 03/14/2012 05:05 PM, J. Bruce Fields wrote:
> > >On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
> > >>shall I sent an updated version of the previous ext4 patch
> > >>or is this patch sufficient?
> > >
> > >Up to Ted.
> > >
> > >Ted, I haven't published my merge of your branch, so at least for my
> > >purposes I don't care whether you fold this in or pop another patch on
> > >top.
> >
> > Ted, could you please tell me how to proceed?
>
> I've merged it in.
>
> Bruce, I have an updated tag and branch at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
>
> Can you give it a try and pull it in if it passes your tests?
Looks good.
(Merged and pushed out to my git://linux-nfs.org/~bfields/linux.git.)
Sorry that was more of a hassle than it could have been.... It would be
nice to try rewriting the cthon tests for xfstests, if that would help
catch nfs-related bugs sooner. I don't have the time for now, though.
Merging a topic branch like this looks like it should work fine, anyway.
Thanks!
--b.
commit 62b9510cb373d5722fdaba71d961d8f695acfcd5
Merge: 8546ee5 06effdb
Author: J. Bruce Fields <[email protected]>
Date: Mon Mar 19 12:34:39 2012 -0400
nfsd: merge cookie collision fixes from ext4 tree
These changes fix readdir loops on ext4 filesystems with dir_index
turned on. I'm pulling them from Ted's tree as I'd like to give them
some extra nfsd testing, and expect to be applying (potentially
conflicting) patches to the same code before the next merge window.
From the nfs-ext4-premerge branch of
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Signed-off-by: J. Bruce Fields <[email protected]>
On 03/13/2012 09:01 PM, J. Bruce Fields wrote:
> On Mon, Mar 12, 2012 at 06:22:50PM -0400, bfields wrote:
>> On Mon, Mar 12, 2012 at 11:49:21AM -0400, J. Bruce Fields wrote:
>>> On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
>>>> On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
>>>>>
>>>>> Linus in general doesn't like cross tree merges (or any extraneous
>>>>> merges) unless they are absolutely necessary; but then, he trusts git
>>>>> merges more than many of us do. :-)
>>>>>
>>>>> I've put the the patch series on a separate patch, with a signed tag, at:
>>>>>
>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
>>>>>
>>>>> Can you confirm that you've pulled it into your tree and it's landed
>>>>> in linux-next? I probably won't bother pulling it in mine unless
>>>>> there is definitely a merge conflict that I need to resolve. (I've
>>>>> checked and there do not appear to be any against linux-next as of
>>>>> last night.)
>>>>
>>>> Ping?
>>>
>>> Whoops, sorry.
>>>
>>> Looks perfect, thanks for handling these!
>>>
>>> I'm running them through my usual regression tests
>>
>> Urgh, I'm seeing a failure on the telldir test (part of the "special"
>> connectathon tests). I haven't looked at what it's trying to do yet.
>> Reproduceable even just on a local filesystem without NFS involved. If
>> someone wants to look at it, you can just do:
>>
>> git clone git://linux-nfs.org/~bfields/cthon04.git
>> cd cthon04
>> make
>> NFS_TESTDIR=/somewhere_on_an_ext4_fs/TMP ./runtests -s
>>
>> (Or after running that, more specifically,
>>
>> cd /somewhere_on_an_ext4_fs/TMP/
>> ./telldir
>>
>> )
>>
>> I'll look at it tomorrow.
>
> Looks like seekdir is just not accepting an offset returned from
> getdents:
>
> openat(AT_FDCWD, "telldir-test", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
> getdents(3, {...{d_ino=20386, d_off=5728083968307607285, d_reclen=24, d_name="192"}...} = 4848
> lseek(3, 5728083968307607285, SEEK_SET) = -1 EINVAL (Invalid argument)
>
> Still investigating.
Thanks! I'm also just going to work on it (just noticed your mail while
on vacation ...).
On 03/13/2012 09:34 PM, J. Bruce Fields wrote:
> On Tue, Mar 13, 2012 at 09:03:51PM +0100, Bernd Schubert wrote:
>> On 03/13/2012 09:01 PM, J. Bruce Fields wrote:
>>> On Mon, Mar 12, 2012 at 06:22:50PM -0400, bfields wrote:
>>>> On Mon, Mar 12, 2012 at 11:49:21AM -0400, J. Bruce Fields wrote:
>>>>> On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
>>>>>> On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
>>>>>>>
>>>>>>> Linus in general doesn't like cross tree merges (or any extraneous
>>>>>>> merges) unless they are absolutely necessary; but then, he trusts git
>>>>>>> merges more than many of us do. :-)
>>>>>>>
>>>>>>> I've put the the patch series on a separate patch, with a signed tag, at:
>>>>>>>
>>>>>>> git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
>>>>>>>
>>>>>>> Can you confirm that you've pulled it into your tree and it's landed
>>>>>>> in linux-next? I probably won't bother pulling it in mine unless
>>>>>>> there is definitely a merge conflict that I need to resolve. (I've
>>>>>>> checked and there do not appear to be any against linux-next as of
>>>>>>> last night.)
>>>>>>
>>>>>> Ping?
>>>>>
>>>>> Whoops, sorry.
>>>>>
>>>>> Looks perfect, thanks for handling these!
>>>>>
>>>>> I'm running them through my usual regression tests
>>>>
>>>> Urgh, I'm seeing a failure on the telldir test (part of the "special"
>>>> connectathon tests). I haven't looked at what it's trying to do yet.
>>>> Reproduceable even just on a local filesystem without NFS involved. If
>>>> someone wants to look at it, you can just do:
>>>>
>>>> git clone git://linux-nfs.org/~bfields/cthon04.git
>>>> cd cthon04
>>>> make
>>>> NFS_TESTDIR=/somewhere_on_an_ext4_fs/TMP ./runtests -s
>>>>
>>>> (Or after running that, more specifically,
>>>>
>>>> cd /somewhere_on_an_ext4_fs/TMP/
>>>> ./telldir
>>>>
>>>> )
>>>>
>>>> I'll look at it tomorrow.
>>>
>>> Looks like seekdir is just not accepting an offset returned from
>>> getdents:
>>>
>>> openat(AT_FDCWD, "telldir-test", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
>>> getdents(3, {...{d_ino=20386, d_off=5728083968307607285, d_reclen=24, d_name="192"}...} = 4848
>>> lseek(3, 5728083968307607285, SEEK_SET) = -1 EINVAL (Invalid argument)
>>>
>>> Still investigating.
>>
>> Thanks! I'm also just going to work on it (just noticed your mail
>> while on vacation ...).
>
> OK, good. You can probably figure this out faster than me....
>
Hmm, there must have gone something wrong on merging, my own test also
fails
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
(Sorry, it does not say 'failure', but one needs to compare the file
names and telldir-offset numbers)
I think I will continue in the morning as its already 1 a.m. here.
Cheers,
Bernd
On Mon, Mar 12, 2012 at 11:49:21AM -0400, J. Bruce Fields wrote:
> On Mon, Mar 12, 2012 at 11:09:12AM -0400, Ted Ts'o wrote:
> > On Fri, Mar 09, 2012 at 03:51:48PM -0500, Ted Ts'o wrote:
> > >
> > > Linus in general doesn't like cross tree merges (or any extraneous
> > > merges) unless they are absolutely necessary; but then, he trusts git
> > > merges more than many of us do. :-)
> > >
> > > I've put the the patch series on a separate patch, with a signed tag, at:
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
> > >
> > > Can you confirm that you've pulled it into your tree and it's landed
> > > in linux-next? I probably won't bother pulling it in mine unless
> > > there is definitely a merge conflict that I need to resolve. (I've
> > > checked and there do not appear to be any against linux-next as of
> > > last night.)
> >
> > Ping?
>
> Whoops, sorry.
>
> Looks perfect, thanks for handling these!
>
> I'm running them through my usual regression tests
Urgh, I'm seeing a failure on the telldir test (part of the "special"
connectathon tests). I haven't looked at what it's trying to do yet.
Reproduceable even just on a local filesystem without NFS involved. If
someone wants to look at it, you can just do:
git clone git://linux-nfs.org/~bfields/cthon04.git
cd cthon04
make
NFS_TESTDIR=/somewhere_on_an_ext4_fs/TMP ./runtests -s
(Or after running that, more specifically,
cd /somewhere_on_an_ext4_fs/TMP/
./telldir
)
I'll look at it tomorrow.
--b.
On Tue, Mar 06, 2012 at 10:18:16AM -0500, Ted Ts'o wrote:
> On Mon, Mar 05, 2012 at 09:08:19PM -0500, J. Bruce Fields wrote:
> > Yes, that would be fine.
> >
> > (There may well be some conflict: simplest might be if you can get that
> > pushed out to a stable public branch somewhere, hopefully before the
> > merge window opens, and that'll give me a chance to pull it and handle
> > any merge conflicts.
>
> Are the commits you're worried about on the NFS side already in
> linux-next?
No (though there might be some other conflicts there I haven't thought
of).
> If so, I'll do a quick test merge and see if there are
> any conflicts. If so, I'll segregate out these patches into a
> separate pre-merge branch --- and then if you like you could pull them
> into your branch and we can deal with it that way.
That sounds like a good way to do it regardless?
--b.
On 03/05/2012 04:59 PM, Ted Ts'o wrote:
> On Mon, Jan 09, 2012 at 02:21:48PM +0100, Bernd Schubert wrote:
>> diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
>> index ac8f168..fa8e491 100644
>> --- a/fs/ext4/hash.c
>> +++ b/fs/ext4/hash.c
>> @@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
>> return -1;
>> }
>> hash = hash& ~1;
>> - if (hash == (EXT4_HTREE_EOF<< 1))
>> - hash = (EXT4_HTREE_EOF-1)<< 1;
>> + if (hash == (EXT4_HTREE_EOF_32BIT<< 1))
>> + hash = (EXT4_HTREE_EOF_32BIT - 1)<< 1;
>> hinfo->hash = hash;
>> hinfo->minor_hash = minor_hash;
>> return 0;
>
> Is there a reason why we don't need to avoid the collsion with the
> 64-bit EOF value as well? i.e., I think we also need to add:
>
> if (hash == (EXT4_HTREE_EOF_64BIT<< 1))
> hash = (EXT4_HTREE_EOF_64BIT - 1)<< 1;
>
> - Ted
Thanks for looking into it, really appreciated!
Yeah, you are right, we also should check for 64-bit EOF. But wouldn't
be something like this be better?
/* check for hash collision */
if(is_32bit_api() ) {
if (hash == (EXT4_HTREE_EOF_32BIT<< 1))
hash = (EXT4_HTREE_EOF_32BIT - 1)<< 1;
} else {
if (hash == (EXT4_HTREE_EOF_64BIT<< 1))
hash = (EXT4_HTREE_EOF_64BIT - 1)<< 1;
}
Or am I over engineering?
Thanks,
Bernd
On Tue, Mar 06, 2012 at 10:28:43AM -0500, J. Bruce Fields wrote:
> > If so, I'll do a quick test merge and see if there are
> > any conflicts. If so, I'll segregate out these patches into a
> > separate pre-merge branch --- and then if you like you could pull them
> > into your branch and we can deal with it that way.
>
> That sounds like a good way to do it regardless?
Linus in general doesn't like cross tree merges (or any extraneous
merges) unless they are absolutely necessary; but then, he trusts git
merges more than many of us do. :-)
I've put the the patch series on a separate patch, with a signed tag, at:
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
Can you confirm that you've pulled it into your tree and it's landed
in linux-next? I probably won't bother pulling it in mine unless
there is definitely a merge conflict that I need to resolve. (I've
checked and there do not appear to be any against linux-next as of
last night.)
- Ted
On Tue, Mar 13, 2012 at 05:10:09PM -0400, Ted Ts'o wrote:
> On Tue, Mar 13, 2012 at 04:34:46PM -0400, J. Bruce Fields wrote:
> > >
> > > Thanks! I'm also just going to work on it (just noticed your mail
> > > while on vacation ...).
>
> Brend, thanks!!
>
> Bruce, random question --- do you know what the copyright and
> licensing terms are for the connectathon test suite?
Alas. Nobody does!
I'ts unlikely there's any owner who would care enough to make a fuss--so
we keep using the tests, but they don't really get any love.
It's one of those annoyances we argue about once a year or so, and then
forget till next time.
So, maybe one way forward would be to reimplement the connectathon tests
for xfstests?
--b.
> It occurs to me
> that it might be good to get some of these tests into xfstests, given
> that you mentioned that the telldir test was failing even when run on
> a local file system. That says to me this is a good file system level
> test that we should try to get into xfstests if possible.
>
> Getting it into xfstests would also tend to reduce the chance of
> regressions, since I always run xfstests before I submit patches (and
> in fact I did run xfstests before sending you a pull request Bernd's
> patches --- and I'm glad your testing caught the problem when mine
> didn't!)
>
> Thanks,
>
> - Ted
On 03/13/2012 10:29 PM, J. Bruce Fields wrote:
> On Tue, Mar 13, 2012 at 10:09:05PM +0100, Bernd Schubert wrote:
>> Hmm, there must have gone something wrong on merging,
>
> In that case one approach would be to rebase your last-sent patches on
> to the same base as Ted's versions, confirm that one still works and the
> other doesn't, and do a diff....
>
>> my own test
>> also fails
>>
>> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
>>
>> (Sorry, it does not say 'failure', but one needs to compare the file
>> names and telldir-offset numbers)
>>
>> I think I will continue in the morning as its already 1 a.m. here.
>
> OK, thanks for following up!
Took me some time to figure out what is going on and in the end I previously forgot
another test case - I always tested directories being sufficiently large, so that
they got the EXT4_INODE_INDEX flag. However, the cython tests failed on a small
directory, which doesn't have that flag. But then ext4_readdir() always uses the
dx version, I guess in order to always return the same offset values
(in the sense of converting a non-dx dir to dx). In ext4_dir_llseek() I only
tested for the EXT4_INODE_INDEX flag, which is not correct in any case.
So here is the patch, shall I sent an updated version of the previous ext4 patch
or is this patch sufficient?
Signed-off-by: Bernd Schubert <[email protected]>
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 71a66ff..6a19a35 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -44,6 +44,24 @@ static unsigned char get_dtype(struct super_block *sb, int filetype)
return (ext4_filetype_table[filetype]);
}
+/**
+ * Check if the given dir-inode refers to an htree indexed directory
+ *
+ * Return 1 if it is a dx dir, 0 if not
+ */
+static int is_dx_dir(struct inode *inode)
+{
+ struct super_block *sb = inode->i_sb;
+
+ if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
+ EXT4_FEATURE_COMPAT_DIR_INDEX) &&
+ ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
+ ((inode->i_size >> sb->s_blocksize_bits) == 1)))
+ return 1;
+
+ return 0;
+}
+
/*
* Return 0 if the directory entry is OK, and 1 if there is a problem
*
@@ -99,18 +117,13 @@ static int ext4_readdir(struct file *filp,
unsigned int offset;
int i, stored;
struct ext4_dir_entry_2 *de;
- struct super_block *sb;
int err;
struct inode *inode = filp->f_path.dentry->d_inode;
+ struct super_block *sb = inode->i_sb;
int ret = 0;
int dir_has_error = 0;
- sb = inode->i_sb;
-
- if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
- EXT4_FEATURE_COMPAT_DIR_INDEX) &&
- ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
- ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
+ if (is_dx_dir(inode)) {
err = ext4_dx_readdir(filp, dirent, filldir);
if (err != ERR_BAD_DX_DIR) {
ret = err;
@@ -308,7 +321,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
{
struct inode *inode = file->f_mapping->host;
loff_t ret = -EINVAL;
- int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
+ int dx_dir = is_dx_dir(inode);
mutex_lock(&inode->i_mutex);
@@ -323,7 +336,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
/* so only negative offsets are left, does that have a
* meaning for directories at all? */
- if (is_dx_dir)
+ if (dx_dir)
offset += ext4_get_htree_eof(file);
else
offset += inode->i_size;
@@ -347,7 +360,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
if (unlikely(offset < 0))
goto out_err;
- if (!is_dx_dir) {
+ if (!dx_dir) {
if (offset > inode->i_sb->s_maxbytes)
goto out_err;
} else if (offset > ext4_get_htree_eof(file))
On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
> On 03/13/2012 10:29 PM, J. Bruce Fields wrote:
> > On Tue, Mar 13, 2012 at 10:09:05PM +0100, Bernd Schubert wrote:
> >> Hmm, there must have gone something wrong on merging,
> >
> > In that case one approach would be to rebase your last-sent patches on
> > to the same base as Ted's versions, confirm that one still works and the
> > other doesn't, and do a diff....
> >
> >> my own test
> >> also fails
> >>
> >> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
> >>
> >> (Sorry, it does not say 'failure', but one needs to compare the file
> >> names and telldir-offset numbers)
> >>
> >> I think I will continue in the morning as its already 1 a.m. here.
> >
> > OK, thanks for following up!
>
> Took me some time to figure out what is going on and in the end I previously forgot
> another test case - I always tested directories being sufficiently large, so that
> they got the EXT4_INODE_INDEX flag. However, the cython tests failed on a small
> directory, which doesn't have that flag. But then ext4_readdir() always uses the
> dx version, I guess in order to always return the same offset values
> (in the sense of converting a non-dx dir to dx). In ext4_dir_llseek() I only
> tested for the EXT4_INODE_INDEX flag, which is not correct in any case.
> So here is the patch,
Yep, that passes my usual tests now, thanks!
> shall I sent an updated version of the previous ext4 patch
> or is this patch sufficient?
Up to Ted.
Ted, I haven't published my merge of your branch, so at least for my
purposes I don't care whether you fold this in or pop another patch on
top.
--b.
>
> Signed-off-by: Bernd Schubert <[email protected]>
>
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index 71a66ff..6a19a35 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -44,6 +44,24 @@ static unsigned char get_dtype(struct super_block *sb, int filetype)
> return (ext4_filetype_table[filetype]);
> }
>
> +/**
> + * Check if the given dir-inode refers to an htree indexed directory
> + *
> + * Return 1 if it is a dx dir, 0 if not
> + */
> +static int is_dx_dir(struct inode *inode)
> +{
> + struct super_block *sb = inode->i_sb;
> +
> + if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> + ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
> + ((inode->i_size >> sb->s_blocksize_bits) == 1)))
> + return 1;
> +
> + return 0;
> +}
> +
> /*
> * Return 0 if the directory entry is OK, and 1 if there is a problem
> *
> @@ -99,18 +117,13 @@ static int ext4_readdir(struct file *filp,
> unsigned int offset;
> int i, stored;
> struct ext4_dir_entry_2 *de;
> - struct super_block *sb;
> int err;
> struct inode *inode = filp->f_path.dentry->d_inode;
> + struct super_block *sb = inode->i_sb;
> int ret = 0;
> int dir_has_error = 0;
>
> - sb = inode->i_sb;
> -
> - if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
> - EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> - ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
> - ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
> + if (is_dx_dir(inode)) {
> err = ext4_dx_readdir(filp, dirent, filldir);
> if (err != ERR_BAD_DX_DIR) {
> ret = err;
> @@ -308,7 +321,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
> {
> struct inode *inode = file->f_mapping->host;
> loff_t ret = -EINVAL;
> - int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
> + int dx_dir = is_dx_dir(inode);
>
> mutex_lock(&inode->i_mutex);
>
> @@ -323,7 +336,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>
> /* so only negative offsets are left, does that have a
> * meaning for directories at all? */
> - if (is_dx_dir)
> + if (dx_dir)
> offset += ext4_get_htree_eof(file);
> else
> offset += inode->i_size;
> @@ -347,7 +360,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
> if (unlikely(offset < 0))
> goto out_err;
>
> - if (!is_dx_dir) {
> + if (!dx_dir) {
> if (offset > inode->i_sb->s_maxbytes)
> goto out_err;
> } else if (offset > ext4_get_htree_eof(file))
>
>
>
>
On Fri, Mar 16, 2012 at 10:22:51PM +0100, Bernd Schubert wrote:
> On 03/14/2012 05:05 PM, J. Bruce Fields wrote:
> >On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
> >>shall I sent an updated version of the previous ext4 patch
> >>or is this patch sufficient?
> >
> >Up to Ted.
> >
> >Ted, I haven't published my merge of your branch, so at least for my
> >purposes I don't care whether you fold this in or pop another patch on
> >top.
>
> Ted, could you please tell me how to proceed?
I've merged it in.
Bruce, I have an updated tag and branch at:
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git nfs-ext4-premerge
Can you give it a try and pull it in if it passes your tests?
- Ted
On 4/22/12 7:51 AM, Bernd Schubert wrote:
> On 04/20/2012 10:04 PM, Eric Sandeen wrote:
>> On 1/9/12 7:21 AM, Bernd Schubert wrote:
>>> From: Fan Yong <[email protected]>
>>>
>>> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
>>> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
>>> and telldir(). However, this causes problems if there are 32-bit hash
>>> collisions, since the NFSv2 server can get stuck resending the same
>>> entries from the directory repeatedly.
>>>
>>> Allow ext4 to return a full 64-bit hash (both major and minor) for
>>> telldir to decrease the chance of hash collisions. This still needs
>>> integration on the NFS side.
>>>
>>> Patch-updated-by: Bernd Schubert <[email protected]>
>>> (blame me if something is not correct)
>>
>> Bernd, I've merged this to ext3. Bruce thought maybe you were working
>> on the same. Should I send mine?
>
> That is perfectly fine with me.
>
>>
>> Also...
>>
>>> +/*
>>> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
>>> + * non-htree and htree directories, where the "offset" is in terms
>>> + * of the filename hash value instead of the byte offset.
>>> + *
>>> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
>>> + * will be invalid once the directory was converted into a dx directory
>>> + */
>>> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>>
>> ext4_llseek() worries about max offset for direct/indirect vs. extent-mapped
>> files. Do we need to worry about the same thing in this function?
>
> Hrmm, I just checked it and I think either is wrong. We only have to
> care about non-dx directories, so ext4_readdir() applies, which limits
> filp->f_pos < inode->i_size.
> Going to send a patch tomorrow. Thanks for spotting this!
The other thing I'm wondering is whether, in light of
ef3d0fd27e90f67e35da516dafc1482c82939a60 vfs: do (nearly) lockless generic_file_llseek
taking the i_mutex in ext4_dir_llseek could be a perf regression vs what was there before? Is there anything about the new function which requires stronger locking?
I may be missing something obvious about the nfs interaction, not sure.
-Eric
> Cheers,
> Bernd
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Thu, Apr 12, 2012 at 11:22:41PM +0200, Bernd Schubert wrote:
> On 04/12/2012 10:49 PM, J. Bruce Fields wrote:
> > Stupid question: is there any fundamental feature of ext4 this depends
> > on, or would this fix work equally well for fs/ext3?
>
> First of all, Bruce and Ted, thanks a lot for your patience with those
> patches!
>
> I have not looked into it in detail yet, but it should be possible to
> back port it to ext3. I don't think there is anything that would depend
> on ext4. I also have ext3 on my todo list, but I first wanted to have it
> in ext4 (in the sense of stability of ext3, if there still should be a
> bug...) and I think for 3.4 it is too late now anyways. So unless there
> are objections, lets target the back port to ext3 for 3.5?
That sounds sensible, thanks!
--b.
On 04/12/2012 10:49 PM, J. Bruce Fields wrote:
> On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
>> On 03/13/2012 10:29 PM, J. Bruce Fields wrote:
>>> On Tue, Mar 13, 2012 at 10:09:05PM +0100, Bernd Schubert wrote:
>>>> Hmm, there must have gone something wrong on merging,
>>>
>>> In that case one approach would be to rebase your last-sent patches on
>>> to the same base as Ted's versions, confirm that one still works and the
>>> other doesn't, and do a diff....
>>>
>>>> my own test
>>>> also fails
>>>>
>>>> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
>>>>
>>>> (Sorry, it does not say 'failure', but one needs to compare the file
>>>> names and telldir-offset numbers)
>>>>
>>>> I think I will continue in the morning as its already 1 a.m. here.
>>>
>>> OK, thanks for following up!
>
> Stupid question: is there any fundamental feature of ext4 this depends
> on, or would this fix work equally well for fs/ext3?
First of all, Bruce and Ted, thanks a lot for your patience with those
patches!
I have not looked into it in detail yet, but it should be possible to
back port it to ext3. I don't think there is anything that would depend
on ext4. I also have ext3 on my todo list, but I first wanted to have it
in ext4 (in the sense of stability of ext3, if there still should be a
bug...) and I think for 3.4 it is too late now anyways. So unless there
are objections, lets target the back port to ext3 for 3.5?
Cheers,
Bernd
On Wed, Mar 14, 2012 at 03:32:45PM +0100, Bernd Schubert wrote:
> On 03/13/2012 10:29 PM, J. Bruce Fields wrote:
> > On Tue, Mar 13, 2012 at 10:09:05PM +0100, Bernd Schubert wrote:
> >> Hmm, there must have gone something wrong on merging,
> >
> > In that case one approach would be to rebase your last-sent patches on
> > to the same base as Ted's versions, confirm that one still works and the
> > other doesn't, and do a diff....
> >
> >> my own test
> >> also fails
> >>
> >> http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/
> >>
> >> (Sorry, it does not say 'failure', but one needs to compare the file
> >> names and telldir-offset numbers)
> >>
> >> I think I will continue in the morning as its already 1 a.m. here.
> >
> > OK, thanks for following up!
Stupid question: is there any fundamental feature of ext4 this depends
on, or would this fix work equally well for fs/ext3?
--b.
>
> Took me some time to figure out what is going on and in the end I previously forgot
> another test case - I always tested directories being sufficiently large, so that
> they got the EXT4_INODE_INDEX flag. However, the cython tests failed on a small
> directory, which doesn't have that flag. But then ext4_readdir() always uses the
> dx version, I guess in order to always return the same offset values
> (in the sense of converting a non-dx dir to dx). In ext4_dir_llseek() I only
> tested for the EXT4_INODE_INDEX flag, which is not correct in any case.
> So here is the patch, shall I sent an updated version of the previous ext4 patch
> or is this patch sufficient?
>
> Signed-off-by: Bernd Schubert <[email protected]>
>
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index 71a66ff..6a19a35 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -44,6 +44,24 @@ static unsigned char get_dtype(struct super_block *sb, int filetype)
> return (ext4_filetype_table[filetype]);
> }
>
> +/**
> + * Check if the given dir-inode refers to an htree indexed directory
> + *
> + * Return 1 if it is a dx dir, 0 if not
> + */
> +static int is_dx_dir(struct inode *inode)
> +{
> + struct super_block *sb = inode->i_sb;
> +
> + if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
> + EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> + ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
> + ((inode->i_size >> sb->s_blocksize_bits) == 1)))
> + return 1;
> +
> + return 0;
> +}
> +
> /*
> * Return 0 if the directory entry is OK, and 1 if there is a problem
> *
> @@ -99,18 +117,13 @@ static int ext4_readdir(struct file *filp,
> unsigned int offset;
> int i, stored;
> struct ext4_dir_entry_2 *de;
> - struct super_block *sb;
> int err;
> struct inode *inode = filp->f_path.dentry->d_inode;
> + struct super_block *sb = inode->i_sb;
> int ret = 0;
> int dir_has_error = 0;
>
> - sb = inode->i_sb;
> -
> - if (EXT4_HAS_COMPAT_FEATURE(inode->i_sb,
> - EXT4_FEATURE_COMPAT_DIR_INDEX) &&
> - ((ext4_test_inode_flag(inode, EXT4_INODE_INDEX)) ||
> - ((inode->i_size >> sb->s_blocksize_bits) == 1))) {
> + if (is_dx_dir(inode)) {
> err = ext4_dx_readdir(filp, dirent, filldir);
> if (err != ERR_BAD_DX_DIR) {
> ret = err;
> @@ -308,7 +321,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
> {
> struct inode *inode = file->f_mapping->host;
> loff_t ret = -EINVAL;
> - int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
> + int dx_dir = is_dx_dir(inode);
>
> mutex_lock(&inode->i_mutex);
>
> @@ -323,7 +336,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>
> /* so only negative offsets are left, does that have a
> * meaning for directories at all? */
> - if (is_dx_dir)
> + if (dx_dir)
> offset += ext4_get_htree_eof(file);
> else
> offset += inode->i_size;
> @@ -347,7 +360,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
> if (unlikely(offset < 0))
> goto out_err;
>
> - if (!is_dx_dir) {
> + if (!dx_dir) {
> if (offset > inode->i_sb->s_maxbytes)
> goto out_err;
> } else if (offset > ext4_get_htree_eof(file))
>
>
>
>
On 2012-04-23, at 5:23 PM, Eric Sandeen wrote:
> I'm curious about the above as well as:
>
> case SEEK_END:
> if (unlikely(offset > 0))
> goto out_err; /* not supported for directories */
>
> The previous .llseek handler, and the generic handler for other filesystems, allow seeking past the end of the dir AFAICT. (not sure why you'd want to, but I don't see that you'd get an error back).
>
> Is there a reason to uniquely exclude it in ext4? Does that line up with POSIX?
I don't know what the origin of this was... I don't think there is a real reason for it except that it doesn't make any sense to do so.
Cheers, Andreas
--
Andreas Dilger Whamcloud, Inc.
Principal Lustre Engineer http://www.whamcloud.com/
On 4/23/12 3:52 PM, Bernd Schubert wrote:
> On 04/23/2012 10:37 PM, Eric Sandeen wrote:
>> On 4/22/12 7:51 AM, Bernd Schubert wrote:
>>> On 04/20/2012 10:04 PM, Eric Sandeen wrote:
>>>> On 1/9/12 7:21 AM, Bernd Schubert wrote:
>>>>> From: Fan Yong <[email protected]>
>>>>>
>>>>> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
>>>>> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
>>>>> and telldir(). However, this causes problems if there are 32-bit hash
>>>>> collisions, since the NFSv2 server can get stuck resending the same
>>>>> entries from the directory repeatedly.
>>>>>
>>>>> Allow ext4 to return a full 64-bit hash (both major and minor) for
>>>>> telldir to decrease the chance of hash collisions. This still needs
>>>>> integration on the NFS side.
>>>>>
>>>>> Patch-updated-by: Bernd Schubert <[email protected]>
>>>>> (blame me if something is not correct)
>>>>
>>>> Bernd, I've merged this to ext3. Bruce thought maybe you were working
>>>> on the same. Should I send mine?
>>>
>>> That is perfectly fine with me.
>>>
>>>>
>>>> Also...
>>>>
>>>>> +/*
>>>>> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
>>>>> + * non-htree and htree directories, where the "offset" is in terms
>>>>> + * of the filename hash value instead of the byte offset.
>>>>> + *
>>>>> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
>>>>> + * will be invalid once the directory was converted into a dx directory
>>>>> + */
>>>>> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>>>>
>>>> ext4_llseek() worries about max offset for direct/indirect vs. extent-mapped
>>>> files. Do we need to worry about the same thing in this function?
>>>
>>> Hrmm, I just checked it and I think either is wrong. We only have to
>>> care about non-dx directories, so ext4_readdir() applies, which limits
>>> filp->f_pos < inode->i_size.
>>> Going to send a patch tomorrow. Thanks for spotting this!
>>
>> The other thing I'm wondering is whether, in light of
>>
>> ef3d0fd27e90f67e35da516dafc1482c82939a60 vfs: do (nearly) lockless generic_file_llseek
>>
>> taking the i_mutex in ext4_dir_llseek could be a perf regression vs what was there before? Is there anything about the new function which requires stronger locking?
>>
>> I may be missing something obvious about the nfs interaction, not sure.
>>
>
> Oh, good point. I was just about to send a small patch, but reading
> through the lockless commit will take some time - its already too late
> for me for today. Will work on that tomorrow. Thanks again for your review!
Sorry it's so late :(
-Eric
> Cheers,
> Bernd
>
>
>
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index b867862..3a4988e2 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -363,7 +363,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t
> offset, int origin)
> goto out_err;
>
> if (!dx_dir) {
> - if (offset > inode->i_sb->s_maxbytes)
> + if (offset > i_size_read(inode))
> goto out_err;
> } else if (offset > ext4_get_htree_eof(file))
> goto out_err;
>
>
On 04/23/2012 10:37 PM, Eric Sandeen wrote:
> On 4/22/12 7:51 AM, Bernd Schubert wrote:
>> On 04/20/2012 10:04 PM, Eric Sandeen wrote:
>>> On 1/9/12 7:21 AM, Bernd Schubert wrote:
>>>> From: Fan Yong <[email protected]>
>>>>
>>>> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
>>>> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
>>>> and telldir(). However, this causes problems if there are 32-bit hash
>>>> collisions, since the NFSv2 server can get stuck resending the same
>>>> entries from the directory repeatedly.
>>>>
>>>> Allow ext4 to return a full 64-bit hash (both major and minor) for
>>>> telldir to decrease the chance of hash collisions. This still needs
>>>> integration on the NFS side.
>>>>
>>>> Patch-updated-by: Bernd Schubert <[email protected]>
>>>> (blame me if something is not correct)
>>>
>>> Bernd, I've merged this to ext3. Bruce thought maybe you were working
>>> on the same. Should I send mine?
>>
>> That is perfectly fine with me.
>>
>>>
>>> Also...
>>>
>>>> +/*
>>>> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
>>>> + * non-htree and htree directories, where the "offset" is in terms
>>>> + * of the filename hash value instead of the byte offset.
>>>> + *
>>>> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
>>>> + * will be invalid once the directory was converted into a dx directory
>>>> + */
>>>> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>>>
>>> ext4_llseek() worries about max offset for direct/indirect vs. extent-mapped
>>> files. Do we need to worry about the same thing in this function?
>>
>> Hrmm, I just checked it and I think either is wrong. We only have to
>> care about non-dx directories, so ext4_readdir() applies, which limits
>> filp->f_pos < inode->i_size.
>> Going to send a patch tomorrow. Thanks for spotting this!
>
> The other thing I'm wondering is whether, in light of
>
> ef3d0fd27e90f67e35da516dafc1482c82939a60 vfs: do (nearly) lockless generic_file_llseek
>
> taking the i_mutex in ext4_dir_llseek could be a perf regression vs what was there before? Is there anything about the new function which requires stronger locking?
>
> I may be missing something obvious about the nfs interaction, not sure.
>
Oh, good point. I was just about to send a small patch, but reading
through the lockless commit will take some time - its already too late
for me for today. Will work on that tomorrow. Thanks again for your review!
Cheers,
Bernd
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index b867862..3a4988e2 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -363,7 +363,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t
offset, int origin)
goto out_err;
if (!dx_dir) {
- if (offset > inode->i_sb->s_maxbytes)
+ if (offset > i_size_read(inode))
goto out_err;
} else if (offset > ext4_get_htree_eof(file))
goto out_err;
On 4/23/12 3:52 PM, Bernd Schubert wrote:
> On 04/23/2012 10:37 PM, Eric Sandeen wrote:
...
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index b867862..3a4988e2 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -363,7 +363,7 @@ loff_t ext4_dir_llseek(struct file *file, loff_t
> offset, int origin)
> goto out_err;
>
> if (!dx_dir) {
> - if (offset > inode->i_sb->s_maxbytes)
> + if (offset > i_size_read(inode))
> goto out_err;
> } else if (offset > ext4_get_htree_eof(file))
> goto out_err;
I'm curious about the above as well as:
case SEEK_END:
if (unlikely(offset > 0))
goto out_err; /* not supported for directories */
The previous .llseek handler, and the generic handler for other filesystems, allow seeking past the end of the dir AFAICT. (not sure why you'd want to, but I don't see that you'd get an error back).
Is there a reason to uniquely exclude it in ext4? Does that line up with POSIX?
-Eric
On 1/9/12 7:21 AM, Bernd Schubert wrote:
> From: Fan Yong <[email protected]>
>
> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
> and telldir(). However, this causes problems if there are 32-bit hash
> collisions, since the NFSv2 server can get stuck resending the same
> entries from the directory repeatedly.
>
> Allow ext4 to return a full 64-bit hash (both major and minor) for
> telldir to decrease the chance of hash collisions. This still needs
> integration on the NFS side.
>
> Patch-updated-by: Bernd Schubert <[email protected]>
> (blame me if something is not correct)
Bernd, I've merged this to ext3. Bruce thought maybe you were working
on the same. Should I send mine?
Also...
> +/*
> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
> + * non-htree and htree directories, where the "offset" is in terms
> + * of the filename hash value instead of the byte offset.
> + *
> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
> + * will be invalid once the directory was converted into a dx directory
> + */
> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
ext4_llseek() worries about max offset for direct/indirect vs. extent-mapped
files. Do we need to worry about the same thing in this function?
-Eric
On 04/24/2012 12:42 AM, Andreas Dilger wrote:
> On 2012-04-23, at 5:23 PM, Eric Sandeen wrote:
>> I'm curious about the above as well as:
>>
>> case SEEK_END:
>> if (unlikely(offset> 0))
>> goto out_err; /* not supported for directories */
>>
>> The previous .llseek handler, and the generic handler for other filesystems, allow seeking past the end of the dir AFAICT. (not sure why you'd want to, but I don't see that you'd get an error back).
>>
>> Is there a reason to uniquely exclude it in ext4? Does that line up with POSIX?
>
> I don't know what the origin of this was... I don't think there is a real reason for it except that it doesn't make any sense to do so.
>
I think I added that. According to pubs.opengroup.org:
(http://pubs.opengroup.org/onlinepubs/009695399/functions/seekdir.html)
void seekdir(DIR *dirp, long loc);
<quote>
If the value of loc was not obtained from an earlier call to telldir(),
or if a call to rewinddir() occurred between the call to telldir() and
the call to seekdir(), the results of subsequent calls to readdir() are
unspecified.
</quote>
As telldir(), which should correlate to 'case SEEK_CUR' will not provide
invalid values, the behaviour is undefined.
Also,
case SEEK_END:
[...]
if (dx_dir)
offset += ext4_get_htree_eof(file);
else
offset += inode->i_size;
[...]
if (!dx_dir) {
if (offset > inode->i_sb->s_maxbytes)
goto out_err;
} else if (offset > ext4_get_htree_eof(file))
goto out_err;
Hence, the additional:
case SEEK_END:
if (unlikely(offset> 0))
goto out_err; /* not supported for directories */
is just a shortcut to avoid useless calculations.
Unless I missed something, it only remains the question if could break
existing applications relying on undefined behaviour. However, I have no
idea how an application might trigger that?
Thanks,
Bernd
On 04/20/2012 10:04 PM, Eric Sandeen wrote:
> On 1/9/12 7:21 AM, Bernd Schubert wrote:
>> From: Fan Yong <[email protected]>
>>
>> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
>> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
>> and telldir(). However, this causes problems if there are 32-bit hash
>> collisions, since the NFSv2 server can get stuck resending the same
>> entries from the directory repeatedly.
>>
>> Allow ext4 to return a full 64-bit hash (both major and minor) for
>> telldir to decrease the chance of hash collisions. This still needs
>> integration on the NFS side.
>>
>> Patch-updated-by: Bernd Schubert <[email protected]>
>> (blame me if something is not correct)
>
> Bernd, I've merged this to ext3. Bruce thought maybe you were working
> on the same. Should I send mine?
That is perfectly fine with me.
>
> Also...
>
>> +/*
>> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
>> + * non-htree and htree directories, where the "offset" is in terms
>> + * of the filename hash value instead of the byte offset.
>> + *
>> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
>> + * will be invalid once the directory was converted into a dx directory
>> + */
>> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
>
> ext4_llseek() worries about max offset for direct/indirect vs. extent-mapped
> files. Do we need to worry about the same thing in this function?
Hrmm, I just checked it and I think either is wrong. We only have to
care about non-dx directories, so ext4_readdir() applies, which limits
filp->f_pos < inode->i_size.
Going to send a patch tomorrow. Thanks for spotting this!
Cheers,
Bernd