With the ext3/ext4 directory index implementation hashes are used to specify
offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
hashes and therefore the probability of hash collisions for larger directories
is rather high. As recently reported on the NFS mailing list that theoretical
problem also happens on real systems:
http://comments.gmane.org/gmane.linux.nfs/40863
The following series adds two new f_mode flags to tell ext4
to use 32-bit or 64-bit hash values for llseek() calls.
These flags can then used by network file systems, such as NFS, to
request 32-bit or 64-bit offsets (hashes).
Version 3:
- remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
I think Bruce wanted to take it seperately as bug fix. It should be applied
before applying the remaining NFS patches, as without it NFSv4 will always
fail with the new 64-bit ext4 seek hashes.
- split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
two separate patches as suggested by Bruce, one patch to rename
'access' to 'may_flags'. And the remainder of the original patch to set
FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
NFSD_MAY_64BIT_COOKIE flag
Version 2:
- use f_mode instead of O_* flags and also in a separate patch
- introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
- fix SEEK_END in ext4_dir_llseek()
- set f_mode flags in NFS code as early as possible and introduce a new
NFSD_MAY_64BIT_COOKIE flag for that
--
Bernd Schubert
Fraunhofer ITWM
Those flags are supposed to be set by NFS readdir() to tell ext3/ext4
to 32bit (NFSv2) or 64bit hash values (offsets) in seekdir().
Signed-off-by: Bernd Schubert <[email protected]>
---
include/linux/fs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 178cdb4..18d40ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -91,6 +91,11 @@ struct inodes_stat_t {
/* File is opened using open(.., 3, ..) and is writeable only for ioctls
(specialy hack for floppy.c) */
#define FMODE_WRITE_IOCTL ((__force fmode_t)0x100)
+/* 32bit hashes as llseek() offset (for directories) */
+#define FMODE_32BITHASH ((__force fmode_t)0x200)
+/* 64bit hashes as llseek() offset (for directories) */
+#define FMODE_64BITHASH ((__force fmode_t)0x400)
+
/*
* Don't update ctime and mtime.
From: Fan Yong <[email protected]>
Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
and telldir(). However, this causes problems if there are 32-bit hash
collisions, since the NFSv2 server can get stuck resending the same
entries from the directory repeatedly.
Allow ext4 to return a full 64-bit hash (both major and minor) for
telldir to decrease the chance of hash collisions. This still needs
integration on the NFS side.
Patch-updated-by: Bernd Schubert <[email protected]>
(blame me if something is not correct)
Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/ext4/dir.c | 185 ++++++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/ext4.h | 6 ++
fs/ext4/hash.c | 4 +
3 files changed, 154 insertions(+), 41 deletions(-)
diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..cc47087 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -32,24 +32,8 @@ static unsigned char ext4_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
};
-static int ext4_readdir(struct file *, void *, filldir_t);
static int ext4_dx_readdir(struct file *filp,
void *dirent, filldir_t filldir);
-static int ext4_release_dir(struct inode *inode,
- struct file *filp);
-
-const struct file_operations ext4_dir_operations = {
- .llseek = ext4_llseek,
- .read = generic_read_dir,
- .readdir = ext4_readdir, /* we take BKL. needed?*/
- .unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = ext4_compat_ioctl,
-#endif
- .fsync = ext4_sync_file,
- .release = ext4_release_dir,
-};
-
static unsigned char get_dtype(struct super_block *sb, int filetype)
{
@@ -254,22 +238,134 @@ out:
return ret;
}
+static inline int is_32bit_api(void)
+{
+#ifdef HAVE_IS_COMPAT_TASK
+ return is_compat_task();
+#else
+ return (BITS_PER_LONG == 32);
+#endif
+}
+
/*
* These functions convert from the major/minor hash to an f_pos
- * value.
+ * value for dx directories
+ *
+ * Upper layer (for example NFS) should specify FMODE_32BITHASH or
+ * FMODE_64BITHASH explicitly. On the other hand, we allow ext4 to be mounted
+ * directly on both 32-bit and 64-bit nodes, under such case, neither
+ * FMODE_32BITHASH nor FMODE_64BITHASH is specified.
+ */
+static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return major >> 1;
+ else
+ return ((__u64)(major >> 1) << 32) | (__u64)minor;
+}
+
+static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return (pos << 1) & 0xffffffff;
+ else
+ return ((pos >> 32) << 1) & 0xffffffff;
+}
+
+static inline __u32 pos2min_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return 0;
+ else
+ return pos & 0xffffffff;
+}
+
+/*
+ * Return 32- or 64-bit end-of-file for dx directories
+ */
+static inline loff_t ext4_get_htree_eof(struct file *filp)
+{
+ if ((filp->f_mode & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return EXT4_HTREE_EOF_32BIT;
+ else
+ return EXT4_HTREE_EOF_64BIT;
+}
+
+
+/*
+ * ext4_dir_llseek() based on generic_file_llseek() to handle both
+ * non-htree and htree directories, where the "offset" is in terms
+ * of the filename hash value instead of the byte offset.
*
- * Currently we only use major hash numer. This is unfortunate, but
- * on 32-bit machines, the same VFS interface is used for lseek and
- * llseek, so if we use the 64 bit offset, then the 32-bit versions of
- * lseek/telldir/seekdir will blow out spectacularly, and from within
- * the ext2 low-level routine, we don't know if we're being called by
- * a 64-bit version of the system call or the 32-bit version of the
- * system call. Worse yet, NFSv2 only allows for a 32-bit readdir
- * cookie. Sigh.
+ * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
+ * will be invalid once the directory was converted into a dx directory
*/
-#define hash2pos(major, minor) (major >> 1)
-#define pos2maj_hash(pos) ((pos << 1) & 0xffffffff)
-#define pos2min_hash(pos) (0)
+loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ loff_t ret = -EINVAL;
+ int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
+
+ mutex_lock(&inode->i_mutex);
+
+ /* NOTE: relative offsets with dx directories might not work
+ * as expected, as it is difficult to figure out the
+ * correct offset between dx hashes */
+
+ switch (origin) {
+ case SEEK_END:
+ if (unlikely(offset > 0))
+ goto out_err; /* not supported for directories */
+
+ /* so only negative offsets are left, does that have a
+ * meaning for directories at all? */
+ if (is_dx_dir)
+ offset += ext4_get_htree_eof(file);
+ else
+ offset += inode->i_size;
+ break;
+ case SEEK_CUR:
+ /*
+ * Here we special-case the lseek(fd, 0, SEEK_CUR)
+ * position-querying operation. Avoid rewriting the "same"
+ * f_pos value back to the file because a concurrent read(),
+ * write() or lseek() might have altered it
+ */
+ if (offset == 0) {
+ offset = file->f_pos;
+ goto out_ok;
+ }
+
+ offset += file->f_pos;
+ break;
+ }
+
+ if (unlikely(offset < 0))
+ goto out_err;
+
+ if (!is_dx_dir) {
+ if (offset > inode->i_sb->s_maxbytes)
+ goto out_err;
+ } else if (offset > ext4_get_htree_eof(file))
+ goto out_err;
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+
+out_ok:
+ ret = offset;
+out_err:
+ mutex_unlock(&inode->i_mutex);
+
+ return ret;
+}
/*
* This structure holds the nodes of the red-black tree used to store
@@ -330,15 +426,16 @@ static void free_rb_tree_fname(struct rb_root *root)
}
-static struct dir_private_info *ext4_htree_create_dir_info(loff_t pos)
+static struct dir_private_info *ext4_htree_create_dir_info(struct file *filp,
+ loff_t pos)
{
struct dir_private_info *p;
p = kzalloc(sizeof(struct dir_private_info), GFP_KERNEL);
if (!p)
return NULL;
- p->curr_hash = pos2maj_hash(pos);
- p->curr_minor_hash = pos2min_hash(pos);
+ p->curr_hash = pos2maj_hash(filp, pos);
+ p->curr_minor_hash = pos2min_hash(filp, pos);
return p;
}
@@ -429,7 +526,7 @@ static int call_filldir(struct file *filp, void *dirent,
"null fname?!?\n");
return 0;
}
- curr_pos = hash2pos(fname->hash, fname->minor_hash);
+ curr_pos = hash2pos(filp, fname->hash, fname->minor_hash);
while (fname) {
error = filldir(dirent, fname->name,
fname->name_len, curr_pos,
@@ -454,13 +551,13 @@ static int ext4_dx_readdir(struct file *filp,
int ret;
if (!info) {
- info = ext4_htree_create_dir_info(filp->f_pos);
+ info = ext4_htree_create_dir_info(filp, filp->f_pos);
if (!info)
return -ENOMEM;
filp->private_data = info;
}
- if (filp->f_pos == EXT4_HTREE_EOF)
+ if (filp->f_pos == ext4_get_htree_eof(filp))
return 0; /* EOF */
/* Some one has messed with f_pos; reset the world */
@@ -468,8 +565,8 @@ static int ext4_dx_readdir(struct file *filp,
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
- info->curr_hash = pos2maj_hash(filp->f_pos);
- info->curr_minor_hash = pos2min_hash(filp->f_pos);
+ info->curr_hash = pos2maj_hash(filp, filp->f_pos);
+ info->curr_minor_hash = pos2min_hash(filp, filp->f_pos);
}
/*
@@ -501,7 +598,7 @@ static int ext4_dx_readdir(struct file *filp,
if (ret < 0)
return ret;
if (ret == 0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_node = rb_first(&info->root);
@@ -521,7 +618,7 @@ static int ext4_dx_readdir(struct file *filp,
info->curr_minor_hash = fname->minor_hash;
} else {
if (info->next_hash == ~0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_hash = info->next_hash;
@@ -540,3 +637,15 @@ static int ext4_release_dir(struct inode *inode, struct file *filp)
return 0;
}
+
+const struct file_operations ext4_dir_operations = {
+ .llseek = ext4_dir_llseek,
+ .read = generic_read_dir,
+ .readdir = ext4_readdir,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .fsync = ext4_sync_file,
+ .release = ext4_release_dir,
+};
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e717dfd..31d9ba0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1560,7 +1560,11 @@ struct dx_hash_info
u32 *seed;
};
-#define EXT4_HTREE_EOF 0x7fffffff
+
+/* 32 and 64 bit signed EOF for dx directories */
+#define EXT4_HTREE_EOF_32BIT ((1UL << (32 - 1)) - 1)
+#define EXT4_HTREE_EOF_64BIT ((1ULL << (64 - 1)) - 1)
+
/*
* Control parameters used by ext4_htree_next_block
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index ac8f168..fa8e491 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
return -1;
}
hash = hash & ~1;
- if (hash == (EXT4_HTREE_EOF << 1))
- hash = (EXT4_HTREE_EOF-1) << 1;
+ if (hash == (EXT4_HTREE_EOF_32BIT << 1))
+ hash = (EXT4_HTREE_EOF_32BIT - 1) << 1;
hinfo->hash = hash;
hinfo->minor_hash = minor_hash;
return 0;
On Tue, Aug 16, 2011 at 01:54:14PM +0200, Bernd Schubert wrote:
> +static inline int is_32bit_api(void)
> +{
> +#ifdef HAVE_IS_COMPAT_TASK
> + return is_compat_task();
> +#else
> + return (BITS_PER_LONG == 32);
> +#endif
I assume is_compat_task() is coming from another patch? What is the
status of that change?
In the case where is_compat_task() is not defined, we can't just test
based on BITS_PER_LONG == 32, since even on an x86_64 machine, it's
possible we're running a 32-bit binary in compat mode....
- Ted
On 2011-08-19, at 4:29 PM, Ted Ts'o wrote:
> On Tue, Aug 16, 2011 at 01:54:14PM +0200, Bernd Schubert wrote:
>> +static inline int is_32bit_api(void)
>> +{
>> +#ifdef HAVE_IS_COMPAT_TASK
>> + return is_compat_task();
>> +#else
>> + return (BITS_PER_LONG == 32);
>> +#endif
>
> I assume is_compat_task() is coming from another patch? What is the
> status of that change?
No, is_compat_task() is upstream for most (all?) of the architectures
that support hybrid 32-/64-bit operation. It is set at 32-bit syscall
entry when running on 64-bit architectures.
The only minor error in this patch (fixed with a new version from Bernd)
is that this should be under CONFIG_COMPAT instead of HAVE_IS_COMPAT_TASK.
> In the case where is_compat_task() is not defined, we can't just test
> based on BITS_PER_LONG == 32, since even on an x86_64 machine, it's
> possible we're running a 32-bit binary in compat mode....
It is definitely available on x86_64.
Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.
On Tue, Aug 16, 2011 at 01:54:04PM +0200, Bernd Schubert wrote:
> With the ext3/ext4 directory index implementation hashes are used to specify
> offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
> on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
> hashes and therefore the probability of hash collisions for larger directories
> is rather high. As recently reported on the NFS mailing list that theoretical
> problem also happens on real systems:
> http://comments.gmane.org/gmane.linux.nfs/40863
>
> The following series adds two new f_mode flags to tell ext4
> to use 32-bit or 64-bit hash values for llseek() calls.
> These flags can then used by network file systems, such as NFS, to
> request 32-bit or 64-bit offsets (hashes).
>
> Version 3:
> - remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
> I think Bruce wanted to take it seperately as bug fix. It should be applied
> before applying the remaining NFS patches, as without it NFSv4 will always
> fail with the new 64-bit ext4 seek hashes.
Yes, applied to my for-3.2 branch at
git://linux-nfs.org/~bfields/linux.git.
For the NFS patches:
Acked-by: J. Bruce Fields <[email protected]>
OK by me if they go in through ext4 tree, or however's most convenient.
--b.
> - split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
> two separate patches as suggested by Bruce, one patch to rename
> 'access' to 'may_flags'. And the remainder of the original patch to set
> FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
> NFSD_MAY_64BIT_COOKIE flag
>
> Version 2:
> - use f_mode instead of O_* flags and also in a separate patch
> - introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
> - fix SEEK_END in ext4_dir_llseek()
> - set f_mode flags in NFS code as early as possible and introduce a new
> NFSD_MAY_64BIT_COOKIE flag for that
>
> --
> Bernd Schubert
> Fraunhofer ITWM
>
On 08/23/2011 11:56 PM, J. Bruce Fields wrote:
> On Tue, Aug 16, 2011 at 01:54:04PM +0200, Bernd Schubert wrote:
>> With the ext3/ext4 directory index implementation hashes are used to specify
>> offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
>> on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
>> hashes and therefore the probability of hash collisions for larger directories
>> is rather high. As recently reported on the NFS mailing list that theoretical
>> problem also happens on real systems:
>> http://comments.gmane.org/gmane.linux.nfs/40863
>>
>> The following series adds two new f_mode flags to tell ext4
>> to use 32-bit or 64-bit hash values for llseek() calls.
>> These flags can then used by network file systems, such as NFS, to
>> request 32-bit or 64-bit offsets (hashes).
>>
>> Version 3:
>> - remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
>> I think Bruce wanted to take it seperately as bug fix. It should be applied
>> before applying the remaining NFS patches, as without it NFSv4 will always
>> fail with the new 64-bit ext4 seek hashes.
>
> Yes, applied to my for-3.2 branch at
> git://linux-nfs.org/~bfields/linux.git.
>
> For the NFS patches:
>
> Acked-by: J. Bruce Fields<[email protected]>
>
> OK by me if they go in through ext4 tree, or however's most convenient.
Great, thanks!
Cheers,
Bernd
On 08/20/2011 08:23 AM, Andreas Dilger wrote:
> On 2011-08-19, at 4:29 PM, Ted Ts'o wrote:
>> On Tue, Aug 16, 2011 at 01:54:14PM +0200, Bernd Schubert wrote:
>>> +static inline int is_32bit_api(void)
>>> +{
>>> +#ifdef HAVE_IS_COMPAT_TASK
>>> + return is_compat_task();
>>> +#else
>>> + return (BITS_PER_LONG == 32);
>>> +#endif
>>
>> I assume is_compat_task() is coming from another patch? What is the
>> status of that change?
>
> No, is_compat_task() is upstream for most (all?) of the architectures
> that support hybrid 32-/64-bit operation. It is set at 32-bit syscall
> entry when running on 64-bit architectures.
>
> The only minor error in this patch (fixed with a new version from Bernd)
> is that this should be under CONFIG_COMPAT instead of HAVE_IS_COMPAT_TASK.
Yes sorry again about this. Could you please see patch series v4 please?
>
>> In the case where is_compat_task() is not defined, we can't just test
>> based on BITS_PER_LONG == 32, since even on an x86_64 machine, it's
>> possible we're running a 32-bit binary in compat mode....
>
> It is definitely available on x86_64.
Yep, otherwise it even wouldn't compile, at least not with patch series v4.
Thanks,
Bernd