LinuxLists.cc - [PATCH 0/2] 32/64 bit llseek hashes

2011-08-16 11:25:40

Subject: [PATCH 0/2] 32/64 bit llseek hashes

With the ext3/ext4 directory index implementation hashes are used to specify
offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
hashes and therefore the probability of hash collisions for larger directories
is rather high. As recently reported on the NFS mailing list that theoretical
problem also happens on real systems:
http://comments.gmane.org/gmane.linux.nfs/40863

The following series adds two new f_mode flags to tell ext4
to use 32-bit or 64-bit hash values for llseek() calls.
These flags can then used by network file systems, such as NFS, to
request 32-bit or 64-bit offsets (hashes).

Version 3:
- remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
I think Bruce wanted to take it seperately as bug fix. It should be applied
before applying the remaining NFS patches, as without it NFSv4 will always
fail with the new 64-bit ext4 seek hashes.
- split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
two separate patches as suggested by Bruce, one patch to rename
'access' to 'may_flags'. And the remainder of the original patch to set
FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
NFSD_MAY_64BIT_COOKIE flag

Version 2:
- use f_mode instead of O_* flags and also in a separate patch
- introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
- fix SEEK_END in ext4_dir_llseek()
- set f_mode flags in NFS code as early as possible and introduce a new
NFSD_MAY_64BIT_COOKIE flag for that

--
Bernd Schubert
Fraunhofer ITWM

2011-08-16 11:25:44

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 1/6] Add new FMODE flags: FMODE_32bithash and FMODE_64bithash

Those flags are supposed to be set by NFS readdir() to tell ext3/ext4
to 32bit (NFSv2) or 64bit hash values (offsets) in seekdir().

Signed-off-by: Bernd Schubert <[email protected]>
---
include/linux/fs.h | 5 +++++
1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 178cdb4..18d40ae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -91,6 +91,11 @@ struct inodes_stat_t {
/* File is opened using open(.., 3, ..) and is writeable only for ioctls
(specialy hack for floppy.c) */
#define FMODE_WRITE_IOCTL ((__force fmode_t)0x100)
+/* 32bit hashes as llseek() offset (for directories) */
+#define FMODE_32BITHASH ((__force fmode_t)0x200)
+/* 64bit hashes as llseek() offset (for directories) */
+#define FMODE_64BITHASH ((__force fmode_t)0x400)
+

/*
* Don't update ctime and mtime.

2011-08-16 11:25:52

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 3/6] nfsd_open(): rename 'int access' to 'int may_flags' in nfsd_open()

Just rename this variable, as the next patch will add a flag and
'access' as variable name would not be correct any more.

Signed-off-by: Bernd Schubert <[email protected]>
---
fs/nfsd/vfs.c | 18 ++++++++++--------
1 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index fd0acca..ca692b4 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -708,12 +708,13 @@ static int nfsd_open_break_lease(struct inode *inode, int access)

/*
* Open an existing file or directory.
- * The access argument indicates the type of open (read/write/lock)
+ * The may_flags argument indicates the type of open (read/write/lock)
+ * and additional flags.
* N.B. After this call fhp needs an fh_put
*/
__be32
nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
- int access, struct file **filp)
+ int may_flags, struct file **filp)
{
struct dentry *dentry;
struct inode *inode;
@@ -728,7 +729,7 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
* and (hopefully) checked permission - so allow OWNER_OVERRIDE
* in case a chmod has now revoked permission.
*/
- err = fh_verify(rqstp, fhp, type, access | NFSD_MAY_OWNER_OVERRIDE);
+ err = fh_verify(rqstp, fhp, type, may_flags | NFSD_MAY_OWNER_OVERRIDE);
if (err)
goto out;

@@ -739,7 +740,7 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
* or any access when mandatory locking enabled
*/
err = nfserr_perm;
- if (IS_APPEND(inode) && (access & NFSD_MAY_WRITE))
+ if (IS_APPEND(inode) && (may_flags & NFSD_MAY_WRITE))
goto out;
/*
* We must ignore files (but only files) which might have mandatory
@@ -752,12 +753,12 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (!inode->i_fop)
goto out;

- host_err = nfsd_open_break_lease(inode, access);
+ host_err = nfsd_open_break_lease(inode, may_flags);
if (host_err) /* NOMEM or WOULDBLOCK */
goto out_nfserr;

- if (access & NFSD_MAY_WRITE) {
- if (access & NFSD_MAY_READ)
+ if (may_flags & NFSD_MAY_WRITE) {
+ if (may_flags & NFSD_MAY_READ)
flags = O_RDWR|O_LARGEFILE;
else
flags = O_WRONLY|O_LARGEFILE;
@@ -767,7 +768,8 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
if (IS_ERR(*filp))
host_err = PTR_ERR(*filp);
else
- host_err = ima_file_check(*filp, access);
+ host_err = ima_file_check(*filp, may_flags);
+
out_nfserr:
err = nfserrno(host_err);
out:

2011-08-16 11:25:49

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 2/6] Return 32/64-bit dir name hash according to usage type

From: Fan Yong <[email protected]>

Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
and telldir(). However, this causes problems if there are 32-bit hash
collisions, since the NFSv2 server can get stuck resending the same
entries from the directory repeatedly.

Allow ext4 to return a full 64-bit hash (both major and minor) for
telldir to decrease the chance of hash collisions. This still needs
integration on the NFS side.

Patch-updated-by: Bernd Schubert <[email protected]>
(blame me if something is not correct)

Signed-off-by: Fan Yong <[email protected]>
Signed-off-by: Andreas Dilger <[email protected]>
Signed-off-by: Bernd Schubert <[email protected]>
---
fs/ext4/dir.c | 185 ++++++++++++++++++++++++++++++++++++++++++++------------
fs/ext4/ext4.h | 6 ++
fs/ext4/hash.c | 4 +
3 files changed, 154 insertions(+), 41 deletions(-)

diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
index 164c560..cc47087 100644
--- a/fs/ext4/dir.c
+++ b/fs/ext4/dir.c
@@ -32,24 +32,8 @@ static unsigned char ext4_filetype_table[] = {
DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
};

-static int ext4_readdir(struct file *, void *, filldir_t);
static int ext4_dx_readdir(struct file *filp,
void *dirent, filldir_t filldir);
-static int ext4_release_dir(struct inode *inode,
- struct file *filp);
-
-const struct file_operations ext4_dir_operations = {
- .llseek = ext4_llseek,
- .read = generic_read_dir,
- .readdir = ext4_readdir, /* we take BKL. needed?*/
- .unlocked_ioctl = ext4_ioctl,
-#ifdef CONFIG_COMPAT
- .compat_ioctl = ext4_compat_ioctl,
-#endif
- .fsync = ext4_sync_file,
- .release = ext4_release_dir,
-};
-

static unsigned char get_dtype(struct super_block *sb, int filetype)
{
@@ -254,22 +238,134 @@ out:
return ret;
}

+static inline int is_32bit_api(void)
+{
+#ifdef HAVE_IS_COMPAT_TASK
+ return is_compat_task();
+#else
+ return (BITS_PER_LONG == 32);
+#endif
+}
+
/*
* These functions convert from the major/minor hash to an f_pos
- * value.
+ * value for dx directories
+ *
+ * Upper layer (for example NFS) should specify FMODE_32BITHASH or
+ * FMODE_64BITHASH explicitly. On the other hand, we allow ext4 to be mounted
+ * directly on both 32-bit and 64-bit nodes, under such case, neither
+ * FMODE_32BITHASH nor FMODE_64BITHASH is specified.
+ */
+static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return major >> 1;
+ else
+ return ((__u64)(major >> 1) << 32) | (__u64)minor;
+}
+
+static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return (pos << 1) & 0xffffffff;
+ else
+ return ((pos >> 32) << 1) & 0xffffffff;
+}
+
+static inline __u32 pos2min_hash(struct file *filp, loff_t pos)
+{
+ if ((filp->f_flags & FMODE_32BITHASH) ||
+ (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
+ return 0;
+ else
+ return pos & 0xffffffff;
+}
+
+/*
+ * Return 32- or 64-bit end-of-file for dx directories
+ */
+static inline loff_t ext4_get_htree_eof(struct file *filp)
+{
+ if ((filp->f_mode & FMODE_32BITHASH) ||
+ (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
+ return EXT4_HTREE_EOF_32BIT;
+ else
+ return EXT4_HTREE_EOF_64BIT;
+}
+
+
+/*
+ * ext4_dir_llseek() based on generic_file_llseek() to handle both
+ * non-htree and htree directories, where the "offset" is in terms
+ * of the filename hash value instead of the byte offset.
*
- * Currently we only use major hash numer. This is unfortunate, but
- * on 32-bit machines, the same VFS interface is used for lseek and
- * llseek, so if we use the 64 bit offset, then the 32-bit versions of
- * lseek/telldir/seekdir will blow out spectacularly, and from within
- * the ext2 low-level routine, we don't know if we're being called by
- * a 64-bit version of the system call or the 32-bit version of the
- * system call. Worse yet, NFSv2 only allows for a 32-bit readdir
- * cookie. Sigh.
+ * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
+ * will be invalid once the directory was converted into a dx directory
*/
-#define hash2pos(major, minor) (major >> 1)
-#define pos2maj_hash(pos) ((pos << 1) & 0xffffffff)
-#define pos2min_hash(pos) (0)
+loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
+{
+ struct inode *inode = file->f_mapping->host;
+ loff_t ret = -EINVAL;
+ int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
+
+ mutex_lock(&inode->i_mutex);
+
+ /* NOTE: relative offsets with dx directories might not work
+ * as expected, as it is difficult to figure out the
+ * correct offset between dx hashes */
+
+ switch (origin) {
+ case SEEK_END:
+ if (unlikely(offset > 0))
+ goto out_err; /* not supported for directories */
+
+ /* so only negative offsets are left, does that have a
+ * meaning for directories at all? */
+ if (is_dx_dir)
+ offset += ext4_get_htree_eof(file);
+ else
+ offset += inode->i_size;
+ break;
+ case SEEK_CUR:
+ /*
+ * Here we special-case the lseek(fd, 0, SEEK_CUR)
+ * position-querying operation. Avoid rewriting the "same"
+ * f_pos value back to the file because a concurrent read(),
+ * write() or lseek() might have altered it
+ */
+ if (offset == 0) {
+ offset = file->f_pos;
+ goto out_ok;
+ }
+
+ offset += file->f_pos;
+ break;
+ }
+
+ if (unlikely(offset < 0))
+ goto out_err;
+
+ if (!is_dx_dir) {
+ if (offset > inode->i_sb->s_maxbytes)
+ goto out_err;
+ } else if (offset > ext4_get_htree_eof(file))
+ goto out_err;
+
+ /* Special lock needed here? */
+ if (offset != file->f_pos) {
+ file->f_pos = offset;
+ file->f_version = 0;
+ }
+
+out_ok:
+ ret = offset;
+out_err:
+ mutex_unlock(&inode->i_mutex);
+
+ return ret;
+}

/*
* This structure holds the nodes of the red-black tree used to store
@@ -330,15 +426,16 @@ static void free_rb_tree_fname(struct rb_root *root)
}

-static struct dir_private_info *ext4_htree_create_dir_info(loff_t pos)
+static struct dir_private_info *ext4_htree_create_dir_info(struct file *filp,
+ loff_t pos)
{
struct dir_private_info *p;

p = kzalloc(sizeof(struct dir_private_info), GFP_KERNEL);
if (!p)
return NULL;
- p->curr_hash = pos2maj_hash(pos);
- p->curr_minor_hash = pos2min_hash(pos);
+ p->curr_hash = pos2maj_hash(filp, pos);
+ p->curr_minor_hash = pos2min_hash(filp, pos);
return p;
}

@@ -429,7 +526,7 @@ static int call_filldir(struct file *filp, void *dirent,
"null fname?!?\n");
return 0;
}
- curr_pos = hash2pos(fname->hash, fname->minor_hash);
+ curr_pos = hash2pos(filp, fname->hash, fname->minor_hash);
while (fname) {
error = filldir(dirent, fname->name,
fname->name_len, curr_pos,
@@ -454,13 +551,13 @@ static int ext4_dx_readdir(struct file *filp,
int ret;

if (!info) {
- info = ext4_htree_create_dir_info(filp->f_pos);
+ info = ext4_htree_create_dir_info(filp, filp->f_pos);
if (!info)
return -ENOMEM;
filp->private_data = info;
}

- if (filp->f_pos == EXT4_HTREE_EOF)
+ if (filp->f_pos == ext4_get_htree_eof(filp))
return 0; /* EOF */

/* Some one has messed with f_pos; reset the world */
@@ -468,8 +565,8 @@ static int ext4_dx_readdir(struct file *filp,
free_rb_tree_fname(&info->root);
info->curr_node = NULL;
info->extra_fname = NULL;
- info->curr_hash = pos2maj_hash(filp->f_pos);
- info->curr_minor_hash = pos2min_hash(filp->f_pos);
+ info->curr_hash = pos2maj_hash(filp, filp->f_pos);
+ info->curr_minor_hash = pos2min_hash(filp, filp->f_pos);
}

/*
@@ -501,7 +598,7 @@ static int ext4_dx_readdir(struct file *filp,
if (ret < 0)
return ret;
if (ret == 0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_node = rb_first(&info->root);
@@ -521,7 +618,7 @@ static int ext4_dx_readdir(struct file *filp,
info->curr_minor_hash = fname->minor_hash;
} else {
if (info->next_hash == ~0) {
- filp->f_pos = EXT4_HTREE_EOF;
+ filp->f_pos = ext4_get_htree_eof(filp);
break;
}
info->curr_hash = info->next_hash;
@@ -540,3 +637,15 @@ static int ext4_release_dir(struct inode *inode, struct file *filp)

return 0;
}
+
+const struct file_operations ext4_dir_operations = {
+ .llseek = ext4_dir_llseek,
+ .read = generic_read_dir,
+ .readdir = ext4_readdir,
+ .unlocked_ioctl = ext4_ioctl,
+#ifdef CONFIG_COMPAT
+ .compat_ioctl = ext4_compat_ioctl,
+#endif
+ .fsync = ext4_sync_file,
+ .release = ext4_release_dir,
+};
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e717dfd..31d9ba0 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1560,7 +1560,11 @@ struct dx_hash_info
u32 *seed;
};

-#define EXT4_HTREE_EOF 0x7fffffff
+
+/* 32 and 64 bit signed EOF for dx directories */
+#define EXT4_HTREE_EOF_32BIT ((1UL << (32 - 1)) - 1)
+#define EXT4_HTREE_EOF_64BIT ((1ULL << (64 - 1)) - 1)
+

/*
* Control parameters used by ext4_htree_next_block
diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
index ac8f168..fa8e491 100644
--- a/fs/ext4/hash.c
+++ b/fs/ext4/hash.c
@@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
return -1;
}
hash = hash & ~1;
- if (hash == (EXT4_HTREE_EOF << 1))
- hash = (EXT4_HTREE_EOF-1) << 1;
+ if (hash == (EXT4_HTREE_EOF_32BIT << 1))
+ hash = (EXT4_HTREE_EOF_32BIT - 1) << 1;
hinfo->hash = hash;
hinfo->minor_hash = minor_hash;
return 0;

2011-08-16 11:25:57

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 4/6] nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)

Use 32-bit or 64-bit llseek() hashes for directory offsets depending on
the NFS version. NFSv2 gets 32-bit hashes only.

NOTE: This patch got rather complex as Christoph asked to set the
filp->f_mode flag in the open call or immediatly after dentry_open()
in nfsd_open() to avoid races.
Personally I still do not see a reason for that and in my opinion
FMODE_32BITHASH/FMODE_64BITHASH flags could be set nfsd_readdir(), as it
follows directly after nfsd_open() without a chance of races.

Signed-off-by: Bernd Schubert <[email protected]>
---
fs/nfsd/vfs.c | 15 +++++++++++++--
fs/nfsd/vfs.h | 2 ++
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index ca692b4..97a99f1 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -767,9 +767,15 @@ nfsd_open(struct svc_rqst *rqstp, struct svc_fh *fhp, int type,
flags, current_cred());
if (IS_ERR(*filp))
host_err = PTR_ERR(*filp);
- else
+ else {
host_err = ima_file_check(*filp, may_flags);

+ if (may_flags & NFSD_MAY_64BIT_COOKIE)
+ (*filp)->f_mode |= FMODE_64BITHASH;
+ else
+ (*filp)->f_mode |= FMODE_32BITHASH;
+ }
+
out_nfserr:
err = nfserrno(host_err);
out:
@@ -1991,8 +1997,13 @@ nfsd_readdir(struct svc_rqst *rqstp, struct svc_fh *fhp, loff_t *offsetp,
__be32 err;
struct file *file;
loff_t offset = *offsetp;
+ int may_flags = NFSD_MAY_READ;
+
+ /* NFSv2 only supports 32 bit cookies */
+ if (rqstp->rq_vers > 2)
+ may_flags |= NFSD_MAY_64BIT_COOKIE;

- err = nfsd_open(rqstp, fhp, S_IFDIR, NFSD_MAY_READ, &file);
+ err = nfsd_open(rqstp, fhp, S_IFDIR, may_flags, &file);
if (err)
goto out;

diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index e0bbac0..ecd00e1 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -26,6 +26,8 @@
#define NFSD_MAY_NOT_BREAK_LEASE 512
#define NFSD_MAY_BYPASS_GSS 1024

+#define NFSD_MAY_64BIT_COOKIE 2048 /* 64 bit readdir cookies for >= NFSv3 */
+
#define NFSD_MAY_CREATE (NFSD_MAY_EXEC|NFSD_MAY_WRITE)
#define NFSD_MAY_REMOVE (NFSD_MAY_EXEC|NFSD_MAY_WRITE|NFSD_MAY_TRUNC)

2011-08-16 11:26:02

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 5/6] Fix possible Null pointer dereference in ipoib_start_xmit()

This will fix https://bugzilla.kernel.org/show_bug.cgi?id=41212

fslab2 login: [ 114.392408] EXT4-fs (sdc): barriers disabled
[ 114.449737] EXT4-fs (sdc): mounted filesystem with writeback data mode.
Opts: journal_async_commit,barrier=0,data=writeback
[ 240.944030] BUG: unable to handle kernel NULL pointer dereference at
0000000000000040
[ 240.948007] IP: [<ffffffffa0366ce9>] ipoib_start_xmit+0x39/0x280 [ib_ipoib]
[...]
[ 240.948007] Call Trace:
[ 240.948007] <IRQ>
[ 240.948007] [<ffffffff812cd5e0>] dev_hard_start_xmit+0x2a0/0x590
[ 240.948007] [<ffffffff8131f680>] ? arp_create+0x70/0x200
[ 240.948007] [<ffffffff812e8e1f>] sch_direct_xmit+0xef/0x1c0

Signed-off-by: Bernd Schubert <[email protected]>
---
drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 +++++---
1 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 43f89ba..fe89c46 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -717,11 +717,13 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_neigh *neigh;
- struct neighbour *n;
+ struct neighbour *n = NULL;
unsigned long flags;

- n = dst_get_neighbour(skb_dst(skb));
- if (likely(skb_dst(skb) && n)) {
+ if (likely(skb_dst(skb)))
+ n = dst_get_neighbour(skb_dst(skb));
+
+ if (likely(n)) {
if (unlikely(!*to_ipoib_neigh(n))) {
ipoib_path_lookup(skb, dev);
return NETDEV_TX_OK;

2011-08-16 11:26:08

by Bernd Schubert

[permalink] [raw]

Subject: [PATCH 6/6] Rename 'n' into a longer variable name.

When it comes to me variable names consisting of a single letter
should be forbidden by coding style guide lines, as it is rather
difficult to search for single letter, such as 'n'.

Rename struct neighbour *n to dst_neigh

Signed-off-by: Bernd Schubert <[email protected]>
---
drivers/infiniband/ulp/ipoib/ipoib_main.c | 15 ++++++++-------
1 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index fe89c46..189d4cb 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -717,22 +717,22 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
{
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ipoib_neigh *neigh;
- struct neighbour *n = NULL;
+ struct neighbour *dst_neigh = NULL;
unsigned long flags;

if (likely(skb_dst(skb)))
- n = dst_get_neighbour(skb_dst(skb));
+ dst_neigh = dst_get_neighbour(skb_dst(skb));

- if (likely(n)) {
- if (unlikely(!*to_ipoib_neigh(n))) {
+ if (likely(dst_neigh)) {
+ if (unlikely(!*to_ipoib_neigh(dst_neigh))) {
ipoib_path_lookup(skb, dev);
return NETDEV_TX_OK;
}

- neigh = *to_ipoib_neigh(n);
+ neigh = *to_ipoib_neigh(dst_neigh);

if (unlikely((memcmp(&neigh->dgid.raw,
- n->ha + 4,
+ dst_neigh->ha + 4,
sizeof(union ib_gid))) ||
(neigh->dev != dev))) {
spin_lock_irqsave(&priv->lock, flags);
@@ -758,7 +758,8 @@ static int ipoib_start_xmit(struct sk_buff *skb, struct net_device *dev)
return NETDEV_TX_OK;
}
} else if (neigh->ah) {
- ipoib_send(dev, skb, neigh->ah, IPOIB_QPN(n->ha));
+ ipoib_send(dev, skb, neigh->ah,
+ IPOIB_QPN(dst_neigh->ha));
return NETDEV_TX_OK;
}

2011-08-16 11:56:05

by Bernd Schubert

[permalink] [raw]

Subject: Re: [PATCH 0/2] 32/64 bit llseek hashes

I'm really sorry for this patch spam :( I just accidentally hit the
enter key too early here and so patches go into this series, which
weren't supposed to get in here :(

On 08/16/2011 01:25 PM, Bernd Schubert wrote:
> With the ext3/ext4 directory index implementation hashes are used to specify
> offsets for llseek(). For compatibility with NFSv2 and 32-bit user space
> on 64-bit systems (kernel space) ext3/ext4 currently only return 32-bit
> hashes and therefore the probability of hash collisions for larger directories
> is rather high. As recently reported on the NFS mailing list that theoretical
> problem also happens on real systems:
> http://comments.gmane.org/gmane.linux.nfs/40863
>
> The following series adds two new f_mode flags to tell ext4
> to use 32-bit or 64-bit hash values for llseek() calls.
> These flags can then used by network file systems, such as NFS, to
> request 32-bit or 64-bit offsets (hashes).
>
> Version 3:
> - remove patch "RFC: Remove check for a 32-bit cookie in nfsd4_readdir()",
> I think Bruce wanted to take it seperately as bug fix. It should be applied
> before applying the remaining NFS patches, as without it NFSv4 will always
> fail with the new 64-bit ext4 seek hashes.
> - split "nfsd: vfs_llseek() with 32 or 64 bit offsets (hashes)" into two
> two separate patches as suggested by Bruce, one patch to rename
> 'access' to 'may_flags'. And the remainder of the original patch to set
> FMODE_32BITHASH/FMODE_64BITHASH flags and to introduce the new
> NFSD_MAY_64BIT_COOKIE flag
>
> Version 2:
> - use f_mode instead of O_* flags and also in a separate patch
> - introduce EXT4_HTREE_EOF_32BIT and EXT4_HTREE_EOF_64BIT
> - fix SEEK_END in ext4_dir_llseek()
> - set f_mode flags in NFS code as early as possible and introduce a new
> NFSD_MAY_64BIT_COOKIE flag for that
>
> --
> Bernd Schubert
> Fraunhofer ITWM
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2011-08-16 21:22:27

by Andreas Dilger

[permalink] [raw]

Subject: Re: [PATCH 2/6] Return 32/64-bit dir name hash according to usage type

On 2011-08-16, at 5:25 AM, Bernd Schubert wrote:

> From: Fan Yong <[email protected]>
>
> Traditionally ext2/3/4 has returned a 32-bit hash value from llseek()
> to appease NFSv2, which can only handle a 32-bit cookie for seekdir()
> and telldir(). However, this causes problems if there are 32-bit hash
> collisions, since the NFSv2 server can get stuck resending the same
> entries from the directory repeatedly.
>
> Allow ext4 to return a full 64-bit hash (both major and minor) for
> telldir to decrease the chance of hash collisions. This still needs
> integration on the NFS side.
>
> Patch-updated-by: Bernd Schubert <[email protected]>
> (blame me if something is not correct)
>
> Signed-off-by: Fan Yong <[email protected]>
> Signed-off-by: Andreas Dilger <[email protected]>
> Signed-off-by: Bernd Schubert <[email protected]>
> ---
> fs/ext4/dir.c | 185 ++++++++++++++++++++++++++++++++++++++++++++------------
> fs/ext4/ext4.h | 6 ++
> fs/ext4/hash.c | 4 +
> 3 files changed, 154 insertions(+), 41 deletions(-)
>
> diff --git a/fs/ext4/dir.c b/fs/ext4/dir.c
> index 164c560..cc47087 100644
> --- a/fs/ext4/dir.c
> +++ b/fs/ext4/dir.c
> @@ -32,24 +32,8 @@ static unsigned char ext4_filetype_table[] = {
> DT_UNKNOWN, DT_REG, DT_DIR, DT_CHR, DT_BLK, DT_FIFO, DT_SOCK, DT_LNK
> };
>
> -static int ext4_readdir(struct file *, void *, filldir_t);
> static int ext4_dx_readdir(struct file *filp,
> void *dirent, filldir_t filldir);
> -static int ext4_release_dir(struct inode *inode,
> - struct file *filp);
> -
> -const struct file_operations ext4_dir_operations = {
> - .llseek = ext4_llseek,
> - .read = generic_read_dir,
> - .readdir = ext4_readdir, /* we take BKL. needed?*/
> - .unlocked_ioctl = ext4_ioctl,
> -#ifdef CONFIG_COMPAT
> - .compat_ioctl = ext4_compat_ioctl,
> -#endif
> - .fsync = ext4_sync_file,
> - .release = ext4_release_dir,
> -};
> -
>
> static unsigned char get_dtype(struct super_block *sb, int filetype)
> {
> @@ -254,22 +238,134 @@ out:
> return ret;
> }
>
> +static inline int is_32bit_api(void)
> +{
> +#ifdef HAVE_IS_COMPAT_TASK
> + return is_compat_task();

Looking more closely, this should actually be "#ifdef CONFIG_COMPAT" in the mainline kernel.

HAVE_IS_COMPAT_TASK is from the Lustre configure script for detecting which kernel is_compat_task() was added in, since it appeared in some kernels at 2.6.17 but wasn't in most arches until 2.6.29.

Sorry I didn't notice this earlier.

> +#else
> + return (BITS_PER_LONG == 32);
> +#endif
> +}
> +
> /*
> * These functions convert from the major/minor hash to an f_pos
> - * value.
> + * value for dx directories
> + *
> + * Upper layer (for example NFS) should specify FMODE_32BITHASH or
> + * FMODE_64BITHASH explicitly. On the other hand, we allow ext4 to be mounted
> + * directly on both 32-bit and 64-bit nodes, under such case, neither
> + * FMODE_32BITHASH nor FMODE_64BITHASH is specified.
> + */
> +static inline loff_t hash2pos(struct file *filp, __u32 major, __u32 minor)
> +{
> + if ((filp->f_flags & FMODE_32BITHASH) ||
> + (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
> + return major >> 1;
> + else
> + return ((__u64)(major >> 1) << 32) | (__u64)minor;
> +}
> +
> +static inline __u32 pos2maj_hash(struct file *filp, loff_t pos)
> +{
> + if ((filp->f_flags & FMODE_32BITHASH) ||
> + (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
> + return (pos << 1) & 0xffffffff;
> + else
> + return ((pos >> 32) << 1) & 0xffffffff;
> +}
> +
> +static inline __u32 pos2min_hash(struct file *filp, loff_t pos)
> +{
> + if ((filp->f_flags & FMODE_32BITHASH) ||
> + (!(filp->f_flags & FMODE_64BITHASH) && is_32bit_api()))
> + return 0;
> + else
> + return pos & 0xffffffff;
> +}
> +
> +/*
> + * Return 32- or 64-bit end-of-file for dx directories
> + */
> +static inline loff_t ext4_get_htree_eof(struct file *filp)
> +{
> + if ((filp->f_mode & FMODE_32BITHASH) ||
> + (!(filp->f_mode & FMODE_64BITHASH) && is_32bit_api()))
> + return EXT4_HTREE_EOF_32BIT;
> + else
> + return EXT4_HTREE_EOF_64BIT;
> +}
> +
> +
> +/*
> + * ext4_dir_llseek() based on generic_file_llseek() to handle both
> + * non-htree and htree directories, where the "offset" is in terms
> + * of the filename hash value instead of the byte offset.
> *
> - * Currently we only use major hash numer. This is unfortunate, but
> - * on 32-bit machines, the same VFS interface is used for lseek and
> - * llseek, so if we use the 64 bit offset, then the 32-bit versions of
> - * lseek/telldir/seekdir will blow out spectacularly, and from within
> - * the ext2 low-level routine, we don't know if we're being called by
> - * a 64-bit version of the system call or the 32-bit version of the
> - * system call. Worse yet, NFSv2 only allows for a 32-bit readdir
> - * cookie. Sigh.
> + * NOTE: offsets obtained *before* ext4_set_inode_flag(dir, EXT4_INODE_INDEX)
> + * will be invalid once the directory was converted into a dx directory
> */
> -#define hash2pos(major, minor) (major >> 1)
> -#define pos2maj_hash(pos) ((pos << 1) & 0xffffffff)
> -#define pos2min_hash(pos) (0)
> +loff_t ext4_dir_llseek(struct file *file, loff_t offset, int origin)
> +{
> + struct inode *inode = file->f_mapping->host;
> + loff_t ret = -EINVAL;
> + int is_dx_dir = ext4_test_inode_flag(inode, EXT4_INODE_INDEX);
> +
> + mutex_lock(&inode->i_mutex);
> +
> + /* NOTE: relative offsets with dx directories might not work
> + * as expected, as it is difficult to figure out the
> + * correct offset between dx hashes */
> +
> + switch (origin) {
> + case SEEK_END:
> + if (unlikely(offset > 0))
> + goto out_err; /* not supported for directories */
> +
> + /* so only negative offsets are left, does that have a
> + * meaning for directories at all? */
> + if (is_dx_dir)
> + offset += ext4_get_htree_eof(file);
> + else
> + offset += inode->i_size;
> + break;
> + case SEEK_CUR:
> + /*
> + * Here we special-case the lseek(fd, 0, SEEK_CUR)
> + * position-querying operation. Avoid rewriting the "same"
> + * f_pos value back to the file because a concurrent read(),
> + * write() or lseek() might have altered it
> + */
> + if (offset == 0) {
> + offset = file->f_pos;
> + goto out_ok;
> + }
> +
> + offset += file->f_pos;
> + break;
> + }
> +
> + if (unlikely(offset < 0))
> + goto out_err;
> +
> + if (!is_dx_dir) {
> + if (offset > inode->i_sb->s_maxbytes)
> + goto out_err;
> + } else if (offset > ext4_get_htree_eof(file))
> + goto out_err;
> +
> + /* Special lock needed here? */
> + if (offset != file->f_pos) {
> + file->f_pos = offset;
> + file->f_version = 0;
> + }
> +
> +out_ok:
> + ret = offset;
> +out_err:
> + mutex_unlock(&inode->i_mutex);
> +
> + return ret;
> +}
>
> /*
> * This structure holds the nodes of the red-black tree used to store
> @@ -330,15 +426,16 @@ static void free_rb_tree_fname(struct rb_root *root)
> }
>
>
> -static struct dir_private_info *ext4_htree_create_dir_info(loff_t pos)
> +static struct dir_private_info *ext4_htree_create_dir_info(struct file *filp,
> + loff_t pos)
> {
> struct dir_private_info *p;
>
> p = kzalloc(sizeof(struct dir_private_info), GFP_KERNEL);
> if (!p)
> return NULL;
> - p->curr_hash = pos2maj_hash(pos);
> - p->curr_minor_hash = pos2min_hash(pos);
> + p->curr_hash = pos2maj_hash(filp, pos);
> + p->curr_minor_hash = pos2min_hash(filp, pos);
> return p;
> }
>
> @@ -429,7 +526,7 @@ static int call_filldir(struct file *filp, void *dirent,
> "null fname?!?\n");
> return 0;
> }
> - curr_pos = hash2pos(fname->hash, fname->minor_hash);
> + curr_pos = hash2pos(filp, fname->hash, fname->minor_hash);
> while (fname) {
> error = filldir(dirent, fname->name,
> fname->name_len, curr_pos,
> @@ -454,13 +551,13 @@ static int ext4_dx_readdir(struct file *filp,
> int ret;
>
> if (!info) {
> - info = ext4_htree_create_dir_info(filp->f_pos);
> + info = ext4_htree_create_dir_info(filp, filp->f_pos);
> if (!info)
> return -ENOMEM;
> filp->private_data = info;
> }
>
> - if (filp->f_pos == EXT4_HTREE_EOF)
> + if (filp->f_pos == ext4_get_htree_eof(filp))
> return 0; /* EOF */
>
> /* Some one has messed with f_pos; reset the world */
> @@ -468,8 +565,8 @@ static int ext4_dx_readdir(struct file *filp,
> free_rb_tree_fname(&info->root);
> info->curr_node = NULL;
> info->extra_fname = NULL;
> - info->curr_hash = pos2maj_hash(filp->f_pos);
> - info->curr_minor_hash = pos2min_hash(filp->f_pos);
> + info->curr_hash = pos2maj_hash(filp, filp->f_pos);
> + info->curr_minor_hash = pos2min_hash(filp, filp->f_pos);
> }
>
> /*
> @@ -501,7 +598,7 @@ static int ext4_dx_readdir(struct file *filp,
> if (ret < 0)
> return ret;
> if (ret == 0) {
> - filp->f_pos = EXT4_HTREE_EOF;
> + filp->f_pos = ext4_get_htree_eof(filp);
> break;
> }
> info->curr_node = rb_first(&info->root);
> @@ -521,7 +618,7 @@ static int ext4_dx_readdir(struct file *filp,
> info->curr_minor_hash = fname->minor_hash;
> } else {
> if (info->next_hash == ~0) {
> - filp->f_pos = EXT4_HTREE_EOF;
> + filp->f_pos = ext4_get_htree_eof(filp);
> break;
> }
> info->curr_hash = info->next_hash;
> @@ -540,3 +637,15 @@ static int ext4_release_dir(struct inode *inode, struct file *filp)
>
> return 0;
> }
> +
> +const struct file_operations ext4_dir_operations = {
> + .llseek = ext4_dir_llseek,
> + .read = generic_read_dir,
> + .readdir = ext4_readdir,
> + .unlocked_ioctl = ext4_ioctl,
> +#ifdef CONFIG_COMPAT
> + .compat_ioctl = ext4_compat_ioctl,
> +#endif
> + .fsync = ext4_sync_file,
> + .release = ext4_release_dir,
> +};
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index e717dfd..31d9ba0 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1560,7 +1560,11 @@ struct dx_hash_info
> u32 *seed;
> };
>
> -#define EXT4_HTREE_EOF 0x7fffffff
> +
> +/* 32 and 64 bit signed EOF for dx directories */
> +#define EXT4_HTREE_EOF_32BIT ((1UL << (32 - 1)) - 1)
> +#define EXT4_HTREE_EOF_64BIT ((1ULL << (64 - 1)) - 1)
> +
>
> /*
> * Control parameters used by ext4_htree_next_block
> diff --git a/fs/ext4/hash.c b/fs/ext4/hash.c
> index ac8f168..fa8e491 100644
> --- a/fs/ext4/hash.c
> +++ b/fs/ext4/hash.c
> @@ -200,8 +200,8 @@ int ext4fs_dirhash(const char *name, int len, struct dx_hash_info *hinfo)
> return -1;
> }
> hash = hash & ~1;
> - if (hash == (EXT4_HTREE_EOF << 1))
> - hash = (EXT4_HTREE_EOF-1) << 1;
> + if (hash == (EXT4_HTREE_EOF_32BIT << 1))
> + hash = (EXT4_HTREE_EOF_32BIT - 1) << 1;
> hinfo->hash = hash;
> hinfo->minor_hash = minor_hash;
> return 0;
>

Cheers, Andreas
--
Andreas Dilger
Principal Engineer
Whamcloud, Inc.

2011-08-17 09:17:28

by Bernd Schubert

[permalink] [raw]

Subject: Re: [PATCH 2/6] Return 32/64-bit dir name hash according to usage type

On 08/16/2011 11:22 PM, Andreas Dilger wrote:
>> +static inline int is_32bit_api(void) +{ +#ifdef
>> HAVE_IS_COMPAT_TASK + return is_compat_task();
>
> Looking more closely, this should actually be "#ifdef CONFIG_COMPAT"
> in the mainline kernel.
>
> HAVE_IS_COMPAT_TASK is from the Lustre configure script for detecting
> which kernel is_compat_task() was added in, since it appeared in some
> kernels at 2.6.17 but wasn't in most arches until 2.6.29.
>
> Sorry I didn't notice this earlier.
>

Oh no, I also should have noticed it :( I even made a small test program
[1], but I then never executed it in 32bit mode :(

Cheers,
Bernd

[1]
http://www.pci.uni-heidelberg.de/tc/usr/bernd/downloads/test_seekdir/