2010-06-15 17:13:17

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

Hi,

The below set of patches implement open by handle support using exportfs
operations. This allows user space application to map a file name to file
handle and later open the file using handle. This should be usable
for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.

[1] http://nfs-ganesha.sourceforge.net/
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/68992

git repo for the patchset at:

git://git.kernel.org/pub/scm/linux/kernel/git/kvaneesh/linux-open-handle.git open-by-handle-v14

Changes from V13:
a) Add support for file descriptor to handle conversion. This is needed
so that we find the right file handle for newly created files.

Changes from V12:
a) Use CAP_DAC_READ_SEARCH instead of CAP_DAC_OVERRIDE in open_by_handle
b) Return -ENOTDIR if O_DIRECTORY flag is specified in open_by_handle with
handle for non directory

Changes from V11:
a) Add necessary documentation to different functions
b) Add null pathname support to faccessat and linkat similar to
readlinkat.
c) compile fix on x86_64

Changes from V10:
a) Missed an stg refresh before sending out the patchset. Send
updated patchset.

Changes from V9:
a) Fix compile errors with CONFIG_EXPORTFS not defined
b) Return -EOPNOTSUPP if file system doesn't support fh_to_dentry exportfs callback.

Changes from V8:
a) exportfs_decode_fh now returns -ESTALE if export operations is not defined.
b) drop get_fsid super_operations. Instead use superblock to store uuid.

Changes from V7:
a) open_by_handle now use mountdirfd to identify the vfsmount.
b) We don't validate the UUID passed as a part of file handle in open_by_handle.
UUID is provided as a part of file handle as an easy way for userspace to
use the kernel returned handle as it is. It also helps in finding the 16 byte
filessytem UUID in userspace without using file system specific libraries to
read file system superblock. If a particular file system doesn't support UUID
or any form of unique id this field in the file handle will be zero filled.
c) drop freadlink syscall. Instead use readlinkat with NULL pathname to indicate
read the link target name of the link pointed by fd. This is similar to
sys_utimensat
d) Instead of opencoding all the open flag related check use helper functions.
Did finish_open_by_handle similar to finish_open.
c) Fix may_open to not return ELOOP for symlink when we are called from handle open.
open(2) still returns error as expected.

Changes from V6:
a) Add uuid to vfsmount lookup and drop uuid to superblock lookup
b) Return -EOPNOTSUPP in sys_name_to_handle if the file system returned uuid
doesn't give the same vfsmount on lookup. This ensure that we fail
sys_name_to_handle when we have multiple file system returning same UUID.

Changes from V5:
a) added sys_name_to_handle_at syscall which takes AT_SYMLINK_NOFOLLOW flag
instead of two syscalls sys_name_to_handle and sys_lname_to_handle.
b) addressed review comments from Niel Brown
c) rebased to b91ce4d14a21fc04d165be30319541e0f9204f15
d) Add compat_sys_open_by_handle

Chages from V4:
a) Changed the syscal arguments so that we don't need compat syscalls
as suggested by Christoph
c) Added two new syscall sys_lname_to_handle and sys_freadlink to work with
symlinks
d) Changed open_by_handle to work with all file types
e) Add ext3 support

Changes from V3:
a) Code cleanup suggested by Andreas
b) x86_64 syscall support
c) add compat syscall

Chages from V2:
a) Support system wide unique handle.

Changes from v1:
a) handle size is now specified in bytes
b) returns -EOVERFLOW if the handle size is small
c) dropped open_handle syscall and added open_by_handle_at syscall
open_by_handle_at takes mount_fd as the directory fd of the mount point
containing the file
e) handle will only be unique in a given file system. So for an NFS server
exporting multiple file system, NFS server will have to internally track the
mount point to which a file handle belongs to. We should be able to do it much
easily than expecting kernel to give a system wide unique file handle. System
wide unique file handle would need much larger changes to the exportfs or VFS
interface and I was not sure whether we really need to do that in the kernel or
in the user space
f) open_handle_at now only check for DAC_OVERRIDE capability


Example program: (x86_32). (x86_64 would need a different syscall number)
-------
cc <source.c> -luuid
--------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <string.h>
#include <uuid/uuid.h>

struct file_handle {
int handle_size;
int handle_type;
uuid_t fs_uuid;
unsigned char handle[0];
};

#define AT_FDCWD -100
#define AT_SYMLINK_FOLLOW 0x400

static int name_to_handle(const char *name, struct file_handle *fh)
{
return syscall(338, AT_FDCWD, name, fh, AT_SYMLINK_FOLLOW);
}

static int lname_to_handle(const char *name, struct file_handle *fh)
{
return syscall(338, AT_FDCWD, name, fh, 0);
}

static int fd_to_handle(int fd, struct file_handle *fh)
{
return syscall(338, fd, NULL, fh, AT_SYMLINK_FOLLOW);
}

static int open_by_handle(int mountfd, struct file_handle *fh, int flags)
{
return syscall(339, mountfd, fh, flags);
}

#define BUFSZ 100
int main(int argc, char *argv[])
{
int fd;
int ret, done = 0;
int mountfd;
int handle_sz;
struct stat bufstat;
char buf[BUFSZ];
char uuid[36];
struct file_handle *fh = NULL;;
if (argc != 3 ) {
printf("Usage: %s <filename> <mount-dir-name>\n", argv[0]);
exit(1);
}
again:
if (fh && fh->handle_size) {
handle_sz = fh->handle_size;
free(fh);
fh = malloc(sizeof(struct file_handle) + handle_sz);
fh->handle_size = handle_sz;
} else {
fh = malloc(sizeof(struct file_handle));
fh->handle_size = 0;
}
errno = 0;
ret = lname_to_handle(argv[1], fh);
if (ret && errno == EOVERFLOW) {
printf("Found the handle size needed to be %d\n", fh->handle_size);
goto again;
} else if (ret) {
perror("Error:");
exit(1);
}
do_again:
uuid_unparse(fh->fs_uuid, uuid);
printf("UUID:%s\n", uuid);
printf("Waiting for input");
getchar();
mountfd = open(argv[2], O_RDONLY | O_DIRECTORY);
if (mountfd <= 0) {
perror("Error:");
exit(1);
}
fd = open_by_handle(mountfd, fh, O_RDONLY);
if (fd <= 0 ) {
perror("Error:");
exit(1);
}
printf("Reading the content now \n");
fstat(fd, &bufstat);
ret = S_ISLNK(bufstat.st_mode);
if (ret) {
memset(buf, 0 , BUFSZ);
readlinkat(fd, NULL, buf, BUFSZ);
printf("%s is a symlink pointing to %s\n", argv[1], buf);
}
memset(buf, 0 , BUFSZ);
while (1) {
ret = read(fd, buf, BUFSZ -1);
if (ret <= 0)
break;
buf[ret] = '\0';
printf("%s", buf);
memset(buf, 0 , BUFSZ);
}
/* Now check for faccess */
if (faccessat(fd, NULL, W_OK, 0) == 0) {
printf("Got write permission on the file \n");
} else
perror("faccess error");
/* now try to create a hardlink */
if (linkat(fd, NULL, AT_FDCWD, "test", 0) == 0){
printf("created hardlink\n");
} else
perror("linkat error");
if (done)
exit(0);
printf("Map fd to handle \n");
ret = fd_to_handle(fd, fh);
if (ret) {
perror("Error:");
exit(1);
}
done = 1;
goto do_again;
}

-aneesh


2010-06-15 17:13:25

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 02/11] vfs: Add name to file handle conversion support

This patch add a new superblock field unsigned char s_uuid[16]
to store UUID mapping for the file system. The s_uuid[16] is used to
identify the file system apart of file_handle

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/open.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 11 ++++
include/linux/syscalls.h | 3 +
3 files changed, 142 insertions(+), 0 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 5463266..43ac798 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -29,6 +29,7 @@
#include <linux/falloc.h>
#include <linux/fs_struct.h>
#include <linux/ima.h>
+#include <linux/exportfs.h>

#include "internal.h"

@@ -1040,3 +1041,130 @@ int nonseekable_open(struct inode *inode, struct file *filp)
}

EXPORT_SYMBOL(nonseekable_open);
+
+#ifdef CONFIG_EXPORTFS
+/* limit the handle size to some value */
+#define MAX_HANDLE_SZ 4096
+static long do_sys_name_to_handle(struct path *path,
+ struct file_handle __user *ufh)
+{
+ long retval;
+ int handle_size;
+ struct super_block *sb;
+ struct file_handle f_handle;
+ struct file_handle *handle = NULL;
+
+ if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
+ retval = -EFAULT;
+ goto err_out;
+ }
+ if (f_handle.handle_size > MAX_HANDLE_SZ) {
+ retval = -EINVAL;
+ goto err_out;
+ }
+ handle = kmalloc(sizeof(struct file_handle) + f_handle.handle_size,
+ GFP_KERNEL);
+ if (!handle) {
+ retval = -ENOMEM;
+ goto err_out;
+ }
+
+ /* convert handle size to multiple of sizeof(u32) */
+ handle_size = f_handle.handle_size >> 2;
+
+ /* we ask for a non connected handle */
+ retval = exportfs_encode_fh(path->dentry,
+ (struct fid *)handle->f_handle,
+ &handle_size, 0);
+ /* convert handle size to bytes */
+ handle_size *= sizeof(u32);
+ handle->handle_type = retval;
+ handle->handle_size = handle_size;
+ if (handle_size <= f_handle.handle_size) {
+ /* get the uuid */
+ sb = path->mnt->mnt_sb;
+ memcpy(handle->fs_uuid,
+ sb->s_uuid,
+ sizeof(handle->fs_uuid));
+ retval = 0;
+ } else {
+ /*
+ * set the handle_size to zero so we copy only
+ * non variable part of the file_handle
+ */
+ handle_size = 0;
+ retval = -EOVERFLOW;
+ }
+ if (copy_to_user(ufh, handle,
+ sizeof(struct file_handle) + handle_size))
+ retval = -EFAULT;
+
+ kfree(handle);
+err_out:
+ return retval;
+}
+
+/**
+ * sys_name_to_handle_at: convert name to handle
+ * @dfd: directory relative to which name is interpreted if not absolute
+ * @name: name that should be converted to handle.
+ * @handle: resulting file handle
+ * @flag: flag value to indicate whether to follow symlink or not
+ *
+ * @handle->handle_size indicate the space available to store the
+ * variable part of the file handle in bytes. If there is not
+ * enough space, the field is updated to return the minimum
+ * value required.
+ */
+SYSCALL_DEFINE4(name_to_handle_at, int, dfd, const char __user *, name,
+ struct file_handle __user *, handle, int, flag)
+{
+
+ int follow;
+ long ret = -EINVAL;
+ struct path path, *pp;
+ struct file *file = NULL;
+
+ if ((flag & ~AT_SYMLINK_FOLLOW) != 0)
+ goto err_out;
+
+ if (name == NULL && dfd != AT_FDCWD) {
+ file = fget(dfd);
+ if (file) {
+ pp = &file->f_path;
+ ret = 0;
+ } else
+ ret = -EBADF;
+ } else {
+ follow = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0;
+ ret = user_path_at(dfd, name, follow, &path);
+ pp = &path;
+ }
+ if (ret)
+ goto err_out;
+ /*
+ * We need t make sure wether the file system
+ * support decoding of the file handle
+ */
+ if (!pp->mnt->mnt_sb->s_export_op ||
+ !pp->mnt->mnt_sb->s_export_op->fh_to_dentry) {
+ ret = -EOPNOTSUPP;
+ goto out_path;
+ }
+ ret = do_sys_name_to_handle(pp, handle);
+
+out_path:
+ if (file)
+ fput(file);
+ else
+ path_put(&path);
+err_out:
+ return ret;
+}
+#else
+SYSCALL_DEFINE4(name_to_handle_at, int, dfd, const char __user *, name,
+ struct file_handle __user *, handle, int, flag)
+{
+ return -ENOSYS;
+}
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 471e1ff..ebc0f98 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -949,6 +949,16 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
+
+struct file_handle {
+ int handle_size;
+ int handle_type;
+ /* File system UUID identifier */
+ u8 fs_uuid[16];
+ /* file identifier */
+ unsigned char f_handle[0];
+};
+
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
#define file_list_unlock() spin_unlock(&files_lock);
@@ -1357,6 +1367,7 @@ struct super_block {
wait_queue_head_t s_wait_unfrozen;

char s_id[32]; /* Informational name */
+ u8 s_uuid[16]; /* UUID */

void *s_fs_info; /* Filesystem private info */
fmode_t s_mode;
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 7f614ce..4d4e922 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -61,6 +61,7 @@ struct robust_list_head;
struct getcpu_cache;
struct old_linux_dirent;
struct perf_event_attr;
+struct file_handle;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -823,5 +824,7 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
+asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
+ struct file_handle __user *handle, int flag);

#endif
--
1.7.1.331.ga5080

2010-06-15 17:13:22

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 01/11] exportfs: Return the minimum required handle size

The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/btrfs/export.c | 8 ++++++--
fs/exportfs/expfs.c | 9 +++++++--
fs/fat/inode.c | 4 +++-
fs/fuse/inode.c | 4 +++-
fs/gfs2/export.c | 8 ++++++--
fs/isofs/export.c | 8 ++++++--
fs/ocfs2/export.c | 8 +++++++-
fs/reiserfs/inode.c | 7 ++++++-
fs/udf/namei.c | 7 ++++++-
fs/xfs/linux-2.6/xfs_export.c | 4 +++-
include/linux/exportfs.h | 6 ++++--
mm/shmem.c | 4 +++-
12 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 951ef09..5f8ee5a 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -21,9 +21,13 @@ static int btrfs_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
int len = *max_len;
int type;

- if ((len < BTRFS_FID_SIZE_NON_CONNECTABLE) ||
- (connectable && len < BTRFS_FID_SIZE_CONNECTABLE))
+ if (connectable && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
+ *max_len = BTRFS_FID_SIZE_CONNECTABLE;
return 255;
+ } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
+ *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
+ return 255;
+ }

len = BTRFS_FID_SIZE_NON_CONNECTABLE;
type = FILEID_BTRFS_WITHOUT_PARENT;
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index e9e1759..cfee0f0 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -319,9 +319,14 @@ static int export_encode_fh(struct dentry *dentry, struct fid *fid,
struct inode * inode = dentry->d_inode;
int len = *max_len;
int type = FILEID_INO32_GEN;
-
- if (len < 2 || (connectable && len < 4))
+
+ if (connectable && (len < 4)) {
+ *max_len = 4;
+ return 255;
+ } else if (len < 2) {
+ *max_len = 2;
return 255;
+ }

len = 2;
fid->i32.ino = inode->i_ino;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 7bf45ae..da2f8a1 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -761,8 +761,10 @@ fat_encode_fh(struct dentry *de, __u32 *fh, int *lenp, int connectable)
struct inode *inode = de->d_inode;
u32 ipos_h, ipos_m, ipos_l;

- if (len < 5)
+ if (len < 5) {
+ *lenp = 5;
return 255; /* no room */
+ }

ipos_h = MSDOS_I(inode)->i_pos >> 8;
ipos_m = (MSDOS_I(inode)->i_pos & 0xf0) << 24;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index ec14d19..beaea69 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -638,8 +638,10 @@ static int fuse_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
u64 nodeid;
u32 generation;

- if (*max_len < len)
+ if (*max_len < len) {
+ *max_len = len;
return 255;
+ }

nodeid = get_fuse_inode(inode)->nodeid;
generation = inode->i_generation;
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index dfe237a..bd0fd68 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -36,9 +36,13 @@ static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
struct super_block *sb = inode->i_sb;
struct gfs2_inode *ip = GFS2_I(inode);

- if (*len < GFS2_SMALL_FH_SIZE ||
- (connectable && *len < GFS2_LARGE_FH_SIZE))
+ if (connectable && (*len < GFS2_LARGE_FH_SIZE)) {
+ *len = GFS2_LARGE_FH_SIZE;
return 255;
+ } else if (*len < GFS2_SMALL_FH_SIZE) {
+ *len = GFS2_SMALL_FH_SIZE;
+ return 255;
+ }

fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
fh[1] = cpu_to_be32(ip->i_no_formal_ino & 0xFFFFFFFF);
diff --git a/fs/isofs/export.c b/fs/isofs/export.c
index ed752cb..dd4687f 100644
--- a/fs/isofs/export.c
+++ b/fs/isofs/export.c
@@ -124,9 +124,13 @@ isofs_export_encode_fh(struct dentry *dentry,
* offset of the inode and the upper 16 bits of fh32[1] to
* hold the offset of the parent.
*/
-
- if (len < 3 || (connectable && len < 5))
+ if (connectable && (len < 5)) {
+ *max_len = 5;
+ return 255;
+ } else if (len < 3) {
+ *max_len = 3;
return 255;
+ }

len = 3;
fh32[0] = ei->i_iget5_block;
diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
index 19ad145..250a347 100644
--- a/fs/ocfs2/export.c
+++ b/fs/ocfs2/export.c
@@ -201,8 +201,14 @@ static int ocfs2_encode_fh(struct dentry *dentry, u32 *fh_in, int *max_len,
dentry->d_name.len, dentry->d_name.name,
fh, len, connectable);

- if (len < 3 || (connectable && len < 6)) {
+ if (connectable && (len < 6)) {
mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+ *max_len = 6;
+ type = 255;
+ goto bail;
+ } else if (len < 3) {
+ mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+ *max_len = 3;
type = 255;
goto bail;
}
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index 0f22fda..8f1bf99 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -1588,8 +1588,13 @@ int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp,
struct inode *inode = dentry->d_inode;
int maxlen = *lenp;

- if (maxlen < 3)
+ if (need_parent && (maxlen < 5)) {
+ *lenp = 5;
return 255;
+ } else if (maxlen < 3) {
+ *lenp = 3;
+ return 255;
+ }

data[0] = inode->i_ino;
data[1] = le32_to_cpu(INODE_PKEY(inode)->k_dir_id);
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..20db42f 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1336,8 +1336,13 @@ static int udf_encode_fh(struct dentry *de, __u32 *fh, int *lenp,
struct fid *fid = (struct fid *)fh;
int type = FILEID_UDF_WITHOUT_PARENT;

- if (len < 3 || (connectable && len < 5))
+ if (connectable && (len < 5)) {
+ *lenp = 5;
+ return 255;
+ } else if (len < 3) {
+ *lenp = 3;
return 255;
+ }

*lenp = 3;
fid->udf.block = location.logicalBlockNum;
diff --git a/fs/xfs/linux-2.6/xfs_export.c b/fs/xfs/linux-2.6/xfs_export.c
index 846b75a..82c0553 100644
--- a/fs/xfs/linux-2.6/xfs_export.c
+++ b/fs/xfs/linux-2.6/xfs_export.c
@@ -81,8 +81,10 @@ xfs_fs_encode_fh(
* seven combinations work. The real answer is "don't use v2".
*/
len = xfs_fileid_length(fileid_type);
- if (*max_len < len)
+ if (*max_len < len) {
+ *max_len = len
return 255;
+ }
*max_len = len;

switch (fileid_type) {
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index a9cd507..acd0b2d 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -108,8 +108,10 @@ struct fid {
* set, the encode_fh() should store sufficient information so that a good
* attempt can be made to find not only the file but also it's place in the
* filesystem. This typically means storing a reference to de->d_parent in
- * the filehandle fragment. encode_fh() should return the number of bytes
- * stored or a negative error code such as %-ENOSPC
+ * the filehandle fragment. encode_fh() should return the fileid_type on
+ * success and on error returns 255 (if the space needed to encode fh is
+ * greater than @max_len*4 bytes). On error @max_len contain the minimum
+ * size(in 4 byte unit) needed to encode the file handle.
*
* fh_to_dentry:
* @fh_to_dentry is given a &struct super_block (@sb) and a file handle
diff --git a/mm/shmem.c b/mm/shmem.c
index f65f840..d8223db 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2117,8 +2117,10 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
{
struct inode *inode = dentry->d_inode;

- if (*len < 3)
+ if (*len < 3) {
+ *len = 3;
return 255;
+ }

if (hlist_unhashed(&inode->i_hash)) {
/* Unfortunately insert_inode_hash is not idempotent,
--
1.7.1.331.ga5080

2010-06-15 17:13:39

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 05/11] vfs: Support null pathname in readlink

From: NeilBrown <[email protected]>

This enables to use readlink to get the link target name
from a file descriptor point to the link. This can be used
with open_by_handle syscall that returns a file descriptor for a link.
We can then use this file descriptor to get the target name.

Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/stat.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index c4ecd52..49b95a7 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -284,26 +284,40 @@ SYSCALL_DEFINE2(newfstat, unsigned int, fd, struct stat __user *, statbuf)
SYSCALL_DEFINE4(readlinkat, int, dfd, const char __user *, pathname,
char __user *, buf, int, bufsiz)
{
- struct path path;
- int error;
+ int error = 0;
+ struct path path, *pp;
+ struct file *file = NULL;

if (bufsiz <= 0)
return -EINVAL;

- error = user_path_at(dfd, pathname, 0, &path);
+ if (pathname == NULL && dfd != AT_FDCWD) {
+ file = fget(dfd);
+
+ if (file)
+ pp = &file->f_path;
+ else
+ error = -EBADF;
+ } else {
+ error = user_path_at(dfd, pathname, 0, &path);
+ pp = &path;
+ }
if (!error) {
- struct inode *inode = path.dentry->d_inode;
+ struct inode *inode = pp->dentry->d_inode;

error = -EINVAL;
if (inode->i_op->readlink) {
- error = security_inode_readlink(path.dentry);
+ error = security_inode_readlink(pp->dentry);
if (!error) {
- touch_atime(path.mnt, path.dentry);
- error = inode->i_op->readlink(path.dentry,
+ touch_atime(pp->mnt, pp->dentry);
+ error = inode->i_op->readlink(pp->dentry,
buf, bufsiz);
}
}
- path_put(&path);
+ if (file)
+ fput(file);
+ else
+ path_put(&path);
}
return error;
}
--
1.7.1.331.ga5080

2010-06-15 17:13:47

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 07/11] x86: Add new syscalls for x86_32

This patch adds sys_name_to_handle_at and sys_open_by_handle_at
syscalls to x86_32

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/x86/include/asm/unistd_32.h | 4 +++-
arch/x86/kernel/syscall_table_32.S | 2 ++
2 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index beb9b5f..06890db 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -343,10 +343,12 @@
#define __NR_rt_tgsigqueueinfo 335
#define __NR_perf_event_open 336
#define __NR_recvmmsg 337
+#define __NR_name_to_handle_at 338
+#define __NR_open_by_handle_at 339

#ifdef __KERNEL__

-#define NR_syscalls 338
+#define NR_syscalls 340

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index 8b37293..646717f 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -337,3 +337,5 @@ ENTRY(sys_call_table)
.long sys_rt_tgsigqueueinfo /* 335 */
.long sys_perf_event_open
.long sys_recvmmsg
+ .long sys_name_to_handle_at
+ .long sys_open_by_handle_at /* 339 */
--
1.7.1.331.ga5080

2010-06-15 17:13:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 08/11] x86: Add new syscalls for x86_64

Add sys_name_to_handle_at and sys_open_by_handle_at syscalls
for x86_64

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/x86/ia32/ia32entry.S | 2 ++
arch/x86/include/asm/unistd_64.h | 4 ++++
fs/compat.c | 11 +++++++++++
fs/open.c | 20 ++++++++++----------
include/linux/fs.h | 1 +
5 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index e790bc1..99f9623 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -842,4 +842,6 @@ ia32_sys_call_table:
.quad compat_sys_rt_tgsigqueueinfo /* 335 */
.quad sys_perf_event_open
.quad compat_sys_recvmmsg
+ .quad sys_name_to_handle_at
+ .quad compat_sys_open_by_handle_at /* 339 */
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index ff4307b..6afd818 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -663,6 +663,10 @@ __SYSCALL(__NR_rt_tgsigqueueinfo, sys_rt_tgsigqueueinfo)
__SYSCALL(__NR_perf_event_open, sys_perf_event_open)
#define __NR_recvmmsg 299
__SYSCALL(__NR_recvmmsg, sys_recvmmsg)
+#define __NR_name_to_handle_at 300
+__SYSCALL(__NR_name_to_handle_at, sys_name_to_handle_at)
+#define __NR_open_by_handle_at 301
+__SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/fs/compat.c b/fs/compat.c
index 6490d21..dc662ac 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -2332,3 +2332,14 @@ asmlinkage long compat_sys_timerfd_gettime(int ufd,
}

#endif /* CONFIG_TIMERFD */
+
+/*
+ * Exactly like fs/open.c:sys_open_by_handle_at(), except that it
+ * doesn't set the O_LARGEFILE flag.
+ */
+asmlinkage long
+compat_sys_open_by_handle_at(int mountdirfd,
+ struct file_handle __user *handle, int flags)
+{
+ return do_sys_open_by_handle(mountdirfd, handle, flags);
+}
diff --git a/fs/open.c b/fs/open.c
index 1c5daa3..43093af 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1231,8 +1231,8 @@ out_err:
return ERR_PTR(retval);
}

-static long do_sys_open_by_handle(int mountdirfd,
- struct file_handle __user *ufh, int open_flag)
+long do_sys_open_by_handle(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag)
{
long retval = 0;
int fd, acc_mode;
@@ -1334,6 +1334,14 @@ out_handle:
out_err:
return retval;
}
+#else
+long do_sys_open_by_handle(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag)
+{
+
+ return -ENOSYS;
+}
+#endif

/**
* sys_open_by_handle_at: Open the file handle
@@ -1358,11 +1366,3 @@ SYSCALL_DEFINE3(open_by_handle_at, int, mountdirfd,
ret = do_sys_open_by_handle(mountdirfd, handle, flags);
return ret;
}
-#else
-SYSCALL_DEFINE3(open_by_handle_at, int, mountdirfd,
- struct file_handle __user *, handle,
- int, flags)
-{
- return -ENOSYS;
-}
-#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1415cad..34d4f60 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1930,6 +1930,7 @@ extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);
extern char * getname(const char __user *);
+extern long do_sys_open_by_handle(int, struct file_handle __user *, int);

/* fs/ioctl.c */

--
1.7.1.331.ga5080

2010-06-15 17:13:57

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 09/11] ext3: Copy fs UUID to superblock.

This enable user space application to get the file system UUID
using sys_name_to_handle_at syscall

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/super.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 6c953bb..1596795 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1931,6 +1931,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
sb->s_qcop = &ext3_qctl_operations;
sb->dq_op = &ext3_quota_operations;
#endif
+ memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
mutex_init(&sbi->s_resize_lock);
--
1.7.1.331.ga5080

2010-06-15 17:14:00

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 10/11] vfs: Support null pathname in faccessat

This enables to use faccessat to get the access check details
from a file descriptor pointing to the file. This can be used
with open_by_handle syscall that returns a file descriptor.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/open.c | 27 ++++++++++++++++++++-------
1 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 43093af..90fb12f 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -288,7 +288,8 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
{
const struct cred *old_cred;
struct cred *override_cred;
- struct path path;
+ struct file *file = NULL;
+ struct path path, *pp;
struct inode *inode;
int res;

@@ -312,12 +313,21 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
}

old_cred = override_creds(override_cred);
-
- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ if (filename == NULL && dfd != AT_FDCWD) {
+ file = fget(dfd);
+ if (file) {
+ pp = &file->f_path;
+ res = 0;
+ } else
+ res = -EBADF;
+ } else {
+ res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ pp = &path;
+ }
if (res)
goto out;

- inode = path.dentry->d_inode;
+ inode = pp->dentry->d_inode;

if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
/*
@@ -325,7 +335,7 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (pp->mnt->mnt_flags & MNT_NOEXEC)
goto out_path_release;
}

@@ -343,11 +353,14 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if (__mnt_is_readonly(pp->mnt))
res = -EROFS;

out_path_release:
- path_put(&path);
+ if (file)
+ fput(file);
+ else
+ path_put(&path);
out:
revert_creds(old_cred);
put_cred(override_cred);
--
1.7.1.331.ga5080

2010-06-15 17:14:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 11/11] vfs: Support null pathname in linkat

This enables to use linkat to create hardlinks from a
file descriptor pointing to the file. This can be used
with open_by_handle syscall that returns a file descriptor.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/namei.c | 31 +++++++++++++++++++++++--------
1 files changed, 23 insertions(+), 8 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 417bf53..8b1cc23 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2533,16 +2533,27 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct nameidata nd;
- struct path old_path;
+ struct path old_path, *old_pathp;
+ struct file *file = NULL;
int error;
char *to;

if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
return -EINVAL;

- error = user_path_at(olddfd, oldname,
- flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
- &old_path);
+ if (oldname == NULL && olddfd != AT_FDCWD) {
+ file = fget(olddfd);
+ if (file) {
+ old_pathp = &file->f_path;
+ error = 0;
+ } else
+ error = -EBADF;
+ } else {
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
+ &old_path);
+ old_pathp = &old_path;
+ }
if (error)
return error;

@@ -2550,7 +2561,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out;
error = -EXDEV;
- if (old_path.mnt != nd.path.mnt)
+ if (old_pathp->mnt != nd.path.mnt)
goto out_release;
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
@@ -2559,10 +2570,11 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
error = mnt_want_write(nd.path.mnt);
if (error)
goto out_dput;
- error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ error = security_path_link(old_pathp->dentry, &nd.path, new_dentry);
if (error)
goto out_drop_write;
- error = vfs_link(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+ error = vfs_link(old_pathp->dentry,
+ nd.path.dentry->d_inode, new_dentry);
out_drop_write:
mnt_drop_write(nd.path.mnt);
out_dput:
@@ -2573,7 +2585,10 @@ out_release:
path_put(&nd.path);
putname(to);
out:
- path_put(&old_path);
+ if (file)
+ fput(file);
+ else
+ path_put(&old_path);

return error;
}
--
1.7.1.331.ga5080

2010-06-15 17:13:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

The patch update may_open to allow handle based open on symlinks.
The file handle based API use file descritor returned from open_by_handle_at
to do different file system operations. To find the link target name we
need to get a file descriptor on symlinks.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/namei.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index c2d19c7..417bf53 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1422,7 +1422,7 @@ int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
return error;
}

-int may_open(struct path *path, int acc_mode, int flag)
+static int __may_open(struct path *path, int acc_mode, int flag, int handle)
{
struct dentry *dentry = path->dentry;
struct inode *inode = dentry->d_inode;
@@ -1433,7 +1433,13 @@ int may_open(struct path *path, int acc_mode, int flag)

switch (inode->i_mode & S_IFMT) {
case S_IFLNK:
- return -ELOOP;
+ /*
+ * For file handle based open we should allow
+ * open of symlink.
+ */
+ if (!handle)
+ return -ELOOP;
+ break;
case S_IFDIR:
if (acc_mode & MAY_WRITE)
return -EISDIR;
@@ -1473,6 +1479,11 @@ int may_open(struct path *path, int acc_mode, int flag)
return break_lease(inode, flag);
}

+int may_open(struct path *path, int acc_mode, int flag)
+{
+ return __may_open(path, acc_mode, flag, 0);
+}
+
static int handle_truncate(struct path *path)
{
struct inode *inode = path->dentry->d_inode;
@@ -1570,7 +1581,7 @@ struct file *finish_open_handle(struct path *path,
if (error)
goto exit;
}
- error = may_open(path, acc_mode, open_flag);
+ error = __may_open(path, acc_mode, open_flag, 1);
if (error) {
if (will_truncate)
mnt_drop_write(path->mnt);
--
1.7.1.331.ga5080

2010-06-15 17:13:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 03/11] vfs: Add open by file handle support

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/exportfs/expfs.c | 2 +
fs/namei.c | 50 ++++++++++++
fs/open.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +-
include/linux/syscalls.h | 2 +
5 files changed, 254 insertions(+), 1 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index cfee0f0..05a1179 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -373,6 +373,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
/*
* Try to get any dentry for the given file handle from the filesystem.
*/
+ if (!nop || !nop->fh_to_dentry)
+ return ERR_PTR(-ESTALE);
result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
if (!result)
result = ERR_PTR(-ESTALE);
diff --git a/fs/namei.c b/fs/namei.c
index 868d0cb..c2d19c7 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1557,6 +1557,56 @@ static int open_will_truncate(int flag, struct inode *inode)
return (flag & O_TRUNC);
}

+struct file *finish_open_handle(struct path *path,
+ int open_flag, int acc_mode)
+{
+ int error;
+ struct file *filp;
+ int will_truncate;
+
+ will_truncate = open_will_truncate(open_flag, path->dentry->d_inode);
+ if (will_truncate) {
+ error = mnt_want_write(path->mnt);
+ if (error)
+ goto exit;
+ }
+ error = may_open(path, acc_mode, open_flag);
+ if (error) {
+ if (will_truncate)
+ mnt_drop_write(path->mnt);
+ goto exit;
+ }
+ filp = dentry_open(path->dentry, path->mnt, open_flag, current_cred());
+ if (!IS_ERR(filp)) {
+ error = ima_file_check(filp, acc_mode);
+ if (error) {
+ fput(filp);
+ filp = ERR_PTR(error);
+ }
+ }
+ if (!IS_ERR(filp)) {
+ if (will_truncate) {
+ error = handle_truncate(path);
+ if (error) {
+ fput(filp);
+ filp = ERR_PTR(error);
+ }
+ }
+ }
+ /*
+ * It is now safe to drop the mnt write
+ * because the filp has had a write taken
+ * on its behalf.
+ */
+ if (will_truncate)
+ mnt_drop_write(path->mnt);
+ return filp;
+
+exit:
+ path_put(path);
+ return ERR_PTR(error);
+}
+
static struct file *finish_open(struct nameidata *nd,
int open_flag, int acc_mode)
{
diff --git a/fs/open.c b/fs/open.c
index 43ac798..1c5daa3 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1168,3 +1168,201 @@ SYSCALL_DEFINE4(name_to_handle_at, int, dfd, const char __user *, name,
return -ENOSYS;
}
#endif
+
+#ifdef CONFIG_EXPORTFS
+static struct vfsmount *get_vfsmount_from_fd(int fd)
+{
+ int fput_needed;
+ struct path *path;
+ struct file *filep;
+
+ if (fd == AT_FDCWD) {
+ struct fs_struct *fs = current->fs;
+ read_lock(&fs->lock);
+ path = &fs->pwd;
+ mntget(path->mnt);
+ read_unlock(&fs->lock);
+ } else {
+ filep = fget_light(fd, &fput_needed);
+ if (!filep)
+ return ERR_PTR(-EBADF);
+ path = &filep->f_path;
+ mntget(path->mnt);
+ fput_light(filep, fput_needed);
+ }
+ return path->mnt;
+}
+
+static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
+{
+ return 1;
+}
+
+static struct path *handle_to_path(int mountdirfd, struct file_handle *handle)
+{
+ int retval;
+ int handle_size;
+ struct path *path;
+
+ path = kmalloc(sizeof(struct path), GFP_KERNEL);
+ if (!path)
+ return ERR_PTR(-ENOMEM);
+
+ path->mnt = get_vfsmount_from_fd(mountdirfd);
+ if (IS_ERR(path->mnt)) {
+ retval = PTR_ERR(path->mnt);
+ goto out_err;
+ }
+ /* change the handle size to multiple of sizeof(u32) */
+ handle_size = handle->handle_size >> 2;
+ path->dentry = exportfs_decode_fh(path->mnt,
+ (struct fid *)handle->f_handle,
+ handle_size, handle->handle_type,
+ vfs_dentry_acceptable, NULL);
+ if (IS_ERR(path->dentry)) {
+ retval = PTR_ERR(path->dentry);
+ goto out_mnt;
+ }
+ return path;
+out_mnt:
+ mntput(path->mnt);
+out_err:
+ kfree(path);
+ return ERR_PTR(retval);
+}
+
+static long do_sys_open_by_handle(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag)
+{
+ long retval = 0;
+ int fd, acc_mode;
+ struct file *filp;
+ struct path *path;
+ struct file_handle f_handle;
+ struct file_handle *handle = NULL;
+
+ /*
+ * With handle we don't look at the execute bit on the
+ * the directory. Ideally we would like CAP_DAC_SEARCH.
+ * But we don't have that
+ */
+ if (!capable(CAP_DAC_READ_SEARCH)) {
+ retval = -EPERM;
+ goto out_err;
+ }
+ /* can't use O_CREATE with open_by_handle */
+ if (open_flag & O_CREAT) {
+ retval = -EINVAL;
+ goto out_err;
+ }
+ if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
+ retval = -EFAULT;
+ goto out_err;
+ }
+ if ((f_handle.handle_size > MAX_HANDLE_SZ) ||
+ (f_handle.handle_size <= 0)) {
+ retval = -EINVAL;
+ goto out_err;
+ }
+ handle = kmalloc(sizeof(struct file_handle) + f_handle.handle_size,
+ GFP_KERNEL);
+ if (!handle) {
+ retval = -ENOMEM;
+ goto out_err;
+ }
+ /* copy the full handle */
+ if (copy_from_user(handle, ufh,
+ sizeof(struct file_handle) +
+ f_handle.handle_size)) {
+ retval = -EFAULT;
+ goto out_handle;
+ }
+ path = handle_to_path(mountdirfd, handle);
+ if (IS_ERR(path)) {
+ retval = PTR_ERR(path);
+ goto out_handle;
+ }
+ if ((open_flag & O_DIRECTORY) &&
+ !S_ISDIR(path->dentry->d_inode->i_mode)) {
+ retval = -ENOTDIR;
+ goto out_path;
+ }
+ /*
+ * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
+ * check for O_DSYNC if the need any syncing at all we enforce it's
+ * always set instead of having to deal with possibly weird behaviour
+ * for malicious applications setting only __O_SYNC.
+ */
+ if (open_flag & __O_SYNC)
+ open_flag |= O_DSYNC;
+
+ acc_mode = MAY_OPEN | ACC_MODE(open_flag);
+
+ /* O_TRUNC implies we need access checks for write permissions */
+ if (open_flag & O_TRUNC)
+ acc_mode |= MAY_WRITE;
+ /*
+ * Allow the LSM permission hook to distinguish append
+ * access from general write access.
+ */
+ if (open_flag & O_APPEND)
+ acc_mode |= MAY_APPEND;
+
+ fd = get_unused_fd_flags(open_flag);
+ if (fd < 0) {
+ retval = fd;
+ goto out_path;
+ }
+ filp = finish_open_handle(path, open_flag, acc_mode);
+ if (IS_ERR(filp)) {
+ put_unused_fd(fd);
+ retval = PTR_ERR(filp);
+ } else {
+ retval = fd;
+ fsnotify_open(filp->f_path.dentry);
+ fd_install(fd, filp);
+ }
+ kfree(path);
+ kfree(handle);
+ return retval;
+
+out_path:
+ path_put(path);
+ kfree(path);
+out_handle:
+ kfree(handle);
+out_err:
+ return retval;
+}
+
+/**
+ * sys_open_by_handle_at: Open the file handle
+ * @mountdirfd: directory file descriptor
+ * @handle: file handle to be opened
+ * @flag: open flags.
+ *
+ * @mountdirfd indicate the directory file descriptor
+ * of the mount point. file handle is decoded relative
+ * to the vfsmount pointed by the @mountdirfd. @flags
+ * value is same as the open(2) flags.
+ */
+SYSCALL_DEFINE3(open_by_handle_at, int, mountdirfd,
+ struct file_handle __user *, handle,
+ int, flags)
+{
+ long ret;
+
+ if (force_o_largefile())
+ flags |= O_LARGEFILE;
+
+ ret = do_sys_open_by_handle(mountdirfd, handle, flags);
+ return ret;
+}
+#else
+SYSCALL_DEFINE3(open_by_handle_at, int, mountdirfd,
+ struct file_handle __user *, handle,
+ int, flags)
+{
+ return -ENOSYS;
+}
+#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ebc0f98..1415cad 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2141,7 +2141,8 @@ extern int may_open(struct path *, int, int);

extern int kernel_read(struct file *, loff_t, char *, unsigned long);
extern struct file * open_exec(const char *);
-
+extern struct file *finish_open_handle(struct path *, int, int);
+
/* fs/dcache.c -- generic fs support functions */
extern int is_subdir(struct dentry *, struct dentry *);
extern int path_is_under(struct path *, struct path *);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 4d4e922..37c629d 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -826,5 +826,7 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
struct file_handle __user *handle, int flag);
+asmlinkage long sys_open_by_handle_at(int mountdirfd,
+ struct file_handle __user *handle, int flags);

#endif
--
1.7.1.331.ga5080

2010-06-15 17:13:45

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V14 06/11] ext4: Copy fs UUID to superblock

This enables userspace application to find the file system
UUID using sys_name_to_handle_at syscall

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/super.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 4e8983a..f2bd1bc 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2816,6 +2816,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_qcop = &ext4_qctl_operations;
sb->dq_op = &ext4_quota_operations;
#endif
+ memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
+
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
mutex_init(&sbi->s_resize_lock);
--
1.7.1.331.ga5080

2010-07-01 16:29:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Tue, 15 Jun 2010 22:42:50 +0530, "Aneesh Kumar K.V" <[email protected]> wrote:

Hi Al,

Any chance of getting this reviewed/merged in the next merge window ?

-aneesh

> Hi,
>
> The below set of patches implement open by handle support using exportfs
> operations. This allows user space application to map a file name to file
> handle and later open the file using handle. This should be usable
> for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
> XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.
>
> [1] http://nfs-ganesha.sourceforge.net/
> [2] http://thread.gmane.org/gmane.comp.emulators.qemu/68992
>
> git repo for the patchset at:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/kvaneesh/linux-open-handle.git open-by-handle-v14
>
> Changes from V13:
> a) Add support for file descriptor to handle conversion. This is needed
> so that we find the right file handle for newly created files.
>
> Changes from V12:
> a) Use CAP_DAC_READ_SEARCH instead of CAP_DAC_OVERRIDE in open_by_handle
> b) Return -ENOTDIR if O_DIRECTORY flag is specified in open_by_handle with
> handle for non directory
>
> Changes from V11:
> a) Add necessary documentation to different functions
> b) Add null pathname support to faccessat and linkat similar to
> readlinkat.
> c) compile fix on x86_64
>
> Changes from V10:
> a) Missed an stg refresh before sending out the patchset. Send
> updated patchset.
>
> Changes from V9:
> a) Fix compile errors with CONFIG_EXPORTFS not defined
> b) Return -EOPNOTSUPP if file system doesn't support fh_to_dentry exportfs callback.
>
> Changes from V8:
> a) exportfs_decode_fh now returns -ESTALE if export operations is not defined.
> b) drop get_fsid super_operations. Instead use superblock to store uuid.
>
> Changes from V7:
> a) open_by_handle now use mountdirfd to identify the vfsmount.
> b) We don't validate the UUID passed as a part of file handle in open_by_handle.
> UUID is provided as a part of file handle as an easy way for userspace to
> use the kernel returned handle as it is. It also helps in finding the 16 byte
> filessytem UUID in userspace without using file system specific libraries to
> read file system superblock. If a particular file system doesn't support UUID
> or any form of unique id this field in the file handle will be zero filled.
> c) drop freadlink syscall. Instead use readlinkat with NULL pathname to indicate
> read the link target name of the link pointed by fd. This is similar to
> sys_utimensat
> d) Instead of opencoding all the open flag related check use helper functions.
> Did finish_open_by_handle similar to finish_open.
> c) Fix may_open to not return ELOOP for symlink when we are called from handle open.
> open(2) still returns error as expected.
>
> Changes from V6:
> a) Add uuid to vfsmount lookup and drop uuid to superblock lookup
> b) Return -EOPNOTSUPP in sys_name_to_handle if the file system returned uuid
> doesn't give the same vfsmount on lookup. This ensure that we fail
> sys_name_to_handle when we have multiple file system returning same UUID.
>
> Changes from V5:
> a) added sys_name_to_handle_at syscall which takes AT_SYMLINK_NOFOLLOW flag
> instead of two syscalls sys_name_to_handle and sys_lname_to_handle.
> b) addressed review comments from Niel Brown
> c) rebased to b91ce4d14a21fc04d165be30319541e0f9204f15
> d) Add compat_sys_open_by_handle
>
> Chages from V4:
> a) Changed the syscal arguments so that we don't need compat syscalls
> as suggested by Christoph
> c) Added two new syscall sys_lname_to_handle and sys_freadlink to work with
> symlinks
> d) Changed open_by_handle to work with all file types
> e) Add ext3 support
>
> Changes from V3:
> a) Code cleanup suggested by Andreas
> b) x86_64 syscall support
> c) add compat syscall
>
> Chages from V2:
> a) Support system wide unique handle.
>
> Changes from v1:
> a) handle size is now specified in bytes
> b) returns -EOVERFLOW if the handle size is small
> c) dropped open_handle syscall and added open_by_handle_at syscall
> open_by_handle_at takes mount_fd as the directory fd of the mount point
> containing the file
> e) handle will only be unique in a given file system. So for an NFS server
> exporting multiple file system, NFS server will have to internally track the
> mount point to which a file handle belongs to. We should be able to do it much
> easily than expecting kernel to give a system wide unique file handle. System
> wide unique file handle would need much larger changes to the exportfs or VFS
> interface and I was not sure whether we really need to do that in the kernel or
> in the user space
> f) open_handle_at now only check for DAC_OVERRIDE capability
>
>
> Example program: (x86_32). (x86_64 would need a different syscall number)
> -------
> cc <source.c> -luuid
> --------
> #define _GNU_SOURCE
> #include <stdio.h>
> #include <stdlib.h>
>
> #include <fcntl.h>
> #include <unistd.h>
> #include <errno.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <string.h>
> #include <uuid/uuid.h>
>
> struct file_handle {
> int handle_size;
> int handle_type;
> uuid_t fs_uuid;
> unsigned char handle[0];
> };
>
> #define AT_FDCWD -100
> #define AT_SYMLINK_FOLLOW 0x400
>
> static int name_to_handle(const char *name, struct file_handle *fh)
> {
> return syscall(338, AT_FDCWD, name, fh, AT_SYMLINK_FOLLOW);
> }
>
> static int lname_to_handle(const char *name, struct file_handle *fh)
> {
> return syscall(338, AT_FDCWD, name, fh, 0);
> }
>
> static int fd_to_handle(int fd, struct file_handle *fh)
> {
> return syscall(338, fd, NULL, fh, AT_SYMLINK_FOLLOW);
> }
>
> static int open_by_handle(int mountfd, struct file_handle *fh, int flags)
> {
> return syscall(339, mountfd, fh, flags);
> }
>
> #define BUFSZ 100
> int main(int argc, char *argv[])
> {
> int fd;
> int ret, done = 0;
> int mountfd;
> int handle_sz;
> struct stat bufstat;
> char buf[BUFSZ];
> char uuid[36];
> struct file_handle *fh = NULL;;
> if (argc != 3 ) {
> printf("Usage: %s <filename> <mount-dir-name>\n", argv[0]);
> exit(1);
> }
> again:
> if (fh && fh->handle_size) {
> handle_sz = fh->handle_size;
> free(fh);
> fh = malloc(sizeof(struct file_handle) + handle_sz);
> fh->handle_size = handle_sz;
> } else {
> fh = malloc(sizeof(struct file_handle));
> fh->handle_size = 0;
> }
> errno = 0;
> ret = lname_to_handle(argv[1], fh);
> if (ret && errno == EOVERFLOW) {
> printf("Found the handle size needed to be %d\n", fh->handle_size);
> goto again;
> } else if (ret) {
> perror("Error:");
> exit(1);
> }
> do_again:
> uuid_unparse(fh->fs_uuid, uuid);
> printf("UUID:%s\n", uuid);
> printf("Waiting for input");
> getchar();
> mountfd = open(argv[2], O_RDONLY | O_DIRECTORY);
> if (mountfd <= 0) {
> perror("Error:");
> exit(1);
> }
> fd = open_by_handle(mountfd, fh, O_RDONLY);
> if (fd <= 0 ) {
> perror("Error:");
> exit(1);
> }
> printf("Reading the content now \n");
> fstat(fd, &bufstat);
> ret = S_ISLNK(bufstat.st_mode);
> if (ret) {
> memset(buf, 0 , BUFSZ);
> readlinkat(fd, NULL, buf, BUFSZ);
> printf("%s is a symlink pointing to %s\n", argv[1], buf);
> }
> memset(buf, 0 , BUFSZ);
> while (1) {
> ret = read(fd, buf, BUFSZ -1);
> if (ret <= 0)
> break;
> buf[ret] = '\0';
> printf("%s", buf);
> memset(buf, 0 , BUFSZ);
> }
> /* Now check for faccess */
> if (faccessat(fd, NULL, W_OK, 0) == 0) {
> printf("Got write permission on the file \n");
> } else
> perror("faccess error");
> /* now try to create a hardlink */
> if (linkat(fd, NULL, AT_FDCWD, "test", 0) == 0){
> printf("created hardlink\n");
> } else
> perror("linkat error");
> if (done)
> exit(0);
> printf("Map fd to handle \n");
> ret = fd_to_handle(fd, fh);
> if (ret) {
> perror("Error:");
> exit(1);
> }
> done = 1;
> goto do_again;
> }
>
> -aneesh
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2010-07-01 20:41:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, 01 Jul 2010 21:58:54 +0530
"Aneesh Kumar K. V" <[email protected]> wrote:

> On Tue, 15 Jun 2010 22:42:50 +0530, "Aneesh Kumar K.V" <[email protected]> wrote:
>
> Hi Al,
>
> Any chance of getting this reviewed/merged in the next merge window ?

My own opinion of the patchset is that the code itself is fine,
however there is one part of the interface that bothers me.

I think that it is a little ugly that filesystem uuid extraction is so
closely tied to filehandle manipulation. They are certainly related, and we
certainly need to be able to get the filesystem uuid directly from the
filesystem, but given that filehandle -> fd mapping doesn't (and shouldn't)
use the uuid, the fact that fd/name -> filehandle mapping does return the
uuid looks like it is simply piggy backing some functionality on the side,
rather than creating a properly designed and general interface.

I would feel happier about the patches if you removed all reference to uuids
and then found some other way to ask a filesystem what its uuid was.

This is not an issue that would make be want to stop the patches going
upstream, but it does hold me back from offering a reviewed-by or
acked-by (for whatever they might be worth).

NeilBrown

2010-07-01 21:25:48

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Fri, 2 Jul 2010 06:41:08 +1000, Neil Brown <[email protected]> wrote:
> On Thu, 01 Jul 2010 21:58:54 +0530
> "Aneesh Kumar K. V" <[email protected]> wrote:
>
> > On Tue, 15 Jun 2010 22:42:50 +0530, "Aneesh Kumar K.V" <[email protected]> wrote:
> >
> > Hi Al,
> >
> > Any chance of getting this reviewed/merged in the next merge window ?
>
> My own opinion of the patchset is that the code itself is fine,
> however there is one part of the interface that bothers me.
>
> I think that it is a little ugly that filesystem uuid extraction is so
> closely tied to filehandle manipulation. They are certainly related, and we
> certainly need to be able to get the filesystem uuid directly from the
> filesystem, but given that filehandle -> fd mapping doesn't (and shouldn't)
> use the uuid, the fact that fd/name -> filehandle mapping does return the
> uuid looks like it is simply piggy backing some functionality on the side,
> rather than creating a properly designed and general interface.
>
> I would feel happier about the patches if you removed all reference to uuids
> and then found some other way to ask a filesystem what its uuid was.
>
> This is not an issue that would make be want to stop the patches going
> upstream, but it does hold me back from offering a reviewed-by or
> acked-by (for whatever they might be worth).
>

One use case i had was that if the userspace file server can directly work
with the returned file system UUID, the it can build the file handle for client
in a single call.


-aneesh

2010-07-02 04:04:19

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-01, at 14:41, Neil Brown <[email protected]> wrote:
> I think that it is a little ugly that filesystem uuid extraction is so
> closely tied to filehandle manipulation. They are certainly related, and we
> certainly need to be able to get the filesystem uuid directly from the
> filesystem, but given that filehandle -> fd mapping doesn't (and shouldn't)
> use the uuid, the fact that fd/name -> filehandle mapping does return the
> uuid looks like it is simply piggy backing some functionality on the side,
> rather than creating a properly designed and general interface.

I disagree. Getting the UUID as part of the filehandle avoids an extra system call for the client and avoids the need for some other non-standard interface to get the UUID.

I'd like to be able to use this interface to implement the distributed open call proposed by the POSIX HECWG. This allows one client to do the path traversal, broadcast the file handle to the (maybe) 1M processes in the job via MPI, and then the other clients can open the file by handle without doing 1M times the full path traversal (which might be 10's of RPCs per process).

Cheers, Andreas
>

2010-07-02 07:06:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, Jul 01, 2010 at 10:02:29PM -0600, Andreas Dilger wrote:
> I'd like to be able to use this interface to implement the distributed open call proposed by the POSIX HECWG. This allows one client to do the path traversal, broadcast the file handle to the (maybe) 1M processes in the job via MPI, and then the other clients can open the file by handle without doing 1M times the full path traversal (which might be 10's of RPCs per process).

The proposal is doomed anyway. If we allow any sort of open by handle
system call for unprivilegued users we need to do reconnect the dentry
to the dcache path anyway (reconnect_path), which is more expensive than
a normal path lookup.

2010-07-02 16:15:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-02, at 01:05, [email protected] wrote:
> On Thu, Jul 01, 2010 at 10:02:29PM -0600, Andreas Dilger wrote:
>> I'd like to be able to use this interface to implement the distributed open call proposed by the POSIX HECWG. This allows one client to do the path traversal, broadcast the file handle to the (maybe) 1M processes in the job via MPI, and then the other clients can open the file by handle without doing 1M times the full path traversal (which might be 10's of RPCs per process).
>
> The proposal is doomed anyway. If we allow any sort of open by handle
> system call for unprivilegued users we need to do reconnect the dentry
> to the dcache path anyway (reconnect_path), which is more expensive than
> a normal path lookup.

I haven't looked at this part of the VFS in a while, but it looks like an implementation issue specific to knfsd, and shouldn't be needed for regular files. i.e. if exportfs_encode_fh() is never used on a disconnected file, then this overhead is not incurred.

The above use of open_by_handle() is not for userspace NFS/Samba re-export, but to allow applications to open regular files for IO.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-02 22:09:28

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Fri, 2 Jul 2010 10:12:47 -0600
Andreas Dilger <[email protected]> wrote:

> On 2010-07-02, at 01:05, [email protected] wrote:
> > On Thu, Jul 01, 2010 at 10:02:29PM -0600, Andreas Dilger wrote:
> >> I'd like to be able to use this interface to implement the distributed open call proposed by the POSIX HECWG. This allows one client to do the path traversal, broadcast the file handle to the (maybe) 1M processes in the job via MPI, and then the other clients can open the file by handle without doing 1M times the full path traversal (which might be 10's of RPCs per process).
> >
> > The proposal is doomed anyway. If we allow any sort of open by handle
> > system call for unprivilegued users we need to do reconnect the dentry
> > to the dcache path anyway (reconnect_path), which is more expensive than
> > a normal path lookup.
>
> I haven't looked at this part of the VFS in a while, but it looks like an implementation issue specific to knfsd, and shouldn't be needed for regular files. i.e. if exportfs_encode_fh() is never used on a disconnected file, then this overhead is not incurred.
>
> The above use of open_by_handle() is not for userspace NFS/Samba re-export, but to allow applications to open regular files for IO.
>

From my recollection of implementing dentry reconnection there are two
needs for it.

Firstly it is needed for directories so that the VFS can effectively lock
against directory rename races which could otherwise create disconnected
subtrees (where the first parent is a member only of one of its
descendants). So if you get a filehandle for a directory it *must* be
properly connected to the root for rename to be safe. This operation is
faster than a full path lookup if the dentry is already is cache, and slower
if it and any of the path is not in cache.
You could possibly delay the full-connection of the dentry until the first
attempt to rename beneath it. I'm not sure how much VFS surgery that would
require.

Secondly it is needed if you want to enforce the rule that the contents of a
directory are only accessible if the 'x' bit on the directory is set.
kNFSd does not enforce this (unless subtree_check is specified), partly
because it is hard to do correctly and partly because we have to trust the
client any, so trusting it to check the 'x' bit is very little extra trust.

Note that it is not possible to reliably perform filehandle lookup for
non-directories if you need a fully reconnected dentry, as
cross-directory-renames can confuse the situation beyond recovery.

Maybe open-by-handle should require DAC_OVERRIDE, or maybe a new
DAC_X_OVERRIDE. And if those aren't provided it only works for directories.
???

NeilBrown

2010-07-02 22:52:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-02, at 16:09, Neil Brown wrote:
> On Fri, 2 Jul 2010 10:12:47 -0600
> Andreas Dilger <[email protected]> wrote:
>>
>> I haven't looked at this part of the VFS in a while, but it looks like reconnect_path() is an implementation issue specific to knfsd, and shouldn't be needed for regular files. i.e. if exportfs_encode_fh() is never used on a disconnected file, then this overhead is not incurred.
>>
>> The above use of open_by_handle() is not for userspace NFS/Samba re-export, but to allow applications to open regular files for IO.
>
> Firstly it is needed for directories so that the VFS can effectively lock
> against directory rename races which could otherwise create disconnected
> subtrees (where the first parent is a member only of one of its
> descendants). So if you get a filehandle for a directory it *must* be
> properly connected to the root for rename to be safe. This operation is
> faster than a full path lookup if the dentry is already is cache, and slower
> if it and any of the path is not in cache.

OK, so this requirement is specific for directories, and not at all needed for regular files.

> Secondly it is needed if you want to enforce the rule that the contents of a
> directory are only accessible if the 'x' bit on the directory is set.
> kNFSd does not enforce this (unless subtree_check is specified), partly
> because it is hard to do correctly and partly because we have to trust the
> client any, so trusting it to check the 'x' bit is very little extra trust.

If the application that called name_to_handle() already had to traverse the whole pathname to get the file handle, then there shouldn't necessarily be a requirement to do this when calling open_by_handle(). The only possible permission checking in open_by_handle() is the permission on the inode itself.

> Note that it is not possible to reliably perform filehandle lookup for
> non-directories if you need a fully reconnected dentry, as
> cross-directory-renames can confuse the situation beyond recovery.

For normal file IO, a fully connected dentry is not needed, and in fact the handle_to_path->exportfs_decode_fh() code will accept any inode alias for reguar file use.

> Maybe open-by-handle should require DAC_OVERRIDE, or maybe a new
> DAC_X_OVERRIDE. And if those aren't provided it only works for directories.

That's the big question. If the file handle has some "non-public" information in it (i.e. a capability that cannot be (easily) guessed or forged), then there should not be any need for DAC_OVERRIDE. This could easily be enforced if there was a provision for "short term" file handles that only had to live a few minutes or less, so the kernel could just store a random cookie in each file handle and require applications to get a new handle if the cookie expires or the server crashes.

However, even a "plain" file handle containing only the inode/generation is relatively secure in this respect, since the only way to get the inode number of a particular file is "ls -li" (which either assumes path "x" traversal permission, OR guessing the inode number), and ioctl(FS_IOC_GETVERSION) which requires being able to open the inode already.

Guessing the inode number by itself is fairly weak, at most 2^32 inodes in most filesystems, usually far fewer. Guessing the generation number is much harder (though not impossible).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-03 16:06:25

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Sat, 3 Jul 2010 08:09:04 +1000, Neil Brown <[email protected]> wrote:
> On Fri, 2 Jul 2010 10:12:47 -0600
> Andreas Dilger <[email protected]> wrote:
>
> > On 2010-07-02, at 01:05, [email protected] wrote:
> > > On Thu, Jul 01, 2010 at 10:02:29PM -0600, Andreas Dilger wrote:
> > >> I'd like to be able to use this interface to implement the distributed open call proposed by the POSIX HECWG. This allows one client to do the path traversal, broadcast the file handle to the (maybe) 1M processes in the job via MPI, and then the other clients can open the file by handle without doing 1M times the full path traversal (which might be 10's of RPCs per process).
> > >
> > > The proposal is doomed anyway. If we allow any sort of open by handle
> > > system call for unprivilegued users we need to do reconnect the dentry
> > > to the dcache path anyway (reconnect_path), which is more expensive than
> > > a normal path lookup.
> >
> > I haven't looked at this part of the VFS in a while, but it looks like an implementation issue specific to knfsd, and shouldn't be needed for regular files. i.e. if exportfs_encode_fh() is never used on a disconnected file, then this overhead is not incurred.
> >
> > The above use of open_by_handle() is not for userspace NFS/Samba re-export, but to allow applications to open regular files for IO.
> >
>
> From my recollection of implementing dentry reconnection there are two
> needs for it.
>
> Firstly it is needed for directories so that the VFS can effectively lock
> against directory rename races which could otherwise create disconnected
> subtrees (where the first parent is a member only of one of its
> descendants). So if you get a filehandle for a directory it *must* be
> properly connected to the root for rename to be safe. This operation is
> faster than a full path lookup if the dentry is already is cache, and slower
> if it and any of the path is not in cache.
> You could possibly delay the full-connection of the dentry until the first
> attempt to rename beneath it. I'm not sure how much VFS surgery that would
> require.
>
> Secondly it is needed if you want to enforce the rule that the contents of a
> directory are only accessible if the 'x' bit on the directory is set.
> kNFSd does not enforce this (unless subtree_check is specified), partly
> because it is hard to do correctly and partly because we have to trust the
> client any, so trusting it to check the 'x' bit is very little extra trust.
>
> Note that it is not possible to reliably perform filehandle lookup for
> non-directories if you need a fully reconnected dentry, as
> cross-directory-renames can confuse the situation beyond recovery.
>
> Maybe open-by-handle should require DAC_OVERRIDE, or maybe a new
> DAC_X_OVERRIDE. And if those aren't provided it only works for directories.
> ???
>

Currently I have the below in open_by_handle

/*
* With handle we don't look at the execute bit on the
* the directory. Ideally we would like CAP_DAC_SEARCH.
* But we don't have that
*/
if (!capable(CAP_DAC_READ_SEARCH)) {
retval = -EPERM;
goto out_err;
}

-aneesh

2010-07-06 16:10:22

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Fri, Jul 02, 2010 at 02:45:45AM +0530, Aneesh Kumar K. V wrote:
> One use case i had was that if the userspace file server can directly
> work with the returned file system UUID,

I agree that the uuid should be split out from the rest of the
filehandle, but ...

> the it can build the file
> handle for client in a single call.

... I don't understand why both need to come in the same system call.
Is it purely an efficiency question? If so, why do you expect this to
be significant?

(I would have thought that the system call overhead is so small, and so
many calls will already be required to perform the typical rpc, that
this would be insignificant.)

A filesystem uuid seems like a generally useful thing (maybe more so
than a filehandle), so it'd seem worth figuring out how to export that
separately.

--b.

2010-07-06 17:25:14

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Tue, 6 Jul 2010 12:10:02 -0400, "J. Bruce Fields" <[email protected]> wrote:
> On Fri, Jul 02, 2010 at 02:45:45AM +0530, Aneesh Kumar K. V wrote:
> > One use case i had was that if the userspace file server can directly
> > work with the returned file system UUID,
>
> I agree that the uuid should be split out from the rest of the
> filehandle, but ...
>
> > the it can build the file
> > handle for client in a single call.
>
> ... I don't understand why both need to come in the same system call.
> Is it purely an efficiency question? If so, why do you expect this to
> be significant?

Since we know that system wide file handle should include a file system
identifier and a file identifier my plan was to retrieve both in the
same syscall.


>
> (I would have thought that the system call overhead is so small, and so
> many calls will already be required to perform the typical rpc, that
> this would be insignificant.)
>
> A filesystem uuid seems like a generally useful thing (maybe more so
> than a filehandle), so it'd seem worth figuring out how to export that
> separately.
>

I can add a new syscall that returns

struct fs_uuid {
u8 fs_uuid[16];
};

long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);

-aneesh

2010-07-06 23:23:58

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Tue, Jul 06, 2010 at 10:39:56PM +0530, Aneesh Kumar K. V wrote:
> On Tue, 6 Jul 2010 12:10:02 -0400, "J. Bruce Fields" <[email protected]> wrote:
> > On Fri, Jul 02, 2010 at 02:45:45AM +0530, Aneesh Kumar K. V wrote:
> > > One use case i had was that if the userspace file server can directly
> > > work with the returned file system UUID,
> >
> > I agree that the uuid should be split out from the rest of the
> > filehandle, but ...
> >
> > > the it can build the file
> > > handle for client in a single call.
> >
> > ... I don't understand why both need to come in the same system call.
> > Is it purely an efficiency question? If so, why do you expect this to
> > be significant?
>
> Since we know that system wide file handle should include a file system
> identifier and a file identifier my plan was to retrieve both in the
> same syscall.
>
>
> >
> > (I would have thought that the system call overhead is so small, and so
> > many calls will already be required to perform the typical rpc, that
> > this would be insignificant.)
> >
> > A filesystem uuid seems like a generally useful thing (maybe more so
> > than a filehandle), so it'd seem worth figuring out how to export that
> > separately.
> >
>
> I can add a new syscall that returns
>
> struct fs_uuid {
> u8 fs_uuid[16];
> };
>
> long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);

libblkid already provides the UUID to userspace applications, doesn't it?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-06 23:36:47

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 09:23:51 +1000
Dave Chinner <[email protected]> wrote:

> > I can add a new syscall that returns
> >
> > struct fs_uuid {
> > u8 fs_uuid[16];
> > };
> >
> > long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);
>
> libblkid already provides the UUID to userspace applications, doesn't it?

Yes and no.

libblkid provides the uuid of the thing that uses a block device. That
doesn't directly map to "UUID of a filesystem".

There are two types of filesystem that I can think of for which libblkid
cannot give a uuid.
- network filesystems (or virtual filesystems, or fuse )
- filesystems which share a block device, such as btrfs.
btrfs can have 'subvols' - multiple "filesystems" within
the one (set of) block device(s). libblkid cannot be asked about these
different subvols.

libblkid is useful, but not a real solution.

NeilBrown

2010-07-07 02:11:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 09:36:29AM +1000, Neil Brown wrote:
> On Wed, 7 Jul 2010 09:23:51 +1000
> Dave Chinner <[email protected]> wrote:
>
> > > I can add a new syscall that returns
> > >
> > > struct fs_uuid {
> > > u8 fs_uuid[16];
> > > };
> > >
> > > long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);
> >
> > libblkid already provides the UUID to userspace applications, doesn't it?
>
> Yes and no.
>
> libblkid provides the uuid of the thing that uses a block device. That
> doesn't directly map to "UUID of a filesystem".

True.

> There are two types of filesystem that I can think of for which libblkid
> cannot give a uuid.
> - network filesystems (or virtual filesystems, or fuse )

How would you guarantee persistent uniqueness for such filesystems?

> - filesystems which share a block device, such as btrfs.
> btrfs can have 'subvols' - multiple "filesystems" within
> the one (set of) block device(s). libblkid cannot be asked about these
> different subvols.
>
> libblkid is useful, but not a real solution.

So libblkid doesn't cover everything, but I think my question is
still valid - if we want per-filesystem UUIDs, why a syscall and not
just publishing it somewhere where we already publish per-mount
information? e.g. in /proc/mounts?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2010-07-07 02:57:49

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 12:11:50 +1000
Dave Chinner <[email protected]> wrote:

> On Wed, Jul 07, 2010 at 09:36:29AM +1000, Neil Brown wrote:
> > On Wed, 7 Jul 2010 09:23:51 +1000
> > Dave Chinner <[email protected]> wrote:
> >
> > > > I can add a new syscall that returns
> > > >
> > > > struct fs_uuid {
> > > > u8 fs_uuid[16];
> > > > };
> > > >
> > > > long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);
> > >
> > > libblkid already provides the UUID to userspace applications, doesn't it?
> >
> > Yes and no.
> >
> > libblkid provides the uuid of the thing that uses a block device. That
> > doesn't directly map to "UUID of a filesystem".
>
> True.
>
> > There are two types of filesystem that I can think of for which libblkid
> > cannot give a uuid.
> > - network filesystems (or virtual filesystems, or fuse )
>
> How would you guarantee persistent uniqueness for such filesystems?

Persistent shouldn't be too hard in many cases.
What uniqueness guarantees do we have anyway? Mostly stochastic I expect.


>
> > - filesystems which share a block device, such as btrfs.
> > btrfs can have 'subvols' - multiple "filesystems" within
> > the one (set of) block device(s). libblkid cannot be asked about these
> > different subvols.
> >
> > libblkid is useful, but not a real solution.
>
> So libblkid doesn't cover everything, but I think my question is
> still valid - if we want per-filesystem UUIDs, why a syscall and not
> just publishing it somewhere where we already publish per-mount
> information? e.g. in /proc/mounts?

The trouble with /proc/mounts is that it is somewhat clumsy to parse
(remember to handle \0ctal escapes) and doesn't include major/minor number
which is the primary key for identifying filesystems in Linux
(see /sys/class/bdi/MAJOR:MINOR which is e.g. the best place to configure
read-ahead for a filesystem).

So /proc/mounts could work (and would probably be better than a new syscall)
but I would really rather see something sane in /sys for
inspecting/configuring filesystems (rather than each filesystem doing their
own independent thing in /sys/fs).

NeilBrown

2010-07-07 12:45:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010, Neil Brown wrote:
> The trouble with /proc/mounts is that it is somewhat clumsy to parse
> (remember to handle \0ctal escapes) and doesn't include major/minor number

/proc/self/mountinfo does. And it's extensible, just add an
"uuid:123" before the dash.

Thanks,
Miklos

2010-07-07 12:57:34

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 12:57:26PM +1000, Neil Brown wrote:
> The trouble with /proc/mounts is that it is somewhat clumsy to parse
> (remember to handle \0ctal escapes) and doesn't include major/minor number
> which is the primary key for identifying filesystems in Linux
> (see /sys/class/bdi/MAJOR:MINOR which is e.g. the best place to configure
> read-ahead for a filesystem).
>
> So /proc/mounts could work (and would probably be better than a new syscall)
> but I would really rather see something sane in /sys for
> inspecting/configuring filesystems (rather than each filesystem doing their
> own independent thing in /sys/fs).

If you use sys or proc, is it possible to get the uuid from a file
descriptor or pathname without races?

--b.

2010-07-07 13:11:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> On Wed, Jul 07, 2010 at 12:57:26PM +1000, Neil Brown wrote:
> > The trouble with /proc/mounts is that it is somewhat clumsy to parse
> > (remember to handle \0ctal escapes) and doesn't include major/minor number
> > which is the primary key for identifying filesystems in Linux
> > (see /sys/class/bdi/MAJOR:MINOR which is e.g. the best place to configure
> > read-ahead for a filesystem).
> >
> > So /proc/mounts could work (and would probably be better than a new syscall)
> > but I would really rather see something sane in /sys for
> > inspecting/configuring filesystems (rather than each filesystem doing their
> > own independent thing in /sys/fs).
>
> If you use sys or proc, is it possible to get the uuid from a file
> descriptor or pathname without races?

You can do stat/fstat to find out the device number (which is unique,
but not persistent) then look for the major:minor calculated from
st_dev in /proc/self/mountinfo.

Thanks,
Miklos

2010-07-07 13:17:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 03:10:21PM +0200, Miklos Szeredi wrote:
> On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > On Wed, Jul 07, 2010 at 12:57:26PM +1000, Neil Brown wrote:
> > > The trouble with /proc/mounts is that it is somewhat clumsy to parse
> > > (remember to handle \0ctal escapes) and doesn't include major/minor number
> > > which is the primary key for identifying filesystems in Linux
> > > (see /sys/class/bdi/MAJOR:MINOR which is e.g. the best place to configure
> > > read-ahead for a filesystem).
> > >
> > > So /proc/mounts could work (and would probably be better than a new syscall)
> > > but I would really rather see something sane in /sys for
> > > inspecting/configuring filesystems (rather than each filesystem doing their
> > > own independent thing in /sys/fs).
> >
> > If you use sys or proc, is it possible to get the uuid from a file
> > descriptor or pathname without races?
>
> You can do stat/fstat to find out the device number (which is unique,
> but not persistent)

Is it really unique over time? (Can't a given st_dev value map to one
filesystem now, and another later?) And it must also have the same
problems as libblkid (e.g. btrfs subvolumes share the same st_dev).

--b.

> then look for the major:minor calculated from
> st_dev in /proc/self/mountinfo.
>
> Thanks,
> Miklos

2010-07-07 13:36:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > If you use sys or proc, is it possible to get the uuid from a file
> > > descriptor or pathname without races?
> >
> > You can do stat/fstat to find out the device number (which is unique,
> > but not persistent)
>
> Is it really unique over time? (Can't a given st_dev value map to one
> filesystem now, and another later?)

It's unique at a single point in time. But if you have a reference
(e.g. open file descriptor) on the mount then that's not a problem.

fd = open(path, ...);
fstat(fd, &st);
search st.st_dev in mountinfo
close(fd)

is effectively the same as an getuuid(path) syscall (lazy unmounted
filesystems will not be found in mountinfo, but the reference is still
there so st_dev will not be reused for other filesystems).

Thanks,
Miklos

2010-07-07 14:45:36

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > descriptor or pathname without races?
> > >
> > > You can do stat/fstat to find out the device number (which is unique,
> > > but not persistent)
> >
> > Is it really unique over time? (Can't a given st_dev value map to one
> > filesystem now, and another later?)
>
> It's unique at a single point in time. But if you have a reference
> (e.g. open file descriptor) on the mount then that's not a problem.
>
> fd = open(path, ...);
> fstat(fd, &st);
> search st.st_dev in mountinfo
> close(fd)
>
> is effectively the same as an getuuid(path) syscall (lazy unmounted
> filesystems will not be found in mountinfo, but the reference is still
> there so st_dev will not be reused for other filesystems).

OK, cool.

That still leaves the problem that there isn't always an underlying
block device, and/or when there is it doesn't always uniquely specify
the filesystem.

--b.

2010-07-07 15:01:24

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
> On Tue, 6 Jul 2010 12:10:02 -0400, "J. Bruce Fields" <[email protected]> wrote:
>> ... I don't understand why both need to come in the same system call.
>> Is it purely an efficiency question? If so, why do you expect this to
>> be significant?
>
> Since we know that system wide file handle should include a file system
> identifier and a file identifier my plan was to retrieve both in the
> same syscall.

Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?

>> A filesystem uuid seems like a generally useful thing (maybe more so
>> than a filehandle), so it'd seem worth figuring out how to export that
>> separately.
>>
>
> I can add a new syscall that returns
>
> struct fs_uuid {
> u8 fs_uuid[16];
> };
>
> long sys_get_fs_uuid(int dfd, char *name, struct fs_uuid *fsid, int flag);

While this might be useful, I think the file handle should identify the filesystem itself.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-07 15:05:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 01:40:53AM -0600, Andreas Dilger wrote:
> On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
> > On Tue, 6 Jul 2010 12:10:02 -0400, "J. Bruce Fields" <[email protected]> wrote:
> >> ... I don't understand why both need to come in the same system call.
> >> Is it purely an efficiency question? If so, why do you expect this to
> >> be significant?
> >
> > Since we know that system wide file handle should include a file system
> > identifier and a file identifier my plan was to retrieve both in the
> > same syscall.
>
> Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?

It'll be rare that a server will want to *just* get a filehandle;
normally it will at least want to get some attributes at the same time.

So I think it will always need to open the file first and then do the
rest of the operations on the returned filehandle.

--b.

2010-07-07 15:17:57

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 03/11] vfs: Add open by file handle support

On Tue, Jun 15, 2010 at 10:42:53PM +0530, Aneesh Kumar K.V wrote:
> diff --git a/fs/open.c b/fs/open.c
> index 43ac798..1c5daa3 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -1168,3 +1168,201 @@ SYSCALL_DEFINE4(name_to_handle_at, int, dfd, const char __user *, name,
> return -ENOSYS;
> }
> #endif
> +
> +#ifdef CONFIG_EXPORTFS
> +static struct vfsmount *get_vfsmount_from_fd(int fd)
> +{
> + int fput_needed;
> + struct path *path;
> + struct file *filep;
> +
> + if (fd == AT_FDCWD) {
> + struct fs_struct *fs = current->fs;
> + read_lock(&fs->lock);
> + path = &fs->pwd;
> + mntget(path->mnt);
> + read_unlock(&fs->lock);
> + } else {
> + filep = fget_light(fd, &fput_needed);
> + if (!filep)
> + return ERR_PTR(-EBADF);
> + path = &filep->f_path;
> + mntget(path->mnt);
> + fput_light(filep, fput_needed);
> + }
> + return path->mnt;
> +}

The bulk of this this should probably be in fs/namei.c, factored with
path_init stuff if possible please.

2010-07-07 15:23:19

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> The patch update may_open to allow handle based open on symlinks.
> The file handle based API use file descritor returned from open_by_handle_at
> to do different file system operations. To find the link target name we
> need to get a file descriptor on symlinks.

This is a pretty big change, isn't it? Have you looked through vfs to
ensure this is actually OK? I was just looking at remount,ro code, for
example, and it seems to assume only writable open files on ISREG
files.

Is this restricted to RDONLY? Really, it should be O_NONE...

2010-07-07 15:27:11

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 05/11] vfs: Support null pathname in readlink

On Tue, Jun 15, 2010 at 10:42:55PM +0530, Aneesh Kumar K.V wrote:
> From: NeilBrown <[email protected]>
>
> This enables to use readlink to get the link target name
> from a file descriptor point to the link. This can be used
> with open_by_handle syscall that returns a file descriptor for a link.
> We can then use this file descriptor to get the target name.
>
> Signed-off-by: NeilBrown <[email protected]>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/stat.c | 30 ++++++++++++++++++++++--------
> 1 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/fs/stat.c b/fs/stat.c
> index c4ecd52..49b95a7 100644
> --- a/fs/stat.c
> +++ b/fs/stat.c
> @@ -284,26 +284,40 @@ SYSCALL_DEFINE2(newfstat, unsigned int, fd, struct stat __user *, statbuf)
> SYSCALL_DEFINE4(readlinkat, int, dfd, const char __user *, pathname,
> char __user *, buf, int, bufsiz)
> {
> - struct path path;
> - int error;
> + int error = 0;
> + struct path path, *pp;
> + struct file *file = NULL;
>
> if (bufsiz <= 0)
> return -EINVAL;
>
> - error = user_path_at(dfd, pathname, 0, &path);
> + if (pathname == NULL && dfd != AT_FDCWD) {
> + file = fget(dfd);
> +
> + if (file)
> + pp = &file->f_path;
> + else
> + error = -EBADF;
> + } else {
> + error = user_path_at(dfd, pathname, 0, &path);
> + pp = &path;
> + }

This (and all the others) is really ugly overloading of syscall
arguments IMO, and the changelog is seriously lacking for such changes.

This also changes the the syscall API of existing calls; from reading
the path at NULL, to switching to a completely different syscall.
Perhaps you're assuming nobody relies on SIGSEGV / mmapped NULL address
there, but even then you surely need to document the changed semantics
somewhere (and document the new syscall semantics properly).

2010-07-07 16:16:26

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 03/11] vfs: Add open by file handle support

On Thu, 8 Jul 2010 01:17:38 +1000, Nick Piggin <[email protected]> wrote:
> On Tue, Jun 15, 2010 at 10:42:53PM +0530, Aneesh Kumar K.V wrote:
> > diff --git a/fs/open.c b/fs/open.c
> > index 43ac798..1c5daa3 100644
> > --- a/fs/open.c
> > +++ b/fs/open.c
> > @@ -1168,3 +1168,201 @@ SYSCALL_DEFINE4(name_to_handle_at, int, dfd, const char __user *, name,
> > return -ENOSYS;
> > }
> > #endif
> > +
> > +#ifdef CONFIG_EXPORTFS
> > +static struct vfsmount *get_vfsmount_from_fd(int fd)
> > +{
> > + int fput_needed;
> > + struct path *path;
> > + struct file *filep;
> > +
> > + if (fd == AT_FDCWD) {
> > + struct fs_struct *fs = current->fs;
> > + read_lock(&fs->lock);
> > + path = &fs->pwd;
> > + mntget(path->mnt);
> > + read_unlock(&fs->lock);
> > + } else {
> > + filep = fget_light(fd, &fput_needed);
> > + if (!filep)
> > + return ERR_PTR(-EBADF);
> > + path = &filep->f_path;
> > + mntget(path->mnt);
> > + fput_light(filep, fput_needed);
> > + }
> > + return path->mnt;
> > +}
>
> The bulk of this this should probably be in fs/namei.c, factored with
> path_init stuff if possible please.
>

Will move this to fs/namei.c. I guess we can also remove #ifdef
CONFIG_EXPORTFS around the code ?

-aneesh

2010-07-07 16:25:12

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Thu, 8 Jul 2010 01:23:03 +1000, Nick Piggin <[email protected]> wrote:
> On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> > The patch update may_open to allow handle based open on symlinks.
> > The file handle based API use file descritor returned from open_by_handle_at
> > to do different file system operations. To find the link target name we
> > need to get a file descriptor on symlinks.
>
> This is a pretty big change, isn't it?

Yes

> Have you looked through vfs to
> ensure this is actually OK? I was just looking at remount,ro code, for
> example, and it seems to assume only writable open files on ISREG
> files.

I verified that write and read returns an error on that descriptor. I
will look at rest of code to ensure the it doesn't break assumptions in
the rest of VFS.

>
> Is this restricted to RDONLY? Really, it should be O_NONE...
>
>

we need the interface so that we can do fstat(lstat) and readlink using the
handle. If we are against the idea of allowing an fd on symlink, how about
adding stat_by_handle and readlink_by_handle syscall ? That way we can
drop "vfs: Support null pathname in readlink" patch

-aneesh

2010-07-07 16:32:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 05/11] vfs: Support null pathname in readlink

On Thu, 8 Jul 2010 01:27:06 +1000, Nick Piggin <[email protected]> wrote:
> On Tue, Jun 15, 2010 at 10:42:55PM +0530, Aneesh Kumar K.V wrote:
> > From: NeilBrown <[email protected]>
> >
> > This enables to use readlink to get the link target name
> > from a file descriptor point to the link. This can be used
> > with open_by_handle syscall that returns a file descriptor for a link.
> > We can then use this file descriptor to get the target name.
> >
> > Signed-off-by: NeilBrown <[email protected]>
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > fs/stat.c | 30 ++++++++++++++++++++++--------
> > 1 files changed, 22 insertions(+), 8 deletions(-)
> >
> > diff --git a/fs/stat.c b/fs/stat.c
> > index c4ecd52..49b95a7 100644
> > --- a/fs/stat.c
> > +++ b/fs/stat.c
> > @@ -284,26 +284,40 @@ SYSCALL_DEFINE2(newfstat, unsigned int, fd, struct stat __user *, statbuf)
> > SYSCALL_DEFINE4(readlinkat, int, dfd, const char __user *, pathname,
> > char __user *, buf, int, bufsiz)
> > {
> > - struct path path;
> > - int error;
> > + int error = 0;
> > + struct path path, *pp;
> > + struct file *file = NULL;
> >
> > if (bufsiz <= 0)
> > return -EINVAL;
> >
> > - error = user_path_at(dfd, pathname, 0, &path);
> > + if (pathname == NULL && dfd != AT_FDCWD) {
> > + file = fget(dfd);
> > +
> > + if (file)
> > + pp = &file->f_path;
> > + else
> > + error = -EBADF;
> > + } else {
> > + error = user_path_at(dfd, pathname, 0, &path);
> > + pp = &path;
> > + }
>
> This (and all the others) is really ugly overloading of syscall
> arguments IMO, and the changelog is seriously lacking for such
> changes.

Initially we had freadlink

http://lkml.org/lkml/2010/5/12/222

We updated the patches to use the existing readlinkat interface because
utimensat(2) already exposed a similar interface. So it should be ok to
expect that other *at call behaved in a similar way ?

>
> This also changes the the syscall API of existing calls; from reading
> the path at NULL, to switching to a completely different syscall.
> Perhaps you're assuming nobody relies on SIGSEGV / mmapped NULL address
> there, but even then you surely need to document the changed semantics
> somewhere (and document the new syscall semantics properly).


Yes this would need a documentation update. But i guess since we already
have utimensat(2) behaving similarly we are ok to extent readlinkat,
linkat and faccessat on similar lines ?

-aneesh

2010-07-07 16:33:58

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 10:45:11 -0400, "J. Bruce Fields" <[email protected]> wrote:
> On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > > descriptor or pathname without races?
> > > >
> > > > You can do stat/fstat to find out the device number (which is unique,
> > > > but not persistent)
> > >
> > > Is it really unique over time? (Can't a given st_dev value map to one
> > > filesystem now, and another later?)
> >
> > It's unique at a single point in time. But if you have a reference
> > (e.g. open file descriptor) on the mount then that's not a problem.
> >
> > fd = open(path, ...);
> > fstat(fd, &st);
> > search st.st_dev in mountinfo
> > close(fd)
> >
> > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > filesystems will not be found in mountinfo, but the reference is still
> > there so st_dev will not be reused for other filesystems).
>
> OK, cool.
>
> That still leaves the problem that there isn't always an underlying
> block device, and/or when there is it doesn't always uniquely specify
> the filesystem.
>

And for this reason we would need this as a syscall right ?

-aneesh

2010-07-07 16:39:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 10:03:39PM +0530, Aneesh Kumar K. V wrote:
> On Wed, 7 Jul 2010 10:45:11 -0400, "J. Bruce Fields" <[email protected]> wrote:
> > On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > > On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > > > descriptor or pathname without races?
> > > > >
> > > > > You can do stat/fstat to find out the device number (which is unique,
> > > > > but not persistent)
> > > >
> > > > Is it really unique over time? (Can't a given st_dev value map to one
> > > > filesystem now, and another later?)
> > >
> > > It's unique at a single point in time. But if you have a reference
> > > (e.g. open file descriptor) on the mount then that's not a problem.
> > >
> > > fd = open(path, ...);
> > > fstat(fd, &st);
> > > search st.st_dev in mountinfo
> > > close(fd)
> > >
> > > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > > filesystems will not be found in mountinfo, but the reference is still
> > > there so st_dev will not be reused for other filesystems).
> >
> > OK, cool.
> >
> > That still leaves the problem that there isn't always an underlying
> > block device, and/or when there is it doesn't always uniquely specify
> > the filesystem.
> >
>
> And for this reason we would need this as a syscall right ?

That's the only solution I see. (Or use an xattr?)

--b.

2010-07-07 16:49:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> The patch update may_open to allow handle based open on symlinks.
> The file handle based API use file descritor returned from open_by_handle_at
> to do different file system operations. To find the link target name we
> need to get a file descriptor on symlinks.
>
> Signed-off-by: Aneesh Kumar K.V <[email protected]>
> ---
> fs/namei.c | 17 ++++++++++++++---
> 1 files changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/namei.c b/fs/namei.c
> index c2d19c7..417bf53 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -1422,7 +1422,7 @@ int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
> return error;
> }
>
> -int may_open(struct path *path, int acc_mode, int flag)
> +static int __may_open(struct path *path, int acc_mode, int flag, int handle)
> {
> struct dentry *dentry = path->dentry;
> struct inode *inode = dentry->d_inode;
> @@ -1433,7 +1433,13 @@ int may_open(struct path *path, int acc_mode, int flag)
>
> switch (inode->i_mode & S_IFMT) {
> case S_IFLNK:
> - return -ELOOP;
> + /*
> + * For file handle based open we should allow
> + * open of symlink.
> + */
> + if (!handle)
> + return -ELOOP;
> + break;
> case S_IFDIR:
> if (acc_mode & MAY_WRITE)
> return -EISDIR;
> @@ -1473,6 +1479,11 @@ int may_open(struct path *path, int acc_mode, int flag)
> return break_lease(inode, flag);
> }
>
> +int may_open(struct path *path, int acc_mode, int flag)
> +{
> + return __may_open(path, acc_mode, flag, 0);
> +}

Why don't you just add MAY_OPEN_LNK instead of this ugliness?

2010-07-07 16:57:25

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Wed, Jul 07, 2010 at 09:54:52PM +0530, Aneesh Kumar K. V wrote:
> On Thu, 8 Jul 2010 01:23:03 +1000, Nick Piggin <[email protected]> wrote:
> > On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> > > The patch update may_open to allow handle based open on symlinks.
> > > The file handle based API use file descritor returned from open_by_handle_at
> > > to do different file system operations. To find the link target name we
> > > need to get a file descriptor on symlinks.
> >
> > This is a pretty big change, isn't it?
>
> Yes
>
> > Have you looked through vfs to
> > ensure this is actually OK? I was just looking at remount,ro code, for
> > example, and it seems to assume only writable open files on ISREG
> > files.
>
> I verified that write and read returns an error on that descriptor. I
> will look at rest of code to ensure the it doesn't break assumptions in
> the rest of VFS.
>
> >
> > Is this restricted to RDONLY? Really, it should be O_NONE...
> >
> >
>
> we need the interface so that we can do fstat(lstat) and readlink using the
> handle. If we are against the idea of allowing an fd on symlink, how about
> adding stat_by_handle and readlink_by_handle syscall ? That way we can
> drop "vfs: Support null pathname in readlink" patch

It would be nice to avoid introducing open files to symlinks, but
I don't know about really the *right* thing to do though. I think
avoiding adding any by-handle syscalls except for open by handle
may be a good idea... OTOH if this is really a special case for
handles...

I haven't followed this thread closely, so can you just go back
and explain to me why you need this? Ie. that you can't achieve with
a path-to-handle follow/nofollow option.

2010-07-07 17:03:09

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 05/11] vfs: Support null pathname in readlink

On Wed, Jul 07, 2010 at 10:02:12PM +0530, Aneesh Kumar K. V wrote:
> On Thu, 8 Jul 2010 01:27:06 +1000, Nick Piggin <[email protected]> wrote:
> > This (and all the others) is really ugly overloading of syscall
> > arguments IMO, and the changelog is seriously lacking for such
> > changes.
>
> Initially we had freadlink
>
> http://lkml.org/lkml/2010/5/12/222
>
> We updated the patches to use the existing readlinkat interface because
> utimensat(2) already exposed a similar interface. So it should be ok to
> expect that other *at call behaved in a similar way ?

I'm not sure whether it's OK or not. Probably is, it is a slight API
change though, that should at least be noted in the changelog.


> > This also changes the the syscall API of existing calls; from reading
> > the path at NULL, to switching to a completely different syscall.
> > Perhaps you're assuming nobody relies on SIGSEGV / mmapped NULL address
> > there, but even then you surely need to document the changed semantics
> > somewhere (and document the new syscall semantics properly).
>
>
> Yes this would need a documentation update. But i guess since we already
> have utimensat(2) behaving similarly we are ok to extent readlinkat,
> linkat and faccessat on similar lines ?

At least there is precedent. Pretty ugly though :( Well if others
(Christoph and Al, primarily) think it's OK then fine by me. But
please put comments or changelog for API changes such that a man page
writer could easily update it.

2010-07-07 17:05:20

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-07, at 09:05, J. Bruce Fields wrote:
> On Wed, Jul 07, 2010 at 01:40:53AM -0600, Andreas Dilger wrote:
>> On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
>>> Since we know that system wide file handle should include a file system
>>> identifier and a file identifier my plan was to retrieve both in the
>>> same syscall.
>>
>> Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?
>
> It'll be rare that a server will want to *just* get a filehandle;
> normally it will at least want to get some attributes at the same time.
> So I think it will always need to open the file first and then do the
> rest of the operations on the returned filehandle.

I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.

>>> That still leaves the problem that there isn't always an underlying
>>> block device, and/or when there is it doesn't always uniquely specify
>>> the filesystem.
>>
>> And for this reason we would need this as a syscall right ?
>
> That's the only solution I see. (Or use an xattr?)

Or... return the UUID as part of the file handle in the first place. That avoids races, avoids adding more syscalls that have to be called for each file handle, or IMNSHO the worst proposal that requires applications to parse a text file in some obscure path for each file handle (requiring a stat() to find the major/minor device of the file, walking through /proc or /sys, and other nastiness).

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-07 17:37:39

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 11:02:47AM -0600, Andreas Dilger wrote:
> On 2010-07-07, at 09:05, J. Bruce Fields wrote:
> > On Wed, Jul 07, 2010 at 01:40:53AM -0600, Andreas Dilger wrote:
> >> On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
> >>> Since we know that system wide file handle should include a file system
> >>> identifier and a file identifier my plan was to retrieve both in the
> >>> same syscall.
> >>
> >> Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?
> >
> > It'll be rare that a server will want to *just* get a filehandle;
> > normally it will at least want to get some attributes at the same time.
> > So I think it will always need to open the file first and then do the
> > rest of the operations on the returned filehandle.
>
> I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.

Understood. I don't understand how that case decides the question of
whether to use a separate system call for the uuid or not?

Those millions of clients are doing the filehandle-to-file-descriptor
mapping, not the reverse.

And you may not need any uuid at all since the clients probably all have
some way of knowing which shared filesystem they want to work with.

> >>> That still leaves the problem that there isn't always an
> >>> underlying block device, and/or when there is it doesn't always
> >>> uniquely specify the filesystem.
> >>
> >> And for this reason we would need this as a syscall right ?
> >
> > That's the only solution I see. (Or use an xattr?)
>
> Or... return the UUID as part of the file handle in the first place.
> That avoids races, avoids adding more syscalls that have to be called
> for each file handle,

I don't hate the idea. A uuid lookup seems like a useful thing, though,
and maybe less expensive, so it seems weird if the only way to get the
uuid is by looking up the filehandle as well.

> or IMNSHO the worst proposal that requires
> applications to parse a text file in some obscure path for each file
> handle (requiring a stat() to find the major/minor device of the file,
> walking through /proc or /sys, and other nastiness).

OK.

--b.

2010-07-07 17:53:41

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Thu, 8 Jul 2010 02:57:19 +1000, Nick Piggin <[email protected]> wrote:
> On Wed, Jul 07, 2010 at 09:54:52PM +0530, Aneesh Kumar K. V wrote:
> > On Thu, 8 Jul 2010 01:23:03 +1000, Nick Piggin <[email protected]> wrote:
> > > On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> > > > The patch update may_open to allow handle based open on symlinks.
> > > > The file handle based API use file descritor returned from open_by_handle_at
> > > > to do different file system operations. To find the link target name we
> > > > need to get a file descriptor on symlinks.
> > >
> > > This is a pretty big change, isn't it?
> >
> > Yes
> >
> > > Have you looked through vfs to
> > > ensure this is actually OK? I was just looking at remount,ro code, for
> > > example, and it seems to assume only writable open files on ISREG
> > > files.
> >
> > I verified that write and read returns an error on that descriptor. I
> > will look at rest of code to ensure the it doesn't break assumptions in
> > the rest of VFS.
> >
> > >
> > > Is this restricted to RDONLY? Really, it should be O_NONE...
> > >
> > >
> >
> > we need the interface so that we can do fstat(lstat) and readlink using the
> > handle. If we are against the idea of allowing an fd on symlink, how about
> > adding stat_by_handle and readlink_by_handle syscall ? That way we can
> > drop "vfs: Support null pathname in readlink" patch
>
> It would be nice to avoid introducing open files to symlinks, but
> I don't know about really the *right* thing to do though. I think
> avoiding adding any by-handle syscalls except for open by handle
> may be a good idea... OTOH if this is really a special case for
> handles...
>
> I haven't followed this thread closely, so can you just go back
> and explain to me why you need this? Ie. that you can't achieve with
> a path-to-handle follow/nofollow option.


We need to be able to read the link target using file handle. The exact
usecase i have is with respect to implementing READLINK operation on a
userspace NFS server. The request contain the file handle and the
response include target name.

Similarly we need to able to get file attribute based on file handle.
I was using readlinkat and fstat to obtain both the above information
using the descriptor returned by open_by_handle on a file handle
representing symlink

-aneesh

2010-07-07 18:05:17

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, Jul 07, 2010 at 11:02:47AM -0600, Andreas Dilger wrote:
> On 2010-07-07, at 09:05, J. Bruce Fields wrote:
> > On Wed, Jul 07, 2010 at 01:40:53AM -0600, Andreas Dilger wrote:
> >> On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
> >>> Since we know that system wide file handle should include a file system
> >>> identifier and a file identifier my plan was to retrieve both in the
> >>> same syscall.
> >>
> >> Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?
> >
> > It'll be rare that a server will want to *just* get a filehandle;
> > normally it will at least want to get some attributes at the same time.
> > So I think it will always need to open the file first and then do the
> > rest of the operations on the returned filehandle.
>
> I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.

Really? What kind of clients? What sort of speedups do you hope to see?
Path traversal can get vastly cheaper in both single threaded and parallel
cases with my locking changes.

It is not acceptable to work around fixable deficiencies in our critical
infrastructure like path walking with hacks like this. If path walking
is still much too expensive, that's another story...

2010-07-07 18:19:10

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 11:02:47 -0600, Andreas Dilger <[email protected]> wrote:
> On 2010-07-07, at 09:05, J. Bruce Fields wrote:
> > On Wed, Jul 07, 2010 at 01:40:53AM -0600, Andreas Dilger wrote:
> >> On 2010-07-06, at 11:09, Aneesh Kumar K. V wrote:
> >>> Since we know that system wide file handle should include a file system
> >>> identifier and a file identifier my plan was to retrieve both in the
> >>> same syscall.
> >>
> >> Won't having it be in a separate system call be racy w.r.t. doing the pathname lookup twice?
> >
> > It'll be rare that a server will want to *just* get a filehandle;
> > normally it will at least want to get some attributes at the same time.
> > So I think it will always need to open the file first and then do the
> > rest of the operations on the returned filehandle.
>
> I think you are assuming too much about the use of the file handle.
> What I'm interested in is not a userspace file server, but rather a
> more efficient way to have 10000's to millions of clients to be able
> to open the same regular file, without having to do full path
> traversal for each one.


With the suggested syscall approach we can do on the client that does
the path traversal.

fd = open(name)
file_identifier = fd_to_handle(fd);
fs_identifier = fd_to_fshandle(fd);
close(fd);


>
> >>> That still leaves the problem that there isn't always an underlying
> >>> block device, and/or when there is it doesn't always uniquely specify
> >>> the filesystem.
> >>
> >> And for this reason we would need this as a syscall right ?
> >
> > That's the only solution I see. (Or use an xattr?)
>
> Or... return the UUID as part of the file handle in the first place.
> That avoids races, avoids adding more syscalls that have to be called
> for each file handle, or IMNSHO the worst proposal that requires
> applications to parse a text file in some obscure path for each file
> handle (requiring a stat() to find the major/minor device of the file,
> walking through /proc or /sys, and other nastiness).

I would also like to get both file system identifier and file identifier
in a single call.

That would also imply instead of the above sequence of 4 calls, we can
do

file_handle = name_to_handle(name);

-aneesh


2010-07-07 18:20:07

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Wed, Jul 07, 2010 at 11:23:31PM +0530, Aneesh Kumar K. V wrote:
> On Thu, 8 Jul 2010 02:57:19 +1000, Nick Piggin <[email protected]> wrote:
> > I haven't followed this thread closely, so can you just go back
> > and explain to me why you need this? Ie. that you can't achieve with
> > a path-to-handle follow/nofollow option.
>
>
> We need to be able to read the link target using file handle. The exact
> usecase i have is with respect to implementing READLINK operation on a
> userspace NFS server. The request contain the file handle and the
> response include target name.

Fair point, could you include this stuff in your changelog? It may
help with other people reviewing it too who have not followed closely.


> Similarly we need to able to get file attribute based on file handle.
> I was using readlinkat and fstat to obtain both the above information
> using the descriptor returned by open_by_handle on a file handle
> representing symlink
>
> -aneesh

2010-07-07 20:33:08

by Alan

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

> I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.

The unix security model requires the traversal. In that respect it
differs significantly from some other OS's that did have handle based
direct opens and treat directories as a lookup and translation index only.

2010-07-07 22:22:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 10:45:11 -0400
"J. Bruce Fields" <[email protected]> wrote:

> On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > > descriptor or pathname without races?
> > > >
> > > > You can do stat/fstat to find out the device number (which is unique,
> > > > but not persistent)
> > >
> > > Is it really unique over time? (Can't a given st_dev value map to one
> > > filesystem now, and another later?)
> >
> > It's unique at a single point in time. But if you have a reference
> > (e.g. open file descriptor) on the mount then that's not a problem.
> >
> > fd = open(path, ...);
> > fstat(fd, &st);
> > search st.st_dev in mountinfo
> > close(fd)
> >
> > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > filesystems will not be found in mountinfo, but the reference is still
> > there so st_dev will not be reused for other filesystems).
>
> OK, cool.
>
> That still leaves the problem that there isn't always an underlying
> block device, and/or when there is it doesn't always uniquely specify
> the filesystem.

It doesn't matter if there is an underlying block device, or if it is shared
among subvolmes.
st_dev is *the* primary key for filesystems. Every "struct super_block" has a
unquie s_dev and that is returned in st_dev.

For "traditional" filesystem, this is the major/minor number of the block
device.
For NFS and btrfs and other filesystems which don't have exclusive use of a
block device, 'set_anon_super' is used to get a unique s_dev based on a major
number of '0'.

So you can *always* use st_dev as an identifier for the filesystem which is
stable and unique as long as you hold an active reference to the filesystem
(open file descriptor, cwd in fs, etc).

If you poll(2) /proc/mounts to get notifications of changes to the mount
table, then it should be quite easy to cache st-dev -> uuid mappings in a
race-free way.

There might be value in getting name_to_handle to return the st_dev of the
target file to ensure that you haven't unexepected crossed into a different
filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
to be unique, a uuid is only supposed to be unique (i.e. that is not
enforced).

NeilBrown

2010-07-07 22:26:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, Jul 08, 2010 at 08:21:43AM +1000, Neil Brown wrote:
> On Wed, 7 Jul 2010 10:45:11 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > > It's unique at a single point in time. But if you have a reference
> > > (e.g. open file descriptor) on the mount then that's not a problem.
> > >
> > > fd = open(path, ...);
> > > fstat(fd, &st);
> > > search st.st_dev in mountinfo
> > > close(fd)
> > >
> > > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > > filesystems will not be found in mountinfo, but the reference is still
> > > there so st_dev will not be reused for other filesystems).
> >
> > OK, cool.
> >
> > That still leaves the problem that there isn't always an underlying
> > block device, and/or when there is it doesn't always uniquely specify
> > the filesystem.
>
> It doesn't matter if there is an underlying block device, or if it is shared
> among subvolmes.
> st_dev is *the* primary key for filesystems. Every "struct super_block" has a
> unquie s_dev and that is returned in st_dev.
>
> For "traditional" filesystem, this is the major/minor number of the block
> device.
> For NFS and btrfs and other filesystems which don't have exclusive use of a
> block device, 'set_anon_super' is used to get a unique s_dev based on a major
> number of '0'.

Whoops, OK, thanks for the explanation.

--b.

> So you can *always* use st_dev as an identifier for the filesystem which is
> stable and unique as long as you hold an active reference to the filesystem
> (open file descriptor, cwd in fs, etc).
>
> If you poll(2) /proc/mounts to get notifications of changes to the mount
> table, then it should be quite easy to cache st-dev -> uuid mappings in a
> race-free way.
>
> There might be value in getting name_to_handle to return the st_dev of the
> target file to ensure that you haven't unexepected crossed into a different
> filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
> to be unique, a uuid is only supposed to be unique (i.e. that is not
> enforced).
>
> NeilBrown

2010-07-08 04:31:10

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-07, at 12:05, Nick Piggin wrote:
> On Wed, Jul 07, 2010 at 11:02:47AM -0600, Andreas Dilger wrote:
>> I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.
>
> Really? What kind of clients? What sort of speedups do you hope to see?
> Path traversal can get vastly cheaper in both single threaded and parallel
> cases with my locking changes.

This is for Lustre clients, but really any kind of network filesystem is equally affected. This isn't really an issue of the local dcache performance, but rather network latency for each component of the path traversal, and for a large number of clients the metadata server is the bottleneck for doing the traversal.

> It is not acceptable to work around fixable deficiencies in our critical
> infrastructure like path walking with hacks like this. If path walking
> is still much too expensive, that's another story...

Two different problems, I'm afraid.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-08 04:31:16

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-07, at 14:39, Alan Cox wrote:
>> I think you are assuming too much about the use of the file handle. What I'm interested in is not a userspace file server, but rather a more efficient way to have 10000's to millions of clients to be able to open the same regular file, without having to do full path traversal for each one.
>
> The unix security model requires the traversal. In that respect it
> differs significantly from some other OS's that did have handle based
> direct opens and treat directories as a lookup and translation index only.

There IS full path traversal at the time the file handle is generated, so it meets these requirements. This isn't really conceptually any different than fd passing over a unix socket on a local system, but passing the fd between user processes (via an out-of-band communication method, likely an MPI broadcast message) that are running on different nodes all accessing the same distributed filesystem.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-08 04:31:53

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-07, at 16:21, Neil Brown wrote:
> It doesn't matter if there is an underlying block device, or if it is shared
> among subvolmes. st_dev is *the* primary key for filesystems. Every "struct super_block" has a unique s_dev and that is returned in st_dev.
>
> For "traditional" filesystem, this is the major/minor number of the block
> device.
> For NFS and btrfs and other filesystems which don't have exclusive use of a
> block device, 'set_anon_super' is used to get a unique s_dev based on a major
> number of '0'.

But the major/minor number returned is essentially random between different clients, so there is no way to use it on another node that is accessing the same filesystem. Conversely, the UUID will be the same on all of the clients.

> So you can *always* use st_dev as an identifier for the filesystem which is
> stable and unique as long as you hold an active reference to the filesystem
> (open file descriptor, cwd in fs, etc).

Only on a single system.

> If you poll(2) /proc/mounts to get notifications of changes to the mount
> table, then it should be quite easy to cache st-dev -> uuid mappings in a
> race-free way.

This sounds unpleasant for any application to implement. It might be OK for a user-space NFS/CIFS server, but it is complex and error-prone for any normal usage, and doesn't seem like a good API design to me.

> There might be value in getting name_to_handle to return the st_dev of the
> target file to ensure that you haven't unexepected crossed into a different
> filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
> to be unique, a uuid is only supposed to be unique (i.e. that is not
> enforced).

UUID duplication (w.r.t. multiple mounts of the same underlying device) doesn't matter at all for regular file opens, where the only interest is getting a handle for the inode. I wouldn't be against requiring the UUID be unique if that was needed, or failing regular opens in the rare case that there is a non-unique UUID pointing to different devices, or failing directory opens for the case of multiple mountpoints.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

2010-07-08 05:03:55

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Wed, 7 Jul 2010 18:03:36 -0600
Andreas Dilger <[email protected]> wrote:

> On 2010-07-07, at 16:21, Neil Brown wrote:
> > It doesn't matter if there is an underlying block device, or if it is shared
> > among subvolmes. st_dev is *the* primary key for filesystems. Every "struct super_block" has a unique s_dev and that is returned in st_dev.
> >
> > For "traditional" filesystem, this is the major/minor number of the block
> > device.
> > For NFS and btrfs and other filesystems which don't have exclusive use of a
> > block device, 'set_anon_super' is used to get a unique s_dev based on a major
> > number of '0'.
>
> But the major/minor number returned is essentially random between different clients, so there is no way to use it on another node that is accessing the same filesystem. Conversely, the UUID will be the same on all of the clients.
>
> > So you can *always* use st_dev as an identifier for the filesystem which is
> > stable and unique as long as you hold an active reference to the filesystem
> > (open file descriptor, cwd in fs, etc).
>
> Only on a single system.

Well the system call only runs on a single system.
If you want a cluster-unique name, get the cluster software to generate it
or enforce it.
Performing a mapping is not hard.

>
> > If you poll(2) /proc/mounts to get notifications of changes to the mount
> > table, then it should be quite easy to cache st-dev -> uuid mappings in a
> > race-free way.
>
> This sounds unpleasant for any application to implement. It might be OK for a user-space NFS/CIFS server, but it is complex and error-prone for any normal usage, and doesn't seem like a good API design to me.

Define "normal usage" for filehandle-based lookup ???

Identifing a filesystem by st_dev is completely reliable. That is a good
start for API design.

Identifing by UUID is not unless uniqueness is enforced, and ....

>
> > There might be value in getting name_to_handle to return the st_dev of the
> > target file to ensure that you haven't unexepected crossed into a different
> > filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
> > to be unique, a uuid is only supposed to be unique (i.e. that is not
> > enforced).
>
> UUID duplication (w.r.t. multiple mounts of the same underlying device) doesn't matter at all for regular file opens, where the only interest is getting a handle for the inode. I wouldn't be against requiring the UUID be unique if that was needed, or failing regular opens in the rare case that there is a non-unique UUID pointing to different devices, or failing directory opens for the case of multiple mountpoints.

It has already been said that requiring uuids to be unique breaks current
practice (involving mounting dm snapshots of active filesystems).
Failing legitimate syscalls in rare circumstances sounds like bad API design
to me.

NeilBrown


>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.

2010-07-08 10:40:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, 8 Jul 2010 08:21:43 +1000, Neil Brown <[email protected]> wrote:
> On Wed, 7 Jul 2010 10:45:11 -0400
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > > On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > > > descriptor or pathname without races?
> > > > >
> > > > > You can do stat/fstat to find out the device number (which is unique,
> > > > > but not persistent)
> > > >
> > > > Is it really unique over time? (Can't a given st_dev value map to one
> > > > filesystem now, and another later?)
> > >
> > > It's unique at a single point in time. But if you have a reference
> > > (e.g. open file descriptor) on the mount then that's not a problem.
> > >
> > > fd = open(path, ...);
> > > fstat(fd, &st);
> > > search st.st_dev in mountinfo
> > > close(fd)
> > >
> > > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > > filesystems will not be found in mountinfo, but the reference is still
> > > there so st_dev will not be reused for other filesystems).
> >
> > OK, cool.
> >
> > That still leaves the problem that there isn't always an underlying
> > block device, and/or when there is it doesn't always uniquely specify
> > the filesystem.
>
> It doesn't matter if there is an underlying block device, or if it is shared
> among subvolmes.
> st_dev is *the* primary key for filesystems. Every "struct super_block" has a
> unquie s_dev and that is returned in st_dev.
>
> For "traditional" filesystem, this is the major/minor number of the block
> device.
> For NFS and btrfs and other filesystems which don't have exclusive use of a
> block device, 'set_anon_super' is used to get a unique s_dev based on a major
> number of '0'.
>
> So you can *always* use st_dev as an identifier for the filesystem which is
> stable and unique as long as you hold an active reference to the filesystem
> (open file descriptor, cwd in fs, etc).
>
> If you poll(2) /proc/mounts to get notifications of changes to the mount
> table, then it should be quite easy to cache st-dev -> uuid mappings in a
> race-free way.
>
> There might be value in getting name_to_handle to return the st_dev of the
> target file to ensure that you haven't unexepected crossed into a different
> filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
> to be unique, a uuid is only supposed to be unique (i.e. that is not
> enforced).

How about adding mnt_id to the handle ? Documentation file says it is
unique

(1) mount ID: unique identifier of the mount (may be reused after umount)

I also updated (/proc/self/mountinfo) to carry the optional uuid field
With the below patch i get in /proc/self/mountinfo

13 1 253:0 / / rw,relatime,uuid:9b5af62a-a34a-43f6-a5bb-1cc22d97e862 - ext3 /dev/root rw,errors=continue,barrier=0,data=writeback

And the handle returns the value 13 in mnt_id field. We should able to
lookup mountinfo with mnt_id and find the corresponding uuid.

diff --git a/fs/namespace.c b/fs/namespace.c
index 88058de..498bd9a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -871,6 +871,9 @@ static int show_mountinfo(struct seq_file *m, void *v)
if (IS_MNT_UNBINDABLE(mnt))
seq_puts(m, " unbindable");

+ /* print the uuid */
+ seq_printf(m, ",uuid:%pU", mnt->mnt_sb->s_uuid);
+
/* Filesystem specific data */
seq_puts(m, " - ");
show_type(m, sb);
diff --git a/fs/open.c b/fs/open.c
index 23d05d3..13d426e 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1092,6 +1092,8 @@ static long do_sys_name_to_handle(struct path *path,
handle_size *= sizeof(u32);
handle->handle_type = retval;
handle->handle_size = handle_size;
+ /* copy the mount id */
+ handle->mnt_id = path->mnt->mnt_id;
if (handle_size > f_handle.handle_size) {
/*
* set the handle_size to zero so we copy only
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ffcb9bf..5f43472 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -952,6 +952,7 @@ struct file {
};

struct file_handle {
+ int mnt_id;
int handle_size;
int handle_type;
/* file identifier */

-aneesh

2010-07-08 10:42:44

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 04/11] vfs: Allow handle based open on symlinks

On Thu, 8 Jul 2010 02:48:59 +1000, Nick Piggin <[email protected]> wrote:
> On Tue, Jun 15, 2010 at 10:42:54PM +0530, Aneesh Kumar K.V wrote:
> > The patch update may_open to allow handle based open on symlinks.
> > The file handle based API use file descritor returned from open_by_handle_at
> > to do different file system operations. To find the link target name we
> > need to get a file descriptor on symlinks.
> >
> > Signed-off-by: Aneesh Kumar K.V <[email protected]>
> > ---
> > fs/namei.c | 17 ++++++++++++++---
> > 1 files changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/namei.c b/fs/namei.c
> > index c2d19c7..417bf53 100644
> > --- a/fs/namei.c
> > +++ b/fs/namei.c
> > @@ -1422,7 +1422,7 @@ int vfs_create(struct inode *dir, struct dentry *dentry, int mode,
> > return error;
> > }
> >
> > -int may_open(struct path *path, int acc_mode, int flag)
> > +static int __may_open(struct path *path, int acc_mode, int flag, int handle)
> > {
> > struct dentry *dentry = path->dentry;
> > struct inode *inode = dentry->d_inode;
> > @@ -1433,7 +1433,13 @@ int may_open(struct path *path, int acc_mode, int flag)
> >
> > switch (inode->i_mode & S_IFMT) {
> > case S_IFLNK:
> > - return -ELOOP;
> > + /*
> > + * For file handle based open we should allow
> > + * open of symlink.
> > + */
> > + if (!handle)
> > + return -ELOOP;
> > + break;
> > case S_IFDIR:
> > if (acc_mode & MAY_WRITE)
> > return -EISDIR;
> > @@ -1473,6 +1479,11 @@ int may_open(struct path *path, int acc_mode, int flag)
> > return break_lease(inode, flag);
> > }
> >
> > +int may_open(struct path *path, int acc_mode, int flag)
> > +{
> > + return __may_open(path, acc_mode, flag, 0);
> > +}
>
> Why don't you just add MAY_OPEN_LNK instead of this ugliness?
>

Something like the below ?

commit 69787f5ed6c8a7a63ad21ef5d7de0b62f124ff0a
Author: Aneesh Kumar K.V <[email protected]>
Date: Tue Jul 6 23:25:57 2010 +0530

vfs: Allow handle based open on symlinks

The patch update may_open to allow handle based open on symlinks.
The file handle based API use file descritor returned from open_by_handle_at
to do different file system operations. To find the link target name we
need to get a file descriptor on symlinks.

Signed-off-by: Aneesh Kumar K.V <[email protected]>

diff --git a/fs/namei.c b/fs/namei.c
index de40312..2ea5c79 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1456,7 +1456,15 @@ int may_open(struct path *path, int acc_mode, int flag)

switch (inode->i_mode & S_IFMT) {
case S_IFLNK:
- return -ELOOP;
+ /*
+ * Allow only if acc_mode contain
+ * open link request and read request.
+ */
+ if (acc_mode != (MAY_OPEN_LINK | MAY_READ))
+ return -ELOOP;
+ if (flag != O_RDONLY)
+ return -ELOOP;
+ break;
case S_IFDIR:
if (acc_mode & MAY_WRITE)
return -EISDIR;
diff --git a/fs/open.c b/fs/open.c
index daf2534..a58837a 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -1266,8 +1266,15 @@ static long do_sys_open_by_handle(int mountdirfd,
*/
if (open_flag & __O_SYNC)
open_flag |= O_DSYNC;
+ /*
+ * Handle based API allow open on a symlink
+ */
+ if (S_ISLNK(path->dentry->d_inode->i_mode))
+ acc_mode = MAY_OPEN_LINK;
+ else
+ acc_mode = MAY_OPEN;

- acc_mode = MAY_OPEN | ACC_MODE(open_flag);
+ acc_mode |= ACC_MODE(open_flag);

/* O_TRUNC implies we need access checks for write permissions */
if (open_flag & O_TRUNC)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cbff4ca..eaf13d1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -53,6 +53,7 @@ struct inodes_stat_t {
#define MAY_APPEND 8
#define MAY_ACCESS 16
#define MAY_OPEN 32
+#define MAY_OPEN_LINK 64

/*
* flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond

2010-07-08 11:53:19

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, 08 Jul 2010, Aneesh Kumar K. V wrote:
> How about adding mnt_id to the handle ? Documentation file says it is
> unique
>
> (1) mount ID: unique identifier of the mount (may be reused after umount)
>
> I also updated (/proc/self/mountinfo) to carry the optional uuid field
> With the below patch i get in /proc/self/mountinfo
>
> 13 1 253:0 / / rw,relatime,uuid:9b5af62a-a34a-43f6-a5bb-1cc22d97e862 - ext3 /dev/root rw,errors=continue,barrier=0,data=writeback
>
> And the handle returns the value 13 in mnt_id field. We should able to
> lookup mountinfo with mnt_id and find the corresponding uuid.
>
> diff --git a/fs/namespace.c b/fs/namespace.c
> index 88058de..498bd9a 100644
> --- a/fs/namespace.c
> +++ b/fs/namespace.c
> @@ -871,6 +871,9 @@ static int show_mountinfo(struct seq_file *m, void *v)
> if (IS_MNT_UNBINDABLE(mnt))
> seq_puts(m, " unbindable");
>
> + /* print the uuid */
> + seq_printf(m, ",uuid:%pU", mnt->mnt_sb->s_uuid);
> +

This should be

seq_printf(m, " uuid:%pU", mnt->mnt_sb->s_uuid);

Thanks,
Miklos

2010-07-08 12:21:42

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Thu, 08 Jul 2010 16:10:09 +0530
"Aneesh Kumar K. V" <[email protected]> wrote:

> On Thu, 8 Jul 2010 08:21:43 +1000, Neil Brown <[email protected]> wrote:
> > On Wed, 7 Jul 2010 10:45:11 -0400
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Wed, Jul 07, 2010 at 03:35:50PM +0200, Miklos Szeredi wrote:
> > > > On Wed, 7 Jul 2010, J. Bruce Fields wrote:
> > > > > > > If you use sys or proc, is it possible to get the uuid from a file
> > > > > > > descriptor or pathname without races?
> > > > > >
> > > > > > You can do stat/fstat to find out the device number (which is unique,
> > > > > > but not persistent)
> > > > >
> > > > > Is it really unique over time? (Can't a given st_dev value map to one
> > > > > filesystem now, and another later?)
> > > >
> > > > It's unique at a single point in time. But if you have a reference
> > > > (e.g. open file descriptor) on the mount then that's not a problem.
> > > >
> > > > fd = open(path, ...);
> > > > fstat(fd, &st);
> > > > search st.st_dev in mountinfo
> > > > close(fd)
> > > >
> > > > is effectively the same as an getuuid(path) syscall (lazy unmounted
> > > > filesystems will not be found in mountinfo, but the reference is still
> > > > there so st_dev will not be reused for other filesystems).
> > >
> > > OK, cool.
> > >
> > > That still leaves the problem that there isn't always an underlying
> > > block device, and/or when there is it doesn't always uniquely specify
> > > the filesystem.
> >
> > It doesn't matter if there is an underlying block device, or if it is shared
> > among subvolmes.
> > st_dev is *the* primary key for filesystems. Every "struct super_block" has a
> > unquie s_dev and that is returned in st_dev.
> >
> > For "traditional" filesystem, this is the major/minor number of the block
> > device.
> > For NFS and btrfs and other filesystems which don't have exclusive use of a
> > block device, 'set_anon_super' is used to get a unique s_dev based on a major
> > number of '0'.
> >
> > So you can *always* use st_dev as an identifier for the filesystem which is
> > stable and unique as long as you hold an active reference to the filesystem
> > (open file descriptor, cwd in fs, etc).
> >
> > If you poll(2) /proc/mounts to get notifications of changes to the mount
> > table, then it should be quite easy to cache st-dev -> uuid mappings in a
> > race-free way.
> >
> > There might be value in getting name_to_handle to return the st_dev of the
> > target file to ensure that you haven't unexepected crossed into a different
> > filesystem. I would prefer that to returning a uuid: st_dev is guaranteed
> > to be unique, a uuid is only supposed to be unique (i.e. that is not
> > enforced).
>
> How about adding mnt_id to the handle ? Documentation file says it is
> unique
>
> (1) mount ID: unique identifier of the mount (may be reused after umount)
>
> I also updated (/proc/self/mountinfo) to carry the optional uuid field
> With the below patch i get in /proc/self/mountinfo
>
> 13 1 253:0 / / rw,relatime,uuid:9b5af62a-a34a-43f6-a5bb-1cc22d97e862 - ext3 /dev/root rw,errors=continue,barrier=0,data=writeback
>
> And the handle returns the value 13 in mnt_id field. We should able to
> lookup mountinfo with mnt_id and find the corresponding uuid.
>

That sounds good. mnt_id will even let you know if you have crossed
a --bind mount, which st_dev wouldn't. That may not always be useful, but it
is good to have it.

NeilBrown

2010-07-09 18:43:07

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On 2010-07-08, at 06:21, Neil Brown wrote:
> On Thu, 08 Jul 2010 16:10:09 +0530
> "Aneesh Kumar K. V" <[email protected]> wrote:
>> How about adding mnt_id to the handle ? Documentation file says it is unique
>>
>> (1) mount ID: unique identifier of the mount (may be reused after umount)

But this value is not persistent across a reboot, or even an umount/mount so it is not useful as an identifier.

I suppose one way to resolve this issue is to just allow the underlying filesystem to supply a completely opaque filehandle to userspace. For local filesystems that don't care about persistence or uniqueness between nodes they can use something like mount_id, and for distributed/clustered filesystems they can include a globally-unique identifier.

Cheers, Andreas




2010-07-10 04:59:01

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V14 0/11] Generic name to handle and open by handle syscalls

On Fri, 9 Jul 2010 12:42:42 -0600, Andreas Dilger <[email protected]> wrote:
> On 2010-07-08, at 06:21, Neil Brown wrote:
> > On Thu, 08 Jul 2010 16:10:09 +0530
> > "Aneesh Kumar K. V" <[email protected]> wrote:
> >> How about adding mnt_id to the handle ? Documentation file says it is unique
> >>
> >> (1) mount ID: unique identifier of the mount (may be reused after umount)
>
> But this value is not persistent across a reboot, or even an
> umount/mount so it is not useful as an identifier.


mount id should not be looked at as a persistent identifier. It should
be used to derive a persistent identifier from /proc/self/mountinfo. The
persistent identifier could be the combination of device properties,
file system properties or the uuid which is going to be an optional
tag in /proc/self/mountinfo.

This also implies we need to hold a reference in the mount to make sure
we can safely lookup uuid using mount id.

>
> I suppose one way to resolve this issue is to just allow the
> underlying filesystem to supply a completely opaque filehandle to
> userspace. For local filesystems that don't care about persistence or
> uniqueness between nodes they can use something like mount_id, and for
> distributed/clustered filesystems they can include a globally-unique
> identifier.


We could use mountid to get the persistent id from mountinfo right ? So
file handle request would include

fd = open(name);
file_handle = fd_to_handle(fd);
fs_uuid = get_uuid(file_handle.mnt_id);
close(fd);

So for your usecase the handle send to other nodes include will include
cluster_fs_uuid and file_identifier.

-aneesh