2010-08-20 01:51:52

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 0/13] Generic name to handle and open by handle syscalls

Hi,

The below set of patches implement open by handle support using exportfs
operations. This allows user space application to map a file name to file
handle and later open the file using handle. This should be usable
for userspace NFS [1] and 9P server [2]. XFS already support this with the ioctls
XFS_IOC_PATH_TO_HANDLE and XFS_IOC_OPEN_BY_HANDLE.

[1] http://nfs-ganesha.sourceforge.net/
[2] http://thread.gmane.org/gmane.comp.emulators.qemu/68992

git repo for the patchset at:

git://git.kernel.org/pub/scm/linux/kernel/git/kvaneesh/linux-open-handle.git open-by-handle

Changes from V17:
a) Add syscall number to asm-generic/unistd.h
c) Fix wrong put_filep usage in do_handle_open

Changes from V16:
a) mnt_id is now an arg to syscall
b) finish_open_handle is removed. Update finish_open to work with
struct path and struct file.
c) Add O_NOACCESS and use that for open_by_handle on symlink

Changes from V15:
a) Add mount id to file handle
b) Add support for optional uuid tag in /proc/<pid>/mountinfo
c) Added MAY_OPEN_LINK to support open on symlink instead of
adding a new may_open function. Also limited the open flag
to O_RDONLY.

Changes from V14:
a) Use fget_light instead of fget in the patches
b) Drop uuid from struct file_handle as per the last review

Changes from V13:
a) Add support for file descriptor to handle conversion. This is needed
so that we find the right file handle for newly created files.

Changes from V12:
a) Use CAP_DAC_READ_SEARCH instead of CAP_DAC_OVERRIDE in open_by_handle
b) Return -ENOTDIR if O_DIRECTORY flag is specified in open_by_handle with
handle for non directory

Changes from V11:
a) Add necessary documentation to different functions
b) Add null pathname support to faccessat and linkat similar to
readlinkat.
c) compile fix on x86_64

Changes from V10:
a) Missed an stg refresh before sending out the patchset. Send
updated patchset.

Changes from V9:
a) Fix compile errors with CONFIG_EXPORTFS not defined
b) Return -EOPNOTSUPP if file system doesn't support fh_to_dentry exportfs callback.

Changes from V8:
a) exportfs_decode_fh now returns -ESTALE if export operations is not defined.
b) drop get_fsid super_operations. Instead use superblock to store uuid.

Changes from V7:
a) open_by_handle now use mountdirfd to identify the vfsmount.
b) We don't validate the UUID passed as a part of file handle in open_by_handle.
UUID is provided as a part of file handle as an easy way for userspace to
use the kernel returned handle as it is. It also helps in finding the 16 byte
filessytem UUID in userspace without using file system specific libraries to
read file system superblock. If a particular file system doesn't support UUID
or any form of unique id this field in the file handle will be zero filled.
c) drop freadlink syscall. Instead use readlinkat with NULL pathname to indicate
read the link target name of the link pointed by fd. This is similar to
sys_utimensat
d) Instead of opencoding all the open flag related check use helper functions.
Did finish_open_by_handle similar to finish_open.
c) Fix may_open to not return ELOOP for symlink when we are called from handle open.
open(2) still returns error as expected.

Changes from V6:
a) Add uuid to vfsmount lookup and drop uuid to superblock lookup
b) Return -EOPNOTSUPP in sys_name_to_handle if the file system returned uuid
doesn't give the same vfsmount on lookup. This ensure that we fail
sys_name_to_handle when we have multiple file system returning same UUID.

Changes from V5:
a) added sys_name_to_handle_at syscall which takes AT_SYMLINK_NOFOLLOW flag
instead of two syscalls sys_name_to_handle and sys_lname_to_handle.
b) addressed review comments from Niel Brown
c) rebased to b91ce4d14a21fc04d165be30319541e0f9204f15
d) Add compat_sys_open_by_handle

Chages from V4:
a) Changed the syscal arguments so that we don't need compat syscalls
as suggested by Christoph
c) Added two new syscall sys_lname_to_handle and sys_freadlink to work with
symlinks
d) Changed open_by_handle to work with all file types
e) Add ext3 support

Changes from V3:
a) Code cleanup suggested by Andreas
b) x86_64 syscall support
c) add compat syscall

Chages from V2:
a) Support system wide unique handle.

Changes from v1:
a) handle size is now specified in bytes
b) returns -EOVERFLOW if the handle size is small
c) dropped open_handle syscall and added open_by_handle_at syscall
open_by_handle_at takes mount_fd as the directory fd of the mount point
containing the file
e) handle will only be unique in a given file system. So for an NFS server
exporting multiple file system, NFS server will have to internally track the
mount point to which a file handle belongs to. We should be able to do it much
easily than expecting kernel to give a system wide unique file handle. System
wide unique file handle would need much larger changes to the exportfs or VFS
interface and I was not sure whether we really need to do that in the kernel or
in the user space
f) open_handle_at now only check for DAC_OVERRIDE capability


Example program: (x86_32). (x86_64 would need a different syscall number)
-------
cc <source.c>
--------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>

#include <fcntl.h>
#include <unistd.h>
#include <errno.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <string.h>

struct file_handle {
int handle_size;
int handle_type;
unsigned char handle[0];
};

#define AT_FDCWD -100
#define AT_SYMLINK_FOLLOW 0x400

static int name_to_handle(const char *name, struct file_handle *fh, int *mnt_id)
{
return syscall(341, AT_FDCWD, name, fh, mnt_id, AT_SYMLINK_FOLLOW);
}

static int lname_to_handle(const char *name, struct file_handle *fh, int *mnt_id )
{
return syscall(341, AT_FDCWD, name, fh, mnt_id, 0);
}

static int fd_to_handle(int fd, struct file_handle *fh, int *mnt_id)
{
return syscall(341, fd, NULL, fh, mnt_id, 0);
}

static int open_by_handle(int mountfd, struct file_handle *fh, int flags)
{
return syscall(342, mountfd, fh, flags);
}

#define BUFSZ 100
int main(int argc, char *argv[])
{
int fd, mnt_id;
int ret, done = 0;
int mountfd;
int handle_sz;
struct stat bufstat;
char buf[BUFSZ];
struct file_handle *fh = NULL;;
if (argc != 3 ) {
printf("Usage: %s <filename> <mount-dir-name>\n", argv[0]);
exit(1);
}
again:
if (fh && fh->handle_size) {
handle_sz = fh->handle_size;
free(fh);
fh = malloc(sizeof(struct file_handle) + handle_sz);
fh->handle_size = handle_sz;
} else {
fh = malloc(sizeof(struct file_handle));
fh->handle_size = 0;
}
errno = 0;
ret = lname_to_handle(argv[1], fh, &mnt_id);
if (ret && errno == EOVERFLOW) {
printf("Found the handle size needed to be %d\n", fh->handle_size);
goto again;
} else if (ret) {
perror("Error:");
exit(1);
}
do_again:
printf("found mount_id %d\n", mnt_id);
printf("Waiting for input");
getchar();
mountfd = open(argv[2], O_RDONLY | O_DIRECTORY);
if (mountfd <= 0) {
perror("Error:");
exit(1);
}
fd = open_by_handle(mountfd, fh, O_RDONLY);
if (fd <= 0 ) {
perror("Error:");
exit(1);
}
printf("Reading the content now \n");
fstat(fd, &bufstat);
ret = S_ISLNK(bufstat.st_mode);
if (ret) {
memset(buf, 0 , BUFSZ);
readlinkat(fd, NULL, buf, BUFSZ);
printf("%s is a symlink pointing to %s\n", argv[1], buf);
}
memset(buf, 0 , BUFSZ);
while (1) {
ret = read(fd, buf, BUFSZ -1);
if (ret <= 0)
break;
buf[ret] = '\0';
printf("%s", buf);
memset(buf, 0 , BUFSZ);
}
/* Now check for faccess */
if (faccessat(fd, NULL, W_OK, 0) == 0) {
printf("Got write permission on the file \n");
} else
perror("faccess error");
/* now try to create a hardlink */
if (linkat(fd, NULL, AT_FDCWD, "test", 0) == 0){
printf("created hardlink\n");
} else
perror("linkat error");
if (done)
exit(0);
printf("Map fd to handle \n");
ret = fd_to_handle(fd, fh, &mnt_id);
if (ret) {
perror("Error:");
exit(1);
}
done = 1;
goto do_again;
}


2010-08-20 01:51:58

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 01/13] exportfs: Return the minimum required handle size

The exportfs encode handle function should return the minimum required
handle size. This helps user to find out the handle size by passing 0
handle size in the first step and then redoing to the call again with
the returned handle size value.

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/btrfs/export.c | 8 ++++++--
fs/exportfs/expfs.c | 9 +++++++--
fs/fat/inode.c | 4 +++-
fs/fuse/inode.c | 4 +++-
fs/gfs2/export.c | 8 ++++++--
fs/isofs/export.c | 8 ++++++--
fs/ocfs2/export.c | 8 +++++++-
fs/reiserfs/inode.c | 7 ++++++-
fs/udf/namei.c | 7 ++++++-
fs/xfs/linux-2.6/xfs_export.c | 4 +++-
include/linux/exportfs.h | 6 ++++--
mm/shmem.c | 4 +++-
12 files changed, 60 insertions(+), 17 deletions(-)

diff --git a/fs/btrfs/export.c b/fs/btrfs/export.c
index 951ef09..5f8ee5a 100644
--- a/fs/btrfs/export.c
+++ b/fs/btrfs/export.c
@@ -21,9 +21,13 @@ static int btrfs_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
int len = *max_len;
int type;

- if ((len < BTRFS_FID_SIZE_NON_CONNECTABLE) ||
- (connectable && len < BTRFS_FID_SIZE_CONNECTABLE))
+ if (connectable && (len < BTRFS_FID_SIZE_CONNECTABLE)) {
+ *max_len = BTRFS_FID_SIZE_CONNECTABLE;
return 255;
+ } else if (len < BTRFS_FID_SIZE_NON_CONNECTABLE) {
+ *max_len = BTRFS_FID_SIZE_NON_CONNECTABLE;
+ return 255;
+ }

len = BTRFS_FID_SIZE_NON_CONNECTABLE;
type = FILEID_BTRFS_WITHOUT_PARENT;
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index e9e1759..cfee0f0 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -319,9 +319,14 @@ static int export_encode_fh(struct dentry *dentry, struct fid *fid,
struct inode * inode = dentry->d_inode;
int len = *max_len;
int type = FILEID_INO32_GEN;
-
- if (len < 2 || (connectable && len < 4))
+
+ if (connectable && (len < 4)) {
+ *max_len = 4;
+ return 255;
+ } else if (len < 2) {
+ *max_len = 2;
return 255;
+ }

len = 2;
fid->i32.ino = inode->i_ino;
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 8300580..0812d29 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -759,8 +759,10 @@ fat_encode_fh(struct dentry *de, __u32 *fh, int *lenp, int connectable)
struct inode *inode = de->d_inode;
u32 ipos_h, ipos_m, ipos_l;

- if (len < 5)
+ if (len < 5) {
+ *lenp = 5;
return 255; /* no room */
+ }

ipos_h = MSDOS_I(inode)->i_pos >> 8;
ipos_m = (MSDOS_I(inode)->i_pos & 0xf0) << 24;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index da9e6e1..52adfcd 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -640,8 +640,10 @@ static int fuse_encode_fh(struct dentry *dentry, u32 *fh, int *max_len,
u64 nodeid;
u32 generation;

- if (*max_len < len)
+ if (*max_len < len) {
+ *max_len = len;
return 255;
+ }

nodeid = get_fuse_inode(inode)->nodeid;
generation = inode->i_generation;
diff --git a/fs/gfs2/export.c b/fs/gfs2/export.c
index dfe237a..bd0fd68 100644
--- a/fs/gfs2/export.c
+++ b/fs/gfs2/export.c
@@ -36,9 +36,13 @@ static int gfs2_encode_fh(struct dentry *dentry, __u32 *p, int *len,
struct super_block *sb = inode->i_sb;
struct gfs2_inode *ip = GFS2_I(inode);

- if (*len < GFS2_SMALL_FH_SIZE ||
- (connectable && *len < GFS2_LARGE_FH_SIZE))
+ if (connectable && (*len < GFS2_LARGE_FH_SIZE)) {
+ *len = GFS2_LARGE_FH_SIZE;
return 255;
+ } else if (*len < GFS2_SMALL_FH_SIZE) {
+ *len = GFS2_SMALL_FH_SIZE;
+ return 255;
+ }

fh[0] = cpu_to_be32(ip->i_no_formal_ino >> 32);
fh[1] = cpu_to_be32(ip->i_no_formal_ino & 0xFFFFFFFF);
diff --git a/fs/isofs/export.c b/fs/isofs/export.c
index ed752cb..dd4687f 100644
--- a/fs/isofs/export.c
+++ b/fs/isofs/export.c
@@ -124,9 +124,13 @@ isofs_export_encode_fh(struct dentry *dentry,
* offset of the inode and the upper 16 bits of fh32[1] to
* hold the offset of the parent.
*/
-
- if (len < 3 || (connectable && len < 5))
+ if (connectable && (len < 5)) {
+ *max_len = 5;
+ return 255;
+ } else if (len < 3) {
+ *max_len = 3;
return 255;
+ }

len = 3;
fh32[0] = ei->i_iget5_block;
diff --git a/fs/ocfs2/export.c b/fs/ocfs2/export.c
index 19ad145..250a347 100644
--- a/fs/ocfs2/export.c
+++ b/fs/ocfs2/export.c
@@ -201,8 +201,14 @@ static int ocfs2_encode_fh(struct dentry *dentry, u32 *fh_in, int *max_len,
dentry->d_name.len, dentry->d_name.name,
fh, len, connectable);

- if (len < 3 || (connectable && len < 6)) {
+ if (connectable && (len < 6)) {
mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+ *max_len = 6;
+ type = 255;
+ goto bail;
+ } else if (len < 3) {
+ mlog(ML_ERROR, "fh buffer is too small for encoding\n");
+ *max_len = 3;
type = 255;
goto bail;
}
diff --git a/fs/reiserfs/inode.c b/fs/reiserfs/inode.c
index ae35413..e6c336b 100644
--- a/fs/reiserfs/inode.c
+++ b/fs/reiserfs/inode.c
@@ -1594,8 +1594,13 @@ int reiserfs_encode_fh(struct dentry *dentry, __u32 * data, int *lenp,
struct inode *inode = dentry->d_inode;
int maxlen = *lenp;

- if (maxlen < 3)
+ if (need_parent && (maxlen < 5)) {
+ *lenp = 5;
return 255;
+ } else if (maxlen < 3) {
+ *lenp = 3;
+ return 255;
+ }

data[0] = inode->i_ino;
data[1] = le32_to_cpu(INODE_PKEY(inode)->k_dir_id);
diff --git a/fs/udf/namei.c b/fs/udf/namei.c
index bf5fc67..20db42f 100644
--- a/fs/udf/namei.c
+++ b/fs/udf/namei.c
@@ -1336,8 +1336,13 @@ static int udf_encode_fh(struct dentry *de, __u32 *fh, int *lenp,
struct fid *fid = (struct fid *)fh;
int type = FILEID_UDF_WITHOUT_PARENT;

- if (len < 3 || (connectable && len < 5))
+ if (connectable && (len < 5)) {
+ *lenp = 5;
+ return 255;
+ } else if (len < 3) {
+ *lenp = 3;
return 255;
+ }

*lenp = 3;
fid->udf.block = location.logicalBlockNum;
diff --git a/fs/xfs/linux-2.6/xfs_export.c b/fs/xfs/linux-2.6/xfs_export.c
index 3764d74..7132d7c 100644
--- a/fs/xfs/linux-2.6/xfs_export.c
+++ b/fs/xfs/linux-2.6/xfs_export.c
@@ -81,8 +81,10 @@ xfs_fs_encode_fh(
* seven combinations work. The real answer is "don't use v2".
*/
len = xfs_fileid_length(fileid_type);
- if (*max_len < len)
+ if (*max_len < len) {
+ *max_len = len;
return 255;
+ }
*max_len = len;

switch (fileid_type) {
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index a9cd507..acd0b2d 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -108,8 +108,10 @@ struct fid {
* set, the encode_fh() should store sufficient information so that a good
* attempt can be made to find not only the file but also it's place in the
* filesystem. This typically means storing a reference to de->d_parent in
- * the filehandle fragment. encode_fh() should return the number of bytes
- * stored or a negative error code such as %-ENOSPC
+ * the filehandle fragment. encode_fh() should return the fileid_type on
+ * success and on error returns 255 (if the space needed to encode fh is
+ * greater than @max_len*4 bytes). On error @max_len contain the minimum
+ * size(in 4 byte unit) needed to encode the file handle.
*
* fh_to_dentry:
* @fh_to_dentry is given a &struct super_block (@sb) and a file handle
diff --git a/mm/shmem.c b/mm/shmem.c
index dfaa0f4..47e58af 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2143,8 +2143,10 @@ static int shmem_encode_fh(struct dentry *dentry, __u32 *fh, int *len,
{
struct inode *inode = dentry->d_inode;

- if (*len < 3)
+ if (*len < 3) {
+ *len = 3;
return 255;
+ }

if (hlist_unhashed(&inode->i_hash)) {
/* Unfortunately insert_inode_hash is not idempotent,
--
1.7.0.4

2010-08-20 01:52:05

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 03/13] vfs: Add open by file handle support

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/exportfs/expfs.c | 2 +
fs/namei.c | 220 +++++++++++++++++++++++++++++++++++++++++++---
fs/open.c | 32 ++++++-
include/linux/fs.h | 8 ++-
include/linux/namei.h | 1 +
include/linux/syscalls.h | 3 +
6 files changed, 247 insertions(+), 19 deletions(-)

diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index cfee0f0..05a1179 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -373,6 +373,8 @@ struct dentry *exportfs_decode_fh(struct vfsmount *mnt, struct fid *fid,
/*
* Try to get any dentry for the given file handle from the filesystem.
*/
+ if (!nop || !nop->fh_to_dentry)
+ return ERR_PTR(-ESTALE);
result = nop->fh_to_dentry(mnt->mnt_sb, fid, fh_len, fileid_type);
if (!result)
result = ERR_PTR(-ESTALE);
diff --git a/fs/namei.c b/fs/namei.c
index 17ea76b..1291dfc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -32,6 +32,7 @@
#include <linux/fcntl.h>
#include <linux/device_cgroup.h>
#include <linux/fs_struct.h>
+#include <linux/exportfs.h>
#include <asm/uaccess.h>

#include "internal.h"
@@ -1042,6 +1043,29 @@ out_fail:
return retval;
}

+struct vfsmount *get_vfsmount_from_fd(int fd)
+{
+ int fput_needed;
+ struct path *path;
+ struct file *filep;
+
+ if (fd == AT_FDCWD) {
+ struct fs_struct *fs = current->fs;
+ read_lock(&fs->lock);
+ path = &fs->pwd;
+ mntget(path->mnt);
+ read_unlock(&fs->lock);
+ } else {
+ filep = fget_light(fd, &fput_needed);
+ if (!filep)
+ return ERR_PTR(-EBADF);
+ path = &filep->f_path;
+ mntget(path->mnt);
+ fput_light(filep, fput_needed);
+ }
+ return path->mnt;
+}
+
/* Returns 0 and nd will be valid on success; Retuns error, otherwise. */
static int do_path_lookup(int dfd, const char *name,
unsigned int flags, struct nameidata *nd)
@@ -1546,26 +1570,30 @@ static int open_will_truncate(int flag, struct inode *inode)
return (flag & O_TRUNC);
}

-static struct file *finish_open(struct nameidata *nd,
+static struct file *finish_open(struct file *filp, struct path *path,
int open_flag, int acc_mode)
{
- struct file *filp;
- int will_truncate;
int error;
+ int will_truncate;

- will_truncate = open_will_truncate(open_flag, nd->path.dentry->d_inode);
+ will_truncate = open_will_truncate(open_flag, path->dentry->d_inode);
if (will_truncate) {
- error = mnt_want_write(nd->path.mnt);
+ error = mnt_want_write(path->mnt);
if (error)
goto exit;
}
- error = may_open(&nd->path, acc_mode, open_flag);
+ error = may_open(path, acc_mode, open_flag);
if (error) {
if (will_truncate)
- mnt_drop_write(nd->path.mnt);
+ mnt_drop_write(path->mnt);
goto exit;
}
- filp = nameidata_to_filp(nd);
+ /* Has the filesystem initialised the file for us? */
+ if (filp->f_path.dentry == NULL)
+ filp = __dentry_open(path->dentry, path->mnt, filp,
+ NULL, current_cred());
+ else
+ path_put(path);
if (!IS_ERR(filp)) {
error = ima_file_check(filp, acc_mode);
if (error) {
@@ -1575,7 +1603,7 @@ static struct file *finish_open(struct nameidata *nd,
}
if (!IS_ERR(filp)) {
if (will_truncate) {
- error = handle_truncate(&nd->path);
+ error = handle_truncate(path);
if (error) {
fput(filp);
filp = ERR_PTR(error);
@@ -1588,13 +1616,17 @@ static struct file *finish_open(struct nameidata *nd,
* on its behalf.
*/
if (will_truncate)
- mnt_drop_write(nd->path.mnt);
+ mnt_drop_write(path->mnt);
return filp;

exit:
- if (!IS_ERR(nd->intent.open.file))
- release_open_intent(nd);
- path_put(&nd->path);
+ if (!IS_ERR(filp)) {
+ if (filp->f_path.dentry == NULL)
+ put_filp(filp);
+ else
+ fput(filp);
+ }
+ path_put(path);
return ERR_PTR(error);
}

@@ -1728,7 +1760,9 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
if (S_ISDIR(path->dentry->d_inode->i_mode))
goto exit;
ok:
- filp = finish_open(nd, open_flag, acc_mode);
+ filp = finish_open(nd->intent.open.file, &nd->path,
+ open_flag, acc_mode);
+
return filp;

exit_mutex_unlock:
@@ -1901,6 +1935,164 @@ struct file *filp_open(const char *filename, int flags, int mode)
}
EXPORT_SYMBOL(filp_open);

+#ifdef CONFIG_EXPORTFS
+static int vfs_dentry_acceptable(void *context, struct dentry *dentry)
+{
+ return 1;
+}
+
+static struct path *handle_to_path(int mountdirfd, struct file_handle *handle)
+{
+ int retval;
+ int handle_size;
+ struct path *path;
+
+ path = kmalloc(sizeof(struct path), GFP_KERNEL);
+ if (!path)
+ return ERR_PTR(-ENOMEM);
+
+ path->mnt = get_vfsmount_from_fd(mountdirfd);
+ if (IS_ERR(path->mnt)) {
+ retval = PTR_ERR(path->mnt);
+ goto out_err;
+ }
+ /* change the handle size to multiple of sizeof(u32) */
+ handle_size = handle->handle_size >> 2;
+ path->dentry = exportfs_decode_fh(path->mnt,
+ (struct fid *)handle->f_handle,
+ handle_size, handle->handle_type,
+ vfs_dentry_acceptable, NULL);
+ if (IS_ERR(path->dentry)) {
+ retval = PTR_ERR(path->dentry);
+ goto out_mnt;
+ }
+ return path;
+out_mnt:
+ mntput(path->mnt);
+out_err:
+ kfree(path);
+ return ERR_PTR(retval);
+}
+
+long do_handle_open(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag)
+{
+ long retval = 0;
+ int fd, acc_mode;
+ struct file *filp;
+ struct path *path;
+ struct file_handle f_handle;
+ struct file_handle *handle = NULL;
+
+ /*
+ * With handle we don't look at the execute bit on the
+ * the directory. Ideally we would like CAP_DAC_SEARCH.
+ * But we don't have that
+ */
+ if (!capable(CAP_DAC_READ_SEARCH)) {
+ retval = -EPERM;
+ goto out_err;
+ }
+ /* can't use O_CREATE with open_by_handle */
+ if (open_flag & O_CREAT) {
+ retval = -EINVAL;
+ goto out_err;
+ }
+ if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
+ retval = -EFAULT;
+ goto out_err;
+ }
+ if ((f_handle.handle_size > MAX_HANDLE_SZ) ||
+ (f_handle.handle_size <= 0)) {
+ retval = -EINVAL;
+ goto out_err;
+ }
+ handle = kmalloc(sizeof(struct file_handle) + f_handle.handle_size,
+ GFP_KERNEL);
+ if (!handle) {
+ retval = -ENOMEM;
+ goto out_err;
+ }
+ /* copy the full handle */
+ if (copy_from_user(handle, ufh,
+ sizeof(struct file_handle) +
+ f_handle.handle_size)) {
+ retval = -EFAULT;
+ goto out_handle;
+ }
+ path = handle_to_path(mountdirfd, handle);
+ if (IS_ERR(path)) {
+ retval = PTR_ERR(path);
+ goto out_handle;
+ }
+ if ((open_flag & O_DIRECTORY) &&
+ !S_ISDIR(path->dentry->d_inode->i_mode)) {
+ retval = -ENOTDIR;
+ goto out_path;
+ }
+ /*
+ * O_SYNC is implemented as __O_SYNC|O_DSYNC. As many places only
+ * check for O_DSYNC if the need any syncing at all we enforce it's
+ * always set instead of having to deal with possibly weird behaviour
+ * for malicious applications setting only __O_SYNC.
+ */
+ if (open_flag & __O_SYNC)
+ open_flag |= O_DSYNC;
+
+ acc_mode = MAY_OPEN | ACC_MODE(open_flag);
+
+ /* O_TRUNC implies we need access checks for write permissions */
+ if (open_flag & O_TRUNC)
+ acc_mode |= MAY_WRITE;
+ /*
+ * Allow the LSM permission hook to distinguish append
+ * access from general write access.
+ */
+ if (open_flag & O_APPEND)
+ acc_mode |= MAY_APPEND;
+
+ fd = get_unused_fd_flags(open_flag);
+ if (fd < 0) {
+ retval = fd;
+ goto out_path;
+ }
+ filp = get_empty_filp();
+ if (!filp) {
+ retval = -ENFILE;
+ goto out_free_fd;
+ }
+ filp->f_flags = open_flag;
+ filp = finish_open(filp, path, open_flag, acc_mode);
+ if (IS_ERR(filp)) {
+ put_unused_fd(fd);
+ retval = PTR_ERR(filp);
+ } else {
+ retval = fd;
+ fsnotify_open(filp);
+ fd_install(fd, filp);
+ }
+ kfree(path);
+ kfree(handle);
+ return retval;
+
+out_free_fd:
+ put_unused_fd(fd);
+out_path:
+ path_put(path);
+ kfree(path);
+out_handle:
+ kfree(handle);
+out_err:
+ return retval;
+}
+#else
+long do_handle_open(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag)
+{
+ return -ENOSYS;
+}
+#endif
+
/**
* lookup_create - lookup a dentry, creating it if it doesn't exist
* @nd: nameidata info
diff --git a/fs/open.c b/fs/open.c
index efb1806..7abdcba 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -652,10 +652,10 @@ static inline int __get_file_write_access(struct inode *inode,
return error;
}

-static struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
- struct file *f,
- int (*open)(struct inode *, struct file *),
- const struct cred *cred)
+struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
+ struct file *f,
+ int (*open)(struct inode *, struct file *),
+ const struct cred *cred)
{
struct inode *inode;
int error;
@@ -1171,3 +1171,27 @@ SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
return -ENOSYS;
}
#endif
+
+/**
+ * sys_open_by_handle_at: Open the file handle
+ * @mountdirfd: directory file descriptor
+ * @handle: file handle to be opened
+ * @flag: open flags.
+ *
+ * @mountdirfd indicate the directory file descriptor
+ * of the mount point. file handle is decoded relative
+ * to the vfsmount pointed by the @mountdirfd. @flags
+ * value is same as the open(2) flags.
+ */
+SYSCALL_DEFINE3(open_by_handle_at, int, mountdirfd,
+ struct file_handle __user *, handle,
+ int, flags)
+{
+ long ret;
+
+ if (force_o_largefile())
+ flags |= O_LARGEFILE;
+
+ ret = do_handle_open(mountdirfd, handle, flags);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index f2f7ad3..e9471c7 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1939,6 +1939,10 @@ extern int do_fallocate(struct file *file, int mode, loff_t offset,
extern long do_sys_open(int dfd, const char __user *filename, int flags,
int mode);
extern struct file *filp_open(const char *, int, int);
+struct file *__dentry_open(struct dentry *dentry, struct vfsmount *mnt,
+ struct file *f,
+ int (*open)(struct inode *, struct file *),
+ const struct cred *cred);
extern struct file * dentry_open(struct dentry *, struct vfsmount *, int,
const struct cred *);
extern int filp_close(struct file *, fl_owner_t id);
@@ -2150,11 +2154,13 @@ extern void free_write_pipe(struct file *);

extern struct file *do_filp_open(int dfd, const char *pathname,
int open_flag, int mode, int acc_mode);
+extern long do_handle_open(int mountdirfd,
+ struct file_handle __user *ufh, int open_flag);
extern int may_open(struct path *, int, int);

extern int kernel_read(struct file *, loff_t, char *, unsigned long);
extern struct file * open_exec(const char *);
-
+
/* fs/dcache.c -- generic fs support functions */
extern int is_subdir(struct dentry *, struct dentry *);
extern int path_is_under(struct path *, struct path *);
diff --git a/include/linux/namei.h b/include/linux/namei.h
index 05b441d..827aef0 100644
--- a/include/linux/namei.h
+++ b/include/linux/namei.h
@@ -64,6 +64,7 @@ extern int user_path_at(int, const char __user *, unsigned, struct path *);
#define user_path_dir(name, path) \
user_path_at(AT_FDCWD, name, LOOKUP_FOLLOW | LOOKUP_DIRECTORY, path)

+extern struct vfsmount *get_vfsmount_from_fd(int);
extern int kern_path(const char *, unsigned, struct path *);

extern int path_lookup(const char *, unsigned, struct nameidata *);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index f7aba76..4dc527c 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -835,4 +835,7 @@ asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
struct file_handle __user *handle,
int __user *mnt_id, int flag);
+asmlinkage long sys_open_by_handle_at(int mountdirfd,
+ struct file_handle __user *handle,
+ int flags);
#endif
--
1.7.0.4

2010-08-20 01:52:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 07/13] vfs: Support null pathname in linkat

This enables to use linkat to create hardlinks from a
file descriptor pointing to the file. This can be used
with open_by_handle syscall that returns a file descriptor.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/namei.c | 36 ++++++++++++++++++++++++++----------
1 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 2c79363..fd9febc 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -2627,7 +2627,7 @@ int vfs_link(struct dentry *old_dentry, struct inode *dir, struct dentry *new_de
struct inode *inode = old_dentry->d_inode;
int error;

- if (!inode)
+ if (!inode || inode->i_nlink == 0)
return -ENOENT;

error = may_create(dir, new_dentry);
@@ -2673,16 +2673,28 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
{
struct dentry *new_dentry;
struct nameidata nd;
- struct path old_path;
- int error;
+ struct path old_path, *old_pathp;
+ struct file *file = NULL;
+ int error, fput_needed;
char *to;

if ((flags & ~AT_SYMLINK_FOLLOW) != 0)
return -EINVAL;

- error = user_path_at(olddfd, oldname,
- flags & AT_SYMLINK_FOLLOW ? LOOKUP_FOLLOW : 0,
- &old_path);
+ if (oldname == NULL && olddfd != AT_FDCWD) {
+ file = fget_light(olddfd, &fput_needed);
+ if (file) {
+ old_pathp = &file->f_path;
+ error = 0;
+ } else
+ error = -EBADF;
+ } else {
+ error = user_path_at(olddfd, oldname,
+ flags & AT_SYMLINK_FOLLOW ?
+ LOOKUP_FOLLOW : 0,
+ &old_path);
+ old_pathp = &old_path;
+ }
if (error)
return error;

@@ -2690,7 +2702,7 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
if (error)
goto out;
error = -EXDEV;
- if (old_path.mnt != nd.path.mnt)
+ if (old_pathp->mnt != nd.path.mnt)
goto out_release;
new_dentry = lookup_create(&nd, 0);
error = PTR_ERR(new_dentry);
@@ -2699,10 +2711,11 @@ SYSCALL_DEFINE5(linkat, int, olddfd, const char __user *, oldname,
error = mnt_want_write(nd.path.mnt);
if (error)
goto out_dput;
- error = security_path_link(old_path.dentry, &nd.path, new_dentry);
+ error = security_path_link(old_pathp->dentry, &nd.path, new_dentry);
if (error)
goto out_drop_write;
- error = vfs_link(old_path.dentry, nd.path.dentry->d_inode, new_dentry);
+ error = vfs_link(old_pathp->dentry,
+ nd.path.dentry->d_inode, new_dentry);
out_drop_write:
mnt_drop_write(nd.path.mnt);
out_dput:
@@ -2713,7 +2726,10 @@ out_release:
path_put(&nd.path);
putname(to);
out:
- path_put(&old_path);
+ if (file)
+ fput_light(file, fput_needed);
+ else
+ path_put(&old_path);

return error;
}
--
1.7.0.4

2010-08-20 01:52:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 08/13] x86: Add new syscalls for x86_32

This patch adds sys_name_to_handle_at and sys_open_by_handle_at
syscalls to x86_32

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/x86/include/asm/unistd_32.h | 5 ++++-
arch/x86/kernel/syscall_table_32.S | 2 ++
2 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/unistd_32.h b/arch/x86/include/asm/unistd_32.h
index b766a5e..62e35b7 100644
--- a/arch/x86/include/asm/unistd_32.h
+++ b/arch/x86/include/asm/unistd_32.h
@@ -346,10 +346,13 @@
#define __NR_fanotify_init 338
#define __NR_fanotify_mark 339
#define __NR_prlimit64 340
+#define __NR_name_to_handle_at 341
+#define __NR_open_by_handle_at 342
+

#ifdef __KERNEL__

-#define NR_syscalls 341
+#define NR_syscalls 343

#define __ARCH_WANT_IPC_PARSE_VERSION
#define __ARCH_WANT_OLD_READDIR
diff --git a/arch/x86/kernel/syscall_table_32.S b/arch/x86/kernel/syscall_table_32.S
index b35786d..c314b21 100644
--- a/arch/x86/kernel/syscall_table_32.S
+++ b/arch/x86/kernel/syscall_table_32.S
@@ -340,3 +340,5 @@ ENTRY(sys_call_table)
.long sys_fanotify_init
.long sys_fanotify_mark
.long sys_prlimit64 /* 340 */
+ .long sys_name_to_handle_at
+ .long sys_open_by_handle_at
--
1.7.0.4

2010-08-20 01:52:28

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 09/13] x86: Add new syscalls for x86_64

Add sys_name_to_handle_at and sys_open_by_handle_at syscalls
for x86_64

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/x86/ia32/ia32entry.S | 2 ++
arch/x86/include/asm/unistd_64.h | 5 +++++
fs/compat.c | 11 +++++++++++
3 files changed, 18 insertions(+), 0 deletions(-)

diff --git a/arch/x86/ia32/ia32entry.S b/arch/x86/ia32/ia32entry.S
index b86feab..d93cd0d 100644
--- a/arch/x86/ia32/ia32entry.S
+++ b/arch/x86/ia32/ia32entry.S
@@ -845,4 +845,6 @@ ia32_sys_call_table:
.quad sys_fanotify_init
.quad sys32_fanotify_mark
.quad sys_prlimit64 /* 340 */
+ .quad sys_name_to_handle_at
+ .quad compat_sys_open_by_handle_at
ia32_syscall_end:
diff --git a/arch/x86/include/asm/unistd_64.h b/arch/x86/include/asm/unistd_64.h
index 363e9b8..24d8a38 100644
--- a/arch/x86/include/asm/unistd_64.h
+++ b/arch/x86/include/asm/unistd_64.h
@@ -669,6 +669,11 @@ __SYSCALL(__NR_fanotify_init, sys_fanotify_init)
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
#define __NR_prlimit64 302
__SYSCALL(__NR_prlimit64, sys_prlimit64)
+#define __NR_name_to_handle_at 303
+__SYSCALL(__NR_name_to_handle_at, sys_name_to_handle_at)
+#define __NR_open_by_handle_at 304
+__SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)
+

#ifndef __NO_STUBS
#define __ARCH_WANT_OLD_READDIR
diff --git a/fs/compat.c b/fs/compat.c
index 718c706..c4037fd 100644
--- a/fs/compat.c
+++ b/fs/compat.c
@@ -2334,3 +2334,14 @@ asmlinkage long compat_sys_timerfd_gettime(int ufd,
}

#endif /* CONFIG_TIMERFD */
+
+/*
+ * Exactly like fs/open.c:sys_open_by_handle_at(), except that it
+ * doesn't set the O_LARGEFILE flag.
+ */
+asmlinkage long
+compat_sys_open_by_handle_at(int mountdirfd,
+ struct file_handle __user *handle, int flags)
+{
+ return do_handle_open(mountdirfd, handle, flags);
+}
--
1.7.0.4

2010-08-20 01:52:31

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 10/13] unistd.h: Add new syscalls numbers to asm-generic

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
include/asm-generic/unistd.h | 6 +++++-
1 files changed, 5 insertions(+), 1 deletions(-)

diff --git a/include/asm-generic/unistd.h b/include/asm-generic/unistd.h
index b969770..8e7bd87 100644
--- a/include/asm-generic/unistd.h
+++ b/include/asm-generic/unistd.h
@@ -646,9 +646,13 @@ __SYSCALL(__NR_prlimit64, sys_prlimit64)
__SYSCALL(__NR_fanotify_init, sys_fanotify_init)
#define __NR_fanotify_mark 263
__SYSCALL(__NR_fanotify_mark, sys_fanotify_mark)
+#define __NR_name_to_handle_at 264
+__SYSCALL(__NR_name_to_handle_at, sys_name_to_handle_at)
+#define __NR_open_by_handle_at 265
+__SYSCALL(__NR_open_by_handle_at, sys_open_by_handle_at)

#undef __NR_syscalls
-#define __NR_syscalls 264
+#define __NR_syscalls 266

/*
* All syscalls below here should go away really,
--
1.7.0.4

2010-08-20 01:52:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 11/13] vfs: Export file system uuid via /proc/<pid>/mountinfo

We add a per superblock uuid field. File systems should
update the uuid in the fill_super callback

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/namespace.c | 16 ++++++++++++++++
include/linux/fs.h | 1 +
2 files changed, 17 insertions(+), 0 deletions(-)

diff --git a/fs/namespace.c b/fs/namespace.c
index 2e10cb1..041645a 100644
--- a/fs/namespace.c
+++ b/fs/namespace.c
@@ -837,6 +837,18 @@ const struct seq_operations mounts_op = {
.show = show_vfsmnt
};

+static int uuid_is_nil(u8 *uuid)
+{
+ int i;
+ u8 *cp = (u8 *)uuid;
+
+ for (i = 0; i < 16; i++) {
+ if (*cp++)
+ return 0;
+ }
+ return 1;
+}
+
static int show_mountinfo(struct seq_file *m, void *v)
{
struct proc_mounts *p = m->private;
@@ -875,6 +887,10 @@ static int show_mountinfo(struct seq_file *m, void *v)
if (IS_MNT_UNBINDABLE(mnt))
seq_puts(m, " unbindable");

+ if (!uuid_is_nil(mnt->mnt_sb->s_uuid))
+ /* print the uuid */
+ seq_printf(m, " uuid:%pU", mnt->mnt_sb->s_uuid);
+
/* Filesystem specific data */
seq_puts(m, " - ");
show_type(m, sb);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3eb3571..06654e5 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1370,6 +1370,7 @@ struct super_block {
wait_queue_head_t s_wait_unfrozen;

char s_id[32]; /* Informational name */
+ u8 s_uuid[16]; /* UUID */

void *s_fs_info; /* Filesystem private info */
fmode_t s_mode;
--
1.7.0.4

2010-08-20 01:52:41

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 12/13] ext3: Copy fs UUID to superblock.

File system UUID is made available to application
via /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext3/super.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/fs/ext3/super.c b/fs/ext3/super.c
index 5dbf4db..6dda322 100644
--- a/fs/ext3/super.c
+++ b/fs/ext3/super.c
@@ -1918,6 +1918,7 @@ static int ext3_fill_super (struct super_block *sb, void *data, int silent)
sb->s_qcop = &ext3_qctl_operations;
sb->dq_op = &ext3_quota_operations;
#endif
+ memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
mutex_init(&sbi->s_resize_lock);
--
1.7.0.4

2010-08-20 01:52:46

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 13/13] ext4: Copy fs UUID to superblock

File system UUID is made available to application
via /proc/<pid>/mountinfo

Acked-by: Serge Hallyn <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/ext4/super.c | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 2614774..b46a78c 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -2941,6 +2941,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
sb->s_qcop = &ext4_qctl_operations;
sb->dq_op = &ext4_quota_operations;
#endif
+ memcpy(sb->s_uuid, es->s_uuid, sizeof(es->s_uuid));
+
INIT_LIST_HEAD(&sbi->s_orphan); /* unlinked but open files */
mutex_init(&sbi->s_orphan_lock);
mutex_init(&sbi->s_resize_lock);
--
1.7.0.4

2010-08-20 01:52:15

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 05/13] vfs: Support null pathname in readlink

From: NeilBrown <[email protected]>

This enables to use readlink to get the link target name
from a file descriptor point to the link. This can be used
with open_by_handle syscall that returns a file descriptor for a link.
We can then use this file descriptor to get the target name.

This is similar to utimensat(2) interface

Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/stat.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/stat.c b/fs/stat.c
index 12e90e2..3723a9f 100644
--- a/fs/stat.c
+++ b/fs/stat.c
@@ -289,26 +289,40 @@ SYSCALL_DEFINE2(newfstat, unsigned int, fd, struct stat __user *, statbuf)
SYSCALL_DEFINE4(readlinkat, int, dfd, const char __user *, pathname,
char __user *, buf, int, bufsiz)
{
- struct path path;
- int error;
+ int error = 0, fput_needed;
+ struct path path, *pp;
+ struct file *file = NULL;

if (bufsiz <= 0)
return -EINVAL;

- error = user_path_at(dfd, pathname, 0, &path);
+ if (pathname == NULL && dfd != AT_FDCWD) {
+ file = fget_light(dfd, &fput_needed);
+
+ if (file)
+ pp = &file->f_path;
+ else
+ error = -EBADF;
+ } else {
+ error = user_path_at(dfd, pathname, 0, &path);
+ pp = &path;
+ }
if (!error) {
- struct inode *inode = path.dentry->d_inode;
+ struct inode *inode = pp->dentry->d_inode;

error = -EINVAL;
if (inode->i_op->readlink) {
- error = security_inode_readlink(path.dentry);
+ error = security_inode_readlink(pp->dentry);
if (!error) {
- touch_atime(path.mnt, path.dentry);
- error = inode->i_op->readlink(path.dentry,
+ touch_atime(pp->mnt, pp->dentry);
+ error = inode->i_op->readlink(pp->dentry,
buf, bufsiz);
}
}
- path_put(&path);
+ if (file)
+ fput_light(file, fput_needed);
+ else
+ path_put(&path);
}
return error;
}
--
1.7.0.4

2010-08-20 01:53:51

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 06/13] vfs: Support null pathname in faccessat

This enables to use faccessat to get the access check details
from a file descriptor pointing to the file. This can be used
with open_by_handle syscall that returns a file descriptor.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/open.c | 29 +++++++++++++++++++++--------
1 files changed, 21 insertions(+), 8 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 7abdcba..f907b50 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -288,9 +288,10 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
{
const struct cred *old_cred;
struct cred *override_cred;
- struct path path;
+ struct file *file = NULL;
+ struct path path, *pp;
struct inode *inode;
- int res;
+ int res, fput_needed;

if (mode & ~S_IRWXO) /* where's F_OK, X_OK, W_OK, R_OK? */
return -EINVAL;
@@ -312,12 +313,21 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
}

old_cred = override_creds(override_cred);
-
- res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ if (filename == NULL && dfd != AT_FDCWD) {
+ file = fget_light(dfd, &fput_needed);
+ if (file) {
+ pp = &file->f_path;
+ res = 0;
+ } else
+ res = -EBADF;
+ } else {
+ res = user_path_at(dfd, filename, LOOKUP_FOLLOW, &path);
+ pp = &path;
+ }
if (res)
goto out;

- inode = path.dentry->d_inode;
+ inode = pp->dentry->d_inode;

if ((mode & MAY_EXEC) && S_ISREG(inode->i_mode)) {
/*
@@ -325,7 +335,7 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* with the "noexec" flag.
*/
res = -EACCES;
- if (path.mnt->mnt_flags & MNT_NOEXEC)
+ if (pp->mnt->mnt_flags & MNT_NOEXEC)
goto out_path_release;
}

@@ -343,11 +353,14 @@ SYSCALL_DEFINE3(faccessat, int, dfd, const char __user *, filename, int, mode)
* inherently racy and know that the fs may change
* state before we even see this result.
*/
- if (__mnt_is_readonly(path.mnt))
+ if (__mnt_is_readonly(pp->mnt))
res = -EROFS;

out_path_release:
- path_put(&path);
+ if (file)
+ fput_light(file, fput_needed);
+ else
+ path_put(&path);
out:
revert_creds(old_cred);
put_cred(override_cred);
--
1.7.0.4

2010-08-20 01:52:11

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

The patch update may_open to allow handle based open on symlinks.
The file handle based API use file descritor returned from open_by_handle_at
to do different file system operations. To find the link target name we
need to get a file descriptor on symlinks.

We should be able to read the link target using file handle. The exact
usecase is with respect to implementing READLINK operation on a
userspace NFS server. The request contain the file handle and the
response include target name.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/namei.c | 24 ++++++++++++++++++++++--
include/asm-generic/fcntl.h | 1 +
include/linux/fs.h | 1 +
3 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/fs/namei.c b/fs/namei.c
index 1291dfc..2c79363 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1447,7 +1447,20 @@ int may_open(struct path *path, int acc_mode, int flag)

switch (inode->i_mode & S_IFMT) {
case S_IFLNK:
- return -ELOOP;
+ /*
+ * Allow only if acc_mode contain
+ * open link request.
+ */
+ if (!(acc_mode & MAY_OPEN_LINK))
+ return -ELOOP;
+ /*
+ * Allow open on symlink only with
+ * open flag O_NOACCESS. We don't
+ * allow read/write on symlinks
+ */
+ if (flag != O_NOACCESS)
+ return -ELOOP;
+ break;
case S_IFDIR:
if (acc_mode & MAY_WRITE)
return -EISDIR;
@@ -2038,8 +2051,15 @@ long do_handle_open(int mountdirfd,
*/
if (open_flag & __O_SYNC)
open_flag |= O_DSYNC;
+ /*
+ * Handle based API allow open on a symlink
+ */
+ if (S_ISLNK(path->dentry->d_inode->i_mode))
+ acc_mode = MAY_OPEN_LINK;
+ else
+ acc_mode = MAY_OPEN;

- acc_mode = MAY_OPEN | ACC_MODE(open_flag);
+ acc_mode |= ACC_MODE(open_flag);

/* O_TRUNC implies we need access checks for write permissions */
if (open_flag & O_TRUNC)
diff --git a/include/asm-generic/fcntl.h b/include/asm-generic/fcntl.h
index a70b2d2..9d8f455 100644
--- a/include/asm-generic/fcntl.h
+++ b/include/asm-generic/fcntl.h
@@ -19,6 +19,7 @@
#define O_RDONLY 00000000
#define O_WRONLY 00000001
#define O_RDWR 00000002
+#define O_NOACCESS 00000003 /* No read/write access */
#ifndef O_CREAT
#define O_CREAT 00000100 /* not fcntl */
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e9471c7..3eb3571 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -55,6 +55,7 @@ struct inodes_stat_t {
#define MAY_ACCESS 16
#define MAY_OPEN 32
#define MAY_CHDIR 64
+#define MAY_OPEN_LINK 128

/*
* flags in file.f_mode. Note that FMODE_READ and FMODE_WRITE must correspond
--
1.7.0.4

2010-08-20 01:54:24

by Aneesh Kumar K.V

[permalink] [raw]
Subject: [PATCH -V18 02/13] vfs: Add name to file handle conversion support

The syscall also return mount id which can be used
to lookup file system specific information such as uuid
in /proc/<pid>/mountinfo

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
fs/open.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/exportfs.h | 3 +
include/linux/fs.h | 8 +++
include/linux/syscalls.h | 5 ++-
4 files changed, 144 insertions(+), 1 deletions(-)

diff --git a/fs/open.c b/fs/open.c
index 630715f..efb1806 100644
--- a/fs/open.c
+++ b/fs/open.c
@@ -30,6 +30,7 @@
#include <linux/fs_struct.h>
#include <linux/ima.h>
#include <linux/dnotify.h>
+#include <linux/exportfs.h>

#include "internal.h"

@@ -1042,3 +1043,131 @@ int nonseekable_open(struct inode *inode, struct file *filp)
}

EXPORT_SYMBOL(nonseekable_open);
+
+#ifdef CONFIG_EXPORTFS
+static long do_sys_name_to_handle(struct path *path,
+ struct file_handle __user *ufh,
+ int __user *mnt_id)
+{
+ long retval;
+ int handle_size;
+ struct file_handle f_handle;
+ struct file_handle *handle = NULL;
+
+ if (copy_from_user(&f_handle, ufh, sizeof(struct file_handle))) {
+ retval = -EFAULT;
+ goto err_out;
+ }
+ if (f_handle.handle_size > MAX_HANDLE_SZ) {
+ retval = -EINVAL;
+ goto err_out;
+ }
+ handle = kmalloc(sizeof(struct file_handle) + f_handle.handle_size,
+ GFP_KERNEL);
+ if (!handle) {
+ retval = -ENOMEM;
+ goto err_out;
+ }
+
+ /* convert handle size to multiple of sizeof(u32) */
+ handle_size = f_handle.handle_size >> 2;
+
+ /* we ask for a non connected handle */
+ retval = exportfs_encode_fh(path->dentry,
+ (struct fid *)handle->f_handle,
+ &handle_size, 0);
+ /* convert handle size to bytes */
+ handle_size *= sizeof(u32);
+ handle->handle_type = retval;
+ handle->handle_size = handle_size;
+ if (handle_size > f_handle.handle_size) {
+ /*
+ * set the handle_size to zero so we copy only
+ * non variable part of the file_handle
+ */
+ handle_size = 0;
+ retval = -EOVERFLOW;
+ } else
+ retval = 0;
+ /* copy the mount id */
+ if (copy_to_user(mnt_id, &path->mnt->mnt_id, sizeof(*mnt_id))) {
+ retval = -EFAULT;
+ goto err_free_out;
+ }
+ if (copy_to_user(ufh, handle,
+ sizeof(struct file_handle) + handle_size))
+ retval = -EFAULT;
+err_free_out:
+ kfree(handle);
+err_out:
+ return retval;
+}
+
+/**
+ * sys_name_to_handle_at: convert name to handle
+ * @dfd: directory relative to which name is interpreted if not absolute
+ * @name: name that should be converted to handle.
+ * @handle: resulting file handle
+ * @mnt_id: mount id of the file system containing the file
+ * @flag: flag value to indicate whether to follow symlink or not
+ *
+ * @handle->handle_size indicate the space available to store the
+ * variable part of the file handle in bytes. If there is not
+ * enough space, the field is updated to return the minimum
+ * value required.
+ */
+SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
+ struct file_handle __user *, handle, int __user*, mnt_id,
+ int, flag)
+{
+
+ int follow;
+ int fput_needed;
+ long ret = -EINVAL;
+ struct path path, *pp;
+ struct file *file = NULL;
+
+ if ((flag & ~AT_SYMLINK_FOLLOW) != 0)
+ goto err_out;
+
+ if (name == NULL && dfd != AT_FDCWD) {
+ file = fget_light(dfd, &fput_needed);
+ if (file) {
+ pp = &file->f_path;
+ ret = 0;
+ } else
+ ret = -EBADF;
+ } else {
+ follow = (flag & AT_SYMLINK_FOLLOW) ? LOOKUP_FOLLOW : 0;
+ ret = user_path_at(dfd, name, follow, &path);
+ pp = &path;
+ }
+ if (ret)
+ goto err_out;
+ /*
+ * We need t make sure wether the file system
+ * support decoding of the file handle
+ */
+ if (!pp->mnt->mnt_sb->s_export_op ||
+ !pp->mnt->mnt_sb->s_export_op->fh_to_dentry) {
+ ret = -EOPNOTSUPP;
+ goto out_path;
+ }
+ ret = do_sys_name_to_handle(pp, handle, mnt_id);
+
+out_path:
+ if (file)
+ fput_light(file, fput_needed);
+ else
+ path_put(&path);
+err_out:
+ return ret;
+}
+#else
+SYSCALL_DEFINE5(name_to_handle_at, int, dfd, const char __user *, name,
+ struct file_handle __user *, handle, int __user *, mnt_id,
+ int, flag)
+{
+ return -ENOSYS;
+}
+#endif
diff --git a/include/linux/exportfs.h b/include/linux/exportfs.h
index acd0b2d..1a6d72f 100644
--- a/include/linux/exportfs.h
+++ b/include/linux/exportfs.h
@@ -8,6 +8,9 @@ struct inode;
struct super_block;
struct vfsmount;

+/* limit the handle size to some value */
+#define MAX_HANDLE_SZ 4096
+
/*
* The fileid_type identifies how the file within the filesystem is encoded.
* In theory this is freely set and parsed by the filesystem, but we try to
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 9a96b4d..f2f7ad3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -953,6 +953,14 @@ struct file {
unsigned long f_mnt_write_state;
#endif
};
+
+struct file_handle {
+ int handle_size;
+ int handle_type;
+ /* file identifier */
+ unsigned char f_handle[0];
+};
+
extern spinlock_t files_lock;
#define file_list_lock() spin_lock(&files_lock);
#define file_list_unlock() spin_unlock(&files_lock);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index 6e5d197..f7aba76 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -62,6 +62,7 @@ struct robust_list_head;
struct getcpu_cache;
struct old_linux_dirent;
struct perf_event_attr;
+struct file_handle;

#include <linux/types.h>
#include <linux/aio_abi.h>
@@ -831,5 +832,7 @@ asmlinkage long sys_mmap_pgoff(unsigned long addr, unsigned long len,
unsigned long prot, unsigned long flags,
unsigned long fd, unsigned long pgoff);
asmlinkage long sys_old_mmap(struct mmap_arg_struct __user *arg);
-
+asmlinkage long sys_name_to_handle_at(int dfd, const char __user *name,
+ struct file_handle __user *handle,
+ int __user *mnt_id, int flag);
#endif
--
1.7.0.4

2010-08-20 02:13:35

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 20 Aug 2010 07:21:28 +0530, "Aneesh Kumar K.V" <[email protected]> wrote:
> The patch update may_open to allow handle based open on symlinks.
> The file handle based API use file descritor returned from open_by_handle_at
> to do different file system operations. To find the link target name we
> need to get a file descriptor on symlinks.
>
> We should be able to read the link target using file handle. The exact
> usecase is with respect to implementing READLINK operation on a
> userspace NFS server. The request contain the file handle and the
> response include target name.
>

If we don't do this, we will need new syscalls for doing stat, readlink
and link on symlinks that take file handle as the argument. If that is ok,
I can redo the patch series accordingly.

-aneesh

2010-08-20 06:53:45

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 20 Aug 2010 07:43:12 +0530, "Aneesh Kumar K. V" <[email protected]> wrote:
> On Fri, 20 Aug 2010 07:21:28 +0530, "Aneesh Kumar K.V" <[email protected]> wrote:
> > The patch update may_open to allow handle based open on symlinks.
> > The file handle based API use file descritor returned from open_by_handle_at
> > to do different file system operations. To find the link target name we
> > need to get a file descriptor on symlinks.
> >
> > We should be able to read the link target using file handle. The exact
> > usecase is with respect to implementing READLINK operation on a
> > userspace NFS server. The request contain the file handle and the
> > response include target name.
> >
>
> If we don't do this, we will need new syscalls for doing stat, readlink
> and link on symlinks that take file handle as the argument. If that is ok,
> I can redo the patch series accordingly.
>

This ended up with the below new syscalls

sys_handle_readlink, sys_handle_stat64, sys_handle_link, sys_handle_chown,
sys_handle_setxattr, sys_handle_getxattr, sys_handle_listxattr,
sys_handle_removexattr, sys_handle_utimes, compat_sys_handle_utimes

-aneesh

2010-08-20 08:31:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

Suddenly getting an file pointer for a symlink which could never happen
before is a really bad idea. Just add a proper readlink_by_handle
system call, similar to what's done in the XFS interface.

2010-08-20 08:32:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH -V18 05/13] vfs: Support null pathname in readlink

Changing the interfaces of existing system calls to accept a NULL name
which previously wasn't acceptable is not valid. Even for new system
calls I think it's a bad idea. utimensat already has the most ugly
code of all fs related system calls because of that.

2010-08-20 09:53:20

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 20 Aug 2010 04:30:57 -0400
Christoph Hellwig <[email protected]> wrote:

> Suddenly getting an file pointer for a symlink which could never happen
> before is a really bad idea. Just add a proper readlink_by_handle
> system call, similar to what's done in the XFS interface.

Why is that?
With futexes we suddenly get a file descriptor for something we could never
get a file descriptor on before and that doesn't seem to be a problem.

Why should symlinks be special as the only thing that you cannot have a file
descriptor for? Uniformity of interface is a very valuable property.

NeilBrown

2010-08-20 10:04:19

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 05/13] vfs: Support null pathname in readlink

On Fri, 20 Aug 2010 04:32:13 -0400
Christoph Hellwig <[email protected]> wrote:

> Changing the interfaces of existing system calls to accept a NULL name
> which previously wasn't acceptable is not valid. Even for new system
> calls I think it's a bad idea. utimensat already has the most ugly
> code of all fs related system calls because of that.

Again, you have made an assertion without justifying it.
I disagree, I think it is perfectly valid.

The function would still return an error in every case that it previously
returned an error, because in every previous case the fd would not have
referred to a symlink so the new version will have nothing to perform a
readlink on when NULL is passed. The error would probably be EINVAL rather
than EFAULT but even that difference could be avoided if it was really
important (I doubt it is).

You only get different behaviour if the path name is NULL and the fd refers
to a symlink. This is a completely new situation for which there is no
precedent for how the syscall should behave. We are free to create whatever
behaviour is most consistent. I believe the proposed behaviour is the
correct one.

The code may be ugly, but I'm sure it can be tidied up, particularly if
several system-calls adopted the same calling convention.

NeilBrown

2010-08-20 11:52:01

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, Aug 20, 2010 at 07:53:03PM +1000, Neil Brown wrote:
> On Fri, 20 Aug 2010 04:30:57 -0400
> Christoph Hellwig <[email protected]> wrote:
>
> > Suddenly getting an file pointer for a symlink which could never happen
> > before is a really bad idea. Just add a proper readlink_by_handle
> > system call, similar to what's done in the XFS interface.
>
> Why is that?
> With futexes we suddenly get a file descriptor for something we could never
> get a file descriptor on before and that doesn't seem to be a problem.
>
> Why should symlinks be special as the only thing that you cannot have a file
> descriptor for? Uniformity of interface is a very valuable property.

You are welcome to review the codepaths around pathname resolution for
assumptions of presense of ->follow_link() and friends; there _are_
subtle cases and dumping your "opened symlinks" in there is far from
a trivial change. Note that it affects more than just the starting
points of lookups; /proc/*/fd/* stuff is also involved.

BTW, speaking of NULL pathname, linkat() variant that allows creating a link
to an opened file is also a very dubious thing; at the very least, you get
non-trivial security implications, since now a process that got an opened
descriptor passed to it by somebody else may create hardlinks to the sucker.

2010-08-20 13:25:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 2010-08-20 at 19:53 +1000, Neil Brown wrote:
> With futexes we suddenly get a file descriptor for something we could never
> get a file descriptor on before and that doesn't seem to be a problem.

FWIW

commit 82af7aca56c67061420d618cc5a30f0fd4106b80
Author: Eric Sesterhenn <[email protected]>
Date: Fri Jan 25 10:40:46 2008 +0100

Removal of FUTEX_FD

2010-08-20 14:38:59

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 20 Aug 2010 04:30:57 -0400, Christoph Hellwig <[email protected]> wrote:
> Suddenly getting an file pointer for a symlink which could never happen
> before is a really bad idea. Just add a proper readlink_by_handle
> system call, similar to what's done in the XFS interface.

It is not just readlink that we would need. We would require

sys_handle_readlink, sys_handle_stat64, sys_handle_link, sys_handle_chown,
sys_handle_setxattr, sys_handle_getxattr, sys_handle_listxattr,
sys_handle_removexattr, sys_handle_utimes, compat_sys_handle_utimes

I have most of the changes done, so if we are ok adding all these
new syscalls i can post the patch series.

-aneesh

2010-08-20 14:43:34

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 05/13] vfs: Support null pathname in readlink

On Fri, 20 Aug 2010 04:32:13 -0400, Christoph Hellwig <[email protected]> wrote:
> Changing the interfaces of existing system calls to accept a NULL name
> which previously wasn't acceptable is not valid. Even for new system
> calls I think it's a bad idea. utimensat already has the most ugly
> code of all fs related system calls because of that.
>

But from an interface point, it seems nice to say if the relative name
from an fd is NULL, the operation should happen on fd itself.

-aneesh

2010-08-20 23:48:18

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Fri, 20 Aug 2010 15:25:29 +0200
Peter Dijkstra <[email protected]> wrote:

> On Fri, 2010-08-20 at 19:53 +1000, Neil Brown wrote:
> > With futexes we suddenly get a file descriptor for something we could never
> > get a file descriptor on before and that doesn't seem to be a problem.
>
> FWIW
>
> commit 82af7aca56c67061420d618cc5a30f0fd4106b80
> Author: Eric Sesterhenn <[email protected]>
> Date: Fri Jan 25 10:40:46 2008 +0100
>
> Removal of FUTEX_FD

Interesting - thanks.

OK, adjust my argument to reference signalfd, timerfs and pollfd.

I haven't looked at that code before - "anon_inode_getfd" gives you an fd on
an inode for which
(stat.st_mode & S_IFMT) == 0
Cool! They have no format at all :-)

NeilBrown

2010-08-21 00:09:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

[[email address for Nick Piggin changed to [email protected]]]

On Fri, 20 Aug 2010 12:51:35 +0100
Al Viro <[email protected]> wrote:

> On Fri, Aug 20, 2010 at 07:53:03PM +1000, Neil Brown wrote:
> > On Fri, 20 Aug 2010 04:30:57 -0400
> > Christoph Hellwig <[email protected]> wrote:
> >
> > > Suddenly getting an file pointer for a symlink which could never happen
> > > before is a really bad idea. Just add a proper readlink_by_handle
> > > system call, similar to what's done in the XFS interface.
> >
> > Why is that?
> > With futexes we suddenly get a file descriptor for something we could never
> > get a file descriptor on before and that doesn't seem to be a problem.
> >
> > Why should symlinks be special as the only thing that you cannot have a file
> > descriptor for? Uniformity of interface is a very valuable property.
>
> You are welcome to review the codepaths around pathname resolution for
> assumptions of presense of ->follow_link() and friends; there _are_
> subtle cases and dumping your "opened symlinks" in there is far from
> a trivial change. Note that it affects more than just the starting
> points of lookups; /proc/*/fd/* stuff is also involved.

So as I understand it you aren't rejecting the concept in principle, but you
believe non-trivial code review is required before it can be considered an
acceptable change?
That's quite reasonable. I hope to find time to have a look at the code.

>
> BTW, speaking of NULL pathname, linkat() variant that allows creating a link
> to an opened file is also a very dubious thing; at the very least, you get
> non-trivial security implications, since now a process that got an opened
> descriptor passed to it by somebody else may create hardlinks to the sucker.

Fair comment, and while one could imagine ways around this (such as requiring
some Capability to link an fd) they wouldn't be very elegant.
But then nor is inventing a pile of new syscalls for doing different things
with handles such as the list Aneesh posted.

Maybe a different approach is needed.

How about a new AT flag: AT_FILE_HANDLE

Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
the 'name' pointer actually points to a filehandle fragment interpreted in
that filesystem.

One problem is that there is no way to pass the length...
Options:
fragment is at most 64 bytes nul padded at the end
fragment is hex encoded and nul terminated
??

I think I prefer the hex encoding, but I'm hoping someone else has a better
idea.

NeilBrown

2010-08-21 07:15:31

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> How about a new AT flag: AT_FILE_HANDLE
>
> Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> the 'name' pointer actually points to a filehandle fragment interpreted in
> that filesystem.
>
> One problem is that there is no way to pass the length...
> Options:
> fragment is at most 64 bytes nul padded at the end
> fragment is hex encoded and nul terminated
> ??
>
> I think I prefer the hex encoding, but I'm hoping someone else has a better
> idea.

That makes it ugly for the kernel to stringify and parse the file handles.

How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra __u32 for alignment) is the length and the rest of the binary file handle follows this? In fact, doesn't the handle itself already encode the length in the header?

Cheers, Andreas-

2010-08-21 08:30:35

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

Thanks, I had both of the same concerns as Christoph with API
change and exposing symlink fds last time I looked at the patces,
actually.

But they can probably be worked around or avoided. I think the more
important thing is whether it is worth supporting. This is
all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
what exact semantics they want. I would like to see more discussion
of what this enables and some results.

For the case of avoiding expensive network revalidations in path name
lookup, do we even need to open symlinks? Could the security issues be
avoided by always having handle attached to an open fd?

On Sat, Aug 21, 2010 at 10:09:00AM +1000, Neil Brown wrote:
> [[email address for Nick Piggin changed to [email protected]]]
>
> On Fri, 20 Aug 2010 12:51:35 +0100
> Al Viro <[email protected]> wrote:
>
> > On Fri, Aug 20, 2010 at 07:53:03PM +1000, Neil Brown wrote:
> > > On Fri, 20 Aug 2010 04:30:57 -0400
> > > Christoph Hellwig <[email protected]> wrote:
> > >
> > > > Suddenly getting an file pointer for a symlink which could never happen
> > > > before is a really bad idea. Just add a proper readlink_by_handle
> > > > system call, similar to what's done in the XFS interface.
> > >
> > > Why is that?
> > > With futexes we suddenly get a file descriptor for something we could never
> > > get a file descriptor on before and that doesn't seem to be a problem.
> > >
> > > Why should symlinks be special as the only thing that you cannot have a file
> > > descriptor for? Uniformity of interface is a very valuable property.
> >
> > You are welcome to review the codepaths around pathname resolution for
> > assumptions of presense of ->follow_link() and friends; there _are_
> > subtle cases and dumping your "opened symlinks" in there is far from
> > a trivial change. Note that it affects more than just the starting
> > points of lookups; /proc/*/fd/* stuff is also involved.
>
> So as I understand it you aren't rejecting the concept in principle, but you
> believe non-trivial code review is required before it can be considered an
> acceptable change?
> That's quite reasonable. I hope to find time to have a look at the code.
>
> >
> > BTW, speaking of NULL pathname, linkat() variant that allows creating a link
> > to an opened file is also a very dubious thing; at the very least, you get
> > non-trivial security implications, since now a process that got an opened
> > descriptor passed to it by somebody else may create hardlinks to the sucker.
>
> Fair comment, and while one could imagine ways around this (such as requiring
> some Capability to link an fd) they wouldn't be very elegant.
> But then nor is inventing a pile of new syscalls for doing different things
> with handles such as the list Aneesh posted.
>
> Maybe a different approach is needed.
>
> How about a new AT flag: AT_FILE_HANDLE
>
> Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> the 'name' pointer actually points to a filehandle fragment interpreted in
> that filesystem.
>
> One problem is that there is no way to pass the length...
> Options:
> fragment is at most 64 bytes nul padded at the end
> fragment is hex encoded and nul terminated
> ??
>
> I think I prefer the hex encoding, but I'm hoping someone else has a better
> idea.
>
> NeilBrown

2010-08-21 09:31:33

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 10:09:00 +1000, Neil Brown <[email protected]> wrote:
> [[email address for Nick Piggin changed to [email protected]]]
>
> On Fri, 20 Aug 2010 12:51:35 +0100
> Al Viro <[email protected]> wrote:
>
> > On Fri, Aug 20, 2010 at 07:53:03PM +1000, Neil Brown wrote:
> > > On Fri, 20 Aug 2010 04:30:57 -0400
> > > Christoph Hellwig <[email protected]> wrote:
> > >
> > > > Suddenly getting an file pointer for a symlink which could never happen
> > > > before is a really bad idea. Just add a proper readlink_by_handle
> > > > system call, similar to what's done in the XFS interface.
> > >
> > > Why is that?
> > > With futexes we suddenly get a file descriptor for something we could never
> > > get a file descriptor on before and that doesn't seem to be a problem.
> > >
> > > Why should symlinks be special as the only thing that you cannot have a file
> > > descriptor for? Uniformity of interface is a very valuable property.
> >
> > You are welcome to review the codepaths around pathname resolution for
> > assumptions of presense of ->follow_link() and friends; there _are_
> > subtle cases and dumping your "opened symlinks" in there is far from
> > a trivial change. Note that it affects more than just the starting
> > points of lookups; /proc/*/fd/* stuff is also involved.
>
> So as I understand it you aren't rejecting the concept in principle, but you
> believe non-trivial code review is required before it can be considered an
> acceptable change?
> That's quite reasonable. I hope to find time to have a look at the code.
>
> >
> > BTW, speaking of NULL pathname, linkat() variant that allows creating a link
> > to an opened file is also a very dubious thing; at the very least, you get
> > non-trivial security implications, since now a process that got an opened
> > descriptor passed to it by somebody else may create hardlinks to the
> > sucker.

Why would that be an issue ? Even though a hardlink can be created,
only process with right privileges can access them. One problem is;
being able to create hardlinks at different directory location as above
implies that there is no way to guarantee that the file got completely
removed from the system. Is this the security risk that you are pointing
above ?

>
> Fair comment, and while one could imagine ways around this (such as requiring
> some Capability to link an fd) they wouldn't be very elegant.


a sys_handle_link that limits to CAP_DAC_READ_SEARCH is a nice compromise.


> But then nor is inventing a pile of new syscalls for doing different things
> with handles such as the list Aneesh posted.
>
> Maybe a different approach is needed.
>
> How about a new AT flag: AT_FILE_HANDLE
>
> Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> the 'name' pointer actually points to a filehandle fragment interpreted in
> that filesystem.


But that would imply once can guess the handle and create a hardlink to it.

>
> One problem is that there is no way to pass the length...


struct file_handle already include the length.

> Options:
> fragment is at most 64 bytes nul padded at the end
> fragment is hex encoded and nul terminated
> ??
>
> I think I prefer the hex encoding, but I'm hoping someone else has a better
> idea.
>

-aneesh

2010-08-21 09:32:36

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 01:13:52 -0600, Andreas Dilger <[email protected]> wrote:
> On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> > How about a new AT flag: AT_FILE_HANDLE
> >
> > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > the 'name' pointer actually points to a filehandle fragment interpreted in
> > that filesystem.
> >
> > One problem is that there is no way to pass the length...
> > Options:
> > fragment is at most 64 bytes nul padded at the end
> > fragment is hex encoded and nul terminated
> > ??
> >
> > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > idea.
>
> That makes it ugly for the kernel to stringify and parse the file handles.
>
> How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra
> __u32 for alignment) is the length and the rest of the binary file
> handle follows this? In fact, doesn't the handle itself already
> encode the length in the header?


struct file_handle already include length

-aneesh

2010-08-21 09:42:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 18:30:24 +1000, Nick Piggin <[email protected]> wrote:
> Thanks, I had both of the same concerns as Christoph with API
> change and exposing symlink fds last time I looked at the patces,
> actually.
>
> But they can probably be worked around or avoided. I think the more
> important thing is whether it is worth supporting. This is
> all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> what exact semantics they want. I would like to see more discussion
> of what this enables and some results.
>
> For the case of avoiding expensive network revalidations in path name
> lookup, do we even need to open symlinks? Could the security issues be
> avoided by always having handle attached to an open fd?
>

For implementing a userspace file server that use handle for
representing files (like NFS) we would require to have the ability to do
different file system operations that can operate on symlink to work on
handle too.

-aneesh

2010-08-22 02:02:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 15:12:15 +0530, "Aneesh Kumar K. V" <[email protected]> wrote:
> On Sat, 21 Aug 2010 18:30:24 +1000, Nick Piggin <[email protected]> wrote:
> > Thanks, I had both of the same concerns as Christoph with API
> > change and exposing symlink fds last time I looked at the patces,
> > actually.
> >
> > But they can probably be worked around or avoided. I think the more
> > important thing is whether it is worth supporting. This is
> > all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> > what exact semantics they want. I would like to see more discussion
> > of what this enables and some results.
> >
> > For the case of avoiding expensive network revalidations in path name
> > lookup, do we even need to open symlinks? Could the security issues be
> > avoided by always having handle attached to an open fd?
> >
>
> For implementing a userspace file server that use handle for
> representing files (like NFS) we would require to have the ability to do
> different file system operations that can operate on symlink to work on
> handle too.
>

linkat change to take NULL path name is needed for regular files
also. The file server will get a request to create a link with handle
of oldpath and directory handle of new path and new name.

The changes would also help in the implementation of 9P server in Qemu
for implementing a fast virtualization passthrough file system (aka
VirtFS)

http://thread.gmane.org/gmane.comp.emulators.qemu/68992

-aneesh

2010-08-22 23:06:24

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 01:13:52 -0600
Andreas Dilger <[email protected]> wrote:

> On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> > How about a new AT flag: AT_FILE_HANDLE
> >
> > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > the 'name' pointer actually points to a filehandle fragment interpreted in
> > that filesystem.
> >
> > One problem is that there is no way to pass the length...
> > Options:
> > fragment is at most 64 bytes nul padded at the end
> > fragment is hex encoded and nul terminated
> > ??
> >
> > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > idea.
>
> That makes it ugly for the kernel to stringify and parse the file handles.

We already parse filenames into components separated by '/'. Is HEX decoding
that much more ugly.

Filehandles are currently passed between the kernel and mountd as HEX
strings, so at least there is some precedent.

>
> How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra __u32 for alignment) is the length and the rest of the binary file handle follows this? In fact, doesn't the handle itself already encode the length in the header?

That part of a filehandle that nfsd gives to the filesystem is one byte out
of a 4-byte header, plus the tail of the filehandle after the part that
identifies the filesystem.
This 'one byte' does imply the length, but it doesn't necessarily encode it.
Rather it is a 'type'. So it cannot really be used to determine the length
at the point when the filehandle would need to be copied from userspace into
the kernel.


I don't think there is any precedent for passing a 4-byte length followed by
a binary string, while there is plenty of precedent for passing a
nul-terminated ASCII string.

[[ Following this approach I would like to avoid any filehandle-specific
syscalls altogether.
Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
getxattr('system:linux.file_handle') to get the filehandle for a given path.

Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
probably a good idea anyway.
]]

NeilBrown

>
> Cheers, Andreas--
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-08-22 23:17:27

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, 21 Aug 2010 18:30:24 +1000
Nick Piggin <[email protected]> wrote:

> Thanks, I had both of the same concerns as Christoph with API
> change and exposing symlink fds last time I looked at the patces,
> actually.
>
> But they can probably be worked around or avoided. I think the more
> important thing is whether it is worth supporting. This is
> all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> what exact semantics they want. I would like to see more discussion
> of what this enables and some results.

They allow a credible user-space implementation of the server for some
network filesystem protocols such as NFS and apparently 9P.

>
> For the case of avoiding expensive network revalidations in path name
> lookup, do we even need to open symlinks? Could the security issues be
> avoided by always having handle attached to an open fd?

I don't see what you are getting at here... which particular security isses,
and what fd would you use?

As I understand it there are only two security issues that have been noted.
1/ lookup-by-filehandle can bypass any 'search' permission tests on ancestor
directories. I cannot see any way to avoid this except require
CAP_DAC_READ_SEARCH
2/ Creating a hardlink to an 'fd' allows a process that was given an 'fd'
that it could not have opened itself to prevent that file from being
removed (and space reclaimed) by creating a private hardlink.
This could be avoided by requiring CAP_DAC_READ_SEARCH for that particular
operation (and probably requiring i_nlink > 0 anyway) but that feels like
a very special-case restriction.

Was it one of these that you were referring to?

Thanks,
NeilBrown


>
> On Sat, Aug 21, 2010 at 10:09:00AM +1000, Neil Brown wrote:
> > [[email address for Nick Piggin changed to [email protected]]]
> >
> > On Fri, 20 Aug 2010 12:51:35 +0100
> > Al Viro <[email protected]> wrote:
> >
> > > On Fri, Aug 20, 2010 at 07:53:03PM +1000, Neil Brown wrote:
> > > > On Fri, 20 Aug 2010 04:30:57 -0400
> > > > Christoph Hellwig <[email protected]> wrote:
> > > >
> > > > > Suddenly getting an file pointer for a symlink which could never happen
> > > > > before is a really bad idea. Just add a proper readlink_by_handle
> > > > > system call, similar to what's done in the XFS interface.
> > > >
> > > > Why is that?
> > > > With futexes we suddenly get a file descriptor for something we could never
> > > > get a file descriptor on before and that doesn't seem to be a problem.
> > > >
> > > > Why should symlinks be special as the only thing that you cannot have a file
> > > > descriptor for? Uniformity of interface is a very valuable property.
> > >
> > > You are welcome to review the codepaths around pathname resolution for
> > > assumptions of presense of ->follow_link() and friends; there _are_
> > > subtle cases and dumping your "opened symlinks" in there is far from
> > > a trivial change. Note that it affects more than just the starting
> > > points of lookups; /proc/*/fd/* stuff is also involved.
> >
> > So as I understand it you aren't rejecting the concept in principle, but you
> > believe non-trivial code review is required before it can be considered an
> > acceptable change?
> > That's quite reasonable. I hope to find time to have a look at the code.
> >
> > >
> > > BTW, speaking of NULL pathname, linkat() variant that allows creating a link
> > > to an opened file is also a very dubious thing; at the very least, you get
> > > non-trivial security implications, since now a process that got an opened
> > > descriptor passed to it by somebody else may create hardlinks to the sucker.
> >
> > Fair comment, and while one could imagine ways around this (such as requiring
> > some Capability to link an fd) they wouldn't be very elegant.
> > But then nor is inventing a pile of new syscalls for doing different things
> > with handles such as the list Aneesh posted.
> >
> > Maybe a different approach is needed.
> >
> > How about a new AT flag: AT_FILE_HANDLE
> >
> > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > the 'name' pointer actually points to a filehandle fragment interpreted in
> > that filesystem.
> >
> > One problem is that there is no way to pass the length...
> > Options:
> > fragment is at most 64 bytes nul padded at the end
> > fragment is hex encoded and nul terminated
> > ??
> >
> > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > idea.
> >
> > NeilBrown
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2010-08-23 01:24:18

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, 23 Aug 2010 09:06:04 +1000, Neil Brown <[email protected]> wrote:
> On Sat, 21 Aug 2010 01:13:52 -0600
> Andreas Dilger <[email protected]> wrote:
>
> > On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> > > How about a new AT flag: AT_FILE_HANDLE
> > >
> > > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > > the 'name' pointer actually points to a filehandle fragment interpreted in
> > > that filesystem.
> > >
> > > One problem is that there is no way to pass the length...
> > > Options:
> > > fragment is at most 64 bytes nul padded at the end
> > > fragment is hex encoded and nul terminated
> > > ??
> > >
> > > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > > idea.
> >
> > That makes it ugly for the kernel to stringify and parse the file handles.
>
> We already parse filenames into components separated by '/'. Is HEX decoding
> that much more ugly.
>
> Filehandles are currently passed between the kernel and mountd as HEX
> strings, so at least there is some precedent.
>
> >
> > How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra __u32 for alignment) is the length and the rest of the binary file handle follows this? In fact, doesn't the handle itself already encode the length in the header?
>
> That part of a filehandle that nfsd gives to the filesystem is one byte out
> of a 4-byte header, plus the tail of the filehandle after the part that
> identifies the filesystem.
> This 'one byte' does imply the length, but it doesn't necessarily encode it.
> Rather it is a 'type'. So it cannot really be used to determine the length
> at the point when the filehandle would need to be copied from userspace into
> the kernel.
>
>
> I don't think there is any precedent for passing a 4-byte length followed by
> a binary string, while there is plenty of precedent for passing a
> nul-terminated ASCII string.
>
> [[ Following this approach I would like to avoid any filehandle-specific
> syscalls altogether.
> Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
> getxattr('system:linux.file_handle') to get the filehandle for a given path.
>
> Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
> probably a good idea anyway.
> ]]

There are at* syscalls that doesn't take the additional flags as the
argument, like openat, readlinkat. How will handle based open and
readlink work with the above interface ?

-aneesh

2010-08-23 01:53:05

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, 23 Aug 2010 06:54:03 +0530
"Aneesh Kumar K. V" <[email protected]> wrote:

> On Mon, 23 Aug 2010 09:06:04 +1000, Neil Brown <[email protected]> wrote:
> > On Sat, 21 Aug 2010 01:13:52 -0600
> > Andreas Dilger <[email protected]> wrote:
> >
> > > On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> > > > How about a new AT flag: AT_FILE_HANDLE
> > > >
> > > > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > > > the 'name' pointer actually points to a filehandle fragment interpreted in
> > > > that filesystem.
> > > >
> > > > One problem is that there is no way to pass the length...
> > > > Options:
> > > > fragment is at most 64 bytes nul padded at the end
> > > > fragment is hex encoded and nul terminated
> > > > ??
> > > >
> > > > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > > > idea.
> > >
> > > That makes it ugly for the kernel to stringify and parse the file handles.
> >
> > We already parse filenames into components separated by '/'. Is HEX decoding
> > that much more ugly.
> >
> > Filehandles are currently passed between the kernel and mountd as HEX
> > strings, so at least there is some precedent.
> >
> > >
> > > How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra __u32 for alignment) is the length and the rest of the binary file handle follows this? In fact, doesn't the handle itself already encode the length in the header?
> >
> > That part of a filehandle that nfsd gives to the filesystem is one byte out
> > of a 4-byte header, plus the tail of the filehandle after the part that
> > identifies the filesystem.
> > This 'one byte' does imply the length, but it doesn't necessarily encode it.
> > Rather it is a 'type'. So it cannot really be used to determine the length
> > at the point when the filehandle would need to be copied from userspace into
> > the kernel.
> >
> >
> > I don't think there is any precedent for passing a 4-byte length followed by
> > a binary string, while there is plenty of precedent for passing a
> > nul-terminated ASCII string.
> >
> > [[ Following this approach I would like to avoid any filehandle-specific
> > syscalls altogether.
> > Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
> > getxattr('system:linux.file_handle') to get the filehandle for a given path.
> >
> > Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
> > probably a good idea anyway.
> > ]]
>
> There are at* syscalls that doesn't take the additional flags as the
> argument, like openat, readlinkat. How will handle based open and
> readlink work with the above interface ?
>

Bother... you are right.

I had remembered that at the time that all that *at calls were added there was
discussion about how you always need some flags, particularly in the context
of adding O_CLOEXEC and (I thought) a flag to allow non-sequential allocation
of fds.
I had thought that they all got 'flags' arguments as a result, but it seems
not.
For openat you could squeeze something into the current 'flags' arg
(O_FILE_HANDLE), but for readlinkat, symlinkat at least there is no such
option. Sad really.

NeilBrown

2010-08-23 02:50:06

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, 23 Aug 2010 09:06:04 +1000, Neil Brown <[email protected]> wrote:
> [[ Following this approach I would like to avoid any filehandle-specific
> syscalls altogether.
> Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
> getxattr('system:linux.file_handle') to get the filehandle for a given path.
>
> Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
> probably a good idea anyway.
> ]]

sys_setxattrat would take 7 arguments.

-aneesh

2010-08-24 07:22:04

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Sat, Aug 21, 2010 at 03:12:15PM +0530, Aneesh Kumar K. V wrote:
> On Sat, 21 Aug 2010 18:30:24 +1000, Nick Piggin <[email protected]> wrote:
> > Thanks, I had both of the same concerns as Christoph with API
> > change and exposing symlink fds last time I looked at the patces,
> > actually.
> >
> > But they can probably be worked around or avoided. I think the more
> > important thing is whether it is worth supporting. This is
> > all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> > what exact semantics they want. I would like to see more discussion
> > of what this enables and some results.
> >
> > For the case of avoiding expensive network revalidations in path name
> > lookup, do we even need to open symlinks? Could the security issues be
> > avoided by always having handle attached to an open fd?
> >
>
> For implementing a userspace file server that use handle for
> representing files (like NFS) we would require to have the ability to do
> different file system operations that can operate on symlink to work on
> handle too.

Right. Is this a really important goal, I'm wondering? Is it realistic
(ie. to be able to remove the nfs server from the kernel)?

2010-08-24 07:30:10

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, Aug 23, 2010 at 09:17:08AM +1000, Neil Brown wrote:
> On Sat, 21 Aug 2010 18:30:24 +1000
> Nick Piggin <[email protected]> wrote:
>
> > Thanks, I had both of the same concerns as Christoph with API
> > change and exposing symlink fds last time I looked at the patces,
> > actually.
> >
> > But they can probably be worked around or avoided. I think the more
> > important thing is whether it is worth supporting. This is
> > all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> > what exact semantics they want. I would like to see more discussion
> > of what this enables and some results.
>
> They allow a credible user-space implementation of the server for some
> network filesystem protocols such as NFS and apparently 9P.
>
> >
> > For the case of avoiding expensive network revalidations in path name
> > lookup, do we even need to open symlinks? Could the security issues be
> > avoided by always having handle attached to an open fd?
>
> I don't see what you are getting at here... which particular security isses,
> and what fd would you use?

Well the issue that you need escalated privilges to use it. The other
use case for it I understand is Andreas's file-handle-server which
avoids a lot of path lookup costs on non-local filesystems. I'm
wondering is that really useful if it's not availale to unprivileged
users?

>
> As I understand it there are only two security issues that have been noted.
> 1/ lookup-by-filehandle can bypass any 'search' permission tests on ancestor
> directories. I cannot see any way to avoid this except require
> CAP_DAC_READ_SEARCH
> 2/ Creating a hardlink to an 'fd' allows a process that was given an 'fd'
> that it could not have opened itself to prevent that file from being
> removed (and space reclaimed) by creating a private hardlink.
> This could be avoided by requiring CAP_DAC_READ_SEARCH for that particular
> operation (and probably requiring i_nlink > 0 anyway) but that feels like
> a very special-case restriction.

Just so long as the process could have created a hardlink to the file
otherwise via traditional operations, I think it's OK.

>
> Was it one of these that you were referring to?

Just the general problem of security and inherent restrictions to using
the syscalls.

2010-08-24 09:41:14

by Bastien Roucariès

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, Aug 23, 2010 at 1:06 AM, Neil Brown <[email protected]> wrote:
> On Sat, 21 Aug 2010 01:13:52 -0600
> Andreas Dilger <[email protected]> wrote:
>
>> On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
>> > How about a new AT flag: ?AT_FILE_HANDLE
>> >
>> > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
>> > the 'name' pointer actually points to a filehandle fragment interpreted in
>> > that filesystem.

Why ot creating a special file system for this kind of operation ?
I mean a vfsmnt filesystem, with each directory on the root is a
symlink to the root of the real vfsmnt root ?

I could be even be in proc space like /proc/self/vfsmnt

path_to_handle will return a relative path from this directory like
0x75843558/somehandle (if X is on /usr/bin/X and usr is mounted by
filesystem 0x75843558)
path_to_fshandle() will return 0x75843558

opening file handle will be just a matter to thus open
/proc/self/vfsmount/0x75843558/somehandle

Permission will be determined by vfsmount filesystem.

No need to create new syscall all te handle to filename will be handle
by the vfsmount filesystem

We could even use at existing command. The dirfd will need to be only
/proc/self/vfsmnt (and if you need to get a fd without mounting /proc
create a syscall to get this fd).

Does sound plausible ?

Bastien

2010-08-24 10:34:40

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Tue, 24 Aug 2010 17:21:55 +1000, Nick Piggin <[email protected]> wrote:
> On Sat, Aug 21, 2010 at 03:12:15PM +0530, Aneesh Kumar K. V wrote:
> > On Sat, 21 Aug 2010 18:30:24 +1000, Nick Piggin <[email protected]> wrote:
> > > Thanks, I had both of the same concerns as Christoph with API
> > > change and exposing symlink fds last time I looked at the patces,
> > > actually.
> > >
> > > But they can probably be worked around or avoided. I think the more
> > > important thing is whether it is worth supporting. This is
> > > all restricted to root (or CAP_DAC_READ_SEARCH) only, right, and
> > > what exact semantics they want. I would like to see more discussion
> > > of what this enables and some results.
> > >
> > > For the case of avoiding expensive network revalidations in path name
> > > lookup, do we even need to open symlinks? Could the security issues be
> > > avoided by always having handle attached to an open fd?
> > >
> >
> > For implementing a userspace file server that use handle for
> > representing files (like NFS) we would require to have the ability to do
> > different file system operations that can operate on symlink to work on
> > handle too.
>
> Right. Is this a really important goal, I'm wondering? Is it realistic
> (ie. to be able to remove the nfs server from the kernel)?
>

The feature is also needed to implement a 9p virtio pass through file
system in Qemu http://thread.gmane.org/gmane.comp.emulators.qemu/68992

-aneesh

2010-08-24 10:40:54

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, 23 Aug 2010 11:52:47 +1000, Neil Brown <[email protected]> wrote:
> On Mon, 23 Aug 2010 06:54:03 +0530
> "Aneesh Kumar K. V" <[email protected]> wrote:
>
> > On Mon, 23 Aug 2010 09:06:04 +1000, Neil Brown <[email protected]> wrote:
> > > On Sat, 21 Aug 2010 01:13:52 -0600
> > > Andreas Dilger <[email protected]> wrote:
> > >
> > > > On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> > > > > How about a new AT flag: AT_FILE_HANDLE
> > > > >
> > > > > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> > > > > the 'name' pointer actually points to a filehandle fragment interpreted in
> > > > > that filesystem.
> > > > >
> > > > > One problem is that there is no way to pass the length...
> > > > > Options:
> > > > > fragment is at most 64 bytes nul padded at the end
> > > > > fragment is hex encoded and nul terminated
> > > > > ??
> > > > >
> > > > > I think I prefer the hex encoding, but I'm hoping someone else has a better
> > > > > idea.
> > > >
> > > > That makes it ugly for the kernel to stringify and parse the file handles.
> > >
> > > We already parse filenames into components separated by '/'. Is HEX decoding
> > > that much more ugly.
> > >
> > > Filehandles are currently passed between the kernel and mountd as HEX
> > > strings, so at least there is some precedent.
> > >
> > > >
> > > > How about for AT_FILE_HANDLE THE FIRST __u32 (maybe with an extra __u32 for alignment) is the length and the rest of the binary file handle follows this? In fact, doesn't the handle itself already encode the length in the header?
> > >
> > > That part of a filehandle that nfsd gives to the filesystem is one byte out
> > > of a 4-byte header, plus the tail of the filehandle after the part that
> > > identifies the filesystem.
> > > This 'one byte' does imply the length, but it doesn't necessarily encode it.
> > > Rather it is a 'type'. So it cannot really be used to determine the length
> > > at the point when the filehandle would need to be copied from userspace into
> > > the kernel.
> > >
> > >
> > > I don't think there is any precedent for passing a 4-byte length followed by
> > > a binary string, while there is plenty of precedent for passing a
> > > nul-terminated ASCII string.
> > >
> > > [[ Following this approach I would like to avoid any filehandle-specific
> > > syscalls altogether.
> > > Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
> > > getxattr('system:linux.file_handle') to get the filehandle for a given path.
> > >
> > > Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
> > > probably a good idea anyway.
> > > ]]
> >
> > There are at* syscalls that doesn't take the additional flags as the
> > argument, like openat, readlinkat. How will handle based open and
> > readlink work with the above interface ?
> >
>
> Bother... you are right.
>
> I had remembered that at the time that all that *at calls were added there was
> discussion about how you always need some flags, particularly in the context
> of adding O_CLOEXEC and (I thought) a flag to allow non-sequential allocation
> of fds.
> I had thought that they all got 'flags' arguments as a result, but it seems
> not.
> For openat you could squeeze something into the current 'flags' arg
> (O_FILE_HANDLE), but for readlinkat, symlinkat at least there is no such
> option. Sad really.
>

IMHO that is really bad overloading of flags value. I also find hex
encoded handle in pathname argument of *at syscalls confusing. The above
discussion also hint that we would need a new syscall for open,
readlink and setxattr, So how about me posting the new series which
remove open on symlink patch and add a bunch of syscalls to allow
operation on symlinks based on handle ?

-aneesh


2010-08-24 13:19:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Tue, Aug 24, 2010 at 05:21:55PM +1000, Nick Piggin wrote:
> Right. Is this a really important goal, I'm wondering? Is it realistic
> (ie. to be able to remove the nfs server from the kernel)?

I think so.

The hardest part is if you want to export a filesystem that also has
non-nfs users--in the kernel it's much easier to maintain consistency
between local users of the filesystem and nfs clients.

If the server has no local users, then userspace may turn out to be
easier.

In that case you might ask why the server needs a real filesystem to
export at all--in theory it could maintain all its data on its own. But
I suspect if you try that you find you're reinventing a lot of wheels.

--b.

2010-08-25 02:04:32

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Tue, 24 Aug 2010 11:41:10 +0200
Bastien ROUCARIES <[email protected]> wrote:

> On Mon, Aug 23, 2010 at 1:06 AM, Neil Brown <[email protected]> wrote:
> > On Sat, 21 Aug 2010 01:13:52 -0600
> > Andreas Dilger <[email protected]> wrote:
> >
> >> On 2010-08-20, at 18:09, Neil Brown <[email protected]> wrote:
> >> > How about a new AT flag:  AT_FILE_HANDLE
> >> >
> >> > Meaning is that the 'dirfd' is used only to identify a filesystem (vfsmnt) and
> >> > the 'name' pointer actually points to a filehandle fragment interpreted in
> >> > that filesystem.
>
> Why ot creating a special file system for this kind of operation ?
> I mean a vfsmnt filesystem, with each directory on the root is a
> symlink to the root of the real vfsmnt root ?
>
> I could be even be in proc space like /proc/self/vfsmnt
>
> path_to_handle will return a relative path from this directory like
> 0x75843558/somehandle (if X is on /usr/bin/X and usr is mounted by
> filesystem 0x75843558)
> path_to_fshandle() will return 0x75843558
>
> opening file handle will be just a matter to thus open
> /proc/self/vfsmount/0x75843558/somehandle
>
> Permission will be determined by vfsmount filesystem.
>
> No need to create new syscall all te handle to filename will be handle
> by the vfsmount filesystem
>
> We could even use at existing command. The dirfd will need to be only
> /proc/self/vfsmnt (and if you need to get a fd without mounting /proc
> create a syscall to get this fd).
>
> Does sound plausible ?
>

I don't think so.

I'm not 100% certain what you are proposing, but I think the basic idea is a
virtual filesystem where giving a textual filehandle as a name gives access
to the file with that filehandle.

This could only work by creating a virtual symlink from the name to the
object in whichever filesystem - somewhat like /proc/self/fd/*.
This could be used to open the file, not to create a hard-link or read a
symlink which are two of the issues we are struggling with.

Maybe I have misunderstood you though.

NeilBrown

2010-08-25 02:07:10

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Mon, 23 Aug 2010 08:19:48 +0530
"Aneesh Kumar K. V" <[email protected]> wrote:

> On Mon, 23 Aug 2010 09:06:04 +1000, Neil Brown <[email protected]> wrote:
> > [[ Following this approach I would like to avoid any filehandle-specific
> > syscalls altogether.
> > Just use a *at syscall with AT_FILE_HANDLE for filehandle lookup, and use
> > getxattr('system:linux.file_handle') to get the filehandle for a given path.
> >
> > Ofcourse we would need to at *at versions of the *xattr syscalls, but that is
> > probably a good idea anyway.
> > ]]
>
> sys_setxattrat would take 7 arguments.
>

I count 6 - assuming we share the 'flags' argument. There are only 2 xattr
flags currently so there should be room to share.

dirfd path name value size flags == 6

NeilBrown

2010-08-25 09:13:15

by Bastien Roucariès

[permalink] [raw]
Subject: Re: [PATCH -V18 04/13] vfs: Allow handle based open on symlinks

On Wed, Aug 25, 2010 at 4:04 AM, Neil Brown <[email protected]> wrote:
> On Tue, 24 Aug 2010 11:41:10 +0200
>>
>> Why ot creating a special file system for this kind of operation ?
>> I mean a vfsmnt filesystem, with each directory on the root is a
>> symlink to the root of the real vfsmnt root ?
>>
>> I could be even be in proc space like /proc/self/vfsmnt
>>
>> path_to_handle will return a relative path from this directory like
>> 0x75843558/somehandle (if X is on /usr/bin/X and usr is mounted by
>> filesystem 0x75843558)
>> path_to_fshandle() will return 0x75843558
>>
>> opening file handle will be just a matter to thus open
>> /proc/self/vfsmount/0x75843558/somehandle
>>
>> Permission will be determined by vfsmount filesystem.
>>
>> No need to create new syscall all te handle to filename will be handle
>> by the vfsmount filesystem
>>
>> We could even use at existing command. The dirfd will need to be only
>> /proc/self/vfsmnt (and if you need to get a fd without mounting /proc
>> create a syscall to get this fd).
>>
>> Does sound plausible ?
>>
>
> I don't think so.
>
> I'm not 100% certain what you are proposing, but I think the basic idea is a
> virtual filesystem where giving a textual filehandle as a name gives access
> to the file with that filehandle.

Exactly

> This could only work by creating a virtual symlink from the name to the
> object in whichever filesystem - somewhat like /proc/self/fd/*.
> This could be used to open the file, not to create a hard-link or read a
> symlink which are two of the issues we are struggling with.

This issue could be raised by the open O_NODE patch
(http://lwn.net/Articles/364735/).
For symlink it will allow to read the symlink.

For hard link some bsd have fsdb that allow this kind of stuff. in our
case the security implication are huge, We could undelete a removed
file. In fact you need a flink(fd, "new path"). I could be useful in
some context note but seriously restricted.

Bastien