2015-11-26 18:55:36

by Christoph Hellwig

[permalink] [raw]
Subject: vfs: move btrfs clone ioctls to common code

This patch set moves the existing btrfs clone ioctls that other file
system have started to implement to common code, and allows the NFS
server to export this functionality to remote systems.

This work is based originally on my NFS CLONE prototype, which reused
code from Anna Schumaker's NFS COPY prototype, as well as various
updates from Peng Tao to this code.

The patches are also available as a git branch and on gitweb:

git://git.infradead.org/users/hch/pnfs.git clone-for-viro
http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro



2015-11-26 18:55:42

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 2/5] locks: new locks_mandatory_area calling convention

Pass a loff_t end for the last byte instead of the 32-bit count
parameter to allow full file clones even on 32-bit architectures.
While we're at it also drop the pointless inode argument and simplify
the read/write selection.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/locks.c | 22 +++++++++-------------
fs/read_write.c | 5 ++---
include/linux/fs.h | 28 +++++++++++++---------------
3 files changed, 24 insertions(+), 31 deletions(-)

diff --git a/fs/locks.c b/fs/locks.c
index 0d2b326..d503669 100644
--- a/fs/locks.c
+++ b/fs/locks.c
@@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)

/**
* locks_mandatory_area - Check for a conflicting lock
- * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
- * for shared
- * @inode: the file to check
* @filp: how the file was opened (if it was)
- * @offset: start of area to check
- * @count: length of area to check
+ * @start: first byte in the file to check
+ * @end: lastbyte in the file to check
+ * @write: %true if checking for write access
*
* Searches the inode's list of locks to find any POSIX locks which conflict.
- * This function is called from rw_verify_area() and
- * locks_verify_truncate().
*/
-int locks_mandatory_area(int read_write, struct inode *inode,
- struct file *filp, loff_t offset,
- size_t count)
+int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
+ bool write)
{
+ struct inode *inode = file_inode(filp);
struct file_lock fl;
int error;
bool sleep = false;
@@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode,
fl.fl_flags = FL_POSIX | FL_ACCESS;
if (filp && !(filp->f_flags & O_NONBLOCK))
sleep = true;
- fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
- fl.fl_start = offset;
- fl.fl_end = offset + count - 1;
+ fl.fl_type = write ? F_WRLCK : F_RDLCK;
+ fl.fl_start = start;
+ fl.fl_end = end;

for (;;) {
if (filp) {
diff --git a/fs/read_write.c b/fs/read_write.c
index c81ef39..48157dd 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
}

if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
- retval = locks_mandatory_area(
- read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
- inode, file, pos, count);
+ retval = locks_mandatory_area(file, pos, pos + count - 1,
+ read_write == READ ? false : true);
if (retval < 0)
return retval;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 870a76e..e640f791 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;

#define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)

-#define FLOCK_VERIFY_READ 1
-#define FLOCK_VERIFY_WRITE 2
-
#ifdef CONFIG_FILE_LOCKING
extern int locks_mandatory_locked(struct file *);
-extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
+extern int locks_mandatory_area(struct file *, loff_t, loff_t, bool);

/*
* Candidates for mandatory locking have the setgid bit set
@@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode,
struct file *filp,
loff_t size)
{
- if (inode->i_flctx && mandatory_lock(inode))
- return locks_mandatory_area(
- FLOCK_VERIFY_WRITE, inode, filp,
- size < inode->i_size ? size : inode->i_size,
- (size < inode->i_size ? inode->i_size - size
- : size - inode->i_size)
- );
- return 0;
+ if (!inode->i_flctx || !mandatory_lock(inode))
+ return 0;
+
+ if (size < inode->i_size) {
+ return locks_mandatory_area(filp, size, inode->i_size - 1,
+ true);
+ } else {
+ return locks_mandatory_area(filp, inode->i_size, size - 1,
+ true);
+ }
}

static inline int break_lease(struct inode *inode, unsigned int mode)
@@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file)
return 0;
}

-static inline int locks_mandatory_area(int rw, struct inode *inode,
- struct file *filp, loff_t offset,
- size_t count)
+static inline int locks_mandatory_area(struct file *filp, loff_t start,
+ loff_t end, bool write)
{
return 0;
}
--
1.9.1


2015-11-26 18:55:45

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 3/5] vfs: pull btrfs clone API to vfs layer

The btrfs ioctl clones are now adopted by other file systems:
NFS since 4.3 and XFS a few kernel in the future, as well as the
previous (incorrect) usage by CIFS. To avoid growth of various
slightly incompatible implementation add one to the core VFS
code. Note that clones are different from file copies in various
ways:

- they are atomic vs other writers
- they support whole file clones
- they support 64-bit legth clones
- they do not allow partial success (aka short writes)
- clones are expected to be a fast metadata operation

Because of that it would be rather cumbersome to try to piggyback
them on top of the recent clone_file_range infrastructure.

Based on earlier work from Peng Tao.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/btrfs/ctree.h | 3 +-
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 49 ++--------------------
fs/ioctl.c | 29 +++++++++++++
fs/nfs/nfs42proc.c | 1 +
fs/nfs/nfs4file.c | 107 ++++++++---------------------------------------
fs/read_write.c | 71 +++++++++++++++++++++++++++++++
include/linux/fs.h | 7 +++-
include/uapi/linux/fs.h | 9 ++++
include/uapi/linux/nfs.h | 11 -----
10 files changed, 140 insertions(+), 148 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index dd7d888..adc997f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4021,7 +4021,6 @@ void btrfs_get_block_group_info(struct list_head *groups_list,
void update_ioctl_balance_args(struct btrfs_fs_info *fs_info, int lock,
struct btrfs_ioctl_balance_args *bargs);

-
/* file.c */
int btrfs_auto_defrag_init(void);
void btrfs_auto_defrag_exit(void);
@@ -4054,6 +4053,8 @@ int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
+int btrfs_clone_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out, u64 len);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index 1c0ee74..3b61b0a 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2921,6 +2921,7 @@ const struct file_operations btrfs_file_operations = {
.compat_ioctl = btrfs_ioctl,
#endif
.copy_file_range = btrfs_copy_file_range,
+ .clone_file_range = btrfs_clone_file_range,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0f92735..85b1cae 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3906,49 +3906,10 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+int btrfs_clone_file_range(struct file *src_file, loff_t off,
+ struct file *dst_file, loff_t destoff, u64 len)
{
- struct fd src_file;
- int ret;
-
- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ)) {
- ret = -EINVAL;
- goto out_fput;
- }
-
- ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
-
-out_fput:
- fdput(src_file);
-out_drop_write:
- mnt_drop_write_file(file);
- return ret;
-}
-
-static long btrfs_ioctl_clone_range(struct file *file, void __user *argp)
-{
- struct btrfs_ioctl_clone_range_args args;
-
- if (copy_from_user(&args, argp, sizeof(args)))
- return -EFAULT;
- return btrfs_ioctl_clone(file, args.src_fd, args.src_offset,
- args.src_length, args.dest_offset);
+ return btrfs_clone_files(dst_file, src_file, off, len, destoff);
}

/*
@@ -5498,10 +5459,6 @@ long btrfs_ioctl(struct file *file, unsigned int
return btrfs_ioctl_dev_info(root, argp);
case BTRFS_IOC_BALANCE:
return btrfs_ioctl_balance(file, NULL);
- case BTRFS_IOC_CLONE:
- return btrfs_ioctl_clone(file, arg, 0, 0, 0);
- case BTRFS_IOC_CLONE_RANGE:
- return btrfs_ioctl_clone_range(file, argp);
case BTRFS_IOC_TRANS_START:
return btrfs_ioctl_trans_start(file);
case BTRFS_IOC_TRANS_END:
diff --git a/fs/ioctl.c b/fs/ioctl.c
index 5d01d26..84c6e79 100644
--- a/fs/ioctl.c
+++ b/fs/ioctl.c
@@ -215,6 +215,29 @@ static int ioctl_fiemap(struct file *filp, unsigned long arg)
return error;
}

+static long ioctl_file_clone(struct file *dst_file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file = fdget(srcfd);
+ int ret;
+
+ if (!src_file.file)
+ return -EBADF;
+ ret = vfs_clone_file_range(src_file.file, off, dst_file, destoff, olen);
+ fdput(src_file);
+ return ret;
+}
+
+static long ioctl_file_clone_range(struct file *file, void __user *argp)
+{
+ struct file_clone_range args;
+
+ if (copy_from_user(&args, argp, sizeof(args)))
+ return -EFAULT;
+ return ioctl_file_clone(file, args.src_fd, args.src_offset,
+ args.src_length, args.dest_offset);
+}
+
#ifdef CONFIG_BLOCK

static inline sector_t logical_to_blk(struct inode *inode, loff_t offset)
@@ -600,6 +623,12 @@ int do_vfs_ioctl(struct file *filp, unsigned int fd, unsigned int cmd,
case FIGETBSZ:
return put_user(inode->i_sb->s_blocksize, argp);

+ case FICLONE:
+ return ioctl_file_clone(filp, arg, 0, 0, 0);
+
+ case FICLONERANGE:
+ return ioctl_file_clone_range(filp, argp);
+
default:
if (S_ISREG(inode->i_mode))
error = file_ioctl(filp, cmd, arg);
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 3e92a3c..303d22e 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -284,6 +284,7 @@ static int _nfs42_proc_clone(struct rpc_message *msg, struct file *src_f,
.dst_fh = NFS_FH(dst_inode),
.src_offset = src_offset,
.dst_offset = dst_offset,
+ .count = count,
.dst_bitmask = server->cache_consistency_bitmask,
};
struct nfs42_clone_res res = {
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4aa5719..f46d087 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -194,63 +194,32 @@ static long nfs42_fallocate(struct file *filep, int mode, loff_t offset, loff_t
return nfs42_proc_allocate(filep, offset, len);
}

-static noinline long
-nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
- u64 src_off, u64 dst_off, u64 count)
+static int nfs42_clone_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off, u64 count)
{
struct inode *dst_inode = file_inode(dst_file);
struct nfs_server *server = NFS_SERVER(dst_inode);
- struct fd src_file;
- struct inode *src_inode;
+ struct inode *src_inode = file_inode(src_file);
unsigned int bs = server->clone_blksize;
+ bool same_inode = false;
int ret;

- /* dst file must be opened for writing */
- if (!(dst_file->f_mode & FMODE_WRITE))
- return -EINVAL;
-
- ret = mnt_want_write_file(dst_file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- src_inode = file_inode(src_file.file);
-
- /* src and dst must be different files */
- ret = -EINVAL;
- if (src_inode == dst_inode)
- goto out_fput;
-
- /* src file must be opened for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
-
- /* src and dst must be regular files */
- ret = -EISDIR;
- if (!S_ISREG(src_inode->i_mode) || !S_ISREG(dst_inode->i_mode))
- goto out_fput;
-
- ret = -EXDEV;
- if (src_file.file->f_path.mnt != dst_file->f_path.mnt ||
- src_inode->i_sb != dst_inode->i_sb)
- goto out_fput;
-
/* check alignment w.r.t. clone_blksize */
ret = -EINVAL;
if (bs) {
if (!IS_ALIGNED(src_off, bs) || !IS_ALIGNED(dst_off, bs))
- goto out_fput;
+ goto out;
if (!IS_ALIGNED(count, bs) && i_size_read(src_inode) != (src_off + count))
- goto out_fput;
+ goto out;
}

+ if (src_inode == dst_inode)
+ same_inode = true;
+
/* XXX: do we lock at all? what if server needs CB_RECALL_LAYOUT? */
- if (dst_inode < src_inode) {
+ if (same_inode) {
+ mutex_lock(&src_inode->i_mutex);
+ } else if (dst_inode < src_inode) {
mutex_lock_nested(&dst_inode->i_mutex, I_MUTEX_PARENT);
mutex_lock_nested(&src_inode->i_mutex, I_MUTEX_CHILD);
} else {
@@ -267,7 +236,7 @@ nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
if (ret)
goto out_unlock;

- ret = nfs42_proc_clone(src_file.file, dst_file, src_off, dst_off, count);
+ ret = nfs42_proc_clone(src_file, dst_file, src_off, dst_off, count);

/* truncate inode page cache of the dst range so that future reads can fetch
* new data from server */
@@ -275,56 +244,20 @@ nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
truncate_inode_pages_range(&dst_inode->i_data, dst_off, dst_off + count - 1);

out_unlock:
- if (dst_inode < src_inode) {
+ if (same_inode) {
+ mutex_unlock(&src_inode->i_mutex);
+ } else if (dst_inode < src_inode) {
mutex_unlock(&src_inode->i_mutex);
mutex_unlock(&dst_inode->i_mutex);
} else {
mutex_unlock(&dst_inode->i_mutex);
mutex_unlock(&src_inode->i_mutex);
}
-out_fput:
- fdput(src_file);
-out_drop_write:
- mnt_drop_write_file(dst_file);
+out:
return ret;
}
-
-static long nfs42_ioctl_clone_range(struct file *dst_file, void __user *argp)
-{
- struct nfs_ioctl_clone_range_args args;
-
- if (copy_from_user(&args, argp, sizeof(args)))
- return -EFAULT;
-
- return nfs42_ioctl_clone(dst_file, args.src_fd, args.src_off, args.dst_off, args.count);
-}
-#else
-static long nfs42_ioctl_clone(struct file *dst_file, unsigned long srcfd,
- u64 src_off, u64 dst_off, u64 count)
-{
- return -ENOTTY;
-}
-
-static long nfs42_ioctl_clone_range(struct file *dst_file, void __user *argp)
-{
- return -ENOTTY;
-}
#endif /* CONFIG_NFS_V4_2 */

-long nfs4_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
-{
- void __user *argp = (void __user *)arg;
-
- switch (cmd) {
- case NFS_IOC_CLONE:
- return nfs42_ioctl_clone(file, arg, 0, 0, 0);
- case NFS_IOC_CLONE_RANGE:
- return nfs42_ioctl_clone_range(file, argp);
- }
-
- return -ENOTTY;
-}
-
const struct file_operations nfs4_file_operations = {
#ifdef CONFIG_NFS_V4_2
.llseek = nfs4_file_llseek,
@@ -344,12 +277,8 @@ const struct file_operations nfs4_file_operations = {
.splice_write = iter_file_splice_write,
#ifdef CONFIG_NFS_V4_2
.fallocate = nfs42_fallocate,
+ .clone_file_range = nfs42_clone_file_range,
#endif /* CONFIG_NFS_V4_2 */
.check_flags = nfs_check_flags,
.setlease = simple_nosetlease,
-#ifdef CONFIG_COMPAT
- .unlocked_ioctl = nfs4_ioctl,
-#else
- .compat_ioctl = nfs4_ioctl,
-#endif /* CONFIG_COMPAT */
};
diff --git a/fs/read_write.c b/fs/read_write.c
index 48157dd..095e209 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1451,3 +1451,74 @@ out1:
out2:
return ret;
}
+
+static int clone_verify_area(struct file *file, loff_t pos, u64 len, bool write)
+{
+ struct inode *inode = file_inode(file);
+
+ if (unlikely(pos < 0))
+ return -EINVAL;
+
+ if (unlikely((loff_t) (pos + len) < 0))
+ return -EINVAL;
+
+ if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
+ loff_t end = len ? pos + len - 1 : OFFSET_MAX;
+ int retval;
+
+ retval = locks_mandatory_area(file, pos, end, write);
+ if (retval < 0)
+ return retval;
+ }
+
+ return security_file_permission(file, write ? MAY_WRITE : MAY_READ);
+}
+
+int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out, u64 len)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+ int ret;
+
+ if (inode_in->i_sb != inode_out->i_sb ||
+ file_in->f_path.mnt != file_out->f_path.mnt)
+ return -EXDEV;
+
+ if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+ return -EISDIR;
+ if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+ return -EOPNOTSUPP;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND) ||
+ !file_in->f_op->clone_file_range)
+ return -EBADF;
+
+ ret = clone_verify_area(file_in, pos_in, len, false);
+ if (ret)
+ return ret;
+
+ ret = clone_verify_area(file_out, pos_out, len, true);
+ if (ret)
+ return ret;
+
+ if (pos_in + len > i_size_read(inode_in))
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file_out);
+ if (ret)
+ return ret;
+
+ ret = file_in->f_op->clone_file_range(file_in, pos_in,
+ file_out, pos_out, len);
+ if (!ret) {
+ fsnotify_access(file_in);
+ fsnotify_modify(file_out);
+ }
+
+ mnt_drop_write_file(file_out);
+ return ret;
+}
+EXPORT_SYMBOL(vfs_clone_file_range);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e640f791..75ce095 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1629,7 +1629,10 @@ struct file_operations {
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
- ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *,
+ loff_t, size_t, unsigned int);
+ int (*clone_file_range)(struct file *, loff_t, struct file *, loff_t,
+ u64);
};

struct inode_operations {
@@ -1683,6 +1686,8 @@ extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+extern int vfs_clone_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out, u64 len);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index f15d980..cd5db7f 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -39,6 +39,13 @@
#define RENAME_EXCHANGE (1 << 1) /* Exchange source and dest */
#define RENAME_WHITEOUT (1 << 2) /* Whiteout source */

+struct file_clone_range {
+ __s64 src_fd;
+ __u64 src_offset;
+ __u64 src_length;
+ __u64 dest_offset;
+};
+
struct fstrim_range {
__u64 start;
__u64 len;
@@ -159,6 +166,8 @@ struct inodes_stat_t {
#define FIFREEZE _IOWR('X', 119, int) /* Freeze */
#define FITHAW _IOWR('X', 120, int) /* Thaw */
#define FITRIM _IOWR('X', 121, struct fstrim_range) /* Trim */
+#define FICLONE _IOW(0x94, 9, int)
+#define FICLONERANGE _IOW(0x94, 13, struct file_clone_range)

#define FS_IOC_GETFLAGS _IOR('f', 1, long)
#define FS_IOC_SETFLAGS _IOW('f', 2, long)
diff --git a/include/uapi/linux/nfs.h b/include/uapi/linux/nfs.h
index 654bae3..5e62961 100644
--- a/include/uapi/linux/nfs.h
+++ b/include/uapi/linux/nfs.h
@@ -33,17 +33,6 @@

#define NFS_PIPE_DIRNAME "nfs"

-/* NFS ioctls */
-/* Let's follow btrfs lead on CLONE to avoid messing userspace */
-#define NFS_IOC_CLONE _IOW(0x94, 9, int)
-#define NFS_IOC_CLONE_RANGE _IOW(0x94, 13, int)
-
-struct nfs_ioctl_clone_range_args {
- __s64 src_fd;
- __u64 src_off, count;
- __u64 dst_off;
-};
-
/*
* NFS stats. The good thing with these values is that NFSv3 errors are
* a superset of NFSv2 errors (with the exception of NFSERR_WFLUSH which
--
1.9.1


2015-11-26 18:55:52

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 5/5] nfsd: implement the NFSv4.2 CLONE operation

This is basically a remote version of the btrfs CLONE operation,
so the implementation is fairly trivial. Made even more trivial
by stealing the XDR code and general framework Anna Schumaker's
COPY prototype.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/nfsd/nfs4proc.c | 47 +++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfs4xdr.c | 21 +++++++++++++++++++++
fs/nfsd/vfs.c | 8 ++++++++
fs/nfsd/vfs.h | 2 ++
fs/nfsd/xdr4.h | 10 ++++++++++
include/linux/nfs4.h | 4 ++--
6 files changed, 90 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 3ba10a3..819ad81 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1012,6 +1012,47 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
}

static __be32
+nfsd4_clone(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
+ struct nfsd4_clone *clone)
+{
+ struct file *src, *dst;
+ __be32 status;
+
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->save_fh,
+ &clone->cl_src_stateid, RD_STATE,
+ &src, NULL);
+ if (status) {
+ dprintk("NFSD: %s: couldn't process src stateid!\n", __func__);
+ goto out;
+ }
+
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ &clone->cl_dst_stateid, WR_STATE,
+ &dst, NULL);
+ if (status) {
+ dprintk("NFSD: %s: couldn't process dst stateid!\n", __func__);
+ goto out_put_src;
+ }
+
+ /* fix up for NFS-specific error code */
+ if (!S_ISREG(file_inode(src)->i_mode) ||
+ !S_ISREG(file_inode(dst)->i_mode)) {
+ status = nfserr_wrong_type;
+ goto out_put_dst;
+ }
+
+ status = nfsd4_clone_file_range(src, clone->cl_src_pos,
+ dst, clone->cl_dst_pos, clone->cl_count);
+
+out_put_dst:
+ fput(dst);
+out_put_src:
+ fput(src);
+out:
+ return status;
+}
+
+static __be32
nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
struct nfsd4_fallocate *fallocate, int flags)
{
@@ -2281,6 +2322,12 @@ static struct nfsd4_operation nfsd4_ops[] = {
.op_name = "OP_DEALLOCATE",
.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
},
+ [OP_CLONE] = {
+ .op_func = (nfsd4op_func)nfsd4_clone,
+ .op_flags = OP_MODIFIES_SOMETHING | OP_CACHEME,
+ .op_name = "OP_CLONE",
+ .op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
+ },
[OP_SEEK] = {
.op_func = (nfsd4op_func)nfsd4_seek,
.op_name = "OP_SEEK",
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 51c9e9c..924416f 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1675,6 +1675,25 @@ nfsd4_decode_fallocate(struct nfsd4_compoundargs *argp,
}

static __be32
+nfsd4_decode_clone(struct nfsd4_compoundargs *argp, struct nfsd4_clone *clone)
+{
+ DECODE_HEAD;
+
+ status = nfsd4_decode_stateid(argp, &clone->cl_src_stateid);
+ if (status)
+ return status;
+ status = nfsd4_decode_stateid(argp, &clone->cl_dst_stateid);
+ if (status)
+ return status;
+
+ READ_BUF(8 + 8 + 8);
+ p = xdr_decode_hyper(p, &clone->cl_src_pos);
+ p = xdr_decode_hyper(p, &clone->cl_dst_pos);
+ p = xdr_decode_hyper(p, &clone->cl_count);
+ DECODE_TAIL;
+}
+
+static __be32
nfsd4_decode_seek(struct nfsd4_compoundargs *argp, struct nfsd4_seek *seek)
{
DECODE_HEAD;
@@ -1785,6 +1804,7 @@ static nfsd4_dec nfsd4_dec_ops[] = {
[OP_READ_PLUS] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_SEEK] = (nfsd4_dec)nfsd4_decode_seek,
[OP_WRITE_SAME] = (nfsd4_dec)nfsd4_decode_notsupp,
+ [OP_CLONE] = (nfsd4_dec)nfsd4_decode_clone,
};

static inline bool
@@ -4292,6 +4312,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {
[OP_READ_PLUS] = (nfsd4_enc)nfsd4_encode_noop,
[OP_SEEK] = (nfsd4_enc)nfsd4_encode_seek,
[OP_WRITE_SAME] = (nfsd4_enc)nfsd4_encode_noop,
+ [OP_CLONE] = (nfsd4_enc)nfsd4_encode_noop,
};

/*
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 994d66f..5411bf0 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -36,6 +36,7 @@
#endif /* CONFIG_NFSD_V3 */

#ifdef CONFIG_NFSD_V4
+#include "../internal.h"
#include "acl.h"
#include "idmap.h"
#endif /* CONFIG_NFSD_V4 */
@@ -498,6 +499,13 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
#endif

+__be32 nfsd4_clone_file_range(struct file *src, u64 src_pos, struct file *dst,
+ u64 dst_pos, u64 count)
+{
+ return nfserrno(vfs_clone_file_range(src, src_pos, dst, dst_pos,
+ count));
+}
+
__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct file *file, loff_t offset, loff_t len,
int flags)
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index fcfc48c..c11ba31 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -56,6 +56,8 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *, struct svc_fh *,
struct xdr_netobj *);
__be32 nfsd4_vfs_fallocate(struct svc_rqst *, struct svc_fh *,
struct file *, loff_t, loff_t, int);
+__be32 nfsd4_clone_file_range(struct file *, u64, struct file *,
+ u64, u64);
#endif /* CONFIG_NFSD_V4 */
__be32 nfsd_create(struct svc_rqst *, struct svc_fh *,
char *name, int len, struct iattr *attrs,
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index ce7362c..d955481 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -491,6 +491,15 @@ struct nfsd4_fallocate {
u64 falloc_length;
};

+struct nfsd4_clone {
+ /* request */
+ stateid_t cl_src_stateid;
+ stateid_t cl_dst_stateid;
+ u64 cl_src_pos;
+ u64 cl_dst_pos;
+ u64 cl_count;
+};
+
struct nfsd4_seek {
/* request */
stateid_t seek_stateid;
@@ -555,6 +564,7 @@ struct nfsd4_op {
/* NFSv4.2 */
struct nfsd4_fallocate allocate;
struct nfsd4_fallocate deallocate;
+ struct nfsd4_clone clone;
struct nfsd4_seek seek;
} u;
struct nfs4_replay * replay;
diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index e7e7853..43aeabd 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -139,10 +139,10 @@ enum nfs_opnum4 {
Needs to be updated if more operations are defined in future.*/

#define FIRST_NFS4_OP OP_ACCESS
-#define LAST_NFS4_OP OP_WRITE_SAME
#define LAST_NFS40_OP OP_RELEASE_LOCKOWNER
#define LAST_NFS41_OP OP_RECLAIM_COMPLETE
-#define LAST_NFS42_OP OP_WRITE_SAME
+#define LAST_NFS42_OP OP_CLONE
+#define LAST_NFS4_OP LAST_NFS42_OP

enum nfsstat4 {
NFS4_OK = 0,
--
1.9.1


2015-11-26 18:55:53

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 4/5] nfsd: Pass filehandle to nfs4_preprocess_stateid_op()

From: Anna Schumaker <[email protected]>

This will be needed so COPY can look up the saved_fh in addition to the
current_fh.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4proc.c | 16 +++++++++-------
fs/nfsd/nfs4state.c | 5 ++---
fs/nfsd/state.h | 4 ++--
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index a9f096c..3ba10a3 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -774,8 +774,9 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
clear_bit(RQ_SPLICE_OK, &rqstp->rq_flags);

/* check stateid */
- status = nfs4_preprocess_stateid_op(rqstp, cstate, &read->rd_stateid,
- RD_STATE, &read->rd_filp, &read->rd_tmp_file);
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ &read->rd_stateid, RD_STATE,
+ &read->rd_filp, &read->rd_tmp_file);
if (status) {
dprintk("NFSD: nfsd4_read: couldn't process stateid!\n");
goto out;
@@ -921,7 +922,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,

if (setattr->sa_iattr.ia_valid & ATTR_SIZE) {
status = nfs4_preprocess_stateid_op(rqstp, cstate,
- &setattr->sa_stateid, WR_STATE, NULL, NULL);
+ &cstate->current_fh, &setattr->sa_stateid,
+ WR_STATE, NULL, NULL);
if (status) {
dprintk("NFSD: nfsd4_setattr: couldn't process stateid!\n");
return status;
@@ -985,8 +987,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if (write->wr_offset >= OFFSET_MAX)
return nfserr_inval;

- status = nfs4_preprocess_stateid_op(rqstp, cstate, stateid, WR_STATE,
- &filp, NULL);
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ stateid, WR_STATE, &filp, NULL);
if (status) {
dprintk("NFSD: nfsd4_write: couldn't process stateid!\n");
return status;
@@ -1016,7 +1018,7 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
__be32 status = nfserr_notsupp;
struct file *file;

- status = nfs4_preprocess_stateid_op(rqstp, cstate,
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&fallocate->falloc_stateid,
WR_STATE, &file, NULL);
if (status != nfs_ok) {
@@ -1055,7 +1057,7 @@ nfsd4_seek(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
__be32 status;
struct file *file;

- status = nfs4_preprocess_stateid_op(rqstp, cstate,
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&seek->seek_stateid,
RD_STATE, &file, NULL);
if (status) {
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 6b800b5..df5dba6 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4797,10 +4797,9 @@ nfs4_check_file(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfs4_stid *s,
*/
__be32
nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
- struct nfsd4_compound_state *cstate, stateid_t *stateid,
- int flags, struct file **filpp, bool *tmp_file)
+ struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+ stateid_t *stateid, int flags, struct file **filpp, bool *tmp_file)
{
- struct svc_fh *fhp = &cstate->current_fh;
struct inode *ino = d_inode(fhp->fh_dentry);
struct net *net = SVC_NET(rqstp);
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 77fdf4d..99432b7 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -578,8 +578,8 @@ struct nfsd4_compound_state;
struct nfsd_net;

extern __be32 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
- struct nfsd4_compound_state *cstate, stateid_t *stateid,
- int flags, struct file **filp, bool *tmp_file);
+ struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+ stateid_t *stateid, int flags, struct file **filp, bool *tmp_file);
__be32 nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate,
stateid_t *stateid, unsigned char typemask,
struct nfs4_stid **s, struct nfsd_net *nn);
--
1.9.1


2015-11-26 18:55:38

by Christoph Hellwig

[permalink] [raw]
Subject: [PATCH 1/5] cifs: implement clone_file_range operation

And drop the fake support for the btrfs CLONE ioctl - SMB2 copies are
chunked and do not actually implement clone semantics!

Heavily based on a previous patch from Peng Tao.

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/cifs/cifsfs.c | 25 ++++++++++++++
fs/cifs/cifsfs.h | 4 ++-
fs/cifs/ioctl.c | 103 +++++++++++++++++++++++++++++++------------------------
3 files changed, 86 insertions(+), 46 deletions(-)

diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index cbc0f4b..ad7117a 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -914,6 +914,23 @@ const struct inode_operations cifs_symlink_inode_ops = {
#endif
};

+ssize_t cifs_file_copy_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ unsigned int xid;
+ int rc;
+
+ if (flags)
+ return -EOPNOTSUPP;
+
+ xid = get_xid();
+ rc = cifs_file_clone_range(xid, file_in, file_out, pos_in,
+ len, pos_out, true);
+ free_xid(xid);
+ return rc < 0 ? rc : len;
+}
+
const struct file_operations cifs_file_ops = {
.read_iter = cifs_loose_read_iter,
.write_iter = cifs_file_write_iter,
@@ -926,6 +943,7 @@ const struct file_operations cifs_file_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
};
@@ -942,6 +960,8 @@ const struct file_operations cifs_file_strict_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
+ .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
};
@@ -958,6 +978,7 @@ const struct file_operations cifs_file_direct_ops = {
.mmap = cifs_file_mmap,
.splice_read = generic_file_splice_read,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.llseek = cifs_llseek,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
@@ -974,6 +995,7 @@ const struct file_operations cifs_file_nobrl_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
};
@@ -989,6 +1011,7 @@ const struct file_operations cifs_file_strict_nobrl_ops = {
.splice_read = generic_file_splice_read,
.llseek = cifs_llseek,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
};
@@ -1004,6 +1027,7 @@ const struct file_operations cifs_file_direct_nobrl_ops = {
.mmap = cifs_file_mmap,
.splice_read = generic_file_splice_read,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.llseek = cifs_llseek,
.setlease = cifs_setlease,
.fallocate = cifs_fallocate,
@@ -1014,6 +1038,7 @@ const struct file_operations cifs_dir_ops = {
.release = cifs_closedir,
.read = generic_read_dir,
.unlocked_ioctl = cifs_ioctl,
+ .copy_file_range = cifs_file_copy_range,
.llseek = generic_file_llseek,
};

diff --git a/fs/cifs/cifsfs.h b/fs/cifs/cifsfs.h
index c3cc160..797439b 100644
--- a/fs/cifs/cifsfs.h
+++ b/fs/cifs/cifsfs.h
@@ -131,7 +131,9 @@ extern int cifs_setxattr(struct dentry *, const char *, const void *,
extern ssize_t cifs_getxattr(struct dentry *, const char *, void *, size_t);
extern ssize_t cifs_listxattr(struct dentry *, char *, size_t);
extern long cifs_ioctl(struct file *filep, unsigned int cmd, unsigned long arg);
-
+extern int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+ struct file *dst_file, u64 off, u64 len,
+ u64 destoff, bool dup_extents);
#ifdef CONFIG_CIFS_NFSD_EXPORT
extern const struct export_operations cifs_export_ops;
#endif /* CONFIG_CIFS_NFSD_EXPORT */
diff --git a/fs/cifs/ioctl.c b/fs/cifs/ioctl.c
index 35cf990..4f92f5c 100644
--- a/fs/cifs/ioctl.c
+++ b/fs/cifs/ioctl.c
@@ -34,73 +34,43 @@
#include "cifs_ioctl.h"
#include <linux/btrfs.h>

-static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
- unsigned long srcfd, u64 off, u64 len, u64 destoff,
- bool dup_extents)
+int cifs_file_clone_range(unsigned int xid, struct file *src_file,
+ struct file *dst_file, u64 off, u64 len,
+ u64 destoff, bool dup_extents)
{
- int rc;
- struct cifsFileInfo *smb_file_target = dst_file->private_data;
+ struct inode *src_inode = file_inode(src_file);
struct inode *target_inode = file_inode(dst_file);
- struct cifs_tcon *target_tcon;
- struct fd src_file;
struct cifsFileInfo *smb_file_src;
- struct inode *src_inode;
+ struct cifsFileInfo *smb_file_target;
struct cifs_tcon *src_tcon;
+ struct cifs_tcon *target_tcon;
+ int rc;

cifs_dbg(FYI, "ioctl clone range\n");
- /* the destination must be opened for writing */
- if (!(dst_file->f_mode & FMODE_WRITE)) {
- cifs_dbg(FYI, "file target not open for write\n");
- return -EINVAL;
- }

- /* check if target volume is readonly and take reference */
- rc = mnt_want_write_file(dst_file);
- if (rc) {
- cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
- return rc;
- }
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- rc = -EBADF;
- goto out_drop_write;
- }
-
- if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
- rc = -EBADF;
- cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
- goto out_fput;
- }
-
- if ((!src_file.file->private_data) || (!dst_file->private_data)) {
+ if (!src_file->private_data || !dst_file->private_data) {
rc = -EBADF;
cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
- goto out_fput;
+ goto out;
}

rc = -EXDEV;
smb_file_target = dst_file->private_data;
- smb_file_src = src_file.file->private_data;
+ smb_file_src = src_file->private_data;
src_tcon = tlink_tcon(smb_file_src->tlink);
target_tcon = tlink_tcon(smb_file_target->tlink);

/* check source and target on same server (or volume if dup_extents) */
if (dup_extents && (src_tcon != target_tcon)) {
cifs_dbg(VFS, "source and target of copy not on same share\n");
- goto out_fput;
+ goto out;
}

if (!dup_extents && (src_tcon->ses != target_tcon->ses)) {
cifs_dbg(VFS, "source and target of copy not on same server\n");
- goto out_fput;
+ goto out;
}

- src_inode = file_inode(src_file.file);
- rc = -EINVAL;
- if (S_ISDIR(src_inode->i_mode))
- goto out_fput;
-
/*
* Note: cifs case is easier than btrfs since server responsible for
* checks for proper open modes and file type and if it wants
@@ -136,6 +106,52 @@ out_unlock:
/* although unlocking in the reverse order from locking is not
strictly necessary here it is a little cleaner to be consistent */
unlock_two_nondirectories(src_inode, target_inode);
+out:
+ return rc;
+}
+
+static long cifs_ioctl_clone(unsigned int xid, struct file *dst_file,
+ unsigned long srcfd, u64 off, u64 len, u64 destoff,
+ bool dup_extents)
+{
+ int rc;
+ struct fd src_file;
+ struct inode *src_inode;
+
+ cifs_dbg(FYI, "ioctl clone range\n");
+ /* the destination must be opened for writing */
+ if (!(dst_file->f_mode & FMODE_WRITE)) {
+ cifs_dbg(FYI, "file target not open for write\n");
+ return -EINVAL;
+ }
+
+ /* check if target volume is readonly and take reference */
+ rc = mnt_want_write_file(dst_file);
+ if (rc) {
+ cifs_dbg(FYI, "mnt_want_write failed with rc %d\n", rc);
+ return rc;
+ }
+
+ src_file = fdget(srcfd);
+ if (!src_file.file) {
+ rc = -EBADF;
+ goto out_drop_write;
+ }
+
+ if (src_file.file->f_op->unlocked_ioctl != cifs_ioctl) {
+ rc = -EBADF;
+ cifs_dbg(VFS, "src file seems to be from a different filesystem type\n");
+ goto out_fput;
+ }
+
+ src_inode = file_inode(src_file.file);
+ rc = -EINVAL;
+ if (S_ISDIR(src_inode->i_mode))
+ goto out_fput;
+
+ rc = cifs_file_clone_range(xid, src_file.file, dst_file, off, len,
+ destoff, dup_extents);
+
out_fput:
fdput(src_file);
out_drop_write:
@@ -258,9 +274,6 @@ long cifs_ioctl(struct file *filep, unsigned int command, unsigned long arg)
case CIFS_IOC_COPYCHUNK_FILE:
rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, false);
break;
- case BTRFS_IOC_CLONE:
- rc = cifs_ioctl_clone(xid, filep, arg, 0, 0, 0, true);
- break;
case CIFS_IOC_SET_INTEGRITY:
if (pSMBFile == NULL)
break;
--
1.9.1


2015-11-30 22:38:32

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 2/5] locks: new locks_mandatory_area calling convention

On Thu, Nov 26, 2015 at 07:50:56PM +0100, Christoph Hellwig wrote:
> Pass a loff_t end for the last byte instead of the 32-bit count
> parameter to allow full file clones even on 32-bit architectures.
> While we're at it also drop the pointless inode argument and simplify
> the read/write selection.
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> ---
> fs/locks.c | 22 +++++++++-------------
> fs/read_write.c | 5 ++---
> include/linux/fs.h | 28 +++++++++++++---------------
> 3 files changed, 24 insertions(+), 31 deletions(-)
>
> diff --git a/fs/locks.c b/fs/locks.c
> index 0d2b326..d503669 100644
> --- a/fs/locks.c
> +++ b/fs/locks.c
> @@ -1227,21 +1227,17 @@ int locks_mandatory_locked(struct file *file)
>
> /**
> * locks_mandatory_area - Check for a conflicting lock
> - * @read_write: %FLOCK_VERIFY_WRITE for exclusive access, %FLOCK_VERIFY_READ
> - * for shared
> - * @inode: the file to check
> * @filp: how the file was opened (if it was)
> - * @offset: start of area to check
> - * @count: length of area to check
> + * @start: first byte in the file to check
> + * @end: lastbyte in the file to check
> + * @write: %true if checking for write access
> *
> * Searches the inode's list of locks to find any POSIX locks which conflict.
> - * This function is called from rw_verify_area() and
> - * locks_verify_truncate().
> */
> -int locks_mandatory_area(int read_write, struct inode *inode,
> - struct file *filp, loff_t offset,
> - size_t count)
> +int locks_mandatory_area(struct file *filp, loff_t start, loff_t end,
> + bool write)
> {
> + struct inode *inode = file_inode(filp);
> struct file_lock fl;
> int error;
> bool sleep = false;
> @@ -1252,9 +1248,9 @@ int locks_mandatory_area(int read_write, struct inode *inode,
> fl.fl_flags = FL_POSIX | FL_ACCESS;
> if (filp && !(filp->f_flags & O_NONBLOCK))
> sleep = true;
> - fl.fl_type = (read_write == FLOCK_VERIFY_WRITE) ? F_WRLCK : F_RDLCK;
> - fl.fl_start = offset;
> - fl.fl_end = offset + count - 1;
> + fl.fl_type = write ? F_WRLCK : F_RDLCK;
> + fl.fl_start = start;
> + fl.fl_end = end;
>
> for (;;) {
> if (filp) {
> diff --git a/fs/read_write.c b/fs/read_write.c
> index c81ef39..48157dd 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -396,9 +396,8 @@ int rw_verify_area(int read_write, struct file *file, const loff_t *ppos, size_t
> }
>
> if (unlikely(inode->i_flctx && mandatory_lock(inode))) {
> - retval = locks_mandatory_area(
> - read_write == READ ? FLOCK_VERIFY_READ : FLOCK_VERIFY_WRITE,
> - inode, file, pos, count);
> + retval = locks_mandatory_area(file, pos, pos + count - 1,
> + read_write == READ ? false : true);
> if (retval < 0)
> return retval;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 870a76e..e640f791 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2030,12 +2030,9 @@ extern struct kobject *fs_kobj;
>
> #define MAX_RW_COUNT (INT_MAX & PAGE_CACHE_MASK)
>
> -#define FLOCK_VERIFY_READ 1
> -#define FLOCK_VERIFY_WRITE 2
> -
> #ifdef CONFIG_FILE_LOCKING
> extern int locks_mandatory_locked(struct file *);
> -extern int locks_mandatory_area(int, struct inode *, struct file *, loff_t, size_t);
> +extern int locks_mandatory_area(struct file *, loff_t, loff_t, bool);
>
> /*
> * Candidates for mandatory locking have the setgid bit set
> @@ -2068,14 +2065,16 @@ static inline int locks_verify_truncate(struct inode *inode,
> struct file *filp,
> loff_t size)
> {
> - if (inode->i_flctx && mandatory_lock(inode))
> - return locks_mandatory_area(
> - FLOCK_VERIFY_WRITE, inode, filp,
> - size < inode->i_size ? size : inode->i_size,
> - (size < inode->i_size ? inode->i_size - size
> - : size - inode->i_size)
> - );
> - return 0;
> + if (!inode->i_flctx || !mandatory_lock(inode))
> + return 0;
> +
> + if (size < inode->i_size) {
> + return locks_mandatory_area(filp, size, inode->i_size - 1,
> + true);
> + } else {
> + return locks_mandatory_area(filp, inode->i_size, size - 1,
> + true);

I feel like these callers would be just slightly more self-documenting
if that last parameter was F_WRLCK instead of true.

But I could live with it either way, patch looks like an
improvement--ACK.

--b.

> + }
> }
>
> static inline int break_lease(struct inode *inode, unsigned int mode)
> @@ -2144,9 +2143,8 @@ static inline int locks_mandatory_locked(struct file *file)
> return 0;
> }
>
> -static inline int locks_mandatory_area(int rw, struct inode *inode,
> - struct file *filp, loff_t offset,
> - size_t count)
> +static inline int locks_mandatory_area(struct file *filp, loff_t start,
> + loff_t end, bool write)
> {
> return 0;
> }
> --
> 1.9.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-11-30 22:56:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

On Thu, Nov 26, 2015 at 07:50:54PM +0100, Christoph Hellwig wrote:
> This patch set moves the existing btrfs clone ioctls that other file
> system have started to implement to common code, and allows the NFS
> server to export this functionality to remote systems.
>
> This work is based originally on my NFS CLONE prototype, which reused
> code from Anna Schumaker's NFS COPY prototype, as well as various
> updates from Peng Tao to this code.

Looks good to me. (In particular: ACK to the locks.c and nfsd patches.
But, disclaimer, I haven't tried to test clone.)

--b.

>
> The patches are also available as a git branch and on gitweb:
>
> git://git.infradead.org/users/hch/pnfs.git clone-for-viro
> http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-12-01 07:37:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/5] locks: new locks_mandatory_area calling convention

On Mon, Nov 30, 2015 at 05:38:30PM -0500, J. Bruce Fields wrote:
> > + if (size < inode->i_size) {
> > + return locks_mandatory_area(filp, size, inode->i_size - 1,
> > + true);
> > + } else {
> > + return locks_mandatory_area(filp, inode->i_size, size - 1,
> > + true);
>
> I feel like these callers would be just slightly more self-documenting
> if that last parameter was F_WRLCK instead of true.

Sure, I can change that forthe next version.


2015-12-01 17:09:28

by Chris Mason

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

On Thu, Nov 26, 2015 at 07:50:54PM +0100, Christoph Hellwig wrote:
> This patch set moves the existing btrfs clone ioctls that other file
> system have started to implement to common code, and allows the NFS
> server to export this functionality to remote systems.
>
> This work is based originally on my NFS CLONE prototype, which reused
> code from Anna Schumaker's NFS COPY prototype, as well as various
> updates from Peng Tao to this code.
>
> The patches are also available as a git branch and on gitweb:
>
> git://git.infradead.org/users/hch/pnfs.git clone-for-viro

Thanks Christoph, this looks fine to me.

-chris


2015-12-01 22:48:33

by Steve French

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

In the new API is there a way to distinguish between the two copy
offload behaviors:

1) FSCTL_DUPLICATE_EXTENTS (where the server file system increments a
refcount on blocks in the range)
and
2) FSCTL_COPYCHUNK (where the server does a server side copy of the
requested range, but does not necessarily use reflink, although in the
case of Samba on btrfs it implements it this way). In this case NTFS
will usually make a copy of the range requested rather than linking
the ranges.

For the former cifs uses (used prior to this patc) the btrfs ioctl,
for the latter it has a private ioctl (CIFS_IOC_COPYCHUNK_FILE). For
the former the files have to be on the same share (export) for the
latter it just requires that the files be on the same server, and in
common cases (drag an drop in the file explorer on the desktop) the
source and target files would be on different mounts to the same
server.

There is an unimplemented (in cifs.ko) whole file clone operation
(copy-on-write file with a network API similar to the hardlink) but it
looks like it is no longer supported by newer servers, perhaps because
there is more interest in the ODX mechanism for duplicating files ala
https://msdn.microsoft.com/en-us/library/windows/desktop/hh848056(v=vs.85).aspx
across server farms for managing virtualization images. I need to
add the ODX copy offload mechanism to cifs.ko but presumably it would
behave more like FSCTL_COPYCHUNK (ie not do a reflink of the blocks)

The performance improvements from server side copy offload is huge
whether or not reflink is done - so allowing the cp command (or common
user space commands ala robocopy which already does this in Windows)
to do fast copy is particularly important for network file systems.

On Thu, Nov 26, 2015 at 12:50 PM, Christoph Hellwig <[email protected]> wrote:
> This patch set moves the existing btrfs clone ioctls that other file
> system have started to implement to common code, and allows the NFS
> server to export this functionality to remote systems.
>
> This work is based originally on my NFS CLONE prototype, which reused
> code from Anna Schumaker's NFS COPY prototype, as well as various
> updates from Peng Tao to this code.
>
> The patches are also available as a git branch and on gitweb:
>
> git://git.infradead.org/users/hch/pnfs.git clone-for-viro
> http://git.infradead.org/users/hch/pnfs.git/shortlog/refs/heads/clone-for-viro
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-cifs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



--
Thanks,

Steve

2015-12-02 07:28:00

by Christoph Hellwig

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

Hi Steve,

we have two APIs in Linux:

- the copy_file_range syscall which just is a "do a copy by any means"
- the btrfs clone ioctls which have stricter semantics that very much
expect a reflink-like operation

I plan to also wire up copy_file_range to try the clone_file_range method
first if available to make life easier for file systems, but as there isn't
any test coverage for that I don't dare to actually submit it yet. I'll
send a compile tested only RFC for it when resending this series.

2015-12-02 17:40:33

by Steve French

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

On Wed, Dec 2, 2015 at 1:27 AM, Christoph Hellwig <[email protected]> wrote:
> Hi Steve,
>
> we have two APIs in Linux:
>
> - the copy_file_range syscall which just is a "do a copy by any means"
> - the btrfs clone ioctls which have stricter semantics that very much
> expect a reflink-like operation

If the copy_file_range is allowed to use any offload mechanism then
cifs.ko could be changed as follows, to fallback among the three
possible mechanisms depending on what the target supports.

- send the fastest one of the three choices,
the "reflink-like") FSCTL_DUPLICATE_EXTENTS (there is a
server fs capability that we check at mount time that indicates whether
it is supported). If it is not supported or if the source and target are on
different shares (exports) then fallback to
- send the ODX style copy offload (when implemented). This is the only
one that could in theory support cross-server copies (rather than require copy
from a source and target on the same server)
- (if the above aren't supported) send the FSCTL_COPYCHUNK (currently
called via CIFS_IOC_COPYCHUNK_FILE)

For the btrfs_ioc_clone_range (or similar ", FSCTL_DUPLICATE_EXTENTS could
probably stay the same since it is the only one of the three that
guarantees using reflinks.

If we want to for Linux->Samba, we could probably add a whole file
clone (similar to hardlinks
on the wire) to Samba and cifs.ko if that is useful (as opposed to the
three mechanisms
above which are copy ranges)

In addition, I noticed that the cp command has added various
optimizations for sparse
file enablement. I need to test those on cifs.ko and update the
ioctls for retrieving sparse
ranges o make sure that they work over SMB3 mounts, for optimizing
the case where the source file is sparse, and mostly empty.

> I plan to also wire up copy_file_range to try the clone_file_range method
> first if available to make life easier for file systems, but as there isn't
> any test coverage for that I don't dare to actually submit it yet. I'll
> send a compile tested only RFC for it when resending this series.



--
Thanks,

Steve

2015-12-03 10:30:37

by Christoph Hellwig

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

On Wed, Dec 02, 2015 at 11:40:13AM -0600, Steve French wrote:
> If the copy_file_range is allowed to use any offload mechanism then
> cifs.ko could be changed as follows, to fallback among the three
> possible mechanisms depending on what the target supports.

How reliable are the fallbacks? E.g. for clones we usually have alignment
restrictions that we'd need to communicate back, and cifs currently
doesn't have client side checks for those.

2015-12-03 19:29:12

by Steve French

[permalink] [raw]
Subject: Re: vfs: move btrfs clone ioctls to common code

On Thu, Dec 3, 2015 at 4:30 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Dec 02, 2015 at 11:40:13AM -0600, Steve French wrote:
>> If the copy_file_range is allowed to use any offload mechanism then
>> cifs.ko could be changed as follows, to fallback among the three
>> possible mechanisms depending on what the target supports.
>
> How reliable are the fallbacks? E.g. for clones we usually have alignment
> restrictions that we'd need to communicate back, and cifs currently
> doesn't have client side checks for those.

I am not worried about fallback inconsistency for the current two options,
if block refcounting is not supported we will know before we issue the
request, and the fallback copy chunk has few restrictions.
When we add ODX there may be additional alignments restrictions, but don't know
until we investigate more.

Although we can query alignment over CIFS and SMB3, it is less important
to know over a network file system than a block device, and unlikely to be
a restriction. Although the protocol does not restrict the maximum chunk
size, the server can return an error indicating the maximum
supported chunk size, allowing the client to retry with the size of
chunks the server requests. To match existing server behavior with
reasonable defaults for common servers - the cifs client uses
16 chunks of 1MB each for each FSCTL_SRV_COPYCHUNK_WRITE
request sent on the wire.

--
Thanks,

Steve