2015-10-23 19:32:21

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 0/4] VFS: In-kernel copy system call

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations. We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting removes the COPY_FR_REFLINK flag. Patch 3 still adds btrfs
support for copy_file_range() as a reflink, so if this behavior is undesireable
then the patch can be dropped.

Changes in v7:
- Remove COPY_FR_REFLINK flag.
- Fix build warning on ARM devices.
- Meniton sparse file expansion in the man page.


Anna Schumaker (1):
vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
vfs: add copy_file_range syscall and vfs helper
x86: add sys_copy_file_range to syscall tables
btrfs: add .copy_file_range file operation

arch/arm/include/uapi/asm/unistd.h | 1 +
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 91 ++++++++++++----------
fs/read_write.c | 133 +++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
11 files changed, 202 insertions(+), 40 deletions(-)

--
2.6.2



2015-10-23 19:32:23

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 1/4] vfs: add copy_file_range syscall and vfs helper

From: Zach Brown <[email protected]>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
Change flags parameter from int to unsigned int,
Add function to include/linux/syscalls.h,
Check copy len after file open mode,
Don't forbid ranges inside the same file,
Use rw_verify_area() to veriy ranges,
Use file_out rather than file_in]
Signed-off-by: Anna Schumaker <[email protected]>
---
v7:
- Remove COPY_FR_REFLINK flag
---
arch/arm/include/uapi/asm/unistd.h | 1 +
fs/read_write.c | 117 +++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
6 files changed, 128 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/uapi/asm/unistd.h b/arch/arm/include/uapi/asm/unistd.h
index 7a2a32a..a8f0219 100644
--- a/arch/arm/include/uapi/asm/unistd.h
+++ b/arch/arm/include/uapi/asm/unistd.h
@@ -416,6 +416,7 @@
#define __NR_execveat (__NR_SYSCALL_BASE+387)
#define __NR_userfaultfd (__NR_SYSCALL_BASE+388)
#define __NR_membarrier (__NR_SYSCALL_BASE+389)
+#define __NR_copy_file_range (__NR_SYSCALL_BASE+390)

/*
* The following SWIs are ARM private.
diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..89c9e65 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -1327,3 +1328,119 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
}
#endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success. When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+ ssize_t ret;
+
+ if (flags != 0)
+ return -EINVAL;
+
+ /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
+ ret = rw_verify_area(READ, file_in, &pos_in, len);
+ if (ret >= 0)
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret < 0)
+ return ret;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND) ||
+ !file_out->f_op || !file_out->f_op->copy_file_range)
+ return -EBADF;
+
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (inode_in->i_sb != inode_out->i_sb ||
+ file_in->f_path.mnt != file_out->f_path.mnt)
+ return -EXDEV;
+
+ if (len == 0)
+ return 0;
+
+ ret = mnt_want_write_file(file_out);
+ if (ret)
+ return ret;
+
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+ len, flags);
+ if (ret > 0) {
+ fsnotify_access(file_in);
+ add_rchar(current, ret);
+ fsnotify_modify(file_out);
+ add_wchar(current, ret);
+ }
+ inc_syscr(current);
+ inc_syscw(current);
+
+ mnt_drop_write_file(file_out);
+
+ return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+ int, fd_out, loff_t __user *, off_out,
+ size_t, len, unsigned int, flags)
+{
+ loff_t pos_in;
+ loff_t pos_out;
+ struct fd f_in;
+ struct fd f_out;
+ ssize_t ret;
+
+ f_in = fdget(fd_in);
+ f_out = fdget(fd_out);
+ if (!f_in.file || !f_out.file) {
+ ret = -EBADF;
+ goto out;
+ }
+
+ ret = -EFAULT;
+ if (off_in) {
+ if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_in = f_in.file->f_pos;
+ }
+
+ if (off_out) {
+ if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_out = f_out.file->f_pos;
+ }
+
+ ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+ if (ret > 0) {
+ pos_in += ret;
+ pos_out += ret;
+
+ if (off_in) {
+ if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_in.file->f_pos = pos_in;
+ }
+
+ if (off_out) {
+ if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_out.file->f_pos = pos_out;
+ }
+ }
+out:
+ fdput(f_in);
+ fdput(f_out);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..6220307 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,6 +1642,7 @@ struct file_operations {
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
};

struct inode_operations {
@@ -1695,6 +1696,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+ loff_t, size_t, unsigned int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a460e2e..290205f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,5 +886,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *envp, int flags);

asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+ int fd_out, loff_t __user *off_out,
+ size_t len, unsigned int flags);

#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ee12400..2d79155 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
__SYSCALL(__NR_userfaultfd, sys_userfaultfd)
#define __NR_membarrier 283
__SYSCALL(__NR_membarrier, sys_membarrier)
+#define __NR_copy_file_range 284
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)

#undef __NR_syscalls
-#define __NR_syscalls 284
+#define __NR_syscalls 285

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a02decf..83c5c82 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
cond_syscall(sys_setfsgid);
cond_syscall(sys_capget);
cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
--
2.6.2


2015-10-23 19:32:28

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space. This is especially useful so
NFSD can support server-side copies. Let's first check if the filesystem
supports any kind of copy acceleration, but fall back on copying through the
pagecache.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Padraig Brady <[email protected]>
---
v7:
- Remove checks for COPY_FR_REFLINK
---
fs/read_write.c | 36 ++++++++++++++++++++++++++----------
1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 89c9e65..d2da7e4 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1329,6 +1329,24 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif

+static ssize_t vfs_copy_fr_copy(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len)
+{
+ ssize_t ret = rw_verify_area(READ, file_in, &pos_in, len);
+
+ if (ret >= 0) {
+ len = ret;
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret >= 0)
+ len = ret;
+ }
+ if (ret < 0)
+ return ret;
+
+ return do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+}
+
/*
* copy_file_range() differs from regular file read and write in that it
* specifically allows return partial success. When it does so is up to
@@ -1345,17 +1363,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags != 0)
return -EINVAL;

- /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
- ret = rw_verify_area(READ, file_in, &pos_in, len);
- if (ret >= 0)
- ret = rw_verify_area(WRITE, file_out, &pos_out, len);
- if (ret < 0)
- return ret;
-
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
- !file_out->f_op || !file_out->f_op->copy_file_range)
+ !file_out->f_op)
return -EBADF;

/* this could be relaxed once a method supports cross-fs copies */
@@ -1370,8 +1381,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (ret)
return ret;

- ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
- len, flags);
+ ret = -EOPNOTSUPP;
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+ if (ret == -EOPNOTSUPP)
+ ret = vfs_copy_fr_copy(file_in, pos_in, file_out, pos_out, len);
+
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
--
2.6.2


2015-10-23 19:32:27

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 3/4] btrfs: add .copy_file_range file operation

From: Zach Brown <[email protected]>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Make flags an unsigned int]
Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
---
v7:
- Remove COPY_FR_REFLINK check
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..0046567 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
loff_t pos, size_t write_bytes,
struct extent_state **cached);
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = btrfs_ioctl,
#endif
+ .copy_file_range = btrfs_copy_file_range,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 3e3e613..ad75e48 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+ u64 off, u64 olen, u64 destoff)
{
struct inode *inode = file_inode(file);
+ struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
- struct fd src_file;
- struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
- int same_inode = 0;
+ int same_inode = src == inode;

/*
* TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
* be either compressed or non-compressed.
*/

- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;

- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- ret = -EXDEV;
- if (src_file.file->f_path.mnt != file->f_path.mnt)
- goto out_fput;
-
- src = file_inode(src_file.file);
-
- ret = -EINVAL;
- if (src == inode)
- same_inode = 1;
-
- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
+ if (file_src->f_path.mnt != file->f_path.mnt ||
+ src->i_sb != inode->i_sb)
+ return -EXDEV;

/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
- goto out_fput;
+ return -EINVAL;

- ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
- goto out_fput;
-
- ret = -EXDEV;
- if (src->i_sb != inode->i_sb)
- goto out_fput;
+ return -EISDIR;

if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(&src->i_mutex);
+ return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+ if (ret == 0)
+ ret = len;
+ return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file;
+ int ret;
+
+ /* the destination must be opened for writing */
+ if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
+ src_file = fdget(srcfd);
+ if (!src_file.file) {
+ ret = -EBADF;
+ goto out_drop_write;
+ }
+
+ /* the src must be open for reading */
+ if (!(src_file.file->f_mode & FMODE_READ)) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
out_fput:
fdput(src_file);
out_drop_write:
--
2.6.2


2015-10-23 19:32:30

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

copy_file_range() is a new system call for copying ranges of data
completely in the kernel. This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
v7:
- Remove COPY_FR_REFLINK
- Fix extra punctuation in the license
- Make a note about sparse file expansion
---
man2/copy_file_range.2 | 199 +++++++++++++++++++++++++++++++++++++++++++++++++
man2/splice.2 | 1 +
2 files changed, 200 insertions(+)
create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..d619e37
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,199 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume
+.\" no responsibility for errors or omissions, or for damages resulting
+.\" from the use of the information contained herein. The author(s) may
+.\" not have taken the same level of care in the production of this
+.\" manual, which is licensed free of charge, as they might when working
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI " loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read. The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument must be set to 0.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is not 0.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EXDEV
+.IR file_in " and " file_out
+are not on the same mounted filesystem.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH NOTES
+If
+.I file_in
+is a sparse file, then
+.BR copy_file_range ()
+may expand any holes existing in the requested range.
+Users may benefit from calling
+.BR copy_file_range ()
+in a loop, and using
+.BR lseek (2)
+to find the locations of data segments.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+ loff_t *off_out, size_t len, unsigned int flags)
+{
+ return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+ off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+ int fd_in, fd_out;
+ struct stat stat;
+ loff_t len, ret;
+ char buf[2];
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+ exit(EXIT_FAILURE);
+ }
+
+ fd_in = open(argv[1], O_RDONLY);
+ if (fd_in == \-1) {
+ perror("open (argv[1])");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fstat(fd_in, &stat) == \-1) {
+ perror("fstat");
+ exit(EXIT_FAILURE);
+ }
+ len = stat.st_size;
+
+ fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+ if (fd_out == \-1) {
+ perror("open (argv[2])");
+ exit(EXIT_FAILURE);
+ }
+
+ do {
+ ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
+ if (ret == \-1) {
+ perror("copy_file_range");
+ exit(EXIT_FAILURE);
+ }
+
+ len \-= ret;
+ } while (len > 0);
+
+ close(fd_in);
+ close(fd_out);
+ exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
See
.BR tee (2).
.SH SEE ALSO
+.BR copy_file_range (2),
.BR sendfile (2),
.BR tee (2),
.BR vmsplice (2)
--
2.6.2


2015-10-23 19:32:24

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v7 2/4] x86: add sys_copy_file_range to syscall tables

From: Zach Brown <[email protected]>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
373 i386 shutdown sys_shutdown
374 i386 userfaultfd sys_userfaultfd
375 i386 membarrier sys_membarrier
+376 i386 copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
322 64 execveat stub_execveat
323 common userfaultfd sys_userfaultfd
324 common membarrier sys_membarrier
+325 common copy_file_range sys_copy_file_range

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.6.2


2015-10-24 06:21:18

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 0/4] VFS: In-kernel copy system call

Thanks Anna,

the whole series looks good to me,

Reviewed-by: Christoph Hellwig <[email protected]>

2015-10-24 12:02:25

by Pádraig Brady

[permalink] [raw]
Subject: Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

On 23/10/15 20:32, Anna Schumaker wrote:
> + len = stat.st_size;
> +
> + fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
> + if (fd_out == \-1) {
> + perror("open (argv[2])");
> + exit(EXIT_FAILURE);
> + }
> +
> + do {
> + ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
> + if (ret == \-1) {
> + perror("copy_file_range");
> + exit(EXIT_FAILURE);
> + }
> +
> + len \-= ret;
> + } while (len > 0);

Is this an infinite loop if len decreases before the copy completes?
Perhaps this should be: while (len && ret);

Otherwise this set looks good.

I'm a bit worried about the sparse expansion and default reflinking
which might preclude cp(1) from using this call in most cases, but I will
test and try to use it. coreutils has heuristics for determining if files
are remote, which we might use to restrict to that use case.

thanks,
P?draig.

2015-10-24 16:52:40

by Eric Biggers

[permalink] [raw]
Subject: Re: [PATCH v7 0/4] VFS: In-kernel copy system call

A few comments:

> if (!(file_in->f_mode & FMODE_READ) ||
> !(file_out->f_mode & FMODE_WRITE) ||
> (file_out->f_flags & O_APPEND) ||
> !file_out->f_op)
> return -EBADF;

Isn't 'f_op' always non-NULL?

If the destination file cannot be append-only, shouldn't this be documented?

> if (inode_in->i_sb != inode_out->i_sb ||
> file_in->f_path.mnt != file_out->f_path.mnt)
> return -EXDEV;

Doesn't the same mount already imply the same superblock?

> /*
> * copy_file_range() differs from regular file read and write in that it
> * specifically allows return partial success. When it does so is up to
> * the copy_file_range method.
> */

What does this mean? I thought that read() and write() can also return partial
success.

> f_out = fdget(fd_out);
> if (!f_in.file || !f_out.file) {
> ret = -EBADF;
> goto out;
> }

This looked wrong at first because it may call fdput() on a 'struct fd' that was
not successfully acquired, but it looks like it works currently because of how
the FDPUT_FPUT flag is used. It may be a good idea to write it the "obvious"
way, though (use separate labels depending on which fdput()s need to happen).


Other questions:

Should FMODE_PREAD or FMODE_PWRITE access be checked if the user specifies their
own 'off_in' or 'off_out', respectively?

What is supposed to happen if the user passes provides a file descriptor to a
non-regular file, such as a block device or char device?

If the 'in' file has fewer than 'len' bytes remaining until EOF, what is the
expected behavior? It looks like the btrfs implementation has different
behavior from the pagecache implementation.

It appears the btrfs implementation has alignment restrictions --- where is this
documented and how will users know what alignment to use?

Are copies within the same file permitted and can the ranges overlap? The man
page doesn't say.

It looks like the initial patch defines __NR_copy_file_range for the ARM
architecture but doesn't actually hook that system call up for ARM; why is that?

2015-10-25 05:23:29

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v7 0/4] VFS: In-kernel copy system call


> On Oct 24, 2015, at 10:52 AM, Eric Biggers <[email protected]> wrote:
>
> A few comments:
>
>> if (!(file_in->f_mode & FMODE_READ) ||
>> !(file_out->f_mode & FMODE_WRITE) ||
>> (file_out->f_flags & O_APPEND) ||
>> !file_out->f_op)
>> return -EBADF;
>
> Isn't 'f_op' always non-NULL?
>
> If the destination file cannot be append-only, shouldn't this be documented?

Actually, wouldn't O_APPEND only be a problem if the target file wasn't
being appended to? In other words, if the target i_size == start offset
then it should be possible to use the copy syscall on an O_APPEND file.

Cheers, Andreas

>> if (inode_in->i_sb != inode_out->i_sb ||
>> file_in->f_path.mnt != file_out->f_path.mnt)
>> return -EXDEV;
>
> Doesn't the same mount already imply the same superblock?
>
>> /*
>> * copy_file_range() differs from regular file read and write in that it
>> * specifically allows return partial success. When it does so is up to
>> * the copy_file_range method.
>> */
>
> What does this mean? I thought that read() and write() can also return partial
> success.
>
>> f_out = fdget(fd_out);
>> if (!f_in.file || !f_out.file) {
>> ret = -EBADF;
>> goto out;
>> }
>
> This looked wrong at first because it may call fdput() on a 'struct fd' that was
> not successfully acquired, but it looks like it works currently because of how
> the FDPUT_FPUT flag is used. It may be a good idea to write it the "obvious"
> way, though (use separate labels depending on which fdput()s need to happen).
>
>
> Other questions:
>
> Should FMODE_PREAD or FMODE_PWRITE access be checked if the user specifies their
> own 'off_in' or 'off_out', respectively?
>
> What is supposed to happen if the user passes provides a file descriptor to a
> non-regular file, such as a block device or char device?
>
> If the 'in' file has fewer than 'len' bytes remaining until EOF, what is the
> expected behavior? It looks like the btrfs implementation has different
> behavior from the pagecache implementation.
>
> It appears the btrfs implementation has alignment restrictions --- where is this
> documented and how will users know what alignment to use?
>
> Are copies within the same file permitted and can the ranges overlap? The man
> page doesn't say.
>
> It looks like the initial patch defines __NR_copy_file_range for the ARM
> architecture but doesn't actually hook that system call up for ARM; why is that?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2015-10-26 03:39:33

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
> I'm a bit worried about the sparse expansion and default reflinking
> which might preclude cp(1) from using this call in most cases, but I will
> test and try to use it. coreutils has heuristics for determining if files
> are remote, which we might use to restrict to that use case.

Can you explain why reflinking and hole expansion are an issue if done
locally and not if done remotely? I'd really like to make the call as
usable as possible for everyone, but we really need clear sem?ntics for
that.

Also note that Annas current series allows for hole filling - any decent
implementation should not do them, but that's really a quality of
implementation and not an interface issue.

2015-10-26 03:45:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v7 0/4] VFS: In-kernel copy system call

On Sat, Oct 24, 2015 at 11:52:37AM -0500, Eric Biggers wrote:
> A few comments:
>
> > if (!(file_in->f_mode & FMODE_READ) ||
> > !(file_out->f_mode & FMODE_WRITE) ||
> > (file_out->f_flags & O_APPEND) ||
> > !file_out->f_op)
> > return -EBADF;
>
> Isn't 'f_op' always non-NULL?

Yes, its is.

> If the destination file cannot be append-only, shouldn't this be documented?

Yes.

> > if (inode_in->i_sb != inode_out->i_sb ||
> > file_in->f_path.mnt != file_out->f_path.mnt)
> > return -EXDEV;
>
> Doesn't the same mount already imply the same superblock?

It does.

> > /*
> > * copy_file_range() differs from regular file read and write in that it
> > * specifically allows return partial success. When it does so is up to
> > * the copy_file_range method.
> > */
>
> What does this mean? I thought that read() and write() can also return
> partial success.

The syscalls are allow to return short from the standards perspective,
but if you actually do that for regualr fiels hell will break loose as
applications don't expect it. That's why we can't actually ever do it.

> Should FMODE_PREAD or FMODE_PWRITE access be checked if the user specifies their
> own 'off_in' or 'off_out', respectively?

Maybe.

> What is supposed to happen if the user passes provides a file descriptor to a
> non-regular file, such as a block device or char device?

If they implement the proper method I see no reason why we can't support
it. For block device we only have one file_ops instance and mapping
that to the bio-level XCOPY abstraction that's been posted a couple of
times would seem sensible. For character devices that's entirely up to
the driver.

> If the 'in' file has fewer than 'len' bytes remaining until EOF, what is the
> expected behavior? It looks like the btrfs implementation has different
> behavior from the pagecache implementation.

Good question. I'd say failure is the right way to handle a mismatching
length.

> It appears the btrfs implementation has alignment restrictions --- where is this
> documented and how will users know what alignment to use?

For actual clones we're limited to the file system block size (NFS adds
an extra attribute for the clone block size), but for regaulr copies we
probably should fall back to the dumb implementation if we don't match
it.

> Are copies within the same file permitted and can the ranges overlap? The man
> page doesn't say.

For clones we defintively want to support it, but for copies I'd be
tempted to say no. Does anyone else have an opinion?

> It looks like the initial patch defines __NR_copy_file_range for the ARM
> architecture but doesn't actually hook that system call up for ARM; why is that?

Looks like that should be dropped. I really wish we had a way to just
wire up syscalls everywhere.

2015-10-26 12:19:36

by Pádraig Brady

[permalink] [raw]
Subject: Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

On 26/10/15 03:39, Christoph Hellwig wrote:
> On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
>> I'm a bit worried about the sparse expansion and default reflinking
>> which might preclude cp(1) from using this call in most cases, but I will
>> test and try to use it. coreutils has heuristics for determining if files
>> are remote, which we might use to restrict to that use case.
>
> Can you explain why reflinking and hole expansion are an issue if done
> locally and not if done remotely? I'd really like to make the call as
> usable as possible for everyone, but we really need clear sem�ntics for
> that.

Fair point on local vs remote. I was just assuming that remote
copy offload would not do reflinking on the backend, or at
least wasn't an exposed option over the remote interface.

I get the impression that you think reflinking should be hidden
from the user, i.e. cp(1) should not have had the --reflink option
(for the last 6 years)? I'm not convinced of that, and even so
I think lower level interfaces would benefit from finer grained options.
This would be especially useful since there is no general interface
to reflink at present. I was happy with the reflink control options,
thinking the extra control could allow cp to use this by default.

> Also note that Annas current series allows for hole filling - any decent
> implementation should not do them, but that's really a quality of
> implementation and not an interface issue.

I think you're saying the default `cp --sparse=auto` operation
could rely on copy_file_range(...complete file...), while
cp --sparse={always,never} would have to iterate over the
file, punching or filling holes as appropriate. I thought
Anna indicated differently wrt splice filling holes by default.

TBH I'm not clear on the semantics of the current implementation,
so need to test the above in various cases.

thanks,
Pádraig.

2015-10-26 21:41:12

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

On Mon, Oct 26, 2015 at 12:19:33PM +0000, Pádraig Brady wrote:
> On 26/10/15 03:39, Christoph Hellwig wrote:
> > On Sat, Oct 24, 2015 at 01:02:21PM +0100, P??draig Brady wrote:
> >> I'm a bit worried about the sparse expansion and default reflinking
> >> which might preclude cp(1) from using this call in most cases, but I will
> >> test and try to use it. coreutils has heuristics for determining if files
> >> are remote, which we might use to restrict to that use case.
> >
> > Can you explain why reflinking and hole expansion are an issue if done
> > locally and not if done remotely? I'd really like to make the call as
> > usable as possible for everyone, but we really need clear sem�ntics for
> > that.
>
> Fair point on local vs remote. I was just assuming that remote
> copy offload would not do reflinking on the backend, or at
> least wasn't an exposed option over the remote interface.

The server could definitely do a reflink. More generally, from the
description of the NFS COPY operation:

https://tools.ietf.org/html/draft-ietf-nfsv4-minorversion2-39#page-64

If the copy completes successfully, either synchronously or
asynchronously, the data copied from the source file to the
destination file MUST appear identical to the NFS client.
However, the NFS server's on disk representation of the data in
the source file and destination file MAY differ. For example,
the NFS server might encrypt, compress, deduplicate, or
otherwise represent the on disk data in the source and
destination file differently.

> I get the impression that you think reflinking should be hidden
> from the user, i.e. cp(1) should not have had the --reflink option
> (for the last 6 years)? I'm not convinced of that, and even so
> I think lower level interfaces would benefit from finer grained options.
> This would be especially useful since there is no general interface
> to reflink at present. I was happy with the reflink control options,
> thinking the extra control could allow cp to use this by default.

Maybe that's a case for Christoph's "clone" operation.

I agree with him that it makes sense to allow the filesystem to
implement "copy" using reflink or similar tricks under the covers. And
that in fact it's difficult to imagine how you'd prevent that in the
presence of layers of filesystem or block protocols underneath.

That "cp" flag seems strange to me, but if "cp" wants to take advantage
of a copy system call while continuing to make something like that
distinction then I suppose it could fallocate the destination range file
after the copy.

--b.

> > Also note that Annas current series allows for hole filling - any decent
> > implementation should not do them, but that's really a quality of
> > implementation and not an interface issue.
>
> I think you're saying the default `cp --sparse=auto` operation
> could rely on copy_file_range(...complete file...), while
> cp --sparse={always,never} would have to iterate over the
> file, punching or filling holes as appropriate. I thought
> Anna indicated differently wrt splice filling holes by default.
>
> TBH I'm not clear on the semantics of the current implementation,
> so need to test the above in various cases.
>
> thanks,
> Pádraig.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-10-27 11:35:06

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH v7 5/4] copy_file_range.2: New page documenting copy_file_range()

On 2015-10-26 17:41, J. Bruce Fields wrote:
> On Mon, Oct 26, 2015 at 12:19:33PM +0000, Pádraig Brady wrote:
>> I get the impression that you think reflinking should be hidden
>> from the user, i.e. cp(1) should not have had the --reflink option
>> (for the last 6 years)? I'm not convinced of that, and even so
>> I think lower level interfaces would benefit from finer grained options.
>> This would be especially useful since there is no general interface
>> to reflink at present. I was happy with the reflink control options,
>> thinking the extra control could allow cp to use this by default.
>
> Maybe that's a case for Christoph's "clone" operation.
>
> I agree with him that it makes sense to allow the filesystem to
> implement "copy" using reflink or similar tricks under the covers. And
> that in fact it's difficult to imagine how you'd prevent that in the
> presence of layers of filesystem or block protocols underneath.
>
> That "cp" flag seems strange to me, but if "cp" wants to take advantage
> of a copy system call while continuing to make something like that
> distinction then I suppose it could fallocate the destination range file
> after the copy.
FWIW, I'm pretty sure that the '--reflink=never' option was added
originally just for those poor misguided people who don't understand
that deduplication is perfectly safe as long as you do it right.
Personally, I really hope that Busybox and the other Coreutils
replacements don't make that mistake, as the very fact that cp allows
you to force it not to reflink things indirectly implies that it isn't
safe in some circumstances, which is completely bogus WRT all the
filesystems in Linux that support it if they are used properly.

If you want to make sure the space is allocated on disk, you should be
using fallocate (or dd, or something equivalent), not cp.


Attachments:
smime.p7s (2.95 kB)
S/MIME Cryptographic Signature

2015-10-27 16:03:57

by Steve French

[permalink] [raw]
Subject: Re: [PATCH v7 1/4] vfs: add copy_file_range syscall and vfs helper

The patch set looks good so far, and copy offload is important for
CIFS/SMB3 (not just Windows/Mac/Samba but most servers support one of
the various methods available to do this over the network via CIFS and
also SMB2/SMB3). I have only implemented two ways of copy offload in
cifs.ko so far but will look at adding the remaining mechanisms.

Currently cifs.ko does have a similar restriction to the below one
that you have in your patch1, but I was planning to relax it (as long
as source and target are on the same server) since some of the
important use cases are having the server do copy offload from one
share to another. So we should support the case where the files are
on the same file system type (nfs, cifs etc.) but not necessarily on
the same superblock (for the case of cifs they will be on the same
server, but with some forms of copy offload that would not necessarily
have to be on the same server).

The following check should be removed (an alternative would be to
check that source and target are the same filesystem type ie nfs, or
cifs or xfs etc. or simply let the file systems check file_in and
file_out's validity)

> + /* this could be relaxed once a method supports cross-fs copies */
> + if (inode_in->i_sb != inode_out->i_sb ||
> + file_in->f_path.mnt != file_out->f_path.mnt)
> + return -EXDEV;




--
Thanks,

Steve