2015-10-16 21:08:34

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 0/4] VFS: In-kernel copy system call

Copy system calls came up during Plumbers a while ago, mostly because several
filesystems (including NFS and XFS) are currently working on copy acceleration
implementations. We haven't heard from Zach Brown in a while, so I volunteered
to push his patches upstream so individual filesystems don't need to keep
writing their own ioctls.

This posting adresses Christoph's comments. I'm pretty sure that splice() is
already interruptible, so I don't know what else needs to be done for this
system call.

I haven't started work on a "sparse" copy flag yet. I would like to focus on
the base system call first, and then add that later if it's still desired.

Changes in v6:
- Squash together most patches.
- Drop all flags except COPY_FR_REFLINK.
- Drop patch removing same mountpoint check.
- Change default behavior (flags = 0) to a data copy.


Anna Schumaker (1):
vfs: Add vfs_copy_file_range() support for pagecache copies

Zach Brown (3):
vfs: add copy_file_range syscall and vfs helper
x86: add sys_copy_file_range to syscall tables
btrfs: add .copy_file_range file operation

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 94 +++++++++++++----------
fs/read_write.c | 133 +++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fs.h | 2 +
kernel/sys_ni.c | 1 +
11 files changed, 206 insertions(+), 40 deletions(-)

--
2.6.1



2015-10-16 21:08:40

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 3/4] btrfs: add .copy_file_range file operation

From: Zach Brown <[email protected]>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Make flags an unsigned int,
Check for COPY_FR_REFLINK]
Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
---
v6:
- Check for COPY_FR_REFLINK
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 94 +++++++++++++++++++++++++++++++++-----------------------
3 files changed, 59 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..0046567 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
loff_t pos, size_t write_bytes,
struct extent_state **cached);
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = btrfs_ioctl,
#endif
+ .copy_file_range = btrfs_copy_file_range,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..3fee0cb 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+ u64 off, u64 olen, u64 destoff)
{
struct inode *inode = file_inode(file);
+ struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
- struct fd src_file;
- struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
- int same_inode = 0;
+ int same_inode = src == inode;

/*
* TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
* be either compressed or non-compressed.
*/

- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;

- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- ret = -EXDEV;
- if (src_file.file->f_path.mnt != file->f_path.mnt)
- goto out_fput;
-
- src = file_inode(src_file.file);
-
- ret = -EINVAL;
- if (src == inode)
- same_inode = 1;
-
- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
+ if (file_src->f_path.mnt != file->f_path.mnt ||
+ src->i_sb != inode->i_sb)
+ return -EXDEV;

/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
- goto out_fput;
+ return -EINVAL;

- ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
- goto out_fput;
-
- ret = -EXDEV;
- if (src->i_sb != inode->i_sb)
- goto out_fput;
+ return -EISDIR;

if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,52 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(&src->i_mutex);
+ return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ if (!(flags & COPY_FR_REFLINK))
+ return -EOPNOTSUPP;
+
+ ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+ if (ret == 0)
+ ret = len;
+ return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file;
+ int ret;
+
+ /* the destination must be opened for writing */
+ if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
+ src_file = fdget(srcfd);
+ if (!src_file.file) {
+ ret = -EBADF;
+ goto out_drop_write;
+ }
+
+ /* the src must be open for reading */
+ if (!(src_file.file->f_mode & FMODE_READ)) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
out_fput:
fdput(src_file);
out_drop_write:
--
2.6.1


2015-10-16 21:08:43

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

copy_file_range() is a new system call for copying ranges of data
completely in the kernel. This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
v6:
- Updates for removing most flags
---
man2/copy_file_range.2 | 204 +++++++++++++++++++++++++++++++++++++++++++++++++
man2/splice.2 | 1 +
2 files changed, 205 insertions(+)
create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..6c52c85
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,204 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
+.\"
+.\" %%%LICENSE_START(VERBATIM)
+.\" Permission is granted to make and distribute verbatim copies of this
+.\" manual provided the copyright notice and this permission notice are
+.\" preserved on all copies.
+.\"
+.\" Permission is granted to copy and distribute modified versions of
+.\" this manual under the conditions for verbatim copying, provided that
+.\" the entire resulting derived work is distributed under the terms of
+.\" a permission notice identical to this one.
+.\"
+.\" Since the Linux kernel and libraries are constantly changing, this
+.\" manual page may be incorrect or out-of-date. The author(s) assume.
+.\" no responsibility for errors or omissions, or for damages resulting.
+.\" from the use of the information contained herein. The author(s) may.
+.\" not have taken the same level of care in the production of this.
+.\" manual, which is licensed free of charge, as they might when working.
+.\" professionally.
+.\"
+.\" Formatted or processed versions of this manual, if unaccompanied by
+.\" the source, must acknowledge the copyright and authors of this work.
+.\" %%%LICENSE_END
+.\"
+.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <linux/fs.h>
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
+.BI " loff_t *" off_out ", size_t " len \
+", unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without the additional cost of transferring data from the kernel to userspace
+and then back into the kernel.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range of the target file.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset, and the offset is
+adjusted by the number of bytes copied.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read. The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument can have the following flag set:
+.TP 1.9i
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to perform a full data copy of the requested range.
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid; or
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the source file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+was specified in
+.IR flags ,
+but the target filesystem does not support reflinks.
+.TP
+.B EXDEV
+.IR file_in " and " file_out
+are not on the same mounted filesystem.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <linux/fs.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
+ loff_t *off_out, size_t len, unsigned int flags)
+{
+ return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
+ off_out, len, flags);
+}
+
+int main(int argc, char **argv)
+{
+ int fd_in, fd_out;
+ struct stat stat;
+ loff_t len, ret;
+ char buf[2];
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+ exit(EXIT_FAILURE);
+ }
+
+ fd_in = open(argv[1], O_RDONLY);
+ if (fd_in == \-1) {
+ perror("open (argv[1])");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fstat(fd_in, &stat) == \-1) {
+ perror("fstat");
+ exit(EXIT_FAILURE);
+ }
+ len = stat.st_size;
+
+ fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
+ if (fd_out == \-1) {
+ perror("open (argv[2])");
+ exit(EXIT_FAILURE);
+ }
+
+ do {
+ ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
+ if (ret == \-1) {
+ perror("copy_file_range");
+ exit(EXIT_FAILURE);
+ }
+
+ len \-= ret;
+ } while (len > 0);
+
+ close(fd_in);
+ close(fd_out);
+ exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
diff --git a/man2/splice.2 b/man2/splice.2
index b9b4f42..5c162e0 100644
--- a/man2/splice.2
+++ b/man2/splice.2
@@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
See
.BR tee (2).
.SH SEE ALSO
+.BR copy_file_range (2),
.BR sendfile (2),
.BR tee (2),
.BR vmsplice (2)
--
2.6.1


2015-10-16 21:08:41

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 4/4] vfs: Add vfs_copy_file_range() support for pagecache copies

This allows us to have an in-kernel copy mechanism that avoids frequent
switches between kernel and user space. This is especially useful so
NFSD can support server-side copies.

The default (flags=0) means to first attempt copy acceleration, but use
the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Reviewed-by: Padraig Brady <[email protected]>
---
v6:
- Don't reflink by default.
- Reword commit message.
---
fs/read_write.c | 36 ++++++++++++++++++++++++++----------
1 file changed, 26 insertions(+), 10 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 20f147d..1fd555c 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1329,6 +1329,24 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif

+static ssize_t vfs_copy_fr_copy(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len)
+{
+ ssize_t ret = rw_verify_area(READ, file_in, &pos_in, len);
+
+ if (ret >= 0) {
+ len = ret;
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret >= 0)
+ len = ret;
+ }
+ if (ret < 0)
+ return ret;
+
+ return do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+}
+
/*
* copy_file_range() differs from regular file read and write in that it
* specifically allows return partial success. When it does so is up to
@@ -1345,17 +1363,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags & ~(COPY_FR_REFLINK))
return -EINVAL;

- /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
- ret = rw_verify_area(READ, file_in, &pos_in, len);
- if (ret >= 0)
- ret = rw_verify_area(WRITE, file_out, &pos_out, len);
- if (ret < 0)
- return ret;
-
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
- !file_out->f_op || !file_out->f_op->copy_file_range)
+ !file_out->f_op)
return -EBADF;

/* this could be relaxed once a method supports cross-fs copies */
@@ -1370,8 +1381,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (ret)
return ret;

- ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
- len, flags);
+ ret = -EOPNOTSUPP;
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+ if ((ret == -EOPNOTSUPP) && !(flags & COPY_FR_REFLINK))
+ ret = vfs_copy_fr_copy(file_in, pos_in, file_out, pos_out, len);
+
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
--
2.6.1


2015-10-16 21:08:36

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 1/4] vfs: add copy_file_range syscall and vfs helper

From: Zach Brown <[email protected]>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification,
Change flags parameter from int to unsigned int,
Add function to include/linux/syscalls.h,
Check copy len after file open mode,
Don't forbid ranges inside the same file,
Use rw_verify_area() to veriy ranges,
Use file_out rather than file_in,
Add COPY_FR_REFLINK flag]
Signed-off-by: Anna Schumaker <[email protected]>
---
v6:
- Christoph suggested that I squash in some of my clean-up patches.
- Introduce COPY_FR_REFLINK in this patch.
---
fs/read_write.c | 117 ++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/syscalls.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/fs.h | 2 +
kernel/sys_ni.c | 1 +
6 files changed, 129 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..20f147d 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -1327,3 +1328,119 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
}
#endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success. When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+ ssize_t ret;
+
+ if (flags & ~(COPY_FR_REFLINK))
+ return -EINVAL;
+
+ /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
+ ret = rw_verify_area(READ, file_in, &pos_in, len);
+ if (ret >= 0)
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret < 0)
+ return ret;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND) ||
+ !file_out->f_op || !file_out->f_op->copy_file_range)
+ return -EBADF;
+
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (inode_in->i_sb != inode_out->i_sb ||
+ file_in->f_path.mnt != file_out->f_path.mnt)
+ return -EXDEV;
+
+ if (len == 0)
+ return 0;
+
+ ret = mnt_want_write_file(file_out);
+ if (ret)
+ return ret;
+
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+ len, flags);
+ if (ret > 0) {
+ fsnotify_access(file_in);
+ add_rchar(current, ret);
+ fsnotify_modify(file_out);
+ add_wchar(current, ret);
+ }
+ inc_syscr(current);
+ inc_syscw(current);
+
+ mnt_drop_write_file(file_out);
+
+ return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+ int, fd_out, loff_t __user *, off_out,
+ size_t, len, unsigned int, flags)
+{
+ loff_t pos_in;
+ loff_t pos_out;
+ struct fd f_in;
+ struct fd f_out;
+ ssize_t ret;
+
+ f_in = fdget(fd_in);
+ f_out = fdget(fd_out);
+ if (!f_in.file || !f_out.file) {
+ ret = -EBADF;
+ goto out;
+ }
+
+ ret = -EFAULT;
+ if (off_in) {
+ if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_in = f_in.file->f_pos;
+ }
+
+ if (off_out) {
+ if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_out = f_out.file->f_pos;
+ }
+
+ ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+ if (ret > 0) {
+ pos_in += ret;
+ pos_out += ret;
+
+ if (off_in) {
+ if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_in.file->f_pos = pos_in;
+ }
+
+ if (off_out) {
+ if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_out.file->f_pos = pos_out;
+ }
+ }
+out:
+ fdput(f_in);
+ fdput(f_out);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..6220307 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,6 +1642,7 @@ struct file_operations {
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, unsigned int);
};

struct inode_operations {
@@ -1695,6 +1696,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+ loff_t, size_t, unsigned int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h
index a460e2e..290205f 100644
--- a/include/linux/syscalls.h
+++ b/include/linux/syscalls.h
@@ -886,5 +886,8 @@ asmlinkage long sys_execveat(int dfd, const char __user *filename,
const char __user *const __user *envp, int flags);

asmlinkage long sys_membarrier(int cmd, int flags);
+asmlinkage long sys_copy_file_range(int fd_in, loff_t __user *off_in,
+ int fd_out, loff_t __user *off_out,
+ size_t len, unsigned int flags);

#endif
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index ee12400..2d79155 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -713,9 +713,11 @@ __SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
__SYSCALL(__NR_userfaultfd, sys_userfaultfd)
#define __NR_membarrier 283
__SYSCALL(__NR_membarrier, sys_membarrier)
+#define __NR_copy_file_range 284
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)

#undef __NR_syscalls
-#define __NR_syscalls 284
+#define __NR_syscalls 285

/*
* All syscalls below here should go away really,
diff --git a/include/uapi/linux/fs.h b/include/uapi/linux/fs.h
index 9b964a5..30b44f4 100644
--- a/include/uapi/linux/fs.h
+++ b/include/uapi/linux/fs.h
@@ -207,4 +207,6 @@ struct inodes_stat_t {
#define SYNC_FILE_RANGE_WRITE 2
#define SYNC_FILE_RANGE_WAIT_AFTER 4

+#define COPY_FR_REFLINK (1 << 0) /* Create a reflink instead. */
+
#endif /* _UAPI_LINUX_FS_H */
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index a02decf..83c5c82 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
cond_syscall(sys_setfsgid);
cond_syscall(sys_capget);
cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
--
2.6.1


2015-10-16 21:08:38

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v6 2/4] x86: add sys_copy_file_range to syscall tables

From: Zach Brown <[email protected]>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 7663c45..0531270 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -382,3 +382,4 @@
373 i386 shutdown sys_shutdown
374 i386 userfaultfd sys_userfaultfd
375 i386 membarrier sys_membarrier
+376 i386 copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 278842f..03a9396 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -331,6 +331,7 @@
322 64 execveat stub_execveat
323 common userfaultfd sys_userfaultfd
324 common membarrier sys_membarrier
+325 common copy_file_range sys_copy_file_range

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.6.1


2015-10-16 21:21:28

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()


> On Oct 16, 2015, at 3:08 PM, Anna Schumaker <[email protected]> wrote:
>
> copy_file_range() is a new system call for copying ranges of data
> completely in the kernel. This gives filesystems an opportunity to
> implement some kind of "copy acceleration", such as reflinks or
> server-side-copy (in the case of NFS).
>
> Signed-off-by: Anna Schumaker <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> v6:
> - Updates for removing most flags
> ---
> man2/copy_file_range.2 | 204 +++++++++++++++++++++++++++++++++++++++++++++++++
> man2/splice.2 | 1 +
> 2 files changed, 205 insertions(+)
> create mode 100644 man2/copy_file_range.2
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> new file mode 100644
> index 0000000..6c52c85
> --- /dev/null
> +++ b/man2/copy_file_range.2
> @@ -0,0 +1,204 @@
> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
> +.\"
> +.\" %%%LICENSE_START(VERBATIM)
> +.\" Permission is granted to make and distribute verbatim copies of this
> +.\" manual provided the copyright notice and this permission notice are
> +.\" preserved on all copies.
> +.\"
> +.\" Permission is granted to copy and distribute modified versions of
> +.\" this manual under the conditions for verbatim copying, provided that
> +.\" the entire resulting derived work is distributed under the terms of
> +.\" a permission notice identical to this one.
> +.\"
> +.\" Since the Linux kernel and libraries are constantly changing, this
> +.\" manual page may be incorrect or out-of-date. The author(s) assume.
> +.\" no responsibility for errors or omissions, or for damages resulting.
> +.\" from the use of the information contained herein. The author(s) may.
> +.\" not have taken the same level of care in the production of this.
> +.\" manual, which is licensed free of charge, as they might when working.

Is there a reason why every. one. of. those. lines. ends. in. a. period?
I don't think that is needed for nroff, and other paragraphs would support
that conclusion.

> +.\" professionally.
> +.\"
> +.\" Formatted or processed versions of this manual, if unaccompanied by
> +.\" the source, must acknowledge the copyright and authors of this work.
> +.\" %%%LICENSE_END
> +.\"
> +.TH COPY 2 2015-10-16 "Linux" "Linux Programmer's Manual"
> +.SH NAME
> +copy_file_range \- Copy a range of data from one file to another
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/fs.h>
> +.B #include <sys/syscall.h>
> +.B #include <unistd.h>
> +
> +.BI "ssize_t copy_file_range(int " fd_in ", loff_t *" off_in ", int " fd_out ",
> +.BI " loff_t *" off_out ", size_t " len \
> +", unsigned int " flags );
> +.fi
> +.SH DESCRIPTION
> +The
> +.BR copy_file_range ()
> +system call performs an in-kernel copy between two file descriptors
> +without the additional cost of transferring data from the kernel to userspace
> +and then back into the kernel.
> +It copies up to
> +.I len
> +bytes of data from file descriptor
> +.I fd_in
> +to file descriptor
> +.IR fd_out ,
> +overwriting any data that exists within the requested range of the target file.
> +
> +The following semantics apply for
> +.IR off_in ,
> +and similar statements apply to
> +.IR off_out :
> +.IP * 3
> +If
> +.I off_in
> +is NULL, then bytes are read from
> +.I fd_in
> +starting from the current file offset, and the offset is
> +adjusted by the number of bytes copied.
> +.IP *
> +If
> +.I off_in
> +is not NULL, then
> +.I off_in
> +must point to a buffer that specifies the starting
> +offset where bytes from
> +.I fd_in
> +will be read. The current file offset of
> +.I fd_in
> +is not changed, but
> +.I off_in
> +is adjusted appropriately.
> +.PP
> +
> +The
> +.I flags
> +argument can have the following flag set:
> +.TP 1.9i
> +.B COPY_FR_REFLINK
> +Create a lightweight "reflink", where data is not copied until
> +one of the files is modified.

This is a circular definition. Something like:

Create a lightweight reference to the data blocks in the original
file, where data is not copied until one of the files is modified.

although I'm not sure if "lightweight" is really valuable there.

> +.PP
> +The default behavior
> +.RI ( flags
> +== 0) is to perform a full data copy of the requested range.
> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.

This is a bit vague. When COPY_FR_REFLINK is used, no data is "copied",
per se, but I doubt that "0" would be returned in that case either. It
probably makes sense to write something like:

... return the number of bytes accessible in the target file.

or maybe (s/accessible/transferred/) or

... return the number of bytes added to the target file.

or similar.

> +
> +On error,
> +.BR copy_file_range ()
> +returns \-1 and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EBADF
> +One or more file descriptors are not valid; or
> +.I fd_in
> +is not open for reading; or
> +.I fd_out
> +is not open for writing.
> +.TP
> +.B EINVAL
> +Requested range extends beyond the end of the source file; or the
> +.I flags
> +argument is set to an invalid value.

Is it possible to return EINTR as well?

Cheers, Andreas

> +.TP
> +.B EIO
> +A low level I/O error occurred while copying.
> +.TP
> +.B ENOMEM
> +Out of memory.
> +.TP
> +.B ENOSPC
> +There is not enough space on the target filesystem to complete the copy.
> +.TP
> +.B EOPNOTSUPP
> +.B COPY_REFLINK
> +was specified in
> +.IR flags ,
> +but the target filesystem does not support reflinks.
> +.TP
> +.B EXDEV
> +.IR file_in " and " file_out
> +are not on the same mounted filesystem.
> +.SH VERSIONS
> +The
> +.BR copy_file_range ()
> +system call first appeared in Linux 4.4.
> +.SH CONFORMING TO
> +The
> +.BR copy_file_range ()
> +system call is a nonstandard Linux extension.
> +.SH EXAMPLE
> +.nf
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <linux/fs.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +loff_t copy_file_range(int fd_in, loff_t *off_in, int fd_out,
> + loff_t *off_out, size_t len, unsigned int flags)
> +{
> + return syscall(__NR_copy_file_range, fd_in, off_in, fd_out,
> + off_out, len, flags);
> +}
> +
> +int main(int argc, char **argv)
> +{
> + int fd_in, fd_out;
> + struct stat stat;
> + loff_t len, ret;
> + char buf[2];
> +
> + if (argc != 3) {
> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
> + exit(EXIT_FAILURE);
> + }
> +
> + fd_in = open(argv[1], O_RDONLY);
> + if (fd_in == \-1) {
> + perror("open (argv[1])");
> + exit(EXIT_FAILURE);
> + }
> +
> + if (fstat(fd_in, &stat) == \-1) {
> + perror("fstat");
> + exit(EXIT_FAILURE);
> + }
> + len = stat.st_size;
> +
> + fd_out = open(argv[2], O_CREAT|O_WRONLY|O_TRUNC, 0644);
> + if (fd_out == \-1) {
> + perror("open (argv[2])");
> + exit(EXIT_FAILURE);
> + }
> +
> + do {
> + ret = copy_file_range(fd_in, NULL, fd_out, NULL, len, 0);
> + if (ret == \-1) {
> + perror("copy_file_range");
> + exit(EXIT_FAILURE);
> + }
> +
> + len \-= ret;
> + } while (len > 0);
> +
> + close(fd_in);
> + close(fd_out);
> + exit(EXIT_SUCCESS);
> +}
> +.fi
> +.SH SEE ALSO
> +.BR splice (2)
> diff --git a/man2/splice.2 b/man2/splice.2
> index b9b4f42..5c162e0 100644
> --- a/man2/splice.2
> +++ b/man2/splice.2
> @@ -238,6 +238,7 @@ only pointers are copied, not the pages of the buffer.
> See
> .BR tee (2).
> .SH SEE ALSO
> +.BR copy_file_range (2),
> .BR sendfile (2),
> .BR tee (2),
> .BR vmsplice (2)
> --
> 2.6.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Cheers, Andreas






Attachments:
signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2015-10-16 21:42:44

by Pádraig Brady

[permalink] [raw]
Subject: Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

It's probably worth mentioning the sparse expansion caveat
in the docs, as it's important and not obvious.

I.E. copy_file_range() could be called from, but would
still benefit from a user space app using a SEEK_{HOLE,DATA} loop.

thanks,
P?draig.

2015-10-18 18:30:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

Just commenting on the man page here as the comment is about sematics.
All the infrastructure in the patch looks reasonable to me, but this
is something we need to get right.

> +.B COPY_FR_REFLINK
> +Create a lightweight "reflink", where data is not copied until
> +one of the files is modified.
> +.PP
> +The default behavior
> +.RI ( flags
> +== 0) is to perform a full data copy of the requested range.
> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.

As mentioned in the previous discussion I fundamentally disagree with
the way your word the flags here.

flags = 0 gives you the data from source at dest, period. How it's
implemented is up to the file system as a user cannot observe how data
actually is stored underneath.

Additionaly I think the 'clone' option with it's stronger guarantees
should be a separate system call. So for now just have no supported
flag and leave it up to the file system and storage device how to
implement it.

For the future a COPY_FALLOC flag taht guaranatees you do not get ENOSPC
on the copied range will be very useful, but given the complexity I
think it's not something we should add now.

2015-10-19 20:45:05

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

On Sun, Oct 18, 2015 at 11:30:13AM -0700, Christoph Hellwig wrote:
> Just commenting on the man page here as the comment is about sematics.
> All the infrastructure in the patch looks reasonable to me, but this
> is something we need to get right.
>
> > +.B COPY_FR_REFLINK
> > +Create a lightweight "reflink", where data is not copied until
> > +one of the files is modified.
> > +.PP
> > +The default behavior
> > +.RI ( flags
> > +== 0) is to perform a full data copy of the requested range.
> > +.SH RETURN VALUE
> > +Upon successful completion,
> > +.BR copy_file_range ()
> > +will return the number of bytes copied between files.
> > +This could be less than the length originally requested.
>
> As mentioned in the previous discussion I fundamentally disagree with
> the way your word the flags here.
>
> flags = 0 gives you the data from source at dest, period. How it's
> implemented is up to the file system as a user cannot observe how data
> actually is stored underneath.
>
> Additionaly I think the 'clone' option with it's stronger guarantees
> should be a separate system call. So for now just have no supported
> flag and leave it up to the file system and storage device how to
> implement it.

So, continue to include a "flags" field but just error out if it's
anything but zero for now?

Sounds fine by me. We can always implement the other stuff later.

--b.

>
> For the future a COPY_FALLOC flag taht guaranatees you do not get ENOSPC
> on the copied range will be very useful, but given the complexity I
> think it's not something we should add now.

2015-10-19 21:03:28

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v6 5/4] copy_file_range.2: New page documenting copy_file_range()

On Mon, Oct 19, 2015 at 04:45:03PM -0400, J. Bruce Fields wrote:
> So, continue to include a "flags" field but just error out if it's
> anything but zero for now?

Exactly!

2015-10-20 09:41:36

by Zhao Lei

[permalink] [raw]
Subject: RE: [PATCH v6 0/4] VFS: In-kernel copy system call

Hi, Anna Schumaker

This patchset compile ok in x86 and x86_64 target,
But failed in arm when compiling btrfs dir, and output following error message:
<stdin>:1304:2: warning: #warning syscall copy_file_range not implemented [-Wcpp]

Reproduce:

merge commands:
cd /mnt/big1/linux
git fetch -q --all
git --force -B btrfs_base v4.3-rc5
git am --abort
git am --whitespace=nowarn PATCH_v6_1_4__vfs__add_copy_file_range_syscall_and_vfs_helper
git am --whitespace=nowarn PATCH_v6_2_4__x86__add_sys_copy_file_range_to_syscall_tables
git am --whitespace=nowarn PATCH_v6_3_4__btrfs__add_.copy_file_range_file_operation
git am --whitespace=nowarn PATCH_v6_4_4__vfs__Add_vfs_copy_file_range___support_for_pagecache_copies

compild commands:
make --directory=/mnt/big1/linux ARCH=arm CROSS_COMPILE=arm-buildroot-linux-uclibcgnueabi- distclean
make --directory=/mnt/big1/linux ARCH=arm CROSS_COMPILE=arm-buildroot-linux-uclibcgnueabi- defconfig
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --module CONFIG_BTRFS_FS
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --enable CONFIG_BTRFS_FS_POSIX_ACL
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --enable CONFIG_BTRFS_FS_CHECK_INTEGRITY
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --enable CONFIG_BTRFS_FS_RUN_SANITY_TESTS
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --enable CONFIG_BTRFS_DEBUG
/mnt/big1/linux/scripts/config --file /mnt/big1/linux/.config --enable CONFIG_BTRFS_ASSERT
make --directory=/mnt/big1/linux ARCH=arm CROSS_COMPILE=arm-buildroot-linux-uclibcgnueabi- -j8 fs/btrfs/

Thanks
Zhaolei

> -----Original Message-----
> From: [email protected]
> [mailto:[email protected]] On Behalf Of Anna Schumaker
> Sent: Saturday, October 17, 2015 5:08 AM
> To: [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected];
> [email protected]; [email protected]; [email protected]
> Subject: [PATCH v6 0/4] VFS: In-kernel copy system call
>
> Copy system calls came up during Plumbers a while ago, mostly because
> several filesystems (including NFS and XFS) are currently working on copy
> acceleration implementations. We haven't heard from Zach Brown in a while,
> so I volunteered to push his patches upstream so individual filesystems don't
> need to keep writing their own ioctls.
>
> This posting adresses Christoph's comments. I'm pretty sure that splice() is
> already interruptible, so I don't know what else needs to be done for this
> system call.
>
> I haven't started work on a "sparse" copy flag yet. I would like to focus on the
> base system call first, and then add that later if it's still desired.
>
> Changes in v6:
> - Squash together most patches.
> - Drop all flags except COPY_FR_REFLINK.
> - Drop patch removing same mountpoint check.
> - Change default behavior (flags = 0) to a data copy.
>
>
> Anna Schumaker (1):
> vfs: Add vfs_copy_file_range() support for pagecache copies
>
> Zach Brown (3):
> vfs: add copy_file_range syscall and vfs helper
> x86: add sys_copy_file_range to syscall tables
> btrfs: add .copy_file_range file operation
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/btrfs/ctree.h | 3 +
> fs/btrfs/file.c | 1 +
> fs/btrfs/ioctl.c | 94 +++++++++++++----------
> fs/read_write.c | 133
> +++++++++++++++++++++++++++++++++
> include/linux/fs.h | 3 +
> include/linux/syscalls.h | 3 +
> include/uapi/asm-generic/unistd.h | 4 +-
> include/uapi/linux/fs.h | 2 +
> kernel/sys_ni.c | 1 +
> 11 files changed, 206 insertions(+), 40 deletions(-)
>
> --
> 2.6.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in the body
> of a message to [email protected] More majordomo info at
> http://vger.kernel.org/majordomo-info.html