2015-09-11 20:30:28

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 0/9] VFS: In-kernel copy system call

Copy system calls came up during Plumbers a couple of weeks ago, because
several filesystems (including NFS and XFS) are currently working on copy
acceleration implementations. We haven't heard from Zach Brown in a while,
so I volunteered to push his patches upstream so individual filesystems
don't need to keep writing their own ioctls.

Changes in v2:
- Update against the most recent Linus kernel
- Fix conflicts due to new system calls
- Remove requirement that inode_in == inode_out
- Drop patch to add mountpoint checking to btrfs
- btrfs already did this check
- Rename COPY_REFLINK -> COPY_FR_REFLINK
- Add COPY_FR_COPY flag
- Expand flags == 0 to (COPY_FR_COPY | COPY_FR_REFLINK)
- Remove checking for invalid flags
- Create a new function for handling pagecache copies
- Move rw_verify_area() checks into the new pagecache-copy function
- Use the return value from rw_verify_area() to set amount of data to copy
- Update man page

I tested the COPY_FR_COPY flag by using /dev/urandom to generate files of
varying sizes and copying them. I compared the output from `time` against
that of `cp` to see if there is any noticable difference. I think there
have been some libvirt changes since my first set of trials, because this
time around cpu usage was down significantly. This time around, VFS copy
was slightly faster than /usr/bin/cp in all cases. Values in the tables
below are averages across multiple trials.


/usr/bin/cp | 512 | 1024 | 1536 | 2048 | 2560 | 3072 | 5120
-------------|--------|--------|--------|--------|--------|--------|--------
user | 0.00s | 0.01s | 0.01s | 0.01s | 0.01s | 0.01s | 0.02s
system | 0.68s | 0.48s | 0.74s | 0.99s | 1.25s | 1.50s | 2.51s
cpu | 34% | 14% | 14% | 15% | 14% | 14% | 15%
total | 1.993 | 3.314 | 4.994 | 6.599 | 8.627 | 10.079 | 16.852


VFS copy | 512 | 1024 | 1536 | 2048 | 2560 | 3072 | 5120
-------------|--------|--------|--------|--------|--------|--------|--------
user | 0.00s | 0.00s | 0.00s | 0.00s | 0.00s | 0.00s | 0.00s
system | 0.65s | 0.46s | 0.70s | 0.93s | 1.18s | 1.41s | 2.37s
cpu | 35% | 14% | 15% | 14% | 14% | 14% | 14%
total | 1.870 | 3.084 | 4.613 | 6.206 | 7.884 | 9.372 | 15.904


Questions? Comments? Thoughts?

Anna


Anna Schumaker (6):
vfs: Copy should check len after file open mode
vfs: Copy shouldn't forbid ranges inside the same file
vfs: Copy should use file_out rather than file_in
vfs: Remove copy_file_range mountpoint checks
vfs: copy_file_range() can do a pagecache copy with splice
btrfs: btrfs_copy_file_range() only supports reflinks

Zach Brown (3):
vfs: add copy_file_range syscall and vfs helper
x86: add sys_copy_file_range to syscall tables
btrfs: add .copy_file_range file operation

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 95 ++++++++++++++----------
fs/read_write.c | 132 +++++++++++++++++++++++++++++++++
include/linux/copy.h | 6 ++
include/linux/fs.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/copy.h | 7 ++
kernel/sys_ni.c | 1 +
12 files changed, 215 insertions(+), 40 deletions(-)
create mode 100644 include/linux/copy.h
create mode 100644 include/uapi/linux/copy.h

--
2.5.1



2015-09-11 20:30:50

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 5/9] vfs: Copy shouldn't forbid ranges inside the same file

This is perfectly valid for BTRFS and XFS, so let's leave this up to
filesystems to check.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 38cc251..d32549b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1371,10 +1371,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
file_in->f_path.mnt != file_out->f_path.mnt)
return -EXDEV;

- /* forbid ranges in the same file */
- if (inode_in == inode_out)
- return -EINVAL;
-
if (len == 0)
return 0;

--
2.5.1


2015-09-11 20:30:46

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

copy_file_range() is a new system call for copying ranges of data
completely in the kernel. This gives filesystems an opportunity to
implement some kind of "copy acceleration", such as reflinks or
server-side-copy (in the case of NFS).

Signed-off-by: Anna Schumaker <[email protected]>
---
man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 188 insertions(+)
create mode 100644 man2/copy_file_range.2

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
new file mode 100644
index 0000000..84912b5
--- /dev/null
+++ b/man2/copy_file_range.2
@@ -0,0 +1,188 @@
+.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
+.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
+.SH NAME
+copy_file_range \- Copy a range of data from one file to another
+.SH SYNOPSIS
+.nf
+.B #include <linux/copy.h>
+.B #include <sys/syscall.h>
+.B #include <unistd.h>
+
+.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
+.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
+.BI " unsigned int " flags );
+.fi
+.SH DESCRIPTION
+The
+.BR copy_file_range ()
+system call performs an in-kernel copy between two file descriptors
+without all that tedious mucking about in userspace.
+It copies up to
+.I len
+bytes of data from file descriptor
+.I fd_in
+to file descriptor
+.IR fd_out ,
+overwriting any data that exists within the requested range.
+
+The following semantics apply for
+.IR off_in ,
+and similar statements apply to
+.IR off_out :
+.IP * 3
+If
+.I off_in
+is NULL, then bytes are read from
+.I fd_in
+starting from the current file offset and the current
+file offset is adjusted appropriately.
+.IP *
+If
+.I off_in
+is not NULL, then
+.I off_in
+must point to a buffer that specifies the starting
+offset where bytes from
+.I fd_in
+will be read. The current file offset of
+.I fd_in
+is not changed, but
+.I off_in
+is adjusted appropriately.
+.PP
+
+The
+.I flags
+argument is a bit mask composed by OR-ing together zero
+or more of the following flags:
+.TP 1.9i
+.B COPY_FR_COPY
+Copy all the file data in the requested range.
+Some filesystems, like NFS, might be able to accelerate this copy
+to avoid unnecessary data transfers.
+.TP
+.B COPY_FR_REFLINK
+Create a lightweight "reflink", where data is not copied until
+one of the files is modified.
+.PP
+The default behavior
+.RI ( flags
+== 0) is to try creating a reflink,
+and if reflinking fails
+.BR copy_file_range ()
+will fall back on performing a full data copy.
+This is equivalent to setting
+.I flags
+equal to
+.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
+.SH RETURN VALUE
+Upon successful completion,
+.BR copy_file_range ()
+will return the number of bytes copied between files.
+This could be less than the length originally requested.
+
+On error,
+.BR copy_file_range ()
+returns \-1 and
+.I errno
+is set to indicate the error.
+.SH ERRORS
+.TP
+.B EBADF
+One or more file descriptors are not valid,
+or do not have proper read-write mode;
+.I fd_in
+is not open for reading; or
+.I fd_out
+is not open for writing.
+.TP
+.B EINVAL
+Requested range extends beyond the end of the file; or the
+.I flags
+argument is set to an invalid value.
+.TP
+.B EIO
+A low level I/O error occurred while copying.
+.TP
+.B ENOMEM
+Out of memory.
+.TP
+.B ENOSPC
+There is not enough space to complete the copy.
+.TP
+.B EOPNOTSUPP
+.B COPY_REFLINK
+was specified in
+.IR flags ,
+but the target filesystem does not support reflinks.
+.TP
+.B EXDEV
+Target filesystem doesn't support cross-filesystem copies.
+.SH VERSIONS
+The
+.BR copy_file_range ()
+system call first appeared in Linux 4.4.
+.SH CONFORMING TO
+The
+.BR copy_file_range ()
+system call is a nonstandard Linux extension.
+.SH EXAMPLE
+.nf
+
+#define _GNU_SOURCE
+#include <fcntl.h>
+#include <linux/copy.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <sys/stat.h>
+#include <sys/syscall.h>
+#include <unistd.h>
+
+
+int main(int argc, char **argv)
+{
+ int fd_in, fd_out;
+ struct stat stat;
+ loff_t len, ret;
+
+ if (argc != 3) {
+ fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
+ exit(EXIT_FAILURE);
+ }
+
+ fd_in = open(argv[1], O_RDONLY);
+ if (fd_in == -1) {
+ perror("open (argv[1])");
+ exit(EXIT_FAILURE);
+ }
+
+ if (fstat(fd_in, &stat) == -1) {
+ perror("fstat");
+ exit(EXIT_FAILURE);
+ }
+ len = stat.st_size;
+
+ fd_out = creat(argv[2], 0644);
+ if (fd_out == -1) {
+ perror("creat (argv[2])");
+ exit(EXIT_FAILURE);
+ }
+
+ do {
+ ret = syscall(__NR_copy_file_range, fd_in, NULL,
+ fd_out, NULL, len, 0);
+ if (ret == -1) {
+ perror("copy_file_range");
+ exit(EXIT_FAILURE);
+ }
+
+ len -= ret;
+ } while (len > 0);
+
+ close(fd_in);
+ close(fd_out);
+ exit(EXIT_SUCCESS);
+}
+.fi
+.SH SEE ALSO
+.BR splice (2)
--
2.5.1


2015-09-11 20:30:44

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

Reject copies that don't have the COPY_FR_REFLINK flag set.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/btrfs/ioctl.c | 4 ++++
1 file changed, 4 insertions(+)

diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 4311554..2e14b91 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -44,6 +44,7 @@
#include <linux/uuid.h>
#include <linux/btrfs.h>
#include <linux/uaccess.h>
+#include <linux/copy.h>
#include "ctree.h"
#include "disk-io.h"
#include "transaction.h"
@@ -3848,6 +3849,9 @@ ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
{
ssize_t ret;

+ if (!(flags & COPY_FR_REFLINK))
+ return -EOPNOTSUPP;
+
ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
if (ret == 0)
ret = len;
--
2.5.1


2015-09-11 20:30:40

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 7/9] vfs: Remove copy_file_range mountpoint checks

I still want to do an in-kernel copy even if the files are on different
mountpoints, and NFS has a "server to server" copy that expects two
files on different mountpoints. Let's have individual filesystems
implement this check instead.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index ac32388..363bd3e 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1366,11 +1366,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
pos_in + len > i_size_read(inode_in))
return -EINVAL;

- /* this could be relaxed once a method supports cross-fs copies */
- if (inode_in->i_sb != inode_out->i_sb ||
- file_in->f_path.mnt != file_out->f_path.mnt)
- return -EXDEV;
-
if (len == 0)
return 0;

--
2.5.1


2015-09-11 20:30:43

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

The NFS server will need some kind offallback for filesystems that don't
have any kind of copy acceleration, and it should be generally useful to
have an in-kernel copy to avoid lots of switches between kernel and user
space.

I make this configurable by adding two new flags. Users who only want a
reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
pass COPY_FR_COPY. The default (flags=0) means to first attempt a
reflink, but use the pagecache if that fails.

I moved the rw_verify_area() calls into the fallback code since some
filesystems can handle reflinking a large range.

Signed-off-by: Anna Schumaker <[email protected]>
---
v2:
- Rename COPY_REFLINK -> COPY_FR_REFLINK
- Introduce COPY_FR_COPY flag
- Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
- Drop check for invalid flags
- Move call to do_splice_direct() into a new function
- Move rw_verify_area() checks into the new fallback function
---
fs/read_write.c | 56 ++++++++++++++++++++++++++++-------------------
include/linux/copy.h | 6 +++++
include/uapi/linux/Kbuild | 1 +
include/uapi/linux/copy.h | 7 ++++++
4 files changed, 48 insertions(+), 22 deletions(-)
create mode 100644 include/linux/copy.h
create mode 100644 include/uapi/linux/copy.h

diff --git a/fs/read_write.c b/fs/read_write.c
index 363bd3e..ba24884 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -7,6 +7,7 @@
#include <linux/slab.h>
#include <linux/stat.h>
#include <linux/fcntl.h>
+#include <linux/copy.h>
#include <linux/file.h>
#include <linux/uio.h>
#include <linux/fsnotify.h>
@@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif

+static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len)
+{
+ ssize_t ret;
+
+ ret = rw_verify_area(READ, file_in, &pos_in, len);
+ if (ret >= 0) {
+ len = ret;
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret >= 0)
+ len = ret;
+ }
+ if (ret < 0)
+ return ret;
+
+ file_start_write(file_out);
+ ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
+ file_end_write(file_out);
+
+ return ret;
+}
+
/*
* copy_file_range() differs from regular file read and write in that it
* specifically allows return partial success. When it does so is up to
@@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, int flags)
{
- struct inode *inode_in;
- struct inode *inode_out;
ssize_t ret;

- if (flags)
- return -EINVAL;
-
- /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
- ret = rw_verify_area(READ, file_in, &pos_in, len);
- if (ret >= 0)
- ret = rw_verify_area(WRITE, file_out, &pos_out, len);
- if (ret < 0)
- return ret;
+ if (flags == 0)
+ flags = COPY_FR_COPY | COPY_FR_REFLINK;

if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
- !file_out->f_op || !file_out->f_op->copy_file_range)
+ !file_in->f_op)
return -EBADF;

- inode_in = file_inode(file_in);
- inode_out = file_inode(file_out);
-
- /* make sure offsets don't wrap and the input is inside i_size */
- if (pos_in + len < pos_in || pos_out + len < pos_out ||
- pos_in + len > i_size_read(inode_in))
- return -EINVAL;
-
if (len == 0)
return 0;

@@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (ret)
return ret;

- ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
- len, flags);
+ ret = -EOPNOTSUPP;
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+ if ((ret < 0) && (flags & COPY_FR_COPY))
+ ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
+ pos_out, len);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
diff --git a/include/linux/copy.h b/include/linux/copy.h
new file mode 100644
index 0000000..fd54543
--- /dev/null
+++ b/include/linux/copy.h
@@ -0,0 +1,6 @@
+#ifndef _LINUX_COPY_H
+#define _LINUX_COPY_H
+
+#include <uapi/linux/copy.h>
+
+#endif /* _LINUX_COPY_H */
diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
index 70ff1d9..d46830a 100644
--- a/include/uapi/linux/Kbuild
+++ b/include/uapi/linux/Kbuild
@@ -90,6 +90,7 @@ header-y += coda_psdev.h
header-y += coff.h
header-y += connector.h
header-y += const.h
+header-y += copy.h
header-y += cramfs_fs.h
header-y += cuda.h
header-y += cyclades.h
diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
new file mode 100644
index 0000000..2da59a8
--- /dev/null
+++ b/include/uapi/linux/copy.h
@@ -0,0 +1,7 @@
+#ifndef _UAPI_LINUX_COPY_H
+#define _UAPI_LINUX_COPY_H
+
+#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */
+#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */
+
+#endif /* _UAPI_LINUX_COPY_H */
--
2.5.1


2015-09-11 20:30:38

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 6/9] vfs: Copy should use file_out rather than file_in

The way to think about this is that the destination filesystem reads the
data from the source file and processes it accordingly. This is
especially important to avoid an infinate loop when doing a "server to
server" copy on NFS.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index d32549b..ac32388 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1355,7 +1355,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
- !file_in->f_op || !file_in->f_op->copy_file_range)
+ !file_out->f_op || !file_out->f_op->copy_file_range)
return -EBADF;

inode_in = file_inode(file_in);
@@ -1378,8 +1378,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (ret)
return ret;

- ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
- len, flags);
+ ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+ len, flags);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
--
2.5.1


2015-09-11 20:30:35

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 4/9] vfs: Copy should check len after file open mode

I don't think it makes sense to report that a copy succeeded if the
files aren't open properly.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 82c4933..38cc251 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1345,9 +1345,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags)
return -EINVAL;

- if (len == 0)
- return 0;
-
/* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
ret = rw_verify_area(READ, file_in, &pos_in, len);
if (ret >= 0)
@@ -1378,6 +1375,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (inode_in == inode_out)
return -EINVAL;

+ if (len == 0)
+ return 0;
+
ret = mnt_want_write_file(file_out);
if (ret)
return ret;
--
2.5.1


2015-09-11 20:30:31

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 2/9] x86: add sys_copy_file_range to syscall tables

From: Zach Brown <[email protected]>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Update syscall number in syscall_32.tbl]
Signed-off-by: Anna Schumaker <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index 477bfa6..6867783 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -381,3 +381,4 @@
372 i386 recvmsg sys_recvmsg compat_sys_recvmsg
373 i386 shutdown sys_shutdown
374 i386 userfaultfd sys_userfaultfd
+375 i386 copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 81c4906..23baaa5 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -330,6 +330,7 @@
321 common bpf sys_bpf
322 64 execveat stub_execveat
323 common userfaultfd sys_userfaultfd
+324 common copy_file_range sys_copy_file_range

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.5.1


2015-09-11 20:30:33

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 3/9] btrfs: add .copy_file_range file operation

From: Zach Brown <[email protected]>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <[email protected]>
Signed-off-by: Anna Schumaker <[email protected]>
Reviewed-by: Josef Bacik <[email protected]>
Reviewed-by: David Sterba <[email protected]>
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index 938efe3..5d06a4f 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -3996,6 +3996,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
loff_t pos, size_t write_bytes,
struct extent_state **cached);
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = btrfs_ioctl,
#endif
+ .copy_file_range = btrfs_copy_file_range,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0adf542..4311554 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3727,17 +3727,16 @@ out:
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+ u64 off, u64 olen, u64 destoff)
{
struct inode *inode = file_inode(file);
+ struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
- struct fd src_file;
- struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
- int same_inode = 0;
+ int same_inode = src == inode;

/*
* TODO:
@@ -3750,49 +3749,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
* be either compressed or non-compressed.
*/

- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;

- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- ret = -EXDEV;
- if (src_file.file->f_path.mnt != file->f_path.mnt)
- goto out_fput;
-
- src = file_inode(src_file.file);
-
- ret = -EINVAL;
- if (src == inode)
- same_inode = 1;
-
- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
+ if (file_src->f_path.mnt != file->f_path.mnt ||
+ src->i_sb != inode->i_sb)
+ return -EXDEV;

/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
- goto out_fput;
+ return -EINVAL;

- ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
- goto out_fput;
-
- ret = -EXDEV;
- if (src->i_sb != inode->i_sb)
- goto out_fput;
+ return -EISDIR;

if (!same_inode) {
btrfs_double_inode_lock(src, inode);
@@ -3869,6 +3839,49 @@ out_unlock:
btrfs_double_inode_unlock(src, inode);
else
mutex_unlock(&src->i_mutex);
+ return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+ ssize_t ret;
+
+ ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+ if (ret == 0)
+ ret = len;
+ return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file;
+ int ret;
+
+ /* the destination must be opened for writing */
+ if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
+ src_file = fdget(srcfd);
+ if (!src_file.file) {
+ ret = -EBADF;
+ goto out_drop_write;
+ }
+
+ /* the src must be open for reading */
+ if (!(src_file.file->f_mode & FMODE_READ)) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
out_fput:
fdput(src_file);
out_drop_write:
--
2.5.1


2015-09-11 20:30:30

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH v2 1/9] vfs: add copy_file_range syscall and vfs helper

From: Zach Brown <[email protected]>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <[email protected]>
[Anna Schumaker: Change -EINVAL to -EBADF during file verification]
Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 129 ++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
4 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..82c4933 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
}
#endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success. When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+ struct inode *inode_in;
+ struct inode *inode_out;
+ ssize_t ret;
+
+ if (flags)
+ return -EINVAL;
+
+ if (len == 0)
+ return 0;
+
+ /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
+ ret = rw_verify_area(READ, file_in, &pos_in, len);
+ if (ret >= 0)
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret < 0)
+ return ret;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND) ||
+ !file_in->f_op || !file_in->f_op->copy_file_range)
+ return -EBADF;
+
+ inode_in = file_inode(file_in);
+ inode_out = file_inode(file_out);
+
+ /* make sure offsets don't wrap and the input is inside i_size */
+ if (pos_in + len < pos_in || pos_out + len < pos_out ||
+ pos_in + len > i_size_read(inode_in))
+ return -EINVAL;
+
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (inode_in->i_sb != inode_out->i_sb ||
+ file_in->f_path.mnt != file_out->f_path.mnt)
+ return -EXDEV;
+
+ /* forbid ranges in the same file */
+ if (inode_in == inode_out)
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file_out);
+ if (ret)
+ return ret;
+
+ ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+ len, flags);
+ if (ret > 0) {
+ fsnotify_access(file_in);
+ add_rchar(current, ret);
+ fsnotify_modify(file_out);
+ add_wchar(current, ret);
+ }
+ inc_syscr(current);
+ inc_syscw(current);
+
+ mnt_drop_write_file(file_out);
+
+ return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+ int, fd_out, loff_t __user *, off_out,
+ size_t, len, unsigned int, flags)
+{
+ loff_t pos_in;
+ loff_t pos_out;
+ struct fd f_in;
+ struct fd f_out;
+ ssize_t ret;
+
+ f_in = fdget(fd_in);
+ f_out = fdget(fd_out);
+ if (!f_in.file || !f_out.file) {
+ ret = -EBADF;
+ goto out;
+ }
+
+ ret = -EFAULT;
+ if (off_in) {
+ if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_in = f_in.file->f_pos;
+ }
+
+ if (off_out) {
+ if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_out = f_out.file->f_pos;
+ }
+
+ ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+ if (ret > 0) {
+ pos_in += ret;
+ pos_out += ret;
+
+ if (off_in) {
+ if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_in.file->f_pos = pos_in;
+ }
+
+ if (off_out) {
+ if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_out.file->f_pos = pos_out;
+ }
+ }
+out:
+ fdput(f_in);
+ fdput(f_out);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 72d8a84..4c70582 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1642,6 +1642,7 @@ struct file_operations {
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, int);
};

struct inode_operations {
@@ -1695,6 +1696,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+ loff_t, size_t, int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..2b60f0c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_copy_file_range 282
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 03c3875..7f53519 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -174,6 +174,7 @@ cond_syscall(sys_setfsuid);
cond_syscall(sys_setfsgid);
cond_syscall(sys_capget);
cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
--
2.5.1


Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

Hi Anna,

On 09/11/2015 10:30 PM, Anna Schumaker wrote:
> copy_file_range() is a new system call for copying ranges of data
> completely in the kernel. This gives filesystems an opportunity to
> implement some kind of "copy acceleration", such as reflinks or
> server-side-copy (in the case of NFS).
>
> Signed-off-by: Anna Schumaker <[email protected]>

Thanks for writing such a nice first draft man page! I have a few
comments below. Would you be willing to revise and resubmit?

> ---
> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 188 insertions(+)
> create mode 100644 man2/copy_file_range.2
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> new file mode 100644
> index 0000000..84912b5
> --- /dev/null
> +++ b/man2/copy_file_range.2
> @@ -0,0 +1,188 @@
> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>

We need a license for this page. Please see
https://www.kernel.org/doc/man-pages/licenses.html

> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"

Make the month 2 digits (leading 0).

> +.SH NAME
> +copy_file_range \- Copy a range of data from one file to another
> +.SH SYNOPSIS
> +.nf
> +.B #include <linux/copy.h>
> +.B #include <sys/syscall.h>
> +.B #include <unistd.h>
> +
> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
> +.BI " unsigned int " flags );

Remove spaces following "*" in the lines above. (man-pages convention for code)

I know that the copy_file_range() (obviously) doesn't yet have a wrapper
in glibc, but in the man pages we document all system calls as though there
is a wrapper. See, for example, gettid(2), for an axample of how this is done
(a note in the SYNOPSIS and then some further details in NOTES).

> +.fi
> +.SH DESCRIPTION
> +The
> +.BR copy_file_range ()
> +system call performs an in-kernel copy between two file descriptors
> +without all that tedious mucking about in userspace.

I'd write that last piece as:

"without the cost of (a loop) transferring data from the kernel to a
user-space buffer and then back to the kernel again.

> +It copies up to
> +.I len
> +bytes of data from file descriptor
> +.I fd_in
> +to file descriptor
> +.IR fd_out ,
> +overwriting any data that exists within the requested range.

s/.$/ of the target file./

> +
> +The following semantics apply for
> +.IR off_in ,
> +and similar statements apply to
> +.IR off_out :
> +.IP * 3
> +If
> +.I off_in
> +is NULL, then bytes are read from
> +.I fd_in
> +starting from the current file offset and the current
> +file offset is adjusted appropriately.
> +.IP *
> +If
> +.I off_in
> +is not NULL, then
> +.I off_in
> +must point to a buffer that specifies the starting
> +offset where bytes from
> +.I fd_in
> +will be read. The current file offset of
> +.I fd_in
> +is not changed, but
> +.I off_in
> +is adjusted appropriately.
> +.PP
> +
> +The
> +.I flags
> +argument is a bit mask composed by OR-ing together zero
> +or more of the following flags:
> +.TP 1.9i
> +.B COPY_FR_COPY
> +Copy all the file data in the requested range.
> +Some filesystems, like NFS, might be able to accelerate this copy
> +to avoid unnecessary data transfers.
> +.TP
> +.B COPY_FR_REFLINK
> +Create a lightweight "reflink", where data is not copied until
> +one of the files is modified.
> +.PP
> +The default behavior
> +.RI ( flags
> +== 0) is to try creating a reflink,
> +and if reflinking fails
> +.BR copy_file_range ()
> +will fall back on performing a full data copy.

s/back on/back to/

> +This is equivalent to setting
> +.I flags
> +equal to
> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).

So, from an API deign perspective, the interoperation of these two
flags appears odd. Bit flags are commonly (not always) orthogonal.
I think here it could be pointed out a little more explicitly that
these two flags are not orthogonal. In particular, perhaps after the
last sentence, you could add another sentence:

[[
(This contrasts with specifying
.I flags
as just
.BR COPY_FR_REFLINK ,
which causes the call to create a reflink,
and fail if that is not possible,
rather than falling back to a full data copy.)
]]

Furthermore, I even wonder if explicitly specifying flags as
COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
error. 0 already gives us the behavior described above,
and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
perhaps just contributes to misleading the user that these
flags are orthogonal, when in reality they are not. What do
you think?

What are the semantics with respect to signals, especially if data
copying a very large file range? This info should be included in the
man page, probably under NOTES.

> +.SH RETURN VALUE
> +Upon successful completion,
> +.BR copy_file_range ()
> +will return the number of bytes copied between files.
> +This could be less than the length originally requested.
> +
> +On error,
> +.BR copy_file_range ()
> +returns \-1 and
> +.I errno
> +is set to indicate the error.
> +.SH ERRORS
> +.TP
> +.B EBADF
> +One or more file descriptors are not valid,
> +or do not have proper read-write mode;

I think that last line can go. I mean, isn't this point (again)
covered in the next few lines?

> +.I fd_in
> +is not open for reading; or
> +.I fd_out
> +is not open for writing.
> +.TP
> +.B EINVAL
> +Requested range extends beyond the end of the file; or the

s/file/source file/ (right?)

> +.I flags
> +argument is set to an invalid value.
> +.TP
> +.B EIO
> +A low level I/O error occurred while copying.
> +.TP
> +.B ENOMEM
> +Out of memory.
> +.TP
> +.B ENOSPC
> +There is not enough space to complete the copy.

Space where? On the filesystem?
=> s/space/space on the target filesystem/

> +.TP
> +.B EOPNOTSUPP
> +.B COPY_REFLINK
> +was specified in
> +.IR flags ,
> +but the target filesystem does not support reflinks.
> +.TP
> +.B EXDEV
> +Target filesystem doesn't support cross-filesystem copies.

I'm curious. What are some examples of filesystems that produce this
error?

> +.SH VERSIONS
> +The
> +.BR copy_file_range ()
> +system call first appeared in Linux 4.4.
> +.SH CONFORMING TO
> +The
> +.BR copy_file_range ()
> +system call is a nonstandard Linux extension.
> +.SH EXAMPLE
> +.nf
> +
> +#define _GNU_SOURCE
> +#include <fcntl.h>
> +#include <linux/copy.h>
> +#include <stdio.h>
> +#include <stdlib.h>
> +#include <sys/stat.h>
> +#include <sys/syscall.h>
> +#include <unistd.h>
> +
> +
> +int main(int argc, char **argv)
> +{
> + int fd_in, fd_out;
> + struct stat stat;
> + loff_t len, ret;
> +
> + if (argc != 3) {
> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
> + exit(EXIT_FAILURE);
> + }
> +
> + fd_in = open(argv[1], O_RDONLY);
> + if (fd_in == -1) {

Please replace all "-" in code by "\-". (This is a groff
detail.)

> + perror("open (argv[1])");
> + exit(EXIT_FAILURE);
> + }
> +
> + if (fstat(fd_in, &stat) == -1) {
> + perror("fstat");
> + exit(EXIT_FAILURE);
> + }
> + len = stat.st_size;
> +
> + fd_out = creat(argv[2], 0644);

These days, I think we should discourage the use of creat(). Maybe
better to use open() instead here?

> + if (fd_out == -1) {
> + perror("creat (argv[2])");
> + exit(EXIT_FAILURE);
> + }
> +
> + do {
> + ret = syscall(__NR_copy_file_range, fd_in, NULL,
> + fd_out, NULL, len, 0);

I'd rather see this as:

int
copy_file_range(int fd_in, loff_t *off_in,
int fd_out, loff_t *off_out, size_t len,
unsigned int flags)
{
return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
}

...

copy_file_range(fd_in, fd_out, off_out, len, flags);


> + if (ret == -1) {
> + perror("copy_file_range");
> + exit(EXIT_FAILURE);
> + }
> +
> + len -= ret;
> + } while (len > 0);
> +
> + close(fd_in);
> + close(fd_out);
> + exit(EXIT_SUCCESS);
> +}
> +.fi
> +.SH SEE ALSO
> +.BR splice (2)
>

In the next iteration of this patch, could you include a change to
splice(2) so that copy_file_range(2) is added under SEE ALSO in
that page. Also, are there any other pages that we should like
to/from? (sendfile(2)?)

Thanks,

Michael


--
Michael Kerrisk
Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
Linux/UNIX System Programming Training: http://man7.org/training/

2015-09-14 18:33:01

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

On Sun, Sep 13, 2015 at 09:50:18AM +0200, Michael Kerrisk (man-pages) wrote:
> Hi Anna,
>
> On 09/11/2015 10:30 PM, Anna Schumaker wrote:
> > copy_file_range() is a new system call for copying ranges of data
> > completely in the kernel. This gives filesystems an opportunity to
> > implement some kind of "copy acceleration", such as reflinks or
> > server-side-copy (in the case of NFS).
> >
> > Signed-off-by: Anna Schumaker <[email protected]>
>
> Thanks for writing such a nice first draft man page! I have a few
> comments below. Would you be willing to revise and resubmit?
>
> > ---
> > man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
> > 1 file changed, 188 insertions(+)
> > create mode 100644 man2/copy_file_range.2
> >
> > diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > new file mode 100644
> > index 0000000..84912b5
> > --- /dev/null
> > +++ b/man2/copy_file_range.2
> > @@ -0,0 +1,188 @@
> > +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
>
> We need a license for this page. Please see
> https://www.kernel.org/doc/man-pages/licenses.html
>
> > +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>
> Make the month 2 digits (leading 0).
>
> > +.SH NAME
> > +copy_file_range \- Copy a range of data from one file to another
> > +.SH SYNOPSIS
> > +.nf
> > +.B #include <linux/copy.h>
> > +.B #include <sys/syscall.h>
> > +.B #include <unistd.h>
> > +
> > +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
> > +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
> > +.BI " unsigned int " flags );
>
> Remove spaces following "*" in the lines above. (man-pages convention for code)
>
> I know that the copy_file_range() (obviously) doesn't yet have a wrapper
> in glibc, but in the man pages we document all system calls as though there
> is a wrapper. See, for example, gettid(2), for an axample of how this is done
> (a note in the SYNOPSIS and then some further details in NOTES).
>
> > +.fi
> > +.SH DESCRIPTION
> > +The
> > +.BR copy_file_range ()
> > +system call performs an in-kernel copy between two file descriptors
> > +without all that tedious mucking about in userspace.
>
> I'd write that last piece as:
>
> "without the cost of (a loop) transferring data from the kernel to a
> user-space buffer and then back to the kernel again.
>
> > +It copies up to
> > +.I len
> > +bytes of data from file descriptor
> > +.I fd_in
> > +to file descriptor
> > +.IR fd_out ,
> > +overwriting any data that exists within the requested range.
>
> s/.$/ of the target file./
>
> > +
> > +The following semantics apply for
> > +.IR off_in ,
> > +and similar statements apply to
> > +.IR off_out :
> > +.IP * 3
> > +If
> > +.I off_in
> > +is NULL, then bytes are read from
> > +.I fd_in
> > +starting from the current file offset and the current
> > +file offset is adjusted appropriately.
> > +.IP *
> > +If
> > +.I off_in
> > +is not NULL, then
> > +.I off_in
> > +must point to a buffer that specifies the starting
> > +offset where bytes from
> > +.I fd_in
> > +will be read. The current file offset of
> > +.I fd_in
> > +is not changed, but
> > +.I off_in
> > +is adjusted appropriately.
> > +.PP
> > +
> > +The
> > +.I flags
> > +argument is a bit mask composed by OR-ing together zero
> > +or more of the following flags:
> > +.TP 1.9i
> > +.B COPY_FR_COPY
> > +Copy all the file data in the requested range.
> > +Some filesystems, like NFS, might be able to accelerate this copy
> > +to avoid unnecessary data transfers.
> > +.TP
> > +.B COPY_FR_REFLINK
> > +Create a lightweight "reflink", where data is not copied until
> > +one of the files is modified.
> > +.PP
> > +The default behavior
> > +.RI ( flags
> > +== 0) is to try creating a reflink,
> > +and if reflinking fails
> > +.BR copy_file_range ()
> > +will fall back on performing a full data copy.
>
> s/back on/back to/
>
> > +This is equivalent to setting
> > +.I flags
> > +equal to
> > +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
>
> So, from an API deign perspective, the interoperation of these two
> flags appears odd. Bit flags are commonly (not always) orthogonal.
> I think here it could be pointed out a little more explicitly that
> these two flags are not orthogonal. In particular, perhaps after the
> last sentence, you could add another sentence:
>
> [[
> (This contrasts with specifying
> .I flags
> as just
> .BR COPY_FR_REFLINK ,
> which causes the call to create a reflink,
> and fail if that is not possible,
> rather than falling back to a full data copy.)
> ]]
>
> Furthermore, I even wonder if explicitly specifying flags as
> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
> error. 0 already gives us the behavior described above,
> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
> perhaps just contributes to misleading the user that these
> flags are orthogonal, when in reality they are not. What do
> you think?

Personally, I think it's a little weird that one turns on reflink with a flag;
turns on regular copy with a different flag; and turns on both by not
specifying either flag. :)

> What are the semantics with respect to signals, especially if data
> copying a very large file range? This info should be included in the
> man page, probably under NOTES.
>
> > +.SH RETURN VALUE
> > +Upon successful completion,
> > +.BR copy_file_range ()
> > +will return the number of bytes copied between files.
> > +This could be less than the length originally requested.
> > +
> > +On error,
> > +.BR copy_file_range ()
> > +returns \-1 and
> > +.I errno
> > +is set to indicate the error.
> > +.SH ERRORS
> > +.TP
> > +.B EBADF
> > +One or more file descriptors are not valid,
> > +or do not have proper read-write mode;
>
> I think that last line can go. I mean, isn't this point (again)
> covered in the next few lines?

Or change the ';' to a ':'?

> > +.I fd_in
> > +is not open for reading; or
> > +.I fd_out
> > +is not open for writing.
> > +.TP
> > +.B EINVAL
> > +Requested range extends beyond the end of the file; or the
>
> s/file/source file/ (right?)
>
> > +.I flags
> > +argument is set to an invalid value.
> > +.TP
> > +.B EIO
> > +A low level I/O error occurred while copying.
> > +.TP
> > +.B ENOMEM
> > +Out of memory.
> > +.TP
> > +.B ENOSPC
> > +There is not enough space to complete the copy.
>
> Space where? On the filesystem?
> => s/space/space on the target filesystem/
>
> > +.TP
> > +.B EOPNOTSUPP
> > +.B COPY_REFLINK
> > +was specified in
> > +.IR flags ,
> > +but the target filesystem does not support reflinks.
> > +.TP
> > +.B EXDEV
> > +Target filesystem doesn't support cross-filesystem copies.
>
> I'm curious. What are some examples of filesystems that produce this
> error?

btrfs and xfs (and probably most local filesystems) don't support cross-fs
copies.

--D

>
> > +.SH VERSIONS
> > +The
> > +.BR copy_file_range ()
> > +system call first appeared in Linux 4.4.
> > +.SH CONFORMING TO
> > +The
> > +.BR copy_file_range ()
> > +system call is a nonstandard Linux extension.
> > +.SH EXAMPLE
> > +.nf
> > +
> > +#define _GNU_SOURCE
> > +#include <fcntl.h>
> > +#include <linux/copy.h>
> > +#include <stdio.h>
> > +#include <stdlib.h>
> > +#include <sys/stat.h>
> > +#include <sys/syscall.h>
> > +#include <unistd.h>
> > +
> > +
> > +int main(int argc, char **argv)
> > +{
> > + int fd_in, fd_out;
> > + struct stat stat;
> > + loff_t len, ret;
> > +
> > + if (argc != 3) {
> > + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + fd_in = open(argv[1], O_RDONLY);
> > + if (fd_in == -1) {
>
> Please replace all "-" in code by "\-". (This is a groff
> detail.)
>
> > + perror("open (argv[1])");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + if (fstat(fd_in, &stat) == -1) {
> > + perror("fstat");
> > + exit(EXIT_FAILURE);
> > + }
> > + len = stat.st_size;
> > +
> > + fd_out = creat(argv[2], 0644);
>
> These days, I think we should discourage the use of creat(). Maybe
> better to use open() instead here?
>
> > + if (fd_out == -1) {
> > + perror("creat (argv[2])");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + do {
> > + ret = syscall(__NR_copy_file_range, fd_in, NULL,
> > + fd_out, NULL, len, 0);
>
> I'd rather see this as:
>
> int
> copy_file_range(int fd_in, loff_t *off_in,
> int fd_out, loff_t *off_out, size_t len,
> unsigned int flags)
> {
> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
> }
>
> ...
>
> copy_file_range(fd_in, fd_out, off_out, len, flags);
>
>
> > + if (ret == -1) {
> > + perror("copy_file_range");
> > + exit(EXIT_FAILURE);
> > + }
> > +
> > + len -= ret;
> > + } while (len > 0);
> > +
> > + close(fd_in);
> > + close(fd_out);
> > + exit(EXIT_SUCCESS);
> > +}
> > +.fi
> > +.SH SEE ALSO
> > +.BR splice (2)
> >
>
> In the next iteration of this patch, could you include a change to
> splice(2) so that copy_file_range(2) is added under SEE ALSO in
> that page. Also, are there any other pages that we should like
> to/from? (sendfile(2)?)
>
> Thanks,
>
> Michael
>
>
> --
> Michael Kerrisk
> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> Linux/UNIX System Programming Training: http://man7.org/training/

2015-09-14 19:02:49

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

On 2015-09-13 03:50, Michael Kerrisk (man-pages) wrote:
> Hi Anna,
>
> On 09/11/2015 10:30 PM, Anna Schumaker wrote:
>> copy_file_range() is a new system call for copying ranges of data
>> completely in the kernel. This gives filesystems an opportunity to
>> implement some kind of "copy acceleration", such as reflinks or
>> server-side-copy (in the case of NFS).
>>
>> Signed-off-by: Anna Schumaker <[email protected]>
>
> Thanks for writing such a nice first draft man page! I have a few
> comments below. Would you be willing to revise and resubmit?
>
>> ---
>> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 188 insertions(+)
>> create mode 100644 man2/copy_file_range.2
>>
>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>> new file mode 100644
>> index 0000000..84912b5
>> --- /dev/null
>> +++ b/man2/copy_file_range.2
>> @@ -0,0 +1,188 @@
>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
>
> We need a license for this page. Please see
> https://www.kernel.org/doc/man-pages/licenses.html
>
>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>
> Make the month 2 digits (leading 0).
>
>> +.SH NAME
>> +copy_file_range \- Copy a range of data from one file to another
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/copy.h>
>> +.B #include <sys/syscall.h>
>> +.B #include <unistd.h>
>> +
>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
>> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
>> +.BI " unsigned int " flags );
>
> Remove spaces following "*" in the lines above. (man-pages convention for code)
>
> I know that the copy_file_range() (obviously) doesn't yet have a wrapper
> in glibc, but in the man pages we document all system calls as though there
> is a wrapper. See, for example, gettid(2), for an axample of how this is done
> (a note in the SYNOPSIS and then some further details in NOTES).
>
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR copy_file_range ()
>> +system call performs an in-kernel copy between two file descriptors
>> +without all that tedious mucking about in userspace.
>
> I'd write that last piece as:
>
> "without the cost of (a loop) transferring data from the kernel to a
> user-space buffer and then back to the kernel again.
>
Seconded, this will likely sound odd to someone who has only ever heard
American English, and while I hate to propagate stereotypes, there might
be some less than intelligent speakers of American English who this
could confuse. That, and it's better to be clear about what 'tedious
mucking about in userspace' it's avoiding.
>> +It copies up to
>> +.I len
>> +bytes of data from file descriptor
>> +.I fd_in
>> +to file descriptor
>> +.IR fd_out ,
>> +overwriting any data that exists within the requested range.
>
> s/.$/ of the target file./
>
>> +
>> +The following semantics apply for
>> +.IR off_in ,
>> +and similar statements apply to
>> +.IR off_out :
>> +.IP * 3
>> +If
>> +.I off_in
>> +is NULL, then bytes are read from
>> +.I fd_in
>> +starting from the current file offset and the current
>> +file offset is adjusted appropriately.
>> +.IP *
>> +If
>> +.I off_in
>> +is not NULL, then
>> +.I off_in
>> +must point to a buffer that specifies the starting
>> +offset where bytes from
>> +.I fd_in
>> +will be read. The current file offset of
>> +.I fd_in
>> +is not changed, but
>> +.I off_in
>> +is adjusted appropriately.
>> +.PP
>> +
>> +The
>> +.I flags
>> +argument is a bit mask composed by OR-ing together zero
>> +or more of the following flags:
>> +.TP 1.9i
>> +.B COPY_FR_COPY
>> +Copy all the file data in the requested range.
>> +Some filesystems, like NFS, might be able to accelerate this copy
>> +to avoid unnecessary data transfers.
>> +.TP
>> +.B COPY_FR_REFLINK
>> +Create a lightweight "reflink", where data is not copied until
>> +one of the files is modified.
>> +.PP
>> +The default behavior
>> +.RI ( flags
>> +== 0) is to try creating a reflink,
>> +and if reflinking fails
>> +.BR copy_file_range ()
>> +will fall back on performing a full data copy.
>
> s/back on/back to/
>
>> +This is equivalent to setting
>> +.I flags
>> +equal to
>> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
>
> So, from an API deign perspective, the interoperation of these two
> flags appears odd. Bit flags are commonly (not always) orthogonal.
> I think here it could be pointed out a little more explicitly that
> these two flags are not orthogonal. In particular, perhaps after the
> last sentence, you could add another sentence:
>
> [[
> (This contrasts with specifying
> .I flags
> as just
> .BR COPY_FR_REFLINK ,
> which causes the call to create a reflink,
> and fail if that is not possible,
> rather than falling back to a full data copy.)
> ]]
>
> Furthermore, I even wonder if explicitly specifying flags as
> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
> error. 0 already gives us the behavior described above,
> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
> perhaps just contributes to misleading the user that these
> flags are orthogonal, when in reality they are not. What do
> you think?
>
> What are the semantics with respect to signals, especially if data
> copying a very large file range? This info should be included in the
> man page, probably under NOTES.
>
>> +.SH RETURN VALUE
>> +Upon successful completion,
>> +.BR copy_file_range ()
>> +will return the number of bytes copied between files.
>> +This could be less than the length originally requested.
>> +
>> +On error,
>> +.BR copy_file_range ()
>> +returns \-1 and
>> +.I errno
>> +is set to indicate the error.
>> +.SH ERRORS
>> +.TP
>> +.B EBADF
>> +One or more file descriptors are not valid,
>> +or do not have proper read-write mode;
>
> I think that last line can go. I mean, isn't this point (again)
> covered in the next few lines?
>
>> +.I fd_in
>> +is not open for reading; or
>> +.I fd_out
>> +is not open for writing.
>> +.TP
>> +.B EINVAL
>> +Requested range extends beyond the end of the file; or the
>
> s/file/source file/ (right?)
>
>> +.I flags
>> +argument is set to an invalid value.
>> +.TP
>> +.B EIO
>> +A low level I/O error occurred while copying.
>> +.TP
>> +.B ENOMEM
>> +Out of memory.
>> +.TP
>> +.B ENOSPC
>> +There is not enough space to complete the copy.
>
> Space where? On the filesystem?
> => s/space/space on the target filesystem/
>
>> +.TP
>> +.B EOPNOTSUPP
>> +.B COPY_REFLINK
>> +was specified in
>> +.IR flags ,
>> +but the target filesystem does not support reflinks.
>> +.TP
>> +.B EXDEV
>> +Target filesystem doesn't support cross-filesystem copies.
>
> I'm curious. What are some examples of filesystems that produce this
> error?
>
>> +.SH VERSIONS
>> +The
>> +.BR copy_file_range ()
>> +system call first appeared in Linux 4.4.
>> +.SH CONFORMING TO
>> +The
>> +.BR copy_file_range ()
>> +system call is a nonstandard Linux extension.
>> +.SH EXAMPLE
>> +.nf
>> +
>> +#define _GNU_SOURCE
>> +#include <fcntl.h>
>> +#include <linux/copy.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <sys/stat.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +
>> +int main(int argc, char **argv)
>> +{
>> + int fd_in, fd_out;
>> + struct stat stat;
>> + loff_t len, ret;
>> +
>> + if (argc != 3) {
>> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + fd_in = open(argv[1], O_RDONLY);
>> + if (fd_in == -1) {
>
> Please replace all "-" in code by "\-". (This is a groff
> detail.)
>
>> + perror("open (argv[1])");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + if (fstat(fd_in, &stat) == -1) {
>> + perror("fstat");
>> + exit(EXIT_FAILURE);
>> + }
>> + len = stat.st_size;
>> +
>> + fd_out = creat(argv[2], 0644);
>
> These days, I think we should discourage the use of creat(). Maybe
> better to use open() instead here?
>
>> + if (fd_out == -1) {
>> + perror("creat (argv[2])");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + do {
>> + ret = syscall(__NR_copy_file_range, fd_in, NULL,
>> + fd_out, NULL, len, 0);
>
> I'd rather see this as:
>
> int
> copy_file_range(int fd_in, loff_t *off_in,
> int fd_out, loff_t *off_out, size_t len,
> unsigned int flags)
> {
> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
> }
>
> ...
>
> copy_file_range(fd_in, fd_out, off_out, len, flags);
>
>
>> + if (ret == -1) {
>> + perror("copy_file_range");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + len -= ret;
>> + } while (len > 0);
>> +
>> + close(fd_in);
>> + close(fd_out);
>> + exit(EXIT_SUCCESS);
>> +}
>> +.fi
>> +.SH SEE ALSO
>> +.BR splice (2)
>>
>
> In the next iteration of this patch, could you include a change to
> splice(2) so that copy_file_range(2) is added under SEE ALSO in
> that page. Also, are there any other pages that we should like
> to/from? (sendfile(2)?)
>
> Thanks,
>
> Michael
>
>



Attachments:
smime.p7s (2.95 kB)
S/MIME Cryptographic Signature

2015-09-15 03:32:48

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

On Fri, Sep 11, 2015 at 04:30:21PM -0400, Anna Schumaker wrote:
> The NFS server will need some kind offallback for filesystems that don't
> have any kind of copy acceleration, and it should be generally useful to
> have an in-kernel copy to avoid lots of switches between kernel and user
> space.
>
> I make this configurable by adding two new flags. Users who only want a
> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
> pass COPY_FR_COPY. The default (flags=0) means to first attempt a
> reflink, but use the pagecache if that fails.
>
> I moved the rw_verify_area() calls into the fallback code since some
> filesystems can handle reflinking a large range.
>
> Signed-off-by: Anna Schumaker <[email protected]>
> ---
> v2:
> - Rename COPY_REFLINK -> COPY_FR_REFLINK
> - Introduce COPY_FR_COPY flag
> - Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
> - Drop check for invalid flags
> - Move call to do_splice_direct() into a new function
> - Move rw_verify_area() checks into the new fallback function
> ---
> fs/read_write.c | 56 ++++++++++++++++++++++++++++-------------------
> include/linux/copy.h | 6 +++++
> include/uapi/linux/Kbuild | 1 +
> include/uapi/linux/copy.h | 7 ++++++
> 4 files changed, 48 insertions(+), 22 deletions(-)
> create mode 100644 include/linux/copy.h
> create mode 100644 include/uapi/linux/copy.h
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 363bd3e..ba24884 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -7,6 +7,7 @@
> #include <linux/slab.h>
> #include <linux/stat.h>
> #include <linux/fcntl.h>
> +#include <linux/copy.h>
> #include <linux/file.h>
> #include <linux/uio.h>
> #include <linux/fsnotify.h>
> @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> }
> #endif
>
> +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t len)
> +{
> + ssize_t ret;
> +
> + ret = rw_verify_area(READ, file_in, &pos_in, len);
> + if (ret >= 0) {
> + len = ret;
> + ret = rw_verify_area(WRITE, file_out, &pos_out, len);
> + if (ret >= 0)
> + len = ret;
> + }
> + if (ret < 0)
> + return ret;
> +
> + file_start_write(file_out);
> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
> + file_end_write(file_out);
> +
> + return ret;
> +}
> +
> /*
> * copy_file_range() differs from regular file read and write in that it
> * specifically allows return partial success. When it does so is up to
> @@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, int flags)
> {
> - struct inode *inode_in;
> - struct inode *inode_out;
> ssize_t ret;
>
> - if (flags)
> - return -EINVAL;
> -
> - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
> - ret = rw_verify_area(READ, file_in, &pos_in, len);
> - if (ret >= 0)
> - ret = rw_verify_area(WRITE, file_out, &pos_out, len);
> - if (ret < 0)
> - return ret;
> + if (flags == 0)
> + flags = COPY_FR_COPY | COPY_FR_REFLINK;

This function must return -EINVAL if any of the undefined flags bits are
set.

>
> if (!(file_in->f_mode & FMODE_READ) ||
> !(file_out->f_mode & FMODE_WRITE) ||
> (file_out->f_flags & O_APPEND) ||
> - !file_out->f_op || !file_out->f_op->copy_file_range)
> + !file_in->f_op)
> return -EBADF;
>
> - inode_in = file_inode(file_in);
> - inode_out = file_inode(file_out);
> -
> - /* make sure offsets don't wrap and the input is inside i_size */
> - if (pos_in + len < pos_in || pos_out + len < pos_out ||
> - pos_in + len > i_size_read(inode_in))
> - return -EINVAL;
> -
> if (len == 0)
> return 0;
>
> @@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (ret)
> return ret;
>
> - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
> - len, flags);
> + ret = -EOPNOTSUPP;
> + if (file_out->f_op->copy_file_range)
> + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
> + pos_out, len, flags);
> + if ((ret < 0) && (flags & COPY_FR_COPY))
> + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
> + pos_out, len);
> if (ret > 0) {
> fsnotify_access(file_in);
> add_rchar(current, ret);
> diff --git a/include/linux/copy.h b/include/linux/copy.h
> new file mode 100644
> index 0000000..fd54543
> --- /dev/null
> +++ b/include/linux/copy.h
> @@ -0,0 +1,6 @@
> +#ifndef _LINUX_COPY_H
> +#define _LINUX_COPY_H
> +
> +#include <uapi/linux/copy.h>
> +
> +#endif /* _LINUX_COPY_H */
> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> index 70ff1d9..d46830a 100644
> --- a/include/uapi/linux/Kbuild
> +++ b/include/uapi/linux/Kbuild
> @@ -90,6 +90,7 @@ header-y += coda_psdev.h
> header-y += coff.h
> header-y += connector.h
> header-y += const.h
> +header-y += copy.h
> header-y += cramfs_fs.h
> header-y += cuda.h
> header-y += cyclades.h
> diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
> new file mode 100644
> index 0000000..2da59a8
> --- /dev/null
> +++ b/include/uapi/linux/copy.h
> @@ -0,0 +1,7 @@
> +#ifndef _UAPI_LINUX_COPY_H
> +#define _UAPI_LINUX_COPY_H
> +
> +#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */
> +#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */

Could I have a COPY_FR_DEDUPE flag too, please?

(I don't mind adding it myself when I get around to hooking up XFS, but I
was hoping to get it in during the first round).

--D

> +
> +#endif /* _UAPI_LINUX_COPY_H */
> --
> 2.5.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-09-15 15:58:09

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

On 09/14/2015 11:32 PM, Darrick J. Wong wrote:
> On Fri, Sep 11, 2015 at 04:30:21PM -0400, Anna Schumaker wrote:
>> The NFS server will need some kind offallback for filesystems that don't
>> have any kind of copy acceleration, and it should be generally useful to
>> have an in-kernel copy to avoid lots of switches between kernel and user
>> space.
>>
>> I make this configurable by adding two new flags. Users who only want a
>> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
>> pass COPY_FR_COPY. The default (flags=0) means to first attempt a
>> reflink, but use the pagecache if that fails.
>>
>> I moved the rw_verify_area() calls into the fallback code since some
>> filesystems can handle reflinking a large range.
>>
>> Signed-off-by: Anna Schumaker <[email protected]>
>> ---
>> v2:
>> - Rename COPY_REFLINK -> COPY_FR_REFLINK
>> - Introduce COPY_FR_COPY flag
>> - Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
>> - Drop check for invalid flags
>> - Move call to do_splice_direct() into a new function
>> - Move rw_verify_area() checks into the new fallback function
>> ---
>> fs/read_write.c | 56 ++++++++++++++++++++++++++++-------------------
>> include/linux/copy.h | 6 +++++
>> include/uapi/linux/Kbuild | 1 +
>> include/uapi/linux/copy.h | 7 ++++++
>> 4 files changed, 48 insertions(+), 22 deletions(-)
>> create mode 100644 include/linux/copy.h
>> create mode 100644 include/uapi/linux/copy.h
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 363bd3e..ba24884 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -7,6 +7,7 @@
>> #include <linux/slab.h>
>> #include <linux/stat.h>
>> #include <linux/fcntl.h>
>> +#include <linux/copy.h>
>> #include <linux/file.h>
>> #include <linux/uio.h>
>> #include <linux/fsnotify.h>
>> @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
>> }
>> #endif
>>
>> +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + size_t len)
>> +{
>> + ssize_t ret;
>> +
>> + ret = rw_verify_area(READ, file_in, &pos_in, len);
>> + if (ret >= 0) {
>> + len = ret;
>> + ret = rw_verify_area(WRITE, file_out, &pos_out, len);
>> + if (ret >= 0)
>> + len = ret;
>> + }
>> + if (ret < 0)
>> + return ret;
>> +
>> + file_start_write(file_out);
>> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
>> + file_end_write(file_out);
>> +
>> + return ret;
>> +}
>> +
>> /*
>> * copy_file_range() differs from regular file read and write in that it
>> * specifically allows return partial success. When it does so is up to
>> @@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> struct file *file_out, loff_t pos_out,
>> size_t len, int flags)
>> {
>> - struct inode *inode_in;
>> - struct inode *inode_out;
>> ssize_t ret;
>>
>> - if (flags)
>> - return -EINVAL;
>> -
>> - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
>> - ret = rw_verify_area(READ, file_in, &pos_in, len);
>> - if (ret >= 0)
>> - ret = rw_verify_area(WRITE, file_out, &pos_out, len);
>> - if (ret < 0)
>> - return ret;
>> + if (flags == 0)
>> + flags = COPY_FR_COPY | COPY_FR_REFLINK;
>
> This function must return -EINVAL if any of the undefined flags bits are
> set.

Sure, I'll add that.

>
>>
>> if (!(file_in->f_mode & FMODE_READ) ||
>> !(file_out->f_mode & FMODE_WRITE) ||
>> (file_out->f_flags & O_APPEND) ||
>> - !file_out->f_op || !file_out->f_op->copy_file_range)
>> + !file_in->f_op)
>> return -EBADF;
>>
>> - inode_in = file_inode(file_in);
>> - inode_out = file_inode(file_out);
>> -
>> - /* make sure offsets don't wrap and the input is inside i_size */
>> - if (pos_in + len < pos_in || pos_out + len < pos_out ||
>> - pos_in + len > i_size_read(inode_in))
>> - return -EINVAL;
>> -
>> if (len == 0)
>> return 0;
>>
>> @@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> if (ret)
>> return ret;
>>
>> - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
>> - len, flags);
>> + ret = -EOPNOTSUPP;
>> + if (file_out->f_op->copy_file_range)
>> + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
>> + pos_out, len, flags);
>> + if ((ret < 0) && (flags & COPY_FR_COPY))
>> + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
>> + pos_out, len);
>> if (ret > 0) {
>> fsnotify_access(file_in);
>> add_rchar(current, ret);
>> diff --git a/include/linux/copy.h b/include/linux/copy.h
>> new file mode 100644
>> index 0000000..fd54543
>> --- /dev/null
>> +++ b/include/linux/copy.h
>> @@ -0,0 +1,6 @@
>> +#ifndef _LINUX_COPY_H
>> +#define _LINUX_COPY_H
>> +
>> +#include <uapi/linux/copy.h>
>> +
>> +#endif /* _LINUX_COPY_H */
>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>> index 70ff1d9..d46830a 100644
>> --- a/include/uapi/linux/Kbuild
>> +++ b/include/uapi/linux/Kbuild
>> @@ -90,6 +90,7 @@ header-y += coda_psdev.h
>> header-y += coff.h
>> header-y += connector.h
>> header-y += const.h
>> +header-y += copy.h
>> header-y += cramfs_fs.h
>> header-y += cuda.h
>> header-y += cyclades.h
>> diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
>> new file mode 100644
>> index 0000000..2da59a8
>> --- /dev/null
>> +++ b/include/uapi/linux/copy.h
>> @@ -0,0 +1,7 @@
>> +#ifndef _UAPI_LINUX_COPY_H
>> +#define _UAPI_LINUX_COPY_H
>> +
>> +#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */
>> +#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */
>
> Could I have a COPY_FR_DEDUPE flag too, please?
>
> (I don't mind adding it myself when I get around to hooking up XFS, but I
> was hoping to get it in during the first round).

I guess I can, but only iff everybody has agreed on using copy for dedupes instead of somethink like fallocate.

Anna

>
> --D
>
>> +
>> +#endif /* _UAPI_LINUX_COPY_H */
>> --
>> 2.5.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


2015-09-15 16:41:57

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

On Tue, Sep 15, 2015 at 11:58:04AM -0400, Anna Schumaker wrote:
> On 09/14/2015 11:32 PM, Darrick J. Wong wrote:
> > On Fri, Sep 11, 2015 at 04:30:21PM -0400, Anna Schumaker wrote:
> >> The NFS server will need some kind offallback for filesystems that don't
> >> have any kind of copy acceleration, and it should be generally useful to
> >> have an in-kernel copy to avoid lots of switches between kernel and user
> >> space.
> >>
> >> I make this configurable by adding two new flags. Users who only want a
> >> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
> >> pass COPY_FR_COPY. The default (flags=0) means to first attempt a
> >> reflink, but use the pagecache if that fails.
> >>
> >> I moved the rw_verify_area() calls into the fallback code since some
> >> filesystems can handle reflinking a large range.
> >>
> >> Signed-off-by: Anna Schumaker <[email protected]>
> >> ---
> >> v2:
> >> - Rename COPY_REFLINK -> COPY_FR_REFLINK
> >> - Introduce COPY_FR_COPY flag
> >> - Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
> >> - Drop check for invalid flags
> >> - Move call to do_splice_direct() into a new function
> >> - Move rw_verify_area() checks into the new fallback function
> >> ---
> >> fs/read_write.c | 56 ++++++++++++++++++++++++++++-------------------
> >> include/linux/copy.h | 6 +++++
> >> include/uapi/linux/Kbuild | 1 +
> >> include/uapi/linux/copy.h | 7 ++++++
> >> 4 files changed, 48 insertions(+), 22 deletions(-)
> >> create mode 100644 include/linux/copy.h
> >> create mode 100644 include/uapi/linux/copy.h
> >>
> >> diff --git a/fs/read_write.c b/fs/read_write.c
> >> index 363bd3e..ba24884 100644
> >> --- a/fs/read_write.c
> >> +++ b/fs/read_write.c
> >> @@ -7,6 +7,7 @@
> >> #include <linux/slab.h>
> >> #include <linux/stat.h>
> >> #include <linux/fcntl.h>
> >> +#include <linux/copy.h>
> >> #include <linux/file.h>
> >> #include <linux/uio.h>
> >> #include <linux/fsnotify.h>
> >> @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> >> }
> >> #endif
> >>
> >> +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
> >> + struct file *file_out, loff_t pos_out,
> >> + size_t len)
> >> +{
> >> + ssize_t ret;
> >> +
> >> + ret = rw_verify_area(READ, file_in, &pos_in, len);
> >> + if (ret >= 0) {
> >> + len = ret;
> >> + ret = rw_verify_area(WRITE, file_out, &pos_out, len);
> >> + if (ret >= 0)
> >> + len = ret;
> >> + }
> >> + if (ret < 0)
> >> + return ret;
> >> +
> >> + file_start_write(file_out);
> >> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
> >> + file_end_write(file_out);
> >> +
> >> + return ret;
> >> +}
> >> +
> >> /*
> >> * copy_file_range() differs from regular file read and write in that it
> >> * specifically allows return partial success. When it does so is up to
> >> @@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >> struct file *file_out, loff_t pos_out,
> >> size_t len, int flags)
> >> {
> >> - struct inode *inode_in;
> >> - struct inode *inode_out;
> >> ssize_t ret;
> >>
> >> - if (flags)
> >> - return -EINVAL;
> >> -
> >> - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
> >> - ret = rw_verify_area(READ, file_in, &pos_in, len);
> >> - if (ret >= 0)
> >> - ret = rw_verify_area(WRITE, file_out, &pos_out, len);
> >> - if (ret < 0)
> >> - return ret;
> >> + if (flags == 0)
> >> + flags = COPY_FR_COPY | COPY_FR_REFLINK;
> >
> > This function must return -EINVAL if any of the undefined flags bits are
> > set.
>
> Sure, I'll add that.
>
> >
> >>
> >> if (!(file_in->f_mode & FMODE_READ) ||
> >> !(file_out->f_mode & FMODE_WRITE) ||
> >> (file_out->f_flags & O_APPEND) ||
> >> - !file_out->f_op || !file_out->f_op->copy_file_range)
> >> + !file_in->f_op)
> >> return -EBADF;
> >>
> >> - inode_in = file_inode(file_in);
> >> - inode_out = file_inode(file_out);
> >> -
> >> - /* make sure offsets don't wrap and the input is inside i_size */
> >> - if (pos_in + len < pos_in || pos_out + len < pos_out ||
> >> - pos_in + len > i_size_read(inode_in))
> >> - return -EINVAL;
> >> -
> >> if (len == 0)
> >> return 0;
> >>
> >> @@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> >> if (ret)
> >> return ret;
> >>
> >> - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
> >> - len, flags);
> >> + ret = -EOPNOTSUPP;
> >> + if (file_out->f_op->copy_file_range)
> >> + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
> >> + pos_out, len, flags);
> >> + if ((ret < 0) && (flags & COPY_FR_COPY))
> >> + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
> >> + pos_out, len);
> >> if (ret > 0) {
> >> fsnotify_access(file_in);
> >> add_rchar(current, ret);
> >> diff --git a/include/linux/copy.h b/include/linux/copy.h
> >> new file mode 100644
> >> index 0000000..fd54543
> >> --- /dev/null
> >> +++ b/include/linux/copy.h
> >> @@ -0,0 +1,6 @@
> >> +#ifndef _LINUX_COPY_H
> >> +#define _LINUX_COPY_H
> >> +
> >> +#include <uapi/linux/copy.h>
> >> +
> >> +#endif /* _LINUX_COPY_H */
> >> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
> >> index 70ff1d9..d46830a 100644
> >> --- a/include/uapi/linux/Kbuild
> >> +++ b/include/uapi/linux/Kbuild
> >> @@ -90,6 +90,7 @@ header-y += coda_psdev.h
> >> header-y += coff.h
> >> header-y += connector.h
> >> header-y += const.h
> >> +header-y += copy.h
> >> header-y += cramfs_fs.h
> >> header-y += cuda.h
> >> header-y += cyclades.h
> >> diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
> >> new file mode 100644
> >> index 0000000..2da59a8
> >> --- /dev/null
> >> +++ b/include/uapi/linux/copy.h
> >> @@ -0,0 +1,7 @@
> >> +#ifndef _UAPI_LINUX_COPY_H
> >> +#define _UAPI_LINUX_COPY_H
> >> +
> >> +#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */
> >> +#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */
> >
> > Could I have a COPY_FR_DEDUPE flag too, please?
> >
> > (I don't mind adding it myself when I get around to hooking up XFS, but I
> > was hoping to get it in during the first round).
>
> I guess I can, but only iff everybody has agreed on using copy for dedupes
> instead of somethink like fallocate.

I don't see how fallocate could even handle userspace-directed deduplication
since the dedupe operation compares two file ranges and reflinks them if the
contents match whereas fallocate only operates on a single file range.

A single file range would work if userspace was telling the kernel to start a
lengthy search for all duplicates of a particular file's range, but ...
yuck.

(Granted, the btrfs extent_same ioctl lets userspace call dedupe on multiple
files simultaneously, but that's for another day.)

--D

>
> Anna
>
> >
> > --D
> >
> >> +
> >> +#endif /* _UAPI_LINUX_COPY_H */
> >> --
> >> 2.5.1
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> >> the body of a message to [email protected]
> >> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-09-15 17:01:36

by Austin S Hemmelgarn

[permalink] [raw]
Subject: Re: [PATCH v2 8/9] vfs: copy_file_range() can do a pagecache copy with splice

On 2015-09-15 12:38, Darrick J. Wong wrote:
> On Tue, Sep 15, 2015 at 11:58:04AM -0400, Anna Schumaker wrote:
>> On 09/14/2015 11:32 PM, Darrick J. Wong wrote:
>>> On Fri, Sep 11, 2015 at 04:30:21PM -0400, Anna Schumaker wrote:
>>>> The NFS server will need some kind offallback for filesystems that don't
>>>> have any kind of copy acceleration, and it should be generally useful to
>>>> have an in-kernel copy to avoid lots of switches between kernel and user
>>>> space.
>>>>
>>>> I make this configurable by adding two new flags. Users who only want a
>>>> reflink can pass COPY_FR_REFLINK, and users who want a full data copy can
>>>> pass COPY_FR_COPY. The default (flags=0) means to first attempt a
>>>> reflink, but use the pagecache if that fails.
>>>>
>>>> I moved the rw_verify_area() calls into the fallback code since some
>>>> filesystems can handle reflinking a large range.
>>>>
>>>> Signed-off-by: Anna Schumaker <[email protected]>
>>>> ---
>>>> v2:
>>>> - Rename COPY_REFLINK -> COPY_FR_REFLINK
>>>> - Introduce COPY_FR_COPY flag
>>>> - Flags == 0 is really COPY_FR_COPY|COPY_FR_REFLINK
>>>> - Drop check for invalid flags
>>>> - Move call to do_splice_direct() into a new function
>>>> - Move rw_verify_area() checks into the new fallback function
>>>> ---
>>>> fs/read_write.c | 56 ++++++++++++++++++++++++++++-------------------
>>>> include/linux/copy.h | 6 +++++
>>>> include/uapi/linux/Kbuild | 1 +
>>>> include/uapi/linux/copy.h | 7 ++++++
>>>> 4 files changed, 48 insertions(+), 22 deletions(-)
>>>> create mode 100644 include/linux/copy.h
>>>> create mode 100644 include/uapi/linux/copy.h
>>>>
>>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>>> index 363bd3e..ba24884 100644
>>>> --- a/fs/read_write.c
>>>> +++ b/fs/read_write.c
>>>> @@ -7,6 +7,7 @@
>>>> #include <linux/slab.h>
>>>> #include <linux/stat.h>
>>>> #include <linux/fcntl.h>
>>>> +#include <linux/copy.h>
>>>> #include <linux/file.h>
>>>> #include <linux/uio.h>
>>>> #include <linux/fsnotify.h>
>>>> @@ -1329,6 +1330,29 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
>>>> }
>>>> #endif
>>>>
>>>> +static ssize_t vfs_copy_file_pagecache(struct file *file_in, loff_t pos_in,
>>>> + struct file *file_out, loff_t pos_out,
>>>> + size_t len)
>>>> +{
>>>> + ssize_t ret;
>>>> +
>>>> + ret = rw_verify_area(READ, file_in, &pos_in, len);
>>>> + if (ret >= 0) {
>>>> + len = ret;
>>>> + ret = rw_verify_area(WRITE, file_out, &pos_out, len);
>>>> + if (ret >= 0)
>>>> + len = ret;
>>>> + }
>>>> + if (ret < 0)
>>>> + return ret;
>>>> +
>>>> + file_start_write(file_out);
>>>> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, 0);
>>>> + file_end_write(file_out);
>>>> +
>>>> + return ret;
>>>> +}
>>>> +
>>>> /*
>>>> * copy_file_range() differs from regular file read and write in that it
>>>> * specifically allows return partial success. When it does so is up to
>>>> @@ -1338,34 +1362,17 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>>> struct file *file_out, loff_t pos_out,
>>>> size_t len, int flags)
>>>> {
>>>> - struct inode *inode_in;
>>>> - struct inode *inode_out;
>>>> ssize_t ret;
>>>>
>>>> - if (flags)
>>>> - return -EINVAL;
>>>> -
>>>> - /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
>>>> - ret = rw_verify_area(READ, file_in, &pos_in, len);
>>>> - if (ret >= 0)
>>>> - ret = rw_verify_area(WRITE, file_out, &pos_out, len);
>>>> - if (ret < 0)
>>>> - return ret;
>>>> + if (flags == 0)
>>>> + flags = COPY_FR_COPY | COPY_FR_REFLINK;
>>>
>>> This function must return -EINVAL if any of the undefined flags bits are
>>> set.
>>
>> Sure, I'll add that.
>>
>>>
>>>>
>>>> if (!(file_in->f_mode & FMODE_READ) ||
>>>> !(file_out->f_mode & FMODE_WRITE) ||
>>>> (file_out->f_flags & O_APPEND) ||
>>>> - !file_out->f_op || !file_out->f_op->copy_file_range)
>>>> + !file_in->f_op)
>>>> return -EBADF;
>>>>
>>>> - inode_in = file_inode(file_in);
>>>> - inode_out = file_inode(file_out);
>>>> -
>>>> - /* make sure offsets don't wrap and the input is inside i_size */
>>>> - if (pos_in + len < pos_in || pos_out + len < pos_out ||
>>>> - pos_in + len > i_size_read(inode_in))
>>>> - return -EINVAL;
>>>> -
>>>> if (len == 0)
>>>> return 0;
>>>>
>>>> @@ -1373,8 +1380,13 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>>> if (ret)
>>>> return ret;
>>>>
>>>> - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
>>>> - len, flags);
>>>> + ret = -EOPNOTSUPP;
>>>> + if (file_out->f_op->copy_file_range)
>>>> + ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
>>>> + pos_out, len, flags);
>>>> + if ((ret < 0) && (flags & COPY_FR_COPY))
>>>> + ret = vfs_copy_file_pagecache(file_in, pos_in, file_out,
>>>> + pos_out, len);
>>>> if (ret > 0) {
>>>> fsnotify_access(file_in);
>>>> add_rchar(current, ret);
>>>> diff --git a/include/linux/copy.h b/include/linux/copy.h
>>>> new file mode 100644
>>>> index 0000000..fd54543
>>>> --- /dev/null
>>>> +++ b/include/linux/copy.h
>>>> @@ -0,0 +1,6 @@
>>>> +#ifndef _LINUX_COPY_H
>>>> +#define _LINUX_COPY_H
>>>> +
>>>> +#include <uapi/linux/copy.h>
>>>> +
>>>> +#endif /* _LINUX_COPY_H */
>>>> diff --git a/include/uapi/linux/Kbuild b/include/uapi/linux/Kbuild
>>>> index 70ff1d9..d46830a 100644
>>>> --- a/include/uapi/linux/Kbuild
>>>> +++ b/include/uapi/linux/Kbuild
>>>> @@ -90,6 +90,7 @@ header-y += coda_psdev.h
>>>> header-y += coff.h
>>>> header-y += connector.h
>>>> header-y += const.h
>>>> +header-y += copy.h
>>>> header-y += cramfs_fs.h
>>>> header-y += cuda.h
>>>> header-y += cyclades.h
>>>> diff --git a/include/uapi/linux/copy.h b/include/uapi/linux/copy.h
>>>> new file mode 100644
>>>> index 0000000..2da59a8
>>>> --- /dev/null
>>>> +++ b/include/uapi/linux/copy.h
>>>> @@ -0,0 +1,7 @@
>>>> +#ifndef _UAPI_LINUX_COPY_H
>>>> +#define _UAPI_LINUX_COPY_H
>>>> +
>>>> +#define COPY_FR_COPY (1 << 0) /* Only do a pagecache copy. */
>>>> +#define COPY_FR_REFLINK (1 << 1) /* Only make a reflink. */
>>>
>>> Could I have a COPY_FR_DEDUPE flag too, please?
>>>
>>> (I don't mind adding it myself when I get around to hooking up XFS, but I
>>> was hoping to get it in during the first round).
>>
>> I guess I can, but only iff everybody has agreed on using copy for dedupes
>> instead of somethink like fallocate.
>
> I don't see how fallocate could even handle userspace-directed deduplication
> since the dedupe operation compares two file ranges and reflinks them if the
> contents match whereas fallocate only operates on a single file range.
>
> A single file range would work if userspace was telling the kernel to start a
> lengthy search for all duplicates of a particular file's range, but ...
> yuck.
>
> (Granted, the btrfs extent_same ioctl lets userspace call dedupe on multiple
> files simultaneously, but that's for another day.)
>
Agreed, having something equivalent to reflinking would be the most
sensible interface for deduplicating two files. I can however see
fallocate possibly having some value when in-band deduplication is
considered (ie, use fallocate to tell the system to include/exclude
certain file ranges from in-band deduplication).



Attachments:
smime.p7s (2.95 kB)
S/MIME Cryptographic Signature

2015-09-22 11:45:01

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] vfs: add copy_file_range syscall and vfs helper

On Fri, Sep 11, 2015 at 04:30:14PM -0400, Anna Schumaker wrote:
> From: Zach Brown <[email protected]>
> +/*
> + * copy_file_range() differs from regular file read and write in that it
> + * specifically allows return partial success. When it does so is up to
> + * the copy_file_range method.
> + */
> +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t len, int flags)

Is the signed type for flags correct? I had the impression that it's
usually good to have unsigned int/long for flags, this can be seen
frequently in the vfs/fs code. Mainly for consistency.

> + ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
> + len, flags);

int -> unsigned int

> +}
> +EXPORT_SYMBOL(vfs_copy_file_range);
> +
> +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
> + int, fd_out, loff_t __user *, off_out,
> + size_t, len, unsigned int, flags)

the syscal takes unsigned int

> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1642,6 +1642,7 @@ struct file_operations {
> #ifndef CONFIG_MMU
> unsigned (*mmap_capabilities)(struct file *);
> #endif
> + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, int);

switch to unsigned

> +extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> + loff_t, size_t, int);

and here

2015-09-22 11:48:08

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] vfs: Copy should check len after file open mode

On Fri, Sep 11, 2015 at 04:30:17PM -0400, Anna Schumaker wrote:
> I don't think it makes sense to report that a copy succeeded if the
> files aren't open properly.
>
> Signed-off-by: Anna Schumaker <[email protected]>

Reviewed-by: David Sterba <[email protected]>

2015-09-22 11:49:38

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 5/9] vfs: Copy shouldn't forbid ranges inside the same file

On Fri, Sep 11, 2015 at 04:30:18PM -0400, Anna Schumaker wrote:
> This is perfectly valid for BTRFS and XFS, so let's leave this up to
> filesystems to check.
>
> Signed-off-by: Anna Schumaker <[email protected]>

For the btrfs part,

Reviewed-by: David Sterba <[email protected]>

2015-09-22 11:53:05

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] vfs: Remove copy_file_range mountpoint checks

On Fri, Sep 11, 2015 at 04:30:20PM -0400, Anna Schumaker wrote:
> I still want to do an in-kernel copy even if the files are on different
> mountpoints, and NFS has a "server to server" copy that expects two
> files on different mountpoints. Let's have individual filesystems
> implement this check instead.
>
> Signed-off-by: Anna Schumaker <[email protected]>

The cross-fs checks exist in the btrfs callback,

Reviewed-by: David Sterba <[email protected]>

2015-09-22 11:57:12

by David Sterba

[permalink] [raw]
Subject: Re: [PATCH v2 9/9] btrfs: btrfs_copy_file_range() only supports reflinks

On Fri, Sep 11, 2015 at 04:30:22PM -0400, Anna Schumaker wrote:
> Reject copies that don't have the COPY_FR_REFLINK flag set.
>
> Signed-off-by: Anna Schumaker <[email protected]>

Reviewed-by: David Sterba <[email protected]>

2015-09-22 18:28:28

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v2 1/9] vfs: add copy_file_range syscall and vfs helper

On 09/22/2015 07:44 AM, David Sterba wrote:
> On Fri, Sep 11, 2015 at 04:30:14PM -0400, Anna Schumaker wrote:
>> From: Zach Brown <[email protected]>
>> +/*
>> + * copy_file_range() differs from regular file read and write in that it
>> + * specifically allows return partial success. When it does so is up to
>> + * the copy_file_range method.
>> + */
>> +ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + size_t len, int flags)
>
> Is the signed type for flags correct? I had the impression that it's
> usually good to have unsigned int/long for flags, this can be seen
> frequently in the vfs/fs code. Mainly for consistency.

I'm all for consistency! I'll change the function to take an unsigned int.

Thanks,
Anna

>
>> + ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
>> + len, flags);
>
> int -> unsigned int
>
>> +}
>> +EXPORT_SYMBOL(vfs_copy_file_range);
>> +
>> +SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
>> + int, fd_out, loff_t __user *, off_out,
>> + size_t, len, unsigned int, flags)
>
> the syscal takes unsigned int
>
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1642,6 +1642,7 @@ struct file_operations {
>> #ifndef CONFIG_MMU
>> unsigned (*mmap_capabilities)(struct file *);
>> #endif
>> + ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, int);
>
> switch to unsigned
>
>> +extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
>> + loff_t, size_t, int);
>
> and here
>


2015-09-22 20:11:16

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

On 09/14/2015 02:32 PM, Darrick J. Wong wrote:
> On Sun, Sep 13, 2015 at 09:50:18AM +0200, Michael Kerrisk (man-pages) wrote:
>> Hi Anna,
>>
>> On 09/11/2015 10:30 PM, Anna Schumaker wrote:
>>> copy_file_range() is a new system call for copying ranges of data
>>> completely in the kernel. This gives filesystems an opportunity to
>>> implement some kind of "copy acceleration", such as reflinks or
>>> server-side-copy (in the case of NFS).
>>>
>>> Signed-off-by: Anna Schumaker <[email protected]>
>>
>> Thanks for writing such a nice first draft man page! I have a few
>> comments below. Would you be willing to revise and resubmit?
>>
>>> ---
>>> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
>>> 1 file changed, 188 insertions(+)
>>> create mode 100644 man2/copy_file_range.2
>>>
>>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>>> new file mode 100644
>>> index 0000000..84912b5
>>> --- /dev/null
>>> +++ b/man2/copy_file_range.2
>>> @@ -0,0 +1,188 @@
>>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
>>
>> We need a license for this page. Please see
>> https://www.kernel.org/doc/man-pages/licenses.html
>>
>>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>>
>> Make the month 2 digits (leading 0).
>>
>>> +.SH NAME
>>> +copy_file_range \- Copy a range of data from one file to another
>>> +.SH SYNOPSIS
>>> +.nf
>>> +.B #include <linux/copy.h>
>>> +.B #include <sys/syscall.h>
>>> +.B #include <unistd.h>
>>> +
>>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
>>> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
>>> +.BI " unsigned int " flags );
>>
>> Remove spaces following "*" in the lines above. (man-pages convention for code)
>>
>> I know that the copy_file_range() (obviously) doesn't yet have a wrapper
>> in glibc, but in the man pages we document all system calls as though there
>> is a wrapper. See, for example, gettid(2), for an axample of how this is done
>> (a note in the SYNOPSIS and then some further details in NOTES).
>>
>>> +.fi
>>> +.SH DESCRIPTION
>>> +The
>>> +.BR copy_file_range ()
>>> +system call performs an in-kernel copy between two file descriptors
>>> +without all that tedious mucking about in userspace.
>>
>> I'd write that last piece as:
>>
>> "without the cost of (a loop) transferring data from the kernel to a
>> user-space buffer and then back to the kernel again.
>>
>>> +It copies up to
>>> +.I len
>>> +bytes of data from file descriptor
>>> +.I fd_in
>>> +to file descriptor
>>> +.IR fd_out ,
>>> +overwriting any data that exists within the requested range.
>>
>> s/.$/ of the target file./
>>
>>> +
>>> +The following semantics apply for
>>> +.IR off_in ,
>>> +and similar statements apply to
>>> +.IR off_out :
>>> +.IP * 3
>>> +If
>>> +.I off_in
>>> +is NULL, then bytes are read from
>>> +.I fd_in
>>> +starting from the current file offset and the current
>>> +file offset is adjusted appropriately.
>>> +.IP *
>>> +If
>>> +.I off_in
>>> +is not NULL, then
>>> +.I off_in
>>> +must point to a buffer that specifies the starting
>>> +offset where bytes from
>>> +.I fd_in
>>> +will be read. The current file offset of
>>> +.I fd_in
>>> +is not changed, but
>>> +.I off_in
>>> +is adjusted appropriately.
>>> +.PP
>>> +
>>> +The
>>> +.I flags
>>> +argument is a bit mask composed by OR-ing together zero
>>> +or more of the following flags:
>>> +.TP 1.9i
>>> +.B COPY_FR_COPY
>>> +Copy all the file data in the requested range.
>>> +Some filesystems, like NFS, might be able to accelerate this copy
>>> +to avoid unnecessary data transfers.
>>> +.TP
>>> +.B COPY_FR_REFLINK
>>> +Create a lightweight "reflink", where data is not copied until
>>> +one of the files is modified.
>>> +.PP
>>> +The default behavior
>>> +.RI ( flags
>>> +== 0) is to try creating a reflink,
>>> +and if reflinking fails
>>> +.BR copy_file_range ()
>>> +will fall back on performing a full data copy.
>>
>> s/back on/back to/
>>
>>> +This is equivalent to setting
>>> +.I flags
>>> +equal to
>>> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
>>
>> So, from an API deign perspective, the interoperation of these two
>> flags appears odd. Bit flags are commonly (not always) orthogonal.
>> I think here it could be pointed out a little more explicitly that
>> these two flags are not orthogonal. In particular, perhaps after the
>> last sentence, you could add another sentence:
>>
>> [[
>> (This contrasts with specifying
>> .I flags
>> as just
>> .BR COPY_FR_REFLINK ,
>> which causes the call to create a reflink,
>> and fail if that is not possible,
>> rather than falling back to a full data copy.)
>> ]]
>>
>> Furthermore, I even wonder if explicitly specifying flags as
>> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
>> error. 0 already gives us the behavior described above,
>> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
>> perhaps just contributes to misleading the user that these
>> flags are orthogonal, when in reality they are not. What do
>> you think?
>
> Personally, I think it's a little weird that one turns on reflink with a flag;
> turns on regular copy with a different flag; and turns on both by not
> specifying either flag. :)

Is there a better behavior for flags=0? I was thinking this would be what people want when they don't care how the copy happens in the kernel.

Anna

>
>> What are the semantics with respect to signals, especially if data
>> copying a very large file range? This info should be included in the
>> man page, probably under NOTES.
>>
>>> +.SH RETURN VALUE
>>> +Upon successful completion,
>>> +.BR copy_file_range ()
>>> +will return the number of bytes copied between files.
>>> +This could be less than the length originally requested.
>>> +
>>> +On error,
>>> +.BR copy_file_range ()
>>> +returns \-1 and
>>> +.I errno
>>> +is set to indicate the error.
>>> +.SH ERRORS
>>> +.TP
>>> +.B EBADF
>>> +One or more file descriptors are not valid,
>>> +or do not have proper read-write mode;
>>
>> I think that last line can go. I mean, isn't this point (again)
>> covered in the next few lines?
>
> Or change the ';' to a ':'?
>
>>> +.I fd_in
>>> +is not open for reading; or
>>> +.I fd_out
>>> +is not open for writing.
>>> +.TP
>>> +.B EINVAL
>>> +Requested range extends beyond the end of the file; or the
>>
>> s/file/source file/ (right?)
>>
>>> +.I flags
>>> +argument is set to an invalid value.
>>> +.TP
>>> +.B EIO
>>> +A low level I/O error occurred while copying.
>>> +.TP
>>> +.B ENOMEM
>>> +Out of memory.
>>> +.TP
>>> +.B ENOSPC
>>> +There is not enough space to complete the copy.
>>
>> Space where? On the filesystem?
>> => s/space/space on the target filesystem/
>>
>>> +.TP
>>> +.B EOPNOTSUPP
>>> +.B COPY_REFLINK
>>> +was specified in
>>> +.IR flags ,
>>> +but the target filesystem does not support reflinks.
>>> +.TP
>>> +.B EXDEV
>>> +Target filesystem doesn't support cross-filesystem copies.
>>
>> I'm curious. What are some examples of filesystems that produce this
>> error?
>
> btrfs and xfs (and probably most local filesystems) don't support cross-fs
> copies.
>
> --D
>
>>
>>> +.SH VERSIONS
>>> +The
>>> +.BR copy_file_range ()
>>> +system call first appeared in Linux 4.4.
>>> +.SH CONFORMING TO
>>> +The
>>> +.BR copy_file_range ()
>>> +system call is a nonstandard Linux extension.
>>> +.SH EXAMPLE
>>> +.nf
>>> +
>>> +#define _GNU_SOURCE
>>> +#include <fcntl.h>
>>> +#include <linux/copy.h>
>>> +#include <stdio.h>
>>> +#include <stdlib.h>
>>> +#include <sys/stat.h>
>>> +#include <sys/syscall.h>
>>> +#include <unistd.h>
>>> +
>>> +
>>> +int main(int argc, char **argv)
>>> +{
>>> + int fd_in, fd_out;
>>> + struct stat stat;
>>> + loff_t len, ret;
>>> +
>>> + if (argc != 3) {
>>> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
>>> + exit(EXIT_FAILURE);
>>> + }
>>> +
>>> + fd_in = open(argv[1], O_RDONLY);
>>> + if (fd_in == -1) {
>>
>> Please replace all "-" in code by "\-". (This is a groff
>> detail.)
>>
>>> + perror("open (argv[1])");
>>> + exit(EXIT_FAILURE);
>>> + }
>>> +
>>> + if (fstat(fd_in, &stat) == -1) {
>>> + perror("fstat");
>>> + exit(EXIT_FAILURE);
>>> + }
>>> + len = stat.st_size;
>>> +
>>> + fd_out = creat(argv[2], 0644);
>>
>> These days, I think we should discourage the use of creat(). Maybe
>> better to use open() instead here?
>>
>>> + if (fd_out == -1) {
>>> + perror("creat (argv[2])");
>>> + exit(EXIT_FAILURE);
>>> + }
>>> +
>>> + do {
>>> + ret = syscall(__NR_copy_file_range, fd_in, NULL,
>>> + fd_out, NULL, len, 0);
>>
>> I'd rather see this as:
>>
>> int
>> copy_file_range(int fd_in, loff_t *off_in,
>> int fd_out, loff_t *off_out, size_t len,
>> unsigned int flags)
>> {
>> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
>> }
>>
>> ...
>>
>> copy_file_range(fd_in, fd_out, off_out, len, flags);
>>
>>
>>> + if (ret == -1) {
>>> + perror("copy_file_range");
>>> + exit(EXIT_FAILURE);
>>> + }
>>> +
>>> + len -= ret;
>>> + } while (len > 0);
>>> +
>>> + close(fd_in);
>>> + close(fd_out);
>>> + exit(EXIT_SUCCESS);
>>> +}
>>> +.fi
>>> +.SH SEE ALSO
>>> +.BR splice (2)
>>>
>>
>> In the next iteration of this patch, could you include a change to
>> splice(2) so that copy_file_range(2) is added under SEE ALSO in
>> that page. Also, are there any other pages that we should like
>> to/from? (sendfile(2)?)
>>
>> Thanks,
>>
>> Michael
>>
>>
>> --
>> Michael Kerrisk
>> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
>> Linux/UNIX System Programming Training: http://man7.org/training/


2015-09-22 20:30:14

by Pádraig Brady

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

On 22/09/15 21:10, Anna Schumaker wrote:
> On 09/14/2015 02:32 PM, Darrick J. Wong wrote:
>> On Sun, Sep 13, 2015 at 09:50:18AM +0200, Michael Kerrisk (man-pages) wrote:
>>> Hi Anna,

>>> Furthermore, I even wonder if explicitly specifying flags as
>>> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
>>> error. 0 already gives us the behavior described above,
>>> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
>>> perhaps just contributes to misleading the user that these
>>> flags are orthogonal, when in reality they are not. What do
>>> you think?
>>
>> Personally, I think it's a little weird that one turns on reflink with a flag;
>> turns on regular copy with a different flag; and turns on both by not
>> specifying either flag. :)
>
> Is there a better behavior for flags=0? I was thinking this would be what people want when they don't care how the copy happens in the kernel.

As a user, I'm fine with this default and the interface in general.

thanks,
Pádraig.

2015-09-22 20:30:51

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

Hi Michael,

On 09/13/2015 03:50 AM, Michael Kerrisk (man-pages) wrote:
> Hi Anna,
>
> On 09/11/2015 10:30 PM, Anna Schumaker wrote:
>> copy_file_range() is a new system call for copying ranges of data
>> completely in the kernel. This gives filesystems an opportunity to
>> implement some kind of "copy acceleration", such as reflinks or
>> server-side-copy (in the case of NFS).
>>
>> Signed-off-by: Anna Schumaker <[email protected]>
>
> Thanks for writing such a nice first draft man page! I have a few
> comments below. Would you be willing to revise and resubmit?

Sorry for the delay in responding, I'm cycling back around to copy stuff now. I'll be more than happy to revise and resubmit!

>
>> ---
>> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 188 insertions(+)
>> create mode 100644 man2/copy_file_range.2
>>
>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>> new file mode 100644
>> index 0000000..84912b5
>> --- /dev/null
>> +++ b/man2/copy_file_range.2
>> @@ -0,0 +1,188 @@
>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
>
> We need a license for this page. Please see
> https://www.kernel.org/doc/man-pages/licenses.html

Sure.

>
>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
>
> Make the month 2 digits (leading 0).

Okay.

>
>> +.SH NAME
>> +copy_file_range \- Copy a range of data from one file to another
>> +.SH SYNOPSIS
>> +.nf
>> +.B #include <linux/copy.h>
>> +.B #include <sys/syscall.h>
>> +.B #include <unistd.h>
>> +
>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
>> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
>> +.BI " unsigned int " flags );
>
> Remove spaces following "*" in the lines above. (man-pages convention for code)
>
> I know that the copy_file_range() (obviously) doesn't yet have a wrapper
> in glibc, but in the man pages we document all system calls as though there
> is a wrapper. See, for example, gettid(2), for an axample of how this is done
> (a note in the SYNOPSIS and then some further details in NOTES).

Makes sense. I'll change that!
>
>> +.fi
>> +.SH DESCRIPTION
>> +The
>> +.BR copy_file_range ()
>> +system call performs an in-kernel copy between two file descriptors
>> +without all that tedious mucking about in userspace.
>
> I'd write that last piece as:
>
> "without the cost of (a loop) transferring data from the kernel to a
> user-space buffer and then back to the kernel again.

Aww, I was hoping to sneak in a Hitchhiker's Guide reference... I'll change the text.

>
>> +It copies up to
>> +.I len
>> +bytes of data from file descriptor
>> +.I fd_in
>> +to file descriptor
>> +.IR fd_out ,
>> +overwriting any data that exists within the requested range.
>
> s/.$/ of the target file./
>
>> +
>> +The following semantics apply for
>> +.IR off_in ,
>> +and similar statements apply to
>> +.IR off_out :
>> +.IP * 3
>> +If
>> +.I off_in
>> +is NULL, then bytes are read from
>> +.I fd_in
>> +starting from the current file offset and the current
>> +file offset is adjusted appropriately.
>> +.IP *
>> +If
>> +.I off_in
>> +is not NULL, then
>> +.I off_in
>> +must point to a buffer that specifies the starting
>> +offset where bytes from
>> +.I fd_in
>> +will be read. The current file offset of
>> +.I fd_in
>> +is not changed, but
>> +.I off_in
>> +is adjusted appropriately.
>> +.PP
>> +
>> +The
>> +.I flags
>> +argument is a bit mask composed by OR-ing together zero
>> +or more of the following flags:
>> +.TP 1.9i
>> +.B COPY_FR_COPY
>> +Copy all the file data in the requested range.
>> +Some filesystems, like NFS, might be able to accelerate this copy
>> +to avoid unnecessary data transfers.
>> +.TP
>> +.B COPY_FR_REFLINK
>> +Create a lightweight "reflink", where data is not copied until
>> +one of the files is modified.
>> +.PP
>> +The default behavior
>> +.RI ( flags
>> +== 0) is to try creating a reflink,
>> +and if reflinking fails
>> +.BR copy_file_range ()
>> +will fall back on performing a full data copy.
>
> s/back on/back to/
>
>> +This is equivalent to setting
>> +.I flags
>> +equal to
>> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
>
> So, from an API deign perspective, the interoperation of these two
> flags appears odd. Bit flags are commonly (not always) orthogonal.
> I think here it could be pointed out a little more explicitly that
> these two flags are not orthogonal. In particular, perhaps after the
> last sentence, you could add another sentence:
>
> [[
> (This contrasts with specifying
> .I flags
> as just
> .BR COPY_FR_REFLINK ,
> which causes the call to create a reflink,
> and fail if that is not possible,
> rather than falling back to a full data copy.)
> ]]
>
> Furthermore, I even wonder if explicitly specifying flags as
> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
> error. 0 already gives us the behavior described above,
> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
> perhaps just contributes to misleading the user that these
> flags are orthogonal, when in reality they are not. What do
> you think?

That could work. I'll think on this before resubmitting anything.

>
> What are the semantics with respect to signals, especially if data
> copying a very large file range? This info should be included in the
> man page, probably under NOTES.

I'm not sure offhand. I need to look into this!

>
>> +.SH RETURN VALUE
>> +Upon successful completion,
>> +.BR copy_file_range ()
>> +will return the number of bytes copied between files.
>> +This could be less than the length originally requested.
>> +
>> +On error,
>> +.BR copy_file_range ()
>> +returns \-1 and
>> +.I errno
>> +is set to indicate the error.
>> +.SH ERRORS
>> +.TP
>> +.B EBADF
>> +One or more file descriptors are not valid,
>> +or do not have proper read-write mode;
>
> I think that last line can go. I mean, isn't this point (again)
> covered in the next few lines?
>
>> +.I fd_in
>> +is not open for reading; or
>> +.I fd_out
>> +is not open for writing.
>> +.TP
>> +.B EINVAL
>> +Requested range extends beyond the end of the file; or the
>
> s/file/source file/ (right?)
>
>> +.I flags
>> +argument is set to an invalid value.
>> +.TP
>> +.B EIO
>> +A low level I/O error occurred while copying.
>> +.TP
>> +.B ENOMEM
>> +Out of memory.
>> +.TP
>> +.B ENOSPC
>> +There is not enough space to complete the copy.
>
> Space where? On the filesystem?
> => s/space/space on the target filesystem/
>
>> +.TP
>> +.B EOPNOTSUPP
>> +.B COPY_REFLINK
>> +was specified in
>> +.IR flags ,
>> +but the target filesystem does not support reflinks.
>> +.TP
>> +.B EXDEV
>> +Target filesystem doesn't support cross-filesystem copies.
>
> I'm curious. What are some examples of filesystems that produce this
> error?
>
>> +.SH VERSIONS
>> +The
>> +.BR copy_file_range ()
>> +system call first appeared in Linux 4.4.
>> +.SH CONFORMING TO
>> +The
>> +.BR copy_file_range ()
>> +system call is a nonstandard Linux extension.
>> +.SH EXAMPLE
>> +.nf
>> +
>> +#define _GNU_SOURCE
>> +#include <fcntl.h>
>> +#include <linux/copy.h>
>> +#include <stdio.h>
>> +#include <stdlib.h>
>> +#include <sys/stat.h>
>> +#include <sys/syscall.h>
>> +#include <unistd.h>
>> +
>> +
>> +int main(int argc, char **argv)
>> +{
>> + int fd_in, fd_out;
>> + struct stat stat;
>> + loff_t len, ret;
>> +
>> + if (argc != 3) {
>> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + fd_in = open(argv[1], O_RDONLY);
>> + if (fd_in == -1) {
>
> Please replace all "-" in code by "\-". (This is a groff
> detail.)

Sure.

>
>> + perror("open (argv[1])");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + if (fstat(fd_in, &stat) == -1) {
>> + perror("fstat");
>> + exit(EXIT_FAILURE);
>> + }
>> + len = stat.st_size;
>> +
>> + fd_out = creat(argv[2], 0644);
>
> These days, I think we should discourage the use of creat(). Maybe
> better to use open() instead here?

Sure, that's a simple enough switch.

>
>> + if (fd_out == -1) {
>> + perror("creat (argv[2])");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + do {
>> + ret = syscall(__NR_copy_file_range, fd_in, NULL,
>> + fd_out, NULL, len, 0);
>
> I'd rather see this as:
>
> int
> copy_file_range(int fd_in, loff_t *off_in,
> int fd_out, loff_t *off_out, size_t len,
> unsigned int flags)
> {
> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
> }
>
> ...
>
> copy_file_range(fd_in, fd_out, off_out, len, flags);

Okay.

>
>
>> + if (ret == -1) {
>> + perror("copy_file_range");
>> + exit(EXIT_FAILURE);
>> + }
>> +
>> + len -= ret;
>> + } while (len > 0);
>> +
>> + close(fd_in);
>> + close(fd_out);
>> + exit(EXIT_SUCCESS);
>> +}
>> +.fi
>> +.SH SEE ALSO
>> +.BR splice (2)
>>
>
> In the next iteration of this patch, could you include a change to
> splice(2) so that copy_file_range(2) is added under SEE ALSO in
> that page. Also, are there any other pages that we should like
> to/from? (sendfile(2)?)

Sure, no problem.

Thanks for the feedback!
Anna

>
> Thanks,
>
> Michael
>
>


2015-09-28 17:24:38

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 10/9] copy_file_range.2: New page documenting copy_file_range()

On Tue, Sep 22, 2015 at 04:10:47PM -0400, Anna Schumaker wrote:
> On 09/14/2015 02:32 PM, Darrick J. Wong wrote:
> > On Sun, Sep 13, 2015 at 09:50:18AM +0200, Michael Kerrisk (man-pages) wrote:
> >> Hi Anna,
> >>
> >> On 09/11/2015 10:30 PM, Anna Schumaker wrote:
> >>> copy_file_range() is a new system call for copying ranges of data
> >>> completely in the kernel. This gives filesystems an opportunity to
> >>> implement some kind of "copy acceleration", such as reflinks or
> >>> server-side-copy (in the case of NFS).
> >>>
> >>> Signed-off-by: Anna Schumaker <[email protected]>
> >>
> >> Thanks for writing such a nice first draft man page! I have a few
> >> comments below. Would you be willing to revise and resubmit?
> >>
> >>> ---
> >>> man2/copy_file_range.2 | 188 +++++++++++++++++++++++++++++++++++++++++++++++++
> >>> 1 file changed, 188 insertions(+)
> >>> create mode 100644 man2/copy_file_range.2
> >>>
> >>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> >>> new file mode 100644
> >>> index 0000000..84912b5
> >>> --- /dev/null
> >>> +++ b/man2/copy_file_range.2
> >>> @@ -0,0 +1,188 @@
> >>> +.\"This manpage is Copyright (C) 2015 Anna Schumaker <[email protected]>
> >>
> >> We need a license for this page. Please see
> >> https://www.kernel.org/doc/man-pages/licenses.html
> >>
> >>> +.TH COPY 2 2015-8-31 "Linux" "Linux Programmer's Manual"
> >>
> >> Make the month 2 digits (leading 0).
> >>
> >>> +.SH NAME
> >>> +copy_file_range \- Copy a range of data from one file to another
> >>> +.SH SYNOPSIS
> >>> +.nf
> >>> +.B #include <linux/copy.h>
> >>> +.B #include <sys/syscall.h>
> >>> +.B #include <unistd.h>
> >>> +
> >>> +.BI "ssize_t syscall(__NR_copy_file_range, int " fd_in ", loff_t * " off_in ",
> >>> +.BI " int " fd_out ", loff_t * " off_out ", size_t " len ",
> >>> +.BI " unsigned int " flags );
> >>
> >> Remove spaces following "*" in the lines above. (man-pages convention for code)
> >>
> >> I know that the copy_file_range() (obviously) doesn't yet have a wrapper
> >> in glibc, but in the man pages we document all system calls as though there
> >> is a wrapper. See, for example, gettid(2), for an axample of how this is done
> >> (a note in the SYNOPSIS and then some further details in NOTES).
> >>
> >>> +.fi
> >>> +.SH DESCRIPTION
> >>> +The
> >>> +.BR copy_file_range ()
> >>> +system call performs an in-kernel copy between two file descriptors
> >>> +without all that tedious mucking about in userspace.
> >>
> >> I'd write that last piece as:
> >>
> >> "without the cost of (a loop) transferring data from the kernel to a
> >> user-space buffer and then back to the kernel again.
> >>
> >>> +It copies up to
> >>> +.I len
> >>> +bytes of data from file descriptor
> >>> +.I fd_in
> >>> +to file descriptor
> >>> +.IR fd_out ,
> >>> +overwriting any data that exists within the requested range.
> >>
> >> s/.$/ of the target file./
> >>
> >>> +
> >>> +The following semantics apply for
> >>> +.IR off_in ,
> >>> +and similar statements apply to
> >>> +.IR off_out :
> >>> +.IP * 3
> >>> +If
> >>> +.I off_in
> >>> +is NULL, then bytes are read from
> >>> +.I fd_in
> >>> +starting from the current file offset and the current
> >>> +file offset is adjusted appropriately.
> >>> +.IP *
> >>> +If
> >>> +.I off_in
> >>> +is not NULL, then
> >>> +.I off_in
> >>> +must point to a buffer that specifies the starting
> >>> +offset where bytes from
> >>> +.I fd_in
> >>> +will be read. The current file offset of
> >>> +.I fd_in
> >>> +is not changed, but
> >>> +.I off_in
> >>> +is adjusted appropriately.
> >>> +.PP
> >>> +
> >>> +The
> >>> +.I flags
> >>> +argument is a bit mask composed by OR-ing together zero
> >>> +or more of the following flags:
> >>> +.TP 1.9i
> >>> +.B COPY_FR_COPY
> >>> +Copy all the file data in the requested range.
> >>> +Some filesystems, like NFS, might be able to accelerate this copy
> >>> +to avoid unnecessary data transfers.
> >>> +.TP
> >>> +.B COPY_FR_REFLINK
> >>> +Create a lightweight "reflink", where data is not copied until
> >>> +one of the files is modified.
> >>> +.PP
> >>> +The default behavior
> >>> +.RI ( flags
> >>> +== 0) is to try creating a reflink,
> >>> +and if reflinking fails
> >>> +.BR copy_file_range ()
> >>> +will fall back on performing a full data copy.
> >>
> >> s/back on/back to/
> >>
> >>> +This is equivalent to setting
> >>> +.I flags
> >>> +equal to
> >>> +.RB ( COPY_FR_COPY | COPY_FR_REFLINK ).
> >>
> >> So, from an API deign perspective, the interoperation of these two
> >> flags appears odd. Bit flags are commonly (not always) orthogonal.
> >> I think here it could be pointed out a little more explicitly that
> >> these two flags are not orthogonal. In particular, perhaps after the
> >> last sentence, you could add another sentence:
> >>
> >> [[
> >> (This contrasts with specifying
> >> .I flags
> >> as just
> >> .BR COPY_FR_REFLINK ,
> >> which causes the call to create a reflink,
> >> and fail if that is not possible,
> >> rather than falling back to a full data copy.)
> >> ]]
> >>
> >> Furthermore, I even wonder if explicitly specifying flags as
> >> COPY_FR_COPY | COPY_FR_REFLINK should just generate an EINVAL
> >> error. 0 already gives us the behavior described above,
> >> and allowing the combination COPY_FR_COPY | COPY_FR_REFLINK
> >> perhaps just contributes to misleading the user that these
> >> flags are orthogonal, when in reality they are not. What do
> >> you think?
> >
> > Personally, I think it's a little weird that one turns on reflink with a flag;
> > turns on regular copy with a different flag; and turns on both by not
> > specifying either flag. :)
>
> Is there a better behavior for flags=0? I was thinking this would be what
> people want when they don't care how the copy happens in the kernel.

[sorry, was out on vacation for a week]

Probably not. I guess you could "#define COPY_FR_CHEFS_SURPRISE 0" so at least
it'd be obvious in userland source code that the last argument is flags
and not just some number that happens to be zero. (I don't like when I have
to look up the manpage to remember which arguments are numbers and which are
flags.)

(OTOH I concede that we do have a precedent of flags == 0 meaning 'use sane
defaults'.)

--D

>
> Anna
>
> >
> >> What are the semantics with respect to signals, especially if data
> >> copying a very large file range? This info should be included in the
> >> man page, probably under NOTES.
> >>
> >>> +.SH RETURN VALUE
> >>> +Upon successful completion,
> >>> +.BR copy_file_range ()
> >>> +will return the number of bytes copied between files.
> >>> +This could be less than the length originally requested.
> >>> +
> >>> +On error,
> >>> +.BR copy_file_range ()
> >>> +returns \-1 and
> >>> +.I errno
> >>> +is set to indicate the error.
> >>> +.SH ERRORS
> >>> +.TP
> >>> +.B EBADF
> >>> +One or more file descriptors are not valid,
> >>> +or do not have proper read-write mode;
> >>
> >> I think that last line can go. I mean, isn't this point (again)
> >> covered in the next few lines?
> >
> > Or change the ';' to a ':'?
> >
> >>> +.I fd_in
> >>> +is not open for reading; or
> >>> +.I fd_out
> >>> +is not open for writing.
> >>> +.TP
> >>> +.B EINVAL
> >>> +Requested range extends beyond the end of the file; or the
> >>
> >> s/file/source file/ (right?)
> >>
> >>> +.I flags
> >>> +argument is set to an invalid value.
> >>> +.TP
> >>> +.B EIO
> >>> +A low level I/O error occurred while copying.
> >>> +.TP
> >>> +.B ENOMEM
> >>> +Out of memory.
> >>> +.TP
> >>> +.B ENOSPC
> >>> +There is not enough space to complete the copy.
> >>
> >> Space where? On the filesystem?
> >> => s/space/space on the target filesystem/
> >>
> >>> +.TP
> >>> +.B EOPNOTSUPP
> >>> +.B COPY_REFLINK
> >>> +was specified in
> >>> +.IR flags ,
> >>> +but the target filesystem does not support reflinks.
> >>> +.TP
> >>> +.B EXDEV
> >>> +Target filesystem doesn't support cross-filesystem copies.
> >>
> >> I'm curious. What are some examples of filesystems that produce this
> >> error?
> >
> > btrfs and xfs (and probably most local filesystems) don't support cross-fs
> > copies.
> >
> > --D
> >
> >>
> >>> +.SH VERSIONS
> >>> +The
> >>> +.BR copy_file_range ()
> >>> +system call first appeared in Linux 4.4.
> >>> +.SH CONFORMING TO
> >>> +The
> >>> +.BR copy_file_range ()
> >>> +system call is a nonstandard Linux extension.
> >>> +.SH EXAMPLE
> >>> +.nf
> >>> +
> >>> +#define _GNU_SOURCE
> >>> +#include <fcntl.h>
> >>> +#include <linux/copy.h>
> >>> +#include <stdio.h>
> >>> +#include <stdlib.h>
> >>> +#include <sys/stat.h>
> >>> +#include <sys/syscall.h>
> >>> +#include <unistd.h>
> >>> +
> >>> +
> >>> +int main(int argc, char **argv)
> >>> +{
> >>> + int fd_in, fd_out;
> >>> + struct stat stat;
> >>> + loff_t len, ret;
> >>> +
> >>> + if (argc != 3) {
> >>> + fprintf(stderr, "Usage: %s <source> <destination>\\n", argv[0]);
> >>> + exit(EXIT_FAILURE);
> >>> + }
> >>> +
> >>> + fd_in = open(argv[1], O_RDONLY);
> >>> + if (fd_in == -1) {
> >>
> >> Please replace all "-" in code by "\-". (This is a groff
> >> detail.)
> >>
> >>> + perror("open (argv[1])");
> >>> + exit(EXIT_FAILURE);
> >>> + }
> >>> +
> >>> + if (fstat(fd_in, &stat) == -1) {
> >>> + perror("fstat");
> >>> + exit(EXIT_FAILURE);
> >>> + }
> >>> + len = stat.st_size;
> >>> +
> >>> + fd_out = creat(argv[2], 0644);
> >>
> >> These days, I think we should discourage the use of creat(). Maybe
> >> better to use open() instead here?
> >>
> >>> + if (fd_out == -1) {
> >>> + perror("creat (argv[2])");
> >>> + exit(EXIT_FAILURE);
> >>> + }
> >>> +
> >>> + do {
> >>> + ret = syscall(__NR_copy_file_range, fd_in, NULL,
> >>> + fd_out, NULL, len, 0);
> >>
> >> I'd rather see this as:
> >>
> >> int
> >> copy_file_range(int fd_in, loff_t *off_in,
> >> int fd_out, loff_t *off_out, size_t len,
> >> unsigned int flags)
> >> {
> >> return(syscall(__NR_copy_file_range, fd_in, fd_out, off_out, len, flags);
> >> }
> >>
> >> ...
> >>
> >> copy_file_range(fd_in, fd_out, off_out, len, flags);
> >>
> >>
> >>> + if (ret == -1) {
> >>> + perror("copy_file_range");
> >>> + exit(EXIT_FAILURE);
> >>> + }
> >>> +
> >>> + len -= ret;
> >>> + } while (len > 0);
> >>> +
> >>> + close(fd_in);
> >>> + close(fd_out);
> >>> + exit(EXIT_SUCCESS);
> >>> +}
> >>> +.fi
> >>> +.SH SEE ALSO
> >>> +.BR splice (2)
> >>>
> >>
> >> In the next iteration of this patch, could you include a change to
> >> splice(2) so that copy_file_range(2) is added under SEE ALSO in
> >> that page. Also, are there any other pages that we should like
> >> to/from? (sendfile(2)?)
> >>
> >> Thanks,
> >>
> >> Michael
> >>
> >>
> >> --
> >> Michael Kerrisk
> >> Linux man-pages maintainer; http://www.kernel.org/doc/man-pages/
> >> Linux/UNIX System Programming Training: http://man7.org/training/
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-api" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html