2015-08-07 20:38:30

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 0/7] NFSv4.2: Add support for the COPY operation

These patches add client and server support for the NFS v4.2 COPY operation.
Unlike the similar CLONE operation, COPY can support both acceleration through
and reflink and a full copy of data from one file into another. These patches
make use of Zach Brown's vfs_copy_file_range() syscall, and the first three
patches in this series are simply a reposting of the patches that add the
syscall.

Patch 4 expands vfs_copy_file_range() to fall back on the splice interface for
copies where the filesystem does not support copy accelerations. This behavior
is useful for NFSD, since we'll still want to copy the file even if we can't
do a reflink. Additionally, this opens up the possibility of in-kernel copies
for all filesystems without needing to do frequent switches between kernel and
user space. The only potential drawback I've noticed is that splice will write
out data in PAGE_SIZE chunks, even if wsize > PAGE_SIZE. This leads to a few
more writes over the wire, but I have not noticed a significant timing
difference. Still, I wonder if there is a better way to optimize this for NFS.

The remaining patches implement the COPY operation for both the client and the
server. The program I used for testing is included as an RFC as the last patch
in the series. I gathered performance information by comparing the runtime and
RPC count of this program against /usr/bin/cp for various file sizes.

/usr/bin/cp:
size: 513MB 1024MB 1536MB 2048MB
------------- ------------- -------- -------- -------- --------
nfs v4 client total: 8203 16396 24588 32780
------------- ------------- -------- -------- -------- --------
nfs v4 client read: 4096 8192 12288 16384
nfs v4 client write: 4096 8192 12288 16384
nfs v4 client commit: 1 1 1 1
nfs v4 client open: 1 1 1 1
nfs v4 client open_noat: 2 2 2 2
nfs v4 client close: 1 1 1 1
nfs v4 client setattr: 2 2 2 2
nfs v4 client access: 2 3 3 3
nfs v4 client getattr: 2 2 2 2

/usr/bin/cp /nfs/test-512 /nfs/test-copy 0.00s user 0.32s system 14% cpu 2.209 total
/usr/bin/cp /nfs/test-1024 /nfs/test-copy 0.00s user 0.66s system 18% cpu 3.651 total
/usr/bin/cp /nfs/test-1536 /nfs/test-copy 0.02s user 0.97s system 18% cpu 5.477 total
/usr/bin/cp /nfs/test-2048 /nfs/test-copy 0.00s user 1.38s system 15% cpu 9.085 total


Copy system call:
size: 512MB 1024MB 1536MB 2048MB
------------- ------------- -------- -------- -------- --------
nfs v4 client total: 6 6 6 6
------------- ------------- -------- -------- -------- --------
nfs v4 client open: 2 2 2 2
nfs v4 client close: 2 2 2 2
nfs v4 client access: 1 1 1 1
nfs v4 client copy: 1 1 1 1


./nfscopy /nfs/test-512 /nfs/test-copy 0.00s user 0.00s system 0% cpu 1.148 total
./nfscopy /nfs/test-1024 /nfs/test-copy 0.00s user 0.00s system 0% cpu 2.293 total
./nfscopy /nfs/test-1536 /nfs/test-copy 0.00s user 0.00s system 0% cpu 3.037 total
./nfscopy /nfs/test-2048 /nfs/test-copy 0.00s user 0.00s system 0% cpu 4.045 total


Questions, comments, and other testing ideas would be greatly appreciated!

Thanks,
Anna


Anna Schumaker (4):
VFS: Fall back on splice if no copy function defined
nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
NFSD: Implement the COPY call
NFS: Add COPY nfs operation

Zach Brown (3):
vfs: add copy_file_range syscall and vfs helper
x86: add sys_copy_file_range to syscall tables
btrfs: add .copy_file_range file operation

arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
fs/btrfs/ctree.h | 3 +
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 91 ++++++++++++----------
fs/nfs/nfs42.h | 1 +
fs/nfs/nfs42proc.c | 40 ++++++++++
fs/nfs/nfs42xdr.c | 136 +++++++++++++++++++++++++++++++++
fs/nfs/nfs4file.c | 8 ++
fs/nfs/nfs4proc.c | 1 +
fs/nfs/nfs4xdr.c | 1 +
fs/nfsd/nfs4proc.c | 79 +++++++++++++++++--
fs/nfsd/nfs4state.c | 5 +-
fs/nfsd/nfs4xdr.c | 62 ++++++++++++++-
fs/nfsd/state.h | 4 +-
fs/nfsd/vfs.c | 13 ++++
fs/nfsd/vfs.h | 1 +
fs/nfsd/xdr4.h | 23 ++++++
fs/read_write.c | 133 ++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/linux/nfs4.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 27 +++++++
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
25 files changed, 587 insertions(+), 54 deletions(-)

--
2.5.0



2015-08-07 20:38:32

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 3/7] btrfs: add .copy_file_range file operation

From: Zach Brown <[email protected]>

This rearranges the existing COPY_RANGE ioctl implementation so that the
.copy_file_range file operation can call the core loop that copies file
data extent items.

The extent copying loop is lifted up into its own function. It retains
the core btrfs error checks that should be shared.

Signed-off-by: Zach Brown <[email protected]>
---
fs/btrfs/ctree.h | 3 ++
fs/btrfs/file.c | 1 +
fs/btrfs/ioctl.c | 91 ++++++++++++++++++++++++++++++++------------------------
3 files changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/btrfs/ctree.h b/fs/btrfs/ctree.h
index aac314e..e09d4e2 100644
--- a/fs/btrfs/ctree.h
+++ b/fs/btrfs/ctree.h
@@ -4000,6 +4000,9 @@ int btrfs_dirty_pages(struct btrfs_root *root, struct inode *inode,
loff_t pos, size_t write_bytes,
struct extent_state **cached);
int btrfs_fdatawrite_range(struct inode *inode, loff_t start, loff_t end);
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags);

/* tree-defrag.c */
int btrfs_defrag_leaves(struct btrfs_trans_handle *trans,
diff --git a/fs/btrfs/file.c b/fs/btrfs/file.c
index b823fac..b05449c 100644
--- a/fs/btrfs/file.c
+++ b/fs/btrfs/file.c
@@ -2816,6 +2816,7 @@ const struct file_operations btrfs_file_operations = {
#ifdef CONFIG_COMPAT
.compat_ioctl = btrfs_ioctl,
#endif
+ .copy_file_range = btrfs_copy_file_range,
};

void btrfs_auto_defrag_exit(void)
diff --git a/fs/btrfs/ioctl.c b/fs/btrfs/ioctl.c
index 0770c91..62ae286 100644
--- a/fs/btrfs/ioctl.c
+++ b/fs/btrfs/ioctl.c
@@ -3719,17 +3719,16 @@ out:
return ret;
}

-static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
- u64 off, u64 olen, u64 destoff)
+static noinline int btrfs_clone_files(struct file *file, struct file *file_src,
+ u64 off, u64 olen, u64 destoff)
{
struct inode *inode = file_inode(file);
+ struct inode *src = file_inode(file_src);
struct btrfs_root *root = BTRFS_I(inode)->root;
- struct fd src_file;
- struct inode *src;
int ret;
u64 len = olen;
u64 bs = root->fs_info->sb->s_blocksize;
- int same_inode = 0;
+ int same_inode = src == inode;

/*
* TODO:
@@ -3742,49 +3741,20 @@ static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
* be either compressed or non-compressed.
*/

- /* the destination must be opened for writing */
- if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
- return -EINVAL;
-
if (btrfs_root_readonly(root))
return -EROFS;

- ret = mnt_want_write_file(file);
- if (ret)
- return ret;
-
- src_file = fdget(srcfd);
- if (!src_file.file) {
- ret = -EBADF;
- goto out_drop_write;
- }
-
- ret = -EXDEV;
- if (src_file.file->f_path.mnt != file->f_path.mnt)
- goto out_fput;
-
- src = file_inode(src_file.file);
-
- ret = -EINVAL;
- if (src == inode)
- same_inode = 1;
-
- /* the src must be open for reading */
- if (!(src_file.file->f_mode & FMODE_READ))
- goto out_fput;
+ if (file_src->f_path.mnt != file->f_path.mnt ||
+ src->i_sb != inode->i_sb)
+ return -EXDEV;

/* don't make the dst file partly checksummed */
if ((BTRFS_I(src)->flags & BTRFS_INODE_NODATASUM) !=
(BTRFS_I(inode)->flags & BTRFS_INODE_NODATASUM))
- goto out_fput;
+ return -EINVAL;

- ret = -EISDIR;
if (S_ISDIR(src->i_mode) || S_ISDIR(inode->i_mode))
- goto out_fput;
-
- ret = -EXDEV;
- if (src->i_sb != inode->i_sb)
- goto out_fput;
+ return -EISDIR;

if (!same_inode) {
if (inode < src) {
@@ -3877,6 +3847,49 @@ out_unlock:
} else {
mutex_unlock(&src->i_mutex);
}
+ return ret;
+}
+
+ssize_t btrfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+ ssize_t ret;
+
+ ret = btrfs_clone_files(file_out, file_in, pos_in, len, pos_out);
+ if (ret == 0)
+ ret = len;
+ return ret;
+}
+
+static noinline long btrfs_ioctl_clone(struct file *file, unsigned long srcfd,
+ u64 off, u64 olen, u64 destoff)
+{
+ struct fd src_file;
+ int ret;
+
+ /* the destination must be opened for writing */
+ if (!(file->f_mode & FMODE_WRITE) || (file->f_flags & O_APPEND))
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file);
+ if (ret)
+ return ret;
+
+ src_file = fdget(srcfd);
+ if (!src_file.file) {
+ ret = -EBADF;
+ goto out_drop_write;
+ }
+
+ /* the src must be open for reading */
+ if (!(src_file.file->f_mode & FMODE_READ)) {
+ ret = -EINVAL;
+ goto out_fput;
+ }
+
+ ret = btrfs_clone_files(file, src_file.file, off, olen, destoff);
+
out_fput:
fdput(src_file);
out_drop_write:
--
2.5.0


2015-08-07 20:38:32

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 4/7] VFS: Fall back on splice if no copy function defined

The NFS server will need a fallback for filesystems that don't have any
kind of copy acceleration yet. Let's handle this by having
vfs_copy_range() fall back to splice, enabling an in-kernel fallback for
all filesystems.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/read_write.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 3804547..e564a6b 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1358,7 +1358,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (!(file_in->f_mode & FMODE_READ) ||
!(file_out->f_mode & FMODE_WRITE) ||
(file_out->f_flags & O_APPEND) ||
- !file_in->f_op || !file_in->f_op->copy_file_range)
+ !file_in->f_op)
return -EINVAL;

inode_in = file_inode(file_in);
@@ -1382,8 +1382,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (ret)
return ret;

- ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
- len, flags);
+ ret = -ENOTSUPP;
+ if (file_in->f_op->copy_file_range)
+ ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out,
+ pos_out, len, flags);
+ if (ret == -ENOTSUPP)
+ ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, flags);
if (ret > 0) {
fsnotify_access(file_in);
add_rchar(current, ret);
--
2.5.0


2015-08-07 20:38:34

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 5/7] nfsd: Pass filehandle to nfs4_preprocess_stateid_op()

This will be needed so COPY can look up the saved_fh in addition to the
current_fh.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4proc.c | 16 +++++++++-------
fs/nfsd/nfs4state.c | 5 ++---
fs/nfsd/state.h | 4 ++--
3 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index 90cfda7..d34c967 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -776,8 +776,9 @@ nfsd4_read(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
clear_bit(RQ_SPLICE_OK, &rqstp->rq_flags);

/* check stateid */
- status = nfs4_preprocess_stateid_op(rqstp, cstate, &read->rd_stateid,
- RD_STATE, &read->rd_filp, &read->rd_tmp_file);
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ &read->rd_stateid, RD_STATE,
+ &read->rd_filp, &read->rd_tmp_file);
if (status) {
dprintk("NFSD: nfsd4_read: couldn't process stateid!\n");
goto out;
@@ -923,7 +924,8 @@ nfsd4_setattr(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,

if (setattr->sa_iattr.ia_valid & ATTR_SIZE) {
status = nfs4_preprocess_stateid_op(rqstp, cstate,
- &setattr->sa_stateid, WR_STATE, NULL, NULL);
+ &cstate->current_fh, &setattr->sa_stateid,
+ WR_STATE, NULL, NULL);
if (status) {
dprintk("NFSD: nfsd4_setattr: couldn't process stateid!\n");
return status;
@@ -987,8 +989,8 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
if (write->wr_offset >= OFFSET_MAX)
return nfserr_inval;

- status = nfs4_preprocess_stateid_op(rqstp, cstate, stateid, WR_STATE,
- &filp, NULL);
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ stateid, WR_STATE, &filp, NULL);
if (status) {
dprintk("NFSD: nfsd4_write: couldn't process stateid!\n");
return status;
@@ -1018,7 +1020,7 @@ nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
__be32 status = nfserr_notsupp;
struct file *file;

- status = nfs4_preprocess_stateid_op(rqstp, cstate,
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&fallocate->falloc_stateid,
WR_STATE, &file, NULL);
if (status != nfs_ok) {
@@ -1057,7 +1059,7 @@ nfsd4_seek(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
__be32 status;
struct file *file;

- status = nfs4_preprocess_stateid_op(rqstp, cstate,
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
&seek->seek_stateid,
RD_STATE, &file, NULL);
if (status) {
diff --git a/fs/nfsd/nfs4state.c b/fs/nfsd/nfs4state.c
index 61dfb33..7b0059d 100644
--- a/fs/nfsd/nfs4state.c
+++ b/fs/nfsd/nfs4state.c
@@ -4645,10 +4645,9 @@ nfs4_check_file(struct svc_rqst *rqstp, struct svc_fh *fhp, struct nfs4_stid *s,
*/
__be32
nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
- struct nfsd4_compound_state *cstate, stateid_t *stateid,
- int flags, struct file **filpp, bool *tmp_file)
+ struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+ stateid_t *stateid, int flags, struct file **filpp, bool *tmp_file)
{
- struct svc_fh *fhp = &cstate->current_fh;
struct inode *ino = d_inode(fhp->fh_dentry);
struct net *net = SVC_NET(rqstp);
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
diff --git a/fs/nfsd/state.h b/fs/nfsd/state.h
index 4874ce5..d3e81ce 100644
--- a/fs/nfsd/state.h
+++ b/fs/nfsd/state.h
@@ -584,8 +584,8 @@ struct nfsd4_compound_state;
struct nfsd_net;

extern __be32 nfs4_preprocess_stateid_op(struct svc_rqst *rqstp,
- struct nfsd4_compound_state *cstate, stateid_t *stateid,
- int flags, struct file **filp, bool *tmp_file);
+ struct nfsd4_compound_state *cstate, struct svc_fh *fhp,
+ stateid_t *stateid, int flags, struct file **filp, bool *tmp_file);
__be32 nfsd4_lookup_stateid(struct nfsd4_compound_state *cstate,
stateid_t *stateid, unsigned char typemask,
struct nfs4_stid **s, struct nfsd_net *nn);
--
2.5.0


2015-08-07 20:38:34

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 1/7] vfs: add copy_file_range syscall and vfs helper

From: Zach Brown <[email protected]>

Add a copy_file_range() system call for offloading copies between
regular files.

This gives an interface to underlying layers of the storage stack which
can copy without reading and writing all the data. There are a few
candidates that should support copy offloading in the nearer term:

- btrfs shares extent references with its clone ioctl
- NFS has patches to add a COPY command which copies on the server
- SCSI has a family of XCOPY commands which copy in the device

This system call avoids the complexity of also accelerating the creation
of the destination file by operating on an existing destination file
descriptor, not a path.

Currently the high level vfs entry point limits copy offloading to files
on the same mount and super (and not in the same file). This can be
relaxed if we get implementations which can copy between file systems
safely.

Signed-off-by: Zach Brown <[email protected]>
---
fs/read_write.c | 129 ++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 3 +
include/uapi/asm-generic/unistd.h | 4 +-
kernel/sys_ni.c | 1 +
4 files changed, 136 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 819ef3f..3804547 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -16,6 +16,7 @@
#include <linux/pagemap.h>
#include <linux/splice.h>
#include <linux/compat.h>
+#include <linux/mount.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -1327,3 +1328,131 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
return do_sendfile(out_fd, in_fd, NULL, count, 0);
}
#endif
+
+/*
+ * copy_file_range() differs from regular file read and write in that it
+ * specifically allows return partial success. When it does so is up to
+ * the copy_file_range method.
+ */
+ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, int flags)
+{
+ struct inode *inode_in;
+ struct inode *inode_out;
+ ssize_t ret;
+
+ if (flags)
+ return -EINVAL;
+
+ if (len == 0)
+ return 0;
+
+ /* copy_file_range allows full ssize_t len, ignoring MAX_RW_COUNT */
+ ret = rw_verify_area(READ, file_in, &pos_in, len);
+ if (ret >= 0)
+ ret = rw_verify_area(WRITE, file_out, &pos_out, len);
+ if (ret < 0)
+ return ret;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND) ||
+ !file_in->f_op || !file_in->f_op->copy_file_range)
+ return -EINVAL;
+
+ inode_in = file_inode(file_in);
+ inode_out = file_inode(file_out);
+
+ /* make sure offsets don't wrap and the input is inside i_size */
+ if (pos_in + len < pos_in || pos_out + len < pos_out ||
+ pos_in + len > i_size_read(inode_in))
+ return -EINVAL;
+
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (inode_in->i_sb != inode_out->i_sb ||
+ file_in->f_path.mnt != file_out->f_path.mnt)
+ return -EXDEV;
+
+ /* forbid ranges in the same file */
+ if (inode_in == inode_out)
+ return -EINVAL;
+
+ ret = mnt_want_write_file(file_out);
+ if (ret)
+ return ret;
+
+ ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
+ len, flags);
+ if (ret > 0) {
+ fsnotify_access(file_in);
+ add_rchar(current, ret);
+ fsnotify_modify(file_out);
+ add_wchar(current, ret);
+ }
+ inc_syscr(current);
+ inc_syscw(current);
+
+ mnt_drop_write_file(file_out);
+
+ return ret;
+}
+EXPORT_SYMBOL(vfs_copy_file_range);
+
+SYSCALL_DEFINE6(copy_file_range, int, fd_in, loff_t __user *, off_in,
+ int, fd_out, loff_t __user *, off_out,
+ size_t, len, unsigned int, flags)
+{
+ loff_t pos_in;
+ loff_t pos_out;
+ struct fd f_in;
+ struct fd f_out;
+ ssize_t ret;
+
+ f_in = fdget(fd_in);
+ f_out = fdget(fd_out);
+ if (!f_in.file || !f_out.file) {
+ ret = -EBADF;
+ goto out;
+ }
+
+ ret = -EFAULT;
+ if (off_in) {
+ if (copy_from_user(&pos_in, off_in, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_in = f_in.file->f_pos;
+ }
+
+ if (off_out) {
+ if (copy_from_user(&pos_out, off_out, sizeof(loff_t)))
+ goto out;
+ } else {
+ pos_out = f_out.file->f_pos;
+ }
+
+ ret = vfs_copy_file_range(f_in.file, pos_in, f_out.file, pos_out, len,
+ flags);
+ if (ret > 0) {
+ pos_in += ret;
+ pos_out += ret;
+
+ if (off_in) {
+ if (copy_to_user(off_in, &pos_in, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_in.file->f_pos = pos_in;
+ }
+
+ if (off_out) {
+ if (copy_to_user(off_out, &pos_out, sizeof(loff_t)))
+ ret = -EFAULT;
+ } else {
+ f_out.file->f_pos = pos_out;
+ }
+ }
+out:
+ fdput(f_in);
+ fdput(f_out);
+ return ret;
+}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index cc008c3..c97aed8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1631,6 +1631,7 @@ struct file_operations {
#ifndef CONFIG_MMU
unsigned (*mmap_capabilities)(struct file *);
#endif
+ ssize_t (*copy_file_range)(struct file *, loff_t, struct file *, loff_t, size_t, int);
};

struct inode_operations {
@@ -1684,6 +1685,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
extern ssize_t vfs_writev(struct file *, const struct iovec __user *,
unsigned long, loff_t *);
+extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
+ loff_t, size_t, int);

struct super_operations {
struct inode *(*alloc_inode)(struct super_block *sb);
diff --git a/include/uapi/asm-generic/unistd.h b/include/uapi/asm-generic/unistd.h
index e016bd9..2b60f0c 100644
--- a/include/uapi/asm-generic/unistd.h
+++ b/include/uapi/asm-generic/unistd.h
@@ -709,9 +709,11 @@ __SYSCALL(__NR_memfd_create, sys_memfd_create)
__SYSCALL(__NR_bpf, sys_bpf)
#define __NR_execveat 281
__SC_COMP(__NR_execveat, sys_execveat, compat_sys_execveat)
+#define __NR_copy_file_range 282
+__SYSCALL(__NR_copy_file_range, sys_copy_file_range)

#undef __NR_syscalls
-#define __NR_syscalls 282
+#define __NR_syscalls 283

/*
* All syscalls below here should go away really,
diff --git a/kernel/sys_ni.c b/kernel/sys_ni.c
index 7995ef5..4e01cd9 100644
--- a/kernel/sys_ni.c
+++ b/kernel/sys_ni.c
@@ -173,6 +173,7 @@ cond_syscall(sys_setfsuid);
cond_syscall(sys_setfsgid);
cond_syscall(sys_capget);
cond_syscall(sys_capset);
+cond_syscall(sys_copy_file_range);

/* arch-specific weak syscall entries */
cond_syscall(sys_pciconfig_read);
--
2.5.0


2015-08-07 20:38:38

by Anna Schumaker

[permalink] [raw]
Subject: [RFC] vfs_copy_range() test program

This is a simple C program that I used for calling the copy system call.
Usage: ./nfscopy /nfs/original.txt /nfs/copy.txt
---
nfscopy.c | 64 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 64 insertions(+)
create mode 100644 nfscopy.c

diff --git a/nfscopy.c b/nfscopy.c
new file mode 100644
index 0000000..535673e
--- /dev/null
+++ b/nfscopy.c
@@ -0,0 +1,64 @@
+#include <errno.h>
+#include <fcntl.h>
+#include <limits.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <sys/stat.h>
+#include <unistd.h>
+
+#define SYS_COPY_RANGE 323
+
+int do_copy(int f_in, int f_out, loff_t f_size)
+{
+ loff_t offset = 0;
+ ssize_t ret;
+
+ while (offset < f_size) {
+ size_t size = f_size - offset;
+ if ((size + offset) != f_size)
+ size = INT_MAX - 1;
+
+ ret = syscall(SYS_COPY_RANGE, f_in, &offset, f_out, &offset, size, 0);
+ if (ret < 0) {
+ printf("Copy error: %s\n", strerror(errno));
+ return ret;
+ }
+ }
+ return 0;
+}
+
+int main(int argc, char **argv)
+{
+ int f_in, f_out, ret;
+ struct stat fstat;
+
+ if (argc != 3) {
+ printf("Usage: %s f_in f_out\n", argv[0]);
+ exit(1);
+ }
+
+ f_in = open(argv[1], O_RDONLY);
+ if (f_in < 0) {
+ printf("%s: %s\n", argv[1], strerror(errno));
+ exit(1);
+ }
+
+ if (stat(argv[1], &fstat) < 0) {
+ printf("%s: %s\n", argv[1], strerror(errno));
+ exit(1);
+ }
+
+ f_out = open(argv[2], O_WRONLY | O_CREAT | O_SYNC, S_IRUSR | S_IWUSR | S_IRGRP | S_IROTH);
+ if (f_out < 0) {
+ printf("%s: %s\n", argv[2], strerror(errno));
+ exit(1);
+ }
+
+ ret = do_copy(f_in, f_out, fstat.st_size);
+
+ fsync(f_out);
+ close(f_in);
+ close(f_out);
+ return ret;
+}
--
2.5.0


2015-08-07 20:38:34

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 6/7] NFSD: Implement the COPY call

From: Anna Schumaker <[email protected]>

I only implemented the sync version of this call, since it's the
easiest. I can simply call vfs_copy_range() and have the vfs do the
right thing for the filesystem being exported.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfsd/nfs4proc.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
fs/nfsd/nfs4xdr.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++--
fs/nfsd/vfs.c | 13 +++++++++++
fs/nfsd/vfs.h | 1 +
fs/nfsd/xdr4.h | 23 ++++++++++++++++++++
5 files changed, 160 insertions(+), 2 deletions(-)

diff --git a/fs/nfsd/nfs4proc.c b/fs/nfsd/nfs4proc.c
index d34c967..fbfb509 100644
--- a/fs/nfsd/nfs4proc.c
+++ b/fs/nfsd/nfs4proc.c
@@ -1014,6 +1014,63 @@ nfsd4_write(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
}

static __be32
+nfsd4_verify_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
+ struct nfsd4_copy *copy, struct file **src, struct file **dst)
+{
+ __be32 status;
+
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->save_fh,
+ &copy->cp_src_stateid, RD_STATE,
+ src, NULL);
+ if (status) {
+ dprintk("NFSD: nfsd4_copy: couldn't process src stateid!\n");
+ return status;
+ }
+
+ status = nfs4_preprocess_stateid_op(rqstp, cstate, &cstate->current_fh,
+ &copy->cp_dst_stateid, WR_STATE,
+ dst, NULL);
+ if (status) {
+ dprintk("NFSD: nfsd4_copy: couldn't process dst stateid!\n");
+ fput(*src);
+ }
+
+ return status;
+}
+
+static __be32
+nfsd4_copy(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
+ struct nfsd4_copy *copy)
+{
+ ssize_t bytes;
+ __be32 status;
+ struct file *src = NULL, *dst = NULL;
+
+ status = nfsd4_verify_copy(rqstp, cstate, copy, &src, &dst);
+ if (status)
+ return status;
+
+ bytes = nfsd_copy_range(src, copy->cp_src_pos,
+ dst, copy->cp_dst_pos,
+ copy->cp_count);
+
+ if (bytes < 0)
+ status = nfserrno(bytes);
+ else {
+ copy->cp_res.wr_bytes_written = bytes;
+ copy->cp_res.wr_stable_how = NFS_FILE_SYNC;
+ copy->cp_consecutive = 1;
+ copy->cp_synchronous = 1;
+ gen_boot_verifier(&copy->cp_res.wr_verifier, SVC_NET(rqstp));
+ status = nfs_ok;
+ }
+
+ fput(src);
+ fput(dst);
+ return status;
+}
+
+static __be32
nfsd4_fallocate(struct svc_rqst *rqstp, struct nfsd4_compound_state *cstate,
struct nfsd4_fallocate *fallocate, int flags)
{
@@ -2283,6 +2340,12 @@ static struct nfsd4_operation nfsd4_ops[] = {
.op_name = "OP_DEALLOCATE",
.op_rsize_bop = (nfsd4op_rsize)nfsd4_only_status_rsize,
},
+ [OP_COPY] = {
+ .op_func = (nfsd4op_func)nfsd4_copy,
+ .op_flags = OP_MODIFIES_SOMETHING | OP_CACHEME,
+ .op_name = "OP_COPY",
+ .op_rsize_bop = (nfsd4op_rsize)nfsd4_write_rsize,
+ },
[OP_SEEK] = {
.op_func = (nfsd4op_func)nfsd4_seek,
.op_name = "OP_SEEK",
diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
index 5463385..3a78c7f 100644
--- a/fs/nfsd/nfs4xdr.c
+++ b/fs/nfsd/nfs4xdr.c
@@ -1675,6 +1675,30 @@ nfsd4_decode_fallocate(struct nfsd4_compoundargs *argp,
}

static __be32
+nfsd4_decode_copy(struct nfsd4_compoundargs *argp, struct nfsd4_copy *copy)
+{
+ DECODE_HEAD;
+ unsigned int tmp;
+
+ status = nfsd4_decode_stateid(argp, &copy->cp_src_stateid);
+ if (status)
+ return status;
+ status = nfsd4_decode_stateid(argp, &copy->cp_dst_stateid);
+ if (status)
+ return status;
+
+ READ_BUF(8 + 8 + 8 + 4 + 4 + 4);
+ p = xdr_decode_hyper(p, &copy->cp_src_pos);
+ p = xdr_decode_hyper(p, &copy->cp_dst_pos);
+ p = xdr_decode_hyper(p, &copy->cp_count);
+ copy->cp_consecutive = be32_to_cpup(p++);
+ copy->cp_synchronous = be32_to_cpup(p++);
+ tmp = be32_to_cpup(p); /* Source server list not supported */
+
+ DECODE_TAIL;
+}
+
+static __be32
nfsd4_decode_seek(struct nfsd4_compoundargs *argp, struct nfsd4_seek *seek)
{
DECODE_HEAD;
@@ -1774,7 +1798,7 @@ static nfsd4_dec nfsd4_dec_ops[] = {

/* new operations for NFSv4.2 */
[OP_ALLOCATE] = (nfsd4_dec)nfsd4_decode_fallocate,
- [OP_COPY] = (nfsd4_dec)nfsd4_decode_notsupp,
+ [OP_COPY] = (nfsd4_dec)nfsd4_decode_copy,
[OP_COPY_NOTIFY] = (nfsd4_dec)nfsd4_decode_notsupp,
[OP_DEALLOCATE] = (nfsd4_dec)nfsd4_decode_fallocate,
[OP_IO_ADVISE] = (nfsd4_dec)nfsd4_decode_notsupp,
@@ -4140,6 +4164,40 @@ nfsd4_encode_layoutreturn(struct nfsd4_compoundres *resp, __be32 nfserr,
#endif /* CONFIG_NFSD_PNFS */

static __be32
+nfsd42_encode_write_res(struct nfsd4_compoundres *resp, struct nfsd42_write_res *write)
+{
+ __be32 *p;
+
+ p = xdr_reserve_space(&resp->xdr, 4 + 8 + 4 + NFS4_VERIFIER_SIZE);
+ if (!p)
+ return nfserr_resource;
+
+ *p++ = cpu_to_be32(0);
+ p = xdr_encode_hyper(p, write->wr_bytes_written);
+ *p++ = cpu_to_be32(write->wr_stable_how);
+ p = xdr_encode_opaque_fixed(p, write->wr_verifier.data, NFS4_VERIFIER_SIZE);
+ return nfs_ok;
+}
+
+static __be32
+nfsd4_encode_copy(struct nfsd4_compoundres *resp, __be32 nfserr,
+ struct nfsd4_copy *copy)
+{
+ __be32 *p, err;
+
+ if (!nfserr) {
+ err = nfsd42_encode_write_res(resp, &copy->cp_res);
+ if (err)
+ return err;
+
+ p = xdr_reserve_space(&resp->xdr, 4 + 4);
+ *p++ = cpu_to_be32(copy->cp_consecutive);
+ *p++ = cpu_to_be32(copy->cp_synchronous);
+ }
+ return nfserr;
+}
+
+static __be32
nfsd4_encode_seek(struct nfsd4_compoundres *resp, __be32 nfserr,
struct nfsd4_seek *seek)
{
@@ -4238,7 +4296,7 @@ static nfsd4_enc nfsd4_enc_ops[] = {

/* NFSv4.2 operations */
[OP_ALLOCATE] = (nfsd4_enc)nfsd4_encode_noop,
- [OP_COPY] = (nfsd4_enc)nfsd4_encode_noop,
+ [OP_COPY] = (nfsd4_enc)nfsd4_encode_copy,
[OP_COPY_NOTIFY] = (nfsd4_enc)nfsd4_encode_noop,
[OP_DEALLOCATE] = (nfsd4_enc)nfsd4_encode_noop,
[OP_IO_ADVISE] = (nfsd4_enc)nfsd4_encode_noop,
diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index b5e077a..4065f38 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -36,6 +36,7 @@
#endif /* CONFIG_NFSD_V3 */

#ifdef CONFIG_NFSD_V4
+#include "../internal.h"
#include "acl.h"
#include "idmap.h"
#endif /* CONFIG_NFSD_V4 */
@@ -498,6 +499,18 @@ __be32 nfsd4_set_nfs4_label(struct svc_rqst *rqstp, struct svc_fh *fhp,
}
#endif

+ssize_t nfsd_copy_range(struct file *src, u64 src_pos,
+ struct file *dst, u64 dst_pos,
+ u64 count)
+{
+ ssize_t bytes;
+
+ bytes = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+ if (bytes > 0)
+ vfs_fsync_range(dst, dst_pos, dst_pos + bytes, 0);
+ return bytes;
+}
+
__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
struct file *file, loff_t offset, loff_t len,
int flags)
diff --git a/fs/nfsd/vfs.h b/fs/nfsd/vfs.h
index 5be875e..c529f9e 100644
--- a/fs/nfsd/vfs.h
+++ b/fs/nfsd/vfs.h
@@ -91,6 +91,7 @@ __be32 nfsd_symlink(struct svc_rqst *, struct svc_fh *,
struct svc_fh *res);
__be32 nfsd_link(struct svc_rqst *, struct svc_fh *,
char *, int, struct svc_fh *);
+ssize_t nfsd_copy_range(struct file *, u64, struct file *, u64, u64);
__be32 nfsd_rename(struct svc_rqst *,
struct svc_fh *, char *, int,
struct svc_fh *, char *, int);
diff --git a/fs/nfsd/xdr4.h b/fs/nfsd/xdr4.h
index 9f99100..9e83f95 100644
--- a/fs/nfsd/xdr4.h
+++ b/fs/nfsd/xdr4.h
@@ -491,6 +491,28 @@ struct nfsd4_fallocate {
u64 falloc_length;
};

+struct nfsd42_write_res {
+ u64 wr_bytes_written;
+ u32 wr_stable_how;
+ nfs4_verifier wr_verifier;
+};
+
+struct nfsd4_copy {
+ /* request */
+ stateid_t cp_src_stateid;
+ stateid_t cp_dst_stateid;
+ u64 cp_src_pos;
+ u64 cp_dst_pos;
+ u64 cp_count;
+
+ /* both */
+ bool cp_consecutive;
+ bool cp_synchronous;
+
+ /* response */
+ struct nfsd42_write_res cp_res;
+};
+
struct nfsd4_seek {
/* request */
stateid_t seek_stateid;
@@ -555,6 +577,7 @@ struct nfsd4_op {
/* NFSv4.2 */
struct nfsd4_fallocate allocate;
struct nfsd4_fallocate deallocate;
+ struct nfsd4_copy copy;
struct nfsd4_seek seek;
} u;
struct nfs4_replay * replay;
--
2.5.0


2015-08-07 20:44:51

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 2/7] x86: add sys_copy_file_range to syscall tables

From: Zach Brown <[email protected]>

Add sys_copy_file_range to the x86 syscall tables.

Signed-off-by: Zach Brown <[email protected]>
---
arch/x86/entry/syscalls/syscall_32.tbl | 1 +
arch/x86/entry/syscalls/syscall_64.tbl | 1 +
2 files changed, 2 insertions(+)

diff --git a/arch/x86/entry/syscalls/syscall_32.tbl b/arch/x86/entry/syscalls/syscall_32.tbl
index ef8187f..2f5e1e0 100644
--- a/arch/x86/entry/syscalls/syscall_32.tbl
+++ b/arch/x86/entry/syscalls/syscall_32.tbl
@@ -365,3 +365,4 @@
356 i386 memfd_create sys_memfd_create
357 i386 bpf sys_bpf
358 i386 execveat sys_execveat stub32_execveat
+359 i386 copy_file_range sys_copy_file_range
diff --git a/arch/x86/entry/syscalls/syscall_64.tbl b/arch/x86/entry/syscalls/syscall_64.tbl
index 9ef32d5..b2101de 100644
--- a/arch/x86/entry/syscalls/syscall_64.tbl
+++ b/arch/x86/entry/syscalls/syscall_64.tbl
@@ -329,6 +329,7 @@
320 common kexec_file_load sys_kexec_file_load
321 common bpf sys_bpf
322 64 execveat stub_execveat
+323 common copy_file_range sys_copy_file_range

#
# x32-specific system call numbers start at 512 to avoid cache impact
--
2.5.0


2015-08-07 20:44:57

by Anna Schumaker

[permalink] [raw]
Subject: [PATCH 7/7] NFS: Add COPY nfs operation

From: Anna Schumaker <[email protected]>

This adds the copy_range file_ops function pointer used by the
sys_copy_range() function call. This patch only implements sync copies,
so if an async copy happens we decode the stateid and ignore it.

Signed-off-by: Anna Schumaker <[email protected]>
---
fs/nfs/nfs42.h | 1 +
fs/nfs/nfs42proc.c | 40 ++++++++++++++
fs/nfs/nfs42xdr.c | 136 ++++++++++++++++++++++++++++++++++++++++++++++
fs/nfs/nfs4file.c | 8 +++
fs/nfs/nfs4proc.c | 1 +
fs/nfs/nfs4xdr.c | 1 +
include/linux/nfs4.h | 1 +
include/linux/nfs_fs_sb.h | 1 +
include/linux/nfs_xdr.h | 27 +++++++++
9 files changed, 216 insertions(+)

diff --git a/fs/nfs/nfs42.h b/fs/nfs/nfs42.h
index ff66ae7..b54b916 100644
--- a/fs/nfs/nfs42.h
+++ b/fs/nfs/nfs42.h
@@ -13,6 +13,7 @@

/* nfs4.2proc.c */
int nfs42_proc_allocate(struct file *, loff_t, loff_t);
+ssize_t nfs42_proc_copy(struct file *, loff_t, struct file *, loff_t, size_t);
int nfs42_proc_deallocate(struct file *, loff_t, loff_t);
loff_t nfs42_proc_llseek(struct file *, loff_t, int);
int nfs42_proc_layoutstats_generic(struct nfs_server *,
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index d731bbf..fa665f9 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -135,6 +135,46 @@ int nfs42_proc_deallocate(struct file *filep, loff_t offset, loff_t len)
return err;
}

+ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
+ struct file *dst, loff_t pos_dst,
+ size_t count)
+{
+ struct nfs42_copy_args args = {
+ .src_fh = NFS_FH(file_inode(src)),
+ .src_pos = pos_src,
+ .dst_fh = NFS_FH(file_inode(dst)),
+ .dst_pos = pos_dst,
+ .count = count,
+ };
+ struct nfs42_copy_res res;
+ struct rpc_message msg = {
+ .rpc_proc = &nfs4_procedures[NFSPROC4_CLNT_COPY],
+ .rpc_argp = &args,
+ .rpc_resp = &res,
+ };
+ struct nfs_server *server = NFS_SERVER(file_inode(dst));
+ int status;
+
+ if (!(server->caps & NFS_CAP_COPY))
+ return -ENOTSUPP;
+
+ status = nfs42_set_rw_stateid(&args.src_stateid, src, FMODE_READ);
+ if (status)
+ return status;
+
+ status = nfs42_set_rw_stateid(&args.dst_stateid, dst, FMODE_WRITE);
+ if (status)
+ return status;
+
+ status = nfs4_call_sync(server->client, server, &msg,
+ &args.seq_args, &res.seq_res, 0);
+ if (status == -ENOTSUPP)
+ server->caps &= ~NFS_CAP_COPY;
+ if (status)
+ return status;
+ return res.write_res.count;
+}
+
static loff_t _nfs42_proc_llseek(struct file *filep, loff_t offset, int whence)
{
struct inode *inode = file_inode(filep);
diff --git a/fs/nfs/nfs42xdr.c b/fs/nfs/nfs42xdr.c
index a6bd27d..489bbf3 100644
--- a/fs/nfs/nfs42xdr.c
+++ b/fs/nfs/nfs42xdr.c
@@ -9,9 +9,22 @@
#define encode_fallocate_maxsz (encode_stateid_maxsz + \
2 /* offset */ + \
2 /* length */)
+#define NFS42_WRITE_RES_SIZE (1 /* wr_callback_id size */ +\
+ XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+ 2 /* wr_count */ + \
+ 1 /* wr_committed */ + \
+ XDR_QUADLEN(NFS4_VERIFIER_SIZE))
#define encode_allocate_maxsz (op_encode_hdr_maxsz + \
encode_fallocate_maxsz)
#define decode_allocate_maxsz (op_decode_hdr_maxsz)
+#define encode_copy_maxsz (op_encode_hdr_maxsz + \
+ XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+ XDR_QUADLEN(NFS4_STATEID_SIZE) + \
+ 2 + 2 + 2 + 1 + 1 + 1)
+#define decode_copy_maxsz (op_decode_hdr_maxsz + \
+ NFS42_WRITE_RES_SIZE + \
+ 1 /* cr_consecutive */ + \
+ 1 /* cr_synchronous */)
#define encode_deallocate_maxsz (op_encode_hdr_maxsz + \
encode_fallocate_maxsz)
#define decode_deallocate_maxsz (op_decode_hdr_maxsz)
@@ -43,6 +56,16 @@
decode_putfh_maxsz + \
decode_allocate_maxsz + \
decode_getattr_maxsz)
+#define NFS4_enc_copy_sz (compound_encode_hdr_maxsz + \
+ encode_putfh_maxsz + \
+ encode_savefh_maxsz + \
+ encode_putfh_maxsz + \
+ encode_copy_maxsz)
+#define NFS4_dec_copy_sz (compound_decode_hdr_maxsz + \
+ decode_putfh_maxsz + \
+ decode_savefh_maxsz + \
+ decode_putfh_maxsz + \
+ decode_copy_maxsz)
#define NFS4_enc_deallocate_sz (compound_encode_hdr_maxsz + \
encode_putfh_maxsz + \
encode_deallocate_maxsz + \
@@ -83,6 +106,23 @@ static void encode_allocate(struct xdr_stream *xdr,
encode_fallocate(xdr, args);
}

+static void encode_copy(struct xdr_stream *xdr,
+ struct nfs42_copy_args *args,
+ struct compound_hdr *hdr)
+{
+ encode_op_hdr(xdr, OP_COPY, decode_copy_maxsz, hdr);
+ encode_nfs4_stateid(xdr, &args->src_stateid);
+ encode_nfs4_stateid(xdr, &args->dst_stateid);
+
+ encode_uint64(xdr, args->src_pos);
+ encode_uint64(xdr, args->dst_pos);
+ encode_uint64(xdr, args->count);
+
+ encode_uint32(xdr, 1); /* consecutive = true */
+ encode_uint32(xdr, 1); /* synchronous = true */
+ encode_uint32(xdr, 0); /* src server list */
+}
+
static void encode_deallocate(struct xdr_stream *xdr,
struct nfs42_falloc_args *args,
struct compound_hdr *hdr)
@@ -148,6 +188,26 @@ static void nfs4_xdr_enc_allocate(struct rpc_rqst *req,
}

/*
+ * Encode COPY request
+ */
+static void nfs4_xdr_enc_copy(struct rpc_rqst *req,
+ struct xdr_stream *xdr,
+ struct nfs42_copy_args *args)
+{
+ struct compound_hdr hdr = {
+ .minorversion = nfs4_xdr_minorversion(&args->seq_args),
+ };
+
+ encode_compound_hdr(xdr, req, &hdr);
+ encode_sequence(xdr, &args->seq_args, &hdr);
+ encode_putfh(xdr, args->src_fh, &hdr);
+ encode_savefh(xdr, &hdr);
+ encode_putfh(xdr, args->dst_fh, &hdr);
+ encode_copy(xdr, args, &hdr);
+ encode_nops(&hdr);
+}
+
+/*
* Encode DEALLOCATE request
*/
static void nfs4_xdr_enc_deallocate(struct rpc_rqst *req,
@@ -211,6 +271,52 @@ static int decode_allocate(struct xdr_stream *xdr, struct nfs42_falloc_res *res)
return decode_op_hdr(xdr, OP_ALLOCATE);
}

+static int decode_write_response(struct xdr_stream *xdr,
+ struct nfs42_write_res *res)
+{
+ __be32 *p;
+ int stateids;
+
+ p = xdr_inline_decode(xdr, 4 + 8 + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ stateids = be32_to_cpup(p++);
+ p = xdr_decode_hyper(p, &res->count);
+ res->committed = be32_to_cpup(p);
+ return decode_verifier(xdr, &res->verifier);
+
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
+static int decode_copy(struct xdr_stream *xdr, struct nfs42_copy_res *res)
+{
+ __be32 *p;
+ int status;
+
+ status = decode_op_hdr(xdr, OP_COPY);
+ if (status)
+ return status;
+
+ status = decode_write_response(xdr, &res->write_res);
+ if (status)
+ return status;
+
+ p = xdr_inline_decode(xdr, 4 + 4);
+ if (unlikely(!p))
+ goto out_overflow;
+
+ res->consecutive = be32_to_cpup(p++);
+ res->synchronous = be32_to_cpup(p++);
+ return 0;
+
+out_overflow:
+ print_overflow_msg(__func__, xdr);
+ return -EIO;
+}
+
static int decode_deallocate(struct xdr_stream *xdr, struct nfs42_falloc_res *res)
{
return decode_op_hdr(xdr, OP_DEALLOCATE);
@@ -272,6 +378,36 @@ out:
}

/*
+ * Decode COPY response
+ */
+static int nfs4_xdr_dec_copy(struct rpc_rqst *rqstp,
+ struct xdr_stream *xdr,
+ struct nfs42_copy_res *res)
+{
+ struct compound_hdr hdr;
+ int status;
+
+ status = decode_compound_hdr(xdr, &hdr);
+ if (status)
+ goto out;
+ status = decode_sequence(xdr, &res->seq_res, rqstp);
+ if (status)
+ goto out;
+ status = decode_putfh(xdr);
+ if (status)
+ goto out;
+ status = decode_savefh(xdr);
+ if (status)
+ goto out;
+ status = decode_putfh(xdr);
+ if (status)
+ goto out;
+ status = decode_copy(xdr, res);
+out:
+ return status;
+}
+
+/*
* Decode DEALLOCATE request
*/
static int nfs4_xdr_dec_deallocate(struct rpc_rqst *rqstp,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index dcd39d4..cc3353a 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -132,6 +132,13 @@ nfs4_file_fsync(struct file *file, loff_t start, loff_t end, int datasync)
}

#ifdef CONFIG_NFS_V4_2
+static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, int flags)
+{
+ return nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
+}
+
static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
{
loff_t ret;
@@ -186,6 +193,7 @@ const struct file_operations nfs4_file_operations = {
.splice_read = nfs_file_splice_read,
.splice_write = iter_file_splice_write,
#ifdef CONFIG_NFS_V4_2
+ .copy_file_range = nfs4_copy_file_range,
.fallocate = nfs42_fallocate,
#endif /* CONFIG_NFS_V4_2 */
.check_flags = nfs_check_flags,
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 3acb1eb..f0c59eb 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -8648,6 +8648,7 @@ static const struct nfs4_minor_version_ops nfs_v4_2_minor_ops = {
| NFS_CAP_STATEID_NFSV41
| NFS_CAP_ATOMIC_OPEN_V1
| NFS_CAP_ALLOCATE
+ | NFS_CAP_COPY
| NFS_CAP_DEALLOCATE
| NFS_CAP_SEEK
| NFS_CAP_LAYOUTSTATS,
diff --git a/fs/nfs/nfs4xdr.c b/fs/nfs/nfs4xdr.c
index 558cd65d..8296628 100644
--- a/fs/nfs/nfs4xdr.c
+++ b/fs/nfs/nfs4xdr.c
@@ -7432,6 +7432,7 @@ struct rpc_procinfo nfs4_procedures[] = {
PROC(ALLOCATE, enc_allocate, dec_allocate),
PROC(DEALLOCATE, enc_deallocate, dec_deallocate),
PROC(LAYOUTSTATS, enc_layoutstats, dec_layoutstats),
+ PROC(COPY, enc_copy, dec_copy),
#endif /* CONFIG_NFS_V4_2 */
};

diff --git a/include/linux/nfs4.h b/include/linux/nfs4.h
index b8e72aa..c975a99 100644
--- a/include/linux/nfs4.h
+++ b/include/linux/nfs4.h
@@ -501,6 +501,7 @@ enum {
NFSPROC4_CLNT_ALLOCATE,
NFSPROC4_CLNT_DEALLOCATE,
NFSPROC4_CLNT_LAYOUTSTATS,
+ NFSPROC4_CLNT_COPY,
};

/* nfs41 types */
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index 20bc8e5..8d37f59 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -238,5 +238,6 @@ struct nfs_server {
#define NFS_CAP_ALLOCATE (1U << 20)
#define NFS_CAP_DEALLOCATE (1U << 21)
#define NFS_CAP_LAYOUTSTATS (1U << 22)
+#define NFS_CAP_COPY (1U << 23)

#endif
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 7bbe505..e5f6227 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -1321,6 +1321,33 @@ struct nfs42_falloc_res {
const struct nfs_server *falloc_server;
};

+struct nfs42_copy_args {
+ struct nfs4_sequence_args seq_args;
+
+ struct nfs_fh *src_fh;
+ nfs4_stateid src_stateid;
+ u64 src_pos;
+
+ struct nfs_fh *dst_fh;
+ nfs4_stateid dst_stateid;
+ u64 dst_pos;
+
+ u64 count;
+};
+
+struct nfs42_write_res {
+ u64 count;
+ u32 committed;
+ nfs4_verifier verifier;
+};
+
+struct nfs42_copy_res {
+ struct nfs4_sequence_res seq_res;
+ struct nfs42_write_res write_res;
+ bool consecutive;
+ bool synchronous;
+};
+
struct nfs42_seek_args {
struct nfs4_sequence_args seq_args;

--
2.5.0


2015-08-10 21:07:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 0/7] NFSv4.2: Add support for the COPY operation

On Fri, Aug 07, 2015 at 04:38:16PM -0400, Anna Schumaker wrote:
> These patches add client and server support for the NFS v4.2 COPY operation.
> Unlike the similar CLONE operation, COPY can support both acceleration through
> and reflink and a full copy of data from one file into another. These patches
> make use of Zach Brown's vfs_copy_file_range() syscall, and the first three
> patches in this series are simply a reposting of the patches that add the
> syscall.
>
> Patch 4 expands vfs_copy_file_range() to fall back on the splice interface for
> copies where the filesystem does not support copy accelerations. This behavior
> is useful for NFSD, since we'll still want to copy the file even if we can't
> do a reflink. Additionally, this opens up the possibility of in-kernel copies
> for all filesystems without needing to do frequent switches between kernel and
> user space. The only potential drawback I've noticed is that splice will write

Also on the server side it means the copy can potentially take
arbitrarily long, right? (And tie up a protocol slot and server thread
the whole time?)

> out data in PAGE_SIZE chunks, even if wsize > PAGE_SIZE. This leads to a few
> more writes over the wire, but I have not noticed a significant timing
> difference. Still, I wonder if there is a better way to optimize this for NFS.

Ideally, write-behind and readahead should paper over this?

>
> The remaining patches implement the COPY operation for both the client and the
> server. The program I used for testing is included as an RFC as the last patch
> in the series. I gathered performance information by comparing the runtime and
> RPC count of this program against /usr/bin/cp for various file sizes.
>
> /usr/bin/cp:
> size: 513MB 1024MB 1536MB 2048MB
> ------------- ------------- -------- -------- -------- --------
> nfs v4 client total: 8203 16396 24588 32780
> ------------- ------------- -------- -------- -------- --------
> nfs v4 client read: 4096 8192 12288 16384
> nfs v4 client write: 4096 8192 12288 16384
> nfs v4 client commit: 1 1 1 1
> nfs v4 client open: 1 1 1 1
> nfs v4 client open_noat: 2 2 2 2
> nfs v4 client close: 1 1 1 1
> nfs v4 client setattr: 2 2 2 2
> nfs v4 client access: 2 3 3 3
> nfs v4 client getattr: 2 2 2 2
>
> /usr/bin/cp /nfs/test-512 /nfs/test-copy 0.00s user 0.32s system 14% cpu 2.209 total
> /usr/bin/cp /nfs/test-1024 /nfs/test-copy 0.00s user 0.66s system 18% cpu 3.651 total
> /usr/bin/cp /nfs/test-1536 /nfs/test-copy 0.02s user 0.97s system 18% cpu 5.477 total
> /usr/bin/cp /nfs/test-2048 /nfs/test-copy 0.00s user 1.38s system 15% cpu 9.085 total
>
>
> Copy system call:
> size: 512MB 1024MB 1536MB 2048MB
> ------------- ------------- -------- -------- -------- --------
> nfs v4 client total: 6 6 6 6
> ------------- ------------- -------- -------- -------- --------
> nfs v4 client open: 2 2 2 2
> nfs v4 client close: 2 2 2 2
> nfs v4 client access: 1 1 1 1
> nfs v4 client copy: 1 1 1 1
>
>
> ./nfscopy /nfs/test-512 /nfs/test-copy 0.00s user 0.00s system 0% cpu 1.148 total
> ./nfscopy /nfs/test-1024 /nfs/test-copy 0.00s user 0.00s system 0% cpu 2.293 total
> ./nfscopy /nfs/test-1536 /nfs/test-copy 0.00s user 0.00s system 0% cpu 3.037 total
> ./nfscopy /nfs/test-2048 /nfs/test-copy 0.00s user 0.00s system 0% cpu 4.045 total
>
>
> Questions, comments, and other testing ideas would be greatly appreciated!
>
> Thanks,
> Anna
>
>
> Anna Schumaker (4):
> VFS: Fall back on splice if no copy function defined
> nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
> NFSD: Implement the COPY call
> NFS: Add COPY nfs operation
>
> Zach Brown (3):
> vfs: add copy_file_range syscall and vfs helper
> x86: add sys_copy_file_range to syscall tables
> btrfs: add .copy_file_range file operation
>
> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
> fs/btrfs/ctree.h | 3 +
> fs/btrfs/file.c | 1 +
> fs/btrfs/ioctl.c | 91 ++++++++++++----------
> fs/nfs/nfs42.h | 1 +
> fs/nfs/nfs42proc.c | 40 ++++++++++
> fs/nfs/nfs42xdr.c | 136 +++++++++++++++++++++++++++++++++
> fs/nfs/nfs4file.c | 8 ++
> fs/nfs/nfs4proc.c | 1 +
> fs/nfs/nfs4xdr.c | 1 +
> fs/nfsd/nfs4proc.c | 79 +++++++++++++++++--
> fs/nfsd/nfs4state.c | 5 +-
> fs/nfsd/nfs4xdr.c | 62 ++++++++++++++-
> fs/nfsd/state.h | 4 +-
> fs/nfsd/vfs.c | 13 ++++
> fs/nfsd/vfs.h | 1 +
> fs/nfsd/xdr4.h | 23 ++++++
> fs/read_write.c | 133 ++++++++++++++++++++++++++++++++
> include/linux/fs.h | 3 +
> include/linux/nfs4.h | 1 +
> include/linux/nfs_fs_sb.h | 1 +
> include/linux/nfs_xdr.h | 27 +++++++
> include/uapi/asm-generic/unistd.h | 4 +-
> kernel/sys_ni.c | 1 +
> 25 files changed, 587 insertions(+), 54 deletions(-)
>
> --
> 2.5.0

2015-08-11 17:52:33

by Anna Schumaker

[permalink] [raw]
Subject: Re: [PATCH 0/7] NFSv4.2: Add support for the COPY operation

On 08/10/2015 05:07 PM, J. Bruce Fields wrote:
> On Fri, Aug 07, 2015 at 04:38:16PM -0400, Anna Schumaker wrote:
>> These patches add client and server support for the NFS v4.2 COPY operation.
>> Unlike the similar CLONE operation, COPY can support both acceleration through
>> and reflink and a full copy of data from one file into another. These patches
>> make use of Zach Brown's vfs_copy_file_range() syscall, and the first three
>> patches in this series are simply a reposting of the patches that add the
>> syscall.
>>
>> Patch 4 expands vfs_copy_file_range() to fall back on the splice interface for
>> copies where the filesystem does not support copy accelerations. This behavior
>> is useful for NFSD, since we'll still want to copy the file even if we can't
>> do a reflink. Additionally, this opens up the possibility of in-kernel copies
>> for all filesystems without needing to do frequent switches between kernel and
>> user space. The only potential drawback I've noticed is that splice will write
>
> Also on the server side it means the copy can potentially take
> arbitrarily long, right? (And tie up a protocol slot and server thread
> the whole time?)

Potentially, but I could put in a cap like we had talked about in the past. In practice, the VFS limits copies to slightly more than 2G because of the call to rw_verify_area().

>
>> out data in PAGE_SIZE chunks, even if wsize > PAGE_SIZE. This leads to a few
>> more writes over the wire, but I have not noticed a significant timing
>> difference. Still, I wonder if there is a better way to optimize this for NFS.
>
> Ideally, write-behind and readahead should paper over this?

I think I might have misinterpreted the RPC counts, and it is writing out chunks larger than PAGE_SIZE. I noticed that there were twice as many writes as reads, but when I looked at wireshark I saw that each write was wsize/2.

Thanks,
Anna

>
>>
>> The remaining patches implement the COPY operation for both the client and the
>> server. The program I used for testing is included as an RFC as the last patch
>> in the series. I gathered performance information by comparing the runtime and
>> RPC count of this program against /usr/bin/cp for various file sizes.
>>
>> /usr/bin/cp:
>> size: 513MB 1024MB 1536MB 2048MB
>> ------------- ------------- -------- -------- -------- --------
>> nfs v4 client total: 8203 16396 24588 32780
>> ------------- ------------- -------- -------- -------- --------
>> nfs v4 client read: 4096 8192 12288 16384
>> nfs v4 client write: 4096 8192 12288 16384
>> nfs v4 client commit: 1 1 1 1
>> nfs v4 client open: 1 1 1 1
>> nfs v4 client open_noat: 2 2 2 2
>> nfs v4 client close: 1 1 1 1
>> nfs v4 client setattr: 2 2 2 2
>> nfs v4 client access: 2 3 3 3
>> nfs v4 client getattr: 2 2 2 2
>>
>> /usr/bin/cp /nfs/test-512 /nfs/test-copy 0.00s user 0.32s system 14% cpu 2.209 total
>> /usr/bin/cp /nfs/test-1024 /nfs/test-copy 0.00s user 0.66s system 18% cpu 3.651 total
>> /usr/bin/cp /nfs/test-1536 /nfs/test-copy 0.02s user 0.97s system 18% cpu 5.477 total
>> /usr/bin/cp /nfs/test-2048 /nfs/test-copy 0.00s user 1.38s system 15% cpu 9.085 total
>>
>>
>> Copy system call:
>> size: 512MB 1024MB 1536MB 2048MB
>> ------------- ------------- -------- -------- -------- --------
>> nfs v4 client total: 6 6 6 6
>> ------------- ------------- -------- -------- -------- --------
>> nfs v4 client open: 2 2 2 2
>> nfs v4 client close: 2 2 2 2
>> nfs v4 client access: 1 1 1 1
>> nfs v4 client copy: 1 1 1 1
>>
>>
>> ./nfscopy /nfs/test-512 /nfs/test-copy 0.00s user 0.00s system 0% cpu 1.148 total
>> ./nfscopy /nfs/test-1024 /nfs/test-copy 0.00s user 0.00s system 0% cpu 2.293 total
>> ./nfscopy /nfs/test-1536 /nfs/test-copy 0.00s user 0.00s system 0% cpu 3.037 total
>> ./nfscopy /nfs/test-2048 /nfs/test-copy 0.00s user 0.00s system 0% cpu 4.045 total
>>
>>
>> Questions, comments, and other testing ideas would be greatly appreciated!
>>
>> Thanks,
>> Anna
>>
>>
>> Anna Schumaker (4):
>> VFS: Fall back on splice if no copy function defined
>> nfsd: Pass filehandle to nfs4_preprocess_stateid_op()
>> NFSD: Implement the COPY call
>> NFS: Add COPY nfs operation
>>
>> Zach Brown (3):
>> vfs: add copy_file_range syscall and vfs helper
>> x86: add sys_copy_file_range to syscall tables
>> btrfs: add .copy_file_range file operation
>>
>> arch/x86/entry/syscalls/syscall_32.tbl | 1 +
>> arch/x86/entry/syscalls/syscall_64.tbl | 1 +
>> fs/btrfs/ctree.h | 3 +
>> fs/btrfs/file.c | 1 +
>> fs/btrfs/ioctl.c | 91 ++++++++++++----------
>> fs/nfs/nfs42.h | 1 +
>> fs/nfs/nfs42proc.c | 40 ++++++++++
>> fs/nfs/nfs42xdr.c | 136 +++++++++++++++++++++++++++++++++
>> fs/nfs/nfs4file.c | 8 ++
>> fs/nfs/nfs4proc.c | 1 +
>> fs/nfs/nfs4xdr.c | 1 +
>> fs/nfsd/nfs4proc.c | 79 +++++++++++++++++--
>> fs/nfsd/nfs4state.c | 5 +-
>> fs/nfsd/nfs4xdr.c | 62 ++++++++++++++-
>> fs/nfsd/state.h | 4 +-
>> fs/nfsd/vfs.c | 13 ++++
>> fs/nfsd/vfs.h | 1 +
>> fs/nfsd/xdr4.h | 23 ++++++
>> fs/read_write.c | 133 ++++++++++++++++++++++++++++++++
>> include/linux/fs.h | 3 +
>> include/linux/nfs4.h | 1 +
>> include/linux/nfs_fs_sb.h | 1 +
>> include/linux/nfs_xdr.h | 27 +++++++
>> include/uapi/asm-generic/unistd.h | 4 +-
>> kernel/sys_ni.c | 1 +
>> 25 files changed, 587 insertions(+), 54 deletions(-)
>>
>> --
>> 2.5.0


2015-08-13 13:04:00

by Kinglong Mee

[permalink] [raw]
Subject: Re: [PATCH 4/7] VFS: Fall back on splice if no copy function defined

On 8/8/2015 04:38, Anna Schumaker wrote:
> The NFS server will need a fallback for filesystems that don't have any
> kind of copy acceleration yet. Let's handle this by having
> vfs_copy_range() fall back to splice, enabling an in-kernel fallback for
> all filesystems.

I'd like do the job in nfsd_copy_range().

If user only want call the underlay filesystem's copy_file_range()?
want get an error if not support. But, this patch lets the syscall
to another logical of calling do_splice_direct().

thanks,
Kinglong Mee

>
> Signed-off-by: Anna Schumaker <[email protected]>
> ---
> fs/read_write.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 3804547..e564a6b 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1358,7 +1358,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (!(file_in->f_mode & FMODE_READ) ||
> !(file_out->f_mode & FMODE_WRITE) ||
> (file_out->f_flags & O_APPEND) ||
> - !file_in->f_op || !file_in->f_op->copy_file_range)
> + !file_in->f_op)
> return -EINVAL;
>
> inode_in = file_inode(file_in);
> @@ -1382,8 +1382,12 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (ret)
> return ret;
>
> - ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out, pos_out,
> - len, flags);
> + ret = -ENOTSUPP;
> + if (file_in->f_op->copy_file_range)
> + ret = file_in->f_op->copy_file_range(file_in, pos_in, file_out,
> + pos_out, len, flags);
> + if (ret == -ENOTSUPP)
> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out, len, flags);
> if (ret > 0) {
> fsnotify_access(file_in);
> add_rchar(current, ret);
>