2019-05-26 06:11:59

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 0/8] Fixes for major copy_file_range() issues

Hi Darrick,

Following is a re-work of Dave Chinner's patches from December [4].
I have updated the kernel patches [1] xfstests [2] and man-page [3]
according to the feedback on v1.

NOTE that this work changes user visible behavior of copy_file_range(2)!
It introduces new errors for cases that were not checked before and it
allows cross-device copy by default. After this work, cifs copy offload
should be possible between two shares on the same server, but I did not
check this functionality.

The major difference from v1 is to conform to short read(2) semantics
that are already implemented for copy_file_range(2) instead of the
documented EINVAL, as suggested by Christoph.

My tests of this work included testing various filesystems for the
fallback default copy_file_range implementation, both filesystems that
support copy_file_range and filesystems that do not. My tests did not
include actual copy offload with nfs/cifs/ceph, so any such tests by
said filesystem developers would be much appreciated.
Special thanks to Olga Kornievskaia and Luis Henriques for helping me
test this work on nfs and ceph.

Darrick, seeing that you and Dave invested most in this work and
previous similar fixes and tests of the remap_file_range series, I
though it would be best if you carried these patches through the xfs
tree or collaborate their merge with the vfs tree.

Please note that the patch [8/8] is not related to copy_file_range,
but I included it in the series because it belongs in the context.

The man page update patch (again, mostly Dave's work) is appended
to the series.

Thanks,
Amir.

Changes since v1:
- Short read instead of EINVAL (Christoph)
- generic_file_rw_checks() helper (Darrick)
- generic_copy_file_range_prep() helper (Christoph)
- Not calling ->remap_file_range() with different sb
- Not calling ->copy_file_range() with different fs type
- Remove changes to overlayfs
- Extra fix to clone/dedupe checks

Amir Goldstein (5):
vfs: introduce generic_file_rw_checks()
vfs: add missing checks to copy_file_range
vfs: copy_file_range needs to strip setuid bits
vfs: allow copy_file_range to copy across devices
vfs: remove redundant checks from generic_remap_checks()

Dave Chinner (3):
vfs: introduce generic_copy_file_range()
vfs: no fallback for ->copy_file_range
vfs: copy_file_range should update file timestamps

fs/ceph/file.c | 32 +++++++++-
fs/cifs/cifsfs.c | 13 +++-
fs/fuse/file.c | 28 ++++++++-
fs/nfs/nfs42proc.c | 8 ++-
fs/nfs/nfs4file.c | 23 ++++++-
fs/read_write.c | 146 +++++++++++++++++++++++++++++++++------------
include/linux/fs.h | 9 +++
mm/filemap.c | 81 +++++++++++++++++++++++--
8 files changed, 283 insertions(+), 57 deletions(-)

[1] https://github.com/amir73il/linux/commits/copy_file_range-v2
[2] https://github.com/amir73il/xfstests/commits/copy_file_range-v2
[3] https://github.com/amir73il/man-pages/commits/copy_file_range-v2
[4] https://lore.kernel.org/linux-fsdevel/[email protected]/

Original cover letter by Dave:
------------------------------

Hi folks,

As most of you already know, we really suck at introducing new
functionality. The recent problems we found with clone/dedupe file
range interfaces also plague the copy_file_range() API and
implementation. Not only doesn't it do exactly what the man page
says, the man page doesn't document everything the syscall does
either.

There's a few problems:
- can overwrite setuid files
- can read from and overwrite active swap files
- can overwrite immutable files
- doesn't update timestamps
- doesn't obey resource limits
- doesn't catch overlapping copy ranges to the same file
- doesn't consistently implement fallback strategies
- does error out when the source range extends past EOF like
the man page says it should
- isn't consistent with clone file range behaviour
- inconsistent behaviour between filesystems
- inconsistent fallback implementations

And so on. There's so much wrong, and I haven't even got to the
problems that the generic fallback code (i.e. do_splice_direct()
has). That's for another day.

So, what this series attempts to do is clean up the code, implement
all the missing checks, provide an infrastructure layout that allows
for consistent behaviour across filesystems and allows filesysetms
to control fallback mechanisms and cross-device copies.

I'll repeat that so it's clear: the series also enabled cross-device
copies once all the problems are sorted out.

To that end, the current fallback code is moved to
generic_copy_file_range(), and that is called only if the filesystem
does not provide a ->copy_file_range implementation. If the
filesystem provides such a method, it must implement the page cache
copy fallback itself by calling generic_copy_file_range() when
appropriate. I did this because different filesystems have different
copy-offload capabilities and so need to fall back in different
situations. It's easier to have them call generic_copy_file_range()
to do that copy when necessary than it is to have them try to
communicate back up to vfs_copy_file_range() that it should run a
fallback copy.

To make all the implementations perform the same validity checks,
I've created a generic_copy_file_checks() which is similar to the
checks we do for clone/dedupe. It's not quite the same, but the core
is very similar. This strips setuid, updates timestamps, checks and
enforces filesystem and resource limits, bounds checks the copy
ranges, etc.

This needs to be run before we call ->remap_file_range() so that we
end up with consistent behaviour across copy_file_range() calls.
e.g. we want an XFS filesystem with reflink=1 (i.e. supports
->remap_file_range()) to behave the same as an XFS filesystem with
reflink=0. Hence we need to check all the parameters up front so we
don't end up with calls to ->remap_file_range() resulting in
different behaviour.

It also means that ->copy_file_range implementations only need to
bounds checking the input against fileystem internal constraints,
not everything. This makes the filesystem implementations simpler,
and means they can call the fallback generic_copy_file_range()
implementation without having to care about further bounds checking.

I have not changed the fallback behaviour of the CIFS, Ceph or NFS
client implementations. They still reject copy_file_range() to the
same file with EINVAL, even though it is supported by the fallback
and filesystems that implement ->remap_file_range(). I'll leave it
for the maintainers to decide if they want to implement the manual
data copy fallback or not. My personal opinion is that they should
implement the fallback where-ever they can, but userspace has to be
prepared for copy_file_range() to fail and so implementing the
fallback is an optional feature.

In terms of testing, Darrick and I have been beating the hell out of
copy_file_range with fsx on XFS to sort out all the data corruption
problems it has exposed (we're still working on that). Patches have
been posted to enhance fsx and fsstress in fstests to exercise
clone/dedupe/copy_file_range. Thread here:

https://www.spinics.net/lists/fstests/msg10920.html

I've also written a bounds/behaviour exercising test:

https://marc.info/?l=fstests&m=154381938829897&w=2
https://marc.info/?l=fstests&m=154381939029898&w=2
https://marc.info/?l=fstests&m=154381939229899&w=2
https://marc.info/?l=fstests&m=154381939329900&w=2

I don't know whether I've got all the permission tests right in this
patchset. There's absolutely no documentation telling us when we
should use file_permission, inode_permission, etc in the
documentation or the code, so I just added the things that made the
tests do the things i think are the right things to be doing.

To run the tests, you'll also need modifications to xfs_io to allow
it to modify state appropriately. This is something we have
overlooked in the past, and so a lots of xfs_io based behaviour
checking is not actually testing the syscall we thought it was
testing but is instead testing the permission checking of the open()
syscall. Those patches are here:

https://marc.info/?l=linux-xfs&m=154378403323889&w=2
https://marc.info/?l=linux-xfs&m=154378403523890&w=2
https://marc.info/?l=linux-xfs&m=154378403323888&w=2
https://marc.info/?l=linux-xfs&m=154379644526132&w=2

These changes really need to go in before we merge any more
copy_file_range() features - we need to get the basics right and get
test coverage over it before we unleash things like NFS server-side
copies on unsuspecting users with filesystems that have busted
copy_file_range() implementations.

I'll be appending a man page patch to this series that documents all
the errors this syscall can throw, the expected behaviours, etc. The
test and the man page were written together first, and the
implementation changes were done second. So if you don't agree with
the behaviour, discuss what the man page patch should say and define,
then I'll change the test to reflect that and I'll go from there.

-Dave.
--
2.17.1


2019-05-26 06:11:59

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 3/8] vfs: introduce generic_file_rw_checks()

Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.

Suggested-by: Darrick J. Wong <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/read_write.c | 38 +++++++++++---------------------------
include/linux/fs.h | 1 +
mm/filemap.c | 24 ++++++++++++++++++++++++
3 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index b63dcb4e4fe9..f1900bdb3127 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1617,17 +1617,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
- struct inode *inode_in = file_inode(file_in);
- struct inode *inode_out = file_inode(file_out);
ssize_t ret;

if (flags != 0)
return -EINVAL;

- if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
- return -EISDIR;
- if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
- return -EINVAL;
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (unlikely(ret))
+ return ret;

ret = rw_verify_area(READ, file_in, &pos_in, len);
if (unlikely(ret))
@@ -1637,15 +1638,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (unlikely(ret))
return ret;

- if (!(file_in->f_mode & FMODE_READ) ||
- !(file_out->f_mode & FMODE_WRITE) ||
- (file_out->f_flags & O_APPEND))
- return -EBADF;
-
- /* this could be relaxed once a method supports cross-fs copies */
- if (inode_in->i_sb != inode_out->i_sb)
- return -EXDEV;
-
if (len == 0)
return 0;

@@ -2013,29 +2005,21 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags)
{
- struct inode *inode_in = file_inode(file_in);
- struct inode *inode_out = file_inode(file_out);
loff_t ret;

WARN_ON_ONCE(remap_flags & REMAP_FILE_DEDUP);

- if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
- return -EISDIR;
- if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
- return -EINVAL;
-
/*
* FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
* the same mount. Practically, they only need to be on the same file
* system.
*/
- if (inode_in->i_sb != inode_out->i_sb)
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
return -EXDEV;

- if (!(file_in->f_mode & FMODE_READ) ||
- !(file_out->f_mode & FMODE_WRITE) ||
- (file_out->f_flags & O_APPEND))
- return -EBADF;
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (ret < 0)
+ return ret;

if (!file_in->f_op->remap_file_range)
return -EOPNOTSUPP;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ea17858310ff..89b9b73eb581 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3049,6 +3049,7 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count, unsigned int remap_flags);
+extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index c5af80c43d36..798aac92cd76 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3040,6 +3040,30 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0;
}

+
+/*
+ * Performs common checks before doing a file copy/clone
+ * from @file_in to @file_out.
+ */
+int generic_file_rw_checks(struct file *file_in, struct file *file_out)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+
+ /* Don't copy dirs, pipes, sockets... */
+ if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+ return -EISDIR;
+ if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+ return -EINVAL;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND))
+ return -EBADF;
+
+ return 0;
+}
+
int pagecache_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
--
2.17.1

2019-05-26 06:12:01

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 7/8] vfs: allow copy_file_range to copy across devices

We want to enable cross-filesystem copy_file_range functionality
where possible, so push the "same superblock only" checks down to
the individual filesystem callouts so they can make their own
decisions about cross-superblock copy offload and fallack to
generic_copy_file_range() for cross-superblock copy.

[Amir] We do not call ->remap_file_range() in case the inodes are not
on the same sb and do not call ->copy_file_range() in case the inodes
are not on the same filesystem type.

This changes behavior of the copy_file_range(2) syscall, which will
now allow cross filesystem in-kernel copy. CIFS already supports
cross-superblock copy, between two shares to the same server. This
functionality will now be available via the copy_file_range(2) syscall.

Cc: Steve French <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/ceph/file.c | 4 +++-
fs/cifs/cifsfs.c | 2 +-
fs/fuse/file.c | 5 ++++-
fs/nfs/nfs4file.c | 5 ++++-
fs/read_write.c | 20 ++++++++++++++------
5 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 54cfc877a6ef..adf99557c46c 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,

if (src_inode == dst_inode)
return -EINVAL;
+ if (src_inode->i_sb != dst_inode->i_sb)
+ return -EXDEV;
if (ceph_snap(dst_inode) != CEPH_NOSNAP)
return -EROFS;

@@ -2118,7 +2120,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
len, flags);

- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(src_file, src_off, dst_file,
dst_off, len, flags);
return ret;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index e103b499aaa8..7bde046110ce 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1152,7 +1152,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
len, flags);
free_xid(xid);

- if (rc == -EOPNOTSUPP)
+ if (rc == -EOPNOTSUPP || rc == -EXDEV)
rc = generic_copy_file_range(src_file, off, dst_file,
destoff, len, flags);
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3531d4a3d9ec..180161f6f0bd 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (fc->no_copy_file_range)
return -EOPNOTSUPP;

+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+
inode_lock(inode_out);

err = generic_copy_file_range_prep(file_in, file_out);
@@ -3186,7 +3189,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
len, flags);

- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(src_file, src_off, dst_file,
dst_off, len, flags);
return ret;
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4842f3ab3161..f4157eb1f69d 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t count, unsigned int flags)
{
+ /* Only offload copy if superblock is the same */
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
return -EOPNOTSUPP;
if (file_inode(file_in) == file_inode(file_out))
@@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,

ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
flags);
- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(file_in, pos_in, file_out,
pos_out, count, flags);
return ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 4b23a86aacd9..283ec30d2136 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1640,7 +1640,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
- if (file_out->f_op->copy_file_range)
+ /*
+ * Although we now allow filesystems to handle cross sb copy, passing
+ * an inode of the wrong filesystem type to filesystem operation can
+ * often result in an attempt to dereference the wrong concrete inode
+ * struct, so avoid doing that until we really have a good reason.
+ * The incentive for passing inode from different sb to filesystem is
+ * NFS cross server copy and for that use case, enforcing same
+ * filesystem type is acceptable.
+ */
+ if (file_out->f_op->copy_file_range &&
+ file_inode(file_in)->i_sb->s_type ==
+ file_inode(file_out)->i_sb->s_type)
return file_out->f_op->copy_file_range(file_in, pos_in,
file_out, pos_out,
len, flags);
@@ -1663,10 +1674,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags != 0)
return -EINVAL;

- /* this could be relaxed once a method supports cross-fs copies */
- if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
- return -EXDEV;
-
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
flags);
if (unlikely(ret))
@@ -1689,7 +1696,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
*/
- if (file_in->f_op->remap_file_range) {
+ if (file_in->f_op->remap_file_range &&
+ file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
loff_t cloned;

cloned = file_in->f_op->remap_file_range(file_in, pos_in,
--
2.17.1

2019-05-26 06:12:02

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 8/8] vfs: remove redundant checks from generic_remap_checks()

The access limit checks on input file range in generic_remap_checks()
are redundant because the input file size is guarantied to be within
limits and pos+len are already checked to be within input file size.

Beyond the fact that the check cannot fail, if it would have failed,
it could return -EFBIG for input file range error. There is no precedent
for that. -EFBIG is returned in syscalls that would change file length.

Signed-off-by: Amir Goldstein <[email protected]>
---
mm/filemap.c | 4 ----
1 file changed, 4 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 1852fbf08eeb..7e1aa36d57a2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3000,10 +3000,6 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
return -EINVAL;
count = min(count, size_in - (uint64_t)pos_in);

- ret = generic_access_check_limits(file_in, pos_in, &count);
- if (ret)
- return ret;
-
ret = generic_write_check_limits(file_out, pos_out, &count);
if (ret)
return ret;
--
2.17.1

2019-05-26 06:12:15

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 9/8] man-pages: copy_file_range updates

Update with all the missing errors the syscall can return, the
behaviour the syscall should have w.r.t. to copies within single
files, etc.

[Amir] Copying beyond EOF returns zero.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
man2/copy_file_range.2 | 93 ++++++++++++++++++++++++++++++++++--------
1 file changed, 77 insertions(+), 16 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 2438b63c8..fab11f977 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
and then back into the kernel.
It copies up to
.I len
-bytes of data from file descriptor
+bytes of data from the source file descriptor
.I fd_in
-to file descriptor
+to target file descriptor
.IR fd_out ,
overwriting any data that exists within the requested range of the target file.
.PP
@@ -74,6 +74,11 @@ is not changed, but
.I off_in
is adjusted appropriately.
.PP
+.I fd_in
+and
+.I fd_out
+can refer to the same file. If they refer to the same file, then the source and
+target ranges are not allowed to overlap.
.PP
The
.I flags
@@ -84,6 +89,11 @@ Upon successful completion,
.BR copy_file_range ()
will return the number of bytes copied between files.
This could be less than the length originally requested.
+If the file offset of
+.I fd_in
+is at or past the end of file, no bytes are copied, and
+.BR copy_file_range ()
+returns zero.
.PP
On error,
.BR copy_file_range ()
@@ -93,12 +103,16 @@ is set to indicate the error.
.SH ERRORS
.TP
.B EBADF
-One or more file descriptors are not valid; or
+One or more file descriptors are not valid.
+.TP
+.B EBADF
.I fd_in
is not open for reading; or
.I fd_out
-is not open for writing; or
-the
+is not open for writing.
+.TP
+.B EBADF
+The
.B O_APPEND
flag is set for the open file description (see
.BR open (2))
@@ -106,17 +120,36 @@ referred to by the file descriptor
.IR fd_out .
.TP
.B EFBIG
-An attempt was made to write a file that exceeds the implementation-defined
-maximum file size or the process's file size limit,
-or to write at a position past the maximum allowed offset.
+An attempt was made to write at a position past the maximum file offset the
+kernel supports.
+.TP
+.B EFBIG
+An attempt was made to write a range that exceeds the allowed maximum file size.
+The maximum file size differs between filesystem implemenations and can be
+different to the maximum allowed file offset.
+.TP
+.B EFBIG
+An attempt was made to write beyond the process's file size resource
+limit. This may also result in the process receiving a
+.I SIGXFSZ
+signal.
.TP
.B EINVAL
-Requested range extends beyond the end of the source file; or the
+The
.I flags
argument is not 0.
.TP
-.B EIO
-A low-level I/O error occurred while copying.
+.B EINVAL
+.I fd_in
+and
+.I fd_out
+refer to the same file and the source and target ranges overlap.
+.TP
+.B EINVAL
+.I fd_in
+or
+.I fd_out
+is not a regular file.
.TP
.B EISDIR
.I fd_in
@@ -124,22 +157,50 @@ or
.I fd_out
refers to a directory.
.TP
+.B EOVERFLOW
+The requested source or destination range is too large to represent in the
+specified data types.
+.TP
+.B EIO
+A low-level I/O error occurred while copying.
+.TP
.B ENOMEM
Out of memory.
.TP
-.B ENOSPC
-There is not enough space on the target filesystem to complete the copy.
-.TP
.B EXDEV
The files referred to by
.IR file_in " and " file_out
-are not on the same mounted filesystem.
+are not on the same mounted filesystem (pre Linux 5.3).
+.TP
+.B ENOSPC
+There is not enough space on the target filesystem to complete the copy.
+.TP
+.B TXTBSY
+.I fd_in
+or
+.I fd_out
+refers to an active swap file.
+.TP
+.B EPERM
+.I fd_out
+refers to an immutable file.
+.TP
+.B EACCES
+The user does not have write permissions for the destination file.
.SH VERSIONS
The
.BR copy_file_range ()
system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
emulation when it is not available.
.\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
+.PP
+A major rework of the kernel implementation occurred in 5.3. Areas of the API
+that weren't clearly defined were clarified and the API bounds are much more
+strictly checked than on earlier kernels. Applications should target the
+behaviour and requirements of 5.3 kernels.
+.PP
+First support for cross-filesystem copies was introduced in Linux 5.3. Older
+kernels will return -EXDEV when cross-filesystem copies are attempted.
.SH CONFORMING TO
The
.BR copy_file_range ()
@@ -224,7 +285,7 @@ main(int argc, char **argv)
}

len \-= ret;
- } while (len > 0);
+ } while (len > 0 && ret > 0);

close(fd_in);
close(fd_out);
--
2.17.1

2019-05-26 06:12:19

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 2/8] vfs: no fallback for ->copy_file_range

From: Dave Chinner <[email protected]>

Now that we have generic_copy_file_range(), remove it as a fallback
case when offloads fail. This puts the responsibility for executing
fallbacks on the filesystems that implement ->copy_file_range and
allows us to add operational validity checks to
generic_copy_file_range().

Rework vfs_copy_file_range() to call a new do_copy_file_range()
helper to exceute the copying callout, and move calls to
generic_file_copy_range() into filesystem methods where they
currently return failures.

[Amir] overlayfs is not responsible of executing the fallback.
It is the responsibility of the underlying filesystem.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/ceph/file.c | 21 ++++++++++++++++++---
fs/cifs/cifsfs.c | 4 ++++
fs/fuse/file.c | 21 ++++++++++++++++++---
fs/nfs/nfs4file.c | 20 +++++++++++++++++---
fs/read_write.c | 25 ++++++++++++++++---------
5 files changed, 73 insertions(+), 18 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 305daf043eb0..e87f7b2023af 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1889,9 +1889,9 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
return 0;
}

-static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
+static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
{
struct inode *src_inode = file_inode(src_file);
struct inode *dst_inode = file_inode(dst_file);
@@ -2100,6 +2100,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
return ret;
}

+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
+ len, flags);
+
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(src_file, src_off, dst_file,
+ dst_off, len, flags);
+ return ret;
+}
+
const struct file_operations ceph_file_fops = {
.open = ceph_open,
.release = ceph_release,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index f5fcd6360056..c65823270313 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1148,6 +1148,10 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
rc = cifs_file_copychunk_range(xid, src_file, off, dst_file, destoff,
len, flags);
free_xid(xid);
+
+ if (rc == -EOPNOTSUPP)
+ rc = generic_copy_file_range(src_file, off, dst_file,
+ destoff, len, flags);
return rc;
}

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3959f08279e6..e03901ae729b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3097,9 +3097,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
return err;
}

-static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
+static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
{
struct fuse_file *ff_in = file_in->private_data;
struct fuse_file *ff_out = file_out->private_data;
@@ -3173,6 +3173,21 @@ static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
return err;
}

+static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
+ len, flags);
+
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(src_file, src_off, dst_file,
+ dst_off, len, flags);
+ return ret;
+}
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read_iter = fuse_file_read_iter,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index cf42a8b939e3..4842f3ab3161 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -129,9 +129,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
}

#ifdef CONFIG_NFS_V4_2
-static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t count, unsigned int flags)
+static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
{
if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
return -EOPNOTSUPP;
@@ -140,6 +140,20 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
return nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
}

+static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
+ flags);
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(file_in, pos_in, file_out,
+ pos_out, count, flags);
+ return ret;
+}
+
static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
{
loff_t ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 676b02fae589..b63dcb4e4fe9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1595,6 +1595,19 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
}
EXPORT_SYMBOL(generic_copy_file_range);

+static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ if (file_out->f_op->copy_file_range)
+ return file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
+
+ return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+ flags);
+}
+
/*
* copy_file_range() differs from regular file read and write in that it
* specifically allows return partial success. When it does so is up to
@@ -1655,15 +1668,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}

- if (file_out->f_op->copy_file_range) {
- ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
- pos_out, len, flags);
- if (ret != -EOPNOTSUPP)
- goto done;
- }
-
- ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
+ ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+ flags);
+ WARN_ON_ONCE(ret == -EOPNOTSUPP);
done:
if (ret > 0) {
fsnotify_access(file_in);
--
2.17.1

2019-05-26 06:12:35

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 4/8] vfs: add missing checks to copy_file_range

Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.

[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/read_write.c | 3 ++-
include/linux/fs.h | 3 +++
mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 58 insertions(+), 1 deletion(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index f1900bdb3127..b0fb1176b628 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
return -EXDEV;

- ret = generic_file_rw_checks(file_in, file_out);
+ ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
+ flags);
if (unlikely(ret))
return ret;

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 89b9b73eb581..e4d382c4342a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count, unsigned int remap_flags);
extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
+extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t *count, unsigned int flags);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 798aac92cd76..1852fbf08eeb 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3064,6 +3064,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
return 0;
}

+/*
+ * Performs necessary checks before doing a file copy
+ *
+ * Can adjust amount of bytes to copy
+ * Returns appropriate error code that caller should return or
+ * zero in case the copy should be allowed.
+ */
+int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t *req_count, unsigned int flags)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+ uint64_t count = *req_count;
+ loff_t size_in;
+ int ret;
+
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (ret)
+ return ret;
+
+ /* Don't touch certain kinds of inodes */
+ if (IS_IMMUTABLE(inode_out))
+ return -EPERM;
+
+ if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+ return -ETXTBSY;
+
+ /* Ensure offsets don't wrap. */
+ if (pos_in + count < pos_in || pos_out + count < pos_out)
+ return -EOVERFLOW;
+
+ /* Shorten the copy to EOF */
+ size_in = i_size_read(inode_in);
+ if (pos_in >= size_in)
+ count = 0;
+ else
+ count = min(count, size_in - (uint64_t)pos_in);
+
+ ret = generic_write_check_limits(file_out, pos_out, &count);
+ if (ret)
+ return ret;
+
+ /* Don't allow overlapped copying within the same file. */
+ if (inode_in == inode_out &&
+ pos_out + count > pos_in &&
+ pos_out < pos_in + count)
+ return -EINVAL;
+
+ *req_count = count;
+ return 0;
+}
+
int pagecache_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
--
2.17.1

2019-05-26 06:13:43

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

From: Dave Chinner <[email protected]>

Timestamps are not updated right now, so programs looking for
timestamp updates for file modifications (like rsync) will not
detect that files have changed. We are also accessing the source
data when doing a copy (but not when cloning) so we need to update
atime on the source file as well.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/read_write.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index e16bcafc0da2..4b23a86aacd9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)

WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));

+ /* Update source timestamps, because we are accessing file data */
+ file_accessed(file_in);
+
+ /* Update destination timestamps, since we can alter file contents. */
+ if (!(file_out->f_mode & FMODE_NOCMTIME)) {
+ ret = file_update_time(file_out);
+ if (ret)
+ return ret;
+ }
+
/*
* Clear the security bits if the process is not being run by root.
* This keeps people from modifying setuid and setgid binaries.
--
2.17.1

2019-05-26 06:13:48

by Amir Goldstein

[permalink] [raw]
Subject: [PATCH v2 5/8] vfs: copy_file_range needs to strip setuid bits

The file we are copying data into needs to have its setuid bit
stripped before we start the data copy so that unprivileged users
can't copy data into executables that are run with root privs.

[Amir] Introduce the helper generic_copy_file_range_prep() modelled
after generic_remap_file_range_prep(). Helper is called by filesystem
before the copy_file_range operation and with output inode locked.

For ceph and for default generic_copy_file_range() implementation there
is no inode lock held throughout the copy operation, so we do best
effort and remove setuid bit before copy starts. This does not protect
suid file from changing if suid bit is set after copy started.

Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/ceph/file.c | 9 +++++++++
fs/cifs/cifsfs.c | 9 ++++++---
fs/fuse/file.c | 4 ++++
fs/nfs/nfs42proc.c | 8 +++++---
fs/read_write.c | 31 +++++++++++++++++++++++++++++++
include/linux/fs.h | 2 ++
6 files changed, 57 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index e87f7b2023af..54cfc877a6ef 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1947,6 +1947,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
goto out;
}

+ /* Should inode lock be held throughout the copy operation? */
+ inode_lock(dst_inode);
+ ret = generic_copy_file_range_prep(src_file, dst_file);
+ inode_unlock(dst_inode);
+ if (ret < 0) {
+ dout("failed to copy from src to dst file (%zd)\n", ret);
+ goto out;
+ }
+
/*
* We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
* clients may have dirty data in their caches. And OSDs know nothing
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index c65823270313..e103b499aaa8 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1096,6 +1096,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
goto out;
}

+ rc = -EOPNOTSUPP;
+ if (!target_tcon->ses->server->ops->copychunk_range)
+ goto out;
+
/*
* Note: cifs case is easier than btrfs since server responsible for
* checks for proper open modes and file type and if it wants
@@ -1107,11 +1111,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
/* should we flush first and last page first */
truncate_inode_pages(&target_inode->i_data, 0);

- if (target_tcon->ses->server->ops->copychunk_range)
+ rc = generic_copy_file_range_prep(src_file, dst_file);
+ if (!rc)
rc = target_tcon->ses->server->ops->copychunk_range(xid,
smb_file_src, smb_file_target, off, len, destoff);
- else
- rc = -EOPNOTSUPP;

/* force revalidate of size and timestamps of target file now
* that target is updated on the server
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e03901ae729b..3531d4a3d9ec 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3128,6 +3128,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,

inode_lock(inode_out);

+ err = generic_copy_file_range_prep(file_in, file_out);
+ if (err)
+ goto out;
+
if (fc->writeback_cache) {
err = filemap_write_and_wait_range(inode_out->i_mapping,
pos_out, pos_out + len);
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 5196bfa7894d..b387951e1d86 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -345,9 +345,11 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,

do {
inode_lock(file_inode(dst));
- err = _nfs42_proc_copy(src, src_lock,
- dst, dst_lock,
- &args, &res);
+ err = generic_copy_file_range_prep(src, dst);
+ if (!err)
+ err = _nfs42_proc_copy(src, src_lock,
+ dst, dst_lock,
+ &args, &res);
inode_unlock(file_inode(dst));

if (err >= 0)
diff --git a/fs/read_write.c b/fs/read_write.c
index b0fb1176b628..e16bcafc0da2 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1565,6 +1565,28 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif

+/*
+ * Prepare inodes for copy from @file_in to @file_out.
+ *
+ * Caller must hold output inode lock.
+ */
+int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
+{
+ int ret;
+
+ WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
+
+ /*
+ * Clear the security bits if the process is not being run by root.
+ * This keeps people from modifying setuid and setgid binaries.
+ */
+ ret = file_remove_privs(file_out);
+
+ return ret;
+
+}
+EXPORT_SYMBOL(generic_copy_file_range_prep);
+
/**
* generic_copy_file_range - copy data between two files
* @file_in: file structure to read from
@@ -1590,6 +1612,15 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
+ int ret;
+
+ /* Should inode lock be held throughout the copy operation? */
+ inode_lock(file_inode(file_out));
+ ret = generic_copy_file_range_prep(file_in, file_out);
+ inode_unlock(file_inode(file_out));
+ if (ret)
+ return ret;
+
return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e4d382c4342a..3e03a96d9ab6 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1889,6 +1889,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
unsigned long, loff_t *, rwf_t);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
+extern int generic_copy_file_range_prep(struct file *file_in,
+ struct file *file_out);
extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags);
--
2.17.1

2019-05-27 14:37:02

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

On Sun, May 26, 2019 at 09:10:57AM +0300, Amir Goldstein wrote:
> From: Dave Chinner <[email protected]>
>
> Timestamps are not updated right now, so programs looking for
> timestamp updates for file modifications (like rsync) will not
> detect that files have changed. We are also accessing the source
> data when doing a copy (but not when cloning) so we need to update
> atime on the source file as well.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/read_write.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index e16bcafc0da2..4b23a86aacd9 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
>
> WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
>
> + /* Update source timestamps, because we are accessing file data */
> + file_accessed(file_in);
> +
> + /* Update destination timestamps, since we can alter file contents. */
> + if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> + ret = file_update_time(file_out);
> + if (ret)
> + return ret;
> + }
> +

Is this the right place for updating the timestamps? I see that in same
cases we may be updating the timestamp even if there was an error and no
copy was performed. For example, if file_remove_privs fails.

(btw, I've re-tested everything on ceph and everything seems to be
working fine.)

Cheers,
--
Lu?s


> /*
> * Clear the security bits if the process is not being run by root.
> * This keeps people from modifying setuid and setgid binaries.
> --
> 2.17.1
>
>

2019-05-27 22:30:53

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

On Mon, May 27, 2019 at 03:35:39PM +0100, Luis Henriques wrote:
> On Sun, May 26, 2019 at 09:10:57AM +0300, Amir Goldstein wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Timestamps are not updated right now, so programs looking for
> > timestamp updates for file modifications (like rsync) will not
> > detect that files have changed. We are also accessing the source
> > data when doing a copy (but not when cloning) so we need to update
> > atime on the source file as well.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/read_write.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index e16bcafc0da2..4b23a86aacd9 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
> >
> > WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
> >
> > + /* Update source timestamps, because we are accessing file data */
> > + file_accessed(file_in);
> > +
> > + /* Update destination timestamps, since we can alter file contents. */
> > + if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> > + ret = file_update_time(file_out);
> > + if (ret)
> > + return ret;
> > + }
> > +
>
> Is this the right place for updating the timestamps? I see that in same
> cases we may be updating the timestamp even if there was an error and no
> copy was performed. For example, if file_remove_privs fails.

It's the same place we do it for read - file_accessed() is called
before we do the IO - and the same place for write -
file_update_time() is called before we copy data into the pagecache
or do direct IO. As such, it really doesn't matter if it is before
or after file_remove_privs() - the IO can still fail for many
reasons after we've updated the timestamps and in some of the
failure cases (e.g. we failed the sync at the end of an O_DSYNC
buffered write) we still want the timestamps to be modified because
the data and/or user visible metadata /may/ have been changed.

cfr operates under the same constraints as read() and write(), so we
need to update the timestamps up front regardless of whether the
copy ends up succeeding or not....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-05-28 08:54:10

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

Dave Chinner <[email protected]> writes:

> On Mon, May 27, 2019 at 03:35:39PM +0100, Luis Henriques wrote:
>> On Sun, May 26, 2019 at 09:10:57AM +0300, Amir Goldstein wrote:
>> > From: Dave Chinner <[email protected]>
>> >
>> > Timestamps are not updated right now, so programs looking for
>> > timestamp updates for file modifications (like rsync) will not
>> > detect that files have changed. We are also accessing the source
>> > data when doing a copy (but not when cloning) so we need to update
>> > atime on the source file as well.
>> >
>> > Signed-off-by: Dave Chinner <[email protected]>
>> > Signed-off-by: Amir Goldstein <[email protected]>
>> > ---
>> > fs/read_write.c | 10 ++++++++++
>> > 1 file changed, 10 insertions(+)
>> >
>> > diff --git a/fs/read_write.c b/fs/read_write.c
>> > index e16bcafc0da2..4b23a86aacd9 100644
>> > --- a/fs/read_write.c
>> > +++ b/fs/read_write.c
>> > @@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
>> >
>> > WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
>> >
>> > + /* Update source timestamps, because we are accessing file data */
>> > + file_accessed(file_in);
>> > +
>> > + /* Update destination timestamps, since we can alter file contents. */
>> > + if (!(file_out->f_mode & FMODE_NOCMTIME)) {
>> > + ret = file_update_time(file_out);
>> > + if (ret)
>> > + return ret;
>> > + }
>> > +
>>
>> Is this the right place for updating the timestamps? I see that in same
>> cases we may be updating the timestamp even if there was an error and no
>> copy was performed. For example, if file_remove_privs fails.
>
> It's the same place we do it for read - file_accessed() is called
> before we do the IO - and the same place for write -
> file_update_time() is called before we copy data into the pagecache
> or do direct IO. As such, it really doesn't matter if it is before
> or after file_remove_privs() - the IO can still fail for many
> reasons after we've updated the timestamps and in some of the
> failure cases (e.g. we failed the sync at the end of an O_DSYNC
> buffered write) we still want the timestamps to be modified because
> the data and/or user visible metadata /may/ have been changed.
>
> cfr operates under the same constraints as read() and write(), so we
> need to update the timestamps up front regardless of whether the
> copy ends up succeeding or not....

Great, thanks for explaining it. It now makes sense, even for
consistency, to have this operation here.

Cheers,
--
Luis

2019-05-28 16:11:09

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 2/8] vfs: no fallback for ->copy_file_range

On Sun, May 26, 2019 at 09:10:53AM +0300, Amir Goldstein wrote:
> From: Dave Chinner <[email protected]>
>
> Now that we have generic_copy_file_range(), remove it as a fallback
> case when offloads fail. This puts the responsibility for executing
> fallbacks on the filesystems that implement ->copy_file_range and
> allows us to add operational validity checks to
> generic_copy_file_range().
>
> Rework vfs_copy_file_range() to call a new do_copy_file_range()
> helper to exceute the copying callout, and move calls to

execute ^^^^^^^

> generic_file_copy_range() into filesystem methods where they
> currently return failures.
>
> [Amir] overlayfs is not responsible of executing the fallback.
> It is the responsibility of the underlying filesystem.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>

Seems fine to me otherwise...
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
> fs/ceph/file.c | 21 ++++++++++++++++++---
> fs/cifs/cifsfs.c | 4 ++++
> fs/fuse/file.c | 21 ++++++++++++++++++---
> fs/nfs/nfs4file.c | 20 +++++++++++++++++---
> fs/read_write.c | 25 ++++++++++++++++---------
> 5 files changed, 73 insertions(+), 18 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 305daf043eb0..e87f7b2023af 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1889,9 +1889,9 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
> return 0;
> }
>
> -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> - struct file *dst_file, loff_t dst_off,
> - size_t len, unsigned int flags)
> +static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> + struct file *dst_file, loff_t dst_off,
> + size_t len, unsigned int flags)
> {
> struct inode *src_inode = file_inode(src_file);
> struct inode *dst_inode = file_inode(dst_file);
> @@ -2100,6 +2100,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> return ret;
> }
>
> +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> + struct file *dst_file, loff_t dst_off,
> + size_t len, unsigned int flags)
> +{
> + ssize_t ret;
> +
> + ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> + len, flags);
> +
> + if (ret == -EOPNOTSUPP)
> + ret = generic_copy_file_range(src_file, src_off, dst_file,
> + dst_off, len, flags);
> + return ret;
> +}
> +
> const struct file_operations ceph_file_fops = {
> .open = ceph_open,
> .release = ceph_release,
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index f5fcd6360056..c65823270313 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1148,6 +1148,10 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> rc = cifs_file_copychunk_range(xid, src_file, off, dst_file, destoff,
> len, flags);
> free_xid(xid);
> +
> + if (rc == -EOPNOTSUPP)
> + rc = generic_copy_file_range(src_file, off, dst_file,
> + destoff, len, flags);
> return rc;
> }
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 3959f08279e6..e03901ae729b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3097,9 +3097,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> return err;
> }
>
> -static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> +static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t len, unsigned int flags)
> {
> struct fuse_file *ff_in = file_in->private_data;
> struct fuse_file *ff_out = file_out->private_data;
> @@ -3173,6 +3173,21 @@ static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> return err;
> }
>
> +static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> + struct file *dst_file, loff_t dst_off,
> + size_t len, unsigned int flags)
> +{
> + ssize_t ret;
> +
> + ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> + len, flags);
> +
> + if (ret == -EOPNOTSUPP)
> + ret = generic_copy_file_range(src_file, src_off, dst_file,
> + dst_off, len, flags);
> + return ret;
> +}
> +
> static const struct file_operations fuse_file_operations = {
> .llseek = fuse_file_llseek,
> .read_iter = fuse_file_read_iter,
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index cf42a8b939e3..4842f3ab3161 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -129,9 +129,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
> }
>
> #ifdef CONFIG_NFS_V4_2
> -static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t count, unsigned int flags)
> +static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t count, unsigned int flags)
> {
> if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> return -EOPNOTSUPP;
> @@ -140,6 +140,20 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> return nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
> }
>
> +static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t count, unsigned int flags)
> +{
> + ssize_t ret;
> +
> + ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> + flags);
> + if (ret == -EOPNOTSUPP)
> + ret = generic_copy_file_range(file_in, pos_in, file_out,
> + pos_out, count, flags);
> + return ret;
> +}
> +
> static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
> {
> loff_t ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 676b02fae589..b63dcb4e4fe9 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1595,6 +1595,19 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> EXPORT_SYMBOL(generic_copy_file_range);
>
> +static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t len, unsigned int flags)
> +{
> + if (file_out->f_op->copy_file_range)
> + return file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out,
> + len, flags);
> +
> + return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> + flags);
> +}
> +
> /*
> * copy_file_range() differs from regular file read and write in that it
> * specifically allows return partial success. When it does so is up to
> @@ -1655,15 +1668,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> }
>
> - if (file_out->f_op->copy_file_range) {
> - ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
> - pos_out, len, flags);
> - if (ret != -EOPNOTSUPP)
> - goto done;
> - }
> -
> - ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> + ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> + flags);
> + WARN_ON_ONCE(ret == -EOPNOTSUPP);
> done:
> if (ret > 0) {
> fsnotify_access(file_in);
> --
> 2.17.1
>

2019-05-28 16:16:46

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 3/8] vfs: introduce generic_file_rw_checks()

On Sun, May 26, 2019 at 09:10:54AM +0300, Amir Goldstein wrote:
> Factor out helper with some checks on in/out file that are
> common to clone_file_range and copy_file_range.
>
> Suggested-by: Darrick J. Wong <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/read_write.c | 38 +++++++++++---------------------------
> include/linux/fs.h | 1 +
> mm/filemap.c | 24 ++++++++++++++++++++++++
> 3 files changed, 36 insertions(+), 27 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b63dcb4e4fe9..f1900bdb3127 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1617,17 +1617,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> - struct inode *inode_in = file_inode(file_in);
> - struct inode *inode_out = file_inode(file_out);
> ssize_t ret;
>
> if (flags != 0)
> return -EINVAL;
>
> - if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> - return -EISDIR;
> - if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> - return -EINVAL;
> + /* this could be relaxed once a method supports cross-fs copies */
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> +
> + ret = generic_file_rw_checks(file_in, file_out);
> + if (unlikely(ret))
> + return ret;
>
> ret = rw_verify_area(READ, file_in, &pos_in, len);
> if (unlikely(ret))
> @@ -1637,15 +1638,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (unlikely(ret))
> return ret;
>
> - if (!(file_in->f_mode & FMODE_READ) ||
> - !(file_out->f_mode & FMODE_WRITE) ||
> - (file_out->f_flags & O_APPEND))
> - return -EBADF;
> -
> - /* this could be relaxed once a method supports cross-fs copies */
> - if (inode_in->i_sb != inode_out->i_sb)
> - return -EXDEV;
> -
> if (len == 0)
> return 0;
>
> @@ -2013,29 +2005,21 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t len, unsigned int remap_flags)
> {
> - struct inode *inode_in = file_inode(file_in);
> - struct inode *inode_out = file_inode(file_out);
> loff_t ret;
>
> WARN_ON_ONCE(remap_flags & REMAP_FILE_DEDUP);
>
> - if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> - return -EISDIR;
> - if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> - return -EINVAL;
> -
> /*
> * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> * the same mount. Practically, they only need to be on the same file
> * system.
> */
> - if (inode_in->i_sb != inode_out->i_sb)
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> return -EXDEV;
>
> - if (!(file_in->f_mode & FMODE_READ) ||
> - !(file_out->f_mode & FMODE_WRITE) ||
> - (file_out->f_flags & O_APPEND))
> - return -EBADF;
> + ret = generic_file_rw_checks(file_in, file_out);
> + if (ret < 0)
> + return ret;
>
> if (!file_in->f_op->remap_file_range)
> return -EOPNOTSUPP;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ea17858310ff..89b9b73eb581 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3049,6 +3049,7 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
> extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t *count, unsigned int remap_flags);
> +extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c5af80c43d36..798aac92cd76 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3040,6 +3040,30 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
> return 0;
> }
>
> +
> +/*
> + * Performs common checks before doing a file copy/clone
> + * from @file_in to @file_out.
> + */
> +int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> +{
> + struct inode *inode_in = file_inode(file_in);
> + struct inode *inode_out = file_inode(file_out);
> +
> + /* Don't copy dirs, pipes, sockets... */
> + if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
> + return -EISDIR;
> + if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
> + return -EINVAL;
> +
> + if (!(file_in->f_mode & FMODE_READ) ||
> + !(file_out->f_mode & FMODE_WRITE) ||
> + (file_out->f_flags & O_APPEND))
> + return -EBADF;
> +
> + return 0;
> +}

/me wishes all these generic vfs io check helpers weren't all being
stuffed into mm/ but that's a refactoring for a different patchset...

Looks ok,
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> +
> int pagecache_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> --
> 2.17.1
>

2019-05-28 16:19:50

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] vfs: add missing checks to copy_file_range

On Sun, May 26, 2019 at 09:10:55AM +0300, Amir Goldstein wrote:
> Like the clone and dedupe interfaces we've recently fixed, the
> copy_file_range() implementation is missing basic sanity, limits and
> boundary condition tests on the parameters that are passed to it
> from userspace. Create a new "generic_copy_file_checks()" function
> modelled on the generic_remap_checks() function to provide this
> missing functionality.
>
> [Amir] Shorten copy length instead of checking pos_in limits
> because input file size already abides by the limits.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/read_write.c | 3 ++-
> include/linux/fs.h | 3 +++
> mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index f1900bdb3127..b0fb1176b628 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> return -EXDEV;
>
> - ret = generic_file_rw_checks(file_in, file_out);
> + ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> + flags);
> if (unlikely(ret))
> return ret;
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 89b9b73eb581..e4d382c4342a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t *count, unsigned int remap_flags);
> extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t *count, unsigned int flags);
> extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 798aac92cd76..1852fbf08eeb 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3064,6 +3064,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> return 0;
> }
>
> +/*
> + * Performs necessary checks before doing a file copy
> + *
> + * Can adjust amount of bytes to copy

That's the @req_count parameter, correct?

> + * Returns appropriate error code that caller should return or
> + * zero in case the copy should be allowed.
> + */
> +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t *req_count, unsigned int flags)
> +{
> + struct inode *inode_in = file_inode(file_in);
> + struct inode *inode_out = file_inode(file_out);
> + uint64_t count = *req_count;
> + loff_t size_in;
> + int ret;
> +
> + ret = generic_file_rw_checks(file_in, file_out);
> + if (ret)
> + return ret;
> +
> + /* Don't touch certain kinds of inodes */
> + if (IS_IMMUTABLE(inode_out))
> + return -EPERM;
> +
> + if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> + return -ETXTBSY;
> +
> + /* Ensure offsets don't wrap. */
> + if (pos_in + count < pos_in || pos_out + count < pos_out)
> + return -EOVERFLOW;
> +
> + /* Shorten the copy to EOF */
> + size_in = i_size_read(inode_in);
> + if (pos_in >= size_in)
> + count = 0;
> + else
> + count = min(count, size_in - (uint64_t)pos_in);

Do we need a call to generic_access_check_limits(file_in...) here to
prevent copies from ranges that the page cache doesn't support?

--D

> + ret = generic_write_check_limits(file_out, pos_out, &count);
> + if (ret)
> + return ret;
> +
> + /* Don't allow overlapped copying within the same file. */
> + if (inode_in == inode_out &&
> + pos_out + count > pos_in &&
> + pos_out < pos_in + count)
> + return -EINVAL;
> +
> + *req_count = count;
> + return 0;
> +}
> +
> int pagecache_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> --
> 2.17.1
>

2019-05-28 16:26:59

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] vfs: add missing checks to copy_file_range

On Tue, May 28, 2019 at 7:18 PM Darrick J. Wong <[email protected]> wrote:
>
> On Sun, May 26, 2019 at 09:10:55AM +0300, Amir Goldstein wrote:
> > Like the clone and dedupe interfaces we've recently fixed, the
> > copy_file_range() implementation is missing basic sanity, limits and
> > boundary condition tests on the parameters that are passed to it
> > from userspace. Create a new "generic_copy_file_checks()" function
> > modelled on the generic_remap_checks() function to provide this
> > missing functionality.
> >
> > [Amir] Shorten copy length instead of checking pos_in limits
> > because input file size already abides by the limits.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/read_write.c | 3 ++-
> > include/linux/fs.h | 3 +++
> > mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 58 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index f1900bdb3127..b0fb1176b628 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > return -EXDEV;
> >
> > - ret = generic_file_rw_checks(file_in, file_out);
> > + ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> > + flags);
> > if (unlikely(ret))
> > return ret;
> >
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index 89b9b73eb581..e4d382c4342a 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > loff_t *count, unsigned int remap_flags);
> > extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> > +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > + struct file *file_out, loff_t pos_out,
> > + size_t *count, unsigned int flags);
> > extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> > extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 798aac92cd76..1852fbf08eeb 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -3064,6 +3064,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> > return 0;
> > }
> >
> > +/*
> > + * Performs necessary checks before doing a file copy
> > + *
> > + * Can adjust amount of bytes to copy
>
> That's the @req_count parameter, correct?

Correct. Same as generic_remap_checks()

>
> > + * Returns appropriate error code that caller should return or
> > + * zero in case the copy should be allowed.
> > + */
> > +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > + struct file *file_out, loff_t pos_out,
> > + size_t *req_count, unsigned int flags)
> > +{
> > + struct inode *inode_in = file_inode(file_in);
> > + struct inode *inode_out = file_inode(file_out);
> > + uint64_t count = *req_count;
> > + loff_t size_in;
> > + int ret;
> > +
> > + ret = generic_file_rw_checks(file_in, file_out);
> > + if (ret)
> > + return ret;
> > +
> > + /* Don't touch certain kinds of inodes */
> > + if (IS_IMMUTABLE(inode_out))
> > + return -EPERM;
> > +
> > + if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> > + return -ETXTBSY;
> > +
> > + /* Ensure offsets don't wrap. */
> > + if (pos_in + count < pos_in || pos_out + count < pos_out)
> > + return -EOVERFLOW;
> > +
> > + /* Shorten the copy to EOF */
> > + size_in = i_size_read(inode_in);
> > + if (pos_in >= size_in)
> > + count = 0;
> > + else
> > + count = min(count, size_in - (uint64_t)pos_in);
>
> Do we need a call to generic_access_check_limits(file_in...) here to
> prevent copies from ranges that the page cache doesn't support?

No. Because i_size cannot be of an illegal size and we cap
the read to i_size.
I also removed generic_access_check_limits() from generic_remap_checks()
for a similar reason in patch #8.

Thanks,
Amir.

2019-05-28 16:27:33

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 5/8] vfs: copy_file_range needs to strip setuid bits

On Sun, May 26, 2019 at 09:10:56AM +0300, Amir Goldstein wrote:
> The file we are copying data into needs to have its setuid bit
> stripped before we start the data copy so that unprivileged users
> can't copy data into executables that are run with root privs.
>
> [Amir] Introduce the helper generic_copy_file_range_prep() modelled
> after generic_remap_file_range_prep(). Helper is called by filesystem
> before the copy_file_range operation and with output inode locked.
>
> For ceph and for default generic_copy_file_range() implementation there
> is no inode lock held throughout the copy operation, so we do best
> effort and remove setuid bit before copy starts. This does not protect
> suid file from changing if suid bit is set after copy started.

Maybe we try to remove suid once more at the end of
generic_copy_file_range if the copy succeeds? It'd still be racy, but I
can't (currently, having drank no coffee yet) think of a general way to
hold both inodes locked during the entire copy operation.

--D

> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/ceph/file.c | 9 +++++++++
> fs/cifs/cifsfs.c | 9 ++++++---
> fs/fuse/file.c | 4 ++++
> fs/nfs/nfs42proc.c | 8 +++++---
> fs/read_write.c | 31 +++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 6 files changed, 57 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index e87f7b2023af..54cfc877a6ef 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1947,6 +1947,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> goto out;
> }
>
> + /* Should inode lock be held throughout the copy operation? */
> + inode_lock(dst_inode);
> + ret = generic_copy_file_range_prep(src_file, dst_file);
> + inode_unlock(dst_inode);
> + if (ret < 0) {
> + dout("failed to copy from src to dst file (%zd)\n", ret);
> + goto out;
> + }
> +
> /*
> * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
> * clients may have dirty data in their caches. And OSDs know nothing
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index c65823270313..e103b499aaa8 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1096,6 +1096,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> goto out;
> }
>
> + rc = -EOPNOTSUPP;
> + if (!target_tcon->ses->server->ops->copychunk_range)
> + goto out;
> +
> /*
> * Note: cifs case is easier than btrfs since server responsible for
> * checks for proper open modes and file type and if it wants
> @@ -1107,11 +1111,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> /* should we flush first and last page first */
> truncate_inode_pages(&target_inode->i_data, 0);
>
> - if (target_tcon->ses->server->ops->copychunk_range)
> + rc = generic_copy_file_range_prep(src_file, dst_file);
> + if (!rc)
> rc = target_tcon->ses->server->ops->copychunk_range(xid,
> smb_file_src, smb_file_target, off, len, destoff);
> - else
> - rc = -EOPNOTSUPP;
>
> /* force revalidate of size and timestamps of target file now
> * that target is updated on the server
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index e03901ae729b..3531d4a3d9ec 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3128,6 +3128,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>
> inode_lock(inode_out);
>
> + err = generic_copy_file_range_prep(file_in, file_out);
> + if (err)
> + goto out;
> +
> if (fc->writeback_cache) {
> err = filemap_write_and_wait_range(inode_out->i_mapping,
> pos_out, pos_out + len);
> diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> index 5196bfa7894d..b387951e1d86 100644
> --- a/fs/nfs/nfs42proc.c
> +++ b/fs/nfs/nfs42proc.c
> @@ -345,9 +345,11 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
>
> do {
> inode_lock(file_inode(dst));
> - err = _nfs42_proc_copy(src, src_lock,
> - dst, dst_lock,
> - &args, &res);
> + err = generic_copy_file_range_prep(src, dst);
> + if (!err)
> + err = _nfs42_proc_copy(src, src_lock,
> + dst, dst_lock,
> + &args, &res);
> inode_unlock(file_inode(dst));
>
> if (err >= 0)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b0fb1176b628..e16bcafc0da2 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1565,6 +1565,28 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> }
> #endif
>
> +/*
> + * Prepare inodes for copy from @file_in to @file_out.
> + *
> + * Caller must hold output inode lock.
> + */
> +int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
> +{
> + int ret;
> +
> + WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
> +
> + /*
> + * Clear the security bits if the process is not being run by root.
> + * This keeps people from modifying setuid and setgid binaries.
> + */
> + ret = file_remove_privs(file_out);
> +
> + return ret;
> +
> +}
> +EXPORT_SYMBOL(generic_copy_file_range_prep);
> +
> /**
> * generic_copy_file_range - copy data between two files
> * @file_in: file structure to read from
> @@ -1590,6 +1612,15 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> + int ret;
> +
> + /* Should inode lock be held throughout the copy operation? */
> + inode_lock(file_inode(file_out));
> + ret = generic_copy_file_range_prep(file_in, file_out);
> + inode_unlock(file_inode(file_out));
> + if (ret)
> + return ret;
> +
> return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e4d382c4342a..3e03a96d9ab6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1889,6 +1889,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> unsigned long, loff_t *, rwf_t);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> loff_t, size_t, unsigned int);
> +extern int generic_copy_file_range_prep(struct file *file_in,
> + struct file *file_out);
> extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags);
> --
> 2.17.1
>

2019-05-28 16:32:37

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

On Sun, May 26, 2019 at 09:10:57AM +0300, Amir Goldstein wrote:
> From: Dave Chinner <[email protected]>
>
> Timestamps are not updated right now, so programs looking for
> timestamp updates for file modifications (like rsync) will not
> detect that files have changed. We are also accessing the source
> data when doing a copy (but not when cloning) so we need to update
> atime on the source file as well.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/read_write.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index e16bcafc0da2..4b23a86aacd9 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
>
> WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
>
> + /* Update source timestamps, because we are accessing file data */
> + file_accessed(file_in);
> +
> + /* Update destination timestamps, since we can alter file contents. */
> + if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> + ret = file_update_time(file_out);
> + if (ret)
> + return ret;
> + }
> +
> /*
> * Clear the security bits if the process is not being run by root.
> * This keeps people from modifying setuid and setgid binaries.

Should the file_update_time and file_remove_privs calls be factored into
a separate small function to be called by generic_{copy,remap}_range_prep?

Looks ok otherwise,
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> --
> 2.17.1
>

2019-05-28 16:32:54

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] vfs: add missing checks to copy_file_range

On Tue, May 28, 2019 at 07:26:29PM +0300, Amir Goldstein wrote:
> On Tue, May 28, 2019 at 7:18 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Sun, May 26, 2019 at 09:10:55AM +0300, Amir Goldstein wrote:
> > > Like the clone and dedupe interfaces we've recently fixed, the
> > > copy_file_range() implementation is missing basic sanity, limits and
> > > boundary condition tests on the parameters that are passed to it
> > > from userspace. Create a new "generic_copy_file_checks()" function
> > > modelled on the generic_remap_checks() function to provide this
> > > missing functionality.
> > >
> > > [Amir] Shorten copy length instead of checking pos_in limits
> > > because input file size already abides by the limits.
> > >
> > > Signed-off-by: Dave Chinner <[email protected]>
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > ---
> > > fs/read_write.c | 3 ++-
> > > include/linux/fs.h | 3 +++
> > > mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
> > > 3 files changed, 58 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index f1900bdb3127..b0fb1176b628 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > return -EXDEV;
> > >
> > > - ret = generic_file_rw_checks(file_in, file_out);
> > > + ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> > > + flags);
> > > if (unlikely(ret))
> > > return ret;
> > >
> > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > index 89b9b73eb581..e4d382c4342a 100644
> > > --- a/include/linux/fs.h
> > > +++ b/include/linux/fs.h
> > > @@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > struct file *file_out, loff_t pos_out,
> > > loff_t *count, unsigned int remap_flags);
> > > extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> > > +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > > + struct file *file_out, loff_t pos_out,
> > > + size_t *count, unsigned int flags);
> > > extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> > > extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > > extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index 798aac92cd76..1852fbf08eeb 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -3064,6 +3064,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * Performs necessary checks before doing a file copy
> > > + *
> > > + * Can adjust amount of bytes to copy
> >
> > That's the @req_count parameter, correct?
>
> Correct. Same as generic_remap_checks()

Ok. Would you mind updating the comment?

> >
> > > + * Returns appropriate error code that caller should return or
> > > + * zero in case the copy should be allowed.
> > > + */
> > > +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > > + struct file *file_out, loff_t pos_out,
> > > + size_t *req_count, unsigned int flags)
> > > +{
> > > + struct inode *inode_in = file_inode(file_in);
> > > + struct inode *inode_out = file_inode(file_out);
> > > + uint64_t count = *req_count;
> > > + loff_t size_in;
> > > + int ret;
> > > +
> > > + ret = generic_file_rw_checks(file_in, file_out);
> > > + if (ret)
> > > + return ret;
> > > +
> > > + /* Don't touch certain kinds of inodes */
> > > + if (IS_IMMUTABLE(inode_out))
> > > + return -EPERM;
> > > +
> > > + if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> > > + return -ETXTBSY;
> > > +
> > > + /* Ensure offsets don't wrap. */
> > > + if (pos_in + count < pos_in || pos_out + count < pos_out)
> > > + return -EOVERFLOW;
> > > +
> > > + /* Shorten the copy to EOF */
> > > + size_in = i_size_read(inode_in);
> > > + if (pos_in >= size_in)
> > > + count = 0;
> > > + else
> > > + count = min(count, size_in - (uint64_t)pos_in);
> >
> > Do we need a call to generic_access_check_limits(file_in...) here to
> > prevent copies from ranges that the page cache doesn't support?
>
> No. Because i_size cannot be of an illegal size and we cap
> the read to i_size.
> I also removed generic_access_check_limits() from generic_remap_checks()
> for a similar reason in patch #8.

<nod>

--D

> Thanks,
> Amir.

2019-05-28 16:37:00

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 7/8] vfs: allow copy_file_range to copy across devices

On Sun, May 26, 2019 at 09:10:58AM +0300, Amir Goldstein wrote:
> We want to enable cross-filesystem copy_file_range functionality
> where possible, so push the "same superblock only" checks down to
> the individual filesystem callouts so they can make their own
> decisions about cross-superblock copy offload and fallack to
> generic_copy_file_range() for cross-superblock copy.
>
> [Amir] We do not call ->remap_file_range() in case the inodes are not
> on the same sb and do not call ->copy_file_range() in case the inodes
> are not on the same filesystem type.
>
> This changes behavior of the copy_file_range(2) syscall, which will
> now allow cross filesystem in-kernel copy. CIFS already supports
> cross-superblock copy, between two shares to the same server. This
> functionality will now be available via the copy_file_range(2) syscall.
>
> Cc: Steve French <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/ceph/file.c | 4 +++-
> fs/cifs/cifsfs.c | 2 +-
> fs/fuse/file.c | 5 ++++-
> fs/nfs/nfs4file.c | 5 ++++-
> fs/read_write.c | 20 ++++++++++++++------
> 5 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 54cfc877a6ef..adf99557c46c 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>
> if (src_inode == dst_inode)
> return -EINVAL;
> + if (src_inode->i_sb != dst_inode->i_sb)
> + return -EXDEV;
> if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> return -EROFS;
>
> @@ -2118,7 +2120,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> len, flags);
>
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(src_file, src_off, dst_file,
> dst_off, len, flags);
> return ret;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index e103b499aaa8..7bde046110ce 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1152,7 +1152,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> len, flags);
> free_xid(xid);
>
> - if (rc == -EOPNOTSUPP)
> + if (rc == -EOPNOTSUPP || rc == -EXDEV)
> rc = generic_copy_file_range(src_file, off, dst_file,
> destoff, len, flags);
> return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 3531d4a3d9ec..180161f6f0bd 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> if (fc->no_copy_file_range)
> return -EOPNOTSUPP;
>
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> +
> inode_lock(inode_out);
>
> err = generic_copy_file_range_prep(file_in, file_out);
> @@ -3186,7 +3189,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> len, flags);
>
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(src_file, src_off, dst_file,
> dst_off, len, flags);
> return ret;
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 4842f3ab3161..f4157eb1f69d 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t count, unsigned int flags)
> {
> + /* Only offload copy if superblock is the same */
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> return -EOPNOTSUPP;
> if (file_inode(file_in) == file_inode(file_out))
> @@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>
> ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> flags);
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(file_in, pos_in, file_out,
> pos_out, count, flags);
> return ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 4b23a86aacd9..283ec30d2136 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1640,7 +1640,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> - if (file_out->f_op->copy_file_range)
> + /*
> + * Although we now allow filesystems to handle cross sb copy, passing
> + * an inode of the wrong filesystem type to filesystem operation can
> + * often result in an attempt to dereference the wrong concrete inode
> + * struct, so avoid doing that until we really have a good reason.
> + * The incentive for passing inode from different sb to filesystem is
> + * NFS cross server copy and for that use case, enforcing same
> + * filesystem type is acceptable.
> + */
> + if (file_out->f_op->copy_file_range &&
> + file_inode(file_in)->i_sb->s_type ==
> + file_inode(file_out)->i_sb->s_type)
> return file_out->f_op->copy_file_range(file_in, pos_in,
> file_out, pos_out,
> len, flags);
> @@ -1663,10 +1674,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (flags != 0)
> return -EINVAL;
>
> - /* this could be relaxed once a method supports cross-fs copies */
> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> - return -EXDEV;
> -
> ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> flags);
> if (unlikely(ret))
> @@ -1689,7 +1696,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> * Try cloning first, this is supported by more file systems, and
> * more efficient if both clone and copy are supported (e.g. NFS).
> */
> - if (file_in->f_op->remap_file_range) {
> + if (file_in->f_op->remap_file_range &&
> + file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {

For the fs/read_write.c bits,
Reviewed-by: Darrick J. Wong <[email protected]>

(I'm not going to pass judgement on whether or not cifs/nfs should be
doing cross-sb copies, as I know nothing about how well that works in
practice.)

--D

> loff_t cloned;
>
> cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> --
> 2.17.1
>

2019-05-28 16:37:43

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2 6/8] vfs: copy_file_range should update file timestamps

On Tue, May 28, 2019 at 7:30 PM Darrick J. Wong <[email protected]> wrote:
>
> On Sun, May 26, 2019 at 09:10:57AM +0300, Amir Goldstein wrote:
> > From: Dave Chinner <[email protected]>
> >
> > Timestamps are not updated right now, so programs looking for
> > timestamp updates for file modifications (like rsync) will not
> > detect that files have changed. We are also accessing the source
> > data when doing a copy (but not when cloning) so we need to update
> > atime on the source file as well.
> >
> > Signed-off-by: Dave Chinner <[email protected]>
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/read_write.c | 10 ++++++++++
> > 1 file changed, 10 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index e16bcafc0da2..4b23a86aacd9 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1576,6 +1576,16 @@ int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
> >
> > WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
> >
> > + /* Update source timestamps, because we are accessing file data */
> > + file_accessed(file_in);
> > +
> > + /* Update destination timestamps, since we can alter file contents. */
> > + if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> > + ret = file_update_time(file_out);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > /*
> > * Clear the security bits if the process is not being run by root.
> > * This keeps people from modifying setuid and setgid binaries.
>
> Should the file_update_time and file_remove_privs calls be factored into
> a separate small function to be called by generic_{copy,remap}_range_prep?
>

Ok. same helper could be called also after copy when inode is not locked
throughout the copy.

Thanks,
Amir.

2019-05-28 16:39:59

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2 4/8] vfs: add missing checks to copy_file_range

On Tue, May 28, 2019 at 7:31 PM Darrick J. Wong <[email protected]> wrote:
>
> On Tue, May 28, 2019 at 07:26:29PM +0300, Amir Goldstein wrote:
> > On Tue, May 28, 2019 at 7:18 PM Darrick J. Wong <[email protected]> wrote:
> > >
> > > On Sun, May 26, 2019 at 09:10:55AM +0300, Amir Goldstein wrote:
> > > > Like the clone and dedupe interfaces we've recently fixed, the
> > > > copy_file_range() implementation is missing basic sanity, limits and
> > > > boundary condition tests on the parameters that are passed to it
> > > > from userspace. Create a new "generic_copy_file_checks()" function
> > > > modelled on the generic_remap_checks() function to provide this
> > > > missing functionality.
> > > >
> > > > [Amir] Shorten copy length instead of checking pos_in limits
> > > > because input file size already abides by the limits.
> > > >
> > > > Signed-off-by: Dave Chinner <[email protected]>
> > > > Signed-off-by: Amir Goldstein <[email protected]>
> > > > ---
> > > > fs/read_write.c | 3 ++-
> > > > include/linux/fs.h | 3 +++
> > > > mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
> > > > 3 files changed, 58 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index f1900bdb3127..b0fb1176b628 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > > return -EXDEV;
> > > >
> > > > - ret = generic_file_rw_checks(file_in, file_out);
> > > > + ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> > > > + flags);
> > > > if (unlikely(ret))
> > > > return ret;
> > > >
> > > > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > > > index 89b9b73eb581..e4d382c4342a 100644
> > > > --- a/include/linux/fs.h
> > > > +++ b/include/linux/fs.h
> > > > @@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> > > > struct file *file_out, loff_t pos_out,
> > > > loff_t *count, unsigned int remap_flags);
> > > > extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> > > > +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > > > + struct file *file_out, loff_t pos_out,
> > > > + size_t *count, unsigned int flags);
> > > > extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> > > > extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > > > extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> > > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > > index 798aac92cd76..1852fbf08eeb 100644
> > > > --- a/mm/filemap.c
> > > > +++ b/mm/filemap.c
> > > > @@ -3064,6 +3064,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> > > > return 0;
> > > > }
> > > >
> > > > +/*
> > > > + * Performs necessary checks before doing a file copy
> > > > + *
> > > > + * Can adjust amount of bytes to copy
> > >
> > > That's the @req_count parameter, correct?
> >
> > Correct. Same as generic_remap_checks()
>
> Ok. Would you mind updating the comment?
>
OK.

Thanks,
Amir.

2019-05-28 16:40:30

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 8/8] vfs: remove redundant checks from generic_remap_checks()

On Sun, May 26, 2019 at 09:10:59AM +0300, Amir Goldstein wrote:
> The access limit checks on input file range in generic_remap_checks()
> are redundant because the input file size is guarantied to be within

guaranteed ^^^^^^^^^^

> limits and pos+len are already checked to be within input file size.
>
> Beyond the fact that the check cannot fail, if it would have failed,
> it could return -EFBIG for input file range error. There is no precedent
> for that. -EFBIG is returned in syscalls that would change file length.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> mm/filemap.c | 4 ----
> 1 file changed, 4 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1852fbf08eeb..7e1aa36d57a2 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3000,10 +3000,6 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
> return -EINVAL;
> count = min(count, size_in - (uint64_t)pos_in);
>
> - ret = generic_access_check_limits(file_in, pos_in, &count);

I suspect you could fold generic_access_check_limits into its only
caller, then...

--D

> - if (ret)
> - return ret;
> -
> ret = generic_write_check_limits(file_out, pos_out, &count);
> if (ret)
> return ret;
> --
> 2.17.1
>

2019-05-28 16:50:00

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH v2 9/8] man-pages: copy_file_range updates

On Sun, May 26, 2019 at 09:11:00AM +0300, Amir Goldstein wrote:
> Update with all the missing errors the syscall can return, the
> behaviour the syscall should have w.r.t. to copies within single
> files, etc.
>
> [Amir] Copying beyond EOF returns zero.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> man2/copy_file_range.2 | 93 ++++++++++++++++++++++++++++++++++--------
> 1 file changed, 77 insertions(+), 16 deletions(-)
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 2438b63c8..fab11f977 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
> and then back into the kernel.
> It copies up to
> .I len
> -bytes of data from file descriptor
> +bytes of data from the source file descriptor
> .I fd_in
> -to file descriptor
> +to target file descriptor

"to the target file descriptor"

> .IR fd_out ,
> overwriting any data that exists within the requested range of the target file.
> .PP
> @@ -74,6 +74,11 @@ is not changed, but
> .I off_in
> is adjusted appropriately.
> .PP
> +.I fd_in
> +and
> +.I fd_out
> +can refer to the same file. If they refer to the same file, then the source and
> +target ranges are not allowed to overlap.

Please start each sentence on a new line, per mkerrisk rules.

> .PP
> The
> .I flags
> @@ -84,6 +89,11 @@ Upon successful completion,
> .BR copy_file_range ()
> will return the number of bytes copied between files.
> This could be less than the length originally requested.
> +If the file offset of
> +.I fd_in
> +is at or past the end of file, no bytes are copied, and
> +.BR copy_file_range ()
> +returns zero.
> .PP
> On error,
> .BR copy_file_range ()
> @@ -93,12 +103,16 @@ is set to indicate the error.
> .SH ERRORS
> .TP
> .B EBADF
> -One or more file descriptors are not valid; or
> +One or more file descriptors are not valid.
> +.TP
> +.B EBADF
> .I fd_in
> is not open for reading; or
> .I fd_out
> -is not open for writing; or
> -the
> +is not open for writing.
> +.TP
> +.B EBADF
> +The
> .B O_APPEND
> flag is set for the open file description (see
> .BR open (2))
> @@ -106,17 +120,36 @@ referred to by the file descriptor
> .IR fd_out .
> .TP
> .B EFBIG
> -An attempt was made to write a file that exceeds the implementation-defined
> -maximum file size or the process's file size limit,
> -or to write at a position past the maximum allowed offset.
> +An attempt was made to write at a position past the maximum file offset the
> +kernel supports.
> +.TP
> +.B EFBIG
> +An attempt was made to write a range that exceeds the allowed maximum file size.
> +The maximum file size differs between filesystem implemenations and can be

"implementations"

> +different to the maximum allowed file offset.

"...different from the maximum..."

> +.TP
> +.B EFBIG
> +An attempt was made to write beyond the process's file size resource
> +limit. This may also result in the process receiving a
> +.I SIGXFSZ
> +signal.

Start new sentences on a new line, please.

> .TP
> .B EINVAL
> -Requested range extends beyond the end of the source file; or the
> +The
> .I flags
> argument is not 0.
> .TP
> -.B EIO
> -A low-level I/O error occurred while copying.
> +.B EINVAL
> +.I fd_in
> +and
> +.I fd_out
> +refer to the same file and the source and target ranges overlap.
> +.TP
> +.B EINVAL
> +.I fd_in
> +or
> +.I fd_out
> +is not a regular file.

Adding the word "either" at the beginning of the sentence (e.g. "Either
fd_in or fd_out is not a regular file.") would help this flow better.

> .TP
> .B EISDIR
> .I fd_in
> @@ -124,22 +157,50 @@ or
> .I fd_out
> refers to a directory.
> .TP
> +.B EOVERFLOW
> +The requested source or destination range is too large to represent in the
> +specified data types.
> +.TP
> +.B EIO
> +A low-level I/O error occurred while copying.
> +.TP
> .B ENOMEM
> Out of memory.
> .TP
> -.B ENOSPC
> -There is not enough space on the target filesystem to complete the copy.
> -.TP
> .B EXDEV
> The files referred to by
> .IR file_in " and " file_out
> -are not on the same mounted filesystem.
> +are not on the same mounted filesystem (pre Linux 5.3).
> +.TP
> +.B ENOSPC
> +There is not enough space on the target filesystem to complete the copy.

Why move this?

> +.TP
> +.B TXTBSY
> +.I fd_in
> +or
> +.I fd_out
> +refers to an active swap file.

"Either fd_in or fd_out refers to..."

> +.TP
> +.B EPERM
> +.I fd_out
> +refers to an immutable file.
> +.TP
> +.B EACCES
> +The user does not have write permissions for the destination file.
> .SH VERSIONS
> The
> .BR copy_file_range ()
> system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
> emulation when it is not available.
> .\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
> +.PP
> +A major rework of the kernel implementation occurred in 5.3. Areas of the API
> +that weren't clearly defined were clarified and the API bounds are much more
> +strictly checked than on earlier kernels. Applications should target the
> +behaviour and requirements of 5.3 kernels.

Are there any weird cases where a program targetting 5.3 behavior would
fail or get stuck in an infinite loop on a 5.2 kernel?

Particularly since glibc spat out a copy_file_range fallback for 2.29
that tries to emulate the kernel behavior 100%. It even refuses
cross-filesystem copies (because hey, we documented that :() even though
that's perfectly fine for a userspace implementation.

TBH I suspect that we ought to get the glibc developers to remove the
"no cross device copies" code from their implementation and then update
the manpage to say that cross device copies are supposed to be
supported all the time, at least as of glibc 2.(futureversion).

Anyways, thanks for taking on the c_f_r cleanup! :)

--D

> +.PP
> +First support for cross-filesystem copies was introduced in Linux 5.3. Older
> +kernels will return -EXDEV when cross-filesystem copies are attempted.
> .SH CONFORMING TO
> The
> .BR copy_file_range ()
> @@ -224,7 +285,7 @@ main(int argc, char **argv)
> }
>
> len \-= ret;
> - } while (len > 0);
> + } while (len > 0 && ret > 0);
>
> close(fd_in);
> close(fd_out);
> --
> 2.17.1
>

2019-05-29 16:22:52

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2 9/8] man-pages: copy_file_range updates

> > +A major rework of the kernel implementation occurred in 5.3. Areas of the API
> > +that weren't clearly defined were clarified and the API bounds are much more
> > +strictly checked than on earlier kernels. Applications should target the
> > +behaviour and requirements of 5.3 kernels.
>
> Are there any weird cases where a program targetting 5.3 behavior would
> fail or get stuck in an infinite loop on a 5.2 kernel?

I don't think so. When Dave wrote this paragraph the behavior was changed
from short copy to EINVAL. That would have been a problem to maintain
old vs. new copy loops, but now the behavior did not change in that respect.

>
> Particularly since glibc spat out a copy_file_range fallback for 2.29
> that tries to emulate the kernel behavior 100%. It even refuses
> cross-filesystem copies (because hey, we documented that :() even though
> that's perfectly fine for a userspace implementation.
>
> TBH I suspect that we ought to get the glibc developers to remove the
> "no cross device copies" code from their implementation and then update
> the manpage to say that cross device copies are supposed to be
> supported all the time, at least as of glibc 2.(futureversion).

I don't see a problem with copy_file_range() returning EXDEV.
That is why I left EXDEV in the man page.
Tools should know how to deal with EXDEV by now.
If you are running on a new kernel, you get better likelihood
for copy_file_range() to do clone or in-kernel copy for you.

>
> Anyways, thanks for taking on the c_f_r cleanup! :)
>

Sure, get ready for another round ;-)

Thanks for the review!
Amir.

2019-05-31 19:35:25

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH v2 5/8] vfs: copy_file_range needs to strip setuid bits

On Sun, May 26, 2019 at 09:10:56AM +0300, Amir Goldstein wrote:
> The file we are copying data into needs to have its setuid bit
> stripped before we start the data copy so that unprivileged users
> can't copy data into executables that are run with root privs.
>
> [Amir] Introduce the helper generic_copy_file_range_prep() modelled
> after generic_remap_file_range_prep(). Helper is called by filesystem
> before the copy_file_range operation and with output inode locked.
>
> For ceph and for default generic_copy_file_range() implementation there
> is no inode lock held throughout the copy operation, so we do best
> effort and remove setuid bit before copy starts. This does not protect
> suid file from changing if suid bit is set after copy started.

I'm not sure what it would accomplish to make setuid-clearing atomic
with the write.

If an attacker could write concurrently with your setting the setuid
bit, then they could probably also perform the write just before you set
the setuid bit.

I think clearing it at the start is all that's necessary, unless I'm
missing something.

--b.


>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/ceph/file.c | 9 +++++++++
> fs/cifs/cifsfs.c | 9 ++++++---
> fs/fuse/file.c | 4 ++++
> fs/nfs/nfs42proc.c | 8 +++++---
> fs/read_write.c | 31 +++++++++++++++++++++++++++++++
> include/linux/fs.h | 2 ++
> 6 files changed, 57 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index e87f7b2023af..54cfc877a6ef 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1947,6 +1947,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> goto out;
> }
>
> + /* Should inode lock be held throughout the copy operation? */
> + inode_lock(dst_inode);
> + ret = generic_copy_file_range_prep(src_file, dst_file);
> + inode_unlock(dst_inode);
> + if (ret < 0) {
> + dout("failed to copy from src to dst file (%zd)\n", ret);
> + goto out;
> + }
> +
> /*
> * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
> * clients may have dirty data in their caches. And OSDs know nothing
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index c65823270313..e103b499aaa8 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1096,6 +1096,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> goto out;
> }
>
> + rc = -EOPNOTSUPP;
> + if (!target_tcon->ses->server->ops->copychunk_range)
> + goto out;
> +
> /*
> * Note: cifs case is easier than btrfs since server responsible for
> * checks for proper open modes and file type and if it wants
> @@ -1107,11 +1111,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> /* should we flush first and last page first */
> truncate_inode_pages(&target_inode->i_data, 0);
>
> - if (target_tcon->ses->server->ops->copychunk_range)
> + rc = generic_copy_file_range_prep(src_file, dst_file);
> + if (!rc)
> rc = target_tcon->ses->server->ops->copychunk_range(xid,
> smb_file_src, smb_file_target, off, len, destoff);
> - else
> - rc = -EOPNOTSUPP;
>
> /* force revalidate of size and timestamps of target file now
> * that target is updated on the server
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index e03901ae729b..3531d4a3d9ec 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3128,6 +3128,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>
> inode_lock(inode_out);
>
> + err = generic_copy_file_range_prep(file_in, file_out);
> + if (err)
> + goto out;
> +
> if (fc->writeback_cache) {
> err = filemap_write_and_wait_range(inode_out->i_mapping,
> pos_out, pos_out + len);
> diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> index 5196bfa7894d..b387951e1d86 100644
> --- a/fs/nfs/nfs42proc.c
> +++ b/fs/nfs/nfs42proc.c
> @@ -345,9 +345,11 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
>
> do {
> inode_lock(file_inode(dst));
> - err = _nfs42_proc_copy(src, src_lock,
> - dst, dst_lock,
> - &args, &res);
> + err = generic_copy_file_range_prep(src, dst);
> + if (!err)
> + err = _nfs42_proc_copy(src, src_lock,
> + dst, dst_lock,
> + &args, &res);
> inode_unlock(file_inode(dst));
>
> if (err >= 0)
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b0fb1176b628..e16bcafc0da2 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1565,6 +1565,28 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> }
> #endif
>
> +/*
> + * Prepare inodes for copy from @file_in to @file_out.
> + *
> + * Caller must hold output inode lock.
> + */
> +int generic_copy_file_range_prep(struct file *file_in, struct file *file_out)
> +{
> + int ret;
> +
> + WARN_ON_ONCE(!inode_is_locked(file_inode(file_out)));
> +
> + /*
> + * Clear the security bits if the process is not being run by root.
> + * This keeps people from modifying setuid and setgid binaries.
> + */
> + ret = file_remove_privs(file_out);
> +
> + return ret;
> +
> +}
> +EXPORT_SYMBOL(generic_copy_file_range_prep);
> +
> /**
> * generic_copy_file_range - copy data between two files
> * @file_in: file structure to read from
> @@ -1590,6 +1612,15 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> + int ret;
> +
> + /* Should inode lock be held throughout the copy operation? */
> + inode_lock(file_inode(file_out));
> + ret = generic_copy_file_range_prep(file_in, file_out);
> + inode_unlock(file_inode(file_out));
> + if (ret)
> + return ret;
> +
> return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e4d382c4342a..3e03a96d9ab6 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1889,6 +1889,8 @@ extern ssize_t vfs_readv(struct file *, const struct iovec __user *,
> unsigned long, loff_t *, rwf_t);
> extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
> loff_t, size_t, unsigned int);
> +extern int generic_copy_file_range_prep(struct file *file_in,
> + struct file *file_out);
> extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags);
> --
> 2.17.1