Hi Darrick,
Following is a re-work of Dave Chinner's copy_file_range() patches.
This v3 patch set is based on your feedback to v2 [1].
NOTE that this work changes user visible behavior of copy_file_range(2)!
It introduces new errors for cases that were not checked before and it
allows cross-device copy by default. After this work, cifs copy offload
should be possible between two shares on the same server, but I did not
check this functionality.
Patches 1-3 have your Reviewed-by.
Patches 4-5 have been slightly amended to address your comments.
Patch 6 adds the new helper you requested dubbed file_modified().
Patch 7 uses the helper in xfs - unrelated to copy_file_range().
Patches 8-12 use the helper for various fs's ->copy_file_range().
Patch 13 (unmodified) has your Reviewed-by, because the bits that
you approved are those that matter to most filesystems (i.e. the
fallback logic).
The man page update patch (again, mostly Dave's work) is appended
to the series with fixes to your review comments.
Thanks,
Amir.
[1] https://lore.kernel.org/linux-fsdevel/[email protected]/
Changes since v2:
- Re-order generic_remap_checks() fix patch before
forking generic_copy_file_checks()
- Document @req_count helper argument (Darrick)
- Fold generic_access_check_limits() (Darrick)
- Added file_modified() helper (Darrick)
- Added xfs patch to use file_modified() helper
- Drop generic_copy_file_range_prep() helper
- Per filesystem patch for file_modified()/file_accessed()
- Post copy file_remove_privs() for ceph/generic (Darrick)
Changes since v1:
- Short read instead of EINVAL (Christoph)
- generic_file_rw_checks() helper (Darrick)
- generic_copy_file_range_prep() helper (Christoph)
- Not calling ->remap_file_range() with different sb
- Not calling ->copy_file_range() with different fs type
- Remove changes to overlayfs
- Extra fix to clone/dedupe checks
Amir Goldstein (11):
vfs: introduce generic_file_rw_checks()
vfs: remove redundant checks from generic_remap_checks()
vfs: add missing checks to copy_file_range
vfs: introduce file_modified() helper
xfs: use file_modified() helper
vfs: copy_file_range needs to strip setuid bits and update timestamps
ceph: copy_file_range needs to strip setuid bits and update timestamps
cifs: copy_file_range needs to strip setuid bits and update timestamps
fuse: copy_file_range needs to strip setuid bits and update timestamps
nfs: copy_file_range needs to strip setuid bits and update timestamps
vfs: allow copy_file_range to copy across devices
Dave Chinner (2):
vfs: introduce generic_copy_file_range()
vfs: no fallback for ->copy_file_range
fs/ceph/file.c | 40 ++++++++++++-
fs/cifs/cifsfs.c | 15 ++++-
fs/fuse/file.c | 29 ++++++++-
fs/inode.c | 20 +++++++
fs/nfs/nfs42proc.c | 9 ++-
fs/nfs/nfs4file.c | 23 ++++++-
fs/read_write.c | 145 ++++++++++++++++++++++++++++-----------------
fs/xfs/xfs_file.c | 15 +----
include/linux/fs.h | 9 +++
mm/filemap.c | 110 +++++++++++++++++++++++++++-------
10 files changed, 309 insertions(+), 106 deletions(-)
--
2.17.1
From: Dave Chinner <[email protected]>
Now that we have generic_copy_file_range(), remove it as a fallback
case when offloads fail. This puts the responsibility for executing
fallbacks on the filesystems that implement ->copy_file_range and
allows us to add operational validity checks to
generic_copy_file_range().
Rework vfs_copy_file_range() to call a new do_copy_file_range()
helper to execute the copying callout, and move calls to
generic_file_copy_range() into filesystem methods where they
currently return failures.
[Amir] overlayfs is not responsible of executing the fallback.
It is the responsibility of the underlying filesystem.
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/ceph/file.c | 21 ++++++++++++++++++---
fs/cifs/cifsfs.c | 4 ++++
fs/fuse/file.c | 21 ++++++++++++++++++---
fs/nfs/nfs4file.c | 20 +++++++++++++++++---
fs/read_write.c | 25 ++++++++++++++++---------
5 files changed, 73 insertions(+), 18 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 305daf043eb0..e87f7b2023af 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1889,9 +1889,9 @@ static int is_file_size_ok(struct inode *src_inode, struct inode *dst_inode,
return 0;
}
-static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
+static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
{
struct inode *src_inode = file_inode(src_file);
struct inode *dst_inode = file_inode(dst_file);
@@ -2100,6 +2100,21 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
return ret;
}
+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
+ len, flags);
+
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(src_file, src_off, dst_file,
+ dst_off, len, flags);
+ return ret;
+}
+
const struct file_operations ceph_file_fops = {
.open = ceph_open,
.release = ceph_release,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index f5fcd6360056..c65823270313 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1148,6 +1148,10 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
rc = cifs_file_copychunk_range(xid, src_file, off, dst_file, destoff,
len, flags);
free_xid(xid);
+
+ if (rc == -EOPNOTSUPP)
+ rc = generic_copy_file_range(src_file, off, dst_file,
+ destoff, len, flags);
return rc;
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 3959f08279e6..e03901ae729b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3097,9 +3097,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
return err;
}
-static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
+static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
{
struct fuse_file *ff_in = file_in->private_data;
struct fuse_file *ff_out = file_out->private_data;
@@ -3173,6 +3173,21 @@ static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
return err;
}
+static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
+ len, flags);
+
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(src_file, src_off, dst_file,
+ dst_off, len, flags);
+ return ret;
+}
+
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read_iter = fuse_file_read_iter,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index cf42a8b939e3..4842f3ab3161 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -129,9 +129,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
}
#ifdef CONFIG_NFS_V4_2
-static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t count, unsigned int flags)
+static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
{
if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
return -EOPNOTSUPP;
@@ -140,6 +140,20 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
return nfs42_proc_copy(file_in, pos_in, file_out, pos_out, count);
}
+static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
+{
+ ssize_t ret;
+
+ ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
+ flags);
+ if (ret == -EOPNOTSUPP)
+ ret = generic_copy_file_range(file_in, pos_in, file_out,
+ pos_out, count, flags);
+ return ret;
+}
+
static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
{
loff_t ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 676b02fae589..b63dcb4e4fe9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1595,6 +1595,19 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
}
EXPORT_SYMBOL(generic_copy_file_range);
+static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
+{
+ if (file_out->f_op->copy_file_range)
+ return file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
+
+ return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+ flags);
+}
+
/*
* copy_file_range() differs from regular file read and write in that it
* specifically allows return partial success. When it does so is up to
@@ -1655,15 +1668,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}
- if (file_out->f_op->copy_file_range) {
- ret = file_out->f_op->copy_file_range(file_in, pos_in, file_out,
- pos_out, len, flags);
- if (ret != -EOPNOTSUPP)
- goto done;
- }
-
- ret = generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
+ ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+ flags);
+ WARN_ON_ONCE(ret == -EOPNOTSUPP);
done:
if (ret > 0) {
fsnotify_access(file_in);
--
2.17.1
Factor out helper with some checks on in/out file that are
common to clone_file_range and copy_file_range.
Suggested-by: Darrick J. Wong <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/read_write.c | 38 +++++++++++---------------------------
include/linux/fs.h | 1 +
mm/filemap.c | 24 ++++++++++++++++++++++++
3 files changed, 36 insertions(+), 27 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index b63dcb4e4fe9..f1900bdb3127 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1617,17 +1617,18 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
- struct inode *inode_in = file_inode(file_in);
- struct inode *inode_out = file_inode(file_out);
ssize_t ret;
if (flags != 0)
return -EINVAL;
- if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
- return -EISDIR;
- if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
- return -EINVAL;
+ /* this could be relaxed once a method supports cross-fs copies */
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (unlikely(ret))
+ return ret;
ret = rw_verify_area(READ, file_in, &pos_in, len);
if (unlikely(ret))
@@ -1637,15 +1638,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (unlikely(ret))
return ret;
- if (!(file_in->f_mode & FMODE_READ) ||
- !(file_out->f_mode & FMODE_WRITE) ||
- (file_out->f_flags & O_APPEND))
- return -EBADF;
-
- /* this could be relaxed once a method supports cross-fs copies */
- if (inode_in->i_sb != inode_out->i_sb)
- return -EXDEV;
-
if (len == 0)
return 0;
@@ -2013,29 +2005,21 @@ loff_t do_clone_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t len, unsigned int remap_flags)
{
- struct inode *inode_in = file_inode(file_in);
- struct inode *inode_out = file_inode(file_out);
loff_t ret;
WARN_ON_ONCE(remap_flags & REMAP_FILE_DEDUP);
- if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
- return -EISDIR;
- if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
- return -EINVAL;
-
/*
* FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
* the same mount. Practically, they only need to be on the same file
* system.
*/
- if (inode_in->i_sb != inode_out->i_sb)
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
return -EXDEV;
- if (!(file_in->f_mode & FMODE_READ) ||
- !(file_out->f_mode & FMODE_WRITE) ||
- (file_out->f_flags & O_APPEND))
- return -EBADF;
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (ret < 0)
+ return ret;
if (!file_in->f_op->remap_file_range)
return -EOPNOTSUPP;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index ea17858310ff..89b9b73eb581 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3049,6 +3049,7 @@ extern ssize_t generic_write_checks(struct kiocb *, struct iov_iter *);
extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count, unsigned int remap_flags);
+extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index df2006ba0cfa..a38619a4a6af 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3041,6 +3041,30 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
return 0;
}
+
+/*
+ * Performs common checks before doing a file copy/clone
+ * from @file_in to @file_out.
+ */
+int generic_file_rw_checks(struct file *file_in, struct file *file_out)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+
+ /* Don't copy dirs, pipes, sockets... */
+ if (S_ISDIR(inode_in->i_mode) || S_ISDIR(inode_out->i_mode))
+ return -EISDIR;
+ if (!S_ISREG(inode_in->i_mode) || !S_ISREG(inode_out->i_mode))
+ return -EINVAL;
+
+ if (!(file_in->f_mode & FMODE_READ) ||
+ !(file_out->f_mode & FMODE_WRITE) ||
+ (file_out->f_flags & O_APPEND))
+ return -EBADF;
+
+ return 0;
+}
+
int pagecache_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
--
2.17.1
Because generic_copy_file_range doesn't hold the destination inode lock
throughout the copy, strip setuid bits before and after copy.
The destination inode mtime is updated before and after the copy and the
source inode atime is updated after the copy, similar to
generic_file_read_iter().
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/read_write.c | 23 +++++++++++++++++++++--
1 file changed, 21 insertions(+), 2 deletions(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index cec7e7b1f693..706ea5f276a7 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1590,8 +1590,27 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
- return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
- len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+ struct inode *inode_out = file_inode(file_out);
+ int ret, err;
+
+ /* Should inode_out lock be held throughout the copy operation? */
+ inode_lock(inode_out);
+ err = file_modified(file_out);
+ inode_unlock(inode_out);
+ if (err)
+ return err;
+
+ ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
+ len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
+
+ file_accessed(file_in);
+
+ /* To be on the safe side, remove privs also after copy */
+ inode_lock(inode_out);
+ err = file_modified(file_out);
+ inode_unlock(inode_out);
+
+ return err ?: ret;
}
EXPORT_SYMBOL(generic_copy_file_range);
--
2.17.1
cifs has both source and destination inodes locked throughout the copy.
Like ->write_iter(), we update mtime and strip setuid bits of destination
file before copy and like ->read_iter(), we update atime of source file
after copy.
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/cifs/cifsfs.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index c65823270313..ab6c5c24146d 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1096,6 +1096,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
goto out;
}
+ rc = -EOPNOTSUPP;
+ if (!target_tcon->ses->server->ops->copychunk_range)
+ goto out;
+
/*
* Note: cifs case is easier than btrfs since server responsible for
* checks for proper open modes and file type and if it wants
@@ -1107,11 +1111,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
/* should we flush first and last page first */
truncate_inode_pages(&target_inode->i_data, 0);
- if (target_tcon->ses->server->ops->copychunk_range)
+ rc = file_modified(dst_file);
+ if (!rc)
rc = target_tcon->ses->server->ops->copychunk_range(xid,
smb_file_src, smb_file_target, off, len, destoff);
- else
- rc = -EOPNOTSUPP;
+
+ file_accessed(src_file);
/* force revalidate of size and timestamps of target file now
* that target is updated on the server
--
2.17.1
Like ->write_iter(), we update mtime and strip setuid of dst file before
copy and like ->read_iter(), we update atime of src file after copy.
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/nfs/nfs42proc.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
index 5196bfa7894d..c37a8e5116c6 100644
--- a/fs/nfs/nfs42proc.c
+++ b/fs/nfs/nfs42proc.c
@@ -345,10 +345,13 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
do {
inode_lock(file_inode(dst));
- err = _nfs42_proc_copy(src, src_lock,
- dst, dst_lock,
- &args, &res);
+ err = file_modified(dst);
+ if (!err)
+ err = _nfs42_proc_copy(src, src_lock,
+ dst, dst_lock,
+ &args, &res);
inode_unlock(file_inode(dst));
+ file_accessed(src);
if (err >= 0)
break;
--
2.17.1
We want to enable cross-filesystem copy_file_range functionality
where possible, so push the "same superblock only" checks down to
the individual filesystem callouts so they can make their own
decisions about cross-superblock copy offload and fallack to
generic_copy_file_range() for cross-superblock copy.
[Amir] We do not call ->remap_file_range() in case the inodes are not
on the same sb and do not call ->copy_file_range() in case the inodes
are not on the same filesystem type.
This changes behavior of the copy_file_range(2) syscall, which will
now allow cross filesystem in-kernel copy. CIFS already supports
cross-superblock copy, between two shares to the same server. This
functionality will now be available via the copy_file_range(2) syscall.
Cc: Steve French <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
---
fs/ceph/file.c | 4 +++-
fs/cifs/cifsfs.c | 2 +-
fs/fuse/file.c | 5 ++++-
fs/nfs/nfs4file.c | 5 ++++-
fs/read_write.c | 20 ++++++++++++++------
5 files changed, 26 insertions(+), 10 deletions(-)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 8a70708e1aca..e9614d686301 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
if (src_inode == dst_inode)
return -EINVAL;
+ if (src_inode->i_sb != dst_inode->i_sb)
+ return -EXDEV;
if (ceph_snap(dst_inode) != CEPH_NOSNAP)
return -EROFS;
@@ -2126,7 +2128,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
len, flags);
- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(src_file, src_off, dst_file,
dst_off, len, flags);
return ret;
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index ab6c5c24146d..83956452c108 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1154,7 +1154,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
len, flags);
free_xid(xid);
- if (rc == -EOPNOTSUPP)
+ if (rc == -EOPNOTSUPP || rc == -EXDEV)
rc = generic_copy_file_range(src_file, off, dst_file,
destoff, len, flags);
return rc;
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 7f33d68f66d9..eab00cd089e8 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
if (fc->no_copy_file_range)
return -EOPNOTSUPP;
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+
inode_lock(inode_out);
err = file_modified(file_out);
@@ -3187,7 +3190,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
len, flags);
- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(src_file, src_off, dst_file,
dst_off, len, flags);
return ret;
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 4842f3ab3161..f4157eb1f69d 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t count, unsigned int flags)
{
+ /* Only offload copy if superblock is the same */
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
return -EOPNOTSUPP;
if (file_inode(file_in) == file_inode(file_out))
@@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
flags);
- if (ret == -EOPNOTSUPP)
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
ret = generic_copy_file_range(file_in, pos_in, file_out,
pos_out, count, flags);
return ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 706ea5f276a7..d8930bb735cb 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1618,7 +1618,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
- if (file_out->f_op->copy_file_range)
+ /*
+ * Although we now allow filesystems to handle cross sb copy, passing
+ * an inode of the wrong filesystem type to filesystem operation can
+ * often result in an attempt to dereference the wrong concrete inode
+ * struct, so avoid doing that until we really have a good reason.
+ * The incentive for passing inode from different sb to filesystem is
+ * NFS cross server copy and for that use case, enforcing same
+ * filesystem type is acceptable.
+ */
+ if (file_out->f_op->copy_file_range &&
+ file_inode(file_in)->i_sb->s_type ==
+ file_inode(file_out)->i_sb->s_type)
return file_out->f_op->copy_file_range(file_in, pos_in,
file_out, pos_out,
len, flags);
@@ -1641,10 +1652,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags != 0)
return -EINVAL;
- /* this could be relaxed once a method supports cross-fs copies */
- if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
- return -EXDEV;
-
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
flags);
if (unlikely(ret))
@@ -1667,7 +1674,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
*/
- if (file_in->f_op->remap_file_range) {
+ if (file_in->f_op->remap_file_range &&
+ file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
loff_t cloned;
cloned = file_in->f_op->remap_file_range(file_in, pos_in,
--
2.17.1
Like ->write_iter(), we update mtime and strip setuid of dst file before
copy and like ->read_iter(), we update atime of src file after copy.
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/fuse/file.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e03901ae729b..7f33d68f66d9 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3128,6 +3128,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
inode_lock(inode_out);
+ err = file_modified(file_out);
+ if (err)
+ goto out;
+
if (fc->writeback_cache) {
err = filemap_write_and_wait_range(inode_out->i_mapping,
pos_out, pos_out + len);
@@ -3169,6 +3173,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
clear_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
inode_unlock(inode_out);
+ file_accessed(file_in);
return err;
}
--
2.17.1
Update with all the missing errors the syscall can return, the
behaviour the syscall should have w.r.t. to copies within single
files, etc.
[Amir] Copying beyond EOF returns zero.
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
man2/copy_file_range.2 | 91 ++++++++++++++++++++++++++++++++++++------
1 file changed, 78 insertions(+), 13 deletions(-)
diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 2438b63c8..9f9b081a7 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -42,9 +42,9 @@ without the additional cost of transferring data from the kernel to user space
and then back into the kernel.
It copies up to
.I len
-bytes of data from file descriptor
+bytes of data from the source file descriptor
.I fd_in
-to file descriptor
+to the target file descriptor
.IR fd_out ,
overwriting any data that exists within the requested range of the target file.
.PP
@@ -74,6 +74,12 @@ is not changed, but
.I off_in
is adjusted appropriately.
.PP
+.I fd_in
+and
+.I fd_out
+can refer to the same file.
+If they refer to the same file, then the source and target ranges are not
+allowed to overlap.
.PP
The
.I flags
@@ -84,6 +90,11 @@ Upon successful completion,
.BR copy_file_range ()
will return the number of bytes copied between files.
This could be less than the length originally requested.
+If the file offset of
+.I fd_in
+is at or past the end of file, no bytes are copied, and
+.BR copy_file_range ()
+returns zero.
.PP
On error,
.BR copy_file_range ()
@@ -93,12 +104,16 @@ is set to indicate the error.
.SH ERRORS
.TP
.B EBADF
-One or more file descriptors are not valid; or
+One or more file descriptors are not valid.
+.TP
+.B EBADF
.I fd_in
is not open for reading; or
.I fd_out
-is not open for writing; or
-the
+is not open for writing.
+.TP
+.B EBADF
+The
.B O_APPEND
flag is set for the open file description (see
.BR open (2))
@@ -106,24 +121,52 @@ referred to by the file descriptor
.IR fd_out .
.TP
.B EFBIG
-An attempt was made to write a file that exceeds the implementation-defined
-maximum file size or the process's file size limit,
-or to write at a position past the maximum allowed offset.
+An attempt was made to write at a position past the maximum file offset the
+kernel supports.
+.TP
+.B EFBIG
+An attempt was made to write a range that exceeds the allowed maximum file size.
+The maximum file size differs between filesystem implementations and can be
+different from the maximum allowed file offset.
+.TP
+.B EFBIG
+An attempt was made to write beyond the process's file size resource limit.
+This may also result in the process receiving a
+.I SIGXFSZ
+signal.
.TP
.B EINVAL
-Requested range extends beyond the end of the source file; or the
+The
.I flags
argument is not 0.
.TP
-.B EIO
-A low-level I/O error occurred while copying.
+.B EINVAL
+.I fd_in
+and
+.I fd_out
+refer to the same file and the source and target ranges overlap.
+.TP
+.B EINVAL
+Either
+.I fd_in
+or
+.I fd_out
+is not a regular file.
.TP
.B EISDIR
+Either
.I fd_in
or
.I fd_out
refers to a directory.
.TP
+.B EOVERFLOW
+The requested source or destination range is too large to represent in the
+specified data types.
+.TP
+.B EIO
+A low-level I/O error occurred while copying.
+.TP
.B ENOMEM
Out of memory.
.TP
@@ -133,13 +176,35 @@ There is not enough space on the target filesystem to complete the copy.
.B EXDEV
The files referred to by
.IR file_in " and " file_out
-are not on the same mounted filesystem.
+are not on the same mounted filesystem (pre Linux 5.3).
+.TP
+.B TXTBSY
+Either
+.I fd_in
+or
+.I fd_out
+refers to an active swap file.
+.TP
+.B EPERM
+.I fd_out
+refers to an immutable file.
+.TP
+.B EACCES
+The user does not have write permissions for the destination file.
.SH VERSIONS
The
.BR copy_file_range ()
system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
emulation when it is not available.
.\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
+.PP
+A major rework of the kernel implementation occurred in 5.3.
+Areas of the API that weren't clearly defined were clarified and the API bounds
+are much more strictly checked than on earlier kernels.
+Applications should target the behaviour and requirements of 5.3 kernels.
+.PP
+First support for cross-filesystem copies was introduced in Linux 5.3.
+Older kernels will return -EXDEV when cross-filesystem copies are attempted.
.SH CONFORMING TO
The
.BR copy_file_range ()
@@ -224,7 +289,7 @@ main(int argc, char **argv)
}
len \-= ret;
- } while (len > 0);
+ } while (len > 0 && ret > 0);
close(fd_in);
close(fd_out);
--
2.17.1
Because ceph doesn't hold destination inode lock throughout the copy,
strip setuid bits before and after copy.
The destination inode mtime is updated before and after the copy and the
source inode atime is updated after the copy, similar to the filesystem
->read_iter() implementation.
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/ceph/file.c | 17 +++++++++++++++++
1 file changed, 17 insertions(+)
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index e87f7b2023af..8a70708e1aca 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -1947,6 +1947,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
goto out;
}
+ /* Should dst_inode lock be held throughout the copy operation? */
+ inode_lock(dst_inode);
+ ret = file_modified(dst_file);
+ inode_unlock(dst_inode);
+ if (ret < 0) {
+ dout("failed to modify dst file before copy (%zd)\n", ret);
+ goto out;
+ }
+
/*
* We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
* clients may have dirty data in their caches. And OSDs know nothing
@@ -2097,6 +2106,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
out:
ceph_free_cap_flush(prealloc_cf);
+ file_accessed(src_file);
+ /* To be on the safe side, remove privs also after copy */
+ inode_lock(dst_inode);
+ err = file_modified(dst_file);
+ inode_unlock(dst_inode);
+ if (err < 0)
+ dout("failed to modify dst file after copy (%zd)\n", err);
+
return ret;
}
--
2.17.1
The access limit checks on input file range in generic_remap_checks()
are redundant because the input file size is guaranteied to be within
limits and pos+len are already checked to be within input file size.
Beyond the fact that the check cannot fail, if it would have failed,
it could return -EFBIG for input file range error. There is no precedent
for that. -EFBIG is returned in syscalls that would change file length.
With that call removed, we can fold generic_access_check_limits() into
generic_write_check_limits().
Signed-off-by: Amir Goldstein <[email protected]>
---
mm/filemap.c | 33 ++++++++++++---------------------
1 file changed, 12 insertions(+), 21 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index a38619a4a6af..44361928bbb0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2895,24 +2895,11 @@ EXPORT_SYMBOL(read_cache_page_gfp);
* LFS limits. If pos is under the limit it becomes a short access. If it
* exceeds the limit we return -EFBIG.
*/
-static int generic_access_check_limits(struct file *file, loff_t pos,
- loff_t *count)
-{
- struct inode *inode = file->f_mapping->host;
- loff_t max_size = inode->i_sb->s_maxbytes;
-
- if (!(file->f_flags & O_LARGEFILE))
- max_size = MAX_NON_LFS;
-
- if (unlikely(pos >= max_size))
- return -EFBIG;
- *count = min(*count, max_size - pos);
- return 0;
-}
-
static int generic_write_check_limits(struct file *file, loff_t pos,
loff_t *count)
{
+ struct inode *inode = file->f_mapping->host;
+ loff_t max_size = inode->i_sb->s_maxbytes;
loff_t limit = rlimit(RLIMIT_FSIZE);
if (limit != RLIM_INFINITY) {
@@ -2923,7 +2910,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
*count = min(*count, limit - pos);
}
- return generic_access_check_limits(file, pos, count);
+ if (!(file->f_flags & O_LARGEFILE))
+ max_size = MAX_NON_LFS;
+
+ if (unlikely(pos >= max_size))
+ return -EFBIG;
+
+ *count = min(*count, max_size - pos);
+
+ return 0;
}
/*
@@ -2963,7 +2958,7 @@ EXPORT_SYMBOL(generic_write_checks);
/*
* Performs necessary checks before doing a clone.
*
- * Can adjust amount of bytes to clone.
+ * Can adjust amount of bytes to clone via @req_count argument.
* Returns appropriate error code that caller should return or
* zero in case the clone should be allowed.
*/
@@ -3001,10 +2996,6 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
return -EINVAL;
count = min(count, size_in - (uint64_t)pos_in);
- ret = generic_access_check_limits(file_in, pos_in, &count);
- if (ret)
- return ret;
-
ret = generic_write_check_limits(file_out, pos_out, &count);
if (ret)
return ret;
--
2.17.1
Note that by using the helper, the order of calling file_remove_privs()
after file_update_mtime() in xfs_file_aio_write_checks() has changed.
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/xfs/xfs_file.c | 15 +--------------
1 file changed, 1 insertion(+), 14 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 76748255f843..916a35cae5e9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -367,20 +367,7 @@ xfs_file_aio_write_checks(
* lock above. Eventually we should look into a way to avoid
* the pointless lock roundtrip.
*/
- if (likely(!(file->f_mode & FMODE_NOCMTIME))) {
- error = file_update_time(file);
- if (error)
- return error;
- }
-
- /*
- * If we're writing the file then make sure to clear the setuid and
- * setgid bits if the process is not being run by root. This keeps
- * people from modifying setuid and setgid binaries.
- */
- if (!IS_NOSEC(inode))
- return file_remove_privs(file);
- return 0;
+ return file_modified(file);
}
static int
--
2.17.1
Like the clone and dedupe interfaces we've recently fixed, the
copy_file_range() implementation is missing basic sanity, limits and
boundary condition tests on the parameters that are passed to it
from userspace. Create a new "generic_copy_file_checks()" function
modelled on the generic_remap_checks() function to provide this
missing functionality.
[Amir] Shorten copy length instead of checking pos_in limits
because input file size already abides by the limits.
Signed-off-by: Dave Chinner <[email protected]>
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/read_write.c | 3 ++-
include/linux/fs.h | 3 +++
mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 58 insertions(+), 1 deletion(-)
diff --git a/fs/read_write.c b/fs/read_write.c
index f1900bdb3127..b0fb1176b628 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
return -EXDEV;
- ret = generic_file_rw_checks(file_in, file_out);
+ ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
+ flags);
if (unlikely(ret))
return ret;
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 89b9b73eb581..e4d382c4342a 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count, unsigned int remap_flags);
extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
+extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t *count, unsigned int flags);
extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
diff --git a/mm/filemap.c b/mm/filemap.c
index 44361928bbb0..aac71aef4c61 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3056,6 +3056,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
return 0;
}
+/*
+ * Performs necessary checks before doing a file copy
+ *
+ * Can adjust amount of bytes to copy via @req_count argument.
+ * Returns appropriate error code that caller should return or
+ * zero in case the copy should be allowed.
+ */
+int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t *req_count, unsigned int flags)
+{
+ struct inode *inode_in = file_inode(file_in);
+ struct inode *inode_out = file_inode(file_out);
+ uint64_t count = *req_count;
+ loff_t size_in;
+ int ret;
+
+ ret = generic_file_rw_checks(file_in, file_out);
+ if (ret)
+ return ret;
+
+ /* Don't touch certain kinds of inodes */
+ if (IS_IMMUTABLE(inode_out))
+ return -EPERM;
+
+ if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
+ return -ETXTBSY;
+
+ /* Ensure offsets don't wrap. */
+ if (pos_in + count < pos_in || pos_out + count < pos_out)
+ return -EOVERFLOW;
+
+ /* Shorten the copy to EOF */
+ size_in = i_size_read(inode_in);
+ if (pos_in >= size_in)
+ count = 0;
+ else
+ count = min(count, size_in - (uint64_t)pos_in);
+
+ ret = generic_write_check_limits(file_out, pos_out, &count);
+ if (ret)
+ return ret;
+
+ /* Don't allow overlapped copying within the same file. */
+ if (inode_in == inode_out &&
+ pos_out + count > pos_in &&
+ pos_out < pos_in + count)
+ return -EINVAL;
+
+ *req_count = count;
+ return 0;
+}
+
int pagecache_write_begin(struct file *file, struct address_space *mapping,
loff_t pos, unsigned len, unsigned flags,
struct page **pagep, void **fsdata)
--
2.17.1
The combination of file_remove_privs() and file_update_mtime() is
quite common in filesystem ->write_iter() methods.
Modelled after the helper file_accessed(), introduce file_modified()
and use it from generic_remap_file_range_prep().
Note that the order of calling file_remove_privs() before
file_update_mtime() in the helper was matched to the more common order by
filesystems and not the current order in generic_remap_file_range_prep().
Signed-off-by: Amir Goldstein <[email protected]>
---
fs/inode.c | 20 ++++++++++++++++++++
fs/read_write.c | 21 +++------------------
include/linux/fs.h | 2 ++
3 files changed, 25 insertions(+), 18 deletions(-)
diff --git a/fs/inode.c b/fs/inode.c
index df6542ec3b88..2885f2f2c7a5 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -1899,6 +1899,26 @@ int file_update_time(struct file *file)
}
EXPORT_SYMBOL(file_update_time);
+/* Caller must hold the file's inode lock */
+int file_modified(struct file *file)
+{
+ int err;
+
+ /*
+ * Clear the security bits if the process is not being run by root.
+ * This keeps people from modifying setuid and setgid binaries.
+ */
+ err = file_remove_privs(file);
+ if (err)
+ return err;
+
+ if (likely(file->f_mode & FMODE_NOCMTIME))
+ return 0;
+
+ return file_update_time(file);
+}
+EXPORT_SYMBOL(file_modified);
+
int inode_needs_sync(struct inode *inode)
{
if (IS_SYNC(inode))
diff --git a/fs/read_write.c b/fs/read_write.c
index b0fb1176b628..cec7e7b1f693 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1980,25 +1980,10 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
return ret;
/* If can't alter the file contents, we're done. */
- if (!(remap_flags & REMAP_FILE_DEDUP)) {
- /* Update the timestamps, since we can alter file contents. */
- if (!(file_out->f_mode & FMODE_NOCMTIME)) {
- ret = file_update_time(file_out);
- if (ret)
- return ret;
- }
+ if (!(remap_flags & REMAP_FILE_DEDUP))
+ ret = file_modified(file_out);
- /*
- * Clear the security bits if the process is not being run by
- * root. This keeps people from modifying setuid and setgid
- * binaries.
- */
- ret = file_remove_privs(file_out);
- if (ret)
- return ret;
- }
-
- return 0;
+ return ret;
}
EXPORT_SYMBOL(generic_remap_file_range_prep);
diff --git a/include/linux/fs.h b/include/linux/fs.h
index e4d382c4342a..79ffa2958bd8 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2177,6 +2177,8 @@ static inline void file_accessed(struct file *file)
touch_atime(&file->f_path);
}
+extern int file_modified(struct file *file);
+
int sync_inode(struct inode *inode, struct writeback_control *wbc);
int sync_inode_metadata(struct inode *inode, int wait);
--
2.17.1
On Wed, May 29, 2019 at 08:43:08PM +0300, Amir Goldstein wrote:
> The access limit checks on input file range in generic_remap_checks()
> are redundant because the input file size is guaranteied to be within
"guaranteed"...
> limits and pos+len are already checked to be within input file size.
>
> Beyond the fact that the check cannot fail, if it would have failed,
> it could return -EFBIG for input file range error. There is no precedent
> for that. -EFBIG is returned in syscalls that would change file length.
>
> With that call removed, we can fold generic_access_check_limits() into
> generic_write_check_limits().
>
> Signed-off-by: Amir Goldstein <[email protected]>
Once the changelog is fixed,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> mm/filemap.c | 33 ++++++++++++---------------------
> 1 file changed, 12 insertions(+), 21 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a38619a4a6af..44361928bbb0 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2895,24 +2895,11 @@ EXPORT_SYMBOL(read_cache_page_gfp);
> * LFS limits. If pos is under the limit it becomes a short access. If it
> * exceeds the limit we return -EFBIG.
> */
> -static int generic_access_check_limits(struct file *file, loff_t pos,
> - loff_t *count)
> -{
> - struct inode *inode = file->f_mapping->host;
> - loff_t max_size = inode->i_sb->s_maxbytes;
> -
> - if (!(file->f_flags & O_LARGEFILE))
> - max_size = MAX_NON_LFS;
> -
> - if (unlikely(pos >= max_size))
> - return -EFBIG;
> - *count = min(*count, max_size - pos);
> - return 0;
> -}
> -
> static int generic_write_check_limits(struct file *file, loff_t pos,
> loff_t *count)
> {
> + struct inode *inode = file->f_mapping->host;
> + loff_t max_size = inode->i_sb->s_maxbytes;
> loff_t limit = rlimit(RLIMIT_FSIZE);
>
> if (limit != RLIM_INFINITY) {
> @@ -2923,7 +2910,15 @@ static int generic_write_check_limits(struct file *file, loff_t pos,
> *count = min(*count, limit - pos);
> }
>
> - return generic_access_check_limits(file, pos, count);
> + if (!(file->f_flags & O_LARGEFILE))
> + max_size = MAX_NON_LFS;
> +
> + if (unlikely(pos >= max_size))
> + return -EFBIG;
> +
> + *count = min(*count, max_size - pos);
> +
> + return 0;
> }
>
> /*
> @@ -2963,7 +2958,7 @@ EXPORT_SYMBOL(generic_write_checks);
> /*
> * Performs necessary checks before doing a clone.
> *
> - * Can adjust amount of bytes to clone.
> + * Can adjust amount of bytes to clone via @req_count argument.
> * Returns appropriate error code that caller should return or
> * zero in case the clone should be allowed.
> */
> @@ -3001,10 +2996,6 @@ int generic_remap_checks(struct file *file_in, loff_t pos_in,
> return -EINVAL;
> count = min(count, size_in - (uint64_t)pos_in);
>
> - ret = generic_access_check_limits(file_in, pos_in, &count);
> - if (ret)
> - return ret;
> -
> ret = generic_write_check_limits(file_out, pos_out, &count);
> if (ret)
> return ret;
> --
> 2.17.1
>
On Wed, May 29, 2019 at 08:43:09PM +0300, Amir Goldstein wrote:
> Like the clone and dedupe interfaces we've recently fixed, the
> copy_file_range() implementation is missing basic sanity, limits and
> boundary condition tests on the parameters that are passed to it
> from userspace. Create a new "generic_copy_file_checks()" function
> modelled on the generic_remap_checks() function to provide this
> missing functionality.
>
> [Amir] Shorten copy length instead of checking pos_in limits
> because input file size already abides by the limits.
>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
Looks ok,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/read_write.c | 3 ++-
> include/linux/fs.h | 3 +++
> mm/filemap.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 58 insertions(+), 1 deletion(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index f1900bdb3127..b0fb1176b628 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1626,7 +1626,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> return -EXDEV;
>
> - ret = generic_file_rw_checks(file_in, file_out);
> + ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> + flags);
> if (unlikely(ret))
> return ret;
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 89b9b73eb581..e4d382c4342a 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -3050,6 +3050,9 @@ extern int generic_remap_checks(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> loff_t *count, unsigned int remap_flags);
> extern int generic_file_rw_checks(struct file *file_in, struct file *file_out);
> +extern int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t *count, unsigned int flags);
> extern ssize_t generic_file_read_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t __generic_file_write_iter(struct kiocb *, struct iov_iter *);
> extern ssize_t generic_file_write_iter(struct kiocb *, struct iov_iter *);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 44361928bbb0..aac71aef4c61 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -3056,6 +3056,59 @@ int generic_file_rw_checks(struct file *file_in, struct file *file_out)
> return 0;
> }
>
> +/*
> + * Performs necessary checks before doing a file copy
> + *
> + * Can adjust amount of bytes to copy via @req_count argument.
> + * Returns appropriate error code that caller should return or
> + * zero in case the copy should be allowed.
> + */
> +int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t *req_count, unsigned int flags)
> +{
> + struct inode *inode_in = file_inode(file_in);
> + struct inode *inode_out = file_inode(file_out);
> + uint64_t count = *req_count;
> + loff_t size_in;
> + int ret;
> +
> + ret = generic_file_rw_checks(file_in, file_out);
> + if (ret)
> + return ret;
> +
> + /* Don't touch certain kinds of inodes */
> + if (IS_IMMUTABLE(inode_out))
> + return -EPERM;
> +
> + if (IS_SWAPFILE(inode_in) || IS_SWAPFILE(inode_out))
> + return -ETXTBSY;
> +
> + /* Ensure offsets don't wrap. */
> + if (pos_in + count < pos_in || pos_out + count < pos_out)
> + return -EOVERFLOW;
> +
> + /* Shorten the copy to EOF */
> + size_in = i_size_read(inode_in);
> + if (pos_in >= size_in)
> + count = 0;
> + else
> + count = min(count, size_in - (uint64_t)pos_in);
> +
> + ret = generic_write_check_limits(file_out, pos_out, &count);
> + if (ret)
> + return ret;
> +
> + /* Don't allow overlapped copying within the same file. */
> + if (inode_in == inode_out &&
> + pos_out + count > pos_in &&
> + pos_out < pos_in + count)
> + return -EINVAL;
> +
> + *req_count = count;
> + return 0;
> +}
> +
> int pagecache_write_begin(struct file *file, struct address_space *mapping,
> loff_t pos, unsigned len, unsigned flags,
> struct page **pagep, void **fsdata)
> --
> 2.17.1
>
On Wed, May 29, 2019 at 08:43:10PM +0300, Amir Goldstein wrote:
> The combination of file_remove_privs() and file_update_mtime() is
> quite common in filesystem ->write_iter() methods.
>
> Modelled after the helper file_accessed(), introduce file_modified()
> and use it from generic_remap_file_range_prep().
>
> Note that the order of calling file_remove_privs() before
> file_update_mtime() in the helper was matched to the more common order by
> filesystems and not the current order in generic_remap_file_range_prep().
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/inode.c | 20 ++++++++++++++++++++
> fs/read_write.c | 21 +++------------------
> include/linux/fs.h | 2 ++
> 3 files changed, 25 insertions(+), 18 deletions(-)
>
> diff --git a/fs/inode.c b/fs/inode.c
> index df6542ec3b88..2885f2f2c7a5 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -1899,6 +1899,26 @@ int file_update_time(struct file *file)
> }
> EXPORT_SYMBOL(file_update_time);
>
> +/* Caller must hold the file's inode lock */
> +int file_modified(struct file *file)
> +{
> + int err;
> +
> + /*
> + * Clear the security bits if the process is not being run by root.
> + * This keeps people from modifying setuid and setgid binaries.
> + */
> + err = file_remove_privs(file);
> + if (err)
> + return err;
> +
> + if (likely(file->f_mode & FMODE_NOCMTIME))
I would not have thought NOCMTIME is likely?
Maybe it is for io requests coming from overlayfs, but for regular uses
I don't think that's true.
--D
> + return 0;
> +
> + return file_update_time(file);
> +}
> +EXPORT_SYMBOL(file_modified);
> +
> int inode_needs_sync(struct inode *inode)
> {
> if (IS_SYNC(inode))
> diff --git a/fs/read_write.c b/fs/read_write.c
> index b0fb1176b628..cec7e7b1f693 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1980,25 +1980,10 @@ int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
> return ret;
>
> /* If can't alter the file contents, we're done. */
> - if (!(remap_flags & REMAP_FILE_DEDUP)) {
> - /* Update the timestamps, since we can alter file contents. */
> - if (!(file_out->f_mode & FMODE_NOCMTIME)) {
> - ret = file_update_time(file_out);
> - if (ret)
> - return ret;
> - }
> + if (!(remap_flags & REMAP_FILE_DEDUP))
> + ret = file_modified(file_out);
>
> - /*
> - * Clear the security bits if the process is not being run by
> - * root. This keeps people from modifying setuid and setgid
> - * binaries.
> - */
> - ret = file_remove_privs(file_out);
> - if (ret)
> - return ret;
> - }
> -
> - return 0;
> + return ret;
> }
> EXPORT_SYMBOL(generic_remap_file_range_prep);
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index e4d382c4342a..79ffa2958bd8 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2177,6 +2177,8 @@ static inline void file_accessed(struct file *file)
> touch_atime(&file->f_path);
> }
>
> +extern int file_modified(struct file *file);
> +
> int sync_inode(struct inode *inode, struct writeback_control *wbc);
> int sync_inode_metadata(struct inode *inode, int wait);
>
> --
> 2.17.1
>
On Wed, May 29, 2019 at 08:43:11PM +0300, Amir Goldstein wrote:
> Note that by using the helper, the order of calling file_remove_privs()
> after file_update_mtime() in xfs_file_aio_write_checks() has changed.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/xfs/xfs_file.c | 15 +--------------
> 1 file changed, 1 insertion(+), 14 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 76748255f843..916a35cae5e9 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -367,20 +367,7 @@ xfs_file_aio_write_checks(
> * lock above. Eventually we should look into a way to avoid
> * the pointless lock roundtrip.
> */
> - if (likely(!(file->f_mode & FMODE_NOCMTIME))) {
...especially since here we think NOCMTIME is likely /not/ to be set.
> - error = file_update_time(file);
> - if (error)
> - return error;
> - }
> -
> - /*
> - * If we're writing the file then make sure to clear the setuid and
> - * setgid bits if the process is not being run by root. This keeps
> - * people from modifying setuid and setgid binaries.
> - */
> - if (!IS_NOSEC(inode))
> - return file_remove_privs(file);
Hm, file_modified doesn't have the !IS_NOSEC check guarding
file_remove_privs, are you sure it's ok to remove the suid bits
unconditionally? Even if SB_NOSEC (and therefore S_NOSEC) are set?
--D
> - return 0;
> + return file_modified(file);
> }
>
> static int
> --
> 2.17.1
>
On Wed, May 29, 2019 at 08:43:12PM +0300, Amir Goldstein wrote:
> Because generic_copy_file_range doesn't hold the destination inode lock
> throughout the copy, strip setuid bits before and after copy.
>
> The destination inode mtime is updated before and after the copy and the
> source inode atime is updated after the copy, similar to
> generic_file_read_iter().
>
> Signed-off-by: Amir Goldstein <[email protected]>
Looks reasonable,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> ---
> fs/read_write.c | 23 +++++++++++++++++++++--
> 1 file changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index cec7e7b1f693..706ea5f276a7 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1590,8 +1590,27 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> - return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> - len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> + struct inode *inode_out = file_inode(file_out);
> + int ret, err;
> +
> + /* Should inode_out lock be held throughout the copy operation? */
> + inode_lock(inode_out);
> + err = file_modified(file_out);
> + inode_unlock(inode_out);
> + if (err)
> + return err;
> +
> + ret = do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> + len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> +
> + file_accessed(file_in);
> +
> + /* To be on the safe side, remove privs also after copy */
> + inode_lock(inode_out);
> + err = file_modified(file_out);
> + inode_unlock(inode_out);
> +
> + return err ?: ret;
> }
> EXPORT_SYMBOL(generic_copy_file_range);
>
> --
> 2.17.1
>
On Wed, May 29, 2019 at 9:27 PM Darrick J. Wong <[email protected]> wrote:
>
> On Wed, May 29, 2019 at 08:43:10PM +0300, Amir Goldstein wrote:
> > The combination of file_remove_privs() and file_update_mtime() is
> > quite common in filesystem ->write_iter() methods.
> >
> > Modelled after the helper file_accessed(), introduce file_modified()
> > and use it from generic_remap_file_range_prep().
> >
> > Note that the order of calling file_remove_privs() before
> > file_update_mtime() in the helper was matched to the more common order by
> > filesystems and not the current order in generic_remap_file_range_prep().
> >
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/inode.c | 20 ++++++++++++++++++++
> > fs/read_write.c | 21 +++------------------
> > include/linux/fs.h | 2 ++
> > 3 files changed, 25 insertions(+), 18 deletions(-)
> >
> > diff --git a/fs/inode.c b/fs/inode.c
> > index df6542ec3b88..2885f2f2c7a5 100644
> > --- a/fs/inode.c
> > +++ b/fs/inode.c
> > @@ -1899,6 +1899,26 @@ int file_update_time(struct file *file)
> > }
> > EXPORT_SYMBOL(file_update_time);
> >
> > +/* Caller must hold the file's inode lock */
> > +int file_modified(struct file *file)
> > +{
> > + int err;
> > +
> > + /*
> > + * Clear the security bits if the process is not being run by root.
> > + * This keeps people from modifying setuid and setgid binaries.
> > + */
> > + err = file_remove_privs(file);
> > + if (err)
> > + return err;
> > +
> > + if (likely(file->f_mode & FMODE_NOCMTIME))
>
> I would not have thought NOCMTIME is likely?
>
> Maybe it is for io requests coming from overlayfs, but for regular uses
> I don't think that's true.
Nope that's a typo. Good spotting.
Overlayfs doesn't set FMODE_NOCMTIME (yet). Only xfs does from
XFS_IOC_OPEN_BY_HANDLE, but I think Dave said that is a deprecated
API. so should have been very_unlikely().
Thanks,
Amir.
On Wed, May 29, 2019 at 9:31 PM Darrick J. Wong <[email protected]> wrote:
>
> On Wed, May 29, 2019 at 08:43:11PM +0300, Amir Goldstein wrote:
> > Note that by using the helper, the order of calling file_remove_privs()
> > after file_update_mtime() in xfs_file_aio_write_checks() has changed.
> >
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/xfs/xfs_file.c | 15 +--------------
> > 1 file changed, 1 insertion(+), 14 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > index 76748255f843..916a35cae5e9 100644
> > --- a/fs/xfs/xfs_file.c
> > +++ b/fs/xfs/xfs_file.c
> > @@ -367,20 +367,7 @@ xfs_file_aio_write_checks(
> > * lock above. Eventually we should look into a way to avoid
> > * the pointless lock roundtrip.
> > */
> > - if (likely(!(file->f_mode & FMODE_NOCMTIME))) {
>
> ...especially since here we think NOCMTIME is likely /not/ to be set.
>
> > - error = file_update_time(file);
> > - if (error)
> > - return error;
> > - }
> > -
> > - /*
> > - * If we're writing the file then make sure to clear the setuid and
> > - * setgid bits if the process is not being run by root. This keeps
> > - * people from modifying setuid and setgid binaries.
> > - */
> > - if (!IS_NOSEC(inode))
> > - return file_remove_privs(file);
>
> Hm, file_modified doesn't have the !IS_NOSEC check guarding
> file_remove_privs, are you sure it's ok to remove the suid bits
> unconditionally? Even if SB_NOSEC (and therefore S_NOSEC) are set?
>
file_remove_privs() has its own IS_NOSEC() check.
Thanks,
Amir.
On Wed, May 29, 2019 at 10:10:37PM +0300, Amir Goldstein wrote:
> On Wed, May 29, 2019 at 9:31 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Wed, May 29, 2019 at 08:43:11PM +0300, Amir Goldstein wrote:
> > > Note that by using the helper, the order of calling file_remove_privs()
> > > after file_update_mtime() in xfs_file_aio_write_checks() has changed.
> > >
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > ---
> > > fs/xfs/xfs_file.c | 15 +--------------
> > > 1 file changed, 1 insertion(+), 14 deletions(-)
> > >
> > > diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> > > index 76748255f843..916a35cae5e9 100644
> > > --- a/fs/xfs/xfs_file.c
> > > +++ b/fs/xfs/xfs_file.c
> > > @@ -367,20 +367,7 @@ xfs_file_aio_write_checks(
> > > * lock above. Eventually we should look into a way to avoid
> > > * the pointless lock roundtrip.
> > > */
> > > - if (likely(!(file->f_mode & FMODE_NOCMTIME))) {
> >
> > ...especially since here we think NOCMTIME is likely /not/ to be set.
> >
> > > - error = file_update_time(file);
> > > - if (error)
> > > - return error;
> > > - }
> > > -
> > > - /*
> > > - * If we're writing the file then make sure to clear the setuid and
> > > - * setgid bits if the process is not being run by root. This keeps
> > > - * people from modifying setuid and setgid binaries.
> > > - */
> > > - if (!IS_NOSEC(inode))
> > > - return file_remove_privs(file);
> >
> > Hm, file_modified doesn't have the !IS_NOSEC check guarding
> > file_remove_privs, are you sure it's ok to remove the suid bits
> > unconditionally? Even if SB_NOSEC (and therefore S_NOSEC) are set?
> >
>
> file_remove_privs() has its own IS_NOSEC() check.
Ah, ok,
Reviewed-by: Darrick J. Wong <[email protected]>
--D
> Thanks,
> Amir.
On Wed, May 29, 2019 at 10:08 PM Amir Goldstein <[email protected]> wrote:
>
> On Wed, May 29, 2019 at 9:27 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Wed, May 29, 2019 at 08:43:10PM +0300, Amir Goldstein wrote:
> > > The combination of file_remove_privs() and file_update_mtime() is
> > > quite common in filesystem ->write_iter() methods.
> > >
> > > Modelled after the helper file_accessed(), introduce file_modified()
> > > and use it from generic_remap_file_range_prep().
> > >
> > > Note that the order of calling file_remove_privs() before
> > > file_update_mtime() in the helper was matched to the more common order by
> > > filesystems and not the current order in generic_remap_file_range_prep().
> > >
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > ---
> > > fs/inode.c | 20 ++++++++++++++++++++
> > > fs/read_write.c | 21 +++------------------
> > > include/linux/fs.h | 2 ++
> > > 3 files changed, 25 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index df6542ec3b88..2885f2f2c7a5 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -1899,6 +1899,26 @@ int file_update_time(struct file *file)
> > > }
> > > EXPORT_SYMBOL(file_update_time);
> > >
> > > +/* Caller must hold the file's inode lock */
> > > +int file_modified(struct file *file)
> > > +{
> > > + int err;
> > > +
> > > + /*
> > > + * Clear the security bits if the process is not being run by root.
> > > + * This keeps people from modifying setuid and setgid binaries.
> > > + */
> > > + err = file_remove_privs(file);
> > > + if (err)
> > > + return err;
> > > +
> > > + if (likely(file->f_mode & FMODE_NOCMTIME))
> >
> > I would not have thought NOCMTIME is likely?
> >
> > Maybe it is for io requests coming from overlayfs, but for regular uses
> > I don't think that's true.
>
> Nope that's a typo. Good spotting.
> Overlayfs doesn't set FMODE_NOCMTIME (yet). Only xfs does from
> XFS_IOC_OPEN_BY_HANDLE, but I think Dave said that is a deprecated
> API. so should have been very_unlikely().
>
Is that an ACK with likely converted to unlikely?
Thanks,
Amir.
Hi Olga,Anna,Trond
Could we get an ACK on this patch.
It is a prerequisite for merging the cross-device copy_file_range work.
It depends on a new helper introduced here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
Thanks,
Amir,
On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]> wrote:
>
> Like ->write_iter(), we update mtime and strip setuid of dst file before
> copy and like ->read_iter(), we update atime of src file after copy.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/nfs/nfs42proc.c | 9 ++++++---
> 1 file changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> index 5196bfa7894d..c37a8e5116c6 100644
> --- a/fs/nfs/nfs42proc.c
> +++ b/fs/nfs/nfs42proc.c
> @@ -345,10 +345,13 @@ ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
>
> do {
> inode_lock(file_inode(dst));
> - err = _nfs42_proc_copy(src, src_lock,
> - dst, dst_lock,
> - &args, &res);
> + err = file_modified(dst);
> + if (!err)
> + err = _nfs42_proc_copy(src, src_lock,
> + dst, dst_lock,
> + &args, &res);
> inode_unlock(file_inode(dst));
> + file_accessed(src);
>
> if (err >= 0)
> break;
> --
> 2.17.1
>
Hi Steve,
Could we get an ACK on this patch.
It is a prerequisite for merging the cross-device copy_file_range work.
It depends on a new helper introduced here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
Thanks,
Amir,
On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]> wrote:
>
> cifs has both source and destination inodes locked throughout the copy.
> Like ->write_iter(), we update mtime and strip setuid bits of destination
> file before copy and like ->read_iter(), we update atime of source file
> after copy.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/cifs/cifsfs.c | 11 ++++++++---
> 1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index c65823270313..ab6c5c24146d 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1096,6 +1096,10 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> goto out;
> }
>
> + rc = -EOPNOTSUPP;
> + if (!target_tcon->ses->server->ops->copychunk_range)
> + goto out;
> +
> /*
> * Note: cifs case is easier than btrfs since server responsible for
> * checks for proper open modes and file type and if it wants
> @@ -1107,11 +1111,12 @@ ssize_t cifs_file_copychunk_range(unsigned int xid,
> /* should we flush first and last page first */
> truncate_inode_pages(&target_inode->i_data, 0);
>
> - if (target_tcon->ses->server->ops->copychunk_range)
> + rc = file_modified(dst_file);
> + if (!rc)
> rc = target_tcon->ses->server->ops->copychunk_range(xid,
> smb_file_src, smb_file_target, off, len, destoff);
> - else
> - rc = -EOPNOTSUPP;
> +
> + file_accessed(src_file);
>
> /* force revalidate of size and timestamps of target file now
> * that target is updated on the server
> --
> 2.17.1
>
Hi Miklos,
Could we get an ACK on this patch.
It is a prerequisite for merging the cross-device copy_file_range work.
It depends on a new helper introduced here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
Thanks,
Amir,
On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]> wrote:
>
> Like ->write_iter(), we update mtime and strip setuid of dst file before
> copy and like ->read_iter(), we update atime of src file after copy.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/fuse/file.c | 5 +++++
> 1 file changed, 5 insertions(+)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index e03901ae729b..7f33d68f66d9 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3128,6 +3128,10 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>
> inode_lock(inode_out);
>
> + err = file_modified(file_out);
> + if (err)
> + goto out;
> +
> if (fc->writeback_cache) {
> err = filemap_write_and_wait_range(inode_out->i_mapping,
> pos_out, pos_out + len);
> @@ -3169,6 +3173,7 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> clear_bit(FUSE_I_SIZE_UNSTABLE, &fi_out->state);
>
> inode_unlock(inode_out);
> + file_accessed(file_in);
>
> return err;
> }
> --
> 2.17.1
>
Hi Zheng and Ilya,
Could we help us get an ACK on this patch.
It is a prerequisite for merging the cross-device copy_file_range work.
It depends on a new helper introduced here:
https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
Luis Henriques has tested (the previous revision of) this work on ceph.
Thanks,
Amir,
On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]> wrote:
>
> Because ceph doesn't hold destination inode lock throughout the copy,
> strip setuid bits before and after copy.
>
> The destination inode mtime is updated before and after the copy and the
> source inode atime is updated after the copy, similar to the filesystem
> ->read_iter() implementation.
>
> Signed-off-by: Amir Goldstein <[email protected]>
> ---
> fs/ceph/file.c | 17 +++++++++++++++++
> 1 file changed, 17 insertions(+)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index e87f7b2023af..8a70708e1aca 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1947,6 +1947,15 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> goto out;
> }
>
> + /* Should dst_inode lock be held throughout the copy operation? */
> + inode_lock(dst_inode);
> + ret = file_modified(dst_file);
> + inode_unlock(dst_inode);
> + if (ret < 0) {
> + dout("failed to modify dst file before copy (%zd)\n", ret);
> + goto out;
> + }
> +
> /*
> * We need FILE_WR caps for dst_ci and FILE_RD for src_ci as other
> * clients may have dirty data in their caches. And OSDs know nothing
> @@ -2097,6 +2106,14 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> out:
> ceph_free_cap_flush(prealloc_cf);
>
> + file_accessed(src_file);
> + /* To be on the safe side, remove privs also after copy */
> + inode_lock(dst_inode);
> + err = file_modified(dst_file);
> + inode_unlock(dst_inode);
> + if (err < 0)
> + dout("failed to modify dst file after copy (%zd)\n", err);
> +
> return ret;
> }
>
> --
> 2.17.1
>
On Wed, 2019-05-29 at 22:34 +0300, Amir Goldstein wrote:
> Hi Olga,Anna,Trond
>
> Could we get an ACK on this patch.
> It is a prerequisite for merging the cross-device copy_file_range
> work.
>
> It depends on a new helper introduced here:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
>
> Thanks,
> Amir,
>
> On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]>
> wrote:
> > Like ->write_iter(), we update mtime and strip setuid of dst file
> > before
> > copy and like ->read_iter(), we update atime of src file after
> > copy.
> >
> > Signed-off-by: Amir Goldstein <[email protected]>
> > ---
> > fs/nfs/nfs42proc.c | 9 ++++++---
> > 1 file changed, 6 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> > index 5196bfa7894d..c37a8e5116c6 100644
> > --- a/fs/nfs/nfs42proc.c
> > +++ b/fs/nfs/nfs42proc.c
> > @@ -345,10 +345,13 @@ ssize_t nfs42_proc_copy(struct file *src,
> > loff_t pos_src,
> >
> > do {
> > inode_lock(file_inode(dst));
> > - err = _nfs42_proc_copy(src, src_lock,
> > - dst, dst_lock,
> > - &args, &res);
> > + err = file_modified(dst);
> > + if (!err)
> > + err = _nfs42_proc_copy(src, src_lock,
> > + dst, dst_lock,
> > + &args, &res);
> > inode_unlock(file_inode(dst));
> > + file_accessed(src);
> >
> > if (err >= 0)
> > break;
> > --
> > 2.17.1
> >
Please drop this patch. In the NFS protocol, the client is not expected
to mess with atime or mtime other than when the user explicitly sets it
through a call to utimensat() or similar. The server takes care of any
a/mtime updates as and when the file is read/written to.
Ditto for suid/sgid clearing.
Motto: "No file_accessed() or file_remove_privs() calls, please. We're
NFS."
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]
On Wed, May 29, 2019 at 9:38 PM Amir Goldstein <[email protected]> wrote:
>
> Hi Miklos,
>
> Could we get an ACK on this patch.
> It is a prerequisite for merging the cross-device copy_file_range work.
>
> It depends on a new helper introduced here:
> https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
That likely is actually an unlikely.
Otherwise ACK.
Thanks,
Miklos
On Wed, May 29, 2019 at 1:43 PM Amir Goldstein <[email protected]> wrote:
>
> We want to enable cross-filesystem copy_file_range functionality
> where possible, so push the "same superblock only" checks down to
> the individual filesystem callouts so they can make their own
> decisions about cross-superblock copy offload and fallack to
> generic_copy_file_range() for cross-superblock copy.
>
> [Amir] We do not call ->remap_file_range() in case the inodes are not
> on the same sb and do not call ->copy_file_range() in case the inodes
> are not on the same filesystem type.
>
> This changes behavior of the copy_file_range(2) syscall, which will
> now allow cross filesystem in-kernel copy. CIFS already supports
> cross-superblock copy, between two shares to the same server. This
> functionality will now be available via the copy_file_range(2) syscall.
>
> Cc: Steve French <[email protected]>
> Signed-off-by: Dave Chinner <[email protected]>
> Signed-off-by: Amir Goldstein <[email protected]>
> Reviewed-by: Darrick J. Wong <[email protected]>
> ---
> fs/ceph/file.c | 4 +++-
> fs/cifs/cifsfs.c | 2 +-
> fs/fuse/file.c | 5 ++++-
> fs/nfs/nfs4file.c | 5 ++++-
> fs/read_write.c | 20 ++++++++++++++------
> 5 files changed, 26 insertions(+), 10 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 8a70708e1aca..e9614d686301 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>
> if (src_inode == dst_inode)
> return -EINVAL;
> + if (src_inode->i_sb != dst_inode->i_sb)
> + return -EXDEV;
> if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> return -EROFS;
>
> @@ -2126,7 +2128,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> len, flags);
>
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(src_file, src_off, dst_file,
> dst_off, len, flags);
> return ret;
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index ab6c5c24146d..83956452c108 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1154,7 +1154,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> len, flags);
> free_xid(xid);
>
> - if (rc == -EOPNOTSUPP)
> + if (rc == -EOPNOTSUPP || rc == -EXDEV)
> rc = generic_copy_file_range(src_file, off, dst_file,
> destoff, len, flags);
> return rc;
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 7f33d68f66d9..eab00cd089e8 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> if (fc->no_copy_file_range)
> return -EOPNOTSUPP;
>
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> +
> inode_lock(inode_out);
>
> err = file_modified(file_out);
> @@ -3187,7 +3190,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> len, flags);
>
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(src_file, src_off, dst_file,
> dst_off, len, flags);
> return ret;
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 4842f3ab3161..f4157eb1f69d 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t count, unsigned int flags)
> {
> + /* Only offload copy if superblock is the same */
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> return -EOPNOTSUPP;
> if (file_inode(file_in) == file_inode(file_out))
> @@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>
> ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> flags);
> - if (ret == -EOPNOTSUPP)
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> ret = generic_copy_file_range(file_in, pos_in, file_out,
> pos_out, count, flags);
> return ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 706ea5f276a7..d8930bb735cb 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1618,7 +1618,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> - if (file_out->f_op->copy_file_range)
> + /*
> + * Although we now allow filesystems to handle cross sb copy, passing
> + * an inode of the wrong filesystem type to filesystem operation can
> + * often result in an attempt to dereference the wrong concrete inode
> + * struct, so avoid doing that until we really have a good reason.
> + * The incentive for passing inode from different sb to filesystem is
> + * NFS cross server copy and for that use case, enforcing same
> + * filesystem type is acceptable.
> + */
> + if (file_out->f_op->copy_file_range &&
> + file_inode(file_in)->i_sb->s_type ==
> + file_inode(file_out)->i_sb->s_type)
While I'm not sure how much I care (vs wanting at least this much of
cross device copy available) but in NFS there are several NFS
file_system_type defined which would disallow a copy between them
(like nfs4_remote_fs_type, nfs4_remote_referral_fs_type, and good old
nfs4_fs_type).
One idea would be to push the check into the filesystems themselves.
> return file_out->f_op->copy_file_range(file_in, pos_in,
> file_out, pos_out,
> len, flags);
> @@ -1641,10 +1652,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (flags != 0)
> return -EINVAL;
>
> - /* this could be relaxed once a method supports cross-fs copies */
> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> - return -EXDEV;
> -
> ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> flags);
> if (unlikely(ret))
> @@ -1667,7 +1674,8 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> * Try cloning first, this is supported by more file systems, and
> * more efficient if both clone and copy are supported (e.g. NFS).
> */
> - if (file_in->f_op->remap_file_range) {
> + if (file_in->f_op->remap_file_range &&
> + file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> loff_t cloned;
>
> cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> --
> 2.17.1
>
On Wed, May 29, 2019 at 11:02 PM Trond Myklebust
<[email protected]> wrote:
>
> On Wed, 2019-05-29 at 22:34 +0300, Amir Goldstein wrote:
> > Hi Olga,Anna,Trond
> >
> > Could we get an ACK on this patch.
> > It is a prerequisite for merging the cross-device copy_file_range
> > work.
> >
> > It depends on a new helper introduced here:
> > https://lore.kernel.org/linux-fsdevel/CAOQ4uxjbcSWX1hUcuXbn8hFH3QYB+5bAC9Z1yCwJdR=T-GGtCg@mail.gmail.com/T/#m1569878c41f39fac3aadb3832a30659c323b582a
> >
> > Thanks,
> > Amir,
> >
> > On Wed, May 29, 2019 at 8:43 PM Amir Goldstein <[email protected]>
> > wrote:
> > > Like ->write_iter(), we update mtime and strip setuid of dst file
> > > before
> > > copy and like ->read_iter(), we update atime of src file after
> > > copy.
> > >
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > ---
> > > fs/nfs/nfs42proc.c | 9 ++++++---
> > > 1 file changed, 6 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/fs/nfs/nfs42proc.c b/fs/nfs/nfs42proc.c
> > > index 5196bfa7894d..c37a8e5116c6 100644
> > > --- a/fs/nfs/nfs42proc.c
> > > +++ b/fs/nfs/nfs42proc.c
> > > @@ -345,10 +345,13 @@ ssize_t nfs42_proc_copy(struct file *src,
> > > loff_t pos_src,
> > >
> > > do {
> > > inode_lock(file_inode(dst));
> > > - err = _nfs42_proc_copy(src, src_lock,
> > > - dst, dst_lock,
> > > - &args, &res);
> > > + err = file_modified(dst);
> > > + if (!err)
> > > + err = _nfs42_proc_copy(src, src_lock,
> > > + dst, dst_lock,
> > > + &args, &res);
> > > inode_unlock(file_inode(dst));
> > > + file_accessed(src);
> > >
> > > if (err >= 0)
> > > break;
> > > --
> > > 2.17.1
> > >
>
> Please drop this patch. In the NFS protocol, the client is not expected
> to mess with atime or mtime other than when the user explicitly sets it
> through a call to utimensat() or similar. The server takes care of any
> a/mtime updates as and when the file is read/written to.
>
> Ditto for suid/sgid clearing.
>
> Motto: "No file_accessed() or file_remove_privs() calls, please. We're
> NFS."
OK.
Thanks,
Amir.
On Wed, May 29, 2019 at 11:09 PM Olga Kornievskaia
<[email protected]> wrote:
>
> On Wed, May 29, 2019 at 1:43 PM Amir Goldstein <[email protected]> wrote:
> >
> > We want to enable cross-filesystem copy_file_range functionality
> > where possible, so push the "same superblock only" checks down to
> > the individual filesystem callouts so they can make their own
> > decisions about cross-superblock copy offload and fallack to
> > generic_copy_file_range() for cross-superblock copy.
> >
> > [Amir] We do not call ->remap_file_range() in case the inodes are not
> > on the same sb and do not call ->copy_file_range() in case the inodes
> > are not on the same filesystem type.
> >
> > This changes behavior of the copy_file_range(2) syscall, which will
> > now allow cross filesystem in-kernel copy. CIFS already supports
> > cross-superblock copy, between two shares to the same server. This
> > functionality will now be available via the copy_file_range(2) syscall.
> >
> > Cc: Steve French <[email protected]>
> > Signed-off-by: Dave Chinner <[email protected]>
> > Signed-off-by: Amir Goldstein <[email protected]>
> > Reviewed-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/ceph/file.c | 4 +++-
> > fs/cifs/cifsfs.c | 2 +-
> > fs/fuse/file.c | 5 ++++-
> > fs/nfs/nfs4file.c | 5 ++++-
> > fs/read_write.c | 20 ++++++++++++++------
> > 5 files changed, 26 insertions(+), 10 deletions(-)
> >
> > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > index 8a70708e1aca..e9614d686301 100644
> > --- a/fs/ceph/file.c
> > +++ b/fs/ceph/file.c
> > @@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> >
> > if (src_inode == dst_inode)
> > return -EINVAL;
> > + if (src_inode->i_sb != dst_inode->i_sb)
> > + return -EXDEV;
> > if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> > return -EROFS;
> >
> > @@ -2126,7 +2128,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> > len, flags);
> >
> > - if (ret == -EOPNOTSUPP)
> > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > dst_off, len, flags);
> > return ret;
> > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > index ab6c5c24146d..83956452c108 100644
> > --- a/fs/cifs/cifsfs.c
> > +++ b/fs/cifs/cifsfs.c
> > @@ -1154,7 +1154,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> > len, flags);
> > free_xid(xid);
> >
> > - if (rc == -EOPNOTSUPP)
> > + if (rc == -EOPNOTSUPP || rc == -EXDEV)
> > rc = generic_copy_file_range(src_file, off, dst_file,
> > destoff, len, flags);
> > return rc;
> > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > index 7f33d68f66d9..eab00cd089e8 100644
> > --- a/fs/fuse/file.c
> > +++ b/fs/fuse/file.c
> > @@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> > if (fc->no_copy_file_range)
> > return -EOPNOTSUPP;
> >
> > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > + return -EXDEV;
> > +
> > inode_lock(inode_out);
> >
> > err = file_modified(file_out);
> > @@ -3187,7 +3190,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> > ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> > len, flags);
> >
> > - if (ret == -EOPNOTSUPP)
> > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > dst_off, len, flags);
> > return ret;
> > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > index 4842f3ab3161..f4157eb1f69d 100644
> > --- a/fs/nfs/nfs4file.c
> > +++ b/fs/nfs/nfs4file.c
> > @@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > size_t count, unsigned int flags)
> > {
> > + /* Only offload copy if superblock is the same */
> > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > + return -EXDEV;
> > if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> > return -EOPNOTSUPP;
> > if (file_inode(file_in) == file_inode(file_out))
> > @@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> >
> > ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> > flags);
> > - if (ret == -EOPNOTSUPP)
> > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > ret = generic_copy_file_range(file_in, pos_in, file_out,
> > pos_out, count, flags);
> > return ret;
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 706ea5f276a7..d8930bb735cb 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1618,7 +1618,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > struct file *file_out, loff_t pos_out,
> > size_t len, unsigned int flags)
> > {
> > - if (file_out->f_op->copy_file_range)
> > + /*
> > + * Although we now allow filesystems to handle cross sb copy, passing
> > + * an inode of the wrong filesystem type to filesystem operation can
> > + * often result in an attempt to dereference the wrong concrete inode
> > + * struct, so avoid doing that until we really have a good reason.
> > + * The incentive for passing inode from different sb to filesystem is
> > + * NFS cross server copy and for that use case, enforcing same
> > + * filesystem type is acceptable.
> > + */
> > + if (file_out->f_op->copy_file_range &&
> > + file_inode(file_in)->i_sb->s_type ==
> > + file_inode(file_out)->i_sb->s_type)
>
> While I'm not sure how much I care (vs wanting at least this much of
> cross device copy available) but in NFS there are several NFS
> file_system_type defined which would disallow a copy between them
> (like nfs4_remote_fs_type, nfs4_remote_referral_fs_type, and good old
> nfs4_fs_type).
>
> One idea would be to push the check into the filesystems themselves.
>
That will require more delicate patches to filesystems.
Are you saying there is a *good* reason to do that now?
Is nfs copy offload expected to be between different types of nfs
file_system_type?
Thanks,
Amir.
On Wed, May 29, 2019 at 9:33 PM Darrick J. Wong <[email protected]> wrote:
>
> On Wed, May 29, 2019 at 08:43:12PM +0300, Amir Goldstein wrote:
> > Because generic_copy_file_range doesn't hold the destination inode lock
> > throughout the copy, strip setuid bits before and after copy.
> >
> > The destination inode mtime is updated before and after the copy and the
> > source inode atime is updated after the copy, similar to
> > generic_file_read_iter().
> >
> > Signed-off-by: Amir Goldstein <[email protected]>
>
> Looks reasonable,
Actually, it isn't reasonable. I'd like to recall this patch :-/
As one might expect, splice_direct_to_actor() already has file_accessed()
and "file_modified" is the responsibility of filesystem's ->write_iter().
Thanks,
Amir.
On Wed, May 29, 2019 at 10:08:44PM +0300, Amir Goldstein wrote:
> On Wed, May 29, 2019 at 9:27 PM Darrick J. Wong <[email protected]> wrote:
> >
> > On Wed, May 29, 2019 at 08:43:10PM +0300, Amir Goldstein wrote:
> > > The combination of file_remove_privs() and file_update_mtime() is
> > > quite common in filesystem ->write_iter() methods.
> > >
> > > Modelled after the helper file_accessed(), introduce file_modified()
> > > and use it from generic_remap_file_range_prep().
> > >
> > > Note that the order of calling file_remove_privs() before
> > > file_update_mtime() in the helper was matched to the more common order by
> > > filesystems and not the current order in generic_remap_file_range_prep().
> > >
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > ---
> > > fs/inode.c | 20 ++++++++++++++++++++
> > > fs/read_write.c | 21 +++------------------
> > > include/linux/fs.h | 2 ++
> > > 3 files changed, 25 insertions(+), 18 deletions(-)
> > >
> > > diff --git a/fs/inode.c b/fs/inode.c
> > > index df6542ec3b88..2885f2f2c7a5 100644
> > > --- a/fs/inode.c
> > > +++ b/fs/inode.c
> > > @@ -1899,6 +1899,26 @@ int file_update_time(struct file *file)
> > > }
> > > EXPORT_SYMBOL(file_update_time);
> > >
> > > +/* Caller must hold the file's inode lock */
> > > +int file_modified(struct file *file)
> > > +{
> > > + int err;
> > > +
> > > + /*
> > > + * Clear the security bits if the process is not being run by root.
> > > + * This keeps people from modifying setuid and setgid binaries.
> > > + */
> > > + err = file_remove_privs(file);
> > > + if (err)
> > > + return err;
> > > +
> > > + if (likely(file->f_mode & FMODE_NOCMTIME))
> >
> > I would not have thought NOCMTIME is likely?
> >
> > Maybe it is for io requests coming from overlayfs, but for regular uses
> > I don't think that's true.
>
> Nope that's a typo. Good spotting.
> Overlayfs doesn't set FMODE_NOCMTIME (yet). Only xfs does from
> XFS_IOC_OPEN_BY_HANDLE, but I think Dave said that is a deprecated
> API. so should have been very_unlikely().
It is most definitely not a deprecated API. I don't know where you
got that idea from. It's used explicitly by the xfs utilities to
perform invisible IO. Anyone who runs xfs_fsr or xfsdump or has an
application that links to libhandle is using XFS_IOC_OPEN_BY_HANDLE
and FMODE_NOCMTIME....
-Dave.
--
Dave Chinner
[email protected]
On Wed, May 29, 2019 at 5:03 PM Amir Goldstein <[email protected]> wrote:
>
> On Wed, May 29, 2019 at 11:09 PM Olga Kornievskaia
> <[email protected]> wrote:
> >
> > On Wed, May 29, 2019 at 1:43 PM Amir Goldstein <[email protected]> wrote:
> > >
> > > We want to enable cross-filesystem copy_file_range functionality
> > > where possible, so push the "same superblock only" checks down to
> > > the individual filesystem callouts so they can make their own
> > > decisions about cross-superblock copy offload and fallack to
> > > generic_copy_file_range() for cross-superblock copy.
> > >
> > > [Amir] We do not call ->remap_file_range() in case the inodes are not
> > > on the same sb and do not call ->copy_file_range() in case the inodes
> > > are not on the same filesystem type.
> > >
> > > This changes behavior of the copy_file_range(2) syscall, which will
> > > now allow cross filesystem in-kernel copy. CIFS already supports
> > > cross-superblock copy, between two shares to the same server. This
> > > functionality will now be available via the copy_file_range(2) syscall.
> > >
> > > Cc: Steve French <[email protected]>
> > > Signed-off-by: Dave Chinner <[email protected]>
> > > Signed-off-by: Amir Goldstein <[email protected]>
> > > Reviewed-by: Darrick J. Wong <[email protected]>
> > > ---
> > > fs/ceph/file.c | 4 +++-
> > > fs/cifs/cifsfs.c | 2 +-
> > > fs/fuse/file.c | 5 ++++-
> > > fs/nfs/nfs4file.c | 5 ++++-
> > > fs/read_write.c | 20 ++++++++++++++------
> > > 5 files changed, 26 insertions(+), 10 deletions(-)
> > >
> > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > index 8a70708e1aca..e9614d686301 100644
> > > --- a/fs/ceph/file.c
> > > +++ b/fs/ceph/file.c
> > > @@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > >
> > > if (src_inode == dst_inode)
> > > return -EINVAL;
> > > + if (src_inode->i_sb != dst_inode->i_sb)
> > > + return -EXDEV;
> > > if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> > > return -EROFS;
> > >
> > > @@ -2126,7 +2128,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> > > len, flags);
> > >
> > > - if (ret == -EOPNOTSUPP)
> > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > > dst_off, len, flags);
> > > return ret;
> > > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > > index ab6c5c24146d..83956452c108 100644
> > > --- a/fs/cifs/cifsfs.c
> > > +++ b/fs/cifs/cifsfs.c
> > > @@ -1154,7 +1154,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> > > len, flags);
> > > free_xid(xid);
> > >
> > > - if (rc == -EOPNOTSUPP)
> > > + if (rc == -EOPNOTSUPP || rc == -EXDEV)
> > > rc = generic_copy_file_range(src_file, off, dst_file,
> > > destoff, len, flags);
> > > return rc;
> > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > index 7f33d68f66d9..eab00cd089e8 100644
> > > --- a/fs/fuse/file.c
> > > +++ b/fs/fuse/file.c
> > > @@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> > > if (fc->no_copy_file_range)
> > > return -EOPNOTSUPP;
> > >
> > > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > + return -EXDEV;
> > > +
> > > inode_lock(inode_out);
> > >
> > > err = file_modified(file_out);
> > > @@ -3187,7 +3190,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> > > ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> > > len, flags);
> > >
> > > - if (ret == -EOPNOTSUPP)
> > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > > dst_off, len, flags);
> > > return ret;
> > > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > > index 4842f3ab3161..f4157eb1f69d 100644
> > > --- a/fs/nfs/nfs4file.c
> > > +++ b/fs/nfs/nfs4file.c
> > > @@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > > struct file *file_out, loff_t pos_out,
> > > size_t count, unsigned int flags)
> > > {
> > > + /* Only offload copy if superblock is the same */
> > > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > + return -EXDEV;
> > > if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> > > return -EOPNOTSUPP;
> > > if (file_inode(file_in) == file_inode(file_out))
> > > @@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > >
> > > ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> > > flags);
> > > - if (ret == -EOPNOTSUPP)
> > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > pos_out, count, flags);
> > > return ret;
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 706ea5f276a7..d8930bb735cb 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1618,7 +1618,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > > struct file *file_out, loff_t pos_out,
> > > size_t len, unsigned int flags)
> > > {
> > > - if (file_out->f_op->copy_file_range)
> > > + /*
> > > + * Although we now allow filesystems to handle cross sb copy, passing
> > > + * an inode of the wrong filesystem type to filesystem operation can
> > > + * often result in an attempt to dereference the wrong concrete inode
> > > + * struct, so avoid doing that until we really have a good reason.
> > > + * The incentive for passing inode from different sb to filesystem is
> > > + * NFS cross server copy and for that use case, enforcing same
> > > + * filesystem type is acceptable.
> > > + */
> > > + if (file_out->f_op->copy_file_range &&
> > > + file_inode(file_in)->i_sb->s_type ==
> > > + file_inode(file_out)->i_sb->s_type)
> >
> > While I'm not sure how much I care (vs wanting at least this much of
> > cross device copy available) but in NFS there are several NFS
> > file_system_type defined which would disallow a copy between them
> > (like nfs4_remote_fs_type, nfs4_remote_referral_fs_type, and good old
> > nfs4_fs_type).
> >
> > One idea would be to push the check into the filesystems themselves.
> >
>
> That will require more delicate patches to filesystems.
> Are you saying there is a *good* reason to do that now?
> Is nfs copy offload expected to be between different types of nfs
> file_system_type?
So I had to setup a test case to perhaps give you a good reason. An
NFS server might have an export that's a referral to another server.
In this case the NFS client gets an ERR_MOVED and would mount the 2nd
server. It shows up as a submount and it would have a different file
system type. Having that check would prevent the NFS client from doing
an NFS copy_file_range between those 2 servers and instead VFS would
fallback to the generic_copy_file_range.
So why is hard(er) to push the check that ->s_types are the same for
the input and output files into the filesystems like it's done in this
patch with the ->i_sb checks?
On Mon, Jun 3, 2019 at 11:39 PM Olga Kornievskaia
<[email protected]> wrote:
>
> On Wed, May 29, 2019 at 5:03 PM Amir Goldstein <[email protected]> wrote:
> >
> > On Wed, May 29, 2019 at 11:09 PM Olga Kornievskaia
> > <[email protected]> wrote:
> > >
> > > On Wed, May 29, 2019 at 1:43 PM Amir Goldstein <[email protected]> wrote:
> > > >
> > > > We want to enable cross-filesystem copy_file_range functionality
> > > > where possible, so push the "same superblock only" checks down to
> > > > the individual filesystem callouts so they can make their own
> > > > decisions about cross-superblock copy offload and fallack to
> > > > generic_copy_file_range() for cross-superblock copy.
> > > >
> > > > [Amir] We do not call ->remap_file_range() in case the inodes are not
> > > > on the same sb and do not call ->copy_file_range() in case the inodes
> > > > are not on the same filesystem type.
> > > >
> > > > This changes behavior of the copy_file_range(2) syscall, which will
> > > > now allow cross filesystem in-kernel copy. CIFS already supports
> > > > cross-superblock copy, between two shares to the same server. This
> > > > functionality will now be available via the copy_file_range(2) syscall.
> > > >
> > > > Cc: Steve French <[email protected]>
> > > > Signed-off-by: Dave Chinner <[email protected]>
> > > > Signed-off-by: Amir Goldstein <[email protected]>
> > > > Reviewed-by: Darrick J. Wong <[email protected]>
> > > > ---
> > > > fs/ceph/file.c | 4 +++-
> > > > fs/cifs/cifsfs.c | 2 +-
> > > > fs/fuse/file.c | 5 ++++-
> > > > fs/nfs/nfs4file.c | 5 ++++-
> > > > fs/read_write.c | 20 ++++++++++++++------
> > > > 5 files changed, 26 insertions(+), 10 deletions(-)
> > > >
> > > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> > > > index 8a70708e1aca..e9614d686301 100644
> > > > --- a/fs/ceph/file.c
> > > > +++ b/fs/ceph/file.c
> > > > @@ -1909,6 +1909,8 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > >
> > > > if (src_inode == dst_inode)
> > > > return -EINVAL;
> > > > + if (src_inode->i_sb != dst_inode->i_sb)
> > > > + return -EXDEV;
> > > > if (ceph_snap(dst_inode) != CEPH_NOSNAP)
> > > > return -EROFS;
> > > >
> > > > @@ -2126,7 +2128,7 @@ static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> > > > ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> > > > len, flags);
> > > >
> > > > - if (ret == -EOPNOTSUPP)
> > > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > > > dst_off, len, flags);
> > > > return ret;
> > > > diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> > > > index ab6c5c24146d..83956452c108 100644
> > > > --- a/fs/cifs/cifsfs.c
> > > > +++ b/fs/cifs/cifsfs.c
> > > > @@ -1154,7 +1154,7 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> > > > len, flags);
> > > > free_xid(xid);
> > > >
> > > > - if (rc == -EOPNOTSUPP)
> > > > + if (rc == -EOPNOTSUPP || rc == -EXDEV)
> > > > rc = generic_copy_file_range(src_file, off, dst_file,
> > > > destoff, len, flags);
> > > > return rc;
> > > > diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> > > > index 7f33d68f66d9..eab00cd089e8 100644
> > > > --- a/fs/fuse/file.c
> > > > +++ b/fs/fuse/file.c
> > > > @@ -3126,6 +3126,9 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > if (fc->no_copy_file_range)
> > > > return -EOPNOTSUPP;
> > > >
> > > > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > > + return -EXDEV;
> > > > +
> > > > inode_lock(inode_out);
> > > >
> > > > err = file_modified(file_out);
> > > > @@ -3187,7 +3190,7 @@ static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> > > > ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> > > > len, flags);
> > > >
> > > > - if (ret == -EOPNOTSUPP)
> > > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > > ret = generic_copy_file_range(src_file, src_off, dst_file,
> > > > dst_off, len, flags);
> > > > return ret;
> > > > diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> > > > index 4842f3ab3161..f4157eb1f69d 100644
> > > > --- a/fs/nfs/nfs4file.c
> > > > +++ b/fs/nfs/nfs4file.c
> > > > @@ -133,6 +133,9 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > struct file *file_out, loff_t pos_out,
> > > > size_t count, unsigned int flags)
> > > > {
> > > > + /* Only offload copy if superblock is the same */
> > > > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > > + return -EXDEV;
> > > > if (!nfs_server_capable(file_inode(file_out), NFS_CAP_COPY))
> > > > return -EOPNOTSUPP;
> > > > if (file_inode(file_in) == file_inode(file_out))
> > > > @@ -148,7 +151,7 @@ static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> > > >
> > > > ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> > > > flags);
> > > > - if (ret == -EOPNOTSUPP)
> > > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > > ret = generic_copy_file_range(file_in, pos_in, file_out,
> > > > pos_out, count, flags);
> > > > return ret;
> > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > index 706ea5f276a7..d8930bb735cb 100644
> > > > --- a/fs/read_write.c
> > > > +++ b/fs/read_write.c
> > > > @@ -1618,7 +1618,18 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > > > struct file *file_out, loff_t pos_out,
> > > > size_t len, unsigned int flags)
> > > > {
> > > > - if (file_out->f_op->copy_file_range)
> > > > + /*
> > > > + * Although we now allow filesystems to handle cross sb copy, passing
> > > > + * an inode of the wrong filesystem type to filesystem operation can
> > > > + * often result in an attempt to dereference the wrong concrete inode
> > > > + * struct, so avoid doing that until we really have a good reason.
> > > > + * The incentive for passing inode from different sb to filesystem is
> > > > + * NFS cross server copy and for that use case, enforcing same
> > > > + * filesystem type is acceptable.
> > > > + */
> > > > + if (file_out->f_op->copy_file_range &&
> > > > + file_inode(file_in)->i_sb->s_type ==
> > > > + file_inode(file_out)->i_sb->s_type)
I can change that to:
+ if (file_out->f_op->copy_file_range &&
+ file_out->f_op->copy_file_range ==
+ file_in->f_op->copy_file_range)
That should be fine for nfs that uses same copy_file_range()
method for all different s_type.
> > >
> > > While I'm not sure how much I care (vs wanting at least this much of
> > > cross device copy available) but in NFS there are several NFS
> > > file_system_type defined which would disallow a copy between them
> > > (like nfs4_remote_fs_type, nfs4_remote_referral_fs_type, and good old
> > > nfs4_fs_type).
> > >
> > > One idea would be to push the check into the filesystems themselves.
> > >
> >
> > That will require more delicate patches to filesystems.
> > Are you saying there is a *good* reason to do that now?
> > Is nfs copy offload expected to be between different types of nfs
> > file_system_type?
>
> So I had to setup a test case to perhaps give you a good reason. An
> NFS server might have an export that's a referral to another server.
> In this case the NFS client gets an ERR_MOVED and would mount the 2nd
> server. It shows up as a submount and it would have a different file
> system type. Having that check would prevent the NFS client from doing
> an NFS copy_file_range between those 2 servers and instead VFS would
> fallback to the generic_copy_file_range.
>
> So why is hard(er) to push the check that ->s_types are the same for
> the input and output files into the filesystems like it's done in this
> patch with the ->i_sb checks?
It's not harder, its fragile.
See, in vfs, file_in and file_out are abstract file objects and only generic
file interface are used. This is no longer the case inside filesystem code.
If passed a file object of different fs, almost all the filesystem in question
will barf (as you noticed yourself), because they try to access nfs specific
file object fields:
ssize_t nfs42_proc_copy(struct file *src, loff_t pos_src,
struct file *dst, loff_t pos_dst,
size_t count)
{
struct nfs_server *server = NFS_SERVER(file_inode(dst));
ssize_t cifs_file_copychunk_range(unsigned int xid,
struct file *src_file, loff_t off,
struct file *dst_file, loff_t destoff,
size_t len, unsigned int flags)
{
struct inode *src_inode = file_inode(src_file);
...
if (!src_file->private_data || !dst_file->private_data) {
rc = -EBADF;
cifs_dbg(VFS, "missing cifsFileInfo on copy range src file\n");
goto out;
}
...
smb_file_src = src_file->private_data;
src_tcon = tlink_tcon(smb_file_src->tlink);
So first of all, all the present filesystems that support copy_file_range
will have no be sanitized to verify file_in is an object of their own type
before dereferencing private fields. That's doable, but it still leaves
future filesystems volnurable to making this mistake. It is also quite
fragile to test, because file objects coming from most fs will have NULL
private_data, so it may take a very specific type of input file to reveal
bugs in implementations.
All in all, I prefer to keep it safe. As long as we don't really need
filesystems
to handle copy of file_in from a different fs, let's keep it simple and resolve
the specific nfs issue by testing for same ->copy_file_range pointer.
Thanks,
Amir.