2021-02-18 17:05:20

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v5] vfs: fix copy_file_range regression in cross-fs copies

On Thu, Feb 18, 2021 at 5:16 PM Luis Henriques <[email protected]> wrote:
>
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file. Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems. After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices"). Filesystems are still allowed to fall-back to the VFS
> generic_copy_file_range() implementation, but that has now to be done
> explicitly.
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> Reported-by: Nicolas Boichat <[email protected]>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> And v5! Sorry. Sure, it makes sense to go through the all the vfs_cfr()
> checks first.

You missed my other comment on v4...

not checking NULL copy_file_range case.

Thanks,
Amir.


2021-02-18 17:05:41

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v5] vfs: fix copy_file_range regression in cross-fs copies

Amir Goldstein <[email protected]> writes:

> On Thu, Feb 18, 2021 at 5:16 PM Luis Henriques <[email protected]> wrote:
>>
>> A regression has been reported by Nicolas Boichat, found while using the
>> copy_file_range syscall to copy a tracefs file. Before commit
>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>> kernel would return -EXDEV to userspace when trying to copy a file across
>> different filesystems. After this commit, the syscall doesn't fail anymore
>> and instead returns zero (zero bytes copied), as this file's content is
>> generated on-the-fly and thus reports a size of zero.
>>
>> This patch restores some cross-filesystem copy restrictions that existed
>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> devices"). Filesystems are still allowed to fall-back to the VFS
>> generic_copy_file_range() implementation, but that has now to be done
>> explicitly.
>>
>> nfsd is also modified to fall-back into generic_copy_file_range() in case
>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>
>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
>> Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
>> Reported-by: Nicolas Boichat <dri[email protected]>
>> Signed-off-by: Luis Henriques <[email protected]>
>> ---
>> And v5! Sorry. Sure, it makes sense to go through the all the vfs_cfr()
>> checks first.
>
> You missed my other comment on v4...
>
> not checking NULL copy_file_range case.

Ah, yeah I did missed it. I'll follow up with yet another revision.

Cheers,
--
Luis

2021-02-18 19:03:18

by Luis Henriques

[permalink] [raw]
Subject: [PATCH v6] vfs: fix copy_file_range regression in cross-fs copies

A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file. Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems. After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices"). Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
Reported-by: Nicolas Boichat <[email protected]>
Signed-off-by: Luis Henriques <[email protected]>
---
And v6 is upon us. Behold!

Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

fs/nfsd/vfs.c | 8 +++++++-
fs/read_write.c | 53 ++++++++++++++++++++++++-------------------------
2 files changed, 33 insertions(+), 28 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
u64 dst_pos, u64 count)
{
+ ssize_t ret;

/*
* Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
* limit like this and pipeline multiple COPY requests.
*/
count = min_t(u64, count, 1 << 22);
- return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+ ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
+ ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+ return ret;
}

__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..0348aaa9e237 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
}
EXPORT_SYMBOL(generic_copy_file_range);

-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
- /*
- * Although we now allow filesystems to handle cross sb copy, passing
- * a file of the wrong filesystem type to filesystem driver can result
- * in an attempt to dereference the wrong type of ->private_data, so
- * avoid doing that until we really have a good reason. NFS defines
- * several different file_system_type structures, but they all end up
- * using the same ->copy_file_range() function pointer.
- */
- if (file_out->f_op->copy_file_range &&
- file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
- return file_out->f_op->copy_file_range(file_in, pos_in,
- file_out, pos_out,
- len, flags);
-
- return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
-}
-
/*
* Performs necessary checks before doing a file copy
*
@@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
loff_t size_in;
int ret;

+ /*
+ * Although we now allow filesystems to handle cross sb copy, passing
+ * a file of the wrong filesystem type to filesystem driver can result
+ * in an attempt to dereference the wrong type of ->private_data, so
+ * avoid doing that until we really have a good reason. NFS defines
+ * several different file_system_type structures, but they all end up
+ * using the same ->copy_file_range() function pointer.
+ */
+ if (file_out->f_op->copy_file_range) {
+ if (file_in->f_op->copy_file_range !=
+ file_out->f_op->copy_file_range)
+ return -EXDEV;
+ } else if (file_in->f_op->remap_file_range) {
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+ } else {
+ return -EOPNOTSUPP;
+ }
+
ret = generic_file_rw_checks(file_in, file_out);
if (ret)
return ret;
@@ -1499,8 +1496,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
*/
- if (file_in->f_op->remap_file_range &&
- file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
+ if (file_in->f_op->remap_file_range) {
loff_t cloned;

cloned = file_in->f_op->remap_file_range(file_in, pos_in,
@@ -1511,11 +1507,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
ret = cloned;
goto done;
}
+ /* Resort to copy_file_range if implemented. */
+ ret = -EOPNOTSUPP;
}

- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
- WARN_ON_ONCE(ret == -EOPNOTSUPP);
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
done:
if (ret > 0) {
fsnotify_access(file_in);

2021-02-19 21:20:34

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v6] vfs: fix copy_file_range regression in cross-fs copies

On Thu, Feb 18, 2021 at 12:33 PM Luis Henriques <[email protected]> wrote:
>
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file. Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems. After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices"). Filesystems are still allowed to fall-back to the VFS
> generic_copy_file_range() implementation, but that has now to be done
> explicitly.
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> Reported-by: Nicolas Boichat <[email protected]>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> And v6 is upon us. Behold!


> Changes since v5
> - check if ->copy_file_range is NULL before calling it
> Changes since v4
> - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> or -EXDEV.
> Changes since v3
> - dropped the COPY_FILE_SPLICE flag
> - kept the f_op's checks early in generic_copy_file_checks, implementing
> Amir's suggestions
> - modified nfsd to use generic_copy_file_range()
> Changes since v2
> - do all the required checks earlier, in generic_copy_file_checks(),
> adding new checks for ->remap_file_range
> - new COPY_FILE_SPLICE flag
> - don't remove filesystem's fallback to generic_copy_file_range()
> - updated commit changelog (and subject)
> Changes since v1 (after Amir review)
> - restored do_copy_file_range() helper
> - return -EOPNOTSUPP if fs doesn't implement CFR
> - updated commit description
>
> fs/nfsd/vfs.c | 8 +++++++-
> fs/read_write.c | 53 ++++++++++++++++++++++++-------------------------
> 2 files changed, 33 insertions(+), 28 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 04937e51de56..23dab0fa9087 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> u64 dst_pos, u64 count)
> {
> + ssize_t ret;
>
> /*
> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> * limit like this and pipeline multiple COPY requests.
> */
> count = min_t(u64, count, 1 << 22);
> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> +
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> + count, 0);
> + return ret;
> }
>
> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..0348aaa9e237 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> EXPORT_SYMBOL(generic_copy_file_range);
>
> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> -{
> - /*
> - * Although we now allow filesystems to handle cross sb copy, passing
> - * a file of the wrong filesystem type to filesystem driver can result
> - * in an attempt to dereference the wrong type of ->private_data, so
> - * avoid doing that until we really have a good reason. NFS defines
> - * several different file_system_type structures, but they all end up
> - * using the same ->copy_file_range() function pointer.
> - */
> - if (file_out->f_op->copy_file_range &&
> - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> - return file_out->f_op->copy_file_range(file_in, pos_in,
> - file_out, pos_out,
> - len, flags);
> -
> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> -}
> -
> /*
> * Performs necessary checks before doing a file copy
> *
> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> loff_t size_in;
> int ret;
>
> + /*
> + * Although we now allow filesystems to handle cross sb copy, passing
> + * a file of the wrong filesystem type to filesystem driver can result
> + * in an attempt to dereference the wrong type of ->private_data, so
> + * avoid doing that until we really have a good reason. NFS defines
> + * several different file_system_type structures, but they all end up
> + * using the same ->copy_file_range() function pointer.
> + */
> + if (file_out->f_op->copy_file_range) {
> + if (file_in->f_op->copy_file_range !=
> + file_out->f_op->copy_file_range)
> + return -EXDEV;
> + } else if (file_in->f_op->remap_file_range) {
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> + } else {
> + return -EOPNOTSUPP;
> + }
> +
> ret = generic_file_rw_checks(file_in, file_out);
> if (ret)
> return ret;
> @@ -1499,8 +1496,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> * Try cloning first, this is supported by more file systems, and
> * more efficient if both clone and copy are supported (e.g. NFS).
> */
> - if (file_in->f_op->remap_file_range &&
> - file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> + if (file_in->f_op->remap_file_range) {
> loff_t cloned;

This chunk breaks NFS. You are removing the check that the source and
destination for the CLONE operation are the same superblock and that
leads to the fact that when NFS does a copy between 2 different NFS
servers, it would try CLONE first which is not allowed. NFS relied on
this check to be done by the VFS layer. Either don't remove it or,
otherwise, fix the NFS clone's code to not send the CLONE and error
accordingly so that the COPY is done as it should have been.

> cloned = file_in->f_op->remap_file_range(file_in, pos_in,
> @@ -1511,11 +1507,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> ret = cloned;
> goto done;
> }
> + /* Resort to copy_file_range if implemented. */
> + ret = -EOPNOTSUPP;
> }
>
> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> - WARN_ON_ONCE(ret == -EOPNOTSUPP);
> + if (file_out->f_op->copy_file_range)
> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out,
> + len, flags);
> done:
> if (ret > 0) {
> fsnotify_access(file_in);

2021-02-19 21:54:25

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v6] vfs: fix copy_file_range regression in cross-fs copies

On Fri, Feb 19, 2021 at 11:18 PM Olga Kornievskaia <[email protected]> wrote:
>
> On Thu, Feb 18, 2021 at 12:33 PM Luis Henriques <[email protected]> wrote:
> >
> > A regression has been reported by Nicolas Boichat, found while using the
> > copy_file_range syscall to copy a tracefs file. Before commit
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > kernel would return -EXDEV to userspace when trying to copy a file across
> > different filesystems. After this commit, the syscall doesn't fail anymore
> > and instead returns zero (zero bytes copied), as this file's content is
> > generated on-the-fly and thus reports a size of zero.
> >
> > This patch restores some cross-filesystem copy restrictions that existed
> > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices"). Filesystems are still allowed to fall-back to the VFS
> > generic_copy_file_range() implementation, but that has now to be done
> > explicitly.
> >
> > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> > Reported-by: Nicolas Boichat <[email protected]>
> > Signed-off-by: Luis Henriques <[email protected]>
> > ---
> > And v6 is upon us. Behold!
>
>
> > Changes since v5
> > - check if ->copy_file_range is NULL before calling it
> > Changes since v4
> > - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> > or -EXDEV.
> > Changes since v3
> > - dropped the COPY_FILE_SPLICE flag
> > - kept the f_op's checks early in generic_copy_file_checks, implementing
> > Amir's suggestions
> > - modified nfsd to use generic_copy_file_range()
> > Changes since v2
> > - do all the required checks earlier, in generic_copy_file_checks(),
> > adding new checks for ->remap_file_range
> > - new COPY_FILE_SPLICE flag
> > - don't remove filesystem's fallback to generic_copy_file_range()
> > - updated commit changelog (and subject)
> > Changes since v1 (after Amir review)
> > - restored do_copy_file_range() helper
> > - return -EOPNOTSUPP if fs doesn't implement CFR
> > - updated commit description
> >
> > fs/nfsd/vfs.c | 8 +++++++-
> > fs/read_write.c | 53 ++++++++++++++++++++++++-------------------------
> > 2 files changed, 33 insertions(+), 28 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 04937e51de56..23dab0fa9087 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
> > ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> > u64 dst_pos, u64 count)
> > {
> > + ssize_t ret;
> >
> > /*
> > * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> > @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> > * limit like this and pipeline multiple COPY requests.
> > */
> > count = min_t(u64, count, 1 << 22);
> > - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > +
> > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> > + count, 0);
> > + return ret;
> > }
> >
> > __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 75f764b43418..0348aaa9e237 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> > }
> > EXPORT_SYMBOL(generic_copy_file_range);
> >
> > -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > - struct file *file_out, loff_t pos_out,
> > - size_t len, unsigned int flags)
> > -{
> > - /*
> > - * Although we now allow filesystems to handle cross sb copy, passing
> > - * a file of the wrong filesystem type to filesystem driver can result
> > - * in an attempt to dereference the wrong type of ->private_data, so
> > - * avoid doing that until we really have a good reason. NFS defines
> > - * several different file_system_type structures, but they all end up
> > - * using the same ->copy_file_range() function pointer.
> > - */
> > - if (file_out->f_op->copy_file_range &&
> > - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> > - return file_out->f_op->copy_file_range(file_in, pos_in,
> > - file_out, pos_out,
> > - len, flags);
> > -
> > - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> > - flags);
> > -}
> > -
> > /*
> > * Performs necessary checks before doing a file copy
> > *
> > @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > loff_t size_in;
> > int ret;
> >
> > + /*
> > + * Although we now allow filesystems to handle cross sb copy, passing
> > + * a file of the wrong filesystem type to filesystem driver can result
> > + * in an attempt to dereference the wrong type of ->private_data, so
> > + * avoid doing that until we really have a good reason. NFS defines
> > + * several different file_system_type structures, but they all end up
> > + * using the same ->copy_file_range() function pointer.
> > + */
> > + if (file_out->f_op->copy_file_range) {
> > + if (file_in->f_op->copy_file_range !=
> > + file_out->f_op->copy_file_range)
> > + return -EXDEV;
> > + } else if (file_in->f_op->remap_file_range) {
> > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > + return -EXDEV;
> > + } else {
> > + return -EOPNOTSUPP;
> > + }
> > +
> > ret = generic_file_rw_checks(file_in, file_out);
> > if (ret)
> > return ret;
> > @@ -1499,8 +1496,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > * Try cloning first, this is supported by more file systems, and
> > * more efficient if both clone and copy are supported (e.g. NFS).
> > */
> > - if (file_in->f_op->remap_file_range &&
> > - file_inode(file_in)->i_sb == file_inode(file_out)->i_sb) {
> > + if (file_in->f_op->remap_file_range) {
> > loff_t cloned;
>
> This chunk breaks NFS. You are removing the check that the source and
> destination for the CLONE operation are the same superblock and that
> leads to the fact that when NFS does a copy between 2 different NFS
> servers, it would try CLONE first which is not allowed. NFS relied on
> this check to be done by the VFS layer. Either don't remove it or,
> otherwise, fix the NFS clone's code to not send the CLONE and error
> accordingly so that the COPY is done as it should have been.
>

Right, we need to add this check back (not only for NFS).

However, I was looking at the change that introduced this opportunistic
call for clone_file_range() into copy_file_range():

commit a76b5b04375f974579c83433b06466758c0c552c
Author: Christoph Hellwig <[email protected]>
Date: Fri Dec 9 16:17:19 2016 -0800

fs: try to clone files first in vfs_copy_file_range

A clone is a perfectly fine implementation of a file copy, so most
file systems just implement the copy that way. Instead of duplicating
this logic move it to the VFS. Currently btrfs and XFS implement copies
the same way as clones and there is no behavior change for them, cifs
only implements clones and grow support for copy_file_range with this
patch. NFS implements both, so this will allow copy_file_range to work
on servers that only implement CLONE and be lot more efficient on servers
that implements CLONE and COPY.

And I was thinking to myself that like the change that brought us here
("vfs: allow copy_file_range to copy across devices"), this change was done
for a certain purpose (serve copy_file_range() by fs that implement CLONE),
but that last part (prefer CLONE over COPY) also sounds like an optimization
that nobody asked for and could lead to unexpected behavior down the road.

I think that if a filesystem implements both methods (COPY and CLONE)
and user called to COPY API, we need to call the more specialized COPY
method and not try the CLONE method, because filesystem should be very
capable of making this optimization internally.

This could have been a hypothetical question, but there are actually
two filesystems that implement both COPY and CLONE, so let's ask the
developers what they think VFS should call.

Olga, Trond, Steve, which methods of your filesystem do you think that
vfs_copy_file_range() should call?
1. Only copy_file_range()?
2. Both copy_file_range() and remap_file_range()?
3. CLONE before COPY or the other way around?

Thanks,
Amir.

2021-02-21 20:04:28

by Luis Henriques

[permalink] [raw]
Subject: [PATCH v7] vfs: fix copy_file_range regression in cross-fs copies

A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file. Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems. After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices"). Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
Reported-by: Nicolas Boichat <[email protected]>
Signed-off-by: Luis Henriques <[email protected]>
---
Changes since v6
- restored i_sb checks for the clone operation
Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

fs/nfsd/vfs.c | 8 +++++++-
fs/read_write.c | 50 ++++++++++++++++++++++++-------------------------
2 files changed, 32 insertions(+), 26 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
u64 dst_pos, u64 count)
{
+ ssize_t ret;

/*
* Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
* limit like this and pipeline multiple COPY requests.
*/
count = min_t(u64, count, 1 << 22);
- return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+ ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
+ ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+ return ret;
}

__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..463345c0ee30 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
}
EXPORT_SYMBOL(generic_copy_file_range);

-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
- /*
- * Although we now allow filesystems to handle cross sb copy, passing
- * a file of the wrong filesystem type to filesystem driver can result
- * in an attempt to dereference the wrong type of ->private_data, so
- * avoid doing that until we really have a good reason. NFS defines
- * several different file_system_type structures, but they all end up
- * using the same ->copy_file_range() function pointer.
- */
- if (file_out->f_op->copy_file_range &&
- file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
- return file_out->f_op->copy_file_range(file_in, pos_in,
- file_out, pos_out,
- len, flags);
-
- return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
-}
-
/*
* Performs necessary checks before doing a file copy
*
@@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
loff_t size_in;
int ret;

+ /*
+ * Although we now allow filesystems to handle cross sb copy, passing
+ * a file of the wrong filesystem type to filesystem driver can result
+ * in an attempt to dereference the wrong type of ->private_data, so
+ * avoid doing that until we really have a good reason. NFS defines
+ * several different file_system_type structures, but they all end up
+ * using the same ->copy_file_range() function pointer.
+ */
+ if (file_out->f_op->copy_file_range) {
+ if (file_in->f_op->copy_file_range !=
+ file_out->f_op->copy_file_range)
+ return -EXDEV;
+ } else if (file_in->f_op->remap_file_range) {
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+ } else {
+ return -EOPNOTSUPP;
+ }
+
ret = generic_file_rw_checks(file_in, file_out);
if (ret)
return ret;
@@ -1511,11 +1508,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
ret = cloned;
goto done;
}
+ /* Resort to copy_file_range if implemented. */
+ ret = -EOPNOTSUPP;
}

- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
- WARN_ON_ONCE(ret == -EOPNOTSUPP);
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
done:
if (ret > 0) {
fsnotify_access(file_in);

2021-02-22 03:05:35

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH v7] vfs: fix copy_file_range regression in cross-fs copies

On Mon, Feb 22, 2021 at 3:57 AM Luis Henriques <[email protected]> wrote:
>
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file. Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems. After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices"). Filesystems are still allowed to fall-back to the VFS
> generic_copy_file_range() implementation, but that has now to be done
> explicitly.
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> Reported-by: Nicolas Boichat <[email protected]>

Tested-by: Nicolas Boichat <[email protected]>

> Signed-off-by: Luis Henriques <[email protected]>
> ---
> Changes since v6
> - restored i_sb checks for the clone operation
> Changes since v5
> - check if ->copy_file_range is NULL before calling it
> Changes since v4
> - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> or -EXDEV.
> Changes since v3
> - dropped the COPY_FILE_SPLICE flag
> - kept the f_op's checks early in generic_copy_file_checks, implementing
> Amir's suggestions
> - modified nfsd to use generic_copy_file_range()
> Changes since v2
> - do all the required checks earlier, in generic_copy_file_checks(),
> adding new checks for ->remap_file_range
> - new COPY_FILE_SPLICE flag
> - don't remove filesystem's fallback to generic_copy_file_range()
> - updated commit changelog (and subject)
> Changes since v1 (after Amir review)
> - restored do_copy_file_range() helper
> - return -EOPNOTSUPP if fs doesn't implement CFR
> - updated commit description
>
> fs/nfsd/vfs.c | 8 +++++++-
> fs/read_write.c | 50 ++++++++++++++++++++++++-------------------------
> 2 files changed, 32 insertions(+), 26 deletions(-)
> [snip]

2021-02-22 10:25:40

by Luis Henriques

[permalink] [raw]
Subject: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

A regression has been reported by Nicolas Boichat, found while using the
copy_file_range syscall to copy a tracefs file. Before commit
5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
kernel would return -EXDEV to userspace when trying to copy a file across
different filesystems. After this commit, the syscall doesn't fail anymore
and instead returns zero (zero bytes copied), as this file's content is
generated on-the-fly and thus reports a size of zero.

This patch restores some cross-filesystem copy restrictions that existed
prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
devices"). Filesystems are still allowed to fall-back to the VFS
generic_copy_file_range() implementation, but that has now to be done
explicitly.

nfsd is also modified to fall-back into generic_copy_file_range() in case
vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
Link: https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0[email protected]/
Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
Reported-by: Nicolas Boichat <[email protected]>
Signed-off-by: Luis Henriques <[email protected]>
---
Changes since v7
- set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
error returned is always related to the 'copy' operation
Changes since v6
- restored i_sb checks for the clone operation
Changes since v5
- check if ->copy_file_range is NULL before calling it
Changes since v4
- nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
or -EXDEV.
Changes since v3
- dropped the COPY_FILE_SPLICE flag
- kept the f_op's checks early in generic_copy_file_checks, implementing
Amir's suggestions
- modified nfsd to use generic_copy_file_range()
Changes since v2
- do all the required checks earlier, in generic_copy_file_checks(),
adding new checks for ->remap_file_range
- new COPY_FILE_SPLICE flag
- don't remove filesystem's fallback to generic_copy_file_range()
- updated commit changelog (and subject)
Changes since v1 (after Amir review)
- restored do_copy_file_range() helper
- return -EOPNOTSUPP if fs doesn't implement CFR
- updated commit description

fs/nfsd/vfs.c | 8 +++++++-
fs/read_write.c | 49 ++++++++++++++++++++++++-------------------------
2 files changed, 31 insertions(+), 26 deletions(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index 04937e51de56..23dab0fa9087 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
u64 dst_pos, u64 count)
{
+ ssize_t ret;

/*
* Limit copy to 4MB to prevent indefinitely blocking an nfsd
@@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
* limit like this and pipeline multiple COPY requests.
*/
count = min_t(u64, count, 1 << 22);
- return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+ ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
+
+ if (ret == -EOPNOTSUPP || ret == -EXDEV)
+ ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
+ count, 0);
+ return ret;
}

__be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..5a26297fd410 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
}
EXPORT_SYMBOL(generic_copy_file_range);

-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
- /*
- * Although we now allow filesystems to handle cross sb copy, passing
- * a file of the wrong filesystem type to filesystem driver can result
- * in an attempt to dereference the wrong type of ->private_data, so
- * avoid doing that until we really have a good reason. NFS defines
- * several different file_system_type structures, but they all end up
- * using the same ->copy_file_range() function pointer.
- */
- if (file_out->f_op->copy_file_range &&
- file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
- return file_out->f_op->copy_file_range(file_in, pos_in,
- file_out, pos_out,
- len, flags);
-
- return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
-}
-
/*
* Performs necessary checks before doing a file copy
*
@@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
loff_t size_in;
int ret;

+ /*
+ * Although we now allow filesystems to handle cross sb copy, passing
+ * a file of the wrong filesystem type to filesystem driver can result
+ * in an attempt to dereference the wrong type of ->private_data, so
+ * avoid doing that until we really have a good reason. NFS defines
+ * several different file_system_type structures, but they all end up
+ * using the same ->copy_file_range() function pointer.
+ */
+ if (file_out->f_op->copy_file_range) {
+ if (file_in->f_op->copy_file_range !=
+ file_out->f_op->copy_file_range)
+ return -EXDEV;
+ } else if (file_in->f_op->remap_file_range) {
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+ } else {
+ return -EOPNOTSUPP;
+ }
+
ret = generic_file_rw_checks(file_in, file_out);
if (ret)
return ret;
@@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,

file_start_write(file_out);

+ ret = -EOPNOTSUPP;
/*
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
@@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}

- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
- WARN_ON_ONCE(ret == -EOPNOTSUPP);
+ if (file_out->f_op->copy_file_range)
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
done:
if (ret > 0) {
fsnotify_access(file_in);

2021-02-22 10:49:05

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Mon, Feb 22, 2021 at 12:23 PM Luis Henriques <[email protected]> wrote:
>
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file. Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems. After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices"). Filesystems are still allowed to fall-back to the VFS
> generic_copy_file_range() implementation, but that has now to be done
> explicitly.
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> Reported-by: Nicolas Boichat <[email protected]>
> Signed-off-by: Luis Henriques <[email protected]>
> ---

Reviewed-by: Amir Goldstein <[email protected]>

Thanks,
Amir.

2021-02-22 16:29:45

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies


On 2/22/21 2:24 AM, Luis Henriques wrote:
> A regression has been reported by Nicolas Boichat, found while using the
> copy_file_range syscall to copy a tracefs file. Before commit
> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> kernel would return -EXDEV to userspace when trying to copy a file across
> different filesystems. After this commit, the syscall doesn't fail anymore
> and instead returns zero (zero bytes copied), as this file's content is
> generated on-the-fly and thus reports a size of zero.
>
> This patch restores some cross-filesystem copy restrictions that existed
> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> devices"). Filesystems are still allowed to fall-back to the VFS
> generic_copy_file_range() implementation, but that has now to be done
> explicitly.
>
> nfsd is also modified to fall-back into generic_copy_file_range() in case
> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> Reported-by: Nicolas Boichat <[email protected]>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> Changes since v7
> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
> error returned is always related to the 'copy' operation
> Changes since v6
> - restored i_sb checks for the clone operation
> Changes since v5
> - check if ->copy_file_range is NULL before calling it
> Changes since v4
> - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> or -EXDEV.
> Changes since v3
> - dropped the COPY_FILE_SPLICE flag
> - kept the f_op's checks early in generic_copy_file_checks, implementing
> Amir's suggestions
> - modified nfsd to use generic_copy_file_range()
> Changes since v2
> - do all the required checks earlier, in generic_copy_file_checks(),
> adding new checks for ->remap_file_range
> - new COPY_FILE_SPLICE flag
> - don't remove filesystem's fallback to generic_copy_file_range()
> - updated commit changelog (and subject)
> Changes since v1 (after Amir review)
> - restored do_copy_file_range() helper
> - return -EOPNOTSUPP if fs doesn't implement CFR
> - updated commit description
>
> fs/nfsd/vfs.c | 8 +++++++-
> fs/read_write.c | 49 ++++++++++++++++++++++++-------------------------
> 2 files changed, 31 insertions(+), 26 deletions(-)
>
> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> index 04937e51de56..23dab0fa9087 100644
> --- a/fs/nfsd/vfs.c
> +++ b/fs/nfsd/vfs.c
> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> u64 dst_pos, u64 count)
> {
> + ssize_t ret;
>
> /*
> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> * limit like this and pipeline multiple COPY requests.
> */
> count = min_t(u64, count, 1 << 22);
> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> +
> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> + count, 0);
> + return ret;
> }
>
> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..5a26297fd410 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> EXPORT_SYMBOL(generic_copy_file_range);
>
> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> -{
> - /*
> - * Although we now allow filesystems to handle cross sb copy, passing
> - * a file of the wrong filesystem type to filesystem driver can result
> - * in an attempt to dereference the wrong type of ->private_data, so
> - * avoid doing that until we really have a good reason. NFS defines
> - * several different file_system_type structures, but they all end up
> - * using the same ->copy_file_range() function pointer.
> - */
> - if (file_out->f_op->copy_file_range &&
> - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> - return file_out->f_op->copy_file_range(file_in, pos_in,
> - file_out, pos_out,
> - len, flags);
> -
> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> -}
> -
> /*
> * Performs necessary checks before doing a file copy
> *
> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> loff_t size_in;
> int ret;
>
> + /*
> + * Although we now allow filesystems to handle cross sb copy, passing
> + * a file of the wrong filesystem type to filesystem driver can result
> + * in an attempt to dereference the wrong type of ->private_data, so
> + * avoid doing that until we really have a good reason. NFS defines
> + * several different file_system_type structures, but they all end up
> + * using the same ->copy_file_range() function pointer.
> + */
> + if (file_out->f_op->copy_file_range) {
> + if (file_in->f_op->copy_file_range !=
> + file_out->f_op->copy_file_range)
> + return -EXDEV;
> + } else if (file_in->f_op->remap_file_range) {
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;

I think this check is redundant, it's done in vfs_copy_file_range.
If this check is removed then the else clause below should be removed
also. Once this check and the else clause are removed then might as
well move the the check of copy_file_range from here to vfs_copy_file_range.

-Dai

> + } else {
> + return -EOPNOTSUPP;
> + }
> +
> ret = generic_file_rw_checks(file_in, file_out);
> if (ret)
> return ret;
> @@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>
> file_start_write(file_out);
>
> + ret = -EOPNOTSUPP;
> /*
> * Try cloning first, this is supported by more file systems, and
> * more efficient if both clone and copy are supported (e.g. NFS).
> @@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> }
>
> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> - WARN_ON_ONCE(ret == -EOPNOTSUPP);
> + if (file_out->f_op->copy_file_range)
> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out,
> + len, flags);
> done:
> if (ret > 0) {
> fsnotify_access(file_in);

2021-02-23 10:34:16

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
>
> On 2/22/21 2:24 AM, Luis Henriques wrote:
> > A regression has been reported by Nicolas Boichat, found while using the
> > copy_file_range syscall to copy a tracefs file. Before commit
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > kernel would return -EXDEV to userspace when trying to copy a file across
> > different filesystems. After this commit, the syscall doesn't fail anymore
> > and instead returns zero (zero bytes copied), as this file's content is
> > generated on-the-fly and thus reports a size of zero.
> >
> > This patch restores some cross-filesystem copy restrictions that existed
> > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices"). Filesystems are still allowed to fall-back to the VFS
> > generic_copy_file_range() implementation, but that has now to be done
> > explicitly.
> >
> > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> > Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> > Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> > Reported-by: Nicolas Boichat <[email protected]>
> > Signed-off-by: Luis Henriques <[email protected]>
> > ---
> > Changes since v7
> > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
> > error returned is always related to the 'copy' operation
> > Changes since v6
> > - restored i_sb checks for the clone operation
> > Changes since v5
> > - check if ->copy_file_range is NULL before calling it
> > Changes since v4
> > - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
> > or -EXDEV.
> > Changes since v3
> > - dropped the COPY_FILE_SPLICE flag
> > - kept the f_op's checks early in generic_copy_file_checks, implementing
> > Amir's suggestions
> > - modified nfsd to use generic_copy_file_range()
> > Changes since v2
> > - do all the required checks earlier, in generic_copy_file_checks(),
> > adding new checks for ->remap_file_range
> > - new COPY_FILE_SPLICE flag
> > - don't remove filesystem's fallback to generic_copy_file_range()
> > - updated commit changelog (and subject)
> > Changes since v1 (after Amir review)
> > - restored do_copy_file_range() helper
> > - return -EOPNOTSUPP if fs doesn't implement CFR
> > - updated commit description
> >
> > fs/nfsd/vfs.c | 8 +++++++-
> > fs/read_write.c | 49 ++++++++++++++++++++++++-------------------------
> > 2 files changed, 31 insertions(+), 26 deletions(-)
> >
> > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > index 04937e51de56..23dab0fa9087 100644
> > --- a/fs/nfsd/vfs.c
> > +++ b/fs/nfsd/vfs.c
> > @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
> > ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> > u64 dst_pos, u64 count)
> > {
> > + ssize_t ret;
> > /*
> > * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> > @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
> > * limit like this and pipeline multiple COPY requests.
> > */
> > count = min_t(u64, count, 1 << 22);
> > - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > +
> > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> > + count, 0);
> > + return ret;
> > }
> > __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 75f764b43418..5a26297fd410 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> > }
> > EXPORT_SYMBOL(generic_copy_file_range);
> > -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> > - struct file *file_out, loff_t pos_out,
> > - size_t len, unsigned int flags)
> > -{
> > - /*
> > - * Although we now allow filesystems to handle cross sb copy, passing
> > - * a file of the wrong filesystem type to filesystem driver can result
> > - * in an attempt to dereference the wrong type of ->private_data, so
> > - * avoid doing that until we really have a good reason. NFS defines
> > - * several different file_system_type structures, but they all end up
> > - * using the same ->copy_file_range() function pointer.
> > - */
> > - if (file_out->f_op->copy_file_range &&
> > - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> > - return file_out->f_op->copy_file_range(file_in, pos_in,
> > - file_out, pos_out,
> > - len, flags);
> > -
> > - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> > - flags);
> > -}
> > -
> > /*
> > * Performs necessary checks before doing a file copy
> > *
> > @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
> > loff_t size_in;
> > int ret;
> > + /*
> > + * Although we now allow filesystems to handle cross sb copy, passing
> > + * a file of the wrong filesystem type to filesystem driver can result
> > + * in an attempt to dereference the wrong type of ->private_data, so
> > + * avoid doing that until we really have a good reason. NFS defines
> > + * several different file_system_type structures, but they all end up
> > + * using the same ->copy_file_range() function pointer.
> > + */
> > + if (file_out->f_op->copy_file_range) {
> > + if (file_in->f_op->copy_file_range !=
> > + file_out->f_op->copy_file_range)
> > + return -EXDEV;
> > + } else if (file_in->f_op->remap_file_range) {
> > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > + return -EXDEV;
>
> I think this check is redundant, it's done in vfs_copy_file_range.
> If this check is removed then the else clause below should be removed
> also. Once this check and the else clause are removed then might as
> well move the the check of copy_file_range from here to vfs_copy_file_range.
>

I don't think it's really redundant, although I agree is messy due to the
fact we try to clone first instead of copying them.

So, in the clone path, this is the only place where we return -EXDEV if:

1) we don't have ->copy_file_range *and*
2) we have ->remap_file_range but the i_sb are different.

The check in vfs_copy_file_range() is only executed if:

1) we have *valid* ->copy_file_range ops and/or
2) we have *valid* ->remap_file_range

So... if we remove the check in generic_copy_file_checks() as you suggest
and:
- we don't have ->copy_file_range,
- we have ->remap_file_range but
- the i_sb are different

we'll return the -EOPNOTSUPP (the one set in line "ret = -EOPNOTSUPP;" in
function vfs_copy_file_range() ) instead of -EXDEV.

But I may have got it all wrong. I've looked so many times at this code
that I'm probably useless at finding problems in it :-)

Cheers,
--
Lu?s

> -Dai
>
> > + } else {
> > + return -EOPNOTSUPP;
> > + }
> > +
> > ret = generic_file_rw_checks(file_in, file_out);
> > if (ret)
> > return ret;
> > @@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > file_start_write(file_out);
> > + ret = -EOPNOTSUPP;
> > /*
> > * Try cloning first, this is supported by more file systems, and
> > * more efficient if both clone and copy are supported (e.g. NFS).
> > @@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > }
> > }
> > - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> > - flags);
> > - WARN_ON_ONCE(ret == -EOPNOTSUPP);
> > + if (file_out->f_op->copy_file_range)
> > + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> > + file_out, pos_out,
> > + len, flags);
> > done:
> > if (ret > 0) {
> > fsnotify_access(file_in);

2021-02-23 15:35:43

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies


On 2/23/21 2:32 AM, Luis Henriques wrote:
> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
>> On 2/22/21 2:24 AM, Luis Henriques wrote:
>>> A regression has been reported by Nicolas Boichat, found while using the
>>> copy_file_range syscall to copy a tracefs file. Before commit
>>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>>> kernel would return -EXDEV to userspace when trying to copy a file across
>>> different filesystems. After this commit, the syscall doesn't fail anymore
>>> and instead returns zero (zero bytes copied), as this file's content is
>>> generated on-the-fly and thus reports a size of zero.
>>>
>>> This patch restores some cross-filesystem copy restrictions that existed
>>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>>> devices"). Filesystems are still allowed to fall-back to the VFS
>>> generic_copy_file_range() implementation, but that has now to be done
>>> explicitly.
>>>
>>> nfsd is also modified to fall-back into generic_copy_file_range() in case
>>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>>
>>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
>>> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
>>> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
>>> Link: https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
>>> Reported-by: Nicolas Boichat <[email protected]>
>>> Signed-off-by: Luis Henriques <[email protected]>
>>> ---
>>> Changes since v7
>>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so that the
>>> error returned is always related to the 'copy' operation
>>> Changes since v6
>>> - restored i_sb checks for the clone operation
>>> Changes since v5
>>> - check if ->copy_file_range is NULL before calling it
>>> Changes since v4
>>> - nfsd falls-back to generic_copy_file_range() only *if* it gets -EOPNOTSUPP
>>> or -EXDEV.
>>> Changes since v3
>>> - dropped the COPY_FILE_SPLICE flag
>>> - kept the f_op's checks early in generic_copy_file_checks, implementing
>>> Amir's suggestions
>>> - modified nfsd to use generic_copy_file_range()
>>> Changes since v2
>>> - do all the required checks earlier, in generic_copy_file_checks(),
>>> adding new checks for ->remap_file_range
>>> - new COPY_FILE_SPLICE flag
>>> - don't remove filesystem's fallback to generic_copy_file_range()
>>> - updated commit changelog (and subject)
>>> Changes since v1 (after Amir review)
>>> - restored do_copy_file_range() helper
>>> - return -EOPNOTSUPP if fs doesn't implement CFR
>>> - updated commit description
>>>
>>> fs/nfsd/vfs.c | 8 +++++++-
>>> fs/read_write.c | 49 ++++++++++++++++++++++++-------------------------
>>> 2 files changed, 31 insertions(+), 26 deletions(-)
>>>
>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>> index 04937e51de56..23dab0fa9087 100644
>>> --- a/fs/nfsd/vfs.c
>>> +++ b/fs/nfsd/vfs.c
>>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file *nf_src, u64 src_pos,
>>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
>>> u64 dst_pos, u64 count)
>>> {
>>> + ssize_t ret;
>>> /*
>>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file *dst,
>>> * limit like this and pipeline multiple COPY requests.
>>> */
>>> count = min_t(u64, count, 1 << 22);
>>> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>> +
>>> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
>>> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
>>> + count, 0);
>>> + return ret;
>>> }
>>> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh *fhp,
>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>> index 75f764b43418..5a26297fd410 100644
>>> --- a/fs/read_write.c
>>> +++ b/fs/read_write.c
>>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>>> }
>>> EXPORT_SYMBOL(generic_copy_file_range);
>>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>>> - struct file *file_out, loff_t pos_out,
>>> - size_t len, unsigned int flags)
>>> -{
>>> - /*
>>> - * Although we now allow filesystems to handle cross sb copy, passing
>>> - * a file of the wrong filesystem type to filesystem driver can result
>>> - * in an attempt to dereference the wrong type of ->private_data, so
>>> - * avoid doing that until we really have a good reason. NFS defines
>>> - * several different file_system_type structures, but they all end up
>>> - * using the same ->copy_file_range() function pointer.
>>> - */
>>> - if (file_out->f_op->copy_file_range &&
>>> - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
>>> - return file_out->f_op->copy_file_range(file_in, pos_in,
>>> - file_out, pos_out,
>>> - len, flags);
>>> -
>>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>>> - flags);
>>> -}
>>> -
>>> /*
>>> * Performs necessary checks before doing a file copy
>>> *
>>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
>>> loff_t size_in;
>>> int ret;
>>> + /*
>>> + * Although we now allow filesystems to handle cross sb copy, passing
>>> + * a file of the wrong filesystem type to filesystem driver can result
>>> + * in an attempt to dereference the wrong type of ->private_data, so
>>> + * avoid doing that until we really have a good reason. NFS defines
>>> + * several different file_system_type structures, but they all end up
>>> + * using the same ->copy_file_range() function pointer.
>>> + */
>>> + if (file_out->f_op->copy_file_range) {
>>> + if (file_in->f_op->copy_file_range !=
>>> + file_out->f_op->copy_file_range)
>>> + return -EXDEV;
>>> + } else if (file_in->f_op->remap_file_range) {
>>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>>> + return -EXDEV;
>> I think this check is redundant, it's done in vfs_copy_file_range.
>> If this check is removed then the else clause below should be removed
>> also. Once this check and the else clause are removed then might as
>> well move the the check of copy_file_range from here to vfs_copy_file_range.
>>
> I don't think it's really redundant, although I agree is messy due to the
> fact we try to clone first instead of copying them.
>
> So, in the clone path, this is the only place where we return -EXDEV if:
>
> 1) we don't have ->copy_file_range *and*
> 2) we have ->remap_file_range but the i_sb are different.
>
> The check in vfs_copy_file_range() is only executed if:
>
> 1) we have *valid* ->copy_file_range ops and/or
> 2) we have *valid* ->remap_file_range
>
> So... if we remove the check in generic_copy_file_checks() as you suggest
> and:
> - we don't have ->copy_file_range,
> - we have ->remap_file_range but
> - the i_sb are different
>
> we'll return the -EOPNOTSUPP (the one set in line "ret = -EOPNOTSUPP;" in
> function vfs_copy_file_range() ) instead of -EXDEV.

Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
-EXDEVV by doing generic_copy_file_range. Do any other consumers of
vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
the correct error code for this case? It seems to me that -EOPNOTSUPP
is more appropriate than EXDEV when (sb1 != sb2).

>
> But I may have got it all wrong. I've looked so many times at this code
> that I'm probably useless at finding problems in it :-)

You're not alone, we all try to do the right thing :-)

-Dai

>
> Cheers,
> --
> Luís
>
>> -Dai
>>
>>> + } else {
>>> + return -EOPNOTSUPP;
>>> + }
>>> +
>>> ret = generic_file_rw_checks(file_in, file_out);
>>> if (ret)
>>> return ret;
>>> @@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>> file_start_write(file_out);
>>> + ret = -EOPNOTSUPP;
>>> /*
>>> * Try cloning first, this is supported by more file systems, and
>>> * more efficient if both clone and copy are supported (e.g. NFS).
>>> @@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>>> }
>>> }
>>> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>>> - flags);
>>> - WARN_ON_ONCE(ret == -EOPNOTSUPP);
>>> + if (file_out->f_op->copy_file_range)
>>> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
>>> + file_out, pos_out,
>>> + len, flags);
>>> done:
>>> if (ret > 0) {
>>> fsnotify_access(file_in);

2021-02-23 16:08:18

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies


On 2/23/21 7:29 AM, [email protected] wrote:
>
> On 2/23/21 2:32 AM, Luis Henriques wrote:
>> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
>>> On 2/22/21 2:24 AM, Luis Henriques wrote:
>>>> A regression has been reported by Nicolas Boichat, found while
>>>> using the
>>>> copy_file_range syscall to copy a tracefs file.  Before commit
>>>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>>>> kernel would return -EXDEV to userspace when trying to copy a file
>>>> across
>>>> different filesystems.  After this commit, the syscall doesn't fail
>>>> anymore
>>>> and instead returns zero (zero bytes copied), as this file's
>>>> content is
>>>> generated on-the-fly and thus reports a size of zero.
>>>>
>>>> This patch restores some cross-filesystem copy restrictions that
>>>> existed
>>>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
>>>> across
>>>> devices").  Filesystems are still allowed to fall-back to the VFS
>>>> generic_copy_file_range() implementation, but that has now to be done
>>>> explicitly.
>>>>
>>>> nfsd is also modified to fall-back into generic_copy_file_range()
>>>> in case
>>>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>>>
>>>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>>>> devices")
>>>> Link:
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
>>>> Link:
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
>>>> Link:
>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
>>>> Reported-by: Nicolas Boichat <[email protected]>
>>>> Signed-off-by: Luis Henriques <[email protected]>
>>>> ---
>>>> Changes since v7
>>>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
>>>> that the
>>>>     error returned is always related to the 'copy' operation
>>>> Changes since v6
>>>> - restored i_sb checks for the clone operation
>>>> Changes since v5
>>>> - check if ->copy_file_range is NULL before calling it
>>>> Changes since v4
>>>> - nfsd falls-back to generic_copy_file_range() only *if* it gets
>>>> -EOPNOTSUPP
>>>>     or -EXDEV.
>>>> Changes since v3
>>>> - dropped the COPY_FILE_SPLICE flag
>>>> - kept the f_op's checks early in generic_copy_file_checks,
>>>> implementing
>>>>     Amir's suggestions
>>>> - modified nfsd to use generic_copy_file_range()
>>>> Changes since v2
>>>> - do all the required checks earlier, in generic_copy_file_checks(),
>>>>     adding new checks for ->remap_file_range
>>>> - new COPY_FILE_SPLICE flag
>>>> - don't remove filesystem's fallback to generic_copy_file_range()
>>>> - updated commit changelog (and subject)
>>>> Changes since v1 (after Amir review)
>>>> - restored do_copy_file_range() helper
>>>> - return -EOPNOTSUPP if fs doesn't implement CFR
>>>> - updated commit description
>>>>
>>>>    fs/nfsd/vfs.c   |  8 +++++++-
>>>>    fs/read_write.c | 49
>>>> ++++++++++++++++++++++++-------------------------
>>>>    2 files changed, 31 insertions(+), 26 deletions(-)
>>>>
>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>> index 04937e51de56..23dab0fa9087 100644
>>>> --- a/fs/nfsd/vfs.c
>>>> +++ b/fs/nfsd/vfs.c
>>>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
>>>> *nf_src, u64 src_pos,
>>>>    ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
>>>> struct file *dst,
>>>>                     u64 dst_pos, u64 count)
>>>>    {
>>>> +    ssize_t ret;
>>>>        /*
>>>>         * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>>>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
>>>> u64 src_pos, struct file *dst,
>>>>         * limit like this and pipeline multiple COPY requests.
>>>>         */
>>>>        count = min_t(u64, count, 1 << 22);
>>>> -    return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>>> +    ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>>> +
>>>> +    if (ret == -EOPNOTSUPP || ret == -EXDEV)
>>>> +        ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
>>>> +                          count, 0);
>>>> +    return ret;
>>>>    }
>>>>    __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
>>>> *fhp,
>>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>>> index 75f764b43418..5a26297fd410 100644
>>>> --- a/fs/read_write.c
>>>> +++ b/fs/read_write.c
>>>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
>>>> *file_in, loff_t pos_in,
>>>>    }
>>>>    EXPORT_SYMBOL(generic_copy_file_range);
>>>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t
>>>> pos_in,
>>>> -                  struct file *file_out, loff_t pos_out,
>>>> -                  size_t len, unsigned int flags)
>>>> -{
>>>> -    /*
>>>> -     * Although we now allow filesystems to handle cross sb copy,
>>>> passing
>>>> -     * a file of the wrong filesystem type to filesystem driver
>>>> can result
>>>> -     * in an attempt to dereference the wrong type of
>>>> ->private_data, so
>>>> -     * avoid doing that until we really have a good reason.  NFS
>>>> defines
>>>> -     * several different file_system_type structures, but they all
>>>> end up
>>>> -     * using the same ->copy_file_range() function pointer.
>>>> -     */
>>>> -    if (file_out->f_op->copy_file_range &&
>>>> -        file_out->f_op->copy_file_range ==
>>>> file_in->f_op->copy_file_range)
>>>> -        return file_out->f_op->copy_file_range(file_in, pos_in,
>>>> -                               file_out, pos_out,
>>>> -                               len, flags);
>>>> -
>>>> -    return generic_copy_file_range(file_in, pos_in, file_out,
>>>> pos_out, len,
>>>> -                       flags);
>>>> -}
>>>> -
>>>>    /*
>>>>     * Performs necessary checks before doing a file copy
>>>>     *
>>>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
>>>> file *file_in, loff_t pos_in,
>>>>        loff_t size_in;
>>>>        int ret;
>>>> +    /*
>>>> +     * Although we now allow filesystems to handle cross sb copy,
>>>> passing
>>>> +     * a file of the wrong filesystem type to filesystem driver
>>>> can result
>>>> +     * in an attempt to dereference the wrong type of
>>>> ->private_data, so
>>>> +     * avoid doing that until we really have a good reason.  NFS
>>>> defines
>>>> +     * several different file_system_type structures, but they all
>>>> end up
>>>> +     * using the same ->copy_file_range() function pointer.
>>>> +     */
>>>> +    if (file_out->f_op->copy_file_range) {
>>>> +        if (file_in->f_op->copy_file_range !=
>>>> +            file_out->f_op->copy_file_range)
>>>> +            return -EXDEV;
>>>> +    } else if (file_in->f_op->remap_file_range) {
>>>> +        if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>>>> +            return -EXDEV;
>>> I think this check is redundant, it's done in vfs_copy_file_range.
>>> If this check is removed then the else clause below should be removed
>>> also. Once this check and the else clause are removed then might as
>>> well move the the check of copy_file_range from here to
>>> vfs_copy_file_range.
>>>
>> I don't think it's really redundant, although I agree is messy due to
>> the
>> fact we try to clone first instead of copying them.
>>
>> So, in the clone path, this is the only place where we return -EXDEV if:
>>
>> 1) we don't have ->copy_file_range *and*
>> 2) we have ->remap_file_range but the i_sb are different.
>>
>> The check in vfs_copy_file_range() is only executed if:
>>
>> 1) we have *valid* ->copy_file_range ops and/or
>> 2) we have *valid* ->remap_file_range
>>
>> So... if we remove the check in generic_copy_file_checks() as you
>> suggest
>> and:
>> - we don't have ->copy_file_range,
>> - we have ->remap_file_range but
>> - the i_sb are different
>>
>> we'll return the -EOPNOTSUPP (the one set in line "ret =
>> -EOPNOTSUPP;" in
>> function vfs_copy_file_range() ) instead of -EXDEV.
>
> Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
> -EXDEVV by doing generic_copy_file_range.  Do any other consumers of
> vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
> the correct error code for this case? It seems to me that -EOPNOTSUPP
> is more appropriate than EXDEV when (sb1 != sb2).

So with the current patch, for a clone operation across 2 filesystems:

. if src and dst filesystem support both copy_file_range and
map_file_range then the code returns -ENOTSUPPORT.

. if the filesystems only support map_file_range then the
code returns -EXDEV

This seems confusing, shouldn't only 1 error code returned for this case?

-Dai

>
>>
>> But I may have got it all wrong.  I've looked so many times at this code
>> that I'm probably useless at finding problems in it :-)
>
> You're not alone, we all try to do the right thing :-)
>
> -Dai
>
>>
>> Cheers,
>> --
>> Luís
>>
>>> -Dai
>>>
>>>> +    } else {
>>>> +                return -EOPNOTSUPP;
>>>> +    }
>>>> +
>>>>        ret = generic_file_rw_checks(file_in, file_out);
>>>>        if (ret)
>>>>            return ret;
>>>> @@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file
>>>> *file_in, loff_t pos_in,
>>>>        file_start_write(file_out);
>>>> +    ret = -EOPNOTSUPP;
>>>>        /*
>>>>         * Try cloning first, this is supported by more file
>>>> systems, and
>>>>         * more efficient if both clone and copy are supported (e.g.
>>>> NFS).
>>>> @@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file
>>>> *file_in, loff_t pos_in,
>>>>            }
>>>>        }
>>>> -    ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>>>> -                flags);
>>>> -    WARN_ON_ONCE(ret == -EOPNOTSUPP);
>>>> +    if (file_out->f_op->copy_file_range)
>>>> +        ret = file_out->f_op->copy_file_range(file_in, pos_in,
>>>> +                              file_out, pos_out,
>>>> +                              len, flags);
>>>>    done:
>>>>        if (ret > 0) {
>>>>            fsnotify_access(file_in);

2021-02-23 16:50:46

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Tue, Feb 23, 2021 at 6:02 PM <[email protected]> wrote:
>
>
> On 2/23/21 7:29 AM, [email protected] wrote:
> >
> > On 2/23/21 2:32 AM, Luis Henriques wrote:
> >> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
> >>> On 2/22/21 2:24 AM, Luis Henriques wrote:
> >>>> A regression has been reported by Nicolas Boichat, found while
> >>>> using the
> >>>> copy_file_range syscall to copy a tracefs file. Before commit
> >>>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> >>>> kernel would return -EXDEV to userspace when trying to copy a file
> >>>> across
> >>>> different filesystems. After this commit, the syscall doesn't fail
> >>>> anymore
> >>>> and instead returns zero (zero bytes copied), as this file's
> >>>> content is
> >>>> generated on-the-fly and thus reports a size of zero.
> >>>>
> >>>> This patch restores some cross-filesystem copy restrictions that
> >>>> existed
> >>>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
> >>>> across
> >>>> devices"). Filesystems are still allowed to fall-back to the VFS
> >>>> generic_copy_file_range() implementation, but that has now to be done
> >>>> explicitly.
> >>>>
> >>>> nfsd is also modified to fall-back into generic_copy_file_range()
> >>>> in case
> >>>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >>>>
> >>>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> >>>> devices")
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> >>>> Reported-by: Nicolas Boichat <[email protected]>
> >>>> Signed-off-by: Luis Henriques <[email protected]>
> >>>> ---
> >>>> Changes since v7
> >>>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
> >>>> that the
> >>>> error returned is always related to the 'copy' operation
> >>>> Changes since v6
> >>>> - restored i_sb checks for the clone operation
> >>>> Changes since v5
> >>>> - check if ->copy_file_range is NULL before calling it
> >>>> Changes since v4
> >>>> - nfsd falls-back to generic_copy_file_range() only *if* it gets
> >>>> -EOPNOTSUPP
> >>>> or -EXDEV.
> >>>> Changes since v3
> >>>> - dropped the COPY_FILE_SPLICE flag
> >>>> - kept the f_op's checks early in generic_copy_file_checks,
> >>>> implementing
> >>>> Amir's suggestions
> >>>> - modified nfsd to use generic_copy_file_range()
> >>>> Changes since v2
> >>>> - do all the required checks earlier, in generic_copy_file_checks(),
> >>>> adding new checks for ->remap_file_range
> >>>> - new COPY_FILE_SPLICE flag
> >>>> - don't remove filesystem's fallback to generic_copy_file_range()
> >>>> - updated commit changelog (and subject)
> >>>> Changes since v1 (after Amir review)
> >>>> - restored do_copy_file_range() helper
> >>>> - return -EOPNOTSUPP if fs doesn't implement CFR
> >>>> - updated commit description
> >>>>
> >>>> fs/nfsd/vfs.c | 8 +++++++-
> >>>> fs/read_write.c | 49
> >>>> ++++++++++++++++++++++++-------------------------
> >>>> 2 files changed, 31 insertions(+), 26 deletions(-)
> >>>>
> >>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >>>> index 04937e51de56..23dab0fa9087 100644
> >>>> --- a/fs/nfsd/vfs.c
> >>>> +++ b/fs/nfsd/vfs.c
> >>>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
> >>>> *nf_src, u64 src_pos,
> >>>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
> >>>> struct file *dst,
> >>>> u64 dst_pos, u64 count)
> >>>> {
> >>>> + ssize_t ret;
> >>>> /*
> >>>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> >>>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
> >>>> u64 src_pos, struct file *dst,
> >>>> * limit like this and pipeline multiple COPY requests.
> >>>> */
> >>>> count = min_t(u64, count, 1 << 22);
> >>>> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> >>>> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> >>>> +
> >>>> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >>>> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> >>>> + count, 0);
> >>>> + return ret;
> >>>> }
> >>>> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
> >>>> *fhp,
> >>>> diff --git a/fs/read_write.c b/fs/read_write.c
> >>>> index 75f764b43418..5a26297fd410 100644
> >>>> --- a/fs/read_write.c
> >>>> +++ b/fs/read_write.c
> >>>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
> >>>> *file_in, loff_t pos_in,
> >>>> }
> >>>> EXPORT_SYMBOL(generic_copy_file_range);
> >>>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t
> >>>> pos_in,
> >>>> - struct file *file_out, loff_t pos_out,
> >>>> - size_t len, unsigned int flags)
> >>>> -{
> >>>> - /*
> >>>> - * Although we now allow filesystems to handle cross sb copy,
> >>>> passing
> >>>> - * a file of the wrong filesystem type to filesystem driver
> >>>> can result
> >>>> - * in an attempt to dereference the wrong type of
> >>>> ->private_data, so
> >>>> - * avoid doing that until we really have a good reason. NFS
> >>>> defines
> >>>> - * several different file_system_type structures, but they all
> >>>> end up
> >>>> - * using the same ->copy_file_range() function pointer.
> >>>> - */
> >>>> - if (file_out->f_op->copy_file_range &&
> >>>> - file_out->f_op->copy_file_range ==
> >>>> file_in->f_op->copy_file_range)
> >>>> - return file_out->f_op->copy_file_range(file_in, pos_in,
> >>>> - file_out, pos_out,
> >>>> - len, flags);
> >>>> -
> >>>> - return generic_copy_file_range(file_in, pos_in, file_out,
> >>>> pos_out, len,
> >>>> - flags);
> >>>> -}
> >>>> -
> >>>> /*
> >>>> * Performs necessary checks before doing a file copy
> >>>> *
> >>>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
> >>>> file *file_in, loff_t pos_in,
> >>>> loff_t size_in;
> >>>> int ret;
> >>>> + /*
> >>>> + * Although we now allow filesystems to handle cross sb copy,
> >>>> passing
> >>>> + * a file of the wrong filesystem type to filesystem driver
> >>>> can result
> >>>> + * in an attempt to dereference the wrong type of
> >>>> ->private_data, so
> >>>> + * avoid doing that until we really have a good reason. NFS
> >>>> defines
> >>>> + * several different file_system_type structures, but they all
> >>>> end up
> >>>> + * using the same ->copy_file_range() function pointer.
> >>>> + */
> >>>> + if (file_out->f_op->copy_file_range) {
> >>>> + if (file_in->f_op->copy_file_range !=
> >>>> + file_out->f_op->copy_file_range)
> >>>> + return -EXDEV;
> >>>> + } else if (file_in->f_op->remap_file_range) {
> >>>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >>>> + return -EXDEV;
> >>> I think this check is redundant, it's done in vfs_copy_file_range.
> >>> If this check is removed then the else clause below should be removed
> >>> also. Once this check and the else clause are removed then might as
> >>> well move the the check of copy_file_range from here to
> >>> vfs_copy_file_range.
> >>>
> >> I don't think it's really redundant, although I agree is messy due to
> >> the
> >> fact we try to clone first instead of copying them.
> >>
> >> So, in the clone path, this is the only place where we return -EXDEV if:
> >>
> >> 1) we don't have ->copy_file_range *and*
> >> 2) we have ->remap_file_range but the i_sb are different.
> >>
> >> The check in vfs_copy_file_range() is only executed if:
> >>
> >> 1) we have *valid* ->copy_file_range ops and/or
> >> 2) we have *valid* ->remap_file_range
> >>
> >> So... if we remove the check in generic_copy_file_checks() as you
> >> suggest
> >> and:
> >> - we don't have ->copy_file_range,
> >> - we have ->remap_file_range but
> >> - the i_sb are different
> >>
> >> we'll return the -EOPNOTSUPP (the one set in line "ret =
> >> -EOPNOTSUPP;" in
> >> function vfs_copy_file_range() ) instead of -EXDEV.
> >
> > Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
> > -EXDEVV by doing generic_copy_file_range. Do any other consumers of
> > vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
> > the correct error code for this case? It seems to me that -EOPNOTSUPP
> > is more appropriate than EXDEV when (sb1 != sb2).
>

EXDEV is the right code for:
filesystem supports the operation but not for sb1 != sb1.

> So with the current patch, for a clone operation across 2 filesystems:
>
> . if src and dst filesystem support both copy_file_range and
> map_file_range then the code returns -ENOTSUPPORT.
>

Why do you say that?
Which code are you referring to exactly?
Did you see this behavior in a test?

> . if the filesystems only support map_file_range then the
> code returns -EXDEV
>
> This seems confusing, shouldn't only 1 error code returned for this case?
>

From my read of the code, user will get -EXDEV in both the cases you
listed.

Thanks,
Amir.

2021-02-23 17:00:31

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies


On 2/23/21 8:47 AM, Amir Goldstein wrote:
> On Tue, Feb 23, 2021 at 6:02 PM <[email protected]> wrote:
>>
>> On 2/23/21 7:29 AM, [email protected] wrote:
>>> On 2/23/21 2:32 AM, Luis Henriques wrote:
>>>> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
>>>>> On 2/22/21 2:24 AM, Luis Henriques wrote:
>>>>>> A regression has been reported by Nicolas Boichat, found while
>>>>>> using the
>>>>>> copy_file_range syscall to copy a tracefs file. Before commit
>>>>>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>>>>>> kernel would return -EXDEV to userspace when trying to copy a file
>>>>>> across
>>>>>> different filesystems. After this commit, the syscall doesn't fail
>>>>>> anymore
>>>>>> and instead returns zero (zero bytes copied), as this file's
>>>>>> content is
>>>>>> generated on-the-fly and thus reports a size of zero.
>>>>>>
>>>>>> This patch restores some cross-filesystem copy restrictions that
>>>>>> existed
>>>>>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
>>>>>> across
>>>>>> devices"). Filesystems are still allowed to fall-back to the VFS
>>>>>> generic_copy_file_range() implementation, but that has now to be done
>>>>>> explicitly.
>>>>>>
>>>>>> nfsd is also modified to fall-back into generic_copy_file_range()
>>>>>> in case
>>>>>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>>>>>
>>>>>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>>>>>> devices")
>>>>>> Link:
>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
>>>>>> Link:
>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
>>>>>> Link:
>>>>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
>>>>>> Reported-by: Nicolas Boichat <[email protected]>
>>>>>> Signed-off-by: Luis Henriques <[email protected]>
>>>>>> ---
>>>>>> Changes since v7
>>>>>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
>>>>>> that the
>>>>>> error returned is always related to the 'copy' operation
>>>>>> Changes since v6
>>>>>> - restored i_sb checks for the clone operation
>>>>>> Changes since v5
>>>>>> - check if ->copy_file_range is NULL before calling it
>>>>>> Changes since v4
>>>>>> - nfsd falls-back to generic_copy_file_range() only *if* it gets
>>>>>> -EOPNOTSUPP
>>>>>> or -EXDEV.
>>>>>> Changes since v3
>>>>>> - dropped the COPY_FILE_SPLICE flag
>>>>>> - kept the f_op's checks early in generic_copy_file_checks,
>>>>>> implementing
>>>>>> Amir's suggestions
>>>>>> - modified nfsd to use generic_copy_file_range()
>>>>>> Changes since v2
>>>>>> - do all the required checks earlier, in generic_copy_file_checks(),
>>>>>> adding new checks for ->remap_file_range
>>>>>> - new COPY_FILE_SPLICE flag
>>>>>> - don't remove filesystem's fallback to generic_copy_file_range()
>>>>>> - updated commit changelog (and subject)
>>>>>> Changes since v1 (after Amir review)
>>>>>> - restored do_copy_file_range() helper
>>>>>> - return -EOPNOTSUPP if fs doesn't implement CFR
>>>>>> - updated commit description
>>>>>>
>>>>>> fs/nfsd/vfs.c | 8 +++++++-
>>>>>> fs/read_write.c | 49
>>>>>> ++++++++++++++++++++++++-------------------------
>>>>>> 2 files changed, 31 insertions(+), 26 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>>>>>> index 04937e51de56..23dab0fa9087 100644
>>>>>> --- a/fs/nfsd/vfs.c
>>>>>> +++ b/fs/nfsd/vfs.c
>>>>>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
>>>>>> *nf_src, u64 src_pos,
>>>>>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
>>>>>> struct file *dst,
>>>>>> u64 dst_pos, u64 count)
>>>>>> {
>>>>>> + ssize_t ret;
>>>>>> /*
>>>>>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>>>>>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
>>>>>> u64 src_pos, struct file *dst,
>>>>>> * limit like this and pipeline multiple COPY requests.
>>>>>> */
>>>>>> count = min_t(u64, count, 1 << 22);
>>>>>> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>>>>> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>>>>>> +
>>>>>> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
>>>>>> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
>>>>>> + count, 0);
>>>>>> + return ret;
>>>>>> }
>>>>>> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
>>>>>> *fhp,
>>>>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>>>>> index 75f764b43418..5a26297fd410 100644
>>>>>> --- a/fs/read_write.c
>>>>>> +++ b/fs/read_write.c
>>>>>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
>>>>>> *file_in, loff_t pos_in,
>>>>>> }
>>>>>> EXPORT_SYMBOL(generic_copy_file_range);
>>>>>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t
>>>>>> pos_in,
>>>>>> - struct file *file_out, loff_t pos_out,
>>>>>> - size_t len, unsigned int flags)
>>>>>> -{
>>>>>> - /*
>>>>>> - * Although we now allow filesystems to handle cross sb copy,
>>>>>> passing
>>>>>> - * a file of the wrong filesystem type to filesystem driver
>>>>>> can result
>>>>>> - * in an attempt to dereference the wrong type of
>>>>>> ->private_data, so
>>>>>> - * avoid doing that until we really have a good reason. NFS
>>>>>> defines
>>>>>> - * several different file_system_type structures, but they all
>>>>>> end up
>>>>>> - * using the same ->copy_file_range() function pointer.
>>>>>> - */
>>>>>> - if (file_out->f_op->copy_file_range &&
>>>>>> - file_out->f_op->copy_file_range ==
>>>>>> file_in->f_op->copy_file_range)
>>>>>> - return file_out->f_op->copy_file_range(file_in, pos_in,
>>>>>> - file_out, pos_out,
>>>>>> - len, flags);
>>>>>> -
>>>>>> - return generic_copy_file_range(file_in, pos_in, file_out,
>>>>>> pos_out, len,
>>>>>> - flags);
>>>>>> -}
>>>>>> -
>>>>>> /*
>>>>>> * Performs necessary checks before doing a file copy
>>>>>> *
>>>>>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
>>>>>> file *file_in, loff_t pos_in,
>>>>>> loff_t size_in;
>>>>>> int ret;
>>>>>> + /*
>>>>>> + * Although we now allow filesystems to handle cross sb copy,
>>>>>> passing
>>>>>> + * a file of the wrong filesystem type to filesystem driver
>>>>>> can result
>>>>>> + * in an attempt to dereference the wrong type of
>>>>>> ->private_data, so
>>>>>> + * avoid doing that until we really have a good reason. NFS
>>>>>> defines
>>>>>> + * several different file_system_type structures, but they all
>>>>>> end up
>>>>>> + * using the same ->copy_file_range() function pointer.
>>>>>> + */
>>>>>> + if (file_out->f_op->copy_file_range) {
>>>>>> + if (file_in->f_op->copy_file_range !=
>>>>>> + file_out->f_op->copy_file_range)
>>>>>> + return -EXDEV;
>>>>>> + } else if (file_in->f_op->remap_file_range) {
>>>>>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>>>>>> + return -EXDEV;
>>>>> I think this check is redundant, it's done in vfs_copy_file_range.
>>>>> If this check is removed then the else clause below should be removed
>>>>> also. Once this check and the else clause are removed then might as
>>>>> well move the the check of copy_file_range from here to
>>>>> vfs_copy_file_range.
>>>>>
>>>> I don't think it's really redundant, although I agree is messy due to
>>>> the
>>>> fact we try to clone first instead of copying them.
>>>>
>>>> So, in the clone path, this is the only place where we return -EXDEV if:
>>>>
>>>> 1) we don't have ->copy_file_range *and*
>>>> 2) we have ->remap_file_range but the i_sb are different.
>>>>
>>>> The check in vfs_copy_file_range() is only executed if:
>>>>
>>>> 1) we have *valid* ->copy_file_range ops and/or
>>>> 2) we have *valid* ->remap_file_range
>>>>
>>>> So... if we remove the check in generic_copy_file_checks() as you
>>>> suggest
>>>> and:
>>>> - we don't have ->copy_file_range,
>>>> - we have ->remap_file_range but
>>>> - the i_sb are different
>>>>
>>>> we'll return the -EOPNOTSUPP (the one set in line "ret =
>>>> -EOPNOTSUPP;" in
>>>> function vfs_copy_file_range() ) instead of -EXDEV.
>>> Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
>>> -EXDEVV by doing generic_copy_file_range. Do any other consumers of
>>> vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
>>> the correct error code for this case? It seems to me that -EOPNOTSUPP
>>> is more appropriate than EXDEV when (sb1 != sb2).
> EXDEV is the right code for:
> filesystem supports the operation but not for sb1 != sb1.
>
>> So with the current patch, for a clone operation across 2 filesystems:
>>
>> . if src and dst filesystem support both copy_file_range and
>> map_file_range then the code returns -ENOTSUPPORT.
>>
> Why do you say that?
> Which code are you referring to exactly?

If the filesystems support both copy_file_range and map_file_range,
it passes the check in generic_file_check but it fails with the
check in vfs_copy_file_range and returns -ENOTSUPPORT (added by
the v8 patch)

-Dai

> Did you see this behavior in a test?
>
>> . if the filesystems only support map_file_range then the
>> code returns -EXDEV
>>
>> This seems confusing, shouldn't only 1 error code returned for this case?
>>
> From my read of the code, user will get -EXDEV in both the cases you
> listed.
>
> Thanks,
> Amir.

2021-02-23 17:16:13

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Tue, Feb 23, 2021 at 11:03 AM <[email protected]> wrote:
>
>
> On 2/23/21 7:29 AM, [email protected] wrote:
> >
> > On 2/23/21 2:32 AM, Luis Henriques wrote:
> >> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
> >>> On 2/22/21 2:24 AM, Luis Henriques wrote:
> >>>> A regression has been reported by Nicolas Boichat, found while
> >>>> using the
> >>>> copy_file_range syscall to copy a tracefs file. Before commit
> >>>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> >>>> kernel would return -EXDEV to userspace when trying to copy a file
> >>>> across
> >>>> different filesystems. After this commit, the syscall doesn't fail
> >>>> anymore
> >>>> and instead returns zero (zero bytes copied), as this file's
> >>>> content is
> >>>> generated on-the-fly and thus reports a size of zero.
> >>>>
> >>>> This patch restores some cross-filesystem copy restrictions that
> >>>> existed
> >>>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
> >>>> across
> >>>> devices"). Filesystems are still allowed to fall-back to the VFS
> >>>> generic_copy_file_range() implementation, but that has now to be done
> >>>> explicitly.
> >>>>
> >>>> nfsd is also modified to fall-back into generic_copy_file_range()
> >>>> in case
> >>>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >>>>
> >>>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> >>>> devices")
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> >>>> Link:
> >>>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> >>>> Reported-by: Nicolas Boichat <[email protected]>
> >>>> Signed-off-by: Luis Henriques <[email protected]>
> >>>> ---
> >>>> Changes since v7
> >>>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
> >>>> that the
> >>>> error returned is always related to the 'copy' operation
> >>>> Changes since v6
> >>>> - restored i_sb checks for the clone operation
> >>>> Changes since v5
> >>>> - check if ->copy_file_range is NULL before calling it
> >>>> Changes since v4
> >>>> - nfsd falls-back to generic_copy_file_range() only *if* it gets
> >>>> -EOPNOTSUPP
> >>>> or -EXDEV.
> >>>> Changes since v3
> >>>> - dropped the COPY_FILE_SPLICE flag
> >>>> - kept the f_op's checks early in generic_copy_file_checks,
> >>>> implementing
> >>>> Amir's suggestions
> >>>> - modified nfsd to use generic_copy_file_range()
> >>>> Changes since v2
> >>>> - do all the required checks earlier, in generic_copy_file_checks(),
> >>>> adding new checks for ->remap_file_range
> >>>> - new COPY_FILE_SPLICE flag
> >>>> - don't remove filesystem's fallback to generic_copy_file_range()
> >>>> - updated commit changelog (and subject)
> >>>> Changes since v1 (after Amir review)
> >>>> - restored do_copy_file_range() helper
> >>>> - return -EOPNOTSUPP if fs doesn't implement CFR
> >>>> - updated commit description
> >>>>
> >>>> fs/nfsd/vfs.c | 8 +++++++-
> >>>> fs/read_write.c | 49
> >>>> ++++++++++++++++++++++++-------------------------
> >>>> 2 files changed, 31 insertions(+), 26 deletions(-)
> >>>>
> >>>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> >>>> index 04937e51de56..23dab0fa9087 100644
> >>>> --- a/fs/nfsd/vfs.c
> >>>> +++ b/fs/nfsd/vfs.c
> >>>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
> >>>> *nf_src, u64 src_pos,
> >>>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
> >>>> struct file *dst,
> >>>> u64 dst_pos, u64 count)
> >>>> {
> >>>> + ssize_t ret;
> >>>> /*
> >>>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> >>>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
> >>>> u64 src_pos, struct file *dst,
> >>>> * limit like this and pipeline multiple COPY requests.
> >>>> */
> >>>> count = min_t(u64, count, 1 << 22);
> >>>> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> >>>> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> >>>> +
> >>>> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> >>>> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> >>>> + count, 0);
> >>>> + return ret;
> >>>> }
> >>>> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
> >>>> *fhp,
> >>>> diff --git a/fs/read_write.c b/fs/read_write.c
> >>>> index 75f764b43418..5a26297fd410 100644
> >>>> --- a/fs/read_write.c
> >>>> +++ b/fs/read_write.c
> >>>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
> >>>> *file_in, loff_t pos_in,
> >>>> }
> >>>> EXPORT_SYMBOL(generic_copy_file_range);
> >>>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t
> >>>> pos_in,
> >>>> - struct file *file_out, loff_t pos_out,
> >>>> - size_t len, unsigned int flags)
> >>>> -{
> >>>> - /*
> >>>> - * Although we now allow filesystems to handle cross sb copy,
> >>>> passing
> >>>> - * a file of the wrong filesystem type to filesystem driver
> >>>> can result
> >>>> - * in an attempt to dereference the wrong type of
> >>>> ->private_data, so
> >>>> - * avoid doing that until we really have a good reason. NFS
> >>>> defines
> >>>> - * several different file_system_type structures, but they all
> >>>> end up
> >>>> - * using the same ->copy_file_range() function pointer.
> >>>> - */
> >>>> - if (file_out->f_op->copy_file_range &&
> >>>> - file_out->f_op->copy_file_range ==
> >>>> file_in->f_op->copy_file_range)
> >>>> - return file_out->f_op->copy_file_range(file_in, pos_in,
> >>>> - file_out, pos_out,
> >>>> - len, flags);
> >>>> -
> >>>> - return generic_copy_file_range(file_in, pos_in, file_out,
> >>>> pos_out, len,
> >>>> - flags);
> >>>> -}
> >>>> -
> >>>> /*
> >>>> * Performs necessary checks before doing a file copy
> >>>> *
> >>>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
> >>>> file *file_in, loff_t pos_in,
> >>>> loff_t size_in;
> >>>> int ret;
> >>>> + /*
> >>>> + * Although we now allow filesystems to handle cross sb copy,
> >>>> passing
> >>>> + * a file of the wrong filesystem type to filesystem driver
> >>>> can result
> >>>> + * in an attempt to dereference the wrong type of
> >>>> ->private_data, so
> >>>> + * avoid doing that until we really have a good reason. NFS
> >>>> defines
> >>>> + * several different file_system_type structures, but they all
> >>>> end up
> >>>> + * using the same ->copy_file_range() function pointer.
> >>>> + */
> >>>> + if (file_out->f_op->copy_file_range) {
> >>>> + if (file_in->f_op->copy_file_range !=
> >>>> + file_out->f_op->copy_file_range)
> >>>> + return -EXDEV;
> >>>> + } else if (file_in->f_op->remap_file_range) {
> >>>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >>>> + return -EXDEV;
> >>> I think this check is redundant, it's done in vfs_copy_file_range.
> >>> If this check is removed then the else clause below should be removed
> >>> also. Once this check and the else clause are removed then might as
> >>> well move the the check of copy_file_range from here to
> >>> vfs_copy_file_range.
> >>>
> >> I don't think it's really redundant, although I agree is messy due to
> >> the
> >> fact we try to clone first instead of copying them.
> >>
> >> So, in the clone path, this is the only place where we return -EXDEV if:
> >>
> >> 1) we don't have ->copy_file_range *and*
> >> 2) we have ->remap_file_range but the i_sb are different.
> >>
> >> The check in vfs_copy_file_range() is only executed if:
> >>
> >> 1) we have *valid* ->copy_file_range ops and/or
> >> 2) we have *valid* ->remap_file_range
> >>
> >> So... if we remove the check in generic_copy_file_checks() as you
> >> suggest
> >> and:
> >> - we don't have ->copy_file_range,
> >> - we have ->remap_file_range but
> >> - the i_sb are different
> >>
> >> we'll return the -EOPNOTSUPP (the one set in line "ret =
> >> -EOPNOTSUPP;" in
> >> function vfs_copy_file_range() ) instead of -EXDEV.
> >
> > Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
> > -EXDEVV by doing generic_copy_file_range. Do any other consumers of
> > vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
> > the correct error code for this case? It seems to me that -EOPNOTSUPP
> > is more appropriate than EXDEV when (sb1 != sb2).
>
> So with the current patch, for a clone operation across 2 filesystems:

Wait, I can't get passed "a clone operation across 2 filesystems", I
thought there are not any options. It's not allowed? Then we go do try
the copy. Those are two different steps so errors code might be
different.

> . if src and dst filesystem support both copy_file_range and
> map_file_range then the code returns -ENOTSUPPORT.
>
> . if the filesystems only support map_file_range then the
> code returns -EXDEV
>
> This seems confusing, shouldn't only 1 error code returned for this case?
>
> -Dai
>
> >
> >>
> >> But I may have got it all wrong. I've looked so many times at this code
> >> that I'm probably useless at finding problems in it :-)
> >
> > You're not alone, we all try to do the right thing :-)
> >
> > -Dai
> >
> >>
> >> Cheers,
> >> --
> >> Luís
> >>
> >>> -Dai
> >>>
> >>>> + } else {
> >>>> + return -EOPNOTSUPP;
> >>>> + }
> >>>> +
> >>>> ret = generic_file_rw_checks(file_in, file_out);
> >>>> if (ret)
> >>>> return ret;
> >>>> @@ -1495,6 +1492,7 @@ ssize_t vfs_copy_file_range(struct file
> >>>> *file_in, loff_t pos_in,
> >>>> file_start_write(file_out);
> >>>> + ret = -EOPNOTSUPP;
> >>>> /*
> >>>> * Try cloning first, this is supported by more file
> >>>> systems, and
> >>>> * more efficient if both clone and copy are supported (e.g.
> >>>> NFS).
> >>>> @@ -1513,9 +1511,10 @@ ssize_t vfs_copy_file_range(struct file
> >>>> *file_in, loff_t pos_in,
> >>>> }
> >>>> }
> >>>> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> >>>> - flags);
> >>>> - WARN_ON_ONCE(ret == -EOPNOTSUPP);
> >>>> + if (file_out->f_op->copy_file_range)
> >>>> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> >>>> + file_out, pos_out,
> >>>> + len, flags);
> >>>> done:
> >>>> if (ret > 0) {
> >>>> fsnotify_access(file_in);

2021-02-23 17:56:34

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Tue, Feb 23, 2021 at 08:57:38AM -0800, [email protected] wrote:
>
> On 2/23/21 8:47 AM, Amir Goldstein wrote:
> > On Tue, Feb 23, 2021 at 6:02 PM <[email protected]> wrote:
> > >
> > > On 2/23/21 7:29 AM, [email protected] wrote:
> > > > On 2/23/21 2:32 AM, Luis Henriques wrote:
> > > > > On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
> > > > > > On 2/22/21 2:24 AM, Luis Henriques wrote:
> > > > > > > A regression has been reported by Nicolas Boichat, found while
> > > > > > > using the
> > > > > > > copy_file_range syscall to copy a tracefs file. Before commit
> > > > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > > > > > > kernel would return -EXDEV to userspace when trying to copy a file
> > > > > > > across
> > > > > > > different filesystems. After this commit, the syscall doesn't fail
> > > > > > > anymore
> > > > > > > and instead returns zero (zero bytes copied), as this file's
> > > > > > > content is
> > > > > > > generated on-the-fly and thus reports a size of zero.
> > > > > > >
> > > > > > > This patch restores some cross-filesystem copy restrictions that
> > > > > > > existed
> > > > > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
> > > > > > > across
> > > > > > > devices"). Filesystems are still allowed to fall-back to the VFS
> > > > > > > generic_copy_file_range() implementation, but that has now to be done
> > > > > > > explicitly.
> > > > > > >
> > > > > > > nfsd is also modified to fall-back into generic_copy_file_range()
> > > > > > > in case
> > > > > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> > > > > > >
> > > > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > > > > devices")
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
> > > > > > > Link:
> > > > > > > https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
> > > > > > > Reported-by: Nicolas Boichat <[email protected]>
> > > > > > > Signed-off-by: Luis Henriques <[email protected]>
> > > > > > > ---
> > > > > > > Changes since v7
> > > > > > > - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
> > > > > > > that the
> > > > > > > error returned is always related to the 'copy' operation
> > > > > > > Changes since v6
> > > > > > > - restored i_sb checks for the clone operation
> > > > > > > Changes since v5
> > > > > > > - check if ->copy_file_range is NULL before calling it
> > > > > > > Changes since v4
> > > > > > > - nfsd falls-back to generic_copy_file_range() only *if* it gets
> > > > > > > -EOPNOTSUPP
> > > > > > > or -EXDEV.
> > > > > > > Changes since v3
> > > > > > > - dropped the COPY_FILE_SPLICE flag
> > > > > > > - kept the f_op's checks early in generic_copy_file_checks,
> > > > > > > implementing
> > > > > > > Amir's suggestions
> > > > > > > - modified nfsd to use generic_copy_file_range()
> > > > > > > Changes since v2
> > > > > > > - do all the required checks earlier, in generic_copy_file_checks(),
> > > > > > > adding new checks for ->remap_file_range
> > > > > > > - new COPY_FILE_SPLICE flag
> > > > > > > - don't remove filesystem's fallback to generic_copy_file_range()
> > > > > > > - updated commit changelog (and subject)
> > > > > > > Changes since v1 (after Amir review)
> > > > > > > - restored do_copy_file_range() helper
> > > > > > > - return -EOPNOTSUPP if fs doesn't implement CFR
> > > > > > > - updated commit description
> > > > > > >
> > > > > > > fs/nfsd/vfs.c | 8 +++++++-
> > > > > > > fs/read_write.c | 49
> > > > > > > ++++++++++++++++++++++++-------------------------
> > > > > > > 2 files changed, 31 insertions(+), 26 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
> > > > > > > index 04937e51de56..23dab0fa9087 100644
> > > > > > > --- a/fs/nfsd/vfs.c
> > > > > > > +++ b/fs/nfsd/vfs.c
> > > > > > > @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
> > > > > > > *nf_src, u64 src_pos,
> > > > > > > ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
> > > > > > > struct file *dst,
> > > > > > > u64 dst_pos, u64 count)
> > > > > > > {
> > > > > > > + ssize_t ret;
> > > > > > > /*
> > > > > > > * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> > > > > > > @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
> > > > > > > u64 src_pos, struct file *dst,
> > > > > > > * limit like this and pipeline multiple COPY requests.
> > > > > > > */
> > > > > > > count = min_t(u64, count, 1 << 22);
> > > > > > > - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > > > > > > + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> > > > > > > +
> > > > > > > + if (ret == -EOPNOTSUPP || ret == -EXDEV)
> > > > > > > + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
> > > > > > > + count, 0);
> > > > > > > + return ret;
> > > > > > > }
> > > > > > > __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
> > > > > > > *fhp,
> > > > > > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > > > > > index 75f764b43418..5a26297fd410 100644
> > > > > > > --- a/fs/read_write.c
> > > > > > > +++ b/fs/read_write.c
> > > > > > > @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
> > > > > > > *file_in, loff_t pos_in,
> > > > > > > }
> > > > > > > EXPORT_SYMBOL(generic_copy_file_range);
> > > > > > > -static ssize_t do_copy_file_range(struct file *file_in, loff_t
> > > > > > > pos_in,
> > > > > > > - struct file *file_out, loff_t pos_out,
> > > > > > > - size_t len, unsigned int flags)
> > > > > > > -{
> > > > > > > - /*
> > > > > > > - * Although we now allow filesystems to handle cross sb copy,
> > > > > > > passing
> > > > > > > - * a file of the wrong filesystem type to filesystem driver
> > > > > > > can result
> > > > > > > - * in an attempt to dereference the wrong type of
> > > > > > > ->private_data, so
> > > > > > > - * avoid doing that until we really have a good reason. NFS
> > > > > > > defines
> > > > > > > - * several different file_system_type structures, but they all
> > > > > > > end up
> > > > > > > - * using the same ->copy_file_range() function pointer.
> > > > > > > - */
> > > > > > > - if (file_out->f_op->copy_file_range &&
> > > > > > > - file_out->f_op->copy_file_range ==
> > > > > > > file_in->f_op->copy_file_range)
> > > > > > > - return file_out->f_op->copy_file_range(file_in, pos_in,
> > > > > > > - file_out, pos_out,
> > > > > > > - len, flags);
> > > > > > > -
> > > > > > > - return generic_copy_file_range(file_in, pos_in, file_out,
> > > > > > > pos_out, len,
> > > > > > > - flags);
> > > > > > > -}
> > > > > > > -
> > > > > > > /*
> > > > > > > * Performs necessary checks before doing a file copy
> > > > > > > *
> > > > > > > @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
> > > > > > > file *file_in, loff_t pos_in,
> > > > > > > loff_t size_in;
> > > > > > > int ret;
> > > > > > > + /*
> > > > > > > + * Although we now allow filesystems to handle cross sb copy,
> > > > > > > passing
> > > > > > > + * a file of the wrong filesystem type to filesystem driver
> > > > > > > can result
> > > > > > > + * in an attempt to dereference the wrong type of
> > > > > > > ->private_data, so
> > > > > > > + * avoid doing that until we really have a good reason. NFS
> > > > > > > defines
> > > > > > > + * several different file_system_type structures, but they all
> > > > > > > end up
> > > > > > > + * using the same ->copy_file_range() function pointer.
> > > > > > > + */
> > > > > > > + if (file_out->f_op->copy_file_range) {
> > > > > > > + if (file_in->f_op->copy_file_range !=
> > > > > > > + file_out->f_op->copy_file_range)
> > > > > > > + return -EXDEV;
> > > > > > > + } else if (file_in->f_op->remap_file_range) {
> > > > > > > + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> > > > > > > + return -EXDEV;
> > > > > > I think this check is redundant, it's done in vfs_copy_file_range.
> > > > > > If this check is removed then the else clause below should be removed
> > > > > > also. Once this check and the else clause are removed then might as
> > > > > > well move the the check of copy_file_range from here to
> > > > > > vfs_copy_file_range.
> > > > > >
> > > > > I don't think it's really redundant, although I agree is messy due to
> > > > > the
> > > > > fact we try to clone first instead of copying them.
> > > > >
> > > > > So, in the clone path, this is the only place where we return -EXDEV if:
> > > > >
> > > > > 1) we don't have ->copy_file_range *and*
> > > > > 2) we have ->remap_file_range but the i_sb are different.
> > > > >
> > > > > The check in vfs_copy_file_range() is only executed if:
> > > > >
> > > > > 1) we have *valid* ->copy_file_range ops and/or
> > > > > 2) we have *valid* ->remap_file_range
> > > > >
> > > > > So... if we remove the check in generic_copy_file_checks() as you
> > > > > suggest
> > > > > and:
> > > > > - we don't have ->copy_file_range,
> > > > > - we have ->remap_file_range but
> > > > > - the i_sb are different
> > > > >
> > > > > we'll return the -EOPNOTSUPP (the one set in line "ret =
> > > > > -EOPNOTSUPP;" in
> > > > > function vfs_copy_file_range() ) instead of -EXDEV.
> > > > Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
> > > > -EXDEVV by doing generic_copy_file_range. Do any other consumers of
> > > > vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
> > > > the correct error code for this case? It seems to me that -EOPNOTSUPP
> > > > is more appropriate than EXDEV when (sb1 != sb2).
> > EXDEV is the right code for:
> > filesystem supports the operation but not for sb1 != sb1.
> >
> > > So with the current patch, for a clone operation across 2 filesystems:
> > >
> > > . if src and dst filesystem support both copy_file_range and
> > > map_file_range then the code returns -ENOTSUPPORT.
> > >
> > Why do you say that?
> > Which code are you referring to exactly?
>
> If the filesystems support both copy_file_range and map_file_range,
> it passes the check in generic_file_check but it fails with the
> check in vfs_copy_file_range and returns -ENOTSUPPORT (added by
> the v8 patch)

I'm sorry but I can't simply see where this can happen. If both syscalls
are present (and all other checks pass), the code will first try the
->map_file_range. If that succeeds, it bails out; if that fails, it tries
the ->copy_file_range. The -ENOTSUPPORT is just there for the case the
->map_file_range fails and ->copy_file_range isn't implemented.

[ <sigh> It would be so much easier if we didn't attempt to clone. ]

But as I said previously, I'm way beyond embarrassment now as I failed to
see too many obvious mistakes in previous versions :-)

Cheers,
--
Lu?s

2021-02-24 01:24:59

by Dai Ngo

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On 2/23/21 9:33 AM, Amir Goldstein wrote:
> On Tue, Feb 23, 2021 at 7:31 PM <[email protected]> wrote:
>> On 2/23/21 8:57 AM, [email protected] wrote:
>>
>>
>> On 2/23/21 8:47 AM, Amir Goldstein wrote:
>>
>> On Tue, Feb 23, 2021 at 6:02 PM <[email protected]> wrote:
>>
>>
>> On 2/23/21 7:29 AM, [email protected] wrote:
>>
>> On 2/23/21 2:32 AM, Luis Henriques wrote:
>>
>> On Mon, Feb 22, 2021 at 08:25:27AM -0800, [email protected] wrote:
>>
>> On 2/22/21 2:24 AM, Luis Henriques wrote:
>>
>> A regression has been reported by Nicolas Boichat, found while
>> using the
>> copy_file_range syscall to copy a tracefs file. Before commit
>> 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
>> kernel would return -EXDEV to userspace when trying to copy a file
>> across
>> different filesystems. After this commit, the syscall doesn't fail
>> anymore
>> and instead returns zero (zero bytes copied), as this file's
>> content is
>> generated on-the-fly and thus reports a size of zero.
>>
>> This patch restores some cross-filesystem copy restrictions that
>> existed
>> prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy
>> across
>> devices"). Filesystems are still allowed to fall-back to the VFS
>> generic_copy_file_range() implementation, but that has now to be done
>> explicitly.
>>
>> nfsd is also modified to fall-back into generic_copy_file_range()
>> in case
>> vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
>>
>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> devices")
>> Link:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmi49dC6w$
>> Link:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/CANMq1KDZuxir2LM5jOTm0xx*[email protected]/__;Kw!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmgCmMHzA$
>> Link:
>> https://urldefense.com/v3/__https://lore.kernel.org/linux-fsdevel/[email protected]eid/__;!!GqivPVa7Brio!P1UWThiSkxbjfjFQWNYJmCxGEkiLFyvHjH6cS-G1ZTt1z-TeqwGQgQmzqItkrQ$
>> Reported-by: Nicolas Boichat <[email protected]>
>> Signed-off-by: Luis Henriques <[email protected]>
>> ---
>> Changes since v7
>> - set 'ret' to '-EOPNOTSUPP' before the clone 'if' statement so
>> that the
>> error returned is always related to the 'copy' operation
>> Changes since v6
>> - restored i_sb checks for the clone operation
>> Changes since v5
>> - check if ->copy_file_range is NULL before calling it
>> Changes since v4
>> - nfsd falls-back to generic_copy_file_range() only *if* it gets
>> -EOPNOTSUPP
>> or -EXDEV.
>> Changes since v3
>> - dropped the COPY_FILE_SPLICE flag
>> - kept the f_op's checks early in generic_copy_file_checks,
>> implementing
>> Amir's suggestions
>> - modified nfsd to use generic_copy_file_range()
>> Changes since v2
>> - do all the required checks earlier, in generic_copy_file_checks(),
>> adding new checks for ->remap_file_range
>> - new COPY_FILE_SPLICE flag
>> - don't remove filesystem's fallback to generic_copy_file_range()
>> - updated commit changelog (and subject)
>> Changes since v1 (after Amir review)
>> - restored do_copy_file_range() helper
>> - return -EOPNOTSUPP if fs doesn't implement CFR
>> - updated commit description
>>
>> fs/nfsd/vfs.c | 8 +++++++-
>> fs/read_write.c | 49
>> ++++++++++++++++++++++++-------------------------
>> 2 files changed, 31 insertions(+), 26 deletions(-)
>>
>> diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
>> index 04937e51de56..23dab0fa9087 100644
>> --- a/fs/nfsd/vfs.c
>> +++ b/fs/nfsd/vfs.c
>> @@ -568,6 +568,7 @@ __be32 nfsd4_clone_file_range(struct nfsd_file
>> *nf_src, u64 src_pos,
>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos,
>> struct file *dst,
>> u64 dst_pos, u64 count)
>> {
>> + ssize_t ret;
>> /*
>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>> @@ -578,7 +579,12 @@ ssize_t nfsd_copy_file_range(struct file *src,
>> u64 src_pos, struct file *dst,
>> * limit like this and pipeline multiple COPY requests.
>> */
>> count = min_t(u64, count, 1 << 22);
>> - return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>> + ret = vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>> +
>> + if (ret == -EOPNOTSUPP || ret == -EXDEV)
>> + ret = generic_copy_file_range(src, src_pos, dst, dst_pos,
>> + count, 0);
>> + return ret;
>> }
>> __be32 nfsd4_vfs_fallocate(struct svc_rqst *rqstp, struct svc_fh
>> *fhp,
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..5a26297fd410 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1388,28 +1388,6 @@ ssize_t generic_copy_file_range(struct file
>> *file_in, loff_t pos_in,
>> }
>> EXPORT_SYMBOL(generic_copy_file_range);
>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t
>> pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t len, unsigned int flags)
>> -{
>> - /*
>> - * Although we now allow filesystems to handle cross sb copy,
>> passing
>> - * a file of the wrong filesystem type to filesystem driver
>> can result
>> - * in an attempt to dereference the wrong type of
>> ->private_data, so
>> - * avoid doing that until we really have a good reason. NFS
>> defines
>> - * several different file_system_type structures, but they all
>> end up
>> - * using the same ->copy_file_range() function pointer.
>> - */
>> - if (file_out->f_op->copy_file_range &&
>> - file_out->f_op->copy_file_range ==
>> file_in->f_op->copy_file_range)
>> - return file_out->f_op->copy_file_range(file_in, pos_in,
>> - file_out, pos_out,
>> - len, flags);
>> -
>> - return generic_copy_file_range(file_in, pos_in, file_out,
>> pos_out, len,
>> - flags);
>> -}
>> -
>> /*
>> * Performs necessary checks before doing a file copy
>> *
>> @@ -1427,6 +1405,25 @@ static int generic_copy_file_checks(struct
>> file *file_in, loff_t pos_in,
>> loff_t size_in;
>> int ret;
>> + /*
>> + * Although we now allow filesystems to handle cross sb copy,
>> passing
>> + * a file of the wrong filesystem type to filesystem driver
>> can result
>> + * in an attempt to dereference the wrong type of
>> ->private_data, so
>> + * avoid doing that until we really have a good reason. NFS
>> defines
>> + * several different file_system_type structures, but they all
>> end up
>> + * using the same ->copy_file_range() function pointer.
>> + */
>> + if (file_out->f_op->copy_file_range) {
>> + if (file_in->f_op->copy_file_range !=
>> + file_out->f_op->copy_file_range)
>> + return -EXDEV;
>> + } else if (file_in->f_op->remap_file_range) {
>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> + return -EXDEV;
>>
>> I think this check is redundant, it's done in vfs_copy_file_range.
>> If this check is removed then the else clause below should be removed
>> also. Once this check and the else clause are removed then might as
>> well move the the check of copy_file_range from here to
>> vfs_copy_file_range.
>>
>> I don't think it's really redundant, although I agree is messy due to
>> the
>> fact we try to clone first instead of copying them.
>>
>> So, in the clone path, this is the only place where we return -EXDEV if:
>>
>> 1) we don't have ->copy_file_range *and*
>> 2) we have ->remap_file_range but the i_sb are different.
>>
>> The check in vfs_copy_file_range() is only executed if:
>>
>> 1) we have *valid* ->copy_file_range ops and/or
>> 2) we have *valid* ->remap_file_range
>>
>> So... if we remove the check in generic_copy_file_checks() as you
>> suggest
>> and:
>> - we don't have ->copy_file_range,
>> - we have ->remap_file_range but
>> - the i_sb are different
>>
>> we'll return the -EOPNOTSUPP (the one set in line "ret =
>> -EOPNOTSUPP;" in
>> function vfs_copy_file_range() ) instead of -EXDEV.
>>
>> Yes, this is the different.The NFS code handles both -EOPNOTSUPP and
>> -EXDEVV by doing generic_copy_file_range. Do any other consumers of
>> vfs_copy_file_range rely on -EXDEV and not -EOPNOTSUPP and which is
>> the correct error code for this case? It seems to me that -EOPNOTSUPP
>> is more appropriate than EXDEV when (sb1 != sb2).
>>
>> EXDEV is the right code for:
>> filesystem supports the operation but not for sb1 != sb1.
>>
>> So with the current patch, for a clone operation across 2 filesystems:
>>
>> . if src and dst filesystem support both copy_file_range and
>> map_file_range then the code returns -ENOTSUPPORT.
>>
>> Why do you say that?
>> Which code are you referring to exactly?
>>
>>
>> If the filesystems support both copy_file_range and map_file_range,
>> it passes the check in generic_file_check but it fails with the
>> check in vfs_copy_file_range and returns -ENOTSUPPORT (added by
>> the v8 patch)
>>
>> Ok, I misread the code here. If it passes the check in generic_copy_file_checks
>> and it fails the sb check in vfs_copy_file_range then it tries copy_file_range
>> so it's ok.
>>
>> I think having the check in both generic_copy_file_checks and vfs_copy_file_range
>> making the code hard to read. What's the reason not to do the check only in
>> vfs_copy_file_range?
>>
> You are going in circles.
> I already answered that.
> Please re-read the entire thread on all patch versions before commenting.

I'm fine with the patch as it is, as long as it does not break NFS.

I just think it's easier to read if the checks are done in
vfs_copy_file_range such as:

@@ -1495,6 +1473,11 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,

file_start_write(file_out);

+ if (file_out->f_op->copy_file_range == NULL &&
+ file_in->f_op->remap_file_range == NULL)
+ return -EOPNOTSUPP; /* not sure this error is needed */
+
+ ret = -EXDEV;
/*
* Try cloning first, this is supported by more file systems, and
* more efficient if both clone and copy are supported (e.g. NFS).
@@ -1513,9 +1496,10 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}

- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
- WARN_ON_ONCE(ret == -EOPNOTSUPP);
+ if (file_out->f_op->copy_file_range &&
+ file_out->f_op->copy_file_range == file_in->f_op->copy_file_range) {
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out, len, flags);
done:
if (ret > 0) {
fsnotify_access(file_in);

Thanks,
-Dai

>
> Thanks,
> Amir.

2021-02-24 10:25:11

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
> On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques <[email protected]> wrote:
> >
> > A regression has been reported by Nicolas Boichat, found while using the
> > copy_file_range syscall to copy a tracefs file. Before commit
> > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > kernel would return -EXDEV to userspace when trying to copy a file across
> > different filesystems. After this commit, the syscall doesn't fail anymore
> > and instead returns zero (zero bytes copied), as this file's content is
> > generated on-the-fly and thus reports a size of zero.
> >
> > This patch restores some cross-filesystem copy restrictions that existed
> > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > devices"). Filesystems are still allowed to fall-back to the VFS
> > generic_copy_file_range() implementation, but that has now to be done
> > explicitly.
> >
> > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >
> > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> > Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> > Reported-by: Nicolas Boichat <[email protected]>
> > Signed-off-by: Luis Henriques <[email protected]>
>
> I tested v8 and I believe it works for NFS.

Thanks a lot for the testing. And to everyone else for reviews,
feedback,... and patience.

I'll now go look into the manpage and see what needs to be changed.

Cheers,
--
Lu?s

2021-02-24 10:48:27

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Wed, Feb 24, 2021 at 6:22 PM Luis Henriques <[email protected]> wrote:
>
> On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
> > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques <[email protected]> wrote:
> > >
> > > A regression has been reported by Nicolas Boichat, found while using the
> > > copy_file_range syscall to copy a tracefs file. Before commit
> > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > > kernel would return -EXDEV to userspace when trying to copy a file across
> > > different filesystems. After this commit, the syscall doesn't fail anymore
> > > and instead returns zero (zero bytes copied), as this file's content is
> > > generated on-the-fly and thus reports a size of zero.
> > >
> > > This patch restores some cross-filesystem copy restrictions that existed
> > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > devices"). Filesystems are still allowed to fall-back to the VFS
> > > generic_copy_file_range() implementation, but that has now to be done
> > > explicitly.
> > >
> > > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> > >
> > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> > > Reported-by: Nicolas Boichat <[email protected]>
> > > Signed-off-by: Luis Henriques <[email protected]>
> >
> > I tested v8 and I believe it works for NFS.
>
> Thanks a lot for the testing. And to everyone else for reviews,
> feedback,... and patience.

Thanks so much to you!!!

Works here, you can add my
Tested-by: Nicolas Boichat <[email protected]>

>
> I'll now go look into the manpage and see what needs to be changed.
>
> Cheers,
> --
> Luís

2021-02-24 14:55:52

by Luis Henriques

[permalink] [raw]
Subject: [PATCH] copy_file_range.2: Kernel v5.12 updates

Update man-page with recent changes to this syscall.

Signed-off-by: Luis Henriques <[email protected]>
---
Hi!

Here's a suggestion for fixing the manpage for copy_file_range(). Note that
I've assumed the fix will hit 5.12.

man2/copy_file_range.2 | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b8026b..b0fd85e2631e 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
.B ENOSPC
There is not enough space on the target filesystem to complete the copy.
.TP
+.B EOPNOTSUPP
+The filesystem does not support this operation.
+.TP
.B EOVERFLOW
The requested source or destination range is too large to represent in the
specified data types.
@@ -187,7 +190,7 @@ refers to an active swap file.
.B EXDEV
The files referred to by
.IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
.SH VERSIONS
The
.BR copy_file_range ()
@@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
.PP
First support for cross-filesystem copies was introduced in Linux 5.3.
Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+.PP
+After Linux 5.12, support for copies between different filesystems was dropped.
+However, individual filesystems may still provide
+.BR copy_file_range ()
+implementations that allow copies across different devices.
.SH CONFORMING TO
The
.BR copy_file_range ()

2021-02-25 10:28:36

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Wed, Feb 24, 2021 at 06:10:45PM +0200, Amir Goldstein wrote:
> On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> >
> > Update man-page with recent changes to this syscall.
> >
> > Signed-off-by: Luis Henriques <[email protected]>
> > ---
> > Hi!
> >
> > Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> > I've assumed the fix will hit 5.12.
> >
> > man2/copy_file_range.2 | 10 +++++++++-
> > 1 file changed, 9 insertions(+), 1 deletion(-)
> >
> > diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > index 611a39b8026b..b0fd85e2631e 100644
> > --- a/man2/copy_file_range.2
> > +++ b/man2/copy_file_range.2
> > @@ -169,6 +169,9 @@ Out of memory.
> > .B ENOSPC
> > There is not enough space on the target filesystem to complete the copy.
> > .TP
> > +.B EOPNOTSUPP
> > +The filesystem does not support this operation.
> > +.TP
> > .B EOVERFLOW
> > The requested source or destination range is too large to represent in the
> > specified data types.
> > @@ -187,7 +190,7 @@ refers to an active swap file.
> > .B EXDEV
> > The files referred to by
> > .IR fd_in " and " fd_out
> > -are not on the same mounted filesystem (pre Linux 5.3).
> > +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
>
> I think you need to drop the (Linux range) altogether.
> What's missing here is the NFS cross server copy use case.
> Maybe:
>
> ...are not on the same mounted filesystem and the source and target filesystems
> do not support cross-filesystem copy.
>
> You may refer the reader to VERSIONS section where it will say which
> filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and nfs).
>
> > .SH VERSIONS
> > The
> > .BR copy_file_range ()
> > @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
> > .PP
> > First support for cross-filesystem copies was introduced in Linux 5.3.
> > Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> > +.PP
> > +After Linux 5.12, support for copies between different filesystems was dropped.
> > +However, individual filesystems may still provide
> > +.BR copy_file_range ()
> > +implementations that allow copies across different devices.
>
> Again, this is not likely to stay uptodate for very long.
> The stable kernels are expected to apply your patch (because it fixes
> a regression)
> so this should be phrased differently.
> If it were me, I would provide all the details of the situation to
> Michael and ask him
> to write the best description for this section.

Thanks Amir.

Yeah, it's tricky. Support was added and then dropped. Since stable
kernels will be picking this patch, maybe the best thing to do is to no
mention the generic cross-filesystem support at all...? Or simply say
that 5.3 temporarily supported it but that support was later dropped.

Michael (or Alejandro), would you be OK handling this yourself as Amir
suggested?

Cheers,
--
Lu?s

Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

Hello Luis,

On 2/25/21 11:21 AM, Luis Henriques wrote:
> On Wed, Feb 24, 2021 at 06:10:45PM +0200, Amir Goldstein wrote:
>> If it were me, I would provide all the details of the situation to
>> Michael and ask him
>> to write the best description for this section.
>
> Thanks Amir.
>
> Yeah, it's tricky. Support was added and then dropped. Since stable
> kernels will be picking this patch, maybe the best thing to do is to no
> mention the generic cross-filesystem support at all...? Or simply say
> that 5.3 temporarily supported it but that support was later dropped.
>
> Michael (or Alejandro), would you be OK handling this yourself as Amir
> suggested?

Could you please provide a more detailed history of what is to be
documented?

Thanks,

Alex

--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

2021-02-26 10:38:29

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Fri, Feb 26, 2021 at 12:13 PM Alejandro Colomar (man-pages)
<[email protected]> wrote:
>
> Hello Luis,
>
> On 2/25/21 11:21 AM, Luis Henriques wrote:
> > On Wed, Feb 24, 2021 at 06:10:45PM +0200, Amir Goldstein wrote:
> >> If it were me, I would provide all the details of the situation to
> >> Michael and ask him
> >> to write the best description for this section.
> >
> > Thanks Amir.
> >
> > Yeah, it's tricky. Support was added and then dropped. Since stable
> > kernels will be picking this patch, maybe the best thing to do is to no
> > mention the generic cross-filesystem support at all...? Or simply say
> > that 5.3 temporarily supported it but that support was later dropped.
> >
> > Michael (or Alejandro), would you be OK handling this yourself as Amir
> > suggested?
>
> Could you please provide a more detailed history of what is to be
> documented?
>

Is this detailed enough? ;-)

https://lwn.net/Articles/846403/

Thanks,
Amir.

Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

Hello Jeff,

On 2/26/21 2:59 PM, Jeff Layton wrote:
> Here's a link that should work. I'm probably breaking the rules a bit as
> a subscriber, but hopefully Jon won't mind too much. FWIW, I've found it
> to be worthwhile to subscribe to LWN if you're doing a lot of kernel
> development:
>
> https://lwn.net/SubscriberLink/846403/0fd639403e629cab/

Thanks! (I already received the link privately some minutes before from
various people.)

It seems that he considers it fair use :)

[[
Where is it appropriate to post a subscriber link?

Almost anywhere. Private mail, messages to project mailing lists, and
blog entries are all appropriate. As long as people do not use
subscriber links as a way to defeat our attempts to gain subscribers, we
are happy to see them shared.
]]
<https://lwn.net/op/FAQ.lwn#site>

Cheers,

Alex

--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

Hello Amir, Luis,

On 2/24/21 5:10 PM, Amir Goldstein wrote:
> On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
>>
>> Update man-page with recent changes to this syscall.
>>
>> Signed-off-by: Luis Henriques <[email protected]>
>> ---
>> Hi!
>>
>> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
>> I've assumed the fix will hit 5.12.
>>
>> man2/copy_file_range.2 | 10 +++++++++-
>> 1 file changed, 9 insertions(+), 1 deletion(-)
>>
>> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
>> index 611a39b8026b..b0fd85e2631e 100644
>> --- a/man2/copy_file_range.2
>> +++ b/man2/copy_file_range.2
>> @@ -169,6 +169,9 @@ Out of memory.
>> .B ENOSPC
>> There is not enough space on the target filesystem to complete the copy.
>> .TP
>> +.B EOPNOTSUPP

I'll add the kernel version here:

.BR EOPNOTSUPP " (since Linux 5.12)"

>> +The filesystem does not support this operation >> +.TP
>> .B EOVERFLOW
>> The requested source or destination range is too large to represent in the
>> specified data types.
>> @@ -187,7 +190,7 @@ refers to an active swap file.
>> .B EXDEV
>> The files referred to by
>> .IR fd_in " and " fd_out
>> -are not on the same mounted filesystem (pre Linux 5.3).
>> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).

I'm not sure that 'mounted' adds any value here. Would you remove the
word here?

It reads as if two separate devices with the same filesystem type would
still give this error.

Per the LWN.net article Amir shared, this is permitted ("When called
from user space, copy_file_range() will only try to copy a file across
filesystems if the two are of the same type").

This behavior was slightly different before 5.3 AFAICR (was it?) ("until
then, copy_file_range() refused to copy between files that were not
located on the same filesystem."). If that's the case, I'd specify the
difference, or more probably split the error into two, one before 5.3,
and one since 5.12.

>
> I think you need to drop the (Linux range) altogether.

I'll keep the range. Users of 5.3..5.11 might be surprised if the
filesystems are different and they don't get an error, I think.

I reworded it to follow other pages conventions:

.BR EXDEV " (before Linux 5.3; or since Linux 5.12)"

which renders as:

EXDEV (before Linux 5.3; or since Linux 5.12)
The files referred to by fd_in and fd_out are not on
the same mounted filesystem.


> What's missing here is the NFS cross server copy use case.
> Maybe:
>
> ...are not on the same mounted filesystem and the source and target filesystems
> do not support cross-filesystem copy.

Yes.

Again, this wasn't true before 5.3, right?

>
> You may refer the reader to VERSIONS section where it will say which
> filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and nfs).
>
>> .SH VERSIONS
>> The
>> .BR copy_file_range ()
>> @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
>> .PP
>> First support for cross-filesystem copies was introduced in Linux 5.3.
>> Older kernels will return -EXDEV when cross-filesystem copies are attempted.
>> +.PP
>> +After Linux 5.12, support for copies between different filesystems was dropped.
>> +However, individual filesystems may still provide
>> +.BR copy_file_range ()
>> +implementations that allow copies across different devices.
>
> Again, this is not likely to stay uptodate for very long.
> The stable kernels are expected to apply your patch (because it fixes
> a regression)
> so this should be phrased differently.
> If it were me, I would provide all the details of the situation to
> Michael and ask him
> to write the best description for this section.

I'll look into more detail at this part in a later review.


On 2/26/21 11:34 AM, Amir Goldstein wrote:
> Is this detailed enough? ;-)
>
> https://lwn.net/Articles/846403/

Yes, it is!



Thanks,

Alex

--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

2021-02-27 05:43:44

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
<[email protected]> wrote:
>
> Hello Amir, Luis,
>
> On 2/24/21 5:10 PM, Amir Goldstein wrote:
> > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> >>
> >> Update man-page with recent changes to this syscall.
> >>
> >> Signed-off-by: Luis Henriques <[email protected]>
> >> ---
> >> Hi!
> >>
> >> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> >> I've assumed the fix will hit 5.12.
> >>
> >> man2/copy_file_range.2 | 10 +++++++++-
> >> 1 file changed, 9 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> >> index 611a39b8026b..b0fd85e2631e 100644
> >> --- a/man2/copy_file_range.2
> >> +++ b/man2/copy_file_range.2
> >> @@ -169,6 +169,9 @@ Out of memory.
> >> .B ENOSPC
> >> There is not enough space on the target filesystem to complete the copy.
> >> .TP
> >> +.B EOPNOTSUPP
>
> I'll add the kernel version here:
>
> .BR EOPNOTSUPP " (since Linux 5.12)"

Error could be returned prior to 5.3 and would be probably returned
by future stable kernels 5.3..5.12 too

>
> >> +The filesystem does not support this operation >> +.TP
> >> .B EOVERFLOW
> >> The requested source or destination range is too large to represent in the
> >> specified data types.
> >> @@ -187,7 +190,7 @@ refers to an active swap file.
> >> .B EXDEV
> >> The files referred to by
> >> .IR fd_in " and " fd_out
> >> -are not on the same mounted filesystem (pre Linux 5.3).
> >> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
>
> I'm not sure that 'mounted' adds any value here. Would you remove the
> word here?

See rename(2). 'mounted' in this context is explained there.
HOWEVER, it does not fit here.
copy_file_range() IS allowed between two mounts of the same filesystem instance.

To make things more complicated, it appears that cross mount clone is not
allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
also uses the 'mounted filesystem' terminology for EXDEV

As things stand now, because of the fallback to clone logic,
copy_file_range() provides a way for users to clone across different mounts
of the same filesystem instance, which they cannot do with the FICLONE ioctl.

Fun :)

BTW, I don't know if preventing cross mount clone was done intentionally,
but as I wrote in a comment in the code once:

/*
* FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
* the same mount. Practically, they only need to be on the same file
* system.
*/

>
> It reads as if two separate devices with the same filesystem type would
> still give this error.
>
> Per the LWN.net article Amir shared, this is permitted ("When called
> from user space, copy_file_range() will only try to copy a file across
> filesystems if the two are of the same type").
>
> This behavior was slightly different before 5.3 AFAICR (was it?) ("until
> then, copy_file_range() refused to copy between files that were not
> located on the same filesystem."). If that's the case, I'd specify the
> difference, or more probably split the error into two, one before 5.3,
> and one since 5.12.
>

True.

> >
> > I think you need to drop the (Linux range) altogether.
>
> I'll keep the range. Users of 5.3..5.11 might be surprised if the
> filesystems are different and they don't get an error, I think.
>
> I reworded it to follow other pages conventions:
>
> .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
>
> which renders as:
>
> EXDEV (before Linux 5.3; or since Linux 5.12)
> The files referred to by fd_in and fd_out are not on
> the same mounted filesystem.
>

drop 'mounted'

>
> > What's missing here is the NFS cross server copy use case.
> > Maybe:
> >
> > ...are not on the same mounted filesystem and the source and target filesystems
> > do not support cross-filesystem copy.
>
> Yes.
>
> Again, this wasn't true before 5.3, right?
>

Right.
Actually, v5.3 provides the vfs capabilities for filesystems to support
cross fs copy. I am not sure if NFS already implements cross fs copy in
v5.3 and not sure about cifs. Need to get input from nfs/cis developers
or dig in the release notes for server-side copy.

> >
> > You may refer the reader to VERSIONS section where it will say which
> > filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and nfs).
> >
> >> .SH VERSIONS
> >> The
> >> .BR copy_file_range ()
> >> @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
> >> .PP
> >> First support for cross-filesystem copies was introduced in Linux 5.3.
> >> Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> >> +.PP
> >> +After Linux 5.12, support for copies between different filesystems was dropped.
> >> +However, individual filesystems may still provide
> >> +.BR copy_file_range ()
> >> +implementations that allow copies across different devices.
> >
> > Again, this is not likely to stay uptodate for very long.
> > The stable kernels are expected to apply your patch (because it fixes
> > a regression)
> > so this should be phrased differently.
> > If it were me, I would provide all the details of the situation to
> > Michael and ask him
> > to write the best description for this section.
>
> I'll look into more detail at this part in a later review.
>
>
> On 2/26/21 11:34 AM, Amir Goldstein wrote:
> > Is this detailed enough? ;-)
> >
> > https://lwn.net/Articles/846403/
>
> Yes, it is!
>

Thanks to LWN :)

Thanks,
Amir.

Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

Hi Amir,

On 2/27/21 6:41 AM, Amir Goldstein wrote:
> On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
>> On 2/24/21 5:10 PM, Amir Goldstein wrote:
>>> On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
>>>> .TP
>>>> +.B EOPNOTSUPP
>>
>> I'll add the kernel version here:
>>
>> .BR EOPNOTSUPP " (since Linux 5.12)"
>
> Error could be returned prior to 5.3 and would be probably returned
> by future stable kernels 5.3..5.12 too

OK, I think I'll state <5.3 and >=5.12 for the moment, and if Greg adds
that to stable 5.3..5.11 kernels, please update me.

>>>> .B EXDEV
>>>> The files referred to by
>>>> .IR fd_in " and " fd_out
>>>> -are not on the same mounted filesystem (pre Linux 5.3).
>>>> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
>>
>> I'm not sure that 'mounted' adds any value here. Would you remove the
>> word here?
>
> See rename(2). 'mounted' in this context is explained there.
> HOWEVER, it does not fit here.
> copy_file_range() IS allowed between two mounts of the same filesystem instance.

Also allowed for <5.3 ?

>
> To make things more complicated, it appears that cross mount clone is not
> allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
> also uses the 'mounted filesystem' terminology for EXDEV
>
> As things stand now, because of the fallback to clone logic,
> copy_file_range() provides a way for users to clone across different mounts
> of the same filesystem instance, which they cannot do with the FICLONE ioctl.
>
> Fun :)
>
> BTW, I don't know if preventing cross mount clone was done intentionally,
> but as I wrote in a comment in the code once:
>
> /*
> * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> * the same mount. Practically, they only need to be on the same file
> * system.
> */

:)

>
>>
>> It reads as if two separate devices with the same filesystem type would
>> still give this error.
>>
>> Per the LWN.net article Amir shared, this is permitted ("When called
>> from user space, copy_file_range() will only try to copy a file across
>> filesystems if the two are of the same type").
>>
>> This behavior was slightly different before 5.3 AFAICR (was it?) ("until
>> then, copy_file_range() refused to copy between files that were not
>> located on the same filesystem."). If that's the case, I'd specify the
>> difference, or more probably split the error into two, one before 5.3,
>> and one since 5.12.
>>
>
> True.
>
>>>
>>> I think you need to drop the (Linux range) altogether.
>>
>> I'll keep the range. Users of 5.3..5.11 might be surprised if the
>> filesystems are different and they don't get an error, I think.
>>
>> I reworded it to follow other pages conventions:
>>
>> .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
>>
>> which renders as:
>>
>> EXDEV (before Linux 5.3; or since Linux 5.12)
>> The files referred to by fd_in and fd_out are not on
>> the same mounted filesystem.
>>
>
> drop 'mounted'

Yes

>
>>
>>> What's missing here is the NFS cross server copy use case.
>>> Maybe:
>>>
>>> ...are not on the same mounted filesystem and the source and target filesystems
>>> do not support cross-filesystem copy.
>>
>> Yes.
>>
>> Again, this wasn't true before 5.3, right?
>>
>
> Right.
> Actually, v5.3 provides the vfs capabilities for filesystems to support
> cross fs copy. I am not sure if NFS already implements cross fs copy in
> v5.3 and not sure about cifs. Need to get input from nfs/cis developers
> or dig in the release notes for server-side copy.

Okay
> Thanks to LWN :)

:)

Thanks,

Alex


--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

Subject: [RFC v2] copy_file_range.2: Update cross-filesystem support for 5.12

Linux 5.12 fixes a regression.

Cross-filesystem copies (introduced in 5.3) were buggy.

Move the statements documenting cross-fs to BUGS.
Kernels 5.3..5.11 should be patched soon.

State version information for some errors related to this.

Reported-by: Luis Henriques <[email protected]>
Reported-by: Amir Goldstein <[email protected]>
Related: <https://lwn.net/Articles/846403/>
Cc: Greg KH <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Steve French <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Nicolas Boichat <[email protected]>
Cc: Ian Lance Taylor <[email protected]>
Cc: Luis Lozano <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Olga Kornievskaia <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: ceph-devel <[email protected]>
Cc: linux-kernel <[email protected]>
Cc: CIFS <[email protected]>
Cc: samba-technical <[email protected]>
Cc: linux-fsdevel <[email protected]>
Cc: Linux NFS Mailing List <[email protected]>
Cc: Walter Harms <[email protected]>
Signed-off-by: Alejandro Colomar <[email protected]>
---

Hi all,

Please check that this is correct.
I wrote it as I understood copy_file_range() from the LWN article,
and the conversation on this thread,
but maybe someone with more experience on this syscall find bugs in my patch.

When kernels 5.3..5.11 fix this, some info could be compacted a bit more,
and maybe the BUGS section could be removed.

Also, I'd like to know which filesystems support cross-fs, and since when.

Amir, you said that it was only cifs and nfs (since when? 5.3? 5.12?).

Also, I'm a bit surprised that <5.3 could fail with EOPNOTSUPP
and it wasn't documented. Is that for sure, Amir?

Thanks,

Alex

---
man2/copy_file_range.2 | 29 ++++++++++++++++++++---------
1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b80..93f54889d 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
.B ENOSPC
There is not enough space on the target filesystem to complete the copy.
.TP
+.BR EOPNOTSUPP " (before Linux 5.3; or since Linux 5.12)"
+The filesystem does not support this operation.
+.TP
.B EOVERFLOW
The requested source or destination range is too large to represent in the
specified data types.
@@ -184,10 +187,17 @@ or
.I fd_out
refers to an active swap file.
.TP
-.B EXDEV
+.BR EXDEV " (before Linux 5.3)"
The files referred to by
.IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same filesystem.
+.TP
+.BR EXDEV " (or since Linux 5.12)"
+The files referred to by
+.IR fd_in " and " fd_out
+are not on the same filesystem,
+and the source and target filesystems are not of the same type,
+or do not support cross-filesystem copy.
.SH VERSIONS
The
.BR copy_file_range ()
@@ -195,13 +205,10 @@ system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
emulation when it is not available.
.\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
.PP
-A major rework of the kernel implementation occurred in 5.3.
-Areas of the API that weren't clearly defined were clarified and the API bounds
-are much more strictly checked than on earlier kernels.
-Applications should target the behaviour and requirements of 5.3 kernels.
-.PP
-First support for cross-filesystem copies was introduced in Linux 5.3.
-Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+Since 5.12,
+cross-filesystem copies can be achieved
+when both filesystems are of the same type,
+and that filesystem implements support for it.
.SH CONFORMING TO
The
.BR copy_file_range ()
@@ -226,6 +233,10 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
such as the use of reflinks (i.e., two or more inodes that share
pointers to the same copy-on-write disk blocks)
or server-side-copy (in the case of NFS).
+.SH BUGS
+In Linux kernels 5.3 to 5.11, cross-filesystem copies were supported.
+However, on some virtual filesystems, the call failed to copy,
+eventhough it may have reported success.
.SH EXAMPLES
.EX
#define _GNU_SOURCE
--
2.30.1.721.g45526154a5

2021-02-27 16:01:59

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC v2] copy_file_range.2: Update cross-filesystem support for 5.12

On Sat, Feb 27, 2021 at 3:59 PM Alejandro Colomar
<[email protected]> wrote:
>
> Linux 5.12 fixes a regression.
>
> Cross-filesystem copies (introduced in 5.3) were buggy.
>
> Move the statements documenting cross-fs to BUGS.
> Kernels 5.3..5.11 should be patched soon.
>
> State version information for some errors related to this.
>
> Reported-by: Luis Henriques <[email protected]>
> Reported-by: Amir Goldstein <[email protected]>
> Related: <https://lwn.net/Articles/846403/>
> Cc: Greg KH <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: Anna Schumaker <[email protected]>
> Cc: Jeff Layton <[email protected]>
> Cc: Steve French <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Nicolas Boichat <[email protected]>
> Cc: Ian Lance Taylor <[email protected]>
> Cc: Luis Lozano <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Olga Kornievskaia <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: ceph-devel <[email protected]>
> Cc: linux-kernel <[email protected]>
> Cc: CIFS <[email protected]>
> Cc: samba-technical <[email protected]>
> Cc: linux-fsdevel <[email protected]>
> Cc: Linux NFS Mailing List <[email protected]>
> Cc: Walter Harms <[email protected]>
> Signed-off-by: Alejandro Colomar <[email protected]>
> ---
>
> Hi all,
>
> Please check that this is correct.
> I wrote it as I understood copy_file_range() from the LWN article,
> and the conversation on this thread,
> but maybe someone with more experience on this syscall find bugs in my patch.
>
> When kernels 5.3..5.11 fix this, some info could be compacted a bit more,
> and maybe the BUGS section could be removed.
>
> Also, I'd like to know which filesystems support cross-fs, and since when.
>
> Amir, you said that it was only cifs and nfs (since when? 5.3? 5.12?).
>
> Also, I'm a bit surprised that <5.3 could fail with EOPNOTSUPP
> and it wasn't documented. Is that for sure, Amir?

No. You are right. EOPNOTSUPP is new.
Kernel always fell back to sendfile(2) if the filesystem did not support
copy_file_range().

>
> Thanks,
>
> Alex
>
> ---
> man2/copy_file_range.2 | 29 ++++++++++++++++++++---------
> 1 file changed, 20 insertions(+), 9 deletions(-)
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 611a39b80..93f54889d 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -169,6 +169,9 @@ Out of memory.
> .B ENOSPC
> There is not enough space on the target filesystem to complete the copy.
> .TP
> +.BR EOPNOTSUPP " (before Linux 5.3; or since Linux 5.12)"
> +The filesystem does not support this operation.
> +.TP

so not before 5.3

> .B EOVERFLOW
> The requested source or destination range is too large to represent in the
> specified data types.
> @@ -184,10 +187,17 @@ or
> .I fd_out
> refers to an active swap file.
> .TP
> -.B EXDEV
> +.BR EXDEV " (before Linux 5.3)"
> The files referred to by
> .IR fd_in " and " fd_out
> -are not on the same mounted filesystem (pre Linux 5.3).
> +are not on the same filesystem.
> +.TP
> +.BR EXDEV " (or since Linux 5.12)"
> +The files referred to by
> +.IR fd_in " and " fd_out
> +are not on the same filesystem,
> +and the source and target filesystems are not of the same type,
> +or do not support cross-filesystem copy.

ok.

> .SH VERSIONS
> The
> .BR copy_file_range ()
> @@ -195,13 +205,10 @@ system call first appeared in Linux 4.5, but glibc 2.27 provides a user-space
> emulation when it is not available.
> .\" https://sourceware.org/git/?p=glibc.git;a=commit;f=posix/unistd.h;h=bad7a0c81f501fbbcc79af9eaa4b8254441c4a1f
> .PP
> -A major rework of the kernel implementation occurred in 5.3.
> -Areas of the API that weren't clearly defined were clarified and the API bounds
> -are much more strictly checked than on earlier kernels.
> -Applications should target the behaviour and requirements of 5.3 kernels.
> -.PP

That information is useful. Why remove it?
FYI, the LTP tests written to velidate the copy_file_range() API are not running
on kernel < 5.3 at all.

> -First support for cross-filesystem copies was introduced in Linux 5.3.
> -Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> +Since 5.12,
> +cross-filesystem copies can be achieved
> +when both filesystems are of the same type,
> +and that filesystem implements support for it.
> .SH CONFORMING TO
> The
> .BR copy_file_range ()
> @@ -226,6 +233,10 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
> such as the use of reflinks (i.e., two or more inodes that share
> pointers to the same copy-on-write disk blocks)
> or server-side-copy (in the case of NFS).
> +.SH BUGS
> +In Linux kernels 5.3 to 5.11, cross-filesystem copies were supported.

I think it is a bit confusing to say "were supported", because how come
support went away from kernel 5.12? maybe something along the lines
that kernel implementation of copy was used if there was no filesystem
support for the operation...

> +However, on some virtual filesystems, the call failed to copy,
> +eventhough it may have reported success.
> .SH EXAMPLES
> .EX
> #define _GNU_SOURCE
> --
> 2.30.1.721.g45526154a5
>

2021-02-27 23:10:37

by Steve French

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Fri, Feb 26, 2021 at 11:43 PM Amir Goldstein <[email protected]> wrote:
>
> On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
> <[email protected]> wrote:
> >
> > Hello Amir, Luis,
> >
> > On 2/24/21 5:10 PM, Amir Goldstein wrote:
> > > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> > >>
> > >> Update man-page with recent changes to this syscall.
> > >>
> > >> Signed-off-by: Luis Henriques <[email protected]>
> > >> ---
> > >> Hi!
> > >>
> > >> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> > >> I've assumed the fix will hit 5.12.
> > >>
> > >> man2/copy_file_range.2 | 10 +++++++++-
> > >> 1 file changed, 9 insertions(+), 1 deletion(-)
> > >>
> > >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > >> index 611a39b8026b..b0fd85e2631e 100644
> > >> --- a/man2/copy_file_range.2
> > >> +++ b/man2/copy_file_range.2
> > >> @@ -169,6 +169,9 @@ Out of memory.
> > >> .B ENOSPC
> > >> There is not enough space on the target filesystem to complete the copy.
> > >> .TP
> > >> +.B EOPNOTSUPP
> >
> > I'll add the kernel version here:
> >
> > .BR EOPNOTSUPP " (since Linux 5.12)"
>
> Error could be returned prior to 5.3 and would be probably returned
> by future stable kernels 5.3..5.12 too
>
> >
> > >> +The filesystem does not support this operation >> +.TP
> > >> .B EOVERFLOW
> > >> The requested source or destination range is too large to represent in the
> > >> specified data types.
> > >> @@ -187,7 +190,7 @@ refers to an active swap file.
> > >> .B EXDEV
> > >> The files referred to by
> > >> .IR fd_in " and " fd_out
> > >> -are not on the same mounted filesystem (pre Linux 5.3).
> > >> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
> >
> > I'm not sure that 'mounted' adds any value here. Would you remove the
> > word here?
>
> See rename(2). 'mounted' in this context is explained there.
> HOWEVER, it does not fit here.
> copy_file_range() IS allowed between two mounts of the same filesystem instance.
>
> To make things more complicated, it appears that cross mount clone is not
> allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
> also uses the 'mounted filesystem' terminology for EXDEV
>
> As things stand now, because of the fallback to clone logic,
> copy_file_range() provides a way for users to clone across different mounts
> of the same filesystem instance, which they cannot do with the FICLONE ioctl.
>
> Fun :)
>
> BTW, I don't know if preventing cross mount clone was done intentionally,
> but as I wrote in a comment in the code once:
>
> /*
> * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> * the same mount. Practically, they only need to be on the same file
> * system.
> */
>
> >
> > It reads as if two separate devices with the same filesystem type would
> > still give this error.
> >
> > Per the LWN.net article Amir shared, this is permitted ("When called
> > from user space, copy_file_range() will only try to copy a file across
> > filesystems if the two are of the same type").
> >
> > This behavior was slightly different before 5.3 AFAICR (was it?) ("until
> > then, copy_file_range() refused to copy between files that were not
> > located on the same filesystem."). If that's the case, I'd specify the
> > difference, or more probably split the error into two, one before 5.3,
> > and one since 5.12.
> >
>
> True.
>
> > >
> > > I think you need to drop the (Linux range) altogether.
> >
> > I'll keep the range. Users of 5.3..5.11 might be surprised if the
> > filesystems are different and they don't get an error, I think.
> >
> > I reworded it to follow other pages conventions:
> >
> > .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
> >
> > which renders as:
> >
> > EXDEV (before Linux 5.3; or since Linux 5.12)
> > The files referred to by fd_in and fd_out are not on
> > the same mounted filesystem.
> >
>
> drop 'mounted'
>
> >
> > > What's missing here is the NFS cross server copy use case.
> > > Maybe:

At least for the SMB3 kernel server (ksmbd "cifsd") looks like they use splice.
And for the user space CIFS/SMB3 server (like Samba) they have a configurable
plug in library interface ("Samba VFS modules") that would allow you
to implement
cross filesystem copy optimally for your version of Linux and plug
this into Samba
with little work on your part.

> >
> > Again, this wasn't true before 5.3, right?
> >
>
> Right.
> Actually, v5.3 provides the vfs capabilities for filesystems to support
> cross fs copy. I am not sure if NFS already implements cross fs copy in
> v5.3 and not sure about cifs. Need to get input from nfs/cis developers
> or dig in the release notes for server-side copy.

The SMB3 protocol has multiple ways to do "server side copy" (copy
offload to the server), some of which would apply to your example.
The case of "reflink" in many cases would be most efficient, and is supported
by the Linux client (see MS-SMB2 protocol specification section 3.3.5.15.18) but
is supported by fewer server file systems, so probably more important
to focus on
the other mechanisms which are server side copy rather than clone. The most
popular way, supported by most servers, is "CopyChunk" - 100s of
millions of systems
support this (if not more) - see MS-SMB2 protocol specification
section 2.2.31.1 and
3.3.5.15.16 - there are various cases where two different SMB3 mounts
on the same
client could handle cross mount server side copy.

There are other mechanisms supported by fewer servers SMB3 ODX/T10 style copy
offload (Windows and some others see e.g. Gordon at Nexenta's presentation
https://www.slideshare.net/gordonross/smb3-offload-data-transfer-odx)
but still popular for virtualization workloads. For this it could be
even more common
for those to be different mounts on the client. The Linux client does
not support
the SMB3 ODX/T10 offload yet but it would be good to add support for it.
There is a nice description of its additional benefits at
https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/offloaded-data-transfer

But - yes SMB3 on Linux can have cross mount file copy today, which is
far more efficient
(having the server do the copy for us) rather than sending large
reads/writes back and
forth over the network from the client. In the future I am hoping that use case
becomes even more common over SMB3 as cloud servers improve.


> > > You may refer the reader to VERSIONS section where it will say which
> > > filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and nfs).
> > >
> > >> .SH VERSIONS
> > >> The
> > >> .BR copy_file_range ()
> > >> @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
> > >> .PP
> > >> First support for cross-filesystem copies was introduced in Linux 5.3.
> > >> Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> > >> +.PP
> > >> +After Linux 5.12, support for copies between different filesystems was dropped.
> > >> +However, individual filesystems may still provide
> > >> +.BR copy_file_range ()
> > >> +implementations that allow copies across different devices.

Yes - this could be very important, especially for cifs (smb3) going forward.



--
Thanks,

Steve

2021-02-28 07:40:57

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Sun, Feb 28, 2021 at 1:08 AM Steve French <[email protected]> wrote:
>
> On Fri, Feb 26, 2021 at 11:43 PM Amir Goldstein <[email protected]> wrote:
> >
> > On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
> > <[email protected]> wrote:
> > >
> > > Hello Amir, Luis,
> > >
> > > On 2/24/21 5:10 PM, Amir Goldstein wrote:
> > > > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> > > >>
> > > >> Update man-page with recent changes to this syscall.
> > > >>
> > > >> Signed-off-by: Luis Henriques <[email protected]>
> > > >> ---
> > > >> Hi!
> > > >>
> > > >> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> > > >> I've assumed the fix will hit 5.12.
> > > >>
> > > >> man2/copy_file_range.2 | 10 +++++++++-
> > > >> 1 file changed, 9 insertions(+), 1 deletion(-)
> > > >>
> > > >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > > >> index 611a39b8026b..b0fd85e2631e 100644
> > > >> --- a/man2/copy_file_range.2
> > > >> +++ b/man2/copy_file_range.2
> > > >> @@ -169,6 +169,9 @@ Out of memory.
> > > >> .B ENOSPC
> > > >> There is not enough space on the target filesystem to complete the copy.
> > > >> .TP
> > > >> +.B EOPNOTSUPP
> > >
> > > I'll add the kernel version here:
> > >
> > > .BR EOPNOTSUPP " (since Linux 5.12)"
> >
> > Error could be returned prior to 5.3 and would be probably returned
> > by future stable kernels 5.3..5.12 too
> >
> > >
> > > >> +The filesystem does not support this operation >> +.TP
> > > >> .B EOVERFLOW
> > > >> The requested source or destination range is too large to represent in the
> > > >> specified data types.
> > > >> @@ -187,7 +190,7 @@ refers to an active swap file.
> > > >> .B EXDEV
> > > >> The files referred to by
> > > >> .IR fd_in " and " fd_out
> > > >> -are not on the same mounted filesystem (pre Linux 5.3).
> > > >> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
> > >
> > > I'm not sure that 'mounted' adds any value here. Would you remove the
> > > word here?
> >
> > See rename(2). 'mounted' in this context is explained there.
> > HOWEVER, it does not fit here.
> > copy_file_range() IS allowed between two mounts of the same filesystem instance.
> >
> > To make things more complicated, it appears that cross mount clone is not
> > allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
> > also uses the 'mounted filesystem' terminology for EXDEV
> >
> > As things stand now, because of the fallback to clone logic,
> > copy_file_range() provides a way for users to clone across different mounts
> > of the same filesystem instance, which they cannot do with the FICLONE ioctl.
> >
> > Fun :)
> >
> > BTW, I don't know if preventing cross mount clone was done intentionally,
> > but as I wrote in a comment in the code once:
> >
> > /*
> > * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > * the same mount. Practically, they only need to be on the same file
> > * system.
> > */
> >
> > >
> > > It reads as if two separate devices with the same filesystem type would
> > > still give this error.
> > >
> > > Per the LWN.net article Amir shared, this is permitted ("When called
> > > from user space, copy_file_range() will only try to copy a file across
> > > filesystems if the two are of the same type").
> > >
> > > This behavior was slightly different before 5.3 AFAICR (was it?) ("until
> > > then, copy_file_range() refused to copy between files that were not
> > > located on the same filesystem."). If that's the case, I'd specify the
> > > difference, or more probably split the error into two, one before 5.3,
> > > and one since 5.12.
> > >
> >
> > True.
> >
> > > >
> > > > I think you need to drop the (Linux range) altogether.
> > >
> > > I'll keep the range. Users of 5.3..5.11 might be surprised if the
> > > filesystems are different and they don't get an error, I think.
> > >
> > > I reworded it to follow other pages conventions:
> > >
> > > .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
> > >
> > > which renders as:
> > >
> > > EXDEV (before Linux 5.3; or since Linux 5.12)
> > > The files referred to by fd_in and fd_out are not on
> > > the same mounted filesystem.
> > >
> >
> > drop 'mounted'
> >
> > >
> > > > What's missing here is the NFS cross server copy use case.
> > > > Maybe:
>
> At least for the SMB3 kernel server (ksmbd "cifsd") looks like they use splice.
> And for the user space CIFS/SMB3 server (like Samba) they have a configurable
> plug in library interface ("Samba VFS modules") that would allow you
> to implement
> cross filesystem copy optimally for your version of Linux and plug
> this into Samba
> with little work on your part.
>
> > >
> > > Again, this wasn't true before 5.3, right?
> > >
> >
> > Right.
> > Actually, v5.3 provides the vfs capabilities for filesystems to support
> > cross fs copy. I am not sure if NFS already implements cross fs copy in
> > v5.3 and not sure about cifs. Need to get input from nfs/cis developers
> > or dig in the release notes for server-side copy.
>
> The SMB3 protocol has multiple ways to do "server side copy" (copy
> offload to the server), some of which would apply to your example.
> The case of "reflink" in many cases would be most efficient, and is supported
> by the Linux client (see MS-SMB2 protocol specification section 3.3.5.15.18) but
> is supported by fewer server file systems, so probably more important
> to focus on
> the other mechanisms which are server side copy rather than clone. The most
> popular way, supported by most servers, is "CopyChunk" - 100s of
> millions of systems
> support this (if not more) - see MS-SMB2 protocol specification
> section 2.2.31.1 and
> 3.3.5.15.16 - there are various cases where two different SMB3 mounts
> on the same
> client could handle cross mount server side copy.
>
> There are other mechanisms supported by fewer servers SMB3 ODX/T10 style copy
> offload (Windows and some others see e.g. Gordon at Nexenta's presentation
> https://www.slideshare.net/gordonross/smb3-offload-data-transfer-odx)
> but still popular for virtualization workloads. For this it could be
> even more common
> for those to be different mounts on the client. The Linux client does
> not support
> the SMB3 ODX/T10 offload yet but it would be good to add support for it.
> There is a nice description of its additional benefits at
> https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/offloaded-data-transfer
>
> But - yes SMB3 on Linux can have cross mount file copy today, which is
> far more efficient

Can have? or does have?
IIUC, server-side copy ability exists for "same cifs fs" for a long time and
since v5.3, it is available for "same cifs connection", which is not exactly
the same as "same cifs fs" but also not really different for most people.
Can you elaborate about that?
Just assume the server can do anything. What can the Linux client do
since v5.3 or later?

> (having the server do the copy for us) rather than sending large
> reads/writes back and
> forth over the network from the client. In the future I am hoping that use case
> becomes even more common over SMB3 as cloud servers improve.
>
>
> > > > You may refer the reader to VERSIONS section where it will say which
> > > > filesystems support cross-fs copy as of kernel version XXX (i.e. cifs and nfs).
> > > >
> > > >> .SH VERSIONS
> > > >> The
> > > >> .BR copy_file_range ()
> > > >> @@ -202,6 +205,11 @@ Applications should target the behaviour and requirements of 5.3 kernels.
> > > >> .PP
> > > >> First support for cross-filesystem copies was introduced in Linux 5.3.
> > > >> Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> > > >> +.PP
> > > >> +After Linux 5.12, support for copies between different filesystems was dropped.
> > > >> +However, individual filesystems may still provide
> > > >> +.BR copy_file_range ()
> > > >> +implementations that allow copies across different devices.
>
> Yes - this could be very important, especially for cifs (smb3) going forward.
>
>
>
> --
> Thanks,
>
> Steve

2021-02-28 22:50:35

by Steve French

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Sun, Feb 28, 2021 at 1:36 AM Amir Goldstein <[email protected]> wrote:
>
> On Sun, Feb 28, 2021 at 1:08 AM Steve French <[email protected]> wrote:
> >
> > On Fri, Feb 26, 2021 at 11:43 PM Amir Goldstein <[email protected]> wrote:
> > >
> > > On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
> > > <[email protected]> wrote:
> > > >
> > > > Hello Amir, Luis,
> > > >
> > > > On 2/24/21 5:10 PM, Amir Goldstein wrote:
> > > > > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> > > > >>
> > > > >> Update man-page with recent changes to this syscall.
> > > > >>
> > > > >> Signed-off-by: Luis Henriques <[email protected]>
> > > > >> ---
> > > > >> Hi!
> > > > >>
> > > > >> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> > > > >> I've assumed the fix will hit 5.12.
> > > > >>
> > > > >> man2/copy_file_range.2 | 10 +++++++++-
> > > > >> 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > >>
> > > > >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > > > >> index 611a39b8026b..b0fd85e2631e 100644
> > > > >> --- a/man2/copy_file_range.2
> > > > >> +++ b/man2/copy_file_range.2
> > > > >> @@ -169,6 +169,9 @@ Out of memory.
> > > > >> .B ENOSPC
> > > > >> There is not enough space on the target filesystem to complete the copy.
> > > > >> .TP
> > > > >> +.B EOPNOTSUPP
> > > >
> > > > I'll add the kernel version here:
> > > >
> > > > .BR EOPNOTSUPP " (since Linux 5.12)"
> > >
> > > Error could be returned prior to 5.3 and would be probably returned
> > > by future stable kernels 5.3..5.12 too
> > >
> > > >
> > > > >> +The filesystem does not support this operation >> +.TP
> > > > >> .B EOVERFLOW
> > > > >> The requested source or destination range is too large to represent in the
> > > > >> specified data types.
> > > > >> @@ -187,7 +190,7 @@ refers to an active swap file.
> > > > >> .B EXDEV
> > > > >> The files referred to by
> > > > >> .IR fd_in " and " fd_out
> > > > >> -are not on the same mounted filesystem (pre Linux 5.3).
> > > > >> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
> > > >
> > > > I'm not sure that 'mounted' adds any value here. Would you remove the
> > > > word here?
> > >
> > > See rename(2). 'mounted' in this context is explained there.
> > > HOWEVER, it does not fit here.
> > > copy_file_range() IS allowed between two mounts of the same filesystem instance.
> > >
> > > To make things more complicated, it appears that cross mount clone is not
> > > allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
> > > also uses the 'mounted filesystem' terminology for EXDEV
> > >
> > > As things stand now, because of the fallback to clone logic,
> > > copy_file_range() provides a way for users to clone across different mounts
> > > of the same filesystem instance, which they cannot do with the FICLONE ioctl.
> > >
> > > Fun :)
> > >
> > > BTW, I don't know if preventing cross mount clone was done intentionally,
> > > but as I wrote in a comment in the code once:
> > >
> > > /*
> > > * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > * the same mount. Practically, they only need to be on the same file
> > > * system.
> > > */
> > >
> > > >
> > > > It reads as if two separate devices with the same filesystem type would
> > > > still give this error.
> > > >
> > > > Per the LWN.net article Amir shared, this is permitted ("When called
> > > > from user space, copy_file_range() will only try to copy a file across
> > > > filesystems if the two are of the same type").
> > > >
> > > > This behavior was slightly different before 5.3 AFAICR (was it?) ("until
> > > > then, copy_file_range() refused to copy between files that were not
> > > > located on the same filesystem."). If that's the case, I'd specify the
> > > > difference, or more probably split the error into two, one before 5.3,
> > > > and one since 5.12.
> > > >
> > >
> > > True.
> > >
> > > > >
> > > > > I think you need to drop the (Linux range) altogether.
> > > >
> > > > I'll keep the range. Users of 5.3..5.11 might be surprised if the
> > > > filesystems are different and they don't get an error, I think.
> > > >
> > > > I reworded it to follow other pages conventions:
> > > >
> > > > .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
> > > >
> > > > which renders as:
> > > >
> > > > EXDEV (before Linux 5.3; or since Linux 5.12)
> > > > The files referred to by fd_in and fd_out are not on
> > > > the same mounted filesystem.
> > > >
> > >
> > > drop 'mounted'
> > >
> > > >
> > > > > What's missing here is the NFS cross server copy use case.
> > > > > Maybe:
> >
> > At least for the SMB3 kernel server (ksmbd "cifsd") looks like they use splice.
> > And for the user space CIFS/SMB3 server (like Samba) they have a configurable
> > plug in library interface ("Samba VFS modules") that would allow you
> > to implement
> > cross filesystem copy optimally for your version of Linux and plug
> > this into Samba
> > with little work on your part.
> >
> > > >
> > > > Again, this wasn't true before 5.3, right?
> > > >
> > >
> > > Right.
> > > Actually, v5.3 provides the vfs capabilities for filesystems to support
> > > cross fs copy. I am not sure if NFS already implements cross fs copy in
> > > v5.3 and not sure about cifs. Need to get input from nfs/cis developers
> > > or dig in the release notes for server-side copy.
> >
> > The SMB3 protocol has multiple ways to do "server side copy" (copy
> > offload to the server), some of which would apply to your example.
> > The case of "reflink" in many cases would be most efficient, and is supported
> > by the Linux client (see MS-SMB2 protocol specification section 3.3.5.15.18) but
> > is supported by fewer server file systems, so probably more important
> > to focus on
> > the other mechanisms which are server side copy rather than clone. The most
> > popular way, supported by most servers, is "CopyChunk" - 100s of
> > millions of systems
> > support this (if not more) - see MS-SMB2 protocol specification
> > section 2.2.31.1 and
> > 3.3.5.15.16 - there are various cases where two different SMB3 mounts
> > on the same
> > client could handle cross mount server side copy.
> >
> > There are other mechanisms supported by fewer servers SMB3 ODX/T10 style copy
> > offload (Windows and some others see e.g. Gordon at Nexenta's presentation
> > https://www.slideshare.net/gordonross/smb3-offload-data-transfer-odx)
> > but still popular for virtualization workloads. For this it could be
> > even more common
> > for those to be different mounts on the client. The Linux client does
> > not support
> > the SMB3 ODX/T10 offload yet but it would be good to add support for it.
> > There is a nice description of its additional benefits at
> > https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/offloaded-data-transfer
> >
> > But - yes SMB3 on Linux can have cross mount file copy today, which is
> > far more efficient
>
> Can have? or does have?
> IIUC, server-side copy ability exists for "same cifs fs" for a long time and
> since v5.3, it is available for "same cifs connection", which is not exactly
> the same as "same cifs fs" but also not really different for most people.
> Can you elaborate about that?
> Just assume the server can do anything. What can the Linux client do
> since v5.3 or later?

Inside the SMB3 client (cifs.ko) we check that the file handles provided
are for the same authenticated user to the same server, so
e.g. you could mount //server/share on /mnt1 and //server/anothershare on /mnt2
and do a copy_file_range from /mnt1/file1 to /mnt2/file2 even though these are
different mounts. The cifs client should allow additional cases of cross mount
copy, but at least this helps for various common scenarios and is very widely
supported on most servers as well.


--
Thanks,

Steve

2021-03-01 06:27:57

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH] copy_file_range.2: Kernel v5.12 updates

On Mon, Mar 1, 2021 at 12:25 AM Steve French <[email protected]> wrote:
>
> On Sun, Feb 28, 2021 at 1:36 AM Amir Goldstein <[email protected]> wrote:
> >
> > On Sun, Feb 28, 2021 at 1:08 AM Steve French <[email protected]> wrote:
> > >
> > > On Fri, Feb 26, 2021 at 11:43 PM Amir Goldstein <[email protected]> wrote:
> > > >
> > > > On Sat, Feb 27, 2021 at 12:19 AM Alejandro Colomar (man-pages)
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hello Amir, Luis,
> > > > >
> > > > > On 2/24/21 5:10 PM, Amir Goldstein wrote:
> > > > > > On Wed, Feb 24, 2021 at 4:22 PM Luis Henriques <[email protected]> wrote:
> > > > > >>
> > > > > >> Update man-page with recent changes to this syscall.
> > > > > >>
> > > > > >> Signed-off-by: Luis Henriques <[email protected]>
> > > > > >> ---
> > > > > >> Hi!
> > > > > >>
> > > > > >> Here's a suggestion for fixing the manpage for copy_file_range(). Note that
> > > > > >> I've assumed the fix will hit 5.12.
> > > > > >>
> > > > > >> man2/copy_file_range.2 | 10 +++++++++-
> > > > > >> 1 file changed, 9 insertions(+), 1 deletion(-)
> > > > > >>
> > > > > >> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> > > > > >> index 611a39b8026b..b0fd85e2631e 100644
> > > > > >> --- a/man2/copy_file_range.2
> > > > > >> +++ b/man2/copy_file_range.2
> > > > > >> @@ -169,6 +169,9 @@ Out of memory.
> > > > > >> .B ENOSPC
> > > > > >> There is not enough space on the target filesystem to complete the copy.
> > > > > >> .TP
> > > > > >> +.B EOPNOTSUPP
> > > > >
> > > > > I'll add the kernel version here:
> > > > >
> > > > > .BR EOPNOTSUPP " (since Linux 5.12)"
> > > >
> > > > Error could be returned prior to 5.3 and would be probably returned
> > > > by future stable kernels 5.3..5.12 too
> > > >
> > > > >
> > > > > >> +The filesystem does not support this operation >> +.TP
> > > > > >> .B EOVERFLOW
> > > > > >> The requested source or destination range is too large to represent in the
> > > > > >> specified data types.
> > > > > >> @@ -187,7 +190,7 @@ refers to an active swap file.
> > > > > >> .B EXDEV
> > > > > >> The files referred to by
> > > > > >> .IR fd_in " and " fd_out
> > > > > >> -are not on the same mounted filesystem (pre Linux 5.3).
> > > > > >> +are not on the same mounted filesystem (pre Linux 5.3 and post Linux 5.12).
> > > > >
> > > > > I'm not sure that 'mounted' adds any value here. Would you remove the
> > > > > word here?
> > > >
> > > > See rename(2). 'mounted' in this context is explained there.
> > > > HOWEVER, it does not fit here.
> > > > copy_file_range() IS allowed between two mounts of the same filesystem instance.
> > > >
> > > > To make things more complicated, it appears that cross mount clone is not
> > > > allowed via FICLONE/FICLONERANGE ioctl, so ioctl_ficlonerange(2) man page
> > > > also uses the 'mounted filesystem' terminology for EXDEV
> > > >
> > > > As things stand now, because of the fallback to clone logic,
> > > > copy_file_range() provides a way for users to clone across different mounts
> > > > of the same filesystem instance, which they cannot do with the FICLONE ioctl.
> > > >
> > > > Fun :)
> > > >
> > > > BTW, I don't know if preventing cross mount clone was done intentionally,
> > > > but as I wrote in a comment in the code once:
> > > >
> > > > /*
> > > > * FICLONE/FICLONERANGE ioctls enforce that src and dest files are on
> > > > * the same mount. Practically, they only need to be on the same file
> > > > * system.
> > > > */
> > > >
> > > > >
> > > > > It reads as if two separate devices with the same filesystem type would
> > > > > still give this error.
> > > > >
> > > > > Per the LWN.net article Amir shared, this is permitted ("When called
> > > > > from user space, copy_file_range() will only try to copy a file across
> > > > > filesystems if the two are of the same type").
> > > > >
> > > > > This behavior was slightly different before 5.3 AFAICR (was it?) ("until
> > > > > then, copy_file_range() refused to copy between files that were not
> > > > > located on the same filesystem."). If that's the case, I'd specify the
> > > > > difference, or more probably split the error into two, one before 5.3,
> > > > > and one since 5.12.
> > > > >
> > > >
> > > > True.
> > > >
> > > > > >
> > > > > > I think you need to drop the (Linux range) altogether.
> > > > >
> > > > > I'll keep the range. Users of 5.3..5.11 might be surprised if the
> > > > > filesystems are different and they don't get an error, I think.
> > > > >
> > > > > I reworded it to follow other pages conventions:
> > > > >
> > > > > .BR EXDEV " (before Linux 5.3; or since Linux 5.12)"
> > > > >
> > > > > which renders as:
> > > > >
> > > > > EXDEV (before Linux 5.3; or since Linux 5.12)
> > > > > The files referred to by fd_in and fd_out are not on
> > > > > the same mounted filesystem.
> > > > >
> > > >
> > > > drop 'mounted'
> > > >
> > > > >
> > > > > > What's missing here is the NFS cross server copy use case.
> > > > > > Maybe:
> > >
> > > At least for the SMB3 kernel server (ksmbd "cifsd") looks like they use splice.
> > > And for the user space CIFS/SMB3 server (like Samba) they have a configurable
> > > plug in library interface ("Samba VFS modules") that would allow you
> > > to implement
> > > cross filesystem copy optimally for your version of Linux and plug
> > > this into Samba
> > > with little work on your part.
> > >
> > > > >
> > > > > Again, this wasn't true before 5.3, right?
> > > > >
> > > >
> > > > Right.
> > > > Actually, v5.3 provides the vfs capabilities for filesystems to support
> > > > cross fs copy. I am not sure if NFS already implements cross fs copy in
> > > > v5.3 and not sure about cifs. Need to get input from nfs/cis developers
> > > > or dig in the release notes for server-side copy.
> > >
> > > The SMB3 protocol has multiple ways to do "server side copy" (copy
> > > offload to the server), some of which would apply to your example.
> > > The case of "reflink" in many cases would be most efficient, and is supported
> > > by the Linux client (see MS-SMB2 protocol specification section 3.3.5.15.18) but
> > > is supported by fewer server file systems, so probably more important
> > > to focus on
> > > the other mechanisms which are server side copy rather than clone. The most
> > > popular way, supported by most servers, is "CopyChunk" - 100s of
> > > millions of systems
> > > support this (if not more) - see MS-SMB2 protocol specification
> > > section 2.2.31.1 and
> > > 3.3.5.15.16 - there are various cases where two different SMB3 mounts
> > > on the same
> > > client could handle cross mount server side copy.
> > >
> > > There are other mechanisms supported by fewer servers SMB3 ODX/T10 style copy
> > > offload (Windows and some others see e.g. Gordon at Nexenta's presentation
> > > https://www.slideshare.net/gordonross/smb3-offload-data-transfer-odx)
> > > but still popular for virtualization workloads. For this it could be
> > > even more common
> > > for those to be different mounts on the client. The Linux client does
> > > not support
> > > the SMB3 ODX/T10 offload yet but it would be good to add support for it.
> > > There is a nice description of its additional benefits at
> > > https://docs.microsoft.com/en-us/windows-hardware/drivers/storage/offloaded-data-transfer
> > >
> > > But - yes SMB3 on Linux can have cross mount file copy today, which is
> > > far more efficient
> >
> > Can have? or does have?
> > IIUC, server-side copy ability exists for "same cifs fs" for a long time and
> > since v5.3, it is available for "same cifs connection", which is not exactly
> > the same as "same cifs fs" but also not really different for most people.
> > Can you elaborate about that?
> > Just assume the server can do anything. What can the Linux client do
> > since v5.3 or later?
>
> Inside the SMB3 client (cifs.ko) we check that the file handles provided
> are for the same authenticated user to the same server, so
> e.g. you could mount //server/share on /mnt1 and //server/anothershare on /mnt2
> and do a copy_file_range from /mnt1/file1 to /mnt2/file2 even though these are
> different mounts. The cifs client should allow additional cases of cross mount
> copy, but at least this helps for various common scenarios and is very widely
> supported on most servers as well.
>

Got it. Thanks for clarifying.

So it appears that both cifs and nfs support cross-fs copy since v5.3
and many other fs that support clone, started supporting cross-mnt
(same fs) copy (implemented as clone) since v5.3 and still do to this day.

Alejandro, just to be clear, none of these changes are in v5.12 yet,
so please hold on to your patch for now.

Thanks,
Amir.

Subject: [RFC v3] copy_file_range.2: Update cross-filesystem support for 5.12

Linux 5.12 fixes a regression.

Cross-filesystem (introduced in 5.3) copies were buggy.

Move the statements documenting cross-fs to BUGS.
Kernels 5.3..5.11 should be patched soon.

State version information for some errors related to this.

Reported-by: Luis Henriques <[email protected]>
Reported-by: Amir Goldstein <[email protected]>
Related: <https://lwn.net/Articles/846403/>
Cc: Greg KH <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Steve French <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Nicolas Boichat <[email protected]>
Cc: Ian Lance Taylor <[email protected]>
Cc: Luis Lozano <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Olga Kornievskaia <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: ceph-devel <[email protected]>
Cc: linux-kernel <[email protected]>
Cc: CIFS <[email protected]>
Cc: samba-technical <[email protected]>
Cc: linux-fsdevel <[email protected]>
Cc: Linux NFS Mailing List <[email protected]>
Cc: Walter Harms <[email protected]>
Signed-off-by: Alejandro Colomar <[email protected]>
---

v3:
- Don't remove some important text.
- Reword BUGS.

---
Hi Amir,

I covered your comments. I may need to add something else after your
discussion with Steve; please comment.

I tried to reword BUGS so that it's as specific and understandable as I can.
If you still find it not good enough, please comment :)

Thanks,

Alex

---
man2/copy_file_range.2 | 26 ++++++++++++++++++++++----
1 file changed, 22 insertions(+), 4 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b80..1c0df3f74 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
.B ENOSPC
There is not enough space on the target filesystem to complete the copy.
.TP
+.BR EOPNOTSUPP " (since Linux 5.12)"
+The filesystem does not support this operation.
+.TP
.B EOVERFLOW
The requested source or destination range is too large to represent in the
specified data types.
@@ -184,10 +187,17 @@ or
.I fd_out
refers to an active swap file.
.TP
-.B EXDEV
+.BR EXDEV " (before Linux 5.3)"
+The files referred to by
+.IR fd_in " and " fd_out
+are not on the same filesystem.
+.TP
+.BR EXDEV " (since Linux 5.12)"
The files referred to by
.IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same filesystem,
+and the source and target filesystems are not of the same type,
+or do not support cross-filesystem copy.
.SH VERSIONS
The
.BR copy_file_range ()
@@ -200,8 +210,10 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
are much more strictly checked than on earlier kernels.
Applications should target the behaviour and requirements of 5.3 kernels.
.PP
-First support for cross-filesystem copies was introduced in Linux 5.3.
-Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+Since 5.12,
+cross-filesystem copies can be achieved
+when both filesystems are of the same type,
+and that filesystem implements support for it.
.SH CONFORMING TO
The
.BR copy_file_range ()
@@ -226,6 +238,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
such as the use of reflinks (i.e., two or more inodes that share
pointers to the same copy-on-write disk blocks)
or server-side-copy (in the case of NFS).
+.SH BUGS
+In Linux kernels 5.3 to 5.11,
+cross-filesystem copies were supported by the kernel,
+instead of being supported by individual filesystems.
+However, on some virtual filesystems,
+the call failed to copy, while still reporting success.
.SH EXAMPLES
.EX
#define _GNU_SOURCE
--
2.30.1.721.g45526154a5

2021-03-01 15:00:02

by Amir Goldstein

[permalink] [raw]
Subject: Re: [RFC v3] copy_file_range.2: Update cross-filesystem support for 5.12

On Mon, Mar 1, 2021 at 4:45 PM Alejandro Colomar <[email protected]> wrote:
>
> Linux 5.12 fixes a regression.
>
> Cross-filesystem (introduced in 5.3) copies were buggy.
>
> Move the statements documenting cross-fs to BUGS.
> Kernels 5.3..5.11 should be patched soon.
>
> State version information for some errors related to this.
>
> Reported-by: Luis Henriques <[email protected]>
> Reported-by: Amir Goldstein <[email protected]>
> Related: <https://lwn.net/Articles/846403/>
> Cc: Greg KH <[email protected]>
> Cc: Michael Kerrisk <[email protected]>
> Cc: Anna Schumaker <[email protected]>
> Cc: Jeff Layton <[email protected]>
> Cc: Steve French <[email protected]>
> Cc: Miklos Szeredi <[email protected]>
> Cc: Trond Myklebust <[email protected]>
> Cc: Alexander Viro <[email protected]>
> Cc: "Darrick J. Wong" <[email protected]>
> Cc: Dave Chinner <[email protected]>
> Cc: Nicolas Boichat <[email protected]>
> Cc: Ian Lance Taylor <[email protected]>
> Cc: Luis Lozano <[email protected]>
> Cc: Andreas Dilger <[email protected]>
> Cc: Olga Kornievskaia <[email protected]>
> Cc: Christoph Hellwig <[email protected]>
> Cc: ceph-devel <[email protected]>
> Cc: linux-kernel <[email protected]>
> Cc: CIFS <[email protected]>
> Cc: samba-technical <[email protected]>
> Cc: linux-fsdevel <[email protected]>
> Cc: Linux NFS Mailing List <[email protected]>
> Cc: Walter Harms <[email protected]>
> Signed-off-by: Alejandro Colomar <[email protected]>
> ---
>
> v3:
> - Don't remove some important text.
> - Reword BUGS.
>
> ---
> Hi Amir,
>
> I covered your comments. I may need to add something else after your
> discussion with Steve; please comment.
>
> I tried to reword BUGS so that it's as specific and understandable as I can.
> If you still find it not good enough, please comment :)
>
> Thanks,
>
> Alex
>
> ---
> man2/copy_file_range.2 | 26 ++++++++++++++++++++++----
> 1 file changed, 22 insertions(+), 4 deletions(-)
>
> diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
> index 611a39b80..1c0df3f74 100644
> --- a/man2/copy_file_range.2
> +++ b/man2/copy_file_range.2
> @@ -169,6 +169,9 @@ Out of memory.
> .B ENOSPC
> There is not enough space on the target filesystem to complete the copy.
> .TP
> +.BR EOPNOTSUPP " (since Linux 5.12)"
> +The filesystem does not support this operation.
> +.TP
> .B EOVERFLOW
> The requested source or destination range is too large to represent in the
> specified data types.
> @@ -184,10 +187,17 @@ or
> .I fd_out
> refers to an active swap file.
> .TP
> -.B EXDEV
> +.BR EXDEV " (before Linux 5.3)"
> +The files referred to by
> +.IR fd_in " and " fd_out
> +are not on the same filesystem.
> +.TP
> +.BR EXDEV " (since Linux 5.12)"
> The files referred to by
> .IR fd_in " and " fd_out
> -are not on the same mounted filesystem (pre Linux 5.3).
> +are not on the same filesystem,
> +and the source and target filesystems are not of the same type,
> +or do not support cross-filesystem copy.
> .SH VERSIONS
> The
> .BR copy_file_range ()
> @@ -200,8 +210,10 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
> are much more strictly checked than on earlier kernels.
> Applications should target the behaviour and requirements of 5.3 kernels.
> .PP
> -First support for cross-filesystem copies was introduced in Linux 5.3.
> -Older kernels will return -EXDEV when cross-filesystem copies are attempted.
> +Since 5.12,
> +cross-filesystem copies can be achieved
> +when both filesystems are of the same type,
> +and that filesystem implements support for it.

Maybe refer to BUGS here for pre 5.12 behavior?

> .SH CONFORMING TO
> The
> .BR copy_file_range ()
> @@ -226,6 +238,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
> such as the use of reflinks (i.e., two or more inodes that share
> pointers to the same copy-on-write disk blocks)
> or server-side-copy (in the case of NFS).
> +.SH BUGS
> +In Linux kernels 5.3 to 5.11,
> +cross-filesystem copies were supported by the kernel,
> +instead of being supported by individual filesystems.

Not so clear/accurate IMO. Maybe:

cross-filesystem copies were implemented by the kernel,
if the operation was not supported by individual filesystems.

> +However, on some virtual filesystems,
> +the call failed to copy, while still reporting success.

Thanks,
Amir.

Subject: [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12

Linux 5.12 fixes a regression.

Cross-filesystem (introduced in 5.3) copies were buggy.

Move the statements documenting cross-fs to BUGS.
Kernels 5.3..5.11 should be patched soon.

State version information for some errors related to this.

Reported-by: Luis Henriques <[email protected]>
Reported-by: Amir Goldstein <[email protected]>
Related: <https://lwn.net/Articles/846403/>
Cc: Greg KH <[email protected]>
Cc: Michael Kerrisk <[email protected]>
Cc: Anna Schumaker <[email protected]>
Cc: Jeff Layton <[email protected]>
Cc: Steve French <[email protected]>
Cc: Miklos Szeredi <[email protected]>
Cc: Trond Myklebust <[email protected]>
Cc: Alexander Viro <[email protected]>
Cc: "Darrick J. Wong" <[email protected]>
Cc: Dave Chinner <[email protected]>
Cc: Nicolas Boichat <[email protected]>
Cc: Ian Lance Taylor <[email protected]>
Cc: Luis Lozano <[email protected]>
Cc: Andreas Dilger <[email protected]>
Cc: Olga Kornievskaia <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: ceph-devel <[email protected]>
Cc: linux-kernel <[email protected]>
Cc: CIFS <[email protected]>
Cc: samba-technical <[email protected]>
Cc: linux-fsdevel <[email protected]>
Cc: Linux NFS Mailing List <[email protected]>
Cc: Walter Harms <[email protected]>
Signed-off-by: Alejandro Colomar <[email protected]>
---

v3:
- Don't remove some important text.
- Reword BUGS.
v4:
- Reword.
- Link to BUGS.

Thanks, Amir, for all the help and better wordings.

Cheers,

Alex

---
man2/copy_file_range.2 | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/man2/copy_file_range.2 b/man2/copy_file_range.2
index 611a39b80..f58bfea8f 100644
--- a/man2/copy_file_range.2
+++ b/man2/copy_file_range.2
@@ -169,6 +169,9 @@ Out of memory.
.B ENOSPC
There is not enough space on the target filesystem to complete the copy.
.TP
+.BR EOPNOTSUPP " (since Linux 5.12)"
+The filesystem does not support this operation.
+.TP
.B EOVERFLOW
The requested source or destination range is too large to represent in the
specified data types.
@@ -184,10 +187,17 @@ or
.I fd_out
refers to an active swap file.
.TP
-.B EXDEV
+.BR EXDEV " (before Linux 5.3)"
+The files referred to by
+.IR fd_in " and " fd_out
+are not on the same filesystem.
+.TP
+.BR EXDEV " (since Linux 5.12)"
The files referred to by
.IR fd_in " and " fd_out
-are not on the same mounted filesystem (pre Linux 5.3).
+are not on the same filesystem,
+and the source and target filesystems are not of the same type,
+or do not support cross-filesystem copy.
.SH VERSIONS
The
.BR copy_file_range ()
@@ -200,8 +210,11 @@ Areas of the API that weren't clearly defined were clarified and the API bounds
are much more strictly checked than on earlier kernels.
Applications should target the behaviour and requirements of 5.3 kernels.
.PP
-First support for cross-filesystem copies was introduced in Linux 5.3.
-Older kernels will return -EXDEV when cross-filesystem copies are attempted.
+Since Linux 5.12,
+cross-filesystem copies can be achieved
+when both filesystems are of the same type,
+and that filesystem implements support for it.
+See BUGS for behavior prior to 5.12.
.SH CONFORMING TO
The
.BR copy_file_range ()
@@ -226,6 +239,12 @@ gives filesystems an opportunity to implement "copy acceleration" techniques,
such as the use of reflinks (i.e., two or more inodes that share
pointers to the same copy-on-write disk blocks)
or server-side-copy (in the case of NFS).
+.SH BUGS
+In Linux kernels 5.3 to 5.11,
+cross-filesystem copies were implemented by the kernel,
+if the operation was not supported by individual filesystems.
+However, on some virtual filesystems,
+the call failed to copy, while still reporting success.
.SH EXAMPLES
.EX
#define _GNU_SOURCE
--
2.30.1.721.g45526154a5

Subject: Re: [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12

Hi Darrick,

On 3/4/21 6:13 PM, Darrick J. Wong wrote:
> On Thu, Mar 04, 2021 at 10:38:07AM +0100, Alejandro Colomar wrote:
>> +However, on some virtual filesystems,
>> +the call failed to copy, while still reporting success.
>
> ...success, or merely a short copy?

Okay.

>
> (The rest looks reasonable (at least by c_f_r standards) to me.)

I'm curious, what does "c_f_r standards" mean? :)

Cheers,

Alex

--
Alejandro Colomar
Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
http://www.alejandro-colomar.es/

2021-03-05 01:03:36

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFC v4] copy_file_range.2: Update cross-filesystem support for 5.12

On Thu, Mar 04, 2021 at 07:24:02PM +0100, Alejandro Colomar (man-pages) wrote:
> Hi Darrick,
>
> On 3/4/21 6:13 PM, Darrick J. Wong wrote:
> > On Thu, Mar 04, 2021 at 10:38:07AM +0100, Alejandro Colomar wrote:
> > > +However, on some virtual filesystems,
> > > +the call failed to copy, while still reporting success.
> >
> > ...success, or merely a short copy?
>
> Okay.
>
> >
> > (The rest looks reasonable (at least by c_f_r standards) to me.)
>
> I'm curious, what does "c_f_r standards" mean? :)

c_f_r is shorthand for "copy_file_range".

As for standards... well... I'll just say that this being the /second/
major shift in behavior reflects our poor community development
processes. The door to general cross-fs copies should not have been
thrown open with as little testing as it did. There are legendary
dchinner rants about how obviously broken the generic fallback was when
it was introduced.

There's a reason why we usually wire up new kernel functionality on an
opt-in basis, and that is to foster gradual enablement as QA resources
permit. It's one thing for maintainers to blow up their own subsystems
in isolation, and an entirely different thing to do it between projects
with no coordination.

Did c_f_r work between an ext4 and an xfs? I have no idea. It seemed
to work between xfses of a similar vintage and featureset, at least, but
that's about as much testing as I have ever managed.

--D

>
> Cheers,
>
> Alex
>
> --
> Alejandro Colomar
> Linux man-pages comaintainer; https://www.kernel.org/doc/man-pages/
> http://www.alejandro-colomar.es/

2021-04-09 05:25:11

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Wed, Feb 24, 2021 at 6:44 PM Nicolas Boichat <[email protected]> wrote:
>
> On Wed, Feb 24, 2021 at 6:22 PM Luis Henriques <[email protected]> wrote:
> >
> > On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
> > > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques <[email protected]> wrote:
> > > >
> > > > A regression has been reported by Nicolas Boichat, found while using the
> > > > copy_file_range syscall to copy a tracefs file. Before commit
> > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> > > > kernel would return -EXDEV to userspace when trying to copy a file across
> > > > different filesystems. After this commit, the syscall doesn't fail anymore
> > > > and instead returns zero (zero bytes copied), as this file's content is
> > > > generated on-the-fly and thus reports a size of zero.
> > > >
> > > > This patch restores some cross-filesystem copy restrictions that existed
> > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > devices"). Filesystems are still allowed to fall-back to the VFS
> > > > generic_copy_file_range() implementation, but that has now to be done
> > > > explicitly.
> > > >
> > > > nfsd is also modified to fall-back into generic_copy_file_range() in case
> > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> > > >
> > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> > > > Reported-by: Nicolas Boichat <[email protected]>
> > > > Signed-off-by: Luis Henriques <[email protected]>
> > >
> > > I tested v8 and I believe it works for NFS.
> >
> > Thanks a lot for the testing. And to everyone else for reviews,
> > feedback,... and patience.
>
> Thanks so much to you!!!
>
> Works here, you can add my
> Tested-by: Nicolas Boichat <[email protected]>

What happened to this patch? It does not seem to have been picked up
yet? Any reason why?

> >
> > I'll now go look into the manpage and see what needs to be changed.
> >
> > Cheers,
> > --
> > Luís

2021-04-09 13:51:26

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v8] vfs: fix copy_file_range regression in cross-fs copies

On Fri, Apr 9, 2021 at 4:39 PM Luis Henriques <[email protected]> wrote:
>
> Nicolas Boichat <[email protected]> writes:
>
> > On Wed, Feb 24, 2021 at 6:44 PM Nicolas Boichat <[email protected]> wrote:
> >>
> >> On Wed, Feb 24, 2021 at 6:22 PM Luis Henriques <[email protected]> wrote:
> >> >
> >> > On Tue, Feb 23, 2021 at 08:00:54PM -0500, Olga Kornievskaia wrote:
> >> > > On Mon, Feb 22, 2021 at 5:25 AM Luis Henriques <[email protected]> wrote:
> >> > > >
> >> > > > A regression has been reported by Nicolas Boichat, found while using the
> >> > > > copy_file_range syscall to copy a tracefs file. Before commit
> >> > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices") the
> >> > > > kernel would return -EXDEV to userspace when trying to copy a file across
> >> > > > different filesystems. After this commit, the syscall doesn't fail anymore
> >> > > > and instead returns zero (zero bytes copied), as this file's content is
> >> > > > generated on-the-fly and thus reports a size of zero.
> >> > > >
> >> > > > This patch restores some cross-filesystem copy restrictions that existed
> >> > > > prior to commit 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> >> > > > devices"). Filesystems are still allowed to fall-back to the VFS
> >> > > > generic_copy_file_range() implementation, but that has now to be done
> >> > > > explicitly.
> >> > > >
> >> > > > nfsd is also modified to fall-back into generic_copy_file_range() in case
> >> > > > vfs_copy_file_range() fails with -EOPNOTSUPP or -EXDEV.
> >> > > >
> >> > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
> >> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]/
> >> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]om/
> >> > > > Link: https://lore.kernel.org/linux-fsdevel/[email protected]eid/
> >> > > > Reported-by: Nicolas Boichat <[email protected]>
> >> > > > Signed-off-by: Luis Henriques <[email protected]>
> >> > >
> >> > > I tested v8 and I believe it works for NFS.
> >> >
> >> > Thanks a lot for the testing. And to everyone else for reviews,
> >> > feedback,... and patience.
> >>
> >> Thanks so much to you!!!
> >>
> >> Works here, you can add my
> >> Tested-by: Nicolas Boichat <[email protected]>
> >
> > What happened to this patch? It does not seem to have been picked up
> > yet? Any reason why?
>
> Hmm... good question. I'm not actually sure who would be picking it. Al,
> maybe...?
>

Darrick,

Would you mind taking this through your tree in case Al doesn't pick it up?

Thanks,
Amir.