2021-02-15 19:45:50

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
> > [email protected]> wrote:
> > >
> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
> > > > [email protected]>
> > > > wrote:
> > > > >
> > > > > Nicolas Boichat reported an issue when trying to use the
> > > > > copy_file_range
> > > > > syscall on a tracefs file. It failed silently because the file
> > > > > content is
> > > > > generated on-the-fly (reporting a size of zero) and
> > > > > copy_file_range
> > > > > needs
> > > > > to know in advance how much data is present.
> > > > >
> > > > > This commit restores the cross-fs restrictions that existed
> > > > > prior
> > > > > to
> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > > devices")
> > > > > and
> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
> > > > > and
> > > > > nfs.
> > > > >
> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
> > > > > devices")
> > > > > Link:
> > > > > https://lore.kernel.org/linux-fsdevel/[email protected]/
> > > > > Cc: Nicolas Boichat <[email protected]>
> > > > > Signed-off-by: Luis Henriques <[email protected]>
> > > >
> > > > Code looks ok.
> > > > You may add:
> > > >
> > > > Reviewed-by: Amir Goldstein <[email protected]>
> > > >
> > > > I agree with Trond that the first paragraph of the commit message
> > > > could
> > > > be improved.
> > > > The purpose of this change is to fix the change of behavior that
> > > > caused the regression.
> > > >
> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to
> > > > read.
> > > > After v5.3, behavior is zero size copy.
> > > >
> > > > It does not matter so much what makes sense for CFR to do in this
> > > > case (generic cross-fs copy). What matters is that nobody asked
> > > > for
> > > > this change and that it caused problems.
> > > >
> > >
> > > No. I'm saying that this patch should be NACKed unless there is a
> > > real
> > > explanation for why we give crap about this tracefs corner case and
> > > why
> > > it can't be fixed.
> > >
> > > There are plenty of reasons why copy offload across filesystems
> > > makes
> > > sense, and particularly when you're doing NAS. Clone just doesn't
> > > cut
> > > it when it comes to disaster recovery (whereas backup to a
> > > different
> > > storage unit does). If the client has to do the copy, then you're
> > > effectively doubling the load on the server, and you're adding
> > > potentially unnecessary network traffic (or at the very least you
> > > are
> > > doubling that traffic).
> > >
> >
> > I don't understand the use case you are describing.
> >
> > Which filesystem types are you talking about for source and target
> > of copy_file_range()?
> >
> > To be clear, the original change was done to support NFS/CIFS server-
> > side
> > copy and those should not be affected by this change.
> >
>
> That is incorrect:
>
> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file
> *dst,
> u64 dst_pos, u64 count)
> {
>
> /*
> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
> * thread and client rpc slot. The choice of 4MB is somewhat
> * arbitrary. We might instead base this on r/wsize, or make it
> * tunable, or use a time instead of a byte limit, or implement
> * asynchronous copy. In theory a client could also recognize a
> * limit like this and pipeline multiple COPY requests.
> */
> count = min_t(u64, count, 1 << 22);
> return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
> }
>
> You are now explicitly changing the behaviour of knfsd when the source
> and destination filesystem differ.
>
> For one thing, you are disallowing the NFSv4.2 copy offload use case of
> copying from a local filesystem to a remote NFS server. However you are
> also disallowing the copy from, say, an XFS formatted partition to an
> ext4 partition.
>

Got it.
This is easy to solve with a flag COPY_FILE_SPLICE (or something)
that is internal to kernel users.

FWIW, you may want to look at the loop in ovl_copy_up_data()
for improvements to nfsd_copy_file_range().

We can move the check out to copy_file_range syscall:

if (flags != 0)
return -EINVAL;

Leave the fallback from all filesystems and check for the
COPY_FILE_SPLICE flag inside generic_copy_file_range().

Thanks,
Amir.


2021-02-16 11:20:41

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

Amir Goldstein <[email protected]> writes:

> On Mon, Feb 15, 2021 at 8:57 PM Trond Myklebust <[email protected]> wrote:
>>
>> On Mon, 2021-02-15 at 19:24 +0200, Amir Goldstein wrote:
>> > On Mon, Feb 15, 2021 at 6:53 PM Trond Myklebust <
>> > [email protected]> wrote:
>> > >
>> > > On Mon, 2021-02-15 at 18:34 +0200, Amir Goldstein wrote:
>> > > > On Mon, Feb 15, 2021 at 5:42 PM Luis Henriques <
>> > > > [email protected]>
>> > > > wrote:
>> > > > >
>> > > > > Nicolas Boichat reported an issue when trying to use the
>> > > > > copy_file_range
>> > > > > syscall on a tracefs file. It failed silently because the file
>> > > > > content is
>> > > > > generated on-the-fly (reporting a size of zero) and
>> > > > > copy_file_range
>> > > > > needs
>> > > > > to know in advance how much data is present.
>> > > > >
>> > > > > This commit restores the cross-fs restrictions that existed
>> > > > > prior
>> > > > > to
>> > > > > 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> > > > > devices")
>> > > > > and
>> > > > > removes generic_copy_file_range() calls from ceph, cifs, fuse,
>> > > > > and
>> > > > > nfs.
>> > > > >
>> > > > > Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across
>> > > > > devices")
>> > > > > Link:
>> > > > > https://lore.kernel.org/linux-fsdevel/[email protected]/
>> > > > > Cc: Nicolas Boichat <[email protected]>
>> > > > > Signed-off-by: Luis Henriques <[email protected]>
>> > > >
>> > > > Code looks ok.
>> > > > You may add:
>> > > >
>> > > > Reviewed-by: Amir Goldstein <[email protected]>
>> > > >
>> > > > I agree with Trond that the first paragraph of the commit message
>> > > > could
>> > > > be improved.
>> > > > The purpose of this change is to fix the change of behavior that
>> > > > caused the regression.
>> > > >
>> > > > Before v5.3, behavior was -EXDEV and userspace could fallback to
>> > > > read.
>> > > > After v5.3, behavior is zero size copy.
>> > > >
>> > > > It does not matter so much what makes sense for CFR to do in this
>> > > > case (generic cross-fs copy). What matters is that nobody asked
>> > > > for
>> > > > this change and that it caused problems.
>> > > >
>> > >
>> > > No. I'm saying that this patch should be NACKed unless there is a
>> > > real
>> > > explanation for why we give crap about this tracefs corner case and
>> > > why
>> > > it can't be fixed.
>> > >
>> > > There are plenty of reasons why copy offload across filesystems
>> > > makes
>> > > sense, and particularly when you're doing NAS. Clone just doesn't
>> > > cut
>> > > it when it comes to disaster recovery (whereas backup to a
>> > > different
>> > > storage unit does). If the client has to do the copy, then you're
>> > > effectively doubling the load on the server, and you're adding
>> > > potentially unnecessary network traffic (or at the very least you
>> > > are
>> > > doubling that traffic).
>> > >
>> >
>> > I don't understand the use case you are describing.
>> >
>> > Which filesystem types are you talking about for source and target
>> > of copy_file_range()?
>> >
>> > To be clear, the original change was done to support NFS/CIFS server-
>> > side
>> > copy and those should not be affected by this change.
>> >
>>
>> That is incorrect:
>>
>> ssize_t nfsd_copy_file_range(struct file *src, u64 src_pos, struct file
>> *dst,
>> u64 dst_pos, u64 count)
>> {
>>
>> /*
>> * Limit copy to 4MB to prevent indefinitely blocking an nfsd
>> * thread and client rpc slot. The choice of 4MB is somewhat
>> * arbitrary. We might instead base this on r/wsize, or make it
>> * tunable, or use a time instead of a byte limit, or implement
>> * asynchronous copy. In theory a client could also recognize a
>> * limit like this and pipeline multiple COPY requests.
>> */
>> count = min_t(u64, count, 1 << 22);
>> return vfs_copy_file_range(src, src_pos, dst, dst_pos, count, 0);
>> }
>>
>> You are now explicitly changing the behaviour of knfsd when the source
>> and destination filesystem differ.
>>
>> For one thing, you are disallowing the NFSv4.2 copy offload use case of
>> copying from a local filesystem to a remote NFS server. However you are
>> also disallowing the copy from, say, an XFS formatted partition to an
>> ext4 partition.
>>
>
> Got it.

Ugh. And I guess overlayfs may have a similar problem.

> This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
> is internal to kernel users.
>
> FWIW, you may want to look at the loop in ovl_copy_up_data()
> for improvements to nfsd_copy_file_range().
>
> We can move the check out to copy_file_range syscall:
>
> if (flags != 0)
> return -EINVAL;
>
> Leave the fallback from all filesystems and check for the
> COPY_FILE_SPLICE flag inside generic_copy_file_range().

Ok, the diff bellow is just to make sure I understood your suggestion.

The patch will also need to:

- change nfs and overlayfs calls to vfs_copy_file_range() so that they
use the new flag.

- check flags in generic_copy_file_checks() to make sure only valid flags
are used (COPY_FILE_SPLICE at the moment).

Also, where should this flag be defined? include/uapi/linux/fs.h?

Cheers,
--
Luis

diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..341d315d2a96 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
size_t len, unsigned int flags)
{
+ if (!(flags & COPY_FILE_SPLICE)) {
+ if (!file_out->f_op->copy_file_range)
+ return -EOPNOTSUPP;
+ else if (file_out->f_op->copy_file_range !=
+ file_in->f_op->copy_file_range)
+ return -EXDEV;
+ }
return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
}
@@ -1474,9 +1481,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
{
ssize_t ret;

- if (flags != 0)
- return -EINVAL;
-
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
flags);
if (unlikely(ret))

2021-02-16 13:54:05

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

> Ugh. And I guess overlayfs may have a similar problem.

Not exactly.
Generally speaking, overlayfs should call vfs_copy_file_range()
with the flags it got from layer above, so if called from nfsd it
will allow cross fs copy and when called from syscall it won't.

There are some corner cases where overlayfs could benefit from
COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but
let's leave those for now. Just leave overlayfs code as is.

>
> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
> > is internal to kernel users.
> >
> > FWIW, you may want to look at the loop in ovl_copy_up_data()
> > for improvements to nfsd_copy_file_range().
> >
> > We can move the check out to copy_file_range syscall:
> >
> > if (flags != 0)
> > return -EINVAL;
> >
> > Leave the fallback from all filesystems and check for the
> > COPY_FILE_SPLICE flag inside generic_copy_file_range().
>
> Ok, the diff bellow is just to make sure I understood your suggestion.
>
> The patch will also need to:
>
> - change nfs and overlayfs calls to vfs_copy_file_range() so that they
> use the new flag.
>
> - check flags in generic_copy_file_checks() to make sure only valid flags
> are used (COPY_FILE_SPLICE at the moment).
>
> Also, where should this flag be defined? include/uapi/linux/fs.h?

Grep for REMAP_FILE_
Same header file, same Documentation rst file.

>
> Cheers,
> --
> Luis
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..341d315d2a96 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> struct file *file_out, loff_t pos_out,
> size_t len, unsigned int flags)
> {
> + if (!(flags & COPY_FILE_SPLICE)) {
> + if (!file_out->f_op->copy_file_range)
> + return -EOPNOTSUPP;
> + else if (file_out->f_op->copy_file_range !=
> + file_in->f_op->copy_file_range)
> + return -EXDEV;
> + }

That looks strange, because you are duplicating the logic in
do_copy_file_range(). Maybe better:

if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
return -EINVAL;
if (flags & COPY_FILE_SPLICE)
return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
if (!file_out->f_op->copy_file_range)
return -EOPNOTSUPP;
return -EXDEV;

> }
> @@ -1474,9 +1481,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> {
> ssize_t ret;
>
> - if (flags != 0)
> - return -EINVAL;
> -

This needs to move to the beginning of SYSCALL_DEFINE6(copy_file_range,...

Thanks,
Amir.

2021-02-16 16:44:56

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

Amir Goldstein <[email protected]> writes:

>> Ugh. And I guess overlayfs may have a similar problem.
>
> Not exactly.
> Generally speaking, overlayfs should call vfs_copy_file_range()
> with the flags it got from layer above, so if called from nfsd it
> will allow cross fs copy and when called from syscall it won't.
>
> There are some corner cases where overlayfs could benefit from
> COPY_FILE_SPLICE (e.g. copy from lower file to upper file), but
> let's leave those for now. Just leave overlayfs code as is.

Got it, thanks for clarifying.

>> > This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>> > is internal to kernel users.
>> >
>> > FWIW, you may want to look at the loop in ovl_copy_up_data()
>> > for improvements to nfsd_copy_file_range().
>> >
>> > We can move the check out to copy_file_range syscall:
>> >
>> > if (flags != 0)
>> > return -EINVAL;
>> >
>> > Leave the fallback from all filesystems and check for the
>> > COPY_FILE_SPLICE flag inside generic_copy_file_range().
>>
>> Ok, the diff bellow is just to make sure I understood your suggestion.
>>
>> The patch will also need to:
>>
>> - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>> use the new flag.
>>
>> - check flags in generic_copy_file_checks() to make sure only valid flags
>> are used (COPY_FILE_SPLICE at the moment).
>>
>> Also, where should this flag be defined? include/uapi/linux/fs.h?
>
> Grep for REMAP_FILE_
> Same header file, same Documentation rst file.
>
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..341d315d2a96 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>> struct file *file_out, loff_t pos_out,
>> size_t len, unsigned int flags)
>> {
>> + if (!(flags & COPY_FILE_SPLICE)) {
>> + if (!file_out->f_op->copy_file_range)
>> + return -EOPNOTSUPP;
>> + else if (file_out->f_op->copy_file_range !=
>> + file_in->f_op->copy_file_range)
>> + return -EXDEV;
>> + }
>
> That looks strange, because you are duplicating the logic in
> do_copy_file_range(). Maybe better:
>
> if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
> return -EINVAL;
> if (flags & COPY_FILE_SPLICE)
> return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);

My initial reasoning for duplicating the logic in do_copy_file_range() was
to allow the generic_copy_file_range() callers to be left unmodified and
allow the filesystems to default to this implementation.

With this change, I guess that the calls to generic_copy_file_range() from
the different filesystems can be dropped, as in my initial patch, as they
will always get -EINVAL. The other option would be to set the
COPY_FILE_SPLICE flag in those calls, but that would get us back to the
problem we're trying to solve.

> if (!file_out->f_op->copy_file_range)
> return -EOPNOTSUPP;
> return -EXDEV;
>
>> }
>> @@ -1474,9 +1481,6 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> {
>> ssize_t ret;
>>
>> - if (flags != 0)
>> - return -EINVAL;
>> -
>
> This needs to move to the beginning of SYSCALL_DEFINE6(copy_file_range,...

Yep, I didn't included that change in my diff as I wasn't sure if you'd
like to have the flag visible in userspace.

Anyway, thanks for your patience!

Cheers,
--
Luis

2021-02-16 18:57:32

by Andreas Dilger

[permalink] [raw]
Subject: Re: [PATCH v2] vfs: prevent copy_file_range to copy across devices

On Feb 16, 2021, at 6:51 AM, Amir Goldstein <[email protected]> wrote:
>>
>>> This is easy to solve with a flag COPY_FILE_SPLICE (or something) that
>>> is internal to kernel users.
>>>
>>> FWIW, you may want to look at the loop in ovl_copy_up_data()
>>> for improvements to nfsd_copy_file_range().
>>>
>>> We can move the check out to copy_file_range syscall:
>>>
>>> if (flags != 0)
>>> return -EINVAL;
>>>
>>> Leave the fallback from all filesystems and check for the
>>> COPY_FILE_SPLICE flag inside generic_copy_file_range().
>>
>> Ok, the diff bellow is just to make sure I understood your suggestion.
>>
>> The patch will also need to:
>>
>> - change nfs and overlayfs calls to vfs_copy_file_range() so that they
>> use the new flag.
>>
>> - check flags in generic_copy_file_checks() to make sure only valid flags
>> are used (COPY_FILE_SPLICE at the moment).
>>
>> Also, where should this flag be defined? include/uapi/linux/fs.h?
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..341d315d2a96 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1383,6 +1383,13 @@ ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>> struct file *file_out, loff_t pos_out,
>> size_t len, unsigned int flags)
>> {
>> + if (!(flags & COPY_FILE_SPLICE)) {
>> + if (!file_out->f_op->copy_file_range)
>> + return -EOPNOTSUPP;
>> + else if (file_out->f_op->copy_file_range !=
>> + file_in->f_op->copy_file_range)
>> + return -EXDEV;
>> + }
>
> That looks strange, because you are duplicating the logic in
> do_copy_file_range(). Maybe better:
>
> if (WARN_ON_ONCE(flags & ~COPY_FILE_SPLICE))
> return -EINVAL;
> if (flags & COPY_FILE_SPLICE)
> return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> if (!file_out->f_op->copy_file_range)
> return -EOPNOTSUPP;
> return -EXDEV;

This shouldn't return -EINVAL to userspace if the flag is not set.

That implies there *is* some valid way for userspace to call this
function, which is AFAICS not possible if COPY_FILE_SPLICE is only
available to in-kernel callers. Instead, it should continue
to return -EOPNOTSUPP to userspace if copy_file_range() is not valid
for this combination of file descriptors, so that applications will
fall back to the non-CFR implementation.

The WARN_ON_ONCE(ret == -EOPNOTSUPP) in vfs_copy_file_range() would
also need to be removed if this will be triggered from userspace.


Cheers, Andreas






Attachments:
signature.asc (890.00 B)
Message signed with OpenPGP