2021-02-12 04:46:17

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 0/6] Add generated flag to filesystem struct to block copy_file_range

We hit an issue when upgrading Go compiler from 1.13 to 1.15 [1],
as we use Go's `io.Copy` to copy the content of
`/sys/kernel/debug/tracing/trace` to a temporary file.

Under the hood, Go 1.15 uses `copy_file_range` syscall to
optimize the copy operation. However, that fails to copy any
content when the input file is from tracefs, with an apparent
size of 0 (but there is still content when you `cat` it, of
course).

From discussions in [2][3], it is clear that copy_file_range
cannot be properly implemented on filesystems where the content
is generated at runtime: the file size is incorrect (because it
is unknown before the content is generated), and seeking in such
files (as required by partial writes) is unlikely to work
correctly.

With this patch, Go's `io.Copy` gracefully falls back to a normal
read/write file copy.

I'm not 100% sure which stable tree this should go in, I'd say
at least >=5.3 since this is what introduced support for
cross-filesystem copy_file_range (and where most users are
somewhat likely to hit this issue). But let's discuss the patch
series first.

[1] http://issuetracker.google.com/issues/178332739
[2] https://lkml.org/lkml/2021/1/25/64
[3] https://lkml.org/lkml/2021/1/26/1736


Nicolas Boichat (6):
fs: Add flag to file_system_type to indicate content is generated
proc: Add FS_GENERATED_CONTENT to filesystem flags
sysfs: Add FS_GENERATED_CONTENT to filesystem flags
debugfs: Add FS_GENERATED_CONTENT to filesystem flags
tracefs: Add FS_GENERATED_CONTENT to filesystem flags
vfs: Disallow copy_file_range on generated file systems

fs/debugfs/inode.c | 1 +
fs/proc/root.c | 2 +-
fs/read_write.c | 3 +++
fs/sysfs/mount.c | 2 +-
fs/tracefs/inode.c | 1 +
include/linux/fs.h | 1 +
6 files changed, 8 insertions(+), 2 deletions(-)

--
2.30.0.478.g8a0d178c01-goog


2021-02-12 04:47:25

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 2/6] proc: Add FS_GENERATED_CONTENT to filesystem flags

procfs content is generated at runtime.

Signed-off-by: Nicolas Boichat <[email protected]>
---

fs/proc/root.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/proc/root.c b/fs/proc/root.c
index c7e3b1350ef8..7ed715a0f807 100644
--- a/fs/proc/root.c
+++ b/fs/proc/root.c
@@ -282,7 +282,7 @@ static struct file_system_type proc_fs_type = {
.init_fs_context = proc_init_fs_context,
.parameters = proc_fs_parameters,
.kill_sb = proc_kill_sb,
- .fs_flags = FS_USERNS_MOUNT | FS_DISALLOW_NOTIFY_PERM,
+ .fs_flags = FS_USERNS_MOUNT | FS_DISALLOW_NOTIFY_PERM | FS_GENERATED_CONTENT,
};

void __init proc_root_init(void)
--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:47:53

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Filesystems such as procfs and sysfs generate their content at
runtime. This implies the file sizes do not usually match the
amount of data that can be read from the file, and that seeking
may not work as intended.

This will be useful to disallow copy_file_range with input files
from such filesystems.

Signed-off-by: Nicolas Boichat <[email protected]>
---
I first thought of adding a new field to struct file_operations,
but that doesn't quite scale as every single file creation
operation would need to be modified.

include/linux/fs.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3482146b11b0..5bd58b928e94 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2335,6 +2335,7 @@ struct file_system_type {
#define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
#define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
#define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
+#define FS_GENERATED_CONTENT 65536 /* FS contains generated content */
int (*init_fs_context)(struct fs_context *);
const struct fs_parameter_spec *parameters;
struct dentry *(*mount) (struct file_system_type *, int,
--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:48:14

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 4/6] debugfs: Add FS_GENERATED_CONTENT to filesystem flags

debugfs content is generated at runtime.

Signed-off-by: Nicolas Boichat <[email protected]>
---

fs/debugfs/inode.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index c35249497b9b..2bbc5e6d3041 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -279,6 +279,7 @@ static struct file_system_type debug_fs_type = {
.name = "debugfs",
.mount = debug_mount,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_GENERATED_CONTENT,
};
MODULE_ALIAS_FS("debugfs");

--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:48:59

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 5/6] tracefs: Add FS_GENERATED_CONTENT to filesystem flags

tracefs content is generated at runtime.

Signed-off-by: Nicolas Boichat <[email protected]>
---

fs/tracefs/inode.c | 1 +
1 file changed, 1 insertion(+)

diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
index 4b83cbded559..89980312c7a3 100644
--- a/fs/tracefs/inode.c
+++ b/fs/tracefs/inode.c
@@ -308,6 +308,7 @@ static struct file_system_type trace_fs_type = {
.name = "tracefs",
.mount = trace_mount,
.kill_sb = kill_litter_super,
+ .fs_flags = FS_GENERATED_CONTENT,
};
MODULE_ALIAS_FS("tracefs");

--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:49:44

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 3/6] sysfs: Add FS_GENERATED_CONTENT to filesystem flags

sysfs content is generated at runtime.

Signed-off-by: Nicolas Boichat <[email protected]>
---

fs/sysfs/mount.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/sysfs/mount.c b/fs/sysfs/mount.c
index e747c135c1d1..7e367ae5edc1 100644
--- a/fs/sysfs/mount.c
+++ b/fs/sysfs/mount.c
@@ -91,7 +91,7 @@ static struct file_system_type sysfs_fs_type = {
.name = "sysfs",
.init_fs_context = sysfs_init_fs_context,
.kill_sb = sysfs_kill_sb,
- .fs_flags = FS_USERNS_MOUNT,
+ .fs_flags = FS_USERNS_MOUNT | FS_GENERATED_CONTENT,
};

int __init sysfs_init(void)
--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:51:01

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH 6/6] vfs: Disallow copy_file_range on generated file systems

copy_file_range (which calls generic_copy_file_checks) uses the
inode file size to adjust the copy count parameter. This breaks
with special filesystems like procfs/sysfs/debugfs/tracefs, where
the file size appears to be zero, but content is actually returned
when a read operation is performed. Other issues would also
happen on partial writes, as the function would attempt to seek
in the input file.

Use the newly introduced FS_GENERATED_CONTENT filesystem flag
to return -EOPNOTSUPP: applications can then retry with a more
usual read/write based file copy (the fallback code is usually
already present to handle older kernels).

Signed-off-by: Nicolas Boichat <[email protected]>
---

fs/read_write.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/read_write.c b/fs/read_write.c
index 0029ff2b0ca8..80322e89fb0a 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1485,6 +1485,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
if (flags != 0)
return -EINVAL;

+ if (file_inode(file_in)->i_sb->s_type->fs_flags & FS_GENERATED_CONTENT)
+ return -EOPNOTSUPP;
+
ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
flags);
if (unlikely(ret))
--
2.30.0.478.g8a0d178c01-goog

2021-02-12 04:57:41

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 6/6] vfs: Disallow copy_file_range on generated file systems

On Fri, Feb 12, 2021 at 12:44:05PM +0800, Nicolas Boichat wrote:
> copy_file_range (which calls generic_copy_file_checks) uses the
> inode file size to adjust the copy count parameter. This breaks
> with special filesystems like procfs/sysfs/debugfs/tracefs, where
> the file size appears to be zero, but content is actually returned
> when a read operation is performed. Other issues would also
> happen on partial writes, as the function would attempt to seek
> in the input file.
>
> Use the newly introduced FS_GENERATED_CONTENT filesystem flag
> to return -EOPNOTSUPP: applications can then retry with a more
> usual read/write based file copy (the fallback code is usually
> already present to handle older kernels).
>
> Signed-off-by: Nicolas Boichat <[email protected]>
> ---
>
> fs/read_write.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 0029ff2b0ca8..80322e89fb0a 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1485,6 +1485,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> if (flags != 0)
> return -EINVAL;
>
> + if (file_inode(file_in)->i_sb->s_type->fs_flags & FS_GENERATED_CONTENT)
> + return -EOPNOTSUPP;

Why not declare a dummy copy_file_range_nop function that returns
EOPNOTSUPP and point all of these filesystems at it?

(Or, I guess in these days where function pointers are the enemy,
create a #define that is a cast of 0x1, and fix do_copy_file_range to
return EOPNOTSUPP if it sees that?)

--D

> +
> ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> flags);
> if (unlikely(ret))
> --
> 2.30.0.478.g8a0d178c01-goog
>

2021-02-12 05:02:08

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 6/6] vfs: Disallow copy_file_range on generated file systems

On Thu, Feb 11, 2021 at 08:53:47PM -0800, Darrick J. Wong wrote:
> On Fri, Feb 12, 2021 at 12:44:05PM +0800, Nicolas Boichat wrote:
> > copy_file_range (which calls generic_copy_file_checks) uses the
> > inode file size to adjust the copy count parameter. This breaks
> > with special filesystems like procfs/sysfs/debugfs/tracefs, where
> > the file size appears to be zero, but content is actually returned
> > when a read operation is performed. Other issues would also
> > happen on partial writes, as the function would attempt to seek
> > in the input file.
> >
> > Use the newly introduced FS_GENERATED_CONTENT filesystem flag
> > to return -EOPNOTSUPP: applications can then retry with a more
> > usual read/write based file copy (the fallback code is usually
> > already present to handle older kernels).
> >
> > Signed-off-by: Nicolas Boichat <[email protected]>
> > ---
> >
> > fs/read_write.c | 3 +++
> > 1 file changed, 3 insertions(+)
> >
> > diff --git a/fs/read_write.c b/fs/read_write.c
> > index 0029ff2b0ca8..80322e89fb0a 100644
> > --- a/fs/read_write.c
> > +++ b/fs/read_write.c
> > @@ -1485,6 +1485,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > if (flags != 0)
> > return -EINVAL;
> >
> > + if (file_inode(file_in)->i_sb->s_type->fs_flags & FS_GENERATED_CONTENT)
> > + return -EOPNOTSUPP;
>
> Why not declare a dummy copy_file_range_nop function that returns
> EOPNOTSUPP and point all of these filesystems at it?
>
> (Or, I guess in these days where function pointers are the enemy,
> create a #define that is a cast of 0x1, and fix do_copy_file_range to
> return EOPNOTSUPP if it sees that?)

Oh, I see, because that doesn't help if the source file is procfs and
the dest file is (say) xfs, because the generic version will try to do
splice magic and *poof*.

I guess the other nit thatI can think of at this late hour is ... what
about the other virtual filesystems like configfs and whatnot? Should
we have a way to flag them as "this can't be the source of a CFR
request" as well?

Or is it just trace/debug/proc/sysfs that have these "zero size but
readable" speshul behaviors?

--D

>
> --D
>
> > +
> > ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> > flags);
> > if (unlikely(ret))
> > --
> > 2.30.0.478.g8a0d178c01-goog
> >

2021-02-12 05:27:05

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH 6/6] vfs: Disallow copy_file_range on generated file systems

On Fri, Feb 12, 2021 at 12:59 PM Darrick J. Wong <[email protected]> wrote:
>
> On Thu, Feb 11, 2021 at 08:53:47PM -0800, Darrick J. Wong wrote:
> > On Fri, Feb 12, 2021 at 12:44:05PM +0800, Nicolas Boichat wrote:
> > > copy_file_range (which calls generic_copy_file_checks) uses the
> > > inode file size to adjust the copy count parameter. This breaks
> > > with special filesystems like procfs/sysfs/debugfs/tracefs, where
> > > the file size appears to be zero, but content is actually returned
> > > when a read operation is performed. Other issues would also
> > > happen on partial writes, as the function would attempt to seek
> > > in the input file.
> > >
> > > Use the newly introduced FS_GENERATED_CONTENT filesystem flag
> > > to return -EOPNOTSUPP: applications can then retry with a more
> > > usual read/write based file copy (the fallback code is usually
> > > already present to handle older kernels).
> > >
> > > Signed-off-by: Nicolas Boichat <[email protected]>
> > > ---
> > >
> > > fs/read_write.c | 3 +++
> > > 1 file changed, 3 insertions(+)
> > >
> > > diff --git a/fs/read_write.c b/fs/read_write.c
> > > index 0029ff2b0ca8..80322e89fb0a 100644
> > > --- a/fs/read_write.c
> > > +++ b/fs/read_write.c
> > > @@ -1485,6 +1485,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> > > if (flags != 0)
> > > return -EINVAL;
> > >
> > > + if (file_inode(file_in)->i_sb->s_type->fs_flags & FS_GENERATED_CONTENT)
> > > + return -EOPNOTSUPP;
> >
> > Why not declare a dummy copy_file_range_nop function that returns
> > EOPNOTSUPP and point all of these filesystems at it?
> >
> > (Or, I guess in these days where function pointers are the enemy,
> > create a #define that is a cast of 0x1, and fix do_copy_file_range to
> > return EOPNOTSUPP if it sees that?)

I was pondering abusing ERR_PTR(-EOPNOTSUPP) for this purpose ,-P

>
> Oh, I see, because that doesn't help if the source file is procfs and
> the dest file is (say) xfs, because the generic version will try to do
> splice magic and *poof*.

Yep. I mean, we could still add a check if the
file_in->f_op->copy_file_range == copy_file_range_nop in
do_copy_file_range...
But then we'd need to sprinkle .copy_file_range = copy_file_range_nop
in many many places (~700 as a lower bound[1]), since the file
operation structure is defined at the file level, not at the FS level,
and people are likely to forget...

[1]
$ git grep "struct file_operations.*=" | grep debug | wc -l
631
$ git grep "struct file_operations.*=" | grep trace | wc -l
84

>
> I guess the other nit thatI can think of at this late hour is ... what
> about the other virtual filesystems like configfs and whatnot? Should
> we have a way to flag them as "this can't be the source of a CFR
> request" as well?
>
> Or is it just trace/debug/proc/sysfs that have these "zero size but
> readable" speshul behaviors?

I did try to audit the other filesystems. The ones I spotted:
- devpts should be fine (only device nodes in there)
- I think pstore doesn't need the flag as it's RAM-backed and persistent.

But yes, I missed configfs, thanks for catching that. I think we need
to add the flag for that one (looks like the sizes are all 4K).

>
> --D
>
> >
> > --D
> >
> > > +
> > > ret = generic_copy_file_checks(file_in, pos_in, file_out, pos_out, &len,
> > > flags);
> > > if (unlikely(ret))
> > > --
> > > 2.30.0.478.g8a0d178c01-goog
> > >

2021-02-12 07:50:36

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> Filesystems such as procfs and sysfs generate their content at
> runtime. This implies the file sizes do not usually match the
> amount of data that can be read from the file, and that seeking
> may not work as intended.
>
> This will be useful to disallow copy_file_range with input files
> from such filesystems.
>
> Signed-off-by: Nicolas Boichat <[email protected]>
> ---
> I first thought of adding a new field to struct file_operations,
> but that doesn't quite scale as every single file creation
> operation would need to be modified.

Even so, you missed a load of filesystems in the kernel with this patch
series, what makes the ones you did mark here different from the
"internal" filesystems that you did not?

This feels wrong, why is userspace suddenly breaking? What changed in
the kernel that caused this? Procfs has been around for a _very_ long
time :)

thanks,

greg k-h

2021-02-12 07:56:21

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 6:47 AM Nicolas Boichat <[email protected]> wrote:
>
> Filesystems such as procfs and sysfs generate their content at
> runtime. This implies the file sizes do not usually match the
> amount of data that can be read from the file, and that seeking
> may not work as intended.
>
> This will be useful to disallow copy_file_range with input files
> from such filesystems.
>
> Signed-off-by: Nicolas Boichat <[email protected]>
> ---
> I first thought of adding a new field to struct file_operations,
> but that doesn't quite scale as every single file creation
> operation would need to be modified.
>
> include/linux/fs.h | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3482146b11b0..5bd58b928e94 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2335,6 +2335,7 @@ struct file_system_type {
> #define FS_ALLOW_IDMAP 32 /* FS has been updated to handle vfs idmappings. */
> #define FS_THP_SUPPORT 8192 /* Remove once all fs converted */
> #define FS_RENAME_DOES_D_MOVE 32768 /* FS will handle d_move() during rename() internally. */
> +#define FS_GENERATED_CONTENT 65536 /* FS contains generated content */

Can you please make the flag name a little less arbitrary.

Either something that conveys the facts as they are (e.g. "zero size
but readable")
or anything that you think describes best the special behavior that follows from
observing this flag.

The alternative is for the flag name to express what you want
(e.g. "don't copy file range") like FS_DISALLOW_NOTIFY_PERM.

Also, I wonder. A great deal of the files you target are opened with seq_open()
(I didn't audit all of them). Maybe it's worth setting an FMODE flag
in seq_open()
and some of it's relatives to express the quality of the file instead
of flagging
the filesystem? Maybe we can do both to cover more cases.

Thanks,
Amir.

2021-02-12 08:23:11

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 3:46 PM Greg KH <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > Filesystems such as procfs and sysfs generate their content at
> > runtime. This implies the file sizes do not usually match the
> > amount of data that can be read from the file, and that seeking
> > may not work as intended.
> >
> > This will be useful to disallow copy_file_range with input files
> > from such filesystems.
> >
> > Signed-off-by: Nicolas Boichat <[email protected]>
> > ---
> > I first thought of adding a new field to struct file_operations,
> > but that doesn't quite scale as every single file creation
> > operation would need to be modified.
>
> Even so, you missed a load of filesystems in the kernel with this patch
> series, what makes the ones you did mark here different from the
> "internal" filesystems that you did not?

Which ones did I miss? (apart from configfs, see the conversation on
patch 6/6). Anyway, hopefully auditing all filesystems is an order of
magnitude easier task, and easier to maintain in the longer run ,-)

> This feels wrong, why is userspace suddenly breaking? What changed in
> the kernel that caused this? Procfs has been around for a _very_ long
> time :)

Some of this is covered in the cover letter 0/6. To expand a bit:
copy_file_range has only supported cross-filesystem copies since 5.3
[1], before that the kernel would return EXDEV and userspace
application would fall back to a read/write based copy. After 5.3,
copy_file_range copies no data as it thinks the input file is empty.

[1] I think it makes little sense to try to use copy_file_range
between 2 files in /proc, but technically, I think that has been
broken since copy_file_range fallback to do_splice_direct was
introduced (eac70053a141 ("vfs: Add vfs_copy_file_range() support for
pagecache copies", ~4.4).

>
> thanks,
>
> greg k-h

2021-02-12 08:25:09

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > Filesystems such as procfs and sysfs generate their content at
> > runtime. This implies the file sizes do not usually match the
> > amount of data that can be read from the file, and that seeking
> > may not work as intended.
> >
> > This will be useful to disallow copy_file_range with input files
> > from such filesystems.
> >
> > Signed-off-by: Nicolas Boichat <[email protected]>
> > ---
> > I first thought of adding a new field to struct file_operations,
> > but that doesn't quite scale as every single file creation
> > operation would need to be modified.
>
> Even so, you missed a load of filesystems in the kernel with this patch
> series, what makes the ones you did mark here different from the
> "internal" filesystems that you did not?
>
> This feels wrong, why is userspace suddenly breaking? What changed in
> the kernel that caused this? Procfs has been around for a _very_ long
> time :)

That would be because of (v5.3):

5dae222a5ff0 vfs: allow copy_file_range to copy across devices

The intention of this change (series) was to allow server side copy
for nfs and cifs via copy_file_range().
This is mostly work by Dave Chinner that I picked up following requests
from the NFS folks.

But the above change also includes this generic change:

- /* this could be relaxed once a method supports cross-fs copies */
- if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
- return -EXDEV;
-

The change of behavior was documented in the commit message.
It was also documented in:

88e75e2c5 copy_file_range.2: Kernel v5.3 updates

I think our rationale for the generic change was:
"Why not? What could go wrong? (TM)"
I am not sure if any workload really gained something from this
kernel cross-fs CFR.

In retrospect, I think it would have been safer to allow cross-fs CFR
only to the filesystems that implement ->{copy,remap}_file_range()...

Our option now are:
- Restore the cross-fs restriction into generic_copy_file_range()
- Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does

Preference anyone?

Thanks,
Amir.

2021-02-12 08:41:31

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 04:20:17PM +0800, Nicolas Boichat wrote:
> On Fri, Feb 12, 2021 at 3:46 PM Greg KH <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > > Filesystems such as procfs and sysfs generate their content at
> > > runtime. This implies the file sizes do not usually match the
> > > amount of data that can be read from the file, and that seeking
> > > may not work as intended.
> > >
> > > This will be useful to disallow copy_file_range with input files
> > > from such filesystems.
> > >
> > > Signed-off-by: Nicolas Boichat <[email protected]>
> > > ---
> > > I first thought of adding a new field to struct file_operations,
> > > but that doesn't quite scale as every single file creation
> > > operation would need to be modified.
> >
> > Even so, you missed a load of filesystems in the kernel with this patch
> > series, what makes the ones you did mark here different from the
> > "internal" filesystems that you did not?
>
> Which ones did I miss? (apart from configfs, see the conversation on
> patch 6/6). Anyway, hopefully auditing all filesystems is an order of
> magnitude easier task, and easier to maintain in the longer run ,-)

Are you going to be the one to add this to each new filesystem that is
added to the kernel?

I see filesystems in drivers/char/mem.c and
drivers/infiniband/hw/qib/qib_fs.c and drivers/misc/ibmasm/ibmasmfs.c
and loads of other driver-specific filesystems. Do all of those "just
work" somehow?

> > This feels wrong, why is userspace suddenly breaking? What changed in
> > the kernel that caused this? Procfs has been around for a _very_ long
> > time :)
>
> Some of this is covered in the cover letter 0/6. To expand a bit:
> copy_file_range has only supported cross-filesystem copies since 5.3
> [1], before that the kernel would return EXDEV and userspace
> application would fall back to a read/write based copy. After 5.3,
> copy_file_range copies no data as it thinks the input file is empty.

So older kernels work fine with this system call on procfs, but newer
ones do not? If so, that's a regression that should be fixed in the
original area, but should not require a new filesystem flag for every
individual one out there. That way lies madness and constant auditing
that I do not see anyone signing up for for the next 20 years, right?

> [1] I think it makes little sense to try to use copy_file_range
> between 2 files in /proc, but technically, I think that has been
> broken since copy_file_range fallback to do_splice_direct was
> introduced (eac70053a141 ("vfs: Add vfs_copy_file_range() support for
> pagecache copies", ~4.4).

Why are people trying to use copy_file_range on simple /proc and /sys
files in the first place? They can not seek (well most can not), so
that feels like a "oh look, a new syscall, let's use it everywhere!"
problem that userspace should not do.

thanks,

greg k-h

2021-02-12 08:43:14

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > > Filesystems such as procfs and sysfs generate their content at
> > > runtime. This implies the file sizes do not usually match the
> > > amount of data that can be read from the file, and that seeking
> > > may not work as intended.
> > >
> > > This will be useful to disallow copy_file_range with input files
> > > from such filesystems.
> > >
> > > Signed-off-by: Nicolas Boichat <[email protected]>
> > > ---
> > > I first thought of adding a new field to struct file_operations,
> > > but that doesn't quite scale as every single file creation
> > > operation would need to be modified.
> >
> > Even so, you missed a load of filesystems in the kernel with this patch
> > series, what makes the ones you did mark here different from the
> > "internal" filesystems that you did not?
> >
> > This feels wrong, why is userspace suddenly breaking? What changed in
> > the kernel that caused this? Procfs has been around for a _very_ long
> > time :)
>
> That would be because of (v5.3):
>
> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>
> The intention of this change (series) was to allow server side copy
> for nfs and cifs via copy_file_range().
> This is mostly work by Dave Chinner that I picked up following requests
> from the NFS folks.
>
> But the above change also includes this generic change:
>
> - /* this could be relaxed once a method supports cross-fs copies */
> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> - return -EXDEV;
> -
>
> The change of behavior was documented in the commit message.
> It was also documented in:
>
> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
>
> I think our rationale for the generic change was:
> "Why not? What could go wrong? (TM)"
> I am not sure if any workload really gained something from this
> kernel cross-fs CFR.

Why not put that check back?

> In retrospect, I think it would have been safer to allow cross-fs CFR
> only to the filesystems that implement ->{copy,remap}_file_range()...

Why not make this change? That seems easier and should fix this for
everyone, right?

> Our option now are:
> - Restore the cross-fs restriction into generic_copy_file_range()

Yes.

> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does

No. That way lies constant auditing and someone being "vigilant" for
the next 30+ years. Which will not happen.

thanks,

greg k-h

2021-02-12 12:08:19

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Greg KH <[email protected]> writes:

> On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
>> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
>> >
>> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
>> > > Filesystems such as procfs and sysfs generate their content at
>> > > runtime. This implies the file sizes do not usually match the
>> > > amount of data that can be read from the file, and that seeking
>> > > may not work as intended.
>> > >
>> > > This will be useful to disallow copy_file_range with input files
>> > > from such filesystems.
>> > >
>> > > Signed-off-by: Nicolas Boichat <[email protected]>
>> > > ---
>> > > I first thought of adding a new field to struct file_operations,
>> > > but that doesn't quite scale as every single file creation
>> > > operation would need to be modified.
>> >
>> > Even so, you missed a load of filesystems in the kernel with this patch
>> > series, what makes the ones you did mark here different from the
>> > "internal" filesystems that you did not?
>> >
>> > This feels wrong, why is userspace suddenly breaking? What changed in
>> > the kernel that caused this? Procfs has been around for a _very_ long
>> > time :)
>>
>> That would be because of (v5.3):
>>
>> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>>
>> The intention of this change (series) was to allow server side copy
>> for nfs and cifs via copy_file_range().
>> This is mostly work by Dave Chinner that I picked up following requests
>> from the NFS folks.
>>
>> But the above change also includes this generic change:
>>
>> - /* this could be relaxed once a method supports cross-fs copies */
>> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> - return -EXDEV;
>> -
>>
>> The change of behavior was documented in the commit message.
>> It was also documented in:
>>
>> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
>>
>> I think our rationale for the generic change was:
>> "Why not? What could go wrong? (TM)"
>> I am not sure if any workload really gained something from this
>> kernel cross-fs CFR.
>
> Why not put that check back?
>
>> In retrospect, I think it would have been safer to allow cross-fs CFR
>> only to the filesystems that implement ->{copy,remap}_file_range()...
>
> Why not make this change? That seems easier and should fix this for
> everyone, right?
>
>> Our option now are:
>> - Restore the cross-fs restriction into generic_copy_file_range()
>
> Yes.
>

Restoring this restriction will actually change the current cephfs CFR
behaviour. Since that commit we have allowed doing remote copies between
different filesystems within the same ceph cluster. See commit
6fd4e6348352 ("ceph: allow object copies across different filesystems in
the same cluster").

Although I'm not aware of any current users for this scenario, the
performance impact can actually be huge as it's the difference between
asking the OSDs for copying a file and doing a full read+write on the
client side.

Cheers,
--
Luis


>> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does
>
> No. That way lies constant auditing and someone being "vigilant" for
> the next 30+ years. Which will not happen.
>
> thanks,
>
> greg k-h

2021-02-12 12:23:49

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote:
> Greg KH <[email protected]> writes:
>
> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
> >> >
> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> >> > > Filesystems such as procfs and sysfs generate their content at
> >> > > runtime. This implies the file sizes do not usually match the
> >> > > amount of data that can be read from the file, and that seeking
> >> > > may not work as intended.
> >> > >
> >> > > This will be useful to disallow copy_file_range with input files
> >> > > from such filesystems.
> >> > >
> >> > > Signed-off-by: Nicolas Boichat <[email protected]>
> >> > > ---
> >> > > I first thought of adding a new field to struct file_operations,
> >> > > but that doesn't quite scale as every single file creation
> >> > > operation would need to be modified.
> >> >
> >> > Even so, you missed a load of filesystems in the kernel with this patch
> >> > series, what makes the ones you did mark here different from the
> >> > "internal" filesystems that you did not?
> >> >
> >> > This feels wrong, why is userspace suddenly breaking? What changed in
> >> > the kernel that caused this? Procfs has been around for a _very_ long
> >> > time :)
> >>
> >> That would be because of (v5.3):
> >>
> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
> >>
> >> The intention of this change (series) was to allow server side copy
> >> for nfs and cifs via copy_file_range().
> >> This is mostly work by Dave Chinner that I picked up following requests
> >> from the NFS folks.
> >>
> >> But the above change also includes this generic change:
> >>
> >> - /* this could be relaxed once a method supports cross-fs copies */
> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >> - return -EXDEV;
> >> -
> >>
> >> The change of behavior was documented in the commit message.
> >> It was also documented in:
> >>
> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
> >>
> >> I think our rationale for the generic change was:
> >> "Why not? What could go wrong? (TM)"
> >> I am not sure if any workload really gained something from this
> >> kernel cross-fs CFR.
> >
> > Why not put that check back?
> >
> >> In retrospect, I think it would have been safer to allow cross-fs CFR
> >> only to the filesystems that implement ->{copy,remap}_file_range()...
> >
> > Why not make this change? That seems easier and should fix this for
> > everyone, right?
> >
> >> Our option now are:
> >> - Restore the cross-fs restriction into generic_copy_file_range()
> >
> > Yes.
> >
>
> Restoring this restriction will actually change the current cephfs CFR
> behaviour. Since that commit we have allowed doing remote copies between
> different filesystems within the same ceph cluster. See commit
> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
> the same cluster").
>
> Although I'm not aware of any current users for this scenario, the
> performance impact can actually be huge as it's the difference between
> asking the OSDs for copying a file and doing a full read+write on the
> client side.

Regression in performance is ok if it fixes a regression for things that
used to work just fine in the past :)

First rule, make it work.

thanks,

greg k-h

2021-02-12 13:07:25

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Greg KH <[email protected]> writes:

> On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote:
>> Greg KH <[email protected]> writes:
>>
>> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
>> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
>> >> >
>> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
>> >> > > Filesystems such as procfs and sysfs generate their content at
>> >> > > runtime. This implies the file sizes do not usually match the
>> >> > > amount of data that can be read from the file, and that seeking
>> >> > > may not work as intended.
>> >> > >
>> >> > > This will be useful to disallow copy_file_range with input files
>> >> > > from such filesystems.
>> >> > >
>> >> > > Signed-off-by: Nicolas Boichat <[email protected]>
>> >> > > ---
>> >> > > I first thought of adding a new field to struct file_operations,
>> >> > > but that doesn't quite scale as every single file creation
>> >> > > operation would need to be modified.
>> >> >
>> >> > Even so, you missed a load of filesystems in the kernel with this patch
>> >> > series, what makes the ones you did mark here different from the
>> >> > "internal" filesystems that you did not?
>> >> >
>> >> > This feels wrong, why is userspace suddenly breaking? What changed in
>> >> > the kernel that caused this? Procfs has been around for a _very_ long
>> >> > time :)
>> >>
>> >> That would be because of (v5.3):
>> >>
>> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>> >>
>> >> The intention of this change (series) was to allow server side copy
>> >> for nfs and cifs via copy_file_range().
>> >> This is mostly work by Dave Chinner that I picked up following requests
>> >> from the NFS folks.
>> >>
>> >> But the above change also includes this generic change:
>> >>
>> >> - /* this could be relaxed once a method supports cross-fs copies */
>> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> >> - return -EXDEV;
>> >> -
>> >>
>> >> The change of behavior was documented in the commit message.
>> >> It was also documented in:
>> >>
>> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
>> >>
>> >> I think our rationale for the generic change was:
>> >> "Why not? What could go wrong? (TM)"
>> >> I am not sure if any workload really gained something from this
>> >> kernel cross-fs CFR.
>> >
>> > Why not put that check back?
>> >
>> >> In retrospect, I think it would have been safer to allow cross-fs CFR
>> >> only to the filesystems that implement ->{copy,remap}_file_range()...
>> >
>> > Why not make this change? That seems easier and should fix this for
>> > everyone, right?
>> >
>> >> Our option now are:
>> >> - Restore the cross-fs restriction into generic_copy_file_range()
>> >
>> > Yes.
>> >
>>
>> Restoring this restriction will actually change the current cephfs CFR
>> behaviour. Since that commit we have allowed doing remote copies between
>> different filesystems within the same ceph cluster. See commit
>> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
>> the same cluster").
>>
>> Although I'm not aware of any current users for this scenario, the
>> performance impact can actually be huge as it's the difference between
>> asking the OSDs for copying a file and doing a full read+write on the
>> client side.
>
> Regression in performance is ok if it fixes a regression for things that
> used to work just fine in the past :)
>
> First rule, make it work.

Sure, I just wanted to point out that *maybe* there are other options than
simply reverting that commit :-)

Something like the patch below (completely untested!) should revert to the
old behaviour in filesystems that don't implement the CFR syscall.

Cheers,
--
Luis

diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..bf5dccc43cc9 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
file_out, pos_out,
len, flags);

- return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
+ if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
+ return -EXDEV;
+ else
+ generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
+ flags);
}

/*

2021-02-12 14:13:10

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote:
> Greg KH <[email protected]> writes:
>
> > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote:
> >> Greg KH <[email protected]> writes:
> >>
> >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
> >> >> >
> >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> >> >> > > Filesystems such as procfs and sysfs generate their content at
> >> >> > > runtime. This implies the file sizes do not usually match the
> >> >> > > amount of data that can be read from the file, and that seeking
> >> >> > > may not work as intended.
> >> >> > >
> >> >> > > This will be useful to disallow copy_file_range with input files
> >> >> > > from such filesystems.
> >> >> > >
> >> >> > > Signed-off-by: Nicolas Boichat <[email protected]>
> >> >> > > ---
> >> >> > > I first thought of adding a new field to struct file_operations,
> >> >> > > but that doesn't quite scale as every single file creation
> >> >> > > operation would need to be modified.
> >> >> >
> >> >> > Even so, you missed a load of filesystems in the kernel with this patch
> >> >> > series, what makes the ones you did mark here different from the
> >> >> > "internal" filesystems that you did not?
> >> >> >
> >> >> > This feels wrong, why is userspace suddenly breaking? What changed in
> >> >> > the kernel that caused this? Procfs has been around for a _very_ long
> >> >> > time :)
> >> >>
> >> >> That would be because of (v5.3):
> >> >>
> >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
> >> >>
> >> >> The intention of this change (series) was to allow server side copy
> >> >> for nfs and cifs via copy_file_range().
> >> >> This is mostly work by Dave Chinner that I picked up following requests
> >> >> from the NFS folks.
> >> >>
> >> >> But the above change also includes this generic change:
> >> >>
> >> >> - /* this could be relaxed once a method supports cross-fs copies */
> >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >> >> - return -EXDEV;
> >> >> -
> >> >>
> >> >> The change of behavior was documented in the commit message.
> >> >> It was also documented in:
> >> >>
> >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
> >> >>
> >> >> I think our rationale for the generic change was:
> >> >> "Why not? What could go wrong? (TM)"
> >> >> I am not sure if any workload really gained something from this
> >> >> kernel cross-fs CFR.
> >> >
> >> > Why not put that check back?
> >> >
> >> >> In retrospect, I think it would have been safer to allow cross-fs CFR
> >> >> only to the filesystems that implement ->{copy,remap}_file_range()...
> >> >
> >> > Why not make this change? That seems easier and should fix this for
> >> > everyone, right?
> >> >
> >> >> Our option now are:
> >> >> - Restore the cross-fs restriction into generic_copy_file_range()
> >> >
> >> > Yes.
> >> >
> >>
> >> Restoring this restriction will actually change the current cephfs CFR
> >> behaviour. Since that commit we have allowed doing remote copies between
> >> different filesystems within the same ceph cluster. See commit
> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
> >> the same cluster").
> >>
> >> Although I'm not aware of any current users for this scenario, the
> >> performance impact can actually be huge as it's the difference between
> >> asking the OSDs for copying a file and doing a full read+write on the
> >> client side.
> >
> > Regression in performance is ok if it fixes a regression for things that
> > used to work just fine in the past :)
> >
> > First rule, make it work.
>
> Sure, I just wanted to point out that *maybe* there are other options than
> simply reverting that commit :-)
>
> Something like the patch below (completely untested!) should revert to the
> old behaviour in filesystems that don't implement the CFR syscall.
>
> Cheers,
> --
> Luis
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..bf5dccc43cc9 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> file_out, pos_out,
> len, flags);
>
> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> + else
> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> + flags);
> }
>
> /*

That would make much more sense to me.

2021-02-12 14:52:28

by Steven Rostedt

[permalink] [raw]
Subject: Re: [PATCH 5/6] tracefs: Add FS_GENERATED_CONTENT to filesystem flags

On Fri, 12 Feb 2021 12:44:04 +0800
Nicolas Boichat <[email protected]> wrote:

> tracefs content is generated at runtime.
>
> Signed-off-by: Nicolas Boichat <[email protected]>

Acked-by: Steven Rostedt (VMware) <[email protected]>

-- Steve

> ---
>
> fs/tracefs/inode.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/fs/tracefs/inode.c b/fs/tracefs/inode.c
> index 4b83cbded559..89980312c7a3 100644
> --- a/fs/tracefs/inode.c
> +++ b/fs/tracefs/inode.c
> @@ -308,6 +308,7 @@ static struct file_system_type trace_fs_type = {
> .name = "tracefs",
> .mount = trace_mount,
> .kill_sb = kill_litter_super,
> + .fs_flags = FS_GENERATED_CONTENT,
> };
> MODULE_ALIAS_FS("tracefs");
>

2021-02-12 15:03:56

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Greg KH <[email protected]> writes:

> On Fri, Feb 12, 2021 at 12:41:48PM +0000, Luis Henriques wrote:
>> Greg KH <[email protected]> writes:
...
>> >> >> Our option now are:
>> >> >> - Restore the cross-fs restriction into generic_copy_file_range()
>> >> >
>> >> > Yes.
>> >> >
>> >>
>> >> Restoring this restriction will actually change the current cephfs CFR
>> >> behaviour. Since that commit we have allowed doing remote copies between
>> >> different filesystems within the same ceph cluster. See commit
>> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
>> >> the same cluster").
>> >>
>> >> Although I'm not aware of any current users for this scenario, the
>> >> performance impact can actually be huge as it's the difference between
>> >> asking the OSDs for copying a file and doing a full read+write on the
>> >> client side.
>> >
>> > Regression in performance is ok if it fixes a regression for things that
>> > used to work just fine in the past :)
>> >
>> > First rule, make it work.
>>
>> Sure, I just wanted to point out that *maybe* there are other options than
>> simply reverting that commit :-)
>>
>> Something like the patch below (completely untested!) should revert to the
>> old behaviour in filesystems that don't implement the CFR syscall.
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..bf5dccc43cc9 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>> file_out, pos_out,
>> len, flags);
>>
>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> - flags);
>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> + return -EXDEV;
>> + else
>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> + flags);
>> }
>>
>> /*
>
> That would make much more sense to me.

Great. I can send a proper patch with changelog, if this is the really
what we want. But I would rather hear from others first. I guess that at
least the NFS devs have something to say here.

Cheers,
--
Luis

2021-02-12 15:37:33

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
>
> Why are people trying to use copy_file_range on simple /proc and /sys
> files in the first place? They can not seek (well most can not), so
> that feels like a "oh look, a new syscall, let's use it everywhere!"
> problem that userspace should not do.

This may have been covered elsewhere, but it's not that people are
saying "let's use copy_file_range on files in /proc." It's that the
Go language standard library provides an interface to operating system
files. When Go code uses the standard library function io.Copy to
copy the contents of one open file to another open file, then on Linux
kernels 5.3 and greater the Go standard library will use the
copy_file_range system call. That seems to be exactly what
copy_file_range is intended for. Unfortunately it appears that when
people writing Go code open a file in /proc and use io.Copy the
contents to another open file, copy_file_range does nothing and
reports success. There isn't anything on the copy_file_range man page
explaining this limitation, and there isn't any documented way to know
that the Go standard library should not use copy_file_range on certain
files.

So ideally the kernel will report EOPNOTSUPP or EINVAL when using
copy_file_range on a file in /proc or some other file system that
fails (and, minor side note, the copy_file_range man page should
document that it can return EOPNOTSUPP or EINVAL in some cases, which
does already happen on at least some kernel versions using at least
some file systems). Or, less ideally, there will be some documented
way that the Go standard library can determine that copy_file_range
will fail before trying to use it. If neither of those can be done,
then I think the only option is for the Go standard library to never
use copy_file_range, as even though it will almost always work and
work well, in some unpredictable number of cases it will fail
silently.

Thanks.

Ian

2021-02-12 15:50:44

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> >
> > Why are people trying to use copy_file_range on simple /proc and /sys
> > files in the first place? They can not seek (well most can not), so
> > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > problem that userspace should not do.
>
> This may have been covered elsewhere, but it's not that people are
> saying "let's use copy_file_range on files in /proc." It's that the
> Go language standard library provides an interface to operating system
> files. When Go code uses the standard library function io.Copy to
> copy the contents of one open file to another open file, then on Linux
> kernels 5.3 and greater the Go standard library will use the
> copy_file_range system call. That seems to be exactly what
> copy_file_range is intended for. Unfortunately it appears that when
> people writing Go code open a file in /proc and use io.Copy the
> contents to another open file, copy_file_range does nothing and
> reports success. There isn't anything on the copy_file_range man page
> explaining this limitation, and there isn't any documented way to know
> that the Go standard library should not use copy_file_range on certain
> files.

But, is this a bug in the kernel in that the syscall being made is not
working properly, or a bug in that Go decided to do this for all types
of files not knowing that some types of files can not handle this?

If the kernel has always worked this way, I would say that Go is doing
the wrong thing here. If the kernel used to work properly, and then
changed, then it's a regression on the kernel side.

So which is it?

> So ideally the kernel will report EOPNOTSUPP or EINVAL when using
> copy_file_range on a file in /proc or some other file system that
> fails (and, minor side note, the copy_file_range man page should
> document that it can return EOPNOTSUPP or EINVAL in some cases, which
> does already happen on at least some kernel versions using at least
> some file systems).

Documentation is good, but what the kernel does is the true "definition"
of what is going right or wrong here.

thanks,

greg k-h

2021-02-12 16:02:48

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 7:45 AM Greg KH <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > >
> > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > files in the first place? They can not seek (well most can not), so
> > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > problem that userspace should not do.
> >
> > This may have been covered elsewhere, but it's not that people are
> > saying "let's use copy_file_range on files in /proc." It's that the
> > Go language standard library provides an interface to operating system
> > files. When Go code uses the standard library function io.Copy to
> > copy the contents of one open file to another open file, then on Linux
> > kernels 5.3 and greater the Go standard library will use the
> > copy_file_range system call. That seems to be exactly what
> > copy_file_range is intended for. Unfortunately it appears that when
> > people writing Go code open a file in /proc and use io.Copy the
> > contents to another open file, copy_file_range does nothing and
> > reports success. There isn't anything on the copy_file_range man page
> > explaining this limitation, and there isn't any documented way to know
> > that the Go standard library should not use copy_file_range on certain
> > files.
>
> But, is this a bug in the kernel in that the syscall being made is not
> working properly, or a bug in that Go decided to do this for all types
> of files not knowing that some types of files can not handle this?
>
> If the kernel has always worked this way, I would say that Go is doing
> the wrong thing here. If the kernel used to work properly, and then
> changed, then it's a regression on the kernel side.
>
> So which is it?

I don't work on the kernel, so I can't tell you which it is. You will
have to decide.

From my perspective, as a kernel user rather than a kernel developer,
a system call that silently fails for certain files and that provides
no way to determine either 1) ahead of time that the system call will
fail, or 2) after the call that the system call did fail, is a useless
system call. I can never use that system call, because I don't know
whether or not it will work. So as a kernel user I would say that you
should fix the system call to report failure, or document some way to
know whether the system call will fail, or you should remove the
system call. But I'm not a kernel developer, I don't have all the
information, and it's obviously your call.

I'll note that to the best of my knowledge this failure started
happening with the 5.3 kernel, as before 5.3 the problematic calls
would report a failure (EXDEV). Since 5.3 isn't all that old I
personally wouldn't say that the kernel "has always worked this way."
But I may be mistaken about this.


> > So ideally the kernel will report EOPNOTSUPP or EINVAL when using
> > copy_file_range on a file in /proc or some other file system that
> > fails (and, minor side note, the copy_file_range man page should
> > document that it can return EOPNOTSUPP or EINVAL in some cases, which
> > does already happen on at least some kernel versions using at least
> > some file systems).
>
> Documentation is good, but what the kernel does is the true "definition"
> of what is going right or wrong here.

Sure. The documentation comment was just a side note. I hope that we
can all agree that accurate man pages are better than inaccurate ones.

Ian

2021-02-12 16:35:12

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 07:59:04AM -0800, Ian Lance Taylor wrote:
> On Fri, Feb 12, 2021 at 7:45 AM Greg KH <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > >
> > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > files in the first place? They can not seek (well most can not), so
> > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > problem that userspace should not do.
> > >
> > > This may have been covered elsewhere, but it's not that people are
> > > saying "let's use copy_file_range on files in /proc." It's that the
> > > Go language standard library provides an interface to operating system
> > > files. When Go code uses the standard library function io.Copy to
> > > copy the contents of one open file to another open file, then on Linux
> > > kernels 5.3 and greater the Go standard library will use the
> > > copy_file_range system call. That seems to be exactly what
> > > copy_file_range is intended for. Unfortunately it appears that when
> > > people writing Go code open a file in /proc and use io.Copy the
> > > contents to another open file, copy_file_range does nothing and
> > > reports success. There isn't anything on the copy_file_range man page
> > > explaining this limitation, and there isn't any documented way to know
> > > that the Go standard library should not use copy_file_range on certain
> > > files.
> >
> > But, is this a bug in the kernel in that the syscall being made is not
> > working properly, or a bug in that Go decided to do this for all types
> > of files not knowing that some types of files can not handle this?
> >
> > If the kernel has always worked this way, I would say that Go is doing
> > the wrong thing here. If the kernel used to work properly, and then
> > changed, then it's a regression on the kernel side.
> >
> > So which is it?
>
> I don't work on the kernel, so I can't tell you which it is. You will
> have to decide.

As you have the userspace code, it should be easier for you to test this
on an older kernel. I don't have your userspace code...

> From my perspective, as a kernel user rather than a kernel developer,
> a system call that silently fails for certain files and that provides
> no way to determine either 1) ahead of time that the system call will
> fail, or 2) after the call that the system call did fail, is a useless
> system call.

Great, then don't use copy_file_range() yet as it seems like it fits
that category at the moment :)

> I can never use that system call, because I don't know
> whether or not it will work. So as a kernel user I would say that you
> should fix the system call to report failure, or document some way to
> know whether the system call will fail, or you should remove the
> system call. But I'm not a kernel developer, I don't have all the
> information, and it's obviously your call.

That's up to the authors of that syscall, I don't know what they
intended this to be used for. It feels like you are using this in a
very generic "all file i/o works for this" way, while that is not what
the authors of the syscall intended it for, as is evident by the
failures that older kernels would report for you.

> I'll note that to the best of my knowledge this failure started
> happening with the 5.3 kernel, as before 5.3 the problematic calls
> would report a failure (EXDEV). Since 5.3 isn't all that old I
> personally wouldn't say that the kernel "has always worked this way."
> But I may be mistaken about this.

Testing would be good, as regressions are more serious than "it would be
nice if it would work like this instead!" type of issues.

thanks,

greg k-h

2021-02-12 20:26:34

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 8:28 AM Greg KH <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 07:59:04AM -0800, Ian Lance Taylor wrote:
> > On Fri, Feb 12, 2021 at 7:45 AM Greg KH <[email protected]> wrote:
> > >
> > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > > >
> > > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > > files in the first place? They can not seek (well most can not), so
> > > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > > problem that userspace should not do.
> > > >
> > > > This may have been covered elsewhere, but it's not that people are
> > > > saying "let's use copy_file_range on files in /proc." It's that the
> > > > Go language standard library provides an interface to operating system
> > > > files. When Go code uses the standard library function io.Copy to
> > > > copy the contents of one open file to another open file, then on Linux
> > > > kernels 5.3 and greater the Go standard library will use the
> > > > copy_file_range system call. That seems to be exactly what
> > > > copy_file_range is intended for. Unfortunately it appears that when
> > > > people writing Go code open a file in /proc and use io.Copy the
> > > > contents to another open file, copy_file_range does nothing and
> > > > reports success. There isn't anything on the copy_file_range man page
> > > > explaining this limitation, and there isn't any documented way to know
> > > > that the Go standard library should not use copy_file_range on certain
> > > > files.
> > >
> > > But, is this a bug in the kernel in that the syscall being made is not
> > > working properly, or a bug in that Go decided to do this for all types
> > > of files not knowing that some types of files can not handle this?
> > >
> > > If the kernel has always worked this way, I would say that Go is doing
> > > the wrong thing here. If the kernel used to work properly, and then
> > > changed, then it's a regression on the kernel side.
> > >
> > > So which is it?
> >
> > I don't work on the kernel, so I can't tell you which it is. You will
> > have to decide.
>
> As you have the userspace code, it should be easier for you to test this
> on an older kernel. I don't have your userspace code...

Sorry, I'm not sure what you are asking.

I've attached a sample Go program. On kernel version 2.6.32 this
program exits 0. On kernel version 5.7.17 it prints

got "" want "./foo\x00"

and exits with status 1.

This program hardcodes the string "/proc/self/cmdline" for
convenience, but of course the same results would happen if this were
a generic copy program that somebody invoked with /proc/self/cmdline
as a command line option.

I could write the same program in C easily enough, by explicitly
calling copy_file_range. Would it help to see a sample C program?


> > From my perspective, as a kernel user rather than a kernel developer,
> > a system call that silently fails for certain files and that provides
> > no way to determine either 1) ahead of time that the system call will
> > fail, or 2) after the call that the system call did fail, is a useless
> > system call.
>
> Great, then don't use copy_file_range() yet as it seems like it fits
> that category at the moment :)

That seems like an unfortunate result, but if that is the determining
opinion then I guess that is what we will have to do in the Go
standard library.

Ian


Attachments:
foo7.go (972.00 B)

2021-02-12 23:06:44

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > >
> > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > files in the first place? They can not seek (well most can not), so
> > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > problem that userspace should not do.
> >
> > This may have been covered elsewhere, but it's not that people are
> > saying "let's use copy_file_range on files in /proc." It's that the
> > Go language standard library provides an interface to operating system
> > files. When Go code uses the standard library function io.Copy to
> > copy the contents of one open file to another open file, then on Linux
> > kernels 5.3 and greater the Go standard library will use the
> > copy_file_range system call. That seems to be exactly what
> > copy_file_range is intended for. Unfortunately it appears that when
> > people writing Go code open a file in /proc and use io.Copy the
> > contents to another open file, copy_file_range does nothing and
> > reports success. There isn't anything on the copy_file_range man page
> > explaining this limitation, and there isn't any documented way to know
> > that the Go standard library should not use copy_file_range on certain
> > files.
>
> But, is this a bug in the kernel in that the syscall being made is not
> working properly, or a bug in that Go decided to do this for all types
> of files not knowing that some types of files can not handle this?
>
> If the kernel has always worked this way, I would say that Go is doing
> the wrong thing here. If the kernel used to work properly, and then
> changed, then it's a regression on the kernel side.
>
> So which is it?

Both Al Viro and myself have said "copy file range is not a generic
method for copying data between two file descriptors". It is a
targetted solution for *regular files only* on filesystems that store
persistent data and can accelerate the data copy in some way (e.g.
clone, server side offload, hardware offlead, etc). It is not
intended as a copy mechanism for copying data from one random file
descriptor to another.

The use of it as a general file copy mechanism in the Go system
library is incorrect and wrong. It is a userspace bug. Userspace
has done the wrong thing, userspace needs to be fixed.

-Dave.
--
Dave Chinner
[email protected]

2021-02-12 23:09:54

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > >
> > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > files in the first place? They can not seek (well most can not), so
> > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > problem that userspace should not do.
> > >
> > > This may have been covered elsewhere, but it's not that people are
> > > saying "let's use copy_file_range on files in /proc." It's that the
> > > Go language standard library provides an interface to operating system
> > > files. When Go code uses the standard library function io.Copy to
> > > copy the contents of one open file to another open file, then on Linux
> > > kernels 5.3 and greater the Go standard library will use the
> > > copy_file_range system call. That seems to be exactly what
> > > copy_file_range is intended for. Unfortunately it appears that when
> > > people writing Go code open a file in /proc and use io.Copy the
> > > contents to another open file, copy_file_range does nothing and
> > > reports success. There isn't anything on the copy_file_range man page
> > > explaining this limitation, and there isn't any documented way to know
> > > that the Go standard library should not use copy_file_range on certain
> > > files.
> >
> > But, is this a bug in the kernel in that the syscall being made is not
> > working properly, or a bug in that Go decided to do this for all types
> > of files not knowing that some types of files can not handle this?
> >
> > If the kernel has always worked this way, I would say that Go is doing
> > the wrong thing here. If the kernel used to work properly, and then
> > changed, then it's a regression on the kernel side.
> >
> > So which is it?
>
> Both Al Viro and myself have said "copy file range is not a generic
> method for copying data between two file descriptors". It is a
> targetted solution for *regular files only* on filesystems that store
> persistent data and can accelerate the data copy in some way (e.g.
> clone, server side offload, hardware offlead, etc). It is not
> intended as a copy mechanism for copying data from one random file
> descriptor to another.
>
> The use of it as a general file copy mechanism in the Go system
> library is incorrect and wrong. It is a userspace bug. Userspace
> has done the wrong thing, userspace needs to be fixed.

OK, we'll take it out.

I'll just make one last plea that I think that copy_file_range could
be much more useful if there were some way that a program could know
whether it would work or not. It's pretty unfortunate that we can't
use it in the Go standard library, or, indeed, in any general purpose
code, in any language, that is intended to support arbitrary file
names. To be pedantically clear, I'm not saying that copy_file_range
should work on all file systems. I'm only saying that on file systems
for which it doesn't work it should fail rather than silently
returning success without doing anything.

Ian

2021-02-12 23:17:09

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> > > Filesystems such as procfs and sysfs generate their content at
> > > runtime. This implies the file sizes do not usually match the
> > > amount of data that can be read from the file, and that seeking
> > > may not work as intended.
> > >
> > > This will be useful to disallow copy_file_range with input files
> > > from such filesystems.
> > >
> > > Signed-off-by: Nicolas Boichat <[email protected]>
> > > ---
> > > I first thought of adding a new field to struct file_operations,
> > > but that doesn't quite scale as every single file creation
> > > operation would need to be modified.
> >
> > Even so, you missed a load of filesystems in the kernel with this patch
> > series, what makes the ones you did mark here different from the
> > "internal" filesystems that you did not?
> >
> > This feels wrong, why is userspace suddenly breaking? What changed in
> > the kernel that caused this? Procfs has been around for a _very_ long
> > time :)
>
> That would be because of (v5.3):
>
> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
>
> The intention of this change (series) was to allow server side copy
> for nfs and cifs via copy_file_range().
> This is mostly work by Dave Chinner that I picked up following requests
> from the NFS folks.
>
> But the above change also includes this generic change:
>
> - /* this could be relaxed once a method supports cross-fs copies */
> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> - return -EXDEV;
> -

This isn't the problem.

The problem is that proc sets the file size to zero length, so
if you attempt to CFR from one proc file to another it will still
report "zero bytes copied" because the source file is zero length.

The other problem is that if the write fails, the generated data
from the /proc file gets thrown away - the splice code treats write
failures like a short read and hence the data sent to the failed
write is consumed and lost.

This has nothing to do with cross-fs cfr support - it's just one
mechanism that can be used to expose the problems that using CFR on
pipes that masquerade as regular files causes.

Userspace can't even tell that CFR failed incorrectly, because the
files that it returns immediate EOF on are zero length. Nor can
userspace know taht a short read tossed away data, because it might
actually be a short read rather than a write failure...

> Our option now are:
> - Restore the cross-fs restriction into generic_copy_file_range()
> - Explicitly opt-out of CFR per-fs and/or per-file as Nicolas' patch does

- Stop trying using cfr for things it was never intended for.

Anyone using cfr has to be prepared for it to fail and do the copy
manually themselves. If you can't tell from userspace if a file has
data in it without actually calling read(), then you can't use
copy_file_range() on it...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-12 23:29:40

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > > >
> > > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > > files in the first place? They can not seek (well most can not), so
> > > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > > problem that userspace should not do.
> > > >
> > > > This may have been covered elsewhere, but it's not that people are
> > > > saying "let's use copy_file_range on files in /proc." It's that the
> > > > Go language standard library provides an interface to operating system
> > > > files. When Go code uses the standard library function io.Copy to
> > > > copy the contents of one open file to another open file, then on Linux
> > > > kernels 5.3 and greater the Go standard library will use the
> > > > copy_file_range system call. That seems to be exactly what
> > > > copy_file_range is intended for. Unfortunately it appears that when
> > > > people writing Go code open a file in /proc and use io.Copy the
> > > > contents to another open file, copy_file_range does nothing and
> > > > reports success. There isn't anything on the copy_file_range man page
> > > > explaining this limitation, and there isn't any documented way to know
> > > > that the Go standard library should not use copy_file_range on certain
> > > > files.
> > >
> > > But, is this a bug in the kernel in that the syscall being made is not
> > > working properly, or a bug in that Go decided to do this for all types
> > > of files not knowing that some types of files can not handle this?
> > >
> > > If the kernel has always worked this way, I would say that Go is doing
> > > the wrong thing here. If the kernel used to work properly, and then
> > > changed, then it's a regression on the kernel side.
> > >
> > > So which is it?
> >
> > Both Al Viro and myself have said "copy file range is not a generic
> > method for copying data between two file descriptors". It is a
> > targetted solution for *regular files only* on filesystems that store
> > persistent data and can accelerate the data copy in some way (e.g.
> > clone, server side offload, hardware offlead, etc). It is not
> > intended as a copy mechanism for copying data from one random file
> > descriptor to another.
> >
> > The use of it as a general file copy mechanism in the Go system
> > library is incorrect and wrong. It is a userspace bug. Userspace
> > has done the wrong thing, userspace needs to be fixed.
>
> OK, we'll take it out.
>
> I'll just make one last plea that I think that copy_file_range could
> be much more useful if there were some way that a program could know
> whether it would work or not.

If you can't tell from userspace that a file has data in it other
than by calling read() on it, then you can't use cfr on it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-12 23:57:21

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > > > >
> > > > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > > > files in the first place? They can not seek (well most can not), so
> > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > > > problem that userspace should not do.
> > > > >
> > > > > This may have been covered elsewhere, but it's not that people are
> > > > > saying "let's use copy_file_range on files in /proc." It's that the
> > > > > Go language standard library provides an interface to operating system
> > > > > files. When Go code uses the standard library function io.Copy to
> > > > > copy the contents of one open file to another open file, then on Linux
> > > > > kernels 5.3 and greater the Go standard library will use the
> > > > > copy_file_range system call. That seems to be exactly what
> > > > > copy_file_range is intended for. Unfortunately it appears that when
> > > > > people writing Go code open a file in /proc and use io.Copy the
> > > > > contents to another open file, copy_file_range does nothing and
> > > > > reports success. There isn't anything on the copy_file_range man page
> > > > > explaining this limitation, and there isn't any documented way to know
> > > > > that the Go standard library should not use copy_file_range on certain
> > > > > files.
> > > >
> > > > But, is this a bug in the kernel in that the syscall being made is not
> > > > working properly, or a bug in that Go decided to do this for all types
> > > > of files not knowing that some types of files can not handle this?
> > > >
> > > > If the kernel has always worked this way, I would say that Go is doing
> > > > the wrong thing here. If the kernel used to work properly, and then
> > > > changed, then it's a regression on the kernel side.
> > > >
> > > > So which is it?
> > >
> > > Both Al Viro and myself have said "copy file range is not a generic
> > > method for copying data between two file descriptors". It is a
> > > targetted solution for *regular files only* on filesystems that store
> > > persistent data and can accelerate the data copy in some way (e.g.
> > > clone, server side offload, hardware offlead, etc). It is not
> > > intended as a copy mechanism for copying data from one random file
> > > descriptor to another.
> > >
> > > The use of it as a general file copy mechanism in the Go system
> > > library is incorrect and wrong. It is a userspace bug. Userspace
> > > has done the wrong thing, userspace needs to be fixed.
> >
> > OK, we'll take it out.
> >
> > I'll just make one last plea that I think that copy_file_range could
> > be much more useful if there were some way that a program could know
> > whether it would work or not.

Well... we could always implement a CFR_DRYRUN flag that would run
through all the parameter validation and return 0 just before actually
starting any real copying logic. But that wouldn't itself solve the
problem that there are very old virtual filesystems in Linux that have
zero-length regular files that behave like a pipe.

> If you can't tell from userspace that a file has data in it other
> than by calling read() on it, then you can't use cfr on it.

I don't know how to do that, Dave. :)

Frankly I'm with the Go developers on this -- one should detect c_f_r by
calling it and if it errors out then fall back to the usual userspace
buffer copy strategy.

That still means we need to fix the kernel WRT these weird old
filesystems. One of...

1. Get rid of the generic fallback completely, since splice only copies
64k at a time and ... yay? I guess it at least passes generic/521 and
generic/522 these days.

2. Keep it, but change c_f_r to require that both files have a
->copy_file_range implementation. If they're the same then we'll call
the function pointer, if not, we call the generic fallback. This at
least gets us back to the usual behavior which is that filesystems have
to opt in to new functionality (== we assume they QA'd all the wunnerful
combinations).

3. #2, but fix the generic fallback to not suck so badly. That sounds
like someone (else's) 2yr project. :P

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2021-02-14 23:40:39

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/6] Add generated flag to filesystem struct to block copy_file_range

On Fri, Feb 12, 2021 at 12:43:59PM +0800, Nicolas Boichat wrote:
> We hit an issue when upgrading Go compiler from 1.13 to 1.15 [1],
> as we use Go's `io.Copy` to copy the content of
> `/sys/kernel/debug/tracing/trace` to a temporary file.
>
> Under the hood, Go 1.15 uses `copy_file_range` syscall to
> optimize the copy operation. However, that fails to copy any
> content when the input file is from tracefs, with an apparent
> size of 0 (but there is still content when you `cat` it, of
> course).
>
> >From discussions in [2][3], it is clear that copy_file_range
> cannot be properly implemented on filesystems where the content
> is generated at runtime: the file size is incorrect (because it
> is unknown before the content is generated), and seeking in such
> files (as required by partial writes) is unlikely to work
> correctly.
>
> With this patch, Go's `io.Copy` gracefully falls back to a normal
> read/write file copy.
>
> I'm not 100% sure which stable tree this should go in, I'd say
> at least >=5.3 since this is what introduced support for
> cross-filesystem copy_file_range (and where most users are
> somewhat likely to hit this issue). But let's discuss the patch
> series first.

No. This is *NOT* an fs-wide flag. Decision regarding the
usability of copy_file_range() is on per-file basis.

The real constraint is "can freely seek back and expect to
find consistent data". That is what's violated for seq_file.
And frankly, I would rather add a flag and have seq_open()
(and other suckers, if any) clear it. With check being
"has both FMODE_PREAD and this new flag".

2021-02-15 00:55:05

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > On Fri, Feb 12, 2021 at 03:07:39PM -0800, Ian Lance Taylor wrote:
> > > On Fri, Feb 12, 2021 at 3:03 PM Dave Chinner <[email protected]> wrote:
> > > >
> > > > On Fri, Feb 12, 2021 at 04:45:41PM +0100, Greg KH wrote:
> > > > > On Fri, Feb 12, 2021 at 07:33:57AM -0800, Ian Lance Taylor wrote:
> > > > > > On Fri, Feb 12, 2021 at 12:38 AM Greg KH <[email protected]> wrote:
> > > > > > >
> > > > > > > Why are people trying to use copy_file_range on simple /proc and /sys
> > > > > > > files in the first place? They can not seek (well most can not), so
> > > > > > > that feels like a "oh look, a new syscall, let's use it everywhere!"
> > > > > > > problem that userspace should not do.
> > > > > >
> > > > > > This may have been covered elsewhere, but it's not that people are
> > > > > > saying "let's use copy_file_range on files in /proc." It's that the
> > > > > > Go language standard library provides an interface to operating system
> > > > > > files. When Go code uses the standard library function io.Copy to
> > > > > > copy the contents of one open file to another open file, then on Linux
> > > > > > kernels 5.3 and greater the Go standard library will use the
> > > > > > copy_file_range system call. That seems to be exactly what
> > > > > > copy_file_range is intended for. Unfortunately it appears that when
> > > > > > people writing Go code open a file in /proc and use io.Copy the
> > > > > > contents to another open file, copy_file_range does nothing and
> > > > > > reports success. There isn't anything on the copy_file_range man page
> > > > > > explaining this limitation, and there isn't any documented way to know
> > > > > > that the Go standard library should not use copy_file_range on certain
> > > > > > files.
> > > > >
> > > > > But, is this a bug in the kernel in that the syscall being made is not
> > > > > working properly, or a bug in that Go decided to do this for all types
> > > > > of files not knowing that some types of files can not handle this?
> > > > >
> > > > > If the kernel has always worked this way, I would say that Go is doing
> > > > > the wrong thing here. If the kernel used to work properly, and then
> > > > > changed, then it's a regression on the kernel side.
> > > > >
> > > > > So which is it?
> > > >
> > > > Both Al Viro and myself have said "copy file range is not a generic
> > > > method for copying data between two file descriptors". It is a
> > > > targetted solution for *regular files only* on filesystems that store
> > > > persistent data and can accelerate the data copy in some way (e.g.
> > > > clone, server side offload, hardware offlead, etc). It is not
> > > > intended as a copy mechanism for copying data from one random file
> > > > descriptor to another.
> > > >
> > > > The use of it as a general file copy mechanism in the Go system
> > > > library is incorrect and wrong. It is a userspace bug. Userspace
> > > > has done the wrong thing, userspace needs to be fixed.
> > >
> > > OK, we'll take it out.
> > >
> > > I'll just make one last plea that I think that copy_file_range could
> > > be much more useful if there were some way that a program could know
> > > whether it would work or not.
>
> Well... we could always implement a CFR_DRYRUN flag that would run
> through all the parameter validation and return 0 just before actually
> starting any real copying logic. But that wouldn't itself solve the
> problem that there are very old virtual filesystems in Linux that have
> zero-length regular files that behave like a pipe.
>
> > If you can't tell from userspace that a file has data in it other
> > than by calling read() on it, then you can't use cfr on it.
>
> I don't know how to do that, Dave. :)

If stat returns a non-zero size, then userspace knows it has at
least that much data in it, whether it be zeros or previously
written data. cfr will copy that data. The special zero length
regular pipe files fail this simple "how much data is there to copy
in this file" check...

> Frankly I'm with the Go developers on this -- one should detect c_f_r by
> calling it and if it errors out then fall back to the usual userspace
> buffer copy strategy.
>
> That still means we need to fix the kernel WRT these weird old
> filesystems. One of...

And that is the whole problem here, not that cfr is failing. cfr is
behaving correctly and consistently as the filesystem is telling the
kernel there is no data in the file (i.e. size = 0).

> 1. Get rid of the generic fallback completely, since splice only copies
> 64k at a time and ... yay? I guess it at least passes generic/521 and
> generic/522 these days.

I've had a few people ask me for cfr to not fall back to a manual
copy because they only want it to do something if it can accelerate
the copy to be faster than userspace can copy the data itself. If
the filesystem can't optimise the copy in some way, they want to
know so they can do something else of their own chosing.

Hence this seems like the sane option to take here...

> 2. Keep it, but change c_f_r to require that both files have a
> ->copy_file_range implementation. If they're the same then we'll call
> the function pointer, if not, we call the generic fallback. This at
> least gets us back to the usual behavior which is that filesystems have
> to opt in to new functionality (== we assume they QA'd all the wunnerful
> combinations).

That doesn't address the "write failure turns into short read"
problem with the splice path...

> 3. #2, but fix the generic fallback to not suck so badly. That sounds
> like someone (else's) 2yr project. :P

Not mine, either.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-02-15 01:16:09

by Ian Lance Taylor

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <[email protected]> wrote:
>
> On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> >
> > > If you can't tell from userspace that a file has data in it other
> > > than by calling read() on it, then you can't use cfr on it.
> >
> > I don't know how to do that, Dave. :)
>
> If stat returns a non-zero size, then userspace knows it has at
> least that much data in it, whether it be zeros or previously
> written data. cfr will copy that data. The special zero length
> regular pipe files fail this simple "how much data is there to copy
> in this file" check...

This suggests that if the Go standard library sees that
copy_file_range reads zero bytes, we should assume that it failed, and
should use the fallback path as though copy_file_range returned
EOPNOTSUPP or EINVAL. This will cause an extra system call for an
empty file, but as long as copy_file_range does not return an error
for cases that it does not support we are going to need an extra
system call anyhow.

Does that seem like a possible mitigation? That is, are there cases
where copy_file_range will fail to do a correct copy, and will return
success, and will not return zero?

Ian

2021-02-15 01:29:27

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <[email protected]> wrote:
>
> On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <[email protected]> wrote:
> >
> > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > >
> > > > If you can't tell from userspace that a file has data in it other
> > > > than by calling read() on it, then you can't use cfr on it.
> > >
> > > I don't know how to do that, Dave. :)
> >
> > If stat returns a non-zero size, then userspace knows it has at
> > least that much data in it, whether it be zeros or previously
> > written data. cfr will copy that data. The special zero length
> > regular pipe files fail this simple "how much data is there to copy
> > in this file" check...
>
> This suggests that if the Go standard library sees that
> copy_file_range reads zero bytes, we should assume that it failed, and
> should use the fallback path as though copy_file_range returned
> EOPNOTSUPP or EINVAL. This will cause an extra system call for an
> empty file, but as long as copy_file_range does not return an error
> for cases that it does not support we are going to need an extra
> system call anyhow.
>
> Does that seem like a possible mitigation? That is, are there cases
> where copy_file_range will fail to do a correct copy, and will return
> success, and will not return zero?

I'm a bit worried about the sysfs files that report a 4096 bytes file
size, for 2 reasons:
- I'm not sure if the content _can_ actually be longer (I couldn't
find a counterexample)
- If you're unlucky enough to have a partial write in the output
filesystem, you'll get a non-zero return value and probably corrupted
content.

>
> Ian

2021-02-15 06:02:18

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Mon, Feb 15, 2021 at 3:27 AM Nicolas Boichat <[email protected]> wrote:
>
> On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <[email protected]> wrote:
> >
> > On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> > > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > > >
> > > > > If you can't tell from userspace that a file has data in it other
> > > > > than by calling read() on it, then you can't use cfr on it.
> > > >
> > > > I don't know how to do that, Dave. :)
> > >
> > > If stat returns a non-zero size, then userspace knows it has at
> > > least that much data in it, whether it be zeros or previously
> > > written data. cfr will copy that data. The special zero length
> > > regular pipe files fail this simple "how much data is there to copy
> > > in this file" check...
> >
> > This suggests that if the Go standard library sees that
> > copy_file_range reads zero bytes, we should assume that it failed, and
> > should use the fallback path as though copy_file_range returned
> > EOPNOTSUPP or EINVAL. This will cause an extra system call for an
> > empty file, but as long as copy_file_range does not return an error
> > for cases that it does not support we are going to need an extra
> > system call anyhow.
> >
> > Does that seem like a possible mitigation? That is, are there cases
> > where copy_file_range will fail to do a correct copy, and will return
> > success, and will not return zero?
>
> I'm a bit worried about the sysfs files that report a 4096 bytes file
> size, for 2 reasons:
> - I'm not sure if the content _can_ actually be longer (I couldn't
> find a counterexample)
> - If you're unlucky enough to have a partial write in the output
> filesystem, you'll get a non-zero return value and probably corrupted
> content.
>

First of all, I am in favor of fixing the regression in the kernel caused
by the change of behavior in v5.3.

Having said that, I don't think it is a good idea to use ANY tool to copy
a zero size pseudo file.
rsync doesn't copy any data either if you try to use it to copy tracing into
a temp file.
I think it is perfectly sane for any tool to check file size before trying
to read/copy.

w.r.t. short write/short read, did you consider using the off_in/off_out
arguments? I *think* current kernel CFR should be safe to use as long
as the tool:
1. Checks file size before copy
2. Does not try to copy a range beyond EOF
3. Pass off_in/off_out args and verify that off_in/off_out advances
as expected

Thanks,
Amir.

2021-02-15 06:16:36

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <[email protected]> wrote:
>
> Greg KH <[email protected]> writes:
>
> > On Fri, Feb 12, 2021 at 12:05:14PM +0000, Luis Henriques wrote:
> >> Greg KH <[email protected]> writes:
> >>
> >> > On Fri, Feb 12, 2021 at 10:22:16AM +0200, Amir Goldstein wrote:
> >> >> On Fri, Feb 12, 2021 at 9:49 AM Greg KH <[email protected]> wrote:
> >> >> >
> >> >> > On Fri, Feb 12, 2021 at 12:44:00PM +0800, Nicolas Boichat wrote:
> >> >> > > Filesystems such as procfs and sysfs generate their content at
> >> >> > > runtime. This implies the file sizes do not usually match the
> >> >> > > amount of data that can be read from the file, and that seeking
> >> >> > > may not work as intended.
> >> >> > >
> >> >> > > This will be useful to disallow copy_file_range with input files
> >> >> > > from such filesystems.
> >> >> > >
> >> >> > > Signed-off-by: Nicolas Boichat <[email protected]>
> >> >> > > ---
> >> >> > > I first thought of adding a new field to struct file_operations,
> >> >> > > but that doesn't quite scale as every single file creation
> >> >> > > operation would need to be modified.
> >> >> >
> >> >> > Even so, you missed a load of filesystems in the kernel with this patch
> >> >> > series, what makes the ones you did mark here different from the
> >> >> > "internal" filesystems that you did not?
> >> >> >
> >> >> > This feels wrong, why is userspace suddenly breaking? What changed in
> >> >> > the kernel that caused this? Procfs has been around for a _very_ long
> >> >> > time :)
> >> >>
> >> >> That would be because of (v5.3):
> >> >>
> >> >> 5dae222a5ff0 vfs: allow copy_file_range to copy across devices
> >> >>
> >> >> The intention of this change (series) was to allow server side copy
> >> >> for nfs and cifs via copy_file_range().
> >> >> This is mostly work by Dave Chinner that I picked up following requests
> >> >> from the NFS folks.
> >> >>
> >> >> But the above change also includes this generic change:
> >> >>
> >> >> - /* this could be relaxed once a method supports cross-fs copies */
> >> >> - if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >> >> - return -EXDEV;
> >> >> -
> >> >>
> >> >> The change of behavior was documented in the commit message.
> >> >> It was also documented in:
> >> >>
> >> >> 88e75e2c5 copy_file_range.2: Kernel v5.3 updates
> >> >>
> >> >> I think our rationale for the generic change was:
> >> >> "Why not? What could go wrong? (TM)"
> >> >> I am not sure if any workload really gained something from this
> >> >> kernel cross-fs CFR.
> >> >
> >> > Why not put that check back?
> >> >
> >> >> In retrospect, I think it would have been safer to allow cross-fs CFR
> >> >> only to the filesystems that implement ->{copy,remap}_file_range()...
> >> >
> >> > Why not make this change? That seems easier and should fix this for
> >> > everyone, right?
> >> >
> >> >> Our option now are:
> >> >> - Restore the cross-fs restriction into generic_copy_file_range()
> >> >
> >> > Yes.
> >> >
> >>
> >> Restoring this restriction will actually change the current cephfs CFR
> >> behaviour. Since that commit we have allowed doing remote copies between
> >> different filesystems within the same ceph cluster. See commit
> >> 6fd4e6348352 ("ceph: allow object copies across different filesystems in
> >> the same cluster").
> >>
> >> Although I'm not aware of any current users for this scenario, the
> >> performance impact can actually be huge as it's the difference between
> >> asking the OSDs for copying a file and doing a full read+write on the
> >> client side.
> >
> > Regression in performance is ok if it fixes a regression for things that
> > used to work just fine in the past :)
> >
> > First rule, make it work.
>
> Sure, I just wanted to point out that *maybe* there are other options than
> simply reverting that commit :-)
>
> Something like the patch below (completely untested!) should revert to the
> old behaviour in filesystems that don't implement the CFR syscall.
>
> Cheers,
> --
> Luis
>
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..bf5dccc43cc9 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> file_out, pos_out,
> len, flags);
>
> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> + return -EXDEV;
> + else
> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> + flags);
> }
>

Which kernel is this patch based on?

At this point, I am with Dave and Darrick on not falling back to
generic_copy_file_range() at all.

We do not have proof of any workload that benefits from it and the
above patch does not protect from a wierd use case of trying to copy a file
from sysfs to sysfs.

I am indecisive about what should be done with generic_copy_file_range()
called as fallback from within filesystems.

I think the wise choice is to not do the fallback in any case, but this is up
to the specific filesystem maintainers to decide.

Thanks,
Amir.

2021-02-15 08:37:55

by Greg Kroah-Hartman

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Mon, Feb 15, 2021 at 09:25:36AM +0800, Nicolas Boichat wrote:
> On Mon, Feb 15, 2021 at 9:12 AM Ian Lance Taylor <[email protected]> wrote:
> >
> > On Sun, Feb 14, 2021 at 4:38 PM Dave Chinner <[email protected]> wrote:
> > >
> > > On Fri, Feb 12, 2021 at 03:54:48PM -0800, Darrick J. Wong wrote:
> > > > On Sat, Feb 13, 2021 at 10:27:26AM +1100, Dave Chinner wrote:
> > > >
> > > > > If you can't tell from userspace that a file has data in it other
> > > > > than by calling read() on it, then you can't use cfr on it.
> > > >
> > > > I don't know how to do that, Dave. :)
> > >
> > > If stat returns a non-zero size, then userspace knows it has at
> > > least that much data in it, whether it be zeros or previously
> > > written data. cfr will copy that data. The special zero length
> > > regular pipe files fail this simple "how much data is there to copy
> > > in this file" check...
> >
> > This suggests that if the Go standard library sees that
> > copy_file_range reads zero bytes, we should assume that it failed, and
> > should use the fallback path as though copy_file_range returned
> > EOPNOTSUPP or EINVAL. This will cause an extra system call for an
> > empty file, but as long as copy_file_range does not return an error
> > for cases that it does not support we are going to need an extra
> > system call anyhow.
> >
> > Does that seem like a possible mitigation? That is, are there cases
> > where copy_file_range will fail to do a correct copy, and will return
> > success, and will not return zero?
>
> I'm a bit worried about the sysfs files that report a 4096 bytes file
> size, for 2 reasons:
> - I'm not sure if the content _can_ actually be longer (I couldn't
> find a counterexample)

There are quite a few, look for binary sysfs files that are pipes from
hardware/firmware resources to userspace. /sys/firmware/efi/efivars/
has a number of them if you want to play around with it.

thanks,

greg k-h

2021-02-15 10:41:14

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Amir Goldstein <[email protected]> writes:

> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <[email protected]> wrote:
...
>> Sure, I just wanted to point out that *maybe* there are other options than
>> simply reverting that commit :-)
>>
>> Something like the patch below (completely untested!) should revert to the
>> old behaviour in filesystems that don't implement the CFR syscall.
>>
>> Cheers,
>> --
>> Luis
>>
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..bf5dccc43cc9 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>> file_out, pos_out,
>> len, flags);
>>
>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> - flags);
>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> + return -EXDEV;
>> + else
>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> + flags);
>> }
>>
>
> Which kernel is this patch based on?

It was v5.11-rc7.

> At this point, I am with Dave and Darrick on not falling back to
> generic_copy_file_range() at all.
>
> We do not have proof of any workload that benefits from it and the
> above patch does not protect from a wierd use case of trying to copy a file
> from sysfs to sysfs.
>

Ok, cool. I can post a new patch doing just that. I guess that function
do_copy_file_range() can be dropped in that case.

> I am indecisive about what should be done with generic_copy_file_range()
> called as fallback from within filesystems.
>
> I think the wise choice is to not do the fallback in any case, but this is up
> to the specific filesystem maintainers to decide.

I see what you mean. You're suggesting to have userspace handle all the
-EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also
removes all the calls to generic_copy_file_range() function? And that
function can also be deleted too, of course.

Cheers,
--
Luis

2021-02-15 12:29:11

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Luis Henriques <[email protected]> writes:

> Amir Goldstein <[email protected]> writes:
>
>> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <[email protected]> wrote:
> ...
>>> Sure, I just wanted to point out that *maybe* there are other options than
>>> simply reverting that commit :-)
>>>
>>> Something like the patch below (completely untested!) should revert to the
>>> old behaviour in filesystems that don't implement the CFR syscall.
>>>
>>> Cheers,
>>> --
>>> Luis
>>>
>>> diff --git a/fs/read_write.c b/fs/read_write.c
>>> index 75f764b43418..bf5dccc43cc9 100644
>>> --- a/fs/read_write.c
>>> +++ b/fs/read_write.c
>>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>>> file_out, pos_out,
>>> len, flags);
>>>
>>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>>> - flags);
>>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>>> + return -EXDEV;
>>> + else
>>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>>> + flags);
>>> }
>>>
>>
>> Which kernel is this patch based on?
>
> It was v5.11-rc7.
>
>> At this point, I am with Dave and Darrick on not falling back to
>> generic_copy_file_range() at all.
>>
>> We do not have proof of any workload that benefits from it and the
>> above patch does not protect from a wierd use case of trying to copy a file
>> from sysfs to sysfs.
>>
>
> Ok, cool. I can post a new patch doing just that. I guess that function
> do_copy_file_range() can be dropped in that case.
>
>> I am indecisive about what should be done with generic_copy_file_range()
>> called as fallback from within filesystems.
>>
>> I think the wise choice is to not do the fallback in any case, but this is up
>> to the specific filesystem maintainers to decide.
>
> I see what you mean. You're suggesting to have userspace handle all the
> -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also
> removes all the calls to generic_copy_file_range() function? And that
> function can also be deleted too, of course.

Here's a first stab at this patch. Hopefully I didn't forgot anything
here. Let me know if you prefer the more conservative approach, i.e. not
touching any of the filesystems and let them use generic_copy_file_range.

Once everyone agrees on the final solution, I can follow-up with the
manpages update.

Cheers,
--
Luis

From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001
From: Luis Henriques <[email protected]>
Date: Fri, 12 Feb 2021 18:03:23 +0000
Subject: [PATCH] vfs: prevent copy_file_range to copy across devices

Nicolas Boichat reported an issue when trying to use the copy_file_range
syscall on a tracefs file. It failed silently because the file content is
generated on-the-fly (reporting a size of zero) and copy_file_range needs
to know in advance how much data is present.

This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to
copy across devices"). Now the copy is done only if the filesystems for source
and destination files are the same and they implement this syscall.

Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
Cc: Nicolas Boichat <[email protected]>
Signed-off-by: Luis Henriques <[email protected]>
---
fs/ceph/file.c | 21 +++------------
fs/cifs/cifsfs.c | 3 ---
fs/fuse/file.c | 21 +++------------
fs/nfs/nfs4file.c | 20 +++-----------
fs/read_write.c | 65 ++++++++--------------------------------------
include/linux/fs.h | 3 ---
6 files changed, 20 insertions(+), 113 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 209535d5b8d3..639bd7bfaea9 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off
return bytes;
}

-static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
+static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
+ struct file *dst_file, loff_t dst_off,
+ size_t len, unsigned int flags)
{
struct inode *src_inode = file_inode(src_file);
struct inode *dst_inode = file_inode(dst_file);
@@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
return ret;
}

-static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
-{
- ssize_t ret;
-
- ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
- len, flags);
-
- if (ret == -EOPNOTSUPP || ret == -EXDEV)
- ret = generic_copy_file_range(src_file, src_off, dst_file,
- dst_off, len, flags);
- return ret;
-}
-
const struct file_operations ceph_file_fops = {
.open = ceph_open,
.release = ceph_release,
diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
index e46da536ed33..8b869cc67443 100644
--- a/fs/cifs/cifsfs.c
+++ b/fs/cifs/cifsfs.c
@@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
len, flags);
free_xid(xid);

- if (rc == -EOPNOTSUPP || rc == -EXDEV)
- rc = generic_copy_file_range(src_file, off, dst_file,
- destoff, len, flags);
return rc;
}

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 8cccecb55fb8..0dd703278e49 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
return err;
}

-static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
+static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t len, unsigned int flags)
{
struct fuse_file *ff_in = file_in->private_data;
struct fuse_file *ff_out = file_out->private_data;
@@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
return err;
}

-static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
- struct file *dst_file, loff_t dst_off,
- size_t len, unsigned int flags)
-{
- ssize_t ret;
-
- ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
- len, flags);
-
- if (ret == -EOPNOTSUPP || ret == -EXDEV)
- ret = generic_copy_file_range(src_file, src_off, dst_file,
- dst_off, len, flags);
- return ret;
-}
-
static const struct file_operations fuse_file_operations = {
.llseek = fuse_file_llseek,
.read_iter = fuse_file_read_iter,
diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
index 57b3821d975a..60998209e310 100644
--- a/fs/nfs/nfs4file.c
+++ b/fs/nfs/nfs4file.c
@@ -133,9 +133,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
}

#ifdef CONFIG_NFS_V4_2
-static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t count, unsigned int flags)
+static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
+ struct file *file_out, loff_t pos_out,
+ size_t count, unsigned int flags)
{
struct nfs42_copy_notify_res *cn_resp = NULL;
struct nl4_server *nss = NULL;
@@ -189,20 +189,6 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
return ret;
}

-static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t count, unsigned int flags)
-{
- ssize_t ret;
-
- ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
- flags);
- if (ret == -EOPNOTSUPP || ret == -EXDEV)
- ret = generic_copy_file_range(file_in, pos_in, file_out,
- pos_out, count, flags);
- return ret;
-}
-
static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
{
loff_t ret;
diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..87bf9efd7f71 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1358,58 +1358,6 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
}
#endif

-/**
- * generic_copy_file_range - copy data between two files
- * @file_in: file structure to read from
- * @pos_in: file offset to read from
- * @file_out: file structure to write data to
- * @pos_out: file offset to write data to
- * @len: amount of data to copy
- * @flags: copy flags
- *
- * This is a generic filesystem helper to copy data from one file to another.
- * It has no constraints on the source or destination file owners - the files
- * can belong to different superblocks and different filesystem types. Short
- * copies are allowed.
- *
- * This should be called from the @file_out filesystem, as per the
- * ->copy_file_range() method.
- *
- * Returns the number of bytes copied or a negative error indicating the
- * failure.
- */
-
-ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
- return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
- len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
-}
-EXPORT_SYMBOL(generic_copy_file_range);
-
-static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags)
-{
- /*
- * Although we now allow filesystems to handle cross sb copy, passing
- * a file of the wrong filesystem type to filesystem driver can result
- * in an attempt to dereference the wrong type of ->private_data, so
- * avoid doing that until we really have a good reason. NFS defines
- * several different file_system_type structures, but they all end up
- * using the same ->copy_file_range() function pointer.
- */
- if (file_out->f_op->copy_file_range &&
- file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
- return file_out->f_op->copy_file_range(file_in, pos_in,
- file_out, pos_out,
- len, flags);
-
- return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
-}
-
/*
* Performs necessary checks before doing a file copy
*
@@ -1474,6 +1422,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
{
ssize_t ret;

+ /*
+ * Allow the copy only if the filesystems for file_in and file_out are
+ * the same, and copy_file_range is implemented.
+ */
+ if (!file_out->f_op->copy_file_range ||
+ (file_out->f_op->copy_file_range != file_in->f_op->copy_file_range))
+ return -EXDEV;
+
if (flags != 0)
return -EINVAL;

@@ -1513,8 +1469,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
}
}

- ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
- flags);
+ ret = file_out->f_op->copy_file_range(file_in, pos_in,
+ file_out, pos_out,
+ len, flags);
WARN_ON_ONCE(ret == -EOPNOTSUPP);
done:
if (ret > 0) {
diff --git a/include/linux/fs.h b/include/linux/fs.h
index fd47deea7c17..3aaf627be409 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1910,9 +1910,6 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
loff_t, size_t, unsigned int);
-extern ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
- struct file *file_out, loff_t pos_out,
- size_t len, unsigned int flags);
extern int generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
struct file *file_out, loff_t pos_out,
loff_t *count,

2021-02-15 14:26:33

by Amir Goldstein

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques <[email protected]> wrote:
>
> Luis Henriques <[email protected]> writes:
>
> > Amir Goldstein <[email protected]> writes:
> >
> >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <[email protected]> wrote:
> > ...
> >>> Sure, I just wanted to point out that *maybe* there are other options than
> >>> simply reverting that commit :-)
> >>>
> >>> Something like the patch below (completely untested!) should revert to the
> >>> old behaviour in filesystems that don't implement the CFR syscall.
> >>>
> >>> Cheers,
> >>> --
> >>> Luis
> >>>
> >>> diff --git a/fs/read_write.c b/fs/read_write.c
> >>> index 75f764b43418..bf5dccc43cc9 100644
> >>> --- a/fs/read_write.c
> >>> +++ b/fs/read_write.c
> >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> >>> file_out, pos_out,
> >>> len, flags);
> >>>
> >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> >>> - flags);
> >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
> >>> + return -EXDEV;
> >>> + else
> >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> >>> + flags);
> >>> }
> >>>
> >>
> >> Which kernel is this patch based on?
> >
> > It was v5.11-rc7.
> >
> >> At this point, I am with Dave and Darrick on not falling back to
> >> generic_copy_file_range() at all.
> >>
> >> We do not have proof of any workload that benefits from it and the
> >> above patch does not protect from a wierd use case of trying to copy a file
> >> from sysfs to sysfs.
> >>
> >
> > Ok, cool. I can post a new patch doing just that. I guess that function
> > do_copy_file_range() can be dropped in that case.
> >
> >> I am indecisive about what should be done with generic_copy_file_range()
> >> called as fallback from within filesystems.
> >>
> >> I think the wise choice is to not do the fallback in any case, but this is up
> >> to the specific filesystem maintainers to decide.
> >
> > I see what you mean. You're suggesting to have userspace handle all the
> > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also
> > removes all the calls to generic_copy_file_range() function? And that
> > function can also be deleted too, of course.
>
> Here's a first stab at this patch. Hopefully I didn't forgot anything
> here. Let me know if you prefer the more conservative approach, i.e. not
> touching any of the filesystems and let them use generic_copy_file_range.
>

I'm fine with this one (modulu my comment below).
CC'ing fuse/cifs/nfs maintainers.
Though I don't think the FS maintainers should mind removing the fallback.
It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()")

> Once everyone agrees on the final solution, I can follow-up with the
> manpages update.
>
> Cheers,
> --
> Luis
>
> From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001
> From: Luis Henriques <[email protected]>
> Date: Fri, 12 Feb 2021 18:03:23 +0000
> Subject: [PATCH] vfs: prevent copy_file_range to copy across devices
>
> Nicolas Boichat reported an issue when trying to use the copy_file_range
> syscall on a tracefs file. It failed silently because the file content is
> generated on-the-fly (reporting a size of zero) and copy_file_range needs
> to know in advance how much data is present.
>
> This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to
> copy across devices"). Now the copy is done only if the filesystems for source
> and destination files are the same and they implement this syscall.

This paragraph is not true and confusing.
This is not a revert and it does not revert the important part of cross-fs copy
for which the original commit was for.
Either rephrase this paragraph or remove it.

>
> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")

Please add a Link: to this discussion in lore.

> Cc: Nicolas Boichat <[email protected]>
> Signed-off-by: Luis Henriques <[email protected]>
> ---
> fs/ceph/file.c | 21 +++------------
> fs/cifs/cifsfs.c | 3 ---
> fs/fuse/file.c | 21 +++------------
> fs/nfs/nfs4file.c | 20 +++-----------
> fs/read_write.c | 65 ++++++++--------------------------------------
> include/linux/fs.h | 3 ---
> 6 files changed, 20 insertions(+), 113 deletions(-)
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 209535d5b8d3..639bd7bfaea9 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
> @@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off
> return bytes;
> }
>
> -static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> - struct file *dst_file, loff_t dst_off,
> - size_t len, unsigned int flags)
> +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> + struct file *dst_file, loff_t dst_off,
> + size_t len, unsigned int flags)
> {
> struct inode *src_inode = file_inode(src_file);
> struct inode *dst_inode = file_inode(dst_file);
> @@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
> return ret;
> }
>
> -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
> - struct file *dst_file, loff_t dst_off,
> - size_t len, unsigned int flags)
> -{
> - ssize_t ret;
> -
> - ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
> - len, flags);
> -
> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
> - ret = generic_copy_file_range(src_file, src_off, dst_file,
> - dst_off, len, flags);
> - return ret;
> -}
> -
> const struct file_operations ceph_file_fops = {
> .open = ceph_open,
> .release = ceph_release,
> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
> index e46da536ed33..8b869cc67443 100644
> --- a/fs/cifs/cifsfs.c
> +++ b/fs/cifs/cifsfs.c
> @@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
> len, flags);
> free_xid(xid);
>
> - if (rc == -EOPNOTSUPP || rc == -EXDEV)
> - rc = generic_copy_file_range(src_file, off, dst_file,
> - destoff, len, flags);
> return rc;
> }
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 8cccecb55fb8..0dd703278e49 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> return err;
> }
>
> -static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> +static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t len, unsigned int flags)
> {
> struct fuse_file *ff_in = file_in->private_data;
> struct fuse_file *ff_out = file_out->private_data;
> @@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
> return err;
> }
>
> -static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
> - struct file *dst_file, loff_t dst_off,
> - size_t len, unsigned int flags)
> -{
> - ssize_t ret;
> -
> - ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
> - len, flags);
> -
> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
> - ret = generic_copy_file_range(src_file, src_off, dst_file,
> - dst_off, len, flags);
> - return ret;
> -}
> -
> static const struct file_operations fuse_file_operations = {
> .llseek = fuse_file_llseek,
> .read_iter = fuse_file_read_iter,
> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
> index 57b3821d975a..60998209e310 100644
> --- a/fs/nfs/nfs4file.c
> +++ b/fs/nfs/nfs4file.c
> @@ -133,9 +133,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
> }
>
> #ifdef CONFIG_NFS_V4_2
> -static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t count, unsigned int flags)
> +static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> + struct file *file_out, loff_t pos_out,
> + size_t count, unsigned int flags)
> {
> struct nfs42_copy_notify_res *cn_resp = NULL;
> struct nl4_server *nss = NULL;
> @@ -189,20 +189,6 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> return ret;
> }
>
> -static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t count, unsigned int flags)
> -{
> - ssize_t ret;
> -
> - ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
> - flags);
> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
> - ret = generic_copy_file_range(file_in, pos_in, file_out,
> - pos_out, count, flags);
> - return ret;
> -}
> -
> static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
> {
> loff_t ret;
> diff --git a/fs/read_write.c b/fs/read_write.c
> index 75f764b43418..87bf9efd7f71 100644
> --- a/fs/read_write.c
> +++ b/fs/read_write.c
> @@ -1358,58 +1358,6 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
> }
> #endif
>
> -/**
> - * generic_copy_file_range - copy data between two files
> - * @file_in: file structure to read from
> - * @pos_in: file offset to read from
> - * @file_out: file structure to write data to
> - * @pos_out: file offset to write data to
> - * @len: amount of data to copy
> - * @flags: copy flags
> - *
> - * This is a generic filesystem helper to copy data from one file to another.
> - * It has no constraints on the source or destination file owners - the files
> - * can belong to different superblocks and different filesystem types. Short
> - * copies are allowed.
> - *
> - * This should be called from the @file_out filesystem, as per the
> - * ->copy_file_range() method.
> - *
> - * Returns the number of bytes copied or a negative error indicating the
> - * failure.
> - */
> -
> -ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> -{
> - return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
> - len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
> -}
> -EXPORT_SYMBOL(generic_copy_file_range);
> -
> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
> - struct file *file_out, loff_t pos_out,
> - size_t len, unsigned int flags)
> -{
> - /*
> - * Although we now allow filesystems to handle cross sb copy, passing
> - * a file of the wrong filesystem type to filesystem driver can result
> - * in an attempt to dereference the wrong type of ->private_data, so
> - * avoid doing that until we really have a good reason. NFS defines
> - * several different file_system_type structures, but they all end up
> - * using the same ->copy_file_range() function pointer.
> - */
> - if (file_out->f_op->copy_file_range &&
> - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
> - return file_out->f_op->copy_file_range(file_in, pos_in,
> - file_out, pos_out,
> - len, flags);
> -
> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);

Please do not remove the big comment above - it is there for a reason.

The case of !file_out->f_op->copy_file_range should return -EOPNOTSUPP
because the outcome of -EXDEV for copy to/from the same fs is confusing.

Either leave this helper and only remove generic_copy_file_range() or
open code the same logic (including comment) in the caller.

> -}
> -
> /*
> * Performs necessary checks before doing a file copy
> *
> @@ -1474,6 +1422,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> {
> ssize_t ret;
>
> + /*
> + * Allow the copy only if the filesystems for file_in and file_out are
> + * the same, and copy_file_range is implemented.
> + */

This comment is wrong. See the big fat comment above to understand why.
Also this check is in the wrong place because it misses the case of
file_in->f_op->remap_file_range && !file_out->f_op->copy_file_range

As I wrote above either leave the helper do_copy_file_range() or open
code it in its current location.

> + if (!file_out->f_op->copy_file_range ||
> + (file_out->f_op->copy_file_range != file_in->f_op->copy_file_range))
> + return -EXDEV;
> +
> if (flags != 0)
> return -EINVAL;
>
> @@ -1513,8 +1469,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
> }
> }
>
> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
> - flags);
> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
> + file_out, pos_out,
> + len, flags);
> WARN_ON_ONCE(ret == -EOPNOTSUPP);

Please remove this line.
filesystem ops can now return -EOPNOTSUPP.

Thanks,
Amir.

2021-02-15 14:52:47

by Luis Henriques

[permalink] [raw]
Subject: Re: [PATCH 1/6] fs: Add flag to file_system_type to indicate content is generated

Amir Goldstein <[email protected]> writes:

> On Mon, Feb 15, 2021 at 2:21 PM Luis Henriques <[email protected]> wrote:
>>
>> Luis Henriques <[email protected]> writes:
>>
>> > Amir Goldstein <[email protected]> writes:
>> >
>> >> On Fri, Feb 12, 2021 at 2:40 PM Luis Henriques <[email protected]> wrote:
>> > ...
>> >>> Sure, I just wanted to point out that *maybe* there are other options than
>> >>> simply reverting that commit :-)
>> >>>
>> >>> Something like the patch below (completely untested!) should revert to the
>> >>> old behaviour in filesystems that don't implement the CFR syscall.
>> >>>
>> >>> Cheers,
>> >>> --
>> >>> Luis
>> >>>
>> >>> diff --git a/fs/read_write.c b/fs/read_write.c
>> >>> index 75f764b43418..bf5dccc43cc9 100644
>> >>> --- a/fs/read_write.c
>> >>> +++ b/fs/read_write.c
>> >>> @@ -1406,8 +1406,11 @@ static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>> >>> file_out, pos_out,
>> >>> len, flags);
>> >>>
>> >>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> >>> - flags);
>> >>> + if (file_inode(file_in)->i_sb != file_inode(file_out)->i_sb)
>> >>> + return -EXDEV;
>> >>> + else
>> >>> + generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> >>> + flags);
>> >>> }
>> >>>
>> >>
>> >> Which kernel is this patch based on?
>> >
>> > It was v5.11-rc7.
>> >
>> >> At this point, I am with Dave and Darrick on not falling back to
>> >> generic_copy_file_range() at all.
>> >>
>> >> We do not have proof of any workload that benefits from it and the
>> >> above patch does not protect from a wierd use case of trying to copy a file
>> >> from sysfs to sysfs.
>> >>
>> >
>> > Ok, cool. I can post a new patch doing just that. I guess that function
>> > do_copy_file_range() can be dropped in that case.
>> >
>> >> I am indecisive about what should be done with generic_copy_file_range()
>> >> called as fallback from within filesystems.
>> >>
>> >> I think the wise choice is to not do the fallback in any case, but this is up
>> >> to the specific filesystem maintainers to decide.
>> >
>> > I see what you mean. You're suggesting to have userspace handle all the
>> > -EOPNOTSUPP and -EXDEV errors. Would you rather have a patch that also
>> > removes all the calls to generic_copy_file_range() function? And that
>> > function can also be deleted too, of course.
>>
>> Here's a first stab at this patch. Hopefully I didn't forgot anything
>> here. Let me know if you prefer the more conservative approach, i.e. not
>> touching any of the filesystems and let them use generic_copy_file_range.
>>
>
> I'm fine with this one (modulu my comment below).
> CC'ing fuse/cifs/nfs maintainers.
> Though I don't think the FS maintainers should mind removing the fallback.
> It was added by "us" (64bf5ff58dff "vfs: no fallback for ->copy_file_range()")

Thanks for your review, Amir. I'll be posting v2 shortly.

Cheers,
--
Luis


>> Once everyone agrees on the final solution, I can follow-up with the
>> manpages update.
>>
>> Cheers,
>> --
>> Luis
>>
>> From e1b37e80b12601d56f792bd19377d3e5208188ef Mon Sep 17 00:00:00 2001
>> From: Luis Henriques <[email protected]>
>> Date: Fri, 12 Feb 2021 18:03:23 +0000
>> Subject: [PATCH] vfs: prevent copy_file_range to copy across devices
>>
>> Nicolas Boichat reported an issue when trying to use the copy_file_range
>> syscall on a tracefs file. It failed silently because the file content is
>> generated on-the-fly (reporting a size of zero) and copy_file_range needs
>> to know in advance how much data is present.
>>
>> This commit effectively reverts 5dae222a5ff0 ("vfs: allow copy_file_range to
>> copy across devices"). Now the copy is done only if the filesystems for source
>> and destination files are the same and they implement this syscall.
>
> This paragraph is not true and confusing.
> This is not a revert and it does not revert the important part of cross-fs copy
> for which the original commit was for.
> Either rephrase this paragraph or remove it.
>
>>
>> Fixes: 5dae222a5ff0 ("vfs: allow copy_file_range to copy across devices")
>
> Please add a Link: to this discussion in lore.
>
>> Cc: Nicolas Boichat <[email protected]>
>> Signed-off-by: Luis Henriques <[email protected]>
>> ---
>> fs/ceph/file.c | 21 +++------------
>> fs/cifs/cifsfs.c | 3 ---
>> fs/fuse/file.c | 21 +++------------
>> fs/nfs/nfs4file.c | 20 +++-----------
>> fs/read_write.c | 65 ++++++++--------------------------------------
>> include/linux/fs.h | 3 ---
>> 6 files changed, 20 insertions(+), 113 deletions(-)
>>
>> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
>> index 209535d5b8d3..639bd7bfaea9 100644
>> --- a/fs/ceph/file.c
>> +++ b/fs/ceph/file.c
>> @@ -2261,9 +2261,9 @@ static ssize_t ceph_do_objects_copy(struct ceph_inode_info *src_ci, u64 *src_off
>> return bytes;
>> }
>>
>> -static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> - struct file *dst_file, loff_t dst_off,
>> - size_t len, unsigned int flags)
>> +static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> + struct file *dst_file, loff_t dst_off,
>> + size_t len, unsigned int flags)
>> {
>> struct inode *src_inode = file_inode(src_file);
>> struct inode *dst_inode = file_inode(dst_file);
>> @@ -2456,21 +2456,6 @@ static ssize_t __ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> return ret;
>> }
>>
>> -static ssize_t ceph_copy_file_range(struct file *src_file, loff_t src_off,
>> - struct file *dst_file, loff_t dst_off,
>> - size_t len, unsigned int flags)
>> -{
>> - ssize_t ret;
>> -
>> - ret = __ceph_copy_file_range(src_file, src_off, dst_file, dst_off,
>> - len, flags);
>> -
>> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
>> - ret = generic_copy_file_range(src_file, src_off, dst_file,
>> - dst_off, len, flags);
>> - return ret;
>> -}
>> -
>> const struct file_operations ceph_file_fops = {
>> .open = ceph_open,
>> .release = ceph_release,
>> diff --git a/fs/cifs/cifsfs.c b/fs/cifs/cifsfs.c
>> index e46da536ed33..8b869cc67443 100644
>> --- a/fs/cifs/cifsfs.c
>> +++ b/fs/cifs/cifsfs.c
>> @@ -1229,9 +1229,6 @@ static ssize_t cifs_copy_file_range(struct file *src_file, loff_t off,
>> len, flags);
>> free_xid(xid);
>>
>> - if (rc == -EOPNOTSUPP || rc == -EXDEV)
>> - rc = generic_copy_file_range(src_file, off, dst_file,
>> - destoff, len, flags);
>> return rc;
>> }
>>
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 8cccecb55fb8..0dd703278e49 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -3330,9 +3330,9 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>> return err;
>> }
>>
>> -static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t len, unsigned int flags)
>> +static ssize_t fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + size_t len, unsigned int flags)
>> {
>> struct fuse_file *ff_in = file_in->private_data;
>> struct fuse_file *ff_out = file_out->private_data;
>> @@ -3439,21 +3439,6 @@ static ssize_t __fuse_copy_file_range(struct file *file_in, loff_t pos_in,
>> return err;
>> }
>>
>> -static ssize_t fuse_copy_file_range(struct file *src_file, loff_t src_off,
>> - struct file *dst_file, loff_t dst_off,
>> - size_t len, unsigned int flags)
>> -{
>> - ssize_t ret;
>> -
>> - ret = __fuse_copy_file_range(src_file, src_off, dst_file, dst_off,
>> - len, flags);
>> -
>> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
>> - ret = generic_copy_file_range(src_file, src_off, dst_file,
>> - dst_off, len, flags);
>> - return ret;
>> -}
>> -
>> static const struct file_operations fuse_file_operations = {
>> .llseek = fuse_file_llseek,
>> .read_iter = fuse_file_read_iter,
>> diff --git a/fs/nfs/nfs4file.c b/fs/nfs/nfs4file.c
>> index 57b3821d975a..60998209e310 100644
>> --- a/fs/nfs/nfs4file.c
>> +++ b/fs/nfs/nfs4file.c
>> @@ -133,9 +133,9 @@ nfs4_file_flush(struct file *file, fl_owner_t id)
>> }
>>
>> #ifdef CONFIG_NFS_V4_2
>> -static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t count, unsigned int flags)
>> +static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>> + struct file *file_out, loff_t pos_out,
>> + size_t count, unsigned int flags)
>> {
>> struct nfs42_copy_notify_res *cn_resp = NULL;
>> struct nl4_server *nss = NULL;
>> @@ -189,20 +189,6 @@ static ssize_t __nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>> return ret;
>> }
>>
>> -static ssize_t nfs4_copy_file_range(struct file *file_in, loff_t pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t count, unsigned int flags)
>> -{
>> - ssize_t ret;
>> -
>> - ret = __nfs4_copy_file_range(file_in, pos_in, file_out, pos_out, count,
>> - flags);
>> - if (ret == -EOPNOTSUPP || ret == -EXDEV)
>> - ret = generic_copy_file_range(file_in, pos_in, file_out,
>> - pos_out, count, flags);
>> - return ret;
>> -}
>> -
>> static loff_t nfs4_file_llseek(struct file *filep, loff_t offset, int whence)
>> {
>> loff_t ret;
>> diff --git a/fs/read_write.c b/fs/read_write.c
>> index 75f764b43418..87bf9efd7f71 100644
>> --- a/fs/read_write.c
>> +++ b/fs/read_write.c
>> @@ -1358,58 +1358,6 @@ COMPAT_SYSCALL_DEFINE4(sendfile64, int, out_fd, int, in_fd,
>> }
>> #endif
>>
>> -/**
>> - * generic_copy_file_range - copy data between two files
>> - * @file_in: file structure to read from
>> - * @pos_in: file offset to read from
>> - * @file_out: file structure to write data to
>> - * @pos_out: file offset to write data to
>> - * @len: amount of data to copy
>> - * @flags: copy flags
>> - *
>> - * This is a generic filesystem helper to copy data from one file to another.
>> - * It has no constraints on the source or destination file owners - the files
>> - * can belong to different superblocks and different filesystem types. Short
>> - * copies are allowed.
>> - *
>> - * This should be called from the @file_out filesystem, as per the
>> - * ->copy_file_range() method.
>> - *
>> - * Returns the number of bytes copied or a negative error indicating the
>> - * failure.
>> - */
>> -
>> -ssize_t generic_copy_file_range(struct file *file_in, loff_t pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t len, unsigned int flags)
>> -{
>> - return do_splice_direct(file_in, &pos_in, file_out, &pos_out,
>> - len > MAX_RW_COUNT ? MAX_RW_COUNT : len, 0);
>> -}
>> -EXPORT_SYMBOL(generic_copy_file_range);
>> -
>> -static ssize_t do_copy_file_range(struct file *file_in, loff_t pos_in,
>> - struct file *file_out, loff_t pos_out,
>> - size_t len, unsigned int flags)
>> -{
>> - /*
>> - * Although we now allow filesystems to handle cross sb copy, passing
>> - * a file of the wrong filesystem type to filesystem driver can result
>> - * in an attempt to dereference the wrong type of ->private_data, so
>> - * avoid doing that until we really have a good reason. NFS defines
>> - * several different file_system_type structures, but they all end up
>> - * using the same ->copy_file_range() function pointer.
>> - */
>> - if (file_out->f_op->copy_file_range &&
>> - file_out->f_op->copy_file_range == file_in->f_op->copy_file_range)
>> - return file_out->f_op->copy_file_range(file_in, pos_in,
>> - file_out, pos_out,
>> - len, flags);
>> -
>> - return generic_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> - flags);
>
> Please do not remove the big comment above - it is there for a reason.
>
> The case of !file_out->f_op->copy_file_range should return -EOPNOTSUPP
> because the outcome of -EXDEV for copy to/from the same fs is confusing.
>
> Either leave this helper and only remove generic_copy_file_range() or
> open code the same logic (including comment) in the caller.
>
>> -}
>> -
>> /*
>> * Performs necessary checks before doing a file copy
>> *
>> @@ -1474,6 +1422,14 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> {
>> ssize_t ret;
>>
>> + /*
>> + * Allow the copy only if the filesystems for file_in and file_out are
>> + * the same, and copy_file_range is implemented.
>> + */
>
> This comment is wrong. See the big fat comment above to understand why.
> Also this check is in the wrong place because it misses the case of
> file_in->f_op->remap_file_range && !file_out->f_op->copy_file_range
>
> As I wrote above either leave the helper do_copy_file_range() or open
> code it in its current location.
>
>> + if (!file_out->f_op->copy_file_range ||
>> + (file_out->f_op->copy_file_range != file_in->f_op->copy_file_range))
>> + return -EXDEV;
>> +
>> if (flags != 0)
>> return -EINVAL;
>>
>> @@ -1513,8 +1469,9 @@ ssize_t vfs_copy_file_range(struct file *file_in, loff_t pos_in,
>> }
>> }
>>
>> - ret = do_copy_file_range(file_in, pos_in, file_out, pos_out, len,
>> - flags);
>> + ret = file_out->f_op->copy_file_range(file_in, pos_in,
>> + file_out, pos_out,
>> + len, flags);
>> WARN_ON_ONCE(ret == -EOPNOTSUPP);
>
> Please remove this line.
> filesystem ops can now return -EOPNOTSUPP.
>
> Thanks,
> Amir.