2021-01-27 00:21:03

by Nicolas Boichat

[permalink] [raw]
Subject: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

copy_file_range (which calls generic_copy_file_checks) uses the
inode file size to adjust the copy count parameter. This breaks
with special filesystems like procfs/sysfs, where the file size
appears to be zero, but content is actually returned when a read
operation is performed.

This commit ignores the source file size, and makes copy_file_range
match the end of file behaviour documented in POSIX's "read",
where 0 is returned to mark EOF. This would allow "cp" and other
standard tools to make use of copy_file_range with the exact same
behaviour as they had in the past.

Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
Signed-off-by: Nicolas Boichat <[email protected]>
---
This can be reproduced with this simple test case:
#define _GNU_SOURCE
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <sys/stat.h>
#include <unistd.h>

int
main(int argc, char **argv)
{
int fd_in, fd_out;
loff_t ret;

fd_in = open("/proc/version", O_RDONLY);
fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);

do {
ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
printf("%d bytes copied\n", (int)ret);
} while (ret > 0);

return 0;
}

Without this patch, `version` output file is empty, and no bytes
are copied:
0 bytes copied

With this patch, the loop runs twice and the content of the file
is copied:
315 bytes copied
0 bytes copied

We hit this issue when upgrading Go compiler from 1.13 to 1.15 [1],
as we use Go's `io.Copy` to copy the content of
`/sys/kernel/debug/tracing/trace` to a temporary file.

Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
copy operation. However, that fails to copy any content when the input
file is from sysfs/tracefs, with an apparent size of 0 (but there is
still content when you `cat` it, of course).

[1] http://issuetracker.google.com/issues/178332739

fs/read_write.c | 8 --------
1 file changed, 8 deletions(-)

diff --git a/fs/read_write.c b/fs/read_write.c
index 75f764b43418..7236146f6ad7 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1424,7 +1424,6 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
struct inode *inode_in = file_inode(file_in);
struct inode *inode_out = file_inode(file_out);
uint64_t count = *req_count;
- loff_t size_in;
int ret;

ret = generic_file_rw_checks(file_in, file_out);
@@ -1442,13 +1441,6 @@ static int generic_copy_file_checks(struct file *file_in, loff_t pos_in,
if (pos_in + count < pos_in || pos_out + count < pos_out)
return -EOVERFLOW;

- /* Shorten the copy to EOF */
- size_in = i_size_read(inode_in);
- if (pos_in >= size_in)
- count = 0;
- else
- count = min(count, size_in - (uint64_t)pos_in);
-
ret = generic_write_check_limits(file_out, pos_out, &count);
if (ret)
return ret;
--
2.30.0.280.ga3ce27912f-goog


2021-01-27 20:03:29

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> copy_file_range (which calls generic_copy_file_checks) uses the
> inode file size to adjust the copy count parameter. This breaks
> with special filesystems like procfs/sysfs, where the file size
> appears to be zero, but content is actually returned when a read
> operation is performed.
>
> This commit ignores the source file size, and makes copy_file_range
> match the end of file behaviour documented in POSIX's "read",
> where 0 is returned to mark EOF. This would allow "cp" and other
> standard tools to make use of copy_file_range with the exact same
> behaviour as they had in the past.

Proper fix is _not_ to use copy_file_range(2) in cp(1) - it's
really not universal and its implementation will *NOT* do the right
thing for most of procfs files. Sequential read(2) (or splice(2),
for that matter) will give you consistent output; copy_file_range(2)
will not.

What copy_file_range() does is splice from input to internal pipe
+ splice from that internal pipe to output, done in a loop. HOWEVER,
a short write to output will discard the contents of the internal pipe
and pretend we had a short _read_ instead. With ->f_pos updated
accordingly, so that on the next call we'd start from the place
right after everything we'd copied. However, that will result in
implicit seek on the next call of seq_read(), with no promise whatsoever
that you won't end up with consistent records in your copy.
Short copy on read(2) will leave the rest of the record in buffer,
so the next read(2) will pick that first.

IOW, don't expect copy_file_range() to produce usable copies
for those. Doesn't work. splice to explicit pipe + splice from that
pipe (without flushing the sucker) works, and if you really want
zero-copy you could use that.

Again, use of copy_file_range(2) in cp(1) is a userland bug.
Don't do it.

2021-01-27 20:04:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> copy_file_range (which calls generic_copy_file_checks) uses the
> inode file size to adjust the copy count parameter. This breaks
> with special filesystems like procfs/sysfs, where the file size
> appears to be zero, but content is actually returned when a read
> operation is performed.
>
> This commit ignores the source file size, and makes copy_file_range
> match the end of file behaviour documented in POSIX's "read",
> where 0 is returned to mark EOF. This would allow "cp" and other
> standard tools to make use of copy_file_range with the exact same
> behaviour as they had in the past.
>
> Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> Signed-off-by: Nicolas Boichat <[email protected]>

Nack.

As I've explained, this is intentional and bypassing it is not a
work around for enabling cfr on filesystems that produce ephemeral,
volatile read-once data using seq-file pipes that masquerade as
regular files with zero size. These files are behaving like pipes
and only work because the VFS has to support read() and friends from
pipes that don't publish the amount of data they contain to the VFS
inode.

copy_file_range() does not support such behaviour.

copy_file_range() -writes- data, so we have to check that those
writes do not extend past boundaries that the destination inode
imposes on the operation. e.g. maximum offset limits, whether the
ranges overlap in the same file, etc.

Hence we need to know how much data there is present to copy before
we can check if it is safe to perform the -write- of the data we are
going to read. Hence we cannot safely support data sources that
cannot tell us how much data is present before we start the copy
operation.

IOWs, these source file EOF restrictions are required by the write
side of copy_file_range(), not the read side.

> ---
> This can be reproduced with this simple test case:
> #define _GNU_SOURCE
> #include <fcntl.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <sys/stat.h>
> #include <unistd.h>
>
> int
> main(int argc, char **argv)
> {
> int fd_in, fd_out;
> loff_t ret;
>
> fd_in = open("/proc/version", O_RDONLY);
> fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
>
> do {
> ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
> printf("%d bytes copied\n", (int)ret);
> } while (ret > 0);
>
> return 0;
> }
>
> Without this patch, `version` output file is empty, and no bytes
> are copied:
> 0 bytes copied

$ ls -l /proc/version
-r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
$

It's a zero length file.

sysfs does this just fine - it's regular files have a size of
at least PAGE_SIZE rather than zero, and so copy_file_range works
just fine on them:

$ ls -l /sys/block/nvme0n1/capability
-r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
$ cat /sys/block/nvme0n1/capability
50
$ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
$ sudo cat /tmp/foo
50

And the behaviour is exactly as you'd expect a read() loop to copy
the file to behave:

openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
....
openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
copy_file_range(4, [0], 3, [0], 4096, 0) = 3
copy_file_range(4, [3], 3, [3], 4093, 0) = 0
close(4)

See? Inode size of 4096 means there's a maximum of 4kB of data that
can be read from this file. copy_file_range() now behaves exactly
as read() would, returning a short copy and then 0 bytes to indicate
EOF.

If you want ephemeral data pipes masquerading as regular files to
work with copy_file_range, then the filesystem implementation needs
to provide the VFS with a data size that indicates the maximum
amount of data that the pipe can produce in a continuous read loop.
Otherwise we cannot validate the range of the write we may be asked
to perform...

> Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> copy operation. However, that fails to copy any content when the input
> file is from sysfs/tracefs, with an apparent size of 0 (but there is
> still content when you `cat` it, of course).

Libraries using copy_file_range() must be prepared for it to fail
and fall back to normal copy mechanisms. Of course, with these
special zero length files that contain ephemeral data, userspace can't
actually tell that they contain data from userspace using stat(). So
as far as userspace is concerned, copy_file_range() correctly
returned zero bytes copied from a zero byte long file and there's
nothing more to do.

This zero length file behaviour is, fundamentally, a kernel
filesystem implementation bug, not a copy_file_range() bug.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-01-28 00:52:42

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

On Wed, Jan 27, 2021 at 7:38 AM Dave Chinner <[email protected]> wrote:
>
> On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> > copy_file_range (which calls generic_copy_file_checks) uses the
> > inode file size to adjust the copy count parameter. This breaks
> > with special filesystems like procfs/sysfs, where the file size
> > appears to be zero, but content is actually returned when a read
> > operation is performed.
> >
> > This commit ignores the source file size, and makes copy_file_range
> > match the end of file behaviour documented in POSIX's "read",
> > where 0 is returned to mark EOF. This would allow "cp" and other
> > standard tools to make use of copy_file_range with the exact same
> > behaviour as they had in the past.
> >
> > Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> > Signed-off-by: Nicolas Boichat <[email protected]>
>
> Nack.

Thanks Dave and Al for the detailed explanations.

>
> As I've explained, this is intentional and bypassing it is not a
> work around for enabling cfr on filesystems that produce ephemeral,
> volatile read-once data using seq-file pipes that masquerade as
> regular files with zero size. These files are behaving like pipes
> and only work because the VFS has to support read() and friends from
> pipes that don't publish the amount of data they contain to the VFS
> inode.
>
> copy_file_range() does not support such behaviour.
>
> copy_file_range() -writes- data, so we have to check that those
> writes do not extend past boundaries that the destination inode
> imposes on the operation. e.g. maximum offset limits, whether the
> ranges overlap in the same file, etc.
>
> Hence we need to know how much data there is present to copy before
> we can check if it is safe to perform the -write- of the data we are
> going to read. Hence we cannot safely support data sources that
> cannot tell us how much data is present before we start the copy
> operation.
>
> IOWs, these source file EOF restrictions are required by the write
> side of copy_file_range(), not the read side.
>
> > ---
> > This can be reproduced with this simple test case:
> > #define _GNU_SOURCE
> > #include <fcntl.h>
> > #include <stdio.h>
> > #include <stdlib.h>
> > #include <sys/stat.h>
> > #include <unistd.h>
> >
> > int
> > main(int argc, char **argv)
> > {
> > int fd_in, fd_out;
> > loff_t ret;
> >
> > fd_in = open("/proc/version", O_RDONLY);
> > fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
> >
> > do {
> > ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
> > printf("%d bytes copied\n", (int)ret);
> > } while (ret > 0);
> >
> > return 0;
> > }
> >
> > Without this patch, `version` output file is empty, and no bytes
> > are copied:
> > 0 bytes copied
>
> $ ls -l /proc/version
> -r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
> $
>
> It's a zero length file.
>
> sysfs does this just fine - it's regular files have a size of
> at least PAGE_SIZE rather than zero, and so copy_file_range works
> just fine on them:
>
> $ ls -l /sys/block/nvme0n1/capability
> -r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
> $ cat /sys/block/nvme0n1/capability
> 50
> $ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
> $ sudo cat /tmp/foo
> 50
>
> And the behaviour is exactly as you'd expect a read() loop to copy
> the file to behave:
>
> openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
> ....
> openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
> copy_file_range(4, [0], 3, [0], 4096, 0) = 3
> copy_file_range(4, [3], 3, [3], 4093, 0) = 0
> close(4)
>
> See? Inode size of 4096 means there's a maximum of 4kB of data that
> can be read from this file. copy_file_range() now behaves exactly
> as read() would, returning a short copy and then 0 bytes to indicate
> EOF.

Unless the content happens to be larger than PAGE_SIZE, then
copy_file_range would only copy the beginning of the file. And as Al
explained, this will still break in case of short writes.

>
> If you want ephemeral data pipes masquerading as regular files to
> work with copy_file_range, then the filesystem implementation needs
> to provide the VFS with a data size that indicates the maximum
> amount of data that the pipe can produce in a continuous read loop.
> Otherwise we cannot validate the range of the write we may be asked
> to perform...
>
> > Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> > copy operation. However, that fails to copy any content when the input
> > file is from sysfs/tracefs, with an apparent size of 0 (but there is
> > still content when you `cat` it, of course).
>
> Libraries using copy_file_range() must be prepared for it to fail
> and fall back to normal copy mechanisms.

How is userspace suppose to detect that? (checking for 0 file size
won't work with the example above)

> Of course, with these
> special zero length files that contain ephemeral data, userspace can't
> actually tell that they contain data from userspace using stat(). So
> as far as userspace is concerned, copy_file_range() correctly
> returned zero bytes copied from a zero byte long file and there's
> nothing more to do.
>
> This zero length file behaviour is, fundamentally, a kernel
> filesystem implementation bug, not a copy_file_range() bug.

Okay, so, based on this and Al's reply, I see 2 things we can do:
1. Go should probably not use copy_file_range in a common library
function, as I don't see any easy way to detect this scenario
currently (detect 0 size? sure, but that won't work with the example
you provide above). And the man page should document this behaviour
more explicitly to prevent further incorrect usage.
2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
copy_file_range? (based on Al's reply, there seems to be no way to
implement it correctly as seeking in such files will not work in case
of short writes)

Thanks,

>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2021-01-28 06:00:53

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

On Thu, Jan 28, 2021 at 08:46:04AM +0800, Nicolas Boichat wrote:
> On Wed, Jan 27, 2021 at 7:38 AM Dave Chinner <[email protected]> wrote:
> >
> > On Tue, Jan 26, 2021 at 01:50:22PM +0800, Nicolas Boichat wrote:
> > > copy_file_range (which calls generic_copy_file_checks) uses the
> > > inode file size to adjust the copy count parameter. This breaks
> > > with special filesystems like procfs/sysfs, where the file size
> > > appears to be zero, but content is actually returned when a read
> > > operation is performed.
> > >
> > > This commit ignores the source file size, and makes copy_file_range
> > > match the end of file behaviour documented in POSIX's "read",
> > > where 0 is returned to mark EOF. This would allow "cp" and other
> > > standard tools to make use of copy_file_range with the exact same
> > > behaviour as they had in the past.
> > >
> > > Fixes: 96e6e8f4a68d ("vfs: add missing checks to copy_file_range")
> > > Signed-off-by: Nicolas Boichat <[email protected]>
> >
> > Nack.
>
> Thanks Dave and Al for the detailed explanations.
>
> >
> > As I've explained, this is intentional and bypassing it is not a
> > work around for enabling cfr on filesystems that produce ephemeral,
> > volatile read-once data using seq-file pipes that masquerade as
> > regular files with zero size. These files are behaving like pipes
> > and only work because the VFS has to support read() and friends from
> > pipes that don't publish the amount of data they contain to the VFS
> > inode.
> >
> > copy_file_range() does not support such behaviour.
> >
> > copy_file_range() -writes- data, so we have to check that those
> > writes do not extend past boundaries that the destination inode
> > imposes on the operation. e.g. maximum offset limits, whether the
> > ranges overlap in the same file, etc.
> >
> > Hence we need to know how much data there is present to copy before
> > we can check if it is safe to perform the -write- of the data we are
> > going to read. Hence we cannot safely support data sources that
> > cannot tell us how much data is present before we start the copy
> > operation.
> >
> > IOWs, these source file EOF restrictions are required by the write
> > side of copy_file_range(), not the read side.
> >
> > > ---
> > > This can be reproduced with this simple test case:
> > > #define _GNU_SOURCE
> > > #include <fcntl.h>
> > > #include <stdio.h>
> > > #include <stdlib.h>
> > > #include <sys/stat.h>
> > > #include <unistd.h>
> > >
> > > int
> > > main(int argc, char **argv)
> > > {
> > > int fd_in, fd_out;
> > > loff_t ret;
> > >
> > > fd_in = open("/proc/version", O_RDONLY);
> > > fd_out = open("version", O_CREAT | O_WRONLY | O_TRUNC, 0644);
> > >
> > > do {
> > > ret = copy_file_range(fd_in, NULL, fd_out, NULL, 1024, 0);
> > > printf("%d bytes copied\n", (int)ret);
> > > } while (ret > 0);
> > >
> > > return 0;
> > > }
> > >
> > > Without this patch, `version` output file is empty, and no bytes
> > > are copied:
> > > 0 bytes copied
> >
> > $ ls -l /proc/version
> > -r--r--r-- 1 root root 0 Jan 20 17:25 /proc/version
> > $
> >
> > It's a zero length file.
> >
> > sysfs does this just fine - it's regular files have a size of
> > at least PAGE_SIZE rather than zero, and so copy_file_range works
> > just fine on them:
> >
> > $ ls -l /sys/block/nvme0n1/capability
> > -r--r--r-- 1 root root 4096 Jan 27 08:41 /sys/block/nvme0n1/capability
> > $ cat /sys/block/nvme0n1/capability
> > 50
> > $ xfs_io -f -c "copy_range -s 0 -d 0 -l 4096 /sys/block/nvme0n1/capability" /tmp/foo
> > $ sudo cat /tmp/foo
> > 50
> >
> > And the behaviour is exactly as you'd expect a read() loop to copy
> > the file to behave:
> >
> > openat(AT_FDCWD, "/tmp/foo", O_RDWR|O_CREAT, 0600) = 3
> > ....
> > openat(AT_FDCWD, "/sys/block/nvme0n1/capability", O_RDONLY) = 4
> > copy_file_range(4, [0], 3, [0], 4096, 0) = 3
> > copy_file_range(4, [3], 3, [3], 4093, 0) = 0
> > close(4)
> >
> > See? Inode size of 4096 means there's a maximum of 4kB of data that
> > can be read from this file. copy_file_range() now behaves exactly
> > as read() would, returning a short copy and then 0 bytes to indicate
> > EOF.
>
> Unless the content happens to be larger than PAGE_SIZE, then
> copy_file_range would only copy the beginning of the file. And as Al
> explained, this will still break in case of short writes.
>
> >
> > If you want ephemeral data pipes masquerading as regular files to
> > work with copy_file_range, then the filesystem implementation needs
> > to provide the VFS with a data size that indicates the maximum
> > amount of data that the pipe can produce in a continuous read loop.
> > Otherwise we cannot validate the range of the write we may be asked
> > to perform...
> >
> > > Under the hood, Go 1.15 uses `copy_file_range` syscall to optimize the
> > > copy operation. However, that fails to copy any content when the input
> > > file is from sysfs/tracefs, with an apparent size of 0 (but there is
> > > still content when you `cat` it, of course).
> >
> > Libraries using copy_file_range() must be prepared for it to fail
> > and fall back to normal copy mechanisms.
>
> How is userspace suppose to detect that? (checking for 0 file size
> won't work with the example above)
>
> > Of course, with these
> > special zero length files that contain ephemeral data, userspace can't
> > actually tell that they contain data from userspace using stat(). So
> > as far as userspace is concerned, copy_file_range() correctly
> > returned zero bytes copied from a zero byte long file and there's
> > nothing more to do.
> >
> > This zero length file behaviour is, fundamentally, a kernel
> > filesystem implementation bug, not a copy_file_range() bug.
>
> Okay, so, based on this and Al's reply, I see 2 things we can do:
> 1. Go should probably not use copy_file_range in a common library
> function, as I don't see any easy way to detect this scenario
> currently (detect 0 size? sure, but that won't work with the example
> you provide above). And the man page should document this behaviour
> more explicitly to prevent further incorrect usage.
> 2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
> copy_file_range? (based on Al's reply, there seems to be no way to
> implement it correctly as seeking in such files will not work in case
> of short writes)

One /could/ make those three provide a phony CFR implementation that
would return -EOPNOTSUPP, though like others have said, it's weird to
have regular files that aren't quite regular. Not sure where that
leaves them, though...

--D

>
> Thanks,
>
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]

2021-02-12 04:51:01

by Nicolas Boichat

[permalink] [raw]
Subject: Re: [PATCH] fs: generic_copy_file_checks: Do not adjust count based on file size

On Thu, Jan 28, 2021 at 1:57 PM Darrick J. Wong <[email protected]> wrote:
>
> On Thu, Jan 28, 2021 at 08:46:04AM +0800, Nicolas Boichat wrote:
[snip]
> > Okay, so, based on this and Al's reply, I see 2 things we can do:
> > 1. Go should probably not use copy_file_range in a common library
> > function, as I don't see any easy way to detect this scenario
> > currently (detect 0 size? sure, but that won't work with the example
> > you provide above). And the man page should document this behaviour
> > more explicitly to prevent further incorrect usage.
> > 2. Can procfs/sysfs/debugfs and friends explicitly prevent usage of
> > copy_file_range? (based on Al's reply, there seems to be no way to
> > implement it correctly as seeking in such files will not work in case
> > of short writes)
>
> One /could/ make those three provide a phony CFR implementation that
> would return -EOPNOTSUPP, though like others have said, it's weird to
> have regular files that aren't quite regular. Not sure where that
> leaves them, though...

Not that simple, as the issue happens on cross-filesystem operations
where file_operations->copy_file_range is not called (and also, that'd
require modifying operations for every single generated file...

Anyway, made an attempt here:
https://lore.kernel.org/linux-fsdevel/[email protected]/T/#t

> --D
>
> >
> > Thanks,
> >
> > >
> > > Cheers,
> > >
> > > Dave.
> > > --
> > > Dave Chinner
> > > [email protected]