Good day, Hugh-
I noticed that several fsx-related tests in the xfstests suite are
failing after updating my NFS server to v5.18-rc1. I normally test
against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
that sees these new failures:
generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
--- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500
+++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400
@@ -4,15 +4,5 @@
-----------------------------------------------
fsx.0 : -d -N numops -S 0
-----------------------------------------------
-
------------------------------------------------
-fsx.1 : -d -N numops -S 0 -x
------------------------------------------------
...
(Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff)
generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
--- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500
+++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400
@@ -1,7 +1,75 @@
QA output created by 091
fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
...
(Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff)
generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
--- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500
+++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400
@@ -4,15 +4,4 @@
-----------------------------------------------
fsx.0 : -A -d -N numops -S 0
-----------------------------------------------
-
------------------------------------------------
-fsx.1 : -A -d -N numops -S 0 -x
------------------------------------------------
...
(Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff)
generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
--- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400
+++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400
@@ -4,10 +4,198 @@
=== FSX Light Mode, Memory Mapping ===
All 100000 operations completed A-OK!
=== FSX Standard Mode, No Memory Mapping ===
-All 100000 operations completed A-OK!
+ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
+READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
+OFFSET GOOD BAD RANGE
...
(Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff)
I bisected the problem to:
56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
generic/075 fails almost immediately without any NFS-level errors.
Likely this is data corruption rather than an overt I/O error.
--
Chuck Lever
On Wed, 6 Apr 2022, Chuck Lever III wrote:
> Good day, Hugh-
Huh! If you were really wishing me a good day, would you tell me this ;-?
>
> I noticed that several fsx-related tests in the xfstests suite are
> failing after updating my NFS server to v5.18-rc1. I normally test
> against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
> that sees these new failures:
>
> generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
> --- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500
> +++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400
> @@ -4,15 +4,5 @@
> -----------------------------------------------
> fsx.0 : -d -N numops -S 0
> -----------------------------------------------
> -
> ------------------------------------------------
> -fsx.1 : -d -N numops -S 0 -x
> ------------------------------------------------
> ...
> (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff)
>
> generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
> --- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500
> +++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400
> @@ -1,7 +1,75 @@
> QA output created by 091
> fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
> -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
> ...
> (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff)
>
> generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
> --- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500
> +++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400
> @@ -4,15 +4,4 @@
> -----------------------------------------------
> fsx.0 : -A -d -N numops -S 0
> -----------------------------------------------
> -
> ------------------------------------------------
> -fsx.1 : -A -d -N numops -S 0 -x
> ------------------------------------------------
> ...
> (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff)
>
> generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
> --- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400
> +++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400
> @@ -4,10 +4,198 @@
> === FSX Light Mode, Memory Mapping ===
> All 100000 operations completed A-OK!
> === FSX Standard Mode, No Memory Mapping ===
> -All 100000 operations completed A-OK!
> +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
> +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
> +OFFSET GOOD BAD RANGE
> ...
> (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff)
>
> I bisected the problem to:
>
> 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
>
> generic/075 fails almost immediately without any NFS-level errors.
> Likely this is data corruption rather than an overt I/O error.
That's sad. Thanks for bisecting and reporting. Sorry for the nuisance.
I suspect this patch is heading for a revert, because I shall not have
time to debug and investigate. Cc'ing fsdevel and a few people who have
an interest in it, to warn of that likely upcoming revert.
But if it's okay with everyone, please may we leave it in for -rc2?
Given that having it in -rc1 already smoked out another issue (problem
of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
it in a little longer might smoke out even more.
The xfstests info above doesn't actually tell very much, beyond that
generic/075 generic/091 generic/112 generic/127, each a test with fsx,
all fall at their first hurdle. If you have time, please rerun and
tar up the results/generic directory (maybe filter just those failing)
and send as attachment. But don't go to any trouble, it's unlikely
that I shall even untar it - it would be mainly to go on record if
anyone has time to look into it later. And, frankly, it's unlikely
to tell us anything more enlightening, than that the data seen was
not as expected: which we do already know.
I've had no problem with xfstests generic 075,091,112,127 testing
tmpfs here, not before and not in the month or two I've had that
patch in: so it's something in the way that NFS exercises tmpfs
that reveals it. If I had time to duplicate your procedure, I'd be
asking for detailed instructions: but no, I won't have a chance.
But I can sit here and try to guess. I notice fs/nfsd checks
file->f_op->splice_read, and employs fallback if not available:
if you have time, please try rerunning those xfstests on an -rc1
kernel, but with mm/shmem.c's .splice_read line commented out.
My guess is that will then pass the tests, and we shall know more.
What could be going wrong there? I've thought of two possibilities.
A minor, hopefully easily fixed, issue would be if fs/nfsd has
trouble with seeing the same page twice in a row: since tmpfs is
now using the ZERO_PAGE(0) for all pages of a hole, and I think I
caught sight of code which looks to see if the latest page is the
same as the one before. It's easy to imagine that might go wrong.
A more difficult issue would be, if fsx is racing writes and reads,
in a way that it can guarantee the correct result, but that correct
result is no longer delivered: because the writes go into freshly
allocated tmpfs cache pages, while reads are still delivering
stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
But unless someone has time to help out, we're heading for a revert.
Thanks,
Hugh
On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <[email protected]> wrote:
>
> On Wed, 6 Apr 2022, Chuck Lever III wrote:
>
> > Good day, Hugh-
>
> Huh! If you were really wishing me a good day, would you tell me this ;-?
>
> >
> > I noticed that several fsx-related tests in the xfstests suite are
> > failing after updating my NFS server to v5.18-rc1. I normally test
> > against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
> > that sees these new failures:
> >
[...]
> > generic/075 fails almost immediately without any NFS-level errors.
> > Likely this is data corruption rather than an overt I/O error.
>
> That's sad. Thanks for bisecting and reporting. Sorry for the nuisance.
>
> I suspect this patch is heading for a revert, because I shall not have
> time to debug and investigate. Cc'ing fsdevel and a few people who have
> an interest in it, to warn of that likely upcoming revert.
>
> But if it's okay with everyone, please may we leave it in for -rc2?
> Given that having it in -rc1 already smoked out another issue (problem
> of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
> it in a little longer might smoke out even more.
>
> The xfstests info above doesn't actually tell very much, beyond that
> generic/075 generic/091 generic/112 generic/127, each a test with fsx,
> all fall at their first hurdle. If you have time, please rerun and
> tar up the results/generic directory (maybe filter just those failing)
> and send as attachment. But don't go to any trouble, it's unlikely
> that I shall even untar it - it would be mainly to go on record if
> anyone has time to look into it later. And, frankly, it's unlikely
> to tell us anything more enlightening, than that the data seen was
> not as expected: which we do already know.
>
> I've had no problem with xfstests generic 075,091,112,127 testing
> tmpfs here, not before and not in the month or two I've had that
> patch in: so it's something in the way that NFS exercises tmpfs
> that reveals it. If I had time to duplicate your procedure, I'd be
> asking for detailed instructions: but no, I won't have a chance.
>
> But I can sit here and try to guess. I notice fs/nfsd checks
> file->f_op->splice_read, and employs fallback if not available:
> if you have time, please try rerunning those xfstests on an -rc1
> kernel, but with mm/shmem.c's .splice_read line commented out.
> My guess is that will then pass the tests, and we shall know more.
>
> What could be going wrong there? I've thought of two possibilities.
> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> trouble with seeing the same page twice in a row: since tmpfs is
> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> caught sight of code which looks to see if the latest page is the
> same as the one before. It's easy to imagine that might go wrong.
When I worked at Veritas, data corruption over NFS was hit when
sending the same page in succession. This was triggered via VxFS's
shared page cache, after a file had been dedup'ed.
I can't remember all the details (~15yrs ago), but the core issue was
skb_can_coalesce() returning a false-positive for the 'same page' case
(no check for crossing a page boundary).
> A more difficult issue would be, if fsx is racing writes and reads,
> in a way that it can guarantee the correct result, but that correct
> result is no longer delivered: because the writes go into freshly
> allocated tmpfs cache pages, while reads are still delivering
> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
>
> But unless someone has time to help out, we're heading for a revert.
>
> Thanks,
> Hugh
Cheers,
Mark
> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <[email protected]> wrote:
>
> On Wed, 6 Apr 2022, Chuck Lever III wrote:
>
>> Good day, Hugh-
>
> Huh! If you were really wishing me a good day, would you tell me this ;-?
>
>>
>> I noticed that several fsx-related tests in the xfstests suite are
>> failing after updating my NFS server to v5.18-rc1. I normally test
>> against xfs, ext4, btrfs, and tmpfs exports. tmpfs is the only export
>> that sees these new failures:
>>
>> generic/075 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/075.out.bad)
>> --- tests/generic/075.out 2014-02-13 15:40:45.000000000 -0500
>> +++ /home/cel/src/xfstests/results//generic/075.out.bad 2022-04-05 16:39:59.145991520 -0400
>> @@ -4,15 +4,5 @@
>> -----------------------------------------------
>> fsx.0 : -d -N numops -S 0
>> -----------------------------------------------
>> -
>> ------------------------------------------------
>> -fsx.1 : -d -N numops -S 0 -x
>> ------------------------------------------------
>> ...
>> (Run 'diff -u /home/cel/src/xfstests/tests/generic/075.out /home/cel/src/xfstests/results//generic/075.out.bad' to see the entire diff)
>>
>> generic/091 9s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/091.out.bad)
>> --- tests/generic/091.out 2014-02-13 15:40:45.000000000 -0500
>> +++ /home/cel/src/xfstests/results//generic/091.out.bad 2022-04-05 16:41:24.329063277 -0400
>> @@ -1,7 +1,75 @@
>> QA output created by 091
>> fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>> -fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>> -fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
>> -fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
>> ...
>> (Run 'diff -u /home/cel/src/xfstests/tests/generic/091.out /home/cel/src/xfstests/results//generic/091.out.bad' to see the entire diff)
>>
>> generic/112 2s ... [failed, exit status 1]- output mismatch (see /home/cel/src/xfstests/results//generic/112.out.bad)
>> --- tests/generic/112.out 2014-02-13 15:40:45.000000000 -0500
>> +++ /home/cel/src/xfstests/results//generic/112.out.bad 2022-04-05 16:41:38.511075170 -0400
>> @@ -4,15 +4,4 @@
>> -----------------------------------------------
>> fsx.0 : -A -d -N numops -S 0
>> -----------------------------------------------
>> -
>> ------------------------------------------------
>> -fsx.1 : -A -d -N numops -S 0 -x
>> ------------------------------------------------
>> ...
>> (Run 'diff -u /home/cel/src/xfstests/tests/generic/112.out /home/cel/src/xfstests/results//generic/112.out.bad' to see the entire diff)
>>
>> generic/127 49s ... - output mismatch (see /home/cel/src/xfstests/results//generic/127.out.bad)
>> --- tests/generic/127.out 2016-08-28 12:16:20.000000000 -0400
>> +++ /home/cel/src/xfstests/results//generic/127.out.bad 2022-04-05 16:42:07.655099652 -0400
>> @@ -4,10 +4,198 @@
>> === FSX Light Mode, Memory Mapping ===
>> All 100000 operations completed A-OK!
>> === FSX Standard Mode, No Memory Mapping ===
>> -All 100000 operations completed A-OK!
>> +ltp/fsx -q -l 262144 -o 65536 -S 191110531 -N 100000 -R -W fsx_std_nommap
>> +READ BAD DATA: offset = 0x9cb7, size = 0xfae3, fname = /tmp/mnt/manet.ib-2323703/fsx_std_nommap
>> +OFFSET GOOD BAD RANGE
>> ...
>> (Run 'diff -u /home/cel/src/xfstests/tests/generic/127.out /home/cel/src/xfstests/results//generic/127.out.bad' to see the entire diff)
>>
>> I bisected the problem to:
>>
>> 56a8c8eb1eaf ("tmpfs: do not allocate pages on read")
>>
>> generic/075 fails almost immediately without any NFS-level errors.
>> Likely this is data corruption rather than an overt I/O error.
>
> That's sad. Thanks for bisecting and reporting. Sorry for the nuisance.
>
> I suspect this patch is heading for a revert, because I shall not have
> time to debug and investigate. Cc'ing fsdevel and a few people who have
> an interest in it, to warn of that likely upcoming revert.
>
> But if it's okay with everyone, please may we leave it in for -rc2?
> Given that having it in -rc1 already smoked out another issue (problem
> of SetPageUptodate(ZERO_PAGE(0)) without CONFIG_MMU), I think keeping
> it in a little longer might smoke out even more.
>
> The xfstests info above doesn't actually tell very much, beyond that
> generic/075 generic/091 generic/112 generic/127, each a test with fsx,
> all fall at their first hurdle. If you have time, please rerun and
> tar up the results/generic directory (maybe filter just those failing)
> and send as attachment. But don't go to any trouble, it's unlikely
> that I shall even untar it - it would be mainly to go on record if
> anyone has time to look into it later. And, frankly, it's unlikely
> to tell us anything more enlightening, than that the data seen was
> not as expected: which we do already know.
>
> I've had no problem with xfstests generic 075,091,112,127 testing
> tmpfs here, not before and not in the month or two I've had that
> patch in: so it's something in the way that NFS exercises tmpfs
> that reveals it. If I had time to duplicate your procedure, I'd be
> asking for detailed instructions: but no, I won't have a chance.
>
> But I can sit here and try to guess. I notice fs/nfsd checks
> file->f_op->splice_read, and employs fallback if not available:
> if you have time, please try rerunning those xfstests on an -rc1
> kernel, but with mm/shmem.c's .splice_read line commented out.
> My guess is that will then pass the tests, and we shall know more.
This seemed like the most probative next step, so I commented
out the .splice_read call-out in mm/shmem.c and ran the tests
again. Yes, that change enables the fsx-related tests to pass
as expected.
> What could be going wrong there? I've thought of two possibilities.
> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> trouble with seeing the same page twice in a row: since tmpfs is
> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> caught sight of code which looks to see if the latest page is the
> same as the one before. It's easy to imagine that might go wrong.
Are you referring to this function in fs/nfsd/vfs.c ?
847 static int
848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
849 struct splice_desc *sd)
850 {
851 struct svc_rqst *rqstp = sd->u.data;
852 struct page **pp = rqstp->rq_next_page;
853 struct page *page = buf->page;
854
855 if (rqstp->rq_res.page_len == 0) {
856 svc_rqst_replace_page(rqstp, page);
857 rqstp->rq_res.page_base = buf->offset;
858 } else if (page != pp[-1]) {
859 svc_rqst_replace_page(rqstp, page);
860 }
861 rqstp->rq_res.page_len += sd->len;
862
863 return sd->len;
864 }
rq_next_page should point to the first unused element of
rqstp->rq_pages, so IIUC that check is looking for the
final page that is part of the READ payload.
But that does suggest that if page -> ZERO_PAGE and so does
pp[-1], then svc_rqst_replace_page() would not be invoked.
> A more difficult issue would be, if fsx is racing writes and reads,
> in a way that it can guarantee the correct result, but that correct
> result is no longer delivered: because the writes go into freshly
> allocated tmpfs cache pages, while reads are still delivering
> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
>
> But unless someone has time to help out, we're heading for a revert.
>
> Thanks,
> Hugh
--
Chuck Lever
On Thu, 7 Apr 2022, Mark Hemment wrote:
> On Thu, 7 Apr 2022 at 01:19, Hugh Dickins <[email protected]> wrote:
> >
> > What could be going wrong there? I've thought of two possibilities.
> > A minor, hopefully easily fixed, issue would be if fs/nfsd has
> > trouble with seeing the same page twice in a row: since tmpfs is
> > now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> > caught sight of code which looks to see if the latest page is the
> > same as the one before. It's easy to imagine that might go wrong.
>
> When I worked at Veritas, data corruption over NFS was hit when
> sending the same page in succession. This was triggered via VxFS's
> shared page cache, after a file had been dedup'ed.
> I can't remember all the details (~15yrs ago), but the core issue was
> skb_can_coalesce() returning a false-positive for the 'same page' case
> (no check for crossing a page boundary).
Very useful input: thank you Mark.
That tells me that, even if we spot a "bug" in fs/nfsd, there could
be various other places which get confused if given the ZERO_PAGE(0)
twice in a row. I won't be able to find them all, I cannot go on
risking that.
At first I thought of using ZERO_PAGE(0) for even pgoffs, alternating
with a tmpfs-specific zeroed page for odd pgoffs. But I was forgetting
that copying ZERO_PAGE(0) is itself just a workaround to avoid the 28%
slower iov_iter_zero().
I think I have a reasonable hybrid: will reply to Chuck now with that.
Hugh
I've rewr
On Thu, 7 Apr 2022, Chuck Lever III wrote:
> > On Apr 6, 2022, at 8:18 PM, Hugh Dickins <[email protected]> wrote:
> >
> > But I can sit here and try to guess. I notice fs/nfsd checks
> > file->f_op->splice_read, and employs fallback if not available:
> > if you have time, please try rerunning those xfstests on an -rc1
> > kernel, but with mm/shmem.c's .splice_read line commented out.
> > My guess is that will then pass the tests, and we shall know more.
>
> This seemed like the most probative next step, so I commented
> out the .splice_read call-out in mm/shmem.c and ran the tests
> again. Yes, that change enables the fsx-related tests to pass
> as expected.
Great, thank you for trying that.
>
> > What could be going wrong there? I've thought of two possibilities.
> > A minor, hopefully easily fixed, issue would be if fs/nfsd has
> > trouble with seeing the same page twice in a row: since tmpfs is
> > now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> > caught sight of code which looks to see if the latest page is the
> > same as the one before. It's easy to imagine that might go wrong.
>
> Are you referring to this function in fs/nfsd/vfs.c ?
I think that was it, didn't pay much attention.
>
> 847 static int
> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> 849 struct splice_desc *sd)
> 850 {
> 851 struct svc_rqst *rqstp = sd->u.data;
> 852 struct page **pp = rqstp->rq_next_page;
> 853 struct page *page = buf->page;
> 854
> 855 if (rqstp->rq_res.page_len == 0) {
> 856 svc_rqst_replace_page(rqstp, page);
> 857 rqstp->rq_res.page_base = buf->offset;
> 858 } else if (page != pp[-1]) {
> 859 svc_rqst_replace_page(rqstp, page);
> 860 }
> 861 rqstp->rq_res.page_len += sd->len;
> 862
> 863 return sd->len;
> 864 }
>
> rq_next_page should point to the first unused element of
> rqstp->rq_pages, so IIUC that check is looking for the
> final page that is part of the READ payload.
>
> But that does suggest that if page -> ZERO_PAGE and so does
> pp[-1], then svc_rqst_replace_page() would not be invoked.
I still haven't studied the logic there: Mark's input made it clear
that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
there could be expectations of uniqueness in other places too.
>
> > A more difficult issue would be, if fsx is racing writes and reads,
> > in a way that it can guarantee the correct result, but that correct
> > result is no longer delivered: because the writes go into freshly
> > allocated tmpfs cache pages, while reads are still delivering
> > stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
> >
> > But unless someone has time to help out, we're heading for a revert.
We might be able to avoid that revert, and go the whole way to using
iov_iter_zero() instead. But the significant slowness of clear_user()
relative to copy to user, on x86 at least, does ask for a hybrid.
Suggested patch below, on top of 5.18-rc1, passes my own testing:
but will it pass yours? It seems to me safe, and as fast as before,
but we don't know yet if this iov_iter_zero() works right for you.
Chuck, please give it a go and let us know.
(Don't forget to restore mm/shmem.c's .splice_read first! And if
this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
in the same patch, fixing the other regression, without recourse to
#ifdefs or arch mods.)
Thanks!
Hugh
--- 5.18-rc1/mm/shmem.c
+++ linux/mm/shmem.c
@@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
pgoff_t end_index;
unsigned long nr, ret;
loff_t i_size = i_size_read(inode);
- bool got_page;
end_index = i_size >> PAGE_SHIFT;
if (index > end_index)
@@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
*/
if (!offset)
mark_page_accessed(page);
- got_page = true;
+ /*
+ * Ok, we have the page, and it's up-to-date, so
+ * now we can copy it to user space...
+ */
+ ret = copy_page_to_iter(page, offset, nr, to);
+ put_page(page);
+
+ } else if (iter_is_iovec(to)) {
+ /*
+ * Copy to user tends to be so well optimized, but
+ * clear_user() not so much, that it is noticeably
+ * faster to copy the zero page instead of clearing.
+ */
+ ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
} else {
- page = ZERO_PAGE(0);
- got_page = false;
+ /*
+ * But submitting the same page twice in a row to
+ * splice() - or others? - can result in confusion:
+ * so don't attempt that optimization on pipes etc.
+ */
+ ret = iov_iter_zero(nr, to);
}
- /*
- * Ok, we have the page, and it's up-to-date, so
- * now we can copy it to user space...
- */
- ret = copy_page_to_iter(page, offset, nr, to);
retval += ret;
offset += ret;
index += offset >> PAGE_SHIFT;
offset &= ~PAGE_MASK;
- if (got_page)
- put_page(page);
if (!iov_iter_count(to))
break;
if (ret < nr) {
> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <[email protected]> wrote:
>
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <[email protected]> wrote:
>>>
>>> But I can sit here and try to guess. I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>>
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
>
> Great, thank you for trying that.
>
>>
>>> What could be going wrong there? I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before. It's easy to imagine that might go wrong.
>>
>> Are you referring to this function in fs/nfsd/vfs.c ?
>
> I think that was it, didn't pay much attention.
This code seems to have been the issue. I added a little test
to see if @page pointed to ZERO_PAGE(0) and now the tests
pass as expected.
>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849 struct splice_desc *sd)
>> 850 {
>> 851 struct svc_rqst *rqstp = sd->u.data;
>> 852 struct page **pp = rqstp->rq_next_page;
>> 853 struct page *page = buf->page;
>> 854
>> 855 if (rqstp->rq_res.page_len == 0) {
>> 856 svc_rqst_replace_page(rqstp, page);
>> 857 rqstp->rq_res.page_base = buf->offset;
>> 858 } else if (page != pp[-1]) {
>> 859 svc_rqst_replace_page(rqstp, page);
>> 860 }
>> 861 rqstp->rq_res.page_len += sd->len;
>> 862
>> 863 return sd->len;
>> 864 }
>>
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>>
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.
>
> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.
I can't really attest to Mark's comment, but...
After studying nfsd_splice_actor() I can't see any reason
except cleverness and technical debt for this particular
check. I have a patch that removes the check and simplifies
this function that I'm testing now -- it seems to be a
reasonable clean-up whether you keep 56a8c8eb1eaf or
choose to revert it.
>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
>>>
>>> But unless someone has time to help out, we're heading for a revert.
>
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead. But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
>
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours? It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.
>
> (Don't forget to restore mm/shmem.c's .splice_read first! And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)
Sure, I will try this out first thing tomorrow.
One thing that occurs to me is that for NFS/RDMA, having a
page full of zeroes that is already DMA-mapped would be a
nice optimization on the sender side (on the client for an
NFS WRITE and on the server for an NFS READ). The transport
would have to set up a scatter-gather list containing a
bunch of entries that reference the same page...
</musing>
> Thanks!
> Hugh
>
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> pgoff_t end_index;
> unsigned long nr, ret;
> loff_t i_size = i_size_read(inode);
> - bool got_page;
>
> end_index = i_size >> PAGE_SHIFT;
> if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> */
> if (!offset)
> mark_page_accessed(page);
> - got_page = true;
> + /*
> + * Ok, we have the page, and it's up-to-date, so
> + * now we can copy it to user space...
> + */
> + ret = copy_page_to_iter(page, offset, nr, to);
> + put_page(page);
> +
> + } else if (iter_is_iovec(to)) {
> + /*
> + * Copy to user tends to be so well optimized, but
> + * clear_user() not so much, that it is noticeably
> + * faster to copy the zero page instead of clearing.
> + */
> + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> } else {
> - page = ZERO_PAGE(0);
> - got_page = false;
> + /*
> + * But submitting the same page twice in a row to
> + * splice() - or others? - can result in confusion:
> + * so don't attempt that optimization on pipes etc.
> + */
> + ret = iov_iter_zero(nr, to);
> }
>
> - /*
> - * Ok, we have the page, and it's up-to-date, so
> - * now we can copy it to user space...
> - */
> - ret = copy_page_to_iter(page, offset, nr, to);
> retval += ret;
> offset += ret;
> index += offset >> PAGE_SHIFT;
> offset &= ~PAGE_MASK;
>
> - if (got_page)
> - put_page(page);
> if (!iov_iter_count(to))
> break;
> if (ret < nr) {
--
Chuck Lever
On Fri, 8 Apr 2022 at 00:45, Chuck Lever III <[email protected]> wrote:
> > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <[email protected]> wrote:
> >
> > On Thu, 7 Apr 2022, Chuck Lever III wrote:
> >>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <[email protected]> wrote:
> >>>
> >>> But I can sit here and try to guess. I notice fs/nfsd checks
> >>> file->f_op->splice_read, and employs fallback if not available:
> >>> if you have time, please try rerunning those xfstests on an -rc1
> >>> kernel, but with mm/shmem.c's .splice_read line commented out.
> >>> My guess is that will then pass the tests, and we shall know more.
> >>
> >> This seemed like the most probative next step, so I commented
> >> out the .splice_read call-out in mm/shmem.c and ran the tests
> >> again. Yes, that change enables the fsx-related tests to pass
> >> as expected.
> >
> > Great, thank you for trying that.
> >
> >>
> >>> What could be going wrong there? I've thought of two possibilities.
> >>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
> >>> trouble with seeing the same page twice in a row: since tmpfs is
> >>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
> >>> caught sight of code which looks to see if the latest page is the
> >>> same as the one before. It's easy to imagine that might go wrong.
> >>
> >> Are you referring to this function in fs/nfsd/vfs.c ?
> >
> > I think that was it, didn't pay much attention.
>
> This code seems to have been the issue. I added a little test
> to see if @page pointed to ZERO_PAGE(0) and now the tests
> pass as expected.
>
>
> >> 847 static int
> >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >> 849 struct splice_desc *sd)
> >> 850 {
> >> 851 struct svc_rqst *rqstp = sd->u.data;
> >> 852 struct page **pp = rqstp->rq_next_page;
> >> 853 struct page *page = buf->page;
> >> 854
> >> 855 if (rqstp->rq_res.page_len == 0) {
> >> 856 svc_rqst_replace_page(rqstp, page);
> >> 857 rqstp->rq_res.page_base = buf->offset;
> >> 858 } else if (page != pp[-1]) {
> >> 859 svc_rqst_replace_page(rqstp, page);
> >> 860 }
> >> 861 rqstp->rq_res.page_len += sd->len;
> >> 862
> >> 863 return sd->len;
> >> 864 }
> >>
> >> rq_next_page should point to the first unused element of
> >> rqstp->rq_pages, so IIUC that check is looking for the
> >> final page that is part of the READ payload.
> >>
> >> But that does suggest that if page -> ZERO_PAGE and so does
> >> pp[-1], then svc_rqst_replace_page() would not be invoked.
> >
> > I still haven't studied the logic there: Mark's input made it clear
> > that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> > there could be expectations of uniqueness in other places too.
>
> I can't really attest to Mark's comment, but...
>
> After studying nfsd_splice_actor() I can't see any reason
> except cleverness and technical debt for this particular
> check. I have a patch that removes the check and simplifies
> this function that I'm testing now -- it seems to be a
> reasonable clean-up whether you keep 56a8c8eb1eaf or
> choose to revert it.
Agreed nfsd_splice_actor() is broken for the same-page case.
That function hasn't changed in logic since introduction. So when
VxFS triggered this issue, back in 2007/2008, it must have had the
same problem with this actor (same with its predecessor; ->sendfile).
I don't remember. But skb_can_coalesce() sticks in my mind for some
reason. Would jumbo frames be a good stress for can_coalesce with
same-page? Or, as Hugh is proposing to avoid sending ZERO_PAGE,
ignore this for now?
> >>> A more difficult issue would be, if fsx is racing writes and reads,
> >>> in a way that it can guarantee the correct result, but that correct
> >>> result is no longer delivered: because the writes go into freshly
> >>> allocated tmpfs cache pages, while reads are still delivering
> >>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
> >>>
> >>> But unless someone has time to help out, we're heading for a revert.
> >
> > We might be able to avoid that revert, and go the whole way to using
> > iov_iter_zero() instead. But the significant slowness of clear_user()
> > relative to copy to user, on x86 at least, does ask for a hybrid.
> >
> > Suggested patch below, on top of 5.18-rc1, passes my own testing:
> > but will it pass yours? It seems to me safe, and as fast as before,
> > but we don't know yet if this iov_iter_zero() works right for you.
> > Chuck, please give it a go and let us know.
> >
> > (Don't forget to restore mm/shmem.c's .splice_read first! And if
> > this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> > in the same patch, fixing the other regression, without recourse to
> > #ifdefs or arch mods.)
>
> Sure, I will try this out first thing tomorrow.
>
> One thing that occurs to me is that for NFS/RDMA, having a
> page full of zeroes that is already DMA-mapped would be a
> nice optimization on the sender side (on the client for an
> NFS WRITE and on the server for an NFS READ). The transport
> would have to set up a scatter-gather list containing a
> bunch of entries that reference the same page...
>
> </musing>
>
>
> > Thanks!
> > Hugh
> >
> > --- 5.18-rc1/mm/shmem.c
> > +++ linux/mm/shmem.c
> > @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > pgoff_t end_index;
> > unsigned long nr, ret;
> > loff_t i_size = i_size_read(inode);
> > - bool got_page;
> >
> > end_index = i_size >> PAGE_SHIFT;
> > if (index > end_index)
> > @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> > */
> > if (!offset)
> > mark_page_accessed(page);
> > - got_page = true;
> > + /*
> > + * Ok, we have the page, and it's up-to-date, so
> > + * now we can copy it to user space...
> > + */
> > + ret = copy_page_to_iter(page, offset, nr, to);
> > + put_page(page);
> > +
> > + } else if (iter_is_iovec(to)) {
> > + /*
> > + * Copy to user tends to be so well optimized, but
> > + * clear_user() not so much, that it is noticeably
> > + * faster to copy the zero page instead of clearing.
> > + */
> > + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> > } else {
> > - page = ZERO_PAGE(0);
> > - got_page = false;
> > + /*
> > + * But submitting the same page twice in a row to
> > + * splice() - or others? - can result in confusion:
> > + * so don't attempt that optimization on pipes etc.
> > + */
> > + ret = iov_iter_zero(nr, to);
> > }
> >
> > - /*
> > - * Ok, we have the page, and it's up-to-date, so
> > - * now we can copy it to user space...
> > - */
> > - ret = copy_page_to_iter(page, offset, nr, to);
> > retval += ret;
> > offset += ret;
> > index += offset >> PAGE_SHIFT;
> > offset &= ~PAGE_MASK;
> >
> > - if (got_page)
> > - put_page(page);
> > if (!iov_iter_count(to))
> > break;
> > if (ret < nr) {
>
> --
> Chuck Lever
Cheers,
Mark
On Fri, 8 Apr 2022, Chuck Lever III wrote:
> > On Apr 7, 2022, at 6:26 PM, Hugh Dickins <[email protected]> wrote:
> > On Thu, 7 Apr 2022, Chuck Lever III wrote:
> >>
> >> 847 static int
> >> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
> >> 849 struct splice_desc *sd)
> >> 850 {
> >> 851 struct svc_rqst *rqstp = sd->u.data;
> >> 852 struct page **pp = rqstp->rq_next_page;
> >> 853 struct page *page = buf->page;
> >> 854
> >> 855 if (rqstp->rq_res.page_len == 0) {
> >> 856 svc_rqst_replace_page(rqstp, page);
> >> 857 rqstp->rq_res.page_base = buf->offset;
> >> 858 } else if (page != pp[-1]) {
> >> 859 svc_rqst_replace_page(rqstp, page);
> >> 860 }
> >> 861 rqstp->rq_res.page_len += sd->len;
> >> 862
> >> 863 return sd->len;
> >> 864 }
> >>
> >> rq_next_page should point to the first unused element of
> >> rqstp->rq_pages, so IIUC that check is looking for the
> >> final page that is part of the READ payload.
> >>
> >> But that does suggest that if page -> ZERO_PAGE and so does
> >> pp[-1], then svc_rqst_replace_page() would not be invoked.
>
> To put a little more color on this, I think the idea here
> is to prevent releasing the same page twice. It might be
> possible that NFSD can add the same page to the rq_pages
> array more than once, and we don't want to do a double
> put_page().
>
> The only time I can think this might happen is if the
> READ payload is partially contained in the page that
> contains the NFS header. I'm not sure that can ever
> happen these days.
I'd have thought that if a page were repeated, then its refcount would
have been raised twice, and so require a double put_page(). But it's
no concern of mine. The only thing I'd say is, if you do find a good
way to robustify that code for duplicates, please don't make it
conditional on ZERO_PAGE - that's just a special case which I
mis-introduced and is now about to go away.
> >
> > We might be able to avoid that revert, and go the whole way to using
> > iov_iter_zero() instead. But the significant slowness of clear_user()
> > relative to copy to user, on x86 at least, does ask for a hybrid.
> >
> > Suggested patch below, on top of 5.18-rc1, passes my own testing:
> > but will it pass yours? It seems to me safe, and as fast as before,
> > but we don't know yet if this iov_iter_zero() works right for you.
> > Chuck, please give it a go and let us know.
>
> Applied to stock v5.18-rc1. The tests pass as expected.
Great, thanks a lot, I'll move ahead with sending akpm the patch
with a proper commit message.
Hugh
> On Apr 8, 2022, at 3:09 PM, Hugh Dickins <[email protected]> wrote:
>
> On Fri, 8 Apr 2022, Chuck Lever III wrote:
>>> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <[email protected]> wrote:
>>> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>>>
>>>> 847 static int
>>>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>>>> 849 struct splice_desc *sd)
>>>> 850 {
>>>> 851 struct svc_rqst *rqstp = sd->u.data;
>>>> 852 struct page **pp = rqstp->rq_next_page;
>>>> 853 struct page *page = buf->page;
>>>> 854
>>>> 855 if (rqstp->rq_res.page_len == 0) {
>>>> 856 svc_rqst_replace_page(rqstp, page);
>>>> 857 rqstp->rq_res.page_base = buf->offset;
>>>> 858 } else if (page != pp[-1]) {
>>>> 859 svc_rqst_replace_page(rqstp, page);
>>>> 860 }
>>>> 861 rqstp->rq_res.page_len += sd->len;
>>>> 862
>>>> 863 return sd->len;
>>>> 864 }
>>>>
>>>> rq_next_page should point to the first unused element of
>>>> rqstp->rq_pages, so IIUC that check is looking for the
>>>> final page that is part of the READ payload.
>>>>
>>>> But that does suggest that if page -> ZERO_PAGE and so does
>>>> pp[-1], then svc_rqst_replace_page() would not be invoked.
>>
>> To put a little more color on this, I think the idea here
>> is to prevent releasing the same page twice. It might be
>> possible that NFSD can add the same page to the rq_pages
>> array more than once, and we don't want to do a double
>> put_page().
>>
>> The only time I can think this might happen is if the
>> READ payload is partially contained in the page that
>> contains the NFS header. I'm not sure that can ever
>> happen these days.
>
> I'd have thought that if a page were repeated, then its refcount would
> have been raised twice, and so require a double put_page(). But it's
> no concern of mine. The only thing I'd say is, if you do find a good
> way to robustify that code for duplicates, please don't make it
> conditional on ZERO_PAGE - that's just a special case which I
> mis-introduced and is now about to go away.
100% agree on both counts: not sure (yet) why get_page() was not
used here, and a special case for ZERO_PAGE would be a band-aid.
Anyway, I haven't found a case yet where a duplicate struct page
appears in rq_pages with current kernels.
--
Chuck Lever
> On Apr 7, 2022, at 6:26 PM, Hugh Dickins <[email protected]> wrote:
>
> On Thu, 7 Apr 2022, Chuck Lever III wrote:
>>> On Apr 6, 2022, at 8:18 PM, Hugh Dickins <[email protected]> wrote:
>>>
>>> But I can sit here and try to guess. I notice fs/nfsd checks
>>> file->f_op->splice_read, and employs fallback if not available:
>>> if you have time, please try rerunning those xfstests on an -rc1
>>> kernel, but with mm/shmem.c's .splice_read line commented out.
>>> My guess is that will then pass the tests, and we shall know more.
>>
>> This seemed like the most probative next step, so I commented
>> out the .splice_read call-out in mm/shmem.c and ran the tests
>> again. Yes, that change enables the fsx-related tests to pass
>> as expected.
>
> Great, thank you for trying that.
>
>>
>>> What could be going wrong there? I've thought of two possibilities.
>>> A minor, hopefully easily fixed, issue would be if fs/nfsd has
>>> trouble with seeing the same page twice in a row: since tmpfs is
>>> now using the ZERO_PAGE(0) for all pages of a hole, and I think I
>>> caught sight of code which looks to see if the latest page is the
>>> same as the one before. It's easy to imagine that might go wrong.
>>
>> Are you referring to this function in fs/nfsd/vfs.c ?
>
> I think that was it, didn't pay much attention.
>
>>
>> 847 static int
>> 848 nfsd_splice_actor(struct pipe_inode_info *pipe, struct pipe_buffer *buf,
>> 849 struct splice_desc *sd)
>> 850 {
>> 851 struct svc_rqst *rqstp = sd->u.data;
>> 852 struct page **pp = rqstp->rq_next_page;
>> 853 struct page *page = buf->page;
>> 854
>> 855 if (rqstp->rq_res.page_len == 0) {
>> 856 svc_rqst_replace_page(rqstp, page);
>> 857 rqstp->rq_res.page_base = buf->offset;
>> 858 } else if (page != pp[-1]) {
>> 859 svc_rqst_replace_page(rqstp, page);
>> 860 }
>> 861 rqstp->rq_res.page_len += sd->len;
>> 862
>> 863 return sd->len;
>> 864 }
>>
>> rq_next_page should point to the first unused element of
>> rqstp->rq_pages, so IIUC that check is looking for the
>> final page that is part of the READ payload.
>>
>> But that does suggest that if page -> ZERO_PAGE and so does
>> pp[-1], then svc_rqst_replace_page() would not be invoked.
To put a little more color on this, I think the idea here
is to prevent releasing the same page twice. It might be
possible that NFSD can add the same page to the rq_pages
array more than once, and we don't want to do a double
put_page().
The only time I can think this might happen is if the
READ payload is partially contained in the page that
contains the NFS header. I'm not sure that can ever
happen these days.
> I still haven't studied the logic there: Mark's input made it clear
> that it's just too risky for tmpfs to pass back ZERO_PAGE repeatedly,
> there could be expectations of uniqueness in other places too.
So far I haven't seen an issue with skb_can_coalesce().
I will keep an eye out for that.
>>> A more difficult issue would be, if fsx is racing writes and reads,
>>> in a way that it can guarantee the correct result, but that correct
>>> result is no longer delivered: because the writes go into freshly
>>> allocated tmpfs cache pages, while reads are still delivering
>>> stale ZERO_PAGEs from the pipe. I'm hazy on the guarantees there.
>>>
>>> But unless someone has time to help out, we're heading for a revert.
>
> We might be able to avoid that revert, and go the whole way to using
> iov_iter_zero() instead. But the significant slowness of clear_user()
> relative to copy to user, on x86 at least, does ask for a hybrid.
>
> Suggested patch below, on top of 5.18-rc1, passes my own testing:
> but will it pass yours? It seems to me safe, and as fast as before,
> but we don't know yet if this iov_iter_zero() works right for you.
> Chuck, please give it a go and let us know.
Applied to stock v5.18-rc1. The tests pass as expected.
> (Don't forget to restore mm/shmem.c's .splice_read first! And if
> this works, I can revert mm/filemap.c's SetPageUptodate(ZERO_PAGE(0))
> in the same patch, fixing the other regression, without recourse to
> #ifdefs or arch mods.)
>
> Thanks!
> Hugh
>
> --- 5.18-rc1/mm/shmem.c
> +++ linux/mm/shmem.c
> @@ -2513,7 +2513,6 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> pgoff_t end_index;
> unsigned long nr, ret;
> loff_t i_size = i_size_read(inode);
> - bool got_page;
>
> end_index = i_size >> PAGE_SHIFT;
> if (index > end_index)
> @@ -2570,24 +2569,34 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> */
> if (!offset)
> mark_page_accessed(page);
> - got_page = true;
> + /*
> + * Ok, we have the page, and it's up-to-date, so
> + * now we can copy it to user space...
> + */
> + ret = copy_page_to_iter(page, offset, nr, to);
> + put_page(page);
> +
> + } else if (iter_is_iovec(to)) {
> + /*
> + * Copy to user tends to be so well optimized, but
> + * clear_user() not so much, that it is noticeably
> + * faster to copy the zero page instead of clearing.
> + */
> + ret = copy_page_to_iter(ZERO_PAGE(0), offset, nr, to);
> } else {
> - page = ZERO_PAGE(0);
> - got_page = false;
> + /*
> + * But submitting the same page twice in a row to
> + * splice() - or others? - can result in confusion:
> + * so don't attempt that optimization on pipes etc.
> + */
> + ret = iov_iter_zero(nr, to);
> }
>
> - /*
> - * Ok, we have the page, and it's up-to-date, so
> - * now we can copy it to user space...
> - */
> - ret = copy_page_to_iter(page, offset, nr, to);
> retval += ret;
> offset += ret;
> index += offset >> PAGE_SHIFT;
> offset &= ~PAGE_MASK;
>
> - if (got_page)
> - put_page(page);
> if (!iov_iter_count(to))
> break;
> if (ret < nr) {
--
Chuck Lever