2023-09-24 13:25:48

by Mantas Mikulėnas

[permalink] [raw]
Subject: Data corruption with 5.10.x client -> 6.5.x server

I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running
Arch Linux x86_64).

Now, when I'm accessing the server over NFSv4.2 from a client that's
running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i
or sec=krb5p, trying to read a file that's <= 4092 bytes in size will
return all-zero data. (That is, `hexdump -C file` shows "00 00 00...")
Files that are 4093 bytes or larger seem to be unaffected.

Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys
for that matter) seems to work without any problems.

Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's
only 5.10.0 that does... though it might also be that the client is
32-bit, but the same client did work previously when the server was
running older kernels, so I still suspect 6.5.x on the server being the
problem.

Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
The server is using Btrfs but I've tested with tmpfs as well.


2023-09-24 13:56:54

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 24, 2023, at 9:07 AM, Mantas Mikulėnas <[email protected]> wrote:
>
> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running Arch Linux x86_64).
>
> Now, when I'm accessing the server over NFSv4.2 from a client that's running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i or sec=krb5p, trying to read a file that's <= 4092 bytes in size will return all-zero data. (That is, `hexdump -C file` shows "00 00 00...") Files that are 4093 bytes or larger seem to be unaffected.
>
> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys for that matter) seems to work without any problems.
>
> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's only 5.10.0 that does... though it might also be that the client is 32-bit, but the same client did work previously when the server was running older kernels, so I still suspect 6.5.x on the server being the problem.
>
> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
> The server is using Btrfs but I've tested with tmpfs as well.

I'm guessing proto=tcp as well (as opposed to proto=rdma).
Does the problem go away with vers=4.1 ?

Can you capture network traffic during the failure? Use sec=krb5i so
we can see the RPC payloads. On the client:

# tcpdump -iany -s0 -w/tmp/sniffer.pcap

--
Chuck Lever


2023-09-24 17:15:27

by Mantas Mikulėnas

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On 2023-09-24 17:44, Chuck Lever III wrote:
>
>
>> On Sep 24, 2023, at 10:32 AM, Mantas Mikulėnas <[email protected]> wrote:
>>
>> On 2023-09-24 16:28, Chuck Lever III wrote:
>>>> On Sep 24, 2023, at 9:07 AM, Mantas Mikulėnas <[email protected]> wrote:
>>>>
>>>> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running Arch Linux x86_64).
>>>>
>>>> Now, when I'm accessing the server over NFSv4.2 from a client that's running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i or sec=krb5p, trying to read a file that's <= 4092 bytes in size will return all-zero data. (That is, `hexdump -C file` shows "00 00 00...") Files that are 4093 bytes or larger seem to be unaffected.
>>>>
>>>> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys for that matter) seems to work without any problems.
>>>>
>>>> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's only 5.10.0 that does... though it might also be that the client is 32-bit, but the same client did work previously when the server was running older kernels, so I still suspect 6.5.x on the server being the problem.
>>>>
>>>> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
>>>> The server is using Btrfs but I've tested with tmpfs as well.
>>> I'm guessing proto=tcp as well (as opposed to proto=rdma).
>>
>> Yes, it's TCP.
>>
>> (I do have RDMA set up between two of the 6.5.x server systems, but in this case all the clients I've tested were TCP-only, and the home server that I originally noticed the problem with doesn't have RDMA at all.)
>>
>>> Does the problem go away with vers=4.1 ?
>>
>> No, it doesn't (neither with 4.0).
>>
>>> Can you capture network traffic during the failure? Use sec=krb5i so
>>> we can see the RPC payloads. On the client:
>>> # tcpdump -iany -s0 -w/tmp/sniffer.pcap
>>
>> Attached. (The script I've been using for testing mounts with -o sec=krb5i, cats three files, then unmounts.)<nfs_krb5i.pcap>
>
> I see three NFS READs in the capture.
>
> The first READ payload is all zeroes. The second payload contains
> "Hello World (4093 bytes)" repeatedly, and the third contains
> "Hello World (4096 bytes)" repeatedly.

Right, whereas on the server, the first file is filled with "Hello World
(4092 bytes)" as I originally tried to narrow down the issue.

Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems
with the same server, and with seemingly the same mount options.

Thanks for looking into it!


Attachments:
nfs_krb5i_working_6.4client.pcap (46.39 kB)

2023-09-24 18:25:05

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
>
> On 2023-09-24 17:44, Chuck Lever III wrote:
>>> On Sep 24, 2023, at 10:32 AM, Mantas Mikulėnas <[email protected]> wrote:
>>>
>>> On 2023-09-24 16:28, Chuck Lever III wrote:
>>>>> On Sep 24, 2023, at 9:07 AM, Mantas Mikulėnas <[email protected]> wrote:
>>>>>
>>>>> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running Arch Linux x86_64).
>>>>>
>>>>> Now, when I'm accessing the server over NFSv4.2 from a client that's running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i or sec=krb5p, trying to read a file that's <= 4092 bytes in size will return all-zero data. (That is, `hexdump -C file` shows "00 00 00...") Files that are 4093 bytes or larger seem to be unaffected.
>>>>>
>>>>> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys for that matter) seems to work without any problems.
>>>>>
>>>>> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's only 5.10.0 that does... though it might also be that the client is 32-bit, but the same client did work previously when the server was running older kernels, so I still suspect 6.5.x on the server being the problem.
>>>>>
>>>>> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
>>>>> The server is using Btrfs but I've tested with tmpfs as well.
>>>> I'm guessing proto=tcp as well (as opposed to proto=rdma).
>>>
>>> Yes, it's TCP.
>>>
>>> (I do have RDMA set up between two of the 6.5.x server systems, but in this case all the clients I've tested were TCP-only, and the home server that I originally noticed the problem with doesn't have RDMA at all.)
>>>
>>>> Does the problem go away with vers=4.1 ?
>>>
>>> No, it doesn't (neither with 4.0).
>>>
>>>> Can you capture network traffic during the failure? Use sec=krb5i so
>>>> we can see the RPC payloads. On the client:
>>>> # tcpdump -iany -s0 -w/tmp/sniffer.pcap
>>>
>>> Attached. (The script I've been using for testing mounts with -o sec=krb5i, cats three files, then unmounts.)<nfs_krb5i.pcap>
>> I see three NFS READs in the capture.
>> The first READ payload is all zeroes. The second payload contains
>> "Hello World (4093 bytes)" repeatedly, and the third contains
>> "Hello World (4096 bytes)" repeatedly.
>
> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
>
> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
>
> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>

I found /a/ problem with the nfsd-fixes branch and krb5i, but
maybe not /your/ problem, and it's with a recent client. Scrounging
a v5.10-vintage client is a little more work, we'll see if that's
needed for confirming an eventual fix.


--
Chuck Lever


2023-09-24 20:56:44

by Mantas Mikulėnas

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On 2023-09-24 16:28, Chuck Lever III wrote:
>
>
>> On Sep 24, 2023, at 9:07 AM, Mantas Mikulėnas <[email protected]> wrote:
>>
>> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running Arch Linux x86_64).
>>
>> Now, when I'm accessing the server over NFSv4.2 from a client that's running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i or sec=krb5p, trying to read a file that's <= 4092 bytes in size will return all-zero data. (That is, `hexdump -C file` shows "00 00 00...") Files that are 4093 bytes or larger seem to be unaffected.
>>
>> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys for that matter) seems to work without any problems.
>>
>> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's only 5.10.0 that does... though it might also be that the client is 32-bit, but the same client did work previously when the server was running older kernels, so I still suspect 6.5.x on the server being the problem.
>>
>> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
>> The server is using Btrfs but I've tested with tmpfs as well.
>
> I'm guessing proto=tcp as well (as opposed to proto=rdma).

Yes, it's TCP.

(I do have RDMA set up between two of the 6.5.x server systems, but in
this case all the clients I've tested were TCP-only, and the home server
that I originally noticed the problem with doesn't have RDMA at all.)

> Does the problem go away with vers=4.1 ?

No, it doesn't (neither with 4.0).

>
> Can you capture network traffic during the failure? Use sec=krb5i so
> we can see the RPC payloads. On the client:
>
> # tcpdump -iany -s0 -w/tmp/sniffer.pcap

Attached. (The script I've been using for testing mounts with -o
sec=krb5i, cats three files, then unmounts.)


Attachments:
nfs_krb5i.pcap (68.67 kB)

2023-09-25 19:23:26

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server


> On Sep 24, 2023, at 2:24 PM, Chuck Lever III <[email protected]> wrote:
>
>> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
>>
>> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
>>
>> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
>>
>> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>
>
> I found /a/ problem with the nfsd-fixes branch and krb5i, but
> maybe not /your/ problem, and it's with a recent client. Scrounging
> a v5.10-vintage client is a little more work, we'll see if that's
> needed for confirming an eventual fix.

The issue I reproduced appears to be unrelated.

I'm wondering if I can get you to bisect the server kernel using
your v5.10 client to test? good = v6.4, bad = v6.5 should do it.


--
Chuck Lever


2023-09-26 07:02:59

by Mantas Mikulėnas

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On 25/09/2023 22.22, Chuck Lever III wrote:
>> On Sep 24, 2023, at 2:24 PM, Chuck Lever III <[email protected]> wrote:
>>
>>> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
>>>
>>> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
>>>
>>> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
>>>
>>> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>
>> I found /a/ problem with the nfsd-fixes branch and krb5i, but
>> maybe not /your/ problem, and it's with a recent client. Scrounging
>> a v5.10-vintage client is a little more work, we'll see if that's
>> needed for confirming an eventual fix.
> The issue I reproduced appears to be unrelated.
>
> I'm wondering if I can get you to bisect the server kernel using
> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.

Yeah, I will try to bisect but it'll probably take a day or two.

2023-09-26 14:41:07

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 26, 2023, at 12:41 AM, Mantas Mikulėnas <[email protected]> wrote:
>
> On 25/09/2023 22.22, Chuck Lever III wrote:
>>> On Sep 24, 2023, at 2:24 PM, Chuck Lever III <[email protected]> wrote:
>>>
>>>> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
>>>>
>>>> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
>>>>
>>>> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
>>>>
>>>> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>
>>> I found /a/ problem with the nfsd-fixes branch and krb5i, but
>>> maybe not /your/ problem, and it's with a recent client. Scrounging
>>> a v5.10-vintage client is a little more work, we'll see if that's
>>> needed for confirming an eventual fix.
>> The issue I reproduced appears to be unrelated.
>>
>> I'm wondering if I can get you to bisect the server kernel using
>> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
>
> Yeah, I will try to bisect but it'll probably take a day or two.

That's great, thank you!

I'm looking into setting up a virtual guest with v5.10 just in case.
Turns out v5.10 does not build on Fedora latest.


--
Chuck Lever


2023-09-27 00:20:08

by Olga Kornievskaia

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On Tue, Sep 26, 2023 at 10:08 AM Chuck Lever III <[email protected]> wrote:
>
>
>
> > On Sep 26, 2023, at 12:41 AM, Mantas Mikulėnas <[email protected]> wrote:
> >
> > On 25/09/2023 22.22, Chuck Lever III wrote:
> >>> On Sep 24, 2023, at 2:24 PM, Chuck Lever III <[email protected]> wrote:
> >>>
> >>>> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
> >>>>
> >>>> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
> >>>>
> >>>> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
> >>>>
> >>>> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>
> >>> I found /a/ problem with the nfsd-fixes branch and krb5i, but
> >>> maybe not /your/ problem, and it's with a recent client. Scrounging
> >>> a v5.10-vintage client is a little more work, we'll see if that's
> >>> needed for confirming an eventual fix.
> >> The issue I reproduced appears to be unrelated.
> >>
> >> I'm wondering if I can get you to bisect the server kernel using
> >> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
> >
> > Yeah, I will try to bisect but it'll probably take a day or two.
>
> That's great, thank you!
>
> I'm looking into setting up a virtual guest with v5.10 just in case.
> Turns out v5.10 does not build on Fedora latest.
>

I can reproduce this with upstream client do dd if=/mnt/4092byteslen
of=/dev/null bs=4092 count=1 iflag=direct

> --
> Chuck Lever
>
>

2023-09-27 09:15:40

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 24, 2023, at 10:32 AM, Mantas Mikulėnas <[email protected]> wrote:
>
> On 2023-09-24 16:28, Chuck Lever III wrote:
>>> On Sep 24, 2023, at 9:07 AM, Mantas Mikulėnas <[email protected]> wrote:
>>>
>>> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running Arch Linux x86_64).
>>>
>>> Now, when I'm accessing the server over NFSv4.2 from a client that's running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i or sec=krb5p, trying to read a file that's <= 4092 bytes in size will return all-zero data. (That is, `hexdump -C file` shows "00 00 00...") Files that are 4093 bytes or larger seem to be unaffected.
>>>
>>> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys for that matter) seems to work without any problems.
>>>
>>> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's only 5.10.0 that does... though it might also be that the client is 32-bit, but the same client did work previously when the server was running older kernels, so I still suspect 6.5.x on the server being the problem.
>>>
>>> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
>>> The server is using Btrfs but I've tested with tmpfs as well.
>> I'm guessing proto=tcp as well (as opposed to proto=rdma).
>
> Yes, it's TCP.
>
> (I do have RDMA set up between two of the 6.5.x server systems, but in this case all the clients I've tested were TCP-only, and the home server that I originally noticed the problem with doesn't have RDMA at all.)
>
>> Does the problem go away with vers=4.1 ?
>
> No, it doesn't (neither with 4.0).
>
>> Can you capture network traffic during the failure? Use sec=krb5i so
>> we can see the RPC payloads. On the client:
>> # tcpdump -iany -s0 -w/tmp/sniffer.pcap
>
> Attached. (The script I've been using for testing mounts with -o sec=krb5i, cats three files, then unmounts.)<nfs_krb5i.pcap>

I see three NFS READs in the capture.

The first READ payload is all zeroes. The second payload contains
"Hello World (4093 bytes)" repeatedly, and the third contains
"Hello World (4096 bytes)" repeatedly.

Let me see if I can reproduce this in my lab.

--
Chuck Lever


Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

[CCing the regression list, as it should be in the loop for regressions:
https://docs.kernel.org/admin-guide/reporting-regressions.html]

[TLDR: I'm adding this report to the list of tracked Linux kernel
regressions; the text you find below is based on a few templates
paragraphs you might have encountered already in similar form.
See link in footer if these mails annoy you.]

On 24.09.23 15:07, Mantas Mikulėnas wrote:
> I've recently upgraded my home NFS server from 6.4.12 to 6.5.4 (running
> Arch Linux x86_64).
>
> Now, when I'm accessing the server over NFSv4.2 from a client that's
> running 5.10.0 (32-bit x86, Debian 11), if the mount is using sec=krb5i
> or sec=krb5p, trying to read a file that's <= 4092 bytes in size will
> return all-zero data. (That is, `hexdump -C file` shows "00 00 00...")
> Files that are 4093 bytes or larger seem to be unaffected.
>
> Only sec=krb5i/krb5p are affected by this – plain sec=krb5 (or sec=sys
> for that matter) seems to work without any problems.
>
> Newer clients (like 6.1.x or 6.4.x) don't seem to have any issues, it's
> only 5.10.0 that does... though it might also be that the client is
> 32-bit, but the same client did work previously when the server was
> running older kernels, so I still suspect 6.5.x on the server being the
> problem.
>
> Upgrading to 6.6.0-rc2 on the server hasn't changed anything.
> The server is using Btrfs but I've tested with tmpfs as well.

Thanks for the report. To be sure the issue doesn't fall through the
cracks unnoticed, I'm adding it to regzbot, the Linux kernel regression
tracking bot:

#regzbot ^introduced 703d7521555504b3a316b105b4806d641
#regzbot title nfs: Data corruption with 5.10.x client -> 6.5.x server
#regzbot ignore-activity

This isn't a regression? This issue or a fix for it are already
discussed somewhere else? It was fixed already? You want to clarify when
the regression started to happen? Or point out I got the title or
something else totally wrong? Then just reply and tell me -- ideally
while also telling regzbot about it, as explained by the page listed in
the footer of this mail.

Developers: When fixing the issue, remember to add 'Link:' tags pointing
to the report (the parent of this mail). See page linked in footer for
details.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
That page also explains what to do if mails like this annoy you.

2023-09-27 21:00:42

by Mantas Mikulėnas

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On 26/09/2023 16.57, Chuck Lever III wrote:
> I'm wondering if I can get you to bisect the server kernel using
> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
>> Yeah, I will try to bisect but it'll probably take a day or two.

I'm *nearly* done with bisect (most of the builds with distro config
took over an hour on this aging Xeon), and I'm currently in the middle of:

518f375 [refs/bisect/bad] nfsd: don't provide pre/post-op attrs if
fh_getattr fails
df56b38 NFSD: Remove nfsd_readv()
703d752 [HEAD] NFSD: Hoist rq_vec preparation into nfsd_read() [step two]
507df40 NFSD: Hoist rq_vec preparation into nfsd_read()
ed4a567 [refs/bisect/good-XX] NFSD: Update rq_next_page between COMPOUND
operations


2023-09-28 06:27:15

by Mantas Mikulėnas

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server

On 27/09/2023 11.45, Mantas Mikulėnas wrote:
> On 26/09/2023 16.57, Chuck Lever III wrote:
>> I'm wondering if I can get you to bisect the server kernel using
>> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
>>> Yeah, I will try to bisect but it'll probably take a day or two.
>
> I'm *nearly* done with bisect (most of the builds with distro config
> took over an hour on this aging Xeon), and I'm currently in the middle
> of:

Now it's done with:

703d7521555504b3a316b105b4806d641b7ebc76 is the first bad commit
commit 703d7521555504b3a316b105b4806d641b7ebc76
Author: Chuck Lever <[email protected]>
Date:   Thu May 18 13:46:03 2023 -0400

    NFSD: Hoist rq_vec preparation into nfsd_read() [step two]

2023-09-28 12:07:51

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 27, 2023, at 5:41 AM, Mantas Mikulėnas <[email protected]> wrote:
>
> On 27/09/2023 11.45, Mantas Mikulėnas wrote:
>> On 26/09/2023 16.57, Chuck Lever III wrote:
>>> I'm wondering if I can get you to bisect the server kernel using
>>> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
>>>> Yeah, I will try to bisect but it'll probably take a day or two.
>>
>> I'm *nearly* done with bisect (most of the builds with distro config took over an hour on this aging Xeon), and I'm currently in the middle of:
>
> Now it's done with:
>
> 703d7521555504b3a316b105b4806d641b7ebc76 is the first bad commit
> commit 703d7521555504b3a316b105b4806d641b7ebc76
> Author: Chuck Lever <[email protected]>
> Date: Thu May 18 13:46:03 2023 -0400
>
> NFSD: Hoist rq_vec preparation into nfsd_read() [step two]

That's even a plausible bisect result!

The difference between the v5.10 client capture and the v6.4
capture you sent us is that the v5.10 client asks for only
4092 bytes in its NFS READ request. The v6.4 client asks for
4096, so the server bug is avoided.

Olga's reproducer tickles the bug by using O_DIRECT to force
the client to request exactly 4092 bytes.

Let me take a closer look at this.


--
Chuck Lever


2023-09-28 18:46:36

by Chuck Lever

[permalink] [raw]
Subject: Re: Data corruption with 5.10.x client -> 6.5.x server



> On Sep 26, 2023, at 5:52 PM, Olga Kornievskaia <[email protected]> wrote:
>
> On Tue, Sep 26, 2023 at 10:08 AM Chuck Lever III <[email protected]> wrote:
>>
>>
>>
>>> On Sep 26, 2023, at 12:41 AM, Mantas Mikulėnas <[email protected]> wrote:
>>>
>>> On 25/09/2023 22.22, Chuck Lever III wrote:
>>>>> On Sep 24, 2023, at 2:24 PM, Chuck Lever III <[email protected]> wrote:
>>>>>
>>>>>> On Sep 24, 2023, at 12:51 PM, Mantas Mikulėnas <[email protected]> wrote:
>>>>>>
>>>>>> Right, whereas on the server, the first file is filled with "Hello World (4092 bytes)" as I originally tried to narrow down the issue.
>>>>>>
>>>>>> Meanwhile, 6.4.x (Arch) clients don't seem to be having any problems with the same server, and with seemingly the same mount options.
>>>>>>
>>>>>> Thanks for looking into it!<nfs_krb5i_working_6.4client.pcap>
>>>>> I found /a/ problem with the nfsd-fixes branch and krb5i, but
>>>>> maybe not /your/ problem, and it's with a recent client. Scrounging
>>>>> a v5.10-vintage client is a little more work, we'll see if that's
>>>>> needed for confirming an eventual fix.
>>>> The issue I reproduced appears to be unrelated.
>>>>
>>>> I'm wondering if I can get you to bisect the server kernel using
>>>> your v5.10 client to test? good = v6.4, bad = v6.5 should do it.
>>>
>>> Yeah, I will try to bisect but it'll probably take a day or two.
>>
>> That's great, thank you!
>>
>> I'm looking into setting up a virtual guest with v5.10 just in case.
>> Turns out v5.10 does not build on Fedora latest.
>
> I can reproduce this with upstream client do dd if=/mnt/4092byteslen
> of=/dev/null bs=4092 count=1 iflag=direct

Hrm.

[cel@morisot cthon04]$ nfsstat -m
/mnt/bazille from bazille.1015granger.net:/export/btrfs
Flags: rw,relatime,vers=4.1,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=krb5i,clientaddr=192.168.1.67,local_lock=none,addr=192.168.1.56

[cel@morisot cthon04]$ ls -l /mnt/bazille/4092byteslen
-rw-r--r-- 1 cel users 4092 Sep 26 18:52 /mnt/bazille/4092byteslen
[cel@morisot cthon04]$ dd if=/mnt/bazille/4092byteslen of=/dev/null bs=4092 count=1 iflag=direct
1+0 records in
1+0 records out
4092 bytes (4.1 kB, 4.0 KiB) copied, 0.00059679 s, 6.9 MB/s
[cel@morisot cthon04]$

--
Chuck Lever