by Olga Kornievskaia

[permalink] [raw]

Subject: Re: problem on nfsd doing RDMA write

On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever <[email protected]> wrote:
>
>
>> On Jan 31, 2018, at 2:51 PM, Chuck Lever <[email protected]> wrote:
>>
>>
>>
>>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia <[email protected]> wrote:
>>>
>>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever <[email protected]> wrote:
>>>>
>>>>
>>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>>
>>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever <[email protected]> wrote:
>>>>>>
>>>>>>
>>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>>>>
>>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever <[email protected]> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia <[email protected]> wrote:
>>>>>>>>>
>>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia <[email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia <[email protected]> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>>>>>>
>>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To
>>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero
>>>>>>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to
>>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation
>>>>>>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The problem happens for the values where XDR round up ends up rounding
>>>>>>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to
>>>>>>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>>>>>>
>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>>>>>>
>>>>>>>>>>>>> len = XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>>>>>> if (len >= avail) {
>>>>>>>>>>>>> - int pages;
>>>>>>>>>>>>> + int pages = 0;
>>>>>>>>>>>>>
>>>>>>>>>>>>> len -= avail;
>>>>>>>>>>>>>
>>>>>>>>>>>>> - pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>> + if (write->wr_buflen >= PAGE_SIZE)
>>>>>>>>>>>>> + pages = len >> PAGE_SHIFT;
>>>>>>>>>>>>> argp->pagelist += pages;
>>>>>>>>>>>>> argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>>>>>> len -= pages * PAGE_SIZE;
>>>>>>>>>>>>>
>>>>>>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads
>>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes
>>>>>>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading
>>>>>>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>>>>>>
>>>>>>>>>>>> Would this also be a problem near any page boundary, say, a
>>>>>>>>>>>> write length of 8191 bytes?
>>>>>>>>>>>>
>>>>>>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>>>>>>
>>>>>>>>>>>> - pages = len >> PAGE_SHIFT;
>>>>>>>>>>>> + pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>>>>
>>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way.
>>>>>>>>>>>
>>>>>>>>>>>> And be sure to test with TCP as well.
>>>>>>>>>>>
>>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why.
>>>>>>>>>>
>>>>>>>>>> OK.
>>>>>>>>>>
>>>>>>>>>> Remember that a Read chunk's length does not have to be
>>>>>>>>>> rounded up. Maybe the transport needs to round up the
>>>>>>>>>> length of the unmarshaled data content on behalf of the
>>>>>>>>>> NFSv4 write decoder.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>> The problem of simply taking write->wr_buflen was that len before that
>>>>>>>>> could have been adjusted by avail value in then non-RDAM mounts.
>>>>>>>>>
>>>>>>>>> Again, I'm not sure if this is the right fix. But this one works for
>>>>>>>>> both non-RDMA and RDMA mounts.
>>>>>>>>>
>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>> index 2c61c6b..3178997 100644
>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct
>>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>>>>>>
>>>>>>>>> len -= avail;
>>>>>>>>>
>>>>>>>>> - pages = len >> PAGE_SHIFT;
>>>>>>>>> + if (!avail)
>>>>>>>>> + pages = write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>> + else
>>>>>>>>> + pages = len >> PAGE_SHIFT;
>>>>>>>>> argp->pagelist += pages;
>>>>>>>>> argp->pagelen -= pages * PAGE_SIZE;
>>>>>>>>> len -= pages * PAGE_SIZE;
>>>>>>>>
>>>>>>>> This code has been around since 2012. Has it always been
>>>>>>>> broken for RDMA? I suspect the root problem is in the
>>>>>>>> transport, not here in the decoder. Can you bisect to
>>>>>>>> find out where the problem started to occur?
>>>>>>>
>>>>>>> To see if I could go back to 2012, I first want to see if I could go
>>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm
>>>>>>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm
>>>>>>> trying to verify if my CX-5 card would even work with mlx4 driver
>>>>>>> which I'm having doubts it would.
>>>>>>
>>>>>> The CX-5 works only with mlx5.
>>>>>>
>>>>>>
>>>>>>> These old kernels fail to boot in
>>>>>>> XFS saying that it's a wrong version. Some googling makes me think
>>>>>>> that since my XFS partition was created with a newer kernel it's not
>>>>>>> compatible with an older kernel...
>>>>>>
>>>>>> That's correct, the XFS filesystem has new features like
>>>>>> metadata checksum that don't work on older kernels.
>>>>>>
>>>>>> To start with, you could try some very late 3.x or early
>>>>>> 4.x kernels. My intuition is that you will find an svcrdma
>>>>>> change that is more recent than 2012 that is causing the
>>>>>> underlying issue.
>>>>>
>>>>> It works in 4.8 and does not work in 4.9. It's this commit that breaks it:
>>>>>
>>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee
>>>>>
>>>>> svcrdma: Server-side support for rpcrdma_connect_private
>>>>>
>>>>> Prepare to receive an RDMA-CM private message when handling a new
>>>>> connection attempt, and send a similar message as part of connection
>>>>> acceptance.
>>>>>
>>>>> Both sides can communicate their various implementation limits.
>>>>> Implementations that don't support this sideband protocol ignore it.
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> Reviewed-by: Sagi Grimberg <[email protected]>
>>>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>>>>
>>>>> So far I don't understand how can causes problems...
>>>>
>>>> More likely it's:
>>>>
>>>>
>>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564
>>>> Author: Chuck Lever <[email protected]>
>>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400
>>>> Commit: J. Bruce Fields <[email protected]>
>>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>>>
>>>> nfsd: Incoming xdr_bufs may have content in tail buffer
>>>>
>>>>
>>>>
>>>> This is a 3 op compound. The third op is GETATTR.
>>>> The reply shows that the WRITE op succeeds, which
>>>> is why no error is reported to the application.
>>>> The GETATTR fails with OP_ILLEGAL, and the client
>>>> I guess is trained to ignore that kind of failure.
>>>>
>>>> After nfsd4_decode_write pushes into the "next" page
>>>> in the page list, the next time READ_BUF is called
>>>> (to handle the next op in the compound), it is
>>>> supposed to see that the XDR page list is now
>>>> exhausted, and switch to the XDR tail buf.
>>>>
>>>> TCP doesn't use the tail buf; the GETATTR in that
>>>> case is at the end of the page list, following the
>>>> WRITE payload. That would explain why this issue
>>>> cannot be reproduced there.
>>>>
>>>> RDMA does use the tail buf in this case. For some
>>>> reason the logic added in eae03e2ac8 is not working,
>>>> and the server is looking at uninitialized memory
>>>> to find that third op.
>>>>
>>>> I run cthon04 all the time, so I wonder why I never
>>>> hit this case.
>>>
>>> While this commit seems like a better explanation, I can 100% hit the
>>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9
>>> branch and rollback back to the cc9d83408b) . What matters is the
>>> assignment of the "private_data" in the conn_param. If I set it to
>>> NULL and 0 for the length. The problem goes away.
>>
>> There's no obvious connection between the NFSv4 WRITE
>> misbehavior and CM private data being present.
>
> You observed that this logic in nfsd4_decode_write is
> making argp->pagelen go negative:
>
> 1304 pages = len >> PAGE_SHIFT;
> 1305 argp->pagelist += pages;
> 1306 argp->pagelen -= pages * PAGE_SIZE;
> 1307 len -= pages * PAGE_SIZE;
>
> This is because the client does not insert XDR round-up
> in the Read chunk. Since pagelen is not zero, the code
> in read_buf that is looking for that case (added by
> eae03e2ac80a) fails.
>
> I suspect what's going on here is:
>
> When you disable the private_data logic, the server is
> not "receiving" the CM private data, and thus it assumes
> the client is not Linux. In response, the server does not
> provide CM private data in it's "connection accept"
> response.
>
> The client then assumes the _server_ is not Linux, and
> disables an optimization that removes the XDR round-up
> from Read chunks.
>
> So the client adds that round-up padding to the end of
> the Read chunks, and that prevents the bug. pagelen
> never goes below zero.
>
> Basically nfsd4_decode_write() assumes that the XDR
> round-up for the WRITE payload is always going to be
> present. The round-up is optional for Read chunks,
> according to RFC 8166.
>
> 193bcb7b37 adds that round-up in the tail iovec. It looks
> like that won't be adequate in the case where the payload
> approaches a page boundary.
>
> Does this seem plausible?

Yes your explanation regarding the presence of CM private data that
toggles what client sends is exactly what I'm seeing.

I don't understand the last paragraph. But it sounds like server needs
to check that if client is linux then it needs to do a round up (and
also actually add padding which would have been done on the client)
because of the nfsd4_decode_write()'s assumptions.

>
>
>>> Are you hitting the issue now?
>>
>> I can reproduce it 100% of the time using your dd example
>> from a Linux NFS client.
>>
>> I've also done a 8009 byte write on TCP, which forces the
>> same page boundary edge condition in nfsd4_decode_write,
>> and the failure does not occur.
>>
>>
>>> Are you 100% certain that you haven't
>>> hit it before. Do you always inspect the network traces? Without
>>> inspecting the network trace you wouldn't have found it. cthon never
>>> fails so from the user perspective nothing failed.
>>
>> Fair enough.
>>
>>
>>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see
>>> what different do I see in my testing.
>>
>> My theory is it's that one or this related one:
>>
>> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64
>> Author: Chuck Lever <[email protected]>
>> AuthorDate: Fri Aug 18 11:12:35 2017 -0400
>> Commit: J. Bruce Fields <[email protected]>
>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>
>> svcrdma: Populate tail iovec when receiving
>>
>>
>>>>>>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Chuck Lever
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>>>> the body of a message to [email protected]
>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>> --
>>>>>>>> Chuck Lever
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>> --
>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>>>> the body of a message to [email protected]
>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>> --
>>>>>> Chuck Lever
>>>>>>
>>>>>>
>>>>>>
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>>>>> the body of a message to [email protected]
>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>> --
>>>> Chuck Lever
>>
>> --
>> Chuck Lever
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>

2018-01-31 21:52:41

by Chuck Lever III

[permalink] [raw]

Subject: Re: problem on nfsd doing RDMA write

> On Jan 31, 2018, at 4:42 PM, Olga Kornievskaia <[email protected]> wrote:
>=20
> On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever <[email protected]> =
wrote:
>>=20
>>=20
>>> On Jan 31, 2018, at 2:51 PM, Chuck Lever <[email protected]> =
wrote:
>>>=20
>>>=20
>>>=20
>>>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia <[email protected]> =
wrote:
>>>>=20
>>>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever =
<[email protected]> wrote:
>>>>>=20
>>>>>=20
>>>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia <[email protected]> =
wrote:
>>>>>>=20
>>>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever =
<[email protected]> wrote:
>>>>>>>=20
>>>>>>>=20
>>>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia =
<[email protected]> wrote:
>>>>>>>>=20
>>>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever =
<[email protected]> wrote:
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia =
<[email protected]> wrote:
>>>>>>>>>>=20
>>>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever =
<[email protected]> wrote:
>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia =
<[email protected]> wrote:
>>>>>>>>>>>>=20
>>>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever =
<[email protected]> wrote:
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia =
<[email protected]> wrote:
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> Hi Bruce/Chuck,
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size =
4093,4094,4095. To
>>>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd =
if=3D/dev/zero
>>>>>>>>>>>>>> of=3D/mnt/testfile bs=3D4093 count=3D1". What happens is =
that nfsd fails to
>>>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails =
the operation
>>>>>>>>>>>>>> with ERR_OP_ILLEGAL.
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> The problem happens for the values where XDR round up =
ends up rounding
>>>>>>>>>>>>>> up to the page size. I don't know if my fix is the =
appropriate way to
>>>>>>>>>>>>>> fix this but with it I don't get the error:
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> index 2c61c6b..a8489c3 100644
>>>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 =
nfsd4_decode_opaque(struct nfsd4_compounda
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> len =3D XDR_QUADLEN(write->wr_buflen) << 2;
>>>>>>>>>>>>>> if (len >=3D avail) {
>>>>>>>>>>>>>> - int pages;
>>>>>>>>>>>>>> + int pages =3D 0;
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> len -=3D avail;
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> - pages =3D len >> PAGE_SHIFT;
>>>>>>>>>>>>>> + if (write->wr_buflen >=3D PAGE_SIZE)
>>>>>>>>>>>>>> + pages =3D len >> PAGE_SHIFT;
>>>>>>>>>>>>>> argp->pagelist +=3D pages;
>>>>>>>>>>>>>> argp->pagelen -=3D pages * PAGE_SIZE;
>>>>>>>>>>>>>> len -=3D pages * PAGE_SIZE;
>>>>>>>>>>>>>>=20
>>>>>>>>>>>>>> So the problem is the using "len" instead of =
"write->wr_buflen" leads
>>>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to =
4096, it makes
>>>>>>>>>>>>>> pages=3D1 and the argp->pagelen ends up being a negative =
value leading
>>>>>>>>>>>>>> to problems of parsing the GETATTR.
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> Would this also be a problem near any page boundary, say, =
a
>>>>>>>>>>>>> write length of 8191 bytes?
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> Instead of using the rounded up "len", why not try this:
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> - pages =3D len >> PAGE_SHIFT;
>>>>>>>>>>>>> + pages =3D write->wr_buflen >> PAGE_SHIFT;
>>>>>>>>>>>>=20
>>>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails =
the same way.
>>>>>>>>>>>>=20
>>>>>>>>>>>>> And be sure to test with TCP as well.
>>>>>>>>>>>>=20
>>>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out =
why.
>>>>>>>>>>>=20
>>>>>>>>>>> OK.
>>>>>>>>>>>=20
>>>>>>>>>>> Remember that a Read chunk's length does not have to be
>>>>>>>>>>> rounded up. Maybe the transport needs to round up the
>>>>>>>>>>> length of the unmarshaled data content on behalf of the
>>>>>>>>>>> NFSv4 write decoder.
>>>>>>>>>>>=20
>>>>>>>>>>=20
>>>>>>>>>> The problem of simply taking write->wr_buflen was that len =
before that
>>>>>>>>>> could have been adjusted by avail value in then non-RDAM =
mounts.
>>>>>>>>>>=20
>>>>>>>>>> Again, I'm not sure if this is the right fix. But this one =
works for
>>>>>>>>>> both non-RDMA and RDMA mounts.
>>>>>>>>>>=20
>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c
>>>>>>>>>> index 2c61c6b..3178997 100644
>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c
>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c
>>>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 =
nfsd4_decode_opaque(struct
>>>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne
>>>>>>>>>>=20
>>>>>>>>>> len -=3D avail;
>>>>>>>>>>=20
>>>>>>>>>> - pages =3D len >> PAGE_SHIFT;
>>>>>>>>>> + if (!avail)
>>>>>>>>>> + pages =3D write->wr_buflen >> =
PAGE_SHIFT;
>>>>>>>>>> + else
>>>>>>>>>> + pages =3D len >> PAGE_SHIFT;
>>>>>>>>>> argp->pagelist +=3D pages;
>>>>>>>>>> argp->pagelen -=3D pages * PAGE_SIZE;
>>>>>>>>>> len -=3D pages * PAGE_SIZE;
>>>>>>>>>=20
>>>>>>>>> This code has been around since 2012. Has it always been
>>>>>>>>> broken for RDMA? I suspect the root problem is in the
>>>>>>>>> transport, not here in the decoder. Can you bisect to
>>>>>>>>> find out where the problem started to occur?
>>>>>>>>=20
>>>>>>>> To see if I could go back to 2012, I first want to see if I =
could go
>>>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But =
I'm
>>>>>>>> having issues trying to boot into a kernel based on 3.10 or =
3.9. I'm
>>>>>>>> trying to verify if my CX-5 card would even work with mlx4 =
driver
>>>>>>>> which I'm having doubts it would.
>>>>>>>=20
>>>>>>> The CX-5 works only with mlx5.
>>>>>>>=20
>>>>>>>=20
>>>>>>>> These old kernels fail to boot in
>>>>>>>> XFS saying that it's a wrong version. Some googling makes me =
think
>>>>>>>> that since my XFS partition was created with a newer kernel =
it's not
>>>>>>>> compatible with an older kernel...
>>>>>>>=20
>>>>>>> That's correct, the XFS filesystem has new features like
>>>>>>> metadata checksum that don't work on older kernels.
>>>>>>>=20
>>>>>>> To start with, you could try some very late 3.x or early
>>>>>>> 4.x kernels. My intuition is that you will find an svcrdma
>>>>>>> change that is more recent than 2012 that is causing the
>>>>>>> underlying issue.
>>>>>>=20
>>>>>> It works in 4.8 and does not work in 4.9. It's this commit that =
breaks it:
>>>>>>=20
>>>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee
>>>>>>=20
>>>>>> svcrdma: Server-side support for rpcrdma_connect_private
>>>>>>=20
>>>>>> Prepare to receive an RDMA-CM private message when handling a new
>>>>>> connection attempt, and send a similar message as part of =
connection
>>>>>> acceptance.
>>>>>>=20
>>>>>> Both sides can communicate their various implementation limits.
>>>>>> Implementations that don't support this sideband protocol ignore =
it.
>>>>>>=20
>>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>>> Reviewed-by: Sagi Grimberg <[email protected]>
>>>>>> Signed-off-by: J. Bruce Fields <[email protected]>
>>>>>>=20
>>>>>> So far I don't understand how can causes problems...
>>>>>=20
>>>>> More likely it's:
>>>>>=20
>>>>>=20
>>>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564
>>>>> Author: Chuck Lever <[email protected]>
>>>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400
>>>>> Commit: J. Bruce Fields <[email protected]>
>>>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>>>>=20
>>>>> nfsd: Incoming xdr_bufs may have content in tail buffer
>>>>>=20
>>>>>=20
>>>>>=20
>>>>> This is a 3 op compound. The third op is GETATTR.
>>>>> The reply shows that the WRITE op succeeds, which
>>>>> is why no error is reported to the application.
>>>>> The GETATTR fails with OP_ILLEGAL, and the client
>>>>> I guess is trained to ignore that kind of failure.
>>>>>=20
>>>>> After nfsd4_decode_write pushes into the "next" page
>>>>> in the page list, the next time READ_BUF is called
>>>>> (to handle the next op in the compound), it is
>>>>> supposed to see that the XDR page list is now
>>>>> exhausted, and switch to the XDR tail buf.
>>>>>=20
>>>>> TCP doesn't use the tail buf; the GETATTR in that
>>>>> case is at the end of the page list, following the
>>>>> WRITE payload. That would explain why this issue
>>>>> cannot be reproduced there.
>>>>>=20
>>>>> RDMA does use the tail buf in this case. For some
>>>>> reason the logic added in eae03e2ac8 is not working,
>>>>> and the server is looking at uninitialized memory
>>>>> to find that third op.
>>>>>=20
>>>>> I run cthon04 all the time, so I wonder why I never
>>>>> hit this case.
>>>>=20
>>>> While this commit seems like a better explanation, I can 100% hit =
the
>>>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9
>>>> branch and rollback back to the cc9d83408b) . What matters is the
>>>> assignment of the "private_data" in the conn_param. If I set it to
>>>> NULL and 0 for the length. The problem goes away.
>>>=20
>>> There's no obvious connection between the NFSv4 WRITE
>>> misbehavior and CM private data being present.
>>=20
>> You observed that this logic in nfsd4_decode_write is
>> making argp->pagelen go negative:
>>=20
>> 1304 pages =3D len >> PAGE_SHIFT;
>> 1305 argp->pagelist +=3D pages;
>> 1306 argp->pagelen -=3D pages * PAGE_SIZE;
>> 1307 len -=3D pages * PAGE_SIZE;
>>=20
>> This is because the client does not insert XDR round-up
>> in the Read chunk. Since pagelen is not zero, the code
>> in read_buf that is looking for that case (added by
>> eae03e2ac80a) fails.
>>=20
>> I suspect what's going on here is:
>>=20
>> When you disable the private_data logic, the server is
>> not "receiving" the CM private data, and thus it assumes
>> the client is not Linux. In response, the server does not
>> provide CM private data in it's "connection accept"
>> response.
>>=20
>> The client then assumes the _server_ is not Linux, and
>> disables an optimization that removes the XDR round-up
>> from Read chunks.
>>=20
>> So the client adds that round-up padding to the end of
>> the Read chunks, and that prevents the bug. pagelen
>> never goes below zero.
>>=20
>> Basically nfsd4_decode_write() assumes that the XDR
>> round-up for the WRITE payload is always going to be
>> present. The round-up is optional for Read chunks,
>> according to RFC 8166.
>>=20
>> 193bcb7b37 adds that round-up in the tail iovec. It looks
>> like that won't be adequate in the case where the payload
>> approaches a page boundary.
>>=20
>> Does this seem plausible?
>=20
> Yes your explanation regarding the presence of CM private data that
> toggles what client sends is exactly what I'm seeing.
>=20
> I don't understand the last paragraph. But it sounds like server needs
> to check that if client is linux then it needs to do a round up (and
> also actually add padding which would have been done on the client)
> because of the nfsd4_decode_write()'s assumptions.

The server-side transport can _always_ round-up the length
of a Read chunk. If the client included the round-up padding,
the length won't change. If the client didn't include padding,
then the length will be bumped up properly. The XDR decoders
should then be happy either way. I hope.

What I'm trying now is having svc_rdma_build_normal_read_chunk
round-up rq_arg->pagelen instead of rounding up the length of
the XDR tail iovec based on the length of the Read chunk.
Seems obvious, and I'm wondering why I didn't do it this way
in the first place. The comment there suggests the NFSv3 decoder
might be finicky about the XDR lengths.

>>>> Are you hitting the issue now?
>>>=20
>>> I can reproduce it 100% of the time using your dd example
>>> from a Linux NFS client.
>>>=20
>>> I've also done a 8009 byte write on TCP, which forces the
>>> same page boundary edge condition in nfsd4_decode_write,
>>> and the failure does not occur.
>>>=20
>>>=20
>>>> Are you 100% certain that you haven't
>>>> hit it before. Do you always inspect the network traces? Without
>>>> inspecting the network trace you wouldn't have found it. cthon =
never
>>>> fails so from the user perspective nothing failed.
>>>=20
>>> Fair enough.
>>>=20
>>>=20
>>>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see
>>>> what different do I see in my testing.
>>>=20
>>> My theory is it's that one or this related one:
>>>=20
>>> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64
>>> Author: Chuck Lever <[email protected]>
>>> AuthorDate: Fri Aug 18 11:12:35 2017 -0400
>>> Commit: J. Bruce Fields <[email protected]>
>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400
>>>=20
>>> svcrdma: Populate tail iovec when receiving
>>>=20
>>>=20
>>>>>>>>>>>>>> If this looks OK, I can send a patch.
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> --
>>>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>>=20
>>>>>>>>>>>>> --
>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>> --
>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>>>>>>>=20
>>>>>>>>>>> --
>>>>>>>>>>> Chuck Lever
>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>>>=20
>>>>>>>>>> --
>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>>>>>> the body of a message to [email protected]
>>>>>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>>>>>=20
>>>>>>>>> --
>>>>>>>>> Chuck Lever
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>>>=20
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>>>> the body of a message to [email protected]
>>>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>>>=20
>>>>>>> --
>>>>>>> Chuck Lever
>>>>>>>=20
>>>>>>>=20
>>>>>>>=20
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe =
linux-nfs" in
>>>>>> the body of a message to [email protected]
>>>>>> More majordomo info at =
http://vger.kernel.org/majordomo-info.html
>>>>>=20
>>>>> --
>>>>> Chuck Lever
>>>=20
>>> --
>>> Chuck Lever
>>>=20
>>>=20
>>>=20
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" =
in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>=20
>> --
>> Chuck Lever

--
Chuck Lever