Return-Path: Received: from aserp2120.oracle.com ([141.146.126.78]:45614 "EHLO aserp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753240AbeAaVwl (ORCPT ); Wed, 31 Jan 2018 16:52:41 -0500 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: problem on nfsd doing RDMA write From: Chuck Lever In-Reply-To: Date: Wed, 31 Jan 2018 16:52:31 -0500 Cc: Bruce Fields , Linux NFS Mailing List Message-Id: <5C8B7F8C-713E-46C3-8E26-CE7EF671D552@oracle.com> References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> <0F49479C-EB5B-4F86-B89B-7435FDD51E54@oracle.com> To: Olga Kornievskaia Sender: linux-nfs-owner@vger.kernel.org List-ID: > On Jan 31, 2018, at 4:42 PM, Olga Kornievskaia wrote: >=20 > On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever = wrote: >>=20 >>=20 >>> On Jan 31, 2018, at 2:51 PM, Chuck Lever = wrote: >>>=20 >>>=20 >>>=20 >>>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia = wrote: >>>>=20 >>>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever = wrote: >>>>>=20 >>>>>=20 >>>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia = wrote: >>>>>>=20 >>>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever = wrote: >>>>>>>=20 >>>>>>>=20 >>>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia = wrote: >>>>>>>>=20 >>>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever = wrote: >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia = wrote: >>>>>>>>>>=20 >>>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever = wrote: >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia = wrote: >>>>>>>>>>>>=20 >>>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever = wrote: >>>>>>>>>>>>>=20 >>>>>>>>>>>>>=20 >>>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia = wrote: >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> Hi Bruce/Chuck, >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size = 4093,4094,4095. To >>>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd = if=3D/dev/zero >>>>>>>>>>>>>> of=3D/mnt/testfile bs=3D4093 count=3D1". What happens is = that nfsd fails to >>>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails = the operation >>>>>>>>>>>>>> with ERR_OP_ILLEGAL. >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> The problem happens for the values where XDR round up = ends up rounding >>>>>>>>>>>>>> up to the page size. I don't know if my fix is the = appropriate way to >>>>>>>>>>>>>> fix this but with it I don't get the error: >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>>> index 2c61c6b..a8489c3 100644 >>>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 = nfsd4_decode_opaque(struct nfsd4_compounda >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> len =3D XDR_QUADLEN(write->wr_buflen) << 2; >>>>>>>>>>>>>> if (len >=3D avail) { >>>>>>>>>>>>>> - int pages; >>>>>>>>>>>>>> + int pages =3D 0; >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> len -=3D avail; >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> - pages =3D len >> PAGE_SHIFT; >>>>>>>>>>>>>> + if (write->wr_buflen >=3D PAGE_SIZE) >>>>>>>>>>>>>> + pages =3D len >> PAGE_SHIFT; >>>>>>>>>>>>>> argp->pagelist +=3D pages; >>>>>>>>>>>>>> argp->pagelen -=3D pages * PAGE_SIZE; >>>>>>>>>>>>>> len -=3D pages * PAGE_SIZE; >>>>>>>>>>>>>>=20 >>>>>>>>>>>>>> So the problem is the using "len" instead of = "write->wr_buflen" leads >>>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to = 4096, it makes >>>>>>>>>>>>>> pages=3D1 and the argp->pagelen ends up being a negative = value leading >>>>>>>>>>>>>> to problems of parsing the GETATTR. >>>>>>>>>>>>>=20 >>>>>>>>>>>>> Would this also be a problem near any page boundary, say, = a >>>>>>>>>>>>> write length of 8191 bytes? >>>>>>>>>>>>>=20 >>>>>>>>>>>>> Instead of using the rounded up "len", why not try this: >>>>>>>>>>>>>=20 >>>>>>>>>>>>> - pages =3D len >> PAGE_SHIFT; >>>>>>>>>>>>> + pages =3D write->wr_buflen >> PAGE_SHIFT; >>>>>>>>>>>>=20 >>>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails = the same way. >>>>>>>>>>>>=20 >>>>>>>>>>>>> And be sure to test with TCP as well. >>>>>>>>>>>>=20 >>>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out = why. >>>>>>>>>>>=20 >>>>>>>>>>> OK. >>>>>>>>>>>=20 >>>>>>>>>>> Remember that a Read chunk's length does not have to be >>>>>>>>>>> rounded up. Maybe the transport needs to round up the >>>>>>>>>>> length of the unmarshaled data content on behalf of the >>>>>>>>>>> NFSv4 write decoder. >>>>>>>>>>>=20 >>>>>>>>>>=20 >>>>>>>>>> The problem of simply taking write->wr_buflen was that len = before that >>>>>>>>>> could have been adjusted by avail value in then non-RDAM = mounts. >>>>>>>>>>=20 >>>>>>>>>> Again, I'm not sure if this is the right fix. But this one = works for >>>>>>>>>> both non-RDMA and RDMA mounts. >>>>>>>>>>=20 >>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>>>>> index 2c61c6b..3178997 100644 >>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 = nfsd4_decode_opaque(struct >>>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne >>>>>>>>>>=20 >>>>>>>>>> len -=3D avail; >>>>>>>>>>=20 >>>>>>>>>> - pages =3D len >> PAGE_SHIFT; >>>>>>>>>> + if (!avail) >>>>>>>>>> + pages =3D write->wr_buflen >> = PAGE_SHIFT; >>>>>>>>>> + else >>>>>>>>>> + pages =3D len >> PAGE_SHIFT; >>>>>>>>>> argp->pagelist +=3D pages; >>>>>>>>>> argp->pagelen -=3D pages * PAGE_SIZE; >>>>>>>>>> len -=3D pages * PAGE_SIZE; >>>>>>>>>=20 >>>>>>>>> This code has been around since 2012. Has it always been >>>>>>>>> broken for RDMA? I suspect the root problem is in the >>>>>>>>> transport, not here in the decoder. Can you bisect to >>>>>>>>> find out where the problem started to occur? >>>>>>>>=20 >>>>>>>> To see if I could go back to 2012, I first want to see if I = could go >>>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But = I'm >>>>>>>> having issues trying to boot into a kernel based on 3.10 or = 3.9. I'm >>>>>>>> trying to verify if my CX-5 card would even work with mlx4 = driver >>>>>>>> which I'm having doubts it would. >>>>>>>=20 >>>>>>> The CX-5 works only with mlx5. >>>>>>>=20 >>>>>>>=20 >>>>>>>> These old kernels fail to boot in >>>>>>>> XFS saying that it's a wrong version. Some googling makes me = think >>>>>>>> that since my XFS partition was created with a newer kernel = it's not >>>>>>>> compatible with an older kernel... >>>>>>>=20 >>>>>>> That's correct, the XFS filesystem has new features like >>>>>>> metadata checksum that don't work on older kernels. >>>>>>>=20 >>>>>>> To start with, you could try some very late 3.x or early >>>>>>> 4.x kernels. My intuition is that you will find an svcrdma >>>>>>> change that is more recent than 2012 that is causing the >>>>>>> underlying issue. >>>>>>=20 >>>>>> It works in 4.8 and does not work in 4.9. It's this commit that = breaks it: >>>>>>=20 >>>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee >>>>>>=20 >>>>>> svcrdma: Server-side support for rpcrdma_connect_private >>>>>>=20 >>>>>> Prepare to receive an RDMA-CM private message when handling a new >>>>>> connection attempt, and send a similar message as part of = connection >>>>>> acceptance. >>>>>>=20 >>>>>> Both sides can communicate their various implementation limits. >>>>>> Implementations that don't support this sideband protocol ignore = it. >>>>>>=20 >>>>>> Signed-off-by: Chuck Lever >>>>>> Reviewed-by: Sagi Grimberg >>>>>> Signed-off-by: J. Bruce Fields >>>>>>=20 >>>>>> So far I don't understand how can causes problems... >>>>>=20 >>>>> More likely it's: >>>>>=20 >>>>>=20 >>>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564 >>>>> Author: Chuck Lever >>>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400 >>>>> Commit: J. Bruce Fields >>>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400 >>>>>=20 >>>>> nfsd: Incoming xdr_bufs may have content in tail buffer >>>>>=20 >>>>>=20 >>>>>=20 >>>>> This is a 3 op compound. The third op is GETATTR. >>>>> The reply shows that the WRITE op succeeds, which >>>>> is why no error is reported to the application. >>>>> The GETATTR fails with OP_ILLEGAL, and the client >>>>> I guess is trained to ignore that kind of failure. >>>>>=20 >>>>> After nfsd4_decode_write pushes into the "next" page >>>>> in the page list, the next time READ_BUF is called >>>>> (to handle the next op in the compound), it is >>>>> supposed to see that the XDR page list is now >>>>> exhausted, and switch to the XDR tail buf. >>>>>=20 >>>>> TCP doesn't use the tail buf; the GETATTR in that >>>>> case is at the end of the page list, following the >>>>> WRITE payload. That would explain why this issue >>>>> cannot be reproduced there. >>>>>=20 >>>>> RDMA does use the tail buf in this case. For some >>>>> reason the logic added in eae03e2ac8 is not working, >>>>> and the server is looking at uninitialized memory >>>>> to find that third op. >>>>>=20 >>>>> I run cthon04 all the time, so I wonder why I never >>>>> hit this case. >>>>=20 >>>> While this commit seems like a better explanation, I can 100% hit = the >>>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9 >>>> branch and rollback back to the cc9d83408b) . What matters is the >>>> assignment of the "private_data" in the conn_param. If I set it to >>>> NULL and 0 for the length. The problem goes away. >>>=20 >>> There's no obvious connection between the NFSv4 WRITE >>> misbehavior and CM private data being present. >>=20 >> You observed that this logic in nfsd4_decode_write is >> making argp->pagelen go negative: >>=20 >> 1304 pages =3D len >> PAGE_SHIFT; >> 1305 argp->pagelist +=3D pages; >> 1306 argp->pagelen -=3D pages * PAGE_SIZE; >> 1307 len -=3D pages * PAGE_SIZE; >>=20 >> This is because the client does not insert XDR round-up >> in the Read chunk. Since pagelen is not zero, the code >> in read_buf that is looking for that case (added by >> eae03e2ac80a) fails. >>=20 >> I suspect what's going on here is: >>=20 >> When you disable the private_data logic, the server is >> not "receiving" the CM private data, and thus it assumes >> the client is not Linux. In response, the server does not >> provide CM private data in it's "connection accept" >> response. >>=20 >> The client then assumes the _server_ is not Linux, and >> disables an optimization that removes the XDR round-up >> from Read chunks. >>=20 >> So the client adds that round-up padding to the end of >> the Read chunks, and that prevents the bug. pagelen >> never goes below zero. >>=20 >> Basically nfsd4_decode_write() assumes that the XDR >> round-up for the WRITE payload is always going to be >> present. The round-up is optional for Read chunks, >> according to RFC 8166. >>=20 >> 193bcb7b37 adds that round-up in the tail iovec. It looks >> like that won't be adequate in the case where the payload >> approaches a page boundary. >>=20 >> Does this seem plausible? >=20 > Yes your explanation regarding the presence of CM private data that > toggles what client sends is exactly what I'm seeing. >=20 > I don't understand the last paragraph. But it sounds like server needs > to check that if client is linux then it needs to do a round up (and > also actually add padding which would have been done on the client) > because of the nfsd4_decode_write()'s assumptions. The server-side transport can _always_ round-up the length of a Read chunk. If the client included the round-up padding, the length won't change. If the client didn't include padding, then the length will be bumped up properly. The XDR decoders should then be happy either way. I hope. What I'm trying now is having svc_rdma_build_normal_read_chunk round-up rq_arg->pagelen instead of rounding up the length of the XDR tail iovec based on the length of the Read chunk. Seems obvious, and I'm wondering why I didn't do it this way in the first place. The comment there suggests the NFSv3 decoder might be finicky about the XDR lengths. >>>> Are you hitting the issue now? >>>=20 >>> I can reproduce it 100% of the time using your dd example >>> from a Linux NFS client. >>>=20 >>> I've also done a 8009 byte write on TCP, which forces the >>> same page boundary edge condition in nfsd4_decode_write, >>> and the failure does not occur. >>>=20 >>>=20 >>>> Are you 100% certain that you haven't >>>> hit it before. Do you always inspect the network traces? Without >>>> inspecting the network trace you wouldn't have found it. cthon = never >>>> fails so from the user perspective nothing failed. >>>=20 >>> Fair enough. >>>=20 >>>=20 >>>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see >>>> what different do I see in my testing. >>>=20 >>> My theory is it's that one or this related one: >>>=20 >>> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64 >>> Author: Chuck Lever >>> AuthorDate: Fri Aug 18 11:12:35 2017 -0400 >>> Commit: J. Bruce Fields >>> CommitDate: Tue Sep 5 15:15:29 2017 -0400 >>>=20 >>> svcrdma: Populate tail iovec when receiving >>>=20 >>>=20 >>>>>>>>>>>>>> If this looks OK, I can send a patch. >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>>>=20 >>>>>>>>>>>>> -- >>>>>>>>>>>>> Chuck Lever >>>>>>>>>>>>>=20 >>>>>>>>>>>>>=20 >>>>>>>>>>>>>=20 >>>>>>>>>>>>> -- >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>=20 >>>>>>>>>>> -- >>>>>>>>>>> Chuck Lever >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>>>=20 >>>>>>>>>> -- >>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>>>>=20 >>>>>>>>> -- >>>>>>>>> Chuck Lever >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>>>=20 >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>>=20 >>>>>>> -- >>>>>>> Chuck Lever >>>>>>>=20 >>>>>>>=20 >>>>>>>=20 >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>=20 >>>>> -- >>>>> Chuck Lever >>>=20 >>> -- >>> Chuck Lever >>>=20 >>>=20 >>>=20 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" = in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >> -- >> Chuck Lever -- Chuck Lever