Return-Path: Received: from mail-vk0-f48.google.com ([209.85.213.48]:34031 "EHLO mail-vk0-f48.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752098AbeAaVmJ (ORCPT ); Wed, 31 Jan 2018 16:42:09 -0500 Received: by mail-vk0-f48.google.com with SMTP id f186so10004485vkc.1 for ; Wed, 31 Jan 2018 13:42:08 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> <0F49479C-EB5B-4F86-B89B-7435FDD51E54@oracle.com> From: Olga Kornievskaia Date: Wed, 31 Jan 2018 16:42:07 -0500 Message-ID: Subject: Re: problem on nfsd doing RDMA write To: Chuck Lever Cc: Bruce Fields , Linux NFS Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org List-ID: On Wed, Jan 31, 2018 at 4:13 PM, Chuck Lever wrote: > > >> On Jan 31, 2018, at 2:51 PM, Chuck Lever wrote: >> >> >> >>> On Jan 31, 2018, at 1:52 PM, Olga Kornievskaia wrote: >>> >>> On Wed, Jan 31, 2018 at 1:29 PM, Chuck Lever wrote: >>>> >>>> >>>>> On Jan 30, 2018, at 12:02 PM, Olga Kornievskaia wrote: >>>>> >>>>> On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever wrote: >>>>>> >>>>>> >>>>>>> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia wrote: >>>>>>> >>>>>>> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever wrote: >>>>>>>> >>>>>>>> >>>>>>>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia wrote: >>>>>>>>> >>>>>>>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever wrote: >>>>>>>>>> >>>>>>>>>> >>>>>>>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia wrote: >>>>>>>>>>> >>>>>>>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever wrote: >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia wrote: >>>>>>>>>>>>> >>>>>>>>>>>>> Hi Bruce/Chuck, >>>>>>>>>>>>> >>>>>>>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To >>>>>>>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero >>>>>>>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to >>>>>>>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation >>>>>>>>>>>>> with ERR_OP_ILLEGAL. >>>>>>>>>>>>> >>>>>>>>>>>>> The problem happens for the values where XDR round up ends up rounding >>>>>>>>>>>>> up to the page size. I don't know if my fix is the appropriate way to >>>>>>>>>>>>> fix this but with it I don't get the error: >>>>>>>>>>>>> >>>>>>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>> index 2c61c6b..a8489c3 100644 >>>>>>>>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda >>>>>>>>>>>>> >>>>>>>>>>>>> len = XDR_QUADLEN(write->wr_buflen) << 2; >>>>>>>>>>>>> if (len >= avail) { >>>>>>>>>>>>> - int pages; >>>>>>>>>>>>> + int pages = 0; >>>>>>>>>>>>> >>>>>>>>>>>>> len -= avail; >>>>>>>>>>>>> >>>>>>>>>>>>> - pages = len >> PAGE_SHIFT; >>>>>>>>>>>>> + if (write->wr_buflen >= PAGE_SIZE) >>>>>>>>>>>>> + pages = len >> PAGE_SHIFT; >>>>>>>>>>>>> argp->pagelist += pages; >>>>>>>>>>>>> argp->pagelen -= pages * PAGE_SIZE; >>>>>>>>>>>>> len -= pages * PAGE_SIZE; >>>>>>>>>>>>> >>>>>>>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads >>>>>>>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes >>>>>>>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading >>>>>>>>>>>>> to problems of parsing the GETATTR. >>>>>>>>>>>> >>>>>>>>>>>> Would this also be a problem near any page boundary, say, a >>>>>>>>>>>> write length of 8191 bytes? >>>>>>>>>>>> >>>>>>>>>>>> Instead of using the rounded up "len", why not try this: >>>>>>>>>>>> >>>>>>>>>>>> - pages = len >> PAGE_SHIFT; >>>>>>>>>>>> + pages = write->wr_buflen >> PAGE_SHIFT; >>>>>>>>>>> >>>>>>>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way. >>>>>>>>>>> >>>>>>>>>>>> And be sure to test with TCP as well. >>>>>>>>>>> >>>>>>>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why. >>>>>>>>>> >>>>>>>>>> OK. >>>>>>>>>> >>>>>>>>>> Remember that a Read chunk's length does not have to be >>>>>>>>>> rounded up. Maybe the transport needs to round up the >>>>>>>>>> length of the unmarshaled data content on behalf of the >>>>>>>>>> NFSv4 write decoder. >>>>>>>>>> >>>>>>>>> >>>>>>>>> The problem of simply taking write->wr_buflen was that len before that >>>>>>>>> could have been adjusted by avail value in then non-RDAM mounts. >>>>>>>>> >>>>>>>>> Again, I'm not sure if this is the right fix. But this one works for >>>>>>>>> both non-RDMA and RDMA mounts. >>>>>>>>> >>>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>>>> index 2c61c6b..3178997 100644 >>>>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct >>>>>>>>> nfsd4_compoundargs *argp, struct xdr_ne >>>>>>>>> >>>>>>>>> len -= avail; >>>>>>>>> >>>>>>>>> - pages = len >> PAGE_SHIFT; >>>>>>>>> + if (!avail) >>>>>>>>> + pages = write->wr_buflen >> PAGE_SHIFT; >>>>>>>>> + else >>>>>>>>> + pages = len >> PAGE_SHIFT; >>>>>>>>> argp->pagelist += pages; >>>>>>>>> argp->pagelen -= pages * PAGE_SIZE; >>>>>>>>> len -= pages * PAGE_SIZE; >>>>>>>> >>>>>>>> This code has been around since 2012. Has it always been >>>>>>>> broken for RDMA? I suspect the root problem is in the >>>>>>>> transport, not here in the decoder. Can you bisect to >>>>>>>> find out where the problem started to occur? >>>>>>> >>>>>>> To see if I could go back to 2012, I first want to see if I could go >>>>>>> back to before mlx5 driver was added which was in 3.11-rc1. But I'm >>>>>>> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm >>>>>>> trying to verify if my CX-5 card would even work with mlx4 driver >>>>>>> which I'm having doubts it would. >>>>>> >>>>>> The CX-5 works only with mlx5. >>>>>> >>>>>> >>>>>>> These old kernels fail to boot in >>>>>>> XFS saying that it's a wrong version. Some googling makes me think >>>>>>> that since my XFS partition was created with a newer kernel it's not >>>>>>> compatible with an older kernel... >>>>>> >>>>>> That's correct, the XFS filesystem has new features like >>>>>> metadata checksum that don't work on older kernels. >>>>>> >>>>>> To start with, you could try some very late 3.x or early >>>>>> 4.x kernels. My intuition is that you will find an svcrdma >>>>>> change that is more recent than 2012 that is causing the >>>>>> underlying issue. >>>>> >>>>> It works in 4.8 and does not work in 4.9. It's this commit that breaks it: >>>>> >>>>> commit cc9d83408b52265ddab2874cf19d1611da4ca7ee >>>>> >>>>> svcrdma: Server-side support for rpcrdma_connect_private >>>>> >>>>> Prepare to receive an RDMA-CM private message when handling a new >>>>> connection attempt, and send a similar message as part of connection >>>>> acceptance. >>>>> >>>>> Both sides can communicate their various implementation limits. >>>>> Implementations that don't support this sideband protocol ignore it. >>>>> >>>>> Signed-off-by: Chuck Lever >>>>> Reviewed-by: Sagi Grimberg >>>>> Signed-off-by: J. Bruce Fields >>>>> >>>>> So far I don't understand how can causes problems... >>>> >>>> More likely it's: >>>> >>>> >>>> commit eae03e2ac80a3476f0652cb0ee451d7b06d30564 >>>> Author: Chuck Lever >>>> AuthorDate: Fri Aug 18 11:12:27 2017 -0400 >>>> Commit: J. Bruce Fields >>>> CommitDate: Tue Sep 5 15:15:29 2017 -0400 >>>> >>>> nfsd: Incoming xdr_bufs may have content in tail buffer >>>> >>>> >>>> >>>> This is a 3 op compound. The third op is GETATTR. >>>> The reply shows that the WRITE op succeeds, which >>>> is why no error is reported to the application. >>>> The GETATTR fails with OP_ILLEGAL, and the client >>>> I guess is trained to ignore that kind of failure. >>>> >>>> After nfsd4_decode_write pushes into the "next" page >>>> in the page list, the next time READ_BUF is called >>>> (to handle the next op in the compound), it is >>>> supposed to see that the XDR page list is now >>>> exhausted, and switch to the XDR tail buf. >>>> >>>> TCP doesn't use the tail buf; the GETATTR in that >>>> case is at the end of the page list, following the >>>> WRITE payload. That would explain why this issue >>>> cannot be reproduced there. >>>> >>>> RDMA does use the tail buf in this case. For some >>>> reason the logic added in eae03e2ac8 is not working, >>>> and the server is looking at uninitialized memory >>>> to find that third op. >>>> >>>> I run cthon04 all the time, so I wonder why I never >>>> hit this case. >>> >>> While this commit seems like a better explanation, I can 100% hit the >>> issue running 4.8-rc6 based kernel (this is from your nfsd-for-4.9 >>> branch and rollback back to the cc9d83408b) . What matters is the >>> assignment of the "private_data" in the conn_param. If I set it to >>> NULL and 0 for the length. The problem goes away. >> >> There's no obvious connection between the NFSv4 WRITE >> misbehavior and CM private data being present. > > You observed that this logic in nfsd4_decode_write is > making argp->pagelen go negative: > > 1304 pages = len >> PAGE_SHIFT; > 1305 argp->pagelist += pages; > 1306 argp->pagelen -= pages * PAGE_SIZE; > 1307 len -= pages * PAGE_SIZE; > > This is because the client does not insert XDR round-up > in the Read chunk. Since pagelen is not zero, the code > in read_buf that is looking for that case (added by > eae03e2ac80a) fails. > > I suspect what's going on here is: > > When you disable the private_data logic, the server is > not "receiving" the CM private data, and thus it assumes > the client is not Linux. In response, the server does not > provide CM private data in it's "connection accept" > response. > > The client then assumes the _server_ is not Linux, and > disables an optimization that removes the XDR round-up > from Read chunks. > > So the client adds that round-up padding to the end of > the Read chunks, and that prevents the bug. pagelen > never goes below zero. > > Basically nfsd4_decode_write() assumes that the XDR > round-up for the WRITE payload is always going to be > present. The round-up is optional for Read chunks, > according to RFC 8166. > > 193bcb7b37 adds that round-up in the tail iovec. It looks > like that won't be adequate in the case where the payload > approaches a page boundary. > > Does this seem plausible? Yes your explanation regarding the presence of CM private data that toggles what client sends is exactly what I'm seeing. I don't understand the last paragraph. But it sounds like server needs to check that if client is linux then it needs to do a round up (and also actually add padding which would have been done on the client) because of the nfsd4_decode_write()'s assumptions. > > >>> Are you hitting the issue now? >> >> I can reproduce it 100% of the time using your dd example >> from a Linux NFS client. >> >> I've also done a 8009 byte write on TCP, which forces the >> same page boundary edge condition in nfsd4_decode_write, >> and the failure does not occur. >> >> >>> Are you 100% certain that you haven't >>> hit it before. Do you always inspect the network traces? Without >>> inspecting the network trace you wouldn't have found it. cthon never >>> fails so from the user perspective nothing failed. >> >> Fair enough. >> >> >>> I will check eae03e2ac80a3476f0652cb0ee451d7b06d30564 commit to see >>> what different do I see in my testing. >> >> My theory is it's that one or this related one: >> >> commit 193bcb7b3719a315814dce40fc8dcd1af786ea64 >> Author: Chuck Lever >> AuthorDate: Fri Aug 18 11:12:35 2017 -0400 >> Commit: J. Bruce Fields >> CommitDate: Tue Sep 5 15:15:29 2017 -0400 >> >> svcrdma: Populate tail iovec when receiving >> >> >>>>>>>>>>>>> If this looks OK, I can send a patch. >>>>>>>>>>>>> -- >>>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> Chuck Lever >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> -- >>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>>> -- >>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> Chuck Lever >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> -- >>>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>>> >>>>>>>> -- >>>>>>>> Chuck Lever >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> >>>>>> -- >>>>>> Chuck Lever >>>>>> >>>>>> >>>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> >>>> -- >>>> Chuck Lever >> >> -- >> Chuck Lever >> >> >> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > >