Return-Path: Received: from mail-ua0-f194.google.com ([209.85.217.194]:37067 "EHLO mail-ua0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753145AbeA3RCZ (ORCPT ); Tue, 30 Jan 2018 12:02:25 -0500 Received: by mail-ua0-f194.google.com with SMTP id q8so7532458uae.4 for ; Tue, 30 Jan 2018 09:02:24 -0800 (PST) MIME-Version: 1.0 In-Reply-To: References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> From: Olga Kornievskaia Date: Tue, 30 Jan 2018 12:02:22 -0500 Message-ID: Subject: Re: problem on nfsd doing RDMA write To: Chuck Lever Cc: Bruce Fields , Linux NFS Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon, Jan 29, 2018 at 12:24 PM, Chuck Lever wrote: > > >> On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia wrote: >> >> On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever wrote: >>> >>> >>>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia wrote: >>>> >>>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever wrote: >>>>> >>>>> >>>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia wrote: >>>>>> >>>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever wrote: >>>>>>> >>>>>>> >>>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia wrote: >>>>>>>> >>>>>>>> Hi Bruce/Chuck, >>>>>>>> >>>>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To >>>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero >>>>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to >>>>>>>> parse GETATTR after the WRITE in the compound. It fails the operation >>>>>>>> with ERR_OP_ILLEGAL. >>>>>>>> >>>>>>>> The problem happens for the values where XDR round up ends up rounding >>>>>>>> up to the page size. I don't know if my fix is the appropriate way to >>>>>>>> fix this but with it I don't get the error: >>>>>>>> >>>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>>> index 2c61c6b..a8489c3 100644 >>>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda >>>>>>>> >>>>>>>> len = XDR_QUADLEN(write->wr_buflen) << 2; >>>>>>>> if (len >= avail) { >>>>>>>> - int pages; >>>>>>>> + int pages = 0; >>>>>>>> >>>>>>>> len -= avail; >>>>>>>> >>>>>>>> - pages = len >> PAGE_SHIFT; >>>>>>>> + if (write->wr_buflen >= PAGE_SIZE) >>>>>>>> + pages = len >> PAGE_SHIFT; >>>>>>>> argp->pagelist += pages; >>>>>>>> argp->pagelen -= pages * PAGE_SIZE; >>>>>>>> len -= pages * PAGE_SIZE; >>>>>>>> >>>>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads >>>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes >>>>>>>> pages=1 and the argp->pagelen ends up being a negative value leading >>>>>>>> to problems of parsing the GETATTR. >>>>>>> >>>>>>> Would this also be a problem near any page boundary, say, a >>>>>>> write length of 8191 bytes? >>>>>>> >>>>>>> Instead of using the rounded up "len", why not try this: >>>>>>> >>>>>>> - pages = len >> PAGE_SHIFT; >>>>>>> + pages = write->wr_buflen >> PAGE_SHIFT; >>>>>> >>>>>> You are right. It needs to be that. Otherwise 8191 fails the same way. >>>>>> >>>>>>> And be sure to test with TCP as well. >>>>>> >>>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why. >>>>> >>>>> OK. >>>>> >>>>> Remember that a Read chunk's length does not have to be >>>>> rounded up. Maybe the transport needs to round up the >>>>> length of the unmarshaled data content on behalf of the >>>>> NFSv4 write decoder. >>>>> >>>> >>>> The problem of simply taking write->wr_buflen was that len before that >>>> could have been adjusted by avail value in then non-RDAM mounts. >>>> >>>> Again, I'm not sure if this is the right fix. But this one works for >>>> both non-RDMA and RDMA mounts. >>>> >>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>> index 2c61c6b..3178997 100644 >>>> --- a/fs/nfsd/nfs4xdr.c >>>> +++ b/fs/nfsd/nfs4xdr.c >>>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct >>>> nfsd4_compoundargs *argp, struct xdr_ne >>>> >>>> len -= avail; >>>> >>>> - pages = len >> PAGE_SHIFT; >>>> + if (!avail) >>>> + pages = write->wr_buflen >> PAGE_SHIFT; >>>> + else >>>> + pages = len >> PAGE_SHIFT; >>>> argp->pagelist += pages; >>>> argp->pagelen -= pages * PAGE_SIZE; >>>> len -= pages * PAGE_SIZE; >>> >>> This code has been around since 2012. Has it always been >>> broken for RDMA? I suspect the root problem is in the >>> transport, not here in the decoder. Can you bisect to >>> find out where the problem started to occur? >> >> To see if I could go back to 2012, I first want to see if I could go >> back to before mlx5 driver was added which was in 3.11-rc1. But I'm >> having issues trying to boot into a kernel based on 3.10 or 3.9. I'm >> trying to verify if my CX-5 card would even work with mlx4 driver >> which I'm having doubts it would. > > The CX-5 works only with mlx5. > > >> These old kernels fail to boot in >> XFS saying that it's a wrong version. Some googling makes me think >> that since my XFS partition was created with a newer kernel it's not >> compatible with an older kernel... > > That's correct, the XFS filesystem has new features like > metadata checksum that don't work on older kernels. > > To start with, you could try some very late 3.x or early > 4.x kernels. My intuition is that you will find an svcrdma > change that is more recent than 2012 that is causing the > underlying issue. It works in 4.8 and does not work in 4.9. It's this commit that breaks it: commit cc9d83408b52265ddab2874cf19d1611da4ca7ee svcrdma: Server-side support for rpcrdma_connect_private Prepare to receive an RDMA-CM private message when handling a new connection attempt, and send a similar message as part of connection acceptance. Both sides can communicate their various implementation limits. Implementations that don't support this sideband protocol ignore it. Signed-off-by: Chuck Lever Reviewed-by: Sagi Grimberg Signed-off-by: J. Bruce Fields So far I don't understand how can causes problems... >>>>>>>> If this looks OK, I can send a patch. >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>>> >>>>>>> -- >>>>>>> Chuck Lever >>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> -- >>>>> Chuck Lever >>>>> >>>>> >>>>> >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> Chuck Lever >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > >