Return-Path: Received: from mail-ua0-f179.google.com ([209.85.217.179]:46190 "EHLO mail-ua0-f179.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751072AbeA2RTW (ORCPT ); Mon, 29 Jan 2018 12:19:22 -0500 Received: by mail-ua0-f179.google.com with SMTP id j23so5167051uak.13 for ; Mon, 29 Jan 2018 09:19:21 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> From: Olga Kornievskaia Date: Mon, 29 Jan 2018 12:19:20 -0500 Message-ID: Subject: Re: problem on nfsd doing RDMA write To: Chuck Lever Cc: Bruce Fields , Linux NFS Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-nfs-owner@vger.kernel.org List-ID: On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever wrote: > > >> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia wrote: >> >> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever wrote: >>> >>> >>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia wrote: >>>> >>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever wrote: >>>>> >>>>> >>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia wrote: >>>>>> >>>>>> Hi Bruce/Chuck, >>>>>> >>>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. To >>>>>> reproduce, mount with RDMA and do a simple dd "dd if=/dev/zero >>>>>> of=/mnt/testfile bs=4093 count=1". What happens is that nfsd fails to >>>>>> parse GETATTR after the WRITE in the compound. It fails the operation >>>>>> with ERR_OP_ILLEGAL. >>>>>> >>>>>> The problem happens for the values where XDR round up ends up rounding >>>>>> up to the page size. I don't know if my fix is the appropriate way to >>>>>> fix this but with it I don't get the error: >>>>>> >>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>> index 2c61c6b..a8489c3 100644 >>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct nfsd4_compounda >>>>>> >>>>>> len = XDR_QUADLEN(write->wr_buflen) << 2; >>>>>> if (len >= avail) { >>>>>> - int pages; >>>>>> + int pages = 0; >>>>>> >>>>>> len -= avail; >>>>>> >>>>>> - pages = len >> PAGE_SHIFT; >>>>>> + if (write->wr_buflen >= PAGE_SIZE) >>>>>> + pages = len >> PAGE_SHIFT; >>>>>> argp->pagelist += pages; >>>>>> argp->pagelen -= pages * PAGE_SIZE; >>>>>> len -= pages * PAGE_SIZE; >>>>>> >>>>>> So the problem is the using "len" instead of "write->wr_buflen" leads >>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it makes >>>>>> pages=1 and the argp->pagelen ends up being a negative value leading >>>>>> to problems of parsing the GETATTR. >>>>> >>>>> Would this also be a problem near any page boundary, say, a >>>>> write length of 8191 bytes? >>>>> >>>>> Instead of using the rounded up "len", why not try this: >>>>> >>>>> - pages = len >> PAGE_SHIFT; >>>>> + pages = write->wr_buflen >> PAGE_SHIFT; >>>> >>>> You are right. It needs to be that. Otherwise 8191 fails the same way. >>>> >>>>> And be sure to test with TCP as well. >>>> >>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why. >>> >>> OK. >>> >>> Remember that a Read chunk's length does not have to be >>> rounded up. Maybe the transport needs to round up the >>> length of the unmarshaled data content on behalf of the >>> NFSv4 write decoder. >>> >> >> The problem of simply taking write->wr_buflen was that len before that >> could have been adjusted by avail value in then non-RDAM mounts. >> >> Again, I'm not sure if this is the right fix. But this one works for >> both non-RDMA and RDMA mounts. >> >> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >> index 2c61c6b..3178997 100644 >> --- a/fs/nfsd/nfs4xdr.c >> +++ b/fs/nfsd/nfs4xdr.c >> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct >> nfsd4_compoundargs *argp, struct xdr_ne >> >> len -= avail; >> >> - pages = len >> PAGE_SHIFT; >> + if (!avail) >> + pages = write->wr_buflen >> PAGE_SHIFT; >> + else >> + pages = len >> PAGE_SHIFT; >> argp->pagelist += pages; >> argp->pagelen -= pages * PAGE_SIZE; >> len -= pages * PAGE_SIZE; > > This code has been around since 2012. Has it always been > broken for RDMA? I suspect the root problem is in the > transport, not here in the decoder. Can you bisect to > find out where the problem started to occur? To see if I could go back to 2012, I first want to see if I could go back to before mlx5 driver was added which was in 3.11-rc1. But I'm having issues trying to boot into a kernel based on 3.10 or 3.9. I'm trying to verify if my CX-5 card would even work with mlx4 driver which I'm having doubts it would. These old kernels fail to boot in XFS saying that it's a wrong version. Some googling makes me think that since my XFS partition was created with a newer kernel it's not compatible with an older kernel... >>>>>> If this looks OK, I can send a patch. >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>>> -- >>>>> Chuck Lever >>>>> >>>>> >>>>> >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> >>> -- >>> Chuck Lever >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > >