Return-Path: Received: from userp2120.oracle.com ([156.151.31.85]:49414 "EHLO userp2120.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751879AbeAZVIL (ORCPT ); Fri, 26 Jan 2018 16:08:11 -0500 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: problem on nfsd doing RDMA write From: Chuck Lever In-Reply-To: Date: Fri, 26 Jan 2018 13:08:02 -0800 Cc: Bruce Fields , Linux NFS Mailing List Message-Id: <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> To: Olga Kornievskaia Sender: linux-nfs-owner@vger.kernel.org List-ID: > On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia = wrote: >=20 > On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever = wrote: >>=20 >>=20 >>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia = wrote: >>>=20 >>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever = wrote: >>>>=20 >>>>=20 >>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia = wrote: >>>>>=20 >>>>> Hi Bruce/Chuck, >>>>>=20 >>>>> There is a problem with nfsd doing a WRITE of size 4093,4094,4095. = To >>>>> reproduce, mount with RDMA and do a simple dd "dd if=3D/dev/zero >>>>> of=3D/mnt/testfile bs=3D4093 count=3D1". What happens is that nfsd = fails to >>>>> parse GETATTR after the WRITE in the compound. It fails the = operation >>>>> with ERR_OP_ILLEGAL. >>>>>=20 >>>>> The problem happens for the values where XDR round up ends up = rounding >>>>> up to the page size. I don't know if my fix is the appropriate way = to >>>>> fix this but with it I don't get the error: >>>>>=20 >>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>> index 2c61c6b..a8489c3 100644 >>>>> --- a/fs/nfsd/nfs4xdr.c >>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct = nfsd4_compounda >>>>>=20 >>>>> len =3D XDR_QUADLEN(write->wr_buflen) << 2; >>>>> if (len >=3D avail) { >>>>> - int pages; >>>>> + int pages =3D 0; >>>>>=20 >>>>> len -=3D avail; >>>>>=20 >>>>> - pages =3D len >> PAGE_SHIFT; >>>>> + if (write->wr_buflen >=3D PAGE_SIZE) >>>>> + pages =3D len >> PAGE_SHIFT; >>>>> argp->pagelist +=3D pages; >>>>> argp->pagelen -=3D pages * PAGE_SIZE; >>>>> len -=3D pages * PAGE_SIZE; >>>>>=20 >>>>> So the problem is the using "len" instead of "write->wr_buflen" = leads >>>>> for the values 4093,4094,4095 that are rounded up to 4096, it = makes >>>>> pages=3D1 and the argp->pagelen ends up being a negative value = leading >>>>> to problems of parsing the GETATTR. >>>>=20 >>>> Would this also be a problem near any page boundary, say, a >>>> write length of 8191 bytes? >>>>=20 >>>> Instead of using the rounded up "len", why not try this: >>>>=20 >>>> - pages =3D len >> PAGE_SHIFT; >>>> + pages =3D write->wr_buflen >> PAGE_SHIFT; >>>=20 >>> You are right. It needs to be that. Otherwise 8191 fails the same = way. >>>=20 >>>> And be sure to test with TCP as well. >>>=20 >>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why. >>=20 >> OK. >>=20 >> Remember that a Read chunk's length does not have to be >> rounded up. Maybe the transport needs to round up the >> length of the unmarshaled data content on behalf of the >> NFSv4 write decoder. >>=20 >=20 > The problem of simply taking write->wr_buflen was that len before that > could have been adjusted by avail value in then non-RDAM mounts. >=20 > Again, I'm not sure if this is the right fix. But this one works for > both non-RDMA and RDMA mounts. >=20 > diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c > index 2c61c6b..3178997 100644 > --- a/fs/nfsd/nfs4xdr.c > +++ b/fs/nfsd/nfs4xdr.c > @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct > nfsd4_compoundargs *argp, struct xdr_ne >=20 > len -=3D avail; >=20 > - pages =3D len >> PAGE_SHIFT; > + if (!avail) > + pages =3D write->wr_buflen >> PAGE_SHIFT; > + else > + pages =3D len >> PAGE_SHIFT; > argp->pagelist +=3D pages; > argp->pagelen -=3D pages * PAGE_SIZE; > len -=3D pages * PAGE_SIZE; This code has been around since 2012. Has it always been broken for RDMA? I suspect the root problem is in the transport, not here in the decoder. Can you bisect to find out where the problem started to occur? >>>>> If this looks OK, I can send a patch. >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>=20 >>>> -- >>>> Chuck Lever >>>>=20 >>>>=20 >>>>=20 >>>> -- >>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>> the body of a message to majordomo@vger.kernel.org >>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" = in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >> -- >> Chuck Lever >>=20 >>=20 >>=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" = in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever