Return-Path: Received: from userp2130.oracle.com ([156.151.31.86]:58140 "EHLO userp2130.oracle.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751417AbeA2RZD (ORCPT ); Mon, 29 Jan 2018 12:25:03 -0500 Content-Type: text/plain; charset=us-ascii Mime-Version: 1.0 (Mac OS X Mail 11.2 \(3445.5.20\)) Subject: Re: problem on nfsd doing RDMA write From: Chuck Lever In-Reply-To: Date: Mon, 29 Jan 2018 12:24:55 -0500 Cc: Bruce Fields , Linux NFS Mailing List Message-Id: References: <0B3C835B-3E30-43F3-B53C-BFB4462543D3@oracle.com> <16CE3E21-2BF9-4CC5-A3E7-FFBE706A9EF4@oracle.com> To: Olga Kornievskaia Sender: linux-nfs-owner@vger.kernel.org List-ID: > On Jan 29, 2018, at 12:19 PM, Olga Kornievskaia = wrote: >=20 > On Fri, Jan 26, 2018 at 4:08 PM, Chuck Lever = wrote: >>=20 >>=20 >>> On Jan 26, 2018, at 12:13 PM, Olga Kornievskaia = wrote: >>>=20 >>> On Fri, Jan 26, 2018 at 2:29 PM, Chuck Lever = wrote: >>>>=20 >>>>=20 >>>>> On Jan 26, 2018, at 11:05 AM, Olga Kornievskaia = wrote: >>>>>=20 >>>>> On Fri, Jan 26, 2018 at 1:23 PM, Chuck Lever = wrote: >>>>>>=20 >>>>>>=20 >>>>>>> On Jan 26, 2018, at 9:24 AM, Olga Kornievskaia = wrote: >>>>>>>=20 >>>>>>> Hi Bruce/Chuck, >>>>>>>=20 >>>>>>> There is a problem with nfsd doing a WRITE of size = 4093,4094,4095. To >>>>>>> reproduce, mount with RDMA and do a simple dd "dd if=3D/dev/zero >>>>>>> of=3D/mnt/testfile bs=3D4093 count=3D1". What happens is that = nfsd fails to >>>>>>> parse GETATTR after the WRITE in the compound. It fails the = operation >>>>>>> with ERR_OP_ILLEGAL. >>>>>>>=20 >>>>>>> The problem happens for the values where XDR round up ends up = rounding >>>>>>> up to the page size. I don't know if my fix is the appropriate = way to >>>>>>> fix this but with it I don't get the error: >>>>>>>=20 >>>>>>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>>>>>> index 2c61c6b..a8489c3 100644 >>>>>>> --- a/fs/nfsd/nfs4xdr.c >>>>>>> +++ b/fs/nfsd/nfs4xdr.c >>>>>>> @@ -1289,11 +1289,12 @@ static __be32 nfsd4_decode_opaque(struct = nfsd4_compounda >>>>>>>=20 >>>>>>> len =3D XDR_QUADLEN(write->wr_buflen) << 2; >>>>>>> if (len >=3D avail) { >>>>>>> - int pages; >>>>>>> + int pages =3D 0; >>>>>>>=20 >>>>>>> len -=3D avail; >>>>>>>=20 >>>>>>> - pages =3D len >> PAGE_SHIFT; >>>>>>> + if (write->wr_buflen >=3D PAGE_SIZE) >>>>>>> + pages =3D len >> PAGE_SHIFT; >>>>>>> argp->pagelist +=3D pages; >>>>>>> argp->pagelen -=3D pages * PAGE_SIZE; >>>>>>> len -=3D pages * PAGE_SIZE; >>>>>>>=20 >>>>>>> So the problem is the using "len" instead of "write->wr_buflen" = leads >>>>>>> for the values 4093,4094,4095 that are rounded up to 4096, it = makes >>>>>>> pages=3D1 and the argp->pagelen ends up being a negative value = leading >>>>>>> to problems of parsing the GETATTR. >>>>>>=20 >>>>>> Would this also be a problem near any page boundary, say, a >>>>>> write length of 8191 bytes? >>>>>>=20 >>>>>> Instead of using the rounded up "len", why not try this: >>>>>>=20 >>>>>> - pages =3D len >> PAGE_SHIFT; >>>>>> + pages =3D write->wr_buflen >> PAGE_SHIFT; >>>>>=20 >>>>> You are right. It needs to be that. Otherwise 8191 fails the same = way. >>>>>=20 >>>>>> And be sure to test with TCP as well. >>>>>=20 >>>>> Sigh. It breaks normal (non-RDMA) mounts. I'll figure out why. >>>>=20 >>>> OK. >>>>=20 >>>> Remember that a Read chunk's length does not have to be >>>> rounded up. Maybe the transport needs to round up the >>>> length of the unmarshaled data content on behalf of the >>>> NFSv4 write decoder. >>>>=20 >>>=20 >>> The problem of simply taking write->wr_buflen was that len before = that >>> could have been adjusted by avail value in then non-RDAM mounts. >>>=20 >>> Again, I'm not sure if this is the right fix. But this one works for >>> both non-RDMA and RDMA mounts. >>>=20 >>> diff --git a/fs/nfsd/nfs4xdr.c b/fs/nfsd/nfs4xdr.c >>> index 2c61c6b..3178997 100644 >>> --- a/fs/nfsd/nfs4xdr.c >>> +++ b/fs/nfsd/nfs4xdr.c >>> @@ -1293,7 +1293,10 @@ static __be32 nfsd4_decode_opaque(struct >>> nfsd4_compoundargs *argp, struct xdr_ne >>>=20 >>> len -=3D avail; >>>=20 >>> - pages =3D len >> PAGE_SHIFT; >>> + if (!avail) >>> + pages =3D write->wr_buflen >> PAGE_SHIFT; >>> + else >>> + pages =3D len >> PAGE_SHIFT; >>> argp->pagelist +=3D pages; >>> argp->pagelen -=3D pages * PAGE_SIZE; >>> len -=3D pages * PAGE_SIZE; >>=20 >> This code has been around since 2012. Has it always been >> broken for RDMA? I suspect the root problem is in the >> transport, not here in the decoder. Can you bisect to >> find out where the problem started to occur? >=20 > To see if I could go back to 2012, I first want to see if I could go > back to before mlx5 driver was added which was in 3.11-rc1. But I'm > having issues trying to boot into a kernel based on 3.10 or 3.9. I'm > trying to verify if my CX-5 card would even work with mlx4 driver > which I'm having doubts it would. The CX-5 works only with mlx5. > These old kernels fail to boot in > XFS saying that it's a wrong version. Some googling makes me think > that since my XFS partition was created with a newer kernel it's not > compatible with an older kernel... That's correct, the XFS filesystem has new features like metadata checksum that don't work on older kernels. To start with, you could try some very late 3.x or early 4.x kernels. My intuition is that you will find an svcrdma change that is more recent than 2012 that is causing the underlying issue. >>>>>>> If this looks OK, I can send a patch. >>>>>>> -- >>>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>>> the body of a message to majordomo@vger.kernel.org >>>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>>>=20 >>>>>> -- >>>>>> Chuck Lever >>>>>>=20 >>>>>>=20 >>>>>>=20 >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>>> the body of a message to majordomo@vger.kernel.org >>>>>> More majordomo info at = http://vger.kernel.org/majordomo-info.html >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe = linux-nfs" in >>>>> the body of a message to majordomo@vger.kernel.org >>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>=20 >>>> -- >>>> Chuck Lever >>>>=20 >>>>=20 >>>>=20 >>> -- >>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" = in >>> the body of a message to majordomo@vger.kernel.org >>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>=20 >> -- >> Chuck Lever >>=20 >>=20 >>=20 > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" = in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- Chuck Lever