Return-Path: linux-nfs-owner@vger.kernel.org Received: from fieldses.org ([173.255.197.46]:38400 "EHLO fieldses.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1756629AbbBFU2C (ORCPT ); Fri, 6 Feb 2015 15:28:02 -0500 Date: Fri, 6 Feb 2015 15:28:00 -0500 From: "J. Bruce Fields" To: Chuck Lever Cc: Christoph Hellwig , Anna Schumaker , Linux NFS Mailing List Subject: Re: [PATCH v2 2/4] NFSD: Add READ_PLUS support for data segments Message-ID: <20150206202800.GH29783@fieldses.org> References: <20150205162326.GA18977@infradead.org> <54D39DC2.9060808@Netapp.com> <20150206115456.GA28915@infradead.org> <20150206160848.GA29783@fieldses.org> <067E5610-290A-4AA7-9A19-F2EF9AB4163E@oracle.com> <8B871365-A241-4BA8-BD95-0946AEA55E38@oracle.com> <20150206175915.GE29783@fieldses.org> <20150206193529.GF29783@fieldses.org> <265D0458-ED72-4154-B0E3-F828E3D36E5A@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 In-Reply-To: <265D0458-ED72-4154-B0E3-F828E3D36E5A@oracle.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Fri, Feb 06, 2015 at 03:07:08PM -0500, Chuck Lever wrote: > > On Feb 6, 2015, at 2:35 PM, J. Bruce Fields wrote: > > > On Fri, Feb 06, 2015 at 01:44:15PM -0500, Chuck Lever wrote: > >> > >> On Feb 6, 2015, at 12:59 PM, J. Bruce Fields wrote: > >> > >>> On Fri, Feb 06, 2015 at 12:04:13PM -0500, Chuck Lever wrote: > >>>> > >>>> On Feb 6, 2015, at 11:46 AM, Chuck Lever wrote: > >>>> > >>>>> > >>>>> On Feb 6, 2015, at 11:08 AM, J. Bruce Fields wrote: > >>>>> > >>>>>> On Fri, Feb 06, 2015 at 03:54:56AM -0800, Christoph Hellwig wrote: > >>>>>>> On Thu, Feb 05, 2015 at 11:43:46AM -0500, Anna Schumaker wrote: > >>>>>>>>> The problem is that the typical case of all data won't use splice > >>>>>>>>> every with your patches as the 4.2 client will always send a READ_PLUS. > >>>>>>>>> > >>>>>>>>> So we'll have to find a way to use it where it helps. While we might be > >>>>>>>>> able to add some hacks to only use splice for the first segment I guess > >>>>>>>>> we just need to make the splice support generic enough in the long run. > >>>>>>>>> > >>>>>>>> > >>>>>>>> I should be able to use splice if I detect that we're only returning a single DATA segment easily enough. > >>>>>>> > >>>>>>> You could also elect to never return more than one data segment as a > >>>>>>> start: > >>>>>>> > >>>>>>> In all situations, the > >>>>>>> server may choose to return fewer bytes than specified by the client. > >>>>>>> The client needs to check for this condition and handle the > >>>>>>> condition appropriately. > >>>>>> > >>>>>> Yeah, I think that was more-or-less what Anna's first attempt did and I > >>>>>> said "what if that means more round trips"? The client can't anticipate > >>>>>> the short reads so it can't make up for this with parallelism. > >>>>>> > >>>>>>> But doing any of these for a call that's really just an optimization > >>>>>>> soudns odd. I'd really like to see an evaluation of the READ_PLUS > >>>>>>> impact on various workloads before offering it. > >>>>>> > >>>>>> Yes, unfortunately I don't see a way to make this just an obvious win. > >>>>> > >>>>> I don’t think a “win” is necessary. It simply needs to be no worse than > >>>>> READ for current use cases. > >>>>> > >>>>> READ_PLUS should be a win for the particular use cases it was > >>>>> designed for (large sparsely-populated datasets). Without a > >>>>> demonstrated benefit I think there’s no point in keeping it. > >>>>> > >>>>>> (Is there any way we could make it so with better protocol? Maybe RDMA > >>>>>> could help get the alignment right in multiple-segment cases? But then > >>>>>> I think there needs to be some sort of language about RDMA, or else > >>>>>> we're stuck with: > >>>>>> > >>>>>> https://tools.ietf.org/html/rfc5667#section-5 > >>>>>> > >>>>>> which I think forces us to return READ_PLUS data inline, another > >>>>>> possible READ_PLUS regression.) > >>>> > >>>> Btw, if I understand this correctly: > >>>> > >>>> Without a spec update, a large NFS READ_PLUS reply would be returned > >>>> in a reply list, which is moved via RDMA WRITE, just like READ > >>>> replies. > >>>> > >>>> The difference is NFS READ payload is placed directly into the > >>>> client’s page cache by the adapter. With a reply list, the client > >>>> transport would need to copy the returned data into the page cache. > >>>> And a large reply buffer would be needed. > >>>> > >>>> So, slower, yes. But not inline. > >>> > >>> I'm not very good at this, bear with me, but: the above-referenced > >>> section doesn't talk about "reply lists", only "write lists", and only > >>> explains how to use write lists for READ and READLINK data, and seems to expect everything else to be sent inline. > >> > >> I may have some details wrong, but this is my understanding. > >> > >> Small replies are sent inline. There is a size maximum for inline > >> messages, however. I guess 5667 section 5 assumes this context, which > >> appears throughout RFC 5666. > >> > >> If an expected reply exceeds the inline size, then a client will > >> set up a reply list for the server. A memory region on the client is > >> registered as a target for RDMA WRITE operations, and the co-ordinates > >> of that region are sent to the server in the RPC call. > >> > >> If the server finds the reply will indeed be larger than the inline > >> maximum, it plants the reply in the client memory region described by > >> the request’s reply list, and repeats the co-ordinates of that region > >> back to the client in the RPC reply. > >> > >> A server may also choose to send a small reply inline, even if the > >> client provided a reply list. In that case, the server does not > >> repeat the reply list in the reply, and the full reply appears > >> inline. > >> > >> Linux registers part of the RPC reply buffer for the reply list. After > >> it is received on the client, the reply payload is copied by the client > >> CPU to its final destination. > >> > >> Inline and reply list are the mechanisms used when the upper layer > >> has some processing to do to the incoming data (eg READDIR). When > >> a request just needs raw data to be simply dropped off in the client’s > >> memory, then the write list is preferred. A write list is basically a > >> zero-copy I/O. > > > > The term "reply list" doesn't appear in either RFC. I believe you mean > > "client-posted write list" in most of the above, except this last > > paragraph, which should have started with "Inline and server-posted read list...” ? > > No, I meant “reply list.” Definitely not read list. > > The terms used in the RFCs and the implementations vary, OK. Would you mind defining the term "reply list" for me? Google's not helping. --b. > unfortunately, and only the read list is an actual list. The write and > reply lists are actually two separate counted arrays that are both > expressed using xdr_write_list. > > Have a look at RFC 5666, section 5.2, where it is referred to as > either a “long reply” or a “reply chunk.” > > >> But these choices are fixed by the specified RPC/RDMA binding of the > >> upper layer protocol (that’s what RFC 5667 is). NFS READ and READLINK > >> are the only NFS operations allowed to use a write list. (NFSv4 > >> compounds are somewhat ambiguous, and that too needs to be addressed). > >> > >> As READ_PLUS conveys both kinds of data (zero-copy and data that > >> might require some processing) IMO RFC 5667 does not provide adequate > >> guidance about how to convey READ_PLUS. It will need to be added > >> somewhere. > > > > OK, good. I wonder how it would do this. The best the client could do, > > I guess, is provide the same write list it would for a READ of the same > > extent. Could the server then write just the pieces of that extent it > > needs to, send the hole information inline, and leave it to the client > > to do any necessary zeroing? (And is any of this worth it?) > > Conveying large data payloads using zero-copy techniques should be > beneficial. > > Since hole information could appear in a reply list if it were large, > and thus would not be inline, technically speaking, the best we can > say is that hole information wouldn’t be eligible for the write list. > > -- > Chuck Lever > chuck[dot]lever[at]oracle[dot]com > > > > -- > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html