Return-Path: Received: from mail-yw0-f176.google.com ([209.85.161.176]:34539 "EHLO mail-yw0-f176.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752047AbcBFP5M (ORCPT ); Sat, 6 Feb 2016 10:57:12 -0500 Received: by mail-yw0-f176.google.com with SMTP id h129so74801620ywb.1 for ; Sat, 06 Feb 2016 07:57:12 -0800 (PST) MIME-Version: 1.0 In-Reply-To: <5F13C308-89C7-4F98-ADB1-D43AA0A1EB50@oracle.com> References: <20160203154411.13868.48268.stgit@klimt.1015granger.net> <20160203155138.13868.38423.stgit@klimt.1015granger.net> <3714E63E-275C-4AFC-BFA7-0208500FC39E@oracle.com> <5F13C308-89C7-4F98-ADB1-D43AA0A1EB50@oracle.com> From: Devesh Sharma Date: Sat, 6 Feb 2016 21:26:32 +0530 Message-ID: Subject: Re: [PATCH v1 02/10] svcrdma: Make svc_rdma_get_frmr() not sleep To: Chuck Lever Cc: linux-rdma@vger.kernel.org, Linux NFS Mailing List Content-Type: text/plain; charset=UTF-8 Sender: linux-nfs-owner@vger.kernel.org List-ID: On Fri, Feb 5, 2016 at 11:43 PM, Chuck Lever wrote: > >> On Feb 5, 2016, at 12:49 PM, Devesh Sharma wrote: >> >> On Fri, Feb 5, 2016 at 9:59 PM, Chuck Lever wrote: >>> >>>> On Feb 5, 2016, at 5:15 AM, Devesh Sharma wrote: >>>> >>>> As I understand, you are pre-allocating because alloc_mr() can sleep >>>> so separate it out while map-frmr-wqe would be non-blocking context by >>>> definition, we are on the same page? >>> >>> alloc_mr() can sleep, and so can kmalloc(GFP_KERNEL). >>> Currently svc_rdma_get_frmr() is called in a worker thread, >>> so it's in a process context and can sleep, but if we were >>> to move the RDMA Read logic into a softIRQ context (say, >>> invoked during Receive processing), it would be a problem >>> to sleep. >> >> Got it. >> >>> >>> Another goal is to get rid of memory allocation operations >>> that can fail in low memory situations in areas of the code >>> where an error is hard to handle. >> >>> >>> So the patch description is misleading and incorrect. >> >> Got it >> >>> >>> However, the problem with pre-allocating frmrs is that it >>> is not possible for the server to determine how many will >>> be needed at accept time. The maximum number depends on the >>> largest possible ULP payload size, the page list depth of >>> the server's device, and the maximum chunk size the client >>> is using (it could be as small as 4KB for some registration >>> modes). That would mean allocating thousands per connection >>> up front to cover every contingency. >> >> Can't it decide on the fly? start with 'n' smaller size of frmrs >> where 'n' is derived from some static calculation. Then if the >> connection demands bigger size, increase the depth of all 'n' frmrs and >> reuse them till the life of the connection. if it demands more frmr, >> don't do anything >> and try to manage with 'n'. Does it makes sense I am sure I am missing >> few points here. > > Currently, all svc_rdma_frmrs are the same. They are allocated > on-demand then cached after an RDMA Read operation is complete. Yes, it does the same. > > >>> Perhaps a better solution would be to understand how to >>> deal with an frmr allocation failure in rdma_read_chunk_frmr(). >>> I think just dropping the connection would be reasonable. >>> If ib_post_recv() fails, for example, this is exactly what >>> the server does. >> >> Dropping the connection seems logical, however, I think it would be >> a harsh decision for any hypothetical device which has less IB/RoCE >> resources e.g. any given VM in SRIOV enabled environment where lots >> of VFs are enabled. >> >> As suggested earlier in this mail, just continue the connection with whatever >> resources it has would let low-resource-device still to use it. if >> the system is recovered from memory-pressure, the allocations would >> succeed eventually. >> drawback is performance penalty. > > When an frmr allocation fails, the RDMA Read to pull the > RPC data over can't be done, so there's no recourse but for > the server to discard that RPC request. NFSv4.x requires a > connection drop when an RPC is lost. The client will retransmit > that RPC. > > Dropping the connection frees all resources for that connection. > Clients will typically reconnect pretty quickly when they have > more work to do. The server will drop the connection repeatedly > until enough resources are available to handle the client's > workload. Ok, effectively user will see "failed to reconnect" messages in syslog until it succeeds. I think if the syslog messages displays a valid reason then we are covered here. > > Eventually we could add a retry mechanism to the RDMA Read > logic. Wait a bit and try the allocation again. That's probably > not worth the effort, this really should be very very rare. Makes sense, I agree with you here. > > If a device is ultimately resource-starved, that's a different > problem. The server may never be able to allocate enough MRs > even if the client tries again later. Yes > > That kind of issue should be obvious at accept time. The server > should adjust the connection's credit limit downward to fit the > device's available resource budget. > > Alternately, if the device is not an iWARP device, RDMA Reads > can use physical addressing, and no frwrs are needed. > > Ok, lets wait for Christoph's patch in NFS server. Till then I agree with your decision on this patch. >>> Christoph says he is about to modify or replace the NFS >>> server's RDMA Read code paths, however. >>> >>> I'll drop this one. >>> >>> >>>> On Wed, Feb 3, 2016 at 9:21 PM, Chuck Lever wrote: >>>>> Want to call it in a context that cannot sleep. So pre-alloc >>>>> the memory and the MRs. >>>>> >>>>> Signed-off-by: Chuck Lever >>>>> --- >>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 44 ++++++++++++++++++++++++++---- >>>>> 1 file changed, 38 insertions(+), 6 deletions(-) >>>>> >>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c >>>>> index 5763825..02eee12 100644 >>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c >>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c >>>>> @@ -949,7 +949,7 @@ static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt) >>>>> return ERR_PTR(-ENOMEM); >>>>> } >>>>> >>>>> -static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt) >>>>> +static void svc_rdma_destroy_frmrs(struct svcxprt_rdma *xprt) >>>>> { >>>>> struct svc_rdma_fastreg_mr *frmr; >>>>> >>>>> @@ -963,6 +963,37 @@ static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt) >>>>> } >>>>> } >>>>> >>>>> +static bool svc_rdma_prealloc_frmrs(struct svcxprt_rdma *xprt) >>>>> +{ >>>>> + struct ib_device *dev = xprt->sc_cm_id->device; >>>>> + unsigned int i; >>>>> + >>>>> + /* Pre-allocate based on the maximum amount of payload >>>>> + * the server's HCA can handle per RDMA Read, to keep >>>>> + * the number of MRs per connection in check. >>>>> + * >>>>> + * If a client sends small Read chunks (eg. it may be >>>>> + * using physical registration), more RDMA Reads per >>>>> + * NFS WRITE will be needed. svc_rdma_get_frmr() dips >>>>> + * into its reserve in that case. Better would be for >>>>> + * the server to reduce the connection's credit limit. >>>>> + */ >>>>> + i = 1 + RPCSVC_MAXPAGES / dev->attrs.max_fast_reg_page_list_len; >>>>> + i *= xprt->sc_max_requests; >>>>> + >>>>> + while (i--) { >>>>> + struct svc_rdma_fastreg_mr *frmr; >>>>> + >>>>> + frmr = rdma_alloc_frmr(xprt); >>>>> + if (!frmr) { >>>>> + dprintk("svcrdma: No memory for request map\n"); >>>>> + return false; >>>>> + } >>>>> + list_add(&frmr->frmr_list, &xprt->sc_frmr_q); >>>>> + } >>>>> + return true; >>>>> +} >>>>> + >>>>> struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma) >>>>> { >>>>> struct svc_rdma_fastreg_mr *frmr = NULL; >>>>> @@ -975,10 +1006,9 @@ struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma) >>>>> frmr->sg_nents = 0; >>>>> } >>>>> spin_unlock_bh(&rdma->sc_frmr_q_lock); >>>>> - if (frmr) >>>>> - return frmr; >>>>> - >>>>> - return rdma_alloc_frmr(rdma); >>>>> + if (!frmr) >>>>> + return ERR_PTR(-ENOMEM); >>>>> + return frmr; >>>>> } >>>>> >>>>> void svc_rdma_put_frmr(struct svcxprt_rdma *rdma, >>>>> @@ -1149,6 +1179,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt) >>>>> dev->attrs.max_fast_reg_page_list_len; >>>>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_FAST_REG; >>>>> newxprt->sc_reader = rdma_read_chunk_frmr; >>>>> + if (!svc_rdma_prealloc_frmrs(newxprt)) >>>>> + goto errout; >>>>> } >>>>> >>>>> /* >>>>> @@ -1310,7 +1342,7 @@ static void __svc_rdma_free(struct work_struct *work) >>>>> xprt->xpt_bc_xprt = NULL; >>>>> } >>>>> >>>>> - rdma_dealloc_frmr_q(rdma); >>>>> + svc_rdma_destroy_frmrs(rdma); >>>>> svc_rdma_destroy_ctxts(rdma); >>>>> svc_rdma_destroy_maps(rdma); >>>>> >>> >>> -- >>> Chuck Lever >>> >>> >>> >>> >> -- >> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in >> the body of a message to majordomo@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > Chuck Lever > > > >