> On Feb 5, 2016, at 12:49 PM, Devesh Sharma <[email protected]> wrote:
>
> On Fri, Feb 5, 2016 at 9:59 PM, Chuck Lever <[email protected]> wrote:
>>
>>> On Feb 5, 2016, at 5:15 AM, Devesh Sharma <[email protected]> wrote:
>>>
>>> As I understand, you are pre-allocating because alloc_mr() can sleep
>>> so separate it out while map-frmr-wqe would be non-blocking context by
>>> definition, we are on the same page?
>>
>> alloc_mr() can sleep, and so can kmalloc(GFP_KERNEL).
>> Currently svc_rdma_get_frmr() is called in a worker thread,
>> so it's in a process context and can sleep, but if we were
>> to move the RDMA Read logic into a softIRQ context (say,
>> invoked during Receive processing), it would be a problem
>> to sleep.
>
> Got it.
>
>>
>> Another goal is to get rid of memory allocation operations
>> that can fail in low memory situations in areas of the code
>> where an error is hard to handle.
>
>>
>> So the patch description is misleading and incorrect.
>
> Got it
>
>>
>> However, the problem with pre-allocating frmrs is that it
>> is not possible for the server to determine how many will
>> be needed at accept time. The maximum number depends on the
>> largest possible ULP payload size, the page list depth of
>> the server's device, and the maximum chunk size the client
>> is using (it could be as small as 4KB for some registration
>> modes). That would mean allocating thousands per connection
>> up front to cover every contingency.
>
> Can't it decide on the fly? start with 'n' smaller size of frmrs
> where 'n' is derived from some static calculation. Then if the
> connection demands bigger size, increase the depth of all 'n' frmrs and
> reuse them till the life of the connection. if it demands more frmr,
> don't do anything
> and try to manage with 'n'. Does it makes sense I am sure I am missing
> few points here.

Currently, all svc_rdma_frmrs are the same. They are allocated
on-demand then cached after an RDMA Read operation is complete.

>> Perhaps a better solution would be to understand how to
>> deal with an frmr allocation failure in rdma_read_chunk_frmr().
>> I think just dropping the connection would be reasonable.
>> If ib_post_recv() fails, for example, this is exactly what
>> the server does.
>
> Dropping the connection seems logical, however, I think it would be
> a harsh decision for any hypothetical device which has less IB/RoCE
> resources e.g. any given VM in SRIOV enabled environment where lots
> of VFs are enabled.
>
> As suggested earlier in this mail, just continue the connection with whatever
> resources it has would let low-resource-device still to use it. if
> the system is recovered from memory-pressure, the allocations would
> succeed eventually.
> drawback is performance penalty.

When an frmr allocation fails, the RDMA Read to pull the
RPC data over can't be done, so there's no recourse but for
the server to discard that RPC request. NFSv4.x requires a
connection drop when an RPC is lost. The client will retransmit
that RPC.

Dropping the connection frees all resources for that connection.
Clients will typically reconnect pretty quickly when they have
more work to do. The server will drop the connection repeatedly
until enough resources are available to handle the client's
workload.

Eventually we could add a retry mechanism to the RDMA Read
logic. Wait a bit and try the allocation again. That's probably
not worth the effort, this really should be very very rare.

If a device is ultimately resource-starved, that's a different
problem. The server may never be able to allocate enough MRs
even if the client tries again later.

That kind of issue should be obvious at accept time. The server
should adjust the connection's credit limit downward to fit the
device's available resource budget.

Alternately, if the device is not an iWARP device, RDMA Reads
can use physical addressing, and no frwrs are needed.

>> Christoph says he is about to modify or replace the NFS
>> server's RDMA Read code paths, however.
>>
>> I'll drop this one.
>>
>>
>>> On Wed, Feb 3, 2016 at 9:21 PM, Chuck Lever <[email protected]> wrote:
>>>> Want to call it in a context that cannot sleep. So pre-alloc
>>>> the memory and the MRs.
>>>>
>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>> ---
>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 44 ++++++++++++++++++++++++++----
>>>> 1 file changed, 38 insertions(+), 6 deletions(-)
>>>>
>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> index 5763825..02eee12 100644
>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>> @@ -949,7 +949,7 @@ static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt)
>>>> return ERR_PTR(-ENOMEM);
>>>> }
>>>>
>>>> -static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>> +static void svc_rdma_destroy_frmrs(struct svcxprt_rdma *xprt)
>>>> {
>>>> struct svc_rdma_fastreg_mr *frmr;
>>>>
>>>> @@ -963,6 +963,37 @@ static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>> }
>>>> }
>>>>
>>>> +static bool svc_rdma_prealloc_frmrs(struct svcxprt_rdma *xprt)
>>>> +{
>>>> + struct ib_device *dev = xprt->sc_cm_id->device;
>>>> + unsigned int i;
>>>> +
>>>> + /* Pre-allocate based on the maximum amount of payload
>>>> + * the server's HCA can handle per RDMA Read, to keep
>>>> + * the number of MRs per connection in check.
>>>> + *
>>>> + * If a client sends small Read chunks (eg. it may be
>>>> + * using physical registration), more RDMA Reads per
>>>> + * NFS WRITE will be needed. svc_rdma_get_frmr() dips
>>>> + * into its reserve in that case. Better would be for
>>>> + * the server to reduce the connection's credit limit.
>>>> + */
>>>> + i = 1 + RPCSVC_MAXPAGES / dev->attrs.max_fast_reg_page_list_len;
>>>> + i *= xprt->sc_max_requests;
>>>> +
>>>> + while (i--) {
>>>> + struct svc_rdma_fastreg_mr *frmr;
>>>> +
>>>> + frmr = rdma_alloc_frmr(xprt);
>>>> + if (!frmr) {
>>>> + dprintk("svcrdma: No memory for request map\n");
>>>> + return false;
>>>> + }
>>>> + list_add(&frmr->frmr_list, &xprt->sc_frmr_q);
>>>> + }
>>>> + return true;
>>>> +}
>>>> +
>>>> struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>> {
>>>> struct svc_rdma_fastreg_mr *frmr = NULL;
>>>> @@ -975,10 +1006,9 @@ struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>> frmr->sg_nents = 0;
>>>> }
>>>> spin_unlock_bh(&rdma->sc_frmr_q_lock);
>>>> - if (frmr)
>>>> - return frmr;
>>>> -
>>>> - return rdma_alloc_frmr(rdma);
>>>> + if (!frmr)
>>>> + return ERR_PTR(-ENOMEM);
>>>> + return frmr;
>>>> }
>>>>
>>>> void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
>>>> @@ -1149,6 +1179,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>> dev->attrs.max_fast_reg_page_list_len;
>>>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_FAST_REG;
>>>> newxprt->sc_reader = rdma_read_chunk_frmr;
>>>> + if (!svc_rdma_prealloc_frmrs(newxprt))
>>>> + goto errout;
>>>> }
>>>>
>>>> /*
>>>> @@ -1310,7 +1342,7 @@ static void __svc_rdma_free(struct work_struct *work)
>>>> xprt->xpt_bc_xprt = NULL;
>>>> }
>>>>
>>>> - rdma_dealloc_frmr_q(rdma);
>>>> + svc_rdma_destroy_frmrs(rdma);
>>>> svc_rdma_destroy_ctxts(rdma);
>>>> svc_rdma_destroy_maps(rdma);
>>>>
>>
>> --
>> Chuck Lever
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever

2016-02-06 15:57:12

by Devesh Sharma

[permalink] [raw]

Subject: Re: [PATCH v1 02/10] svcrdma: Make svc_rdma_get_frmr() not sleep

On Fri, Feb 5, 2016 at 11:43 PM, Chuck Lever <[email protected]> wrote:
>
>> On Feb 5, 2016, at 12:49 PM, Devesh Sharma <[email protected]> wrote:
>>
>> On Fri, Feb 5, 2016 at 9:59 PM, Chuck Lever <[email protected]> wrote:
>>>
>>>> On Feb 5, 2016, at 5:15 AM, Devesh Sharma <[email protected]> wrote:
>>>>
>>>> As I understand, you are pre-allocating because alloc_mr() can sleep
>>>> so separate it out while map-frmr-wqe would be non-blocking context by
>>>> definition, we are on the same page?
>>>
>>> alloc_mr() can sleep, and so can kmalloc(GFP_KERNEL).
>>> Currently svc_rdma_get_frmr() is called in a worker thread,
>>> so it's in a process context and can sleep, but if we were
>>> to move the RDMA Read logic into a softIRQ context (say,
>>> invoked during Receive processing), it would be a problem
>>> to sleep.
>>
>> Got it.
>>
>>>
>>> Another goal is to get rid of memory allocation operations
>>> that can fail in low memory situations in areas of the code
>>> where an error is hard to handle.
>>
>>>
>>> So the patch description is misleading and incorrect.
>>
>> Got it
>>
>>>
>>> However, the problem with pre-allocating frmrs is that it
>>> is not possible for the server to determine how many will
>>> be needed at accept time. The maximum number depends on the
>>> largest possible ULP payload size, the page list depth of
>>> the server's device, and the maximum chunk size the client
>>> is using (it could be as small as 4KB for some registration
>>> modes). That would mean allocating thousands per connection
>>> up front to cover every contingency.
>>
>> Can't it decide on the fly? start with 'n' smaller size of frmrs
>> where 'n' is derived from some static calculation. Then if the
>> connection demands bigger size, increase the depth of all 'n' frmrs and
>> reuse them till the life of the connection. if it demands more frmr,
>> don't do anything
>> and try to manage with 'n'. Does it makes sense I am sure I am missing
>> few points here.
>
> Currently, all svc_rdma_frmrs are the same. They are allocated
> on-demand then cached after an RDMA Read operation is complete.

Yes, it does the same.

>
>
>>> Perhaps a better solution would be to understand how to
>>> deal with an frmr allocation failure in rdma_read_chunk_frmr().
>>> I think just dropping the connection would be reasonable.
>>> If ib_post_recv() fails, for example, this is exactly what
>>> the server does.
>>
>> Dropping the connection seems logical, however, I think it would be
>> a harsh decision for any hypothetical device which has less IB/RoCE
>> resources e.g. any given VM in SRIOV enabled environment where lots
>> of VFs are enabled.
>>
>> As suggested earlier in this mail, just continue the connection with whatever
>> resources it has would let low-resource-device still to use it. if
>> the system is recovered from memory-pressure, the allocations would
>> succeed eventually.
>> drawback is performance penalty.
>
> When an frmr allocation fails, the RDMA Read to pull the
> RPC data over can't be done, so there's no recourse but for
> the server to discard that RPC request. NFSv4.x requires a
> connection drop when an RPC is lost. The client will retransmit
> that RPC.
>
> Dropping the connection frees all resources for that connection.
> Clients will typically reconnect pretty quickly when they have
> more work to do. The server will drop the connection repeatedly
> until enough resources are available to handle the client's
> workload.

Ok, effectively user will see "failed to reconnect" messages in syslog
until it succeeds. I think if the syslog messages displays a valid reason
then we are covered here.

>
> Eventually we could add a retry mechanism to the RDMA Read
> logic. Wait a bit and try the allocation again. That's probably
> not worth the effort, this really should be very very rare.

Makes sense, I agree with you here.

>
> If a device is ultimately resource-starved, that's a different
> problem. The server may never be able to allocate enough MRs
> even if the client tries again later.

Yes

>
> That kind of issue should be obvious at accept time. The server
> should adjust the connection's credit limit downward to fit the
> device's available resource budget.
>
> Alternately, if the device is not an iWARP device, RDMA Reads
> can use physical addressing, and no frwrs are needed.
>
>

Ok, lets wait for Christoph's patch in NFS server. Till then I agree
with your decision on this patch.

>>> Christoph says he is about to modify or replace the NFS
>>> server's RDMA Read code paths, however.
>>>
>>> I'll drop this one.
>>>
>>>
>>>> On Wed, Feb 3, 2016 at 9:21 PM, Chuck Lever <[email protected]> wrote:
>>>>> Want to call it in a context that cannot sleep. So pre-alloc
>>>>> the memory and the MRs.
>>>>>
>>>>> Signed-off-by: Chuck Lever <[email protected]>
>>>>> ---
>>>>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 44 ++++++++++++++++++++++++++----
>>>>> 1 file changed, 38 insertions(+), 6 deletions(-)
>>>>>
>>>>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> index 5763825..02eee12 100644
>>>>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>>>>> @@ -949,7 +949,7 @@ static struct svc_rdma_fastreg_mr *rdma_alloc_frmr(struct svcxprt_rdma *xprt)
>>>>> return ERR_PTR(-ENOMEM);
>>>>> }
>>>>>
>>>>> -static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>>> +static void svc_rdma_destroy_frmrs(struct svcxprt_rdma *xprt)
>>>>> {
>>>>> struct svc_rdma_fastreg_mr *frmr;
>>>>>
>>>>> @@ -963,6 +963,37 @@ static void rdma_dealloc_frmr_q(struct svcxprt_rdma *xprt)
>>>>> }
>>>>> }
>>>>>
>>>>> +static bool svc_rdma_prealloc_frmrs(struct svcxprt_rdma *xprt)
>>>>> +{
>>>>> + struct ib_device *dev = xprt->sc_cm_id->device;
>>>>> + unsigned int i;
>>>>> +
>>>>> + /* Pre-allocate based on the maximum amount of payload
>>>>> + * the server's HCA can handle per RDMA Read, to keep
>>>>> + * the number of MRs per connection in check.
>>>>> + *
>>>>> + * If a client sends small Read chunks (eg. it may be
>>>>> + * using physical registration), more RDMA Reads per
>>>>> + * NFS WRITE will be needed. svc_rdma_get_frmr() dips
>>>>> + * into its reserve in that case. Better would be for
>>>>> + * the server to reduce the connection's credit limit.
>>>>> + */
>>>>> + i = 1 + RPCSVC_MAXPAGES / dev->attrs.max_fast_reg_page_list_len;
>>>>> + i *= xprt->sc_max_requests;
>>>>> +
>>>>> + while (i--) {
>>>>> + struct svc_rdma_fastreg_mr *frmr;
>>>>> +
>>>>> + frmr = rdma_alloc_frmr(xprt);
>>>>> + if (!frmr) {
>>>>> + dprintk("svcrdma: No memory for request map\n");
>>>>> + return false;
>>>>> + }
>>>>> + list_add(&frmr->frmr_list, &xprt->sc_frmr_q);
>>>>> + }
>>>>> + return true;
>>>>> +}
>>>>> +
>>>>> struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>>> {
>>>>> struct svc_rdma_fastreg_mr *frmr = NULL;
>>>>> @@ -975,10 +1006,9 @@ struct svc_rdma_fastreg_mr *svc_rdma_get_frmr(struct svcxprt_rdma *rdma)
>>>>> frmr->sg_nents = 0;
>>>>> }
>>>>> spin_unlock_bh(&rdma->sc_frmr_q_lock);
>>>>> - if (frmr)
>>>>> - return frmr;
>>>>> -
>>>>> - return rdma_alloc_frmr(rdma);
>>>>> + if (!frmr)
>>>>> + return ERR_PTR(-ENOMEM);
>>>>> + return frmr;
>>>>> }
>>>>>
>>>>> void svc_rdma_put_frmr(struct svcxprt_rdma *rdma,
>>>>> @@ -1149,6 +1179,8 @@ static struct svc_xprt *svc_rdma_accept(struct svc_xprt *xprt)
>>>>> dev->attrs.max_fast_reg_page_list_len;
>>>>> newxprt->sc_dev_caps |= SVCRDMA_DEVCAP_FAST_REG;
>>>>> newxprt->sc_reader = rdma_read_chunk_frmr;
>>>>> + if (!svc_rdma_prealloc_frmrs(newxprt))
>>>>> + goto errout;
>>>>> }
>>>>>
>>>>> /*
>>>>> @@ -1310,7 +1342,7 @@ static void __svc_rdma_free(struct work_struct *work)
>>>>> xprt->xpt_bc_xprt = NULL;
>>>>> }
>>>>>
>>>>> - rdma_dealloc_frmr_q(rdma);
>>>>> + svc_rdma_destroy_frmrs(rdma);
>>>>> svc_rdma_destroy_ctxts(rdma);
>>>>> svc_rdma_destroy_maps(rdma);
>>>>>
>>>
>>> --
>>> Chuck Lever
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> Chuck Lever
>
>
>
>