2008-05-05 22:06:37

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 2/17] svcrdma: Fix race with dto_tasklet in svc_rdma_send

On Fri, May 02, 2008 at 11:28:34AM -0500, Tom Tucker wrote:
> The svc_rdma_send function will attempt to reap SQ WR to make room for
> a new request if it finds the SQ full. This function races with the
> dto_tasklet that also reaps SQ WR. To avoid calling the function
> unnecessarily use test_and_clear_bit with the RDMAXPRT_SQ_PENDING
> flag to serialize access to the sq_cq_reap function.

OK. I won't pretend to understand much of this, but--would it be worth
pulling out the added code here into a helper function, since it now
exists in two different places? (Especially if correctness depends on
the same thing happening in both the places this bit can be cleared.)

--b.

>
> Signed-off-by: Tom Tucker <[email protected]>
>
> ---
> net/sunrpc/xprtrdma/svc_rdma_transport.c | 8 ++++++--
> 1 files changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> index 1e0af2f..14f83a1 100644
> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
> @@ -1010,8 +1010,12 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct ib_send_wr *wr)
> if (xprt->sc_sq_depth == atomic_read(&xprt->sc_sq_count)) {
> spin_unlock_bh(&xprt->sc_lock);
> atomic_inc(&rdma_stat_sq_starve);
> - /* See if we can reap some SQ WR */
> - sq_cq_reap(xprt);
> +
> + /* See if we can opportunistically reap SQ WR to make room */
> + if (test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags)) {
> + ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
> + sq_cq_reap(xprt);
> + }
>
> /* Wait until SQ WR available if SQ still full */
> wait_event(xprt->sc_send_wait,


2008-05-06 02:26:15

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 2/17] svcrdma: Fix race with dto_tasklet in svc_rdma_send




On 5/5/08 5:06 PM, "J. Bruce Fields" <[email protected]> wrote:

> On Fri, May 02, 2008 at 11:28:34AM -0500, Tom Tucker wrote:
>> The svc_rdma_send function will attempt to reap SQ WR to make room for
>> a new request if it finds the SQ full. This function races with the
>> dto_tasklet that also reaps SQ WR. To avoid calling the function
>> unnecessarily use test_and_clear_bit with the RDMAXPRT_SQ_PENDING
>> flag to serialize access to the sq_cq_reap function.
>
> OK. I won't pretend to understand much of this, but--would it be worth
> pulling out the added code here into a helper function, since it now
> exists in two different places? (Especially if correctness depends on
> the same thing happening in both the places this bit can be cleared.)

Yes. Good suggestions.

BTW, this code is here because the SQ is undersized for big data. Since a
single NFS_READ/WRITE can result in an attempt to fetch a large amount of
data from the client (2M) and depending on certain HW resources this can
result in a lot of WR being posted to the SQ.

That said, there is a change coming in the 2.6.27 time frame that supports
what is called Fast NSMR register. This allows the transport to effectively
DMA map the entire transfer size (32k -- 2M) all as a single SGE. This will
take a lot of pressure off the SQ and effectively make this code
unnecessary.

>
> --b.
>
>>
>> Signed-off-by: Tom Tucker <[email protected]>
>>
>> ---
>> net/sunrpc/xprtrdma/svc_rdma_transport.c | 8 ++++++--
>> 1 files changed, 6 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> index 1e0af2f..14f83a1 100644
>> --- a/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> +++ b/net/sunrpc/xprtrdma/svc_rdma_transport.c
>> @@ -1010,8 +1010,12 @@ int svc_rdma_send(struct svcxprt_rdma *xprt, struct
>> ib_send_wr *wr)
>> if (xprt->sc_sq_depth == atomic_read(&xprt->sc_sq_count)) {
>> spin_unlock_bh(&xprt->sc_lock);
>> atomic_inc(&rdma_stat_sq_starve);
>> - /* See if we can reap some SQ WR */
>> - sq_cq_reap(xprt);
>> +
>> + /* See if we can opportunistically reap SQ WR to make room */
>> + if (test_and_clear_bit(RDMAXPRT_SQ_PENDING, &xprt->sc_flags)) {
>> + ib_req_notify_cq(xprt->sc_sq_cq, IB_CQ_NEXT_COMP);
>> + sq_cq_reap(xprt);
>> + }
>>
>> /* Wait until SQ WR available if SQ still full */
>> wait_event(xprt->sc_send_wait,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html



2008-05-06 21:18:52

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 2/17] svcrdma: Fix race with dto_tasklet in svc_rdma_send

On Mon, May 05, 2008 at 09:26:02PM -0500, Tom Tucker wrote:
>
>
>
> On 5/5/08 5:06 PM, "J. Bruce Fields" <[email protected]> wrote:
>
> > On Fri, May 02, 2008 at 11:28:34AM -0500, Tom Tucker wrote:
> >> The svc_rdma_send function will attempt to reap SQ WR to make room for
> >> a new request if it finds the SQ full. This function races with the
> >> dto_tasklet that also reaps SQ WR. To avoid calling the function
> >> unnecessarily use test_and_clear_bit with the RDMAXPRT_SQ_PENDING
> >> flag to serialize access to the sq_cq_reap function.
> >
> > OK. I won't pretend to understand much of this, but--would it be worth
> > pulling out the added code here into a helper function, since it now
> > exists in two different places? (Especially if correctness depends on
> > the same thing happening in both the places this bit can be cleared.)
>
> Yes. Good suggestions.
>
> BTW, this code is here because the SQ is undersized for big data. Since a
> single NFS_READ/WRITE can result in an attempt to fetch a large amount of
> data from the client (2M) and depending on certain HW resources this can
> result in a lot of WR being posted to the SQ.

WR is write request, SQ is send queue? (In which case this happens on
nfs read operations?) I'm behind....

> That said, there is a change coming in the 2.6.27 time frame that supports
> what is called Fast NSMR register. This allows the transport to effectively
> DMA map the entire transfer size (32k -- 2M) all as a single SGE. This will
> take a lot of pressure off the SQ and effectively make this code
> unnecessary.

OK!

Ideally we'd have patches for 2.6.27 in linux-next for a little while
first, so we should try to have that ready in a month or so, when the
-rc's start looking final.

--b.

2008-05-07 00:45:23

by Tom Tucker

[permalink] [raw]
Subject: Re: [PATCH 2/17] svcrdma: Fix race with dto_tasklet in svc_rdma_send




On 5/6/08 4:18 PM, "J. Bruce Fields" <[email protected]> wrote:

> On Mon, May 05, 2008 at 09:26:02PM -0500, Tom Tucker wrote:
>>
>>
>>
>> On 5/5/08 5:06 PM, "J. Bruce Fields" <[email protected]> wrote:
>>
>>> On Fri, May 02, 2008 at 11:28:34AM -0500, Tom Tucker wrote:
>>>> The svc_rdma_send function will attempt to reap SQ WR to make room for
>>>> a new request if it finds the SQ full. This function races with the
>>>> dto_tasklet that also reaps SQ WR. To avoid calling the function
>>>> unnecessarily use test_and_clear_bit with the RDMAXPRT_SQ_PENDING
>>>> flag to serialize access to the sq_cq_reap function.
>>>
>>> OK. I won't pretend to understand much of this, but--would it be worth
>>> pulling out the added code here into a helper function, since it now
>>> exists in two different places? (Especially if correctness depends on
>>> the same thing happening in both the places this bit can be cleared.)
>>
>> Yes. Good suggestions.
>>
>> BTW, this code is here because the SQ is undersized for big data. Since a
>> single NFS_READ/WRITE can result in an attempt to fetch a large amount of
>> data from the client (2M) and depending on certain HW resources this can
>> result in a lot of WR being posted to the SQ.
>
> WR is write request, SQ is send queue? (In which case this happens on
> nfs read operations?) I'm behind....
>
>> That said, there is a change coming in the 2.6.27 time frame that supports
>> what is called Fast NSMR register. This allows the transport to effectively
>> DMA map the entire transfer size (32k -- 2M) all as a single SGE. This will
>> take a lot of pressure off the SQ and effectively make this code
>> unnecessary.
>
> OK!
>
> Ideally we'd have patches for 2.6.27 in linux-next for a little while
> first, so we should try to have that ready in a month or so, when the
> -rc's start looking final.

It may be .28 then. A lot of pieces have to come together...

>
> --b.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html