2008-10-08 16:22:11

by Tom Talpey

[permalink] [raw]
Subject: [PATCH 12/15] RPC/RDMA: correct a 5 second pause on reconnecting to an idle server.

The RPC/RDMA code always performed a reconnect-with-backoff, even
when re-establishing a connection to a server after the RPC layer
closed it due to being idle.
---

net/sunrpc/xprtrdma/transport.c | 5 +++--
net/sunrpc/xprtrdma/verbs.c | 2 +-
2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
index c7d2380..278a544 100644
--- a/net/sunrpc/xprtrdma/transport.c
+++ b/net/sunrpc/xprtrdma/transport.c
@@ -486,8 +486,9 @@ xprt_rdma_connect(struct rpc_task *task)
struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);

if (!xprt_test_and_set_connecting(xprt)) {
- if (r_xprt->rx_ep.rep_connected != 0) {
- /* Reconnect */
+ if (r_xprt->rx_ep.rep_connected &&
+ r_xprt->rx_ep.rep_connected != -EPIPE) {
+ /* Reconnect with backoff */
schedule_delayed_work(&r_xprt->rdma_connect,
xprt->reestablish_timeout);
} else {
diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
index a63d0c0..9ef7e0d 100644
--- a/net/sunrpc/xprtrdma/verbs.c
+++ b/net/sunrpc/xprtrdma/verbs.c
@@ -317,7 +317,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct rdma_cm_event *event)
connstate = -ECONNREFUSED;
goto connected;
case RDMA_CM_EVENT_DISCONNECTED:
- connstate = -ECONNABORTED;
+ connstate = -EPIPE;
goto connected;
case RDMA_CM_EVENT_DEVICE_REMOVAL:
connstate = -ENODEV;



2008-10-08 17:35:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 12/15] RPC/RDMA: correct a 5 second pause on reconnecting to an idle server.

On Wed, 2008-10-08 at 11:48 -0400, Tom Talpey wrote:
> The RPC/RDMA code always performed a reconnect-with-backoff, even
> when re-establishing a connection to a server after the RPC layer
> closed it due to being idle.
> ---
>
> net/sunrpc/xprtrdma/transport.c | 5 +++--
> net/sunrpc/xprtrdma/verbs.c | 2 +-
> 2 files changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/sunrpc/xprtrdma/transport.c b/net/sunrpc/xprtrdma/transport.c
> index c7d2380..278a544 100644
> --- a/net/sunrpc/xprtrdma/transport.c
> +++ b/net/sunrpc/xprtrdma/transport.c
> @@ -486,8 +486,9 @@ xprt_rdma_connect(struct rpc_task *task)
> struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>
> if (!xprt_test_and_set_connecting(xprt)) {
> - if (r_xprt->rx_ep.rep_connected != 0) {
> - /* Reconnect */
> + if (r_xprt->rx_ep.rep_connected &&
> + r_xprt->rx_ep.rep_connected != -EPIPE) {
> + /* Reconnect with backoff */
> schedule_delayed_work(&r_xprt->rdma_connect,
> xprt->reestablish_timeout);
> } else {
> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
> index a63d0c0..9ef7e0d 100644
> --- a/net/sunrpc/xprtrdma/verbs.c
> +++ b/net/sunrpc/xprtrdma/verbs.c
> @@ -317,7 +317,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id, struct rdma_cm_event *event)
> connstate = -ECONNREFUSED;
> goto connected;
> case RDMA_CM_EVENT_DISCONNECTED:
> - connstate = -ECONNABORTED;
> + connstate = -EPIPE;
> goto connected;
> case RDMA_CM_EVENT_DEVICE_REMOVAL:
> connstate = -ENODEV;

Hmm... Why not rather do the same as the socket code: have the
disconnect handler paths that don't require exponential backoff just
reset xprt->reestablish_timeout to 0?

> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-10-08 17:52:24

by Talpey, Thomas

[permalink] [raw]
Subject: Re: [PATCH 12/15] RPC/RDMA: correct a 5 second pause on reconnecting to an idle server.

At 01:35 PM 10/8/2008, Trond Myklebust wrote:
>On Wed, 2008-10-08 at 11:48 -0400, Tom Talpey wrote:
>> The RPC/RDMA code always performed a reconnect-with-backoff, even
>> when re-establishing a connection to a server after the RPC layer
>> closed it due to being idle.
>> ---
>>
>> net/sunrpc/xprtrdma/transport.c | 5 +++--
>> net/sunrpc/xprtrdma/verbs.c | 2 +-
>> 2 files changed, 4 insertions(+), 3 deletions(-)
>>
>> diff --git a/net/sunrpc/xprtrdma/transport.c
>b/net/sunrpc/xprtrdma/transport.c
>> index c7d2380..278a544 100644
>> --- a/net/sunrpc/xprtrdma/transport.c
>> +++ b/net/sunrpc/xprtrdma/transport.c
>> @@ -486,8 +486,9 @@ xprt_rdma_connect(struct rpc_task *task)
>> struct rpcrdma_xprt *r_xprt = rpcx_to_rdmax(xprt);
>>
>> if (!xprt_test_and_set_connecting(xprt)) {
>> - if (r_xprt->rx_ep.rep_connected != 0) {
>> - /* Reconnect */
>> + if (r_xprt->rx_ep.rep_connected &&
>> + r_xprt->rx_ep.rep_connected != -EPIPE) {
>> + /* Reconnect with backoff */
>> schedule_delayed_work(&r_xprt->rdma_connect,
>> xprt->reestablish_timeout);
>> } else {
>> diff --git a/net/sunrpc/xprtrdma/verbs.c b/net/sunrpc/xprtrdma/verbs.c
>> index a63d0c0..9ef7e0d 100644
>> --- a/net/sunrpc/xprtrdma/verbs.c
>> +++ b/net/sunrpc/xprtrdma/verbs.c
>> @@ -317,7 +317,7 @@ rpcrdma_conn_upcall(struct rdma_cm_id *id,
>struct rdma_cm_event *event)
>> connstate = -ECONNREFUSED;
>> goto connected;
>> case RDMA_CM_EVENT_DISCONNECTED:
>> - connstate = -ECONNABORTED;
>> + connstate = -EPIPE;
>> goto connected;
>> case RDMA_CM_EVENT_DEVICE_REMOVAL:
>> connstate = -ENODEV;
>
>Hmm... Why not rather do the same as the socket code: have the
>disconnect handler paths that don't require exponential backoff just
>reset xprt->reestablish_timeout to 0?

Because we do want a non-zero reestablishment timeout in general, and
the RDMA client has not implemented a connection backoff. So in effect
the value is constant for this code, and I thought treating it as such is
the safer fix.

I'm not 100% convinced the TCP code is correct, btw. It appears to
zero out the reestablish timeout on idle-disconnect, but it's not obvious
to me where it sets it back to a non-zero value. It does try to double
it in xs_connect() though! :-)

I have that issue on my list to look into, of course, but I think it was
out of scope for RDMA.

Tom.

>
>> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html


2008-10-08 18:04:23

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 12/15] RPC/RDMA: correct a 5 second pause on reconnecting to an idle server.

On Wed, 2008-10-08 at 13:51 -0400, Talpey, Thomas wrote:
> At 01:35 PM 10/8/2008, Trond Myklebust wrote:
> >Hmm... Why not rather do the same as the socket code: have the
> >disconnect handler paths that don't require exponential backoff just
> >reset xprt->reestablish_timeout to 0?
>
> Because we do want a non-zero reestablishment timeout in general, and
> the RDMA client has not implemented a connection backoff. So in effect
> the value is constant for this code, and I thought treating it as such is
> the safer fix.
>
> I'm not 100% convinced the TCP code is correct, btw. It appears to
> zero out the reestablish timeout on idle-disconnect, but it's not obvious
> to me where it sets it back to a non-zero value. It does try to double
> it in xs_connect() though! :-)

The TCP code sets the xprt->reestablish_timeout to a non-zero value
whenever the _server_ closes the connection (i.e. if ever we enter a
SYN_SENT state followed by a reset, a CLOSE_WAIT state or a CLOSING
state.
Why would the RDMA client want to do anything different?



2008-10-08 19:06:34

by Talpey, Thomas

[permalink] [raw]
Subject: Re: [PATCH 12/15] RPC/RDMA: correct a 5 second pause on reconnecting to an idle server.

At 02:04 PM 10/8/2008, Trond Myklebust wrote:
>On Wed, 2008-10-08 at 13:51 -0400, Talpey, Thomas wrote:
>> At 01:35 PM 10/8/2008, Trond Myklebust wrote:
>> >Hmm... Why not rather do the same as the socket code: have the
>> >disconnect handler paths that don't require exponential backoff just
>> >reset xprt->reestablish_timeout to 0?
>>
>> Because we do want a non-zero reestablishment timeout in general, and
>> the RDMA client has not implemented a connection backoff. So in effect
>> the value is constant for this code, and I thought treating it as such is
>> the safer fix.
>>
>> I'm not 100% convinced the TCP code is correct, btw. It appears to
>> zero out the reestablish timeout on idle-disconnect, but it's not obvious
>> to me where it sets it back to a non-zero value. It does try to double
>> it in xs_connect() though! :-)
>
>The TCP code sets the xprt->reestablish_timeout to a non-zero value
>whenever the _server_ closes the connection (i.e. if ever we enter a
>SYN_SENT state followed by a reset, a CLOSE_WAIT state or a CLOSING
>state.

Hmm, I guess. It's driven off the TCP state machine so it doesn't translate
well to the RDMA layer, which doesn't give the same granularity of upcall
event (all we see is success/fail). I think I can get close though.

>Why would the RDMA client want to do anything different?

It wouldn't, but the RDMA layer isn't the same as TCP. For example,
on Infiniband, which is a nearly lossless local medium, there are fewer
reasons to back off. And even over iWARP, the NIC's TCP stack is
handling all the recovery and retry, so it's generally better to not
overthink it.

The constant-backoff approach in the current RDMA client, however,
takes in the issue when the upper layer (nfsd) is involved. So, "going
exponential" isn't a big change, and worthwhile.

New patch on the way.

Tom.