2008-10-23 04:37:04

by Michel Lespinasse

[permalink] [raw]
Subject: Fwd: NFS 5-minute hangs upon S3 resume using 2.6.27 client

I sent this out to LKML earlier but was told this would be a better list.
This has been mentionned in bugzilla already, but I'd like to draw attention
before it gets too late for 2.6.28 (or is it already too late ???)

The following is a common cause of 5-minute NFS hangs here:

* Client has TCP connections to the NFS server, goes to S3 sleep for few hours.
* TCP connections die on the server side.
(not 100% sure why, do they use some kind of keepalive ???)
* Client resumes from S3.
* Client sends NFS requests down its TCP connections, gets back RST packet.
* [Client hangs for exactly 300 seconds here]
* Client establishes new TCP connections to the NFS server,
and recovers from the hang.

A tcpdump trace is attached at the end of bugzilla bug 11154:
http://bugzilla.kernel.org/show_bug.cgi?id=11154

Should the client immediately try to reconnect when its existing connection
receives an RST packet ? (the 5 minute delay would make sense to me if
RST was received in reply to a SYN, but I'm not sure about it in the case
of an existing open TCP connection).

If the 5 minute delay after an RST is necessary, could the client avoid it
by explicitly closing/reopening its connections using suspend/resume hooks ?

(I can not work around the issue locally by mounting/unmounting my NFS
shares around the suspend/resume because rootfs also on NFS...)

This NFS setup was working fine in 2.6.24. There has been issues with
2.6.25 and 2.6.26, but I did not confirm if they are the same bug.
2.6.25 usualy recovers after some variable delay and 2.6.26 usualy
does not recover. Bugs 11154 and 11061 have more details about this,
also Ian Campbell has been tracking an NFS issue under load that
appeared at around the same time.

Hope this helps,

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.


2008-10-23 15:36:52

by Trond Myklebust

[permalink] [raw]
Subject: Re: Fwd: NFS 5-minute hangs upon S3 resume using 2.6.27 client

On Wed, 2008-10-22 at 21:02 -0700, Michel Lespinasse wrote:
> I sent this out to LKML earlier but was told this would be a better list.
> This has been mentionned in bugzilla already, but I'd like to draw attention
> before it gets too late for 2.6.28 (or is it already too late ???)
>
> The following is a common cause of 5-minute NFS hangs here:
>
> * Client has TCP connections to the NFS server, goes to S3 sleep for few hours.
> * TCP connections die on the server side.
> (not 100% sure why, do they use some kind of keepalive ???)
> * Client resumes from S3.
> * Client sends NFS requests down its TCP connections, gets back RST packet.
> * [Client hangs for exactly 300 seconds here]
> * Client establishes new TCP connections to the NFS server,
> and recovers from the hang.
>
> A tcpdump trace is attached at the end of bugzilla bug 11154:
> http://bugzilla.kernel.org/show_bug.cgi?id=11154
>
> Should the client immediately try to reconnect when its existing connection
> receives an RST packet ? (the 5 minute delay would make sense to me if
> RST was received in reply to a SYN, but I'm not sure about it in the case
> of an existing open TCP connection).
>
> If the 5 minute delay after an RST is necessary, could the client avoid it
> by explicitly closing/reopening its connections using suspend/resume hooks ?
>
> (I can not work around the issue locally by mounting/unmounting my NFS
> shares around the suspend/resume because rootfs also on NFS...)
>
> This NFS setup was working fine in 2.6.24. There has been issues with
> 2.6.25 and 2.6.26, but I did not confirm if they are the same bug.
> 2.6.25 usualy recovers after some variable delay and 2.6.26 usualy
> does not recover. Bugs 11154 and 11061 have more details about this,
> also Ian Campbell has been tracking an NFS issue under load that
> appeared at around the same time.
>
> Hope this helps,

Does the appended patch make a difference?

Cheers
Trond
--------------------------------------------------------------------------------------
From: Trond Myklebust <[email protected]>
Date: Thu, 23 Oct 2008 11:33:59 -0400
SUNRPC: Respond promptly to server TCP resets

If the server sends us an RST error while we're in the TCP_ESTABLISHED
state, then that will not result in a state change, and so the RPC client
ends up hanging forever (see
http://bugzilla.kernel.org/show_bug.cgi?id=11154)

We can intercept the reset by setting up an sk->sk_error_report callback,
which will then allow us to initiate a proper shutdown and retry...

We also make sure that if the send request receives an ECONNRESET, then we
shutdown too...

Signed-off-by: Trond Myklebust <[email protected]>
---

net/sunrpc/xprtsock.c | 58 +++++++++++++++++++++++++++++++++++++++++--------
1 files changed, 48 insertions(+), 10 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 9a288d5..0a50361 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -249,6 +249,7 @@ struct sock_xprt {
void (*old_data_ready)(struct sock *, int);
void (*old_state_change)(struct sock *);
void (*old_write_space)(struct sock *);
+ void (*old_error_report)(struct sock *);
};

/*
@@ -698,8 +699,9 @@ static int xs_tcp_send_request(struct rpc_task *task)
case -EAGAIN:
xs_nospace(task);
break;
- case -ECONNREFUSED:
case -ECONNRESET:
+ xs_tcp_shutdown(xprt);
+ case -ECONNREFUSED:
case -ENOTCONN:
case -EPIPE:
status = -ENOTCONN;
@@ -742,6 +744,22 @@ out_release:
xprt_release_xprt(xprt, task);
}

+static void xs_save_old_callbacks(struct sock_xprt *transport, struct sock *sk)
+{
+ transport->old_data_ready = sk->sk_data_ready;
+ transport->old_state_change = sk->sk_state_change;
+ transport->old_write_space = sk->sk_write_space;
+ transport->old_error_report = sk->sk_error_report;
+}
+
+static void xs_restore_old_callbacks(struct sock_xprt *transport, struct sock *sk)
+{
+ sk->sk_data_ready = transport->old_data_ready;
+ sk->sk_state_change = transport->old_state_change;
+ sk->sk_write_space = transport->old_write_space;
+ sk->sk_error_report = transport->old_error_report;
+}
+
/**
* xs_close - close a socket
* @xprt: transport
@@ -765,9 +783,8 @@ static void xs_close(struct rpc_xprt *xprt)
transport->sock = NULL;

sk->sk_user_data = NULL;
- sk->sk_data_ready = transport->old_data_ready;
- sk->sk_state_change = transport->old_state_change;
- sk->sk_write_space = transport->old_write_space;
+
+ xs_restore_old_callbacks(transport, sk);
write_unlock_bh(&sk->sk_callback_lock);

sk->sk_no_check = 0;
@@ -1180,6 +1197,28 @@ static void xs_tcp_state_change(struct sock *sk)
}

/**
+ * xs_tcp_error_report - callback mainly for catching RST events
+ * @sk: socket
+ */
+static void xs_tcp_error_report(struct sock *sk)
+{
+ struct rpc_xprt *xprt;
+
+ read_lock(&sk->sk_callback_lock);
+ if (sk->sk_err != ECONNRESET || sk->sk_state != TCP_ESTABLISHED)
+ goto out;
+ if (!(xprt = xprt_from_sock(sk)))
+ goto out;
+ dprintk("RPC: %s client %p...\n"
+ "RPC: error %d\n",
+ __func__, xprt, sk->sk_err);
+
+ xprt_force_disconnect(xprt);
+out:
+ read_unlock(&sk->sk_callback_lock);
+}
+
+/**
* xs_udp_write_space - callback invoked when socket buffer space
* becomes available
* @sk: socket whose state has changed
@@ -1454,10 +1493,9 @@ static void xs_udp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)

write_lock_bh(&sk->sk_callback_lock);

+ xs_save_old_callbacks(transport, sk);
+
sk->sk_user_data = xprt;
- transport->old_data_ready = sk->sk_data_ready;
- transport->old_state_change = sk->sk_state_change;
- transport->old_write_space = sk->sk_write_space;
sk->sk_data_ready = xs_udp_data_ready;
sk->sk_write_space = xs_udp_write_space;
sk->sk_no_check = UDP_CSUM_NORCV;
@@ -1589,13 +1627,13 @@ static int xs_tcp_finish_connecting(struct rpc_xprt *xprt, struct socket *sock)

write_lock_bh(&sk->sk_callback_lock);

+ xs_save_old_callbacks(transport, sk);
+
sk->sk_user_data = xprt;
- transport->old_data_ready = sk->sk_data_ready;
- transport->old_state_change = sk->sk_state_change;
- transport->old_write_space = sk->sk_write_space;
sk->sk_data_ready = xs_tcp_data_ready;
sk->sk_state_change = xs_tcp_state_change;
sk->sk_write_space = xs_tcp_write_space;
+ sk->sk_error_report = xs_tcp_error_report;
sk->sk_allocation = GFP_ATOMIC;

/* socket options */



2008-10-23 19:52:33

by Michel Lespinasse

[permalink] [raw]
Subject: Re: Fwd: NFS 5-minute hangs upon S3 resume using 2.6.27 client

Hi,

On Thu, Oct 23, 2008 at 11:36:47AM -0400, Trond Myklebust wrote:
> Does the appended patch make a difference?
>
> From: Trond Myklebust <[email protected]>
> Date: Thu, 23 Oct 2008 11:33:59 -0400
> SUNRPC: Respond promptly to server TCP resets

I applied it over a 2.6.27.3 base, suspended the client for 40 minutes
and resumed it, logging what happens from the server side. The resume
went like this:

12:38:53.692785 IP client.329262748 > server.nfs: 100 getattr [|nfs]
12:38:53.699885 arp who-has client tell server
12:38:54.123793 IP client.329262748 > server.nfs: 100 getattr [|nfs]
12:38:54.695888 arp who-has client tell server
12:38:54.696011 arp reply client is-at 00:19:d1:54:0e:39 (oui Unknown)
12:38:54.696020 IP server.nfs > client.882: R 2944642919:2944642919(0) win 0
12:38:54.696024 IP server.nfs > client.882: R 2944642919:2944642919(0) win 0

(I'm still concerned about the 3 second delay here...)

12:38:57.695956 IP client.2 > server.nfs: 0 null
12:38:57.696004 IP server.nfs > client.2: reply ERR 0 null
12:38:57.696133 IP client.882 > server.nfs: . ack 931551254 win 183 <nop,nop,timestamp 448858 1116460165>
12:38:57.696159 IP client.362817180 > server.nfs: 100 getattr [|nfs]
12:38:57.696179 IP server.nfs > client.882: . ack 64732 win 91 <nop,nop,timestamp 1116460165 448858>
12:38:57.696189 IP client.346039964 > server.nfs: 100 getattr [|nfs]
12:38:57.696205 IP server.nfs > client.882: . ack 64832 win 91 <nop,nop,timestamp 1116460165 448858>
12:38:57.696294 IP client.329262748 > server.nfs: 300 getattr [|nfs]
12:38:57.696307 IP server.nfs > client.882: . ack 65132 win 108 <nop,nop,timestamp 1116460165 448858>
12:38:57.765764 IP server.nfs > client.362817180: reply ok 116 getattr [|nfs]
12:38:57.765833 IP server.nfs > client.346039964: reply ok 116 getattr [|nfs]
12:38:57.765879 IP client.882 > server.nfs: . ack 931551370 win 183 <nop,nop,timestamp 448875 1116460182>
12:38:57.765903 IP server.nfs > client.329262748: reply ok 116 getattr [|nfs]
12:38:57.765909 IP client.882 > server.nfs: . ack 931551486 win 183 <nop,nop,timestamp 448875 1116460182>
12:38:57.765933 IP server.nfs > client.379594396: reply ok 116
12:38:57.765953 IP client.882 > server.nfs: . ack 931551602 win 183 <nop,nop,timestamp 448875 1116460182>
12:38:57.766008 IP client.882 > server.nfs: . ack 931551718 win 183 <nop,nop,timestamp 448875 1116460182>
12:38:57.766058 IP server.nfs > client.396371612: reply ok 116
12:38:57.766108 IP client.413148828 > server.nfs: 100 getattr [|nfs]
12:38:57.766149 IP server.nfs > client.882: . ack 65232 win 108 <nop,nop,timestamp 1116460182 448875>
12:38:57.766176 IP server.nfs > client.413148828: reply ok 116 getattr [|nfs]
12:38:57.766274 IP client.882 > server.nfs: . ack 931551950 win 183 <nop,nop,timestamp 448875 1116460182>
12:38:57.769130 IP client.429926044 > server.nfs: 100 getattr [|nfs]
12:38:57.769188 IP server.nfs > client.429926044: reply ok 116 getattr [|nfs]
12:38:57.769418 IP client.446703260 > server.nfs: 100 getattr [|nfs]
12:38:57.769473 IP server.nfs > client.446703260: reply ok 116 getattr [|nfs]

I'll have more results tonight when I try resuming again. If the delay
goes to 6 seconds then the issue is probably still there... I'll report
here either way.

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.

2008-10-23 23:18:04

by Trond Myklebust

[permalink] [raw]
Subject: Re: Fwd: NFS 5-minute hangs upon S3 resume using 2.6.27 client

On Thu, 2008-10-23 at 12:52 -0700, Michel Lespinasse wrote:
> Hi,
>
> On Thu, Oct 23, 2008 at 11:36:47AM -0400, Trond Myklebust wrote:
> > Does the appended patch make a difference?
> >
> > From: Trond Myklebust <[email protected]>
> > Date: Thu, 23 Oct 2008 11:33:59 -0400
> > SUNRPC: Respond promptly to server TCP resets
>
> I applied it over a 2.6.27.3 base, suspended the client for 40 minutes
> and resumed it, logging what happens from the server side. The resume
> went like this:
>
> 12:38:53.692785 IP client.329262748 > server.nfs: 100 getattr [|nfs]
> 12:38:53.699885 arp who-has client tell server
> 12:38:54.123793 IP client.329262748 > server.nfs: 100 getattr [|nfs]
> 12:38:54.695888 arp who-has client tell server
> 12:38:54.696011 arp reply client is-at 00:19:d1:54:0e:39 (oui Unknown)
> 12:38:54.696020 IP server.nfs > client.882: R 2944642919:2944642919(0) win 0
> 12:38:54.696024 IP server.nfs > client.882: R 2944642919:2944642919(0) win 0
>
> (I'm still concerned about the 3 second delay here...)
>
> 12:38:57.695956 IP client.2 > server.nfs: 0 null

Does this patch fix that delay?

Cheers
Trond

-------------------------------------------------------------------------------------
From: Trond Myklebust <[email protected]>
Date: Thu, 23 Oct 2008 19:14:55 -0400
SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting

If the server aborts an established connection, then we should retry
connecting immediately. Since xprt->reestablish_timeout is not reset unless
we go through a TCP_FIN_WAIT1 state, we may end waiting.
The fix is to reset xprt->reestablish_timeout in TCP_ESTABLISHED, and then
rely on the fact that we set it to non-zero values in all other cases when
the server closes the connection.

Also fix a race between xs_connect() and xs_tcp_state_change(). The
value of xprt->reestablish_timeout should be updated before we actually
attempt the connection.

Signed-off-by: Trond Myklebust <[email protected]>
---

net/sunrpc/xprtsock.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 0a50361..ac2aa52 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -1144,6 +1144,8 @@ static void xs_tcp_state_change(struct sock *sk)
struct sock_xprt *transport = container_of(xprt,
struct sock_xprt, xprt);

+ xprt->reestablish_timeout = 0;
+
/* Reset TCP record info */
transport->tcp_offset = 0;
transport->tcp_reclen = 0;
@@ -1158,7 +1160,6 @@ static void xs_tcp_state_change(struct sock *sk)
case TCP_FIN_WAIT1:
/* The client initiated a shutdown of the socket */
xprt->connect_cookie++;
- xprt->reestablish_timeout = 0;
set_bit(XPRT_CLOSING, &xprt->state);
smp_mb__before_clear_bit();
clear_bit(XPRT_CONNECTED, &xprt->state);
@@ -1793,6 +1794,7 @@ static void xs_connect(struct rpc_task *task)
{
struct rpc_xprt *xprt = task->tk_xprt;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ unsigned long timeout;

if (xprt_test_and_set_connecting(xprt))
return;
@@ -1801,12 +1803,12 @@ static void xs_connect(struct rpc_task *task)
dprintk("RPC: xs_connect delayed xprt %p for %lu "
"seconds\n",
xprt, xprt->reestablish_timeout / HZ);
- queue_delayed_work(rpciod_workqueue,
- &transport->connect_worker,
- xprt->reestablish_timeout);
+ timeout = xprt->reestablish_timeout;
xprt->reestablish_timeout <<= 1;
if (xprt->reestablish_timeout > XS_TCP_MAX_REEST_TO)
xprt->reestablish_timeout = XS_TCP_MAX_REEST_TO;
+ queue_delayed_work(rpciod_workqueue,
+ &transport->connect_worker, timeout);
} else {
dprintk("RPC: xs_connect scheduled xprt %p\n", xprt);
queue_delayed_work(rpciod_workqueue,



2008-10-24 06:58:03

by Michel Lespinasse

[permalink] [raw]
Subject: Re: Fwd: NFS 5-minute hangs upon S3 resume using 2.6.27 client

On Thu, Oct 23, 2008 at 07:17:59PM -0400, Trond Myklebust wrote:
> > (I'm still concerned about the 3 second delay here...)
> Does this patch fix that delay?
>
> SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting

I applied this on top of the previous patch and it worked - but now I'm
not sure if you wanted to test this as an independant patch ???

I'm wondering how the code in xs_tcp_state_change() that sets
reestablish_timeout back to XS_TCP_INIT_REEST_TO managed to not cause trouble.


Can I propose a patch too ? mine looks quite similar to your second patch,
but with the reestablish_timeout logic hopefully simplified...



SUNRPC: Fix the setting of xprt->reestablish_timeout when reconnecting

The proposed logic works as follows:

* The reestablish_timeout is initially set to zero (no delay before the
initial connection). It is also reset to zero upon establishing a
successful TCP connection (no delay next time we try to reopen the
connection, whether it got closed by the client or the server side).
This fixes an issue where we would wait 5-minute after receiving an RST
from the server rather than retrying immediately (bug 11154).

* xs_connect() delays the (re)connection by reestablish_timeout jiffies.
reestablish_timeout for the *next* connection attempt is also
updated to a nonzero value at that point (constant for UDP case,
exponential backoff for TCP case). There used to be an issue where
we would apply the exponential backoff even in the UDP case.

Signed-off-by: Michel Lespinasse <[email protected]>
---

--- net/sunrpc/xprtsock.c.orig 2008-10-23 10:43:31.000000000 -0700
+++ net/sunrpc/xprtsock.c 2008-10-23 22:10:50.000000000 -0700
@@ -1134,6 +1134,7 @@
transport->tcp_flags =
TCP_RCV_COPY_FRAGHDR | TCP_RCV_COPY_XID;

+ xprt->reestablish_timeout = 0;
xprt_wake_pending_tasks(xprt, 0);
}
spin_unlock_bh(&xprt->transport_lock);
@@ -1141,7 +1142,6 @@
case TCP_FIN_WAIT1:
/* The client initiated a shutdown of the socket */
xprt->connect_cookie++;
- xprt->reestablish_timeout = 0;
set_bit(XPRT_CLOSING, &xprt->state);
smp_mb__before_clear_bit();
clear_bit(XPRT_CONNECTED, &xprt->state);
@@ -1154,13 +1154,6 @@
xprt_force_disconnect(xprt);
case TCP_SYN_SENT:
xprt->connect_cookie++;
- case TCP_CLOSING:
- /*
- * If the server closed down the connection, make sure that
- * we back off before reconnecting
- */
- if (xprt->reestablish_timeout < XS_TCP_INIT_REEST_TO)
- xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
break;
case TCP_LAST_ACK:
smp_mb__before_clear_bit();
@@ -1755,25 +1748,25 @@
{
struct rpc_xprt *xprt = task->tk_xprt;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ unsigned long timeout;

if (xprt_test_and_set_connecting(xprt))
return;

- if (transport->sock != NULL) {
- dprintk("RPC: xs_connect delayed xprt %p for %lu "
- "seconds\n",
- xprt, xprt->reestablish_timeout / HZ);
- queue_delayed_work(rpciod_workqueue,
- &transport->connect_worker,
- xprt->reestablish_timeout);
+ timeout = xprt->reestablish_timeout;
+ if (xprt->prot != IPPROTO_TCP)
+ xprt->reestablish_timeout = XS_UDP_REEST_TO;
+ else if (xprt->reestablish_timeout == 0)
+ xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ else {
xprt->reestablish_timeout <<= 1;
if (xprt->reestablish_timeout > XS_TCP_MAX_REEST_TO)
xprt->reestablish_timeout = XS_TCP_MAX_REEST_TO;
- } else {
- dprintk("RPC: xs_connect scheduled xprt %p\n", xprt);
- queue_delayed_work(rpciod_workqueue,
- &transport->connect_worker, 0);
}
+ dprintk("RPC: xs_connect delayed xprt %p for %lu seconds\n",
+ xprt, timeout / HZ);
+ queue_delayed_work(rpciod_workqueue, &transport->connect_worker,
+ timeout);
}

static void xs_tcp_connect(struct rpc_task *task)
@@ -1935,7 +1928,7 @@

xprt->bind_timeout = XS_BIND_TO;
xprt->connect_timeout = XS_UDP_CONN_TO;
- xprt->reestablish_timeout = XS_UDP_REEST_TO;
+ xprt->reestablish_timeout = 0;
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_udp_ops;
@@ -2003,7 +1996,7 @@

xprt->bind_timeout = XS_BIND_TO;
xprt->connect_timeout = XS_TCP_CONN_TO;
- xprt->reestablish_timeout = XS_TCP_INIT_REEST_TO;
+ xprt->reestablish_timeout = 0;
xprt->idle_timeout = XS_IDLE_DISC_TO;

xprt->ops = &xs_tcp_ops;


Hope this helps,

--
Michel "Walken" Lespinasse
A program is never fully debugged until the last user dies.