2002-11-12 23:09:11

by Chuck Lever

[permalink] [raw]
Subject: [PATCH] new timeout behavior for RPC requests on TCP sockets

make RPC timeout behavior over TCP sockets behave more like reference
client implementations. reference behavior is to transmit the same
request three times at 60 second intervals; if there is no response, close
and reestablish the socket connection. we modify the Linux RPC client as
follows:

+ after a minor retransmit timeout, use the same timeout value when
retrying on a TCP socket rather than doubling the value
+ after a major retransmit timeout, close the socket and attempt
to reestablish a fresh TCP connection

note that today mount uses a 6 second timeout with 5 retries for NFS over
TCP by default; proper default behavior is 2 retries each with 60 second
timeouts. a separate patch for mount is pending.

against 2.5.47.


diff -ruN 10-connect3/include/linux/sunrpc/xprt.h 11-timeout/include/linux/sunrpc/xprt.h
--- 10-connect3/include/linux/sunrpc/xprt.h Tue Nov 12 16:18:57 2002
+++ 11-timeout/include/linux/sunrpc/xprt.h Tue Nov 12 17:06:16 2002
@@ -182,9 +182,10 @@
void xprt_reserve(struct rpc_task *);
void xprt_transmit(struct rpc_task *);
void xprt_receive(struct rpc_task *);
-int xprt_adjust_timeout(struct rpc_timeout *);
+void xprt_adjust_timeout(struct rpc_timeout *);
void xprt_release(struct rpc_task *);
void xprt_connect(struct rpc_task *);
+void xprt_disconnect(struct rpc_xprt *);
int xprt_clear_backlog(struct rpc_xprt *);
void xprt_sock_setbufsize(struct rpc_xprt *);

diff -ruN 10-connect3/net/sunrpc/clnt.c 11-timeout/net/sunrpc/clnt.c
--- 10-connect3/net/sunrpc/clnt.c Tue Nov 12 16:19:25 2002
+++ 11-timeout/net/sunrpc/clnt.c Tue Nov 12 17:13:24 2002
@@ -667,6 +667,9 @@
if (clnt->cl_autobind)
clnt->cl_port = 0;
task->tk_action = call_bind;
+ /* A disconnect can happen after only part of an RPC was
+ * sent on a TCP socket. send all of this request again */
+ req->rq_bytes_sent = 0;
break;
case -EAGAIN:
task->tk_action = call_transmit;
@@ -688,20 +691,34 @@
* 6a. Handle RPC timeout
* We do not release the request slot, so we keep using the
* same XID for all retransmits.
+ * For stream transports, shut down the transport socket when
+ * a request sees a major time out. When any request on this
+ * connection is retried, the FSM notices the socket has been
+ * shut down, and attempts to reconnect.
*/
static void
call_timeout(struct rpc_task *task)
{
struct rpc_clnt *clnt = task->tk_client;
- struct rpc_timeout *to = &task->tk_rqstp->rq_timeout;
+ struct rpc_xprt *xprt = clnt->cl_xprt;
+ struct rpc_rqst *req = task->tk_rqstp;
+ struct rpc_timeout *to = &req->rq_timeout;

- if (xprt_adjust_timeout(to)) {
- dprintk("RPC: %4d call_timeout (minor)\n", task->tk_pid);
+ if (!xprt->stream)
+ xprt_adjust_timeout(to);
+
+ if (to->to_retries--) {
+ dprintk("RPC: %4d call_timeout (minor, retries=%d)\n",
+ task->tk_pid, to->to_retries);
goto retry;
}
- to->to_retries = clnt->cl_timeout.to_retries;
+ to->to_retries = xprt->timeout.to_retries;

dprintk("RPC: %4d call_timeout (major)\n", task->tk_pid);
+
+ if (xprt->stream)
+ xprt_disconnect(xprt);
+
if (clnt->cl_softrtry) {
if (clnt->cl_chatty && !task->tk_exit)
printk(KERN_NOTICE "%s: server %s not responding, timed out\n",
@@ -710,15 +727,18 @@
return;
}

- if (clnt->cl_chatty && !(task->tk_flags & RPC_CALL_MAJORSEEN) && rpc_ntimeo(&clnt->cl_rtt) > 7) {
- task->tk_flags |= RPC_CALL_MAJORSEEN;
- printk(KERN_NOTICE "%s: server %s not responding, still trying\n",
- clnt->cl_protname, clnt->cl_server);
+ if (clnt->cl_chatty && !(task->tk_flags & RPC_CALL_MAJORSEEN)) {
+ if (xprt->stream || (rpc_ntimeo(&clnt->cl_rtt) > 7)) {
+ task->tk_flags |= RPC_CALL_MAJORSEEN;
+ printk(KERN_NOTICE "%s: server %s not responding, still trying\n",
+ clnt->cl_protname, clnt->cl_server);
+ }
}
if (clnt->cl_autobind)
clnt->cl_port = 0;

retry:
+ req->rq_bytes_sent = 0; /* send all of this request again */
clnt->cl_stats->rpcretrans++;
task->tk_action = call_bind;
task->tk_status = 0;
diff -ruN 10-connect3/net/sunrpc/xprt.c 11-timeout/net/sunrpc/xprt.c
--- 10-connect3/net/sunrpc/xprt.c Tue Nov 12 16:31:08 2002
+++ 11-timeout/net/sunrpc/xprt.c Tue Nov 12 17:06:16 2002
@@ -85,7 +85,6 @@
static void xprt_request_init(struct rpc_task *, struct rpc_xprt *);
static void do_xprt_transmit(struct rpc_task *);
static inline void do_xprt_reserve(struct rpc_task *);
-static void xprt_disconnect(struct rpc_xprt *);
static void xprt_conn_status(struct rpc_task *task);
static struct rpc_xprt * xprt_setup(int proto, struct sockaddr_in *ap,
struct rpc_timeout *to);
@@ -336,7 +335,7 @@
/*
* Adjust timeout values etc for next retransmit
*/
-int
+void
xprt_adjust_timeout(struct rpc_timeout *to)
{
if (to->to_retries > 0) {
@@ -362,7 +361,7 @@
}
pprintk("RPC: %lu %s\n", jiffies,
to->to_retries? "retrans" : "timeout");
- return to->to_retries-- > 0;
+ return;
}

/*
@@ -394,7 +393,7 @@
/*
* Mark a transport as disconnected
*/
-static void
+void
xprt_disconnect(struct rpc_xprt *xprt)
{
dprintk("RPC: disconnected transport %p\n", xprt);


2002-11-12 23:20:35

by Dan Kegel

[permalink] [raw]
Subject: re: [PATCH] new timeout behavior for RPC requests on TCP sockets

Chuck wrote:
> make RPC timeout behavior over TCP sockets behave more like reference
> client implementations. reference behavior is to transmit the same
> request three times at 60 second intervals; if there is no response, close
> and reestablish the socket connection. we modify the Linux RPC client as
> follows:
>
> + after a minor retransmit timeout, use the same timeout value when
> retrying on a TCP socket rather than doubling the value
> + after a major retransmit timeout, close the socket and attempt
> to reestablish a fresh TCP connection
>
> note that today mount uses a 6 second timeout with 5 retries for NFS over
> TCP by default; proper default behavior is 2 retries each with 60 second
> timeouts. a separate patch for mount is pending.

Chuck, can you briefly explain why RPC does any minor
retransmits at all over TCP?
Shouldn't TCP's natural retransmit take care of that?
- Dan

2002-11-13 15:51:57

by Chuck Lever

[permalink] [raw]
Subject: re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Tue, 12 Nov 2002, Dan Kegel wrote:

> Chuck wrote:
> > make RPC timeout behavior over TCP sockets behave more like reference
> > client implementations. reference behavior is to transmit the same
> > request three times at 60 second intervals; if there is no response, close
> > and reestablish the socket connection. we modify the Linux RPC client as
> > follows:
> >
> > + after a minor retransmit timeout, use the same timeout value when
> > retrying on a TCP socket rather than doubling the value
> > + after a major retransmit timeout, close the socket and attempt
> > to reestablish a fresh TCP connection
> >
> > note that today mount uses a 6 second timeout with 5 retries for NFS over
> > TCP by default; proper default behavior is 2 retries each with 60 second
> > timeouts. a separate patch for mount is pending.
>
> Chuck, can you briefly explain why RPC does any minor
> retransmits at all over TCP?
> Shouldn't TCP's natural retransmit take care of that?

the socket layer guarantees delivery only to the RPC server application...
if the application itself chooses to drop the request, an RPC retransmit
is still required.

- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>

2002-11-13 16:35:08

by Richard B. Johnson

[permalink] [raw]
Subject: re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Wed, 13 Nov 2002, Chuck Lever wrote:

> On Tue, 12 Nov 2002, Dan Kegel wrote:
>
> > Chuck wrote:
> > > make RPC timeout behavior over TCP sockets behave more like reference
> > > client implementations. reference behavior is to transmit the same
> > > request three times at 60 second intervals; if there is no response, close
> > > and reestablish the socket connection. we modify the Linux RPC client as
> > > follows:
> > >
> > > + after a minor retransmit timeout, use the same timeout value when
> > > retrying on a TCP socket rather than doubling the value
> > > + after a major retransmit timeout, close the socket and attempt
> > > to reestablish a fresh TCP connection
> > >
> > > note that today mount uses a 6 second timeout with 5 retries for NFS over
> > > TCP by default; proper default behavior is 2 retries each with 60 second
> > > timeouts. a separate patch for mount is pending.
> >
> > Chuck, can you briefly explain why RPC does any minor
> > retransmits at all over TCP?
> > Shouldn't TCP's natural retransmit take care of that?
>
> the socket layer guarantees delivery only to the RPC server application...
> if the application itself chooses to drop the request, an RPC retransmit
> is still required.
>
> - Chuck Lever
> --

If the application "chooses to drop the request", the kernel is not
required to fix that application. The RPC cannot retransmit if
it has been shut-down or disconnected, which is about the only
way the application could "choose to drop the request". So something
doesn't smell right here.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-13 16:42:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

>>>>> " " == Richard B Johnson <[email protected]> writes:

> If the application "chooses to drop the request", the kernel is
> not required to fix that application. The RPC cannot retransmit
> if it has been shut-down or disconnected, which is about the
> only way the application could "choose to drop the request". So
> something doesn't smell right here.

An NFS server is perfectly free to drop an RPC request if it doesn't
have the necessary free resources to service it (i.e. if it is out of
memory). If the client doesn't time out + retry, you lose data. Not a
good idea...

Cheers,
Trond

2002-11-13 17:10:18

by Alan

[permalink] [raw]
Subject: re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Wed, 2002-11-13 at 16:44, Richard B. Johnson wrote:
> If the application "chooses to drop the request", the kernel is not
> required to fix that application. The RPC cannot retransmit if
> it has been shut-down or disconnected, which is about the only
> way the application could "choose to drop the request". So something
> doesn't smell right here.

Check your socks...

As far as RPC goes the RPC server can choose to drop a request whenever
it pleases by simply throwing it away (eg reading it from the socket and
binning it) depending on its workload. There are actually reasons for
that in some situations (eg if the top requests are all for a volume
that is down its better to throw them away so you can get requests for a
volume that is functional)

Alan

2002-11-13 18:24:06

by Richard B. Johnson

[permalink] [raw]
Subject: re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On 13 Nov 2002, Alan Cox wrote:

> On Wed, 2002-11-13 at 16:44, Richard B. Johnson wrote:
> > If the application "chooses to drop the request", the kernel is not
> > required to fix that application. The RPC cannot retransmit if
> > it has been shut-down or disconnected, which is about the only
> > way the application could "choose to drop the request". So something
> > doesn't smell right here.
>
> Check your socks...
>
> As far as RPC goes the RPC server can choose to drop a request whenever
> it pleases by simply throwing it away (eg reading it from the socket and
> binning it) depending on its workload. There are actually reasons for
> that in some situations (eg if the top requests are all for a volume
> that is down its better to throw them away so you can get requests for a
> volume that is functional)
>
> Alan

Yes! But, the Client it is perfectly free to request it again and
should (must) to keep any mounted volumes intact. This doesn't
affect the internal TCP/IP stack (or it shouldn't). Since the whole
NFS thing is "stateless", the client just issues another request.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-13 18:28:44

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On 13 Nov 2002, Trond Myklebust wrote:

> >>>>> " " == Richard B Johnson <[email protected]> writes:
>
> > If the application "chooses to drop the request", the kernel is
> > not required to fix that application. The RPC cannot retransmit
> > if it has been shut-down or disconnected, which is about the
> > only way the application could "choose to drop the request". So
> > something doesn't smell right here.
>
> An NFS server is perfectly free to drop an RPC request if it doesn't
> have the necessary free resources to service it (i.e. if it is out of
> memory). If the client doesn't time out + retry, you lose data. Not a
> good idea...
>
> Cheers,
> Trond

The Client is the guy that just retries, as you say from a time-out.
This shouldn't affect any internal TCP/IP code. The time-out is
at the application (client) level. It sent a request, the data
was sent or promised to be sent because the write() or send() didn't
block, now it expects to get the data it asked for. It waits, nothing
happens. It times-out and sends the exact same request again.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-14 17:38:10

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

>>>>> " " == Richard B Johnson <[email protected]> writes:

> The Client is the guy that just retries, as you say from a
> time-out. This shouldn't affect any internal TCP/IP code. The
> time-out is at the application (client) level. It sent a
> request, the data was sent or promised to be sent because the
> write() or send() didn't block, now it expects to get the data
> it asked for. It waits, nothing happens. It times-out and sends
> the exact same request again.

Huh??? There's no 'application level' involved here at all, nor any
'internal TCP/IP code'.

Chuck's patch touches the way the kernel Sun RPC client code (as used
exclusively by the kernel NFS client and the kernel NLM client)
handles the generic case of message timeout + resend. Why would we
want to even consider pushing that sort of thing down into the NFS
code itself?

Cheers,
Trond

2002-11-14 18:27:26

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On 14 Nov 2002, Trond Myklebust wrote:

> >>>>> " " == Richard B Johnson <[email protected]> writes:
>
> > The Client is the guy that just retries, as you say from a
> > time-out. This shouldn't affect any internal TCP/IP code. The
> > time-out is at the application (client) level. It sent a
> > request, the data was sent or promised to be sent because the
> > write() or send() didn't block, now it expects to get the data
> > it asked for. It waits, nothing happens. It times-out and sends
> > the exact same request again.
>
> Huh??? There's no 'application level' involved here at all, nor any
> 'internal TCP/IP code'.
>
> Chuck's patch touches the way the kernel Sun RPC client code (as used
> exclusively by the kernel NFS client and the kernel NLM client)
> handles the generic case of message timeout + resend. Why would we
> want to even consider pushing that sort of thing down into the NFS
> code itself?
>
> Cheers,
> Trond
>

Because all of the RPC stuff was, initially, user-mode code. For
performance reasons or otherwise, it was moved into the kernel.
Okay, so far? Now, when something goes wrong with that code, should
that code be fixed, or should the unrelated TCP/IP code be modified
to accommodate? I think the time-outs should be put at the correct
places and not added to generic network code.

Once the client side gets a buffer of data from the TCP/IP stack,
the TCP/IP should not care (ever) what it does with it. Putting
the timeout(s) in the TCP/IP stack, makes it care, and adds code
to accommodate.

Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-14 19:26:53

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

>>>>> " " == Richard B Johnson <[email protected]> writes:


> Because all of the RPC stuff was, initially, user-mode
> code. For performance reasons or otherwise, it was moved into
> the kernel. Okay, so far? Now, when something goes wrong with
> that code, should that code be fixed, or should the unrelated
> TCP/IP code be modified to accommodate? I think the time-outs
> should be put at the correct places and not added to generic
> network code.

No. The kernel RPC code has never been user mode code, nor has it ever
been exported to userland. It exists purely for the benefit of NFS and
friends. It is located in a subdirectory of the network code, but it
is certainly not 'generic network code'.

Cheers,
Trond

2002-11-14 20:19:52

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Thu, 14 Nov 2002, Richard B. Johnson wrote:

> Because all of the RPC stuff was, initially, user-mode code.

if you mean ti-rpc, that stuff comes from sun. the linux kernel ONC/RPC
implementation is not based on the ti-rpc code because, being Transport
Independent, ti-rpc is less than optimally efficient. also, it is
covered by a restrictive license agreement, so that code base can't be
included in the linux kernel.

> Now, when something goes wrong with that code, should
> that code be fixed, or should the unrelated TCP/IP code be modified
> to accommodate?

obviously the RPC client should be fixed....

> I think the time-outs should be put at the correct
> places and not added to generic network code.

...which is exactly what i did.

the new RPC retransmission logic is in net/sunrpc/clnt.c:call_timeout,
which is strictly a part of the RPC client's finite state machine.
underlying TCP retransmit behavior is not changed by this patch. the
changes apply to the RPC client only, which resides above the socket
layer.

let me go over the changes again. the RPC client sets a timeout after
sending each request. if it doesn't receive a valid reply for a request
within the timeout interval, a "minor" timeout occurs. after each
timeout, the RPC client doubles the timeout interval until it reaches a
maximum value.

for RPC over UDP, short timeouts and retransmission back-off make sense.
for TCP, retransmission is built into the underlying protocol, so it makes
more sense to use a constant long retransmit timeout.

a "major" timeout occurs after several "minor" timeouts. this is an
ad-hoc mechanism for detecting that a server is actually down, rather than
just a few requests have been lost. a "server not responding" message in
the kernel log appears when a major timeout occurs.

for UDP, there is no way a client can tell the server has gone away except
by noticing that the server is not sending any replies. TCP sockets
require a bit more cleanup when one end dies, however, since both ends
maintain some connection state.

i've changed the RPC client's timeout behavior when it uses a TCP socket
rather than a UDP socket to connect to a server:

+ after a minor RPC retransmit timeout on a TCP socket, the RPC client
uses the same retransmit timeout value when retransmitting the request
rather than doubling it, as it would on a UDP socket.

+ after a major RPC retransmit timeout on a TCP socket, close the socket.
the RPC finite state machine will notice the socket is no longer
connected, and attempt to reestablish a connection when it retries
the request again.

this means that after a few retransmissions, the RPC client closes the
transport socket. if a server hasn't responded after several retransmissions,
the client now assumes that it has crashed and has lost all connection
state, so it will reestablish a fresh connection with the server.

this behavior is recommended for NFSv2 and v3 over TCP, and is required
for NFSv4 over TCP (RFC3010).

- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>


2002-11-14 20:27:55

by Richard B. Johnson

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Thu, 14 Nov 2002, Chuck Lever wrote:

> On Thu, 14 Nov 2002, Richard B. Johnson wrote:
>
> > Because all of the RPC stuff was, initially, user-mode code.
>
> if you mean ti-rpc, that stuff comes from sun. the linux kernel ONC/RPC
> implementation is not based on the ti-rpc code because, being Transport
> Independent, ti-rpc is less than optimally efficient. also, it is
> covered by a restrictive license agreement, so that code base can't be
> included in the linux kernel.
>
> > Now, when something goes wrong with that code, should
> > that code be fixed, or should the unrelated TCP/IP code be modified
> > to accommodate?
>
> obviously the RPC client should be fixed....
>
> > I think the time-outs should be put at the correct
> > places and not added to generic network code.
>
> ...which is exactly what i did.
>
> the new RPC retransmission logic is in net/sunrpc/clnt.c:call_timeout,
> which is strictly a part of the RPC client's finite state machine.
> underlying TCP retransmit behavior is not changed by this patch. the
> changes apply to the RPC client only, which resides above the socket
> layer.
>
> let me go over the changes again. the RPC client sets a timeout after
> sending each request. if it doesn't receive a valid reply for a request
> within the timeout interval, a "minor" timeout occurs. after each
> timeout, the RPC client doubles the timeout interval until it reaches a
> maximum value.
>
> for RPC over UDP, short timeouts and retransmission back-off make sense.
> for TCP, retransmission is built into the underlying protocol, so it makes
> more sense to use a constant long retransmit timeout.
>
> a "major" timeout occurs after several "minor" timeouts. this is an
> ad-hoc mechanism for detecting that a server is actually down, rather than
> just a few requests have been lost. a "server not responding" message in
> the kernel log appears when a major timeout occurs.
>
> for UDP, there is no way a client can tell the server has gone away except
> by noticing that the server is not sending any replies. TCP sockets
> require a bit more cleanup when one end dies, however, since both ends
> maintain some connection state.
>
> i've changed the RPC client's timeout behavior when it uses a TCP socket
> rather than a UDP socket to connect to a server:
>
> + after a minor RPC retransmit timeout on a TCP socket, the RPC client
> uses the same retransmit timeout value when retransmitting the request
> rather than doubling it, as it would on a UDP socket.
>
> + after a major RPC retransmit timeout on a TCP socket, close the socket.
> the RPC finite state machine will notice the socket is no longer
> connected, and attempt to reestablish a connection when it retries
> the request again.
>
> this means that after a few retransmissions, the RPC client closes the
> transport socket. if a server hasn't responded after several retransmissions,
> the client now assumes that it has crashed and has lost all connection
> state, so it will reestablish a fresh connection with the server.
>
> this behavior is recommended for NFSv2 and v3 over TCP, and is required
> for NFSv4 over TCP (RFC3010).
>
> - Chuck Lever
> --
> corporate: <cel at netapp dot com>
> personal: <chucklever at bigfoot dot com>
>
>

Okay. Thanks a lot for the complete explaination. The early
information about the patch, and the patch itself that I tried
to follow, seemed to show that new retransmit timer bahavior
was applied at the TCP/IP level (actually socket level). This
is what I was bitching about.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.18 on an i686 machine (797.90 BogoMips).
Bush : The Fourth Reich of America


2002-11-14 22:09:31

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH] new timeout behavior for RPC requests on TCP sockets

On Thu, 14 Nov 2002, Richard B. Johnson wrote:

> Because all of the RPC stuff was, initially, user-mode code.

if you mean ti-rpc, that stuff comes from sun. the linux kernel ONC/RPC
implementation is not based on the ti-rpc code because, being Transport
Independent, ti-rpc is less than optimally efficient. also, it is
covered by a restrictive license agreement, so that code base can't be
included in the linux kernel.

> Now, when something goes wrong with that code, should
> that code be fixed, or should the unrelated TCP/IP code be modified
> to accommodate?

obviously the RPC client should be fixed....

> I think the time-outs should be put at the correct
> places and not added to generic network code.

...which is exactly what i did.

the new RPC retransmission logic is in net/sunrpc/clnt.c:call_timeout,
which is strictly a part of the RPC client's finite state machine.
underlying TCP retransmit behavior is not changed by this patch. the
changes apply to the RPC client only, which resides above the socket
layer.

let me go over the changes again. the RPC client sets a timeout after
sending each request. if it doesn't receive a valid reply for a request
within the timeout interval, a "minor" timeout occurs. after each
timeout, the RPC client doubles the timeout interval until it reaches a
maximum value.

for RPC over UDP, short timeouts and retransmission back-off make sense.
for TCP, retransmission is built into the underlying protocol, so it makes
more sense to use a constant long retransmit timeout.

a "major" timeout occurs after several "minor" timeouts. this is an
ad-hoc mechanism for detecting that a server is actually down, rather than
just a few requests have been lost. a "server not responding" message in
the kernel log appears when a major timeout occurs.

for UDP, there is no way a client can tell the server has gone away except
by noticing that the server is not sending any replies. TCP sockets
require a bit more cleanup when one end dies, however, since both ends
maintain some connection state.

i've changed the RPC client's timeout behavior when it uses a TCP socket
rather than a UDP socket to connect to a server:

+ after a minor RPC retransmit timeout on a TCP socket, the RPC client
uses the same retransmit timeout value when retransmitting the request
rather than doubling it, as it would on a UDP socket.

+ after a major RPC retransmit timeout on a TCP socket, close the socket.
the RPC finite state machine will notice the socket is no longer
connected, and attempt to reestablish a connection when it retries
the request again.

this means that after a few retransmissions, the RPC client closes the
transport socket. if a server hasn't responded after several retransmissions,
the client now assumes that it has crashed and has lost all connection
state, so it will reestablish a fresh connection with the server.

this behavior is recommended for NFSv2 and v3 over TCP, and is required
for NFSv4 over TCP (RFC3010).

- Chuck Lever
--
corporate: <cel at netapp dot com>
personal: <chucklever at bigfoot dot com>