2004-03-16 11:13:15

by Olaf Kirch

[permalink] [raw]
Subject: NFS over TCP: random drop

Hi,

I am looking into an NFS problem at one of our customers. They are
running NFS over TCP, and the client is some kind of mainframe OS.

What happens is this:

client: initiates connections, 3way handshake completes
server: closes connection (FIN)
client: (at the same moment) sends data to server
server: RST
client: argh (reports EIO to application)

What seems to happen here is the random connection drop in svcsock.c,
where we randomly drop a connection if the total number of TCP connections
exceeds (nrthreads + 3) * 10. "Randomly" here means either the oldest
connection, or the newest one (which happens to be the one we just
accepted).

The specific problem we have here is that the client doesn't grok TCP RSTs
generated by the server. It's arguably a client bug, but I'm nevertheless
thinking about a way to work around this

Possible solutions:

- always drop the oldest connection. The current strategy doesn't
help much against DoS anyway, because rather than dropping
old connections for _every_ new one, we drop them for every
two new connections. Big improvement :)

- Create a sysctl that allows to set a hard limit for active
TCP connections.

- increase the "random drop" threshold to something like
(nrthreads + 3) * 100.

Any comments?

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2004-03-16 15:50:32

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFS over TCP: random drop

P=E5 ty , 16/03/2004 klokka 06:11, skreiv Olaf Kirch:
> Hi,
>=20
> I am looking into an NFS problem at one of our customers. They are
> running NFS over TCP, and the client is some kind of mainframe OS.
>=20
> What happens is this:
>=20
> client: initiates connections, 3way handshake completes
> server: closes connection (FIN)
> client: (at the same moment) sends data to server
> server: RST
> client: argh (reports EIO to application)
>=20
> What seems to happen here is the random connection drop in svcsock.c,
> where we randomly drop a connection if the total number of TCP connection=
s
> exceeds (nrthreads + 3) * 10. "Randomly" here means either the oldest
> connection, or the newest one (which happens to be the one we just
> accepted).
>=20
> The specific problem we have here is that the client doesn't grok TCP RST=
s
> generated by the server. It's arguably a client bug, but I'm nevertheless
> thinking about a way to work around this

It would be a client bug if this was happening on hard mounts, but I
assume this is "soft"?

Cheers,
Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-16 16:21:29

by Lever, Charles

[permalink] [raw]
Subject: RE: NFS over TCP: random drop

hi olaf-

IMO it would be better to fix this in the client. overloaded
servers of all kinds behave this way, and the client should
deal with it. server cluster failover also behaves this way.

if this is a soft mount, do you see "timing out" messages in
the log?

> -----Original Message-----
> From: Olaf Kirch [mailto:[email protected]]=20
> Sent: Tuesday, March 16, 2004 6:12 AM
> To: [email protected]
> Subject: [NFS] NFS over TCP: random drop
>=20
>=20
> Hi,
>=20
> I am looking into an NFS problem at one of our customers.=20
> They are running NFS over TCP, and the client is some kind of=20
> mainframe OS.
>=20
> What happens is this:
>=20
> client: initiates connections, 3way handshake completes
> server: closes connection (FIN)
> client: (at the same moment) sends data to server
> server: RST
> client: argh (reports EIO to application)
>=20
> What seems to happen here is the random connection drop in=20
> svcsock.c, where we randomly drop a connection if the total=20
> number of TCP connections exceeds (nrthreads + 3) * 10. =20
> "Randomly" here means either the oldest connection, or the=20
> newest one (which happens to be the one we just accepted).
>=20
> The specific problem we have here is that the client doesn't=20
> grok TCP RSTs generated by the server. It's arguably a client=20
> bug, but I'm nevertheless thinking about a way to work around this
>=20
> Possible solutions:
>=20
> - always drop the oldest connection. The current strategy doesn't
> help much against DoS anyway, because rather than dropping
> old connections for _every_ new one, we drop them for every
> two new connections. Big improvement :)
>=20
> - Create a sysctl that allows to set a hard limit for active
> TCP connections.
>=20
> - increase the "random drop" threshold to something like
> (nrthreads + 3) * 100.
>=20
> Any comments?
>=20
> Olaf
> --=20
> Olaf Kirch | Stop wasting entropy - start using predictable
> [email protected] | tempfile names today!
> ---------------+=20
>=20
>=20
> -------------------------------------------------------
> This SF.Net email is sponsored by: IBM Linux Tutorials
> Free Linux tutorial presented by Daniel Robbins, President=20
> and CEO of GenToo technologies. Learn everything from=20
> fundamentals to system=20
> =
administration.http://ads.osdn.com/?ad_id=3D1470&alloc_id=3D3638&op=3Dcli=
ck
> _______________________________________________
> NFS maillist - [email protected]=20
> https://lists.sourceforge.net/lists/listinfo/n> fs
>=20


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-16 17:06:19

by Trond Myklebust

[permalink] [raw]
Subject: RE: NFS over TCP: random drop

P=E5 ty , 16/03/2004 klokka 11:21, skreiv Lever, Charles:
> hi olaf-
>=20
> IMO it would be better to fix this in the client. overloaded
> servers of all kinds behave this way, and the client should
> deal with it. server cluster failover also behaves this way.

It *should* currently be dealing with it...

ECONNRESET in xprt_sendmsg will be tranformed into ENOTCONN. That again
should cause call_status() to call the reconnection code.

However if the server disconnects us *while we are reconnecting*, then
the soft request will fail with an EIO. This seems like sensible
behaviour to me: a server which is accepting a connection then
immediately breaking it is fundamentally broken...

Cheers,
Trond


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-16 17:19:34

by Lever, Charles

[permalink] [raw]
Subject: RE: NFS over TCP: random drop

> a server which is=20
> accepting a connection then immediately breaking it is=20
> fundamentally broken...

agreed that this should be rare behavior. but i can't think
of a way to eliminate every case where a server might RST a
TCP connection just after accepting it.

cluster failover is one example where a server might legitimately
RST a connection just after accepting it.


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2004-03-16 19:42:30

by Olaf Kirch

[permalink] [raw]
Subject: Re: NFS over TCP: random drop

On Tue, Mar 16, 2004 at 12:05:30PM -0500, Trond Myklebust wrote:
> However if the server disconnects us *while we are reconnecting*, then
> the soft request will fail with an EIO. This seems like sensible
> behaviour to me: a server which is accepting a connection then
> immediately breaking it is fundamentally broken...

But the Linux nfsd seems to be doing exactly this - it accepts the
connection, then randomly drops either the oldest or the newest one
(in terms of activity). The one at the head of the queue happens to be
the one we just accepted.

Olaf
--
Olaf Kirch | Stop wasting entropy - start using predictable
[email protected] | tempfile names today!
---------------+


-------------------------------------------------------
This SF.Net email is sponsored by: IBM Linux Tutorials
Free Linux tutorial presented by Daniel Robbins, President and CEO of
GenToo technologies. Learn everything from fundamentals to system
administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs