2008-04-15 15:17:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [NFS] How to set-up a Linux NFS server to handle massive number of requests

On Mon, Apr 14, 2008 at 11:48:33PM -0500, Tom Tucker wrote:
>
> Maybe this this is a TCP_BACKLOG issue?

So, looking around.... There seems to be a global limit in
/proc/sys/net/ipv4/tcp_max_syn_backlog (default 1024?); might be worth
seeing what happens if that's increased, e.g., with

echo 2048 >/proc/sys/net/ipv4/tcp_max_syn_backlog

Though each client does have to make more than one tcp connection, I
wouldn't expect it to be making more than one at a time, so with 1340
clients, and assuming the requests are spread out at least a tiny bit, I
would have thought 1024 would be enough.

Oh, but: Grepping the glibc rpc code, it looks like it calls listen with
second argument SOMAXCONN == 128. You can confirm that by strace'ing
rpc.mountd -F and looking for the listen call.

And that socket's shared between all the mountd processes, so I guess
that's the real limit. I don't see an easy way to adjust that. You'd
also need to increase /proc/sys/net/core/somaxconn first.

But none of this explains why we'd see connections stuck in CLOSE_WAIT
indefinitely?

--b.

>
> BTW, with that many mounts won't you run out of "secure" ports (< 1024),
> so you'll need to use 'insecure' as a mount option.
>
>
> On Fri, 2008-04-11 at 19:07 -0400, J. Bruce Fields wrote:
> > On Thu, Apr 10, 2008 at 02:12:58PM +0200, Carsten Aulbert wrote:
> > > Hi all,
> > >
> > > we have a pretty extreme problem here and I try to figure out how to get
> > > it done right.
> > >
> > > We have a large cluster consisting of 1340 compute nodes who have a
> > > automount directory which will subsequently trigger a NFS mount (read-only):
> > >
> > > $ ypcat auto.data
> > > -fstype=nfs,nfsvers=3,hard,intr,rsize=8192,wsize=8192,tcp &:/data
> > >
> > > $ grep auto.data /etc/auto.master
> > > /atlas/data yp:auto.data --timeout=5
> > >
> > > So far so good.
> > >
> > > When submitting 1000 jobs just doing a md5sum of the very same file from
> > > one single data server, I see very weird effects.
> > >
> > > In the standard set-up many connections get into the box (tcp connection
> > > status SYN_RECV) but those fall over after some time and stay in
> > > CLOSE_WAIT state until I restart the nfs-kernel-server. Typically that
> > > looks like (netstat -an):
> >
> > That's interesting! But I'm not sure how to figure this out.
> >
> > Is it possible to get a network trace that shows what's going on?
> >
> > What happens on the clients?
> >
> > What kernel version are you using?--b.
> >
> > >
> > > tcp 0 0 10.20.10.14:687 10.10.2.87:799 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.4.1:823 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.65:656 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.30:650 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.0.71:789 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.4:602 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.1:967 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.3.66:915 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.0.55:620 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.41:835 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.2.29:958 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.12:998 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.30:651 SYN_RECV
> > > tcp 0 0 10.20.10.14:687 10.10.1.4:601 SYN_RECV
> > > tcp 0 0 10.20.10.14:2049 10.10.1.19:846
> > > ESTABLISHED
> > > tcp 45 0 10.20.10.14:687 10.10.0.68:979
> > > CLOSE_WAIT
> > > tcp 45 0 10.20.10.14:687 10.10.3.83:680
> > > CLOSE_WAIT
> > > tcp 89 0 10.20.10.14:687 10.10.0.79:604
> > > CLOSE_WAIT
> > > tcp 0 0 10.20.10.14:2049 10.10.2.6:676
> > > ESTABLISHED
> > > tcp 45 0 10.20.10.14:687 10.10.2.56:913
> > > CLOSE_WAIT
> > > tcp 45 0 10.20.10.14:687 10.10.0.60:827
> > > CLOSE_WAIT
> > > tcp 0 0 10.20.10.14:2049 10.10.3.55:778
> > > ESTABLISHED
> > > tcp 45 0 10.20.10.14:687 10.10.2.86:981
> > > CLOSE_WAIT
> > > tcp 45 0 10.20.10.14:687 10.10.9.13:792
> > > CLOSE_WAIT
> > > tcp 89 0 10.20.10.14:687 10.10.2.93:728
> > > CLOSE_WAIT
> > > tcp 45 0 10.20.10.14:687 10.10.0.20:742
> > > CLOSE_WAIT
> > > tcp 45 0 10.20.10.14:687 10.10.3.44:982
> > > CLOSE_WAIT
> > >
> > >
> > > I played with different numbers of of nfsd (ranging from 8-1024) and
> > > increasing the number of threads for rpc.mountd from 1 to 64, in quite a
> > > few combinations, but so far I have not found a consistent set of
> > > parameters where 1000 nodes are able to read this file at the same time.
> > >
> > > Any ideas from anyone or do you need more input from me?
> > >
> > > TIA
> > >
> > > Carsten
> > >
> > > PS: Please Cc me, I'm not yet subscribed.
> > >
> > > -------------------------------------------------------------------------
> > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > > Don't miss this year's exciting event. There's still time to save $100.
> > > Use priority code J8TL2D2.
> > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > > _______________________________________________
> > > NFS maillist - [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/nfs
> > > _______________________________________________
> > > Please note that [email protected] is being discontinued.
> > > Please subscribe to [email protected] instead.
> > > http://vger.kernel.org/vger-lists.html#linux-nfs
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > Don't miss this year's exciting event. There's still time to save $100.
> > Use priority code J8TL2D2.
> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > _______________________________________________
> > NFS maillist - [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nfs
> > _______________________________________________
> > Please note that [email protected] is being discontinued.
> > Please subscribe to [email protected] instead.
> > http://vger.kernel.org/vger-lists.html#linux-nfs
>
>
> -------------------------------------------------------------------------
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> Don't miss this year's exciting event. There's still time to save $100.
> Use priority code J8TL2D2.
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
> _______________________________________________
> Please note that [email protected] is being discontinued.
> Please subscribe to [email protected] instead.
> http://vger.kernel.org/vger-lists.html#linux-nfs
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs



2008-04-16 02:42:49

by Tom Tucker

[permalink] [raw]
Subject: Re: [NFS] How to set-up a Linux NFS server to handle massive number of requests


On Tue, 2008-04-15 at 11:12 -0400, J. Bruce Fields wrote:
> On Mon, Apr 14, 2008 at 11:48:33PM -0500, Tom Tucker wrote:
> >
> > Maybe this this is a TCP_BACKLOG issue?
>
> So, looking around.... There seems to be a global limit in
> /proc/sys/net/ipv4/tcp_max_syn_backlog (default 1024?); might be worth
> seeing what happens if that's increased, e.g., with
>
> echo 2048 >/proc/sys/net/ipv4/tcp_max_syn_backlog

I think this represents the collective total for all listening
endpoints. I think we're only talking about mountd.

Shooting from the hip...

My gray haired recollection is that the single connection default is a
backlog of 10 (SYN received, not accepted connections). Additional SYN's
received to this endpoint will be dropped...clients will retry the SYN
as part of normal TCP retransmit...

It might be that the CLOSE_WAIT's in the log are _normal_. That is, they
reflect completed mount requests that are in the normal close path. If
they never go away, then that's not normal. Is this the case?

Suppose the 10 is roughly correct. The remaining "jilted" clients will
retransmit their SYN after a randomized exponential backoff. I think you
can imagine that trying 1300+ connections of which only 10 succeed and
then retrying 1300-10 based on a randomized exponential backoff might
get you some pretty bad performance.

Just a thought --

>
> Though each client does have to make more than one tcp connection, I
> wouldn't expect it to be making more than one at a time, so with 1340
> clients, and assuming the requests are spread out at least a tiny bit, I
> would have thought 1024 would be enough.
>
> Oh, but: Grepping the glibc rpc code, it looks like it calls listen with
> second argument SOMAXCONN == 128. You can confirm that by strace'ing
> rpc.mountd -F and looking for the listen call.
>
> And that socket's shared between all the mountd processes, so I guess
> that's the real limit. I don't see an easy way to adjust that. You'd
> also need to increase /proc/sys/net/core/somaxconn first.
>
> But none of this explains why we'd see connections stuck in CLOSE_WAIT
> indefinitely?
>
> --b.
>
> >
> > BTW, with that many mounts won't you run out of "secure" ports (< 1024),
> > so you'll need to use 'insecure' as a mount option.
> >
> >
> > On Fri, 2008-04-11 at 19:07 -0400, J. Bruce Fields wrote:
> > > On Thu, Apr 10, 2008 at 02:12:58PM +0200, Carsten Aulbert wrote:
> > > > Hi all,
> > > >
> > > > we have a pretty extreme problem here and I try to figure out how to get
> > > > it done right.
> > > >
> > > > We have a large cluster consisting of 1340 compute nodes who have a
> > > > automount directory which will subsequently trigger a NFS mount (read-only):
> > > >
> > > > $ ypcat auto.data
> > > > -fstype=nfs,nfsvers=3,hard,intr,rsize=8192,wsize=8192,tcp &:/data
> > > >
> > > > $ grep auto.data /etc/auto.master
> > > > /atlas/data yp:auto.data --timeout=5
> > > >
> > > > So far so good.
> > > >
> > > > When submitting 1000 jobs just doing a md5sum of the very same file from
> > > > one single data server, I see very weird effects.
> > > >
> > > > In the standard set-up many connections get into the box (tcp connection
> > > > status SYN_RECV) but those fall over after some time and stay in
> > > > CLOSE_WAIT state until I restart the nfs-kernel-server. Typically that
> > > > looks like (netstat -an):
> > >
> > > That's interesting! But I'm not sure how to figure this out.
> > >
> > > Is it possible to get a network trace that shows what's going on?
> > >
> > > What happens on the clients?
> > >
> > > What kernel version are you using?--b.
> > >
> > > >
> > > > tcp 0 0 10.20.10.14:687 10.10.2.87:799 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.4.1:823 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.65:656 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.30:650 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.0.71:789 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.4:602 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.1:967 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.3.66:915 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.0.55:620 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.41:835 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.2.29:958 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.12:998 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.30:651 SYN_RECV
> > > > tcp 0 0 10.20.10.14:687 10.10.1.4:601 SYN_RECV
> > > > tcp 0 0 10.20.10.14:2049 10.10.1.19:846
> > > > ESTABLISHED
> > > > tcp 45 0 10.20.10.14:687 10.10.0.68:979
> > > > CLOSE_WAIT
> > > > tcp 45 0 10.20.10.14:687 10.10.3.83:680
> > > > CLOSE_WAIT
> > > > tcp 89 0 10.20.10.14:687 10.10.0.79:604
> > > > CLOSE_WAIT
> > > > tcp 0 0 10.20.10.14:2049 10.10.2.6:676
> > > > ESTABLISHED
> > > > tcp 45 0 10.20.10.14:687 10.10.2.56:913
> > > > CLOSE_WAIT
> > > > tcp 45 0 10.20.10.14:687 10.10.0.60:827
> > > > CLOSE_WAIT
> > > > tcp 0 0 10.20.10.14:2049 10.10.3.55:778
> > > > ESTABLISHED
> > > > tcp 45 0 10.20.10.14:687 10.10.2.86:981
> > > > CLOSE_WAIT
> > > > tcp 45 0 10.20.10.14:687 10.10.9.13:792
> > > > CLOSE_WAIT
> > > > tcp 89 0 10.20.10.14:687 10.10.2.93:728
> > > > CLOSE_WAIT
> > > > tcp 45 0 10.20.10.14:687 10.10.0.20:742
> > > > CLOSE_WAIT
> > > > tcp 45 0 10.20.10.14:687 10.10.3.44:982
> > > > CLOSE_WAIT
> > > >
> > > >
> > > > I played with different numbers of of nfsd (ranging from 8-1024) and
> > > > increasing the number of threads for rpc.mountd from 1 to 64, in quite a
> > > > few combinations, but so far I have not found a consistent set of
> > > > parameters where 1000 nodes are able to read this file at the same time.
> > > >
> > > > Any ideas from anyone or do you need more input from me?
> > > >
> > > > TIA
> > > >
> > > > Carsten
> > > >
> > > > PS: Please Cc me, I'm not yet subscribed.
> > > >
> > > > -------------------------------------------------------------------------
> > > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > > > Don't miss this year's exciting event. There's still time to save $100.
> > > > Use priority code J8TL2D2.
> > > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > > > _______________________________________________
> > > > NFS maillist - [email protected]
> > > > https://lists.sourceforge.net/lists/listinfo/nfs
> > > > _______________________________________________
> > > > Please note that [email protected] is being discontinued.
> > > > Please subscribe to [email protected] instead.
> > > > http://vger.kernel.org/vger-lists.html#linux-nfs
> > > >
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > > > the body of a message to [email protected]
> > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > > -------------------------------------------------------------------------
> > > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > > Don't miss this year's exciting event. There's still time to save $100.
> > > Use priority code J8TL2D2.
> > > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > > _______________________________________________
> > > NFS maillist - [email protected]
> > > https://lists.sourceforge.net/lists/listinfo/nfs
> > > _______________________________________________
> > > Please note that [email protected] is being discontinued.
> > > Please subscribe to [email protected] instead.
> > > http://vger.kernel.org/vger-lists.html#linux-nfs
> >
> >
> > -------------------------------------------------------------------------
> > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
> > Don't miss this year's exciting event. There's still time to save $100.
> > Use priority code J8TL2D2.
> > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> > _______________________________________________
> > NFS maillist - [email protected]
> > https://lists.sourceforge.net/lists/listinfo/nfs
> > _______________________________________________
> > Please note that [email protected] is being discontinued.
> > Please subscribe to [email protected] instead.
> > http://vger.kernel.org/vger-lists.html#linux-nfs
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html


-------------------------------------------------------------------------
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference
Don't miss this year's exciting event. There's still time to save $100.
Use priority code J8TL2D2.
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs