2010-11-15 19:25:23

by Mark Hills

[permalink] [raw]
Subject: Listen backlog set to 64

I am looking into an issue of hanging clients to a set of NFS servers, on
a large HPC cluster.

My investigation took me to the RPC code, svc_create_socket().

if (protocol == IPPROTO_TCP) {
if ((error = kernel_listen(sock, 64)) < 0)
goto bummer;
}

A fixed backlog of 64 connections at the server seems like it could be too
low on a cluster like this, particularly when the protocol opens and
closes the TCP connection.

I wondered what is the rationale is behind this number, particuarly as it
is a fixed value. Perhaps there is a reason why this has no effect on
nfsd, or is this a FAQ for people on large systems?

The servers show overflow of a listening queue, which I imagine is
related.

$ netstat -s
[...]
TcpExt:
6475 times the listen queue of a socket overflowed
6475 SYNs to LISTEN sockets ignored

The affected servers are old, kernel 2.6.9. But this limit of 64 is
consistent across that and the latest kernel source.

Thanks

--
Mark


2010-11-16 19:05:48

by Mark Hills

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Tue, 16 Nov 2010, J. Bruce Fields wrote:

> On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > I am looking into an issue of hanging clients to a set of NFS servers, on
> > a large HPC cluster.
> >
> > My investigation took me to the RPC code, svc_create_socket().
> >
> > if (protocol == IPPROTO_TCP) {
> > if ((error = kernel_listen(sock, 64)) < 0)
> > goto bummer;
> > }
> >
> > A fixed backlog of 64 connections at the server seems like it could be too
> > low on a cluster like this, particularly when the protocol opens and
> > closes the TCP connection.
> >
> > I wondered what is the rationale is behind this number, particuarly as it
> > is a fixed value. Perhaps there is a reason why this has no effect on
> > nfsd, or is this a FAQ for people on large systems?
> >
> > The servers show overflow of a listening queue, which I imagine is
> > related.
> >
> > $ netstat -s
> > [...]
> > TcpExt:
> > 6475 times the listen queue of a socket overflowed
> > 6475 SYNs to LISTEN sockets ignored
> >
> > The affected servers are old, kernel 2.6.9. But this limit of 64 is
> > consistent across that and the latest kernel source.
>
> Looks like the last time that was touched was 8 years ago, by Neil (below, from
> historical git archive).
>
> I'd be inclined to just keep doubling it until people don't complain,
> unless it's very expensive. (How much memory (or whatever else) does a
> pending connection tie up?)

Perhaps SOMAXCONN could also be appropriate.

> The clients should be retrying, though, shouldn't they?

I think so, but a quick glance at net/sunrpc/clnt.c looks like the
timeouts are fixed, not randomised. With nothing to smooth out the load
from a large number of (identical) clients, potentially they could
continue this process for some time.

I may be in the wrong client code here though for a client TCP connection,
perhaps someone with more experience can comment. I hope to investigate
further tomorrow.

--
Mark

2010-11-16 18:20:28

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> I am looking into an issue of hanging clients to a set of NFS servers, on
> a large HPC cluster.
>
> My investigation took me to the RPC code, svc_create_socket().
>
> if (protocol == IPPROTO_TCP) {
> if ((error = kernel_listen(sock, 64)) < 0)
> goto bummer;
> }
>
> A fixed backlog of 64 connections at the server seems like it could be too
> low on a cluster like this, particularly when the protocol opens and
> closes the TCP connection.
>
> I wondered what is the rationale is behind this number, particuarly as it
> is a fixed value. Perhaps there is a reason why this has no effect on
> nfsd, or is this a FAQ for people on large systems?
>
> The servers show overflow of a listening queue, which I imagine is
> related.
>
> $ netstat -s
> [...]
> TcpExt:
> 6475 times the listen queue of a socket overflowed
> 6475 SYNs to LISTEN sockets ignored
>
> The affected servers are old, kernel 2.6.9. But this limit of 64 is
> consistent across that and the latest kernel source.

Looks like the last time that was touched was 8 years ago, by Neil (below, from
historical git archive).

I'd be inclined to just keep doubling it until people don't complain,
unless it's very expensive. (How much memory (or whatever else) does a
pending connection tie up?)

The clients should be retrying, though, shouldn't they?

--b.

commit df0afc51f2f74756135c8bc08ec01134eb6de287
Author: Neil Brown <[email protected]>
Date: Thu Aug 22 21:21:39 2002 -0700

[PATCH] Fix two problems with multiple concurrent nfs/tcp connects.

1/ connect requests would be get lost...
As the comment at the top of svcsock.c says when discussing
SK_CONN:
* after a set, svc_sock_enqueue must be called.

We didn't and so lost conneciton requests.

2/ set the max accept backlog to a more reasonable number to cope
with bursts of lots of connection requests.

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index ab28937..bbeee09 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -679,6 +679,8 @@ svc_tcp_accept(struct svc_sock *svsk)
goto failed; /* aborted connection or whatever */
}
set_bit(SK_CONN, &svsk->sk_flags);
+ svc_sock_enqueue(svsk);
+
slen = sizeof(sin);
err = ops->getname(newsock, (struct sockaddr *) &sin, &slen, 1);
if (err < 0) {
@@ -1220,7 +1222,7 @@ svc_create_socket(struct svc_serv *serv, int protocol, str
}

if (protocol == IPPROTO_TCP) {
- if ((error = sock->ops->listen(sock, 5)) < 0)
+ if ((error = sock->ops->listen(sock, 64)) < 0)
goto bummer;
}


2010-11-30 17:50:55

by Mark Hills

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Mon, 29 Nov 2010, J. Bruce Fields wrote:

> On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > On Tue, 16 Nov 2010 13:20:26 -0500
> > "J. Bruce Fields" <[email protected]> wrote:
> >
> > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > I am looking into an issue of hanging clients to a set of NFS servers, on
> > > > a large HPC cluster.
> > > >
> > > > My investigation took me to the RPC code, svc_create_socket().
> > > >
> > > > if (protocol == IPPROTO_TCP) {
> > > > if ((error = kernel_listen(sock, 64)) < 0)
> > > > goto bummer;
> > > > }
> > > >
> > > > A fixed backlog of 64 connections at the server seems like it could be too
> > > > low on a cluster like this, particularly when the protocol opens and
> > > > closes the TCP connection.
[...]
> > So we could:
> > - hard code a new number
> > - make this another sysctl configurable
> > - auto-adjust it so that it "just works".
> >
> > I would prefer the latter if it is possible. Possibly we could adjust it
> > based on the number of nfsd threads, like we do for receive buffer space.
> > Maybe something arbitrary like:
> > min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
> >
> > which would get the current 64 at 24 threads, and can easily push up to 128
> > and beyond with more threads.
> >
> > Or is that too arbitrary?
>
> I kinda like the idea of piggybacking on an existing constant like
> sysctl_max_syn_backlog. Somebody else hopefully keeps it set to something
> reasonable, and we as a last resort it gives you a knob to twiddle.
>
> But number of threads would work OK too.
>
> At a minimum we should make sure we solve the original problem....
> Mark, have you had a chance to check whether increasing that number to
> 128 or more is enough to solve your problem?

I think we can hold off changing the queue size, for now at least. We
reduced the reported queue overflows by increasing the number of mountd
threads, allowing it to service the queue more quickly. However this did
not fix the common problem, and I was hoping to have more information in
this follow-up email.

Our investigation brings us to rpc.mountd and mount.nfs communicating. In
the client log we see messages like:

Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up

Using strace and isolating one of these, I can see a non-blocking connect
has already managed to make a connection and even send/receive some data.

But soon a timeout of 9999 milliseconds in poll() causes a problem in
mount.nfs when waiting for a response of some sort. The socket in question
is a connection to mountd:

26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>

When it returns:

26512 <... poll resumed> ) = 0 (Timeout)
26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
26512 close(3) = 0
26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100

There's no re-try from here, just a failed mount.

What is the source of this 9999 millisecond timeout used by poll() in
mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
but I need more time to investigate.

If the server is being too slow to respond, what could the cause of this
be? Multiple threads are already in use, but it seems like they are not
all in use because a thread is able to accept() the connection. I haven't
been able to pin this on the forward/reverse DNS lookup used by
authentication and logging.

Thanks

--
Mark

2010-11-29 20:59:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> On Tue, 16 Nov 2010 13:20:26 -0500
> "J. Bruce Fields" <[email protected]> wrote:
>
> > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > I am looking into an issue of hanging clients to a set of NFS servers, on
> > > a large HPC cluster.
> > >
> > > My investigation took me to the RPC code, svc_create_socket().
> > >
> > > if (protocol == IPPROTO_TCP) {
> > > if ((error = kernel_listen(sock, 64)) < 0)
> > > goto bummer;
> > > }
> > >
> > > A fixed backlog of 64 connections at the server seems like it could be too
> > > low on a cluster like this, particularly when the protocol opens and
> > > closes the TCP connection.
> > >
> > > I wondered what is the rationale is behind this number, particuarly as it
> > > is a fixed value. Perhaps there is a reason why this has no effect on
> > > nfsd, or is this a FAQ for people on large systems?
> > >
> > > The servers show overflow of a listening queue, which I imagine is
> > > related.
> > >
> > > $ netstat -s
> > > [...]
> > > TcpExt:
> > > 6475 times the listen queue of a socket overflowed
> > > 6475 SYNs to LISTEN sockets ignored
> > >
> > > The affected servers are old, kernel 2.6.9. But this limit of 64 is
> > > consistent across that and the latest kernel source.
> >
> > Looks like the last time that was touched was 8 years ago, by Neil (below, from
> > historical git archive).
> >
> > I'd be inclined to just keep doubling it until people don't complain,
> > unless it's very expensive. (How much memory (or whatever else) does a
> > pending connection tie up?)
>
> Surely we should "keep multiplying by 13" as that is what I did :-)
>
> There is a sysctl 'somaxconn' which limits what a process can ask for in the
> listen() system call, but as we bypass this syscall it doesn't directly
> affect nfsd.
> It defaults to SOMAXCONN == 128 but can be raised arbitrarily by the sysadmin.
>
> There is another sysctl 'max_syn_backlog' which looks like a system-wide
> limit to the connect backlog.
> This defaults to 256. The comment says it is
> adjusted between 128 and 1024 based on memory size, though that isn't clear
> in the code (to me at least).

This comment?:

/*
* Maximum number of SYN_RECV sockets in queue per LISTEN socket.
* One SYN_RECV socket costs about 80bytes on a 32bit machine.
* It would be better to replace it with a global counter for all sockets
* but then some measure against one socket starving all other sockets
* would be needed.
*
* It was 128 by default. Experiments with real servers show, that
* it is absolutely not enough even at 100conn/sec. 256 cures most
* of problems. This value is adjusted to 128 for very small machines
* (<=32Mb of memory) and to 1024 on normal or better ones (>=256Mb).
* Note : Dont forget somaxconn that may limit backlog too.
*/
int sysctl_max_syn_backlog = 256;

Looks like net/ipv4/tcp.c:tcp_init() does the memory-based calculation.

80 bytes sounds small.

> So we could:
> - hard code a new number
> - make this another sysctl configurable
> - auto-adjust it so that it "just works".
>
> I would prefer the latter if it is possible. Possibly we could adjust it
> based on the number of nfsd threads, like we do for receive buffer space.
> Maybe something arbitrary like:
> min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
>
> which would get the current 64 at 24 threads, and can easily push up to 128
> and beyond with more threads.
>
> Or is that too arbitrary?

I kinda like the idea of piggybacking on an existing constant like
sysctl_max_syn_backlog. Somebody else hopefully keeps it set to something
reasonable, and we as a last resort it gives you a knob to twiddle.

But number of threads would work OK too.

At a minimum we should make sure we solve the original problem....
Mark, have you had a chance to check whether increasing that number to
128 or more is enough to solve your problem?

--b.

2010-11-30 20:00:17

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> On Mon, 29 Nov 2010, J. Bruce Fields wrote:
>
> > On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > > On Tue, 16 Nov 2010 13:20:26 -0500
> > > "J. Bruce Fields" <[email protected]> wrote:
> > >
> > > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > > I am looking into an issue of hanging clients to a set of NFS servers, on
> > > > > a large HPC cluster.
> > > > >
> > > > > My investigation took me to the RPC code, svc_create_socket().
> > > > >
> > > > > if (protocol == IPPROTO_TCP) {
> > > > > if ((error = kernel_listen(sock, 64)) < 0)
> > > > > goto bummer;
> > > > > }
> > > > >
> > > > > A fixed backlog of 64 connections at the server seems like it could be too
> > > > > low on a cluster like this, particularly when the protocol opens and
> > > > > closes the TCP connection.
> [...]
> > > So we could:
> > > - hard code a new number
> > > - make this another sysctl configurable
> > > - auto-adjust it so that it "just works".
> > >
> > > I would prefer the latter if it is possible. Possibly we could adjust it
> > > based on the number of nfsd threads, like we do for receive buffer space.
> > > Maybe something arbitrary like:
> > > min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)
> > >
> > > which would get the current 64 at 24 threads, and can easily push up to 128
> > > and beyond with more threads.
> > >
> > > Or is that too arbitrary?
> >
> > I kinda like the idea of piggybacking on an existing constant like
> > sysctl_max_syn_backlog. Somebody else hopefully keeps it set to something
> > reasonable, and we as a last resort it gives you a knob to twiddle.
> >
> > But number of threads would work OK too.
> >
> > At a minimum we should make sure we solve the original problem....
> > Mark, have you had a chance to check whether increasing that number to
> > 128 or more is enough to solve your problem?
>
> I think we can hold off changing the queue size, for now at least. We
> reduced the reported queue overflows by increasing the number of mountd
> threads, allowing it to service the queue more quickly.

Apologies, I should have thought to suggest that at the start.

> However this did
> not fix the common problem, and I was hoping to have more information in
> this follow-up email.
>
> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> the client log we see messages like:
>
> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>
> Using strace and isolating one of these, I can see a non-blocking connect
> has already managed to make a connection and even send/receive some data.
>
> But soon a timeout of 9999 milliseconds in poll() causes a problem in
> mount.nfs when waiting for a response of some sort. The socket in question
> is a connection to mountd:
>
> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>
> When it returns:
>
> 26512 <... poll resumed> ) = 0 (Timeout)
> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> 26512 close(3) = 0
> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>
> There's no re-try from here, just a failed mount.

That does sound wrong. I'm not at all familiar with automount,
unfortunately; how is it invoking mount.nfs?

> What is the source of this 9999 millisecond timeout used by poll() in
> mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
> but I need more time to investigate.
>
> If the server is being too slow to respond, what could the cause of this
> be? Multiple threads are already in use, but it seems like they are not
> all in use because a thread is able to accept() the connection. I haven't
> been able to pin this on the forward/reverse DNS lookup used by
> authentication and logging.

Can you tell where the mountd threads are typically waiting?

--b.

2010-11-16 22:08:35

by NeilBrown

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Tue, 16 Nov 2010 13:20:26 -0500
"J. Bruce Fields" <[email protected]> wrote:

> On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > I am looking into an issue of hanging clients to a set of NFS servers, on
> > a large HPC cluster.
> >
> > My investigation took me to the RPC code, svc_create_socket().
> >
> > if (protocol == IPPROTO_TCP) {
> > if ((error = kernel_listen(sock, 64)) < 0)
> > goto bummer;
> > }
> >
> > A fixed backlog of 64 connections at the server seems like it could be too
> > low on a cluster like this, particularly when the protocol opens and
> > closes the TCP connection.
> >
> > I wondered what is the rationale is behind this number, particuarly as it
> > is a fixed value. Perhaps there is a reason why this has no effect on
> > nfsd, or is this a FAQ for people on large systems?
> >
> > The servers show overflow of a listening queue, which I imagine is
> > related.
> >
> > $ netstat -s
> > [...]
> > TcpExt:
> > 6475 times the listen queue of a socket overflowed
> > 6475 SYNs to LISTEN sockets ignored
> >
> > The affected servers are old, kernel 2.6.9. But this limit of 64 is
> > consistent across that and the latest kernel source.
>
> Looks like the last time that was touched was 8 years ago, by Neil (below, from
> historical git archive).
>
> I'd be inclined to just keep doubling it until people don't complain,
> unless it's very expensive. (How much memory (or whatever else) does a
> pending connection tie up?)

Surely we should "keep multiplying by 13" as that is what I did :-)

There is a sysctl 'somaxconn' which limits what a process can ask for in the
listen() system call, but as we bypass this syscall it doesn't directly
affect nfsd.
It defaults to SOMAXCONN == 128 but can be raised arbitrarily by the sysadmin.

There is another sysctl 'max_syn_backlog' which looks like a system-wide
limit to the connect backlog. This defaults to 256. The comment says it is
adjusted between 128 and 1024 based on memory size, though that isn't clear
in the code (to me at least).

So we could:
- hard code a new number
- make this another sysctl configurable
- auto-adjust it so that it "just works".

I would prefer the latter if it is possible. Possibly we could adjust it
based on the number of nfsd threads, like we do for receive buffer space.
Maybe something arbitrary like:
min(16 + 2 * number of threads, sock_net(sk)->core.sysctl_somaxconn)

which would get the current 64 at 24 threads, and can easily push up to 128
and beyond with more threads.

Or is that too arbitrary?

NeilBrown


>
> The clients should be retrying, though, shouldn't they?
>
> --b.
>
> commit df0afc51f2f74756135c8bc08ec01134eb6de287
> Author: Neil Brown <[email protected]>
> Date: Thu Aug 22 21:21:39 2002 -0700
>
> [PATCH] Fix two problems with multiple concurrent nfs/tcp connects.
>
> 1/ connect requests would be get lost...
> As the comment at the top of svcsock.c says when discussing
> SK_CONN:
> * after a set, svc_sock_enqueue must be called.
>
> We didn't and so lost conneciton requests.
>
> 2/ set the max accept backlog to a more reasonable number to cope
> with bursts of lots of connection requests.
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index ab28937..bbeee09 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -679,6 +679,8 @@ svc_tcp_accept(struct svc_sock *svsk)
> goto failed; /* aborted connection or whatever */
> }
> set_bit(SK_CONN, &svsk->sk_flags);
> + svc_sock_enqueue(svsk);
> +
> slen = sizeof(sin);
> err = ops->getname(newsock, (struct sockaddr *) &sin, &slen, 1);
> if (err < 0) {
> @@ -1220,7 +1222,7 @@ svc_create_socket(struct svc_serv *serv, int protocol, str
> }
>
> if (protocol == IPPROTO_TCP) {
> - if ((error = sock->ops->listen(sock, 5)) < 0)
> + if ((error = sock->ops->listen(sock, 64)) < 0)
> goto bummer;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


2010-11-30 22:09:13

by Mark Hills

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Tue, 30 Nov 2010, J. Bruce Fields wrote:

> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > On Mon, 29 Nov 2010, J. Bruce Fields wrote:
> >
> > > On Wed, Nov 17, 2010 at 09:08:26AM +1100, Neil Brown wrote:
> > > > On Tue, 16 Nov 2010 13:20:26 -0500
> > > > "J. Bruce Fields" <[email protected]> wrote:
> > > >
> > > > > On Mon, Nov 15, 2010 at 06:43:52PM +0000, Mark Hills wrote:
> > > > > > I am looking into an issue of hanging clients to a set of NFS servers, on
> > > > > > a large HPC cluster.
> > > > > >
> > > > > > My investigation took me to the RPC code, svc_create_socket().
> > > > > >
> > > > > > if (protocol == IPPROTO_TCP) {
> > > > > > if ((error = kernel_listen(sock, 64)) < 0)
> > > > > > goto bummer;
> > > > > > }
[...]
> > I think we can hold off changing the queue size, for now at least. We
> > reduced the reported queue overflows by increasing the number of mountd
> > threads, allowing it to service the queue more quickly.
>
> Apologies, I should have thought to suggest that at the start.

No apology necessary. I actually thought we had ruled out the possibility
that mountd is at fault, but I cannot remember the exact reasons. I think
it is because we have _some_ cases of failure on a mounted filesystem. So
maybe we will find ourselves back here again later.

> > However this did not fix the common problem, and I was hoping to have
> > more information in this follow-up email.
> >
> > Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> > the client log we see messages like:
> >
> > Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >
> > Using strace and isolating one of these, I can see a non-blocking connect
> > has already managed to make a connection and even send/receive some data.
> >
> > But soon a timeout of 9999 milliseconds in poll() causes a problem in
> > mount.nfs when waiting for a response of some sort. The socket in question
> > is a connection to mountd:
> >
> > 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >
> > When it returns:
> >
> > 26512 <... poll resumed> ) = 0 (Timeout)
> > 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > 26512 close(3) = 0
> > 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >
> > There's no re-try from here, just a failed mount.
>
> That does sound wrong. I'm not at all familiar with automount,
> unfortunately; how is it invoking mount.nfs?

It invokes 'mount' as a command using exec(). A more detailed log:

Nov 22 12:07:36 nyrf014 automount[7883]: mount_mount: mount(nfs): calling mount -t nfs -s -o proto=tcp ss1a:/mnt/raid1/banana /proj/banana
Nov 22 12:07:48 nyrf014 automount[7883]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up

The process trying to access the filesystem will block whilst automount
does this. The failure of the mount command causes an empty directory
until negative timeout expires, after which it will invoke 'mount' again.
It's this period that causes problems; applications fail when they see an
empty directory.

It looks like we are more likely to _see_ the problem when a large number
of hosts are mounting at the same time, but it doesn't seem to be directly
the cause of it.

> > What is the source of this 9999 millisecond timeout used by poll() in
> > mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
> > but I need more time to investigate.
> >
> > If the server is being too slow to respond, what could the cause of this
> > be? Multiple threads are already in use, but it seems like they are not
> > all in use because a thread is able to accept() the connection. I haven't
> > been able to pin this on the forward/reverse DNS lookup used by
> > authentication and logging.
>
> Can you tell where the mountd threads are typically waiting?

I intend to investigate this, possibly by attaching strace to rpc.mountd
and trying to correlate the log with one of the dropouts. Perhaps somebody
has a better suggestion.

--
Mark

2010-12-08 16:45:53

by Chuck Lever III

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)


On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:

> On Wed, 1 Dec 2010, Chuck Lever wrote:
>
>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>
>>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
>>>
>>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
>>> [...]
>>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
>>>>> the client log we see messages like:
>>>>>
>>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>>>>
>>>>> Using strace and isolating one of these, I can see a non-blocking connect
>>>>> has already managed to make a connection and even send/receive some data.
>>>>>
>>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
>>>>> mount.nfs when waiting for a response of some sort. The socket in question
>>>>> is a connection to mountd:
>>>>>
>>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>>>>
>>>>> When it returns:
>>>>>
>>>>> 26512 <... poll resumed> ) = 0 (Timeout)
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>>>> 26512 close(3) = 0
>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>>>>
>>>>> There's no re-try from here, just a failed mount.
> [...]
>>>> Can you tell where the mountd threads are typically waiting?
>>>
>>> Here's a trace from mountd. Note the long pause after fdatasync():
>>>
>>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
>>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
>>> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> [...]
>> fdatasync() will write all outstanding dirty data on that partition to
>> persistent storage, not just the dirty data for that file. If /var is
>> on a partition that has a lot of other activity, then this is the
>> expected outcome, unfortunately. I would expect such delays to be due
>> to a large amount of outstanding writes, not to periodic synchronous
>> writes.
>
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck
> writing to /var, this time via syslog.
>
> We identified a crontab entry which uses RPM to apply updates to the
> server OS at randomised times which was blocking /var for 10 seconds or
> so.

I think this is closer to the root cause than the mount.nfs connect timeout. I might consider RPM or syslog to be the final culprit, but I wonder why a multi-threaded mountd would stop servicing TCP connects because it couldn't write to /var or to syslog?

> This returns the issue to giving up after the 9999 millisecond timeout
> above, which I think is making the client overly sensitive.
>
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
> We are using mount.nfs for a v3 mount over TCP.
>
> Locating this timeout wasn't easy. There's no '9999' in any of the
> nfs-utils or glibc's rpc code.
>
> It turns out that the timeout originates in
> nfs-utils/support/nfs/rpc_socket.c as
>
> #define NFSRPC_TIMEOUT_TCP (10)
>
> This is used to populate a struct timeval *timeout which is used and
> updated by nfs_connect_nb(). Then the remaining timeout of 9999
> milliseconds is used in an RPC clnt_call operation.
>
> 10 seconds over TCP is simply too short to be reliable here; it's not a
> long period of time for a fatal timeout. I will attempt to test on our
> cluster with an extended timeout.

This is a TCP connect timeout, not a RPC retransmit timeout. Waiting 10 seconds for a TCP connect is really not that obnoxious. The reason this is short is because we want mount.nfs to realize quickly that it's making no progress on stuck servers.

> I presume that the bug is that the timeout is too short, and not that the
> timeout should cause a retry in a layer above -- because re-sending the
> request over reliable TCP would not make sense.

The upper layer wouldn't resend the request, since the underlying transport never connected.

> For UDP the retries are handled inside the RPC call, clntudp_call(). In
> which case, what is the reason the UDP timeout differs from TCP? And is it
> also too small?

RPC over UDP doesn't have to connect to the server. I wonder what would happen if you specified "mountproto=udp" when mounting exports on this overloaded server.

We could consider making the TCP connect timeout value larger, but the price would be a mount.nfs command that is less responsive to hung servers. For example, the legacy get_socket() code uses a 20 second TCP connect timeout for the pmap query, and a 30 second TCP connect timeout for the mount request.

But how large would be large enough? If we chose a TCP connect timeout of 25 seconds, what would happen if your server required, say, 30 seconds to accept a new TCP connection? I think we would agree that the server side should be fixed in that case.

IMO adding another mount option would be mistaken. It adds clutter to the mount user interface to work around what is clearly a problem on the server.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2010-12-08 14:45:46

by Mark Hills

[permalink] [raw]
Subject: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)

On Wed, 1 Dec 2010, Chuck Lever wrote:

> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>
> > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> >
> >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > [...]
> >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> >>> the client log we see messages like:
> >>>
> >>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>>
> >>> Using strace and isolating one of these, I can see a non-blocking connect
> >>> has already managed to make a connection and even send/receive some data.
> >>>
> >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
> >>> mount.nfs when waiting for a response of some sort. The socket in question
> >>> is a connection to mountd:
> >>>
> >>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>>
> >>> When it returns:
> >>>
> >>> 26512 <... poll resumed> ) = 0 (Timeout)
> >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>> 26512 close(3) = 0
> >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>>
> >>> There's no re-try from here, just a failed mount.
[...]
> >> Can you tell where the mountd threads are typically waiting?
> >
> > Here's a trace from mountd. Note the long pause after fdatasync():
> >
> > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > 31156 12:09:40 fdatasync(14) = 0 <15.413401>
[...]
> fdatasync() will write all outstanding dirty data on that partition to
> persistent storage, not just the dirty data for that file. If /var is
> on a partition that has a lot of other activity, then this is the
> expected outcome, unfortunately. I would expect such delays to be due
> to a large amount of outstanding writes, not to periodic synchronous
> writes.

Moving rmtab to a tmpfs, this moved the problem to another bottleneck
writing to /var, this time via syslog.

We identified a crontab entry which uses RPM to apply updates to the
server OS at randomised times which was blocking /var for 10 seconds or
so.

This returns the issue to giving up after the 9999 millisecond timeout
above, which I think is making the client overly sensitive.

The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
We are using mount.nfs for a v3 mount over TCP.

Locating this timeout wasn't easy. There's no '9999' in any of the
nfs-utils or glibc's rpc code.

It turns out that the timeout originates in
nfs-utils/support/nfs/rpc_socket.c as

#define NFSRPC_TIMEOUT_TCP (10)

This is used to populate a struct timeval *timeout which is used and
updated by nfs_connect_nb(). Then the remaining timeout of 9999
milliseconds is used in an RPC clnt_call operation.

10 seconds over TCP is simply too short to be reliable here; it's not a
long period of time for a fatal timeout. I will attempt to test on our
cluster with an extended timeout.

I presume that the bug is that the timeout is too short, and not that the
timeout should cause a retry in a layer above -- because re-sending the
request over reliable TCP would not make sense.

For UDP the retries are handled inside the RPC call, clntudp_call(). In
which case, what is the reason the UDP timeout differs from TCP? And is it
also too small?

Thanks

--
Mark

2010-12-08 15:38:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)

On Wed, Dec 08, 2010 at 02:45:40PM +0000, Mark Hills wrote:
> On Wed, 1 Dec 2010, Chuck Lever wrote:
>
> > On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> >
> > > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> > >
> > >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > > [...]
> > >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> > >>> the client log we see messages like:
> > >>>
> > >>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> > >>>
> > >>> Using strace and isolating one of these, I can see a non-blocking connect
> > >>> has already managed to make a connection and even send/receive some data.
> > >>>
> > >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
> > >>> mount.nfs when waiting for a response of some sort. The socket in question
> > >>> is a connection to mountd:
> > >>>
> > >>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > >>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > >>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> > >>>
> > >>> When it returns:
> > >>>
> > >>> 26512 <... poll resumed> ) = 0 (Timeout)
> > >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > >>> 26512 close(3) = 0
> > >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > >>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> > >>>
> > >>> There's no re-try from here, just a failed mount.
> [...]
> > >> Can you tell where the mountd threads are typically waiting?
> > >
> > > Here's a trace from mountd. Note the long pause after fdatasync():
> > >
> > > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > > 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> [...]
> > fdatasync() will write all outstanding dirty data on that partition to
> > persistent storage, not just the dirty data for that file. If /var is
> > on a partition that has a lot of other activity, then this is the
> > expected outcome, unfortunately. I would expect such delays to be due
> > to a large amount of outstanding writes, not to periodic synchronous
> > writes.
>
> Moving rmtab to a tmpfs, this moved the problem to another bottleneck
> writing to /var, this time via syslog.
>
> We identified a crontab entry which uses RPM to apply updates to the
> server OS at randomised times which was blocking /var for 10 seconds or
> so.

So is the problem that mountd ends up waiting on a synchronous write of
one of those "authenticated mount request" log messages?

Maybe we should downgrade those messages? On a server with a lot of
clients a log on every mount call seems obnoxious. Doing it
synchronously even more so (though I guess that's a problem for the
syslog configuration.)

> This returns the issue to giving up after the 9999 millisecond timeout
> above, which I think is making the client overly sensitive.
>
> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
> We are using mount.nfs for a v3 mount over TCP.
>
> Locating this timeout wasn't easy. There's no '9999' in any of the
> nfs-utils or glibc's rpc code.
>
> It turns out that the timeout originates in
> nfs-utils/support/nfs/rpc_socket.c as
>
> #define NFSRPC_TIMEOUT_TCP (10)
>
> This is used to populate a struct timeval *timeout which is used and
> updated by nfs_connect_nb(). Then the remaining timeout of 9999
> milliseconds is used in an RPC clnt_call operation.
>
> 10 seconds over TCP is simply too short to be reliable here; it's not a
> long period of time for a fatal timeout. I will attempt to test on our
> cluster with an extended timeout.

Should this be tied to some mount option?

Chuck probably knows how this is supposed to work....

--b.

>
> I presume that the bug is that the timeout is too short, and not that the
> timeout should cause a retry in a layer above -- because re-sending the
> request over reliable TCP would not make sense.
>
> For UDP the retries are handled inside the RPC call, clntudp_call(). In
> which case, what is the reason the UDP timeout differs from TCP? And is it
> also too small?

2010-12-01 18:36:59

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Wed, Dec 01, 2010 at 06:18:17PM +0000, Mark Hills wrote:
> Here's a trace from mountd. Note the long pause after fdatasync():
>
> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> 31156 12:09:55 close(14) = 0 <0.000008>
> 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> 31156 12:09:55 close(13) = 0 <0.000008>
>
> This pause exceeds the mysterious 9999 millisecond timeout allowed at the
> client.
>
> fdatasync() is happening at every write of rmtab -- a lot. Occasionally it
> takes a long time. The syncing of the rmtab seems a reasonable explanation
> why more threads does not help -- all mountd threads lock up.
>
> The strace I am analysing here contains 670 calls to fdatasync() in a 16
> minute period. All of them take < 1 second, except for this burst:
>
> 31156 12:09:11 fdatasync(14) = 0 <0.000010>
> 31156 12:09:11 fdatasync(15) = 0 <0.136285>
> 31156 12:09:12 fdatasync(14) = 0 <0.000010>
> 31156 12:09:12 fdatasync(15) = 0 <14.911323>
> 31156 12:09:27 fdatasync(14) = 0 <1.298896>
> 31156 12:09:29 fdatasync(14) = 0 <1.159262>
> 31156 12:09:33 fdatasync(14) = 0 <1.447169>
> 31156 12:09:35 fdatasync(14) = 0 <1.345562>
> 31156 12:09:40 fdatasync(14) = 0 <0.000010>
> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> 31156 12:09:55 fdatasync(17) = 0 <0.000010>
> 31156 12:09:55 fdatasync(17) = 0 <1.068503>
> 31156 12:09:57 fdatasync(18) = 0 <0.000010>
> 31156 12:09:57 fdatasync(18) = 0 <0.089682>
>
> /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
>
> The server is only acting as a 'regular' and dedicated file server, but
> presumably some other process is causing /var to become erratic. Unless
> continual writing and syncing to a small (64KiB) file could cause this to
> happen periodically?
>
> In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs
> mount to see if that helps, and will leave it running tonight.

I didn't really follow the controversy earlier this year about fsync
performance on ext3--is that fixed to some degree on newer kernels?

But in any case, the rmtab file is, as the rpc.mountd man page says,
"mostly ornamental". We shouldn't be taking so much care over it.

Hm, the code in question:

if (!have_new_cache) {
/*
* If we are using the old caching interface: exportfs
* uses the rmtab to determine what should be exported,
* so it is important that it be up-to-date.
*
* If we are using the new caching interface: the rmtab
* is ignored by exportfs and the fdatasync only serves
* to slow us down.
*/
fflush(fp);
fdatasync(fileno(fp));
}

It's a little odd that you'd be hitting this case. Do you not have "nfsd"
mounted? (What distro is this again?)

--b.

2010-12-13 16:19:48

by Chuck Lever III

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)


On Dec 8, 2010, at 1:28 PM, Chuck Lever wrote:

>
> On Dec 8, 2010, at 12:31 PM, Mark Hills wrote:
>
>> On Wed, 8 Dec 2010, Chuck Lever wrote:
>>
>>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>>>
>>>> On Wed, 1 Dec 2010, Chuck Lever wrote:
>>>>
>>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>>>>
>>>> This returns the issue to giving up after the 9999 millisecond timeout
>>>> above, which I think is making the client overly sensitive.
>>>>
>>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
>>>> We are using mount.nfs for a v3 mount over TCP.
>>>>
>>>> Locating this timeout wasn't easy. There's no '9999' in any of the
>>>> nfs-utils or glibc's rpc code.
>>>>
>>>> It turns out that the timeout originates in
>>>> nfs-utils/support/nfs/rpc_socket.c as
>>>>
>>>> #define NFSRPC_TIMEOUT_TCP (10)
>>>>
>>>> This is used to populate a struct timeval *timeout which is used and
>>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999
>>>> milliseconds is used in an RPC clnt_call operation.
>>>>
>>>> 10 seconds over TCP is simply too short to be reliable here; it's not a
>>>> long period of time for a fatal timeout. I will attempt to test on our
>>>> cluster with an extended timeout.
>>>
>>> This is a TCP connect timeout, not a RPC retransmit timeout. Waiting 10
>>> seconds for a TCP connect is really not that obnoxious. The reason this
>>> is short is because we want mount.nfs to realize quickly that it's
>>> making no progress on stuck servers.
>>
>> If my understanding of the code is correct, the 10 second timeout is not
>> just for the connect, but also for making an RPC call and receiving the
>> response (the struct timeval is carried through). Hence the diagnosis in
>> the previous mail.
>>
>>>> I presume that the bug is that the timeout is too short, and not that
>>>> the timeout should cause a retry in a layer above -- because
>>>> re-sending the request over reliable TCP would not make sense.
>>>
>>> The upper layer wouldn't resend the request, since the underlying
>>> transport never connected.
>>>
>>>> For UDP the retries are handled inside the RPC call, clntudp_call().
>>>> In which case, what is the reason the UDP timeout differs from TCP?
>>>> And is it also too small?
>>>
>>> RPC over UDP doesn't have to connect to the server. I wonder what would
>>> happen if you specified "mountproto=udp" when mounting exports on this
>>> overloaded server.
>>
>> I think we'd still see the same issue over UDP, perhaps even more brutally
>> because of the shorter 3 second UDP timeout.
>>
>> To clarify, we've deviated from the problem described in the original
>> email; opening a TCP connection to the mountd which is the problem, but
>> mountd's ability to send a message in response.
>
> So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds?

I've opened https://bugzilla.linux-nfs.org/show_bug.cgi?id=196 to track the mount.nfs part of this issue.

It seems to me, however, that under normal circumstances nfs-utils 1.2.1 on Fedora 12 should be doing a kernel mount, and mount.nfs should not be attempting to send a MNT request. Would it be possible to enable some trace debugging to capture the sequence of kernel events during a failed mount?

On the clients:

# rpcdebug -m nfs -s mount

To disable:

# rpcdebug -m nfs -c

Debugging messages will appear in /var/log/messages.

A full network trace could also be useful. Just needed while your jobs are starting so we can capture the network traffic at mount time.

# tcpdump -s 1600 -w /tmp/raw

Doing this on your clients would likely be more reliable than on the server.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2010-12-01 18:43:10

by Chuck Lever III

[permalink] [raw]
Subject: Re: Listen backlog set to 64


On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:

> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
>
>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> [...]
>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
>>> the client log we see messages like:
>>>
>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>>
>>> Using strace and isolating one of these, I can see a non-blocking connect
>>> has already managed to make a connection and even send/receive some data.
>>>
>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
>>> mount.nfs when waiting for a response of some sort. The socket in question
>>> is a connection to mountd:
>>>
>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>>
>>> When it returns:
>>>
>>> 26512 <... poll resumed> ) = 0 (Timeout)
>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>> 26512 close(3) = 0
>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>>
>>> There's no re-try from here, just a failed mount.
>>
>> That does sound wrong. I'm not at all familiar with automount,
>> unfortunately; how is it invoking mount.nfs?
>>
>>> What is the source of this 9999 millisecond timeout used by poll() in
>>> mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
>>> but I need more time to investigate.
>>>
>>> If the server is being too slow to respond, what could the cause of this
>>> be? Multiple threads are already in use, but it seems like they are not
>>> all in use because a thread is able to accept() the connection. I haven't
>>> been able to pin this on the forward/reverse DNS lookup used by
>>> authentication and logging.
>>
>> Can you tell where the mountd threads are typically waiting?
>
> Here's a trace from mountd. Note the long pause after fdatasync():
>
> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> 31156 12:09:55 close(14) = 0 <0.000008>
> 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> 31156 12:09:55 close(13) = 0 <0.000008>
>
> This pause exceeds the mysterious 9999 millisecond timeout allowed at the
> client.
>
> fdatasync() is happening at every write of rmtab -- a lot. Occasionally it
> takes a long time. The syncing of the rmtab seems a reasonable explanation
> why more threads does not help -- all mountd threads lock up.
>
> The strace I am analysing here contains 670 calls to fdatasync() in a 16
> minute period. All of them take < 1 second, except for this burst:
>
> 31156 12:09:11 fdatasync(14) = 0 <0.000010>
> 31156 12:09:11 fdatasync(15) = 0 <0.136285>
> 31156 12:09:12 fdatasync(14) = 0 <0.000010>
> 31156 12:09:12 fdatasync(15) = 0 <14.911323>
> 31156 12:09:27 fdatasync(14) = 0 <1.298896>
> 31156 12:09:29 fdatasync(14) = 0 <1.159262>
> 31156 12:09:33 fdatasync(14) = 0 <1.447169>
> 31156 12:09:35 fdatasync(14) = 0 <1.345562>
> 31156 12:09:40 fdatasync(14) = 0 <0.000010>
> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> 31156 12:09:55 fdatasync(17) = 0 <0.000010>
> 31156 12:09:55 fdatasync(17) = 0 <1.068503>
> 31156 12:09:57 fdatasync(18) = 0 <0.000010>
> 31156 12:09:57 fdatasync(18) = 0 <0.089682>
>
> /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
>
> The server is only acting as a 'regular' and dedicated file server, but
> presumably some other process is causing /var to become erratic. Unless
> continual writing and syncing to a small (64KiB) file could cause this to
> happen periodically?

fdatasync() will write all outstanding dirty data on that partition to persistent storage, not just the dirty data for that file. If /var is on a partition that has a lot of other activity, then this is the expected outcome, unfortunately. I would expect such delays to be due to a large amount of outstanding writes, not to periodic synchronous writes.

This problem may be addressed in recent versions of nfs-utils. I see a commit from Ben Myers on February 12 of this year that seems on point. nfs-utils-1.2.2 has it, I think?

> In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs
> mount to see if that helps, and will leave it running tonight.
>
> --
> Mark
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2010-12-08 17:31:49

by Mark Hills

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)

On Wed, 8 Dec 2010, Chuck Lever wrote:

> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>
> > On Wed, 1 Dec 2010, Chuck Lever wrote:
> >
> >> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
> >>
> >>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> >>>
> >>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> >>> [...]
> >>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> >>>>> the client log we see messages like:
> >>>>>
> >>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>>>>
> >>>>> Using strace and isolating one of these, I can see a non-blocking connect
> >>>>> has already managed to make a connection and even send/receive some data.
> >>>>>
> >>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
> >>>>> mount.nfs when waiting for a response of some sort. The socket in question
> >>>>> is a connection to mountd:
> >>>>>
> >>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>>>>
> >>>>> When it returns:
> >>>>>
> >>>>> 26512 <... poll resumed> ) = 0 (Timeout)
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>>>> 26512 close(3) = 0
> >>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>>>>
> >>>>> There's no re-try from here, just a failed mount.
> > [...]
> >>>> Can you tell where the mountd threads are typically waiting?
> >>>
> >>> Here's a trace from mountd. Note the long pause after fdatasync():
> >>>
> >>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> >>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> >>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> >>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> >>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> >>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> >>> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> > [...]
> >> fdatasync() will write all outstanding dirty data on that partition to
> >> persistent storage, not just the dirty data for that file. If /var is
> >> on a partition that has a lot of other activity, then this is the
> >> expected outcome, unfortunately. I would expect such delays to be due
> >> to a large amount of outstanding writes, not to periodic synchronous
> >> writes.
> >
> > Moving rmtab to a tmpfs, this moved the problem to another bottleneck
> > writing to /var, this time via syslog.
> >
> > We identified a crontab entry which uses RPM to apply updates to the
> > server OS at randomised times which was blocking /var for 10 seconds or
> > so.
>
> I think this is closer to the root cause than the mount.nfs connect
> timeout. I might consider RPM or syslog to be the final culprit, but I
> wonder why a multi-threaded mountd would stop servicing TCP connects
> because it couldn't write to /var or to syslog?

Of course this is the root cause, but a mementary bottleneck at the server
isn't such a rare occurance, particularly when servicing bursts on a large
HPC cluster.

> > This returns the issue to giving up after the 9999 millisecond timeout
> > above, which I think is making the client overly sensitive.
> >
> > The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
> > We are using mount.nfs for a v3 mount over TCP.
> >
> > Locating this timeout wasn't easy. There's no '9999' in any of the
> > nfs-utils or glibc's rpc code.
> >
> > It turns out that the timeout originates in
> > nfs-utils/support/nfs/rpc_socket.c as
> >
> > #define NFSRPC_TIMEOUT_TCP (10)
> >
> > This is used to populate a struct timeval *timeout which is used and
> > updated by nfs_connect_nb(). Then the remaining timeout of 9999
> > milliseconds is used in an RPC clnt_call operation.
> >
> > 10 seconds over TCP is simply too short to be reliable here; it's not a
> > long period of time for a fatal timeout. I will attempt to test on our
> > cluster with an extended timeout.
>
> This is a TCP connect timeout, not a RPC retransmit timeout. Waiting 10
> seconds for a TCP connect is really not that obnoxious. The reason this
> is short is because we want mount.nfs to realize quickly that it's
> making no progress on stuck servers.

If my understanding of the code is correct, the 10 second timeout is not
just for the connect, but also for making an RPC call and receiving the
response (the struct timeval is carried through). Hence the diagnosis in
the previous mail.

> > I presume that the bug is that the timeout is too short, and not that
> > the timeout should cause a retry in a layer above -- because
> > re-sending the request over reliable TCP would not make sense.
>
> The upper layer wouldn't resend the request, since the underlying
> transport never connected.
>
> > For UDP the retries are handled inside the RPC call, clntudp_call().
> > In which case, what is the reason the UDP timeout differs from TCP?
> > And is it also too small?
>
> RPC over UDP doesn't have to connect to the server. I wonder what would
> happen if you specified "mountproto=udp" when mounting exports on this
> overloaded server.

I think we'd still see the same issue over UDP, perhaps even more brutally
because of the shorter 3 second UDP timeout.

To clarify, we've deviated from the problem described in the original
email; opening a TCP connection to the mountd which is the problem, but
mountd's ability to send a message in response.

> We could consider making the TCP connect timeout value larger, but the
> price would be a mount.nfs command that is less responsive to hung
> servers. For example, the legacy get_socket() code uses a 20 second TCP
> connect timeout for the pmap query, and a 30 second TCP connect timeout
> for the mount request.
>
> But how large would be large enough? If we chose a TCP connect timeout
> of 25 seconds, what would happen if your server required, say, 30
> seconds to accept a new TCP connection? I think we would agree that the
> server side should be fixed in that case.

Whilst these timeouts could have resolved our problem, they are perhaps
too short to consider that a server is fatally down. Particularly if we
have already managed to open a TCP connection.

We've certainly managed to see the problem in the wild on a fairly
'standard' RHEL 5.4 Linux server. In this case it was rpm, so what if the
sysadmin requires to run 'rpm' to perform some administration? They can't
do this without high risk to the clients.

> IMO adding another mount option would be mistaken. It adds clutter to
> the mount user interface to work around what is clearly a problem on the
> server.

I agree that new options can be cluttered.

But I think the problem needs attention at both the client and sever side.
In this case the cause of our load spike was something that we hope can be
avoided, but I can't eliminate the risk. When serving many hundreds of
clients we expect hot-spots of load, especially when a single user has
control over large groups of client hosts.

Furthermore the chain of events of a mount failure is so severe, because
autofs caches the failed mount and present an empty directory for some
time after.

--
Mark

2010-12-08 20:34:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)


On Dec 8, 2010, at 1:37 PM, J. Bruce Fields wrote:

> On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
>> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect. Or some combination of these.
>>
>> A ten second latency for file system operations is the real issue.
>>
>>>> IMO adding another mount option would be mistaken. It adds clutter to
>>>> the mount user interface to work around what is clearly a problem on the
>>>> server.
>>>
>>> I agree that new options can be cluttered.
>>>
>>> But I think the problem needs attention at both the client and sever side.
>>
>> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests. The question I have is what would be reasonable values to use instead of what we have? (That question is directed to everyone who is still reading).
>
> What's the behavior you want when the server is just off or unreachable?
>
> I'd've thought that there are cases where you just want the mount to
> hang till it's interrupted or the problem is fixed. (Autofs possibly
> being one of them.)

Perhaps we should retry in this case until the mount succeeds or the retry= timeout expires. Then the value of NFSRPC_TIMEOUT_TCP would be inconsequential.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2010-12-08 18:37:41

by J. Bruce Fields

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)

On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect. Or some combination of these.
>
> A ten second latency for file system operations is the real issue.
>
> >> IMO adding another mount option would be mistaken. It adds clutter to
> >> the mount user interface to work around what is clearly a problem on the
> >> server.
> >
> > I agree that new options can be cluttered.
> >
> > But I think the problem needs attention at both the client and sever side.
>
> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests. The question I have is what would be reasonable values to use instead of what we have? (That question is directed to everyone who is still reading).

What's the behavior you want when the server is just off or unreachable?

I'd've thought that there are cases where you just want the mount to
hang till it's interrupted or the problem is fixed. (Autofs possibly
being one of them.)

--b.

2010-12-08 21:04:41

by Chuck Lever III

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)


On Dec 8, 2010, at 1:37 PM, J. Bruce Fields wrote:

> On Wed, Dec 08, 2010 at 01:28:06PM -0500, Chuck Lever wrote:
>> From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect. Or some combination of these.
>>
>> A ten second latency for file system operations is the real issue.
>>
>>>> IMO adding another mount option would be mistaken. It adds clutter to
>>>> the mount user interface to work around what is clearly a problem on the
>>>> server.
>>>
>>> I agree that new options can be cluttered.
>>>
>>> But I think the problem needs attention at both the client and sever side.
>>
>> It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests. The question I have is what would be reasonable values to use instead of what we have? (That question is directed to everyone who is still reading).
>
> What's the behavior you want when the server is just off or unreachable?
>
> I'd've thought that there are cases where you just want the mount to
> hang till it's interrupted or the problem is fixed. (Autofs possibly
> being one of them.)

Be it noted however that autofs usually wants a quick yes or no answer in these cases. Auto mounts that take a long time to succeed usually drive users crazy, especially if the automounter is single-threaded. That's why this timeout is kept short.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com





2010-12-08 18:29:50

by Chuck Lever III

[permalink] [raw]
Subject: Re: mount.nfs timeout of 9999ms (was Re: Listen backlog set to 64)


On Dec 8, 2010, at 12:31 PM, Mark Hills wrote:

> On Wed, 8 Dec 2010, Chuck Lever wrote:
>
>> On Dec 8, 2010, at 9:45 AM, Mark Hills wrote:
>>
>>> On Wed, 1 Dec 2010, Chuck Lever wrote:
>>>
>>>> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>>>>
>>>>> On Tue, 30 Nov 2010, J. Bruce Fields wrote:
>>>>>
>>>>>> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
>>>>> [...]
>>>>>>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
>>>>>>> the client log we see messages like:
>>>>>>>
>>>>>>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
>>>>>>>
>>>>>>> Using strace and isolating one of these, I can see a non-blocking connect
>>>>>>> has already managed to make a connection and even send/receive some data.
>>>>>>>
>>>>>>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
>>>>>>> mount.nfs when waiting for a response of some sort. The socket in question
>>>>>>> is a connection to mountd:
>>>>>>>
>>>>>>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
>>>>>>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
>>>>>>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
>>>>>>>
>>>>>>> When it returns:
>>>>>>>
>>>>>>> 26512 <... poll resumed> ) = 0 (Timeout)
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
>>>>>>> 26512 close(3) = 0
>>>>>>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
>>>>>>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
>>>>>>>
>>>>>>> There's no re-try from here, just a failed mount.
>>> [...]
>>>>>> Can you tell where the mountd threads are typically waiting?
>>>>>
>>>>> Here's a trace from mountd. Note the long pause after fdatasync():
>>>>>
>>>>> 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
>>>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>>>> 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
>>>>> 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
>>>>> 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
>>>>> 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
>>>>> 31156 12:09:40 fdatasync(14) = 0 <15.413401>
>>> [...]
>>>> fdatasync() will write all outstanding dirty data on that partition to
>>>> persistent storage, not just the dirty data for that file. If /var is
>>>> on a partition that has a lot of other activity, then this is the
>>>> expected outcome, unfortunately. I would expect such delays to be due
>>>> to a large amount of outstanding writes, not to periodic synchronous
>>>> writes.
>>>
>>> Moving rmtab to a tmpfs, this moved the problem to another bottleneck
>>> writing to /var, this time via syslog.
>>>
>>> We identified a crontab entry which uses RPM to apply updates to the
>>> server OS at randomised times which was blocking /var for 10 seconds or
>>> so.
>>
>> I think this is closer to the root cause than the mount.nfs connect
>> timeout. I might consider RPM or syslog to be the final culprit, but I
>> wonder why a multi-threaded mountd would stop servicing TCP connects
>> because it couldn't write to /var or to syslog?
>
> Of course this is the root cause, but a mementary bottleneck at the server
> isn't such a rare occurance, particularly when servicing bursts on a large
> HPC cluster.

Fwiw, the solution on enterprise storage servers (often typically used in HPC cluster environments) has been to improve mountd so that momentary server system bottlenecks do not cause sporadic mountd service. That seems like the larger part of an appropriate response to this problem on Linux.

>>> This returns the issue to giving up after the 9999 millisecond timeout
>>> above, which I think is making the client overly sensitive.
>>>
>>> The affected clients are Fedora 12 distribution: nfs-utils-1.2.1-4.fc12.
>>> We are using mount.nfs for a v3 mount over TCP.
>>>
>>> Locating this timeout wasn't easy. There's no '9999' in any of the
>>> nfs-utils or glibc's rpc code.
>>>
>>> It turns out that the timeout originates in
>>> nfs-utils/support/nfs/rpc_socket.c as
>>>
>>> #define NFSRPC_TIMEOUT_TCP (10)
>>>
>>> This is used to populate a struct timeval *timeout which is used and
>>> updated by nfs_connect_nb(). Then the remaining timeout of 9999
>>> milliseconds is used in an RPC clnt_call operation.
>>>
>>> 10 seconds over TCP is simply too short to be reliable here; it's not a
>>> long period of time for a fatal timeout. I will attempt to test on our
>>> cluster with an extended timeout.
>>
>> This is a TCP connect timeout, not a RPC retransmit timeout. Waiting 10
>> seconds for a TCP connect is really not that obnoxious. The reason this
>> is short is because we want mount.nfs to realize quickly that it's
>> making no progress on stuck servers.
>
> If my understanding of the code is correct, the 10 second timeout is not
> just for the connect, but also for making an RPC call and receiving the
> response (the struct timeval is carried through). Hence the diagnosis in
> the previous mail.
>
>>> I presume that the bug is that the timeout is too short, and not that
>>> the timeout should cause a retry in a layer above -- because
>>> re-sending the request over reliable TCP would not make sense.
>>
>> The upper layer wouldn't resend the request, since the underlying
>> transport never connected.
>>
>>> For UDP the retries are handled inside the RPC call, clntudp_call().
>>> In which case, what is the reason the UDP timeout differs from TCP?
>>> And is it also too small?
>>
>> RPC over UDP doesn't have to connect to the server. I wonder what would
>> happen if you specified "mountproto=udp" when mounting exports on this
>> overloaded server.
>
> I think we'd still see the same issue over UDP, perhaps even more brutally
> because of the shorter 3 second UDP timeout.
>
> To clarify, we've deviated from the problem described in the original
> email; opening a TCP connection to the mountd which is the problem, but
> mountd's ability to send a message in response.

So, you've verified (with, say, a network trace) that the clients are connecting successfully, but mountd is not sending a reply to the MNT request with ten seconds?

>> We could consider making the TCP connect timeout value larger, but the
>> price would be a mount.nfs command that is less responsive to hung
>> servers. For example, the legacy get_socket() code uses a 20 second TCP
>> connect timeout for the pmap query, and a 30 second TCP connect timeout
>> for the mount request.
>>
>> But how large would be large enough? If we chose a TCP connect timeout
>> of 25 seconds, what would happen if your server required, say, 30
>> seconds to accept a new TCP connection? I think we would agree that the
>> server side should be fixed in that case.
>
> Whilst these timeouts could have resolved our problem, they are perhaps
> too short to consider that a server is fatally down. Particularly if we
> have already managed to open a TCP connection.

Perhaps. If mountd is hung or otherwise unresponsive, a TCP connection could be open but the service may still be unresponsive for some reason. But that's a nit.

> We've certainly managed to see the problem in the wild on a fairly
> 'standard' RHEL 5.4 Linux server. In this case it was rpm, so what if the
> sysadmin requires to run 'rpm' to perform some administration? They can't
> do this without high risk to the clients.

>From what I have read, the server's configuration is the immediate problem here: namely that the disk that /var is on is slow, or there are bugs (or not-very-useful behavior) in the file system implementation used on /var, or that rpm should not cause such latency for other applications, or that the multi-threading architecture of mountd is incorrect. Or some combination of these.

A ten second latency for file system operations is the real issue.

>> IMO adding another mount option would be mistaken. It adds clutter to
>> the mount user interface to work around what is clearly a problem on the
>> server.
>
> I agree that new options can be cluttered.
>
> But I think the problem needs attention at both the client and sever side.

It seems reasonable to relax mount.nfs's timeout settings used when performing mount and rpcbind requests. The question I have is what would be reasonable values to use instead of what we have? (That question is directed to everyone who is still reading).

> In this case the cause of our load spike was something that we hope can be
> avoided, but I can't eliminate the risk. When serving many hundreds of
> clients we expect hot-spots of load, especially when a single user has
> control over large groups of client hosts.

Yes, that's a typical problem in compute clusters that use NFS.

> Furthermore the chain of events of a mount failure is so severe, because
> autofs caches the failed mount and present an empty directory for some
> time after.

That's a good argument to get mountd changed so that it can better handle such mount bursts.

--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com




2010-12-01 18:18:20

by Mark Hills

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Tue, 30 Nov 2010, J. Bruce Fields wrote:

> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
[...]
> > Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> > the client log we see messages like:
> >
> > Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >
> > Using strace and isolating one of these, I can see a non-blocking connect
> > has already managed to make a connection and even send/receive some data.
> >
> > But soon a timeout of 9999 milliseconds in poll() causes a problem in
> > mount.nfs when waiting for a response of some sort. The socket in question
> > is a connection to mountd:
> >
> > 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> > 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> > 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >
> > When it returns:
> >
> > 26512 <... poll resumed> ) = 0 (Timeout)
> > 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> > 26512 close(3) = 0
> > 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> > 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >
> > There's no re-try from here, just a failed mount.
>
> That does sound wrong. I'm not at all familiar with automount,
> unfortunately; how is it invoking mount.nfs?
>
> > What is the source of this 9999 millisecond timeout used by poll() in
> > mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
> > but I need more time to investigate.
> >
> > If the server is being too slow to respond, what could the cause of this
> > be? Multiple threads are already in use, but it seems like they are not
> > all in use because a thread is able to accept() the connection. I haven't
> > been able to pin this on the forward/reverse DNS lookup used by
> > authentication and logging.
>
> Can you tell where the mountd threads are typically waiting?

Here's a trace from mountd. Note the long pause after fdatasync():

31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
31156 12:09:40 fdatasync(14) = 0 <15.413401>
31156 12:09:55 close(14) = 0 <0.000008>
31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
31156 12:09:55 close(13) = 0 <0.000008>

This pause exceeds the mysterious 9999 millisecond timeout allowed at the
client.

fdatasync() is happening at every write of rmtab -- a lot. Occasionally it
takes a long time. The syncing of the rmtab seems a reasonable explanation
why more threads does not help -- all mountd threads lock up.

The strace I am analysing here contains 670 calls to fdatasync() in a 16
minute period. All of them take < 1 second, except for this burst:

31156 12:09:11 fdatasync(14) = 0 <0.000010>
31156 12:09:11 fdatasync(15) = 0 <0.136285>
31156 12:09:12 fdatasync(14) = 0 <0.000010>
31156 12:09:12 fdatasync(15) = 0 <14.911323>
31156 12:09:27 fdatasync(14) = 0 <1.298896>
31156 12:09:29 fdatasync(14) = 0 <1.159262>
31156 12:09:33 fdatasync(14) = 0 <1.447169>
31156 12:09:35 fdatasync(14) = 0 <1.345562>
31156 12:09:40 fdatasync(14) = 0 <0.000010>
31156 12:09:40 fdatasync(14) = 0 <15.413401>
31156 12:09:55 fdatasync(17) = 0 <0.000010>
31156 12:09:55 fdatasync(17) = 0 <1.068503>
31156 12:09:57 fdatasync(18) = 0 <0.000010>
31156 12:09:57 fdatasync(18) = 0 <0.089682>

/var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.

The server is only acting as a 'regular' and dedicated file server, but
presumably some other process is causing /var to become erratic. Unless
continual writing and syncing to a small (64KiB) file could cause this to
happen periodically?

In the meantime we have symlink'd /var/lib/nfs/rmtab[.tmp] to a tmpfs
mount to see if that helps, and will leave it running tonight.

--
Mark

2010-12-01 18:47:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Listen backlog set to 64

On Wed, Dec 01, 2010 at 01:28:40PM -0500, Chuck Lever wrote:
>
> On Dec 1, 2010, at 1:18 PM, Mark Hills wrote:
>
> > On Tue, 30 Nov 2010, J. Bruce Fields wrote:
> >
> >> On Tue, Nov 30, 2010 at 05:50:52PM +0000, Mark Hills wrote:
> > [...]
> >>> Our investigation brings us to rpc.mountd and mount.nfs communicating. In
> >>> the client log we see messages like:
> >>>
> >>> Nov 24 12:09:43 nyrd001 automount[3782]: >> mount.nfs: mount to NFS server 'ss1a:/mnt/raid1/banana' failed: timed out, giving up
> >>>
> >>> Using strace and isolating one of these, I can see a non-blocking connect
> >>> has already managed to make a connection and even send/receive some data.
> >>>
> >>> But soon a timeout of 9999 milliseconds in poll() causes a problem in
> >>> mount.nfs when waiting for a response of some sort. The socket in question
> >>> is a connection to mountd:
> >>>
> >>> 26512 futex(0x7ff76affa540, FUTEX_WAKE_PRIVATE, 1) = 0
> >>> 26512 write(3, "\200\0\0(j\212\254\365\0\0\0\0\0\0\0\2\0\1\206\245\0\0\0\3\0\0\0\0\0\0\0\0"..., 44) = 44
> >>> 26512 poll([{fd=3, events=POLLIN}], 1, 9999 <unfinished ...>
> >>>
> >>> When it returns:
> >>>
> >>> 26512 <... poll resumed> ) = 0 (Timeout)
> >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> 26512 rt_sigprocmask(SIG_SETMASK, ~[RTMIN RT_1], [], 8) = 0
> >>> 26512 close(3) = 0
> >>> 26512 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
> >>> 26512 write(2, "mount.nfs: mount to NFS server '"..., 100) = 100
> >>>
> >>> There's no re-try from here, just a failed mount.
> >>
> >> That does sound wrong. I'm not at all familiar with automount,
> >> unfortunately; how is it invoking mount.nfs?
> >>
> >>> What is the source of this 9999 millisecond timeout used by poll() in
> >>> mount.nfs? It was not clear in an initial search of nfs-utils and glibc,
> >>> but I need more time to investigate.
> >>>
> >>> If the server is being too slow to respond, what could the cause of this
> >>> be? Multiple threads are already in use, but it seems like they are not
> >>> all in use because a thread is able to accept() the connection. I haven't
> >>> been able to pin this on the forward/reverse DNS lookup used by
> >>> authentication and logging.
> >>
> >> Can you tell where the mountd threads are typically waiting?
> >
> > Here's a trace from mountd. Note the long pause after fdatasync():
> >
> > 31156 12:09:40 open("/var/lib/nfs/rmtab", O_WRONLY|O_CREAT|O_APPEND, 0666) = 14 <0.000010>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2af00c929000 <0.000007>
> > 31156 12:09:40 fstat(14, {st_mode=S_IFREG|0644, st_size=62642, ...}) = 0 <0.000005>
> > 31156 12:09:40 lseek(14, 62642, SEEK_SET) = 62642 <0.000006>
> > 31156 12:09:40 write(14, "192.168.14.165:/mnt/raid1/abcde_"..., 57) = 57 <0.000013>
> > 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> > 31156 12:09:55 close(14) = 0 <0.000008>
> > 31156 12:09:55 munmap(0x2af00c929000, 4096) = 0 <0.000083>
> > 31156 12:09:55 close(13) = 0 <0.000008>
> >
> > This pause exceeds the mysterious 9999 millisecond timeout allowed at the
> > client.
> >
> > fdatasync() is happening at every write of rmtab -- a lot. Occasionally it
> > takes a long time. The syncing of the rmtab seems a reasonable explanation
> > why more threads does not help -- all mountd threads lock up.
> >
> > The strace I am analysing here contains 670 calls to fdatasync() in a 16
> > minute period. All of them take < 1 second, except for this burst:
> >
> > 31156 12:09:11 fdatasync(14) = 0 <0.000010>
> > 31156 12:09:11 fdatasync(15) = 0 <0.136285>
> > 31156 12:09:12 fdatasync(14) = 0 <0.000010>
> > 31156 12:09:12 fdatasync(15) = 0 <14.911323>
> > 31156 12:09:27 fdatasync(14) = 0 <1.298896>
> > 31156 12:09:29 fdatasync(14) = 0 <1.159262>
> > 31156 12:09:33 fdatasync(14) = 0 <1.447169>
> > 31156 12:09:35 fdatasync(14) = 0 <1.345562>
> > 31156 12:09:40 fdatasync(14) = 0 <0.000010>
> > 31156 12:09:40 fdatasync(14) = 0 <15.413401>
> > 31156 12:09:55 fdatasync(17) = 0 <0.000010>
> > 31156 12:09:55 fdatasync(17) = 0 <1.068503>
> > 31156 12:09:57 fdatasync(18) = 0 <0.000010>
> > 31156 12:09:57 fdatasync(18) = 0 <0.089682>
> >
> > /var/lib/nfs/rmtab[.tmp] is on an ext3 filesystem, the system disk.
> >
> > The server is only acting as a 'regular' and dedicated file server, but
> > presumably some other process is causing /var to become erratic. Unless
> > continual writing and syncing to a small (64KiB) file could cause this to
> > happen periodically?
>
> fdatasync() will write all outstanding dirty data on that partition to persistent storage, not just the dirty data for that file. If /var is on a partition that has a lot of other activity, then this is the expected outcome, unfortunately. I would expect such delays to be due to a large amount of outstanding writes, not to periodic synchronous writes.
>
> This problem may be addressed in recent versions of nfs-utils. I see a commit from Ben Myers on February 12 of this year that seems on point. nfs-utils-1.2.2 has it, I think?

Doh, yes, looks like the code I quoted went into nfs-utils 1.2.2 with
Ben's commit.

--b.