2013-01-25 00:59:31

by Ben Myers

[permalink] [raw]
Subject: sunrpc: socket buffer size tuneable

At 1020 threads the send buffer size wraps and becomes negative causing
the nfs server to grind to a halt. Rather than setting bufsize based
upon the number of nfsd threads, make the buffer sizes tuneable via
module parameters.

Set the buffer sizes in terms of the number of rpcs you want to fit into
the buffer.

Signed-off-by: Ben Myers <[email protected]>
---

v2: moved tuneables to net/sunrpc/svcsock.c, allow user to lock buffer
size

net/sunrpc/sunrpc_syms.c | 39 +++++++++++++++++++++++++++++++++++
net/sunrpc/svcsock.c | 52 ++++++++++++++++++++++++++---------------------
2 files changed, 68 insertions(+), 23 deletions(-)

Index: xfs/net/sunrpc/svcsock.c
===================================================================
--- xfs.orig/net/sunrpc/svcsock.c
+++ xfs/net/sunrpc/svcsock.c
@@ -57,6 +57,47 @@

#define RPCDBG_FACILITY RPCDBG_SVCXPRT

+/*
+ * Size each socket's send and receive buffers based upon the number of rpcs
+ * we'd like to be able to fit into them. The TCP rcvbuf is enough for just a
+ * few requests. TCP sndbuf is the same as the tcp_slot_table_entries default,
+ * the maximum number of in-flight rpcs for a client. UDP sndbuf and rcvbuf
+ * defaults are the same as the udp_slot_table_entries default.
+ *
+ * Setting these too small can cause nfsd threads to block when sending.
+ */
+static int tcp_rcvbuf_nrpc = 8;
+static int tcp_sndbuf_nrpc = 8;
+
+static int udp_rcvbuf_nrpc = 6;
+static int udp_sndbuf_nrpc = 6;
+
+static bool lock_bufsize = false;
+
+/*
+ * Minimum for bufsize settings is enough space for 6 rpcs, maximum is enough
+ * space for 128 rpcs.
+ */
+static int param_set_bufsize_nrpc(const char *val, struct kernel_param *kp)
+{
+ char *endp;
+ int num = simple_strtol(val, &endp, 0);
+
+ if (endp == val || (*endp && *endp != '\n') || num < 6 || num > 128)
+ return -EINVAL;
+ *((int *)kp->arg) = num;
+ return 0;
+}
+
+module_param_call(tcp_rcvbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
+ &tcp_rcvbuf_nrpc, 0644);
+module_param_call(tcp_sndbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
+ &tcp_sndbuf_nrpc, 0644);
+module_param_call(udp_rcvbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
+ &udp_rcvbuf_nrpc, 0644);
+module_param_call(udp_sndbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
+ &udp_sndbuf_nrpc, 0644);
+module_param(lock_bufsize, bool, 0644);

static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *,
int flags);
@@ -375,8 +416,8 @@ static int svc_partial_recvfrom(struct s
/*
* Set socket snd and rcv buffer lengths
*/
-static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
- unsigned int rcv)
+static void svc_sock_setbufsize(struct socket *sock, int snd, int rcv,
+ bool lock)
{
#if 0
mm_segment_t oldfs;
@@ -392,8 +433,17 @@ static void svc_sock_setbufsize(struct s
* DaveM said I could!
*/
lock_sock(sock->sk);
- sock->sk->sk_sndbuf = snd * 2;
- sock->sk->sk_rcvbuf = rcv * 2;
+ sock->sk->sk_sndbuf = snd;
+ sock->sk->sk_rcvbuf = rcv;
+
+ if (lock) {
+ /*
+ * The socket buffer can be resized by the networking code
+ * unless you specify that this is not to be done.
+ */
+ sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
+ }
+
sock->sk->sk_write_space(sock->sk);
release_sock(sock->sk);
#endif
@@ -525,18 +575,17 @@ static int svc_udp_recvfrom(struct svc_r
size_t len;
int err;

- if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
- /* udp sockets need large rcvbuf as all pending
- * requests are still in that buffer. sndbuf must
- * also be large enough that there is enough space
- * for one reply per thread. We count all threads
- * rather than threads in a particular pool, which
- * provides an upper bound on the number of threads
- * which will access the socket.
- */
- svc_sock_setbufsize(svsk->sk_sock,
- (serv->sv_nrthreads+3) * serv->sv_max_mesg,
- (serv->sv_nrthreads+3) * serv->sv_max_mesg);
+ if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags)) {
+ /*
+ * UDP sockets need a large rcvbuf as all pending requests are
+ * still in that buffer. sndbuf must also be large enough to
+ * fit a reply for each request the server could be processing.
+ */
+ svc_sock_setbufsize(svsk->sk_sock,
+ udp_sndbuf_nrpc * serv->sv_max_mesg,
+ udp_rcvbuf_nrpc * serv->sv_max_mesg,
+ lock_bufsize);
+ }

clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
skb = NULL;
@@ -695,13 +744,14 @@ static void svc_udp_init(struct svc_sock
svsk->sk_sk->sk_data_ready = svc_udp_data_ready;
svsk->sk_sk->sk_write_space = svc_write_space;

- /* initialise setting must have enough space to
- * receive and respond to one request.
- * svc_udp_recvfrom will re-adjust if necessary
+ /*
+ * Initial setting must have enough space to receive and respond to one
+ * request. svc_udp_recvfrom will readjust if necessary.
*/
svc_sock_setbufsize(svsk->sk_sock,
- 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
- 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
+ udp_sndbuf_nrpc * svsk->sk_xprt.xpt_server->sv_max_mesg,
+ udp_rcvbuf_nrpc * svsk->sk_xprt.xpt_server->sv_max_mesg,
+ lock_bufsize);

/* data might have come in before data_ready set up */
set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
@@ -1377,8 +1427,10 @@ static struct svc_sock *svc_setup_socket
/* initialise setting must have enough space to
* receive and respond to one request.
*/
- svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
- 4 * serv->sv_max_mesg);
+ svc_sock_setbufsize(svsk->sk_sock,
+ tcp_sndbuf_nrpc * serv->sv_max_mesg,
+ tcp_rcvbuf_nrpc * serv->sv_max_mesg,
+ lock_bufsize);
svc_tcp_init(svsk, serv);
}



2013-01-25 23:01:04

by Myklebust, Trond

[permalink] [raw]
Subject: RE: sunrpc: socket buffer size tuneable

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Friday, January 25, 2013 5:35 PM
> To: Myklebust, Trond
> Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> Subject: Re: sunrpc: socket buffer size tuneable
>
> On Fri, Jan 25, 2013 at 10:20:12PM +0000, Myklebust, Trond wrote:
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:[email protected]]
> > > Sent: Friday, January 25, 2013 4:57 PM
> > > To: Myklebust, Trond
> > > Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim
> > > Rees
> > > Subject: Re: sunrpc: socket buffer size tuneable
> > >
> > > On Fri, Jan 25, 2013 at 09:45:12PM +0000, Myklebust, Trond wrote:
> > > > > -----Original Message----- From: J. Bruce Fields
> > > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > > 4:35 PM
> > > > > To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > > buffer size tuneable
> > > > >
> > > > > On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > > > > > -----Original Message----- From: J. Bruce Fields
> > > > > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > > > > 4:21 PM To: Myklebust, Trond Cc: Ben Myers; Olga
> > > > > > > Kornievskaia; [email protected]; Jim Rees Subject:
> > > > > > > Re: sunrpc: socket buffer size tuneable
> > > > > > >
> > > > > > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond
> > > > > > > wrote:
> > > > > >
> > > > > > > > Why is it not sufficient to clamp the TCP values of 'snd'
> > > > > > > > and 'rcv' using
> > > > > > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > > > > > ...and clamp the UDP values using
> > > > > > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > > > > > >
> > > > > > > Yeah, I was just looking at that--so, Ben, something like:
> > > > > > >
> > > > > > > echo "1048576 1048576 4194304"
> > > > > > > >/proc/sys/net/ipv4/tcp_wmem
> > > > > > >
> > > > > > > But I'm unclear on some of the details: do we need to set
> > > > > > > the minimum or only the default? And does it need any more
> > > > > > > allowance for protocol overhead?
> > > > > >
> > > > > > I meant adding a check either to svc_sock_setbufsize or to the
> > > > > > 2 call-sites
> > > > > that enforces the above limits.
> > > > >
> > > > > I lost you.
> > > > >
> > > > > It's not svc_sock_setbufsize that's setting too-small values, if
> > > > > that's what you mean.
> > > > >
> > > >
> > > > I understood that the problem was svc_udp_recvfrom() and
> > > > svc_setup_socket() were using negative values in the calls to
> > > > svc_sock_setbufsize(). Looking again at svc_setup_socket(), I
> > > > don't see how that could do so, but svc_udp_recvfrom() definitely
> > > > has potential to cause damage.
> > >
> > > Right, the changelog was confusing, the problem they're actually
> > > hitting is with tcp. Looks like tcp autotuning is decreasing the
> > > send buffer below the size we requested in svc_sock_setbufsize().
> >
> > Yes. As far as I can tell, that is endemic unless you lock the sndbuf size.
> Grep for sk_stream_moderate_sndbuf(), and you'll see what I mean.
>
> Yes. So I guess I'll investigate a little more, then do an amateur attempt at an
> interface to enforce a minimum and see if the network developers think it's a
> reasonable idea.
>
> Alternatively: is there some better strategy for the server here?
>
> It's trying to prevent threads from blocking by refusing to accept more rpc's
> than it has send buffer space to reply to.
>
> Presumably the fear is that all your threads could block trying to get
> responses to a small number of slow clients.
>
> Are there better ways to prevent that?

Why not allow the NFS server to process 1 request per TCP socket when there is not enough space to send it? Then have sock->write_space() + a work queue handle actually completing sending the data. The replay cache is still there to save your arse if the connection is broken (as it would be in the normal case).

Trond

2013-01-25 20:21:09

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable


> On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > Hey Bruce & Jim & Olga,
> >
> > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > J. Bruce Fields wrote:
> > >
> > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > At 1020 threads the send buffer size wraps and becomes negative causing
> > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > upon the number of nfsd threads, make the buffer sizes tuneable via
> > > > module parameters.
> > > >
> > > > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > > > the buffer.
> > >
> > > From private communication, my understanding is that the original
> > > problem here was due to memory pressure forcing the tcp send buffer size
> > > below the size required to hold a single rpc.
> >
> > Years ago I did see wrapping of the buffer size when tcp was used with many
> > threads. Today's problem is timeouts on a cluster with a heavy read
> > workload... and I seem to remember seeing that the send buffer size was too
> > small.
> >
> > > In which case the important variable here is lock_bufsize, as that's
> > > what prevents the buffer size from going too low.
> >
> > I tested removing the lock of bufsize and did hit the timeouts, so the overflow
> > is starting to look less relevant. I will test your minimal overflow fix to
> > see if this is the case.
>
> The minimal overflow fix did not resolve the timeouts.

OK, thanks, that's expected.

> I will test with this to see if it resolves the timeouts:

And I'd expect that to do the job--but at the expense of some tcp
bandwidth. So you end up needing your other module parameters to get
the performance back.

--b.

>
> ---
> net/sunrpc/svcsock.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> Index: b/net/sunrpc/svcsock.c
> ===================================================================
> --- a/net/sunrpc/svcsock.c 2013-01-25 13:48:05.000000000 -0600
> +++ b/net/sunrpc/svcsock.c 2013-01-25 13:49:42.000000000 -0600
> @@ -435,6 +435,13 @@ static void svc_sock_setbufsize(struct s
> lock_sock(sock->sk);
> sock->sk->sk_sndbuf = snd * 2;
> sock->sk->sk_rcvbuf = rcv * 2;
> +
> + /*
> + * The socket buffer can be resized by the networking code
> + * unless you specify that this is not to be done.
> + */
> + sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
> +
> sock->sk->sk_write_space(sock->sk);
> release_sock(sock->sk);
> #endif
>
> Thanks,
> Ben

2013-01-25 21:02:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 03:35:21PM -0500, J. Bruce Fields wrote:
> On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> >
> > > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > > Hey Bruce & Jim & Olga,
> > > >
> > > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > > J. Bruce Fields wrote:
> > > > >
> > > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > > At 1020 threads the send buffer size wraps and becomes negative causing
> > > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > > upon the number of nfsd threads, make the buffer sizes tuneable via
> > > > > > module parameters.
> > > > > >
> > > > > > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > > > > > the buffer.
> > > > >
> > > > > From private communication, my understanding is that the original
> > > > > problem here was due to memory pressure forcing the tcp send buffer size
> > > > > below the size required to hold a single rpc.
> > > >
> > > > Years ago I did see wrapping of the buffer size when tcp was used with many
> > > > threads. Today's problem is timeouts on a cluster with a heavy read
> > > > workload... and I seem to remember seeing that the send buffer size was too
> > > > small.
> > > >
> > > > > In which case the important variable here is lock_bufsize, as that's
> > > > > what prevents the buffer size from going too low.
> > > >
> > > > I tested removing the lock of bufsize and did hit the timeouts, so the overflow
> > > > is starting to look less relevant. I will test your minimal overflow fix to
> > > > see if this is the case.
> > >
> > > The minimal overflow fix did not resolve the timeouts.
> >
> > OK, thanks, that's expected.
> >
> > > I will test with this to see if it resolves the timeouts:
> >
> > And I'd expect that to do the job--but at the expense of some tcp
> > bandwidth. So you end up needing your other module parameters to get
> > the performance back.
>
> Also, what do you see happening on the server in the problem case--are
> threads blocking in svc_send, or are they dropping replies?

Oh, never mind, right, it's almost certainly svc_tcp_has_wspace failing:

required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
if (sk_stream_wspace(svsk->sk_sk) >= required)
return 1;
set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
return 0;

That returns 0 once sk_stream_wspace falls below sv_max_mesg, so we
never take the request and don't get to the point of failing in
svc_send.

--b.

2013-01-25 20:35:09

by Ben Myers

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

Hey Bruce,

On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > Hey Bruce & Jim & Olga,
> > >
> > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > J. Bruce Fields wrote:
> > > >
> > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > At 1020 threads the send buffer size wraps and becomes negative causing
> > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > upon the number of nfsd threads, make the buffer sizes tuneable via
> > > > > module parameters.
> > > > >
> > > > > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > > > > the buffer.
> > > >
> > > > From private communication, my understanding is that the original
> > > > problem here was due to memory pressure forcing the tcp send buffer size
> > > > below the size required to hold a single rpc.
> > >
> > > Years ago I did see wrapping of the buffer size when tcp was used with many
> > > threads. Today's problem is timeouts on a cluster with a heavy read
> > > workload... and I seem to remember seeing that the send buffer size was too
> > > small.
> > >
> > > > In which case the important variable here is lock_bufsize, as that's
> > > > what prevents the buffer size from going too low.
> > >
> > > I tested removing the lock of bufsize and did hit the timeouts, so the overflow
> > > is starting to look less relevant. I will test your minimal overflow fix to
> > > see if this is the case.
> >
> > The minimal overflow fix did not resolve the timeouts.
>
> OK, thanks, that's expected.
>
> > I will test with this to see if it resolves the timeouts:
>
> And I'd expect that to do the job--

It did.

> but at the expense of some tcp
> bandwidth. So you end up needing your other module parameters to get
> the performance back.

I didn't put a timer on it, so I'm not sure. Any ideas for an alternate fix?

Thanks,
Ben

2013-01-25 22:02:50

by Ben Myers

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

Hey,

On Fri, Jan 25, 2013 at 04:57:12PM -0500, J. Bruce Fields wrote:
> On Fri, Jan 25, 2013 at 09:45:12PM +0000, Myklebust, Trond wrote:
> > > -----Original Message----- From: J. Bruce Fields
> > > [mailto:[email protected]] Sent: Friday, January 25, 2013 4:35 PM
> > > To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > buffer size tuneable
> > >
> > > On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > > > -----Original Message----- From: J. Bruce Fields
> > > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > > 4:21 PM To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > > buffer size tuneable
> > > > >
> > > > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond
> > > > > wrote:
> > > >
> > > > > > Why is it not sufficient to clamp the TCP values of 'snd' and
> > > > > > 'rcv' using
> > > > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > > > ...and clamp the UDP values using
> > > > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > > > >
> > > > > Yeah, I was just looking at that--so, Ben, something like:
> > > > >
> > > > > echo "1048576 1048576 4194304"
> > > > > >/proc/sys/net/ipv4/tcp_wmem
> > > > >
> > > > > But I'm unclear on some of the details: do we need to set the
> > > > > minimum or only the default? And does it need any more
> > > > > allowance for protocol overhead?
> > > >
> > > > I meant adding a check either to svc_sock_setbufsize or to the 2
> > > > call-sites
> > > that enforces the above limits.
> > >
> > > I lost you.
> > >
> > > It's not svc_sock_setbufsize that's setting too-small values, if
> > > that's what you mean.
> > >
> >
> > I understood that the problem was svc_udp_recvfrom() and
> > svc_setup_socket() were using negative values in the calls to
> > svc_sock_setbufsize(). Looking again at svc_setup_socket(), I don't
> > see how that could do so, but svc_udp_recvfrom() definitely has
> > potential to cause damage.
>
> Right, the changelog was confusing, the problem they're actually hitting
> is with tcp. Looks like tcp autotuning is decreasing the send buffer
> below the size we requested in svc_sock_setbufsize().

echo "1048576 1048576 4194304" > /proc/sys/net/ipv4/tcp_wmem

Seems to have been effective. I'll be toasting to you gents tonight.

I think it would be good if the server enforced a setting that is large enough.

Thanks,
Ben

2013-01-25 21:12:57

by Myklebust, Trond

[permalink] [raw]
Subject: RE: sunrpc: socket buffer size tuneable

> -----Original Message-----
> From: [email protected] [mailto:linux-nfs-
> [email protected]] On Behalf Of Ben Myers
> Sent: Friday, January 25, 2013 3:35 PM
> To: J. Bruce Fields
> Cc: Olga Kornievskaia; [email protected]; Jim Rees
> Subject: Re: sunrpc: socket buffer size tuneable
>
> Hey Bruce,
>
> On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> > > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > > Hey Bruce & Jim & Olga,
> > > >
> > > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > > J. Bruce Fields wrote:
> > > > >
> > > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > > At 1020 threads the send buffer size wraps and becomes negative
> causing
> > > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > > upon the number of nfsd threads, make the buffer sizes tuneable
> via
> > > > > > module parameters.
> > > > > >
> > > > > > Set the buffer sizes in terms of the number of rpcs you want to fit
> into
> > > > > > the buffer.
> > > > >
> > > > > From private communication, my understanding is that the original
> > > > > problem here was due to memory pressure forcing the tcp send
> buffer size
> > > > > below the size required to hold a single rpc.
> > > >
> > > > Years ago I did see wrapping of the buffer size when tcp was used
> > > > with many threads. Today's problem is timeouts on a cluster with
> > > > a heavy read workload... and I seem to remember seeing that the
> > > > send buffer size was too small.
> > > >
> > > > > In which case the important variable here is lock_bufsize, as that's
> > > > > what prevents the buffer size from going too low.
> > > >
> > > > I tested removing the lock of bufsize and did hit the timeouts, so
> > > > the overflow is starting to look less relevant. I will test your
> > > > minimal overflow fix to see if this is the case.
> > >
> > > The minimal overflow fix did not resolve the timeouts.
> >
> > OK, thanks, that's expected.
> >
> > > I will test with this to see if it resolves the timeouts:
> >
> > And I'd expect that to do the job--
>
> It did.
>
> > but at the expense of some tcp
> > bandwidth. So you end up needing your other module parameters to get
> > the performance back.
>
> I didn't put a timer on it, so I'm not sure. Any ideas for an alternate fix?
>

Why is it not sufficient to clamp the TCP values of 'snd' and 'rcv' using sysctl_tcp_wmem/sysctl_tcp_rmem?
...and clamp the UDP values using sysctl_[wr]mem_min/sysctl_[wr]mem_max?.

Cheers
Trond

2013-01-25 18:57:50

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> At 1020 threads the send buffer size wraps and becomes negative causing
> the nfs server to grind to a halt. Rather than setting bufsize based
> upon the number of nfsd threads, make the buffer sizes tuneable via
> module parameters.
>
> Set the buffer sizes in terms of the number of rpcs you want to fit into
> the buffer.

>From private communication, my understanding is that the original
problem here was due to memory pressure forcing the tcp send buffer size
below the size required to hold a single rpc.

In which case the important variable here is lock_bufsize, as that's
what prevents the buffer size from going too low.

Cc'ing Jim Rees in case he remembers: I seem to recall discussing this
possibility, wondering whether we needed a special interface to the
network layer allowing us to set a minimum, and deciding it wasn't
really necessary at the time as we didn't think the network layer would
actually do this. Is that right? In which case either we were wrong,
or something changed.

--b.

>
> Signed-off-by: Ben Myers <[email protected]>
> ---
>
> v2: moved tuneables to net/sunrpc/svcsock.c, allow user to lock buffer
> size
>
> net/sunrpc/sunrpc_syms.c | 39 +++++++++++++++++++++++++++++++++++
> net/sunrpc/svcsock.c | 52 ++++++++++++++++++++++++++---------------------
> 2 files changed, 68 insertions(+), 23 deletions(-)
>
> Index: xfs/net/sunrpc/svcsock.c
> ===================================================================
> --- xfs.orig/net/sunrpc/svcsock.c
> +++ xfs/net/sunrpc/svcsock.c
> @@ -57,6 +57,47 @@
>
> #define RPCDBG_FACILITY RPCDBG_SVCXPRT
>
> +/*
> + * Size each socket's send and receive buffers based upon the number of rpcs
> + * we'd like to be able to fit into them. The TCP rcvbuf is enough for just a
> + * few requests. TCP sndbuf is the same as the tcp_slot_table_entries default,
> + * the maximum number of in-flight rpcs for a client. UDP sndbuf and rcvbuf
> + * defaults are the same as the udp_slot_table_entries default.
> + *
> + * Setting these too small can cause nfsd threads to block when sending.
> + */
> +static int tcp_rcvbuf_nrpc = 8;
> +static int tcp_sndbuf_nrpc = 8;
> +
> +static int udp_rcvbuf_nrpc = 6;
> +static int udp_sndbuf_nrpc = 6;
> +
> +static bool lock_bufsize = false;
> +
> +/*
> + * Minimum for bufsize settings is enough space for 6 rpcs, maximum is enough
> + * space for 128 rpcs.
> + */
> +static int param_set_bufsize_nrpc(const char *val, struct kernel_param *kp)
> +{
> + char *endp;
> + int num = simple_strtol(val, &endp, 0);
> +
> + if (endp == val || (*endp && *endp != '\n') || num < 6 || num > 128)
> + return -EINVAL;
> + *((int *)kp->arg) = num;
> + return 0;
> +}
> +
> +module_param_call(tcp_rcvbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
> + &tcp_rcvbuf_nrpc, 0644);
> +module_param_call(tcp_sndbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
> + &tcp_sndbuf_nrpc, 0644);
> +module_param_call(udp_rcvbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
> + &udp_rcvbuf_nrpc, 0644);
> +module_param_call(udp_sndbuf_nrpc, param_set_bufsize_nrpc, param_get_int,
> + &udp_sndbuf_nrpc, 0644);
> +module_param(lock_bufsize, bool, 0644);
>
> static struct svc_sock *svc_setup_socket(struct svc_serv *, struct socket *,
> int flags);
> @@ -375,8 +416,8 @@ static int svc_partial_recvfrom(struct s
> /*
> * Set socket snd and rcv buffer lengths
> */
> -static void svc_sock_setbufsize(struct socket *sock, unsigned int snd,
> - unsigned int rcv)
> +static void svc_sock_setbufsize(struct socket *sock, int snd, int rcv,
> + bool lock)
> {
> #if 0
> mm_segment_t oldfs;
> @@ -392,8 +433,17 @@ static void svc_sock_setbufsize(struct s
> * DaveM said I could!
> */
> lock_sock(sock->sk);
> - sock->sk->sk_sndbuf = snd * 2;
> - sock->sk->sk_rcvbuf = rcv * 2;
> + sock->sk->sk_sndbuf = snd;
> + sock->sk->sk_rcvbuf = rcv;
> +
> + if (lock) {
> + /*
> + * The socket buffer can be resized by the networking code
> + * unless you specify that this is not to be done.
> + */
> + sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
> + }
> +
> sock->sk->sk_write_space(sock->sk);
> release_sock(sock->sk);
> #endif
> @@ -525,18 +575,17 @@ static int svc_udp_recvfrom(struct svc_r
> size_t len;
> int err;
>
> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> - /* udp sockets need large rcvbuf as all pending
> - * requests are still in that buffer. sndbuf must
> - * also be large enough that there is enough space
> - * for one reply per thread. We count all threads
> - * rather than threads in a particular pool, which
> - * provides an upper bound on the number of threads
> - * which will access the socket.
> - */
> - svc_sock_setbufsize(svsk->sk_sock,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg);
> + if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags)) {
> + /*
> + * UDP sockets need a large rcvbuf as all pending requests are
> + * still in that buffer. sndbuf must also be large enough to
> + * fit a reply for each request the server could be processing.
> + */
> + svc_sock_setbufsize(svsk->sk_sock,
> + udp_sndbuf_nrpc * serv->sv_max_mesg,
> + udp_rcvbuf_nrpc * serv->sv_max_mesg,
> + lock_bufsize);
> + }
>
> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> skb = NULL;
> @@ -695,13 +744,14 @@ static void svc_udp_init(struct svc_sock
> svsk->sk_sk->sk_data_ready = svc_udp_data_ready;
> svsk->sk_sk->sk_write_space = svc_write_space;
>
> - /* initialise setting must have enough space to
> - * receive and respond to one request.
> - * svc_udp_recvfrom will re-adjust if necessary
> + /*
> + * Initial setting must have enough space to receive and respond to one
> + * request. svc_udp_recvfrom will readjust if necessary.
> */
> svc_sock_setbufsize(svsk->sk_sock,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
> + udp_sndbuf_nrpc * svsk->sk_xprt.xpt_server->sv_max_mesg,
> + udp_rcvbuf_nrpc * svsk->sk_xprt.xpt_server->sv_max_mesg,
> + lock_bufsize);
>
> /* data might have come in before data_ready set up */
> set_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> @@ -1377,8 +1427,10 @@ static struct svc_sock *svc_setup_socket
> /* initialise setting must have enough space to
> * receive and respond to one request.
> */
> - svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> - 4 * serv->sv_max_mesg);
> + svc_sock_setbufsize(svsk->sk_sock,
> + tcp_sndbuf_nrpc * serv->sv_max_mesg,
> + tcp_rcvbuf_nrpc * serv->sv_max_mesg,
> + lock_bufsize);
> svc_tcp_init(svsk, serv);
> }
>

2013-01-25 19:16:35

by Jim Rees

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

J. Bruce Fields wrote:

On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> At 1020 threads the send buffer size wraps and becomes negative causing
> the nfs server to grind to a halt. Rather than setting bufsize based
> upon the number of nfsd threads, make the buffer sizes tuneable via
> module parameters.
>
> Set the buffer sizes in terms of the number of rpcs you want to fit into
> the buffer.

From private communication, my understanding is that the original
problem here was due to memory pressure forcing the tcp send buffer size
below the size required to hold a single rpc.

In which case the important variable here is lock_bufsize, as that's
what prevents the buffer size from going too low.

Cc'ing Jim Rees in case he remembers: I seem to recall discussing this
possibility, wondering whether we needed a special interface to the
network layer allowing us to set a minimum, and deciding it wasn't
really necessary at the time as we didn't think the network layer would
actually do this. Is that right? In which case either we were wrong,
or something changed.

I do remember discussing this. My memory is that we needed it but no one
wanted to implement it and it never happened. But I could be
mis-remembering. Maybe ask Olga, she's the one who put the bufsize
autotuning patch in, commit 96604398.

I'll go back through my old mail and see if I can figure out what happened
with this.

2013-01-25 20:51:47

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 03:35:21PM -0500, J. Bruce Fields wrote:
> On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> >
> > > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > > Hey Bruce & Jim & Olga,
> > > >
> > > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > > J. Bruce Fields wrote:
> > > > >
> > > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > > At 1020 threads the send buffer size wraps and becomes negative causing
> > > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > > upon the number of nfsd threads, make the buffer sizes tuneable via
> > > > > > module parameters.
> > > > > >
> > > > > > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > > > > > the buffer.
> > > > >
> > > > > From private communication, my understanding is that the original
> > > > > problem here was due to memory pressure forcing the tcp send buffer size
> > > > > below the size required to hold a single rpc.
> > > >
> > > > Years ago I did see wrapping of the buffer size when tcp was used with many
> > > > threads. Today's problem is timeouts on a cluster with a heavy read
> > > > workload... and I seem to remember seeing that the send buffer size was too
> > > > small.
> > > >
> > > > > In which case the important variable here is lock_bufsize, as that's
> > > > > what prevents the buffer size from going too low.
> > > >
> > > > I tested removing the lock of bufsize and did hit the timeouts, so the overflow
> > > > is starting to look less relevant. I will test your minimal overflow fix to
> > > > see if this is the case.
> > >
> > > The minimal overflow fix did not resolve the timeouts.
> >
> > OK, thanks, that's expected.
> >
> > > I will test with this to see if it resolves the timeouts:
> >
> > And I'd expect that to do the job--but at the expense of some tcp
> > bandwidth. So you end up needing your other module parameters to get
> > the performance back.
>
> Also, what do you see happening on the server in the problem case--are
> threads blocking in svc_send, or are they dropping replies?

And what exactly is your test?

--b.

2013-01-25 21:13:17

by Ben Myers

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

Hey Bruce,

On Fri, Jan 25, 2013 at 03:51:45PM -0500, J. Bruce Fields wrote:
> On Fri, Jan 25, 2013 at 03:35:21PM -0500, J. Bruce Fields wrote:
> > On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> > > > The minimal overflow fix did not resolve the timeouts.
> > >
> > > OK, thanks, that's expected.
> > >
> > > > I will test with this to see if it resolves the timeouts:
> > >
> > > And I'd expect that to do the job--but at the expense of some tcp
> > > bandwidth. So you end up needing your other module parameters to get
> > > the performance back.
> >
> > Also, what do you see happening on the server in the problem case--are
> > threads blocking in svc_send, or are they dropping replies?

I'm working on getting a dump to determine this. Maybe there is an easier way?

> And what exactly is your test?

Boot of a 144 node nfs root cluster. The timeouts happen when loading a ~14mb
kernel module.

Thanks,
Ben

2013-01-25 21:21:08

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: [email protected] [mailto:linux-nfs-
> > [email protected]] On Behalf Of Ben Myers
> > Sent: Friday, January 25, 2013 3:35 PM
> > To: J. Bruce Fields
> > Cc: Olga Kornievskaia; [email protected]; Jim Rees
> > Subject: Re: sunrpc: socket buffer size tuneable
> >
> > Hey Bruce,
> >
> > On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
> > > > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > > > Hey Bruce & Jim & Olga,
> > > > >
> > > > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > > > J. Bruce Fields wrote:
> > > > > >
> > > > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > > > At 1020 threads the send buffer size wraps and becomes negative
> > causing
> > > > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > > > upon the number of nfsd threads, make the buffer sizes tuneable
> > via
> > > > > > > module parameters.
> > > > > > >
> > > > > > > Set the buffer sizes in terms of the number of rpcs you want to fit
> > into
> > > > > > > the buffer.
> > > > > >
> > > > > > From private communication, my understanding is that the original
> > > > > > problem here was due to memory pressure forcing the tcp send
> > buffer size
> > > > > > below the size required to hold a single rpc.
> > > > >
> > > > > Years ago I did see wrapping of the buffer size when tcp was used
> > > > > with many threads. Today's problem is timeouts on a cluster with
> > > > > a heavy read workload... and I seem to remember seeing that the
> > > > > send buffer size was too small.
> > > > >
> > > > > > In which case the important variable here is lock_bufsize, as that's
> > > > > > what prevents the buffer size from going too low.
> > > > >
> > > > > I tested removing the lock of bufsize and did hit the timeouts, so
> > > > > the overflow is starting to look less relevant. I will test your
> > > > > minimal overflow fix to see if this is the case.
> > > >
> > > > The minimal overflow fix did not resolve the timeouts.
> > >
> > > OK, thanks, that's expected.
> > >
> > > > I will test with this to see if it resolves the timeouts:
> > >
> > > And I'd expect that to do the job--
> >
> > It did.
> >
> > > but at the expense of some tcp
> > > bandwidth. So you end up needing your other module parameters to get
> > > the performance back.
> >
> > I didn't put a timer on it, so I'm not sure. Any ideas for an alternate fix?
> >
>
> Why is it not sufficient to clamp the TCP values of 'snd' and 'rcv' using sysctl_tcp_wmem/sysctl_tcp_rmem?
> ...and clamp the UDP values using sysctl_[wr]mem_min/sysctl_[wr]mem_max?.

Yeah, I was just looking at that--so, Ben, something like:

echo "1048576 1048576 4194304" >/proc/sys/net/ipv4/tcp_wmem

But I'm unclear on some of the details: do we need to set the minimum or
only the default? And does it need any more allowance for protocol
overhead?

Regardless, it's unfortunate if the server's buggy by default.

--b.

2013-01-25 21:29:11

by Myklebust, Trond

[permalink] [raw]
Subject: RE: sunrpc: socket buffer size tuneable

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Friday, January 25, 2013 4:21 PM
> To: Myklebust, Trond
> Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> Subject: Re: sunrpc: socket buffer size tuneable
>
> On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond wrote:

> > Why is it not sufficient to clamp the TCP values of 'snd' and 'rcv' using
> sysctl_tcp_wmem/sysctl_tcp_rmem?
> > ...and clamp the UDP values using
> sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
>
> Yeah, I was just looking at that--so, Ben, something like:
>
> echo "1048576 1048576 4194304" >/proc/sys/net/ipv4/tcp_wmem
>
> But I'm unclear on some of the details: do we need to set the minimum or
> only the default? And does it need any more allowance for protocol
> overhead?

I meant adding a check either to svc_sock_setbufsize or to the 2 call-sites that enforces the above limits.

Trond

2013-01-25 22:34:57

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 10:20:12PM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:[email protected]]
> > Sent: Friday, January 25, 2013 4:57 PM
> > To: Myklebust, Trond
> > Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> > Subject: Re: sunrpc: socket buffer size tuneable
> >
> > On Fri, Jan 25, 2013 at 09:45:12PM +0000, Myklebust, Trond wrote:
> > > > -----Original Message----- From: J. Bruce Fields
> > > > [mailto:[email protected]] Sent: Friday, January 25, 2013 4:35 PM
> > > > To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > buffer size tuneable
> > > >
> > > > On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > > > > -----Original Message----- From: J. Bruce Fields
> > > > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > > > 4:21 PM To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > > > buffer size tuneable
> > > > > >
> > > > > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond
> > > > > > wrote:
> > > > >
> > > > > > > Why is it not sufficient to clamp the TCP values of 'snd' and
> > > > > > > 'rcv' using
> > > > > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > > > > ...and clamp the UDP values using
> > > > > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > > > > >
> > > > > > Yeah, I was just looking at that--so, Ben, something like:
> > > > > >
> > > > > > echo "1048576 1048576 4194304"
> > > > > > >/proc/sys/net/ipv4/tcp_wmem
> > > > > >
> > > > > > But I'm unclear on some of the details: do we need to set the
> > > > > > minimum or only the default? And does it need any more
> > > > > > allowance for protocol overhead?
> > > > >
> > > > > I meant adding a check either to svc_sock_setbufsize or to the 2
> > > > > call-sites
> > > > that enforces the above limits.
> > > >
> > > > I lost you.
> > > >
> > > > It's not svc_sock_setbufsize that's setting too-small values, if
> > > > that's what you mean.
> > > >
> > >
> > > I understood that the problem was svc_udp_recvfrom() and
> > > svc_setup_socket() were using negative values in the calls to
> > > svc_sock_setbufsize(). Looking again at svc_setup_socket(), I don't
> > > see how that could do so, but svc_udp_recvfrom() definitely has
> > > potential to cause damage.
> >
> > Right, the changelog was confusing, the problem they're actually hitting is
> > with tcp. Looks like tcp autotuning is decreasing the send buffer below the
> > size we requested in svc_sock_setbufsize().
>
> Yes. As far as I can tell, that is endemic unless you lock the sndbuf size. Grep for sk_stream_moderate_sndbuf(), and you'll see what I mean.

Yes. So I guess I'll investigate a little more, then do an amateur
attempt at an interface to enforce a minimum and see if the network
developers think it's a reasonable idea.

Alternatively: is there some better strategy for the server here?

It's trying to prevent threads from blocking by refusing to accept more
rpc's than it has send buffer space to reply to.

Presumably the fear is that all your threads could block trying to get
responses to a small number of slow clients.

Are there better ways to prevent that?

--b.

2013-01-25 19:10:26

by Ben Myers

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

Hey Bruce,

On Fri, Jan 25, 2013 at 01:45:22PM -0500, J. Bruce Fields wrote:
> On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > At 1020 threads the send buffer size wraps and becomes negative causing
> > the nfs server to grind to a halt.
>
> Actually this appears to be a problem in only one place:
>
> > @@ -525,18 +575,17 @@ static int svc_udp_recvfrom(struct svc_r
> > size_t len;
> > int err;
> >
> > - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> > - /* udp sockets need large rcvbuf as all pending
> > - * requests are still in that buffer. sndbuf must
> > - * also be large enough that there is enough space
> > - * for one reply per thread. We count all threads
> > - * rather than threads in a particular pool, which
> > - * provides an upper bound on the number of threads
> > - * which will access the socket.
> > - */
> > - svc_sock_setbufsize(svsk->sk_sock,
> > - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> > - (serv->sv_nrthreads+3) * serv->sv_max_mesg);
>
> Elsewhere it's not dependent on the number of threads:
>
> > svc_sock_setbufsize(svsk->sk_sock,
> > - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
> > - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
> ...
> > - svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> > - 4 * serv->sv_max_mesg);
>
> Since sv_max_mesg is at most a megabyte, those won't overflow.
>
> If you saw an overflow in the tcp case, I suspect that was with a kernel
> before 9660439861aa8dbd5e2b8087f33e20760c2c9afc "svcrpc: take advantage
> of tcp autotuning"?

Correct. This is an updated version of an old patch I've been carrying around
awhile.

> Honestly I'm a bit mystified by the udp comment in the code above; do we
> *really* need buffers that large?

I've been assuming that the author was used to running with just a few nfsd
threads, and didn't expect that someone would try a large number.

> And if so, why don't we set them to
> that size at the start in svc_udp_init?

Yeah, that's strange..

> But in the udp case I'm inclined to do just the minimum to fix the
> overflow and otherwise leave the code alone--it doesn't get as much use
> and testing as it once did, so it's probably better to leave it alone.
>
> So something like the below.

It looks reasonable. I'll give it a try.

> This still leaves the issue of tcp buffer size under memory pressure.

Yep. That may be why this patch works for me.

Thanks,
Ben

> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 0a148c9..f3a525c 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -525,7 +525,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> size_t len;
> int err;
>
> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> + if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags)) {
> /* udp sockets need large rcvbuf as all pending
> * requests are still in that buffer. sndbuf must
> * also be large enough that there is enough space
> @@ -534,9 +534,13 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
> * provides an upper bound on the number of threads
> * which will access the socket.
> */
> + unsigned int threads = serv->sv_nrthreads;
> + /* arbitrary limit, mainly just to prevent overflow: */
> + threads = min(threads, 128U);
> svc_sock_setbufsize(svsk->sk_sock,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg);
> + (threads+3) * serv->sv_max_mesg,
> + (threads+3) * serv->sv_max_mesg);
> + }
>
> clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
> skb = NULL;

2013-01-25 20:35:23

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 03:21:07PM -0500, J. Bruce Fields wrote:
>
> > On Fri, Jan 25, 2013 at 01:29:35PM -0600, Ben Myers wrote:
> > > Hey Bruce & Jim & Olga,
> > >
> > > On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> > > > J. Bruce Fields wrote:
> > > >
> > > > On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > > > > At 1020 threads the send buffer size wraps and becomes negative causing
> > > > > the nfs server to grind to a halt. Rather than setting bufsize based
> > > > > upon the number of nfsd threads, make the buffer sizes tuneable via
> > > > > module parameters.
> > > > >
> > > > > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > > > > the buffer.
> > > >
> > > > From private communication, my understanding is that the original
> > > > problem here was due to memory pressure forcing the tcp send buffer size
> > > > below the size required to hold a single rpc.
> > >
> > > Years ago I did see wrapping of the buffer size when tcp was used with many
> > > threads. Today's problem is timeouts on a cluster with a heavy read
> > > workload... and I seem to remember seeing that the send buffer size was too
> > > small.
> > >
> > > > In which case the important variable here is lock_bufsize, as that's
> > > > what prevents the buffer size from going too low.
> > >
> > > I tested removing the lock of bufsize and did hit the timeouts, so the overflow
> > > is starting to look less relevant. I will test your minimal overflow fix to
> > > see if this is the case.
> >
> > The minimal overflow fix did not resolve the timeouts.
>
> OK, thanks, that's expected.
>
> > I will test with this to see if it resolves the timeouts:
>
> And I'd expect that to do the job--but at the expense of some tcp
> bandwidth. So you end up needing your other module parameters to get
> the performance back.

Also, what do you see happening on the server in the problem case--are
threads blocking in svc_send, or are they dropping replies?

--b.

>
> --b.
>
> >
> > ---
> > net/sunrpc/svcsock.c | 7 +++++++
> > 1 file changed, 7 insertions(+)
> >
> > Index: b/net/sunrpc/svcsock.c
> > ===================================================================
> > --- a/net/sunrpc/svcsock.c 2013-01-25 13:48:05.000000000 -0600
> > +++ b/net/sunrpc/svcsock.c 2013-01-25 13:49:42.000000000 -0600
> > @@ -435,6 +435,13 @@ static void svc_sock_setbufsize(struct s
> > lock_sock(sock->sk);
> > sock->sk->sk_sndbuf = snd * 2;
> > sock->sk->sk_rcvbuf = rcv * 2;
> > +
> > + /*
> > + * The socket buffer can be resized by the networking code
> > + * unless you specify that this is not to be done.
> > + */
> > + sock->sk->sk_userlocks |= SOCK_SNDBUF_LOCK|SOCK_RCVBUF_LOCK;
> > +
> > sock->sk->sk_write_space(sock->sk);
> > release_sock(sock->sk);
> > #endif
> >
> > Thanks,
> > Ben

2013-01-25 18:45:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> At 1020 threads the send buffer size wraps and becomes negative causing
> the nfs server to grind to a halt.

Actually this appears to be a problem in only one place:

> @@ -525,18 +575,17 @@ static int svc_udp_recvfrom(struct svc_r
> size_t len;
> int err;
>
> - if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
> - /* udp sockets need large rcvbuf as all pending
> - * requests are still in that buffer. sndbuf must
> - * also be large enough that there is enough space
> - * for one reply per thread. We count all threads
> - * rather than threads in a particular pool, which
> - * provides an upper bound on the number of threads
> - * which will access the socket.
> - */
> - svc_sock_setbufsize(svsk->sk_sock,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg,
> - (serv->sv_nrthreads+3) * serv->sv_max_mesg);

Elsewhere it's not dependent on the number of threads:

> svc_sock_setbufsize(svsk->sk_sock,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg,
> - 3 * svsk->sk_xprt.xpt_server->sv_max_mesg);
...
> - svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> - 4 * serv->sv_max_mesg);

Since sv_max_mesg is at most a megabyte, those won't overflow.

If you saw an overflow in the tcp case, I suspect that was with a kernel
before 9660439861aa8dbd5e2b8087f33e20760c2c9afc "svcrpc: take advantage
of tcp autotuning"?

Honestly I'm a bit mystified by the udp comment in the code above; do we
*really* need buffers that large? And if so, why don't we set them to
that size at the start in svc_udp_init?

But in the udp case I'm inclined to do just the minimum to fix the
overflow and otherwise leave the code alone--it doesn't get as much use
and testing as it once did, so it's probably better to leave it alone.

So something like the below.

This still leaves the issue of tcp buffer size under memory pressure.

--b.

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 0a148c9..f3a525c 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -525,7 +525,7 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
size_t len;
int err;

- if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags))
+ if (test_and_clear_bit(XPT_CHNGBUF, &svsk->sk_xprt.xpt_flags)) {
/* udp sockets need large rcvbuf as all pending
* requests are still in that buffer. sndbuf must
* also be large enough that there is enough space
@@ -534,9 +534,13 @@ static int svc_udp_recvfrom(struct svc_rqst *rqstp)
* provides an upper bound on the number of threads
* which will access the socket.
*/
+ unsigned int threads = serv->sv_nrthreads;
+ /* arbitrary limit, mainly just to prevent overflow: */
+ threads = min(threads, 128U);
svc_sock_setbufsize(svsk->sk_sock,
- (serv->sv_nrthreads+3) * serv->sv_max_mesg,
- (serv->sv_nrthreads+3) * serv->sv_max_mesg);
+ (threads+3) * serv->sv_max_mesg,
+ (threads+3) * serv->sv_max_mesg);
+ }

clear_bit(XPT_DATA, &svsk->sk_xprt.xpt_flags);
skb = NULL;

2013-01-25 21:57:13

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 09:45:12PM +0000, Myklebust, Trond wrote:
> > -----Original Message----- From: J. Bruce Fields
> > [mailto:[email protected]] Sent: Friday, January 25, 2013 4:35 PM
> > To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > buffer size tuneable
> >
> > On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > > -----Original Message----- From: J. Bruce Fields
> > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > 4:21 PM To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > buffer size tuneable
> > > >
> > > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond
> > > > wrote:
> > >
> > > > > Why is it not sufficient to clamp the TCP values of 'snd' and
> > > > > 'rcv' using
> > > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > > ...and clamp the UDP values using
> > > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > > >
> > > > Yeah, I was just looking at that--so, Ben, something like:
> > > >
> > > > echo "1048576 1048576 4194304"
> > > > >/proc/sys/net/ipv4/tcp_wmem
> > > >
> > > > But I'm unclear on some of the details: do we need to set the
> > > > minimum or only the default? And does it need any more
> > > > allowance for protocol overhead?
> > >
> > > I meant adding a check either to svc_sock_setbufsize or to the 2
> > > call-sites
> > that enforces the above limits.
> >
> > I lost you.
> >
> > It's not svc_sock_setbufsize that's setting too-small values, if
> > that's what you mean.
> >
>
> I understood that the problem was svc_udp_recvfrom() and
> svc_setup_socket() were using negative values in the calls to
> svc_sock_setbufsize(). Looking again at svc_setup_socket(), I don't
> see how that could do so, but svc_udp_recvfrom() definitely has
> potential to cause damage.

Right, the changelog was confusing, the problem they're actually hitting
is with tcp. Looks like tcp autotuning is decreasing the send buffer
below the size we requested in svc_sock_setbufsize().

--b.

>
> The TCP layer should be taking care to always clamp the send/receive
> buffer sizes in tcp_new_space(),tcp_clamp_window(), etc...
>
> Trond

2013-01-25 22:20:15

by Myklebust, Trond

[permalink] [raw]
Subject: RE: sunrpc: socket buffer size tuneable

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Friday, January 25, 2013 4:57 PM
> To: Myklebust, Trond
> Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> Subject: Re: sunrpc: socket buffer size tuneable
>
> On Fri, Jan 25, 2013 at 09:45:12PM +0000, Myklebust, Trond wrote:
> > > -----Original Message----- From: J. Bruce Fields
> > > [mailto:[email protected]] Sent: Friday, January 25, 2013 4:35 PM
> > > To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > buffer size tuneable
> > >
> > > On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > > > -----Original Message----- From: J. Bruce Fields
> > > > > [mailto:[email protected]] Sent: Friday, January 25, 2013
> > > > > 4:21 PM To: Myklebust, Trond Cc: Ben Myers; Olga Kornievskaia;
> > > > > [email protected]; Jim Rees Subject: Re: sunrpc: socket
> > > > > buffer size tuneable
> > > > >
> > > > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond
> > > > > wrote:
> > > >
> > > > > > Why is it not sufficient to clamp the TCP values of 'snd' and
> > > > > > 'rcv' using
> > > > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > > > ...and clamp the UDP values using
> > > > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > > > >
> > > > > Yeah, I was just looking at that--so, Ben, something like:
> > > > >
> > > > > echo "1048576 1048576 4194304"
> > > > > >/proc/sys/net/ipv4/tcp_wmem
> > > > >
> > > > > But I'm unclear on some of the details: do we need to set the
> > > > > minimum or only the default? And does it need any more
> > > > > allowance for protocol overhead?
> > > >
> > > > I meant adding a check either to svc_sock_setbufsize or to the 2
> > > > call-sites
> > > that enforces the above limits.
> > >
> > > I lost you.
> > >
> > > It's not svc_sock_setbufsize that's setting too-small values, if
> > > that's what you mean.
> > >
> >
> > I understood that the problem was svc_udp_recvfrom() and
> > svc_setup_socket() were using negative values in the calls to
> > svc_sock_setbufsize(). Looking again at svc_setup_socket(), I don't
> > see how that could do so, but svc_udp_recvfrom() definitely has
> > potential to cause damage.
>
> Right, the changelog was confusing, the problem they're actually hitting is
> with tcp. Looks like tcp autotuning is decreasing the send buffer below the
> size we requested in svc_sock_setbufsize().

Yes. As far as I can tell, that is endemic unless you lock the sndbuf size. Grep for sk_stream_moderate_sndbuf(), and you'll see what I mean.

Trond

2013-01-25 21:45:14

by Myklebust, Trond

[permalink] [raw]
Subject: RE: sunrpc: socket buffer size tuneable

> -----Original Message-----
> From: J. Bruce Fields [mailto:[email protected]]
> Sent: Friday, January 25, 2013 4:35 PM
> To: Myklebust, Trond
> Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> Subject: Re: sunrpc: socket buffer size tuneable
>
> On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > > -----Original Message-----
> > > From: J. Bruce Fields [mailto:[email protected]]
> > > Sent: Friday, January 25, 2013 4:21 PM
> > > To: Myklebust, Trond
> > > Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim
> > > Rees
> > > Subject: Re: sunrpc: socket buffer size tuneable
> > >
> > > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond wrote:
> >
> > > > Why is it not sufficient to clamp the TCP values of 'snd' and
> > > > 'rcv' using
> > > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > > ...and clamp the UDP values using
> > > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> > >
> > > Yeah, I was just looking at that--so, Ben, something like:
> > >
> > > echo "1048576 1048576 4194304" >/proc/sys/net/ipv4/tcp_wmem
> > >
> > > But I'm unclear on some of the details: do we need to set the
> > > minimum or only the default? And does it need any more allowance
> > > for protocol overhead?
> >
> > I meant adding a check either to svc_sock_setbufsize or to the 2 call-sites
> that enforces the above limits.
>
> I lost you.
>
> It's not svc_sock_setbufsize that's setting too-small values, if that's what you
> mean.
>

I understood that the problem was svc_udp_recvfrom() and svc_setup_socket() were using negative values in the calls to svc_sock_setbufsize(). Looking again at svc_setup_socket(), I don't see how that could do so, but svc_udp_recvfrom() definitely has potential to cause damage.

The TCP layer should be taking care to always clamp the send/receive buffer sizes in tcp_new_space(),tcp_clamp_window(), etc...

Trond

2013-01-25 19:29:36

by Ben Myers

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

Hey Bruce & Jim & Olga,

On Fri, Jan 25, 2013 at 02:16:20PM -0500, Jim Rees wrote:
> J. Bruce Fields wrote:
>
> On Thu, Jan 24, 2013 at 06:59:30PM -0600, Ben Myers wrote:
> > At 1020 threads the send buffer size wraps and becomes negative causing
> > the nfs server to grind to a halt. Rather than setting bufsize based
> > upon the number of nfsd threads, make the buffer sizes tuneable via
> > module parameters.
> >
> > Set the buffer sizes in terms of the number of rpcs you want to fit into
> > the buffer.
>
> From private communication, my understanding is that the original
> problem here was due to memory pressure forcing the tcp send buffer size
> below the size required to hold a single rpc.

Years ago I did see wrapping of the buffer size when tcp was used with many
threads. Today's problem is timeouts on a cluster with a heavy read
workload... and I seem to remember seeing that the send buffer size was too
small.

> In which case the important variable here is lock_bufsize, as that's
> what prevents the buffer size from going too low.

I tested removing the lock of bufsize and did hit the timeouts, so the overflow
is starting to look less relevant. I will test your minimal overflow fix to
see if this is the case.

> Cc'ing Jim Rees in case he remembers: I seem to recall discussing this
> possibility, wondering whether we needed a special interface to the
> network layer allowing us to set a minimum, and deciding it wasn't
> really necessary at the time as we didn't think the network layer would
> actually do this. Is that right? In which case either we were wrong,
> or something changed.
>
> I do remember discussing this. My memory is that we needed it but no one
> wanted to implement it and it never happened. But I could be
> mis-remembering. Maybe ask Olga, she's the one who put the bufsize
> autotuning patch in, commit 96604398.
>
> I'll go back through my old mail and see if I can figure out what happened
> with this.

Thanks,
Ben

2013-01-25 21:35:06

by J. Bruce Fields

[permalink] [raw]
Subject: Re: sunrpc: socket buffer size tuneable

On Fri, Jan 25, 2013 at 09:29:09PM +0000, Myklebust, Trond wrote:
> > -----Original Message-----
> > From: J. Bruce Fields [mailto:[email protected]]
> > Sent: Friday, January 25, 2013 4:21 PM
> > To: Myklebust, Trond
> > Cc: Ben Myers; Olga Kornievskaia; [email protected]; Jim Rees
> > Subject: Re: sunrpc: socket buffer size tuneable
> >
> > On Fri, Jan 25, 2013 at 09:12:55PM +0000, Myklebust, Trond wrote:
>
> > > Why is it not sufficient to clamp the TCP values of 'snd' and 'rcv' using
> > sysctl_tcp_wmem/sysctl_tcp_rmem?
> > > ...and clamp the UDP values using
> > sysctl_[wr]mem_min/sysctl_[wr]mem_max?.
> >
> > Yeah, I was just looking at that--so, Ben, something like:
> >
> > echo "1048576 1048576 4194304" >/proc/sys/net/ipv4/tcp_wmem
> >
> > But I'm unclear on some of the details: do we need to set the minimum or
> > only the default? And does it need any more allowance for protocol
> > overhead?
>
> I meant adding a check either to svc_sock_setbufsize or to the 2 call-sites that enforces the above limits.

I lost you.

It's not svc_sock_setbufsize that's setting too-small values, if that's
what you mean.

--b.