LinuxLists.cc - Re: Is tcp autotuning really what NFS wants?

2013-07-10 02:27:43

Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
>
> Hi,
> I just noticed this commit:
>
> commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> Author: Olga Kornievskaia <[email protected]>
> Date: Tue Oct 21 14:13:47 2008 -0400
>
> svcrpc: take advantage of tcp autotuning
>
>
> which I must confess surprised me. I wonder if the full implications of
> removing that functionality were understood.
>
> Previously nfsd would set the transmit buffer space for a connection to
> ensure there is plenty to hold all replies. Now it doesn't.
>
> nfsd refuses to accept a request if there isn't enough space in the transmit
> buffer to send a reply. This is important to ensure that each reply gets
> sent atomically without blocking and there is no risk of replies getting
> interleaved.
>
> The server starts out with a large estimate of the reply space (1M) and for
> NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> it is much harder to estimate the space needed so it just assumes every
> reply will require 1M of space.
>
> This means that with NFSv4, as soon as you have enough concurrent requests
> such that 1M each reserves all of whatever window size was auto-tuned, new
> requests on that connection will be ignored.
>
> This could significantly limit the amount of parallelism that can be achieved
> for a single TCP connection (and given that the Linux client strongly prefers
> a single connection now, this could become more of an issue).

Worse, I believe it can deadlock completely if the transmit buffer
shrinks too far, and people really have run into this:

http://mid.gmane.org/<[email protected]>

Trond's suggestion looked at the time like it might work and be doable:

http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>

but I dropped it.

The v4-specific situation might not be hard to improve: the v4
processing decodes the whole compound at the start, so it knows the
sequence of ops before it does anything else and could compute a tighter
bound on the reply size at that point.

> I don't know if this is a real issue that needs addressing - I hit in the
> context of a server filesystem which was misbehaving and so caused this issue
> to become obvious. But in this case it is certainly the filesystem, not the
> NFS server, which is causing the problem.

Yeah it looks a real problem.

Some good test cases would be useful if we could find some.

And, yes, my screwup for merging 966043986 without solving those other
problems first. I was confused.

It does make a difference on high bandwidth-product networks (something
people have also hit). I'd rather not regress there and also would
rather not require manual tuning for something we should be able to get
right automatically.

--b.

2013-07-10 17:33:13

by Dean

[permalink] [raw]

Subject: Re: Is tcp autotuning really what NFS wants?

> This could significantly limit the amount of parallelism that can be
achieved for a single TCP connection (and given that the
> Linux client strongly prefers a single connection now, this could
become more of an issue).

I understand the simplicity in using a single tcp connection, but
performance-wise it is definitely not the way to go on WAN links. When
even a miniscule amount of packet loss is added to the link (<0.001%
packet loss), the tcp buffer collapses and performance drops
significantly (especially on 10GigE WAN links). I think new TCP
algorithms could help the problem somewhat, but nothing available today
makes much of a difference vs. cubic.

Using multiple tcp connections allows better saturation of the link,
since when packet loss occurs on a stream, the other streams can fill
the void. Today, the only solution is to scale up the number of
physical clients, which has high coordination overhead, or use a wan
accelerator such as Bitspeed or Riverbed (which comes with its own
issues such as extra hardware, cost, etc).

> It does make a difference on high bandwidth-product networks (something
> people have also hit). I'd rather not regress there and also would
> rather not require manual tuning for something we should be able to get
> right automatically.'

Previous to this patch, the tcp buffer was fixed to such a small size
(especially for writes) that the idea of parallelism was moot anyways.
Whatever the tcp buffer negotiates to now is definitely bigger than was
what there before hand, which I think is brought out by the fact that no
performance regression was found.

Regressing back to the old way is a death nail to any system with a
delay of >1ms or a bandwidth of >1GigE, so I definitely hope we never go
there. Of course, now that autoscaling allows the tcp buffer to grow to
reasonable values to achieve good performance for 10+GigE and WAN links,
if we can improve the parallelism/stability even further, that would be
great.
Dean

2013-07-26 14:19:18

by J.Bruce Fields

signature.asc (828.00 B)

2013-07-16 04:47:04

On Sun, 14 Jul 2013 21:26:20 -0400 Jim Rees <[email protected]> wrote:

> J.Bruce Fields wrote:
>
> On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> >
> > Hi,
> > I just noticed this commit:
> >
> > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > Author: Olga Kornievskaia <[email protected]>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> >
> > which I must confess surprised me. I wonder if the full implications of
> > removing that functionality were understood.
> >
> > Previously nfsd would set the transmit buffer space for a connection to
> > ensure there is plenty to hold all replies. Now it doesn't.
> >
> > nfsd refuses to accept a request if there isn't enough space in the transmit
> > buffer to send a reply. This is important to ensure that each reply gets
> > sent atomically without blocking and there is no risk of replies getting
> > interleaved.
> >
> > The server starts out with a large estimate of the reply space (1M) and for
> > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > it is much harder to estimate the space needed so it just assumes every
> > reply will require 1M of space.
> >
> > This means that with NFSv4, as soon as you have enough concurrent requests
> > such that 1M each reserves all of whatever window size was auto-tuned, new
> > requests on that connection will be ignored.
> >
> > This could significantly limit the amount of parallelism that can be achieved
> > for a single TCP connection (and given that the Linux client strongly prefers
> > a single connection now, this could become more of an issue).
>
> Worse, I believe it can deadlock completely if the transmit buffer
> shrinks too far, and people really have run into this:
>
> It's been a few years since I looked at this, but are you sure autotuning
> reduces the buffer space available on the sending socket? That doesn't sound
> like correct behavior to me. I know we thought about this at the time.

Autotuning is enabled when SOCK_SNDBUF_LOCK is not set in sk_userlocks.

One of the main effects of this flag is to disable:

static inline void sk_stream_moderate_sndbuf(struct sock *sk)
{
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) {
sk->sk_sndbuf = min(sk->sk_sndbuf, sk->sk_wmem_queued >> 1);
sk->sk_sndbuf = max(sk->sk_sndbuf, SOCK_MIN_SNDBUF);
}
}

which will reduce sk_sndbuf to half the queued writes. As sk_wmem_queued
cannot grow above sk_sndbuf, the definitely reduces sk_sndbuf (though never
below SOCK_MIN_SNDBUF which is 2K.

This seems to happen under memory pressure (sk_stream_alloc_skb).

So yes: memory pressure can reduce the sndbuf size when autotuning is enabled, and it
can get as low as 2K.
(An API to set this minimum to e.g. 2M for nfsd connections would be an alternate
fix for the deadlock, as Bruce has already mentioned).

>
> It does seem like a bug that we don't multiply the needed send buffer space
> by the number of threads. I think that's because we don't know how many
> threads there are going to be in svc_setup_socket()?

We used to, but it turned out to be too small in practice! As it auto-grows,
the "4 * serv->sv_max_mesg" setting is big enough ... if only it wouldn't
shrink below that.

NeilBrown

Attachments:

signature.asc (828.00 B)

2013-07-10 19:07:35

by J.Bruce Fields

[permalink] [raw]

Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
> wrote:
>
> > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > >
> > > Hi,
> > > I just noticed this commit:
> > >
> > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > Author: Olga Kornievskaia <[email protected]>
> > > Date: Tue Oct 21 14:13:47 2008 -0400
> > >
> > > svcrpc: take advantage of tcp autotuning
> > >
> > >
> > > which I must confess surprised me. I wonder if the full implications of
> > > removing that functionality were understood.
> > >
> > > Previously nfsd would set the transmit buffer space for a connection to
> > > ensure there is plenty to hold all replies. Now it doesn't.
> > >
> > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > buffer to send a reply. This is important to ensure that each reply gets
> > > sent atomically without blocking and there is no risk of replies getting
> > > interleaved.

By the way, it's xpt_mutex that really guarantees non-interleaving,
isn't it?

I think of the svc_tcp_has_wspace checks as solving a problem of
fairness between clients. If we only had one client, everything would
work without them. But when we have multiple clients we don't want a
single slow client to be able to tie block every thread waiting for that
client to receive read data. Is that accurate?

> > > The server starts out with a large estimate of the reply space (1M) and for
> > > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > > it is much harder to estimate the space needed so it just assumes every
> > > reply will require 1M of space.
> > >
> > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > requests on that connection will be ignored.
> > >
> > > This could significantly limit the amount of parallelism that can be achieved
> > > for a single TCP connection (and given that the Linux client strongly prefers
> > > a single connection now, this could become more of an issue).
> >
> > Worse, I believe it can deadlock completely if the transmit buffer
> > shrinks too far, and people really have run into this:
> >
> > http://mid.gmane.org/<[email protected]>
> >
> > Trond's suggestion looked at the time like it might work and be doable:
> >
> > http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> >
> > but I dropped it.
>
> I would probably generalise Trond's suggestion and allow "N" extra requests
> through that exceed the reservation, when N is related to the number of idle
> threads. squareroot might be nice, but half is probably easiest.
>
> If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> the connection so a really bad connection won't block threads indefinitely.
>
>
> And yes - a nice test case would be good.
>
> What do you think of the following (totally untested - just for comment)?

In cases where the disk is the bottleneck, there aren't actually going
to be idle threads, are there? In which case I don't think this helps
save a stuck client.

Trond's suggestion doesn't have the same limitation, if I understand it
correctly.

--b.

> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f..2fc92f1 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -81,6 +81,10 @@ struct svc_xprt {
>
> struct net *xpt_net;
> struct rpc_xprt *xpt_bc_xprt; /* NFSv4.1 backchannel */
> +
> + atomic_t xpt_extras; /* Extra requests which
> + * might block on send
> + */
> };
>
> static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..fc366ca 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -165,6 +165,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
> set_bit(XPT_BUSY, &xprt->xpt_flags);
> rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
> xprt->xpt_net = get_net(net);
> + atomic_set(&xprt->xpt_extra, 0);
> }
> EXPORT_SYMBOL_GPL(svc_xprt_init);
>
> @@ -326,13 +327,21 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
> list_del(&rqstp->rq_list);
> }
>
> -static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> +static int svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> {
> if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> - return true;
> - if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> - return xprt->xpt_ops->xpo_has_wspace(xprt);
> - return false;
> + return 1;
> + if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> + if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> + if (atomic_read(&xprt->xpt_extra))
> + atomic_set(&xprt->xpt_extras, 0);
> + return 1;
> + } else {
> + atomic_inc(&xprt->xpt_extras);
> + return 2; /* only if free threads */
> + }
> + }
> + return 0;
> }
>
> /*
> @@ -345,8 +354,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> struct svc_pool *pool;
> struct svc_rqst *rqstp;
> int cpu;
> + int todo = svc_xprt_has_something_to_do(xprt);
>
> - if (!svc_xprt_has_something_to_do(xprt))
> + if (!todo)
> return;
>
> cpu = get_cpu();
> @@ -361,6 +371,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> "svc_xprt_enqueue: "
> "threads and transports both waiting??\n");
>
> + if (todo == 2) {
> + int free_needed = atomic_read(&xprt->xpt_extras) * 2;
> + list_for_each_entry(rqstp, &pool->sp_thread, rq_list)
> + if (--free_needed <= 0)
> + break;
> +
> + if (free_needed > 0) {
> + /* Need more free threads before we allow this. */
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
> + goto out_unlock;
> + }
> + }
> +
> pool->sp_stats.packets++;
>
> /* Mark transport as busy. It will remain in this state until
> @@ -371,6 +394,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
> /* Don't enqueue transport while already enqueued */
> dprintk("svc: transport %p busy, not enqueued\n", xprt);
> + if (todo == 2)
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
> goto out_unlock;
> }
>
> @@ -466,6 +491,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
> printk(KERN_ERR "RPC request reserved %d but used %d\n",
> rqstp->rq_reserved,
> rqstp->rq_res.len);
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
>
> rqstp->rq_res.head[0].iov_len = 0;
> svc_reserve(rqstp, 0);

2013-07-16 14:24:32

by J.Bruce Fields

[permalink] [raw]

Subject: Re: Is tcp autotuning really what NFS wants?

Adding Ben Myers to Cc: in case he has any testing help or advice (see
below):

On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> That would be because I only allowed extra requests up to half the number of
> idle threads, and there would be zero idle threads.
>
> So yes - I see your point.
>
> I now think that it is sufficient to just allow one request through per
> socket. While there is high memory pressure, that is sufficient. When the
> memory pressure drops, that should be enough to cause sndbuf to grow.
>
> We should be able to use xpt_reserved to check if this is the only request
> or not:

A remaining possible bad case is if the number of "bad" sockets is more
than the number of threads, and if the problem is something that won't
resolve itself by just waiting a little while.

I don't know if that's likely in practice. Maybe a network failure cuts
you off from a large swath (but not all) of your clients? Or maybe some
large proportion of your clients are just slow?

But this is two lines and looks likely to solve the immediate
problem:

> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..5b832ef 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> return true;
> if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> - return xprt->xpt_ops->xpo_has_wspace(xprt);
> + return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> + atomic_read(&xprt->xpt_reserved) == 0;
> return false;
> }
>
> @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> newxpt = xprt->xpt_ops->xpo_accept(xprt);
> if (newxpt)
> svc_add_new_temp_xprt(serv, newxpt);
> - } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> + } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> + atomic_read(&xprt->xpt_reserved) == 0) {
> /* XPT_DATA|XPT_DEFERRED case: */
> dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> rqstp, rqstp->rq_pool->sp_id, xprt,
>
>
> Thoughts?
>
> I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> more).

Ben, do you still have a setup that can reproduce the problem? Or did
you ever find an easier way to reproduce?

--b.

2013-07-18 00:03:20

by Ben Myers

[permalink] [raw]

Subject: Re: Is tcp autotuning really what NFS wants?

Hey Bruce,

On Tue, Jul 16, 2013 at 10:24:30AM -0400, J.Bruce Fields wrote:
> Adding Ben Myers to Cc: in case he has any testing help or advice (see
> below):
>
> On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> > That would be because I only allowed extra requests up to half the number of
> > idle threads, and there would be zero idle threads.
> >
> > So yes - I see your point.
> >
> > I now think that it is sufficient to just allow one request through per
> > socket. While there is high memory pressure, that is sufficient. When the
> > memory pressure drops, that should be enough to cause sndbuf to grow.
> >
> > We should be able to use xpt_reserved to check if this is the only request
> > or not:
>
> A remaining possible bad case is if the number of "bad" sockets is more
> than the number of threads, and if the problem is something that won't
> resolve itself by just waiting a little while.
>
> I don't know if that's likely in practice. Maybe a network failure cuts
> you off from a large swath (but not all) of your clients? Or maybe some
> large proportion of your clients are just slow?
>
> But this is two lines and looks likely to solve the immediate
> problem:
>
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 80a6640..5b832ef 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > return true;
> > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> > - return xprt->xpt_ops->xpo_has_wspace(xprt);
> > + return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > + atomic_read(&xprt->xpt_reserved) == 0;
> > return false;
> > }
> >
> > @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> > newxpt = xprt->xpt_ops->xpo_accept(xprt);
> > if (newxpt)
> > svc_add_new_temp_xprt(serv, newxpt);
> > - } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> > + } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > + atomic_read(&xprt->xpt_reserved) == 0) {
> > /* XPT_DATA|XPT_DEFERRED case: */
> > dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> > rqstp, rqstp->rq_pool->sp_id, xprt,
> >
> >
> > Thoughts?
> >
> > I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> > the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> > more).
>
> Ben, do you still have a setup that can reproduce the problem? Or did you
> ever find an easier way to reproduce?

Unfortunately I don't still have that test rig handy, and I did not find an
easier way to reproduce this. I do agree that it 'is sufficient to just allow
one request through per socket'... probably that would have solved the problem
in our case. Maybe it would help to limit the amount of memory in your system
to try and induce additional memory pressure? IIRC it was a 1mb block size
read workload where we would hit this.

Thanks,
Ben

2013-07-15 04:32:16

Dean <[email protected]> wrote:
>> This could significantly limit the amount of parallelism that can be
> achieved for a single TCP connection (and given that the
>> Linux client strongly prefers a single connection now, this could
> become more of an issue).

> I understand the simplicity in using a single tcp connection, but
> performance-wise it is definitely not the way to go on WAN links. When
> even a miniscule amount of packet loss is added to the link (<0.001%
> packet loss), the tcp buffer collapses and performance drops

And just remember bufferbloat.

> Using multiple tcp connections allows better saturation of the link,
> since when packet loss occurs on a stream, the other streams can fill
> the void. Today, the only solution is to scale up the number of
> physical clients, which has high coordination overhead, or use a wan
> accelerator such as Bitspeed or Riverbed (which comes with its own
> issues such as extra hardware, cost, etc).

This is true on high speed links with few bottlenecks, but not so much when
there is a DSL-type bottleneck and excessive buffers.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] [email protected] http://www.sandelman.ca/ | ruby on rails [