2013-07-10 02:27:43

by J.Bruce Fields

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
>
> Hi,
> I just noticed this commit:
>
> commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> Author: Olga Kornievskaia <[email protected]>
> Date: Tue Oct 21 14:13:47 2008 -0400
>
> svcrpc: take advantage of tcp autotuning
>
>
> which I must confess surprised me. I wonder if the full implications of
> removing that functionality were understood.
>
> Previously nfsd would set the transmit buffer space for a connection to
> ensure there is plenty to hold all replies. Now it doesn't.
>
> nfsd refuses to accept a request if there isn't enough space in the transmit
> buffer to send a reply. This is important to ensure that each reply gets
> sent atomically without blocking and there is no risk of replies getting
> interleaved.
>
> The server starts out with a large estimate of the reply space (1M) and for
> NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> it is much harder to estimate the space needed so it just assumes every
> reply will require 1M of space.
>
> This means that with NFSv4, as soon as you have enough concurrent requests
> such that 1M each reserves all of whatever window size was auto-tuned, new
> requests on that connection will be ignored.
>
> This could significantly limit the amount of parallelism that can be achieved
> for a single TCP connection (and given that the Linux client strongly prefers
> a single connection now, this could become more of an issue).

Worse, I believe it can deadlock completely if the transmit buffer
shrinks too far, and people really have run into this:

http://mid.gmane.org/<[email protected]>

Trond's suggestion looked at the time like it might work and be doable:

http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>

but I dropped it.

The v4-specific situation might not be hard to improve: the v4
processing decodes the whole compound at the start, so it knows the
sequence of ops before it does anything else and could compute a tighter
bound on the reply size at that point.

> I don't know if this is a real issue that needs addressing - I hit in the
> context of a server filesystem which was misbehaving and so caused this issue
> to become obvious. But in this case it is certainly the filesystem, not the
> NFS server, which is causing the problem.

Yeah it looks a real problem.

Some good test cases would be useful if we could find some.

And, yes, my screwup for merging 966043986 without solving those other
problems first. I was confused.

It does make a difference on high bandwidth-product networks (something
people have also hit). I'd rather not regress there and also would
rather not require manual tuning for something we should be able to get
right automatically.

--b.


2013-07-10 17:33:13

by Dean

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

> This could significantly limit the amount of parallelism that can be
achieved for a single TCP connection (and given that the
> Linux client strongly prefers a single connection now, this could
become more of an issue).

I understand the simplicity in using a single tcp connection, but
performance-wise it is definitely not the way to go on WAN links. When
even a miniscule amount of packet loss is added to the link (<0.001%
packet loss), the tcp buffer collapses and performance drops
significantly (especially on 10GigE WAN links). I think new TCP
algorithms could help the problem somewhat, but nothing available today
makes much of a difference vs. cubic.

Using multiple tcp connections allows better saturation of the link,
since when packet loss occurs on a stream, the other streams can fill
the void. Today, the only solution is to scale up the number of
physical clients, which has high coordination overhead, or use a wan
accelerator such as Bitspeed or Riverbed (which comes with its own
issues such as extra hardware, cost, etc).


> It does make a difference on high bandwidth-product networks (something
> people have also hit). I'd rather not regress there and also would
> rather not require manual tuning for something we should be able to get
> right automatically.'

Previous to this patch, the tcp buffer was fixed to such a small size
(especially for writes) that the idea of parallelism was moot anyways.
Whatever the tcp buffer negotiates to now is definitely bigger than was
what there before hand, which I think is brought out by the fact that no
performance regression was found.

Regressing back to the old way is a death nail to any system with a
delay of >1ms or a bandwidth of >1GigE, so I definitely hope we never go
there. Of course, now that autoscaling allows the tcp buffer to grow to
reasonable values to achieve good performance for 10+GigE and WAN links,
if we can improve the parallelism/stability even further, that would be
great.
Dean

2013-07-26 14:19:18

by J.Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <[email protected]>
> wrote:
>
> > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > >
> > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > guarantee that there is enough write-space on each connection to
> > > queue a reply.
...
> > This is great, thanks!
> >
> > Inclined to queue it up for 3.11 and stable....
>
> I'd agree for 3.11.
> It feels a bit border-line for stable. "dead-lock" and "has been seen in the
> wild" are technically enough justification...
> I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> or something like that, just for a bit of breathing space.
> Your call though.


So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
Greg were requesting that:

- criteria for -stable and late -rc's should really be about the
same, and
- people should follow Documentation/stable-kernel-rules.txt.

So as an exercise to remind me what those rules are:

Easy questions:

- "no bigger than 100 lines, with context." Check.
- "It must fix only one thing." Check.
- "real bug that bothers people". Check.
- "tested": yep. It doesn't actually say "tested on stable
trees", and I recall this did land you with a tricky bug one
time when a prerequisite was omitted from the backport.

Judgement calls:

- "obviously correct": it's short, but admittedly subtle, and
performance regressions can take a while to get sorted out.
- "It must fix a problem that causes a build error (but not for
things marked CONFIG_BROKEN), an oops, a hang, data
corruption, a real security issue, or some "oh, that's not
good" issue. In short, something critical." We could argue
that "server stops responding" is critical, though not to the
same degree as a panic.
- OR: alternatively: "Serious issues as reported by a user of a
distribution kernel may also be considered if they fix a
notable performance or interactivity issue." The only bz I've
personally seen was the result of artificial testing of some
kind, and it sounds like your case involved a disk failure?

--b.

2013-07-16 01:11:06

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Mon, 15 Jul 2013 09:42:11 -0400 Jim Rees <[email protected]> wrote:

> Here's the thread on netdev where we discussed this issue:
> http://marc.info/?l=linux-netdev&m=121545498313619&w=2
>
> Here is where I asked about an api to set a socket buf minimum:
> http://marc.info/?l=linux-netdev&m=121554794432038&w=2
>
> There are many subtleties, and I suggest anyone who wants to try to fix this
> code should read the email thread. The netdev people were pretty insistent
> that we turn on autotuning; David Miller said the old behavior was
> equivalent to "turn[ing] off half of the TCP stack."

Thanks for those pointers.
Certainly autotuning is what we want.
The issues with the receive buffer getting tuned too low appear to have been
addressed some time ago.
The issues with the send buffer getting tuned too low are not what I at first
thought they were, but still are not completely resolved.
My previous patch which allowed N requests beyond the available space is, I
now think, more than necessary. You just need to make sure one can get
though. That should be enough to push the sndbuf size up quite quickly.

I'll revise it, test, and repost.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-16 04:00:33

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Mon, 15 Jul 2013 21:58:03 -0400 "J.Bruce Fields" <[email protected]>
wrote:

> On Mon, Jul 15, 2013 at 02:32:03PM +1000, NeilBrown wrote:
> > On Wed, 10 Jul 2013 15:07:28 -0400 "J.Bruce Fields" <[email protected]>
> > wrote:
> >
> > > On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> > > > On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
> > > > wrote:
> > > >
> > > > > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > > > > >
> > > > > > Hi,
> > > > > > I just noticed this commit:
> > > > > >
> > > > > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > > > > Author: Olga Kornievskaia <[email protected]>
> > > > > > Date: Tue Oct 21 14:13:47 2008 -0400
> > > > > >
> > > > > > svcrpc: take advantage of tcp autotuning
> > > > > >
> > > > > >
> > > > > > which I must confess surprised me. I wonder if the full implications of
> > > > > > removing that functionality were understood.
> > > > > >
> > > > > > Previously nfsd would set the transmit buffer space for a connection to
> > > > > > ensure there is plenty to hold all replies. Now it doesn't.
> > > > > >
> > > > > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > > > > buffer to send a reply. This is important to ensure that each reply gets
> > > > > > sent atomically without blocking and there is no risk of replies getting
> > > > > > interleaved.
> > >
> > > By the way, it's xpt_mutex that really guarantees non-interleaving,
> > > isn't it?
> >
> > Yes. I had the two different things confused. The issue is only about
> > blocking on writes and consequently tying up nfsd threads.
> >
> > >
> > > I think of the svc_tcp_has_wspace checks as solving a problem of
> > > fairness between clients. If we only had one client, everything would
> > > work without them. But when we have multiple clients we don't want a
> > > single slow client to be able to tie block every thread waiting for that
> > > client to receive read data. Is that accurate?
> >
> > It agrees with my understanding - yes.
> >
> >
> > >
> > > > > > The server starts out with a large estimate of the reply space (1M) and for
> > > > > > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > > > > > it is much harder to estimate the space needed so it just assumes every
> > > > > > reply will require 1M of space.
> > > > > >
> > > > > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > > > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > > > > requests on that connection will be ignored.
> > > > > >
> > > > > > This could significantly limit the amount of parallelism that can be achieved
> > > > > > for a single TCP connection (and given that the Linux client strongly prefers
> > > > > > a single connection now, this could become more of an issue).
> > > > >
> > > > > Worse, I believe it can deadlock completely if the transmit buffer
> > > > > shrinks too far, and people really have run into this:
> > > > >
> > > > > http://mid.gmane.org/<[email protected]>
> > > > >
> > > > > Trond's suggestion looked at the time like it might work and be doable:
> > > > >
> > > > > http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> > > > >
> > > > > but I dropped it.
> > > >
> > > > I would probably generalise Trond's suggestion and allow "N" extra requests
> > > > through that exceed the reservation, when N is related to the number of idle
> > > > threads. squareroot might be nice, but half is probably easiest.
> > > >
> > > > If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> > > > the connection so a really bad connection won't block threads indefinitely.
> > > >
> > > >
> > > > And yes - a nice test case would be good.
> > > >
> > > > What do you think of the following (totally untested - just for comment)?
> > >
> > > In cases where the disk is the bottleneck, there aren't actually going
> > > to be idle threads, are there? In which case I don't think this helps
> > > save a stuck client.
> >
> > Do we care about the disk being a bottleneck? We don't currently try to
> > handle that situation at all - if the disk is slow, everything slows down.
>
> Right, that's fine, it's just that:
>
> Suppose you have a fast network, several fast clients all continuously
> reading, and one slow disk.
>
> Then in the steady state all the threads will be waiting on the disk,
> and the receive buffers will all be full of read requests from the
> various clients. When a thread completes a read it will immediately
> send the response, pick up the next request, and go back to waiting on
> the disk.
>
> Suppose the send buffer of one of the clients drops below the minimum
> necessary.
>
> Then aftert that point, I don't think that one client will ever have
> another rpc processed, even after your proposed change.

That would be because I only allowed extra requests up to half the number of
idle threads, and there would be zero idle threads.

So yes - I see your point.

I now think that it is sufficient to just allow one request through per
socket. While there is high memory pressure, that is sufficient. When the
memory pressure drops, that should be enough to cause sndbuf to grow.

We should be able to use xpt_reserved to check if this is the only request
or not:


diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 80a6640..5b832ef 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
return true;
if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
- return xprt->xpt_ops->xpo_has_wspace(xprt);
+ return xprt->xpt_ops->xpo_has_wspace(xprt) ||
+ atomic_read(&xprt->xpt_reserved) == 0;
return false;
}

@@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
newxpt = xprt->xpt_ops->xpo_accept(xprt);
if (newxpt)
svc_add_new_temp_xprt(serv, newxpt);
- } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
+ } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
+ atomic_read(&xprt->xpt_reserved) == 0) {
/* XPT_DATA|XPT_DEFERRED case: */
dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
rqstp, rqstp->rq_pool->sp_id, xprt,


Thoughts?

I tried testing this, but xpo_has_wspace never fails for me, even if I remove
the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
more).

>
> Am I missing anything?

Not as far as I can see - thanks!

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-16 01:58:05

by J.Bruce Fields

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Mon, Jul 15, 2013 at 02:32:03PM +1000, NeilBrown wrote:
> On Wed, 10 Jul 2013 15:07:28 -0400 "J.Bruce Fields" <[email protected]>
> wrote:
>
> > On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> > > On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
> > > wrote:
> > >
> > > > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > > > >
> > > > > Hi,
> > > > > I just noticed this commit:
> > > > >
> > > > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > > > Author: Olga Kornievskaia <[email protected]>
> > > > > Date: Tue Oct 21 14:13:47 2008 -0400
> > > > >
> > > > > svcrpc: take advantage of tcp autotuning
> > > > >
> > > > >
> > > > > which I must confess surprised me. I wonder if the full implications of
> > > > > removing that functionality were understood.
> > > > >
> > > > > Previously nfsd would set the transmit buffer space for a connection to
> > > > > ensure there is plenty to hold all replies. Now it doesn't.
> > > > >
> > > > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > > > buffer to send a reply. This is important to ensure that each reply gets
> > > > > sent atomically without blocking and there is no risk of replies getting
> > > > > interleaved.
> >
> > By the way, it's xpt_mutex that really guarantees non-interleaving,
> > isn't it?
>
> Yes. I had the two different things confused. The issue is only about
> blocking on writes and consequently tying up nfsd threads.
>
> >
> > I think of the svc_tcp_has_wspace checks as solving a problem of
> > fairness between clients. If we only had one client, everything would
> > work without them. But when we have multiple clients we don't want a
> > single slow client to be able to tie block every thread waiting for that
> > client to receive read data. Is that accurate?
>
> It agrees with my understanding - yes.
>
>
> >
> > > > > The server starts out with a large estimate of the reply space (1M) and for
> > > > > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > > > > it is much harder to estimate the space needed so it just assumes every
> > > > > reply will require 1M of space.
> > > > >
> > > > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > > > requests on that connection will be ignored.
> > > > >
> > > > > This could significantly limit the amount of parallelism that can be achieved
> > > > > for a single TCP connection (and given that the Linux client strongly prefers
> > > > > a single connection now, this could become more of an issue).
> > > >
> > > > Worse, I believe it can deadlock completely if the transmit buffer
> > > > shrinks too far, and people really have run into this:
> > > >
> > > > http://mid.gmane.org/<[email protected]>
> > > >
> > > > Trond's suggestion looked at the time like it might work and be doable:
> > > >
> > > > http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> > > >
> > > > but I dropped it.
> > >
> > > I would probably generalise Trond's suggestion and allow "N" extra requests
> > > through that exceed the reservation, when N is related to the number of idle
> > > threads. squareroot might be nice, but half is probably easiest.
> > >
> > > If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> > > the connection so a really bad connection won't block threads indefinitely.
> > >
> > >
> > > And yes - a nice test case would be good.
> > >
> > > What do you think of the following (totally untested - just for comment)?
> >
> > In cases where the disk is the bottleneck, there aren't actually going
> > to be idle threads, are there? In which case I don't think this helps
> > save a stuck client.
>
> Do we care about the disk being a bottleneck? We don't currently try to
> handle that situation at all - if the disk is slow, everything slows down.

Right, that's fine, it's just that:

Suppose you have a fast network, several fast clients all continuously
reading, and one slow disk.

Then in the steady state all the threads will be waiting on the disk,
and the receive buffers will all be full of read requests from the
various clients. When a thread completes a read it will immediately
send the response, pick up the next request, and go back to waiting on
the disk.

Suppose the send buffer of one of the clients drops below the minimum
necessary.

Then aftert that point, I don't think that one client will ever have
another rpc processed, even after your proposed change.

Am I missing anything?

--b.

2013-07-25 01:30:39

by NeilBrown

[permalink] [raw]
Subject: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.


Since we enabled auto-tuning for sunrpc TCP connections we do not
guarantee that there is enough write-space on each connection to
queue a reply.

If memory pressure causes the window to shrink too small, the request
throttling in sunrpc/svc will not accept any requests so no more requests
will be handled. Even when pressure decreases the window will not
grow again until data is sent on the connection.
This means we get a deadlock: no requests will be handled until there
is more space, and no space will be allocated until a request is
handled.

This can be simulated by modifying svc_tcp_has_wspace to inflate the
number of byte required and removing the 'svc_sock_setbufsize' calls
in svc_setup_socket.

I found that multiplying by 16 was enough to make the requirement
exceed the default allocation. With this modification in place:
mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
would block and eventually time out because the nfs server could not
accept any requests.

This patch relaxes the request throttling to always allow at least one
request through per connection. It does this by checking both
sk_stream_min_wspace() and xprt->xpt_reserved
are zero.
The first is zero when the TCP transmit queue is empty.
The second is zero when there are no RPC requests being processed.
When both of these are zero the socket is idle and so one more
request can safely be allowed through.

Applying this patch allows the above mount command to succeed cleanly.
Tracing shows that the allocated write buffer space quickly grows and
after a few requests are handled, the extra tests are no longer needed
to permit further requests to be processed.

The main purpose of request throttling is to handle the case when one
client is slow at collecting replies and the send queue gets full of
replies that the client hasn't acknowledged (at the TCP level) yet.
As we only change behaviour when the send queue is empty this main
purpose is still preserved.

Reported-by: Ben Myers <[email protected]>
Signed-off-by: NeilBrown <[email protected]>

--
As you can see I've changed the patch. While writing up the above
description realised there was a weakness and so added the sk_stream_min_wspace
test. That allowed me to write the final paragraph.

NeilBrown


diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 305374d..7762b9f 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1193,7 +1193,9 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
return 1;
required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
- if (sk_stream_wspace(svsk->sk_sk) >= required)
+ if (sk_stream_wspace(svsk->sk_sk) >= required ||
+ (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
+ atomic_read(&xprt->xpt_reserved) == 0))
return 1;
set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
return 0;


Attachments:
signature.asc (828.00 B)

2013-07-25 20:18:08

by J.Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
>
> Since we enabled auto-tuning for sunrpc TCP connections we do not
> guarantee that there is enough write-space on each connection to
> queue a reply.
>
> If memory pressure causes the window to shrink too small, the request
> throttling in sunrpc/svc will not accept any requests so no more requests
> will be handled. Even when pressure decreases the window will not
> grow again until data is sent on the connection.
> This means we get a deadlock: no requests will be handled until there
> is more space, and no space will be allocated until a request is
> handled.
>
> This can be simulated by modifying svc_tcp_has_wspace to inflate the
> number of byte required and removing the 'svc_sock_setbufsize' calls
> in svc_setup_socket.

Ah-hah!

> I found that multiplying by 16 was enough to make the requirement
> exceed the default allocation. With this modification in place:
> mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
> would block and eventually time out because the nfs server could not
> accept any requests.

So, this?:

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 305374d..36de50d 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1193,6 +1193,7 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
return 1;
required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
+ required *= 16;
if (sk_stream_wspace(svsk->sk_sk) >= required)
return 1;
set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
@@ -1378,14 +1379,8 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
/* Initialize the socket */
if (sock->type == SOCK_DGRAM)
svc_udp_init(svsk, serv);
- else {
- /* initialise setting must have enough space to
- * receive and respond to one request.
- */
- svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
- 4 * serv->sv_max_mesg);
+ else
svc_tcp_init(svsk, serv);
- }

dprintk("svc: svc_setup_socket created %p (inet %p)\n",
svsk, svsk->sk_sk);

> This patch relaxes the request throttling to always allow at least one
> request through per connection. It does this by checking both
> sk_stream_min_wspace() and xprt->xpt_reserved
> are zero.
> The first is zero when the TCP transmit queue is empty.
> The second is zero when there are no RPC requests being processed.
> When both of these are zero the socket is idle and so one more
> request can safely be allowed through.
>
> Applying this patch allows the above mount command to succeed cleanly.
> Tracing shows that the allocated write buffer space quickly grows and
> after a few requests are handled, the extra tests are no longer needed
> to permit further requests to be processed.
>
> The main purpose of request throttling is to handle the case when one
> client is slow at collecting replies and the send queue gets full of
> replies that the client hasn't acknowledged (at the TCP level) yet.
> As we only change behaviour when the send queue is empty this main
> purpose is still preserved.
>
> Reported-by: Ben Myers <[email protected]>
> Signed-off-by: NeilBrown <[email protected]>
>
> --
> As you can see I've changed the patch. While writing up the above
> description realised there was a weakness and so added the sk_stream_min_wspace
> test. That allowed me to write the final paragraph.

This is great, thanks!

Inclined to queue it up for 3.11 and stable....

--b.

>
> NeilBrown
>
>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 305374d..7762b9f 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1193,7 +1193,9 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
> return 1;
> required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> - if (sk_stream_wspace(svsk->sk_sk) >= required)
> + if (sk_stream_wspace(svsk->sk_sk) >= required ||
> + (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
> + atomic_read(&xprt->xpt_reserved) == 0))
> return 1;
> set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> return 0;



2013-07-25 20:33:17

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <[email protected]>
wrote:

> On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> >
> > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > guarantee that there is enough write-space on each connection to
> > queue a reply.
> >
> > If memory pressure causes the window to shrink too small, the request
> > throttling in sunrpc/svc will not accept any requests so no more requests
> > will be handled. Even when pressure decreases the window will not
> > grow again until data is sent on the connection.
> > This means we get a deadlock: no requests will be handled until there
> > is more space, and no space will be allocated until a request is
> > handled.
> >
> > This can be simulated by modifying svc_tcp_has_wspace to inflate the
> > number of byte required and removing the 'svc_sock_setbufsize' calls
> > in svc_setup_socket.
>
> Ah-hah!
>
> > I found that multiplying by 16 was enough to make the requirement
> > exceed the default allocation. With this modification in place:
> > mount -o vers=3,proto=tcp 127.0.0.1:/home /mnt
> > would block and eventually time out because the nfs server could not
> > accept any requests.
>
> So, this?:

Close enough. I just put "//" in front of the lines I didn't want rather
than delete them. But yes: exactly that effect.


>
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 305374d..36de50d 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1193,6 +1193,7 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
> return 1;
> required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> + required *= 16;
> if (sk_stream_wspace(svsk->sk_sk) >= required)
> return 1;
> set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> @@ -1378,14 +1379,8 @@ static struct svc_sock *svc_setup_socket(struct svc_serv *serv,
> /* Initialize the socket */
> if (sock->type == SOCK_DGRAM)
> svc_udp_init(svsk, serv);
> - else {
> - /* initialise setting must have enough space to
> - * receive and respond to one request.
> - */
> - svc_sock_setbufsize(svsk->sk_sock, 4 * serv->sv_max_mesg,
> - 4 * serv->sv_max_mesg);
> + else
> svc_tcp_init(svsk, serv);
> - }
>
> dprintk("svc: svc_setup_socket created %p (inet %p)\n",
> svsk, svsk->sk_sk);
>
> > This patch relaxes the request throttling to always allow at least one
> > request through per connection. It does this by checking both
> > sk_stream_min_wspace() and xprt->xpt_reserved
> > are zero.
> > The first is zero when the TCP transmit queue is empty.
> > The second is zero when there are no RPC requests being processed.
> > When both of these are zero the socket is idle and so one more
> > request can safely be allowed through.
> >
> > Applying this patch allows the above mount command to succeed cleanly.
> > Tracing shows that the allocated write buffer space quickly grows and
> > after a few requests are handled, the extra tests are no longer needed
> > to permit further requests to be processed.
> >
> > The main purpose of request throttling is to handle the case when one
> > client is slow at collecting replies and the send queue gets full of
> > replies that the client hasn't acknowledged (at the TCP level) yet.
> > As we only change behaviour when the send queue is empty this main
> > purpose is still preserved.
> >
> > Reported-by: Ben Myers <[email protected]>
> > Signed-off-by: NeilBrown <[email protected]>
> >
> > --
> > As you can see I've changed the patch. While writing up the above
> > description realised there was a weakness and so added the sk_stream_min_wspace
> > test. That allowed me to write the final paragraph.
>
> This is great, thanks!
>
> Inclined to queue it up for 3.11 and stable....

I'd agree for 3.11.
It feels a bit border-line for stable. "dead-lock" and "has been seen in the
wild" are technically enough justification...
I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
or something like that, just for a bit of breathing space.
Your call though.

NeilBrown

>
> --b.
>
> >
> > NeilBrown
> >
> >
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 305374d..7762b9f 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1193,7 +1193,9 @@ static int svc_tcp_has_wspace(struct svc_xprt *xprt)
> > if (test_bit(XPT_LISTENER, &xprt->xpt_flags))
> > return 1;
> > required = atomic_read(&xprt->xpt_reserved) + serv->sv_max_mesg;
> > - if (sk_stream_wspace(svsk->sk_sk) >= required)
> > + if (sk_stream_wspace(svsk->sk_sk) >= required ||
> > + (sk_stream_min_wspace(svsk->sk_sk) == 0 &&
> > + atomic_read(&xprt->xpt_reserved) == 0))
> > return 1;
> > set_bit(SOCK_NOSPACE, &svsk->sk_sock->flags);
> > return 0;
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html


Attachments:
signature.asc (828.00 B)

2013-07-24 21:07:51

by J.Bruce Fields

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, Jul 17, 2013 at 07:03:19PM -0500, Ben Myers wrote:
> Hey Bruce,
>
> On Tue, Jul 16, 2013 at 10:24:30AM -0400, J.Bruce Fields wrote:
> > Adding Ben Myers to Cc: in case he has any testing help or advice (see
> > below):
> >
> > On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> > > That would be because I only allowed extra requests up to half the number of
> > > idle threads, and there would be zero idle threads.
> > >
> > > So yes - I see your point.
> > >
> > > I now think that it is sufficient to just allow one request through per
> > > socket. While there is high memory pressure, that is sufficient. When the
> > > memory pressure drops, that should be enough to cause sndbuf to grow.
> > >
> > > We should be able to use xpt_reserved to check if this is the only request
> > > or not:
> >
> > A remaining possible bad case is if the number of "bad" sockets is more
> > than the number of threads, and if the problem is something that won't
> > resolve itself by just waiting a little while.
> >
> > I don't know if that's likely in practice. Maybe a network failure cuts
> > you off from a large swath (but not all) of your clients? Or maybe some
> > large proportion of your clients are just slow?
> >
> > But this is two lines and looks likely to solve the immediate
> > problem:
> >
> > > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > > index 80a6640..5b832ef 100644
> > > --- a/net/sunrpc/svc_xprt.c
> > > +++ b/net/sunrpc/svc_xprt.c
> > > @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > > return true;
> > > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> > > - return xprt->xpt_ops->xpo_has_wspace(xprt);
> > > + return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > > + atomic_read(&xprt->xpt_reserved) == 0;
> > > return false;
> > > }
> > >
> > > @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> > > newxpt = xprt->xpt_ops->xpo_accept(xprt);
> > > if (newxpt)
> > > svc_add_new_temp_xprt(serv, newxpt);
> > > - } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> > > + } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > > + atomic_read(&xprt->xpt_reserved) == 0) {
> > > /* XPT_DATA|XPT_DEFERRED case: */
> > > dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> > > rqstp, rqstp->rq_pool->sp_id, xprt,
> > >
> > >
> > > Thoughts?
> > >
> > > I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> > > the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> > > more).
> >
> > Ben, do you still have a setup that can reproduce the problem? Or did you
> > ever find an easier way to reproduce?
>
> Unfortunately I don't still have that test rig handy, and I did not find an
> easier way to reproduce this. I do agree that it 'is sufficient to just allow
> one request through per socket'... probably that would have solved the problem
> in our case. Maybe it would help to limit the amount of memory in your system
> to try and induce additional memory pressure? IIRC it was a 1mb block size
> read workload where we would hit this.

That makes sense.

Or I wonder if you could cheat and artificially force the tcp code to
behave as if there was memory pressure. (E.g. by calling
tcp_enter_memory_pressure on some trigger).

Neil, are you still looking at this? (If you're done with it, could you
resend that patch with a signed-off-by?)

--b.

2013-07-10 17:39:45

by Ben Greear

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On 07/10/2013 10:33 AM, Dean wrote:
> > This could significantly limit the amount of parallelism that can be achieved for a single TCP connection (and given that the
> > Linux client strongly prefers a single connection now, this could become more of an issue).
>
> I understand the simplicity in using a single tcp connection, but performance-wise it is definitely not the way to go on WAN links. When even a miniscule amount
> of packet loss is added to the link (<0.001% packet loss), the tcp buffer collapses and performance drops significantly (especially on 10GigE WAN links). I
> think new TCP algorithms could help the problem somewhat, but nothing available today makes much of a difference vs. cubic.
>
> Using multiple tcp connections allows better saturation of the link, since when packet loss occurs on a stream, the other streams can fill the void. Today, the
> only solution is to scale up the number of physical clients, which has high coordination overhead, or use a wan accelerator such as Bitspeed or Riverbed (which
> comes with its own issues such as extra hardware, cost, etc).

I have a set of patches that allows one to do multiple unique mounts to the same server from a single
client, but the patches are for the client side, so it would not help
non-Linux clients. And, the patches were rejected for upstream as not being
useful. But, if you are interested in such, please let me know and I can point
you to them...

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-07-10 04:32:52

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
wrote:

> On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> >
> > Hi,
> > I just noticed this commit:
> >
> > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > Author: Olga Kornievskaia <[email protected]>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> >
> > which I must confess surprised me. I wonder if the full implications of
> > removing that functionality were understood.
> >
> > Previously nfsd would set the transmit buffer space for a connection to
> > ensure there is plenty to hold all replies. Now it doesn't.
> >
> > nfsd refuses to accept a request if there isn't enough space in the transmit
> > buffer to send a reply. This is important to ensure that each reply gets
> > sent atomically without blocking and there is no risk of replies getting
> > interleaved.
> >
> > The server starts out with a large estimate of the reply space (1M) and for
> > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > it is much harder to estimate the space needed so it just assumes every
> > reply will require 1M of space.
> >
> > This means that with NFSv4, as soon as you have enough concurrent requests
> > such that 1M each reserves all of whatever window size was auto-tuned, new
> > requests on that connection will be ignored.
> >
> > This could significantly limit the amount of parallelism that can be achieved
> > for a single TCP connection (and given that the Linux client strongly prefers
> > a single connection now, this could become more of an issue).
>
> Worse, I believe it can deadlock completely if the transmit buffer
> shrinks too far, and people really have run into this:
>
> http://mid.gmane.org/<[email protected]>
>
> Trond's suggestion looked at the time like it might work and be doable:
>
> http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
>
> but I dropped it.

I would probably generalise Trond's suggestion and allow "N" extra requests
through that exceed the reservation, when N is related to the number of idle
threads. squareroot might be nice, but half is probably easiest.

If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
the connection so a really bad connection won't block threads indefinitely.


And yes - a nice test case would be good.

What do you think of the following (totally untested - just for comment)?

NeilBrown


diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
index b05963f..2fc92f1 100644
--- a/include/linux/sunrpc/svc_xprt.h
+++ b/include/linux/sunrpc/svc_xprt.h
@@ -81,6 +81,10 @@ struct svc_xprt {

struct net *xpt_net;
struct rpc_xprt *xpt_bc_xprt; /* NFSv4.1 backchannel */
+
+ atomic_t xpt_extras; /* Extra requests which
+ * might block on send
+ */
};

static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 80a6640..fc366ca 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -165,6 +165,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
set_bit(XPT_BUSY, &xprt->xpt_flags);
rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
xprt->xpt_net = get_net(net);
+ atomic_set(&xprt->xpt_extra, 0);
}
EXPORT_SYMBOL_GPL(svc_xprt_init);

@@ -326,13 +327,21 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
list_del(&rqstp->rq_list);
}

-static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
+static int svc_xprt_has_something_to_do(struct svc_xprt *xprt)
{
if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
- return true;
- if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
- return xprt->xpt_ops->xpo_has_wspace(xprt);
- return false;
+ return 1;
+ if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
+ if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
+ if (atomic_read(&xprt->xpt_extra))
+ atomic_set(&xprt->xpt_extras, 0);
+ return 1;
+ } else {
+ atomic_inc(&xprt->xpt_extras);
+ return 2; /* only if free threads */
+ }
+ }
+ return 0;
}

/*
@@ -345,8 +354,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
struct svc_pool *pool;
struct svc_rqst *rqstp;
int cpu;
+ int todo = svc_xprt_has_something_to_do(xprt);

- if (!svc_xprt_has_something_to_do(xprt))
+ if (!todo)
return;

cpu = get_cpu();
@@ -361,6 +371,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
"svc_xprt_enqueue: "
"threads and transports both waiting??\n");

+ if (todo == 2) {
+ int free_needed = atomic_read(&xprt->xpt_extras) * 2;
+ list_for_each_entry(rqstp, &pool->sp_thread, rq_list)
+ if (--free_needed <= 0)
+ break;
+
+ if (free_needed > 0) {
+ /* Need more free threads before we allow this. */
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);
+ goto out_unlock;
+ }
+ }
+
pool->sp_stats.packets++;

/* Mark transport as busy. It will remain in this state until
@@ -371,6 +394,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
/* Don't enqueue transport while already enqueued */
dprintk("svc: transport %p busy, not enqueued\n", xprt);
+ if (todo == 2)
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);
goto out_unlock;
}

@@ -466,6 +491,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
printk(KERN_ERR "RPC request reserved %d but used %d\n",
rqstp->rq_reserved,
rqstp->rq_res.len);
+ atomic_add_unless(&xprt->xpt_extras, -1, 0);

rqstp->rq_res.head[0].iov_len = 0;
svc_reserve(rqstp, 0);


Attachments:
signature.asc (828.00 B)

2013-07-30 02:49:12

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

On Fri, 26 Jul 2013 10:19:16 -0400 "J.Bruce Fields" <[email protected]>
wrote:

> On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> > On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <[email protected]>
> > wrote:
> >
> > > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > > >
> > > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > > guarantee that there is enough write-space on each connection to
> > > > queue a reply.
> ...
> > > This is great, thanks!
> > >
> > > Inclined to queue it up for 3.11 and stable....
> >
> > I'd agree for 3.11.
> > It feels a bit border-line for stable. "dead-lock" and "has been seen in the
> > wild" are technically enough justification...
> > I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> > or something like that, just for a bit of breathing space.
> > Your call though.
>
>
> So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
> Greg were requesting that:
>
> - criteria for -stable and late -rc's should really be about the
> same, and
> - people should follow Documentation/stable-kernel-rules.txt.
>
> So as an exercise to remind me what those rules are:
>
> Easy questions:
>
> - "no bigger than 100 lines, with context." Check.
> - "It must fix only one thing." Check.
> - "real bug that bothers people". Check.
> - "tested": yep. It doesn't actually say "tested on stable
> trees", and I recall this did land you with a tricky bug one
> time when a prerequisite was omitted from the backport.
>
> Judgement calls:
>
> - "obviously correct": it's short, but admittedly subtle, and
> performance regressions can take a while to get sorted out.
> - "It must fix a problem that causes a build error (but not for
> things marked CONFIG_BROKEN), an oops, a hang, data
> corruption, a real security issue, or some "oh, that's not
> good" issue. In short, something critical." We could argue
> that "server stops responding" is critical, though not to the
> same degree as a panic.
> - OR: alternatively: "Serious issues as reported by a user of a
> distribution kernel may also be considered if they fix a
> notable performance or interactivity issue." The only bz I've
> personally seen was the result of artificial testing of some
> kind, and it sounds like your case involved a disk failure?
>
> --b.

Looks like good analysis ... except that it doesn't seem conclusive. Being
conclusive would make it really good. :-)

The case that brought it to my attention doesn't require the fix.
A file system was mis-behaving (blocking when it should return EJUKEBOX) and
this resulted in nfsd behaviour different than my expectation.
I expected nfsd to keep accepting requests until all threads were blocks.
However only 4 requests were accepted (which is actually better behaviour,
but not what I expected).
So I looked into it and thought that what I found wasn't really right. Which
turned out to be the case, but not the way I thought...

So my direct experience doesn't argue for the patch going to -stable at all.
If the only other reports are from artificial testing then I'd leave it out
of -stable. I don't feel -rc4 (that's next I think) is too late for it
though.

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-15 23:32:07

by Ben Greear

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On 07/14/2013 09:35 PM, NeilBrown wrote:
> On Wed, 10 Jul 2013 10:39:38 -0700 Ben Greear <[email protected]> wrote:
>
>> On 07/10/2013 10:33 AM, Dean wrote:
>>> > This could significantly limit the amount of parallelism that can be achieved for a single TCP connection (and given that the
>>> > Linux client strongly prefers a single connection now, this could become more of an issue).
>>>
>>> I understand the simplicity in using a single tcp connection, but performance-wise it is definitely not the way to go on WAN links. When even a miniscule amount
>>> of packet loss is added to the link (<0.001% packet loss), the tcp buffer collapses and performance drops significantly (especially on 10GigE WAN links). I
>>> think new TCP algorithms could help the problem somewhat, but nothing available today makes much of a difference vs. cubic.
>>>
>>> Using multiple tcp connections allows better saturation of the link, since when packet loss occurs on a stream, the other streams can fill the void. Today, the
>>> only solution is to scale up the number of physical clients, which has high coordination overhead, or use a wan accelerator such as Bitspeed or Riverbed (which
>>> comes with its own issues such as extra hardware, cost, etc).
>>
>> I have a set of patches that allows one to do multiple unique mounts to the same server from a single
>> client, but the patches are for the client side, so it would not help
>> non-Linux clients. And, the patches were rejected for upstream as not being
>> useful. But, if you are interested in such, please let me know and I can point
>> you to them...
>
> Yes please!


I haven't ported these forward to 3.10 yet, but you can find my 3.9 tree
here:

http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.9.dev.y/.git;a=summary

There's a bunch of other patches, but the nfs related ones are all in a row
and the patches are rebased, so you can probably pull them out w/out
too much difficulty. The patches introduce a bug where it fails to compile
w/out NFS 4.1 defined..I haven't bothered to fix it yet but it's probably
simple..or just compile with NFS 4.1 as I do.

Older trees found here, but we don't bother back-porting many patches,
so I'd use the latest if you can.

http://dmz2.candelatech.com/git/gitweb.cgi

And, you'll need a patched mount.nfs:

https://github.com/greearb/nfs-utils-ct

I hope to get started on porting these to 3.10 later this week.. If
there is any interest in this patch series or something like it going
upstream (there wasn't in the past, I've no good reason to think that
has changed), let me know and I will clean them up and post them to
the mailing list again...

If you search for [email protected] and 'bind to local' you
can find previous threads in the mailing list archives...

Thanks,
Ben


--
Ben Greear <[email protected]>
Candela Technologies Inc http://www.candelatech.com


2013-07-15 04:35:38

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, 10 Jul 2013 10:39:38 -0700 Ben Greear <[email protected]> wrote:

> On 07/10/2013 10:33 AM, Dean wrote:
> > > This could significantly limit the amount of parallelism that can be achieved for a single TCP connection (and given that the
> > > Linux client strongly prefers a single connection now, this could become more of an issue).
> >
> > I understand the simplicity in using a single tcp connection, but performance-wise it is definitely not the way to go on WAN links. When even a miniscule amount
> > of packet loss is added to the link (<0.001% packet loss), the tcp buffer collapses and performance drops significantly (especially on 10GigE WAN links). I
> > think new TCP algorithms could help the problem somewhat, but nothing available today makes much of a difference vs. cubic.
> >
> > Using multiple tcp connections allows better saturation of the link, since when packet loss occurs on a stream, the other streams can fill the void. Today, the
> > only solution is to scale up the number of physical clients, which has high coordination overhead, or use a wan accelerator such as Bitspeed or Riverbed (which
> > comes with its own issues such as extra hardware, cost, etc).
>
> I have a set of patches that allows one to do multiple unique mounts to the same server from a single
> client, but the patches are for the client side, so it would not help
> non-Linux clients. And, the patches were rejected for upstream as not being
> useful. But, if you are interested in such, please let me know and I can point
> you to them...

Yes please!

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-16 04:47:04

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Mon, 15 Jul 2013 16:32:00 -0700 Ben Greear <[email protected]> wrote:

> On 07/14/2013 09:35 PM, NeilBrown wrote:
> > On Wed, 10 Jul 2013 10:39:38 -0700 Ben Greear <[email protected]> wrote:
> >
> >> On 07/10/2013 10:33 AM, Dean wrote:
> >>> > This could significantly limit the amount of parallelism that can be achieved for a single TCP connection (and given that the
> >>> > Linux client strongly prefers a single connection now, this could become more of an issue).
> >>>
> >>> I understand the simplicity in using a single tcp connection, but performance-wise it is definitely not the way to go on WAN links. When even a miniscule amount
> >>> of packet loss is added to the link (<0.001% packet loss), the tcp buffer collapses and performance drops significantly (especially on 10GigE WAN links). I
> >>> think new TCP algorithms could help the problem somewhat, but nothing available today makes much of a difference vs. cubic.
> >>>
> >>> Using multiple tcp connections allows better saturation of the link, since when packet loss occurs on a stream, the other streams can fill the void. Today, the
> >>> only solution is to scale up the number of physical clients, which has high coordination overhead, or use a wan accelerator such as Bitspeed or Riverbed (which
> >>> comes with its own issues such as extra hardware, cost, etc).
> >>
> >> I have a set of patches that allows one to do multiple unique mounts to the same server from a single
> >> client, but the patches are for the client side, so it would not help
> >> non-Linux clients. And, the patches were rejected for upstream as not being
> >> useful. But, if you are interested in such, please let me know and I can point
> >> you to them...
> >
> > Yes please!
>
>
> I haven't ported these forward to 3.10 yet, but you can find my 3.9 tree
> here:
>
> http://dmz2.candelatech.com/git/gitweb.cgi?p=linux-3.9.dev.y/.git;a=summary
>
> There's a bunch of other patches, but the nfs related ones are all in a row
> and the patches are rebased, so you can probably pull them out w/out
> too much difficulty. The patches introduce a bug where it fails to compile
> w/out NFS 4.1 defined..I haven't bothered to fix it yet but it's probably
> simple..or just compile with NFS 4.1 as I do.
>
> Older trees found here, but we don't bother back-porting many patches,
> so I'd use the latest if you can.
>
> http://dmz2.candelatech.com/git/gitweb.cgi
>
> And, you'll need a patched mount.nfs:
>
> https://github.com/greearb/nfs-utils-ct
>
> I hope to get started on porting these to 3.10 later this week.. If
> there is any interest in this patch series or something like it going
> upstream (there wasn't in the past, I've no good reason to think that
> has changed), let me know and I will clean them up and post them to
> the mailing list again...
>
> If you search for [email protected] and 'bind to local' you
> can find previous threads in the mailing list archives...

Thanks.
You are supporting multiple independent mounts by using different client-side
IP addresses. That has clear benefits for controlling routing, but also
admin costs if you don't care about routing. The latter case applies to me.

So I'll bookmark this, but it isn't what I need just now.

Thanks,
NeilBrown


Attachments:
signature.asc (828.00 B)

2013-08-01 02:49:27

by J.Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH] NFSD/sunrpc: avoid deadlock on TCP connection due to memory pressure.

On Tue, Jul 30, 2013 at 12:48:57PM +1000, NeilBrown wrote:
> On Fri, 26 Jul 2013 10:19:16 -0400 "J.Bruce Fields" <[email protected]>
> wrote:
>
> > On Fri, Jul 26, 2013 at 06:33:03AM +1000, NeilBrown wrote:
> > > On Thu, 25 Jul 2013 16:18:05 -0400 "J.Bruce Fields" <[email protected]>
> > > wrote:
> > >
> > > > On Thu, Jul 25, 2013 at 11:30:23AM +1000, NeilBrown wrote:
> > > > >
> > > > > Since we enabled auto-tuning for sunrpc TCP connections we do not
> > > > > guarantee that there is enough write-space on each connection to
> > > > > queue a reply.
> > ...
> > > > This is great, thanks!
> > > >
> > > > Inclined to queue it up for 3.11 and stable....
> > >
> > > I'd agree for 3.11.
> > > It feels a bit border-line for stable. "dead-lock" and "has been seen in the
> > > wild" are technically enough justification...
> > > I'd probably mark it as "pleas don't apply to -stable until 3.11 is released"
> > > or something like that, just for a bit of breathing space.
> > > Your call though.
> >
> >
> > So my takeaway from http://lwn.net/Articles/559113/ was that Linus and
> > Greg were requesting that:
> >
> > - criteria for -stable and late -rc's should really be about the
> > same, and
> > - people should follow Documentation/stable-kernel-rules.txt.
> >
> > So as an exercise to remind me what those rules are:
> >
> > Easy questions:
> >
> > - "no bigger than 100 lines, with context." Check.
> > - "It must fix only one thing." Check.
> > - "real bug that bothers people". Check.
> > - "tested": yep. It doesn't actually say "tested on stable
> > trees", and I recall this did land you with a tricky bug one
> > time when a prerequisite was omitted from the backport.
> >
> > Judgement calls:
> >
> > - "obviously correct": it's short, but admittedly subtle, and
> > performance regressions can take a while to get sorted out.
> > - "It must fix a problem that causes a build error (but not for
> > things marked CONFIG_BROKEN), an oops, a hang, data
> > corruption, a real security issue, or some "oh, that's not
> > good" issue. In short, something critical." We could argue
> > that "server stops responding" is critical, though not to the
> > same degree as a panic.
> > - OR: alternatively: "Serious issues as reported by a user of a
> > distribution kernel may also be considered if they fix a
> > notable performance or interactivity issue." The only bz I've
> > personally seen was the result of artificial testing of some
> > kind, and it sounds like your case involved a disk failure?
> >
> > --b.
>
> Looks like good analysis ... except that it doesn't seem conclusive. Being
> conclusive would make it really good. :-)

Hah, yeah sorry.

> The case that brought it to my attention doesn't require the fix.
> A file system was mis-behaving (blocking when it should return EJUKEBOX) and
> this resulted in nfsd behaviour different than my expectation.
> I expected nfsd to keep accepting requests until all threads were blocks.
> However only 4 requests were accepted (which is actually better behaviour,
> but not what I expected).
> So I looked into it and thought that what I found wasn't really right. Which
> turned out to be the case, but not the way I thought...
>
> So my direct experience doesn't argue for the patch going to -stable at all.
> If the only other reports are from artificial testing then I'd leave it out
> of -stable. I don't feel -rc4 (that's next I think) is too late for it
> though.

OK, I think I agree. We can pass it along to stable later if more
complaints surface....

--b.

2013-07-25 12:36:33

by Jim Rees

[permalink] [raw]

2013-07-15 05:02:40

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Sun, 14 Jul 2013 21:26:20 -0400 Jim Rees <[email protected]> wrote:

> J.Bruce Fields wrote:
>
> On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> >
> > Hi,
> > I just noticed this commit:
> >
> > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > Author: Olga Kornievskaia <[email protected]>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> >
> > which I must confess surprised me. I wonder if the full implications of
> > removing that functionality were understood.
> >
> > Previously nfsd would set the transmit buffer space for a connection to
> > ensure there is plenty to hold all replies. Now it doesn't.
> >
> > nfsd refuses to accept a request if there isn't enough space in the transmit
> > buffer to send a reply. This is important to ensure that each reply gets
> > sent atomically without blocking and there is no risk of replies getting
> > interleaved.
> >
> > The server starts out with a large estimate of the reply space (1M) and for
> > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > it is much harder to estimate the space needed so it just assumes every
> > reply will require 1M of space.
> >
> > This means that with NFSv4, as soon as you have enough concurrent requests
> > such that 1M each reserves all of whatever window size was auto-tuned, new
> > requests on that connection will be ignored.
> >
> > This could significantly limit the amount of parallelism that can be achieved
> > for a single TCP connection (and given that the Linux client strongly prefers
> > a single connection now, this could become more of an issue).
>
> Worse, I believe it can deadlock completely if the transmit buffer
> shrinks too far, and people really have run into this:
>
> It's been a few years since I looked at this, but are you sure autotuning
> reduces the buffer space available on the sending socket? That doesn't sound
> like correct behavior to me. I know we thought about this at the time.

Autotuning is enabled when SOCK_SNDBUF_LOCK is not set in sk_userlocks.

One of the main effects of this flag is to disable:

static inline void sk_stream_moderate_sndbuf(struct sock *sk)
{
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) {
sk->sk_sndbuf = min(sk->sk_sndbuf, sk->sk_wmem_queued >> 1);
sk->sk_sndbuf = max(sk->sk_sndbuf, SOCK_MIN_SNDBUF);
}
}


which will reduce sk_sndbuf to half the queued writes. As sk_wmem_queued
cannot grow above sk_sndbuf, the definitely reduces sk_sndbuf (though never
below SOCK_MIN_SNDBUF which is 2K.

This seems to happen under memory pressure (sk_stream_alloc_skb).

So yes: memory pressure can reduce the sndbuf size when autotuning is enabled, and it
can get as low as 2K.
(An API to set this minimum to e.g. 2M for nfsd connections would be an alternate
fix for the deadlock, as Bruce has already mentioned).

>
> It does seem like a bug that we don't multiply the needed send buffer space
> by the number of threads. I think that's because we don't know how many
> threads there are going to be in svc_setup_socket()?

We used to, but it turned out to be too small in practice! As it auto-grows,
the "4 * serv->sv_max_mesg" setting is big enough ... if only it wouldn't
shrink below that.

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-10 19:07:35

by J.Bruce Fields

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
> wrote:
>
> > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > >
> > > Hi,
> > > I just noticed this commit:
> > >
> > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > Author: Olga Kornievskaia <[email protected]>
> > > Date: Tue Oct 21 14:13:47 2008 -0400
> > >
> > > svcrpc: take advantage of tcp autotuning
> > >
> > >
> > > which I must confess surprised me. I wonder if the full implications of
> > > removing that functionality were understood.
> > >
> > > Previously nfsd would set the transmit buffer space for a connection to
> > > ensure there is plenty to hold all replies. Now it doesn't.
> > >
> > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > buffer to send a reply. This is important to ensure that each reply gets
> > > sent atomically without blocking and there is no risk of replies getting
> > > interleaved.

By the way, it's xpt_mutex that really guarantees non-interleaving,
isn't it?

I think of the svc_tcp_has_wspace checks as solving a problem of
fairness between clients. If we only had one client, everything would
work without them. But when we have multiple clients we don't want a
single slow client to be able to tie block every thread waiting for that
client to receive read data. Is that accurate?

> > > The server starts out with a large estimate of the reply space (1M) and for
> > > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > > it is much harder to estimate the space needed so it just assumes every
> > > reply will require 1M of space.
> > >
> > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > requests on that connection will be ignored.
> > >
> > > This could significantly limit the amount of parallelism that can be achieved
> > > for a single TCP connection (and given that the Linux client strongly prefers
> > > a single connection now, this could become more of an issue).
> >
> > Worse, I believe it can deadlock completely if the transmit buffer
> > shrinks too far, and people really have run into this:
> >
> > http://mid.gmane.org/<[email protected]>
> >
> > Trond's suggestion looked at the time like it might work and be doable:
> >
> > http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> >
> > but I dropped it.
>
> I would probably generalise Trond's suggestion and allow "N" extra requests
> through that exceed the reservation, when N is related to the number of idle
> threads. squareroot might be nice, but half is probably easiest.
>
> If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> the connection so a really bad connection won't block threads indefinitely.
>
>
> And yes - a nice test case would be good.
>
> What do you think of the following (totally untested - just for comment)?

In cases where the disk is the bottleneck, there aren't actually going
to be idle threads, are there? In which case I don't think this helps
save a stuck client.

Trond's suggestion doesn't have the same limitation, if I understand it
correctly.

--b.

> diff --git a/include/linux/sunrpc/svc_xprt.h b/include/linux/sunrpc/svc_xprt.h
> index b05963f..2fc92f1 100644
> --- a/include/linux/sunrpc/svc_xprt.h
> +++ b/include/linux/sunrpc/svc_xprt.h
> @@ -81,6 +81,10 @@ struct svc_xprt {
>
> struct net *xpt_net;
> struct rpc_xprt *xpt_bc_xprt; /* NFSv4.1 backchannel */
> +
> + atomic_t xpt_extras; /* Extra requests which
> + * might block on send
> + */
> };
>
> static inline void unregister_xpt_user(struct svc_xprt *xpt, struct svc_xpt_user *u)
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..fc366ca 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -165,6 +165,7 @@ void svc_xprt_init(struct net *net, struct svc_xprt_class *xcl,
> set_bit(XPT_BUSY, &xprt->xpt_flags);
> rpc_init_wait_queue(&xprt->xpt_bc_pending, "xpt_bc_pending");
> xprt->xpt_net = get_net(net);
> + atomic_set(&xprt->xpt_extra, 0);
> }
> EXPORT_SYMBOL_GPL(svc_xprt_init);
>
> @@ -326,13 +327,21 @@ static void svc_thread_dequeue(struct svc_pool *pool, struct svc_rqst *rqstp)
> list_del(&rqstp->rq_list);
> }
>
> -static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> +static int svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> {
> if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> - return true;
> - if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> - return xprt->xpt_ops->xpo_has_wspace(xprt);
> - return false;
> + return 1;
> + if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED))) {
> + if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> + if (atomic_read(&xprt->xpt_extra))
> + atomic_set(&xprt->xpt_extras, 0);
> + return 1;
> + } else {
> + atomic_inc(&xprt->xpt_extras);
> + return 2; /* only if free threads */
> + }
> + }
> + return 0;
> }
>
> /*
> @@ -345,8 +354,9 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> struct svc_pool *pool;
> struct svc_rqst *rqstp;
> int cpu;
> + int todo = svc_xprt_has_something_to_do(xprt);
>
> - if (!svc_xprt_has_something_to_do(xprt))
> + if (!todo)
> return;
>
> cpu = get_cpu();
> @@ -361,6 +371,19 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> "svc_xprt_enqueue: "
> "threads and transports both waiting??\n");
>
> + if (todo == 2) {
> + int free_needed = atomic_read(&xprt->xpt_extras) * 2;
> + list_for_each_entry(rqstp, &pool->sp_thread, rq_list)
> + if (--free_needed <= 0)
> + break;
> +
> + if (free_needed > 0) {
> + /* Need more free threads before we allow this. */
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
> + goto out_unlock;
> + }
> + }
> +
> pool->sp_stats.packets++;
>
> /* Mark transport as busy. It will remain in this state until
> @@ -371,6 +394,8 @@ void svc_xprt_enqueue(struct svc_xprt *xprt)
> if (test_and_set_bit(XPT_BUSY, &xprt->xpt_flags)) {
> /* Don't enqueue transport while already enqueued */
> dprintk("svc: transport %p busy, not enqueued\n", xprt);
> + if (todo == 2)
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
> goto out_unlock;
> }
>
> @@ -466,6 +491,7 @@ static void svc_xprt_release(struct svc_rqst *rqstp)
> printk(KERN_ERR "RPC request reserved %d but used %d\n",
> rqstp->rq_reserved,
> rqstp->rq_res.len);
> + atomic_add_unless(&xprt->xpt_extras, -1, 0);
>
> rqstp->rq_res.head[0].iov_len = 0;
> svc_reserve(rqstp, 0);



2013-07-16 14:24:32

by J.Bruce Fields

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

Adding Ben Myers to Cc: in case he has any testing help or advice (see
below):

On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> That would be because I only allowed extra requests up to half the number of
> idle threads, and there would be zero idle threads.
>
> So yes - I see your point.
>
> I now think that it is sufficient to just allow one request through per
> socket. While there is high memory pressure, that is sufficient. When the
> memory pressure drops, that should be enough to cause sndbuf to grow.
>
> We should be able to use xpt_reserved to check if this is the only request
> or not:

A remaining possible bad case is if the number of "bad" sockets is more
than the number of threads, and if the problem is something that won't
resolve itself by just waiting a little while.

I don't know if that's likely in practice. Maybe a network failure cuts
you off from a large swath (but not all) of your clients? Or maybe some
large proportion of your clients are just slow?

But this is two lines and looks likely to solve the immediate
problem:

> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index 80a6640..5b832ef 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> return true;
> if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> - return xprt->xpt_ops->xpo_has_wspace(xprt);
> + return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> + atomic_read(&xprt->xpt_reserved) == 0;
> return false;
> }
>
> @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> newxpt = xprt->xpt_ops->xpo_accept(xprt);
> if (newxpt)
> svc_add_new_temp_xprt(serv, newxpt);
> - } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> + } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> + atomic_read(&xprt->xpt_reserved) == 0) {
> /* XPT_DATA|XPT_DEFERRED case: */
> dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> rqstp, rqstp->rq_pool->sp_id, xprt,
>
>
> Thoughts?
>
> I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> more).

Ben, do you still have a setup that can reproduce the problem? Or did
you ever find an easier way to reproduce?

--b.

2013-07-18 00:03:20

by Ben Myers

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

Hey Bruce,

On Tue, Jul 16, 2013 at 10:24:30AM -0400, J.Bruce Fields wrote:
> Adding Ben Myers to Cc: in case he has any testing help or advice (see
> below):
>
> On Tue, Jul 16, 2013 at 02:00:21PM +1000, NeilBrown wrote:
> > That would be because I only allowed extra requests up to half the number of
> > idle threads, and there would be zero idle threads.
> >
> > So yes - I see your point.
> >
> > I now think that it is sufficient to just allow one request through per
> > socket. While there is high memory pressure, that is sufficient. When the
> > memory pressure drops, that should be enough to cause sndbuf to grow.
> >
> > We should be able to use xpt_reserved to check if this is the only request
> > or not:
>
> A remaining possible bad case is if the number of "bad" sockets is more
> than the number of threads, and if the problem is something that won't
> resolve itself by just waiting a little while.
>
> I don't know if that's likely in practice. Maybe a network failure cuts
> you off from a large swath (but not all) of your clients? Or maybe some
> large proportion of your clients are just slow?
>
> But this is two lines and looks likely to solve the immediate
> problem:
>
> > diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> > index 80a6640..5b832ef 100644
> > --- a/net/sunrpc/svc_xprt.c
> > +++ b/net/sunrpc/svc_xprt.c
> > @@ -331,7 +331,8 @@ static bool svc_xprt_has_something_to_do(struct svc_xprt *xprt)
> > if (xprt->xpt_flags & ((1<<XPT_CONN)|(1<<XPT_CLOSE)))
> > return true;
> > if (xprt->xpt_flags & ((1<<XPT_DATA)|(1<<XPT_DEFERRED)))
> > - return xprt->xpt_ops->xpo_has_wspace(xprt);
> > + return xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > + atomic_read(&xprt->xpt_reserved) == 0;
> > return false;
> > }
> >
> > @@ -730,7 +731,8 @@ static int svc_handle_xprt(struct svc_rqst *rqstp, struct svc_xprt *xprt)
> > newxpt = xprt->xpt_ops->xpo_accept(xprt);
> > if (newxpt)
> > svc_add_new_temp_xprt(serv, newxpt);
> > - } else if (xprt->xpt_ops->xpo_has_wspace(xprt)) {
> > + } else if (xprt->xpt_ops->xpo_has_wspace(xprt) ||
> > + atomic_read(&xprt->xpt_reserved) == 0) {
> > /* XPT_DATA|XPT_DEFERRED case: */
> > dprintk("svc: server %p, pool %u, transport %p, inuse=%d\n",
> > rqstp, rqstp->rq_pool->sp_id, xprt,
> >
> >
> > Thoughts?
> >
> > I tried testing this, but xpo_has_wspace never fails for me, even if I remove
> > the calls to svc_sock_setbufsize (which probably aren't wanted for TCP any
> > more).
>
> Ben, do you still have a setup that can reproduce the problem? Or did you
> ever find an easier way to reproduce?

Unfortunately I don't still have that test rig handy, and I did not find an
easier way to reproduce this. I do agree that it 'is sufficient to just allow
one request through per socket'... probably that would have solved the problem
in our case. Maybe it would help to limit the amount of memory in your system
to try and induce additional memory pressure? IIRC it was a 1mb block size
read workload where we would hit this.

Thanks,
Ben

2013-07-15 04:32:16

by NeilBrown

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

On Wed, 10 Jul 2013 15:07:28 -0400 "J.Bruce Fields" <[email protected]>
wrote:

> On Wed, Jul 10, 2013 at 02:32:33PM +1000, NeilBrown wrote:
> > On Tue, 9 Jul 2013 22:27:35 -0400 "J.Bruce Fields" <[email protected]>
> > wrote:
> >
> > > On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> > > >
> > > > Hi,
> > > > I just noticed this commit:
> > > >
> > > > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > > > Author: Olga Kornievskaia <[email protected]>
> > > > Date: Tue Oct 21 14:13:47 2008 -0400
> > > >
> > > > svcrpc: take advantage of tcp autotuning
> > > >
> > > >
> > > > which I must confess surprised me. I wonder if the full implications of
> > > > removing that functionality were understood.
> > > >
> > > > Previously nfsd would set the transmit buffer space for a connection to
> > > > ensure there is plenty to hold all replies. Now it doesn't.
> > > >
> > > > nfsd refuses to accept a request if there isn't enough space in the transmit
> > > > buffer to send a reply. This is important to ensure that each reply gets
> > > > sent atomically without blocking and there is no risk of replies getting
> > > > interleaved.
>
> By the way, it's xpt_mutex that really guarantees non-interleaving,
> isn't it?

Yes. I had the two different things confused. The issue is only about
blocking on writes and consequently tying up nfsd threads.

>
> I think of the svc_tcp_has_wspace checks as solving a problem of
> fairness between clients. If we only had one client, everything would
> work without them. But when we have multiple clients we don't want a
> single slow client to be able to tie block every thread waiting for that
> client to receive read data. Is that accurate?

It agrees with my understanding - yes.


>
> > > > The server starts out with a large estimate of the reply space (1M) and for
> > > > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > > > it is much harder to estimate the space needed so it just assumes every
> > > > reply will require 1M of space.
> > > >
> > > > This means that with NFSv4, as soon as you have enough concurrent requests
> > > > such that 1M each reserves all of whatever window size was auto-tuned, new
> > > > requests on that connection will be ignored.
> > > >
> > > > This could significantly limit the amount of parallelism that can be achieved
> > > > for a single TCP connection (and given that the Linux client strongly prefers
> > > > a single connection now, this could become more of an issue).
> > >
> > > Worse, I believe it can deadlock completely if the transmit buffer
> > > shrinks too far, and people really have run into this:
> > >
> > > http://mid.gmane.org/<[email protected]>
> > >
> > > Trond's suggestion looked at the time like it might work and be doable:
> > >
> > > http://mid.gmane.org/<4FA345DA4F4AE44899BD2B03EEEC2FA91833C1D8@sacexcmbx05-prd.hq.netapp.com>
> > >
> > > but I dropped it.
> >
> > I would probably generalise Trond's suggestion and allow "N" extra requests
> > through that exceed the reservation, when N is related to the number of idle
> > threads. squareroot might be nice, but half is probably easiest.
> >
> > If any send takes more than 30 seconds the sk_sndtimeo will kick in and close
> > the connection so a really bad connection won't block threads indefinitely.
> >
> >
> > And yes - a nice test case would be good.
> >
> > What do you think of the following (totally untested - just for comment)?
>
> In cases where the disk is the bottleneck, there aren't actually going
> to be idle threads, are there? In which case I don't think this helps
> save a stuck client.

Do we care about the disk being a bottleneck? We don't currently try to
handle that situation at all - if the disk is slow, everything slows down.
Unless you have to separate clients using two separate devices, then I think
that is the right think to do. Limiting each filesystem to some number of
threads might be interesting but I don't think there would be much call for
it.

The present issue is when a client or network is slow and could thereby
interfere with the response to other clients.

Certainly the deadlock that can happen if write buffer space drops below 1Meg
would be an issue when very few clients are active.


>
> Trond's suggestion doesn't have the same limitation, if I understand it
> correctly.

I think I missed the "hand off to a workqueue" bit the first time.

I guess something like changing the "mutex_lock" in svc_send to
"mutex_try_lock" and if that fails, punt to a work-queue. We would need to
split the buffer space off the rqstp and attach it to the work struct.
Each rqstp could have a small array of work_structs and that sets the limit
on the number of requests that can be accepted when there is no wspace.

We probably don't even need a workqueue. After completing a send, svc_send
can check if another reply has been queued and if so, it starts sending it -
if there is enough wspace. If no wspace just do nothing. The wspace
callback can queue the socket, and svc_xprt_enqueue can queue up an nfsd
thread if there is write work to do. svc_handle_xprt() can do the actual
send.

This would require modifying all the xpo_sendto routines and the routines
they call to take an index into the list of send buffers per socket. Or
maybe a pointer to a "delayed reply" structure.


For each delayed reply we would need:
- struct xdr_buf - to hold the reply
- struct page *[RPCSVC_MAXPAGES] - to hold the pages so we can free them
when done
- struct sockaddr_storage - to hold source address for UDP reply
- struct sockaddr_storage - to hold the dest address for UDP
- maybe something else for RDMA? - svc_rdma_sendto uses rqstp->rq_arg - not
sure why.

We could possible allocate these as needed. If the kmalloc fails, just use
"mutex_lock" instead of "mutex_trylock" and block for a while.

Note that while this structure would be smaller than "struct svc_rqst", it is
not completely trivial. It might be just as easy to allocate a new "svc_rqst"
structure. Or fork new nfsd threads.


I think the key issue is really to allow some nfsd threads to block on
writing the reply, but to firmly limit the number which do. This is what my
patch tried to do.
Punting to a work queue might allow a little less memory to be allocated
to end up with the same number of spare threads, but would require a lot more
code churn.

I wonder if it is time to revisit dynamic sizing of the nfsd thread pool.

NeilBrown


Attachments:
signature.asc (828.00 B)

2013-07-15 11:57:21

by Jim Rees

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

NeilBrown wrote:

On Sun, 14 Jul 2013 21:26:20 -0400 Jim Rees <[email protected]> wrote:

> J.Bruce Fields wrote:
>
> On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
> >
> > Hi,
> > I just noticed this commit:
> >
> > commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> > Author: Olga Kornievskaia <[email protected]>
> > Date: Tue Oct 21 14:13:47 2008 -0400
> >
> > svcrpc: take advantage of tcp autotuning
> >
> >
> > which I must confess surprised me. I wonder if the full implications of
> > removing that functionality were understood.
> >
> > Previously nfsd would set the transmit buffer space for a connection to
> > ensure there is plenty to hold all replies. Now it doesn't.
> >
> > nfsd refuses to accept a request if there isn't enough space in the transmit
> > buffer to send a reply. This is important to ensure that each reply gets
> > sent atomically without blocking and there is no risk of replies getting
> > interleaved.
> >
> > The server starts out with a large estimate of the reply space (1M) and for
> > NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> > it is much harder to estimate the space needed so it just assumes every
> > reply will require 1M of space.
> >
> > This means that with NFSv4, as soon as you have enough concurrent requests
> > such that 1M each reserves all of whatever window size was auto-tuned, new
> > requests on that connection will be ignored.
> >
> > This could significantly limit the amount of parallelism that can be achieved
> > for a single TCP connection (and given that the Linux client strongly prefers
> > a single connection now, this could become more of an issue).
>
> Worse, I believe it can deadlock completely if the transmit buffer
> shrinks too far, and people really have run into this:
>
> It's been a few years since I looked at this, but are you sure autotuning
> reduces the buffer space available on the sending socket? That doesn't sound
> like correct behavior to me. I know we thought about this at the time.

Autotuning is enabled when SOCK_SNDBUF_LOCK is not set in sk_userlocks.

One of the main effects of this flag is to disable:

static inline void sk_stream_moderate_sndbuf(struct sock *sk)
{
if (!(sk->sk_userlocks & SOCK_SNDBUF_LOCK)) {
sk->sk_sndbuf = min(sk->sk_sndbuf, sk->sk_wmem_queued >> 1);
sk->sk_sndbuf = max(sk->sk_sndbuf, SOCK_MIN_SNDBUF);
}
}


which will reduce sk_sndbuf to half the queued writes. As sk_wmem_queued
cannot grow above sk_sndbuf, the definitely reduces sk_sndbuf (though never
below SOCK_MIN_SNDBUF which is 2K.

This seems to happen under memory pressure (sk_stream_alloc_skb).

So yes: memory pressure can reduce the sndbuf size when autotuning is enabled, and it
can get as low as 2K.
(An API to set this minimum to e.g. 2M for nfsd connections would be an alternate
fix for the deadlock, as Bruce has already mentioned).

>
> It does seem like a bug that we don't multiply the needed send buffer space
> by the number of threads. I think that's because we don't know how many
> threads there are going to be in svc_setup_socket()?

We used to, but it turned out to be too small in practice! As it auto-grows,
the "4 * serv->sv_max_mesg" setting is big enough ... if only it wouldn't
shrink below that.

This sounds familiar. In fact I think we asked on the network mailing list
about providing an api to set a minimum on the socket buffer. I'll go
through my old email and see if I can find it.

2013-07-15 13:42:19

by Jim Rees

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

Here's the thread on netdev where we discussed this issue:
http://marc.info/?l=linux-netdev&m=121545498313619&w=2

Here is where I asked about an api to set a socket buf minimum:
http://marc.info/?l=linux-netdev&m=121554794432038&w=2

There are many subtleties, and I suggest anyone who wants to try to fix this
code should read the email thread. The netdev people were pretty insistent
that we turn on autotuning; David Miller said the old behavior was
equivalent to "turn[ing] off half of the TCP stack."

2013-07-15 01:26:30

by Jim Rees

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?

J.Bruce Fields wrote:

On Wed, Jul 10, 2013 at 09:22:55AM +1000, NeilBrown wrote:
>
> Hi,
> I just noticed this commit:
>
> commit 9660439861aa8dbd5e2b8087f33e20760c2c9afc
> Author: Olga Kornievskaia <[email protected]>
> Date: Tue Oct 21 14:13:47 2008 -0400
>
> svcrpc: take advantage of tcp autotuning
>
>
> which I must confess surprised me. I wonder if the full implications of
> removing that functionality were understood.
>
> Previously nfsd would set the transmit buffer space for a connection to
> ensure there is plenty to hold all replies. Now it doesn't.
>
> nfsd refuses to accept a request if there isn't enough space in the transmit
> buffer to send a reply. This is important to ensure that each reply gets
> sent atomically without blocking and there is no risk of replies getting
> interleaved.
>
> The server starts out with a large estimate of the reply space (1M) and for
> NFSv3 and v2 it quickly adjusts this down to something realistic. For NFSv4
> it is much harder to estimate the space needed so it just assumes every
> reply will require 1M of space.
>
> This means that with NFSv4, as soon as you have enough concurrent requests
> such that 1M each reserves all of whatever window size was auto-tuned, new
> requests on that connection will be ignored.
>
> This could significantly limit the amount of parallelism that can be achieved
> for a single TCP connection (and given that the Linux client strongly prefers
> a single connection now, this could become more of an issue).

Worse, I believe it can deadlock completely if the transmit buffer
shrinks too far, and people really have run into this:

It's been a few years since I looked at this, but are you sure autotuning
reduces the buffer space available on the sending socket? That doesn't sound
like correct behavior to me. I know we thought about this at the time.

It does seem like a bug that we don't multiply the needed send buffer space
by the number of threads. I think that's because we don't know how many
threads there are going to be in svc_setup_socket()?

2013-07-10 20:00:39

by Michael Richardson

[permalink] [raw]
Subject: Re: Is tcp autotuning really what NFS wants?


Dean <[email protected]> wrote:
>> This could significantly limit the amount of parallelism that can be
> achieved for a single TCP connection (and given that the
>> Linux client strongly prefers a single connection now, this could
> become more of an issue).

> I understand the simplicity in using a single tcp connection, but
> performance-wise it is definitely not the way to go on WAN links. When
> even a miniscule amount of packet loss is added to the link (<0.001%
> packet loss), the tcp buffer collapses and performance drops

And just remember bufferbloat.

> Using multiple tcp connections allows better saturation of the link,
> since when packet loss occurs on a stream, the other streams can fill
> the void. Today, the only solution is to scale up the number of
> physical clients, which has high coordination overhead, or use a wan
> accelerator such as Bitspeed or Riverbed (which comes with its own
> issues such as extra hardware, cost, etc).

This is true on high speed links with few bottlenecks, but not so much when
there is a DSL-type bottleneck and excessive buffers.

--
] Never tell me the odds! | ipv6 mesh networks [
] Michael Richardson, Sandelman Software Works | network architect [
] [email protected] http://www.sandelman.ca/ | ruby on rails [