LinuxLists.cc - NFS TCP race condition with SOCK_ASYNC

2011-11-18 18:40:04

Subject: NFS TCP race condition with SOCK_ASYNC_NOSPACE

Hello,

As described originally in
http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
encountering a bug whereby the NFS session was unexpectedly timing out.

I believe I have found the source of the race condition causing the timeout.

Brief overview of setup:
10GiB network, NFS mounted using TCP. Problem reproduces with
multiple different NICs, with synchronous or asynchronous mounts, and
with soft and hard mounts. Reproduces on 2.6.32 and I am currently
trying to reproduce with mainline. (I don't have physical access to the
servers so installing stuff is not fantastically easy)

In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
the sock buffer using xs_sendpages()

When the sock buffer is nearly fully, we get an EAGAIN from
xs_sendpages() which causes a break out of the loop. Lower down the
function, we switch on status which cases us to call xs_nospace() with
the task.

In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
in the rare case where that bit is clear, we return 0 instead of
EAGAIN. This promptly overwrites status in xs_tcp_send_request().

The result is that xs_tcp_release_xprt() finds a request which has no
error, but has not sent all of the bytes in its send buffer. It cleans
up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
xprt->task_cleanup, which closes the TCP connection.

Under normal operation, the TCP connection goes down and back up without
interruption to the NFS layer. However, when the NFS server hangs in a
half closed state, the client forces a RST of the TCP connection,
leading to the timeout.

I have tried a few naive fixes such as changing the default return value
in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
returned) but this causes a kernel memory leak. Can someone who a
better understanding of these interactions than me have a look? It
seems that the if (test_bit()) test in xs_nospace() should have an else
clause.

Thanks in advance,

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

2011-11-22 12:02:20

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 22/11/11 11:38, Trond Myklebust wrote:
> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
>> Following some debugging, I believe that the attached patch fixes the
>> problem.
>>
>> Simply returning EAGAIN is not sufficient, as the task does not get
>> requeued, and times out 13 seconds later (as per our mount options).
>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>
>> I realize that this is a gross hack and I should probably not be using
>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
>> same solution?
>>
> What you are doing will cause the request to be put to sleep with no
> guarantee that it will ever be woken up. Why would we want to do that if
> there is no report of a tcp window/buffer space congestion?

But the reason we get to this code is because there was a report of
space collision. What would you suggest instead? Changing
xs_{tcp,udp}_send_request() to retry in this case would defeat the point
of having xs_nospace().

What should happen is the request getting re-queued to run at the next
available opportunity, rather than perhaps sleeping for a certain length
of time. At the moment, leaving SOCK_ASYNC_NOSPACE unset causes the
request to never be woken, whereas setting that bit seems to always be
re-queued at some near point in the future.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

2011-11-18 19:14:50

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Fri, 2011-11-18 at 19:04 +0000, Andrew Cooper wrote:
> On 18/11/11 18:52, Trond Myklebust wrote:
> > On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote:
> >> Hello,
> >>
> >> As described originally in
> >> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
> >> encountering a bug whereby the NFS session was unexpectedly timing out.
> >>
> >> I believe I have found the source of the race condition causing the timeout.
> >>
> >> Brief overview of setup:
> >> 10GiB network, NFS mounted using TCP. Problem reproduces with
> >> multiple different NICs, with synchronous or asynchronous mounts, and
> >> with soft and hard mounts. Reproduces on 2.6.32 and I am currently
> >> trying to reproduce with mainline. (I don't have physical access to the
> >> servers so installing stuff is not fantastically easy)
> >>
> >>
> >>
> >> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
> >> the sock buffer using xs_sendpages()
> >>
> >> When the sock buffer is nearly fully, we get an EAGAIN from
> >> xs_sendpages() which causes a break out of the loop. Lower down the
> >> function, we switch on status which cases us to call xs_nospace() with
> >> the task.
> >>
> >> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
> >> in the rare case where that bit is clear, we return 0 instead of
> >> EAGAIN. This promptly overwrites status in xs_tcp_send_request().
> >>
> >> The result is that xs_tcp_release_xprt() finds a request which has no
> >> error, but has not sent all of the bytes in its send buffer. It cleans
> >> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
> >> xprt->task_cleanup, which closes the TCP connection.
> >>
> >>
> >> Under normal operation, the TCP connection goes down and back up without
> >> interruption to the NFS layer. However, when the NFS server hangs in a
> >> half closed state, the client forces a RST of the TCP connection,
> >> leading to the timeout.
> >>
> >> I have tried a few naive fixes such as changing the default return value
> >> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
> >> returned) but this causes a kernel memory leak. Can someone who a
> >> better understanding of these interactions than me have a look? It
> >> seems that the if (test_bit()) test in xs_nospace() should have an else
> >> clause.
> > I fully agree with your analysis. The correct thing to do here is to
> > always return either EAGAIN or ENOTCONN. Thank you very much for working
> > this one out!
> >
> > Trond
>
> Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller
> starts going after processes holding large amounts of LowMem. Returning

The EAGAIN should trigger a retry of the send.

> ENOTCONN causes the NFS session to complain about a timeout in the logs,
> and in the case of a softmout, give an EIO to the calling process.

Correct. ENOTCONN means that the connection was lost.

> >From the looks of the TCP stream, and from the the looks of some
> targeted debugging, nothing is actually wrong, so the client should not
> be trying to FIN the TCP connection. Is it possible that there is a
> more sinister reason for SOCK_ASYNC_NOSPACE being clear?

Normally, it means that we're out of the out-of-write-buffer condition
that caused the socket to fail (i.e. the socket has made progress
sending more data, so that we can now resume sending more). Returning
EAGAIN in that condition is correct.

> I can attempt to find which of the many calls to clear that bit is
> actually causing the problem, but I have a feeing that is going to a
> little more tricky to narrow down.
>

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-22 12:16:24

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 22/11/11 12:10, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
>> On 22/11/11 11:38, Trond Myklebust wrote:
>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
>>>> Following some debugging, I believe that the attached patch fixes the
>>>> problem.
>>>>
>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>> requeued, and times out 13 seconds later (as per our mount options).
>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>
>>>> I realize that this is a gross hack and I should probably not be using
>>>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
>>>> same solution?
>>>>
>>> What you are doing will cause the request to be put to sleep with no
>>> guarantee that it will ever be woken up. Why would we want to do that if
>>> there is no report of a tcp window/buffer space congestion?
>> But the reason we get to this code is because there was a report of
>> space collision. What would you suggest instead? Changing
>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>> of having xs_nospace().
> I suggest doing absolutely nothing: do what you originally proposed,
> which is to report the EAGAIN so that the client state machine retries
> the socket write.
>
> My point is that this is a context which is _not_ atomic with the
> original report of tcp window/buffer space congestion. There are no
> locks or anything else that will guarantee that the congestion still
> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> indicates that this is the case.
> The whole purpose of xs_nospace() is to wait until a congestion
> condition clears. If the congestion clears before we get here, then we
> have no reason to do anything special other than retry.
>
> Trond

I am slightly confused as to what you mean now.

When you take out the if(test_bit test and always set ret to EAGAIN and
requeue the request, the next time it wakes up is when it is killed due
to timeout. This results in substantially worse effects for the
userspace, as the NFS session is killed.

Did you mean something else when you said "always report EAGAIN"?

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

2011-11-22 12:45:42

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Tue, 2011-11-22 at 12:34 +0000, Andrew Cooper wrote:
>
> On 22/11/11 12:22, Trond Myklebust wrote:
> > On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote:
> >> On 22/11/11 12:10, Trond Myklebust wrote:
> >>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
> >>>> On 22/11/11 11:38, Trond Myklebust wrote:
> >>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
> >>>>>> Following some debugging, I believe that the attached patch fixes the
> >>>>>> problem.
> >>>>>>
> >>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
> >>>>>> requeued, and times out 13 seconds later (as per our mount options).
> >>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>>>>>
> >>>>>> I realize that this is a gross hack and I should probably not be using
> >>>>>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
> >>>>>> same solution?
> >>>>>>
> >>>>> What you are doing will cause the request to be put to sleep with no
> >>>>> guarantee that it will ever be woken up. Why would we want to do that if
> >>>>> there is no report of a tcp window/buffer space congestion?
> >>>> But the reason we get to this code is because there was a report of
> >>>> space collision. What would you suggest instead? Changing
> >>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> >>>> of having xs_nospace().
> >>> I suggest doing absolutely nothing: do what you originally proposed,
> >>> which is to report the EAGAIN so that the client state machine retries
> >>> the socket write.
> >>>
> >>> My point is that this is a context which is _not_ atomic with the
> >>> original report of tcp window/buffer space congestion. There are no
> >>> locks or anything else that will guarantee that the congestion still
> >>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> >>> indicates that this is the case.
> >>> The whole purpose of xs_nospace() is to wait until a congestion
> >>> condition clears. If the congestion clears before we get here, then we
> >>> have no reason to do anything special other than retry.
> >>>
> >>> Trond
> >> I am slightly confused as to what you mean now.
> >>
> >> When you take out the if(test_bit test and always set ret to EAGAIN and
> >> requeue the request, the next time it wakes up is when it is killed due
> >> to timeout. This results in substantially worse effects for the
> >> userspace, as the NFS session is killed.
> > What is putting the request to sleep? It should be awake when it enters
> > xs_nospace(), and nothing in or after that function should be putting it
> > to sleep until we've retried with call_transmit().
> >
>
> I presume it is the call to xprt_wait_for_buffer_space() which calls
> rpc_sleep_on(). There is xs_tcp_write_space which appears to wake it up
> based on sk->sk_write_space which is triggered on the sock gaining more
> space, which has already happened in this specific case.

That call should only happen if we have to wait for congestion (i.e. if
SOCK_ASYNC_NOSPACE is set).

Please can you test if the following patch fixes the problem.

Cheers
Trond

8<---------------------------------------------------------------------
>From 729b3861d9a2820735b648a044d558315bc9c9db Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Tue, 22 Nov 2011 14:44:28 +0200
Subject: [PATCH] SUNRPC: Ensure we return EAGAIN in xs_nospace if congestion
is cleared

By returning '0' instead of 'EAGAIN' when the tests in xs_nospace() fail
to find evidence of socket congestion, we are making the RPC engine believe
that the message was incompletely sent, and so it disconnects the socket
instead of just retrying the send.

The bug appears to have been introduced by commit
5e3771ce2d6a69e10fcc870cdf226d121d868491 (SUNRPC: Ensure that xs_nospace
return values are propagated).

Reported-by: Andrew Cooper <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
Cc: [email protected] [>= 2.6.30]
---
net/sunrpc/xprtsock.c | 3 +--
1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 2d78d95..55472c4 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -496,7 +496,7 @@ static int xs_nospace(struct rpc_task *task)
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_xprt *xprt = req->rq_xprt;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
- int ret = 0;
+ int ret = -EAGAIN;

dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
task->tk_pid, req->rq_slen - req->rq_bytes_sent,
@@ -508,7 +508,6 @@ static int xs_nospace(struct rpc_task *task)
/* Don't race with disconnect */
if (xprt_connected(xprt)) {
if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
- ret = -EAGAIN;
/*
* Notify TCP that we're limited by the application
* window size
--
1.7.7.3

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-22 11:38:35

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
> Following some debugging, I believe that the attached patch fixes the
> problem.
>
> Simply returning EAGAIN is not sufficient, as the task does not get
> requeued, and times out 13 seconds later (as per our mount options).
> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>
> I realize that this is a gross hack and I should probably not be using
> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
> same solution?
>

What you are doing will cause the request to be put to sleep with no
guarantee that it will ever be woken up. Why would we want to do that if
there is no report of a tcp window/buffer space congestion?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-22 12:23:00

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote:
> On 22/11/11 12:10, Trond Myklebust wrote:
> > On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
> >> On 22/11/11 11:38, Trond Myklebust wrote:
> >>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
> >>>> Following some debugging, I believe that the attached patch fixes the
> >>>> problem.
> >>>>
> >>>> Simply returning EAGAIN is not sufficient, as the task does not get
> >>>> requeued, and times out 13 seconds later (as per our mount options).
> >>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>>>
> >>>> I realize that this is a gross hack and I should probably not be using
> >>>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
> >>>> same solution?
> >>>>
> >>> What you are doing will cause the request to be put to sleep with no
> >>> guarantee that it will ever be woken up. Why would we want to do that if
> >>> there is no report of a tcp window/buffer space congestion?
> >> But the reason we get to this code is because there was a report of
> >> space collision. What would you suggest instead? Changing
> >> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> >> of having xs_nospace().
> > I suggest doing absolutely nothing: do what you originally proposed,
> > which is to report the EAGAIN so that the client state machine retries
> > the socket write.
> >
> > My point is that this is a context which is _not_ atomic with the
> > original report of tcp window/buffer space congestion. There are no
> > locks or anything else that will guarantee that the congestion still
> > exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
> > indicates that this is the case.
> > The whole purpose of xs_nospace() is to wait until a congestion
> > condition clears. If the congestion clears before we get here, then we
> > have no reason to do anything special other than retry.
> >
> > Trond
>
> I am slightly confused as to what you mean now.
>
> When you take out the if(test_bit test and always set ret to EAGAIN and
> requeue the request, the next time it wakes up is when it is killed due
> to timeout. This results in substantially worse effects for the
> userspace, as the NFS session is killed.

What is putting the request to sleep? It should be awake when it enters
xs_nospace(), and nothing in or after that function should be putting it
to sleep until we've retried with call_transmit().

> Did you mean something else when you said "always report EAGAIN"?

Nope.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-22 12:34:37

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 22/11/11 12:22, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote:
>> On 22/11/11 12:10, Trond Myklebust wrote:
>>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
>>>> On 22/11/11 11:38, Trond Myklebust wrote:
>>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
>>>>>> Following some debugging, I believe that the attached patch fixes the
>>>>>> problem.
>>>>>>
>>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>>>> requeued, and times out 13 seconds later (as per our mount options).
>>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>>>
>>>>>> I realize that this is a gross hack and I should probably not be using
>>>>>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
>>>>>> same solution?
>>>>>>
>>>>> What you are doing will cause the request to be put to sleep with no
>>>>> guarantee that it will ever be woken up. Why would we want to do that if
>>>>> there is no report of a tcp window/buffer space congestion?
>>>> But the reason we get to this code is because there was a report of
>>>> space collision. What would you suggest instead? Changing
>>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>>>> of having xs_nospace().
>>> I suggest doing absolutely nothing: do what you originally proposed,
>>> which is to report the EAGAIN so that the client state machine retries
>>> the socket write.
>>>
>>> My point is that this is a context which is _not_ atomic with the
>>> original report of tcp window/buffer space congestion. There are no
>>> locks or anything else that will guarantee that the congestion still
>>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
>>> indicates that this is the case.
>>> The whole purpose of xs_nospace() is to wait until a congestion
>>> condition clears. If the congestion clears before we get here, then we
>>> have no reason to do anything special other than retry.
>>>
>>> Trond
>> I am slightly confused as to what you mean now.
>>
>> When you take out the if(test_bit test and always set ret to EAGAIN and
>> requeue the request, the next time it wakes up is when it is killed due
>> to timeout. This results in substantially worse effects for the
>> userspace, as the NFS session is killed.
> What is putting the request to sleep? It should be awake when it enters
> xs_nospace(), and nothing in or after that function should be putting it
> to sleep until we've retried with call_transmit().
>

I presume it is the call to xprt_wait_for_buffer_space() which calls
rpc_sleep_on(). There is xs_tcp_write_space which appears to wake it up
based on sk->sk_write_space which is triggered on the sock gaining more
space, which has already happened in this specific case.

Sorry if I am being a bit slow here - I am still learning my way round
an unfamiliar codebase.

>> Did you mean something else when you said "always report EAGAIN"?
> Nope.
>

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

2011-11-22 12:10:40

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
> On 22/11/11 11:38, Trond Myklebust wrote:
> > On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
> >> Following some debugging, I believe that the attached patch fixes the
> >> problem.
> >>
> >> Simply returning EAGAIN is not sufficient, as the task does not get
> >> requeued, and times out 13 seconds later (as per our mount options).
> >> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
> >>
> >> I realize that this is a gross hack and I should probably not be using
> >> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
> >> same solution?
> >>
> > What you are doing will cause the request to be put to sleep with no
> > guarantee that it will ever be woken up. Why would we want to do that if
> > there is no report of a tcp window/buffer space congestion?
>
> But the reason we get to this code is because there was a report of
> space collision. What would you suggest instead? Changing
> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
> of having xs_nospace().

I suggest doing absolutely nothing: do what you originally proposed,
which is to report the EAGAIN so that the client state machine retries
the socket write.

My point is that this is a context which is _not_ atomic with the
original report of tcp window/buffer space congestion. There are no
locks or anything else that will guarantee that the congestion still
exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
indicates that this is the case.
The whole purpose of xs_nospace() is to wait until a congestion
condition clears. If the congestion clears before we get here, then we
have no reason to do anything special other than retry.

Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-18 19:55:20

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 18/11/2011 19:14, Trond Myklebust wrote:
> On Fri, 2011-11-18 at 19:04 +0000, Andrew Cooper wrote:
>> On 18/11/11 18:52, Trond Myklebust wrote:
>>> On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote:
>>>> Hello,
>>>>
>>>> As described originally in
>>>> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
>>>> encountering a bug whereby the NFS session was unexpectedly timing out.
>>>>
>>>> I believe I have found the source of the race condition causing the timeout.
>>>>
>>>> Brief overview of setup:
>>>> 10GiB network, NFS mounted using TCP. Problem reproduces with
>>>> multiple different NICs, with synchronous or asynchronous mounts, and
>>>> with soft and hard mounts. Reproduces on 2.6.32 and I am currently
>>>> trying to reproduce with mainline. (I don't have physical access to the
>>>> servers so installing stuff is not fantastically easy)
>>>>
>>>>
>>>>
>>>> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
>>>> the sock buffer using xs_sendpages()
>>>>
>>>> When the sock buffer is nearly fully, we get an EAGAIN from
>>>> xs_sendpages() which causes a break out of the loop. Lower down the
>>>> function, we switch on status which cases us to call xs_nospace() with
>>>> the task.
>>>>
>>>> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
>>>> in the rare case where that bit is clear, we return 0 instead of
>>>> EAGAIN. This promptly overwrites status in xs_tcp_send_request().
>>>>
>>>> The result is that xs_tcp_release_xprt() finds a request which has no
>>>> error, but has not sent all of the bytes in its send buffer. It cleans
>>>> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
>>>> xprt->task_cleanup, which closes the TCP connection.
>>>>
>>>>
>>>> Under normal operation, the TCP connection goes down and back up without
>>>> interruption to the NFS layer. However, when the NFS server hangs in a
>>>> half closed state, the client forces a RST of the TCP connection,
>>>> leading to the timeout.
>>>>
>>>> I have tried a few naive fixes such as changing the default return value
>>>> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
>>>> returned) but this causes a kernel memory leak. Can someone who a
>>>> better understanding of these interactions than me have a look? It
>>>> seems that the if (test_bit()) test in xs_nospace() should have an else
>>>> clause.
>>> I fully agree with your analysis. The correct thing to do here is to
>>> always return either EAGAIN or ENOTCONN. Thank you very much for working
>>> this one out!
>>>
>>> Trond
>> Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller
>> starts going after processes holding large amounts of LowMem. Returning
> The EAGAIN should trigger a retry of the send.
>
>> ENOTCONN causes the NFS session to complain about a timeout in the logs,
>> and in the case of a softmout, give an EIO to the calling process.
> Correct. ENOTCONN means that the connection was lost.
>
>> >From the looks of the TCP stream, and from the the looks of some
>> targeted debugging, nothing is actually wrong, so the client should not
>> be trying to FIN the TCP connection. Is it possible that there is a
>> more sinister reason for SOCK_ASYNC_NOSPACE being clear?
> Normally, it means that we're out of the out-of-write-buffer condition
> that caused the socket to fail (i.e. the socket has made progress
> sending more data, so that we can now resume sending more). Returning
> EAGAIN in that condition is correct.

Following my latest set of tests, I would have to say that we are in the
abnormal case. Removing the test_bit check and always requeuing causes
the next call to xs_sendpages() to return with an ENOTCONN. I guess I
have some more debugging to do.

>
>> I can attempt to find which of the many calls to clear that bit is
>> actually causing the problem, but I have a feeing that is going to a
>> little more tricky to narrow down.
>>

~Andrew

2011-11-18 19:04:41

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 18/11/11 18:52, Trond Myklebust wrote:
> On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote:
>> Hello,
>>
>> As described originally in
>> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
>> encountering a bug whereby the NFS session was unexpectedly timing out.
>>
>> I believe I have found the source of the race condition causing the timeout.
>>
>> Brief overview of setup:
>> 10GiB network, NFS mounted using TCP. Problem reproduces with
>> multiple different NICs, with synchronous or asynchronous mounts, and
>> with soft and hard mounts. Reproduces on 2.6.32 and I am currently
>> trying to reproduce with mainline. (I don't have physical access to the
>> servers so installing stuff is not fantastically easy)
>>
>>
>>
>> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
>> the sock buffer using xs_sendpages()
>>
>> When the sock buffer is nearly fully, we get an EAGAIN from
>> xs_sendpages() which causes a break out of the loop. Lower down the
>> function, we switch on status which cases us to call xs_nospace() with
>> the task.
>>
>> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
>> in the rare case where that bit is clear, we return 0 instead of
>> EAGAIN. This promptly overwrites status in xs_tcp_send_request().
>>
>> The result is that xs_tcp_release_xprt() finds a request which has no
>> error, but has not sent all of the bytes in its send buffer. It cleans
>> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
>> xprt->task_cleanup, which closes the TCP connection.
>>
>>
>> Under normal operation, the TCP connection goes down and back up without
>> interruption to the NFS layer. However, when the NFS server hangs in a
>> half closed state, the client forces a RST of the TCP connection,
>> leading to the timeout.
>>
>> I have tried a few naive fixes such as changing the default return value
>> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
>> returned) but this causes a kernel memory leak. Can someone who a
>> better understanding of these interactions than me have a look? It
>> seems that the if (test_bit()) test in xs_nospace() should have an else
>> clause.
> I fully agree with your analysis. The correct thing to do here is to
> always return either EAGAIN or ENOTCONN. Thank you very much for working
> this one out!
>
> Trond

Returning EAGAIN seems to cause a kernel memory leak, as the oomkiller
starts going after processes holding large amounts of LowMem. Returning
ENOTCONN causes the NFS session to complain about a timeout in the logs,
and in the case of a softmout, give an EIO to the calling process.

>From the looks of the TCP stream, and from the the looks of some
targeted debugging, nothing is actually wrong, so the client should not
be trying to FIN the TCP connection. Is it possible that there is a
more sinister reason for SOCK_ASYNC_NOSPACE being clear?

I can attempt to find which of the many calls to clear that bit is
actually causing the problem, but I have a feeing that is going to a
little more tricky to narrow down.

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

2011-11-21 18:14:19

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

Following some debugging, I believe that the attached patch fixes the
problem.

Simply returning EAGAIN is not sufficient, as the task does not get
requeued, and times out 13 seconds later (as per our mount options).
Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.

I realize that this is a gross hack and I should probably not be using
SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
same solution?

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

Attachments:

NFS-fix-async-nospace-race.patch (1.08 kB)

2011-11-18 18:52:50

by Myklebust, Trond

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On Fri, 2011-11-18 at 18:40 +0000, Andrew Cooper wrote:
> Hello,
>
> As described originally in
> http://www.spinics.net/lists/linux-nfs/msg25314.html, we were
> encountering a bug whereby the NFS session was unexpectedly timing out.
>
> I believe I have found the source of the race condition causing the timeout.
>
> Brief overview of setup:
> 10GiB network, NFS mounted using TCP. Problem reproduces with
> multiple different NICs, with synchronous or asynchronous mounts, and
> with soft and hard mounts. Reproduces on 2.6.32 and I am currently
> trying to reproduce with mainline. (I don't have physical access to the
> servers so installing stuff is not fantastically easy)
>
>
>
> In net/sunrpc/xprtsock.c:xs_tcp_send_request(), we try to write data to
> the sock buffer using xs_sendpages()
>
> When the sock buffer is nearly fully, we get an EAGAIN from
> xs_sendpages() which causes a break out of the loop. Lower down the
> function, we switch on status which cases us to call xs_nospace() with
> the task.
>
> In xs_nospace(), we test the SOCK_ASYNC_NOSPACE bit from the socket, and
> in the rare case where that bit is clear, we return 0 instead of
> EAGAIN. This promptly overwrites status in xs_tcp_send_request().
>
> The result is that xs_tcp_release_xprt() finds a request which has no
> error, but has not sent all of the bytes in its send buffer. It cleans
> up by setting XPRT_CLOSE_WAIT which causes xprt_clear_locked() to queue
> xprt->task_cleanup, which closes the TCP connection.
>
>
> Under normal operation, the TCP connection goes down and back up without
> interruption to the NFS layer. However, when the NFS server hangs in a
> half closed state, the client forces a RST of the TCP connection,
> leading to the timeout.
>
> I have tried a few naive fixes such as changing the default return value
> in xs_nospace() from 0 to -EAGAIN (meaning that 0 will never be
> returned) but this causes a kernel memory leak. Can someone who a
> better understanding of these interactions than me have a look? It
> seems that the if (test_bit()) test in xs_nospace() should have an else
> clause.

I fully agree with your analysis. The correct thing to do here is to
always return either EAGAIN or ENOTCONN. Thank you very much for working
this one out!

Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-11-22 13:23:50

by Andrew Cooper

[permalink] [raw]

Subject: Re: NFS TCP race condition with SOCK_ASYNC_NOSPACE

On 22/11/11 12:45, Trond Myklebust wrote:
> On Tue, 2011-11-22 at 12:34 +0000, Andrew Cooper wrote:
>> On 22/11/11 12:22, Trond Myklebust wrote:
>>> On Tue, 2011-11-22 at 12:16 +0000, Andrew Cooper wrote:
>>>> On 22/11/11 12:10, Trond Myklebust wrote:
>>>>> On Tue, 2011-11-22 at 12:02 +0000, Andrew Cooper wrote:
>>>>>> On 22/11/11 11:38, Trond Myklebust wrote:
>>>>>>> On Mon, 2011-11-21 at 18:14 +0000, Andrew Cooper wrote:
>>>>>>>> Following some debugging, I believe that the attached patch fixes the
>>>>>>>> problem.
>>>>>>>>
>>>>>>>> Simply returning EAGAIN is not sufficient, as the task does not get
>>>>>>>> requeued, and times out 13 seconds later (as per our mount options).
>>>>>>>> Setting the SOCK_ASYNC_NOSPACE bit causes the requeue to happen.
>>>>>>>>
>>>>>>>> I realize that this is a gross hack and I should probably not be using
>>>>>>>> SOCK_ASYNC_NOSPACE in that way. Is there a better way to achieve the
>>>>>>>> same solution?
>>>>>>>>
>>>>>>> What you are doing will cause the request to be put to sleep with no
>>>>>>> guarantee that it will ever be woken up. Why would we want to do that if
>>>>>>> there is no report of a tcp window/buffer space congestion?
>>>>>> But the reason we get to this code is because there was a report of
>>>>>> space collision. What would you suggest instead? Changing
>>>>>> xs_{tcp,udp}_send_request() to retry in this case would defeat the point
>>>>>> of having xs_nospace().
>>>>> I suggest doing absolutely nothing: do what you originally proposed,
>>>>> which is to report the EAGAIN so that the client state machine retries
>>>>> the socket write.
>>>>>
>>>>> My point is that this is a context which is _not_ atomic with the
>>>>> original report of tcp window/buffer space congestion. There are no
>>>>> locks or anything else that will guarantee that the congestion still
>>>>> exists, and the fact that the SOCK_ASYNC_NOSPACE flag is now clear
>>>>> indicates that this is the case.
>>>>> The whole purpose of xs_nospace() is to wait until a congestion
>>>>> condition clears. If the congestion clears before we get here, then we
>>>>> have no reason to do anything special other than retry.
>>>>>
>>>>> Trond
>>>> I am slightly confused as to what you mean now.
>>>>
>>>> When you take out the if(test_bit test and always set ret to EAGAIN and
>>>> requeue the request, the next time it wakes up is when it is killed due
>>>> to timeout. This results in substantially worse effects for the
>>>> userspace, as the NFS session is killed.
>>> What is putting the request to sleep? It should be awake when it enters
>>> xs_nospace(), and nothing in or after that function should be putting it
>>> to sleep until we've retried with call_transmit().
>>>
>> I presume it is the call to xprt_wait_for_buffer_space() which calls
>> rpc_sleep_on(). There is xs_tcp_write_space which appears to wake it up
>> based on sk->sk_write_space which is triggered on the sock gaining more
>> space, which has already happened in this specific case.
> That call should only happen if we have to wait for congestion (i.e. if
> SOCK_ASYNC_NOSPACE is set).
>
> Please can you test if the following patch fixes the problem.
>
> Cheers
> Trond

Yup - this works and seems like a rather more elegant solution.

Thanks for all you help - I have been chasing this bug on and off since
the middle of May.

> 8<---------------------------------------------------------------------
> From 729b3861d9a2820735b648a044d558315bc9c9db Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Tue, 22 Nov 2011 14:44:28 +0200
> Subject: [PATCH] SUNRPC: Ensure we return EAGAIN in xs_nospace if congestion
> is cleared
>
> By returning '0' instead of 'EAGAIN' when the tests in xs_nospace() fail
> to find evidence of socket congestion, we are making the RPC engine believe
> that the message was incompletely sent, and so it disconnects the socket
> instead of just retrying the send.
>
> The bug appears to have been introduced by commit
> 5e3771ce2d6a69e10fcc870cdf226d121d868491 (SUNRPC: Ensure that xs_nospace
> return values are propagated).
>
> Reported-by: Andrew Cooper <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>

Tested-by: Andrew Cooper <[email protected]>

> Cc: [email protected] [>= 2.6.30]
> ---
> net/sunrpc/xprtsock.c | 3 +--
> 1 files changed, 1 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 2d78d95..55472c4 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -496,7 +496,7 @@ static int xs_nospace(struct rpc_task *task)
> struct rpc_rqst *req = task->tk_rqstp;
> struct rpc_xprt *xprt = req->rq_xprt;
> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> - int ret = 0;
> + int ret = -EAGAIN;
>
> dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
> task->tk_pid, req->rq_slen - req->rq_bytes_sent,
> @@ -508,7 +508,6 @@ static int xs_nospace(struct rpc_task *task)
> /* Don't race with disconnect */
> if (xprt_connected(xprt)) {
> if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
> - ret = -EAGAIN;
> /*
> * Notify TCP that we're limited by the application
> * window size

--
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com