Hi,
We have a customer who reports occasional but reproducible hangs on our 3.0
based kernel.
I managed to deduce that
commit a9a6b52ee1baa865283a91eb8d443ee91adfca56
Author: Trond Myklebust <[email protected]>
Date: Fri Feb 22 14:57:57 2013 -0500
SUNRPC: Don't start the retransmission timer when out of socket space
was to blame (it got into our kernel through -stable ... not sure why it
deserved to be in -stable). Reverting that patch fixes the problem. However I
don't fully understand why.
Analysing the logs suggests that the NFS client had tried to write a request
to the socket, failed as the transmit queue was temporarily full, and never
got the call-back when the queue emptied. All other requests piled up behind
this one and all traffic stopped.
With the above commit reverted, the client will timeout and retry the
transmit even if the call-back hasn't arrived. With the patch, it waits
indefinitely.
This suggests that there is a race in the SOCK_ASYNC_NOSPACE handling.
The most obvious place for a race would be in xs_nospace() between
if (test_bit(SOCK_ASYNC_NOSPACE, &transport->sock->flags)) {
and before
xprt_wait_for_buffer_space(task, xs_nospace_callback);
However xprt->transport_lock is held here, and xprt_write_space() will wait
for the lock before calling rpc_wake_up_queued_task(), so the wakeup cannot
happen before the task is put on the pending list.
My only remaining theory is that the space is freed before SOCK_NOSPACE is
set in xs_nospace(). I think this would mean that the callback never happens.
In net/ipv4/tcp.c, in tcp_poll() there is code:
if (sk_stream_is_writeable(sk)) {
mask |= POLLOUT | POLLWRNORM;
} else { /* send SIGIO later */
set_bit(SOCK_ASYNC_NOSPACE,
&sk->sk_socket->flags);
set_bit(SOCK_NOSPACE, &sk->sk_socket->flags);
/* Race breaker. If space is freed after
* wspace test but before the flags are set,
* IO signal will be lost.
*/
if (sk_stream_is_writeable(sk))
mask |= POLLOUT | POLLWRNORM;
}
which suggests there is room for a race, and seems to suggest that it is
safest to test sk_stream_is_writeable() after setting SOCK_NOSPACE.
This is what the sunrpc server side does for udp (svc_udp_has_wspace).
For tcp it doesn't something which I don't really understand, so maybe I'm
missing something.
Any thoughts? I could try asking the customer to test with an extra check
for sk_stream_is_writeable before calling xprt_wait_for_buffer_space(), but I
wouldn't be surprised if they'd rather not given they have a working solution.
Does someone understand this code enough to be confident that such a test
would be correct?
Thanks,
NeilBrown
On Tue, 11 Feb 2014 09:33:31 -0500 Trond Myklebust
<[email protected]> wrote:
> On Mon, 2014-02-10 at 17:03 +1100, NeilBrown wrote:
> > Hi,
> > We have a customer who reports occasional but reproducible hangs on our 3.0
> > based kernel.
> > I managed to deduce that
> >
> > commit a9a6b52ee1baa865283a91eb8d443ee91adfca56
> > Author: Trond Myklebust <[email protected]>
> > Date: Fri Feb 22 14:57:57 2013 -0500
> >
> > SUNRPC: Don't start the retransmission timer when out of socket space
> >
> > was to blame (it got into our kernel through -stable ... not sure why it
> > deserved to be in -stable). Reverting that patch fixes the problem. However I
> > don't fully understand why.
> >
>
> The reason why that patch was put into stable was that the connection
> breakage triggered by the timeouts was causing nasty behaviour when
> servers (or the network) are heavily loaded. Instead of clearing the
> logjam, breaking the connection and then reconnecting would aggravate
> it, causing hangs.
Ahh, that make sense. Thanks.
>
> Anyhow, does the following patch help to break the race?
> 8<------------------------------------------------------------------
> >From e4c0373be4b8deae2667a7478d34415b99924abc Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Tue, 11 Feb 2014 09:15:54 -0500
> Subject: [PATCH] SUNRPC: Fix races in xs_nospace()
>
> When a send failure occurs due to the socket being out of buffer space,
> we call xs_nospace() in order to have the RPC task wait until the
> socket has drained enough to make it worth while trying again.
> The current patch fixes a race in which the socket is drained before
> we get round to setting up the machinery in xs_nospace(), and which
> is reported to cause hangs.
>
> Link: http://lkml.kernel.org/r/[email protected]
> Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...)
> Reported-by: Neil Brown <[email protected]>
> Cc: [email protected]
> Signed-off-by: Trond Myklebust <[email protected]>
> ---
> net/sunrpc/xprtsock.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> index 6497c221612c..966763d735e9 100644
> --- a/net/sunrpc/xprtsock.c
> +++ b/net/sunrpc/xprtsock.c
> @@ -510,6 +510,7 @@ static int xs_nospace(struct rpc_task *task)
> struct rpc_rqst *req = task->tk_rqstp;
> struct rpc_xprt *xprt = req->rq_xprt;
> struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> + struct sock *sk = transport->inet;
> int ret = -EAGAIN;
>
> dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
> @@ -527,7 +528,7 @@ static int xs_nospace(struct rpc_task *task)
> * window size
> */
> set_bit(SOCK_NOSPACE, &transport->sock->flags);
> - transport->inet->sk_write_pending++;
> + sk->sk_write_pending++;
> /* ...and wait for more buffer space */
> xprt_wait_for_buffer_space(task, xs_nospace_callback);
> }
> @@ -537,6 +538,9 @@ static int xs_nospace(struct rpc_task *task)
> }
>
> spin_unlock_bh(&xprt->transport_lock);
> +
> + /* Race breaker in case memory is freed before above code is called */
> + sk->sk_write_space(sk);
> return ret;
> }
>
I looks good. I've asked if the customer is willing to test it and provided
the patch.
By the way, this bug is the first time that I've found the tasklisk - printed
when you enable rpc debuging - useful. And it was *really* useful!
There were lots of
nfsv3 WRITE a:call_reserveresult q:xprt_backlog
and a few
nfsv3 WRITE a:call_status q:xprt_sending
and one
nfsv3 WRITE a:call_transmit_status q:xprt_pending
That last one was put to sleep by xprt_wait_for_buffer_space() and
is blocking all the rest.
Now I've got another bug with vaguely similar symptoms and only
9bfb000 (null) 0 ffffffffa03aa0a0 nfsv3 READ a:call_reserveresult q:xprt_sending
9bfb000 (null) 0 ffffffffa03aa0a0 nfsv3 READ a:call_reserveresult q:xprt_sending
a933000 (null) 0 ffffffffa03084e0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
I suspect it is a different bug. Maybe some missing backports or something,
though we have 961a828df64979d2a which is related code.
Thanks,
NeilBrown
On Wed, 12 Feb 2014 15:05:55 +1100 NeilBrown <[email protected]> wrote:
> On Tue, 11 Feb 2014 09:33:31 -0500 Trond Myklebust
> <[email protected]> wrote:
>
> > On Mon, 2014-02-10 at 17:03 +1100, NeilBrown wrote:
> > > Hi,
> > > We have a customer who reports occasional but reproducible hangs on our 3.0
> > > based kernel.
> > > I managed to deduce that
> > >
> > > commit a9a6b52ee1baa865283a91eb8d443ee91adfca56
> > > Author: Trond Myklebust <[email protected]>
> > > Date: Fri Feb 22 14:57:57 2013 -0500
> > >
> > > SUNRPC: Don't start the retransmission timer when out of socket space
> > >
> > > was to blame (it got into our kernel through -stable ... not sure why it
> > > deserved to be in -stable). Reverting that patch fixes the problem. However I
> > > don't fully understand why.
> > >
> >
> > The reason why that patch was put into stable was that the connection
> > breakage triggered by the timeouts was causing nasty behaviour when
> > servers (or the network) are heavily loaded. Instead of clearing the
> > logjam, breaking the connection and then reconnecting would aggravate
> > it, causing hangs.
>
> Ahh, that make sense. Thanks.
>
>
> >
> > Anyhow, does the following patch help to break the race?
> > 8<------------------------------------------------------------------
> > >From e4c0373be4b8deae2667a7478d34415b99924abc Mon Sep 17 00:00:00 2001
> > From: Trond Myklebust <[email protected]>
> > Date: Tue, 11 Feb 2014 09:15:54 -0500
> > Subject: [PATCH] SUNRPC: Fix races in xs_nospace()
> >
> > When a send failure occurs due to the socket being out of buffer space,
> > we call xs_nospace() in order to have the RPC task wait until the
> > socket has drained enough to make it worth while trying again.
> > The current patch fixes a race in which the socket is drained before
> > we get round to setting up the machinery in xs_nospace(), and which
> > is reported to cause hangs.
> >
> > Link: http://lkml.kernel.org/r/[email protected]
> > Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...)
> > Reported-by: Neil Brown <[email protected]>
> > Cc: [email protected]
> > Signed-off-by: Trond Myklebust <[email protected]>
> > ---
> > net/sunrpc/xprtsock.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
> > index 6497c221612c..966763d735e9 100644
> > --- a/net/sunrpc/xprtsock.c
> > +++ b/net/sunrpc/xprtsock.c
> > @@ -510,6 +510,7 @@ static int xs_nospace(struct rpc_task *task)
> > struct rpc_rqst *req = task->tk_rqstp;
> > struct rpc_xprt *xprt = req->rq_xprt;
> > struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
> > + struct sock *sk = transport->inet;
> > int ret = -EAGAIN;
> >
> > dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
> > @@ -527,7 +528,7 @@ static int xs_nospace(struct rpc_task *task)
> > * window size
> > */
> > set_bit(SOCK_NOSPACE, &transport->sock->flags);
> > - transport->inet->sk_write_pending++;
> > + sk->sk_write_pending++;
> > /* ...and wait for more buffer space */
> > xprt_wait_for_buffer_space(task, xs_nospace_callback);
> > }
> > @@ -537,6 +538,9 @@ static int xs_nospace(struct rpc_task *task)
> > }
> >
> > spin_unlock_bh(&xprt->transport_lock);
> > +
> > + /* Race breaker in case memory is freed before above code is called */
> > + sk->sk_write_space(sk);
> > return ret;
> > }
> >
>
> I looks good. I've asked if the customer is willing to test it and provided
> the patch.
Unfortunately the customer was not able to get internal approval to test this
patch, as they already have a working solution.
So we might have to go with this patch without confirmation that it fixes
this particular problem. I've examined it again and I cannot see any room
for any new races, and it should fix a credible problem.
So
Reviewed-by: NeilBrown <[email protected]>
Thanks,
NeilBrown
On Mon, 2014-02-10 at 17:03 +1100, NeilBrown wrote:
> Hi,
> We have a customer who reports occasional but reproducible hangs on our 3.0
> based kernel.
> I managed to deduce that
>
> commit a9a6b52ee1baa865283a91eb8d443ee91adfca56
> Author: Trond Myklebust <[email protected]>
> Date: Fri Feb 22 14:57:57 2013 -0500
>
> SUNRPC: Don't start the retransmission timer when out of socket space
>
> was to blame (it got into our kernel through -stable ... not sure why it
> deserved to be in -stable). Reverting that patch fixes the problem. However I
> don't fully understand why.
>
The reason why that patch was put into stable was that the connection
breakage triggered by the timeouts was causing nasty behaviour when
servers (or the network) are heavily loaded. Instead of clearing the
logjam, breaking the connection and then reconnecting would aggravate
it, causing hangs.
Anyhow, does the following patch help to break the race?
8<------------------------------------------------------------------
>From e4c0373be4b8deae2667a7478d34415b99924abc Mon Sep 17 00:00:00 2001
From: Trond Myklebust <[email protected]>
Date: Tue, 11 Feb 2014 09:15:54 -0500
Subject: [PATCH] SUNRPC: Fix races in xs_nospace()
When a send failure occurs due to the socket being out of buffer space,
we call xs_nospace() in order to have the RPC task wait until the
socket has drained enough to make it worth while trying again.
The current patch fixes a race in which the socket is drained before
we get round to setting up the machinery in xs_nospace(), and which
is reported to cause hangs.
Link: http://lkml.kernel.org/r/[email protected]
Fixes: a9a6b52ee1ba (SUNRPC: Don't start the retransmission timer...)
Reported-by: Neil Brown <[email protected]>
Cc: [email protected]
Signed-off-by: Trond Myklebust <[email protected]>
---
net/sunrpc/xprtsock.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)
diff --git a/net/sunrpc/xprtsock.c b/net/sunrpc/xprtsock.c
index 6497c221612c..966763d735e9 100644
--- a/net/sunrpc/xprtsock.c
+++ b/net/sunrpc/xprtsock.c
@@ -510,6 +510,7 @@ static int xs_nospace(struct rpc_task *task)
struct rpc_rqst *req = task->tk_rqstp;
struct rpc_xprt *xprt = req->rq_xprt;
struct sock_xprt *transport = container_of(xprt, struct sock_xprt, xprt);
+ struct sock *sk = transport->inet;
int ret = -EAGAIN;
dprintk("RPC: %5u xmit incomplete (%u left of %u)\n",
@@ -527,7 +528,7 @@ static int xs_nospace(struct rpc_task *task)
* window size
*/
set_bit(SOCK_NOSPACE, &transport->sock->flags);
- transport->inet->sk_write_pending++;
+ sk->sk_write_pending++;
/* ...and wait for more buffer space */
xprt_wait_for_buffer_space(task, xs_nospace_callback);
}
@@ -537,6 +538,9 @@ static int xs_nospace(struct rpc_task *task)
}
spin_unlock_bh(&xprt->transport_lock);
+
+ /* Race breaker in case memory is freed before above code is called */
+ sk->sk_write_space(sk);
return ret;
}
--
1.8.5.3
--
Trond Myklebust
Linux NFS client maintainer, PrimaryData
[email protected]