On Mon, 2013-01-07 at 15:29 -0500, Trond Myklebust wrote:
+AD4- On Mon, 2013-01-07 at 15:20 -0500, Chris Perl wrote:
+AD4- +AD4- On Mon, Jan 07, 2013 at 07:50:10PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- +AD4- Hi Chris,
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Excellent sleuthing+ACE- Given the thoroughness of your explanation, I'm
+AD4- +AD4- +AD4- pretty sure that the attached patch should fix the problem.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Cheers
+AD4- +AD4- +AD4- Trond
+AD4- +AD4- +AD4- --
+AD4- +AD4- +AD4- Trond Myklebust
+AD4- +AD4- +AD4- Linux NFS client maintainer
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- NetApp
+AD4- +AD4- +AD4- Trond.Myklebust+AEA-netapp.com
+AD4- +AD4- +AD4- http://www.netapp.com
+AD4- +AD4-
+AD4- +AD4- +AD4- From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
+AD4- +AD4- +AD4- From: Trond Myklebust +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4- +AD4- Date: Mon, 7 Jan 2013 14:30:46 -0500
+AD4- +AD4- +AD4- Subject: +AFs-PATCH+AF0- SUNRPC: Ensure we release the socket write lock if the
+AD4- +AD4- +AD4- rpc+AF8-task exits early
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- If the rpc+AF8-task exits while holding the socket write lock before it has
+AD4- +AD4- +AD4- allocated an rpc slot, then the usual mechanism for releasing the write
+AD4- +AD4- +AD4- lock in xprt+AF8-release() is defeated.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- The problem occurs if the call to xprt+AF8-lock+AF8-write() initially fails, so
+AD4- +AD4- +AD4- that the rpc+AF8-task is put on the xprt-+AD4-sending wait queue. If the task
+AD4- +AD4- +AD4- exits after being assigned the lock by +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func, but
+AD4- +AD4- +AD4- before it has retried the call to xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot(), then
+AD4- +AD4- +AD4- it calls xprt+AF8-release() while holding the write lock, but will
+AD4- +AD4- +AD4- immediately exit due to the test for task-+AD4-tk+AF8-rqstp +ACEAPQ- NULL.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Reported-by: Chris Perl +ADw-chris.perl+AEA-gmail.com+AD4-
+AD4- +AD4- +AD4- Signed-off-by: Trond Myklebust +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4- +AD4- Cc: stable+AEA-vger.kernel.org +AFsAPgA9- 3.1+AF0-
+AD4- +AD4- +AD4- ---
+AD4- +AD4- +AD4- net/sunrpc/xprt.c +AHw- 6 +-+-+-+---
+AD4- +AD4- +AD4- 1 file changed, 4 insertions(+-), 2 deletions(-)
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
+AD4- +AD4- +AD4- index bd462a5..6676457 100644
+AD4- +AD4- +AD4- --- a/net/sunrpc/xprt.c
+AD4- +AD4- +AD4- +-+-+- b/net/sunrpc/xprt.c
+AD4- +AD4- +AD4- +AEAAQA- -1136,10 +-1136,12 +AEAAQA- static void xprt+AF8-request+AF8-init(struct rpc+AF8-task +ACo-task, struct rpc+AF8-xprt +ACo-xprt)
+AD4- +AD4- +AD4- void xprt+AF8-release(struct rpc+AF8-task +ACo-task)
+AD4- +AD4- +AD4- +AHs-
+AD4- +AD4- +AD4- struct rpc+AF8-xprt +ACo-xprt+ADs-
+AD4- +AD4- +AD4- - struct rpc+AF8-rqst +ACo-req+ADs-
+AD4- +AD4- +AD4- +- struct rpc+AF8-rqst +ACo-req +AD0- task-+AD4-tk+AF8-rqstp+ADs-
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- - if (+ACE-(req +AD0- task-+AD4-tk+AF8-rqstp))
+AD4- +AD4- +AD4- +- if (req +AD0APQ- NULL) +AHs-
+AD4- +AD4- +AD4- +- xprt+AF8-release+AF8-write(task-+AD4-tk+AF8-xprt, task)+ADs-
+AD4- +AD4- +AD4- return+ADs-
+AD4- +AD4- +AD4- +- +AH0-
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- xprt +AD0- req-+AD4-rq+AF8-xprt+ADs-
+AD4- +AD4- +AD4- if (task-+AD4-tk+AF8-ops-+AD4-rpc+AF8-count+AF8-stats +ACEAPQ- NULL)
+AD4- +AD4- +AD4- --
+AD4- +AD4- +AD4- 1.7.11.7
+AD4- +AD4- +AD4-
+AD4- +AD4-
+AD4- +AD4- Ah, I totally missed the call to +AGA-rpc+AF8-release+AF8-task' at the bottom of the
+AD4- +AD4- +AGAAXwBf-rpc+AF8-execute' loop (at least thats how I think we'd get to this function
+AD4- +AD4- you're patching).
+AD4- +AD4-
+AD4- +AD4- But wouldn't we need to update the call site in
+AD4- +AD4- +AGA-rpc+AF8-release+AF8-resources+AF8-task' as well? It contains an explicit check for
+AD4- +AD4- +AGA-task-+AD4-tk+AF8-rqstp' being non null.
+AD4-
+AD4- Ewww.... You're right: That's a wart that seems to have been copied and
+AD4- pasted quite a few times.
+AD4-
+AD4- Here is v2...
...and a v3 that adds a small optimisation to avoid taking the transport
lock in cases where we really don't need it.
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com
On Tue, 2013-01-08 at 13:40 -0500, Chris Perl wrote:
+AD4- On Mon, Jan 07, 2013 at 05:00:47PM -0500, Chris Perl wrote:
+AD4- Anyway, it appears that on mount the rpc+AF8-tasks tk+AF8-client member is NULL
+AD4- and therefore the double dereference of task-+AD4-tk+AF8-xprt is what blew
+AD4- things up. I ammended the patch for this +AFs-1+AF0- and am testing it
+AD4- now.
+AD4-
+AD4- Thus far, I've still hit hangs, it just seems to take longer. I'll have
+AD4- to dig in a bit further to see what's going on now.
+AD4-
+AD4- Is this CentOS 6.3 kernel this system too old for you guys to care?
+AD4- I.e. should I spend time reporting digging into and reporting problems
+AD4- for this system as well or you only care about the fedora system?
My main interest is always the upstream (Linus) kernel, however the RPC
client in the CentOS 6.3 kernel does actually contain a lot of code that
was recently backported from upstream. As such, it is definitely of
interest to figure out corner case bugs so that we can compare to
upstream...
+AD4- I'll report back again when I have further info and after testing the
+AD4- fedora system.
+AD4-
+AD4- +AFs-1+AF0- linux-kernel-test.patch
I've attached the latest copy of the patch (v4). In addition to the
check for tk+AF8-client+ACEAPQ-NULL, it needed a couple of changes to deal with
the RCU code.
Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com
On Tue, 2013-01-08 at 16:01 -0500, Chris Perl wrote:
+AD4- +AD4- My main interest is always the upstream (Linus) kernel, however the RPC
+AD4- +AD4- client in the CentOS 6.3 kernel does actually contain a lot of code that
+AD4- +AD4- was recently backported from upstream. As such, it is definitely of
+AD4- +AD4- interest to figure out corner case bugs so that we can compare to
+AD4- +AD4- upstream...
+AD4-
+AD4- Ok, great. I will try this version of the patch as well. However, when just
+AD4- thinking about this, I'm concerned that the race still exists, but is just less
+AD4- likely to manifest.
+AD4-
+AD4- I imagine something like this happening. I assume there is some reason this
+AD4- can't happen that I'm not seeing? These are functions from linus's current
+AD4- git, not the CentOS 6.3 code:
+AD4-
+AD4- thread 1 thread 2
+AD4- -------- --------
+AD4- +AF8AXw-rpc+AF8-execute +AF8AXw-rpc+AF8-execute
+AD4- ... ...
+AD4- call+AF8-reserve
+AD4- xprt+AF8-reserve
+AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- xprt+AF8-lock+AF8-write
+AD4- xprt+AF8-reserve+AF8-xprt
+AD4- ...
+AD4- xprt+AF8-release+AF8-write
+AD4- call+AF8-reserve
+AD4- xprt+AF8-reserve
+AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- xprt+AF8-lock+AF8-write
+AD4- xprt+AF8-reserve+AF8-xprt
+AD4- rpc+AF8-sleep+AF8-on+AF8-priority
+AD4- +AF8AXw-rpc+AF8-sleep+AF8-on+AF8-priority
+AD4- +AF8AXw-rpc+AF8-add+AF8-wait+AF8-queue
+AD4- +AF8AXw-rpc+AF8-add+AF8-wait+AF8-queue+AF8-priority
+AD4- (Now on the sending wait queue)
+AD4- xs+AF8-tcp+AF8-release+AF8-xprt
+AD4- xprt+AF8-release+AF8-xprt
+AD4- xprt+AF8-clear+AF8-locked
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next
+AD4- rpc+AF8-wake+AF8-up+AF8-first
+AD4- +AF8AXw-rpc+AF8-find+AF8-next+AF8-queued
+AD4- +AF8AXw-rpc+AF8-find+AF8-next+AF8-queued+AF8-priority
+AD4- ...
+AD4- (has now pulled thread 2 off the wait queue)
+AD4- out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit
+AD4- (receive SIGKILL)
+AD4- rpc+AF8-wait+AF8-bit+AF8-killable
+AD4- rpc+AF8-exit
+AD4- rpc+AF8-exit+AF8-task
+AD4- rpc+AF8-release+AF8-task
+AD4- (doesn't release xprt b/c he isn't listed in snd+AF8-task yet)
+AD4- (returns from +AF8AXw-rpc+AF8-execute)
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func
+AD4- (thread 2 now has the transport locked)
+AD4- rpc+AF8-wake+AF8-up+AF8-task+AF8-queue+AF8-locked
+AD4- +AF8AXw-rpc+AF8-remove+AF8-wait+AF8-queue
+AD4- +AF8AXw-rpc+AF8-remove+AF8-wait+AF8-queue+AF8-priority
+AD4- (continues on, potentially exiting early,
+AD4- potentially blocking the next time it needs the transport)
+AD4-
+AD4- With the patch we're discussing, it would fix the case where thread 2 is
+AD4- breaking out of the FSM loop after having been given the transport lock. But
+AD4- what about the above. Is there something else synchronizing things?
Hi Chris,
I'm not sure I understand how the above would work. rpc+AF8-exit+AF8-task() will
remove the rpc+AF8-task from the xprt-+AD4-sending rpc+AF8-wait+AF8-queue, at which
point +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func() can no longer find it.
+AD4- In my testing so far with my (not quite right) amended v3 of the patch, the
+AD4- timing has become such that I was having trouble reproducing the problem while
+AD4- attempting to instrument things with systemtap. However, without systemtap
+AD4- running, I'm still able to reproduce the hang pretty easily.
Is it really the same hang? Does 'echo 0 +AD4-/proc/sys/sunrpc/rpc+AF8-debug'
show you a similar list of rpc tasks waiting on xprt-+AD4-sending?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com
On Mon, Jan 07, 2013 at 05:00:47PM -0500, Chris Perl wrote:
> On Mon, Jan 07, 2013 at 08:35:31PM +0000, Myklebust, Trond wrote:
> > On Mon, 2013-01-07 at 15:29 -0500, Trond Myklebust wrote:
> > > On Mon, 2013-01-07 at 15:20 -0500, Chris Perl wrote:
> > > > On Mon, Jan 07, 2013 at 07:50:10PM +0000, Myklebust, Trond wrote:
> > > > > Hi Chris,
> > > > >
> > > > > Excellent sleuthing! Given the thoroughness of your explanation, I'm
> > > > > pretty sure that the attached patch should fix the problem.
> > > > >
> > > > > Cheers
> > > > > Trond
> > > > > --
> > > > > Trond Myklebust
> > > > > Linux NFS client maintainer
> > > > >
> > > > > NetApp
> > > > > [email protected]
> > > > > http://www.netapp.com
> > > >
> > > > > From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
> > > > > From: Trond Myklebust <[email protected]>
> > > > > Date: Mon, 7 Jan 2013 14:30:46 -0500
> > > > > Subject: [PATCH] SUNRPC: Ensure we release the socket write lock if the
> > > > > rpc_task exits early
> > > > >
> > > > > If the rpc_task exits while holding the socket write lock before it has
> > > > > allocated an rpc slot, then the usual mechanism for releasing the write
> > > > > lock in xprt_release() is defeated.
> > > > >
> > > > > The problem occurs if the call to xprt_lock_write() initially fails, so
> > > > > that the rpc_task is put on the xprt->sending wait queue. If the task
> > > > > exits after being assigned the lock by __xprt_lock_write_func, but
> > > > > before it has retried the call to xprt_lock_and_alloc_slot(), then
> > > > > it calls xprt_release() while holding the write lock, but will
> > > > > immediately exit due to the test for task->tk_rqstp != NULL.
> > > > >
> > > > > Reported-by: Chris Perl <[email protected]>
> > > > > Signed-off-by: Trond Myklebust <[email protected]>
> > > > > Cc: [email protected] [>= 3.1]
> > > > > ---
> > > > > net/sunrpc/xprt.c | 6 ++++--
> > > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > > index bd462a5..6676457 100644
> > > > > --- a/net/sunrpc/xprt.c
> > > > > +++ b/net/sunrpc/xprt.c
> > > > > @@ -1136,10 +1136,12 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> > > > > void xprt_release(struct rpc_task *task)
> > > > > {
> > > > > struct rpc_xprt *xprt;
> > > > > - struct rpc_rqst *req;
> > > > > + struct rpc_rqst *req = task->tk_rqstp;
> > > > >
> > > > > - if (!(req = task->tk_rqstp))
> > > > > + if (req == NULL) {
> > > > > + xprt_release_write(task->tk_xprt, task);
> > > > > return;
> > > > > + }
> > > > >
> > > > > xprt = req->rq_xprt;
> > > > > if (task->tk_ops->rpc_count_stats != NULL)
> > > > > --
> > > > > 1.7.11.7
> > > > >
> > > >
> > > > Ah, I totally missed the call to `rpc_release_task' at the bottom of the
> > > > `__rpc_execute' loop (at least thats how I think we'd get to this function
> > > > you're patching).
> > > >
> > > > But wouldn't we need to update the call site in
> > > > `rpc_release_resources_task' as well? It contains an explicit check for
> > > > `task->tk_rqstp' being non null.
> > >
> > > Ewww.... You're right: That's a wart that seems to have been copied and
> > > pasted quite a few times.
> > >
> > > Here is v2...
> >
> > ...and a v3 that adds a small optimisation to avoid taking the transport
> > lock in cases where we really don't need it.
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer
> >
> > NetApp
> > [email protected]
> > http://www.netapp.com
>
> > From 51b63a538c54cb9c3b83c4d62572cf18da165cba Mon Sep 17 00:00:00 2001
> > From: Trond Myklebust <[email protected]>
> > Date: Mon, 7 Jan 2013 14:30:46 -0500
> > Subject: [PATCH v3] SUNRPC: Ensure we release the socket write lock if the
> > rpc_task exits early
> >
> > If the rpc_task exits while holding the socket write lock before it has
> > allocated an rpc slot, then the usual mechanism for releasing the write
> > lock in xprt_release() is defeated.
> >
> > The problem occurs if the call to xprt_lock_write() initially fails, so
> > that the rpc_task is put on the xprt->sending wait queue. If the task
> > exits after being assigned the lock by __xprt_lock_write_func, but
> > before it has retried the call to xprt_lock_and_alloc_slot(), then
> > it calls xprt_release() while holding the write lock, but will
> > immediately exit due to the test for task->tk_rqstp != NULL.
> >
> > Reported-by: Chris Perl <[email protected]>
> > Signed-off-by: Trond Myklebust <[email protected]>
> > Cc: [email protected] [>= 3.1]
> > ---
> > net/sunrpc/sched.c | 3 +--
> > net/sunrpc/xprt.c | 8 ++++++--
> > 2 files changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > index b4133bd..bfa3171 100644
> > --- a/net/sunrpc/sched.c
> > +++ b/net/sunrpc/sched.c
> > @@ -972,8 +972,7 @@ static void rpc_async_release(struct work_struct *work)
> >
> > static void rpc_release_resources_task(struct rpc_task *task)
> > {
> > - if (task->tk_rqstp)
> > - xprt_release(task);
> > + xprt_release(task);
> > if (task->tk_msg.rpc_cred) {
> > put_rpccred(task->tk_msg.rpc_cred);
> > task->tk_msg.rpc_cred = NULL;
> > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > index bd462a5..6acc0c5 100644
> > --- a/net/sunrpc/xprt.c
> > +++ b/net/sunrpc/xprt.c
> > @@ -1136,10 +1136,14 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> > void xprt_release(struct rpc_task *task)
> > {
> > struct rpc_xprt *xprt;
> > - struct rpc_rqst *req;
> > + struct rpc_rqst *req = task->tk_rqstp;
> >
> > - if (!(req = task->tk_rqstp))
> > + if (req == NULL) {
> > + xprt = task->tk_xprt;
> > + if (xprt->snd_task == task)
> > + xprt_release_write(xprt, task);
> > return;
> > + }
> >
> > xprt = req->rq_xprt;
> > if (task->tk_ops->rpc_count_stats != NULL)
> > --
> > 1.7.11.7
> >
>
> Thanks, I will give this a shot tomorrow and let you know how it goes.
I applied the patch to a CentOS 6.3 system (same kernel I referenced
before) as I don't have my laptop available today to test the fedora
system I was using prior. It panic'd on attempting to mount the NFS
share to start testing.
The backtrace from crash shows it as:
crash> bt
PID: 6721 TASK: ffff8810284acae0 CPU: 2 COMMAND: "mount.nfs"
#0 [ffff881028c255b0] machine_kexec at ffffffff8103281b
#1 [ffff881028c25610] crash_kexec at ffffffff810ba8e2
#2 [ffff881028c256e0] oops_end at ffffffff815086b0
#3 [ffff881028c25710] no_context at ffffffff81043bab
#4 [ffff881028c25760] __bad_area_nosemaphore at ffffffff81043e35
#5 [ffff881028c257b0] bad_area at ffffffff81043f5e
#6 [ffff881028c257e0] __do_page_fault at ffffffff81044710
#7 [ffff881028c25900] do_page_fault at ffffffff8150a68e
#8 [ffff881028c25930] page_fault at ffffffff81507a45
[exception RIP: xprt_release+388]
RIP: ffffffffa025ec14 RSP: ffff881028c259e8 RFLAGS: 00010246
RAX: 0000000000000000 RBX: ffff880829be0080 RCX: 0000000000000000
RDX: 0000000000000002 RSI: 0000000000000000 RDI: ffff880829be0080
RBP: ffff881028c25a18 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: ffff880829be0080 R14: ffffffffa0285fe0 R15: ffff88082a3ef968
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
#9 [ffff881028c25a20] rpc_release_resources_task at ffffffffa0266436 [sunrpc]
#10 [ffff881028c25a40] rpc_do_put_task at ffffffffa0266564 [sunrpc]
#11 [ffff881028c25a70] rpc_put_task at ffffffffa02665b0 [sunrpc]
#12 [ffff881028c25a80] rpc_call_sync at ffffffffa025d4b8 [sunrpc]
#13 [ffff881028c25ae0] rpc_ping at ffffffffa025d522 [sunrpc]
#14 [ffff881028c25b20] rpc_create at ffffffffa025df37 [sunrpc]
#15 [ffff881028c25bf0] nfs_mount at ffffffffa033f54a [nfs]
#16 [ffff881028c25cc0] nfs_try_mount at ffffffffa03329f3 [nfs]
#17 [ffff881028c25d80] nfs_get_sb at ffffffffa0333d86 [nfs]
#18 [ffff881028c25e00] vfs_kern_mount at ffffffff8117df2b
#19 [ffff881028c25e50] do_kern_mount at ffffffff8117e0d2
#20 [ffff881028c25ea0] do_mount at ffffffff8119c862
#21 [ffff881028c25f20] sys_mount at ffffffff8119cef0
#22 [ffff881028c25f80] system_call_fastpath at ffffffff8100b0f2
RIP: 00007ffff70aa8aa RSP: 00007fffffffdac8 RFLAGS: 00000202
RAX: 00000000000000a5 RBX: ffffffff8100b0f2 RCX: 0000000073726576
RDX: 00007ffff7ff8edb RSI: 00007ffff8205da0 RDI: 00007fffffffe2af
RBP: 00007ffff7fc76a8 R8: 00007ffff8210090 R9: fefeff3836333834
R10: 0000000000000000 R11: 0000000000000206 R12: 00007fffffffdbbc
R13: 00007fffffffdbbe R14: 00007fffffffdcb0 R15: 00007ffff8206040
ORIG_RAX: 00000000000000a5 CS: 0033 SS: 002b
Anyway, it appears that on mount the rpc_tasks tk_client member is NULL
and therefore the double dereference of task->tk_xprt is what blew
things up. I ammended the patch for this [1] and am testing it
now.
Thus far, I've still hit hangs, it just seems to take longer. I'll have
to dig in a bit further to see what's going on now.
Is this CentOS 6.3 kernel this system too old for you guys to care?
I.e. should I spend time reporting digging into and reporting problems
for this system as well or you only care about the fedora system?
I'll report back again when I have further info and after testing the
fedora system.
[1] linux-kernel-test.patch
> My main interest is always the upstream (Linus) kernel, however the RPC
> client in the CentOS 6.3 kernel does actually contain a lot of code that
> was recently backported from upstream. As such, it is definitely of
> interest to figure out corner case bugs so that we can compare to
> upstream...
Ok, great. I will try this version of the patch as well. However, when just
thinking about this, I'm concerned that the race still exists, but is just less
likely to manifest.
I imagine something like this happening. I assume there is some reason this
can't happen that I'm not seeing? These are functions from linus's current
git, not the CentOS 6.3 code:
thread 1 thread 2
-------- --------
__rpc_execute __rpc_execute
... ...
call_reserve
xprt_reserve
xprt_lock_and_alloc_slot
xprt_lock_write
xprt_reserve_xprt
...
xprt_release_write
call_reserve
xprt_reserve
xprt_lock_and_alloc_slot
xprt_lock_write
xprt_reserve_xprt
rpc_sleep_on_priority
__rpc_sleep_on_priority
__rpc_add_wait_queue
__rpc_add_wait_queue_priority
(Now on the sending wait queue)
xs_tcp_release_xprt
xprt_release_xprt
xprt_clear_locked
__xprt_lock_write_next
rpc_wake_up_first
__rpc_find_next_queued
__rpc_find_next_queued_priority
...
(has now pulled thread 2 off the wait queue)
out_of_line_wait_on_bit
(receive SIGKILL)
rpc_wait_bit_killable
rpc_exit
rpc_exit_task
rpc_release_task
(doesn't release xprt b/c he isn't listed in snd_task yet)
(returns from __rpc_execute)
__xprt_lock_write_func
(thread 2 now has the transport locked)
rpc_wake_up_task_queue_locked
__rpc_remove_wait_queue
__rpc_remove_wait_queue_priority
(continues on, potentially exiting early,
potentially blocking the next time it needs the transport)
With the patch we're discussing, it would fix the case where thread 2 is
breaking out of the FSM loop after having been given the transport lock. But
what about the above. Is there something else synchronizing things?
In my testing so far with my (not quite right) amended v3 of the patch, the
timing has become such that I was having trouble reproducing the problem while
attempting to instrument things with systemtap. However, without systemtap
running, I'm still able to reproduce the hang pretty easily.
On Tue, Jan 08, 2013 at 09:13:45PM +0000, Myklebust, Trond wrote:
> On Tue, 2013-01-08 at 16:01 -0500, Chris Perl wrote:
> > > My main interest is always the upstream (Linus) kernel, however the RPC
> > > client in the CentOS 6.3 kernel does actually contain a lot of code that
> > > was recently backported from upstream. As such, it is definitely of
> > > interest to figure out corner case bugs so that we can compare to
> > > upstream...
> >
> > Ok, great. I will try this version of the patch as well. However, when just
> > thinking about this, I'm concerned that the race still exists, but is just less
> > likely to manifest.
> >
> > I imagine something like this happening. I assume there is some reason this
> > can't happen that I'm not seeing? These are functions from linus's current
> > git, not the CentOS 6.3 code:
> >
> > thread 1 thread 2
> > -------- --------
> > __rpc_execute __rpc_execute
> > ... ...
> > call_reserve
> > xprt_reserve
> > xprt_lock_and_alloc_slot
> > xprt_lock_write
> > xprt_reserve_xprt
> > ...
> > xprt_release_write
> > call_reserve
> > xprt_reserve
> > xprt_lock_and_alloc_slot
> > xprt_lock_write
> > xprt_reserve_xprt
> > rpc_sleep_on_priority
> > __rpc_sleep_on_priority
> > __rpc_add_wait_queue
> > __rpc_add_wait_queue_priority
> > (Now on the sending wait queue)
> > xs_tcp_release_xprt
> > xprt_release_xprt
> > xprt_clear_locked
> > __xprt_lock_write_next
> > rpc_wake_up_first
> > __rpc_find_next_queued
> > __rpc_find_next_queued_priority
> > ...
> > (has now pulled thread 2 off the wait queue)
> > out_of_line_wait_on_bit
> > (receive SIGKILL)
> > rpc_wait_bit_killable
> > rpc_exit
> > rpc_exit_task
> > rpc_release_task
> > (doesn't release xprt b/c he isn't listed in snd_task yet)
> > (returns from __rpc_execute)
> > __xprt_lock_write_func
> > (thread 2 now has the transport locked)
> > rpc_wake_up_task_queue_locked
> > __rpc_remove_wait_queue
> > __rpc_remove_wait_queue_priority
> > (continues on, potentially exiting early,
> > potentially blocking the next time it needs the transport)
> >
> > With the patch we're discussing, it would fix the case where thread 2 is
> > breaking out of the FSM loop after having been given the transport lock. But
> > what about the above. Is there something else synchronizing things?
>
> I'm not sure I understand how the above would work. rpc_exit_task() will
> remove the rpc_task from the xprt->sending rpc_wait_queue, at which
> point __xprt_lock_write_func() can no longer find it.
I'm imagining that thread 1 pulls thread 2 off the sending wait queue
almost right after it was added, but before thread 2 has a chance to
execute rpc_exit_task. Now that thread 1 has a reference to
thread 2, it proceeds to give it the lock on the transport, but before
it does so, thread 2 completely finishes executing and returns from
__rpc_execute.
I'm probably missing something.
> Is it really the same hang? Does 'echo 0 >/proc/sys/sunrpc/rpc_debug'
> show you a similar list of rpc tasks waiting on xprt->sending?
I'm not certain. Its possible that its not. I'm recompiling with your
v4 of the patch right now and will test shortly. I didn't know about
echoing 0 out to that file, good to know! :)
On Mon, Jan 07, 2013 at 08:35:31PM +0000, Myklebust, Trond wrote:
> On Mon, 2013-01-07 at 15:29 -0500, Trond Myklebust wrote:
> > On Mon, 2013-01-07 at 15:20 -0500, Chris Perl wrote:
> > > On Mon, Jan 07, 2013 at 07:50:10PM +0000, Myklebust, Trond wrote:
> > > > Hi Chris,
> > > >
> > > > Excellent sleuthing! Given the thoroughness of your explanation, I'm
> > > > pretty sure that the attached patch should fix the problem.
> > > >
> > > > Cheers
> > > > Trond
> > > > --
> > > > Trond Myklebust
> > > > Linux NFS client maintainer
> > > >
> > > > NetApp
> > > > [email protected]
> > > > http://www.netapp.com
> > >
> > > > From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
> > > > From: Trond Myklebust <[email protected]>
> > > > Date: Mon, 7 Jan 2013 14:30:46 -0500
> > > > Subject: [PATCH] SUNRPC: Ensure we release the socket write lock if the
> > > > rpc_task exits early
> > > >
> > > > If the rpc_task exits while holding the socket write lock before it has
> > > > allocated an rpc slot, then the usual mechanism for releasing the write
> > > > lock in xprt_release() is defeated.
> > > >
> > > > The problem occurs if the call to xprt_lock_write() initially fails, so
> > > > that the rpc_task is put on the xprt->sending wait queue. If the task
> > > > exits after being assigned the lock by __xprt_lock_write_func, but
> > > > before it has retried the call to xprt_lock_and_alloc_slot(), then
> > > > it calls xprt_release() while holding the write lock, but will
> > > > immediately exit due to the test for task->tk_rqstp != NULL.
> > > >
> > > > Reported-by: Chris Perl <[email protected]>
> > > > Signed-off-by: Trond Myklebust <[email protected]>
> > > > Cc: [email protected] [>= 3.1]
> > > > ---
> > > > net/sunrpc/xprt.c | 6 ++++--
> > > > 1 file changed, 4 insertions(+), 2 deletions(-)
> > > >
> > > > diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> > > > index bd462a5..6676457 100644
> > > > --- a/net/sunrpc/xprt.c
> > > > +++ b/net/sunrpc/xprt.c
> > > > @@ -1136,10 +1136,12 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> > > > void xprt_release(struct rpc_task *task)
> > > > {
> > > > struct rpc_xprt *xprt;
> > > > - struct rpc_rqst *req;
> > > > + struct rpc_rqst *req = task->tk_rqstp;
> > > >
> > > > - if (!(req = task->tk_rqstp))
> > > > + if (req == NULL) {
> > > > + xprt_release_write(task->tk_xprt, task);
> > > > return;
> > > > + }
> > > >
> > > > xprt = req->rq_xprt;
> > > > if (task->tk_ops->rpc_count_stats != NULL)
> > > > --
> > > > 1.7.11.7
> > > >
> > >
> > > Ah, I totally missed the call to `rpc_release_task' at the bottom of the
> > > `__rpc_execute' loop (at least thats how I think we'd get to this function
> > > you're patching).
> > >
> > > But wouldn't we need to update the call site in
> > > `rpc_release_resources_task' as well? It contains an explicit check for
> > > `task->tk_rqstp' being non null.
> >
> > Ewww.... You're right: That's a wart that seems to have been copied and
> > pasted quite a few times.
> >
> > Here is v2...
>
> ...and a v3 that adds a small optimisation to avoid taking the transport
> lock in cases where we really don't need it.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
> From 51b63a538c54cb9c3b83c4d62572cf18da165cba Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Mon, 7 Jan 2013 14:30:46 -0500
> Subject: [PATCH v3] SUNRPC: Ensure we release the socket write lock if the
> rpc_task exits early
>
> If the rpc_task exits while holding the socket write lock before it has
> allocated an rpc slot, then the usual mechanism for releasing the write
> lock in xprt_release() is defeated.
>
> The problem occurs if the call to xprt_lock_write() initially fails, so
> that the rpc_task is put on the xprt->sending wait queue. If the task
> exits after being assigned the lock by __xprt_lock_write_func, but
> before it has retried the call to xprt_lock_and_alloc_slot(), then
> it calls xprt_release() while holding the write lock, but will
> immediately exit due to the test for task->tk_rqstp != NULL.
>
> Reported-by: Chris Perl <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: [email protected] [>= 3.1]
> ---
> net/sunrpc/sched.c | 3 +--
> net/sunrpc/xprt.c | 8 ++++++--
> 2 files changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index b4133bd..bfa3171 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -972,8 +972,7 @@ static void rpc_async_release(struct work_struct *work)
>
> static void rpc_release_resources_task(struct rpc_task *task)
> {
> - if (task->tk_rqstp)
> - xprt_release(task);
> + xprt_release(task);
> if (task->tk_msg.rpc_cred) {
> put_rpccred(task->tk_msg.rpc_cred);
> task->tk_msg.rpc_cred = NULL;
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index bd462a5..6acc0c5 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1136,10 +1136,14 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> void xprt_release(struct rpc_task *task)
> {
> struct rpc_xprt *xprt;
> - struct rpc_rqst *req;
> + struct rpc_rqst *req = task->tk_rqstp;
>
> - if (!(req = task->tk_rqstp))
> + if (req == NULL) {
> + xprt = task->tk_xprt;
> + if (xprt->snd_task == task)
> + xprt_release_write(xprt, task);
> return;
> + }
>
> xprt = req->rq_xprt;
> if (task->tk_ops->rpc_count_stats != NULL)
> --
> 1.7.11.7
>
Thanks, I will give this a shot tomorrow and let you know how it goes.
On Tue, 2013-01-08 at 16:23 -0500, Chris Perl wrote:
+AD4- On Tue, Jan 08, 2013 at 09:13:45PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- On Tue, 2013-01-08 at 16:01 -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- +AD4- My main interest is always the upstream (Linus) kernel, however the RPC
+AD4- +AD4- +AD4- +AD4- client in the CentOS 6.3 kernel does actually contain a lot of code that
+AD4- +AD4- +AD4- +AD4- was recently backported from upstream. As such, it is definitely of
+AD4- +AD4- +AD4- +AD4- interest to figure out corner case bugs so that we can compare to
+AD4- +AD4- +AD4- +AD4- upstream...
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Ok, great. I will try this version of the patch as well. However, when just
+AD4- +AD4- +AD4- thinking about this, I'm concerned that the race still exists, but is just less
+AD4- +AD4- +AD4- likely to manifest.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- I imagine something like this happening. I assume there is some reason this
+AD4- +AD4- +AD4- can't happen that I'm not seeing? These are functions from linus's current
+AD4- +AD4- +AD4- git, not the CentOS 6.3 code:
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- thread 1 thread 2
+AD4- +AD4- +AD4- -------- --------
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-execute +AF8AXw-rpc+AF8-execute
+AD4- +AD4- +AD4- ... ...
+AD4- +AD4- +AD4- call+AF8-reserve
+AD4- +AD4- +AD4- xprt+AF8-reserve
+AD4- +AD4- +AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- +AD4- +AD4- xprt+AF8-lock+AF8-write
+AD4- +AD4- +AD4- xprt+AF8-reserve+AF8-xprt
+AD4- +AD4- +AD4- ...
+AD4- +AD4- +AD4- xprt+AF8-release+AF8-write
+AD4- +AD4- +AD4- call+AF8-reserve
+AD4- +AD4- +AD4- xprt+AF8-reserve
+AD4- +AD4- +AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- +AD4- +AD4- xprt+AF8-lock+AF8-write
+AD4- +AD4- +AD4- xprt+AF8-reserve+AF8-xprt
+AD4- +AD4- +AD4- rpc+AF8-sleep+AF8-on+AF8-priority
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-sleep+AF8-on+AF8-priority
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-add+AF8-wait+AF8-queue
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-add+AF8-wait+AF8-queue+AF8-priority
+AD4- +AD4- +AD4- (Now on the sending wait queue)
+AD4- +AD4- +AD4- xs+AF8-tcp+AF8-release+AF8-xprt
+AD4- +AD4- +AD4- xprt+AF8-release+AF8-xprt
+AD4- +AD4- +AD4- xprt+AF8-clear+AF8-locked
+AD4- +AD4- +AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next
+AD4- +AD4- +AD4- rpc+AF8-wake+AF8-up+AF8-first
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-find+AF8-next+AF8-queued
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-find+AF8-next+AF8-queued+AF8-priority
+AD4- +AD4- +AD4- ...
+AD4- +AD4- +AD4- (has now pulled thread 2 off the wait queue)
+AD4- +AD4- +AD4- out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit
+AD4- +AD4- +AD4- (receive SIGKILL)
+AD4- +AD4- +AD4- rpc+AF8-wait+AF8-bit+AF8-killable
+AD4- +AD4- +AD4- rpc+AF8-exit
+AD4- +AD4- +AD4- rpc+AF8-exit+AF8-task
+AD4- +AD4- +AD4- rpc+AF8-release+AF8-task
+AD4- +AD4- +AD4- (doesn't release xprt b/c he isn't listed in snd+AF8-task yet)
+AD4- +AD4- +AD4- (returns from +AF8AXw-rpc+AF8-execute)
+AD4- +AD4- +AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func
+AD4- +AD4- +AD4- (thread 2 now has the transport locked)
+AD4- +AD4- +AD4- rpc+AF8-wake+AF8-up+AF8-task+AF8-queue+AF8-locked
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-remove+AF8-wait+AF8-queue
+AD4- +AD4- +AD4- +AF8AXw-rpc+AF8-remove+AF8-wait+AF8-queue+AF8-priority
+AD4- +AD4- +AD4- (continues on, potentially exiting early,
+AD4- +AD4- +AD4- potentially blocking the next time it needs the transport)
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- With the patch we're discussing, it would fix the case where thread 2 is
+AD4- +AD4- +AD4- breaking out of the FSM loop after having been given the transport lock. But
+AD4- +AD4- +AD4- what about the above. Is there something else synchronizing things?
+AD4- +AD4-
+AD4- +AD4- I'm not sure I understand how the above would work. rpc+AF8-exit+AF8-task() will
+AD4- +AD4- remove the rpc+AF8-task from the xprt-+AD4-sending rpc+AF8-wait+AF8-queue, at which
+AD4- +AD4- point +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func() can no longer find it.
+AD4-
+AD4- I'm imagining that thread 1 pulls thread 2 off the sending wait queue
+AD4- almost right after it was added, but before thread 2 has a chance to
+AD4- execute rpc+AF8-exit+AF8-task. Now that thread 1 has a reference to
+AD4- thread 2, it proceeds to give it the lock on the transport, but before
+AD4- it does so, thread 2 completely finishes executing and returns from
+AD4- +AF8AXw-rpc+AF8-execute.
+AD4-
+AD4- I'm probably missing something.
The lock is associated with the rpc+AF8-task. Threads can normally only
access an rpc+AF8-task when it is on a wait queue (while holding the wait
queue lock) unless they are given ownership of the rpc+AF8-task.
IOW: the scenario you describe should not be possible, since it relies
on thread 1 assigning the lock to the rpc+AF8-task after it has been removed
from the wait queue.
+AD4- +AD4- Is it really the same hang? Does 'echo 0 +AD4-/proc/sys/sunrpc/rpc+AF8-debug'
+AD4- +AD4- show you a similar list of rpc tasks waiting on xprt-+AD4-sending?
+AD4-
+AD4- I'm not certain. Its possible that its not. I'm recompiling with your
+AD4- v4 of the patch right now and will test shortly. I didn't know about
+AD4- echoing 0 out to that file, good to know+ACE- :)
That would be great.
If you are recompiling the kernel, perhaps you can also add in a patch
to rpc+AF8-show+AF8-tasks() to display the current value of
clnt-+AD4-cl+AF8-xprt-+AD4-snd+AF8-task?
--
Trond Myklebust
Linux NFS client maintainer
NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com