2013-01-07 18:59:11

by Chris Perl

[permalink] [raw]
Subject: Possible Race Condition on SIGKILL

(Apologies up front for the long email, just trying to be thorough)

I have come across a hang in the Linux NFS client that is reproducible. I've
been testing with a Fedora 17 system (3.6.10-2.fc17.x86_64), but I've also seen
the same issues on CentOS 6.3 systems (2.6.32-279.5.2.el6.centos.plus.x86_64).
I have not had a chance to test the latest 3.8 rc kernels yet, but quickly
browsing their code, I didn't see major changes in the key areas to make me
think it would make a difference (but I could certainly be wrong).

Unfortunately, the problem is not easily reproducible. So far, the only way
that I've been able to trigger it is with the attached OCaml program [1].

Even more unfortunate, for you to compile and use that program, you'll need to
have a functioning OCaml environment with Jane Street's Core and Async
libraries available. I can give details about the easiest way to set that up
if you want, but will hold off on that for now. Please let me know if you want
these details.

The idea of the program is that it spawns many threads that start stat'ing and
reading files over NFS (the paths to each file are command line args) and then
abruptly exits.

Originally, I came across a system that hung and forced a crash dump of the
kernel that I then inspected with crash [2]. With crash I could see that the
rpc_task referenced by the snd_task member in the relevant rpc_xprt didn't
exist.

Then after talking with some coworkers (and inspecting several more hung
systems), eventually came up with a way to reproduce the problem. To reproduce
the problem, you actually need to take the compiled OCaml program (which
purposely exits very quickly after starting) and wrap it in a loop that will
immediately restart it when it exits:

$ while :; do ./hang.opt /mnt/nfs/file{01..10} >/dev/null 2>&1; done

In this case, file01 through 10 are 10k text files filled entirely with the
character "A". /mnt/nfs is the mount point with the following options

$ nfsstat -m
/mnt/nfs from 127.0.0.1:/var/tmp
Flags: rw,relatime,vers=3rsize=524288,wsize=524288,namelen=255,acregmin=0,acregmax=0,acdirmin=0,acdirmax=0,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=127.0.0.1,mountvers=3,mountport=20048,mountproto=tcp,local_lock=none,addr=127.0.0.1

And the hang should happen within a few minutes (although sometimes it did
take quite a while for it to hang).

I believe the problem is a race condition where one thread that currently has
the `rpc_xprt' for a given NFS mount locked tries to give it to another thread
on the `sending' wait queue for that `rpc_xprt' (via __xprt_lock_write_next),
but the thread that is presumably sleeping on that wait queue never made it to
sleep because it received a SIGKILL before it could do so.

Here is the interaction that I believe is causing the problem:

thread 1 thread 2
-------- --------
__rpc_execute
...
call_reserve
xprt_reserve
xprt_lock_and_alloc_slot
xprt_lock_write
xprt_reserve_xprt (locks rpc_xprt)
xprt_alloc_slot
__rpc_execute
...
call_reserve
xprt_reserve
xprt_lock_and_alloc_slot
xprt_lock_write
xprt_reserve_xprt
rpc_sleep_on ("sending" wait queue)
out_of_line_wait_on_bit
xprt_release_write
xprt_release_xprt
xprt_clear_locked
__xprt_lock_write_next
(receives SIGKILL from a 3rd thread)
rpc_wait_bit_killable
(will return ERESTARTSYS b/c of SIGNAL)
rpc_wake_up_first (wq:sending,
f:__xprt_lock_write_func)
__xprt_lock_write_func (w/ task from sending)
rpc_exit
rpc_exit_task
... ...

At this point the first thread has updated rpc_xprt->snd_task to point to the
second thread, and left the rpc_xprt locked, but the second thread is exiting
because of the SIGKILL. The real issue is the second thread was added to the
"sending" wait queue (by rpc_sleep_on), but never actually made it to sleep.

The net effect of this is that the transport remains locked by a phantom
rpc_task. Any other requests that come in now just go to sleep on the
rpc_xprt's sending wait queue, and no progress is made. Any mount point that
winds up using this rpc_xprt is now hung.

To help me understand what was happening, I used the attached systemtap script
[3] to watch the system and upon each return from __rpc_execute, check to see
if our task is the task currently listed in the associated rpc_xprt's snd_task
member, as I believe this should never be the case. Running that script
produced output like the following when it hung:


[cperl@cesium ~/tmp/2013-01-05]$ sudo stap -DMAXMAPENTRIES=4096 -g ./catch_locked.stp --vp 00101
Pass 3: translated to C into "/tmp/stapMZvjDF/stap_23115b4269a0722946b88ddf556678e4_18614_src.c" using 354096virt/32668res/6728shr/25096data kb, in 90usr/10sys/101real ms.
Pass 5: starting run.
1357430209194595979: 30246: thread-pool thr -> __rpc_execute: task=0xffff88011f39ec00: GETATTR
1357430209194599327: 30246: thread-pool thr tk_callback: 0x0, tk_action: call_start
1357430209194602139: 30246: thread-pool thr tk_callback: 0x0, tk_action: call_reserve
1357430209195083066: 30246: thread-pool thr status: -512
1357430209195085888: 30246: thread-pool thr -> rpc_exit:
1357430209195088144: 30246: thread-pool thr tk_callback: 0x0, tk_action: rpc_exit_task
1357430209195090865: 30246: thread-pool thr tk_callback: 0x0, tk_action: 0x0
1357430209195093517: 30246: thread-pool thr <- __rpc_execute: tk_status: 0x0
^CPass 5: run completed in 10usr/30sys/33870real ms.

After this happened, I used crash on the live system and indeed, this task was
the one that was listed in rpc_xprt->snd_task for the transport. The rest of
the circumstances of the hang I inferred by looking at the code, so I
could definitely be wrong.

The systemtap script places probe points at specific lines within the
code (to be able to look at tk_action and tk_callback), so it would need
to be tweaked if run on a different kernel than the fedora 17 kernel I'm
running it against.

Of course its entirely possible that I've missed some lock or something that
makes what I've outlined above safe and fine, but in that case, I still have an
NFS hang that I need to figure out.

[1] hang.ml
[2] http://people.redhat.com/anderson/crash_whitepaper/
[3] catch_locked.stp


Attachments:
(No filename) (6.31 kB)
hang.ml (715.00 B)
catch_locked.stp (1.92 kB)
Download all attachments

2013-01-08 22:19:27

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Tue, Jan 08, 2013 at 05:16:51PM -0500, Chris Perl wrote:
> > The lock is associated with the rpc_task. Threads can normally only
> > access an rpc_task when it is on a wait queue (while holding the wait
> > queue lock) unless they are given ownership of the rpc_task.
> >
> > IOW: the scenario you describe should not be possible, since it relies
> > on thread 1 assigning the lock to the rpc_task after it has been removed
> > from the wait queue.
>
> Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
> silly bumbling questions. You can start ignoring me at any time. :)
>
> I was talking about setting (or leaving set) the XPRT_LOCKED bit in
> rpc_xprt->state. By "assigning the lock" I really just mean that thread
> 1 leaves XPRT_LOCKED set in rpc_xprt->state and sets rpc_xprt->snd_task
> to thread 2.
>
> > If you are recompiling the kernel, perhaps you can also add in a patch
> > to rpc_show_tasks() to display the current value of
> > clnt->cl_xprt->snd_task?
>
> Sure. This is what 'echo 0 > /proc/sys/sunrpc/rpc_debug' shows after
> the hang (with my extra prints):
>
> # cat /proc/kmsg
> ...
> <6>client: ffff88082b6c9c00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
> <6>client: ffff88082b6c9e00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
> <6>-pid- flgs status -client- --rqstp- -timeout ---ops--
> <6>18091 0080 -11 ffff88082b6c9e00 (null) ffff0770ns3 ACCESS a:call_reserveresult q:xprt_sending
> <6>client: ffff88082a244600, xprt: ffff88082a343000, snd_task: (null)
> <6>client: ffff880829181600, xprt: ffff88082a343000, snd_task: (null)
> <6>client: ffff880828170200, xprt: ffff880824aef800, snd_task: ffff881029c63ec0

Sorry, that output was a little messed up. Here it is again:

<6>client: ffff88082b6c9c00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
<6>client: ffff88082b6c9e00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
<6>-pid- flgs status -client- --rqstp- -timeout ---ops--
<6>18091 0080 -11 ffff88082b6c9e00 (null) 0 ffffffffa027b7e0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
<6>client: ffff88082a244600, xprt: ffff88082a343000, snd_task: (null)
<6>client: ffff880829181600, xprt: ffff88082a343000, snd_task: (null)
<6>client: ffff880828170200, xprt: ffff880824aef800, snd_task: ffff881029c63ec0

2013-01-07 20:29:25

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, 2013-01-07 at 15:20 -0500, Chris Perl wrote:
+AD4- On Mon, Jan 07, 2013 at 07:50:10PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- Hi Chris,
+AD4- +AD4-
+AD4- +AD4- Excellent sleuthing+ACE- Given the thoroughness of your explanation, I'm
+AD4- +AD4- pretty sure that the attached patch should fix the problem.
+AD4- +AD4-
+AD4- +AD4- Cheers
+AD4- +AD4- Trond
+AD4- +AD4- --
+AD4- +AD4- Trond Myklebust
+AD4- +AD4- Linux NFS client maintainer
+AD4- +AD4-
+AD4- +AD4- NetApp
+AD4- +AD4- Trond.Myklebust+AEA-netapp.com
+AD4- +AD4- http://www.netapp.com
+AD4-
+AD4- +AD4- From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
+AD4- +AD4- From: Trond Myklebust +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4- Date: Mon, 7 Jan 2013 14:30:46 -0500
+AD4- +AD4- Subject: +AFs-PATCH+AF0- SUNRPC: Ensure we release the socket write lock if the
+AD4- +AD4- rpc+AF8-task exits early
+AD4- +AD4-
+AD4- +AD4- If the rpc+AF8-task exits while holding the socket write lock before it has
+AD4- +AD4- allocated an rpc slot, then the usual mechanism for releasing the write
+AD4- +AD4- lock in xprt+AF8-release() is defeated.
+AD4- +AD4-
+AD4- +AD4- The problem occurs if the call to xprt+AF8-lock+AF8-write() initially fails, so
+AD4- +AD4- that the rpc+AF8-task is put on the xprt-+AD4-sending wait queue. If the task
+AD4- +AD4- exits after being assigned the lock by +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func, but
+AD4- +AD4- before it has retried the call to xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot(), then
+AD4- +AD4- it calls xprt+AF8-release() while holding the write lock, but will
+AD4- +AD4- immediately exit due to the test for task-+AD4-tk+AF8-rqstp +ACEAPQ- NULL.
+AD4- +AD4-
+AD4- +AD4- Reported-by: Chris Perl +ADw-chris.perl+AEA-gmail.com+AD4-
+AD4- +AD4- Signed-off-by: Trond Myklebust +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4- Cc: stable+AEA-vger.kernel.org +AFsAPgA9- 3.1+AF0-
+AD4- +AD4- ---
+AD4- +AD4- net/sunrpc/xprt.c +AHw- 6 +-+-+-+---
+AD4- +AD4- 1 file changed, 4 insertions(+-), 2 deletions(-)
+AD4- +AD4-
+AD4- +AD4- diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
+AD4- +AD4- index bd462a5..6676457 100644
+AD4- +AD4- --- a/net/sunrpc/xprt.c
+AD4- +AD4- +-+-+- b/net/sunrpc/xprt.c
+AD4- +AD4- +AEAAQA- -1136,10 +-1136,12 +AEAAQA- static void xprt+AF8-request+AF8-init(struct rpc+AF8-task +ACo-task, struct rpc+AF8-xprt +ACo-xprt)
+AD4- +AD4- void xprt+AF8-release(struct rpc+AF8-task +ACo-task)
+AD4- +AD4- +AHs-
+AD4- +AD4- struct rpc+AF8-xprt +ACo-xprt+ADs-
+AD4- +AD4- - struct rpc+AF8-rqst +ACo-req+ADs-
+AD4- +AD4- +- struct rpc+AF8-rqst +ACo-req +AD0- task-+AD4-tk+AF8-rqstp+ADs-
+AD4- +AD4-
+AD4- +AD4- - if (+ACE-(req +AD0- task-+AD4-tk+AF8-rqstp))
+AD4- +AD4- +- if (req +AD0APQ- NULL) +AHs-
+AD4- +AD4- +- xprt+AF8-release+AF8-write(task-+AD4-tk+AF8-xprt, task)+ADs-
+AD4- +AD4- return+ADs-
+AD4- +AD4- +- +AH0-
+AD4- +AD4-
+AD4- +AD4- xprt +AD0- req-+AD4-rq+AF8-xprt+ADs-
+AD4- +AD4- if (task-+AD4-tk+AF8-ops-+AD4-rpc+AF8-count+AF8-stats +ACEAPQ- NULL)
+AD4- +AD4- --
+AD4- +AD4- 1.7.11.7
+AD4- +AD4-
+AD4-
+AD4- Ah, I totally missed the call to +AGA-rpc+AF8-release+AF8-task' at the bottom of the
+AD4- +AGAAXwBf-rpc+AF8-execute' loop (at least thats how I think we'd get to this function
+AD4- you're patching).
+AD4-
+AD4- But wouldn't we need to update the call site in
+AD4- +AGA-rpc+AF8-release+AF8-resources+AF8-task' as well? It contains an explicit check for
+AD4- +AGA-task-+AD4-tk+AF8-rqstp' being non null.

Ewww.... You're right: That's a wart that seems to have been copied and
pasted quite a few times.

Here is v2...
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-SUNRPC-Ensure-we-release-the-socket-write-lock-if-th.patch (2.11 kB)
0001-SUNRPC-Ensure-we-release-the-socket-write-lock-if-th.patch

2013-01-09 17:55:09

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

> Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
> silly bumbling questions. You can start ignoring me at any time. :)

I stared at the code for a while and more and now see why what I
outlined is not possible. Thanks for helping to clarify!

I decided to pull your git repo and compile with HEAD at
87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
morning). Using this kernel, I can no longer induce any hangs.

Interestingly, I tried recompiling the CentOS 6.3 kernel with
both the original patch (v4) and the last patch you sent about fixing
priority queues. With both of those in place, I still run into a
problem.

echo 0 > /proc/sys/sunrpc/rpc_debug after the hang shows (I left in the
previous additional prints and added printing of the tasks pointer
itself):

<6>client: ffff88082896c200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
<6>client: ffff8808282b5600, xprt: ffff880829011000, snd_task: ffff880829a1aac0
<6>--task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
<6>ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
<6>client: ffff88082838cc00, xprt: ffff88082b7c5800, snd_task: (null)
<6>client: ffff8808283db400, xprt: ffff88082b7c5800, snd_task: (null)
<6>client: ffff8808283db200, xprt: ffff880829011000, snd_task: ffff880829a1aac0

Any thoughts about other patches that might affect this?

2013-01-09 20:53:02

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
+AD4- +AD4- Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
+AD4- +AD4- silly bumbling questions. You can start ignoring me at any time. :)
+AD4-
+AD4- I stared at the code for a while and more and now see why what I
+AD4- outlined is not possible. Thanks for helping to clarify+ACE-
+AD4-
+AD4- I decided to pull your git repo and compile with HEAD at
+AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
+AD4- morning). Using this kernel, I can no longer induce any hangs.
+AD4-
+AD4- Interestingly, I tried recompiling the CentOS 6.3 kernel with
+AD4- both the original patch (v4) and the last patch you sent about fixing
+AD4- priority queues. With both of those in place, I still run into a
+AD4- problem.
+AD4-
+AD4- echo 0 +AD4- /proc/sys/sunrpc/rpc+AF8-debug after the hang shows (I left in the
+AD4- previous additional prints and added printing of the tasks pointer
+AD4- itself):
+AD4-
+AD4- +ADw-6+AD4-client: ffff88082896c200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +ADw-6+AD4-client: ffff8808282b5600, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +ADw-6+AD4---task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +ADw-6+AD4-ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +ADw-6+AD4-client: ffff88082838cc00, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +ADw-6+AD4-client: ffff8808283db400, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +ADw-6+AD4-client: ffff8808283db200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4-
+AD4- Any thoughts about other patches that might affect this?

Hmm... The only one that springs to mind is this one (see attachment)
and then the 'connect' fixes that you helped us with previously.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-SUNRPC-Ensure-that-we-free-the-rpc_task-after-cleanu.patch (2.37 kB)
0001-SUNRPC-Ensure-that-we-free-the-rpc_task-after-cleanu.patch

2013-01-07 19:50:13

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, 2013-01-07 at 13:58 -0500, Chris Perl wrote:
+AD4- (Apologies up front for the long email, just trying to be thorough)
+AD4-
+AD4- I have come across a hang in the Linux NFS client that is reproducible. I've
+AD4- been testing with a Fedora 17 system (3.6.10-2.fc17.x86+AF8-64), but I've also seen
+AD4- the same issues on CentOS 6.3 systems (2.6.32-279.5.2.el6.centos.plus.x86+AF8-64).
+AD4- I have not had a chance to test the latest 3.8 rc kernels yet, but quickly
+AD4- browsing their code, I didn't see major changes in the key areas to make me
+AD4- think it would make a difference (but I could certainly be wrong).
+AD4-
+AD4- Unfortunately, the problem is not easily reproducible. So far, the only way
+AD4- that I've been able to trigger it is with the attached OCaml program +AFs-1+AF0-.
+AD4-
+AD4- Even more unfortunate, for you to compile and use that program, you'll need to
+AD4- have a functioning OCaml environment with Jane Street's Core and Async
+AD4- libraries available. I can give details about the easiest way to set that up
+AD4- if you want, but will hold off on that for now. Please let me know if you want
+AD4- these details.
+AD4-
+AD4- The idea of the program is that it spawns many threads that start stat'ing and
+AD4- reading files over NFS (the paths to each file are command line args) and then
+AD4- abruptly exits.
+AD4-
+AD4- Originally, I came across a system that hung and forced a crash dump of the
+AD4- kernel that I then inspected with crash +AFs-2+AF0-. With crash I could see that the
+AD4- rpc+AF8-task referenced by the snd+AF8-task member in the relevant rpc+AF8-xprt didn't
+AD4- exist.
+AD4-
+AD4- Then after talking with some coworkers (and inspecting several more hung
+AD4- systems), eventually came up with a way to reproduce the problem. To reproduce
+AD4- the problem, you actually need to take the compiled OCaml program (which
+AD4- purposely exits very quickly after starting) and wrap it in a loop that will
+AD4- immediately restart it when it exits:
+AD4-
+AD4- +ACQ- while :+ADs- do ./hang.opt /mnt/nfs/file+AHs-01..10+AH0- +AD4-/dev/null 2+AD4AJg-1+ADs- done
+AD4-
+AD4- In this case, file01 through 10 are 10k text files filled entirely with the
+AD4- character +ACI-A+ACI-. /mnt/nfs is the mount point with the following options
+AD4-
+AD4- +ACQ- nfsstat -m
+AD4- /mnt/nfs from 127.0.0.1:/var/tmp
+AD4- Flags: rw,relatime,vers+AD0-3rsize+AD0-524288,wsize+AD0-524288,namelen+AD0-255,acregmin+AD0-0,acregmax+AD0-0,acdirmin+AD0-0,acdirmax+AD0-0,hard,proto+AD0-tcp,timeo+AD0-600,retrans+AD0-2,sec+AD0-sys,mountaddr+AD0-127.0.0.1,mountvers+AD0-3,mountport+AD0-20048,mountproto+AD0-tcp,local+AF8-lock+AD0-none,addr+AD0-127.0.0.1
+AD4-
+AD4- And the hang should happen within a few minutes (although sometimes it did
+AD4- take quite a while for it to hang).
+AD4-
+AD4- I believe the problem is a race condition where one thread that currently has
+AD4- the +AGA-rpc+AF8-xprt' for a given NFS mount locked tries to give it to another thread
+AD4- on the +AGA-sending' wait queue for that +AGA-rpc+AF8-xprt' (via +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next),
+AD4- but the thread that is presumably sleeping on that wait queue never made it to
+AD4- sleep because it received a SIGKILL before it could do so.
+AD4-
+AD4- Here is the interaction that I believe is causing the problem:
+AD4-
+AD4- thread 1 thread 2
+AD4- -------- --------
+AD4- +AF8AXw-rpc+AF8-execute
+AD4- ...
+AD4- call+AF8-reserve
+AD4- xprt+AF8-reserve
+AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- xprt+AF8-lock+AF8-write
+AD4- xprt+AF8-reserve+AF8-xprt (locks rpc+AF8-xprt)
+AD4- xprt+AF8-alloc+AF8-slot
+AD4- +AF8AXw-rpc+AF8-execute
+AD4- ...
+AD4- call+AF8-reserve
+AD4- xprt+AF8-reserve
+AD4- xprt+AF8-lock+AF8-and+AF8-alloc+AF8-slot
+AD4- xprt+AF8-lock+AF8-write
+AD4- xprt+AF8-reserve+AF8-xprt
+AD4- rpc+AF8-sleep+AF8-on (+ACI-sending+ACI- wait queue)
+AD4- out+AF8-of+AF8-line+AF8-wait+AF8-on+AF8-bit
+AD4- xprt+AF8-release+AF8-write
+AD4- xprt+AF8-release+AF8-xprt
+AD4- xprt+AF8-clear+AF8-locked
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-next
+AD4- (receives SIGKILL from a 3rd thread)
+AD4- rpc+AF8-wait+AF8-bit+AF8-killable
+AD4- (will return ERESTARTSYS b/c of SIGNAL)
+AD4- rpc+AF8-wake+AF8-up+AF8-first (wq:sending,
+AD4- f:+AF8AXw-xprt+AF8-lock+AF8-write+AF8-func)
+AD4- +AF8AXw-xprt+AF8-lock+AF8-write+AF8-func (w/ task from sending)
+AD4- rpc+AF8-exit
+AD4- rpc+AF8-exit+AF8-task
+AD4- ... ...
+AD4-
+AD4- At this point the first thread has updated rpc+AF8-xprt-+AD4-snd+AF8-task to point to the
+AD4- second thread, and left the rpc+AF8-xprt locked, but the second thread is exiting
+AD4- because of the SIGKILL. The real issue is the second thread was added to the
+AD4- +ACI-sending+ACI- wait queue (by rpc+AF8-sleep+AF8-on), but never actually made it to sleep.
+AD4-
+AD4- The net effect of this is that the transport remains locked by a phantom
+AD4- rpc+AF8-task. Any other requests that come in now just go to sleep on the
+AD4- rpc+AF8-xprt's sending wait queue, and no progress is made. Any mount point that
+AD4- winds up using this rpc+AF8-xprt is now hung.
+AD4-
+AD4- To help me understand what was happening, I used the attached systemtap script
+AD4- +AFs-3+AF0- to watch the system and upon each return from +AF8AXw-rpc+AF8-execute, check to see
+AD4- if our task is the task currently listed in the associated rpc+AF8-xprt's snd+AF8-task
+AD4- member, as I believe this should never be the case. Running that script
+AD4- produced output like the following when it hung:
+AD4-
+AD4-
+AD4- +AFs-cperl+AEA-cesium +AH4-/tmp/2013-01-05+AF0AJA- sudo stap -DMAXMAPENTRIES+AD0-4096 -g ./catch+AF8-locked.stp --vp 00101
+AD4- Pass 3: translated to C into +ACI-/tmp/stapMZvjDF/stap+AF8-23115b4269a0722946b88ddf556678e4+AF8-18614+AF8-src.c+ACI- using 354096virt/32668res/6728shr/25096data kb, in 90usr/10sys/101real ms.
+AD4- Pass 5: starting run.
+AD4- 1357430209194595979: 30246: thread-pool thr -+AD4- +AF8AXw-rpc+AF8-execute: task+AD0-0xffff88011f39ec00: GETATTR
+AD4- 1357430209194599327: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-start
+AD4- 1357430209194602139: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: call+AF8-reserve
+AD4- 1357430209195083066: 30246: thread-pool thr status: -512
+AD4- 1357430209195085888: 30246: thread-pool thr -+AD4- rpc+AF8-exit:
+AD4- 1357430209195088144: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: rpc+AF8-exit+AF8-task
+AD4- 1357430209195090865: 30246: thread-pool thr tk+AF8-callback: 0x0, tk+AF8-action: 0x0
+AD4- 1357430209195093517: 30246: thread-pool thr +ADw-- +AF8AXw-rpc+AF8-execute: tk+AF8-status: 0x0
+AD4- +AF4-CPass 5: run completed in 10usr/30sys/33870real ms.
+AD4-
+AD4- After this happened, I used crash on the live system and indeed, this task was
+AD4- the one that was listed in rpc+AF8-xprt-+AD4-snd+AF8-task for the transport. The rest of
+AD4- the circumstances of the hang I inferred by looking at the code, so I
+AD4- could definitely be wrong.
+AD4-
+AD4- The systemtap script places probe points at specific lines within the
+AD4- code (to be able to look at tk+AF8-action and tk+AF8-callback), so it would need
+AD4- to be tweaked if run on a different kernel than the fedora 17 kernel I'm
+AD4- running it against.

Hi Chris,

Excellent sleuthing+ACE- Given the thoroughness of your explanation, I'm
pretty sure that the attached patch should fix the problem.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-SUNRPC-Ensure-we-release-the-socket-write-lock-if-th.patch (1.63 kB)
0001-SUNRPC-Ensure-we-release-the-socket-write-lock-if-th.patch

2013-01-08 23:11:11

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Tue, 2013-01-08 at 17:19 -0500, Chris Perl wrote:
+AD4- On Tue, Jan 08, 2013 at 05:16:51PM -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- The lock is associated with the rpc+AF8-task. Threads can normally only
+AD4- +AD4- +AD4- access an rpc+AF8-task when it is on a wait queue (while holding the wait
+AD4- +AD4- +AD4- queue lock) unless they are given ownership of the rpc+AF8-task.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- IOW: the scenario you describe should not be possible, since it relies
+AD4- +AD4- +AD4- on thread 1 assigning the lock to the rpc+AF8-task after it has been removed
+AD4- +AD4- +AD4- from the wait queue.
+AD4- +AD4-
+AD4- +AD4- Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
+AD4- +AD4- silly bumbling questions. You can start ignoring me at any time. :)
+AD4- +AD4-
+AD4- +AD4- I was talking about setting (or leaving set) the XPRT+AF8-LOCKED bit in
+AD4- +AD4- rpc+AF8-xprt-+AD4-state. By +ACI-assigning the lock+ACI- I really just mean that thread
+AD4- +AD4- 1 leaves XPRT+AF8-LOCKED set in rpc+AF8-xprt-+AD4-state and sets rpc+AF8-xprt-+AD4-snd+AF8-task
+AD4- +AD4- to thread 2.
+AD4- +AD4-
+AD4- +AD4- +AD4- If you are recompiling the kernel, perhaps you can also add in a patch
+AD4- +AD4- +AD4- to rpc+AF8-show+AF8-tasks() to display the current value of
+AD4- +AD4- +AD4- clnt-+AD4-cl+AF8-xprt-+AD4-snd+AF8-task?
+AD4- +AD4-
+AD4- +AD4- Sure. This is what 'echo 0 +AD4- /proc/sys/sunrpc/rpc+AF8-debug' shows after
+AD4- +AD4- the hang (with my extra prints):
+AD4- +AD4-
+AD4- +AD4- +ACM- cat /proc/kmsg
+AD4- +AD4- ...
+AD4- +AD4- +ADw-6+AD4-client: ffff88082b6c9c00, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0
+AD4- +AD4- +ADw-6+AD4-client: ffff88082b6c9e00, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0
+AD4- +AD4- +ADw-6+AD4--pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +AD4- +ADw-6+AD4-18091 0080 -11 ffff88082b6c9e00 (null) ffff0770ns3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +AD4- +ADw-6+AD4-client: ffff88082a244600, xprt: ffff88082a343000, snd+AF8-task: (null)
+AD4- +AD4- +ADw-6+AD4-client: ffff880829181600, xprt: ffff88082a343000, snd+AF8-task: (null)
+AD4- +AD4- +ADw-6+AD4-client: ffff880828170200, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0
+AD4-
+AD4- Sorry, that output was a little messed up. Here it is again:
+AD4-
+AD4- +ADw-6+AD4-client: ffff88082b6c9c00, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0
+AD4- +ADw-6+AD4-client: ffff88082b6c9e00, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0
+AD4- +ADw-6+AD4--pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +ADw-6+AD4-18091 0080 -11 ffff88082b6c9e00 (null) 0 ffffffffa027b7e0 nfsv3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +ADw-6+AD4-client: ffff88082a244600, xprt: ffff88082a343000, snd+AF8-task: (null)
+AD4- +ADw-6+AD4-client: ffff880829181600, xprt: ffff88082a343000, snd+AF8-task: (null)
+AD4- +ADw-6+AD4-client: ffff880828170200, xprt: ffff880824aef800, snd+AF8-task: ffff881029c63ec0

Hi Chris,

It looks as if the problem here is that the rpc+AF8-task in question is not
being woken up. I'm aware of at least one problem with priority queues
in RHEL-6.3/CentOS-6.3, and that has been fixed in the upstream kernel.

See attachment.

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-SUNRPC-Don-t-allow-low-priority-tasks-to-pre-empt-hi.patch (3.92 kB)
0001-SUNRPC-Don-t-allow-low-priority-tasks-to-pre-empt-hi.patch

2013-01-08 22:24:17

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

> The lock is associated with the rpc_task. Threads can normally only
> access an rpc_task when it is on a wait queue (while holding the wait
> queue lock) unless they are given ownership of the rpc_task.
>
> IOW: the scenario you describe should not be possible, since it relies
> on thread 1 assigning the lock to the rpc_task after it has been removed
> from the wait queue.

Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
silly bumbling questions. You can start ignoring me at any time. :)

I was talking about setting (or leaving set) the XPRT_LOCKED bit in
rpc_xprt->state. By "assigning the lock" I really just mean that thread
1 leaves XPRT_LOCKED set in rpc_xprt->state and sets rpc_xprt->snd_task
to thread 2.

> If you are recompiling the kernel, perhaps you can also add in a patch
> to rpc_show_tasks() to display the current value of
> clnt->cl_xprt->snd_task?

Sure. This is what 'echo 0 > /proc/sys/sunrpc/rpc_debug' shows after
the hang (with my extra prints):

# cat /proc/kmsg
...
<6>client: ffff88082b6c9c00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
<6>client: ffff88082b6c9e00, xprt: ffff880824aef800, snd_task: ffff881029c63ec0
<6>-pid- flgs status -client- --rqstp- -timeout ---ops--
<6>18091 0080 -11 ffff88082b6c9e00 (null) ffff0770ns3 ACCESS a:call_reserveresult q:xprt_sending
<6>client: ffff88082a244600, xprt: ffff88082a343000, snd_task: (null)
<6>client: ffff880829181600, xprt: ffff88082a343000, snd_task: (null)
<6>client: ffff880828170200, xprt: ffff880824aef800, snd_task: ffff881029c63ec0

2013-01-07 20:20:29

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, Jan 07, 2013 at 07:50:10PM +0000, Myklebust, Trond wrote:
> Hi Chris,
>
> Excellent sleuthing! Given the thoroughness of your explanation, I'm
> pretty sure that the attached patch should fix the problem.
>
> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com

> From ec8cbb4aff21cd0eac2c6f3fc4273ac72cdd91ef Mon Sep 17 00:00:00 2001
> From: Trond Myklebust <[email protected]>
> Date: Mon, 7 Jan 2013 14:30:46 -0500
> Subject: [PATCH] SUNRPC: Ensure we release the socket write lock if the
> rpc_task exits early
>
> If the rpc_task exits while holding the socket write lock before it has
> allocated an rpc slot, then the usual mechanism for releasing the write
> lock in xprt_release() is defeated.
>
> The problem occurs if the call to xprt_lock_write() initially fails, so
> that the rpc_task is put on the xprt->sending wait queue. If the task
> exits after being assigned the lock by __xprt_lock_write_func, but
> before it has retried the call to xprt_lock_and_alloc_slot(), then
> it calls xprt_release() while holding the write lock, but will
> immediately exit due to the test for task->tk_rqstp != NULL.
>
> Reported-by: Chris Perl <[email protected]>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: [email protected] [>= 3.1]
> ---
> net/sunrpc/xprt.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/net/sunrpc/xprt.c b/net/sunrpc/xprt.c
> index bd462a5..6676457 100644
> --- a/net/sunrpc/xprt.c
> +++ b/net/sunrpc/xprt.c
> @@ -1136,10 +1136,12 @@ static void xprt_request_init(struct rpc_task *task, struct rpc_xprt *xprt)
> void xprt_release(struct rpc_task *task)
> {
> struct rpc_xprt *xprt;
> - struct rpc_rqst *req;
> + struct rpc_rqst *req = task->tk_rqstp;
>
> - if (!(req = task->tk_rqstp))
> + if (req == NULL) {
> + xprt_release_write(task->tk_xprt, task);
> return;
> + }
>
> xprt = req->rq_xprt;
> if (task->tk_ops->rpc_count_stats != NULL)
> --
> 1.7.11.7
>

Ah, I totally missed the call to `rpc_release_task' at the bottom of the
`__rpc_execute' loop (at least thats how I think we'd get to this function
you're patching).

But wouldn't we need to update the call site in
`rpc_release_resources_task' as well? It contains an explicit check for
`task->tk_rqstp' being non null.