2013-01-10 21:31:12

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
+AD4- On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
+AD4- +AD4- +AD4- silly bumbling questions. You can start ignoring me at any time. :)
+AD4- +AD4-
+AD4- +AD4- I stared at the code for a while and more and now see why what I
+AD4- +AD4- outlined is not possible. Thanks for helping to clarify+ACE-
+AD4- +AD4-
+AD4- +AD4- I decided to pull your git repo and compile with HEAD at
+AD4- +AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
+AD4- +AD4- morning). Using this kernel, I can no longer induce any hangs.
+AD4- +AD4-
+AD4- +AD4- Interestingly, I tried recompiling the CentOS 6.3 kernel with
+AD4- +AD4- both the original patch (v4) and the last patch you sent about fixing
+AD4- +AD4- priority queues. With both of those in place, I still run into a
+AD4- +AD4- problem.
+AD4- +AD4-
+AD4- +AD4- echo 0 +AD4- /proc/sys/sunrpc/rpc+AF8-debug after the hang shows (I left in the
+AD4- +AD4- previous additional prints and added printing of the tasks pointer
+AD4- +AD4- itself):
+AD4- +AD4-
+AD4- +AD4- +ADw-6+AD4-client: ffff88082896c200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +ADw-6+AD4-client: ffff8808282b5600, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +ADw-6+AD4---task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +AD4- +ADw-6+AD4-ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +AD4- +ADw-6+AD4-client: ffff88082838cc00, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +ADw-6+AD4-client: ffff8808283db400, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +ADw-6+AD4-client: ffff8808283db200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4-
+AD4- +AD4- Any thoughts about other patches that might affect this?
+AD4-
+AD4- Hmm... The only one that springs to mind is this one (see attachment)
+AD4- and then the 'connect' fixes that you helped us with previously.

Never mind. I suspect that the main reason why RHEL-6.3 is still
vulnerable is that it lacks commit
961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
xprt+AF8-lock+AF8-write+AF8-next()).

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-SUNRPC-Fix-potential-races-in-xprt_lock_write_next.patch (8.23 kB)
0001-SUNRPC-Fix-potential-races-in-xprt_lock_write_next.patch

2013-01-14 15:09:55

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Fri, Jan 11, 2013 at 11:19:44AM -0500, Chris Perl wrote:
> On Thu, Jan 10, 2013 at 09:30:58PM +0000, Myklebust, Trond wrote:
> > On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
> > > On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
> > > > > Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
> > > > > silly bumbling questions. You can start ignoring me at any time. :)
> > > >
> > > > I stared at the code for a while and more and now see why what I
> > > > outlined is not possible. Thanks for helping to clarify!
> > > >
> > > > I decided to pull your git repo and compile with HEAD at
> > > > 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
> > > > morning). Using this kernel, I can no longer induce any hangs.
> > > >
> > > > Interestingly, I tried recompiling the CentOS 6.3 kernel with
> > > > both the original patch (v4) and the last patch you sent about fixing
> > > > priority queues. With both of those in place, I still run into a
> > > > problem.
> > > >
> > > > echo 0 > /proc/sys/sunrpc/rpc_debug after the hang shows (I left in the
> > > > previous additional prints and added printing of the tasks pointer
> > > > itself):
> > > >
> > > > <6>client: ffff88082896c200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > > <6>client: ffff8808282b5600, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > > <6>--task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
> > > > <6>ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
> > > > <6>client: ffff88082838cc00, xprt: ffff88082b7c5800, snd_task: (null)
> > > > <6>client: ffff8808283db400, xprt: ffff88082b7c5800, snd_task: (null)
> > > > <6>client: ffff8808283db200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > >
> > > > Any thoughts about other patches that might affect this?
> > >
> > > Hmm... The only one that springs to mind is this one (see attachment)
> > > and then the 'connect' fixes that you helped us with previously.
> >
> > Never mind. I suspect that the main reason why RHEL-6.3 is still
> > vulnerable is that it lacks commit
> > 961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
> > xprt_lock_write_next()).
>
> Great, thanks! I've add this on top of the others and am now testing.
> I'll let you know how it goes.

With all 4 patches in place, I am no longer able to hang my CentOS 6.3
system. I have not tested all the various combinations of the 4
patches, but can definitely confirm that without either of:

961a828df64979d2a9faeeeee043391670a193b9 SUNRPC: Fix potential races in xprt_lock_write_next()
87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 SUNRPC: Ensure we release the socket write lock if the rpc_task exits early

I can hang the system using the test program I sent in my first email.

I'll follow up with Red Hat and ask that they include all 4 patches for
6.4 (some of the earlier ones they may already have).

Thanks so much for all the help!

2013-01-14 19:36:47

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, 2013-01-14 at 14:12 -0500, Chris Perl wrote:
+AD4- On Mon, Jan 14, 2013 at 03:25:50PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- On Mon, 2013-01-14 at 10:09 -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- On Fri, Jan 11, 2013 at 11:19:44AM -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- +AD4- On Thu, Jan 10, 2013 at 09:30:58PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- silly bumbling questions. You can start ignoring me at any time. :)
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- I stared at the code for a while and more and now see why what I
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- outlined is not possible. Thanks for helping to clarify+ACE-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- I decided to pull your git repo and compile with HEAD at
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- morning). Using this kernel, I can no longer induce any hangs.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Interestingly, I tried recompiling the CentOS 6.3 kernel with
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- both the original patch (v4) and the last patch you sent about fixing
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- priority queues. With both of those in place, I still run into a
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- problem.
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- echo 0 +AD4- /proc/sys/sunrpc/rpc+AF8-debug after the hang shows (I left in the
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- previous additional prints and added printing of the tasks pointer
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- itself):
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff88082896c200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808282b5600, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4---task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff88082838cc00, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808283db400, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808283db200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Any thoughts about other patches that might affect this?
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Hmm... The only one that springs to mind is this one (see attachment)
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- and then the 'connect' fixes that you helped us with previously.
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- Never mind. I suspect that the main reason why RHEL-6.3 is still
+AD4- +AD4- +AD4- +AD4- +AD4- vulnerable is that it lacks commit
+AD4- +AD4- +AD4- +AD4- +AD4- 961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
+AD4- +AD4- +AD4- +AD4- +AD4- xprt+AF8-lock+AF8-write+AF8-next()).
+AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- Great, thanks+ACE- I've add this on top of the others and am now testing.
+AD4- +AD4- +AD4- +AD4- I'll let you know how it goes.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- With all 4 patches in place, I am no longer able to hang my CentOS 6.3
+AD4- +AD4- +AD4- system. I have not tested all the various combinations of the 4
+AD4- +AD4- +AD4- patches, but can definitely confirm that without either of:
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- 961a828df64979d2a9faeeeee043391670a193b9 SUNRPC: Fix potential races in xprt+AF8-lock+AF8-write+AF8-next()
+AD4- +AD4- +AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 SUNRPC: Ensure we release the socket write lock if the rpc+AF8-task exits early
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- I can hang the system using the test program I sent in my first email.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- I'll follow up with Red Hat and ask that they include all 4 patches for
+AD4- +AD4- +AD4- 6.4 (some of the earlier ones they may already have).
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Thanks so much for all the help+ACE-
+AD4- +AD4-
+AD4- +AD4- Likewise. Thank you for all the work you did in debugging and testing.
+AD4- +AD4- It is much appreciated.
+AD4-
+AD4- Oh, I had just one other question about the final version of the patch
+AD4- you created. In xprt+AF8-release, is it safe to access xprt-+AD4-snd+AF8-task
+AD4- without the transport+AF8-lock because of the rcu+AF8-lock()/rcu+AF8-unlock() stuff,
+AD4- or because all architectures can atomically read pointer sized values
+AD4- that are aligned in certain ways, or something else?

It is safe to access the xprt structure because of the rcu+AF8-lock().

As for xprt-+AD4-snd+AF8-task: no other processes are allowed to change the
value of xprt-+AD4-snd+AF8-task once it has been assigned to +ACI-our+ACI- rpc+AF8-task, so
if we just want to know the answer to the question +ACI-am I holding the
socket lock+ACI-, then it is safe to read the value of xprt-+AD4-snd+AF8-task
outside the transport lock.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-01-14 15:25:52

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, 2013-01-14 at 10:09 -0500, Chris Perl wrote:
+AD4- On Fri, Jan 11, 2013 at 11:19:44AM -0500, Chris Perl wrote:
+AD4- +AD4- On Thu, Jan 10, 2013 at 09:30:58PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- +AD4- On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
+AD4- +AD4- +AD4- +AD4- On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
+AD4- +AD4- +AD4- +AD4- +AD4- +AD4- silly bumbling questions. You can start ignoring me at any time. :)
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- I stared at the code for a while and more and now see why what I
+AD4- +AD4- +AD4- +AD4- +AD4- outlined is not possible. Thanks for helping to clarify+ACE-
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- I decided to pull your git repo and compile with HEAD at
+AD4- +AD4- +AD4- +AD4- +AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
+AD4- +AD4- +AD4- +AD4- +AD4- morning). Using this kernel, I can no longer induce any hangs.
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- Interestingly, I tried recompiling the CentOS 6.3 kernel with
+AD4- +AD4- +AD4- +AD4- +AD4- both the original patch (v4) and the last patch you sent about fixing
+AD4- +AD4- +AD4- +AD4- +AD4- priority queues. With both of those in place, I still run into a
+AD4- +AD4- +AD4- +AD4- +AD4- problem.
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- echo 0 +AD4- /proc/sys/sunrpc/rpc+AF8-debug after the hang shows (I left in the
+AD4- +AD4- +AD4- +AD4- +AD4- previous additional prints and added printing of the tasks pointer
+AD4- +AD4- +AD4- +AD4- +AD4- itself):
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff88082896c200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808282b5600, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4---task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call+AF8-reserveresult q:xprt+AF8-sending
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff88082838cc00, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808283db400, xprt: ffff88082b7c5800, snd+AF8-task: (null)
+AD4- +AD4- +AD4- +AD4- +AD4- +ADw-6+AD4-client: ffff8808283db200, xprt: ffff880829011000, snd+AF8-task: ffff880829a1aac0
+AD4- +AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- +AD4- Any thoughts about other patches that might affect this?
+AD4- +AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- +AD4- Hmm... The only one that springs to mind is this one (see attachment)
+AD4- +AD4- +AD4- +AD4- and then the 'connect' fixes that you helped us with previously.
+AD4- +AD4- +AD4-
+AD4- +AD4- +AD4- Never mind. I suspect that the main reason why RHEL-6.3 is still
+AD4- +AD4- +AD4- vulnerable is that it lacks commit
+AD4- +AD4- +AD4- 961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
+AD4- +AD4- +AD4- xprt+AF8-lock+AF8-write+AF8-next()).
+AD4- +AD4-
+AD4- +AD4- Great, thanks+ACE- I've add this on top of the others and am now testing.
+AD4- +AD4- I'll let you know how it goes.
+AD4-
+AD4- With all 4 patches in place, I am no longer able to hang my CentOS 6.3
+AD4- system. I have not tested all the various combinations of the 4
+AD4- patches, but can definitely confirm that without either of:
+AD4-
+AD4- 961a828df64979d2a9faeeeee043391670a193b9 SUNRPC: Fix potential races in xprt+AF8-lock+AF8-write+AF8-next()
+AD4- 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 SUNRPC: Ensure we release the socket write lock if the rpc+AF8-task exits early
+AD4-
+AD4- I can hang the system using the test program I sent in my first email.
+AD4-
+AD4- I'll follow up with Red Hat and ask that they include all 4 patches for
+AD4- 6.4 (some of the earlier ones they may already have).
+AD4-
+AD4- Thanks so much for all the help+ACE-

Likewise. Thank you for all the work you did in debugging and testing.
It is much appreciated.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-01-11 16:19:51

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Thu, Jan 10, 2013 at 09:30:58PM +0000, Myklebust, Trond wrote:
> On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
> > On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
> > > > Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
> > > > silly bumbling questions. You can start ignoring me at any time. :)
> > >
> > > I stared at the code for a while and more and now see why what I
> > > outlined is not possible. Thanks for helping to clarify!
> > >
> > > I decided to pull your git repo and compile with HEAD at
> > > 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
> > > morning). Using this kernel, I can no longer induce any hangs.
> > >
> > > Interestingly, I tried recompiling the CentOS 6.3 kernel with
> > > both the original patch (v4) and the last patch you sent about fixing
> > > priority queues. With both of those in place, I still run into a
> > > problem.
> > >
> > > echo 0 > /proc/sys/sunrpc/rpc_debug after the hang shows (I left in the
> > > previous additional prints and added printing of the tasks pointer
> > > itself):
> > >
> > > <6>client: ffff88082896c200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > <6>client: ffff8808282b5600, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > <6>--task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
> > > <6>ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
> > > <6>client: ffff88082838cc00, xprt: ffff88082b7c5800, snd_task: (null)
> > > <6>client: ffff8808283db400, xprt: ffff88082b7c5800, snd_task: (null)
> > > <6>client: ffff8808283db200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > >
> > > Any thoughts about other patches that might affect this?
> >
> > Hmm... The only one that springs to mind is this one (see attachment)
> > and then the 'connect' fixes that you helped us with previously.
>
> Never mind. I suspect that the main reason why RHEL-6.3 is still
> vulnerable is that it lacks commit
> 961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
> xprt_lock_write_next()).

Great, thanks! I've add this on top of the others and am now testing.
I'll let you know how it goes.

2013-01-14 19:13:00

by Chris Perl

[permalink] [raw]
Subject: Re: Possible Race Condition on SIGKILL

On Mon, Jan 14, 2013 at 03:25:50PM +0000, Myklebust, Trond wrote:
> On Mon, 2013-01-14 at 10:09 -0500, Chris Perl wrote:
> > On Fri, Jan 11, 2013 at 11:19:44AM -0500, Chris Perl wrote:
> > > On Thu, Jan 10, 2013 at 09:30:58PM +0000, Myklebust, Trond wrote:
> > > > On Wed, 2013-01-09 at 15:52 -0500, Trond Myklebust wrote:
> > > > > On Wed, 2013-01-09 at 12:55 -0500, Chris Perl wrote:
> > > > > > > Hrm. I guess I'm in over my head here. Apologoies if I'm just asking
> > > > > > > silly bumbling questions. You can start ignoring me at any time. :)
> > > > > >
> > > > > > I stared at the code for a while and more and now see why what I
> > > > > > outlined is not possible. Thanks for helping to clarify!
> > > > > >
> > > > > > I decided to pull your git repo and compile with HEAD at
> > > > > > 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 (linux-next as of this
> > > > > > morning). Using this kernel, I can no longer induce any hangs.
> > > > > >
> > > > > > Interestingly, I tried recompiling the CentOS 6.3 kernel with
> > > > > > both the original patch (v4) and the last patch you sent about fixing
> > > > > > priority queues. With both of those in place, I still run into a
> > > > > > problem.
> > > > > >
> > > > > > echo 0 > /proc/sys/sunrpc/rpc_debug after the hang shows (I left in the
> > > > > > previous additional prints and added printing of the tasks pointer
> > > > > > itself):
> > > > > >
> > > > > > <6>client: ffff88082896c200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > > > > <6>client: ffff8808282b5600, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > > > > <6>--task-- -pid- flgs status -client- --rqstp- -timeout ---ops--
> > > > > > <6>ffff88082a463180 22007 0080 -11 ffff8808282b5600 (null) 0 ffffffffa027b7a0 nfsv3 ACCESS a:call_reserveresult q:xprt_sending
> > > > > > <6>client: ffff88082838cc00, xprt: ffff88082b7c5800, snd_task: (null)
> > > > > > <6>client: ffff8808283db400, xprt: ffff88082b7c5800, snd_task: (null)
> > > > > > <6>client: ffff8808283db200, xprt: ffff880829011000, snd_task: ffff880829a1aac0
> > > > > >
> > > > > > Any thoughts about other patches that might affect this?
> > > > >
> > > > > Hmm... The only one that springs to mind is this one (see attachment)
> > > > > and then the 'connect' fixes that you helped us with previously.
> > > >
> > > > Never mind. I suspect that the main reason why RHEL-6.3 is still
> > > > vulnerable is that it lacks commit
> > > > 961a828df64979d2a9faeeeee043391670a193b9 (SUNRPC: Fix potential races in
> > > > xprt_lock_write_next()).
> > >
> > > Great, thanks! I've add this on top of the others and am now testing.
> > > I'll let you know how it goes.
> >
> > With all 4 patches in place, I am no longer able to hang my CentOS 6.3
> > system. I have not tested all the various combinations of the 4
> > patches, but can definitely confirm that without either of:
> >
> > 961a828df64979d2a9faeeeee043391670a193b9 SUNRPC: Fix potential races in xprt_lock_write_next()
> > 87ed50036b866db2ec2ba16b2a7aec4a2b0b7c39 SUNRPC: Ensure we release the socket write lock if the rpc_task exits early
> >
> > I can hang the system using the test program I sent in my first email.
> >
> > I'll follow up with Red Hat and ask that they include all 4 patches for
> > 6.4 (some of the earlier ones they may already have).
> >
> > Thanks so much for all the help!
>
> Likewise. Thank you for all the work you did in debugging and testing.
> It is much appreciated.

Oh, I had just one other question about the final version of the patch
you created. In xprt_release, is it safe to access xprt->snd_task
without the transport_lock because of the rcu_lock()/rcu_unlock() stuff,
or because all architectures can atomically read pointer sized values
that are aligned in certain ways, or something else?