2008-01-10 21:51:40

by NeilBrown

[permalink] [raw]
Subject: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?



I have a report of an unusual NFS deadlock in OpenSuSE 10.3, which is
based on 2.6.22.

People who like bugzilla can find a little more detail at:
https://bugzilla.novell.com/show_bug.cgi?id=352878

We start with a "soft,intr" mount over a VPN on an unreliable network.

I assume an unmount happened (by autofs) while there was some
outstanding requests - at least one was a COMMIT but I'm guessing
there were others.

When the last commit completed (probably timed-out) it put the open
context, dropped the last reference on the vfsmnt and so called
nfs_free_server and thence rpc_shutdown_client.
All this is happened in rpciod/0.

rpc_shutdown_client calls rpc_killall_tasks and waits for them all to
complete. This deadlocks.

I'm wondering if some other request (read-ahead?) might need service
from rpciod/0 before it can complete. This would mean that it cannot
complete until rpciod/0 is free, and rpciod/0 won't be free until it
completes.

Is this a credible scenario?

There is an idle rpciod/1, so presumably some tasks could complete on
that, but maybe they are tided to a particular CPU ??

Some excerpts from the 'showtasks' list are:

This is where I think the deadlock is:


rpciod/0 D 0000066e2a868397 0 3407 2 (L-TLB)
ffff81012bbbbd00 0000000000000046 0000000000000000 0000000000000000
0000000000000441 ffff81012bbbbcb0 ffffffff80617800 ffffffff80617800
ffffffff8061d210 ffffffff80617800 ffffffff80617800 ffff81006a507cb0
Call Trace:
[<ffffffff803fa993>] schedule_timeout+0x8a/0xad
[<ffffffff8023a5f6>] process_timeout+0x0/0x5
[<ffffffff803fa98e>] schedule_timeout+0x85/0xad
[<ffffffff88569e6d>] :sunrpc:rpc_shutdown_client+0x9f/0xfb
[<ffffffff8024433b>] autoremove_wake_function+0x0/0x2e
[<ffffffff803fabfa>] mutex_lock+0xd/0x1e
[<ffffffff88668cd3>] :nfs:nfs_free_server+0x92/0xcf
[<ffffffff8028b867>] deactivate_super+0x6a/0x82
[<ffffffff8866edd6>] :nfs:put_nfs_open_context+0x9e/0xaf
[<ffffffff88671fb5>] :nfs:nfs_release_request+0x1d/0x2f
[<ffffffff886760da>] :nfs:nfs_commit_done+0x176/0x19f
[<ffffffff8856e3aa>] :sunrpc:rpc_exit_task+0x33/0x79
[<ffffffff8856e8e7>] :sunrpc:__rpc_execute+0x88/0x242
[<ffffffff8856eaa1>] :sunrpc:rpc_async_schedule+0x0/0xc
[<ffffffff80240bbe>] run_workqueue+0x7d/0x10c
[<ffffffff80241472>] worker_thread+0xe1/0xf0
[<ffffffff8024433b>] autoremove_wake_function+0x0/0x2e
[<ffffffff80241391>] worker_thread+0x0/0xf0
[<ffffffff80244214>] kthread+0x47/0x73
[<ffffffff8020aa48>] child_rip+0xa/0x12
[<ffffffff802441cd>] kthread+0x0/0x73
[<ffffffff8020aa3e>] child_rip+0x0/0x12


This is a task trying to access a new NFS mount. It is blocking on
IO. "ls -l" and other metadata options work OK on this mount, but
anything requiring data access hangs like this. Presumably it is
blocking on rpciod/0 ??

less D 000005b8846cd503 0 31053 1 (NOTLB)
ffff810055e05b78 0000000000000082 0000000000000002 0000000000000000
ffff810001025800 ffff810055e05b28 ffffffff80617800 ffffffff80617800
ffffffff8061d210 ffffffff80617800 ffffffff80617800 ffffffff80617800
Call Trace:
[<ffffffff8856fac1>] :sunrpc:rpcauth_lookup_credcache+0x113/0x242
[<ffffffff80264a27>] sync_page+0x0/0x42
[<ffffffff803fa8d2>] io_schedule+0x3f/0x67
[<ffffffff80264a65>] sync_page+0x3e/0x42
[<ffffffff803faa1f>] __wait_on_bit_lock+0x3f/0x72
[<ffffffff80264a14>] __lock_page+0x5e/0x64
[<ffffffff80244369>] wake_bit_function+0x0/0x23
[<ffffffff802652fd>] do_generic_mapping_read+0x1d9/0x3de
[<ffffffff802647ec>] file_read_actor+0x0/0xf1
[<ffffffff80266c83>] generic_file_aio_read+0x121/0x15e
[<ffffffff80289c11>] do_sync_read+0xd1/0x118
[<ffffffff8856f79c>] :sunrpc:rpcauth_lookupcred+0x6d/0x97
[<ffffffff8866d94b>] :nfs:nfs_file_open+0x0/0x7d
[<ffffffff8024433b>] autoremove_wake_function+0x0/0x2e
[<ffffffff8028868a>] __dentry_open+0x101/0x1aa
[<ffffffff802887ea>] do_filp_open+0x2d/0x3d
[<ffffffff8028a42d>] vfs_read+0xcb/0x153
[<ffffffff8028a7c1>] sys_read+0x45/0x6e
[<ffffffff80209c2e>] system_call+0x7e/0x83

The other rpciod is doing nothing.

rpciod/1 S ffff81012b85c1c0 0 21829 2 (L-TLB)
ffff810102c1fec0 0000000000000046 ffffffff802c1215 ffffffff80229aa7
ffff810102c1fe10 ffff810102c1fe70 ffffffff80617800 ffffffff80617800
ffffffff8061d210 ffffffff80617800 ffffffff80617800 ffffffff803fa014
Call Trace:
[<ffffffff802c1215>] proc_delete_inode+0x0/0x4a
[<ffffffff80229aa7>] task_rq_lock+0x3d/0x73
[<ffffffff803fa014>] thread_return+0x0/0xf1
[<ffffffff8024143a>] worker_thread+0xa9/0xf0
[<ffffffff8024433b>] autoremove_wake_function+0x0/0x2e
[<ffffffff80241391>] worker_thread+0x0/0xf0
[<ffffffff80244214>] kthread+0x47/0x73
[<ffffffff8020aa48>] child_rip+0xa/0x12
[<ffffffff802441cd>] kthread+0x0/0x73
[<ffffffff8020aa3e>] child_rip+0x0/0x12


Any input appreciated.

Thanks,
NeilBrown


2008-01-10 22:41:39

by Trond Myklebust

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Fri, 2008-01-11 at 08:51 +1100, Neil Brown wrote:
>
> I have a report of an unusual NFS deadlock in OpenSuSE 10.3, which is
> based on 2.6.22.
>
> People who like bugzilla can find a little more detail at:
> https://bugzilla.novell.com/show_bug.cgi?id=352878
>
> We start with a "soft,intr" mount over a VPN on an unreliable network.
>
> I assume an unmount happened (by autofs) while there was some
> outstanding requests - at least one was a COMMIT but I'm guessing
> there were others.
>
> When the last commit completed (probably timed-out) it put the open
> context, dropped the last reference on the vfsmnt and so called
> nfs_free_server and thence rpc_shutdown_client.
> All this is happened in rpciod/0.
>
> rpc_shutdown_client calls rpc_killall_tasks and waits for them all to
> complete. This deadlocks.
>
> I'm wondering if some other request (read-ahead?) might need service
> from rpciod/0 before it can complete. This would mean that it cannot
> complete until rpciod/0 is free, and rpciod/0 won't be free until it
> completes.
>
> Is this a credible scenario?

Yes, but I have a scenario that I think trumps it:

* the call that puts the open context is being made in
nfs_commit_done (or possibly nfs_writeback_done), causing it to
wait until the rpc_killall_tasks completes.
* The problem is that rpc_killall_tasks won't complete until the
rpc_task that is stuck in nfs_commit_done/nfs_writeback_done
exits.

Urgh...

I'm surprised that we can get into this state, though. How is
sys_umount() able to exit with either readaheads or writebacks still
pending? Is this perhaps occurring on a lazy umount?

Cheers
Trond


2008-01-11 01:08:02

by NeilBrown

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Thursday January 10, [email protected] wrote:
>
> On Fri, 2008-01-11 at 08:51 +1100, Neil Brown wrote:
> >
> > Is this a credible scenario?
>
> Yes, but I have a scenario that I think trumps it:
>
> * the call that puts the open context is being made in
> nfs_commit_done (or possibly nfs_writeback_done), causing it to
> wait until the rpc_killall_tasks completes.
> * The problem is that rpc_killall_tasks won't complete until the
> rpc_task that is stuck in nfs_commit_done/nfs_writeback_done
> exits.
>
> Urgh...


Oh, yes, that would be ugly!

>
> I'm surprised that we can get into this state, though. How is
> sys_umount() able to exit with either readaheads or writebacks still
> pending? Is this perhaps occurring on a lazy umount?

It does sound like it has to be a lazy unmount, doesn't it? I don't
think autofs will do that. The computer in question was suspended
while on the slow slow network and when it resumed the problem hit.
So maybe there is a suspend-time script which did the lazy unmount.
I'll ask.

Meanwhile, I managed to reproduce it. It went something like:

Client:

mount -o soft,intr server:/home /mnt
{ echo open > /dev/tty ;sleep 10; echo writing > /dev/tty; \
exec cat /boot/bzImage-test; } > /mnt/testing

"open"

Server:
rpc.nfsd 0

Client:

"writing"

type control-C

umount -l /mnt


wait a little while, notice that one rpciod is in 'D' wait.
echo t > /proc/sysrq-trigger

See the same stack trace.


However I cannot do it again, so I cannot test a fix.


The only fix that occurs to me is to use schedule_work to shunt the
rpc_shutdown_client into a separate thread.

Thanks,
NeilBrown

2008-01-11 01:26:35

by Trond Myklebust

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Fri, 2008-01-11 at 12:07 +1100, Neil Brown wrote:

> The only fix that occurs to me is to use schedule_work to shunt the
> rpc_shutdown_client into a separate thread.

Well. Instead of rpc_shutdown_client() you could use
rpc_release_client() in order to have the last RPC caller destroy the
client when the last rpc_task exits.
The problem then is that you lose the guarantee that all RPC tasks must
complete before you free structures such as the nfs_server...

Cheers
Trond


2008-02-19 19:26:12

by Trond Myklebust

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Tue, 2008-02-19 at 16:34 +1100, Neil Brown wrote:
> On Thursday January 10, [email protected] wrote:
> >
> > On Fri, 2008-01-11 at 08:51 +1100, Neil Brown wrote:
> > >
> > > I have a report of an unusual NFS deadlock in OpenSuSE 10.3, which is
> > > based on 2.6.22.
> > >
> > > People who like bugzilla can find a little more detail at:
> > > https://bugzilla.novell.com/show_bug.cgi?id=352878
> > >
> > > We start with a "soft,intr" mount over a VPN on an unreliable network.
> > >
> > > I assume an unmount happened (by autofs) while there was some
> > > outstanding requests - at least one was a COMMIT but I'm guessing
> > > there were others.
> > >
> > > When the last commit completed (probably timed-out) it put the open
> > > context, dropped the last reference on the vfsmnt and so called
> > > nfs_free_server and thence rpc_shutdown_client.
> > > All this is happened in rpciod/0.
> > >
> > > rpc_shutdown_client calls rpc_killall_tasks and waits for them all to
> > > complete. This deadlocks.
> > >
> > > I'm wondering if some other request (read-ahead?) might need service
> > > from rpciod/0 before it can complete. This would mean that it cannot
> > > complete until rpciod/0 is free, and rpciod/0 won't be free until it
> > > completes.
> > >
> > > Is this a credible scenario?
> >
> > Yes, but I have a scenario that I think trumps it:
> >
> > * the call that puts the open context is being made in
> > nfs_commit_done (or possibly nfs_writeback_done), causing it to
> > wait until the rpc_killall_tasks completes.
> > * The problem is that rpc_killall_tasks won't complete until the
> > rpc_task that is stuck in nfs_commit_done/nfs_writeback_done
> > exits.
> >
> > Urgh...
> >
> > I'm surprised that we can get into this state, though. How is
> > sys_umount() able to exit with either readaheads or writebacks still
> > pending? Is this perhaps occurring on a lazy umount?
>
> (finally getting back to this).
>
> Yes, lazy umount is definitely implicated. OpenSUSE sends 'autofs' a
> SIGUSR1 when suspending, and autofs may use "umount -l" if it received
> SIGUSR1.
>
> I dug further and realised that if there is still a COMMIT in flight,
> then there should still be an open file descriptor on the filesystem,
> so when the COMMIT finishes, it won't be the last thing that is
> holding the filesystem active, so this dead lock shouldn't happen...
>
> Then I found this, which post-dates OpenSUSE-10.3.
>
> ----------------------------
> commit a49c3c7736a2e77931dabc5bc4a83fb4b2da013e
> Author: Trond Myklebust <[email protected]>
> Date: Thu Oct 18 18:03:27 2007 -0400
>
> NFSv4: Ensure that we wait for the CLOSE request to complete
>
> Otherwise, we do end up breaking close-to-open semantics. We also end up
> breaking some of the silly-rename tests in Connectathon on some setups.
>
> Please refer to the bug-report at
> http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
> ----------------------------
>
> Which seems to suggest that a file descriptor could have been
> completely closed before the final COMMIT returned, so that final
> commit would be all that was holding the filesystem active.
>
> So it could be that this bug has since been fixed.
>
> Do you remember this patch well enough to be able to say if the bug it
> fixes could have allowed a COMMIT to be outstanding when the last
> nfs_file_release completed?
>
>
> In my investigations, it seemed that (in -mm where I was looking)
> nfs_wb_all can still complete before the last COMMIT returns, but only
> if the process received a signal (the new TASK_KILLABLE seems to allow
> this). That seems to suggest that the bug could be there again. If
> you "umount -l", then kill the last process with a file open, there
> could still be an in-fight COMMIT that is holding the filesystem
> active. When it competes, you could get a deadlock.
>
> It also seems to me that "rpc_shutdown_client" doesn't really need
> to loop. The only reason I can see for calling rpc_killall_tasks
> multiple times is in case a task schedules a sub-task before it
> noticed that it has been killed.
>
> That possibility can be avoided by flagging the client as 'dying' and
> auto-killing any task that gets added to a dying client.
>
> Once the multiple calls to rpc_killall_tasks aren't needed, there is
> no need to wait until cl_tasks is empty as the kref counting will make
> sure that the client gets cleaned up at the right time.
>
> So maybe the following patch would still be useful to make sure that
> rpc_shutdown_client never deadlocks??
>
> Thanks,
> NeilBrown
>
>
> Signed-off-by: Neil Brown <[email protected]>
>
> ### Diffstat output
> ./include/linux/sunrpc/clnt.h | 2 ++
> ./net/sunrpc/clnt.c | 10 +++++-----
> ./net/sunrpc/sched.c | 4 ++++
> 3 files changed, 11 insertions(+), 5 deletions(-)
>
> diff .prev/include/linux/sunrpc/clnt.h ./include/linux/sunrpc/clnt.h
> --- .prev/include/linux/sunrpc/clnt.h 2008-02-19 15:28:06.000000000 +1100
> +++ ./include/linux/sunrpc/clnt.h 2008-02-19 15:51:40.000000000 +1100
> @@ -42,6 +42,8 @@ struct rpc_clnt {
>
> unsigned int cl_softrtry : 1,/* soft timeouts */
> cl_discrtry : 1,/* disconnect before retry */
> + cl_dying : 1, /* shutting down, allow
> + * no more tasks */
> cl_autobind : 1;/* use getport() */
>
> struct rpc_rtt * cl_rtt; /* RTO estimator data */
>
> diff .prev/net/sunrpc/clnt.c ./net/sunrpc/clnt.c
> --- .prev/net/sunrpc/clnt.c 2008-02-19 15:29:58.000000000 +1100
> +++ ./net/sunrpc/clnt.c 2008-02-19 15:49:43.000000000 +1100
> @@ -386,11 +386,11 @@ void rpc_shutdown_client(struct rpc_clnt
> dprintk("RPC: shutting down %s client for %s\n",
> clnt->cl_protname, clnt->cl_server);
>
> - while (!list_empty(&clnt->cl_tasks)) {
> - rpc_killall_tasks(clnt);
> - wait_event_timeout(destroy_wait,
> - list_empty(&clnt->cl_tasks), 1*HZ);
> - }
> + spin_lock(&clnt->cl_lock);
> + clnt->cl_dying = 1;
> + spin_unlock(&clnt->cl_lock);
> +
> + rpc_killall_tasks(clnt);
>
> rpc_release_client(clnt);
> }
>
> diff .prev/net/sunrpc/sched.c ./net/sunrpc/sched.c
> --- .prev/net/sunrpc/sched.c 2008-02-19 15:25:45.000000000 +1100
> +++ ./net/sunrpc/sched.c 2008-02-19 15:37:11.000000000 +1100
> @@ -278,6 +278,10 @@ static void rpc_set_active(struct rpc_ta
> if (clnt != NULL) {
> spin_lock(&clnt->cl_lock);
> list_add_tail(&task->tk_task, &clnt->cl_tasks);
> + if (clnt->cl_dying) {
> + task->tk_flags |= RPC_TASK_KILLED;
> + rpc_exit(task, -EIO);
> + }
> spin_unlock(&clnt->cl_lock);
> }
> }

Hmm... I'm not sure that I see how this will fix the deadlock.

The main problem here is that we can call nfs_inode_remove_request(),
for instance, from within the rpciod callback nfs_writeback_done_full().
This again may end up calling put_nfs_open_context(), which again calls
mntput().

The deadlock then occurs because we still have the active rpc_task in
nfs_writeback_done_full() calling mntput, and hence triggering the call
to rpc_shutdown_client(), which cannot complete until said active
rpc_task itself completes...

One solution may be to move the call to put_nfs_open_context() into the
->rpc_release() callback, since that is called after the task has been
removed from the client task list, and so it is invisible to
rpc_killall_tasks() and rpc_shutdown_client().
The next problem then becomes SteveD's wait_event() in nfs_put_super(),
which waits for all sillyrenames to complete. If rpciod is doing the
waiting, then it can't service the asynchronous sillyrename tasks, and
so it deadlocks again. In that case, it may indeed be that your fix
above will help. I need to think a bit about that, since there are
legitimate RPC calls that can be sent even after the call to
rpc_killall_tasks (mainly the RPCSEC_GSS context teardown calls).

Cheers
Trond


2008-02-19 23:19:03

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

How about moving the offending mntput calls off rpciod altogether? That
way we can avoid both the deadlock with rpc_shutdown_client() and the
deadlock with nfs_put_super().
The other advantage of doing this is that we move all those deadlocky
little malloc() calls from the NFSv4 open(), close(), lock(), and
locku() out of rpciod too. Ditto for the delegation return stuff that
may result from the dput() calls...

The way I'm attempting to do this, is to add something like the
following patch series (which has been compile tested, but not
run-tested quite yet). It basically creates an 'nfsiod' workqueue, and
allows the NFS read/write/... code to specify that the
tk_ops->rpc_release() callback should be run on that particular
workqueue. It then moves all the mntput()/dput() stuff into the
rpc_release() call...

Cheers
Trond

2008-02-21 04:58:30

by NeilBrown

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Tuesday February 19, [email protected] wrote:
>
> On Tue, 2008-02-19 at 16:34 +1100, Neil Brown wrote:
> > On Thursday January 10, [email protected] wrote:
> > >
> > > On Fri, 2008-01-11 at 08:51 +1100, Neil Brown wrote:
> > > >
> > > > I have a report of an unusual NFS deadlock in OpenSuSE 10.3, which is
> > > > based on 2.6.22.
> > > >
> > > > People who like bugzilla can find a little more detail at:
> > > > https://bugzilla.novell.com/show_bug.cgi?id=352878
> > > >
> > > > We start with a "soft,intr" mount over a VPN on an unreliable network.
> > > >
> > > > I assume an unmount happened (by autofs) while there was some
> > > > outstanding requests - at least one was a COMMIT but I'm guessing
> > > > there were others.
> > > >
> > > > When the last commit completed (probably timed-out) it put the open
> > > > context, dropped the last reference on the vfsmnt and so called
> > > > nfs_free_server and thence rpc_shutdown_client.
> > > > All this is happened in rpciod/0.
> > > >
> > > > rpc_shutdown_client calls rpc_killall_tasks and waits for them all to
> > > > complete. This deadlocks.
> > > >
> > > > I'm wondering if some other request (read-ahead?) might need service
> > > > from rpciod/0 before it can complete. This would mean that it cannot
> > > > complete until rpciod/0 is free, and rpciod/0 won't be free until it
> > > > completes.
> > > >
> > > > Is this a credible scenario?
> > >
> > > Yes, but I have a scenario that I think trumps it:
> > >
> > > * the call that puts the open context is being made in
> > > nfs_commit_done (or possibly nfs_writeback_done), causing it to
> > > wait until the rpc_killall_tasks completes.
> > > * The problem is that rpc_killall_tasks won't complete until the
> > > rpc_task that is stuck in nfs_commit_done/nfs_writeback_done
> > > exits.
> > >
> > > Urgh...
> > >
> > > I'm surprised that we can get into this state, though. How is
> > > sys_umount() able to exit with either readaheads or writebacks still
> > > pending? Is this perhaps occurring on a lazy umount?
> >
> > (finally getting back to this).
> >
> > Yes, lazy umount is definitely implicated. OpenSUSE sends 'autofs' a
> > SIGUSR1 when suspending, and autofs may use "umount -l" if it received
> > SIGUSR1.
> >
> > I dug further and realised that if there is still a COMMIT in flight,
> > then there should still be an open file descriptor on the filesystem,
> > so when the COMMIT finishes, it won't be the last thing that is
> > holding the filesystem active, so this dead lock shouldn't happen...
> >
> > Then I found this, which post-dates OpenSUSE-10.3.
> >
> > ----------------------------
> > commit a49c3c7736a2e77931dabc5bc4a83fb4b2da013e
> > Author: Trond Myklebust <[email protected]>
> > Date: Thu Oct 18 18:03:27 2007 -0400
> >
> > NFSv4: Ensure that we wait for the CLOSE request to complete
> >
> > Otherwise, we do end up breaking close-to-open semantics. We also end up
> > breaking some of the silly-rename tests in Connectathon on some setups.
> >
> > Please refer to the bug-report at
> > http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
> > ----------------------------
> >
> > Which seems to suggest that a file descriptor could have been
> > completely closed before the final COMMIT returned, so that final
> > commit would be all that was holding the filesystem active.
> >
> > So it could be that this bug has since been fixed.
> >
> > Do you remember this patch well enough to be able to say if the bug it
> > fixes could have allowed a COMMIT to be outstanding when the last
> > nfs_file_release completed?
> >
> >
> > In my investigations, it seemed that (in -mm where I was looking)
> > nfs_wb_all can still complete before the last COMMIT returns, but only
> > if the process received a signal (the new TASK_KILLABLE seems to allow
> > this). That seems to suggest that the bug could be there again. If
> > you "umount -l", then kill the last process with a file open, there
> > could still be an in-fight COMMIT that is holding the filesystem
> > active. When it competes, you could get a deadlock.
> >
> > It also seems to me that "rpc_shutdown_client" doesn't really need
> > to loop. The only reason I can see for calling rpc_killall_tasks
> > multiple times is in case a task schedules a sub-task before it
> > noticed that it has been killed.
> >
> > That possibility can be avoided by flagging the client as 'dying' and
> > auto-killing any task that gets added to a dying client.
> >
> > Once the multiple calls to rpc_killall_tasks aren't needed, there is
> > no need to wait until cl_tasks is empty as the kref counting will make
> > sure that the client gets cleaned up at the right time.
> >
> > So maybe the following patch would still be useful to make sure that
> > rpc_shutdown_client never deadlocks??
> >
> > Thanks,
> > NeilBrown
> >
> >
> > Signed-off-by: Neil Brown <[email protected]>
> >
> > ### Diffstat output
> > ./include/linux/sunrpc/clnt.h | 2 ++
> > ./net/sunrpc/clnt.c | 10 +++++-----
> > ./net/sunrpc/sched.c | 4 ++++
> > 3 files changed, 11 insertions(+), 5 deletions(-)
> >
> > diff .prev/include/linux/sunrpc/clnt.h ./include/linux/sunrpc/clnt.h
> > --- .prev/include/linux/sunrpc/clnt.h 2008-02-19 15:28:06.000000000 +1100
> > +++ ./include/linux/sunrpc/clnt.h 2008-02-19 15:51:40.000000000 +1100
> > @@ -42,6 +42,8 @@ struct rpc_clnt {
> >
> > unsigned int cl_softrtry : 1,/* soft timeouts */
> > cl_discrtry : 1,/* disconnect before retry */
> > + cl_dying : 1, /* shutting down, allow
> > + * no more tasks */
> > cl_autobind : 1;/* use getport() */
> >
> > struct rpc_rtt * cl_rtt; /* RTO estimator data */
> >
> > diff .prev/net/sunrpc/clnt.c ./net/sunrpc/clnt.c
> > --- .prev/net/sunrpc/clnt.c 2008-02-19 15:29:58.000000000 +1100
> > +++ ./net/sunrpc/clnt.c 2008-02-19 15:49:43.000000000 +1100
> > @@ -386,11 +386,11 @@ void rpc_shutdown_client(struct rpc_clnt
> > dprintk("RPC: shutting down %s client for %s\n",
> > clnt->cl_protname, clnt->cl_server);
> >
> > - while (!list_empty(&clnt->cl_tasks)) {
> > - rpc_killall_tasks(clnt);
> > - wait_event_timeout(destroy_wait,
> > - list_empty(&clnt->cl_tasks), 1*HZ);
> > - }
> > + spin_lock(&clnt->cl_lock);
> > + clnt->cl_dying = 1;
> > + spin_unlock(&clnt->cl_lock);
> > +
> > + rpc_killall_tasks(clnt);
> >
> > rpc_release_client(clnt);
> > }
> >
> > diff .prev/net/sunrpc/sched.c ./net/sunrpc/sched.c
> > --- .prev/net/sunrpc/sched.c 2008-02-19 15:25:45.000000000 +1100
> > +++ ./net/sunrpc/sched.c 2008-02-19 15:37:11.000000000 +1100
> > @@ -278,6 +278,10 @@ static void rpc_set_active(struct rpc_ta
> > if (clnt != NULL) {
> > spin_lock(&clnt->cl_lock);
> > list_add_tail(&task->tk_task, &clnt->cl_tasks);
> > + if (clnt->cl_dying) {
> > + task->tk_flags |= RPC_TASK_KILLED;
> > + rpc_exit(task, -EIO);
> > + }
> > spin_unlock(&clnt->cl_lock);
> > }
> > }
>
> Hmm... I'm not sure that I see how this will fix the deadlock.

The task that I saw deadlocked was waiting in rpc_shutdown_client. I
removed the wait from rpc_shutdown_client, so presumably the task
cannot deadlock anymore.

>
> The main problem here is that we can call nfs_inode_remove_request(),
> for instance, from within the rpciod callback nfs_writeback_done_full().
> This again may end up calling put_nfs_open_context(), which again calls
> mntput().
>
> The deadlock then occurs because we still have the active rpc_task in
> nfs_writeback_done_full() calling mntput, and hence triggering the call
> to rpc_shutdown_client(), which cannot complete until said active
> rpc_task itself completes...

My question is: *why* cannot rpc_shutdown_client complete until all
active rpc_tasks complete? The use of reference counting ensure that
once they do all complete, the client will be finally released and any
relevant modules will also be released.

Is there really any need to wait for completion?

>
> One solution may be to move the call to put_nfs_open_context() into the
> ->rpc_release() callback, since that is called after the task has been
> removed from the client task list, and so it is invisible to
> rpc_killall_tasks() and rpc_shutdown_client().
> The next problem then becomes SteveD's wait_event() in nfs_put_super(),
> which waits for all sillyrenames to complete. If rpciod is doing the
> waiting, then it can't service the asynchronous sillyrename tasks, and
> so it deadlocks again. In that case, it may indeed be that your fix
> above will help. I need to think a bit about that, since there are
> legitimate RPC calls that can be sent even after the call to
> rpc_killall_tasks (mainly the RPCSEC_GSS context teardown calls).


The current rpc_shutdown_client will kill such a task if it takes
longer than 1 second, won't it?

Maybe this is all the more reason to simply cut the client loose and
let it sort itself out with the help of rpciod and clean up when it
has finished.
>
> Cheers
> Trond

2008-02-21 05:16:05

by NeilBrown

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Tuesday February 19, [email protected] wrote:
> How about moving the offending mntput calls off rpciod altogether? That
> way we can avoid both the deadlock with rpc_shutdown_client() and the
> deadlock with nfs_put_super().
> The other advantage of doing this is that we move all those deadlocky
> little malloc() calls from the NFSv4 open(), close(), lock(), and
> locku() out of rpciod too. Ditto for the delegation return stuff that
> may result from the dput() calls...

Yes, that sounds like a reasonable approach.

Adding extra threads is not something I would want to do too lightly,
but it does seem reasonably justified here.

>
> The way I'm attempting to do this, is to add something like the
> following patch series (which has been compile tested, but not
> run-tested quite yet). It basically creates an 'nfsiod' workqueue, and
> allows the NFS read/write/... code to specify that the
> tk_ops->rpc_release() callback should be run on that particular
> workqueue. It then moves all the mntput()/dput() stuff into the
> rpc_release() call...

Seems to make sense, but I'm not really familiar enough with the code
to be sure.


Thanks,
NeilBrown

2008-02-21 12:16:44

by Jeff Layton

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Thu, 21 Feb 2008 15:58:34 +1100
Neil Brown <[email protected]> wrote:

> My question is: *why* cannot rpc_shutdown_client complete until all
> active rpc_tasks complete? The use of reference counting ensure that
> once they do all complete, the client will be finally released and any
> relevant modules will also be released.
>
> Is there really any need to wait for completion?
>

I can't speak for all uses of rpc_shutdown_client, but the fix I sent
up recently for the use after free when taking down lockd depends on
this. If we change the semantics of rpc_shutdown_client this way then
we'll need to fix that in some other way. I need to know that when I'm
taking down lockd that all of the grant callbacks in the RPC queue are
completely dead.

--
Jeff Layton <[email protected]>

2008-02-21 13:34:17

by Myklebust, Trond

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Thu, 2008-02-21 at 16:16 +1100, Neil Brown wrote:
> On Tuesday February 19, [email protected] wrote:
> > How about moving the offending mntput calls off rpciod altogether? That
> > way we can avoid both the deadlock with rpc_shutdown_client() and the
> > deadlock with nfs_put_super().
> > The other advantage of doing this is that we move all those deadlocky
> > little malloc() calls from the NFSv4 open(), close(), lock(), and
> > locku() out of rpciod too. Ditto for the delegation return stuff that
> > may result from the dput() calls...
>
> Yes, that sounds like a reasonable approach.
>
> Adding extra threads is not something I would want to do too lightly,
> but it does seem reasonably justified here.

I did some performance tests, and found that it appears to be acceptable
to replace the multi-threaded workqueue with a single-threaded one, so
I've baked this into the third patch.

I also did some tests with running iozone and lazy-umounting the
partition before killing the iozone threads. That appears to work fine
now...


2008-02-21 15:27:38

by Trond Myklebust

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Thu, 2008-02-21 at 15:58 +1100, Neil Brown wrote:

> My question is: *why* cannot rpc_shutdown_client complete until all
> active rpc_tasks complete? The use of reference counting ensure that
> once they do all complete, the client will be finally released and any
> relevant modules will also be released.
>
> Is there really any need to wait for completion?

Looking at the code, I suspect that you can probably get rid of the
rpc_shutdown_client() without creating too much trouble (since we now
hold a reference to the vfsmount in most of the asynchronous
operations).

However the asynchronous sillyrenames are still a problem: if you don't
wait for the sillyrename RPC call to complete, then you end up with the
famous "Self-destruct in 5 seconds" message on umount (because we have
to hold a reference to the directory inode for the duration of the RPC
call if we want to avoid lookup races and cache consistency issues
during normal operation).
Hence, I think the extra workqueue is justified by the fact that rpciod
cannot ever wait for the sillyrename calls to complete.




2008-02-22 00:31:55

by NeilBrown

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Thursday February 21, [email protected] wrote:
>
> On Thu, 2008-02-21 at 15:58 +1100, Neil Brown wrote:
>
> > My question is: *why* cannot rpc_shutdown_client complete until all
> > active rpc_tasks complete? The use of reference counting ensure that
> > once they do all complete, the client will be finally released and any
> > relevant modules will also be released.
> >
> > Is there really any need to wait for completion?
>
> Looking at the code, I suspect that you can probably get rid of the
> rpc_shutdown_client() without creating too much trouble (since we now
> hold a reference to the vfsmount in most of the asynchronous
> operations).
>
> However the asynchronous sillyrenames are still a problem: if you don't
> wait for the sillyrename RPC call to complete, then you end up with the
> famous "Self-destruct in 5 seconds" message on umount (because we have
> to hold a reference to the directory inode for the duration of the RPC
> call if we want to avoid lookup races and cache consistency issues
> during normal operation).

Could the sillyrename call hold a reference to the vfsmount???

I guess that would mean that an unmount could fail with EBUSY when you
don't expect it...

Presumably the sillyrename delete happens on close... could the close
wait for the sillyrename delete to finish? That would hold everything
busy in a sensible way.

But I suspect there are good answers, you've doubtless thought about
it more than me :-)

> Hence, I think the extra workqueue is justified by the fact that rpciod
> cannot ever wait for the sillyrename calls to complete.
>

Sounds like we've arrived at a good solution - thanks!

Now I just have to decide which bits, if any, to bother back-porting
to OpenSUSE before I close the bug that started all of this :-)

NeilBrown

2008-02-22 00:42:01

by Trond Myklebust

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?


On Fri, 2008-02-22 at 11:31 +1100, Neil Brown wrote:
> On Thursday February 21, [email protected] wrote:
> >
> > On Thu, 2008-02-21 at 15:58 +1100, Neil Brown wrote:
> >
> > > My question is: *why* cannot rpc_shutdown_client complete until all
> > > active rpc_tasks complete? The use of reference counting ensure that
> > > once they do all complete, the client will be finally released and any
> > > relevant modules will also be released.
> > >
> > > Is there really any need to wait for completion?
> >
> > Looking at the code, I suspect that you can probably get rid of the
> > rpc_shutdown_client() without creating too much trouble (since we now
> > hold a reference to the vfsmount in most of the asynchronous
> > operations).
> >
> > However the asynchronous sillyrenames are still a problem: if you don't
> > wait for the sillyrename RPC call to complete, then you end up with the
> > famous "Self-destruct in 5 seconds" message on umount (because we have
> > to hold a reference to the directory inode for the duration of the RPC
> > call if we want to avoid lookup races and cache consistency issues
> > during normal operation).
>
> Could the sillyrename call hold a reference to the vfsmount???

That would require a VFS change to allow the ->rename() and ->unlink()
inode ops to pass down the vfsmount to the filesystem. Unfortunately Al
and Christoph have both NACKed such a change. SteveD's fix was therefore
the only thing we could do in this situation.



2008-02-19 05:34:40

by NeilBrown

[permalink] [raw]
Subject: Re: Strange lockup during unmount in 2.6.22 - maybe rpciod deadlock?

On Thursday January 10, [email protected] wrote:
>
> On Fri, 2008-01-11 at 08:51 +1100, Neil Brown wrote:
> >
> > I have a report of an unusual NFS deadlock in OpenSuSE 10.3, which is
> > based on 2.6.22.
> >
> > People who like bugzilla can find a little more detail at:
> > https://bugzilla.novell.com/show_bug.cgi?id=352878
> >
> > We start with a "soft,intr" mount over a VPN on an unreliable network.
> >
> > I assume an unmount happened (by autofs) while there was some
> > outstanding requests - at least one was a COMMIT but I'm guessing
> > there were others.
> >
> > When the last commit completed (probably timed-out) it put the open
> > context, dropped the last reference on the vfsmnt and so called
> > nfs_free_server and thence rpc_shutdown_client.
> > All this is happened in rpciod/0.
> >
> > rpc_shutdown_client calls rpc_killall_tasks and waits for them all to
> > complete. This deadlocks.
> >
> > I'm wondering if some other request (read-ahead?) might need service
> > from rpciod/0 before it can complete. This would mean that it cannot
> > complete until rpciod/0 is free, and rpciod/0 won't be free until it
> > completes.
> >
> > Is this a credible scenario?
>
> Yes, but I have a scenario that I think trumps it:
>
> * the call that puts the open context is being made in
> nfs_commit_done (or possibly nfs_writeback_done), causing it to
> wait until the rpc_killall_tasks completes.
> * The problem is that rpc_killall_tasks won't complete until the
> rpc_task that is stuck in nfs_commit_done/nfs_writeback_done
> exits.
>
> Urgh...
>
> I'm surprised that we can get into this state, though. How is
> sys_umount() able to exit with either readaheads or writebacks still
> pending? Is this perhaps occurring on a lazy umount?

(finally getting back to this).

Yes, lazy umount is definitely implicated. OpenSUSE sends 'autofs' a
SIGUSR1 when suspending, and autofs may use "umount -l" if it received
SIGUSR1.

I dug further and realised that if there is still a COMMIT in flight,
then there should still be an open file descriptor on the filesystem,
so when the COMMIT finishes, it won't be the last thing that is
holding the filesystem active, so this dead lock shouldn't happen...

Then I found this, which post-dates OpenSUSE-10.3.

----------------------------
commit a49c3c7736a2e77931dabc5bc4a83fb4b2da013e
Author: Trond Myklebust <[email protected]>
Date: Thu Oct 18 18:03:27 2007 -0400

NFSv4: Ensure that we wait for the CLOSE request to complete

Otherwise, we do end up breaking close-to-open semantics. We also end up
breaking some of the silly-rename tests in Connectathon on some setups.

Please refer to the bug-report at
http://bugzilla.linux-nfs.org/show_bug.cgi?id=150
----------------------------

Which seems to suggest that a file descriptor could have been
completely closed before the final COMMIT returned, so that final
commit would be all that was holding the filesystem active.

So it could be that this bug has since been fixed.

Do you remember this patch well enough to be able to say if the bug it
fixes could have allowed a COMMIT to be outstanding when the last
nfs_file_release completed?


In my investigations, it seemed that (in -mm where I was looking)
nfs_wb_all can still complete before the last COMMIT returns, but only
if the process received a signal (the new TASK_KILLABLE seems to allow
this). That seems to suggest that the bug could be there again. If
you "umount -l", then kill the last process with a file open, there
could still be an in-fight COMMIT that is holding the filesystem
active. When it competes, you could get a deadlock.

It also seems to me that "rpc_shutdown_client" doesn't really need
to loop. The only reason I can see for calling rpc_killall_tasks
multiple times is in case a task schedules a sub-task before it
noticed that it has been killed.

That possibility can be avoided by flagging the client as 'dying' and
auto-killing any task that gets added to a dying client.

Once the multiple calls to rpc_killall_tasks aren't needed, there is
no need to wait until cl_tasks is empty as the kref counting will make
sure that the client gets cleaned up at the right time.

So maybe the following patch would still be useful to make sure that
rpc_shutdown_client never deadlocks??

Thanks,
NeilBrown


Signed-off-by: Neil Brown <[email protected]>

### Diffstat output
./include/linux/sunrpc/clnt.h | 2 ++
./net/sunrpc/clnt.c | 10 +++++-----
./net/sunrpc/sched.c | 4 ++++
3 files changed, 11 insertions(+), 5 deletions(-)

diff .prev/include/linux/sunrpc/clnt.h ./include/linux/sunrpc/clnt.h
--- .prev/include/linux/sunrpc/clnt.h 2008-02-19 15:28:06.000000000 +1100
+++ ./include/linux/sunrpc/clnt.h 2008-02-19 15:51:40.000000000 +1100
@@ -42,6 +42,8 @@ struct rpc_clnt {

unsigned int cl_softrtry : 1,/* soft timeouts */
cl_discrtry : 1,/* disconnect before retry */
+ cl_dying : 1, /* shutting down, allow
+ * no more tasks */
cl_autobind : 1;/* use getport() */

struct rpc_rtt * cl_rtt; /* RTO estimator data */

diff .prev/net/sunrpc/clnt.c ./net/sunrpc/clnt.c
--- .prev/net/sunrpc/clnt.c 2008-02-19 15:29:58.000000000 +1100
+++ ./net/sunrpc/clnt.c 2008-02-19 15:49:43.000000000 +1100
@@ -386,11 +386,11 @@ void rpc_shutdown_client(struct rpc_clnt
dprintk("RPC: shutting down %s client for %s\n",
clnt->cl_protname, clnt->cl_server);

- while (!list_empty(&clnt->cl_tasks)) {
- rpc_killall_tasks(clnt);
- wait_event_timeout(destroy_wait,
- list_empty(&clnt->cl_tasks), 1*HZ);
- }
+ spin_lock(&clnt->cl_lock);
+ clnt->cl_dying = 1;
+ spin_unlock(&clnt->cl_lock);
+
+ rpc_killall_tasks(clnt);

rpc_release_client(clnt);
}

diff .prev/net/sunrpc/sched.c ./net/sunrpc/sched.c
--- .prev/net/sunrpc/sched.c 2008-02-19 15:25:45.000000000 +1100
+++ ./net/sunrpc/sched.c 2008-02-19 15:37:11.000000000 +1100
@@ -278,6 +278,10 @@ static void rpc_set_active(struct rpc_ta
if (clnt != NULL) {
spin_lock(&clnt->cl_lock);
list_add_tail(&task->tk_task, &clnt->cl_tasks);
+ if (clnt->cl_dying) {
+ task->tk_flags |= RPC_TASK_KILLED;
+ rpc_exit(task, -EIO);
+ }
spin_unlock(&clnt->cl_lock);
}
}