LinuxLists.cc - [PATCH] nfs: don't queue synchronous NFSv4 close rpc

2011-02-15 14:59:05

Subject: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

I recently had some of our QA people report some connectathon test
failures in RHEL5 (2.6.18-based kernel). For some odd reason (maybe
scheduling differences that make the race more likely?) the problem
occurs more frequently on s390.

The problem generally manifests itself on NFSv4 as a race where an rmdir
fails because a silly-renamed file in the directory wasn't deleted in
time. Looking at traces, what you usually see is the failing rmdir
attempt that fails with the sillydelete of the file that prevented it
very soon afterward.

Silly deletes are handled via dentry_iput and in the case of a close on
NFSv4, the last dentry reference is often held by the CLOSE RPC task.
nfs4_do_close does the close as an async RPC task that it conditionally
waits on depending on whether the close is synchronous or not.

It also sets the workqueue for the task to nfsiod_workqueue. When
tk_workqueue is set, the rpc_release operation is queued to that
workqueue. rpc_release is where the dentry reference held by the task is
put. The caller has no way to wait for that to complete, so the close(2)
syscall can easily return before the rpc_release call is ever done. In
some cases, that rpc_release is delayed for a long enough to prevent a
subsequent rmdir of the containing directory.

I believe this is a bug, or at least not ideal behavior. We should try
not to have the close(2) call return in this situation until the
sillydelete is done.

I've been able to reproduce this more reliably by adding a 100ms sleep
at the top of nfs4_free_closedata. I've not seen it "in the wild" on
mainline kernels, but it seems quite possible when a machine is heavily
loaded.

This patch fixes this by not setting tk_workqueue in nfs4_do_close when
the wait flag is set. This makes the final rpc_put_task a synchronous
operation and should prevent close(2) from returning before the
dentry_iput is done.

Signed-off-by: Jeff Layton <[email protected]>
---
fs/nfs/nfs4proc.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 78936a8..4cabfea 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1988,11 +1988,14 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
.rpc_client = server->client,
.rpc_message = &msg,
.callback_ops = &nfs4_close_ops,
- .workqueue = nfsiod_workqueue,
.flags = RPC_TASK_ASYNC,
};
int status = -ENOMEM;

+ /* rpc_release must be synchronous too if "wait" is set */
+ if (!wait)
+ task_setup_data.workqueue = nfsiod_workqueue;
+
calldata = kzalloc(sizeof(*calldata), gfp_mask);
if (calldata == NULL)
goto out;
--
1.7.4

2011-02-18 20:54:25

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Fri, 2011-02-18 at 15:04 -0500, Jeff Layton wrote:
> On Thu, 17 Feb 2011 16:37:41 -0500
> Jeff Layton <[email protected]> wrote:
>
> > On Thu, 17 Feb 2011 11:47:24 -0800
> > Trond Myklebust <[email protected]> wrote:
> >
> > > On Thu, 2011-02-17 at 10:10 -0500, Jeff Layton wrote:
> > > > On Thu, 17 Feb 2011 08:40:52 -0500
> > > > Jeff Layton <[email protected]> wrote:
> > > >
> > > > >
> > > > > Looks like it finally failed on the 39th pass:
> > > > >
> > > > > second check for lost reply on non-idempotent requests
> > > > > testing 50 idempotencies in directory "testdir"
> > > > > rmdir 1: Directory not empty
> > > > > special tests failed
> > > > >
> > > > > When I look in the directory (several hours after it failed), the
> > > > > silly-renamed file is still there:
> > > > >
> > > > > -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
> > > > >
> > > > > ...so I'm not sure what exactly is wrong yet, but it looks like the
> > > > > silly delete just never happened. Maybe there's a dentry refcount leak
> > > > > of some sort? There are no queued RPC's.
> > > > >
> > > > > I'll keep looking at it but if you have ideas as to what it could be,
> > > > > let me know.
> > > > >
> > > >
> > > > I walked down the directory tree in crash on the live kernel and found
> > > > the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
> > > > cleaned up:
> > > >
> > > > crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
> > > > d_flags = 0xc000,
> > > > d_count = 0x0,
> > > > d_name = {
> > > > hash = 0xe08ab5c8,
> > > > len = 0x1c,
> > > > name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
> > > > },
> > > >
> > > >
> > > > The d_flags are:
> > > >
> > > > #define DCACHE_OP_REVALIDATE 0x4000
> > > > #define DCACHE_OP_DELETE 0x8000
> > > >
> > > > ...very odd. I'd have expected to see this one set too:
> > > >
> > > > #define DCACHE_NFSFS_RENAMED 0x0002
> > > >
> > > > I suppose the async sillyrename call could have failed and we ended up
> > > > calling nfs_cancel_async_unlink? I'll stick in some printk's around
> > > > that area and see if I can figure out what's going on...
> > > >
> > >
> > > Perhaps I missed a call site that needs an rpc_put_task_async()?
> > >
> >
> > I'm not sure that would explain what we're seeing here, unless I'm just
> > missing something. I do know that nfs_cancel_async_unlink was never
> > called in my latest test run, so that does not seem to be it.
> >
> > Another anomaly -- d_fsdata is NULL. How do we get into that state with
> > a dentry that has been silly-renamed? FWIW, here's the entire dentry
> > struct with my latest reproducer. Let me know if anything stands out to
> > you...
> >
>
> My current thinking is that the problems I'm seeing here are not
> directly related to the bug we're trying to fix. It looks like
> sillyrename is still just plain broken a'la your email entitled:
>
> 2.6.38-rc2... NFS sillyrename is broken...
>
> ...I'll plan to keep an eye out for the fix from Nick, and re-test this
> when it hits.
>
> Thanks,

Oh, I'm sorry. I wasn't aware you were testing on 2.6.38-rc...

Yes, sillyrename is still broken there due to the NULL parent dentry
being passed to the d_iput() callback. Nick promised us a patch, but
hasn't delivered yet.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-16 14:26:16

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Wed, 2011-02-16 at 09:09 -0500, Trond Myklebust wrote:
> On Tue, 2011-02-15 at 18:47 -0500, Trond Myklebust wrote:
> > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > index 243fc09..11b71b1 100644
> > --- a/net/sunrpc/sched.c
> > +++ b/net/sunrpc/sched.c
> > @@ -252,23 +252,39 @@ static void rpc_set_active(struct rpc_task *task)
> >
> > /*
> > * Mark an RPC call as having completed by clearing the 'active' bit
> > + * and then waking up all tasks that were sleeping.
> > */
> > -static void rpc_mark_complete_task(struct rpc_task *task)
> > +static int rpc_complete_task(struct rpc_task *task)
> > {
> > - smp_mb__before_clear_bit();
> > + void *m = &task->tk_runstate;
> > + wait_queue_head_t *wq = bit_waitqueue(m, RPC_TASK_ACTIVE);
> > + struct wait_bit_key k = __WAIT_BIT_KEY_INITIALIZER(m, RPC_TASK_ACTIVE);
> > + unsigned long flags;
> > + int ret;
> > +
> > + spin_lock_irqsave(&wq->lock, flags);
> > clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
> > - smp_mb__after_clear_bit();
> > - wake_up_bit(&task->tk_runstate, RPC_TASK_ACTIVE);
> > + if (waitqueue_active(wq))
> > + __wake_up_locked_key(wq, TASK_NORMAL, &k);
> > + ret = atomic_dec_and_test(&task->tk_count);
> > + spin_unlock_irqrestore(&wq->lock, flags);
> > + return ret;
> > }
> >
> > /*
> > * Allow callers to wait for completion of an RPC call
> > + *
> > + * Note the use of out_of_line_wait_on_bit() rather than wait_on_bit()
> > + * to enforce taking of the wq->lock and hence avoid races with
> > + * rpc_complete_task().
> > */
> > int __rpc_wait_for_completion_task(struct rpc_task *task, int (*action)(void *))
> > {
> > + BUG_ON(!RPC_IS_ASYNC(task));
> > +
> > if (action == NULL)
> > action = rpc_wait_bit_killable;
> > - return wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> > + return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> > action, TASK_KILLABLE);
> > }
> > EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
>
> Never mind. The above ordering scheme is broken by the fact that
> 'wake_bit_function' calls autoremove_wake_function, which again means
> that finish_wait optimises away the spin lock.
>
> Back to the drawing board...

...and after yet another cup of coffee the solution suggests itself: we
just need to reorder the calls to __wake_up_locked_key and the
atomic_dec_and_test in rpc_complete_task. The revised patch is appended.

Cheers
Trond

8<----------------------------------------------------------------------------

2011-02-17 21:33:42

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Thu, 17 Feb 2011 11:47:24 -0800
Trond Myklebust <[email protected]> wrote:

> On Thu, 2011-02-17 at 10:10 -0500, Jeff Layton wrote:
> > On Thu, 17 Feb 2011 08:40:52 -0500
> > Jeff Layton <[email protected]> wrote:
> >
> > >
> > > Looks like it finally failed on the 39th pass:
> > >
> > > second check for lost reply on non-idempotent requests
> > > testing 50 idempotencies in directory "testdir"
> > > rmdir 1: Directory not empty
> > > special tests failed
> > >
> > > When I look in the directory (several hours after it failed), the
> > > silly-renamed file is still there:
> > >
> > > -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
> > >
> > > ...so I'm not sure what exactly is wrong yet, but it looks like the
> > > silly delete just never happened. Maybe there's a dentry refcount leak
> > > of some sort? There are no queued RPC's.
> > >
> > > I'll keep looking at it but if you have ideas as to what it could be,
> > > let me know.
> > >
> >
> > I walked down the directory tree in crash on the live kernel and found
> > the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
> > cleaned up:
> >
> > crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
> > d_flags = 0xc000,
> > d_count = 0x0,
> > d_name = {
> > hash = 0xe08ab5c8,
> > len = 0x1c,
> > name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
> > },
> >
> >
> > The d_flags are:
> >
> > #define DCACHE_OP_REVALIDATE 0x4000
> > #define DCACHE_OP_DELETE 0x8000
> >
> > ...very odd. I'd have expected to see this one set too:
> >
> > #define DCACHE_NFSFS_RENAMED 0x0002
> >
> > I suppose the async sillyrename call could have failed and we ended up
> > calling nfs_cancel_async_unlink? I'll stick in some printk's around
> > that area and see if I can figure out what's going on...
> >
>
> Perhaps I missed a call site that needs an rpc_put_task_async()?
>

I'm not sure that would explain what we're seeing here, unless I'm just
missing something. I do know that nfs_cancel_async_unlink was never
called in my latest test run, so that does not seem to be it.

Another anomaly -- d_fsdata is NULL. How do we get into that state with
a dentry that has been silly-renamed? FWIW, here's the entire dentry
struct with my latest reproducer. Let me know if anything stands out to
you...

ffff880017d68e70
struct dentry {
d_flags = 0xc008,
d_seq = {
sequence = 0x2
},
d_hash = {
next = 0x0,
pprev = 0xffffc900000b45d0
},
d_parent = 0xffff880017e50e70,
d_name = {
hash = 0xe08a8918,
len = 0x1c,
name = 0xffff880017d68ea8 ".nfs000000000000002d00000031"
},
d_inode = 0xffff88002cec1070,
d_iname = ".nfs000000000000002d00000031\000kkk",
d_count = 0x0,
d_lock = {
{
rlock = {
raw_lock = {
slock = 0x180018
},
magic = 0xdead4ead,
owner_cpu = 0xffffffff,
owner = 0xffffffffffffffff,
dep_map = {
key = 0xffffffff827df728,
class_cache = {0x0, 0x0},
name = 0xffffffff817cb493 "&(&dentry->d_lock)->rlock",
cpu = 0x1,
ip = 0xffffffff811467bc
}
},
{
__padding = "\030\000\030\000\255N\255\336\377\377\377\377kkkk\377\377\377\377\377\377\377\377",
dep_map = {
key = 0xffffffff827df728,
class_cache = {0x0, 0x0},
name = 0xffffffff817cb493 "&(&dentry->d_lock)->rlock",
cpu = 0x1,
ip = 0xffffffff811467bc
}
}
}
},
d_op = 0xffffffffa01402c0,
d_sb = 0xffff88003436b9f8,
d_time = 0x5a5a5a5a5a5a5a65,
d_fsdata = 0x0,
d_lru = {
next = 0xffff880017e50f38,
prev = 0xffff88003436bbd0
},
d_u = {
d_child = {
next = 0xffff880017e50f58,
prev = 0xffff880017e50f58
},
d_rcu = {
next = 0xffff880017e50f58,
func = 0xffff880017e50f58
}
},
d_subdirs = {
next = 0xffff880017d68f58,
prev = 0xffff880017d68f58
},
d_alias = {
next = 0xffff88002cec11d8,
prev = 0xffff88002cec11d8
}
}

--
Jeff Layton <[email protected]>

2011-02-18 20:00:35

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Thu, 17 Feb 2011 16:37:41 -0500
Jeff Layton <[email protected]> wrote:

> On Thu, 17 Feb 2011 11:47:24 -0800
> Trond Myklebust <[email protected]> wrote:
>
> > On Thu, 2011-02-17 at 10:10 -0500, Jeff Layton wrote:
> > > On Thu, 17 Feb 2011 08:40:52 -0500
> > > Jeff Layton <[email protected]> wrote:
> > >
> > > >
> > > > Looks like it finally failed on the 39th pass:
> > > >
> > > > second check for lost reply on non-idempotent requests
> > > > testing 50 idempotencies in directory "testdir"
> > > > rmdir 1: Directory not empty
> > > > special tests failed
> > > >
> > > > When I look in the directory (several hours after it failed), the
> > > > silly-renamed file is still there:
> > > >
> > > > -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
> > > >
> > > > ...so I'm not sure what exactly is wrong yet, but it looks like the
> > > > silly delete just never happened. Maybe there's a dentry refcount leak
> > > > of some sort? There are no queued RPC's.
> > > >
> > > > I'll keep looking at it but if you have ideas as to what it could be,
> > > > let me know.
> > > >
> > >
> > > I walked down the directory tree in crash on the live kernel and found
> > > the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
> > > cleaned up:
> > >
> > > crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
> > > d_flags = 0xc000,
> > > d_count = 0x0,
> > > d_name = {
> > > hash = 0xe08ab5c8,
> > > len = 0x1c,
> > > name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
> > > },
> > >
> > >
> > > The d_flags are:
> > >
> > > #define DCACHE_OP_REVALIDATE 0x4000
> > > #define DCACHE_OP_DELETE 0x8000
> > >
> > > ...very odd. I'd have expected to see this one set too:
> > >
> > > #define DCACHE_NFSFS_RENAMED 0x0002
> > >
> > > I suppose the async sillyrename call could have failed and we ended up
> > > calling nfs_cancel_async_unlink? I'll stick in some printk's around
> > > that area and see if I can figure out what's going on...
> > >
> >
> > Perhaps I missed a call site that needs an rpc_put_task_async()?
> >
>
> I'm not sure that would explain what we're seeing here, unless I'm just
> missing something. I do know that nfs_cancel_async_unlink was never
> called in my latest test run, so that does not seem to be it.
>
> Another anomaly -- d_fsdata is NULL. How do we get into that state with
> a dentry that has been silly-renamed? FWIW, here's the entire dentry
> struct with my latest reproducer. Let me know if anything stands out to
> you...
>

My current thinking is that the problems I'm seeing here are not
directly related to the bug we're trying to fix. It looks like
sillyrename is still just plain broken a'la your email entitled:

2.6.38-rc2... NFS sillyrename is broken...

...I'll plan to keep an eye out for the fix from Nick, and re-test this
when it hits.

Thanks,
--
Jeff Layton <[email protected]>

2011-02-17 13:40:56

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Wed, 16 Feb 2011 13:13:07 -0500
Jeff Layton <[email protected]> wrote:

> On Wed, 16 Feb 2011 10:21:17 -0500
> Trond Myklebust <[email protected]> wrote:
>
> > On Wed, 2011-02-16 at 09:50 -0500, Jeff Layton wrote:
> > > Thanks Trond,
> > >
> > > This builds, but I can't plug in the module:
> > >
> > > [ 103.540405] sunrpc: Unknown symbol __wake_up_locked_key (err 0)
> > >
> > > ...I think __wake_up_locked_key will need to be exported too. I'll do
> > > that and then test this out later today.
> >
> > Thanks! I've added an EXPORT_SYMBOL_GPL() for __wake_up_locked_key to
> > the patch.
> >
>
> So far this patch looks good. I've been able to reproduce the problem
> much more reliably with this patch and running the cthon special tests:
>
> ---------------------------[snip]-------------------------
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 40381d2..0e3d75f 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1849,6 +1849,7 @@ static void nfs4_free_closedata(void *data)
> struct nfs4_closedata *calldata = data;
> struct nfs4_state_owner *sp = calldata->state->owner;
>
> + msleep(100);
> if (calldata->roc)
> pnfs_roc_release(calldata->state->inode);
> nfs4_put_open_state(calldata->state);
> ---------------------------[snip]-------------------------
>
> ...with your patch on top of that, I've not been able to reproduce the
> problem so far after around 20 passes. I'll plan to let the tests run
> this evening to make sure, but initial results are good.
>

Looks like it finally failed on the 39th pass:

second check for lost reply on non-idempotent requests
testing 50 idempotencies in directory "testdir"
rmdir 1: Directory not empty
special tests failed

When I look in the directory (several hours after it failed), the
silly-renamed file is still there:

-rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090

...so I'm not sure what exactly is wrong yet, but it looks like the
silly delete just never happened. Maybe there's a dentry refcount leak
of some sort? There are no queued RPC's.

I'll keep looking at it but if you have ideas as to what it could be,
let me know.

--
Jeff Layton <[email protected]>

2011-02-16 18:13:09

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Wed, 16 Feb 2011 10:21:17 -0500
Trond Myklebust <[email protected]> wrote:

> On Wed, 2011-02-16 at 09:50 -0500, Jeff Layton wrote:
> > Thanks Trond,
> >
> > This builds, but I can't plug in the module:
> >
> > [ 103.540405] sunrpc: Unknown symbol __wake_up_locked_key (err 0)
> >
> > ...I think __wake_up_locked_key will need to be exported too. I'll do
> > that and then test this out later today.
>
> Thanks! I've added an EXPORT_SYMBOL_GPL() for __wake_up_locked_key to
> the patch.
>

So far this patch looks good. I've been able to reproduce the problem
much more reliably with this patch and running the cthon special tests:

---------------------------[snip]-------------------------
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 40381d2..0e3d75f 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1849,6 +1849,7 @@ static void nfs4_free_closedata(void *data)
struct nfs4_closedata *calldata = data;
struct nfs4_state_owner *sp = calldata->state->owner;

+ msleep(100);
if (calldata->roc)
pnfs_roc_release(calldata->state->inode);
nfs4_put_open_state(calldata->state);
---------------------------[snip]-------------------------

...with your patch on top of that, I've not been able to reproduce the
problem so far after around 20 passes. I'll plan to let the tests run
this evening to make sure, but initial results are good.

Cheers,
--
Jeff Layton <[email protected]>

2011-02-17 15:10:17

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Thu, 17 Feb 2011 08:40:52 -0500
Jeff Layton <[email protected]> wrote:

>
> Looks like it finally failed on the 39th pass:
>
> second check for lost reply on non-idempotent requests
> testing 50 idempotencies in directory "testdir"
> rmdir 1: Directory not empty
> special tests failed
>
> When I look in the directory (several hours after it failed), the
> silly-renamed file is still there:
>
> -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
>
> ...so I'm not sure what exactly is wrong yet, but it looks like the
> silly delete just never happened. Maybe there's a dentry refcount leak
> of some sort? There are no queued RPC's.
>
> I'll keep looking at it but if you have ideas as to what it could be,
> let me know.
>

I walked down the directory tree in crash on the live kernel and found
the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
cleaned up:

crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
d_flags = 0xc000,
d_count = 0x0,
d_name = {
hash = 0xe08ab5c8,
len = 0x1c,
name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
},

The d_flags are:

#define DCACHE_OP_REVALIDATE 0x4000
#define DCACHE_OP_DELETE 0x8000

...very odd. I'd have expected to see this one set too:

#define DCACHE_NFSFS_RENAMED 0x0002

I suppose the async sillyrename call could have failed and we ended up
calling nfs_cancel_async_unlink? I'll stick in some printk's around
that area and see if I can figure out what's going on...

--
Jeff Layton <[email protected]>

2011-02-16 15:21:19

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Wed, 2011-02-16 at 09:50 -0500, Jeff Layton wrote:
> Thanks Trond,
>
> This builds, but I can't plug in the module:
>
> [ 103.540405] sunrpc: Unknown symbol __wake_up_locked_key (err 0)
>
> ...I think __wake_up_locked_key will need to be exported too. I'll do
> that and then test this out later today.

Thanks! I've added an EXPORT_SYMBOL_GPL() for __wake_up_locked_key to
the patch.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-15 23:47:10

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Tue, 2011-02-15 at 11:30 -0500, Jeff Layton wrote:
> On Tue, 15 Feb 2011 10:31:38 -0500
> Trond Myklebust <[email protected]> wrote:
>
> > On Tue, 2011-02-15 at 09:58 -0500, Jeff Layton wrote:
> > > I recently had some of our QA people report some connectathon test
> > > failures in RHEL5 (2.6.18-based kernel). For some odd reason (maybe
> > > scheduling differences that make the race more likely?) the problem
> > > occurs more frequently on s390.
> > >
> > > The problem generally manifests itself on NFSv4 as a race where an rmdir
> > > fails because a silly-renamed file in the directory wasn't deleted in
> > > time. Looking at traces, what you usually see is the failing rmdir
> > > attempt that fails with the sillydelete of the file that prevented it
> > > very soon afterward.
> > >
> > > Silly deletes are handled via dentry_iput and in the case of a close on
> > > NFSv4, the last dentry reference is often held by the CLOSE RPC task.
> > > nfs4_do_close does the close as an async RPC task that it conditionally
> > > waits on depending on whether the close is synchronous or not.
> > >
> > > It also sets the workqueue for the task to nfsiod_workqueue. When
> > > tk_workqueue is set, the rpc_release operation is queued to that
> > > workqueue. rpc_release is where the dentry reference held by the task is
> > > put. The caller has no way to wait for that to complete, so the close(2)
> > > syscall can easily return before the rpc_release call is ever done. In
> > > some cases, that rpc_release is delayed for a long enough to prevent a
> > > subsequent rmdir of the containing directory.
> > >
> > > I believe this is a bug, or at least not ideal behavior. We should try
> > > not to have the close(2) call return in this situation until the
> > > sillydelete is done.
> > >
> > > I've been able to reproduce this more reliably by adding a 100ms sleep
> > > at the top of nfs4_free_closedata. I've not seen it "in the wild" on
> > > mainline kernels, but it seems quite possible when a machine is heavily
> > > loaded.
<snip>
> > There is no guarantee that rpc_wait_for_completion_task() will wait
> > until rpciod had called rpc_put_task(), in which case the above change
> > ends up with a dput() on the rpciod workqueue, and potential deadlocks.
> >
> > Let's rather look at making rpc_put_task a little bit more safe, by
> > ensuring that rpciod always calls queue_work(), and by allowing other
> > callers to switch it off.
> >
>
> Ok, I see the potential deadlock. To make sure I understand correctly,
> is this what you'd like to see?
>
> 1) If tk_workqueue is set, then queue rpc_release to that.
>
> 2) If it's not, queue it to rpciod.
>
> 3) add a flag to rpc_put_task (or a variant of that function) that
> calls rpc_release synchronously, and have nfs4_do_close use that
> variant when it puts the task.
>
> ...does that sound correct? If so, then I'm not sure that it'll 100%
> close the race. It's still possible that the rpc_put_task call in
> rpc_release_task would be the last one. If so and we queue it to the
> workqueue, then we're back in the same situation.

Agreed. The problem is lack of atomicity between rpc_complete_task() and
rpc_put_task(). How about something like the following?

Cheers
Trond
8<-----------------------------------------------------------------------------------------

2011-02-23 20:13:24

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Fri, 18 Feb 2011 12:54:07 -0800
Trond Myklebust <[email protected]> wrote:

> On Fri, 2011-02-18 at 15:04 -0500, Jeff Layton wrote:
> > On Thu, 17 Feb 2011 16:37:41 -0500
> > Jeff Layton <[email protected]> wrote:
> >
> > > On Thu, 17 Feb 2011 11:47:24 -0800
> > > Trond Myklebust <[email protected]> wrote:
> > >
> > > > On Thu, 2011-02-17 at 10:10 -0500, Jeff Layton wrote:
> > > > > On Thu, 17 Feb 2011 08:40:52 -0500
> > > > > Jeff Layton <[email protected]> wrote:
> > > > >
> > > > > >
> > > > > > Looks like it finally failed on the 39th pass:
> > > > > >
> > > > > > second check for lost reply on non-idempotent requests
> > > > > > testing 50 idempotencies in directory "testdir"
> > > > > > rmdir 1: Directory not empty
> > > > > > special tests failed
> > > > > >
> > > > > > When I look in the directory (several hours after it failed), the
> > > > > > silly-renamed file is still there:
> > > > > >
> > > > > > -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
> > > > > >
> > > > > > ...so I'm not sure what exactly is wrong yet, but it looks like the
> > > > > > silly delete just never happened. Maybe there's a dentry refcount leak
> > > > > > of some sort? There are no queued RPC's.
> > > > > >
> > > > > > I'll keep looking at it but if you have ideas as to what it could be,
> > > > > > let me know.
> > > > > >
> > > > >
> > > > > I walked down the directory tree in crash on the live kernel and found
> > > > > the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
> > > > > cleaned up:
> > > > >
> > > > > crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
> > > > > d_flags = 0xc000,
> > > > > d_count = 0x0,
> > > > > d_name = {
> > > > > hash = 0xe08ab5c8,
> > > > > len = 0x1c,
> > > > > name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
> > > > > },
> > > > >
> > > > >
> > > > > The d_flags are:
> > > > >
> > > > > #define DCACHE_OP_REVALIDATE 0x4000
> > > > > #define DCACHE_OP_DELETE 0x8000
> > > > >
> > > > > ...very odd. I'd have expected to see this one set too:
> > > > >
> > > > > #define DCACHE_NFSFS_RENAMED 0x0002
> > > > >
> > > > > I suppose the async sillyrename call could have failed and we ended up
> > > > > calling nfs_cancel_async_unlink? I'll stick in some printk's around
> > > > > that area and see if I can figure out what's going on...
> > > > >
> > > >
> > > > Perhaps I missed a call site that needs an rpc_put_task_async()?
> > > >
> > >
> > > I'm not sure that would explain what we're seeing here, unless I'm just
> > > missing something. I do know that nfs_cancel_async_unlink was never
> > > called in my latest test run, so that does not seem to be it.
> > >
> > > Another anomaly -- d_fsdata is NULL. How do we get into that state with
> > > a dentry that has been silly-renamed? FWIW, here's the entire dentry
> > > struct with my latest reproducer. Let me know if anything stands out to
> > > you...
> > >
> >
> > My current thinking is that the problems I'm seeing here are not
> > directly related to the bug we're trying to fix. It looks like
> > sillyrename is still just plain broken a'la your email entitled:
> >
> > 2.6.38-rc2... NFS sillyrename is broken...
> >
> > ...I'll plan to keep an eye out for the fix from Nick, and re-test this
> > when it hits.
> >
> > Thanks,
>
> Oh, I'm sorry. I wasn't aware you were testing on 2.6.38-rc...
>
> Yes, sillyrename is still broken there due to the NULL parent dentry
> being passed to the d_iput() callback. Nick promised us a patch, but
> hasn't delivered yet.
>

Thanks. I went back and tested this patch along with my trivial patch
that adds an artificial delay in the rpc_release callback. This patch
seems to fix the race.

--
Jeff Layton <[email protected]>

2011-02-15 15:31:55

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Tue, 2011-02-15 at 09:58 -0500, Jeff Layton wrote:
> I recently had some of our QA people report some connectathon test
> failures in RHEL5 (2.6.18-based kernel). For some odd reason (maybe
> scheduling differences that make the race more likely?) the problem
> occurs more frequently on s390.
>
> The problem generally manifests itself on NFSv4 as a race where an rmdir
> fails because a silly-renamed file in the directory wasn't deleted in
> time. Looking at traces, what you usually see is the failing rmdir
> attempt that fails with the sillydelete of the file that prevented it
> very soon afterward.
>
> Silly deletes are handled via dentry_iput and in the case of a close on
> NFSv4, the last dentry reference is often held by the CLOSE RPC task.
> nfs4_do_close does the close as an async RPC task that it conditionally
> waits on depending on whether the close is synchronous or not.
>
> It also sets the workqueue for the task to nfsiod_workqueue. When
> tk_workqueue is set, the rpc_release operation is queued to that
> workqueue. rpc_release is where the dentry reference held by the task is
> put. The caller has no way to wait for that to complete, so the close(2)
> syscall can easily return before the rpc_release call is ever done. In
> some cases, that rpc_release is delayed for a long enough to prevent a
> subsequent rmdir of the containing directory.
>
> I believe this is a bug, or at least not ideal behavior. We should try
> not to have the close(2) call return in this situation until the
> sillydelete is done.
>
> I've been able to reproduce this more reliably by adding a 100ms sleep
> at the top of nfs4_free_closedata. I've not seen it "in the wild" on
> mainline kernels, but it seems quite possible when a machine is heavily
> loaded.
>
> This patch fixes this by not setting tk_workqueue in nfs4_do_close when
> the wait flag is set. This makes the final rpc_put_task a synchronous
> operation and should prevent close(2) from returning before the
> dentry_iput is done.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/nfs4proc.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 78936a8..4cabfea 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1988,11 +1988,14 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
> .rpc_client = server->client,
> .rpc_message = &msg,
> .callback_ops = &nfs4_close_ops,
> - .workqueue = nfsiod_workqueue,
> .flags = RPC_TASK_ASYNC,
> };
> int status = -ENOMEM;
>
> + /* rpc_release must be synchronous too if "wait" is set */
> + if (!wait)
> + task_setup_data.workqueue = nfsiod_workqueue;
> +
> calldata = kzalloc(sizeof(*calldata), gfp_mask);
> if (calldata == NULL)
> goto out;

There is no guarantee that rpc_wait_for_completion_task() will wait
until rpciod had called rpc_put_task(), in which case the above change
ends up with a dput() on the rpciod workqueue, and potential deadlocks.

Let's rather look at making rpc_put_task a little bit more safe, by
ensuring that rpciod always calls queue_work(), and by allowing other
callers to switch it off.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-17 19:47:41

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Thu, 2011-02-17 at 10:10 -0500, Jeff Layton wrote:
> On Thu, 17 Feb 2011 08:40:52 -0500
> Jeff Layton <[email protected]> wrote:
>
> >
> > Looks like it finally failed on the 39th pass:
> >
> > second check for lost reply on non-idempotent requests
> > testing 50 idempotencies in directory "testdir"
> > rmdir 1: Directory not empty
> > special tests failed
> >
> > When I look in the directory (several hours after it failed), the
> > silly-renamed file is still there:
> >
> > -rw---x--x. 1 root root 30 Feb 16 15:04 .nfs000000000000002d00000090
> >
> > ...so I'm not sure what exactly is wrong yet, but it looks like the
> > silly delete just never happened. Maybe there's a dentry refcount leak
> > of some sort? There are no queued RPC's.
> >
> > I'll keep looking at it but if you have ideas as to what it could be,
> > let me know.
> >
>
> I walked down the directory tree in crash on the live kernel and found
> the dentry. The d_count is 0x0, so I'm not clear on why it didn't get
> cleaned up:
>
> crash> struct dentry.d_flags,d_count,d_name 0xffff880017d46a80
> d_flags = 0xc000,
> d_count = 0x0,
> d_name = {
> hash = 0xe08ab5c8,
> len = 0x1c,
> name = 0xffff880017d46ab8 ".nfs000000000000002d00000090"
> },
>
>
> The d_flags are:
>
> #define DCACHE_OP_REVALIDATE 0x4000
> #define DCACHE_OP_DELETE 0x8000
>
> ...very odd. I'd have expected to see this one set too:
>
> #define DCACHE_NFSFS_RENAMED 0x0002
>
> I suppose the async sillyrename call could have failed and we ended up
> calling nfs_cancel_async_unlink? I'll stick in some printk's around
> that area and see if I can figure out what's going on...
>

Perhaps I missed a call site that needs an rpc_put_task_async()?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-15 15:54:06

by Mkrtchyan, Tigran

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

I have seen this as well, but was interpreting as problem in my server.

Tigran.

On 02/15/2011 03:58 PM, Jeff Layton wrote:
> I recently had some of our QA people report some connectathon test
> failures in RHEL5 (2.6.18-based kernel). For some odd reason (maybe
> scheduling differences that make the race more likely?) the problem
> occurs more frequently on s390.
>
> The problem generally manifests itself on NFSv4 as a race where an rmdir
> fails because a silly-renamed file in the directory wasn't deleted in
> time. Looking at traces, what you usually see is the failing rmdir
> attempt that fails with the sillydelete of the file that prevented it
> very soon afterward.
>
> Silly deletes are handled via dentry_iput and in the case of a close on
> NFSv4, the last dentry reference is often held by the CLOSE RPC task.
> nfs4_do_close does the close as an async RPC task that it conditionally
> waits on depending on whether the close is synchronous or not.
>
> It also sets the workqueue for the task to nfsiod_workqueue. When
> tk_workqueue is set, the rpc_release operation is queued to that
> workqueue. rpc_release is where the dentry reference held by the task is
> put. The caller has no way to wait for that to complete, so the close(2)
> syscall can easily return before the rpc_release call is ever done. In
> some cases, that rpc_release is delayed for a long enough to prevent a
> subsequent rmdir of the containing directory.
>
> I believe this is a bug, or at least not ideal behavior. We should try
> not to have the close(2) call return in this situation until the
> sillydelete is done.
>
> I've been able to reproduce this more reliably by adding a 100ms sleep
> at the top of nfs4_free_closedata. I've not seen it "in the wild" on
> mainline kernels, but it seems quite possible when a machine is heavily
> loaded.
>
> This patch fixes this by not setting tk_workqueue in nfs4_do_close when
> the wait flag is set. This makes the final rpc_put_task a synchronous
> operation and should prevent close(2) from returning before the
> dentry_iput is done.
>
> Signed-off-by: Jeff Layton <[email protected]>
> ---
> fs/nfs/nfs4proc.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 78936a8..4cabfea 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -1988,11 +1988,14 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
> .rpc_client = server->client,
> .rpc_message = &msg,
> .callback_ops = &nfs4_close_ops,
> - .workqueue = nfsiod_workqueue,
> .flags = RPC_TASK_ASYNC,
> };
> int status = -ENOMEM;
>
> + /* rpc_release must be synchronous too if "wait" is set */
> + if (!wait)
> + task_setup_data.workqueue = nfsiod_workqueue;
> +
> calldata = kzalloc(sizeof(*calldata), gfp_mask);
> if (calldata == NULL)
> goto out;

2011-02-16 14:50:09

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Wed, 16 Feb 2011 09:26:13 -0500
Trond Myklebust <[email protected]> wrote:

> On Wed, 2011-02-16 at 09:09 -0500, Trond Myklebust wrote:
> > On Tue, 2011-02-15 at 18:47 -0500, Trond Myklebust wrote:
> > > diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> > > index 243fc09..11b71b1 100644
> > > --- a/net/sunrpc/sched.c
> > > +++ b/net/sunrpc/sched.c
> > > @@ -252,23 +252,39 @@ static void rpc_set_active(struct rpc_task *task)
> > >
> > > /*
> > > * Mark an RPC call as having completed by clearing the 'active' bit
> > > + * and then waking up all tasks that were sleeping.
> > > */
> > > -static void rpc_mark_complete_task(struct rpc_task *task)
> > > +static int rpc_complete_task(struct rpc_task *task)
> > > {
> > > - smp_mb__before_clear_bit();
> > > + void *m = &task->tk_runstate;
> > > + wait_queue_head_t *wq = bit_waitqueue(m, RPC_TASK_ACTIVE);
> > > + struct wait_bit_key k = __WAIT_BIT_KEY_INITIALIZER(m, RPC_TASK_ACTIVE);
> > > + unsigned long flags;
> > > + int ret;
> > > +
> > > + spin_lock_irqsave(&wq->lock, flags);
> > > clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
> > > - smp_mb__after_clear_bit();
> > > - wake_up_bit(&task->tk_runstate, RPC_TASK_ACTIVE);
> > > + if (waitqueue_active(wq))
> > > + __wake_up_locked_key(wq, TASK_NORMAL, &k);
> > > + ret = atomic_dec_and_test(&task->tk_count);
> > > + spin_unlock_irqrestore(&wq->lock, flags);
> > > + return ret;
> > > }
> > >
> > > /*
> > > * Allow callers to wait for completion of an RPC call
> > > + *
> > > + * Note the use of out_of_line_wait_on_bit() rather than wait_on_bit()
> > > + * to enforce taking of the wq->lock and hence avoid races with
> > > + * rpc_complete_task().
> > > */
> > > int __rpc_wait_for_completion_task(struct rpc_task *task, int (*action)(void *))
> > > {
> > > + BUG_ON(!RPC_IS_ASYNC(task));
> > > +
> > > if (action == NULL)
> > > action = rpc_wait_bit_killable;
> > > - return wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> > > + return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> > > action, TASK_KILLABLE);
> > > }
> > > EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);
> >
> > Never mind. The above ordering scheme is broken by the fact that
> > 'wake_bit_function' calls autoremove_wake_function, which again means
> > that finish_wait optimises away the spin lock.
> >
> > Back to the drawing board...
>
> ...and after yet another cup of coffee the solution suggests itself: we
> just need to reorder the calls to __wake_up_locked_key and the
> atomic_dec_and_test in rpc_complete_task. The revised patch is appended.
>
> Cheers
> Trond
>

Thanks Trond,

This builds, but I can't plug in the module:

[ 103.540405] sunrpc: Unknown symbol __wake_up_locked_key (err 0)

...I think __wake_up_locked_key will need to be exported too. I'll do
that and then test this out later today.

--
Jeff Layton <[email protected]>

2011-02-16 14:09:16

by Myklebust, Trond

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Tue, 2011-02-15 at 18:47 -0500, Trond Myklebust wrote:
> diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
> index 243fc09..11b71b1 100644
> --- a/net/sunrpc/sched.c
> +++ b/net/sunrpc/sched.c
> @@ -252,23 +252,39 @@ static void rpc_set_active(struct rpc_task *task)
>
> /*
> * Mark an RPC call as having completed by clearing the 'active' bit
> + * and then waking up all tasks that were sleeping.
> */
> -static void rpc_mark_complete_task(struct rpc_task *task)
> +static int rpc_complete_task(struct rpc_task *task)
> {
> - smp_mb__before_clear_bit();
> + void *m = &task->tk_runstate;
> + wait_queue_head_t *wq = bit_waitqueue(m, RPC_TASK_ACTIVE);
> + struct wait_bit_key k = __WAIT_BIT_KEY_INITIALIZER(m, RPC_TASK_ACTIVE);
> + unsigned long flags;
> + int ret;
> +
> + spin_lock_irqsave(&wq->lock, flags);
> clear_bit(RPC_TASK_ACTIVE, &task->tk_runstate);
> - smp_mb__after_clear_bit();
> - wake_up_bit(&task->tk_runstate, RPC_TASK_ACTIVE);
> + if (waitqueue_active(wq))
> + __wake_up_locked_key(wq, TASK_NORMAL, &k);
> + ret = atomic_dec_and_test(&task->tk_count);
> + spin_unlock_irqrestore(&wq->lock, flags);
> + return ret;
> }
>
> /*
> * Allow callers to wait for completion of an RPC call
> + *
> + * Note the use of out_of_line_wait_on_bit() rather than wait_on_bit()
> + * to enforce taking of the wq->lock and hence avoid races with
> + * rpc_complete_task().
> */
> int __rpc_wait_for_completion_task(struct rpc_task *task, int (*action)(void *))
> {
> + BUG_ON(!RPC_IS_ASYNC(task));
> +
> if (action == NULL)
> action = rpc_wait_bit_killable;
> - return wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> + return out_of_line_wait_on_bit(&task->tk_runstate, RPC_TASK_ACTIVE,
> action, TASK_KILLABLE);
> }
> EXPORT_SYMBOL_GPL(__rpc_wait_for_completion_task);

Never mind. The above ordering scheme is broken by the fact that
'wake_bit_function' calls autoremove_wake_function, which again means
that finish_wait optimises away the spin lock.

Back to the drawing board...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com

2011-02-15 16:30:55

by Jeff Layton

[permalink] [raw]

Subject: Re: [PATCH] nfs: don't queue synchronous NFSv4 close rpc_release to nfsiod

On Tue, 15 Feb 2011 10:31:38 -0500
Trond Myklebust <[email protected]> wrote:

> On Tue, 2011-02-15 at 09:58 -0500, Jeff Layton wrote:
> > I recently had some of our QA people report some connectathon test
> > failures in RHEL5 (2.6.18-based kernel). For some odd reason (maybe
> > scheduling differences that make the race more likely?) the problem
> > occurs more frequently on s390.
> >
> > The problem generally manifests itself on NFSv4 as a race where an rmdir
> > fails because a silly-renamed file in the directory wasn't deleted in
> > time. Looking at traces, what you usually see is the failing rmdir
> > attempt that fails with the sillydelete of the file that prevented it
> > very soon afterward.
> >
> > Silly deletes are handled via dentry_iput and in the case of a close on
> > NFSv4, the last dentry reference is often held by the CLOSE RPC task.
> > nfs4_do_close does the close as an async RPC task that it conditionally
> > waits on depending on whether the close is synchronous or not.
> >
> > It also sets the workqueue for the task to nfsiod_workqueue. When
> > tk_workqueue is set, the rpc_release operation is queued to that
> > workqueue. rpc_release is where the dentry reference held by the task is
> > put. The caller has no way to wait for that to complete, so the close(2)
> > syscall can easily return before the rpc_release call is ever done. In
> > some cases, that rpc_release is delayed for a long enough to prevent a
> > subsequent rmdir of the containing directory.
> >
> > I believe this is a bug, or at least not ideal behavior. We should try
> > not to have the close(2) call return in this situation until the
> > sillydelete is done.
> >
> > I've been able to reproduce this more reliably by adding a 100ms sleep
> > at the top of nfs4_free_closedata. I've not seen it "in the wild" on
> > mainline kernels, but it seems quite possible when a machine is heavily
> > loaded.
> >
> > This patch fixes this by not setting tk_workqueue in nfs4_do_close when
> > the wait flag is set. This makes the final rpc_put_task a synchronous
> > operation and should prevent close(2) from returning before the
> > dentry_iput is done.
> >
> > Signed-off-by: Jeff Layton <[email protected]>
> > ---
> > fs/nfs/nfs4proc.c | 5 ++++-
> > 1 files changed, 4 insertions(+), 1 deletions(-)
> >
> > diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> > index 78936a8..4cabfea 100644
> > --- a/fs/nfs/nfs4proc.c
> > +++ b/fs/nfs/nfs4proc.c
> > @@ -1988,11 +1988,14 @@ int nfs4_do_close(struct path *path, struct nfs4_state *state, gfp_t gfp_mask, i
> > .rpc_client = server->client,
> > .rpc_message = &msg,
> > .callback_ops = &nfs4_close_ops,
> > - .workqueue = nfsiod_workqueue,
> > .flags = RPC_TASK_ASYNC,
> > };
> > int status = -ENOMEM;
> >
> > + /* rpc_release must be synchronous too if "wait" is set */
> > + if (!wait)
> > + task_setup_data.workqueue = nfsiod_workqueue;
> > +
> > calldata = kzalloc(sizeof(*calldata), gfp_mask);
> > if (calldata == NULL)
> > goto out;
>
> There is no guarantee that rpc_wait_for_completion_task() will wait
> until rpciod had called rpc_put_task(), in which case the above change
> ends up with a dput() on the rpciod workqueue, and potential deadlocks.
>
> Let's rather look at making rpc_put_task a little bit more safe, by
> ensuring that rpciod always calls queue_work(), and by allowing other
> callers to switch it off.
>

Ok, I see the potential deadlock. To make sure I understand correctly,
is this what you'd like to see?

1) If tk_workqueue is set, then queue rpc_release to that.

2) If it's not, queue it to rpciod.

3) add a flag to rpc_put_task (or a variant of that function) that
calls rpc_release synchronously, and have nfs4_do_close use that
variant when it puts the task.

...does that sound correct? If so, then I'm not sure that it'll 100%
close the race. It's still possible that the rpc_put_task call in
rpc_release_task would be the last one. If so and we queue it to the
workqueue, then we're back in the same situation.

--
Jeff Layton <[email protected]>