2010-10-04 10:03:31

by Sachin Prabhu

[permalink] [raw]
Subject: Re: NFS4 clients cannot reclaim locks

----- "Trond Myklebust" <[email protected]> wrote:
> On Fri, 2010-10-01 at 07:30 -0400, Sachin Prabhu wrote:
> > NFS4 clients appear to have problems reclaiming locks after a server
> reboot. I can recreate the issue on 2.6.34.7-56.fc13.x86_64 on a
> Fedora system.
> >
> > The problem appears to happen in cases where after a reboot, a WRITE
> call is made just before the RENEW call. In that case, the
> NFS4ERR_STALE_STATEID is returned for the WRITE call which results in
> NFS_STATE_RECLAIM_REBOOT being set in the state flags. However the
> NFS4ERR_STALE_CLIENTID returned for the subsequent RENEW call is
> handled by
> > nfs4_recovery_handle_error() -> nfs4_state_end_reclaim_reboot(clp);
>
> > which ends up setting the state flag to NFS_STATE_RECLAIM_NOGRACE
> and clearing the NFS_STATE_RECLAIM_REBOOT in
> nfs4_state_mark_reclaim_nograce().
>
> Yup. I don't think we should call nfs4_state_mark_reclaim_reboot()
> here.
>
> > The process of reclaiming the locks then seem to hit another
> roadblock in nfs4_open_expired() where it fails to open the file and
> reset the state. It ends up calling nfs4_reclaim_locks() in a loop
> with the old stateid in nfs4_reclaim_open_state().
>
> Any idea how nfs4_open_expired() is failing? It seems that if it
> does,
> we should see an error, which would cause the lock reclaim to fail.
>
> Also, why is the call to nfs4_reclaim_locks() looping? That too
> should
> exit in case of an error.
>

>From instrumentation, the problem appears to happen at nfs4_open_prepare

static void nfs4_open_prepare(struct rpc_task *task, void *calldata)
{
..
/*
* Check if we still need to send an OPEN call, or if we can use
* a delegation instead.
*/

if (data->state != NULL) {
struct nfs_delegation *delegation;

if (can_open_cached(data->state, data->o_arg.fmode, data->o_arg.open_flags))
goto out_no_action;
..
out_no_action:
task->tk_action = NULL;

}

Here, can_open_cached returns true. The open call is never made and the old state is used.
static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs4_state_recovery_ops *ops)
{
..
restart:
..
status = ops->recover_open(sp, state); <-- This call attempts to use cached state and status is set to 0
if (status >= 0) {
status = nfs4_reclaim_locks(state, ops); <-- Attempts to reclaim locks using old stateid
-- Here status is set to -NFS4ERR_BAD_STATEID --
..
}
switch (status) {
..
case -NFS4ERR_BAD_STATEID:
case -NFS4ERR_RECLAIM_BAD:
case -NFS4ERR_RECLAIM_CONFLICT:
nfs4_state_mark_reclaim_nograce(sp->so_client, state);
break;
..
}
nfs4_put_open_state(state);
goto restart;
..
}

The call to ops->recover_open() calls nfs4_open_expired(). While preparing the RPC call to OPEN, in nfs4_open_prepare(), it decides that the caches copy is valid and it attempts to use it. So nfs4_open_expired() returns 0. The subsequent call to reclaim locks using nfs4_reclaim_locks() fails with with a -NFS4ERR_BAD_STATEID. A goto statement in nfs4_reclaim_open_state() results in it looping with the same results as before.

Sachin Prabhu


2010-10-05 13:38:47

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS4 clients cannot reclaim locks

On Mon, 2010-10-04 at 06:03 -0400, Sachin Prabhu wrote:
> ----- "Trond Myklebust" <[email protected]> wrote:
> > On Fri, 2010-10-01 at 07:30 -0400, Sachin Prabhu wrote:
> > > NFS4 clients appear to have problems reclaiming locks after a server
> > reboot. I can recreate the issue on 2.6.34.7-56.fc13.x86_64 on a
> > Fedora system.
> > >
> > > The problem appears to happen in cases where after a reboot, a WRITE
> > call is made just before the RENEW call. In that case, the
> > NFS4ERR_STALE_STATEID is returned for the WRITE call which results in
> > NFS_STATE_RECLAIM_REBOOT being set in the state flags. However the
> > NFS4ERR_STALE_CLIENTID returned for the subsequent RENEW call is
> > handled by
> > > nfs4_recovery_handle_error() -> nfs4_state_end_reclaim_reboot(clp);
> >
> > > which ends up setting the state flag to NFS_STATE_RECLAIM_NOGRACE
> > and clearing the NFS_STATE_RECLAIM_REBOOT in
> > nfs4_state_mark_reclaim_nograce().
> >
> > Yup. I don't think we should call nfs4_state_mark_reclaim_reboot()
> > here.

...Here is the second patch.

Cheers
Trond
------------------------------------------------------------------------------------------------------
NFSv4: Don't call nfs4_state_mark_reclaim_reboot() from error handlers

From: Trond Myklebust <[email protected]>

In the case of a server reboot, the state recovery thread starts by calling
nfs4_state_end_reclaim_reboot() in order to avoid edge conditions when
the server reboots while the client is in the middle of recovery.

However, if the client has already marked the nfs4_state as requiring
reboot recovery, then the above behaviour will cause the recovery thread to
treat the open as if it was part of such an edge condition: the open will
be recovered as if it was part of a lease expiration (and all the locks
will be lost).
Fix is to remove the call to nfs4_state_mark_reclaim_reboot from
nfs4_async_handle_error(), and nfs4_handle_exception(). Instead we leave it
to the recovery thread to do this for us.

Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/nfs4proc.c | 6 ------
1 files changed, 0 insertions(+), 6 deletions(-)


diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 01b4817..74aa54e 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -255,9 +255,6 @@ static int nfs4_handle_exception(const struct nfs_server *server, int errorcode,
nfs4_state_mark_reclaim_nograce(clp, state);
goto do_state_recovery;
case -NFS4ERR_STALE_STATEID:
- if (state == NULL)
- break;
- nfs4_state_mark_reclaim_reboot(clp, state);
case -NFS4ERR_STALE_CLIENTID:
case -NFS4ERR_EXPIRED:
goto do_state_recovery;
@@ -3493,9 +3490,6 @@ nfs4_async_handle_error(struct rpc_task *task, const struct nfs_server *server,
nfs4_state_mark_reclaim_nograce(clp, state);
goto do_state_recovery;
case -NFS4ERR_STALE_STATEID:
- if (state == NULL)
- break;
- nfs4_state_mark_reclaim_reboot(clp, state);
case -NFS4ERR_STALE_CLIENTID:
case -NFS4ERR_EXPIRED:
goto do_state_recovery;


2010-10-05 13:37:42

by Myklebust, Trond

[permalink] [raw]
Subject: Re: NFS4 clients cannot reclaim locks

On Mon, 2010-10-04 at 06:03 -0400, Sachin Prabhu wrote:
> From instrumentation, the problem appears to happen at nfs4_open_prepare
>
> static void nfs4_open_prepare(struct rpc_task *task, void *calldata)
> {
> ..
> /*
> * Check if we still need to send an OPEN call, or if we can use
> * a delegation instead.
> */
>
> if (data->state != NULL) {
> struct nfs_delegation *delegation;
>
> if (can_open_cached(data->state, data->o_arg.fmode, data->o_arg.open_flags))
> goto out_no_action;
> ..
> out_no_action:
> task->tk_action = NULL;
>
> }
>
> Here, can_open_cached returns true. The open call is never made and the old state is used.
> static int nfs4_reclaim_open_state(struct nfs4_state_owner *sp, const struct nfs4_state_recovery_ops *ops)
> {
> ..
> restart:
> ..
> status = ops->recover_open(sp, state); <-- This call attempts to use cached state and status is set to 0
> if (status >= 0) {
> status = nfs4_reclaim_locks(state, ops); <-- Attempts to reclaim locks using old stateid
> -- Here status is set to -NFS4ERR_BAD_STATEID --
> ..
> }
> switch (status) {
> ..
> case -NFS4ERR_BAD_STATEID:
> case -NFS4ERR_RECLAIM_BAD:
> case -NFS4ERR_RECLAIM_CONFLICT:
> nfs4_state_mark_reclaim_nograce(sp->so_client, state);
> break;
> ..
> }
> nfs4_put_open_state(state);
> goto restart;
> ..
> }
>
> The call to ops->recover_open() calls nfs4_open_expired(). While preparing the RPC call to OPEN, in nfs4_open_prepare(), it decides that the caches copy is valid and it attempts to use it. So nfs4_open_expired() returns 0. The subsequent call to reclaim locks using nfs4_reclaim_locks() fails with with a -NFS4ERR_BAD_STATEID. A goto statement in nfs4_reclaim_open_state() results in it looping with the same results as before.

Yup. That makes sense. Does the following patch help?

Cheers
Trond
--------------------------------------------------------------------------------------------------------
NFSv4: Fix open recovery

From: Trond Myklebust <[email protected]>

NFSv4 open recovery is currently broken: since we do not clear the
state->flags states before attempting recovery, we end up with the
'can_open_cached()' function triggering. This again leads to no OPEN call
being put on the wire.

Reported-by: Sachin Prabhu <[email protected]>
Signed-off-by: Trond Myklebust <[email protected]>
---

fs/nfs/nfs4proc.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)


diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 089da5b..01b4817 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -1120,6 +1120,7 @@ static int nfs4_open_recover(struct nfs4_opendata *opendata, struct nfs4_state *
clear_bit(NFS_DELEGATED_STATE, &state->flags);
smp_rmb();
if (state->n_rdwr != 0) {
+ clear_bit(NFS_O_RDWR_STATE, &state->flags);
ret = nfs4_open_recover_helper(opendata, FMODE_READ|FMODE_WRITE, &newstate);
if (ret != 0)
return ret;
@@ -1127,6 +1128,7 @@ static int nfs4_open_recover(struct nfs4_opendata *opendata, struct nfs4_state *
return -ESTALE;
}
if (state->n_wronly != 0) {
+ clear_bit(NFS_O_WRONLY_STATE, &state->flags);
ret = nfs4_open_recover_helper(opendata, FMODE_WRITE, &newstate);
if (ret != 0)
return ret;
@@ -1134,6 +1136,7 @@ static int nfs4_open_recover(struct nfs4_opendata *opendata, struct nfs4_state *
return -ESTALE;
}
if (state->n_rdonly != 0) {
+ clear_bit(NFS_O_RDONLY_STATE, &state->flags);
ret = nfs4_open_recover_helper(opendata, FMODE_READ, &newstate);
if (ret != 0)
return ret;