2023-06-27 18:35:16

by Benjamin Coddington

[permalink] [raw]
Subject: [PATCH 2/2] NFSv4: Fix dropped lock for racing OPEN and delegation return

Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during delegation
return") attempted to solve this problem by using nfs4's generic async error
handling, but introduced a regression where v4.0 lock recovery would hang.
The additional complexity introduced by overloading that error handling is
not necessary for this case.

The problem as originally explained in the above commit is:

There's a small window where a LOCK sent during a delegation return can
race with another OPEN on client, but the open stateid has not yet been
updated. In this case, the client doesn't handle the OLD_STATEID error
from the server and will lose this lock, emitting:
"NFS: nfs4_handle_delegation_recall_error: unhandled error -10024".

We want a fix that is much more focused to the original problem. Fix this
issue by returning -EAGAIN from the nfs4_handle_delegation_recall_error() on
OLD_STATEID, and use that as a signal for the delegation return code to
retry the LOCK operation. We should at this point be able to send along
the updated stateid.

Signed-off-by: Benjamin Coddington <[email protected]>
---
fs/nfs/delegation.c | 4 +++-
fs/nfs/nfs4proc.c | 1 +
2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
index cf7365581031..23aeb02319a5 100644
--- a/fs/nfs/delegation.c
+++ b/fs/nfs/delegation.c
@@ -160,7 +160,9 @@ static int nfs_delegation_claim_locks(struct nfs4_state *state, const nfs4_state
if (nfs_file_open_context(fl->fl_file)->state != state)
continue;
spin_unlock(&flctx->flc_lock);
- status = nfs4_lock_delegation_recall(fl, state, stateid);
+ do {
+ status = nfs4_lock_delegation_recall(fl, state, stateid);
+ } while (status == -EAGAIN);
if (status < 0)
goto out;
spin_lock(&flctx->flc_lock);
diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 6bb14f6cfbc0..399db73a57f4 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -2262,6 +2262,7 @@ static int nfs4_handle_delegation_recall_error(struct nfs_server *server, struct
case -NFS4ERR_BAD_HIGH_SLOT:
case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
case -NFS4ERR_DEADSESSION:
+ case -NFS4ERR_OLD_STATEID:
return -EAGAIN;
case -NFS4ERR_STALE_CLIENTID:
case -NFS4ERR_STALE_STATEID:
--
2.40.1



2023-06-29 18:34:01

by Trond Myklebust

[permalink] [raw]
Subject: Re: [PATCH 2/2] NFSv4: Fix dropped lock for racing OPEN and delegation return

On Tue, 2023-06-27 at 14:31 -0400, Benjamin Coddington wrote:
> Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during
> delegation
> return") attempted to solve this problem by using nfs4's generic
> async error
> handling, but introduced a regression where v4.0 lock recovery would
> hang.
> The additional complexity introduced by overloading that error
> handling is
> not necessary for this case.
>
> The problem as originally explained in the above commit is:
>
>     There's a small window where a LOCK sent during a delegation
> return can
>     race with another OPEN on client, but the open stateid has not
> yet been
>     updated.  In this case, the client doesn't handle the OLD_STATEID
> error
>     from the server and will lose this lock, emitting:
>     "NFS: nfs4_handle_delegation_recall_error: unhandled error -
> 10024".
>
> We want a fix that is much more focused to the original problem.  Fix
> this
> issue by returning -EAGAIN from the
> nfs4_handle_delegation_recall_error() on
> OLD_STATEID, and use that as a signal for the delegation return code
> to
> retry the LOCK operation.  We should at this point be able to send
> along
> the updated stateid.
>
> Signed-off-by: Benjamin Coddington <[email protected]>
> ---
>  fs/nfs/delegation.c | 4 +++-
>  fs/nfs/nfs4proc.c   | 1 +
>  2 files changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
> index cf7365581031..23aeb02319a5 100644
> --- a/fs/nfs/delegation.c
> +++ b/fs/nfs/delegation.c
> @@ -160,7 +160,9 @@ static int nfs_delegation_claim_locks(struct
> nfs4_state *state, const nfs4_state
>                 if (nfs_file_open_context(fl->fl_file)->state !=
> state)
>                         continue;
>                 spin_unlock(&flctx->flc_lock);
> -               status = nfs4_lock_delegation_recall(fl, state,
> stateid);
> +               do {
> +                       status = nfs4_lock_delegation_recall(fl,
> state, stateid);
> +               } while (status == -EAGAIN);
>                 if (status < 0)
>                         goto out;
>                 spin_lock(&flctx->flc_lock);
> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
> index 6bb14f6cfbc0..399db73a57f4 100644
> --- a/fs/nfs/nfs4proc.c
> +++ b/fs/nfs/nfs4proc.c
> @@ -2262,6 +2262,7 @@ static int
> nfs4_handle_delegation_recall_error(struct nfs_server *server, struct
>                 case -NFS4ERR_BAD_HIGH_SLOT:
>                 case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
>                 case -NFS4ERR_DEADSESSION:
> +               case -NFS4ERR_OLD_STATEID:
>                         return -EAGAIN;

Hmm... Rather than issuing a blanket EAGAIN, we really should be
looking at using either nfs4_refresh_lock_old_stateid() or
nfs4_refresh_open_old_stateid(), depending on whether the stateid that
saw the NFS4ERR_OLD_STATEID was a lock stateid or an open stateid.

>                 case -NFS4ERR_STALE_CLIENTID:
>                 case -NFS4ERR_STALE_STATEID:

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]


2023-06-29 19:04:20

by Benjamin Coddington

[permalink] [raw]
Subject: Re: [PATCH 2/2] NFSv4: Fix dropped lock for racing OPEN and delegation return

On 29 Jun 2023, at 14:33, Trond Myklebust wrote:

> On Tue, 2023-06-27 at 14:31 -0400, Benjamin Coddington wrote:
>> Commmit f5ea16137a3f ("NFSv4: Retry LOCK on OLD_STATEID during
>> delegation
>> return") attempted to solve this problem by using nfs4's generic
>> async error
>> handling, but introduced a regression where v4.0 lock recovery would
>> hang.
>> The additional complexity introduced by overloading that error
>> handling is
>> not necessary for this case.
>>
>> The problem as originally explained in the above commit is:
>>
>>     There's a small window where a LOCK sent during a delegation
>> return can
>>     race with another OPEN on client, but the open stateid has not
>> yet been
>>     updated.  In this case, the client doesn't handle the OLD_STATEID
>> error
>>     from the server and will lose this lock, emitting:
>>     "NFS: nfs4_handle_delegation_recall_error: unhandled error -
>> 10024".
>>
>> We want a fix that is much more focused to the original problem.  Fix
>> this
>> issue by returning -EAGAIN from the
>> nfs4_handle_delegation_recall_error() on
>> OLD_STATEID, and use that as a signal for the delegation return code
>> to
>> retry the LOCK operation.  We should at this point be able to send
>> along
>> the updated stateid.
>>
>> Signed-off-by: Benjamin Coddington <[email protected]>
>> ---
>>  fs/nfs/delegation.c | 4 +++-
>>  fs/nfs/nfs4proc.c   | 1 +
>>  2 files changed, 4 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/nfs/delegation.c b/fs/nfs/delegation.c
>> index cf7365581031..23aeb02319a5 100644
>> --- a/fs/nfs/delegation.c
>> +++ b/fs/nfs/delegation.c
>> @@ -160,7 +160,9 @@ static int nfs_delegation_claim_locks(struct
>> nfs4_state *state, const nfs4_state
>>                 if (nfs_file_open_context(fl->fl_file)->state !=
>> state)
>>                         continue;
>>                 spin_unlock(&flctx->flc_lock);
>> -               status = nfs4_lock_delegation_recall(fl, state,
>> stateid);
>> +               do {
>> +                       status = nfs4_lock_delegation_recall(fl,
>> state, stateid);
>> +               } while (status == -EAGAIN);
>>                 if (status < 0)
>>                         goto out;
>>                 spin_lock(&flctx->flc_lock);
>> diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
>> index 6bb14f6cfbc0..399db73a57f4 100644
>> --- a/fs/nfs/nfs4proc.c
>> +++ b/fs/nfs/nfs4proc.c
>> @@ -2262,6 +2262,7 @@ static int
>> nfs4_handle_delegation_recall_error(struct nfs_server *server, struct
>>                 case -NFS4ERR_BAD_HIGH_SLOT:
>>                 case -NFS4ERR_CONN_NOT_BOUND_TO_SESSION:
>>                 case -NFS4ERR_DEADSESSION:
>> +               case -NFS4ERR_OLD_STATEID:
>>                         return -EAGAIN;
>
> Hmm... Rather than issuing a blanket EAGAIN, we really should be
> looking at using either nfs4_refresh_lock_old_stateid() or
> nfs4_refresh_open_old_stateid(), depending on whether the stateid that
> saw the NFS4ERR_OLD_STATEID was a lock stateid or an open stateid.

I figured if there was client race that would trigger the OLD_STATEID on
open, we'd have heard from the "unhandled error" printk by now, so I rushed
this out.. But I also don't like the overloading for open error handling
here. I'll work it up as you suggest.

The revert can go without a fix (IMO). The fix is worse than original bug
which was really hard to hit. I couldn't reproduce it without artificial
delays.

Ben