2013-11-15 21:36:22

by Andy Adamson

[permalink] [raw]
Subject: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

From: Andy Adamson <[email protected]>

When the state manager is processing the NFS4CLNT_DELEGRETURN flag, session
draining is off, but DELEGRETURN can still get a session error.
The async handler calls nfs4_schedule_session_recovery returns -EAGAIN, and
the DELEGRETURN done then restarts the RPC task in the prepare state.
With the state manager still processing the NFS4CLNT_DELEGRETURN flag with
session draining off, these DELEGRETURNs will cycle with errors filling up the
session slots.

This prevents OPEN reclaims (from nfs_delegation_claim_opens) required by the
NFS4CLNT_DELEGRETURN state manager processing from completing, hanging the
state manager in the __rpc_wait_for_completion_task in nfs4_run_open_task
as seen in this kernel thread dump:

kernel: 4.12.32.53-ma D 0000000000000000 0 3393 2 0x00000000
kernel: ffff88013995fb60 0000000000000046 ffff880138cc5400 ffff88013a9df140
kernel: ffff8800000265c0 ffffffff8116eef0 ffff88013fc10080 0000000300000001
kernel: ffff88013a4ad058 ffff88013995ffd8 000000000000fbc8 ffff88013a4ad058
kernel: Call Trace:
kernel: [<ffffffff8116eef0>] ? cache_alloc_refill+0x1c0/0x240
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffffa0358152>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
kernel: [<ffffffff8152914f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffff815291f8>] out_of_line_wait_on_bit+0x78/0x90
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa035810d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
kernel: [<ffffffffa040d44c>] nfs4_run_open_task+0x11c/0x160 [nfs]
kernel: [<ffffffffa04114e7>] nfs4_open_recover_helper+0x87/0x120 [nfs]
kernel: [<ffffffffa0411646>] nfs4_open_recover+0xc6/0x150 [nfs]
kernel: [<ffffffffa040cc6f>] ? nfs4_open_recoverdata_alloc+0x2f/0x60 [nfs]
kernel: [<ffffffffa0414e1a>] nfs4_open_delegation_recall+0x6a/0xa0 [nfs]
kernel: [<ffffffffa0424020>] nfs_end_delegation_return+0x120/0x2e0 [nfs]
kernel: [<ffffffff8109580f>] ? queue_work+0x1f/0x30
kernel: [<ffffffffa0424347>] nfs_client_return_marked_delegations+0xd7/0x110 [nfs]
kernel: [<ffffffffa04225d8>] nfs4_run_state_manager+0x548/0x620 [nfs]
kernel: [<ffffffffa0422090>] ? nfs4_run_state_manager+0x0/0x620 [nfs]
kernel: [<ffffffff8109b0f6>] kthread+0x96/0xa0
kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
kernel: [<ffffffff8109b060>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

The state manager can not therefore process the DELEGRETURN session errors.
Change the async handler to wait for recovery on session errors.

Signed-off-by: Andy Adamson <[email protected]>
---
fs/nfs/nfs4proc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index 5ab33c0..1f4edfb 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -4803,7 +4803,7 @@ nfs4_async_handle_error(struct rpc_task *task, const struct nfs_server *server,
dprintk("%s ERROR %d, Reset session\n", __func__,
task->tk_status);
nfs4_schedule_session_recovery(clp->cl_session, task->tk_status);
- goto restart_call;
+ goto wait_on_recovery;
#endif /* CONFIG_NFS_V4_1 */
case -NFS4ERR_DELAY:
nfs_inc_server_stats(server, NFSIOS_DELAY);
--
1.8.3.1



2013-11-21 23:48:58

by William Dauchy

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

On Wed, Nov 20, 2013 at 10:29 AM, William Dauchy <[email protected]> wrote:
> ba64b36 NFSv4: Update list of irrecoverable errors on DELEGRETURN
> is cc'ed for stable
>
> If these two patches are connected, why the first one:
> d3b173a NFSv4 wait on recovery for async session errors
> is not cc'ed for stable as well?

you last push
4a82fd7 NFSv4 wait on recovery for async session errors
is now cc'ed for stable

Thanks,
--
William

2013-11-20 09:30:10

by William Dauchy

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

Hi Trond,

On Tue, Nov 19, 2013 at 10:49 PM, Myklebust, Trond
<[email protected]> wrote:
> There is a second patch that goes with this problem. Please see the
> following attachment.

ba64b36 NFSv4: Update list of irrecoverable errors on DELEGRETURN
is cc'ed for stable

If these two patches are connected, why the first one:
d3b173a NFSv4 wait on recovery for async session errors
is not cc'ed for stable as well?

For example in stable v3.10.x we currently have:

case -NFS4ERR_SEQ_MISORDERED:
dprintk("%s ERROR %d, Reset session\n", __func__,
task->tk_status);
nfs4_schedule_session_recovery(clp->cl_session, task->tk_status);
task->tk_status = 0;
return -EAGAIN;

but with the last pacthes we now have in mainline:

case -NFS4ERR_SEQ_MISORDERED:
dprintk("%s ERROR: %d Reset session\n", __func__,
errorcode);
nfs4_schedule_session_recovery(clp->cl_session, errorcode);
goto wait_on_recovery;


and this was orignally change because of commit
f1478c1 NFS: Re-use exit code in nfs4_async_handle_error()

The result in stable v3.10.x doesn't really make sense to me. Did I
missed something?

Regards,
--
William

2013-11-19 21:50:03

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

On Fri, 2013-11-15 at 16:36 -0500, andros+AEA-netapp.com wrote:
+AD4- From: Andy Adamson +ADw-andros+AEA-netapp.com+AD4-
+AD4-
+AD4- When the state manager is processing the NFS4CLNT+AF8-DELEGRETURN flag, session
+AD4- draining is off, but DELEGRETURN can still get a session error.
+AD4- The async handler calls nfs4+AF8-schedule+AF8-session+AF8-recovery returns -EAGAIN, and
+AD4- the DELEGRETURN done then restarts the RPC task in the prepare state.
+AD4- With the state manager still processing the NFS4CLNT+AF8-DELEGRETURN flag with
+AD4- session draining off, these DELEGRETURNs will cycle with errors filling up the
+AD4- session slots.
+AD4-
+AD4- This prevents OPEN reclaims (from nfs+AF8-delegation+AF8-claim+AF8-opens) required by the
+AD4- NFS4CLNT+AF8-DELEGRETURN state manager processing from completing, hanging the
+AD4- state manager in the +AF8AXw-rpc+AF8-wait+AF8-for+AF8-completion+AF8-task in nfs4+AF8-run+AF8-open+AF8-task
+AD4- as seen in this kernel thread dump:
+AD4-

Hi Andy,

There is a second patch that goes with this problem. Please see the
following attachment.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch (1.37 kB)
0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch

2013-11-19 22:43:31

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

Why not recover the lease? Shouldn't NFS4ERR_EXPIRED be left for the async handler?

-->Andy

On Nov 19, 2013, at 5:34 PM, "Myklebust, Trond" <[email protected]>
wrote:

> On Tue, 2013-11-19 at 16:49 -0500, Trond Myklebust wrote:
>> There is a second patch that goes with this problem. Please see the
>> following attachment.
>
> V2: Don't return an error when we know that the stateid is no longer
> valid.
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
> <0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch>


2013-11-19 22:47:13

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

On Tue, 2013-11-19 at 22:43 +-0000, Adamson, Andy wrote:
+AD4- Why not recover the lease? Shouldn't NFS4ERR+AF8-EXPIRED be left for the async handler?

No, because then it will retry the operation. Since it is impossible to
recover the delegation at this point, then that will result in another
failure.

+AD4- --+AD4-Andy
+AD4-
+AD4- On Nov 19, 2013, at 5:34 PM, +ACI-Myklebust, Trond+ACI- +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- wrote:
+AD4-
+AD4- +AD4- On Tue, 2013-11-19 at 16:49 -0500, Trond Myklebust wrote:
+AD4- +AD4APg- There is a second patch that goes with this problem. Please see the
+AD4- +AD4APg- following attachment.
+AD4- +AD4-
+AD4- +AD4- V2: Don't return an error when we know that the stateid is no longer
+AD4- +AD4- valid.
+AD4- +AD4-
+AD4- +AD4- --
+AD4- +AD4- Trond Myklebust
+AD4- +AD4- Linux NFS client maintainer
+AD4- +AD4-
+AD4- +AD4- NetApp
+AD4- +AD4- Trond.Myklebust+AEA-netapp.com
+AD4- +AD4- http://www.netapp.com
+AD4- +AD4- +ADw-0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch+AD4-
+AD4-

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-11-19 22:34:32

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH Version 3 1/1] NFSv4 wait on recovery for async session errors

On Tue, 2013-11-19 at 16:49 -0500, Trond Myklebust wrote:
+AD4- There is a second patch that goes with this problem. Please see the
+AD4- following attachment.

V2: Don't return an error when we know that the stateid is no longer
valid.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch (1.47 kB)
0001-NFSv4-Update-list-of-irrecoverable-errors-on-DELEGRE.patch