2013-11-25 17:57:55

by Andy Adamson

[permalink] [raw]
Subject: [PATCH 0/1] NFSv4.1 fix a kswap nfs4_state_manger race

From: Andy Adamson <[email protected]>

This is a three way race between the state manager, kswapd and sys_open. We are
hitting this regularily in our long term testing. This patch should fix the
race - but before we test with this patch, I'd like comments from the list.

The state manager is waiting in __rpc_wait_for_completion_task for a
recovery OPEN to complete:

kernel: Call Trace:
kernel: [<ffffffff81054a39>] ? __wake_up_common+0x59/0x90
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffffa0358152>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
kernel: [<ffffffff8152914f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffff815291f8>] out_of_line_wait_on_bit+0x78/0x90
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa035810d>] __rpc_wait_for_completion_task+0x2d/0x30 [sunrpc]
kernel: [<ffffffffa040d44c>] nfs4_run_open_task+0x11c/0x160 [nfs]
kernel: [<ffffffffa04114d7>] nfs4_open_recover_helper+0x87/0x120 [nfs]
kernel: [<ffffffffa0411636>] nfs4_open_recover+0xc6/0x150 [nfs]
kernel: [<ffffffffa040cc6f>] ? nfs4_open_recoverdata_alloc+0x2f/0x60 [nfs]
kernel: [<ffffffffa041192d>] nfs4_open_reclaim+0xad/0x140 [nfs]
kernel: [<ffffffffa0421bfb>] nfs4_do_reclaim+0x15b/0x5e0 [nfs]
kernel: [<ffffffffa042afc3>] ? pnfs_destroy_layout+0x63/0x80 [nfs]
kernel: [<ffffffffa04224cb>] nfs4_run_state_manager+0x44b/0x620 [nfs]
kernel: [<ffffffffa0422080>] ? nfs4_run_state_manager+0x0/0x620 [nfs]
kernel: [<ffffffff8109b0f6>] kthread+0x96/0xa0
kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
kernel: [<ffffffff8109b060>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20


Kswapd is shrinking the inode cache, and waiting for a layoutreturn:

kernel: Call Trace:
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffffa0358152>] rpc_wait_bit_killable+0x42/0xa0 [sunrpc]
kernel: [<ffffffff8152914f>] __wait_on_bit+0x5f/0x90
kernel: [<ffffffff8152aacb>] ? _spin_unlock_bh+0x1b/0x20
kernel: [<ffffffffa0358110>] ? rpc_wait_bit_killable+0x0/0xa0 [sunrpc]
kernel: [<ffffffff815291f8>] out_of_line_wait_on_bit+0x78/0x90
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa0357b90>] ? rpc_exit_task+0x0/0x60 [sunrpc]
kernel: [<ffffffffa0358695>] __rpc_execute+0xf5/0x350 [sunrpc]
kernel: [<ffffffff8109b327>] ? bit_waitqueue+0x17/0xd0
kernel: [<ffffffffa0358951>] rpc_execute+0x61/0xa0 [sunrpc]
kernel: [<ffffffffa034f3a5>] rpc_run_task+0x75/0x90 [sunrpc]
kernel: [<ffffffffa040b86c>] nfs4_proc_layoutreturn+0x9c/0x110 [nfs]
kernel: [<ffffffffa042b22e>] _pnfs_return_layout+0x11e/0x1e0 [nfs]
kernel: [<ffffffffa03f3ef4>] nfs4_clear_inode+0x44/0x70 [nfs]
kernel: [<ffffffff811a5c7c>] clear_inode+0xac/0x140
kernel: [<ffffffff811a5d50>] dispose_list+0x40/0x120
kernel: [<ffffffff811a60a4>] shrink_icache_memory+0x274/0x2e0
kernel: [<ffffffff81138cca>] shrink_slab+0x12a/0x1a0
kernel: [<ffffffff8113c10a>] balance_pgdat+0x59a/0x820
kernel: [<ffffffff8113c4c4>] kswapd+0x134/0x3b0
kernel: [<ffffffff8109b4a0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff8113c390>] ? kswapd+0x0/0x3b0
kernel: [<ffffffff8109b0f6>] kthread+0x96/0xa0
kernel: [<ffffffff8100c20a>] child_rip+0xa/0x20
kernel: [<ffffffff8109b060>] ? kthread+0x0/0xa0
kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20

The layoutreturn is on the cl_rpcwaitq waiting for the state manager to
complete:

kernel: 14628 0a80 0 ffff88013c8a3a00 (null) 0 ffffffffa0430580 nfsv4 LAYOUTRETURN a:rpc_prepare_task q:NFS client

Meanwhile, a sys_open is waiting in __wait_on_freeing_inode for kswapd to
complete the inode deletion. Note that this OPEN RPC has almost completed - it
is stuck processing nfs4_opendata_to_nfs4_state, but it has yet to call
nfs_release_seqid:

kernel: Call Trace:
kernel: [<ffffffff81224590>] ? user_match+0x0/0x20
kernel: [<ffffffff8109b7ce>] ? prepare_to_wait+0x4e/0x80
kernel: [<ffffffff811a55b8>] __wait_on_freeing_inode+0x98/0xc0
kernel: [<ffffffff8109b520>] ? wake_bit_function+0x0/0x50
kernel: [<ffffffffa03f3d80>] ? nfs_find_actor+0x0/0x90 [nfs]
kernel: [<ffffffff811a5764>] find_inode+0x64/0x90
kernel: [<ffffffffa03f3d80>] ? nfs_find_actor+0x0/0x90 [nfs]
kernel: [<ffffffff811a68ad>] ifind+0x4d/0xd0
kernel: [<ffffffffa03f3d80>] ? nfs_find_actor+0x0/0x90 [nfs]
kernel: [<ffffffff811a6d29>] iget5_locked+0x59/0x1b0
kernel: [<ffffffffa03f3280>] ? nfs_init_locked+0x0/0x40 [nfs]
kernel: [<ffffffffa03f54f6>] nfs_fhget+0xc6/0x6c0 [nfs]
kernel: [<ffffffffa040def1>] nfs4_opendata_to_nfs4_state+0x1c1/0x330 [nfs]
kernel: [<ffffffffa040ec3c>] _nfs4_do_open+0x21c/0x4f0 [nfs]
kernel: [<ffffffffa035ac05>] ? rpcauth_lookup_credcache+0xc5/0x260 [sunrpc]
kernel: [<ffffffffa040ef95>] nfs4_do_open+0x85/0x170 [nfs]
kernel: [<ffffffffa040f0a8>] nfs4_atomic_open+0x28/0x50 [nfs]
kernel: [<ffffffffa03ee9fd>] nfs_atomic_lookup+0x15d/0x310 [nfs]
kernel: [<ffffffff81198ae5>] do_lookup+0x1a5/0x230
kernel: [<ffffffff811993fc>] __link_path_walk+0x78c/0xfe0
kernel: [<ffffffff81121f20>] ? __generic_file_aio_write+0x260/0x490
kernel: [<ffffffffa0357d30>] ? rpc_do_put_task+0x30/0x40 [sunrpc]
kernel: [<ffffffff81199f1a>] path_walk+0x6a/0xe0
kernel: [<ffffffff8119a12b>] filename_lookup+0x6b/0xc0
kernel: [<ffffffff81226466>] ? security_file_alloc+0x16/0x20
kernel: [<ffffffff8119b5f4>] do_filp_open+0x104/0xd20
kernel: [<ffffffff8109b4a0>] ? autoremove_wake_function+0x0/0x40
kernel: [<ffffffff8118e9a4>] ? cp_new_stat+0xe4/0x100
kernel: [<ffffffff811a82b2>] ? alloc_fd+0x92/0x160
kernel: [<ffffffff81185f19>] do_sys_open+0x69/0x140
kernel: [<ffffffff81189a61>] ? sys_write+0x51/0x90
kernel: [<ffffffff81186030>] sys_open+0x20/0x30
kernel: [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b

The OPEN from the state manager (this is an educated guess) is waiting for the
above open to release the seqid - so it is waiting on the Seqid_waitqueue

kernel: 11683 0081 0 ffff880037827c00 (null) 0 ffffffffa0430180 nfsv4 OPEN a:rpc_prepare_task q:Seqid_waitqueue

Turning off error handling for layoutreturn calls that come from nfs4_evict_inode will prevent the race. It would be more accurate to only turn off this error handling when kswapd and the state manager are running, but that seemed too complicated to worry about as layoutreturn already passes in a NULL state to nfs4_async_handle_errors and so does not handle a good number errors.


Andy Adamson (1):
NFSv4.1 Don't handle layoutreturn errors when state manager is running

fs/nfs/nfs4proc.c | 6 ++++++
fs/nfs/pnfs.c | 5 ++++-
include/linux/nfs_xdr.h | 1 +
3 files changed, 11 insertions(+), 1 deletion(-)

--
1.8.3.1



2013-11-25 20:51:15

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 3:29 PM, "Adamson, Andy" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 3:20 PM, "Myklebust, Trond" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 15:10, Adamson, Andy <[email protected]> wrote:
>>
>>>
>>> On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
>>> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>>>>
>>>>>
>>>>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>>>>> wrote:
>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>>>>
>>>>>>>> From: Andy Adamson <[email protected]>
>>>>>>>>
>>>>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>>>>
>>>>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>>>>> __wait_on_freeing_inode.
>>>>>>>>
>>>>>>>> If the open is either the open called by the state manager, or an open from
>>>>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>>>>> then the state is deadlocked with kswapd.
>>>>>>>>
>>>>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>>>>
>>>>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>>>>
>>>>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>>>>
>>>>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>>>>
>>>>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>>>>
>>>>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>>>>
>>>> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.
>>>
>>> Except we just changed that to fix a different state manager hang:
>>>
>>> commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
>>> Author: Andy Adamson <[email protected]>
>>> Date: Fri Nov 15 16:36:16 2013 -0500
>>>
>>> NFSv4 wait on recovery for async session errors
>>
>> Right, but that won?t prevent nfs4_evict_inode from completing,
>
> Ah - I was thinking of the synchronous handlers call to nfs4_wait_clnt_recover - so yes, no problem

In fact, this issue is NOT an upstream issue! RHEL6.5-pre has nfs4_proc_layoutreturn as as SYNC rpc call, and _that_ is the bug that is fixed upstream.

Really sorry for the confusion. I'll back port a solution for RHEL6.5

-->Andy


>
> -->Andy
>
>> and hence the OPEN that is waiting in nfs_fhget() can also complete, and so there is no deadlock with the state manager thread.
>
>>
>> Cheers
>> Trond
>> --
>> Trond Myklebust
>> Linux NFS client maintainer
>>
>> NetApp
>> [email protected]
>> http://www.netapp.com
>>
>


2013-11-25 18:33:23

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:

>
> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>
>> From: Andy Adamson <[email protected]>
>>
>> The state manager is recovering expired state and recovery OPENs are being
>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>> resultant layoutreturn gets an error that the state mangager is to handle,
>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>
>> At the same time an open is waiting for the inode deletion to complete in
>> __wait_on_freeing_inode.
>>
>> If the open is either the open called by the state manager, or an open from
>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>> then the state is deadlocked with kswapd.
>>
>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>
> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>
> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?

BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2013-11-25 20:54:07

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 3:51 PM, "Adamson, Andy" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 3:29 PM, "Adamson, Andy" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 3:20 PM, "Myklebust, Trond" <[email protected]>
>> wrote:
>>
>>>
>>> On Nov 25, 2013, at 15:10, Adamson, Andy <[email protected]> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> From: Andy Adamson <[email protected]>
>>>>>>>>>
>>>>>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>>>>>
>>>>>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>>>>>> __wait_on_freeing_inode.
>>>>>>>>>
>>>>>>>>> If the open is either the open called by the state manager, or an open from
>>>>>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>>>>>> then the state is deadlocked with kswapd.
>>>>>>>>>
>>>>>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>>>>>
>>>>>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>>>>>
>>>>>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>>>>>
>>>>>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>>>>>
>>>>>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>>>>>
>>>>>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>>>>>
>>>>> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.
>>>>
>>>> Except we just changed that to fix a different state manager hang:
>>>>
>>>> commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
>>>> Author: Andy Adamson <[email protected]>
>>>> Date: Fri Nov 15 16:36:16 2013 -0500
>>>>
>>>> NFSv4 wait on recovery for async session errors
>>>
>>> Right, but that won?t prevent nfs4_evict_inode from completing,
>>
>> Ah - I was thinking of the synchronous handlers call to nfs4_wait_clnt_recover - so yes, no problem
>
> In fact, this issue is NOT an upstream issue! RHEL6.5-pre has nfs4_proc_layoutreturn as as SYNC rpc call, and _that_ is the bug that is fixed upstream.
>
> Really sorry for the confusion. I'll back port a solution for RHEL6.5


please ignore this message.


>
> -->Andy
>
>
>>
>> -->Andy
>>
>>> and hence the OPEN that is waiting in nfs_fhget() can also complete, and so there is no deadlock with the state manager thread.
>>
>>>
>>> Cheers
>>> Trond
>>> --
>>> Trond Myklebust
>>> Linux NFS client maintainer
>>>
>>> NetApp
>>> [email protected]
>>> http://www.netapp.com
>>>
>>
>


2013-11-25 20:58:00

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 15:51, Adamson, Andy <[email protected]> wrote:

>
> On Nov 25, 2013, at 3:29 PM, "Adamson, Andy" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 3:20 PM, "Myklebust, Trond" <[email protected]>
>> wrote:
>>
>>>
>>> On Nov 25, 2013, at 15:10, Adamson, Andy <[email protected]> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>>
>>>>>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>>>>>
>>>>>>>>
>>>>>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> From: Andy Adamson <[email protected]>
>>>>>>>>>
>>>>>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>>>>>
>>>>>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>>>>>> __wait_on_freeing_inode.
>>>>>>>>>
>>>>>>>>> If the open is either the open called by the state manager, or an open from
>>>>>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>>>>>> then the state is deadlocked with kswapd.
>>>>>>>>>
>>>>>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>>>>>
>>>>>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>>>>>
>>>>>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>>>>>
>>>>>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>>>>>
>>>>>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>>>>>
>>>>>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>>>>>
>>>>> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.
>>>>
>>>> Except we just changed that to fix a different state manager hang:
>>>>
>>>> commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
>>>> Author: Andy Adamson <[email protected]>
>>>> Date: Fri Nov 15 16:36:16 2013 -0500
>>>>
>>>> NFSv4 wait on recovery for async session errors
>>>
>>> Right, but that won?t prevent nfs4_evict_inode from completing,
>>
>> Ah - I was thinking of the synchronous handlers call to nfs4_wait_clnt_recover - so yes, no problem
>
> In fact, this issue is NOT an upstream issue! RHEL6.5-pre has nfs4_proc_layoutreturn as as SYNC rpc call, and _that_ is the bug that is fixed upstream.
>
> Really sorry for the confusion. I'll back port a solution for RHEL6.5

Are you sure? As far as I can tell, the upstream nfs4_proc_layoutreturn is also synchronous. I therefore suspect that we still have the same problem.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2013-11-25 18:13:19

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:

> From: Andy Adamson <[email protected]>
>
> The state manager is recovering expired state and recovery OPENs are being
> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
> resultant layoutreturn gets an error that the state mangager is to handle,
> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>
> At the same time an open is waiting for the inode deletion to complete in
> __wait_on_freeing_inode.
>
> If the open is either the open called by the state manager, or an open from
> the same open owner that is holding the NFSv4.0 sequence id which causes the
> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
> then the state is deadlocked with kswapd.
>
> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.

Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.

IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?

Cheers
Trond

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2013-11-25 20:10:31

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>
>>
>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>> wrote:
>>
>>>
>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>
>>>>> From: Andy Adamson <[email protected]>
>>>>>
>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>
>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>> __wait_on_freeing_inode.
>>>>>
>>>>> If the open is either the open called by the state manager, or an open from
>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>> then the state is deadlocked with kswapd.
>>>>>
>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>
>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>
>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>
>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>
>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>
>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>
> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.

Except we just changed that to fix a different state manager hang:

commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
Author: Andy Adamson <[email protected]>
Date: Fri Nov 15 16:36:16 2013 -0500

NFSv4 wait on recovery for async session errors

>
> Cheers
> Trond


2013-11-25 19:54:00

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:

>
> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>
>>>
>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>
>>>> From: Andy Adamson <[email protected]>
>>>>
>>>> The state manager is recovering expired state and recovery OPENs are being
>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>
>>>> At the same time an open is waiting for the inode deletion to complete in
>>>> __wait_on_freeing_inode.
>>>>
>>>> If the open is either the open called by the state manager, or an open from
>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>> then the state is deadlocked with kswapd.
>>>>
>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>
>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>
>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>
>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>
> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>
> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.

Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.

Cheers
Trond

2013-11-25 20:29:47

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 3:20 PM, "Myklebust, Trond" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 15:10, Adamson, Andy <[email protected]> wrote:
>
>>
>> On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
>> wrote:
>>
>>>
>>> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>>>> wrote:
>>>>
>>>>>
>>>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>>>
>>>>>>> From: Andy Adamson <[email protected]>
>>>>>>>
>>>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>>>
>>>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>>>> __wait_on_freeing_inode.
>>>>>>>
>>>>>>> If the open is either the open called by the state manager, or an open from
>>>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>>>> then the state is deadlocked with kswapd.
>>>>>>>
>>>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>>>
>>>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>>>
>>>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>>>
>>>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>>>
>>>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>>>
>>>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>>>
>>> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.
>>
>> Except we just changed that to fix a different state manager hang:
>>
>> commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
>> Author: Andy Adamson <[email protected]>
>> Date: Fri Nov 15 16:36:16 2013 -0500
>>
>> NFSv4 wait on recovery for async session errors
>
> Right, but that won?t prevent nfs4_evict_inode from completing,

Ah - I was thinking of the synchronous handlers call to nfs4_wait_clnt_recover - so yes, no problem

-->Andy

> and hence the OPEN that is waiting in nfs_fhget() can also complete, and so there is no deadlock with the state manager thread.

>
> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>


2013-11-25 20:21:07

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 15:10, Adamson, Andy <[email protected]> wrote:

>
> On Nov 25, 2013, at 2:53 PM, "Myklebust, Trond" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 14:27, Adamson, Andy <[email protected]> wrote:
>>
>>>
>>> On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
>>> wrote:
>>>
>>>>
>>>> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>>>>
>>>>>
>>>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>>>
>>>>>> From: Andy Adamson <[email protected]>
>>>>>>
>>>>>> The state manager is recovering expired state and recovery OPENs are being
>>>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>>>
>>>>>> At the same time an open is waiting for the inode deletion to complete in
>>>>>> __wait_on_freeing_inode.
>>>>>>
>>>>>> If the open is either the open called by the state manager, or an open from
>>>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>>>> then the state is deadlocked with kswapd.
>>>>>>
>>>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>>>
>>>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>>>
>>>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>>>
>>>> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?
>>>
>>> Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.
>>>
>>> With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
>>
>> Ah? I forgot that the delegreturn in nfs4_evict_inode is asynchronous and doesn?t wait for completion, so it shouldn?t be a problem here.
>
> Except we just changed that to fix a different state manager hang:
>
> commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
> Author: Andy Adamson <[email protected]>
> Date: Fri Nov 15 16:36:16 2013 -0500
>
> NFSv4 wait on recovery for async session errors

Right, but that won?t prevent nfs4_evict_inode from completing, and hence the OPEN that is waiting in nfs_fhget() can also complete, and so there is no deadlock with the state manager thread.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2013-11-25 18:17:52

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 1:13 PM, "Myklebust, Trond" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>
>> From: Andy Adamson <[email protected]>
>>
>> The state manager is recovering expired state and recovery OPENs are being
>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>> resultant layoutreturn gets an error that the state mangager is to handle,
>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>
>> At the same time an open is waiting for the inode deletion to complete in
>> __wait_on_freeing_inode.
>>
>> If the open is either the open called by the state manager, or an open from
>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>> then the state is deadlocked with kswapd.
>>
>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>
> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>
> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?

Yeah, I was thinking about this as well - perhaps recovering from session-level errors or grace/delay errors would be useful for the block client.

-->Andy

>
> Cheers
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>


2013-11-25 18:28:18

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 13:17, Adamson, Andy <[email protected]> wrote:

>
> On Nov 25, 2013, at 1:13 PM, "Myklebust, Trond" <[email protected]>
> wrote:
>
>>
>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>
>>> From: Andy Adamson <[email protected]>
>>>
>>> The state manager is recovering expired state and recovery OPENs are being
>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>
>>> At the same time an open is waiting for the inode deletion to complete in
>>> __wait_on_freeing_inode.
>>>
>>> If the open is either the open called by the state manager, or an open from
>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>> then the state is deadlocked with kswapd.
>>>
>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>
>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>
>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>
> Yeah, I was thinking about this as well - perhaps recovering from session-level errors or grace/delay errors would be useful for the block client.

NFS4ERR_DELAY, probably, yes.

NFS4ERR_GRACE, no? That?s a reboot situation

As for session level errors, I?d say that complicates things too much, since several of those can basically end up masking a NFS4ERR_STALE_CLIENTID error.


Either way, all the layout types (including blocks) should be able to continue on even if we miss a layout return or two. The server has to be coded to expect a forgetful client.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
[email protected]
http://www.netapp.com


2013-11-25 17:57:56

by Andy Adamson

[permalink] [raw]
Subject: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race

From: Andy Adamson <[email protected]>

The state manager is recovering expired state and recovery OPENs are being
processed. If kswapd is pruning inodes at the same time, a deadlock can occur
when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
resultant layoutreturn gets an error that the state mangager is to handle,
causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.

At the same time an open is waiting for the inode deletion to complete in
__wait_on_freeing_inode.

If the open is either the open called by the state manager, or an open from
the same open owner that is holding the NFSv4.0 sequence id which causes the
OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
then the state is deadlocked with kswapd.

Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.

Signed-off-by: Andy Adamson <[email protected]>
---
fs/nfs/nfs4proc.c | 6 ++++++
fs/nfs/pnfs.c | 5 ++++-
include/linux/nfs_xdr.h | 1 +
3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/nfs/nfs4proc.c b/fs/nfs/nfs4proc.c
index ca36d0d..fbeadf3 100644
--- a/fs/nfs/nfs4proc.c
+++ b/fs/nfs/nfs4proc.c
@@ -7596,6 +7596,12 @@ static void nfs4_layoutreturn_done(struct rpc_task *task, void *calldata)
return;

server = NFS_SERVER(lrp->args.inode);
+
+ /* Error handling can dead lock the state manager running open
+ * recovery when kswapd is also pruning inodes. */
+ if (lrp->ino_freeing)
+ return;
+
if (nfs4_async_handle_error(task, server, NULL) == -EAGAIN) {
rpc_restart_call_prepare(task);
return;
diff --git a/fs/nfs/pnfs.c b/fs/nfs/pnfs.c
index d75d938..a55ebf2 100644
--- a/fs/nfs/pnfs.c
+++ b/fs/nfs/pnfs.c
@@ -819,7 +819,7 @@ _pnfs_return_layout(struct inode *ino)
LIST_HEAD(tmp_list);
struct nfs4_layoutreturn *lrp;
nfs4_stateid stateid;
- int status = 0, empty;
+ int status = 0, empty, freeing = 0;

dprintk("NFS: %s for inode %lu\n", __func__, ino->i_ino);

@@ -844,6 +844,8 @@ _pnfs_return_layout(struct inode *ino)
goto out;
}
lo->plh_block_lgets++;
+ if (ino->i_state & I_FREEING)
+ freeing = 1;
spin_unlock(&ino->i_lock);
pnfs_free_lseg_list(&tmp_list);

@@ -863,6 +865,7 @@ _pnfs_return_layout(struct inode *ino)
lrp->args.layout = lo;
lrp->clp = NFS_SERVER(ino)->nfs_client;
lrp->cred = lo->plh_lc_cred;
+ lrp->ino_freeing = freeing;

status = nfs4_proc_layoutreturn(lrp);
out:
diff --git a/include/linux/nfs_xdr.h b/include/linux/nfs_xdr.h
index 3ccfcec..b815345 100644
--- a/include/linux/nfs_xdr.h
+++ b/include/linux/nfs_xdr.h
@@ -316,6 +316,7 @@ struct nfs4_layoutreturn {
struct nfs4_layoutreturn_res res;
struct rpc_cred *cred;
struct nfs_client *clp;
+ int ino_freeing;
int rpc_status;
};

--
1.8.3.1


2013-11-25 18:32:40

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 1:28 PM, "Myklebust, Trond" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 13:17, Adamson, Andy <[email protected]> wrote:
>
>>
>> On Nov 25, 2013, at 1:13 PM, "Myklebust, Trond" <[email protected]>
>> wrote:
>>
>>>
>>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>>
>>>> From: Andy Adamson <[email protected]>
>>>>
>>>> The state manager is recovering expired state and recovery OPENs are being
>>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>>
>>>> At the same time an open is waiting for the inode deletion to complete in
>>>> __wait_on_freeing_inode.
>>>>
>>>> If the open is either the open called by the state manager, or an open from
>>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>>> then the state is deadlocked with kswapd.
>>>>
>>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>>
>>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>>
>>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>>
>> Yeah, I was thinking about this as well - perhaps recovering from session-level errors or grace/delay errors would be useful for the block client.
>
> NFS4ERR_DELAY, probably, yes.
>
> NFS4ERR_GRACE, no? That?s a reboot situation
>
> As for session level errors, I?d say that complicates things too much, since several of those can basically end up masking a NFS4ERR_STALE_CLIENTID error.
>
>
> Either way, all the layout types (including blocks) should be able to continue on even if we miss a layout return or two. The server has to be coded to expect a forgetful client.

OK - I'll resend the patch.

-->Andy
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>


2013-11-25 21:06:48

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race

On Mon, 2013-11-25 at 20:29 +-0000, Adamson, Andy wrote:
+AD4- On Nov 25, 2013, at 3:20 PM, +ACI-Myklebust, Trond+ACI- +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- wrote:
+AD4-
+AD4- +AD4-
+AD4- +AD4- On Nov 25, 2013, at 15:10, Adamson, Andy +ADw-William.Adamson+AEA-netapp.com+AD4- wrote:
+AD4- +AD4-
+AD4- +AD4APg-
+AD4- +AD4APg- On Nov 25, 2013, at 2:53 PM, +ACI-Myklebust, Trond+ACI- +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4APg- wrote:
+AD4- +AD4APg-
+AD4- +AD4APgA+-
+AD4- +AD4APgA+- On Nov 25, 2013, at 14:27, Adamson, Andy +ADw-William.Adamson+AEA-netapp.com+AD4- wrote:
+AD4- +AD4APgA+-
+AD4- +AD4APgA+AD4-
+AD4- +AD4APgA+AD4- On Nov 25, 2013, at 1:33 PM, +ACI-Myklebust, Trond+ACI- +ADw-Trond.Myklebust+AEA-netapp.com+AD4-
+AD4- +AD4APgA+AD4- wrote:
+AD4- +AD4APgA+AD4-
+AD4- +AD4APgA+AD4APg-
+AD4- +AD4APgA+AD4APg- On Nov 25, 2013, at 13:13, Myklebust, Trond +ADw-Trond.Myklebust+AEA-netapp.com+AD4- wrote:
+AD4- +AD4APgA+AD4APg-
+AD4- +AD4APgA+AD4APgA+-
+AD4- +AD4APgA+AD4APgA+- On Nov 25, 2013, at 12:57, +ADw-andros+AEA-netapp.com+AD4- +ADw-andros+AEA-netapp.com+AD4- wrote:
+AD4- +AD4APgA+AD4APgA+-
+AD4- +AD4APgA+AD4APgA+AD4- From: Andy Adamson +ADw-andros+AEA-netapp.com+AD4-
+AD4- +AD4APgA+AD4APgA+AD4-
+AD4- +AD4APgA+AD4APgA+AD4- The state manager is recovering expired state and recovery OPENs are being
+AD4- +AD4APgA+AD4APgA+AD4- processed. If kswapd is pruning inodes at the same time, a deadlock can occur
+AD4- +AD4APgA+AD4APgA+AD4- when kswapd calls evict+AF8-inode on an NFSv4.1 inode with a layout, and the
+AD4- +AD4APgA+AD4APgA+AD4- resultant layoutreturn gets an error that the state mangager is to handle,
+AD4- +AD4APgA+AD4APgA+AD4- causing the layoutreturn to wait on the (NFS client) cl+AF8-rpcwaitq.
+AD4- +AD4APgA+AD4APgA+AD4-
+AD4- +AD4APgA+AD4APgA+AD4- At the same time an open is waiting for the inode deletion to complete in
+AD4- +AD4APgA+AD4APgA+AD4- +AF8AXw-wait+AF8-on+AF8-freeing+AF8-inode.
+AD4- +AD4APgA+AD4APgA+AD4-
+AD4- +AD4APgA+AD4APgA+AD4- If the open is either the open called by the state manager, or an open from
+AD4- +AD4APgA+AD4APgA+AD4- the same open owner that is holding the NFSv4.0 sequence id which causes the
+AD4- +AD4APgA+AD4APgA+AD4- OPEN from the state manager to wait for the sequence id on the Seqid+AF8-waitqueue,
+AD4- +AD4APgA+AD4APgA+AD4- then the state is deadlocked with kswapd.
+AD4- +AD4APgA+AD4APgA+AD4-
+AD4- +AD4APgA+AD4APgA+AD4- Do not handle LAYOUTRETURN errors when called from nfs4+AF8-evict+AF8-inode.
+AD4- +AD4APgA+AD4APgA+-
+AD4- +AD4APgA+AD4APgA+- Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
+AD4- +AD4APgA+AD4APgA+-
+AD4- +AD4APgA+AD4APgA+- IOW: Is there any reason why we need to special-case nfs4+AF8-evict+AF8-inode? Shouldn+IBk-t we just bail out on error in +AF8-all+AF8- cases?
+AD4- +AD4APgA+AD4APg-
+AD4- +AD4APgA+AD4APg- BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4+AF8-evict+AF8-inode+ICY-
+AD4- +AD4APgA+AD4-
+AD4- +AD4APgA+AD4- Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys+AF8-open and the state manager running.
+AD4- +AD4APgA+AD4-
+AD4- +AD4APgA+AD4- With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.
+AD4- +AD4APgA+-
+AD4- +AD4APgA+- Ah+ICY- I forgot that the delegreturn in nfs4+AF8-evict+AF8-inode is asynchronous and doesn+IBk-t wait for completion, so it shouldn+IBk-t be a problem here.
+AD4- +AD4APg-
+AD4- +AD4APg- Except we just changed that to fix a different state manager hang:
+AD4- +AD4APg-
+AD4- +AD4APg- commit 4a82fd7c4e78a1b7a224f9ae8bb7e1fd95f670e0
+AD4- +AD4APg- Author: Andy Adamson +ADw-andros+AEA-netapp.com+AD4-
+AD4- +AD4APg- Date: Fri Nov 15 16:36:16 2013 -0500
+AD4- +AD4APg-
+AD4- +AD4APg- NFSv4 wait on recovery for async session errors
+AD4- +AD4-
+AD4- +AD4- Right, but that won+IBk-t prevent nfs4+AF8-evict+AF8-inode from completing,
+AD4-
+AD4- Ah - I was thinking of the synchronous handlers call to nfs4+AF8-wait+AF8-clnt+AF8-recover - so yes, no problem
+AD4-
+AD4- --+AD4-Andy
+AD4-
+AD4- +AD4- and hence the OPEN that is waiting in nfs+AF8-fhget() can also complete, and so there is no deadlock with the state manager thread.

How about something like the attached...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
patch.dif (604.00 B)
patch.dif

2013-11-25 19:27:14

by Adamson, Andy

[permalink] [raw]
Subject: Re: [PATCH 1/1] NFSv4.1 fix a kswap nfs4_state_manger race


On Nov 25, 2013, at 1:33 PM, "Myklebust, Trond" <[email protected]>
wrote:

>
> On Nov 25, 2013, at 13:13, Myklebust, Trond <[email protected]> wrote:
>
>>
>> On Nov 25, 2013, at 12:57, <[email protected]> <[email protected]> wrote:
>>
>>> From: Andy Adamson <[email protected]>
>>>
>>> The state manager is recovering expired state and recovery OPENs are being
>>> processed. If kswapd is pruning inodes at the same time, a deadlock can occur
>>> when kswapd calls evict_inode on an NFSv4.1 inode with a layout, and the
>>> resultant layoutreturn gets an error that the state mangager is to handle,
>>> causing the layoutreturn to wait on the (NFS client) cl_rpcwaitq.
>>>
>>> At the same time an open is waiting for the inode deletion to complete in
>>> __wait_on_freeing_inode.
>>>
>>> If the open is either the open called by the state manager, or an open from
>>> the same open owner that is holding the NFSv4.0 sequence id which causes the
>>> OPEN from the state manager to wait for the sequence id on the Seqid_waitqueue,
>>> then the state is deadlocked with kswapd.
>>>
>>> Do not handle LAYOUTRETURN errors when called from nfs4_evict_inode.
>>
>> Why are we waiting for recovery in LAYOUTRETURN at all? Layouts are automatically lost when the server reboots or when the lease is otherwise lost.
>>
>> IOW: Is there any reason why we need to special-case nfs4_evict_inode? Shouldn?t we just bail out on error in _all_ cases?
>
> BTW: Is it possible that we might have a similar problem with delegreturn? That too can be called from nfs4_evict_inode?

Yes, good point. kswapd could be waiting for a delegation to return which has an error along with the same scenario with sys_open and the state manager running.

With delegreturn, we most definately want to limit 'no error handling' to the evict inode case.

-->Andy
>
> --
> Trond Myklebust
> Linux NFS client maintainer
>
> NetApp
> [email protected]
> http://www.netapp.com
>