Calling nfs_kill_super from an RPC task callback would result in a deadlock
where nfs_free_server (via rpc_shutdown_client) tries to kill all
RPC tasks associated with that connection - including itself!
Instead of calling nfs_kill_super directly, queue a job on the nfsiod
workqueue.
Signed-off-by: Weston Andros Adamson <[email protected]>
---
This fixes the current incarnation of the lockup I've been tracking down for
some time now. I still have to go back and see why the reproducer changed
behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
now (before this patch) are stuck in rpc_exit.
The reproducer works against a server with write delegations:
./nfsometer.py -m v4 server:/path dd_100m_100k
which is basically:
- mount
- dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
- umount
- break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
case it NEVER goes away)
There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
done handler calls nfs_sb_deactivate...
I've tested this approach with 10 runs X 3 nfs versions X 5 workloads
(dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
its correct.
One question for the list: should nfs_free_server *always* be scheduled on
the nfsiod workqueue? It's called in error paths in several locations.
After looking at them, I don't think my approach would break anything, but
some might have style objections.
-dros
fs/nfs/client.c | 20 +++++++++++++++++---
include/linux/nfs_fs_sb.h | 1 +
2 files changed, 18 insertions(+), 3 deletions(-)
diff --git a/fs/nfs/client.c b/fs/nfs/client.c
index c285e0a..9186a96 100644
--- a/fs/nfs/client.c
+++ b/fs/nfs/client.c
@@ -1010,9 +1010,11 @@ EXPORT_SYMBOL_GPL(nfs_alloc_server);
/*
* Free up a server record
*/
-void nfs_free_server(struct nfs_server *server)
+static void nfs_free_server_schedule_work(struct work_struct *work)
{
- dprintk("--> nfs_free_server()\n");
+ struct nfs_server *server = container_of(work, struct nfs_server, work);
+
+ dprintk("--> %s\n", __func__);
nfs_server_remove_lists(server);
@@ -1032,7 +1034,19 @@ void nfs_free_server(struct nfs_server *server)
bdi_destroy(&server->backing_dev_info);
kfree(server);
nfs_release_automount_timer();
- dprintk("<-- nfs_free_server()\n");
+ dprintk("<-- %s\n", __func__);
+}
+
+/*
+ * Queue work on nfsiod workqueue to free up a server record.
+ * This avoids a deadlock when an RPC task scheduled from the rpciod
+ * workqueue tries to kill itself.
+ */
+void nfs_free_server(struct nfs_server *server)
+{
+ WARN_ON_ONCE(work_pending(&server->work));
+ INIT_WORK(&server->work, nfs_free_server_schedule_work);
+ queue_work(nfsiod_workqueue, &server->work);
}
EXPORT_SYMBOL_GPL(nfs_free_server);
diff --git a/include/linux/nfs_fs_sb.h b/include/linux/nfs_fs_sb.h
index a9e76ee..a607886 100644
--- a/include/linux/nfs_fs_sb.h
+++ b/include/linux/nfs_fs_sb.h
@@ -171,6 +171,7 @@ struct nfs_server {
void (*destroy)(struct nfs_server *);
atomic_t active; /* Keep trace of any activity to this server */
+ struct work_struct work; /* used to schedule free */
/* mountd-related mount options */
struct sockaddr_storage mountd_address;
--
1.7.9.6 (Apple Git-31.1)
On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <[email protected]> wrote:
> On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote:
>> Calling nfs_kill_super from an RPC task callback would result in a deadlock
>> where nfs_free_server (via rpc_shutdown_client) tries to kill all
>> RPC tasks associated with that connection - including itself!
>>
>> Instead of calling nfs_kill_super directly, queue a job on the nfsiod
>> workqueue.
>>
>> Signed-off-by: Weston Andros Adamson <[email protected]>
>> ---
>>
>> This fixes the current incarnation of the lockup I've been tracking down for
>> some time now. I still have to go back and see why the reproducer changed
>> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
>> now (before this patch) are stuck in rpc_exit.
>>
>> The reproducer works against a server with write delegations:
>>
>> ./nfsometer.py -m v4 server:/path dd_100m_100k
>>
>> which is basically:
>> - mount
>> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
>> - umount
>> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
>> case it NEVER goes away)
>>
>> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
>> done handler calls nfs_sb_deactivate...
>>
>> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads
>> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
>> its correct.
>>
>> One question for the list: should nfs_free_server *always* be scheduled on
>> the nfsiod workqueue? It's called in error paths in several locations.
>> After looking at them, I don't think my approach would break anything, but
>> some might have style objections.
>>
>
> This doesn't add up. There should be nothing calling nfs_sb_deactive()
> from a rpc_call_done() callback. If so, then that would be the bug.
>
> All calls to things like rpc_put_task(), put_nfs_open_context(), dput(),
> or nfs_sb_deactive() should occur in the rpc_call_release() callback if
> they can't be done in a process context. In both those cases, the
> rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client.
Ah, I misunderstood what was going on here.
nfs_kill_super *is* being called by rpc_release_calldata callback:
The kworker stuck in rpc_killall_tasks forever:
[ 34.552600] [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc]
[ 34.552608] [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc]
[ 34.552615] [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs]
[ 34.552625] [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs]
[ 34.552629] [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63
[ 34.552633] [<ffffffff8113785f>] deactivate_super+0x37/0x3b
[ 34.552642] [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs]
[ 34.552649] [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4]
[ 34.552661] [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc]
[ 34.552671] [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc]
[ 34.552680] [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc]
[ 34.552684] [<ffffffff810632b7>] process_one_work+0x192/0x2a0
[ 34.552693] [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc]
[ 34.552697] [<ffffffff81064169>] worker_thread+0x140/0x1d7
[ 34.552700] [<ffffffff81064029>] ? manage_workers+0x23b/0x23b
[ 34.552704] [<ffffffff81067d21>] kthread+0x8d/0x95
[ 34.552708] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
[ 34.552713] [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0
[ 34.552717] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
And the client's task list:
[ 174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops--
[ 174.574019] 1664 0181 -5 ffff880226474600 (null) 0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none
So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run.
I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it. It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete).
This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states:
* Currently, a queued barrier can't be canceled. This is because
* try_to_grab_pending() can't determine whether the work to be
* grabbed is at the head of the queue and thus can't clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set.
Now that I have a better understanding of what's happening, I'll go back to the drawing board.
Thanks!
-dros
On Oct 25, 2012, at 2:17 PM, "Myklebust, Trond" <[email protected]> wrote:
> On Thu, 2012-10-25 at 14:02 -0400, Weston Andros Adamson wrote:
>> Calling nfs_kill_super from an RPC task callback would result in a deadlock
>> where nfs_free_server (via rpc_shutdown_client) tries to kill all
>> RPC tasks associated with that connection - including itself!
>>
>> Instead of calling nfs_kill_super directly, queue a job on the nfsiod
>> workqueue.
>>
>> Signed-off-by: Weston Andros Adamson <[email protected]>
>> ---
>>
>> This fixes the current incarnation of the lockup I've been tracking down for
>> some time now. I still have to go back and see why the reproducer changed
>> behavior a few weeks ago - tasks used to get stuck in rpc_prepare_task, but
>> now (before this patch) are stuck in rpc_exit.
>>
>> The reproducer works against a server with write delegations:
>>
>> ./nfsometer.py -m v4 server:/path dd_100m_100k
>>
>> which is basically:
>> - mount
>> - dd if=/dev/zero of=./dd_file.100m_100k bs=102400 count=1024
>> - umount
>> - break if /proc/fs/nfsfs/servers still has entry after 5 seconds (in this
>> case it NEVER goes away)
>>
>> There are clearly other ways to trigger this deadlock, like a v4.1 CLOSE - the
>> done handler calls nfs_sb_deactivate...
>>
>> I've tested this approach with 10 runs X 3 nfs versions X 5 workloads
>> (dd_100m_100k, dd_100m_1k, python, kernel, cthon), so I'm pretty confident
>> its correct.
>>
>> One question for the list: should nfs_free_server *always* be scheduled on
>> the nfsiod workqueue? It's called in error paths in several locations.
>> After looking at them, I don't think my approach would break anything, but
>> some might have style objections.
>>
>
> This doesn't add up. There should be nothing calling nfs_sb_deactive()
> from a rpc_call_done() callback. If so, then that would be the bug.
>
> All calls to things like rpc_put_task(), put_nfs_open_context(), dput(),
> or nfs_sb_deactive() should occur in the rpc_call_release() callback if
> they can't be done in a process context. In both those cases, the
> rpc_task will be invisible to rpc_killall_tasks and rpc_shutdown_client.
Ah, I misunderstood what was going on here.
nfs_kill_super *is* being called by rpc_release_calldata callback:
The kworker stuck in rpc_killall_tasks forever:
[ 34.552600] [<ffffffffa00868e6>] rpc_killall_tasks+0x2d/0xcd [sunrpc]
[ 34.552608] [<ffffffffa00883e4>] rpc_shutdown_client+0x4a/0xec [sunrpc]
[ 34.552615] [<ffffffffa002b973>] nfs_free_server+0xcf/0x133 [nfs]
[ 34.552625] [<ffffffffa0033193>] nfs_kill_super+0x37/0x3c [nfs]
[ 34.552629] [<ffffffff81136c68>] deactivate_locked_super+0x37/0x63
[ 34.552633] [<ffffffff8113785f>] deactivate_super+0x37/0x3b
[ 34.552642] [<ffffffffa0034fc1>] nfs_sb_deactive+0x23/0x25 [nfs]
[ 34.552649] [<ffffffffa00dbba2>] nfs4_free_closedata+0x53/0x63 [nfsv4]
[ 34.552661] [<ffffffffa008f997>] rpc_release_calldata+0x17/0x19 [sunrpc]
[ 34.552671] [<ffffffffa008f9f5>] rpc_free_task+0x5c/0x65 [sunrpc]
[ 34.552680] [<ffffffffa008fe07>] rpc_async_release+0x15/0x17 [sunrpc]
[ 34.552684] [<ffffffff810632b7>] process_one_work+0x192/0x2a0
[ 34.552693] [<ffffffffa008fdf2>] ? rpc_async_schedule+0x33/0x33 [sunrpc]
[ 34.552697] [<ffffffff81064169>] worker_thread+0x140/0x1d7
[ 34.552700] [<ffffffff81064029>] ? manage_workers+0x23b/0x23b
[ 34.552704] [<ffffffff81067d21>] kthread+0x8d/0x95
[ 34.552708] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
[ 34.552713] [<ffffffff814ef1ac>] ret_from_fork+0x7c/0xb0
[ 34.552717] [<ffffffff81067c94>] ? kthread_freezable_should_stop+0x43/0x43
And the client's task list:
[ 174.574006] -pid- flgs status -client- --rqstp- -timeout ---ops--
[ 174.574019] 1664 0181 -5 ffff880226474600 (null) 0 ffffffffa00f7ce0 nfsv4 DELEGRETURN a:rpc_exit_task [sunrpc] q:none
So it looks like a CLOSE's rpc_release_calldata is triggering the nfs_kill_super and this is stuck trying to kill the DELEGRETURN task - which never gets run.
I've debugged this from the workqueue side and the DELEGRETURN work is scheduled, but ends up having insert_wq_barrier called on it. It seems to me that this means the work queue is enforcing queue ordering that the CLOSE work should complete before the DELEGRETURN work can proceed -- and *that* is the deadlock (CLOSE waiting until DELEGRETURN is dead, DELEGRETURN can't run until close is complete).
This would also explain our (Trond and me) failed attempts at canceling / rescheduling jobs in killall_tasks -- insert_wq_barrier's comment states:
* Currently, a queued barrier can't be canceled. This is because
* try_to_grab_pending() can't determine whether the work to be
* grabbed is at the head of the queue and thus can't clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set.
Now that I have a better understanding of what's happening, I'll go back to the drawing board.
Thanks!
-dros
T24gVGh1LCAyMDEyLTEwLTI1IGF0IDE0OjAyIC0wNDAwLCBXZXN0b24gQW5kcm9zIEFkYW1zb24g
d3JvdGU6DQo+IENhbGxpbmcgbmZzX2tpbGxfc3VwZXIgZnJvbSBhbiBSUEMgdGFzayBjYWxsYmFj
ayB3b3VsZCByZXN1bHQgaW4gYSBkZWFkbG9jaw0KPiB3aGVyZSBuZnNfZnJlZV9zZXJ2ZXIgKHZp
YSBycGNfc2h1dGRvd25fY2xpZW50KSB0cmllcyB0byBraWxsIGFsbA0KPiBSUEMgdGFza3MgYXNz
b2NpYXRlZCB3aXRoIHRoYXQgY29ubmVjdGlvbiAtIGluY2x1ZGluZyBpdHNlbGYhDQo+IA0KPiBJ
bnN0ZWFkIG9mIGNhbGxpbmcgbmZzX2tpbGxfc3VwZXIgZGlyZWN0bHksIHF1ZXVlIGEgam9iIG9u
IHRoZSBuZnNpb2QNCj4gd29ya3F1ZXVlLg0KPiANCj4gU2lnbmVkLW9mZi1ieTogV2VzdG9uIEFu
ZHJvcyBBZGFtc29uIDxkcm9zQG5ldGFwcC5jb20+DQo+IC0tLQ0KPiANCj4gVGhpcyBmaXhlcyB0
aGUgY3VycmVudCBpbmNhcm5hdGlvbiBvZiB0aGUgbG9ja3VwIEkndmUgYmVlbiB0cmFja2luZyBk
b3duIGZvcg0KPiBzb21lIHRpbWUgbm93LiAgSSBzdGlsbCBoYXZlIHRvIGdvIGJhY2sgYW5kIHNl
ZSB3aHkgdGhlIHJlcHJvZHVjZXIgY2hhbmdlZA0KPiBiZWhhdmlvciBhIGZldyB3ZWVrcyBhZ28g
LSB0YXNrcyB1c2VkIHRvIGdldCBzdHVjayBpbiBycGNfcHJlcGFyZV90YXNrLCBidXQNCj4gbm93
IChiZWZvcmUgdGhpcyBwYXRjaCkgYXJlIHN0dWNrIGluIHJwY19leGl0Lg0KPiANCj4gVGhlIHJl
cHJvZHVjZXIgd29ya3MgYWdhaW5zdCBhIHNlcnZlciB3aXRoIHdyaXRlIGRlbGVnYXRpb25zOg0K
PiANCj4gLi9uZnNvbWV0ZXIucHkgLW0gdjQgc2VydmVyOi9wYXRoIGRkXzEwMG1fMTAwaw0KPiAN
Cj4gd2hpY2ggaXMgYmFzaWNhbGx5Og0KPiAgLSBtb3VudA0KPiAgLSBkZCBpZj0vZGV2L3plcm8g
b2Y9Li9kZF9maWxlLjEwMG1fMTAwayBicz0xMDI0MDAgY291bnQ9MTAyNA0KPiAgLSB1bW91bnQN
Cj4gIC0gYnJlYWsgaWYgL3Byb2MvZnMvbmZzZnMvc2VydmVycyBzdGlsbCBoYXMgZW50cnkgYWZ0
ZXIgNSBzZWNvbmRzIChpbiB0aGlzDQo+ICAgIGNhc2UgaXQgTkVWRVIgZ29lcyBhd2F5KQ0KPiAN
Cj4gVGhlcmUgYXJlIGNsZWFybHkgb3RoZXIgd2F5cyB0byB0cmlnZ2VyIHRoaXMgZGVhZGxvY2ss
IGxpa2UgYSB2NC4xIENMT1NFIC0gdGhlDQo+IGRvbmUgaGFuZGxlciBjYWxscyBuZnNfc2JfZGVh
Y3RpdmF0ZS4uLg0KPiANCj4gSSd2ZSB0ZXN0ZWQgdGhpcyBhcHByb2FjaCB3aXRoIDEwIHJ1bnMg
WCAzIG5mcyB2ZXJzaW9ucyBYIDUgd29ya2xvYWRzIA0KPiAoZGRfMTAwbV8xMDBrLCBkZF8xMDBt
XzFrLCBweXRob24sIGtlcm5lbCwgY3Rob24pLCBzbyBJJ20gcHJldHR5IGNvbmZpZGVudA0KPiBp
dHMgY29ycmVjdC4NCj4gDQo+IE9uZSBxdWVzdGlvbiBmb3IgdGhlIGxpc3Q6IHNob3VsZCBuZnNf
ZnJlZV9zZXJ2ZXIgKmFsd2F5cyogYmUgc2NoZWR1bGVkIG9uDQo+IHRoZSBuZnNpb2Qgd29ya3F1
ZXVlPyBJdCdzIGNhbGxlZCBpbiBlcnJvciBwYXRocyBpbiBzZXZlcmFsIGxvY2F0aW9ucy4NCj4g
QWZ0ZXIgbG9va2luZyBhdCB0aGVtLCBJIGRvbid0IHRoaW5rIG15IGFwcHJvYWNoIHdvdWxkIGJy
ZWFrIGFueXRoaW5nLCBidXQgDQo+IHNvbWUgbWlnaHQgaGF2ZSBzdHlsZSBvYmplY3Rpb25zLg0K
PiANCg0KVGhpcyBkb2Vzbid0IGFkZCB1cC4gVGhlcmUgc2hvdWxkIGJlIG5vdGhpbmcgY2FsbGlu
ZyBuZnNfc2JfZGVhY3RpdmUoKQ0KZnJvbSBhIHJwY19jYWxsX2RvbmUoKSBjYWxsYmFjay4gSWYg
c28sIHRoZW4gdGhhdCB3b3VsZCBiZSB0aGUgYnVnLg0KDQpBbGwgY2FsbHMgdG8gdGhpbmdzIGxp
a2UgcnBjX3B1dF90YXNrKCksIHB1dF9uZnNfb3Blbl9jb250ZXh0KCksIGRwdXQoKSwNCm9yIG5m
c19zYl9kZWFjdGl2ZSgpIHNob3VsZCBvY2N1ciBpbiB0aGUgcnBjX2NhbGxfcmVsZWFzZSgpIGNh
bGxiYWNrIGlmDQp0aGV5IGNhbid0IGJlIGRvbmUgaW4gYSBwcm9jZXNzIGNvbnRleHQuIEluIGJv
dGggdGhvc2UgY2FzZXMsIHRoZQ0KcnBjX3Rhc2sgd2lsbCBiZSBpbnZpc2libGUgdG8gcnBjX2tp
bGxhbGxfdGFza3MgYW5kIHJwY19zaHV0ZG93bl9jbGllbnQuDQoNCi0tIA0KVHJvbmQgTXlrbGVi
dXN0DQpMaW51eCBORlMgY2xpZW50IG1haW50YWluZXINCg0KTmV0QXBwDQpUcm9uZC5NeWtsZWJ1
c3RAbmV0YXBwLmNvbQ0Kd3d3Lm5ldGFwcC5jb20NCg==