2012-12-21 15:33:54

by Dave Jones

[permalink] [raw]
Subject: nfsd oops on Linus' current tree.

Did a mount from a client (also running Linus current), and the
server spat this out..

[ 6936.306135] ------------[ cut here ]------------
[ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
[ 6936.306156] Hardware name:
[ 6936.306157] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
[ 6936.306214] Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+ #34
[ 6936.306216] Call Trace:
[ 6936.306224] [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
[ 6936.306227] [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
[ 6936.306235] [<ffffffffa02c62ca>] rpc_shutdown_client+0x12a/0x1b0 [sunrpc]
[ 6936.306240] [<ffffffff81368318>] ? delay_tsc+0x98/0xf0
[ 6936.306252] [<ffffffffa034a60b>] nfsd4_process_cb_update.isra.16+0x4b/0x230 [nfsd]
[ 6936.306256] [<ffffffff8109677c>] ? __rcu_read_unlock+0x5c/0xa0
[ 6936.306260] [<ffffffff81370d46>] ? debug_object_deactivate+0x46/0x130
[ 6936.306269] [<ffffffffa034a87d>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
[ 6936.306272] [<ffffffff810900f7>] process_one_work+0x207/0x760
[ 6936.306274] [<ffffffff81090087>] ? process_one_work+0x197/0x760
[ 6936.306277] [<ffffffff81090afe>] ? worker_thread+0x21e/0x440
[ 6936.306285] [<ffffffffa034a7f0>] ? nfsd4_process_cb_update.isra.16+0x230/0x230 [nfsd]
[ 6936.306289] [<ffffffff81090a3e>] worker_thread+0x15e/0x440
[ 6936.306292] [<ffffffff810908e0>] ? rescuer_thread+0x250/0x250
[ 6936.306295] [<ffffffff8109b16d>] kthread+0xed/0x100
[ 6936.306299] [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
[ 6936.306302] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.306307] [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
[ 6936.306310] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.306312] ---[ end trace 5bab69e086ae3c6f ]---
[ 6936.363213] ------------[ cut here ]------------
[ 6936.363226] WARNING: at fs/nfsd/vfs.c:937 nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]()
[ 6936.363229] Hardware name:
[ 6936.363230] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
[ 6936.363284] Pid: 699, comm: nfsd Tainted: G W 3.7.0+ #34
[ 6936.363286] Call Trace:
[ 6936.363293] [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
[ 6936.363296] [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
[ 6936.363302] [<ffffffffa031ef77>] nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]
[ 6936.363310] [<ffffffffa0321948>] nfsd_read_file+0x88/0xb0 [nfsd]
[ 6936.363317] [<ffffffffa0332956>] nfsd4_encode_read+0x186/0x260 [nfsd]
[ 6936.363325] [<ffffffffa03391cc>] nfsd4_encode_operation+0x5c/0xa0 [nfsd]
[ 6936.363333] [<ffffffffa032e5a9>] nfsd4_proc_compound+0x289/0x780 [nfsd]
[ 6936.363339] [<ffffffffa0319e5b>] nfsd_dispatch+0xeb/0x230 [nfsd]
[ 6936.363355] [<ffffffffa02d3d38>] svc_process_common+0x328/0x6d0 [sunrpc]
[ 6936.363365] [<ffffffffa02d4433>] svc_process+0x103/0x160 [sunrpc]
[ 6936.363371] [<ffffffffa031921b>] nfsd+0xdb/0x160 [nfsd]
[ 6936.363378] [<ffffffffa0319140>] ? nfsd_destroy+0x210/0x210 [nfsd]
[ 6936.363381] [<ffffffff8109b16d>] kthread+0xed/0x100
[ 6936.363385] [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
[ 6936.363388] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.363393] [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
[ 6936.363396] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
[ 6936.363398] ---[ end trace 5bab69e086ae3c70 ]---



2012-12-21 23:08:51

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 06:40:54PM +0000, Myklebust, Trond wrote:
> On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
> > On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> > > Did a mount from a client (also running Linus current), and the
> > > server spat this out..
> > >
> > > [ 6936.306135] ------------[ cut here ]------------
> > > [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
> >
> > This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
> > "SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
> > nfsd is calling shutdown_client from a workqueue, which is a problem
> > because shutdown_client has to wait on rpc tasks that run on a
> > workqueue.
> >
> > I don't believe there's any circular dependency among the workqueues
> > (we're calling shutdown_client from callback_wq, not rpciod_workqueue),
>
> We were getting deadlocks with rpciod when calling rpc_shutdown_client
> from the nfsiod workqueue.
>
> The problem here is that the workqueues all run using the same pool of
> threads, and so you can get "interesting" deadlocks when one of these
> threads has to wait for another one.

OK, coming back after reviewing Documentation/workqueue.txt: the
workqueues potentially involved here are rpciod and the server's
laundromat and callback workqueues.

The former is created with

alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

and the latter two are both created with
create_singlethread_workqueue(), which also sets WQ_MEM_RECLAIM.

As I understand it, that ensures that each of the three has at least one
"rescuer" thread dedicated to it--so there shouldn't be any deadlock as
long as there's no circular dependency among the three.

So think this is a false positive--am I missing something?

> > but 168e4b39d1afb.. says that we could get a deadlock if both are
> > running on the same kworker thread.
> >
> > I'm not sure what to do about that.
> >
>
> The question is if you really do need the call to rpc_killall_tasks and
> the synchronous wait for completion of old tasks? If you don't care,
> then we could just have you call rpc_release_client() in order to
> release your reference on the rpc_client.

No, the waiting's intentional: for example, the laundromat wq is what's
responsible for reaping expired clients, and it wants to wait for
callback rpc's to exit. (If I remember right they may reference data
structures we're about to clean up).

--b.

2012-12-21 23:15:42

by Myklebust, Trond

[permalink] [raw]
Subject: RE: nfsd oops on Linus' current tree.

Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.

Sorry, I have no explanation why...

Cheers,
Trond


________________________________________
From: J. Bruce Fields [[email protected]]
Sent: Friday, December 21, 2012 6:08 PM
To: Myklebust, Trond
Cc: Dave Jones; Linux Kernel; [email protected]; Adamson, Dros
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 06:40:54PM +0000, Myklebust, Trond wrote:
> On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
> > On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> > > Did a mount from a client (also running Linus current), and the
> > > server spat this out..
> > >
> > > [ 6936.306135] ------------[ cut here ]------------
> > > [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()
> >
> > This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
> > "SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
> > nfsd is calling shutdown_client from a workqueue, which is a problem
> > because shutdown_client has to wait on rpc tasks that run on a
> > workqueue.
> >
> > I don't believe there's any circular dependency among the workqueues
> > (we're calling shutdown_client from callback_wq, not rpciod_workqueue),
>
> We were getting deadlocks with rpciod when calling rpc_shutdown_client
> from the nfsiod workqueue.
>
> The problem here is that the workqueues all run using the same pool of
> threads, and so you can get "interesting" deadlocks when one of these
> threads has to wait for another one.

OK, coming back after reviewing Documentation/workqueue.txt: the
workqueues potentially involved here are rpciod and the server's
laundromat and callback workqueues.

The former is created with

alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

and the latter two are both created with
create_singlethread_workqueue(), which also sets WQ_MEM_RECLAIM.

As I understand it, that ensures that each of the three has at least one
"rescuer" thread dedicated to it--so there shouldn't be any deadlock as
long as there's no circular dependency among the three.

So think this is a false positive--am I missing something?

> > but 168e4b39d1afb.. says that we could get a deadlock if both are
> > running on the same kworker thread.
> >
> > I'm not sure what to do about that.
> >
>
> The question is if you really do need the call to rpc_killall_tasks and
> the synchronous wait for completion of old tasks? If you don't care,
> then we could just have you call rpc_release_client() in order to
> release your reference on the rpc_client.

No, the waiting's intentional: for example, the laundromat wq is what's
responsible for reaping expired clients, and it wants to wait for
callback rpc's to exit. (If I remember right they may reference data
structures we're about to clean up).

--b.

2012-12-22 00:49:03

by J. Bruce Fields

[permalink] [raw]
Subject: [PATCH] Revert "nfsd: warn on odd reply state in nfsd_vfs_read"

From: "J. Bruce Fields" <[email protected]>

This reverts commit 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e.

This is obviously wrong, and I have no idea how I missed seeing the
warning in testing: I must just not have looked at the right logs. The
caller bumps rq_resused/rq_next_page, so it will always be hit on a
large enough read.

Reported-by: Dave Jones <[email protected]>
Signed-off-by: J. Bruce Fields <[email protected]>
---
fs/nfsd/vfs.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/fs/nfsd/vfs.c b/fs/nfsd/vfs.c
index f0a6d88..d586117 100644
--- a/fs/nfsd/vfs.c
+++ b/fs/nfsd/vfs.c
@@ -934,7 +934,6 @@ nfsd_vfs_read(struct svc_rqst *rqstp, struct svc_fh *fhp, struct file *file,
.u.data = rqstp,
};

- WARN_ON_ONCE(rqstp->rq_next_page != rqstp->rq_respages + 1);
rqstp->rq_next_page = rqstp->rq_respages + 1;
host_err = splice_direct_to_actor(file, &sd, nfsd_direct_splice_actor);
} else {
--
1.7.11.7


2012-12-21 23:36:53

by Myklebust, Trond

[permalink] [raw]
Subject: RE: nfsd oops on Linus' current tree.

Please reread what I said. There was no obvious circular dependency, because nfsiod and rpciod are separate workqueues, both created with WQ_MEM_RECLAIM. Dros' experience shows, however that a call to rpc_shutdown_client in an nfsiod work item will deadlock with rpciod if the RPC task's work item has been assigned to the same CPU as the one running the rpc_shutdown_client work item.

I can't tell right now if that is intentional (in which case the WARN_ON in the rpc code is correct), or if it is a bug in the workqueue code. For now, we're assuming the former.

________________________________________
From: J. Bruce Fields [[email protected]]
Sent: Friday, December 21, 2012 6:26 PM
To: Myklebust, Trond
Cc: Dave Jones; Linux Kernel; [email protected]; Adamson, Dros
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 11:15:40PM +0000, Myklebust, Trond wrote:
> Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Fun! If that happens to me on this trip, I've got a week trying to hack
the kernel from my cell phone....

> Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.
>
> Sorry, I have no explanation why...

As I said:

> there shouldn't be any deadlock as long as there's no circular
> dependency among the three.

There was a circular dependency (of rpciod on itself), so having a
dedicated rpciod rescuer thread wouldn't help--once the rescuer thread
is waiting for work queued to do the same queue you're asking for
trouble.

The last argument in

alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

ensures that it will never allow more than 1 piece of work to run per
CPU, so the deadlock should be pretty easy to hit.

And with UNBOUND that's only one piece of work globally, so yeah all you
need is an rpc at shutdown time and it should deadlock every time.

--b.

2012-12-21 18:41:15

by Myklebust, Trond

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Fri, 2012-12-21 at 13:08 -0500, J. Bruce Fields wrote:
+AD4- On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
+AD4- +AD4- Did a mount from a client (also running Linus current), and the
+AD4- +AD4- server spat this out..
+AD4- +AD4-
+AD4- +AD4- +AFs- 6936.306135+AF0- ------------+AFs- cut here +AF0-------------
+AD4- +AD4- +AFs- 6936.306154+AF0- WARNING: at net/sunrpc/clnt.c:617 rpc+AF8-shutdown+AF8-client+-0x12a/0x1b0 +AFs-sunrpc+AF0-()
+AD4-
+AD4- This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
+AD4- +ACI-SUNRPC: add WARN+AF8-ON+AF8-ONCE for potential deadlock+ACI-, pointing out that
+AD4- nfsd is calling shutdown+AF8-client from a workqueue, which is a problem
+AD4- because shutdown+AF8-client has to wait on rpc tasks that run on a
+AD4- workqueue.
+AD4-
+AD4- I don't believe there's any circular dependency among the workqueues
+AD4- (we're calling shutdown+AF8-client from callback+AF8-wq, not rpciod+AF8-workqueue),

We were getting deadlocks with rpciod when calling rpc+AF8-shutdown+AF8-client
from the nfsiod workqueue.

The problem here is that the workqueues all run using the same pool of
threads, and so you can get +ACI-interesting+ACI- deadlocks when one of these
threads has to wait for another one.

+AD4- but 168e4b39d1afb.. says that we could get a deadlock if both are
+AD4- running on the same kworker thread.
+AD4-
+AD4- I'm not sure what to do about that.
+AD4-

The question is if you really do need the call to rpc+AF8-killall+AF8-tasks and
the synchronous wait for completion of old tasks? If you don't care,
then we could just have you call rpc+AF8-release+AF8-client() in order to
release your reference on the rpc+AF8-client.

+AD4- +AD4- +AFs- 6936.306156+AF0- Hardware name:
+AD4- +AD4- +AFs- 6936.306157+AF0- Modules linked in: ip6t+AF8-REJECT nf+AF8-conntrack+AF8-ipv6 nf+AF8-defrag+AF8-ipv6 xt+AF8-conntrack nf+AF8-conntrack ip6table+AF8-filter ip6+AF8-tables xfs coretemp iTCO+AF8-wdt iTCO+AF8-vendor+AF8-support snd+AF8-emu10k1 microcode snd+AF8-util+AF8-mem snd+AF8-ac97+AF8-codec ac97+AF8-bus snd+AF8-hwdep snd+AF8-seq snd+AF8-pcm snd+AF8-page+AF8-alloc snd+AF8-timer e1000e snd+AF8-rawmidi snd+AF8-seq+AF8-device snd emu10k1+AF8-gp pcspkr i2c+AF8-i801 soundcore gameport lpc+AF8-ich mfd+AF8-core i82975x+AF8-edac edac+AF8-core vhost+AF8-net tun macvtap macvlan kvm+AF8-intel kvm binfmt+AF8-misc nfsd auth+AF8-rpcgss nfs+AF8-acl lockd sunrpc btrfs libcrc32c zlib+AF8-deflate usb+AF8-storage firewire+AF8-ohci firewire+AF8-core sata+AF8-sil crc+AF8-itu+AF8-t radeon i2c+AF8-algo+AF8-bit drm+AF8-kms+AF8-helper ttm drm i2c+AF8-core floppy
+AD4- +AD4- +AFs- 6936.306214+AF0- Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+- +ACM-34
+AD4- +AD4- +AFs- 6936.306216+AF0- Call Trace:
+AD4- +AD4- +AFs- 6936.306224+AF0- +AFsAPA-ffffffff8106badf+AD4AXQ- warn+AF8-slowpath+AF8-common+-0x7f/0xc0
+AD4- +AD4- +AFs- 6936.306227+AF0- +AFsAPA-ffffffff8106bb3a+AD4AXQ- warn+AF8-slowpath+AF8-null+-0x1a/0x20
+AD4- +AD4- +AFs- 6936.306235+AF0- +AFsAPA-ffffffffa02c62ca+AD4AXQ- rpc+AF8-shutdown+AF8-client+-0x12a/0x1b0 +AFs-sunrpc+AF0-
+AD4- +AD4- +AFs- 6936.306240+AF0- +AFsAPA-ffffffff81368318+AD4AXQ- ? delay+AF8-tsc+-0x98/0xf0
+AD4- +AD4- +AFs- 6936.306252+AF0- +AFsAPA-ffffffffa034a60b+AD4AXQ- nfsd4+AF8-process+AF8-cb+AF8-update.isra.16+-0x4b/0x230 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.306256+AF0- +AFsAPA-ffffffff8109677c+AD4AXQ- ? +AF8AXw-rcu+AF8-read+AF8-unlock+-0x5c/0xa0
+AD4- +AD4- +AFs- 6936.306260+AF0- +AFsAPA-ffffffff81370d46+AD4AXQ- ? debug+AF8-object+AF8-deactivate+-0x46/0x130
+AD4- +AD4- +AFs- 6936.306269+AF0- +AFsAPA-ffffffffa034a87d+AD4AXQ- nfsd4+AF8-do+AF8-callback+AF8-rpc+-0x8d/0xa0 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.306272+AF0- +AFsAPA-ffffffff810900f7+AD4AXQ- process+AF8-one+AF8-work+-0x207/0x760
+AD4- +AD4- +AFs- 6936.306274+AF0- +AFsAPA-ffffffff81090087+AD4AXQ- ? process+AF8-one+AF8-work+-0x197/0x760
+AD4- +AD4- +AFs- 6936.306277+AF0- +AFsAPA-ffffffff81090afe+AD4AXQ- ? worker+AF8-thread+-0x21e/0x440
+AD4- +AD4- +AFs- 6936.306285+AF0- +AFsAPA-ffffffffa034a7f0+AD4AXQ- ? nfsd4+AF8-process+AF8-cb+AF8-update.isra.16+-0x230/0x230 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.306289+AF0- +AFsAPA-ffffffff81090a3e+AD4AXQ- worker+AF8-thread+-0x15e/0x440
+AD4- +AD4- +AFs- 6936.306292+AF0- +AFsAPA-ffffffff810908e0+AD4AXQ- ? rescuer+AF8-thread+-0x250/0x250
+AD4- +AD4- +AFs- 6936.306295+AF0- +AFsAPA-ffffffff8109b16d+AD4AXQ- kthread+-0xed/0x100
+AD4- +AD4- +AFs- 6936.306299+AF0- +AFsAPA-ffffffff810dd86e+AD4AXQ- ? put+AF8-lock+AF8-stats.isra.25+-0xe/0x40
+AD4- +AD4- +AFs- 6936.306302+AF0- +AFsAPA-ffffffff8109b080+AD4AXQ- ? kthread+AF8-create+AF8-on+AF8-node+-0x160/0x160
+AD4- +AD4- +AFs- 6936.306307+AF0- +AFsAPA-ffffffff81711e2c+AD4AXQ- ret+AF8-from+AF8-fork+-0x7c/0xb0
+AD4- +AD4- +AFs- 6936.306310+AF0- +AFsAPA-ffffffff8109b080+AD4AXQ- ? kthread+AF8-create+AF8-on+AF8-node+-0x160/0x160
+AD4- +AD4- +AFs- 6936.306312+AF0- ---+AFs- end trace 5bab69e086ae3c6f +AF0----
+AD4- +AD4- +AFs- 6936.363213+AF0- ------------+AFs- cut here +AF0-------------
+AD4- +AD4- +AFs- 6936.363226+AF0- WARNING: at fs/nfsd/vfs.c:937 nfsd+AF8-vfs+AF8-read.isra.13+-0x197/0x1b0 +AFs-nfsd+AF0-()
+AD4-
+AD4- This warning is unrelated, and is probably just carelessness on my part:
+AD4- I couldn't see why this condition would happen, and I stuck the warning
+AD4- in there without looking much harder. Probably we should just revert
+AD4- 79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e +ACI-nfsd: warn on odd reply state
+AD4- in nfsd+AF8-vfs+AF8-read+ACI- while I go stare at the code.
+AD4-
+AD4- --b.
+AD4-
+AD4- +AD4- +AFs- 6936.363229+AF0- Hardware name:
+AD4- +AD4- +AFs- 6936.363230+AF0- Modules linked in: ip6t+AF8-REJECT nf+AF8-conntrack+AF8-ipv6 nf+AF8-defrag+AF8-ipv6 xt+AF8-conntrack nf+AF8-conntrack ip6table+AF8-filter ip6+AF8-tables xfs coretemp iTCO+AF8-wdt iTCO+AF8-vendor+AF8-support snd+AF8-emu10k1 microcode snd+AF8-util+AF8-mem snd+AF8-ac97+AF8-codec ac97+AF8-bus snd+AF8-hwdep snd+AF8-seq snd+AF8-pcm snd+AF8-page+AF8-alloc snd+AF8-timer e1000e snd+AF8-rawmidi snd+AF8-seq+AF8-device snd emu10k1+AF8-gp pcspkr i2c+AF8-i801 soundcore gameport lpc+AF8-ich mfd+AF8-core i82975x+AF8-edac edac+AF8-core vhost+AF8-net tun macvtap macvlan kvm+AF8-intel kvm binfmt+AF8-misc nfsd auth+AF8-rpcgss nfs+AF8-acl lockd sunrpc btrfs libcrc32c zlib+AF8-deflate usb+AF8-storage firewire+AF8-ohci firewire+AF8-core sata+AF8-sil crc+AF8-itu+AF8-t radeon i2c+AF8-algo+AF8-bit drm+AF8-kms+AF8-helper ttm drm i2c+AF8-core floppy
+AD4- +AD4- +AFs- 6936.363284+AF0- Pid: 699, comm: nfsd Tainted: G W 3.7.0+- +ACM-34
+AD4- +AD4- +AFs- 6936.363286+AF0- Call Trace:
+AD4- +AD4- +AFs- 6936.363293+AF0- +AFsAPA-ffffffff8106badf+AD4AXQ- warn+AF8-slowpath+AF8-common+-0x7f/0xc0
+AD4- +AD4- +AFs- 6936.363296+AF0- +AFsAPA-ffffffff8106bb3a+AD4AXQ- warn+AF8-slowpath+AF8-null+-0x1a/0x20
+AD4- +AD4- +AFs- 6936.363302+AF0- +AFsAPA-ffffffffa031ef77+AD4AXQ- nfsd+AF8-vfs+AF8-read.isra.13+-0x197/0x1b0 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363310+AF0- +AFsAPA-ffffffffa0321948+AD4AXQ- nfsd+AF8-read+AF8-file+-0x88/0xb0 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363317+AF0- +AFsAPA-ffffffffa0332956+AD4AXQ- nfsd4+AF8-encode+AF8-read+-0x186/0x260 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363325+AF0- +AFsAPA-ffffffffa03391cc+AD4AXQ- nfsd4+AF8-encode+AF8-operation+-0x5c/0xa0 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363333+AF0- +AFsAPA-ffffffffa032e5a9+AD4AXQ- nfsd4+AF8-proc+AF8-compound+-0x289/0x780 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363339+AF0- +AFsAPA-ffffffffa0319e5b+AD4AXQ- nfsd+AF8-dispatch+-0xeb/0x230 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363355+AF0- +AFsAPA-ffffffffa02d3d38+AD4AXQ- svc+AF8-process+AF8-common+-0x328/0x6d0 +AFs-sunrpc+AF0-
+AD4- +AD4- +AFs- 6936.363365+AF0- +AFsAPA-ffffffffa02d4433+AD4AXQ- svc+AF8-process+-0x103/0x160 +AFs-sunrpc+AF0-
+AD4- +AD4- +AFs- 6936.363371+AF0- +AFsAPA-ffffffffa031921b+AD4AXQ- nfsd+-0xdb/0x160 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363378+AF0- +AFsAPA-ffffffffa0319140+AD4AXQ- ? nfsd+AF8-destroy+-0x210/0x210 +AFs-nfsd+AF0-
+AD4- +AD4- +AFs- 6936.363381+AF0- +AFsAPA-ffffffff8109b16d+AD4AXQ- kthread+-0xed/0x100
+AD4- +AD4- +AFs- 6936.363385+AF0- +AFsAPA-ffffffff810dd86e+AD4AXQ- ? put+AF8-lock+AF8-stats.isra.25+-0xe/0x40
+AD4- +AD4- +AFs- 6936.363388+AF0- +AFsAPA-ffffffff8109b080+AD4AXQ- ? kthread+AF8-create+AF8-on+AF8-node+-0x160/0x160
+AD4- +AD4- +AFs- 6936.363393+AF0- +AFsAPA-ffffffff81711e2c+AD4AXQ- ret+AF8-from+AF8-fork+-0x7c/0xb0
+AD4- +AD4- +AFs- 6936.363396+AF0- +AFsAPA-ffffffff8109b080+AD4AXQ- ? kthread+AF8-create+AF8-on+AF8-node+-0x160/0x160
+AD4- +AD4- +AFs- 6936.363398+AF0- ---+AFs- end trace 5bab69e086ae3c70 +AF0----
+AD4- +AD4-

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2012-12-21 23:26:10

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 11:15:40PM +0000, Myklebust, Trond wrote:
> Apologies for top-posting. The SSD on my laptop died, and so I'm stuck using webmail for this account...

Fun! If that happens to me on this trip, I've got a week trying to hack
the kernel from my cell phone....

> Our experience with nfsiod is that the WQ_MEM_RECLAIM option still deadlocks despite the "rescuer thread". The CPU that is running the workqueue will deadlock with any rpciod task that is assigned to the same CPU. Interestingly enough, the WQ_UNBOUND option also appears able to deadlock in the same situation.
>
> Sorry, I have no explanation why...

As I said:

> there shouldn't be any deadlock as long as there's no circular
> dependency among the three.

There was a circular dependency (of rpciod on itself), so having a
dedicated rpciod rescuer thread wouldn't help--once the rescuer thread
is waiting for work queued to do the same queue you're asking for
trouble.

The last argument in

alloc_workqueue("rpciod", WQ_MEM_RECLAIM, 1);

ensures that it will never allow more than 1 piece of work to run per
CPU, so the deadlock should be pretty easy to hit.

And with UNBOUND that's only one piece of work globally, so yeah all you
need is an rpc at shutdown time and it should deadlock every time.

--b.

2012-12-21 23:45:33

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> Please reread what I said. There was no obvious circular dependency,
> because nfsiod and rpciod are separate workqueues, both created with
> WQ_MEM_RECLAIM.

Oh, sorry, I read "rpciod" as "nfsiod"!

> Dros' experience shows, however that a call to
> rpc_shutdown_client in an nfsiod work item will deadlock with rpciod
> if the RPC task's work item has been assigned to the same CPU as the
> one running the rpc_shutdown_client work item.
>
> I can't tell right now if that is intentional (in which case the
> WARN_ON in the rpc code is correct), or if it is a bug in the
> workqueue code. For now, we're assuming the former.

Well, Documentation/workqueue.txt says:

"Each wq with WQ_MEM_RECLAIM set has an execution context
reserved for it. If there is dependency among multiple work
items used during memory reclaim, they should be queued to
separate wq each with WQ_MEM_RECLAIM."

And I can't see how it could have been safe to convert
create_single_singlethread_workqueue() callers otherwise.

--b.

2012-12-21 18:08:30

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Fri, Dec 21, 2012 at 10:33:48AM -0500, Dave Jones wrote:
> Did a mount from a client (also running Linus current), and the
> server spat this out..
>
> [ 6936.306135] ------------[ cut here ]------------
> [ 6936.306154] WARNING: at net/sunrpc/clnt.c:617 rpc_shutdown_client+0x12a/0x1b0 [sunrpc]()

This is a warning added by 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9
"SUNRPC: add WARN_ON_ONCE for potential deadlock", pointing out that
nfsd is calling shutdown_client from a workqueue, which is a problem
because shutdown_client has to wait on rpc tasks that run on a
workqueue.

I don't believe there's any circular dependency among the workqueues
(we're calling shutdown_client from callback_wq, not rpciod_workqueue),
but 168e4b39d1afb.. says that we could get a deadlock if both are
running on the same kworker thread.

I'm not sure what to do about that.

> [ 6936.306156] Hardware name:
> [ 6936.306157] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> [ 6936.306214] Pid: 52, comm: kworker/u:2 Not tainted 3.7.0+ #34
> [ 6936.306216] Call Trace:
> [ 6936.306224] [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> [ 6936.306227] [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> [ 6936.306235] [<ffffffffa02c62ca>] rpc_shutdown_client+0x12a/0x1b0 [sunrpc]
> [ 6936.306240] [<ffffffff81368318>] ? delay_tsc+0x98/0xf0
> [ 6936.306252] [<ffffffffa034a60b>] nfsd4_process_cb_update.isra.16+0x4b/0x230 [nfsd]
> [ 6936.306256] [<ffffffff8109677c>] ? __rcu_read_unlock+0x5c/0xa0
> [ 6936.306260] [<ffffffff81370d46>] ? debug_object_deactivate+0x46/0x130
> [ 6936.306269] [<ffffffffa034a87d>] nfsd4_do_callback_rpc+0x8d/0xa0 [nfsd]
> [ 6936.306272] [<ffffffff810900f7>] process_one_work+0x207/0x760
> [ 6936.306274] [<ffffffff81090087>] ? process_one_work+0x197/0x760
> [ 6936.306277] [<ffffffff81090afe>] ? worker_thread+0x21e/0x440
> [ 6936.306285] [<ffffffffa034a7f0>] ? nfsd4_process_cb_update.isra.16+0x230/0x230 [nfsd]
> [ 6936.306289] [<ffffffff81090a3e>] worker_thread+0x15e/0x440
> [ 6936.306292] [<ffffffff810908e0>] ? rescuer_thread+0x250/0x250
> [ 6936.306295] [<ffffffff8109b16d>] kthread+0xed/0x100
> [ 6936.306299] [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> [ 6936.306302] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.306307] [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> [ 6936.306310] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.306312] ---[ end trace 5bab69e086ae3c6f ]---
> [ 6936.363213] ------------[ cut here ]------------
> [ 6936.363226] WARNING: at fs/nfsd/vfs.c:937 nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]()

This warning is unrelated, and is probably just carelessness on my part:
I couldn't see why this condition would happen, and I stuck the warning
in there without looking much harder. Probably we should just revert
79f77bf9a4e3dd5ead006b8f17e7c4ff07d8374e "nfsd: warn on odd reply state
in nfsd_vfs_read" while I go stare at the code.

--b.

> [ 6936.363229] Hardware name:
> [ 6936.363230] Modules linked in: ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_conntrack nf_conntrack ip6table_filter ip6_tables xfs coretemp iTCO_wdt iTCO_vendor_support snd_emu10k1 microcode snd_util_mem snd_ac97_codec ac97_bus snd_hwdep snd_seq snd_pcm snd_page_alloc snd_timer e1000e snd_rawmidi snd_seq_device snd emu10k1_gp pcspkr i2c_i801 soundcore gameport lpc_ich mfd_core i82975x_edac edac_core vhost_net tun macvtap macvlan kvm_intel kvm binfmt_misc nfsd auth_rpcgss nfs_acl lockd sunrpc btrfs libcrc32c zlib_deflate usb_storage firewire_ohci firewire_core sata_sil crc_itu_t radeon i2c_algo_bit drm_kms_helper ttm drm i2c_core floppy
> [ 6936.363284] Pid: 699, comm: nfsd Tainted: G W 3.7.0+ #34
> [ 6936.363286] Call Trace:
> [ 6936.363293] [<ffffffff8106badf>] warn_slowpath_common+0x7f/0xc0
> [ 6936.363296] [<ffffffff8106bb3a>] warn_slowpath_null+0x1a/0x20
> [ 6936.363302] [<ffffffffa031ef77>] nfsd_vfs_read.isra.13+0x197/0x1b0 [nfsd]
> [ 6936.363310] [<ffffffffa0321948>] nfsd_read_file+0x88/0xb0 [nfsd]
> [ 6936.363317] [<ffffffffa0332956>] nfsd4_encode_read+0x186/0x260 [nfsd]
> [ 6936.363325] [<ffffffffa03391cc>] nfsd4_encode_operation+0x5c/0xa0 [nfsd]
> [ 6936.363333] [<ffffffffa032e5a9>] nfsd4_proc_compound+0x289/0x780 [nfsd]
> [ 6936.363339] [<ffffffffa0319e5b>] nfsd_dispatch+0xeb/0x230 [nfsd]
> [ 6936.363355] [<ffffffffa02d3d38>] svc_process_common+0x328/0x6d0 [sunrpc]
> [ 6936.363365] [<ffffffffa02d4433>] svc_process+0x103/0x160 [sunrpc]
> [ 6936.363371] [<ffffffffa031921b>] nfsd+0xdb/0x160 [nfsd]
> [ 6936.363378] [<ffffffffa0319140>] ? nfsd_destroy+0x210/0x210 [nfsd]
> [ 6936.363381] [<ffffffff8109b16d>] kthread+0xed/0x100
> [ 6936.363385] [<ffffffff810dd86e>] ? put_lock_stats.isra.25+0xe/0x40
> [ 6936.363388] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.363393] [<ffffffff81711e2c>] ret_from_fork+0x7c/0xb0
> [ 6936.363396] [<ffffffff8109b080>] ? kthread_create_on_node+0x160/0x160
> [ 6936.363398] ---[ end trace 5bab69e086ae3c70 ]---
>

2013-01-03 20:11:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> Hey, sorry for the late response, I've been on vacation.
>
> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <[email protected]>
> wrote:
>
> > On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >> Please reread what I said. There was no obvious circular
> >> dependency, because nfsiod and rpciod are separate workqueues, both
> >> created with WQ_MEM_RECLAIM.
> >
> > Oh, sorry, I read "rpciod" as "nfsiod"!
> >
> >> Dros' experience shows, however that a call to rpc_shutdown_client
> >> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >> work item has been assigned to the same CPU as the one running the
> >> rpc_shutdown_client work item.
> >>
> >> I can't tell right now if that is intentional (in which case the
> >> WARN_ON in the rpc code is correct), or if it is a bug in the
> >> workqueue code. For now, we're assuming the former.
> >
> > Well, Documentation/workqueue.txt says:
> >
> > "Each wq with WQ_MEM_RECLAIM set has an execution context
> > reserved for it. If there is dependency among multiple work
> > items used during memory reclaim, they should be queued to
> > separate wq each with WQ_MEM_RECLAIM."
>
> The deadlock we were seeing was:
>
> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> task B gets queued on rpciod workqueue and assigned the same kworker
> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> will loop forever waiting for task B to run rpc_async_release() - task
> B will never run rpc_async_release() - it can't run until kworker-0:0
> is free, which won't happen until task A (rpc_shutdown_client) is done
>
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.
>
> Once a work struct is queued, there is no way to back out of the
> deadlock. From kernel/workqueue.c:insert_wq_barrier comment:
>
> * Currently, a queued barrier can't be canceled. This is because *
> try_to_grab_pending() can't determine whether the work to be *
> grabbed is at the head of the queue and thus can't clear LINKED *
> flag of the previous work while there must be a valid next work *
> after a work with LINKED flag set.
>
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.
>
> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already been
> assigned to the same kworker thread will never have a chance to run
> this cleanup stage.
>
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
>
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> success), but I'd argue that even if that did work it would still be
> very easy to call rpc_shutdown_client from the wrong context and MUCH
> harder to detect it. It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

But, right, looking at kernel/workqueue.c, it seems that the dedicated
"rescuer" threads are invoked only in the case when work is stalled
because a new worker thread isn't allocated quickly enough.

So, what to do that's simplest enough that it would work for
post-rc2/stable? I was happy having just a simple dedicated
thread--these are only started when nfsd is, so there's no real thread
proliferation problem.

I'll go quietly weep for a little while and then think about it some
more....

--b.

2013-01-04 19:12:56

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write cleanups are done

On Fri, Jan 04, 2013 at 06:52:12PM +0000, Myklebust, Trond wrote:
> On Fri, 2013-01-04 at 13:29 -0500, Bruce Fields wrote:
> > On Fri, Jan 04, 2013 at 01:15:01PM -0500, Trond Myklebust wrote:
> > > This patch ensures that we free the rpc_task after the cleanup callbacks
> > > are done in order to avoid a deadlock problem that can be triggered if
> > > the callback needs to wait for another workqueue item to complete.
> >
> > Makes sense to me!
> >
> > (Dumb question: so read and write data are the only two cases where the
> > calldata embeds an rpc task? Why is that?)
>
> nfs_commit_data and nfs_layoutcommit_data do too. The idea is to improve
> reliability when writing back dirty data in low memory conditions. The
> struct nfs_write_data and nfs_commit_data have their own mempool in
> order to guarantee a minimum number of available slots. By embedding the
> rpc_task, we can extend that guarantee to cover (part of) the RPC call
> too.
>
> The only reason why nfs_read_data has the same embedding is for symmetry
> with nfs_write_data.

OK, thanks for the explanation.

--b.

2013-01-04 18:15:10

by Myklebust, Trond

[permalink] [raw]
Subject: [PATCH 1/3] SUNRPC: Ensure that we free the rpc_task after cleanups are done

This patch ensures that we free the rpc_task after the cleanup callbacks
are done in order to avoid a deadlock problem that can be triggered if
the callback needs to wait for another workqueue item to complete.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: Weston Andros Adamson <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Bruce Fields <[email protected]>
Cc: [email protected]
---
net/sunrpc/sched.c | 27 +++++++++++++++++++++++----
1 file changed, 23 insertions(+), 4 deletions(-)

diff --git a/net/sunrpc/sched.c b/net/sunrpc/sched.c
index d17a704..b4133bd 100644
--- a/net/sunrpc/sched.c
+++ b/net/sunrpc/sched.c
@@ -934,16 +934,35 @@ struct rpc_task *rpc_new_task(const struct rpc_task_setup *setup_data)
return task;
}

+/*
+ * rpc_free_task - release rpc task and perform cleanups
+ *
+ * Note that we free up the rpc_task _after_ rpc_release_calldata()
+ * in order to work around a workqueue dependency issue.
+ *
+ * Tejun Heo states:
+ * "Workqueue currently considers two work items to be the same if they're
+ * on the same address and won't execute them concurrently - ie. it
+ * makes a work item which is queued again while being executed wait
+ * for the previous execution to complete.
+ *
+ * If a work function frees the work item, and then waits for an event
+ * which should be performed by another work item and *that* work item
+ * recycles the freed work item, it can create a false dependency loop.
+ * There really is no reliable way to detect this short of verifying
+ * every memory free."
+ *
+ */
static void rpc_free_task(struct rpc_task *task)
{
- const struct rpc_call_ops *tk_ops = task->tk_ops;
- void *calldata = task->tk_calldata;
+ unsigned short tk_flags = task->tk_flags;
+
+ rpc_release_calldata(task->tk_ops, task->tk_calldata);

- if (task->tk_flags & RPC_TASK_DYNAMIC) {
+ if (tk_flags & RPC_TASK_DYNAMIC) {
dprintk("RPC: %5u freeing task\n", task->tk_pid);
mempool_free(task, rpc_task_mempool);
}
- rpc_release_calldata(tk_ops, calldata);
}

static void rpc_async_release(struct work_struct *work)
--
1.7.11.7


2013-01-03 23:26:59

by Myklebust, Trond

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, 2013-01-03 at 18:11 -0500, Trond Myklebust wrote:
+AD4- On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
+AD4- +AD4- Ooh, BTW, there was a bug where workqueue code created a false
+AD4- +AD4- dependency between two work items. Workqueue currently considers two
+AD4- +AD4- work items to be the same if they're on the same address and won't
+AD4- +AD4- execute them concurrently - ie. it makes a work item which is queued
+AD4- +AD4- again while being executed wait for the previous execution to
+AD4- +AD4- complete.
+AD4- +AD4-
+AD4- +AD4- If a work function frees the work item, and then waits for an event
+AD4- +AD4- which should be performed by another work item and +ACo-that+ACo- work item
+AD4- +AD4- recycles the freed work item, it can create a false dependency loop.
+AD4- +AD4- There really is no reliable way to detect this short of verifying
+AD4- +AD4- every memory free. A patch is queued to make such occurrences less
+AD4- +AD4- likely (work functions should also match for two work items considered
+AD4- +AD4- the same), but if you're seeing this, the best thing to do is freeing
+AD4- +AD4- the work item at the end of the work function.
+AD4-
+AD4- That's interesting... I wonder if we may have been hitting that issue.
+AD4-
+AD4- From what I can see, we do actually free the write RPC task (and hence
+AD4- the work+AF8-struct) before we call the asynchronous unlink completion...
+AD4-
+AD4- Dros, can you see if reverting commit
+AD4- 324d003b0cd82151adbaecefef57b73f7959a469 +- commit
+AD4- 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
+AD4- patch also fixes the hang on a pristine 3.7.x kernel?

Actually, we probably also need to look at rpc+AF8-free+AF8-task, so the
following patch, instead...

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
gnurr2.dif (2.01 kB)
gnurr2.dif

2013-01-04 17:11:07

by Adamson, Dros

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.


On Jan 3, 2013, at 6:26 PM, "Myklebust, Trond" <[email protected]> wrote:

> On Thu, 2013-01-03 at 18:11 -0500, Trond Myklebust wrote:
>> On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
>>> Ooh, BTW, there was a bug where workqueue code created a false
>>> dependency between two work items. Workqueue currently considers two
>>> work items to be the same if they're on the same address and won't
>>> execute them concurrently - ie. it makes a work item which is queued
>>> again while being executed wait for the previous execution to
>>> complete.
>>>
>>> If a work function frees the work item, and then waits for an event
>>> which should be performed by another work item and *that* work item
>>> recycles the freed work item, it can create a false dependency loop.
>>> There really is no reliable way to detect this short of verifying
>>> every memory free. A patch is queued to make such occurrences less
>>> likely (work functions should also match for two work items considered
>>> the same), but if you're seeing this, the best thing to do is freeing
>>> the work item at the end of the work function.
>>
>> That's interesting... I wonder if we may have been hitting that issue.
>>
>> From what I can see, we do actually free the write RPC task (and hence
>> the work_struct) before we call the asynchronous unlink completion...
>>
>> Dros, can you see if reverting commit
>> 324d003b0cd82151adbaecefef57b73f7959a469 + commit
>> 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
>> patch also fixes the hang on a pristine 3.7.x kernel?
>
> Actually, we probably also need to look at rpc_free_task, so the
> following patch, instead...

Yes, this patch fixes the hang!

Thank you for the explanation Tejun - that makes a lot of sense and explains the workqueue behavior that we were seeing.

-dros

2013-01-04 18:30:00

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write cleanups are done

On Fri, Jan 04, 2013 at 01:15:01PM -0500, Trond Myklebust wrote:
> This patch ensures that we free the rpc_task after the cleanup callbacks
> are done in order to avoid a deadlock problem that can be triggered if
> the callback needs to wait for another workqueue item to complete.

Makes sense to me!

(Dumb question: so read and write data are the only two cases where the
calldata embeds an rpc task? Why is that?)

--b.

>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: Weston Andros Adamson <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Bruce Fields <[email protected]>
> Cc: [email protected] [>= 3.5]
> ---
> fs/nfs/read.c | 10 +++++++---
> fs/nfs/write.c | 10 +++++++---
> 2 files changed, 14 insertions(+), 6 deletions(-)
>
> diff --git a/fs/nfs/read.c b/fs/nfs/read.c
> index b6bdb18..a5e5d98 100644
> --- a/fs/nfs/read.c
> +++ b/fs/nfs/read.c
> @@ -91,12 +91,16 @@ void nfs_readdata_release(struct nfs_read_data *rdata)
> put_nfs_open_context(rdata->args.context);
> if (rdata->pages.pagevec != rdata->pages.page_array)
> kfree(rdata->pages.pagevec);
> - if (rdata != &read_header->rpc_data)
> - kfree(rdata);
> - else
> + if (rdata == &read_header->rpc_data) {
> rdata->header = NULL;
> + rdata = NULL;
> + }
> if (atomic_dec_and_test(&hdr->refcnt))
> hdr->completion_ops->completion(hdr);
> + /* Note: we only free the rpc_task after callbacks are done.
> + * See the comment in rpc_free_task() for why
> + */
> + kfree(rdata);
> }
> EXPORT_SYMBOL_GPL(nfs_readdata_release);
>
> diff --git a/fs/nfs/write.c b/fs/nfs/write.c
> index b673be3..c483cc5 100644
> --- a/fs/nfs/write.c
> +++ b/fs/nfs/write.c
> @@ -126,12 +126,16 @@ void nfs_writedata_release(struct nfs_write_data *wdata)
> put_nfs_open_context(wdata->args.context);
> if (wdata->pages.pagevec != wdata->pages.page_array)
> kfree(wdata->pages.pagevec);
> - if (wdata != &write_header->rpc_data)
> - kfree(wdata);
> - else
> + if (wdata == &write_header->rpc_data) {
> wdata->header = NULL;
> + wdata = NULL;
> + }
> if (atomic_dec_and_test(&hdr->refcnt))
> hdr->completion_ops->completion(hdr);
> + /* Note: we only free the rpc_task after callbacks are done.
> + * See the comment in rpc_free_task() for why
> + */
> + kfree(wdata);
> }
> EXPORT_SYMBOL_GPL(nfs_writedata_release);
>
> --
> 1.7.11.7
>

2013-01-03 22:22:52

by Tejun Heo

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

It's getting a bit repetitive but I really wanna steer people away
from implementing separate kthreads for wrong reasons. kthread is
surprisingly difficult to get right especially around freezing /
exiting / hotplugs and people get it subtly wrong very often.

On Thu, Jan 03, 2013 at 03:52:21PM -0500, J. Bruce Fields wrote:
> No, I meant I was happy having my workqueue handled by a single
> dedicated thread, which I don't see a trivial way to do any more.

Because you don't need that. The old workqueue didn't give you
anything better than the current one. It was usually more difficult
to get right in terms of execution dependency because it introduced a
lot of unintended execution dependencies through
one-(per-cpu)-worker-per-workqueue rule.

If anyone can show me workqueue is deadlocking when it shouldn't I'll
be happy to look into it and fix it, but, up until now, most if not
all reported cases were through incorrect usage, and things definitely
don't work like described in this thread.

Thanks.

--
tejun

2013-01-04 18:52:15

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write cleanups are done

On Fri, 2013-01-04 at 13:29 -0500, Bruce Fields wrote:
+AD4- On Fri, Jan 04, 2013 at 01:15:01PM -0500, Trond Myklebust wrote:
+AD4- +AD4- This patch ensures that we free the rpc+AF8-task after the cleanup callbacks
+AD4- +AD4- are done in order to avoid a deadlock problem that can be triggered if
+AD4- +AD4- the callback needs to wait for another workqueue item to complete.
+AD4-
+AD4- Makes sense to me+ACE-
+AD4-
+AD4- (Dumb question: so read and write data are the only two cases where the
+AD4- calldata embeds an rpc task? Why is that?)

nfs+AF8-commit+AF8-data and nfs+AF8-layoutcommit+AF8-data do too. The idea is to improve
reliability when writing back dirty data in low memory conditions. The
struct nfs+AF8-write+AF8-data and nfs+AF8-commit+AF8-data have their own mempool in
order to guarantee a minimum number of available slots. By embedding the
rpc+AF8-task, we can extend that guarantee to cover (part of) the RPC call
too.

The only reason why nfs+AF8-read+AF8-data has the same embedding is for symmetry
with nfs+AF8-write+AF8-data.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-01-03 20:52:24

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, Jan 03, 2013 at 08:27:20PM +0000, Adamson, Dros wrote:
>
> On Jan 3, 2013, at 3:11 PM, J. Bruce Fields <[email protected]> wrote:
>
> > On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> >> Hey, sorry for the late response, I've been on vacation.
> >>
> >> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <[email protected]>
> >> wrote:
> >>
> >>> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
> >>>> Please reread what I said. There was no obvious circular
> >>>> dependency, because nfsiod and rpciod are separate workqueues, both
> >>>> created with WQ_MEM_RECLAIM.
> >>>
> >>> Oh, sorry, I read "rpciod" as "nfsiod"!
> >>>
> >>>> Dros' experience shows, however that a call to rpc_shutdown_client
> >>>> in an nfsiod work item will deadlock with rpciod if the RPC task's
> >>>> work item has been assigned to the same CPU as the one running the
> >>>> rpc_shutdown_client work item.
> >>>>
> >>>> I can't tell right now if that is intentional (in which case the
> >>>> WARN_ON in the rpc code is correct), or if it is a bug in the
> >>>> workqueue code. For now, we're assuming the former.
> >>>
> >>> Well, Documentation/workqueue.txt says:
> >>>
> >>> "Each wq with WQ_MEM_RECLAIM set has an execution context
> >>> reserved for it. If there is dependency among multiple work
> >>> items used during memory reclaim, they should be queued to
> >>> separate wq each with WQ_MEM_RECLAIM."
> >>
> >> The deadlock we were seeing was:
> >>
> >> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
> >> task B gets queued on rpciod workqueue and assigned the same kworker
> >> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
> >> will loop forever waiting for task B to run rpc_async_release() - task
> >> B will never run rpc_async_release() - it can't run until kworker-0:0
> >> is free, which won't happen until task A (rpc_shutdown_client) is done
> >>
> >> The same deadlock happened when we tried queuing the tasks on a
> >> different workqueues -- queue_work() assigns the task to a kworker
> >> thread and it's luck of the draw if it's the same kworker as task A.
> >> We tried the different workqueue options, but nothing changed this
> >> behavior.
> >>
> >> Once a work struct is queued, there is no way to back out of the
> >> deadlock. From kernel/workqueue.c:insert_wq_barrier comment:
> >>
> >> * Currently, a queued barrier can't be canceled. This is because *
> >> try_to_grab_pending() can't determine whether the work to be *
> >> grabbed is at the head of the queue and thus can't clear LINKED *
> >> flag of the previous work while there must be a valid next work *
> >> after a work with LINKED flag set.
> >>
> >> So once a work struct is queued and there is an ordering dependency
> >> (i.e. task A is before task B), there is no way to back task B out -
> >> so we can't just call cancel_work() or something on task B in
> >> rpc_shutdown_client.
> >>
> >> The root of our issue is that rpc_shutdown_client is never safe to
> >> call from a workqueue context - it loops until there are no more
> >> tasks, marking tasks as killed and waiting for them to be cleaned up
> >> in each task's own workqueue context. Any tasks that have already been
> >> assigned to the same kworker thread will never have a chance to run
> >> this cleanup stage.
> >>
> >> When fixing this deadlock, Trond and I discussed changing how
> >> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> >> that it'd be better to just not call it from a workqueue context and
> >> print a warning if it is.
> >>
> >> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
> >> success), but I'd argue that even if that did work it would still be
> >> very easy to call rpc_shutdown_client from the wrong context and MUCH
> >> harder to detect it. It's also unclear to me if setting rpciod
> >> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
> >
> > Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
>
> Heh, oh yeah :)
>
> >
> > But, right, looking at kernel/workqueue.c, it seems that the dedicated
> > "rescuer" threads are invoked only in the case when work is stalled
> > because a new worker thread isn't allocated quickly enough.
> >
> > So, what to do that's simplest enough that it would work for
> > post-rc2/stable? I was happy having just a simple dedicated
> > thread--these are only started when nfsd is, so there's no real thread
> > proliferation problem.
>
> That should work fine.

No, I meant I was happy having my workqueue handled by a single
dedicated thread, which I don't see a trivial way to do any more.

> The client went this way and spawns a new kthread before calling rpc_shutdown_client() from a workqueue context. This should happen very infrequently.

I don't think I can do that, as I was really counting on being able to
wait for the client to shut down. I'll think about it.

--b.

2013-01-03 22:34:47

by Tejun Heo

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, Jan 03, 2013 at 05:26:39PM -0500, Tejun Heo wrote:
> Hello, Trond.
>
> On Thu, Jan 03, 2013 at 10:12:32PM +0000, Myklebust, Trond wrote:
> > > The analysis is likely completely wrong, so please don't go off doing
> > > something unnecessary. Please take look at what's causing the
> > > deadlocks again.
> >
> > The analysis is a no-brainer:
> > We see a deadlock due to one work item waiting for completion of another
> > work item that is queued on the same CPU. There is no other dependency
> > between the two work items.
>
> What do you mean "waiting for completion of"? Is one flushing the
> other? Or is one waiting for the other to take certain action? How
> are the two work items related? Are they queued on the same
> workqueue? Can you create a minimal repro case of the observed
> deadlock?

Ooh, BTW, there was a bug where workqueue code created a false
dependency between two work items. Workqueue currently considers two
work items to be the same if they're on the same address and won't
execute them concurrently - ie. it makes a work item which is queued
again while being executed wait for the previous execution to
complete.

If a work function frees the work item, and then waits for an event
which should be performed by another work item and *that* work item
recycles the freed work item, it can create a false dependency loop.
There really is no reliable way to detect this short of verifying
every memory free. A patch is queued to make such occurrences less
likely (work functions should also match for two work items considered
the same), but if you're seeing this, the best thing to do is freeing
the work item at the end of the work function.

Thanks.

--
tejun

2013-01-07 17:47:24

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 1/3] SUNRPC: Ensure that we free the rpc_task after cleanups are done

On Fri, Jan 04, 2013 at 01:15:00PM -0500, Trond Myklebust wrote:
> This patch ensures that we free the rpc_task after the cleanup callbacks
> are done in order to avoid a deadlock problem that can be triggered if
> the callback needs to wait for another workqueue item to complete.
>
> Signed-off-by: Trond Myklebust <[email protected]>
> Cc: Weston Andros Adamson <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Bruce Fields <[email protected]>
> Cc: [email protected]

Acked-by: Tejun Heo <[email protected]>

> + * Tejun Heo states:
> + * "Workqueue currently considers two work items to be the same if they're
> + * on the same address and won't execute them concurrently - ie. it
> + * makes a work item which is queued again while being executed wait
> + * for the previous execution to complete.
> + *
> + * If a work function frees the work item, and then waits for an event
> + * which should be performed by another work item and *that* work item
> + * recycles the freed work item, it can create a false dependency loop.
> + * There really is no reliable way to detect this short of verifying
> + * every memory free."

I will document it in Documentation/workqueue.txt.

Thanks!

--
tejun

2013-01-03 22:12:34

by Myklebust, Trond

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, 2013-01-03 at 17:08 -0500, Tejun Heo wrote:
+AD4- Hello,
+AD4-
+AD4- On Thu, Jan 03, 2013 at 03:11:20PM -0500, J. Bruce Fields wrote:
+AD4- +AD4- Both rpciod and nfsiod already set WQ+AF8-MEM+AF8-RECLAIM.
+AD4- +AD4-
+AD4- +AD4- But, right, looking at kernel/workqueue.c, it seems that the dedicated
+AD4- +AD4- +ACI-rescuer+ACI- threads are invoked only in the case when work is stalled
+AD4- +AD4- because a new worker thread isn't allocated quickly enough.
+AD4-
+AD4- Because that's the +ACo-only+ACo- case where progress can't be guaranteed
+AD4- otherwise.
+AD4-
+AD4- +AD4- So, what to do that's simplest enough that it would work for
+AD4- +AD4- post-rc2/stable? I was happy having just a simple dedicated
+AD4- +AD4- thread--these are only started when nfsd is, so there's no real thread
+AD4- +AD4- proliferation problem.
+AD4-
+AD4- The analysis is likely completely wrong, so please don't go off doing
+AD4- something unnecessary. Please take look at what's causing the
+AD4- deadlocks again.

The analysis is a no-brainer:
We see a deadlock due to one work item waiting for completion of another
work item that is queued on the same CPU. There is no other dependency
between the two work items.

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-01-03 22:08:20

by Tejun Heo

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

Hello,

On Thu, Jan 03, 2013 at 03:11:20PM -0500, J. Bruce Fields wrote:
> Both rpciod and nfsiod already set WQ_MEM_RECLAIM.
>
> But, right, looking at kernel/workqueue.c, it seems that the dedicated
> "rescuer" threads are invoked only in the case when work is stalled
> because a new worker thread isn't allocated quickly enough.

Because that's the *only* case where progress can't be guaranteed
otherwise.

> So, what to do that's simplest enough that it would work for
> post-rc2/stable? I was happy having just a simple dedicated
> thread--these are only started when nfsd is, so there's no real thread
> proliferation problem.

The analysis is likely completely wrong, so please don't go off doing
something unnecessary. Please take look at what's causing the
deadlocks again.

Thanks.

--
tejun

2013-01-03 23:11:34

by Myklebust, Trond

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, 2013-01-03 at 17:26 -0500, Tejun Heo wrote:
+AD4- Hello, Trond.
+AD4-
+AD4- On Thu, Jan 03, 2013 at 10:12:32PM +-0000, Myklebust, Trond wrote:
+AD4- +AD4- +AD4- The analysis is likely completely wrong, so please don't go off doing
+AD4- +AD4- +AD4- something unnecessary. Please take look at what's causing the
+AD4- +AD4- +AD4- deadlocks again.
+AD4- +AD4-
+AD4- +AD4- The analysis is a no-brainer:
+AD4- +AD4- We see a deadlock due to one work item waiting for completion of another
+AD4- +AD4- work item that is queued on the same CPU. There is no other dependency
+AD4- +AD4- between the two work items.
+AD4-
+AD4- What do you mean +ACI-waiting for completion of+ACI-? Is one flushing the
+AD4- other? Or is one waiting for the other to take certain action? How
+AD4- are the two work items related? Are they queued on the same
+AD4- workqueue? Can you create a minimal repro case of the observed
+AD4- deadlock?

The two work items are running on different work queues. One is running
on the nfsiod work queue, and is waiting for the other to complete on
the rpciod work queue. Basically, the nfsiod work item is trying to shut
down an RPC session, and it is waiting for each outstanding RPC call to
finish running a clean-up routine.

We can't call flush+AF8-work(), since we don't have a way to pin the
work+AF8-struct for any long period of time, so we queue all the RPC calls
up, then sleep on a global wait queue for 1 second or until the last RPC
call wakes us up (see rpc+AF8-shutdown+AF8-client()).

In the deadlock scenario, it looks as if one (or more) of the RPC calls
are getting queued on the same CPU (but on the rpciod workqueue) as the
shutdown process (running on nfsiod).


+AD4- Ooh, BTW, there was a bug where workqueue code created a false
+AD4- dependency between two work items. Workqueue currently considers two
+AD4- work items to be the same if they're on the same address and won't
+AD4- execute them concurrently - ie. it makes a work item which is queued
+AD4- again while being executed wait for the previous execution to
+AD4- complete.
+AD4-
+AD4- If a work function frees the work item, and then waits for an event
+AD4- which should be performed by another work item and +ACo-that+ACo- work item
+AD4- recycles the freed work item, it can create a false dependency loop.
+AD4- There really is no reliable way to detect this short of verifying
+AD4- every memory free. A patch is queued to make such occurrences less
+AD4- likely (work functions should also match for two work items considered
+AD4- the same), but if you're seeing this, the best thing to do is freeing
+AD4- the work item at the end of the work function.

That's interesting... I wonder if we may have been hitting that issue.

>From what I can see, we do actually free the write RPC task (and hence
the work+AF8-struct) before we call the asynchronous unlink completion...

Dros, can you see if reverting commit
324d003b0cd82151adbaecefef57b73f7959a469 +- commit
168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9 and then applying the attached
patch also fixes the hang on a pristine 3.7.x kernel?

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com


Attachments:
gnurr.dif (1.25 kB)
gnurr.dif

2013-01-07 17:54:06

by Myklebust, Trond

[permalink] [raw]
Subject: Re: [PATCH 1/3] SUNRPC: Ensure that we free the rpc_task after cleanups are done

On Mon, 2013-01-07 at 09:47 -0800, Tejun Heo wrote:
+AD4- On Fri, Jan 04, 2013 at 01:15:00PM -0500, Trond Myklebust wrote:
+AD4- +AD4- +- +ACo- Tejun Heo states:
+AD4- +AD4- +- +ACo- +ACI-Workqueue currently considers two work items to be the same if they're
+AD4- +AD4- +- +ACo- on the same address and won't execute them concurrently - ie. it
+AD4- +AD4- +- +ACo- makes a work item which is queued again while being executed wait
+AD4- +AD4- +- +ACo- for the previous execution to complete.
+AD4- +AD4- +- +ACo-
+AD4- +AD4- +- +ACo- If a work function frees the work item, and then waits for an event
+AD4- +AD4- +- +ACo- which should be performed by another work item and +ACo-that+ACo- work item
+AD4- +AD4- +- +ACo- recycles the freed work item, it can create a false dependency loop.
+AD4- +AD4- +- +ACo- There really is no reliable way to detect this short of verifying
+AD4- +AD4- +- +ACo- every memory free.+ACI-
+AD4-
+AD4- I will document it in Documentation/workqueue.txt.

Thanks Tejun+ACE-

--
Trond Myklebust
Linux NFS client maintainer

NetApp
Trond.Myklebust+AEA-netapp.com
http://www.netapp.com

2013-01-03 23:08:14

by J. Bruce Fields

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

On Thu, Jan 03, 2013 at 05:03:09PM -0500, Tejun Heo wrote:
> Hello, guys.
>
> On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> > The deadlock we were seeing was:
> >
> > - task A gets queued on rpciod workqueue and assigned kworker-0:0
> > - task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
> > - task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
> > - task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done
> >
> > The same deadlock happened when we tried queuing the tasks on a
> > different workqueues -- queue_work() assigns the task to a kworker
> > thread and it's luck of the draw if it's the same kworker as task A.
> > We tried the different workqueue options, but nothing changed this
> > behavior.
>
> Work items don't get assigned to workers on queueing. Idle workers
> pick up work items.

Oh, so that's why the case where we can't create a new worker is the
only case we should need the rescuers for. Got it. I think.

--b.

> A work item is directly assigned to a specific
> worker iff the worker is already executing that specific work item or
> the new work item is "linked" to the one it's currently executing.
> Currently, the only case where a linked work item is used is when
> flushing which is guaranteed to not introduce dependency the other way
> around.
>
> So, your diagnosis looks wrong to me. If such problem existed, we
> would be seeing deadlocks all over the place.
>
> > Once a work struct is queued, there is no way to back out of the
> > deadlock. From kernel/workqueue.c:insert_wq_barrier comment:
>
> Yes, there are. cancel_work[_sync]() do exactly that.
>
> > * Currently, a queued barrier can't be canceled. This is because
> > * try_to_grab_pending() can't determine whether the work to be
> > * grabbed is at the head of the queue and thus can't clear LINKED
> > * flag of the previous work while there must be a valid next work
> > * after a work with LINKED flag set.
> >
> > So once a work struct is queued and there is an ordering dependency
> > (i.e. task A is before task B), there is no way to back task B out -
> > so we can't just call cancel_work() or something on task B in
> > rpc_shutdown_client.
>
> A *barrier* can't be canceled. A barrier is used only to flush work
> items. The above comment means that we currently don't (or can't)
> support canceling flush_work(). It has *nothing* to do with canceling
> regular work items. You can cancel work items fine.
>
> > The root of our issue is that rpc_shutdown_client is never safe to
> > call from a workqueue context - it loops until there are no more
> > tasks, marking tasks as killed and waiting for them to be cleaned up
> > in each task's own workqueue context. Any tasks that have already
> > been assigned to the same kworker thread will never have a chance to
> > run this cleanup stage.
> >
> > When fixing this deadlock, Trond and I discussed changing how
> > rpc_shutdown_client works (making it workqueue safe), but Trond felt
> > that it'd be better to just not call it from a workqueue context and
> > print a warning if it is.
> >
> > IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with
> > no success), but I'd argue that even if that did work it would still
> > be very easy to call rpc_shutdown_client from the wrong context and
> > MUCH harder to detect it. It's also unclear to me if setting rpciod
> > workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
>
> It looks like you guys ended up in a weird place misled by wrong
> analysis. Unless you require more than one concurrent execution on
> the same workqueue, WQ_MEM_RECLAIM guarantees forward progress. It
> won't deadlock because "a different work item is queued to the same
> worker". The whole thing is designed *exactly* to avoid problems like
> that. So, I'd strongly recommend looking again at why the deadlocks
> are occurring.
>
> Thanks.
>
> --
> tejun
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2013-01-03 22:26:44

by Tejun Heo

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

Hello, Trond.

On Thu, Jan 03, 2013 at 10:12:32PM +0000, Myklebust, Trond wrote:
> > The analysis is likely completely wrong, so please don't go off doing
> > something unnecessary. Please take look at what's causing the
> > deadlocks again.
>
> The analysis is a no-brainer:
> We see a deadlock due to one work item waiting for completion of another
> work item that is queued on the same CPU. There is no other dependency
> between the two work items.

What do you mean "waiting for completion of"? Is one flushing the
other? Or is one waiting for the other to take certain action? How
are the two work items related? Are they queued on the same
workqueue? Can you create a minimal repro case of the observed
deadlock?

Thanks.

--
tejun

2013-01-03 20:27:23

by Adamson, Dros

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.


On Jan 3, 2013, at 3:11 PM, J. Bruce Fields <[email protected]> wrote:

> On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
>> Hey, sorry for the late response, I've been on vacation.
>>
>> On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <[email protected]>
>> wrote:
>>
>>> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>>>> Please reread what I said. There was no obvious circular
>>>> dependency, because nfsiod and rpciod are separate workqueues, both
>>>> created with WQ_MEM_RECLAIM.
>>>
>>> Oh, sorry, I read "rpciod" as "nfsiod"!
>>>
>>>> Dros' experience shows, however that a call to rpc_shutdown_client
>>>> in an nfsiod work item will deadlock with rpciod if the RPC task's
>>>> work item has been assigned to the same CPU as the one running the
>>>> rpc_shutdown_client work item.
>>>>
>>>> I can't tell right now if that is intentional (in which case the
>>>> WARN_ON in the rpc code is correct), or if it is a bug in the
>>>> workqueue code. For now, we're assuming the former.
>>>
>>> Well, Documentation/workqueue.txt says:
>>>
>>> "Each wq with WQ_MEM_RECLAIM set has an execution context
>>> reserved for it. If there is dependency among multiple work
>>> items used during memory reclaim, they should be queued to
>>> separate wq each with WQ_MEM_RECLAIM."
>>
>> The deadlock we were seeing was:
>>
>> - task A gets queued on rpciod workqueue and assigned kworker-0:0 -
>> task B gets queued on rpciod workqueue and assigned the same kworker
>> (kworker-0:0) - task A gets run, calls rpc_shutdown_client(), which
>> will loop forever waiting for task B to run rpc_async_release() - task
>> B will never run rpc_async_release() - it can't run until kworker-0:0
>> is free, which won't happen until task A (rpc_shutdown_client) is done
>>
>> The same deadlock happened when we tried queuing the tasks on a
>> different workqueues -- queue_work() assigns the task to a kworker
>> thread and it's luck of the draw if it's the same kworker as task A.
>> We tried the different workqueue options, but nothing changed this
>> behavior.
>>
>> Once a work struct is queued, there is no way to back out of the
>> deadlock. From kernel/workqueue.c:insert_wq_barrier comment:
>>
>> * Currently, a queued barrier can't be canceled. This is because *
>> try_to_grab_pending() can't determine whether the work to be *
>> grabbed is at the head of the queue and thus can't clear LINKED *
>> flag of the previous work while there must be a valid next work *
>> after a work with LINKED flag set.
>>
>> So once a work struct is queued and there is an ordering dependency
>> (i.e. task A is before task B), there is no way to back task B out -
>> so we can't just call cancel_work() or something on task B in
>> rpc_shutdown_client.
>>
>> The root of our issue is that rpc_shutdown_client is never safe to
>> call from a workqueue context - it loops until there are no more
>> tasks, marking tasks as killed and waiting for them to be cleaned up
>> in each task's own workqueue context. Any tasks that have already been
>> assigned to the same kworker thread will never have a chance to run
>> this cleanup stage.
>>
>> When fixing this deadlock, Trond and I discussed changing how
>> rpc_shutdown_client works (making it workqueue safe), but Trond felt
>> that it'd be better to just not call it from a workqueue context and
>> print a warning if it is.
>>
>> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no
>> success), but I'd argue that even if that did work it would still be
>> very easy to call rpc_shutdown_client from the wrong context and MUCH
>> harder to detect it. It's also unclear to me if setting rpciod
>> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...
>
> Both rpciod and nfsiod already set WQ_MEM_RECLAIM.

Heh, oh yeah :)

>
> But, right, looking at kernel/workqueue.c, it seems that the dedicated
> "rescuer" threads are invoked only in the case when work is stalled
> because a new worker thread isn't allocated quickly enough.
>
> So, what to do that's simplest enough that it would work for
> post-rc2/stable? I was happy having just a simple dedicated
> thread--these are only started when nfsd is, so there's no real thread
> proliferation problem.

That should work fine. The client went this way and spawns a new kthread before calling rpc_shutdown_client() from a workqueue context. This should happen very infrequently.

-dros

>
> I'll go quietly weep for a little while and then think about it some
> more....




2013-01-03 22:03:15

by Tejun Heo

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

Hello, guys.

On Thu, Jan 03, 2013 at 04:28:37PM +0000, Adamson, Dros wrote:
> The deadlock we were seeing was:
>
> - task A gets queued on rpciod workqueue and assigned kworker-0:0
> - task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
> - task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
> - task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done
>
> The same deadlock happened when we tried queuing the tasks on a
> different workqueues -- queue_work() assigns the task to a kworker
> thread and it's luck of the draw if it's the same kworker as task A.
> We tried the different workqueue options, but nothing changed this
> behavior.

Work items don't get assigned to workers on queueing. Idle workers
pick up work items. A work item is directly assigned to a specific
worker iff the worker is already executing that specific work item or
the new work item is "linked" to the one it's currently executing.
Currently, the only case where a linked work item is used is when
flushing which is guaranteed to not introduce dependency the other way
around.

So, your diagnosis looks wrong to me. If such problem existed, we
would be seeing deadlocks all over the place.

> Once a work struct is queued, there is no way to back out of the
> deadlock. From kernel/workqueue.c:insert_wq_barrier comment:

Yes, there are. cancel_work[_sync]() do exactly that.

> * Currently, a queued barrier can't be canceled. This is because
> * try_to_grab_pending() can't determine whether the work to be
> * grabbed is at the head of the queue and thus can't clear LINKED
> * flag of the previous work while there must be a valid next work
> * after a work with LINKED flag set.
>
> So once a work struct is queued and there is an ordering dependency
> (i.e. task A is before task B), there is no way to back task B out -
> so we can't just call cancel_work() or something on task B in
> rpc_shutdown_client.

A *barrier* can't be canceled. A barrier is used only to flush work
items. The above comment means that we currently don't (or can't)
support canceling flush_work(). It has *nothing* to do with canceling
regular work items. You can cancel work items fine.

> The root of our issue is that rpc_shutdown_client is never safe to
> call from a workqueue context - it loops until there are no more
> tasks, marking tasks as killed and waiting for them to be cleaned up
> in each task's own workqueue context. Any tasks that have already
> been assigned to the same kworker thread will never have a chance to
> run this cleanup stage.
>
> When fixing this deadlock, Trond and I discussed changing how
> rpc_shutdown_client works (making it workqueue safe), but Trond felt
> that it'd be better to just not call it from a workqueue context and
> print a warning if it is.
>
> IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with
> no success), but I'd argue that even if that did work it would still
> be very easy to call rpc_shutdown_client from the wrong context and
> MUCH harder to detect it. It's also unclear to me if setting rpciod
> workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

It looks like you guys ended up in a weird place misled by wrong
analysis. Unless you require more than one concurrent execution on
the same workqueue, WQ_MEM_RECLAIM guarantees forward progress. It
won't deadlock because "a different work item is queued to the same
worker". The whole thing is designed *exactly* to avoid problems like
that. So, I'd strongly recommend looking again at why the deadlocks
are occurring.

Thanks.

--
tejun

2013-01-04 18:15:11

by Myklebust, Trond

[permalink] [raw]
Subject: [PATCH 2/3] NFS: Ensure that we free the rpc_task after read and write cleanups are done

This patch ensures that we free the rpc_task after the cleanup callbacks
are done in order to avoid a deadlock problem that can be triggered if
the callback needs to wait for another workqueue item to complete.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: Weston Andros Adamson <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Bruce Fields <[email protected]>
Cc: [email protected] [>= 3.5]
---
fs/nfs/read.c | 10 +++++++---
fs/nfs/write.c | 10 +++++++---
2 files changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/nfs/read.c b/fs/nfs/read.c
index b6bdb18..a5e5d98 100644
--- a/fs/nfs/read.c
+++ b/fs/nfs/read.c
@@ -91,12 +91,16 @@ void nfs_readdata_release(struct nfs_read_data *rdata)
put_nfs_open_context(rdata->args.context);
if (rdata->pages.pagevec != rdata->pages.page_array)
kfree(rdata->pages.pagevec);
- if (rdata != &read_header->rpc_data)
- kfree(rdata);
- else
+ if (rdata == &read_header->rpc_data) {
rdata->header = NULL;
+ rdata = NULL;
+ }
if (atomic_dec_and_test(&hdr->refcnt))
hdr->completion_ops->completion(hdr);
+ /* Note: we only free the rpc_task after callbacks are done.
+ * See the comment in rpc_free_task() for why
+ */
+ kfree(rdata);
}
EXPORT_SYMBOL_GPL(nfs_readdata_release);

diff --git a/fs/nfs/write.c b/fs/nfs/write.c
index b673be3..c483cc5 100644
--- a/fs/nfs/write.c
+++ b/fs/nfs/write.c
@@ -126,12 +126,16 @@ void nfs_writedata_release(struct nfs_write_data *wdata)
put_nfs_open_context(wdata->args.context);
if (wdata->pages.pagevec != wdata->pages.page_array)
kfree(wdata->pages.pagevec);
- if (wdata != &write_header->rpc_data)
- kfree(wdata);
- else
+ if (wdata == &write_header->rpc_data) {
wdata->header = NULL;
+ wdata = NULL;
+ }
if (atomic_dec_and_test(&hdr->refcnt))
hdr->completion_ops->completion(hdr);
+ /* Note: we only free the rpc_task after callbacks are done.
+ * See the comment in rpc_free_task() for why
+ */
+ kfree(wdata);
}
EXPORT_SYMBOL_GPL(nfs_writedata_release);

--
1.7.11.7


2013-01-03 16:28:40

by Adamson, Dros

[permalink] [raw]
Subject: Re: nfsd oops on Linus' current tree.

Hey, sorry for the late response, I've been on vacation.

On Dec 21, 2012, at 6:45 PM, J. Bruce Fields <[email protected]> wrote:

> On Fri, Dec 21, 2012 at 11:36:51PM +0000, Myklebust, Trond wrote:
>> Please reread what I said. There was no obvious circular dependency,
>> because nfsiod and rpciod are separate workqueues, both created with
>> WQ_MEM_RECLAIM.
>
> Oh, sorry, I read "rpciod" as "nfsiod"!
>
>> Dros' experience shows, however that a call to
>> rpc_shutdown_client in an nfsiod work item will deadlock with rpciod
>> if the RPC task's work item has been assigned to the same CPU as the
>> one running the rpc_shutdown_client work item.
>>
>> I can't tell right now if that is intentional (in which case the
>> WARN_ON in the rpc code is correct), or if it is a bug in the
>> workqueue code. For now, we're assuming the former.
>
> Well, Documentation/workqueue.txt says:
>
> "Each wq with WQ_MEM_RECLAIM set has an execution context
> reserved for it. If there is dependency among multiple work
> items used during memory reclaim, they should be queued to
> separate wq each with WQ_MEM_RECLAIM."

The deadlock we were seeing was:

- task A gets queued on rpciod workqueue and assigned kworker-0:0
- task B gets queued on rpciod workqueue and assigned the same kworker (kworker-0:0)
- task A gets run, calls rpc_shutdown_client(), which will loop forever waiting for task B to run rpc_async_release()
- task B will never run rpc_async_release() - it can't run until kworker-0:0 is free, which won't happen until task A (rpc_shutdown_client) is done

The same deadlock happened when we tried queuing the tasks on a different workqueues -- queue_work() assigns the task to a kworker thread and it's luck of the draw if it's the same kworker as task A. We tried the different workqueue options, but nothing changed this behavior.

Once a work struct is queued, there is no way to back out of the deadlock. From kernel/workqueue.c:insert_wq_barrier comment:

* Currently, a queued barrier can't be canceled. This is because
* try_to_grab_pending() can't determine whether the work to be
* grabbed is at the head of the queue and thus can't clear LINKED
* flag of the previous work while there must be a valid next work
* after a work with LINKED flag set.

So once a work struct is queued and there is an ordering dependency (i.e. task A is before task B), there is no way to back task B out - so we can't just call cancel_work() or something on task B in rpc_shutdown_client.

The root of our issue is that rpc_shutdown_client is never safe to call from a workqueue context - it loops until there are no more tasks, marking tasks as killed and waiting for them to be cleaned up in each task's own workqueue context. Any tasks that have already been assigned to the same kworker thread will never have a chance to run this cleanup stage.

When fixing this deadlock, Trond and I discussed changing how rpc_shutdown_client works (making it workqueue safe), but Trond felt that it'd be better to just not call it from a workqueue context and print a warning if it is.

IIRC we tried using different workqueues with WQ_MEM_RECLAIM (with no success), but I'd argue that even if that did work it would still be very easy to call rpc_shutdown_client from the wrong context and MUCH harder to detect it. It's also unclear to me if setting rpciod workqueue to WQ_MEM_RECLAIM would limit it to one kworker, etc...

-dros


2013-01-04 18:15:12

by Myklebust, Trond

[permalink] [raw]
Subject: [PATCH 3/3] SUNRPC: Partial revert of commit 168e4b39d1afb79a7e3ea6c3bb246b4c82c6bdb9

Partially revert commit (SUNRPC: add WARN_ON_ONCE for potential deadlock).
The looping behaviour has been tracked down to a knownn issue with
workqueues, and a workaround has now been implemented.

Signed-off-by: Trond Myklebust <[email protected]>
Cc: Weston Andros Adamson <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Bruce Fields <[email protected]>
Cc: [email protected] [>= 3.7]
---
net/sunrpc/clnt.c | 5 -----
1 file changed, 5 deletions(-)

diff --git a/net/sunrpc/clnt.c b/net/sunrpc/clnt.c
index 822f020..1915ffe 100644
--- a/net/sunrpc/clnt.c
+++ b/net/sunrpc/clnt.c
@@ -610,11 +610,6 @@ EXPORT_SYMBOL_GPL(rpc_killall_tasks);
*/
void rpc_shutdown_client(struct rpc_clnt *clnt)
{
- /*
- * To avoid deadlock, never call rpc_shutdown_client from a
- * workqueue context!
- */
- WARN_ON_ONCE(current->flags & PF_WQ_WORKER);
might_sleep();

dprintk_rcu("RPC: shutting down %s client for %s\n",
--
1.7.11.7