Return-Path: linux-nfs-owner@vger.kernel.org Received: from cantor2.suse.de ([195.135.220.15]:58489 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965159AbaDJVZG (ORCPT ); Thu, 10 Apr 2014 17:25:06 -0400 Date: Thu, 10 Apr 2014 23:25:03 +0200 From: Jan Kara To: Trond Myklebust Cc: Jan Kara , Brown Neil , Viro Alexander , NFS Subject: Re: NFS deadlock between 'sync' and commit after unmount.... Message-ID: <20140410212503.GA12339@quack.suse.cz> References: <20140407135001.56ef9f36@notabene.brown> <20140407202750.GE23670@quack.suse.cz> <1396908136.5563.9.camel@leira.trondhjem.org> <20140407223541.GB1125@quack.suse.cz> <6664F8BA-1598-467A-824B-C59729B29E00@primarydata.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii In-Reply-To: <6664F8BA-1598-467A-824B-C59729B29E00@primarydata.com> Sender: linux-nfs-owner@vger.kernel.org List-ID: On Mon 07-04-14 19:07:35, Trond Myklebust wrote: > On Apr 7, 2014, at 18:35, Jan Kara wrote: > > > On Mon 07-04-14 18:02:16, Trond Myklebust wrote: > >> On Mon, 2014-04-07 at 22:27 +0200, Jan Kara wrote: > >>> On Mon 07-04-14 10:10:27, Trond Myklebust wrote: > >>>> On Apr 6, 2014, at 23:50, NeilBrown wrote: > >>>>> I've just hit a deadlock in NFS that seems very strange. > >>>>> The kernel is 3.14-rc8 which some local changes which shouldn't affect the > >>>>> deadlocking code. > >>>>> > >>>>> Shortly after umounting the NFS filesystem with "umount -f" (though I don't > >>>>> think the -f is important), I ran "sync". > >>>>> > >>>>> The sync is now stuck in > >>>>> > >>>>> [] sync_inodes_sb+0xa1/0x1c0 > >>>>> [] sync_inodes_one_sb+0x19/0x20 > >>>>> [] iterate_supers+0xb2/0x110 > >>>>> [] sys_sync+0x30/0x90 > >>>>> [] system_call_fastpath+0x16/0x1b > >>>>> [] 0xffffffffffffffff > >>>>> > >>>>> while kworker/u16:1 is stuck: > >>>>> > >>>>> [] call_rwsem_down_write_failed+0x13/0x20 > >>>>> [] deactivate_super+0x39/0x60 > >>>>> [] nfs_sb_deactive+0x21/0x30 > >>>>> [] __put_nfs_open_context+0xc9/0x100 > >>>>> [] put_nfs_open_context+0xb/0x10 > >>>>> [] nfs_commitdata_release+0x14/0x30 > >>>>> [] nfs_commit_release+0x1a/0x20 > >>>>> [] rpc_free_task+0x25/0x70 > >>>>> [] rpc_do_put_task+0x78/0x80 > >>>>> [] rpc_put_task+0xb/0x10 > >>>>> [] nfs_initiate_commit+0xce/0x110 > >>>>> [] nfs_commit_list+0x62/0x90 > >>>>> [] nfs_commit_inode+0xa6/0x170 > >>>>> [] nfs_write_inode+0x5d/0xa0 > >>>>> [] nfs4_write_inode+0x9/0x10 > >>>>> [] __writeback_single_inode+0x10c/0x2c0 > >>>>> [] writeback_sb_inodes+0x2ca/0x450 > >>>>> [] wb_writeback+0xec/0x320 > >>>>> [] bdi_writeback_workfn+0x115/0x4c0 > >>>>> [] process_one_work+0x16b/0x430 > >>>>> [] worker_thread+0x119/0x3a0 > >>>>> [] kthread+0xcd/0xf0 > >>>>> [] ret_from_fork+0x7c/0xb0 > >>>>> [] 0xffffffffffffffff > >>>>> > >>>>> > >>>>> So sync is holding sb->s_umount, queued some bdi work on the filesystem > >>>>> and is waiting for it to complete. Mean while, that work has (I think) > >>>>> submitted a 'commit' (via ->write_inode) and that commit wants to > >>>>> deactivate_super and so needs to get ->s_umount. > >>>>> > >>>>> I suspect this could happen even more easily with a lazy unmount. > >>>>> > >>>>> It seems that this commit request is that last thing that is keeping > >>>>> ->s_active elevated and it deadlocks trying to drop the last s_active. > >>>>> > >>>>> I have no idea how to fix it.... help? > >>>>> > >>>> > >>>> The problem seems to be the use of iterate_supers(), which grabs a > >>>> passive reference, and conflicts with our use of an active reference in > >>>> the open context. > >>> Yeah, we cannot really do otherwise in iterate_supers() - we have to grab > >>> some superblock reference and we don't really want to get an active one > >>> since that would result in spurious EBUSY returns from umount. > >>> > >>> Cannot we just punt the deactivate_super() call to a workqueue to avoid > >>> this deadlock? It's a bit ugly but it should do the trick. Or is > >>> nfs_sb_deactive() called too often and we'd see some adverse effects for > >>> that? We could also offload it to workqueue only in the special case where > >>> sb->s_active == 1. That should be really rare then but it's a bit ugly > >>> poking in VFS internals. > >> > >> The activate/deactivate super is basically there to save our bacon when > >> NFS file state extends beyond the usual VFS path walk, open() and > >> close(). Examples include sillyrename and NFSv4 delegations. Even > >> ordinary read and write state can extend beyond close() if the user > >> decides to 'kill -9' in the wrong places. > >> In most of these situations, we need to keep a dentry around until we're > >> finished, which means that we want to keep the super block alive too. > > Yeah, that makes sense. But offloading dropping of sb reference to a > > workqueue would work then, wouldn't it? > > Could we perhaps have a helper in the VFS that can optimise away the case > where s->s_active > 1? I'm not sure how you'd imagine the optimisation in VFS. But what I had in mind was something like: void nfs_deactivate_super(struct super_block *sb) { if (!atomic_add_unless(&sb->s_active, -1, 1)) { /* * Postpone deactivation to workqueue to avoid deadlocking * on s_umount semaphore - we can get here when trying to * complete sync(2) request for forcefully unmounted * filesystem. */ schedule_work(&NFS_SB(sb)->deactivate_work); } } static void nfs_deactivate_sb_work(struct work_struct *work) { struct super_block *sb = container_of(work, struct nfs_server, deactivate_work)->super; deactivate_super(sb); } in nfs_initialise_sb(): INIT_WORK(&NFS_SB(sb)->deactivate_work, nfs_deactivate_sb_work); and then use nfs_deactive_super() instead of deactivate_super(). That should do the trick and do the offloading only if we are really dropping the last reference. Honza -- Jan Kara SUSE Labs, CR