Return-Path: Received: from mail-pa0-f41.google.com ([209.85.220.41]:34494 "EHLO mail-pa0-f41.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1031614AbbD2IpX (ORCPT ); Wed, 29 Apr 2015 04:45:23 -0400 Received: by pacyx8 with SMTP id yx8so21917867pac.1 for ; Wed, 29 Apr 2015 01:45:23 -0700 (PDT) Message-ID: <55409A1C.60604@gmail.com> Date: Wed, 29 Apr 2015 16:45:16 +0800 From: Kinglong Mee MIME-Version: 1.0 To: NeilBrown CC: "J. Bruce Fields" , "linux-nfs@vger.kernel.org" , kinglongmee@gmail.com Subject: Re: [PATCH RFC] NFSD: fix cannot umounting mount points under pseudo root References: <553663B7.7030506@gmail.com> <20150421215417.GE13782@fieldses.org> <553781E2.1000900@gmail.com> <20150422150703.GA1247@fieldses.org> <20150423094431.1a8aa68b@notabene.brown> <5538EB18.7080802@gmail.com> <20150424130045.6bbdb2f9@notabene.brown> <553E2784.6020906@gmail.com> <20150429125728.69ddfc6c@notabene.brown> In-Reply-To: <20150429125728.69ddfc6c@notabene.brown> Content-Type: text/plain; charset=windows-1252 Sender: linux-nfs-owner@vger.kernel.org List-ID: On 4/29/2015 10:57 AM, NeilBrown wrote: > On Mon, 27 Apr 2015 20:11:48 +0800 Kinglong Mee wrote: > >> On 4/24/2015 11:00 AM, NeilBrown wrote: >>> On Thu, 23 Apr 2015 20:52:40 +0800 Kinglong Mee wrote: >>> >>>> On 4/23/2015 7:44 AM, NeilBrown wrote: >>>>> On Wed, 22 Apr 2015 11:07:03 -0400 "J. Bruce Fields" >>>>> wrote: >>>>>>> Reference of dentry/mnt is like a cache, avoids re-finding of them, >>>>>>> it is necessary to store them in svc_export. >>>>>>> >>>>>>> Neil points out another way of 'fs_pin', I will check that. >>>>>> >>>>>> Yes, that'd be interesting. On a very quick look--I don't understand >>>>>> what it's meant to do at all. But if it does provide a way to get a >>>>>> callback on umount, that'd certainly be interesting.... >>>>> >>>>> Yeah, on a quick look it isn't really obvious at all. >>>>> >>>>> But I didn't read the code. I read >>>>> https://lwn.net/Articles/636730/ >>>>> >>>>> which says: >>>>> >>>>> In its place is a mechanism by which an object can be added to a vfsmount >>>>> structure (which represents a mounted filesystem); that object supports >>>>> only one method: kill(). These "pin" objects hang around until the final >>>>> reference to the vfsmount goes away, at which point each one's kill() function >>>>> is called. It thus is a clean mechanism for performing cleanup when a filesystem >>>>> goes away. >>>>> >>>>> This is used to close "BSD process accounting" files on umount, and sound >>>>> like the perfect interface for flushing things from the sunrpc cache on >>>>> umount. >>>> >>>> I have review those codes in fs/fs_pin.c and kernel/acct.c. >>>> 'fs_pin' is used to fix the race between file->f_path.mnt for acct >>>> and umount, file->f_path.mnt just holds the reference of vfsmount >>>> but not keep busy, umount will check the reference and return busy. >>>> >>>> The logical is same as sunrpc cache for exports. >>>> >>>> Before commit 3064c3563b ("death to mnt_pinned"), acct uses a hack >>>> method, acct get a pin to vfsmount and then put the reference, >>>> so umount finds no other reference and callback those pins in vfsmount, >>>> at last umount success. >>>> >>>> After that commit, besides pin to original vfsmount and put the reference >>>> of the original vfsmount, acct clone the vfsmount storing in file->f_path.mnt. >>>> >>>> The different between those two methods is the value of file->f_path.mnt, >>>> the first one, it contains the original vfsmnt without reference, >>>> the second one, contains a cloned vfsmnt with reference. >>>> >>>> I have test the first method, pins to vfsmount for all exports cache, >>>> nfsd can work but sometimes will delay or hang at rpc.mountd's calling >>>> of name_to_handle_at, I can't find the reason. >>> >>> A kernel stack trace of exactly where it is hanging would help a lot. >> >> Witch adding fs_pin to exports, it seems there is a mutex race between >> nfsd and rpc.mountd for locking parent inode. >> >> Apr 27 19:57:03 ntest kernel: rpc.mountd D ffff88006ac5fc28 0 1152 1 0x00000000 >> Apr 27 19:57:03 ntest kernel: ffff88006ac5fc28 ffff880068c38970 ffff880068c3de60 ffff88006ac5fc08 >> Apr 27 19:57:03 ntest kernel: ffff88006ac60000 ffff880035ecf694 ffff880068c3de60 00000000ffffffff >> Apr 27 19:57:03 ntest kernel: ffff880035ecf698 ffff88006ac5fc48 ffffffff8177ffd7 0000000000000000 >> Apr 27 19:57:03 ntest kernel: Call Trace: >> Apr 27 19:57:03 ntest kernel: [] schedule+0x37/0x90 >> Apr 27 19:57:03 ntest kernel: [] schedule_preempt_disabled+0xe/0x10 >> Apr 27 19:57:03 ntest kernel: [] __mutex_lock_slowpath+0xb2/0x120 >> Apr 27 19:57:03 ntest kernel: [] mutex_lock+0x23/0x40 >> Apr 27 19:57:03 ntest kernel: [] lookup_slow+0x34/0xc0 >> Apr 27 19:57:03 ntest kernel: [] path_lookupat+0x89e/0xc60 >> Apr 27 19:57:03 ntest kernel: [] ? kmem_cache_alloc+0x1e2/0x260 >> Apr 27 19:57:03 ntest kernel: [] ? getname_flags+0x56/0x200 >> Apr 27 19:57:03 ntest kernel: [] filename_lookup+0x27/0xc0 >> Apr 27 19:57:03 ntest kernel: [] user_path_at_empty+0x63/0xd0 >> Apr 27 19:57:03 ntest kernel: [] ? put_object+0x32/0x60 >> Apr 27 19:57:03 ntest kernel: [] ? delete_object_full+0x2d/0x40 >> Apr 27 19:57:03 ntest kernel: [] ? dput+0x33/0x230 >> Apr 27 19:57:03 ntest kernel: [] user_path_at+0x11/0x20 >> Apr 27 19:57:03 ntest kernel: [] SyS_name_to_handle_at+0x59/0x200 >> Apr 27 19:57:03 ntest kernel: [] system_call_fastpath+0x12/0x71 >> Apr 27 19:57:03 ntest kernel: nfsd S ffff88006e92b708 0 1170 2 0x00000000 >> Apr 27 19:57:03 ntest kernel: ffff88006e92b708 ffffffff81c12480 ffff88006acbcb80 ffff88006e92b758 >> Apr 27 19:57:03 ntest kernel: ffff88006e92c000 ffffffff82290040 ffff88006e92b758 ffffffff82290040 >> Apr 27 19:57:03 ntest kernel: 00000000fffffff5 ffff88006e92b728 ffffffff8177ffd7 0000000100043a09 >> Apr 27 19:57:03 ntest kernel: Call Trace: >> Apr 27 19:57:03 ntest kernel: [] schedule+0x37/0x90 >> Apr 27 19:57:03 ntest kernel: [] schedule_timeout+0x117/0x230 >> Apr 27 19:57:03 ntest kernel: [] ? internal_add_timer+0xb0/0xb0 >> Apr 27 19:57:03 ntest kernel: [] wait_for_completion_interruptible_timeout+0xf3/0x150 >> Apr 27 19:57:03 ntest kernel: [] ? wake_up_state+0x20/0x20 >> Apr 27 19:57:03 ntest kernel: [] cache_wait_req.isra.10+0x9c/0x110 [sunrpc] >> Apr 27 19:57:03 ntest kernel: [] ? sunrpc_init_cache_detail+0xc0/0xc0 [sunrpc] >> Apr 27 19:57:03 ntest kernel: [] cache_check+0x1d4/0x380 [sunrpc] >> Apr 27 19:57:03 ntest kernel: [] ? inode_doinit_with_dentry+0x48c/0x6a0 >> Apr 27 19:57:03 ntest kernel: [] exp_get_by_name+0x82/0xb0 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] ? kmemleak_free+0x3a/0xa0 >> Apr 27 19:57:03 ntest kernel: [] ? inode_doinit_with_dentry+0x210/0x6a0 >> Apr 27 19:57:03 ntest kernel: [] ? selinux_d_instantiate+0x1c/0x20 >> Apr 27 19:57:03 ntest kernel: [] ? _d_rehash+0x37/0x40 >> Apr 27 19:57:03 ntest kernel: [] ? d_splice_alias+0xa6/0x2d0 >> Apr 27 19:57:03 ntest kernel: [] ? ext4_lookup+0xdb/0x160 >> Apr 27 19:57:03 ntest kernel: [] rqst_exp_get_by_name+0x64/0x140 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd_cross_mnt+0x76/0x1b0 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd4_encode_dirent+0x160/0x3d0 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] ? nfsd4_encode_getattr+0x40/0x40 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd_readdir+0x1c1/0x2a0 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] ? nfsd_direct_splice_actor+0x20/0x20 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd4_encode_readdir+0x120/0x220 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd4_encode_operation+0x7d/0x190 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd4_proc_compound+0x24d/0x6f0 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] nfsd_dispatch+0xc3/0x220 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] svc_process_common+0x43b/0x690 [sunrpc] >> Apr 27 19:57:03 ntest kernel: [] svc_process+0x103/0x1b0 [sunrpc] >> Apr 27 19:57:03 ntest kernel: [] nfsd+0xff/0x170 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] ? nfsd_destroy+0x80/0x80 [nfsd] >> Apr 27 19:57:03 ntest kernel: [] kthread+0xd8/0xf0 >> Apr 27 19:57:03 ntest kernel: [] ? kthread_worker_fn+0x180/0x180 >> Apr 27 19:57:03 ntest kernel: [] ret_from_fork+0x42/0x70 >> Apr 27 19:57:03 ntest kernel: [] ? kthread_worker_fn+0x180/0x180 >> > > Hmm... That looks bad. I wonder why we haven't hit the deadlock before. It appears after nfs-utils commit 6091c0a4c4 ("mountd: add support for case-insensitive file names") adds using name_to_handle_at which will locking parent. > > nfsd gets a LOOKUP request for $DIR/name which is a mountpoint, or a READDIR > request for $DIR which contains 'name' which is a mountpoint. > > nfsd_lookup_dentry of nfsd_buffered_readdir lock i_mutex on $DIR, then do > the mountpoint check. If the mounted filesystem is not exported, but the $DIR > filesystem has 'crossmnt', an upcall to mountd asks for export information. > When mountd try to do a lookup on the name, it works fine if lookup_fast finds > the child, but it deadlocks on the parent's i_mutex if it has to go to > lookup_slow. I have send a patch for it, "[PATCH] NFSD: Avoid race of locking parent's mutex at cross mount" > > It should only go to lookup_slow() if: > - the child isn't in cache .. but as it is mounted on, it must be Yes, it is. If after some operations of those files, they are cached, the race will not appear. > - d_revalidate returns 0 > > This probably means that $DIR is on an NFS filesystem and it failed the > revalidation for some reason.... It is caused by readdir of pseudo root, there are many unexported files in it. > > Kinglong: can you reproduce this easily? If so can you enable nfs debugging > and collect the output? > I want to confirm that the dfprintk in nfs_lookup_revalidate is saying that > the name is invalid. Yes, I will send the reproduce method in the other threads, add cc you. thanks, Kinglong Mee