Received: by 2002:a25:f815:0:0:0:0:0 with SMTP id u21csp336175ybd; Fri, 28 Jun 2019 21:09:49 -0700 (PDT) X-Google-Smtp-Source: APXvYqw5B1PmL3CRWVWW6HJeZWkub9uCrUtQGAlP+8wXHWK6wLAuUu2qEi9J2n72HqTfOzQJqaZr X-Received: by 2002:a63:7a5b:: with SMTP id j27mr12603035pgn.242.1561781389497; Fri, 28 Jun 2019 21:09:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1561781389; cv=none; d=google.com; s=arc-20160816; b=g1GPwVRrfcfc3oY0f9Mt5NXB4LoAbb5uO9UYnak7N8nUWlkEacWxGAA6Qw3h5e3XDD 0nZfcPWixWqFwC+48EksxvRC0C9TkQk8EL21M89jhKyUIARKpqzioPK9epCbyZ4Aw8Cq 9W0rkVJmfi2ohtiYqkNZ/c0QnX9P5Qv6cEzUfPCXfmeb1rhb8jEqiQxni9PJaxXS3hT2 kOs/Iq+0ZdcZAHEagjM4Pki0Hj97KIJg9bMMxFBhiOJWweJwi8/31yU1N04hJRVYPw/3 g4F4eR/Q0HOh0vRhnm86Ml9E4qTMMSDwoKHlwKfQqQZaT7GOzfaf2erZT1zNWiqxucrw BsPg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=zFT8Eg9srX4yG/cxwlvQ5H0v6t3M3ZEOaMGsGYlfuaM=; b=LkY5fxJdSKtj2ZiGqnFLT2WTk30/3k3AWqCJcS/3s5tiQ5d96d7VHHZcbiE0KTryjR V+l16ZytI11CMSJSYsdghcSzDqlDSYxiexpHh6dcU9+/e3BFNlod6XNqIZtPFyo7jNUl PVlRkWPB76MQ4JSyspsBd48yrcduvaJnpFlt48aNdVhKVPd+YJTCIu1eZuJguUu5a3t2 FQpLQXwW4dzMKZxsZxRAowjCiiI3DqdYa+/U+lTr7Bthpt0/tSGS2sJtYM1cqR5HhqxH ST3UdaS0eu2WjjjkFREPvWgl1+wlILxL0Fc16HB3FV3viXFNgMZwjBmN25s9p0MDMjaI Ridg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k1si4058964pfa.25.2019.06.28.21.09.30; Fri, 28 Jun 2019 21:09:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726766AbfF2EJ1 (ORCPT + 99 others); Sat, 29 Jun 2019 00:09:27 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:49938 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726146AbfF2EJ1 (ORCPT ); Sat, 29 Jun 2019 00:09:27 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92 #3 (Red Hat Linux)) id 1hh4fM-0003FG-Bu; Sat, 29 Jun 2019 04:08:44 +0000 Date: Sat, 29 Jun 2019 05:08:44 +0100 From: Al Viro To: "Tobin C. Harding" Cc: "Tobin C. Harding" , Andrew Morton , Roman Gushchin , Alexander Viro , Christoph Hellwig , Pekka Enberg , David Rientjes , Joonsoo Kim , Christopher Lameter , Matthew Wilcox , Miklos Szeredi , Andreas Dilger , Waiman Long , Tycho Andersen , Theodore Ts'o , Andi Kleen , David Chinner , Nick Piggin , Rik van Riel , Hugh Dickins , Jonathan Corbet , linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, Linus Torvalds Subject: Re: [RFC PATCH v3 14/15] dcache: Implement partial shrink via Slab Movable Objects Message-ID: <20190629040844.GS17978@ZenIV.linux.org.uk> References: <20190411013441.5415-1-tobin@kernel.org> <20190411013441.5415-15-tobin@kernel.org> <20190411023322.GD2217@ZenIV.linux.org.uk> <20190411024821.GB6941@eros.localdomain> <20190411044746.GE2217@ZenIV.linux.org.uk> <20190411210200.GH2217@ZenIV.linux.org.uk> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190411210200.GH2217@ZenIV.linux.org.uk> User-Agent: Mutt/1.11.3 (2019-02-01) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 11, 2019 at 10:02:00PM +0100, Al Viro wrote: > Aaaarrgghhh... No, we can't. Look: we get one candidate dentry in isolate > phase. We put it into shrink list. umount(2) comes and calls > shrink_dcache_for_umount(), which calls shrink_dcache_parent(root). > In the meanwhile, shrink_dentry_list() is run and does __dentry_kill() on > that one dentry. Fine, it's gone - before shrink_dcache_parent() even > sees it. Now shrink_dentry_list() holds a reference to its parent and > is about to drop it in > dentry = parent; > while (dentry && !lockref_put_or_lock(&dentry->d_lockref)) > dentry = dentry_kill(dentry); > And dropped it will be, but... shrink_dcache_parent() has finished the > scan, without finding *anything* with zero refcount - the thing that used > to be on the shrink list was already gone before shrink_dcache_parent() > has gotten there and the reference to parent was not dropped yet. So > shrink_dcache_for_umount() plows past shrink_dcache_parent(), walks the > tree and complains loudly about "busy" dentries (that parent we hadn't > finished dropping), and then we proceed with filesystem shutdown. > In the meanwhile, dentry_kill() finally gets to killing dentry and > triggers an unexpected late call of ->d_iput() on a filesystem that > has already been far enough into shutdown - far enough to destroy the > data structures needed for that sucker. > > The reason we don't hit that problem with regular memory shrinker is > this: > unregister_shrinker(&s->s_shrink); > fs->kill_sb(s); > in deactivate_locked_super(). IOW, shrinker for this fs is gone > before we get around to shutdown. And so are all normal sources > of dentry eviction for that fs. > > Your earlier variants all suffer the same problem - picking a page > shared by dentries from several superblocks can run into trouble > if it overlaps with umount of one of those. FWIW, I think I see a kinda-sorta sane solution. Namely, add static void __dput_to_list(struct dentry *dentry, struct list_head *list) { if (dentry->d_flags & DCACHE_SHRINK_LIST) { /* let the owner of the list it's on deal with it */ --dentry->d_lockref.count; } else { if (dentry->d_flags & DCACHE_LRU_LIST) d_lru_del(dentry); if (!--dentry->d_lockref.count) d_shrink_add(parent, list); } } and have shrink_dentry_list() do this in the end of loop: d_shrink_del(dentry); parent = dentry->d_parent; /* both dentry and parent are locked at that point */ if (parent != dentry) { /* * We need to prune ancestors too. This is necessary to * prevent quadratic behavior of shrink_dcache_parent(), * but is also expected to be beneficial in reducing * dentry cache fragmentation. */ __dput_to_list(parent, list); } __dentry_kill(dentry); } instead of d_shrink_del(dentry); parent = dentry->d_parent; __dentry_kill(dentry); if (parent == dentry) continue; /* * We need to prune ancestors too. This is necessary to prevent * quadratic behavior of shrink_dcache_parent(), but is also * expected to be beneficial in reducing dentry cache * fragmentation. */ dentry = parent; while (dentry && !lockref_put_or_lock(&dentry->d_lockref)) dentry = dentry_kill(dentry); } we have there now. Linus, do you see any problems with that change? AFAICS, that should avoid the problem described above. Moreover, it seems to allow a fun API addition: void dput_to_list(struct dentry *dentry, struct list_head *list) { rcu_read_lock(); if (likely(fast_dput(dentry))) { rcu_read_unlock(); return; } rcu_read_unlock(); if (!retain_dentry(dentry)) __dput_to_list(dentry, list); spin_unlock(&dentry->d_lock); } allowing to take an empty list, do a bunch of dput_to_list() (under spinlocks, etc.), then, once we are in better locking conditions, shrink_dentry_list() to take them all out. I can see applications for that in e.g. fs/namespace.c - quite a bit of kludges with ->mnt_ex_mountpoint would be killable that way, and there would be a chance to transfer the contribution to ->d_count of mountpoint from struct mount to struct mountpoint (i.e. make any number of mounts on the same mountpoint dentry contribute only 1 to its ->d_count, not the number of such mounts).