Received: by 2002:a05:6a10:1a4d:0:0:0:0 with SMTP id nk13csp2498104pxb; Wed, 9 Feb 2022 21:44:46 -0800 (PST) X-Google-Smtp-Source: ABdhPJzHZYSnNJrob/BBZfNhFLuYnhas/ke+XblpykvyXZPMS6QUPMLqbYOMEe3spccqWIGNxPxw X-Received: by 2002:a05:6402:2993:: with SMTP id eq19mr6520819edb.233.1644471886602; Wed, 09 Feb 2022 21:44:46 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644471886; cv=none; d=google.com; s=arc-20160816; b=Wk2aPhAhZqSuxSBSl4v0cuka30TqySEIW+bDCgBSk9ZiSLTKmCdt2D9g8gUBA9yqCk NFS/uoeyjztieUtDp3CRGiW99PxPE7duhH0sstfUrDazLQEHCYzRdS5XoVIp27gb2Ac4 pEdihJyzAT1jWtU/qM4SMQh8rPdE5kuN+W9f20xCsnhiyX8F732LsonAv9aNqjILb/Us wJwwEA92t35kcFkE2C969Zy16qIDJLEJtjOjEcHDS5WHbNiJVblco+mpcxJ/Uzv23wz/ H/kxsq0ijwYmyk/RU8oD67pMIokXgDTGS3X3q8ozMzB5MAWXmLnVVspb+eRaajhqEpkB yCjQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=Q26VGWE6TXeXm4K35ekEh5qSHRDXgbvWUZEKheICCec=; b=MndYPmJf/AehW/k0Ifgu+NNxDGpd27MHzCMvQSTULzTfStSyH1iue249QrurV2TZKY rRmQmwHgWg/EhS+LJ0zY6X0aCE1+ZWmsQmKPEcQmZvkn5LZVFFDKnp8QL0laxfX/XhbY XUQse8stufyzyoxsqoT/D/MNm6nwlestFqTATUbZpOKmi9Dvf2N9k+npQLLCY+bzbdgU i2CnyQxaeQFMo99msBj23Uj+CofmiRZ+CP8ezGFELickuciHh/XH4bqwBrhCc9AsKES8 ecQSGtAwhLFMaU1OfVNuvyjU3cENTTmr9QlBUBPaWaRhRvUJfZlaHB+WycOPboA3bQjY 3YdA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x13si12486125edj.626.2022.02.09.21.44.22; Wed, 09 Feb 2022 21:44:46 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233902AbiBJFd1 (ORCPT + 99 others); Thu, 10 Feb 2022 00:33:27 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:33316 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233826AbiBJFdZ (ORCPT ); Thu, 10 Feb 2022 00:33:25 -0500 Received: from zeniv-ca.linux.org.uk (zeniv-ca.linux.org.uk [IPv6:2607:5300:60:148a::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9B02410C0; Wed, 9 Feb 2022 21:33:25 -0800 (PST) Received: from viro by zeniv-ca.linux.org.uk with local (Exim 4.94.2 #2 (Red Hat Linux)) id 1nI255-000qlO-OE; Thu, 10 Feb 2022 05:33:23 +0000 Date: Thu, 10 Feb 2022 05:33:23 +0000 From: Al Viro To: Stephen Brennan Cc: linux-kernel@vger.kernel.org, Luis Chamberlain , Andrew Morton , Jan Kara , linux-fsdevel@vger.kernel.org, Arnd Bergmann , Amir Goldstein Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings Message-ID: References: <20220209231406.187668-1-stephen.s.brennan@oracle.com> <20220209231406.187668-2-stephen.s.brennan@oracle.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20220209231406.187668-2-stephen.s.brennan@oracle.com> Sender: Al Viro X-Spam-Status: No, score=-4.2 required=5.0 tests=BAYES_00,RCVD_IN_DNSWL_MED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote: > +static void sweep_negative(struct dentry *dentry) > +{ > + struct dentry *parent; > + > + rcu_read_lock(); > + parent = lock_parent(dentry); > + if (!parent) { > + rcu_read_unlock(); > + return; > + } > + > + /* > + * If we did not hold a reference to dentry (as in the case of dput), > + * and dentry->d_lock was dropped in lock_parent(), then we could now be > + * holding onto a dead dentry. Be careful to check d_count and unlock > + * before dropping RCU lock, otherwise we could corrupt freed memory. > + */ > + if (!d_count(dentry) && d_is_negative(dentry) && > + !d_is_tail_negative(dentry)) { > + dentry->d_flags |= DCACHE_TAIL_NEGATIVE; > + list_move_tail(&dentry->d_child, &parent->d_subdirs); > + } > + > + spin_unlock(&parent->d_lock); > + spin_unlock(&dentry->d_lock); > + rcu_read_unlock(); > +} I'm not sure if it came up the last time you'd posted this series (and I apologize if it had and I forgot the explanation), but... consider the comment in dentry_unlist(). What's to prevent the race described there making d_walk() skip a part of tree, by replacing the "lseek moving cursor in just the wrong moment" with "dput moving the negative dentry right next to the one being killed to the tail of the list"? The race in question: d_walk() is leaving a subdirectory. We are here: rcu_read_lock(); ascend: if (this_parent != parent) { It isn't - we are not back to the root of tree being walked. At this point this_parent is the directory we'd just finished looking into. struct dentry *child = this_parent; this_parent = child->d_parent; ... and now child points to it, and this_parent points to its parent. spin_unlock(&child->d_lock); No locks held. Another CPU gets through successful rmdir(). child gets unhashed and dropped. It's off the ->d_subdirs of this_parent; its ->d_child.next is still pointing where it used to, and whatever it points to won't be physically freed until rcu_read_unlock(). Moreover, in the meanwhile this next sibling (negative, pinned) got dput(). And had been moved to the tail of the this_parent->d_subdirs. Since its ->d_child.prev does *NOT* point to child (which is off-list, about to be freed shortly, etc.), child->d_dchild.next is not modified - it still points to that (now moved) sibling. spin_lock(&this_parent->d_lock); Got it. /* might go back up the wrong parent if we have had a rename. */ if (need_seqretry(&rename_lock, seq)) goto rename_retry; Nope, hadn't happened. /* go into the first sibling still alive */ do { next = child->d_child.next; ... and this is the moved sibling, now in the end of the ->d_subdirs. if (next == &this_parent->d_subdirs) goto ascend; No, it is not - it's the last element of the list, not its anchor. child = list_entry(next, struct dentry, d_child); Our moved negative dentry. } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)); Not killed, that one. rcu_read_unlock(); goto resume; ... and since that sucker has no children, we proceed to look at it, ascend and now we are at the end of this_parent->d_subdirs. And we ascend out of it, having entirely skipped all branches that used to be between the rmdir victim and the end of the parent's ->d_subdirs. What am I missing here? Unlike the trick we used with cursors (see dentry_unlist()) we can't predict who won't get moved in this case... Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do for rename_lock mismatches would not work unless you grab the spinlock component of rename_lock every time dentry becomes positive. Which is obviously not feasible (it's a system-wide lock and cacheline pingpong alone would hurt us very badly, not to mention the contention issues due to the frequency of grabbing it going up by several orders of magnitude).