Received: by 2002:a25:31c3:0:0:0:0:0 with SMTP id x186csp2058867ybx; Sat, 2 Nov 2019 11:17:55 -0700 (PDT) X-Google-Smtp-Source: APXvYqwZSN4218aEK/BZA8Pc9LJ1GwEGVjmDJsGDikfbtq8/68ijD40uVw2+LHApGJPSE3FZ0scJ X-Received: by 2002:a17:906:4346:: with SMTP id z6mr8979943ejm.71.1572718675441; Sat, 02 Nov 2019 11:17:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572718675; cv=none; d=google.com; s=arc-20160816; b=ViQMlUwOfIHrwp49Eva1FQI4y1S1Bzlt1o3M/SKbweUTsxq1wGnp8RMyJEfHStZTMV yVwlpKftTJmIAqivjIouKdcI2ZxQqWR/WwwClnDsDpDhWn8r8LNm5ElQGMSLQnqgdLfG xjUKdUVIJvn+pOPPG7hMCzIsDC8rtwZPAB7SEvsIez+7Yq3Yh7x1bRg3Y1THk9D8hw07 h9moLcx4qmeBIBV2o9CIPFoWRihyPeSBZS10z34cFWjP8VUmCA1k7dXB4T5EOEvd5hkB d48zuPuiSXVMQNzmGPIdhEFZYUv3Ev1LRNLO9wDcukvSgLaGHA8Ej3V1CjTKA25+QypC qM0Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=wk041R1rNQimOcSIkVkYmmVpl785qSLo1PPe1X/HzmA=; b=mLgPJ6pfCFBCyaa38rh2PvkKHLMLLkirjxoxJLFEpGaRuJM73BwOq8mrm9U1PJqC1r 2roqN53/nXqHInn5FMV1bfbbwZHGdUxEvvh9TxWQNrpee61wkW5W4aee/Mpl9hid6fj9 mRDJlVIL0cZLaj9/eExU+PKXLo1Noyw3OELNvynWa2GCXfMc4eOjmBR7917kQx0mPSH8 IhHIOLm8ChE32SaZeG5PPsl55Lh6TDuY1Xe8YfeSaMsdZar8agYC7fNKobCKHLwyNfDB byujCmO+avKo1BFdyIB0YZzyVg8sQiiEQ7Cs0ivcpXdbJzoMYdqYHLqF/YNIF+zbU+Un r6xA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j9si8132197ejt.25.2019.11.02.11.16.56; Sat, 02 Nov 2019 11:17:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726841AbfKBSIr (ORCPT + 99 others); Sat, 2 Nov 2019 14:08:47 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:53744 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726523AbfKBSIr (ORCPT ); Sat, 2 Nov 2019 14:08:47 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1iQxpK-00056q-97; Sat, 02 Nov 2019 18:08:42 +0000 Date: Sat, 2 Nov 2019 18:08:42 +0000 From: Al Viro To: "Paul E. McKenney" Cc: Ritesh Harjani , linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com, Jan Kara , Linus Torvalds Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Message-ID: <20191102180842.GN26530@ZenIV.linux.org.uk> References: <20190927044243.18856-1-riteshh@linux.ibm.com> <20191015040730.6A84742047@d06av24.portsmouth.uk.ibm.com> <20191022133855.B1B4752050@d06av21.portsmouth.uk.ibm.com> <20191022143736.GX26530@ZenIV.linux.org.uk> <20191022201131.GZ26530@ZenIV.linux.org.uk> <20191023110551.D04AE4C044@d06av22.portsmouth.uk.ibm.com> <20191101234622.GM26530@ZenIV.linux.org.uk> <20191102172229.GT20975@paulmck-ThinkPad-P72> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191102172229.GT20975@paulmck-ThinkPad-P72> User-Agent: Mutt/1.12.1 (2019-06-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sat, Nov 02, 2019 at 10:22:29AM -0700, Paul E. McKenney wrote: > Ignoring the possibility of the more exotic compiler optimizations, if > the first task's load into f sees the value stored by the second task, > then the pair of memory barriers guarantee that the first task's load > into d will see the second task's store. The question was about the load into i being also safe. > In fact, you could instead say this in recent kernels: > > f = READ_ONCE(fdt[n]) // provides dependency ordering via mb on Alpha > mb Er... that mb comes from expanded READ_ONCE(), actually - the call chain is fdget_pos() -> __fdget_pos() -> __fdget() -> __fget_light() -> __fcheck_files(), either directly or via __fget() -> fcheck_files() -> __fcheck_files() rcu_dereference_raw() -> READ_ONCE() -> smp_read_barrier_depends() which yields mb on alpha. > d = f->f_path.dentry > i = d->d_inode // But this is OK only if ->f_path.entry is > // constant throughout Yes, it is - once you hold a reference to a positive dentry, it can't be made negative by anybody else. See d_delete() for details; basically, if you have refcount > 1, dentry will be unhashed, but not made negative. > The result of the first task's load into i requires information outside > of the two code fragments. > > Or am I missing your point? My point is that barriers sufficient to guarantee visibility of *f in the reader will also suffice to guarantee visibility of *f->f_path.dentry. On alpha it boils down to having load of d->d_inode when opening the file orders before the barrier prior to storing the reference to f in the descriptor table, so if it observes the store to d->d_inode done by the same CPU, that store is ordered before the barrier due to processor instruction order constraints and if it observes the store to d->d_inode done by some other CPU, that store is ordered before the load and before the barrier by transitivity. So in either case that store is ordered before the store into descriptor table. IOW, the reader that has enough barriers to guarantee seing ->f_path.dentry will be guaranteed to see ->f_path.dentry->d_inode. And yes, we will need some barriers added near some positivity checks in pathname resolution - that's what has started the entire thread. This part ("any place looking at file->f_path.dentry will have ->d_inode and mode bits of ->d_flags visible and stable") covers quite a few places that come up in the analysis... This morning catch, BTW: audit_get_nd(): don't unlock parent too early if the child has been negative and just went positive under us, we want coherent d_is_positive() and ->d_inode. Don't unlock the parent until we'd done that work... Signed-off-by: Al Viro diff --git a/kernel/audit_watch.c b/kernel/audit_watch.c index 1f31c2f1e6fc..4508d5e0cf69 100644 --- a/kernel/audit_watch.c +++ b/kernel/audit_watch.c @@ -351,12 +351,12 @@ static int audit_get_nd(struct audit_watch *watch, struct path *parent) struct dentry *d = kern_path_locked(watch->path, parent); if (IS_ERR(d)) return PTR_ERR(d); - inode_unlock(d_backing_inode(parent->dentry)); if (d_is_positive(d)) { /* update watch filter fields */ watch->dev = d->d_sb->s_dev; watch->ino = d_backing_inode(d)->i_ino; } + inode_unlock(d_backing_inode(parent->dentry)); dput(d); return 0; } For other fun bits and pieces see ceph bugs caught this week and crap in dget_parent() (not posted yet). The former had been ceph violating the "turning a previously observable dentry positive requires exclusive lock on parent" rule, the latter - genuine insufficient barriers in the fast path of dget_parent(). It is converging to a reasonably small and understandable surface, actually, most of that being in core pathname resolution. Two big piles of nightmares left to review - overlayfs and (somewhat surprisingly) setxattr call chains, the latter due to IMA/EVM/LSM insanity... There's also some secondary stuff dropping out of that (e.g. ceph seeding dcache on readdir and blindly unhashing dentries it sees stale instead of doing d_invalidate() as it ought to - leads to fun results if you had something mounted on a subdirectory that got removed/recreated on server), but that's a separate pile of joy - doesn't affect this analysis, so it'll have to be dealt with later. It had been an interesting couple of weeks...