Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp5735647ybg; Tue, 22 Oct 2019 07:39:01 -0700 (PDT) X-Google-Smtp-Source: APXvYqy1bG4GlhgXTfzOtt4dDBj2Ol9Ep1h7HOArdW6F5YiuqddN6DKVjqdPJ5aCjqqoEzIqc2fw X-Received: by 2002:a05:6402:1359:: with SMTP id y25mr31457818edw.183.1571755141398; Tue, 22 Oct 2019 07:39:01 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1571755141; cv=none; d=google.com; s=arc-20160816; b=Zf6K3TGDgeFrQ4cHacFzwPVpt5VG/jSLWDYiZhEvEsn/j4tbKSpBDDNbp4GBWfqiNt LJsVZRrQhYnUuzo3+Yq/Ay8dkjUcsKnDQMch7pguZFMZuKvWZIa45OAPvMaM+zm7HqGO wE5En0QWhcea7tcRlbx/+CCALbUCyuZ9gbPaBF/3o7sFVrO5DirlwEe5pSIyHH48EWAL K+l1Jrf4kBMtyM4qbIRUKET2n4IqoWO02EKyAUN+ltTEpwQH+iw3CnDsuHiu0mWkn+HE GQgViXe39qDFcti1k2p/awTB/qLhmRRHcL+esDcoz5rl6iIXUh/ZCcNGCM52eAQLniuC 3EOw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=dkKTJcHhmedcYAEZF7s9DGeOHOG/dXr82jPDp7JmoAQ=; b=iZMj8JG6RUQ9Ct7Bc8HPdh7EmRmqmybO/12R44I3aO/1tc5rIIF+aw2yuzHa25i7yN Y3zqdY3gsX8+eHVfEQUB3zX72P4DZHGVdCjr+dxTVQ6tM6T8+re1n75p5rAambf/IZfB p4Zldmv/Jgm0xClMi1ziM4C/g5jhKrZLIifsWzxLvoIuoGpJReNiXEkPVjabJZtNABv2 keUxsGvRwIxrhmUtWKtAtAr/CR3VzJALPaXPl4bVEbL4JSOXLcpJ8pQHYkglvDn1GfLs 67g1/gzpxRFAfIDh5iwD6yD2R1U/wf5f4jgCjUq3ArRHvvd7kZ6D0mGNk2YgtoIgZBjE aOAA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p14si11403536edc.227.2019.10.22.07.38.37; Tue, 22 Oct 2019 07:39:01 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730283AbfJVOhj (ORCPT + 99 others); Tue, 22 Oct 2019 10:37:39 -0400 Received: from zeniv.linux.org.uk ([195.92.253.2]:38780 "EHLO ZenIV.linux.org.uk" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727194AbfJVOhj (ORCPT ); Tue, 22 Oct 2019 10:37:39 -0400 Received: from viro by ZenIV.linux.org.uk with local (Exim 4.92.3 #3 (Red Hat Linux)) id 1iMvI0-0008W2-Pe; Tue, 22 Oct 2019 14:37:36 +0000 Date: Tue, 22 Oct 2019 15:37:36 +0100 From: Al Viro To: Ritesh Harjani Cc: linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, wugyuan@cn.ibm.com, jlayton@kernel.org, hsiangkao@aol.com, Jan Kara , Linus Torvalds Subject: Re: [PATCH RESEND 1/1] vfs: Really check for inode ptr in lookup_fast Message-ID: <20191022143736.GX26530@ZenIV.linux.org.uk> References: <20190927044243.18856-1-riteshh@linux.ibm.com> <20191015040730.6A84742047@d06av24.portsmouth.uk.ibm.com> <20191022133855.B1B4752050@d06av21.portsmouth.uk.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20191022133855.B1B4752050@d06av21.portsmouth.uk.ibm.com> User-Agent: Mutt/1.12.1 (2019-06-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Oct 22, 2019 at 07:08:54PM +0530, Ritesh Harjani wrote: > I think we have still not taken this patch. Al? I'm still not sure it's a good way to deal with the whole mess, TBH ;-/ Consider e.g. walk_component(), with its if (unlikely(d_is_negative(path.dentry))) { path_to_nameidata(&path, nd); return -ENOENT; } seq = 0; /* we are already out of RCU mode */ inode = d_backing_inode(path.dentry); or mountpoint_last() if (d_is_negative(path.dentry)) { dput(path.dentry); return -ENOENT; } path.mnt = nd->path.mnt; return step_into(nd, &path, 0, d_backing_inode(path.dentry), 0); or last_open() if (unlikely(d_is_negative(path.dentry))) { path_to_nameidata(&path, nd); return -ENOENT; } /* * create/update audit record if it already exists. */ audit_inode(nd->name, path.dentry, 0); if (unlikely((open_flag & (O_EXCL | O_CREAT)) == (O_EXCL | O_CREAT))) { path_to_nameidata(&path, nd); return -EEXIST; } seq = 0; /* out of RCU mode, so the value doesn't matter */ inode = d_backing_inode(path.dentry); or, for that matter, any callers of filename_lookup() assuming that the lack of ENOENT means that the last call of walk_component() (from lookup_last()) has not failed with the same ENOENT and thus the result has been observed positive. You've picked the easiest one to hit, but on e.g. KVM setups you can have the host thread representing the CPU where __d_set_inode_and_type() runs get preempted (by the host kernel), leaving others with much wider window. Sure, we can do that to all callers of d_is_negative/d_is_positive, but... the same goes for any places that assumes that d_is_dir() implies that the sucker is positive, etc. What we have guaranteed is * ->d_lock serializes ->d_flags/->d_inode changes * ->d_seq is bumped before/after such changes * positive dentry never changes ->d_inode as long as you hold a reference (negative dentry *can* become positive right under you) So there are 3 classes of valid users: those holding ->d_lock, those sampling and rechecking ->d_seq and those relying upon having observed the sucker they've pinned to be positive. What you've been hitting is "we have it pinned, ->d_flags says it's positive but we still observe the value of ->d_inode from before the store to ->d_flags that has made it look positive". I'm somewhat tempted to make __d_set_inode_and_type() do smp_store_release() for setting ->d_flags and __d_entry_type() - smp_load_acquire(); that would be pretty much free for x86 (as well as sparc, s390 and itanic) and reasonably cheap on ppc and arm64. How badly would e.g. SMP arm suffer from the below (completely untested)? diff --git a/fs/dcache.c b/fs/dcache.c index e88cf0554e65..35368316c582 100644 --- a/fs/dcache.c +++ b/fs/dcache.c @@ -319,7 +319,7 @@ static inline void __d_set_inode_and_type(struct dentry *dentry, flags = READ_ONCE(dentry->d_flags); flags &= ~(DCACHE_ENTRY_TYPE | DCACHE_FALLTHRU); flags |= type_flags; - WRITE_ONCE(dentry->d_flags, flags); + smp_store_release(&dentry->d_flags, flags); } static inline void __d_clear_type_and_inode(struct dentry *dentry) diff --git a/include/linux/dcache.h b/include/linux/dcache.h index 10090f11ab95..bee28076a3fd 100644 --- a/include/linux/dcache.h +++ b/include/linux/dcache.h @@ -386,7 +386,7 @@ static inline bool d_mountpoint(const struct dentry *dentry) */ static inline unsigned __d_entry_type(const struct dentry *dentry) { - return dentry->d_flags & DCACHE_ENTRY_TYPE; + return smp_load_acquire(&dentry->d_flags) & DCACHE_ENTRY_TYPE; } static inline bool d_is_miss(const struct dentry *dentry)