Received: by 2002:a25:d7c1:0:0:0:0:0 with SMTP id o184csp4016310ybg; Fri, 25 Oct 2019 12:08:15 -0700 (PDT) X-Google-Smtp-Source: APXvYqyQiD43rS1FKKlXZawstPAU2qr2pQd7QUi40bH70OS+4eIAhQmeiU+muQRX7BRdowa7Wvf+ X-Received: by 2002:a17:906:70d2:: with SMTP id g18mr5171936ejk.18.1572030495433; Fri, 25 Oct 2019 12:08:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1572030495; cv=none; d=google.com; s=arc-20160816; b=LHAteclnot9J61YaU/aTb5JdCOiM83A3RieZfa+ZZ5g9E4wzKx4dtINwnrcxMhAXM9 VNRQIdrP9x9xjbkcDnHeUHrxWlZkPDkCnDZnOaEYj9u+B/NOEiSsBs0ShNTRnyxfAVBf mXjTtwOMr75QZG2vvCt55Unw5EUO31JOXLpY+gQ3P6J9gjpK6KIsUV36MxnM5Gby/Zs/ 7ZmxjJNUfUBFizB5Ht+GHPE42rlwysSQlX4b0Aek5BIw6QPot1+LtpRK3N52Ez0VNOxh 0kV4VENxV22jeTOp4QjsroFeC3IWq8ru5fKHTKhUfttJo4jlERjD2+3T2SEyCgnlF4KM vuzw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=ESo/Rbh4ELRx/H5cGu1xnkED6xVPsweCfm3RCa96lqo=; b=UvgCV9rBlcB5uJ0kmfNZwE5eErjza0gIDjUyrOT66JEnfomM2CA3BF5cnSqRQd+qwV 95uEl2FIkVkv/2Lqrz1SLp+bnmnd0hcCuOGQ/OVoPkrKmQ1PYTNAspj2LikSRd3VkpdY xvMozNFYap3gOiwlhmywussrA8fItqqi7nr1q6mK3ujD5ETZMnAml9nxqPvQ7qA2YMIi i7c9BP8YoX3NsQ/vJedki046Z9uCFAnQXJLKWlFSwpaWQ2FMxVKYmLivIgm/2SkNm34M c/kvWehteuDU8j7HU1Zodw57JbcELWSFuvhymnvUWYEL/QMsFuXJYmhL+NOFnlF9Stsq UGcA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id b12si2061714edk.16.2019.10.25.12.07.51; Fri, 25 Oct 2019 12:08:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-ext4-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-ext4-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729418AbfJXVfQ (ORCPT + 99 others); Thu, 24 Oct 2019 17:35:16 -0400 Received: from mail104.syd.optusnet.com.au ([211.29.132.246]:45749 "EHLO mail104.syd.optusnet.com.au" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726386AbfJXVfQ (ORCPT ); Thu, 24 Oct 2019 17:35:16 -0400 Received: from dread.disaster.area (pa49-181-161-154.pa.nsw.optusnet.com.au [49.181.161.154]) by mail104.syd.optusnet.com.au (Postfix) with ESMTPS id 7386F43F07B; Fri, 25 Oct 2019 08:35:09 +1100 (AEDT) Received: from dave by dread.disaster.area with local (Exim 4.92.3) (envelope-from ) id 1iNklA-0006Nu-K4; Fri, 25 Oct 2019 08:35:08 +1100 Date: Fri, 25 Oct 2019 08:35:08 +1100 From: Dave Chinner To: Boaz Harrosh Cc: ira.weiny@intel.com, linux-kernel@vger.kernel.org, Alexander Viro , "Darrick J. Wong" , Dan Williams , Christoph Hellwig , "Theodore Y. Ts'o" , Jan Kara , linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org Subject: Re: [PATCH 0/5] Enable per-file/directory DAX operations Message-ID: <20191024213508.GB4614@dread.disaster.area> References: <20191020155935.12297-1-ira.weiny@intel.com> <20191023221332.GE2044@dread.disaster.area> <20191024073446.GA4614@dread.disaster.area> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.10.1 (2018-07-13) X-Optus-CM-Score: 0 X-Optus-CM-Analysis: v=2.2 cv=G6BsK5s5 c=1 sm=1 tr=0 a=l3vQdJ1SkhDHY1nke8Lmag==:117 a=l3vQdJ1SkhDHY1nke8Lmag==:17 a=jpOVt7BSZ2e4Z31A5e1TngXxSK0=:19 a=kj9zAlcOel0A:10 a=XobE76Q3jBoA:10 a=7-415B0cAAAA:8 a=Qt9MKOuts2txuSNu_AQA:9 a=CjuIK1q_8ugA:10 a=biEYGPWJfzWAr4FL6Ov7:22 Sender: linux-ext4-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-ext4@vger.kernel.org On Thu, Oct 24, 2019 at 05:05:45PM +0300, Boaz Harrosh wrote: > On 24/10/2019 10:34, Dave Chinner wrote: > > On Thu, Oct 24, 2019 at 05:31:13AM +0300, Boaz Harrosh wrote: > <> > > > > The on disk DAX flag is inherited from the parent directory at > > create time. Hence an admin only need to set it on the data > > directory of the application when first configuring it, and > > everything the app creates will be configured for DAX access > > automatically. > > > > Yes I said that as well. You said "it must be set between creation and first write", stating the requirement for an on-disk flag to work. I'm decribing how that requirement is actually implemented. i.e. what you are stating is something we actually implemented years ago... > > I also seem > > to recall that there was a need to take some vm level lock to really > > prevent page fault races, and that we can't safely take that in a > > safe combination with all the filesystem locks we need. > > > > We do not really care with page fault races in the Kernel as long Oh yes we do. A write fault is a 2-part operation - a read fault to populate the pte and mapping, then a write fault (->page_mkwrite) to do all the filesystem work needed to dirty the page and pte. The read fault sets up the state for the write fault, and if we change the aops between these two operations, then the ->page_mkwrite implementation goes kaboom. This isn't a theoretical problem - this is exactly the race condition that lead us to disabling the flag in the first place. There is no serialisation between the read and write parts of the page fault iand the filesystem changing the DAX flag and ops vector, and so fixing this problem requires hold yet more locks in the filesystem path to completely lock out page fault processing on the inode's mapping. > as I protect the xarray access and these are protected well if we > take truncate locking. But we have a bigger problem that you pointed > out with the change of the operations vector pointer. > > I was thinking about this last night. One way to do this is with > file-exclusive-lock. Correct me if I'm wrong: > file-exclusive-readwrite-lock means any other openers will fail and > if there are openers already the lock will fail. Which is what we want > no? The filesystem ioctls and page faults have no visibility of file locks. They don't know and can't find out in a sane manner that an inode has a single -user- reference. And it introduces a new problem for any application using the fssetxattr() ioctl - accidentally not setting the S_DAX flag to be unmodified will now fail, and that means such a change breaks existing applications. Sure, you can say they are "buggy applications", but the fact is this user API change breaks them. Hence I don't think we can change the user API for setting/clearing this flag like this. Cheers, Dave. -- Dave Chinner david@fromorbit.com