From: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>
Subject: Re: Question about Experimental of Filesystem DAX.
Date: Thu, 7 Jun 2018 16:38:38 +0200
Message-ID: <20180607143838.6gdms3qlok6ikzrf@quack2.suse.cz>
References: <20180531112731.0CAC.E1E9C6FF@jp.fujitsu.com>
 <20180531150716.GA19764@linux.intel.com>
 <CAPcyv4jpffaM60tTdLTojhh0CDwcbVUdS8td7R6LKWM3bb15dw@mail.gmail.com>
 <20180531174636.GA7825@magnolia>
 <CAPcyv4gS1GCN9TFjngTmmHu83-uMgRpvvx842ZGnvxB3PbgU+Q@mail.gmail.com>
 <20180531230550.GN10363@dastard>
Mime-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 7bit
Cc: Jan Kara <jack-AlSwsSmVLrQ@public.gmane.org>, "Darrick J. Wong" <darrick.wong-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>,
 NVDIMM-ML <linux-nvdimm-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>, linux-xfs <linux-xfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>,
 Yasunori Goto <y-goto-+CUm20s59erQFUHtdCDX3A@public.gmane.org>, linux-ext4 <linux-ext4-u79uwXL29TY76Z2rM5mHXA@public.gmane.org>
To: Dave Chinner <david-FqsqvQoI3Ljby3iVrkZq2A@public.gmane.org>
Content-Disposition: inline
In-Reply-To: <20180531230550.GN10363@dastard>
Errors-To: linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org
Sender: "Linux-nvdimm" <linux-nvdimm-bounces-hn68Rpc1hR1g9hUCZPvPmw@public.gmane.org>

On Fri 01-06-18 09:05:50, Dave Chinner wrote:
> On Thu, May 31, 2018 at 11:26:43AM -0700, Dan Williams wrote:
> > On Thu, May 31, 2018 at 10:46 AM, Darrick J. Wong
> > > The recent thread between Jan and Dan make me wonder if making mappings
> > > share struct pages is going to be a nightmare to add to the mm code,
> > > though...
> > 
> > It's going to be a bit messy because a singular page->mapping
> > association is fundamentally incompatible with DAX. Perhaps a linked
> > list of mapping "siblings"?
> 
> I'd much prefer the filesystem allocate/control the struct page that
> is inserted into mapping trees so we can have multiple struct pages
> pointing at the one physical page.  That way we can just insert
> these dynamic struct pages into the relevant mappings and it works
> the same way for both DAX and shared page cache pages.

Yes, that's one option but the overhead in terms of memory and CPU is
non-trivial and as Dan writes there are assumptions in MM code that
PFN<->struct page is 1:1 relationship (or possibly 1:0 as struct page can
be missing for certain types of pfns). But at this point I think the exact
data structure layout is not that important (whether there will be dynamic
struct pages or some other linkage). I think we first need to settle on
how responsibilities between MM and filesystems are going to be split.

> i.e. the filesystem knows they are shared physical blocks, the
> filesystem controls COW of physical blocks, the filesystem controls
> truncate/invalidation of physical blocks, the filesystem controls
> cache state of the physical blocks. So why are we designing
> infrastructure around the virtual memory and caching infrastructure
> that bypasses the layer that manages and arbitrates access to the
> physical storage?

I agree with this but let's see whether we are on the same page.  What MM
mostly cares about and in particular what Dan needs to solve is "given
struct page, give me all page tables that map this page". And it seems
completely fair to me to maintain such translation within MM code as the
filesystem generally does not care in too big detail about memory mappings
of files.  When something with the page is going happen (like breaking cow,
truncate, whatever), the filesystem just tells MM to invalidate all page
tables for the page and subsequent page faults will fill in updated
information - nothing new here.

Then there's a second slightly different question - and I suspect you are
speaking about that one - "given struct page, give me all radix trees
pointing to this page". This is more a filesystem caching question (even in
case of DAX, you can think of this as caching of inode + logical offset =>
physical block translations). Currently this translation is maintained by
page cache and again the filesystem is mostly ignorant of details of what
and when gets cached and just tells the MM when it should throw away the
cached information (through truncate/invalidate_inode_pages()).

With reflink and DAX / shared page cache these questions get more complex
to answer but in principle I don't see why the mapping from the struct page
to radix trees should not be maintained (with a help of the filesystem) in
generic code. Sure the interface now needs to be more flexible and in that
sense filesystems will be more in control what the page cache is doing -
e.g. ->readpage callback would probably need to be passed just inode +
logical offset and *the filesystem* will now need to find / allocate
approprite page, fill it up, and tell page cache this page is now caching
given offset in the file. And page cache will add this inode + offset to a
list of things caching this page. Do you agree or you had something
different in mind?

								Honza
-- 
Jan Kara <jack-IBi9RG/b67k@public.gmane.org>
SUSE Labs, CR