From: Dave Chinner Subject: Re: [PATCH v3 0/3] Add XIP support to ext4 Date: Mon, 23 Dec 2013 15:14:47 +1100 Message-ID: <20131223041447.GG3220@dastard> References: <20131217223050.GB20579@dastard> <20131218023143.GA24491@parisc-linux.org> <20131218123319.GH31386@dastard> <20131218152234.GB9207@parisc-linux.org> <20131219004831.GU31386@dastard> <20131219010530.GA17545@parisc-linux.org> <20131219015843.GE20579@dastard> <20131219153213.GC19166@parisc-linux.org> <20131219234653.GD31386@dastard> <20131220164529.GE19166@parisc-linux.org> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Matthew Wilcox , linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org To: Matthew Wilcox Return-path: Content-Disposition: inline In-Reply-To: <20131220164529.GE19166@parisc-linux.org> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Fri, Dec 20, 2013 at 09:45:30AM -0700, Matthew Wilcox wrote: > On Fri, Dec 20, 2013 at 10:46:54AM +1100, Dave Chinner wrote: > > Then perhaps we need to get rid of the xip_sparse_mutex first? :/ > > Yeah, already done in my tree. Just finishing up a few other pieces. > > > > > And that solves the unwritten extent problem for the IO path. Now we > > > > just need to solve it for the mmap path. That, I suspect will > > > > require a custom .page_mkwrite handler.... > > > > > > No, page_mkwrite() never gets called. At this point, I'm thinking a > > > custom ->fault handler that looks something like this: > > > > And that's another difference to the normal filesystem mmap paths. > > .fault is a read-only operation for filesystems and > > .page-mkwrite is the write-fault modification path. i.e. > > .fault is only supposed to populate the page into the page > > cache by reading it from disk via ->readpage(s). It does not do > > block allocation - if the fault is over a hole then a new, zeroed > > page is placed in the page cache regardless of whether it is a read > > or write page fault. > > > > ->page_mkwrite is then used by page fault infrstructure to inform > > filesystem that a write fault has occurred and they may need to > > allocate backing store for the page, or convert unwritten extents to > > written. > > > > What xip_file_fault() does is ask the fielsystem to allocate blocks > > for writeable mappings, rather than just inserting a sparse page > > over holes and unwritten extents. That fails to handle unwritten > > extents correctly - they remain unwritten despite the fact that > > userspace can now write to the page. > > I agree with you up to this point. But xip_file_fault() uses the same > get_block_t callback to allocate blocks that block_page_mkwrite() does. > So there's no real difference from the fs' point of view. I beg to differ. You're adding a new allocation path from page faults into the filesystem code. We already have one - it's called page_mkwrite. > > IOWs, xip_file_fault() needs to drop the allocation of blocks and > > only ever insert mappings for pages that have data in them or sprase > > pages for holes and unwritten extents. Then the filesystem needs to > > provide it's own ->page_mkwrite callout to do block allocation and > > unwritten extent conversion on write page faults, and the XIP code > > needs to provide a helper function to replace the sparse page in the > > mappings with the correct page mapped from the backing device after > > allocation or unwritten extent conversion. > > > > That will make XIP behave almost identically to the normal page > > cache based page fault path, requiring only a small addition to the > > filesystem page_mkwrite handler to support XIP... > > I decided to see if there was anything particularly hard about the XFS > code in this area. I really think it's just this for you: > > +++ b/fs/xfs/xfs_file.c > @@ -957,12 +957,27 @@ xfs_file_readdir( > return 0; > } > > +static int xfs_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > +{ > + return xip_fault(vma, vmf, xfs_get_blocks); > +} > + > +static const struct vm_operations_struct xfs_xip_vm_ops = { > + .fault = xfs_xip_fault, > + .remap_pages = generic_file_remap_pages, > +}; No, it doesn't. ->page_mkwrite re-enters the filesystem back in ->write_begin and ->write_end to do zeroing and allocation and to set up tracking to trigger unwritten extent conversion, etc. So, off the top of my head, how does xip_fault(): - avoid triggering delayed allocation? - trigger unwritten extent conversion to occur? - ensure that partial blocks at EOF are zeroed correctly? - deal with partial write failure? (e.g. unwritten extent conversion fails) AFAICT, it can't. This is all stuff that is specific to the filesystem imlpementation, and stuff that is already handled by the page_mkwrite paths.... > > > static int ext4_xip_fault(struct vm_area_struct *vma, struct vm_fault *vmf) > > > { > > > return xip_fault(vma, vmf, ext4_get_block_write, ext4_end_io_dio); > > > } > > > > I think the xip fault handler should be generic as there's no reason > > for it to do anything other that read-only operations. It's the > > page_mkwrite callout that needs custom code for each filesystem. > > With no struct page for the XIP memory, it's just not feasible to do it > that way. Yet we are making the major assumption that XIP memory is made up of pages that can be mapped directly into page tables. So, it's pages and we optimise filesystems to deal with native page sizes, but XIP says it's not pages and so we have to deal with it completely differently? Perhaps XIP pages need to have struct pages allocated for them at device initialisation time, such that all xip_get_mem turns into xip_find_get_page() and the page gets inserted directly into the mapping tree on the inode? All of a sudden, XIP looks almost the same as the normal mmap path, and filesystems only need to ensure that they zero the page and/or convert it if necessary before the page fault returns... Seriously, just because the current XIP code uses some hack to work with *one device*, it doesn't mean that's the best way to approach the problem. XIP is not something magic and different - it's just a different way of inserting a page into the page cache.... Cheers, Dave. -- Dave Chinner david@fromorbit.com