From: Ross Zwisler Subject: Re: [PATCH v6 4/7] dax: add support for fsync/msync Date: Tue, 5 Jan 2016 08:50:39 -0700 Message-ID: <20160105155039.GA6462@linux.intel.com> References: <1450899560-26708-1-git-send-email-ross.zwisler@linux.intel.com> <1450899560-26708-5-git-send-email-ross.zwisler@linux.intel.com> <20160105111346.GC2724@quack.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Dan Williams , Ross Zwisler , "linux-kernel@vger.kernel.org" , "H. Peter Anvin" , "J. Bruce Fields" , Theodore Ts'o , Alexander Viro , Andreas Dilger , Dave Chinner , Ingo Molnar , Jan Kara , Jeff Layton , Matthew Wilcox , Thomas Gleixner , linux-ext4 , linux-fsdevel , Linux MM , "linux-nvdimm@lists.01.org" , X86 ML , XFS Developers , Andrew Morton , Matthew Wilcox , D To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20160105111346.GC2724@quack.suse.cz> Sender: linux-fsdevel-owner@vger.kernel.org List-Id: linux-ext4.vger.kernel.org On Tue, Jan 05, 2016 at 12:13:46PM +0100, Jan Kara wrote: > On Sun 03-01-16 10:13:06, Dan Williams wrote: > > On Wed, Dec 23, 2015 at 11:39 AM, Ross Zwisler > > wrote: > > > To properly handle fsync/msync in an efficient way DAX needs to track dirty > > > pages so it is able to flush them durably to media on demand. > > > > > > The tracking of dirty pages is done via the radix tree in struct > > > address_space. This radix tree is already used by the page writeback > > > infrastructure for tracking dirty pages associated with an open file, and > > > it already has support for exceptional (non struct page*) entries. We > > > build upon these features to add exceptional entries to the radix tree for > > > DAX dirty PMD or PTE pages at fault time. > > > > > > Signed-off-by: Ross Zwisler > > > > I'm hitting the following report with the ndctl dax test [1] on > > next-20151231. I bisected it to > > commit 3cb108f941de "dax-add-support-for-fsync-sync-v6". I'll take a > > closer look tomorrow, but in case someone can beat me to it, here's > > the back-trace: > > > > ------------[ cut here ]------------ > > kernel BUG at fs/inode.c:497! > > I suppose this is the check that mapping->nr_exceptional is zero, isn't it? > Hum, I don't see how that could happen given we call > truncate_inode_pages_final() just before the clear_inode() call which > removes all the exceptional entries from the radix tree. And there's not > much room for a race during umount... Does the radix tree really contain > any entry or is it an accounting bug? > > Honza I think this is a bug with the existing way that we handle PMD faults. The issue is that the PMD path doesn't properly remove radix tree entries for zero pages covering holes. The PMD path calls unmap_mapping_range() to unmap the range out of the struct address_space, but it is missing a call to truncate_inode_pages_range() or similar to clear out those entries in the radix tree. Up until now we didn't notice, we just had an orphaned entry in the radix tree, but with my code we then find the page entry in the radix tree when handling a PMD fault, we remove it and add in a PMD entry. This causes us to be off on both our mapping->nrpages and mapping->nrexceptional counts. In the PTE path we properly remove the pages from the radix tree when upgrading from a hole to a real DAX entry via the delete_from_page_cache() call, which eventually calls page_cache_tree_delete(). I'm working on a fix now (and making sure all the above is correct). - Ross