From: Ross Zwisler Subject: Re: [PATCH 16/18] dax: New fault locking Date: Wed, 11 May 2016 13:26:32 -0600 Message-ID: <20160511192632.GA8841@linux.intel.com> References: <1461015341-20153-1-git-send-email-jack@suse.cz> <1461015341-20153-17-git-send-email-jack@suse.cz> <20160506041350.GA29628@linux.intel.com> <20160510122715.GK11897@quack2.suse.cz> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: Ross Zwisler , linux-fsdevel@vger.kernel.org, linux-ext4@vger.kernel.org, linux-mm@kvack.org, Dan Williams , linux-nvdimm@lists.01.org, Matthew Wilcox To: Jan Kara Return-path: Content-Disposition: inline In-Reply-To: <20160510122715.GK11897@quack2.suse.cz> Sender: owner-linux-mm@kvack.org List-Id: linux-ext4.vger.kernel.org On Tue, May 10, 2016 at 02:27:15PM +0200, Jan Kara wrote: > On Thu 05-05-16 22:13:50, Ross Zwisler wrote: > > On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote: > > > +/* > > > + * Find radix tree entry at given index. If it points to a page, return with > > > + * the page locked. If it points to the exceptional entry, return with the > > > + * radix tree entry locked. If the radix tree doesn't contain given index, > > > + * create empty exceptional entry for the index and return with it locked. > > > + * > > > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For > > > + * persistent memory the benefit is doubtful. We can add that later if we can > > > + * show it helps. > > > + */ > > > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index) > > > +{ > > > + void *ret, **slot; > > > + > > > +restart: > > > + spin_lock_irq(&mapping->tree_lock); > > > + ret = get_unlocked_mapping_entry(mapping, index, &slot); > > > + /* No entry for given index? Make sure radix tree is big enough. */ > > > + if (!ret) { > > > + int err; > > > + > > > + spin_unlock_irq(&mapping->tree_lock); > > > + err = radix_tree_preload( > > > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM); > > > > In the conversation about v2 of this series you said: > > > > > Note that we take the hit for dropping the lock only if we really need to > > > allocate new radix tree node so about once per 64 new entries. So it is not > > > too bad. > > > > I think this is incorrect. We get here whenever we get a NULL return from > > __radix_tree_lookup(). I believe that this happens if we don't have a node, > > in which case we need an allocation, but I think it also happens in the case > > where we do have a node and we just have a NULL slot in that node. > > > > For the behavior you're looking for (only preload if you need to do an > > allocation), you probably need to check the 'slot' we get back from > > get_unlocked_mapping_entry(), yea? > > You are correct. However currently __radix_tree_lookup() doesn't return a > slot pointer if entry was not found so it is not easy to fix. So I'd leave > the code as is for now and we can later optimize the case where we don't > need to grow the radix tree... Ah, you're right. Sure, that plan sounds good. > > > +/* > > > + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree > > > + * entry to get unlocked before deleting it. > > > + */ > > > +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index) > > > +{ > > > + void *entry; > > > + > > > + spin_lock_irq(&mapping->tree_lock); > > > + entry = get_unlocked_mapping_entry(mapping, index, NULL); > > > + /* > > > + * Caller should make sure radix tree modifications don't race and > > > + * we have seen exceptional entry here before. > > > + */ > > > + if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) { > > > > dax_delete_mapping_entry() is only called from clear_exceptional_entry(). > > With this new code we've changed the behavior of that call path a little. > > > > In the various places where clear_exceptional_entry() is called, the code > > batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We > > don't hold the mapping->tree_lock between the time this lookup happens and the > > time that the entry is passed to clear_exceptional_entry(). This is why the > > old code did a verification that the entry passed in matched what was still > > currently present in the radix tree. This was done in the DAX case via > > radix_tree_delete_item(), and it was open coded in clear_exceptional_entry() > > for the page cache case. In both cases if the entry didn't match what was > > currently in the tree, we bailed without doing anything. > > > > This new code doesn't verify against the 'entry' passed to > > clear_exceptional_entry(), but instead makes sure it is an exceptional entry > > before removing, and if not it does a WARN_ON_ONCE(). > > > > This changes things because: > > > > a) If the exceptional entry changed, say from a plain lock entry to an actual > > DAX entry, we wouldn't notice, and we would just clear the latter out. My > > guess is that this is fine, I just wanted to call it out. > > > > b) If we have a non-exceptional entry here now, say because our lock entry has > > been swapped out for a zero page, we will WARN_ON_ONCE() and return without a > > removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this > > could happen during normal operation and we don't want to scare anyone. :) > > So your concerns are exactly why I have added a comment to > dax_delete_mapping_entry() that: > > /* > * Caller should make sure radix tree modifications don't race and > * we have seen exceptional entry here before. > */ > > The thing is dax_delete_mapping_entry() is called only from truncate / > punch hole path. Those should hold i_mmap_sem for writing and thus there > should be no modifications of the radix tree. If anything changes, between > what truncate_inode_pages() (or similar functions) finds and what > dax_delete_mapping_entry() sees, we have a locking bug and I want to know > about it :). Any suggestion how I should expand the comment so that this is > clearer? Ah, I didn't understand all that. :) Given a bit more context the comment seems fine - if anything it could be a bit more specific, and include the text: "dax_delete_mapping_entry() is called only from truncate / punch hole path. Those should hold i_mmap_sem for writing and thus there should be no modifications of the radix tree." Either way - thanks for explaining. > > > +/* > > > * The user has performed a load from a hole in the file. Allocating > > > * a new page in the file would cause excessive storage usage for > > > * workloads with sparse files. We allocate a page cache page instead. > > > @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io); > > > * otherwise it will simply fall out of the page cache under memory > > > * pressure without ever having been dirtied. > > > */ > > > -static int dax_load_hole(struct address_space *mapping, struct page *page, > > > - struct vm_fault *vmf) > > > +static int dax_load_hole(struct address_space *mapping, void *entry, > > > + struct vm_fault *vmf) > > > { > > > - if (!page) > > > - page = find_or_create_page(mapping, vmf->pgoff, > > > - GFP_KERNEL | __GFP_ZERO); > > > - if (!page) > > > - return VM_FAULT_OOM; > > > + struct page *page; > > > + > > > + /* Hole page already exists? Return it... */ > > > + if (!radix_tree_exceptional_entry(entry)) { > > > + vmf->page = entry; > > > + return VM_FAULT_LOCKED; > > > + } > > > > > > + /* This will replace locked radix tree entry with a hole page */ > > > + page = find_or_create_page(mapping, vmf->pgoff, > > > + vmf->gfp_mask | __GFP_ZERO); > > > > This replacement happens via page_cache_tree_insert(), correct? In this case, > > who wakes up anyone waiting on the old lock entry that we just killed? In the > > non-hole case we would traverse through put_locked_mapping_entry(), but I > > don't see that in the hole case. > > Ha, good catch. We miss the wakeup. Fixed. > > Attached is the diff resulting from your review of this patch. I still have > to hunt down that strange interaction with workingset code you've reported... At the end of this mail I've attached one small fixup for the incremental diff you sent. Aside from that, I think that you've addressed all my review feedback, thanks! Reviewed-by: Ross Zwisler I'm going to try and get more info on the working set test failure. --- diff --git a/fs/dax.c b/fs/dax.c index f496854..c4cb69b 100644 --- a/fs/dax.c +++ b/fs/dax.c @@ -406,7 +406,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping, init_wait(&ewait.wait); ewait.wait.func = wake_exceptional_entry_func; - ewait.key.root = &mapping->page_tree; + ewait.key.mapping = mapping; ewait.key.index = index; for (;;) { -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org