This patch series adds a slimmed down version of fsync/msync support to
DAX. The major change versus v2 of this patch series is that we no longer
remove DAX entries from the radix tree during fsync/msync calls. Instead
the list of DAX entries in the radix tree grows for the lifetime of the
mapping. We reclaim DAX entries from the radix tree via
clear_exceptional_entry() for truncate, when the filesystem is unmounted,
etc.
This change was made because if we try and remove radix tree entries during
writeback operations there are a number of race conditions that exist
between those writeback operations and page faults. In the non-DAX case
these races are dealt with using the page lock, but we don't have a good
replacement lock with the same granularity. These races could leave us in
a place where we have a DAX page that is dirty and writeable from userspace
but no longer in the radix tree. This page would then be skipped during
subsequent writeback operations, which is unacceptable.
I do plan to continue to try and solve these race conditions so that we can
have a more optimal fsync/msync solution for DAX, but I wanted to get this
set out for v4.5 consideration while I continued working. While
suboptimal the solution in this series gives us correct behavior for DAX
fsync/msync and seems like a reasonable short term compromise.
This series is built upon v4.4-rc4 plus the recent ext4 DAX series from Jan
Kara (http://www.spinics.net/lists/linux-ext4/msg49951.html) and a recent
XFS fix from Dave Chinner (https://lkml.org/lkml/2015/12/2/923). The tree
with all this working can be found here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=fsync_v3
Other changes versus v2:
- Renamed dax_fsync() to dax_writeback_mapping_range(). (Dave Chinner)
- Removed REQ_FUA/REQ_FLUSH support from the PMEM driver and instead just
make the call to wmb_pmem() in dax_writeback_mapping_range(). (Dan)
- Reworked some BUG_ON() calls to be a WARN_ON() followed by an error
return.
- Moved call to dax_writeback_mapping_range() from the filesystems down
into filemap_write_and_wait_range(). (Dave Chinner)
- Fixed handling of DAX read faults so they create a radix tree entry but
don't mark it as dirty until the follow-up dax_pfn_mkwrite() call.
- Update clear_exceptional_entry() and to dax_writeback_one() so they
validate the DAX radix tree entry before they use it. (Dave Chinner)
- Added a comment to find_get_entries_tag() to explain the restart
condition. (Dave Chinner)
Ross Zwisler (7):
pmem: add wb_cache_pmem() to the PMEM API
dax: support dirty DAX entries in radix tree
mm: add find_get_entries_tag()
dax: add support for fsync/sync
ext2: call dax_pfn_mkwrite() for DAX fsync/msync
ext4: call dax_pfn_mkwrite() for DAX fsync/msync
xfs: call dax_pfn_mkwrite() for DAX fsync/msync
arch/x86/include/asm/pmem.h | 11 ++--
fs/block_dev.c | 3 +-
fs/dax.c | 147 ++++++++++++++++++++++++++++++++++++++++++--
fs/ext2/file.c | 4 +-
fs/ext4/file.c | 4 +-
fs/inode.c | 1 +
fs/xfs/xfs_file.c | 7 ++-
include/linux/dax.h | 7 +++
include/linux/fs.h | 1 +
include/linux/pagemap.h | 3 +
include/linux/pmem.h | 22 ++++++-
include/linux/radix-tree.h | 9 +++
mm/filemap.c | 84 ++++++++++++++++++++++++-
mm/truncate.c | 64 +++++++++++--------
14 files changed, 319 insertions(+), 48 deletions(-)
--
2.5.0
The function __arch_wb_cache_pmem() was already an internal implementation
detail of the x86 PMEM API, but this functionality needs to be exported as
part of the general PMEM API to handle the fsync/msync case for DAX mmaps.
One thing worth noting is that we really do want this to be part of the
PMEM API as opposed to a stand-alone function like clflush_cache_range()
because of ordering restrictions. By having wb_cache_pmem() as part of the
PMEM API we can leave it unordered, call it multiple times to write back
large amounts of memory, and then order the multiple calls with a single
wmb_pmem().
Signed-off-by: Ross Zwisler <[email protected]>
---
arch/x86/include/asm/pmem.h | 11 ++++++-----
include/linux/pmem.h | 22 +++++++++++++++++++++-
2 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/pmem.h b/arch/x86/include/asm/pmem.h
index d8ce3ec..6c7ade0 100644
--- a/arch/x86/include/asm/pmem.h
+++ b/arch/x86/include/asm/pmem.h
@@ -67,18 +67,19 @@ static inline void arch_wmb_pmem(void)
}
/**
- * __arch_wb_cache_pmem - write back a cache range with CLWB
+ * arch_wb_cache_pmem - write back a cache range with CLWB
* @vaddr: virtual start address
* @size: number of bytes to write back
*
* Write back a cache range using the CLWB (cache line write back)
* instruction. This function requires explicit ordering with an
- * arch_wmb_pmem() call. This API is internal to the x86 PMEM implementation.
+ * arch_wmb_pmem() call.
*/
-static inline void __arch_wb_cache_pmem(void *vaddr, size_t size)
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
{
u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
unsigned long clflush_mask = x86_clflush_size - 1;
+ void *vaddr = (void __force *)addr;
void *vend = vaddr + size;
void *p;
@@ -115,7 +116,7 @@ static inline size_t arch_copy_from_iter_pmem(void __pmem *addr, size_t bytes,
len = copy_from_iter_nocache(vaddr, bytes, i);
if (__iter_needs_pmem_wb(i))
- __arch_wb_cache_pmem(vaddr, bytes);
+ arch_wb_cache_pmem(addr, bytes);
return len;
}
@@ -138,7 +139,7 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
else
memset(vaddr, 0, size);
- __arch_wb_cache_pmem(vaddr, size);
+ arch_wb_cache_pmem(addr, size);
}
static inline bool __arch_has_wmb_pmem(void)
diff --git a/include/linux/pmem.h b/include/linux/pmem.h
index acfea8c..7c3d11a 100644
--- a/include/linux/pmem.h
+++ b/include/linux/pmem.h
@@ -53,12 +53,18 @@ static inline void arch_clear_pmem(void __pmem *addr, size_t size)
{
BUG();
}
+
+static inline void arch_wb_cache_pmem(void __pmem *addr, size_t size)
+{
+ BUG();
+}
#endif
/*
* Architectures that define ARCH_HAS_PMEM_API must provide
* implementations for arch_memcpy_to_pmem(), arch_wmb_pmem(),
- * arch_copy_from_iter_pmem(), arch_clear_pmem() and arch_has_wmb_pmem().
+ * arch_copy_from_iter_pmem(), arch_clear_pmem(), arch_wb_cache_pmem()
+ * and arch_has_wmb_pmem().
*/
static inline void memcpy_from_pmem(void *dst, void __pmem const *src, size_t size)
{
@@ -178,4 +184,18 @@ static inline void clear_pmem(void __pmem *addr, size_t size)
else
default_clear_pmem(addr, size);
}
+
+/**
+ * wb_cache_pmem - write back processor cache for PMEM memory range
+ * @addr: virtual start address
+ * @size: number of bytes to write back
+ *
+ * Write back the processor cache range starting at 'addr' for 'size' bytes.
+ * This function requires explicit ordering with a wmb_pmem() call.
+ */
+static inline void wb_cache_pmem(void __pmem *addr, size_t size)
+{
+ if (arch_has_pmem_api())
+ arch_wb_cache_pmem(addr, size);
+}
#endif /* __PMEM_H__ */
--
2.5.0
Add support for tracking dirty DAX entries in the struct address_space
radix tree. This tree is already used for dirty page writeback, and it
already supports the use of exceptional (non struct page*) entries.
In order to properly track dirty DAX pages we will insert new exceptional
entries into the radix tree that represent dirty DAX PTE or PMD pages.
These exceptional entries will also contain the writeback addresses for the
PTE or PMD faults that we can use at fsync/msync time.
There are currently two types of exceptional entries (shmem and shadow)
that can be placed into the radix tree, and this adds a third. There
shouldn't be any collisions between these various exceptional entries
because only one type of exceptional entry should be able to be found in a
radix tree at a time depending on how it is being used.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/block_dev.c | 3 ++-
fs/inode.c | 1 +
include/linux/dax.h | 5 ++++
include/linux/fs.h | 1 +
include/linux/radix-tree.h | 9 +++++++
mm/filemap.c | 13 +++++++---
mm/truncate.c | 64 +++++++++++++++++++++++++++-------------------
7 files changed, 65 insertions(+), 31 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index c25639e..226dacc 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev)
{
struct address_space *mapping = bdev->bd_inode->i_mapping;
- if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+ if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+ mapping->nrdax == 0)
return;
invalidate_bh_lrus();
diff --git a/fs/inode.c b/fs/inode.c
index 1be5f90..79d828f 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
spin_lock_irq(&inode->i_data.tree_lock);
BUG_ON(inode->i_data.nrpages);
BUG_ON(inode->i_data.nrshadows);
+ BUG_ON(inode->i_data.nrdax);
spin_unlock_irq(&inode->i_data.tree_lock);
BUG_ON(!list_empty(&inode->i_data.private_list));
BUG_ON(!(inode->i_state & I_FREEING));
diff --git a/include/linux/dax.h b/include/linux/dax.h
index b415e52..e9d57f68 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
{
return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
}
+
+static inline bool dax_mapping(struct address_space *mapping)
+{
+ return mapping->host && IS_DAX(mapping->host);
+}
#endif
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 3aa5142..b9ac534 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -433,6 +433,7 @@ struct address_space {
/* Protected by tree_lock together with the radix tree */
unsigned long nrpages; /* number of total pages */
unsigned long nrshadows; /* number of shadow entries */
+ unsigned long nrdax; /* number of DAX entries */
pgoff_t writeback_index;/* writeback starts here */
const struct address_space_operations *a_ops; /* methods */
unsigned long flags; /* error bits/gfp mask */
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 33170db..f793c99 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -51,6 +51,15 @@
#define RADIX_TREE_EXCEPTIONAL_ENTRY 2
#define RADIX_TREE_EXCEPTIONAL_SHIFT 2
+#define RADIX_DAX_MASK 0xf
+#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \
+ ~RADIX_DAX_MASK))
+#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \
+ (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
static inline int radix_tree_is_indirect_ptr(void *ptr)
{
return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
diff --git a/mm/filemap.c b/mm/filemap.c
index 1bb0076..167a4d9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -11,6 +11,7 @@
*/
#include <linux/export.h>
#include <linux/compiler.h>
+#include <linux/dax.h>
#include <linux/fs.h>
#include <linux/uaccess.h>
#include <linux/capability.h>
@@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping,
p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
+
+ if (dax_mapping(mapping)) {
+ WARN_ON(1);
+ return -EINVAL;
+ }
+
if (shadowp)
*shadowp = p;
mapping->nrshadows--;
@@ -1242,9 +1249,9 @@ repeat:
if (radix_tree_deref_retry(page))
goto restart;
/*
- * A shadow entry of a recently evicted page,
- * or a swap entry from shmem/tmpfs. Return
- * it without attempting to raise page count.
+ * A shadow entry of a recently evicted page, a swap
+ * entry from shmem/tmpfs or a DAX entry. Return it
+ * without attempting to raise page count.
*/
goto export;
}
diff --git a/mm/truncate.c b/mm/truncate.c
index 76e35ad..1dc9f29 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -9,6 +9,7 @@
#include <linux/kernel.h>
#include <linux/backing-dev.h>
+#include <linux/dax.h>
#include <linux/gfp.h>
#include <linux/mm.h>
#include <linux/swap.h>
@@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
return;
spin_lock_irq(&mapping->tree_lock);
- /*
- * Regular page slots are stabilized by the page lock even
- * without the tree itself locked. These unlocked entries
- * need verification under the tree lock.
- */
- if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
- goto unlock;
- if (*slot != entry)
- goto unlock;
- radix_tree_replace_slot(slot, NULL);
- mapping->nrshadows--;
- if (!node)
- goto unlock;
- workingset_node_shadows_dec(node);
- /*
- * Don't track node without shadow entries.
- *
- * Avoid acquiring the list_lru lock if already untracked.
- * The list_empty() test is safe as node->private_list is
- * protected by mapping->tree_lock.
- */
- if (!workingset_node_shadows(node) &&
- !list_empty(&node->private_list))
- list_lru_del(&workingset_shadow_nodes, &node->private_list);
- __radix_tree_delete_node(&mapping->page_tree, node);
+
+ if (dax_mapping(mapping)) {
+ if (radix_tree_delete_item(&mapping->page_tree, index, entry))
+ mapping->nrdax--;
+ } else {
+ /*
+ * Regular page slots are stabilized by the page lock even
+ * without the tree itself locked. These unlocked entries
+ * need verification under the tree lock.
+ */
+ if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+ &slot))
+ goto unlock;
+ if (*slot != entry)
+ goto unlock;
+ radix_tree_replace_slot(slot, NULL);
+ mapping->nrshadows--;
+ if (!node)
+ goto unlock;
+ workingset_node_shadows_dec(node);
+ /*
+ * Don't track node without shadow entries.
+ *
+ * Avoid acquiring the list_lru lock if already untracked.
+ * The list_empty() test is safe as node->private_list is
+ * protected by mapping->tree_lock.
+ */
+ if (!workingset_node_shadows(node) &&
+ !list_empty(&node->private_list))
+ list_lru_del(&workingset_shadow_nodes,
+ &node->private_list);
+ __radix_tree_delete_node(&mapping->page_tree, node);
+ }
unlock:
spin_unlock_irq(&mapping->tree_lock);
}
@@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
int i;
cleancache_invalidate_inode(mapping);
- if (mapping->nrpages == 0 && mapping->nrshadows == 0)
+ if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
+ mapping->nrdax == 0)
return;
/* Offsets within partial pages */
@@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
smp_rmb();
nrshadows = mapping->nrshadows;
- if (nrpages || nrshadows) {
+ if (nrpages || nrshadows || mapping->nrdax) {
/*
* As truncation uses a lockless tree lookup, cycle
* the tree lock to make sure any ongoing tree
--
2.5.0
Add find_get_entries_tag() to the family of functions that include
find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
needed for DAX dirty page handling because we need a list of both page
offsets and radix tree entries ('indices' and 'entries' in this function)
that are marked with the PAGECACHE_TAG_TOWRITE tag.
Signed-off-by: Ross Zwisler <[email protected]>
---
include/linux/pagemap.h | 3 +++
mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index 26eabf5..4db0425 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
unsigned int nr_pages, struct page **pages);
unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
int tag, unsigned int nr_pages, struct page **pages);
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+ int tag, unsigned int nr_entries,
+ struct page **entries, pgoff_t *indices);
struct page *grab_cache_page_write_begin(struct address_space *mapping,
pgoff_t index, unsigned flags);
diff --git a/mm/filemap.c b/mm/filemap.c
index 167a4d9..99dfbc9 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1498,6 +1498,74 @@ repeat:
}
EXPORT_SYMBOL(find_get_pages_tag);
+/**
+ * find_get_entries_tag - find and return entries that match @tag
+ * @mapping: the address_space to search
+ * @start: the starting page cache index
+ * @tag: the tag index
+ * @nr_entries: the maximum number of entries
+ * @entries: where the resulting entries are placed
+ * @indices: the cache indices corresponding to the entries in @entries
+ *
+ * Like find_get_entries, except we only return entries which are tagged with
+ * @tag.
+ */
+unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
+ int tag, unsigned int nr_entries,
+ struct page **entries, pgoff_t *indices)
+{
+ void **slot;
+ unsigned int ret = 0;
+ struct radix_tree_iter iter;
+
+ if (!nr_entries)
+ return 0;
+
+ rcu_read_lock();
+restart:
+ radix_tree_for_each_tagged(slot, &mapping->page_tree,
+ &iter, start, tag) {
+ struct page *page;
+repeat:
+ page = radix_tree_deref_slot(slot);
+ if (unlikely(!page))
+ continue;
+ if (radix_tree_exception(page)) {
+ if (radix_tree_deref_retry(page)) {
+ /*
+ * Transient condition which can only trigger
+ * when entry at index 0 moves out of or back
+ * to root: none yet gotten, safe to restart.
+ */
+ goto restart;
+ }
+
+ /*
+ * A shadow entry of a recently evicted page, a swap
+ * entry from shmem/tmpfs or a DAX entry. Return it
+ * without attempting to raise page count.
+ */
+ goto export;
+ }
+ if (!page_cache_get_speculative(page))
+ goto repeat;
+
+ /* Has the page moved? */
+ if (unlikely(page != *slot)) {
+ page_cache_release(page);
+ goto repeat;
+ }
+export:
+ indices[ret] = iter.index;
+ entries[ret] = page;
+ if (++ret == nr_entries)
+ break;
+ }
+ rcu_read_unlock();
+ return ret;
+}
+EXPORT_SYMBOL(find_get_entries_tag);
+
/*
* CD/DVDs are error prone. When a medium error occurs, the driver may fail
* a _large_ part of the i/o request. Imagine the worst scenario:
--
2.5.0
To properly handle fsync/msync in an efficient way DAX needs to track dirty
pages so it is able to flush them durably to media on demand.
The tracking of dirty pages is done via the radix tree in struct
address_space. This radix tree is already used by the page writeback
infrastructure for tracking dirty pages associated with an open file, and
it already has support for exceptional (non struct page*) entries. We
build upon these features to add exceptional entries to the radix tree for
DAX dirty PMD or PTE pages at fault time.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/dax.c | 147 +++++++++++++++++++++++++++++++++++++++++++++++++---
include/linux/dax.h | 2 +
mm/filemap.c | 3 ++
3 files changed, 146 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 43671b6..c65b0e4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -24,6 +24,7 @@
#include <linux/memcontrol.h>
#include <linux/mm.h>
#include <linux/mutex.h>
+#include <linux/pagevec.h>
#include <linux/pmem.h>
#include <linux/sched.h>
#include <linux/uio.h>
@@ -289,6 +290,131 @@ static int copy_user_bh(struct page *to, struct buffer_head *bh,
return 0;
}
+static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
+ void __pmem *addr, bool pmd_entry, bool dirty)
+{
+ struct radix_tree_root *page_tree = &mapping->page_tree;
+ int error = 0;
+ void *entry;
+
+ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = radix_tree_lookup(page_tree, index);
+
+ if (entry) {
+ if (!pmd_entry || RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+ goto dirty;
+ radix_tree_delete(&mapping->page_tree, index);
+ mapping->nrdax--;
+ }
+
+ if (!addr || RADIX_DAX_TYPE(addr)) {
+ WARN_ONCE(1, "%s: invalid address %p\n", __func__, addr);
+ goto unlock;
+ }
+
+ error = radix_tree_insert(page_tree, index,
+ RADIX_DAX_ENTRY(addr, pmd_entry));
+ if (error)
+ goto unlock;
+
+ mapping->nrdax++;
+ dirty:
+ if (dirty)
+ radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
+ unlock:
+ spin_unlock_irq(&mapping->tree_lock);
+ return error;
+}
+
+static void dax_writeback_one(struct address_space *mapping, pgoff_t index,
+ void *entry)
+{
+ struct radix_tree_root *page_tree = &mapping->page_tree;
+ int type = RADIX_DAX_TYPE(entry);
+ struct radix_tree_node *node;
+ void **slot;
+
+ if (type != RADIX_DAX_PTE && type != RADIX_DAX_PMD) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+
+ spin_lock_irq(&mapping->tree_lock);
+ /*
+ * Regular page slots are stabilized by the page lock even
+ * without the tree itself locked. These unlocked entries
+ * need verification under the tree lock.
+ */
+ if (!__radix_tree_lookup(page_tree, index, &node, &slot))
+ goto unlock;
+ if (*slot != entry)
+ goto unlock;
+
+ /* another fsync thread may have already written back this entry */
+ if (!radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+ goto unlock;
+
+ radix_tree_tag_clear(page_tree, index, PAGECACHE_TAG_TOWRITE);
+
+ if (type == RADIX_DAX_PMD)
+ wb_cache_pmem(RADIX_DAX_ADDR(entry), PMD_SIZE);
+ else
+ wb_cache_pmem(RADIX_DAX_ADDR(entry), PAGE_SIZE);
+ unlock:
+ spin_unlock_irq(&mapping->tree_lock);
+}
+
+/*
+ * Flush the mapping to the persistent domain within the byte range of [start,
+ * end]. This is required by data integrity operations to ensure file data is
+ * on persistent storage prior to completion of the operation.
+ */
+void dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+ loff_t end)
+{
+ struct inode *inode = mapping->host;
+ pgoff_t indices[PAGEVEC_SIZE];
+ pgoff_t start_page, end_page;
+ struct pagevec pvec;
+ void *entry;
+ int i;
+
+ if (inode->i_blkbits != PAGE_SHIFT) {
+ WARN_ON_ONCE(1);
+ return;
+ }
+
+ rcu_read_lock();
+ entry = radix_tree_lookup(&mapping->page_tree, start & PMD_MASK);
+ rcu_read_unlock();
+
+ /* see if the start of our range is covered by a PMD entry */
+ if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+ start &= PMD_MASK;
+
+ start_page = start >> PAGE_CACHE_SHIFT;
+ end_page = end >> PAGE_CACHE_SHIFT;
+
+ tag_pages_for_writeback(mapping, start_page, end_page);
+
+ pagevec_init(&pvec, 0);
+ while (1) {
+ pvec.nr = find_get_entries_tag(mapping, start_page,
+ PAGECACHE_TAG_TOWRITE, PAGEVEC_SIZE,
+ pvec.pages, indices);
+
+ if (pvec.nr == 0)
+ break;
+
+ for (i = 0; i < pvec.nr; i++)
+ dax_writeback_one(mapping, indices[i], pvec.pages[i]);
+ }
+ wmb_pmem();
+}
+EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
+
static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
@@ -329,7 +455,11 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
}
error = vm_insert_mixed(vma, vaddr, pfn);
+ if (error)
+ goto out;
+ error = dax_radix_entry(mapping, vmf->pgoff, addr, false,
+ vmf->flags & FAULT_FLAG_WRITE);
out:
i_mmap_unlock_read(mapping);
@@ -452,6 +582,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
delete_from_page_cache(page);
unlock_page(page);
page_cache_release(page);
+ page = NULL;
}
/*
@@ -539,7 +670,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
pgoff_t size, pgoff;
sector_t block, sector;
unsigned long pfn;
- int result = 0;
+ int error, result = 0;
/* dax pmd mappings are broken wrt gup and fork */
if (!IS_ENABLED(CONFIG_FS_DAX_PMD))
@@ -651,6 +782,13 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
result |= vmf_insert_pfn_pmd(vma, address, pmd, pfn, write);
+
+ if (write) {
+ error = dax_radix_entry(mapping, pgoff, kaddr, true,
+ true);
+ if (error)
+ goto fallback;
+ }
}
out:
@@ -702,15 +840,12 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
* dax_pfn_mkwrite - handle first write to DAX page
* @vma: The virtual memory area where the fault occurred
* @vmf: The description of the fault
- *
*/
int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ struct file *file = vma->vm_file;
- sb_start_pagefault(sb);
- file_update_time(vma->vm_file);
- sb_end_pagefault(sb);
+ dax_radix_entry(file->f_mapping, vmf->pgoff, NULL, false, true);
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e9d57f68..11eb183 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,4 +41,6 @@ static inline bool dax_mapping(struct address_space *mapping)
{
return mapping->host && IS_DAX(mapping->host);
}
+void dax_writeback_mapping_range(struct address_space *mapping, loff_t start,
+ loff_t end);
#endif
diff --git a/mm/filemap.c b/mm/filemap.c
index 99dfbc9..9577783 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -482,6 +482,9 @@ int filemap_write_and_wait_range(struct address_space *mapping,
{
int err = 0;
+ if (dax_mapping(mapping) && mapping->nrdax)
+ dax_writeback_mapping_range(mapping, lstart, lend);
+
if (mapping->nrpages) {
err = __filemap_fdatawrite_range(mapping, lstart, lend,
WB_SYNC_ALL);
--
2.5.0
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/ext2/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 11a42c5..2c88d68 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -102,8 +102,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
{
struct inode *inode = file_inode(vma->vm_file);
struct ext2_inode_info *ei = EXT2_I(inode);
- int ret = VM_FAULT_NOPAGE;
loff_t size;
+ int ret;
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
@@ -113,6 +113,8 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
ret = VM_FAULT_SIGBUS;
+ else
+ ret = dax_pfn_mkwrite(vma, vmf);
up_read(&ei->dax_sem);
sb_end_pagefault(inode->i_sb);
--
2.5.0
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/ext4/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 749b222..8c8965c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,8 +291,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
{
struct inode *inode = file_inode(vma->vm_file);
struct super_block *sb = inode->i_sb;
- int ret = VM_FAULT_NOPAGE;
loff_t size;
+ int ret;
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
@@ -300,6 +300,8 @@ static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
ret = VM_FAULT_SIGBUS;
+ else
+ ret = dax_pfn_mkwrite(vma, vmf);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
--
2.5.0
To properly support the new DAX fsync/msync infrastructure filesystems
need to call dax_pfn_mkwrite() so that DAX can track when user pages are
dirtied.
Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_file.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index f5392ab..40ffbb1 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1603,9 +1603,8 @@ xfs_filemap_pmd_fault(
/*
* pfn_mkwrite was originally inteneded to ensure we capture time stamp
* updates on write faults. In reality, it's need to serialise against
- * truncate similar to page_mkwrite. Hence we open-code dax_pfn_mkwrite()
- * here and cycle the XFS_MMAPLOCK_SHARED to ensure we serialise the fault
- * barrier in place.
+ * truncate similar to page_mkwrite. Hence we cycle the XFS_MMAPLOCK_SHARED
+ * to ensure we serialise the fault barrier in place.
*/
static int
xfs_filemap_pfn_mkwrite(
@@ -1628,6 +1627,8 @@ xfs_filemap_pfn_mkwrite(
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
ret = VM_FAULT_SIGBUS;
+ else if (IS_DAX(inode))
+ ret = dax_pfn_mkwrite(vma, vmf);
xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
sb_end_pagefault(inode->i_sb);
return ret;
--
2.5.0
On Tue, Dec 8, 2015 at 11:18 AM, Ross Zwisler
<[email protected]> wrote:
> Add find_get_entries_tag() to the family of functions that include
> find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
> needed for DAX dirty page handling because we need a list of both page
> offsets and radix tree entries ('indices' and 'entries' in this function)
> that are marked with the PAGECACHE_TAG_TOWRITE tag.
>
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> include/linux/pagemap.h | 3 +++
> mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 71 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 26eabf5..4db0425 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> unsigned int nr_pages, struct page **pages);
> unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> int tag, unsigned int nr_pages, struct page **pages);
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices);
>
> struct page *grab_cache_page_write_begin(struct address_space *mapping,
> pgoff_t index, unsigned flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 167a4d9..99dfbc9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1498,6 +1498,74 @@ repeat:
> }
> EXPORT_SYMBOL(find_get_pages_tag);
>
> +/**
> + * find_get_entries_tag - find and return entries that match @tag
> + * @mapping: the address_space to search
> + * @start: the starting page cache index
> + * @tag: the tag index
> + * @nr_entries: the maximum number of entries
> + * @entries: where the resulting entries are placed
> + * @indices: the cache indices corresponding to the entries in @entries
> + *
> + * Like find_get_entries, except we only return entries which are tagged with
> + * @tag.
> + */
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices)
> +{
> + void **slot;
> + unsigned int ret = 0;
> + struct radix_tree_iter iter;
> +
> + if (!nr_entries)
> + return 0;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_tagged(slot, &mapping->page_tree,
> + &iter, start, tag) {
> + struct page *page;
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + continue;
> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page)) {
> + /*
> + * Transient condition which can only trigger
> + * when entry at index 0 moves out of or back
> + * to root: none yet gotten, safe to restart.
> + */
> + goto restart;
> + }
> +
> + /*
> + * A shadow entry of a recently evicted page, a swap
> + * entry from shmem/tmpfs or a DAX entry. Return it
> + * without attempting to raise page count.
> + */
> + goto export;
> + }
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +export:
> + indices[ret] = iter.index;
> + entries[ret] = page;
> + if (++ret == nr_entries)
> + break;
> + }
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(find_get_entries_tag);
> +
Why does this mostly duplicate find_get_entries()?
Surely find_get_entries() can be implemented as a special case of
find_get_entries_tag().
On Wed, Dec 09, 2015 at 11:44:16AM -0800, Dan Williams wrote:
> On Tue, Dec 8, 2015 at 11:18 AM, Ross Zwisler
> <[email protected]> wrote:
> > Add find_get_entries_tag() to the family of functions that include
> > find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
> > needed for DAX dirty page handling because we need a list of both page
> > offsets and radix tree entries ('indices' and 'entries' in this function)
> > that are marked with the PAGECACHE_TAG_TOWRITE tag.
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
<>
> Why does this mostly duplicate find_get_entries()?
>
> Surely find_get_entries() can be implemented as a special case of
> find_get_entries_tag().
I'm adding find_get_entries_tag() to the family of functions that already
exist and include find_get_entries(), find_get_pages(),
find_get_pages_contig() and find_get_pages_tag().
These functions all contain very similar code with small changes to the
internal looping based on whether you're looking through all radix slots or
only the ones that match a certain tag (radix_tree_for_each_slot() vs
radix_tree_for_each_tagged()).
We already have find_get_page() to get all pages in a range and
find_get_pages_tag() to get all pages in the range with a certain tag. We
have find_get_entries() to get all pages and indices for a given range, but we
are currently missing find_get_entries_tag() to do that same search based on a
tag, which is what I'm adding.
I agree that we could probably figure out a way to combine the code for
find_get_entries() with find_get_entries_tag(), as we could do for the
existing functions find_get_pages() and find_get_pages_tag(). I think we
should probably add find_get_entries_tag() per this patch, though, and then
decide whether to do any combining later as a separate step.
On Thu, Dec 10, 2015 at 12:24 PM, Ross Zwisler
<[email protected]> wrote:
> On Wed, Dec 09, 2015 at 11:44:16AM -0800, Dan Williams wrote:
>> On Tue, Dec 8, 2015 at 11:18 AM, Ross Zwisler
>> <[email protected]> wrote:
>> > Add find_get_entries_tag() to the family of functions that include
>> > find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
>> > needed for DAX dirty page handling because we need a list of both page
>> > offsets and radix tree entries ('indices' and 'entries' in this function)
>> > that are marked with the PAGECACHE_TAG_TOWRITE tag.
>> >
>> > Signed-off-by: Ross Zwisler <[email protected]>
> <>
>> Why does this mostly duplicate find_get_entries()?
>>
>> Surely find_get_entries() can be implemented as a special case of
>> find_get_entries_tag().
>
> I'm adding find_get_entries_tag() to the family of functions that already
> exist and include find_get_entries(), find_get_pages(),
> find_get_pages_contig() and find_get_pages_tag().
>
> These functions all contain very similar code with small changes to the
> internal looping based on whether you're looking through all radix slots or
> only the ones that match a certain tag (radix_tree_for_each_slot() vs
> radix_tree_for_each_tagged()).
>
> We already have find_get_page() to get all pages in a range and
> find_get_pages_tag() to get all pages in the range with a certain tag. We
> have find_get_entries() to get all pages and indices for a given range, but we
> are currently missing find_get_entries_tag() to do that same search based on a
> tag, which is what I'm adding.
>
> I agree that we could probably figure out a way to combine the code for
> find_get_entries() with find_get_entries_tag(), as we could do for the
> existing functions find_get_pages() and find_get_pages_tag(). I think we
> should probably add find_get_entries_tag() per this patch, though, and then
> decide whether to do any combining later as a separate step.
Ok, sounds good to me.
On Tue 08-12-15 12:18:40, Ross Zwisler wrote:
> Add support for tracking dirty DAX entries in the struct address_space
> radix tree. This tree is already used for dirty page writeback, and it
> already supports the use of exceptional (non struct page*) entries.
>
> In order to properly track dirty DAX pages we will insert new exceptional
> entries into the radix tree that represent dirty DAX PTE or PMD pages.
> These exceptional entries will also contain the writeback addresses for the
> PTE or PMD faults that we can use at fsync/msync time.
>
> There are currently two types of exceptional entries (shmem and shadow)
> that can be placed into the radix tree, and this adds a third. There
> shouldn't be any collisions between these various exceptional entries
> because only one type of exceptional entry should be able to be found in a
> radix tree at a time depending on how it is being used.
I was thinking about this and I'm not sure the use of exceptional entries
cannot collide. DAX uses page cache for read mapping of holes. When memory
pressure happens, page can get evicted again and entry in the radix tree
will get replaced with a shadow entry. So shadow entries *can* exist in DAX
mappings. Thus at least your change to clear_exceptional_entry() looks
wrong to me.
Also when we'd like to insert DAX radix tree entry, we have to count with
the fact that there can be shadow entry in place and we have to tear it
down properly.
Honza
> Signed-off-by: Ross Zwisler <[email protected]>
> ---
> fs/block_dev.c | 3 ++-
> fs/inode.c | 1 +
> include/linux/dax.h | 5 ++++
> include/linux/fs.h | 1 +
> include/linux/radix-tree.h | 9 +++++++
> mm/filemap.c | 13 +++++++---
> mm/truncate.c | 64 +++++++++++++++++++++++++++-------------------
> 7 files changed, 65 insertions(+), 31 deletions(-)
>
> diff --git a/fs/block_dev.c b/fs/block_dev.c
> index c25639e..226dacc 100644
> --- a/fs/block_dev.c
> +++ b/fs/block_dev.c
> @@ -75,7 +75,8 @@ void kill_bdev(struct block_device *bdev)
> {
> struct address_space *mapping = bdev->bd_inode->i_mapping;
>
> - if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> + if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
> + mapping->nrdax == 0)
> return;
>
> invalidate_bh_lrus();
> diff --git a/fs/inode.c b/fs/inode.c
> index 1be5f90..79d828f 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -496,6 +496,7 @@ void clear_inode(struct inode *inode)
> spin_lock_irq(&inode->i_data.tree_lock);
> BUG_ON(inode->i_data.nrpages);
> BUG_ON(inode->i_data.nrshadows);
> + BUG_ON(inode->i_data.nrdax);
> spin_unlock_irq(&inode->i_data.tree_lock);
> BUG_ON(!list_empty(&inode->i_data.private_list));
> BUG_ON(!(inode->i_state & I_FREEING));
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index b415e52..e9d57f68 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -36,4 +36,9 @@ static inline bool vma_is_dax(struct vm_area_struct *vma)
> {
> return vma->vm_file && IS_DAX(vma->vm_file->f_mapping->host);
> }
> +
> +static inline bool dax_mapping(struct address_space *mapping)
> +{
> + return mapping->host && IS_DAX(mapping->host);
> +}
> #endif
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 3aa5142..b9ac534 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -433,6 +433,7 @@ struct address_space {
> /* Protected by tree_lock together with the radix tree */
> unsigned long nrpages; /* number of total pages */
> unsigned long nrshadows; /* number of shadow entries */
> + unsigned long nrdax; /* number of DAX entries */
> pgoff_t writeback_index;/* writeback starts here */
> const struct address_space_operations *a_ops; /* methods */
> unsigned long flags; /* error bits/gfp mask */
> diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
> index 33170db..f793c99 100644
> --- a/include/linux/radix-tree.h
> +++ b/include/linux/radix-tree.h
> @@ -51,6 +51,15 @@
> #define RADIX_TREE_EXCEPTIONAL_ENTRY 2
> #define RADIX_TREE_EXCEPTIONAL_SHIFT 2
>
> +#define RADIX_DAX_MASK 0xf
> +#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
> +#define RADIX_DAX_TYPE(entry) ((__force unsigned long)entry & RADIX_DAX_MASK)
> +#define RADIX_DAX_ADDR(entry) ((void __pmem *)((unsigned long)entry & \
> + ~RADIX_DAX_MASK))
> +#define RADIX_DAX_ENTRY(addr, pmd) ((void *)((__force unsigned long)addr | \
> + (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
> +
> static inline int radix_tree_is_indirect_ptr(void *ptr)
> {
> return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1bb0076..167a4d9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -11,6 +11,7 @@
> */
> #include <linux/export.h>
> #include <linux/compiler.h>
> +#include <linux/dax.h>
> #include <linux/fs.h>
> #include <linux/uaccess.h>
> #include <linux/capability.h>
> @@ -579,6 +580,12 @@ static int page_cache_tree_insert(struct address_space *mapping,
> p = radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
> if (!radix_tree_exceptional_entry(p))
> return -EEXIST;
> +
> + if (dax_mapping(mapping)) {
> + WARN_ON(1);
> + return -EINVAL;
> + }
> +
> if (shadowp)
> *shadowp = p;
> mapping->nrshadows--;
> @@ -1242,9 +1249,9 @@ repeat:
> if (radix_tree_deref_retry(page))
> goto restart;
> /*
> - * A shadow entry of a recently evicted page,
> - * or a swap entry from shmem/tmpfs. Return
> - * it without attempting to raise page count.
> + * A shadow entry of a recently evicted page, a swap
> + * entry from shmem/tmpfs or a DAX entry. Return it
> + * without attempting to raise page count.
> */
> goto export;
> }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 76e35ad..1dc9f29 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -9,6 +9,7 @@
>
> #include <linux/kernel.h>
> #include <linux/backing-dev.h>
> +#include <linux/dax.h>
> #include <linux/gfp.h>
> #include <linux/mm.h>
> #include <linux/swap.h>
> @@ -34,31 +35,39 @@ static void clear_exceptional_entry(struct address_space *mapping,
> return;
>
> spin_lock_irq(&mapping->tree_lock);
> - /*
> - * Regular page slots are stabilized by the page lock even
> - * without the tree itself locked. These unlocked entries
> - * need verification under the tree lock.
> - */
> - if (!__radix_tree_lookup(&mapping->page_tree, index, &node, &slot))
> - goto unlock;
> - if (*slot != entry)
> - goto unlock;
> - radix_tree_replace_slot(slot, NULL);
> - mapping->nrshadows--;
> - if (!node)
> - goto unlock;
> - workingset_node_shadows_dec(node);
> - /*
> - * Don't track node without shadow entries.
> - *
> - * Avoid acquiring the list_lru lock if already untracked.
> - * The list_empty() test is safe as node->private_list is
> - * protected by mapping->tree_lock.
> - */
> - if (!workingset_node_shadows(node) &&
> - !list_empty(&node->private_list))
> - list_lru_del(&workingset_shadow_nodes, &node->private_list);
> - __radix_tree_delete_node(&mapping->page_tree, node);
> +
> + if (dax_mapping(mapping)) {
> + if (radix_tree_delete_item(&mapping->page_tree, index, entry))
> + mapping->nrdax--;
> + } else {
> + /*
> + * Regular page slots are stabilized by the page lock even
> + * without the tree itself locked. These unlocked entries
> + * need verification under the tree lock.
> + */
> + if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
> + &slot))
> + goto unlock;
> + if (*slot != entry)
> + goto unlock;
> + radix_tree_replace_slot(slot, NULL);
> + mapping->nrshadows--;
> + if (!node)
> + goto unlock;
> + workingset_node_shadows_dec(node);
> + /*
> + * Don't track node without shadow entries.
> + *
> + * Avoid acquiring the list_lru lock if already untracked.
> + * The list_empty() test is safe as node->private_list is
> + * protected by mapping->tree_lock.
> + */
> + if (!workingset_node_shadows(node) &&
> + !list_empty(&node->private_list))
> + list_lru_del(&workingset_shadow_nodes,
> + &node->private_list);
> + __radix_tree_delete_node(&mapping->page_tree, node);
> + }
> unlock:
> spin_unlock_irq(&mapping->tree_lock);
> }
> @@ -228,7 +237,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> int i;
>
> cleancache_invalidate_inode(mapping);
> - if (mapping->nrpages == 0 && mapping->nrshadows == 0)
> + if (mapping->nrpages == 0 && mapping->nrshadows == 0 &&
> + mapping->nrdax == 0)
> return;
>
> /* Offsets within partial pages */
> @@ -423,7 +433,7 @@ void truncate_inode_pages_final(struct address_space *mapping)
> smp_rmb();
> nrshadows = mapping->nrshadows;
>
> - if (nrpages || nrshadows) {
> + if (nrpages || nrshadows || mapping->nrdax) {
> /*
> * As truncation uses a lockless tree lookup, cycle
> * the tree lock to make sure any ongoing tree
> --
> 2.5.0
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 08-12-15 12:18:41, Ross Zwisler wrote:
> Add find_get_entries_tag() to the family of functions that include
> find_get_entries(), find_get_pages() and find_get_pages_tag(). This is
> needed for DAX dirty page handling because we need a list of both page
> offsets and radix tree entries ('indices' and 'entries' in this function)
> that are marked with the PAGECACHE_TAG_TOWRITE tag.
>
> Signed-off-by: Ross Zwisler <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
But I agree with Daniel that some refactoring to remove common code would
be good.
Honza
> ---
> include/linux/pagemap.h | 3 +++
> mm/filemap.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 71 insertions(+)
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index 26eabf5..4db0425 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -361,6 +361,9 @@ unsigned find_get_pages_contig(struct address_space *mapping, pgoff_t start,
> unsigned int nr_pages, struct page **pages);
> unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
> int tag, unsigned int nr_pages, struct page **pages);
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices);
>
> struct page *grab_cache_page_write_begin(struct address_space *mapping,
> pgoff_t index, unsigned flags);
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 167a4d9..99dfbc9 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1498,6 +1498,74 @@ repeat:
> }
> EXPORT_SYMBOL(find_get_pages_tag);
>
> +/**
> + * find_get_entries_tag - find and return entries that match @tag
> + * @mapping: the address_space to search
> + * @start: the starting page cache index
> + * @tag: the tag index
> + * @nr_entries: the maximum number of entries
> + * @entries: where the resulting entries are placed
> + * @indices: the cache indices corresponding to the entries in @entries
> + *
> + * Like find_get_entries, except we only return entries which are tagged with
> + * @tag.
> + */
> +unsigned find_get_entries_tag(struct address_space *mapping, pgoff_t start,
> + int tag, unsigned int nr_entries,
> + struct page **entries, pgoff_t *indices)
> +{
> + void **slot;
> + unsigned int ret = 0;
> + struct radix_tree_iter iter;
> +
> + if (!nr_entries)
> + return 0;
> +
> + rcu_read_lock();
> +restart:
> + radix_tree_for_each_tagged(slot, &mapping->page_tree,
> + &iter, start, tag) {
> + struct page *page;
> +repeat:
> + page = radix_tree_deref_slot(slot);
> + if (unlikely(!page))
> + continue;
> + if (radix_tree_exception(page)) {
> + if (radix_tree_deref_retry(page)) {
> + /*
> + * Transient condition which can only trigger
> + * when entry at index 0 moves out of or back
> + * to root: none yet gotten, safe to restart.
> + */
> + goto restart;
> + }
> +
> + /*
> + * A shadow entry of a recently evicted page, a swap
> + * entry from shmem/tmpfs or a DAX entry. Return it
> + * without attempting to raise page count.
> + */
> + goto export;
> + }
> + if (!page_cache_get_speculative(page))
> + goto repeat;
> +
> + /* Has the page moved? */
> + if (unlikely(page != *slot)) {
> + page_cache_release(page);
> + goto repeat;
> + }
> +export:
> + indices[ret] = iter.index;
> + entries[ret] = page;
> + if (++ret == nr_entries)
> + break;
> + }
> + rcu_read_unlock();
> + return ret;
> +}
> +EXPORT_SYMBOL(find_get_entries_tag);
> +
> /*
> * CD/DVDs are error prone. When a medium error occurs, the driver may fail
> * a _large_ part of the i/o request. Imagine the worst scenario:
> --
> 2.5.0
>
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Dec 18, 2015 at 10:01:10AM +0100, Jan Kara wrote:
> On Tue 08-12-15 12:18:40, Ross Zwisler wrote:
> > Add support for tracking dirty DAX entries in the struct address_space
> > radix tree. This tree is already used for dirty page writeback, and it
> > already supports the use of exceptional (non struct page*) entries.
> >
> > In order to properly track dirty DAX pages we will insert new exceptional
> > entries into the radix tree that represent dirty DAX PTE or PMD pages.
> > These exceptional entries will also contain the writeback addresses for the
> > PTE or PMD faults that we can use at fsync/msync time.
> >
> > There are currently two types of exceptional entries (shmem and shadow)
> > that can be placed into the radix tree, and this adds a third. There
> > shouldn't be any collisions between these various exceptional entries
> > because only one type of exceptional entry should be able to be found in a
> > radix tree at a time depending on how it is being used.
>
> I was thinking about this and I'm not sure the use of exceptional entries
> cannot collide. DAX uses page cache for read mapping of holes. When memory
> pressure happens, page can get evicted again and entry in the radix tree
> will get replaced with a shadow entry. So shadow entries *can* exist in DAX
> mappings. Thus at least your change to clear_exceptional_entry() looks
> wrong to me.
>
> Also when we'd like to insert DAX radix tree entry, we have to count with
> the fact that there can be shadow entry in place and we have to tear it
> down properly.
You are right, thank you for catching this.
I think the correct thing to do is to just explicitly disallow having shadow
entries in trees for DAX mappings. As you say the only usage is to track zero
page mappings for reading holes which will get minimal benefit from shadow
entries, and this choice makes the logic both in clear_exceptional_entry() and
in the rest of the DAX code simpler.
I've sent out a v5 that fixes this issue.