2016-12-12 16:47:02

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/6 v3] dax: Page invalidation fixes

Hello,

this is the third revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
patches to some tree once DAX write-protection patches are merged. I'm hoping
to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v2:
* Added Reviewed-by tags
* Fixed commit message of patch 3
* Slightly simplified dax_iomap_pmd_fault()
* Renamed truncation functions to express better what they do

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

Honza

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2016-12-12 16:47:03

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
}
- } else {
- *new = true;
}
+ *new = true;

ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
mutex_unlock(&ei->truncate_mutex);
--
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-12 16:47:05

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/6] dax: Avoid page invalidation races and unnecessary radix tree traversals

Currently dax_iomap_rw() takes care of invalidating page tables and
evicting hole pages from the radix tree when write(2) to the file
happens. This invalidation is only necessary when there is some block
allocation resulting from write(2). Furthermore in current place the
invalidation is racy wrt page fault instantiating a hole page just after
we have invalidated it.

So perform the page invalidation inside dax_iomap_actor() where we can
do it only when really necessary and after blocks have been allocated so
nobody will be instantiating new hole pages anymore.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 28 +++++++++++-----------------
1 file changed, 11 insertions(+), 17 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 71863f0d51f7..97858dd5dab6 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -986,6 +986,17 @@ dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
if (WARN_ON_ONCE(iomap->type != IOMAP_MAPPED))
return -EIO;

+ /*
+ * Write can allocate block for an area which has a hole page mapped
+ * into page tables. We have to tear down these mappings so that data
+ * written by write(2) is visible in mmap.
+ */
+ if ((iomap->flags & IOMAP_F_NEW) && inode->i_mapping->nrpages) {
+ invalidate_inode_pages2_range(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ (end - 1) >> PAGE_SHIFT);
+ }
+
while (pos < end) {
unsigned offset = pos & (PAGE_SIZE - 1);
struct blk_dax_ctl dax = { 0 };
@@ -1044,23 +1055,6 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
if (iov_iter_rw(iter) == WRITE)
flags |= IOMAP_WRITE;

- /*
- * Yes, even DAX files can have page cache attached to them: A zeroed
- * page is inserted into the pagecache when we have to serve a write
- * fault on a hole. It should never be dirtied and can simply be
- * dropped from the pagecache once we get real data for the page.
- *
- * XXX: This is racy against mmap, and there's nothing we can do about
- * it. We'll eventually need to shift this down even further so that
- * we can check if we allocated blocks over a hole first.
- */
- if (mapping->nrpages) {
- ret = invalidate_inode_pages2_range(mapping,
- pos >> PAGE_SHIFT,
- (pos + iov_iter_count(iter) - 1) >> PAGE_SHIFT);
- WARN_ON_ONCE(ret);
- }
-
while (iov_iter_count(iter)) {
ret = iomap_apply(inode, pos, iov_iter_count(iter), flags, ops,
iter, dax_iomap_actor);
--
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-12 16:47:25

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX (if it faults when copying from/to user provided
buffers).

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 121 ++++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 66 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index e186bba0a642..51b03e91d3e2 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1079,6 +1079,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(dax_iomap_rw);

+static int dax_fault_return(int error)
+{
+ if (error == 0)
+ return VM_FAULT_NOPAGE;
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ return VM_FAULT_SIGBUS;
+}
+
/**
* dax_iomap_fault - handle a page fault on a DAX file
* @vma: The virtual memory area where the fault occurred
@@ -1111,12 +1120,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (pos >= i_size_read(inode))
return VM_FAULT_SIGBUS;

- entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
- if (IS_ERR(entry)) {
- error = PTR_ERR(entry);
- goto out;
- }
-
if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
flags |= IOMAP_WRITE;

@@ -1127,9 +1130,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
*/
error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
if (error)
- goto unlock_entry;
+ return dax_fault_return(error);
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
- error = -EIO; /* fs corruption? */
+ vmf_ret = dax_fault_return(-EIO); /* fs corruption? */
+ goto finish_iomap;
+ }
+
+ entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+ if (IS_ERR(entry)) {
+ vmf_ret = dax_fault_return(PTR_ERR(entry));
goto finish_iomap;
}

@@ -1152,13 +1161,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}

if (error)
- goto finish_iomap;
+ goto error_unlock_entry;

__SetPageUptodate(vmf->cow_page);
vmf_ret = finish_fault(vmf);
if (!vmf_ret)
vmf_ret = VM_FAULT_DONE_COW;
- goto finish_iomap;
+ goto unlock_entry;
}

switch (iomap.type) {
@@ -1170,12 +1179,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
error = dax_insert_mapping(mapping, iomap.bdev, sector,
PAGE_SIZE, &entry, vma, vmf);
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error == -EBUSY)
+ error = 0;
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
vmf_ret = dax_load_hole(mapping, &entry, vmf);
- goto finish_iomap;
+ goto unlock_entry;
}
/*FALLTHRU*/
default:
@@ -1184,30 +1196,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
break;
}

- finish_iomap:
- if (ops->iomap_end) {
- if (error || (vmf_ret & VM_FAULT_ERROR)) {
- /* keep previous error */
- ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
- &iomap);
- } else {
- error = ops->iomap_end(inode, pos, PAGE_SIZE,
- PAGE_SIZE, flags, &iomap);
- }
- }
+ error_unlock_entry:
+ vmf_ret = dax_fault_return(error) | major;
unlock_entry:
put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
- if (error == -ENOMEM)
- return VM_FAULT_OOM | major;
- /* -EBUSY is fine, somebody else faulted on the same PTE */
- if (error < 0 && error != -EBUSY)
- return VM_FAULT_SIGBUS | major;
- if (vmf_ret) {
- WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
- return vmf_ret;
+ finish_iomap:
+ if (ops->iomap_end) {
+ int copied = PAGE_SIZE;
+
+ if (vmf_ret & VM_FAULT_ERROR)
+ copied = 0;
+ /*
+ * The fault is done by now and there's no way back (other
+ * thread may be already happily using PTE we have installed).
+ * Just ignore error from ->iomap_end since we cannot do much
+ * with it.
+ */
+ ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
}
- return VM_FAULT_NOPAGE | major;
+ return vmf_ret;
}
EXPORT_SYMBOL_GPL(dax_iomap_fault);

@@ -1332,16 +1339,6 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
goto fallback;

/*
- * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
- * PMD or a HZP entry. If it can't (because a 4k page is already in
- * the tree, for instance), it will return -EEXIST and we just fall
- * back to 4k entries.
- */
- entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
- if (IS_ERR(entry))
- goto fallback;
-
- /*
* Note that we don't use iomap_apply here. We aren't doing I/O, only
* setting up a mapping, so really we're using iomap_begin() as a way
* to look up our filesystem block.
@@ -1349,10 +1346,21 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
pos = (loff_t)pgoff << PAGE_SHIFT;
error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
if (error)
- goto unlock_entry;
+ goto fallback;
+
if (iomap.offset + iomap.length < pos + PMD_SIZE)
goto finish_iomap;

+ /*
+ * grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
+ * PMD or a HZP entry. If it can't (because a 4k page is already in
+ * the tree, for instance), it will return -EEXIST and we just fall
+ * back to 4k entries.
+ */
+ entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
+ if (IS_ERR(entry))
+ goto finish_iomap;
+
vmf.pgoff = pgoff;
vmf.flags = flags;
vmf.gfp_mask = mapping_gfp_mask(mapping) | __GFP_IO;
@@ -1365,7 +1373,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (WARN_ON_ONCE(write))
- goto finish_iomap;
+ goto unlock_entry;
result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
&entry);
break;
@@ -1374,20 +1382,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
break;
}

+ unlock_entry:
+ put_locked_mapping_entry(mapping, pgoff, entry);
finish_iomap:
if (ops->iomap_end) {
- if (result == VM_FAULT_FALLBACK) {
- ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
- &iomap);
- } else {
- error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
- iomap_flags, &iomap);
- if (error)
- result = VM_FAULT_FALLBACK;
- }
+ int copied = PMD_SIZE;
+
+ if (result == VM_FAULT_FALLBACK)
+ copied = 0;
+ /*
+ * The fault is done by now and there's no way back (other
+ * thread may be already happily using PMD we have installed).
+ * Just ignore error from ->iomap_end since we cannot do much
+ * with it.
+ */
+ ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+ &iomap);
}
- unlock_entry:
- put_locked_mapping_entry(mapping, pgoff, entry);
fallback:
if (result == VM_FAULT_FALLBACK) {
split_huge_pmd(vma, pmd, address);
--
2.10.2


2016-12-12 16:47:25

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/6] ext4: Simplify DAX fault path

Now that dax_iomap_fault() calls ->iomap_begin() without entry lock, we
can use transaction starting in ext4_iomap_begin() and thus simplify
ext4_dax_fault(). It also provides us proper retries in case of ENOSPC.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 48 ++++++++++--------------------------------------
1 file changed, 10 insertions(+), 38 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b5f184493c57..d663d3d7c81c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -258,7 +258,6 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
- handle_t *handle = NULL;
struct inode *inode = file_inode(vma->vm_file);
struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;
@@ -266,24 +265,12 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
- down_read(&EXT4_I(inode)->i_mmap_sem);
- handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
- EXT4_DATA_TRANS_BLOCKS(sb));
- } else
- down_read(&EXT4_I(inode)->i_mmap_sem);
-
- if (IS_ERR(handle))
- result = VM_FAULT_SIGBUS;
- else
- result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
-
- if (write) {
- if (!IS_ERR(handle))
- ext4_journal_stop(handle);
- up_read(&EXT4_I(inode)->i_mmap_sem);
+ }
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ result = dax_iomap_fault(vma, vmf, &ext4_iomap_ops);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ if (write)
sb_end_pagefault(sb);
- } else
- up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}
@@ -292,7 +279,6 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
int result;
- handle_t *handle = NULL;
struct inode *inode = file_inode(vma->vm_file);
struct super_block *sb = inode->i_sb;
bool write = flags & FAULT_FLAG_WRITE;
@@ -300,27 +286,13 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
- down_read(&EXT4_I(inode)->i_mmap_sem);
- handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
- ext4_chunk_trans_blocks(inode,
- PMD_SIZE / PAGE_SIZE));
- } else
- down_read(&EXT4_I(inode)->i_mmap_sem);
-
- if (IS_ERR(handle))
- result = VM_FAULT_SIGBUS;
- else {
- result = dax_iomap_pmd_fault(vma, addr, pmd, flags,
- &ext4_iomap_ops);
}

2016-12-12 16:47:06

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/6] dax: Finish fault completely when loading holes

The only case when we do not finish the page fault completely is when we
are loading hole pages into a radix tree. Avoid this special case and
finish the fault in that case as well inside the DAX fault handler. It
will allow us for easier iomap handling.

Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 27 ++++++++++++++++++---------
1 file changed, 18 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 97858dd5dab6..e186bba0a642 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -540,15 +540,16 @@ int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
* otherwise it will simply fall out of the page cache under memory
* pressure without ever having been dirtied.
*/
-static int dax_load_hole(struct address_space *mapping, void *entry,
+static int dax_load_hole(struct address_space *mapping, void **entry,
struct vm_fault *vmf)
{
struct page *page;
+ int ret;

/* Hole page already exists? Return it... */
- if (!radix_tree_exceptional_entry(entry)) {
- vmf->page = entry;
- return VM_FAULT_LOCKED;
+ if (!radix_tree_exceptional_entry(*entry)) {
+ page = *entry;
+ goto out;
}

/* This will replace locked radix tree entry with a hole page */
@@ -556,8 +557,17 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
vmf->gfp_mask | __GFP_ZERO);
if (!page)
return VM_FAULT_OOM;
+ out:
vmf->page = page;
- return VM_FAULT_LOCKED;
+ ret = finish_fault(vmf);
+ vmf->page = NULL;
+ *entry = page;
+ if (!ret) {
+ /* Grab reference for PTE that is now referencing the page */
+ get_page(page);
+ return VM_FAULT_NOPAGE;
+ }
+ return ret;
}

static int copy_user_dax(struct block_device *bdev, sector_t sector, size_t size,
@@ -1164,8 +1174,8 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
- vmf_ret = dax_load_hole(mapping, entry, vmf);
- break;
+ vmf_ret = dax_load_hole(mapping, &entry, vmf);
+ goto finish_iomap;
}
/*FALLTHRU*/
default:
@@ -1186,8 +1196,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
}
unlock_entry:
- if (vmf_ret != VM_FAULT_LOCKED || error)
- put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ put_locked_mapping_entry(mapping, vmf->pgoff, entry);
out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
--
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-12 16:47:04

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate

Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
just delete all exceptional radix tree entries they find. For DAX this
is not desirable as we track cache dirtiness in these entries and when
they are evicted, we may not flush caches although it is necessary. This
can for example manifest when we write to the same block both via mmap
and via write(2) (to different offsets) and fsync(2) then does not
properly flush CPU caches when modification via write(2) was the last
one.

Create appropriate DAX functions to handle invalidation of DAX entries
for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
wire them up into the corresponding mm functions.

Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 71 +++++++++++++++++++++++++++++++++++++++++++-------
include/linux/dax.h | 3 +++
mm/truncate.c | 75 +++++++++++++++++++++++++++++++++++++++++++----------
3 files changed, 125 insertions(+), 24 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index b1fe228cd609..71863f0d51f7 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -452,16 +452,37 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
}

+static int __dax_invalidate_mapping_entry(struct address_space *mapping,
+ pgoff_t index, bool trunc)
+{
+ int ret = 0;
+ void *entry;
+ struct radix_tree_root *page_tree = &mapping->page_tree;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, NULL);
+ if (!entry || !radix_tree_exceptional_entry(entry))
+ goto out;
+ if (!trunc &&
+ (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+ radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE)))
+ goto out;
+ radix_tree_delete(page_tree, index);
+ mapping->nrexceptional--;
+ ret = 1;
+out:
+ put_unlocked_mapping_entry(mapping, index, entry);
+ spin_unlock_irq(&mapping->tree_lock);
+ return ret;
+}
/*
* Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
* entry to get unlocked before deleting it.
*/
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
{
- void *entry;
+ int ret = __dax_invalidate_mapping_entry(mapping, index, true);

- spin_lock_irq(&mapping->tree_lock);
- entry = get_unlocked_mapping_entry(mapping, index, NULL);
/*
* This gets called from truncate / punch_hole path. As such, the caller
* must hold locks protecting against concurrent modifications of the
@@ -469,16 +490,46 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
* caller has seen exceptional entry for this index, we better find it
* at that index as well...
*/
- if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
- spin_unlock_irq(&mapping->tree_lock);
- return 0;
- }
- radix_tree_delete(&mapping->page_tree, index);
+ WARN_ON_ONCE(!ret);
+ return ret;
+}
+
+/*
+ * Invalidate exceptional DAX entry if easily possible. This handles DAX
+ * entries for invalidate_inode_pages() so we evict the entry only if we can
+ * do so without blocking.
+ */
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+ int ret = 0;
+ void *entry, **slot;
+ struct radix_tree_root *page_tree = &mapping->page_tree;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = __radix_tree_lookup(page_tree, index, NULL, &slot);
+ if (!entry || !radix_tree_exceptional_entry(entry) ||
+ slot_locked(mapping, slot))
+ goto out;
+ if (radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_DIRTY) ||
+ radix_tree_tag_get(page_tree, index, PAGECACHE_TAG_TOWRITE))
+ goto out;
+ radix_tree_delete(page_tree, index);
mapping->nrexceptional--;
+ ret = 1;
+out:
spin_unlock_irq(&mapping->tree_lock);
- dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+ if (ret)
+ dax_wake_mapping_entry_waiter(mapping, index, entry, true);
+ return ret;
+}

- return 1;
+/*
+ * Invalidate exceptional DAX entry if it is clean.
+ */
+int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
+ pgoff_t index)
+{
+ return __dax_invalidate_mapping_entry(mapping, index, false);
}

/*
diff --git a/include/linux/dax.h b/include/linux/dax.h
index f97bcfe79472..24ad71173995 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -41,6 +41,9 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
struct iomap_ops *ops);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry(struct address_space *mapping, pgoff_t index);
+int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
+ pgoff_t index);
void dax_wake_mapping_entry_waiter(struct address_space *mapping,
pgoff_t index, void *entry, bool wake_all);

diff --git a/mm/truncate.c b/mm/truncate.c
index fd97f1dbce29..dd7b24e083c5 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -24,20 +24,12 @@
#include <linux/rmap.h>
#include "internal.h"

-static void clear_exceptional_entry(struct address_space *mapping,
- pgoff_t index, void *entry)
+static void clear_shadow_entry(struct address_space *mapping, pgoff_t index,
+ void *entry)
{
struct radix_tree_node *node;
void **slot;

- /* Handled by shmem itself */
- if (shmem_mapping(mapping))
- return;
-
- if (dax_mapping(mapping)) {
- dax_delete_mapping_entry(mapping, index);
- return;
- }
spin_lock_irq(&mapping->tree_lock);
/*
* Regular page slots are stabilized by the page lock even
@@ -55,6 +47,56 @@ static void clear_exceptional_entry(struct address_space *mapping,
spin_unlock_irq(&mapping->tree_lock);
}

+/*
+ * Unconditionally remove exceptional entry. Usually called from truncate path.
+ */
+static void truncate_exceptional_entry(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ /* Handled by shmem itself */
+ if (shmem_mapping(mapping))
+ return;
+
+ if (dax_mapping(mapping)) {
+ dax_delete_mapping_entry(mapping, index);
+ return;
+ }
+ clear_shadow_entry(mapping, index, entry);
+}
+
+/*
+ * Invalidate exceptional entry if easily possible. This handles exceptional
+ * entries for invalidate_inode_pages() so for DAX it evicts only unlocked and
+ * clean entries.
+ */
+static int invalidate_exceptional_entry(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ /* Handled by shmem itself */
+ if (shmem_mapping(mapping))
+ return 1;
+ if (dax_mapping(mapping))
+ return dax_invalidate_mapping_entry(mapping, index);
+ clear_shadow_entry(mapping, index, entry);
+ return 1;
+}
+
+/*
+ * Invalidate exceptional entry if clean. This handles exceptional entries for
+ * invalidate_inode_pages2() so for DAX it evicts only clean entries.
+ */
+static int invalidate_exceptional_entry2(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ /* Handled by shmem itself */
+ if (shmem_mapping(mapping))
+ return 1;
+ if (dax_mapping(mapping))
+ return dax_invalidate_mapping_entry_sync(mapping, index);
+ clear_shadow_entry(mapping, index, entry);
+ return 1;
+}
+
/**
* do_invalidatepage - invalidate part or all of a page
* @page: the page which is affected
@@ -262,7 +304,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
break;

if (radix_tree_exceptional_entry(page)) {
- clear_exceptional_entry(mapping, index, page);
+ truncate_exceptional_entry(mapping, index,
+ page);
continue;
}

@@ -351,7 +394,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
}

if (radix_tree_exceptional_entry(page)) {
- clear_exceptional_entry(mapping, index, page);
+ truncate_exceptional_entry(mapping, index,
+ page);
continue;
}

@@ -470,7 +514,8 @@ unsigned long invalidate_mapping_pages(struct address_space *mapping,
break;

if (radix_tree_exceptional_entry(page)) {
- clear_exceptional_entry(mapping, index, page);
+ invalidate_exceptional_entry(mapping, index,
+ page);
continue;
}

@@ -592,7 +637,9 @@ int invalidate_inode_pages2_range(struct address_space *mapping,
break;

if (radix_tree_exceptional_entry(page)) {
- clear_exceptional_entry(mapping, index, page);
+ if (!invalidate_exceptional_entry2(mapping,
+ index, page))
+ ret = -EBUSY;
continue;
}

--
2.10.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-12 17:50:55

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm: Invalidate DAX radix tree entries only if appropriate

On Mon, Dec 12, 2016 at 05:47:04PM +0100, Jan Kara wrote:
> Currently invalidate_inode_pages2_range() and invalidate_mapping_pages()
> just delete all exceptional radix tree entries they find. For DAX this
> is not desirable as we track cache dirtiness in these entries and when
> they are evicted, we may not flush caches although it is necessary. This
> can for example manifest when we write to the same block both via mmap
> and via write(2) (to different offsets) and fsync(2) then does not
> properly flush CPU caches when modification via write(2) was the last
> one.
>
> Create appropriate DAX functions to handle invalidation of DAX entries
> for invalidate_inode_pages2_range() and invalidate_mapping_pages() and
> wire them up into the corresponding mm functions.
>
> Reviewed-by: Ross Zwisler <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>

Acked-by: Johannes Weiner <[email protected]>

2016-12-12 17:51:37

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Mon, Dec 12, 2016 at 05:47:02PM +0100, Jan Kara wrote:
> Hello,
>
> this is the third revision of my fixes of races when invalidating hole pages in
> DAX mappings. See changelogs for details. The series is based on my patches to
> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
> dependency because we really need to closely track dirtiness (and cleanness!)
> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
> bits leading to missed cache flushes on fsync(2).
>
> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>
> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
> patches to some tree once DAX write-protection patches are merged. I'm hoping
> to get at least first three patches merged for 4.10-rc2... Thanks!

LGTM, thanks Jan

2016-12-13 11:52:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Mon 12-12-16 17:47:02, Jan Kara wrote:
> Hello,
>
> this is the third revision of my fixes of races when invalidating hole pages in
> DAX mappings. See changelogs for details. The series is based on my patches to
> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
> dependency because we really need to closely track dirtiness (and cleanness!)
> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
> bits leading to missed cache flushes on fsync(2).
>
> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>
> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
> patches to some tree once DAX write-protection patches are merged. I'm hoping
> to get at least first three patches merged for 4.10-rc2... Thanks!

OK, with the final ack from Johannes and since this is mostly DAX stuff,
can we take this through NVDIMM tree and push to Linus either late in the
merge window or for -rc2? These patches require my DAX patches sitting in mm
tree so they can be included in any git tree only once those patches land
in Linus' tree (which may happen only once Dave and Ted push out their
stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
Dan?

Honza

>
> Changes since v2:
> * Added Reviewed-by tags
> * Fixed commit message of patch 3
> * Slightly simplified dax_iomap_pmd_fault()
> * Renamed truncation functions to express better what they do
>
> Changes since v1:
> * Rebased on top of patches in mm tree
> * Added some Reviewed-by tags
> * renamed some functions based on review feedback
>
> Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-13 18:57:23

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
> On Mon 12-12-16 17:47:02, Jan Kara wrote:
>> Hello,
>>
>> this is the third revision of my fixes of races when invalidating hole pages in
>> DAX mappings. See changelogs for details. The series is based on my patches to
>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
>> dependency because we really need to closely track dirtiness (and cleanness!)
>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
>> bits leading to missed cache flushes on fsync(2).
>>
>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>>
>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
>> patches to some tree once DAX write-protection patches are merged. I'm hoping
>> to get at least first three patches merged for 4.10-rc2... Thanks!
>
> OK, with the final ack from Johannes and since this is mostly DAX stuff,
> can we take this through NVDIMM tree and push to Linus either late in the
> merge window or for -rc2? These patches require my DAX patches sitting in mm
> tree so they can be included in any git tree only once those patches land
> in Linus' tree (which may happen only once Dave and Ted push out their
> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
> Dan?
>

I like the -rc2 plan better than sending a pull request based on some
random point in the middle of the merge window. I can give Linus a
heads up in my initial nvdimm pull request for -rc1 that for
coordination purposes we'll be sending this set of follow-on DAX
cleanups for -rc2.

2016-12-13 20:01:57

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Tue, Dec 13, 2016 at 12:52:09PM +0100, Jan Kara wrote:
> On Mon 12-12-16 17:47:02, Jan Kara wrote:
> > Hello,
> >
> > this is the third revision of my fixes of races when invalidating hole pages in
> > DAX mappings. See changelogs for details. The series is based on my patches to
> > write-protect DAX PTEs which are currently carried in mm tree. This is a hard
> > dependency because we really need to closely track dirtiness (and cleanness!)
> > of radix tree entries in DAX mappings in order to avoid discarding valid dirty
> > bits leading to missed cache flushes on fsync(2).
> >
> > The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
> >
> > Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
> > patches to some tree once DAX write-protection patches are merged. I'm hoping
> > to get at least first three patches merged for 4.10-rc2... Thanks!
>
> OK, with the final ack from Johannes and since this is mostly DAX stuff,
> can we take this through NVDIMM tree and push to Linus either late in the
> merge window or for -rc2? These patches require my DAX patches sitting in mm
> tree so they can be included in any git tree only once those patches land
> in Linus' tree (which may happen only once Dave and Ted push out their
> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...

And I'm waiting on Jens and the block tree before I send Linus
a pulllreq for all the stuff I have queued because of the conflicts
in the iomap-direct IO patches I've also got in the XFS tree... :/

Cheers,

Dave.
--
Dave Chinner
[email protected]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-13 20:42:31

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Tue, Dec 13, 2016 at 12:52:09PM +0100, Jan Kara wrote:
> OK, with the final ack from Johannes and since this is mostly DAX stuff,
> can we take this through NVDIMM tree and push to Linus either late in the
> merge window or for -rc2? These patches require my DAX patches sitting in mm
> tree so they can be included in any git tree only once those patches land
> in Linus' tree (which may happen only once Dave and Ted push out their
> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
> Dan?

I've sent out the pull request for ext4.... which includes the
dax-4.0-iomap-pmd and fscrypt branch. Yes, convoluted. :-)

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-17 01:35:35

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Tue, Dec 13, 2016 at 10:57 AM, Dan Williams <[email protected]> wrote:
> On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
>> On Mon 12-12-16 17:47:02, Jan Kara wrote:
>>> Hello,
>>>
>>> this is the third revision of my fixes of races when invalidating hole pages in
>>> DAX mappings. See changelogs for details. The series is based on my patches to
>>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
>>> dependency because we really need to closely track dirtiness (and cleanness!)
>>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
>>> bits leading to missed cache flushes on fsync(2).
>>>
>>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>>>
>>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
>>> patches to some tree once DAX write-protection patches are merged. I'm hoping
>>> to get at least first three patches merged for 4.10-rc2... Thanks!
>>
>> OK, with the final ack from Johannes and since this is mostly DAX stuff,
>> can we take this through NVDIMM tree and push to Linus either late in the
>> merge window or for -rc2? These patches require my DAX patches sitting in mm
>> tree so they can be included in any git tree only once those patches land
>> in Linus' tree (which may happen only once Dave and Ted push out their
>> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
>> Dan?
>>
>
> I like the -rc2 plan better than sending a pull request based on some
> random point in the middle of the merge window. I can give Linus a
> heads up in my initial nvdimm pull request for -rc1 that for
> coordination purposes we'll be sending this set of follow-on DAX
> cleanups for -rc2.

So what's still pending for -rc2? I want to be explicit about what I'm
requesting Linus be prepared to receive after -rc1. The libnvdimm pull
request is very light this time around since I ended up deferring the
device-dax-subdivision topic until 4.11 and sub-section memory hotplug
didn't make the cutoff for -mm. We can spend some of that goodwill on
your patches ;-).

I can roll them into libnvdimm-for-next now for the integration
testing coverage, rebase to -rc1 when it's out, wait for your thumbs
up on the testing and send a pull request on the 23rd.

2016-12-17 01:49:47

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Fri, Dec 16, 2016 at 5:35 PM, Dan Williams <[email protected]> wrote:
> On Tue, Dec 13, 2016 at 10:57 AM, Dan Williams <[email protected]> wrote:
>> On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
>>> On Mon 12-12-16 17:47:02, Jan Kara wrote:
>>>> Hello,
>>>>
>>>> this is the third revision of my fixes of races when invalidating hole pages in
>>>> DAX mappings. See changelogs for details. The series is based on my patches to
>>>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
>>>> dependency because we really need to closely track dirtiness (and cleanness!)
>>>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
>>>> bits leading to missed cache flushes on fsync(2).
>>>>
>>>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>>>>
>>>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
>>>> patches to some tree once DAX write-protection patches are merged. I'm hoping
>>>> to get at least first three patches merged for 4.10-rc2... Thanks!
>>>
>>> OK, with the final ack from Johannes and since this is mostly DAX stuff,
>>> can we take this through NVDIMM tree and push to Linus either late in the
>>> merge window or for -rc2? These patches require my DAX patches sitting in mm
>>> tree so they can be included in any git tree only once those patches land
>>> in Linus' tree (which may happen only once Dave and Ted push out their
>>> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
>>> Dan?
>>>
>>
>> I like the -rc2 plan better than sending a pull request based on some
>> random point in the middle of the merge window. I can give Linus a
>> heads up in my initial nvdimm pull request for -rc1 that for
>> coordination purposes we'll be sending this set of follow-on DAX
>> cleanups for -rc2.
>
> So what's still pending for -rc2? I want to be explicit about what I'm
> requesting Linus be prepared to receive after -rc1. The libnvdimm pull
> request is very light this time around since I ended up deferring the
> device-dax-subdivision topic until 4.11 and sub-section memory hotplug
> didn't make the cutoff for -mm. We can spend some of that goodwill on
> your patches ;-).
>
> I can roll them into libnvdimm-for-next now for the integration
> testing coverage, rebase to -rc1 when it's out, wait for your thumbs
> up on the testing and send a pull request on the 23rd.

Sorry, I meant the 30th of December.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-19 09:56:23

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Fri 16-12-16 17:35:35, Dan Williams wrote:
> On Tue, Dec 13, 2016 at 10:57 AM, Dan Williams <[email protected]> wrote:
> > On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
> >> On Mon 12-12-16 17:47:02, Jan Kara wrote:
> >>> Hello,
> >>>
> >>> this is the third revision of my fixes of races when invalidating hole pages in
> >>> DAX mappings. See changelogs for details. The series is based on my patches to
> >>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
> >>> dependency because we really need to closely track dirtiness (and cleanness!)
> >>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
> >>> bits leading to missed cache flushes on fsync(2).
> >>>
> >>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
> >>>
> >>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
> >>> patches to some tree once DAX write-protection patches are merged. I'm hoping
> >>> to get at least first three patches merged for 4.10-rc2... Thanks!
> >>
> >> OK, with the final ack from Johannes and since this is mostly DAX stuff,
> >> can we take this through NVDIMM tree and push to Linus either late in the
> >> merge window or for -rc2? These patches require my DAX patches sitting in mm
> >> tree so they can be included in any git tree only once those patches land
> >> in Linus' tree (which may happen only once Dave and Ted push out their
> >> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
> >> Dan?
> >>
> >
> > I like the -rc2 plan better than sending a pull request based on some
> > random point in the middle of the merge window. I can give Linus a
> > heads up in my initial nvdimm pull request for -rc1 that for
> > coordination purposes we'll be sending this set of follow-on DAX
> > cleanups for -rc2.
>
> So what's still pending for -rc2? I want to be explicit about what I'm
> requesting Linus be prepared to receive after -rc1. The libnvdimm pull
> request is very light this time around since I ended up deferring the
> device-dax-subdivision topic until 4.11 and sub-section memory hotplug
> didn't make the cutoff for -mm. We can spend some of that goodwill on
> your patches ;-).

;-) So I'd like all these 6 patches to go for rc2. The first three patches
fix invalidation of exceptional DAX entries (a bug which is there for a
long time) - without these patches data loss can occur on power failure
even though user called fsync(2). The other three patches change locking of
DAX faults so that ->iomap_begin() is called in a more relaxed locking
context and we are safe to start a transaction there for ext4.

> I can roll them into libnvdimm-for-next now for the integration
> testing coverage, rebase to -rc1 when it's out, wait for your thumbs
> up on the testing and send a pull request on the 23rd.

Yup, all prerequisites are merged now so you can pick these patches up.
Thanks! Note that I'll be on vacation on Dec 23 - Jan 1.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-19 21:51:53

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Mon, Dec 19, 2016 at 1:56 AM, Jan Kara <[email protected]> wrote:
> On Fri 16-12-16 17:35:35, Dan Williams wrote:
>> On Tue, Dec 13, 2016 at 10:57 AM, Dan Williams <[email protected]> wrote:
>> > On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
>> >> On Mon 12-12-16 17:47:02, Jan Kara wrote:
>> >>> Hello,
>> >>>
>> >>> this is the third revision of my fixes of races when invalidating hole pages in
>> >>> DAX mappings. See changelogs for details. The series is based on my patches to
>> >>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
>> >>> dependency because we really need to closely track dirtiness (and cleanness!)
>> >>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
>> >>> bits leading to missed cache flushes on fsync(2).
>> >>>
>> >>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
>> >>>
>> >>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
>> >>> patches to some tree once DAX write-protection patches are merged. I'm hoping
>> >>> to get at least first three patches merged for 4.10-rc2... Thanks!
>> >>
>> >> OK, with the final ack from Johannes and since this is mostly DAX stuff,
>> >> can we take this through NVDIMM tree and push to Linus either late in the
>> >> merge window or for -rc2? These patches require my DAX patches sitting in mm
>> >> tree so they can be included in any git tree only once those patches land
>> >> in Linus' tree (which may happen only once Dave and Ted push out their
>> >> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
>> >> Dan?
>> >>
>> >
>> > I like the -rc2 plan better than sending a pull request based on some
>> > random point in the middle of the merge window. I can give Linus a
>> > heads up in my initial nvdimm pull request for -rc1 that for
>> > coordination purposes we'll be sending this set of follow-on DAX
>> > cleanups for -rc2.
>>
>> So what's still pending for -rc2? I want to be explicit about what I'm
>> requesting Linus be prepared to receive after -rc1. The libnvdimm pull
>> request is very light this time around since I ended up deferring the
>> device-dax-subdivision topic until 4.11 and sub-section memory hotplug
>> didn't make the cutoff for -mm. We can spend some of that goodwill on
>> your patches ;-).
>
> ;-) So I'd like all these 6 patches to go for rc2. The first three patches
> fix invalidation of exceptional DAX entries (a bug which is there for a
> long time) - without these patches data loss can occur on power failure
> even though user called fsync(2). The other three patches change locking of
> DAX faults so that ->iomap_begin() is called in a more relaxed locking
> context and we are safe to start a transaction there for ext4.
>
>> I can roll them into libnvdimm-for-next now for the integration
>> testing coverage, rebase to -rc1 when it's out, wait for your thumbs
>> up on the testing and send a pull request on the 23rd.
>
> Yup, all prerequisites are merged now so you can pick these patches up.
> Thanks! Note that I'll be on vacation on Dec 23 - Jan 1.

Sounds good, the contents are now out on libnvdimm-pending awaiting
0day-run before moving them over to libnvdimm-for-next, also it's down
to 5 patches since it seems that the "dax: Fix sleep in atomic contex
in grab_mapping_entry()" change went upstream already.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-20 07:59:42

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Mon 19-12-16 13:51:53, Dan Williams wrote:
> On Mon, Dec 19, 2016 at 1:56 AM, Jan Kara <[email protected]> wrote:
> > On Fri 16-12-16 17:35:35, Dan Williams wrote:
> >> On Tue, Dec 13, 2016 at 10:57 AM, Dan Williams <[email protected]> wrote:
> >> > On Tue, Dec 13, 2016 at 3:52 AM, Jan Kara <[email protected]> wrote:
> >> >> On Mon 12-12-16 17:47:02, Jan Kara wrote:
> >> >>> Hello,
> >> >>>
> >> >>> this is the third revision of my fixes of races when invalidating hole pages in
> >> >>> DAX mappings. See changelogs for details. The series is based on my patches to
> >> >>> write-protect DAX PTEs which are currently carried in mm tree. This is a hard
> >> >>> dependency because we really need to closely track dirtiness (and cleanness!)
> >> >>> of radix tree entries in DAX mappings in order to avoid discarding valid dirty
> >> >>> bits leading to missed cache flushes on fsync(2).
> >> >>>
> >> >>> The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.
> >> >>>
> >> >>> Johannes, are you OK with patch 2/6 in its current form? I'd like to push these
> >> >>> patches to some tree once DAX write-protection patches are merged. I'm hoping
> >> >>> to get at least first three patches merged for 4.10-rc2... Thanks!
> >> >>
> >> >> OK, with the final ack from Johannes and since this is mostly DAX stuff,
> >> >> can we take this through NVDIMM tree and push to Linus either late in the
> >> >> merge window or for -rc2? These patches require my DAX patches sitting in mm
> >> >> tree so they can be included in any git tree only once those patches land
> >> >> in Linus' tree (which may happen only once Dave and Ted push out their
> >> >> stuff - this is the most convoluted merge window I'd ever to deal with ;-)...
> >> >> Dan?
> >> >>
> >> >
> >> > I like the -rc2 plan better than sending a pull request based on some
> >> > random point in the middle of the merge window. I can give Linus a
> >> > heads up in my initial nvdimm pull request for -rc1 that for
> >> > coordination purposes we'll be sending this set of follow-on DAX
> >> > cleanups for -rc2.
> >>
> >> So what's still pending for -rc2? I want to be explicit about what I'm
> >> requesting Linus be prepared to receive after -rc1. The libnvdimm pull
> >> request is very light this time around since I ended up deferring the
> >> device-dax-subdivision topic until 4.11 and sub-section memory hotplug
> >> didn't make the cutoff for -mm. We can spend some of that goodwill on
> >> your patches ;-).
> >
> > ;-) So I'd like all these 6 patches to go for rc2. The first three patches
> > fix invalidation of exceptional DAX entries (a bug which is there for a
> > long time) - without these patches data loss can occur on power failure
> > even though user called fsync(2). The other three patches change locking of
> > DAX faults so that ->iomap_begin() is called in a more relaxed locking
> > context and we are safe to start a transaction there for ext4.
> >
> >> I can roll them into libnvdimm-for-next now for the integration
> >> testing coverage, rebase to -rc1 when it's out, wait for your thumbs
> >> up on the testing and send a pull request on the 23rd.
> >
> > Yup, all prerequisites are merged now so you can pick these patches up.
> > Thanks! Note that I'll be on vacation on Dec 23 - Jan 1.
>
> Sounds good, the contents are now out on libnvdimm-pending awaiting
> 0day-run before moving them over to libnvdimm-for-next, also it's down
> to 5 patches since it seems that the "dax: Fix sleep in atomic contex
> in grab_mapping_entry()" change went upstream already.

Yes, but I've accounted for that. Checking the libnvdimm-pending branch I
see you missed "ext2: Return BH_New buffers for zeroed blocks" which was
the first patch in the series. The subject is a slight misnomer since it is
about setting IOMAP_F_NEW flag instead these days but still it is needed...
Otherwise the DAX invalidation code would not propely invalidate zero pages
in the radix tree in response to writes for ext2.

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-12-20 20:09:45

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 0/6 v3] dax: Page invalidation fixes

On Mon, Dec 19, 2016 at 11:59 PM, Jan Kara <[email protected]> wrote:
> On Mon 19-12-16 13:51:53, Dan Williams wrote:
[..]
> Yes, but I've accounted for that. Checking the libnvdimm-pending branch I
> see you missed "ext2: Return BH_New buffers for zeroed blocks" which was
> the first patch in the series. The subject is a slight misnomer since it is
> about setting IOMAP_F_NEW flag instead these days but still it is needed...
> Otherwise the DAX invalidation code would not propely invalidate zero pages
> in the radix tree in response to writes for ext2.

Ok, thanks. Updated libnvdimm-pending pushed out.