LinuxLists.cc - [PATCH 0/6 v2] dax: Page invalidation fixes

2016-11-24 09:46:30

Subject: [PATCH 0/6 v2] dax: Page invalidation fixes

Hello,

this is second revision of my fixes of races when invalidating hole pages in
DAX mappings. See changelogs for details. The series is based on my patches to
write-protect DAX PTEs which are currently carried in mm tree. This is a hard
dependency because we really need to closely track dirtiness (and cleanness!)
of radix tree entries in DAX mappings in order to avoid discarding valid dirty
bits leading to missed cache flushes on fsync(2).

The tests have passed xfstests for xfs and ext4 in DAX and non-DAX mode.

I'd like to get some review of the patches (MM/FS people, please check whether
you like the direction changes in mm/truncate.c take in patch 2/6 - added
Johannes to CC since he was touching related code recently) so that these
patches can land in some tree once DAX write-protection patches are merged.
I'm hoping to get at least first three patches merged for 4.10-rc2... Thanks!

Changes since v1:
* Rebased on top of patches in mm tree
* Added some Reviewed-by tags
* renamed some functions based on review feedback

Honza

2016-11-24 09:46:31

by Jan Kara

[permalink] [raw]

Subject: [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks

So far we did not return BH_New buffers from ext2_get_blocks() when we
allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
DAX code. This zeroing is gone these days so we can remove the
workaround.

Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/inode.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 046b642f3585..e626fe892c01 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -754,9 +754,8 @@ static int ext2_get_blocks(struct inode *inode,
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
}
- } else {
- *new = true;
}
+ *new = true;

ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
mutex_unlock(&ei->truncate_mutex);
--
2.6.6

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-11-24 09:46:35

by Jan Kara

[permalink] [raw]

Subject: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

Currently ->iomap_begin() handler is called with entry lock held. If the
filesystem held any locks between ->iomap_begin() and ->iomap_end()
(such as ext4 which will want to hold transaction open), this would cause
lock inversion with the iomap_apply() from standard IO path which first
calls ->iomap_begin() and only then calls ->actor() callback which grabs
entry locks for DAX.

Fix the problem by nesting grabbing of entry lock inside ->iomap_begin()
- ->iomap_end() pair.

Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 120 ++++++++++++++++++++++++++++++++++-----------------------------
1 file changed, 65 insertions(+), 55 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 38f996976ebf..be39633d346e 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1077,6 +1077,15 @@ dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
}
EXPORT_SYMBOL_GPL(dax_iomap_rw);

+static int dax_fault_return(int error)
+{
+ if (error == 0)
+ return VM_FAULT_NOPAGE;
+ if (error == -ENOMEM)
+ return VM_FAULT_OOM;
+ return VM_FAULT_SIGBUS;
+}
+
/**
* dax_iomap_fault - handle a page fault on a DAX file
* @vma: The virtual memory area where the fault occurred
@@ -1109,12 +1118,6 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (pos >= i_size_read(inode))
return VM_FAULT_SIGBUS;

- entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
- if (IS_ERR(entry)) {
- error = PTR_ERR(entry);
- goto out;
- }
-
if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
flags |= IOMAP_WRITE;

@@ -1125,9 +1128,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
*/
error = ops->iomap_begin(inode, pos, PAGE_SIZE, flags, &iomap);
if (error)
- goto unlock_entry;
+ return dax_fault_return(error);
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
- error = -EIO; /* fs corruption? */
+ vmf_ret = dax_fault_return(-EIO); /* fs corruption? */
+ goto finish_iomap;
+ }
+
+ entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
+ if (IS_ERR(entry)) {
+ vmf_ret = dax_fault_return(PTR_ERR(entry));
goto finish_iomap;
}

@@ -1150,13 +1159,13 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}

if (error)
- goto finish_iomap;
+ goto error_unlock_entry;

__SetPageUptodate(vmf->cow_page);
vmf_ret = finish_fault(vmf);
if (!vmf_ret)
vmf_ret = VM_FAULT_DONE_COW;
- goto finish_iomap;
+ goto unlock_entry;
}

switch (iomap.type) {
@@ -1168,12 +1177,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
error = dax_insert_mapping(mapping, iomap.bdev, sector,
PAGE_SIZE, &entry, vma, vmf);
+ /* -EBUSY is fine, somebody else faulted on the same PTE */
+ if (error == -EBUSY)
+ error = 0;
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (!(vmf->flags & FAULT_FLAG_WRITE)) {
vmf_ret = dax_load_hole(mapping, &entry, vmf);
- goto finish_iomap;
+ goto unlock_entry;
}
/*FALLTHRU*/
default:
@@ -1182,30 +1194,25 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
break;
}

- finish_iomap:
- if (ops->iomap_end) {
- if (error || (vmf_ret & VM_FAULT_ERROR)) {
- /* keep previous error */
- ops->iomap_end(inode, pos, PAGE_SIZE, 0, flags,
- &iomap);
- } else {
- error = ops->iomap_end(inode, pos, PAGE_SIZE,
- PAGE_SIZE, flags, &iomap);
- }
- }
+ error_unlock_entry:
+ vmf_ret = dax_fault_return(error) | major;
unlock_entry:
put_locked_mapping_entry(mapping, vmf->pgoff, entry);
- out:
- if (error == -ENOMEM)
- return VM_FAULT_OOM | major;
- /* -EBUSY is fine, somebody else faulted on the same PTE */
- if (error < 0 && error != -EBUSY)
- return VM_FAULT_SIGBUS | major;
- if (vmf_ret) {
- WARN_ON_ONCE(error); /* -EBUSY from ops->iomap_end? */
- return vmf_ret;
+ finish_iomap:
+ if (ops->iomap_end) {
+ int copied = PAGE_SIZE;
+
+ if (vmf_ret & VM_FAULT_ERROR)
+ copied = 0;
+ /*
+ * The fault is done by now and there's no way back (other
+ * thread may be already happily using PTE we have installed).
+ * Just ignore error from ->iomap_end since we cannot do much
+ * with it.
+ */
+ ops->iomap_end(inode, pos, PAGE_SIZE, copied, flags, &iomap);
}
- return VM_FAULT_NOPAGE | major;
+ return vmf_ret;
}
EXPORT_SYMBOL_GPL(dax_iomap_fault);

@@ -1330,6 +1337,15 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
goto fallback;

/*
+ * Note that we don't use iomap_apply here. We aren't doing I/O, only
+ * setting up a mapping, so really we're using iomap_begin() as a way
+ * to look up our filesystem block.
+ */
+ pos = (loff_t)pgoff << PAGE_SHIFT;
+ error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
+ if (error)
+ goto fallback;
+ /*
* grab_mapping_entry() will make sure we get a 2M empty entry, a DAX
* PMD or a HZP entry. If it can't (because a 4k page is already in
* the tree, for instance), it will return -EEXIST and we just fall
@@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
*/
entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
if (IS_ERR(entry))
- goto fallback;
+ goto finish_iomap;

- /*
- * Note that we don't use iomap_apply here. We aren't doing I/O, only
- * setting up a mapping, so really we're using iomap_begin() as a way
- * to look up our filesystem block.
- */
- pos = (loff_t)pgoff << PAGE_SHIFT;
- error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
- if (error)
- goto unlock_entry;
if (iomap.offset + iomap.length < pos + PMD_SIZE)
- goto finish_iomap;
+ goto unlock_entry;

vmf.pgoff = pgoff;
vmf.flags = flags;
@@ -1363,7 +1370,7 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
if (WARN_ON_ONCE(write))
- goto finish_iomap;
+ goto unlock_entry;
result = dax_pmd_load_hole(vma, pmd, &vmf, address, &iomap,
&entry);
break;
@@ -1372,20 +1379,23 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
break;
}

+ unlock_entry:
+ put_locked_mapping_entry(mapping, pgoff, entry);
finish_iomap:
if (ops->iomap_end) {
- if (result == VM_FAULT_FALLBACK) {
- ops->iomap_end(inode, pos, PMD_SIZE, 0, iomap_flags,
- &iomap);
- } else {
- error = ops->iomap_end(inode, pos, PMD_SIZE, PMD_SIZE,
- iomap_flags, &iomap);
- if (error)
- result = VM_FAULT_FALLBACK;
- }
+ int copied = PMD_SIZE;
+
+ if (result == VM_FAULT_FALLBACK)
+ copied = 0;
+ /*
+ * The fault is done by now and there's no way back (other
+ * thread may be already happily using PMD we have installed).
+ * Just ignore error from ->iomap_end since we cannot do much
+ * with it.
+ */
+ ops->iomap_end(inode, pos, PMD_SIZE, copied, iomap_flags,
+ &iomap);
}
- unlock_entry:
- put_locked_mapping_entry(mapping, pgoff, entry);
fallback:
if (result == VM_FAULT_FALLBACK) {
split_huge_pmd(vma, pmd, address);
--
2.6.6

2016-11-29 17:48:11

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH 1/6] ext2: Return BH_New buffers for zeroed blocks

On Thu, Nov 24, 2016 at 10:46:31AM +0100, Jan Kara wrote:
> So far we did not return BH_New buffers from ext2_get_blocks() when we
> allocated and zeroed-out a block for DAX inode to avoid racy zeroing in
> DAX code. This zeroing is gone these days so we can remove the
> workaround.
>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>

Reviewed-by: Ross Zwisler <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-01 22:24:49

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> Currently ->iomap_begin() handler is called with entry lock held. If the
> filesystem held any locks between ->iomap_begin() and ->iomap_end()
> (such as ext4 which will want to hold transaction open), this would cause
> lock inversion with the iomap_apply() from standard IO path which first
> calls ->iomap_begin() and only then calls ->actor() callback which grabs
> entry locks for DAX.

I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this
an issue currently, or are you just trying to make the code consistent so we
don't run into issues in the future?

2016-12-01 23:27:04

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
>
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

Ah, I see that you use this new ordering in patch 6/6 so that you can change
your interaction with the ext4 journal. I'm still curious if we have a lock
ordering inversion within DAX, but if this ordering helps you with ext4, good
enough.

One quick comment:

> @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> */
> entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> if (IS_ERR(entry))
> - goto fallback;
> + goto finish_iomap;
>
> - /*
> - * Note that we don't use iomap_apply here. We aren't doing I/O, only
> - * setting up a mapping, so really we're using iomap_begin() as a way
> - * to look up our filesystem block.
> - */
> - pos = (loff_t)pgoff << PAGE_SHIFT;
> - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> - if (error)
> - goto unlock_entry;
> if (iomap.offset + iomap.length < pos + PMD_SIZE)
> - goto finish_iomap;
> + goto unlock_entry;

I think this offset+length bounds check could be moved along with the
iomap_begin() call up above the grab_mapping_entry(). You would then goto
'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
and releasing of the mapping entry.

Other than that one small nit, this looks fine to me:
Reviewed-by: Ross Zwisler <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-02 10:08:18

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

On Thu 01-12-16 15:24:47, Ross Zwisler wrote:
> On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > Currently ->iomap_begin() handler is called with entry lock held. If the
> > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > (such as ext4 which will want to hold transaction open), this would cause
> > lock inversion with the iomap_apply() from standard IO path which first
> > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > entry locks for DAX.
>
> I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this
> an issue currently, or are you just trying to make the code consistent so we
> don't run into issues in the future?

So dax_iomap_actor() copies data from / to user provided buffer. That can
fault and if the buffer happens to be mmaped file on DAX filesystem, the
fault will end up grabbing entry locks. Sample evil test:

fd = open("some_file", O_RDWR);
buf = mmap(NULL, 65536, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
write(fd, buf, 4096);

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-12-02 10:12:54

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH 5/6] dax: Call ->iomap_begin without entry lock during dax fault

On Thu 01-12-16 16:27:04, Ross Zwisler wrote:
> On Thu, Dec 01, 2016 at 03:24:47PM -0700, Ross Zwisler wrote:
> > On Thu, Nov 24, 2016 at 10:46:35AM +0100, Jan Kara wrote:
> > > Currently ->iomap_begin() handler is called with entry lock held. If the
> > > filesystem held any locks between ->iomap_begin() and ->iomap_end()
> > > (such as ext4 which will want to hold transaction open), this would cause
> > > lock inversion with the iomap_apply() from standard IO path which first
> > > calls ->iomap_begin() and only then calls ->actor() callback which grabs
> > > entry locks for DAX.
> >
> > I don't see the dax_iomap_actor() grabbing any entry locks for DAX? Is this
> > an issue currently, or are you just trying to make the code consistent so we
> > don't run into issues in the future?
>
> Ah, I see that you use this new ordering in patch 6/6 so that you can change
> your interaction with the ext4 journal. I'm still curious if we have a lock
> ordering inversion within DAX, but if this ordering helps you with ext4, good
> enough.
>
> One quick comment:
>
> > @@ -1337,19 +1353,10 @@ int dax_iomap_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > */
> > entry = grab_mapping_entry(mapping, pgoff, RADIX_DAX_PMD);
> > if (IS_ERR(entry))
> > - goto fallback;
> > + goto finish_iomap;
> >
> > - /*
> > - * Note that we don't use iomap_apply here. We aren't doing I/O, only
> > - * setting up a mapping, so really we're using iomap_begin() as a way
> > - * to look up our filesystem block.
> > - */
> > - pos = (loff_t)pgoff << PAGE_SHIFT;
> > - error = ops->iomap_begin(inode, pos, PMD_SIZE, iomap_flags, &iomap);
> > - if (error)
> > - goto unlock_entry;
> > if (iomap.offset + iomap.length < pos + PMD_SIZE)
> > - goto finish_iomap;
> > + goto unlock_entry;
>
> I think this offset+length bounds check could be moved along with the
> iomap_begin() call up above the grab_mapping_entry(). You would then goto
> 'finish_iomap' if you hit this error condition, allowing you to avoid grabbing
> and releasing of the mapping entry.

Yes, that is nicer. Changed.

> Other than that one small nit, this looks fine to me:
> Reviewed-by: Ross Zwisler <[email protected]>

Thanks.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>