LinuxLists.cc - [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-10 19:10:29

Subject: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

Changes since v3:
* Added an ext4_break_layouts() call to ext4_insert_range() to ensure
that the {ext4,xfs}_break_layouts() calls have the same meaning.
(Dave, Darrick and Jan)

---

This series from Dan:

https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html

added synchronization between DAX dma and truncate/hole-punch in XFS.
This short series adds analogous support to ext4.

I've added calls to ext4_break_layouts() everywhere that ext4 removes
blocks from an inode's map.

The timings in XFS are such that it's difficult to hit this race. Dan
was able to show the race by manually introducing delays in the direct
I/O path.

For ext4, though, its trivial to hit this race, and a hit will result in
a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():

WARN_ON_ONCE(trunc && page_ref_count(page) > 1);

I've made an xfstest which tests all the paths where we now call
ext4_break_layouts(). Each of the four paths easily hits this race many
times in my test setup with the xfstest. You can find that test here:

https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html

With these patches applied, I've still seen occasional hits of the above
WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
continue looking at these more rare hits.

Ross Zwisler (2):
dax: dax_layout_busy_page() warn on !exceptional
ext4: handle layout changes to pinned DAX mappings

fs/dax.c | 10 +++++++++-
fs/ext4/ext4.h | 1 +
fs/ext4/extents.c | 17 +++++++++++++++++
fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
fs/ext4/truncate.h | 4 ++++
5 files changed, 77 insertions(+), 1 deletion(-)

--
2.14.4

2018-07-11 08:17:41

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> Changes since v3:
> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> that the {ext4,xfs}_break_layouts() calls have the same meaning.
> (Dave, Darrick and Jan)

How about the occasional WARN_ON_ONCE you mention below. Were you able to
hunt them down?

Honza
> ---
>
> This series from Dan:
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
>
> added synchronization between DAX dma and truncate/hole-punch in XFS.
> This short series adds analogous support to ext4.
>
> I've added calls to ext4_break_layouts() everywhere that ext4 removes
> blocks from an inode's map.
>
> The timings in XFS are such that it's difficult to hit this race. Dan
> was able to show the race by manually introducing delays in the direct
> I/O path.
>
> For ext4, though, its trivial to hit this race, and a hit will result in
> a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
>
> WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>
> I've made an xfstest which tests all the paths where we now call
> ext4_break_layouts(). Each of the four paths easily hits this race many
> times in my test setup with the xfstest. You can find that test here:
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
>
> With these patches applied, I've still seen occasional hits of the above
> WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
> continue looking at these more rare hits.
>
> Ross Zwisler (2):
> dax: dax_layout_busy_page() warn on !exceptional
> ext4: handle layout changes to pinned DAX mappings
>
> fs/dax.c | 10 +++++++++-
> fs/ext4/ext4.h | 1 +
> fs/ext4/extents.c | 17 +++++++++++++++++
> fs/ext4/inode.c | 46 ++++++++++++++++++++++++++++++++++++++++++++++
> fs/ext4/truncate.h | 4 ++++
> 5 files changed, 77 insertions(+), 1 deletion(-)
>
> --
> 2.14.4
>
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2018-07-11 15:41:24

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

2018-07-25 22:28:39

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > Changes since v3:
> > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> > that the {ext4,xfs}_break_layouts() calls have the same meaning.
> > (Dave, Darrick and Jan)
>
> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> hunt them down?

The root cause of this issue is that while the ei->i_mmap_sem provides
synchronization between ext4_break_layouts() and page faults, it doesn't
provide synchronize us with the direct I/O path. This exact same issue exists
in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

This allows the direct I/O path to do I/O and raise & lower page->_refcount
while we're executing a truncate/hole punch. This leads to us trying to free
a page with an elevated refcount.

Here's one instance of the race:

CPU 0 CPU 1
----- -----
ext4_punch_hole()
ext4_break_layouts() # all pages have refcount=1

ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page() # elevates refcount

truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()

A similar race occurs when the refcount is being dropped while we're running
ext4_break_layouts(), and this is the one that my test was actually hitting:

CPU 0 CPU 1
----- -----
ext4_direct_IO()
... lots of layers ...
follow_page_pte()
get_page()
# elevates refcount of page X
ext4_punch_hole()
ext4_break_layouts() # two pages, X & Y, have refcount == 2
__wait_var_event() # called for page X

__put_devmap_managed_page()
# drops refcount of X to 1

# __wait_var_events() checks X's refcount in "if (condition)", and breaks.
# We never actually called ext4_wait_dax_page(), so 'retry' in
# ext4_break_layouts() is still false. Exit do/while loop in
# ext4_break_layouts, never attempting to wait on page Y which still has an
# elevated refcount of 2.

truncate_pagecache_range()
... a few layers ...
dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()

This second race can be fixed with the patch at the end of this function,
which I think should go in, unless there is a benfit to the current retry
scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
With this patch applied I've been able to run my unit test through
thousands of iterations, where it used to failed consistently within 10 or
so.

Even so, I wonder if the real solution is to add synchronization between
the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how
this should be handled?

--- >8 ---

>From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
From: Ross Zwisler <[email protected]>
Date: Wed, 25 Jul 2018 16:16:05 -0600
Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()

If the refcount of a page is lowered between the time that it is returned
by dax_busy_page() and when the refcount is again checked in
ext4_break_layouts() => ___wait_var_event(), the waiting function
ext4_wait_dax_page() will never be called. This means that
ext4_break_layouts() will still have 'retry' set to false, so we'll stop
looping and never check the refcount of other pages in this inode.

Instead, always continue looping as long as dax_layout_busy_page() gives us
a page which it found with an elevated refcount.

Note that this works around the race exposed by my unit test, but I think
that there is another race that needs to be addressed, probably with
additional synchronization added between direct I/O and
{ext4,xfs}_break_layouts().

Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>
---
fs/ext4/inode.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 27e9643bc13b..fedb88104bbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
return 0;
}

-static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
+static void ext4_wait_dax_page(struct ext4_inode_info *ei)
{
- *did_unlock = true;
up_write(&ei->i_mmap_sem);
schedule();
down_write(&ei->i_mmap_sem);
@@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
{
struct ext4_inode_info *ei = EXT4_I(inode);
struct page *page;
- bool retry;
int error;

if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
return -EINVAL;

do {
- retry = false;
page = dax_layout_busy_page(inode->i_mapping);
if (!page)
return 0;
@@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
error = ___wait_var_event(&page->_refcount,
atomic_read(&page->_refcount) == 1,
TASK_INTERRUPTIBLE, 0, 0,
- ext4_wait_dax_page(ei, &retry));
- } while (error == 0 && retry);
+ ext4_wait_dax_page(ei));
+ } while (error == 0);

return error;
}
--
2.14.4

2018-07-27 16:28:51

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

+ fsdevel and the xfs list.

On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
<ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> > On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > > Changes since v3:
> > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> > > that the {ext4,xfs}_break_layouts() calls have the same meaning.
> > > (Dave, Darrick and Jan)
> >
> > How about the occasional WARN_ON_ONCE you mention below. Were you able to
> > hunt them down?
>
> The root cause of this issue is that while the ei->i_mmap_sem provides
> synchronization between ext4_break_layouts() and page faults, it doesn't
> provide synchronize us with the direct I/O path. This exact same issue exists
> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
>
> This allows the direct I/O path to do I/O and raise & lower page->_refcount
> while we're executing a truncate/hole punch. This leads to us trying to free
> a page with an elevated refcount.
>
> Here's one instance of the race:
>
> CPU 0 CPU 1
> ----- -----
> ext4_punch_hole()
> ext4_break_layouts() # all pages have refcount=1
>
> ext4_direct_IO()
> ... lots of layers ...
> follow_page_pte()
> get_page() # elevates refcount
>
> truncate_pagecache_range()
> ... a few layers ...
> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
>
> A similar race occurs when the refcount is being dropped while we're running
> ext4_break_layouts(), and this is the one that my test was actually hitting:
>
> CPU 0 CPU 1
> ----- -----
> ext4_direct_IO()
> ... lots of layers ...
> follow_page_pte()
> get_page()
> # elevates refcount of page X
> ext4_punch_hole()
> ext4_break_layouts() # two pages, X & Y, have refcount == 2
> __wait_var_event() # called for page X
>
> __put_devmap_managed_page()
> # drops refcount of X to 1
>
> # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
> # We never actually called ext4_wait_dax_page(), so 'retry' in
> # ext4_break_layouts() is still false. Exit do/while loop in
> # ext4_break_layouts, never attempting to wait on page Y which still has an
> # elevated refcount of 2.
>
> truncate_pagecache_range()
> ... a few layers ...
> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
>
> This second race can be fixed with the patch at the end of this function,
> which I think should go in, unless there is a benfit to the current retry
> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
> With this patch applied I've been able to run my unit test through
> thousands of iterations, where it used to failed consistently within 10 or
> so.
>
> Even so, I wonder if the real solution is to add synchronization between
> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how
> this should be handled?
>
> --- >8 ---
>
> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
> From: Ross Zwisler <[email protected]>
> Date: Wed, 25 Jul 2018 16:16:05 -0600
> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
>
> If the refcount of a page is lowered between the time that it is returned
> by dax_busy_page() and when the refcount is again checked in
> ext4_break_layouts() => ___wait_var_event(), the waiting function
> ext4_wait_dax_page() will never be called. This means that
> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
> looping and never check the refcount of other pages in this inode.
>
> Instead, always continue looping as long as dax_layout_busy_page() gives us
> a page which it found with an elevated refcount.
>
> Note that this works around the race exposed by my unit test, but I think
> that there is another race that needs to be addressed, probably with
> additional synchronization added between direct I/O and
> {ext4,xfs}_break_layouts().
>
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>
> ---
> fs/ext4/inode.c | 9 +++------
> 1 file changed, 3 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 27e9643bc13b..fedb88104bbf 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> return 0;
> }
>
> -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
> +static void ext4_wait_dax_page(struct ext4_inode_info *ei)
> {
> - *did_unlock = true;
> up_write(&ei->i_mmap_sem);
> schedule();
> down_write(&ei->i_mmap_sem);
> @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
> {
> struct ext4_inode_info *ei = EXT4_I(inode);
> struct page *page;
> - bool retry;
> int error;
>
> if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
> return -EINVAL;
>
> do {
> - retry = false;
> page = dax_layout_busy_page(inode->i_mapping);
> if (!page)
> return 0;
> @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
> error = ___wait_var_event(&page->_refcount,
> atomic_read(&page->_refcount) == 1,
> TASK_INTERRUPTIBLE, 0, 0,
> - ext4_wait_dax_page(ei, &retry));
> - } while (error == 0 && retry);
> + ext4_wait_dax_page(ei));
> + } while (error == 0);
>
> return error;
> }

2018-07-31 19:44:20

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Tue, Jul 10, 2018 at 01:10:29PM -0600, Ross Zwisler wrote:
> Changes since v3:
> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> that the {ext4,xfs}_break_layouts() calls have the same meaning.
> (Dave, Darrick and Jan)
>
> ---
>
> This series from Dan:
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-March/014913.html
>
> added synchronization between DAX dma and truncate/hole-punch in XFS.
> This short series adds analogous support to ext4.
>
> I've added calls to ext4_break_layouts() everywhere that ext4 removes
> blocks from an inode's map.
>
> The timings in XFS are such that it's difficult to hit this race. Dan
> was able to show the race by manually introducing delays in the direct
> I/O path.
>
> For ext4, though, its trivial to hit this race, and a hit will result in
> a trigger of this WARN_ON_ONCE() in dax_disassociate_entry():
>
> WARN_ON_ONCE(trunc && page_ref_count(page) > 1);
>
> I've made an xfstest which tests all the paths where we now call
> ext4_break_layouts(). Each of the four paths easily hits this race many
> times in my test setup with the xfstest. You can find that test here:
>
> https://lists.01.org/pipermail/linux-nvdimm/2018-June/016435.html
>
> With these patches applied, I've still seen occasional hits of the above
> WARN_ON_ONCE(), which tells me that we still have some work to do. I'll
> continue looking at these more rare hits.

One last ping on this - do we want to merge this for v4.19? I've tracked down
the more rare warnings, and have reported the race I'm seeing here:

https://lists.01.org/pipermail/linux-nvdimm/2018-July/017205.html

AFAICT the race exists equally for XFS and ext4, and I'm not sure how to solve
it easily. Essentially it seems like we need to synchronize not just page
faults but calls to get_page() with truncate/hole punch operations, else we
can have the reference count for a given DAX page going up and down while we
are in the middle of a truncate. I'm not sure if this is even feasible.

The equivalent code for this series already exists in XFS, so taking the
series now gets ext4 and XFS on the same footing moving forward, so I guess
I'm in favor of merging it now, though I can see the argument that it's not a
complete solution.

Thoughts?

2018-08-06 03:55:50

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Fri, Jul 27, 2018 at 10:28:51AM -0600, Ross Zwisler wrote:
> + fsdevel and the xfs list.
>
> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
> <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
> > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > > > Changes since v3:
> > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> > > > that the {ext4,xfs}_break_layouts() calls have the same meaning.
> > > > (Dave, Darrick and Jan)
> > >
> > > How about the occasional WARN_ON_ONCE you mention below. Were you able to
> > > hunt them down?
> >
> > The root cause of this issue is that while the ei->i_mmap_sem provides
> > synchronization between ext4_break_layouts() and page faults, it doesn't
> > provide synchronize us with the direct I/O path. This exact same issue exists
> > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.

That's not correct for XFS.

Truncate/holepunch in XFS hold *both* the MMAPLOCK and the IOLOCK in
exclusive mode. Holding the MMAPLOCK serialises against page faults,
holding the IOLOCK serialises against buffered IO and Direct IO
submission. The inode_dio_wait() call that truncate/holepunch does
after gaining the IOLOCK ensures that we wait for all direct IO in
progress to complete (e.g. AIO) before the truncate/holepunch goes
ahead. e.g:

STATIC long
xfs_file_fallocate(
struct file *file,
int mode,
loff_t offset,
loff_t len)
{
.....
uint iolock = XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL;
.....
xfs_ilock(ip, iolock);
error = xfs_break_layouts(inode, &iolock, BREAK_UNMAP);
.....
if (mode & FALLOC_FL_PUNCH_HOLE) {
error = xfs_free_file_space(ip, offset, len);

and then xfs_free_file_space calls xfs_flush_unmap_range() which
waits for direct IO to complete, flushes all the dirty cached pages
and then invalidates the page cache before doing any extent
manipulation. The cannot be any IO or page faults in progress after
this point.....

truncate (through xfs_vn_setattr() essentially does the same thing,
except that it's called with the IOLOCK already held exclusively
(remember the IOLOCK is the vfs inode->i_rwsem).

> > This allows the direct I/O path to do I/O and raise & lower page->_refcount
> > while we're executing a truncate/hole punch. This leads to us trying to free
> > a page with an elevated refcount.

I don't see how this is possible in XFS - maybe I'm missing
something, but "direct IO submission during truncate" is not
something that should ever be happening in XFS, DAX or not.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2018-08-06 15:49:43

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

> > > This allows the direct I/O path to do I/O and raise & lower page->_refcount
> > > while we're executing a truncate/hole punch. This leads to us trying to free
> > > a page with an elevated refcount.
>
> I don't see how this is possible in XFS - maybe I'm missing
> something, but "direct IO submission during truncate" is not
> something that should ever be happening in XFS, DAX or not.

The pages involved in a direct I/O are not that of the file that
the direct I/O read/write syscalls are called on, but those of the
memory regions the direct I/O read/write syscalls operate on.
Those pages could be file backed and undergo a truncate at the
same time.

2018-08-06 22:25:01

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Mon, Aug 06, 2018 at 05:49:43PM +0200, Christoph Hellwig wrote:
> > > > This allows the direct I/O path to do I/O and raise & lower page->_refcount
> > > > while we're executing a truncate/hole punch. This leads to us trying to free
> > > > a page with an elevated refcount.
> >
> > I don't see how this is possible in XFS - maybe I'm missing
> > something, but "direct IO submission during truncate" is not
> > something that should ever be happening in XFS, DAX or not.
>
> The pages involved in a direct I/O are not that of the file that
> the direct I/O read/write syscalls are called on, but those of the
> memory regions the direct I/O read/write syscalls operate on.
> Those pages could be file backed and undergo a truncate at the
> same time.

So let me get this straight. First, mmap() file A, then fault it all
in, then use the mmapped range of file A as the user buffer for
direct IO to file B, then concurrently truncate file A down so the
destination buffer for the file B dio will be beyond EOF and so we
need to invalidate it. But waiting for gup references in truncate
can race with other new page references via gup because gup does not
serialise access to the file backed pages in any way?

i.e. we hold no fs locks at all on file A when gup takes page
references during direct IO to file B unless we have to fault in the
page. this doesn't seem like a problem that the filesystem can
solve, but it does indicate to me a potential solution. i.e. we
take the MMAPLOCK during page faults, and so we can use that to
serialise gup against the invalidation in progress on file A.

i.e. it would seem to me that gup needs to refault file-backed pages
rather than just blindly take a reference to them so that it
triggers serialisation of the page references against in-progress
invalidation operations.

Thoughts?

-Dave.
--
Dave Chinner
[email protected]

2018-08-07 08:45:45

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Fri 27-07-18 10:28:51, Ross Zwisler wrote:
> + fsdevel and the xfs list.
>
> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
> <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
> > On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> > > On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> > > > Changes since v3:
> > > > * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> > > > that the {ext4,xfs}_break_layouts() calls have the same meaning.
> > > > (Dave, Darrick and Jan)
> > >
> > > How about the occasional WARN_ON_ONCE you mention below. Were you able to
> > > hunt them down?
> >
> > The root cause of this issue is that while the ei->i_mmap_sem provides
> > synchronization between ext4_break_layouts() and page faults, it doesn't
> > provide synchronize us with the direct I/O path. This exact same issue exists
> > in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
> >
> > This allows the direct I/O path to do I/O and raise & lower page->_refcount
> > while we're executing a truncate/hole punch. This leads to us trying to free
> > a page with an elevated refcount.
> >
> > Here's one instance of the race:
> >
> > CPU 0 CPU 1
> > ----- -----
> > ext4_punch_hole()
> > ext4_break_layouts() # all pages have refcount=1
> >
> > ext4_direct_IO()
> > ... lots of layers ...
> > follow_page_pte()
> > get_page() # elevates refcount
> >
> > truncate_pagecache_range()
> > ... a few layers ...
> > dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
> >

So this is a very different race from the one below. And it should be
impossible to happen. This race is exactly the reason why
dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault
which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race.

> > A similar race occurs when the refcount is being dropped while we're running
> > ext4_break_layouts(), and this is the one that my test was actually hitting:
> >
> > CPU 0 CPU 1
> > ----- -----
> > ext4_direct_IO()
> > ... lots of layers ...
> > follow_page_pte()
> > get_page()
> > # elevates refcount of page X
> > ext4_punch_hole()
> > ext4_break_layouts() # two pages, X & Y, have refcount == 2
> > __wait_var_event() # called for page X
> >
> > __put_devmap_managed_page()
> > # drops refcount of X to 1
> >
> > # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
> > # We never actually called ext4_wait_dax_page(), so 'retry' in
> > # ext4_break_layouts() is still false. Exit do/while loop in
> > # ext4_break_layouts, never attempting to wait on page Y which still has an
> > # elevated refcount of 2.
> >
> > truncate_pagecache_range()
> > ... a few layers ...
> > dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
> >
> > This second race can be fixed with the patch at the end of this function,
> > which I think should go in, unless there is a benfit to the current retry
> > scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
> > With this patch applied I've been able to run my unit test through
> > thousands of iterations, where it used to failed consistently within 10 or
> > so.
> >
> > Even so, I wonder if the real solution is to add synchronization between
> > the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how
> > this should be handled?
> >
> > --- >8 ---
> >
> > From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
> > From: Ross Zwisler <[email protected]>
> > Date: Wed, 25 Jul 2018 16:16:05 -0600
> > Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
> >
> > If the refcount of a page is lowered between the time that it is returned
> > by dax_busy_page() and when the refcount is again checked in
> > ext4_break_layouts() => ___wait_var_event(), the waiting function
> > ext4_wait_dax_page() will never be called. This means that
> > ext4_break_layouts() will still have 'retry' set to false, so we'll stop
> > looping and never check the refcount of other pages in this inode.
> >
> > Instead, always continue looping as long as dax_layout_busy_page() gives us
> > a page which it found with an elevated refcount.
> >
> > Note that this works around the race exposed by my unit test, but I think
> > that there is another race that needs to be addressed, probably with
> > additional synchronization added between direct I/O and
> > {ext4,xfs}_break_layouts().
> >
> > Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>

OK, this is a good catch and the patch looks good. You can add:

Reviewed-by: Jan Kara <[email protected]>

Also please post this fix officially to Ted to include it in his tree (I
can see that he has all your other patches queued for the merge window).

AFAICS XFS implementation of xfs_break_dax_layouts() has the same problem.
Can you please fix it? Thanks for working on this!

Honza

> > ---
> > fs/ext4/inode.c | 9 +++------
> > 1 file changed, 3 insertions(+), 6 deletions(-)
> >
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 27e9643bc13b..fedb88104bbf 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4193,9 +4193,8 @@ int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
> > return 0;
> > }
> >
> > -static void ext4_wait_dax_page(struct ext4_inode_info *ei, bool *did_unlock)
> > +static void ext4_wait_dax_page(struct ext4_inode_info *ei)
> > {
> > - *did_unlock = true;
> > up_write(&ei->i_mmap_sem);
> > schedule();
> > down_write(&ei->i_mmap_sem);
> > @@ -4205,14 +4204,12 @@ int ext4_break_layouts(struct inode *inode)
> > {
> > struct ext4_inode_info *ei = EXT4_I(inode);
> > struct page *page;
> > - bool retry;
> > int error;
> >
> > if (WARN_ON_ONCE(!rwsem_is_locked(&ei->i_mmap_sem)))
> > return -EINVAL;
> >
> > do {
> > - retry = false;
> > page = dax_layout_busy_page(inode->i_mapping);
> > if (!page)
> > return 0;
> > @@ -4220,8 +4217,8 @@ int ext4_break_layouts(struct inode *inode)
> > error = ___wait_var_event(&page->_refcount,
> > atomic_read(&page->_refcount) == 1,
> > TASK_INTERRUPTIBLE, 0, 0,
> > - ext4_wait_dax_page(ei, &retry));
> > - } while (error == 0 && retry);
> > + ext4_wait_dax_page(ei));
> > + } while (error == 0);
> >
> > return error;
> > }
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2018-09-10 22:18:49

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On 8/7/18 3:45 AM, Jan Kara wrote:
> On Fri 27-07-18 10:28:51, Ross Zwisler wrote:
>> + fsdevel and the xfs list.
>>
>> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
>> <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
>>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
>>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
>>>>> Changes since v3:
>>>>> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
>>>>> that the {ext4,xfs}_break_layouts() calls have the same meaning.
>>>>> (Dave, Darrick and Jan)
>>>>
>>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to
>>>> hunt them down?
>>>
>>> The root cause of this issue is that while the ei->i_mmap_sem provides
>>> synchronization between ext4_break_layouts() and page faults, it doesn't
>>> provide synchronize us with the direct I/O path. This exact same issue exists
>>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
>>>
>>> This allows the direct I/O path to do I/O and raise & lower page->_refcount
>>> while we're executing a truncate/hole punch. This leads to us trying to free
>>> a page with an elevated refcount.
>>>
>>> Here's one instance of the race:
>>>
>>> CPU 0 CPU 1
>>> ----- -----
>>> ext4_punch_hole()
>>> ext4_break_layouts() # all pages have refcount=1
>>>
>>> ext4_direct_IO()
>>> ... lots of layers ...
>>> follow_page_pte()
>>> get_page() # elevates refcount
>>>
>>> truncate_pagecache_range()
>>> ... a few layers ...
>>> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
>>>
>
> So this is a very different race from the one below. And it should be
> impossible to happen. This race is exactly the reason why
> dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault
> which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race.
>
>>> A similar race occurs when the refcount is being dropped while we're running
>>> ext4_break_layouts(), and this is the one that my test was actually hitting:
>>>
>>> CPU 0 CPU 1
>>> ----- -----
>>> ext4_direct_IO()
>>> ... lots of layers ...
>>> follow_page_pte()
>>> get_page()
>>> # elevates refcount of page X
>>> ext4_punch_hole()
>>> ext4_break_layouts() # two pages, X & Y, have refcount == 2
>>> __wait_var_event() # called for page X
>>>
>>> __put_devmap_managed_page()
>>> # drops refcount of X to 1
>>>
>>> # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
>>> # We never actually called ext4_wait_dax_page(), so 'retry' in
>>> # ext4_break_layouts() is still false. Exit do/while loop in
>>> # ext4_break_layouts, never attempting to wait on page Y which still has an
>>> # elevated refcount of 2.
>>>
>>> truncate_pagecache_range()
>>> ... a few layers ...
>>> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
>>>
>>> This second race can be fixed with the patch at the end of this function,
>>> which I think should go in, unless there is a benfit to the current retry
>>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
>>> With this patch applied I've been able to run my unit test through
>>> thousands of iterations, where it used to failed consistently within 10 or
>>> so.
>>>
>>> Even so, I wonder if the real solution is to add synchronization between
>>> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how
>>> this should be handled?
>>>
>>> --- >8 ---
>>>
>>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
>>> From: Ross Zwisler <[email protected]>
>>> Date: Wed, 25 Jul 2018 16:16:05 -0600
>>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
>>>
>>> If the refcount of a page is lowered between the time that it is returned
>>> by dax_busy_page() and when the refcount is again checked in
>>> ext4_break_layouts() => ___wait_var_event(), the waiting function
>>> ext4_wait_dax_page() will never be called. This means that
>>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
>>> looping and never check the refcount of other pages in this inode.
>>>
>>> Instead, always continue looping as long as dax_layout_busy_page() gives us
>>> a page which it found with an elevated refcount.
>>>
>>> Note that this works around the race exposed by my unit test, but I think
>>> that there is another race that needs to be addressed, probably with
>>> additional synchronization added between direct I/O and
>>> {ext4,xfs}_break_layouts().
>>>
>>> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>
>
> OK, this is a good catch and the patch looks good. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Also please post this fix officially to Ted to include it in his tree (I
> can see that he has all your other patches queued for the merge window).

Did these ever get on Ted's radar? I don't see it upstream yet.

Thanks,
-Eric

2018-09-11 15:14:21

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Mon 10-09-18 17:18:49, Eric Sandeen wrote:
> On 8/7/18 3:45 AM, Jan Kara wrote:
> > On Fri 27-07-18 10:28:51, Ross Zwisler wrote:
> >> + fsdevel and the xfs list.
> >>
> >> On Wed, Jul 25, 2018 at 4:28 PM Ross Zwisler
> >> <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
> >>> On Wed, Jul 11, 2018 at 10:17:41AM +0200, Jan Kara wrote:
> >>>> On Tue 10-07-18 13:10:29, Ross Zwisler wrote:
> >>>>> Changes since v3:
> >>>>> * Added an ext4_break_layouts() call to ext4_insert_range() to ensure
> >>>>> that the {ext4,xfs}_break_layouts() calls have the same meaning.
> >>>>> (Dave, Darrick and Jan)
> >>>>
> >>>> How about the occasional WARN_ON_ONCE you mention below. Were you able to
> >>>> hunt them down?
> >>>
> >>> The root cause of this issue is that while the ei->i_mmap_sem provides
> >>> synchronization between ext4_break_layouts() and page faults, it doesn't
> >>> provide synchronize us with the direct I/O path. This exact same issue exists
> >>> in XFS AFAICT, with the synchronization tool there being the XFS_MMAPLOCK.
> >>>
> >>> This allows the direct I/O path to do I/O and raise & lower page->_refcount
> >>> while we're executing a truncate/hole punch. This leads to us trying to free
> >>> a page with an elevated refcount.
> >>>
> >>> Here's one instance of the race:
> >>>
> >>> CPU 0 CPU 1
> >>> ----- -----
> >>> ext4_punch_hole()
> >>> ext4_break_layouts() # all pages have refcount=1
> >>>
> >>> ext4_direct_IO()
> >>> ... lots of layers ...
> >>> follow_page_pte()
> >>> get_page() # elevates refcount
> >>>
> >>> truncate_pagecache_range()
> >>> ... a few layers ...
> >>> dax_disassociate_entry() # sees elevated refcount, WARN_ON_ONCE()
> >>>
> >
> > So this is a very different race from the one below. And it should be
> > impossible to happen. This race is exactly the reason why
> > dax_layout_busy_page() has unmap_mapping_range() call to force GUP to fault
> > which blocks on ei->i_mmap_sem / XFS_MMAPLOCK and thus avoids the race.
> >
> >>> A similar race occurs when the refcount is being dropped while we're running
> >>> ext4_break_layouts(), and this is the one that my test was actually hitting:
> >>>
> >>> CPU 0 CPU 1
> >>> ----- -----
> >>> ext4_direct_IO()
> >>> ... lots of layers ...
> >>> follow_page_pte()
> >>> get_page()
> >>> # elevates refcount of page X
> >>> ext4_punch_hole()
> >>> ext4_break_layouts() # two pages, X & Y, have refcount == 2
> >>> __wait_var_event() # called for page X
> >>>
> >>> __put_devmap_managed_page()
> >>> # drops refcount of X to 1
> >>>
> >>> # __wait_var_events() checks X's refcount in "if (condition)", and breaks.
> >>> # We never actually called ext4_wait_dax_page(), so 'retry' in
> >>> # ext4_break_layouts() is still false. Exit do/while loop in
> >>> # ext4_break_layouts, never attempting to wait on page Y which still has an
> >>> # elevated refcount of 2.
> >>>
> >>> truncate_pagecache_range()
> >>> ... a few layers ...
> >>> dax_disassociate_entry() # sees elevated refcount for Y, WARN_ON_ONCE()
> >>>
> >>> This second race can be fixed with the patch at the end of this function,
> >>> which I think should go in, unless there is a benfit to the current retry
> >>> scheme which relies on the 'retry' variable in {ext4,xfs}_break_layouts()?
> >>> With this patch applied I've been able to run my unit test through
> >>> thousands of iterations, where it used to failed consistently within 10 or
> >>> so.
> >>>
> >>> Even so, I wonder if the real solution is to add synchronization between
> >>> the direct I/O path and {ext4,xfs}_break_layouts()? Other ideas on how
> >>> this should be handled?
> >>>
> >>> --- >8 ---
> >>>
> >>> From a4519b0f40362f0a63ae96acaf986092aff0f0d3 Mon Sep 17 00:00:00 2001
> >>> From: Ross Zwisler <[email protected]>
> >>> Date: Wed, 25 Jul 2018 16:16:05 -0600
> >>> Subject: [PATCH] ext4: Close race between direct IO and ext4_break_layouts()
> >>>
> >>> If the refcount of a page is lowered between the time that it is returned
> >>> by dax_busy_page() and when the refcount is again checked in
> >>> ext4_break_layouts() => ___wait_var_event(), the waiting function
> >>> ext4_wait_dax_page() will never be called. This means that
> >>> ext4_break_layouts() will still have 'retry' set to false, so we'll stop
> >>> looping and never check the refcount of other pages in this inode.
> >>>
> >>> Instead, always continue looping as long as dax_layout_busy_page() gives us
> >>> a page which it found with an elevated refcount.
> >>>
> >>> Note that this works around the race exposed by my unit test, but I think
> >>> that there is another race that needs to be addressed, probably with
> >>> additional synchronization added between direct I/O and
> >>> {ext4,xfs}_break_layouts().
> >>>
> >>> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>
> >
> > OK, this is a good catch and the patch looks good. You can add:
> >
> > Reviewed-by: Jan Kara <[email protected]>
> >
> > Also please post this fix officially to Ted to include it in his tree (I
> > can see that he has all your other patches queued for the merge window).
>
> Did these ever get on Ted's radar? I don't see it upstream yet.

Hum, it seems Ted never picked this patch up. I guess I'll gather the two
fixes you pointed out and resend them to Ted.

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2018-09-11 15:20:24

by Jan Kara

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Tue 11-09-18 17:14:21, Jan Kara wrote:
> On Mon 10-09-18 17:18:49, Eric Sandeen wrote:
> > On 8/7/18 3:45 AM, Jan Kara wrote:
> > >
> > > OK, this is a good catch and the patch looks good. You can add:
> > >
> > > Reviewed-by: Jan Kara <[email protected]>
> > >
> > > Also please post this fix officially to Ted to include it in his tree (I
> > > can see that he has all your other patches queued for the merge window).
> >
> > Did these ever get on Ted's radar? I don't see it upstream yet.
>
> Hum, it seems Ted never picked this patch up. I guess I'll gather the two
> fixes you pointed out and resend them to Ted.

Actually only one patch when looking into it now.

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2018-09-11 17:28:06

by Theodore Y. Ts'o

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On Tue, Sep 11, 2018 at 05:20:24PM +0200, Jan Kara wrote:
> > Hum, it seems Ted never picked this patch up. I guess I'll gather the two
> > fixes you pointed out and resend them to Ted.
>
> Actually only one patch when looking into it now.

I believe both are in the ext4 git tree.

- Ted

2018-09-11 18:21:19

by Eric Sandeen

[permalink] [raw]

Subject: Re: [PATCH v4 0/2] ext4: fix DAX dma vs truncate/hole-punch

On 9/11/18 12:28 PM, Theodore Y. Ts'o wrote:
> On Tue, Sep 11, 2018 at 05:20:24PM +0200, Jan Kara wrote:
>>> Hum, it seems Ted never picked this patch up. I guess I'll gather the two
>>> fixes you pointed out and resend them to Ted.
>>
>> Actually only one patch when looking into it now.
>
> I believe both are in the ext4 git tree.
>
> - Ted

Ok, right:

dax: dax_layout_busy_page() warn on !exceptional
https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=cdbf8897cb09b7baf2b8a7e78051a35a872b01d5

ext4: handle layout changes to pinned DAX mappings
https://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4.git/commit/?h=dev&id=430657b6be896db57d974375cc499ca212c7f01d

I think the one I meant to ask about was:

[PATCH v2 1/2] ext4: Close race between direct IO and ext4_break_layouts()

https://patchwork.kernel.org/patch/10560393/

Sorry for the confusion.

-Eric