2016-07-27 12:07:45

by Jan Kara

[permalink] [raw]
Subject: Subtle races between DAX mmap fault and write path

Hi,

when testing my latest changes to DXA fault handling code I have hit the
following interesting race between the fault and write path (I'll show
function names for ext4 but xfs has the same issue AFAICT).

We have a file 'f' which has a hole at offset 0.

Process 0 Process 1

data = mmap('f');
read data[0]
-> fault, we map a hole page

pwrite('f', buf, len, 0)
-> ext4_file_write_iter
inode_lock(inode);
__generic_file_write_iter()
generic_file_direct_write()
invalidate_inode_pages2_range()
- drops hole page from
the radix tree
ext4_direct_IO()
dax_do_io()
- allocates block for
offset 0
data[0] = 1
-> page_mkwrite fault
-> ext4_dax_fault()
down_read(&EXT4_I(inode)->i_mmap_sem);
__dax_fault()
grab_mapping_entry()
- creates locked radix tree entry
- maps block into PTE
put_locked_mapping_entry()

invalidate_inode_pages2_range()
- removes dax entry from
the radix tree

So we have just lost information that block 0 is mapped and needs flushing
caches.

Also the fact that the consistency of data as viewed by mmap and
dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
unexpected to me and we should document it somewhere.

The question is how to best fix this. I see three options:

1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
harsh but should work - we call filemap_write_and_wait() in
generic_file_direct_write() so we flush out all caches for the relevant
area before dropping radix tree entries.

2) Call filemap_write_and_wait() after we return from ->direct_IO before we
call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
for those two calls. Lock hold time will be shorter than 1) but it will
require additional flush and we'd probably have to stop using
generic_file_direct_write() for DAX writes to allow for all this special
hackery.

3) Remodel dax_do_io() to work more like buffered IO and use radix tree
entry locks to protect against similar races. That has likely better
scalability than 1) but may be actually slower in the uncontended case (due
to all the radix tree operations).

Any opinions on this?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


2016-07-27 21:11:12

by Ross Zwisler

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> Hi,
>
> when testing my latest changes to DXA fault handling code I have hit the
> following interesting race between the fault and write path (I'll show
> function names for ext4 but xfs has the same issue AFAICT).
>
> We have a file 'f' which has a hole at offset 0.
>
> Process 0 Process 1
>
> data = mmap('f');
> read data[0]
> -> fault, we map a hole page
>
> pwrite('f', buf, len, 0)
> -> ext4_file_write_iter
> inode_lock(inode);
> __generic_file_write_iter()
> generic_file_direct_write()
> invalidate_inode_pages2_range()
> - drops hole page from
> the radix tree
> ext4_direct_IO()
> dax_do_io()
> - allocates block for
> offset 0
> data[0] = 1
> -> page_mkwrite fault
> -> ext4_dax_fault()
> down_read(&EXT4_I(inode)->i_mmap_sem);
> __dax_fault()
> grab_mapping_entry()
> - creates locked radix tree entry
> - maps block into PTE
> put_locked_mapping_entry()
>
> invalidate_inode_pages2_range()
> - removes dax entry from
> the radix tree
>
> So we have just lost information that block 0 is mapped and needs flushing
> caches.
>
> Also the fact that the consistency of data as viewed by mmap and
> dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> unexpected to me and we should document it somewhere.
>
> The question is how to best fix this. I see three options:
>
> 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> harsh but should work - we call filemap_write_and_wait() in
> generic_file_direct_write() so we flush out all caches for the relevant
> area before dropping radix tree entries.
>
> 2) Call filemap_write_and_wait() after we return from ->direct_IO before we
> call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
> for those two calls. Lock hold time will be shorter than 1) but it will
> require additional flush and we'd probably have to stop using
> generic_file_direct_write() for DAX writes to allow for all this special
> hackery.
>
> 3) Remodel dax_do_io() to work more like buffered IO and use radix tree
> entry locks to protect against similar races. That has likely better
> scalability than 1) but may be actually slower in the uncontended case (due
> to all the radix tree operations).
>
> Any opinions on this?

Can we just skip the two calls to invalidate_inode_pages2_range() in
generic_file_direct_write() for DAX I/O?

These calls are there for the direct I/O path because for direct I/O there is
a failure scenario where we have clean pages in the page cache which are stale
compared to the newly written data on media. If we read from these clean
pages instead of reading from media, we get data corruption.

This failure case doesn't exist for DAX - we don't care if there are radix
tree entries for the data region that the ->direct_IO() call is about to
write.

Similarly, for DAX I don't think we actually need to do the
filemap_write_and_wait_range() call in generic_file_direct_write() either.
It's a similar scenario - for direct I/O we are trying to make sure that any
dirty data in the page cache is written out to media before the ->direct_IO()
call happens. For DAX I don't think we care. If a user does an mmap() write
which creates a dirty radix tree entry, then does a write(), we should be able
to happily overwrite the old data with the new without flushing, and just
leave the dirty radix tree entry in place.

I realize this adds even more special case DAX code to mm/filemap.c, but if we
can avoid the race without adding any more locking (and by simplifying our
logic), it seems like it's worth it to me.

Does this break in some way I'm not seeing?

- Ross

2016-07-27 21:38:43

by Dan Williams

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

[ Adding Eric ]

On Wed, Jul 27, 2016 at 5:07 AM, Jan Kara <[email protected]> wrote:
> Hi,
>
> when testing my latest changes to DXA fault handling code I have hit the
> following interesting race between the fault and write path (I'll show
> function names for ext4 but xfs has the same issue AFAICT).
>
> We have a file 'f' which has a hole at offset 0.
>
> Process 0 Process 1
>
> data = mmap('f');
> read data[0]
> -> fault, we map a hole page
>
> pwrite('f', buf, len, 0)
> -> ext4_file_write_iter
> inode_lock(inode);
> __generic_file_write_iter()
> generic_file_direct_write()
> invalidate_inode_pages2_range()
> - drops hole page from
> the radix tree
> ext4_direct_IO()
> dax_do_io()
> - allocates block for
> offset 0
> data[0] = 1
> -> page_mkwrite fault
> -> ext4_dax_fault()
> down_read(&EXT4_I(inode)->i_mmap_sem);
> __dax_fault()
> grab_mapping_entry()
> - creates locked radix tree entry
> - maps block into PTE
> put_locked_mapping_entry()
>
> invalidate_inode_pages2_range()
> - removes dax entry from
> the radix tree
>
> So we have just lost information that block 0 is mapped and needs flushing
> caches.
>
> Also the fact that the consistency of data as viewed by mmap and
> dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> unexpected to me and we should document it somewhere.
>
> The question is how to best fix this. I see three options:
>
> 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> harsh but should work - we call filemap_write_and_wait() in
> generic_file_direct_write() so we flush out all caches for the relevant
> area before dropping radix tree entries.
>
> 2) Call filemap_write_and_wait() after we return from ->direct_IO before we
> call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
> for those two calls. Lock hold time will be shorter than 1) but it will
> require additional flush and we'd probably have to stop using
> generic_file_direct_write() for DAX writes to allow for all this special
> hackery.
>
> 3) Remodel dax_do_io() to work more like buffered IO and use radix tree
> entry locks to protect against similar races. That has likely better
> scalability than 1) but may be actually slower in the uncontended case (due
> to all the radix tree operations).

In general, re-modeling dax_do_io() has come up before in the context
of error handling [1] and code readability [2]. I think there are
benefits outside of this immediate bug fix to make dax_do_io() less
special.

[1]: "PATCH v2 5/5] dax: handle media errors in dax_do_io" where we
debated direct-I/O writes triggering error clearing
[2]: commit "023954351fae dax: fix offset overflow in dax_io" where we
found an off by one bug was hiding in the range checking

2016-07-27 22:19:49

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> >
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).

The XFS update I just pushed to Linus contains a rework of the XFS
DAX IO path. It no longer shares the XFS direct IO path, so it
doesn't contain any of the direct IO warts anymore.

> > We have a file 'f' which has a hole at offset 0.
> >
> > Process 0 Process 1
> >
> > data = mmap('f');
> > read data[0]
> > -> fault, we map a hole page
> >
> > pwrite('f', buf, len, 0)
> > -> ext4_file_write_iter
> > inode_lock(inode);
> > __generic_file_write_iter()
> > generic_file_direct_write()
> > invalidate_inode_pages2_range()
> > - drops hole page from
> > the radix tree
> > ext4_direct_IO()
> > dax_do_io()
> > - allocates block for
> > offset 0
> > data[0] = 1
> > -> page_mkwrite fault
> > -> ext4_dax_fault()
> > down_read(&EXT4_I(inode)->i_mmap_sem);
> > __dax_fault()
> > grab_mapping_entry()
> > - creates locked radix tree entry
> > - maps block into PTE
> > put_locked_mapping_entry()
> >
> > invalidate_inode_pages2_range()
> > - removes dax entry from
> > the radix tree

In XFS, we don't call __generic_file_write_iter or
generic_file_direct_write(), and xfs_file_dax_write() does not have
this trailing call to invalidate_inode_pages2_range() anymore. It's
DAX - there's nothing to invalidate, right?

So I think Christoph just (accidentally) removed this race condition
from XFS....

> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> >
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.

I don't think it does - what, in DAX, is incoherent? If anything,
it's the data in the direct IO buffer, not the view the mmap will
see. i.e. the post-write invalidate is to ensure that applications
that have mmapped the file see the data written by direct IO. That's
not necessary with DAX.

> > The question is how to best fix this. I see three options:
> >
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.

We don't call filemap_write_and_wait() in xfs_file_dax_write()
anymore, either. It's DAX - we don't need to flush anything to read
the correct data, and there's nothing cached that becomes stale when
we overwrite the destination memory.

[snip]

> > Any opinions on this?
>
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?

The first invalidate is still there in XFS. The comment above it
in the new XFS code says:

/*
* Yes, even DAX files can have page cache attached to them: A zeroed
* page is inserted into the pagecache when we have to serve a write
* fault on a hole. It should never be dirtied and can simply be
* dropped from the pagecache once we get real data for the page.
*/
if (mapping->nrpages) {
ret = invalidate_inode_pages2(mapping);
WARN_ON_ONCE(ret);
}


> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens. For DAX I don't think we care. If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.

Yup, that's pretty much how XFS treats DAX now - it's not direct IO,
but it's not buffered IO, either...

I don't think there's a problem that needs to be fixed in the DAX
code - the issue is all about the surrounding IO context. i.e
whether do_dax_io() is automatically coherent with mmap or not
because mmap directly exposes the CPU cache coherent memory
do_dax_io() modifies.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-07-28 08:10:33

by Jan Kara

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Thu 28-07-16 08:19:49, Dave Chinner wrote:
> On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> > On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > > Hi,
> > >
> > > when testing my latest changes to DXA fault handling code I have hit the
> > > following interesting race between the fault and write path (I'll show
> > > function names for ext4 but xfs has the same issue AFAICT).
>
> The XFS update I just pushed to Linus contains a rework of the XFS
> DAX IO path. It no longer shares the XFS direct IO path, so it
> doesn't contain any of the direct IO warts anymore.

Ah, OK. I knew Christoph rewrote it but forgot to check your tree when
looking into this bug.

> > > We have a file 'f' which has a hole at offset 0.
> > >
> > > Process 0 Process 1
> > >
> > > data = mmap('f');
> > > read data[0]
> > > -> fault, we map a hole page
> > >
> > > pwrite('f', buf, len, 0)
> > > -> ext4_file_write_iter
> > > inode_lock(inode);
> > > __generic_file_write_iter()
> > > generic_file_direct_write()
> > > invalidate_inode_pages2_range()
> > > - drops hole page from
> > > the radix tree
> > > ext4_direct_IO()
> > > dax_do_io()
> > > - allocates block for
> > > offset 0
> > > data[0] = 1
> > > -> page_mkwrite fault
> > > -> ext4_dax_fault()
> > > down_read(&EXT4_I(inode)->i_mmap_sem);
> > > __dax_fault()
> > > grab_mapping_entry()
> > > - creates locked radix tree entry
> > > - maps block into PTE
> > > put_locked_mapping_entry()
> > >
> > > invalidate_inode_pages2_range()
> > > - removes dax entry from
> > > the radix tree
>
> In XFS, we don't call __generic_file_write_iter or
> generic_file_direct_write(), and xfs_file_dax_write() does not have
> this trailing call to invalidate_inode_pages2_range() anymore. It's
> DAX - there's nothing to invalidate, right?
>
> So I think Christoph just (accidentally) removed this race condition
> from XFS....

So with current XFS code what prevents the following:

Process 0 Process 1

data = mmap('f');
pwrite('f', buf, len, 0)
-> xfs_file_dax_write
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
invalidate_inode_pages2()
- drops hole page from the
radix tree
xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
read data[0]
-> fault
-> xfs_filemap_fault
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
dax_fault()
- installs hole in the radix tree and PTE

dax_do_io()
- allocates block for offset 0

And now Process 0 doesn't see the data Process 1 has written until fault on
that address is triggered again for some reason.

Heck, looking at the code in xfs_file_dax_write() you call
invalidate_inode_pages2() which clears the *whole* radix tree. So you have
just lost the dirtiness information for the whole file and thus fsync(2)
will not flush any data written via mmap.

> > > So we have just lost information that block 0 is mapped and needs flushing
> > > caches.
> > >
> > > Also the fact that the consistency of data as viewed by mmap and
> > > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > > unexpected to me and we should document it somewhere.
>
> I don't think it does - what, in DAX, is incoherent? If anything,
> it's the data in the direct IO buffer, not the view the mmap will
> see. i.e. the post-write invalidate is to ensure that applications
> that have mmapped the file see the data written by direct IO. That's
> not necessary with DAX.

The coherency issues are very similar to direct IO because of hole pages.
Generally we need to maintain coherency between what is pointed to from page
tables and what is really backing file data (as viewed by inode's extent
tree). So whenever block allocation / deallocation happens for the file, we
may need to update page tables which map this offset as well. For
deallocation (truncate, punch hole, ...) we take care of this under
protection of MMAPLOCK so things are fine. For allocation we currently
don't use this serialization mechanism.

> > > The question is how to best fix this. I see three options:
> > >
> > > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > > harsh but should work - we call filemap_write_and_wait() in
> > > generic_file_direct_write() so we flush out all caches for the relevant
> > > area before dropping radix tree entries.
>
> We don't call filemap_write_and_wait() in xfs_file_dax_write()
> anymore, either. It's DAX - we don't need to flush anything to read
> the correct data, and there's nothing cached that becomes stale when
> we overwrite the destination memory.

So DAX doesn't need flushing to maintain consistent view of the data but it
does need flushing to make sure fsync(2) results in data written via mmap
to reach persistent storage. Look at the following example: Assume you
write via mmap to offsets 0-1023 in the page and via write(2) to offsets
1024-2048 in the page. Then after write(2) you still may have data in
volatile caches for the page for offsets 0-1023. So your write(2) either
has to keep the corresponding entry in the radix tree dirty (and currently
XFS's call to invalidate_inode_pages2() in xfs_file_dax_write() doesn't do
this) or you have to flush the page via filemap_write_and_wait().

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-07-28 08:47:54

by Jan Kara

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Wed 27-07-16 15:10:39, Ross Zwisler wrote:
> On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > Hi,
> >
> > when testing my latest changes to DXA fault handling code I have hit the
> > following interesting race between the fault and write path (I'll show
> > function names for ext4 but xfs has the same issue AFAICT).
> >
> > We have a file 'f' which has a hole at offset 0.
> >
> > Process 0 Process 1
> >
> > data = mmap('f');
> > read data[0]
> > -> fault, we map a hole page
> >
> > pwrite('f', buf, len, 0)
> > -> ext4_file_write_iter
> > inode_lock(inode);
> > __generic_file_write_iter()
> > generic_file_direct_write()
> > invalidate_inode_pages2_range()
> > - drops hole page from
> > the radix tree
> > ext4_direct_IO()
> > dax_do_io()
> > - allocates block for
> > offset 0
> > data[0] = 1
> > -> page_mkwrite fault
> > -> ext4_dax_fault()
> > down_read(&EXT4_I(inode)->i_mmap_sem);
> > __dax_fault()
> > grab_mapping_entry()
> > - creates locked radix tree entry
> > - maps block into PTE
> > put_locked_mapping_entry()
> >
> > invalidate_inode_pages2_range()
> > - removes dax entry from
> > the radix tree
> >
> > So we have just lost information that block 0 is mapped and needs flushing
> > caches.
> >
> > Also the fact that the consistency of data as viewed by mmap and
> > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > unexpected to me and we should document it somewhere.
> >
> > The question is how to best fix this. I see three options:
> >
> > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > harsh but should work - we call filemap_write_and_wait() in
> > generic_file_direct_write() so we flush out all caches for the relevant
> > area before dropping radix tree entries.
> >
> > 2) Call filemap_write_and_wait() after we return from ->direct_IO before we
> > call invalidate_inode_pages2_range() and hold i_mmap_sem exclusively only
> > for those two calls. Lock hold time will be shorter than 1) but it will
> > require additional flush and we'd probably have to stop using
> > generic_file_direct_write() for DAX writes to allow for all this special
> > hackery.
> >
> > 3) Remodel dax_do_io() to work more like buffered IO and use radix tree
> > entry locks to protect against similar races. That has likely better
> > scalability than 1) but may be actually slower in the uncontended case (due
> > to all the radix tree operations).
> >
> > Any opinions on this?
>
> Can we just skip the two calls to invalidate_inode_pages2_range() in
> generic_file_direct_write() for DAX I/O?
>
> These calls are there for the direct I/O path because for direct I/O there is
> a failure scenario where we have clean pages in the page cache which are stale
> compared to the newly written data on media. If we read from these clean
> pages instead of reading from media, we get data corruption.
>
> This failure case doesn't exist for DAX - we don't care if there are radix
> tree entries for the data region that the ->direct_IO() call is about to
> write.
>
> Similarly, for DAX I don't think we actually need to do the
> filemap_write_and_wait_range() call in generic_file_direct_write() either.
> It's a similar scenario - for direct I/O we are trying to make sure that any
> dirty data in the page cache is written out to media before the ->direct_IO()
> call happens. For DAX I don't think we care. If a user does an mmap() write
> which creates a dirty radix tree entry, then does a write(), we should be able
> to happily overwrite the old data with the new without flushing, and just
> leave the dirty radix tree entry in place.

See my email to Dave for details but to put it shortly, write(2) which
allocates block has to make sure hole page for that offset is unmapped from
page tables and freed so at least one invalidate_inode_pages2_range() call
is necessary even for DAX. And because that call will currently remove also
dirty radix tree entries, flushing is currently necessary as well. If we
modified invalidate_inode_pages2_range() to keep dirty radix tree entries
(which makes sense because invalidate_inode_pages2_range() does not discard
dirty pages in the first place), flushing won't be necessary. That is true.

> I realize this adds even more special case DAX code to mm/filemap.c, but
> if we can avoid the race without adding any more locking (and by
> simplifying our logic), it seems like it's worth it to me.

Well, we could always decouple DAX write path from the direct IO write
path. XFS already did this and if the generic DIO path won't be suitable for
DAX on ext4, we can do the same for it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2016-07-29 02:21:52

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
> > On Wed, Jul 27, 2016 at 03:10:39PM -0600, Ross Zwisler wrote:
> > > On Wed, Jul 27, 2016 at 02:07:45PM +0200, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > when testing my latest changes to DXA fault handling code I have hit the
> > > > following interesting race between the fault and write path (I'll show
> > > > function names for ext4 but xfs has the same issue AFAICT).
> >
> > The XFS update I just pushed to Linus contains a rework of the XFS
> > DAX IO path. It no longer shares the XFS direct IO path, so it
> > doesn't contain any of the direct IO warts anymore.
>
> Ah, OK. I knew Christoph rewrote it but forgot to check your tree when
> looking into this bug.
>
> > > > We have a file 'f' which has a hole at offset 0.
> > > >
> > > > Process 0 Process 1
> > > >
> > > > data = mmap('f');
> > > > read data[0]
> > > > -> fault, we map a hole page
> > > >
> > > > pwrite('f', buf, len, 0)
> > > > -> ext4_file_write_iter
> > > > inode_lock(inode);
> > > > __generic_file_write_iter()
> > > > generic_file_direct_write()
> > > > invalidate_inode_pages2_range()
> > > > - drops hole page from
> > > > the radix tree
> > > > ext4_direct_IO()
> > > > dax_do_io()
> > > > - allocates block for
> > > > offset 0
> > > > data[0] = 1
> > > > -> page_mkwrite fault
> > > > -> ext4_dax_fault()
> > > > down_read(&EXT4_I(inode)->i_mmap_sem);
> > > > __dax_fault()
> > > > grab_mapping_entry()
> > > > - creates locked radix tree entry
> > > > - maps block into PTE
> > > > put_locked_mapping_entry()
> > > >
> > > > invalidate_inode_pages2_range()
> > > > - removes dax entry from
> > > > the radix tree
> >
> > In XFS, we don't call __generic_file_write_iter or
> > generic_file_direct_write(), and xfs_file_dax_write() does not have
> > this trailing call to invalidate_inode_pages2_range() anymore. It's
> > DAX - there's nothing to invalidate, right?
> >
> > So I think Christoph just (accidentally) removed this race condition
> > from XFS....
>
> So with current XFS code what prevents the following:
>
> Process 0 Process 1
>
> data = mmap('f');
> pwrite('f', buf, len, 0)
> -> xfs_file_dax_write
> xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
> invalidate_inode_pages2()
> - drops hole page from the
> radix tree
> xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
> read data[0]
> -> fault
> -> xfs_filemap_fault
> xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
> dax_fault()
> - installs hole in the radix tree and PTE
>
> dax_do_io()
> - allocates block for offset 0
>
> And now Process 0 doesn't see the data Process 1 has written until fault on
> that address is triggered again for some reason.

IMO, that's a problem internal to dax_do_io(), not an XFS problem.
XFS has serialised everything it can (the block lookup/allocation)
between mmap and the IO path, yet the internal DAX radix tree and
mappings is now inconsistent.

If dax_do_io() allocates a new block, it clearly makes the existing
mapping tree contents for that block invalid. Hence it needs to be
invalidating mappings and PTEs over a physical offset that it knows
it just changed. The filesystem shouldn't do this - it would be a
horrible layering violation to require get_block() to invalidate
PTEs....

> Heck, looking at the code in xfs_file_dax_write() you call
> invalidate_inode_pages2() which clears the *whole* radix tree. So you have
> just lost the dirtiness information for the whole file and thus fsync(2)
> will not flush any data written via mmap.

Ok, so that's a bug, but it has no impact on what is being discussed
here. It's also not something any of the regression test we have
picked up....

> > > > So we have just lost information that block 0 is mapped and needs flushing
> > > > caches.
> > > >
> > > > Also the fact that the consistency of data as viewed by mmap and
> > > > dax_do_io() relies on invalidate_inode_pages2_range() is somewhat
> > > > unexpected to me and we should document it somewhere.
> >
> > I don't think it does - what, in DAX, is incoherent? If anything,
> > it's the data in the direct IO buffer, not the view the mmap will
> > see. i.e. the post-write invalidate is to ensure that applications
> > that have mmapped the file see the data written by direct IO. That's
> > not necessary with DAX.
>
> The coherency issues are very similar to direct IO because of hole pages.
> Generally we need to maintain coherency between what is pointed to from page
> tables and what is really backing file data (as viewed by inode's extent
> tree). So whenever block allocation / deallocation happens for the file, we
> may need to update page tables which map this offset as well. For
> deallocation (truncate, punch hole, ...) we take care of this under
> protection of MMAPLOCK so things are fine. For allocation we currently
> don't use this serialization mechanism.

And that's because we can't take locks in the IO path that might be
taken in the page fault path because of page fault recursion
deadlocks during copy operations.

We do, however, return buffer_new() when allocations occur, which
tells the dax_do_io() code it needs to do something special with the
buffer....

> > > > The question is how to best fix this. I see three options:
> > > >
> > > > 1) Lock out faults during writes via exclusive i_mmap_sem. That is rather
> > > > harsh but should work - we call filemap_write_and_wait() in
> > > > generic_file_direct_write() so we flush out all caches for the relevant
> > > > area before dropping radix tree entries.
> >
> > We don't call filemap_write_and_wait() in xfs_file_dax_write()
> > anymore, either. It's DAX - we don't need to flush anything to read
> > the correct data, and there's nothing cached that becomes stale when
> > we overwrite the destination memory.
>
> So DAX doesn't need flushing to maintain consistent view of the data but it
> does need flushing to make sure fsync(2) results in data written via mmap
> to reach persistent storage.

I thought this all changed with the removal of the pcommit
instruction and wmb_pmem() going away. Isn't it now a platform
requirement now that dirty cache lines over persistent memory ranges
are either guaranteed to be flushed to persistent storage on power
fail or when required by REQ_FLUSH?

https://lkml.org/lkml/2016/7/9/131

And part of that is the wmb_pmem() calls are going away?

https://lkml.org/lkml/2016/7/9/136
https://lkml.org/lkml/2016/7/9/140

i.e. fsync on pmem only needs to take care of writing filesystem
metadata now, and the pmem driver handles the rest when it gets a
REQ_FLUSH bio from fsync?

https://lkml.org/lkml/2016/7/9/134

Or have we somehow ended up with the fucked up situation where
dax_do_io() writes are (effectively) immediately persistent and
untracked by internal infrastructure, whilst mmap() writes
require internal dirty tracking and fsync() to flush caches via
writeback?

Seriously: everything inside DAX needs to operate the same way and
use the same tracking infrastructure, otherwise we are going to
always have fucked up coherency problems and races like this. If
mmap writes are volatile, then dax_do_io writes should be volatile,
too. And both should maintain that dirty state and mapping coherency
via the internal mapping infrastructure, just like the page cache
does for mmap vs buffered IO.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-07-29 14:44:25

by Dan Williams

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner <[email protected]> wrote:
> On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
>> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
[..]
>> So DAX doesn't need flushing to maintain consistent view of the data but it
>> does need flushing to make sure fsync(2) results in data written via mmap
>> to reach persistent storage.
>
> I thought this all changed with the removal of the pcommit
> instruction and wmb_pmem() going away. Isn't it now a platform
> requirement now that dirty cache lines over persistent memory ranges
> are either guaranteed to be flushed to persistent storage on power
> fail or when required by REQ_FLUSH?

No, nothing automates cache flushing. The path of a write is:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

The ADR mechanism and the wpq-flush facility flush data thorough the
imc (integrated memory controller) to media. dax_do_io() gets writes
to the imc, but we still need a posted-write-buffer flush mechanism to
guarantee data makes it out to media.


> https://lkml.org/lkml/2016/7/9/131
>
> And part of that is the wmb_pmem() calls are going away?
>
> https://lkml.org/lkml/2016/7/9/136
> https://lkml.org/lkml/2016/7/9/140
>
> i.e. fsync on pmem only needs to take care of writing filesystem
> metadata now, and the pmem driver handles the rest when it gets a
> REQ_FLUSH bio from fsync?
>
> https://lkml.org/lkml/2016/7/9/134
>
> Or have we somehow ended up with the fucked up situation where
> dax_do_io() writes are (effectively) immediately persistent and
> untracked by internal infrastructure, whilst mmap() writes
> require internal dirty tracking and fsync() to flush caches via
> writeback?

dax_do_io() writes are not immediately persistent. They bypass the
cpu-cache and cpu-write-bufffer and are ready to be flushed to media
by REQ_FLUSH or power-fail on an ADR system.

2016-07-30 00:12:49

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, Jul 29, 2016 at 07:44:25AM -0700, Dan Williams wrote:
> On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner <[email protected]> wrote:
> > On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
> >> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
> [..]
> >> So DAX doesn't need flushing to maintain consistent view of the data but it
> >> does need flushing to make sure fsync(2) results in data written via mmap
> >> to reach persistent storage.
> >
> > I thought this all changed with the removal of the pcommit
> > instruction and wmb_pmem() going away. Isn't it now a platform
> > requirement now that dirty cache lines over persistent memory ranges
> > are either guaranteed to be flushed to persistent storage on power
> > fail or when required by REQ_FLUSH?
>
> No, nothing automates cache flushing. The path of a write is:
>
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>
> The ADR mechanism and the wpq-flush facility flush data thorough the
> imc (integrated memory controller) to media. dax_do_io() gets writes
> to the imc, but we still need a posted-write-buffer flush mechanism to
> guarantee data makes it out to media.

So what you are saying is that on and ADR machine, we have these
domains w.r.t. power fail:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------|-----persistent--------------|

because anything that gets to the IMC is guaranteed to be flushed to
stable media on power fail.

But on a posted-write-buffer system, we have this:

cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media

|-------------volatile-------------------------------------------|--persistent--|

IOWs, only things already posted to the media via REQ_FLUSH are
considered stable on persistent media. What happens in this case
when power fails during a media update? Incomplete writes?

> > Or have we somehow ended up with the fucked up situation where
> > dax_do_io() writes are (effectively) immediately persistent and
> > untracked by internal infrastructure, whilst mmap() writes
> > require internal dirty tracking and fsync() to flush caches via
> > writeback?
>
> dax_do_io() writes are not immediately persistent. They bypass the
> cpu-cache and cpu-write-bufffer and are ready to be flushed to media
> by REQ_FLUSH or power-fail on an ADR system.

IOWs, on an ADR system write is /effectively/ immediately persistent
because if power fails ADR guarantees it will be flushed to stable
media, while on a posted write system it is volatile and will be
lost. Right?

If so, that's even worse than just having mmap/write behave
differently - now writes will behave differently depending on the
specific hardware installed. I think this makes it even more
important for the DAX code to hide this behaviour from the
fielsystems by treating everything as volatile.

If we track the dirty blocks from write in the radix tree like we
for mmap, then we can just use a normal memcpy() in dax_do_io(),
getting rid of the slow cache bypass that is currently run. Radix
tree updates are much less expensive than a slow memcpy of large
amounts of data, ad fsync can then take care of persistence, just
like we do for mmap.

We should just make the design assumption that all persistent memory
is volatile, track where we dirty it in all paths, and use the
fastest volatile memcpy primitives available to us in the IO path.
We'll end up with a faster fastpath that if we use CPU cache bypass
copies, dax_do_io() and mmap will be coherent and synchronised, and
fsync() will have the same requirements and overhead regardless of
the way the application modifies the pmem or the hardware platform
used to implement the pmem.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-07-30 00:53:07

by Dan Williams

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, Jul 29, 2016 at 5:12 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Jul 29, 2016 at 07:44:25AM -0700, Dan Williams wrote:
>> On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner <[email protected]> wrote:
>> > On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
>> >> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
>> [..]
>> >> So DAX doesn't need flushing to maintain consistent view of the data but it
>> >> does need flushing to make sure fsync(2) results in data written via mmap
>> >> to reach persistent storage.
>> >
>> > I thought this all changed with the removal of the pcommit
>> > instruction and wmb_pmem() going away. Isn't it now a platform
>> > requirement now that dirty cache lines over persistent memory ranges
>> > are either guaranteed to be flushed to persistent storage on power
>> > fail or when required by REQ_FLUSH?
>>
>> No, nothing automates cache flushing. The path of a write is:
>>
>> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>>
>> The ADR mechanism and the wpq-flush facility flush data thorough the
>> imc (integrated memory controller) to media. dax_do_io() gets writes
>> to the imc, but we still need a posted-write-buffer flush mechanism to
>> guarantee data makes it out to media.
>
> So what you are saying is that on and ADR machine, we have these
> domains w.r.t. power fail:
>
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>
> |-------------volatile-------------------|-----persistent--------------|
>
> because anything that gets to the IMC is guaranteed to be flushed to
> stable media on power fail.
>
> But on a posted-write-buffer system, we have this:
>
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>
> |-------------volatile-------------------------------------------|--persistent--|
>
> IOWs, only things already posted to the media via REQ_FLUSH are
> considered stable on persistent media. What happens in this case
> when power fails during a media update? Incomplete writes?

Yes, power failure during a media update will end up with incomplete
writes on an 8-byte boundary.

>
>> > Or have we somehow ended up with the fucked up situation where
>> > dax_do_io() writes are (effectively) immediately persistent and
>> > untracked by internal infrastructure, whilst mmap() writes
>> > require internal dirty tracking and fsync() to flush caches via
>> > writeback?
>>
>> dax_do_io() writes are not immediately persistent. They bypass the
>> cpu-cache and cpu-write-bufffer and are ready to be flushed to media
>> by REQ_FLUSH or power-fail on an ADR system.
>
> IOWs, on an ADR system write is /effectively/ immediately persistent
> because if power fails ADR guarantees it will be flushed to stable
> media, while on a posted write system it is volatile and will be
> lost. Right?

Right.

>
> If so, that's even worse than just having mmap/write behave
> differently - now writes will behave differently depending on the
> specific hardware installed. I think this makes it even more
> important for the DAX code to hide this behaviour from the
> fielsystems by treating everything as volatile.

The symmetry does sound appealing...

> If we track the dirty blocks from write in the radix tree like we
> for mmap, then we can just use a normal memcpy() in dax_do_io(),
> getting rid of the slow cache bypass that is currently run. Radix
> tree updates are much less expensive than a slow memcpy of large
> amounts of data, ad fsync can then take care of persistence, just
> like we do for mmap.

If we go this route to increase the amount of dirty-data tracking in
the radix it raises the priority of one of the items on the backlog;
namely, determine the crossover point where wbinvd of the entire cache
is faster than a clflush / clwb loop.

> We should just make the design assumption that all persistent memory
> is volatile, track where we dirty it in all paths, and use the
> fastest volatile memcpy primitives available to us in the IO path.
> We'll end up with a faster fastpath that if we use CPU cache bypass
> copies, dax_do_io() and mmap will be coherent and synchronised, and
> fsync() will have the same requirements and overhead regardless of
> the way the application modifies the pmem or the hardware platform
> used to implement the pmem.

I like the direction, I'd still want to measure where/whether it's
actually faster given the writes may have evicted hot data, and the
amortized cost of the cache flushing loop.

2016-08-01 03:13:23

by Keith Packard

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

Dave Chinner <[email protected]> writes:

> So we'd see that from the point of view of a torn single sector
> write. Ok, so we better limit DAX to CRC enabled filesystems to
> ensure these sorts of events are always caught by the filesystem.

Which is the same lack of guarantee that we already get on rotating
media. Flash media seems to work harder to provide sector atomicity; I
guess that's a feature?

The alternative is to hide metadata behind a translation layer, and
while that can be done for lame file systems, I'd like to see the raw
hardware capabilities exposed and then make free software that
constructs a reliable system on top of that.

--
-keith


Attachments:
signature.asc (810.00 B)
(No filename) (121.00 B)
Download all attachments

2016-08-01 04:39:38

by Dan Williams

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner <[email protected]> wrote:
> On Sun, Jul 31, 2016 at 08:13:23PM -0700, Keith Packard wrote:
>> Dave Chinner <[email protected]> writes:
>>
>> > So we'd see that from the point of view of a torn single sector
>> > write. Ok, so we better limit DAX to CRC enabled filesystems to
>> > ensure these sorts of events are always caught by the filesystem.
>>
>> Which is the same lack of guarantee that we already get on rotating
>> media. Flash media seems to work harder to provide sector atomicity; I
>> guess that's a feature?
>
> No, that's not the case. Existing sector based storage guarantees
> atomic sector writes - rotating or solid state. I haven't seen one a
> corruption caused by a torn sector write in 15 years. BTT layer was
> written for pmem to provide this guarantee, but you can't use DAX
> through that layer.
>
> In XFS, the only place we really care about torn sector writes in
> the journal - we have CRCs there to detect such problems and CRC
> enabled filesystems will prevent recovery of checkpoints containing
> torn writes. Non-CRC enable filesystems just warn and continue on
> their merry way (compatibility reasons - older kernels don't emit
> CRCs in log writes), so we really do need to restrict DAX to CRC
> enabled filesystems.
>
>> The alternative is to hide metadata behind a translation layer, and
>> while that can be done for lame file systems,
>
> No. These "lame" filesystems just need to use their existing
> metadata buffering layer to hide this and the journalling subsystem
> should protects against torn metadata writes.
>
>> I'd like to see the raw
>> hardware capabilities exposed and then make free software that
>> constructs a reliable system on top of that.
>
> Not as easy as it sounds. A couple of weeks ago I tried converting
> the XFS journal to use DAX to avoid the IO layer and resultant
> memcpy(), and I found out that there's a major problem with using
> DAX for metadata access on existing filesystems: pmem is not
> physically contiguous. I found that out the hard way - the ramdisk
> is DAX capable, but each 4k page ithat is allocated to it has a
> different memory address.
>
> Filesystems are designed around the fact that the block address
> space they are working with is contiguous - that sector N and sector
> N+1 can be accessed in the same IO. This is not true for
> direct access of pmem - while the sectors might be logically
> contiguous, the physical memory that is directly accessed is not.
> i.e. when we cross a page boundary, the next page could be on a
> different node, on a different device (e.g. RAID0), etc.
> Traditional remapping devices (i.e. DM, md, etc) hide the physical
> discontiguities from the filesystem - the present a contiguous LBA
> and remap/split/combine under the covers where the filesystem is not
> aware of it at all.
>
> The reason this is important is that if the filesystem has metadata
> constructs larger than a single page it can't use DAX to access them
> as a single object because they may lay across a physical
> discontiguity in the memory map. Filesystems aren't usually exposed
> to this - sectors of a block device are indexed by "Logical Block
> Address" for good reason - the LBA address space is supposed to hide
> the physical layout of the storage from the layer above the block
> device.
>
> OTOH, DAX directly exposes the physical layout to the filesytem.
> And because it's DAX-based pmem and not cached struct pages, we
> can't run vm_map_ram() to virtually map the range we need to see as
> a contiguous range, as we do in XFS for large objects such as directory
> blocks and log buffers. For other large objects such as inode
> clusters, we can directly map each page as the objects within the
> clusters are page aligned and never overlap page boundaries, but
> that only works for inode and dquot buffers. Hence DAX as it stands
> makes it extremely difficult to "retrofit" DAX into all aspects of
> existing fileystems because exposing physical discontiguities breaks
> code that assumes they don't exist.

On this specific point about page remapping, the administrator can
configure struct pages for pmem and you can detect whether they are
present in the filesystem with pfn_t_has_page(). I.e. you could
require pages be present for XFS, if that helps...

2016-08-01 01:46:45

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, Jul 29, 2016 at 05:53:07PM -0700, Dan Williams wrote:
> On Fri, Jul 29, 2016 at 5:12 PM, Dave Chinner <[email protected]> wrote:
....
> > So what you are saying is that on and ADR machine, we have these
> > domains w.r.t. power fail:
> >
> > cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
> >
> > |-------------volatile-------------------|-----persistent--------------|
> >
> > because anything that gets to the IMC is guaranteed to be flushed to
> > stable media on power fail.
> >
> > But on a posted-write-buffer system, we have this:
> >
> > cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
> >
> > |-------------volatile-------------------------------------------|--persistent--|
> >
> > IOWs, only things already posted to the media via REQ_FLUSH are
> > considered stable on persistent media. What happens in this case
> > when power fails during a media update? Incomplete writes?
>
> Yes, power failure during a media update will end up with incomplete
> writes on an 8-byte boundary.

So we'd see that from the point of view of a torn single sector
write. Ok, so we better limit DAX to CRC enabled filesystems to
ensure these sorts of events are always caught by the filesystem.

> >> > Or have we somehow ended up with the fucked up situation where
> >> > dax_do_io() writes are (effectively) immediately persistent and
> >> > untracked by internal infrastructure, whilst mmap() writes
> >> > require internal dirty tracking and fsync() to flush caches via
> >> > writeback?
> >>
> >> dax_do_io() writes are not immediately persistent. They bypass the
> >> cpu-cache and cpu-write-bufffer and are ready to be flushed to media
> >> by REQ_FLUSH or power-fail on an ADR system.
> >
> > IOWs, on an ADR system write is /effectively/ immediately persistent
> > because if power fails ADR guarantees it will be flushed to stable
> > media, while on a posted write system it is volatile and will be
> > lost. Right?
>
> Right.

Thanks for the clarification.

> > If we track the dirty blocks from write in the radix tree like we
> > for mmap, then we can just use a normal memcpy() in dax_do_io(),
> > getting rid of the slow cache bypass that is currently run. Radix
> > tree updates are much less expensive than a slow memcpy of large
> > amounts of data, ad fsync can then take care of persistence, just
> > like we do for mmap.
>
> If we go this route to increase the amount of dirty-data tracking in
> the radix it raises the priority of one of the items on the backlog;
> namely, determine the crossover point where wbinvd of the entire cache
> is faster than a clflush / clwb loop.

Actually, I'd look at it from the other persepctive - at what point
does fine-grained dirty tracking run faster than the brute force
flush? If the gains are only marginal, then we need to question
whether fine grained tracking is worth the complexity at all...

Cheers,

Dave.
--
Dave Chinner
[email protected]

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2016-08-01 04:07:37

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Sun, Jul 31, 2016 at 08:13:23PM -0700, Keith Packard wrote:
> Dave Chinner <[email protected]> writes:
>
> > So we'd see that from the point of view of a torn single sector
> > write. Ok, so we better limit DAX to CRC enabled filesystems to
> > ensure these sorts of events are always caught by the filesystem.
>
> Which is the same lack of guarantee that we already get on rotating
> media. Flash media seems to work harder to provide sector atomicity; I
> guess that's a feature?

No, that's not the case. Existing sector based storage guarantees
atomic sector writes - rotating or solid state. I haven't seen one a
corruption caused by a torn sector write in 15 years. BTT layer was
written for pmem to provide this guarantee, but you can't use DAX
through that layer.

In XFS, the only place we really care about torn sector writes in
the journal - we have CRCs there to detect such problems and CRC
enabled filesystems will prevent recovery of checkpoints containing
torn writes. Non-CRC enable filesystems just warn and continue on
their merry way (compatibility reasons - older kernels don't emit
CRCs in log writes), so we really do need to restrict DAX to CRC
enabled filesystems.

> The alternative is to hide metadata behind a translation layer, and
> while that can be done for lame file systems,

No. These "lame" filesystems just need to use their existing
metadata buffering layer to hide this and the journalling subsystem
should protects against torn metadata writes.

> I'd like to see the raw
> hardware capabilities exposed and then make free software that
> constructs a reliable system on top of that.

Not as easy as it sounds. A couple of weeks ago I tried converting
the XFS journal to use DAX to avoid the IO layer and resultant
memcpy(), and I found out that there's a major problem with using
DAX for metadata access on existing filesystems: pmem is not
physically contiguous. I found that out the hard way - the ramdisk
is DAX capable, but each 4k page ithat is allocated to it has a
different memory address.

Filesystems are designed around the fact that the block address
space they are working with is contiguous - that sector N and sector
N+1 can be accessed in the same IO. This is not true for
direct access of pmem - while the sectors might be logically
contiguous, the physical memory that is directly accessed is not.
i.e. when we cross a page boundary, the next page could be on a
different node, on a different device (e.g. RAID0), etc.
Traditional remapping devices (i.e. DM, md, etc) hide the physical
discontiguities from the filesystem - the present a contiguous LBA
and remap/split/combine under the covers where the filesystem is not
aware of it at all.

The reason this is important is that if the filesystem has metadata
constructs larger than a single page it can't use DAX to access them
as a single object because they may lay across a physical
discontiguity in the memory map. Filesystems aren't usually exposed
to this - sectors of a block device are indexed by "Logical Block
Address" for good reason - the LBA address space is supposed to hide
the physical layout of the storage from the layer above the block
device.

OTOH, DAX directly exposes the physical layout to the filesytem.
And because it's DAX-based pmem and not cached struct pages, we
can't run vm_map_ram() to virtually map the range we need to see as
a contiguous range, as we do in XFS for large objects such as directory
blocks and log buffers. For other large objects such as inode
clusters, we can directly map each page as the objects within the
clusters are page aligned and never overlap page boundaries, but
that only works for inode and dquot buffers. Hence DAX as it stands
makes it extremely difficult to "retrofit" DAX into all aspects of
existing fileystems because exposing physical discontiguities breaks
code that assumes they don't exist.

I've been saying this from the start: we can't make use of all the
capabilities of pmem with existing filesystems and DAX. DAX is
supposed to be a *stopgap measure* until pmem native solutions are
built and mature. Finding limitations like the above only serve to
highlight the fact DAX on ext4/XFS is only a partial solution.

The real problem is, as always, a lack of resources to implement
everything we want to be able to do. Building a new filesystem is
hard, takes a long time, and all the people we have that might be
able to do it are fully occupied by maintaining and enhancing the
existing Linux filesystems to support things like DAX or other
functionality that users want (e.g. rmap, reflink, copy offload,
etc).

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-08-01 07:39:06

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Sun, Jul 31, 2016 at 09:39:38PM -0700, Dan Williams wrote:
> On Sun, Jul 31, 2016 at 9:07 PM, Dave Chinner <[email protected]> wrote:
> > OTOH, DAX directly exposes the physical layout to the filesytem.
> > And because it's DAX-based pmem and not cached struct pages, we
> > can't run vm_map_ram() to virtually map the range we need to see as
> > a contiguous range, as we do in XFS for large objects such as directory
> > blocks and log buffers. For other large objects such as inode
> > clusters, we can directly map each page as the objects within the
> > clusters are page aligned and never overlap page boundaries, but
> > that only works for inode and dquot buffers. Hence DAX as it stands
> > makes it extremely difficult to "retrofit" DAX into all aspects of
> > existing fileystems because exposing physical discontiguities breaks
> > code that assumes they don't exist.
>
> On this specific point about page remapping, the administrator can
> configure struct pages for pmem and you can detect whether they are
> present in the filesystem with pfn_t_has_page(). I.e. you could
> require pages be present for XFS, if that helps...

It's kinda silly to require struct pages for the entire pmem device
if they are only needed for accessing a (comparitively) small amount
of metadata.

Besides, now that I look at it more deeply, we can't use virtually
mapped pmem for the log buffers. We can't allocate memory at the
point in time where we work out what LBA in the log we need to map
to physical pmem for the current log write. Hence calls to
vm_map_ram() can't be used, and so that rules out using mapped page
based pmem for log buffers.

I'll probably have to rewrite the xlog_write() engine completely to
be able to handle discontiguous pages in the iclog buffers before we
can consider mapping them via DAX now, and I'm really not sure it's
worth the effort. I'd much prefer to spend time designing a native
pmem filesystem....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-08-01 10:13:45

by Boaz Harrosh

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On 07/30/2016 03:12 AM, Dave Chinner wrote:
<>
>
> If we track the dirty blocks from write in the radix tree like we
> for mmap, then we can just use a normal memcpy() in dax_do_io(),
> getting rid of the slow cache bypass that is currently run. Radix
> tree updates are much less expensive than a slow memcpy of large
> amounts of data, ad fsync can then take care of persistence, just
> like we do for mmap.
>

No!

mov_nt instructions, That "slow cache bypass that is currently run" above
is actually faster then cached writes by 20%, and if you add the dirty
tracking and cl_flush instructions it becomes x2 slower in the most
optimal case and 3 times slower in the DAX case.

The network guys have noticed the mov_nt instructions superior performance
for years before we pushed DAX into the tree. look for users of copy_from_iter_nocache
and the comments when they where introduced, those where used before DAX, and
nothing at all to do with persistence.

So what you are suggesting is fine only 3 times slower in the current
implementation.

> We should just make the design assumption that all persistent memory
> is volatile, track where we dirty it in all paths, and use the
> fastest volatile memcpy primitives available to us in the IO path.

The "fastest volatile memcpy primitives available" is what we do
today with the mov_nt instructions.

> We'll end up with a faster fastpath that if we use CPU cache bypass
> copies, dax_do_io() and mmap will be coherent and synchronised, and
> fsync() will have the same requirements and overhead regardless of
> the way the application modifies the pmem or the hardware platform
> used to implement the pmem.
>

I measured, there is tests running in our labs every night, your
suggestion on an ADR system is 3 times slower to reach persistence.

Is why I was pushing for MMAP_PMEM_AWARE, because a smart mmap application
from user-mode uses mov_nt anyway because it wants that 20% gain regardless
of what the Kernel will do. Then it calls fsync() and the Kernel will burn
x2 more CPU, just for the sake of burning CPU, because the data is already
persistent at the get go.

> Cheers,
> Dave.

As you, I do not care for DAX very much, but please lets keep the physical
facts strait

Cheers indeed
Boaz

2016-08-01 17:47:08

by Dan Williams

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, Jul 29, 2016 at 5:12 PM, Dave Chinner <[email protected]> wrote:
> On Fri, Jul 29, 2016 at 07:44:25AM -0700, Dan Williams wrote:
>> On Thu, Jul 28, 2016 at 7:21 PM, Dave Chinner <[email protected]> wrote:
>> > On Thu, Jul 28, 2016 at 10:10:33AM +0200, Jan Kara wrote:
>> >> On Thu 28-07-16 08:19:49, Dave Chinner wrote:
>> [..]
>> >> So DAX doesn't need flushing to maintain consistent view of the data but it
>> >> does need flushing to make sure fsync(2) results in data written via mmap
>> >> to reach persistent storage.
>> >
>> > I thought this all changed with the removal of the pcommit
>> > instruction and wmb_pmem() going away. Isn't it now a platform
>> > requirement now that dirty cache lines over persistent memory ranges
>> > are either guaranteed to be flushed to persistent storage on power
>> > fail or when required by REQ_FLUSH?
>>
>> No, nothing automates cache flushing. The path of a write is:
>>
>> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>>
>> The ADR mechanism and the wpq-flush facility flush data thorough the
>> imc (integrated memory controller) to media. dax_do_io() gets writes
>> to the imc, but we still need a posted-write-buffer flush mechanism to
>> guarantee data makes it out to media.
>
> So what you are saying is that on and ADR machine, we have these
> domains w.r.t. power fail:
>
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>
> |-------------volatile-------------------|-----persistent--------------|
>
> because anything that gets to the IMC is guaranteed to be flushed to
> stable media on power fail.
>
> But on a posted-write-buffer system, we have this:
>
> cpu-cache -> cpu-write-buffer -> bus -> imc -> imc-write-buffer -> media
>
> |-------------volatile-------------------------------------------|--persistent--|
>
> IOWs, only things already posted to the media via REQ_FLUSH are
> considered stable on persistent media. What happens in this case
> when power fails during a media update? Incomplete writes?
>
>> > Or have we somehow ended up with the fucked up situation where
>> > dax_do_io() writes are (effectively) immediately persistent and
>> > untracked by internal infrastructure, whilst mmap() writes
>> > require internal dirty tracking and fsync() to flush caches via
>> > writeback?
>>
>> dax_do_io() writes are not immediately persistent. They bypass the
>> cpu-cache and cpu-write-bufffer and are ready to be flushed to media
>> by REQ_FLUSH or power-fail on an ADR system.
>
> IOWs, on an ADR system write is /effectively/ immediately persistent
> because if power fails ADR guarantees it will be flushed to stable
> media, while on a posted write system it is volatile and will be
> lost. Right?
>
> If so, that's even worse than just having mmap/write behave
> differently - now writes will behave differently depending on the
> specific hardware installed. I think this makes it even more
> important for the DAX code to hide this behaviour from the
> fielsystems by treating everything as volatile.

Sorry, I confused things above by implying that Linux will need to
consider NVDIMM platforms without ADR.

ADR is already required for present day NVDIMM platforms and that
requirement continues. The explicit flushing allowed by REQ_FLUSH is
an optional mechanism to backstop ADR, but is not required and will
not be used as an alternative to ADR. See pages 21 and 22 of the
latest driver writer's guide if you want more details [1].

Long story short, we should always consider writes that enter the
persistence domain (movnt + sfence) as persistent regardless of the
presence of WPQ-flush.

[1]: http://pmem.io/documents/NVDIMM_DriverWritersGuide-July-2016.pdf

2016-08-02 00:21:44

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Mon, Aug 01, 2016 at 01:13:45PM +0300, Boaz Harrosh wrote:
> On 07/30/2016 03:12 AM, Dave Chinner wrote:
> <>
> >
> > If we track the dirty blocks from write in the radix tree like we
> > for mmap, then we can just use a normal memcpy() in dax_do_io(),
> > getting rid of the slow cache bypass that is currently run. Radix
> > tree updates are much less expensive than a slow memcpy of large
> > amounts of data, ad fsync can then take care of persistence, just
> > like we do for mmap.
> >
>
> No!
>
> mov_nt instructions, That "slow cache bypass that is currently run" above
> is actually faster then cached writes by 20%, and if you add the dirty
> tracking and cl_flush instructions it becomes x2 slower in the most
> optimal case and 3 times slower in the DAX case.

IOWs, we'd expect writing to a file with DAX to be faster than when
buffered through the page cache and fsync()d, right?

The numbers I get say otherwise. Filesystem on 8GB pmem block device:

$ sudo mkfs.xfs -f /dev/pmem1
meta-data=/dev/pmem1 isize=512 agcount=4, agsize=524288 blks
= sectsz=4096 attr=2, projid32bit=1
= crc=1 finobt=1, sparse=0, rmapbt=0, reflink=0
data = bsize=4096 blocks=2097152, imaxpct=25
= sunit=0 swidth=0 blks
naming =version 2 bsize=4096 ascii-ci=0 ftype=1
log =internal log bsize=4096 blocks=2560, version=2
= sectsz=4096 sunit=1 blks, lazy-count=1
realtime =none extsz=4096 blocks=0, rtextents=0

Test command that writes 1GB to the filesystem:

$ sudo time xfs_io -f -c "pwrite 0 1g" -c "sync" /mnt/scratch/foo
wrote 1073741824/1073741824 bytes at offset 0
1 GiB, 262144 ops; 0:00:01.00 (880.040 MiB/sec and 225290.3317 ops/sec)
0.02user 1.13system 0:02.27elapsed 51%CPU (0avgtext+0avgdata 2344maxresident)k
0inputs+0outputs (0major+109minor)pagefaults 0swaps

Results:

pwrite B/W (MiB/s) runtime
run no DAX DAX no DAX DAX
1 880.040 236.352 2.27s 4.34s
2 857.094 257.297 2.18s 3.99s
3 865.820 236.087 2.13s 4.34s

It is quite clear that *DAX is much slower* than normal buffered
IO through the page cache followed by a fsync().

Stop and think why that might be. We're only doing one copy with
DAX, so why is the pwrite() speed 4x lower than for a copy into the
page cache? We're not copying 4x the data here. We're copying it
once. But there's another uncached write to each page during
allocation to zero each block first, so we're actually doing two
uncached writes to the page. And we're doing an allocation per page
with DAX, whereas we're using delayed allocation in the buffered IO
case which has much less overhead.

The only thing we can do here to speed the DAX case up is do cached
memcpy so that the data copy after zeroing runs at L1 cache speed
(i.e. 50x faster than it currently does).

Let's take the allocation out of it, eh? Let's do overwrite instead,
fsync in the buffered Io case, no fsync for DAX:

pwrite B/W (MiB/s) runtime
run no DAX DAX no DAX DAX
1 1119 1125 1.85s 0.93s
2 1113 1121 1.83s 0.91s
3 1128 1078 1.80s 0.94s

So, pwrite speeds are no different for DAX vs page cache IO. Also,
now we can see the overhead of writeback - a second data copy to
the pmem for the IO during fsync. If I take the fsync() away from
the buffered IO, the runtime drops to 0.89-0.91s, which is identical
to the DAX code. Given the DAX code has a short IO path than
buffered IO, it's not showing any advantage speed for using uncached
IO....

Let's go back to the allocation case, but this time take advantage
of the new iomap based Io path in XFS to amortise the DAX allocation
overhead by using a 16MB IO size instead of 4k:

$ sudo time xfs_io -f -c "pwrite 0 1g -b 16m" -c sync /mnt/scratch/foo


pwrite B/W (MiB/s) runtime
run no DAX DAX no DAX DAX
1 1344 1028 1.63s 1.03s
2 1410 980 1.62s 1.06s
3 1399 1032 1.72s 0.99s

So, pwrite bandwidth of the copy into the page cache is still much
higher than that of the DAX path, but now the allocation overhead
is minimised and hence the double copy in the buffered IO writeback
path shows up. For completeness, lets just run the overwrite case
here which is effectively just competing memcpy implementations,
fsync for buffered, no fsync for DAX:

pwrite B/W (MiB/s) runtime
run no DAX DAX no DAX DAX
1 1791 1727 1.53s 0.59s
2 1768 1726 1.57s 0.59s
3 1799 1729 1.55s 0.59s

Again, runtime shows the overhead of the double copy in the buffered
IO/writeback path. It also shows the overhead in the DAX path of the
allocation zeroing vs overwrite. If I drop the fsync from the
buffered IO path, bandwidth remains the same but runtime drops to
0.55-0.57s, so again the buffered IO write path is faster than DAX
while doing more work.

IOws, the overhead of dirty page tracking in the page cache mapping
tree is not significant in terms of write() performance. Hence
I fail to see why it should be significant in the DAX path - it will
probably have less overhead because we have less to account for in
the DAX write path. The only performance penalty for dirty tracking
is in the fsync writeback path itself, and that a separate issue
for optimisation.

Quite frankly, what I see here is that whatever optimisations that
have been made to make DAX fast don't show any real world benefit.
Further, the claims that dirty tracking has too much overhead are
*completely shot down* by the fact that buffered write IO through
the page cache is *faster* than the current DAX write IO path.

> The network guys have noticed the mov_nt instructions superior
> performance for years before we pushed DAX into the tree. look for
> users of copy_from_iter_nocache and the comments when they where
> introduced, those where used before DAX, and nothing at all to do
> with persistence.
>
> So what you are suggesting is fine only 3 times slower in the current
> implementation.

What is optimal for one use case does not mean it is optimal for
all.

High level operation performance measurement disagrees with the
assertion that we're using the *best* method of copying data in the
DAX path available right now. Understand how data moves through the
system, then optimise the data flow. What we are seeing here is that
optimising for the fastest single data movement can result in lower
overall performance where the code path requires multiple data
movements to the same location....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-08-04 18:40:42

by Toshi Kani

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> On Mon, Aug 01, 2016 at 01:13:45PM +0300, Boaz Harrosh wrote:
> >
> > On 07/30/2016 03:12 AM, Dave Chinner wrote:
> > <>
> > > If we track the dirty blocks from write in the radix tree like we
> > > for mmap, then we can just use a normal memcpy() in dax_do_io(),
> > > getting rid of the slow cache bypass that is currently run. Radix
> > > tree updates are much less expensive than a slow memcpy of large
> > > amounts of data, ad fsync can then take care of persistence, just
> > > like we do for mmap.
> > >
> >
> > No! 
> >
> > mov_nt instructions, That "slow cache bypass that is currently run" above
> > is actually faster then cached writes by 20%, and if you add the dirty
> > tracking and cl_flush instructions it becomes x2 slower in the most
> > optimal case and 3 times slower in the DAX case.
>
> IOWs, we'd expect writing to a file with DAX to be faster than when
> buffered through the page cache and fsync()d, right?
>
> The numbers I get say otherwise. Filesystem on 8GB pmem block device:
>
> $ sudo mkfs.xfs -f /dev/pmem1
> meta-data=/dev/pmem1             isize=512    agcount=4, agsize=524288 blks
>          =                       sectsz=4096  attr=2, projid32bit=1
>          =                       crc=1        finobt=1, sparse=0, rmapbt=0,
> reflink=0
> data     =                       bsize=4096   blocks=2097152, imaxpct=25
>          =                       sunit=0      swidth=0 blks
> naming   =version 2              bsize=4096   ascii-ci=0 ftype=1
> log      =internal log           bsize=4096   blocks=2560, version=2
>          =                       sectsz=4096  sunit=1 blks, lazy-count=1
> realtime =none                   extsz=4096   blocks=0, rtextents=0
>
> Test command that writes 1GB to the filesystem:
>
> $ sudo time xfs_io -f -c "pwrite 0 1g" -c "sync" /mnt/scratch/foo
> wrote 1073741824/1073741824 bytes at offset 0
> 1 GiB, 262144 ops; 0:00:01.00 (880.040 MiB/sec and 225290.3317 ops/sec)
> 0.02user 1.13system 0:02.27elapsed 51%CPU (0avgtext+0avgdata
> 2344maxresident)k
> 0inputs+0outputs (0major+109minor)pagefaults 0swaps
>
> Results:
>
>     pwrite B/W (MiB/s) runtime
> run no DAX DAX no DAX DA
> X
>  1 880.040 236.352 2.27s
> 4.34s
>  2 857.094 257.297 2.18s
> 3.99s
>  3 865.820 236.087 2.13s
> 4.34s
>
> It is quite clear that *DAX is much slower* than normal buffered
> IO through the page cache followed by a fsync().
>
> Stop and think why that might be. We're only doing one copy with
> DAX, so why is the pwrite() speed 4x lower than for a copy into the
> page cache? We're not copying 4x the data here. We're copying it
> once. But there's another uncached write to each page during
> allocation to zero each block first, so we're actually doing two
> uncached writes to the page. And we're doing an allocation per page
> with DAX, whereas we're using delayed allocation in the buffered IO
> case which has much less overhead.
>
> The only thing we can do here to speed the DAX case up is do cached
> memcpy so that the data copy after zeroing runs at L1 cache speed
> (i.e. 50x faster than it currently does).
>
> Let's take the allocation out of it, eh? Let's do overwrite instead,
> fsync in the buffered Io case, no fsync for DAX:
>
>     pwrite B/W (MiB/s) runtime
> run no DAX DAX no DAX DA
> X
>  1 1119 1125 1.85s 0.93s
>  2 1113 1121 1.83s 0.91s
>  3 1128 1078 1.80s 0.94s
>
> So, pwrite speeds are no different for DAX vs page cache IO. Also,
> now we can see the overhead of writeback - a second data copy to
> the pmem for the IO during fsync. If I take the fsync() away from
> the buffered IO, the runtime drops to 0.89-0.91s, which is identical
> to the DAX code. Given the DAX code has a short IO path than
> buffered IO, it's not showing any advantage speed for using uncached
> IO....
>
> Let's go back to the allocation case, but this time take advantage
> of the new iomap based Io path in XFS to amortise the DAX allocation
> overhead by using a 16MB IO size instead of 4k:
>
> $ sudo time xfs_io -f -c "pwrite 0 1g -b 16m" -c sync /mnt/scratch/foo
>
>
>     pwrite B/W (MiB/s) runtime
> run no DAX DAX no DAX DA
> X
>  1 1344 1028 1.63s 1.03s
>  2 1410  980 1.62s 1.06s
>  3 1399 1032 1.72s 0.99s
>
> So, pwrite bandwidth of the copy into the page cache is still much
> higher than that of the DAX path, but now the allocation overhead
> is minimised and hence the double copy in the buffered IO writeback
> path shows up. For completeness, lets just run the overwrite case
> here which is effectively just competing  memcpy implementations,
> fsync for buffered, no fsync for DAX:
>
>     pwrite B/W (MiB/s) runtime
> run no DAX DAX no DAX DA
> X
>  1 1791 1727 1.53s 0.59s
>  2 1768 1726 1.57s 0.59s
>  3 1799 1729 1.55s 0.59s
>
> Again, runtime shows the overhead of the double copy in the buffered
> IO/writeback path. It also shows the overhead in the DAX path of the
> allocation zeroing vs overwrite. If I drop the fsync from the
> buffered IO path, bandwidth remains the same but runtime drops to
> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> while doing more work.
>
> IOws, the overhead of dirty page tracking in the page cache mapping
> tree is not significant in terms of write() performance. Hence
> I fail to see why it should be significant in the DAX path - it will
> probably have less overhead because we have less to account for in
> the DAX write path. The only performance penalty for dirty tracking
> is in the fsync writeback path itself, and that a separate issue
> for optimisation.

I mostly agree with the analysis, but I have a few comments about the use of
cached / uncached (movnt) copy to PMEM.

I do not think the test results are relevant on this point because both
buffered and dax write() paths use uncached copy to avoid clflush.  The
buffered path uses cached copy to the page cache and then use uncached copy to
PMEM via writeback.  Therefore, the buffered IO path also benefits from using
uncached copy to avoid clflush.

Cached copy (req movq) is slightly faster than uncached copy, and should be
used for writing to the page cache.  For writing to PMEM, however, additional
clflush can be expensive, and allocating cachelines for PMEM leads to evict
application's cachelines.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

2016-08-05 11:27:39

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

[ cut to just the important points ]
On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> > If I drop the fsync from the
> > buffered IO path, bandwidth remains the same but runtime drops to
> > 0.55-0.57s, so again the buffered IO write path is faster than DAX
> > while doing more work.
>
> I do not think the test results are relevant on this point because both
> buffered and dax write() paths use uncached copy to avoid clflush. ?The
> buffered path uses cached copy to the page cache and then use uncached copy to
> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
> uncached copy to avoid clflush.

Except that I tested without the writeback path for buffered IO, so
there was a direct comparison for single cached copy vs single
uncached copy.

The undenial fact is that a write() with a single cached copy with
all the overhead of dirty page tracking is /faster/ than a much
shorter, simpler IO path that uses an uncached copy. That's what the
numbers say....

> Cached copy (req movq) is slightly faster than uncached copy,

Not according to Boaz - he claims that uncached is 20% faster than
cached. How about you two get together, do some benchmarking and get
your story straight, eh?

> and should be
> used for writing to the page cache. ?For writing to PMEM, however, additional
> clflush can be expensive, and allocating cachelines for PMEM leads to evict
> application's cachelines.

I keep hearing people tell me why cached copies are slower, but
no-one is providing numbers to back up their statements. The only
numbers we have are the ones I've published showing cached copies w/
full dirty tracking is faster than uncached copy w/o dirty tracking.

Show me the numbers that back up your statements, then I'll listen
to you.

-Dave.
--
Dave Chinner
[email protected]

2016-08-05 15:18:14

by Toshi Kani

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, 2016-08-05 at 21:27 +1000, Dave Chinner wrote:
> [ cut to just the important points ]
> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> >
> > On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> > >
> > > If I drop the fsync from the
> > > buffered IO path, bandwidth remains the same but runtime drops to
> > > 0.55-0.57s, so again the buffered IO write path is faster than DAX
> > > while doing more work.
> >
> > I do not think the test results are relevant on this point because both
> > buffered and dax write() paths use uncached copy to avoid clflush.  The
> > buffered path uses cached copy to the page cache and then use uncached
> > copy to PMEM via writeback.  Therefore, the buffered IO path also benefits
> > from using uncached copy to avoid clflush.
>
> Except that I tested without the writeback path for buffered IO, so
> there was a direct comparison for single cached copy vs single
> uncached copy.

I agree that the result showed a tentative comparison for cached copy vs
uncached copy.  My point, however, is that writes to PMEM need to persist
unlike the page cache.  So for PMEM, the comparison should be (cached copy +
clflush) vs uncached copy.

> The undenial fact is that a write() with a single cached copy with
> all the overhead of dirty page tracking is /faster/ than a much
> shorter, simpler IO path that uses an uncached copy. That's what the
> numbers say....

This cost evaluation needs to include the cost of clflush for cached copy.
 Also, page cache allocation tends to be faster than disk block allocation.

> >
> > Cached copy (req movq) is slightly faster than uncached copy,
>
> Not according to Boaz - he claims that uncached is 20% faster than
> cached. How about you two get together, do some benchmarking and get
> your story straight, eh?

I vaguely remember seeing such results, but I may be wrong about that.  Here
are performance test results Robert Elliott conducted before.
https://lkml.org/lkml/2015/4/2/453

To quote the results relevant to this topic:

- Cached copy 2.5 M
- Uncached copy w/ MOVNTI 2.6 M
- Uncached copy w/ MOVNTDQ 3.5 M

Note that we use MOVNTI today, not MOVNTDQ.  We instrumented a MOVNTDQ copy
function for this test.  We can further improve the copy performance by using
MOVNTDQ.

> > and should be used for writing to the page cache.  For writing to PMEM,
> > however, additional clflush can be expensive, and allocating cachelines
> > for PMEM leads to evict application's cachelines.
>
> I keep hearing people tell me why cached copies are slower, but
> no-one is providing numbers to back up their statements. The only
> numbers we have are the ones I've published showing cached copies w/
> full dirty tracking is faster than uncached copy w/o dirty tracking.
>
> Show me the numbers that back up your statements, then I'll listen
> to you.

Please see above.  Cached copy requires clflush on top of that.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

2016-08-05 19:58:33

by Boylston, Brian

[permalink] [raw]
Subject: RE: Subtle races between DAX mmap fault and write path

Dave Chinner wrote on 2016-08-05:
> [ cut to just the important points ]
> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
>> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
>>> If I drop the fsync from the
>>> buffered IO path, bandwidth remains the same but runtime drops to
>>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
>>> while doing more work.
>>
>> I do not think the test results are relevant on this point because both
>> buffered and dax write() paths use uncached copy to avoid clflush. ?The
>> buffered path uses cached copy to the page cache and then use uncached copy to
>> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
>> uncached copy to avoid clflush.
>
> Except that I tested without the writeback path for buffered IO, so
> there was a direct comparison for single cached copy vs single
> uncached copy.
>
> The undenial fact is that a write() with a single cached copy with
> all the overhead of dirty page tracking is /faster/ than a much
> shorter, simpler IO path that uses an uncached copy. That's what the
> numbers say....
>
>> Cached copy (req movq) is slightly faster than uncached copy,
>
> Not according to Boaz - he claims that uncached is 20% faster than
> cached. How about you two get together, do some benchmarking and get
> your story straight, eh?
>
>> and should be
>> used for writing to the page cache. ?For writing to PMEM, however, additional
>> clflush can be expensive, and allocating cachelines for PMEM leads to evict
>> application's cachelines.
>
> I keep hearing people tell me why cached copies are slower, but
> no-one is providing numbers to back up their statements. The only
> numbers we have are the ones I've published showing cached copies w/
> full dirty tracking is faster than uncached copy w/o dirty tracking.
>
> Show me the numbers that back up your statements, then I'll listen
> to you.

Here are some numbers for a particular scenario, and the code is below.

Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
(1M total memcpy()s). For the cached+clflush case, the flushes are done
every 4MiB (which seems slightly faster than flushing every 16KiB):

NUMA local NUMA remote
Cached+clflush 13.5 37.1
movnt 1.0 1.3

In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
#include <libpmem.h>

/*
* gcc -Wall -O2 -m64 -mcx16 -o memcpyperf memcpyperf.c -lpmem
*
* Not sure if -mcx16 allows gcc to use faster memcpy bits?
*/

/*
* our source buffer. we'll copy this much at a time.
* align it so that memcpy() doesn't have to do anything funny.
*/
char __attribute__((aligned(0x100))) src[4 * 4096];

int
main(
int argc,
char* argv[]
)
{
char* path;
char mode;
int nloops;
char* dstbase;
size_t dstsz;
int ispmem;
int cpysz;
char* dst;

if (argc != 4) {
fprintf(stderr, "ERROR: usage: "
"memcpyperf [cached | nt] PATH NLOOPS\n");
exit(1);
}
mode = argv[1][0];
path = argv[2];
nloops = atoi(argv[3]);

dstbase = pmem_map_file(path, 0, 0, 0, &dstsz, &ispmem);
if (NULL == dstbase) {
perror(path);
exit(1);
}

if (!ispmem)
fprintf(stderr, "WARNING: %s is not pmem\n", path);

if (dstsz < sizeof(src))
cpysz = dstsz;
else
cpysz = sizeof(src);

fprintf(stderr, "INFO: dst %p src %p dstsz %ld cpysz %d\n",
dstbase, src, dstsz, cpysz);

dst = dstbase;
while (nloops--) {
if (mode == 'c') {
memcpy(dst, src, cpysz);
/*
* we could do the clflush here on the 16KiB we just
* wrote, but with a 4MiB file (dst buffer) and 16KiB
* src buffer, it seems slightly faster to flush the
* entire 4MiB below
*/
//pmem_persist(dst, cpysz);
}
else {
pmem_memcpy_persist(dst, src, cpysz);
}
dst += cpysz;
if ((dst + cpysz) - dstbase > dstsz) {
dst = dstbase;
/* see note above */
if (mode == 'c')
pmem_persist(dst, dstsz);
}
}

exit(0);

} /* main() */
EOF

Sample runs:

$ numactl -N0 time -p ./memcpyperf c /pmem0/brian/cpt 1000000
INFO: dst 0x7f3f1a000000 src 0x601200 dstsz 4194304 cpysz 16384
real 13.53
user 13.53
sys 0.00
$ numactl -N0 time -p ./memcpyperf n /pmem0/brian/cpt 1000000
INFO: dst 0x7f2b54600000 src 0x601200 dstsz 4194304 cpysz 16384
real 1.04
user 1.04
sys 0.00
$ numactl -N1 time -p ./memcpyperf c /pmem0/brian/cpt 1000000
INFO: dst 0x7f8f8c200000 src 0x601200 dstsz 4194304 cpysz 16384
real 37.13
user 37.15
sys 0.00
$ numactl -N1 time -p ./memcpyperf n /pmem0/brian/cpt 1000000
INFO: dst 0x7f77f7400000 src 0x601200 dstsz 4194304 cpysz 16384
real 1.24
user 1.24
sys 0.00


Brian

2016-08-08 09:26:55

by Jan Kara

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri 05-08-16 19:58:33, Boylston, Brian wrote:
> Dave Chinner wrote on 2016-08-05:
> > [ cut to just the important points ]
> > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> >> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> >>> If I drop the fsync from the
> >>> buffered IO path, bandwidth remains the same but runtime drops to
> >>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> >>> while doing more work.
> >>
> >> I do not think the test results are relevant on this point because both
> >> buffered and dax write() paths use uncached copy to avoid clflush. ?The
> >> buffered path uses cached copy to the page cache and then use uncached copy to
> >> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
> >> uncached copy to avoid clflush.
> >
> > Except that I tested without the writeback path for buffered IO, so
> > there was a direct comparison for single cached copy vs single
> > uncached copy.
> >
> > The undenial fact is that a write() with a single cached copy with
> > all the overhead of dirty page tracking is /faster/ than a much
> > shorter, simpler IO path that uses an uncached copy. That's what the
> > numbers say....
> >
> >> Cached copy (req movq) is slightly faster than uncached copy,
> >
> > Not according to Boaz - he claims that uncached is 20% faster than
> > cached. How about you two get together, do some benchmarking and get
> > your story straight, eh?
> >
> >> and should be
> >> used for writing to the page cache. ?For writing to PMEM, however, additional
> >> clflush can be expensive, and allocating cachelines for PMEM leads to evict
> >> application's cachelines.
> >
> > I keep hearing people tell me why cached copies are slower, but
> > no-one is providing numbers to back up their statements. The only
> > numbers we have are the ones I've published showing cached copies w/
> > full dirty tracking is faster than uncached copy w/o dirty tracking.
> >
> > Show me the numbers that back up your statements, then I'll listen
> > to you.
>
> Here are some numbers for a particular scenario, and the code is below.
>
> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
> (1M total memcpy()s). For the cached+clflush case, the flushes are done
> every 4MiB (which seems slightly faster than flushing every 16KiB):
>
> NUMA local NUMA remote
> Cached+clflush 13.5 37.1
> movnt 1.0 1.3

Thanks for the test Brian. But looking at the current source of libpmem
this seems to be comparing apples to oranges. Let me explain the details
below:

> In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
> and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:

Yes. libpmem does what you describe above and the name
pmem_memcpy_persist() is thus currently misleading because it is not
guaranteed to be persistent with the current implementation of DAX in the
kernel.

It is important to know which kernel version and what filesystem have you
used for the test to be able judge the details but generally pmem_persist()
does properly tell the filesystem to flush all metadata associated with the
file, commit open transactions etc. That's the full cost of persistence.

pmem_memcpy_persist() makes sure the data writes have reached persistent
storage but nothing guarantees associated metadata changes have reached
persistent storage as well. To assure that, fsync() (or pmem_persist() if
you wish) is currently the only way from userspace. At which point you've
lost most of the advantages using movnt. Ross researches into possibilities
of allowing more efficient userspace implementation but currently there are
none.

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-08-08 12:30:18

by Boylston, Brian

[permalink] [raw]
Subject: RE: Subtle races between DAX mmap fault and write path

Jan Kara wrote on 2016-08-08:
> On Fri 05-08-16 19:58:33, Boylston, Brian wrote:
>> Dave Chinner wrote on 2016-08-05:
>>> [ cut to just the important points ]
>>> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
>>>> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
>>>>> If I drop the fsync from the
>>>>> buffered IO path, bandwidth remains the same but runtime drops to
>>>>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
>>>>> while doing more work.
>>>>
>>>> I do not think the test results are relevant on this point because both
>>>> buffered and dax write() paths use uncached copy to avoid clflush. ?The
>>>> buffered path uses cached copy to the page cache and then use uncached copy to
>>>> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
>>>> uncached copy to avoid clflush.
>>>
>>> Except that I tested without the writeback path for buffered IO, so
>>> there was a direct comparison for single cached copy vs single
>>> uncached copy.
>>>
>>> The undenial fact is that a write() with a single cached copy with
>>> all the overhead of dirty page tracking is /faster/ than a much
>>> shorter, simpler IO path that uses an uncached copy. That's what the
>>> numbers say....
>>>
>>>> Cached copy (req movq) is slightly faster than uncached copy,
>>>
>>> Not according to Boaz - he claims that uncached is 20% faster than
>>> cached. How about you two get together, do some benchmarking and get
>>> your story straight, eh?
>>>
>>>> and should be
>>>> used for writing to the page cache. ?For writing to PMEM, however, additional
>>>> clflush can be expensive, and allocating cachelines for PMEM leads to evict
>>>> application's cachelines.
>>>
>>> I keep hearing people tell me why cached copies are slower, but
>>> no-one is providing numbers to back up their statements. The only
>>> numbers we have are the ones I've published showing cached copies w/
>>> full dirty tracking is faster than uncached copy w/o dirty tracking.
>>>
>>> Show me the numbers that back up your statements, then I'll listen
>>> to you.
>>
>> Here are some numbers for a particular scenario, and the code is below.
>>
>> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
>> (1M total memcpy()s). For the cached+clflush case, the flushes are done
>> every 4MiB (which seems slightly faster than flushing every 16KiB):
>>
>> NUMA local NUMA remote
>> Cached+clflush 13.5 37.1
>> movnt 1.0 1.3
>
> Thanks for the test Brian. But looking at the current source of libpmem
> this seems to be comparing apples to oranges. Let me explain the details
> below:
>
>> In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
>> and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:
>
> Yes. libpmem does what you describe above and the name
> pmem_memcpy_persist() is thus currently misleading because it is not
> guaranteed to be persistent with the current implementation of DAX in
> the kernel.
>
> It is important to know which kernel version and what filesystem have you
> used for the test to be able judge the details but generally pmem_persist()
> does properly tell the filesystem to flush all metadata associated with the
> file, commit open transactions etc. That's the full cost of persistence.

I used NVML 1.1 for the measurements. In this version and with the hardware
that I used, the pmem_persist() flow is:

pmem_persist()
pmem_flush()
Func_flush() == flush_clflush
CLFLUSH
pmem_drain()
Func_predrain_fence() == predrain_fence_empty
no-op

So, I don't think that pmem_persist() does anything to cause the filesystem
to flush metadata as it doesn't make any system calls?

> pmem_memcpy_persist() makes sure the data writes have reached persistent
> storage but nothing guarantees associated metadata changes have reached
> persistent storage as well.

While metadata is certainly important, my goal with this specific test was
to measure the "raw" performance of cached+flush vs uncached, without
anything else in the way.

> To assure that, fsync() (or pmem_persist()
> if you wish) is currently the only way from userspace.

Perhaps you mean pmem_msync() here? pmem_msync() calls msync(), but
pmem_persist() does not.

> At which point
> you've lost most of the advantages using movnt. Ross researches into
> possibilities of allowing more efficient userspace implementation but
> currently there are none.

Apart from the current performance discussion, if the metadata for a file
is already established (file created, space allocated by explicit writes(),
and everything synced), then if I map it and do pmem_memcpy_persist(),
are there any "ongoing" metadata updates that would need to be flushed
(besides timestamps)?


Brian

2016-08-08 13:11:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Mon, Aug 08, 2016 at 12:30:18PM +0000, Boylston, Brian wrote:
> I used NVML 1.1 for the measurements. In this version and with the hardware
> that I used, the pmem_persist() flow is:

Please don't use crap like NVML, given that the people behind it don't
seem to understand persistency at all.

> Perhaps you mean pmem_msync() here? pmem_msync() calls msync(), but
> pmem_persist() does not.

pmem_persist is misnamed then, don't use it.

> > At which point
> > you've lost most of the advantages using movnt. Ross researches into
> > possibilities of allowing more efficient userspace implementation but
> > currently there are none.
>
> Apart from the current performance discussion, if the metadata for a file
> is already established (file created, space allocated by explicit writes(),
> and everything synced), then if I map it and do pmem_memcpy_persist(),
> are there any "ongoing" metadata updates that would need to be flushed
> (besides timestamps)?

Yes. For example because every write might mean a new space allocating
if using reflinks or a COW file system.

_______________________________________________
xfs mailing list
[email protected]
http://oss.sgi.com/mailman/listinfo/xfs

2016-08-08 18:28:27

by Jan Kara

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Mon 08-08-16 12:30:18, Boylston, Brian wrote:
> Jan Kara wrote on 2016-08-08:
> > On Fri 05-08-16 19:58:33, Boylston, Brian wrote:
> >> Dave Chinner wrote on 2016-08-05:
> >>> [ cut to just the important points ]
> >>> On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> >>>> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> >>>>> If I drop the fsync from the
> >>>>> buffered IO path, bandwidth remains the same but runtime drops to
> >>>>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> >>>>> while doing more work.
> >>>>
> >>>> I do not think the test results are relevant on this point because both
> >>>> buffered and dax write() paths use uncached copy to avoid clflush. ?The
> >>>> buffered path uses cached copy to the page cache and then use uncached copy to
> >>>> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
> >>>> uncached copy to avoid clflush.
> >>>
> >>> Except that I tested without the writeback path for buffered IO, so
> >>> there was a direct comparison for single cached copy vs single
> >>> uncached copy.
> >>>
> >>> The undenial fact is that a write() with a single cached copy with
> >>> all the overhead of dirty page tracking is /faster/ than a much
> >>> shorter, simpler IO path that uses an uncached copy. That's what the
> >>> numbers say....
> >>>
> >>>> Cached copy (req movq) is slightly faster than uncached copy,
> >>>
> >>> Not according to Boaz - he claims that uncached is 20% faster than
> >>> cached. How about you two get together, do some benchmarking and get
> >>> your story straight, eh?
> >>>
> >>>> and should be
> >>>> used for writing to the page cache. ?For writing to PMEM, however, additional
> >>>> clflush can be expensive, and allocating cachelines for PMEM leads to evict
> >>>> application's cachelines.
> >>>
> >>> I keep hearing people tell me why cached copies are slower, but
> >>> no-one is providing numbers to back up their statements. The only
> >>> numbers we have are the ones I've published showing cached copies w/
> >>> full dirty tracking is faster than uncached copy w/o dirty tracking.
> >>>
> >>> Show me the numbers that back up your statements, then I'll listen
> >>> to you.
> >>
> >> Here are some numbers for a particular scenario, and the code is below.
> >>
> >> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
> >> (1M total memcpy()s). For the cached+clflush case, the flushes are done
> >> every 4MiB (which seems slightly faster than flushing every 16KiB):
> >>
> >> NUMA local NUMA remote
> >> Cached+clflush 13.5 37.1
> >> movnt 1.0 1.3
> >
> > Thanks for the test Brian. But looking at the current source of libpmem
> > this seems to be comparing apples to oranges. Let me explain the details
> > below:
> >
> >> In the code below, pmem_persist() does the CLFLUSH(es) on the given range,
> >> and pmem_memcpy_persist() does non-temporal MOVs with an SFENCE:
> >
> > Yes. libpmem does what you describe above and the name
> > pmem_memcpy_persist() is thus currently misleading because it is not
> > guaranteed to be persistent with the current implementation of DAX in
> > the kernel.
> >
> > It is important to know which kernel version and what filesystem have you
> > used for the test to be able judge the details but generally pmem_persist()
> > does properly tell the filesystem to flush all metadata associated with the
> > file, commit open transactions etc. That's the full cost of persistence.
>
> I used NVML 1.1 for the measurements. In this version and with the hardware
> that I used, the pmem_persist() flow is:
>
> pmem_persist()
> pmem_flush()
> Func_flush() == flush_clflush
> CLFLUSH
> pmem_drain()
> Func_predrain_fence() == predrain_fence_empty
> no-op
>
> So, I don't think that pmem_persist() does anything to cause the filesystem
> to flush metadata as it doesn't make any system calls?

Ah, you are right. I somehow misread what is in NVML sources. I agree with
Christoph that _persist suffix is then misleading for the reasons he stated
but that's irrelevant to the test you did.

So it indeed seems that in your test movnt + sfence is an order of
magnitude faster than cached memcpy + cflush + sfence. I'm surprised I have
to say.

> > At which point
> > you've lost most of the advantages using movnt. Ross researches into
> > possibilities of allowing more efficient userspace implementation but
> > currently there are none.
>
> Apart from the current performance discussion, if the metadata for a file
> is already established (file created, space allocated by explicit writes(),
> and everything synced), then if I map it and do pmem_memcpy_persist(),
> are there any "ongoing" metadata updates that would need to be flushed
> (besides timestamps)?

As Christoph wrote, currently there is no way for userspace to know and
filesystem may be doing all sorts of interesting stuff underneath that
userspace doesn't know about. The only obligation filesystem has is that in
response to fsync() it has to make sure all the data written before fsync()
is visible after a crash...

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-08-08 19:32:47

by Toshi Kani

[permalink] [raw]
Subject: RE: Subtle races between DAX mmap fault and write path

> > Jan Kara wrote on 2016-08-08:
> > > On Fri 05-08-16 19:58:33, Boylston, Brian wrote:
> >
> > I used NVML 1.1 for the measurements. In this version and with the
> > hardware that I used, the pmem_persist() flow is:
> >
> > pmem_persist()
> > pmem_flush()
> > Func_flush() == flush_clflush
> > CLFLUSH
> > pmem_drain()
> > Func_predrain_fence() == predrain_fence_empty
> > no-op
> >
> > So, I don't think that pmem_persist() does anything to cause the filesystem
> > to flush metadata as it doesn't make any system calls?
>
> Ah, you are right. I somehow misread what is in NVML sources. I agree with
> Christoph that _persist suffix is then misleading for the reasons he stated
> but that's irrelevant to the test you did.
>
> So it indeed seems that in your test movnt + sfence is an order of
> magnitude faster than cached memcpy + cflush + sfence. I'm surprised I have
> to say.

movnt is posted to WC buffer, which is asynchronously evicted to memory
when each line is filled.

clflush, on the other hand, must be serialized. So, it has to synchronously evict
line-by-line. clflushopt, when supported by new CPUs, should be a lot faster as
it can execute simultaneously and does not have to wait line-by-line. It'd be still
slower than uncached copy, though.

-Toshi

2016-08-08 23:12:25

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Fri, Aug 05, 2016 at 07:58:33PM +0000, Boylston, Brian wrote:
> Dave Chinner wrote on 2016-08-05:
> > [ cut to just the important points ]
> > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> >> On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> >>> If I drop the fsync from the
> >>> buffered IO path, bandwidth remains the same but runtime drops to
> >>> 0.55-0.57s, so again the buffered IO write path is faster than DAX
> >>> while doing more work.
> >>
> >> I do not think the test results are relevant on this point because both
> >> buffered and dax write() paths use uncached copy to avoid clflush. ?The
> >> buffered path uses cached copy to the page cache and then use uncached copy to
> >> PMEM via writeback. ?Therefore, the buffered IO path also benefits from using
> >> uncached copy to avoid clflush.
> >
> > Except that I tested without the writeback path for buffered IO, so
> > there was a direct comparison for single cached copy vs single
> > uncached copy.
> >
> > The undenial fact is that a write() with a single cached copy with
> > all the overhead of dirty page tracking is /faster/ than a much
> > shorter, simpler IO path that uses an uncached copy. That's what the
> > numbers say....
> >
> >> Cached copy (req movq) is slightly faster than uncached copy,
> >
> > Not according to Boaz - he claims that uncached is 20% faster than
> > cached. How about you two get together, do some benchmarking and get
> > your story straight, eh?
> >
> >> and should be
> >> used for writing to the page cache. ?For writing to PMEM, however, additional
> >> clflush can be expensive, and allocating cachelines for PMEM leads to evict
> >> application's cachelines.
> >
> > I keep hearing people tell me why cached copies are slower, but
> > no-one is providing numbers to back up their statements. The only
> > numbers we have are the ones I've published showing cached copies w/
> > full dirty tracking is faster than uncached copy w/o dirty tracking.
> >
> > Show me the numbers that back up your statements, then I'll listen
> > to you.
>
> Here are some numbers for a particular scenario, and the code is below.
>
> Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM buffer
> (1M total memcpy()s). For the cached+clflush case, the flushes are done
> every 4MiB (which seems slightly faster than flushing every 16KiB):
>
> NUMA local NUMA remote
> Cached+clflush 13.5 37.1
> movnt 1.0 1.3

So let's put that in memory bandwidth terms. You wrote 16GB to the
NVDIMM. That means:

NUMA local NUMA remote
Cached+clflush 1.2GB/s 0.43GB/s
movnt 16.0GB/s 12.3GB/s

That smells wrong. The DAX code (using movnt) is not 1-2 orders of
magnitude faster than a page cache copy, so I don't believe your
benchmark reflects what I'm proposing.

What I think you're getting wrong is that we are not doing a clflush
after every 16k write when we use the page cache, nor will we do
that if we use cached copies, dirty tracking and clflush on fsync().
IOWs, the correct equivalent "cached + clflush" loop to a volatile
copy with dirty tracking + fsync would be:

dstp = dst;
while (--nloops) {
memcpy(dstp, src, src_sz); // pwrite();
dstp += src_sz;
}
pmem_persist(dst, dstsz); // fsync();

i.e. The cache flushes occur only at the user defined
synchronisation point not on every syscall.

Yes, if you want to make your copy slow and safe, use O_SYNC to
trigger clflush on every write() call - that's what we do for
existing storage and the mechanisms are already there; we just need
the dirty tracking to optimise it.

Put simple: we should only care about cache flush synchronisation at
user defined data integrity synchronisation points. That's the IO
model the kernel has always exposed to users, and pmem storage is no
different.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-08-09 01:00:30

by Toshi Kani

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Tue, 2016-08-09 at 09:12 +1000, Dave Chinner wrote:
> On Fri, Aug 05, 2016 at 07:58:33PM +0000, Boylston, Brian wrote:
> >
> > Dave Chinner wrote on 2016-08-05:
> > >
> > > [ cut to just the important points ]
> > > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> > > >
> > > > On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> > > > >
> > > > > If I drop the fsync from the
> > > > > buffered IO path, bandwidth remains the same but runtime
> > > > > drops to 0.55-0.57s, so again the buffered IO write path is
> > > > > faster than DAX while doing more work.
> > > >
> > > > I do not think the test results are relevant on this point
> > > > because both buffered and dax write() paths use uncached copy
> > > > to avoid clflush.  The buffered path uses cached copy to the
> > > > page cache and then use uncached copy to PMEM via writeback.
> > > >  Therefore, the buffered IO path also benefits from using
> > > > uncached copy to avoid clflush.
> > >
> > > Except that I tested without the writeback path for buffered IO,
> > > so there was a direct comparison for single cached copy vs single
> > > uncached copy.
> > >
> > > The undenial fact is that a write() with a single cached copy
> > > with all the overhead of dirty page tracking is /faster/ than a
> > > much shorter, simpler IO path that uses an uncached copy. That's
> > > what the numbers say....
> > >
> > > >
> > > > Cached copy (req movq) is slightly faster than uncached copy,
> > >
> > > Not according to Boaz - he claims that uncached is 20% faster
> > > than cached. How about you two get together, do some benchmarking
> > > and get your story straight, eh?
> > >
> > > > and should be used for writing to the page cache.  For writing
> > > > to PMEM, however, additional clflush can be expensive, and
> > > > allocating cachelines for PMEM leads to evict application's
> > > > cachelines.
> > >
> > > I keep hearing people tell me why cached copies are slower, but
> > > no-one is providing numbers to back up their statements. The only
> > > numbers we have are the ones I've published showing cached copies
> > > w/ full dirty tracking is faster than uncached copy w/o dirty
> > > tracking.
> > >
> > > Show me the numbers that back up your statements, then I'll
> > > listen to you.
> >
> > Here are some numbers for a particular scenario, and the code is
> > below.
> >
> > Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM
> > buffer (1M total memcpy()s).  For the cached+clflush case, the
> > flushes are done every 4MiB (which seems slightly faster than
> > flushing every 16KiB):
> >
> >                   NUMA local    NUMA remote
> > Cached+clflush      13.5           37.1
> > movnt                1.0            1.3 
>
> So let's put that in memory bandwidth terms. You wrote 16GB to the
> NVDIMM.  That means:
>
>                   NUMA local    NUMA remote
> Cached+clflush      1.2GB/s         0.43GB/s
> movnt              16.0GB/s         12.3GB/s
>
> That smells wrong.  The DAX code (using movnt) is not 1-2 orders of
> magnitude faster than a page cache copy, so I don't believe your
> benchmark reflects what I'm proposing.
>
> What I think you're getting wrong is that we are not doing a clflush
> after every 16k write when we use the page cache, nor will we do
> that if we use cached copies, dirty tracking and clflush on fsync().

As I mentioned before, we do not use clflush on the write path.  So,
your tests did not issue clflush at all.

> IOWs, the correct equivalent "cached + clflush" loop to a volatile
> copy with dirty tracking + fsync would be:
>
> dstp = dst;
> while (--nloops) {
> memcpy(dstp, src, src_sz); // pwrite();
> dstp += src_sz;
> }
>         pmem_persist(dst, dstsz); // fsync();
>
> i.e. The cache flushes occur only at the user defined
> synchronisation point not on every syscall.

Brian's test is (16 KiB pwrite + fsync) repeated 1M times.  It compared
two approaches in the case of 16 KiB persistent write.  I do not
cosider it wrong, but it indicated that cached copy + clflush will lead
much higher overhead when sync'd in a finer granularity.

I agree that it should have less overhead in total when clflush is done
at once since it only has to evict as much as the cache size.

> Yes, if you want to make your copy slow and safe, use O_SYNC to
> trigger clflush on every write() call - that's what we do for
> existing storage and the mechanisms are already there; we just need
> the dirty tracking to optimise it.

Perhaps, you are referring flushing on disk write cache?  I do not
think clflush as a x86 instruction is used for exisiting storage.

> Put simple: we should only care about cache flush synchronisation at
> user defined data integrity synchronisation points. That's the IO
> model the kernel has always exposed to users, and pmem storage is no
> different.

Thanks,
-Toshi
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

2016-08-09 05:58:48

by Dave Chinner

[permalink] [raw]
Subject: Re: Subtle races between DAX mmap fault and write path

On Tue, Aug 09, 2016 at 01:00:30AM +0000, Kani, Toshimitsu wrote:
> On Tue, 2016-08-09 at 09:12 +1000, Dave Chinner wrote:
> > On Fri, Aug 05, 2016 at 07:58:33PM +0000, Boylston, Brian wrote:
> > >
> > > Dave Chinner wrote on 2016-08-05:
> > > >
> > > > [ cut to just the important points ]
> > > > On Thu, Aug 04, 2016 at 06:40:42PM +0000, Kani, Toshimitsu wrote:
> > > > >
> > > > > On Tue, 2016-08-02 at 10:21 +1000, Dave Chinner wrote:
> > > > > >
> > > > > > If I drop the fsync from the
> > > > > > buffered IO path, bandwidth remains the same but runtime
> > > > > > drops to 0.55-0.57s, so again the buffered IO write path is
> > > > > > faster than DAX while doing more work.
> > > > >
> > > > > I do not think the test results are relevant on this point
> > > > > because both buffered and dax write() paths use uncached copy
> > > > > to avoid clflush. ?The buffered path uses cached copy to the
> > > > > page cache and then use uncached copy to PMEM via writeback.
> > > > > ?Therefore, the buffered IO path also benefits from using
> > > > > uncached copy to avoid clflush.
> > > >
> > > > Except that I tested without the writeback path for buffered IO,
> > > > so there was a direct comparison for single cached copy vs single
> > > > uncached copy.
> > > >
> > > > The undenial fact is that a write() with a single cached copy
> > > > with all the overhead of dirty page tracking is /faster/ than a
> > > > much shorter, simpler IO path that uses an uncached copy. That's
> > > > what the numbers say....
> > > >
> > > > >
> > > > > Cached copy (req movq) is slightly faster than uncached copy,
> > > >
> > > > Not according to Boaz - he claims that uncached is 20% faster
> > > > than cached. How about you two get together, do some benchmarking
> > > > and get your story straight, eh?
> > > >
> > > > > and should be used for writing to the page cache. ?For writing
> > > > > to PMEM, however, additional clflush can be expensive, and
> > > > > allocating cachelines for PMEM leads to evict application's
> > > > > cachelines.
> > > >
> > > > I keep hearing people tell me why cached copies are slower, but
> > > > no-one is providing numbers to back up their statements. The only
> > > > numbers we have are the ones I've published showing cached copies
> > > > w/ full dirty tracking is faster than uncached copy w/o dirty
> > > > tracking.
> > > >
> > > > Show me the numbers that back up your statements, then I'll
> > > > listen to you.
> > >
> > > Here are some numbers for a particular scenario, and the code is
> > > below.
> > >
> > > Time (in seconds) to copy a 16KiB buffer 1M times to a 4MiB NVDIMM
> > > buffer (1M total memcpy()s).??For the cached+clflush case, the
> > > flushes are done every 4MiB (which seems slightly faster than
> > > flushing every 16KiB):
> > >
> > > ??????????????????NUMA local????NUMA remote
> > > Cached+clflush??????13.5???????????37.1
> > > movnt????????????????1.0????????????1.3?
> >
> > So let's put that in memory bandwidth terms. You wrote 16GB to the
> > NVDIMM.??That means:
> >
> > ??????????????????NUMA local????NUMA remote
> > Cached+clflush??????1.2GB/s?????????0.43GB/s
> > movnt??????????????16.0GB/s?????????12.3GB/s
> >
> > That smells wrong.??The DAX code (using movnt) is not 1-2 orders of
> > magnitude faster than a page cache copy, so I don't believe your
> > benchmark reflects what I'm proposing.
> >
> > What I think you're getting wrong is that we are not doing a clflush
> > after every 16k write when we use the page cache, nor will we do
> > that if we use cached copies, dirty tracking and clflush on fsync().
>
> As I mentioned before, we do not use clflush on the write path. ?So,
> your tests did not issue clflush at all.

Uh, yes, I made that clear by saying "using volatile, cached
copies through the page cache, then using fsync() for integrity".
This results in the page cache being written to the nvdimm via
the writeback paths through bios, which the pmem driver does via
movnt() instructions.

And then we send a REQ_FLUSH bio to the device during fsync(), and
that runs nvdimm_flush() which makes sure all the posted writes
are flushed to persistent storage.

I'll also point out that the writeback of the page cache doubled
the runtime of the test, so the second memcpy() using movnt had
basically the same cost as the volatile page cache copy up front.

> > IOWs, the correct equivalent "cached + clflush" loop to a volatile
> > copy with dirty tracking + fsync would be:
> >
> > dstp = dst;
> > while (--nloops) {
> > memcpy(dstp, src, src_sz); // pwrite();
> > dstp += src_sz;
> > }
> > ????????pmem_persist(dst, dstsz); // fsync();
> >
> > i.e. The cache flushes occur only at the user defined
> > synchronisation point not on every syscall.
>
> Brian's test is (16 KiB pwrite + fsync) repeated 1M times. ?It compared
> two approaches in the case of 16 KiB persistent write.

Yes, Brian tested synchronous writes. But:

THAT WAS NOT WHAT I WAS TESTING, PROPOSING OR DISCUSSING!

Seriously, compare apples to apples, don't try to justify comparing
apples to oranges by saying "but we like oranges better".

> I do not
> cosider it wrong, but it indicated that cached copy + clflush will lead
> much higher overhead when sync'd in a finer granularity.

That's true, but only until the next gen CPUs optimise clflush, and
then it will have negliable difference in overhead. IOWs, you are
advocating we optimise for existing, sub-optimal CPU design
constraints, rather than architect a sane data integrity model. A model
which, not by coincidence, will be much better suited to the
capabilities of the next generation of CPUs and so will perform
better than any micro-optimisations we could make now for existing
CPUs.

> I agree that it should have less overhead in total when clflush is done
> at once since it only has to evict as much as the cache size.
>
> > Yes, if you want to make your copy slow and safe, use O_SYNC to
> > trigger clflush on every write() call - that's what we do for
> > existing storage and the mechanisms are already there; we just need
> > the dirty tracking to optimise it.
>
> Perhaps, you are referring flushing on disk write cache? ?I do not
> think clflush as a x86 instruction is used for exisiting storage.

I'm talking about *whatever volatile caches are in layers below the
filesystem*. Stop thinking that pmem is some special little
snowflake - it's not. It's no different to the existing storage
architectures we have to deal with - it has volatile domains that we
have to flush to ensure are flushed to the persistent domain when a
data integrity synchronisation point occurs.

For DAX we use the kernel writeback infrastructure to issue clflush.
We use REQ_FLUSH to get the pmem driver to guarantee persistence of the
posted writes that result from the clflush. That's no different to
writing data via a bio to the SSD and then send REQ_FLUSH to ensure
it is on stable media inside the SSD. These a /generic primitives/
we use to guarantee data integrity, and they apply equally to pmem
as the do all other block and network based storage.

And let's not forget that we can't guarantee data persistence just
though a cache flush or synchronous data write. There is no
guarantee that the filesystem metadata points to the location the
data was written to until a data synchronisation command is given to
the filesystem. The filesystem may have allocated a new block (e.g.
due to a copy-on-write event in a reflinked file) for that data
write so even overwrites are not guaranteed to be persistent until a
fsync/msync/sync/O_[D]SYNC action takes place.

Arguments that movnt is the fastest way to copy data into the
persistent domain completely ignore the the context of the copy in
the data integrity model filesystems present to applications.
Microbenchmarks are good for comparing micro-optimisations, but they
are useless for comparing the merits of high level architecture
decisions.

Cheers,

Dave.
--
Dave Chinner
[email protected]