LinuxLists.cc - [PATCH v4 00/12] re-enable DAX PMD support

2016-09-29 22:49:18

Subject: [PATCH v4 00/12] re-enable DAX PMD support

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Ted, can you please take the ext2 + ext4 patches through your tree? Dave,
can you please take the rest through the XFS tree?

Changes since v3:
- Corrected dax iomap code namespace for functions defined in fs/dax.c.
(Dave Chinner)
- Added leading "dax" namespace to new static functions in fs/dax.c.
(Dave Chinner)
- Made all DAX PMD code in fs/dax.c conditionally compiled based on
CONFIG_FS_DAX_PMD. Otherwise a stub in include/linux/dax.h that just
returns VM_FAULT_FALLBACK will be used. (Dave Chinner)
- Removed dynamic debugging messages from DAX PMD fault path. Debugging
tracepoints will be added later to both the PTE and PMD paths via a
later patch set. (Dave Chinner)
- Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
PMD faults in ext2. (Dave Chinner)

This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
and Dan Williams. Dan's patch has already been merged for v4.8, and
Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
inclusion.

Here is a tree containing my changes and all the fixes that I've been testing:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v4

This tree has passed xfstests for ext2, ext4 and XFS both with and without DAX,
and has passed targeted testing where I inserted, removed and flushed DAX PTEs
and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults. Here is the FIO script I
used for my testing:

[global]
bs=4k
size=2G
directory=/mnt/pmem0
ioengine=mmap
[randrw]
rw=randrw

Here are the performance results with XFS using only pte faults:
READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM. To get XFS to use PMD faults, I did the following:

mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem0
xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (12):
ext4: allow DAX writeback for hole punch
ext4: tell DAX the size of allocation holes
dax: remove buffer_size_valid()
ext2: remove support for DAX PMD faults
dax: make 'wait_table' global variable static
dax: consistent variable naming for DAX entries
dax: coordinate locking for offsets in PMD range
dax: remove dax_pmd_fault()
dax: correct dax iomap code namespace
dax: add struct iomap based DAX PMD support
xfs: use struct iomap based DAX PMD fault path
dax: remove "depends on BROKEN" from FS_DAX_PMD

fs/Kconfig | 1 -
fs/dax.c | 650 +++++++++++++++++++++++++++-------------------------
fs/ext2/file.c | 35 +--
fs/ext4/inode.c | 7 +-
fs/xfs/xfs_aops.c | 25 +-
fs/xfs/xfs_aops.h | 3 -
fs/xfs/xfs_file.c | 10 +-
include/linux/dax.h | 48 +++-
mm/filemap.c | 6 +-
9 files changed, 402 insertions(+), 383 deletions(-)

--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-09-29 22:49:20

by Ross Zwisler

[permalink] [raw]

Subject: [PATCH v4 02/12] ext4: tell DAX the size of allocation holes

When DAX calls _ext4_get_block() and the file offset points to a hole we
currently don't set bh->b_size. This is current worked around via
buffer_size_valid() in fs/dax.c.

_ext4_get_block() has the hole size information from ext4_map_blocks(), so
populate bh->b_size so we can remove buffer_size_valid() in a later patch.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0900cb4..9075fac 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -759,6 +759,9 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
ext4_update_bh_state(bh, map.m_flags);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
ret = 0;
+ } else if (ret == 0) {
+ /* hole case, need to fill in bh->b_size */
+ bh->b_size = inode->i_sb->s_blocksize * map.m_len;
}
return ret;
}
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-09-29 22:49:21

by Ross Zwisler

[permalink] [raw]

Subject: [PATCH v4 03/12] dax: remove buffer_size_valid()

Now that ext4 properly sets bh.b_size when we call get_block() for a hole,
rely on that value and remove the buffer_size_valid() sanity check.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/dax.c | 22 +---------------------
1 file changed, 1 insertion(+), 21 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index cc025f8..9b9be8a 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -123,19 +123,6 @@ static bool buffer_written(struct buffer_head *bh)
return buffer_mapped(bh) && !buffer_unwritten(bh);
}

-/*
- * When ext4 encounters a hole, it returns without modifying the buffer_head
- * which means that we can't trust b_size. To cope with this, we set b_state
- * to 0 before calling get_block and, if any bit is set, we know we can trust
- * b_size. Unfortunate, really, since ext4 knows precisely how long a hole is
- * and would save us time calling get_block repeatedly.
- */
-static bool buffer_size_valid(struct buffer_head *bh)
-{
- return bh->b_state != 0;
-}
-
-
static sector_t to_sector(const struct buffer_head *bh,
const struct inode *inode)
{
@@ -177,8 +164,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
rc = get_block(inode, block, bh, rw == WRITE);
if (rc)
break;
- if (!buffer_size_valid(bh))
- bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
/*
@@ -1012,12 +997,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,

bdev = bh.b_bdev;

- /*
- * If the filesystem isn't willing to tell us the length of a hole,
- * just fall back to PTEs. Calling get_block 512 times in a loop
- * would be silly.
- */
- if (!buffer_size_valid(&bh) || bh.b_size < PMD_SIZE) {
+ if (bh.b_size < PMD_SIZE) {
dax_pmd_dbg(&bh, address, "allocated block too small");
return VM_FAULT_FALLBACK;
}
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-09-29 22:49:29

by Ross Zwisler

[permalink] [raw]

Subject: [PATCH v4 11/12] xfs: use struct iomap based DAX PMD fault path

Switch xfs_filemap_pmd_fault() from using dax_pmd_fault() to the new and
improved dax_iomap_pmd_fault(). Also, now that it has no more users,
remove xfs_get_blocks_dax_fault().

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/xfs/xfs_aops.c | 25 +++++--------------------
fs/xfs/xfs_aops.h | 3 ---
fs/xfs/xfs_file.c | 2 +-
3 files changed, 6 insertions(+), 24 deletions(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 4a28fa9..39c754f 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -1170,8 +1170,7 @@ __xfs_get_blocks(
sector_t iblock,
struct buffer_head *bh_result,
int create,
- bool direct,
- bool dax_fault)
+ bool direct)
{
struct xfs_inode *ip = XFS_I(inode);
struct xfs_mount *mp = ip->i_mount;
@@ -1265,12 +1264,8 @@ __xfs_get_blocks(
if (ISUNWRITTEN(&imap))
set_buffer_unwritten(bh_result);
/* direct IO needs special help */
- if (create) {
- if (dax_fault)
- ASSERT(!ISUNWRITTEN(&imap));
- else
- xfs_map_direct(inode, bh_result, &imap, offset);
- }
+ if (create)
+ xfs_map_direct(inode, bh_result, &imap, offset);
}

/*
@@ -1310,7 +1305,7 @@ xfs_get_blocks(
struct buffer_head *bh_result,
int create)
{
- return __xfs_get_blocks(inode, iblock, bh_result, create, false, false);
+ return __xfs_get_blocks(inode, iblock, bh_result, create, false);
}

int
@@ -1320,17 +1315,7 @@ xfs_get_blocks_direct(
struct buffer_head *bh_result,
int create)
{
- return __xfs_get_blocks(inode, iblock, bh_result, create, true, false);
-}
-
-int
-xfs_get_blocks_dax_fault(
- struct inode *inode,
- sector_t iblock,
- struct buffer_head *bh_result,
- int create)
-{
- return __xfs_get_blocks(inode, iblock, bh_result, create, true, true);
+ return __xfs_get_blocks(inode, iblock, bh_result, create, true);
}

/*
diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
index 1950e3b..6779e9d 100644
--- a/fs/xfs/xfs_aops.h
+++ b/fs/xfs/xfs_aops.h
@@ -57,9 +57,6 @@ int xfs_get_blocks(struct inode *inode, sector_t offset,
struct buffer_head *map_bh, int create);
int xfs_get_blocks_direct(struct inode *inode, sector_t offset,
struct buffer_head *map_bh, int create);
-int xfs_get_blocks_dax_fault(struct inode *inode, sector_t offset,
- struct buffer_head *map_bh, int create);
-
int xfs_end_io_direct_write(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private);
int xfs_setfilesize(struct xfs_inode *ip, xfs_off_t offset, size_t size);
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 00293d2..ab8f652 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1539,7 +1539,7 @@ xfs_filemap_pmd_fault(
}

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
+ ret = dax_iomap_pmd_fault(vma, addr, pmd, flags, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

if (flags & FAULT_FLAG_WRITE)
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-09-29 23:43:45

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Thu, Sep 29, 2016 at 04:49:18PM -0600, Ross Zwisler wrote:
> DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> locking. This series allows DAX PMDs to participate in the DAX radix tree
> based locking scheme so that they can be re-enabled.
>
> Ted, can you please take the ext2 + ext4 patches through your tree? Dave,
> can you please take the rest through the XFS tree?

That will make merging this difficult, because later patches in the
series are dependent on the ext2/ext4 patches early in the series.
I'd much prefer they all go through the one tree to avoid issues
like this.

>
> Changes since v3:
> - Corrected dax iomap code namespace for functions defined in fs/dax.c.
> (Dave Chinner)
> - Added leading "dax" namespace to new static functions in fs/dax.c.
> (Dave Chinner)
> - Made all DAX PMD code in fs/dax.c conditionally compiled based on
> CONFIG_FS_DAX_PMD. Otherwise a stub in include/linux/dax.h that just
> returns VM_FAULT_FALLBACK will be used. (Dave Chinner)
> - Removed dynamic debugging messages from DAX PMD fault path. Debugging
> tracepoints will be added later to both the PTE and PMD paths via a
> later patch set. (Dave Chinner)
> - Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
> PMD faults in ext2. (Dave Chinner)
>
> This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
> and Dan Williams. Dan's patch has already been merged for v4.8, and
> Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
> inclusion.

the xfs/for-next branch is not a stable branch - it can rebase at
any time just like linux-next can. The topic branches that I merge
into the for-next branch, OTOH, are usually stable. i.e. The topic
branch you should be basing this on is "iomap-4.9-dax".

And then you'll also see that there are already ext2 patches in this
topic branch to convert it to iomap for DAX. So I'm quite happy to
take the ext2/4 patches in this series in the same way.

The patches from Dan and Toshi: is you patch series dependent on
them? Do I need to take them as well?

Finally: none of the patches in your tree have reviewed-by tags.
That says to me that none of this code has been reviewed yet.
Reviewed-by tags are non-negotiable requirement for anything going
through my trees. I don't have time right now to review this code,
so you're going to need to chase up other reviewers before merging.

And, really, this is getting very late in the cycle to be merging
new code - we're less than one working day away from the merge
window opening and we've missed the last linux-next build. I'd
suggest that we'd might be best served by slipping this to the PMD
support code to the next cycle when there's no time pressure for
review and we can get a decent linux-next soak on the code.

> This tree has passed xfstests for ext2, ext4 and XFS both with and without DAX,
> and has passed targeted testing where I inserted, removed and flushed DAX PTEs
> and PMDs in every combination I could think of.
>
> Previously reported performance numbers:
>
> [global]
> bs=4k
> size=2G
> directory=/mnt/pmem0
> ioengine=mmap
> [randrw]
> rw=randrw
>
> Here are the performance results with XFS using only pte faults:
> READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
> WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
>
> Here are performance numbers for that same test using PMD faults:
> READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
> WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

The numbers look good - how much of that is from lower filesystem
allocation overhead and how much of it is from fewer page faults?
You can probably determine this by creating the test file with
write() to ensure it is fully allocated and so all the filesystem
is doing on both the read and write paths is mapping allocated
regions....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-09-30 03:03:43

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> Finally: none of the patches in your tree have reviewed-by tags.
> That says to me that none of this code has been reviewed yet.
> Reviewed-by tags are non-negotiable requirement for anything going
> through my trees. I don't have time right now to review this code,
> so you're going to need to chase up other reviewers before merging.
>
> And, really, this is getting very late in the cycle to be merging
> new code - we're less than one working day away from the merge
> window opening and we've missed the last linux-next build. I'd
> suggest that we'd might be best served by slipping this to the PMD
> support code to the next cycle when there's no time pressure for
> review and we can get a decent linux-next soak on the code.

I absolutely support your policy of only sending code to Linux that has passed
peer review.

However, I do feel compelled to point out that this is not new code. I didn't
just spring it on everyone in the hours before the v4.8 merge window. I
posted the first version of this patch set on August 15th, *seven weeks ago*:

https://lkml.org/lkml/2016/8/15/613

This was the day after v4.7-rc2 was released.

Since then I have responded promptly to the little review feedback that I've
received. I've also reviewed and tested other DAX changes, like the struct
iomap changes from Christoph. Those changes were first posted to the mailing
list on September 9th, four weeks after mine. Nevertheless, I was happy to
rebase my changes on top of his, which meant a full rewrite of the DAX PMD
fault handler so it would be based on struct iomap. His changes are going to
be merged for v4.9, and mine are not.

Please, help me understand what I can do to get my code reviewed. Do I need
to more aggressively ping my patch series, asking people by name for reviews?
Do we need to rework our code flow to Linus so that the DAX changes go through
a filesystem tree like XFS or ext4, and ask the developers of that filesystem
to help with reviews? Something else?

I'm honestly very frustrated by this because I've done my best to be open to
constructive criticism and I've tried to respond promptly to the feedback that
I've received. In the end, though, a system where it's a requirement that all
upstreamed code be peer reviewed but in which I can't get any feedback is
essentially a system where I'm not allowed to contribute.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-09-30 04:00:55

by Darrick J. Wong

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

2016-09-30 06:48:58

by Dave Chinner

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > Finally: none of the patches in your tree have reviewed-by tags.
> > That says to me that none of this code has been reviewed yet.
> > Reviewed-by tags are non-negotiable requirement for anything going
> > through my trees. I don't have time right now to review this code,
> > so you're going to need to chase up other reviewers before merging.
> >
> > And, really, this is getting very late in the cycle to be merging
> > new code - we're less than one working day away from the merge
> > window opening and we've missed the last linux-next build. I'd
> > suggest that we'd might be best served by slipping this to the PMD
> > support code to the next cycle when there's no time pressure for
> > review and we can get a decent linux-next soak on the code.
>
> I absolutely support your policy of only sending code to Linux that has passed
> peer review.
>
> However, I do feel compelled to point out that this is not new code. I didn't
> just spring it on everyone in the hours before the v4.8 merge window. I
> posted the first version of this patch set on August 15th, *seven weeks ago*:
>
> https://lkml.org/lkml/2016/8/15/613
>
> This was the day after v4.7-rc2 was released.

I understand, Ross. We've all been there - welcome to the club. :/

The problem is there are lots of people writing code and very few
people who spent time reviewing it. And for many of those reviewers,
they also have to spend time on other code bases, whether it be the
distro's they maintain, the userspace that the kernel code needs to
function, or even the test code needed to make this all work
properly.

Stuff that spans multiple subsystems such as DAX (i.e. vm,
filesystems and block devices) are particularly troublesome in this
respect because there are very few people with expertise in all
three aspects and hence able to review a change that spans all three
subsystems. Those people also tend to be the busiest and have to
prioritise what they do.

Consider the XFS side of review in recent times: in the past 4
months, there's been ~30,000 lines of code change between kernel and
userspace. And there's already another 15,000+ lines of code in the
backlog for the next 2-3 months. That review load is falling on 3-4
people, who all have other work to do as well. This is for work that
was started well over a year ago, and that picked up from work that
was originally done 3 years ago. We're swamped on pure XFS review
right now, let alone all the other infratructure stuff we need to
get reviewed at the same time...

Having someone say "lets get sorted after the merge window" is far
better than having your patches ignored - it tells you someone wants
your code and is actually planning to review it in the near term!
Have patience. Keep the patches up to date, keep building what you
need to build on top of them. Missing a merge window is not the end
of the world.

> Since then I have responded promptly to the little review feedback that I've
> received. I've also reviewed and tested other DAX changes, like the struct
> iomap changes from Christoph.

And I'm grateful for your doing that - it sped the process up a lot
because they then weren't blocked waiting for me to get to them. As
a result, I owe you some review time in return but unfortunately I
can't fully return the favour immediately. If more people treated
review as a selfless act that should be returned in kind, then we
wouldn't have a review bottleneck like we do...

> Those changes were first posted to the mailing
> list on September 9th, four weeks after mine. Nevertheless, I was happy to
> rebase my changes on top of his, which meant a full rewrite of the DAX PMD
> fault handler so it would be based on struct iomap. His changes are going to
> be merged for v4.9, and mine are not.

Yes, this can happen, too - core infrastructure changes can appear
suddenly and be implemented very quickly, but that does not mean
there hasn't been a lot of background work and effort put into the
code. The iomap code goes way back. Back to early 2010, in fact:

http://lkml.iu.edu/hypermail/linux/kernel/1005.1/02720.html

At that time I implemented a working multipage write IO path for
XFS, and Christoph integrated that int various OEM products shortly
afterwards. Yes, there have been iomap based XFS implementations out
there in production for over 5 years now, but that code was not
clean enough to even consider merging.

Another reference in 2013, when someone proposed a hack for embedded
systems to optimise the write path:

https://lkml.org/lkml/2013/7/23/809

Then Christoph introduced the struct iomap for pNFS and the XFS
block layout driver in late 2013, and when DAX first came along I
really wanted iomaps to be used up front rather than buffer heads
for block mapping.

Now we've finally got iomaps in the IO path and that's rapidly
cascading through all the XFS IO interfaces and into other
filesystems. This is exactly what we first talked about 6 years ago.

So while it might look like the iomap infrastructure has come out of
nowhere, it's really been a long, long road to get to this point. We
work to a different time scale over here - it's not uncommon to be
planning 5 years ahead for new XFS features. We know how long it
takes to develop, test, review and stabilise significant new
features, so while it might look like something appears and is
committed quickly, that's because you haven't seen the work that has
been done leading up to patches being presented for review and
merge.

Hopefully this will give you some more perspective on why I think
slipping a single release isn't something to get worried about. :)

> Please, help me understand what I can do to get my code reviewed. Do I need
> to more aggressively ping my patch series, asking people by name for reviews?

On the XFS and fstests lists, if nobody has responded within a few
days (usually a week) then it's OK to ping it and see if anyone has
time to review it. In general, a single ping is enough if the
patchset is in good shape. Fix it all, repost, ping in a week if
there's no followup. Repeat until merge.

> Do we need to rework our code flow to Linus so that the DAX changes go through
> a filesystem tree like XFS or ext4, and ask the developers of that filesystem
> to help with reviews? Something else?

The question we have to ask here is whether a lack of development
support from a filesystem stop us from driving the DAX
implementation forward? I've said it before, and I'll say it again:
I'm happy to drive DAX on XFS forwards at the rate at which we can
sustain review via the XFS tree and I don't care if we break support
on ext2 or ext4. If we keep having to wait for ext4 to fix stuff to
catch up with what we want/need to do then progress will continue to
be sporadic and frustrating. Perhaps it's time to stop waiting for
ext4 to play catchup every time we take a step forwards....

> I'm honestly very frustrated by this because I've done my best to be open to
> constructive criticism and I've tried to respond promptly to the feedback that
> I've received. In the end, though, a system where it's a requirement that all
> upstreamed code be peer reviewed but in which I can't get any feedback is
> essentially a system where I'm not allowed to contribute.

There's always been a review bottleneck, and everyone ends up on the
end of that frustration from time to time. Delays will happen - it's
just part of the process we all need to deal with. I used to get
frustrated, too, but now I just accept it, roll with it and we've
made it an acceptible part of the process to ping patches when it
looks like they have been forgotten...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2016-09-30 08:49:34

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v4 03/12] dax: remove buffer_size_valid()

Looks fine,

Reviewed-by: Christoph Hellwig <[email protected]>

2016-09-30 11:46:56

by Christoph Hellwig

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

2016-10-03 18:54:03

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Thu, Sep 29, 2016 at 09:00:55PM -0700, Darrick J. Wong wrote:
> On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > > Finally: none of the patches in your tree have reviewed-by tags.
> > > That says to me that none of this code has been reviewed yet.
> > > Reviewed-by tags are non-negotiable requirement for anything going
> > > through my trees. I don't have time right now to review this code,
> > > so you're going to need to chase up other reviewers before merging.
> > >
> > > And, really, this is getting very late in the cycle to be merging
> > > new code - we're less than one working day away from the merge
> > > window opening and we've missed the last linux-next build. I'd
> > > suggest that we'd might be best served by slipping this to the PMD
> > > support code to the next cycle when there's no time pressure for
> > > review and we can get a decent linux-next soak on the code.
> >
> > I absolutely support your policy of only sending code to Linux that has passed
> > peer review.
> >
> > However, I do feel compelled to point out that this is not new code. I didn't
> > just spring it on everyone in the hours before the v4.8 merge window. I
> > posted the first version of this patch set on August 15th, *seven weeks ago*:
> >
> > https://lkml.org/lkml/2016/8/15/613
> >
> > This was the day after v4.7-rc2 was released.
> >
> > Since then I have responded promptly to the little review feedback
> > that I've received. I've also reviewed and tested other DAX changes,
> > like the struct iomap changes from Christoph. Those changes were
> > first posted to the mailing list on September 9th, four weeks after
> > mine. Nevertheless, I was happy to rebase my changes on top of his,
> > which meant a full rewrite of the DAX PMD fault handler so it would be
> > based on struct iomap. His changes are going to be merged for v4.9,
> > and mine are not.
>
> I'm not knocking the iomap migration, but it did cause a fair amount of
> churn in the XFS reflink patchset -- and that's for a filesystem that
> already /had/ iomap implemented. It'd be neat to have all(?) the DAX
> filesystems (ext[24], XFS) move over to iomap so that you wouldn't have
> to support multiple ways of talking to FSes. AFAICT ext4 hasn't gotten
> iomap, which complicates things. But that's my opinion, maybe you're
> fine with supporting iomap and not-iomap.

I agree that we want to move everything over to be iomap based. I think
Christoph is already working on moving ext4 over, but as of this set PMD
support explicitly depends on the iomap interface, and I'm itching to remove
the struct buffer_head + get_block_t PTE path and I/O path as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-03 21:11:11

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Fri, Sep 30, 2016 at 04:48:58PM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 09:03:43PM -0600, Ross Zwisler wrote:
> > On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> > > Finally: none of the patches in your tree have reviewed-by tags.
> > > That says to me that none of this code has been reviewed yet.
> > > Reviewed-by tags are non-negotiable requirement for anything going
> > > through my trees. I don't have time right now to review this code,
> > > so you're going to need to chase up other reviewers before merging.
> > >
> > > And, really, this is getting very late in the cycle to be merging
> > > new code - we're less than one working day away from the merge
> > > window opening and we've missed the last linux-next build. I'd
> > > suggest that we'd might be best served by slipping this to the PMD
> > > support code to the next cycle when there's no time pressure for
> > > review and we can get a decent linux-next soak on the code.
> >
> > I absolutely support your policy of only sending code to Linux that has passed
> > peer review.
> >
> > However, I do feel compelled to point out that this is not new code. I didn't
> > just spring it on everyone in the hours before the v4.8 merge window. I
> > posted the first version of this patch set on August 15th, *seven weeks ago*:
> >
> > https://lkml.org/lkml/2016/8/15/613
> >
> > This was the day after v4.7-rc2 was released.
>
> I understand, Ross. We've all been there - welcome to the club. :/
>
> The problem is there are lots of people writing code and very few
> people who spent time reviewing it. And for many of those reviewers,
> they also have to spend time on other code bases, whether it be the
> distro's they maintain, the userspace that the kernel code needs to
> function, or even the test code needed to make this all work
> properly.
>
> Stuff that spans multiple subsystems such as DAX (i.e. vm,
> filesystems and block devices) are particularly troublesome in this
> respect because there are very few people with expertise in all
> three aspects and hence able to review a change that spans all three
> subsystems. Those people also tend to be the busiest and have to
> prioritise what they do.
>
> Consider the XFS side of review in recent times: in the past 4
> months, there's been ~30,000 lines of code change between kernel and
> userspace. And there's already another 15,000+ lines of code in the
> backlog for the next 2-3 months. That review load is falling on 3-4
> people, who all have other work to do as well. This is for work that
> was started well over a year ago, and that picked up from work that
> was originally done 3 years ago. We're swamped on pure XFS review
> right now, let alone all the other infratructure stuff we need to
> get reviewed at the same time...
>
> Having someone say "lets get sorted after the merge window" is far
> better than having your patches ignored - it tells you someone wants
> your code and is actually planning to review it in the near term!
> Have patience. Keep the patches up to date, keep building what you
> need to build on top of them. Missing a merge window is not the end
> of the world.
>
> > Since then I have responded promptly to the little review feedback that I've
> > received. I've also reviewed and tested other DAX changes, like the struct
> > iomap changes from Christoph.
>
> And I'm grateful for your doing that - it sped the process up a lot
> because they then weren't blocked waiting for me to get to them. As
> a result, I owe you some review time in return but unfortunately I
> can't fully return the favour immediately. If more people treated
> review as a selfless act that should be returned in kind, then we
> wouldn't have a review bottleneck like we do...
>
> > Those changes were first posted to the mailing
> > list on September 9th, four weeks after mine. Nevertheless, I was happy to
> > rebase my changes on top of his, which meant a full rewrite of the DAX PMD
> > fault handler so it would be based on struct iomap. His changes are going to
> > be merged for v4.9, and mine are not.
>
> Yes, this can happen, too - core infrastructure changes can appear
> suddenly and be implemented very quickly, but that does not mean
> there hasn't been a lot of background work and effort put into the
> code. The iomap code goes way back. Back to early 2010, in fact:
>
> http://lkml.iu.edu/hypermail/linux/kernel/1005.1/02720.html
>
> At that time I implemented a working multipage write IO path for
> XFS, and Christoph integrated that int various OEM products shortly
> afterwards. Yes, there have been iomap based XFS implementations out
> there in production for over 5 years now, but that code was not
> clean enough to even consider merging.
>
> Another reference in 2013, when someone proposed a hack for embedded
> systems to optimise the write path:
>
> https://lkml.org/lkml/2013/7/23/809
>
> Then Christoph introduced the struct iomap for pNFS and the XFS
> block layout driver in late 2013, and when DAX first came along I
> really wanted iomaps to be used up front rather than buffer heads
> for block mapping.
>
> Now we've finally got iomaps in the IO path and that's rapidly
> cascading through all the XFS IO interfaces and into other
> filesystems. This is exactly what we first talked about 6 years ago.
>
> So while it might look like the iomap infrastructure has come out of
> nowhere, it's really been a long, long road to get to this point. We
> work to a different time scale over here - it's not uncommon to be
> planning 5 years ahead for new XFS features. We know how long it
> takes to develop, test, review and stabilise significant new
> features, so while it might look like something appears and is
> committed quickly, that's because you haven't seen the work that has
> been done leading up to patches being presented for review and
> merge.
>
> Hopefully this will give you some more perspective on why I think
> slipping a single release isn't something to get worried about. :)
>
> > Please, help me understand what I can do to get my code reviewed. Do I need
> > to more aggressively ping my patch series, asking people by name for reviews?
>
> On the XFS and fstests lists, if nobody has responded within a few
> days (usually a week) then it's OK to ping it and see if anyone has
> time to review it. In general, a single ping is enough if the
> patchset is in good shape. Fix it all, repost, ping in a week if
> there's no followup. Repeat until merge.
>
> > Do we need to rework our code flow to Linus so that the DAX changes go through
> > a filesystem tree like XFS or ext4, and ask the developers of that filesystem
> > to help with reviews? Something else?
>
> The question we have to ask here is whether a lack of development
> support from a filesystem stop us from driving the DAX
> implementation forward? I've said it before, and I'll say it again:
> I'm happy to drive DAX on XFS forwards at the rate at which we can
> sustain review via the XFS tree and I don't care if we break support
> on ext2 or ext4. If we keep having to wait for ext4 to fix stuff to
> catch up with what we want/need to do then progress will continue to
> be sporadic and frustrating. Perhaps it's time to stop waiting for
> ext4 to play catchup every time we take a step forwards....
>
> > I'm honestly very frustrated by this because I've done my best to be open to
> > constructive criticism and I've tried to respond promptly to the feedback that
> > I've received. In the end, though, a system where it's a requirement that all
> > upstreamed code be peer reviewed but in which I can't get any feedback is
> > essentially a system where I'm not allowed to contribute.
>
> There's always been a review bottleneck, and everyone ends up on the
> end of that frustration from time to time. Delays will happen - it's
> just part of the process we all need to deal with. I used to get
> frustrated, too, but now I just accept it, roll with it and we've
> made it an acceptible part of the process to ping patches when it
> looks like they have been forgotten...
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

Thank you for the advice and perspective, Dave. :)

I've gotten a bunch of review feedback from Christoph and Jan (thank you!), so
I'm busily addressing that. Since we're now targeting v4.10 I'll also
incorporate tracepoints so we can observe what's going on in our various fault
paths.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-03 23:05:06

by Ross Zwisler

[permalink] [raw]

Subject: Re: [PATCH v4 00/12] re-enable DAX PMD support

On Fri, Sep 30, 2016 at 09:43:45AM +1000, Dave Chinner wrote:
> On Thu, Sep 29, 2016 at 04:49:18PM -0600, Ross Zwisler wrote:
> > DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
> > locking. This series allows DAX PMDs to participate in the DAX radix tree
> > based locking scheme so that they can be re-enabled.
> >
> > Ted, can you please take the ext2 + ext4 patches through your tree? Dave,
> > can you please take the rest through the XFS tree?
>
> That will make merging this difficult, because later patches in the
> series are dependent on the ext2/ext4 patches early in the series.
> I'd much prefer they all go through the one tree to avoid issues
> like this.

I think that at least some of the ext2 patches at least will be merged in v4.9
through Ted's tree because they are just bugfixes. For whatever is left I'm
happy to have it merged to v4.10 through the XFS tree, thank you.

> >
> > Changes since v3:
> > - Corrected dax iomap code namespace for functions defined in fs/dax.c.
> > (Dave Chinner)
> > - Added leading "dax" namespace to new static functions in fs/dax.c.
> > (Dave Chinner)
> > - Made all DAX PMD code in fs/dax.c conditionally compiled based on
> > CONFIG_FS_DAX_PMD. Otherwise a stub in include/linux/dax.h that just
> > returns VM_FAULT_FALLBACK will be used. (Dave Chinner)
> > - Removed dynamic debugging messages from DAX PMD fault path. Debugging
> > tracepoints will be added later to both the PTE and PMD paths via a
> > later patch set. (Dave Chinner)
> > - Added a comment to ext2_dax_vm_ops explaining why we don't support DAX
> > PMD faults in ext2. (Dave Chinner)
> >
> > This was built upon xfs/for-next with PMD performance fixes from Toshi Kani
> > and Dan Williams. Dan's patch has already been merged for v4.8, and
> > Toshi's patches are currently queued in Andrew Morton's mm tree for v4.9
> > inclusion.
>
> the xfs/for-next branch is not a stable branch - it can rebase at
> any time just like linux-next can. The topic branches that I merge
> into the for-next branch, OTOH, are usually stable. i.e. The topic
> branch you should be basing this on is "iomap-4.9-dax".

Thanks, I didn't realize that. I'll rebase onto iomap-4.9-dax, then on a
v4.9-rc when things become more stable.

> And then you'll also see that there are already ext2 patches in this
> topic branch to convert it to iomap for DAX. So I'm quite happy to
> take the ext2/4 patches in this series in the same way.
>
> The patches from Dan and Toshi: is you patch series dependent on
> them? Do I need to take them as well?

Nope, Dan and Toshi's patches are just performance fixes, and should be merged
through other trees for v4.9.

> > Previously reported performance numbers:
> >
> > [global]
> > bs=4k
> > size=2G
> > directory=/mnt/pmem0
> > ioengine=mmap
> > [randrw]
> > rw=randrw
> >
> > Here are the performance results with XFS using only pte faults:
> > READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
> > WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec
> >
> > Here are performance numbers for that same test using PMD faults:
> > READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
> > WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec
>
> The numbers look good - how much of that is from lower filesystem
> allocation overhead and how much of it is from fewer page faults?
> You can probably determine this by creating the test file with
> write() to ensure it is fully allocated and so all the filesystem
> is doing on both the read and write paths is mapping allocated
> regions....

Perhaps I'm doing something wrong, but I think that the first time I run the
test I get all allocating writes because that's when the file is first
created. The file stays around between runs, though, so every run thereafter
should already have a fully allocated file, so the only overhead difference
should be page faults? Does this sound correct?

If that's true, the performance gain seems to be almost entirely due to fewer
page faults. For both the 4k case and the 2M case there isn't a noticeable
performance difference between the unallocated vs fully allocated cases.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>