2017-08-17 16:08:02

by Jan Kara

[permalink] [raw]
Subject: [RFC PATCH 0/13 v2] dax, ext4: Synchronous page faults

Hello,

here is second version of my patches to implement synchronous page faults for
DAX mappings to make flushing of DAX mappings possible from userspace so that
they can be flushed on finer than page granularity and also avoid the overhead
of a syscall.

We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.

How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.

>From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
same and it should be even possible for filesystem to implement safe remapping
of a file offset to a different block (i.e. break reflink, do defrag, or
similar stuff) like:

1) Block page faults
2) fdatasync() remapped range (there can be outstanding data modifications
not yet flushed)
3) unmap_mapping_range()
4) Now remap blocks
5) Unblock page faults

Basically we do the same on events like punch hole so there is not much new
there.

Note that the implementation of MAP_SYNC flag is pretty crude for now just to
enable testing since Dan is working in the same area to implement another mmap
flag. Once the decision on how to implement new mmap flag is settled, I can
clean up that patch.

I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):

File preallocated, no background load no MAP_SYNC:
min=5 avg=6 max=42
4 - 7 us: 398
8 - 15 us: 110
16 - 31 us: 2
32 - 63 us: 2

File preallocated, no background load, MAP_SYNC:
min=10 avg=10 max=43
8 - 15 us: 509
16 - 31 us: 2
32 - 63 us: 1

File empty, no background load, no MAP_SYNC:
min=21 avg=23 max=76
16 - 31 us: 503
32 - 63 us: 8
64 - 127 us: 1

File empty, no background load, MAP_SYNC:
min=91 avg=108 max=234
64 - 127 us: 467
128 - 255 us: 45

File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1

File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23

So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.

Anyway, here are the patches, comments are welcome.

Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers

Honza


2017-08-17 16:08:10

by Jan Kara

[permalink] [raw]
Subject: [PATCH 08/13] dax: Fix comment describing dax_iomap_fault()

Add missing argument description.

Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/dax.c b/fs/dax.c
index 54d2a9161f71..85ea49bbbdbf 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1414,7 +1414,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
/**
* dax_iomap_fault - handle a page fault on a DAX file
* @vmf: The description of the fault
- * @ops: iomap ops passed from the file system
+ * @pe_size: Size of the page to fault in
+ * @ops: Iomap ops passed from the file system
*
* When a page fault occurs, filesystems may call this helper in
* their fault handler for DAX files. dax_iomap_fault() assumes the caller
--
2.12.3

2017-08-17 16:08:15

by Jan Kara

[permalink] [raw]
Subject: [PATCH 13/13] ext4: Support for synchronous DAX faults

We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
dirty corresponding radix tree entry which is what we want - fsync(2)
will still provide data integrity guarantees for applications not using
userspace flushing. And applications using userspace flushing can avoid
calling fsync(2) and thus avoid the performance overhead.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 36 ++++++++++++++++++++++++++++++------
fs/ext4/inode.c | 4 ++++
fs/jbd2/journal.c | 17 +++++++++++++++++
include/linux/jbd2.h | 1 +
4 files changed, 52 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 850037e140d7..3765c4ed1368 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -280,6 +280,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
struct inode *inode = file_inode(vmf->vma->vm_file);
struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+ pfn_t pfn;

if (write) {
sb_start_pagefault(sb);
@@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
+ if (IS_ERR(handle)) {
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+ return VM_FAULT_SIGBUS;
+ }
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- if (!IS_ERR(handle))
- result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, NULL);
- else
- result = VM_FAULT_SIGBUS;
+ result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, &pfn);
if (write) {
- if (!IS_ERR(handle))
- ext4_journal_stop(handle);
+ ext4_journal_stop(handle);
+ /* Write fault but PFN mapped only RO? */
+ if (result & VM_FAULT_NEEDDSYNC) {
+ int err;
+ loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+ size_t len = 0;
+
+ if (pe_size == PE_SIZE_PTE)
+ len = PAGE_SIZE;
+#ifdef CONFIG_FS_DAX_PMD
+ else if (pe_size == PE_SIZE_PMD)
+ len = HPAGE_PMD_SIZE;
+#endif
+ else
+ WARN_ON_ONCE(1);
+ err = vfs_fsync_range(vmf->vma->vm_file, start,
+ start + len - 1, 1);
+ if (err)
+ result = VM_FAULT_SIGBUS;
+ else
+ result = dax_insert_pfn_mkwrite(vmf, pe_size,
+ pfn);
+ }
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c600f02673f..7a7529c3f0c8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}

iomap->flags = 0;
+ if ((flags & IOMAP_WRITE) &&
+ !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
+ EXT4_I(inode)->i_datasync_tid))
+ iomap->flags |= IOMAP_F_NEEDDSYNC;
bdev = inode->i_sb->s_bdev;
iomap->bdev = bdev;
if (blk_queue_dax(bdev->bd_queue))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..fa8cde498b4b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,23 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
return err;
}

+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+ int ret = 1;
+
+ read_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction &&
+ journal->j_running_transaction->t_tid == tid)
+ ret = 0;
+ if (journal->j_committing_transaction &&
+ journal->j_committing_transaction->t_tid == tid)
+ ret = 0;
+ read_unlock(&journal->j_state_lock);
+ return ret;
+}
+EXPORT_SYMBOL(jbd2_transaction_committed);
+
/*
* When this function returns the transaction corresponding to tid
* will be completed. If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..296d1e0ea87b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1367,6 +1367,7 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid);
int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
int jbd2_log_do_checkpoint(journal_t *journal);
int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
--
2.12.3

2017-08-18 22:12:59

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 08/13] dax: Fix comment describing dax_iomap_fault()

On Thu, Aug 17, 2017 at 06:08:10PM +0200, Jan Kara wrote:
> Add missing argument description.
>
> Signed-off-by: Jan Kara <[email protected]>

Reviewed-by: Ross Zwisler <[email protected]>

2017-08-21 19:19:50

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also

Need to fix up the above line a little -
s/dax_pfn_mkwrite/dax_insert_pfn_mkwrite/, and we insert the PTE as well as
make it writeable.

> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 36 ++++++++++++++++++++++++++++++------
> fs/ext4/inode.c | 4 ++++
> fs/jbd2/journal.c | 17 +++++++++++++++++
> include/linux/jbd2.h | 1 +
> 4 files changed, 52 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 850037e140d7..3765c4ed1368 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -280,6 +280,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> struct inode *inode = file_inode(vmf->vma->vm_file);
> struct super_block *sb = inode->i_sb;
> bool write = vmf->flags & FAULT_FLAG_WRITE;
> + pfn_t pfn;
>
> if (write) {
> sb_start_pagefault(sb);
> @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> down_read(&EXT4_I(inode)->i_mmap_sem);
> handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
> EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(&EXT4_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }
> } else {
> down_read(&EXT4_I(inode)->i_mmap_sem);
> }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, NULL);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, &pfn);
> if (write) {
> - if (!IS_ERR(handle))
> - ext4_journal_stop(handle);
> + ext4_journal_stop(handle);
> + /* Write fault but PFN mapped only RO? */

The above comment is out of date.

> + if (result & VM_FAULT_NEEDDSYNC) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;

In fs/dax.c we always use PMD_SIZE. It looks like HPAGE_PMD_SIZE and PMD_SIZE
are always the same (from include/linux/huge_mm.h, the only defintion of
HPAGE_PMD_SIZE):

#define HPAGE_PMD_SHIFT PMD_SHIFT
#define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)

and AFAICT PMD_SIZE is defined to be 1<<PMD_SHIFT for all architectures as
well. I don't understand why we have both?

In any case, neither HPAGE_PMD_SIZE nor PMD_SIZE are used anywhere else in the
ext4 code, so can we use PMD_SIZE here for consistency? If they ever did
manage to be different, I think we'd want PMD_SIZE anyway.

With those nits and an updated changelog:

Reviewed-by: Ross Zwisler <[email protected]>

2017-08-22 10:18:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Mon 21-08-17 13:19:48, Ross Zwisler wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
>
> Need to fix up the above line a little -
> s/dax_pfn_mkwrite/dax_insert_pfn_mkwrite/, and we insert the PTE as well as
> make it writeable.

Fixed up, thanks.

> > if (write) {
> > - if (!IS_ERR(handle))
> > - ext4_journal_stop(handle);
> > + ext4_journal_stop(handle);
> > + /* Write fault but PFN mapped only RO? */
>
> The above comment is out of date.

Fixed.

> > + if (result & VM_FAULT_NEEDDSYNC) {
> > + int err;
> > + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> > + size_t len = 0;
> > +
> > + if (pe_size == PE_SIZE_PTE)
> > + len = PAGE_SIZE;
> > +#ifdef CONFIG_FS_DAX_PMD
> > + else if (pe_size == PE_SIZE_PMD)
> > + len = HPAGE_PMD_SIZE;
>
> In fs/dax.c we always use PMD_SIZE. It looks like HPAGE_PMD_SIZE and PMD_SIZE
> are always the same (from include/linux/huge_mm.h, the only defintion of
> HPAGE_PMD_SIZE):
>
> #define HPAGE_PMD_SHIFT PMD_SHIFT
> #define HPAGE_PMD_SIZE ((1UL) << HPAGE_PMD_SHIFT)
>
> and AFAICT PMD_SIZE is defined to be 1<<PMD_SHIFT for all architectures as
> well. I don't understand why we have both?
>
> In any case, neither HPAGE_PMD_SIZE nor PMD_SIZE are used anywhere else in the
> ext4 code, so can we use PMD_SIZE here for consistency? If they ever did
> manage to be different, I think we'd want PMD_SIZE anyway.

Yeah, I've changed that to PMD_SIZE.

> With those nits and an updated changelog:
>
> Reviewed-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>

Thanks!

Honza

--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-08-23 18:32:22

by Christoph Hellwig

[permalink] [raw]

2017-08-23 18:37:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

> + pfn_t pfn;
>
> if (write) {
> sb_start_pagefault(sb);
> @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> down_read(&EXT4_I(inode)->i_mmap_sem);
> handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
> EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(&EXT4_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }
> } else {
> down_read(&EXT4_I(inode)->i_mmap_sem);
> }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, NULL);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, &pfn);

Maybe split the error handling refactor into a simple prep patch to
make this one more readable?

> + /* Write fault but PFN mapped only RO? */
> + if (result & VM_FAULT_NEEDDSYNC) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;
> +#endif
> + else
> + WARN_ON_ONCE(1);
> + err = vfs_fsync_range(vmf->vma->vm_file, start,
> + start + len - 1, 1);
> + if (err)
> + result = VM_FAULT_SIGBUS;
> + else
> + result = dax_insert_pfn_mkwrite(vmf, pe_size,
> + pfn);
> + }

I think this needs to become a helper exported from the DAX code,
way too much magic inside the file system as-is.

2017-08-24 07:18:19

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Wed 23-08-17 11:37:14, Christoph Hellwig wrote:
> > + pfn_t pfn;
> >
> > if (write) {
> > sb_start_pagefault(sb);
> > @@ -287,16 +288,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> > down_read(&EXT4_I(inode)->i_mmap_sem);
> > handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
> > EXT4_DATA_TRANS_BLOCKS(sb));
> > + if (IS_ERR(handle)) {
> > + up_read(&EXT4_I(inode)->i_mmap_sem);
> > + sb_end_pagefault(sb);
> > + return VM_FAULT_SIGBUS;
> > + }
> > } else {
> > down_read(&EXT4_I(inode)->i_mmap_sem);
> > }
> > - if (!IS_ERR(handle))
> > - result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, NULL);
> > - else
> > - result = VM_FAULT_SIGBUS;
> > + result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops, &pfn);
>
> Maybe split the error handling refactor into a simple prep patch to
> make this one more readable?

OK, will do.

> > + /* Write fault but PFN mapped only RO? */
> > + if (result & VM_FAULT_NEEDDSYNC) {
> > + int err;
> > + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> > + size_t len = 0;
> > +
> > + if (pe_size == PE_SIZE_PTE)
> > + len = PAGE_SIZE;
> > +#ifdef CONFIG_FS_DAX_PMD
> > + else if (pe_size == PE_SIZE_PMD)
> > + len = HPAGE_PMD_SIZE;
> > +#endif
> > + else
> > + WARN_ON_ONCE(1);
> > + err = vfs_fsync_range(vmf->vma->vm_file, start,
> > + start + len - 1, 1);
> > + if (err)
> > + result = VM_FAULT_SIGBUS;
> > + else
> > + result = dax_insert_pfn_mkwrite(vmf, pe_size,
> > + pfn);
> > + }
>
> I think this needs to become a helper exported from the DAX code,
> way too much magic inside the file system as-is.

Good point, there isn't anything fs specific in there.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-08-24 12:31:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.

Why is this only wiered up for the huge_fault handler and not the
regular?

2017-08-24 12:34:52

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Thu, Aug 24, 2017 at 05:31:26AM -0700, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> > dirty corresponding radix tree entry which is what we want - fsync(2)
> > will still provide data integrity guarantees for applications not using
> > userspace flushing. And applications using userspace flushing can avoid
> > calling fsync(2) and thus avoid the performance overhead.
>
> Why is this only wiered up for the huge_fault handler and not the
> regular?

Ah, turns out ext4 implements ->fault in terms of ->huge_fault.

We'll really need to sort out this mess of fault handlers before
doing too much surgery here..

2017-08-24 12:36:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 13/13] ext4: Support for synchronous DAX faults

On Thu 24-08-17 05:31:26, Christoph Hellwig wrote:
> On Thu, Aug 17, 2017 at 06:08:15PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> > dirty corresponding radix tree entry which is what we want - fsync(2)
> > will still provide data integrity guarantees for applications not using
> > userspace flushing. And applications using userspace flushing can avoid
> > calling fsync(2) and thus avoid the performance overhead.
>
> Why is this only wiered up for the huge_fault handler and not the
> regular?

We do handle both. Just ext4 naming is a bit confusing and ext4_dax_fault()
uses ext4_dax_huge_fault() for handling.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR