2017-07-27 13:12:38

by Jan Kara

[permalink] [raw]
Subject: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

Hello,

after last discussions about whether / how to make flushing of DAX mappings
possible from userspace so that they can be flushed on finer than page
granularity and also avoid the overhead of a syscall, I've decided to give
a stab at implementing "synchronous page faults" idea for ext4 so that
we can see whether that is reasonably possible to implement and how would
such implementation look like. This patch set is the result.

So the functionality this patches implement: We have an inode flag (currently
I abuse S_SYNC inode flag for this and IMHO it kind of makes sense but if
people hate that I'm certainly open to using new flag in the final
implementation) that marks inode as requiring synchronous page faults.
The guarantee provided by this flag on inode is: While a block is writeably
mapped into page tables, it is guaranteed to be visible in the file at that
offset also after a crash.

How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX maps page table entries
read-only and returns special flag VM_FAULT_RO to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to write-enable the page table entries.

>From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
same and it should be even possible for filesystem to implement safe remapping
of a file offset to a different block (i.e. break reflink, do defrag, or
similar stuff) like:

1) Block page faults
2) fdatasync() remapped range (there can be outstanding data modifications
not yet flushed)
3) unmap_mapping_range()
4) Now remap blocks
5) Unblock page faults

Basically we do the same on events like punch hole so there is not much new
there.

There are couple of open questions with this implementation:

1) Is it worth the hassle?
2) Is S_SYNC good flag to use or should we use a new inode flag?
3) VM_FAULT_RO and especially passing of resulting 'pfn' from
dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
things to make this cleaner.

Anyway, here are the patches, comments are welcome.

Honza


2017-07-27 13:12:49

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()

Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
the fact that synchronous fault is requested.

Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 12 ++++++------
fs/ext2/file.c | 2 +-
fs/ext4/file.c | 2 +-
fs/xfs/xfs_file.c | 8 ++++----
include/linux/dax.h | 2 +-
5 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 75760f7bdf5d..44d9cce4399b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1069,7 +1069,7 @@ static int dax_fault_return(int error)
return VM_FAULT_SIGBUS;
}

-static int dax_iomap_pte_fault(struct vm_fault *vmf,
+static int dax_iomap_pte_fault(struct vm_fault *vmf, bool sync,
const struct iomap_ops *ops)
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
@@ -1300,7 +1300,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
return VM_FAULT_FALLBACK;
}

-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
const struct iomap_ops *ops)
{
struct vm_area_struct *vma = vmf->vma;
@@ -1422,7 +1422,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
return result;
}
#else
-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, bool sync,
const struct iomap_ops *ops)
{
return VM_FAULT_FALLBACK;
@@ -1440,13 +1440,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
* successfully.
*/
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- const struct iomap_ops *ops)
+ bool sync, const struct iomap_ops *ops)
{
switch (pe_size) {
case PE_SIZE_PTE:
- return dax_iomap_pte_fault(vmf, ops);
+ return dax_iomap_pte_fault(vmf, sync, ops);
case PE_SIZE_PMD:
- return dax_iomap_pmd_fault(vmf, ops);
+ return dax_iomap_pmd_fault(vmf, sync, ops);
default:
return VM_FAULT_FALLBACK;
}
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..430e6ecf4e0e 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_fault *vmf)
}
down_read(&ei->dax_sem);

- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &ext2_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &ext2_iomap_ops);

up_read(&ei->dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 9dda70edba74..d401403e5095 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -291,7 +291,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(&EXT4_I(inode)->i_mmap_sem);
}
if (!IS_ERR(handle))
- result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
else
result = VM_FAULT_SIGBUS;
if (write) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 62db8ffa83b9..9fc8b5668578 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1032,7 +1032,7 @@ xfs_filemap_page_mkwrite(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

if (IS_DAX(inode)) {
- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
} else {
ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
ret = block_page_mkwrite_return(ret);
@@ -1059,7 +1059,7 @@ xfs_filemap_fault(

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode))
- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
else
ret = filemap_fault(vmf);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1094,7 +1094,7 @@ xfs_filemap_huge_fault(
}

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, pe_size, false, &xfs_iomap_ops);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);

if (vmf->flags & FAULT_FLAG_WRITE)
@@ -1130,7 +1130,7 @@ xfs_filemap_pfn_mkwrite(
if (vmf->pgoff >= size)
ret = VM_FAULT_SIGBUS;
else if (IS_DAX(inode))
- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, false, &xfs_iomap_ops);
xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index d0e32729ad1e..98950f4d127e 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -91,7 +91,7 @@ void dax_write_cache(struct dax_device *dax_dev, bool wc);
ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops);
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- const struct iomap_ops *ops);
+ bool sync, const struct iomap_ops *ops);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
--
2.12.3

2017-07-27 13:12:51

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/7] ext4: Support for synchronous DAX faults

We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
dirty corresponding radix tree entry which is what we want - fsync(2)
will still provide data integrity guarantees for applications not using
userspace flushing. And applications using userspace flushing can avoid
calling fsync(2) and thus avoid the performance overhead.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 35 +++++++++++++++++++++++++++++------
fs/ext4/inode.c | 4 ++++
fs/jbd2/journal.c | 16 ++++++++++++++++
include/linux/jbd2.h | 1 +
4 files changed, 50 insertions(+), 6 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index d401403e5095..b221d0b546b0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -287,16 +287,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
+ if (IS_ERR(handle)) {
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+ return VM_FAULT_SIGBUS;
+ }
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- if (!IS_ERR(handle))
- result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
- else
- result = VM_FAULT_SIGBUS;
+ result = dax_iomap_fault(vmf, pe_size, IS_SYNC(inode), &ext4_iomap_ops);
if (write) {
- if (!IS_ERR(handle))
- ext4_journal_stop(handle);
+ ext4_journal_stop(handle);
+ /* Write fault but PFN mapped only RO? */
+ if (result & VM_FAULT_RO) {
+ int err;
+ loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+ size_t len = 0;
+
+ if (pe_size == PE_SIZE_PTE)
+ len = PAGE_SIZE;
+#ifdef CONFIG_FS_DAX_PMD
+ else if (pe_size == PE_SIZE_PMD)
+ len = HPAGE_PMD_SIZE;
+ else
+ WARN_ON_ONCE(1);
+#endif
+ WARN_ON_ONCE(!IS_SYNC(inode));
+ err = vfs_fsync_range(vmf->vma->vm_file, start,
+ start + len - 1, 1);
+ if (err)
+ result = VM_FAULT_SIGBUS;
+ else
+ result = dax_pfn_mkwrite(vmf, pe_size);
+ }
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3c600f02673f..e68231bb227c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}

iomap->flags = 0;
+ if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
+ !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
+ EXT4_I(inode)->i_datasync_tid))
+ iomap->flags |= IOMAP_F_NEEDDSYNC;
bdev = inode->i_sb->s_bdev;
iomap->bdev = bdev;
if (blk_queue_dax(bdev->bd_queue))
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..e0f436c42d67 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,22 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
return err;
}

+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+ int ret = 1;
+
+ read_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction &&
+ journal->j_running_transaction->t_tid == tid)
+ ret = 0;
+ if (journal->j_committing_transaction &&
+ journal->j_committing_transaction->t_tid == tid)
+ ret = 0;
+ read_unlock(&journal->j_state_lock);
+ return ret;
+}
+
/*
* When this function returns the transaction corresponding to tid
* will be completed. If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..296d1e0ea87b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1367,6 +1367,7 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid);
int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
int jbd2_log_do_checkpoint(journal_t *journal);
int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
--
2.12.3

2017-07-27 14:09:09

by Jeff Moyer

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

Jan Kara <[email protected]> writes:

Hi, Jan,

Thanks for looking into this!

> There are couple of open questions with this implementation:
>
> 1) Is it worth the hassle?
> 2) Is S_SYNC good flag to use or should we use a new inode flag?
> 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> things to make this cleaner.

4) How does an application discover that it is safe to flush from
userspace?

-Jeff

2017-07-27 21:57:13

by Ross Zwisler

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
> Jan Kara <[email protected]> writes:
>
> Hi, Jan,
>
> Thanks for looking into this!
>
> > There are couple of open questions with this implementation:
> >
> > 1) Is it worth the hassle?
> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> > dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> > vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> > things to make this cleaner.
>
> 4) How does an application discover that it is safe to flush from
> userspace?

I think that we would be best off with a new flag available via
lsattr(1)/chattr(1). This would have the following advantages:

1) We could only set the flag if the inode supported DAX (either via the mount
option or via the individual DAX flag). This would give NVML et al. one
central way to detect whether it was safe to flush from userspace because the
FS supported synchronous faults.

2) Defining a new flag prevents any confusion about whether the kernel version
you have supports sync faults. Otherwise NVML would have to do something like
look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
which is complex and of course breaks for OS kernel versions.

3) Defining the flag in a generic way via lsattr/chattr opens the door for the
same API and flag to be used by other filesystems in the future.

2017-07-27 22:06:37

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()

On Thu, Jul 27, 2017 at 03:12:40PM +0200, Jan Kara wrote:
> Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
> the fact that synchronous fault is requested.

I don't actually think you need to pass this 'sync' parameter around. I think
you can completely rely on IOMAP_F_NEEDSYNC being set in iomap.flags. The DAX
fault handlers can call ops->iomap_begin() and use that flag for all the
tests and make it our once source of truth.

That flag also tells us that we are doing a write fault (from
ext4_iomap_begin()):

if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
!jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
EXT4_I(inode)->i_datasync_tid))
iomap->flags |= IOMAP_F_NEEDDSYNC;

So conditionals like this from dax_iomap_pte_fault():

force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
(iomap.flags & IOMAP_F_NEEDDSYNC);

can be simplified to:

force_ro = (iomap.flags & IOMAP_F_NEEDDSYNC);

2017-07-27 22:57:39

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 7/7] ext4: Support for synchronous DAX faults

On Thu, Jul 27, 2017 at 03:12:45PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_pfn_mkwrite() to mark PTE as writeable. Note that this will also
> dirty corresponding radix tree entry which is what we want - fsync(2)
> will still provide data integrity guarantees for applications not using
> userspace flushing. And applications using userspace flushing can avoid
> calling fsync(2) and thus avoid the performance overhead.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 35 +++++++++++++++++++++++++++++------
> fs/ext4/inode.c | 4 ++++
> fs/jbd2/journal.c | 16 ++++++++++++++++
> include/linux/jbd2.h | 1 +
> 4 files changed, 50 insertions(+), 6 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index d401403e5095..b221d0b546b0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -287,16 +287,39 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
> down_read(&EXT4_I(inode)->i_mmap_sem);
> handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
> EXT4_DATA_TRANS_BLOCKS(sb));
> + if (IS_ERR(handle)) {
> + up_read(&EXT4_I(inode)->i_mmap_sem);
> + sb_end_pagefault(sb);
> + return VM_FAULT_SIGBUS;
> + }

Yay, this error handling seems cleaner to me anyway.

> } else {
> down_read(&EXT4_I(inode)->i_mmap_sem);
> }
> - if (!IS_ERR(handle))
> - result = dax_iomap_fault(vmf, pe_size, false, &ext4_iomap_ops);
> - else
> - result = VM_FAULT_SIGBUS;
> + result = dax_iomap_fault(vmf, pe_size, IS_SYNC(inode), &ext4_iomap_ops);
> if (write) {
> - if (!IS_ERR(handle))
> - ext4_journal_stop(handle);
> + ext4_journal_stop(handle);
> + /* Write fault but PFN mapped only RO? */
> + if (result & VM_FAULT_RO) {
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = HPAGE_PMD_SIZE;
> + else
> + WARN_ON_ONCE(1);

I think this "else WARN_ON_ONCE(1);" should live outside of the
CONFIG_FS_DAX_PMD so that we get warned in all configs if we get an
unsupported pe_size.

> +#endif
> + WARN_ON_ONCE(!IS_SYNC(inode));
> + err = vfs_fsync_range(vmf->vma->vm_file, start,
> + start + len - 1, 1);
> + if (err)
> + result = VM_FAULT_SIGBUS;
> + else
> + result = dax_pfn_mkwrite(vmf, pe_size);
> + }
> up_read(&EXT4_I(inode)->i_mmap_sem);
> sb_end_pagefault(sb);
> } else {
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 3c600f02673f..e68231bb227c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3429,6 +3429,10 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
>
> iomap->flags = 0;
> + if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
> + !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
> + EXT4_I(inode)->i_datasync_tid))
> + iomap->flags |= IOMAP_F_NEEDDSYNC;

Do we need to check for (flags & IOMAP_FAULT), or can we rely on the fact that
we are in ext4_iomap_begin()?

2017-07-28 02:05:47

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Thu, Jul 27, 2017 at 2:57 PM, Ross Zwisler
<[email protected]> wrote:
> On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
>> Jan Kara <[email protected]> writes:
>>
>> Hi, Jan,
>>
>> Thanks for looking into this!
>>
>> > There are couple of open questions with this implementation:
>> >
>> > 1) Is it worth the hassle?
>> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
>> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
>> > dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
>> > vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
>> > things to make this cleaner.
>>
>> 4) How does an application discover that it is safe to flush from
>> userspace?
>
> I think that we would be best off with a new flag available via
> lsattr(1)/chattr(1). This would have the following advantages:
>
> 1) We could only set the flag if the inode supported DAX (either via the mount
> option or via the individual DAX flag). This would give NVML et al. one
> central way to detect whether it was safe to flush from userspace because the
> FS supported synchronous faults.
>
> 2) Defining a new flag prevents any confusion about whether the kernel version
> you have supports sync faults. Otherwise NVML would have to do something like
> look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
> which is complex and of course breaks for OS kernel versions.
>
> 3) Defining the flag in a generic way via lsattr/chattr opens the door for the
> same API and flag to be used by other filesystems in the future.

I would advocate using a new fcntl() instead of lsattr for the
following reason: ISTM the fact that it's an *inode* flag in this
patchset is a bit of an implementation detail. I can easily imagine a
future implementation that makes it per-struct-file instead. A
fcntl() that asks "can I flush from userspace" would still work under
than scenario.

2017-07-28 09:38:21

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Thu 27-07-17 19:05:24, Andy Lutomirski wrote:
> On Thu, Jul 27, 2017 at 2:57 PM, Ross Zwisler
> <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]> wrote:
> > On Thu, Jul 27, 2017 at 10:09:07AM -0400, Jeff Moyer wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> Hi, Jan,
> >>
> >> Thanks for looking into this!
> >>
> >> > There are couple of open questions with this implementation:
> >> >
> >> > 1) Is it worth the hassle?
> >> > 2) Is S_SYNC good flag to use or should we use a new inode flag?
> >> > 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> >> > dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> >> > vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> >> > things to make this cleaner.
> >>
> >> 4) How does an application discover that it is safe to flush from
> >> userspace?
> >
> > I think that we would be best off with a new flag available via
> > lsattr(1)/chattr(1). This would have the following advantages:
> >
> > 1) We could only set the flag if the inode supported DAX (either via the mount
> > option or via the individual DAX flag). This would give NVML et al. one
> > central way to detect whether it was safe to flush from userspace because the
> > FS supported synchronous faults.
> >
> > 2) Defining a new flag prevents any confusion about whether the kernel version
> > you have supports sync faults. Otherwise NVML would have to do something like
> > look at the trio of (kernel version, S_SYNC flag, mount/inode option for DAX)
> > which is complex and of course breaks for OS kernel versions.
> >
> > 3) Defining the flag in a generic way via lsattr/chattr opens the door for the
> > same API and flag to be used by other filesystems in the future.
>
> I would advocate using a new fcntl() instead of lsattr for the
> following reason: ISTM the fact that it's an *inode* flag in this
> patchset is a bit of an implementation detail. I can easily imagine a
> future implementation that makes it per-struct-file instead. A
> fcntl() that asks "can I flush from userspace" would still work under
> than scenario.

Well, you are right I can make the implementation work with struct file
flag as well - let's call it O_DAXDSYNC. However there are filesystem
operations where you may need to answer question: Is there any fd with
O_DAXDSYNC open against this inode (for operations that change file offset
-> block mapping)? And in that case inode flag is straightforward while
file flag is a bit awkward (you need to implement counter of fd's with that
flag in the inode).

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-07-28 09:40:03

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/7] dax: Add sync argument to dax_iomap_fault()

On Thu 27-07-17 16:06:37, Ross Zwisler wrote:
> On Thu, Jul 27, 2017 at 03:12:40PM +0200, Jan Kara wrote:
> > Add 'sync' argument to dax_iomap_fault(). It will be used to communicate
> > the fact that synchronous fault is requested.
>
> I don't actually think you need to pass this 'sync' parameter around. I think
> you can completely rely on IOMAP_F_NEEDSYNC being set in iomap.flags. The DAX
> fault handlers can call ops->iomap_begin() and use that flag for all the
> tests and make it our once source of truth.
>
> That flag also tells us that we are doing a write fault (from
> ext4_iomap_begin()):
>
> if ((flags & IOMAP_FAULT) && (flags & IOMAP_WRITE) && IS_SYNC(inode) &&
> !jbd2_transaction_committed(EXT4_SB(inode->i_sb)->s_journal,
> EXT4_I(inode)->i_datasync_tid))
> iomap->flags |= IOMAP_F_NEEDDSYNC;
>
> So conditionals like this from dax_iomap_pte_fault():
>
> force_ro = (vmf->flags & FAULT_FLAG_WRITE) && sync &&
> (iomap.flags & IOMAP_F_NEEDDSYNC);
>
> can be simplified to:
>
> force_ro = (iomap.flags & IOMAP_F_NEEDDSYNC);

Yeah, probably you're right. I'll look into changing this.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-08-01 10:52:19

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Thu, Jul 27, 2017 at 03:12:38PM +0200, Jan Kara wrote:
> So the functionality this patches implement: We have an inode flag (currently
> I abuse S_SYNC inode flag for this and IMHO it kind of makes sense but if
> people hate that I'm certainly open to using new flag in the final
> implementation) that marks inode as requiring synchronous page faults.
> The guarantee provided by this flag on inode is: While a block is writeably
> mapped into page tables, it is guaranteed to be visible in the file at that
> offset also after a crash.

I think the right interface for page fault behavior is a mmap
flag, MAP_SYNC or similar, which will be optional and a failure of
a MAP_SYNC mmap will indicated that this behavior can't be provided
for the given file descriptor.

> >From my (fairly limited) knowledge of XFS it seems XFS should be able to do the
> same and it should be even possible for filesystem to implement safe remapping
> of a file offset to a different block (i.e. break reflink, do defrag, or
> similar stuff) like:

It should. But what I'm worried about for both ext4 and XFS is the
worst case behavior that the page faul path can now hit, e.g. flushing
a potentially full log. Do you have any numbers of how long your
ext4 page faults take with this in the worst case?

> There are couple of open questions with this implementation:
>
> 1) Is it worth the hassle?

For that I'd really like to see performance numbers. And compared to
the immutable nightmare that Dan proposed this looks orders of magnitude
better.

> 2) Is S_SYNC good flag to use or should we use a new inode flag?

I think the right interface is mmap as said above. But even if not
we should not simply reuse existing flags with a well defined (although
not particular useful) behavior.

> 3) VM_FAULT_RO and especially passing of resulting 'pfn' from
> dax_iomap_fault() through filesystem fault handler to dax_pfn_mkwrite() in
> vmf->orig_pte is a bit of a hack. So far I'm not sure how to refactor
> things to make this cleaner.

I'll take a look.

2017-08-01 11:02:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
> Well, you are right I can make the implementation work with struct file
> flag as well - let's call it O_DAXDSYNC. However there are filesystem
> operations where you may need to answer question: Is there any fd with
> O_DAXDSYNC open against this inode (for operations that change file offset
> -> block mapping)? And in that case inode flag is straightforward while
> file flag is a bit awkward (you need to implement counter of fd's with that
> flag in the inode).

We can still keep and inode flag as the internal implementation
detail. As mentioned earlier the right flag to control behavior
of a mapping is an mmap flag. And the initial naive implementation
would simply mark the inode as sync once the first MAP_SYNC open happens
on it. We could then move to more precise tracking if/when needed.

2017-08-01 11:26:03

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Tue 01-08-17 04:02:41, Christoph Hellwig wrote:
> On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
> > Well, you are right I can make the implementation work with struct file
> > flag as well - let's call it O_DAXDSYNC. However there are filesystem
> > operations where you may need to answer question: Is there any fd with
> > O_DAXDSYNC open against this inode (for operations that change file offset
> > -> block mapping)? And in that case inode flag is straightforward while
> > file flag is a bit awkward (you need to implement counter of fd's with that
> > flag in the inode).
>
> We can still keep and inode flag as the internal implementation
> detail. As mentioned earlier the right flag to control behavior
> of a mapping is an mmap flag. And the initial naive implementation
> would simply mark the inode as sync once the first MAP_SYNC open happens
> on it. We could then move to more precise tracking if/when needed.

OK, makes sense and I like the MAP_SYNC proposal. I'll change it in my
implementation.

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-08-08 00:24:09

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Tue, Aug 1, 2017 at 4:26 AM, Jan Kara <[email protected]> wrote:
> On Tue 01-08-17 04:02:41, Christoph Hellwig wrote:
>> On Fri, Jul 28, 2017 at 11:38:21AM +0200, Jan Kara wrote:
>> > Well, you are right I can make the implementation work with struct file
>> > flag as well - let's call it O_DAXDSYNC. However there are filesystem
>> > operations where you may need to answer question: Is there any fd with
>> > O_DAXDSYNC open against this inode (for operations that change file offset
>> > -> block mapping)? And in that case inode flag is straightforward while
>> > file flag is a bit awkward (you need to implement counter of fd's with that
>> > flag in the inode).
>>
>> We can still keep and inode flag as the internal implementation
>> detail. As mentioned earlier the right flag to control behavior
>> of a mapping is an mmap flag. And the initial naive implementation
>> would simply mark the inode as sync once the first MAP_SYNC open happens
>> on it. We could then move to more precise tracking if/when needed.
>
> OK, makes sense and I like the MAP_SYNC proposal. I'll change it in my
> implementation.

Does sys_mmap() reject unknown flag values today? I'm either not
looking in the right place or it's missing and we'll need some
interface/mechanism to check if MAP_SYNC is honored.

2017-08-11 10:03:27

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Mon, Aug 07, 2017 at 05:24:08PM -0700, Dan Williams wrote:
> Does sys_mmap() reject unknown flag values today? I'm either not
> looking in the right place or it's missing and we'll need some
> interface/mechanism to check if MAP_SYNC is honored.

It doesn't seem to reject unknown flags.

The lack of flags checking strikes back once more, sighṫ.
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

2017-08-13 02:44:14

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Fri, Aug 11, 2017 at 3:03 AM, Christoph Hellwig <[email protected]> wrote:
> On Mon, Aug 07, 2017 at 05:24:08PM -0700, Dan Williams wrote:
>> Does sys_mmap() reject unknown flag values today? I'm either not
>> looking in the right place or it's missing and we'll need some
>> interface/mechanism to check if MAP_SYNC is honored.
>
> It doesn't seem to reject unknown flags.
>
> The lack of flags checking strikes back once more, sighṫ.

How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
should get -EINVAL, and on new kernels it means SYNC+SHARED.
_______________________________________________
Linux-nvdimm mailing list
[email protected]
https://lists.01.org/mailman/listinfo/linux-nvdimm

2017-08-13 09:25:56

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> should get -EINVAL, and on new kernels it means SYNC+SHARED.

Cute trick, but I'd hate to waster it just for our little flag.

How about:

#define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
#define MAP_SYNC 0x??? | __MAP_VALIDATE

so that we can reuse that trick for any new flag?

2017-08-13 17:08:44

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Sun, Aug 13, 2017 at 2:25 AM, Christoph Hellwig <[email protected]> wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
>
> Cute trick, but I'd hate to waster it just for our little flag.
>
> How about:
>
> #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC 0x??? | __MAP_VALIDATE
>
> so that we can reuse that trick for any new flag?

Yes, even better. I think that effectively kicks the need for mmap3()
down the road a bit.

2017-08-14 08:30:12

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Sun 13-08-17 02:25:56, Christoph Hellwig wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> > How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> > should get -EINVAL, and on new kernels it means SYNC+SHARED.
>
> Cute trick, but I'd hate to waster it just for our little flag.
>
> How about:
>
> #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC 0x??? | __MAP_VALIDATE
>
> so that we can reuse that trick for any new flag?

Heh, that's such an ugly hack that it is cute ;)... I'll use this
definition of MAP_SYNC in my patches and send new version of the series
(probably today in the afternoon if the testing goes fine).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-08-14 14:04:36

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults


Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
it in 2014. I'm so glad its time is finally do.

Thank you for working on this. Please CC me on future patches.
(note the new Netapp email)

On 13/08/17 12:25, Christoph Hellwig wrote:
> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
>
> Cute trick, but I'd hate to waster it just for our little flag.
>
> How about:
>
> #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
> #define MAP_SYNC 0x??? | __MAP_VALIDATE
>
> so that we can reuse that trick for any new flag?
>

YES! And please create a mask for all new flags and in validation
code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
way you actually preserve the old check that SHARED and PRIVATE
do not co exist.

Few Comments on this new MAP_ flag

0] The name at least needs to be MAP_MSYNC because only meta-data is
synced not the data pointed to. That is the responsibility of the app

1] This flag you have named MAP_SYNC but it is very much related to
dax and the ability for user-mode to "flush" the data pointed by this
now "synced" meta data.
For example in ext4, this flag set on an inode that is *not* IS_DAX
should fail the mmap. Because there is no point of synced meta if the
data is actually in page-cache and we know for sure it was not yet synced,
And there is no way for user-mode to directly "sync" the data as well.

2] The code should be constructed that the default check for the MAP_SYNC
should fail, and only Hopped in FSs are allowed.
(So not to modify all Implementations of file_operations->mmap() )

3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
(which is also an API that says "I know I need to cl_flush". See 1. )

4] Once we have this flag. And properly implemented at least in one FS
and optionally in /dev/pmemX we no longer have any justification for
/dev/daxX and it can die a slow and happy death.

Thanks, Cheers
Boaz

2017-08-14 16:03:59

by Dan Williams

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
>
> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> it in 2014. I'm so glad its time is finally do.
>
> Thank you for working on this. Please CC me on future patches.
> (note the new Netapp email)
>
> On 13/08/17 12:25, Christoph Hellwig wrote:
>> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
>>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
>>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
>>
>> Cute trick, but I'd hate to waster it just for our little flag.
>>
>> How about:
>>
>> #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
>> #define MAP_SYNC 0x??? | __MAP_VALIDATE
>>
>> so that we can reuse that trick for any new flag?
>>
>
> YES! And please create a mask for all new flags and in validation
> code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> way you actually preserve the old check that SHARED and PRIVATE
> do not co exist.
>
> Few Comments on this new MAP_ flag
>
> 0] The name at least needs to be MAP_MSYNC because only meta-data is
> synced not the data pointed to. That is the responsibility of the app
>
> 1] This flag you have named MAP_SYNC but it is very much related to
> dax and the ability for user-mode to "flush" the data pointed by this
> now "synced" meta data.
> For example in ext4, this flag set on an inode that is *not* IS_DAX
> should fail the mmap. Because there is no point of synced meta if the
> data is actually in page-cache and we know for sure it was not yet synced,
> And there is no way for user-mode to directly "sync" the data as well.
>
> 2] The code should be constructed that the default check for the MAP_SYNC
> should fail, and only Hopped in FSs are allowed.
> (So not to modify all Implementations of file_operations->mmap() )
>
> 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
> (which is also an API that says "I know I need to cl_flush". See 1. )
>
> 4] Once we have this flag. And properly implemented at least in one FS
> and optionally in /dev/pmemX we no longer have any justification for
> /dev/daxX and it can die a slow and happy death.

I'm all for replacing /dev/dax with filesystem equivalent
functionality, but I don't think MAP_SYNC gets us fully there. That's
what the MAP_DIRECT proposal [1] is meant to address.

[1]: https://lkml.org/lkml/2017/8/13/160

2017-08-15 09:06:02

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On 14/08/17 19:03, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
>>
>> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
>> it in 2014. I'm so glad its time is finally do.
>>

<>

>> 4] Once we have this flag. And properly implemented at least in one FS
>> and optionally in /dev/pmemX we no longer have any justification for
>> /dev/daxX and it can die a slow and happy death.
>
> I'm all for replacing /dev/dax with filesystem equivalent
> functionality, but I don't think MAP_SYNC gets us fully there. That's
> what the MAP_DIRECT proposal [1] is meant to address.
>

OK This is true. Could you please summarise for us the exact semantics of both
proposed flags?

That said I think the big difference is the movability of physical blocks
underneath the mmap mapping. Now for swap files that is a problem. because
of the deadlocks that can happen with memory needed if blocks start moving.
But for an application like nvml? why does it care. Why can't an nvml image file
not be cloned and COWed underneath the NVM application transparently.

Sorry for being slow but I don't see why you need MAP_DIRECT from user-mode
If you have MAP_SYNC. Please advise

(not that the immutable patchset is not a very needed fixing)

> [1]: https://lkml.org/lkml/2017/8/13/160
>

Thanks
Boaz

2017-08-15 09:44:40

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On 15/08/17 12:06, Boaz Harrosh wrote:
> On 14/08/17 19:03, Dan Williams wrote:
>> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
>>>
>>> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
>>> it in 2014. I'm so glad its time is finally do.
>>>
>
> <>
>
>>> 4] Once we have this flag. And properly implemented at least in one FS
>>> and optionally in /dev/pmemX we no longer have any justification for
>>> /dev/daxX and it can die a slow and happy death.
>>
>> I'm all for replacing /dev/dax with filesystem equivalent
>> functionality, but I don't think MAP_SYNC gets us fully there. That's
>> what the MAP_DIRECT proposal [1] is meant to address.
>>
>
> OK This is true. Could you please summarise for us the exact semantics of both
> proposed flags?
>
> That said I think the big difference is the movability of physical blocks
> underneath the mmap mapping. Now for swap files that is a problem. because
> of the deadlocks that can happen with memory needed if blocks start moving.
> But for an application like nvml? why does it care. Why can't an nvml image file
> not be cloned and COWed underneath the NVM application transparently.
>

OK Sorry didn't do my homework. Struck this out, it is all about RDMA and friends
from an "immutable" file.

I'll go dig into this now.

Thanks
Boaz

> Sorry for being slow but I don't see why you need MAP_DIRECT from user-mode
> If you have MAP_SYNC. Please advise
>
> (not that the immutable patchset is not a very needed fixing)
>
>> [1]: https://lkml.org/lkml/2017/8/13/160
>>
>
> Thanks
> Boaz
>

2017-08-17 16:16:32

by Jan Kara

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Mon 14-08-17 17:04:17, Boaz Harrosh wrote:
>
> Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> it in 2014. I'm so glad its time is finally do.
>
> Thank you for working on this. Please CC me on future patches.
> (note the new Netapp email)
>
> On 13/08/17 12:25, Christoph Hellwig wrote:
> > On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> >> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> >> should get -EINVAL, and on new kernels it means SYNC+SHARED.
> >
> > Cute trick, but I'd hate to waster it just for our little flag.
> >
> > How about:
> >
> > #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
> > #define MAP_SYNC 0x??? | __MAP_VALIDATE
> >
> > so that we can reuse that trick for any new flag?
> >
>
> YES! And please create a mask for all new flags and in validation
> code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> way you actually preserve the old check that SHARED and PRIVATE
> do not co exist.

For now I did just a crude hack. Dan is working on new mmap syscall which
checks flags which will be cleaner...

> Few Comments on this new MAP_ flag
>
> 0] The name at least needs to be MAP_MSYNC because only meta-data is
> synced not the data pointed to. That is the responsibility of the app

So we actually do normal fdatasync() call so we do flush data as well. This
way we don't have to be afraid of stale data exposure or other strange
effects. So I've kept the name to be MAP_SYNC.

> 1] This flag you have named MAP_SYNC but it is very much related to
> dax and the ability for user-mode to "flush" the data pointed by this
> now "synced" meta data.
> For example in ext4, this flag set on an inode that is *not* IS_DAX
> should fail the mmap. Because there is no point of synced meta if the
> data is actually in page-cache and we know for sure it was not yet synced,
> And there is no way for user-mode to directly "sync" the data as well.

Yes, done.

> 2] The code should be constructed that the default check for the MAP_SYNC
> should fail, and only Hopped in FSs are allowed.
> (So not to modify all Implementations of file_operations->mmap() )

Agreed but for now I've skipped this as I wait for new mmap syscall and
how Dan implements flag checking there.

> 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
> (which is also an API that says "I know I need to cl_flush". See 1. )

MAP_SYNC is rather more like: I can also use clflush instead of
fdatasync(2). And this is rather important as all legacy applications are
100% safe in the new scheme.

> 4] Once we have this flag. And properly implemented at least in one FS
> and optionally in /dev/pmemX we no longer have any justification for
> /dev/daxX and it can die a slow and happy death.

This will be more complex I guess - see MAP_DIRECT proposal...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-08-21 19:57:51

by Ross Zwisler

[permalink] [raw]
Subject: Re: [RFC PATCH 0/7] dax, ext4: Synchronous page faults

On Mon, Aug 14, 2017 at 09:03:59AM -0700, Dan Williams wrote:
> On Mon, Aug 14, 2017 at 7:04 AM, Boaz Harrosh <[email protected]> wrote:
> >
> > Thank you Jan, I'm patiently waiting for this MAP_SYNC flag since I asked for
> > it in 2014. I'm so glad its time is finally do.
> >
> > Thank you for working on this. Please CC me on future patches.
> > (note the new Netapp email)
> >
> > On 13/08/17 12:25, Christoph Hellwig wrote:
> >> On Sat, Aug 12, 2017 at 07:44:14PM -0700, Dan Williams wrote:
> >>> How about MAP_SYNC == (MAP_SHARED|MAP_PRIVATE)? On older kernels that
> >>> should get -EINVAL, and on new kernels it means SYNC+SHARED.
> >>
> >> Cute trick, but I'd hate to waster it just for our little flag.
> >>
> >> How about:
> >>
> >> #define __MAP_VALIDATE MAP_SHARED|MAP_PRIVATE
> >> #define MAP_SYNC 0x??? | __MAP_VALIDATE
> >>
> >> so that we can reuse that trick for any new flag?
> >>
> >
> > YES! And please create a mask for all new flags and in validation
> > code if ((m_flags & __MAP_VALIDATE) == __MAP_VALIDATE) then you
> > want that (m_flags & __MAP_NEWFLAGS) does not come empty, this
> > way you actually preserve the old check that SHARED and PRIVATE
> > do not co exist.
> >
> > Few Comments on this new MAP_ flag
> >
> > 0] The name at least needs to be MAP_MSYNC because only meta-data is
> > synced not the data pointed to. That is the responsibility of the app
> >
> > 1] This flag you have named MAP_SYNC but it is very much related to
> > dax and the ability for user-mode to "flush" the data pointed by this
> > now "synced" meta data.
> > For example in ext4, this flag set on an inode that is *not* IS_DAX
> > should fail the mmap. Because there is no point of synced meta if the
> > data is actually in page-cache and we know for sure it was not yet synced,
> > And there is no way for user-mode to directly "sync" the data as well.
> >
> > 2] The code should be constructed that the default check for the MAP_SYNC
> > should fail, and only Hopped in FSs are allowed.
> > (So not to modify all Implementations of file_operations->mmap() )
> >
> > 3] /dev/pmem could start serving DAX pages in mmap, if asked for MAP_MSYNC
> > (which is also an API that says "I know I need to cl_flush". See 1. )
> >
> > 4] Once we have this flag. And properly implemented at least in one FS
> > and optionally in /dev/pmemX we no longer have any justification for
> > /dev/daxX and it can die a slow and happy death.
>
> I'm all for replacing /dev/dax with filesystem equivalent
> functionality, but I don't think MAP_SYNC gets us fully there. That's
> what the MAP_DIRECT proposal [1] is meant to address.
>
> [1]: https://lkml.org/lkml/2017/8/13/160

I think we may also need to get DAX working on raw block devices again before
/dev/dax can go away. AFAIK some customers still prefer to use DAX on a
per-device basis and not have a filesystem.