2017-10-24 15:25:30

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/17 v5] dax, ext4, xfs: Synchronous page faults

Hello,

here is the fifth version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.

We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.

How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.

I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):

File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1

File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2

File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1

File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178

File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1

File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23

So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.

Anyway, here are the patches and since Ross already posted his patches to test
the functionality, I think we are ready to get this merged. I've talked with
Dan and he said he could take the patches through his tree, I'd just like to
get a final ack from Christoph on the patch modifying mmap(2). Comments are
welcome.

Changes since v4:
* fixed couple of minor things in the manpage
* make legacy mmap flags always supported, remove them from mask declared
to be supported by ext4 and xfs

Changes since v3:
* updated some changelogs
* folded fs support for VM_SYNC flag into patches implementing the
functionality
* removed ->mmap_validate, use ->mmap_supported_flags instead
* added some Reviewed-by tags
* added manpage patch

Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4

Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers

Honza


2017-10-24 15:24:01

by Jan Kara

[permalink] [raw]
Subject: [PATCH 04/17] dax: Factor out getting of pfn out of iomap

Factor out code to get pfn out of iomap that is shared between PTE and
PMD fault path.

Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 83 +++++++++++++++++++++++++++++++++-------------------------------
1 file changed, 43 insertions(+), 40 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 0bc42ac294ca..116eef8d6c69 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -825,30 +825,53 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
}

-static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
- loff_t pos, void *entry)
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+ pfn_t *pfnp)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
- struct vm_area_struct *vma = vmf->vma;
- struct address_space *mapping = vma->vm_file->f_mapping;
- unsigned long vaddr = vmf->address;
- void *ret, *kaddr;
pgoff_t pgoff;
+ void *kaddr;
int id, rc;
- pfn_t pfn;
+ long length;

- rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
+ rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
if (rc)
return rc;
-
id = dax_read_lock();
- rc = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(PAGE_SIZE),
- &kaddr, &pfn);
- if (rc < 0) {
- dax_read_unlock(id);
- return rc;
+ length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+ &kaddr, pfnp);
+ if (length < 0) {
+ rc = length;
+ goto out;
}
+ rc = -EINVAL;
+ if (PFN_PHYS(length) < size)
+ goto out;
+ if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
+ goto out;
+ /* For larger pages we need devmap */
+ if (length > 1 && !pfn_t_devmap(*pfnp))
+ goto out;
+ rc = 0;
+out:
dax_read_unlock(id);
+ return rc;
+}
+
+static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
+ loff_t pos, void *entry)
+{
+ const sector_t sector = dax_iomap_sector(iomap, pos);
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ unsigned long vaddr = vmf->address;
+ void *ret;
+ int rc;
+ pfn_t pfn;
+
+ rc = dax_iomap_pfn(iomap, pos, PAGE_SIZE, &pfn);
+ if (rc < 0)
+ return rc;

ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
if (IS_ERR(ret))
@@ -1223,46 +1246,26 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
const sector_t sector = dax_iomap_sector(iomap, pos);
- struct dax_device *dax_dev = iomap->dax_dev;
- struct block_device *bdev = iomap->bdev;
struct inode *inode = mapping->host;
- const size_t size = PMD_SIZE;
- void *ret = NULL, *kaddr;
- long length = 0;
- pgoff_t pgoff;
+ void *ret = NULL;
pfn_t pfn = {};
- int id;
+ int rc;

- if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
+ rc = dax_iomap_pfn(iomap, pos, PMD_SIZE, &pfn);
+ if (rc < 0)
goto fallback;

- id = dax_read_lock();
- length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
- if (length < 0)
- goto unlock_fallback;
- length = PFN_PHYS(length);
-
- if (length < size)
- goto unlock_fallback;
- if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
- goto unlock_fallback;
- if (!pfn_t_devmap(pfn))
- goto unlock_fallback;
- dax_read_unlock(id);
-
ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
RADIX_DAX_PMD);
if (IS_ERR(ret))
goto fallback;

- trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret);
+ trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, ret);
return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
pfn, vmf->flags & FAULT_FLAG_WRITE);

-unlock_fallback:
- dax_read_unlock(id);
fallback:
- trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
+ trace_dax_pmd_insert_mapping_fallback(inode, vmf, PMD_SIZE, pfn, ret);
return VM_FAULT_FALLBACK;
}

--
2.12.3

2017-10-24 15:25:32

by Jan Kara

[permalink] [raw]
Subject: [PATCH 11/17] dax: Allow tuning whether dax_insert_mapping_entry() dirties entry

Currently we dirty radix tree entry whenever dax_insert_mapping_entry()
gets called for a write fault. With synchronous page faults we would
like to insert clean radix tree entry and dirty it only once we call
fdatasync() and update page tables to save some unnecessary cache
flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that.

Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5ddf15161390..efc210ff6665 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -526,13 +526,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
static void *dax_insert_mapping_entry(struct address_space *mapping,
struct vm_fault *vmf,
void *entry, sector_t sector,
- unsigned long flags)
+ unsigned long flags, bool dirty)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
void *new_entry;
pgoff_t index = vmf->pgoff;

- if (vmf->flags & FAULT_FLAG_WRITE)
+ if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);

if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_ZERO_PAGE)) {
@@ -569,7 +569,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
entry = new_entry;
}

- if (vmf->flags & FAULT_FLAG_WRITE)
+ if (dirty)
radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);

spin_unlock_irq(&mapping->tree_lock);
@@ -881,7 +881,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
}

entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_ZERO_PAGE);
+ RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(entry2)) {
ret = VM_FAULT_SIGBUS;
goto out;
@@ -1182,7 +1182,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,

entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- 0);
+ 0, write);
if (IS_ERR(entry)) {
error = PTR_ERR(entry);
goto error_finish_iomap;
@@ -1258,7 +1258,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
goto fallback;

ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
+ RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(ret))
goto fallback;

@@ -1379,7 +1379,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,

entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- RADIX_DAX_PMD);
+ RADIX_DAX_PMD, write);
if (IS_ERR(entry))
goto finish_iomap;

--
2.12.3

2017-10-24 15:24:11

by Jan Kara

[permalink] [raw]
Subject: [PATCH 14/17] dax: Implement dax_finish_sync_fault()

Implement a function that filesystems can call to finish handling of
synchronous page faults. It takes care of syncing appropriare file range
and insertion of page table entry.

Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 83 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 2 ++
include/trace/events/fs_dax.h | 2 ++
3 files changed, 87 insertions(+)

diff --git a/fs/dax.c b/fs/dax.c
index bb9ff907738c..78233c716757 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1492,3 +1492,86 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
}
}
EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+/**
+ * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
+ * @vmf: The description of the fault
+ * @pe_size: Size of entry to be inserted
+ * @pfn: PFN to insert
+ *
+ * This function inserts writeable PTE or PMD entry into page tables for mmaped
+ * DAX file. It takes care of marking corresponding radix tree entry as dirty
+ * as well.
+ */
+static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
+ enum page_entry_size pe_size,
+ pfn_t pfn)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ void *entry, **slot;
+ pgoff_t index = vmf->pgoff;
+ int vmf_ret, error;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, &slot);
+ /* Did we race with someone splitting entry or so? */
+ if (!entry ||
+ (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
+ (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
+ put_unlocked_mapping_entry(mapping, index, entry);
+ spin_unlock_irq(&mapping->tree_lock);
+ trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
+ VM_FAULT_NOPAGE);
+ return VM_FAULT_NOPAGE;
+ }
+ radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+ entry = lock_slot(mapping, slot);
+ spin_unlock_irq(&mapping->tree_lock);
+ switch (pe_size) {
+ case PE_SIZE_PTE:
+ error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+ vmf_ret = dax_fault_return(error);
+ break;
+#ifdef CONFIG_FS_DAX_PMD
+ case PE_SIZE_PMD:
+ vmf_ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
+ pfn, true);
+ break;
+#endif
+ default:
+ vmf_ret = VM_FAULT_FALLBACK;
+ }
+ put_locked_mapping_entry(mapping, index);
+ trace_dax_insert_pfn_mkwrite(mapping->host, vmf, vmf_ret);
+ return vmf_ret;
+}
+
+/**
+ * dax_finish_sync_fault - finish synchronous page fault
+ * @vmf: The description of the fault
+ * @pe_size: Size of entry to be inserted
+ * @pfn: PFN to insert
+ *
+ * This function ensures that the file range touched by the page fault is
+ * stored persistently on the media and handles inserting of appropriate page
+ * table entry.
+ */
+int dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ pfn_t pfn)
+{
+ int err;
+ loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+ size_t len = 0;
+
+ if (pe_size == PE_SIZE_PTE)
+ len = PAGE_SIZE;
+ else if (pe_size == PE_SIZE_PMD)
+ len = PMD_SIZE;
+ else
+ WARN_ON_ONCE(1);
+ err = vfs_fsync_range(vmf->vma->vm_file, start, start + len - 1, 1);
+ if (err)
+ return VM_FAULT_SIGBUS;
+ return dax_insert_pfn_mkwrite(vmf, pe_size, pfn);
+}
+EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e7fa4b8f45bc..d403f78b706c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -96,6 +96,8 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops);
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
pfn_t *pfnp, const struct iomap_ops *ops);
+int dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ pfn_t pfn);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index 88a9d19b8ff8..7725459fafef 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -190,6 +190,8 @@ DEFINE_EVENT(dax_pte_fault_class, name, \
DEFINE_PTE_FAULT_EVENT(dax_pte_fault);
DEFINE_PTE_FAULT_EVENT(dax_pte_fault_done);
DEFINE_PTE_FAULT_EVENT(dax_load_hole);
+DEFINE_PTE_FAULT_EVENT(dax_insert_pfn_mkwrite_no_entry);
+DEFINE_PTE_FAULT_EVENT(dax_insert_pfn_mkwrite);

TRACE_EVENT(dax_insert_mapping,
TP_PROTO(struct inode *inode, struct vm_fault *vmf, void *radix_entry),
--
2.12.3

2017-10-24 15:24:14

by Jan Kara

[permalink] [raw]
Subject: [PATCH 17/17] xfs: support for synchronous DAX faults

From: Christoph Hellwig <[email protected]>

Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
blocks for writing and the inode is pinned, and has dirty fields other
than the timestamps. In __xfs_filemap_fault() we then detect this case
and call dax_finish_sync_fault() to make sure all metadata is committed,
and to insert the page table entry.

Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.

[JK: Added VM_SYNC flag handling]

Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/xfs/xfs_file.c | 15 ++++++++++++++-
fs/xfs/xfs_iomap.c | 5 +++++
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7c6b8def6eed..02093df4b314 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,6 +44,7 @@
#include <linux/falloc.h>
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
+#include <linux/mman.h>

static const struct vm_operations_struct xfs_file_vm_ops;

@@ -1040,7 +1041,11 @@ __xfs_filemap_fault(

xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
- ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops);
+ pfn_t pfn;
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, &xfs_iomap_ops);
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
} else {
if (write_fault)
ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
@@ -1131,6 +1136,13 @@ xfs_file_mmap(
struct file *filp,
struct vm_area_struct *vma)
{
+ /*
+ * We don't support synchronous mappings for non-DAX files. At least
+ * until someone comes with a sensible use case.
+ */
+ if (!IS_DAX(file_inode(filp)) && (vma->vm_flags & VM_SYNC))
+ return -EOPNOTSUPP;
+
file_accessed(filp);
vma->vm_ops = &xfs_file_vm_ops;
if (IS_DAX(file_inode(filp)))
@@ -1149,6 +1161,7 @@ const struct file_operations xfs_file_operations = {
.compat_ioctl = xfs_file_compat_ioctl,
#endif
.mmap = xfs_file_mmap,
+ .mmap_supported_flags = MAP_SYNC,
.open = xfs_file_open,
.release = xfs_file_release,
.fsync = xfs_file_fsync,
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..b43be199fbdf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -33,6 +33,7 @@
#include "xfs_error.h"
#include "xfs_trans.h"
#include "xfs_trans_space.h"
+#include "xfs_inode_item.h"
#include "xfs_iomap.h"
#include "xfs_trace.h"
#include "xfs_icache.h"
@@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
trace_xfs_iomap_found(ip, offset, length, 0, &imap);
}

+ if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
+ (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+ iomap->flags |= IOMAP_F_DIRTY;
+
xfs_bmbt_to_iomap(ip, iomap, &imap);

if (shared)
--
2.12.3

2017-10-24 21:29:08

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig <[email protected]>
>
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps. In __xfs_filemap_fault() we then detect this case
> and call dax_finish_sync_fault() to make sure all metadata is committed,
> and to insert the page table entry.
>
> Note that this will also dirty corresponding radix tree entry which is
> what we want - fsync(2) will still provide data integrity guarantees for
> applications not using userspace flushing. And applications using
> userspace flushing can avoid calling fsync(2) and thus avoid the
> performance overhead.
>
> [JK: Added VM_SYNC flag handling]
>
> Signed-off-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>

I don't know enough about XFS dirty inode handling to be able to comment on
the changes in xfs_file_iomap_begin(), but the rest looks good.

Reviewed-by: Ross Zwisler <[email protected]>

2017-10-24 22:23:22

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> From: Christoph Hellwig <[email protected]>
>
> Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> blocks for writing and the inode is pinned, and has dirty fields other
> than the timestamps.

That's "fdatasync dirty", not "fsync dirty".

IOMAP_F_DIRTY needs a far better description of it's semantics than
"/* block mapping is not yet on persistent storage */" so we know
exactly what filesystems are supposed to be implementing here. I
suspect that what it really is meant to say is:

/*
* IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
* written data and requires fdatasync to commit to persistent storage.
*/

[....]

> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index f179bdf1644d..b43be199fbdf 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -33,6 +33,7 @@
> #include "xfs_error.h"
> #include "xfs_trans.h"
> #include "xfs_trans_space.h"
> +#include "xfs_inode_item.h"
> #include "xfs_iomap.h"
> #include "xfs_trace.h"
> #include "xfs_icache.h"
> @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> }
>
> + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> + iomap->flags |= IOMAP_F_DIRTY;

This is the very definition of an inode that is "fdatasync dirty".

Hmmmm, shouldn't this also be set for read faults, too?

Cheers,

Dave.
--
Dave Chinner
[email protected]

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-10-26 15:48:04

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > From: Christoph Hellwig <[email protected]>
> >
> > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > blocks for writing and the inode is pinned, and has dirty fields other
> > than the timestamps.
>
> That's "fdatasync dirty", not "fsync dirty".

Correct.

> IOMAP_F_DIRTY needs a far better description of it's semantics than
> "/* block mapping is not yet on persistent storage */" so we know
> exactly what filesystems are supposed to be implementing here. I
> suspect that what it really is meant to say is:
>
> /*
> * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
> * written data and requires fdatasync to commit to persistent storage.
> */

I'll update the comment. Thanks!

> [....]
>
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index f179bdf1644d..b43be199fbdf 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -33,6 +33,7 @@
> > #include "xfs_error.h"
> > #include "xfs_trans.h"
> > #include "xfs_trans_space.h"
> > +#include "xfs_inode_item.h"
> > #include "xfs_iomap.h"
> > #include "xfs_trace.h"
> > #include "xfs_icache.h"
> > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > }
> >
> > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > + iomap->flags |= IOMAP_F_DIRTY;
>
> This is the very definition of an inode that is "fdatasync dirty".
>
> Hmmmm, shouldn't this also be set for read faults, too?

No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
data to the page which he'd then like to be persistent. The only reason why
I thought it could be useful for a while was that it would be nice to make
MAP_SYNC mapping provide the guarantee that data you see now is the data
you'll see after a crash but we cannot provide that guarantee for RO
mapping anyway if someone else has the page mapped as well. So I just
decided not to return IOMAP_F_DIRTY for read faults.

But now that I look at XFS implementation again, it misses handling
of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
as well since it mostly duplicates it anyway... Thanks for inquiring!

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-10-26 21:16:11

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> On Wed 25-10-17 09:23:22, Dave Chinner wrote:
> > On Tue, Oct 24, 2017 at 05:24:14PM +0200, Jan Kara wrote:
> > > From: Christoph Hellwig <[email protected]>
> > >
> > > Return IOMAP_F_DIRTY from xfs_file_iomap_begin() when asked to prepare
> > > blocks for writing and the inode is pinned, and has dirty fields other
> > > than the timestamps.
> >
> > That's "fdatasync dirty", not "fsync dirty".
>
> Correct.
>
> > IOMAP_F_DIRTY needs a far better description of it's semantics than
> > "/* block mapping is not yet on persistent storage */" so we know
> > exactly what filesystems are supposed to be implementing here. I
> > suspect that what it really is meant to say is:
> >
> > /*
> > * IOMAP_F_DIRTY indicates the inode has uncommitted metadata to
> > * written data and requires fdatasync to commit to persistent storage.
> > */
>
> I'll update the comment. Thanks!
>
> > [....]
> >
> > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > index f179bdf1644d..b43be199fbdf 100644
> > > --- a/fs/xfs/xfs_iomap.c
> > > +++ b/fs/xfs/xfs_iomap.c
> > > @@ -33,6 +33,7 @@
> > > #include "xfs_error.h"
> > > #include "xfs_trans.h"
> > > #include "xfs_trans_space.h"
> > > +#include "xfs_inode_item.h"
> > > #include "xfs_iomap.h"
> > > #include "xfs_trace.h"
> > > #include "xfs_icache.h"
> > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > > }
> > >
> > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > + iomap->flags |= IOMAP_F_DIRTY;
> >
> > This is the very definition of an inode that is "fdatasync dirty".
> >
> > Hmmmm, shouldn't this also be set for read faults, too?
>
> No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> data to the page which he'd then like to be persistent. The only reason why
> I thought it could be useful for a while was that it would be nice to make
> MAP_SYNC mapping provide the guarantee that data you see now is the data
> you'll see after a crash

Isn't that the entire point of MAP_SYNC? i.e. That when we return
from a page fault, the app knows that the data and it's underlying
extent is on persistent storage?

> but we cannot provide that guarantee for RO
> mapping anyway if someone else has the page mapped as well. So I just
> decided not to return IOMAP_F_DIRTY for read faults.

If there are multiple MAP_SYNC mappings to the inode, I would have
expected that they all sync all of the data/metadata on every page
fault, regardless of who dirtied the inode. An RO mapping doesn't
mean the data/metadata on the inode can't change, it just means it
can't change through that mapping. Running fsync() to guarantee the
persistence of that data/metadata doesn't actually changing any
data....

IOWs, if read faults don't guarantee the mapped range has stable
extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
because it's not giving consistent guarantees to userspace. Yes, it
works fine when only one MAP_SYNC mapping is modifying the inode,
but the moment we have concurrent operations on the inode that
aren't MAP_SYNC or O_SYNC this goes out the window....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2017-10-27 06:43:01

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> But now that I look at XFS implementation again, it misses handling
> of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
> I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
> as well since it mostly duplicates it anyway... Thanks for inquiring!

My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault,
but that didn't work. Wish I'd remember why, though.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-10-27 09:13:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Fri 27-10-17 08:43:01, Christoph Hellwig wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > But now that I look at XFS implementation again, it misses handling
> > of VM_FAULT_NEEDSYNC in xfs_filemap_pfn_mkwrite() (ext4 gets this right).
> > I'll fix this by using __xfs_filemap_fault() for xfs_filemap_pfn_mkwrite()
> > as well since it mostly duplicates it anyway... Thanks for inquiring!
>
> My first patches move xfs_filemap_pfn_mkwrite to use __xfs_filemap_fault,
> but that didn't work. Wish I'd remember why, though.

Maybe due to the additional check on IS_DAX(inode) in __xfs_filemap_fault()
which could do the wrong thing if per-inode DAX flag is switched? Because
otherwise __xfs_filemap_fault(vmf, PE_SIZE_PTE, true) does exactly the same
thing as xfs_filemap_pfn_mkwrite() did.

If we do care about per-inode DAX flag switching, I can just fixup
xfs_filemap_pfn_mkwrite() but my understanding was that we ditched the idea
at least until someone comes up with a reliable way to implement that...

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-10-27 10:08:37

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > index f179bdf1644d..b43be199fbdf 100644
> > > > --- a/fs/xfs/xfs_iomap.c
> > > > +++ b/fs/xfs/xfs_iomap.c
> > > > @@ -33,6 +33,7 @@
> > > > #include "xfs_error.h"
> > > > #include "xfs_trans.h"
> > > > #include "xfs_trans_space.h"
> > > > +#include "xfs_inode_item.h"
> > > > #include "xfs_iomap.h"
> > > > #include "xfs_trace.h"
> > > > #include "xfs_icache.h"
> > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > > > }
> > > >
> > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > + iomap->flags |= IOMAP_F_DIRTY;
> > >
> > > This is the very definition of an inode that is "fdatasync dirty".
> > >
> > > Hmmmm, shouldn't this also be set for read faults, too?
> >
> > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> > data to the page which he'd then like to be persistent. The only reason why
> > I thought it could be useful for a while was that it would be nice to make
> > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > you'll see after a crash
>
> Isn't that the entire point of MAP_SYNC? i.e. That when we return
> from a page fault, the app knows that the data and it's underlying
> extent is on persistent storage?
>
> > but we cannot provide that guarantee for RO
> > mapping anyway if someone else has the page mapped as well. So I just
> > decided not to return IOMAP_F_DIRTY for read faults.
>
> If there are multiple MAP_SYNC mappings to the inode, I would have
> expected that they all sync all of the data/metadata on every page
> fault, regardless of who dirtied the inode. An RO mapping doesn't

Well, they all do sync regardless of who dirtied the inode on every *write*
fault.

> mean the data/metadata on the inode can't change, it just means it
> can't change through that mapping. Running fsync() to guarantee the
> persistence of that data/metadata doesn't actually changing any
> data....
>
> IOWs, if read faults don't guarantee the mapped range has stable
> extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> because it's not giving consistent guarantees to userspace. Yes, it
> works fine when only one MAP_SYNC mapping is modifying the inode,
> but the moment we have concurrent operations on the inode that
> aren't MAP_SYNC or O_SYNC this goes out the window....

MAP_SYNC as I've implemented it provides guarantees only for data the
process has actually written. I agree with that and it was a conscious
decision. In my opinion that covers most usecases, provides reasonably
simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
persist it just using CPU instructions), and reasonable performance.

Now you seem to suggest the semantics should be: "Data you have read from or
written to a MAP_SYNC mapping can be persisted using CPU instructions." And
from implementation POV we can do that rather easily (just rip out the
IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
be useful enough to justify the slowdown of read faults? I was not able to
come up with a good usecase and so I've decided for current semantics. What
do other people think?

And now that I've spelled out exact semantics I don't think your comparison
that you can fsync() data you didn't write quite matches - with MAP_SYNC
you will have to at least read the data to be able to persist it and you
don't have that requirement for fsync() either...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2017-10-31 15:19:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Fri 27-10-17 12:08:34, Jan Kara wrote:
> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > > > > index f179bdf1644d..b43be199fbdf 100644
> > > > > --- a/fs/xfs/xfs_iomap.c
> > > > > +++ b/fs/xfs/xfs_iomap.c
> > > > > @@ -33,6 +33,7 @@
> > > > > #include "xfs_error.h"
> > > > > #include "xfs_trans.h"
> > > > > #include "xfs_trans_space.h"
> > > > > +#include "xfs_inode_item.h"
> > > > > #include "xfs_iomap.h"
> > > > > #include "xfs_trace.h"
> > > > > #include "xfs_icache.h"
> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> > > > > }
> > > > >
> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> > > > > + iomap->flags |= IOMAP_F_DIRTY;
> > > >
> > > > This is the very definition of an inode that is "fdatasync dirty".
> > > >
> > > > Hmmmm, shouldn't this also be set for read faults, too?
> > >
> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> > > data to the page which he'd then like to be persistent. The only reason why
> > > I thought it could be useful for a while was that it would be nice to make
> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
> > > you'll see after a crash
> >
> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> > from a page fault, the app knows that the data and it's underlying
> > extent is on persistent storage?
> >
> > > but we cannot provide that guarantee for RO
> > > mapping anyway if someone else has the page mapped as well. So I just
> > > decided not to return IOMAP_F_DIRTY for read faults.
> >
> > If there are multiple MAP_SYNC mappings to the inode, I would have
> > expected that they all sync all of the data/metadata on every page
> > fault, regardless of who dirtied the inode. An RO mapping doesn't
>
> Well, they all do sync regardless of who dirtied the inode on every *write*
> fault.
>
> > mean the data/metadata on the inode can't change, it just means it
> > can't change through that mapping. Running fsync() to guarantee the
> > persistence of that data/metadata doesn't actually changing any
> > data....
> >
> > IOWs, if read faults don't guarantee the mapped range has stable
> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> > because it's not giving consistent guarantees to userspace. Yes, it
> > works fine when only one MAP_SYNC mapping is modifying the inode,
> > but the moment we have concurrent operations on the inode that
> > aren't MAP_SYNC or O_SYNC this goes out the window....
>
> MAP_SYNC as I've implemented it provides guarantees only for data the
> process has actually written. I agree with that and it was a conscious
> decision. In my opinion that covers most usecases, provides reasonably
> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> persist it just using CPU instructions), and reasonable performance.
>
> Now you seem to suggest the semantics should be: "Data you have read from or
> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> from implementation POV we can do that rather easily (just rip out the
> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> be useful enough to justify the slowdown of read faults? I was not able to
> come up with a good usecase and so I've decided for current semantics. What
> do other people think?

Nobody commented on this for couple of days so how do we proceed? I would
prefer to go just with a guarantee for data written and we can always make
the guarantee stronger (i.e. apply it also for read data) when some user
comes with a good usecase?

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2017-10-31 21:50:01

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <[email protected]> wrote:
> On Fri 27-10-17 12:08:34, Jan Kara wrote:
>> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
>> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
>> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> > > > > index f179bdf1644d..b43be199fbdf 100644
>> > > > > --- a/fs/xfs/xfs_iomap.c
>> > > > > +++ b/fs/xfs/xfs_iomap.c
>> > > > > @@ -33,6 +33,7 @@
>> > > > > #include "xfs_error.h"
>> > > > > #include "xfs_trans.h"
>> > > > > #include "xfs_trans_space.h"
>> > > > > +#include "xfs_inode_item.h"
>> > > > > #include "xfs_iomap.h"
>> > > > > #include "xfs_trace.h"
>> > > > > #include "xfs_icache.h"
>> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
>> > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
>> > > > > }
>> > > > >
>> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
>> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
>> > > > > + iomap->flags |= IOMAP_F_DIRTY;
>> > > >
>> > > > This is the very definition of an inode that is "fdatasync dirty".
>> > > >
>> > > > Hmmmm, shouldn't this also be set for read faults, too?
>> > >
>> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
>> > > data to the page which he'd then like to be persistent. The only reason why
>> > > I thought it could be useful for a while was that it would be nice to make
>> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
>> > > you'll see after a crash
>> >
>> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
>> > from a page fault, the app knows that the data and it's underlying
>> > extent is on persistent storage?
>> >
>> > > but we cannot provide that guarantee for RO
>> > > mapping anyway if someone else has the page mapped as well. So I just
>> > > decided not to return IOMAP_F_DIRTY for read faults.
>> >
>> > If there are multiple MAP_SYNC mappings to the inode, I would have
>> > expected that they all sync all of the data/metadata on every page
>> > fault, regardless of who dirtied the inode. An RO mapping doesn't
>>
>> Well, they all do sync regardless of who dirtied the inode on every *write*
>> fault.
>>
>> > mean the data/metadata on the inode can't change, it just means it
>> > can't change through that mapping. Running fsync() to guarantee the
>> > persistence of that data/metadata doesn't actually changing any
>> > data....
>> >
>> > IOWs, if read faults don't guarantee the mapped range has stable
>> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
>> > because it's not giving consistent guarantees to userspace. Yes, it
>> > works fine when only one MAP_SYNC mapping is modifying the inode,
>> > but the moment we have concurrent operations on the inode that
>> > aren't MAP_SYNC or O_SYNC this goes out the window....
>>
>> MAP_SYNC as I've implemented it provides guarantees only for data the
>> process has actually written. I agree with that and it was a conscious
>> decision. In my opinion that covers most usecases, provides reasonably
>> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
>> persist it just using CPU instructions), and reasonable performance.
>>
>> Now you seem to suggest the semantics should be: "Data you have read from or
>> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
>> from implementation POV we can do that rather easily (just rip out the
>> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
>> be useful enough to justify the slowdown of read faults? I was not able to
>> come up with a good usecase and so I've decided for current semantics. What
>> do other people think?
>
> Nobody commented on this for couple of days so how do we proceed? I would
> prefer to go just with a guarantee for data written and we can always make
> the guarantee stronger (i.e. apply it also for read data) when some user
> comes with a good usecase?

I think it is easier to strengthen the guarantee than loosen it later
especially since it is not yet clear that we have a use case for the
stronger semantic. At least the initial motivation for MAP_SYNC was
for writers.

2017-11-01 03:47:07

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 17/17] xfs: support for synchronous DAX faults

On Tue, Oct 31, 2017 at 02:50:01PM -0700, Dan Williams wrote:
> On Tue, Oct 31, 2017 at 8:19 AM, Jan Kara <[email protected]> wrote:
> > On Fri 27-10-17 12:08:34, Jan Kara wrote:
> >> On Fri 27-10-17 08:16:11, Dave Chinner wrote:
> >> > On Thu, Oct 26, 2017 at 05:48:04PM +0200, Jan Kara wrote:
> >> > > > > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> > > > > index f179bdf1644d..b43be199fbdf 100644
> >> > > > > --- a/fs/xfs/xfs_iomap.c
> >> > > > > +++ b/fs/xfs/xfs_iomap.c
> >> > > > > @@ -33,6 +33,7 @@
> >> > > > > #include "xfs_error.h"
> >> > > > > #include "xfs_trans.h"
> >> > > > > #include "xfs_trans_space.h"
> >> > > > > +#include "xfs_inode_item.h"
> >> > > > > #include "xfs_iomap.h"
> >> > > > > #include "xfs_trace.h"
> >> > > > > #include "xfs_icache.h"
> >> > > > > @@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
> >> > > > > trace_xfs_iomap_found(ip, offset, length, 0, &imap);
> >> > > > > }
> >> > > > >
> >> > > > > + if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
> >> > > > > + (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
> >> > > > > + iomap->flags |= IOMAP_F_DIRTY;
> >> > > >
> >> > > > This is the very definition of an inode that is "fdatasync dirty".
> >> > > >
> >> > > > Hmmmm, shouldn't this also be set for read faults, too?
> >> > >
> >> > > No, read faults don't need to set IOMAP_F_DIRTY since user cannot write any
> >> > > data to the page which he'd then like to be persistent. The only reason why
> >> > > I thought it could be useful for a while was that it would be nice to make
> >> > > MAP_SYNC mapping provide the guarantee that data you see now is the data
> >> > > you'll see after a crash
> >> >
> >> > Isn't that the entire point of MAP_SYNC? i.e. That when we return
> >> > from a page fault, the app knows that the data and it's underlying
> >> > extent is on persistent storage?
> >> >
> >> > > but we cannot provide that guarantee for RO
> >> > > mapping anyway if someone else has the page mapped as well. So I just
> >> > > decided not to return IOMAP_F_DIRTY for read faults.
> >> >
> >> > If there are multiple MAP_SYNC mappings to the inode, I would have
> >> > expected that they all sync all of the data/metadata on every page
> >> > fault, regardless of who dirtied the inode. An RO mapping doesn't
> >>
> >> Well, they all do sync regardless of who dirtied the inode on every *write*
> >> fault.
> >>
> >> > mean the data/metadata on the inode can't change, it just means it
> >> > can't change through that mapping. Running fsync() to guarantee the
> >> > persistence of that data/metadata doesn't actually changing any
> >> > data....
> >> >
> >> > IOWs, if read faults don't guarantee the mapped range has stable
> >> > extents on a MAP_SYNC mapping, then I think MAP_SYNC is broken
> >> > because it's not giving consistent guarantees to userspace. Yes, it
> >> > works fine when only one MAP_SYNC mapping is modifying the inode,
> >> > but the moment we have concurrent operations on the inode that
> >> > aren't MAP_SYNC or O_SYNC this goes out the window....
> >>
> >> MAP_SYNC as I've implemented it provides guarantees only for data the
> >> process has actually written. I agree with that and it was a conscious
> >> decision. In my opinion that covers most usecases, provides reasonably
> >> simple semantics (i.e., if you write data through MAP_SYNC mapping, you can
> >> persist it just using CPU instructions), and reasonable performance.
> >>
> >> Now you seem to suggest the semantics should be: "Data you have read from or
> >> written to a MAP_SYNC mapping can be persisted using CPU instructions." And
> >> from implementation POV we can do that rather easily (just rip out the
> >> IOMAP_WRITE checks). But I'm unsure whether this additional guarantee would
> >> be useful enough to justify the slowdown of read faults? I was not able to
> >> come up with a good usecase and so I've decided for current semantics. What
> >> do other people think?
> >
> > Nobody commented on this for couple of days so how do we proceed? I would
> > prefer to go just with a guarantee for data written and we can always make
> > the guarantee stronger (i.e. apply it also for read data) when some user
> > comes with a good usecase?
>
> I think it is easier to strengthen the guarantee than loosen it later
> especially since it is not yet clear that we have a use case for the
> stronger semantic. At least the initial motivation for MAP_SYNC was
> for writers.

I agree. It seems like all threads/processes in a given application need to
use MAP_SYNC consistently so they can be sure that data that is written (and
then possibly read) will be durable on media. I think what you have is a good
starting point, and we can adjust later if necessary.