Hello,
here is the third version of my patches to implement synchronous page faults
for DAX mappings to make flushing of DAX mappings possible from userspace so
that they can be flushed on finer than page granularity and also avoid the
overhead of a syscall.
We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
should be synchronous. The guarantee provided by this flag is: While a block
is writeably mapped into page tables of this mapping, it is guaranteed to be
visible in the file at that offset also after a crash.
How I implement this is that ->iomap_begin() indicates by a flag that inode
block mapping metadata is unstable and may need flushing (use the same test as
whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
from inserting / write-enabling the page table entry and returns special flag
VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
and after that calls DAX code to update the page table entry appropriately.
The first patch in this series is taken from Dan Williams' series for
MAP_DIRECT so that we get a reliable way of detecting whether MAP_SYNC is
supported or not.
I did some basic performance testing on the patches over ramdisk - timed
latency of page faults when faulting 512 pages. I did several tests: with file
preallocated / with file empty, with background file copying going on / without
it, with / without MAP_SYNC (so that we get comparison). The results are
(numbers are in microseconds):
File preallocated, no background load no MAP_SYNC:
min=9 avg=10 max=46
8 - 15 us: 508
16 - 31 us: 3
32 - 63 us: 1
File preallocated, no background load, MAP_SYNC:
min=9 avg=10 max=47
8 - 15 us: 508
16 - 31 us: 2
32 - 63 us: 2
File empty, no background load, no MAP_SYNC:
min=21 avg=22 max=70
16 - 31 us: 506
32 - 63 us: 5
64 - 127 us: 1
File empty, no background load, MAP_SYNC:
min=40 avg=124 max=242
32 - 63 us: 1
64 - 127 us: 333
128 - 255 us: 178
File empty, background load, no MAP_SYNC:
min=21 avg=23 max=67
16 - 31 us: 507
32 - 63 us: 4
64 - 127 us: 1
File empty, background load, MAP_SYNC:
min=94 avg=112 max=181
64 - 127 us: 489
128 - 255 us: 23
So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
100-200 us when we need to wait for transaction commit in this setup.
Anyway, here are the patches and AFAICT the series is pretty much complete
so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
minimal so either tree is fine I guess. Comments are welcome.
Changes since v2:
* avoid unnecessary flushing of faulted page (Ross) - I've realized it makes no
sense to remeasure my benchmark results (after actually doing that and seeing
no difference, sigh) since I use ramdisk and not real PMEM HW and so flushes
are ignored.
* handle nojournal mode of ext4
* other smaller cleanups & fixes (Ross)
* factor larger part of finishing of synchronous fault into a helper (Christoph)
* reorder pfnp argument of dax_iomap_fault() (Christoph)
* add XFS support from Christoph
* use proper MAP_SYNC support in mmap(2)
* rebased on top of 4.14-rc4
Changes since v1:
* switched to using mmap flag MAP_SYNC
* cleaned up fault handlers to avoid passing pfn in vmf->orig_pte
* switched to not touching page tables before we are ready to insert final
entry as it was unnecessary and not really simplifying anything
* renamed fault flag to VM_FAULT_NEEDDSYNC
* other smaller fixes found by reviewers
Honza
There are already two users and more are coming.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index c09465884bbe..5ea71381dba0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1116,6 +1116,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
struct iomap iomap = { 0 };
unsigned flags = IOMAP_FAULT;
int error, major = 0;
+ bool write = vmf->flags & FAULT_FLAG_WRITE;
int vmf_ret = 0;
void *entry;
@@ -1130,7 +1131,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
goto out;
}
- if ((vmf->flags & FAULT_FLAG_WRITE) && !vmf->cow_page)
+ if (write && !vmf->cow_page)
flags |= IOMAP_WRITE;
entry = grab_mapping_entry(mapping, vmf->pgoff, 0);
@@ -1207,7 +1208,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
- if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+ if (!write) {
vmf_ret = dax_load_hole(mapping, entry, vmf);
goto finish_iomap;
}
--
2.12.3
It is unused.
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
include/linux/mm.h | 2 --
1 file changed, 2 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 38f6ed954dde..55a9d2d109d9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,8 +1182,6 @@ static inline void clear_page_pfmemalloc(struct page *page)
#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */
#define VM_FAULT_DONE_COW 0x1000 /* ->fault has fully handled COW */
-#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
From: Dan Williams <[email protected]>
The mmap(2) syscall suffers from the ABI anti-pattern of not validating
unknown flags. However, proposals like MAP_SYNC and MAP_DIRECT need a
mechanism to define new behavior that is known to fail on older kernels
without the support. Define a new MAP_SHARED_VALIDATE flag pattern that
is guaranteed to fail on all legacy mmap implementations.
It is worth noting that the original proposal was for a standalone
MAP_VALIDATE flag. However, when that could not be supported by all
archs Linus observed:
I see why you *think* you want a bitmap. You think you want
a bitmap because you want to make MAP_VALIDATE be part of MAP_SYNC
etc, so that people can do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED
| MAP_SYNC, fd, 0);
and "know" that MAP_SYNC actually takes.
And I'm saying that whole wish is bogus. You're fundamentally
depending on special semantics, just make it explicit. It's already
not portable, so don't try to make it so.
Rename that MAP_VALIDATE as MAP_SHARED_VALIDATE, make it have a value
of 0x3, and make people do
ret = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED_VALIDATE
| MAP_SYNC, fd, 0);
and then the kernel side is easier too (none of that random garbage
playing games with looking at the "MAP_VALIDATE bit", but just another
case statement in that map type thing.
Boom. Done.
Similar to ->fallocate() we also want the ability to validate the
support for new flags on a per ->mmap() 'struct file_operations'
instance basis. Towards that end arrange for flags to be generically
validated against a mmap_supported_mask exported by 'struct
file_operations'. By default all existing flags are implicitly
supported, but new flags require MAP_SHARED_VALIDATE and
per-instance-opt-in.
Cc: Jan Kara <[email protected]>
Cc: Arnd Bergmann <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Suggested-by: Christoph Hellwig <[email protected]>
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Dan Williams <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
arch/alpha/include/uapi/asm/mman.h | 1 +
arch/mips/include/uapi/asm/mman.h | 1 +
arch/mips/kernel/vdso.c | 2 +-
arch/parisc/include/uapi/asm/mman.h | 1 +
arch/tile/mm/elf.c | 3 ++-
arch/xtensa/include/uapi/asm/mman.h | 1 +
include/linux/fs.h | 2 ++
include/linux/mm.h | 2 +-
include/linux/mman.h | 39 ++++++++++++++++++++++++++++
include/uapi/asm-generic/mman-common.h | 1 +
mm/mmap.c | 21 ++++++++++++---
tools/include/uapi/asm-generic/mman-common.h | 1 +
12 files changed, 69 insertions(+), 6 deletions(-)
diff --git a/arch/alpha/include/uapi/asm/mman.h b/arch/alpha/include/uapi/asm/mman.h
index 3b26cc62dadb..92823f24890b 100644
--- a/arch/alpha/include/uapi/asm/mman.h
+++ b/arch/alpha/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
#define MAP_TYPE 0x0f /* Mask for type of mapping (OSF/1 is _wrong_) */
#define MAP_FIXED 0x100 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x10 /* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
/* not used by linux, but here to make sure we don't clash with OSF/1 defines */
#define _MAP_HASSEMAPHORE 0x0200
diff --git a/arch/mips/include/uapi/asm/mman.h b/arch/mips/include/uapi/asm/mman.h
index da3216007fe0..c77689076577 100644
--- a/arch/mips/include/uapi/asm/mman.h
+++ b/arch/mips/include/uapi/asm/mman.h
@@ -30,6 +30,7 @@
#define MAP_PRIVATE 0x002 /* Changes are private */
#define MAP_TYPE 0x00f /* Mask for type of mapping */
#define MAP_FIXED 0x010 /* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
/* not used by linux, but here to make sure we don't clash with ABI defines */
#define MAP_RENAME 0x020 /* Assign page to file */
diff --git a/arch/mips/kernel/vdso.c b/arch/mips/kernel/vdso.c
index 019035d7225c..cf10654477a9 100644
--- a/arch/mips/kernel/vdso.c
+++ b/arch/mips/kernel/vdso.c
@@ -110,7 +110,7 @@ int arch_setup_additional_pages(struct linux_binprm *bprm, int uses_interp)
base = mmap_region(NULL, STACK_TOP, PAGE_SIZE,
VM_READ|VM_WRITE|VM_EXEC|
VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC,
- 0, NULL);
+ 0, NULL, 0);
if (IS_ERR_VALUE(base)) {
ret = base;
goto out;
diff --git a/arch/parisc/include/uapi/asm/mman.h b/arch/parisc/include/uapi/asm/mman.h
index 775b5d5e41a1..36b688d52de3 100644
--- a/arch/parisc/include/uapi/asm/mman.h
+++ b/arch/parisc/include/uapi/asm/mman.h
@@ -14,6 +14,7 @@
#define MAP_TYPE 0x03 /* Mask for type of mapping */
#define MAP_FIXED 0x04 /* Interpret addr exactly */
#define MAP_ANONYMOUS 0x10 /* don't use a file */
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
#define MAP_DENYWRITE 0x0800 /* ETXTBSY */
#define MAP_EXECUTABLE 0x1000 /* mark it as an executable */
diff --git a/arch/tile/mm/elf.c b/arch/tile/mm/elf.c
index 889901824400..5ffcbe76aef9 100644
--- a/arch/tile/mm/elf.c
+++ b/arch/tile/mm/elf.c
@@ -143,7 +143,8 @@ int arch_setup_additional_pages(struct linux_binprm *bprm,
unsigned long addr = MEM_USER_INTRPT;
addr = mmap_region(NULL, addr, INTRPT_SIZE,
VM_READ|VM_EXEC|
- VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0, NULL);
+ VM_MAYREAD|VM_MAYWRITE|VM_MAYEXEC, 0,
+ NULL, 0);
if (addr > (unsigned long) -PAGE_SIZE)
retval = (int) addr;
}
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index b15b278aa314..ec597900eec7 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -37,6 +37,7 @@
#define MAP_PRIVATE 0x002 /* Changes are private */
#define MAP_TYPE 0x00f /* Mask for type of mapping */
#define MAP_FIXED 0x010 /* Interpret addr exactly */
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
/* not used by linux, but here to make sure we don't clash with ABI defines */
#define MAP_RENAME 0x020 /* Assign page to file */
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 13dab191a23e..5aee97d64cae 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -1701,6 +1701,8 @@ struct file_operations {
long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
int (*mmap) (struct file *, struct vm_area_struct *);
+ int (*mmap_validate) (struct file *, struct vm_area_struct *,
+ unsigned long);
int (*open) (struct inode *, struct file *);
int (*flush) (struct file *, fl_owner_t id);
int (*release) (struct inode *, struct file *);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 065d99deb847..38f6ed954dde 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2133,7 +2133,7 @@ extern unsigned long get_unmapped_area(struct file *, unsigned long, unsigned lo
extern unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf);
+ struct list_head *uf, unsigned long map_flags);
extern unsigned long do_mmap(struct file *file, unsigned long addr,
unsigned long len, unsigned long prot, unsigned long flags,
vm_flags_t vm_flags, unsigned long pgoff, unsigned long *populate,
diff --git a/include/linux/mman.h b/include/linux/mman.h
index c8367041fafd..94b63b4d71ff 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -7,6 +7,45 @@
#include <linux/atomic.h>
#include <uapi/linux/mman.h>
+/*
+ * Arrange for legacy / undefined architecture specific flags to be
+ * ignored by default in LEGACY_MAP_MASK.
+ */
+#ifndef MAP_32BIT
+#define MAP_32BIT 0
+#endif
+#ifndef MAP_HUGE_2MB
+#define MAP_HUGE_2MB 0
+#endif
+#ifndef MAP_HUGE_1GB
+#define MAP_HUGE_1GB 0
+#endif
+#ifndef MAP_UNINITIALIZED
+#define MAP_UNINITIALIZED 0
+#endif
+
+/*
+ * The historical set of flags that all mmap implementations implicitly
+ * support when a ->mmap_validate() op is not provided in file_operations.
+ */
+#define LEGACY_MAP_MASK (MAP_SHARED \
+ | MAP_PRIVATE \
+ | MAP_FIXED \
+ | MAP_ANONYMOUS \
+ | MAP_DENYWRITE \
+ | MAP_EXECUTABLE \
+ | MAP_UNINITIALIZED \
+ | MAP_GROWSDOWN \
+ | MAP_LOCKED \
+ | MAP_NORESERVE \
+ | MAP_POPULATE \
+ | MAP_NONBLOCK \
+ | MAP_STACK \
+ | MAP_HUGETLB \
+ | MAP_32BIT \
+ | MAP_HUGE_2MB \
+ | MAP_HUGE_1GB)
+
extern int sysctl_overcommit_memory;
extern int sysctl_overcommit_ratio;
extern unsigned long sysctl_overcommit_kbytes;
diff --git a/include/uapi/asm-generic/mman-common.h b/include/uapi/asm-generic/mman-common.h
index 203268f9231e..ac55d1c0ec0f 100644
--- a/include/uapi/asm-generic/mman-common.h
+++ b/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
#else
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
/*
* Flags for mlock
diff --git a/mm/mmap.c b/mm/mmap.c
index 680506faceae..a1bcaa9eff42 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1389,6 +1389,18 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
struct inode *inode = file_inode(file);
switch (flags & MAP_TYPE) {
+ case (MAP_SHARED_VALIDATE):
+ if ((flags & ~LEGACY_MAP_MASK) == 0) {
+ /*
+ * If all legacy mmap flags, downgrade
+ * to MAP_SHARED, i.e. invoke ->mmap()
+ * instead of ->mmap_validate()
+ */
+ flags &= ~MAP_TYPE;
+ flags |= MAP_SHARED;
+ } else if (!file->f_op->mmap_validate)
+ return -EOPNOTSUPP;
+ /* fall through */
case MAP_SHARED:
if ((prot&PROT_WRITE) && !(file->f_mode&FMODE_WRITE))
return -EACCES;
@@ -1465,7 +1477,7 @@ unsigned long do_mmap(struct file *file, unsigned long addr,
vm_flags |= VM_NORESERVE;
}
- addr = mmap_region(file, addr, len, vm_flags, pgoff, uf);
+ addr = mmap_region(file, addr, len, vm_flags, pgoff, uf, flags);
if (!IS_ERR_VALUE(addr) &&
((vm_flags & VM_LOCKED) ||
(flags & (MAP_POPULATE | MAP_NONBLOCK)) == MAP_POPULATE))
@@ -1602,7 +1614,7 @@ static inline int accountable_mapping(struct file *file, vm_flags_t vm_flags)
unsigned long mmap_region(struct file *file, unsigned long addr,
unsigned long len, vm_flags_t vm_flags, unsigned long pgoff,
- struct list_head *uf)
+ struct list_head *uf, unsigned long map_flags)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma, *prev;
@@ -1687,7 +1699,10 @@ unsigned long mmap_region(struct file *file, unsigned long addr,
* new file must not have been exposed to user-space, yet.
*/
vma->vm_file = get_file(file);
- error = call_mmap(file, vma);
+ if ((map_flags & MAP_TYPE) == MAP_SHARED_VALIDATE)
+ error = file->f_op->mmap_validate(file, vma, map_flags);
+ else
+ error = call_mmap(file, vma);
if (error)
goto unmap_and_free_vma;
diff --git a/tools/include/uapi/asm-generic/mman-common.h b/tools/include/uapi/asm-generic/mman-common.h
index 203268f9231e..ac55d1c0ec0f 100644
--- a/tools/include/uapi/asm-generic/mman-common.h
+++ b/tools/include/uapi/asm-generic/mman-common.h
@@ -24,6 +24,7 @@
#else
# define MAP_UNINITIALIZED 0x0 /* Don't support this flag */
#endif
+#define MAP_SHARED_VALIDATE 0x3 /* share + validate extension flags */
/*
* Flags for mlock
--
2.12.3
Now when everything is prepared, add support in ext4 to accept MAP_SYNC
as an mmap(2) flag.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 20 ++++++++++++++++++++
1 file changed, 20 insertions(+)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 61a8788168f3..f013cda84b3d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -364,6 +364,25 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
return 0;
}
+#define EXT4_SUPPORTED_MAP_FLAGS (LEGACY_MAP_MASK | MAP_SYNC)
+
+static int ext4_file_mmap_validate(struct file *file,
+ struct vm_area_struct *vma,
+ unsigned long map_flags)
+{
+ if (map_flags & ~EXT4_SUPPORTED_MAP_FLAGS)
+ return -EOPNOTSUPP;
+
+ /*
+ * We don't support synchronous mappings for non-DAX files. At least
+ * until someone comes with a sensible use case.
+ */
+ if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
+ return -EOPNOTSUPP;
+
+ return ext4_file_mmap(file, vma);
+}
+
static int ext4_file_open(struct inode * inode, struct file * filp)
{
struct super_block *sb = inode->i_sb;
@@ -723,6 +742,7 @@ const struct file_operations ext4_file_operations = {
.compat_ioctl = ext4_compat_ioctl,
#endif
.mmap = ext4_file_mmap,
+ .mmap_validate = ext4_file_mmap_validate,
.open = ext4_file_open,
.release = ext4_release_file,
.fsync = ext4_sync_file,
--
2.12.3
From: Christoph Hellwig <[email protected]>
Return IOMAP_F_NEEDDSYNC from xfs_file_iomap_begin() for a synchronous
write fault when inode is pinned, and has dirty fields other than the
timestamps. In xfs_filemap_huge_fault() we then detect this case and
call dax_finish_sync_fault() to make sure all metadata is committed, and
to insert the page table entry.
Note that this will also dirty corresponding radix tree entry which is
what we want - fsync(2) will still provide data integrity guarantees for
applications not using userspace flushing. And applications using
userspace flushing can avoid calling fsync(2) and thus avoid the
performance overhead.
Signed-off-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/xfs/xfs_file.c | 6 +++++-
fs/xfs/xfs_iomap.c | 5 +++++
2 files changed, 10 insertions(+), 1 deletion(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 7c6b8def6eed..c45f24ffab22 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1040,7 +1040,11 @@ __xfs_filemap_fault(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
- ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops);
+ pfn_t pfn;
+
+ ret = dax_iomap_fault(vmf, pe_size, &pfn, &xfs_iomap_ops);
+ if (ret & VM_FAULT_NEEDDSYNC)
+ ret = dax_finish_sync_fault(vmf, pe_size, pfn);
} else {
if (write_fault)
ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index f179bdf1644d..b43be199fbdf 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -33,6 +33,7 @@
#include "xfs_error.h"
#include "xfs_trans.h"
#include "xfs_trans_space.h"
+#include "xfs_inode_item.h"
#include "xfs_iomap.h"
#include "xfs_trace.h"
#include "xfs_icache.h"
@@ -1086,6 +1087,10 @@ xfs_file_iomap_begin(
trace_xfs_iomap_found(ip, offset, length, 0, &imap);
}
+ if ((flags & IOMAP_WRITE) && xfs_ipincount(ip) &&
+ (ip->i_itemp->ili_fsync_fields & ~XFS_ILOG_TIMESTAMP))
+ iomap->flags |= IOMAP_F_DIRTY;
+
xfs_bmbt_to_iomap(ip, iomap, &imap);
if (shared)
--
2.12.3
If transaction starting fails, just bail out of the function immediately
instead of checking for that condition throughout the function.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 13 +++++++------
1 file changed, 7 insertions(+), 6 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 3cec0b95672f..208adfc3e673 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -302,16 +302,17 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
+ if (IS_ERR(handle)) {
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+ return VM_FAULT_SIGBUS;
+ }
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- if (!IS_ERR(handle))
- result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops);
- else
- result = VM_FAULT_SIGBUS;
+ result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops);
if (write) {
- if (!IS_ERR(handle))
- ext4_journal_stop(handle);
+ ext4_journal_stop(handle);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
} else {
--
2.12.3
We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
synchronous write fault when inode has some uncommitted metadata
changes. In the fault handler ext4_dax_fault() we then detect this case,
call vfs_fsync_range() to make sure all metadata is committed, and call
dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
also dirty corresponding radix tree entry which is what we want -
fsync(2) will still provide data integrity guarantees for applications
not using userspace flushing. And applications using userspace flushing
can avoid calling fsync(2) and thus avoid the performance overhead.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 6 +++++-
fs/ext4/inode.c | 15 +++++++++++++++
fs/jbd2/journal.c | 17 +++++++++++++++++
include/linux/jbd2.h | 1 +
4 files changed, 38 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 208adfc3e673..61a8788168f3 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -295,6 +295,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
*/
bool write = (vmf->flags & FAULT_FLAG_WRITE) &&
(vmf->vma->vm_flags & VM_SHARED);
+ pfn_t pfn;
if (write) {
sb_start_pagefault(sb);
@@ -310,9 +311,12 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, &pfn, &ext4_iomap_ops);
if (write) {
ext4_journal_stop(handle);
+ /* Handling synchronous page fault? */
+ if (result & VM_FAULT_NEEDDSYNC)
+ result = dax_finish_sync_fault(vmf, pe_size, pfn);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
} else {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 31db875bc7a1..13a198924a0f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
}
#ifdef CONFIG_FS_DAX
+static bool ext4_inode_datasync_dirty(struct inode *inode)
+{
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+
+ if (journal)
+ return !jbd2_transaction_committed(journal,
+ EXT4_I(inode)->i_datasync_tid);
+ /* Any metadata buffers to write? */
+ if (!list_empty(&inode->i_mapping->private_list))
+ return true;
+ return inode->i_state & I_DIRTY_DATASYNC;
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
@@ -3466,6 +3479,8 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
iomap->flags = 0;
+ if ((flags & IOMAP_WRITE) && ext4_inode_datasync_dirty(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
iomap->bdev = inode->i_sb->s_bdev;
iomap->dax_dev = sbi->s_daxdev;
iomap->offset = first_block << blkbits;
diff --git a/fs/jbd2/journal.c b/fs/jbd2/journal.c
index 7d5ef3bf3f3e..fa8cde498b4b 100644
--- a/fs/jbd2/journal.c
+++ b/fs/jbd2/journal.c
@@ -738,6 +738,23 @@ int jbd2_log_wait_commit(journal_t *journal, tid_t tid)
return err;
}
+/* Return 1 when transaction with given tid has already committed. */
+int jbd2_transaction_committed(journal_t *journal, tid_t tid)
+{
+ int ret = 1;
+
+ read_lock(&journal->j_state_lock);
+ if (journal->j_running_transaction &&
+ journal->j_running_transaction->t_tid == tid)
+ ret = 0;
+ if (journal->j_committing_transaction &&
+ journal->j_committing_transaction->t_tid == tid)
+ ret = 0;
+ read_unlock(&journal->j_state_lock);
+ return ret;
+}
+EXPORT_SYMBOL(jbd2_transaction_committed);
+
/*
* When this function returns the transaction corresponding to tid
* will be completed. If the transaction has currently running, start
diff --git a/include/linux/jbd2.h b/include/linux/jbd2.h
index 606b6bce3a5b..296d1e0ea87b 100644
--- a/include/linux/jbd2.h
+++ b/include/linux/jbd2.h
@@ -1367,6 +1367,7 @@ int jbd2_log_start_commit(journal_t *journal, tid_t tid);
int __jbd2_log_start_commit(journal_t *journal, tid_t tid);
int jbd2_journal_start_commit(journal_t *journal, tid_t *tid);
int jbd2_log_wait_commit(journal_t *journal, tid_t tid);
+int jbd2_transaction_committed(journal_t *journal, tid_t tid);
int jbd2_complete_transaction(journal_t *journal, tid_t tid);
int jbd2_log_do_checkpoint(journal_t *journal);
int jbd2_trans_will_send_data_barrier(journal_t *journal, tid_t tid);
--
2.12.3
Define new MAP_SYNC flag and corresponding VMA VM_SYNC flag. As the
MAP_SYNC flag is not part of LEGACY_MAP_MASK, currently it will be
refused by all MAP_SHARED_VALIDATE map attempts and silently ignored for
everything else.
Signed-off-by: Jan Kara <[email protected]>
---
arch/xtensa/include/uapi/asm/mman.h | 1 +
fs/proc/task_mmu.c | 1 +
include/linux/mm.h | 1 +
include/linux/mman.h | 3 ++-
include/uapi/asm-generic/mman.h | 1 +
5 files changed, 6 insertions(+), 1 deletion(-)
diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
index ec597900eec7..b62a7ce166fb 100644
--- a/arch/xtensa/include/uapi/asm/mman.h
+++ b/arch/xtensa/include/uapi/asm/mman.h
@@ -56,6 +56,7 @@
#define MAP_NONBLOCK 0x20000 /* do not block on IO */
#define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x80000 /* create a huge page mapping */
+#define MAP_SYNC 0x100000 /* perform synchronous page faults for the mapping */
#ifdef CONFIG_MMAP_ALLOW_UNINITIALIZED
# define MAP_UNINITIALIZED 0x4000000 /* For anonymous mmap, memory could be
* uninitialized */
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 5589b4bd4b85..ea78b37deeaa 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -664,6 +664,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
[ilog2(VM_ACCOUNT)] = "ac",
[ilog2(VM_NORESERVE)] = "nr",
[ilog2(VM_HUGETLB)] = "ht",
+ [ilog2(VM_SYNC)] = "sf",
[ilog2(VM_ARCH_1)] = "ar",
[ilog2(VM_WIPEONFORK)] = "wf",
[ilog2(VM_DONTDUMP)] = "dd",
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 55a9d2d109d9..cb62f2127314 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -189,6 +189,7 @@ extern unsigned int kobjsize(const void *objp);
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
+#define VM_SYNC 0x00800000 /* Synchronous page faults */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
#define VM_WIPEONFORK 0x02000000 /* Wipe VMA contents in child. */
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 94b63b4d71ff..2442abc7bb74 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -125,7 +125,8 @@ calc_vm_flag_bits(unsigned long flags)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
- _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
+ _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
+ _calc_vm_trans(flags, MAP_SYNC, VM_SYNC );
}
unsigned long vm_commit_limit(void);
diff --git a/include/uapi/asm-generic/mman.h b/include/uapi/asm-generic/mman.h
index 7162cd4cca73..00e55627d2df 100644
--- a/include/uapi/asm-generic/mman.h
+++ b/include/uapi/asm-generic/mman.h
@@ -12,6 +12,7 @@
#define MAP_NONBLOCK 0x10000 /* do not block on IO */
#define MAP_STACK 0x20000 /* give out an address that is best suited for process/thread stacks */
#define MAP_HUGETLB 0x40000 /* create a huge page mapping */
+#define MAP_SYNC 0x80000 /* perform synchronous page faults for the mapping */
/* Bits [26:31] are reserved, see mman-common.h for MAP_HUGETLB usage */
--
2.12.3
Implement a function that filesystems can call to finish handling of
synchronous page faults. It takes care of syncing appropriare file range
and insertion of page table entry.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 83 +++++++++++++++++++++++++++++++++++++++++++
include/linux/dax.h | 2 ++
include/trace/events/fs_dax.h | 2 ++
3 files changed, 87 insertions(+)
diff --git a/fs/dax.c b/fs/dax.c
index bb9ff907738c..98840043b88d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1492,3 +1492,86 @@ int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
}
}
EXPORT_SYMBOL_GPL(dax_iomap_fault);
+
+/**
+ * dax_insert_pfn_mkwrite - insert PTE or PMD entry into page tables
+ * @vmf: The description of the fault
+ * @pe_size: Size of entry to be inserted
+ * @pfn: PFN to insert
+ *
+ * This function inserts writeable PTE or PMD entry into page tables for mmaped
+ * DAX file. It takes care of marking corresponding radix tree entry as dirty
+ * as well.
+ */
+static int dax_insert_pfn_mkwrite(struct vm_fault *vmf,
+ enum page_entry_size pe_size,
+ pfn_t pfn)
+{
+ struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ void *entry, **slot;
+ pgoff_t index = vmf->pgoff;
+ int vmf_ret, error;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, &slot);
+ /* Did we race with someone splitting entry or so? */
+ if (!entry || (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
+ (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
+ put_unlocked_mapping_entry(mapping, index, entry);
+ spin_unlock_irq(&mapping->tree_lock);
+ trace_dax_insert_pfn_mkwrite_no_entry(mapping->host, vmf,
+ VM_FAULT_NOPAGE);
+ return VM_FAULT_NOPAGE;
+ }
+ radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+ entry = lock_slot(mapping, slot);
+ spin_unlock_irq(&mapping->tree_lock);
+ switch (pe_size) {
+ case PE_SIZE_PTE:
+ error = vm_insert_mixed_mkwrite(vmf->vma, vmf->address, pfn);
+ vmf_ret = dax_fault_return(error);
+ break;
+#ifdef CONFIG_FS_DAX_PMD
+ case PE_SIZE_PMD:
+ vmf_ret = vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
+ pfn, true);
+ break;
+#endif
+ default:
+ vmf_ret = VM_FAULT_FALLBACK;
+ }
+ put_locked_mapping_entry(mapping, index);
+ trace_dax_insert_pfn_mkwrite(mapping->host, vmf, vmf_ret);
+ return vmf_ret;
+}
+
+/**
+ * dax_finish_sync_fault - finish synchronous page fault
+ * @vmf: The description of the fault
+ * @pe_size: Size of entry to be inserted
+ * @pfn: PFN to insert
+ *
+ * This function calls fdatasync() on the range of file touched by the page
+ * fault and handles inserting of appropriate page table entry.
+ */
+int dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ pfn_t pfn)
+{
+ int err;
+ loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
+ size_t len = 0;
+
+ if (pe_size == PE_SIZE_PTE)
+ len = PAGE_SIZE;
+#ifdef CONFIG_FS_DAX_PMD
+ else if (pe_size == PE_SIZE_PMD)
+ len = PMD_SIZE;
+#endif
+ else
+ WARN_ON_ONCE(1);
+ err = vfs_fsync_range(vmf->vma->vm_file, start, start + len - 1, 1);
+ if (err)
+ return VM_FAULT_SIGBUS;
+ return dax_insert_pfn_mkwrite(vmf, pe_size, pfn);
+}
+EXPORT_SYMBOL_GPL(dax_finish_sync_fault);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index e7fa4b8f45bc..d403f78b706c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -96,6 +96,8 @@ ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops);
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
pfn_t *pfnp, const struct iomap_ops *ops);
+int dax_finish_sync_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
+ pfn_t pfn);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index 88a9d19b8ff8..7725459fafef 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -190,6 +190,8 @@ DEFINE_EVENT(dax_pte_fault_class, name, \
DEFINE_PTE_FAULT_EVENT(dax_pte_fault);
DEFINE_PTE_FAULT_EVENT(dax_pte_fault_done);
DEFINE_PTE_FAULT_EVENT(dax_load_hole);
+DEFINE_PTE_FAULT_EVENT(dax_insert_pfn_mkwrite_no_entry);
+DEFINE_PTE_FAULT_EVENT(dax_insert_pfn_mkwrite);
TRACE_EVENT(dax_insert_mapping,
TP_PROTO(struct inode *inode, struct vm_fault *vmf, void *radix_entry),
--
2.12.3
Factor out code to get pfn out of iomap that is shared between PTE and
PMD fault path.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 83 +++++++++++++++++++++++++++++++++-------------------------------
1 file changed, 43 insertions(+), 40 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 0bc42ac294ca..116eef8d6c69 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -825,30 +825,53 @@ static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
}
-static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
- loff_t pos, void *entry)
+static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
+ pfn_t *pfnp)
{
const sector_t sector = dax_iomap_sector(iomap, pos);
- struct vm_area_struct *vma = vmf->vma;
- struct address_space *mapping = vma->vm_file->f_mapping;
- unsigned long vaddr = vmf->address;
- void *ret, *kaddr;
pgoff_t pgoff;
+ void *kaddr;
int id, rc;
- pfn_t pfn;
+ long length;
- rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
+ rc = bdev_dax_pgoff(iomap->bdev, sector, size, &pgoff);
if (rc)
return rc;
-
id = dax_read_lock();
- rc = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(PAGE_SIZE),
- &kaddr, &pfn);
- if (rc < 0) {
- dax_read_unlock(id);
- return rc;
+ length = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(size),
+ &kaddr, pfnp);
+ if (length < 0) {
+ rc = length;
+ goto out;
}
+ rc = -EINVAL;
+ if (PFN_PHYS(length) < size)
+ goto out;
+ if (pfn_t_to_pfn(*pfnp) & (PHYS_PFN(size)-1))
+ goto out;
+ /* For larger pages we need devmap */
+ if (length > 1 && !pfn_t_devmap(*pfnp))
+ goto out;
+ rc = 0;
+out:
dax_read_unlock(id);
+ return rc;
+}
+
+static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
+ loff_t pos, void *entry)
+{
+ const sector_t sector = dax_iomap_sector(iomap, pos);
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ unsigned long vaddr = vmf->address;
+ void *ret;
+ int rc;
+ pfn_t pfn;
+
+ rc = dax_iomap_pfn(iomap, pos, PAGE_SIZE, &pfn);
+ if (rc < 0)
+ return rc;
ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
if (IS_ERR(ret))
@@ -1223,46 +1246,26 @@ static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
{
struct address_space *mapping = vmf->vma->vm_file->f_mapping;
const sector_t sector = dax_iomap_sector(iomap, pos);
- struct dax_device *dax_dev = iomap->dax_dev;
- struct block_device *bdev = iomap->bdev;
struct inode *inode = mapping->host;
- const size_t size = PMD_SIZE;
- void *ret = NULL, *kaddr;
- long length = 0;
- pgoff_t pgoff;
+ void *ret = NULL;
pfn_t pfn = {};
- int id;
+ int rc;
- if (bdev_dax_pgoff(bdev, sector, size, &pgoff) != 0)
+ rc = dax_iomap_pfn(iomap, pos, PMD_SIZE, &pfn);
+ if (rc < 0)
goto fallback;
- id = dax_read_lock();
- length = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
- if (length < 0)
- goto unlock_fallback;
- length = PFN_PHYS(length);
-
- if (length < size)
- goto unlock_fallback;
- if (pfn_t_to_pfn(pfn) & PG_PMD_COLOUR)
- goto unlock_fallback;
- if (!pfn_t_devmap(pfn))
- goto unlock_fallback;
- dax_read_unlock(id);
-
ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
RADIX_DAX_PMD);
if (IS_ERR(ret))
goto fallback;
- trace_dax_pmd_insert_mapping(inode, vmf, length, pfn, ret);
+ trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, ret);
return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
pfn, vmf->flags & FAULT_FLAG_WRITE);
-unlock_fallback:
- dax_read_unlock(id);
fallback:
- trace_dax_pmd_insert_mapping_fallback(inode, vmf, length, pfn, ret);
+ trace_dax_pmd_insert_mapping_fallback(inode, vmf, PMD_SIZE, pfn, ret);
return VM_FAULT_FALLBACK;
}
--
2.12.3
Add a flag to iomap interface informing the caller that inode needs
fdstasync(2) for returned extent to become persistent and use it in DAX
fault code so that we don't map such extents into page tables
immediately. Instead we propagate the information that fdatasync(2) is
necessary from dax_iomap_fault() with a new VM_FAULT_NEEDDSYNC flag.
Filesystem fault handler is then responsible for calling fdatasync(2)
and inserting pfn into page tables.
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 39 +++++++++++++++++++++++++++++++++++++--
include/linux/iomap.h | 1 +
include/linux/mm.h | 6 +++++-
3 files changed, 43 insertions(+), 3 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index efc210ff6665..bb9ff907738c 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1091,6 +1091,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
unsigned flags = IOMAP_FAULT;
int error, major = 0;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+ bool sync;
int vmf_ret = 0;
void *entry;
pfn_t pfn;
@@ -1169,6 +1170,8 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
goto finish_iomap;
}
+ sync = (vma->vm_flags & VM_SYNC) && (iomap.flags & IOMAP_F_DIRTY);
+
switch (iomap.type) {
case IOMAP_MAPPED:
if (iomap.flags & IOMAP_F_NEW) {
@@ -1182,12 +1185,27 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- 0, write);
+ 0, write && !sync);
if (IS_ERR(entry)) {
error = PTR_ERR(entry);
goto error_finish_iomap;
}
+ /*
+ * If we are doing synchronous page fault and inode needs fsync,
+ * we can insert PTE into page tables only after that happens.
+ * Skip insertion for now and return the pfn so that caller can
+ * insert it after fsync is done.
+ */
+ if (sync) {
+ if (WARN_ON_ONCE(!pfnp)) {
+ error = -EIO;
+ goto error_finish_iomap;
+ }
+ *pfnp = pfn;
+ vmf_ret = VM_FAULT_NEEDDSYNC | major;
+ goto finish_iomap;
+ }
trace_dax_insert_mapping(inode, vmf, entry);
if (write)
error = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
@@ -1287,6 +1305,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
struct address_space *mapping = vma->vm_file->f_mapping;
unsigned long pmd_addr = vmf->address & PMD_MASK;
bool write = vmf->flags & FAULT_FLAG_WRITE;
+ bool sync;
unsigned int iomap_flags = (write ? IOMAP_WRITE : 0) | IOMAP_FAULT;
struct inode *inode = mapping->host;
int result = VM_FAULT_FALLBACK;
@@ -1371,6 +1390,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
if (iomap.offset + iomap.length < pos + PMD_SIZE)
goto finish_iomap;
+ sync = (vma->vm_flags & VM_SYNC) && (iomap.flags & IOMAP_F_DIRTY);
+
switch (iomap.type) {
case IOMAP_MAPPED:
error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
@@ -1379,10 +1400,24 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- RADIX_DAX_PMD, write);
+ RADIX_DAX_PMD, write && !sync);
if (IS_ERR(entry))
goto finish_iomap;
+ /*
+ * If we are doing synchronous page fault and inode needs fsync,
+ * we can insert PMD into page tables only after that happens.
+ * Skip insertion for now and return the pfn so that caller can
+ * insert it after fsync is done.
+ */
+ if (sync) {
+ if (WARN_ON_ONCE(!pfnp))
+ goto finish_iomap;
+ *pfnp = pfn;
+ result = VM_FAULT_NEEDDSYNC;
+ goto finish_iomap;
+ }
+
trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
write);
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index f64dc6ce5161..4bc0a6fe3b15 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -22,6 +22,7 @@ struct vm_fault;
* Flags for all iomap mappings:
*/
#define IOMAP_F_NEW 0x01 /* blocks have been newly allocated */
+#define IOMAP_F_DIRTY 0x02 /* block mapping is not yet on persistent storage */
/*
* Flags that only need to be reported for IOMAP_REPORT requests:
diff --git a/include/linux/mm.h b/include/linux/mm.h
index cb62f2127314..44c4c0da2da1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1182,6 +1182,9 @@ static inline void clear_page_pfmemalloc(struct page *page)
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */
#define VM_FAULT_DONE_COW 0x1000 /* ->fault has fully handled COW */
+#define VM_FAULT_NEEDDSYNC 0x2000 /* ->fault did not modify page tables
+ * and needs fsync() to complete (for
+ * synchronous page faults in DAX) */
#define VM_FAULT_ERROR (VM_FAULT_OOM | VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV | \
VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE | \
@@ -1199,7 +1202,8 @@ static inline void clear_page_pfmemalloc(struct page *page)
{ VM_FAULT_LOCKED, "LOCKED" }, \
{ VM_FAULT_RETRY, "RETRY" }, \
{ VM_FAULT_FALLBACK, "FALLBACK" }, \
- { VM_FAULT_DONE_COW, "DONE_COW" }
+ { VM_FAULT_DONE_COW, "DONE_COW" }, \
+ { VM_FAULT_NEEDDSYNC, "NEEDDSYNC" }
/* Encode hstate index for a hwpoisoned large page */
#define VM_FAULT_SET_HINDEX(x) ((x) << 12)
--
2.12.3
For synchronous page fault dax_iomap_fault() will need to return PFN
which will then need to be inserted into page tables after fsync()
completes. Add necessary parameter to dax_iomap_fault().
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 13 +++++++------
fs/ext2/file.c | 2 +-
fs/ext4/file.c | 2 +-
fs/xfs/xfs_file.c | 4 ++--
include/linux/dax.h | 2 +-
5 files changed, 12 insertions(+), 11 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5214ed9ba508..5ddf15161390 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1079,7 +1079,7 @@ static int dax_fault_return(int error)
return VM_FAULT_SIGBUS;
}
-static int dax_iomap_pte_fault(struct vm_fault *vmf,
+static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
{
struct vm_area_struct *vma = vmf->vma;
@@ -1280,7 +1280,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
return VM_FAULT_FALLBACK;
}
-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
{
struct vm_area_struct *vma = vmf->vma;
@@ -1425,7 +1425,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
return result;
}
#else
-static int dax_iomap_pmd_fault(struct vm_fault *vmf,
+static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
const struct iomap_ops *ops)
{
return VM_FAULT_FALLBACK;
@@ -1436,6 +1436,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
* dax_iomap_fault - handle a page fault on a DAX file
* @vmf: The description of the fault
* @pe_size: Size of the page to fault in
+ * @pfnp: PFN to insert for synchronous faults if fsync is required
* @ops: Iomap ops passed from the file system
*
* When a page fault occurs, filesystems may call this helper in
@@ -1444,13 +1445,13 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
* successfully.
*/
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- const struct iomap_ops *ops)
+ pfn_t *pfnp, const struct iomap_ops *ops)
{
switch (pe_size) {
case PE_SIZE_PTE:
- return dax_iomap_pte_fault(vmf, ops);
+ return dax_iomap_pte_fault(vmf, pfnp, ops);
case PE_SIZE_PMD:
- return dax_iomap_pmd_fault(vmf, ops);
+ return dax_iomap_pmd_fault(vmf, pfnp, ops);
default:
return VM_FAULT_FALLBACK;
}
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index ff3a3636a5ca..d2bb7c96307d 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -99,7 +99,7 @@ static int ext2_dax_fault(struct vm_fault *vmf)
}
down_read(&ei->dax_sem);
- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &ext2_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &ext2_iomap_ops);
up_read(&ei->dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b1da660ac3bc..3cec0b95672f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -306,7 +306,7 @@ static int ext4_dax_huge_fault(struct vm_fault *vmf,
down_read(&EXT4_I(inode)->i_mmap_sem);
}
if (!IS_ERR(handle))
- result = dax_iomap_fault(vmf, pe_size, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, NULL, &ext4_iomap_ops);
else
result = VM_FAULT_SIGBUS;
if (write) {
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 309e26c9dddb..7c6b8def6eed 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1040,7 +1040,7 @@ __xfs_filemap_fault(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
- ret = dax_iomap_fault(vmf, pe_size, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, pe_size, NULL, &xfs_iomap_ops);
} else {
if (write_fault)
ret = iomap_page_mkwrite(vmf, &xfs_iomap_ops);
@@ -1111,7 +1111,7 @@ xfs_filemap_pfn_mkwrite(
if (vmf->pgoff >= size)
ret = VM_FAULT_SIGBUS;
else if (IS_DAX(inode))
- ret = dax_iomap_fault(vmf, PE_SIZE_PTE, &xfs_iomap_ops);
+ ret = dax_iomap_fault(vmf, PE_SIZE_PTE, NULL, &xfs_iomap_ops);
xfs_iunlock(ip, XFS_MMAPLOCK_SHARED);
sb_end_pagefault(inode->i_sb);
return ret;
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 122197124b9d..e7fa4b8f45bc 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -95,7 +95,7 @@ bool dax_write_cache_enabled(struct dax_device *dax_dev);
ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops);
int dax_iomap_fault(struct vm_fault *vmf, enum page_entry_size pe_size,
- const struct iomap_ops *ops);
+ pfn_t *pfnp, const struct iomap_ops *ops);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
int dax_invalidate_mapping_entry_sync(struct address_space *mapping,
pgoff_t index);
--
2.12.3
dax_pmd_insert_mapping() has only one callsite and we will need to
further fine tune what it does for synchronous faults. Just inline it
into the callsite so that we don't have to pass awkward bools around.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 47 +++++++++++++++++--------------------------
include/trace/events/fs_dax.h | 1 -
2 files changed, 19 insertions(+), 29 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5b20c6456926..675fab8ec41f 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1235,33 +1235,11 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
}
#ifdef CONFIG_FS_DAX_PMD
-static int dax_pmd_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
- loff_t pos, void *entry)
-{
- struct address_space *mapping = vmf->vma->vm_file->f_mapping;
- const sector_t sector = dax_iomap_sector(iomap, pos);
- struct inode *inode = mapping->host;
- void *ret = NULL;
- pfn_t pfn = {};
- int rc;
-
- rc = dax_iomap_pfn(iomap, pos, PMD_SIZE, &pfn);
- if (rc < 0)
- goto fallback;
-
- ret = dax_insert_mapping_entry(mapping, vmf, entry, sector,
- RADIX_DAX_PMD);
- if (IS_ERR(ret))
- goto fallback;
-
- trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, ret);
- return vmf_insert_pfn_pmd(vmf->vma, vmf->address, vmf->pmd,
- pfn, vmf->flags & FAULT_FLAG_WRITE);
-
-fallback:
- trace_dax_pmd_insert_mapping_fallback(inode, vmf, PMD_SIZE, pfn, ret);
- return VM_FAULT_FALLBACK;
-}
+/*
+ * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
+ * more often than one might expect in the below functions.
+ */
+#define PG_PMD_COLOUR ((PMD_SIZE >> PAGE_SHIFT) - 1)
static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
void *entry)
@@ -1317,6 +1295,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
void *entry;
loff_t pos;
int error;
+ pfn_t pfn;
/*
* Check whether offset isn't beyond end of file now. Caller is
@@ -1394,7 +1373,19 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
switch (iomap.type) {
case IOMAP_MAPPED:
- result = dax_pmd_insert_mapping(vmf, &iomap, pos, entry);
+ error = dax_iomap_pfn(&iomap, pos, PMD_SIZE, &pfn);
+ if (error < 0)
+ goto finish_iomap;
+
+ entry = dax_insert_mapping_entry(mapping, vmf, entry,
+ dax_iomap_sector(&iomap, pos),
+ RADIX_DAX_PMD);
+ if (IS_ERR(entry))
+ goto finish_iomap;
+
+ trace_dax_pmd_insert_mapping(inode, vmf, PMD_SIZE, pfn, entry);
+ result = vmf_insert_pfn_pmd(vma, vmf->address, vmf->pmd, pfn,
+ write);
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
diff --git a/include/trace/events/fs_dax.h b/include/trace/events/fs_dax.h
index fbc4a06f7310..88a9d19b8ff8 100644
--- a/include/trace/events/fs_dax.h
+++ b/include/trace/events/fs_dax.h
@@ -148,7 +148,6 @@ DEFINE_EVENT(dax_pmd_insert_mapping_class, name, \
TP_ARGS(inode, vmf, length, pfn, radix_entry))
DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping);
-DEFINE_PMD_INSERT_MAPPING_EVENT(dax_pmd_insert_mapping_fallback);
DECLARE_EVENT_CLASS(dax_pte_fault_class,
TP_PROTO(struct inode *inode, struct vm_fault *vmf, int result),
--
2.12.3
Now when everything is prepared, add support in xfs to accept MAP_SYNC
as an mmap(2) flag.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 1 +
fs/xfs/xfs_file.c | 23 +++++++++++++++++++++++
2 files changed, 24 insertions(+)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f013cda84b3d..6b597cc6b29d 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -26,6 +26,7 @@
#include <linux/quotaops.h>
#include <linux/pagevec.h>
#include <linux/uio.h>
+#include <linux/mman.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index c45f24ffab22..fb135224476d 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -44,6 +44,7 @@
#include <linux/falloc.h>
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
+#include <linux/mman.h>
static const struct vm_operations_struct xfs_file_vm_ops;
@@ -1142,6 +1143,27 @@ xfs_file_mmap(
return 0;
}
+#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC)
+
+static int
+xfs_file_mmap_validate(
+ struct file *filp,
+ struct vm_area_struct *vma,
+ unsigned long map_flags)
+{
+ if (map_flags & ~XFS_MAP_SUPPORTED)
+ return -EOPNOTSUPP;
+
+ /*
+ * We don't support synchronous mappings for non-DAX files. At least
+ * until someone comes with a sensible use case.
+ */
+ if (!IS_DAX(file_inode(filp)) && (map_flags & MAP_SYNC))
+ return -EOPNOTSUPP;
+
+ return xfs_file_mmap(filp, vma);
+}
+
const struct file_operations xfs_file_operations = {
.llseek = xfs_file_llseek,
.read_iter = xfs_file_read_iter,
@@ -1153,6 +1175,7 @@ const struct file_operations xfs_file_operations = {
.compat_ioctl = xfs_file_compat_ioctl,
#endif
.mmap = xfs_file_mmap,
+ .mmap_validate = xfs_file_mmap_validate,
.open = xfs_file_open,
.release = xfs_file_release,
.fsync = xfs_file_fsync,
--
2.12.3
Currently we dirty radix tree entry whenever dax_insert_mapping_entry()
gets called for a write fault. With synchronous page faults we would
like to insert clean radix tree entry and dirty it only once we call
fdatasync() and update page tables to same some unnecessary cache
flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 14 +++++++-------
1 file changed, 7 insertions(+), 7 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5ddf15161390..efc210ff6665 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -526,13 +526,13 @@ static int copy_user_dax(struct block_device *bdev, struct dax_device *dax_dev,
static void *dax_insert_mapping_entry(struct address_space *mapping,
struct vm_fault *vmf,
void *entry, sector_t sector,
- unsigned long flags)
+ unsigned long flags, bool dirty)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
void *new_entry;
pgoff_t index = vmf->pgoff;
- if (vmf->flags & FAULT_FLAG_WRITE)
+ if (dirty)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
if (dax_is_zero_entry(entry) && !(flags & RADIX_DAX_ZERO_PAGE)) {
@@ -569,7 +569,7 @@ static void *dax_insert_mapping_entry(struct address_space *mapping,
entry = new_entry;
}
- if (vmf->flags & FAULT_FLAG_WRITE)
+ if (dirty)
radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
spin_unlock_irq(&mapping->tree_lock);
@@ -881,7 +881,7 @@ static int dax_load_hole(struct address_space *mapping, void *entry,
}
entry2 = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_ZERO_PAGE);
+ RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(entry2)) {
ret = VM_FAULT_SIGBUS;
goto out;
@@ -1182,7 +1182,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf, pfn_t *pfnp,
entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- 0);
+ 0, write);
if (IS_ERR(entry)) {
error = PTR_ERR(entry);
goto error_finish_iomap;
@@ -1258,7 +1258,7 @@ static int dax_pmd_load_hole(struct vm_fault *vmf, struct iomap *iomap,
goto fallback;
ret = dax_insert_mapping_entry(mapping, vmf, entry, 0,
- RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE);
+ RADIX_DAX_PMD | RADIX_DAX_ZERO_PAGE, false);
if (IS_ERR(ret))
goto fallback;
@@ -1379,7 +1379,7 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf, pfn_t *pfnp,
entry = dax_insert_mapping_entry(mapping, vmf, entry,
dax_iomap_sector(&iomap, pos),
- RADIX_DAX_PMD);
+ RADIX_DAX_PMD, write);
if (IS_ERR(entry))
goto finish_iomap;
--
2.12.3
There are already two users and more are coming.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 116eef8d6c69..c09465884bbe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1108,7 +1108,8 @@ static int dax_fault_return(int error)
static int dax_iomap_pte_fault(struct vm_fault *vmf,
const struct iomap_ops *ops)
{
- struct address_space *mapping = vmf->vma->vm_file->f_mapping;
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
struct inode *inode = mapping->host;
unsigned long vaddr = vmf->address;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
@@ -1196,7 +1197,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
case IOMAP_MAPPED:
if (iomap.flags & IOMAP_F_NEW) {
count_vm_event(PGMAJFAULT);
- count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
+ count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
error = dax_insert_mapping(vmf, &iomap, pos, entry);
--
2.12.3
dax_insert_mapping() has only one callsite and we will need to further
fine tune what it does for synchronous faults. Just inline it into the
callsite so that we don't have to pass awkward bools around.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 46 +++++++++++++++++++---------------------------
1 file changed, 19 insertions(+), 27 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 5ea71381dba0..5b20c6456926 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -858,32 +858,6 @@ static int dax_iomap_pfn(struct iomap *iomap, loff_t pos, size_t size,
return rc;
}
-static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
- loff_t pos, void *entry)
-{
- const sector_t sector = dax_iomap_sector(iomap, pos);
- struct vm_area_struct *vma = vmf->vma;
- struct address_space *mapping = vma->vm_file->f_mapping;
- unsigned long vaddr = vmf->address;
- void *ret;
- int rc;
- pfn_t pfn;
-
- rc = dax_iomap_pfn(iomap, pos, PAGE_SIZE, &pfn);
- if (rc < 0)
- return rc;
-
- ret = dax_insert_mapping_entry(mapping, vmf, entry, sector, 0);
- if (IS_ERR(ret))
- return PTR_ERR(ret);
-
- trace_dax_insert_mapping(mapping->host, vmf, ret);
- if (vmf->flags & FAULT_FLAG_WRITE)
- return vm_insert_mixed_mkwrite(vma, vaddr, pfn);
- else
- return vm_insert_mixed(vma, vaddr, pfn);
-}
-
/*
* The user has performed a load from a hole in the file. Allocating a new
* page in the file would cause excessive storage usage for workloads with
@@ -1119,6 +1093,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
bool write = vmf->flags & FAULT_FLAG_WRITE;
int vmf_ret = 0;
void *entry;
+ pfn_t pfn;
trace_dax_pte_fault(inode, vmf, vmf_ret);
/*
@@ -1201,7 +1176,24 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
count_memcg_event_mm(vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
- error = dax_insert_mapping(vmf, &iomap, pos, entry);
+ error = dax_iomap_pfn(&iomap, pos, PAGE_SIZE, &pfn);
+ if (error < 0)
+ goto error_finish_iomap;
+
+ entry = dax_insert_mapping_entry(mapping, vmf, entry,
+ dax_iomap_sector(&iomap, pos),
+ 0);
+ if (IS_ERR(entry)) {
+ error = PTR_ERR(entry);
+ goto error_finish_iomap;
+ }
+
+ trace_dax_insert_mapping(inode, vmf, entry);
+ if (write)
+ error = vm_insert_mixed_mkwrite(vma, vaddr, pfn);
+ else
+ error = vm_insert_mixed(vma, vaddr, pfn);
+
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error == -EBUSY)
error = 0;
--
2.12.3
dax_insert_mapping() has lots of arguments and a lot of them is actuall
duplicated by passing vm_fault structure as well. Change the function to
take the same arguments as dax_pmd_insert_mapping().
Reviewed-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index f001d8c72a06..0bc42ac294ca 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -820,23 +820,30 @@ int dax_writeback_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
-static int dax_insert_mapping(struct address_space *mapping,
- struct block_device *bdev, struct dax_device *dax_dev,
- sector_t sector, size_t size, void *entry,
- struct vm_area_struct *vma, struct vm_fault *vmf)
+static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
+{
+ return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
+}
+
+static int dax_insert_mapping(struct vm_fault *vmf, struct iomap *iomap,
+ loff_t pos, void *entry)
{
+ const sector_t sector = dax_iomap_sector(iomap, pos);
+ struct vm_area_struct *vma = vmf->vma;
+ struct address_space *mapping = vma->vm_file->f_mapping;
unsigned long vaddr = vmf->address;
void *ret, *kaddr;
pgoff_t pgoff;
int id, rc;
pfn_t pfn;
- rc = bdev_dax_pgoff(bdev, sector, size, &pgoff);
+ rc = bdev_dax_pgoff(iomap->bdev, sector, PAGE_SIZE, &pgoff);
if (rc)
return rc;
id = dax_read_lock();
- rc = dax_direct_access(dax_dev, pgoff, PHYS_PFN(size), &kaddr, &pfn);
+ rc = dax_direct_access(iomap->dax_dev, pgoff, PHYS_PFN(PAGE_SIZE),
+ &kaddr, &pfn);
if (rc < 0) {
dax_read_unlock(id);
return rc;
@@ -936,11 +943,6 @@ int __dax_zero_page_range(struct block_device *bdev,
}
EXPORT_SYMBOL_GPL(__dax_zero_page_range);
-static sector_t dax_iomap_sector(struct iomap *iomap, loff_t pos)
-{
- return iomap->blkno + (((pos & PAGE_MASK) - iomap->offset) >> 9);
-}
-
static loff_t
dax_iomap_actor(struct inode *inode, loff_t pos, loff_t length, void *data,
struct iomap *iomap)
@@ -1087,7 +1089,6 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
struct inode *inode = mapping->host;
unsigned long vaddr = vmf->address;
loff_t pos = (loff_t)vmf->pgoff << PAGE_SHIFT;
- sector_t sector;
struct iomap iomap = { 0 };
unsigned flags = IOMAP_FAULT;
int error, major = 0;
@@ -1140,9 +1141,9 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
goto error_finish_iomap;
}
- sector = dax_iomap_sector(&iomap, pos);
-
if (vmf->cow_page) {
+ sector_t sector = dax_iomap_sector(&iomap, pos);
+
switch (iomap.type) {
case IOMAP_HOLE:
case IOMAP_UNWRITTEN:
@@ -1175,8 +1176,7 @@ static int dax_iomap_pte_fault(struct vm_fault *vmf,
count_memcg_event_mm(vmf->vma->vm_mm, PGMAJFAULT);
major = VM_FAULT_MAJOR;
}
- error = dax_insert_mapping(mapping, iomap.bdev, iomap.dax_dev,
- sector, PAGE_SIZE, entry, vmf->vma, vmf);
+ error = dax_insert_mapping(vmf, &iomap, pos, entry);
/* -EBUSY is fine, somebody else faulted on the same PTE */
if (error == -EBUSY)
error = 0;
--
2.12.3
Add missing argument description.
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 675fab8ec41f..5214ed9ba508 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1435,7 +1435,8 @@ static int dax_iomap_pmd_fault(struct vm_fault *vmf,
/**
* dax_iomap_fault - handle a page fault on a DAX file
* @vmf: The description of the fault
- * @ops: iomap ops passed from the file system
+ * @pe_size: Size of the page to fault in
+ * @ops: Iomap ops passed from the file system
*
* When a page fault occurs, filesystems may call this helper in
* their fault handler for DAX files. dax_iomap_fault() assumes the caller
--
2.12.3
On Wed, Oct 11, 2017 at 1:05 PM, Jan Kara <[email protected]> wrote:
> Hello,
>
> here is the third version of my patches to implement synchronous page faults
> for DAX mappings to make flushing of DAX mappings possible from userspace so
> that they can be flushed on finer than page granularity and also avoid the
> overhead of a syscall.
>
> We use a new mmap flag MAP_SYNC to indicate that page faults for the mapping
> should be synchronous. The guarantee provided by this flag is: While a block
> is writeably mapped into page tables of this mapping, it is guaranteed to be
> visible in the file at that offset also after a crash.
>
> How I implement this is that ->iomap_begin() indicates by a flag that inode
> block mapping metadata is unstable and may need flushing (use the same test as
> whether fdatasync() has metadata to write). If yes, DAX fault handler refrains
> from inserting / write-enabling the page table entry and returns special flag
> VM_FAULT_NEEDDSYNC together with a PFN to map to the filesystem fault handler.
> The handler then calls fdatasync() (vfs_fsync_range()) for the affected range
> and after that calls DAX code to update the page table entry appropriately.
>
> The first patch in this series is taken from Dan Williams' series for
> MAP_DIRECT so that we get a reliable way of detecting whether MAP_SYNC is
> supported or not.
>
> I did some basic performance testing on the patches over ramdisk - timed
> latency of page faults when faulting 512 pages. I did several tests: with file
> preallocated / with file empty, with background file copying going on / without
> it, with / without MAP_SYNC (so that we get comparison). The results are
> (numbers are in microseconds):
>
> File preallocated, no background load no MAP_SYNC:
> min=9 avg=10 max=46
> 8 - 15 us: 508
> 16 - 31 us: 3
> 32 - 63 us: 1
>
> File preallocated, no background load, MAP_SYNC:
> min=9 avg=10 max=47
> 8 - 15 us: 508
> 16 - 31 us: 2
> 32 - 63 us: 2
>
> File empty, no background load, no MAP_SYNC:
> min=21 avg=22 max=70
> 16 - 31 us: 506
> 32 - 63 us: 5
> 64 - 127 us: 1
>
> File empty, no background load, MAP_SYNC:
> min=40 avg=124 max=242
> 32 - 63 us: 1
> 64 - 127 us: 333
> 128 - 255 us: 178
>
> File empty, background load, no MAP_SYNC:
> min=21 avg=23 max=67
> 16 - 31 us: 507
> 32 - 63 us: 4
> 64 - 127 us: 1
>
> File empty, background load, MAP_SYNC:
> min=94 avg=112 max=181
> 64 - 127 us: 489
> 128 - 255 us: 23
>
> So here we can see the difference between MAP_SYNC vs non MAP_SYNC is about
> 100-200 us when we need to wait for transaction commit in this setup.
>
> Anyway, here are the patches and AFAICT the series is pretty much complete
> so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
> minimal so either tree is fine I guess. Comments are welcome.
I'd like to propose taking this through the nvdimm tree. Some of these
changes make the MAP_DIRECT support for ext4 easier, so I'd like to
rebase that support on top and carry both.
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> Now when everything is prepared, add support in ext4 to accept MAP_SYNC
> as an mmap(2) flag.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 20 ++++++++++++++++++++
> 1 file changed, 20 insertions(+)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 61a8788168f3..f013cda84b3d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -364,6 +364,25 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> return 0;
> }
>
> +#define EXT4_SUPPORTED_MAP_FLAGS (LEGACY_MAP_MASK | MAP_SYNC)
> +
> +static int ext4_file_mmap_validate(struct file *file,
> + struct vm_area_struct *vma,
> + unsigned long map_flags)
> +{
> + if (map_flags & ~EXT4_SUPPORTED_MAP_FLAGS)
> + return -EOPNOTSUPP;
> +
> + /*
> + * We don't support synchronous mappings for non-DAX files. At least
> + * until someone comes with a sensible use case.
> + */
> + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
> + return -EOPNOTSUPP;
Perhaps EPERM instead to differentiate the unsupported flags case?
There's precedent for mmap returning EPERM for reasons other than
incompatible PROT flags.
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin()
I assume this a stale changelog from a previous version and should be
VM_FAULT_NEEDDSYNC?
On Wed, Oct 11, 2017 at 02:18:06PM -0700, Dan Williams wrote:
> On Wed, Oct 11, 2017 at 1:05 PM, Jan Kara <[email protected]> wrote:
> > Anyway, here are the patches and AFAICT the series is pretty much complete
> > so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
> > minimal so either tree is fine I guess. Comments are welcome.
>
> I'd like to propose taking this through the nvdimm tree. Some of these
> changes make the MAP_DIRECT support for ext4 easier, so I'd like to
> rebase that support on top and carry both.
This functionality needs to be exercised and stressed by fstests
in some way before it gets merged.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> Now when everything is prepared, add support in xfs to accept MAP_SYNC
> as an mmap(2) flag.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 1 +
> fs/xfs/xfs_file.c | 23 +++++++++++++++++++++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f013cda84b3d..6b597cc6b29d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -26,6 +26,7 @@
> #include <linux/quotaops.h>
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> +#include <linux/mman.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index c45f24ffab22..fb135224476d 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -44,6 +44,7 @@
> #include <linux/falloc.h>
> #include <linux/pagevec.h>
> #include <linux/backing-dev.h>
> +#include <linux/mman.h>
>
> static const struct vm_operations_struct xfs_file_vm_ops;
>
> @@ -1142,6 +1143,27 @@ xfs_file_mmap(
> return 0;
> }
>
> +#define XFS_MAP_SUPPORTED (LEGACY_MAP_MASK | MAP_SYNC)
> +
> +static int
> +xfs_file_mmap_validate(
> + struct file *filp,
> + struct vm_area_struct *vma,
> + unsigned long map_flags)
> +{
> + if (map_flags & ~XFS_MAP_SUPPORTED)
> + return -EOPNOTSUPP;
> +
> + /*
> + * We don't support synchronous mappings for non-DAX files. At least
> + * until someone comes with a sensible use case.
> + */
> + if (!IS_DAX(file_inode(filp)) && (map_flags & MAP_SYNC))
> + return -EOPNOTSUPP;
Same comment about using EPERM here. That's also what I'm returning in
the MAP_DIRECT case when the inode is a reflink inode and does not
support MAP_DIRECT.
On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> Now when everything is prepared, add support in xfs to accept MAP_SYNC
> as an mmap(2) flag.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 1 +
> fs/xfs/xfs_file.c | 23 +++++++++++++++++++++++
> 2 files changed, 24 insertions(+)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index f013cda84b3d..6b597cc6b29d 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -26,6 +26,7 @@
> #include <linux/quotaops.h>
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> +#include <linux/mman.h>
I assume you wanted this change earlier in the series.
On Wed, Oct 11, 2017 at 3:43 PM, Dave Chinner <[email protected]> wrote:
> On Wed, Oct 11, 2017 at 02:18:06PM -0700, Dan Williams wrote:
>> On Wed, Oct 11, 2017 at 1:05 PM, Jan Kara <[email protected]> wrote:
>> > Anyway, here are the patches and AFAICT the series is pretty much complete
>> > so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
>> > minimal so either tree is fine I guess. Comments are welcome.
>>
>> I'd like to propose taking this through the nvdimm tree. Some of these
>> changes make the MAP_DIRECT support for ext4 easier, so I'd like to
>> rebase that support on top and carry both.
>
> This functionality needs to be exercised and stressed by fstests
> in some way before it gets merged.
Yes, and man page updates for mmap(2).
On Wed 11-10-17 15:11:21, Dan Williams wrote:
> On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> > Now when everything is prepared, add support in ext4 to accept MAP_SYNC
> > as an mmap(2) flag.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/file.c | 20 ++++++++++++++++++++
> > 1 file changed, 20 insertions(+)
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 61a8788168f3..f013cda84b3d 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -364,6 +364,25 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
> > return 0;
> > }
> >
> > +#define EXT4_SUPPORTED_MAP_FLAGS (LEGACY_MAP_MASK | MAP_SYNC)
> > +
> > +static int ext4_file_mmap_validate(struct file *file,
> > + struct vm_area_struct *vma,
> > + unsigned long map_flags)
> > +{
> > + if (map_flags & ~EXT4_SUPPORTED_MAP_FLAGS)
> > + return -EOPNOTSUPP;
> > +
> > + /*
> > + * We don't support synchronous mappings for non-DAX files. At least
> > + * until someone comes with a sensible use case.
> > + */
> > + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
> > + return -EOPNOTSUPP;
>
> Perhaps EPERM instead to differentiate the unsupported flags case?
> There's precedent for mmap returning EPERM for reasons other than
> incompatible PROT flags.
Hum, I could make it EINVAL. EPERM looks just too bogus to me.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 11-10-17 15:23:21, Dan Williams wrote:
> On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin()
>
> I assume this a stale changelog from a previous version and should be
> VM_FAULT_NEEDDSYNC?
Yeah, stale changelog. It should be IOMAP_F_DIRTY. Thanks for catching
this.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Oct 12, 2017 at 6:42 AM, Jan Kara <[email protected]> wrote:
> On Wed 11-10-17 15:11:21, Dan Williams wrote:
>> On Wed, Oct 11, 2017 at 1:06 PM, Jan Kara <[email protected]> wrote:
>> > Now when everything is prepared, add support in ext4 to accept MAP_SYNC
>> > as an mmap(2) flag.
>> >
>> > Signed-off-by: Jan Kara <[email protected]>
>> > ---
>> > fs/ext4/file.c | 20 ++++++++++++++++++++
>> > 1 file changed, 20 insertions(+)
>> >
>> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> > index 61a8788168f3..f013cda84b3d 100644
>> > --- a/fs/ext4/file.c
>> > +++ b/fs/ext4/file.c
>> > @@ -364,6 +364,25 @@ static int ext4_file_mmap(struct file *file, struct vm_area_struct *vma)
>> > return 0;
>> > }
>> >
>> > +#define EXT4_SUPPORTED_MAP_FLAGS (LEGACY_MAP_MASK | MAP_SYNC)
>> > +
>> > +static int ext4_file_mmap_validate(struct file *file,
>> > + struct vm_area_struct *vma,
>> > + unsigned long map_flags)
>> > +{
>> > + if (map_flags & ~EXT4_SUPPORTED_MAP_FLAGS)
>> > + return -EOPNOTSUPP;
>> > +
>> > + /*
>> > + * We don't support synchronous mappings for non-DAX files. At least
>> > + * until someone comes with a sensible use case.
>> > + */
>> > + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
>> > + return -EOPNOTSUPP;
>>
>> Perhaps EPERM instead to differentiate the unsupported flags case?
>> There's precedent for mmap returning EPERM for reasons other than
>> incompatible PROT flags.
>
> Hum, I could make it EINVAL. EPERM looks just too bogus to me.
EINVAL is indistinguishable from a kernel that does not support
MAP_SHARED_VALIDATE...
So did we settle on the new mmap_validate vs adding a new argument
to ->mmap for real now? I have to say I'd much prefer passing an
additional argument instead, but if higher powers rule that out
this version is ok.
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 13dab191a23e..5aee97d64cae 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -1701,6 +1701,8 @@ struct file_operations {
> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
> long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
> int (*mmap) (struct file *, struct vm_area_struct *);
> + int (*mmap_validate) (struct file *, struct vm_area_struct *,
> + unsigned long);
Can we make this return a bool for ok vs not ok? That way we only
need to have the error code discussion in one place instead of every
file system.
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
Looks fine,
Reviewed-by: Christoph Hellwig <[email protected]>
On Wed, Oct 11, 2017 at 10:05:58PM +0200, Jan Kara wrote:
> Implement a function that filesystems can call to finish handling of
> synchronous page faults. It takes care of syncing appropriare file range
> and insertion of page table entry.
>
> Signed-off-by: Jan Kara <[email protected]>
Looks fine except for a few nitpicks below:
Reviewed-by: Christoph Hellwig <[email protected]>
> + spin_lock_irq(&mapping->tree_lock);
> + entry = get_unlocked_mapping_entry(mapping, index, &slot);
> + /* Did we race with someone splitting entry or so? */
> + if (!entry || (pe_size == PE_SIZE_PTE && !dax_is_pte_entry(entry)) ||
> + (pe_size == PE_SIZE_PMD && !dax_is_pmd_entry(entry))) {
Minor nitpick:
Maybe keep each of the three conditions on a separate line to make it a
little easier to read.
> + * This function calls fdatasync() on the range of file touched by the page
> + * fault and handles inserting of appropriate page table entry.
Nitpick: we don't call fdatasync, as that doesn't even exist in the
kernel, but vfs_fsync_range. But documenting which functions we call
doesn't seem all that useful to start with, so if you really want to
document it I'd say something like:
"Ensures that the file range touched by the page fault is stored
persistently on the media"
or similar.
> + int err;
> + loff_t start = ((loff_t)vmf->pgoff) << PAGE_SHIFT;
> + size_t len = 0;
> +
> + if (pe_size == PE_SIZE_PTE)
> + len = PAGE_SIZE;
> +#ifdef CONFIG_FS_DAX_PMD
> + else if (pe_size == PE_SIZE_PMD)
> + len = PMD_SIZE;
> +#endif
Is there really any point to ifdef out this single branch?
> + else
> + WARN_ON_ONCE(1);
> + err = vfs_fsync_range(vmf->vma->vm_file, start, start + len - 1, 1);
(this just reminds me that I hate the vfs_fsync_range /->fsync calling
conventions.., offset + len would be way to easy)
This really should be part of the previous patch.
On Wed, Oct 11, 2017 at 03:11:21PM -0700, Dan Williams wrote:
> > + /*
> > + * We don't support synchronous mappings for non-DAX files. At least
> > + * until someone comes with a sensible use case.
> > + */
> > + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
> > + return -EOPNOTSUPP;
>
> Perhaps EPERM instead to differentiate the unsupported flags case?
> There's precedent for mmap returning EPERM for reasons other than
> incompatible PROT flags.
Why would we want to voluntarily use arcane errors for a entirely
new interface under our control?
-EOPNOTSUPP is nice and explicit about what is going on.
Should be merged go into the previous patch.
Also I just noticed how I misread ->mmap_validate in the first patch - it
doesn't just validate the mmap arguments but also replaces ->mmap
itself.
Which means that a) the name for it is very confusing, and b) we should
at very least have it fully replace ->mmap for file systems that
implement it, that is ext4 and xfs should not have to implement both
but just the new one.
On Fri, Oct 13, 2017 at 12:12 AM, Christoph Hellwig <[email protected]> wrote:
> So did we settle on the new mmap_validate vs adding a new argument
> to ->mmap for real now? I have to say I'd much prefer passing an
> additional argument instead, but if higher powers rule that out
> this version is ok.
MAP_DIRECT also needs the fd now, so it would be two arguments. Before
you say "that's bogus" I want to understand what the alternative looks
like for notifying userspace that its get_user_pages() registration
lease is being invalidated.
>
>> diff --git a/include/linux/fs.h b/include/linux/fs.h
>> index 13dab191a23e..5aee97d64cae 100644
>> --- a/include/linux/fs.h
>> +++ b/include/linux/fs.h
>> @@ -1701,6 +1701,8 @@ struct file_operations {
>> long (*unlocked_ioctl) (struct file *, unsigned int, unsigned long);
>> long (*compat_ioctl) (struct file *, unsigned int, unsigned long);
>> int (*mmap) (struct file *, struct vm_area_struct *);
>> + int (*mmap_validate) (struct file *, struct vm_area_struct *,
>> + unsigned long);
>
> Can we make this return a bool for ok vs not ok? That way we only
> need to have the error code discussion in one place instead of every
> file system.
How does userspace figure out what went wrong if we lose the ability
to return different failure reasons from the ops owner? Maybe an enum
to cut down the possible error codes to 'flag not supported', 'flag
supported but failed to setup the mapping', and 'success'.
On Fri, Oct 13, 2017 at 12:22 AM, Christoph Hellwig <[email protected]> wrote:
> On Wed, Oct 11, 2017 at 03:11:21PM -0700, Dan Williams wrote:
>> > + /*
>> > + * We don't support synchronous mappings for non-DAX files. At least
>> > + * until someone comes with a sensible use case.
>> > + */
>> > + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
>> > + return -EOPNOTSUPP;
>>
>> Perhaps EPERM instead to differentiate the unsupported flags case?
>> There's precedent for mmap returning EPERM for reasons other than
>> incompatible PROT flags.
>
> Why would we want to voluntarily use arcane errors for a entirely
> new interface under our control?
>
> -EOPNOTSUPP is nice and explicit about what is going on.
We have this general and perennial problem of parsing why the kernel
is throwing an error. It saves a debug step if the error code is
unique, and in this case would indicate that the filesystem supports
MAP_SYNC but the administrator failed to arrange for the device to be
in DAX mode.
On Fri, Oct 13, 2017 at 8:44 AM, Dan Williams <[email protected]> wrote:
> On Fri, Oct 13, 2017 at 12:12 AM, Christoph Hellwig <[email protected]> wrote:
>> So did we settle on the new mmap_validate vs adding a new argument
>> to ->mmap for real now? I have to say I'd much prefer passing an
>> additional argument instead, but if higher powers rule that out
>> this version is ok.
>
> MAP_DIRECT also needs the fd now, so it would be two arguments. Before
> you say "that's bogus" I want to understand what the alternative looks
> like for notifying userspace that its get_user_pages() registration
> lease is being invalidated.
>
Jason straightened me out about how RDMA applications expect to get
notifications and now I see why the fd based scheme is lacking.
On Wed, Oct 11, 2017 at 10:05:55PM +0200, Jan Kara wrote:
> Currently we dirty radix tree entry whenever dax_insert_mapping_entry()
> gets called for a write fault. With synchronous page faults we would
> like to insert clean radix tree entry and dirty it only once we call
> fdatasync() and update page tables to same some unnecessary cache
> flushing. Add 'dirty' argument to dax_insert_mapping_entry() for that.
>
> Signed-off-by: Jan Kara <[email protected]>
Looks good.
Reviewed-by: Ross Zwisler <[email protected]>
On Wed, Oct 11, 2017 at 10:05:56PM +0200, Jan Kara wrote:
> Define new MAP_SYNC flag and corresponding VMA VM_SYNC flag. As the
> MAP_SYNC flag is not part of LEGACY_MAP_MASK, currently it will be
> refused by all MAP_SHARED_VALIDATE map attempts and silently ignored for
> everything else.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> arch/xtensa/include/uapi/asm/mman.h | 1 +
> fs/proc/task_mmu.c | 1 +
> include/linux/mm.h | 1 +
> include/linux/mman.h | 3 ++-
> include/uapi/asm-generic/mman.h | 1 +
> 5 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> index ec597900eec7..b62a7ce166fb 100644
> --- a/arch/xtensa/include/uapi/asm/mman.h
> +++ b/arch/xtensa/include/uapi/asm/mman.h
> @@ -56,6 +56,7 @@
> #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> +#define MAP_SYNC 0x100000 /* perform synchronous page faults for the mapping */
Why define MAP_SYNC for this one architecture, but not for the rest that have
their own arch/*/include/uapi/asm/mman.h? AFAIK xtensa doesn't support DAX,
so has no need for MAP_SYNC?
Other than this one question, this patch looks fine:
Reviewed-by: Ross Zwisler <[email protected]>
On Wed, Oct 11, 2017 at 10:05:58PM +0200, Jan Kara wrote:
> Implement a function that filesystems can call to finish handling of
> synchronous page faults. It takes care of syncing appropriare file range
appropriate
> and insertion of page table entry.
>
> Signed-off-by: Jan Kara <[email protected]>
Looks good:
Reviewed-by: Ross Zwisler <[email protected]>
On Wed, Oct 11, 2017 at 10:05:59PM +0200, Jan Kara wrote:
> If transaction starting fails, just bail out of the function immediately
> instead of checking for that condition throughout the function.
>
> Signed-off-by: Jan Kara <[email protected]>
Yep, looks correct and is cleaner:
Reviewed-by: Ross Zwisler <[email protected]>
On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote:
> We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> synchronous write fault when inode has some uncommitted metadata
> changes. In the fault handler ext4_dax_fault() we then detect this case,
> call vfs_fsync_range() to make sure all metadata is committed, and call
> dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
> also dirty corresponding radix tree entry which is what we want -
> fsync(2) will still provide data integrity guarantees for applications
> not using userspace flushing. And applications using userspace flushing
> can avoid calling fsync(2) and thus avoid the performance overhead.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/file.c | 6 +++++-
> fs/ext4/inode.c | 15 +++++++++++++++
> fs/jbd2/journal.c | 17 +++++++++++++++++
> include/linux/jbd2.h | 1 +
> 4 files changed, 38 insertions(+), 1 deletion(-)
<>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 31db875bc7a1..13a198924a0f 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
> }
>
> #ifdef CONFIG_FS_DAX
> +static bool ext4_inode_datasync_dirty(struct inode *inode)
> +{
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> +
> + if (journal)
> + return !jbd2_transaction_committed(journal,
> + EXT4_I(inode)->i_datasync_tid);
> + /* Any metadata buffers to write? */
> + if (!list_empty(&inode->i_mapping->private_list))
> + return true;
> + return inode->i_state & I_DIRTY_DATASYNC;
> +}
I just had 2 quick questions on this:
1) Does ext4 actually use inode->i_mapping->private_list to keep track of
dirty metadata buffers? The comment above ext4_write_end() leads me to
believe that this list is unused?
* ext4 never places buffers on inode->i_mapping->private_list. metadata
* buffers are managed internally.
Or does the above comment only apply to ext4 with a journal?
2) Where is I_DIRTY_DATASYNC set in inode->i_state? I poked around a bit and
couldn't see it.
The rest of the patch looks good to me, and you can add:
Reviewed-by: Ross Zwisler <[email protected]>
On Thu, Oct 12, 2017 at 09:43:31AM +1100, Dave Chinner wrote:
> On Wed, Oct 11, 2017 at 02:18:06PM -0700, Dan Williams wrote:
> > On Wed, Oct 11, 2017 at 1:05 PM, Jan Kara <[email protected]> wrote:
> > > Anyway, here are the patches and AFAICT the series is pretty much complete
> > > so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
> > > minimal so either tree is fine I guess. Comments are welcome.
> >
> > I'd like to propose taking this through the nvdimm tree. Some of these
> > changes make the MAP_DIRECT support for ext4 easier, so I'd like to
> > rebase that support on top and carry both.
>
> This functionality needs to be exercised and stressed by fstests
> in some way before it gets merged.
I can work on making an fstest for this set, unless someone else has already
started.
On Fri, 2017-10-13 at 00:12 -0700, Christoph Hellwig wrote:
+AD4- So did we settle on the new mmap+AF8-validate vs adding a new argument
+AD4- to -+AD4-mmap for real now?+AKAAoA-I have to say I'd much prefer passing an
+AD4- additional argument instead, but if higher powers rule that out
+AD4- this version is ok.
Even if we decide to add a parameter to -+AD4-mmap() I think that should be
done after we merge this version. Otherwise there's no way to stage
these changes in advance of the merge window since we need to run the
+ACI-add parameter+ACI- coccinelle script near or after -rc1 when there's no
risk of new -+AD4-mmap() users being added.
+AD4-
+AD4- +AD4- diff --git a/include/linux/fs.h b/include/linux/fs.h
+AD4- +AD4- index 13dab191a23e..5aee97d64cae 100644
+AD4- +AD4- --- a/include/linux/fs.h
+AD4- +AD4- +-+-+- b/include/linux/fs.h
+AD4- +AD4- +AEAAQA- -1701,6 +-1701,8 +AEAAQA- struct file+AF8-operations +AHs-
+AD4- +AD4- +AKA- long (+ACo-unlocked+AF8-ioctl) (struct file +ACo-, unsigned int,
+AD4- +AD4- unsigned long)+ADs-
+AD4- +AD4- +AKA- long (+ACo-compat+AF8-ioctl) (struct file +ACo-, unsigned int,
+AD4- +AD4- unsigned long)+ADs-
+AD4- +AD4- +AKA- int (+ACo-mmap) (struct file +ACo-, struct vm+AF8-area+AF8-struct +ACo-)+ADs-
+AD4- +AD4- +- int (+ACo-mmap+AF8-validate) (struct file +ACo-, struct vm+AF8-area+AF8-struct
+AD4- +AD4- +ACo-,
+AD4- +AD4- +- unsigned long)+ADs-
+AD4-
+AD4- Can we make this return a bool for ok vs not ok?+AKAAoA-That way we only
+AD4- need to have the error code discussion in one place instead of every
+AD4- file system.
How about the following incremental update? It allows -+AD4-mmap+AF8-validate()
to be used as a full replacement for -+AD4-mmap() and it limits the error
code freedom to a centralized mmap+AF8-status+AF8-errno() routine:
---
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 5aee97d64cae..e07fcac17ba7 100644
--- a/include/linux/fs.h
+-+-+- b/include/linux/fs.h
+AEAAQA- -1685,6 +-1685,13 +AEAAQA- struct block+AF8-device+AF8-operations+ADs-
+ACM-define NOMMU+AF8-VMFLAGS +AFw-
(NOMMU+AF8-MAP+AF8-READ +AHw- NOMMU+AF8-MAP+AF8-WRITE +AHw- NOMMU+AF8-MAP+AF8-EXEC)
+-enum mmap+AF8-status +AHs-
+- MMAP+AF8-SUCCESS,
+- MMAP+AF8-UNSUPPORTED+AF8-FLAGS,
+- MMAP+AF8-INVALID+AF8-FILE,
+- MMAP+AF8-RESOURCE+AF8-FAILURE,
+-+AH0AOw-
+-typedef enum mmap+AF8-status mmap+AF8-status+AF8-t+ADs-
struct iov+AF8-iter+ADs-
+AEAAQA- -1701,7 +-1708,7 +AEAAQA- struct file+AF8-operations +AHs-
long (+ACo-unlocked+AF8-ioctl) (struct file +ACo-, unsigned int, unsigned long)+ADs-
long (+ACo-compat+AF8-ioctl) (struct file +ACo-, unsigned int, unsigned long)+ADs-
int (+ACo-mmap) (struct file +ACo-, struct vm+AF8-area+AF8-struct +ACo-)+ADs-
- int (+ACo-mmap+AF8-validate) (struct file +ACo-, struct vm+AF8-area+AF8-struct +ACo-,
+- mmap+AF8-status+AF8-t (+ACo-mmap+AF8-validate) (struct file +ACo-, struct vm+AF8-area+AF8-struct +ACo-,
unsigned long)+ADs-
int (+ACo-open) (struct inode +ACo-, struct file +ACo-)+ADs-
int (+ACo-flush) (struct file +ACo-, fl+AF8-owner+AF8-t id)+ADs-
diff --git a/mm/mmap.c b/mm/mmap.c
index 2649c00581a0..c1b6a8c25ce7 100644
--- a/mm/mmap.c
+-+-+- b/mm/mmap.c
+AEAAQA- -1432,7 +-1432,7 +AEAAQA- unsigned long do+AF8-mmap(struct file +ACo-file, unsigned long addr,
vm+AF8-flags +ACYAPQ- +AH4-VM+AF8-MAYEXEC+ADs-
+AH0-
- if (+ACE-file-+AD4-f+AF8-op-+AD4-mmap)
+- if (+ACE-file-+AD4-f+AF8-op-+AD4-mmap +ACYAJg- +ACE-file-+AD4-f+AF8-op-+AD4-mmap+AF8-validate)
return -ENODEV+ADs-
if (vm+AF8-flags +ACY- (VM+AF8-GROWSDOWN+AHw-VM+AF8-GROWSUP))
return -EINVAL+ADs-
+AEAAQA- -1612,6 +-1612,27 +AEAAQA- static inline int accountable+AF8-mapping(struct file +ACo-file, vm+AF8-flags+AF8-t vm+AF8-flags)
return (vm+AF8-flags +ACY- (VM+AF8-NORESERVE +AHw- VM+AF8-SHARED +AHw- VM+AF8-WRITE)) +AD0APQ- VM+AF8-WRITE+ADs-
+AH0-
+-static int mmap+AF8-status+AF8-errno(mmap+AF8-status+AF8-t stat)
+-+AHs-
+- static const int rc+AFsAXQ- +AD0- +AHs-
+- +AFs-MMAP+AF8-SUCCESS+AF0- +AD0- 0,
+- +AFs-MMAP+AF8-UNSUPPORTED+AF8-FLAGS+AF0- +AD0- -EOPNOTSUPP,
+- +AFs-MMAP+AF8-INVALID+AF8-FILE+AF0- +AD0- -ENOTTY,
+- +AFs-MMAP+AF8-RESOURCE+AF8-FAILURE+AF0- +AD0- -ENOMEM,
+- +AH0AOw-
+-
+- switch (stat) +AHs-
+- case MMAP+AF8-SUCCESS:
+- case MMAP+AF8-UNSUPPORTED+AF8-FLAGS:
+- case MMAP+AF8-INVALID+AF8-FILE:
+- case MMAP+AF8-RESOURCE+AF8-FAILURE:
+- return rc+AFs-stat+AF0AOw-
+- default:
+- /+ACo- unknown mmap+AF8-status +ACo-/
+- return rc+AFs-MMAP+AF8-UNSUPPORTED+AF8-FLAGS+AF0AOw-
+- +AH0-
+-+AH0-
+-
unsigned long mmap+AF8-region(struct file +ACo-file, unsigned long addr,
unsigned long len, vm+AF8-flags+AF8-t vm+AF8-flags, unsigned long pgoff,
struct list+AF8-head +ACo-uf, unsigned long map+AF8-flags)
+AEAAQA- -1619,6 +-1640,7 +AEAAQA- unsigned long mmap+AF8-region(struct file +ACo-file, unsigned long addr,
struct mm+AF8-struct +ACo-mm +AD0- current-+AD4-mm+ADs-
struct vm+AF8-area+AF8-struct +ACo-vma, +ACo-prev+ADs-
int error+ADs-
+- mmap+AF8-status+AF8-t status+ADs-
struct rb+AF8-node +ACoAKg-rb+AF8-link, +ACo-rb+AF8-parent+ADs-
unsigned long charged +AD0- 0+ADs-
+AEAAQA- -1697,11 +-1719,19 +AEAAQA- unsigned long mmap+AF8-region(struct file +ACo-file, unsigned long addr,
+ACo- vma+AF8-link() below can deny write-access if VM+AF8-DENYWRITE is set
+ACo- and map writably if VM+AF8-SHARED is set. This usually means the
+ACo- new file must not have been exposed to user-space, yet.
+- +ACo-
+- +ACo- We require -+AD4-mmap+AF8-validate in the MAP+AF8-SHARED+AF8-VALIDATE
+- +ACo- case, prefer -+AD4-mmap+AF8-validate over -+AD4-mmap, and
+- +ACo- otherwise fallback to legacy -+AD4-mmap.
+ACo-/
vma-+AD4-vm+AF8-file +AD0- get+AF8-file(file)+ADs-
- if ((map+AF8-flags +ACY- MAP+AF8-TYPE) +AD0APQ- MAP+AF8-SHARED+AF8-VALIDATE)
- error +AD0- file-+AD4-f+AF8-op-+AD4-mmap+AF8-validate(file, vma, map+AF8-flags)+ADs-
- else
+- if ((map+AF8-flags +ACY- MAP+AF8-TYPE) +AD0APQ- MAP+AF8-SHARED+AF8-VALIDATE) +AHs-
+- status +AD0- file-+AD4-f+AF8-op-+AD4-mmap+AF8-validate(file, vma, map+AF8-flags)+ADs-
+- error +AD0- mmap+AF8-status+AF8-errno(status)+ADs-
+- +AH0- else if (file-+AD4-f+AF8-op-+AD4-mmap+AF8-validate) +AHs-
+- status +AD0- file-+AD4-f+AF8-op-+AD4-mmap+AF8-validate(file, vma, map+AF8-flags)+ADs-
+- error +AD0- mmap+AF8-status+AF8-errno(status)+ADs-
+- +AH0- else
error +AD0- call+AF8-mmap(file, vma)+ADs-
if (error)
goto unmap+AF8-and+AF8-free+AF8-vma+ADs-
> How about the following incremental update? It allows ->mmap_validate()
> to be used as a full replacement for ->mmap() and it limits the error
> code freedom to a centralized mmap_status_errno() routine:
Nah - my earlier comment was simply misinformed because I didn't
read the whole patch and the _validate name mislead me.
So I think the current calling conventions are ok, I'd just like a
better name (mmap_flags maybe?) and avoid the need the file system
also has to implement ->mmap.
On Fri 13-10-17 16:53:04, Ross Zwisler wrote:
> On Thu, Oct 12, 2017 at 09:43:31AM +1100, Dave Chinner wrote:
> > On Wed, Oct 11, 2017 at 02:18:06PM -0700, Dan Williams wrote:
> > > On Wed, Oct 11, 2017 at 1:05 PM, Jan Kara <[email protected]> wrote:
> > > > Anyway, here are the patches and AFAICT the series is pretty much complete
> > > > so we can start thinking how to merge this. Changes to ext4 / XFS are pretty
> > > > minimal so either tree is fine I guess. Comments are welcome.
> > >
> > > I'd like to propose taking this through the nvdimm tree. Some of these
> > > changes make the MAP_DIRECT support for ext4 easier, so I'd like to
> > > rebase that support on top and carry both.
> >
> > This functionality needs to be exercised and stressed by fstests
> > in some way before it gets merged.
>
> I can work on making an fstest for this set, unless someone else has already
> started.
I haven't started yet so feel free to have a look. Otherwise I'll try to
find some time later this week.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-10-17 00:21:23, Christoph Hellwig wrote:
> This really should be part of the previous patch.
Fair enough, I'll fold it.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-10-17 13:44:44, Ross Zwisler wrote:
> On Wed, Oct 11, 2017 at 10:05:56PM +0200, Jan Kara wrote:
> > Define new MAP_SYNC flag and corresponding VMA VM_SYNC flag. As the
> > MAP_SYNC flag is not part of LEGACY_MAP_MASK, currently it will be
> > refused by all MAP_SHARED_VALIDATE map attempts and silently ignored for
> > everything else.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > arch/xtensa/include/uapi/asm/mman.h | 1 +
> > fs/proc/task_mmu.c | 1 +
> > include/linux/mm.h | 1 +
> > include/linux/mman.h | 3 ++-
> > include/uapi/asm-generic/mman.h | 1 +
> > 5 files changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/arch/xtensa/include/uapi/asm/mman.h b/arch/xtensa/include/uapi/asm/mman.h
> > index ec597900eec7..b62a7ce166fb 100644
> > --- a/arch/xtensa/include/uapi/asm/mman.h
> > +++ b/arch/xtensa/include/uapi/asm/mman.h
> > @@ -56,6 +56,7 @@
> > #define MAP_NONBLOCK 0x20000 /* do not block on IO */
> > #define MAP_STACK 0x40000 /* give out an address that is best suited for process/thread stacks */
> > #define MAP_HUGETLB 0x80000 /* create a huge page mapping */
> > +#define MAP_SYNC 0x100000 /* perform synchronous page faults for the mapping */
>
> Why define MAP_SYNC for this one architecture, but not for the rest that have
> their own arch/*/include/uapi/asm/mman.h? AFAIK xtensa doesn't support DAX,
> so has no need for MAP_SYNC?
Yeah, good point. I guess I'll just define it in asm-generic (which covers
x86 as well) and just define MAP_SYNC to 0 if not defined in linux/mman.h.
That should make the code compile and make sure userspace gets error if he
tries to use MAP_SYNC somehow on arch without its definition. Attached is a
new version of the patch.
> Other than this one question, this patch looks fine:
>
> Reviewed-by: Ross Zwisler <[email protected]>
Thanks.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-10-17 00:21:01, Christoph Hellwig wrote:
> On Wed, Oct 11, 2017 at 10:05:58PM +0200, Jan Kara wrote:
> > Implement a function that filesystems can call to finish handling of
> > synchronous page faults. It takes care of syncing appropriare file range
> > and insertion of page table entry.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> Looks fine except for a few nitpicks below:
>
> Reviewed-by: Christoph Hellwig <[email protected]>
Thanks! Implemented all your comments.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-10-17 14:58:54, Ross Zwisler wrote:
> On Wed, Oct 11, 2017 at 10:06:00PM +0200, Jan Kara wrote:
> > We return IOMAP_F_NEEDDSYNC flag from ext4_iomap_begin() for a
> > synchronous write fault when inode has some uncommitted metadata
> > changes. In the fault handler ext4_dax_fault() we then detect this case,
> > call vfs_fsync_range() to make sure all metadata is committed, and call
> > dax_insert_pfn_mkwrite() to insert page table entry. Note that this will
> > also dirty corresponding radix tree entry which is what we want -
> > fsync(2) will still provide data integrity guarantees for applications
> > not using userspace flushing. And applications using userspace flushing
> > can avoid calling fsync(2) and thus avoid the performance overhead.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/file.c | 6 +++++-
> > fs/ext4/inode.c | 15 +++++++++++++++
> > fs/jbd2/journal.c | 17 +++++++++++++++++
> > include/linux/jbd2.h | 1 +
> > 4 files changed, 38 insertions(+), 1 deletion(-)
>
> <>
>
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 31db875bc7a1..13a198924a0f 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3394,6 +3394,19 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
> > }
> >
> > #ifdef CONFIG_FS_DAX
> > +static bool ext4_inode_datasync_dirty(struct inode *inode)
> > +{
> > + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> > +
> > + if (journal)
> > + return !jbd2_transaction_committed(journal,
> > + EXT4_I(inode)->i_datasync_tid);
> > + /* Any metadata buffers to write? */
> > + if (!list_empty(&inode->i_mapping->private_list))
> > + return true;
> > + return inode->i_state & I_DIRTY_DATASYNC;
> > +}
>
> I just had 2 quick questions on this:
>
> 1) Does ext4 actually use inode->i_mapping->private_list to keep track of
> dirty metadata buffers? The comment above ext4_write_end() leads me to
> believe that this list is unused?
>
> * ext4 never places buffers on inode->i_mapping->private_list. metadata
> * buffers are managed internally.
>
> Or does the above comment only apply to ext4 with a journal?
Yes, the above applies for ext4 with a journal. ext4 without a journal uses
inode->i_mapping->private_list for metadata tracking. And DAX can be used
without the journal just fine...
> 2) Where is I_DIRTY_DATASYNC set in inode->i_state? I poked around a bit and
> couldn't see it.
Never directly (at least for ext4). But it will get set by
mark_inode_dirty() as I_DIRTY contains I_DIRTY_DATASYNC.
> The rest of the patch looks good to me, and you can add:
>
> Reviewed-by: Ross Zwisler <[email protected]>
Thanks!
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 13-10-17 08:52:28, Dan Williams wrote:
> On Fri, Oct 13, 2017 at 12:22 AM, Christoph Hellwig <[email protected]> wrote:
> > On Wed, Oct 11, 2017 at 03:11:21PM -0700, Dan Williams wrote:
> >> > + /*
> >> > + * We don't support synchronous mappings for non-DAX files. At least
> >> > + * until someone comes with a sensible use case.
> >> > + */
> >> > + if (!IS_DAX(file_inode(file)) && (map_flags & MAP_SYNC))
> >> > + return -EOPNOTSUPP;
> >>
> >> Perhaps EPERM instead to differentiate the unsupported flags case?
> >> There's precedent for mmap returning EPERM for reasons other than
> >> incompatible PROT flags.
> >
> > Why would we want to voluntarily use arcane errors for a entirely
> > new interface under our control?
> >
> > -EOPNOTSUPP is nice and explicit about what is going on.
>
> We have this general and perennial problem of parsing why the kernel
> is throwing an error. It saves a debug step if the error code is
> unique, and in this case would indicate that the filesystem supports
> MAP_SYNC but the administrator failed to arrange for the device to be
> in DAX mode.
So I understand this wish however I think the error codes should also
reflect the nature of the problem. And EPERM has something to do with access
permissions, not whether file supports DAX access or not.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 16-10-17 00:45:04, Christoph Hellwig wrote:
> > How about the following incremental update? It allows ->mmap_validate()
> > to be used as a full replacement for ->mmap() and it limits the error
> > code freedom to a centralized mmap_status_errno() routine:
>
> Nah - my earlier comment was simply misinformed because I didn't
> read the whole patch and the _validate name mislead me.
>
> So I think the current calling conventions are ok, I'd just like a
> better name (mmap_flags maybe?) and avoid the need the file system
> also has to implement ->mmap.
OK, I can do that. But I had just realized that if MAP_DIRECT isn't going
to end up using mmap(2) interface but something else (and I'm not sure
where discussions on this matter ended), we don't need flags argument for
->mmap at all. MAP_SYNC uses a VMA flag anyway and thus it is fine with the
current ->mmap interface. We still need some opt-in mechanism for
MAP_SHARED_VALIDATE though (probably supported mmap flags as Dan had in one
version of his patch). Thoughts on which way to go for now?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Oct 17, 2017 at 4:50 AM, Jan Kara <[email protected]> wrote:
> On Mon 16-10-17 00:45:04, Christoph Hellwig wrote:
>> > How about the following incremental update? It allows ->mmap_validate()
>> > to be used as a full replacement for ->mmap() and it limits the error
>> > code freedom to a centralized mmap_status_errno() routine:
>>
>> Nah - my earlier comment was simply misinformed because I didn't
>> read the whole patch and the _validate name mislead me.
>>
>> So I think the current calling conventions are ok, I'd just like a
>> better name (mmap_flags maybe?) and avoid the need the file system
>> also has to implement ->mmap.
>
> OK, I can do that. But I had just realized that if MAP_DIRECT isn't going
> to end up using mmap(2) interface but something else (and I'm not sure
> where discussions on this matter ended), we don't need flags argument for
> ->mmap at all. MAP_SYNC uses a VMA flag anyway and thus it is fine with the
> current ->mmap interface. We still need some opt-in mechanism for
> MAP_SHARED_VALIDATE though (probably supported mmap flags as Dan had in one
> version of his patch). Thoughts on which way to go for now?
The "supported mmap flags" approach also solves the problem you raised
about MAP_SYNC being silently accepted by an ->mmap() handler that
does not know about the new flag. I.e. leading userpace to potentially
assume an invalid data consistency model. I'll revive that approach
now that the MAP_DIRECT problem is going to be solved via a different
interface.
On Tue, Oct 17, 2017 at 01:50:47PM +0200, Jan Kara wrote:
> OK, I can do that. But I had just realized that if MAP_DIRECT isn't going
> to end up using mmap(2) interface but something else (and I'm not sure
> where discussions on this matter ended), we don't need flags argument for
> ->mmap at all. MAP_SYNC uses a VMA flag anyway and thus it is fine with the
> current ->mmap interface. We still need some opt-in mechanism for
> MAP_SHARED_VALIDATE though (probably supported mmap flags as Dan had in one
> version of his patch). Thoughts on which way to go for now?
Yes, I'd much prefer the mmap_flags in file_operations. The other
option would be a new FMODE_* flag which is what Al did for various
other optional features, but I generally thing that is a confusing
interface.