2016-10-07 21:08:47

by Ross Zwisler

[permalink] [raw]
Subject: [PATCH v5 00/17] re-enable DAX PMD support

DAX PMDs have been disabled since Jan Kara introduced DAX radix tree based
locking. This series allows DAX PMDs to participate in the DAX radix tree
based locking scheme so that they can be re-enabled.

Dave, can you please take this through the XFS tree as we discussed during
the v4 review?

Changes since v4:
- Reworked the DAX flags handling to simplify things and get rid of
RADIX_DAX_PTE. (Jan & Christoph)
- Moved RADIX_DAX_* macros to be inline functions in include/linux/dax.h.
(Christoph)
- Got rid of unneeded macros RADIX_DAX_HZP_ENTRY() and
RADIX_DAX_EMPTY_ENTRY(), and instead just pass arbitrary flags to
radix_dax_entry().
- Re-ordered the arguments to dax_wake_mapping_entry_waiter() to be more
consistent with the rest of the code. (Jan)
- Moved radix_dax_order() inside of the #ifdef CONFIG_FS_DAX_PMD block.
This was causing a build error on various systems that don't define
PMD_SHIFT.
- Patch 5 fixes what I believe is a missing error return in
ext2_iomap_end().
- Fixed the page_start calculation for PMDs that was previously found in
dax_entry_start(). (Jan) This code is now included directly in
dax_entry_waitqueue(). (Christoph)
- dax_entry_waitqueue() now sets up the struct exceptional_entry_key() of
the caller as a service to reduce code duplication. (Christoph)
- In grab_mapping_entry() we now hold the radix tree entry lock for PMD
downgrades while we release the tree_lock and do an
unmap_mapping_range(). (Jan)
- Removed our last BUG_ON() in dax.c, replacing it with a WARN_ON_ONCE()
and an error return.
- The dax_iomap_fault() and dax_iomap_pmd_fault() handlers both now call
ops->iomap_end() to ensure that we properly balance the
ops->iomap_begin() calls with respect to locking, allocations, etc.
(Jan)
- Removed __GFP_FS from the vmf.gfp_mask used in dax_iomap_pmd_fault().
(Jan)

Thank you again to Jan, Christoph and Dave for their review feedback.

Here are some related things that are not included in this patch set, but
which I plan on doing in the near future:
- Add tracepoint support for the PTE and PMD based DAX fault handlers.
(Dave)
- Move the DAX 4k zero page handling to use a single 4k zero page instead
of allocating pages on demand. This will mirror the way that things are
done for the 2 MiB case, and will reduce the amount of memory we use
when reading 4k holes in DAX.
- Change the API to the PMD fault hanlder so it takes a vmf, and at a
layer above DAX make sure that the vmf.gfp_mask given to DAX for both
PMD and PTE faults doesn't include __GFP_FS. (Jan)

These work items will happen after review & integration with Jan's patch
set for DAX radix tree cleaning.

This series was built upon xfs/xfs-4.9-reflink with PMD performance fixes
from Toshi Kani and Dan Williams. Dan's patch has already been merged for
v4.8, and Toshi's patches are currently queued in Andrew Morton's mm tree
for v4.9 inclusion. These patches are not needed for correct operation,
only for good performance.

Here is a tree containing my changes:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=dax_pmd_v5

This tree has passed xfstests for ext2, ext4 and XFS both with and without
DAX, and has passed targeted testing where I inserted, removed and flushed
DAX PTEs and PMDs in every combination I could think of.

Previously reported performance numbers:

In some simple mmap I/O testing with FIO the use of PMD faults more than
doubles I/O performance as compared with PTE faults. Here is the FIO
script I used for my testing:

[global]
bs=4k
size=2G
directory=/mnt/pmem0
ioengine=mmap
[randrw]
rw=randrw

Here are the performance results with XFS using only pte faults:
READ: io=1022.7MB, aggrb=557610KB/s, minb=557610KB/s, maxb=557610KB/s, mint=1878msec, maxt=1878msec
WRITE: io=1025.4MB, aggrb=559084KB/s, minb=559084KB/s, maxb=559084KB/s, mint=1878msec, maxt=1878msec

Here are performance numbers for that same test using PMD faults:
READ: io=1022.7MB, aggrb=1406.7MB/s, minb=1406.7MB/s, maxb=1406.7MB/s, mint=727msec, maxt=727msec
WRITE: io=1025.4MB, aggrb=1410.4MB/s, minb=1410.4MB/s, maxb=1410.4MB/s, mint=727msec, maxt=727msec

This was done on a random lab machine with a PMEM device made from memmap'd
RAM. To get XFS to use PMD faults, I did the following:

mkfs.xfs -f -d su=2m,sw=1 /dev/pmem0
mount -o dax /dev/pmem0 /mnt/pmem0
xfs_io -c "extsize 2m" /mnt/pmem0

Ross Zwisler (17):
ext4: allow DAX writeback for hole punch
ext4: tell DAX the size of allocation holes
dax: remove buffer_size_valid()
ext2: remove support for DAX PMD faults
ext2: return -EIO on ext2_iomap_end() failure
dax: make 'wait_table' global variable static
dax: remove the last BUG_ON() from fs/dax.c
dax: consistent variable naming for DAX entries
dax: coordinate locking for offsets in PMD range
dax: remove dax_pmd_fault()
dax: correct dax iomap code namespace
dax: add dax_iomap_sector() helper function
dax: dax_iomap_fault() needs to call iomap_end()
dax: move RADIX_DAX_* defines to dax.h
dax: add struct iomap based DAX PMD support
xfs: use struct iomap based DAX PMD fault path
dax: remove "depends on BROKEN" from FS_DAX_PMD

fs/Kconfig | 1 -
fs/dax.c | 718 ++++++++++++++++++++++++++++------------------------
fs/ext2/file.c | 35 +--
fs/ext2/inode.c | 4 +-
fs/ext4/inode.c | 7 +-
fs/xfs/xfs_aops.c | 26 +-
fs/xfs/xfs_aops.h | 3 -
fs/xfs/xfs_file.c | 10 +-
include/linux/dax.h | 60 ++++-
mm/filemap.c | 6 +-
10 files changed, 466 insertions(+), 404 deletions(-)

--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>


2016-10-07 21:08:51

by Ross Zwisler

[permalink] [raw]
Subject: [PATCH v5 04/17] ext2: remove support for DAX PMD faults

DAX PMD support was added via the following commit:

commit e7b1ea2ad658 ("ext2: huge page fault support")

I believe this path to be untested as ext2 doesn't reliably provide block
allocations that are aligned to 2MiB. In my testing I've been unable to
get ext2 to actually fault in a PMD. It always fails with a "pfn
unaligned" message because the sector returned by ext2_get_block() isn't
aligned.

I've tried various settings for the "stride" and "stripe_width" extended
options to mkfs.ext2, without any luck.

Since we can't reliably get PMDs, remove support so that we don't have an
untested code path that we may someday traverse when we happen to get an
aligned block allocation. This should also make 4k DAX faults in ext2 a
bit faster since they will no longer have to call the PMD fault handler
only to get a response of VM_FAULT_FALLBACK.

Signed-off-by: Ross Zwisler <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext2/file.c | 29 ++++++-----------------------
1 file changed, 6 insertions(+), 23 deletions(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 0ca363d..0f257f8 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -107,27 +107,6 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return ret;
}

-static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags)
-{
- struct inode *inode = file_inode(vma->vm_file);
- struct ext2_inode_info *ei = EXT2_I(inode);
- int ret;
-
- if (flags & FAULT_FLAG_WRITE) {
- sb_start_pagefault(inode->i_sb);
- file_update_time(vma->vm_file);
- }
- down_read(&ei->dax_sem);
-
- ret = dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
-
- up_read(&ei->dax_sem);
- if (flags & FAULT_FLAG_WRITE)
- sb_end_pagefault(inode->i_sb);
- return ret;
-}
-
static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,
struct vm_fault *vmf)
{
@@ -154,7 +133,11 @@ static int ext2_dax_pfn_mkwrite(struct vm_area_struct *vma,

static const struct vm_operations_struct ext2_dax_vm_ops = {
.fault = ext2_dax_fault,
- .pmd_fault = ext2_dax_pmd_fault,
+ /*
+ * .pmd_fault is not supported for DAX because allocation in ext2
+ * cannot be reliably aligned to huge page sizes and so pmd faults
+ * will always fail and fail back to regular faults.
+ */
.page_mkwrite = ext2_dax_fault,
.pfn_mkwrite = ext2_dax_pfn_mkwrite,
};
@@ -166,7 +149,7 @@ static int ext2_file_mmap(struct file *file, struct vm_area_struct *vma)

file_accessed(file);
vma->vm_ops = &ext2_dax_vm_ops;
- vma->vm_flags |= VM_MIXEDMAP | VM_HUGEPAGE;
+ vma->vm_flags |= VM_MIXEDMAP;
return 0;
}
#else
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-07 21:08:56

by Ross Zwisler

[permalink] [raw]
Subject: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range

DAX radix tree locking currently locks entries based on the unique
combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
This works for PTEs, but as we move to PMDs we will need to have all the
offsets within the range covered by the PMD to map to the same bit lock.
To accomplish this, for ranges covered by a PMD entry we will instead lock
based on the page offset of the beginning of the PMD entry. The 'mapping'
pointer is still used in the same way.

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/dax.c | 58 +++++++++++++++++++++++++++++++----------------------
include/linux/dax.h | 2 +-
mm/filemap.c | 2 +-
3 files changed, 36 insertions(+), 26 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 152a6e1..608cee9 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -64,14 +64,6 @@ static int __init init_dax_wait_table(void)
}
fs_initcall(init_dax_wait_table);

-static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
- pgoff_t index)
-{
- unsigned long hash = hash_long((unsigned long)mapping ^ index,
- DAX_WAIT_TABLE_BITS);
- return wait_table + hash;
-}
-
static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
{
struct request_queue *q = bdev->bd_queue;
@@ -285,7 +277,7 @@ EXPORT_SYMBOL_GPL(dax_do_io);
*/
struct exceptional_entry_key {
struct address_space *mapping;
- unsigned long index;
+ pgoff_t entry_start;
};

struct wait_exceptional_entry_queue {
@@ -293,6 +285,26 @@ struct wait_exceptional_entry_queue {
struct exceptional_entry_key key;
};

+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+ pgoff_t index, void *entry, struct exceptional_entry_key *key)
+{
+ unsigned long hash;
+
+ /*
+ * If 'entry' is a PMD, align the 'index' that we use for the wait
+ * queue to the start of that PMD. This ensures that all offsets in
+ * the range covered by the PMD map to the same bit lock.
+ */
+ if (RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD)
+ index &= ~((1UL << (PMD_SHIFT - PAGE_SHIFT)) - 1);
+
+ key->mapping = mapping;
+ key->entry_start = index;
+
+ hash = hash_long((unsigned long)mapping ^ index, DAX_WAIT_TABLE_BITS);
+ return wait_table + hash;
+}
+
static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
int sync, void *keyp)
{
@@ -301,7 +313,7 @@ static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
container_of(wait, struct wait_exceptional_entry_queue, wait);

if (key->mapping != ewait->key.mapping ||
- key->index != ewait->key.index)
+ key->entry_start != ewait->key.entry_start)
return 0;
return autoremove_wake_function(wait, mode, sync, NULL);
}
@@ -359,12 +371,10 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
{
void *entry, **slot;
struct wait_exceptional_entry_queue ewait;
- wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+ wait_queue_head_t *wq;

init_wait(&ewait.wait);
ewait.wait.func = wake_exceptional_entry_func;
- ewait.key.mapping = mapping;
- ewait.key.index = index;

for (;;) {
entry = __radix_tree_lookup(&mapping->page_tree, index, NULL,
@@ -375,6 +385,8 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
*slotp = slot;
return entry;
}
+
+ wq = dax_entry_waitqueue(mapping, index, entry, &ewait.key);
prepare_to_wait_exclusive(wq, &ewait.wait,
TASK_UNINTERRUPTIBLE);
spin_unlock_irq(&mapping->tree_lock);
@@ -448,9 +460,12 @@ restart:
}

void dax_wake_mapping_entry_waiter(struct address_space *mapping,
- pgoff_t index, bool wake_all)
+ pgoff_t index, void *entry, bool wake_all)
{
- wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+ struct exceptional_entry_key key;
+ wait_queue_head_t *wq;
+
+ wq = dax_entry_waitqueue(mapping, index, entry, &key);

/*
* Checking for locked entry and prepare_to_wait_exclusive() happens
@@ -458,13 +473,8 @@ void dax_wake_mapping_entry_waiter(struct address_space *mapping,
* So at this point all tasks that could have seen our entry locked
* must be in the waitqueue and the following check will see them.
*/
- if (waitqueue_active(wq)) {
- struct exceptional_entry_key key;
-
- key.mapping = mapping;
- key.index = index;
+ if (waitqueue_active(wq))
__wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
- }
}

void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
@@ -480,7 +490,7 @@ void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
}
unlock_slot(mapping, slot);
spin_unlock_irq(&mapping->tree_lock);
- dax_wake_mapping_entry_waiter(mapping, index, false);
+ dax_wake_mapping_entry_waiter(mapping, index, entry, false);
}

static void put_locked_mapping_entry(struct address_space *mapping,
@@ -505,7 +515,7 @@ static void put_unlocked_mapping_entry(struct address_space *mapping,
return;

/* We have to wake up next waiter for the radix tree entry lock */
- dax_wake_mapping_entry_waiter(mapping, index, false);
+ dax_wake_mapping_entry_waiter(mapping, index, entry, false);
}

/*
@@ -532,7 +542,7 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
radix_tree_delete(&mapping->page_tree, index);
mapping->nrexceptional--;
spin_unlock_irq(&mapping->tree_lock);
- dax_wake_mapping_entry_waiter(mapping, index, true);
+ dax_wake_mapping_entry_waiter(mapping, index, entry, true);

return 1;
}
diff --git a/include/linux/dax.h b/include/linux/dax.h
index add6c4b..a41a747 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -22,7 +22,7 @@ int iomap_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
void dax_wake_mapping_entry_waiter(struct address_space *mapping,
- pgoff_t index, bool wake_all);
+ pgoff_t index, void *entry, bool wake_all);

#ifdef CONFIG_FS_DAX
struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/filemap.c b/mm/filemap.c
index 2f1175e..a596462 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -617,7 +617,7 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (node)
workingset_node_pages_dec(node);
/* Wakeup waiters for exceptional entry lock */
- dax_wake_mapping_entry_waiter(mapping, page->index,
+ dax_wake_mapping_entry_waiter(mapping, page->index, p,
false);
}
}
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-07 21:09:00

by Ross Zwisler

[permalink] [raw]
Subject: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()

Currently iomap_end() doesn't do anything for DAX page faults for both ext2
and XFS. ext2_iomap_end() just checks for a write underrun, and
xfs_file_iomap_end() checks to see if it needs to finish a delayed
allocation. However, in the future iomap_end() calls might be needed to
make sure we have balanced allocations, locks, etc. So, add calls to
iomap_end() with appropriate error handling to dax_iomap_fault().

Signed-off-by: Ross Zwisler <[email protected]>
Suggested-by: Jan Kara <[email protected]>
---
fs/dax.c | 31 ++++++++++++++++++++++++++++---
1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 7689ab0..5e8febe 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -1187,7 +1187,7 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
goto unlock_entry;
if (WARN_ON_ONCE(iomap.offset + iomap.length < pos + PAGE_SIZE)) {
error = -EIO; /* fs corruption? */
- goto unlock_entry;
+ goto finish_iomap;
}

sector = dax_iomap_sector(&iomap, pos);
@@ -1209,7 +1209,14 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}

if (error)
- goto unlock_entry;
+ goto finish_iomap;
+
+ if (ops->iomap_end) {
+ error = ops->iomap_end(inode, pos, PAGE_SIZE,
+ PAGE_SIZE, flags, &iomap);
+ if (error)
+ goto unlock_entry;
+ }
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
return VM_FAULT_LOCKED;
@@ -1230,8 +1237,15 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
break;
case IOMAP_UNWRITTEN:
case IOMAP_HOLE:
- if (!(vmf->flags & FAULT_FLAG_WRITE))
+ if (!(vmf->flags & FAULT_FLAG_WRITE)) {
+ if (ops->iomap_end) {
+ error = ops->iomap_end(inode, pos, PAGE_SIZE,
+ PAGE_SIZE, flags, &iomap);
+ if (error)
+ goto unlock_entry;
+ }
return dax_load_hole(mapping, entry, vmf);
+ }
/*FALLTHRU*/
default:
WARN_ON_ONCE(1);
@@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
break;
}

+ finish_iomap:
+ if (ops->iomap_end) {
+ if (error) {
+ /* keep previous error */
+ ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
+ &iomap);
+ } else {
+ error = ops->iomap_end(inode, pos, PAGE_SIZE,
+ PAGE_SIZE, flags, &iomap);
+ }
+ }
unlock_entry:
put_locked_mapping_entry(mapping, vmf->pgoff, entry);
out:
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-07 21:09:01

by Ross Zwisler

[permalink] [raw]
Subject: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h

The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need
access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
empty entry.

Instead of shifting the defines between dax.c and dax.h as they are
individually used in other code, just move them wholesale to dax.h so
they'll be available when we need them.

Signed-off-by: Ross Zwisler <[email protected]>
---
fs/dax.c | 14 --------------
include/linux/dax.h | 15 ++++++++++++++-
2 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 5e8febe..ac3cd05 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -34,20 +34,6 @@
#include <linux/iomap.h>
#include "internal.h"

-/*
- * We use lowest available bit in exceptional entry for locking, other two
- * bits to determine entry type. In total 3 special bits.
- */
-#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
-#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
-#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
-#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
- RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
- RADIX_TREE_EXCEPTIONAL_ENTRY))
-
/* We choose 4096 entries - same as per-zone page wait tables */
#define DAX_WAIT_TABLE_BITS 12
#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index a3dfee4..e9ea78c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -8,8 +8,21 @@

struct iomap_ops;

-/* We use lowest available exceptional entry bit for locking */
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
#define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+ RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+ RADIX_TREE_EXCEPTIONAL_ENTRY))
+

ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
struct iomap_ops *ops);
--
2.7.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-10 15:50:04

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()

On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS. ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation. However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc. So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().

Is there a way to just have a single call to iomap_end at the end of
the function, after which we just return a previosuly setup return
value?

e.g.

out:
if (ops->iomap_end) {
error = ops->iomap_end(inode, pos, PAGE_SIZE,
PAGE_SIZE, flags, &iomap);
}

if (error == -ENOMEM)
return VM_FAULT_OOM | major;
if (error < 0 && error != -EBUSY)
return VM_FAULT_SIGBUS | major;
return ret;

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-10 15:50:29

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h

On Fri, Oct 07, 2016 at 03:09:01PM -0600, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
>
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.

Looks fine, assuming that the macros get cleaned up in the next patches..

Reviewed-by: Christoph Hellwig <[email protected]>

2016-10-10 22:05:50

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()

On Mon, Oct 10, 2016 at 05:50:04PM +0200, Christoph Hellwig wrote:
> On Fri, Oct 07, 2016 at 03:09:00PM -0600, Ross Zwisler wrote:
> > Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> > and XFS. ext2_iomap_end() just checks for a write underrun, and
> > xfs_file_iomap_end() checks to see if it needs to finish a delayed
> > allocation. However, in the future iomap_end() calls might be needed to
> > make sure we have balanced allocations, locks, etc. So, add calls to
> > iomap_end() with appropriate error handling to dax_iomap_fault().
>
> Is there a way to just have a single call to iomap_end at the end of
> the function, after which we just return a previosuly setup return
> value?
>
> e.g.
>
> out:
> if (ops->iomap_end) {
> error = ops->iomap_end(inode, pos, PAGE_SIZE,
> PAGE_SIZE, flags, &iomap);
> }
>
> if (error == -ENOMEM)
> return VM_FAULT_OOM | major;
> if (error < 0 && error != -EBUSY)
> return VM_FAULT_SIGBUS | major;
> return ret;

Sure, will do.

2016-10-11 07:23:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 14/17] dax: move RADIX_DAX_* defines to dax.h

On Fri 07-10-16 15:09:01, Ross Zwisler wrote:
> The RADIX_DAX_* defines currently mostly live in fs/dax.c, with just
> RADIX_DAX_ENTRY_LOCK being in include/linux/dax.h so it can be used in
> mm/filemap.c. When we add PMD support, though, mm/filemap.c will also need
> access to the RADIX_DAX_PTE type so it can properly construct a 4k sized
> empty entry.
>
> Instead of shifting the defines between dax.c and dax.h as they are
> individually used in other code, just move them wholesale to dax.h so
> they'll be available when we need them.
>
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>

Looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/dax.c | 14 --------------
> include/linux/dax.h | 15 ++++++++++++++-
> 2 files changed, 14 insertions(+), 15 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 5e8febe..ac3cd05 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -34,20 +34,6 @@
> #include <linux/iomap.h>
> #include "internal.h"
>
> -/*
> - * We use lowest available bit in exceptional entry for locking, other two
> - * bits to determine entry type. In total 3 special bits.
> - */
> -#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> -#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> -#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> -#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> -#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> -#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> -#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> - RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> - RADIX_TREE_EXCEPTIONAL_ENTRY))
> -
> /* We choose 4096 entries - same as per-zone page wait tables */
> #define DAX_WAIT_TABLE_BITS 12
> #define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index a3dfee4..e9ea78c 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -8,8 +8,21 @@
>
> struct iomap_ops;
>
> -/* We use lowest available exceptional entry bit for locking */
> +/*
> + * We use lowest available bit in exceptional entry for locking, other two
> + * bits to determine entry type. In total 3 special bits.
> + */
> +#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
> #define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
> +#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
> +#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
> +#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
> +#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
> +#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
> +#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
> + RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
> + RADIX_TREE_EXCEPTIONAL_ENTRY))
> +
>
> ssize_t dax_iomap_rw(struct kiocb *iocb, struct iov_iter *iter,
> struct iomap_ops *ops);
> --
> 2.7.4
>
>
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-10-11 07:04:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range

On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> DAX radix tree locking currently locks entries based on the unique
> combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> This works for PTEs, but as we move to PMDs we will need to have all the
> offsets within the range covered by the PMD to map to the same bit lock.
> To accomplish this, for ranges covered by a PMD entry we will instead lock
> based on the page offset of the beginning of the PMD entry. The 'mapping'
> pointer is still used in the same way.
>
> Signed-off-by: Ross Zwisler <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Just one thing which IMO deserves a comment below:

> @@ -448,9 +460,12 @@ restart:
> }
>
> void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> - pgoff_t index, bool wake_all)
> + pgoff_t index, void *entry, bool wake_all)
> {
> - wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> + struct exceptional_entry_key key;
> + wait_queue_head_t *wq;
> +
> + wq = dax_entry_waitqueue(mapping, index, entry, &key);

So I believe we should comment above this function that the 'entry' it gets
may be invalid by the time it gets it (we call it without tree_lock held so
the passed entry may be changed in the radix tree as we work) but we use it
only to find appropriate waitqueue where tasks sleep waiting for that old
entry to unlock so we indeed wake up all tasks we need.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2016-10-11 07:21:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v5 13/17] dax: dax_iomap_fault() needs to call iomap_end()

On Fri 07-10-16 15:09:00, Ross Zwisler wrote:
> Currently iomap_end() doesn't do anything for DAX page faults for both ext2
> and XFS. ext2_iomap_end() just checks for a write underrun, and
> xfs_file_iomap_end() checks to see if it needs to finish a delayed
> allocation. However, in the future iomap_end() calls might be needed to
> make sure we have balanced allocations, locks, etc. So, add calls to
> iomap_end() with appropriate error handling to dax_iomap_fault().
>
> Signed-off-by: Ross Zwisler <ross.zwisler-VuQAYsv1563Yd54FQh9/[email protected]>
> Suggested-by: Jan Kara <[email protected]>
...
> @@ -1239,6 +1253,17 @@ int dax_iomap_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> break;
> }
>
> + finish_iomap:
> + if (ops->iomap_end) {
> + if (error) {
> + /* keep previous error */
> + ops->iomap_end(inode, pos, PAGE_SIZE, PAGE_SIZE, flags,
> + &iomap);

I think for the error case we should set number of 'written' bytes to 0 to
tell fs to cancel what it has prepared. This is mostly cosmetic since the
only case where I can imagine this would matter is shared write fault and
in that case we have currently no error path but still it could bite us in
the future.

Other than that the patch looks good so you can add:

Reviewed-by: Jan Kara <[email protected]>

Honza
--
Jan Kara <jack-IBi9RG/[email protected]>
SUSE Labs, CR

2016-10-11 21:18:04

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH v5 09/17] dax: coordinate locking for offsets in PMD range

On Tue, Oct 11, 2016 at 09:04:09AM +0200, Jan Kara wrote:
> On Fri 07-10-16 15:08:56, Ross Zwisler wrote:
> > DAX radix tree locking currently locks entries based on the unique
> > combination of the 'mapping' pointer and the pgoff_t 'index' for the entry.
> > This works for PTEs, but as we move to PMDs we will need to have all the
> > offsets within the range covered by the PMD to map to the same bit lock.
> > To accomplish this, for ranges covered by a PMD entry we will instead lock
> > based on the page offset of the beginning of the PMD entry. The 'mapping'
> > pointer is still used in the same way.
> >
> > Signed-off-by: Ross Zwisler <[email protected]>
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Just one thing which IMO deserves a comment below:
>
> > @@ -448,9 +460,12 @@ restart:
> > }
> >
> > void dax_wake_mapping_entry_waiter(struct address_space *mapping,
> > - pgoff_t index, bool wake_all)
> > + pgoff_t index, void *entry, bool wake_all)
> > {
> > - wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
> > + struct exceptional_entry_key key;
> > + wait_queue_head_t *wq;
> > +
> > + wq = dax_entry_waitqueue(mapping, index, entry, &key);
>
> So I believe we should comment above this function that the 'entry' it gets
> may be invalid by the time it gets it (we call it without tree_lock held so
> the passed entry may be changed in the radix tree as we work) but we use it
> only to find appropriate waitqueue where tasks sleep waiting for that old
> entry to unlock so we indeed wake up all tasks we need.

Added, thanks for the suggestion.