Hello,
this is my third attempt at DAX page fault locking rewrite. The patch set has
passed xfstests both with and without DAX mount option on ext4 and xfs for
me and also additional page fault beating using the new page fault stress
tests I have added to xfstests. So I'd be grateful if you guys could have a
closer look at the patches so that they can be merged. Thanks.
Changes since v2:
- lot of additional ext4 fixes and cleanups
- make PMD page faults depend on CONFIG_BROKEN instead of #if 0
- fixed page reference leak when replacing hole page with a pfn
- added some reviewed-by tags
- rebased on top of current Linus' tree
Changes since v1:
- handle wakeups of exclusive waiters properly
- fix cow fault races
- other minor stuff
General description
The basic idea is that we use a bit in an exceptional radix tree entry as
a lock bit and use it similarly to how page lock is used for normal faults.
That way we fix races between hole instantiation and read faults of the
same index. For now I have disabled PMD faults since there the issues with
page fault locking are even worse. Now that Matthew's multi-order radix tree
has landed, I can have a look into using that for proper locking of PMD faults
but first I want normal pages sorted out.
In the end I have decided to implement the bit locking directly in the DAX
code. Originally I was thinking we could provide something generic directly
in the radix tree code but the functions DAX needs are rather specific.
Maybe someone else will have a good idea how to distill some generally useful
functions out of what I've implemented for DAX but for now I didn't bother
with that.
Honza
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
ext4_dax_get_blocks() was accidentally omitted fixing get blocks
handlers to properly handle transient ENOSPC errors. Fix it now to use
ext4_get_blocks_trans() helper which takes care of these errors.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/inode.c | 75 +++++++++++++++------------------------------------------
1 file changed, 20 insertions(+), 55 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 981a1fc30eaa..3e0c06028668 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3218,72 +3218,37 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- int ret, err;
- int credits;
- struct ext4_map_blocks map;
- handle_t *handle = NULL;
- int flags = 0;
+ int ret;
ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
- map.m_lblk = iblock;
- map.m_len = bh_result->b_size >> inode->i_blkbits;
- credits = ext4_chunk_trans_blocks(inode, map.m_len);
- if (create) {
- flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- return ret;
- }
- }
+ if (!create)
+ return _ext4_get_block(inode, iblock, bh_result, 0);
- ret = ext4_map_blocks(handle, inode, &map, flags);
- if (create) {
- err = ext4_journal_stop(handle);
- if (ret >= 0 && err < 0)
- ret = err;
- }
- if (ret <= 0)
- goto out;
- if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- int err2;
+ ret = ext4_get_block_trans(inode, iblock, bh_result,
+ EXT4_GET_BLOCKS_PRE_IO |
+ EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ return ret;
+ if (buffer_unwritten(bh_result)) {
/*
* We are protected by i_mmap_sem so we know block cannot go
* away from under us even though we dropped i_data_sem.
* Convert extent to written and write zeros there.
- *
- * Note: We may get here even when create == 0.
*/
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
-
- err = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
- if (err < 0)
- ret = err;
- err2 = ext4_journal_stop(handle);
- if (err2 < 0 && ret > 0)
- ret = err2;
- }
-out:
- WARN_ON_ONCE(ret == 0 && create);
- if (ret > 0) {
- map_bh(bh_result, inode->i_sb, map.m_pblk);
- /*
- * At least for now we have to clear BH_New so that DAX code
- * doesn't attempt to zero blocks again in a racy way.
- */
- map.m_flags &= ~EXT4_MAP_NEW;
- ext4_update_bh_state(bh_result, map.m_flags);
- bh_result->b_size = map.m_len << inode->i_blkbits;
- ret = 0;
+ ret = ext4_get_block_trans(inode, iblock, bh_result,
+ EXT4_GET_BLOCKS_CONVERT |
+ EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ return ret;
}
- return ret;
+ /*
+ * At least for now we have to clear BH_New so that DAX code
+ * doesn't attempt to zero blocks again in a racy way.
+ */
+ clear_buffer_new(bh_result);
+ return 0;
}
#endif
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
When there are blocks to free in the running transaction, block
allocator can return ENOSPC although the filesystem has some blocks to
free. We use ext4_should_retry_alloc() to force commit of the current
transaction and return whether anything was committed so that it makes
sense to retry the allocation. However the transaction may get committed
after block allocation fails but before we call
ext4_should_retry_alloc(). So ext4_should_retry_alloc() returns false
because there is nothing to commit and we wrongly return ENOSPC.
Fix the race by unconditionally returning 1 from ext4_should_retry_alloc()
when we tried to commit a transaction. This should not add any
unnecessary retries since we had a transaction running a while ago when
trying to allocate blocks and we want to retry the allocation once that
transaction has committed anyway.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/balloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/balloc.c b/fs/ext4/balloc.c
index fe1f50fe764f..3020fd70c392 100644
--- a/fs/ext4/balloc.c
+++ b/fs/ext4/balloc.c
@@ -610,7 +610,8 @@ int ext4_should_retry_alloc(struct super_block *sb, int *retries)
jbd_debug(1, "%s: retrying operation after ENOSPC\n", sb->s_id);
- return jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+ jbd2_journal_force_commit_nested(EXT4_SB(sb)->s_journal);
+ return 1;
}
/*
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
From: NeilBrown <[email protected]>
These don't belong in radix-tree.c any more than PAGECACHE_TAG_* do.
Let's try to maintain the idea that radix-tree simply implements an
abstract data type.
Acked-by: Ross Zwisler <[email protected]>
Reviewed-by: Matthew Wilcox <[email protected]>
Signed-off-by: NeilBrown <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 9 +++++++++
include/linux/radix-tree.h | 9 ---------
2 files changed, 9 insertions(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 75ba46d82a76..08799a510b4d 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,6 +32,15 @@
#include <linux/pfn_t.h>
#include <linux/sizes.h>
+#define RADIX_DAX_MASK 0xf
+#define RADIX_DAX_SHIFT 4
+#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
+#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
+ RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+
static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
{
struct request_queue *q = bdev->bd_queue;
diff --git a/include/linux/radix-tree.h b/include/linux/radix-tree.h
index 51a97ac8bfbf..d08d6ec3bf53 100644
--- a/include/linux/radix-tree.h
+++ b/include/linux/radix-tree.h
@@ -52,15 +52,6 @@
#define RADIX_TREE_EXCEPTIONAL_ENTRY 2
#define RADIX_TREE_EXCEPTIONAL_SHIFT 2
-#define RADIX_DAX_MASK 0xf
-#define RADIX_DAX_SHIFT 4
-#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
-#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
-#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
- RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
-
static inline int radix_tree_is_indirect_ptr(void *ptr)
{
return (int)((unsigned long)ptr & RADIX_TREE_INDIRECT_PTR);
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Currently ext4 treats DAX IO the same way as direct IO. I.e., it
allocates unwritten extents before IO is done and converts unwritten
extents afterwards. However this way DAX IO can race with page fault to
the same area:
ext4_ext_direct_IO() dax_fault()
dax_io()
get_block() - allocates unwritten extent
copy_from_iter_pmem()
get_block() - converts
unwritten block to
written and zeroes it
out
ext4_convert_unwritten_extents()
So data written with DAX IO gets lost. Similarly dax_new_buf() called
from dax_io() can overwrite data that has been already written to the
block via mmap.
Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
use them for DAX mmap. The downside of this solution is that every
allocating write writes each block twice (once zeros, once data). Fixing
the race with locking is possible as well however we would need to
lock-out faults for the whole range written to by DAX IO. And that is
not easy to do without locking-out faults for the whole file which seems
too aggressive.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 11 +++++++++--
fs/ext4/file.c | 4 ++--
fs/ext4/inode.c | 42 +++++++++++++++++++++++++++++++++---------
3 files changed, 44 insertions(+), 13 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 35792b430fb6..173da8faff81 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2521,8 +2521,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_dio_get_block(struct inode *inode, sector_t iblock,
@@ -3328,6 +3328,13 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}
}
+static inline bool ext4_aligned_io(struct inode *inode, loff_t off, loff_t len)
+{
+ int blksize = 1 << inode->i_blkbits;
+
+ return IS_ALIGNED(off, blksize) && IS_ALIGNED(off + len, blksize);
+}
+
#endif /* __KERNEL__ */
#define EFSBADCRC EBADMSG /* Bad CRC detected */
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index b3a9c6eeadbc..2e9aa49a95fa 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
+ result = __dax_fault(vma, vmf, ext4_dax_get_block);
if (write) {
if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_dax_mmap_get_block);
+ ext4_dax_get_block);
if (write) {
if (!IS_ERR(handle))
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 23fd0e0a9223..6d5d5c1db293 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3215,12 +3215,17 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
}
#ifdef CONFIG_FS_DAX
-int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+/*
+ * Get block function for DAX IO and mmap faults. It takes care of converting
+ * unwritten extents to written ones and initializes new / converted blocks
+ * to zeros.
+ */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
{
int ret;
- ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
+ ext4_debug("ext4_dax_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
if (!create)
return _ext4_get_block(inode, iblock, bh_result, 0);
@@ -3233,9 +3238,9 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
if (buffer_unwritten(bh_result)) {
/*
- * We are protected by i_mmap_sem so we know block cannot go
- * away from under us even though we dropped i_data_sem.
- * Convert extent to written and write zeros there.
+ * We are protected by i_mmap_sem or i_mutex so we know block
+ * cannot go away from under us even though we dropped
+ * i_data_sem. Convert extent to written and write zeros there.
*/
ret = ext4_get_block_trans(inode, iblock, bh_result,
EXT4_GET_BLOCKS_CONVERT |
@@ -3250,6 +3255,14 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
clear_buffer_new(bh_result);
return 0;
}
+#else
+/* Just define empty function, it will never get called. */
+int ext4_dax_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
+{
+ BUG();
+ return 0;
+}
#endif
static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
@@ -3371,8 +3384,20 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
iocb->private = NULL;
if (overwrite)
get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+ else if (IS_DAX(inode)) {
+ /*
+ * We can avoid zeroing for aligned DAX writes beyond EOF. Other
+ * writes need zeroing either because they can race with page
+ * faults or because they use partial blocks.
+ */
+ if (round_down(offset, 1<<inode->i_blkbits) >= inode->i_size &&
+ ext4_aligned_io(inode, offset, count))
+ get_block_func = ext4_dio_get_block;
+ else
+ get_block_func = ext4_dax_get_block;
+ dio_flags = DIO_LOCKING;
+ } else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+ round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
get_block_func = ext4_dio_get_block;
dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
} else if (is_sync_kiocb(iocb)) {
@@ -3386,7 +3411,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
#endif
if (IS_DAX(inode)) {
- dio_flags &= ~DIO_SKIP_HOLES;
ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
ext4_end_io_dio, dio_flags);
} else
--
2.6.6
Currently ext4 direct IO handling is split between ext4_ext_direct_IO()
and ext4_ind_direct_IO(). However the extent based function calls into
the indirect based one for some cases and for example it is not able to
handle file extending. Previously it was not also properly handling
retries in case of ENOSPC errors. With DAX things would get even more
contrieved so just refactor the direct IO code and instead of indirect /
extent split do the split to read vs writes.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 2 -
fs/ext4/indirect.c | 127 ---------------------------------------------------
fs/ext4/inode.c | 131 ++++++++++++++++++++++++++++++++++++++++++++++-------
3 files changed, 114 insertions(+), 146 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 349afebe21ee..35792b430fb6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2581,8 +2581,6 @@ extern int ext4_get_next_extent(struct inode *inode, ext4_lblk_t lblk,
/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
struct ext4_map_blocks *map, int flags);
-extern ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
- loff_t offset);
extern int ext4_ind_calc_metadata_amount(struct inode *inode, sector_t lblock);
extern int ext4_ind_trans_blocks(struct inode *inode, int nrblocks);
extern void ext4_ind_truncate(handle_t *, struct inode *inode);
diff --git a/fs/ext4/indirect.c b/fs/ext4/indirect.c
index 3027fa681de5..bc15c2c17633 100644
--- a/fs/ext4/indirect.c
+++ b/fs/ext4/indirect.c
@@ -649,133 +649,6 @@ out:
}
/*
- * O_DIRECT for ext3 (or indirect map) based files
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- * If the O_DIRECT write is intantiating holes inside i_size and the machine
- * crashes then stale disk data _may_ be exposed inside the file. But current
- * VFS code falls back into buffered path in that case so we are safe.
- */
-ssize_t ext4_ind_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
- loff_t offset)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- handle_t *handle;
- ssize_t ret;
- int orphan = 0;
- size_t count = iov_iter_count(iter);
- int retries = 0;
-
- if (iov_iter_rw(iter) == WRITE) {
- loff_t final_size = offset + count;
-
- if (final_size > inode->i_size) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ei->i_disksize = inode->i_size;
- ext4_journal_stop(handle);
- }
- }
-
-retry:
- if (iov_iter_rw(iter) == READ && ext4_should_dioread_nolock(inode)) {
- /*
- * Nolock dioread optimization may be dynamically disabled
- * via ext4_inode_block_unlocked_dio(). Check inode's state
- * while holding extra i_dio_count ref.
- */
- inode_dio_begin(inode);
- smp_mb();
- if (unlikely(ext4_test_inode_state(inode,
- EXT4_STATE_DIOREAD_LOCK))) {
- inode_dio_end(inode);
- goto locked;
- }
- if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset,
- ext4_dio_get_block, NULL, 0);
- else
- ret = __blockdev_direct_IO(iocb, inode,
- inode->i_sb->s_bdev, iter,
- offset, ext4_dio_get_block,
- NULL, NULL, 0);
- inode_dio_end(inode);
- } else {
-locked:
- if (IS_DAX(inode))
- ret = dax_do_io(iocb, inode, iter, offset,
- ext4_dio_get_block, NULL, DIO_LOCKING);
- else
- ret = blockdev_direct_IO(iocb, inode, iter, offset,
- ext4_dio_get_block);
-
- if (unlikely(iov_iter_rw(iter) == WRITE && ret < 0)) {
- loff_t isize = i_size_read(inode);
- loff_t end = offset + count;
-
- if (end > isize)
- ext4_truncate_failed_write(inode);
- }
- }
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
-
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /* This is really bad luck. We've written the data
- * but cannot extend i_size. Bail out and pretend
- * the write failed... */
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size) {
- ei->i_disksize = end;
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-/*
* Calculate the number of metadata blocks need to reserve
* to allocate a new block at @lblocks for non extent file based file
*/
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3e0c06028668..23fd0e0a9223 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3281,7 +3281,9 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
}
/*
- * For ext4 extent files, ext4 will do direct-io write to holes,
+ * Handling of direct IO writes.
+ *
+ * For ext4 extent files, ext4 will do direct-io write even to holes,
* preallocated extents, and those write extend the file, no need to
* fall back to buffered IO.
*
@@ -3299,21 +3301,37 @@ static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
* if the machine crashes during the write.
*
*/
-static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
- loff_t offset)
+static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t offset)
{
struct file *file = iocb->ki_filp;
struct inode *inode = file->f_mapping->host;
+ struct ext4_inode_info *ei = EXT4_I(inode);
ssize_t ret;
size_t count = iov_iter_count(iter);
int overwrite = 0;
get_block_t *get_block_func = NULL;
int dio_flags = 0;
loff_t final_size = offset + count;
+ int orphan = 0;
+ handle_t *handle;
- /* Use the old path for reads and writes beyond i_size. */
- if (iov_iter_rw(iter) != WRITE || final_size > inode->i_size)
- return ext4_ind_direct_IO(iocb, iter, offset);
+ if (final_size > inode->i_size) {
+ /* Credits for sb + inode write */
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+ orphan = 1;
+ ei->i_disksize = inode->i_size;
+ ext4_journal_stop(handle);
+ }
BUG_ON(iocb->private == NULL);
@@ -3322,8 +3340,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
* conversion. This also disallows race between truncate() and
* overwrite DIO as i_dio_count needs to be incremented under i_mutex.
*/
- if (iov_iter_rw(iter) == WRITE)
- inode_dio_begin(inode);
+ inode_dio_begin(inode);
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);
@@ -3332,7 +3349,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
inode_unlock(inode);
/*
- * We could direct write to holes and fallocate.
+ * For extent mapped files we could direct write to holes and fallocate.
*
* Allocated blocks to fill the hole are marked as unwritten to prevent
* parallel buffered read to expose the stale data before DIO complete
@@ -3354,7 +3371,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
iocb->private = NULL;
if (overwrite)
get_block_func = ext4_dio_get_block_overwrite;
- else if (is_sync_kiocb(iocb)) {
+ else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
+ round_down(offset, 1 << inode->i_blkbits) >= inode->i_size) {
+ get_block_func = ext4_dio_get_block;
+ dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
+ } else if (is_sync_kiocb(iocb)) {
get_block_func = ext4_dio_get_block_unwritten_sync;
dio_flags = DIO_LOCKING;
} else {
@@ -3364,10 +3385,11 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
#ifdef CONFIG_EXT4_FS_ENCRYPTION
BUG_ON(ext4_encrypted_inode(inode) && S_ISREG(inode->i_mode));
#endif
- if (IS_DAX(inode))
+ if (IS_DAX(inode)) {
+ dio_flags &= ~DIO_SKIP_HOLES;
ret = dax_do_io(iocb, inode, iter, offset, get_block_func,
ext4_end_io_dio, dio_flags);
- else
+ } else
ret = __blockdev_direct_IO(iocb, inode,
inode->i_sb->s_bdev, iter, offset,
get_block_func,
@@ -3387,12 +3409,87 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
}
- if (iov_iter_rw(iter) == WRITE)
- inode_dio_end(inode);
+ inode_dio_end(inode);
/* take i_mutex locking again if we do a ovewrite dio */
if (overwrite)
inode_lock(inode);
+ if (ret < 0 && final_size > inode->i_size)
+ ext4_truncate_failed_write(inode);
+
+ /* Handle extending of i_size after direct IO write */
+ if (orphan) {
+ int err;
+
+ /* Credits for sb + inode write */
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ /* This is really bad luck. We've written the data
+ * but cannot extend i_size. Bail out and pretend
+ * the write failed... */
+ ret = PTR_ERR(handle);
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+
+ goto out;
+ }
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ if (ret > 0) {
+ loff_t end = offset + ret;
+ if (end > inode->i_size) {
+ ei->i_disksize = end;
+ i_size_write(inode, end);
+ /*
+ * We're going to return a positive `ret'
+ * here due to non-zero-length I/O, so there's
+ * no way of reporting error returns from
+ * ext4_mark_inode_dirty() to userspace. So
+ * ignore it.
+ */
+ ext4_mark_inode_dirty(handle, inode);
+ }
+ }
+ err = ext4_journal_stop(handle);
+ if (ret == 0)
+ ret = err;
+ }
+out:
+ return ret;
+}
+
+static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter,
+ loff_t offset)
+{
+ int unlocked = 0;
+ struct inode *inode = iocb->ki_filp->f_mapping->host;
+ ssize_t ret;
+
+ if (ext4_should_dioread_nolock(inode)) {
+ /*
+ * Nolock dioread optimization may be dynamically disabled
+ * via ext4_inode_block_unlocked_dio(). Check inode's state
+ * while holding extra i_dio_count ref.
+ */
+ inode_dio_begin(inode);
+ smp_mb();
+ if (unlikely(ext4_test_inode_state(inode,
+ EXT4_STATE_DIOREAD_LOCK)))
+ inode_dio_end(inode);
+ else
+ unlocked = 1;
+ }
+ if (IS_DAX(inode)) {
+ ret = dax_do_io(iocb, inode, iter, offset, ext4_dio_get_block,
+ NULL, unlocked ? 0 : DIO_LOCKING);
+ } else {
+ ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
+ iter, offset, ext4_dio_get_block,
+ NULL, NULL,
+ unlocked ? 0 : DIO_LOCKING);
+ }
+ if (unlocked)
+ inode_dio_end(inode);
return ret;
}
@@ -3420,10 +3517,10 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
return 0;
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
- ret = ext4_ext_direct_IO(iocb, iter, offset);
+ if (iov_iter_rw(iter) == READ)
+ ret = ext4_direct_IO_read(iocb, iter, offset);
else
- ret = ext4_ind_direct_IO(iocb, iter, offset);
+ ret = ext4_direct_IO_write(iocb, iter, offset);
trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
return ret;
}
--
2.6.6
Currently the handling of huge pages for DAX is racy. For example the
following can happen:
CPU0 (THP write fault) CPU1 (normal read fault)
__dax_pmd_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh) && write)
get_block(inode, block, &bh, 1) -> allocates blocks
truncate_pagecache_range(inode, lstart, lend);
dax_load_hole();
This results in data corruption since process on CPU1 won't see changes
into the file done by CPU0.
The race can happen even if two normal faults race however with THP the
situation is even worse because the two faults don't operate on the same
entries in the radix tree and we want to use these entries for
serialization. So make THP support in DAX code depend on CONFIG_BROKEN
for now.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 2 +-
include/linux/dax.h | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index d7addfab2094..388327f56fa8 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -707,7 +707,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
EXPORT_SYMBOL_GPL(dax_fault);
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
/*
* The 'colour' (ie low bits) within a PMD of a page offset. This comes up
* more often than one might expect in the below function.
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 7c45ac7ea1d1..0591f4853228 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -23,7 +23,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
}
#endif
-#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
unsigned int flags, get_block_t);
int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
--
2.6.6
Currently ext2 zeroes any data blocks allocated for DAX inode however it
still returns them as BH_New. Thus DAX code zeroes them again in
dax_insert_mapping() which can possibly overwrite the data that has been
already stored to those blocks by a racing dax_io(). Avoid marking
pre-zeroed buffers as new.
Signed-off-by: Jan Kara <[email protected]>
---
fs/ext2/inode.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 6bd58e6ff038..1f07b758b968 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -745,11 +745,11 @@ static int ext2_get_blocks(struct inode *inode,
mutex_unlock(&ei->truncate_mutex);
goto cleanup;
}
- }
+ } else
+ set_buffer_new(bh_result);
ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
mutex_unlock(&ei->truncate_mutex);
- set_buffer_new(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
if (count > blocks_to_boundary)
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Fault handlers currently take complete_unwritten argument to convert
unwritten extents after PTEs are updated. However no filesystem uses
this anymore as the code is racy. Remove the unused argument.
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/block_dev.c | 4 ++--
fs/dax.c | 43 +++++++++----------------------------------
fs/ext2/file.c | 4 ++--
fs/ext4/file.c | 4 ++--
fs/xfs/xfs_file.c | 7 +++----
include/linux/dax.h | 17 +++++++----------
include/linux/fs.h | 1 -
7 files changed, 25 insertions(+), 55 deletions(-)
diff --git a/fs/block_dev.c b/fs/block_dev.c
index 20a2c02b77c4..b25bb230b28a 100644
--- a/fs/block_dev.c
+++ b/fs/block_dev.c
@@ -1746,7 +1746,7 @@ static const struct address_space_operations def_blk_aops = {
*/
static int blkdev_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return __dax_fault(vma, vmf, blkdev_get_block, NULL);
+ return __dax_fault(vma, vmf, blkdev_get_block);
}
static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
@@ -1758,7 +1758,7 @@ static int blkdev_dax_pfn_mkwrite(struct vm_area_struct *vma,
static int blkdev_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd, unsigned int flags)
{
- return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block, NULL);
+ return __dax_pmd_fault(vma, addr, pmd, flags, blkdev_get_block);
}
static const struct vm_operations_struct blkdev_dax_vm_ops = {
diff --git a/fs/dax.c b/fs/dax.c
index 08799a510b4d..c5ccf745d279 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -612,19 +612,13 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
* @vma: The virtual memory area where the fault occurred
* @vmf: The description of the fault
* @get_block: The filesystem method used to translate file offsets to blocks
- * @complete_unwritten: The filesystem method used to convert unwritten blocks
- * to written so the data written to them is exposed. This is required for
- * required by write faults for filesystems that will return unwritten
- * extent mappings from @get_block, but it is optional for reads as
- * dax_insert_mapping() will always zero unwritten blocks. If the fs does
- * not support unwritten extents, the it should pass NULL.
*
* When a page fault occurs, filesystems may call this helper in their
* fault handler for DAX files. __dax_fault() assumes the caller has done all
* the necessary locking for the page fault to proceed successfully.
*/
int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
- get_block_t get_block, dax_iodone_t complete_unwritten)
+ get_block_t get_block)
{
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
@@ -727,23 +721,9 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
page = NULL;
}
- /*
- * If we successfully insert the new mapping over an unwritten extent,
- * we need to ensure we convert the unwritten extent. If there is an
- * error inserting the mapping, the filesystem needs to leave it as
- * unwritten to prevent exposure of the stale underlying data to
- * userspace, but we still need to call the completion function so
- * the private resources on the mapping buffer can be released. We
- * indicate what the callback should do via the uptodate variable, same
- * as for normal BH based IO completions.
- */
+ /* Filesystem should not return unwritten buffers to us! */
+ WARN_ON_ONCE(buffer_unwritten(&bh));
error = dax_insert_mapping(inode, &bh, vma, vmf);
- if (buffer_unwritten(&bh)) {
- if (complete_unwritten)
- complete_unwritten(&bh, !error);
- else
- WARN_ON_ONCE(!(vmf->flags & FAULT_FLAG_WRITE));
- }
out:
if (error == -ENOMEM)
@@ -772,7 +752,7 @@ EXPORT_SYMBOL(__dax_fault);
* fault handler for DAX files.
*/
int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
- get_block_t get_block, dax_iodone_t complete_unwritten)
+ get_block_t get_block)
{
int result;
struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -781,7 +761,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
}
- result = __dax_fault(vma, vmf, get_block, complete_unwritten);
+ result = __dax_fault(vma, vmf, get_block);
if (vmf->flags & FAULT_FLAG_WRITE)
sb_end_pagefault(sb);
@@ -815,8 +795,7 @@ static void __dax_dbg(struct buffer_head *bh, unsigned long address,
#define dax_pmd_dbg(bh, address, reason) __dax_dbg(bh, address, reason, "dax_pmd")
int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, get_block_t get_block,
- dax_iodone_t complete_unwritten)
+ pmd_t *pmd, unsigned int flags, get_block_t get_block)
{
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
@@ -875,6 +854,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if (get_block(inode, block, &bh, 1) != 0)
return VM_FAULT_SIGBUS;
alloc = true;
+ WARN_ON_ONCE(buffer_unwritten(&bh));
}
bdev = bh.b_bdev;
@@ -1020,9 +1000,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
out:
i_mmap_unlock_read(mapping);
- if (buffer_unwritten(&bh))
- complete_unwritten(&bh, !(result & VM_FAULT_ERROR));
-
return result;
fallback:
@@ -1042,8 +1019,7 @@ EXPORT_SYMBOL_GPL(__dax_pmd_fault);
* pmd_fault handler for DAX files.
*/
int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmd, unsigned int flags, get_block_t get_block,
- dax_iodone_t complete_unwritten)
+ pmd_t *pmd, unsigned int flags, get_block_t get_block)
{
int result;
struct super_block *sb = file_inode(vma->vm_file)->i_sb;
@@ -1052,8 +1028,7 @@ int dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
}
- result = __dax_pmd_fault(vma, address, pmd, flags, get_block,
- complete_unwritten);
+ result = __dax_pmd_fault(vma, address, pmd, flags, get_block);
if (flags & FAULT_FLAG_WRITE)
sb_end_pagefault(sb);
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index c1400b109805..868c02317b05 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -51,7 +51,7 @@ static int ext2_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
}
down_read(&ei->dax_sem);
- ret = __dax_fault(vma, vmf, ext2_get_block, NULL);
+ ret = __dax_fault(vma, vmf, ext2_get_block);
up_read(&ei->dax_sem);
if (vmf->flags & FAULT_FLAG_WRITE)
@@ -72,7 +72,7 @@ static int ext2_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
}
down_read(&ei->dax_sem);
- ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block, NULL);
+ ret = __dax_pmd_fault(vma, addr, pmd, flags, ext2_get_block);
up_read(&ei->dax_sem);
if (flags & FAULT_FLAG_WRITE)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index fa2208bae2e1..b3a9c6eeadbc 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -207,7 +207,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);
+ result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block);
if (write) {
if (!IS_ERR(handle))
@@ -243,7 +243,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_dax_mmap_get_block, NULL);
+ ext4_dax_mmap_get_block);
if (write) {
if (!IS_ERR(handle))
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 569938a4a357..c2946f436a3a 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1558,7 +1558,7 @@ xfs_filemap_page_mkwrite(
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (IS_DAX(inode)) {
- ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+ ret = __dax_mkwrite(vma, vmf, xfs_get_blocks_dax_fault);
} else {
ret = block_page_mkwrite(vma, vmf, xfs_get_blocks);
ret = block_page_mkwrite_return(ret);
@@ -1592,7 +1592,7 @@ xfs_filemap_fault(
* changes to xfs_get_blocks_direct() to map unwritten extent
* ioend for conversion on read-only mappings.
*/
- ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault, NULL);
+ ret = __dax_fault(vma, vmf, xfs_get_blocks_dax_fault);
} else
ret = filemap_fault(vma, vmf);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
@@ -1629,8 +1629,7 @@ xfs_filemap_pmd_fault(
}
xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
- ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault,
- NULL);
+ ret = __dax_pmd_fault(vma, addr, pmd, flags, xfs_get_blocks_dax_fault);
xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
if (flags & FAULT_FLAG_WRITE)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 636dd59ab505..7c45ac7ea1d1 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -10,10 +10,8 @@ ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
-int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
- dax_iodone_t);
-int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t,
- dax_iodone_t);
+int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
#ifdef CONFIG_FS_DAX
struct page *read_dax_sector(struct block_device *bdev, sector_t n);
@@ -27,21 +25,20 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
- unsigned int flags, get_block_t, dax_iodone_t);
+ unsigned int flags, get_block_t);
int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
- unsigned int flags, get_block_t, dax_iodone_t);
+ unsigned int flags, get_block_t);
#else
static inline int dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
- pmd_t *pmd, unsigned int flags, get_block_t gb,
- dax_iodone_t di)
+ pmd_t *pmd, unsigned int flags, get_block_t gb)
{
return VM_FAULT_FALLBACK;
}
#define __dax_pmd_fault dax_pmd_fault
#endif
int dax_pfn_mkwrite(struct vm_area_struct *, struct vm_fault *);
-#define dax_mkwrite(vma, vmf, gb, iod) dax_fault(vma, vmf, gb, iod)
-#define __dax_mkwrite(vma, vmf, gb, iod) __dax_fault(vma, vmf, gb, iod)
+#define dax_mkwrite(vma, vmf, gb) dax_fault(vma, vmf, gb)
+#define __dax_mkwrite(vma, vmf, gb) __dax_fault(vma, vmf, gb)
static inline bool vma_is_dax(struct vm_area_struct *vma)
{
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 70e61b58baaf..9f2813090d1b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -74,7 +74,6 @@ typedef int (get_block_t)(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
typedef int (dio_iodone_t)(struct kiocb *iocb, loff_t offset,
ssize_t bytes, void *private);
-typedef void (dax_iodone_t)(struct buffer_head *bh_map, int uptodate);
#define MAY_EXEC 0x00000001
#define MAY_WRITE 0x00000002
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
set. Presumably this was copied over from direct IO code. However DAX
inodes have no pagecache pages to write so the call is pointless. Remove
it.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 10 +---------
1 file changed, 1 insertion(+), 9 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 7c0036dd1570..237581441bc1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -268,15 +268,8 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
memset(&bh, 0, sizeof(bh));
bh.b_bdev = inode->i_sb->s_bdev;
- if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ) {
- struct address_space *mapping = inode->i_mapping;
+ if ((flags & DIO_LOCKING) && iov_iter_rw(iter) == READ)
inode_lock(inode);
- retval = filemap_write_and_wait_range(mapping, pos, end - 1);
- if (retval) {
- inode_unlock(inode);
- goto out;
- }
- }
/* Protects against truncate */
if (!(flags & DIO_SKIP_DIO_COUNT))
@@ -297,7 +290,6 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
if (!(flags & DIO_SKIP_DIO_COUNT))
inode_dio_end(inode);
- out:
return retval;
}
EXPORT_SYMBOL_GPL(dax_do_io);
--
2.6.6
We will use lowest available bit in the radix tree exceptional entry for
locking of the entry. Define it. Also clean up definitions of DAX entry
type bits in DAX exceptional entries to use defined constants instead of
hardcoding numbers and cleanup checking of these bits to not rely on how
other bits in the entry are set.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 17 +++++++++++------
include/linux/dax.h | 3 +++
2 files changed, 14 insertions(+), 6 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 388327f56fa8..3e491eb37bc4 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -32,14 +32,19 @@
#include <linux/pfn_t.h>
#include <linux/sizes.h>
-#define RADIX_DAX_MASK 0xf
-#define RADIX_DAX_SHIFT 4
-#define RADIX_DAX_PTE (0x4 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_PMD (0x8 | RADIX_TREE_EXCEPTIONAL_ENTRY)
-#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_MASK)
+/*
+ * We use lowest available bit in exceptional entry for locking, other two
+ * bits to determine entry type. In total 3 special bits.
+ */
+#define RADIX_DAX_SHIFT (RADIX_TREE_EXCEPTIONAL_SHIFT + 3)
+#define RADIX_DAX_PTE (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 1))
+#define RADIX_DAX_PMD (1 << (RADIX_TREE_EXCEPTIONAL_SHIFT + 2))
+#define RADIX_DAX_TYPE_MASK (RADIX_DAX_PTE | RADIX_DAX_PMD)
+#define RADIX_DAX_TYPE(entry) ((unsigned long)entry & RADIX_DAX_TYPE_MASK)
#define RADIX_DAX_SECTOR(entry) (((unsigned long)entry >> RADIX_DAX_SHIFT))
#define RADIX_DAX_ENTRY(sector, pmd) ((void *)((unsigned long)sector << \
- RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE)))
+ RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
+ RADIX_TREE_EXCEPTIONAL_ENTRY))
static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
{
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 0591f4853228..bef5c44f71b3 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -5,6 +5,9 @@
#include <linux/mm.h>
#include <asm/pgtable.h>
+/* We use lowest available exceptional entry bit for locking */
+#define RADIX_DAX_ENTRY_LOCK (1 << RADIX_TREE_EXCEPTIONAL_SHIFT)
+
ssize_t dax_do_io(struct kiocb *, struct inode *, struct iov_iter *, loff_t,
get_block_t, dio_iodone_t, int flags);
int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size);
--
2.6.6
All the filesystems are now zeroing blocks themselves for DAX IO to avoid
races between dax_io() and dax_fault(). Remove the zeroing code from
dax_io() and add warning to catch the case when somebody unexpectedly
returns new or unwritten buffer.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 28 ++++++++++------------------
1 file changed, 10 insertions(+), 18 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index ccb8bc399d78..7c0036dd1570 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -119,18 +119,6 @@ int dax_clear_sectors(struct block_device *bdev, sector_t _sector, long _size)
}
EXPORT_SYMBOL_GPL(dax_clear_sectors);
-/* the clear_pmem() calls are ordered by a wmb_pmem() in the caller */
-static void dax_new_buf(void __pmem *addr, unsigned size, unsigned first,
- loff_t pos, loff_t end)
-{
- loff_t final = end - pos + first; /* The final byte of the buffer */
-
- if (first > 0)
- clear_pmem(addr, first);
- if (final < size)
- clear_pmem(addr + final, size - final);
-}
-
static bool buffer_written(struct buffer_head *bh)
{
return buffer_mapped(bh) && !buffer_unwritten(bh);
@@ -169,6 +157,9 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
struct blk_dax_ctl dax = {
.addr = (void __pmem *) ERR_PTR(-EIO),
};
+ unsigned blkbits = inode->i_blkbits;
+ sector_t file_blks = (i_size_read(inode) + (1 << blkbits) - 1)
+ >> blkbits;
if (rw == READ)
end = min(end, i_size_read(inode));
@@ -176,7 +167,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
while (pos < end) {
size_t len;
if (pos == max) {
- unsigned blkbits = inode->i_blkbits;
long page = pos >> PAGE_SHIFT;
sector_t block = page << (PAGE_SHIFT - blkbits);
unsigned first = pos - (block << blkbits);
@@ -192,6 +182,13 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
bh->b_size = 1 << blkbits;
bh_max = pos - first + bh->b_size;
bdev = bh->b_bdev;
+ /*
+ * We allow uninitialized buffers for writes
+ * beyond EOF as those cannot race with faults
+ */
+ WARN_ON_ONCE(
+ (buffer_new(bh) && block < file_blks) ||
+ (rw == WRITE && buffer_unwritten(bh)));
} else {
unsigned done = bh->b_size -
(bh_max - (pos - first));
@@ -211,11 +208,6 @@ static ssize_t dax_io(struct inode *inode, struct iov_iter *iter,
rc = map_len;
break;
}
- if (buffer_unwritten(bh) || buffer_new(bh)) {
- dax_new_buf(dax.addr, map_len, first,
- pos, end);
- need_wmb = true;
- }
dax.addr += first;
size = map_len - first;
}
--
2.6.6
Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
returned buffer has BH_Uptodate set. However that doesn't get set for
any mapping buffer so that branch is actually a dead code. The
BH_Uptodate check doesn't make any sense so just remove it.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/fs/dax.c b/fs/dax.c
index 237581441bc1..42bf65b4e752 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -878,7 +878,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
goto fallback;
}
- if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
+ if (!write && !buffer_mapped(&bh)) {
spinlock_t *ptl;
pmd_t entry;
struct page *zero_page = get_huge_zero_page();
--
2.6.6
Currently DAX page fault locking is racy.
CPU0 (write fault) CPU1 (read fault)
__dax_fault() __dax_fault()
get_block(inode, block, &bh, 0) -> not mapped
get_block(inode, block, &bh, 0)
-> not mapped
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE)
get_block(inode, block, &bh, 1) -> allocates blocks
if (page) -> no
if (!buffer_mapped(&bh))
if (vmf->flags & FAULT_FLAG_WRITE) {
} else {
dax_load_hole();
}
dax_insert_mapping()
And we are in a situation where we fail in dax_radix_entry() with -EIO.
Another problem with the current DAX page fault locking is that there is
no race-free way to clear dirty tag in the radix tree. We can always
end up with clean radix tree and dirty data in CPU cache.
We fix the first problem by introducing locking of exceptional radix
tree entries in DAX mappings acting very similarly to page lock and thus
synchronizing properly faults against the same mapping index. The same
lock can later be used to avoid races when clearing radix tree dirty
tag.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 536 ++++++++++++++++++++++++++++++++++++++--------------
include/linux/dax.h | 1 +
mm/truncate.c | 62 +++---
3 files changed, 429 insertions(+), 170 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 3e491eb37bc4..be68d18a98c1 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -46,6 +46,30 @@
RADIX_DAX_SHIFT | (pmd ? RADIX_DAX_PMD : RADIX_DAX_PTE) | \
RADIX_TREE_EXCEPTIONAL_ENTRY))
+/* We choose 4096 entries - same as per-zone page wait tables */
+#define DAX_WAIT_TABLE_BITS 12
+#define DAX_WAIT_TABLE_ENTRIES (1 << DAX_WAIT_TABLE_BITS)
+
+wait_queue_head_t wait_table[DAX_WAIT_TABLE_ENTRIES];
+
+static int __init init_dax_wait_table(void)
+{
+ int i;
+
+ for (i = 0; i < DAX_WAIT_TABLE_ENTRIES; i++)
+ init_waitqueue_head(wait_table + i);
+ return 0;
+}
+fs_initcall(init_dax_wait_table);
+
+static wait_queue_head_t *dax_entry_waitqueue(struct address_space *mapping,
+ pgoff_t index)
+{
+ unsigned long hash = hash_long((unsigned long)mapping ^ index,
+ DAX_WAIT_TABLE_BITS);
+ return wait_table + hash;
+}
+
static long dax_map_atomic(struct block_device *bdev, struct blk_dax_ctl *dax)
{
struct request_queue *q = bdev->bd_queue;
@@ -300,6 +324,259 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
EXPORT_SYMBOL_GPL(dax_do_io);
/*
+ * DAX radix tree locking
+ */
+struct exceptional_entry_key {
+ struct radix_tree_root *root;
+ unsigned long index;
+};
+
+struct wait_exceptional_entry_queue {
+ wait_queue_t wait;
+ struct exceptional_entry_key key;
+};
+
+static int wake_exceptional_entry_func(wait_queue_t *wait, unsigned int mode,
+ int sync, void *keyp)
+{
+ struct exceptional_entry_key *key = keyp;
+ struct wait_exceptional_entry_queue *ewait =
+ container_of(wait, struct wait_exceptional_entry_queue, wait);
+
+ if (key->root != ewait->key.root || key->index != ewait->key.index)
+ return 0;
+ return autoremove_wake_function(wait, mode, sync, NULL);
+}
+
+/*
+ * Check whether the given slot is locked. The function must be called with
+ * mapping->tree_lock held
+ */
+static inline int slot_locked(struct address_space *mapping, void **slot)
+{
+ unsigned long entry = (unsigned long)
+ radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+ return entry & RADIX_DAX_ENTRY_LOCK;
+}
+
+/*
+ * Mark the given slot is locked. The function must be called with
+ * mapping->tree_lock held
+ */
+static inline void *lock_slot(struct address_space *mapping, void **slot)
+{
+ unsigned long entry = (unsigned long)
+ radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+
+ entry |= RADIX_DAX_ENTRY_LOCK;
+ radix_tree_replace_slot(slot, (void *)entry);
+ return (void *)entry;
+}
+
+/*
+ * Mark the given slot is unlocked. The function must be called with
+ * mapping->tree_lock held
+ */
+static inline void *unlock_slot(struct address_space *mapping, void **slot)
+{
+ unsigned long entry = (unsigned long)
+ radix_tree_deref_slot_protected(slot, &mapping->tree_lock);
+
+ entry &= ~(unsigned long)RADIX_DAX_ENTRY_LOCK;
+ radix_tree_replace_slot(slot, (void *)entry);
+ return (void *)entry;
+}
+
+/*
+ * Lookup entry in radix tree, wait for it to become unlocked if it is
+ * exceptional entry and return it. The caller must call
+ * put_unlocked_mapping_entry() when he decided not to lock the entry or
+ * put_locked_mapping_entry() when he locked the entry and now wants to
+ * unlock it.
+ *
+ * The function must be called with mapping->tree_lock held.
+ */
+static void *get_unlocked_mapping_entry(struct address_space *mapping,
+ pgoff_t index, void ***slotp)
+{
+ void *ret, **slot;
+ struct wait_exceptional_entry_queue ewait;
+ wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+ init_wait(&ewait.wait);
+ ewait.wait.func = wake_exceptional_entry_func;
+ ewait.key.root = &mapping->page_tree;
+ ewait.key.index = index;
+
+ for (;;) {
+ ret = __radix_tree_lookup(&mapping->page_tree, index, NULL,
+ &slot);
+ if (!ret || !radix_tree_exceptional_entry(ret) ||
+ !slot_locked(mapping, slot)) {
+ if (slotp)
+ *slotp = slot;
+ return ret;
+ }
+ prepare_to_wait_exclusive(wq, &ewait.wait,
+ TASK_UNINTERRUPTIBLE);
+ spin_unlock_irq(&mapping->tree_lock);
+ schedule();
+ finish_wait(wq, &ewait.wait);
+ spin_lock_irq(&mapping->tree_lock);
+ }
+}
+
+/*
+ * Find radix tree entry at given index. If it points to a page, return with
+ * the page locked. If it points to the exceptional entry, return with the
+ * radix tree entry locked. If the radix tree doesn't contain given index,
+ * create empty exceptional entry for the index and return with it locked.
+ *
+ * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
+ * persistent memory the benefit is doubtful. We can add that later if we can
+ * show it helps.
+ */
+static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+ void *ret, **slot;
+
+restart:
+ spin_lock_irq(&mapping->tree_lock);
+ ret = get_unlocked_mapping_entry(mapping, index, &slot);
+ /* No entry for given index? Make sure radix tree is big enough. */
+ if (!ret) {
+ int err;
+
+ spin_unlock_irq(&mapping->tree_lock);
+ err = radix_tree_preload(
+ mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
+ if (err)
+ return ERR_PTR(err);
+ ret = (void *)(RADIX_TREE_EXCEPTIONAL_ENTRY |
+ RADIX_DAX_ENTRY_LOCK);
+ spin_lock_irq(&mapping->tree_lock);
+ err = radix_tree_insert(&mapping->page_tree, index, ret);
+ radix_tree_preload_end();
+ if (err) {
+ spin_unlock_irq(&mapping->tree_lock);
+ /* Someone already created the entry? */
+ if (err == -EEXIST)
+ goto restart;
+ return ERR_PTR(err);
+ }
+ /* Good, we have inserted empty locked entry into the tree. */
+ mapping->nrexceptional++;
+ spin_unlock_irq(&mapping->tree_lock);
+ return ret;
+ }
+ /* Normal page in radix tree? */
+ if (!radix_tree_exceptional_entry(ret)) {
+ struct page *page = ret;
+
+ get_page(page);
+ spin_unlock_irq(&mapping->tree_lock);
+ lock_page(page);
+ /* Page got truncated? Retry... */
+ if (unlikely(page->mapping != mapping)) {
+ unlock_page(page);
+ put_page(page);
+ goto restart;
+ }
+ return page;
+ }
+ ret = lock_slot(mapping, slot);
+ spin_unlock_irq(&mapping->tree_lock);
+ return ret;
+}
+
+static void wake_mapping_entry_waiter(struct address_space *mapping,
+ pgoff_t index, bool wake_all)
+{
+ wait_queue_head_t *wq = dax_entry_waitqueue(mapping, index);
+
+ /*
+ * Checking for locked entry and prepare_to_wait_exclusive() happens
+ * under mapping->tree_lock, ditto for entry handling in our callers.
+ * So at this point all tasks that could have seen our entry locked
+ * must be in the waitqueue and the following check will see them.
+ */
+ if (waitqueue_active(wq)) {
+ struct exceptional_entry_key key;
+
+ key.root = &mapping->page_tree;
+ key.index = index;
+ __wake_up(wq, TASK_NORMAL, wake_all ? 0 : 1, &key);
+ }
+}
+
+static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+ void *ret, **slot;
+
+ spin_lock_irq(&mapping->tree_lock);
+ ret = __radix_tree_lookup(&mapping->page_tree, index, NULL, &slot);
+ if (WARN_ON_ONCE(!ret || !radix_tree_exceptional_entry(ret) ||
+ !slot_locked(mapping, slot))) {
+ spin_unlock_irq(&mapping->tree_lock);
+ return;
+ }
+ unlock_slot(mapping, slot);
+ spin_unlock_irq(&mapping->tree_lock);
+ wake_mapping_entry_waiter(mapping, index, false);
+}
+
+static void put_locked_mapping_entry(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ if (!radix_tree_exceptional_entry(entry)) {
+ unlock_page(entry);
+ put_page(entry);
+ } else {
+ unlock_mapping_entry(mapping, index);
+ }
+}
+
+/*
+ * Called when we are done with radix tree entry we looked up via
+ * get_unlocked_mapping_entry() and which we didn't lock in the end.
+ */
+static void put_unlocked_mapping_entry(struct address_space *mapping,
+ pgoff_t index, void *entry)
+{
+ if (!radix_tree_exceptional_entry(entry))
+ return;
+
+ /* We have to wake up next waiter for the radix tree entry lock */
+ wake_mapping_entry_waiter(mapping, index, false);
+}
+
+/*
+ * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
+ * entry to get unlocked before deleting it.
+ */
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
+{
+ void *entry;
+
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, NULL);
+ /*
+ * Caller should make sure radix tree modifications don't race and
+ * we have seen exceptional entry here before.
+ */
+ if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
+ spin_unlock_irq(&mapping->tree_lock);
+ return 0;
+ }
+ radix_tree_delete(&mapping->page_tree, index);
+ mapping->nrexceptional--;
+ spin_unlock_irq(&mapping->tree_lock);
+ wake_mapping_entry_waiter(mapping, index, true);
+
+ return 1;
+}
+
+/*
* The user has performed a load from a hole in the file. Allocating
* a new page in the file would cause excessive storage usage for
* workloads with sparse files. We allocate a page cache page instead.
@@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
* otherwise it will simply fall out of the page cache under memory
* pressure without ever having been dirtied.
*/
-static int dax_load_hole(struct address_space *mapping, struct page *page,
- struct vm_fault *vmf)
+static int dax_load_hole(struct address_space *mapping, void *entry,
+ struct vm_fault *vmf)
{
- if (!page)
- page = find_or_create_page(mapping, vmf->pgoff,
- GFP_KERNEL | __GFP_ZERO);
- if (!page)
- return VM_FAULT_OOM;
+ struct page *page;
+
+ /* Hole page already exists? Return it... */
+ if (!radix_tree_exceptional_entry(entry)) {
+ vmf->page = entry;
+ return VM_FAULT_LOCKED;
+ }
+ /* This will replace locked radix tree entry with a hole page */
+ page = find_or_create_page(mapping, vmf->pgoff,
+ vmf->gfp_mask | __GFP_ZERO);
+ if (!page) {
+ put_locked_mapping_entry(mapping, vmf->pgoff, entry);
+ return VM_FAULT_OOM;
+ }
vmf->page = page;
return VM_FAULT_LOCKED;
}
@@ -339,77 +625,72 @@ static int copy_user_bh(struct page *to, struct inode *inode,
return 0;
}
-#define NO_SECTOR -1
#define DAX_PMD_INDEX(page_index) (page_index & (PMD_MASK >> PAGE_SHIFT))
-static int dax_radix_entry(struct address_space *mapping, pgoff_t index,
- sector_t sector, bool pmd_entry, bool dirty)
+static void *dax_insert_mapping_entry(struct address_space *mapping,
+ struct vm_fault *vmf,
+ void *entry, sector_t sector)
{
struct radix_tree_root *page_tree = &mapping->page_tree;
- pgoff_t pmd_index = DAX_PMD_INDEX(index);
- int type, error = 0;
- void *entry;
+ int error = 0;
+ bool hole_fill = false;
+ void *new_entry;
+ pgoff_t index = vmf->pgoff;
- WARN_ON_ONCE(pmd_entry && !dirty);
- if (dirty)
+ if (vmf->flags & FAULT_FLAG_WRITE)
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
- spin_lock_irq(&mapping->tree_lock);
-
- entry = radix_tree_lookup(page_tree, pmd_index);
- if (entry && RADIX_DAX_TYPE(entry) == RADIX_DAX_PMD) {
- index = pmd_index;
- goto dirty;
+ /* Replacing hole page with block mapping? */
+ if (!radix_tree_exceptional_entry(entry)) {
+ hole_fill = true;
+ /*
+ * Unmap the page now before we remove it from page cache below.
+ * The page is locked so it cannot be faulted in again.
+ */
+ unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
+ PAGE_SIZE, 0);
+ error = radix_tree_preload(vmf->gfp_mask & ~__GFP_HIGHMEM);
+ if (error)
+ return ERR_PTR(error);
}
- entry = radix_tree_lookup(page_tree, index);
- if (entry) {
- type = RADIX_DAX_TYPE(entry);
- if (WARN_ON_ONCE(type != RADIX_DAX_PTE &&
- type != RADIX_DAX_PMD)) {
- error = -EIO;
+ spin_lock_irq(&mapping->tree_lock);
+ new_entry = (void *)((unsigned long)RADIX_DAX_ENTRY(sector, false) |
+ RADIX_DAX_ENTRY_LOCK);
+ if (hole_fill) {
+ __delete_from_page_cache(entry, NULL);
+ /* Drop pagecache reference */
+ put_page(entry);
+ error = radix_tree_insert(page_tree, index, new_entry);
+ if (error) {
+ new_entry = ERR_PTR(error);
goto unlock;
}
+ mapping->nrexceptional++;
+ } else {
+ void **slot;
+ void *ret;
- if (!pmd_entry || type == RADIX_DAX_PMD)
- goto dirty;
-
- /*
- * We only insert dirty PMD entries into the radix tree. This
- * means we don't need to worry about removing a dirty PTE
- * entry and inserting a clean PMD entry, thus reducing the
- * range we would flush with a follow-up fsync/msync call.
- */
- radix_tree_delete(&mapping->page_tree, index);
- mapping->nrexceptional--;
- }
-
- if (sector == NO_SECTOR) {
- /*
- * This can happen during correct operation if our pfn_mkwrite
- * fault raced against a hole punch operation. If this
- * happens the pte that was hole punched will have been
- * unmapped and the radix tree entry will have been removed by
- * the time we are called, but the call will still happen. We
- * will return all the way up to wp_pfn_shared(), where the
- * pte_same() check will fail, eventually causing page fault
- * to be retried by the CPU.
- */
- goto unlock;
+ ret = __radix_tree_lookup(page_tree, index, NULL, &slot);
+ WARN_ON_ONCE(ret != entry);
+ radix_tree_replace_slot(slot, new_entry);
}
-
- error = radix_tree_insert(page_tree, index,
- RADIX_DAX_ENTRY(sector, pmd_entry));
- if (error)
- goto unlock;
-
- mapping->nrexceptional++;
- dirty:
- if (dirty)
+ if (vmf->flags & FAULT_FLAG_WRITE)
radix_tree_tag_set(page_tree, index, PAGECACHE_TAG_DIRTY);
unlock:
spin_unlock_irq(&mapping->tree_lock);
- return error;
+ if (hole_fill) {
+ radix_tree_preload_end();
+ /*
+ * We don't need hole page anymore, it has been replaced with
+ * locked radix tree entry now.
+ */
+ if (mapping->a_ops->freepage)
+ mapping->a_ops->freepage(entry);
+ unlock_page(entry);
+ put_page(entry);
+ }
+ return new_entry;
}
static int dax_writeback_one(struct block_device *bdev,
@@ -535,17 +816,19 @@ int dax_writeback_mapping_range(struct address_space *mapping,
}
EXPORT_SYMBOL_GPL(dax_writeback_mapping_range);
-static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
+static int dax_insert_mapping(struct address_space *mapping,
+ struct buffer_head *bh, void **entryp,
struct vm_area_struct *vma, struct vm_fault *vmf)
{
unsigned long vaddr = (unsigned long)vmf->virtual_address;
- struct address_space *mapping = inode->i_mapping;
struct block_device *bdev = bh->b_bdev;
struct blk_dax_ctl dax = {
- .sector = to_sector(bh, inode),
+ .sector = to_sector(bh, mapping->host),
.size = bh->b_size,
};
int error;
+ void *ret;
+ void *entry = *entryp;
i_mmap_lock_read(mapping);
@@ -555,16 +838,16 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
}
dax_unmap_atomic(bdev, &dax);
- error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
- vmf->flags & FAULT_FLAG_WRITE);
- if (error)
+ ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
+ if (IS_ERR(ret)) {
+ error = PTR_ERR(ret);
goto out;
+ }
+ *entryp = ret;
error = vm_insert_mixed(vma, vaddr, dax.pfn);
-
out:
i_mmap_unlock_read(mapping);
-
return error;
}
@@ -584,7 +867,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
struct file *file = vma->vm_file;
struct address_space *mapping = file->f_mapping;
struct inode *inode = mapping->host;
- struct page *page;
+ void *entry;
struct buffer_head bh;
unsigned long vaddr = (unsigned long)vmf->virtual_address;
unsigned blkbits = inode->i_blkbits;
@@ -593,6 +876,11 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
int error;
int major = 0;
+ /*
+ * Check whether offset isn't beyond end of file now. Caller is supposed
+ * to hold locks serializing us with truncate / punch hole so this is
+ * a reliable test.
+ */
size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
if (vmf->pgoff >= size)
return VM_FAULT_SIGBUS;
@@ -602,40 +890,17 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
bh.b_bdev = inode->i_sb->s_bdev;
bh.b_size = PAGE_SIZE;
- repeat:
- page = find_get_page(mapping, vmf->pgoff);
- if (page) {
- if (!lock_page_or_retry(page, vma->vm_mm, vmf->flags)) {
- put_page(page);
- return VM_FAULT_RETRY;
- }
- if (unlikely(page->mapping != mapping)) {
- unlock_page(page);
- put_page(page);
- goto repeat;
- }
+ entry = grab_mapping_entry(mapping, vmf->pgoff);
+ if (IS_ERR(entry)) {
+ error = PTR_ERR(entry);
+ goto out;
}
error = get_block(inode, block, &bh, 0);
if (!error && (bh.b_size < PAGE_SIZE))
error = -EIO; /* fs corruption? */
if (error)
- goto unlock_page;
-
- if (!buffer_mapped(&bh) && !vmf->cow_page) {
- if (vmf->flags & FAULT_FLAG_WRITE) {
- error = get_block(inode, block, &bh, 1);
- count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
- major = VM_FAULT_MAJOR;
- if (!error && (bh.b_size < PAGE_SIZE))
- error = -EIO;
- if (error)
- goto unlock_page;
- } else {
- return dax_load_hole(mapping, page, vmf);
- }
- }
+ goto unlock_entry;
if (vmf->cow_page) {
struct page *new_page = vmf->cow_page;
@@ -644,30 +909,37 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
else
clear_user_highpage(new_page, vaddr);
if (error)
- goto unlock_page;
- vmf->page = page;
- if (!page)
+ goto unlock_entry;
+ if (!radix_tree_exceptional_entry(entry)) {
+ vmf->page = entry;
+ } else {
+ unlock_mapping_entry(mapping, vmf->pgoff);
i_mmap_lock_read(mapping);
+ vmf->page = NULL;
+ }
return VM_FAULT_LOCKED;
}
- /* Check we didn't race with a read fault installing a new page */
- if (!page && major)
- page = find_lock_page(mapping, vmf->pgoff);
-
- if (page) {
- unmap_mapping_range(mapping, vmf->pgoff << PAGE_SHIFT,
- PAGE_SIZE, 0);
- delete_from_page_cache(page);
- unlock_page(page);
- put_page(page);
- page = NULL;
+ if (!buffer_mapped(&bh)) {
+ if (vmf->flags & FAULT_FLAG_WRITE) {
+ error = get_block(inode, block, &bh, 1);
+ count_vm_event(PGMAJFAULT);
+ mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
+ major = VM_FAULT_MAJOR;
+ if (!error && (bh.b_size < PAGE_SIZE))
+ error = -EIO;
+ if (error)
+ goto unlock_entry;
+ } else {
+ return dax_load_hole(mapping, entry, vmf);
+ }
}
/* Filesystem should not return unwritten buffers to us! */
WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
- error = dax_insert_mapping(inode, &bh, vma, vmf);
-
+ error = dax_insert_mapping(mapping, &bh, &entry, vma, vmf);
+ unlock_entry:
+ put_locked_mapping_entry(mapping, vmf->pgoff, entry);
out:
if (error == -ENOMEM)
return VM_FAULT_OOM | major;
@@ -675,13 +947,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if ((error < 0) && (error != -EBUSY))
return VM_FAULT_SIGBUS | major;
return VM_FAULT_NOPAGE | major;
-
- unlock_page:
- if (page) {
- unlock_page(page);
- put_page(page);
- }
- goto out;
}
EXPORT_SYMBOL(__dax_fault);
@@ -963,23 +1228,18 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
struct file *file = vma->vm_file;
- int error;
-
- /*
- * We pass NO_SECTOR to dax_radix_entry() because we expect that a
- * RADIX_DAX_PTE entry already exists in the radix tree from a
- * previous call to __dax_fault(). We just want to look up that PTE
- * entry using vmf->pgoff and make sure the dirty tag is set. This
- * saves us from having to make a call to get_block() here to look
- * up the sector.
- */
- error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
- true);
+ struct address_space *mapping = file->f_mapping;
+ void *entry;
+ pgoff_t index = vmf->pgoff;
- if (error == -ENOMEM)
- return VM_FAULT_OOM;
- if (error)
- return VM_FAULT_SIGBUS;
+ spin_lock_irq(&mapping->tree_lock);
+ entry = get_unlocked_mapping_entry(mapping, index, NULL);
+ if (!entry || !radix_tree_exceptional_entry(entry))
+ goto out;
+ radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
+ put_unlocked_mapping_entry(mapping, index, entry);
+out:
+ spin_unlock_irq(&mapping->tree_lock);
return VM_FAULT_NOPAGE;
}
EXPORT_SYMBOL_GPL(dax_pfn_mkwrite);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index fd4aeae65ed7..c5522f912344 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -16,6 +16,7 @@ int dax_zero_page_range(struct inode *, loff_t from, unsigned len, get_block_t);
int dax_truncate_page(struct inode *, loff_t from, get_block_t);
int dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
int __dax_fault(struct vm_area_struct *, struct vm_fault *, get_block_t);
+int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
#ifdef CONFIG_FS_DAX
struct page *read_dax_sector(struct block_device *bdev, sector_t n);
diff --git a/mm/truncate.c b/mm/truncate.c
index b00272810871..4064f8f53daa 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -34,40 +34,38 @@ static void clear_exceptional_entry(struct address_space *mapping,
if (shmem_mapping(mapping))
return;
- spin_lock_irq(&mapping->tree_lock);
-
if (dax_mapping(mapping)) {
- if (radix_tree_delete_item(&mapping->page_tree, index, entry))
- mapping->nrexceptional--;
- } else {
- /*
- * Regular page slots are stabilized by the page lock even
- * without the tree itself locked. These unlocked entries
- * need verification under the tree lock.
- */
- if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
- &slot))
- goto unlock;
- if (*slot != entry)
- goto unlock;
- radix_tree_replace_slot(slot, NULL);
- mapping->nrexceptional--;
- if (!node)
- goto unlock;
- workingset_node_shadows_dec(node);
- /*
- * Don't track node without shadow entries.
- *
- * Avoid acquiring the list_lru lock if already untracked.
- * The list_empty() test is safe as node->private_list is
- * protected by mapping->tree_lock.
- */
- if (!workingset_node_shadows(node) &&
- !list_empty(&node->private_list))
- list_lru_del(&workingset_shadow_nodes,
- &node->private_list);
- __radix_tree_delete_node(&mapping->page_tree, node);
+ dax_delete_mapping_entry(mapping, index);
+ return;
}
+ spin_lock_irq(&mapping->tree_lock);
+ /*
+ * Regular page slots are stabilized by the page lock even
+ * without the tree itself locked. These unlocked entries
+ * need verification under the tree lock.
+ */
+ if (!__radix_tree_lookup(&mapping->page_tree, index, &node,
+ &slot))
+ goto unlock;
+ if (*slot != entry)
+ goto unlock;
+ radix_tree_replace_slot(slot, NULL);
+ mapping->nrexceptional--;
+ if (!node)
+ goto unlock;
+ workingset_node_shadows_dec(node);
+ /*
+ * Don't track node without shadow entries.
+ *
+ * Avoid acquiring the list_lru lock if already untracked.
+ * The list_empty() test is safe as node->private_list is
+ * protected by mapping->tree_lock.
+ */
+ if (!workingset_node_shadows(node) &&
+ !list_empty(&node->private_list))
+ list_lru_del(&workingset_shadow_nodes,
+ &node->private_list);
+ __radix_tree_delete_node(&mapping->page_tree, node);
unlock:
spin_unlock_irq(&mapping->tree_lock);
}
--
2.6.6
Now that all filesystems zero out blocks allocated for a fault handler,
we can just remove the zeroing from the handler itself. Also add checks
that no filesystem returns to us unwritten or new buffer.
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index c5ccf745d279..ccb8bc399d78 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -587,11 +587,6 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
error = PTR_ERR(dax.addr);
goto out;
}
-
- if (buffer_unwritten(bh) || buffer_new(bh)) {
- clear_pmem(dax.addr, PAGE_SIZE);
- wmb_pmem();
- }
dax_unmap_atomic(bdev, &dax);
error = dax_radix_entry(mapping, vmf->pgoff, dax.sector, false,
@@ -670,7 +665,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (error)
goto unlock_page;
- if (!buffer_mapped(&bh) && !buffer_unwritten(&bh) && !vmf->cow_page) {
+ if (!buffer_mapped(&bh) && !vmf->cow_page) {
if (vmf->flags & FAULT_FLAG_WRITE) {
error = get_block(inode, block, &bh, 1);
count_vm_event(PGMAJFAULT);
@@ -722,7 +717,7 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
}
/* Filesystem should not return unwritten buffers to us! */
- WARN_ON_ONCE(buffer_unwritten(&bh));
+ WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
error = dax_insert_mapping(inode, &bh, vma, vmf);
out:
@@ -854,7 +849,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
if (get_block(inode, block, &bh, 1) != 0)
return VM_FAULT_SIGBUS;
alloc = true;
- WARN_ON_ONCE(buffer_unwritten(&bh));
+ WARN_ON_ONCE(buffer_unwritten(&bh) || buffer_new(&bh));
}
bdev = bh.b_bdev;
@@ -953,14 +948,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
dax_pmd_dbg(&bh, address, "pfn not in memmap");
goto fallback;
}
-
- if (buffer_unwritten(&bh) || buffer_new(&bh)) {
- clear_pmem(dax.addr, PMD_SIZE);
- wmb_pmem();
- count_vm_event(PGMAJFAULT);
- mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
- result |= VM_FAULT_MAJOR;
- }
dax_unmap_atomic(bdev, &dax);
/*
--
2.6.6
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Callers of dax fault handlers must make sure these calls cannot race
with truncate. Thus it is enough to check inode size when entering the
function and we don't have to recheck it again later in the handler.
Note that inode size itself can be decreased while the fault handler
runs but filesystem locking prevents against any radix tree or block
mapping information changes resulting from the truncate and that is what
we really care about.
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 60 +-----------------------------------------------------------
1 file changed, 1 insertion(+), 59 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index 42bf65b4e752..d7addfab2094 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -305,20 +305,11 @@ EXPORT_SYMBOL_GPL(dax_do_io);
static int dax_load_hole(struct address_space *mapping, struct page *page,
struct vm_fault *vmf)
{
- unsigned long size;
- struct inode *inode = mapping->host;
if (!page)
page = find_or_create_page(mapping, vmf->pgoff,
GFP_KERNEL | __GFP_ZERO);
if (!page)
return VM_FAULT_OOM;
- /* Recheck i_size under page lock to avoid truncate race */
- size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (vmf->pgoff >= size) {
- unlock_page(page);
- put_page(page);
- return VM_FAULT_SIGBUS;
- }
vmf->page = page;
return VM_FAULT_LOCKED;
@@ -549,24 +540,10 @@ static int dax_insert_mapping(struct inode *inode, struct buffer_head *bh,
.sector = to_sector(bh, inode),
.size = bh->b_size,
};
- pgoff_t size;
int error;
i_mmap_lock_read(mapping);
- /*
- * Check truncate didn't happen while we were allocating a block.
- * If it did, this block may or may not be still allocated to the
- * file. We can't tell the filesystem to free it because we can't
- * take i_mutex here. In the worst case, the file still has blocks
- * allocated past the end of the file.
- */
- size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- error = -EIO;
- goto out;
- }
-
if (dax_map_atomic(bdev, &dax) < 0) {
error = PTR_ERR(dax.addr);
goto out;
@@ -632,15 +609,6 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
put_page(page);
goto repeat;
}
- size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (unlikely(vmf->pgoff >= size)) {
- /*
- * We have a struct page covering a hole in the file
- * from a read fault and we've raced with a truncate
- */
- error = -EIO;
- goto unlock_page;
- }
}
error = get_block(inode, block, &bh, 0);
@@ -673,17 +641,8 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
if (error)
goto unlock_page;
vmf->page = page;
- if (!page) {
+ if (!page)
i_mmap_lock_read(mapping);
- /* Check we didn't race with truncate */
- size = (i_size_read(inode) + PAGE_SIZE - 1) >>
- PAGE_SHIFT;
- if (vmf->pgoff >= size) {
- i_mmap_unlock_read(mapping);
- error = -EIO;
- goto out;
- }
- }
return VM_FAULT_LOCKED;
}
@@ -861,23 +820,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
i_mmap_lock_read(mapping);
- /*
- * If a truncate happened while we were allocating blocks, we may
- * leave blocks allocated to the file that are beyond EOF. We can't
- * take i_mutex here, so just leave them hanging; they'll be freed
- * when the file is deleted.
- */
- size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
- if (pgoff >= size) {
- result = VM_FAULT_SIGBUS;
- goto out;
- }
- if ((pgoff | PG_PMD_COLOUR) >= size) {
- dax_pmd_dbg(&bh, address,
- "offset + huge page size > file size");
- goto fallback;
- }
Currently faults are protected against truncate by filesystem specific
i_mmap_sem and page lock in case of hole page. Cow faults are protected
DAX radix tree entry locking. So there's no need for i_mmap_lock in DAX
code. Remove it.
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 24 +++++-------------------
mm/memory.c | 2 --
2 files changed, 5 insertions(+), 21 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index d907bf8b07a0..5a34f086b4ca 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -826,29 +826,19 @@ static int dax_insert_mapping(struct address_space *mapping,
.sector = to_sector(bh, mapping->host),
.size = bh->b_size,
};
- int error;
void *ret;
void *entry = *entryp;
- i_mmap_lock_read(mapping);
-
- if (dax_map_atomic(bdev, &dax) < 0) {
- error = PTR_ERR(dax.addr);
- goto out;
- }
+ if (dax_map_atomic(bdev, &dax) < 0)
+ return PTR_ERR(dax.addr);
dax_unmap_atomic(bdev, &dax);
ret = dax_insert_mapping_entry(mapping, vmf, entry, dax.sector);
- if (IS_ERR(ret)) {
- error = PTR_ERR(ret);
- goto out;
- }
+ if (IS_ERR(ret))
+ return PTR_ERR(ret);
*entryp = ret;
- error = vm_insert_mixed(vma, vaddr, dax.pfn);
- out:
- i_mmap_unlock_read(mapping);
- return error;
+ return vm_insert_mixed(vma, vaddr, dax.pfn);
}
/**
@@ -1086,8 +1076,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
truncate_pagecache_range(inode, lstart, lend);
}
- i_mmap_lock_read(mapping);
-
if (!write && !buffer_mapped(&bh)) {
spinlock_t *ptl;
pmd_t entry;
@@ -1179,8 +1167,6 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
}
out:
- i_mmap_unlock_read(mapping);
-
return result;
fallback:
diff --git a/mm/memory.c b/mm/memory.c
index f09cdb8d48fa..06f552504e79 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -2453,8 +2453,6 @@ void unmap_mapping_range(struct address_space *mapping,
if (details.last_index < details.first_index)
details.last_index = ULONG_MAX;
Currently we forbid page_cache_tree_insert() to replace exceptional radix
tree entries for DAX inodes. However to make DAX faults race free we will
lock radix tree entries and when hole is created, we need to replace
such locked radix tree entry with a hole page. So modify
page_cache_tree_insert() to allow that.
Signed-off-by: Jan Kara <[email protected]>
---
include/linux/dax.h | 1 +
mm/filemap.c | 18 +++++++++++-------
2 files changed, 12 insertions(+), 7 deletions(-)
diff --git a/include/linux/dax.h b/include/linux/dax.h
index bef5c44f71b3..fd4aeae65ed7 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -3,6 +3,7 @@
#include <linux/fs.h>
#include <linux/mm.h>
+#include <linux/radix-tree.h>
#include <asm/pgtable.h>
/* We use lowest available exceptional entry bit for locking */
diff --git a/mm/filemap.c b/mm/filemap.c
index f2479af09da9..3effd5c8f2f6 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -597,14 +597,18 @@ static int page_cache_tree_insert(struct address_space *mapping,
if (!radix_tree_exceptional_entry(p))
return -EEXIST;
- if (WARN_ON(dax_mapping(mapping)))
- return -EINVAL;
When doing cow faults, we cannot directly fill in PTE as we do for other
faults as we rely on generic code to do proper accounting of the cowed page.
We also have no page to lock to protect against races with truncate as
other faults have and we need the protection to extend until the moment
generic code inserts cowed page into PTE thus at that point we have no
protection of fs-specific i_mmap_sem. So far we relied on using
i_mmap_lock for the protection however that is completely special to cow
faults. To make fault locking more uniform use DAX entry lock instead.
Reviewed-by: Ross Zwisler <[email protected]>
Signed-off-by: Jan Kara <[email protected]>
---
fs/dax.c | 12 +++++-------
include/linux/dax.h | 7 +++++++
include/linux/mm.h | 7 +++++++
mm/memory.c | 38 ++++++++++++++++++--------------------
4 files changed, 37 insertions(+), 27 deletions(-)
diff --git a/fs/dax.c b/fs/dax.c
index be68d18a98c1..d907bf8b07a0 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -509,7 +509,7 @@ static void wake_mapping_entry_waiter(struct address_space *mapping,
}
}
-static void unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index)
{
void *ret, **slot;
@@ -532,7 +532,7 @@ static void put_locked_mapping_entry(struct address_space *mapping,
unlock_page(entry);
put_page(entry);
} else {
- unlock_mapping_entry(mapping, index);
+ dax_unlock_mapping_entry(mapping, index);
}
}
@@ -912,12 +912,10 @@ int __dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
goto unlock_entry;
if (!radix_tree_exceptional_entry(entry)) {
vmf->page = entry;
- } else {
- unlock_mapping_entry(mapping, vmf->pgoff);
- i_mmap_lock_read(mapping);
- vmf->page = NULL;
+ return VM_FAULT_LOCKED;
}
- return VM_FAULT_LOCKED;
+ vmf->entry = entry;
+ return VM_FAULT_DAX_LOCKED;
}
if (!buffer_mapped(&bh)) {
diff --git a/include/linux/dax.h b/include/linux/dax.h
index c5522f912344..ef94fa71368c 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -20,12 +20,19 @@ int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index);
#ifdef CONFIG_FS_DAX
struct page *read_dax_sector(struct block_device *bdev, sector_t n);
+void dax_unlock_mapping_entry(struct address_space *mapping, pgoff_t index);
#else
static inline struct page *read_dax_sector(struct block_device *bdev,
sector_t n)
{
return ERR_PTR(-ENXIO);
}
+/* Shouldn't ever be called when dax is disabled. */
+static inline void dax_unlock_mapping_entry(struct address_space *mapping,
+ pgoff_t index)
+{
+ BUG();
+}
#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index a55e5be0894f..0ef9dc720ec3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -299,6 +299,12 @@ struct vm_fault {
* is set (which is also implied by
* VM_FAULT_ERROR).
*/
+ void *entry; /* ->fault handler can alternatively
+ * return locked DAX entry. In that
+ * case handler should return
+ * VM_FAULT_DAX_LOCKED and fill in
+ * entry here.
+ */
/* for ->map_pages() only */
pgoff_t max_pgoff; /* map pages for offset from pgoff till
* max_pgoff inclusive */
@@ -1084,6 +1090,7 @@ static inline void clear_page_pfmemalloc(struct page *page)
#define VM_FAULT_LOCKED 0x0200 /* ->fault locked the returned page */
#define VM_FAULT_RETRY 0x0400 /* ->fault blocked, must retry */
#define VM_FAULT_FALLBACK 0x0800 /* huge page fault failed, fall back to small */
+#define VM_FAULT_DAX_LOCKED 0x1000 /* ->fault has locked DAX entry */
#define VM_FAULT_HWPOISON_LARGE_MASK 0xf000 /* encodes hpage index for large hwpoison */
diff --git a/mm/memory.c b/mm/memory.c
index 93897f23cc11..f09cdb8d48fa 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -63,6 +63,7 @@
#include <linux/dma-debug.h>
#include <linux/debugfs.h>
#include <linux/userfaultfd_k.h>
+#include <linux/dax.h>
#include <asm/io.h>
#include <asm/mmu_context.h>
@@ -2785,7 +2786,8 @@ oom:
*/
static int __do_fault(struct vm_area_struct *vma, unsigned long address,
pgoff_t pgoff, unsigned int flags,
- struct page *cow_page, struct page **page)
+ struct page *cow_page, struct page **page,
+ void **entry)
{
struct vm_fault vmf;
int ret;
@@ -2800,8 +2802,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
ret = vma->vm_ops->fault(vma, &vmf);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
- if (!vmf.page)
- goto out;
+ if (ret & VM_FAULT_DAX_LOCKED) {
+ *entry = vmf.entry;
+ return ret;
+ }
if (unlikely(PageHWPoison(vmf.page))) {
if (ret & VM_FAULT_LOCKED)
@@ -2815,7 +2819,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
else
VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
- out:
*page = vmf.page;
return ret;
}
@@ -2987,7 +2990,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pte_unmap_unlock(pte, ptl);
}
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
@@ -3010,6 +3013,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
{
struct page *fault_page, *new_page;
+ void *fault_entry;
struct mem_cgroup *memcg;
spinlock_t *ptl;
pte_t *pte;
@@ -3027,26 +3031,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return VM_FAULT_OOM;
}
- ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
+ &fault_entry);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
goto uncharge_out;
- if (fault_page)
+ if (!(ret & VM_FAULT_DAX_LOCKED))
copy_user_highpage(new_page, fault_page, address, vma);
__SetPageUptodate(new_page);
pte = pte_offset_map_lock(mm, pmd, address, &ptl);
if (unlikely(!pte_same(*pte, orig_pte))) {
pte_unmap_unlock(pte, ptl);
- if (fault_page) {
+ if (!(ret & VM_FAULT_DAX_LOCKED)) {
unlock_page(fault_page);
put_page(fault_page);
} else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
+ dax_unlock_mapping_entry(vma->vm_file->f_mapping,
+ pgoff);
}
goto uncharge_out;
}
@@ -3054,15 +3056,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
mem_cgroup_commit_charge(new_page, memcg, false, false);
lru_cache_add_active_or_unevictable(new_page, vma);
pte_unmap_unlock(pte, ptl);
- if (fault_page) {
+ if (!(ret & VM_FAULT_DAX_LOCKED)) {
unlock_page(fault_page);
put_page(fault_page);
} else {
- /*
- * The fault handler has no page to lock, so it holds
- * i_mmap_lock for read to protect against truncate.
- */
- i_mmap_unlock_read(vma->vm_file->f_mapping);
+ dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
}
return ret;
uncharge_out:
@@ -3082,7 +3080,7 @@ static int do_shared_fault(struct mm_struct *mm, struct vm_area_struct *vma,
int dirtied = 0;
int ret, tmp;
- ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
+ ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
return ret;
--
2.6.6
On Mon, Apr 18, 2016 at 11:35:40PM +0200, Jan Kara wrote:
> When doing cow faults, we cannot directly fill in PTE as we do for other
> faults as we rely on generic code to do proper accounting of the cowed page.
> We also have no page to lock to protect against races with truncate as
> other faults have and we need the protection to extend until the moment
> generic code inserts cowed page into PTE thus at that point we have no
> protection of fs-specific i_mmap_sem. So far we relied on using
> i_mmap_lock for the protection however that is completely special to cow
> faults. To make fault locking more uniform use DAX entry lock instead.
>
> Reviewed-by: Ross Zwisler <[email protected]>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/dax.c | 12 +++++-------
> include/linux/dax.h | 7 +++++++
> include/linux/mm.h | 7 +++++++
> mm/memory.c | 38 ++++++++++++++++++--------------------
> 4 files changed, 37 insertions(+), 27 deletions(-)
>
[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index 93897f23cc11..f09cdb8d48fa 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -63,6 +63,7 @@
> #include <linux/dma-debug.h>
> #include <linux/debugfs.h>
> #include <linux/userfaultfd_k.h>
> +#include <linux/dax.h>
>
> #include <asm/io.h>
> #include <asm/mmu_context.h>
> @@ -2785,7 +2786,8 @@ oom:
> */
> static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> pgoff_t pgoff, unsigned int flags,
> - struct page *cow_page, struct page **page)
> + struct page *cow_page, struct page **page,
> + void **entry)
> {
> struct vm_fault vmf;
> int ret;
> @@ -2800,8 +2802,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> ret = vma->vm_ops->fault(vma, &vmf);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> return ret;
> - if (!vmf.page)
> - goto out;
Removing the above sounds seriously bogus to me as it means that below
if (unlikely(PageHWPoison(vmf.page))) could dereference a NULL pointer.
> + if (ret & VM_FAULT_DAX_LOCKED) {
> + *entry = vmf.entry;
> + return ret;
> + }
I see that below you call __do_fault() with NULL for entry, if i am
properly understanding you will never get VM_FAULT_DAX_LOCKED set in
those case so this should be fine but maybe a BUG_ON() might be worth
it.
>
> if (unlikely(PageHWPoison(vmf.page))) {
> if (ret & VM_FAULT_LOCKED)
> @@ -2815,7 +2819,6 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> else
> VM_BUG_ON_PAGE(!PageLocked(vmf.page), vmf.page);
>
> - out:
> *page = vmf.page;
> return ret;
> }
> @@ -2987,7 +2990,7 @@ static int do_read_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> pte_unmap_unlock(pte, ptl);
> }
>
> - ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page);
> + ret = __do_fault(vma, address, pgoff, flags, NULL, &fault_page, NULL);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> return ret;
>
> @@ -3010,6 +3013,7 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> pgoff_t pgoff, unsigned int flags, pte_t orig_pte)
> {
> struct page *fault_page, *new_page;
> + void *fault_entry;
> struct mem_cgroup *memcg;
> spinlock_t *ptl;
> pte_t *pte;
> @@ -3027,26 +3031,24 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> return VM_FAULT_OOM;
> }
>
> - ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page);
> + ret = __do_fault(vma, address, pgoff, flags, new_page, &fault_page,
> + &fault_entry);
> if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> goto uncharge_out;
>
> - if (fault_page)
> + if (!(ret & VM_FAULT_DAX_LOCKED))
> copy_user_highpage(new_page, fault_page, address, vma);
Again removing check for non NULL page looks bogus to me, i think there are
still cases where you will get !(ret & VM_FAULT_DAX_LOCKED) and a fault_page
== NULL, for instance from device file mapping. To me it seems that what you
want is fault_page = NULL when VM_FAULT_DAX_LOCKED is set.
> __SetPageUptodate(new_page);
>
> pte = pte_offset_map_lock(mm, pmd, address, &ptl);
> if (unlikely(!pte_same(*pte, orig_pte))) {
> pte_unmap_unlock(pte, ptl);
> - if (fault_page) {
> + if (!(ret & VM_FAULT_DAX_LOCKED)) {
Same as above.
> unlock_page(fault_page);
> put_page(fault_page);
> } else {
> - /*
> - * The fault handler has no page to lock, so it holds
> - * i_mmap_lock for read to protect against truncate.
> - */
> - i_mmap_unlock_read(vma->vm_file->f_mapping);
> + dax_unlock_mapping_entry(vma->vm_file->f_mapping,
> + pgoff);
> }
> goto uncharge_out;
> }
> @@ -3054,15 +3056,11 @@ static int do_cow_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> mem_cgroup_commit_charge(new_page, memcg, false, false);
> lru_cache_add_active_or_unevictable(new_page, vma);
> pte_unmap_unlock(pte, ptl);
> - if (fault_page) {
> + if (!(ret & VM_FAULT_DAX_LOCKED)) {
Again fault_page might be NULL while VM_FAULT_DAX_LOCKED is not set.
> unlock_page(fault_page);
> put_page(fault_page);
> } else {
> - /*
> - * The fault handler has no page to lock, so it holds
> - * i_mmap_lock for read to protect against truncate.
> - */
> - i_mmap_unlock_read(vma->vm_file->f_mapping);
> + dax_unlock_mapping_entry(vma->vm_file->f_mapping, pgoff);
> }
> return ret;
> uncharge_out:
On Tue 19-04-16 07:46:09, Jerome Glisse wrote:
> On Mon, Apr 18, 2016 at 11:35:40PM +0200, Jan Kara wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 93897f23cc11..f09cdb8d48fa 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -63,6 +63,7 @@
> > #include <linux/dma-debug.h>
> > #include <linux/debugfs.h>
> > #include <linux/userfaultfd_k.h>
> > +#include <linux/dax.h>
> >
> > #include <asm/io.h>
> > #include <asm/mmu_context.h>
> > @@ -2785,7 +2786,8 @@ oom:
> > */
> > static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> > pgoff_t pgoff, unsigned int flags,
> > - struct page *cow_page, struct page **page)
> > + struct page *cow_page, struct page **page,
> > + void **entry)
> > {
> > struct vm_fault vmf;
> > int ret;
> > @@ -2800,8 +2802,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> > ret = vma->vm_ops->fault(vma, &vmf);
> > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> > return ret;
> > - if (!vmf.page)
> > - goto out;
>
> Removing the above sounds seriously bogus to me as it means that below
> if (unlikely(PageHWPoison(vmf.page))) could dereference a NULL pointer.
If you do not return a valid page, you must return appropriate return code
from the ->fault handler. That being VM_FAULT_NOPAGE, VM_FAULT_DAX_LOCKED,
or some error. That has always been the case except for DAX abuse which was
added by commit 2e4cdab0584f "mm: allow page fault handlers to perform the
COW" about an year ago. And my patch fixes this abuse.
I'm not aware of any other code that would start abusing the return value
from the ->fault handler. If some such code indeed got merged during the
last year, it should be fixed as well.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Apr 19, 2016 at 04:33:43PM +0200, Jan Kara wrote:
> On Tue 19-04-16 07:46:09, Jerome Glisse wrote:
> > On Mon, Apr 18, 2016 at 11:35:40PM +0200, Jan Kara wrote:
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 93897f23cc11..f09cdb8d48fa 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -63,6 +63,7 @@
> > > #include <linux/dma-debug.h>
> > > #include <linux/debugfs.h>
> > > #include <linux/userfaultfd_k.h>
> > > +#include <linux/dax.h>
> > >
> > > #include <asm/io.h>
> > > #include <asm/mmu_context.h>
> > > @@ -2785,7 +2786,8 @@ oom:
> > > */
> > > static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> > > pgoff_t pgoff, unsigned int flags,
> > > - struct page *cow_page, struct page **page)
> > > + struct page *cow_page, struct page **page,
> > > + void **entry)
> > > {
> > > struct vm_fault vmf;
> > > int ret;
> > > @@ -2800,8 +2802,10 @@ static int __do_fault(struct vm_area_struct *vma, unsigned long address,
> > > ret = vma->vm_ops->fault(vma, &vmf);
> > > if (unlikely(ret & (VM_FAULT_ERROR | VM_FAULT_NOPAGE | VM_FAULT_RETRY)))
> > > return ret;
> > > - if (!vmf.page)
> > > - goto out;
> >
> > Removing the above sounds seriously bogus to me as it means that below
> > if (unlikely(PageHWPoison(vmf.page))) could dereference a NULL pointer.
>
> If you do not return a valid page, you must return appropriate return code
> from the ->fault handler. That being VM_FAULT_NOPAGE, VM_FAULT_DAX_LOCKED,
> or some error. That has always been the case except for DAX abuse which was
> added by commit 2e4cdab0584f "mm: allow page fault handlers to perform the
> COW" about an year ago. And my patch fixes this abuse.
>
> I'm not aware of any other code that would start abusing the return value
> from the ->fault handler. If some such code indeed got merged during the
> last year, it should be fixed as well.
>
Ok my bad i missed that.
Cheers,
J?r?me
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, Apr 19 2016, Jan Kara wrote:
> Currently DAX page fault locking is racy.
>
> CPU0 (write fault) CPU1 (read fault)
>
> __dax_fault() __dax_fault()
> get_block(inode, block, &bh, 0) -> not mapped
> get_block(inode, block, &bh, 0)
> -> not mapped
> if (!buffer_mapped(&bh))
> if (vmf->flags & FAULT_FLAG_WRITE)
> get_block(inode, block, &bh, 1) -> allocates blocks
> if (page) -> no
> if (!buffer_mapped(&bh))
> if (vmf->flags & FAULT_FLAG_WRITE) {
> } else {
> dax_load_hole();
> }
> dax_insert_mapping()
>
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
>
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: NeilBrown <[email protected]> (for the exception locking bits)
I really like how you have structured the code - makes it fairly
obviously correct!
Thanks,
NeilBrown
On Mon, Apr 18, 2016 at 11:35:28PM +0200, Jan Kara wrote:
> Currently ext2 zeroes any data blocks allocated for DAX inode however it
> still returns them as BH_New. Thus DAX code zeroes them again in
> dax_insert_mapping() which can possibly overwrite the data that has been
> already stored to those blocks by a racing dax_io(). Avoid marking
> pre-zeroed buffers as new.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
On Mon, Apr 18, 2016 at 11:35:29PM +0200, Jan Kara wrote:
> Now that all filesystems zero out blocks allocated for a fault handler,
> we can just remove the zeroing from the handler itself. Also add checks
> that no filesystem returns to us unwritten or new buffer.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 18, 2016 at 11:35:31PM +0200, Jan Kara wrote:
> Currently ext4 treats DAX IO the same way as direct IO. I.e., it
> allocates unwritten extents before IO is done and converts unwritten
> extents afterwards. However this way DAX IO can race with page fault to
> the same area:
>
> ext4_ext_direct_IO() dax_fault()
> dax_io()
> get_block() - allocates unwritten extent
> copy_from_iter_pmem()
> get_block() - converts
> unwritten block to
> written and zeroes it
> out
> ext4_convert_unwritten_extents()
>
> So data written with DAX IO gets lost. Similarly dax_new_buf() called
> from dax_io() can overwrite data that has been already written to the
> block via mmap.
>
> Fix the problem by using pre-zeroed blocks for DAX IO the same way as we
> use them for DAX mmap. The downside of this solution is that every
> allocating write writes each block twice (once zeros, once data). Fixing
> the race with locking is possible as well however we would need to
> lock-out faults for the whole range written to by DAX IO. And that is
> not easy to do without locking-out faults for the whole file which seems
> too aggressive.
>
> Signed-off-by: Jan Kara <[email protected]>
Just a couple of simplifications - feel free to ignore them if you don't think
they are worth the effort.
> ---
> fs/ext4/ext4.h | 11 +++++++++--
> fs/ext4/file.c | 4 ++--
> fs/ext4/inode.c | 42 +++++++++++++++++++++++++++++++++---------
> 3 files changed, 44 insertions(+), 13 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 35792b430fb6..173da8faff81 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -2521,8 +2521,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
> struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
> int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> -int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create);
> +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> + struct buffer_head *bh_result, int create);
> int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> @@ -3328,6 +3328,13 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> }
> }
>
> +static inline bool ext4_aligned_io(struct inode *inode, loff_t off, loff_t len)
> +{
> + int blksize = 1 << inode->i_blkbits;
> +
> + return IS_ALIGNED(off, blksize) && IS_ALIGNED(off + len, blksize);
This could be just a tiny bit simpler by doing
return IS_ALIGNED(off, blksize) && IS_ALIGNED(len, blksize);
^^^
You've already made sure 'off' is aligned, so if 'len' is aligned 'off+len'
will be aligned.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 23fd0e0a9223..6d5d5c1db293 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3215,12 +3215,17 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
> }
>
> #ifdef CONFIG_FS_DAX
> -int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create)
> +/*
> + * Get block function for DAX IO and mmap faults. It takes care of converting
> + * unwritten extents to written ones and initializes new / converted blocks
> + * to zeros.
> + */
> +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> + struct buffer_head *bh_result, int create)
> {
> int ret;
>
> - ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
> + ext4_debug("ext4_dax_get_block: inode %lu, create flag %d\n",
> inode->i_ino, create);
This pattern could be improved by using "%s" and __func__ for the function
name. That way you don't have to hunt through all your debug code and update
strings when you rename a function. More importantly it prevents the strings
from getting out of sync with the function name, resulting in confusing debug
messages.
> if (!create)
> return _ext4_get_block(inode, iblock, bh_result, 0);
> @@ -3233,9 +3238,9 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
>
> if (buffer_unwritten(bh_result)) {
> /*
> - * We are protected by i_mmap_sem so we know block cannot go
> - * away from under us even though we dropped i_data_sem.
> - * Convert extent to written and write zeros there.
> + * We are protected by i_mmap_sem or i_mutex so we know block
> + * cannot go away from under us even though we dropped
> + * i_data_sem. Convert extent to written and write zeros there.
> */
> ret = ext4_get_block_trans(inode, iblock, bh_result,
> EXT4_GET_BLOCKS_CONVERT |
> @@ -3250,6 +3255,14 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> clear_buffer_new(bh_result);
> return 0;
> }
> +#else
> +/* Just define empty function, it will never get called. */
> +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> + struct buffer_head *bh_result, int create)
> +{
> + BUG();
> + return 0;
> +}
You don't need this stub. All the uses of ext4_dax_get_block() are either
within their own '#ifdef CONFIG_FS_DAX' sections, or they are in an
"if (IS_DAX)" conditional. The latter will also be compiled out if
CONFIG_FS_DAX isn't defined. This is because of the way that S_DAX is
defined:
#define S_DAX 8192 /* Direct Access, avoiding the page cache */
#else
#define S_DAX 0 /* Make all the DAX code disappear */
#endif
On Mon, Apr 18, 2016 at 11:35:32PM +0200, Jan Kara wrote:
> All the filesystems are now zeroing blocks themselves for DAX IO to avoid
> races between dax_io() and dax_fault(). Remove the zeroing code from
> dax_io() and add warning to catch the case when somebody unexpectedly
> returns new or unwritten buffer.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 18, 2016 at 11:35:33PM +0200, Jan Kara wrote:
> dax_do_io() is calling filemap_write_and_wait() if DIO_LOCKING flags is
> set. Presumably this was copied over from direct IO code. However DAX
> inodes have no pagecache pages to write so the call is pointless. Remove
> it.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
On Mon, Apr 18, 2016 at 11:35:34PM +0200, Jan Kara wrote:
> Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
> returned buffer has BH_Uptodate set. However that doesn't get set for
> any mapping buffer so that branch is actually a dead code. The
> BH_Uptodate check doesn't make any sense so just remove it.
I'm not sure about this one. In my testing (which was a while ago) I was
also never able to exercise this code path and create huge zero pages. My
concern is that by removing the buffer_uptodate() check, we will all of a
sudden start running through a code path that was previously unreachable.
AFAICT the buffer_uptodate() was part of the original PMD commit. Did we ever
get buffers with BH_Uptodate set? Has this code ever been run? Does it work?
I suppose this concern is mitigated by the fact that later in this series you
disable the PMD path entirely, but maybe we should just leave it as is and
turn it off, then clean it up if/when we reenable it when we add multi-order
radix tree locking for PMDs?
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/dax.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index 237581441bc1..42bf65b4e752 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -878,7 +878,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> goto fallback;
> }
>
> - if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> + if (!write && !buffer_mapped(&bh)) {
> spinlock_t *ptl;
> pmd_t entry;
> struct page *zero_page = get_huge_zero_page();
> --
> 2.6.6
>
On Mon, Apr 18, 2016 at 11:35:36PM +0200, Jan Kara wrote:
> Currently the handling of huge pages for DAX is racy. For example the
> following can happen:
>
> CPU0 (THP write fault) CPU1 (normal read fault)
>
> __dax_pmd_fault() __dax_fault()
> get_block(inode, block, &bh, 0) -> not mapped
> get_block(inode, block, &bh, 0)
> -> not mapped
> if (!buffer_mapped(&bh) && write)
> get_block(inode, block, &bh, 1) -> allocates blocks
> truncate_pagecache_range(inode, lstart, lend);
> dax_load_hole();
>
> This results in data corruption since process on CPU1 won't see changes
> into the file done by CPU0.
>
> The race can happen even if two normal faults race however with THP the
> situation is even worse because the two faults don't operate on the same
> entries in the radix tree and we want to use these entries for
> serialization. So make THP support in DAX code depend on CONFIG_BROKEN
> for now.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/dax.c | 2 +-
> include/linux/dax.h | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/fs/dax.c b/fs/dax.c
> index d7addfab2094..388327f56fa8 100644
> --- a/fs/dax.c
> +++ b/fs/dax.c
> @@ -707,7 +707,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> }
> EXPORT_SYMBOL_GPL(dax_fault);
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
> /*
> * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
> * more often than one might expect in the below function.
> diff --git a/include/linux/dax.h b/include/linux/dax.h
> index 7c45ac7ea1d1..0591f4853228 100644
> --- a/include/linux/dax.h
> +++ b/include/linux/dax.h
> @@ -23,7 +23,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> }
> #endif
>
> -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
> int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> unsigned int flags, get_block_t);
> int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> --
> 2.6.6
Ah, sorry, I think by "make PMD depend on CONFIG_BROKEN" Dan & I meant this:
diff --git a/fs/Kconfig b/fs/Kconfig
index 6725f59..b8fcb41 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -52,6 +52,7 @@ config FS_DAX_PMD
depends on FS_DAX
depends on ZONE_DEVICE
depends on TRANSPARENT_HUGEPAGE
+ depends on BROKEN
endif # BLOCK
This has the benefit that you have only one place to quickly reenable PMD code
if/when we want to work on it.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 18, 2016 at 11:35:37PM +0200, Jan Kara wrote:
> We will use lowest available bit in the radix tree exceptional entry for
> locking of the entry. Define it. Also clean up definitions of DAX entry
> type bits in DAX exceptional entries to use defined constants instead of
> hardcoding numbers and cleanup checking of these bits to not rely on how
> other bits in the entry are set.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
On Mon, Apr 18, 2016 at 11:35:38PM +0200, Jan Kara wrote:
> Currently we forbid page_cache_tree_insert() to replace exceptional radix
> tree entries for DAX inodes. However to make DAX faults race free we will
> lock radix tree entries and when hole is created, we need to replace
> such locked radix tree entry with a hole page. So modify
> page_cache_tree_insert() to allow that.
>
> Signed-off-by: Jan Kara <[email protected]>
Reviewed-by: Ross Zwisler <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri 29-04-16 12:01:58, Ross Zwisler wrote:
> On Mon, Apr 18, 2016 at 11:35:31PM +0200, Jan Kara wrote:
> > ---
> > fs/ext4/ext4.h | 11 +++++++++--
> > fs/ext4/file.c | 4 ++--
> > fs/ext4/inode.c | 42 +++++++++++++++++++++++++++++++++---------
> > 3 files changed, 44 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> > index 35792b430fb6..173da8faff81 100644
> > --- a/fs/ext4/ext4.h
> > +++ b/fs/ext4/ext4.h
> > @@ -2521,8 +2521,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
> > struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
> > int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create);
> > -int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> > - struct buffer_head *bh_result, int create);
> > +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> > + struct buffer_head *bh_result, int create);
> > int ext4_get_block(struct inode *inode, sector_t iblock,
> > struct buffer_head *bh_result, int create);
> > int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> > @@ -3328,6 +3328,13 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> > }
> > }
> >
> > +static inline bool ext4_aligned_io(struct inode *inode, loff_t off, loff_t len)
> > +{
> > + int blksize = 1 << inode->i_blkbits;
> > +
> > + return IS_ALIGNED(off, blksize) && IS_ALIGNED(off + len, blksize);
>
> This could be just a tiny bit simpler by doing
>
> return IS_ALIGNED(off, blksize) && IS_ALIGNED(len, blksize);
> ^^^
>
> You've already made sure 'off' is aligned, so if 'len' is aligned 'off+len'
> will be aligned.
Good point, done.
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 23fd0e0a9223..6d5d5c1db293 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -3215,12 +3215,17 @@ static int ext4_releasepage(struct page *page, gfp_t wait)
> > }
> >
> > #ifdef CONFIG_FS_DAX
> > -int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> > - struct buffer_head *bh_result, int create)
> > +/*
> > + * Get block function for DAX IO and mmap faults. It takes care of converting
> > + * unwritten extents to written ones and initializes new / converted blocks
> > + * to zeros.
> > + */
> > +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> > + struct buffer_head *bh_result, int create)
> > {
> > int ret;
> >
> > - ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
> > + ext4_debug("ext4_dax_get_block: inode %lu, create flag %d\n",
> > inode->i_ino, create);
>
> This pattern could be improved by using "%s" and __func__ for the function
> name. That way you don't have to hunt through all your debug code and update
> strings when you rename a function. More importantly it prevents the strings
> from getting out of sync with the function name, resulting in confusing debug
> messages.
Actually, ext4_debug() already automatically prepends the function name. So
I've just discarded it from the format string.
> > if (!create)
> > return _ext4_get_block(inode, iblock, bh_result, 0);
> > @@ -3233,9 +3238,9 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> >
> > if (buffer_unwritten(bh_result)) {
> > /*
> > - * We are protected by i_mmap_sem so we know block cannot go
> > - * away from under us even though we dropped i_data_sem.
> > - * Convert extent to written and write zeros there.
> > + * We are protected by i_mmap_sem or i_mutex so we know block
> > + * cannot go away from under us even though we dropped
> > + * i_data_sem. Convert extent to written and write zeros there.
> > */
> > ret = ext4_get_block_trans(inode, iblock, bh_result,
> > EXT4_GET_BLOCKS_CONVERT |
> > @@ -3250,6 +3255,14 @@ int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
> > clear_buffer_new(bh_result);
> > return 0;
> > }
> > +#else
> > +/* Just define empty function, it will never get called. */
> > +int ext4_dax_get_block(struct inode *inode, sector_t iblock,
> > + struct buffer_head *bh_result, int create)
> > +{
> > + BUG();
> > + return 0;
> > +}
>
> You don't need this stub. All the uses of ext4_dax_get_block() are either
> within their own '#ifdef CONFIG_FS_DAX' sections, or they are in an
> "if (IS_DAX)" conditional. The latter will also be compiled out if
> CONFIG_FS_DAX isn't defined. This is because of the way that S_DAX is
> defined:
>
> #define S_DAX 8192 /* Direct Access, avoiding the page cache */
> #else
> #define S_DAX 0 /* Make all the DAX code disappear */
> #endif
OK, I agree it's likely not needed but I'm somewhat wary of relying on this
compiler optimization. In some more complex cases for some compilers they
needn't be able to infer that the code is actually dead and you'll get
compilation error. IMO not worth those 7 lines of trivial code... So I've
kept this.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Fri 29-04-16 13:08:05, Ross Zwisler wrote:
> On Mon, Apr 18, 2016 at 11:35:34PM +0200, Jan Kara wrote:
> > Currently dax_pmd_fault() decides to fill a PMD-sized hole only if
> > returned buffer has BH_Uptodate set. However that doesn't get set for
> > any mapping buffer so that branch is actually a dead code. The
> > BH_Uptodate check doesn't make any sense so just remove it.
>
> I'm not sure about this one. In my testing (which was a while ago) I was
> also never able to exercise this code path and create huge zero pages. My
> concern is that by removing the buffer_uptodate() check, we will all of a
> sudden start running through a code path that was previously unreachable.
>
> AFAICT the buffer_uptodate() was part of the original PMD commit. Did we ever
> get buffers with BH_Uptodate set? Has this code ever been run? Does it work?
>
> I suppose this concern is mitigated by the fact that later in this series you
> disable the PMD path entirely, but maybe we should just leave it as is and
> turn it off, then clean it up if/when we reenable it when we add multi-order
> radix tree locking for PMDs?
Well, I did this as a separate commit exactly because I'm not sure about
the impact since nobody was actually able to test this code. So we can
easily bisect later if we find issues. The code just didn't make sense to
me that way and later in the series I update it to use radix tree locking
which would be hard to do without having code which actually makes some
sense. So I'd prefer to keep this change...
Honza
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/dax.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index 237581441bc1..42bf65b4e752 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -878,7 +878,7 @@ int __dax_pmd_fault(struct vm_area_struct *vma, unsigned long address,
> > goto fallback;
> > }
> >
> > - if (!write && !buffer_mapped(&bh) && buffer_uptodate(&bh)) {
> > + if (!write && !buffer_mapped(&bh)) {
> > spinlock_t *ptl;
> > pmd_t entry;
> > struct page *zero_page = get_huge_zero_page();
> > --
> > 2.6.6
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri 29-04-16 13:53:40, Ross Zwisler wrote:
> On Mon, Apr 18, 2016 at 11:35:36PM +0200, Jan Kara wrote:
> > Currently the handling of huge pages for DAX is racy. For example the
> > following can happen:
> >
> > CPU0 (THP write fault) CPU1 (normal read fault)
> >
> > __dax_pmd_fault() __dax_fault()
> > get_block(inode, block, &bh, 0) -> not mapped
> > get_block(inode, block, &bh, 0)
> > -> not mapped
> > if (!buffer_mapped(&bh) && write)
> > get_block(inode, block, &bh, 1) -> allocates blocks
> > truncate_pagecache_range(inode, lstart, lend);
> > dax_load_hole();
> >
> > This results in data corruption since process on CPU1 won't see changes
> > into the file done by CPU0.
> >
> > The race can happen even if two normal faults race however with THP the
> > situation is even worse because the two faults don't operate on the same
> > entries in the radix tree and we want to use these entries for
> > serialization. So make THP support in DAX code depend on CONFIG_BROKEN
> > for now.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/dax.c | 2 +-
> > include/linux/dax.h | 2 +-
> > 2 files changed, 2 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/dax.c b/fs/dax.c
> > index d7addfab2094..388327f56fa8 100644
> > --- a/fs/dax.c
> > +++ b/fs/dax.c
> > @@ -707,7 +707,7 @@ int dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf,
> > }
> > EXPORT_SYMBOL_GPL(dax_fault);
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
> > /*
> > * The 'colour' (ie low bits) within a PMD of a page offset. This comes up
> > * more often than one might expect in the below function.
> > diff --git a/include/linux/dax.h b/include/linux/dax.h
> > index 7c45ac7ea1d1..0591f4853228 100644
> > --- a/include/linux/dax.h
> > +++ b/include/linux/dax.h
> > @@ -23,7 +23,7 @@ static inline struct page *read_dax_sector(struct block_device *bdev,
> > }
> > #endif
> >
> > -#ifdef CONFIG_TRANSPARENT_HUGEPAGE
> > +#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && defined(CONFIG_BROKEN)
> > int dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > unsigned int flags, get_block_t);
> > int __dax_pmd_fault(struct vm_area_struct *, unsigned long addr, pmd_t *,
> > --
> > 2.6.6
>
> Ah, sorry, I think by "make PMD depend on CONFIG_BROKEN" Dan & I meant this:
>
> diff --git a/fs/Kconfig b/fs/Kconfig
> index 6725f59..b8fcb41 100644
> --- a/fs/Kconfig
> +++ b/fs/Kconfig
> @@ -52,6 +52,7 @@ config FS_DAX_PMD
> depends on FS_DAX
> depends on ZONE_DEVICE
> depends on TRANSPARENT_HUGEPAGE
> + depends on BROKEN
>
> endif # BLOCK
>
> This has the benefit that you have only one place to quickly reenable PMD code
> if/when we want to work on it.
OK, makes sense. I've changed this.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Apr 18, 2016 at 11:35:23PM +0200, Jan Kara wrote:
> Hello,
>
> this is my third attempt at DAX page fault locking rewrite. The patch set has
> passed xfstests both with and without DAX mount option on ext4 and xfs for
> me and also additional page fault beating using the new page fault stress
> tests I have added to xfstests. So I'd be grateful if you guys could have a
> closer look at the patches so that they can be merged. Thanks.
>
> Changes since v2:
> - lot of additional ext4 fixes and cleanups
> - make PMD page faults depend on CONFIG_BROKEN instead of #if 0
> - fixed page reference leak when replacing hole page with a pfn
> - added some reviewed-by tags
> - rebased on top of current Linus' tree
>
> Changes since v1:
> - handle wakeups of exclusive waiters properly
> - fix cow fault races
> - other minor stuff
>
> General description
>
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.
>
> In the end I have decided to implement the bit locking directly in the DAX
> code. Originally I was thinking we could provide something generic directly
> in the radix tree code but the functions DAX needs are rather specific.
> Maybe someone else will have a good idea how to distill some generally useful
> functions out of what I've implemented for DAX but for now I didn't bother
> with that.
>
> Honza
Hey Jan,
I've been testing with this a bit today, and I hit the following issue with
generic/231. I was able to reproduce it 100% of the time with both ext4 and
XFS.
Here's the test:
# ./check generic/231
FSTYP -- ext4
PLATFORM -- Linux/x86_64 lorwyn 4.6.0-rc5+
MKFS_OPTIONS -- /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
generic/231 28s ..../check: line 542: 1545 Segmentation fault ./$seq > $tmp.rawout 2>&1
[failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/231.out.bad)
--- tests/generic/231.out 2016-01-12 09:24:26.420085531 -0700
+++ /root/xfstests/results//generic/231.out.bad 2016-05-05 21:25:18.629675139 -0600
@@ -1,16 +1,3 @@
Qe output created by 231
=== FSX Standard Mode, Memory Mapping, 1 Tasks ===
All 20000 operations completed A-OK!
-Comparing user usage
-Comparing group usage
-=== FSX Standard Mode, Memory Mapping, 4 Tasks ===
-All 20000 operations completed A-OK!
...
(Run 'diff -u tests/generic/231.out /root/xfstests/results//generic/231.out.bad' to see the entire diff)
And the log, passed through kasan_symbolize.py:
t mm/workingset.c:423!
invalid opcode: 0000 [#1] SMP
Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
CPU: 1 PID: 1545 Comm: 231 Not tainted 4.6.0-rc5+ #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
task: ffff880505853180 ti: ffff880504f08000 task.ti: ffff880504f08000
RIP: 0010:[<ffffffff81207b93>] [<ffffffff81207b93>] shadow_lru_isolate+0x183/0x1a0
RSP: 0018:ffff880504f0bbe8 EFLAGS: 00010006
RAX: ffff880094f304b8 RBX: ffff880094f304a8 RCX: ffff880094f306b8
RDX: 0000000000000077 RSI: 0000000000000000 RDI: ffff8800b8483000
RBP: ffff880504f0bc10 R08: 0000000000000004 R09: 0000000000000000
R10: ffff88009ba51ec8 R11: 0000000000000080 R12: ffff8800b8483000
R13: ffff88009ba51eb0 R14: ffff88009ba51e98 R15: ffff8800b8483048
FS: 00007f3332e14700(0000) GS:ffff88051a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f3332e2c000 CR3: 000000051425c000 CR4: 00000000000006e0
Stack:
ffff8800b8483000 ffff8800b8483048 ffff880504f0bd18 ffff880094f11b78
ffff880094f304a8 ffff880504f0bc60 ffffffff81206c7f 0000000000000000
0000000000000000 ffffffff81207a10 ffff880504f0bd10 0000000000000000
Call Trace:
[<ffffffff81206c7f>] __list_lru_walk_one.isra.3+0x9f/0x150 mm/list_lru.c:223
[<ffffffff81206d53>] list_lru_walk_one+0x23/0x30 mm/list_lru.c:263
[< inline >] list_lru_shrink_walk include/linux/list_lru.h:170
[<ffffffff81207bea>] scan_shadow_nodes+0x3a/0x50 mm/workingset.c:457
[< inline >] do_shrink_slab mm/vmscan.c:344
[<ffffffff811ea37e>] shrink_slab.part.40+0x1fe/0x420 mm/vmscan.c:442
[<ffffffff811ea5c9>] shrink_slab+0x29/0x30 mm/vmscan.c:406
[<ffffffff811ec831>] drop_slab_node+0x31/0x60 mm/vmscan.c:460
[<ffffffff811ec89f>] drop_slab+0x3f/0x70 mm/vmscan.c:471
[<ffffffff812d8c39>] drop_caches_sysctl_handler+0x69/0xb0 fs/drop_caches.c:58
[<ffffffff812f2937>] proc_sys_call_handler+0xe7/0x100 fs/proc/proc_sysctl.c:543
[<ffffffff812f2964>] proc_sys_write+0x14/0x20 fs/proc/proc_sysctl.c:561
[<ffffffff81269aa7>] __vfs_write+0x37/0x120 fs/read_write.c:529
[<ffffffff8126a3fc>] vfs_write+0xac/0x1a0 fs/read_write.c:578
[< inline >] SYSC_write fs/read_write.c:625
[<ffffffff8126b8d8>] SyS_write+0x58/0xd0 fs/read_write.c:617
[<ffffffff81a92a3c>] entry_SYSCALL_64_fastpath+0x1f/0xbd arch/x86/entry/entry_64.S:207
Code: 66 90 66 66 90 e8 4e 53 88 00 fa 66 66 90 66 66 90 e8 52 5c ef ff 4c 89 e7 e8 ba a1 88 00 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 66 2e
RIP [<ffffffff81207b93>] shadow_lru_isolate+0x183/0x1a0 mm/workingset.c:448
RSP <ffff880504f0bbe8>
---[ end trace 8e4a52e5c9e07c83 ]---
This passes 100% of the time with my baseline, which was just v4.6-rc5.
For convenience I've pushed a working tree of what I was testing here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=jan_testing
My setup is just a pair of PMEM ramdisks for my test device and scratch device.
Let me know if you have any trouble reproducing this result.
Thanks,
- Ross
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> Currently DAX page fault locking is racy.
>
> CPU0 (write fault) CPU1 (read fault)
>
> __dax_fault() __dax_fault()
> get_block(inode, block, &bh, 0) -> not mapped
> get_block(inode, block, &bh, 0)
> -> not mapped
> if (!buffer_mapped(&bh))
> if (vmf->flags & FAULT_FLAG_WRITE)
> get_block(inode, block, &bh, 1) -> allocates blocks
> if (page) -> no
> if (!buffer_mapped(&bh))
> if (vmf->flags & FAULT_FLAG_WRITE) {
> } else {
> dax_load_hole();
> }
> dax_insert_mapping()
>
> And we are in a situation where we fail in dax_radix_entry() with -EIO.
>
> Another problem with the current DAX page fault locking is that there is
> no race-free way to clear dirty tag in the radix tree. We can always
> end up with clean radix tree and dirty data in CPU cache.
>
> We fix the first problem by introducing locking of exceptional radix
> tree entries in DAX mappings acting very similarly to page lock and thus
> synchronizing properly faults against the same mapping index. The same
> lock can later be used to avoid races when clearing radix tree dirty
> tag.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> @@ -300,6 +324,259 @@ ssize_t dax_do_io(struct kiocb *iocb, struct inode *inode,
> EXPORT_SYMBOL_GPL(dax_do_io);
>
> /*
> + * DAX radix tree locking
> + */
> +struct exceptional_entry_key {
> + struct radix_tree_root *root;
> + unsigned long index;
> +};
I believe that we basically just need the struct exceptional_entry_key to
uniquely identify an entry, correct? I agree that we get this with the pair
[struct radix_tree_root, index], but we also get it with
[struct address_space, index], and we might want to use the latter here since
that's the pair that is used to look up the wait queue in
dax_entry_waitqueue(). Functionally I don't think it matters (correct me if
I'm wrong), but it makes for a nicer symmetry.
> +/*
> + * Find radix tree entry at given index. If it points to a page, return with
> + * the page locked. If it points to the exceptional entry, return with the
> + * radix tree entry locked. If the radix tree doesn't contain given index,
> + * create empty exceptional entry for the index and return with it locked.
> + *
> + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> + * persistent memory the benefit is doubtful. We can add that later if we can
> + * show it helps.
> + */
> +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> + void *ret, **slot;
> +
> +restart:
> + spin_lock_irq(&mapping->tree_lock);
> + ret = get_unlocked_mapping_entry(mapping, index, &slot);
> + /* No entry for given index? Make sure radix tree is big enough. */
> + if (!ret) {
> + int err;
> +
> + spin_unlock_irq(&mapping->tree_lock);
> + err = radix_tree_preload(
> + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
In the conversation about v2 of this series you said:
> Note that we take the hit for dropping the lock only if we really need to
> allocate new radix tree node so about once per 64 new entries. So it is not
> too bad.
I think this is incorrect. We get here whenever we get a NULL return from
__radix_tree_lookup(). I believe that this happens if we don't have a node,
in which case we need an allocation, but I think it also happens in the case
where we do have a node and we just have a NULL slot in that node.
For the behavior you're looking for (only preload if you need to do an
allocation), you probably need to check the 'slot' we get back from
get_unlocked_mapping_entry(), yea?
> +/*
> + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> + * entry to get unlocked before deleting it.
> + */
> +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> +{
> + void *entry;
> +
> + spin_lock_irq(&mapping->tree_lock);
> + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> + /*
> + * Caller should make sure radix tree modifications don't race and
> + * we have seen exceptional entry here before.
> + */
> + if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
dax_delete_mapping_entry() is only called from clear_exceptional_entry().
With this new code we've changed the behavior of that call path a little.
In the various places where clear_exceptional_entry() is called, the code
batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We
don't hold the mapping->tree_lock between the time this lookup happens and the
time that the entry is passed to clear_exceptional_entry(). This is why the
old code did a verification that the entry passed in matched what was still
currently present in the radix tree. This was done in the DAX case via
radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
for the page cache case. In both cases if the entry didn't match what was
currently in the tree, we bailed without doing anything.
This new code doesn't verify against the 'entry' passed to
clear_exceptional_entry(), but instead makes sure it is an exceptional entry
before removing, and if not it does a WARN_ON_ONCE().
This changes things because:
a) If the exceptional entry changed, say from a plain lock entry to an actual
DAX entry, we wouldn't notice, and we would just clear the latter out. My
guess is that this is fine, I just wanted to call it out.
b) If we have a non-exceptional entry here now, say because our lock entry has
been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this
could happen during normal operation and we don't want to scare anyone. :)
> +/*
> * The user has performed a load from a hole in the file. Allocating
> * a new page in the file would cause excessive storage usage for
> * workloads with sparse files. We allocate a page cache page instead.
> @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
> * otherwise it will simply fall out of the page cache under memory
> * pressure without ever having been dirtied.
> */
> -static int dax_load_hole(struct address_space *mapping, struct page *page,
> - struct vm_fault *vmf)
> +static int dax_load_hole(struct address_space *mapping, void *entry,
> + struct vm_fault *vmf)
> {
> - if (!page)
> - page = find_or_create_page(mapping, vmf->pgoff,
> - GFP_KERNEL | __GFP_ZERO);
> - if (!page)
> - return VM_FAULT_OOM;
> + struct page *page;
> +
> + /* Hole page already exists? Return it... */
> + if (!radix_tree_exceptional_entry(entry)) {
> + vmf->page = entry;
> + return VM_FAULT_LOCKED;
> + }
>
> + /* This will replace locked radix tree entry with a hole page */
> + page = find_or_create_page(mapping, vmf->pgoff,
> + vmf->gfp_mask | __GFP_ZERO);
This replacement happens via page_cache_tree_insert(), correct? In this case,
who wakes up anyone waiting on the old lock entry that we just killed? In the
non-hole case we would traverse through put_locked_mapping_entry(), but I
don't see that in the hole case.
> @@ -963,23 +1228,18 @@ EXPORT_SYMBOL_GPL(dax_pmd_fault);
> int dax_pfn_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> struct file *file = vma->vm_file;
> - int error;
> -
> - /*
> - * We pass NO_SECTOR to dax_radix_entry() because we expect that a
> - * RADIX_DAX_PTE entry already exists in the radix tree from a
> - * previous call to __dax_fault(). We just want to look up that PTE
> - * entry using vmf->pgoff and make sure the dirty tag is set. This
> - * saves us from having to make a call to get_block() here to look
> - * up the sector.
> - */
> - error = dax_radix_entry(file->f_mapping, vmf->pgoff, NO_SECTOR, false,
> - true);
> + struct address_space *mapping = file->f_mapping;
> + void *entry;
> + pgoff_t index = vmf->pgoff;
>
> - if (error == -ENOMEM)
> - return VM_FAULT_OOM;
> - if (error)
> - return VM_FAULT_SIGBUS;
> + spin_lock_irq(&mapping->tree_lock);
> + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> + if (!entry || !radix_tree_exceptional_entry(entry))
> + goto out;
> + radix_tree_tag_set(&mapping->page_tree, index, PAGECACHE_TAG_DIRTY);
> + put_unlocked_mapping_entry(mapping, index, entry);
I really like how simple this function has become. :)
On Mon, Apr 18, 2016 at 11:35:23PM +0200, Jan Kara wrote:
> Hello,
>
> this is my third attempt at DAX page fault locking rewrite. The patch set has
> passed xfstests both with and without DAX mount option on ext4 and xfs for
> me and also additional page fault beating using the new page fault stress
> tests I have added to xfstests. So I'd be grateful if you guys could have a
> closer look at the patches so that they can be merged. Thanks.
>
> Changes since v2:
> - lot of additional ext4 fixes and cleanups
> - make PMD page faults depend on CONFIG_BROKEN instead of #if 0
> - fixed page reference leak when replacing hole page with a pfn
> - added some reviewed-by tags
> - rebased on top of current Linus' tree
>
> Changes since v1:
> - handle wakeups of exclusive waiters properly
> - fix cow fault races
> - other minor stuff
>
> General description
>
> The basic idea is that we use a bit in an exceptional radix tree entry as
> a lock bit and use it similarly to how page lock is used for normal faults.
> That way we fix races between hole instantiation and read faults of the
> same index. For now I have disabled PMD faults since there the issues with
> page fault locking are even worse. Now that Matthew's multi-order radix tree
> has landed, I can have a look into using that for proper locking of PMD faults
> but first I want normal pages sorted out.
>
> In the end I have decided to implement the bit locking directly in the DAX
> code. Originally I was thinking we could provide something generic directly
> in the radix tree code but the functions DAX needs are rather specific.
> Maybe someone else will have a good idea how to distill some generally useful
> functions out of what I've implemented for DAX but for now I didn't bother
> with that.
>
> Honza
Hey Jan,
Another hit in testing, which may or may not be related to the last one. The
BUG is a few lines off from the previous report:
kernel BUG at mm/workingset.c:423!
vs
kernel BUG at mm/workingset.c:435!
I've been able to consistently hit this one using DAX + ext4 with generic/086.
For some reason generic/086 always passes when run by itself, but fails
consistently if you run it after a set of other tests. Here is a relatively
fast set that reproduces it:
# ./check generic/070 generic/071 generic/072 generic/073 generic/074 generic/075 generic/076 generic/077 generic/078 generic/079 generic/080 generic/082 generic/083 generic/084 generic/085 generic/086
FSTYP -- ext4
PLATFORM -- Linux/x86_64 lorwyn 4.6.0-rc5+
MKFS_OPTIONS -- /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
generic/070 1s ... 2s
generic/071 1s ... 0s
generic/072 3s ... 4s
generic/073 1s ... [not run] Cannot run tests with DAX on dmflakey devices
generic/074 91s ... 85s
generic/075 1s ... 2s
generic/076 1s ... 2s
generic/077 3s ... 2s
generic/078 1s ... 1s
generic/079 0s ... 1s
generic/080 3s ... 2s
generic/082 0s ... 1s
generic/083 1s ... 4s
generic/084 6s ... 6s
generic/085 5s ... 4s
generic/086 1s ..../check: line 519: 19233 Segmentation fault ./$seq > $tmp.rawout 2>&1
[failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/086.out.bad)
--- tests/generic/086.out 2015-10-02 10:19:36.799795846 -0600
+++ /root/xfstests/results//generic/086.out.bad 2016-05-06 13:38:17.682976273 -0600
@@ -1,14 +1 @@
QA output created by 086
-00000000 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
-*
-00001000 aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa aa |................|
-*
-00001800 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
-*
...
(Run 'diff -u tests/generic/086.out /root/xfstests/results//generic/086.out.bad' to see the entire diff)
And the kernel log, passed through kasan_symbolize.py:
------------[ cut here ]------------
kernel BUG at mm/workingset.c:435!
invalid opcode: 0000 [#1] SMP
Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
CPU: 1 PID: 19233 Comm: 086 Not tainted 4.6.0-rc5+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.2-20150714_191134- 04/01/2014
task: ffff8800ba88ca40 ti: ffff88040bde4000 task.ti: ffff88040bde4000
RIP: 0010:[<ffffffff81207b9d>] [<ffffffff81207b9d>] shadow_lru_isolate+0x18d/0x1a0
RSP: 0018:ffff88040bde7be8 EFLAGS: 00010006
RAX: 0000000000000100 RBX: ffff8801f2c91dc0 RCX: ffff8801f2c91fd0
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff880410b2ac00
RBP: ffff88040bde7c10 R08: 0000000000000004 R09: 0000000000000000
R10: ffff8801fa31b038 R11: 0000000000000080 R12: ffff880410b2ac00
R13: ffff8801fa31b020 R14: ffff8801fa31b008 R15: ffff880410b2ac48
FS: 00007f67ccee8700(0000) GS:ffff88041a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f67ccf00000 CR3: 000000040e434000 CR4: 00000000000006e0
Stack:
ffff880410b2ac00 ffff880410b2ac48 ffff88040bde7d18 ffff8801fa2cfb78
ffff8801f2c91dc0 ffff88040bde7c60 ffffffff81206c7f 0000000000000000
0000000000000000 ffffffff81207a10 ffff88040bde7d10 0000000000000000
Call Trace:
[<ffffffff81206c7f>] __list_lru_walk_one.isra.3+0x9f/0x150 mm/list_lru.c:223
[<ffffffff81206d53>] list_lru_walk_one+0x23/0x30 mm/list_lru.c:263
[< inline >] list_lru_shrink_walk include/linux/list_lru.h:170
[<ffffffff81207bea>] scan_shadow_nodes+0x3a/0x50 mm/workingset.c:457
[< inline >] do_shrink_slab mm/vmscan.c:344
[<ffffffff811ea37e>] shrink_slab.part.40+0x1fe/0x420 mm/vmscan.c:442
[<ffffffff811ea5c9>] shrink_slab+0x29/0x30 mm/vmscan.c:406
[<ffffffff811ec831>] drop_slab_node+0x31/0x60 mm/vmscan.c:460
[<ffffffff811ec89f>] drop_slab+0x3f/0x70 mm/vmscan.c:471
[<ffffffff812d8c39>] drop_caches_sysctl_handler+0x69/0xb0 fs/drop_caches.c:58
[<ffffffff812f2937>] proc_sys_call_handler+0xe7/0x100 fs/proc/proc_sysctl.c:543
[<ffffffff812f2964>] proc_sys_write+0x14/0x20 fs/proc/proc_sysctl.c:561
[<ffffffff81269aa7>] __vfs_write+0x37/0x120 fs/read_write.c:529
[<ffffffff8126a3fc>] vfs_write+0xac/0x1a0 fs/read_write.c:578
[< inline >] SYSC_write fs/read_write.c:625
[<ffffffff8126b8d8>] SyS_write+0x58/0xd0 fs/read_write.c:617
[<ffffffff81a92a3c>] entry_SYSCALL_64_fastpath+0x1f/0xbd arch/x86/entry/entry_64.S:207
Code: fa 66 66 90 66 66 90 e8 52 5c ef ff 4c 89 e7 e8 ba a1 88 00 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b <0f> 0b 0f 0b 66 66 66 66 66 66 2e 0f 1f 84 00 00 00 00 00 66 66
RIP [<ffffffff81207b9d>] shadow_lru_isolate+0x18d/0x1a0 mm/workingset.c:422
RSP <ffff88040bde7be8>
Same setup as last time, a pair of PMEM ramdisks. This was created with the
same working tree:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=jan_testing
Thanks!
- Ross
On Fri 06-05-16 14:33:08, Ross Zwisler wrote:
> On Mon, Apr 18, 2016 at 11:35:23PM +0200, Jan Kara wrote:
> > Hello,
> >
> > this is my third attempt at DAX page fault locking rewrite. The patch set has
> > passed xfstests both with and without DAX mount option on ext4 and xfs for
> > me and also additional page fault beating using the new page fault stress
> > tests I have added to xfstests. So I'd be grateful if you guys could have a
> > closer look at the patches so that they can be merged. Thanks.
> >
> > Changes since v2:
> > - lot of additional ext4 fixes and cleanups
> > - make PMD page faults depend on CONFIG_BROKEN instead of #if 0
> > - fixed page reference leak when replacing hole page with a pfn
> > - added some reviewed-by tags
> > - rebased on top of current Linus' tree
> >
> > Changes since v1:
> > - handle wakeups of exclusive waiters properly
> > - fix cow fault races
> > - other minor stuff
> >
> > General description
> >
> > The basic idea is that we use a bit in an exceptional radix tree entry as
> > a lock bit and use it similarly to how page lock is used for normal faults.
> > That way we fix races between hole instantiation and read faults of the
> > same index. For now I have disabled PMD faults since there the issues with
> > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > has landed, I can have a look into using that for proper locking of PMD faults
> > but first I want normal pages sorted out.
> >
> > In the end I have decided to implement the bit locking directly in the DAX
> > code. Originally I was thinking we could provide something generic directly
> > in the radix tree code but the functions DAX needs are rather specific.
> > Maybe someone else will have a good idea how to distill some generally useful
> > functions out of what I've implemented for DAX but for now I didn't bother
> > with that.
> >
> > Honza
>
> Hey Jan,
>
> Another hit in testing, which may or may not be related to the last one. The
> BUG is a few lines off from the previous report:
> kernel BUG at mm/workingset.c:423!
> vs
> kernel BUG at mm/workingset.c:435!
>
> I've been able to consistently hit this one using DAX + ext4 with generic/086.
> For some reason generic/086 always passes when run by itself, but fails
> consistently if you run it after a set of other tests. Here is a relatively
> fast set that reproduces it:
Thanks for reports! It is strange that I didn't see this happening but I've
been testing against somewhat older base so maybe something has changed.
Anyway the culprit seems to be that workingset tracking code messes with
radix tree which is managed by DAX and these two were never meant to
coexist so assertions naturally trip. In particular we should not add radix
tree node to working-set list of nodes for eviction in
page_cache_tree_delete() for DAX inodes. However that seems to happen in
your case and so far I don't quite understand why...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, 2016-04-18 at 23:35 +0200, Jan Kara wrote:
> <>
Hi Jan,
I've noticed that patches 1 through 12 of your series are relatively
independent, and are probably more stable than the remaining part of
the series that actually changes locking.
My dax error handling series also depends on the patches that change
zeroing in DAX (patches 5, 6, 9).
To allow the error handling stuff to move faster, can we split these
into two patchsets?
I was hoping to send the dax error handling series through the nvdimm
tree, and if you'd like, I can also prepend your patches 1-12 with my
series.
Let me know your preference.
Thanks,
-Vishal
Hi!
On Mon 09-05-16 21:28:06, Verma, Vishal L wrote:
> On Mon, 2016-04-18 at 23:35 +0200, Jan Kara wrote:
> I've noticed that patches 1 through 12 of your series are relatively
> independent, and are probably more stable than the remaining part of
> the series that actually changes locking.
Yes.
> My dax error handling series also depends on the patches that change
> zeroing in DAX (patches 5, 6, 9).
>
> To allow the error handling stuff to move faster, can we split these
> into two patchsets?
>
> I was hoping to send the dax error handling series through the nvdimm
> tree, and if you'd like, I can also prepend your patches 1-12 with my
> series.
So I'm thinking how to best merge this. There are some ext4 patches which
are not trivial (mainly "ext4: Refactor direct IO code"). These can go in
as far as I'm concerned but there is a potential for conflicts in ext4
tree and I'd definitely want to give them full test run in the ext4 tree.
The best what I can think of is to pull ext4 related changes into a stable
branch in ext4 tree and then pull that branch into nvdimm tree. Ted, what
do you think? If you agree, I can separate the patches into three parts -
one for ext4 tree, stable patches for nvdimm tree, and then remaining
patches.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Thu 05-05-16 22:13:50, Ross Zwisler wrote:
> On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> > /*
> > + * DAX radix tree locking
> > + */
> > +struct exceptional_entry_key {
> > + struct radix_tree_root *root;
> > + unsigned long index;
> > +};
>
> I believe that we basically just need the struct exceptional_entry_key to
> uniquely identify an entry, correct? I agree that we get this with the pair
> [struct radix_tree_root, index], but we also get it with
> [struct address_space, index], and we might want to use the latter here since
> that's the pair that is used to look up the wait queue in
> dax_entry_waitqueue(). Functionally I don't think it matters (correct me if
> I'm wrong), but it makes for a nicer symmetry.
OK, makes sense. Changed.
> > +/*
> > + * Find radix tree entry at given index. If it points to a page, return with
> > + * the page locked. If it points to the exceptional entry, return with the
> > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > + * create empty exceptional entry for the index and return with it locked.
> > + *
> > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > + * persistent memory the benefit is doubtful. We can add that later if we can
> > + * show it helps.
> > + */
> > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > + void *ret, **slot;
> > +
> > +restart:
> > + spin_lock_irq(&mapping->tree_lock);
> > + ret = get_unlocked_mapping_entry(mapping, index, &slot);
> > + /* No entry for given index? Make sure radix tree is big enough. */
> > + if (!ret) {
> > + int err;
> > +
> > + spin_unlock_irq(&mapping->tree_lock);
> > + err = radix_tree_preload(
> > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
>
> In the conversation about v2 of this series you said:
>
> > Note that we take the hit for dropping the lock only if we really need to
> > allocate new radix tree node so about once per 64 new entries. So it is not
> > too bad.
>
> I think this is incorrect. We get here whenever we get a NULL return from
> __radix_tree_lookup(). I believe that this happens if we don't have a node,
> in which case we need an allocation, but I think it also happens in the case
> where we do have a node and we just have a NULL slot in that node.
>
> For the behavior you're looking for (only preload if you need to do an
> allocation), you probably need to check the 'slot' we get back from
> get_unlocked_mapping_entry(), yea?
You are correct. However currently __radix_tree_lookup() doesn't return a
slot pointer if entry was not found so it is not easy to fix. So I'd leave
the code as is for now and we can later optimize the case where we don't
need to grow the radix tree...
>
> > +/*
> > + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> > + * entry to get unlocked before deleting it.
> > + */
> > +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > +{
> > + void *entry;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > + /*
> > + * Caller should make sure radix tree modifications don't race and
> > + * we have seen exceptional entry here before.
> > + */
> > + if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
>
> dax_delete_mapping_entry() is only called from clear_exceptional_entry().
> With this new code we've changed the behavior of that call path a little.
>
> In the various places where clear_exceptional_entry() is called, the code
> batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We
> don't hold the mapping->tree_lock between the time this lookup happens and the
> time that the entry is passed to clear_exceptional_entry(). This is why the
> old code did a verification that the entry passed in matched what was still
> currently present in the radix tree. This was done in the DAX case via
> radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
> for the page cache case. In both cases if the entry didn't match what was
> currently in the tree, we bailed without doing anything.
>
> This new code doesn't verify against the 'entry' passed to
> clear_exceptional_entry(), but instead makes sure it is an exceptional entry
> before removing, and if not it does a WARN_ON_ONCE().
>
> This changes things because:
>
> a) If the exceptional entry changed, say from a plain lock entry to an actual
> DAX entry, we wouldn't notice, and we would just clear the latter out. My
> guess is that this is fine, I just wanted to call it out.
>
> b) If we have a non-exceptional entry here now, say because our lock entry has
> been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
> removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this
> could happen during normal operation and we don't want to scare anyone. :)
So your concerns are exactly why I have added a comment to
dax_delete_mapping_entry() that:
/*
* Caller should make sure radix tree modifications don't race and
* we have seen exceptional entry here before.
*/
The thing is dax_delete_mapping_entry() is called only from truncate /
punch hole path. Those should hold i_mmap_sem for writing and thus there
should be no modifications of the radix tree. If anything changes, between
what truncate_inode_pages() (or similar functions) finds and what
dax_delete_mapping_entry() sees, we have a locking bug and I want to know
about it :). Any suggestion how I should expand the comment so that this is
clearer?
> > +/*
> > * The user has performed a load from a hole in the file. Allocating
> > * a new page in the file would cause excessive storage usage for
> > * workloads with sparse files. We allocate a page cache page instead.
> > @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
> > * otherwise it will simply fall out of the page cache under memory
> > * pressure without ever having been dirtied.
> > */
> > -static int dax_load_hole(struct address_space *mapping, struct page *page,
> > - struct vm_fault *vmf)
> > +static int dax_load_hole(struct address_space *mapping, void *entry,
> > + struct vm_fault *vmf)
> > {
> > - if (!page)
> > - page = find_or_create_page(mapping, vmf->pgoff,
> > - GFP_KERNEL | __GFP_ZERO);
> > - if (!page)
> > - return VM_FAULT_OOM;
> > + struct page *page;
> > +
> > + /* Hole page already exists? Return it... */
> > + if (!radix_tree_exceptional_entry(entry)) {
> > + vmf->page = entry;
> > + return VM_FAULT_LOCKED;
> > + }
> >
> > + /* This will replace locked radix tree entry with a hole page */
> > + page = find_or_create_page(mapping, vmf->pgoff,
> > + vmf->gfp_mask | __GFP_ZERO);
>
> This replacement happens via page_cache_tree_insert(), correct? In this case,
> who wakes up anyone waiting on the old lock entry that we just killed? In the
> non-hole case we would traverse through put_locked_mapping_entry(), but I
> don't see that in the hole case.
Ha, good catch. We miss the wakeup. Fixed.
Attached is the diff resulting from your review of this patch. I still have
to hunt down that strange interaction with workingset code you've reported...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 09-05-16 11:38:28, Jan Kara wrote:
> On Fri 06-05-16 14:33:08, Ross Zwisler wrote:
> > On Mon, Apr 18, 2016 at 11:35:23PM +0200, Jan Kara wrote:
> > > Hello,
> > >
> > > this is my third attempt at DAX page fault locking rewrite. The patch set has
> > > passed xfstests both with and without DAX mount option on ext4 and xfs for
> > > me and also additional page fault beating using the new page fault stress
> > > tests I have added to xfstests. So I'd be grateful if you guys could have a
> > > closer look at the patches so that they can be merged. Thanks.
> > >
> > > Changes since v2:
> > > - lot of additional ext4 fixes and cleanups
> > > - make PMD page faults depend on CONFIG_BROKEN instead of #if 0
> > > - fixed page reference leak when replacing hole page with a pfn
> > > - added some reviewed-by tags
> > > - rebased on top of current Linus' tree
> > >
> > > Changes since v1:
> > > - handle wakeups of exclusive waiters properly
> > > - fix cow fault races
> > > - other minor stuff
> > >
> > > General description
> > >
> > > The basic idea is that we use a bit in an exceptional radix tree entry as
> > > a lock bit and use it similarly to how page lock is used for normal faults.
> > > That way we fix races between hole instantiation and read faults of the
> > > same index. For now I have disabled PMD faults since there the issues with
> > > page fault locking are even worse. Now that Matthew's multi-order radix tree
> > > has landed, I can have a look into using that for proper locking of PMD faults
> > > but first I want normal pages sorted out.
> > >
> > > In the end I have decided to implement the bit locking directly in the DAX
> > > code. Originally I was thinking we could provide something generic directly
> > > in the radix tree code but the functions DAX needs are rather specific.
> > > Maybe someone else will have a good idea how to distill some generally useful
> > > functions out of what I've implemented for DAX but for now I didn't bother
> > > with that.
> > >
> > > Honza
> >
> > Hey Jan,
> >
> > Another hit in testing, which may or may not be related to the last one. The
> > BUG is a few lines off from the previous report:
> > kernel BUG at mm/workingset.c:423!
> > vs
> > kernel BUG at mm/workingset.c:435!
> >
> > I've been able to consistently hit this one using DAX + ext4 with generic/086.
> > For some reason generic/086 always passes when run by itself, but fails
> > consistently if you run it after a set of other tests. Here is a relatively
> > fast set that reproduces it:
>
> Thanks for reports! It is strange that I didn't see this happening but I've
> been testing against somewhat older base so maybe something has changed.
> Anyway the culprit seems to be that workingset tracking code messes with
> radix tree which is managed by DAX and these two were never meant to
> coexist so assertions naturally trip. In particular we should not add radix
> tree node to working-set list of nodes for eviction in
> page_cache_tree_delete() for DAX inodes. However that seems to happen in
> your case and so far I don't quite understand why...
Somehow, I'm not able to reproduce the warnings... Anyway, I think I see
what's going on. Can you check whether the warning goes away when you
change the condition at the end of page_cache_tree_delete() to:
if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
list_empty(&node->private_list)) {
Thanks!
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, May 10, 2016 at 05:28:14PM +0200, Jan Kara wrote:
> On Mon 09-05-16 11:38:28, Jan Kara wrote:
> Somehow, I'm not able to reproduce the warnings... Anyway, I think I see
> what's going on. Can you check whether the warning goes away when you
> change the condition at the end of page_cache_tree_delete() to:
>
> if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> list_empty(&node->private_list)) {
Yep, this took care of both of the issues that I reported. I'll restart my
testing with this in my baseline, but as of this fix I don't have any more
open testing issues. :)
On Tue, May 10, 2016 at 02:30:03PM -0600, Ross Zwisler wrote:
> On Tue, May 10, 2016 at 05:28:14PM +0200, Jan Kara wrote:
> > On Mon 09-05-16 11:38:28, Jan Kara wrote:
> > Somehow, I'm not able to reproduce the warnings... Anyway, I think I see
> > what's going on. Can you check whether the warning goes away when you
> > change the condition at the end of page_cache_tree_delete() to:
> >
> > if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> > list_empty(&node->private_list)) {
>
> Yep, this took care of both of the issues that I reported. I'll restart my
> testing with this in my baseline, but as of this fix I don't have any more
> open testing issues. :)
Well, looks like I spoke too soon. The two tests that were failing for me are
now passing, but I can still create what looks like a related failure using
XFS, DAX, and the two xfstests generic/231 and generic/232 run back-to-back.
Here's the shell:
# ./check generic/231 generic/232
FSTYP -- xfs (debug)
PLATFORM -- Linux/x86_64 alara 4.6.0-rc5jan_testing_2+
MKFS_OPTIONS -- -f -bsize=4096 /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
generic/231 88s ... 88s
generic/232 2s ..../check: line 543: 9105 Segmentation fault ./$seq > $tmp.rawout 2>&1
[failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/232.out.bad)
--- tests/generic/232.out 2015-10-02 10:19:36.806795894 -0600
+++ /root/xfstests/results//generic/232.out.bad 2016-05-10 16:17:54.805637876 -0600
@@ -3,5 +3,3 @@
Testing fsstress
seed = S
-Comparing user usage
-Comparing group usage
...
(Run 'diff -u tests/generic/232.out /root/xfstests/results//generic/232.out.bad' to see the entire diff)
and the serial log:
run fstests generic/232 at 2016-05-10 16:17:53
XFS (pmem0p2): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
XFS (pmem0p2): Mounting V5 Filesystem
XFS (pmem0p2): Ending clean mount
XFS (pmem0p2): Quotacheck needed: Please wait.
XFS (pmem0p2): Quotacheck: Done.
------------[ cut here ]------------
kernel BUG at mm/workingset.c:423!
invalid opcode: 0000 [#1] SMP
Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
CPU: 1 PID: 9105 Comm: 232 Not tainted 4.6.0-rc5jan_testing_2+ #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.9.1-0-gb3ef39f-prebuilt.qemu-project.org 04/01/2014
task: ffff8801f4eb98c0 ti: ffff88040e5f0000 task.ti: ffff88040e5f0000
RIP: 0010:[<ffffffff81207b93>] [<ffffffff81207b93>] shadow_lru_isolate+0x183/0x1a0
RSP: 0018:ffff88040e5f3be8 EFLAGS: 00010006
RAX: ffff880401f68270 RBX: ffff880401f68260 RCX: ffff880401f68470
RDX: 000000000000006c RSI: 0000000000000000 RDI: ffff880410b2bd80
RBP: ffff88040e5f3c10 R08: 0000000000000008 R09: 0000000000000000
R10: ffff8800b59eb840 R11: 0000000000000080 R12: ffff880410b2bd80
R13: ffff8800b59eb828 R14: ffff8800b59eb810 R15: ffff880410b2bdc8
FS: 00007fb73c58c700(0000) GS:ffff88041a200000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fb73c5a4000 CR3: 000000040e139000 CR4: 00000000000006e0
Stack:
ffff880410b2bd80 ffff880410b2bdc8 ffff88040e5f3d18 ffff880410b2bdc8
ffff880401f68260 ffff88040e5f3c60 ffffffff81206c7f 0000000000000000
0000000000000000 ffffffff81207a10 ffff88040e5f3d10 0000000000000000
Call Trace:
[<ffffffff81206c7f>] __list_lru_walk_one.isra.3+0x9f/0x150 mm/list_lru.c:223
[<ffffffff81206d53>] list_lru_walk_one+0x23/0x30 mm/list_lru.c:263
[< inline >] list_lru_shrink_walk include/linux/list_lru.h:170
[<ffffffff81207bea>] scan_shadow_nodes+0x3a/0x50 mm/workingset.c:457
[< inline >] do_shrink_slab mm/vmscan.c:344
[<ffffffff811ea37e>] shrink_slab.part.40+0x1fe/0x420 mm/vmscan.c:442
[<ffffffff811ea5c9>] shrink_slab+0x29/0x30 mm/vmscan.c:406
[<ffffffff811ec831>] drop_slab_node+0x31/0x60 mm/vmscan.c:460
[<ffffffff811ec89f>] drop_slab+0x3f/0x70 mm/vmscan.c:471
[<ffffffff812d8c39>] drop_caches_sysctl_handler+0x69/0xb0 fs/drop_caches.c:58
[<ffffffff812f2937>] proc_sys_call_handler+0xe7/0x100 fs/proc/proc_sysctl.c:543
[<ffffffff812f2964>] proc_sys_write+0x14/0x20 fs/proc/proc_sysctl.c:561
[<ffffffff81269aa7>] __vfs_write+0x37/0x120 fs/read_write.c:529
[<ffffffff8126a3fc>] vfs_write+0xac/0x1a0 fs/read_write.c:578
[< inline >] SYSC_write fs/read_write.c:625
[<ffffffff8126b8d8>] SyS_write+0x58/0xd0 fs/read_write.c:617
[<ffffffff81a92a3c>] entry_SYSCALL_64_fastpath+0x1f/0xbd arch/x86/entry/entry_64.S:207
Code: 66 90 66 66 90 e8 4e 53 88 00 fa 66 66 90 66 66 90 e8 52 5c ef ff 4c 89 e7 e8 ba a1 88 00 89 d8 5b 41 5c 41 5d 41 5e 41 5f 5d c3 <0f> 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 0f 0b 66 66 66 66 66 66 2e
RIP [<ffffffff81207b93>] shadow_lru_isolate+0x183/0x1a0 mm/workingset.c:448
RSP <ffff88040e5f3be8>
---[ end trace c4ff9bc94605ec45 ]---
This was against a tree with your most recent fix. The full tree can be found
here:
https://git.kernel.org/cgit/linux/kernel/git/zwisler/linux.git/log/?h=jan_testing
This only recreates on about 1/2 of the runs of these tests in my system.
Thanks,
- Ross
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue 10-05-16 16:39:37, Ross Zwisler wrote:
> On Tue, May 10, 2016 at 02:30:03PM -0600, Ross Zwisler wrote:
> > On Tue, May 10, 2016 at 05:28:14PM +0200, Jan Kara wrote:
> > > On Mon 09-05-16 11:38:28, Jan Kara wrote:
> > > Somehow, I'm not able to reproduce the warnings... Anyway, I think I see
> > > what's going on. Can you check whether the warning goes away when you
> > > change the condition at the end of page_cache_tree_delete() to:
> > >
> > > if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> > > list_empty(&node->private_list)) {
> >
> > Yep, this took care of both of the issues that I reported. I'll restart my
> > testing with this in my baseline, but as of this fix I don't have any more
> > open testing issues. :)
>
> Well, looks like I spoke too soon. The two tests that were failing for me are
> now passing, but I can still create what looks like a related failure using
> XFS, DAX, and the two xfstests generic/231 and generic/232 run back-to-back.
Hum, full xfstests run completes for me just fine. Can you reproduce the
issue with the attached debug patch? Thanks!
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, May 11, 2016 at 11:19:30AM +0200, Jan Kara wrote:
> On Tue 10-05-16 16:39:37, Ross Zwisler wrote:
> > On Tue, May 10, 2016 at 02:30:03PM -0600, Ross Zwisler wrote:
> > > On Tue, May 10, 2016 at 05:28:14PM +0200, Jan Kara wrote:
> > > > On Mon 09-05-16 11:38:28, Jan Kara wrote:
> > > > Somehow, I'm not able to reproduce the warnings... Anyway, I think I see
> > > > what's going on. Can you check whether the warning goes away when you
> > > > change the condition at the end of page_cache_tree_delete() to:
> > > >
> > > > if (!dax_mapping(mapping) && !workingset_node_pages(node) &&
> > > > list_empty(&node->private_list)) {
> > >
> > > Yep, this took care of both of the issues that I reported. I'll restart my
> > > testing with this in my baseline, but as of this fix I don't have any more
> > > open testing issues. :)
> >
> > Well, looks like I spoke too soon. The two tests that were failing for me are
> > now passing, but I can still create what looks like a related failure using
> > XFS, DAX, and the two xfstests generic/231 and generic/232 run back-to-back.
>
> Hum, full xfstests run completes for me just fine. Can you reproduce the
> issue with the attached debug patch? Thanks!
Here's the resulting debug:
[ 212.541923] Wrong node->count 244.
[ 212.542316] Host sb pmem0p2 ino 2097257
[ 212.542696] Node dump: 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
On Tue, May 10, 2016 at 02:27:15PM +0200, Jan Kara wrote:
> On Thu 05-05-16 22:13:50, Ross Zwisler wrote:
> > On Mon, Apr 18, 2016 at 11:35:39PM +0200, Jan Kara wrote:
> > > +/*
> > > + * Find radix tree entry at given index. If it points to a page, return with
> > > + * the page locked. If it points to the exceptional entry, return with the
> > > + * radix tree entry locked. If the radix tree doesn't contain given index,
> > > + * create empty exceptional entry for the index and return with it locked.
> > > + *
> > > + * Note: Unlike filemap_fault() we don't honor FAULT_FLAG_RETRY flags. For
> > > + * persistent memory the benefit is doubtful. We can add that later if we can
> > > + * show it helps.
> > > + */
> > > +static void *grab_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > + void *ret, **slot;
> > > +
> > > +restart:
> > > + spin_lock_irq(&mapping->tree_lock);
> > > + ret = get_unlocked_mapping_entry(mapping, index, &slot);
> > > + /* No entry for given index? Make sure radix tree is big enough. */
> > > + if (!ret) {
> > > + int err;
> > > +
> > > + spin_unlock_irq(&mapping->tree_lock);
> > > + err = radix_tree_preload(
> > > + mapping_gfp_mask(mapping) & ~__GFP_HIGHMEM);
> >
> > In the conversation about v2 of this series you said:
> >
> > > Note that we take the hit for dropping the lock only if we really need to
> > > allocate new radix tree node so about once per 64 new entries. So it is not
> > > too bad.
> >
> > I think this is incorrect. We get here whenever we get a NULL return from
> > __radix_tree_lookup(). I believe that this happens if we don't have a node,
> > in which case we need an allocation, but I think it also happens in the case
> > where we do have a node and we just have a NULL slot in that node.
> >
> > For the behavior you're looking for (only preload if you need to do an
> > allocation), you probably need to check the 'slot' we get back from
> > get_unlocked_mapping_entry(), yea?
>
> You are correct. However currently __radix_tree_lookup() doesn't return a
> slot pointer if entry was not found so it is not easy to fix. So I'd leave
> the code as is for now and we can later optimize the case where we don't
> need to grow the radix tree...
Ah, you're right. Sure, that plan sounds good.
> > > +/*
> > > + * Delete exceptional DAX entry at @index from @mapping. Wait for radix tree
> > > + * entry to get unlocked before deleting it.
> > > + */
> > > +int dax_delete_mapping_entry(struct address_space *mapping, pgoff_t index)
> > > +{
> > > + void *entry;
> > > +
> > > + spin_lock_irq(&mapping->tree_lock);
> > > + entry = get_unlocked_mapping_entry(mapping, index, NULL);
> > > + /*
> > > + * Caller should make sure radix tree modifications don't race and
> > > + * we have seen exceptional entry here before.
> > > + */
> > > + if (WARN_ON_ONCE(!entry || !radix_tree_exceptional_entry(entry))) {
> >
> > dax_delete_mapping_entry() is only called from clear_exceptional_entry().
> > With this new code we've changed the behavior of that call path a little.
> >
> > In the various places where clear_exceptional_entry() is called, the code
> > batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We
> > don't hold the mapping->tree_lock between the time this lookup happens and the
> > time that the entry is passed to clear_exceptional_entry(). This is why the
> > old code did a verification that the entry passed in matched what was still
> > currently present in the radix tree. This was done in the DAX case via
> > radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
> > for the page cache case. In both cases if the entry didn't match what was
> > currently in the tree, we bailed without doing anything.
> >
> > This new code doesn't verify against the 'entry' passed to
> > clear_exceptional_entry(), but instead makes sure it is an exceptional entry
> > before removing, and if not it does a WARN_ON_ONCE().
> >
> > This changes things because:
> >
> > a) If the exceptional entry changed, say from a plain lock entry to an actual
> > DAX entry, we wouldn't notice, and we would just clear the latter out. My
> > guess is that this is fine, I just wanted to call it out.
> >
> > b) If we have a non-exceptional entry here now, say because our lock entry has
> > been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
> > removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this
> > could happen during normal operation and we don't want to scare anyone. :)
>
> So your concerns are exactly why I have added a comment to
> dax_delete_mapping_entry() that:
>
> /*
> * Caller should make sure radix tree modifications don't race and
> * we have seen exceptional entry here before.
> */
>
> The thing is dax_delete_mapping_entry() is called only from truncate /
> punch hole path. Those should hold i_mmap_sem for writing and thus there
> should be no modifications of the radix tree. If anything changes, between
> what truncate_inode_pages() (or similar functions) finds and what
> dax_delete_mapping_entry() sees, we have a locking bug and I want to know
> about it :). Any suggestion how I should expand the comment so that this is
> clearer?
Ah, I didn't understand all that. :) Given a bit more context the comment
seems fine - if anything it could be a bit more specific, and include the
text: "dax_delete_mapping_entry() is called only from truncate / punch hole
path. Those should hold i_mmap_sem for writing and thus there should be no
modifications of the radix tree." Either way - thanks for explaining.
> > > +/*
> > > * The user has performed a load from a hole in the file. Allocating
> > > * a new page in the file would cause excessive storage usage for
> > > * workloads with sparse files. We allocate a page cache page instead.
> > > @@ -307,15 +584,24 @@ EXPORT_SYMBOL_GPL(dax_do_io);
> > > * otherwise it will simply fall out of the page cache under memory
> > > * pressure without ever having been dirtied.
> > > */
> > > -static int dax_load_hole(struct address_space *mapping, struct page *page,
> > > - struct vm_fault *vmf)
> > > +static int dax_load_hole(struct address_space *mapping, void *entry,
> > > + struct vm_fault *vmf)
> > > {
> > > - if (!page)
> > > - page = find_or_create_page(mapping, vmf->pgoff,
> > > - GFP_KERNEL | __GFP_ZERO);
> > > - if (!page)
> > > - return VM_FAULT_OOM;
> > > + struct page *page;
> > > +
> > > + /* Hole page already exists? Return it... */
> > > + if (!radix_tree_exceptional_entry(entry)) {
> > > + vmf->page = entry;
> > > + return VM_FAULT_LOCKED;
> > > + }
> > >
> > > + /* This will replace locked radix tree entry with a hole page */
> > > + page = find_or_create_page(mapping, vmf->pgoff,
> > > + vmf->gfp_mask | __GFP_ZERO);
> >
> > This replacement happens via page_cache_tree_insert(), correct? In this case,
> > who wakes up anyone waiting on the old lock entry that we just killed? In the
> > non-hole case we would traverse through put_locked_mapping_entry(), but I
> > don't see that in the hole case.
>
> Ha, good catch. We miss the wakeup. Fixed.
>
> Attached is the diff resulting from your review of this patch. I still have
> to hunt down that strange interaction with workingset code you've reported...
At the end of this mail I've attached one small fixup for the incremental diff
you sent. Aside from that, I think that you've addressed all my review
feedback, thanks!
Reviewed-by: Ross Zwisler <[email protected]>
I'm going to try and get more info on the working set test failure.
---
diff --git a/fs/dax.c b/fs/dax.c
index f496854..c4cb69b 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -406,7 +406,7 @@ static void *get_unlocked_mapping_entry(struct address_space *mapping,
init_wait(&ewait.wait);
ewait.wait.func = wake_exceptional_entry_func;
- ewait.key.root = &mapping->page_tree;
+ ewait.key.mapping = mapping;
ewait.key.index = index;
for (;;) {
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed 11-05-16 13:26:32, Ross Zwisler wrote:
> > > In the various places where clear_exceptional_entry() is called, the code
> > > batches up a bunch of entries in a pvec via pagevec_lookup_entries(). We
> > > don't hold the mapping->tree_lock between the time this lookup happens and the
> > > time that the entry is passed to clear_exceptional_entry(). This is why the
> > > old code did a verification that the entry passed in matched what was still
> > > currently present in the radix tree. This was done in the DAX case via
> > > radix_tree_delete_item(), and it was open coded in clear_exceptional_entry()
> > > for the page cache case. In both cases if the entry didn't match what was
> > > currently in the tree, we bailed without doing anything.
> > >
> > > This new code doesn't verify against the 'entry' passed to
> > > clear_exceptional_entry(), but instead makes sure it is an exceptional entry
> > > before removing, and if not it does a WARN_ON_ONCE().
> > >
> > > This changes things because:
> > >
> > > a) If the exceptional entry changed, say from a plain lock entry to an actual
> > > DAX entry, we wouldn't notice, and we would just clear the latter out. My
> > > guess is that this is fine, I just wanted to call it out.
> > >
> > > b) If we have a non-exceptional entry here now, say because our lock entry has
> > > been swapped out for a zero page, we will WARN_ON_ONCE() and return without a
> > > removal. I think we may want to silence the WARN_ON_ONCE(), as I believe this
> > > could happen during normal operation and we don't want to scare anyone. :)
> >
> > So your concerns are exactly why I have added a comment to
> > dax_delete_mapping_entry() that:
> >
> > /*
> > * Caller should make sure radix tree modifications don't race and
> > * we have seen exceptional entry here before.
> > */
> >
> > The thing is dax_delete_mapping_entry() is called only from truncate /
> > punch hole path. Those should hold i_mmap_sem for writing and thus there
> > should be no modifications of the radix tree. If anything changes, between
> > what truncate_inode_pages() (or similar functions) finds and what
> > dax_delete_mapping_entry() sees, we have a locking bug and I want to know
> > about it :). Any suggestion how I should expand the comment so that this is
> > clearer?
>
> Ah, I didn't understand all that. :) Given a bit more context the comment
> seems fine - if anything it could be a bit more specific, and include the
> text: "dax_delete_mapping_entry() is called only from truncate / punch hole
> path. Those should hold i_mmap_sem for writing and thus there should be no
> modifications of the radix tree." Either way - thanks for explaining.
OK, I've made the comment more detailed.
> At the end of this mail I've attached one small fixup for the incremental diff
> you sent. Aside from that, I think that you've addressed all my review
> feedback, thanks!
Yup, I've found this out as well when compiling the new version.
> Reviewed-by: Ross Zwisler <[email protected]>
Thanks.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>