2015-10-22 08:16:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/9 v2] ext4: Punch hole and DAX fixes

Hello,

This series fixes a long standing problem of racing punch hole and page fault
resulting in possible filesystem corruption or stale data exposure. We fix the
problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
page faults with truncate and punch hole operations.

When having this exclusion, the only remaining problem with DAX implementation
are races between two page faults zeroing out same block concurrently (where
the data written after the first fault finishes are possibly overwritten by
the second fault still doing zeroing).

Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
serialize extent manipulation operations and page faults.

Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
unlock in unlocked DIO protections (currently harmless but nasty surprise).

Patches 3-4 fix further races of extent manipulation functions (such as zero
range, collapse range, insert range) with buffered IO, page writeback

Patch 5 documents locking order of ext4 filesystem locks.

Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
dioread_nolock is enabled since it is not needed anymore.

Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
callback and use such blocks for allocations from DAX page faults.

The patches survived xfstests run both in dax and non-dax mode.

Honza


2015-10-22 08:16:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/9] ext4: Move unlocked dio protection from ext4_alloc_file_blocks()

Currently ext4_alloc_file_blocks() was handling protection against
unlocked DIO. However we now need to sometimes call it under i_mmap_sem
and sometimes not and DIO protection ranks above it (although strictly
speaking this cannot currently create any deadlocks). Also
ext4_zero_range() was actually getting & releasing unlocked DIO
protection twice in some cases. Luckily it didn't introduce any real bug
but it was a land mine waiting to be stepped on. So move DIO protection
out from ext4_alloc_file_blocks() into the two callsites.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6e0c5d37a232..66ab89b58c1f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4681,10 +4681,6 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (len <= EXT_UNWRITTEN_MAX_LEN)
flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
/*
* credits to insert 1 extent into extent tree
*/
@@ -4748,8 +4744,6 @@ retry:
goto retry;
}

- ext4_inode_resume_unlocked_dio(inode);
-
return ret > 0 ? ret2 : ret;
}

@@ -4823,6 +4817,10 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
/* Preallocate the range including the unaligned edges */
if (partial_begin || partial_end) {
ret = ext4_alloc_file_blocks(file,
@@ -4831,7 +4829,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
round_down(offset, 1 << blkbits)) >> blkbits,
new_size, flags, mode);
if (ret)
- goto out_mutex;
+ goto out_dio;

}

@@ -4840,10 +4838,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);

2015-10-22 08:16:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/9] ext4: Fix races of writeback with punch hole and zero range

When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.

Make sure the on-disk inode size is updated before truncating page
cache.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/extents.c | 5 +++++
fs/ext4/inode.c | 35 ++++++++++++++++++++++++++++++++++-
3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c19ff61ccbdf..c86546efec30 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2691,6 +2691,9 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
return changed;
}

+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len);
+
struct ext4_group_info {
unsigned long bb_state;
struct rb_root bb_free_root;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 892245a55c53..9977b557d50d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4843,6 +4843,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
* released from page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ ret = ext4_update_disksize_before_punch(inode, offset, len);
+ if (ret) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ goto out_dio;
+ }
/* Now release the pages and zero block aligned part of pages */
truncate_pagecache_range(inode, start, end - 1);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 36ad45906d26..f72212ef1fee 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3517,6 +3517,35 @@ int ext4_can_truncate(struct inode *inode)
}

/*
+ * We have to make sure i_disksize gets properly updated before we truncate
+ * page cache due to hole punching or zero range. Otherwise i_disksize update
+ * can get lost as it may have been postponed to submission of writeback but
+ * that will never happen after we truncate page cache.
+ */
+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len)
+{
+ handle_t *handle;
+ loff_t size = i_size_read(inode);
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ if (offset > size || offset + len < size)
+ return 0;
+
+ if (EXT4_I(inode)->i_disksize >= size)
+ return 0;
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ ext4_update_i_disksize(inode, size);
+ ext4_mark_inode_dirty(handle, inode);
+ ext4_journal_stop(handle);
+
+ return 0;
+}
+
+/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
*
@@ -3594,9 +3623,13 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

/* Now release the pages and zero block aligned part of pages*/
- if (last_block_offset > first_block_offset)
+ if (last_block_offset > first_block_offset) {
+ ret = ext4_update_disksize_before_punch(inode, offset, length);
+ if (ret)
+ goto out_dio;
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);
+ }

if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
--
2.1.4


2015-10-22 08:16:24

by Jan Kara

[permalink] [raw]
Subject: [PATCH 9/9] ext4: Use pre-zeroed blocks for DAX page faults

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +--
fs/ext4/file.c | 20 +++-----------
fs/ext4/inode.c | 81 +++++++++++++++++++++++++++++++++++++++++++++------------
3 files changed, 69 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a11a34d54c9..9b6e5813968b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2284,8 +2284,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d24ebcd7c9e..749b222e6498 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -193,18 +193,6 @@ out:
}

#ifdef CONFIG_FS_DAX
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
- struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
- loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
- int err;
- if (!uptodate)
- return;
- WARN_ON(!buffer_unwritten(bh));
- err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
@@ -225,8 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -262,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_get_block_dax, ext4_end_io_unwritten);
+ ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -283,8 +270,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 73c6214084d4..c2a476a35464 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -718,16 +718,6 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
- if (IS_DAX(inode) && buffer_unwritten(bh)) {
- /*
- * dgc: I suspect unwritten conversion on ext4+DAX is
- * fundamentally broken here when there are concurrent
- * read/write in progress on this inode.
- */
- WARN_ON_ONCE(io_end);
- bh->b_assoc_map = inode->i_mapping;
- bh->b_private = (void *)(unsigned long)iblock;
- }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3050,17 +3040,74 @@ static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
return ret;
}

-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+#ifdef CONFIG_FS_DAX
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
{
- int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+ int ret, err;
+ int credits;
+ struct ext4_map_blocks map;
+ handle_t *handle = NULL;
+ int flags = 0;

- if (create)
- flags |= EXT4_GET_BLOCKS_CREATE;
- ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+ ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result, flags);
+ map.m_lblk = iblock;
+ map.m_len = bh_result->b_size >> inode->i_blkbits;
+ credits = ext4_chunk_trans_blocks(inode, map.m_len);
+ if (create) {
+ flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+ }
+
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (create) {
+ err = ext4_journal_stop(handle);
+ if (ret >= 0 && err < 0)
+ ret = err;
+ }
+ if (ret <= 0)
+ goto out;
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int err2;
+
+ /*
+ * We are protected by i_mmap_sem so we know block cannot go
+ * away from under us even though we dropped i_data_sem.
+ * Convert extent to written and write zeros there.
+ *
+ * Note: We may get here even when create == 0.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ err = ext4_map_blocks(handle, inode, &map,
+ EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (err < 0)
+ ret = err;
+ err2 = ext4_journal_stop(handle);
+ if (err2 < 0 && ret > 0)
+ ret = err2;
+ }
+out:
+ WARN_ON_ONCE(ret == 0 && create);
+ if (ret > 0) {
+ map_bh(bh_result, inode->i_sb, map.m_pblk);
+ bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
+ map.m_flags;
+ bh_result->b_size = map.m_len << inode->i_blkbits;
+ ret = 0;
+ }
+ return ret;
}
+#endif

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
--
2.1.4


2015-10-22 08:16:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/9] ext4: Fix races between page faults and hole punching

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++++
fs/ext4/extents.c | 54 ++++++++++++++++++++++++--------------------
fs/ext4/file.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++--------
fs/ext4/inode.c | 36 +++++++++++++++++++++--------
fs/ext4/super.c | 1 +
fs/ext4/truncate.h | 2 ++
6 files changed, 127 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fd1f28be5296..c19ff61ccbdf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -869,6 +869,15 @@ struct ext4_inode_info {
* by other means, so we have i_data_sem.
*/
struct rw_semaphore i_data_sem;
+ /*
+ * i_mmap_sem is for serializing page faults with truncate / punch hole
+ * operations. We have to make sure that new page cannot be faulted in
+ * a section of the inode that is being punched. We cannot easily use
+ * i_data_sem for this since we need protection for the whole punch
+ * operation and i_data_sem ranks below transaction start so we have
+ * to occasionally drop it.
+ */
+ struct rw_semaphore i_mmap_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;

@@ -2316,6 +2325,7 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2553aa8b608d..6e0c5d37a232 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4766,7 +4766,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
int partial_begin, partial_end;
loff_t start, end;
ext4_lblk_t lblk;
- struct address_space *mapping = inode->i_mapping;
unsigned int blkbits = inode->i_blkbits;

trace_ext4_zero_range(inode, offset, len, mode);
@@ -4782,17 +4781,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
}

/*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + len - 1);
- if (ret)
- return ret;
- }
-
- /*
* Round up offset. This is not fallocate, we neet to zero out
* blocks, so convert interior block aligned part of the range to
* unwritten and possibly manually zero out unaligned parts of the
@@ -4852,16 +4840,22 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Now release the pages and zero block aligned part of pages*/
- truncate_pagecache_range(inode, start, end - 1);
- inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-
/* Wait all existing dio workers, newcomers will block on i_mutex */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have
+ * released from page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ /* Now release the pages and zero block aligned part of pages */
+ truncate_pagecache_range(inode, start, end - 1);
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
flags, mode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
if (ret)
goto out_dio;
}
@@ -5520,17 +5514,22 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

down_write(&EXT4_I(inode)->i_data_sem);
@@ -5569,7 +5568,8 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -5656,17 +5656,22 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

/* Expand file to avoid data loss if there is error while shifting */
@@ -5737,7 +5742,8 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 113837e7ba98..0d24ebcd7c9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -209,15 +209,18 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
handle_t *handle = NULL;
- struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;

if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -228,8 +231,10 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}
@@ -246,10 +251,12 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
ext4_chunk_trans_blocks(inode,
PMD_SIZE / PAGE_SIZE));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -260,30 +267,71 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ int err;
+ struct inode *inode = file_inode(vma->vm_file);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
+ ext4_end_io_unwritten);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(inode->i_sb);
+
+ return err;
+}
+
+/*
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * handler we check for races agaist truncate. Note that since we cycle through
+ * i_mmap_sem, we are sure that also any hole punching that began before we
+ * were called is finished by now and so if it included part of the file we
+ * are working on, our pte will get unmapped and the check for pte_same() in
+ * wp_pfn_shared() fails. Thus fault gets retried and things work out as
+ * desired.
+ */
+static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ int ret = VM_FAULT_NOPAGE;
+ loff_t size;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ ret = VM_FAULT_SIGBUS;
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+
+ return ret;
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
.page_mkwrite = ext4_dax_mkwrite,
- .pfn_mkwrite = dax_pfn_mkwrite,
+ .pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
#define ext4_dax_vm_ops ext4_file_vm_ops
#endif

static const struct vm_operations_struct ext4_file_vm_ops = {
- .fault = filemap_fault,
+ .fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf76b5c..36ad45906d26 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3581,6 +3581,15 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

}

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

@@ -3589,10 +3598,6 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
else
@@ -3638,16 +3643,12 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
if (IS_SYNC(inode))
ext4_handle_sync(handle);

- /* Now release the pages again to reduce race window */
- if (last_block_offset > first_block_offset)
- truncate_pagecache_range(inode, first_block_offset,
- last_block_offset);
-
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
out_stop:
ext4_journal_stop(handle);
out_dio:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -4784,6 +4785,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
} else
ext4_wait_for_tail_page_commit(inode);
}
+ down_write(&EXT4_I(inode)->i_mmap_sem);
/*
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
@@ -4791,6 +4793,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
truncate_pagecache(inode, inode->i_size);
if (shrink)
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

if (!rc) {
@@ -5239,6 +5242,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
@@ -5308,6 +5313,19 @@ retry_alloc:
out_ret:
ret = block_page_mkwrite_return(ret);
out:
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);
return ret;
}
+
+int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ int err;
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_fault(vma, vmf);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+
+ return err;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a63c7b0a10cf..61fad9e33bb9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -955,6 +955,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&ei->i_orphan);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
+ init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
}

diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index 011ba6670d99..c70d06a383e2 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -10,8 +10,10 @@
*/
static inline void ext4_truncate_failed_write(struct inode *inode)
{
+ down_write(&EXT4_I(inode)->i_mmap_sem);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

/*
--
2.1.4


2015-10-22 08:16:24

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/9] ext4: Document lock ordering

We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 61fad9e33bb9..152f719a0c62 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -85,6 +85,36 @@ static void ext4_unregister_li_request(struct super_block *sb);
static void ext4_clear_request_list(void);
static int ext4_reserve_clusters(struct ext4_sb_info *, ext4_fsblk_t);

+/*
+ * Lock ordering
+ *
+ * Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
+ * i_mmap_rwsem (inode->i_mmap_rwsem)!
+ *
+ * page fault path:
+ * mmap_sem -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
+ * page lock -> i_data_sem (rw)
+ *
+ * buffered write path:
+ * sb_start_write -> i_mutex -> mmap_sem
+ * sb_start_write -> i_mutex -> transaction start -> page lock ->
+ * i_data_sem (rw)
+ *
+ * truncate:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * i_mmap_rwsem (w) -> page lock
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * direct IO:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) -> mmap_sem
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * writepages:
+ * transaction start -> page lock(s) -> i_data_sem (rw)
+ */
+
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
static struct file_system_type ext2_fs_type = {
.owner = THIS_MODULE,
--
2.1.4


2015-10-22 08:16:26

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/9] ext4: Provide ext4_issue_zeroout()

Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/crypto.c | 6 ++----
fs/ext4/ext4.h | 5 ++++-
fs/ext4/extents.c | 12 ++----------
fs/ext4/inode.c | 15 +++++++++++++++
4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index 45731558138c..6b63d1e9f05a 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -403,14 +403,12 @@ int ext4_decrypt_one(struct inode *inode, struct page *page)
return ret;
}

-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex)
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len)
{
struct ext4_crypto_ctx *ctx;
struct page *ciphertext_page = NULL;
struct bio *bio;
- ext4_lblk_t lblk = ex->ee_block;
- ext4_fsblk_t pblk = ext4_ext_pblock(ex);
- unsigned int len = ext4_ext_get_actual_len(ex);
int err = 0;

BUG_ON(inode->i_sb->s_blocksize != PAGE_CACHE_SIZE);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 142e935d1f3d..785737a342e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2072,7 +2072,8 @@ struct page *ext4_encrypt(struct inode *inode,
struct page *plaintext_page);
int ext4_decrypt(struct ext4_crypto_ctx *ctx, struct page *page);
int ext4_decrypt_one(struct inode *inode, struct page *page);
-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex);
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

#ifdef CONFIG_EXT4_FS_ENCRYPTION
int ext4_init_crypto(void);
@@ -2327,6 +2328,8 @@ extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 9977b557d50d..273bf40cca23 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3118,19 +3118,11 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
{
ext4_fsblk_t ee_pblock;
unsigned int ee_len;
- int ret;

ee_len = ext4_ext_get_actual_len(ex);
ee_pblock = ext4_ext_pblock(ex);
-
- if (ext4_encrypted_inode(inode))
- return ext4_encrypted_zeroout(inode, ex);
-
- ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len, GFP_NOFS);
- if (ret > 0)
- ret = 0;
-
- return ret;
+ return ext4_issue_zeroout(inode, le32_to_cpu(ex->ee_block), ee_pblock,
+ ee_len);
}

/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 39da009cbc89..5e5795bde6b4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -383,6 +383,21 @@ static int __check_block_validity(struct inode *inode, const char *func,
return 0;
}

+int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
+ ext4_lblk_t len)
+{
+ int ret;
+
+ if (ext4_encrypted_inode(inode))
+ return ext4_encrypted_zeroout(inode, lblk, pblk, len);
+
+ ret = sb_issue_zeroout(inode->i_sb, pblk, len, GFP_NOFS);
+ if (ret > 0)
+ ret = 0;
+
+ return ret;
+}
+
#define check_block_validity(inode, map) \
__check_block_validity((inode), __func__, __LINE__, (map))

--
2.1.4


2015-10-22 08:16:26

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/9] ext4: Get rid of EXT4_GET_BLOCKS_NO_LOCK flag

When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +---
fs/ext4/inode.c | 43 ++++++++++++++++++++-----------------------
include/trace/events/ext4.h | 3 +--
3 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c86546efec30..142e935d1f3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -551,10 +551,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_NORMALIZE 0x0040
/* Request will not result in inode size update (user for fallocate) */
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
- /* Do not take i_data_sem locking in ext4_map_blocks */
-#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0200
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f72212ef1fee..39da009cbc89 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -403,8 +403,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
* out taking i_data_sem. So at the time the unwritten extent
* could be converted.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -412,8 +411,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

/*
* We don't check m_len because extent will be collpased in status
@@ -509,8 +507,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -541,8 +538,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ret < 0)
retval = ret;
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
@@ -674,7 +670,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;

- if (flags && !(flags & EXT4_GET_BLOCKS_NO_LOCK) && !handle) {
+ if (flags && !handle) {
/* Direct IO write... */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
@@ -879,9 +875,6 @@ int do_journal_get_write_access(handle_t *handle,
return ret;
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
-
#ifdef CONFIG_EXT4_FS_ENCRYPTION
static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block)
@@ -3012,13 +3005,21 @@ int ext4_get_block_write(struct inode *inode, sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
+static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- ext4_debug("ext4_get_block_write_nolock: inode %lu, create flag %d\n",
+ int ret;
+
+ ext4_debug("ext4_get_block_overwrite: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_NO_LOCK);
+ ret = _ext4_get_block(inode, iblock, bh_result, 0);
+ /*
+ * Blocks should have been preallocated! ext4_file_write_iter() checks
+ * that.
+ */
+ WARN_ON_ONCE(ret == 0);
+
+ return ret;
}

int ext4_get_block_dax(struct inode *inode, sector_t iblock,
@@ -3101,10 +3102,8 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);

- if (overwrite) {
- down_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_unlock(&inode->i_mutex);
- }

/*
* We could direct write to holes and fallocate.
@@ -3147,7 +3146,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
}

if (overwrite) {
- get_block_func = ext4_get_block_write_nolock;
+ get_block_func = ext4_get_block_overwrite;
} else {
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
@@ -3203,10 +3202,8 @@ retake_lock:
if (iov_iter_rw(iter) == WRITE)
inode_dio_end(inode);
/* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite) {
- up_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_lock(&inode->i_mutex);
- }

return ret;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 594b4b29a224..5f2ace56efc5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,8 +42,7 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
- { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-10-22 08:16:21

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

Current code implementing FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE is prove to races with buffered writes and page
faults. If buffered write or write via mmap manages to squeeze between
filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
implementations, the written data is simply discarded by
truncate_pagecache() although it should have been shifted.

Fix the problem by moving filemap_write_and_wait_range() call inside
i_mutex and i_mmap_sem. That way we are protected against races with
both buffered writes and page faults.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 62 +++++++++++++++++++++++++++++--------------------------
1 file changed, 33 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 66ab89b58c1f..892245a55c53 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5483,21 +5483,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down offset to be aligned with page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);
-
/*
* There is no need to overlap collapse range with EOF, in which case
* it is effectively a truncate operation
@@ -5518,10 +5504,32 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
inode_dio_wait(inode);

/*
- * Prevent page faults from reinstantiating pages we have released from
+ * Prevent page faults from reinstantiating we have released from
* page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ /*
+ * Need to round down offset to be aligned with page size boundary
+ * for page size > block size.
+ */
+ ioffset = round_down(offset, PAGE_SIZE);
+ /*
+ * Write tail of last page before removed range since it will get
+ * removed from page cache below.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
+ offset - ioffset);
+ if (ret)
+ goto out_mmap;
+ /*
+ * Write data that will be shifted to preserve them when discarding
+ * page cache below. We are also protected from pages becoming dirty
+ * by i_mmap_sem.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, offset + len,
+ LLONG_MAX);
+ if (ret)
+ goto out_mmap;
truncate_pagecache(inode, ioffset);

credits = ext4_writepage_trans_blocks(inode);
@@ -5622,21 +5630,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down to align start offset to page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);

2015-10-22 08:16:26

by Jan Kara

[permalink] [raw]
Subject: [PATCH 8/9] ext4: Implement allocation of pre-zeroed blocks

DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/extents.c | 8 ++++++++
fs/ext4/inode.c | 20 +++++++++++++++++---
include/trace/events/ext4.h | 3 ++-
4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 785737a342e6..0a11a34d54c9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -553,6 +553,10 @@ enum {
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
/* Convert written extents to unwritten */
#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100
+ /* Write zeros to newly created written extents */
+#define EXT4_GET_BLOCKS_ZERO 0x0200
+#define EXT4_GET_BLOCKS_CREATE_ZERO (EXT4_GET_BLOCKS_CREATE |\
+ EXT4_GET_BLOCKS_ZERO)

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 273bf40cca23..8e0d853a8044 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4040,6 +4040,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
}
/* IO end_io complete, convert the filled extent to written */
if (flags & EXT4_GET_BLOCKS_CONVERT) {
+ if (flags & EXT4_GET_BLOCKS_ZERO) {
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ err = ext4_issue_zeroout(inode, map->m_lblk, newblock,
+ allocated);
+ if (err < 0)
+ goto out2;
+ }
ret = ext4_convert_unwritten_extents_endio(handle, inode, map,
ppath);
if (ret >= 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5e5795bde6b4..73c6214084d4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -643,7 +643,7 @@ found:
if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
if (ext4_es_is_written(&es))
- goto has_zeroout;
+ goto out_sem;
}
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
@@ -654,11 +654,24 @@ found:
status |= EXTENT_STATUS_DELAYED;
ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
- if (ret < 0)
+ if (ret < 0) {
retval = ret;
+ goto out_sem;
+ }
+
+ if (flags & EXT4_GET_BLOCKS_ZERO &&
+ map->m_flags & EXT4_MAP_MAPPED &&
+ map->m_flags & EXT4_MAP_NEW) {
+ ret = ext4_issue_zeroout(inode, map->m_lblk,
+ map->m_pblk, map->m_len);
+ if (ret) {
+ retval = ret;
+ goto out_sem;
+ }
+ }
}

-has_zeroout:
+out_sem:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
ret = check_block_validity(inode, map);
@@ -3041,6 +3054,7 @@ int ext4_get_block_dax(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+
if (create)
flags |= EXT4_GET_BLOCKS_CREATE;
ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5f2ace56efc5..4e4b2fa78609 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,7 +42,8 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
+ { EXT4_GET_BLOCKS_ZERO, "ZERO" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-10-22 18:09:12

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] ext4: Punch hole and DAX fixes

On Thu, Oct 22, 2015 at 10:15:52AM +0200, Jan Kara wrote:
> Hello,
>
> This series fixes a long standing problem of racing punch hole and page fault
> resulting in possible filesystem corruption or stale data exposure. We fix the
> problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> page faults with truncate and punch hole operations.
>
> When having this exclusion, the only remaining problem with DAX implementation
> are races between two page faults zeroing out same block concurrently (where
> the data written after the first fault finishes are possibly overwritten by
> the second fault still doing zeroing).
>
> Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
> serialize extent manipulation operations and page faults.
>
> Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
> unlock in unlocked DIO protections (currently harmless but nasty surprise).
>
> Patches 3-4 fix further races of extent manipulation functions (such as zero
> range, collapse range, insert range) with buffered IO, page writeback
>
> Patch 5 documents locking order of ext4 filesystem locks.
>
> Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
> dioread_nolock is enabled since it is not needed anymore.
>
> Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
> callback and use such blocks for allocations from DAX page faults.
>
> The patches survived xfstests run both in dax and non-dax mode.
>
> Honza

In my setup generic/081 passes, but causes all sorts of errors reported in
syslog. I think these errors are a normal part of the test because they happen
in the non-DAX case as well, but in the DAX case I think we end up in a bad
state. After this, all other tests fail.

Here is a run of generic/081 and generic/082, where generic/082 dies as a
bystander of the bad state left by generic/081:

# ./check generic/081 generic/082
FSTYP -- ext4
PLATFORM -- Linux/x86_64 lorwyn 4.3.0-rc6_ext4_dax_v2+
MKFS_OPTIONS -- /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch

generic/081 2s ... 1s
generic/082 0s ... [failed, exit status 1] - output mismatch (see /root/xfstests/results//generic/082.out.bad)
--- tests/generic/082.out 2015-10-02 10:20:40.816712128 -0600
+++ /root/xfstests/results//generic/082.out.bad 2015-10-22 12:07:06.009024134 -0600
@@ -1,3 +1,9 @@
QA output created by 082
-group quota on SCRATCH_MNT (SCRATCH_DEV) is on
-user quota on SCRATCH_MNT (SCRATCH_DEV) is on
+mount: /dev/pmem0p2 is already mounted or /mnt/xfstests_scratch busy
+quotaon: Mountpoint (or device) /mnt/xfstests_scratch not found or has no quota enabled.
+mount: /mnt/xfstests_scratch not mounted or bad option
+
...
(Run 'diff -u tests/generic/082.out /root/xfstests/results//generic/082.out.bad' to see the entire diff)
Ran: generic/081 generic/082
Failures: generic/082
Failed 1 of 2 tests

When run by itself on a clean system with DAX generic/082 passes repeatedly.

After you get into this bad state, all other tests fail to run:

# ./check generic/082
FSTYP -- ext4
PLATFORM -- Linux/x86_64 lorwyn 4.3.0-rc6_ext4_dax_v2+
MKFS_OPTIONS -- /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch

our local _scratch_mkfs routine ...
mke2fs 1.42.12 (29-Aug-2014)
/dev/pmem0p2 is apparently in use by the system; will not make a filesystem here!
check: failed to mkfs $SCRATCH_DEV using specified options
Passed all 0 tests

2015-10-22 21:14:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] ext4: Punch hole and DAX fixes

On Thu 22-10-15 12:09:10, Ross Zwisler wrote:
> On Thu, Oct 22, 2015 at 10:15:52AM +0200, Jan Kara wrote:
> > Hello,
> >
> > This series fixes a long standing problem of racing punch hole and page fault
> > resulting in possible filesystem corruption or stale data exposure. We fix the
> > problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> > page faults with truncate and punch hole operations.
> >
> > When having this exclusion, the only remaining problem with DAX implementation
> > are races between two page faults zeroing out same block concurrently (where
> > the data written after the first fault finishes are possibly overwritten by
> > the second fault still doing zeroing).
> >
> > Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
> > serialize extent manipulation operations and page faults.
> >
> > Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
> > unlock in unlocked DIO protections (currently harmless but nasty surprise).
> >
> > Patches 3-4 fix further races of extent manipulation functions (such as zero
> > range, collapse range, insert range) with buffered IO, page writeback
> >
> > Patch 5 documents locking order of ext4 filesystem locks.
> >
> > Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
> > dioread_nolock is enabled since it is not needed anymore.
> >
> > Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
> > callback and use such blocks for allocations from DAX page faults.
> >
> > The patches survived xfstests run both in dax and non-dax mode.
> >
> > Honza
>
> In my setup generic/081 passes, but causes all sorts of errors reported in
> syslog. I think these errors are a normal part of the test because they happen
> in the non-DAX case as well, but in the DAX case I think we end up in a bad
> state. After this, all other tests fail.

Thanks for report! I was using ramdisk for testing and test generic/081 got
skipped because the device doesn't have "sane flush". I'll retest with pmem
and debug what's going on.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-10-23 03:36:50

by Eryu Guan

[permalink] [raw]
Subject: Re: [PATCH 0/9 v2] ext4: Punch hole and DAX fixes

On Thu, Oct 22, 2015 at 12:09:10PM -0600, Ross Zwisler wrote:
> On Thu, Oct 22, 2015 at 10:15:52AM +0200, Jan Kara wrote:
> > Hello,
> >
> > This series fixes a long standing problem of racing punch hole and page fault
> > resulting in possible filesystem corruption or stale data exposure. We fix the
> > problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> > page faults with truncate and punch hole operations.
> >
> > When having this exclusion, the only remaining problem with DAX implementation
> > are races between two page faults zeroing out same block concurrently (where
> > the data written after the first fault finishes are possibly overwritten by
> > the second fault still doing zeroing).
> >
> > Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
> > serialize extent manipulation operations and page faults.
> >
> > Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
> > unlock in unlocked DIO protections (currently harmless but nasty surprise).
> >
> > Patches 3-4 fix further races of extent manipulation functions (such as zero
> > range, collapse range, insert range) with buffered IO, page writeback
> >
> > Patch 5 documents locking order of ext4 filesystem locks.
> >
> > Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
> > dioread_nolock is enabled since it is not needed anymore.
> >
> > Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
> > callback and use such blocks for allocations from DAX page faults.
> >
> > The patches survived xfstests run both in dax and non-dax mode.
> >
> > Honza
>
> In my setup generic/081 passes, but causes all sorts of errors reported in
> syslog. I think these errors are a normal part of the test because they happen
> in the non-DAX case as well, but in the DAX case I think we end up in a bad
> state. After this, all other tests fail.
>
> Here is a run of generic/081 and generic/082, where generic/082 dies as a
> bystander of the bad state left by generic/081:
>
> # ./check generic/081 generic/082
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 lorwyn 4.3.0-rc6_ext4_dax_v2+
> MKFS_OPTIONS -- /dev/pmem0p2
> MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
>
> generic/081 2s ... 1s
> generic/082 0s ... [failed, exit status 1] - output mismatch (see /root/xfstests/results//generic/082.out.bad)
> --- tests/generic/082.out 2015-10-02 10:20:40.816712128 -0600
> +++ /root/xfstests/results//generic/082.out.bad 2015-10-22 12:07:06.009024134 -0600
> @@ -1,3 +1,9 @@
> QA output created by 082
> -group quota on SCRATCH_MNT (SCRATCH_DEV) is on
> -user quota on SCRATCH_MNT (SCRATCH_DEV) is on
> +mount: /dev/pmem0p2 is already mounted or /mnt/xfstests_scratch busy
> +quotaon: Mountpoint (or device) /mnt/xfstests_scratch not found or has no quota enabled.
> +mount: /mnt/xfstests_scratch not mounted or bad option
> +
> ...
> (Run 'diff -u tests/generic/082.out /root/xfstests/results//generic/082.out.bad' to see the entire diff)
> Ran: generic/081 generic/082
> Failures: generic/082
> Failed 1 of 2 tests
>
> When run by itself on a clean system with DAX generic/082 passes repeatedly.
>
> After you get into this bad state, all other tests fail to run:
>
> # ./check generic/082
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 lorwyn 4.3.0-rc6_ext4_dax_v2+
> MKFS_OPTIONS -- /dev/pmem0p2
> MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
>
> our local _scratch_mkfs routine ...
> mke2fs 1.42.12 (29-Aug-2014)
> /dev/pmem0p2 is apparently in use by the system; will not make a filesystem here!

Sometimes generic/081 fails to remove test vg/lv on SCRATCH_DEV (EBUSY
iirc) and leaves SCRATCH_DEV in use. And Dave suggests that's a lvm bug,
because the device is already umounted, nothing should stop the removal
of vg/lv.

As a workaround, you can remove "-c fsync" from the xfs_io command line,
but I'm not sure if that works for pmem.

Thanks,
Eryu

> check: failed to mkfs $SCRATCH_DEV using specified options
> Passed all 0 tests
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2015-10-24 10:51:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 1/9] ext4: Fix races between page faults and hole punching

On Thu, Oct 22, 2015 at 10:15:53AM +0200, Jan Kara wrote:
> Currently, page faults and hole punching are completely unsynchronized.
> This can result in page fault faulting in a page into a range that we
> are punching after truncate_pagecache_range() has been called and thus
> we can end up with a page mapped to disk blocks that will be shortly
> freed. Filesystem corruption will shortly follow. Note that the same
> race is avoided for truncate by checking page fault offset against
> i_size but there isn't similar mechanism available for punching holes.
>
> Fix the problem by creating new rw semaphore i_mmap_sem in inode and
> grab it for writing over truncate, hole punching, and other functions
> removing blocks from extent tree and for read over page faults. We
> cannot easily use i_data_sem for this since that ranks below transaction
> start and we need something ranking above it so that it can be held over
> the whole truncate / hole punching operation. Also remove various
> workarounds we had in the code to reduce race window when page fault
> could have created pages with stale mapping information.
>
> Signed-off-by: Jan Kara <[email protected]>

This patch is causing ext4/001 to fail even using the standard 4k
non-DAX test configuration. You had mentioned that extent zeroing was
getting suppressed for DAX file systems, but it looks like it's
getting suppressed even in the non-DAX configuration:

% kvm-xfstests -c 4k ext4/001
...
ext4/001 [21:11:29][ 7.796142] run fstests ext4/001 at 2015-10-23 21:11:29
[21:11:32] - output mismatch (see /results/results-4k/ext4/001.out.bad)
--- tests/ext4/001.out 2015-10-18 23:46:49.000000000 -0400
+++ /results/results-4k/ext4/001.out.bad 2015-10-23 21:11:32.104276540 -0400
@@ -131,14 +131,10 @@
2: [32..39]: hole
daa100df6e6711906b61c9ab5aa16032
11. data -> hole -> data
-0: [0..7]: data
-1: [8..31]: unwritten
-2: [32..39]: data
+0: [0..39]: data
...
(Run 'diff -u tests/ext4/001.out /results/results-4k/ext4/001.out.bad' to see the entire diff)


- Ted

2015-10-24 10:51:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

On Thu, Oct 22, 2015 at 10:15:55AM +0200, Jan Kara wrote:
> Current code implementing FALLOC_FL_COLLAPSE_RANGE and
> FALLOC_FL_INSERT_RANGE is prove to races with buffered writes and page
> faults. If buffered write or write via mmap manages to squeeze between
> filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
> implementations, the written data is simply discarded by
> truncate_pagecache() although it should have been shifted.
>
> Fix the problem by moving filemap_write_and_wait_range() call inside
> i_mutex and i_mmap_sem. That way we are protected against races with
> both buffered writes and page faults.
>
> Signed-off-by: Jan Kara <[email protected]>

This patch is causing generic/091 to fail with a 1k block size.

BEGIN TEST 1k: Ext4 1k block Sat Oct 24 00:41:32 EDT 2015
DEVICE: /dev/vdd
MK2FS OPTIONS: -q -b 1024
MOUNT OPTIONS: -o block_validity
FSTYP -- ext4
PLATFORM -- Linux/i686 kvm-xfstests 4.3.0-rc2ext4-00028-g53834d8
MKFS_OPTIONS -- -q -b 1024 /dev/vdc
MOUNT_OPTIONS -- -o acl,user_xattr -o block_validity /dev/vdc /vdc

generic/091 [00:41:35][ 8.942840] run fstests generic/091 at 2015-10-24 00:41:35
[ 9.471531] xfs_io (3160) used greatest stack depth: 5636 bytes left
[ 13.467847] fsx (3172) used greatest stack depth: 5632 bytes left
[00:41:39] [failed, exit status 1] - output mismatch (see /results/results-1k/generic/091.out.bad)
--- tests/generic/091.out 2015-10-18 23:46:49.000000000 -0400
+++ /results/results-1k/generic/091.out.bad 2015-10-24 00:41:39.981112671 -0400
@@ -1,7 +1,7124 @@
QA output created by 091
fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 8192 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 32768 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
-fsx -N 10000 -o 128000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -W
...
(Run 'diff -u tests/generic/091.out /results/results-1k/generic/091.out.bad' to see the entire diff)

An examination of results-1k/generic/091.full finds:

fsx -N 10000 -l 500000 -r PSIZE -t BSIZE -w BSIZE -Z -R -W
mapped writes DISABLED
skipping insert range behind EOF
skipping insert range behind EOF
truncating to largest ever: 0x11e00
truncating to largest ever: 0x2fa00
zero_range to largest ever: 0x4e869
truncating to largest ever: 0x68000
truncating to largest ever: 0x6c000
truncating to largest ever: 0x70200
truncating to largest ever: 0x74400
truncating to largest ever: 0x74e00
fallocating to largest ever: 0x75f20
fallocating to largest ever: 0x7a120
skipping insert range behind EOF
skipping collapse range behind EOF
skipping zero length zero range
skipping zero length zero range
READ BAD DATA: offset = 0x28000, size = 0xb000, fname = /vdd/junk
OFFSET GOOD BAD RANGE
0x28b0f 0x0000 0xa793 0x 0
operation# (mod 256) for the bad data may be 147
0x28b10 0x0000 0x9398 0x 1
operation# (mod 256) for the bad data may be 147
0x28b11 0x0000 0x9893 0x 2
operation# (mod 256) for the bad data may be 147
0x28b12 0x0000 0x93d4 0x 3
operation# (mod 256) for the bad data may be 147
0x28b13 0x0000 0xd493 0x 4
...

Could you take a look?

- Ted


2015-10-24 10:51:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

On Thu, Oct 22, 2015 at 10:15:55AM +0200, Jan Kara wrote:
> Current code implementing FALLOC_FL_COLLAPSE_RANGE and
> FALLOC_FL_INSERT_RANGE is prove to races with buffered writes and page

Minor nit:

s/prove/prone/

- Ted

2015-10-25 04:59:07

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/9] ext4: Fix races between page faults and hole punching

On Fri 23-10-15 21:21:35, Ted Tso wrote:
> On Thu, Oct 22, 2015 at 10:15:53AM +0200, Jan Kara wrote:
> > Currently, page faults and hole punching are completely unsynchronized.
> > This can result in page fault faulting in a page into a range that we
> > are punching after truncate_pagecache_range() has been called and thus
> > we can end up with a page mapped to disk blocks that will be shortly
> > freed. Filesystem corruption will shortly follow. Note that the same
> > race is avoided for truncate by checking page fault offset against
> > i_size but there isn't similar mechanism available for punching holes.
> >
> > Fix the problem by creating new rw semaphore i_mmap_sem in inode and
> > grab it for writing over truncate, hole punching, and other functions
> > removing blocks from extent tree and for read over page faults. We
> > cannot easily use i_data_sem for this since that ranks below transaction
> > start and we need something ranking above it so that it can be held over
> > the whole truncate / hole punching operation. Also remove various
> > workarounds we had in the code to reduce race window when page fault
> > could have created pages with stale mapping information.
> >
> > Signed-off-by: Jan Kara <[email protected]>
>
> This patch is causing ext4/001 to fail even using the standard 4k
> non-DAX test configuration. You had mentioned that extent zeroing was
> getting suppressed for DAX file systems, but it looks like it's
> getting suppressed even in the non-DAX configuration:

I'll verify this in detail but if I remember correctly, this was caused by
some thing like that with my patches we don't bother to first write out pages
that aregoing to be zeroed out shortly because that's pretty pointless. But
as I said, I'll check again to be sure.

Honza
>
> % kvm-xfstests -c 4k ext4/001
> ...
> ext4/001 [21:11:29][ 7.796142] run fstests ext4/001 at 2015-10-23 21:11:29
> [21:11:32] - output mismatch (see /results/results-4k/ext4/001.out.bad)
> --- tests/ext4/001.out 2015-10-18 23:46:49.000000000 -0400
> +++ /results/results-4k/ext4/001.out.bad 2015-10-23 21:11:32.104276540 -0400
> @@ -131,14 +131,10 @@
> 2: [32..39]: hole
> daa100df6e6711906b61c9ab5aa16032
> 11. data -> hole -> data
> -0: [0..7]: data
> -1: [8..31]: unwritten
> -2: [32..39]: data
> +0: [0..39]: data
> ...
> (Run 'diff -u tests/ext4/001.out /results/results-4k/ext4/001.out.bad' to see the entire diff)
>
>
> - Ted
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-10-25 09:23:34

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 9/9] ext4: Use pre-zeroed blocks for DAX page faults

Here's the list of new test failures that I found after applying this
patch series:

CMDLINE: full
FSTESTVER: e2fsprogs v1.43-WIP-2015-05-18-64-g2334bd3 (Sat, 5 Sep 2015 22:21:35 -0400)
FSTESTVER: fio fio-2.1.3 (Tue, 24 Sep 2013 08:42:24 -0600)
FSTESTVER: quota 022aabf (Wed, 26 Nov 2014 10:22:08 +0100)
FSTESTVER: xfsprogs v4.2.0 (Mon, 7 Sep 2015 10:14:31 +1000)
FSTESTVER: xfstests-bld dbb0a26 (Sun, 18 Oct 2015 08:34:14 -0400)
FSTESTVER: xfstests linux-v3.8-779-gc17e548 (Sun, 18 Oct 2015 23:40:39 -0400)
FSTESTVER: kernel 4.3.0-rc2-ext4-00034-gf29d88e8 #115 SMP Sat Oct 24 20:20:43 EDT 2015 x86_64
FSTESTCFG: "4k 1k ext3 encrypt nojournal ext3conv dioread_nolock data_journal_noleak inline bigalloc bigalloc_1k"
FSTESTSET: "-g auto"
FSTESTEXC: ""
FSTESTOPT: "aex"
MNTOPTS: ""
CPUS: "2"
MEM: "7477.52"
MEM: 7680 MB (Max capacity)
BEGIN TEST 4k: Ext4 4k block Sat Oct 24 20:53:11 EDT 2015
Failures: ext4/001
BEGIN TEST 1k: Ext4 1k block Sat Oct 24 21:35:55 EDT 2015
Failures: ext4/001 generic/075 generic/091 generic/112 generic/127 generic/263
BEGIN TEST ext3: Ext4 4k block emulating ext3 Sat Oct 24 22:20:37 EDT 2015
BEGIN TEST encrypt: Ext4 encryption Sat Oct 24 23:03:38 EDT 2015
BEGIN TEST nojournal: Ext4 4k block w/ no journal Sat Oct 24 23:40:09 EDT 2015
Failures: ext4/001
BEGIN TEST ext3conv: Ext4 4k block w/nodelalloc and no flex_bg Sun Oct 25 00:20:14 EDT 2015
Failures: generic/269
BEGIN TEST dioread_nolock: Ext4 4k block w/dioread_nolock Sun Oct 25 01:01:48 EDT 2015
Failures: ext4/001 generic/091
BEGIN TEST data_journal_noleak: Ext4 4k block w/data=journal Sun Oct 25 01:45:39 EDT 2015
BEGIN TEST inline: Ext4 4k block w/inline Sun Oct 25 02:37:44 EDT 2015
Failures: ext4/001
BEGIN TEST bigalloc: Ext4 4k block w/bigalloc Sun Oct 25 03:22:02 EDT 2015
Failures: ext4/001
BEGIN TEST bigalloc_1k: Ext4 1k block w/bigalloc Sun Oct 25 04:08:26 EDT 2015
Failures: ext4/001

Cheers,

- Ted