2015-11-04 16:18:51

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

Hello,

Another version of my ext4 fixes. I've fixed up all the failures Ted reported
except for ext4/001 failures which are false positive (will send fixes for that
test shortly) and generic/269 in nodelalloc mode which I just wasn't able to
reproduce.

Note that testing with 1 KB blocksize on ramdisk is broken since brd has
buggy discard implementation. It took me quite some time to figure this out.
Fix is submitted but bear this in mind just in case.

Changes since v2:
* Fixed collaps range to truncate pagecache properly with blocksize < pagesize
* Fixed assertion in ext4_get_blocks_overwrite

Patch set description

This series fixes a long standing problem of racing punch hole and page fault
resulting in possible filesystem corruption or stale data exposure. We fix the
problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
page faults with truncate and punch hole operations.

When having this exclusion, the only remaining problem with DAX implementation
are races between two page faults zeroing out same block concurrently (where
the data written after the first fault finishes are possibly overwritten by
the second fault still doing zeroing).

Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
serialize extent manipulation operations and page faults.

Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
unlock in unlocked DIO protections (currently harmless but nasty surprise).

Patches 3-4 fix further races of extent manipulation functions (such as zero
range, collapse range, insert range) with buffered IO, page writeback

Patch 5 documents locking order of ext4 filesystem locks.

Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
dioread_nolock is enabled since it is not needed anymore.

Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
callback and use such blocks for allocations from DAX page faults.

The patches survived xfstests run both in dax and non-dax mode.

Honza


2015-11-04 16:18:50

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/9] ext4: Move unlocked dio protection from ext4_alloc_file_blocks()

Currently ext4_alloc_file_blocks() was handling protection against
unlocked DIO. However we now need to sometimes call it under i_mmap_sem
and sometimes not and DIO protection ranks above it (although strictly
speaking this cannot currently create any deadlocks). Also
ext4_zero_range() was actually getting & releasing unlocked DIO
protection twice in some cases. Luckily it didn't introduce any real bug
but it was a land mine waiting to be stepped on. So move DIO protection
out from ext4_alloc_file_blocks() into the two callsites.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 6e0c5d37a232..66ab89b58c1f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4681,10 +4681,6 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (len <= EXT_UNWRITTEN_MAX_LEN)
flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
/*
* credits to insert 1 extent into extent tree
*/
@@ -4748,8 +4744,6 @@ retry:
goto retry;
}

- ext4_inode_resume_unlocked_dio(inode);
-
return ret > 0 ? ret2 : ret;
}

@@ -4823,6 +4817,10 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
/* Preallocate the range including the unaligned edges */
if (partial_begin || partial_end) {
ret = ext4_alloc_file_blocks(file,
@@ -4831,7 +4829,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
round_down(offset, 1 << blkbits)) >> blkbits,
new_size, flags, mode);
if (ret)
- goto out_mutex;
+ goto out_dio;

}

@@ -4840,10 +4838,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);

2015-11-04 16:18:52

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/9] ext4: Document lock ordering

We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 61fad9e33bb9..152f719a0c62 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -85,6 +85,36 @@ static void ext4_unregister_li_request(struct super_block *sb);
static void ext4_clear_request_list(void);
static int ext4_reserve_clusters(struct ext4_sb_info *, ext4_fsblk_t);

+/*
+ * Lock ordering
+ *
+ * Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
+ * i_mmap_rwsem (inode->i_mmap_rwsem)!
+ *
+ * page fault path:
+ * mmap_sem -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
+ * page lock -> i_data_sem (rw)
+ *
+ * buffered write path:
+ * sb_start_write -> i_mutex -> mmap_sem
+ * sb_start_write -> i_mutex -> transaction start -> page lock ->
+ * i_data_sem (rw)
+ *
+ * truncate:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * i_mmap_rwsem (w) -> page lock
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * direct IO:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) -> mmap_sem
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * writepages:
+ * transaction start -> page lock(s) -> i_data_sem (rw)
+ */
+
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
static struct file_system_type ext2_fs_type = {
.owner = THIS_MODULE,
--
2.1.4


2015-11-04 16:18:51

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

Current code implementing FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE is prone to races with buffered writes and page
faults. If buffered write or write via mmap manages to squeeze between
filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
implementations, the written data is simply discarded by
truncate_pagecache() although it should have been shifted.

Fix the problem by moving filemap_write_and_wait_range() call inside
i_mutex and i_mmap_sem. That way we are protected against races with
both buffered writes and page faults.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 61 +++++++++++++++++++++++++++++--------------------------
1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 66ab89b58c1f..a333c9cc651f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5483,21 +5483,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down offset to be aligned with page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);
-
/*
* There is no need to overlap collapse range with EOF, in which case
* it is effectively a truncate operation
@@ -5518,10 +5504,31 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
inode_dio_wait(inode);

/*
- * Prevent page faults from reinstantiating pages we have released from
+ * Prevent page faults from reinstantiating we have released from
* page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ /*
+ * Need to round down offset to be aligned with page size boundary
+ * for page size > block size.
+ */
+ ioffset = round_down(offset, PAGE_SIZE);
+ /*
+ * Write tail of the last page before removed range since it will get
+ * removed from the page cache below.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, ioffset, offset);
+ if (ret)
+ goto out_mmap;
+ /*
+ * Write data that will be shifted to preserve them when discarding
+ * page cache below. We are also protected from pages becoming dirty
+ * by i_mmap_sem.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, offset + len,
+ LLONG_MAX);
+ if (ret)
+ goto out_mmap;
truncate_pagecache(inode, ioffset);

credits = ext4_writepage_trans_blocks(inode);
@@ -5622,21 +5629,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down to align start offset to page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);

2015-11-04 16:18:53

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/9] ext4: Get rid of EXT4_GET_BLOCKS_NO_LOCK flag

When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +---
fs/ext4/inode.c | 43 ++++++++++++++++++++-----------------------
include/trace/events/ext4.h | 3 +--
3 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c86546efec30..142e935d1f3d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -551,10 +551,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_NORMALIZE 0x0040
/* Request will not result in inode size update (user for fallocate) */
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
- /* Do not take i_data_sem locking in ext4_map_blocks */
-#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0200
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f72212ef1fee..f220dfa6dbc3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -403,8 +403,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
* out taking i_data_sem. So at the time the unwritten extent
* could be converted.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -412,8 +411,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

/*
* We don't check m_len because extent will be collpased in status
@@ -509,8 +507,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -541,8 +538,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ret < 0)
retval = ret;
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
@@ -674,7 +670,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;

- if (flags && !(flags & EXT4_GET_BLOCKS_NO_LOCK) && !handle) {
+ if (flags && !handle) {
/* Direct IO write... */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
@@ -879,9 +875,6 @@ int do_journal_get_write_access(handle_t *handle,
return ret;
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
-
#ifdef CONFIG_EXT4_FS_ENCRYPTION
static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block)
@@ -3012,13 +3005,21 @@ int ext4_get_block_write(struct inode *inode, sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
+static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- ext4_debug("ext4_get_block_write_nolock: inode %lu, create flag %d\n",
+ int ret;
+
+ ext4_debug("ext4_get_block_overwrite: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_NO_LOCK);
+ ret = _ext4_get_block(inode, iblock, bh_result, 0);
+ /*
+ * Blocks should have been preallocated! ext4_file_write_iter() checks
+ * that.
+ */
+ WARN_ON_ONCE(!buffer_mapped(bh_result));
+
+ return ret;
}

int ext4_get_block_dax(struct inode *inode, sector_t iblock,
@@ -3101,10 +3102,8 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);

- if (overwrite) {
- down_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_unlock(&inode->i_mutex);
- }

/*
* We could direct write to holes and fallocate.
@@ -3147,7 +3146,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
}

if (overwrite) {
- get_block_func = ext4_get_block_write_nolock;
+ get_block_func = ext4_get_block_overwrite;
} else {
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
@@ -3203,10 +3202,8 @@ retake_lock:
if (iov_iter_rw(iter) == WRITE)
inode_dio_end(inode);
/* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite) {
- up_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_lock(&inode->i_mutex);
- }

return ret;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 594b4b29a224..5f2ace56efc5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,8 +42,7 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
- { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-11-04 16:18:53

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/9] ext4: Provide ext4_issue_zeroout()

Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/crypto.c | 6 ++----
fs/ext4/ext4.h | 5 ++++-
fs/ext4/extents.c | 12 ++----------
fs/ext4/inode.c | 15 +++++++++++++++
4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index 45731558138c..6b63d1e9f05a 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -403,14 +403,12 @@ int ext4_decrypt_one(struct inode *inode, struct page *page)
return ret;
}

-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex)
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len)
{
struct ext4_crypto_ctx *ctx;
struct page *ciphertext_page = NULL;
struct bio *bio;
- ext4_lblk_t lblk = ex->ee_block;
- ext4_fsblk_t pblk = ext4_ext_pblock(ex);
- unsigned int len = ext4_ext_get_actual_len(ex);
int err = 0;

BUG_ON(inode->i_sb->s_blocksize != PAGE_CACHE_SIZE);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 142e935d1f3d..785737a342e6 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2072,7 +2072,8 @@ struct page *ext4_encrypt(struct inode *inode,
struct page *plaintext_page);
int ext4_decrypt(struct ext4_crypto_ctx *ctx, struct page *page);
int ext4_decrypt_one(struct inode *inode, struct page *page);
-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex);
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

#ifdef CONFIG_EXT4_FS_ENCRYPTION
int ext4_init_crypto(void);
@@ -2327,6 +2328,8 @@ extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index be5496c14f6b..e13669b09dc1 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3118,19 +3118,11 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
{
ext4_fsblk_t ee_pblock;
unsigned int ee_len;
- int ret;

ee_len = ext4_ext_get_actual_len(ex);
ee_pblock = ext4_ext_pblock(ex);
-
- if (ext4_encrypted_inode(inode))
- return ext4_encrypted_zeroout(inode, ex);
-
- ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len, GFP_NOFS);
- if (ret > 0)
- ret = 0;
-
- return ret;
+ return ext4_issue_zeroout(inode, le32_to_cpu(ex->ee_block), ee_pblock,
+ ee_len);
}

/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f220dfa6dbc3..5c92b0096f85 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -383,6 +383,21 @@ static int __check_block_validity(struct inode *inode, const char *func,
return 0;
}

+int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
+ ext4_lblk_t len)
+{
+ int ret;
+
+ if (ext4_encrypted_inode(inode))
+ return ext4_encrypted_zeroout(inode, lblk, pblk, len);
+
+ ret = sb_issue_zeroout(inode->i_sb, pblk, len, GFP_NOFS);
+ if (ret > 0)
+ ret = 0;
+
+ return ret;
+}
+
#define check_block_validity(inode, map) \
__check_block_validity((inode), __func__, __LINE__, (map))

--
2.1.4


2015-11-04 16:18:53

by Jan Kara

[permalink] [raw]
Subject: [PATCH 9/9] ext4: Use pre-zeroed blocks for DAX page faults

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +--
fs/ext4/file.c | 20 +++-----------
fs/ext4/inode.c | 81 +++++++++++++++++++++++++++++++++++++++++++++------------
3 files changed, 69 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0a11a34d54c9..9b6e5813968b 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2284,8 +2284,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d24ebcd7c9e..749b222e6498 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -193,18 +193,6 @@ out:
}

#ifdef CONFIG_FS_DAX
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
- struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
- loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
- int err;
- if (!uptodate)
- return;
- WARN_ON(!buffer_unwritten(bh));
- err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
@@ -225,8 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -262,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_get_block_dax, ext4_end_io_unwritten);
+ ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -283,8 +270,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 74543c892b16..73d00d92bb4d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -718,16 +718,6 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
- if (IS_DAX(inode) && buffer_unwritten(bh)) {
- /*
- * dgc: I suspect unwritten conversion on ext4+DAX is
- * fundamentally broken here when there are concurrent
- * read/write in progress on this inode.
- */
- WARN_ON_ONCE(io_end);
- bh->b_assoc_map = inode->i_mapping;
- bh->b_private = (void *)(unsigned long)iblock;
- }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3050,17 +3040,74 @@ static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
return ret;
}

-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+#ifdef CONFIG_FS_DAX
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
{
- int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+ int ret, err;
+ int credits;
+ struct ext4_map_blocks map;
+ handle_t *handle = NULL;
+ int flags = 0;

- if (create)
- flags |= EXT4_GET_BLOCKS_CREATE;
- ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+ ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result, flags);
+ map.m_lblk = iblock;
+ map.m_len = bh_result->b_size >> inode->i_blkbits;
+ credits = ext4_chunk_trans_blocks(inode, map.m_len);
+ if (create) {
+ flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+ }
+
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (create) {
+ err = ext4_journal_stop(handle);
+ if (ret >= 0 && err < 0)
+ ret = err;
+ }
+ if (ret <= 0)
+ goto out;
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int err2;
+
+ /*
+ * We are protected by i_mmap_sem so we know block cannot go
+ * away from under us even though we dropped i_data_sem.
+ * Convert extent to written and write zeros there.
+ *
+ * Note: We may get here even when create == 0.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ err = ext4_map_blocks(handle, inode, &map,
+ EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (err < 0)
+ ret = err;
+ err2 = ext4_journal_stop(handle);
+ if (err2 < 0 && ret > 0)
+ ret = err2;
+ }
+out:
+ WARN_ON_ONCE(ret == 0 && create);
+ if (ret > 0) {
+ map_bh(bh_result, inode->i_sb, map.m_pblk);
+ bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
+ map.m_flags;
+ bh_result->b_size = map.m_len << inode->i_blkbits;
+ ret = 0;
+ }
+ return ret;
}
+#endif

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
--
2.1.4


2015-11-04 16:18:52

by Jan Kara

[permalink] [raw]
Subject: [PATCH 8/9] ext4: Implement allocation of pre-zeroed blocks

DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/extents.c | 8 ++++++++
fs/ext4/inode.c | 20 +++++++++++++++++---
include/trace/events/ext4.h | 3 ++-
4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 785737a342e6..0a11a34d54c9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -553,6 +553,10 @@ enum {
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
/* Convert written extents to unwritten */
#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100
+ /* Write zeros to newly created written extents */
+#define EXT4_GET_BLOCKS_ZERO 0x0200
+#define EXT4_GET_BLOCKS_CREATE_ZERO (EXT4_GET_BLOCKS_CREATE |\
+ EXT4_GET_BLOCKS_ZERO)

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index e13669b09dc1..76113dbdfcf7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4040,6 +4040,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
}
/* IO end_io complete, convert the filled extent to written */
if (flags & EXT4_GET_BLOCKS_CONVERT) {
+ if (flags & EXT4_GET_BLOCKS_ZERO) {
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ err = ext4_issue_zeroout(inode, map->m_lblk, newblock,
+ allocated);
+ if (err < 0)
+ goto out2;
+ }
ret = ext4_convert_unwritten_extents_endio(handle, inode, map,
ppath);
if (ret >= 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5c92b0096f85..74543c892b16 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -643,7 +643,7 @@ found:
if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
if (ext4_es_is_written(&es))
- goto has_zeroout;
+ goto out_sem;
}
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
@@ -654,11 +654,24 @@ found:
status |= EXTENT_STATUS_DELAYED;
ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
- if (ret < 0)
+ if (ret < 0) {
retval = ret;
+ goto out_sem;
+ }
+
+ if (flags & EXT4_GET_BLOCKS_ZERO &&
+ map->m_flags & EXT4_MAP_MAPPED &&
+ map->m_flags & EXT4_MAP_NEW) {
+ ret = ext4_issue_zeroout(inode, map->m_lblk,
+ map->m_pblk, map->m_len);
+ if (ret) {
+ retval = ret;
+ goto out_sem;
+ }
+ }
}

-has_zeroout:
+out_sem:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
ret = check_block_validity(inode, map);
@@ -3041,6 +3054,7 @@ int ext4_get_block_dax(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+
if (create)
flags |= EXT4_GET_BLOCKS_CREATE;
ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5f2ace56efc5..4e4b2fa78609 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,7 +42,8 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
+ { EXT4_GET_BLOCKS_ZERO, "ZERO" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-11-04 16:18:51

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/9] ext4: Fix races of writeback with punch hole and zero range

When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.

Make sure the on-disk inode size is updated before truncating page
cache.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/extents.c | 5 +++++
fs/ext4/inode.c | 35 ++++++++++++++++++++++++++++++++++-
3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c19ff61ccbdf..c86546efec30 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2691,6 +2691,9 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
return changed;
}

+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len);
+
struct ext4_group_info {
unsigned long bb_state;
struct rb_root bb_free_root;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index a333c9cc651f..be5496c14f6b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4843,6 +4843,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
* released from page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ ret = ext4_update_disksize_before_punch(inode, offset, len);
+ if (ret) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ goto out_dio;
+ }
/* Now release the pages and zero block aligned part of pages */
truncate_pagecache_range(inode, start, end - 1);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 36ad45906d26..f72212ef1fee 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3517,6 +3517,35 @@ int ext4_can_truncate(struct inode *inode)
}

/*
+ * We have to make sure i_disksize gets properly updated before we truncate
+ * page cache due to hole punching or zero range. Otherwise i_disksize update
+ * can get lost as it may have been postponed to submission of writeback but
+ * that will never happen after we truncate page cache.
+ */
+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len)
+{
+ handle_t *handle;
+ loff_t size = i_size_read(inode);
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ if (offset > size || offset + len < size)
+ return 0;
+
+ if (EXT4_I(inode)->i_disksize >= size)
+ return 0;
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ ext4_update_i_disksize(inode, size);
+ ext4_mark_inode_dirty(handle, inode);
+ ext4_journal_stop(handle);
+
+ return 0;
+}
+
+/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
*
@@ -3594,9 +3623,13 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

/* Now release the pages and zero block aligned part of pages*/
- if (last_block_offset > first_block_offset)
+ if (last_block_offset > first_block_offset) {
+ ret = ext4_update_disksize_before_punch(inode, offset, length);
+ if (ret)
+ goto out_dio;
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);
+ }

if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
--
2.1.4


2015-11-04 16:18:51

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/9] ext4: Fix races between page faults and hole punching

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++++
fs/ext4/extents.c | 54 ++++++++++++++++++++++++--------------------
fs/ext4/file.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++--------
fs/ext4/inode.c | 36 +++++++++++++++++++++--------
fs/ext4/super.c | 1 +
fs/ext4/truncate.h | 2 ++
6 files changed, 127 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fd1f28be5296..c19ff61ccbdf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -869,6 +869,15 @@ struct ext4_inode_info {
* by other means, so we have i_data_sem.
*/
struct rw_semaphore i_data_sem;
+ /*
+ * i_mmap_sem is for serializing page faults with truncate / punch hole
+ * operations. We have to make sure that new page cannot be faulted in
+ * a section of the inode that is being punched. We cannot easily use
+ * i_data_sem for this since we need protection for the whole punch
+ * operation and i_data_sem ranks below transaction start so we have
+ * to occasionally drop it.
+ */
+ struct rw_semaphore i_mmap_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;

@@ -2316,6 +2325,7 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2553aa8b608d..6e0c5d37a232 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4766,7 +4766,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
int partial_begin, partial_end;
loff_t start, end;
ext4_lblk_t lblk;
- struct address_space *mapping = inode->i_mapping;
unsigned int blkbits = inode->i_blkbits;

trace_ext4_zero_range(inode, offset, len, mode);
@@ -4782,17 +4781,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
}

/*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + len - 1);
- if (ret)
- return ret;
- }
-
- /*
* Round up offset. This is not fallocate, we neet to zero out
* blocks, so convert interior block aligned part of the range to
* unwritten and possibly manually zero out unaligned parts of the
@@ -4852,16 +4840,22 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Now release the pages and zero block aligned part of pages*/
- truncate_pagecache_range(inode, start, end - 1);
- inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-
/* Wait all existing dio workers, newcomers will block on i_mutex */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have
+ * released from page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ /* Now release the pages and zero block aligned part of pages */
+ truncate_pagecache_range(inode, start, end - 1);
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
flags, mode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
if (ret)
goto out_dio;
}
@@ -5520,17 +5514,22 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

down_write(&EXT4_I(inode)->i_data_sem);
@@ -5569,7 +5568,8 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -5656,17 +5656,22 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

/* Expand file to avoid data loss if there is error while shifting */
@@ -5737,7 +5742,8 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 113837e7ba98..0d24ebcd7c9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -209,15 +209,18 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
handle_t *handle = NULL;
- struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;

if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -228,8 +231,10 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}
@@ -246,10 +251,12 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
ext4_chunk_trans_blocks(inode,
PMD_SIZE / PAGE_SIZE));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -260,30 +267,71 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ int err;
+ struct inode *inode = file_inode(vma->vm_file);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
+ ext4_end_io_unwritten);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(inode->i_sb);
+
+ return err;
+}
+
+/*
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * handler we check for races agaist truncate. Note that since we cycle through
+ * i_mmap_sem, we are sure that also any hole punching that began before we
+ * were called is finished by now and so if it included part of the file we
+ * are working on, our pte will get unmapped and the check for pte_same() in
+ * wp_pfn_shared() fails. Thus fault gets retried and things work out as
+ * desired.
+ */
+static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ int ret = VM_FAULT_NOPAGE;
+ loff_t size;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ ret = VM_FAULT_SIGBUS;
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+
+ return ret;
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
.page_mkwrite = ext4_dax_mkwrite,
- .pfn_mkwrite = dax_pfn_mkwrite,
+ .pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
#define ext4_dax_vm_ops ext4_file_vm_ops
#endif

static const struct vm_operations_struct ext4_file_vm_ops = {
- .fault = filemap_fault,
+ .fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf76b5c..36ad45906d26 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3581,6 +3581,15 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

}

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

@@ -3589,10 +3598,6 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
else
@@ -3638,16 +3643,12 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
if (IS_SYNC(inode))
ext4_handle_sync(handle);

- /* Now release the pages again to reduce race window */
- if (last_block_offset > first_block_offset)
- truncate_pagecache_range(inode, first_block_offset,
- last_block_offset);
-
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
out_stop:
ext4_journal_stop(handle);
out_dio:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -4784,6 +4785,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
} else
ext4_wait_for_tail_page_commit(inode);
}
+ down_write(&EXT4_I(inode)->i_mmap_sem);
/*
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
@@ -4791,6 +4793,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
truncate_pagecache(inode, inode->i_size);
if (shrink)
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

if (!rc) {
@@ -5239,6 +5242,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
@@ -5308,6 +5313,19 @@ retry_alloc:
out_ret:
ret = block_page_mkwrite_return(ret);
out:
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);
return ret;
}
+
+int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ int err;
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_fault(vma, vmf);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+
+ return err;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a63c7b0a10cf..61fad9e33bb9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -955,6 +955,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&ei->i_orphan);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
+ init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
}

diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index 011ba6670d99..c70d06a383e2 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -10,8 +10,10 @@
*/
static inline void ext4_truncate_failed_write(struct inode *inode)
{
+ down_write(&EXT4_I(inode)->i_mmap_sem);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

/*
--
2.1.4


2015-11-04 18:51:03

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Wed, Nov 04, 2015 at 05:18:31PM +0100, Jan Kara wrote:
> Hello,
>
> Another version of my ext4 fixes. I've fixed up all the failures Ted reported
> except for ext4/001 failures which are false positive (will send fixes for that
> test shortly) and generic/269 in nodelalloc mode which I just wasn't able to
> reproduce.
>
> Note that testing with 1 KB blocksize on ramdisk is broken since brd has
> buggy discard implementation. It took me quite some time to figure this out.
> Fix is submitted but bear this in mind just in case.
>
> Changes since v2:
> * Fixed collaps range to truncate pagecache properly with blocksize < pagesize
> * Fixed assertion in ext4_get_blocks_overwrite
>
> Patch set description
>
> This series fixes a long standing problem of racing punch hole and page fault
> resulting in possible filesystem corruption or stale data exposure. We fix the
> problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> page faults with truncate and punch hole operations.
>
> When having this exclusion, the only remaining problem with DAX implementation
> are races between two page faults zeroing out same block concurrently (where
> the data written after the first fault finishes are possibly overwritten by
> the second fault still doing zeroing).
>
> Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
> serialize extent manipulation operations and page faults.
>
> Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
> unlock in unlocked DIO protections (currently harmless but nasty surprise).
>
> Patches 3-4 fix further races of extent manipulation functions (such as zero
> range, collapse range, insert range) with buffered IO, page writeback
>
> Patch 5 documents locking order of ext4 filesystem locks.
>
> Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
> dioread_nolock is enabled since it is not needed anymore.
>
> Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
> callback and use such blocks for allocations from DAX page faults.
>
> The patches survived xfstests run both in dax and non-dax mode.
>
> Honza

This passed all my testing as well.

Tested-by: Ross Zwisler <[email protected]>

2015-11-06 17:58:08

by Boylston, Brian

[permalink] [raw]
Subject: RE: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

Hi,

I've written a test tool (included below) that exercises page faults on
hole-y portions of an mmapped file. The file is created, sized using
various methods, mmapped, and then two threads race to write a marker to
different offsets within each mapped page. Once the threads have
finished marking each page, the pages are checked for the presence of
the markers.

With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
pmem-backed, DAX-mounted xfs and ext4 file systems.

With 4.3 and this ext4 patch set, the data corruption is still seen:

$ ./holetest -f /pmem1/brian/holetest 1000
holetest r207

INFO: zero-filled test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007f2ad0bd0000
INFO: thread 0 is 7f2ad0bcf700
INFO: thread 1 is 7f2ad03ce700
INFO: 0 error(s) detected

INFO: posix_fallocate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007f2ad0bd0000
INFO: thread 0 is 7f2ad03ce700
INFO: thread 1 is 7f2ad0bcf700
INFO: 0 error(s) detected

INFO: fallocate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007f2ad0bd0000
INFO: thread 0 is 7f2ad0bcf700
INFO: thread 1 is 7f2ad03ce700
INFO: 0 error(s) detected

INFO: ftruncate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007f2ad0bd0000
INFO: thread 0 is 7f2ad03ce700
INFO: thread 1 is 7f2ad0bcf700
ERROR: thread 0, offset 01001c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 01801c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 02001c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 02807c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 0281dc00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 03001c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 03023c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 03801c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 03804c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 04001c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 04801c00, 00000000 != 7f2ad03ce700
ERROR: thread 0, offset 05001c00, 00000000 != 7f2ad03ce700
ERROR: thread 1, offset 0e001400, 00000000 != 7f2ad0bcf700
ERROR: thread 1, offset 16001400, 00000000 != 7f2ad0bcf700
ERROR: thread 1, offset 1b001400, 00000000 != 7f2ad0bcf700
ERROR: thread 1, offset 2a802400, 00000000 != 7f2ad0bcf700
ERROR: thread 1, offset 31005400, 00000000 != 7f2ad0bcf700
ERROR: thread 0, offset 3e6b3c00, 00000000 != 7f2ad03ce700
INFO: 18 error(s) detected
$


Thanks,
Brian



/*
* holetest -- test simultaneous page faults on hole-backed pages
* Copyright (C) 2015 Hewlett Packard Enterprise Development LP
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
* as published by the Free Software Foundation; either version 2
* of the License, or (at your option) any later version.
*
* This program is distributed in the hope that it will be useful,
* but WITHOUT ANY WARRANTY; without even the implied warranty of
* MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
* GNU General Public License for more details.
*
* You should have received a copy of the GNU General Public License
* along with this program; if not, write to the Free Software Foundation,
* Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301, USA.
*/


/*
* holetest
*
* gcc -Wall -pthread -o holetest holetest.c
*
* This test tool exercises page faults on hole-y portions of an mmapped
* file. The file is created, sized using various methods, mmapped, and
* then two threads race to write a marker to different offsets within
* each mapped page. Once the threads have finished marking each page,
* the pages are checked for the presence of the markers.
*
* The file is sized four different ways: explicitly zero-filled by the
* test, posix_fallocate(), fallocate(), and ftruncate(). The explicit
* zero-fill does not really test simultaneous page faults on hole-backed
* pages, but rather serves as control of sorts.
*
* Usage:
*
* holetest [-f] FILENAME FILESIZEinMB
*
* Where:
*
* FILENAME is the name of a non-existent test file to create
*
* FILESIZEinMB is the desired size of the test file in MiB
*
* If the test is successful, FILENAME will be unlinked. By default,
* if the test detects an error in the page markers, then the test exits
* immediately and FILENAME is left. If -f is given, then the test
* continues after a marker error and FILENAME is unlinked, but will
* still exit with a non-0 status.
*/


/* for fallocate(2) */
#define _GNU_SOURCE

#include <stdlib.h>
#include <stdio.h>
#include <errno.h>
#include <inttypes.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <unistd.h>
#include <pthread.h>
#include <string.h>


#ifndef HOLETEST_REVISION
#define HOLETEST_REVISION "0"
#endif


#define PGSZ (4096)


void*
pt_page_marker(
void* args
)
{
intptr_t* a = args;
char* va = (char*)(a[0]);
int npages = (int)(a[1]);
int pgoff = (int)(a[2]);
uint64_t tid = (uint64_t)(pthread_self());

va += pgoff;

/* mark pages */
for (; npages > 0; va += PGSZ, npages--) {
*(uint64_t*)(va) = tid;
}

return NULL;

} /* pt_page_marker() */


int
test_this(
int fd,
int sz
)
{
int npages;
char* vastart;
char* va;
intptr_t targs[6];
pthread_t t[2];
uint64_t tid[2];
int errcnt;

npages = sz / PGSZ;
printf("INFO: sz = %08x, npages = %d\n", sz, npages);

/* mmap it */
vastart = mmap(NULL, sz, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (MAP_FAILED == vastart) {
perror("mmap()");
exit(20);
}
printf("INFO: vastart = %016lx\n", (uintptr_t)vastart);

/* prepare the thread args
*
* thread 1:
*/
targs[0] = (intptr_t)vastart;
targs[1] = (intptr_t)npages;
targs[2] = (intptr_t)(3072);
/* thread 2: */
targs[3] = (intptr_t)vastart;
targs[4] = (intptr_t)npages;
targs[5] = (intptr_t)(1024);

/* start two threads */
if (0 != pthread_create(&(t[0]), NULL, pt_page_marker, &(targs[0]))) {
perror("pthread_create(1)");
exit(21);
}
if (0 != pthread_create(&(t[1]), NULL, pt_page_marker, &(targs[3]))) {
perror("pthread_create(2)");
exit(22);
}
tid[0] = (uint64_t)t[0];
tid[1] = (uint64_t)t[1];
printf("INFO: thread 0 is %08lx\n", t[0]);
printf("INFO: thread 1 is %08lx\n", t[1]);

/* wait for them to finish */
(void)pthread_join(t[0], NULL);
(void)pthread_join(t[1], NULL);

/* check markers on each page */
errcnt = 0;
for (va = vastart; npages > 0; va += PGSZ, npages--) {
if (*(uint64_t*)(va + 3072) != tid[0]) {
printf("ERROR: thread 0, "
"offset %08lx, %08lx != %08lx\n",
(va + 3072 - vastart),
*(uint64_t*)(va + 3072), tid[0]);
errcnt += 1;
}
if (*(uint64_t*)(va + 1024) != tid[1]) {
printf("ERROR: thread 1, "
"offset %08lx, %08lx != %08lx\n",
(va + 1024 - vastart),
*(uint64_t*)(va + 1024), tid[1]);
errcnt += 1;
}
}

printf("INFO: %d error(s) detected\n", errcnt);

(void)munmap(vastart, sz);

return errcnt;

} /* test_this() */


int
main(
int argc,
char* argv[]
)
{
int stoponerror = 1;
char* path;
int sz;
int fd;
int errcnt;
int toterr = 0;

printf("holetest r%s\n", HOLETEST_REVISION);

/* process command line */
argc--; argv++;
/* ignore errors? */
if ((3 == argc) && (0 == strcmp(argv[0], "-f"))) {
stoponerror = 0;
argc--;
argv++;
}
/* file name and size */
if ((2 != argc) || (argv[0][0] == '-')) {
fprintf(stderr, "ERROR: usage: holetest [-f] "
"FILENAME FILESIZEinMB\n");
exit(1);
}
path = argv[0];
sz = atoi(argv[1]) << 20;
if (1 > sz) {
fprintf(stderr, "ERROR: bad FILESIZEinMB\n");
exit(1);
}


/*
* we're going to run our test in several different ways:
*
* 1. explictly zero-filled
* 2. posix_fallocated
* 3. fallocated
* 4. ftruncated
*/


/*
* explicitly zero-filled
*/
printf("\nINFO: zero-filled test...\n");
/* create the file */
fd = open(path, O_RDWR | O_EXCL | O_CREAT, 0644);
if (0 > fd) {
perror(path);
exit(2);
}
/* truncate it to size */
if (0 != ftruncate(fd, sz)) {
perror("ftruncate()");
exit(3);
}
/* explicitly zero-fill */
{
char* va = mmap(NULL, sz, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);
if (MAP_FAILED == va) {
perror("mmap()");
exit(4);
}
memset(va, 0, sz);
munmap(va, sz);
}
/* test it */
errcnt = test_this(fd, sz);
toterr += errcnt;
close(fd);
if (stoponerror && (0 < errcnt))
exit(5);
/* cleanup */
if (0 != unlink(path)) {
perror("unlink()");
exit(6);
}


/*
* posix_fallocated
*/
printf("\nINFO: posix_fallocate test...\n");
/* create the file */
fd = open(path, O_RDWR | O_EXCL | O_CREAT, 0644);
if (0 > fd) {
perror(path);
exit(7);
}
/* fill it to size */
if (0 != posix_fallocate(fd, 0, sz)) {
perror("posix_fallocate()");
exit(8);
}
/* test it */
errcnt = test_this(fd, sz);
toterr += errcnt;
close(fd);
if (stoponerror && (0 < errcnt))
exit(9);
/* cleanup */
if (0 != unlink(path)) {
perror("unlink()");
exit(10);
}


/*
* fallocated
*/
printf("\nINFO: fallocate test...\n");
/* create the file */
fd = open(path, O_RDWR | O_EXCL | O_CREAT, 0644);
if (0 > fd) {
perror(path);
exit(11);
}
/* fill it to size */
if (0 != fallocate(fd, 0, 0, sz)) {
perror("fallocate()");
exit(12);
}
/* test it */
errcnt = test_this(fd, sz);
toterr += errcnt;
close(fd);
if (stoponerror && (0 < errcnt))
exit(13);
/* cleanup */
if (0 != unlink(path)) {
perror("unlink()");
exit(14);
}


/*
* ftruncated
*/
printf("\nINFO: ftruncate test...\n");
/* create the file */
fd = open(path, O_RDWR | O_EXCL | O_CREAT, 0644);
if (0 > fd) {
perror(path);
exit(15);
}
/* truncate it to size */
if (0 != ftruncate(fd, sz)) {
perror("ftruncate()");
exit(16);
}
/* test it */
errcnt = test_this(fd, sz);
toterr += errcnt;
close(fd);
if (stoponerror && (0 < errcnt))
exit(17);
/* cleanup */
if (0 != unlink(path)) {
perror("unlink()");
exit(18);
}


/* done */
if (0 < toterr)
exit(19);
else
return 0;

} /* main() */



-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of Jan Kara
Sent: Wednesday, November 04, 2015 11:19 AM
Subject: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

Hello,

Another version of my ext4 fixes. I've fixed up all the failures Ted reported
except for ext4/001 failures which are false positive (will send fixes for that
test shortly) and generic/269 in nodelalloc mode which I just wasn't able to
reproduce.

Note that testing with 1 KB blocksize on ramdisk is broken since brd has
buggy discard implementation. It took me quite some time to figure this out.
Fix is submitted but bear this in mind just in case.

Changes since v2:
* Fixed collaps range to truncate pagecache properly with blocksize < pagesize
* Fixed assertion in ext4_get_blocks_overwrite

Patch set description

This series fixes a long standing problem of racing punch hole and page fault
resulting in possible filesystem corruption or stale data exposure. We fix the
problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
page faults with truncate and punch hole operations.

When having this exclusion, the only remaining problem with DAX implementation
are races between two page faults zeroing out same block concurrently (where
the data written after the first fault finishes are possibly overwritten by
the second fault still doing zeroing).

Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
serialize extent manipulation operations and page faults.

Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
unlock in unlocked DIO protections (currently harmless but nasty surprise).

Patches 3-4 fix further races of extent manipulation functions (such as zero
range, collapse range, insert range) with buffered IO, page writeback

Patch 5 documents locking order of ext4 filesystem locks.

Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
dioread_nolock is enabled since it is not needed anymore.

Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
callback and use such blocks for allocations from DAX page faults.

The patches survived xfstests run both in dax and non-dax mode.

Honza

2015-11-06 22:17:30

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Fri, Nov 06, 2015 at 05:57:04PM +0000, Boylston, Brian wrote:
> Hi,
>
> I've written a test tool (included below) that exercises page faults on
> hole-y portions of an mmapped file. The file is created, sized using
> various methods, mmapped, and then two threads race to write a marker to
> different offsets within each mapped page. Once the threads have
> finished marking each page, the pages are checked for the presence of
> the markers.
>
> With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
> pmem-backed, DAX-mounted xfs and ext4 file systems.

4.2/4.3 kernels do not have a fix for this race problem in them, so
it's no surprise that they fail the test. The fixes for this problem
in XFS are currently in the for-next branch, and will be merged into
4.4 during the merge window.

And FWIW, with XFS on a ramdisk on a 4.3+for-next branch kernel,
this test passes just fine:

$ ./dax-hole-test -f /mnt/scratch/holetest 1000
holetest r0

INFO: zero-filled test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007fc5a90bd000
INFO: thread 0 is 7fc5a90bc700
INFO: thread 1 is 7fc5a88bb700
INFO: 0 error(s) detected

INFO: posix_fallocate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007fc5a90bd000
INFO: thread 0 is 7fc5a88bb700
INFO: thread 1 is 7fc5a90bc700
INFO: 0 error(s) detected

INFO: fallocate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007fc5a90bd000
INFO: thread 0 is 7fc5a90bc700
INFO: thread 1 is 7fc5a88bb700
INFO: 0 error(s) detected

INFO: ftruncate test...
INFO: sz = 3e800000, npages = 256000
INFO: vastart = 00007fc5a90bd000
INFO: thread 0 is 7fc5a88bb700
INFO: thread 1 is 7fc5a90bc700
INFO: 0 error(s) detected

That tends to indicate a bug in the ext4 patchset, not that the
method we've taken to solve the problem is fundamentally wrong.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-11-09 16:23:01

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

Hi,

On Fri 06-11-15 17:57:04, Boylston, Brian wrote:
> I've written a test tool (included below) that exercises page faults on
> hole-y portions of an mmapped file. The file is created, sized using
> various methods, mmapped, and then two threads race to write a marker to
> different offsets within each mapped page. Once the threads have
> finished marking each page, the pages are checked for the presence of
> the markers.
>
> With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
> pmem-backed, DAX-mounted xfs and ext4 file systems.
>
> With 4.3 and this ext4 patch set, the data corruption is still seen:
>
> $ ./holetest -f /pmem1/brian/holetest 1000
> holetest r207

Thanks for the test. I'll try to reproduce locally and have a look why
my block zeroing patch didn't work as expected.

Honza
>
> INFO: zero-filled test...
> INFO: sz = 3e800000, npages = 256000
> INFO: vastart = 00007f2ad0bd0000
> INFO: thread 0 is 7f2ad0bcf700
> INFO: thread 1 is 7f2ad03ce700
> INFO: 0 error(s) detected
>
> INFO: posix_fallocate test...
> INFO: sz = 3e800000, npages = 256000
> INFO: vastart = 00007f2ad0bd0000
> INFO: thread 0 is 7f2ad03ce700
> INFO: thread 1 is 7f2ad0bcf700
> INFO: 0 error(s) detected
>
> INFO: fallocate test...
> INFO: sz = 3e800000, npages = 256000
> INFO: vastart = 00007f2ad0bd0000
> INFO: thread 0 is 7f2ad0bcf700
> INFO: thread 1 is 7f2ad03ce700
> INFO: 0 error(s) detected
>
> INFO: ftruncate test...
> INFO: sz = 3e800000, npages = 256000
> INFO: vastart = 00007f2ad0bd0000
> INFO: thread 0 is 7f2ad03ce700
> INFO: thread 1 is 7f2ad0bcf700
> ERROR: thread 0, offset 01001c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 01801c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 02001c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 02807c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 0281dc00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 03001c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 03023c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 03801c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 03804c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 04001c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 04801c00, 00000000 != 7f2ad03ce700
> ERROR: thread 0, offset 05001c00, 00000000 != 7f2ad03ce700
> ERROR: thread 1, offset 0e001400, 00000000 != 7f2ad0bcf700
> ERROR: thread 1, offset 16001400, 00000000 != 7f2ad0bcf700
> ERROR: thread 1, offset 1b001400, 00000000 != 7f2ad0bcf700
> ERROR: thread 1, offset 2a802400, 00000000 != 7f2ad0bcf700
> ERROR: thread 1, offset 31005400, 00000000 != 7f2ad0bcf700
> ERROR: thread 0, offset 3e6b3c00, 00000000 != 7f2ad03ce700
> INFO: 18 error(s) detected
> $
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-09 16:51:54

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Mon 09-11-15 17:22:56, Jan Kara wrote:
> Hi,
>
> On Fri 06-11-15 17:57:04, Boylston, Brian wrote:
> > I've written a test tool (included below) that exercises page faults on
> > hole-y portions of an mmapped file. The file is created, sized using
> > various methods, mmapped, and then two threads race to write a marker to
> > different offsets within each mapped page. Once the threads have
> > finished marking each page, the pages are checked for the presence of
> > the markers.
> >
> > With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
> > pmem-backed, DAX-mounted xfs and ext4 file systems.
> >
> > With 4.3 and this ext4 patch set, the data corruption is still seen:
> >
> > $ ./holetest -f /pmem1/brian/holetest 1000
> > holetest r207
>
> Thanks for the test. I'll try to reproduce locally and have a look why
> my block zeroing patch didn't work as expected.

Ah, OK, I see what's going on. So ext4 with my patches still returns
buffer_new buffer even though it is zeroed out and thus generic DAX code
still tries to zero out the buffer again which indeed causes the corrution
(will test everything tomorrow with that code disabled). Now I have
decided that block mapping function should return buffer_new buffer even
though it is zeroed out because e.g. if block zeroing was used for page
cache writes, we'd still need code in fs/buffer.c to do proper zeroing of
parts of the block that are not written. And that happens based on
buffer_new flag.

The zeroing code in __dax_fault() needs to go away anyway so whether we
return buffer_new buffer is not really substantial but I'd like to get some
agreement and consistency among filesystems in with which flags zeroed
blocks are returned. Thoughts?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-10 00:00:34

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Mon, Nov 09, 2015 at 05:51:49PM +0100, Jan Kara wrote:
> On Mon 09-11-15 17:22:56, Jan Kara wrote:
> > Hi,
> >
> > On Fri 06-11-15 17:57:04, Boylston, Brian wrote:
> > > I've written a test tool (included below) that exercises page faults on
> > > hole-y portions of an mmapped file. The file is created, sized using
> > > various methods, mmapped, and then two threads race to write a marker to
> > > different offsets within each mapped page. Once the threads have
> > > finished marking each page, the pages are checked for the presence of
> > > the markers.
> > >
> > > With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
> > > pmem-backed, DAX-mounted xfs and ext4 file systems.
> > >
> > > With 4.3 and this ext4 patch set, the data corruption is still seen:
> > >
> > > $ ./holetest -f /pmem1/brian/holetest 1000
> > > holetest r207
> >
> > Thanks for the test. I'll try to reproduce locally and have a look why
> > my block zeroing patch didn't work as expected.
>
> Ah, OK, I see what's going on. So ext4 with my patches still returns
> buffer_new buffer even though it is zeroed out and thus generic DAX code
> still tries to zero out the buffer again which indeed causes the corrution
> (will test everything tomorrow with that code disabled). Now I have
> decided that block mapping function should return buffer_new buffer even
> though it is zeroed out because e.g. if block zeroing was used for page
> cache writes, we'd still need code in fs/buffer.c to do proper zeroing of
> parts of the block that are not written. And that happens based on
> buffer_new flag.

XFS special cases this for DAX in __xfs_get_blocks():

if (IS_DAX(inode) && create) {
ASSERT(!ISUNWRITTEN(&imap));
/* zeroing is not needed at a higher layer */
new = 0;
}

And so will not set the buffer_new() fo rhte DAX case as we've
already directly zeroed the region the DAX code s about to write
into...

> The zeroing code in __dax_fault() needs to go away anyway so whether we
> return buffer_new buffer is not really substantial but I'd like to get some
> agreement and consistency among filesystems in with which flags zeroed
> blocks are returned. Thoughts?

There is no consistency to begin with, especially w.r.t. unwritten
extent behaviour as the upper layers don't all understand that
buffer_unwritten is a valid flag for getblock to return. Hence we
have hacks in XFS setting buffer_new() in strange cases to get the
upper level code to zero stuff that really needs zeroing...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-11-10 10:02:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Tue 10-11-15 11:00:27, Dave Chinner wrote:
> On Mon, Nov 09, 2015 at 05:51:49PM +0100, Jan Kara wrote:
> > On Mon 09-11-15 17:22:56, Jan Kara wrote:
> > > Hi,
> > >
> > > On Fri 06-11-15 17:57:04, Boylston, Brian wrote:
> > > > I've written a test tool (included below) that exercises page faults on
> > > > hole-y portions of an mmapped file. The file is created, sized using
> > > > various methods, mmapped, and then two threads race to write a marker to
> > > > different offsets within each mapped page. Once the threads have
> > > > finished marking each page, the pages are checked for the presence of
> > > > the markers.
> > > >
> > > > With vanilla 4.2 and 4.3 kernels, this test easily exposes corruption on
> > > > pmem-backed, DAX-mounted xfs and ext4 file systems.
> > > >
> > > > With 4.3 and this ext4 patch set, the data corruption is still seen:
> > > >
> > > > $ ./holetest -f /pmem1/brian/holetest 1000
> > > > holetest r207
> > >
> > > Thanks for the test. I'll try to reproduce locally and have a look why
> > > my block zeroing patch didn't work as expected.
> >
> > Ah, OK, I see what's going on. So ext4 with my patches still returns
> > buffer_new buffer even though it is zeroed out and thus generic DAX code
> > still tries to zero out the buffer again which indeed causes the corrution
> > (will test everything tomorrow with that code disabled). Now I have
> > decided that block mapping function should return buffer_new buffer even
> > though it is zeroed out because e.g. if block zeroing was used for page
> > cache writes, we'd still need code in fs/buffer.c to do proper zeroing of
> > parts of the block that are not written. And that happens based on
> > buffer_new flag.
>
> XFS special cases this for DAX in __xfs_get_blocks():
>
> if (IS_DAX(inode) && create) {
> ASSERT(!ISUNWRITTEN(&imap));
> /* zeroing is not needed at a higher layer */
> new = 0;
> }
>
> And so will not set the buffer_new() fo rhte DAX case as we've
> already directly zeroed the region the DAX code s about to write
> into...

OK, for now I did something similar in the ext4 mapping function for DAX
faults.

> > The zeroing code in __dax_fault() needs to go away anyway so whether we
> > return buffer_new buffer is not really substantial but I'd like to get some
> > agreement and consistency among filesystems in with which flags zeroed
> > blocks are returned. Thoughts?
>
> There is no consistency to begin with, especially w.r.t. unwritten
> extent behaviour as the upper layers don't all understand that
> buffer_unwritten is a valid flag for getblock to return. Hence we
> have hacks in XFS setting buffer_new() in strange cases to get the
> upper level code to zero stuff that really needs zeroing...

In ext4 we set buffer as new in two cases:

1) When it was freshly allocated (regardless whether into unwritten or
normal extent).
2) When it was converted from unwritten to written state.

This seems to do the job...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-10 22:31:43

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Tue, Nov 10, 2015 at 11:02:53AM +0100, Jan Kara wrote:
> On Tue 10-11-15 11:00:27, Dave Chinner wrote:
> > On Mon, Nov 09, 2015 at 05:51:49PM +0100, Jan Kara wrote:
> > > The zeroing code in __dax_fault() needs to go away anyway so whether we
> > > return buffer_new buffer is not really substantial but I'd like to get some
> > > agreement and consistency among filesystems in with which flags zeroed
> > > blocks are returned. Thoughts?
> >
> > There is no consistency to begin with, especially w.r.t. unwritten
> > extent behaviour as the upper layers don't all understand that
> > buffer_unwritten is a valid flag for getblock to return. Hence we
> > have hacks in XFS setting buffer_new() in strange cases to get the
> > upper level code to zero stuff that really needs zeroing...
>
> In ext4 we set buffer as new in two cases:
>
> 1) When it was freshly allocated (regardless whether into unwritten or
> normal extent).

Which only works for freshly allocated unwritten extents, not for
writes into preallocated extents.

> 2) When it was converted from unwritten to written state.

Which, for direct IO, is no good when we do the conversion after IO
completion because we need to know at IO submission if we need to do
sub-block zeroing (i.e. dio_zero_block() will skip the zeroing if
buffer_new() is not set on unwritten blocks).

Like I said, there's no consistency of behaviour between
filesystems; we do what works for the specific filesystem
algorithms...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-11-11 15:25:46

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v3] ext4: Punch hole and DAX fixes

On Wed 11-11-15 09:31:39, Dave Chinner wrote:
> On Tue, Nov 10, 2015 at 11:02:53AM +0100, Jan Kara wrote:
> > On Tue 10-11-15 11:00:27, Dave Chinner wrote:
> > > On Mon, Nov 09, 2015 at 05:51:49PM +0100, Jan Kara wrote:
> > > > The zeroing code in __dax_fault() needs to go away anyway so whether we
> > > > return buffer_new buffer is not really substantial but I'd like to get some
> > > > agreement and consistency among filesystems in with which flags zeroed
> > > > blocks are returned. Thoughts?
> > >
> > > There is no consistency to begin with, especially w.r.t. unwritten
> > > extent behaviour as the upper layers don't all understand that
> > > buffer_unwritten is a valid flag for getblock to return. Hence we
> > > have hacks in XFS setting buffer_new() in strange cases to get the
> > > upper level code to zero stuff that really needs zeroing...
> >
> > In ext4 we set buffer as new in two cases:
> >
> > 1) When it was freshly allocated (regardless whether into unwritten or
> > normal extent).
>
> Which only works for freshly allocated unwritten extents, not for
> writes into preallocated extents.
>
> > 2) When it was converted from unwritten to written state.
>
> Which, for direct IO, is no good when we do the conversion after IO
> completion because we need to know at IO submission if we need to do
> sub-block zeroing (i.e. dio_zero_block() will skip the zeroing if
> buffer_new() is not set on unwritten blocks).

OK, sorry I parsed the code flow wrong, ext4 sets buffer_new for any
unwritten extent it returns (when create == 1 only but that doesn't really
make a difference). And then for any freshly allocated block. This seems
pretty consistent to me. There are two places in ext4 that set returned
buffer as new (those two cases mentioned above) and that seems to be
enough. Maybe XFS needs something different but at least from looking at
ext4 setting of buffer_new is not complex...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR