2015-11-10 19:51:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

Hello,

Another version of my ext4 fixes. Since previous version I have fixed DAX block
mapping to really avoid races for parallel page faults so that the test program
by Brian passes. Note that you'll see ext4/001 failures - xfstests updates were
submitted. Also note that testing with 1 KB blocksize on ramdisk is broken
since brd has buggy discard implementation - Jens has a fix queued.

Change since v3:
* Fixed ext4_dax_mmap_get_block() to not return buffer_new buffer and thus
avoid racy zeroing in generic dax code
* Fixed ext4_map_blocks() to zeroout blocks before inserting entry into
extent status tree to avoid racy lookups of blocks.

Changes since v2:
* Fixed collaps range to truncate pagecache properly with blocksize < pagesize
* Fixed assertion in ext4_get_blocks_overwrite

Patch set description

This series fixes a long standing problem of racing punch hole and page fault
resulting in possible filesystem corruption or stale data exposure. We fix the
problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
page faults with truncate and punch hole operations.

When having this exclusion, the only remaining problem with DAX implementation
are races between two page faults zeroing out same block concurrently (where
the data written after the first fault finishes are possibly overwritten by
the second fault still doing zeroing).

Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
serialize extent manipulation operations and page faults.

Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
unlock in unlocked DIO protections (currently harmless but nasty surprise).

Patches 3-4 fix further races of extent manipulation functions (such as zero
range, collapse range, insert range) with buffered IO, page writeback

Patch 5 documents locking order of ext4 filesystem locks.

Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
dioread_nolock is enabled since it is not needed anymore.

Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
callback and use such blocks for allocations from DAX page faults.

The patches survived xfstests run both in dax and non-dax mode.

Honza


2015-11-10 19:51:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

Current code implementing FALLOC_FL_COLLAPSE_RANGE and
FALLOC_FL_INSERT_RANGE is prone to races with buffered writes and page
faults. If buffered write or write via mmap manages to squeeze between
filemap_write_and_wait_range() and truncate_pagecache() in the fallocate
implementations, the written data is simply discarded by
truncate_pagecache() although it should have been shifted.

Fix the problem by moving filemap_write_and_wait_range() call inside
i_mutex and i_mmap_sem. That way we are protected against races with
both buffered writes and page faults.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 61 +++++++++++++++++++++++++++++--------------------------
1 file changed, 32 insertions(+), 29 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 65b5ada2833f..3fafdb0a17ed 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -5487,21 +5487,7 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down offset to be aligned with page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);
-
/*
* There is no need to overlap collapse range with EOF, in which case
* it is effectively a truncate operation
@@ -5522,10 +5508,31 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
inode_dio_wait(inode);

/*
- * Prevent page faults from reinstantiating pages we have released from
+ * Prevent page faults from reinstantiating we have released from
* page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ /*
+ * Need to round down offset to be aligned with page size boundary
+ * for page size > block size.
+ */
+ ioffset = round_down(offset, PAGE_SIZE);
+ /*
+ * Write tail of the last page before removed range since it will get
+ * removed from the page cache below.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, ioffset, offset);
+ if (ret)
+ goto out_mmap;
+ /*
+ * Write data that will be shifted to preserve them when discarding
+ * page cache below. We are also protected from pages becoming dirty
+ * by i_mmap_sem.
+ */
+ ret = filemap_write_and_wait_range(inode->i_mapping, offset + len,
+ LLONG_MAX);
+ if (ret)
+ goto out_mmap;
truncate_pagecache(inode, ioffset);

credits = ext4_writepage_trans_blocks(inode);
@@ -5626,21 +5633,7 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
return ret;
}

- /*
- * Need to round down to align start offset to page size boundary
- * for page size > block size.
- */
- ioffset = round_down(offset, PAGE_SIZE);
-
- /* Write out all dirty pages */
- ret = filemap_write_and_wait_range(inode->i_mapping, ioffset,
- LLONG_MAX);
- if (ret)
- return ret;
-
- /* Take mutex lock */
mutex_lock(&inode->i_mutex);

2015-11-10 19:51:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/9] ext4: Move unlocked dio protection from ext4_alloc_file_blocks()

Currently ext4_alloc_file_blocks() was handling protection against
unlocked DIO. However we now need to sometimes call it under i_mmap_sem
and sometimes not and DIO protection ranks above it (although strictly
speaking this cannot currently create any deadlocks). Also
ext4_zero_range() was actually getting & releasing unlocked DIO
protection twice in some cases. Luckily it didn't introduce any real bug
but it was a land mine waiting to be stepped on. So move DIO protection
out from ext4_alloc_file_blocks() into the two callsites.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/extents.c | 21 ++++++++++-----------
1 file changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 5be9ca5a8a7a..65b5ada2833f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4685,10 +4685,6 @@ static int ext4_alloc_file_blocks(struct file *file, ext4_lblk_t offset,
if (len <= EXT_UNWRITTEN_MAX_LEN)
flags |= EXT4_GET_BLOCKS_NO_NORMALIZE;

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
/*
* credits to insert 1 extent into extent tree
*/
@@ -4752,8 +4748,6 @@ retry:
goto retry;
}

- ext4_inode_resume_unlocked_dio(inode);
-
return ret > 0 ? ret2 : ret;
}

@@ -4827,6 +4821,10 @@ static long ext4_zero_range(struct file *file, loff_t offset,
if (mode & FALLOC_FL_KEEP_SIZE)
flags |= EXT4_GET_BLOCKS_KEEP_SIZE;

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
/* Preallocate the range including the unaligned edges */
if (partial_begin || partial_end) {
ret = ext4_alloc_file_blocks(file,
@@ -4835,7 +4833,7 @@ static long ext4_zero_range(struct file *file, loff_t offset,
round_down(offset, 1 << blkbits)) >> blkbits,
new_size, flags, mode);
if (ret)
- goto out_mutex;
+ goto out_dio;

}

@@ -4844,10 +4842,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);

2015-11-10 19:51:17

by Jan Kara

[permalink] [raw]
Subject: [PATCH 7/9] ext4: Provide ext4_issue_zeroout()

Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/crypto.c | 6 ++----
fs/ext4/ext4.h | 5 ++++-
fs/ext4/extents.c | 12 ++----------
fs/ext4/inode.c | 15 +++++++++++++++
4 files changed, 23 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index af06830bfc00..c8021208a7eb 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -384,14 +384,12 @@ int ext4_decrypt(struct page *page)
EXT4_DECRYPT, page->index, page, page);
}

-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex)
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len)
{
struct ext4_crypto_ctx *ctx;
struct page *ciphertext_page = NULL;
struct bio *bio;
- ext4_lblk_t lblk = ex->ee_block;
- ext4_fsblk_t pblk = ext4_ext_pblock(ex);
- unsigned int len = ext4_ext_get_actual_len(ex);
int ret, err = 0;

#if 0
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 1d09345c9793..353089c35485 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2204,7 +2204,8 @@ void ext4_restore_control_page(struct page *data_page);
struct page *ext4_encrypt(struct inode *inode,
struct page *plaintext_page);
int ext4_decrypt(struct page *page);
-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex);
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

#ifdef CONFIG_EXT4_FS_ENCRYPTION
int ext4_init_crypto(void);
@@ -2458,6 +2459,8 @@ extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index f76db53eddc4..c0c7650c35ba 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3119,19 +3119,11 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
{
ext4_fsblk_t ee_pblock;
unsigned int ee_len;
- int ret;

ee_len = ext4_ext_get_actual_len(ex);
ee_pblock = ext4_ext_pblock(ex);
-
- if (ext4_encrypted_inode(inode))
- return ext4_encrypted_zeroout(inode, ex);
-
- ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len, GFP_NOFS);
- if (ret > 0)
- ret = 0;
-
- return ret;
+ return ext4_issue_zeroout(inode, le32_to_cpu(ex->ee_block), ee_pblock,
+ ee_len);
}

/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 528d1a0aebff..39cdad9e2338 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -383,6 +383,21 @@ static int __check_block_validity(struct inode *inode, const char *func,
return 0;
}

+int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
+ ext4_lblk_t len)
+{
+ int ret;
+
+ if (ext4_encrypted_inode(inode))
+ return ext4_encrypted_zeroout(inode, lblk, pblk, len);
+
+ ret = sb_issue_zeroout(inode->i_sb, pblk, len, GFP_NOFS);
+ if (ret > 0)
+ ret = 0;
+
+ return ret;
+}
+
#define check_block_validity(inode, map) \
__check_block_validity((inode), __func__, __LINE__, (map))

--
2.1.4


2015-11-10 19:51:17

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/9] ext4: Document lock ordering

We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 6d307dac9622..05a426e0fa1f 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -80,6 +80,36 @@ static void ext4_destroy_lazyinit_thread(void);
static void ext4_unregister_li_request(struct super_block *sb);
static void ext4_clear_request_list(void);

+/*
+ * Lock ordering
+ *
+ * Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
+ * i_mmap_rwsem (inode->i_mmap_rwsem)!
+ *
+ * page fault path:
+ * mmap_sem -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
+ * page lock -> i_data_sem (rw)
+ *
+ * buffered write path:
+ * sb_start_write -> i_mutex -> mmap_sem
+ * sb_start_write -> i_mutex -> transaction start -> page lock ->
+ * i_data_sem (rw)
+ *
+ * truncate:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * i_mmap_rwsem (w) -> page lock
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * direct IO:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) -> mmap_sem
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * writepages:
+ * transaction start -> page lock(s) -> i_data_sem (rw)
+ */
+
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
static struct file_system_type ext2_fs_type = {
.owner = THIS_MODULE,
--
2.1.4


2015-11-10 19:51:17

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/9] ext4: Get rid of EXT4_GET_BLOCKS_NO_LOCK flag

When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +---
fs/ext4/inode.c | 43 ++++++++++++++++++++-----------------------
include/trace/events/ext4.h | 3 +--
3 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d049fc76efcd..1d09345c9793 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -554,10 +554,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_NORMALIZE 0x0040
/* Request will not result in inode size update (user for fallocate) */
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
- /* Do not take i_data_sem locking in ext4_map_blocks */
-#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0200
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index fcd4ab065e4a..528d1a0aebff 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -403,8 +403,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
* out taking i_data_sem. So at the time the unwritten extent
* could be converted.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -412,8 +411,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

/*
* We don't check m_len because extent will be collpased in status
@@ -509,8 +507,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -541,8 +538,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ret < 0)
retval = ret;
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
@@ -674,7 +670,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;

- if (flags && !(flags & EXT4_GET_BLOCKS_NO_LOCK) && !handle) {
+ if (flags && !handle) {
/* Direct IO write... */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
@@ -879,9 +875,6 @@ int do_journal_get_write_access(handle_t *handle,
return ret;
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
-
#ifdef CONFIG_EXT4_FS_ENCRYPTION
static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block)
@@ -3054,13 +3047,21 @@ int ext4_get_block_write(struct inode *inode, sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
+static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- ext4_debug("ext4_get_block_write_nolock: inode %lu, create flag %d\n",
+ int ret;
+
+ ext4_debug("ext4_get_block_overwrite: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_NO_LOCK);
+ ret = _ext4_get_block(inode, iblock, bh_result, 0);
+ /*
+ * Blocks should have been preallocated! ext4_file_write_iter() checks
+ * that.
+ */
+ WARN_ON_ONCE(!buffer_mapped(bh_result));
+
+ return ret;
}

int ext4_get_block_dax(struct inode *inode, sector_t iblock,
@@ -3143,10 +3144,8 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);

- if (overwrite) {
- down_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_unlock(&inode->i_mutex);
- }

/*
* We could direct write to holes and fallocate.
@@ -3189,7 +3188,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
}

if (overwrite) {
- get_block_func = ext4_get_block_write_nolock;
+ get_block_func = ext4_get_block_overwrite;
} else {
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
@@ -3245,10 +3244,8 @@ retake_lock:
if (iov_iter_rw(iter) == WRITE)
inode_dio_end(inode);
/* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite) {
- up_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_lock(&inode->i_mutex);
- }

return ret;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 594b4b29a224..5f2ace56efc5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,8 +42,7 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
- { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-11-10 19:51:18

by Jan Kara

[permalink] [raw]
Subject: [PATCH 8/9] ext4: Implement allocation of pre-zeroed blocks

DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/extents.c | 8 ++++++++
fs/ext4/inode.c | 25 ++++++++++++++++++++++---
include/trace/events/ext4.h | 3 ++-
4 files changed, 36 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 353089c35485..e667a7947b48 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -556,6 +556,10 @@ enum {
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
/* Convert written extents to unwritten */
#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100
+ /* Write zeros to newly created written extents */
+#define EXT4_GET_BLOCKS_ZERO 0x0200
+#define EXT4_GET_BLOCKS_CREATE_ZERO (EXT4_GET_BLOCKS_CREATE |\
+ EXT4_GET_BLOCKS_ZERO)

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index c0c7650c35ba..f7df02a15a18 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4044,6 +4044,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
}
/* IO end_io complete, convert the filled extent to written */
if (flags & EXT4_GET_BLOCKS_CONVERT) {
+ if (flags & EXT4_GET_BLOCKS_ZERO) {
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ err = ext4_issue_zeroout(inode, map->m_lblk, newblock,
+ allocated);
+ if (err < 0)
+ goto out2;
+ }
ret = ext4_convert_unwritten_extents_endio(handle, inode, map,
ppath);
if (ret >= 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 39cdad9e2338..32e643a82890 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -637,13 +637,29 @@ found:
}

/*
+ * We have to zeroout blocks before inserting them into extent
+ * status tree. Otherwise someone could look them up there and
+ * use them before they are really zeroed.
+ */
+ if (flags & EXT4_GET_BLOCKS_ZERO &&
+ map->m_flags & EXT4_MAP_MAPPED &&
+ map->m_flags & EXT4_MAP_NEW) {
+ ret = ext4_issue_zeroout(inode, map->m_lblk,
+ map->m_pblk, map->m_len);
+ if (ret) {
+ retval = ret;
+ goto out_sem;
+ }
+ }
+
+ /*
* If the extent has been zeroed out, we don't need to update
* extent status tree.
*/
if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
if (ext4_es_is_written(&es))
- goto has_zeroout;
+ goto out_sem;
}
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
@@ -654,11 +670,13 @@ found:
status |= EXTENT_STATUS_DELAYED;
ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
- if (ret < 0)
+ if (ret < 0) {
retval = ret;
+ goto out_sem;
+ }
}

-has_zeroout:
+out_sem:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
ret = check_block_validity(inode, map);
@@ -3083,6 +3101,7 @@ int ext4_get_block_dax(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+
if (create)
flags |= EXT4_GET_BLOCKS_CREATE;
ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5f2ace56efc5..4e4b2fa78609 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,7 +42,8 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
+ { EXT4_GET_BLOCKS_ZERO, "ZERO" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-11-10 19:51:18

by Jan Kara

[permalink] [raw]
Subject: [PATCH 9/9] ext4: Use pre-zeroed blocks for DAX page faults

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +--
fs/ext4/file.c | 20 ++------------
fs/ext4/inode.c | 86 +++++++++++++++++++++++++++++++++++++++++++++------------
3 files changed, 74 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e667a7947b48..1a6592e764f9 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2415,8 +2415,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d24ebcd7c9e..749b222e6498 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -193,18 +193,6 @@ out:
}

#ifdef CONFIG_FS_DAX
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
- struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
- loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
- int err;
- if (!uptodate)
- return;
- WARN_ON(!buffer_unwritten(bh));
- err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
@@ -225,8 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -262,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_get_block_dax, ext4_end_io_unwritten);
+ ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -283,8 +270,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 32e643a82890..7bf08b1b2c0a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -723,16 +723,6 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
- if (IS_DAX(inode) && buffer_unwritten(bh)) {
- /*
- * dgc: I suspect unwritten conversion on ext4+DAX is
- * fundamentally broken here when there are concurrent
- * read/write in progress on this inode.
- */
- WARN_ON_ONCE(io_end);
- bh->b_assoc_map = inode->i_mapping;
- bh->b_private = (void *)(unsigned long)iblock;
- }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3097,17 +3087,79 @@ static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
return ret;
}

-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+#ifdef CONFIG_FS_DAX
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
{
- int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+ int ret, err;
+ int credits;
+ struct ext4_map_blocks map;
+ handle_t *handle = NULL;
+ int flags = 0;

- if (create)
- flags |= EXT4_GET_BLOCKS_CREATE;
- ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+ ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result, flags);
+ map.m_lblk = iblock;
+ map.m_len = bh_result->b_size >> inode->i_blkbits;
+ credits = ext4_chunk_trans_blocks(inode, map.m_len);
+ if (create) {
+ flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+ }
+
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (create) {
+ err = ext4_journal_stop(handle);
+ if (ret >= 0 && err < 0)
+ ret = err;
+ }
+ if (ret <= 0)
+ goto out;
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int err2;
+
+ /*
+ * We are protected by i_mmap_sem so we know block cannot go
+ * away from under us even though we dropped i_data_sem.
+ * Convert extent to written and write zeros there.
+ *
+ * Note: We may get here even when create == 0.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ err = ext4_map_blocks(handle, inode, &map,
+ EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (err < 0)
+ ret = err;
+ err2 = ext4_journal_stop(handle);
+ if (err2 < 0 && ret > 0)
+ ret = err2;
+ }
+out:
+ WARN_ON_ONCE(ret == 0 && create);
+ if (ret > 0) {
+ map_bh(bh_result, inode->i_sb, map.m_pblk);
+ bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
+ map.m_flags;
+ /*
+ * At least for now we have to clear BH_New so that DAX code
+ * doesn't attempt to zero blocks again in a racy way.
+ */
+ bh_result->b_state &= ~(1 << BH_New);
+ bh_result->b_size = map.m_len << inode->i_blkbits;
+ ret = 0;
+ }
+ return ret;
}
+#endif

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
--
2.1.4


2015-11-10 19:51:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/9] ext4: Fix races of writeback with punch hole and zero range

When doing delayed allocation, update of on-disk inode size is postponed
until IO submission time. However hole punch or zero range fallocate
calls can end up discarding the tail page cache page and thus on-disk
inode size would never be properly updated.

Make sure the on-disk inode size is updated before truncating page
cache.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 3 +++
fs/ext4/extents.c | 5 +++++
fs/ext4/inode.c | 35 ++++++++++++++++++++++++++++++++++-
3 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 0e42b9d69f18..d049fc76efcd 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2821,6 +2821,9 @@ static inline int ext4_update_inode_size(struct inode *inode, loff_t newsize)
return changed;
}

+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len);
+
struct ext4_group_info {
unsigned long bb_state;
struct rb_root bb_free_root;
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 3fafdb0a17ed..f76db53eddc4 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4847,6 +4847,11 @@ static long ext4_zero_range(struct file *file, loff_t offset,
* released from page cache.
*/
down_write(&EXT4_I(inode)->i_mmap_sem);
+ ret = ext4_update_disksize_before_punch(inode, offset, len);
+ if (ret) {
+ up_write(&EXT4_I(inode)->i_mmap_sem);
+ goto out_dio;
+ }
/* Now release the pages and zero block aligned part of pages */
truncate_pagecache_range(inode, start, end - 1);
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7292d3b0af9c..fcd4ab065e4a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3559,6 +3559,35 @@ int ext4_can_truncate(struct inode *inode)
}

/*
+ * We have to make sure i_disksize gets properly updated before we truncate
+ * page cache due to hole punching or zero range. Otherwise i_disksize update
+ * can get lost as it may have been postponed to submission of writeback but
+ * that will never happen after we truncate page cache.
+ */
+int ext4_update_disksize_before_punch(struct inode *inode, loff_t offset,
+ loff_t len)
+{
+ handle_t *handle;
+ loff_t size = i_size_read(inode);
+
+ WARN_ON(!mutex_is_locked(&inode->i_mutex));
+ if (offset > size || offset + len < size)
+ return 0;
+
+ if (EXT4_I(inode)->i_disksize >= size)
+ return 0;
+
+ handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ ext4_update_i_disksize(inode, size);
+ ext4_mark_inode_dirty(handle, inode);
+ ext4_journal_stop(handle);
+
+ return 0;
+}
+
+/*
* ext4_punch_hole: punches a hole in a file by releaseing the blocks
* associated with the given offset and length
*
@@ -3636,9 +3665,13 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

/* Now release the pages and zero block aligned part of pages*/
- if (last_block_offset > first_block_offset)
+ if (last_block_offset > first_block_offset) {
+ ret = ext4_update_disksize_before_punch(inode, offset, length);
+ if (ret)
+ goto out_dio;
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);
+ }

if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
--
2.1.4


2015-11-10 19:51:16

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/9] ext4: Fix races between page faults and hole punching

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate, hole punching, and other functions
removing blocks from extent tree and for read over page faults. We
cannot easily use i_data_sem for this since that ranks below transaction
start and we need something ranking above it so that it can be held over
the whole truncate / hole punching operation. Also remove various
workarounds we had in the code to reduce race window when page fault
could have created pages with stale mapping information.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++++
fs/ext4/extents.c | 54 ++++++++++++++++++++++++--------------------
fs/ext4/file.c | 66 ++++++++++++++++++++++++++++++++++++++++++++++--------
fs/ext4/inode.c | 36 +++++++++++++++++++++--------
fs/ext4/super.c | 1 +
fs/ext4/truncate.h | 2 ++
6 files changed, 127 insertions(+), 42 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 750063f7a50c..0e42b9d69f18 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -873,6 +873,15 @@ struct ext4_inode_info {
* by other means, so we have i_data_sem.
*/
struct rw_semaphore i_data_sem;
+ /*
+ * i_mmap_sem is for serializing page faults with truncate / punch hole
+ * operations. We have to make sure that new page cannot be faulted in
+ * a section of the inode that is being punched. We cannot easily use
+ * i_data_sem for this since we need protection for the whole punch
+ * operation and i_data_sem ranks below transaction start so we have
+ * to occasionally drop it.
+ */
+ struct rw_semaphore i_mmap_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;

@@ -2447,6 +2456,7 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 551353b1b17a..5be9ca5a8a7a 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4770,7 +4770,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
int partial_begin, partial_end;
loff_t start, end;
ext4_lblk_t lblk;
- struct address_space *mapping = inode->i_mapping;
unsigned int blkbits = inode->i_blkbits;

trace_ext4_zero_range(inode, offset, len, mode);
@@ -4786,17 +4785,6 @@ static long ext4_zero_range(struct file *file, loff_t offset,
}

/*
- * Write out all dirty pages to avoid race conditions
- * Then release them.
- */
- if (mapping->nrpages && mapping_tagged(mapping, PAGECACHE_TAG_DIRTY)) {
- ret = filemap_write_and_wait_range(mapping, offset,
- offset + len - 1);
- if (ret)
- return ret;
- }
-
- /*
* Round up offset. This is not fallocate, we neet to zero out
* blocks, so convert interior block aligned part of the range to
* unwritten and possibly manually zero out unaligned parts of the
@@ -4856,16 +4844,22 @@ static long ext4_zero_range(struct file *file, loff_t offset,
flags |= (EXT4_GET_BLOCKS_CONVERT_UNWRITTEN |
EXT4_EX_NOCACHE);

- /* Now release the pages and zero block aligned part of pages*/
- truncate_pagecache_range(inode, start, end - 1);
- inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
-
/* Wait all existing dio workers, newcomers will block on i_mutex */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have
+ * released from page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ /* Now release the pages and zero block aligned part of pages */
+ truncate_pagecache_range(inode, start, end - 1);
+ inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
+
ret = ext4_alloc_file_blocks(file, lblk, max_blocks, new_size,
flags, mode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
if (ret)
goto out_dio;
}
@@ -5524,17 +5518,22 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

down_write(&EXT4_I(inode)->i_data_sem);
@@ -5573,7 +5572,8 @@ int ext4_collapse_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -5660,17 +5660,22 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)
goto out_mutex;
}

- truncate_pagecache(inode, ioffset);
-
/* Wait for existing dio to complete */
ext4_inode_block_unlocked_dio(inode);
inode_dio_wait(inode);

+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
+ truncate_pagecache(inode, ioffset);
+
credits = ext4_writepage_trans_blocks(inode);
handle = ext4_journal_start(inode, EXT4_HT_TRUNCATE, credits);
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
- goto out_dio;
+ goto out_mmap;
}

/* Expand file to avoid data loss if there is error while shifting */
@@ -5741,7 +5746,8 @@ int ext4_insert_range(struct inode *inode, loff_t offset, loff_t len)

out_stop:
ext4_journal_stop(handle);
-out_dio:
+out_mmap:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 113837e7ba98..0d24ebcd7c9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -209,15 +209,18 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
handle_t *handle = NULL;
- struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;

if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -228,8 +231,10 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}
@@ -246,10 +251,12 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
ext4_chunk_trans_blocks(inode,
PMD_SIZE / PAGE_SIZE));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -260,30 +267,71 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ int err;
+ struct inode *inode = file_inode(vma->vm_file);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
+ ext4_end_io_unwritten);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(inode->i_sb);
+
+ return err;
+}
+
+/*
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * handler we check for races agaist truncate. Note that since we cycle through
+ * i_mmap_sem, we are sure that also any hole punching that began before we
+ * were called is finished by now and so if it included part of the file we
+ * are working on, our pte will get unmapped and the check for pte_same() in
+ * wp_pfn_shared() fails. Thus fault gets retried and things work out as
+ * desired.
+ */
+static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ int ret = VM_FAULT_NOPAGE;
+ loff_t size;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ ret = VM_FAULT_SIGBUS;
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+
+ return ret;
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
.page_mkwrite = ext4_dax_mkwrite,
- .pfn_mkwrite = dax_pfn_mkwrite,
+ .pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
#define ext4_dax_vm_ops ext4_file_vm_ops
#endif

static const struct vm_operations_struct ext4_file_vm_ops = {
- .fault = filemap_fault,
+ .fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7d1aad1d9313..7292d3b0af9c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3623,6 +3623,15 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

}

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ /*
+ * Prevent page faults from reinstantiating pages we have released from
+ * page cache.
+ */
+ down_write(&EXT4_I(inode)->i_mmap_sem);
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

@@ -3631,10 +3640,6 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
else
@@ -3680,16 +3685,12 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
if (IS_SYNC(inode))
ext4_handle_sync(handle);

- /* Now release the pages again to reduce race window */
- if (last_block_offset > first_block_offset)
- truncate_pagecache_range(inode, first_block_offset,
- last_block_offset);
-
inode->i_mtime = inode->i_ctime = ext4_current_time(inode);
ext4_mark_inode_dirty(handle, inode);
out_stop:
ext4_journal_stop(handle);
out_dio:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -4823,6 +4824,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
} else
ext4_wait_for_tail_page_commit(inode);
}
+ down_write(&EXT4_I(inode)->i_mmap_sem);
/*
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
@@ -4830,6 +4832,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
truncate_pagecache(inode, inode->i_size);
if (shrink)
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

if (!rc) {
@@ -5278,6 +5281,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
@@ -5347,6 +5352,19 @@ retry_alloc:
out_ret:
ret = block_page_mkwrite_return(ret);
out:
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);
return ret;
}
+
+int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ int err;
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_fault(vma, vmf);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+
+ return err;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 753f4e68b820..6d307dac9622 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -958,6 +958,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&ei->i_orphan);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
+ init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
}

diff --git a/fs/ext4/truncate.h b/fs/ext4/truncate.h
index 011ba6670d99..c70d06a383e2 100644
--- a/fs/ext4/truncate.h
+++ b/fs/ext4/truncate.h
@@ -10,8 +10,10 @@
*/
static inline void ext4_truncate_failed_write(struct inode *inode)
{
+ down_write(&EXT4_I(inode)->i_mmap_sem);
truncate_inode_pages(inode->i_mapping, inode->i_size);
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

/*
--
2.1.4


2015-11-17 17:47:18

by Boylston, Brian

[permalink] [raw]
Subject: RE: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

On Tue, Nov 10, 2015 at 2:51 PM, Jan Kara wrote:
> Another version of my ext4 fixes. Since previous version I have fixed DAX block
> mapping to really avoid races for parallel page faults so that the test program
> by Brian passes. Note that you'll see ext4/001 failures - xfstests updates were

Thanks for the updated patches!

> submitted. Also note that testing with 1 KB blocksize on ramdisk is broken
> since brd has buggy discard implementation - Jens has a fix queued.
>
> Change since v3:
> * Fixed ext4_dax_mmap_get_block() to not return buffer_new buffer and thus
> avoid racy zeroing in generic dax code
> * Fixed ext4_map_blocks() to zeroout blocks before inserting entry into
> extent status tree to avoid racy lookups of blocks.
>
> Changes since v2:
> * Fixed collaps range to truncate pagecache properly with blocksize < pagesize
> * Fixed assertion in ext4_get_blocks_overwrite
>
> Patch set description
>
> This series fixes a long standing problem of racing punch hole and page fault
> resulting in possible filesystem corruption or stale data exposure. We fix the
> problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> page faults with truncate and punch hole operations.
>
> When having this exclusion, the only remaining problem with DAX implementation
> are races between two page faults zeroing out same block concurrently (where
> the data written after the first fault finishes are possibly overwritten by
> the second fault still doing zeroing).

Is this still a problem for this version of the patch set?

Thanks!
Brian

> Patch 1 introduces i_mmap_sem lock in ext4 inode and uses it to properly
> serialize extent manipulation operations and page faults.
>
> Patch 2 is mostly a preparatory cleanup patch which also avoids double lock /
> unlock in unlocked DIO protections (currently harmless but nasty surprise).
>
> Patches 3-4 fix further races of extent manipulation functions (such as zero
> range, collapse range, insert range) with buffered IO, page writeback
>
> Patch 5 documents locking order of ext4 filesystem locks.
>
> Patch 6 removes locking abuse of i_data_sem from the get_blocks() path when
> dioread_nolock is enabled since it is not needed anymore.
>
> Patches 7-9 implement allocation of pre-zeroed blocks in ext4_map_blocks()
> callback and use such blocks for allocations from DAX page faults.
>
> The patches survived xfstests run both in dax and non-dax mode.


Subject: RE: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range


> -----Original Message-----
> From: [email protected] [mailto:linux-ext4-
> [email protected]] On Behalf Of Jan Kara
> Sent: Tuesday, November 10, 2015 1:51 PM
> Subject: [PATCH 3/9] ext4: Fix races between buffered IO and collapse /
> insert range
>
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> /*
> - * Prevent page faults from reinstantiating pages we have released from
> + * Prevent page faults from reinstantiating we have released from
> * page cache.

I don't think you meant to delete that word...



2015-11-18 15:13:05

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

On Tue 17-11-15 17:41:55, Boylston, Brian wrote:
> On Tue, Nov 10, 2015 at 2:51 PM, Jan Kara wrote:
> > submitted. Also note that testing with 1 KB blocksize on ramdisk is broken
> > since brd has buggy discard implementation - Jens has a fix queued.
> >
> > Change since v3:
> > * Fixed ext4_dax_mmap_get_block() to not return buffer_new buffer and thus
> > avoid racy zeroing in generic dax code
> > * Fixed ext4_map_blocks() to zeroout blocks before inserting entry into
> > extent status tree to avoid racy lookups of blocks.
> >
> > Changes since v2:
> > * Fixed collaps range to truncate pagecache properly with blocksize < pagesize
> > * Fixed assertion in ext4_get_blocks_overwrite
> >
> > Patch set description
> >
> > This series fixes a long standing problem of racing punch hole and page fault
> > resulting in possible filesystem corruption or stale data exposure. We fix the
> > problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> > page faults with truncate and punch hole operations.
> >
> > When having this exclusion, the only remaining problem with DAX implementation
> > are races between two page faults zeroing out same block concurrently (where
> > the data written after the first fault finishes are possibly overwritten by
> > the second fault still doing zeroing).
>
> Is this still a problem for this version of the patch set?

No, the patch set fixes this problem. The paragraph meant to say that first
few patches fix hole punching issues and then a few patches fix this DAX
issue. But the wording is strange, I agree.

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-11-18 15:17:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/9] ext4: Fix races between buffered IO and collapse / insert range

On Wed 18-11-15 01:39:37, Elliott, Robert (Persistent Memory) wrote:
>
> > -----Original Message-----
> > From: [email protected] [mailto:linux-ext4-
> > [email protected]] On Behalf Of Jan Kara
> > Sent: Tuesday, November 10, 2015 1:51 PM
> > Subject: [PATCH 3/9] ext4: Fix races between buffered IO and collapse /
> > insert range
> >
> > diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> > /*
> > - * Prevent page faults from reinstantiating pages we have released from
> > + * Prevent page faults from reinstantiating we have released from
> > * page cache.
>
> I don't think you meant to delete that word...

Right. Thanks for catching that. Attached is a fixed version of the patch.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (780.00 B)
0001-ext4-Fix-races-between-buffered-IO-and-collapse-inse.patch (3.69 kB)
Download all attachments

2015-12-08 01:08:50

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

On Tue, Nov 10, 2015 at 08:50:50PM +0100, Jan Kara wrote:
> Hello,
>
> Another version of my ext4 fixes. Since previous version I have fixed DAX block
> mapping to really avoid races for parallel page faults so that the test program
> by Brian passes. Note that you'll see ext4/001 failures - xfstests updates were
> submitted. Also note that testing with 1 KB blocksize on ramdisk is broken
> since brd has buggy discard implementation - Jens has a fix queued.

Thanks, I've applied these patches to the ext4 git tree and they are
on the dev branch.

- Ted

2015-12-10 16:26:44

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

On Wed, Dec 09, 2015 at 04:55:18PM -0700, Ross Zwisler wrote:
> Hey Ted,
>
> I'm working on some fsync/msync patches that I'm hoping to merge for v4.5 and
> my patches build upon the work done in this series by Jan.
>
> I see Jan's patches in your dev branch - do you expect that branch to be part
> of a pull request (are the commit IDs stable)? Or will patches from the dev
> branch get moved into another branch before you send the pull request to
> Linus?

Thanks for asking. I assume your work is not just ext4 specific (in
which I could just manage it using patches in the ext4 patch queue).,
but where you want to start a git tree and then push to Linus
independently, yes?

> I'm just trying to figure out what I should use as a baseline for my changes
> since I need to build on Jan's patches but obviously don't want to include new
> ext4 code with my pull request (which I will send after you send yours). :)

Please use the "master" branch from the ext4 tree as your baseline.
Normally dev is a rewinding branch, but for these sorts of situations
I'll guarantee that everything between origin..master will be stable,
while commits between master..dev may be rewound or dropped to fix
bugs/regressions:

* dde86e3 - (HEAD -> guilt/origin, dev) ext4 crypto: add missing locking for keyring_key access (10 hours ago)
* 0969c39 - ext4 crypto: add ioctls to allow backup of encryption metadata (10 hours ago)
* 843848f - ext4 crypto: add ciphertext_access mount option (10 hours ago)
* ba5843f - (master) ext4: use pre-zeroed blocks for DAX page faults (3 days ago)
* c86d8db - ext4: implement allocation of pre-zeroed blocks (3 days ago)
* 53085fa - ext4: provide ext4_issue_zeroout() (3 days ago)
* 2dcba47 - ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag (3 days ago)
* e74031f - ext4: document lock ordering (3 days ago)
* 0112784 - ext4: fix races of writeback with punch hole and zero range (3 days ago)
* 32ebffd - ext4: fix races between buffered IO and collapse / insert range (3 days ago)
* 17048e8 - ext4: move unlocked dio protection from ext4_alloc_file_blocks() (3 days ago)
* ea3d720 - ext4: fix races between page faults and hole punching (3 days ago)

Jan's patches have gone through multiple rounds of testing and I'm
quite confident there are no bugs or regressions that can't be dealt
with using follow up patches.

Cheers,

- Ted

2015-12-10 17:11:46

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 0/9 v4] ext4: Punch hole and DAX fixes

On Thu, Dec 10, 2015 at 11:26:42AM -0500, Theodore Ts'o wrote:
> On Wed, Dec 09, 2015 at 04:55:18PM -0700, Ross Zwisler wrote:
> > Hey Ted,
> >
> > I'm working on some fsync/msync patches that I'm hoping to merge for v4.5 and
> > my patches build upon the work done in this series by Jan.
> >
> > I see Jan's patches in your dev branch - do you expect that branch to be part
> > of a pull request (are the commit IDs stable)? Or will patches from the dev
> > branch get moved into another branch before you send the pull request to
> > Linus?
>
> Thanks for asking. I assume your work is not just ext4 specific (in
> which I could just manage it using patches in the ext4 patch queue).,
> but where you want to start a git tree and then push to Linus
> independently, yes?

Correct. Here's the series I'm talking about (sorry for not including a link
in the original mail):

https://lists.01.org/pipermail/linux-nvdimm/2015-December/003259.html

The only patch that touches ext4 is patch 6/7, which is a 4 line change to
hook into the ext4_dax_pfn_mkwrite() function and have it call
dax_pfn_mkwrite(). Jan's series added ext4_dax_pfn_mkwrite(), which is my
dependency.

> > I'm just trying to figure out what I should use as a baseline for my changes
> > since I need to build on Jan's patches but obviously don't want to include new
> > ext4 code with my pull request (which I will send after you send yours). :)
>
> Please use the "master" branch from the ext4 tree as your baseline.
> Normally dev is a rewinding branch, but for these sorts of situations
> I'll guarantee that everything between origin..master will be stable,
> while commits between master..dev may be rewound or dropped to fix
> bugs/regressions:
>
> * dde86e3 - (HEAD -> guilt/origin, dev) ext4 crypto: add missing locking for keyring_key access (10 hours ago)
> * 0969c39 - ext4 crypto: add ioctls to allow backup of encryption metadata (10 hours ago)
> * 843848f - ext4 crypto: add ciphertext_access mount option (10 hours ago)
> * ba5843f - (master) ext4: use pre-zeroed blocks for DAX page faults (3 days ago)
> * c86d8db - ext4: implement allocation of pre-zeroed blocks (3 days ago)
> * 53085fa - ext4: provide ext4_issue_zeroout() (3 days ago)
> * 2dcba47 - ext4: get rid of EXT4_GET_BLOCKS_NO_LOCK flag (3 days ago)
> * e74031f - ext4: document lock ordering (3 days ago)
> * 0112784 - ext4: fix races of writeback with punch hole and zero range (3 days ago)
> * 32ebffd - ext4: fix races between buffered IO and collapse / insert range (3 days ago)
> * 17048e8 - ext4: move unlocked dio protection from ext4_alloc_file_blocks() (3 days ago)
> * ea3d720 - ext4: fix races between page faults and hole punching (3 days ago)
>
> Jan's patches have gone through multiple rounds of testing and I'm
> quite confident there are no bugs or regressions that can't be dealt
> with using follow up patches.

Perfect, thank you. :)