2015-10-14 11:30:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 0/6] ext4: Punch hole and DAX fixes

Hello,

This series fixes a long standing problem of racing punch hole and page fault
resulting in possible filesystem corruption or stale data exposure. We fix the
problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
page faults with truncate and punch hole operations.

When having this exclusion, the only remaining problem with DAX implementation
are races between two page faults zeroing out same block concurrently (where
the data written after the first fault finishes are possibly overwritten by
the second fault still doing zeroing). That is dealt with by patches 5 and 6
in this series where we implement block zeroing directly in ext4_map_blocks().

The patches survived xfstests run both in dax and non-dax mode.

Honza


2015-10-14 11:30:42

by Jan Kara

[permalink] [raw]
Subject: [PATCH 2/6] ext4: Document lock ordering

We have enough locks that it's probably worth documenting the lock
ordering rules we have in ext4.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/super.c | 30 ++++++++++++++++++++++++++++++
1 file changed, 30 insertions(+)

diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index 61fad9e33bb9..152f719a0c62 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -85,6 +85,36 @@ static void ext4_unregister_li_request(struct super_block *sb);
static void ext4_clear_request_list(void);
static int ext4_reserve_clusters(struct ext4_sb_info *, ext4_fsblk_t);

+/*
+ * Lock ordering
+ *
+ * Note the difference between i_mmap_sem (EXT4_I(inode)->i_mmap_sem) and
+ * i_mmap_rwsem (inode->i_mmap_rwsem)!
+ *
+ * page fault path:
+ * mmap_sem -> sb_start_pagefault -> i_mmap_sem (r) -> transaction start ->
+ * page lock -> i_data_sem (rw)
+ *
+ * buffered write path:
+ * sb_start_write -> i_mutex -> mmap_sem
+ * sb_start_write -> i_mutex -> transaction start -> page lock ->
+ * i_data_sem (rw)
+ *
+ * truncate:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * i_mmap_rwsem (w) -> page lock
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (w) -> i_mmap_sem (w) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * direct IO:
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) -> mmap_sem
+ * sb_start_write -> i_mutex -> EXT4_STATE_DIOREAD_LOCK (r) ->
+ * transaction start -> i_data_sem (rw)
+ *
+ * writepages:
+ * transaction start -> page lock(s) -> i_data_sem (rw)
+ */
+
#if !defined(CONFIG_EXT2_FS) && !defined(CONFIG_EXT2_FS_MODULE) && defined(CONFIG_EXT4_USE_FOR_EXT2)
static struct file_system_type ext2_fs_type = {
.owner = THIS_MODULE,
--
2.1.4


2015-10-14 11:30:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 3/6] ext4: Get rid of EXT4_GET_BLOCKS_NO_LOCK flag

When dioread_nolock mode is enabled, we grab i_data_sem in
ext4_ext_direct_IO() and therefore we need to instruct _ext4_get_block()
not to grab i_data_sem again using EXT4_GET_BLOCKS_NO_LOCK. However
holding i_data_sem over overwrite direct IO isn't needed these days. We
have exclusion against truncate / hole punching because we increase
i_dio_count under i_mutex in ext4_ext_direct_IO() so once
ext4_file_write_iter() verifies blocks are allocated & written, they are
guaranteed to stay so during the whole direct IO even after we drop
i_mutex.

So we can just remove this locking abuse and the no longer necessary
EXT4_GET_BLOCKS_NO_LOCK flag.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +---
fs/ext4/inode.c | 43 ++++++++++++++++++++-----------------------
include/trace/events/ext4.h | 3 +--
3 files changed, 22 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index c19ff61ccbdf..8c51e9bc7877 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -551,10 +551,8 @@ enum {
#define EXT4_GET_BLOCKS_NO_NORMALIZE 0x0040
/* Request will not result in inode size update (user for fallocate) */
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
- /* Do not take i_data_sem locking in ext4_map_blocks */
-#define EXT4_GET_BLOCKS_NO_LOCK 0x0100
/* Convert written extents to unwritten */
-#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0200
+#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9b20563e10f2..cd304948a85f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -403,8 +403,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
* out taking i_data_sem. So at the time the unwritten extent
* could be converted.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -412,8 +411,7 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
retval = ext4_ind_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

/*
* We don't check m_len because extent will be collpased in status
@@ -509,8 +507,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
* Try to see if we can get the block without requesting a new
* file system block.
*/
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- down_read(&EXT4_I(inode)->i_data_sem);
+ down_read(&EXT4_I(inode)->i_data_sem);
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) {
retval = ext4_ext_map_blocks(handle, inode, map, flags &
EXT4_GET_BLOCKS_KEEP_SIZE);
@@ -541,8 +538,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
if (ret < 0)
retval = ret;
}
- if (!(flags & EXT4_GET_BLOCKS_NO_LOCK))
- up_read((&EXT4_I(inode)->i_data_sem));
+ up_read((&EXT4_I(inode)->i_data_sem));

found:
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
@@ -674,7 +670,7 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,
map.m_lblk = iblock;
map.m_len = bh->b_size >> inode->i_blkbits;

- if (flags && !(flags & EXT4_GET_BLOCKS_NO_LOCK) && !handle) {
+ if (flags && !handle) {
/* Direct IO write... */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
@@ -879,9 +875,6 @@ int do_journal_get_write_access(handle_t *handle,
return ret;
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
-
#ifdef CONFIG_EXT4_FS_ENCRYPTION
static int ext4_block_write_begin(struct page *page, loff_t pos, unsigned len,
get_block_t *get_block)
@@ -3012,13 +3005,21 @@ int ext4_get_block_write(struct inode *inode, sector_t iblock,
EXT4_GET_BLOCKS_IO_CREATE_EXT);
}

-static int ext4_get_block_write_nolock(struct inode *inode, sector_t iblock,
+static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
- ext4_debug("ext4_get_block_write_nolock: inode %lu, create flag %d\n",
+ int ret;
+
+ ext4_debug("ext4_get_block_overwrite: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_NO_LOCK);
+ ret = _ext4_get_block(inode, iblock, bh_result, 0);
+ /*
+ * Blocks should have been preallocated! ext4_file_write_iter() checks
+ * that.
+ */
+ WARN_ON_ONCE(ret == 0);
+
+ return ret;
}

int ext4_get_block_dax(struct inode *inode, sector_t iblock,
@@ -3101,10 +3102,8 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
/* If we do a overwrite dio, i_mutex locking can be released */
overwrite = *((int *)iocb->private);

- if (overwrite) {
- down_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_unlock(&inode->i_mutex);
- }

/*
* We could direct write to holes and fallocate.
@@ -3147,7 +3146,7 @@ static ssize_t ext4_ext_direct_IO(struct kiocb *iocb, struct iov_iter *iter,
}

if (overwrite) {
- get_block_func = ext4_get_block_write_nolock;
+ get_block_func = ext4_get_block_overwrite;
} else {
get_block_func = ext4_get_block_write;
dio_flags = DIO_LOCKING;
@@ -3203,10 +3202,8 @@ retake_lock:
if (iov_iter_rw(iter) == WRITE)
inode_dio_end(inode);
/* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite) {
- up_read(&EXT4_I(inode)->i_data_sem);
+ if (overwrite)
mutex_lock(&inode->i_mutex);
- }

return ret;
}
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 594b4b29a224..5f2ace56efc5 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,8 +42,7 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
- { EXT4_GET_BLOCKS_NO_LOCK, "NO_LOCK" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-10-14 11:30:45

by Jan Kara

[permalink] [raw]
Subject: [PATCH 5/6] ext4: Implement allocation of pre-zeroed blocks

DAX page fault path needs to get blocks that are pre-zeroed to avoid
races when two concurrent page faults happen in the same block of a
file. Implement support for this in ext4_map_blocks().

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 ++++
fs/ext4/extents.c | 8 ++++++++
fs/ext4/inode.c | 20 +++++++++++++++++---
include/trace/events/ext4.h | 3 ++-
4 files changed, 31 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index e1bfcfff4655..bc3192a88b5c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -553,6 +553,10 @@ enum {
#define EXT4_GET_BLOCKS_KEEP_SIZE 0x0080
/* Convert written extents to unwritten */
#define EXT4_GET_BLOCKS_CONVERT_UNWRITTEN 0x0100
+ /* Write zeros to newly created written extents */
+#define EXT4_GET_BLOCKS_ZERO 0x0200
+#define EXT4_GET_BLOCKS_CREATE_ZERO (EXT4_GET_BLOCKS_CREATE |\
+ EXT4_GET_BLOCKS_ZERO)

/*
* The bit position of these flags must not overlap with any of the
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 0fac5a21145f..ab0779b8bd5d 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4039,6 +4039,14 @@ ext4_ext_handle_unwritten_extents(handle_t *handle, struct inode *inode,
}
/* IO end_io complete, convert the filled extent to written */
if (flags & EXT4_GET_BLOCKS_CONVERT) {
+ if (flags & EXT4_GET_BLOCKS_ZERO) {
+ if (allocated > map->m_len)
+ allocated = map->m_len;
+ err = ext4_issue_zeroout(inode, map->m_lblk, newblock,
+ allocated);
+ if (err < 0)
+ goto out2;
+ }
ret = ext4_convert_unwritten_extents_endio(handle, inode, map,
ppath);
if (ret >= 0) {
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 56d743ea4084..dd120ba4ece9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -643,7 +643,7 @@ found:
if ((flags & EXT4_GET_BLOCKS_PRE_IO) &&
ext4_es_lookup_extent(inode, map->m_lblk, &es)) {
if (ext4_es_is_written(&es))
- goto has_zeroout;
+ goto out_sem;
}
status = map->m_flags & EXT4_MAP_UNWRITTEN ?
EXTENT_STATUS_UNWRITTEN : EXTENT_STATUS_WRITTEN;
@@ -654,11 +654,24 @@ found:
status |= EXTENT_STATUS_DELAYED;
ret = ext4_es_insert_extent(inode, map->m_lblk, map->m_len,
map->m_pblk, status);
- if (ret < 0)
+ if (ret < 0) {
retval = ret;
+ goto out_sem;
+ }
+
+ if (flags & EXT4_GET_BLOCKS_ZERO &&
+ map->m_flags & EXT4_MAP_MAPPED &&
+ map->m_flags & EXT4_MAP_NEW) {
+ ret = ext4_issue_zeroout(inode, map->m_lblk,
+ map->m_pblk, map->m_len);
+ if (ret) {
+ retval = ret;
+ goto out_sem;
+ }
+ }
}

-has_zeroout:
+out_sem:
up_write((&EXT4_I(inode)->i_data_sem));
if (retval > 0 && map->m_flags & EXT4_MAP_MAPPED) {
ret = check_block_validity(inode, map);
@@ -3041,6 +3054,7 @@ int ext4_get_block_dax(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create)
{
int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+
if (create)
flags |= EXT4_GET_BLOCKS_CREATE;
ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 5f2ace56efc5..4e4b2fa78609 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -42,7 +42,8 @@ struct extent_status;
{ EXT4_GET_BLOCKS_CONVERT, "CONVERT" }, \
{ EXT4_GET_BLOCKS_METADATA_NOFAIL, "METADATA_NOFAIL" }, \
{ EXT4_GET_BLOCKS_NO_NORMALIZE, "NO_NORMALIZE" }, \
- { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" })
+ { EXT4_GET_BLOCKS_KEEP_SIZE, "KEEP_SIZE" }, \
+ { EXT4_GET_BLOCKS_ZERO, "ZERO" })

#define show_mflags(flags) __print_flags(flags, "", \
{ EXT4_MAP_NEW, "N" }, \
--
2.1.4


2015-10-14 11:30:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 4/6] ext4: Provide ext4_issue_zeroout()

Create new function ext4_issue_zeroout() to zeroout contiguous (both
logically and physically) part of inode data. We will need to issue
zeroout when extent structure is not readily available and this function
will allow us to do it without making up fake extent structures.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/crypto.c | 6 ++----
fs/ext4/ext4.h | 5 ++++-
fs/ext4/extents.c | 11 +----------
fs/ext4/inode.c | 15 +++++++++++++++
4 files changed, 22 insertions(+), 15 deletions(-)

diff --git a/fs/ext4/crypto.c b/fs/ext4/crypto.c
index 45731558138c..6b63d1e9f05a 100644
--- a/fs/ext4/crypto.c
+++ b/fs/ext4/crypto.c
@@ -403,14 +403,12 @@ int ext4_decrypt_one(struct inode *inode, struct page *page)
return ret;
}

-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex)
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len)
{
struct ext4_crypto_ctx *ctx;
struct page *ciphertext_page = NULL;
struct bio *bio;
- ext4_lblk_t lblk = ex->ee_block;
- ext4_fsblk_t pblk = ext4_ext_pblock(ex);
- unsigned int len = ext4_ext_get_actual_len(ex);
int err = 0;

BUG_ON(inode->i_sb->s_blocksize != PAGE_CACHE_SIZE);
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 8c51e9bc7877..e1bfcfff4655 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2072,7 +2072,8 @@ struct page *ext4_encrypt(struct inode *inode,
struct page *plaintext_page);
int ext4_decrypt(struct ext4_crypto_ctx *ctx, struct page *page);
int ext4_decrypt_one(struct inode *inode, struct page *page);
-int ext4_encrypted_zeroout(struct inode *inode, struct ext4_extent *ex);
+int ext4_encrypted_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

#ifdef CONFIG_EXT4_FS_ENCRYPTION
int ext4_init_crypto(void);
@@ -2327,6 +2328,8 @@ extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
+extern int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk,
+ ext4_fsblk_t pblk, ext4_lblk_t len);

/* indirect.c */
extern int ext4_ind_map_blocks(handle_t *handle, struct inode *inode,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 2553aa8b608d..0fac5a21145f 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -3118,19 +3118,10 @@ static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
{
ext4_fsblk_t ee_pblock;
unsigned int ee_len;
- int ret;

ee_len = ext4_ext_get_actual_len(ex);
ee_pblock = ext4_ext_pblock(ex);
-
- if (ext4_encrypted_inode(inode))
- return ext4_encrypted_zeroout(inode, ex);
-
- ret = sb_issue_zeroout(inode->i_sb, ee_pblock, ee_len, GFP_NOFS);
- if (ret > 0)
- ret = 0;
-
- return ret;
+ return ext4_issue_zeroout(inode, ex->ee_block, ee_pblock, ee_len);
}

/*
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index cd304948a85f..56d743ea4084 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -383,6 +383,21 @@ static int __check_block_validity(struct inode *inode, const char *func,
return 0;
}

+int ext4_issue_zeroout(struct inode *inode, ext4_lblk_t lblk, ext4_fsblk_t pblk,
+ ext4_lblk_t len)
+{
+ int ret;
+
+ if (ext4_encrypted_inode(inode))
+ return ext4_encrypted_zeroout(inode, lblk, pblk, len);
+
+ ret = sb_issue_zeroout(inode->i_sb, pblk, len, GFP_NOFS);
+ if (ret > 0)
+ ret = 0;
+
+ return ret;
+}
+
#define check_block_validity(inode, map) \
__check_block_validity((inode), __func__, __LINE__, (map))

--
2.1.4


2015-10-14 11:30:43

by Jan Kara

[permalink] [raw]
Subject: [PATCH 1/6] ext4: Fix races between page faults and hole punching

Currently, page faults and hole punching are completely unsynchronized.
This can result in page fault faulting in a page into a range that we
are punching after truncate_pagecache_range() has been called and thus
we can end up with a page mapped to disk blocks that will be shortly
freed. Filesystem corruption will shortly follow. Note that the same
race is avoided for truncate by checking page fault offset against
i_size but there isn't similar mechanism available for punching holes.

Fix the problem by creating new rw semaphore i_mmap_sem in inode and
grab it for writing over truncate and hole punching and for read over
page faults. We cannot easily use i_data_sem for this since that ranks
below transaction start and we need something ranking above it so that
it can be held over the whole truncate / hole punching operation.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 10 +++++++++
fs/ext4/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++--------
fs/ext4/inode.c | 27 +++++++++++++++++++----
fs/ext4/super.c | 1 +
4 files changed, 91 insertions(+), 13 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index fd1f28be5296..c19ff61ccbdf 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -869,6 +869,15 @@ struct ext4_inode_info {
* by other means, so we have i_data_sem.
*/
struct rw_semaphore i_data_sem;
+ /*
+ * i_mmap_sem is for serializing page faults with truncate / punch hole
+ * operations. We have to make sure that new page cannot be faulted in
+ * a section of the inode that is being punched. We cannot easily use
+ * i_data_sem for this since we need protection for the whole punch
+ * operation and i_data_sem ranks below transaction start so we have
+ * to occasionally drop it.
+ */
+ struct rw_semaphore i_mmap_sem;
struct inode vfs_inode;
struct jbd2_inode *jinode;

@@ -2316,6 +2325,7 @@ extern int ext4_chunk_trans_blocks(struct inode *, int nrblocks);
extern int ext4_zero_partial_blocks(handle_t *handle, struct inode *inode,
loff_t lstart, loff_t lend);
extern int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf);
+extern int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf);
extern qsize_t *ext4_get_reserved_space(struct inode *inode);
extern void ext4_da_update_reserve_space(struct inode *inode,
int used, int quota_claim);
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 113837e7ba98..0d24ebcd7c9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -209,15 +209,18 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
handle_t *handle = NULL;
- struct super_block *sb = file_inode(vma->vm_file)->i_sb;
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
bool write = vmf->flags & FAULT_FLAG_WRITE;

if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
EXT4_DATA_TRANS_BLOCKS(sb));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -228,8 +231,10 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}
@@ -246,10 +251,12 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
sb_start_pagefault(sb);
file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
handle = ext4_journal_start_sb(sb, EXT4_HT_WRITE_PAGE,
ext4_chunk_trans_blocks(inode,
PMD_SIZE / PAGE_SIZE));
- }
+ } else
+ down_read(&EXT4_I(inode)->i_mmap_sem);

if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
@@ -260,30 +267,71 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
if (write) {
if (!IS_ERR(handle))
ext4_journal_stop(handle);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(sb);
- }
+ } else
+ up_read(&EXT4_I(inode)->i_mmap_sem);

return result;
}

static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
{
- return dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ int err;
+ struct inode *inode = file_inode(vma->vm_file);
+
+ sb_start_pagefault(inode->i_sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
+ ext4_end_io_unwritten);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(inode->i_sb);
+
+ return err;
+}
+
+/*
+ * Handle write fault for VM_MIXEDMAP mappings. Similarly to ext4_dax_mkwrite()
+ * handler we check for races agaist truncate. Note that since we cycle through
+ * i_mmap_sem, we are sure that also any hole punching that began before we
+ * were called is finished by now and so if it included part of the file we
+ * are working on, our pte will get unmapped and the check for pte_same() in
+ * wp_pfn_shared() fails. Thus fault gets retried and things work out as
+ * desired.
+ */
+static int ext4_dax_pfn_mkwrite(struct vm_area_struct *vma,
+ struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ struct super_block *sb = inode->i_sb;
+ int ret = VM_FAULT_NOPAGE;
+ loff_t size;
+
+ sb_start_pagefault(sb);
+ file_update_time(vma->vm_file);
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ size = (i_size_read(inode) + PAGE_SIZE - 1) >> PAGE_SHIFT;
+ if (vmf->pgoff >= size)
+ ret = VM_FAULT_SIGBUS;
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+ sb_end_pagefault(sb);
+
+ return ret;
}

static const struct vm_operations_struct ext4_dax_vm_ops = {
.fault = ext4_dax_fault,
.pmd_fault = ext4_dax_pmd_fault,
.page_mkwrite = ext4_dax_mkwrite,
- .pfn_mkwrite = dax_pfn_mkwrite,
+ .pfn_mkwrite = ext4_dax_pfn_mkwrite,
};
#else
#define ext4_dax_vm_ops ext4_file_vm_ops
#endif

static const struct vm_operations_struct ext4_file_vm_ops = {
- .fault = filemap_fault,
+ .fault = ext4_filemap_fault,
.map_pages = filemap_map_pages,
.page_mkwrite = ext4_page_mkwrite,
};
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 612fbcf76b5c..9b20563e10f2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3581,6 +3581,11 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)

}

+ /* Wait all existing dio workers, newcomers will block on i_mutex */
+ ext4_inode_block_unlocked_dio(inode);
+ inode_dio_wait(inode);
+
+ down_write(&EXT4_I(inode)->i_mmap_sem);
first_block_offset = round_up(offset, sb->s_blocksize);
last_block_offset = round_down((offset + length), sb->s_blocksize) - 1;

@@ -3589,10 +3594,6 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
truncate_pagecache_range(inode, first_block_offset,
last_block_offset);

- /* Wait all existing dio workers, newcomers will block on i_mutex */
- ext4_inode_block_unlocked_dio(inode);
- inode_dio_wait(inode);
-
if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
credits = ext4_writepage_trans_blocks(inode);
else
@@ -3648,6 +3649,7 @@ int ext4_punch_hole(struct inode *inode, loff_t offset, loff_t length)
out_stop:
ext4_journal_stop(handle);
out_dio:
+ up_write(&EXT4_I(inode)->i_mmap_sem);
ext4_inode_resume_unlocked_dio(inode);
out_mutex:
mutex_unlock(&inode->i_mutex);
@@ -4784,6 +4786,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
} else
ext4_wait_for_tail_page_commit(inode);
}
+ down_write(&EXT4_I(inode)->i_mmap_sem);
/*
* Truncate pagecache after we've waited for commit
* in data=journal mode to make pages freeable.
@@ -4791,6 +4794,7 @@ int ext4_setattr(struct dentry *dentry, struct iattr *attr)
truncate_pagecache(inode, inode->i_size);
if (shrink)
ext4_truncate(inode);
+ up_write(&EXT4_I(inode)->i_mmap_sem);
}

if (!rc) {
@@ -5239,6 +5243,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)

sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
/* Delalloc case is easy... */
if (test_opt(inode->i_sb, DELALLOC) &&
!ext4_should_journal_data(inode) &&
@@ -5308,6 +5314,19 @@ retry_alloc:
out_ret:
ret = block_page_mkwrite_return(ret);
out:
+ up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);
return ret;
}
+
+int ext4_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct inode *inode = file_inode(vma->vm_file);
+ int err;
+
+ down_read(&EXT4_I(inode)->i_mmap_sem);
+ err = filemap_fault(vma, vmf);
+ up_read(&EXT4_I(inode)->i_mmap_sem);
+
+ return err;
+}
diff --git a/fs/ext4/super.c b/fs/ext4/super.c
index a63c7b0a10cf..61fad9e33bb9 100644
--- a/fs/ext4/super.c
+++ b/fs/ext4/super.c
@@ -955,6 +955,7 @@ static void init_once(void *foo)
INIT_LIST_HEAD(&ei->i_orphan);
init_rwsem(&ei->xattr_sem);
init_rwsem(&ei->i_data_sem);
+ init_rwsem(&ei->i_mmap_sem);
inode_init_once(&ei->vfs_inode);
}

--
2.1.4


2015-10-14 11:30:45

by Jan Kara

[permalink] [raw]
Subject: [PATCH 6/6] ext4: Use pre-zeroed blocks for DAX page faults

Make DAX fault path use pre-zeroed blocks to avoid races with extent
conversion and zeroing when two page faults to the same block happen.

Signed-off-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 4 +--
fs/ext4/file.c | 20 +++-----------
fs/ext4/inode.c | 82 +++++++++++++++++++++++++++++++++++++++++++++------------
3 files changed, 70 insertions(+), 36 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bc3192a88b5c..d084fff2ee3a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -2284,8 +2284,8 @@ struct buffer_head *ext4_getblk(handle_t *, struct inode *, ext4_lblk_t, int);
struct buffer_head *ext4_bread(handle_t *, struct inode *, ext4_lblk_t, int);
int ext4_get_block_write(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 0d24ebcd7c9e..749b222e6498 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -193,18 +193,6 @@ out:
}

#ifdef CONFIG_FS_DAX
-static void ext4_end_io_unwritten(struct buffer_head *bh, int uptodate)
-{
- struct inode *inode = bh->b_assoc_map->host;
- /* XXX: breaks on 32-bit > 16TB. Is that even supported? */
- loff_t offset = (loff_t)(uintptr_t)bh->b_private << inode->i_blkbits;
- int err;
- if (!uptodate)
- return;
- WARN_ON(!buffer_unwritten(bh));
- err = ext4_convert_unwritten_extents(NULL, inode, offset, bh->b_size);
-}
-
static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
int result;
@@ -225,8 +213,7 @@ static int ext4_dax_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
if (IS_ERR(handle))
result = VM_FAULT_SIGBUS;
else
- result = __dax_fault(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ result = __dax_fault(vma, vmf, ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -262,7 +249,7 @@ static int ext4_dax_pmd_fault(struct vm_area_struct *vma, unsigned long addr,
result = VM_FAULT_SIGBUS;
else
result = __dax_pmd_fault(vma, addr, pmd, flags,
- ext4_get_block_dax, ext4_end_io_unwritten);
+ ext4_dax_mmap_get_block, NULL);

if (write) {
if (!IS_ERR(handle))
@@ -283,8 +270,7 @@ static int ext4_dax_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
sb_start_pagefault(inode->i_sb);
file_update_time(vma->vm_file);
down_read(&EXT4_I(inode)->i_mmap_sem);
- err = __dax_mkwrite(vma, vmf, ext4_get_block_dax,
- ext4_end_io_unwritten);
+ err = __dax_mkwrite(vma, vmf, ext4_dax_mmap_get_block, NULL);
up_read(&EXT4_I(inode)->i_mmap_sem);
sb_end_pagefault(inode->i_sb);

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index dd120ba4ece9..4ec4bc1dc949 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -718,16 +718,6 @@ static int _ext4_get_block(struct inode *inode, sector_t iblock,

map_bh(bh, inode->i_sb, map.m_pblk);
bh->b_state = (bh->b_state & ~EXT4_MAP_FLAGS) | map.m_flags;
- if (IS_DAX(inode) && buffer_unwritten(bh)) {
- /*
- * dgc: I suspect unwritten conversion on ext4+DAX is
- * fundamentally broken here when there are concurrent
- * read/write in progress on this inode.
- */
- WARN_ON_ONCE(io_end);
- bh->b_assoc_map = inode->i_mapping;
- bh->b_private = (void *)(unsigned long)iblock;
- }
if (io_end && io_end->flag & EXT4_IO_END_UNWRITTEN)
set_buffer_defer_completion(bh);
bh->b_size = inode->i_sb->s_blocksize * map.m_len;
@@ -3050,17 +3040,75 @@ static int ext4_get_block_overwrite(struct inode *inode, sector_t iblock,
return ret;
}

-int ext4_get_block_dax(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
+#ifdef CONFIG_FS_DAX
+int ext4_dax_mmap_get_block(struct inode *inode, sector_t iblock,
+ struct buffer_head *bh_result, int create)
{
- int flags = EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_UNWRIT_EXT;
+ int ret, err;
+ int credits;
+ struct ext4_map_blocks map;
+ handle_t *handle = NULL;
+ int flags = 0;

- if (create)
- flags |= EXT4_GET_BLOCKS_CREATE;
- ext4_debug("ext4_get_block_dax: inode %lu, create flag %d\n",
+ ext4_debug("ext4_dax_mmap_get_block: inode %lu, create flag %d\n",
inode->i_ino, create);
- return _ext4_get_block(inode, iblock, bh_result, flags);
+ map.m_lblk = iblock;
+ map.m_len = bh_result->b_size >> inode->i_blkbits;
+ BUG_ON(map.m_len != 1);
+ credits = ext4_chunk_trans_blocks(inode, map.m_len);
+ if (create) {
+ flags |= EXT4_GET_BLOCKS_PRE_IO | EXT4_GET_BLOCKS_CREATE_ZERO;
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ return ret;
+ }
+ }
+
+ ret = ext4_map_blocks(handle, inode, &map, flags);
+ if (create) {
+ err = ext4_journal_stop(handle);
+ if (ret > 0 && err < 0)
+ ret = err;
+ }
+ if (ret <= 0)
+ goto out;
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ int err2;
+
+ /*
+ * We are protected by i_mmap_sem so we know block cannot go
+ * away from under us even though we dropped i_data_sem.
+ * Convert extent to written and write zeros there.
+ *
+ * Note: We may get here even when create == 0.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, credits);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ err = ext4_map_blocks(handle, inode, &map,
+ EXT4_GET_BLOCKS_CONVERT | EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (err < 0)
+ ret = err;
+ err2 = ext4_journal_stop(handle);
+ if (err2 < 0 && ret > 0)
+ ret = err2;
+ }
+out:
+ WARN_ON_ONCE(ret == 0 && create);
+ if (ret > 0) {
+ map_bh(bh_result, inode->i_sb, map.m_pblk);
+ bh_result->b_state = (bh_result->b_state & ~EXT4_MAP_FLAGS) |
+ map.m_flags;
+ bh_result->b_size = map.m_len << inode->i_blkbits;
+ ret = 0;
+ }
+ return ret;
}
+#endif

static void ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
--
2.1.4


2015-10-14 13:19:27

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH 4/6] ext4: Provide ext4_issue_zeroout()

Hi Jan,

[auto build test WARNING on v4.3-rc5 -- if it's inappropriate base, please suggest rules for selecting the more suitable base]

url: https://github.com/0day-ci/linux/commits/Jan-Kara/ext4-Punch-hole-and-DAX-fixes/20151014-193637
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__


sparse warnings: (new ones prefixed by >>)

>> fs/ext4/extents.c:3124:44: sparse: incorrect type in argument 2 (different base types)
fs/ext4/extents.c:3124:44: expected unsigned int [unsigned] [usertype] lblk
fs/ext4/extents.c:3124:44: got restricted __le32 [usertype] ee_block

vim +3124 fs/ext4/extents.c

3108
3109 if (ee_len == 0)
3110 return 0;
3111
3112 return ext4_es_insert_extent(inode, ee_block, ee_len, ee_pblock,
3113 EXTENT_STATUS_WRITTEN);
3114 }
3115
3116 /* FIXME!! we need to try to merge to left or right after zero-out */
3117 static int ext4_ext_zeroout(struct inode *inode, struct ext4_extent *ex)
3118 {
3119 ext4_fsblk_t ee_pblock;
3120 unsigned int ee_len;
3121
3122 ee_len = ext4_ext_get_actual_len(ex);
3123 ee_pblock = ext4_ext_pblock(ex);
> 3124 return ext4_issue_zeroout(inode, ex->ee_block, ee_pblock, ee_len);
3125 }
3126
3127 /*
3128 * ext4_split_extent_at() splits an extent at given block.
3129 *
3130 * @handle: the journal handle
3131 * @inode: the file inode
3132 * @path: the path to the extent

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation

2015-10-14 18:06:23

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 0/6] ext4: Punch hole and DAX fixes

On Wed, Oct 14, 2015 at 01:30:21PM +0200, Jan Kara wrote:
> Hello,
>
> This series fixes a long standing problem of racing punch hole and page fault
> resulting in possible filesystem corruption or stale data exposure. We fix the
> problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> page faults with truncate and punch hole operations.
>
> When having this exclusion, the only remaining problem with DAX implementation
> are races between two page faults zeroing out same block concurrently (where
> the data written after the first fault finishes are possibly overwritten by
> the second fault still doing zeroing). That is dealt with by patches 5 and 6
> in this series where we implement block zeroing directly in ext4_map_blocks().
>
> The patches survived xfstests run both in dax and non-dax mode.
>
> Honza

I threw these patches in to my own test setup and was able to hit the
following with generic/198:

[ 57.453855] run fstests generic/198 at 2015-10-14 11:41:21
[ 57.585886] ------------[ cut here ]------------
[ 57.585890] kernel BUG at fs/ext4/inode.c:3057!
[ 57.585892] invalid opcode: 0000 [#1] SMP
[ 57.585894] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[ 57.585898] CPU: 0 PID: 3099 Comm: aiodio_sparse2 Not tainted 4.3.0-rc4_ext4_locking+ #1
[ 57.585899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
[ 57.585900] task: ffff88040ea3b400 ti: ffff8800ba5e8000 task.ti: ffff8800ba5e8000
[ 57.585901] RIP: 0010:[<ffffffff812f7b79>] [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
[ 57.585906] RSP: 0000:ffff8800ba5ebc48 EFLAGS: 00010216
[ 57.585907] RAX: 0000000000000200 RBX: ffff8800ba5ebcf8 RCX: 000000000000000c
[ 57.585908] RDX: ffff8800ba5ebcf8 RSI: 000000000000004b RDI: ffff8801fbdb13c0
[ 57.585909] RBP: ffff8800ba5ebca0 R08: ffffffff812f7980 R09: ffff8801fbdb15e0
[ 57.585910] R10: 00007f7921c00000 R11: 000000000000000c R12: ffff8801fbdb13c0
[ 57.585911] R13: 0000000000000001 R14: ffff8800b153c870 R15: 0000000000000000
[ 57.585912] FS: 00007f7928999700(0000) GS:ffff880411000000(0000) knlGS:0000000000000000
[ 57.585913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 57.585914] CR2: 00007f7921c00000 CR3: 00000000bb1b5000 CR4: 00000000000406f0
[ 57.585917] Stack:
[ 57.585918] ffff8800ba351000 ffff8800bafca028 0000000000000000 000002000000004b
[ 57.585920] ffff8800ba5ebd08 000000004edcd27a ffff8800baf47f18 ffff8801fbdb13c0
[ 57.585922] 00007f7921c00000 ffff8800b153c870 0000000000000000 ffff8800ba5ebd90
[ 57.585924] Call Trace:
[ 57.585928] [<ffffffff812a061c>] __dax_pmd_fault+0x15c/0x590
[ 57.585931] [<ffffffff81224d46>] ? kmem_cache_alloc+0xd6/0x210
[ 57.585934] [<ffffffff812f06ab>] ext4_dax_pmd_fault+0xbb/0x140
[ 57.585936] [<ffffffff811f6d0b>] handle_mm_fault+0x2cb/0x1960
[ 57.585938] [<ffffffff811f6a9a>] ? handle_mm_fault+0x5a/0x1960
[ 57.585941] [<ffffffff81067731>] __do_page_fault+0x191/0x3f0
[ 57.585943] [<ffffffff81067a5f>] trace_do_page_fault+0x4f/0x120
[ 57.585944] [<ffffffff8106217a>] do_async_page_fault+0x1a/0xa0
[ 57.585947] [<ffffffff81a23dc8>] async_page_fault+0x28/0x30
[ 57.585948] Code: e8 3d ef ff ff 85 c0 41 89 c5 4c 89 fa be 18 0c 00 00 48 c7 c7 50 0a c3 81 78 31 e8 82 17 03 00 85 c0 44 0f 48 f0 e9 98 fe ff ff <0f> 0b be 1d 0c 00 00 48 c7 c7 f8 6e f2 81 e8 74 b6 da ff c6 05
[ 57.585969] RIP [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
[ 57.585971] RSP <ffff8800ba5ebc48>
[ 57.585973] ---[ end trace 356a2226c61d7104 ]---

I'll spend some time today trying to figure out what's going on, but thought
I'd let you know early in case you were already debugging this.

My test setup is a pair of 4 GiB PMEM partitions in a KVM guest, one for the
TEST_DEV and the other for the SCRATCH_DEV. I'm setting "-o dax" with
MOUNT_OPTIONS:

# ./check generic/198
FSTYP -- ext4
PLATFORM -- Linux/x86_64 alara 4.3.0-rc4_ext4_locking+
MKFS_OPTIONS -- /dev/pmem0p2
MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch

generic/198 0s ... [failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/198.out.bad)
--- tests/generic/198.out 2015-10-02 10:19:36.804795880 -0600
+++ /root/xfstests/results//generic/198.out.bad 2015-10-14 11:41:21.234972535 -0600
@@ -1,2 +1,3 @@
QA output created by 198
Silence is golden.
+./tests/generic/198: line 55: 3099 Segmentation fault $AIO_TEST "$TEST_DIR/aiodio_sparse"
...
(Run 'diff -u tests/generic/198.out /root/xfstests/results//generic/198.out.bad' to see the entire diff)

2015-10-14 21:07:08

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 0/6] ext4: Punch hole and DAX fixes

On Wed, Oct 14, 2015 at 12:06:04PM -0600, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 01:30:21PM +0200, Jan Kara wrote:
> > Hello,
> >
> > This series fixes a long standing problem of racing punch hole and page fault
> > resulting in possible filesystem corruption or stale data exposure. We fix the
> > problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> > page faults with truncate and punch hole operations.
> >
> > When having this exclusion, the only remaining problem with DAX implementation
> > are races between two page faults zeroing out same block concurrently (where
> > the data written after the first fault finishes are possibly overwritten by
> > the second fault still doing zeroing). That is dealt with by patches 5 and 6
> > in this series where we implement block zeroing directly in ext4_map_blocks().
> >
> > The patches survived xfstests run both in dax and non-dax mode.
> >
> > Honza
>
> I threw these patches in to my own test setup and was able to hit the
> following with generic/198:
>
> [ 57.453855] run fstests generic/198 at 2015-10-14 11:41:21
> [ 57.585886] ------------[ cut here ]------------
> [ 57.585890] kernel BUG at fs/ext4/inode.c:3057!
> [ 57.585892] invalid opcode: 0000 [#1] SMP
> [ 57.585894] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [ 57.585898] CPU: 0 PID: 3099 Comm: aiodio_sparse2 Not tainted 4.3.0-rc4_ext4_locking+ #1
> [ 57.585899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
> [ 57.585900] task: ffff88040ea3b400 ti: ffff8800ba5e8000 task.ti: ffff8800ba5e8000
> [ 57.585901] RIP: 0010:[<ffffffff812f7b79>] [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
> [ 57.585906] RSP: 0000:ffff8800ba5ebc48 EFLAGS: 00010216
> [ 57.585907] RAX: 0000000000000200 RBX: ffff8800ba5ebcf8 RCX: 000000000000000c
> [ 57.585908] RDX: ffff8800ba5ebcf8 RSI: 000000000000004b RDI: ffff8801fbdb13c0
> [ 57.585909] RBP: ffff8800ba5ebca0 R08: ffffffff812f7980 R09: ffff8801fbdb15e0
> [ 57.585910] R10: 00007f7921c00000 R11: 000000000000000c R12: ffff8801fbdb13c0
> [ 57.585911] R13: 0000000000000001 R14: ffff8800b153c870 R15: 0000000000000000
> [ 57.585912] FS: 00007f7928999700(0000) GS:ffff880411000000(0000) knlGS:0000000000000000
> [ 57.585913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 57.585914] CR2: 00007f7921c00000 CR3: 00000000bb1b5000 CR4: 00000000000406f0
> [ 57.585917] Stack:
> [ 57.585918] ffff8800ba351000 ffff8800bafca028 0000000000000000 000002000000004b
> [ 57.585920] ffff8800ba5ebd08 000000004edcd27a ffff8800baf47f18 ffff8801fbdb13c0
> [ 57.585922] 00007f7921c00000 ffff8800b153c870 0000000000000000 ffff8800ba5ebd90
> [ 57.585924] Call Trace:
> [ 57.585928] [<ffffffff812a061c>] __dax_pmd_fault+0x15c/0x590
> [ 57.585931] [<ffffffff81224d46>] ? kmem_cache_alloc+0xd6/0x210
> [ 57.585934] [<ffffffff812f06ab>] ext4_dax_pmd_fault+0xbb/0x140
> [ 57.585936] [<ffffffff811f6d0b>] handle_mm_fault+0x2cb/0x1960
> [ 57.585938] [<ffffffff811f6a9a>] ? handle_mm_fault+0x5a/0x1960
> [ 57.585941] [<ffffffff81067731>] __do_page_fault+0x191/0x3f0
> [ 57.585943] [<ffffffff81067a5f>] trace_do_page_fault+0x4f/0x120
> [ 57.585944] [<ffffffff8106217a>] do_async_page_fault+0x1a/0xa0
> [ 57.585947] [<ffffffff81a23dc8>] async_page_fault+0x28/0x30
> [ 57.585948] Code: e8 3d ef ff ff 85 c0 41 89 c5 4c 89 fa be 18 0c 00 00 48 c7 c7 50 0a c3 81 78 31 e8 82 17 03 00 85 c0 44 0f 48 f0 e9 98 fe ff ff <0f> 0b be 1d 0c 00 00 48 c7 c7 f8 6e f2 81 e8 74 b6 da ff c6 05
> [ 57.585969] RIP [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
> [ 57.585971] RSP <ffff8800ba5ebc48>
> [ 57.585973] ---[ end trace 356a2226c61d7104 ]---
>
> I'll spend some time today trying to figure out what's going on, but thought
> I'd let you know early in case you were already debugging this.
>
> My test setup is a pair of 4 GiB PMEM partitions in a KVM guest, one for the
> TEST_DEV and the other for the SCRATCH_DEV. I'm setting "-o dax" with
> MOUNT_OPTIONS:
>
> # ./check generic/198
> FSTYP -- ext4
> PLATFORM -- Linux/x86_64 alara 4.3.0-rc4_ext4_locking+
> MKFS_OPTIONS -- /dev/pmem0p2
> MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
>
> generic/198 0s ... [failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/198.out.bad)
> --- tests/generic/198.out 2015-10-02 10:19:36.804795880 -0600
> +++ /root/xfstests/results//generic/198.out.bad 2015-10-14 11:41:21.234972535 -0600
> @@ -1,2 +1,3 @@
> QA output created by 198
> Silence is golden.
> +./tests/generic/198: line 55: 3099 Segmentation fault $AIO_TEST "$TEST_DIR/aiodio_sparse"
> ...
> (Run 'diff -u tests/generic/198.out /root/xfstests/results//generic/198.out.bad' to see the entire diff)

It looks like the issue is that we're running through the DAX PMD fault path,
so we are trying to map 2 MiB at once. ext4_dax_mmap_get_block() enforces an
assumption that it will only be called for a single block at a time with

BUG_ON(map.m_len != 1);

When running through the PMD path, though, map.m_len is 512. I'm guessing
that you didn't hit this in your testing because things weren't set up in such
a way that you received PMD faults.

To be able to get this to work in my setup I reserve PMEM starting at an
aligned address (I have it at a 2GiB alignment, something smaller might be
sufficient) and I set up my partitions to be on 1GiB alignments as well.
There may be more precise ways to get this to to work. :)

For ease of copy and paste, my mmap param for PMEM is "memmap=8G!8G" (I have
16 GiB of RAM in my VM), and here is the bash function I use to set up my test
environment:

function xfstests_init {
umount /dev/pmem* 2>/dev/null
parted -s $xfstests_dev mktable msdos
parted -s -a optimal $xfstests_dev mkpart Primary 1 4GiB
parted -s -a optimal $xfstests_dev mkpart Primary 4GiB 7GiB

if [[ $test_fs == "xfs" ]]; then
mkfs.xfs -f $TEST_DEV
mkfs.xfs -f $SCRATCH_DEV
elif [[ $test_fs == "ext4" ]]; then
mkfs.ext4 -F $TEST_DEV
mkfs.ext4 -F $SCRATCH_DEV
elif [[ $test_fs == "ext2" ]]; then
mkfs.ext2 -F $TEST_DEV
mkfs.ext2 -F $SCRATCH_DEV
else
echo "Invalid test fs '$test_fs'"
fi
}

2015-10-15 03:01:01

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext4: Fix races between page faults and hole punching

On Wed, Oct 14, 2015 at 01:30:22PM +0200, Jan Kara wrote:
> Currently, page faults and hole punching are completely unsynchronized.
> This can result in page fault faulting in a page into a range that we
> are punching after truncate_pagecache_range() has been called and thus
> we can end up with a page mapped to disk blocks that will be shortly
> freed. Filesystem corruption will shortly follow. Note that the same
> race is avoided for truncate by checking page fault offset against
> i_size but there isn't similar mechanism available for punching holes.
>
> Fix the problem by creating new rw semaphore i_mmap_sem in inode and
> grab it for writing over truncate and hole punching and for read over
> page faults. We cannot easily use i_data_sem for this since that ranks
> below transaction start and we need something ranking above it so that
> it can be held over the whole truncate / hole punching operation.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/ext4/ext4.h | 10 +++++++++
> fs/ext4/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> fs/ext4/inode.c | 27 +++++++++++++++++++----
> fs/ext4/super.c | 1 +
> 4 files changed, 91 insertions(+), 13 deletions(-)

I wonder if there are a few other operations in ext4_fallocate() that
we may need to protect in addition to ext4_punch_hole()?

Do ext4_collapse_range(), ext4_insert_range() and maybe even ext4_zero_range()
need protection?

For what it's worth the rest of the locking looks good to me. The lock
ordering is the same as with ext2 and XFS, and all the DAX fault handlers look
correct to me.

2015-10-15 09:13:15

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 0/6] ext4: Punch hole and DAX fixes

On Wed 14-10-15 15:07:07, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 12:06:04PM -0600, Ross Zwisler wrote:
> > On Wed, Oct 14, 2015 at 01:30:21PM +0200, Jan Kara wrote:
> > > Hello,
> > >
> > > This series fixes a long standing problem of racing punch hole and page fault
> > > resulting in possible filesystem corruption or stale data exposure. We fix the
> > > problem by using a new inode-private rw_semaphore i_mmap_sem to synchronize
> > > page faults with truncate and punch hole operations.
> > >
> > > When having this exclusion, the only remaining problem with DAX implementation
> > > are races between two page faults zeroing out same block concurrently (where
> > > the data written after the first fault finishes are possibly overwritten by
> > > the second fault still doing zeroing). That is dealt with by patches 5 and 6
> > > in this series where we implement block zeroing directly in ext4_map_blocks().
> > >
> > > The patches survived xfstests run both in dax and non-dax mode.
> > >
> > > Honza
> >
> > I threw these patches in to my own test setup and was able to hit the
> > following with generic/198:
> >
> > [ 57.453855] run fstests generic/198 at 2015-10-14 11:41:21
> > [ 57.585886] ------------[ cut here ]------------
> > [ 57.585890] kernel BUG at fs/ext4/inode.c:3057!
> > [ 57.585892] invalid opcode: 0000 [#1] SMP
> > [ 57.585894] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> > [ 57.585898] CPU: 0 PID: 3099 Comm: aiodio_sparse2 Not tainted 4.3.0-rc4_ext4_locking+ #1
> > [ 57.585899] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153950- 04/01/2014
> > [ 57.585900] task: ffff88040ea3b400 ti: ffff8800ba5e8000 task.ti: ffff8800ba5e8000
> > [ 57.585901] RIP: 0010:[<ffffffff812f7b79>] [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
> > [ 57.585906] RSP: 0000:ffff8800ba5ebc48 EFLAGS: 00010216
> > [ 57.585907] RAX: 0000000000000200 RBX: ffff8800ba5ebcf8 RCX: 000000000000000c
> > [ 57.585908] RDX: ffff8800ba5ebcf8 RSI: 000000000000004b RDI: ffff8801fbdb13c0
> > [ 57.585909] RBP: ffff8800ba5ebca0 R08: ffffffff812f7980 R09: ffff8801fbdb15e0
> > [ 57.585910] R10: 00007f7921c00000 R11: 000000000000000c R12: ffff8801fbdb13c0
> > [ 57.585911] R13: 0000000000000001 R14: ffff8800b153c870 R15: 0000000000000000
> > [ 57.585912] FS: 00007f7928999700(0000) GS:ffff880411000000(0000) knlGS:0000000000000000
> > [ 57.585913] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 57.585914] CR2: 00007f7921c00000 CR3: 00000000bb1b5000 CR4: 00000000000406f0
> > [ 57.585917] Stack:
> > [ 57.585918] ffff8800ba351000 ffff8800bafca028 0000000000000000 000002000000004b
> > [ 57.585920] ffff8800ba5ebd08 000000004edcd27a ffff8800baf47f18 ffff8801fbdb13c0
> > [ 57.585922] 00007f7921c00000 ffff8800b153c870 0000000000000000 ffff8800ba5ebd90
> > [ 57.585924] Call Trace:
> > [ 57.585928] [<ffffffff812a061c>] __dax_pmd_fault+0x15c/0x590
> > [ 57.585931] [<ffffffff81224d46>] ? kmem_cache_alloc+0xd6/0x210
> > [ 57.585934] [<ffffffff812f06ab>] ext4_dax_pmd_fault+0xbb/0x140
> > [ 57.585936] [<ffffffff811f6d0b>] handle_mm_fault+0x2cb/0x1960
> > [ 57.585938] [<ffffffff811f6a9a>] ? handle_mm_fault+0x5a/0x1960
> > [ 57.585941] [<ffffffff81067731>] __do_page_fault+0x191/0x3f0
> > [ 57.585943] [<ffffffff81067a5f>] trace_do_page_fault+0x4f/0x120
> > [ 57.585944] [<ffffffff8106217a>] do_async_page_fault+0x1a/0xa0
> > [ 57.585947] [<ffffffff81a23dc8>] async_page_fault+0x28/0x30
> > [ 57.585948] Code: e8 3d ef ff ff 85 c0 41 89 c5 4c 89 fa be 18 0c 00 00 48 c7 c7 50 0a c3 81 78 31 e8 82 17 03 00 85 c0 44 0f 48 f0 e9 98 fe ff ff <0f> 0b be 1d 0c 00 00 48 c7 c7 f8 6e f2 81 e8 74 b6 da ff c6 05
> > [ 57.585969] RIP [<ffffffff812f7b79>] ext4_dax_mmap_get_block+0x1f9/0x250
> > [ 57.585971] RSP <ffff8800ba5ebc48>
> > [ 57.585973] ---[ end trace 356a2226c61d7104 ]---
> >
> > I'll spend some time today trying to figure out what's going on, but thought
> > I'd let you know early in case you were already debugging this.
> >
> > My test setup is a pair of 4 GiB PMEM partitions in a KVM guest, one for the
> > TEST_DEV and the other for the SCRATCH_DEV. I'm setting "-o dax" with
> > MOUNT_OPTIONS:
> >
> > # ./check generic/198
> > FSTYP -- ext4
> > PLATFORM -- Linux/x86_64 alara 4.3.0-rc4_ext4_locking+
> > MKFS_OPTIONS -- /dev/pmem0p2
> > MOUNT_OPTIONS -- -o dax -o context=system_u:object_r:nfs_t:s0 /dev/pmem0p2 /mnt/xfstests_scratch
> >
> > generic/198 0s ... [failed, exit status 139] - output mismatch (see /root/xfstests/results//generic/198.out.bad)
> > --- tests/generic/198.out 2015-10-02 10:19:36.804795880 -0600
> > +++ /root/xfstests/results//generic/198.out.bad 2015-10-14 11:41:21.234972535 -0600
> > @@ -1,2 +1,3 @@
> > QA output created by 198
> > Silence is golden.
> > +./tests/generic/198: line 55: 3099 Segmentation fault $AIO_TEST "$TEST_DIR/aiodio_sparse"
> > ...
> > (Run 'diff -u tests/generic/198.out /root/xfstests/results//generic/198.out.bad' to see the entire diff)
>
> It looks like the issue is that we're running through the DAX PMD fault
> path, so we are trying to map 2 MiB at once. ext4_dax_mmap_get_block()
> enforces an assumption that it will only be called for a single block at
> a time with
>
> BUG_ON(map.m_len != 1);
>
> When running through the PMD path, though, map.m_len is 512. I'm
> guessing that you didn't hit this in your testing because things weren't
> set up in such a way that you received PMD faults.

Yeah, actually I was running on older version of your locking fixes in
generic DAX code which completely disabled PMD faults. Furthermore I've
used ordinary ramdisk as a backing device so alignment could have been an
issue as well.

> To be able to get this to work in my setup I reserve PMEM starting at an
> aligned address (I have it at a 2GiB alignment, something smaller might be
> sufficient) and I set up my partitions to be on 1GiB alignments as well.
> There may be more precise ways to get this to to work. :)
>
> For ease of copy and paste, my mmap param for PMEM is "memmap=8G!8G" (I
> have 16 GiB of RAM in my VM), and here is the bash function I use to set
> up my test environment:
>
> function xfstests_init {
> umount /dev/pmem* 2>/dev/null
> parted -s $xfstests_dev mktable msdos
> parted -s -a optimal $xfstests_dev mkpart Primary 1 4GiB
> parted -s -a optimal $xfstests_dev mkpart Primary 4GiB 7GiB
>
> if [[ $test_fs == "xfs" ]]; then
> mkfs.xfs -f $TEST_DEV
> mkfs.xfs -f $SCRATCH_DEV
> elif [[ $test_fs == "ext4" ]]; then
> mkfs.ext4 -F $TEST_DEV
> mkfs.ext4 -F $SCRATCH_DEV
> elif [[ $test_fs == "ext2" ]]; then
> mkfs.ext2 -F $TEST_DEV
> mkfs.ext2 -F $SCRATCH_DEV
> else
> echo "Invalid test fs '$test_fs'"
> fi
> }

Thanks. I'll fix the bug (and recheck whether something actually depends on
that assertion) and try testing with PMD faults enabled.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-10-15 09:14:18

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext4: Fix races between page faults and hole punching

On Wed 14-10-15 21:00:59, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 01:30:22PM +0200, Jan Kara wrote:
> > Currently, page faults and hole punching are completely unsynchronized.
> > This can result in page fault faulting in a page into a range that we
> > are punching after truncate_pagecache_range() has been called and thus
> > we can end up with a page mapped to disk blocks that will be shortly
> > freed. Filesystem corruption will shortly follow. Note that the same
> > race is avoided for truncate by checking page fault offset against
> > i_size but there isn't similar mechanism available for punching holes.
> >
> > Fix the problem by creating new rw semaphore i_mmap_sem in inode and
> > grab it for writing over truncate and hole punching and for read over
> > page faults. We cannot easily use i_data_sem for this since that ranks
> > below transaction start and we need something ranking above it so that
> > it can be held over the whole truncate / hole punching operation.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/ext4.h | 10 +++++++++
> > fs/ext4/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > fs/ext4/inode.c | 27 +++++++++++++++++++----
> > fs/ext4/super.c | 1 +
> > 4 files changed, 91 insertions(+), 13 deletions(-)
>
> I wonder if there are a few other operations in ext4_fallocate() that
> we may need to protect in addition to ext4_punch_hole()?
>
> Do ext4_collapse_range(), ext4_insert_range() and maybe even
> ext4_zero_range() need protection?

Yeah, they do. I'll recheck what have I missed... Thanks for catching this!

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2015-10-15 20:22:39

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 1/6] ext4: Fix races between page faults and hole punching

On Wed, Oct 14, 2015 at 09:00:59PM -0600, Ross Zwisler wrote:
> On Wed, Oct 14, 2015 at 01:30:22PM +0200, Jan Kara wrote:
> > Currently, page faults and hole punching are completely unsynchronized.
> > This can result in page fault faulting in a page into a range that we
> > are punching after truncate_pagecache_range() has been called and thus
> > we can end up with a page mapped to disk blocks that will be shortly
> > freed. Filesystem corruption will shortly follow. Note that the same
> > race is avoided for truncate by checking page fault offset against
> > i_size but there isn't similar mechanism available for punching holes.
> >
> > Fix the problem by creating new rw semaphore i_mmap_sem in inode and
> > grab it for writing over truncate and hole punching and for read over
> > page faults. We cannot easily use i_data_sem for this since that ranks
> > below transaction start and we need something ranking above it so that
> > it can be held over the whole truncate / hole punching operation.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/ext4/ext4.h | 10 +++++++++
> > fs/ext4/file.c | 66 +++++++++++++++++++++++++++++++++++++++++++++++++--------
> > fs/ext4/inode.c | 27 +++++++++++++++++++----
> > fs/ext4/super.c | 1 +
> > 4 files changed, 91 insertions(+), 13 deletions(-)
>
> I wonder if there are a few other operations in ext4_fallocate() that
> we may need to protect in addition to ext4_punch_hole()?
>
> Do ext4_collapse_range(), ext4_insert_range() and maybe even ext4_zero_range()
> need protection?

Yes, they do. Anything that does direct extent manipulation needs
to invalidate current mappings across the range and that requires
page fault serialisation.

Cheers,

Dave.
--
Dave Chinner
[email protected]