2019-10-03 11:34:19

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 0/8] ext4: port direct I/O to iomap infrastructure

This patch series ports the ext4 direct I/O paths over to the iomap
infrastructure. The legacy buffer_head based direct I/O paths have
subsequently been removed as they're no longer in use. The result of
this change is that ext4 now uses the newer iomap framework for direct
I/O operations, which results in an overall cleaner implementation and
keeps the code isolated from buffer_head internals. In addition to
this, a slight performance boost may be expected while using O_SYNC |
O_DIRECT I/O.

The changes within this patch series have been tested via xfstests in
both DAX and non-DAX modes using the various filesystem configuration
options i.e. 4k, dioread_nolock, etc.

Changes since v3:

- Introduced a couple preparation patches around refactoring the ext4
iomap code. This involved splitting chunks of the existing
->iomap_begin() callback into separate helpers.

- Moved out the orphan handling code into a higher level caller. It
used to be within ext4_iomap_begin(), but is now within
ext4_dio_write_iter() and similarily ext4_dax_write_iter().

- Renamed the helper function from ext4_dio_checks() to
ext4_dio_supported(). Overall, it just reads better when using this
helper throughout the code.

- Cleaned up the ->end_io() handler. This was a result of refactoring
ext4_handle_inode_extension() and allowing it to perform clean up
routines for extension cases rather than calling
ext4_handle_failed_inode_extension() explicitly.

- Added a couple comments here and there to bits of logic that aren't
immediately obvious.

- Rather than having the clean up code in a separate patch at the end
of the series, I've incorporated the clean up into the patches
directly.

Thank you to all that took the time to review the patch series and
provide very valuable feedback. This includes Jan Kara, Christoph
Hellwig, Ritesh Harjani, and anybody else that I may have missed.

Matthew Bobrowski (8):
ext4: move out iomap field population into separate helper
ext4: move out IOMAP_WRITE path into separate helper
ext4: introduce new callback for IOMAP_REPORT operations
ext4: introduce direct I/O read path using iomap infrastructure
ext4: move inode extension/truncate code out from ->iomap_end()
callback
ext4: move inode extension checks out from ext4_iomap_alloc()
ext4: reorder map.m_flags checks in ext4_set_iomap()
ext4: introduce direct I/O write path using iomap infrastructure

fs/ext4/ext4.h | 4 +-
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 387 ++++++++++++++++++++-----
fs/ext4/inode.c | 709 +++++++++++-----------------------------------
4 files changed, 484 insertions(+), 627 deletions(-)

--
2.20.1


2019-10-03 11:34:36

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

Separate the iomap field population chunk into a separate simple
helper routine. This helps reducing the overall clutter within the
ext4_iomap_begin() callback, especially as we move across more code to
make use of iomap infrastructure.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 65 ++++++++++++++++++++++++++++---------------------
1 file changed, 37 insertions(+), 28 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 516faa280ced..1ccdc14c4d69 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3406,10 +3406,43 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}

+static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
+ unsigned long first_block, struct ext4_map_blocks *map)
+{
+ u8 blkbits = inode->i_blkbits;
+
+ iomap->flags = 0;
+ if (ext4_inode_datasync_dirty(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
+ iomap->bdev = inode->i_sb->s_bdev;
+ iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
+ iomap->offset = (u64) first_block << blkbits;
+ iomap->length = (u64) map->m_len << blkbits;
+
+ if (type) {
+ iomap->type = type;
+ iomap->addr = IOMAP_NULL_ADDR;
+ } else {
+ if (map->m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
+ } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
+ iomap->type = IOMAP_UNWRITTEN;
+ } else {
+ WARN_ON_ONCE(1);
+ return -EIO;
+ }
+ iomap->addr = (u64) map->m_pblk << blkbits;
+ }
+
+ if (map->m_flags & EXT4_MAP_NEW)
+ iomap->flags |= IOMAP_F_NEW;
+ return 0;
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
+ u16 type = 0;
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block, last_block;
struct ext4_map_blocks map;
@@ -3523,33 +3556,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}

- iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode))
- iomap->flags |= IOMAP_F_DIRTY;
- iomap->bdev = inode->i_sb->s_bdev;
- iomap->dax_dev = sbi->s_daxdev;
- iomap->offset = (u64)first_block << blkbits;
- iomap->length = (u64)map.m_len << blkbits;
-
- if (ret == 0) {
- iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
- iomap->addr = IOMAP_NULL_ADDR;
- } else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- iomap->type = IOMAP_UNWRITTEN;
- } else {
- WARN_ON_ONCE(1);
- return -EIO;
- }
- iomap->addr = (u64)map.m_pblk << blkbits;
- }
-
- if (map.m_flags & EXT4_MAP_NEW)
- iomap->flags |= IOMAP_F_NEW;
-
- return 0;
+ if (!ret)
+ type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
+ return ext4_set_iomap(inode, iomap, type, first_block, &map);
}

static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
--
2.20.1

2019-10-03 11:35:35

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 2/8] ext4: move out IOMAP_WRITE path into separate helper

In preparation for porting across the direct I/O path to iomap, split
out the IOMAP_WRITE logic into a separate helper. This way, we don't
need to clutter the ext4_iomap_begin() callback.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 110 ++++++++++++++++++++++++++----------------------
1 file changed, 60 insertions(+), 50 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1ccdc14c4d69..caeb3dec0dec 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3439,6 +3439,62 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
return 0;
}

+static int ext4_iomap_alloc(struct inode *inode,
+ unsigned flags,
+ unsigned long first_block,
+ struct ext4_map_blocks *map)
+{
+ handle_t *handle;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, retries = 0;
+
+ /*
+ * Trim mapping request to the maximum value that we can map
+ * at once for direct I/O.
+ */
+ if (map->m_len > DIO_MAX_BLOCKS)
+ map->m_len = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+retry:
+ /*
+ * Either we allocate blocks and then we don't get unwritten
+ * extent so we have reserved enough credits, or the blocks
+ * are already allocated and unwritten. In that case, the
+ * extent conversion fits in the credits as well.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ goto journal_stop;
+
+ /*
+ * If we have allocated blocks beyond the EOF, we need to make
+ * sure that they get truncate if we crash before updating the
+ * inode size metadata in ext4_iomap_end(). For faults, we
+ * don't need to do that (and cannot due to the orphan list
+ * operations needing an inode_lock()). If we happen to
+ * instantiate blocks beyond EOF, it is because we race with a
+ * truncate operation, which already has added the inode onto
+ * the orphan list.
+ */
+ if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
+ (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
+ int err;
+
+ err = ext4_orphan_add(handle, inode);
+ if (err < 0)
+ ret = err;
+ }
+journal_stop:
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+ return ret;
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
@@ -3500,62 +3556,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
- handle_t *handle;
- int retries = 0;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (map.m_len > DIO_MAX_BLOCKS)
- map.m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
-retry:
- /*
- * Either we allocate blocks and then we don't get unwritten
- * extent so we have reserved enough credits, or the blocks
- * are already allocated and unwritten and in that case
- * extent conversion fits in the credits as well.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0) {
- ext4_journal_stop(handle);
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
- }
-
- /*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0) {
- ext4_journal_stop(handle);
- return err;
- }
- }
- ext4_journal_stop(handle);
+ ret = ext4_iomap_alloc(inode, flags, first_block, &map);
} else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
if (ret < 0)
return ret;
}

+ if (ret < 0)
+ return ret;
+
if (!ret)
type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
return ext4_set_iomap(inode, iomap, type, first_block, &map);
--
2.20.1

2019-10-03 11:35:46

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 4/8] ext4: introduce direct I/O read path using iomap infrastructure

This patch introduces a new direct I/O read path that makes use of the
iomap infrastructure.

The new function ext4_dio_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
being performed on the inode does not pass the preliminary checks
performed within ext4_dio_supported(), then we simply fallback to
buffered I/O in order to fulfil the request.

Existing direct I/O read buffer_head code has been removed as it's now
redundant.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----
fs/ext4/inode.c | 32 +--------------------------
2 files changed, 55 insertions(+), 35 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ab75aee3e687..69ac042fb74b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,53 @@
#include "xattr.h"
#include "acl.h"

+static bool ext4_dio_supported(struct inode *inode)
+{
+ if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
+ return false;
+ if (fsverity_active(inode))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ /*
+ * Get exclusion from truncate and other inode operations.
+ */
+ if (!inode_trylock_shared(inode)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ inode_lock_shared(inode);
+ }
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered I/O if the operation being
+ * performed on the inode is not supported by direct
+ * I/O. The IOCB_DIRECT flag needs to be cleared here
+ * in order to ensure that the direct I/O path withiin
+ * generic_file_read_iter() is not taken.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,16 +111,19 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

if (!iov_iter_count(to))
return 0; /* skip atime */

-#ifdef CONFIG_FS_DAX
- if (IS_DAX(file_inode(iocb->ki_filp)))
+ if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
-#endif
+
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
return generic_file_read_iter(iocb, to);
}

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1dace576b8bd..159ffb92f82d 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
{
/* We don't expect handle for direct IO */
WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
}

@@ -3855,30 +3852,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}

-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- inode_lock_shared(inode);
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct file *file = iocb->ki_filp;
@@ -3905,10 +3878,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return 0;

trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
+ ret = ext4_direct_IO_write(iocb, iter);
trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
return ret;
}
--
2.20.1

2019-10-03 11:35:47

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 6/8] ext4: move inode extension checks out from ext4_iomap_alloc()

We lift the inode extension/orphan list handling logic out from
ext4_iomap_alloc() and place it within the caller
ext4_dax_write_iter().

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 17 +++++++++++++++++
fs/ext4/inode.c | 22 ----------------------
2 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 2883711e8a33..f64da0c590b2 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -309,6 +309,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t ret;
size_t count;
loff_t offset;
+ handle_t *handle;
struct inode *inode = file_inode(iocb->ki_filp);

if (!inode_trylock(inode)) {
@@ -328,6 +329,22 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)

offset = iocb->ki_pos;
count = iov_iter_count(from);
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+ ext4_journal_stop(handle);
+ }
+
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);

error = ext4_handle_inode_extension(inode, offset, ret, count);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d616062b603e..e133dda55063 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3508,7 +3508,6 @@ static int ext4_iomap_alloc(struct inode *inode,
struct ext4_map_blocks *map)
{
handle_t *handle;
- u8 blkbits = inode->i_blkbits;
int ret, dio_credits, retries = 0;

/*
@@ -3530,28 +3529,7 @@ static int ext4_iomap_alloc(struct inode *inode,
return PTR_ERR(handle);

ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0)
- goto journal_stop;

- /*
- * If we have allocated blocks beyond the EOF, we need to make
- * sure that they get truncate if we crash before updating the
- * inode size metadata in ext4_iomap_end(). For faults, we
- * don't need to do that (and cannot due to the orphan list
- * operations needing an inode_lock()). If we happen to
- * instantiate blocks beyond EOF, it is because we race with a
- * truncate operation, which already has added the inode onto
- * the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0)
- ret = err;
- }
-journal_stop:
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
--
2.20.1

2019-10-03 11:37:16

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 7/8] ext4: reorder map.m_flags checks in ext4_set_iomap()

For iomap direct IO write code path changes, we need to accommodate
for the case where the block mapping flags passed to ext4_map_blocks()
will result in m_flags having both EXT4_MAP_MAPPED and
EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
extents to be converted properly in the end_io handler, iomap->type
must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
conditional statement in order to achieve this.

This change is a no-op for DAX code path as the block mapping flag
passed to ext4_map_blocks() when IS_DAX(inode) never results in
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e133dda55063..63ad23ae05b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3420,10 +3420,20 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
iomap->type = type;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map->m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
+ /*
+ * Flags passed to ext4_map_blocks() for direct I/O
+ * writes can result in map->m_flags having both
+ * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
+ * order for allocated extents to be converted to
+ * written extents in the ->end_io handler correctly,
+ * we need to ensure that the iomap->type is set
+ * approprately. Thus, we need to check whether
+ * EXT4_MAP_UNWRITTEN is set first.
+ */
+ if (map->m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
+ } else if (map->m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
} else {
WARN_ON_ONCE(1);
return -EIO;
--
2.20.1

2019-10-03 11:37:20

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations

As part of ext4_iomap_begin() cleanups and port across direct I/O path
to make use of iomap infrastructure, we split IOMAP_REPORT operations
into a separate ->iomap_begin() handler.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 6 ++-
fs/ext4/inode.c | 129 ++++++++++++++++++++++++++++--------------------
3 files changed, 80 insertions(+), 56 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..d0d88f411a44 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3379,6 +3379,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}

extern const struct iomap_ops ext4_iomap_ops;
+extern const struct iomap_ops ext4_iomap_report_ops;

static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8d2bbcc2d813..ab75aee3e687 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
maxbytes, i_size_read(inode));
case SEEK_HOLE:
inode_lock_shared(inode);
- offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_hole(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
case SEEK_DATA:
inode_lock_shared(inode);
- offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_data(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index caeb3dec0dec..1dace576b8bd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3439,6 +3439,72 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
return 0;
}

+static u16 ext4_iomap_check_delalloc(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ struct extent_status es;
+ ext4_lblk_t end = map->m_lblk + map->m_len - 1;
+
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
+ end, &es);
+
+ /* Entire range is a hole */
+ if (!es.es_len || es.es_lblk > end)
+ return IOMAP_HOLE;
+ if (es.es_lblk <= map->m_lblk) {
+ ext4_lblk_t offset = 0;
+
+ if (es.es_lblk < map->m_lblk)
+ offset = map->m_lblk - es.es_lblk;
+ map->m_lblk = es.es_lblk + offset;
+ map->m_len = es.es_len - offset;
+ return IOMAP_DELALLOC;
+ }
+
+ /* Range starts with a hole */
+ map->m_len = es.es_lblk - map->m_lblk;
+ return IOMAP_HOLE;
+}
+
+static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
+ loff_t length, unsigned flags,
+ struct iomap *iomap)
+{
+ int ret;
+ u16 type = 0;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+ unsigned long first_block, last_block;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+ first_block = offset >> blkbits;
+ last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK);
+
+ if (ext4_has_inline_data(inode)) {
+ ret = ext4_inline_data_iomap(inode, iomap);
+ if (ret != -EAGAIN) {
+ if (ret == 0 && offset >= iomap->length)
+ ret = -ENOENT;
+ return ret;
+ }
+ }
+
+ map.m_lblk = first_block;
+ map.m_len = last_block = first_block + 1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ type = ext4_iomap_check_delalloc(inode, &map);
+ return ext4_set_iomap(inode, iomap, type, first_block, &map);
+}
+
+const struct iomap_ops ext4_iomap_report_ops = {
+ .iomap_begin = ext4_iomap_begin_report,
+};
+
static int ext4_iomap_alloc(struct inode *inode,
unsigned flags,
unsigned long first_block,
@@ -3498,12 +3564,10 @@ static int ext4_iomap_alloc(struct inode *inode,
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
- u16 type = 0;
- unsigned int blkbits = inode->i_blkbits;
- unsigned long first_block, last_block;
- struct ext4_map_blocks map;
- bool delalloc = false;
int ret;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+ unsigned long first_block, last_block;

if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
@@ -3511,64 +3575,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
EXT4_MAX_LOGICAL_BLOCK);

- if (flags & IOMAP_REPORT) {
- if (ext4_has_inline_data(inode)) {
- ret = ext4_inline_data_iomap(inode, iomap);
- if (ret != -EAGAIN) {
- if (ret == 0 && offset >= iomap->length)
- ret = -ENOENT;
- return ret;
- }
- }
- } else {
- if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
- return -ERANGE;
- }
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;

map.m_lblk = first_block;
map.m_len = last_block - first_block + 1;

- if (flags & IOMAP_REPORT) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
-
- if (ret == 0) {
- ext4_lblk_t end = map.m_lblk + map.m_len - 1;
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map.m_lblk, end, &es);
-
- if (!es.es_len || es.es_lblk > end) {
- /* entire range is a hole */
- } else if (es.es_lblk > map.m_lblk) {
- /* range starts with a hole */
- map.m_len = es.es_lblk - map.m_lblk;
- } else {
- ext4_lblk_t offs = 0;
-
- if (es.es_lblk < map.m_lblk)
- offs = map.m_lblk - es.es_lblk;
- map.m_lblk = es.es_lblk + offs;
- map.m_len = es.es_len - offs;
- delalloc = true;
- }
- }
- } else if (flags & IOMAP_WRITE) {
+ if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, flags, first_block, &map);
- } else {
+ else
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
- }

if (ret < 0)
return ret;
-
- if (!ret)
- type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
- return ext4_set_iomap(inode, iomap, type, first_block, &map);
+ return ext4_set_iomap(inode, iomap, ret ? 0 : IOMAP_HOLE, first_block,
+ &map);
}

static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
--
2.20.1

2019-10-03 11:38:18

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

This patch introduces a new direct I/O write code path implementation
that makes use of the iomap infrastructure.

All direct I/O write operations are now passed from the ->write_iter()
callback to the new function ext4_dio_write_iter(). This function is
responsible for calling into iomap infrastructure via
iomap_dio_rw(). Snippets of the direct I/O code from within
ext4_file_write_iter(), such as checking whether the IO request is
unaligned asynchronous I/O, or whether it will ber overwriting
allocated and initialized blocks has been moved out and into
ext4_dio_write_iter().

The block mapping flags that are passed to ext4_map_blocks() from
within ext4_dio_get_block() and friends have effectively been taken
out and introduced within the ext4_iomap_alloc().

For inode extension cases, the ->end_io() callback
ext4_dio_write_end_io() is responsible for calling into
ext4_handle_inode_extension() and performing the necessary metadata
updates. Failed writes that we're intended to be extend the inode will
have blocks truncated accordingly. The ->end_io() handler is also
responsible for converting allocated unwritten extents to written
extents.

In the instance of a short write, we fallback to buffered I/O and use
that method to complete whatever is left over in 'iter'. Any blocks
that have been allocated in preparation for direct I/O write will be
reused by buffered I/O, so there's no issue with leaving allocated
blocks beyond EOF.

Existing direct I/O write buffer_head code has been removed as it's
now redundant.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 245 ++++++++++++++++++---------
fs/ext4/inode.c | 411 +++++-----------------------------------------
4 files changed, 214 insertions(+), 456 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0d88f411a44..fdab3420539d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1579,7 +1579,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2560,8 +2559,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..df0629de3667 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
*/
if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
return 0;
- /*
- * The check for IO to unwritten extent is somewhat racy as we
- * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
- * dropping i_data_sem. But reserved blocks should save us in that
- * case.
- */
+
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index f64da0c590b2..a6ad747709cf 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -154,13 +155,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}

-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
@@ -213,13 +207,13 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;

+ if (unlikely(IS_IMMUTABLE(inode)))
+ return -EPERM;
+
ret = generic_write_checks(iocb, from);
if (ret <= 0)
return ret;

- if (unlikely(IS_IMMUTABLE(inode)))
- return -EPERM;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +225,39 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
return iov_iter_count(from);
}

+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ inode_lock(inode);
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+ return ret;
+}
+
static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t written, size_t count)
{
@@ -301,34 +325,77 @@ static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
return ret;
}

-#ifdef CONFIG_FS_DAX
-static ssize_t
-ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (!error && (flags & IOMAP_DIO_UNWRITTEN))
+ error = ext4_convert_unwritten_extents(NULL, inode, offset,
+ size);
+ return ext4_handle_inode_extension(inode, offset, error ? : size,
+ size);
+}
+
+static const struct iomap_dio_ops ext4_dio_write_ops = {
+ .end_io = ext4_dio_write_end_io,
+};
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- int error;
ssize_t ret;
size_t count;
loff_t offset;
handle_t *handle;
struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;

if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
return -EAGAIN;
inode_lock(inode);
}
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock(inode);
+ /*
+ * Fallback to buffered I/O if the operation on the
+ * inode is not supported by direct I/O.
+ */
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }

+ /*
+ * Unaligned direct asynchronous I/O must be serialized among
+ * each other as the zeroing of partial blocks of two
+ * competing unaligned asynchronous I/O writes can result in
+ * data corruption.
+ */
offset = iocb->ki_pos;
count = iov_iter_count(from);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the I/O will overwrite allocated and
+ * initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ */
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }

if (offset + count > EXT4_I(inode)->i_disksize) {
handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
@@ -342,38 +409,64 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ext4_journal_stop(handle);
goto out;
}
+
+ extend = true;
ext4_journal_stop(handle);
}

- ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops);

- error = ext4_handle_inode_extension(inode, offset, ret, count);
- if (error)
- ret = error;
+ /*
+ * Unaligned direct asynchronous I/O must be the only I/O in
+ * flight or else any overlapping aligned I/O after unaligned
+ * I/O might result in data corruption. We also need to wait
+ * here in the case where the inode is being extended so that
+ * inode extension routines in the ->end_io handler are
+ * covered by the inode_lock().
+ */
+ if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+ inode_dio_wait(inode);
+
+ /*
+ * The ->end_io handler may have failed to remove the inode
+ * from the orphan list in the case that the i_disksize got
+ * updated due to a delalloc writeback while the direct I/O
+ * was running.
+ */
+ if (extend && !list_empty(&EXT4_I(inode)->i_orphan)) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ goto out;
+ }
+
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
out:
- inode_unlock(inode);
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from))
+ return ext4_buffered_write_iter(iocb, from);
return ret;
}
-#endif

+#ifdef CONFIG_FS_DAX
static ssize_t
-ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
+ int error;
ssize_t ret;
-
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
- return -EIO;
-
-#ifdef CONFIG_FS_DAX
- if (IS_DAX(inode))
- return ext4_dax_write_iter(iocb, from);
-#endif
+ size_t count;
+ loff_t offset;
+ handle_t *handle;
+ struct inode *inode = file_inode(iocb->ki_filp);

if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -385,48 +478,50 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret <= 0)
goto out;

- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }

- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
goto out;
}
+ ext4_journal_stop(handle);
}

- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
+ ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);

+ error = ext4_handle_inode_extension(inode, offset, ret, count);
+ if (error)
+ ret = error;
+out:
+ inode_unlock(inode);
if (ret > 0)
ret = generic_write_sync(iocb, ret);
-
return ret;
+}
+#endif

-out:
- inode_unlock(inode);
- return ret;
+static ssize_t
+ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ return -EIO;
+
+ if (IS_DAX(inode))
+ return ext4_dax_write_iter(iocb, from);
+
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
+ return ext4_buffered_write_iter(iocb, from);
}

#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 63ad23ae05b8..83c5daf7e0c4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,133 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096

-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3518,7 +3391,8 @@ static int ext4_iomap_alloc(struct inode *inode,
struct ext4_map_blocks *map)
{
handle_t *handle;
- int ret, dio_credits, retries = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, m_flags = 0, retries = 0;

/*
* Trim mapping request to the maximum value that we can map
@@ -3538,7 +3412,34 @@ static int ext4_iomap_alloc(struct inode *inode,
if (IS_ERR(handle))
return PTR_ERR(handle);

- ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct I/O are the only two operations that are
+ * currently supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ /*
+ * We use i_size instead of i_disksize here because delalloc
+ * writeback can complete at any point and subsequently push
+ * the i_disksize out to i_size. This could be beyond where
+ * the direct I/O is happening and thus expose allocated
+ * blocks to direct I/O reads.
+ */
+ else if ((first_block * (1 << blkbits)) >= i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes as that
+ * could expose stale data in the case of a crash. Use the
+ * magic error code to fallback to buffered I/O.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;

ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -3580,6 +3481,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error occurred while writing out
+ * data to the allocated blocks. If so, return the magic error
+ * code so that we fallback to buffered I/O. Any blocks that
+ * have been allocated in preparation for the direct I/O write
+ * will be reused during buffered I/O.
+ */
+ if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+ return -ENOTBLK;
return 0;
}

@@ -3588,243 +3498,6 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};

-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end->offset = offset;
- io_end->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
- if (fsverity_active(inode))
- return 0;
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3862,7 +3535,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3879,7 +3552,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3895,7 +3568,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1

2019-10-03 11:38:20

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

In preparation for implementing the iomap direct I/O write path
modifications, the inode extension/truncate code needs to be moved out
from ext4_iomap_end(). For direct I/O, if the current code remained
within ext4_iomap_end() it would behave incorrectly. Updating the
inode size prior to converting unwritten extents to written extents
will potentially allow a racing direct I/O read operation to find
unwritten extents before they've been correctly converted.

The inode extension/truncate code has been moved out into a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can be used by both DAX and direct I/O paths.

Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +-----------------------------
2 files changed, 79 insertions(+), 48 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 69ac042fb74b..2883711e8a33 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"

static bool ext4_dio_supported(struct inode *inode)
{
@@ -233,12 +234,82 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}

+static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
+ ssize_t written, size_t count)
+{
+ int ret = 0;
+ handle_t *handle;
+ bool truncate = false;
+ u8 blkbits = inode->i_blkbits;
+ ext4_lblk_t written_blk, end_blk;
+
+ /*
+ * Note that EXT4_I(inode)->i_disksize can get extended up to
+ * inode->i_size while the IO was running due to writeback of
+ * delalloc blocks. But the code in ext4_iomap_alloc() is careful
+ * to use zeroed / unwritten extents if this is possible and thus
+ * we won't leave uninitialized blocks in a file even if we didn't
+ * succeed in writing as much as we planned.
+ */
+ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
+ if (offset + count <= EXT4_I(inode)->i_disksize)
+ return written < 0 ? written : 0;
+
+ if (written < 0) {
+ ret = written;
+ goto truncate;
+ }
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto truncate;
+ }
+
+ if (ext4_update_inode_size(inode, offset + written))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need to truncate allocated but not written blocks
+ * beyond EOF.
+ */
+ written_blk = ALIGN(offset + written, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been
+ * extended and everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+truncate:
+ ext4_truncate_failed_write(inode);
+ /*
+ * If the truncate operation failed early, then the
+ * inode may still be on the orphan list. In that
+ * case, we need to try remove the inode from the
+ * in-memory linked list.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
+ int error;
ssize_t ret;
+ size_t count;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);

if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -255,7 +326,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;

+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+
+ error = ext4_handle_inode_extension(inode, offset, ret, count);
+ if (error)
+ ret = error;
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 159ffb92f82d..d616062b603e 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3592,53 +3592,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}

const struct iomap_ops ext4_iomap_ops = {
--
2.20.1

2019-10-08 10:28:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Thu 03-10-19 21:33:09, Matthew Bobrowski wrote:
> Separate the iomap field population chunk into a separate simple
> helper routine. This helps reducing the overall clutter within the
> ext4_iomap_begin() callback, especially as we move across more code to
> make use of iomap infrastructure.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/inode.c | 65 ++++++++++++++++++++++++++++---------------------
> 1 file changed, 37 insertions(+), 28 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 516faa280ced..1ccdc14c4d69 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3406,10 +3406,43 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> return inode->i_state & I_DIRTY_DATASYNC;
> }
>
> +static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> + unsigned long first_block, struct ext4_map_blocks *map)
> +{
> + u8 blkbits = inode->i_blkbits;
> +
> + iomap->flags = 0;
> + if (ext4_inode_datasync_dirty(inode))
> + iomap->flags |= IOMAP_F_DIRTY;
> + iomap->bdev = inode->i_sb->s_bdev;
> + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> + iomap->offset = (u64) first_block << blkbits;
> + iomap->length = (u64) map->m_len << blkbits;
> +
> + if (type) {
> + iomap->type = type;
> + iomap->addr = IOMAP_NULL_ADDR;
> + } else {
> + if (map->m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> + } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> + iomap->type = IOMAP_UNWRITTEN;
> + } else {
> + WARN_ON_ONCE(1);
> + return -EIO;
> + }
> + iomap->addr = (u64) map->m_pblk << blkbits;
> + }

Looking at this function now, the 'type' argument looks a bit weird. Can we
perhaps just remove the 'type' argument and change the above to:

if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
if (map->m_flags & EXT4_MAP_MAPPED)
iomap->type = IOMAP_MAPPED;
else if (map->m_flags & EXT4_MAP_UNWRITTEN)
iomap->type = IOMAP_UNWRITTEN;
iomap->addr = (u64) map->m_pblk << blkbits;
} else {
iomap->type = IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
}

And then in ext4_iomap_begin() we overwrite the type to:

if (delalloc && iomap->type == IOMAP_HOLE)
iomap->type = IOMAP_DELALLOC;

That would IMO make ext4_set_iomap() arguments harder to get wrong.

Honza

> +
> + if (map->m_flags & EXT4_MAP_NEW)
> + iomap->flags |= IOMAP_F_NEW;
> + return 0;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + u16 type = 0;
> unsigned int blkbits = inode->i_blkbits;
> unsigned long first_block, last_block;
> struct ext4_map_blocks map;
> @@ -3523,33 +3556,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> return ret;
> }
>
> - iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode))
> - iomap->flags |= IOMAP_F_DIRTY;
> - iomap->bdev = inode->i_sb->s_bdev;
> - iomap->dax_dev = sbi->s_daxdev;
> - iomap->offset = (u64)first_block << blkbits;
> - iomap->length = (u64)map.m_len << blkbits;
> -
> - if (ret == 0) {
> - iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - iomap->addr = IOMAP_NULL_ADDR;
> - } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> - iomap->type = IOMAP_UNWRITTEN;
> - } else {
> - WARN_ON_ONCE(1);
> - return -EIO;
> - }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> - }
> -
> - if (map.m_flags & EXT4_MAP_NEW)
> - iomap->flags |= IOMAP_F_NEW;
> -
> - return 0;
> + if (!ret)
> + type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> + return ext4_set_iomap(inode, iomap, type, first_block, &map);
> }
>
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 10:34:27

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] ext4: move out IOMAP_WRITE path into separate helper

On Thu 03-10-19 21:33:29, Matthew Bobrowski wrote:
> In preparation for porting across the direct I/O path to iomap, split
> out the IOMAP_WRITE logic into a separate helper. This way, we don't
> need to clutter the ext4_iomap_begin() callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Just please reformat the comments to use full 80 column lines. Your Emacs
still doesn't seem to get it :)

Honza

> ---
> fs/ext4/inode.c | 110 ++++++++++++++++++++++++++----------------------
> 1 file changed, 60 insertions(+), 50 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ccdc14c4d69..caeb3dec0dec 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3439,6 +3439,62 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> return 0;
> }
>
> +static int ext4_iomap_alloc(struct inode *inode,
> + unsigned flags,
> + unsigned long first_block,
> + struct ext4_map_blocks *map)
> +{
> + handle_t *handle;
> + u8 blkbits = inode->i_blkbits;
> + int ret, dio_credits, retries = 0;
> +
> + /*
> + * Trim mapping request to the maximum value that we can map
> + * at once for direct I/O.
> + */
> + if (map->m_len > DIO_MAX_BLOCKS)
> + map->m_len = DIO_MAX_BLOCKS;
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +retry:
> + /*
> + * Either we allocate blocks and then we don't get unwritten
> + * extent so we have reserved enough credits, or the blocks
> + * are already allocated and unwritten. In that case, the
> + * extent conversion fits in the credits as well.
> + */
> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> + if (ret < 0)
> + goto journal_stop;
> +
> + /*
> + * If we have allocated blocks beyond the EOF, we need to make
> + * sure that they get truncate if we crash before updating the
> + * inode size metadata in ext4_iomap_end(). For faults, we
> + * don't need to do that (and cannot due to the orphan list
> + * operations needing an inode_lock()). If we happen to
> + * instantiate blocks beyond EOF, it is because we race with a
> + * truncate operation, which already has added the inode onto
> + * the orphan list.
> + */
> + if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
> + (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> + int err;
> +
> + err = ext4_orphan_add(handle, inode);
> + if (err < 0)
> + ret = err;
> + }
> +journal_stop:
> + ext4_journal_stop(handle);
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> + return ret;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> @@ -3500,62 +3556,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (map.m_len > DIO_MAX_BLOCKS)
> - map.m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> -retry:
> - /*
> - * Either we allocate blocks and then we don't get unwritten
> - * extent so we have reserved enough credits, or the blocks
> - * are already allocated and unwritten and in that case
> - * extent conversion fits in the credits as well.
> - */
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> - dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0) {
> - ext4_journal_stop(handle);
> - if (ret == -ENOSPC &&
> - ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> - }
> -
> - /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0) {
> - ext4_journal_stop(handle);
> - return err;
> - }
> - }
> - ext4_journal_stop(handle);
> + ret = ext4_iomap_alloc(inode, flags, first_block, &map);
> } else {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> if (ret < 0)
> return ret;
> }
>
> + if (ret < 0)
> + return ret;
> +
> if (!ret)
> type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> return ext4_set_iomap(inode, iomap, type, first_block, &map);
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 10:42:41

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations

On Thu 03-10-19 21:33:45, Matthew Bobrowski wrote:
> As part of ext4_iomap_begin() cleanups and port across direct I/O path
> to make use of iomap infrastructure, we split IOMAP_REPORT operations
> into a separate ->iomap_begin() handler.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

It would just need small adjustments if you change patch 1 as I suggested:

> +static u16 ext4_iomap_check_delalloc(struct inode *inode,
> + struct ext4_map_blocks *map)
> +{
> + struct extent_status es;
> + ext4_lblk_t end = map->m_lblk + map->m_len - 1;
> +
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
> + end, &es);
> +
> + /* Entire range is a hole */
> + if (!es.es_len || es.es_lblk > end)
> + return IOMAP_HOLE;
> + if (es.es_lblk <= map->m_lblk) {
> + ext4_lblk_t offset = 0;
> +
> + if (es.es_lblk < map->m_lblk)
> + offset = map->m_lblk - es.es_lblk;
> + map->m_lblk = es.es_lblk + offset;
> + map->m_len = es.es_len - offset;
> + return IOMAP_DELALLOC;
> + }
> +
> + /* Range starts with a hole */
> + map->m_len = es.es_lblk - map->m_lblk;
> + return IOMAP_HOLE;
> +}

This function would then be IMO better off to directly update 'iomap' as
needed after ext4_set_iomap() sets hole there.

Honza

> +
> +static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + int ret;
> + u16 type = 0;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> + unsigned long first_block, last_block;
> +
> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> + return -EINVAL;
> + first_block = offset >> blkbits;
> + last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK);
> +
> + if (ext4_has_inline_data(inode)) {
> + ret = ext4_inline_data_iomap(inode, iomap);
> + if (ret != -EAGAIN) {
> + if (ret == 0 && offset >= iomap->length)
> + ret = -ENOENT;
> + return ret;
> + }
> + }
> +
> + map.m_lblk = first_block;
> + map.m_len = last_block = first_block + 1;
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret < 0)
> + return ret;
> + if (ret == 0)
> + type = ext4_iomap_check_delalloc(inode, &map);
> + return ext4_set_iomap(inode, iomap, type, first_block, &map);
> +}
> +
> +const struct iomap_ops ext4_iomap_report_ops = {
> + .iomap_begin = ext4_iomap_begin_report,
> +};
> +
> static int ext4_iomap_alloc(struct inode *inode,
> unsigned flags,
> unsigned long first_block,
> @@ -3498,12 +3564,10 @@ static int ext4_iomap_alloc(struct inode *inode,
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> - u16 type = 0;
> - unsigned int blkbits = inode->i_blkbits;
> - unsigned long first_block, last_block;
> - struct ext4_map_blocks map;
> - bool delalloc = false;
> int ret;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> + unsigned long first_block, last_block;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> @@ -3511,64 +3575,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK);
>
> - if (flags & IOMAP_REPORT) {
> - if (ext4_has_inline_data(inode)) {
> - ret = ext4_inline_data_iomap(inode, iomap);
> - if (ret != -EAGAIN) {
> - if (ret == 0 && offset >= iomap->length)
> - ret = -ENOENT;
> - return ret;
> - }
> - }
> - } else {
> - if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> - return -ERANGE;
> - }
> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> + return -ERANGE;
>
> map.m_lblk = first_block;
> map.m_len = last_block - first_block + 1;
>
> - if (flags & IOMAP_REPORT) {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> -
> - if (ret == 0) {
> - ext4_lblk_t end = map.m_lblk + map.m_len - 1;
> - struct extent_status es;
> -
> - ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> - map.m_lblk, end, &es);
> -
> - if (!es.es_len || es.es_lblk > end) {
> - /* entire range is a hole */
> - } else if (es.es_lblk > map.m_lblk) {
> - /* range starts with a hole */
> - map.m_len = es.es_lblk - map.m_lblk;
> - } else {
> - ext4_lblk_t offs = 0;
> -
> - if (es.es_lblk < map.m_lblk)
> - offs = map.m_lblk - es.es_lblk;
> - map.m_lblk = es.es_lblk + offs;
> - map.m_len = es.es_len - offs;
> - delalloc = true;
> - }
> - }
> - } else if (flags & IOMAP_WRITE) {
> + if (flags & IOMAP_WRITE)
> ret = ext4_iomap_alloc(inode, flags, first_block, &map);
> - } else {
> + else
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> - }
>
> if (ret < 0)
> return ret;
> -
> - if (!ret)
> - type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - return ext4_set_iomap(inode, iomap, type, first_block, &map);
> + return ext4_set_iomap(inode, iomap, ret ? 0 : IOMAP_HOLE, first_block,
> + &map);
> }
>
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 10:53:09

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] ext4: introduce direct I/O read path using iomap infrastructure

On Thu 03-10-19 21:34:00, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O read path that makes use of the
> iomap infrastructure.
>
> The new function ext4_dio_read_iter() is responsible for calling into
> the iomap infrastructure via iomap_dio_rw(). If the read operation
> being performed on the inode does not pass the preliminary checks
> performed within ext4_dio_supported(), then we simply fallback to
> buffered I/O in order to fulfil the request.
>
> Existing direct I/O read buffer_head code has been removed as it's now
> redundant.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

The patch looks good to me. Just one small nit below. With that fixed, you
can add:

Reviewed-by: Jan Kara <[email protected]>

> + /*
> + * Get exclusion from truncate and other inode operations.
> + */
> + if (!inode_trylock_shared(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock_shared(inode);
> + }

I've noticed here you actually introduce new trylock pattern - previously
we had unconditional inode_lock_shared() in ext4_direct_IO_read(). So the
cleanest would be to just use unconditional inode_lock_shared() here and
then fixup IOCB_NOWAIT handling (I agree that was missing in the original
code) in a separate patch. And the pattern should rather look like:

if (iocb->ki_flags & IOCB_NOWAIT) {
if (!inode_trylock_shared(inode))
return -EAGAIN;
} else {
inode_lock_shared(inode);
}

to avoid two atomical operations instead of one in the fast path. No need
to repeat old mistakes when we know better :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 11:25:52

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

On Thu 03-10-19 21:34:18, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct I/O write path
> modifications, the inode extension/truncate code needs to be moved out
> from ext4_iomap_end(). For direct I/O, if the current code remained
> within ext4_iomap_end() it would behave incorrectly. Updating the
> inode size prior to converting unwritten extents to written extents
> will potentially allow a racing direct I/O read operation to find
> unwritten extents before they've been correctly converted.
>
> The inode extension/truncate code has been moved out into a new helper
> ext4_handle_inode_extension(). This function has been designed so that
> it can be used by both DAX and direct I/O paths.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

Looks good to me. Fell free to add:

Reviewed-by: Jan Kara <[email protected]>

Just small nits below:

> +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t written, size_t count)
> +{
> + int ret = 0;

I think both the function and callsites may be slightly simpler if you let
the function return 'written' or error (not 0 or error). But I'll leave
that decision upto you.

> + handle_t *handle;
> + bool truncate = false;
> + u8 blkbits = inode->i_blkbits;
> + ext4_lblk_t written_blk, end_blk;
> +
> + /*
> + * Note that EXT4_I(inode)->i_disksize can get extended up to
> + * inode->i_size while the IO was running due to writeback of
> + * delalloc blocks. But the code in ext4_iomap_alloc() is careful
> + * to use zeroed / unwritten extents if this is possible and thus
> + * we won't leave uninitialized blocks in a file even if we didn't
> + * succeed in writing as much as we planned.
> + */

Whitespace damaged here...

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 11:28:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] ext4: move inode extension checks out from ext4_iomap_alloc()

On Thu 03-10-19 21:34:36, Matthew Bobrowski wrote:
> We lift the inode extension/orphan list handling logic out from
> ext4_iomap_alloc() and place it within the caller
> ext4_dax_write_iter().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/file.c | 17 +++++++++++++++++
> fs/ext4/inode.c | 22 ----------------------
> 2 files changed, 17 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2883711e8a33..f64da0c590b2 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -309,6 +309,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ssize_t ret;
> size_t count;
> loff_t offset;
> + handle_t *handle;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> @@ -328,6 +329,22 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> offset = iocb->ki_pos;
> count = iov_iter_count(from);
> +
> + if (offset + count > EXT4_I(inode)->i_disksize) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out;
> + }
> +
> + ret = ext4_orphan_add(handle, inode);
> + if (ret) {
> + ext4_journal_stop(handle);
> + goto out;
> + }
> + ext4_journal_stop(handle);
> + }
> +
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>
> error = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d616062b603e..e133dda55063 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3508,7 +3508,6 @@ static int ext4_iomap_alloc(struct inode *inode,
> struct ext4_map_blocks *map)
> {
> handle_t *handle;
> - u8 blkbits = inode->i_blkbits;
> int ret, dio_credits, retries = 0;
>
> /*
> @@ -3530,28 +3529,7 @@ static int ext4_iomap_alloc(struct inode *inode,
> return PTR_ERR(handle);
>
> ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0)
> - goto journal_stop;
>
> - /*
> - * If we have allocated blocks beyond the EOF, we need to make
> - * sure that they get truncate if we crash before updating the
> - * inode size metadata in ext4_iomap_end(). For faults, we
> - * don't need to do that (and cannot due to the orphan list
> - * operations needing an inode_lock()). If we happen to
> - * instantiate blocks beyond EOF, it is because we race with a
> - * truncate operation, which already has added the inode onto
> - * the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0)
> - ret = err;
> - }
> -journal_stop:
> ext4_journal_stop(handle);
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 11:32:12

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] ext4: reorder map.m_flags checks in ext4_set_iomap()

On Thu 03-10-19 21:34:51, Matthew Bobrowski wrote:
> For iomap direct IO write code path changes, we need to accommodate
> for the case where the block mapping flags passed to ext4_map_blocks()
> will result in m_flags having both EXT4_MAP_MAPPED and
> EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
> extents to be converted properly in the end_io handler, iomap->type
> must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
> conditional statement in order to achieve this.
>
> This change is a no-op for DAX code path as the block mapping flag
> passed to ext4_map_blocks() when IS_DAX(inode) never results in
> EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Just those 72 columns limited comment lines... ;)

Honza

> ---
> fs/ext4/inode.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e133dda55063..63ad23ae05b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3420,10 +3420,20 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> iomap->type = type;
> iomap->addr = IOMAP_NULL_ADDR;
> } else {
> - if (map->m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> + /*
> + * Flags passed to ext4_map_blocks() for direct I/O
> + * writes can result in map->m_flags having both
> + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> + * order for allocated extents to be converted to
> + * written extents in the ->end_io handler correctly,
> + * we need to ensure that the iomap->type is set
> + * approprately. Thus, we need to check whether
> + * EXT4_MAP_UNWRITTEN is set first.
> + */
> + if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> + } else if (map->m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> } else {
> WARN_ON_ONCE(1);
> return -EIO;
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-08 15:13:43

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

On Thu 03-10-19 21:35:05, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O write code path implementation
> that makes use of the iomap infrastructure.
>
> All direct I/O write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct I/O code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous I/O, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
>
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_alloc().
>
> For inode extension cases, the ->end_io() callback
> ext4_dio_write_end_io() is responsible for calling into
> ext4_handle_inode_extension() and performing the necessary metadata
> updates. Failed writes that we're intended to be extend the inode will
> have blocks truncated accordingly. The ->end_io() handler is also
> responsible for converting allocated unwritten extents to written
> extents.
>
> In the instance of a short write, we fallback to buffered I/O and use
> that method to complete whatever is left over in 'iter'.

> Any blocks
> that have been allocated in preparation for direct I/O write will be
> reused by buffered I/O, so there's no issue with leaving allocated
> blocks beyond EOF.

This actually is not true as ext4_truncate_failed_write() will trim blocks
beyond EOF. Also this would not be 100% reliable as if we crash between DIO
short write succeeding and buffered write happening, we would leave inode
with blocks beyond EOF. So I'd just remove this sentence.

> Existing direct I/O write buffer_head code has been removed as it's
> now redundant.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

...

> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> + int error, unsigned int flags)
> +{
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (!error && (flags & IOMAP_DIO_UNWRITTEN))
> + error = ext4_convert_unwritten_extents(NULL, inode, offset,
> + size);
> + return ext4_handle_inode_extension(inode, offset, error ? : size,
> + size);
> +}

I was pondering about this and I don't think we have it quite correct.
Still :-|. The problem is that iomap_dio_complete() will pass dio->size as
'size', which is the amount of submitted IO but not necessarily the amount
of blocks that were mapped (that can be larger). Thus
ext4_handle_inode_extension() can miss the fact that there are blocks
beyond EOF that need to be trimmed. And we have no way of finding that out
inside our ->end_io handler. Even iomap_dio_complete() doesn't have that
information so we'd need to add 'original length' to struct iomap_dio and
then pass it do ->end_io.

Seeing how difficult it is when a filesystem wants to complete the iocb
synchronously (regardless whether it is async or sync) and have all the
information in one place for further processing, I think it would be the
easiest to provide iomap_dio_rw_wait() that forces waiting for the iocb to
complete *and* returns the appropriate return value instead of pretty
useless EIOCBQUEUED. It is actually pretty trivial (patch attached). With
this we can then just call iomap_dio_rw_sync() for the inode extension case
with ->end_io doing just the unwritten extent processing and then call
ext4_handle_inode_extension() from ext4_direct_write_iter() where we would
have all the information we need.

Christoph, Darrick, what do you think about extending iomap like in the
attached patch (plus sample use in XFS)?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (4.03 kB)
0001-iomap-Allow-forcing-of-waiting-for-running-DIO-in-io.patch (2.97 kB)
0002-xfs-Use-iomap_dio_rw_wait.patch (1.34 kB)
Download all attachments

2019-10-09 06:01:17

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations



On 10/3/19 5:03 PM, Matthew Bobrowski wrote:
> As part of ext4_iomap_begin() cleanups and port across direct I/O path
> to make use of iomap infrastructure, we split IOMAP_REPORT operations
> into a separate ->iomap_begin() handler.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/file.c | 6 ++-
> fs/ext4/inode.c | 129 ++++++++++++++++++++++++++++--------------------
> 3 files changed, 80 insertions(+), 56 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 03db3e71676c..d0d88f411a44 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3379,6 +3379,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> }
>
> extern const struct iomap_ops ext4_iomap_ops;
> +extern const struct iomap_ops ext4_iomap_report_ops;
>
> static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8d2bbcc2d813..ab75aee3e687 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
> maxbytes, i_size_read(inode));
> case SEEK_HOLE:
> inode_lock_shared(inode);
> - offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_hole(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> case SEEK_DATA:
> inode_lock_shared(inode);
> - offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_data(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index caeb3dec0dec..1dace576b8bd 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3439,6 +3439,72 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> return 0;
> }
>
> +static u16 ext4_iomap_check_delalloc(struct inode *inode,
> + struct ext4_map_blocks *map)
> +{
> + struct extent_status es;
> + ext4_lblk_t end = map->m_lblk + map->m_len - 1;
> +
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
> + end, &es);
> +
> + /* Entire range is a hole */
> + if (!es.es_len || es.es_lblk > end)
> + return IOMAP_HOLE;
> + if (es.es_lblk <= map->m_lblk) {
> + ext4_lblk_t offset = 0;
> +
> + if (es.es_lblk < map->m_lblk)
> + offset = map->m_lblk - es.es_lblk;
> + map->m_lblk = es.es_lblk + offset;
This looks redundant no? map->m_lblk never changes actually.
So this is not needed here.


> + map->m_len = es.es_len - offset;
> + return IOMAP_DELALLOC;
> + }
> +
> + /* Range starts with a hole */
> + map->m_len = es.es_lblk - map->m_lblk;
> + return IOMAP_HOLE;
> +}
> +
> +static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> + loff_t length, unsigned flags,
> + struct iomap *iomap)
> +{
> + int ret;
> + u16 type = 0;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> + unsigned long first_block, last_block;
> +
> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> + return -EINVAL;
> + first_block = offset >> blkbits;
> + last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK);
> +
> + if (ext4_has_inline_data(inode)) {
> + ret = ext4_inline_data_iomap(inode, iomap);
> + if (ret != -EAGAIN) {
> + if (ret == 0 && offset >= iomap->length)
> + ret = -ENOENT;
> + return ret;
> + }
> + }
> +
> + map.m_lblk = first_block;
> + map.m_len = last_block = first_block + 1;
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret < 0)
> + return ret;
> + if (ret == 0)
> + type = ext4_iomap_check_delalloc(inode, &map);
> + return ext4_set_iomap(inode, iomap, type, first_block, &map);
We don't need to send first_block here. Since map->m_lblk
is same as first_block.
Also with Jan comment, we don't even need 'type' parameter.
Then we should be able to rename the function
ext4_set_iomap ==> ext4_map_to_iomap. This better reflects what it is
doing. Thoughts?


> +}
> +
> +const struct iomap_ops ext4_iomap_report_ops = {
> + .iomap_begin = ext4_iomap_begin_report,
> +};
> +
> static int ext4_iomap_alloc(struct inode *inode,
> unsigned flags,
> unsigned long first_block,
> @@ -3498,12 +3564,10 @@ static int ext4_iomap_alloc(struct inode *inode,
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> - u16 type = 0;
> - unsigned int blkbits = inode->i_blkbits;
> - unsigned long first_block, last_block;
> - struct ext4_map_blocks map;
> - bool delalloc = false;
> int ret;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> + unsigned long first_block, last_block;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> @@ -3511,64 +3575,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> EXT4_MAX_LOGICAL_BLOCK);
>
> - if (flags & IOMAP_REPORT) {
> - if (ext4_has_inline_data(inode)) {
> - ret = ext4_inline_data_iomap(inode, iomap);
> - if (ret != -EAGAIN) {
> - if (ret == 0 && offset >= iomap->length)
> - ret = -ENOENT;
> - return ret;
> - }
> - }
> - } else {
> - if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> - return -ERANGE;
> - }
> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> + return -ERANGE;
>
> map.m_lblk = first_block;
> map.m_len = last_block - first_block + 1;
>
> - if (flags & IOMAP_REPORT) {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> -
> - if (ret == 0) {
> - ext4_lblk_t end = map.m_lblk + map.m_len - 1;
> - struct extent_status es;
> -
> - ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> - map.m_lblk, end, &es);
> -
> - if (!es.es_len || es.es_lblk > end) {
> - /* entire range is a hole */
> - } else if (es.es_lblk > map.m_lblk) {
> - /* range starts with a hole */
> - map.m_len = es.es_lblk - map.m_lblk;
> - } else {
> - ext4_lblk_t offs = 0;
> -
> - if (es.es_lblk < map.m_lblk)
> - offs = map.m_lblk - es.es_lblk;
> - map.m_lblk = es.es_lblk + offs;
> - map.m_len = es.es_len - offs;
> - delalloc = true;
> - }
> - }
> - } else if (flags & IOMAP_WRITE) {
> + if (flags & IOMAP_WRITE)
> ret = ext4_iomap_alloc(inode, flags, first_block, &map);
> - } else {
> + else
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> - }
>
> if (ret < 0)
> return ret;
> -
> - if (!ret)
> - type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - return ext4_set_iomap(inode, iomap, type, first_block, &map);
> + return ext4_set_iomap(inode, iomap, ret ? 0 : IOMAP_HOLE, first_block,
> + &map);
> }
>
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>

2019-10-09 06:03:36

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

Some minor comments apart from Jan comments.

On 10/3/19 5:03 PM, Matthew Bobrowski wrote:
> Separate the iomap field population chunk into a separate simple
> helper routine. This helps reducing the overall clutter within the
> ext4_iomap_begin() callback, especially as we move across more code to
> make use of iomap infrastructure.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/inode.c | 65 ++++++++++++++++++++++++++++---------------------
> 1 file changed, 37 insertions(+), 28 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 516faa280ced..1ccdc14c4d69 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3406,10 +3406,43 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> return inode->i_state & I_DIRTY_DATASYNC;
> }
>
> +static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> + unsigned long first_block, struct ext4_map_blocks *map)

Line beyond 80 chars. Please check with checkpatch once.

We can also get rid of "first_block" argument here.
Also the "type" argument also looks confusing.
Please see comment on patch 3.


> +{
> + u8 blkbits = inode->i_blkbits;
> +
> + iomap->flags = 0;
> + if (ext4_inode_datasync_dirty(inode))
> + iomap->flags |= IOMAP_F_DIRTY;
> + iomap->bdev = inode->i_sb->s_bdev;
> + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> + iomap->offset = (u64) first_block << blkbits;
> + iomap->length = (u64) map->m_len << blkbits;
> +
> + if (type) {
> + iomap->type = type;
> + iomap->addr = IOMAP_NULL_ADDR;
> + } else {
> + if (map->m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> + } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> + iomap->type = IOMAP_UNWRITTEN;
> + } else {
> + WARN_ON_ONCE(1);
> + return -EIO;
> + }
> + iomap->addr = (u64) map->m_pblk << blkbits;
> + }
> +
> + if (map->m_flags & EXT4_MAP_NEW)
> + iomap->flags |= IOMAP_F_NEW;
> + return 0;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> + u16 type = 0;
> unsigned int blkbits = inode->i_blkbits;
> unsigned long first_block, last_block;
> struct ext4_map_blocks map;
> @@ -3523,33 +3556,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> return ret;
> }
>
> - iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode))
> - iomap->flags |= IOMAP_F_DIRTY;
> - iomap->bdev = inode->i_sb->s_bdev;
> - iomap->dax_dev = sbi->s_daxdev;
> - iomap->offset = (u64)first_block << blkbits;
> - iomap->length = (u64)map.m_len << blkbits;
> -
> - if (ret == 0) {
> - iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - iomap->addr = IOMAP_NULL_ADDR;
> - } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> - iomap->type = IOMAP_UNWRITTEN;
> - } else {
> - WARN_ON_ONCE(1);
> - return -EIO;
> - }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> - }
> -
> - if (map.m_flags & EXT4_MAP_NEW)
> - iomap->flags |= IOMAP_F_NEW;
> -
> - return 0;
> + if (!ret)
> + type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> + return ext4_set_iomap(inode, iomap, type, first_block, &map);
> }
>
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>

2019-10-09 06:23:44

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] ext4: move out IOMAP_WRITE path into separate helper



On 10/3/19 5:03 PM, Matthew Bobrowski wrote:
> In preparation for porting across the direct I/O path to iomap, split
> out the IOMAP_WRITE logic into a separate helper. This way, we don't
> need to clutter the ext4_iomap_begin() callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

Minor comment, but otherwise.
Patch looks good to me. You may add:

Reviewed-by: Ritesh Harjani <[email protected]>


> ---
> fs/ext4/inode.c | 110 ++++++++++++++++++++++++++----------------------
> 1 file changed, 60 insertions(+), 50 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1ccdc14c4d69..caeb3dec0dec 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3439,6 +3439,62 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> return 0;
> }
>
> +static int ext4_iomap_alloc(struct inode *inode,
> + unsigned flags,
> + unsigned long first_block,
> + struct ext4_map_blocks *map)
> +{
> + handle_t *handle;
> + u8 blkbits = inode->i_blkbits;
> + int ret, dio_credits, retries = 0;
> +
> + /*
> + * Trim mapping request to the maximum value that we can map
> + * at once for direct I/O.
> + */
> + if (map->m_len > DIO_MAX_BLOCKS)
> + map->m_len = DIO_MAX_BLOCKS;
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +retry:
> + /*
> + * Either we allocate blocks and then we don't get unwritten
> + * extent so we have reserved enough credits, or the blocks
> + * are already allocated and unwritten. In that case, the
> + * extent conversion fits in the credits as well.
> + */
> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> + if (ret < 0)
> + goto journal_stop;
> +
> + /*
> + * If we have allocated blocks beyond the EOF, we need to make
> + * sure that they get truncate if we crash before updating the
> + * inode size metadata in ext4_iomap_end(). For faults, we
> + * don't need to do that (and cannot due to the orphan list
> + * operations needing an inode_lock()). If we happen to
> + * instantiate blocks beyond EOF, it is because we race with a
> + * truncate operation, which already has added the inode onto
> + * the orphan list.
> + */
> + if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
> + (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> + int err;
> +
> + err = ext4_orphan_add(handle, inode);
> + if (err < 0)
> + ret = err;
> + }
> +journal_stop:
> + ext4_journal_stop(handle);
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> + return ret;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> @@ -3500,62 +3556,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (map.m_len > DIO_MAX_BLOCKS)
> - map.m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> -retry:
> - /*
> - * Either we allocate blocks and then we don't get unwritten
> - * extent so we have reserved enough credits, or the blocks
> - * are already allocated and unwritten and in that case
> - * extent conversion fits in the credits as well.
> - */
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> - dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0) {
> - ext4_journal_stop(handle);
> - if (ret == -ENOSPC &&
> - ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> - }
> -
> - /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0) {
> - ext4_journal_stop(handle);
> - return err;
> - }
> - }
> - ext4_journal_stop(handle);
> + ret = ext4_iomap_alloc(inode, flags, first_block, &map);

We don't need "first_block" argument here. Since
map->m_lblk saves first_block directly above in the same function.

No strong objection against ext4_iomap_alloc, but
maybe ext4_iomap_map_write sounds better?
Either way is fine though.


> } else {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> if (ret < 0)
> return ret;
> }
>
> + if (ret < 0)
> + return ret;
> +
> if (!ret)
> type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> return ext4_set_iomap(inode, iomap, type, first_block, &map);
>

2019-10-09 06:27:56

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback



On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct I/O write path
> modifications, the inode extension/truncate code needs to be moved out
> from ext4_iomap_end(). For direct I/O, if the current code remained
> within ext4_iomap_end() it would behave incorrectly. Updating the
> inode size prior to converting unwritten extents to written extents
> will potentially allow a racing direct I/O read operation to find
> unwritten extents before they've been correctly converted.
>
> The inode extension/truncate code has been moved out into a new helper
> ext4_handle_inode_extension(). This function has been designed so that
> it can be used by both DAX and direct I/O paths.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

checkpatch shows some whitespaces error in your comments
in this patch.
But apart from that, patch looks good to me.
You may add:

Reviewed-by: Ritesh Harjani <[email protected]>


> ---
> fs/ext4/file.c | 79 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 48 +-----------------------------
> 2 files changed, 79 insertions(+), 48 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 69ac042fb74b..2883711e8a33 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -33,6 +33,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "truncate.h"
>
> static bool ext4_dio_supported(struct inode *inode)
> {
> @@ -233,12 +234,82 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t written, size_t count)
> +{
> + int ret = 0;
> + handle_t *handle;
> + bool truncate = false;
> + u8 blkbits = inode->i_blkbits;
> + ext4_lblk_t written_blk, end_blk;
> +
> + /*
> + * Note that EXT4_I(inode)->i_disksize can get extended up to
> + * inode->i_size while the IO was running due to writeback of
> + * delalloc blocks. But the code in ext4_iomap_alloc() is careful
> + * to use zeroed / unwritten extents if this is possible and thus
> + * we won't leave uninitialized blocks in a file even if we didn't
> + * succeed in writing as much as we planned.
> + */
> + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> + if (offset + count <= EXT4_I(inode)->i_disksize)
> + return written < 0 ? written : 0;
> +
> + if (written < 0) {
> + ret = written;
> + goto truncate;
> + }
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto truncate;
> + }
> +
> + if (ext4_update_inode_size(inode, offset + written))
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * We may need to truncate allocated but not written blocks
> + * beyond EOF.
> + */
> + written_blk = ALIGN(offset + written, 1 << blkbits);
> + end_blk = ALIGN(offset + count, 1 << blkbits);
> + if (written_blk < end_blk && ext4_can_truncate(inode))
> + truncate = true;
> +
> + /*
> + * Remove the inode from the orphan list if it has been
> + * extended and everything went OK.
> + */
> + if (!truncate && inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> +
> + if (truncate) {
> +truncate:
> + ext4_truncate_failed_write(inode);
> + /*
> + * If the truncate operation failed early, then the
> + * inode may still be on the orphan list. In that
> + * case, we need to try remove the inode from the
> + * in-memory linked list.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct inode *inode = file_inode(iocb->ki_filp);
> + int error;
> ssize_t ret;
> + size_t count;
> + loff_t offset;
> + struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -255,7 +326,13 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ret)
> goto out;
>
> + offset = iocb->ki_pos;
> + count = iov_iter_count(from);
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> +
> + error = ext4_handle_inode_extension(inode, offset, ret, count);
> + if (error)
> + ret = error;
> out:
> inode_unlock(inode);
> if (ret > 0)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 159ffb92f82d..d616062b603e 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3592,53 +3592,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> - int ret = 0;
> - handle_t *handle;
> - int blkbits = inode->i_blkbits;
> - bool truncate = false;
> -
> - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
> - return 0;
> -
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto orphan_del;
> - }
> - if (ext4_update_inode_size(inode, offset + written))
> - ext4_mark_inode_dirty(handle, inode);
> - /*
> - * We may need to truncate allocated but not written blocks beyond EOF.
> - */
> - if (iomap->offset + iomap->length >
> - ALIGN(inode->i_size, 1 << blkbits)) {
> - ext4_lblk_t written_blk, end_blk;
> -
> - written_blk = (offset + written) >> blkbits;
> - end_blk = (offset + length) >> blkbits;
> - if (written_blk < end_blk && ext4_can_truncate(inode))
> - truncate = true;
> - }
> - /*
> - * Remove inode from orphan list if we were extending a inode and
> - * everything went fine.
> - */
> - if (!truncate && inode->i_nlink &&
> - !list_empty(&EXT4_I(inode)->i_orphan))
> - ext4_orphan_del(handle, inode);
> - ext4_journal_stop(handle);
> - if (truncate) {
> - ext4_truncate_failed_write(inode);
> -orphan_del:
> - /*
> - * If truncate failed early the inode might still be on the
> - * orphan list; we need to make sure the inode is removed from
> - * the orphan list in that case.
> - */
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> - }
> - return ret;
> + return 0;
> }
>
> const struct iomap_ops ext4_iomap_ops = {
>

2019-10-09 06:31:06

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] ext4: move inode extension checks out from ext4_iomap_alloc()



On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> We lift the inode extension/orphan list handling logic out from
> ext4_iomap_alloc() and place it within the caller
> ext4_dax_write_iter().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

This looks good. Should solve our previous lengthy discussion
about orphan handling :)

You may add:
Reviewed-by: Ritesh Harjani <[email protected]>


> ---
> fs/ext4/file.c | 17 +++++++++++++++++
> fs/ext4/inode.c | 22 ----------------------
> 2 files changed, 17 insertions(+), 22 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 2883711e8a33..f64da0c590b2 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -309,6 +309,7 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> ssize_t ret;
> size_t count;
> loff_t offset;
> + handle_t *handle;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> @@ -328,6 +329,22 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>
> offset = iocb->ki_pos;
> count = iov_iter_count(from);
> +
> + if (offset + count > EXT4_I(inode)->i_disksize) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out;
> + }
> +
> + ret = ext4_orphan_add(handle, inode);
> + if (ret) {
> + ext4_journal_stop(handle);
> + goto out;
> + }
> + ext4_journal_stop(handle);
> + }
> +
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>
> error = ext4_handle_inode_extension(inode, offset, ret, count);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index d616062b603e..e133dda55063 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3508,7 +3508,6 @@ static int ext4_iomap_alloc(struct inode *inode,
> struct ext4_map_blocks *map)
> {
> handle_t *handle;
> - u8 blkbits = inode->i_blkbits;
> int ret, dio_credits, retries = 0;
>
> /*
> @@ -3530,28 +3529,7 @@ static int ext4_iomap_alloc(struct inode *inode,
> return PTR_ERR(handle);
>
> ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0)
> - goto journal_stop;
>
> - /*
> - * If we have allocated blocks beyond the EOF, we need to make
> - * sure that they get truncate if we crash before updating the
> - * inode size metadata in ext4_iomap_end(). For faults, we
> - * don't need to do that (and cannot due to the orphan list
> - * operations needing an inode_lock()). If we happen to
> - * instantiate blocks beyond EOF, it is because we race with a
> - * truncate operation, which already has added the inode onto
> - * the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0)
> - ret = err;
> - }
> -journal_stop:
> ext4_journal_stop(handle);
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> goto retry;
>

2019-10-09 06:36:31

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] ext4: reorder map.m_flags checks in ext4_set_iomap()



On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> For iomap direct IO write code path changes, we need to accommodate
> for the case where the block mapping flags passed to ext4_map_blocks()
> will result in m_flags having both EXT4_MAP_MAPPED and
> EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
> extents to be converted properly in the end_io handler, iomap->type
> must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
> conditional statement in order to achieve this.
>
> This change is a no-op for DAX code path as the block mapping flag
> passed to ext4_map_blocks() when IS_DAX(inode) never results in
> EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>

You may be changing the function parameters & name here,
(in ext4_set_iomap)
But functionality wise the patch looks good to me.

You may add:
Reviewed-by: Ritesh Harjani <[email protected]>


> ---
> fs/ext4/inode.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e133dda55063..63ad23ae05b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3420,10 +3420,20 @@ static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> iomap->type = type;
> iomap->addr = IOMAP_NULL_ADDR;
> } else {
> - if (map->m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> + /*
> + * Flags passed to ext4_map_blocks() for direct I/O
> + * writes can result in map->m_flags having both
> + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> + * order for allocated extents to be converted to
> + * written extents in the ->end_io handler correctly,
> + * we need to ensure that the iomap->type is set
> + * approprately. Thus, we need to check whether
> + * EXT4_MAP_UNWRITTEN is set first.
> + */
> + if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> + } else if (map->m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> } else {
> WARN_ON_ONCE(1);
> return -EIO;
>

2019-10-09 06:42:38

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] ext4: introduce direct I/O read path using iomap infrastructure



On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O read path that makes use of the
> iomap infrastructure.
>
> The new function ext4_dio_read_iter() is responsible for calling into
> the iomap infrastructure via iomap_dio_rw(). If the read operation
> being performed on the inode does not pass the preliminary checks
> performed within ext4_dio_supported(), then we simply fallback to
> buffered I/O in order to fulfil the request.
>
> Existing direct I/O read buffer_head code has been removed as it's now
> redundant.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/file.c | 58 +++++++++++++++++++++++++++++++++++++++++++++----
> fs/ext4/inode.c | 32 +--------------------------
> 2 files changed, 55 insertions(+), 35 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index ab75aee3e687..69ac042fb74b 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,53 @@
> #include "xattr.h"
> #include "acl.h"
>
> +static bool ext4_dio_supported(struct inode *inode)
> +{
> + if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
> + return false;
> + if (fsverity_active(inode))
> + return false;
> + if (ext4_should_journal_data(inode))
> + return false;
> + if (ext4_has_inline_data(inode))
> + return false;
> + return true;
> +}
> +
> +static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + /*
> + * Get exclusion from truncate and other inode operations.
> + */
> + if (!inode_trylock_shared(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock_shared(inode);
> + }
Same comments here.
Let's follow as per the discussion here
https://patchwork.kernel.org/patch/11141577/


> +
> + if (!ext4_dio_supported(inode)) {
> + inode_unlock_shared(inode);
> + /*
> + * Fallback to buffered I/O if the operation being
> + * performed on the inode is not supported by direct
> + * I/O. The IOCB_DIRECT flag needs to be cleared here
> + * in order to ensure that the direct I/O path withiin
> + * generic_file_read_iter() is not taken.
> + */
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + return generic_file_read_iter(iocb, to);
> + }
> +
> + ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
> + inode_unlock_shared(inode);
> +
> + file_accessed(iocb->ki_filp);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> @@ -64,16 +111,19 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> if (!iov_iter_count(to))
> return 0; /* skip atime */
>
> -#ifdef CONFIG_FS_DAX
> - if (IS_DAX(file_inode(iocb->ki_filp)))
> + if (IS_DAX(inode))
> return ext4_dax_read_iter(iocb, to);
> -#endif
> +
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_read_iter(iocb, to);
> return generic_file_read_iter(iocb, to);
> }
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1dace576b8bd..159ffb92f82d 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> {
> /* We don't expect handle for direct IO */
> WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - if (!create)
> - return _ext4_get_block(inode, iblock, bh, 0);
> return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
> }
>
> @@ -3855,30 +3852,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
> return ret;
> }
>
> -static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> - struct inode *inode = mapping->host;
> - size_t count = iov_iter_count(iter);
> - ssize_t ret;
> -
> - /*
> - * Shared inode_lock is enough for us - it protects against concurrent
> - * writes & truncates and since we take care of writing back page cache,
> - * we are protected against page writeback as well.
> - */
> - inode_lock_shared(inode);
> - ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
> - iocb->ki_pos + count - 1);
> - if (ret)
> - goto out_unlock;
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
> - iter, ext4_dio_get_block, NULL, NULL, 0);
> -out_unlock:
> - inode_unlock_shared(inode);
> - return ret;
> -}
> -
> static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> struct file *file = iocb->ki_filp;
> @@ -3905,10 +3878,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> return 0;
>
> trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> - if (iov_iter_rw(iter) == READ)
> - ret = ext4_direct_IO_read(iocb, iter);
> - else
> - ret = ext4_direct_IO_write(iocb, iter);
> + ret = ext4_direct_IO_write(iocb, iter);
> trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
> return ret;
> }
>

2019-10-09 07:09:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Wed, Oct 09, 2019 at 11:32:54AM +0530, Ritesh Harjani wrote:
> We can also get rid of "first_block" argument here.

That would just duplicate filling it out in all callers, so why?

2019-10-09 07:13:05

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

On Tue, Oct 08, 2019 at 05:12:38PM +0200, Jan Kara wrote:
> Seeing how difficult it is when a filesystem wants to complete the iocb
> synchronously (regardless whether it is async or sync) and have all the
> information in one place for further processing, I think it would be the
> easiest to provide iomap_dio_rw_wait() that forces waiting for the iocb to
> complete *and* returns the appropriate return value instead of pretty
> useless EIOCBQUEUED. It is actually pretty trivial (patch attached). With
> this we can then just call iomap_dio_rw_sync() for the inode extension case
> with ->end_io doing just the unwritten extent processing and then call
> ext4_handle_inode_extension() from ext4_direct_write_iter() where we would
> have all the information we need.
>
> Christoph, Darrick, what do you think about extending iomap like in the
> attached patch (plus sample use in XFS)?

I vaguely remember suggesting something like this but Brian or Dave
convinced me it wasn't a good idea. This will require a trip to the
xfs or fsdevel archives from when the inode_dio_wait was added in XFS.

But if we decide it actully works this time around please don't add the
__ variant but just add the parameter to iomap_dio_rw directly.

2019-10-09 07:51:06

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper



On 10/9/19 12:37 PM, Christoph Hellwig wrote:
> On Wed, Oct 09, 2019 at 11:32:54AM +0530, Ritesh Harjani wrote:
>> We can also get rid of "first_block" argument here.
>
> That would just duplicate filling it out in all callers, so why?
>

What I meant is "first_block" is same as map->m_lblk.
(unless ext4_map_blocks can change map->m_lblk, which AFAICT, it should
not).
So why pass an extra argument when we are already passing 'map'
structure.
So we can fill iomap->offset using map->m_lblk in ext4_set_iomap()
function.

-ritesh

2019-10-09 08:58:14

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Tue, Oct 08, 2019 at 12:27:09PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:33:09, Matthew Bobrowski wrote:
> > +static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> > + unsigned long first_block, struct ext4_map_blocks *map)
> > +{
> > + u8 blkbits = inode->i_blkbits;
> > +
> > + iomap->flags = 0;
> > + if (ext4_inode_datasync_dirty(inode))
> > + iomap->flags |= IOMAP_F_DIRTY;
> > + iomap->bdev = inode->i_sb->s_bdev;
> > + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> > + iomap->offset = (u64) first_block << blkbits;
> > + iomap->length = (u64) map->m_len << blkbits;
> > +
> > + if (type) {
> > + iomap->type = type;
> > + iomap->addr = IOMAP_NULL_ADDR;
> > + } else {
> > + if (map->m_flags & EXT4_MAP_MAPPED) {
> > + iomap->type = IOMAP_MAPPED;
> > + } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> > + iomap->type = IOMAP_UNWRITTEN;
> > + } else {
> > + WARN_ON_ONCE(1);
> > + return -EIO;
> > + }
> > + iomap->addr = (u64) map->m_pblk << blkbits;
> > + }
>
> Looking at this function now, the 'type' argument looks a bit weird. Can we
> perhaps just remove the 'type' argument and change the above to:

We can, but refer to the point below.

> if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> if (map->m_flags & EXT4_MAP_MAPPED)
> iomap->type = IOMAP_MAPPED;
> else if (map->m_flags & EXT4_MAP_UNWRITTEN)
> iomap->type = IOMAP_UNWRITTEN;
> iomap->addr = (u64) map->m_pblk << blkbits;
> } else {
> iomap->type = IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> }
>
> And then in ext4_iomap_begin() we overwrite the type to:
>
> if (delalloc && iomap->type == IOMAP_HOLE)
> iomap->type = IOMAP_DELALLOC;
>
> That would IMO make ext4_set_iomap() arguments harder to get wrong.

I was thinking about this while doing a bunch of other things at work
today. I'm kind of aligned with what Christoph mentioned around
possibly duplicating some of the post 'iomap->type' setting from both
current and any future ext4_set_iomap() callers. In addition to this,
my thought was that if we're populating the iomap structure with
values respectively, then it would make most sense to encapsulate
those routines, if possible, within the ext4_set_iomap() as that's the
sole purpose of the function.

However, no real strong objections for dropping 'type', but I just
wanted to share my thoughts.

Also, yes, we probably can drop 'first_block' from the list of
arguments here as we can derive that from 'map' and set 'iomap->type'
accordingly...

--<M>--


2019-10-09 09:09:43

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Wed, Oct 09, 2019 at 01:20:17PM +0530, Ritesh Harjani wrote:
> On 10/9/19 12:37 PM, Christoph Hellwig wrote:
> > On Wed, Oct 09, 2019 at 11:32:54AM +0530, Ritesh Harjani wrote:
> > > We can also get rid of "first_block" argument here.
> >
> > That would just duplicate filling it out in all callers, so why?
> >
>
> What I meant is "first_block" is same as map->m_lblk.
> (unless ext4_map_blocks can change map->m_lblk, which AFAICT, it should
> not).
> So why pass an extra argument when we are already passing 'map'
> structure.
> So we can fill iomap->offset using map->m_lblk in ext4_set_iomap()
> function.

What you're saying makes sense Ritesh and I will update it as such. I
believe what Christoph was actually referring to was what I explained
to Jan to the email that I just sent out. This was around the possible
code duplication and having some iomap value setting related logic
outside ext4_set_iomap(), when in fact there's no real reason why it
can't be inside.

--<M>--

2019-10-09 09:20:06

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] ext4: move out IOMAP_WRITE path into separate helper

On Tue, Oct 08, 2019 at 12:31:37PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:33:29, Matthew Bobrowski wrote:
> > In preparation for porting across the direct I/O path to iomap, split
> > out the IOMAP_WRITE logic into a separate helper. This way, we don't
> > need to clutter the ext4_iomap_begin() callback.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> Just please reformat the comments to use full 80 column lines. Your Emacs
> still doesn't seem to get it :)

*nod* :)

--<M>--

2019-10-09 09:32:30

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 2/8] ext4: move out IOMAP_WRITE path into separate helper

On Wed, Oct 09, 2019 at 11:52:41AM +0530, Ritesh Harjani wrote:
> On 10/3/19 5:03 PM, Matthew Bobrowski wrote:
> Minor comment, but otherwise.
> Patch looks good to me. You may add:
>
> Reviewed-by: Ritesh Harjani <[email protected]>

*nod* - Thank you!

> > static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > unsigned flags, struct iomap *iomap)
> > {
> > @@ -3500,62 +3556,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > }
> > }
> > } else if (flags & IOMAP_WRITE) {
> > - int dio_credits;
> > - handle_t *handle;
> > - int retries = 0;
> > -
> > - /* Trim mapping request to maximum we can map at once for DIO */
> > - if (map.m_len > DIO_MAX_BLOCKS)
> > - map.m_len = DIO_MAX_BLOCKS;
> > - dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> > -retry:
> > - /*
> > - * Either we allocate blocks and then we don't get unwritten
> > - * extent so we have reserved enough credits, or the blocks
> > - * are already allocated and unwritten and in that case
> > - * extent conversion fits in the credits as well.
> > - */
> > - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> > - dio_credits);
> > - if (IS_ERR(handle))
> > - return PTR_ERR(handle);
> > -
> > - ret = ext4_map_blocks(handle, inode, &map,
> > - EXT4_GET_BLOCKS_CREATE_ZERO);
> > - if (ret < 0) {
> > - ext4_journal_stop(handle);
> > - if (ret == -ENOSPC &&
> > - ext4_should_retry_alloc(inode->i_sb, &retries))
> > - goto retry;
> > - return ret;
> > - }
> > -
> > - /*
> > - * If we added blocks beyond i_size, we need to make sure they
> > - * will get truncated if we crash before updating i_size in
> > - * ext4_iomap_end(). For faults we don't need to do that (and
> > - * even cannot because for orphan list operations inode_lock is
> > - * required) - if we happen to instantiate block beyond i_size,
> > - * it is because we race with truncate which has already added
> > - * the inode to the orphan list.
> > - */
> > - if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> > - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> > - int err;
> > -
> > - err = ext4_orphan_add(handle, inode);
> > - if (err < 0) {
> > - ext4_journal_stop(handle);
> > - return err;
> > - }
> > - }
> > - ext4_journal_stop(handle);
> > + ret = ext4_iomap_alloc(inode, flags, first_block, &map);
>
> We don't need "first_block" argument here. Since
> map->m_lblk saves first_block directly above in the same function.

You're right. I will change that.

> No strong objection against ext4_iomap_alloc, but
> maybe ext4_iomap_map_write sounds better?
> Either way is fine though.

I like 'ext4_iomap_alloc', because it's performing allocation in
preparation for a write being performed on behalf of iomap. :)

--<M>--

2019-10-09 09:42:11

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations

On Tue, Oct 08, 2019 at 12:42:09PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:33:45, Matthew Bobrowski wrote:
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks Jan! :)

> It would just need small adjustments if you change patch 1 as I suggested:

I will await what you say in response to what my thoughts were aronud
ext4_set_iomap() before doing any updates here.

> > +static u16 ext4_iomap_check_delalloc(struct inode *inode,
> > + struct ext4_map_blocks *map)
> > +{
> > + struct extent_status es;
> > + ext4_lblk_t end = map->m_lblk + map->m_len - 1;
> > +
> > + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
> > + end, &es);
> > +
> > + /* Entire range is a hole */
> > + if (!es.es_len || es.es_lblk > end)
> > + return IOMAP_HOLE;
> > + if (es.es_lblk <= map->m_lblk) {
> > + ext4_lblk_t offset = 0;
> > +
> > + if (es.es_lblk < map->m_lblk)
> > + offset = map->m_lblk - es.es_lblk;
> > + map->m_lblk = es.es_lblk + offset;
> > + map->m_len = es.es_len - offset;
> > + return IOMAP_DELALLOC;
> > + }
> > +
> > + /* Range starts with a hole */
> > + map->m_len = es.es_lblk - map->m_lblk;
> > + return IOMAP_HOLE;
> > +}
>
> This function would then be IMO better off to directly update 'iomap' as
> needed after ext4_set_iomap() sets hole there.

As mentioned in 1/8, it would be nice to leave all iomap setting up to
ext4_set_iomap(), but if we're strongly against passing 'type', then
I'm happy to change it and update this to pass an 'iomap'.

--<M>--

2019-10-09 10:20:55

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

On Wed, Oct 09, 2019 at 11:57:04AM +0530, Ritesh Harjani wrote:
> On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> > In preparation for implementing the iomap direct I/O write path
> > modifications, the inode extension/truncate code needs to be moved out
> > from ext4_iomap_end(). For direct I/O, if the current code remained
> > within ext4_iomap_end() it would behave incorrectly. Updating the
> > inode size prior to converting unwritten extents to written extents
> > will potentially allow a racing direct I/O read operation to find
> > unwritten extents before they've been correctly converted.
> >
> > The inode extension/truncate code has been moved out into a new helper
> > ext4_handle_inode_extension(). This function has been designed so that
> > it can be used by both DAX and direct I/O paths.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> checkpatch shows some whitespaces error in your comments
> in this patch.
> But apart from that, patch looks good to me.

Yeah, I will fix those.

> You may add:
>
> Reviewed-by: Ritesh Harjani <[email protected]>

Thanks!

--<M>--

2019-10-09 10:21:46

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] ext4: move inode extension checks out from ext4_iomap_alloc()

On Tue, Oct 08, 2019 at 01:27:06PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:34:36, Matthew Bobrowski wrote:
> > We lift the inode extension/orphan list handling logic out from
> > ext4_iomap_alloc() and place it within the caller
> > ext4_dax_write_iter().
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks Jan! :)

--<M>--

2019-10-09 10:22:54

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

On Tue, Oct 08, 2019 at 01:25:12PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:34:18, Matthew Bobrowski wrote:
> Looks good to me. Fell free to add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks Jan!

> Just small nits below:
>
> > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > + ssize_t written, size_t count)
> > +{
> > + int ret = 0;
>
> I think both the function and callsites may be slightly simpler if you let
> the function return 'written' or error (not 0 or error). But I'll leave
> that decision upto you.

Hm, don't we actually need to return 0 for success cases so that
iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos,
etc?

> > + handle_t *handle;
> > + bool truncate = false;
> > + u8 blkbits = inode->i_blkbits;
> > + ext4_lblk_t written_blk, end_blk;
> > +
> > + /*
> > + * Note that EXT4_I(inode)->i_disksize can get extended up to
> > + * inode->i_size while the IO was running due to writeback of
> > + * delalloc blocks. But the code in ext4_iomap_alloc() is careful
> > + * to use zeroed / unwritten extents if this is possible and thus
> > + * we won't leave uninitialized blocks in a file even if we didn't
> > + * succeed in writing as much as we planned.
> > + */
>
> Whitespace damaged here...

I'll fix this.

--<M>--

2019-10-09 10:40:20

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 6/8] ext4: move inode extension checks out from ext4_iomap_alloc()

On Wed, Oct 09, 2019 at 12:00:22PM +0530, Ritesh Harjani wrote:
> On 10/3/19 5:04 PM, Matthew Bobrowski wrote:
> This looks good. Should solve our previous lengthy discussion
> about orphan handling :)

Yeah, although I'm still not the biggest fan of this approach.

> You may add:
> Reviewed-by: Ritesh Harjani <[email protected]>

Thanks Ritesh!

--<M>--

2019-10-09 10:43:04

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] ext4: reorder map.m_flags checks in ext4_set_iomap()

On Tue, Oct 08, 2019 at 01:30:12PM +0200, Jan Kara wrote:
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thanks Jan!

> Just those 72 columns limited comment lines... ;)

*nod* - will be fixed!

--<M>--

2019-10-09 10:44:41

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] ext4: reorder map.m_flags checks in ext4_set_iomap()

On Wed, Oct 09, 2019 at 12:05:47PM +0530, Ritesh Harjani wrote:
> You may be changing the function parameters & name here,
> (in ext4_set_iomap)

Hm, maybe. :P It all depends on the outcome of what we discuss in 1/8.

> But functionality wise the patch looks good to me.
>
> You may add:
> Reviewed-by: Ritesh Harjani <[email protected]>

Great, thanks Ritesh!

--<M>--

2019-10-09 10:57:48

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 4/8] ext4: introduce direct I/O read path using iomap infrastructure

On Tue, Oct 08, 2019 at 12:52:07PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:34:00, Matthew Bobrowski wrote:
> > This patch introduces a new direct I/O read path that makes use of the
> > iomap infrastructure.
> >
> > The new function ext4_dio_read_iter() is responsible for calling into
> > the iomap infrastructure via iomap_dio_rw(). If the read operation
> > being performed on the inode does not pass the preliminary checks
> > performed within ext4_dio_supported(), then we simply fallback to
> > buffered I/O in order to fulfil the request.
> >
> > Existing direct I/O read buffer_head code has been removed as it's now
> > redundant.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> The patch looks good to me. Just one small nit below. With that fixed, you
> can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Cool, I'll fix it!

> > + /*
> > + * Get exclusion from truncate and other inode operations.
> > + */
> > + if (!inode_trylock_shared(inode)) {
> > + if (iocb->ki_flags & IOCB_NOWAIT)
> > + return -EAGAIN;
> > + inode_lock_shared(inode);
> > + }
>
> I've noticed here you actually introduce new trylock pattern - previously
> we had unconditional inode_lock_shared() in ext4_direct_IO_read(). So the
> cleanest would be to just use unconditional inode_lock_shared() here and
> then fixup IOCB_NOWAIT handling (I agree that was missing in the original
> code) in a separate patch.

Right, so I will just have an unconditional call to
inode_lock_shared() and in the patch that follows I will fix it up to
apply the new pattern.

> And the pattern should rather look like:
>
> if (iocb->ki_flags & IOCB_NOWAIT) {
> if (!inode_trylock_shared(inode))
> return -EAGAIN;
> } else {
> inode_lock_shared(inode);
> }
>
> to avoid two atomical operations instead of one in the fast path. No need
> to repeat old mistakes when we know better :).

Yes, also agree.

--<M>--

2019-10-09 11:55:02

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

On Tue, Oct 08, 2019 at 05:12:38PM +0200, Jan Kara wrote:
> On Thu 03-10-19 21:35:05, Matthew Bobrowski wrote:
> > Any blocks
> > that have been allocated in preparation for direct I/O write will be
> > reused by buffered I/O, so there's no issue with leaving allocated
> > blocks beyond EOF.
>
> This actually is not true as ext4_truncate_failed_write() will trim blocks
> beyond EOF. Also this would not be 100% reliable as if we crash between DIO
> short write succeeding and buffered write happening, we would leave inode
> with blocks beyond EOF. So I'd just remove this sentence.

OK.

> > Existing direct I/O write buffer_head code has been removed as it's
> > now redundant.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
> ...
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> > + int error, unsigned int flags)
> > +{
> > + loff_t offset = iocb->ki_pos;
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > + if (!error && (flags & IOMAP_DIO_UNWRITTEN))
> > + error = ext4_convert_unwritten_extents(NULL, inode, offset,
> > + size);
> > + return ext4_handle_inode_extension(inode, offset, error ? : size,
> > + size);
> > +}
>
> I was pondering about this and I don't think we have it quite correct.
> Still :-|. The problem is that iomap_dio_complete() will pass dio->size as
> 'size', which is the amount of submitted IO but not necessarily the amount
> of blocks that were mapped (that can be larger). Thus
> ext4_handle_inode_extension() can miss the fact that there are blocks
> beyond EOF that need to be trimmed. And we have no way of finding that out
> inside our ->end_io handler. Even iomap_dio_complete() doesn't have that
> information so we'd need to add 'original length' to struct iomap_dio and
> then pass it do ->end_io.

Yes, I remember having a discussion around this in the past. The
answer to this problem at the time was that any blocks that may have
been allocated in preparation for the direct I/O write and we're not
used would in fact be reused when we fell back to buffered I/O. The
case that you're describing above, based on my understanding, would
have to be a result of short write?

> Seeing how difficult it is when a filesystem wants to complete the iocb
> synchronously (regardless whether it is async or sync) and have all the
> information in one place for further processing, I think it would be the
> easiest to provide iomap_dio_rw_wait() that forces waiting for the iocb to
> complete *and* returns the appropriate return value instead of pretty
> useless EIOCBQUEUED. It is actually pretty trivial (patch attached). With
> this we can then just call iomap_dio_rw_sync() for the inode extension case
> with ->end_io doing just the unwritten extent processing and then call
> ext4_handle_inode_extension() from ext4_direct_write_iter() where we would
> have all the information we need.

This could also work, nicely. But, if this isn't an option for
whatever reason then we could go with what you suggested above? After
all, I think it would make sense to pass such information about the
write all the way through to the end, especially the ->end_io handler,
seeing as though that's we're clean up should be performed in the
instance of a failure, or a short write.

However, note that:

a) likely(my understanding may be wrong)

b) Someone a lot smarter than I has probably already thought this
through and there's a real good reason why we don't cram such
information about the write within the iomap structures and have
them passed all the way through to the end...

:)

--<M>--

2019-10-09 12:09:46

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations

On Wed, Oct 09, 2019 at 11:30:21AM +0530, Ritesh Harjani wrote:
> > +static u16 ext4_iomap_check_delalloc(struct inode *inode,
> > + struct ext4_map_blocks *map)
> > +{
> > + struct extent_status es;
> > + ext4_lblk_t end = map->m_lblk + map->m_len - 1;
> > +
> > + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
> > + end, &es);
> > +
> > + /* Entire range is a hole */
> > + if (!es.es_len || es.es_lblk > end)
> > + return IOMAP_HOLE;
> > + if (es.es_lblk <= map->m_lblk) {
> > + ext4_lblk_t offset = 0;
> > +
> > + if (es.es_lblk < map->m_lblk)
> > + offset = map->m_lblk - es.es_lblk;
> > + map->m_lblk = es.es_lblk + offset;
> This looks redundant no? map->m_lblk never changes actually.
> So this is not needed here.

Well, it depends if map->m_lblk == es.es_lblk + offset prior to the
assignment? If that's always true, then sure, it'd be redundant. But
honestly, I don't know what the downstream effect would be if this was
removed. I'd have to look at the code, perform some tests, and figure
it out.

> > + map.m_lblk = first_block;
> > + map.m_len = last_block = first_block + 1;
> > + ret = ext4_map_blocks(NULL, inode, &map, 0);
> > + if (ret < 0)
> > + return ret;
> > + if (ret == 0)
> > + type = ext4_iomap_check_delalloc(inode, &map);
> > + return ext4_set_iomap(inode, iomap, type, first_block, &map);
> We don't need to send first_block here. Since map->m_lblk
> is same as first_block.
> Also with Jan comment, we don't even need 'type' parameter.
> Then we should be able to rename the function
> ext4_set_iomap ==> ext4_map_to_iomap. This better reflects what it is
> doing. Thoughts?

Depends on what we conclude in 1/8. :)

I'm for removing 'first_block', but still not convinced removing
'type' is heading down the right track if I were to forward think a
little.

--<M>--

2019-10-09 12:52:29

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

On Wed 09-10-19 21:18:50, Matthew Bobrowski wrote:
> > Just small nits below:
> >
> > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > > + ssize_t written, size_t count)
> > > +{
> > > + int ret = 0;
> >
> > I think both the function and callsites may be slightly simpler if you let
> > the function return 'written' or error (not 0 or error). But I'll leave
> > that decision upto you.
>
> Hm, don't we actually need to return 0 for success cases so that
> iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos,
> etc?

Correct, iomap_dio_complete() expects 0 on success. So if we keep calling
ext4_handle_inode_extension() from ->end_io handler, we'd need some
specialcasing there and I agree that changing ext4_handle_inode_extension()
return convention isn't then very beneficial. If we stop calling
ext4_handle_inode_extension() from ->end_io handler (patch 8/8 discussion
pending), then the change would be a clear win.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-09 13:06:44

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Wed 09-10-19 19:57:23, Matthew Bobrowski wrote:
> On Tue, Oct 08, 2019 at 12:27:09PM +0200, Jan Kara wrote:
> > On Thu 03-10-19 21:33:09, Matthew Bobrowski wrote:
> > > +static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> > > + unsigned long first_block, struct ext4_map_blocks *map)
> > > +{
> > > + u8 blkbits = inode->i_blkbits;
> > > +
> > > + iomap->flags = 0;
> > > + if (ext4_inode_datasync_dirty(inode))
> > > + iomap->flags |= IOMAP_F_DIRTY;
> > > + iomap->bdev = inode->i_sb->s_bdev;
> > > + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> > > + iomap->offset = (u64) first_block << blkbits;
> > > + iomap->length = (u64) map->m_len << blkbits;
> > > +
> > > + if (type) {
> > > + iomap->type = type;
> > > + iomap->addr = IOMAP_NULL_ADDR;
> > > + } else {
> > > + if (map->m_flags & EXT4_MAP_MAPPED) {
> > > + iomap->type = IOMAP_MAPPED;
> > > + } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> > > + iomap->type = IOMAP_UNWRITTEN;
> > > + } else {
> > > + WARN_ON_ONCE(1);
> > > + return -EIO;
> > > + }
> > > + iomap->addr = (u64) map->m_pblk << blkbits;
> > > + }
> >
> > Looking at this function now, the 'type' argument looks a bit weird. Can we
> > perhaps just remove the 'type' argument and change the above to:
>
> We can, but refer to the point below.
>
> > if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> > if (map->m_flags & EXT4_MAP_MAPPED)
> > iomap->type = IOMAP_MAPPED;
> > else if (map->m_flags & EXT4_MAP_UNWRITTEN)
> > iomap->type = IOMAP_UNWRITTEN;
> > iomap->addr = (u64) map->m_pblk << blkbits;
> > } else {
> > iomap->type = IOMAP_HOLE;
> > iomap->addr = IOMAP_NULL_ADDR;
> > }
> >
> > And then in ext4_iomap_begin() we overwrite the type to:
> >
> > if (delalloc && iomap->type == IOMAP_HOLE)
> > iomap->type = IOMAP_DELALLOC;
> >
> > That would IMO make ext4_set_iomap() arguments harder to get wrong.
>
> I was thinking about this while doing a bunch of other things at work
> today. I'm kind of aligned with what Christoph mentioned around
> possibly duplicating some of the post 'iomap->type' setting from both
> current and any future ext4_set_iomap() callers. In addition to this,
> my thought was that if we're populating the iomap structure with
> values respectively, then it would make most sense to encapsulate
> those routines, if possible, within the ext4_set_iomap() as that's the
> sole purpose of the function.

Well, what I dislike about 'type' argument is the inconsistency in it's
handling. It is useful only for HOLE/DELALLOC, anything else will just give
you invalid iomap and you have to be careful to pass 0 in that case.

I understand the concern about possible duplication but since only
IOMAP_REPORT cares about IOMAP_DELALLOC, I'm not much concerned about it.
But another sensible API would be to optionally pass 'struct extent_status
*es' argument to ext4_set_iomap() and if this argument is non-NULL,
ext4_set_iomap() will handle the intersection of ext4_map_blocks and
extent_status and deduce appropriate resulting type from that.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-09 13:15:20

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v4 3/8] ext4: introduce new callback for IOMAP_REPORT operations



On 10/9/19 5:38 PM, Matthew Bobrowski wrote:
> On Wed, Oct 09, 2019 at 11:30:21AM +0530, Ritesh Harjani wrote:
>>> +static u16 ext4_iomap_check_delalloc(struct inode *inode,
>>> + struct ext4_map_blocks *map)
>>> +{
>>> + struct extent_status es;
>>> + ext4_lblk_t end = map->m_lblk + map->m_len - 1;
>>> +
>>> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk,
>>> + end, &es);
>>> +
>>> + /* Entire range is a hole */
>>> + if (!es.es_len || es.es_lblk > end)
>>> + return IOMAP_HOLE;
>>> + if (es.es_lblk <= map->m_lblk) {
>>> + ext4_lblk_t offset = 0;
>>> +
>>> + if (es.es_lblk < map->m_lblk)
>>> + offset = map->m_lblk - es.es_lblk;
>>> + map->m_lblk = es.es_lblk + offset;
>> This looks redundant no? map->m_lblk never changes actually.
>> So this is not needed here.
>
> Well, it depends if map->m_lblk == es.es_lblk + offset prior to the
> assignment? If that's always true, then sure, it'd be redundant. But
> honestly, I don't know what the downstream effect would be if this was
> removed. I'd have to look at the code, perform some tests, and figure
> it out.

<snip>
3334 if (es.es_lblk <= map->m_lblk) {
3335 ext4_lblk_t offset = 0;
3336
3337 if (es.es_lblk < map->m_lblk)
3338 offset = map->m_lblk - es.es_lblk;
3339 map->m_lblk = es.es_lblk + offset;
3340 map->m_len = es.es_len - offset;
3341 return IOMAP_DELALLOC;
3342 }

I saw it this way-

In condition "if (es.es_lblk <= map->m_lblk)" there are 2 cases.

Case 1: es.es_lblk is equal to map->m_lblk (equality)
For this case, "offset" will remain 0.
So map->lblk = es.es_lblk + 0 (but since es.es_lblk is same as
map->m_lblk in equality case, so it is redundant).


Case 2: es.es_lblk < map->m_lblk (less than)
In this case "offset = map->m_lblk - es.es_lblk"
Now replacing this val of offset in "map->m_lblk = es.es_lblk + offset"
map->m_lblk = es.es_lblk + map->m_lblk - es.es_lblk
which again is map->m_lblk = map->m_lblk - again redundant.


What did I miss?
But sure feel free to test as per your convenience.


>
>>> + map.m_lblk = first_block;
>>> + map.m_len = last_block = first_block + 1;
>>> + ret = ext4_map_blocks(NULL, inode, &map, 0);
>>> + if (ret < 0)
>>> + return ret;
>>> + if (ret == 0)
>>> + type = ext4_iomap_check_delalloc(inode, &map);
>>> + return ext4_set_iomap(inode, iomap, type, first_block, &map);
>> We don't need to send first_block here. Since map->m_lblk
>> is same as first_block.
>> Also with Jan comment, we don't even need 'type' parameter.
>> Then we should be able to rename the function
>> ext4_set_iomap ==> ext4_map_to_iomap. This better reflects what it is
>> doing. Thoughts?
>
> Depends on what we conclude in 1/8. :)
>
> I'm for removing 'first_block', but still not convinced removing
> 'type' is heading down the right track if I were to forward think a
> little.

Only once you are convinced that map->m_lblk will not change even in
function ext4_iomap_check_delalloc(), then only you should
drop "first_block" argument from ext4_set_iomap.

Please check above comments once.

-ritesh

2019-10-09 13:42:20

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v4 8/8] ext4: introduce direct I/O write path using iomap infrastructure

On Wed 09-10-19 00:11:45, Christoph Hellwig wrote:
> On Tue, Oct 08, 2019 at 05:12:38PM +0200, Jan Kara wrote:
> > Seeing how difficult it is when a filesystem wants to complete the iocb
> > synchronously (regardless whether it is async or sync) and have all the
> > information in one place for further processing, I think it would be the
> > easiest to provide iomap_dio_rw_wait() that forces waiting for the iocb to
> > complete *and* returns the appropriate return value instead of pretty
> > useless EIOCBQUEUED. It is actually pretty trivial (patch attached). With
> > this we can then just call iomap_dio_rw_sync() for the inode extension case
> > with ->end_io doing just the unwritten extent processing and then call
> > ext4_handle_inode_extension() from ext4_direct_write_iter() where we would
> > have all the information we need.
> >
> > Christoph, Darrick, what do you think about extending iomap like in the
> > attached patch (plus sample use in XFS)?
>
> I vaguely remember suggesting something like this but Brian or Dave
> convinced me it wasn't a good idea. This will require a trip to the
> xfs or fsdevel archives from when the inode_dio_wait was added in XFS.

I think you refer to this [1] message from Brian:

It's not quite that simple..

FWIW, the discussion (between Dave and I) for how best to solve this
started offline prior to sending the patch and pretty much started with
the idea of changing the async I/O to sync as you suggest here. I backed
off from that because it's too subtle given the semantics between the
higher level aio code and lower level dio code for async I/O. By that I
mean either can be responsible for calling the ->ki_complete() callback
in the iocb on I/O completion.

IOW, if we receive an async direct I/O, clear ->ki_complete() as you
describe above and submit it, the dio code will wait on I/O and return
the size of the I/O on successful completion. It will not have called
->ki_complete(), however. Rather, the >0 return value indicates that
aio_rw_done() must call ->ki_complete() after xfs_file_write_iter()
returns, but we would have already cleared the function pointer.

I think it is technically possible to use this technique by clearing and
restoring ->ki_complete(), but in general we've visited this "change the
I/O type" approach twice now and we've (collectively) got it wrong both
times (the first error in thinking was that XFS would need to call
->ki_complete()). IMO, this demonstrates that it's not worth the
complexity to insert ourselves into this dependency chain when we can
accomplish the same thing with a simple dio wait call.

---

Which is fair enough. I've been looking at the same and arrived at similar
conclusion ;) (BTW, funnily enough ocfs2 seems to do this dance with
clearing and restoring ki_complete). But what I propose is something
different - just wait for IO in iomap_dio_rw() which avoids the need to
clear & restore ->ki_complete() while at the same time while providing
fully-sync IO experience to the caller. So Brians objection doesn't apply
here.

> But if we decide it actully works this time around please don't add the
> __ variant but just add the parameter to iomap_dio_rw directly.

Yeah, I was undecided on this one as well. Will change this and post the
patches to fsdevel & xfs lists.

Honza

[1] https://lore.kernel.org/linux-xfs/20190325135124.GD52167@bfoster/
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-10-10 05:40:09

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 1/8] ext4: move out iomap field population into separate helper

On Wed, Oct 09, 2019 at 03:06:09PM +0200, Jan Kara wrote:
> On Wed 09-10-19 19:57:23, Matthew Bobrowski wrote:
> > On Tue, Oct 08, 2019 at 12:27:09PM +0200, Jan Kara wrote:
> > > On Thu 03-10-19 21:33:09, Matthew Bobrowski wrote:
> > > > +static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
> > > > + unsigned long first_block, struct ext4_map_blocks *map)
> > > > +{
> > > > + u8 blkbits = inode->i_blkbits;
> > > > +
> > > > + iomap->flags = 0;
> > > > + if (ext4_inode_datasync_dirty(inode))
> > > > + iomap->flags |= IOMAP_F_DIRTY;
> > > > + iomap->bdev = inode->i_sb->s_bdev;
> > > > + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> > > > + iomap->offset = (u64) first_block << blkbits;
> > > > + iomap->length = (u64) map->m_len << blkbits;
> > > > +
> > > > + if (type) {
> > > > + iomap->type = type;
> > > > + iomap->addr = IOMAP_NULL_ADDR;
> > > > + } else {
> > > > + if (map->m_flags & EXT4_MAP_MAPPED) {
> > > > + iomap->type = IOMAP_MAPPED;
> > > > + } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
> > > > + iomap->type = IOMAP_UNWRITTEN;
> > > > + } else {
> > > > + WARN_ON_ONCE(1);
> > > > + return -EIO;
> > > > + }
> > > > + iomap->addr = (u64) map->m_pblk << blkbits;
> > > > + }
> > >
> > > Looking at this function now, the 'type' argument looks a bit weird. Can we
> > > perhaps just remove the 'type' argument and change the above to:
> >
> > We can, but refer to the point below.
> >
> > > if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> > > if (map->m_flags & EXT4_MAP_MAPPED)
> > > iomap->type = IOMAP_MAPPED;
> > > else if (map->m_flags & EXT4_MAP_UNWRITTEN)
> > > iomap->type = IOMAP_UNWRITTEN;
> > > iomap->addr = (u64) map->m_pblk << blkbits;
> > > } else {
> > > iomap->type = IOMAP_HOLE;
> > > iomap->addr = IOMAP_NULL_ADDR;
> > > }
> > >
> > > And then in ext4_iomap_begin() we overwrite the type to:
> > >
> > > if (delalloc && iomap->type == IOMAP_HOLE)
> > > iomap->type = IOMAP_DELALLOC;
> > >
> > > That would IMO make ext4_set_iomap() arguments harder to get wrong.
> >
> > I was thinking about this while doing a bunch of other things at work
> > today. I'm kind of aligned with what Christoph mentioned around
> > possibly duplicating some of the post 'iomap->type' setting from both
> > current and any future ext4_set_iomap() callers. In addition to this,
> > my thought was that if we're populating the iomap structure with
> > values respectively, then it would make most sense to encapsulate
> > those routines, if possible, within the ext4_set_iomap() as that's the
> > sole purpose of the function.
>
> Well, what I dislike about 'type' argument is the inconsistency in it's
> handling. It is useful only for HOLE/DELALLOC, anything else will just give
> you invalid iomap and you have to be careful to pass 0 in that case.
>
> I understand the concern about possible duplication but since only
> IOMAP_REPORT cares about IOMAP_DELALLOC, I'm not much concerned about it.
> But another sensible API would be to optionally pass 'struct extent_status
> *es' argument to ext4_set_iomap() and if this argument is non-NULL,
> ext4_set_iomap() will handle the intersection of ext4_map_blocks and
> extent_status and deduce appropriate resulting type from that.

I see and thank you for sharing your view point. Let's go with what you
originally proposed, which is dropping the 'type' argument and then handling
IOMAP_DELALLOC within ext4_set_iomap(). No real hard objections. :)

--<M>--

2019-10-10 05:45:14

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] ext4: move inode extension/truncate code out from ->iomap_end() callback

On Wed, Oct 09, 2019 at 02:51:32PM +0200, Jan Kara wrote:
> On Wed 09-10-19 21:18:50, Matthew Bobrowski wrote:
> > > Just small nits below:
> > >
> > > > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > > > + ssize_t written, size_t count)
> > > > +{
> > > > + int ret = 0;
> > >
> > > I think both the function and callsites may be slightly simpler if you let
> > > the function return 'written' or error (not 0 or error). But I'll leave
> > > that decision upto you.
> >
> > Hm, don't we actually need to return 0 for success cases so that
> > iomap_dio_complete() behaves correctly i.e. increments iocb->ki_pos,
> > etc?
>
> Correct, iomap_dio_complete() expects 0 on success. So if we keep calling
> ext4_handle_inode_extension() from ->end_io handler, we'd need some
> specialcasing there and I agree that changing ext4_handle_inode_extension()
> return convention isn't then very beneficial. If we stop calling
> ext4_handle_inode_extension() from ->end_io handler (patch 8/8 discussion
> pending), then the change would be a clear win.

Agreed. Well, I think we've got some movement in the right direction in 8/8,
so it looks like changing up the return values is what we'll go ahead with.

--<M>--