2019-09-12 11:05:18

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 0/6] ext4: port direct IO to iomap infrastructure

This patch series ports the ext4 direct IO paths to make use of the
iomap infrastructure. The legacy buffer_head based direct IO paths
have subsequently been removed as they're now no longer in use. The
result of this change is that the direct IO implementation is much
cleaner and keeps the code isolated from the buffer_head internals. In
addition to this, a slight performance boost could be expected while
using O_SYNC | O_DIRECT IO.

The changes have been tested using xfstests in both DAX and non-DAX
modes using various filesystem configurations i.e. 4k, dioread_nolock.

Changes since v2:

- Simplified ext4_write_checks() by using file_modified() as oppose
to calling file_remove_privs() and file_update_time() explicitly.

- Other minor cleanups that have been suggested by Ritesh Harjani in
the previous round of feedback.

In the original patch series, there was a relatively lengthy
discussion around the merging of unwritten extents. The buffer_head
direct IO implementation made use of inode and end_io flags to track
whether an unwritten extent is eligible to be merged, or not. We don't
make use of these flags in the new direct IO iomap implementation,
effectively making the extent merging checks that make use these flags
redundant. However, it appears that even if additional merges and
splits are performed, it isn't deemed problematic as such and that's
due to the way that the filesystem now accommdates for unexpected
extent splits. The only real concern is the potential wasted
performance due to the unnecessary merge and split performed under
specific workloads. The full discussion that goes through these
details starts from here:
https://www.spinics.net/lists/linux-ext4/msg67173.html.

Thank you to all that took the time to review and provide constructive
feedback for previous patch series, highly appreciated.

Matthew Bobrowski (6):
ext4: introduce direct IO read path using iomap infrastructure
ext4: move inode extension/truncate code out from ext4_iomap_end()
iomap: split size and error for iomap_dio_rw ->end_io
ext4: reorder map.m_flags checks in ext4_iomap_begin()
ext4: introduce direct IO write path using iomap infrastructure
ext4: cleanup legacy buffer_head direct IO code

fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 364 +++++++++++++++++++++++------
fs/ext4/inode.c | 516 +++++-------------------------------------
fs/iomap/direct-io.c | 9 +-
fs/xfs/xfs_file.c | 8 +-
include/linux/iomap.h | 4 +-
7 files changed, 363 insertions(+), 552 deletions(-)

--
2.20.1


2019-09-12 11:06:00

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end()

In preparation for implementing the iomap direct IO write path
modifications, the inode extension/truncate code needs to be moved out
from ext4_iomap_end(). For direct IO, if the current code remained
within ext4_iomap_end() it would behave incorrectly. If we update the
inode size prior to converting unwritten extents we run the risk of
allowing a racing direct IO read operation to find unwritten extents
before they are converted.

The inode extension/truncate code has been moved out into a new helper
ext4_handle_inode_extension(). This helper has been designed so that
it can be used by both DAX and direct IO paths in the instance that
the result of the write is extending the inode.

Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 92 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +-------------------------
2 files changed, 92 insertions(+), 48 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index e52e3928dc25..6457c629b8cf 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"

static bool ext4_dio_checks(struct inode *inode)
{
@@ -233,12 +234,90 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}

+static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
+ ssize_t len, size_t count)
+{
+ handle_t *handle;
+ bool truncate = false;
+ ext4_lblk_t written_blk, end_blk;
+ int ret = 0, blkbits = inode->i_blkbits;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto orphan_del;
+ }
+
+ if (ext4_update_inode_size(inode, offset + len))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need truncate allocated but not written blocks
+ * beyond EOF.
+ */
+ written_blk = ALIGN(offset + len, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been
+ * extended and everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+ ext4_truncate_failed_write(inode);
+orphan_del:
+ /*
+ * If the truncate operation failed early the inode
+ * may still be on the orphan list. In that case, we
+ * need try remove the inode from the linked list in
+ * memory.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+ return ret;
+}
+
+/*
+ * The inode may have been placed onto the orphan list or has had
+ * blocks allocated beyond EOF as a result of an extension. We need to
+ * ensure that any necessary cleanup routines are performed if the
+ * error path has been taken for a write.
+ */
+static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
+{
+ handle_t *handle;
+
+ if (size > i_size_read(inode))
+ ext4_truncate_failed_write(inode);
+
+ if (!list_empty(&EXT4_I(inode)->i_orphan)) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ return PTR_ERR(handle);
+ }
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+ return 0;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ int error = 0;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);

if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -255,7 +334,18 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;

+ offset = iocb->ki_pos;
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ if (ret > 0 && iocb->ki_pos > i_size_read(inode))
+ error = ext4_handle_inode_extension(inode, offset, ret,
+ iov_iter_count(from));
+
+ if (ret < 0)
+ error = ext4_handle_failed_inode_extension(inode,
+ iocb->ki_pos);
+
+ if (error)
+ ret = error;
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..761ce6286b05 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3601,53 +3601,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}

const struct iomap_ops ext4_iomap_ops = {
--
2.20.1

2019-09-12 11:06:05

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 3/6] iomap: split size and error for iomap_dio_rw ->end_io

Modify the calling convention for the iomap_dio_rw ->end_io() callback.
Rather than passing either dio->error or dio->size as the 'size' argument,
instead pass both the dio->error and the dio->size value separately.

In the instance that an error occurred during a write, we currently cannot
determine whether any blocks have been allocated beyond the current EOF and
data has subsequently been written to these blocks within the ->end_io()
callback. As a result, we cannot judge whether we should take the truncate
failed write path. Having both dio->error and dio->size will allow us to
perform such checks within this callback.

Signed-off-by: Matthew Bobrowski <[email protected]>
[hch: minor cleanups]
Signed-off-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Darrick J. Wong <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
---

Note, this patch is already queued in the 'iomap-5.4-merge' branch here:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/log/?h=iomap-5.4-merge.

Just adding it within this patch series to highlight that it's a
dependency.

fs/iomap/direct-io.c | 9 +++------
fs/xfs/xfs_file.c | 8 +++++---
include/linux/iomap.h | 4 ++--
3 files changed, 10 insertions(+), 11 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 10517cea9682..2ccf1c6460d4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -77,13 +77,10 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
loff_t offset = iocb->ki_pos;
ssize_t ret;

- if (dio->end_io) {
- ret = dio->end_io(iocb,
- dio->error ? dio->error : dio->size,
- dio->flags);
- } else {
+ if (dio->end_io)
+ ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
+ else
ret = dio->error;
- }

if (likely(!ret)) {
ret = dio->size;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 28101bbc0b78..74411296f6b5 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -369,21 +369,23 @@ static int
xfs_dio_write_end_io(
struct kiocb *iocb,
ssize_t size,
+ int error,
unsigned flags)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct xfs_inode *ip = XFS_I(inode);
loff_t offset = iocb->ki_pos;
unsigned int nofs_flag;
- int error = 0;

trace_xfs_end_io_direct_write(ip, offset, size);

if (XFS_FORCED_SHUTDOWN(ip->i_mount))
return -EIO;

- if (size <= 0)
- return size;
+ if (error)
+ return error;
+ if (!size)
+ return 0;

/*
* Capture amount written on completion as we can't reliably account
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..50bb746d2216 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -188,8 +188,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
*/
#define IOMAP_DIO_UNWRITTEN (1 << 0) /* covers unwritten extent(s) */
#define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */
-typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
- unsigned flags);
+typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size, int error,
+ unsigned int flags);
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
--
2.20.1

2019-09-12 11:06:06

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin()

For iomap direct IO write code path changes, we need to accommodate
for the case where the block mapping flags passed to ext4_map_blocks()
will result in m_flags having both EXT4_MAP_MAPPED and
EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
extents to be converted properly in the end_io handler, iomap->type
must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
conditional statement in order to achieve this.

This change is a no-op for DAX code path as the block mapping flag
passed to ext4_map_blocks() when IS_DAX(inode) never results in
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.

Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 761ce6286b05..efb184928e51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ /*
+ * Flags passed to ext4_map_blocks() for direct IO
+ * writes can result in m_flags having both
+ * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
+ * order for allocated unwritten extents to be
+ * converted to written extents in the end_io handler
+ * correctly, we need to ensure that the iomap->type
+ * is also set appropriately in that case. Thus, we
+ * need to check whether EXT4_MAP_UNWRITTEN is set
+ * first.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
+ } else if (map.m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
} else {
WARN_ON_ONCE(1);
return -EIO;
--
2.20.1

2019-09-12 11:06:41

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

This patch introduces a new direct IO write code path implementation
that makes use of the iomap infrastructure.

All direct IO write operations are now passed from the ->write_iter()
callback to the new function ext4_dio_write_iter(). This function is
responsible for calling into iomap infrastructure via
iomap_dio_rw(). Snippets of the direct IO code from within
ext4_file_write_iter(), such as checking whether the IO request is
unaligned asynchronous IO, or whether it will ber overwriting
allocated and initialized blocks has been moved out and into
ext4_dio_write_iter().

The block mapping flags that are passed to ext4_map_blocks() from
within ext4_dio_get_block() and friends have effectively been taken
out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
happens to have instantiated blocks beyond the i_size, then we attempt
to place the inode onto the orphan list. Despite being able to perform
i_size extension checking earlier on in the direct IO code path, it
makes most sense to perform this bit post successful block allocation.

The ->end_io() callback ext4_dio_write_end_io() is responsible for
removing the inode from the orphan list and determining if we should
truncate a failed write in the case of an error. We also convert a
range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
and perform the necessary i_size/i_disksize extension if the
iocb->ki_pos + dio->size > i_size_read(inode).

In the instance of a short write, we fallback to buffered IO and
complete whatever is left the 'iter'. Any blocks that may have been
allocated in preparation for direct IO will be reused by buffered IO,
so there's no issue with leaving allocated blocks beyond EOF.

Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
fs/ext4/inode.c | 57 ++++++++++---
2 files changed, 196 insertions(+), 80 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6457c629b8cf..413c7895aa9e 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;

+ if (unlikely(IS_IMMUTABLE(inode)))
+ return -EPERM;
+
ret = generic_write_checks(iocb, from);
if (ret <= 0)
return ret;

- if (unlikely(IS_IMMUTABLE(inode)))
- return -EPERM;
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return 0;

/*
* If we have encountered a bitmap-format file, the size limit
@@ -234,6 +239,32 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}

+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ inode_lock(inode);
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+ return ret;
+}
+
static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t len, size_t count)
{
@@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
return 0;
}

+/*
+ * For a write that extends the inode size, ext4_dio_write_iter() will
+ * wait for the write to complete. Consequently, operations performed
+ * within this function are still covered by the inode_lock(). On
+ * success, this function returns 0.
+ */
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+ unsigned int flags)
+{
+ int ret;
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error) {
+ ret = ext4_handle_failed_inode_extension(inode, offset + size);
+ return ret ? ret : error;
+ }
+
+ if (flags & IOMAP_DIO_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(NULL, inode,
+ offset, size);
+ if (ret)
+ return ret;
+ }
+
+ if (offset + size > i_size_read(inode)) {
+ ret = ext4_handle_inode_extension(inode, offset, size, 0);
+ if (ret)
+ return ret;
+ }
+ return 0;
+}
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ size_t count;
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;
+
+ if (!inode_trylock(inode)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ inode_lock(inode);
+ }
+
+ if (!ext4_dio_checks(inode)) {
+ inode_unlock(inode);
+ /*
+ * Fallback to buffered IO if the operation on the
+ * inode is not supported by direct IO.
+ */
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }
+
+ /*
+ * Unaligned direct AIO must be serialized among each other as
+ * the zeroing of partial blocks of two competing unaligned
+ * AIOs can result in data corruption.
+ */
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the IO operation will overwrite allocated
+ * and initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ */
+ count = iov_iter_count(from);
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }
+
+ if (offset + count > i_size_read(inode) ||
+ offset + count > EXT4_I(inode)->i_disksize) {
+ ext4_update_i_disksize(inode, inode->i_size);
+ extend = true;
+ }
+
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
+
+ /*
+ * Unaligned direct AIO must be the only IO in flight or else
+ * any overlapping aligned IO after unaligned IO might result
+ * in data corruption. We also need to wait here in the case
+ * where the inode is being extended so that inode extension
+ * routines in ext4_dio_write_end_io() are covered by the
+ * inode_lock().
+ */
+ if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+ inode_dio_wait(inode);
+
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from))
+ return ext4_buffered_write_iter(iocb, from);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -324,15 +469,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
return -EAGAIN;
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;

offset = iocb->ki_pos;
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
@@ -358,73 +498,16 @@ static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
- ssize_t ret;

if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

-#ifdef CONFIG_FS_DAX
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
-#endif
- if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
- return -EOPNOTSUPP;

- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- inode_lock(inode);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
-
- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
-
- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
- goto out;
- }
- }
-
- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
-
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
-
- return ret;
-
-out:
- inode_unlock(inode);
- return ret;
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
+ return ext4_buffered_write_iter(iocb, from);
}

#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index efb184928e51..f52ad3065236 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
handle_t *handle;
- int retries = 0;
+ int dio_credits, retries = 0, m_flags = 0;

- /* Trim mapping request to maximum we can map at once for DIO */
+ /*
+ * Trim mapping request to maximum we can map at once
+ * for DIO.
+ */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
@@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
if (IS_ERR(handle))
return PTR_ERR(handle);

- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct IO are the only two operations that
+ * are currently supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ else if (round_down(offset, i_blocksize(inode)) >=
+ i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, &map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes
+ * as that could expose stale data in the case of a
+ * crash. Use the magic error code to fallback to
+ * buffered IO.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;
+
if (ret < 0) {
ext4_journal_stop(handle);
if (ret == -ENOSPC &&
@@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}

/*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
+ * If we added blocks beyond i_size, we need to make
+ * sure they will get truncated if we crash before
+ * updating the i_size. For faults we don't need to do
+ * that (and even cannot because for orphan list
+ * operations inode_lock is required) - if we happen
+ * to instantiate block beyond i_size, it is because
+ * we race with truncate which has already added the
+ * inode to the orphan list.
*/
if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
(i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
@@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error occurred while writing data
+ * out to allocated blocks. If so, return the magic error code
+ * so that we fallback to buffered IO and reuse the blocks
+ * that were allocated in preparation for the direct IO write.
+ */
+ if (flags & IOMAP_DIRECT && written == 0)
+ return -ENOTBLK;
return 0;
}

--
2.20.1

2019-09-12 11:07:43

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 1/6] ext4: introduce direct IO read path using iomap infrastructure

This patch introduces a new direct IO read path that makes use of the
iomap infrastructure.

The new function ext4_dio_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the inode in question
does not pass preliminary checks in ext4_dio_checks(), then we simply
fallback to buffered IO and take that path to fulfil the request. It's
imperative that we drop the IOCB_DIRECT flag from iocb->ki_flags in
order to prevent generic_file_read_iter() from trying to take the
direct IO code path again.

Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 54 insertions(+), 4 deletions(-)

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 70b0438dbc94..e52e3928dc25 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,53 @@
#include "xattr.h"
#include "acl.h"

+static bool ext4_dio_checks(struct inode *inode)
+{
+#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
+ if (IS_ENCRYPTED(inode))
+ return false;
+#endif
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ /*
+ * Get exclusion from truncate and other inode operations.
+ */
+ if (!inode_trylock_shared(inode)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ inode_lock_shared(inode);
+ }
+
+ if (!ext4_dio_checks(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered IO if the operation being
+ * performed on the inode is not supported by direct
+ * IO. The IOCB_DIRECT flag flags needs to be cleared
+ * here to ensure that the direct IO code path within
+ * generic_file_read_iter() is not taken again.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,16 +111,19 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)

static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;

if (!iov_iter_count(to))
return 0; /* skip atime */

-#ifdef CONFIG_FS_DAX
- if (IS_DAX(file_inode(iocb->ki_filp)))
+ if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
-#endif
+
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
return generic_file_read_iter(iocb, to);
}

--
2.20.1

2019-09-12 11:08:38

by Matthew Bobrowski

[permalink] [raw]
Subject: [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code

Remove buffer_head direct IO code that is now redundant as a result of
porting across the read/write paths to use iomap infrastructure.

Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 7 -
fs/ext4/inode.c | 398 +---------------------------------------------
4 files changed, 5 insertions(+), 414 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..2ab91815f52d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1555,7 +1555,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2522,8 +2521,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..a869e206bd81 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
*/
if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
return 0;
- /*
- * The check for IO to unwritten extent is somewhat racy as we
- * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
- * dropping i_data_sem. But reserved blocks should save us in that
- * case.
- */
+
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 413c7895aa9e..d2ff383a8b9f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -155,13 +155,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}

-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f52ad3065236..a4f0749527c7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,136 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096

-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3653,268 +3523,6 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};

-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end->offset = offset;
- io_end->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- inode_lock_shared(inode);
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3952,7 +3560,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3969,7 +3577,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3985,7 +3593,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_da_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1

2019-09-16 07:09:52

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

Hello,

On 9/12/19 4:34 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
>
> All direct IO write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct IO code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
>
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
> happens to have instantiated blocks beyond the i_size, then we attempt
> to place the inode onto the orphan list. Despite being able to perform
> i_size extension checking earlier on in the direct IO code path, it
> makes most sense to perform this bit post successful block allocation.
>
> The ->end_io() callback ext4_dio_write_end_io() is responsible for
> removing the inode from the orphan list and determining if we should
> truncate a failed write in the case of an error. We also convert a
> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
> and perform the necessary i_size/i_disksize extension if the
> iocb->ki_pos + dio->size > i_size_read(inode).
>
> In the instance of a short write, we fallback to buffered IO and
> complete whatever is left the 'iter'. Any blocks that may have been
> allocated in preparation for direct IO will be reused by buffered IO,
> so there's no issue with leaving allocated blocks beyond EOF.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
> fs/ext4/inode.c | 57 ++++++++++---
> 2 files changed, 196 insertions(+), 80 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6457c629b8cf..413c7895aa9e 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> #include <linux/mman.h>
> +#include <linux/backing-dev.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
>
> + if (unlikely(IS_IMMUTABLE(inode)))
> + return -EPERM;
> +
> ret = generic_write_checks(iocb, from);
> if (ret <= 0)
> return ret;
>
> - if (unlikely(IS_IMMUTABLE(inode)))
> - return -EPERM;
> + ret = file_modified(iocb->ki_filp);
> + if (ret)
> + return 0;

Why not return ret directly, otherwise we will be returning the wrong
error code to user space. Thoughts?

Do you think simplification/restructuring of this API
"ext4_write_checks" can be a separate patch, so that this patch
only focuses on conversion of DIO write path to iomap?


Also, I think we can make the function (ext4_write_checks())
like below.
This way we call for file_modified() only after we have checked for
write limits, at one place.

static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter
*from)
{
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;

if (unlikely(IS_IMMUTABLE(inode)))
return -EPERM;

ret = generic_write_checks(iocb, from);
if (ret <= 0)
_ return ret;
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
*/
if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);

if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes -
iocb->ki_pos);
}
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
return iov_iter_count(from);
}


-ritesh


>
> /*
> * If we have encountered a bitmap-format file, the size limit
> @@ -234,6 +239,32 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
> + inode_lock(inode);
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + current->backing_dev_info = inode_to_bdi(inode);
> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> + current->backing_dev_info = NULL;
> +out:
> + inode_unlock(inode);
> + if (likely(ret > 0)) {
> + iocb->ki_pos += ret;
> + ret = generic_write_sync(iocb, ret);
> + }
> + return ret;
> +}
> +
> static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> ssize_t len, size_t count)
> {
> @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> return 0;
> }
>
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> + unsigned int flags)
> +{
> + int ret;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error) {
> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> + return ret ? ret : error;
> + }
> +
> + if (flags & IOMAP_DIO_UNWRITTEN) {
> + ret = ext4_convert_unwritten_extents(NULL, inode,
> + offset, size);
> + if (ret)
> + return ret;
> + }
> +
> + if (offset + size > i_size_read(inode)) {
> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> + if (ret)
> + return ret;
> + }
> + return 0;
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret;
> + size_t count;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool extend = false, overwrite = false, unaligned_aio = false;
> +
> + if (!inode_trylock(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered IO if the operation on the
> + * inode is not supported by direct IO.
> + */
> + return ext4_buffered_write_iter(iocb, from);
> + }
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0) {
> + inode_unlock(inode);
> + return ret;
> + }
> +
> + /*
> + * Unaligned direct AIO must be serialized among each other as
> + * the zeroing of partial blocks of two competing unaligned
> + * AIOs can result in data corruption.
> + */
> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> + unaligned_aio = true;
> + inode_dio_wait(inode);
> + }
> +
> + /*
> + * Determine whether the IO operation will overwrite allocated
> + * and initialized blocks. If so, check to see whether it is
> + * possible to take the dioread_nolock path.
> + */
> + count = iov_iter_count(from);
> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> + ext4_should_dioread_nolock(inode)) {
> + overwrite = true;
> + downgrade_write(&inode->i_rwsem);
> + }
> +
> + if (offset + count > i_size_read(inode) ||
> + offset + count > EXT4_I(inode)->i_disksize) {
> + ext4_update_i_disksize(inode, inode->i_size);
> + extend = true;
> + }
> +
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> + /*
> + * Unaligned direct AIO must be the only IO in flight or else
> + * any overlapping aligned IO after unaligned IO might result
> + * in data corruption. We also need to wait here in the case
> + * where the inode is being extended so that inode extension
> + * routines in ext4_dio_write_end_io() are covered by the
> + * inode_lock().
> + */
> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> + inode_dio_wait(inode);
> +
> + if (overwrite)
> + inode_unlock_shared(inode);
> + else
> + inode_unlock(inode);
> +
> + if (ret >= 0 && iov_iter_count(from))
> + return ext4_buffered_write_iter(iocb, from);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -324,15 +469,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return -EAGAIN;
> inode_lock(inode);
> }
> +
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0)
> goto out;
> - ret = file_remove_privs(iocb->ki_filp);
> - if (ret)
> - goto out;
> - ret = file_update_time(iocb->ki_filp);
> - if (ret)
> - goto out;
>
> offset = iocb->ki_pos;
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> @@ -358,73 +498,16 @@ static ssize_t
> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
> - int unaligned_aio = 0;
> - int overwrite = 0;
> - ssize_t ret;
>
> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> -#ifdef CONFIG_FS_DAX
> if (IS_DAX(inode))
> return ext4_dax_write_iter(iocb, from);
> -#endif
> - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> - return -EOPNOTSUPP;
>
> - if (!inode_trylock(inode)) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - return -EAGAIN;
> - inode_lock(inode);
> - }
> -
> - ret = ext4_write_checks(iocb, from);
> - if (ret <= 0)
> - goto out;
> -
> - /*
> - * Unaligned direct AIO must be serialized among each other as zeroing
> - * of partial blocks of two competing unaligned AIOs can result in data
> - * corruption.
> - */
> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> - !is_sync_kiocb(iocb) &&
> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> - unaligned_aio = 1;
> - ext4_unwritten_wait(inode);
> - }
> -
> - iocb->private = &overwrite;
> - /* Check whether we do a DIO overwrite or not */
> - if (o_direct && !unaligned_aio) {
> - if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> - if (ext4_should_dioread_nolock(inode))
> - overwrite = 1;
> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
> - ret = -EAGAIN;
> - goto out;
> - }
> - }
> -
> - ret = __generic_file_write_iter(iocb, from);
> - /*
> - * Unaligned direct AIO must be the only IO in flight. Otherwise
> - * overlapping aligned IO after unaligned might result in data
> - * corruption.
> - */
> - if (ret == -EIOCBQUEUED && unaligned_aio)
> - ext4_unwritten_wait(inode);
> - inode_unlock(inode);
> -
> - if (ret > 0)
> - ret = generic_write_sync(iocb, ret);
> -
> - return ret;
> -
> -out:
> - inode_unlock(inode);
> - return ret;
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_write_iter(iocb, from);
> + return ext4_buffered_write_iter(iocb, from);
> }
>
> #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> handle_t *handle;
> - int retries = 0;
> + int dio_credits, retries = 0, m_flags = 0;
>
> - /* Trim mapping request to maximum we can map at once for DIO */
> + /*
> + * Trim mapping request to maximum we can map at once
> + * for DIO.
> + */
> if (map.m_len > DIO_MAX_BLOCKS)
> map.m_len = DIO_MAX_BLOCKS;
> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> + /*
> + * DAX and direct IO are the only two operations that
> + * are currently supported with IOMAP_WRITE.
> + */
> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> + if (IS_DAX(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> + else if (round_down(offset, i_blocksize(inode)) >=
> + i_size_read(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE;
> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> + /*
> + * We cannot fill holes in indirect tree based inodes
> + * as that could expose stale data in the case of a
> + * crash. Use the magic error code to fallback to
> + * buffered IO.
> + */
> + if (!m_flags && !ret)
> + ret = -ENOTBLK;
> +
> if (ret < 0) {
> ext4_journal_stop(handle);
> if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
>
> /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> + * If we added blocks beyond i_size, we need to make
> + * sure they will get truncated if we crash before
> + * updating the i_size. For faults we don't need to do
> + * that (and even cannot because for orphan list
> + * operations inode_lock is required) - if we happen
> + * to instantiate block beyond i_size, it is because
> + * we race with truncate which has already added the
> + * inode to the orphan list.
> */
> if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> + /*
> + * Check to see whether an error occurred while writing data
> + * out to allocated blocks. If so, return the magic error code
> + * so that we fallback to buffered IO and reuse the blocks
> + * that were allocated in preparation for the direct IO write.
> + */
> + if (flags & IOMAP_DIRECT && written == 0)
> + return -ENOTBLK;
> return 0;
> }
>

2019-09-16 10:55:24

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Mon, Sep 16, 2019 at 10:07:41AM +0530, Ritesh Harjani wrote:
> > @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > struct inode *inode = file_inode(iocb->ki_filp);
> > ssize_t ret;
> >
> > + if (unlikely(IS_IMMUTABLE(inode)))
> > + return -EPERM;
> > +
> > ret = generic_write_checks(iocb, from);
> > if (ret <= 0)
> > return ret;
> >
> > - if (unlikely(IS_IMMUTABLE(inode)))
> > - return -EPERM;
> > + ret = file_modified(iocb->ki_filp);
> > + if (ret)
> > + return 0;
>
> Why not return ret directly, otherwise we will be returning the wrong
> error code to user space. Thoughts?

You're right. I can't remember exactly why I decided to return '0', however
looking at the code once again I don't see a reason why we don't just return
'ret', as any value other than '0' represents a failure in this case
anyway. Thanks for picking that up.

> Do you think simplification/restructuring of this API
> "ext4_write_checks" can be a separate patch, so that this patch
> only focuses on conversion of DIO write path to iomap?

Hm, if we split it up so that it comes before this patch then it becomes hairy
in the sense that a whole bunch of other changes would also need to come with
what looks to be such a miniscule modification
i.e. ext4_buffered_write_iter(), ext4_file_write_iter(), etc. Splitting it to
come after just doesn't make any sense. To be honest, I don't really have any
strong opinions around why we shouldn't split it up, nor do I have a strong
opinion around why we should, so I think we should just leave it for now.

> Also, I think we can make the function (ext4_write_checks())
> like below. This way we call for file_modified() only after we
> have checked for write limits, at one place.

No objections and I think it's a good idea.

> static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter
> *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
>
> if (unlikely(IS_IMMUTABLE(inode)))
> return -EPERM;
>
> ret = generic_write_checks(iocb, from);
> if (ret <= 0)
> _ return ret;
> /*
> * If we have encountered a bitmap-format file, the size limit
> * is smaller than s_maxbytes, which is for extent-mapped files.
> */
> if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
> struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> if (iocb->ki_pos >= sbi->s_bitmap_maxbytes)
> return -EFBIG;
> iov_iter_truncate(from, sbi->s_bitmap_maxbytes -
> iocb->ki_pos);
> }
> +
> + ret = file_modified(iocb->ki_filp);
> + if (ret)
> + return ret;
> +
> return iov_iter_count(from);
> }

--<M>--

2019-09-16 12:01:23

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] ext4: introduce direct IO read path using iomap infrastructure

On Thu, Sep 12, 2019 at 09:03:44PM +1000, Matthew Bobrowski wrote:
> +static bool ext4_dio_checks(struct inode *inode)
> +{
> +#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
> + if (IS_ENCRYPTED(inode))
> + return false;
> +#endif
> + if (ext4_should_journal_data(inode))
> + return false;
> + if (ext4_has_inline_data(inode))
> + return false;
> + return true;

Shouldn't this function be called ext4_dio_supported or similar?

Also bonus points of adding a patch that defines a IS_ENCRYPTED stub
for !CONFIG_FS_ENCRYPTION to make this a little cleaner.

Also the iomap direct I/O code supports inline data, so the above
might not be required (at least with small updates elsewhere).

2019-09-16 12:07:03

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin()

On Thu, Sep 12, 2019 at 09:04:30PM +1000, Matthew Bobrowski wrote:
> @@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + /*
> + * Flags passed to ext4_map_blocks() for direct IO
> + * writes can result in m_flags having both
> + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> + * order for allocated unwritten extents to be
> + * converted to written extents in the end_io handler
> + * correctly, we need to ensure that the iomap->type
> + * is also set appropriately in that case. Thus, we
> + * need to check whether EXT4_MAP_UNWRITTEN is set
> + * first.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> + } else if (map.m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;

I think much of this would benefit a lot from just being split up.
I hacked up a patch last week that split the ext4 direct I/O code
a bit, but this is completely untested and needs further splitup,
but maybe you can take it as an inspiration for your series?
E.g. at least one helper for filling out the iomap from the ext4
map data, and one for the seek unwritten extent reporting. The
split of the overall iomap ops seemed useful to me, but might not
be as important with the other cleanups:

---
From 7ac1a837e279e415882feae473e335b4a3d89c10 Mon Sep 17 00:00:00 2001
From: Christoph Hellwig <[email protected]>
Date: Sun, 8 Sep 2019 10:44:28 +0200
Subject: ext4: refactor the iomap code

Signed-off-by: Christoph Hellwig <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 7 +-
fs/ext4/inode.c | 279 ++++++++++++++++++++++++++----------------------
3 files changed, 159 insertions(+), 128 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..c8e34fe3daba 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3333,6 +3333,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}

extern const struct iomap_ops ext4_iomap_ops;
+extern const struct iomap_ops ext4_report_iomap_ops;

#endif /* __KERNEL__ */

diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 70b0438dbc94..cd2d41bc842b 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -328,7 +328,8 @@ static vm_fault_t ext4_dax_huge_fault(struct vm_fault *vmf,
} else {
down_read(&EXT4_I(inode)->i_mmap_sem);
}
- result = dax_iomap_fault(vmf, pe_size, &pfn, &error, &ext4_iomap_ops);
+ result = dax_iomap_fault(vmf, pe_size, &pfn, &error,
+ &ext4_iomap_ops);
if (write) {
ext4_journal_stop(handle);

@@ -492,12 +493,12 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
maxbytes, i_size_read(inode));
case SEEK_HOLE:
inode_lock_shared(inode);
- offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_hole(inode, offset, &ext4_report_iomap_ops);
inode_unlock_shared(inode);
break;
case SEEK_DATA:
inode_lock_shared(inode);
- offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_data(inode, offset, &ext4_report_iomap_ops);
inode_unlock_shared(inode);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..1c1b07f0cdbf 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3452,152 +3452,116 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}

-static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
- unsigned flags, struct iomap *iomap)
+static int ext4_set_iomap(struct inode *inode, struct iomap *iomap, u16 type,
+ unsigned long first_block, struct ext4_map_blocks *map)
{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
unsigned int blkbits = inode->i_blkbits;
- unsigned long first_block, last_block;
- struct ext4_map_blocks map;
- bool delalloc = false;
- int ret;
-
- if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
- return -EINVAL;
- first_block = offset >> blkbits;
- last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
- EXT4_MAX_LOGICAL_BLOCK);
-
- if (flags & IOMAP_REPORT) {
- if (ext4_has_inline_data(inode)) {
- ret = ext4_inline_data_iomap(inode, iomap);
- if (ret != -EAGAIN) {
- if (ret == 0 && offset >= iomap->length)
- ret = -ENOENT;
- return ret;
- }
- }
- } else {
- if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
- return -ERANGE;
- }
-
- map.m_lblk = first_block;
- map.m_len = last_block - first_block + 1;
-
- if (flags & IOMAP_REPORT) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
-
- if (ret == 0) {
- ext4_lblk_t end = map.m_lblk + map.m_len - 1;
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map.m_lblk, end, &es);
-
- if (!es.es_len || es.es_lblk > end) {
- /* entire range is a hole */
- } else if (es.es_lblk > map.m_lblk) {
- /* range starts with a hole */
- map.m_len = es.es_lblk - map.m_lblk;
- } else {
- ext4_lblk_t offs = 0;
-
- if (es.es_lblk < map.m_lblk)
- offs = map.m_lblk - es.es_lblk;
- map.m_lblk = es.es_lblk + offs;
- map.m_len = es.es_len - offs;
- delalloc = true;
- }
- }
- } else if (flags & IOMAP_WRITE) {
- int dio_credits;
- handle_t *handle;
- int retries = 0;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (map.m_len > DIO_MAX_BLOCKS)
- map.m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
-retry:
- /*
- * Either we allocate blocks and then we don't get unwritten
- * extent so we have reserved enough credits, or the blocks
- * are already allocated and unwritten and in that case
- * extent conversion fits in the credits as well.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0) {
- ext4_journal_stop(handle);
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
- }
-
- /*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0) {
- ext4_journal_stop(handle);
- return err;
- }
- }
- ext4_journal_stop(handle);
- } else {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
- }

iomap->flags = 0;
if (ext4_inode_datasync_dirty(inode))
iomap->flags |= IOMAP_F_DIRTY;
iomap->bdev = inode->i_sb->s_bdev;
- iomap->dax_dev = sbi->s_daxdev;
+ iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
iomap->offset = (u64)first_block << blkbits;
- iomap->length = (u64)map.m_len << blkbits;
-
- if (ret == 0) {
- iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
+ iomap->length = (u64)map->m_len << blkbits;
+ if (type) {
+ iomap->type = type;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
+ if (map->m_flags & EXT4_MAP_MAPPED) {
iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ } else if (map->m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
} else {
WARN_ON_ONCE(1);
return -EIO;
}
- iomap->addr = (u64)map.m_pblk << blkbits;
+ iomap->addr = (u64)map->m_pblk << blkbits;
}
-
- if (map.m_flags & EXT4_MAP_NEW)
+ if (map->m_flags & EXT4_MAP_NEW)
iomap->flags |= IOMAP_F_NEW;
-
return 0;
}

+#ifdef CONFIG_FS_DAX
+static int ext4_iomap_alloc(struct inode *inode, unsigned flags,
+ unsigned long first_block, struct ext4_map_blocks *map)
+{
+ unsigned int blkbits = inode->i_blkbits;
+ int dio_credits, ret, retries = 0;
+ handle_t *handle;
+
+ /* Trim mapping request to maximum we can map at once for DIO */
+ if (map->m_len > DIO_MAX_BLOCKS)
+ map->m_len = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+retry:
+ /*
+ * Either we allocate blocks and then we don't get unwritten extent so
+ * we have reserved enough credits, or the blocks are already allocated
+ * and unwritten and in that case extent conversion fits in the credits
+ as well.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+ ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ goto journal_stop;
+
+ /*
+ * If we added blocks beyond i_size, we need to make sure they will get
+ * truncated if we crash before updating i_size in ext4_iomap_end().
+ * For faults we don't need to do that (and even cannot because for
+ * orphan list operations inode_lock is required) - if we happen to
+ * instantiate block beyond i_size, it is because we race with truncate
+ * which has already added the inode to the orphan list.
+ */
+ if (!(flags & IOMAP_FAULT) && first_block + map->m_len >
+ (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
+ int err;
+
+ err = ext4_orphan_add(handle, inode);
+ if (err < 0)
+ ret = err;
+ }
+journal_stop:
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+ return ret;
+}
+
+static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
+ unsigned flags, struct iomap *iomap)
+{
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned long first_block, last_block;
+ struct ext4_map_blocks map;
+ int ret;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+ first_block = offset >> blkbits;
+ last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK);
+
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
+
+ map.m_lblk = first_block;
+ map.m_len = last_block - first_block + 1;
+ if (flags & IOMAP_WRITE)
+ ret = ext4_iomap_alloc(inode, flags, first_block, &map);
+ else
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+
+ if (ret < 0)
+ return ret;
+ return ext4_set_iomap(inode, iomap, ret ? 0 : IOMAP_HOLE, first_block,
+ &map);
+}
+
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
@@ -3654,6 +3618,71 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_begin = ext4_iomap_begin,
.iomap_end = ext4_iomap_end,
};
+#endif /* CONFIG_FS_DAX */
+
+static u16 ext4_iomap_check_delalloc(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ ext4_lblk_t end = map->m_lblk + map->m_len - 1;
+ struct extent_status es;
+
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed, map->m_lblk, end,
+ &es);
+
+ /* entire range is a hole */
+ if (!es.es_len || es.es_lblk > end)
+ return IOMAP_HOLE;
+ if (es.es_lblk <= map->m_lblk) {
+ ext4_lblk_t offs = 0;
+
+ if (es.es_lblk < map->m_lblk)
+ offs = map->m_lblk - es.es_lblk;
+ map->m_lblk = es.es_lblk + offs;
+ map->m_len = es.es_len - offs;
+ return IOMAP_DELALLOC;
+ }
+ /* range starts with a hole */
+ map->m_len = es.es_lblk - map->m_lblk;
+ return IOMAP_HOLE;
+}
+
+static int ext4_report_iomap_begin(struct inode *inode, loff_t offset,
+ loff_t length, unsigned flags, struct iomap *iomap)
+{
+ unsigned int blkbits = inode->i_blkbits;
+ unsigned long first_block, last_block;
+ struct ext4_map_blocks map;
+ u16 type = 0;
+ int ret;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+ first_block = offset >> blkbits;
+ last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK);
+
+ if (ext4_has_inline_data(inode)) {
+ ret = ext4_inline_data_iomap(inode, iomap);
+ if (ret != -EAGAIN) {
+ if (ret == 0 && offset >= iomap->length)
+ ret = -ENOENT;
+ return ret;
+ }
+ }
+
+ map.m_lblk = first_block;
+ map.m_len = last_block - first_block + 1;
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ type = ext4_iomap_check_delalloc(inode, &map);
+ return ext4_set_iomap(inode, iomap, type, first_block, &map);
+}
+
+const struct iomap_ops ext4_report_iomap_ops = {
+ .iomap_begin = ext4_report_iomap_begin,
+};

static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
--
2.20.1

2019-09-16 12:07:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code

On Thu, Sep 12, 2019 at 09:05:03PM +1000, Matthew Bobrowski wrote:
> Remove buffer_head direct IO code that is now redundant as a result of
> porting across the read/write paths to use iomap infrastructure.

I still think moving the removal of the not required code into the
actual patches that make the code obsolete is a better idea. That
makes it very clear what is added to remove what kind and amount of
old code.

2019-09-16 12:13:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
>
> + if (unlikely(IS_IMMUTABLE(inode)))
> + return -EPERM;
> +
> ret = generic_write_checks(iocb, from);
> if (ret <= 0)
> return ret;
>
> - if (unlikely(IS_IMMUTABLE(inode)))
> - return -EPERM;
> + ret = file_modified(iocb->ki_filp);
> + if (ret)
> + return 0;
>
> /*
> * If we have encountered a bitmap-format file, the size limit

Independent of the error return issue you probably want to split
modifying ext4_write_checks into a separate preparation patch.

> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> + unsigned int flags)
> +{
> + int ret;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error) {
> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> + return ret ? ret : error;
> + }

Just a personal opinion, but I find the use of the ternary operator
here a little weird.

A plain old:

ret = ext4_handle_failed_inode_extension(inode, offset + size);
if (ret)
return ret;
return error;

flow much easier.

> + if (!inode_trylock(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered IO if the operation on the
> + * inode is not supported by direct IO.
> + */
> + return ext4_buffered_write_iter(iocb, from);

I think you want to lift the locking into the caller of this function
so that you don't have to unlock and relock for the buffered write
fallback.

> + if (offset + count > i_size_read(inode) ||
> + offset + count > EXT4_I(inode)->i_disksize) {
> + ext4_update_i_disksize(inode, inode->i_size);
> + extend = true;

Doesn't the ext4_update_i_disksize need to be under an open journal
handle?

2019-09-16 12:57:41

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 6/6] ext4: cleanup legacy buffer_head direct IO code

On Mon, Sep 16, 2019 at 05:06:32AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2019 at 09:05:03PM +1000, Matthew Bobrowski wrote:
> > Remove buffer_head direct IO code that is now redundant as a result of
> > porting across the read/write paths to use iomap infrastructure.
>
> I still think moving the removal of the not required code into the
> actual patches that make the code obsolete is a better idea. That
> makes it very clear what is added to remove what kind and amount of
> old code.

Yeah, you're right, I completely agree with that now that you mention it. I
will do this for the next version of this patch series.

--<M>--

2019-09-16 13:08:48

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 1/6] ext4: introduce direct IO read path using iomap infrastructure

On Mon, Sep 16, 2019 at 05:00:32AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2019 at 09:03:44PM +1000, Matthew Bobrowski wrote:
> > +static bool ext4_dio_checks(struct inode *inode)
> > +{
> > +#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
> > + if (IS_ENCRYPTED(inode))
> > + return false;
> > +#endif
> > + if (ext4_should_journal_data(inode))
> > + return false;
> > + if (ext4_has_inline_data(inode))
> > + return false;
> > + return true;
>
> Shouldn't this function be called ext4_dio_supported or similar?

Yeah, let's run with your suggestion. I think 'ext4_dio_supported' reads far
better than what I've named this helper.

> Also bonus points of adding a patch that defines a IS_ENCRYPTED stub
> for !CONFIG_FS_ENCRYPTION to make this a little cleaner.

I like this idea and I will try to do this.

> Also the iomap direct I/O code supports inline data, so the above
> might not be required (at least with small updates elsewhere).

I did see this, but to be perfectly honest I haven't looked at what needs to
be done on the ext4 side of things to get it all plumbed up and working. I
wanted to get clear of these main bits and revisit after the fact within a
separate patch series.

--<M>--

2019-09-16 23:07:45

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
> > @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > struct inode *inode = file_inode(iocb->ki_filp);
> > ssize_t ret;
> >
> > + if (unlikely(IS_IMMUTABLE(inode)))
> > + return -EPERM;
> > +
> > ret = generic_write_checks(iocb, from);
> > if (ret <= 0)
> > return ret;
> >
> > - if (unlikely(IS_IMMUTABLE(inode)))
> > - return -EPERM;
> > + ret = file_modified(iocb->ki_filp);
> > + if (ret)
> > + return 0;
> >
> > /*
> > * If we have encountered a bitmap-format file, the size limit
>
> Independent of the error return issue you probably want to split
> modifying ext4_write_checks into a separate preparation patch.

Providing that there's no objections to introducing a possible performance
change with this separate preparation patch (overhead of calling
file_remove_privs/file_update_time twice), then I have no issues in doing so.

> > +/*
> > + * For a write that extends the inode size, ext4_dio_write_iter() will
> > + * wait for the write to complete. Consequently, operations performed
> > + * within this function are still covered by the inode_lock(). On
> > + * success, this function returns 0.
> > + */
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> > + unsigned int flags)
> > +{
> > + int ret;
> > + loff_t offset = iocb->ki_pos;
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > + if (error) {
> > + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> > + return ret ? ret : error;
> > + }
>
> Just a personal opinion, but I find the use of the ternary operator
> here a little weird.
>
> A plain old:
>
> ret = ext4_handle_failed_inode_extension(inode, offset + size);
> if (ret)
> return ret;
> return error;
>
> flow much easier.

Agree, much cleaner.

> > + if (!inode_trylock(inode)) {
> > + if (iocb->ki_flags & IOCB_NOWAIT)
> > + return -EAGAIN;
> > + inode_lock(inode);
> > + }
> > +
> > + if (!ext4_dio_checks(inode)) {
> > + inode_unlock(inode);
> > + /*
> > + * Fallback to buffered IO if the operation on the
> > + * inode is not supported by direct IO.
> > + */
> > + return ext4_buffered_write_iter(iocb, from);
>
> I think you want to lift the locking into the caller of this function
> so that you don't have to unlock and relock for the buffered write
> fallback.

I don't exactly know what you really mean by "lift the locking into the caller
of this function". I'm interpreting that as moving the inode_unlock()
operation into ext4_buffered_write_iter(), but I can't see how that would be
any different from doing it directly here? Wouldn't this also run the risk of
the locks becoming unbalanced as we'd need to add checks around whether the
resource is being contended? Maybe I'm misunderstanding something here...

> > + if (offset + count > i_size_read(inode) ||
> > + offset + count > EXT4_I(inode)->i_disksize) {
> > + ext4_update_i_disksize(inode, inode->i_size);
> > + extend = true;
>
> Doesn't the ext4_update_i_disksize need to be under an open journal
> handle?

After all, it is a metadata update, which should go through an open journal
handle.

Thank you for the review Christoph!

--<M>--

2019-09-17 09:55:05

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

Hello,

On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
>> On Thu, Sep 12, 2019 at 09:04:46PM +1000, Matthew Bobrowski wrote:
>>> @@ -213,12 +214,16 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
>>> struct inode *inode = file_inode(iocb->ki_filp);
>>> ssize_t ret;
>>>
>>> + if (unlikely(IS_IMMUTABLE(inode)))
>>> + return -EPERM;
>>> +
>>> ret = generic_write_checks(iocb, from);
>>> if (ret <= 0)
>>> return ret;
>>>
>>> - if (unlikely(IS_IMMUTABLE(inode)))
>>> - return -EPERM;
>>> + ret = file_modified(iocb->ki_filp);
>>> + if (ret)
>>> + return 0;
>>>
>>> /*
>>> * If we have encountered a bitmap-format file, the size limit
>>
>> Independent of the error return issue you probably want to split
>> modifying ext4_write_checks into a separate preparation patch.
>
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.
>
>>> +/*
>>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>>> + * wait for the write to complete. Consequently, operations performed
>>> + * within this function are still covered by the inode_lock(). On
>>> + * success, this function returns 0.
>>> + */
>>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
>>> + unsigned int flags)
>>> +{
>>> + int ret;
>>> + loff_t offset = iocb->ki_pos;
>>> + struct inode *inode = file_inode(iocb->ki_filp);
>>> +
>>> + if (error) {
>>> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
>>> + return ret ? ret : error;
>>> + }
>>
>> Just a personal opinion, but I find the use of the ternary operator
>> here a little weird.
>>
>> A plain old:
>>
>> ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> if (ret)
>> return ret;
>> return error;
>>
>> flow much easier.
>
> Agree, much cleaner.
>
>>> + if (!inode_trylock(inode)) {
>>> + if (iocb->ki_flags & IOCB_NOWAIT)
>>> + return -EAGAIN;
>>> + inode_lock(inode);
>>> + }
>>> +
>>> + if (!ext4_dio_checks(inode)) {
>>> + inode_unlock(inode);
>>> + /*
>>> + * Fallback to buffered IO if the operation on the
>>> + * inode is not supported by direct IO.
>>> + */
>>> + return ext4_buffered_write_iter(iocb, from);
>>
>> I think you want to lift the locking into the caller of this function
>> so that you don't have to unlock and relock for the buffered write
>> fallback.
>
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...
>
>>> + if (offset + count > i_size_read(inode) ||
>>> + offset + count > EXT4_I(inode)->i_disksize) {
>>> + ext4_update_i_disksize(inode, inode->i_size);
>>> + extend = true;
>>
>> Doesn't the ext4_update_i_disksize need to be under an open journal
>> handle?
>
> After all, it is a metadata update, which should go through an open journal
> handle.

Hmmm, it seems like a race here. But I am not sure if this is just due
to not updating i_disksize under open journal handle.


So if we have a delayed buffered write to a file,
in that case we first only update inode->i_size and update
i_disksize at writeback time
(i.e. during block allocation).
In that case when we call for ext4_dio_write_iter
since offset + len > i_disksize, we call for ext4_update_i_disksize().

Now if writeback for some reason failed. And the system crashes, during
the DIO writes, after the blocks are allocated. Then during reboot we
may have an inconsistent inode, since we did not add the inode into the
orphan list before we updated the inode->i_disksize. And journal replay
may not succeed.

1. Can above actually happen? I am still not able to figure out the
race/inconsistency completely.
2. Can you please help explain under what other cases
it was necessary to call ext4_update_i_disksize() in DIO write paths?
3. When will i_disksize be out-of-sync with i_size during DIO writes?


-ritesh

2019-09-17 09:55:14

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
> So if we have a delayed buffered write to a file,
> in that case we first only update inode->i_size and update
> i_disksize at writeback time
> (i.e. during block allocation).
> In that case when we call for ext4_dio_write_iter
> since offset + len > i_disksize, we call for ext4_update_i_disksize().
>
> Now if writeback for some reason failed. And the system crashes, during the
> DIO writes, after the blocks are allocated. Then during reboot we may have
> an inconsistent inode, since we did not add the inode into the
> orphan list before we updated the inode->i_disksize. And journal replay
> may not succeed.
>
> 1. Can above actually happen? I am still not able to figure out the
> race/inconsistency completely.
> 2. Can you please help explain under what other cases
> it was necessary to call ext4_update_i_disksize() in DIO write paths?
> 3. When will i_disksize be out-of-sync with i_size during DIO writes?

None of the above seems new in this patchset, does it? That being said
I found the early size update odd. XFS updates the on-disk size only
at I/O completion time to deal with various races including the
potential exposure of stale data.

2019-09-17 09:55:22

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > Independent of the error return issue you probably want to split
> > modifying ext4_write_checks into a separate preparation patch.
>
> Providing that there's no objections to introducing a possible performance
> change with this separate preparation patch (overhead of calling
> file_remove_privs/file_update_time twice), then I have no issues in doing so.

Well, we should avoid calling it twice. But what caught my eye is that
the buffered I/O path also called this function, so we are changing it as
well here. If that actually is safe (I didn't review these bits carefully
and don't know ext4 that well) the overall refactoring of the write
flow might belong into a separate prep patch (that is not relying
on ->direct_IO, the checks changes, etc).

> > > + if (!inode_trylock(inode)) {
> > > + if (iocb->ki_flags & IOCB_NOWAIT)
> > > + return -EAGAIN;
> > > + inode_lock(inode);
> > > + }
> > > +
> > > + if (!ext4_dio_checks(inode)) {
> > > + inode_unlock(inode);
> > > + /*
> > > + * Fallback to buffered IO if the operation on the
> > > + * inode is not supported by direct IO.
> > > + */
> > > + return ext4_buffered_write_iter(iocb, from);
> >
> > I think you want to lift the locking into the caller of this function
> > so that you don't have to unlock and relock for the buffered write
> > fallback.
>
> I don't exactly know what you really mean by "lift the locking into the caller
> of this function". I'm interpreting that as moving the inode_unlock()
> operation into ext4_buffered_write_iter(), but I can't see how that would be
> any different from doing it directly here? Wouldn't this also run the risk of
> the locks becoming unbalanced as we'd need to add checks around whether the
> resource is being contended? Maybe I'm misunderstanding something here...

With that I mean to acquire the inode lock in ext4_file_write_iter
instead of the low-level buffered I/O or direct I/O routines.

2019-09-17 10:43:11

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure



On 9/17/19 2:32 PM, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
>> So if we have a delayed buffered write to a file,
>> in that case we first only update inode->i_size and update
>> i_disksize at writeback time
>> (i.e. during block allocation).
>> In that case when we call for ext4_dio_write_iter
>> since offset + len > i_disksize, we call for ext4_update_i_disksize().
>>
>> Now if writeback for some reason failed. And the system crashes, during the
>> DIO writes, after the blocks are allocated. Then during reboot we may have
>> an inconsistent inode, since we did not add the inode into the
>> orphan list before we updated the inode->i_disksize. And journal replay
>> may not succeed.
>>
>> 1. Can above actually happen? I am still not able to figure out the
>> race/inconsistency completely.
>> 2. Can you please help explain under what other cases
>> it was necessary to call ext4_update_i_disksize() in DIO write paths?
>> 3. When will i_disksize be out-of-sync with i_size during DIO writes?
>
> None of the above seems new in this patchset, does it? That being said


In original code before updating i_disksize in ext4_direct_IO_write,
we used to add the inode into the orphan list (which will mark the iloc
dirty and also update the ondisk inode size). Only then we update the
i_disksize to inode->i_size (which still I don't understand the
reason to put inside open journal handle).

So in case if the crash happens, then in the recovery, we can replay the
journal and we truncate any extra blocks beyond i_size.
(ext4_orphan_cleanup()).


In new iomap implementation (i.e. this patchset), we are doing this in
reverse.

We first call for ext4_update_i_disksize() in ext4_dio_write_iter(),
and then in ext4_iomap_begin() after ext4_map_blocks(),
we add the inode to orphan list, which I am not really sure whether it
is really consistent with on disk size??



> I found the early size update odd. XFS updates the on-disk size only
> at I/O completion time to deal with various races including the
> potential exposure of stale data.
>

Yes, can't really say why it is the case in ext4.
That's mostly what I wanted to understand from previous queries.


-ritesh

2019-09-17 11:32:04

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 17, 2019 at 02:06:13AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > > Independent of the error return issue you probably want to split
> > > modifying ext4_write_checks into a separate preparation patch.
> >
> > Providing that there's no objections to introducing a possible performance
> > change with this separate preparation patch (overhead of calling
> > file_remove_privs/file_update_time twice), then I have no issues in doing so.
>
> Well, we should avoid calling it twice. But what caught my eye is that
> the buffered I/O path also called this function, so we are changing it as
> well here. If that actually is safe (I didn't review these bits carefully
> and don't know ext4 that well) the overall refactoring of the write
> flow might belong into a separate prep patch (that is not relying
> on ->direct_IO, the checks changes, etc).

Yeah, I see what you're saying. From memory, in order to get this right, there
was a whole bunch of additional changes that needed to be done that would
effectively be removed in a subsequent patch. But, let me revisit this again
and see what I can do.

> > > > + if (!inode_trylock(inode)) {
> > > > + if (iocb->ki_flags & IOCB_NOWAIT)
> > > > + return -EAGAIN;
> > > > + inode_lock(inode);
> > > > + }
> > > > +
> > > > + if (!ext4_dio_checks(inode)) {
> > > > + inode_unlock(inode);
> > > > + /*
> > > > + * Fallback to buffered IO if the operation on the
> > > > + * inode is not supported by direct IO.
> > > > + */
> > > > + return ext4_buffered_write_iter(iocb, from);
> > >
> > > I think you want to lift the locking into the caller of this function
> > > so that you don't have to unlock and relock for the buffered write
> > > fallback.
> >
> > I don't exactly know what you really mean by "lift the locking into the caller
> > of this function". I'm interpreting that as moving the inode_unlock()
> > operation into ext4_buffered_write_iter(), but I can't see how that would be
> > any different from doing it directly here? Wouldn't this also run the risk of
> > the locks becoming unbalanced as we'd need to add checks around whether the
> > resource is being contended? Maybe I'm misunderstanding something here...
>
> With that I mean to acquire the inode lock in ext4_file_write_iter
> instead of the low-level buffered I/O or direct I/O routines.

Oh, I didn't think of that! But yes, that would in fact be nice and I cannot
see why we shouldn't be doing that at this point. It also helps with reducing
all the code duplication going on in the low-level buffered, direct, dax I/O
routines.

--<M>--

2019-09-17 12:39:51

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 17, 2019 at 02:02:33AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 02:30:15PM +0530, Ritesh Harjani wrote:
> > So if we have a delayed buffered write to a file,
> > in that case we first only update inode->i_size and update
> > i_disksize at writeback time
> > (i.e. during block allocation).
> > In that case when we call for ext4_dio_write_iter
> > since offset + len > i_disksize, we call for ext4_update_i_disksize().
> >
> > Now if writeback for some reason failed. And the system crashes, during the
> > DIO writes, after the blocks are allocated. Then during reboot we may have
> > an inconsistent inode, since we did not add the inode into the
> > orphan list before we updated the inode->i_disksize. And journal replay
> > may not succeed.
> >
> > 1. Can above actually happen? I am still not able to figure out the
> > race/inconsistency completely.
> > 2. Can you please help explain under what other cases
> > it was necessary to call ext4_update_i_disksize() in DIO write paths?
> > 3. When will i_disksize be out-of-sync with i_size during DIO writes?
>
> None of the above seems new in this patchset, does it?

That's correct.

*Ritesh - FWIW, I think you'll find the answers to your questions above by
referring to the following commits:

1) 73fdad00b208b
2) 45d8ec4d9fd54

If you drop the check (offset + count > EXT4_I(inode)->i_disksize) and the
call to ext4_update_i_disksize(), under some workloads i.e. "generic/475"
you'll generally end up with metadata corruption.

> That being said I found the early size update odd. XFS updates the on-disk
> size only at I/O completion time to deal with various races including the
> potential exposure of stale data.

Indeed a little odd. But, I think delalloc/writeback implementation is
possibly to blame here (based on what's detailed in 45d8ec4d9fd54)? Ideally, I
had the same approach as XFS in mind, but I couldn't do it.

--<M>--

2019-09-17 12:50:13

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin()

On Mon, Sep 16, 2019 at 05:05:33AM -0700, Christoph Hellwig wrote:
> On Thu, Sep 12, 2019 at 09:04:30PM +1000, Matthew Bobrowski wrote:
> > @@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> > iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> > iomap->addr = IOMAP_NULL_ADDR;
> > } else {
> > - if (map.m_flags & EXT4_MAP_MAPPED) {
> > - iomap->type = IOMAP_MAPPED;
> > - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > + /*
> > + * Flags passed to ext4_map_blocks() for direct IO
> > + * writes can result in m_flags having both
> > + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> > + * order for allocated unwritten extents to be
> > + * converted to written extents in the end_io handler
> > + * correctly, we need to ensure that the iomap->type
> > + * is also set appropriately in that case. Thus, we
> > + * need to check whether EXT4_MAP_UNWRITTEN is set
> > + * first.
> > + */
> > + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> > iomap->type = IOMAP_UNWRITTEN;
> > + } else if (map.m_flags & EXT4_MAP_MAPPED) {
> > + iomap->type = IOMAP_MAPPED;
>
> I think much of this would benefit a lot from just being split up.
> I hacked up a patch last week that split the ext4 direct I/O code
> a bit, but this is completely untested and needs further splitup,
> but maybe you can take it as an inspiration for your series?

Nice, I really like this! :-)

The ext4_iomap_begin() callback is kind of already getting larger than it
should have to be and I can only see it growing moving forward, so why not
split it up now.

> E.g. at least one helper for filling out the iomap from the ext4
> map data, and one for the seek unwritten extent reporting. The
> split of the overall iomap ops seemed useful to me, but might not
> be as important with the other cleanups:

Yeah, I think I'll leave the iomap operations as they are for now. Something
to definitely consider at a later point though.

--<M>--

2019-09-20 19:51:11

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 17, 2019 at 02:06:13AM -0700, Christoph Hellwig wrote:
> On Tue, Sep 17, 2019 at 08:37:41AM +1000, Matthew Bobrowski wrote:
> > > Independent of the error return issue you probably want to split
> > > modifying ext4_write_checks into a separate preparation patch.
> >
> > Providing that there's no objections to introducing a possible performance
> > change with this separate preparation patch (overhead of calling
> > file_remove_privs/file_update_time twice), then I have no issues in doing so.
>
> Well, we should avoid calling it twice. But what caught my eye is that
> the buffered I/O path also called this function, so we are changing it as
> well here. If that actually is safe (I didn't review these bits carefully
> and don't know ext4 that well) the overall refactoring of the write
> flow might belong into a separate prep patch (that is not relying
> on ->direct_IO, the checks changes, etc).

Yeah, no. Revisiting this again now and trying to implement the
ext4_write_checks() modifications as a pre-patch is a nightmare so to
speak. This is purely due to the way that ext4_file_write_iter() is currently
written and how both the current buffered I/O and direct I/O paths traverse
through and make use of it.

If anything, the changes applied to ext4_write_checks() should be a separate
patch that comes *after* the refactoring of the buffered and direct I/O write
flow. However, even then, there'd be code that we essentially introduce in the
write flow changes and then subsequently removed after the fact. Providing
that's OK, then sure, I can put this within a separate patch.

--<M>--

2019-09-25 15:14:00

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin()

On Thu 12-09-19 21:04:30, Matthew Bobrowski wrote:
> For iomap direct IO write code path changes, we need to accommodate
> for the case where the block mapping flags passed to ext4_map_blocks()
> will result in m_flags having both EXT4_MAP_MAPPED and
> EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
> extents to be converted properly in the end_io handler, iomap->type
> must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
> conditional statement in order to achieve this.
>
> This change is a no-op for DAX code path as the block mapping flag
> passed to ext4_map_blocks() when IS_DAX(inode) never results in
> EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>

The patch looks good to me. You can add:

Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
> fs/ext4/inode.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 761ce6286b05..efb184928e51 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + /*
> + * Flags passed to ext4_map_blocks() for direct IO
> + * writes can result in m_flags having both
> + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> + * order for allocated unwritten extents to be
> + * converted to written extents in the end_io handler
> + * correctly, we need to ensure that the iomap->type
> + * is also set appropriately in that case. Thus, we
> + * need to check whether EXT4_MAP_UNWRITTEN is set
> + * first.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> + } else if (map.m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> } else {
> WARN_ON_ONCE(1);
> return -EIO;
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-25 16:19:47

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end()

On Thu 12-09-19 21:04:00, Matthew Bobrowski wrote:
> @@ -233,12 +234,90 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t len, size_t count)

Traditionally, we call one of the length arguments 'copied' or 'written' to
denote actual amount of data processed and the original length is called
'len' or 'length' in iomap code. Can you please rename the arguments to
follow this convention?

> +{
> + handle_t *handle;
> + bool truncate = false;
> + ext4_lblk_t written_blk, end_blk;
> + int ret = 0, blkbits = inode->i_blkbits;
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto orphan_del;
> + }
> +
> + if (ext4_update_inode_size(inode, offset + len))
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * We may need truncate allocated but not written blocks
> + * beyond EOF.
> + */
> + written_blk = ALIGN(offset + len, 1 << blkbits);
> + end_blk = ALIGN(offset + count, 1 << blkbits);
> + if (written_blk < end_blk && ext4_can_truncate(inode))
> + truncate = true;
> +
> + /*
> + * Remove the inode from the orphan list if it has been
> + * extended and everything went OK.
> + */
> + if (!truncate && inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> +
> + if (truncate) {
> + ext4_truncate_failed_write(inode);
> +orphan_del:
> + /*
> + * If the truncate operation failed early the inode
> + * may still be on the orphan list. In that case, we
> + * need try remove the inode from the linked list in
> + * memory.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> + return ret;
> +}
> +
> +/*
> + * The inode may have been placed onto the orphan list or has had
> + * blocks allocated beyond EOF as a result of an extension. We need to
> + * ensure that any necessary cleanup routines are performed if the
> + * error path has been taken for a write.
> + */
> +static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> +{
> + handle_t *handle;
> +
> + if (size > i_size_read(inode))
> + ext4_truncate_failed_write(inode);
> +
> + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + return PTR_ERR(handle);
> + }
> + if (inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> + }
> + return 0;
> +}
> +

After some thought, I'd just drop this function and fold the functionality
into ext4_handle_inode_extension() by making it accept negative 'len'
argument indicating error. The code just happens to be the simplest in that
case (see below).

> @@ -255,7 +334,18 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ret)
> goto out;
>
> + offset = iocb->ki_pos;
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> + if (ret > 0 && iocb->ki_pos > i_size_read(inode))
> + error = ext4_handle_inode_extension(inode, offset, ret,
> + iov_iter_count(from));

You need to sample iov_iter_count(from) before calling dax_iomap_rw(). At
this point iov_iter_count(from) is just what's left in the iter after
writing what we could.

Also I don't think the condition iocb->ki_pos > i_size_read(inode) is
correct here. Because it may happen that offset + count > i_size so we
allocate some blocks beyond i_size but then we manage to copy only less so
offset + ret == iocb->ki_pos <= i_size and you will not call
ext4_handle_inode_extension() to truncate allocated blocks beyond i_size.

So I'd just call ext4_handle_inode_extension() unconditionally like:

error = ext4_handle_inode_extension(inode, offset, ret, len);

and have a quick check at the beginning of that function to avoid starting
transaction when there isn't anything to do. Something like:

/*
* DIO and DAX writes get exclusion from truncate (i_rwsem) and
* page writeback (i_rwsem and flushing all dirty pages).
*/
WARN_ON_ONCE(i_size_read(inode) != EXT4_I(inode)->i_disksize);
if (offset + count <= i_size_read(inode))
return 0;
if (len < 0)
goto truncate;

... do the heavylifting with transaction start, inode size update,
and orphan handling...

if (truncate) {
truncate:
ext4_truncate_failed_write(inode);
orphan_del:
/*
* If the truncate operation failed early the inode
* may still be on the orphan list. In that case, we
* need try remove the inode from the linked list in
* memory.
*/
if (inode->i_nlink)
ext4_orphan_del(NULL, inode);
}

> +
> + if (ret < 0)
> + error = ext4_handle_failed_inode_extension(inode,
> + iocb->ki_pos);
> +
> + if (error)
> + ret = error;
> out:
> inode_unlock(inode);
> if (ret > 0)

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-25 23:49:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

I'll try to comment just on top of refactoring Christoph has suggested...

On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> @@ -310,6 +341,120 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> return 0;
> }
>
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock(). On
> + * success, this function returns 0.
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> + unsigned int flags)
> +{
> + int ret;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error) {
> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> + return ret ? ret : error;
> + }
> +
> + if (flags & IOMAP_DIO_UNWRITTEN) {
> + ret = ext4_convert_unwritten_extents(NULL, inode,
> + offset, size);
> + if (ret)
> + return ret;
> + }
> +
> + if (offset + size > i_size_read(inode)) {
> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> + if (ret)
> + return ret;
> + }

With the suggestions I made to your patch 3/6 this could be simplified to:

if (!error && flags & IOMAP_DIO_UNWRITTEN) {
error = ext4_convert_unwritten_extents(NULL, inode, offset,
size);
}
return ext4_handle_inode_extension(inode, offset, error ? : size, size);


Note the change that when ext4_convert_unwritten_extents() fails (although
this should not really happen unless there's some corruption going on), we
do properly truncate possible extents beyond i_size.

> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret;
> + size_t count;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool extend = false, overwrite = false, unaligned_aio = false;
> +
> + if (!inode_trylock(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered IO if the operation on the
> + * inode is not supported by direct IO.
> + */
> + return ext4_buffered_write_iter(iocb, from);
> + }
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0) {
> + inode_unlock(inode);
> + return ret;
> + }
> +
> + /*
> + * Unaligned direct AIO must be serialized among each other as
> + * the zeroing of partial blocks of two competing unaligned
> + * AIOs can result in data corruption.
> + */
> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> + unaligned_aio = true;
> + inode_dio_wait(inode);
> + }
> +
> + /*
> + * Determine whether the IO operation will overwrite allocated
> + * and initialized blocks. If so, check to see whether it is
> + * possible to take the dioread_nolock path.
> + */
> + count = iov_iter_count(from);
> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> + ext4_should_dioread_nolock(inode)) {
> + overwrite = true;
> + downgrade_write(&inode->i_rwsem);
> + }
> +
> + if (offset + count > i_size_read(inode) ||
> + offset + count > EXT4_I(inode)->i_disksize) {
> + ext4_update_i_disksize(inode, inode->i_size);
> + extend = true;
> + }

This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
you need to also have transaction started and call ext4_mark_inode_dirty()
to actually journal the change of i_disksize (ext4_update_i_disksize()
updates only the in-memory copy of the entry). Also the direct IO code
needs to add the inode to the orphan list so that in case of crash, blocks
allocated beyond EOF get truncated on next mount. That is the whole point
of this excercise with i_disksize after all.

But I'm wondering if i_disksize update is needed. Truncate cannot be in
progress (we hold i_rwsem) and dirty pages will be flushed by
iomap_dio_rw() before we start to allocate any blocks. So it should be
enough to have here:

if (offset + count > i_size_read(inode)) {
/*
* Add inode to orphan list so that blocks allocated beyond
* EOF get properly truncated in case of crash.
*/
start transaction handle
add inode to orphan list
stop transaction handle
}

And just leave i_disksize at whatever it currently is.

> +
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> + /*
> + * Unaligned direct AIO must be the only IO in flight or else
> + * any overlapping aligned IO after unaligned IO might result
> + * in data corruption. We also need to wait here in the case
> + * where the inode is being extended so that inode extension
> + * routines in ext4_dio_write_end_io() are covered by the
> + * inode_lock().
> + */
> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> + inode_dio_wait(inode);
> +
> + if (overwrite)
> + inode_unlock_shared(inode);
> + else
> + inode_unlock(inode);
> +
> + if (ret >= 0 && iov_iter_count(from))
> + return ext4_buffered_write_iter(iocb, from);
> + return ret;
> +}
> +
...
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> handle_t *handle;
> - int retries = 0;
> + int dio_credits, retries = 0, m_flags = 0;
>
> - /* Trim mapping request to maximum we can map at once for DIO */
> + /*
> + * Trim mapping request to maximum we can map at once
> + * for DIO.
> + */
> if (map.m_len > DIO_MAX_BLOCKS)
> map.m_len = DIO_MAX_BLOCKS;
> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> + /*
> + * DAX and direct IO are the only two operations that
> + * are currently supported with IOMAP_WRITE.
> + */
> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> + if (IS_DAX(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> + else if (round_down(offset, i_blocksize(inode)) >=
> + i_size_read(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE;
> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> + /*
> + * We cannot fill holes in indirect tree based inodes
> + * as that could expose stale data in the case of a
> + * crash. Use the magic error code to fallback to
> + * buffered IO.
> + */
> + if (!m_flags && !ret)
> + ret = -ENOTBLK;
> +
> if (ret < 0) {
> ext4_journal_stop(handle);
> if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
>
> /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> + * If we added blocks beyond i_size, we need to make
> + * sure they will get truncated if we crash before
> + * updating the i_size. For faults we don't need to do
> + * that (and even cannot because for orphan list
> + * operations inode_lock is required) - if we happen
> + * to instantiate block beyond i_size, it is because
> + * we race with truncate which has already added the
> + * inode to the orphan list.
> */

Just a nit but it would be nice to use full width of 80 columns when
formatting comments so that they don't get unnecessarily long.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 00:58:11

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 4/6] ext4: reorder map.m_flags checks in ext4_iomap_begin()

On Mon, Sep 23, 2019 at 05:08:56PM +0200, Jan Kara wrote:
> On Thu 12-09-19 21:04:30, Matthew Bobrowski wrote:
> > For iomap direct IO write code path changes, we need to accommodate
> > for the case where the block mapping flags passed to ext4_map_blocks()
> > will result in m_flags having both EXT4_MAP_MAPPED and
> > EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
> > extents to be converted properly in the end_io handler, iomap->type
> > must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
> > conditional statement in order to achieve this.
> >
> > This change is a no-op for DAX code path as the block mapping flag
> > passed to ext4_map_blocks() when IS_DAX(inode) never results in
> > EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
> > Reviewed-by: Ritesh Harjani <[email protected]>
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>

Thank you Jan!

Note that the updated patch series that I'll send through shortly has slightly
changed in the sense that I've split up a whole bunch of the iomap code. The
idea behind this was to reduce the overall clutter that exists within the
->iomap_begin() callback. This was recommended by Christoph, and I really like
the idea because moving forward it makes complete sense to do so. This
specific patch will remain as is though.

--<M>--

2019-09-26 01:04:15

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end()

On Mon, Sep 23, 2019 at 06:21:15PM +0200, Jan Kara wrote:
> On Thu 12-09-19 21:04:00, Matthew Bobrowski wrote:
> > @@ -233,12 +234,90 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > return iov_iter_count(from);
> > }
> >
> > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > + ssize_t len, size_t count)
>
> Traditionally, we call one of the length arguments 'copied' or 'written' to
> denote actual amount of data processed and the original length is called
> 'len' or 'length' in iomap code. Can you please rename the arguments to
> follow this convention?

Sure, I will go with 'written'.

> > +/*
> > + * The inode may have been placed onto the orphan list or has had
> > + * blocks allocated beyond EOF as a result of an extension. We need to
> > + * ensure that any necessary cleanup routines are performed if the
> > + * error path has been taken for a write.
> > + */
> > +static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> > +{
> > + handle_t *handle;
> > +
> > + if (size > i_size_read(inode))
> > + ext4_truncate_failed_write(inode);
> > +
> > + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > + if (IS_ERR(handle)) {
> > + if (inode->i_nlink)
> > + ext4_orphan_del(NULL, inode);
> > + return PTR_ERR(handle);
> > + }
> > + if (inode->i_nlink)
> > + ext4_orphan_del(handle, inode);
> > + ext4_journal_stop(handle);
> > + }
> > + return 0;
> > +}
> > +
>
> After some thought, I'd just drop this function and fold the functionality
> into ext4_handle_inode_extension() by making it accept negative 'len'
> argument indicating error. The code just happens to be the simplest in that
> case (see below).
>
> > @@ -255,7 +334,18 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > if (ret)
> > goto out;
> >
> > + offset = iocb->ki_pos;
> > ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> > + if (ret > 0 && iocb->ki_pos > i_size_read(inode))
> > + error = ext4_handle_inode_extension(inode, offset, ret,
> > + iov_iter_count(from));
>
> You need to sample iov_iter_count(from) before calling dax_iomap_rw(). At
> this point iov_iter_count(from) is just what's left in the iter after
> writing what we could.
>
> Also I don't think the condition iocb->ki_pos > i_size_read(inode) is
> correct here. Because it may happen that offset + count > i_size so we
> allocate some blocks beyond i_size but then we manage to copy only less so
> offset + ret == iocb->ki_pos <= i_size and you will not call
> ext4_handle_inode_extension() to truncate allocated blocks beyond i_size.

Yeah, this makes sense. I will implement it this way. Once again, appreciate
your suggestions.

> So I'd just call ext4_handle_inode_extension() unconditionally like:
>
> error = ext4_handle_inode_extension(inode, offset, ret, len);
>
> and have a quick check at the beginning of that function to avoid starting
> transaction when there isn't anything to do. Something like:
>
> /*
> * DIO and DAX writes get exclusion from truncate (i_rwsem) and
> * page writeback (i_rwsem and flushing all dirty pages).
> */
> WARN_ON_ONCE(i_size_read(inode) != EXT4_I(inode)->i_disksize);
> if (offset + count <= i_size_read(inode))
> return 0;
> if (len < 0)
> goto truncate;
>
> ... do the heavylifting with transaction start, inode size update,
> and orphan handling...
>
> if (truncate) {
> truncate:
> ext4_truncate_failed_write(inode);
> orphan_del:
> /*
> * If the truncate operation failed early the inode
> * may still be on the orphan list. In that case, we
> * need try remove the inode from the linked list in
> * memory.
> */
> if (inode->i_nlink)
> ext4_orphan_del(NULL, inode);
> }

--<M>--

2019-09-26 01:14:50

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > +/*
> > + * For a write that extends the inode size, ext4_dio_write_iter() will
> > + * wait for the write to complete. Consequently, operations performed
> > + * within this function are still covered by the inode_lock(). On
> > + * success, this function returns 0.
> > + */
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> > + unsigned int flags)
> > +{
> > + int ret;
> > + loff_t offset = iocb->ki_pos;
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > + if (error) {
> > + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> > + return ret ? ret : error;
> > + }
> > +
> > + if (flags & IOMAP_DIO_UNWRITTEN) {
> > + ret = ext4_convert_unwritten_extents(NULL, inode,
> > + offset, size);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + if (offset + size > i_size_read(inode)) {
> > + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> > + if (ret)
> > + return ret;
> > + }
>
> With the suggestions I made to your patch 3/6 this could be simplified to:
>
> if (!error && flags & IOMAP_DIO_UNWRITTEN) {
> error = ext4_convert_unwritten_extents(NULL, inode, offset,
> size);
> }
> return ext4_handle_inode_extension(inode, offset, error ? : size, size);
>
>
> Note the change that when ext4_convert_unwritten_extents() fails (although
> this should not really happen unless there's some corruption going on), we
> do properly truncate possible extents beyond i_size.

This sounds good to me. I like this idea.

> > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > +{
> > + ssize_t ret;
> > + size_t count;
> > + loff_t offset = iocb->ki_pos;
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > + bool extend = false, overwrite = false, unaligned_aio = false;
> > +
> > + if (!inode_trylock(inode)) {
> > + if (iocb->ki_flags & IOCB_NOWAIT)
> > + return -EAGAIN;
> > + inode_lock(inode);
> > + }
> > +
> > + if (!ext4_dio_checks(inode)) {
> > + inode_unlock(inode);
> > + /*
> > + * Fallback to buffered IO if the operation on the
> > + * inode is not supported by direct IO.
> > + */
> > + return ext4_buffered_write_iter(iocb, from);
> > + }
> > +
> > + ret = ext4_write_checks(iocb, from);
> > + if (ret <= 0) {
> > + inode_unlock(inode);
> > + return ret;
> > + }
> > +
> > + /*
> > + * Unaligned direct AIO must be serialized among each other as
> > + * the zeroing of partial blocks of two competing unaligned
> > + * AIOs can result in data corruption.
> > + */
> > + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> > + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> > + unaligned_aio = true;
> > + inode_dio_wait(inode);
> > + }
> > +
> > + /*
> > + * Determine whether the IO operation will overwrite allocated
> > + * and initialized blocks. If so, check to see whether it is
> > + * possible to take the dioread_nolock path.
> > + */
> > + count = iov_iter_count(from);
> > + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > + ext4_should_dioread_nolock(inode)) {
> > + overwrite = true;
> > + downgrade_write(&inode->i_rwsem);
> > + }
> > +
> > + if (offset + count > i_size_read(inode) ||
> > + offset + count > EXT4_I(inode)->i_disksize) {
> > + ext4_update_i_disksize(inode, inode->i_size);
> > + extend = true;
> > + }
>
> This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> you need to also have transaction started and call ext4_mark_inode_dirty()
> to actually journal the change of i_disksize (ext4_update_i_disksize()
> updates only the in-memory copy of the entry). Also the direct IO code
> needs to add the inode to the orphan list so that in case of crash, blocks
> allocated beyond EOF get truncated on next mount. That is the whole point
> of this excercise with i_disksize after all.
>
> But I'm wondering if i_disksize update is needed. Truncate cannot be in
> progress (we hold i_rwsem) and dirty pages will be flushed by
> iomap_dio_rw() before we start to allocate any blocks. So it should be
> enough to have here:

Well, I initially thought the same, however doing some research shows that we
have the following edge case:
- 45d8ec4d9fd54
and
- 73fdad00b208b

In fact you can reproduce the exact same i_size corruption issue by running
the generic/475 xfstests mutitple times, as articulated within
45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
be a problem that resides elsewhere that may need addressing?

> if (offset + count > i_size_read(inode)) {
> /*
> * Add inode to orphan list so that blocks allocated beyond
> * EOF get properly truncated in case of crash.
> */
> start transaction handle
> add inode to orphan list
> stop transaction handle
> }
>
> And just leave i_disksize at whatever it currently is.

I originally had the code which added the inode to the orphan list here, but
then I thought to myself that it'd make more sense to actually do this step
closer to the point where we've managed to successfully allocate the required
blocks for the write. This prevents the need to spray orphan list clean up
code all over the place just to cover the case that a write which had intended
to extend the inode beyond i_size had failed prematurely (i.e. before block
allocation). So, hence the reason why I thought having it in
ext4_iomap_begin() would make more sense, because at that point in the write
path, there is enough/or more assurance to make the call around whether we
will in fact be able to perform the write which will be extending beyond
i_size, or not and consequently whether the inode should be placed onto the
orphan list?

Ideally I'd like to turn this statement into:

if (offset + count > i_size_read(inode))
extend = true;

Maybe I'm missing something here and there's actually a really good reason for
doing this nice and early? What are your thoughts about what I've mentioned
above?

> > - * If we added blocks beyond i_size, we need to make sure they
> > - * will get truncated if we crash before updating i_size in
> > - * ext4_iomap_end(). For faults we don't need to do that (and
> > - * even cannot because for orphan list operations inode_lock is
> > - * required) - if we happen to instantiate block beyond i_size,
> > - * it is because we race with truncate which has already added
> > - * the inode to the orphan list.
> > + * If we added blocks beyond i_size, we need to make
> > + * sure they will get truncated if we crash before
> > + * updating the i_size. For faults we don't need to do
> > + * that (and even cannot because for orphan list
> > + * operations inode_lock is required) - if we happen
> > + * to instantiate block beyond i_size, it is because
> > + * we race with truncate which has already added the
> > + * inode to the orphan list.
> > */
>
> Just a nit but it would be nice to use full width of 80 columns when
> formatting comments so that they don't get unnecessarily long.

Sure. I'm blaming emacs for this. :P

--<M>--

2019-09-26 08:04:57

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue 17-09-19 14:30:15, Ritesh Harjani wrote:
> On 9/17/19 4:07 AM, Matthew Bobrowski wrote:
> > On Mon, Sep 16, 2019 at 05:12:48AM -0700, Christoph Hellwig wrote:
> > > > + if (offset + count > i_size_read(inode) ||
> > > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > > + ext4_update_i_disksize(inode, inode->i_size);
> > > > + extend = true;
> > >
> > > Doesn't the ext4_update_i_disksize need to be under an open journal
> > > handle?
> >
> > After all, it is a metadata update, which should go through an open journal
> > handle.
>
> Hmmm, it seems like a race here. But I am not sure if this is just due to
> not updating i_disksize under open journal handle.
>
>
> So if we have a delayed buffered write to a file,
> in that case we first only update inode->i_size and update
> i_disksize at writeback time
> (i.e. during block allocation).
> In that case when we call for ext4_dio_write_iter
> since offset + len > i_disksize, we call for ext4_update_i_disksize().
>
> Now if writeback for some reason failed. And the system crashes, during the
> DIO writes, after the blocks are allocated. Then during reboot we may have
> an inconsistent inode, since we did not add the inode into the
> orphan list before we updated the inode->i_disksize. And journal replay
> may not succeed.

OK, let me clear out some confusion here.

> 1. Can above actually happen? I am still not able to figure out the
> race/inconsistency completely.
> 2. Can you please help explain under what other cases
> it was necessary to call ext4_update_i_disksize() in DIO write paths?
> 3. When will i_disksize be out-of-sync with i_size during DIO writes?

So as I commented in my other reply to this patch, the code is definitely
wrong as is. The update of i_disksize in the original DIO code is connected
with the addition to the orphan list - we want to make sure orphan cleanup
in case of crash will truncate inode to the current i_size so we set
i_disksize to i_size. That being said I cannot find a reason why the update
would be actually necessary because at that point i_size should be equal to
i_disksize anyway. So the best theory I have is that it was added "just to
be sure" (the code got inherited from original ext3 codebase from before
git times).

To answer your other questions: If we decided to leave i_disksize update
in, the proper use would look like:

handle = ext4_journal_start(...);
if (ext4_update_i_disksize(inode))
ext4_mark_inode_dirty(handle, inode);
ext4_orphan_add(handle, inode);
ext4_journal_stop();

Now the race you ask about in 1) can happen but it would be harmless.
As you mention buffered writeback does update i_disksize after submitting
IO to freshly allocated blocks (in the same transaction as block allocation
is) but stale data exposure is impossible because transaction commit will
wait for page writeback to complete.

To answer 3), i_disksize is not in sync with i_size either during truncate
(or similar operations generally protected by i_rwsem for writing) or
when there are pending delay allocated writes.

Now to answer 2) I don't think update of i_disksize is ever needed for DIO
path - once iomap_dio_rw() flushes pending dirty pages, i_disksize should
be equal to i_size since we hold i_rwsem at least for reading.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 08:05:39

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 2/6] ext4: move inode extension/truncate code out from ext4_iomap_end()

On Mon 23-09-19 18:21:15, Jan Kara wrote:
> On Thu 12-09-19 21:04:00, Matthew Bobrowski wrote:
> So I'd just call ext4_handle_inode_extension() unconditionally like:
>
> error = ext4_handle_inode_extension(inode, offset, ret, len);
>
> and have a quick check at the beginning of that function to avoid starting
> transaction when there isn't anything to do. Something like:
>
> /*
> * DIO and DAX writes get exclusion from truncate (i_rwsem) and
> * page writeback (i_rwsem and flushing all dirty pages).
> */
> WARN_ON_ONCE(i_size_read(inode) != EXT4_I(inode)->i_disksize);
> if (offset + count <= i_size_read(inode))
> return 0;

One correction here. With commit 45d8ec4d9fd54 in mind this should be:

/*
* DIO and DAX writes get exclusion from truncate (i_rwsem). There
* can be pending delalloc writeback beyond the range written by
* DIO though.
*/
WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
if (offset + count <= EXT4_I(inode)->i_disksize)
return 0;

Honza

> if (len < 0)
> goto truncate;
>
> ... do the heavylifting with transaction start, inode size update,
> and orphan handling...
>
> if (truncate) {
> truncate:
> ext4_truncate_failed_write(inode);
> orphan_del:
> /*
> * If the truncate operation failed early the inode
> * may still be on the orphan list. In that case, we
> * need try remove the inode from the linked list in
> * memory.
> */
> if (inode->i_nlink)
> ext4_orphan_del(NULL, inode);
> }
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 08:14:03

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> > > +{
> > > + ssize_t ret;
> > > + size_t count;
> > > + loff_t offset = iocb->ki_pos;
> > > + struct inode *inode = file_inode(iocb->ki_filp);
> > > + bool extend = false, overwrite = false, unaligned_aio = false;
> > > +
> > > + if (!inode_trylock(inode)) {
> > > + if (iocb->ki_flags & IOCB_NOWAIT)
> > > + return -EAGAIN;
> > > + inode_lock(inode);
> > > + }
> > > +
> > > + if (!ext4_dio_checks(inode)) {
> > > + inode_unlock(inode);
> > > + /*
> > > + * Fallback to buffered IO if the operation on the
> > > + * inode is not supported by direct IO.
> > > + */
> > > + return ext4_buffered_write_iter(iocb, from);
> > > + }
> > > +
> > > + ret = ext4_write_checks(iocb, from);
> > > + if (ret <= 0) {
> > > + inode_unlock(inode);
> > > + return ret;
> > > + }
> > > +
> > > + /*
> > > + * Unaligned direct AIO must be serialized among each other as
> > > + * the zeroing of partial blocks of two competing unaligned
> > > + * AIOs can result in data corruption.
> > > + */
> > > + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> > > + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> > > + unaligned_aio = true;
> > > + inode_dio_wait(inode);
> > > + }
> > > +
> > > + /*
> > > + * Determine whether the IO operation will overwrite allocated
> > > + * and initialized blocks. If so, check to see whether it is
> > > + * possible to take the dioread_nolock path.
> > > + */
> > > + count = iov_iter_count(from);
> > > + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> > > + ext4_should_dioread_nolock(inode)) {
> > > + overwrite = true;
> > > + downgrade_write(&inode->i_rwsem);
> > > + }
> > > +
> > > + if (offset + count > i_size_read(inode) ||
> > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > + ext4_update_i_disksize(inode, inode->i_size);
> > > + extend = true;
> > > + }
> >
> > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > you need to also have transaction started and call ext4_mark_inode_dirty()
> > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > updates only the in-memory copy of the entry). Also the direct IO code
> > needs to add the inode to the orphan list so that in case of crash, blocks
> > allocated beyond EOF get truncated on next mount. That is the whole point
> > of this excercise with i_disksize after all.
> >
> > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > progress (we hold i_rwsem) and dirty pages will be flushed by
> > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > enough to have here:
>
> Well, I initially thought the same, however doing some research shows that we
> have the following edge case:
> - 45d8ec4d9fd54
> and
> - 73fdad00b208b
>
> In fact you can reproduce the exact same i_size corruption issue by running
> the generic/475 xfstests mutitple times, as articulated within
> 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> be a problem that resides elsewhere that may need addressing?

Right, I forgot about the special case explained in 45d8ec4d9fd54 where
there's unwritted delalloc write beyond range where DIO write happens.

> > if (offset + count > i_size_read(inode)) {
> > /*
> > * Add inode to orphan list so that blocks allocated beyond
> > * EOF get properly truncated in case of crash.
> > */
> > start transaction handle
> > add inode to orphan list
> > stop transaction handle
> > }
> >
> > And just leave i_disksize at whatever it currently is.
>
> I originally had the code which added the inode to the orphan list here, but
> then I thought to myself that it'd make more sense to actually do this step
> closer to the point where we've managed to successfully allocate the required
> blocks for the write. This prevents the need to spray orphan list clean up
> code all over the place just to cover the case that a write which had intended
> to extend the inode beyond i_size had failed prematurely (i.e. before block
> allocation). So, hence the reason why I thought having it in
> ext4_iomap_begin() would make more sense, because at that point in the write
> path, there is enough/or more assurance to make the call around whether we
> will in fact be able to perform the write which will be extending beyond
> i_size, or not and consequently whether the inode should be placed onto the
> orphan list?
>
> Ideally I'd like to turn this statement into:
>
> if (offset + count > i_size_read(inode))
> extend = true;
>
> Maybe I'm missing something here and there's actually a really good reason for
> doing this nice and early? What are your thoughts about what I've mentioned
> above?

Well, the slight trouble with adding inode to orphan list in
ext4_iomap_begin() is that then it is somewhat difficult to tell whether
you need to remove it when IO is done because there's no way how to
propagate that information from ext4_iomap_begin() and checking against
i_disksize is unreliable because it can change (due to writeback of
delalloc pages) while direct IO is running. But I think we can overcome
that by splitting our end_io functions to two - ext4_dio_write_end_io() and
ext4_dio_extend_write_end_io(). So:

WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
/*
* Need to check against i_disksize as there may be dellalloc writes
* pending.
*/
if (offset + count > EXT4_I(inode)->i_disksize)
extend = true;

...
iomap_dio_rw(...,
extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);

and ext4_dio_write_end_io() will just take care of conversion of unwritten
extents on successful IO completion, while ext4_dio_extend_write_end_io()
will take care of all the complex stuff with orphan handling, extension
of inode size, and truncation of blocks beyond EOF - and it can do that
because it is guaranteed to run under the protection of i_rwsem held in
ext4_dio_write_iter().

Alternatively, we could also just pass NULL instead of
ext4_dio_extend_write_end_io() and just do all the work explicitely in
ext4_dio_write_iter() in the 'extend' case. That might be actually the most
transparent option...

But at this point there are so many suggestions in flight that I need to
see current state of the code again to be able to tell anything useful :).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2019-09-26 09:17:33

by Matthew Bobrowski

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > + if (offset + count > i_size_read(inode) ||
> > > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > > + ext4_update_i_disksize(inode, inode->i_size);
> > > > + extend = true;
> > > > + }
> > >
> > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > updates only the in-memory copy of the entry). Also the direct IO code
> > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > of this excercise with i_disksize after all.
> > >
> > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > enough to have here:
> >
> > Well, I initially thought the same, however doing some research shows that we
> > have the following edge case:
> > - 45d8ec4d9fd54
> > and
> > - 73fdad00b208b
> >
> > In fact you can reproduce the exact same i_size corruption issue by running
> > the generic/475 xfstests mutitple times, as articulated within
> > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > be a problem that resides elsewhere that may need addressing?
>
> Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> there's unwritted delalloc write beyond range where DIO write happens.
>
> > > if (offset + count > i_size_read(inode)) {
> > > /*
> > > * Add inode to orphan list so that blocks allocated beyond
> > > * EOF get properly truncated in case of crash.
> > > */
> > > start transaction handle
> > > add inode to orphan list
> > > stop transaction handle
> > > }
> > >
> > > And just leave i_disksize at whatever it currently is.
> >
> > I originally had the code which added the inode to the orphan list here, but
> > then I thought to myself that it'd make more sense to actually do this step
> > closer to the point where we've managed to successfully allocate the required
> > blocks for the write. This prevents the need to spray orphan list clean up
> > code all over the place just to cover the case that a write which had intended
> > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > allocation). So, hence the reason why I thought having it in
> > ext4_iomap_begin() would make more sense, because at that point in the write
> > path, there is enough/or more assurance to make the call around whether we
> > will in fact be able to perform the write which will be extending beyond
> > i_size, or not and consequently whether the inode should be placed onto the
> > orphan list?
> >
> > Ideally I'd like to turn this statement into:
> >
> > if (offset + count > i_size_read(inode))
> > extend = true;
> >
> > Maybe I'm missing something here and there's actually a really good reason for
> > doing this nice and early? What are your thoughts about what I've mentioned
> > above?
>
> Well, the slight trouble with adding inode to orphan list in
> ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> you need to remove it when IO is done because there's no way how to
> propagate that information from ext4_iomap_begin() and checking against
> i_disksize is unreliable because it can change (due to writeback of
> delalloc pages) while direct IO is running. But I think we can overcome
> that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> ext4_dio_extend_write_end_io(). So:
>
> WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> /*
> * Need to check against i_disksize as there may be dellalloc writes
> * pending.
> */
> if (offset + count > EXT4_I(inode)->i_disksize)
> extend = true;

Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
should be using to determine whether an extension will be performed or
not? Because, my understanding is that i_size is what holds the actual value
of what the file size is expected to be and hence the reason why we previously
updated the i_disksize to i_size using ext4_update_i_disksize().

Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
however offset + count < i_size_read(inode). So in that case we may take an
incorrect path somewhere i.e. below where extend clause is true. Also, I feel
as though we should try stick to using one value as the reference to determine
whether we're performing an extension and not use i_disksize here and then
i_size over there kind of thing as that leads to unnecessary confusion?

> ...
> iomap_dio_rw(...,
> extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
>
> and ext4_dio_write_end_io() will just take care of conversion of unwritten
> extents on successful IO completion, while ext4_dio_extend_write_end_io()
> will take care of all the complex stuff with orphan handling, extension
> of inode size, and truncation of blocks beyond EOF - and it can do that
> because it is guaranteed to run under the protection of i_rwsem held in
> ext4_dio_write_iter().
>
> Alternatively, we could also just pass NULL instead of
> ext4_dio_extend_write_end_io() and just do all the work explicitely in
> ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> transparent option...

Well, with the changes to ext4_handle_inode_extension() conditions that you
recommended in patch 2/6, then I can't see why we'd need two separate
->end_io() handlers as we'd just abort early if we're not extending?

> But at this point there are so many suggestions in flight that I need to
> see current state of the code again to be able to tell anything useful :).

Heh, true. I will post through an updated patch series taking into account
most of the recommendations put forward for this series version and then we
can have a discussion based on that. :)

--<M>--

2019-09-26 09:21:33

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH v3 5/6] ext4: introduce direct IO write path using iomap infrastructure

On Wed 25-09-19 17:14:29, Matthew Bobrowski wrote:
> On Tue, Sep 24, 2019 at 04:13:21PM +0200, Jan Kara wrote:
> > On Tue 24-09-19 20:29:26, Matthew Bobrowski wrote:
> > > On Mon, Sep 23, 2019 at 11:10:11PM +0200, Jan Kara wrote:
> > > > On Thu 12-09-19 21:04:46, Matthew Bobrowski wrote:
> > > > > + if (offset + count > i_size_read(inode) ||
> > > > > + offset + count > EXT4_I(inode)->i_disksize) {
> > > > > + ext4_update_i_disksize(inode, inode->i_size);
> > > > > + extend = true;
> > > > > + }
> > > >
> > > > This call to ext4_update_i_disksize() is definitely wrong. If nothing else,
> > > > you need to also have transaction started and call ext4_mark_inode_dirty()
> > > > to actually journal the change of i_disksize (ext4_update_i_disksize()
> > > > updates only the in-memory copy of the entry). Also the direct IO code
> > > > needs to add the inode to the orphan list so that in case of crash, blocks
> > > > allocated beyond EOF get truncated on next mount. That is the whole point
> > > > of this excercise with i_disksize after all.
> > > >
> > > > But I'm wondering if i_disksize update is needed. Truncate cannot be in
> > > > progress (we hold i_rwsem) and dirty pages will be flushed by
> > > > iomap_dio_rw() before we start to allocate any blocks. So it should be
> > > > enough to have here:
> > >
> > > Well, I initially thought the same, however doing some research shows that we
> > > have the following edge case:
> > > - 45d8ec4d9fd54
> > > and
> > > - 73fdad00b208b
> > >
> > > In fact you can reproduce the exact same i_size corruption issue by running
> > > the generic/475 xfstests mutitple times, as articulated within
> > > 45d8ec4d9fd54. So with that, I'm kind of confused and thinking that there may
> > > be a problem that resides elsewhere that may need addressing?
> >
> > Right, I forgot about the special case explained in 45d8ec4d9fd54 where
> > there's unwritted delalloc write beyond range where DIO write happens.
> >
> > > > if (offset + count > i_size_read(inode)) {
> > > > /*
> > > > * Add inode to orphan list so that blocks allocated beyond
> > > > * EOF get properly truncated in case of crash.
> > > > */
> > > > start transaction handle
> > > > add inode to orphan list
> > > > stop transaction handle
> > > > }
> > > >
> > > > And just leave i_disksize at whatever it currently is.
> > >
> > > I originally had the code which added the inode to the orphan list here, but
> > > then I thought to myself that it'd make more sense to actually do this step
> > > closer to the point where we've managed to successfully allocate the required
> > > blocks for the write. This prevents the need to spray orphan list clean up
> > > code all over the place just to cover the case that a write which had intended
> > > to extend the inode beyond i_size had failed prematurely (i.e. before block
> > > allocation). So, hence the reason why I thought having it in
> > > ext4_iomap_begin() would make more sense, because at that point in the write
> > > path, there is enough/or more assurance to make the call around whether we
> > > will in fact be able to perform the write which will be extending beyond
> > > i_size, or not and consequently whether the inode should be placed onto the
> > > orphan list?
> > >
> > > Ideally I'd like to turn this statement into:
> > >
> > > if (offset + count > i_size_read(inode))
> > > extend = true;
> > >
> > > Maybe I'm missing something here and there's actually a really good reason for
> > > doing this nice and early? What are your thoughts about what I've mentioned
> > > above?
> >
> > Well, the slight trouble with adding inode to orphan list in
> > ext4_iomap_begin() is that then it is somewhat difficult to tell whether
> > you need to remove it when IO is done because there's no way how to
> > propagate that information from ext4_iomap_begin() and checking against
> > i_disksize is unreliable because it can change (due to writeback of
> > delalloc pages) while direct IO is running. But I think we can overcome
> > that by splitting our end_io functions to two - ext4_dio_write_end_io() and
> > ext4_dio_extend_write_end_io(). So:
> >
> > WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > /*
> > * Need to check against i_disksize as there may be dellalloc writes
> > * pending.
> > */
> > if (offset + count > EXT4_I(inode)->i_disksize)
> > extend = true;
>
> Hm... I'm not entirely convinced that EXT4_I(inode)->i_disksize is what we
> should be using to determine whether an extension will be performed or
> not? Because, my understanding is that i_size is what holds the actual value
> of what the file size is expected to be and hence the reason why we previously
> updated the i_disksize to i_size using ext4_update_i_disksize().

So i_size is how inode size actually appears to user. i_disksize is what
inode size really is on disk. And because of orphan handling and similar
stuff requiring i_rwsem protection, we need to use the slow path waiting
for DIO to complete whenever the direct IO is beyond the on-disk version of
inode size. Possibly there are other places in DIO path that need to use
i_disksize instead of i_size as well - I'll check that once I see new
version of your patches.

> Also, there are cases where offset + count > EXT4_I(inode)->i_disksize,
> however offset + count < i_size_read(inode). So in that case we may take an
> incorrect path somewhere i.e. below where extend clause is true. Also, I feel
> as though we should try stick to using one value as the reference to determine
> whether we're performing an extension and not use i_disksize here and then
> i_size over there kind of thing as that leads to unnecessary confusion?

Well, when DIO extends past i_disksize, it needs to add inode to orphan
list, call truncate in case of failed write, etc. So extension of
i_disksize is what actually matters. I was speaking in the past about
i_size because I thought i_size == i_disksize for the cases we care for but
as you properly pointed out that isn't necessarily the case.

> > ...
> > iomap_dio_rw(...,
> > extend ? ext4_dio_extend_write_end_io : ext4_dio_write_end_io);
> >
> > and ext4_dio_write_end_io() will just take care of conversion of unwritten
> > extents on successful IO completion, while ext4_dio_extend_write_end_io()
> > will take care of all the complex stuff with orphan handling, extension
> > of inode size, and truncation of blocks beyond EOF - and it can do that
> > because it is guaranteed to run under the protection of i_rwsem held in
> > ext4_dio_write_iter().
> >
> > Alternatively, we could also just pass NULL instead of
> > ext4_dio_extend_write_end_io() and just do all the work explicitely in
> > ext4_dio_write_iter() in the 'extend' case. That might be actually the most
> > transparent option...
>
> Well, with the changes to ext4_handle_inode_extension() conditions that you
> recommended in patch 2/6, then I can't see why we'd need two separate
> ->end_io() handlers as we'd just abort early if we're not extending?

The problem is that the condition I've suggested for patch 2/6 will be
actually racy if we use i_disksize for comparison. Consider the following
situation:

CPU1 CPU2
fd1 = open("file"); fd2 = open("file", O_DIRECT);
/* Delalloc write */
pwrite(fd1, buf, 4096, 16384);
/* O_DIRECT write */
pwrite(fd2, buf, 4096, 4096)
i_disksize == 0 so we have to add
inode to orphan list
submit DIO
writeback happens, i_disksize extended
to 20480.
DIO completes ->
ext4_dio_write_end_io() - sees
big i_disksize so does not
touch orphan list and inode is
wrongly left there.


And we cannot remove inode unconditionally from the orphan list in
ext4_dio_write_end_io() as when DIO starts already below i_disksize,
ext4_dio_write_end_io() may get called without i_rwsem protection.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR