This patch series ports the ext4 direct IO paths to make use of the
iomap infrastructure. The legacy buffer_head based direct IO paths
have subsequently been removed as they're now no longer in use. The
result of this change is that the direct IO implementation is much
cleaner and keeps the code isolated from the buffer_head internals. In
addition to this, a slight performance boost could be expected while
using O_SYNC | O_DIRECT IO.
The changes have been tested using xfstests in both DAX and non-DAX
modes using various filesystem configurations i.e. 4k, dioread_nolock.
Changes since v1:
- The error handling code within ext4_dio_write_end_io() has been
moved out into a common helper ext4_handle_failed_inode_extension(),
which is now used by both DAX and direct IO paths.
- Simplified the conditional statement in ext4_iomap_begin() that
determined which flags needed to be passed to ext4_map_blocks().
- Split up patch 4/5 by taking out the hunk of code in
ext4_iomap_begin() that determined the flag that should be assigned
to iomap->type.
- Introduced comments to snippets of code that are not immediately
unambiguous and applied other minor cleanups based on the feedback
that was received.
In the previous patch series, there was a relatively lengthy
discussion around the merging of unwritten extents. The buffer_head
direct IO implementation made use of inode and end_io flags to track
whether an unwritten extent is eligible to be merged, or not. We don't
make use of these flags in the new direct IO iomap implementation,
effectively making the extent merging checks that make use these flags
redundant. However, it appears that even if additional merges and
splits are performed, it isn't deemed problematic as such and that's
due to the way that the filesystem now accommdates for unexpected
extent splits. The only real concern is the potential wasted
performance due to the unnecessary merge and split performed under
specific workloads. The full discussion that goes through these
details starts from here:
https://www.spinics.net/lists/linux-ext4/msg67173.html.
Thank you to all that took the time to review and provide constructive
feedback for previous patch series, highly appreciated.
Matthew Bobrowski (6):
ext4: introduce direct IO read path using iomap infrastructure
ext4: move inode extension/truncate code out from ext4_iomap_end()
iomap: modify ->end_io() calling convention
ext4: reorder map.m_flags checks in ext4_iomap_begin()
ext4: introduce direct IO write path using iomap infrastructure
ext4: cleanup legacy buffer_head direct IO code
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 363 +++++++++++++++++++++++------
fs/ext4/inode.c | 516 +++++-------------------------------------
fs/iomap/direct-io.c | 9 +-
fs/xfs/xfs_file.c | 8 +-
include/linux/iomap.h | 4 +-
7 files changed, 365 insertions(+), 549 deletions(-)
--
2.20.1
This patch introduces a new direct IO read path that makes use of the
iomap infrastructure.
The new function ext4_dio_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the inode in question
does not pass preliminary checks in ext4_dio_checks(), then we simply
fallback to buffered IO and take that path to fulfil the request. It's
imperative that we drop the IOCB_DIRECT flag from iocb->ki_flags in
order to prevent generic_file_read_iter() from trying to take the
direct IO code path again.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++----
1 file changed, 54 insertions(+), 4 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 70b0438dbc94..e52e3928dc25 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,53 @@
#include "xattr.h"
#include "acl.h"
+static bool ext4_dio_checks(struct inode *inode)
+{
+#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
+ if (IS_ENCRYPTED(inode))
+ return false;
+#endif
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ /*
+ * Get exclusion from truncate and other inode operations.
+ */
+ if (!inode_trylock_shared(inode)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ inode_lock_shared(inode);
+ }
+
+ if (!ext4_dio_checks(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered IO if the operation being
+ * performed on the inode is not supported by direct
+ * IO. The IOCB_DIRECT flag flags needs to be cleared
+ * here to ensure that the direct IO code path within
+ * generic_file_read_iter() is not taken again.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,16 +111,19 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
if (!iov_iter_count(to))
return 0; /* skip atime */
-#ifdef CONFIG_FS_DAX
- if (IS_DAX(file_inode(iocb->ki_filp)))
+ if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
-#endif
+
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
return generic_file_read_iter(iocb, to);
}
--
2.20.1
In preparation for implementing the iomap direct IO write path
modifications, the inode extension/truncate code needs to be moved out
from ext4_iomap_end(). For direct IO, if the current code remained
within ext4_iomap_end() it would behave incorrectly. If we update the
inode size prior to converting unwritten extents we run the risk of
allowing a racing direct IO read operation to find unwritten extents
before they are converted.
The inode extension/truncate code has been moved out into a new helper
ext4_handle_inode_extension(). This helper has been designed so that
it can be used by both DAX and direct IO paths in the instance that
the result of the write is extending the inode.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +------------------------
2 files changed, 93 insertions(+), 48 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index e52e3928dc25..8e586198f6e6 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"
static bool ext4_dio_checks(struct inode *inode)
{
@@ -233,12 +234,91 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}
+static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
+ ssize_t len, size_t count)
+{
+ handle_t *handle;
+ bool truncate = false;
+ ext4_lblk_t written_blk, end_blk;
+ int ret = 0, blkbits = inode->i_blkbits;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto orphan_del;
+ }
+
+ if (ext4_update_inode_size(inode, offset + len))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need truncate allocated but not written blocks
+ * beyond EOF.
+ */
+ written_blk = ALIGN(offset + len, 1 << blkbits);
+ end_blk = ALIGN(offset + len + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been
+ * extended and everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+ ext4_truncate_failed_write(inode);
+orphan_del:
+ /*
+ * If the truncate operation failed early the inode
+ * may still be on the orphan list. In that case, we
+ * need try remove the inode from the linked list in
+ * memory.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+ return ret;
+}
+
+/*
+ * The inode may have been placed onto the orphan list or has had
+ * blocks allocated beyond EOF as a result of an extension. We need to
+ * ensure that any necessary cleanup routines are performed if the
+ * error path has been taken for a write.
+ */
+static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
+{
+ int ret = 0;
+ handle_t *handle;
+
+ if (size > i_size_read(inode))
+ ext4_truncate_failed_write(inode);
+
+ if (!list_empty(&EXT4_I(inode)->i_orphan)) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ return PTR_ERR(handle);
+ }
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ int error = 0;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -255,7 +335,18 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;
+ offset = iocb->ki_pos;
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ if (ret > 0 && iocb->ki_pos > i_size_read(inode))
+ error = ext4_handle_inode_extension(inode, offset, ret,
+ iov_iter_count(from));
+
+ if (ret < 0)
+ error = ext4_handle_failed_inode_extension(inode,
+ iocb->ki_pos);
+
+ if (error)
+ ret = error;
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 420fe3deed39..761ce6286b05 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3601,53 +3601,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}
const struct iomap_ops ext4_iomap_ops = {
--
2.20.1
For iomap direct IO write code path changes, we need to accommodate
for the case where the block mapping flags passed to ext4_map_blocks()
will result in m_flags having both EXT4_MAP_MAPPED and
EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
extents to be converted properly in the end_io handler, iomap->type
must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
conditional statement in order to achieve this.
This change is a no-op for DAX code path as the block mapping flag
passed to ext4_map_blocks() when IS_DAX(inode) never results in
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 17 ++++++++++++++---
1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 761ce6286b05..efb184928e51 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ /*
+ * Flags passed to ext4_map_blocks() for direct IO
+ * writes can result in m_flags having both
+ * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
+ * order for allocated unwritten extents to be
+ * converted to written extents in the end_io handler
+ * correctly, we need to ensure that the iomap->type
+ * is also set appropriately in that case. Thus, we
+ * need to check whether EXT4_MAP_UNWRITTEN is set
+ * first.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
+ } else if (map.m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
} else {
WARN_ON_ONCE(1);
return -EIO;
--
2.20.1
This patch modifies the calling convention for the iomap ->end_io()
callback. Rather than passing either dio->error or dio->size as the
'size' argument, we instead pass both dio->error and dio->size values
separately.
In the instance that an error occurred during a write, we currently
cannot determine whether any blocks have been allocated beyond the
current EOF and data has subsequently been written to these blocks
within the ->end_io() callback. As a result, we cannot judge whether
we should take the truncate failed write path. Having both dio->error
and dio->size will allow us to perform such checks within this
callback.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/iomap/direct-io.c | 9 +++------
fs/xfs/xfs_file.c | 8 +++++---
include/linux/iomap.h | 4 ++--
3 files changed, 10 insertions(+), 11 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 10517cea9682..2ccf1c6460d4 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -77,13 +77,10 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
loff_t offset = iocb->ki_pos;
ssize_t ret;
- if (dio->end_io) {
- ret = dio->end_io(iocb,
- dio->error ? dio->error : dio->size,
- dio->flags);
- } else {
+ if (dio->end_io)
+ ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
+ else
ret = dio->error;
- }
if (likely(!ret)) {
ret = dio->size;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 28101bbc0b78..d49759008c54 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -369,21 +369,23 @@ static int
xfs_dio_write_end_io(
struct kiocb *iocb,
ssize_t size,
+ int error,
unsigned flags)
{
struct inode *inode = file_inode(iocb->ki_filp);
struct xfs_inode *ip = XFS_I(inode);
loff_t offset = iocb->ki_pos;
unsigned int nofs_flag;
- int error = 0;
trace_xfs_end_io_direct_write(ip, offset, size);
if (XFS_FORCED_SHUTDOWN(ip->i_mount))
return -EIO;
- if (size <= 0)
- return size;
+ if (error)
+ return error;
+ if (!size)
+ return 0;
/*
* Capture amount written on completion as we can't reliably account
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index bc499ceae392..d983cdcf2e72 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -188,8 +188,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
*/
#define IOMAP_DIO_UNWRITTEN (1 << 0) /* covers unwritten extent(s) */
#define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */
-typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
- unsigned flags);
+typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags);
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
--
2.20.1
This patch introduces a new direct IO write code path implementation
that makes use of the iomap infrastructure.
All direct IO write operations are now passed from the ->write_iter()
callback to the new function ext4_dio_write_iter(). This function is
responsible for calling into iomap infrastructure via
iomap_dio_rw(). Snippets of the direct IO code from within
ext4_file_write_iter(), such as checking whether the IO request is
unaligned asynchronous IO, or whether it will ber overwriting
allocated and initialized blocks has been moved out and into
ext4_dio_write_iter().
The block mapping flags that are passed to ext4_map_blocks() from
within ext4_dio_get_block() and friends have effectively been taken
out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
happens to have instantiated blocks beyond the i_size, then we attempt
to place the inode onto the orphan list. Despite being able to perform
i_size extension checking earlier on in the direct IO code path, it
makes most sense to perform this bit post successful block allocation.
The ->end_io() callback ext4_dio_write_end_io() is responsible for
removing the inode from the orphan list and determining if we should
truncate a failed write in the case of an error. We also convert a
range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
and perform the necessary i_size/i_disksize extension if the
iocb->ki_pos + dio->size > i_size_read(inode).
In the instance of a short write, we fallback to buffered IO and
complete whatever is left the 'iter'. Any blocks that may have been
allocated in preparation for direct IO will be reused by buffered IO,
so there's no issue with leaving allocated blocks beyond EOF.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
fs/ext4/inode.c | 57 ++++++++++---
2 files changed, 198 insertions(+), 78 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8e586198f6e6..bf22425a6a6f 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
if (ret <= 0)
return ret;
+ ret = file_remove_privs(iocb->ki_filp);
+ if (ret)
+ return 0;
+
+ ret = file_update_time(iocb->ki_filp);
+ if (ret)
+ return 0;
+
if (unlikely(IS_IMMUTABLE(inode)))
return -EPERM;
@@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ if (!inode_trylock(inode))
+ inode_lock(inode);
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+ return ret;
+}
+
static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t len, size_t count)
{
@@ -311,6 +348,118 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
return ret;
}
+/*
+ * For a write that extends the inode size, ext4_dio_write_iter() will
+ * wait for the write to complete. Consequently, operations performed
+ * within this function are still covered by the inode_lock().
+ */
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
+ unsigned int flags)
+{
+ int ret = 0;
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error) {
+ ret = ext4_handle_failed_inode_extension(inode, offset + size);
+ return ret ? ret : error;
+ }
+
+ if (flags & IOMAP_DIO_UNWRITTEN) {
+ ret = ext4_convert_unwritten_extents(NULL, inode,
+ offset, size);
+ if (ret)
+ return ret;
+ }
+
+ if (offset + size > i_size_read(inode)) {
+ ret = ext4_handle_inode_extension(inode, offset, size, 0);
+ if (ret)
+ return ret;
+ }
+ return ret;
+}
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ loff_t offset = iocb->ki_pos;
+ size_t count = iov_iter_count(from);
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;
+
+ if (!inode_trylock(inode)) {
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EAGAIN;
+ inode_lock(inode);
+ }
+
+ if (!ext4_dio_checks(inode)) {
+ inode_unlock(inode);
+ /*
+ * Fallback to buffered IO if the operation on the
+ * inode is not supported by direct IO.
+ */
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }
+
+ /*
+ * Unaligned direct AIO must be serialized among each other as
+ * the zeroing of partial blocks of two competing unaligned
+ * AIOs can result in data corruption.
+ */
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the IO operation will overwrite allocated
+ * and initialized blocks. If so, check to see whether it is
+ * possible to take the dioread_nolock path.
+ */
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }
+
+ if (offset + count > i_size_read(inode) ||
+ offset + count > EXT4_I(inode)->i_disksize) {
+ ext4_update_i_disksize(inode, inode->i_size);
+ extend = true;
+ }
+
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
+
+ /*
+ * Unaligned direct AIO must be the only IO in flight or else
+ * any overlapping aligned IO after unaligned IO might result
+ * in data corruption. We also need to wait here in the case
+ * where the inode is being extended so that inode extension
+ * routines in ext4_dio_write_end_io() are covered by the
+ * inode_lock().
+ */
+ if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
+ inode_dio_wait(inode);
+
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from))
+ return ext4_buffered_write_iter(iocb, from);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
return -EAGAIN;
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;
offset = iocb->ki_pos;
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
@@ -359,73 +503,16 @@ static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
- ssize_t ret;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
-#ifdef CONFIG_FS_DAX
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
-#endif
- if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
- return -EOPNOTSUPP;
- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- inode_lock(inode);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
-
- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
-
- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
- goto out;
- }
- }
-
- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
-
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
-
- return ret;
-
-out:
- inode_unlock(inode);
- return ret;
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
+ return ext4_buffered_write_iter(iocb, from);
}
#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index efb184928e51..f52ad3065236 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
handle_t *handle;
- int retries = 0;
+ int dio_credits, retries = 0, m_flags = 0;
- /* Trim mapping request to maximum we can map at once for DIO */
+ /*
+ * Trim mapping request to maximum we can map at once
+ * for DIO.
+ */
if (map.m_len > DIO_MAX_BLOCKS)
map.m_len = DIO_MAX_BLOCKS;
dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
@@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
if (IS_ERR(handle))
return PTR_ERR(handle);
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct IO are the only two operations that
+ * are currently supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ else if (round_down(offset, i_blocksize(inode)) >=
+ i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, &map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes
+ * as that could expose stale data in the case of a
+ * crash. Use the magic error code to fallback to
+ * buffered IO.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;
+
if (ret < 0) {
ext4_journal_stop(handle);
if (ret == -ENOSPC &&
@@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
/*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
+ * If we added blocks beyond i_size, we need to make
+ * sure they will get truncated if we crash before
+ * updating the i_size. For faults we don't need to do
+ * that (and even cannot because for orphan list
+ * operations inode_lock is required) - if we happen
+ * to instantiate block beyond i_size, it is because
+ * we race with truncate which has already added the
+ * inode to the orphan list.
*/
if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
(i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
@@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error occurred while writing data
+ * out to allocated blocks. If so, return the magic error code
+ * so that we fallback to buffered IO and reuse the blocks
+ * that were allocated in preparation for the direct IO write.
+ */
+ if (flags & IOMAP_DIRECT && written == 0)
+ return -ENOTBLK;
return 0;
}
--
2.20.1
Remove buffer_head direct IO code that is now redundant as a result of
porting across the read/write paths to use iomap infrastructure.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 7 -
fs/ext4/inode.c | 398 +---------------------------------------------
4 files changed, 5 insertions(+), 414 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index bf660aa7a9e0..2ab91815f52d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1555,7 +1555,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2522,8 +2521,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 92266a2da7d6..a869e206bd81 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
*/
if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
return 0;
- /*
- * The check for IO to unwritten extent is somewhat racy as we
- * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
- * dropping i_data_sem. But reserved blocks should save us in that
- * case.
- */
+
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index bf22425a6a6f..d4b9a82aed6c 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -155,13 +155,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f52ad3065236..a4f0749527c7 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,136 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3653,268 +3523,6 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};
-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end->offset = offset;
- io_end->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- inode_lock_shared(inode);
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3952,7 +3560,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3969,7 +3577,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3985,7 +3593,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_da_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1
+ CC: [email protected]
On Mon, Sep 09, 2019 at 09:19:31AM +1000, Matthew Bobrowski wrote:
> This patch modifies the calling convention for the iomap ->end_io()
> callback. Rather than passing either dio->error or dio->size as the
> 'size' argument, we instead pass both dio->error and dio->size values
> separately.
>
> In the instance that an error occurred during a write, we currently
> cannot determine whether any blocks have been allocated beyond the
> current EOF and data has subsequently been written to these blocks
> within the ->end_io() callback. As a result, we cannot judge whether
> we should take the truncate failed write path. Having both dio->error
> and dio->size will allow us to perform such checks within this
> callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Please note, this patch is now part of the recent end_io improvements
posted through by Christoph, and can be found here:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=iomap-5.4-merge&id=db10bd824bc0353702231f1294b58903cb66bac7.
> ---
> fs/iomap/direct-io.c | 9 +++------
> fs/xfs/xfs_file.c | 8 +++++---
> include/linux/iomap.h | 4 ++--
> 3 files changed, 10 insertions(+), 11 deletions(-)
>
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 10517cea9682..2ccf1c6460d4 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -77,13 +77,10 @@ static ssize_t iomap_dio_complete(struct iomap_dio *dio)
> loff_t offset = iocb->ki_pos;
> ssize_t ret;
>
> - if (dio->end_io) {
> - ret = dio->end_io(iocb,
> - dio->error ? dio->error : dio->size,
> - dio->flags);
> - } else {
> + if (dio->end_io)
> + ret = dio->end_io(iocb, dio->size, dio->error, dio->flags);
> + else
> ret = dio->error;
> - }
>
> if (likely(!ret)) {
> ret = dio->size;
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 28101bbc0b78..d49759008c54 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -369,21 +369,23 @@ static int
> xfs_dio_write_end_io(
> struct kiocb *iocb,
> ssize_t size,
> + int error,
> unsigned flags)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> struct xfs_inode *ip = XFS_I(inode);
> loff_t offset = iocb->ki_pos;
> unsigned int nofs_flag;
> - int error = 0;
>
> trace_xfs_end_io_direct_write(ip, offset, size);
>
> if (XFS_FORCED_SHUTDOWN(ip->i_mount))
> return -EIO;
>
> - if (size <= 0)
> - return size;
> + if (error)
> + return error;
> + if (!size)
> + return 0;
>
> /*
> * Capture amount written on completion as we can't reliably account
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index bc499ceae392..d983cdcf2e72 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -188,8 +188,8 @@ sector_t iomap_bmap(struct address_space *mapping, sector_t bno,
> */
> #define IOMAP_DIO_UNWRITTEN (1 << 0) /* covers unwritten extent(s) */
> #define IOMAP_DIO_COW (1 << 1) /* covers COW extent(s) */
> -typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t ret,
> - unsigned flags);
> +typedef int (iomap_dio_end_io_t)(struct kiocb *iocb, ssize_t size,
> + int error, unsigned int flags);
> ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> const struct iomap_ops *ops, iomap_dio_end_io_t end_io);
> int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
> --
> 2.20.1
>
On 9/9/19 4:48 AM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO read path that makes use of the
> iomap infrastructure.
>
> The new function ext4_dio_read_iter() is responsible for calling into
> the iomap infrastructure via iomap_dio_rw(). If the inode in question
> does not pass preliminary checks in ext4_dio_checks(), then we simply
> fallback to buffered IO and take that path to fulfil the request. It's
> imperative that we drop the IOCB_DIRECT flag from iocb->ki_flags in
> order to prevent generic_file_read_iter() from trying to take the
> direct IO code path again.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me.
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++++----
> 1 file changed, 54 insertions(+), 4 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 70b0438dbc94..e52e3928dc25 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,53 @@
> #include "xattr.h"
> #include "acl.h"
>
> +static bool ext4_dio_checks(struct inode *inode)
> +{
> +#if IS_ENABLED(CONFIG_FS_ENCRYPTION)
> + if (IS_ENCRYPTED(inode))
> + return false;
> +#endif
> + if (ext4_should_journal_data(inode))
> + return false;
> + if (ext4_has_inline_data(inode))
> + return false;
> + return true;
> +}
> +
> +static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + /*
> + * Get exclusion from truncate and other inode operations.
> + */
> + if (!inode_trylock_shared(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock_shared(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock_shared(inode);
> + /*
> + * Fallback to buffered IO if the operation being
> + * performed on the inode is not supported by direct
> + * IO. The IOCB_DIRECT flag flags needs to be cleared
> + * here to ensure that the direct IO code path within
> + * generic_file_read_iter() is not taken again.
> + */
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + return generic_file_read_iter(iocb, to);
> + }
> +
> + ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL);
> + inode_unlock_shared(inode);
> +
> + file_accessed(iocb->ki_filp);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> @@ -64,16 +111,19 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> if (!iov_iter_count(to))
> return 0; /* skip atime */
>
> -#ifdef CONFIG_FS_DAX
> - if (IS_DAX(file_inode(iocb->ki_filp)))
> + if (IS_DAX(inode))
> return ext4_dax_read_iter(iocb, to);
> -#endif
> +
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_read_iter(iocb, to);
> return generic_file_read_iter(iocb, to);
> }
>
On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct IO write path
> modifications, the inode extension/truncate code needs to be moved out
> from ext4_iomap_end(). For direct IO, if the current code remained
> within ext4_iomap_end() it would behave incorrectly. If we update the
> inode size prior to converting unwritten extents we run the risk of
> allowing a racing direct IO read operation to find unwritten extents
> before they are converted.
>
> The inode extension/truncate code has been moved out into a new helper
> ext4_handle_inode_extension(). This helper has been designed so that
> it can be used by both DAX and direct IO paths in the instance that
> the result of the write is extending the inode.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/file.c | 93 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 48 +------------------------
> 2 files changed, 93 insertions(+), 48 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index e52e3928dc25..8e586198f6e6 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -33,6 +33,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "truncate.h"
>
> static bool ext4_dio_checks(struct inode *inode)
> {
> @@ -233,12 +234,91 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t len, size_t count)
> +{
> + handle_t *handle;
> + bool truncate = false;
> + ext4_lblk_t written_blk, end_blk;
> + int ret = 0, blkbits = inode->i_blkbits;
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto orphan_del;
> + }
> +
> + if (ext4_update_inode_size(inode, offset + len))
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * We may need truncate allocated but not written blocks
> + * beyond EOF.
> + */
> + written_blk = ALIGN(offset + len, 1 << blkbits);
> + end_blk = ALIGN(offset + len + count, 1 << blkbits);
why add len in end_blk calculation?
shouldn't this be like below?
end_blk = ALIGN(offset + count, 1 << blkbits);
> + if (written_blk < end_blk && ext4_can_truncate(inode))
> + truncate = true;
> +
> + /*
> + * Remove the inode from the orphan list if it has been
> + * extended and everything went OK.
> + */
> + if (!truncate && inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> +
> + if (truncate) {
> + ext4_truncate_failed_write(inode);
> +orphan_del:
> + /*
> + * If the truncate operation failed early the inode
> + * may still be on the orphan list. In that case, we
> + * need try remove the inode from the linked list in
> + * memory.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> + return ret;
> +}
> +
> +/*
> + * The inode may have been placed onto the orphan list or has had
> + * blocks allocated beyond EOF as a result of an extension. We need to
> + * ensure that any necessary cleanup routines are performed if the
> + * error path has been taken for a write.
> + */
> +static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> +{
> + int ret = 0;
No need of ret anyways.
> + handle_t *handle;
> +
> + if (size > i_size_read(inode))
> + ext4_truncate_failed_write(inode);
> +
> + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + return PTR_ERR(handle);
> + }
> + if (inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> + }
> + return ret;
can directly call for `return 0;`
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
> + int error = 0;
> + loff_t offset;
> + struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -255,7 +335,18 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ret)
> goto out;
>
> + offset = iocb->ki_pos;
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> + if (ret > 0 && iocb->ki_pos > i_size_read(inode))
> + error = ext4_handle_inode_extension(inode, offset, ret,
> + iov_iter_count(from));
> +
> + if (ret < 0)
> + error = ext4_handle_failed_inode_extension(inode,
> + iocb->ki_pos);
> +
> + if (error)
> + ret = error;
> out:
> inode_unlock(inode);
> if (ret > 0)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 420fe3deed39..761ce6286b05 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3601,53 +3601,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> - int ret = 0;
> - handle_t *handle;
> - int blkbits = inode->i_blkbits;
> - bool truncate = false;
> -
> - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
> - return 0;
> -
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto orphan_del;
> - }
> - if (ext4_update_inode_size(inode, offset + written))
> - ext4_mark_inode_dirty(handle, inode);
> - /*
> - * We may need to truncate allocated but not written blocks beyond EOF.
> - */
> - if (iomap->offset + iomap->length >
> - ALIGN(inode->i_size, 1 << blkbits)) {
> - ext4_lblk_t written_blk, end_blk;
> -
> - written_blk = (offset + written) >> blkbits;
> - end_blk = (offset + length) >> blkbits;
> - if (written_blk < end_blk && ext4_can_truncate(inode))
> - truncate = true;
> - }
> - /*
> - * Remove inode from orphan list if we were extending a inode and
> - * everything went fine.
> - */
> - if (!truncate && inode->i_nlink &&
> - !list_empty(&EXT4_I(inode)->i_orphan))
> - ext4_orphan_del(handle, inode);
> - ext4_journal_stop(handle);
> - if (truncate) {
> - ext4_truncate_failed_write(inode);
> -orphan_del:
> - /*
> - * If truncate failed early the inode might still be on the
> - * orphan list; we need to make sure the inode is removed from
> - * the orphan list in that case.
> - */
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> - }
> - return ret;
> + return 0;
> }
>
> const struct iomap_ops ext4_iomap_ops = {
>
On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> For iomap direct IO write code path changes, we need to accommodate
> for the case where the block mapping flags passed to ext4_map_blocks()
> will result in m_flags having both EXT4_MAP_MAPPED and
> EXT4_MAP_UNWRITTEN bits set. In order for the allocated unwritten
> extents to be converted properly in the end_io handler, iomap->type
> must be set to IOMAP_UNWRITTEN, so we need to reshuffle the
> conditional statement in order to achieve this.
>
> This change is a no-op for DAX code path as the block mapping flag
> passed to ext4_map_blocks() when IS_DAX(inode) never results in
> EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me.
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/inode.c | 17 ++++++++++++++---
> 1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 761ce6286b05..efb184928e51 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3581,10 +3581,21 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> + /*
> + * Flags passed to ext4_map_blocks() for direct IO
> + * writes can result in m_flags having both
> + * EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In
> + * order for allocated unwritten extents to be
> + * converted to written extents in the end_io handler
> + * correctly, we need to ensure that the iomap->type
> + * is also set appropriately in that case. Thus, we
> + * need to check whether EXT4_MAP_UNWRITTEN is set
> + * first.
> + */
> + if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> + } else if (map.m_flags & EXT4_MAP_MAPPED) {
> + iomap->type = IOMAP_MAPPED;
> } else {
> WARN_ON_ONCE(1);
> return -EIO;
>
On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
>
> All direct IO write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct IO code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
>
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
> happens to have instantiated blocks beyond the i_size, then we attempt
> to place the inode onto the orphan list. Despite being able to perform
> i_size extension checking earlier on in the direct IO code path, it
> makes most sense to perform this bit post successful block allocation.
>
> The ->end_io() callback ext4_dio_write_end_io() is responsible for
> removing the inode from the orphan list and determining if we should
> truncate a failed write in the case of an error. We also convert a
> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
> and perform the necessary i_size/i_disksize extension if the
> iocb->ki_pos + dio->size > i_size_read(inode).
>
> In the instance of a short write, we fallback to buffered IO and
> complete whatever is left the 'iter'. Any blocks that may have been
> allocated in preparation for direct IO will be reused by buffered IO,
> so there's no issue with leaving allocated blocks beyond EOF.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me.
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
> fs/ext4/inode.c | 57 ++++++++++---
> 2 files changed, 198 insertions(+), 78 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8e586198f6e6..bf22425a6a6f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> #include <linux/mman.h>
> +#include <linux/backing-dev.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> if (ret <= 0)
> return ret;
>
> + ret = file_remove_privs(iocb->ki_filp);
> + if (ret)
> + return 0;
> +
> + ret = file_update_time(iocb->ki_filp);
> + if (ret)
> + return 0;
> +
> if (unlikely(IS_IMMUTABLE(inode)))
> return -EPERM;
>
> @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
> + if (!inode_trylock(inode))
> + inode_lock(inode);
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + current->backing_dev_info = inode_to_bdi(inode);
> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> + current->backing_dev_info = NULL;
> +out:
> + inode_unlock(inode);
> + if (likely(ret > 0)) {
> + iocb->ki_pos += ret;
> + ret = generic_write_sync(iocb, ret);
> + }
> + return ret;
> +}
> +
> static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> ssize_t len, size_t count)
> {
> @@ -311,6 +348,118 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> return ret;
> }
>
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock().
> + */
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> + unsigned int flags)
> +{
> + int ret = 0;
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error) {
> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> + return ret ? ret : error;
> + }
> +
> + if (flags & IOMAP_DIO_UNWRITTEN) {
> + ret = ext4_convert_unwritten_extents(NULL, inode,
> + offset, size);
> + if (ret)
> + return ret;
> + }
> +
> + if (offset + size > i_size_read(inode)) {
> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> + if (ret)
> + return ret;
> + }
> + return ret;
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret;
> + loff_t offset = iocb->ki_pos;
> + size_t count = iov_iter_count(from);
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool extend = false, overwrite = false, unaligned_aio = false;
> +
> + if (!inode_trylock(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered IO if the operation on the
> + * inode is not supported by direct IO.
> + */
> + return ext4_buffered_write_iter(iocb, from);
> + }
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0) {
> + inode_unlock(inode);
> + return ret;
> + }
> +
> + /*
> + * Unaligned direct AIO must be serialized among each other as
> + * the zeroing of partial blocks of two competing unaligned
> + * AIOs can result in data corruption.
> + */
> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> + unaligned_aio = true;
> + inode_dio_wait(inode);
> + }
> +
> + /*
> + * Determine whether the IO operation will overwrite allocated
> + * and initialized blocks. If so, check to see whether it is
> + * possible to take the dioread_nolock path.
> + */
> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> + ext4_should_dioread_nolock(inode)) {
> + overwrite = true;
> + downgrade_write(&inode->i_rwsem);
> + }
> +
> + if (offset + count > i_size_read(inode) ||
> + offset + count > EXT4_I(inode)->i_disksize) {
> + ext4_update_i_disksize(inode, inode->i_size);
> + extend = true;
> + }
> +
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> + /*
> + * Unaligned direct AIO must be the only IO in flight or else
> + * any overlapping aligned IO after unaligned IO might result
> + * in data corruption. We also need to wait here in the case
> + * where the inode is being extended so that inode extension
> + * routines in ext4_dio_write_end_io() are covered by the
> + * inode_lock().
> + */
> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> + inode_dio_wait(inode);
> +
> + if (overwrite)
> + inode_unlock_shared(inode);
> + else
> + inode_unlock(inode);
> +
> + if (ret >= 0 && iov_iter_count(from))
> + return ext4_buffered_write_iter(iocb, from);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return -EAGAIN;
> inode_lock(inode);
> }
> +
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0)
> goto out;
> - ret = file_remove_privs(iocb->ki_filp);
> - if (ret)
> - goto out;
> - ret = file_update_time(iocb->ki_filp);
> - if (ret)
> - goto out;
>
> offset = iocb->ki_pos;
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> @@ -359,73 +503,16 @@ static ssize_t
> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
> - int unaligned_aio = 0;
> - int overwrite = 0;
> - ssize_t ret;
>
> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> -#ifdef CONFIG_FS_DAX
> if (IS_DAX(inode))
> return ext4_dax_write_iter(iocb, from);
> -#endif
> - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> - return -EOPNOTSUPP;
>
> - if (!inode_trylock(inode)) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - return -EAGAIN;
> - inode_lock(inode);
> - }
> -
> - ret = ext4_write_checks(iocb, from);
> - if (ret <= 0)
> - goto out;
> -
> - /*
> - * Unaligned direct AIO must be serialized among each other as zeroing
> - * of partial blocks of two competing unaligned AIOs can result in data
> - * corruption.
> - */
> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> - !is_sync_kiocb(iocb) &&
> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> - unaligned_aio = 1;
> - ext4_unwritten_wait(inode);
> - }
> -
> - iocb->private = &overwrite;
> - /* Check whether we do a DIO overwrite or not */
> - if (o_direct && !unaligned_aio) {
> - if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> - if (ext4_should_dioread_nolock(inode))
> - overwrite = 1;
> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
> - ret = -EAGAIN;
> - goto out;
> - }
> - }
> -
> - ret = __generic_file_write_iter(iocb, from);
> - /*
> - * Unaligned direct AIO must be the only IO in flight. Otherwise
> - * overlapping aligned IO after unaligned might result in data
> - * corruption.
> - */
> - if (ret == -EIOCBQUEUED && unaligned_aio)
> - ext4_unwritten_wait(inode);
> - inode_unlock(inode);
> -
> - if (ret > 0)
> - ret = generic_write_sync(iocb, ret);
> -
> - return ret;
> -
> -out:
> - inode_unlock(inode);
> - return ret;
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_write_iter(iocb, from);
> + return ext4_buffered_write_iter(iocb, from);
> }
>
> #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> handle_t *handle;
> - int retries = 0;
> + int dio_credits, retries = 0, m_flags = 0;
>
> - /* Trim mapping request to maximum we can map at once for DIO */
> + /*
> + * Trim mapping request to maximum we can map at once
> + * for DIO.
> + */
> if (map.m_len > DIO_MAX_BLOCKS)
> map.m_len = DIO_MAX_BLOCKS;
> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> + /*
> + * DAX and direct IO are the only two operations that
> + * are currently supported with IOMAP_WRITE.
> + */
> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> + if (IS_DAX(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> + else if (round_down(offset, i_blocksize(inode)) >=
> + i_size_read(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE;
> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> + /*
> + * We cannot fill holes in indirect tree based inodes
> + * as that could expose stale data in the case of a
> + * crash. Use the magic error code to fallback to
> + * buffered IO.
> + */
> + if (!m_flags && !ret)
> + ret = -ENOTBLK;
> +
> if (ret < 0) {
> ext4_journal_stop(handle);
> if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
>
> /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> + * If we added blocks beyond i_size, we need to make
> + * sure they will get truncated if we crash before
> + * updating the i_size. For faults we don't need to do
> + * that (and even cannot because for orphan list
> + * operations inode_lock is required) - if we happen
> + * to instantiate block beyond i_size, it is because
> + * we race with truncate which has already added the
> + * inode to the orphan list.
> */
> if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> + /*
> + * Check to see whether an error occurred while writing data
> + * out to allocated blocks. If so, return the magic error code
> + * so that we fallback to buffered IO and reuse the blocks
> + * that were allocated in preparation for the direct IO write.
> + */
> + if (flags & IOMAP_DIRECT && written == 0)
> + return -ENOTBLK;
> return 0;
> }
>
On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> This patch introduces a new direct IO write code path implementation
> that makes use of the iomap infrastructure.
>
> All direct IO write operations are now passed from the ->write_iter()
> callback to the new function ext4_dio_write_iter(). This function is
> responsible for calling into iomap infrastructure via
> iomap_dio_rw(). Snippets of the direct IO code from within
> ext4_file_write_iter(), such as checking whether the IO request is
> unaligned asynchronous IO, or whether it will ber overwriting
> allocated and initialized blocks has been moved out and into
> ext4_dio_write_iter().
>
> The block mapping flags that are passed to ext4_map_blocks() from
> within ext4_dio_get_block() and friends have effectively been taken
> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
> happens to have instantiated blocks beyond the i_size, then we attempt
> to place the inode onto the orphan list. Despite being able to perform
> i_size extension checking earlier on in the direct IO code path, it
> makes most sense to perform this bit post successful block allocation.
>
> The ->end_io() callback ext4_dio_write_end_io() is responsible for
> removing the inode from the orphan list and determining if we should
> truncate a failed write in the case of an error. We also convert a
> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
> and perform the necessary i_size/i_disksize extension if the
> iocb->ki_pos + dio->size > i_size_read(inode).
>
> In the instance of a short write, we fallback to buffered IO and
> complete whatever is left the 'iter'. Any blocks that may have been
> allocated in preparation for direct IO will be reused by buffered IO,
> so there's no issue with leaving allocated blocks beyond EOF.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Sorry some minor simplification comments. Forgot to respond in previous
email.
Otherwise looks good.
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
> fs/ext4/inode.c | 57 ++++++++++---
> 2 files changed, 198 insertions(+), 78 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8e586198f6e6..bf22425a6a6f 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> #include <linux/mman.h>
> +#include <linux/backing-dev.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> if (ret <= 0)
> return ret;
>
> + ret = file_remove_privs(iocb->ki_filp);
> + if (ret)
> + return 0;
> +
> + ret = file_update_time(iocb->ki_filp);
> + if (ret)
> + return 0;
> +
> if (unlikely(IS_IMMUTABLE(inode)))
> return -EPERM;
>
> @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
> + if (!inode_trylock(inode))
> + inode_lock(inode);
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + current->backing_dev_info = inode_to_bdi(inode);
> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> + current->backing_dev_info = NULL;
> +out:
> + inode_unlock(inode);
> + if (likely(ret > 0)) {
> + iocb->ki_pos += ret;
> + ret = generic_write_sync(iocb, ret);
> + }
> + return ret;
> +}
> +
> static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> ssize_t len, size_t count)
> {
> @@ -311,6 +348,118 @@ static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> return ret;
> }
>
> +/*
> + * For a write that extends the inode size, ext4_dio_write_iter() will
> + * wait for the write to complete. Consequently, operations performed
> + * within this function are still covered by the inode_lock().
> + */
Maybe add a comment that on success this returns 0.
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> + unsigned int flags)
> +{
> + int ret = 0;
No need to initialize ret.
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error) {
> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> + return ret ? ret : error;
> + }
> +
> + if (flags & IOMAP_DIO_UNWRITTEN) {
> + ret = ext4_convert_unwritten_extents(NULL, inode,
> + offset, size);
> + if (ret)
> + return ret;
> + }
> +
> + if (offset + size > i_size_read(inode)) {
> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> + if (ret)
> + return ret;
> + }
> + return ret;
Directly return 0, since if it falls here it mans it is a success case.
You are anyway returning error from above error paths.
> +}
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret;
> + loff_t offset = iocb->ki_pos;
> + size_t count = iov_iter_count(from);
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool extend = false, overwrite = false, unaligned_aio = false;
> +
> + if (!inode_trylock(inode)) {
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EAGAIN;
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_checks(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered IO if the operation on the
> + * inode is not supported by direct IO.
> + */
> + return ext4_buffered_write_iter(iocb, from);
> + }
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0) {
> + inode_unlock(inode);
> + return ret;
> + }
> +
> + /*
> + * Unaligned direct AIO must be serialized among each other as
> + * the zeroing of partial blocks of two competing unaligned
> + * AIOs can result in data corruption.
> + */
> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> + unaligned_aio = true;
> + inode_dio_wait(inode);
> + }
> +
> + /*
> + * Determine whether the IO operation will overwrite allocated
> + * and initialized blocks. If so, check to see whether it is
> + * possible to take the dioread_nolock path.
> + */
> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> + ext4_should_dioread_nolock(inode)) {
> + overwrite = true;
> + downgrade_write(&inode->i_rwsem);
> + }
> +
> + if (offset + count > i_size_read(inode) ||
> + offset + count > EXT4_I(inode)->i_disksize) {
> + ext4_update_i_disksize(inode, inode->i_size);
> + extend = true;
> + }
> +
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, ext4_dio_write_end_io);
> +
> + /*
> + * Unaligned direct AIO must be the only IO in flight or else
> + * any overlapping aligned IO after unaligned IO might result
> + * in data corruption. We also need to wait here in the case
> + * where the inode is being extended so that inode extension
> + * routines in ext4_dio_write_end_io() are covered by the
> + * inode_lock().
> + */
> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> + inode_dio_wait(inode);
> +
> + if (overwrite)
> + inode_unlock_shared(inode);
> + else
> + inode_unlock(inode);
> +
> + if (ret >= 0 && iov_iter_count(from))
> + return ext4_buffered_write_iter(iocb, from);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return -EAGAIN;
> inode_lock(inode);
> }
> +
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0)
> goto out;
> - ret = file_remove_privs(iocb->ki_filp);
> - if (ret)
> - goto out;
> - ret = file_update_time(iocb->ki_filp);
> - if (ret)
> - goto out;
>
> offset = iocb->ki_pos;
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> @@ -359,73 +503,16 @@ static ssize_t
> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
> - int unaligned_aio = 0;
> - int overwrite = 0;
> - ssize_t ret;
>
> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> -#ifdef CONFIG_FS_DAX
> if (IS_DAX(inode))
> return ext4_dax_write_iter(iocb, from);
> -#endif
> - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
> - return -EOPNOTSUPP;
>
> - if (!inode_trylock(inode)) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - return -EAGAIN;
> - inode_lock(inode);
> - }
> -
> - ret = ext4_write_checks(iocb, from);
> - if (ret <= 0)
> - goto out;
> -
> - /*
> - * Unaligned direct AIO must be serialized among each other as zeroing
> - * of partial blocks of two competing unaligned AIOs can result in data
> - * corruption.
> - */
> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> - !is_sync_kiocb(iocb) &&
> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> - unaligned_aio = 1;
> - ext4_unwritten_wait(inode);
> - }
> -
> - iocb->private = &overwrite;
> - /* Check whether we do a DIO overwrite or not */
> - if (o_direct && !unaligned_aio) {
> - if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> - if (ext4_should_dioread_nolock(inode))
> - overwrite = 1;
> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
> - ret = -EAGAIN;
> - goto out;
> - }
> - }
> -
> - ret = __generic_file_write_iter(iocb, from);
> - /*
> - * Unaligned direct AIO must be the only IO in flight. Otherwise
> - * overlapping aligned IO after unaligned might result in data
> - * corruption.
> - */
> - if (ret == -EIOCBQUEUED && unaligned_aio)
> - ext4_unwritten_wait(inode);
> - inode_unlock(inode);
> -
> - if (ret > 0)
> - ret = generic_write_sync(iocb, ret);
> -
> - return ret;
> -
> -out:
> - inode_unlock(inode);
> - return ret;
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_write_iter(iocb, from);
> + return ext4_buffered_write_iter(iocb, from);
> }
>
> #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index efb184928e51..f52ad3065236 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> handle_t *handle;
> - int retries = 0;
> + int dio_credits, retries = 0, m_flags = 0;
>
> - /* Trim mapping request to maximum we can map at once for DIO */
> + /*
> + * Trim mapping request to maximum we can map at once
> + * for DIO.
> + */
> if (map.m_len > DIO_MAX_BLOCKS)
> map.m_len = DIO_MAX_BLOCKS;
> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> + /*
> + * DAX and direct IO are the only two operations that
> + * are currently supported with IOMAP_WRITE.
> + */
> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> + if (IS_DAX(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> + else if (round_down(offset, i_blocksize(inode)) >=
> + i_size_read(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE;
> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
> +
> + /*
> + * We cannot fill holes in indirect tree based inodes
> + * as that could expose stale data in the case of a
> + * crash. Use the magic error code to fallback to
> + * buffered IO.
> + */
I like this comment ;)
Help others to understand what is really going on here.
> + if (!m_flags && !ret)
> + ret = -ENOTBLK;
> +
> if (ret < 0) {
> ext4_journal_stop(handle);
> if (ret == -ENOSPC &&
> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
>
> /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> + * If we added blocks beyond i_size, we need to make
> + * sure they will get truncated if we crash before
> + * updating the i_size. For faults we don't need to do
> + * that (and even cannot because for orphan list
> + * operations inode_lock is required) - if we happen
> + * to instantiate block beyond i_size, it is because
> + * we race with truncate which has already added the
> + * inode to the orphan list.
> */
> if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> + /*
> + * Check to see whether an error occurred while writing data
> + * out to allocated blocks. If so, return the magic error code
> + * so that we fallback to buffered IO and reuse the blocks
> + * that were allocated in preparation for the direct IO write.
> + */
> + if (flags & IOMAP_DIRECT && written == 0)
> + return -ENOTBLK;
> return 0;
> }
>
On 9/9/19 4:50 AM, Matthew Bobrowski wrote:
> Remove buffer_head direct IO code that is now redundant as a result of
> porting across the read/write paths to use iomap infrastructure.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me.
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/ext4.h | 3 -
> fs/ext4/extents.c | 11 +-
> fs/ext4/file.c | 7 -
> fs/ext4/inode.c | 398 +---------------------------------------------
> 4 files changed, 5 insertions(+), 414 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index bf660aa7a9e0..2ab91815f52d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1555,7 +1555,6 @@ enum {
> EXT4_STATE_NO_EXPAND, /* No space for expansion */
> EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
> EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
> - EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
> EXT4_STATE_NEWENTRY, /* File just added to dir */
> EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
> EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
> @@ -2522,8 +2521,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> -int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create);
> int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> struct buffer_head *bh, int create);
> int ext4_walk_page_buffers(handle_t *handle,
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index 92266a2da7d6..a869e206bd81 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> */
> if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
> return 0;
> - /*
> - * The check for IO to unwritten extent is somewhat racy as we
> - * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
> - * dropping i_data_sem. But reserved blocks should save us in that
> - * case.
> - */
> +
> if (ext4_ext_is_unwritten(ex1) &&
> - (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
> - atomic_read(&EXT4_I(inode)->i_unwritten) ||
> - (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
> + (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN))
> return 0;
> #ifdef AGGRESSIVE_TEST
> if (ext1_ee_len >= 4)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index bf22425a6a6f..d4b9a82aed6c 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -155,13 +155,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
> return 0;
> }
>
> -static void ext4_unwritten_wait(struct inode *inode)
> -{
> - wait_queue_head_t *wq = ext4_ioend_wq(inode);
> -
> - wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
> -}
> -
> /*
> * This tests whether the IO in question is block-aligned or not.
> * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index f52ad3065236..a4f0749527c7 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -826,136 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> /* Maximum number of blocks we map for direct IO at once. */
> #define DIO_MAX_BLOCKS 4096
>
> -/*
> - * Get blocks function for the cases that need to start a transaction -
> - * generally difference cases of direct IO and DAX IO. It also handles retries
> - * in case of ENOSPC.
> - */
> -static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int flags)
> -{
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> - int ret;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
> - bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
> - dio_credits = ext4_chunk_trans_blocks(inode,
> - bh_result->b_size >> inode->i_blkbits);
> -retry:
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = _ext4_get_block(inode, iblock, bh_result, flags);
> - ext4_journal_stop(handle);
> -
> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> -}
> -
> -/* Get block function for DIO reads and writes to inodes without extents */
> -int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh, int create)
> -{
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - if (!create)
> - return _ext4_get_block(inode, iblock, bh, 0);
> - return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
> -}
> -
> -/*
> - * Get block function for AIO DIO writes when we create unwritten extent if
> - * blocks are not allocated yet. The extent will be converted to written
> - * after IO is complete.
> - */
> -static int ext4_dio_get_block_unwritten_async(struct inode *inode,
> - sector_t iblock, struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = ext4_get_block_trans(inode, iblock, bh_result,
> - EXT4_GET_BLOCKS_IO_CREATE_EXT);
> -
> - /*
> - * When doing DIO using unwritten extents, we need io_end to convert
> - * unwritten extents to written on IO completion. We allocate io_end
> - * once we spot unwritten extent and store it in b_private. Generic
> - * DIO code keeps b_private set and furthermore passes the value to
> - * our completion callback in 'private' argument.
> - */
> - if (!ret && buffer_unwritten(bh_result)) {
> - if (!bh_result->b_private) {
> - ext4_io_end_t *io_end;
> -
> - io_end = ext4_init_io_end(inode, GFP_KERNEL);
> - if (!io_end)
> - return -ENOMEM;
> - bh_result->b_private = io_end;
> - ext4_set_io_unwritten_flag(inode, io_end);
> - }
> - set_buffer_defer_completion(bh_result);
> - }
> -
> - return ret;
> -}
> -
> -/*
> - * Get block function for non-AIO DIO writes when we create unwritten extent if
> - * blocks are not allocated yet. The extent will be converted to written
> - * after IO is complete by ext4_direct_IO_write().
> - */
> -static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
> - sector_t iblock, struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = ext4_get_block_trans(inode, iblock, bh_result,
> - EXT4_GET_BLOCKS_IO_CREATE_EXT);
> -
> - /*
> - * Mark inode as having pending DIO writes to unwritten extents.
> - * ext4_direct_IO_write() checks this flag and converts extents to
> - * written.
> - */
> - if (!ret && buffer_unwritten(bh_result))
> - ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> -
> - return ret;
> -}
> -
> -static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
> - inode->i_ino, create);
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = _ext4_get_block(inode, iblock, bh_result, 0);
> - /*
> - * Blocks should have been preallocated! ext4_file_write_iter() checks
> - * that.
> - */
> - WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
> -
> - return ret;
> -}
> -
> -
> /*
> * `handle' can be NULL if create is zero
> */
> @@ -3653,268 +3523,6 @@ const struct iomap_ops ext4_iomap_ops = {
> .iomap_end = ext4_iomap_end,
> };
>
> -static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> - ssize_t size, void *private)
> -{
> - ext4_io_end_t *io_end = private;
> -
> - /* if not async direct IO just return */
> - if (!io_end)
> - return 0;
> -
> - ext_debug("ext4_end_io_dio(): io_end 0x%p "
> - "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
> - io_end, io_end->inode->i_ino, iocb, offset, size);
> -
> - /*
> - * Error during AIO DIO. We cannot convert unwritten extents as the
> - * data was not written. Just clear the unwritten flag and drop io_end.
> - */
> - if (size <= 0) {
> - ext4_clear_io_unwritten_flag(io_end);
> - size = 0;
> - }
> - io_end->offset = offset;
> - io_end->size = size;
> - ext4_put_io_end(io_end);
> -
> - return 0;
> -}
> -
> -/*
> - * Handling of direct IO writes.
> - *
> - * For ext4 extent files, ext4 will do direct-io write even to holes,
> - * preallocated extents, and those write extend the file, no need to
> - * fall back to buffered IO.
> - *
> - * For holes, we fallocate those blocks, mark them as unwritten
> - * If those blocks were preallocated, we mark sure they are split, but
> - * still keep the range to write as unwritten.
> - *
> - * The unwritten extents will be converted to written when DIO is completed.
> - * For async direct IO, since the IO may still pending when return, we
> - * set up an end_io call back function, which will do the conversion
> - * when async direct IO completed.
> - *
> - * If the O_DIRECT write will extend the file then add this inode to the
> - * orphan list. So recovery will truncate it back to the original size
> - * if the machine crashes during the write.
> - *
> - */
> -static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - ssize_t ret;
> - loff_t offset = iocb->ki_pos;
> - size_t count = iov_iter_count(iter);
> - int overwrite = 0;
> - get_block_t *get_block_func = NULL;
> - int dio_flags = 0;
> - loff_t final_size = offset + count;
> - int orphan = 0;
> - handle_t *handle;
> -
> - if (final_size > inode->i_size || final_size > ei->i_disksize) {
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - ret = ext4_orphan_add(handle, inode);
> - if (ret) {
> - ext4_journal_stop(handle);
> - goto out;
> - }
> - orphan = 1;
> - ext4_update_i_disksize(inode, inode->i_size);
> - ext4_journal_stop(handle);
> - }
> -
> - BUG_ON(iocb->private == NULL);
> -
> - /*
> - * Make all waiters for direct IO properly wait also for extent
> - * conversion. This also disallows race between truncate() and
> - * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
> - */
> - inode_dio_begin(inode);
> -
> - /* If we do a overwrite dio, i_mutex locking can be released */
> - overwrite = *((int *)iocb->private);
> -
> - if (overwrite)
> - inode_unlock(inode);
> -
> - /*
> - * For extent mapped files we could direct write to holes and fallocate.
> - *
> - * Allocated blocks to fill the hole are marked as unwritten to prevent
> - * parallel buffered read to expose the stale data before DIO complete
> - * the data IO.
> - *
> - * As to previously fallocated extents, ext4 get_block will just simply
> - * mark the buffer mapped but still keep the extents unwritten.
> - *
> - * For non AIO case, we will convert those unwritten extents to written
> - * after return back from blockdev_direct_IO. That way we save us from
> - * allocating io_end structure and also the overhead of offloading
> - * the extent convertion to a workqueue.
> - *
> - * For async DIO, the conversion needs to be deferred when the
> - * IO is completed. The ext4 end_io callback function will be
> - * called to take care of the conversion work. Here for async
> - * case, we allocate an io_end structure to hook to the iocb.
> - */
> - iocb->private = NULL;
> - if (overwrite)
> - get_block_func = ext4_dio_get_block_overwrite;
> - else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
> - round_down(offset, i_blocksize(inode)) >= inode->i_size) {
> - get_block_func = ext4_dio_get_block;
> - dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
> - } else if (is_sync_kiocb(iocb)) {
> - get_block_func = ext4_dio_get_block_unwritten_sync;
> - dio_flags = DIO_LOCKING;
> - } else {
> - get_block_func = ext4_dio_get_block_unwritten_async;
> - dio_flags = DIO_LOCKING;
> - }
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
> - get_block_func, ext4_end_io_dio, NULL,
> - dio_flags);
> -
> - if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> - EXT4_STATE_DIO_UNWRITTEN)) {
> - int err;
> - /*
> - * for non AIO case, since the IO is already
> - * completed, we could do the conversion right here
> - */
> - err = ext4_convert_unwritten_extents(NULL, inode,
> - offset, ret);
> - if (err < 0)
> - ret = err;
> - ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> - }
> -
> - inode_dio_end(inode);
> - /* take i_mutex locking again if we do a ovewrite dio */
> - if (overwrite)
> - inode_lock(inode);
> -
> - if (ret < 0 && final_size > inode->i_size)
> - ext4_truncate_failed_write(inode);
> -
> - /* Handle extending of i_size after direct IO write */
> - if (orphan) {
> - int err;
> -
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - /*
> - * We wrote the data but cannot extend
> - * i_size. Bail out. In async io case, we do
> - * not return error here because we have
> - * already submmitted the corresponding
> - * bio. Returning error here makes the caller
> - * think that this IO is done and failed
> - * resulting in race with bio's completion
> - * handler.
> - */
> - if (!ret)
> - ret = PTR_ERR(handle);
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> -
> - goto out;
> - }
> - if (inode->i_nlink)
> - ext4_orphan_del(handle, inode);
> - if (ret > 0) {
> - loff_t end = offset + ret;
> - if (end > inode->i_size || end > ei->i_disksize) {
> - ext4_update_i_disksize(inode, end);
> - if (end > inode->i_size)
> - i_size_write(inode, end);
> - /*
> - * We're going to return a positive `ret'
> - * here due to non-zero-length I/O, so there's
> - * no way of reporting error returns from
> - * ext4_mark_inode_dirty() to userspace. So
> - * ignore it.
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> - }
> - err = ext4_journal_stop(handle);
> - if (ret == 0)
> - ret = err;
> - }
> -out:
> - return ret;
> -}
> -
> -static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> - struct inode *inode = mapping->host;
> - size_t count = iov_iter_count(iter);
> - ssize_t ret;
> -
> - /*
> - * Shared inode_lock is enough for us - it protects against concurrent
> - * writes & truncates and since we take care of writing back page cache,
> - * we are protected against page writeback as well.
> - */
> - inode_lock_shared(inode);
> - ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
> - iocb->ki_pos + count - 1);
> - if (ret)
> - goto out_unlock;
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
> - iter, ext4_dio_get_block, NULL, NULL, 0);
> -out_unlock:
> - inode_unlock_shared(inode);
> - return ret;
> -}
> -
> -static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - size_t count = iov_iter_count(iter);
> - loff_t offset = iocb->ki_pos;
> - ssize_t ret;
> -
> -#ifdef CONFIG_FS_ENCRYPTION
> - if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
> - return 0;
> -#endif
> -
> - /*
> - * If we are doing data journalling we don't support O_DIRECT
> - */
> - if (ext4_should_journal_data(inode))
> - return 0;
> -
> - /* Let buffer I/O handle the inline data case. */
> - if (ext4_has_inline_data(inode))
> - return 0;
> -
> - trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> - if (iov_iter_rw(iter) == READ)
> - ret = ext4_direct_IO_read(iocb, iter);
> - else
> - ret = ext4_direct_IO_write(iocb, iter);
> - trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
> - return ret;
> -}
> -
> /*
> * Pages can be marked dirty completely asynchronously from ext4's journalling
> * activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
> @@ -3952,7 +3560,7 @@ static const struct address_space_operations ext4_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .migratepage = buffer_migrate_page,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
> @@ -3969,7 +3577,7 @@ static const struct address_space_operations ext4_journalled_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_journalled_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
> };
> @@ -3985,7 +3593,7 @@ static const struct address_space_operations ext4_da_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_da_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .migratepage = buffer_migrate_page,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
>
Please excuse me for my queries here again.
Hoping my query here will be useful for others also to understand
about what is happening under the hood here (for ext4 & XFS) :)
On 9/9/19 2:56 PM, Ritesh Harjani wrote:
>
>
> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
>> This patch introduces a new direct IO write code path implementation
>> that makes use of the iomap infrastructure.
>>
>> All direct IO write operations are now passed from the ->write_iter()
>> callback to the new function ext4_dio_write_iter(). This function is
>> responsible for calling into iomap infrastructure via
>> iomap_dio_rw(). Snippets of the direct IO code from within
>> ext4_file_write_iter(), such as checking whether the IO request is
>> unaligned asynchronous IO, or whether it will ber overwriting
>> allocated and initialized blocks has been moved out and into
>> ext4_dio_write_iter().
>>
>> The block mapping flags that are passed to ext4_map_blocks() from
>> within ext4_dio_get_block() and friends have effectively been taken
>> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
>> happens to have instantiated blocks beyond the i_size, then we attempt
>> to place the inode onto the orphan list. Despite being able to perform
>> i_size extension checking earlier on in the direct IO code path, it
>> makes most sense to perform this bit post successful block allocation.
>>
>> The ->end_io() callback ext4_dio_write_end_io() is responsible for
>> removing the inode from the orphan list and determining if we should
>> truncate a failed write in the case of an error. We also convert a
>> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
>> and perform the necessary i_size/i_disksize extension if the
>> iocb->ki_pos + dio->size > i_size_read(inode).
>>
>> In the instance of a short write, we fallback to buffered IO and
>> complete whatever is left the 'iter'. Any blocks that may have been
>> allocated in preparation for direct IO will be reused by buffered IO,
>> so there's no issue with leaving allocated blocks beyond EOF.
>>
>> Signed-off-by: Matthew Bobrowski <[email protected]>
>
> Sorry some minor simplification comments. Forgot to respond in previous
> email.
>
> Otherwise looks good.
>
> Reviewed-by: Ritesh Harjani <[email protected]>
>
>
>
>> ---
>> fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
>> fs/ext4/inode.c | 57 ++++++++++---
>> 2 files changed, 198 insertions(+), 78 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 8e586198f6e6..bf22425a6a6f 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -29,6 +29,7 @@
>> #include <linux/pagevec.h>
>> #include <linux/uio.h>
>> #include <linux/mman.h>
>> +#include <linux/backing-dev.h>
>> #include "ext4.h"
>> #include "ext4_jbd2.h"
>> #include "xattr.h"
>> @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb
>> *iocb, struct iov_iter *from)
>> if (ret <= 0)
>> return ret;
>>
>> + ret = file_remove_privs(iocb->ki_filp);
>> + if (ret)
>> + return 0;
>> +
>> + ret = file_update_time(iocb->ki_filp);
>> + if (ret)
>> + return 0;
>> +
>> if (unlikely(IS_IMMUTABLE(inode)))
>> return -EPERM;
>>
>> @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb
>> *iocb, struct iov_iter *from)
>> return iov_iter_count(from);
>> }
>>
>> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>> + struct iov_iter *from)
>> +{
>> + ssize_t ret;
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + return -EOPNOTSUPP;
>> +
>> + if (!inode_trylock(inode))
>> + inode_lock(inode);
>> +
>> + ret = ext4_write_checks(iocb, from);
>> + if (ret <= 0)
>> + goto out;
>> +
>> + current->backing_dev_info = inode_to_bdi(inode);
>> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
>> + current->backing_dev_info = NULL;
>> +out:
>> + inode_unlock(inode);
>> + if (likely(ret > 0)) {
>> + iocb->ki_pos += ret;
>> + ret = generic_write_sync(iocb, ret);
>> + }
>> + return ret;
>> +}
>> +
>> static int ext4_handle_inode_extension(struct inode *inode, loff_t
>> offset,
>> ssize_t len, size_t count)
>> {
>> @@ -311,6 +348,118 @@ static int
>> ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>> return ret;
>> }
>>
>> +/*
>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>> + * wait for the write to complete. Consequently, operations performed
>> + * within this function are still covered by the inode_lock().
>> + */
> Maybe add a comment that on success this returns 0.
>
>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>> int error,
>> + unsigned int flags)
>> +{
>> + int ret = 0;
> No need to initialize ret.
>
>
>> + loff_t offset = iocb->ki_pos;
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + if (error) {
>> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> + return ret ? ret : error;
>> + }
>> +
>> + if (flags & IOMAP_DIO_UNWRITTEN) {
>> + ret = ext4_convert_unwritten_extents(NULL, inode,
>> + offset, size);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + if (offset + size > i_size_read(inode)) {
>> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
>> + if (ret)
>> + return ret;
>> + }
>> + return ret;
> Directly return 0, since if it falls here it mans it is a success case.
> You are anyway returning error from above error paths.
>
>> +}
>> +
>> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct
>> iov_iter *from)
>> +{
>> + ssize_t ret;
>> + loff_t offset = iocb->ki_pos;
>> + size_t count = iov_iter_count(from);
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + bool extend = false, overwrite = false, unaligned_aio = false;
>> +
>> + if (!inode_trylock(inode)) {
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + return -EAGAIN;
>> + inode_lock(inode);
>> + }
>> +
>> + if (!ext4_dio_checks(inode)) {
>> + inode_unlock(inode);
>> + /*
>> + * Fallback to buffered IO if the operation on the
>> + * inode is not supported by direct IO.
>> + */
>> + return ext4_buffered_write_iter(iocb, from);
>> + }
>> +
>> + ret = ext4_write_checks(iocb, from);
>> + if (ret <= 0) {
>> + inode_unlock(inode);
>> + return ret;
>> + }
>> +
>> + /*
>> + * Unaligned direct AIO must be serialized among each other as
>> + * the zeroing of partial blocks of two competing unaligned
>> + * AIOs can result in data corruption.
>> + */
>> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from,
>> offset)) {
>> + unaligned_aio = true;
>> + inode_dio_wait(inode);
>> + }
>> +
>> + /*
>> + * Determine whether the IO operation will overwrite allocated
>> + * and initialized blocks. If so, check to see whether it is
>> + * possible to take the dioread_nolock path.
>> + */
>> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
>> + ext4_should_dioread_nolock(inode)) {
>> + overwrite = true;
>> + downgrade_write(&inode->i_rwsem);
>> + }
>> +
>> + if (offset + count > i_size_read(inode) ||
>> + offset + count > EXT4_I(inode)->i_disksize) {
>> + ext4_update_i_disksize(inode, inode->i_size);
>> + extend = true;
>> + }
>> +
>> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops,
>> ext4_dio_write_end_io);
>> +
>> + /*
>> + * Unaligned direct AIO must be the only IO in flight or else
>> + * any overlapping aligned IO after unaligned IO might result
>> + * in data corruption. We also need to wait here in the case
>> + * where the inode is being extended so that inode extension
>> + * routines in ext4_dio_write_end_io() are covered by the
>> + * inode_lock().
>> + */
>> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
>> + inode_dio_wait(inode);
So, I looked this more closely into the AIO DIO write extend case
of yours here. AFAICT, this looks good in the sense that it follows
the behavior what we used to have before from __blockdev_direct_IO.
In that it used to wait for AIO DIO writes beyond EOF, but the iomap
framework does not have that. So waiting in case of writes beyond EOF
should be the right thing to do here for ext4 (following the legacy code).
But I would like to confirm the exact race this extend case
is protecting here.
Since writes beyond EOF will require update of inode i_size
(ext4_update_inode_size()) which require us to hold the inode_lock
in exclusive mode, so we must need to wait in extend case here,
even for AIO DIO writes.
Q1. Is above understanding completely correct?
Q2. Or is there anything else also which it is also protecting which I
am missing? Do we need to hold inode exclusive lock for
ext4_orphan_del() as well?
Q3. How about XFS then?
(I do see some tricks done with IOLOCK handling in case of ki__pos >
i_size & to zero out the buffer space between old i_size & ki_pos).
But if we talk only about the above case of extending AIO DIO writes
beyond EOF, XFS only takes a shared lock. why?
Looking into XFS code, I see that they have IOLOCK & ILOCK.
So I guess for protecting inode->i_size update they use ILOCK (in
xfs_dio_write_end_io() -> xfs_iomap_write_unwritten()
or ip->i_flags_lock lock in NON-UNWRITTEN case).
and for IO part the IOLOCK is used and hence IOLOCK can be used
in shared mode. Is this correct understanding for XFS?
-ritesh
>> +
>> + if (overwrite)
>> + inode_unlock_shared(inode);
>> + else
>> + inode_unlock(inode);
>> +
>> + if (ret >= 0 && iov_iter_count(from))
>> + return ext4_buffered_write_iter(iocb, from);
>> + return ret;
>> +}
>> +
>> #ifdef CONFIG_FS_DAX
>> static ssize_t
>> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> @@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct
>> iov_iter *from)
>> return -EAGAIN;
>> inode_lock(inode);
>> }
>> +
>> ret = ext4_write_checks(iocb, from);
>> if (ret <= 0)
>> goto out;
>> - ret = file_remove_privs(iocb->ki_filp);
>> - if (ret)
>> - goto out;
>> - ret = file_update_time(iocb->ki_filp);
>> - if (ret)
>> - goto out;
>>
>> offset = iocb->ki_pos;
>> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>> @@ -359,73 +503,16 @@ static ssize_t
>> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> struct inode *inode = file_inode(iocb->ki_filp);
>> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
>> - int unaligned_aio = 0;
>> - int overwrite = 0;
>> - ssize_t ret;
>>
>> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>> return -EIO;
>>
>> -#ifdef CONFIG_FS_DAX
>> if (IS_DAX(inode))
>> return ext4_dax_write_iter(iocb, from);
>> -#endif
>> - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
>> - return -EOPNOTSUPP;
>>
>> - if (!inode_trylock(inode)) {
>> - if (iocb->ki_flags & IOCB_NOWAIT)
>> - return -EAGAIN;
>> - inode_lock(inode);
>> - }
>> -
>> - ret = ext4_write_checks(iocb, from);
>> - if (ret <= 0)
>> - goto out;
>> -
>> - /*
>> - * Unaligned direct AIO must be serialized among each other as
>> zeroing
>> - * of partial blocks of two competing unaligned AIOs can result
>> in data
>> - * corruption.
>> - */
>> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> - !is_sync_kiocb(iocb) &&
>> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
>> - unaligned_aio = 1;
>> - ext4_unwritten_wait(inode);
>> - }
>> -
>> - iocb->private = &overwrite;
>> - /* Check whether we do a DIO overwrite or not */
>> - if (o_direct && !unaligned_aio) {
>> - if (ext4_overwrite_io(inode, iocb->ki_pos,
>> iov_iter_count(from))) {
>> - if (ext4_should_dioread_nolock(inode))
>> - overwrite = 1;
>> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
>> - ret = -EAGAIN;
>> - goto out;
>> - }
>> - }
>> -
>> - ret = __generic_file_write_iter(iocb, from);
>> - /*
>> - * Unaligned direct AIO must be the only IO in flight. Otherwise
>> - * overlapping aligned IO after unaligned might result in data
>> - * corruption.
>> - */
>> - if (ret == -EIOCBQUEUED && unaligned_aio)
>> - ext4_unwritten_wait(inode);
>> - inode_unlock(inode);
>> -
>> - if (ret > 0)
>> - ret = generic_write_sync(iocb, ret);
>> -
>> - return ret;
>> -
>> -out:
>> - inode_unlock(inode);
>> - return ret;
>> + if (iocb->ki_flags & IOCB_DIRECT)
>> + return ext4_dio_write_iter(iocb, from);
>> + return ext4_buffered_write_iter(iocb, from);
>> }
>>
>> #ifdef CONFIG_FS_DAX
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index efb184928e51..f52ad3065236 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> }
>> }
>> } else if (flags & IOMAP_WRITE) {
>> - int dio_credits;
>> handle_t *handle;
>> - int retries = 0;
>> + int dio_credits, retries = 0, m_flags = 0;
>>
>> - /* Trim mapping request to maximum we can map at once for DIO */
>> + /*
>> + * Trim mapping request to maximum we can map at once
>> + * for DIO.
>> + */
>> if (map.m_len > DIO_MAX_BLOCKS)
>> map.m_len = DIO_MAX_BLOCKS;
>> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
>> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> if (IS_ERR(handle))
>> return PTR_ERR(handle);
>>
>> - ret = ext4_map_blocks(handle, inode, &map,
>> - EXT4_GET_BLOCKS_CREATE_ZERO);
>> + /*
>> + * DAX and direct IO are the only two operations that
>> + * are currently supported with IOMAP_WRITE.
>> + */
>> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
>> + if (IS_DAX(inode))
>> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> + else if (round_down(offset, i_blocksize(inode)) >=
>> + i_size_read(inode))
>> + m_flags = EXT4_GET_BLOCKS_CREATE;
>> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>> +
>> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
>> +
>> + /*
>> + * We cannot fill holes in indirect tree based inodes
>> + * as that could expose stale data in the case of a
>> + * crash. Use the magic error code to fallback to
>> + * buffered IO.
>> + */
>
> I like this comment ;)
> Help others to understand what is really going on here.
>
>> + if (!m_flags && !ret)
>> + ret = -ENOTBLK;
>> +
>> if (ret < 0) {
>> ext4_journal_stop(handle);
>> if (ret == -ENOSPC &&
>> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> }
>>
>> /*
>> - * If we added blocks beyond i_size, we need to make sure they
>> - * will get truncated if we crash before updating i_size in
>> - * ext4_iomap_end(). For faults we don't need to do that (and
>> - * even cannot because for orphan list operations inode_lock is
>> - * required) - if we happen to instantiate block beyond i_size,
>> - * it is because we race with truncate which has already added
>> - * the inode to the orphan list.
>> + * If we added blocks beyond i_size, we need to make
>> + * sure they will get truncated if we crash before
>> + * updating the i_size. For faults we don't need to do
>> + * that (and even cannot because for orphan list
>> + * operations inode_lock is required) - if we happen
>> + * to instantiate block beyond i_size, it is because
>> + * we race with truncate which has already added the
>> + * inode to the orphan list.
>> */
>> if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
>> (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
>> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t
>> length,
>> ssize_t written, unsigned flags, struct iomap *iomap)
>> {
>> + /*
>> + * Check to see whether an error occurred while writing data
>> + * out to allocated blocks. If so, return the magic error code
>> + * so that we fallback to buffered IO and reuse the blocks
>> + * that were allocated in preparation for the direct IO write.
>> + */
>> + if (flags & IOMAP_DIRECT && written == 0)
>> + return -ENOTBLK;
>> return 0;
>> }
>>
>
On Mon, Sep 09, 2019 at 01:47:28PM +0530, Ritesh Harjani wrote:
> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> > +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > + ssize_t len, size_t count)
> > +{
> > + handle_t *handle;
> > + bool truncate = false;
> > + ext4_lblk_t written_blk, end_blk;
> > + int ret = 0, blkbits = inode->i_blkbits;
> > +
> > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + goto orphan_del;
> > + }
> > +
> > + if (ext4_update_inode_size(inode, offset + len))
> > + ext4_mark_inode_dirty(handle, inode);
> > +
> > + /*
> > + * We may need truncate allocated but not written blocks
> > + * beyond EOF.
> > + */
> > + written_blk = ALIGN(offset + len, 1 << blkbits);
> > + end_blk = ALIGN(offset + len + count, 1 << blkbits);
>
> why add len in end_blk calculation?
> shouldn't this be like below?
> end_blk = ALIGN(offset + count, 1 << blkbits);
I don't believe that would be entirely correct. The reason being is that the
'end_blk' is meant to represent the last logical block which we should expect
to have used for the write operation. So, we have the 'offset' which
represents starting point, 'len' which is the amount of data that has been
written, and 'count' which is the amount of data that we still have left over
in the 'iter', if any.
The count in the 'iter' is decremented as that data is copied from it. So if
we did use 'offset' + 'count', in the instance of a short write, we
potentially wouldn't truncate any of the allocated but not written blocks. I
guess this would hold true for the DAX code path at this point, seeing as
though for the DIO case we pass in '0'.
> > +/*
> > + * The inode may have been placed onto the orphan list or has had
> > + * blocks allocated beyond EOF as a result of an extension. We need to
> > + * ensure that any necessary cleanup routines are performed if the
> > + * error path has been taken for a write.
> > + */
> > +static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
> > +{
> > + int ret = 0;
>
> No need of ret anyways.
>
>
> > + handle_t *handle;
> > +
> > + if (size > i_size_read(inode))
> > + ext4_truncate_failed_write(inode);
> > +
> > + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > + if (IS_ERR(handle)) {
> > + if (inode->i_nlink)
> > + ext4_orphan_del(NULL, inode);
> > + return PTR_ERR(handle);
> > + }
> > + if (inode->i_nlink)
> > + ext4_orphan_del(handle, inode);
> > + ext4_journal_stop(handle);
> > + }
> > + return ret;
>
> can directly call for `return 0;`
True.
--<M>--
On Mon, Sep 09, 2019 at 02:56:15PM +0530, Ritesh Harjani wrote:
> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> > +/*
> > + * For a write that extends the inode size, ext4_dio_write_iter() will
> > + * wait for the write to complete. Consequently, operations performed
> > + * within this function are still covered by the inode_lock().
> > + */
> Maybe add a comment that on success this returns 0.
OK, can do.
> > +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size, int error,
> > + unsigned int flags)
> > +{
> > + int ret = 0;
> No need to initialize ret.
>
>
> > + loff_t offset = iocb->ki_pos;
> > + struct inode *inode = file_inode(iocb->ki_filp);
> > +
> > + if (error) {
> > + ret = ext4_handle_failed_inode_extension(inode, offset + size);
> > + return ret ? ret : error;
> > + }
> > +
> > + if (flags & IOMAP_DIO_UNWRITTEN) {
> > + ret = ext4_convert_unwritten_extents(NULL, inode,
> > + offset, size);
> > + if (ret)
> > + return ret;
> > + }
> > +
> > + if (offset + size > i_size_read(inode)) {
> > + ret = ext4_handle_inode_extension(inode, offset, size, 0);
> > + if (ret)
> > + return ret;
> > + }
> > + return ret;
> Directly return 0, since if it falls here it mans it is a success case.
> You are anyway returning error from above error paths.
OK, sure.
--<M>--
On 9/10/19 3:56 PM, Matthew Bobrowski wrote:
> On Mon, Sep 09, 2019 at 01:47:28PM +0530, Ritesh Harjani wrote:
>> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
>>> +static int ext4_handle_inode_extension(struct inode *inode, loff_t offset,
>>> + ssize_t len, size_t count)
>>> +{
>>> + handle_t *handle;
>>> + bool truncate = false;
>>> + ext4_lblk_t written_blk, end_blk;
>>> + int ret = 0, blkbits = inode->i_blkbits;
>>> +
>>> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>>> + if (IS_ERR(handle)) {
>>> + ret = PTR_ERR(handle);
>>> + goto orphan_del;
>>> + }
>>> +
>>> + if (ext4_update_inode_size(inode, offset + len))
>>> + ext4_mark_inode_dirty(handle, inode);
>>> +
>>> + /*
>>> + * We may need truncate allocated but not written blocks
>>> + * beyond EOF.
>>> + */
>>> + written_blk = ALIGN(offset + len, 1 << blkbits);
>>> + end_blk = ALIGN(offset + len + count, 1 << blkbits);
>>
>> why add len in end_blk calculation?
>> shouldn't this be like below?
>> end_blk = ALIGN(offset + count, 1 << blkbits);
>
> I don't believe that would be entirely correct. The reason being is that the
> 'end_blk' is meant to represent the last logical block which we should expect
> to have used for the write operation. So, we have the 'offset' which
> represents starting point, 'len' which is the amount of data that has been
> written, and 'count' which is the amount of data that we still have left over
> in the 'iter', if any.
>
Agree. Yes, I see that you are passing iov_iter_count(from) as a param,
after the dax write.
> The count in the 'iter' is decremented as that data is copied from it. So if > we did use 'offset' + 'count', in the instance of a short write, we
> potentially wouldn't truncate any of the allocated but not written blocks. I
> guess this would hold true for the DAX code path at this point, seeing as
> though for the DIO case we pass in '0'.
Agreed.
>
>>> +/*
>>> + * The inode may have been placed onto the orphan list or has had
>>> + * blocks allocated beyond EOF as a result of an extension. We need to
>>> + * ensure that any necessary cleanup routines are performed if the
>>> + * error path has been taken for a write.
>>> + */
>>> +static int ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>>> +{
>>> + int ret = 0;
>>
>> No need of ret anyways.
>>
>>
>>> + handle_t *handle;
>>> +
>>> + if (size > i_size_read(inode))
>>> + ext4_truncate_failed_write(inode);
>>> +
>>> + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
>>> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
>>> + if (IS_ERR(handle)) {
>>> + if (inode->i_nlink)
>>> + ext4_orphan_del(NULL, inode);
>>> + return PTR_ERR(handle);
>>> + }
>>> + if (inode->i_nlink)
>>> + ext4_orphan_del(handle, inode);
>>> + ext4_journal_stop(handle);
>>> + }
>>> + return ret;
>>
>> can directly call for `return 0;`
>
> True.
>
> --<M>--
>
On Mon, Sep 09, 2019 at 08:02:13PM +0530, Ritesh Harjani wrote:
> On 9/9/19 2:56 PM, Ritesh Harjani wrote:
> > > +??? ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops,
> > > ext4_dio_write_end_io);
> > > +
> > > +??? /*
> > > +???? * Unaligned direct AIO must be the only IO in flight or else
> > > +???? * any overlapping aligned IO after unaligned IO might result
> > > +???? * in data corruption. We also need to wait here in the case
> > > +???? * where the inode is being extended so that inode extension
> > > +???? * routines in ext4_dio_write_end_io() are covered by the
> > > +???? * inode_lock().
> > > +???? */
> > > +??? if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
> > > +??????? inode_dio_wait(inode);
>
> So, I looked this more closely into the AIO DIO write extend case
> of yours here. AFAICT, this looks good in the sense that it follows
> the behavior what we used to have before from __blockdev_direct_IO.
> In that it used to wait for AIO DIO writes beyond EOF, but the iomap
> framework does not have that. So waiting in case of writes beyond EOF
> should be the right thing to do here for ext4 (following the legacy code).
>
> But I would like to confirm the exact race this extend case
> is protecting here.
> Since writes beyond EOF will require update of inode i_size
> (ext4_update_inode_size()) which require us to hold the inode_lock
> in exclusive mode, so we must need to wait in extend case here,
> even for AIO DIO writes.
>
> Q1. Is above understanding completely correct?
Yes, that's essentially correct.
> Q2. Or is there anything else also which it is also protecting which I am
> missing?
No, I think that's it...
> Do we need to hold inode exclusive lock for ext4_orphan_del() as well?
Yes, I believe so.
> Q3. How about XFS then?
> (I do see some tricks done with IOLOCK handling in case of ki__pos > i_size
> & to zero out the buffer space between old i_size & ki_pos).
>
> But if we talk only about the above case of extending AIO DIO writes beyond
> EOF, XFS only takes a shared lock. why?
>
> Looking into XFS code, I see that they have IOLOCK & ILOCK.
> So I guess for protecting inode->i_size update they use ILOCK (in
> xfs_dio_write_end_io() -> xfs_iomap_write_unwritten()
> or ip->i_flags_lock lock in NON-UNWRITTEN case). And for IO part the IOLOCK
> is used and hence IOLOCK can be used in shared mode. Is this correct
> understanding for XFS?
* David/Christoph/Darrick
I haven't looked at the intricate XFS locking semantics, so I can't really
comment until I've looked at the code to be perfectly honest. Perhaps asking
one of the XFS maintainers could get you the answer you're looking for on
this.
--<M>--
Hello,
Few more small things noted.
Please check once.
On 9/9/19 2:56 PM, Ritesh Harjani wrote:
>
>
> On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
>> This patch introduces a new direct IO write code path implementation
>> that makes use of the iomap infrastructure.
>>
>> All direct IO write operations are now passed from the ->write_iter()
>> callback to the new function ext4_dio_write_iter(). This function is
>> responsible for calling into iomap infrastructure via
>> iomap_dio_rw(). Snippets of the direct IO code from within
>> ext4_file_write_iter(), such as checking whether the IO request is
>> unaligned asynchronous IO, or whether it will ber overwriting
>> allocated and initialized blocks has been moved out and into
>> ext4_dio_write_iter().
>>
>> The block mapping flags that are passed to ext4_map_blocks() from
>> within ext4_dio_get_block() and friends have effectively been taken
>> out and introduced within the ext4_iomap_begin(). If ext4_map_blocks()
>> happens to have instantiated blocks beyond the i_size, then we attempt
>> to place the inode onto the orphan list. Despite being able to perform
>> i_size extension checking earlier on in the direct IO code path, it
>> makes most sense to perform this bit post successful block allocation.
>>
>> The ->end_io() callback ext4_dio_write_end_io() is responsible for
>> removing the inode from the orphan list and determining if we should
>> truncate a failed write in the case of an error. We also convert a
>> range of unwritten extents to written if IOMAP_DIO_UNWRITTEN is set
>> and perform the necessary i_size/i_disksize extension if the
>> iocb->ki_pos + dio->size > i_size_read(inode).
>>
>> In the instance of a short write, we fallback to buffered IO and
>> complete whatever is left the 'iter'. Any blocks that may have been
>> allocated in preparation for direct IO will be reused by buffered IO,
>> so there's no issue with leaving allocated blocks beyond EOF.
>>
>> Signed-off-by: Matthew Bobrowski <[email protected]>
>
> Sorry some minor simplification comments. Forgot to respond in previous
> email.
>
> Otherwise looks good.
>
> Reviewed-by: Ritesh Harjani <[email protected]>
>
>
>
>> ---
>> fs/ext4/file.c | 219 +++++++++++++++++++++++++++++++++---------------
>> fs/ext4/inode.c | 57 ++++++++++---
>> 2 files changed, 198 insertions(+), 78 deletions(-)
>>
>> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
>> index 8e586198f6e6..bf22425a6a6f 100644
>> --- a/fs/ext4/file.c
>> +++ b/fs/ext4/file.c
>> @@ -29,6 +29,7 @@
>> #include <linux/pagevec.h>
>> #include <linux/uio.h>
>> #include <linux/mman.h>
>> +#include <linux/backing-dev.h>
>> #include "ext4.h"
>> #include "ext4_jbd2.h"
>> #include "xattr.h"
>> @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb
>> *iocb, struct iov_iter *from)
>> if (ret <= 0)
>> return ret;
>>
>> + ret = file_remove_privs(iocb->ki_filp);
>> + if (ret)
>> + return 0;
>> +
>> + ret = file_update_time(iocb->ki_filp);
>> + if (ret)
>> + return 0;
>> +
>> if (unlikely(IS_IMMUTABLE(inode)))
>> return -EPERM;
Maybe we can move this up. If file is IMMUTABLE no point in
calling for above actions (file_remove_privs/file_updatetime).
also why not use file_modified() API which does the same.
>>
>> @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb
>> *iocb, struct iov_iter *from)
>> return iov_iter_count(from);
>> }
>>
>> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
>> + struct iov_iter *from)
>> +{
>> + ssize_t ret;
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + return -EOPNOTSUPP;
>> +
>> + if (!inode_trylock(inode))
>> + inode_lock(inode);
Is it really needed to check for trylock first?
we can directly call for inode_lock() here.
>> +
>> + ret = ext4_write_checks(iocb, from);
>> + if (ret <= 0)
>> + goto out;
>> +
>> + current->backing_dev_info = inode_to_bdi(inode);
>> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
>> + current->backing_dev_info = NULL;
>> +out:
>> + inode_unlock(inode);
>> + if (likely(ret > 0)) {
>> + iocb->ki_pos += ret;
>> + ret = generic_write_sync(iocb, ret);
>> + }
>> + return ret;
>> +}
>> +
>> static int ext4_handle_inode_extension(struct inode *inode, loff_t
>> offset,
>> ssize_t len, size_t count)
>> {
>> @@ -311,6 +348,118 @@ static int
>> ext4_handle_failed_inode_extension(struct inode *inode, loff_t size)
>> return ret;
>> }
>>
>> +/*
>> + * For a write that extends the inode size, ext4_dio_write_iter() will
>> + * wait for the write to complete. Consequently, operations performed
>> + * within this function are still covered by the inode_lock().
>> + */
> Maybe add a comment that on success this returns 0.
>
>> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
>> int error,
>> + unsigned int flags)
>> +{
>> + int ret = 0;
> No need to initialize ret.
>
>
>> + loff_t offset = iocb->ki_pos;
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> +
>> + if (error) {
>> + ret = ext4_handle_failed_inode_extension(inode, offset + size);
>> + return ret ? ret : error;
>> + }
>> +
>> + if (flags & IOMAP_DIO_UNWRITTEN) {
>> + ret = ext4_convert_unwritten_extents(NULL, inode,
>> + offset, size);
>> + if (ret)
>> + return ret;
>> + }
>> +
>> + if (offset + size > i_size_read(inode)) {
>> + ret = ext4_handle_inode_extension(inode, offset, size, 0);
>> + if (ret)
>> + return ret;
>> + }
>> + return ret;
> Directly return 0, since if it falls here it mans it is a success case.
> You are anyway returning error from above error paths.
>
>> +}
>> +
>> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct
>> iov_iter *from)
>> +{
>> + ssize_t ret;
>> + loff_t offset = iocb->ki_pos;
>> + size_t count = iov_iter_count(from);
>> + struct inode *inode = file_inode(iocb->ki_filp);
>> + bool extend = false, overwrite = false, unaligned_aio = false;
>> +
>> + if (!inode_trylock(inode)) {
>> + if (iocb->ki_flags & IOCB_NOWAIT)
>> + return -EAGAIN;
>> + inode_lock(inode);
>> + }
>> +
>> + if (!ext4_dio_checks(inode)) {
>> + inode_unlock(inode);
>> + /*
>> + * Fallback to buffered IO if the operation on the
>> + * inode is not supported by direct IO.
>> + */
>> + return ext4_buffered_write_iter(iocb, from);
>> + }
>> +
>> + ret = ext4_write_checks(iocb, from);
This can modify the count in iov_iter *from.
>> + if (ret <= 0) {
>> + inode_unlock(inode);
>> + return ret;
>> + }
let's recalculate count = iov_iter_count(from);
>> +
>> + /*
>> + * Unaligned direct AIO must be serialized among each other as
>> + * the zeroing of partial blocks of two competing unaligned
>> + * AIOs can result in data corruption.
>> + */
>> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from,
>> offset)) {
>> + unaligned_aio = true;
>> + inode_dio_wait(inode);
>> + }
>> +
>> + /*
>> + * Determine whether the IO operation will overwrite allocated
>> + * and initialized blocks. If so, check to see whether it is
>> + * possible to take the dioread_nolock path.
>> + */
>> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&count here could be the old one.
>> + ext4_should_dioread_nolock(inode)) {
>> + overwrite = true;
>> + downgrade_write(&inode->i_rwsem);
>> + }
>> +
>> + if (offset + count > i_size_read(inode) ||
>> + offset + count > EXT4_I(inode)->i_disksize) {
ditto.
>> + ext4_update_i_disksize(inode, inode->i_size);
>> + extend = true;
>> + }
>> +
>> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops,
>> ext4_dio_write_end_io);
>> +
>> + /*
>> + * Unaligned direct AIO must be the only IO in flight or else
>> + * any overlapping aligned IO after unaligned IO might result
>> + * in data corruption. We also need to wait here in the case
>> + * where the inode is being extended so that inode extension
>> + * routines in ext4_dio_write_end_io() are covered by the
>> + * inode_lock().
>> + */
>> + if (ret == -EIOCBQUEUED && (unaligned_aio || extend))
>> + inode_dio_wait(inode);
>> +
>> + if (overwrite)
>> + inode_unlock_shared(inode);
>> + else
>> + inode_unlock(inode);
>> +
>> + if (ret >= 0 && iov_iter_count(from))
>> + return ext4_buffered_write_iter(iocb, from);
>> + return ret;
>> +}
>> +
>> #ifdef CONFIG_FS_DAX
>> static ssize_t
>> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> @@ -325,15 +474,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct
>> iov_iter *from)
>> return -EAGAIN;
>> inode_lock(inode);
>> }
>> +
>> ret = ext4_write_checks(iocb, from);
>> if (ret <= 0)
>> goto out;
>> - ret = file_remove_privs(iocb->ki_filp);
>> - if (ret)
>> - goto out;
>> - ret = file_update_time(iocb->ki_filp);
>> - if (ret)
>> - goto out;
>>
>> offset = iocb->ki_pos;
>> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
>> @@ -359,73 +503,16 @@ static ssize_t
>> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
>> {
>> struct inode *inode = file_inode(iocb->ki_filp);
>> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
>> - int unaligned_aio = 0;
>> - int overwrite = 0;
>> - ssize_t ret;
>>
>> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
>> return -EIO;
>>
>> -#ifdef CONFIG_FS_DAX
>> if (IS_DAX(inode))
>> return ext4_dax_write_iter(iocb, from);
>> -#endif
>> - if (!o_direct && (iocb->ki_flags & IOCB_NOWAIT))
>> - return -EOPNOTSUPP;
>>
>> - if (!inode_trylock(inode)) {
>> - if (iocb->ki_flags & IOCB_NOWAIT)
>> - return -EAGAIN;
>> - inode_lock(inode);
>> - }
>> -
>> - ret = ext4_write_checks(iocb, from);
>> - if (ret <= 0)
>> - goto out;
>> -
>> - /*
>> - * Unaligned direct AIO must be serialized among each other as
>> zeroing
>> - * of partial blocks of two competing unaligned AIOs can result
>> in data
>> - * corruption.
>> - */
>> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
>> - !is_sync_kiocb(iocb) &&
>> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
>> - unaligned_aio = 1;
>> - ext4_unwritten_wait(inode);
>> - }
>> -
>> - iocb->private = &overwrite;
>> - /* Check whether we do a DIO overwrite or not */
>> - if (o_direct && !unaligned_aio) {
>> - if (ext4_overwrite_io(inode, iocb->ki_pos,
>> iov_iter_count(from))) {
>> - if (ext4_should_dioread_nolock(inode))
>> - overwrite = 1;
>> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
>> - ret = -EAGAIN;
>> - goto out;
>> - }
>> - }
>> -
>> - ret = __generic_file_write_iter(iocb, from);
>> - /*
>> - * Unaligned direct AIO must be the only IO in flight. Otherwise
>> - * overlapping aligned IO after unaligned might result in data
>> - * corruption.
>> - */
>> - if (ret == -EIOCBQUEUED && unaligned_aio)
>> - ext4_unwritten_wait(inode);
>> - inode_unlock(inode);
>> -
>> - if (ret > 0)
>> - ret = generic_write_sync(iocb, ret);
>> -
>> - return ret;
>> -
>> -out:
>> - inode_unlock(inode);
>> - return ret;
>> + if (iocb->ki_flags & IOCB_DIRECT)
>> + return ext4_dio_write_iter(iocb, from);
>> + return ext4_buffered_write_iter(iocb, from);
>> }
>>
>> #ifdef CONFIG_FS_DAX
>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>> index efb184928e51..f52ad3065236 100644
>> --- a/fs/ext4/inode.c
>> +++ b/fs/ext4/inode.c
>> @@ -3513,11 +3513,13 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> }
>> }
>> } else if (flags & IOMAP_WRITE) {
>> - int dio_credits;
>> handle_t *handle;
>> - int retries = 0;
>> + int dio_credits, retries = 0, m_flags = 0;
>>
>> - /* Trim mapping request to maximum we can map at once for DIO */
>> + /*
>> + * Trim mapping request to maximum we can map at once
>> + * for DIO.
>> + */
>> if (map.m_len > DIO_MAX_BLOCKS)
>> map.m_len = DIO_MAX_BLOCKS;
>> dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
>> @@ -3533,8 +3535,30 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> if (IS_ERR(handle))
>> return PTR_ERR(handle);
>>
>> - ret = ext4_map_blocks(handle, inode, &map,
>> - EXT4_GET_BLOCKS_CREATE_ZERO);
>> + /*
>> + * DAX and direct IO are the only two operations that
>> + * are currently supported with IOMAP_WRITE.
>> + */
>> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
>> + if (IS_DAX(inode))
>> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
>> + else if (round_down(offset, i_blocksize(inode)) >=
>> + i_size_read(inode))
>> + m_flags = EXT4_GET_BLOCKS_CREATE;
>> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
>> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
>> +
>> + ret = ext4_map_blocks(handle, inode, &map, m_flags);
>> +
>> + /*
>> + * We cannot fill holes in indirect tree based inodes
>> + * as that could expose stale data in the case of a
>> + * crash. Use the magic error code to fallback to
>> + * buffered IO.
>> + */
>
> I like this comment ;)
> Help others to understand what is really going on here.
>
>> + if (!m_flags && !ret)
>> + ret = -ENOTBLK;
>> +
>> if (ret < 0) {
>> ext4_journal_stop(handle);
>> if (ret == -ENOSPC &&
>> @@ -3544,13 +3568,14 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> }
>>
>> /*
>> - * If we added blocks beyond i_size, we need to make sure they
>> - * will get truncated if we crash before updating i_size in
>> - * ext4_iomap_end(). For faults we don't need to do that (and
>> - * even cannot because for orphan list operations inode_lock is
>> - * required) - if we happen to instantiate block beyond i_size,
>> - * it is because we race with truncate which has already added
>> - * the inode to the orphan list.
>> + * If we added blocks beyond i_size, we need to make
>> + * sure they will get truncated if we crash before
>> + * updating the i_size. For faults we don't need to do
>> + * that (and even cannot because for orphan list
>> + * operations inode_lock is required) - if we happen
>> + * to instantiate block beyond i_size, it is because
>> + * we race with truncate which has already added the
>> + * inode to the orphan list.
>> */
>> if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
>> (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
>> @@ -3612,6 +3637,14 @@ static int ext4_iomap_begin(struct inode
>> *inode, loff_t offset, loff_t length,
>> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t
>> length,
>> ssize_t written, unsigned flags, struct iomap *iomap)
>> {
>> + /*
>> + * Check to see whether an error occurred while writing data
>> + * out to allocated blocks. If so, return the magic error code
>> + * so that we fallback to buffered IO and reuse the blocks
>> + * that were allocated in preparation for the direct IO write.
>> + */
>> + if (flags & IOMAP_DIRECT && written == 0)
>> + return -ENOTBLK;
>> return 0;
>> }
>>
>
On Wed, Sep 11, 2019 at 01:38:52PM +0530, Ritesh Harjani wrote:
> On 9/9/19 2:56 PM, Ritesh Harjani wrote:
> > On 9/9/19 4:49 AM, Matthew Bobrowski wrote:
> > > @@ -217,6 +218,14 @@ static ssize_t ext4_write_checks(struct kiocb
> > > *iocb, struct iov_iter *from)
> > > ????? if (ret <= 0)
> > > ????????? return ret;
> > >
> > > +??? ret = file_remove_privs(iocb->ki_filp);
> > > +??? if (ret)
> > > +??????? return 0;
> > > +
> > > +??? ret = file_update_time(iocb->ki_filp);
> > > +??? if (ret)
> > > +??????? return 0;
> > > +
> > > ????? if (unlikely(IS_IMMUTABLE(inode)))
> > > ????????? return -EPERM;
>
> Maybe we can move this up. If file is IMMUTABLE no point in
> calling for above actions (file_remove_privs/file_updatetime).
Yep, sure could do this. In fact, I think we could put this above
generic_write_checks().
> Also why not use file_modified() API which does the same.
Ah, nice. Indeed we can, thanks for simplifying it.
> > > @@ -234,6 +243,34 @@ static ssize_t ext4_write_checks(struct kiocb
> > > *iocb, struct iov_iter *from)
> > > ????? return iov_iter_count(from);
> > > ? }
> > >
> > > +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> > > +??????????????????? struct iov_iter *from)
> > > +{
> > > +??? ssize_t ret;
> > > +??? struct inode *inode = file_inode(iocb->ki_filp);
> > > +
> > > +??? if (iocb->ki_flags & IOCB_NOWAIT)
> > > +??????? return -EOPNOTSUPP;
> > > +
> > > +??? if (!inode_trylock(inode))
> > > +??????? inode_lock(inode);
>
> Is it really needed to check for trylock first?
> we can directly call for inode_lock() here.
You're right, no need to do this dance. We can call inode_lock() directly.
> > > +
> > > +??? ret = ext4_write_checks(iocb, from);
> > > +??? if (ret <= 0)
> > > +??????? goto out;
> > > +
> > > +??? current->backing_dev_info = inode_to_bdi(inode);
> > > +??? ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> > > +??? current->backing_dev_info = NULL;
> > > +out:
> > > +??? inode_unlock(inode);
> > > +??? if (likely(ret > 0)) {
> > > +??????? iocb->ki_pos += ret;
> > > +??????? ret = generic_write_sync(iocb, ret);
> > > +??? }
> > > +??? return ret;
> > > +}
> > > +
> > > +??? if (!ext4_dio_checks(inode)) {
> > > +??????? inode_unlock(inode);
> > > +??????? /*
> > > +???????? * Fallback to buffered IO if the operation on the
> > > +???????? * inode is not supported by direct IO.
> > > +???????? */
> > > +??????? return ext4_buffered_write_iter(iocb, from);
> > > +??? }
> > > +
> > > +??? ret = ext4_write_checks(iocb, from);
> This can modify the count in iov_iter *from.
Good point. We'll recalculate the iter 'count' again.
Thank you for the review/suggestions, highly appreciated.
--<M>--