Hi all,
This is an updated patch series of the ext4 direct I/O port to iomap
infrastructure. This updated series includes some minor updates and
fixes that were identified within the preceding patch
series. Changlog since v6 has been summarised below:
Changes since v6:
- Removed duplicate map->m_flags check in ext4_set_iomap(), which
cleaned up some of unnecessary levels of identation.
- Fixed an issue with the buffered I/O fallback path within
ext4_dio_write_iter(). Previously, we only returned the value that
ext4_buffered_write_iter() would return without taking into account
anything that possibly written for the direct I/O. This meant that
we'd return incorrect values back to userspace.
- Added missing fsync + page cache invalidation for written I/O range
post buffered I/O fallback. This was missing from my original patch
series, but this is actually needed in order to preserve direct I/O
semantics.
The original cover letter for this series has been provided below for
reference.
---
This patch series ports the ext4 direct IO paths to make use of the
iomap infrastructure. The legacy buffer_head based direct IO paths
have subsequently been removed as they're now no longer in use. The
result of this change is that the direct IO implementation is much
cleaner and keeps the code isolated from the buffer_head internals. In
addition to this, a slight performance boost could be expected while
using O_SYNC | O_DIRECT IO.
The changes have been tested using xfstests in both DAX and non-DAX
modes using various filesystem configurations i.e. 4k, dioread_nolock,
nojournal, ext3.
Matthew Bobrowski (11):
ext4: reorder map.m_flags checks within ext4_iomap_begin()
ext4: update direct I/O read lock pattern for IOCB_NOWAIT
ext4: iomap that extends beyond EOF should be marked dirty
ext4: move set iomap routines into a separate helper ext4_set_iomap()
ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
ext4: introduce new callback for IOMAP_REPORT
ext4: introduce direct I/O read using iomap infrastructure
ext4: move inode extension/truncate code out from ->iomap_end()
callback
ext4: move inode extension check out from ext4_iomap_alloc()
ext4: update ext4_sync_file() to not use __generic_file_fsync()
ext4: introduce direct I/O write using iomap infrastructure
fs/ext4/ext4.h | 4 +-
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 412 +++++++++++++++++++++-----
fs/ext4/fsync.c | 72 +++--
fs/ext4/inode.c | 720 +++++++++++-----------------------------------
5 files changed, 563 insertions(+), 656 deletions(-)
--
2.20.1
For the direct I/O changes that follow in this patch series, we need
to accommodate for the case where the block mapping flags passed
through to ext4_map_blocks() result in m_flags having both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for any
allocated unwritten extents to be converted correctly in the
->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for
cases where the EXT4_MAP_UNWRITTEN bit has been set within
m_flags. Hence the reason why we need to reshuffle this conditional
statement around.
This change is a no-op for DAX as the block mapping flags passed
through to ext4_map_blocks() i.e. EXT4_GET_BLOCKS_CREATE_ZERO never
results in both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at
once.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0d8971b819e9..e4b0722717b3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3577,10 +3577,20 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ /*
+ * Flags passed into ext4_map_blocks() for direct I/O writes
+ * can result in m_flags having both EXT4_MAP_MAPPED and
+ * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
+ * unwritten extents to be converted into written extents
+ * correctly within the ->end_io() handler, we need to ensure
+ * that the iomap->type is set appropriately. Hence the reason
+ * why we need to check whether EXT4_MAP_UNWRITTEN is set
+ * first.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
+ } else if (map.m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
} else {
WARN_ON_ONCE(1);
return -EIO;
--
2.20.1
This patch updates the lock pattern in ext4_direct_IO_read() to not
block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The
locking condition implemented here is similar to that of 942491c9e6d6
("xfs: fix AIM7 regression").
Fixes: 16c54688592c ("ext4: Allow parallel DIO reads")
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e4b0722717b3..f33fa86fff67 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3881,7 +3881,13 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
* writes & truncates and since we take care of writing back page cache,
* we are protected against page writeback as well.
*/
- inode_lock_shared(inode);
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
iocb->ki_pos + count - 1);
if (ret)
--
2.20.1
This patch addresses what Dave Chinner had discovered and fixed within
commit: 7684e2c4384d. This changes does not have any user visible
impact for ext4 as none of the current users of ext4_iomap_begin()
that extend files depend on IOMAP_F_DIRTY.
When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.
However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion when
O_DSYNC writes are being used and the hardware supports FUA.
Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync() to
flush the inode size update to stable storage correctly.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f33fa86fff67..b422d9b8c0bd 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3565,8 +3565,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}
+ /*
+ * Writes that span EOF might trigger an I/O size update on completion,
+ * so consider them to be dirty for the purposes of O_DSYNC, even if
+ * there is no other metadata changes being made or are pending here.
+ */
iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode))
+ if (ext4_inode_datasync_dirty(inode) ||
+ offset + length > i_size_read(inode))
iomap->flags |= IOMAP_F_DIRTY;
iomap->bdev = inode->i_sb->s_bdev;
iomap->dax_dev = sbi->s_daxdev;
--
2.20.1
This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.
The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.
Existing direct I/O read code path has been removed, as it is now
redundant.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/inode.c | 38 +---------------------------------
2 files changed, 54 insertions(+), 39 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ab75aee3e687..440f4c6ba4ee 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,52 @@
#include "xattr.h"
#include "acl.h"
+static bool ext4_dio_supported(struct inode *inode)
+{
+ if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
+ return false;
+ if (fsverity_active(inode))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered I/O if the operation being performed on
+ * the inode is not supported by direct I/O. The IOCB_DIRECT
+ * flag needs to be cleared here in order to ensure that the
+ * direct I/O path within generic_file_read_iter() is not
+ * taken.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
+ is_sync_kiocb(iocb));
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,16 +110,21 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
if (!iov_iter_count(to))
return 0; /* skip atime */
#ifdef CONFIG_FS_DAX
- if (IS_DAX(file_inode(iocb->ki_filp)))
+ if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
+
return generic_file_read_iter(iocb, to);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b5ba6767b276..9bd80df6b856 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
{
/* We don't expect handle for direct IO */
WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
}
@@ -3916,36 +3913,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock_shared(inode))
- return -EAGAIN;
- } else {
- inode_lock_shared(inode);
- }
-
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct file *file = iocb->ki_filp;
@@ -3972,10 +3939,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return 0;
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
+ ret = ext4_direct_IO_write(iocb, iter);
trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
return ret;
}
--
2.20.1
Separate the iomap field population code that is currently within
ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent
of this function is self explanatory, however the rationale behind
taking this step is to reeduce the overall clutter that we currently
have within the ext4_iomap_begin() callback.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 90 ++++++++++++++++++++++++++-----------------------
1 file changed, 48 insertions(+), 42 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b422d9b8c0bd..9e1ac9fe816b 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3448,10 +3448,54 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}
+static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
+ struct ext4_map_blocks *map, loff_t offset,
+ loff_t length)
+{
+ u8 blkbits = inode->i_blkbits;
+
+ /*
+ * Writes that span EOF might trigger an I/O size update on completion,
+ * so consider them to be dirty for the purpose of O_DSYNC, even if
+ * there is no other metadata changes being made or are pending.
+ */
+ iomap->flags = 0;
+ if (ext4_inode_datasync_dirty(inode) ||
+ offset + length > i_size_read(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
+
+ if (map->m_flags & EXT4_MAP_NEW)
+ iomap->flags |= IOMAP_F_NEW;
+
+ iomap->bdev = inode->i_sb->s_bdev;
+ iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
+ iomap->offset = (u64) map->m_lblk << blkbits;
+ iomap->length = (u64) map->m_len << blkbits;
+
+ /*
+ * Flags passed to ext4_map_blocks() for direct I/O writes can result
+ * in m_flags having both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits
+ * set. In order for any allocated unwritten extents to be converted
+ * into written extents correctly within the ->end_io() handler, we
+ * need to ensure that the iomap->type is set appropriately. Hence, the
+ * reason why we need to check whether the EXT4_MAP_UNWRITTEN bit has
+ * been set first.
+ */
+ if (map->m_flags & EXT4_MAP_UNWRITTEN) {
+ iomap->type = IOMAP_UNWRITTEN;
+ iomap->addr = (u64) map->m_pblk << blkbits;
+ } else if (map->m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
+ iomap->addr = (u64) map->m_pblk << blkbits;
+ } else {
+ iomap->type = IOMAP_HOLE;
+ iomap->addr = IOMAP_NULL_ADDR;
+ }
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block, last_block;
struct ext4_map_blocks map;
@@ -3565,47 +3609,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}
- /*
- * Writes that span EOF might trigger an I/O size update on completion,
- * so consider them to be dirty for the purposes of O_DSYNC, even if
- * there is no other metadata changes being made or are pending here.
- */
- iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode) ||
- offset + length > i_size_read(inode))
- iomap->flags |= IOMAP_F_DIRTY;
- iomap->bdev = inode->i_sb->s_bdev;
- iomap->dax_dev = sbi->s_daxdev;
- iomap->offset = (u64)first_block << blkbits;
- iomap->length = (u64)map.m_len << blkbits;
-
- if (ret == 0) {
- iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
- iomap->addr = IOMAP_NULL_ADDR;
- } else {
- /*
- * Flags passed into ext4_map_blocks() for direct I/O writes
- * can result in m_flags having both EXT4_MAP_MAPPED and
- * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
- * unwritten extents to be converted into written extents
- * correctly within the ->end_io() handler, we need to ensure
- * that the iomap->type is set appropriately. Hence the reason
- * why we need to check whether EXT4_MAP_UNWRITTEN is set
- * first.
- */
- if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- iomap->type = IOMAP_UNWRITTEN;
- } else if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else {
- WARN_ON_ONCE(1);
- return -EIO;
- }
- iomap->addr = (u64)map.m_pblk << blkbits;
- }
-
- if (map.m_flags & EXT4_MAP_NEW)
- iomap->flags |= IOMAP_F_NEW;
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
return 0;
}
--
2.20.1
In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.
The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +-------------------------
2 files changed, 89 insertions(+), 48 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 440f4c6ba4ee..ec54fec96a81 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"
static bool ext4_dio_supported(struct inode *inode)
{
@@ -234,12 +235,95 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}
+static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
+ ssize_t written, size_t count)
+{
+ handle_t *handle;
+ bool truncate = false;
+ u8 blkbits = inode->i_blkbits;
+ ext4_lblk_t written_blk, end_blk;
+
+ /*
+ * Note that EXT4_I(inode)->i_disksize can get extended up to
+ * inode->i_size while the I/O was running due to writeback of delalloc
+ * blocks. But, the code in ext4_iomap_alloc() is careful to use
+ * zeroed/unwritten extents if this is possible; thus we won't leave
+ * uninitialized blocks in a file even if we didn't succeed in writing
+ * as much as we intended.
+ */
+ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
+ if (offset + count <= EXT4_I(inode)->i_disksize) {
+ /*
+ * We need to ensure that the inode is removed from the orphan
+ * list if it has been added prematurely, due to writeback of
+ * delalloc blocks.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+
+ if (IS_ERR(handle)) {
+ ext4_orphan_del(NULL, inode);
+ return PTR_ERR(handle);
+ }
+
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+
+ return written;
+ }
+
+ if (written < 0)
+ goto truncate;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ written = PTR_ERR(handle);
+ goto truncate;
+ }
+
+ if (ext4_update_inode_size(inode, offset + written))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need to truncate allocated but not written blocks beyond EOF.
+ */
+ written_blk = ALIGN(offset + written, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been extended and
+ * everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+truncate:
+ ext4_truncate_failed_write(inode);
+ /*
+ * If the truncate operation failed early, then the inode may
+ * still be on the orphan list. In that case, we need to try
+ * remove the inode from the in-memory linked list.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
+ return written;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ size_t count;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -256,7 +340,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9bd80df6b856..071a1f976aab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3583,53 +3583,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}
const struct iomap_ops ext4_iomap_ops = {
--
2.20.1
In preparation for porting across the ext4 direct I/O path over to the
iomap infrastructure, split up the IOMAP_WRITE branch that's currently
within ext4_iomap_begin() into a separate helper
ext4_alloc_iomap(). This way, when we add in the necessary code for
direct I/O, we don't end up with ext4_iomap_begin() becoming a
monstrous twisty maze.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 113 ++++++++++++++++++++++++++----------------------
1 file changed, 61 insertions(+), 52 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9e1ac9fe816b..b540f2903faa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3493,6 +3493,63 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
}
}
+static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
+ unsigned int flags)
+{
+ handle_t *handle;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, retries = 0;
+
+ /*
+ * Trim the mapping request to the maximum value that we can map at
+ * once for direct I/O.
+ */
+ if (map->m_len > DIO_MAX_BLOCKS)
+ map->m_len = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+
+retry:
+ /*
+ * Either we allocate blocks and then don't get an unwritten extent, so
+ * in that case we have reserved enough credits. Or, the blocks are
+ * already allocated and unwritten. In that case, the extent conversion
+ * fits into the credits as well.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ goto journal_stop;
+
+ /*
+ * If we've allocated blocks beyond EOF, we need to ensure that they're
+ * truncated if we crash before updating the inode size metadata within
+ * ext4_iomap_end(). For faults, we don't need to do that (and cannot
+ * due to orphan list operations needing an inode_lock()). If we happen
+ * to instantiate blocks beyond EOF, it is because we race with a
+ * truncate operation, which already has added the inode onto the
+ * orphan list.
+ */
+ if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
+ (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
+ int err;
+
+ err = ext4_orphan_add(handle, inode);
+ if (err < 0)
+ ret = err;
+ }
+
+journal_stop:
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ return ret;
+}
+
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
@@ -3553,62 +3610,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
- handle_t *handle;
- int retries = 0;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (map.m_len > DIO_MAX_BLOCKS)
- map.m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
-retry:
- /*
- * Either we allocate blocks and then we don't get unwritten
- * extent so we have reserved enough credits, or the blocks
- * are already allocated and unwritten and in that case
- * extent conversion fits in the credits as well.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0) {
- ext4_journal_stop(handle);
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
- }
-
- /*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0) {
- ext4_journal_stop(handle);
- return err;
- }
- }
- ext4_journal_stop(handle);
+ ret = ext4_iomap_alloc(inode, &map, flags);
} else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
}
+ if (ret < 0)
+ return ret;
+
ext4_set_iomap(inode, iomap, &map, offset, length);
if (delalloc && iomap->type == IOMAP_HOLE)
iomap->type = IOMAP_DELALLOC;
--
2.20.1
Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter().
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 24 +++++++++++++++++++++++-
fs/ext4/inode.c | 22 ----------------------
2 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ec54fec96a81..83ef9c9ed208 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -323,6 +323,8 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t ret;
size_t count;
loff_t offset;
+ handle_t *handle;
+ bool extend = false;
struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
@@ -342,8 +344,28 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
offset = iocb->ki_pos;
count = iov_iter_count(from);
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+
+ if (extend)
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 071a1f976aab..392085aa7809 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3494,7 +3494,6 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- u8 blkbits = inode->i_blkbits;
int ret, dio_credits, retries = 0;
/*
@@ -3517,28 +3516,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
return PTR_ERR(handle);
ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0)
- goto journal_stop;
-
- /*
- * If we've allocated blocks beyond EOF, we need to ensure that they're
- * truncated if we crash before updating the inode size metadata within
- * ext4_iomap_end(). For faults, we don't need to do that (and cannot
- * due to orphan list operations needing an inode_lock()). If we happen
- * to instantiate blocks beyond EOF, it is because we race with a
- * truncate operation, which already has added the inode onto the
- * orphan list.
- */
- if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0)
- ret = err;
- }
-journal_stop:
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
--
2.20.1
When the filesystem is created without a journal, we eventually call
into __generic_file_fsync() in order to write out all the modified
in-core data to the permanent storage device. This function happens to
try and obtain an inode_lock() while synchronizing the files buffer
and it's associated metadata.
Generally, this is fine, however it becomes a problem when there is
higher level code that has already obtained an inode_lock() as this
leads to a recursive lock situation. This case is especially true when
porting across direct I/O to iomap infrastructure as we obtain an
inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
it until the I/O has been completed. Consequently, to not run into
this specific issue, we move away from calling into
__generic_file_fsync() and perform the necessary synchronization tasks
within ext4_sync_file().
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/fsync.c | 72 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 47 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..e10206e7f4bb 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}
+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ int ret, err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ return ret;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ return ret;
+
+ err = sync_inode_metadata(inode, 1);
+ if (!ret)
+ ret = err;
+
+ if (!ret)
+ ret = ext4_sync_parent(inode);
+ if (test_opt(inode->i_sb, BARRIER))
+ *needs_barrier = true;
+
+ return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ *needs_barrier = true;
+
+ return jbd2_complete_transaction(journal, commit_tid);
+}
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
* What we do is just kick off a commit and wait on it. This will snapshot the
* inode to disk.
*/
-
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0, err;
- tid_t commit_tid;
bool needs_barrier = false;
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (sb_rdonly(inode->i_sb)) {
/* Make sure that we read updated s_mount_flags value */
smp_rmb();
- if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+ if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
ret = -EROFS;
goto out;
}
- if (!journal) {
- ret = __generic_file_fsync(file, start, end, datasync);
- if (!ret)
- ret = ext4_sync_parent(inode);
- if (test_opt(inode->i_sb, BARRIER))
- goto issue_flush;
- goto out;
- }
-
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
+
/*
* data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext4_should_journal_data(inode)) {
+ if (!sbi->s_journal)
+ ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+ else if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
- goto out;
- }
+ else
+ ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
- commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (journal->j_flags & JBD2_BARRIER &&
- !jbd2_trans_will_send_data_barrier(journal, commit_tid))
- needs_barrier = true;
- ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
- issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
--
2.20.1
This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.
All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().
Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().
For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.
In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().
The existing buffer_head direct I/O implementation has been removed as
it's now redundant.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 246 +++++++++++++++++++--------
fs/ext4/inode.c | 413 +++++-----------------------------------------
4 files changed, 218 insertions(+), 455 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 5c6c4acea8b1..24f79035c731 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1584,7 +1584,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2565,8 +2564,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index cf6c5f64cb58..56a4cee00fb7 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
*/
if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
return 0;
- /*
- * The check for IO to unwritten extent is somewhat racy as we
- * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
- * dropping i_data_sem. But reserved blocks should save us in that
- * case.
- */
+
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 83ef9c9ed208..3a8423bec372 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -155,13 +156,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
@@ -214,13 +208,13 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ if (unlikely(IS_IMMUTABLE(inode)))
+ return -EPERM;
+
ret = generic_write_checks(iocb, from);
if (ret <= 0)
return ret;
- if (unlikely(IS_IMMUTABLE(inode)))
- return -EPERM;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -232,9 +226,42 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
return iov_iter_count(from);
}
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ inode_lock(inode);
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+
+ return ret;
+}
+
static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t written, size_t count)
{
@@ -316,6 +343,139 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
return written;
}
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error)
+ return error;
+
+ if (size && flags & IOMAP_DIO_UNWRITTEN)
+ return ext4_convert_unwritten_extents(NULL, inode,
+ offset, size);
+
+ return 0;
+}
+
+static const struct iomap_dio_ops ext4_dio_write_ops = {
+ .end_io = ext4_dio_write_end_io,
+};
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ size_t count;
+ loff_t offset;
+ handle_t *handle;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock(inode);
+ /*
+ * Fallback to buffered I/O if the inode does not support
+ * direct I/O.
+ */
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }
+
+ /*
+ * Unaligned asynchronous direct I/O must be serialized among each
+ * other as the zeroing of partial blocks of two competing unaligned
+ * asynchronous direct I/O writes can result in data corruption.
+ */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the I/O will overwrite allocated and initialized
+ * blocks. If so, check to see whether it is possible to take the
+ * dioread_nolock path.
+ */
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
+ is_sync_kiocb(iocb) || unaligned_aio || extend);
+
+ if (extend)
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
+
+out:
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from)) {
+ ssize_t err;
+ loff_t endbyte;
+
+ offset = iocb->ki_pos;
+ err = ext4_buffered_write_iter(iocb, from);
+ if (err < 0)
+ return err;
+
+ /*
+ * We need to ensure that the pages within the page cache for
+ * the range covered by this I/O are written to disk and
+ * invalidated. This is in attempt to preserve the expected
+ * direct I/O semantics in the case we fallback to buffered I/O
+ * to complete off the I/O request.
+ */
+ ret += err;
+ endbyte = offset + ret - 1;
+ err = filemap_write_and_wait_range(iocb->ki_filp->f_mapping,
+ offset, endbyte);
+ if (!err)
+ invalidate_mapping_pages(iocb->ki_filp->f_mapping,
+ offset >> PAGE_SHIFT,
+ endbyte >> PAGE_SHIFT);
+ }
+
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -332,15 +492,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
return -EAGAIN;
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;
offset = iocb->ki_pos;
count = iov_iter_count(from);
@@ -378,10 +533,6 @@ static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
- ssize_t ret;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
@@ -390,59 +541,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- inode_lock(inode);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
-
- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
-
- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
- goto out;
- }
- }
-
- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
-
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
-
- return ret;
-
-out:
- inode_unlock(inode);
- return ret;
+ return ext4_buffered_write_iter(iocb, from);
}
#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 392085aa7809..c103362b9cf9 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,133 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3494,7 +3367,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- int ret, dio_credits, retries = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, m_flags = 0, retries = 0;
/*
* Trim the mapping request to the maximum value that we can map at
@@ -3515,7 +3389,33 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
if (IS_ERR(handle))
return PTR_ERR(handle);
- ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct I/O are the only two operations that are currently
+ * supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ /*
+ * We use i_size instead of i_disksize here because delalloc writeback
+ * can complete at any point during the I/O and subsequently push the
+ * i_disksize out to i_size. This could be beyond where direct I/O is
+ * happening and thus expose allocated blocks to direct I/O reads.
+ */
+ else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes as that could
+ * expose stale data in the case of a crash. Use the magic error code
+ * to fallback to buffered I/O.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -3561,6 +3461,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error occurred while writing out the data to
+ * the allocated blocks. If so, return the magic error code so that we
+ * fallback to buffered I/O and attempt to complete the remainder of
+ * the I/O. Any blocks that may have been allocated in preparation for
+ * the direct I/O will be reused during buffered I/O.
+ */
+ if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+ return -ENOTBLK;
+
return 0;
}
@@ -3637,245 +3547,6 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
- struct ext4_io_end_vec *io_end_vec;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end_vec = ext4_alloc_io_end_vec(io_end);
- io_end_vec->offset = offset;
- io_end_vec->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
- if (fsverity_active(inode))
- return 0;
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3913,7 +3584,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3930,7 +3601,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3946,7 +3617,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1
As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 6 ++-
fs/ext4/inode.c | 134 +++++++++++++++++++++++++++++-------------------
3 files changed, 85 insertions(+), 56 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 3616f1b0c987..5c6c4acea8b1 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3388,6 +3388,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}
extern const struct iomap_ops ext4_iomap_ops;
+extern const struct iomap_ops ext4_iomap_report_ops;
static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8d2bbcc2d813..ab75aee3e687 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
maxbytes, i_size_read(inode));
case SEEK_HOLE:
inode_lock_shared(inode);
- offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_hole(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
case SEEK_DATA:
inode_lock_shared(inode);
- offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_data(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b540f2903faa..b5ba6767b276 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3553,74 +3553,32 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
- unsigned int blkbits = inode->i_blkbits;
- unsigned long first_block, last_block;
- struct ext4_map_blocks map;
- bool delalloc = false;
int ret;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
- first_block = offset >> blkbits;
- last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
- EXT4_MAX_LOGICAL_BLOCK);
-
- if (flags & IOMAP_REPORT) {
- if (ext4_has_inline_data(inode)) {
- ret = ext4_inline_data_iomap(inode, iomap);
- if (ret != -EAGAIN) {
- if (ret == 0 && offset >= iomap->length)
- ret = -ENOENT;
- return ret;
- }
- }
- } else {
- if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
- return -ERANGE;
- }
- map.m_lblk = first_block;
- map.m_len = last_block - first_block + 1;
-
- if (flags & IOMAP_REPORT) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
-
- if (ret == 0) {
- ext4_lblk_t end = map.m_lblk + map.m_len - 1;
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map.m_lblk, end, &es);
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
- if (!es.es_len || es.es_lblk > end) {
- /* entire range is a hole */
- } else if (es.es_lblk > map.m_lblk) {
- /* range starts with a hole */
- map.m_len = es.es_lblk - map.m_lblk;
- } else {
- ext4_lblk_t offs = 0;
+ /*
+ * Calculate the first and last logical blocks respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
- if (es.es_lblk < map.m_lblk)
- offs = map.m_lblk - es.es_lblk;
- map.m_lblk = es.es_lblk + offs;
- map.m_len = es.es_len - offs;
- delalloc = true;
- }
- }
- } else if (flags & IOMAP_WRITE) {
+ if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, &map, flags);
- } else {
+ else
ret = ext4_map_blocks(NULL, inode, &map, 0);
- }
if (ret < 0)
return ret;
ext4_set_iomap(inode, iomap, &map, offset, length);
- if (delalloc && iomap->type == IOMAP_HOLE)
- iomap->type = IOMAP_DELALLOC;
return 0;
}
@@ -3682,6 +3640,74 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};
+static bool ext4_iomap_is_delalloc(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ struct extent_status es;
+ ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
+
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
+ map->m_lblk, end, &es);
+
+ if (!es.es_len || es.es_lblk > end)
+ return false;
+
+ if (es.es_lblk > map->m_lblk) {
+ map->m_len = es.es_lblk - map->m_lblk;
+ return false;
+ }
+
+ offset = map->m_lblk - es.es_lblk;
+ map->m_len = es.es_len - offset;
+
+ return true;
+}
+
+static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
+ loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ int ret;
+ bool delalloc = false;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+
+ if (ext4_has_inline_data(inode)) {
+ ret = ext4_inline_data_iomap(inode, iomap);
+ if (ret != -EAGAIN) {
+ if (ret == 0 && offset >= iomap->length)
+ ret = -ENOENT;
+ return ret;
+ }
+ }
+
+ /*
+ * Calculate the first and last logical block respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ delalloc = ext4_iomap_is_delalloc(inode, &map);
+
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
+
+ return 0;
+}
+
+const struct iomap_ops ext4_iomap_report_ops = {
+ .iomap_begin = ext4_iomap_begin_report,
+};
+
static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
{
--
2.20.1
On Tue 05-11-19 23:02:39, Matthew Bobrowski wrote:
> + if (ret >= 0 && iov_iter_count(from)) {
> + ssize_t err;
> + loff_t endbyte;
> +
> + offset = iocb->ki_pos;
> + err = ext4_buffered_write_iter(iocb, from);
> + if (err < 0)
> + return err;
> +
> + /*
> + * We need to ensure that the pages within the page cache for
> + * the range covered by this I/O are written to disk and
> + * invalidated. This is in attempt to preserve the expected
> + * direct I/O semantics in the case we fallback to buffered I/O
> + * to complete off the I/O request.
> + */
> + ret += err;
> + endbyte = offset + ret - 1;
^^ err here?
Otherwise you would write out and invalidate too much AFAICT - the 'offset'
is position just before we fall back to buffered IO. Otherwise this hunk
looks good to me.
> + err = filemap_write_and_wait_range(iocb->ki_filp->f_mapping,
> + offset, endbyte);
> + if (!err)
> + invalidate_mapping_pages(iocb->ki_filp->f_mapping,
> + offset >> PAGE_SHIFT,
> + endbyte >> PAGE_SHIFT);
> + }
> +
> + return ret;
> +}
> +
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Nov 05, 2019 at 11:03:31PM +1100, Matthew Bobrowski wrote:
> As part of the ext4_iomap_begin() cleanups that precede this patch, we
> also split up the IOMAP_REPORT branch into a completely separate
> ->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
> raionale for this change is to reduce the overall clutter within
> ext4_iomap_begin().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/file.c | 6 ++-
> fs/ext4/inode.c | 134 +++++++++++++++++++++++++++++-------------------
> 3 files changed, 85 insertions(+), 56 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 3616f1b0c987..5c6c4acea8b1 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3388,6 +3388,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> }
>
> extern const struct iomap_ops ext4_iomap_ops;
> +extern const struct iomap_ops ext4_iomap_report_ops;
>
> static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8d2bbcc2d813..ab75aee3e687 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
> maxbytes, i_size_read(inode));
> case SEEK_HOLE:
> inode_lock_shared(inode);
> - offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_hole(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> case SEEK_DATA:
> inode_lock_shared(inode);
> - offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_data(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b540f2903faa..b5ba6767b276 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3553,74 +3553,32 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap, struct iomap *srcmap)
> {
> - unsigned int blkbits = inode->i_blkbits;
> - unsigned long first_block, last_block;
> - struct ext4_map_blocks map;
> - bool delalloc = false;
> int ret;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> - first_block = offset >> blkbits;
> - last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> - EXT4_MAX_LOGICAL_BLOCK);
> -
> - if (flags & IOMAP_REPORT) {
> - if (ext4_has_inline_data(inode)) {
> - ret = ext4_inline_data_iomap(inode, iomap);
> - if (ret != -EAGAIN) {
> - if (ret == 0 && offset >= iomap->length)
> - ret = -ENOENT;
> - return ret;
> - }
> - }
> - } else {
> - if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> - return -ERANGE;
> - }
>
> - map.m_lblk = first_block;
> - map.m_len = last_block - first_block + 1;
> -
> - if (flags & IOMAP_REPORT) {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> -
> - if (ret == 0) {
> - ext4_lblk_t end = map.m_lblk + map.m_len - 1;
> - struct extent_status es;
> -
> - ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> - map.m_lblk, end, &es);
> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> + return -ERANGE;
>
> - if (!es.es_len || es.es_lblk > end) {
> - /* entire range is a hole */
> - } else if (es.es_lblk > map.m_lblk) {
> - /* range starts with a hole */
> - map.m_len = es.es_lblk - map.m_lblk;
> - } else {
> - ext4_lblk_t offs = 0;
> + /*
> + * Calculate the first and last logical blocks respectively.
> + */
> + map.m_lblk = offset >> blkbits;
> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>
> - if (es.es_lblk < map.m_lblk)
> - offs = map.m_lblk - es.es_lblk;
> - map.m_lblk = es.es_lblk + offs;
> - map.m_len = es.es_len - offs;
> - delalloc = true;
> - }
> - }
> - } else if (flags & IOMAP_WRITE) {
> + if (flags & IOMAP_WRITE)
> ret = ext4_iomap_alloc(inode, &map, flags);
FWIW you could even split non-buffered read and write into separate iomap
ops and avoid this split... but that's a cleanup that can wait until
after the main series lands.
> - } else {
> + else
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - }
>
> if (ret < 0)
> return ret;
>
> ext4_set_iomap(inode, iomap, &map, offset, length);
> - if (delalloc && iomap->type == IOMAP_HOLE)
> - iomap->type = IOMAP_DELALLOC;
>
> return 0;
> }
> @@ -3682,6 +3640,74 @@ const struct iomap_ops ext4_iomap_ops = {
> .iomap_end = ext4_iomap_end,
> };
>
> +static bool ext4_iomap_is_delalloc(struct inode *inode,
> + struct ext4_map_blocks *map)
> +{
> + struct extent_status es;
> + ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
> +
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> + map->m_lblk, end, &es);
> +
> + if (!es.es_len || es.es_lblk > end)
> + return false;
> +
> + if (es.es_lblk > map->m_lblk) {
> + map->m_len = es.es_lblk - map->m_lblk;
> + return false;
> + }
> +
> + offset = map->m_lblk - es.es_lblk;
> + map->m_len = es.es_len - offset;
> +
> + return true;
> +}
> +
> +static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> + loff_t length, unsigned int flags,
> + struct iomap *iomap, struct iomap *srcmap)
> +{
> + int ret;
> + bool delalloc = false;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> +
> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> + return -EINVAL;
> +
> + if (ext4_has_inline_data(inode)) {
> + ret = ext4_inline_data_iomap(inode, iomap);
> + if (ret != -EAGAIN) {
> + if (ret == 0 && offset >= iomap->length)
> + ret = -ENOENT;
> + return ret;
> + }
> + }
> +
> + /*
> + * Calculate the first and last logical block respectively.
> + */
> + map.m_lblk = offset >> blkbits;
> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> +
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret < 0)
> + return ret;
> + if (ret == 0)
> + delalloc = ext4_iomap_is_delalloc(inode, &map);
If you can tell that a mapping is delalloc from @inode and @map, how
about pushing the ext4_iomap_is_delalloc call into ext4_set_iomap?
Oh, humm, the _is_delalloc function isn't a predicate after all; it
modifies @map. Urrk.
--D
> +
> + ext4_set_iomap(inode, iomap, &map, offset, length);
> + if (delalloc && iomap->type == IOMAP_HOLE)
> + iomap->type = IOMAP_DELALLOC;
> +
> + return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_report_ops = {
> + .iomap_begin = ext4_iomap_begin_report,
> +};
> +
> static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> ssize_t size, void *private)
> {
> --
> 2.20.1
>
On Tue, Nov 05, 2019 at 11:01:51PM +1100, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct I/O modifications,
> the inode extension/truncate code needs to be moved out from the
> ext4_iomap_end() callback. For direct I/O, if the current code
> remained, it would behave incorrrectly. Updating the inode size prior
> to converting unwritten extents would potentially allow a racing
> direct I/O read to find unwritten extents before being converted
> correctly.
>
> The inode extension/truncate code now resides within a new helper
> ext4_handle_inode_extension(). This function has been designed so that
> it can accommodate for both DAX and direct I/O extension/truncate
> operations.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 48 +-------------------------
> 2 files changed, 89 insertions(+), 48 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 440f4c6ba4ee..ec54fec96a81 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -33,6 +33,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "truncate.h"
>
> static bool ext4_dio_supported(struct inode *inode)
> {
> @@ -234,12 +235,95 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t written, size_t count)
> +{
> + handle_t *handle;
> + bool truncate = false;
> + u8 blkbits = inode->i_blkbits;
> + ext4_lblk_t written_blk, end_blk;
> +
> + /*
> + * Note that EXT4_I(inode)->i_disksize can get extended up to
> + * inode->i_size while the I/O was running due to writeback of delalloc
> + * blocks. But, the code in ext4_iomap_alloc() is careful to use
> + * zeroed/unwritten extents if this is possible; thus we won't leave
> + * uninitialized blocks in a file even if we didn't succeed in writing
> + * as much as we intended.
> + */
> + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> + if (offset + count <= EXT4_I(inode)->i_disksize) {
> + /*
> + * We need to ensure that the inode is removed from the orphan
> + * list if it has been added prematurely, due to writeback of
> + * delalloc blocks.
> + */
> + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +
> + if (IS_ERR(handle)) {
> + ext4_orphan_del(NULL, inode);
> + return PTR_ERR(handle);
> + }
> +
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
I keep seeing this chunk (and the ext4_orphan_add chunk) bouncing around
through this patchset, which causes me to wonder -- would it be useful
to refactor these into small helpers? Or is it really just the same two
orphan_add/del chunks bouncing around multiple places?
--D
> + }
> +
> + return written;
> + }
> +
> + if (written < 0)
> + goto truncate;
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + written = PTR_ERR(handle);
> + goto truncate;
> + }
> +
> + if (ext4_update_inode_size(inode, offset + written))
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * We may need to truncate allocated but not written blocks beyond EOF.
> + */
> + written_blk = ALIGN(offset + written, 1 << blkbits);
> + end_blk = ALIGN(offset + count, 1 << blkbits);
> + if (written_blk < end_blk && ext4_can_truncate(inode))
> + truncate = true;
> +
> + /*
> + * Remove the inode from the orphan list if it has been extended and
> + * everything went OK.
> + */
> + if (!truncate && inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> +
> + if (truncate) {
> +truncate:
> + ext4_truncate_failed_write(inode);
> + /*
> + * If the truncate operation failed early, then the inode may
> + * still be on the orphan list. In that case, we need to try
> + * remove the inode from the in-memory linked list.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> +
> + return written;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
> + size_t count;
> + loff_t offset;
> + struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -256,7 +340,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ret)
> goto out;
>
> + offset = iocb->ki_pos;
> + count = iov_iter_count(from);
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> + ret = ext4_handle_inode_extension(inode, offset, ret, count);
> out:
> inode_unlock(inode);
> if (ret > 0)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9bd80df6b856..071a1f976aab 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3583,53 +3583,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> - int ret = 0;
> - handle_t *handle;
> - int blkbits = inode->i_blkbits;
> - bool truncate = false;
> -
> - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
> - return 0;
> -
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto orphan_del;
> - }
> - if (ext4_update_inode_size(inode, offset + written))
> - ext4_mark_inode_dirty(handle, inode);
> - /*
> - * We may need to truncate allocated but not written blocks beyond EOF.
> - */
> - if (iomap->offset + iomap->length >
> - ALIGN(inode->i_size, 1 << blkbits)) {
> - ext4_lblk_t written_blk, end_blk;
> -
> - written_blk = (offset + written) >> blkbits;
> - end_blk = (offset + length) >> blkbits;
> - if (written_blk < end_blk && ext4_can_truncate(inode))
> - truncate = true;
> - }
> - /*
> - * Remove inode from orphan list if we were extending a inode and
> - * everything went fine.
> - */
> - if (!truncate && inode->i_nlink &&
> - !list_empty(&EXT4_I(inode)->i_orphan))
> - ext4_orphan_del(handle, inode);
> - ext4_journal_stop(handle);
> - if (truncate) {
> - ext4_truncate_failed_write(inode);
> -orphan_del:
> - /*
> - * If truncate failed early the inode might still be on the
> - * orphan list; we need to make sure the inode is removed from
> - * the orphan list in that case.
> - */
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> - }
> - return ret;
> + return 0;
> }
>
> const struct iomap_ops ext4_iomap_ops = {
> --
> 2.20.1
>
On Tue, Nov 05, 2019 at 11:02:39PM +1100, Matthew Bobrowski wrote:
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
> + is_sync_kiocb(iocb) || unaligned_aio || extend);
> +
> + if (extend)
> + ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +
Can we do a slight optimization here like this?
ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
is_sync_kiocb(iocb) || unaligned_aio || extend);
if (extend && ret != -EBIOCQUEUED)
ret = ext4_handle_inode_extension(inode, offset, ret, count);
If iomap_dio_rw() returns -EBIOCQUEUED, there's no need to do any of
the ext4_handle_inode_extension --- in particular, there's no need to
call ext4_truncate_failed_write(), which has a bunch of extra
overhead, including taking and releasing i_data_sem.
- Ted
On Tue, Nov 05, 2019 at 02:59:32PM +0100, Jan Kara wrote:
> On Tue 05-11-19 23:02:39, Matthew Bobrowski wrote:
> > + if (ret >= 0 && iov_iter_count(from)) {
> > + ssize_t err;
> > + loff_t endbyte;
> > +
> > + offset = iocb->ki_pos;
> > + err = ext4_buffered_write_iter(iocb, from);
> > + if (err < 0)
> > + return err;
> > +
> > + /*
> > + * We need to ensure that the pages within the page cache for
> > + * the range covered by this I/O are written to disk and
> > + * invalidated. This is in attempt to preserve the expected
> > + * direct I/O semantics in the case we fallback to buffered I/O
> > + * to complete off the I/O request.
> > + */
> > + ret += err;
> > + endbyte = offset + ret - 1;
> ^^ err here?
>
> Otherwise you would write out and invalidate too much AFAICT - the 'offset'
> is position just before we fall back to buffered IO. Otherwise this hunk
> looks good to me.
Er, yes. That's right, it should rather be 'err' instead or else we
would write/invalidate too much. I actually had this originally, but I
must've muddled it up while rewriting this patch on my other computer.
Thanks for picking that up!
/M
On Wed, Nov 06, 2019 at 07:32:00AM +1100, Matthew Bobrowski wrote:
> > Otherwise you would write out and invalidate too much AFAICT - the 'offset'
> > is position just before we fall back to buffered IO. Otherwise this hunk
> > looks good to me.
>
> Er, yes. That's right, it should rather be 'err' instead or else we
> would write/invalidate too much. I actually had this originally, but I
> must've muddled it up while rewriting this patch on my other computer.
>
> Thanks for picking that up!
I can fix that up in my tree, unless there are any other changes that
we need to make.
- Ted
On Tue, Nov 05, 2019 at 11:28:55AM -0500, Theodore Y. Ts'o wrote:
> On Tue, Nov 05, 2019 at 11:02:39PM +1100, Matthew Bobrowski wrote:
> > + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
> > + is_sync_kiocb(iocb) || unaligned_aio || extend);
> > +
> > + if (extend)
> > + ret = ext4_handle_inode_extension(inode, offset, ret, count);
> > +
>
> Can we do a slight optimization here like this?
>
> ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
> is_sync_kiocb(iocb) || unaligned_aio || extend);
>
> if (extend && ret != -EBIOCQUEUED)
> ret = ext4_handle_inode_extension(inode, offset, ret, count);
>
>
> If iomap_dio_rw() returns -EBIOCQUEUED, there's no need to do any of
> the ext4_handle_inode_extension --- in particular, there's no need to
> call ext4_truncate_failed_write(), which has a bunch of extra
> overhead, including taking and releasing i_data_sem.
Hm, but for extension, unaligned asynchronous IO, or synchronous IO
cases, 'wait_for_completion' within iomap_dio_rw() is set to true and
as a result we'd never expect to receive -EIOCBQUEUED from
iomap_dio_rw()?
So, with that said, would the above change be necessary seeing as
though we'd never expect ret == -EIOCBQUEUED when extend == true?
Maybe I'm missing something?
/M
On Tue, Nov 05, 2019 at 03:53:03PM -0500, Theodore Y. Ts'o wrote:
> On Wed, Nov 06, 2019 at 07:32:00AM +1100, Matthew Bobrowski wrote:
> > > Otherwise you would write out and invalidate too much AFAICT - the 'offset'
> > > is position just before we fall back to buffered IO. Otherwise this hunk
> > > looks good to me.
> >
> > Er, yes. That's right, it should rather be 'err' instead or else we
> > would write/invalidate too much. I actually had this originally, but I
> > must've muddled it up while rewriting this patch on my other computer.
> >
> > Thanks for picking that up!
>
> I can fix that up in my tree, unless there are any other changes that
> we need to make.
If you could, that would be super awesome as I don't really see
anything else changing in this series. I'll probably send through some
minor optimisations/refactoring cleanups after this series lands, but
that can come at a later point.
/M
On Tue, Nov 05, 2019 at 07:49:50AM -0800, Darrick J. Wong wrote:
> On Tue, Nov 05, 2019 at 11:01:51PM +1100, Matthew Bobrowski wrote:
> > In preparation for implementing the iomap direct I/O modifications,
> > the inode extension/truncate code needs to be moved out from the
> > ext4_iomap_end() callback. For direct I/O, if the current code
> > remained, it would behave incorrrectly. Updating the inode size prior
> > to converting unwritten extents would potentially allow a racing
> > direct I/O read to find unwritten extents before being converted
> > correctly.
> >
> > The inode extension/truncate code now resides within a new helper
> > ext4_handle_inode_extension(). This function has been designed so that
> > it can accommodate for both DAX and direct I/O extension/truncate
> > operations.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
> > Reviewed-by: Jan Kara <[email protected]>
> > Reviewed-by: Ritesh Harjani <[email protected]>
> > ---
> > fs/ext4/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-
> > fs/ext4/inode.c | 48 +-------------------------
> > 2 files changed, 89 insertions(+), 48 deletions(-)
> >
> > diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> > index 440f4c6ba4ee..ec54fec96a81 100644
> > --- a/fs/ext4/file.c
> > +++ b/fs/ext4/file.c
> > @@ -33,6 +33,7 @@
> > #include "ext4_jbd2.h"
> > #include "xattr.h"
> > #include "acl.h"
> > +#include "truncate.h"
> >
> > static bool ext4_dio_supported(struct inode *inode)
> > {
> > @@ -234,12 +235,95 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> > return iov_iter_count(from);
> > }
> >
> > +static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> > + ssize_t written, size_t count)
> > +{
> > + handle_t *handle;
> > + bool truncate = false;
> > + u8 blkbits = inode->i_blkbits;
> > + ext4_lblk_t written_blk, end_blk;
> > +
> > + /*
> > + * Note that EXT4_I(inode)->i_disksize can get extended up to
> > + * inode->i_size while the I/O was running due to writeback of delalloc
> > + * blocks. But, the code in ext4_iomap_alloc() is careful to use
> > + * zeroed/unwritten extents if this is possible; thus we won't leave
> > + * uninitialized blocks in a file even if we didn't succeed in writing
> > + * as much as we intended.
> > + */
> > + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> > + if (offset + count <= EXT4_I(inode)->i_disksize) {
> > + /*
> > + * We need to ensure that the inode is removed from the orphan
> > + * list if it has been added prematurely, due to writeback of
> > + * delalloc blocks.
> > + */
> > + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > +
> > + if (IS_ERR(handle)) {
> > + ext4_orphan_del(NULL, inode);
> > + return PTR_ERR(handle);
> > + }
> > +
> > + ext4_orphan_del(handle, inode);
> > + ext4_journal_stop(handle);
>
> I keep seeing this chunk (and the ext4_orphan_add chunk) bouncing around
> through this patchset, which causes me to wonder -- would it be useful
> to refactor these into small helpers? Or is it really just the same two
> orphan_add/del chunks bouncing around multiple places?
No, you're right. This and the other pattern is sprayed throughout the
patchset, but also possibly throughout some of the other chunks of
EXT4 code (I think), which I haven't touched here. So, my thought
process was to actually introduce a small separate cleanup patchset
that does exactly that i.e. moves out these duplicate chunks
orphan_add/orphan_del into small helpers.
/M
On Wed, Nov 06, 2019 at 08:00:44AM +1100, Matthew Bobrowski wrote:
> On Tue, Nov 05, 2019 at 03:53:03PM -0500, Theodore Y. Ts'o wrote:
> > On Wed, Nov 06, 2019 at 07:32:00AM +1100, Matthew Bobrowski wrote:
> > > > Otherwise you would write out and invalidate too much AFAICT - the 'offset'
> > > > is position just before we fall back to buffered IO. Otherwise this hunk
> > > > looks good to me.
> > >
> > > Er, yes. That's right, it should rather be 'err' instead or else we
> > > would write/invalidate too much. I actually had this originally, but I
> > > must've muddled it up while rewriting this patch on my other computer.
> > >
> > > Thanks for picking that up!
> >
> > I can fix that up in my tree, unless there are any other changes that
> > we need to make.
>
> If you could, that would be super awesome as I don't really see
> anything else changing in this series. I'll probably send through some
> minor optimisations/refactoring cleanups after this series lands, but
> that can come at a later point.
Done. I've just pushed out the ext4.git tree, with both the master
branch (which should never rewind) and the dev branch (which can
rewind) advanced to include this patch series.
Many thanks for your work on this patch series!
- Ted