Hi!
OK, so I think we're now coming very close to having this patch series
ready, if not already there. I've made some slight amendments to this
patch series as a result of the last round of feedback. This patch
series is based off Darrick's iomap-for-next branch, as there are some
iomap API changes there that we're dependent on here.
Changes since v5:
- Fixed up calling into __generic_file_fsync() from
ext4_sync_file(). Previously this caused a recursive lock situation
when the filesystem was created without a journal.
- Rolled up the orphan handling code from ext4_dio_write_iter() into
ext4_handle_inode_extension().
- Dropped redundant conditional statement and expression from
ext4_iomap_is_delalloc().
- Fixed up a couple comments, changelogs, as well as some other
really minor grammatical corrections.
This patch series ports the ext4 direct I/O paths over to the iomap
infrastructure. The legacy buffer_head based direct I/O implementation
has subsequently been removed as it's no longer in use. The result of
this change is that ext4 now uses the newer iomap framework for direct
I/O read/write operations. Overall, this results in a much cleaner
direct I/O implementation and keeps this code isolated from the
buffer_head internals. In addition to this, a slight performance boost
may be expected while using O_SYNC | O_DIRECT.
The changes within this patch series have been tested via xfstests in
both DAX and non-DAX modes using the various filesystem configuration
options, including: 4k, dioread_nolock, nojournal.
Matthew Bobrowski (11):
ext4: reorder map.m_flags checks within ext4_iomap_begin()
ext4: update direct I/O read lock pattern for IOCB_NOWAIT
ext4: iomap that extends beyond EOF should be marked dirty
ext4: move set iomap routines into a separate helper ext4_set_iomap()
ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
ext4: introduce new callback for IOMAP_REPORT
ext4: introduce direct I/O read using iomap infrastructure
ext4: move inode extension/truncate code out from ->iomap_end()
callback
ext4: move inode extension check out from ext4_iomap_alloc()
ext4: update ext4_sync_file() to not use __generic_file_fsync()
ext4: introduce direct I/O write using iomap infrastructure
fs/ext4/ext4.h | 4 +-
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 387 ++++++++++++++++++++-----
fs/ext4/fsync.c | 72 +++--
fs/ext4/inode.c | 714 +++++++++++-----------------------------------
5 files changed, 537 insertions(+), 651 deletions(-)
--
2.20.1
For the direct I/O changes that follow in this patch series, we need
to accommodate for the case where the block mapping flags passed
through to ext4_map_blocks() result in m_flags having both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for any
allocated unwritten extents to be converted correctly in the
->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for
cases where the EXT4_MAP_UNWRITTEN bit has been set within
m_flags. Hence the reason why we need to reshuffle this conditional
statement around.
This change is a no-op for DAX as the block mapping flags passed
through to ext4_map_blocks() i.e. EXT4_GET_BLOCKS_CREATE_ZERO never
results in both EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at
once.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index abaaf7d96ca4..ee116344c420 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3535,10 +3535,20 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
} else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
+ /*
+ * Flags passed into ext4_map_blocks() for direct I/O writes
+ * can result in m_flags having both EXT4_MAP_MAPPED and
+ * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
+ * unwritten extents to be converted into written extents
+ * correctly within the ->end_io() handler, we need to ensure
+ * that the iomap->type is set appropriately. Hence the reason
+ * why we need to check whether EXT4_MAP_UNWRITTEN is set
+ * first.
+ */
+ if (map.m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
+ } else if (map.m_flags & EXT4_MAP_MAPPED) {
+ iomap->type = IOMAP_MAPPED;
} else {
WARN_ON_ONCE(1);
return -EIO;
--
2.20.1
This patch updates the lock pattern in ext4_direct_IO_read() to not
block on inode lock in cases of IOCB_NOWAIT direct I/O reads. The
locking condition implemented here is similar to that of 942491c9e6d6
("xfs: fix AIM7 regression").
Fixes: 16c54688592c ("ext4: Allow parallel DIO reads")
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ee116344c420..0a9ea291cfab 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3837,7 +3837,13 @@ static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
* writes & truncates and since we take care of writing back page cache,
* we are protected against page writeback as well.
*/
- inode_lock_shared(inode);
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
iocb->ki_pos + count - 1);
if (ret)
--
2.20.1
This patch addresses what Dave Chinner had discovered and fixed within
commit: 7684e2c4384d. This changes does not have any user visible
impact for ext4 as none of the current users of ext4_iomap_begin()
that extend files depend on IOMAP_F_DIRTY.
When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.
However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion when
O_DSYNC writes are being used and the hardware supports FUA.
Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync() to
flush the inode size update to stable storage correctly.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0a9ea291cfab..da2ca81e3d9c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3523,8 +3523,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}
+ /*
+ * Writes that span EOF might trigger an I/O size update on completion,
+ * so consider them to be dirty for the purposes of O_DSYNC, even if
+ * there is no other metadata changes being made or are pending here.
+ */
iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode))
+ if (ext4_inode_datasync_dirty(inode) ||
+ offset + length > i_size_read(inode))
iomap->flags |= IOMAP_F_DIRTY;
iomap->bdev = inode->i_sb->s_bdev;
iomap->dax_dev = sbi->s_daxdev;
--
2.20.1
In preparation for porting across the ext4 direct I/O path over to the
iomap infrastructure, split up the IOMAP_WRITE branch that's currently
within ext4_iomap_begin() into a separate helper
ext4_alloc_iomap(). This way, when we add in the necessary code for
direct I/O, we don't end up with ext4_iomap_begin() becoming a
monstrous twisty maze.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 113 ++++++++++++++++++++++++++----------------------
1 file changed, 61 insertions(+), 52 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 073b7c873bb2..325abba6482c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3453,6 +3453,63 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
}
}
+static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
+ unsigned int flags)
+{
+ handle_t *handle;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, retries = 0;
+
+ /*
+ * Trim the mapping request to the maximum value that we can map at
+ * once for direct I/O.
+ */
+ if (map->m_len > DIO_MAX_BLOCKS)
+ map->m_len = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+
+retry:
+ /*
+ * Either we allocate blocks and then don't get an unwritten extent, so
+ * in that case we have reserved enough credits. Or, the blocks are
+ * already allocated and unwritten. In that case, the extent conversion
+ * fits into the credits as well.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ goto journal_stop;
+
+ /*
+ * If we've allocated blocks beyond EOF, we need to ensure that they're
+ * truncated if we crash before updating the inode size metadata within
+ * ext4_iomap_end(). For faults, we don't need to do that (and cannot
+ * due to orphan list operations needing an inode_lock()). If we happen
+ * to instantiate blocks beyond EOF, it is because we race with a
+ * truncate operation, which already has added the inode onto the
+ * orphan list.
+ */
+ if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
+ (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
+ int err;
+
+ err = ext4_orphan_add(handle, inode);
+ if (err < 0)
+ ret = err;
+ }
+
+journal_stop:
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ return ret;
+}
+
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
@@ -3513,62 +3570,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
- handle_t *handle;
- int retries = 0;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (map.m_len > DIO_MAX_BLOCKS)
- map.m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
-retry:
- /*
- * Either we allocate blocks and then we don't get unwritten
- * extent so we have reserved enough credits, or the blocks
- * are already allocated and unwritten and in that case
- * extent conversion fits in the credits as well.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0) {
- ext4_journal_stop(handle);
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
- }
-
- /*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0) {
- ext4_journal_stop(handle);
- return err;
- }
- }
- ext4_journal_stop(handle);
+ ret = ext4_iomap_alloc(inode, &map, flags);
} else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
}
+ if (ret < 0)
+ return ret;
+
ext4_set_iomap(inode, iomap, &map, offset, length);
if (delalloc && iomap->type == IOMAP_HOLE)
iomap->type = IOMAP_DELALLOC;
--
2.20.1
Separate the iomap field population code that is currently within
ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent
of this function is self explanatory, however the rationale behind
taking this step is to reeduce the overall clutter that we currently
have within the ext4_iomap_begin() callback.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 92 +++++++++++++++++++++++++++----------------------
1 file changed, 50 insertions(+), 42 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index da2ca81e3d9c..073b7c873bb2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3406,10 +3406,56 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}
+static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
+ struct ext4_map_blocks *map, loff_t offset,
+ loff_t length)
+{
+ u8 blkbits = inode->i_blkbits;
+
+ /*
+ * Writes that span EOF might trigger an I/O size update on completion,
+ * so consider them to be dirty for the purpose of O_DSYNC, even if
+ * there is no other metadata changes being made or are pending.
+ */
+ iomap->flags = 0;
+ if (ext4_inode_datasync_dirty(inode) ||
+ offset + length > i_size_read(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
+
+ if (map->m_flags & EXT4_MAP_NEW)
+ iomap->flags |= IOMAP_F_NEW;
+
+ iomap->bdev = inode->i_sb->s_bdev;
+ iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
+ iomap->offset = (u64) map->m_lblk << blkbits;
+ iomap->length = (u64) map->m_len << blkbits;
+
+ if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
+ /*
+ * Flags passed to ext4_map_blocks() for direct I/O writes can
+ * result in m_flags having both EXT4_MAP_MAPPED and
+ * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
+ * unwritten extents to be converted into written extents
+ * correctly within the ->end_io() handler, we need to ensure
+ * that the iomap->type is set appropriately. Hence, the reason
+ * why we need to check whether the EXT4_MAP_UNWRITTEN bit has
+ * been set first.
+ */
+ if (map->m_flags & EXT4_MAP_UNWRITTEN)
+ iomap->type = IOMAP_UNWRITTEN;
+ else if (map->m_flags & EXT4_MAP_MAPPED)
+ iomap->type = IOMAP_MAPPED;
+
+ iomap->addr = (u64) map->m_pblk << blkbits;
+ } else {
+ iomap->type = IOMAP_HOLE;
+ iomap->addr = IOMAP_NULL_ADDR;
+ }
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block, last_block;
struct ext4_map_blocks map;
@@ -3523,47 +3569,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}
- /*
- * Writes that span EOF might trigger an I/O size update on completion,
- * so consider them to be dirty for the purposes of O_DSYNC, even if
- * there is no other metadata changes being made or are pending here.
- */
- iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode) ||
- offset + length > i_size_read(inode))
- iomap->flags |= IOMAP_F_DIRTY;
- iomap->bdev = inode->i_sb->s_bdev;
- iomap->dax_dev = sbi->s_daxdev;
- iomap->offset = (u64)first_block << blkbits;
- iomap->length = (u64)map.m_len << blkbits;
-
- if (ret == 0) {
- iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
- iomap->addr = IOMAP_NULL_ADDR;
- } else {
- /*
- * Flags passed into ext4_map_blocks() for direct I/O writes
- * can result in m_flags having both EXT4_MAP_MAPPED and
- * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
- * unwritten extents to be converted into written extents
- * correctly within the ->end_io() handler, we need to ensure
- * that the iomap->type is set appropriately. Hence the reason
- * why we need to check whether EXT4_MAP_UNWRITTEN is set
- * first.
- */
- if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- iomap->type = IOMAP_UNWRITTEN;
- } else if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else {
- WARN_ON_ONCE(1);
- return -EIO;
- }
- iomap->addr = (u64)map.m_pblk << blkbits;
- }
-
- if (map.m_flags & EXT4_MAP_NEW)
- iomap->flags |= IOMAP_F_NEW;
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
return 0;
}
--
2.20.1
As part of the ext4_iomap_begin() cleanups that precede this patch, we
also split up the IOMAP_REPORT branch into a completely separate
->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
raionale for this change is to reduce the overall clutter within
ext4_iomap_begin().
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 6 ++-
fs/ext4/inode.c | 134 +++++++++++++++++++++++++++++-------------------
3 files changed, 85 insertions(+), 56 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..d0d88f411a44 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3379,6 +3379,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}
extern const struct iomap_ops ext4_iomap_ops;
+extern const struct iomap_ops ext4_iomap_report_ops;
static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8d2bbcc2d813..ab75aee3e687 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
maxbytes, i_size_read(inode));
case SEEK_HOLE:
inode_lock_shared(inode);
- offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_hole(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
case SEEK_DATA:
inode_lock_shared(inode);
- offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_data(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 325abba6482c..50b4835cd927 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3513,74 +3513,32 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap, struct iomap *srcmap)
{
- unsigned int blkbits = inode->i_blkbits;
- unsigned long first_block, last_block;
- struct ext4_map_blocks map;
- bool delalloc = false;
int ret;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
- first_block = offset >> blkbits;
- last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
- EXT4_MAX_LOGICAL_BLOCK);
-
- if (flags & IOMAP_REPORT) {
- if (ext4_has_inline_data(inode)) {
- ret = ext4_inline_data_iomap(inode, iomap);
- if (ret != -EAGAIN) {
- if (ret == 0 && offset >= iomap->length)
- ret = -ENOENT;
- return ret;
- }
- }
- } else {
- if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
- return -ERANGE;
- }
- map.m_lblk = first_block;
- map.m_len = last_block - first_block + 1;
-
- if (flags & IOMAP_REPORT) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
-
- if (ret == 0) {
- ext4_lblk_t end = map.m_lblk + map.m_len - 1;
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map.m_lblk, end, &es);
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
- if (!es.es_len || es.es_lblk > end) {
- /* entire range is a hole */
- } else if (es.es_lblk > map.m_lblk) {
- /* range starts with a hole */
- map.m_len = es.es_lblk - map.m_lblk;
- } else {
- ext4_lblk_t offs = 0;
+ /*
+ * Calculate the first and last logical blocks respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
- if (es.es_lblk < map.m_lblk)
- offs = map.m_lblk - es.es_lblk;
- map.m_lblk = es.es_lblk + offs;
- map.m_len = es.es_len - offs;
- delalloc = true;
- }
- }
- } else if (flags & IOMAP_WRITE) {
+ if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, &map, flags);
- } else {
+ else
ret = ext4_map_blocks(NULL, inode, &map, 0);
- }
if (ret < 0)
return ret;
ext4_set_iomap(inode, iomap, &map, offset, length);
- if (delalloc && iomap->type == IOMAP_HOLE)
- iomap->type = IOMAP_DELALLOC;
return 0;
}
@@ -3642,6 +3600,74 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};
+static bool ext4_iomap_is_delalloc(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ struct extent_status es;
+ ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
+
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
+ map->m_lblk, end, &es);
+
+ if (!es.es_len || es.es_lblk > end)
+ return false;
+
+ if (es.es_lblk > map->m_lblk) {
+ map->m_len = es.es_lblk - map->m_lblk;
+ return false;
+ }
+
+ offset = map->m_lblk - es.es_lblk;
+ map->m_len = es.es_len - offset;
+
+ return true;
+}
+
+static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
+ loff_t length, unsigned int flags,
+ struct iomap *iomap, struct iomap *srcmap)
+{
+ int ret;
+ bool delalloc = false;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+
+ if (ext4_has_inline_data(inode)) {
+ ret = ext4_inline_data_iomap(inode, iomap);
+ if (ret != -EAGAIN) {
+ if (ret == 0 && offset >= iomap->length)
+ ret = -ENOENT;
+ return ret;
+ }
+ }
+
+ /*
+ * Calculate the first and last logical block respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ delalloc = ext4_iomap_is_delalloc(inode, &map);
+
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
+
+ return 0;
+}
+
+const struct iomap_ops ext4_iomap_report_ops = {
+ .iomap_begin = ext4_iomap_begin_report,
+};
+
static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
{
--
2.20.1
This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.
The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.
Existing direct I/O read code path has been removed, as it is now
redundant.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 55 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/ext4/inode.c | 38 +---------------------------------
2 files changed, 54 insertions(+), 39 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ab75aee3e687..440f4c6ba4ee 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,52 @@
#include "xattr.h"
#include "acl.h"
+static bool ext4_dio_supported(struct inode *inode)
+{
+ if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
+ return false;
+ if (fsverity_active(inode))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static ssize_t ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered I/O if the operation being performed on
+ * the inode is not supported by direct I/O. The IOCB_DIRECT
+ * flag needs to be cleared here in order to ensure that the
+ * direct I/O path within generic_file_read_iter() is not
+ * taken.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
+ is_sync_kiocb(iocb));
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,16 +110,21 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
if (!iov_iter_count(to))
return 0; /* skip atime */
#ifdef CONFIG_FS_DAX
- if (IS_DAX(file_inode(iocb->ki_filp)))
+ if (IS_DAX(inode))
return ext4_dax_read_iter(iocb, to);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
+
return generic_file_read_iter(iocb, to);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 50b4835cd927..e44b3b1dbbc4 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
{
/* We don't expect handle for direct IO */
WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
}
@@ -3874,36 +3871,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- if (iocb->ki_flags & IOCB_NOWAIT) {
- if (!inode_trylock_shared(inode))
- return -EAGAIN;
- } else {
- inode_lock_shared(inode);
- }
-
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct file *file = iocb->ki_filp;
@@ -3930,10 +3897,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return 0;
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
+ ret = ext4_direct_IO_write(iocb, iter);
trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
return ret;
}
--
2.20.1
In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.
The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +-------------------------
2 files changed, 89 insertions(+), 48 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 440f4c6ba4ee..ec54fec96a81 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"
static bool ext4_dio_supported(struct inode *inode)
{
@@ -234,12 +235,95 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}
+static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
+ ssize_t written, size_t count)
+{
+ handle_t *handle;
+ bool truncate = false;
+ u8 blkbits = inode->i_blkbits;
+ ext4_lblk_t written_blk, end_blk;
+
+ /*
+ * Note that EXT4_I(inode)->i_disksize can get extended up to
+ * inode->i_size while the I/O was running due to writeback of delalloc
+ * blocks. But, the code in ext4_iomap_alloc() is careful to use
+ * zeroed/unwritten extents if this is possible; thus we won't leave
+ * uninitialized blocks in a file even if we didn't succeed in writing
+ * as much as we intended.
+ */
+ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
+ if (offset + count <= EXT4_I(inode)->i_disksize) {
+ /*
+ * We need to ensure that the inode is removed from the orphan
+ * list if it has been added prematurely, due to writeback of
+ * delalloc blocks.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+
+ if (IS_ERR(handle)) {
+ ext4_orphan_del(NULL, inode);
+ return PTR_ERR(handle);
+ }
+
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+
+ return written;
+ }
+
+ if (written < 0)
+ goto truncate;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ written = PTR_ERR(handle);
+ goto truncate;
+ }
+
+ if (ext4_update_inode_size(inode, offset + written))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need to truncate allocated but not written blocks beyond EOF.
+ */
+ written_blk = ALIGN(offset + written, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been extended and
+ * everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+truncate:
+ ext4_truncate_failed_write(inode);
+ /*
+ * If the truncate operation failed early, then the inode may
+ * still be on the orphan list. In that case, we need to try
+ * remove the inode from the in-memory linked list.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
+ return written;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ size_t count;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -256,7 +340,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e44b3b1dbbc4..7c21028760ee 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3543,53 +3543,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}
const struct iomap_ops ext4_iomap_ops = {
--
2.20.1
Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter().
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 24 +++++++++++++++++++++++-
fs/ext4/inode.c | 22 ----------------------
2 files changed, 23 insertions(+), 23 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ec54fec96a81..83ef9c9ed208 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -323,6 +323,8 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t ret;
size_t count;
loff_t offset;
+ handle_t *handle;
+ bool extend = false;
struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
@@ -342,8 +344,28 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
offset = iocb->ki_pos;
count = iov_iter_count(from);
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
- ret = ext4_handle_inode_extension(inode, offset, ret, count);
+
+ if (extend)
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 7c21028760ee..2ca2e6e69344 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3454,7 +3454,6 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- u8 blkbits = inode->i_blkbits;
int ret, dio_credits, retries = 0;
/*
@@ -3477,28 +3476,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
return PTR_ERR(handle);
ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0)
- goto journal_stop;
-
- /*
- * If we've allocated blocks beyond EOF, we need to ensure that they're
- * truncated if we crash before updating the inode size metadata within
- * ext4_iomap_end(). For faults, we don't need to do that (and cannot
- * due to orphan list operations needing an inode_lock()). If we happen
- * to instantiate blocks beyond EOF, it is because we race with a
- * truncate operation, which already has added the inode onto the
- * orphan list.
- */
- if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0)
- ret = err;
- }
-journal_stop:
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
--
2.20.1
This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.
All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().
Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().
For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.
In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().
The existing buffer_head direct I/O implementation has been removed as
it's now redundant.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 11 +-
fs/ext4/file.c | 221 +++++++++++++++++--------
fs/ext4/inode.c | 411 +++++-----------------------------------------
4 files changed, 193 insertions(+), 453 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0d88f411a44..fdab3420539d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1579,7 +1579,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2560,8 +2559,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..df0629de3667 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
*/
if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
return 0;
- /*
- * The check for IO to unwritten extent is somewhat racy as we
- * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
- * dropping i_data_sem. But reserved blocks should save us in that
- * case.
- */
+
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 83ef9c9ed208..0df9d5191ed0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -155,13 +156,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
@@ -214,13 +208,13 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ if (unlikely(IS_IMMUTABLE(inode)))
+ return -EPERM;
+
ret = generic_write_checks(iocb, from);
if (ret <= 0)
return ret;
- if (unlikely(IS_IMMUTABLE(inode)))
- return -EPERM;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -232,9 +226,42 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
return iov_iter_count(from);
}
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ inode_lock(inode);
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+
+ return ret;
+}
+
static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
ssize_t written, size_t count)
{
@@ -316,6 +343,114 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
return written;
}
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error)
+ return error;
+
+ if (size && flags & IOMAP_DIO_UNWRITTEN)
+ return ext4_convert_unwritten_extents(NULL, inode,
+ offset, size);
+
+ return 0;
+}
+
+static const struct iomap_dio_ops ext4_dio_write_ops = {
+ .end_io = ext4_dio_write_end_io,
+};
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ size_t count;
+ loff_t offset;
+ handle_t *handle;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock(inode);
+ /*
+ * Fallback to buffered I/O if the inode does not support
+ * direct I/O.
+ */
+ return ext4_buffered_write_iter(iocb, from);
+ }
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }
+
+ /*
+ * Unaligned direct asynchronous I/O must be serialized among each
+ * other as the zeroing of partial blocks of two competing unaligned
+ * asynchronous I/O writes can result in data corruption.
+ */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the I/O will overwrite allocated and initialized
+ * blocks. If so, check to see whether it is possible to take the
+ * dioread_nolock path.
+ */
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
+ is_sync_kiocb(iocb) || unaligned_aio || extend);
+
+ if (extend)
+ ret = ext4_handle_inode_extension(inode, offset, ret, count);
+out:
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from))
+ return ext4_buffered_write_iter(iocb, from);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -332,15 +467,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
return -EAGAIN;
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;
offset = iocb->ki_pos;
count = iov_iter_count(from);
@@ -378,10 +508,6 @@ static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
- ssize_t ret;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
@@ -390,59 +516,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- inode_lock(inode);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
-
- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
-
- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
- goto out;
- }
- }
-
- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
-
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
-
- return ret;
-
-out:
- inode_unlock(inode);
- return ret;
+ return ext4_buffered_write_iter(iocb, from);
}
#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 2ca2e6e69344..ff683d918112 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,133 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3454,7 +3327,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- int ret, dio_credits, retries = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, m_flags = 0, retries = 0;
/*
* Trim the mapping request to the maximum value that we can map at
@@ -3475,7 +3349,33 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
if (IS_ERR(handle))
return PTR_ERR(handle);
- ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct I/O are the only two operations that are currently
+ * supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ /*
+ * We use i_size instead of i_disksize here because delalloc writeback
+ * can complete at any point and subsequently push the i_disksize out
+ * to i_size. This could be beyond where the direct I/O is happening
+ * and thus expose allocated blocks to direct I/O reads.
+ */
+ else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes as that could
+ * expose stale data in the case of a crash. Use the magic error code
+ * to fallback to buffered I/O.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -3521,6 +3421,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error occurred while writing out the data to
+ * the allocated blocks. If so, return the magic error code so that we
+ * fallback to buffered I/O and attempt to complete the remainder of
+ * the I/O. Any blocks that may have been allocated in preparation for
+ * the direct I/O write will be reused during the buffered I/O.
+ */
+ if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+ return -ENOTBLK;
+
return 0;
}
@@ -3597,243 +3507,6 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end->offset = offset;
- io_end->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
- if (fsverity_active(inode))
- return 0;
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3871,7 +3544,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3888,7 +3561,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3904,7 +3577,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1
When the filesystem is created without a journal, we eventually call
into __generic_file_fsync() in order to write out all the modified
in-core data to the permanent storage device. This function happens to
try and obtain an inode_lock() while synchronizing the files buffer
and it's associated metadata.
Generally, this is fine, however it becomes a problem when there is
higher level code that has already obtained an inode_lock() as this
leads to a recursive lock situation. This case is especially true when
porting across direct I/O to iomap infrastructure as we obtain an
inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
it until the I/O has been completed. Consequently, to not run into
this specific issue, we move away from calling into
__generic_file_fsync() and perform the necessary synchronization tasks
within ext4_sync_file().
Signed-off-by: Matthew Bobrowski <[email protected]>
---
Thanks Jan and Christoph for the suggestion on this one, highly
appreciated.
fs/ext4/fsync.c | 72 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 47 insertions(+), 25 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..e10206e7f4bb 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}
+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ int ret, err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ return ret;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ return ret;
+
+ err = sync_inode_metadata(inode, 1);
+ if (!ret)
+ ret = err;
+
+ if (!ret)
+ ret = ext4_sync_parent(inode);
+ if (test_opt(inode->i_sb, BARRIER))
+ *needs_barrier = true;
+
+ return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ *needs_barrier = true;
+
+ return jbd2_complete_transaction(journal, commit_tid);
+}
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
* What we do is just kick off a commit and wait on it. This will snapshot the
* inode to disk.
*/
-
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
int ret = 0, err;
- tid_t commit_tid;
bool needs_barrier = false;
+ struct inode *inode = file->f_mapping->host;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
if (sb_rdonly(inode->i_sb)) {
/* Make sure that we read updated s_mount_flags value */
smp_rmb();
- if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
+ if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
ret = -EROFS;
goto out;
}
- if (!journal) {
- ret = __generic_file_fsync(file, start, end, datasync);
- if (!ret)
- ret = ext4_sync_parent(inode);
- if (test_opt(inode->i_sb, BARRIER))
- goto issue_flush;
- goto out;
- }
-
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
+
/*
* data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext4_should_journal_data(inode)) {
+ if (!sbi->s_journal)
+ ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+ else if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
- goto out;
- }
+ else
+ ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
- commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (journal->j_flags & JBD2_BARRIER &&
- !jbd2_trans_will_send_data_barrier(journal, commit_tid))
- needs_barrier = true;
- ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
- issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
--
2.20.1
On Mon, Oct 28, 2019 at 09:51:31PM +1100, Matthew Bobrowski wrote:
> Separate the iomap field population code that is currently within
> ext4_iomap_begin() into a separate helper ext4_set_iomap(). The intent
> of this function is self explanatory, however the rationale behind
> taking this step is to reeduce the overall clutter that we currently
> have within the ext4_iomap_begin() callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
> Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/inode.c | 92 +++++++++++++++++++++++++++----------------------
> 1 file changed, 50 insertions(+), 42 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index da2ca81e3d9c..073b7c873bb2 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3406,10 +3406,56 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> return inode->i_state & I_DIRTY_DATASYNC;
> }
>
> +static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> + struct ext4_map_blocks *map, loff_t offset,
> + loff_t length)
> +{
> + u8 blkbits = inode->i_blkbits;
> +
> + /*
> + * Writes that span EOF might trigger an I/O size update on completion,
> + * so consider them to be dirty for the purpose of O_DSYNC, even if
> + * there is no other metadata changes being made or are pending.
> + */
> + iomap->flags = 0;
> + if (ext4_inode_datasync_dirty(inode) ||
> + offset + length > i_size_read(inode))
> + iomap->flags |= IOMAP_F_DIRTY;
> +
> + if (map->m_flags & EXT4_MAP_NEW)
> + iomap->flags |= IOMAP_F_NEW;
> +
> + iomap->bdev = inode->i_sb->s_bdev;
> + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> + iomap->offset = (u64) map->m_lblk << blkbits;
> + iomap->length = (u64) map->m_len << blkbits;
> +
> + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
/me wonders if this would be easier to follow if it was less indenty:
/*
* <giant comment from below>
*/
if (m_flags & EXT4_MAP_UNWRITTEN) {
iomap->type = IOMAP_UNWRITTEN;
iomap->addr = ...
} else if (m_flags & EXT4_MAP_MAPPED) {
iomap->type = IOAMP_MAPPED;
iomap->addr = ...
} else {
iomap->type = IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
}
Rather than double-checking m_flags?
Otherwise looks fine to me...
--D
> + /*
> + * Flags passed to ext4_map_blocks() for direct I/O writes can
> + * result in m_flags having both EXT4_MAP_MAPPED and
> + * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
> + * unwritten extents to be converted into written extents
> + * correctly within the ->end_io() handler, we need to ensure
> + * that the iomap->type is set appropriately. Hence, the reason
> + * why we need to check whether the EXT4_MAP_UNWRITTEN bit has
> + * been set first.
> + */
> + if (map->m_flags & EXT4_MAP_UNWRITTEN)
> + iomap->type = IOMAP_UNWRITTEN;
> + else if (map->m_flags & EXT4_MAP_MAPPED)
> + iomap->type = IOMAP_MAPPED;
> +
> + iomap->addr = (u64) map->m_pblk << blkbits;
> + } else {
> + iomap->type = IOMAP_HOLE;
> + iomap->addr = IOMAP_NULL_ADDR;
> + }
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap, struct iomap *srcmap)
> {
> - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> unsigned int blkbits = inode->i_blkbits;
> unsigned long first_block, last_block;
> struct ext4_map_blocks map;
> @@ -3523,47 +3569,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> return ret;
> }
>
> - /*
> - * Writes that span EOF might trigger an I/O size update on completion,
> - * so consider them to be dirty for the purposes of O_DSYNC, even if
> - * there is no other metadata changes being made or are pending here.
> - */
> - iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode) ||
> - offset + length > i_size_read(inode))
> - iomap->flags |= IOMAP_F_DIRTY;
> - iomap->bdev = inode->i_sb->s_bdev;
> - iomap->dax_dev = sbi->s_daxdev;
> - iomap->offset = (u64)first_block << blkbits;
> - iomap->length = (u64)map.m_len << blkbits;
> -
> - if (ret == 0) {
> - iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - iomap->addr = IOMAP_NULL_ADDR;
> - } else {
> - /*
> - * Flags passed into ext4_map_blocks() for direct I/O writes
> - * can result in m_flags having both EXT4_MAP_MAPPED and
> - * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
> - * unwritten extents to be converted into written extents
> - * correctly within the ->end_io() handler, we need to ensure
> - * that the iomap->type is set appropriately. Hence the reason
> - * why we need to check whether EXT4_MAP_UNWRITTEN is set
> - * first.
> - */
> - if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> - iomap->type = IOMAP_UNWRITTEN;
> - } else if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else {
> - WARN_ON_ONCE(1);
> - return -EIO;
> - }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> - }
> -
> - if (map.m_flags & EXT4_MAP_NEW)
> - iomap->flags |= IOMAP_F_NEW;
> + ext4_set_iomap(inode, iomap, &map, offset, length);
> + if (delalloc && iomap->type == IOMAP_HOLE)
> + iomap->type = IOMAP_DELALLOC;
>
> return 0;
> }
> --
> 2.20.1
>
On Mon, Oct 28, 2019 at 10:03:48AM -0700, Darrick J. Wong wrote:
> On Mon, Oct 28, 2019 at 09:51:31PM +1100, Matthew Bobrowski wrote:
> > +static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > + struct ext4_map_blocks *map, loff_t offset,
> > + loff_t length)
> > +{
> > + u8 blkbits = inode->i_blkbits;
> > +
> > + /*
> > + * Writes that span EOF might trigger an I/O size update on completion,
> > + * so consider them to be dirty for the purpose of O_DSYNC, even if
> > + * there is no other metadata changes being made or are pending.
> > + */
> > + iomap->flags = 0;
> > + if (ext4_inode_datasync_dirty(inode) ||
> > + offset + length > i_size_read(inode))
> > + iomap->flags |= IOMAP_F_DIRTY;
> > +
> > + if (map->m_flags & EXT4_MAP_NEW)
> > + iomap->flags |= IOMAP_F_NEW;
> > +
> > + iomap->bdev = inode->i_sb->s_bdev;
> > + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> > + iomap->offset = (u64) map->m_lblk << blkbits;
> > + iomap->length = (u64) map->m_len << blkbits;
> > +
> > + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
>
> /me wonders if this would be easier to follow if it was less indenty:
>
> /*
> * <giant comment from below>
> */
> if (m_flags & EXT4_MAP_UNWRITTEN) {
> iomap->type = IOMAP_UNWRITTEN;
> iomap->addr = ...
> } else if (m_flags & EXT4_MAP_MAPPED) {
> iomap->type = IOAMP_MAPPED;
> iomap->addr = ...
> } else {
> iomap->type = IOMAP_HOLE;
> iomap->addr = IOMAP_NULL_ADDR;
> }
>
> Rather than double-checking m_flags?
Yeah, you're right. The extra checks and levels of indentation aren't really
necessary and can be simplified further, as you've suggested above.
Thanks for looking over this for me.
/me adds this to the TODO for v7.
--<M>--
On Tue, Oct 29, 2019 at 07:36:41AM +1100, Matthew Bobrowski wrote:
> On Mon, Oct 28, 2019 at 10:03:48AM -0700, Darrick J. Wong wrote:
> > On Mon, Oct 28, 2019 at 09:51:31PM +1100, Matthew Bobrowski wrote:
> > > +static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> > > + struct ext4_map_blocks *map, loff_t offset,
> > > + loff_t length)
> > > +{
> > > + u8 blkbits = inode->i_blkbits;
> > > +
> > > + /*
> > > + * Writes that span EOF might trigger an I/O size update on completion,
> > > + * so consider them to be dirty for the purpose of O_DSYNC, even if
> > > + * there is no other metadata changes being made or are pending.
> > > + */
> > > + iomap->flags = 0;
> > > + if (ext4_inode_datasync_dirty(inode) ||
> > > + offset + length > i_size_read(inode))
> > > + iomap->flags |= IOMAP_F_DIRTY;
> > > +
> > > + if (map->m_flags & EXT4_MAP_NEW)
> > > + iomap->flags |= IOMAP_F_NEW;
> > > +
> > > + iomap->bdev = inode->i_sb->s_bdev;
> > > + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> > > + iomap->offset = (u64) map->m_lblk << blkbits;
> > > + iomap->length = (u64) map->m_len << blkbits;
> > > +
> > > + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> >
> > /me wonders if this would be easier to follow if it was less indenty:
> >
> > /*
> > * <giant comment from below>
> > */
> > if (m_flags & EXT4_MAP_UNWRITTEN) {
> > iomap->type = IOMAP_UNWRITTEN;
> > iomap->addr = ...
> > } else if (m_flags & EXT4_MAP_MAPPED) {
> > iomap->type = IOAMP_MAPPED;
> > iomap->addr = ...
> > } else {
> > iomap->type = IOMAP_HOLE;
> > iomap->addr = IOMAP_NULL_ADDR;
> > }
> >
> > Rather than double-checking m_flags?
>
> Yeah, you're right. The extra checks and levels of indentation aren't really
> necessary and can be simplified further, as you've suggested above.
>
> Thanks for looking over this for me.
>
> /me adds this to the TODO for v7.
<nod> I didn't see anything else obviously wrong, FWIW.
--D
> --<M>--
On 10/28/19 4:22 PM, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct I/O modifications,
> the inode extension/truncate code needs to be moved out from the
> ext4_iomap_end() callback. For direct I/O, if the current code
> remained, it would behave incorrrectly. Updating the inode size prior
> to converting unwritten extents would potentially allow a racing
> direct I/O read to find unwritten extents before being converted
> correctly.
>
> The inode extension/truncate code now resides within a new helper
> ext4_handle_inode_extension(). This function has been designed so that
> it can accommodate for both DAX and direct I/O extension/truncate
> operations.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
Thanks for the patch. Looks good to me.
You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 89 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 48 +-------------------------
> 2 files changed, 89 insertions(+), 48 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 440f4c6ba4ee..ec54fec96a81 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -33,6 +33,7 @@
> #include "ext4_jbd2.h"
> #include "xattr.h"
> #include "acl.h"
> +#include "truncate.h"
>
> static bool ext4_dio_supported(struct inode *inode)
> {
> @@ -234,12 +235,95 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> + ssize_t written, size_t count)
> +{
> + handle_t *handle;
> + bool truncate = false;
> + u8 blkbits = inode->i_blkbits;
> + ext4_lblk_t written_blk, end_blk;
> +
> + /*
> + * Note that EXT4_I(inode)->i_disksize can get extended up to
> + * inode->i_size while the I/O was running due to writeback of delalloc
> + * blocks. But, the code in ext4_iomap_alloc() is careful to use
> + * zeroed/unwritten extents if this is possible; thus we won't leave
> + * uninitialized blocks in a file even if we didn't succeed in writing
> + * as much as we intended.
> + */
> + WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
> + if (offset + count <= EXT4_I(inode)->i_disksize) {
> + /*
> + * We need to ensure that the inode is removed from the orphan
> + * list if it has been added prematurely, due to writeback of
> + * delalloc blocks.
> + */
> + if (!list_empty(&EXT4_I(inode)->i_orphan) && inode->i_nlink) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> +
> + if (IS_ERR(handle)) {
> + ext4_orphan_del(NULL, inode);
> + return PTR_ERR(handle);
> + }
> +
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> + }
> +
> + return written;
> + }
> +
> + if (written < 0)
> + goto truncate;
> +
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + written = PTR_ERR(handle);
> + goto truncate;
> + }
> +
> + if (ext4_update_inode_size(inode, offset + written))
> + ext4_mark_inode_dirty(handle, inode);
> +
> + /*
> + * We may need to truncate allocated but not written blocks beyond EOF.
> + */
> + written_blk = ALIGN(offset + written, 1 << blkbits);
> + end_blk = ALIGN(offset + count, 1 << blkbits);
> + if (written_blk < end_blk && ext4_can_truncate(inode))
> + truncate = true;
> +
> + /*
> + * Remove the inode from the orphan list if it has been extended and
> + * everything went OK.
> + */
> + if (!truncate && inode->i_nlink)
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> +
> + if (truncate) {
> +truncate:
> + ext4_truncate_failed_write(inode);
> + /*
> + * If the truncate operation failed early, then the inode may
> + * still be on the orphan list. In that case, we need to try
> + * remove the inode from the in-memory linked list.
> + */
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + }
> +
> + return written;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> - struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
> + size_t count;
> + loff_t offset;
> + struct inode *inode = file_inode(iocb->ki_filp);
>
> if (!inode_trylock(inode)) {
> if (iocb->ki_flags & IOCB_NOWAIT)
> @@ -256,7 +340,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (ret)
> goto out;
>
> + offset = iocb->ki_pos;
> + count = iov_iter_count(from);
> ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
> + ret = ext4_handle_inode_extension(inode, offset, ret, count);
> out:
> inode_unlock(inode);
> if (ret > 0)
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e44b3b1dbbc4..7c21028760ee 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3543,53 +3543,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> - int ret = 0;
> - handle_t *handle;
> - int blkbits = inode->i_blkbits;
> - bool truncate = false;
> -
> - if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
> - return 0;
> -
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto orphan_del;
> - }
> - if (ext4_update_inode_size(inode, offset + written))
> - ext4_mark_inode_dirty(handle, inode);
> - /*
> - * We may need to truncate allocated but not written blocks beyond EOF.
> - */
> - if (iomap->offset + iomap->length >
> - ALIGN(inode->i_size, 1 << blkbits)) {
> - ext4_lblk_t written_blk, end_blk;
> -
> - written_blk = (offset + written) >> blkbits;
> - end_blk = (offset + length) >> blkbits;
> - if (written_blk < end_blk && ext4_can_truncate(inode))
> - truncate = true;
> - }
> - /*
> - * Remove inode from orphan list if we were extending a inode and
> - * everything went fine.
> - */
> - if (!truncate && inode->i_nlink &&
> - !list_empty(&EXT4_I(inode)->i_orphan))
> - ext4_orphan_del(handle, inode);
> - ext4_journal_stop(handle);
> - if (truncate) {
> - ext4_truncate_failed_write(inode);
> -orphan_del:
> - /*
> - * If truncate failed early the inode might still be on the
> - * orphan list; we need to make sure the inode is removed from
> - * the orphan list in that case.
> - */
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> - }
> - return ret;
> + return 0;
> }
>
> const struct iomap_ops ext4_iomap_ops = {
>
On 10/28/19 4:23 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O write path which makes use of
> the iomap infrastructure.
>
> All direct I/O writes are now passed from the ->write_iter() callback
> through to the new direct I/O handler ext4_dio_write_iter(). This
> function is responsible for calling into the iomap infrastructure via
> iomap_dio_rw().
>
> Code snippets from the existing direct I/O write code within
> ext4_file_write_iter() such as, checking whether the I/O request is
> unaligned asynchronous I/O, or whether the write will result in an
> overwrite have effectively been moved out and into the new direct I/O
> ->write_iter() handler.
>
> The block mapping flags that are eventually passed down to
> ext4_map_blocks() from the *_get_block_*() suite of routines have been
> taken out and introduced within ext4_iomap_alloc().
>
> For inode extension cases, ext4_handle_inode_extension() is
> effectively the function responsible for performing such metadata
> updates. This is called after iomap_dio_rw() has returned so that we
> can safely determine whether we need to potentially truncate any
> allocated blocks that may have been prepared for this direct I/O
> write. We don't perform the inode extension, or truncate operations
> from the ->end_io() handler as we don't have the original I/O 'length'
> available there. The ->end_io() however is responsible fo converting
> allocated unwritten extents to written extents.
>
> In the instance of a short write, we fallback and complete the
> remainder of the I/O using buffered I/O via
> ext4_buffered_write_iter().
>
> The existing buffer_head direct I/O implementation has been removed as
> it's now redundant.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
Thanks for the patch. This looks good to me.
You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/ext4.h | 3 -
> fs/ext4/extents.c | 11 +-
> fs/ext4/file.c | 221 +++++++++++++++++--------
> fs/ext4/inode.c | 411 +++++-----------------------------------------
> 4 files changed, 193 insertions(+), 453 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index d0d88f411a44..fdab3420539d 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1579,7 +1579,6 @@ enum {
> EXT4_STATE_NO_EXPAND, /* No space for expansion */
> EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
> EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
> - EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
> EXT4_STATE_NEWENTRY, /* File just added to dir */
> EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
> EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
> @@ -2560,8 +2559,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> int ext4_get_block(struct inode *inode, sector_t iblock,
> struct buffer_head *bh_result, int create);
> -int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create);
> int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
> struct buffer_head *bh, int create);
> int ext4_walk_page_buffers(handle_t *handle,
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index fb0f99dc8c22..df0629de3667 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -1753,16 +1753,9 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
> */
> if (ext1_ee_len + ext2_ee_len > EXT_INIT_MAX_LEN)
> return 0;
> - /*
> - * The check for IO to unwritten extent is somewhat racy as we
> - * increment i_unwritten / set EXT4_STATE_DIO_UNWRITTEN only after
> - * dropping i_data_sem. But reserved blocks should save us in that
> - * case.
> - */
> +
> if (ext4_ext_is_unwritten(ex1) &&
> - (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
> - atomic_read(&EXT4_I(inode)->i_unwritten) ||
> - (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
> + ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)
> return 0;
> #ifdef AGGRESSIVE_TEST
> if (ext1_ee_len >= 4)
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 83ef9c9ed208..0df9d5191ed0 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -29,6 +29,7 @@
> #include <linux/pagevec.h>
> #include <linux/uio.h>
> #include <linux/mman.h>
> +#include <linux/backing-dev.h>
> #include "ext4.h"
> #include "ext4_jbd2.h"
> #include "xattr.h"
> @@ -155,13 +156,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
> return 0;
> }
>
> -static void ext4_unwritten_wait(struct inode *inode)
> -{
> - wait_queue_head_t *wq = ext4_ioend_wq(inode);
> -
> - wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
> -}
> -
> /*
> * This tests whether the IO in question is block-aligned or not.
> * Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
> @@ -214,13 +208,13 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> struct inode *inode = file_inode(iocb->ki_filp);
> ssize_t ret;
>
> + if (unlikely(IS_IMMUTABLE(inode)))
> + return -EPERM;
> +
> ret = generic_write_checks(iocb, from);
> if (ret <= 0)
> return ret;
>
> - if (unlikely(IS_IMMUTABLE(inode)))
> - return -EPERM;
> -
> /*
> * If we have encountered a bitmap-format file, the size limit
> * is smaller than s_maxbytes, which is for extent-mapped files.
> @@ -232,9 +226,42 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
> return -EFBIG;
> iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
> }
> +
> + ret = file_modified(iocb->ki_filp);
> + if (ret)
> + return ret;
> +
> return iov_iter_count(from);
> }
>
> +static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
> + struct iov_iter *from)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (iocb->ki_flags & IOCB_NOWAIT)
> + return -EOPNOTSUPP;
> +
> + inode_lock(inode);
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0)
> + goto out;
> +
> + current->backing_dev_info = inode_to_bdi(inode);
> + ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
> + current->backing_dev_info = NULL;
> +
> +out:
> + inode_unlock(inode);
> + if (likely(ret > 0)) {
> + iocb->ki_pos += ret;
> + ret = generic_write_sync(iocb, ret);
> + }
> +
> + return ret;
> +}
> +
> static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> ssize_t written, size_t count)
> {
> @@ -316,6 +343,114 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, loff_t offset,
> return written;
> }
>
> +static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
> + int error, unsigned int flags)
> +{
> + loff_t offset = iocb->ki_pos;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (error)
> + return error;
> +
> + if (size && flags & IOMAP_DIO_UNWRITTEN)
> + return ext4_convert_unwritten_extents(NULL, inode,
> + offset, size);
> +
> + return 0;
> +}
> +
> +static const struct iomap_dio_ops ext4_dio_write_ops = {
> + .end_io = ext4_dio_write_end_io,
> +};
> +
> +static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> + ssize_t ret;
> + size_t count;
> + loff_t offset;
> + handle_t *handle;
> + struct inode *inode = file_inode(iocb->ki_filp);
> + bool extend = false, overwrite = false, unaligned_aio = false;
> +
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock(inode);
> + }
> +
> + if (!ext4_dio_supported(inode)) {
> + inode_unlock(inode);
> + /*
> + * Fallback to buffered I/O if the inode does not support
> + * direct I/O.
> + */
> + return ext4_buffered_write_iter(iocb, from);
> + }
> +
> + ret = ext4_write_checks(iocb, from);
> + if (ret <= 0) {
> + inode_unlock(inode);
> + return ret;
> + }
> +
> + /*
> + * Unaligned direct asynchronous I/O must be serialized among each
> + * other as the zeroing of partial blocks of two competing unaligned
> + * asynchronous I/O writes can result in data corruption.
> + */
> + offset = iocb->ki_pos;
> + count = iov_iter_count(from);
> + if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> + !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
> + unaligned_aio = true;
> + inode_dio_wait(inode);
> + }
> +
> + /*
> + * Determine whether the I/O will overwrite allocated and initialized
> + * blocks. If so, check to see whether it is possible to take the
> + * dioread_nolock path.
> + */
> + if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
> + ext4_should_dioread_nolock(inode)) {
> + overwrite = true;
> + downgrade_write(&inode->i_rwsem);
> + }
> +
> + if (offset + count > EXT4_I(inode)->i_disksize) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + goto out;
> + }
> +
> + ret = ext4_orphan_add(handle, inode);
> + if (ret) {
> + ext4_journal_stop(handle);
> + goto out;
> + }
> +
> + extend = true;
> + ext4_journal_stop(handle);
> + }
> +
> + ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
> + is_sync_kiocb(iocb) || unaligned_aio || extend);
> +
> + if (extend)
> + ret = ext4_handle_inode_extension(inode, offset, ret, count);
> +out:
> + if (overwrite)
> + inode_unlock_shared(inode);
> + else
> + inode_unlock(inode);
> +
> + if (ret >= 0 && iov_iter_count(from))
> + return ext4_buffered_write_iter(iocb, from);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t
> ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> @@ -332,15 +467,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
> return -EAGAIN;
> inode_lock(inode);
> }
> +
> ret = ext4_write_checks(iocb, from);
> if (ret <= 0)
> goto out;
> - ret = file_remove_privs(iocb->ki_filp);
> - if (ret)
> - goto out;
> - ret = file_update_time(iocb->ki_filp);
> - if (ret)
> - goto out;
>
> offset = iocb->ki_pos;
> count = iov_iter_count(from);
> @@ -378,10 +508,6 @@ static ssize_t
> ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> {
> struct inode *inode = file_inode(iocb->ki_filp);
> - int o_direct = iocb->ki_flags & IOCB_DIRECT;
> - int unaligned_aio = 0;
> - int overwrite = 0;
> - ssize_t ret;
>
> if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
> @@ -390,59 +516,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
> if (IS_DAX(inode))
> return ext4_dax_write_iter(iocb, from);
> #endif
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_write_iter(iocb, from);
>
> - if (!inode_trylock(inode)) {
> - if (iocb->ki_flags & IOCB_NOWAIT)
> - return -EAGAIN;
> - inode_lock(inode);
> - }
> -
> - ret = ext4_write_checks(iocb, from);
> - if (ret <= 0)
> - goto out;
> -
> - /*
> - * Unaligned direct AIO must be serialized among each other as zeroing
> - * of partial blocks of two competing unaligned AIOs can result in data
> - * corruption.
> - */
> - if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
> - !is_sync_kiocb(iocb) &&
> - ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
> - unaligned_aio = 1;
> - ext4_unwritten_wait(inode);
> - }
> -
> - iocb->private = &overwrite;
> - /* Check whether we do a DIO overwrite or not */
> - if (o_direct && !unaligned_aio) {
> - if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
> - if (ext4_should_dioread_nolock(inode))
> - overwrite = 1;
> - } else if (iocb->ki_flags & IOCB_NOWAIT) {
> - ret = -EAGAIN;
> - goto out;
> - }
> - }
> -
> - ret = __generic_file_write_iter(iocb, from);
> - /*
> - * Unaligned direct AIO must be the only IO in flight. Otherwise
> - * overlapping aligned IO after unaligned might result in data
> - * corruption.
> - */
> - if (ret == -EIOCBQUEUED && unaligned_aio)
> - ext4_unwritten_wait(inode);
> - inode_unlock(inode);
> -
> - if (ret > 0)
> - ret = generic_write_sync(iocb, ret);
> -
> - return ret;
> -
> -out:
> - inode_unlock(inode);
> - return ret;
> + return ext4_buffered_write_iter(iocb, from);
> }
>
> #ifdef CONFIG_FS_DAX
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 2ca2e6e69344..ff683d918112 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -826,133 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
> /* Maximum number of blocks we map for direct IO at once. */
> #define DIO_MAX_BLOCKS 4096
>
> -/*
> - * Get blocks function for the cases that need to start a transaction -
> - * generally difference cases of direct IO and DAX IO. It also handles retries
> - * in case of ENOSPC.
> - */
> -static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int flags)
> -{
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> - int ret;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
> - bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
> - dio_credits = ext4_chunk_trans_blocks(inode,
> - bh_result->b_size >> inode->i_blkbits);
> -retry:
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = _ext4_get_block(inode, iblock, bh_result, flags);
> - ext4_journal_stop(handle);
> -
> - if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> -}
> -
> -/* Get block function for DIO reads and writes to inodes without extents */
> -int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh, int create)
> -{
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> - return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
> -}
> -
> -/*
> - * Get block function for AIO DIO writes when we create unwritten extent if
> - * blocks are not allocated yet. The extent will be converted to written
> - * after IO is complete.
> - */
> -static int ext4_dio_get_block_unwritten_async(struct inode *inode,
> - sector_t iblock, struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = ext4_get_block_trans(inode, iblock, bh_result,
> - EXT4_GET_BLOCKS_IO_CREATE_EXT);
> -
> - /*
> - * When doing DIO using unwritten extents, we need io_end to convert
> - * unwritten extents to written on IO completion. We allocate io_end
> - * once we spot unwritten extent and store it in b_private. Generic
> - * DIO code keeps b_private set and furthermore passes the value to
> - * our completion callback in 'private' argument.
> - */
> - if (!ret && buffer_unwritten(bh_result)) {
> - if (!bh_result->b_private) {
> - ext4_io_end_t *io_end;
> -
> - io_end = ext4_init_io_end(inode, GFP_KERNEL);
> - if (!io_end)
> - return -ENOMEM;
> - bh_result->b_private = io_end;
> - ext4_set_io_unwritten_flag(inode, io_end);
> - }
> - set_buffer_defer_completion(bh_result);
> - }
> -
> - return ret;
> -}
> -
> -/*
> - * Get block function for non-AIO DIO writes when we create unwritten extent if
> - * blocks are not allocated yet. The extent will be converted to written
> - * after IO is complete by ext4_direct_IO_write().
> - */
> -static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
> - sector_t iblock, struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = ext4_get_block_trans(inode, iblock, bh_result,
> - EXT4_GET_BLOCKS_IO_CREATE_EXT);
> -
> - /*
> - * Mark inode as having pending DIO writes to unwritten extents.
> - * ext4_direct_IO_write() checks this flag and converts extents to
> - * written.
> - */
> - if (!ret && buffer_unwritten(bh_result))
> - ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> -
> - return ret;
> -}
> -
> -static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
> - struct buffer_head *bh_result, int create)
> -{
> - int ret;
> -
> - ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
> - inode->i_ino, create);
> - /* We don't expect handle for direct IO */
> - WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - ret = _ext4_get_block(inode, iblock, bh_result, 0);
> - /*
> - * Blocks should have been preallocated! ext4_file_write_iter() checks
> - * that.
> - */
> - WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
> -
> - return ret;
> -}
> -
> -
> /*
> * `handle' can be NULL if create is zero
> */
> @@ -3454,7 +3327,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> unsigned int flags)
> {
> handle_t *handle;
> - int ret, dio_credits, retries = 0;
> + u8 blkbits = inode->i_blkbits;
> + int ret, dio_credits, m_flags = 0, retries = 0;
>
> /*
> * Trim the mapping request to the maximum value that we can map at
> @@ -3475,7 +3349,33 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> if (IS_ERR(handle))
> return PTR_ERR(handle);
>
> - ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> + /*
> + * DAX and direct I/O are the only two operations that are currently
> + * supported with IOMAP_WRITE.
> + */
> + WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
> + if (IS_DAX(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
> + /*
> + * We use i_size instead of i_disksize here because delalloc writeback
> + * can complete at any point and subsequently push the i_disksize out
> + * to i_size. This could be beyond where the direct I/O is happening
> + * and thus expose allocated blocks to direct I/O reads.
> + */
> + else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
> + m_flags = EXT4_GET_BLOCKS_CREATE;
> + else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
> + m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
> +
> + ret = ext4_map_blocks(handle, inode, map, m_flags);
> +
> + /*
> + * We cannot fill holes in indirect tree based inodes as that could
> + * expose stale data in the case of a crash. Use the magic error code
> + * to fallback to buffered I/O.
> + */
> + if (!m_flags && !ret)
> + ret = -ENOTBLK;
>
> ext4_journal_stop(handle);
> if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> @@ -3521,6 +3421,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> + /*
> + * Check to see whether an error occurred while writing out the data to
> + * the allocated blocks. If so, return the magic error code so that we
> + * fallback to buffered I/O and attempt to complete the remainder of
> + * the I/O. Any blocks that may have been allocated in preparation for
> + * the direct I/O write will be reused during the buffered I/O.
> + */
> + if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
> + return -ENOTBLK;
> +
> return 0;
> }
>
> @@ -3597,243 +3507,6 @@ const struct iomap_ops ext4_iomap_report_ops = {
> .iomap_begin = ext4_iomap_begin_report,
> };
>
> -static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> - ssize_t size, void *private)
> -{
> - ext4_io_end_t *io_end = private;
> -
> - /* if not async direct IO just return */
> - if (!io_end)
> - return 0;
> -
> - ext_debug("ext4_end_io_dio(): io_end 0x%p "
> - "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
> - io_end, io_end->inode->i_ino, iocb, offset, size);
> -
> - /*
> - * Error during AIO DIO. We cannot convert unwritten extents as the
> - * data was not written. Just clear the unwritten flag and drop io_end.
> - */
> - if (size <= 0) {
> - ext4_clear_io_unwritten_flag(io_end);
> - size = 0;
> - }
> - io_end->offset = offset;
> - io_end->size = size;
> - ext4_put_io_end(io_end);
> -
> - return 0;
> -}
> -
> -/*
> - * Handling of direct IO writes.
> - *
> - * For ext4 extent files, ext4 will do direct-io write even to holes,
> - * preallocated extents, and those write extend the file, no need to
> - * fall back to buffered IO.
> - *
> - * For holes, we fallocate those blocks, mark them as unwritten
> - * If those blocks were preallocated, we mark sure they are split, but
> - * still keep the range to write as unwritten.
> - *
> - * The unwritten extents will be converted to written when DIO is completed.
> - * For async direct IO, since the IO may still pending when return, we
> - * set up an end_io call back function, which will do the conversion
> - * when async direct IO completed.
> - *
> - * If the O_DIRECT write will extend the file then add this inode to the
> - * orphan list. So recovery will truncate it back to the original size
> - * if the machine crashes during the write.
> - *
> - */
> -static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - ssize_t ret;
> - loff_t offset = iocb->ki_pos;
> - size_t count = iov_iter_count(iter);
> - int overwrite = 0;
> - get_block_t *get_block_func = NULL;
> - int dio_flags = 0;
> - loff_t final_size = offset + count;
> - int orphan = 0;
> - handle_t *handle;
> -
> - if (final_size > inode->i_size || final_size > ei->i_disksize) {
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - ret = PTR_ERR(handle);
> - goto out;
> - }
> - ret = ext4_orphan_add(handle, inode);
> - if (ret) {
> - ext4_journal_stop(handle);
> - goto out;
> - }
> - orphan = 1;
> - ext4_update_i_disksize(inode, inode->i_size);
> - ext4_journal_stop(handle);
> - }
> -
> - BUG_ON(iocb->private == NULL);
> -
> - /*
> - * Make all waiters for direct IO properly wait also for extent
> - * conversion. This also disallows race between truncate() and
> - * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
> - */
> - inode_dio_begin(inode);
> -
> - /* If we do a overwrite dio, i_mutex locking can be released */
> - overwrite = *((int *)iocb->private);
> -
> - if (overwrite)
> - inode_unlock(inode);
> -
> - /*
> - * For extent mapped files we could direct write to holes and fallocate.
> - *
> - * Allocated blocks to fill the hole are marked as unwritten to prevent
> - * parallel buffered read to expose the stale data before DIO complete
> - * the data IO.
> - *
> - * As to previously fallocated extents, ext4 get_block will just simply
> - * mark the buffer mapped but still keep the extents unwritten.
> - *
> - * For non AIO case, we will convert those unwritten extents to written
> - * after return back from blockdev_direct_IO. That way we save us from
> - * allocating io_end structure and also the overhead of offloading
> - * the extent convertion to a workqueue.
> - *
> - * For async DIO, the conversion needs to be deferred when the
> - * IO is completed. The ext4 end_io callback function will be
> - * called to take care of the conversion work. Here for async
> - * case, we allocate an io_end structure to hook to the iocb.
> - */
> - iocb->private = NULL;
> - if (overwrite)
> - get_block_func = ext4_dio_get_block_overwrite;
> - else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
> - round_down(offset, i_blocksize(inode)) >= inode->i_size) {
> - get_block_func = ext4_dio_get_block;
> - dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
> - } else if (is_sync_kiocb(iocb)) {
> - get_block_func = ext4_dio_get_block_unwritten_sync;
> - dio_flags = DIO_LOCKING;
> - } else {
> - get_block_func = ext4_dio_get_block_unwritten_async;
> - dio_flags = DIO_LOCKING;
> - }
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
> - get_block_func, ext4_end_io_dio, NULL,
> - dio_flags);
> -
> - if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
> - EXT4_STATE_DIO_UNWRITTEN)) {
> - int err;
> - /*
> - * for non AIO case, since the IO is already
> - * completed, we could do the conversion right here
> - */
> - err = ext4_convert_unwritten_extents(NULL, inode,
> - offset, ret);
> - if (err < 0)
> - ret = err;
> - ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
> - }
> -
> - inode_dio_end(inode);
> - /* take i_mutex locking again if we do a ovewrite dio */
> - if (overwrite)
> - inode_lock(inode);
> -
> - if (ret < 0 && final_size > inode->i_size)
> - ext4_truncate_failed_write(inode);
> -
> - /* Handle extending of i_size after direct IO write */
> - if (orphan) {
> - int err;
> -
> - /* Credits for sb + inode write */
> - handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> - if (IS_ERR(handle)) {
> - /*
> - * We wrote the data but cannot extend
> - * i_size. Bail out. In async io case, we do
> - * not return error here because we have
> - * already submmitted the corresponding
> - * bio. Returning error here makes the caller
> - * think that this IO is done and failed
> - * resulting in race with bio's completion
> - * handler.
> - */
> - if (!ret)
> - ret = PTR_ERR(handle);
> - if (inode->i_nlink)
> - ext4_orphan_del(NULL, inode);
> -
> - goto out;
> - }
> - if (inode->i_nlink)
> - ext4_orphan_del(handle, inode);
> - if (ret > 0) {
> - loff_t end = offset + ret;
> - if (end > inode->i_size || end > ei->i_disksize) {
> - ext4_update_i_disksize(inode, end);
> - if (end > inode->i_size)
> - i_size_write(inode, end);
> - /*
> - * We're going to return a positive `ret'
> - * here due to non-zero-length I/O, so there's
> - * no way of reporting error returns from
> - * ext4_mark_inode_dirty() to userspace. So
> - * ignore it.
> - */
> - ext4_mark_inode_dirty(handle, inode);
> - }
> - }
> - err = ext4_journal_stop(handle);
> - if (ret == 0)
> - ret = err;
> - }
> -out:
> - return ret;
> -}
> -
> -static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct file *file = iocb->ki_filp;
> - struct inode *inode = file->f_mapping->host;
> - size_t count = iov_iter_count(iter);
> - loff_t offset = iocb->ki_pos;
> - ssize_t ret;
> -
> -#ifdef CONFIG_FS_ENCRYPTION
> - if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
> - return 0;
> -#endif
> - if (fsverity_active(inode))
> - return 0;
> -
> - /*
> - * If we are doing data journalling we don't support O_DIRECT
> - */
> - if (ext4_should_journal_data(inode))
> - return 0;
> -
> - /* Let buffer I/O handle the inline data case. */
> - if (ext4_has_inline_data(inode))
> - return 0;
> -
> - trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> - ret = ext4_direct_IO_write(iocb, iter);
> - trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
> - return ret;
> -}
> -
> /*
> * Pages can be marked dirty completely asynchronously from ext4's journalling
> * activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
> @@ -3871,7 +3544,7 @@ static const struct address_space_operations ext4_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .migratepage = buffer_migrate_page,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
> @@ -3888,7 +3561,7 @@ static const struct address_space_operations ext4_journalled_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_journalled_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
> };
> @@ -3904,7 +3577,7 @@ static const struct address_space_operations ext4_da_aops = {
> .bmap = ext4_bmap,
> .invalidatepage = ext4_invalidatepage,
> .releasepage = ext4_releasepage,
> - .direct_IO = ext4_direct_IO,
> + .direct_IO = noop_direct_IO,
> .migratepage = buffer_migrate_page,
> .is_partially_uptodate = block_is_partially_uptodate,
> .error_remove_page = generic_error_remove_page,
>
On 10/28/19 4:22 PM, Matthew Bobrowski wrote:
> As part of the ext4_iomap_begin() cleanups that precede this patch, we
> also split up the IOMAP_REPORT branch into a completely separate
> ->iomap_begin() callback named ext4_iomap_begin_report(). Again, the
> raionale for this change is to reduce the overall clutter within
> ext4_iomap_begin().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> Reviewed-by: Jan Kara <[email protected]>
Thanks for the patch. Looks good to me.
You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/ext4.h | 1 +
> fs/ext4/file.c | 6 ++-
> fs/ext4/inode.c | 134 +++++++++++++++++++++++++++++-------------------
> 3 files changed, 85 insertions(+), 56 deletions(-)
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index 03db3e71676c..d0d88f411a44 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -3379,6 +3379,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
> }
>
> extern const struct iomap_ops ext4_iomap_ops;
> +extern const struct iomap_ops ext4_iomap_report_ops;
>
> static inline int ext4_buffer_uptodate(struct buffer_head *bh)
> {
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 8d2bbcc2d813..ab75aee3e687 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
> maxbytes, i_size_read(inode));
> case SEEK_HOLE:
> inode_lock_shared(inode);
> - offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_hole(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> case SEEK_DATA:
> inode_lock_shared(inode);
> - offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
> + offset = iomap_seek_data(inode, offset,
> + &ext4_iomap_report_ops);
> inode_unlock_shared(inode);
> break;
> }
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 325abba6482c..50b4835cd927 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3513,74 +3513,32 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap, struct iomap *srcmap)
> {
> - unsigned int blkbits = inode->i_blkbits;
> - unsigned long first_block, last_block;
> - struct ext4_map_blocks map;
> - bool delalloc = false;
> int ret;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
>
> if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> return -EINVAL;
> - first_block = offset >> blkbits;
> - last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
> - EXT4_MAX_LOGICAL_BLOCK);
> -
> - if (flags & IOMAP_REPORT) {
> - if (ext4_has_inline_data(inode)) {
> - ret = ext4_inline_data_iomap(inode, iomap);
> - if (ret != -EAGAIN) {
> - if (ret == 0 && offset >= iomap->length)
> - ret = -ENOENT;
> - return ret;
> - }
> - }
> - } else {
> - if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> - return -ERANGE;
> - }
>
> - map.m_lblk = first_block;
> - map.m_len = last_block - first_block + 1;
> -
> - if (flags & IOMAP_REPORT) {
> - ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> -
> - if (ret == 0) {
> - ext4_lblk_t end = map.m_lblk + map.m_len - 1;
> - struct extent_status es;
> -
> - ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> - map.m_lblk, end, &es);
> + if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
> + return -ERANGE;
>
> - if (!es.es_len || es.es_lblk > end) {
> - /* entire range is a hole */
> - } else if (es.es_lblk > map.m_lblk) {
> - /* range starts with a hole */
> - map.m_len = es.es_lblk - map.m_lblk;
> - } else {
> - ext4_lblk_t offs = 0;
> + /*
> + * Calculate the first and last logical blocks respectively.
> + */
> + map.m_lblk = offset >> blkbits;
> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
>
> - if (es.es_lblk < map.m_lblk)
> - offs = map.m_lblk - es.es_lblk;
> - map.m_lblk = es.es_lblk + offs;
> - map.m_len = es.es_len - offs;
> - delalloc = true;
> - }
> - }
> - } else if (flags & IOMAP_WRITE) {
> + if (flags & IOMAP_WRITE)
> ret = ext4_iomap_alloc(inode, &map, flags);
> - } else {
> + else
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - }
>
> if (ret < 0)
> return ret;
>
> ext4_set_iomap(inode, iomap, &map, offset, length);
> - if (delalloc && iomap->type == IOMAP_HOLE)
> - iomap->type = IOMAP_DELALLOC;
>
> return 0;
> }
> @@ -3642,6 +3600,74 @@ const struct iomap_ops ext4_iomap_ops = {
> .iomap_end = ext4_iomap_end,
> };
>
> +static bool ext4_iomap_is_delalloc(struct inode *inode,
> + struct ext4_map_blocks *map)
> +{
> + struct extent_status es;
> + ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
> +
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> + map->m_lblk, end, &es);
> +
> + if (!es.es_len || es.es_lblk > end)
> + return false;
> +
> + if (es.es_lblk > map->m_lblk) {
> + map->m_len = es.es_lblk - map->m_lblk;
> + return false;
> + }
> +
> + offset = map->m_lblk - es.es_lblk;
> + map->m_len = es.es_len - offset;
> +
> + return true;
> +}
> +
> +static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
> + loff_t length, unsigned int flags,
> + struct iomap *iomap, struct iomap *srcmap)
> +{
> + int ret;
> + bool delalloc = false;
> + struct ext4_map_blocks map;
> + u8 blkbits = inode->i_blkbits;
> +
> + if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
> + return -EINVAL;
> +
> + if (ext4_has_inline_data(inode)) {
> + ret = ext4_inline_data_iomap(inode, iomap);
> + if (ret != -EAGAIN) {
> + if (ret == 0 && offset >= iomap->length)
> + ret = -ENOENT;
> + return ret;
> + }
> + }
> +
> + /*
> + * Calculate the first and last logical block respectively.
> + */
> + map.m_lblk = offset >> blkbits;
> + map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
> + EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
> +
> + ret = ext4_map_blocks(NULL, inode, &map, 0);
> + if (ret < 0)
> + return ret;
> + if (ret == 0)
> + delalloc = ext4_iomap_is_delalloc(inode, &map);
> +
> + ext4_set_iomap(inode, iomap, &map, offset, length);
> + if (delalloc && iomap->type == IOMAP_HOLE)
> + iomap->type = IOMAP_DELALLOC;
> +
> + return 0;
> +}
> +
> +const struct iomap_ops ext4_iomap_report_ops = {
> + .iomap_begin = ext4_iomap_begin_report,
> +};
> +
> static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
> ssize_t size, void *private)
> {
>
On 10/28/19 4:23 PM, Matthew Bobrowski wrote:
> When the filesystem is created without a journal, we eventually call
> into __generic_file_fsync() in order to write out all the modified
> in-core data to the permanent storage device. This function happens to
> try and obtain an inode_lock() while synchronizing the files buffer
> and it's associated metadata.
>
> Generally, this is fine, however it becomes a problem when there is
> higher level code that has already obtained an inode_lock() as this
> leads to a recursive lock situation. This case is especially true when
> porting across direct I/O to iomap infrastructure as we obtain an
> inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
> it until the I/O has been completed. Consequently, to not run into
> this specific issue, we move away from calling into
> __generic_file_fsync() and perform the necessary synchronization tasks
> within ext4_sync_file().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Thanks for the patch. Looks good to me.
You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
>
> Thanks Jan and Christoph for the suggestion on this one, highly
> appreciated.
>
> fs/ext4/fsync.c | 72 ++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 5508baa11bb6..e10206e7f4bb 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
> return ret;
> }
>
> +static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
> + bool *needs_barrier)
> +{
> + int ret, err;
> +
> + ret = sync_mapping_buffers(inode->i_mapping);
> + if (!(inode->i_state & I_DIRTY_ALL))
> + return ret;
> + if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> + return ret;
> +
> + err = sync_inode_metadata(inode, 1);
> + if (!ret)
> + ret = err;
> +
> + if (!ret)
> + ret = ext4_sync_parent(inode);
> + if (test_opt(inode->i_sb, BARRIER))
> + *needs_barrier = true;
> +
> + return ret;
> +}
> +
> +static int ext4_fsync_journal(struct inode *inode, bool datasync,
> + bool *needs_barrier)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> +
> + if (journal->j_flags & JBD2_BARRIER &&
> + !jbd2_trans_will_send_data_barrier(journal, commit_tid))
> + *needs_barrier = true;
> +
> + return jbd2_complete_transaction(journal, commit_tid);
> +}
> +
> /*
> * akpm: A new design for ext4_sync_file().
> *
> @@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
> * What we do is just kick off a commit and wait on it. This will snapshot the
> * inode to disk.
> */
> -
> int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> {
> - struct inode *inode = file->f_mapping->host;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> int ret = 0, err;
> - tid_t commit_tid;
> bool needs_barrier = false;
> + struct inode *inode = file->f_mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> + if (unlikely(ext4_forced_shutdown(sbi)))
> return -EIO;
>
> J_ASSERT(ext4_journal_current_handle() == NULL);
> @@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> if (sb_rdonly(inode->i_sb)) {
> /* Make sure that we read updated s_mount_flags value */
> smp_rmb();
> - if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
> + if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
> ret = -EROFS;
> goto out;
> }
>
> - if (!journal) {
> - ret = __generic_file_fsync(file, start, end, datasync);
> - if (!ret)
> - ret = ext4_sync_parent(inode);
> - if (test_opt(inode->i_sb, BARRIER))
> - goto issue_flush;
> - goto out;
> - }
> -
> ret = file_write_and_wait_range(file, start, end);
> if (ret)
> return ret;
> +
> /*
> * data=writeback,ordered:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> @@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> * (they were dirtied by commit). But that's OK - the blocks are
> * safe in-journal, which is all fsync() needs to ensure.
> */
> - if (ext4_should_journal_data(inode)) {
> + if (!sbi->s_journal)
> + ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
> + else if (ext4_should_journal_data(inode))
> ret = ext4_force_commit(inode->i_sb);
> - goto out;
> - }
> + else
> + ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
>
> - commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> - if (journal->j_flags & JBD2_BARRIER &&
> - !jbd2_trans_will_send_data_barrier(journal, commit_tid))
> - needs_barrier = true;
> - ret = jbd2_complete_transaction(journal, commit_tid);
> if (needs_barrier) {
> - issue_flush:
> err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
> if (!ret)
> ret = err;
>
Hi Matthew, it looks like there are a number of problems with this
patch series when using the ext3 backwards compatibility mode (e.g.,
no extents enabled).
So the following configurations are failing:
kvm-xfstests -c ext3 generic/091 generic/240 generic/263
It looks like the main issue is related to fsx failing. On v5.4-rc3
the following command:
fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk
... has the following result:
root@kvm-xfstests:~# mount /vdd
[ 9.366568] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem
[ 9.385537] EXT4-fs (vdd): recovery complete
[ 9.389219] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk
mapped writes DISABLED
Seed set to 1
main: filesystem does not support fallocate mode 0, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
main: filesystem does not support clone range, disabling!
main: filesystem does not support dedupe range, disabling!
truncating to largest ever: 0xe400
copying to largest ever: 0x6de00
copying to largest ever: 0x76a00
copying to largest ever: 0x78200
copying to largest ever: 0x78400
copying to largest ever: 0x78c00
truncating to largest ever: 0x79200
truncating to largest ever: 0x79600
copying to largest ever: 0x79800
All 10000 operations completed A-OK!
root@kvm-xfstests:~#
However, with this patch series applied, the fsx command fails with a
"short write":
root@kvm-xfstests:~# mount /vdd
[ 7.854352] EXT4-fs (vdd): mounting ext3 file system using the ext4 subsystem
[ 7.892418] EXT4-fs (vdd): recovery complete
[ 7.896480] EXT4-fs (vdd): mounted filesystem with ordered data mode. Opts: (null)
root@kvm-xfstests:~# fsx -N 10000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /vdd/junk
mapped writes DISABLED
Seed set to 1
main: filesystem does not support fallocate mode 0, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_KEEP_SIZE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_ZERO_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_COLLAPSE_RANGE, disabling!
main: filesystem does not support fallocate mode FALLOC_FL_INSERT_RANGE, disabling!
main: filesystem does not support clone range, disabling!
main: filesystem does not support dedupe range, disabling!
truncating to largest ever: 0xe400
copying to largest ever: 0x6de00
copying to largest ever: 0x76a00
short write: 0xc00 bytes instead of 0x9c00
LOG DUMP (60 total operations):
1( 1 mod 256): SKIPPED (no operation)
2( 2 mod 256): SKIPPED (no operation)
3( 3 mod 256): SKIPPED (no operation)
4( 4 mod 256): TRUNCATE UP from 0x0 to 0xe400
5( 5 mod 256): SKIPPED (no operation)
6( 6 mod 256): SKIPPED (no operation)
7( 7 mod 256): SKIPPED (no operation)
8( 8 mod 256): WRITE 0x65a00 thru 0x665ff (0xc00 bytes) HOLE
9( 9 mod 256): WRITE 0x17400 thru 0x1e7ff (0x7400 bytes)
10( 10 mod 256): SKIPPED (no operation)
11( 11 mod 256): WRITE 0xa200 thru 0x12bff (0x8a00 bytes)
12( 12 mod 256): SKIPPED (no operation)
13( 13 mod 256): SKIPPED (no operation)
14( 14 mod 256): SKIPPED (no operation)
15( 15 mod 256): SKIPPED (no operation)
16( 16 mod 256): SKIPPED (no operation)
17( 17 mod 256): SKIPPED (no operation)
18( 18 mod 256): COPY 0x8000 thru 0x13fff (0xc000 bytes) to 0x61e00 thru 0x6ddff
19( 19 mod 256): SKIPPED (no operation)
20( 20 mod 256): COPY 0x3a000 thru 0x46fff (0xd000 bytes) to 0x69a00 thru 0x769ff
21( 21 mod 256): READ 0x34000 thru 0x3efff (0xb000 bytes)
22( 22 mod 256): WRITE 0x1f200 thru 0x2cbff (0xda00 bytes)
23( 23 mod 256): READ 0x55000 thru 0x5bfff (0x7000 bytes)
24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
25( 25 mod 256): WRITE 0x47800 thru 0x4b1ff (0x3a00 bytes)
26( 26 mod 256): SKIPPED (no operation)
27( 27 mod 256): READ 0x16000 thru 0x1afff (0x5000 bytes)
28( 28 mod 256): SKIPPED (no operation)
29( 29 mod 256): SKIPPED (no operation)
30( 30 mod 256): SKIPPED (no operation)
31( 31 mod 256): SKIPPED (no operation)
32( 32 mod 256): SKIPPED (no operation)
33( 33 mod 256): SKIPPED (no operation)
34( 34 mod 256): SKIPPED (no operation)
35( 35 mod 256): READ 0x23000 thru 0x2afff (0x8000 bytes)
36( 36 mod 256): SKIPPED (no operation)
37( 37 mod 256): PUNCH 0x11100 thru 0x18049 (0x6f4a bytes)
38( 38 mod 256): READ 0x3000 thru 0x5fff (0x3000 bytes)
39( 39 mod 256): SKIPPED (no operation)
40( 40 mod 256): COPY 0x5b000 thru 0x65fff (0xb000 bytes) to 0x66600 thru 0x715ff
41( 41 mod 256): SKIPPED (no operation)
42( 42 mod 256): WRITE 0x36c00 thru 0x3fdff (0x9200 bytes)
43( 43 mod 256): SKIPPED (no operation)
44( 44 mod 256): PUNCH 0x3199d thru 0x3feaf (0xe513 bytes)
45( 45 mod 256): SKIPPED (no operation)
46( 46 mod 256): SKIPPED (no operation)
47( 47 mod 256): COPY 0x71000 thru 0x75fff (0x5000 bytes) to 0x38800 thru 0x3d7ff
48( 48 mod 256): SKIPPED (no operation)
49( 49 mod 256): SKIPPED (no operation)
50( 50 mod 256): SKIPPED (no operation)
51( 51 mod 256): READ 0x3a000 thru 0x43fff (0xa000 bytes)
52( 52 mod 256): SKIPPED (no operation)
53( 53 mod 256): READ 0x10000 thru 0x16fff (0x7000 bytes)
54( 54 mod 256): SKIPPED (no operation)
55( 55 mod 256): SKIPPED (no operation)
56( 56 mod 256): PUNCH 0x8a16 thru 0x1845d (0xfa48 bytes)
57( 57 mod 256): WRITE 0x12800 thru 0x207ff (0xe000 bytes)
58( 58 mod 256): SKIPPED (no operation)
59( 59 mod 256): COPY 0x28000 thru 0x36fff (0xf000 bytes) to 0x9600 thru 0x185ff
60( 60 mod 256): WRITE 0x24000 thru 0x2dbff (0x9c00 bytes)
Log of operations saved to "/vdd/junk.fsxops"; replay with --replay-ops
Correct content saved for comparison
(maybe hexdump "/vdd/junk" vs "/vdd/junk.fsxgood")
Could you take a look? Thanks!!
- Ted
On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> Hi Matthew, it looks like there are a number of problems with this
> patch series when using the ext3 backwards compatibility mode (e.g.,
> no extents enabled).
>
> So the following configurations are failing:
>
> kvm-xfstests -c ext3 generic/091 generic/240 generic/263
Here are the details of the generic/240 failure:
root@kvm-xfstests:~# diff -u /root/xfstests/tests/generic/240.out /results/ext4/results-ext3/generic/240.out.bad
--- /root/xfstests/tests/generic/240.out 2019-10-21 15:29:56.000000000 -0400
+++ /results/ext4/results-ext3/generic/240.out.bad 2019-10-29 19:32:29.166850310 -0400
@@ -1,2 +1,7 @@
QA output created by 240
Silence is golden.
+AIO write offset 4608 expected 4096 got 512
+AIO write offset 8704 expected 4096 got 512
+non one buffer at buf[0] => 0x00,00,00,00
+non-one read at offset 12800
+*** WARNING *** /vdd/aiodio_sparse has not been unlinked; if you don't rm it manually first, it may influence the next run
- Ted
On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > Hi Matthew, it looks like there are a number of problems with this
> > patch series when using the ext3 backwards compatibility mode (e.g.,
> > no extents enabled).
> >
> > So the following configurations are failing:
> >
> > kvm-xfstests -c ext3 generic/091 generic/240 generic/263
This is one mode that I didn't get around to testing. Let me take a
look at the above and get back to you.
--<M>--
On Mon 28-10-19 21:53:52, Matthew Bobrowski wrote:
> When the filesystem is created without a journal, we eventually call
> into __generic_file_fsync() in order to write out all the modified
> in-core data to the permanent storage device. This function happens to
> try and obtain an inode_lock() while synchronizing the files buffer
> and it's associated metadata.
>
> Generally, this is fine, however it becomes a problem when there is
> higher level code that has already obtained an inode_lock() as this
> leads to a recursive lock situation. This case is especially true when
> porting across direct I/O to iomap infrastructure as we obtain an
> inode_lock() early on in the I/O within ext4_dio_write_iter() and hold
> it until the I/O has been completed. Consequently, to not run into
> this specific issue, we move away from calling into
> __generic_file_fsync() and perform the necessary synchronization tasks
> within ext4_sync_file().
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Nice! You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
>
> Thanks Jan and Christoph for the suggestion on this one, highly
> appreciated.
>
> fs/ext4/fsync.c | 72 ++++++++++++++++++++++++++++++++-----------------
> 1 file changed, 47 insertions(+), 25 deletions(-)
>
> diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
> index 5508baa11bb6..e10206e7f4bb 100644
> --- a/fs/ext4/fsync.c
> +++ b/fs/ext4/fsync.c
> @@ -80,6 +80,43 @@ static int ext4_sync_parent(struct inode *inode)
> return ret;
> }
>
> +static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
> + bool *needs_barrier)
> +{
> + int ret, err;
> +
> + ret = sync_mapping_buffers(inode->i_mapping);
> + if (!(inode->i_state & I_DIRTY_ALL))
> + return ret;
> + if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> + return ret;
> +
> + err = sync_inode_metadata(inode, 1);
> + if (!ret)
> + ret = err;
> +
> + if (!ret)
> + ret = ext4_sync_parent(inode);
> + if (test_opt(inode->i_sb, BARRIER))
> + *needs_barrier = true;
> +
> + return ret;
> +}
> +
> +static int ext4_fsync_journal(struct inode *inode, bool datasync,
> + bool *needs_barrier)
> +{
> + struct ext4_inode_info *ei = EXT4_I(inode);
> + journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> + tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> +
> + if (journal->j_flags & JBD2_BARRIER &&
> + !jbd2_trans_will_send_data_barrier(journal, commit_tid))
> + *needs_barrier = true;
> +
> + return jbd2_complete_transaction(journal, commit_tid);
> +}
> +
> /*
> * akpm: A new design for ext4_sync_file().
> *
> @@ -91,17 +128,14 @@ static int ext4_sync_parent(struct inode *inode)
> * What we do is just kick off a commit and wait on it. This will snapshot the
> * inode to disk.
> */
> -
> int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> {
> - struct inode *inode = file->f_mapping->host;
> - struct ext4_inode_info *ei = EXT4_I(inode);
> - journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
> int ret = 0, err;
> - tid_t commit_tid;
> bool needs_barrier = false;
> + struct inode *inode = file->f_mapping->host;
> + struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
>
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> + if (unlikely(ext4_forced_shutdown(sbi)))
> return -EIO;
>
> J_ASSERT(ext4_journal_current_handle() == NULL);
> @@ -111,23 +145,15 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> if (sb_rdonly(inode->i_sb)) {
> /* Make sure that we read updated s_mount_flags value */
> smp_rmb();
> - if (EXT4_SB(inode->i_sb)->s_mount_flags & EXT4_MF_FS_ABORTED)
> + if (sbi->s_mount_flags & EXT4_MF_FS_ABORTED)
> ret = -EROFS;
> goto out;
> }
>
> - if (!journal) {
> - ret = __generic_file_fsync(file, start, end, datasync);
> - if (!ret)
> - ret = ext4_sync_parent(inode);
> - if (test_opt(inode->i_sb, BARRIER))
> - goto issue_flush;
> - goto out;
> - }
> -
> ret = file_write_and_wait_range(file, start, end);
> if (ret)
> return ret;
> +
> /*
> * data=writeback,ordered:
> * The caller's filemap_fdatawrite()/wait will sync the data.
> @@ -142,18 +168,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
> * (they were dirtied by commit). But that's OK - the blocks are
> * safe in-journal, which is all fsync() needs to ensure.
> */
> - if (ext4_should_journal_data(inode)) {
> + if (!sbi->s_journal)
> + ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
> + else if (ext4_should_journal_data(inode))
> ret = ext4_force_commit(inode->i_sb);
> - goto out;
> - }
> + else
> + ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
>
> - commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
> - if (journal->j_flags & JBD2_BARRIER &&
> - !jbd2_trans_will_send_data_barrier(journal, commit_tid))
> - needs_barrier = true;
> - ret = jbd2_complete_transaction(journal, commit_tid);
> if (needs_barrier) {
> - issue_flush:
> err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
> if (!ret)
> ret = err;
> --
> 2.20.1
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > Hi Matthew, it looks like there are a number of problems with this
> > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > no extents enabled).
> > >
> > > So the following configurations are failing:
> > >
> > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263
>
> This is one mode that I didn't get around to testing. Let me take a
> look at the above and get back to you.
If I should guess, I'd start looking at what that -ENOTBLK fallback from
direct IO ends up doing as we seem to be hitting that path...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 30-10-19 12:26:52, Jan Kara wrote:
> On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > Hi Matthew, it looks like there are a number of problems with this
> > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > no extents enabled).
> > > >
> > > > So the following configurations are failing:
> > > >
> > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263
> >
> > This is one mode that I didn't get around to testing. Let me take a
> > look at the above and get back to you.
>
> If I should guess, I'd start looking at what that -ENOTBLK fallback from
> direct IO ends up doing as we seem to be hitting that path...
Hum, actually no. This write from fsx output:
24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
should have allocated blocks to where the failed write was going (0x24000).
But still I'd expect some interaction between how buffered writes to holes
interact with following direct IO writes... One of the subtle differences
we have introduced with iomap conversion is that the old code in
__generic_file_write_iter() did fsync & invalidate written range after
buffered write fallback and we don't seem to do that now (probably should
be fixed regardless of relation to this bug).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > > Hi Matthew, it looks like there are a number of problems with this
> > > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > > no extents enabled).
> > > > >
> > > > > So the following configurations are failing:
> > > > >
> > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263
> > >
> > > This is one mode that I didn't get around to testing. Let me take a
> > > look at the above and get back to you.
> >
> > If I should guess, I'd start looking at what that -ENOTBLK fallback from
> > direct IO ends up doing as we seem to be hitting that path...
>
> Hum, actually no. This write from fsx output:
>
> 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
>
> should have allocated blocks to where the failed write was going (0x24000).
> But still I'd expect some interaction between how buffered writes to holes
> interact with following direct IO writes... One of the subtle differences
> we have introduced with iomap conversion is that the old code in
> __generic_file_write_iter() did fsync & invalidate written range after
> buffered write fallback and we don't seem to do that now (probably should
> be fixed regardless of relation to this bug).
After performing some debugging this afternoon, I quickly realised
that the fix for this is rather trivial. Within the previous direct
I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
ext4_map_blocks() for any writes to inodes without extents. I seem to
have missed that here and consequently block allocation for a write
wasn't performing correctly in such cases.
Also, I agree, the fsync + page cache invalidation bits need to be
implemented. I'm just thinking to branch out within
ext4_buffered_write_iter() and implement those bits there i.e.
...
ret = generic_perform_write();
if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
err = filemap_write_and_wait_range();
if (!err)
invalidate_mapping_pages();
...
AFAICT, this would be the most appropriate place to put it? Or, did
you have something else in mind?
--<M>--
On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote:
> On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> > On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > > On Wed 30-10-19 13:00:24, Matthew Bobrowski wrote:
> > > > On Tue, Oct 29, 2019 at 07:34:01PM -0400, Theodore Y. Ts'o wrote:
> > > > > On Tue, Oct 29, 2019 at 07:31:59PM -0400, Theodore Y. Ts'o wrote:
> > > > > > Hi Matthew, it looks like there are a number of problems with this
> > > > > > patch series when using the ext3 backwards compatibility mode (e.g.,
> > > > > > no extents enabled).
> > > > > >
> > > > > > So the following configurations are failing:
> > > > > >
> > > > > > kvm-xfstests -c ext3 generic/091 generic/240 generic/263
> > > >
> > > > This is one mode that I didn't get around to testing. Let me take a
> > > > look at the above and get back to you.
> > >
> > > If I should guess, I'd start looking at what that -ENOTBLK fallback from
> > > direct IO ends up doing as we seem to be hitting that path...
> >
> > Hum, actually no. This write from fsx output:
> >
> > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
> >
> > should have allocated blocks to where the failed write was going (0x24000).
> > But still I'd expect some interaction between how buffered writes to holes
> > interact with following direct IO writes... One of the subtle differences
> > we have introduced with iomap conversion is that the old code in
> > __generic_file_write_iter() did fsync & invalidate written range after
> > buffered write fallback and we don't seem to do that now (probably should
> > be fixed regardless of relation to this bug).
>
> After performing some debugging this afternoon, I quickly realised
> that the fix for this is rather trivial. Within the previous direct
> I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
> ext4_map_blocks() for any writes to inodes without extents. I seem to
> have missed that here and consequently block allocation for a write
> wasn't performing correctly in such cases.
No, this is not correct. For inodes without extents we used
ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO().
Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass
'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses
'0' argument to ext4_map_blocks() similarly to what you do.
And indeed for inodes without extents we must fallback to buffered IO for
filling holes inside a file to avoid stale data exposure (racing DIO read
could read block contents before data is written to it if we used
EXT4_GET_BLOCKS_CREATE).
> Also, I agree, the fsync + page cache invalidation bits need to be
> implemented. I'm just thinking to branch out within
> ext4_buffered_write_iter() and implement those bits there i.e.
>
> ...
> ret = generic_perform_write();
>
> if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
> err = filemap_write_and_wait_range();
>
> if (!err)
> invalidate_mapping_pages();
> ...
>
> AFAICT, this would be the most appropriate place to put it? Or, did
> you have something else in mind?
Yes, either this, or maybe in ext4_dio_write_iter() after returning from
ext4_buffered_write_iter() would be even more logical.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Oct 31, 2019 at 05:54:16PM +0100, Jan Kara wrote:
> On Thu 31-10-19 20:16:41, Matthew Bobrowski wrote:
> > On Wed, Oct 30, 2019 at 12:39:18PM +0100, Jan Kara wrote:
> > > On Wed 30-10-19 12:26:52, Jan Kara wrote:
> > > Hum, actually no. This write from fsx output:
> > >
> > > 24( 24 mod 256): WRITE 0x23000 thru 0x285ff (0x5600 bytes)
> > >
> > > should have allocated blocks to where the failed write was going (0x24000).
> > > But still I'd expect some interaction between how buffered writes to holes
> > > interact with following direct IO writes... One of the subtle differences
> > > we have introduced with iomap conversion is that the old code in
> > > __generic_file_write_iter() did fsync & invalidate written range after
> > > buffered write fallback and we don't seem to do that now (probably should
> > > be fixed regardless of relation to this bug).
> >
> > After performing some debugging this afternoon, I quickly realised
> > that the fix for this is rather trivial. Within the previous direct
> > I/O implementation, we passed EXT4_GET_BLOCKS_CREATE to
> > ext4_map_blocks() for any writes to inodes without extents. I seem to
> > have missed that here and consequently block allocation for a write
> > wasn't performing correctly in such cases.
>
> No, this is not correct. For inodes without extents we used
> ext4_dio_get_block() and we pass DIO_SKIP_HOLES to __blockdev_direct_IO().
> Now DIO_SKIP_HOLES means that if starting block is within i_size, we pass
> 'create == 0' to get_blocks() function and thus ext4_dio_get_block() uses
> '0' argument to ext4_map_blocks() similarly to what you do.
Ah right, I missed that part. :(
> And indeed for inodes without extents we must fallback to buffered IO for
> filling holes inside a file to avoid stale data exposure (racing DIO read
> could read block contents before data is written to it if we used
> EXT4_GET_BLOCKS_CREATE).
Well in this case I'm pretty sure I know exactly where the problem
resides. I seem to be falling back to buffered I/O from
ext4_dio_write_iter() without actually taking into account any of the
data that may have partially been written by the direct I/O. So, when
returning the bytes written back to userspace it's whatever actually
is returned by ext4_buffered_write_iter(), which may not necessarily
be the amount of bytes that were expected, so it should rather be
ext4_dio_write_iter() + ext4_buffered_write_iter()...
> > Also, I agree, the fsync + page cache invalidation bits need to be
> > implemented. I'm just thinking to branch out within
> > ext4_buffered_write_iter() and implement those bits there i.e.
> >
> > ...
> > ret = generic_perform_write();
> >
> > if (ret > 0 && iocb->ki_flags & IOCB_DIRECT) {
> > err = filemap_write_and_wait_range();
> >
> > if (!err)
> > invalidate_mapping_pages();
> > ...
> >
> > AFAICT, this would be the most appropriate place to put it? Or, did
> > you have something else in mind?
>
> Yes, either this, or maybe in ext4_dio_write_iter() after returning from
> ext4_buffered_write_iter() would be even more logical.
Yes, let's stick with doing it within ext4_dio_write_iter().
--<M>--
Hi Matthew, could you do me a favor? For the next (and hopefully
final :-) spin of this patch series, could you base it on the
ext4.git's master branch. Then pull in Darrick's iomap-for-next
branch, and then apply your patches on top of that.
I attempted to do this with the v6 patch series --- see the tt/mb-dio
branch --- and I described on another e-mail thread, I appear to have
screwed up that patch conflicts, since it's causing a failure with
diroead-nolock using a 1k block size. Since this wasn't something
that worked when you were first working on the patch set, this isn't
something I'm going to consider blocking, especially since a flay test
failure which happens 7% of the time, and using dioread_nolock with a
sub-page blocksize isn't something that is going to be all that common
(since it wasn't working at all up until now).
Still, I'm hoping that either Ritesh or you can figure out how my
simple-minded handling of the patch conflict between your and his
patch series can be addressed properly.
Thanks!!
- Ted
Howdy Ted!
On Sun, Nov 03, 2019 at 02:20:40PM -0500, Theodore Y. Ts'o wrote:
> Hi Matthew, could you do me a favor? For the next (and hopefully
> final :-) spin of this patch series, could you base it on the
> ext4.git's master branch. Then pull in Darrick's iomap-for-next
> branch, and then apply your patches on top of that.
>
> I attempted to do this with the v6 patch series --- see the tt/mb-dio
> branch --- and I described on another e-mail thread, I appear to have
> screwed up that patch conflicts, since it's causing a failure with
> diroead-nolock using a 1k block size. Since this wasn't something
> that worked when you were first working on the patch set, this isn't
> something I'm going to consider blocking, especially since a flay test
> failure which happens 7% of the time, and using dioread_nolock with a
> sub-page blocksize isn't something that is going to be all that common
> (since it wasn't working at all up until now).
>
> Still, I'm hoping that either Ritesh or you can figure out how my
> simple-minded handling of the patch conflict between your and his
> patch series can be addressed properly.
OK, I will try get around to this tonight. :)
--<M>--