Hello!
I managed to get some time over the weekend to fix up the stuff that
came out from the last review. Please feel free to review and provide
any necessary constructive criticism.
This patch series ports the ext4 direct I/O paths over to the iomap
infrastructure. The legacy buffer_head based direct I/O implementation
has subsequently been removed as it's no longer in use. The result of
this change is that ext4 now uses the newer iomap framework for direct
I/O read/write operations. Overall, this results in a much cleaner
direct I/O implementation and keeps this code isolated from the
buffer_head internals. In addition to this, a slight performance boost
may be expected while using O_SYNC | O_DIRECT.
The changes within this patch series have been tested via xfstests in
both DAX and non-DAX modes using the various filesystem configuration
options i.e. 4k, dioread_nolock, etc.
Changes since v4:
- Removed the ext4_handle_inode_extension() function out from the
->end_io() callback so that we can realiably truncate any allocated
blocks as we have all the information we need in
ext4_dio_write_iter().
- Fixed a couple comment formatting issues in addition to spelling
mistakes here and there.
- Incorporated the fix for the issue that Dave Chinner found around
writes extending beyond and not marking the iomap dirty.
Thank you all who took the time to review and provide feedback. :)
Jan Kara (2):
iomap: Allow forcing of waiting for running DIO in iomap_dio_rw()
xfs: Use iomap_dio_rw_wait()
Matthew Bobrowski (10):
ext4: move set iomap routines into separate helper ext4_set_iomap()
ext4: iomap that extends beyond EOF should be marked dirty
ext4: split IOMAP_WRITE branch in ext4_iomap_begin() into helper
ext4: introduce new callback for IOMAP_REPORT
ext4: introduce direct I/O read using iomap infrastructure
ext4: update direct I/O read to do trylock in IOCB_NOWAIT cases
ext4: move inode extension/truncate code out from ->iomap_end()
callback
ext4: move inode extension check out from ext4_iomap_alloc()
ext4: reorder map->m_flags checks in ext4_set_iomap()
ext4: introduce direct I/O write using iomap infrastructure
fs/ext4/ext4.h | 4 +-
fs/ext4/extents.c | 4 +-
fs/ext4/file.c | 378 +++++++++++++++++-----
fs/ext4/inode.c | 716 +++++++++++-------------------------------
fs/gfs2/file.c | 6 +-
fs/iomap/direct-io.c | 7 +-
fs/xfs/xfs_file.c | 13 +-
include/linux/iomap.h | 3 +-
8 files changed, 498 insertions(+), 633 deletions(-)
--
2.20.1
--<M>--
As part of the ext4_iomap_begin() cleanups that precede this patch,
here we also split up the IOMAP_REPORT branch into a completely
separate ->iomap_begin() callback named
ext4_iomap_begin_report(). Again, the raionale for this change is to
reduce the overall clutter that's starting to become apparent as we
start to port more functionality over to the iomap infrastructure.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/ext4.h | 1 +
fs/ext4/file.c | 6 ++-
fs/ext4/inode.c | 137 +++++++++++++++++++++++++++++-------------------
3 files changed, 88 insertions(+), 56 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 03db3e71676c..d0d88f411a44 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -3379,6 +3379,7 @@ static inline void ext4_clear_io_unwritten_flag(ext4_io_end_t *io_end)
}
extern const struct iomap_ops ext4_iomap_ops;
+extern const struct iomap_ops ext4_iomap_report_ops;
static inline int ext4_buffer_uptodate(struct buffer_head *bh)
{
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8d2bbcc2d813..ab75aee3e687 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -494,12 +494,14 @@ loff_t ext4_llseek(struct file *file, loff_t offset, int whence)
maxbytes, i_size_read(inode));
case SEEK_HOLE:
inode_lock_shared(inode);
- offset = iomap_seek_hole(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_hole(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
case SEEK_DATA:
inode_lock_shared(inode);
- offset = iomap_seek_data(inode, offset, &ext4_iomap_ops);
+ offset = iomap_seek_data(inode, offset,
+ &ext4_iomap_report_ops);
inode_unlock_shared(inode);
break;
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3dc92bd8a944..ebeedbf3900f 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3501,74 +3501,32 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
- unsigned int blkbits = inode->i_blkbits;
- unsigned long first_block, last_block;
- struct ext4_map_blocks map;
- bool delalloc = false;
int ret;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
return -EINVAL;
- first_block = offset >> blkbits;
- last_block = min_t(loff_t, (offset + length - 1) >> blkbits,
- EXT4_MAX_LOGICAL_BLOCK);
-
- if (flags & IOMAP_REPORT) {
- if (ext4_has_inline_data(inode)) {
- ret = ext4_inline_data_iomap(inode, iomap);
- if (ret != -EAGAIN) {
- if (ret == 0 && offset >= iomap->length)
- ret = -ENOENT;
- return ret;
- }
- }
- } else {
- if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
- return -ERANGE;
- }
- map.m_lblk = first_block;
- map.m_len = last_block - first_block + 1;
-
- if (flags & IOMAP_REPORT) {
- ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
-
- if (ret == 0) {
- ext4_lblk_t end = map.m_lblk + map.m_len - 1;
- struct extent_status es;
-
- ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
- map.m_lblk, end, &es);
+ if (WARN_ON_ONCE(ext4_has_inline_data(inode)))
+ return -ERANGE;
- if (!es.es_len || es.es_lblk > end) {
- /* entire range is a hole */
- } else if (es.es_lblk > map.m_lblk) {
- /* range starts with a hole */
- map.m_len = es.es_lblk - map.m_lblk;
- } else {
- ext4_lblk_t offs = 0;
+ /*
+ * Calculate the first and last logical blocks respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
- if (es.es_lblk < map.m_lblk)
- offs = map.m_lblk - es.es_lblk;
- map.m_lblk = es.es_lblk + offs;
- map.m_len = es.es_len - offs;
- delalloc = true;
- }
- }
- } else if (flags & IOMAP_WRITE) {
+ if (flags & IOMAP_WRITE)
ret = ext4_iomap_alloc(inode, &map, flags);
- } else {
+ else
ret = ext4_map_blocks(NULL, inode, &map, 0);
- }
if (ret < 0)
return ret;
ext4_set_iomap(inode, iomap, &map, offset, length);
- if (delalloc && iomap->type == IOMAP_HOLE)
- iomap->type = IOMAP_DELALLOC;
return 0;
}
@@ -3630,6 +3588,77 @@ const struct iomap_ops ext4_iomap_ops = {
.iomap_end = ext4_iomap_end,
};
+static bool ext4_iomap_is_delalloc(struct inode *inode,
+ struct ext4_map_blocks *map)
+{
+ struct extent_status es;
+ ext4_lblk_t offset = 0, end = map->m_lblk + map->m_len - 1;
+
+ ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
+ map->m_lblk, end, &es);
+
+ if (!es.es_len || es.es_lblk > end)
+ return false;
+
+ if (es.es_lblk > map->m_lblk) {
+ map->m_len = es.es_lblk - map->m_lblk;
+ return false;
+ }
+
+ if (es.es_lblk <= map->m_lblk)
+ offset = map->m_lblk - es.es_lblk;
+
+ map->m_lblk = es.es_lblk + offset;
+ map->m_len = es.es_len - offset;
+
+ return true;
+}
+
+static int ext4_iomap_begin_report(struct inode *inode, loff_t offset,
+ loff_t length, unsigned int flags,
+ struct iomap *iomap)
+{
+ int ret;
+ bool delalloc = false;
+ struct ext4_map_blocks map;
+ u8 blkbits = inode->i_blkbits;
+
+ if ((offset >> blkbits) > EXT4_MAX_LOGICAL_BLOCK)
+ return -EINVAL;
+
+ if (ext4_has_inline_data(inode)) {
+ ret = ext4_inline_data_iomap(inode, iomap);
+ if (ret != -EAGAIN) {
+ if (ret == 0 && offset >= iomap->length)
+ ret = -ENOENT;
+ return ret;
+ }
+ }
+
+ /*
+ * Calculate the first and last logical blocks respectively.
+ */
+ map.m_lblk = offset >> blkbits;
+ map.m_len = min_t(loff_t, (offset + length - 1) >> blkbits,
+ EXT4_MAX_LOGICAL_BLOCK) - map.m_lblk + 1;
+
+ ret = ext4_map_blocks(NULL, inode, &map, 0);
+ if (ret < 0)
+ return ret;
+ if (ret == 0)
+ delalloc = ext4_iomap_is_delalloc(inode, &map);
+
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
+
+ return 0;
+}
+
+const struct iomap_ops ext4_iomap_report_ops = {
+ .iomap_begin = ext4_iomap_begin_report,
+};
+
static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
ssize_t size, void *private)
{
--
2.20.1
--<M>--
This patch introduces a new direct I/O read path which makes use of
the iomap infrastructure.
The new function ext4_do_read_iter() is responsible for calling into
the iomap infrastructure via iomap_dio_rw(). If the read operation
performed on the inode is not supported, which is checked via
ext4_dio_supported(), then we simply fallback and complete the I/O
using buffered I/O.
Existing direct I/O read code path has been removed, as it is no
longer required.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 32 +-------------------------------
2 files changed, 46 insertions(+), 32 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index ab75aee3e687..6ea7e00e0204 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -34,6 +34,46 @@
#include "xattr.h"
#include "acl.h"
+static bool ext4_dio_supported(struct inode *inode)
+{
+ if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
+ return false;
+ if (fsverity_active(inode))
+ return false;
+ if (ext4_should_journal_data(inode))
+ return false;
+ if (ext4_has_inline_data(inode))
+ return false;
+ return true;
+}
+
+static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ inode_lock_shared(inode);
+ if (!ext4_dio_supported(inode)) {
+ inode_unlock_shared(inode);
+ /*
+ * Fallback to buffered I/O if the operation being performed on
+ * the inode is not supported by direct I/O. The IOCB_DIRECT
+ * flag needs to be cleared here in order to ensure that the
+ * direct I/O path within generic_file_read_iter() is not
+ * taken.
+ */
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ return generic_file_read_iter(iocb, to);
+ }
+
+ ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
+ is_sync_kiocb(iocb));
+ inode_unlock_shared(inode);
+
+ file_accessed(iocb->ki_filp);
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
@@ -64,7 +104,9 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
- if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
if (!iov_iter_count(to))
@@ -74,6 +116,8 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
if (IS_DAX(file_inode(iocb->ki_filp)))
return ext4_dax_read_iter(iocb, to);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_read_iter(iocb, to);
return generic_file_read_iter(iocb, to);
}
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index ebeedbf3900f..03a9e2b85e46 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
{
/* We don't expect handle for direct IO */
WARN_ON_ONCE(ext4_journal_current_handle());
-
- if (!create)
- return _ext4_get_block(inode, iblock, bh, 0);
return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
}
@@ -3865,30 +3862,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
return ret;
}
-static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct address_space *mapping = iocb->ki_filp->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- ssize_t ret;
-
- /*
- * Shared inode_lock is enough for us - it protects against concurrent
- * writes & truncates and since we take care of writing back page cache,
- * we are protected against page writeback as well.
- */
- inode_lock_shared(inode);
- ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
- iocb->ki_pos + count - 1);
- if (ret)
- goto out_unlock;
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
- iter, ext4_dio_get_block, NULL, NULL, 0);
-out_unlock:
- inode_unlock_shared(inode);
- return ret;
-}
-
static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
{
struct file *file = iocb->ki_filp;
@@ -3915,10 +3888,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
return 0;
trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- if (iov_iter_rw(iter) == READ)
- ret = ext4_direct_IO_read(iocb, iter);
- else
- ret = ext4_direct_IO_write(iocb, iter);
+ ret = ext4_direct_IO_write(iocb, iter);
trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
return ret;
}
--
2.20.1
--<M>--
This patch updates the lock pattern in ext4_dio_read_iter() to only
perform the trylock in IOCB_NOWAIT cases.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6ea7e00e0204..8420686b90f5 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -52,7 +52,13 @@ static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
ssize_t ret;
struct inode *inode = file_inode(iocb->ki_filp);
- inode_lock_shared(inode);
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock_shared(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock_shared(inode);
+ }
+
if (!ext4_dio_supported(inode)) {
inode_unlock_shared(inode);
/*
--
2.20.1
--<M>--
Separate the iomap field population chunk of code that is currently
within ext4_iomap_begin() into a new helper called
ext4_set_iomap(). The intent of this function is self explanatory,
however the rationale behind doing so is to also reduce the overall
clutter that we currently have within the ext4_iomap_begin() callback.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 59 +++++++++++++++++++++++++++----------------------
1 file changed, 33 insertions(+), 26 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 516faa280ced..158eea9a1944 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3406,10 +3406,39 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
return inode->i_state & I_DIRTY_DATASYNC;
}
+static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
+ struct ext4_map_blocks *map, loff_t offset,
+ loff_t length)
+{
+ u8 blkbits = inode->i_blkbits;
+
+ iomap->flags = 0;
+ if (ext4_inode_datasync_dirty(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
+
+ if (map->m_flags & EXT4_MAP_NEW)
+ iomap->flags |= IOMAP_F_NEW;
+
+ iomap->bdev = inode->i_sb->s_bdev;
+ iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
+ iomap->offset = (u64) map->m_lblk << blkbits;
+ iomap->length = (u64) map->m_len << blkbits;
+
+ if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
+ if (map->m_flags & EXT4_MAP_MAPPED)
+ iomap->type = IOMAP_MAPPED;
+ else if (map->m_flags & EXT4_MAP_UNWRITTEN)
+ iomap->type = IOMAP_UNWRITTEN;
+ iomap->addr = (u64) map->m_pblk << blkbits;
+ } else {
+ iomap->type = IOMAP_HOLE;
+ iomap->addr = IOMAP_NULL_ADDR;
+ }
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
- struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
unsigned int blkbits = inode->i_blkbits;
unsigned long first_block, last_block;
struct ext4_map_blocks map;
@@ -3523,31 +3552,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
return ret;
}
- iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode))
- iomap->flags |= IOMAP_F_DIRTY;
- iomap->bdev = inode->i_sb->s_bdev;
- iomap->dax_dev = sbi->s_daxdev;
- iomap->offset = (u64)first_block << blkbits;
- iomap->length = (u64)map.m_len << blkbits;
-
- if (ret == 0) {
- iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
- iomap->addr = IOMAP_NULL_ADDR;
- } else {
- if (map.m_flags & EXT4_MAP_MAPPED) {
- iomap->type = IOMAP_MAPPED;
- } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
- iomap->type = IOMAP_UNWRITTEN;
- } else {
- WARN_ON_ONCE(1);
- return -EIO;
- }
- iomap->addr = (u64)map.m_pblk << blkbits;
- }
-
- if (map.m_flags & EXT4_MAP_NEW)
- iomap->flags |= IOMAP_F_NEW;
+ ext4_set_iomap(inode, iomap, &map, offset, length);
+ if (delalloc && iomap->type == IOMAP_HOLE)
+ iomap->type = IOMAP_DELALLOC;
return 0;
}
--
2.20.1
--<M>--
This patch is effectively addressed what Dave Chinner had found and
fixed within this commit: 8a23414ee345. Justification for needing this
modification has been provided below:
When doing a direct IO that spans the current EOF, and there are
written blocks beyond EOF that extend beyond the current write, the
only metadata update that needs to be done is a file size extension.
However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
there is IO completion metadata updates required, and hence we may
fail to correctly sync file size extensions made in IO completion when
O_DSYNC writes are being used and the hardware supports FUA.
Hence when setting IOMAP_F_DIRTY, we need to also take into account
whether the iomap spans the current EOF. If it does, then we need to
mark it dirty so that IO completion will call generic_write_sync() to
flush the inode size update to stable storage correctly.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 158eea9a1944..0dd29ae5cc8c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3412,8 +3412,14 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
{
u8 blkbits = inode->i_blkbits;
+ /*
+ * Writes that span EOF might trigger an I/O size update on completion,
+ * so consider them to be dirty for the purposes of O_DSYNC, even if
+ * there is no other metadata changes being made or are pending here.
+ */
iomap->flags = 0;
- if (ext4_inode_datasync_dirty(inode))
+ if (ext4_inode_datasync_dirty(inode) ||
+ offset + length > i_size_read(inode))
iomap->flags |= IOMAP_F_DIRTY;
if (map->m_flags & EXT4_MAP_NEW)
--
2.20.1
--<M>--
In preparation for porting across the ext4 direct I/O path for reads
and writes over to the iomap infrastructure, split up the IOMAP_WRITE
chunk of code into a separate helper ext4_alloc_iomap(). This way,
when we add the necessary bits for direct I/O, we don't end up with
ext4_iomap_begin() becoming a behemoth twisty maze.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/inode.c | 112 ++++++++++++++++++++++++++----------------------
1 file changed, 60 insertions(+), 52 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 0dd29ae5cc8c..3dc92bd8a944 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3442,6 +3442,62 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
}
}
+static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
+ unsigned int flags)
+{
+ handle_t *handle;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, retries = 0;
+
+ /*
+ * Trim the mapping request to the maximum value that we can map at
+ * once for direct I/O.
+ */
+ if (map->m_len > DIO_MAX_BLOCKS)
+ map->m_len = DIO_MAX_BLOCKS;
+ dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
+
+retry:
+ /*
+ * Either we allocate blocks and then don't get an unwritten extent, so
+ * in that case we have reserved enough credits. Or, the blocks are
+ * already allocated and and unwritten. In that case, the extent
+ * conversion fits into the credits too.
+ */
+ handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
+ if (IS_ERR(handle))
+ return PTR_ERR(handle);
+
+ ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ if (ret < 0)
+ goto journal_stop;
+
+ /*
+ * If we have allocated blocks beyond EOF, we need to ensure that
+ * they're truncated if we crash before updating the inode size
+ * metadata within ext4_iomap_end(). For faults, we don't need to do
+ * that (and cannot due to the orphan list operations needing an
+ * inode_lock()). If we happen to instantiate blocks beyond EOF, it is
+ * because we race with a truncate operation, which already has added
+ * the inode onto the orphan list.
+ */
+ if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
+ (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
+ int err;
+
+ err = ext4_orphan_add(handle, inode);
+ if (err < 0)
+ ret = err;
+ }
+
+journal_stop:
+ ext4_journal_stop(handle);
+ if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
+ goto retry;
+
+ return ret;
+}
+
static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
unsigned flags, struct iomap *iomap)
{
@@ -3502,62 +3558,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
}
}
} else if (flags & IOMAP_WRITE) {
- int dio_credits;
- handle_t *handle;
- int retries = 0;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (map.m_len > DIO_MAX_BLOCKS)
- map.m_len = DIO_MAX_BLOCKS;
- dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
-retry:
- /*
- * Either we allocate blocks and then we don't get unwritten
- * extent so we have reserved enough credits, or the blocks
- * are already allocated and unwritten and in that case
- * extent conversion fits in the credits as well.
- */
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
- dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = ext4_map_blocks(handle, inode, &map,
- EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0) {
- ext4_journal_stop(handle);
- if (ret == -ENOSPC &&
- ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
- }
-
- /*
- * If we added blocks beyond i_size, we need to make sure they
- * will get truncated if we crash before updating i_size in
- * ext4_iomap_end(). For faults we don't need to do that (and
- * even cannot because for orphan list operations inode_lock is
- * required) - if we happen to instantiate block beyond i_size,
- * it is because we race with truncate which has already added
- * the inode to the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0) {
- ext4_journal_stop(handle);
- return err;
- }
- }
- ext4_journal_stop(handle);
+ ret = ext4_iomap_alloc(inode, &map, flags);
} else {
ret = ext4_map_blocks(NULL, inode, &map, 0);
- if (ret < 0)
- return ret;
}
+ if (ret < 0)
+ return ret;
+
ext4_set_iomap(inode, iomap, &map, offset, length);
if (delalloc && iomap->type == IOMAP_HOLE)
iomap->type = IOMAP_DELALLOC;
--
2.20.1
--<M>--
Filesystems do not support doing IO as asynchronous in some cases. For
example in case of unaligned writes or in case file size needs to be
extended (e.g. for ext4). Instead of forcing filesystem to wait for AIO
in such cases, add argument to iomap_dio_rw() which makes the function
wait for IO completion. This also results in executing
iomap_dio_complete() inline in iomap_dio_rw() providing its return value
to the caller as for ordinary sync IO.
Signed-off-by: Jan Kara <[email protected]>
---
This patch has already been posted through by Jan, but I've just
included it within this patch series to mark that it's a clear
dependency.
fs/gfs2/file.c | 6 ++++--
fs/iomap/direct-io.c | 7 +++++--
fs/xfs/xfs_file.c | 5 +++--
include/linux/iomap.h | 3 ++-
4 files changed, 14 insertions(+), 7 deletions(-)
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 997b326247e2..f0caee2b7c00 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -732,7 +732,8 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to)
if (ret)
goto out_uninit;
- ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL);
+ ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL,
+ is_sync_kiocb(iocb));
gfs2_glock_dq(&gh);
out_uninit:
@@ -767,7 +768,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from)
if (offset + len > i_size_read(&ip->i_inode))
goto out;
- ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL);
+ ret = iomap_dio_rw(iocb, from, &gfs2_iomap_ops, NULL,
+ is_sync_kiocb(iocb));
out:
gfs2_glock_dq(&gh);
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 1fc28c2da279..da124cee1783 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -392,7 +392,8 @@ iomap_dio_actor(struct inode *inode, loff_t pos, loff_t length,
*/
ssize_t
iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
- const struct iomap_ops *ops, const struct iomap_dio_ops *dops)
+ const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+ bool wait_for_completion)
{
struct address_space *mapping = iocb->ki_filp->f_mapping;
struct inode *inode = file_inode(iocb->ki_filp);
@@ -400,7 +401,6 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
loff_t pos = iocb->ki_pos, start = pos;
loff_t end = iocb->ki_pos + count - 1, ret = 0;
unsigned int flags = IOMAP_DIRECT;
- bool wait_for_completion = is_sync_kiocb(iocb);
struct blk_plug plug;
struct iomap_dio *dio;
@@ -409,6 +409,9 @@ iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
if (!count)
return 0;
+ if (WARN_ON(is_sync_kiocb(iocb) && !wait_for_completion))
+ return -EIO;
+
dio = kmalloc(sizeof(*dio), GFP_KERNEL);
if (!dio)
return -ENOMEM;
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 1ffb179f35d2..0739ba72a82e 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -188,7 +188,7 @@ xfs_file_dio_aio_read(
file_accessed(iocb->ki_filp);
xfs_ilock(ip, XFS_IOLOCK_SHARED);
- ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL);
+ ret = iomap_dio_rw(iocb, to, &xfs_iomap_ops, NULL, is_sync_kiocb(iocb));
xfs_iunlock(ip, XFS_IOLOCK_SHARED);
return ret;
@@ -547,7 +547,8 @@ xfs_file_dio_aio_write(
}
trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
- ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops);
+ ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops,
+ is_sync_kiocb(iocb));
/*
* If unaligned, this is the only IO in-flight. If it has not yet
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 7aa5d6117936..76b14cb729dc 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -195,7 +195,8 @@ struct iomap_dio_ops {
};
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
- const struct iomap_ops *ops, const struct iomap_dio_ops *dops);
+ const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
+ bool wait_for_completion);
int iomap_dio_iopoll(struct kiocb *kiocb, bool spin);
#ifdef CONFIG_SWAP
--
2.20.1
--<M>--
For the direct I/O iomap write changes that follow in this patch
series, we need to accommodate for the case where the block mapping
flags passed to ext4_map_blocks() result in map->m_flags having both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN bits set. In order for
allocated unwritten extents to be converted correctly in the
->end_io() handler, the iomap->type must be set to IOMAP_UNWRITTEN for
cases where map->m_flags has EXT4_MAP_UNWRITTEN set. Hence the reason
why we reshuffle the conditional statement in this patch.
This change is a no-op for DAX as the block mapping flags passed to
ext4_map_blocks() when the inode IS_DAX never results in both
EXT4_MAP_MAPPED and EXT4_MAP_UNWRITTEN being set at once.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/inode.c | 15 ++++++++++++---
1 file changed, 12 insertions(+), 3 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index a37112efe3fb..70ddcb9c2c1c 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3428,10 +3428,19 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
iomap->length = (u64) map->m_len << blkbits;
if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
- if (map->m_flags & EXT4_MAP_MAPPED)
- iomap->type = IOMAP_MAPPED;
- else if (map->m_flags & EXT4_MAP_UNWRITTEN)
+ /*
+ * Flags passed into ext4_map_blocks() for direct I/O writes
+ * can result in map->m_flags having both EXT4_MAP_MAPPED and
+ * EXT4_MAP_UNWRITTEN bits set. In order for any allocated
+ * extents to be converted to written extents correctly in the
+ * ->end_io() handler, we need to ensure that the iomap->type
+ * is set appropriately. Hence the reason why we need to check
+ * whether EXT4_MAP_UNWRITTEN is set first.
+ */
+ if (map->m_flags & EXT4_MAP_UNWRITTEN)
iomap->type = IOMAP_UNWRITTEN;
+ else if (map->m_flags & EXT4_MAP_MAPPED)
+ iomap->type = IOMAP_MAPPED;
iomap->addr = (u64) map->m_pblk << blkbits;
} else {
iomap->type = IOMAP_HOLE;
--
2.20.1
--<M>--
Lift the inode extension/orphan list handling code out from
ext4_iomap_alloc() and apply it within the ext4_dax_write_iter()
function.
Signed-off-by: Matthew Bobrowski <[email protected]>
Reviewed-by: Jan Kara <[email protected]>
Reviewed-by: Ritesh Harjani <[email protected]>
---
fs/ext4/file.c | 21 ++++++++++++++++++++-
fs/ext4/inode.c | 22 ----------------------
2 files changed, 20 insertions(+), 23 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6ddf00265988..65e758ae02d0 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -304,6 +304,8 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
ssize_t ret;
size_t count;
loff_t offset;
+ handle_t *handle;
+ bool extend = false;
struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
@@ -323,8 +325,25 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
offset = iocb->ki_pos;
count = iov_iter_count(from);
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
- ret = ext4_handle_inode_extension(inode, ret, offset, count);
+ if (extend)
+ ret = ext4_handle_inode_extension(inode, ret, offset, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f79d15e8d3c6..a37112efe3fb 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3443,7 +3443,6 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- u8 blkbits = inode->i_blkbits;
int ret, dio_credits, retries = 0;
/*
@@ -3466,28 +3465,7 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
return PTR_ERR(handle);
ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
- if (ret < 0)
- goto journal_stop;
-
- /*
- * If we have allocated blocks beyond EOF, we need to ensure that
- * they're truncated if we crash before updating the inode size
- * metadata within ext4_iomap_end(). For faults, we don't need to do
- * that (and cannot due to the orphan list operations needing an
- * inode_lock()). If we happen to instantiate blocks beyond EOF, it is
- * because we race with a truncate operation, which already has added
- * the inode onto the orphan list.
- */
- if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
- (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
- int err;
-
- err = ext4_orphan_add(handle, inode);
- if (err < 0)
- ret = err;
- }
-journal_stop:
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
goto retry;
--
2.20.1
--<M>--
Use iomap_dio_rw() to wait for unaligned direct IO instead of opencoding
the wait.
Signed-off-by: Jan Kara <[email protected]>
---
This patch has already been posted through by Jan, but I've just
included it within this patch series to mark that it's a clear
dependency.
fs/xfs/xfs_file.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)
diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index 0739ba72a82e..c0620135a279 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -547,16 +547,12 @@ xfs_file_dio_aio_write(
}
trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
- ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops,
- is_sync_kiocb(iocb));
-
/*
- * If unaligned, this is the only IO in-flight. If it has not yet
- * completed, wait on it before we release the iolock to prevent
- * subsequent overlapping IO.
+ * If unaligned, this is the only IO in-flight. Wait on it before we
+ * release the iolock to prevent subsequent overlapping IO.
*/
- if (ret == -EIOCBQUEUED && unaligned_io)
- inode_dio_wait(inode);
+ ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops,
+ is_sync_kiocb(iocb) || unaligned_io);
out:
xfs_iunlock(ip, iolock);
--
2.20.1
--<M>--
This patch introduces a new direct I/O write path which makes use of
the iomap infrastructure.
All direct I/O writes are now passed from the ->write_iter() callback
through to the new direct I/O handler ext4_dio_write_iter(). This
function is responsible for calling into the iomap infrastructure via
iomap_dio_rw().
Code snippets from the existing direct I/O write code within
ext4_file_write_iter() such as, checking whether the I/O request is
unaligned asynchronous I/O, or whether the write will result in an
overwrite have effectively been moved out and into the new direct I/O
->write_iter() handler.
The block mapping flags that are eventually passed down to
ext4_map_blocks() from the *_get_block_*() suite of routines have been
taken out and introduced within ext4_iomap_alloc().
For inode extension cases, ext4_handle_inode_extension() is
effectively the function responsible for performing such metadata
updates. This is called after iomap_dio_rw() has returned so that we
can safely determine whether we need to potentially truncate any
allocated blocks that may have been prepared for this direct I/O
write. We don't perform the inode extension, or truncate operations
from the ->end_io() handler as we don't have the original I/O 'length'
available there. The ->end_io() however is responsible fo converting
allocated unwritten extents to written extents.
In the instance of a short write, we fallback and complete the
remainder of the I/O using buffered I/O via
ext4_buffered_write_iter().
The existing buffer_head direct I/O implementation has been removed as
it's now redundant.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/ext4.h | 3 -
fs/ext4/extents.c | 4 +-
fs/ext4/file.c | 236 ++++++++++++++++++--------
fs/ext4/inode.c | 411 +++++-----------------------------------------
4 files changed, 207 insertions(+), 447 deletions(-)
diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index d0d88f411a44..fdab3420539d 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -1579,7 +1579,6 @@ enum {
EXT4_STATE_NO_EXPAND, /* No space for expansion */
EXT4_STATE_DA_ALLOC_CLOSE, /* Alloc DA blks on close */
EXT4_STATE_EXT_MIGRATE, /* Inode is migrating */
- EXT4_STATE_DIO_UNWRITTEN, /* need convert on dio done*/
EXT4_STATE_NEWENTRY, /* File just added to dir */
EXT4_STATE_MAY_INLINE_DATA, /* may have in-inode data */
EXT4_STATE_EXT_PRECACHED, /* extents have been precached */
@@ -2560,8 +2559,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
int ext4_get_block(struct inode *inode, sector_t iblock,
struct buffer_head *bh_result, int create);
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create);
int ext4_da_get_block_prep(struct inode *inode, sector_t iblock,
struct buffer_head *bh, int create);
int ext4_walk_page_buffers(handle_t *handle,
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index fb0f99dc8c22..3abc60a0b292 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -1760,9 +1760,7 @@ ext4_can_extents_be_merged(struct inode *inode, struct ext4_extent *ex1,
* case.
*/
if (ext4_ext_is_unwritten(ex1) &&
- (ext4_test_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN) ||
- atomic_read(&EXT4_I(inode)->i_unwritten) ||
- (ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)))
+ ext1_ee_len + ext2_ee_len > EXT_UNWRITTEN_MAX_LEN)
return 0;
#ifdef AGGRESSIVE_TEST
if (ext1_ee_len >= 4)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 65e758ae02d0..49fdfde155f1 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -29,6 +29,7 @@
#include <linux/pagevec.h>
#include <linux/uio.h>
#include <linux/mman.h>
+#include <linux/backing-dev.h>
#include "ext4.h"
#include "ext4_jbd2.h"
#include "xattr.h"
@@ -154,13 +155,6 @@ static int ext4_release_file(struct inode *inode, struct file *filp)
return 0;
}
-static void ext4_unwritten_wait(struct inode *inode)
-{
- wait_queue_head_t *wq = ext4_ioend_wq(inode);
-
- wait_event(*wq, (atomic_read(&EXT4_I(inode)->i_unwritten) == 0));
-}
-
/*
* This tests whether the IO in question is block-aligned or not.
* Ext4 utilizes unwritten extents when hole-filling during direct IO, and they
@@ -213,13 +207,13 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ if (unlikely(IS_IMMUTABLE(inode)))
+ return -EPERM;
+
ret = generic_write_checks(iocb, from);
if (ret <= 0)
return ret;
- if (unlikely(IS_IMMUTABLE(inode)))
- return -EPERM;
-
/*
* If we have encountered a bitmap-format file, the size limit
* is smaller than s_maxbytes, which is for extent-mapped files.
@@ -231,9 +225,42 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return -EFBIG;
iov_iter_truncate(from, sbi->s_bitmap_maxbytes - iocb->ki_pos);
}
+
+ ret = file_modified(iocb->ki_filp);
+ if (ret)
+ return ret;
+
return iov_iter_count(from);
}
+static ssize_t ext4_buffered_write_iter(struct kiocb *iocb,
+ struct iov_iter *from)
+{
+ ssize_t ret;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (iocb->ki_flags & IOCB_NOWAIT)
+ return -EOPNOTSUPP;
+
+ inode_lock(inode);
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out;
+
+ current->backing_dev_info = inode_to_bdi(inode);
+ ret = generic_perform_write(iocb->ki_filp, from, iocb->ki_pos);
+ current->backing_dev_info = NULL;
+
+out:
+ inode_unlock(inode);
+ if (likely(ret > 0)) {
+ iocb->ki_pos += ret;
+ ret = generic_write_sync(iocb, ret);
+ }
+
+ return ret;
+}
+
static ssize_t ext4_handle_inode_extension(struct inode *inode, ssize_t written,
loff_t offset, size_t count)
{
@@ -297,6 +324,129 @@ static ssize_t ext4_handle_inode_extension(struct inode *inode, ssize_t written,
return written;
}
+static int ext4_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t offset = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error)
+ return error;
+
+ if (size && flags & IOMAP_DIO_UNWRITTEN)
+ return ext4_convert_unwritten_extents(NULL, inode,
+ offset, size);
+
+ return 0;
+}
+
+static const struct iomap_dio_ops ext4_dio_write_ops = {
+ .end_io = ext4_dio_write_end_io,
+};
+
+static ssize_t ext4_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ ssize_t ret;
+ size_t count;
+ loff_t offset;
+ handle_t *handle;
+ struct inode *inode = file_inode(iocb->ki_filp);
+ bool extend = false, overwrite = false, unaligned_aio = false;
+
+ if (iocb->ki_flags & IOCB_NOWAIT) {
+ if (!inode_trylock(inode))
+ return -EAGAIN;
+ } else {
+ inode_lock(inode);
+ }
+
+ ret = ext4_write_checks(iocb, from);
+ if (ret <= 0) {
+ inode_unlock(inode);
+ return ret;
+ }
+
+ /*
+ * Unaligned direct asynchronous I/O must be serialized among each
+ * other as the zeroing of partial blocks of two competing unaligned
+ * asynchronous I/O writes can result in data corruption.
+ */
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
+ if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
+ !is_sync_kiocb(iocb) && ext4_unaligned_aio(inode, from, offset)) {
+ unaligned_aio = true;
+ inode_dio_wait(inode);
+ }
+
+ /*
+ * Determine whether the I/O will overwrite allocated and initialized
+ * blocks. If so, check to see whether it is possible to take the
+ * dioread_nolock path.
+ */
+ if (!unaligned_aio && ext4_overwrite_io(inode, offset, count) &&
+ ext4_should_dioread_nolock(inode)) {
+ overwrite = true;
+ downgrade_write(&inode->i_rwsem);
+ }
+
+ if (offset + count > EXT4_I(inode)->i_disksize) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ goto out;
+ }
+
+ ret = ext4_orphan_add(handle, inode);
+ if (ret) {
+ ext4_journal_stop(handle);
+ goto out;
+ }
+
+ extend = true;
+ ext4_journal_stop(handle);
+ }
+
+ ret = iomap_dio_rw(iocb, from, &ext4_iomap_ops, &ext4_dio_write_ops,
+ is_sync_kiocb(iocb) || unaligned_aio || extend);
+
+ if (extend) {
+ ret = ext4_handle_inode_extension(inode, ret, offset, count);
+
+ /*
+ * We may have failed to remove the inode from the orphan list
+ * in the case that the i_disksize got update due to delalloc
+ * writeback while the direct I/O was running. We need to make
+ * sure we remove it from the orphan list as if we've
+ * prematurely popped it onto the list.
+ */
+ if (!list_empty(&EXT4_I(inode)->i_orphan)) {
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ ret = PTR_ERR(handle);
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ goto out;
+ }
+
+ if (inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+ }
+ }
+
+out:
+ if (overwrite)
+ inode_unlock_shared(inode);
+ else
+ inode_unlock(inode);
+
+ if (ret >= 0 && iov_iter_count(from))
+ return ext4_buffered_write_iter(iocb, from);
+
+ return ret;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
@@ -313,15 +463,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
return -EAGAIN;
inode_lock(inode);
}
+
ret = ext4_write_checks(iocb, from);
if (ret <= 0)
goto out;
- ret = file_remove_privs(iocb->ki_filp);
- if (ret)
- goto out;
- ret = file_update_time(iocb->ki_filp);
- if (ret)
- goto out;
offset = iocb->ki_pos;
count = iov_iter_count(from);
@@ -356,10 +501,6 @@ static ssize_t
ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
struct inode *inode = file_inode(iocb->ki_filp);
- int o_direct = iocb->ki_flags & IOCB_DIRECT;
- int unaligned_aio = 0;
- int overwrite = 0;
- ssize_t ret;
if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
return -EIO;
@@ -368,59 +509,10 @@ ext4_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (IS_DAX(inode))
return ext4_dax_write_iter(iocb, from);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext4_dio_write_iter(iocb, from);
- if (!inode_trylock(inode)) {
- if (iocb->ki_flags & IOCB_NOWAIT)
- return -EAGAIN;
- inode_lock(inode);
- }
-
- ret = ext4_write_checks(iocb, from);
- if (ret <= 0)
- goto out;
-
- /*
- * Unaligned direct AIO must be serialized among each other as zeroing
- * of partial blocks of two competing unaligned AIOs can result in data
- * corruption.
- */
- if (o_direct && ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) &&
- !is_sync_kiocb(iocb) &&
- ext4_unaligned_aio(inode, from, iocb->ki_pos)) {
- unaligned_aio = 1;
- ext4_unwritten_wait(inode);
- }
-
- iocb->private = &overwrite;
- /* Check whether we do a DIO overwrite or not */
- if (o_direct && !unaligned_aio) {
- if (ext4_overwrite_io(inode, iocb->ki_pos, iov_iter_count(from))) {
- if (ext4_should_dioread_nolock(inode))
- overwrite = 1;
- } else if (iocb->ki_flags & IOCB_NOWAIT) {
- ret = -EAGAIN;
- goto out;
- }
- }
-
- ret = __generic_file_write_iter(iocb, from);
- /*
- * Unaligned direct AIO must be the only IO in flight. Otherwise
- * overlapping aligned IO after unaligned might result in data
- * corruption.
- */
- if (ret == -EIOCBQUEUED && unaligned_aio)
- ext4_unwritten_wait(inode);
- inode_unlock(inode);
-
- if (ret > 0)
- ret = generic_write_sync(iocb, ret);
-
- return ret;
-
-out:
- inode_unlock(inode);
- return ret;
+ return ext4_buffered_write_iter(iocb, from);
}
#ifdef CONFIG_FS_DAX
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 70ddcb9c2c1c..1adc265d72fc 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -826,133 +826,6 @@ int ext4_get_block_unwritten(struct inode *inode, sector_t iblock,
/* Maximum number of blocks we map for direct IO at once. */
#define DIO_MAX_BLOCKS 4096
-/*
- * Get blocks function for the cases that need to start a transaction -
- * generally difference cases of direct IO and DAX IO. It also handles retries
- * in case of ENOSPC.
- */
-static int ext4_get_block_trans(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int flags)
-{
- int dio_credits;
- handle_t *handle;
- int retries = 0;
- int ret;
-
- /* Trim mapping request to maximum we can map at once for DIO */
- if (bh_result->b_size >> inode->i_blkbits > DIO_MAX_BLOCKS)
- bh_result->b_size = DIO_MAX_BLOCKS << inode->i_blkbits;
- dio_credits = ext4_chunk_trans_blocks(inode,
- bh_result->b_size >> inode->i_blkbits);
-retry:
- handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
- if (IS_ERR(handle))
- return PTR_ERR(handle);
-
- ret = _ext4_get_block(inode, iblock, bh_result, flags);
- ext4_journal_stop(handle);
-
- if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
- goto retry;
- return ret;
-}
-
-/* Get block function for DIO reads and writes to inodes without extents */
-int ext4_dio_get_block(struct inode *inode, sector_t iblock,
- struct buffer_head *bh, int create)
-{
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
- return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
-}
-
-/*
- * Get block function for AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete.
- */
-static int ext4_dio_get_block_unwritten_async(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * When doing DIO using unwritten extents, we need io_end to convert
- * unwritten extents to written on IO completion. We allocate io_end
- * once we spot unwritten extent and store it in b_private. Generic
- * DIO code keeps b_private set and furthermore passes the value to
- * our completion callback in 'private' argument.
- */
- if (!ret && buffer_unwritten(bh_result)) {
- if (!bh_result->b_private) {
- ext4_io_end_t *io_end;
-
- io_end = ext4_init_io_end(inode, GFP_KERNEL);
- if (!io_end)
- return -ENOMEM;
- bh_result->b_private = io_end;
- ext4_set_io_unwritten_flag(inode, io_end);
- }
- set_buffer_defer_completion(bh_result);
- }
-
- return ret;
-}
-
-/*
- * Get block function for non-AIO DIO writes when we create unwritten extent if
- * blocks are not allocated yet. The extent will be converted to written
- * after IO is complete by ext4_direct_IO_write().
- */
-static int ext4_dio_get_block_unwritten_sync(struct inode *inode,
- sector_t iblock, struct buffer_head *bh_result, int create)
-{
- int ret;
-
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = ext4_get_block_trans(inode, iblock, bh_result,
- EXT4_GET_BLOCKS_IO_CREATE_EXT);
-
- /*
- * Mark inode as having pending DIO writes to unwritten extents.
- * ext4_direct_IO_write() checks this flag and converts extents to
- * written.
- */
- if (!ret && buffer_unwritten(bh_result))
- ext4_set_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
-
- return ret;
-}
-
-static int ext4_dio_get_block_overwrite(struct inode *inode, sector_t iblock,
- struct buffer_head *bh_result, int create)
-{
- int ret;
-
- ext4_debug("ext4_dio_get_block_overwrite: inode %lu, create flag %d\n",
- inode->i_ino, create);
- /* We don't expect handle for direct IO */
- WARN_ON_ONCE(ext4_journal_current_handle());
-
- ret = _ext4_get_block(inode, iblock, bh_result, 0);
- /*
- * Blocks should have been preallocated! ext4_file_write_iter() checks
- * that.
- */
- WARN_ON_ONCE(!buffer_mapped(bh_result) || buffer_unwritten(bh_result));
-
- return ret;
-}
-
-
/*
* `handle' can be NULL if create is zero
*/
@@ -3452,7 +3325,8 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
unsigned int flags)
{
handle_t *handle;
- int ret, dio_credits, retries = 0;
+ u8 blkbits = inode->i_blkbits;
+ int ret, dio_credits, m_flags = 0, retries = 0;
/*
* Trim the mapping request to the maximum value that we can map at
@@ -3473,7 +3347,33 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
if (IS_ERR(handle))
return PTR_ERR(handle);
- ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
+ /*
+ * DAX and direct I/O are the only two operations that are currently
+ * supported with IOMAP_WRITE.
+ */
+ WARN_ON(!IS_DAX(inode) && !(flags & IOMAP_DIRECT));
+ if (IS_DAX(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE_ZERO;
+ /*
+ * We use i_size instead of i_disksize here because delalloc writeback
+ * can complete at any point and subsequently push the i_disksize out
+ * to i_size. This could be beyond where the direct I/O is happening
+ * and thus expose allocated blocks to direct I/O reads.
+ */
+ else if ((map->m_lblk * (1 << blkbits)) >= i_size_read(inode))
+ m_flags = EXT4_GET_BLOCKS_CREATE;
+ else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
+ m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
+
+ ret = ext4_map_blocks(handle, inode, map, m_flags);
+
+ /*
+ * We cannot fill holes in indirect tree based inodes as that could
+ * expose stale data in the case of a crash. Use the magic error code
+ * to fallback to buffered I/O.
+ */
+ if (!m_flags && !ret)
+ ret = -ENOTBLK;
ext4_journal_stop(handle);
if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
@@ -3518,6 +3418,16 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Check to see whether an error while writing out the data to the
+ * allocated blocks. If so, return the magic error code so that we
+ * fallback to buffered I/O. Any blocks that have been allocated in
+ * preparation for the direct I/O write will be reused during the
+ * buffered I/O.
+ */
+ if (flags & (IOMAP_WRITE | IOMAP_DIRECT) && written == 0)
+ return -ENOTBLK;
+
return 0;
}
@@ -3597,243 +3507,6 @@ const struct iomap_ops ext4_iomap_report_ops = {
.iomap_begin = ext4_iomap_begin_report,
};
-static int ext4_end_io_dio(struct kiocb *iocb, loff_t offset,
- ssize_t size, void *private)
-{
- ext4_io_end_t *io_end = private;
-
- /* if not async direct IO just return */
- if (!io_end)
- return 0;
-
- ext_debug("ext4_end_io_dio(): io_end 0x%p "
- "for inode %lu, iocb 0x%p, offset %llu, size %zd\n",
- io_end, io_end->inode->i_ino, iocb, offset, size);
-
- /*
- * Error during AIO DIO. We cannot convert unwritten extents as the
- * data was not written. Just clear the unwritten flag and drop io_end.
- */
- if (size <= 0) {
- ext4_clear_io_unwritten_flag(io_end);
- size = 0;
- }
- io_end->offset = offset;
- io_end->size = size;
- ext4_put_io_end(io_end);
-
- return 0;
-}
-
-/*
- * Handling of direct IO writes.
- *
- * For ext4 extent files, ext4 will do direct-io write even to holes,
- * preallocated extents, and those write extend the file, no need to
- * fall back to buffered IO.
- *
- * For holes, we fallocate those blocks, mark them as unwritten
- * If those blocks were preallocated, we mark sure they are split, but
- * still keep the range to write as unwritten.
- *
- * The unwritten extents will be converted to written when DIO is completed.
- * For async direct IO, since the IO may still pending when return, we
- * set up an end_io call back function, which will do the conversion
- * when async direct IO completed.
- *
- * If the O_DIRECT write will extend the file then add this inode to the
- * orphan list. So recovery will truncate it back to the original size
- * if the machine crashes during the write.
- *
- */
-static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- ssize_t ret;
- loff_t offset = iocb->ki_pos;
- size_t count = iov_iter_count(iter);
- int overwrite = 0;
- get_block_t *get_block_func = NULL;
- int dio_flags = 0;
- loff_t final_size = offset + count;
- int orphan = 0;
- handle_t *handle;
-
- if (final_size > inode->i_size || final_size > ei->i_disksize) {
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto out;
- }
- ret = ext4_orphan_add(handle, inode);
- if (ret) {
- ext4_journal_stop(handle);
- goto out;
- }
- orphan = 1;
- ext4_update_i_disksize(inode, inode->i_size);
- ext4_journal_stop(handle);
- }
-
- BUG_ON(iocb->private == NULL);
-
- /*
- * Make all waiters for direct IO properly wait also for extent
- * conversion. This also disallows race between truncate() and
- * overwrite DIO as i_dio_count needs to be incremented under i_mutex.
- */
- inode_dio_begin(inode);
-
- /* If we do a overwrite dio, i_mutex locking can be released */
- overwrite = *((int *)iocb->private);
-
- if (overwrite)
- inode_unlock(inode);
-
- /*
- * For extent mapped files we could direct write to holes and fallocate.
- *
- * Allocated blocks to fill the hole are marked as unwritten to prevent
- * parallel buffered read to expose the stale data before DIO complete
- * the data IO.
- *
- * As to previously fallocated extents, ext4 get_block will just simply
- * mark the buffer mapped but still keep the extents unwritten.
- *
- * For non AIO case, we will convert those unwritten extents to written
- * after return back from blockdev_direct_IO. That way we save us from
- * allocating io_end structure and also the overhead of offloading
- * the extent convertion to a workqueue.
- *
- * For async DIO, the conversion needs to be deferred when the
- * IO is completed. The ext4 end_io callback function will be
- * called to take care of the conversion work. Here for async
- * case, we allocate an io_end structure to hook to the iocb.
- */
- iocb->private = NULL;
- if (overwrite)
- get_block_func = ext4_dio_get_block_overwrite;
- else if (!ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS) ||
- round_down(offset, i_blocksize(inode)) >= inode->i_size) {
- get_block_func = ext4_dio_get_block;
- dio_flags = DIO_LOCKING | DIO_SKIP_HOLES;
- } else if (is_sync_kiocb(iocb)) {
- get_block_func = ext4_dio_get_block_unwritten_sync;
- dio_flags = DIO_LOCKING;
- } else {
- get_block_func = ext4_dio_get_block_unwritten_async;
- dio_flags = DIO_LOCKING;
- }
- ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev, iter,
- get_block_func, ext4_end_io_dio, NULL,
- dio_flags);
-
- if (ret > 0 && !overwrite && ext4_test_inode_state(inode,
- EXT4_STATE_DIO_UNWRITTEN)) {
- int err;
- /*
- * for non AIO case, since the IO is already
- * completed, we could do the conversion right here
- */
- err = ext4_convert_unwritten_extents(NULL, inode,
- offset, ret);
- if (err < 0)
- ret = err;
- ext4_clear_inode_state(inode, EXT4_STATE_DIO_UNWRITTEN);
- }
-
- inode_dio_end(inode);
- /* take i_mutex locking again if we do a ovewrite dio */
- if (overwrite)
- inode_lock(inode);
-
- if (ret < 0 && final_size > inode->i_size)
- ext4_truncate_failed_write(inode);
-
- /* Handle extending of i_size after direct IO write */
- if (orphan) {
- int err;
-
- /* Credits for sb + inode write */
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- /*
- * We wrote the data but cannot extend
- * i_size. Bail out. In async io case, we do
- * not return error here because we have
- * already submmitted the corresponding
- * bio. Returning error here makes the caller
- * think that this IO is done and failed
- * resulting in race with bio's completion
- * handler.
- */
- if (!ret)
- ret = PTR_ERR(handle);
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
-
- goto out;
- }
- if (inode->i_nlink)
- ext4_orphan_del(handle, inode);
- if (ret > 0) {
- loff_t end = offset + ret;
- if (end > inode->i_size || end > ei->i_disksize) {
- ext4_update_i_disksize(inode, end);
- if (end > inode->i_size)
- i_size_write(inode, end);
- /*
- * We're going to return a positive `ret'
- * here due to non-zero-length I/O, so there's
- * no way of reporting error returns from
- * ext4_mark_inode_dirty() to userspace. So
- * ignore it.
- */
- ext4_mark_inode_dirty(handle, inode);
- }
- }
- err = ext4_journal_stop(handle);
- if (ret == 0)
- ret = err;
- }
-out:
- return ret;
-}
-
-static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct inode *inode = file->f_mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
-#ifdef CONFIG_FS_ENCRYPTION
- if (IS_ENCRYPTED(inode) && S_ISREG(inode->i_mode))
- return 0;
-#endif
- if (fsverity_active(inode))
- return 0;
-
- /*
- * If we are doing data journalling we don't support O_DIRECT
- */
- if (ext4_should_journal_data(inode))
- return 0;
-
- /* Let buffer I/O handle the inline data case. */
- if (ext4_has_inline_data(inode))
- return 0;
-
- trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
- ret = ext4_direct_IO_write(iocb, iter);
- trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
- return ret;
-}
-
/*
* Pages can be marked dirty completely asynchronously from ext4's journalling
* activity. By filemap_sync_pte(), try_to_unmap_one(), etc. We cannot do
@@ -3871,7 +3544,7 @@ static const struct address_space_operations ext4_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
@@ -3888,7 +3561,7 @@ static const struct address_space_operations ext4_journalled_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_journalled_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
};
@@ -3904,7 +3577,7 @@ static const struct address_space_operations ext4_da_aops = {
.bmap = ext4_bmap,
.invalidatepage = ext4_invalidatepage,
.releasepage = ext4_releasepage,
- .direct_IO = ext4_direct_IO,
+ .direct_IO = noop_direct_IO,
.migratepage = buffer_migrate_page,
.is_partially_uptodate = block_is_partially_uptodate,
.error_remove_page = generic_error_remove_page,
--
2.20.1
--<M>--
In preparation for implementing the iomap direct I/O modifications,
the inode extension/truncate code needs to be moved out from the
ext4_iomap_end() callback. For direct I/O, if the current code
remained, it would behave incorrrectly. Updating the inode size prior
to converting unwritten extents would potentially allow a racing
direct I/O read to find unwritten extents before being converted
correctly.
The inode extension/truncate code now resides within a new helper
ext4_handle_inode_extension(). This function has been designed so that
it can accommodate for both DAX and direct I/O extension/truncate
operations.
Signed-off-by: Matthew Bobrowski <[email protected]>
---
fs/ext4/file.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-
fs/ext4/inode.c | 48 +--------------------------------
2 files changed, 71 insertions(+), 48 deletions(-)
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 8420686b90f5..6ddf00265988 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -33,6 +33,7 @@
#include "ext4_jbd2.h"
#include "xattr.h"
#include "acl.h"
+#include "truncate.h"
static bool ext4_dio_supported(struct inode *inode)
{
@@ -233,12 +234,77 @@ static ssize_t ext4_write_checks(struct kiocb *iocb, struct iov_iter *from)
return iov_iter_count(from);
}
+static ssize_t ext4_handle_inode_extension(struct inode *inode, ssize_t written,
+ loff_t offset, size_t count)
+{
+ handle_t *handle;
+ bool truncate = false;
+ u8 blkbits = inode->i_blkbits;
+ ext4_lblk_t written_blk, end_blk;
+
+ /*
+ * Note that EXT4_I(inode)->i_disksize can get extended up to
+ * inode->i_size while the I/O was running due to the writeback of
+ * delalloc blocks. But, the code in ext4_iomap_alloc() is careful to
+ * use zeroed/unwritten extents if this is possible and thus we won't
+ * leave uninitialized blocks in a file even if we didn't succeed in
+ * writing as much as we planned.
+ */
+ WARN_ON_ONCE(i_size_read(inode) < EXT4_I(inode)->i_disksize);
+ if (offset + count <= EXT4_I(inode)->i_disksize)
+ return written;
+
+ if (written < 0)
+ goto truncate;
+
+ handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
+ if (IS_ERR(handle)) {
+ written = PTR_ERR(handle);
+ goto truncate;
+ }
+
+ if (ext4_update_inode_size(inode, offset + written))
+ ext4_mark_inode_dirty(handle, inode);
+
+ /*
+ * We may need to truncate allocated but not written blocks beyond EOF.
+ */
+ written_blk = ALIGN(offset + written, 1 << blkbits);
+ end_blk = ALIGN(offset + count, 1 << blkbits);
+ if (written_blk < end_blk && ext4_can_truncate(inode))
+ truncate = true;
+
+ /*
+ * Remove the inode from the orphan list if it has been extended and
+ * everything went OK.
+ */
+ if (!truncate && inode->i_nlink)
+ ext4_orphan_del(handle, inode);
+ ext4_journal_stop(handle);
+
+ if (truncate) {
+truncate:
+ ext4_truncate_failed_write(inode);
+ /*
+ * If the truncate operation failed early, then the inode may
+ * still be on the orphan list. In that case, we need to try
+ * remove the inode from the in-memory linked list.
+ */
+ if (inode->i_nlink)
+ ext4_orphan_del(NULL, inode);
+ }
+
+ return written;
+}
+
#ifdef CONFIG_FS_DAX
static ssize_t
ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
{
- struct inode *inode = file_inode(iocb->ki_filp);
ssize_t ret;
+ size_t count;
+ loff_t offset;
+ struct inode *inode = file_inode(iocb->ki_filp);
if (!inode_trylock(inode)) {
if (iocb->ki_flags & IOCB_NOWAIT)
@@ -255,7 +321,10 @@ ext4_dax_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (ret)
goto out;
+ offset = iocb->ki_pos;
+ count = iov_iter_count(from);
ret = dax_iomap_rw(iocb, from, &ext4_iomap_ops);
+ ret = ext4_handle_inode_extension(inode, ret, offset, count);
out:
inode_unlock(inode);
if (ret > 0)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 03a9e2b85e46..f79d15e8d3c6 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -3531,53 +3531,7 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
static int ext4_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
- int ret = 0;
- handle_t *handle;
- int blkbits = inode->i_blkbits;
- bool truncate = false;
-
- if (!(flags & IOMAP_WRITE) || (flags & IOMAP_FAULT))
- return 0;
-
- handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
- if (IS_ERR(handle)) {
- ret = PTR_ERR(handle);
- goto orphan_del;
- }
- if (ext4_update_inode_size(inode, offset + written))
- ext4_mark_inode_dirty(handle, inode);
- /*
- * We may need to truncate allocated but not written blocks beyond EOF.
- */
- if (iomap->offset + iomap->length >
- ALIGN(inode->i_size, 1 << blkbits)) {
- ext4_lblk_t written_blk, end_blk;
-
- written_blk = (offset + written) >> blkbits;
- end_blk = (offset + length) >> blkbits;
- if (written_blk < end_blk && ext4_can_truncate(inode))
- truncate = true;
- }
- /*
- * Remove inode from orphan list if we were extending a inode and
- * everything went fine.
- */
- if (!truncate && inode->i_nlink &&
- !list_empty(&EXT4_I(inode)->i_orphan))
- ext4_orphan_del(handle, inode);
- ext4_journal_stop(handle);
- if (truncate) {
- ext4_truncate_failed_write(inode);
-orphan_del:
- /*
- * If truncate failed early the inode might still be on the
- * orphan list; we need to make sure the inode is removed from
- * the orphan list in that case.
- */
- if (inode->i_nlink)
- ext4_orphan_del(NULL, inode);
- }
- return ret;
+ return 0;
}
const struct iomap_ops ext4_iomap_ops = {
--
2.20.1
--<M>--
On Mon 21-10-19 20:17:31, Matthew Bobrowski wrote:
> Separate the iomap field population chunk of code that is currently
> within ext4_iomap_begin() into a new helper called
> ext4_set_iomap(). The intent of this function is self explanatory,
> however the rationale behind doing so is to also reduce the overall
> clutter that we currently have within the ext4_iomap_begin() callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/inode.c | 59 +++++++++++++++++++++++++++----------------------
> 1 file changed, 33 insertions(+), 26 deletions(-)
The patch looks good to me. Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:17:46, Matthew Bobrowski wrote:
> This patch is effectively addressed what Dave Chinner had found and
> fixed within this commit: 8a23414ee345. Justification for needing this
> modification has been provided below:
>
> When doing a direct IO that spans the current EOF, and there are
> written blocks beyond EOF that extend beyond the current write, the
> only metadata update that needs to be done is a file size extension.
>
> However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> there is IO completion metadata updates required, and hence we may
> fail to correctly sync file size extensions made in IO completion when
> O_DSYNC writes are being used and the hardware supports FUA.
>
> Hence when setting IOMAP_F_DIRTY, we need to also take into account
> whether the iomap spans the current EOF. If it does, then we need to
> mark it dirty so that IO completion will call generic_write_sync() to
> flush the inode size update to stable storage correctly.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me. You could possibly also comment in the changelog that
currently, this change doesn't have user visible impact for ext4 as none of
current users of ext4_iomap_begin() that extend files depend of
IOMAP_F_DIRTY.
Also this patch would make slightly more sense to be before 1/12 so that
you don't have there those two strange unused arguments. But these are just
small nits.
Feel free to add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/inode.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 158eea9a1944..0dd29ae5cc8c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3412,8 +3412,14 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> {
> u8 blkbits = inode->i_blkbits;
>
> + /*
> + * Writes that span EOF might trigger an I/O size update on completion,
> + * so consider them to be dirty for the purposes of O_DSYNC, even if
> + * there is no other metadata changes being made or are pending here.
> + */
> iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode))
> + if (ext4_inode_datasync_dirty(inode) ||
> + offset + length > i_size_read(inode))
> iomap->flags |= IOMAP_F_DIRTY;
>
> if (map->m_flags & EXT4_MAP_NEW)
> --
> 2.20.1
>
> --<M>--
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:18:00, Matthew Bobrowski wrote:
> In preparation for porting across the ext4 direct I/O path for reads
> and writes over to the iomap infrastructure, split up the IOMAP_WRITE
> chunk of code into a separate helper ext4_alloc_iomap(). This way,
> when we add the necessary bits for direct I/O, we don't end up with
> ext4_iomap_begin() becoming a behemoth twisty maze.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
Honza
> ---
> fs/ext4/inode.c | 112 ++++++++++++++++++++++++++----------------------
> 1 file changed, 60 insertions(+), 52 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0dd29ae5cc8c..3dc92bd8a944 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3442,6 +3442,62 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> }
> }
>
> +static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> + unsigned int flags)
> +{
> + handle_t *handle;
> + u8 blkbits = inode->i_blkbits;
> + int ret, dio_credits, retries = 0;
> +
> + /*
> + * Trim the mapping request to the maximum value that we can map at
> + * once for direct I/O.
> + */
> + if (map->m_len > DIO_MAX_BLOCKS)
> + map->m_len = DIO_MAX_BLOCKS;
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +
> +retry:
> + /*
> + * Either we allocate blocks and then don't get an unwritten extent, so
> + * in that case we have reserved enough credits. Or, the blocks are
> + * already allocated and and unwritten. In that case, the extent
> + * conversion fits into the credits too.
> + */
> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> + if (ret < 0)
> + goto journal_stop;
> +
> + /*
> + * If we have allocated blocks beyond EOF, we need to ensure that
> + * they're truncated if we crash before updating the inode size
> + * metadata within ext4_iomap_end(). For faults, we don't need to do
> + * that (and cannot due to the orphan list operations needing an
> + * inode_lock()). If we happen to instantiate blocks beyond EOF, it is
> + * because we race with a truncate operation, which already has added
> + * the inode onto the orphan list.
> + */
> + if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
> + (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> + int err;
> +
> + err = ext4_orphan_add(handle, inode);
> + if (err < 0)
> + ret = err;
> + }
> +
> +journal_stop:
> + ext4_journal_stop(handle);
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> +
> + return ret;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> @@ -3502,62 +3558,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (map.m_len > DIO_MAX_BLOCKS)
> - map.m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> -retry:
> - /*
> - * Either we allocate blocks and then we don't get unwritten
> - * extent so we have reserved enough credits, or the blocks
> - * are already allocated and unwritten and in that case
> - * extent conversion fits in the credits as well.
> - */
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> - dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0) {
> - ext4_journal_stop(handle);
> - if (ret == -ENOSPC &&
> - ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> - }
> -
> - /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0) {
> - ext4_journal_stop(handle);
> - return err;
> - }
> - }
> - ext4_journal_stop(handle);
> + ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> }
>
> + if (ret < 0)
> + return ret;
> +
> ext4_set_iomap(inode, iomap, &map, offset, length);
> if (delalloc && iomap->type == IOMAP_HOLE)
> iomap->type = IOMAP_DELALLOC;
> --
> 2.20.1
>
> --<M>--
--
Jan Kara <[email protected]>
SUSE Labs, CR
Hi Matthew, thanks for your work on this patch series!
I applied it against 4c3, and ran a quick test run on it, and found
the following locking problem. To reproduce:
kvm-xfstests -c nojournal generic/113
generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
[ 7.959477]
[ 7.959798] ============================================
[ 7.960518] WARNING: possible recursive locking detected
[ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
[ 7.961991] --------------------------------------------
[ 7.962569] aio-stress/1516 is trying to acquire lock:
[ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
[ 7.964109]
[ 7.964109] but task is already holding lock:
[ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
[ 7.965763]
[ 7.965763] other info that might help us debug this:
[ 7.966630] Possible unsafe locking scenario:
[ 7.966630]
[ 7.967424] CPU0
[ 7.967760] ----
[ 7.968097] lock(&sb->s_type->i_mutex_key#12);
[ 7.968827] lock(&sb->s_type->i_mutex_key#12);
[ 7.969558]
[ 7.969558] *** DEADLOCK ***
[ 7.969558]
[ 7.970518] May be due to missing lock nesting notation
[ 7.970518]
[ 7.971592] 1 lock held by aio-stress/1516:
[ 7.972267] #0: ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
[ 7.973807]
[ 7.973807] stack backtrace:
[ 7.974510] CPU: 0 PID: 1516 Comm: aio-stress Not tainted 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238
[ 7.976053] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 7.977327] Call Trace:
[ 7.977700] dump_stack+0x67/0x90
[ 7.978198] __lock_acquire.cold+0x130/0x1f7
[ 7.978829] ? __switch_to_asm+0x40/0x70
[ 7.979659] lock_acquire+0x9a/0x160
[ 7.980320] ? __generic_file_fsync+0x3e/0xb0
[ 7.981014] down_write+0x40/0x110
[ 7.981717] ? __generic_file_fsync+0x3e/0xb0
[ 7.982676] __generic_file_fsync+0x3e/0xb0
[ 7.983454] ext4_sync_file+0x277/0x4e0
[ 7.984188] iomap_dio_complete+0x112/0x130
[ 7.984971] ? iomap_dio_rw+0x3a0/0x4b0
[ 7.985647] iomap_dio_rw+0x419/0x4b0
[ 7.986317] ? ext4_dio_write_iter+0x296/0x430
[ 7.987039] ext4_dio_write_iter+0x296/0x430
[ 7.987786] aio_write+0xef/0x1c0
[ 7.988284] ? kvm_sched_clock_read+0x14/0x30
[ 7.988822] ? sched_clock+0x5/0x10
[ 7.989234] ? sched_clock_cpu+0xc/0xc0
[ 7.989719] __io_submit_one.constprop.0+0x399/0x5f0
[ 7.990315] ? kvm_sched_clock_read+0x14/0x30
[ 7.990917] ? sched_clock+0x5/0x10
[ 7.991473] ? sched_clock_cpu+0xc/0xc0
[ 7.992097] ? io_submit_one+0x141/0x5a0
[ 7.992741] io_submit_one+0x141/0x5a0
[ 7.993354] __x64_sys_io_submit+0x9a/0x290
[ 7.993853] ? do_syscall_64+0x50/0x1b0
[ 7.994250] ? __ia32_sys_io_destroy+0x10/0x10
[ 7.994748] do_syscall_64+0x50/0x1b0
[ 7.995175] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[ 7.995761] RIP: 0033:0x55d1268c2d17
[ 7.996270] Code: 00 75 08 8b 47 0c 39 47 08 74 08 e9 b3 ff ff ff 0f 1f 00 31 c0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 b8 d1 00 00 00 0f 05 <c3> 0f 1f 84 00 00 00 00 00 48 63 ff b8 ce 00 00 00 0f 05 c3 0f 1f
[ 7.999131] RSP: 002b:00007f090fb0bd88 EFLAGS: 00000202 ORIG_RAX: 00000000000000d1
[ 7.999994] RAX: ffffffffffffffda RBX: 000055d128135010 RCX: 000055d1268c2d17
[ 8.000881] RDX: 000055d128135010 RSI: 0000000000000008 RDI: 00007f0907263000
[ 8.001765] RBP: 000055d128129560 R08: 00007fff421ae080 R09: 00007f090fb0bd68
[ 8.002824] R10: 00007f090fb0bd60 R11: 0000000000000202 R12: 0000000000000008
[ 8.004016] R13: 00007f090fb0bdb0 R14: 00007f090fb0bda0 R15: 000055d128129560
I'm other test configurations are still running, and I haven't had a
chance to do a detailed review on it, but I'll try to get to it this
week.
Thanks,
- Ted
On Mon 21-10-19 20:18:09, Matthew Bobrowski wrote:
> As part of the ext4_iomap_begin() cleanups that precede this patch,
> here we also split up the IOMAP_REPORT branch into a completely
> separate ->iomap_begin() callback named
> ext4_iomap_begin_report(). Again, the raionale for this change is to
> reduce the overall clutter that's starting to become apparent as we
> start to port more functionality over to the iomap infrastructure.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
One nit below.
> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> + map->m_lblk, end, &es);
> +
> + if (!es.es_len || es.es_lblk > end)
> + return false;
> +
> + if (es.es_lblk > map->m_lblk) {
> + map->m_len = es.es_lblk - map->m_lblk;
> + return false;
> + }
> +
> + if (es.es_lblk <= map->m_lblk)
This condition must be always true AFAICT.
> + offset = map->m_lblk - es.es_lblk;
> +
> + map->m_lblk = es.es_lblk + offset;
> + map->m_len = es.es_len - offset;
> +
> + return true;
> +}
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:18:27, [email protected] wrote:
> Use iomap_dio_rw() to wait for unaligned direct IO instead of opencoding
> the wait.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
>
> This patch has already been posted through by Jan, but I've just
> included it within this patch series to mark that it's a clear
> dependency.
This patch actually is not a dependency. But that doesn't really matter...
Honza
>
> fs/xfs/xfs_file.c | 12 ++++--------
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
> index 0739ba72a82e..c0620135a279 100644
> --- a/fs/xfs/xfs_file.c
> +++ b/fs/xfs/xfs_file.c
> @@ -547,16 +547,12 @@ xfs_file_dio_aio_write(
> }
>
> trace_xfs_file_direct_write(ip, count, iocb->ki_pos);
> - ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops,
> - is_sync_kiocb(iocb));
> -
> /*
> - * If unaligned, this is the only IO in-flight. If it has not yet
> - * completed, wait on it before we release the iolock to prevent
> - * subsequent overlapping IO.
> + * If unaligned, this is the only IO in-flight. Wait on it before we
> + * release the iolock to prevent subsequent overlapping IO.
> */
> - if (ret == -EIOCBQUEUED && unaligned_io)
> - inode_dio_wait(inode);
> + ret = iomap_dio_rw(iocb, from, &xfs_iomap_ops, &xfs_dio_write_ops,
> + is_sync_kiocb(iocb) || unaligned_io);
> out:
> xfs_iunlock(ip, iolock);
>
> --
> 2.20.1
>
> --<M>--
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:18:37, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O read path which makes use of
> the iomap infrastructure.
>
> The new function ext4_do_read_iter() is responsible for calling into
> the iomap infrastructure via iomap_dio_rw(). If the read operation
> performed on the inode is not supported, which is checked via
> ext4_dio_supported(), then we simply fallback and complete the I/O
> using buffered I/O.
>
> Existing direct I/O read code path has been removed, as it is no
> longer required.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
BTW, I think I gave you my Reviewed-by tag for this patch also last time
and this patch didn't change since then. You can just include the tag in
your posting in that case.
Honza
> ---
> fs/ext4/file.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 32 +-------------------------------
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index ab75aee3e687..6ea7e00e0204 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,46 @@
> #include "xattr.h"
> #include "acl.h"
>
> +static bool ext4_dio_supported(struct inode *inode)
> +{
> + if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
> + return false;
> + if (fsverity_active(inode))
> + return false;
> + if (ext4_should_journal_data(inode))
> + return false;
> + if (ext4_has_inline_data(inode))
> + return false;
> + return true;
> +}
> +
> +static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock_shared(inode);
> + if (!ext4_dio_supported(inode)) {
> + inode_unlock_shared(inode);
> + /*
> + * Fallback to buffered I/O if the operation being performed on
> + * the inode is not supported by direct I/O. The IOCB_DIRECT
> + * flag needs to be cleared here in order to ensure that the
> + * direct I/O path within generic_file_read_iter() is not
> + * taken.
> + */
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + return generic_file_read_iter(iocb, to);
> + }
> +
> + ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> + is_sync_kiocb(iocb));
> + inode_unlock_shared(inode);
> +
> + file_accessed(iocb->ki_filp);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> @@ -64,7 +104,9 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> if (!iov_iter_count(to))
> @@ -74,6 +116,8 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> if (IS_DAX(file_inode(iocb->ki_filp)))
> return ext4_dax_read_iter(iocb, to);
> #endif
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_read_iter(iocb, to);
> return generic_file_read_iter(iocb, to);
> }
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ebeedbf3900f..03a9e2b85e46 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> {
> /* We don't expect handle for direct IO */
> WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - if (!create)
> - return _ext4_get_block(inode, iblock, bh, 0);
> return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
> }
>
> @@ -3865,30 +3862,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
> return ret;
> }
>
> -static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> - struct inode *inode = mapping->host;
> - size_t count = iov_iter_count(iter);
> - ssize_t ret;
> -
> - /*
> - * Shared inode_lock is enough for us - it protects against concurrent
> - * writes & truncates and since we take care of writing back page cache,
> - * we are protected against page writeback as well.
> - */
> - inode_lock_shared(inode);
> - ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
> - iocb->ki_pos + count - 1);
> - if (ret)
> - goto out_unlock;
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
> - iter, ext4_dio_get_block, NULL, NULL, 0);
> -out_unlock:
> - inode_unlock_shared(inode);
> - return ret;
> -}
> -
> static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> struct file *file = iocb->ki_filp;
> @@ -3915,10 +3888,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> return 0;
>
> trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> - if (iov_iter_rw(iter) == READ)
> - ret = ext4_direct_IO_read(iocb, iter);
> - else
> - ret = ext4_direct_IO_write(iocb, iter);
> + ret = ext4_direct_IO_write(iocb, iter);
> trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
> return ret;
> }
> --
> 2.20.1
>
> --<M>--
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:18:46, Matthew Bobrowski wrote:
> This patch updates the lock pattern in ext4_dio_read_iter() to only
> perform the trylock in IOCB_NOWAIT cases.
The changelog is actually misleading. It should say something like "This
patch updates the lock pattern in ext4_dio_read_iter() to not block on
inode lock in case of IOCB_NOWAIT direct IO reads."
Also to ease backporting of easy fixes, we try to put patches like this
early in the series (fixing code in ext4_direct_IO_read(), and then the
fixed code would just carry over to ext4_dio_read_iter()).
The change itself looks good to me.
Honza
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/file.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6ea7e00e0204..8420686b90f5 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -52,7 +52,13 @@ static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> ssize_t ret;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> - inode_lock_shared(inode);
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock_shared(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock_shared(inode);
> + }
> +
> if (!ext4_dio_supported(inode)) {
> inode_unlock_shared(inode);
> /*
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:18:56, Matthew Bobrowski wrote:
> In preparation for implementing the iomap direct I/O modifications,
> the inode extension/truncate code needs to be moved out from the
> ext4_iomap_end() callback. For direct I/O, if the current code
> remained, it would behave incorrrectly. Updating the inode size prior
> to converting unwritten extents would potentially allow a racing
> direct I/O read to find unwritten extents before being converted
> correctly.
>
> The inode extension/truncate code now resides within a new helper
> ext4_handle_inode_extension(). This function has been designed so that
> it can accommodate for both DAX and direct I/O extension/truncate
> operations.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/file.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 48 +--------------------------------
> 2 files changed, 71 insertions(+), 48 deletions(-)
>
The patch looks good to me. You can add:
Reviewed-by: Jan Kara <[email protected]>
One nit below:
> +static ssize_t ext4_handle_inode_extension(struct inode *inode, ssize_t written,
> + loff_t offset, size_t count)
IMHO a bit more logical ordering of arguments would be 'inode, offset,
written, count'...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 20:20:20, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O write path which makes use of
> the iomap infrastructure.
>
> All direct I/O writes are now passed from the ->write_iter() callback
> through to the new direct I/O handler ext4_dio_write_iter(). This
> function is responsible for calling into the iomap infrastructure via
> iomap_dio_rw().
>
> Code snippets from the existing direct I/O write code within
> ext4_file_write_iter() such as, checking whether the I/O request is
> unaligned asynchronous I/O, or whether the write will result in an
> overwrite have effectively been moved out and into the new direct I/O
> ->write_iter() handler.
>
> The block mapping flags that are eventually passed down to
> ext4_map_blocks() from the *_get_block_*() suite of routines have been
> taken out and introduced within ext4_iomap_alloc().
>
> For inode extension cases, ext4_handle_inode_extension() is
> effectively the function responsible for performing such metadata
> updates. This is called after iomap_dio_rw() has returned so that we
> can safely determine whether we need to potentially truncate any
> allocated blocks that may have been prepared for this direct I/O
> write. We don't perform the inode extension, or truncate operations
> from the ->end_io() handler as we don't have the original I/O 'length'
> available there. The ->end_io() however is responsible fo converting
> allocated unwritten extents to written extents.
>
> In the instance of a short write, we fallback and complete the
> remainder of the I/O using buffered I/O via
> ext4_buffered_write_iter().
>
> The existing buffer_head direct I/O implementation has been removed as
> it's now redundant.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
> ---
> fs/ext4/ext4.h | 3 -
> fs/ext4/extents.c | 4 +-
> fs/ext4/file.c | 236 ++++++++++++++++++--------
> fs/ext4/inode.c | 411 +++++-----------------------------------------
> 4 files changed, 207 insertions(+), 447 deletions(-)
The patch looks good to me! You can add:
Reviewed-by: Jan Kara <[email protected]>
One nitpick below:
> + if (extend) {
> + ret = ext4_handle_inode_extension(inode, ret, offset, count);
> +
> + /*
> + * We may have failed to remove the inode from the orphan list
> + * in the case that the i_disksize got update due to delalloc
> + * writeback while the direct I/O was running. We need to make
> + * sure we remove it from the orphan list as if we've
> + * prematurely popped it onto the list.
> + */
> + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> + if (IS_ERR(handle)) {
> + ret = PTR_ERR(handle);
> + if (inode->i_nlink)
> + ext4_orphan_del(NULL, inode);
> + goto out;
> + }
> +
> + if (inode->i_nlink)
This check can be joined with the list_empty() check above to save us from
unnecessarily starting a transaction. Also I was wondering whether it would
not make more sense have this orphan handling bit also in
ext4_handle_inode_extension(). ext4_dax_write_iter() doesn't strictly
need it (as for DAX i_disksize cannot currently change while
ext4_dax_write_iter() is running) but it would look more robust to me for
the future users and it certainly doesn't hurt ext4_dax_write_iter() case.
> + ext4_orphan_del(handle, inode);
> + ext4_journal_stop(handle);
> + }
> + }
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> Hi Matthew, thanks for your work on this patch series!
>
> I applied it against 4c3, and ran a quick test run on it, and found
> the following locking problem. To reproduce:
>
> kvm-xfstests -c nojournal generic/113
>
> generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> [ 7.959477]
> [ 7.959798] ============================================
> [ 7.960518] WARNING: possible recursive locking detected
> [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> [ 7.961991] --------------------------------------------
> [ 7.962569] aio-stress/1516 is trying to acquire lock:
> [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> [ 7.964109]
> [ 7.964109] but task is already holding lock:
> [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
This is going to be a tricky one. With iomap, the inode locking is handled
by the filesystem while calling generic_write_sync() is done by
iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
to call generic_write_sync(). So we need to remove inode_lock from
__generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
for legacy purposes and we don't need this in ext4 AFAICT - but removing
the lock from __generic_file_fsync() would mean auditing all legacy
filesystems that use this to make sure flushing inode & its metadata buffer
list while it is possibly changing cannot result in something unexpected. I
don't want to clutter this series with it so we are left with
reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
too bad but not great either. Thoughts?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > Hi Matthew, thanks for your work on this patch series!
> >
> > I applied it against 4c3, and ran a quick test run on it, and found
> > the following locking problem. To reproduce:
> >
> > kvm-xfstests -c nojournal generic/113
> >
> > generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > [ 7.959477]
> > [ 7.959798] ============================================
> > [ 7.960518] WARNING: possible recursive locking detected
> > [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > [ 7.961991] --------------------------------------------
> > [ 7.962569] aio-stress/1516 is trying to acquire lock:
> > [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > [ 7.964109]
> > [ 7.964109] but task is already holding lock:
> > [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
>
> This is going to be a tricky one. With iomap, the inode locking is handled
> by the filesystem while calling generic_write_sync() is done by
> iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> to call generic_write_sync().
You can't remove it from there, because that will break O_DSYNC
AIO+DIO. i.e. generic_write_sync() needs to be called before
iocb->ki_complete() is called in the AIO completion path, and that
means filesystems using iomap_dio_rw need to be be able to run
generic_write_sync() without taking the inode_lock().
> So we need to remove inode_lock from
> __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> for legacy purposes and we don't need this in ext4 AFAICT - but removing
> the lock from __generic_file_fsync() would mean auditing all legacy
> filesystems that use this to make sure flushing inode & its metadata buffer
> list while it is possibly changing cannot result in something unexpected. I
> don't want to clutter this series with it so we are left with
> reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> too bad but not great either. Thoughts?
XFS has implemented it's own ->fsync operation pretty much forever
without issues. It's basically:
1. flush dirty cached data w/ WB_SYNC_ALL
2. flush dirty cached metadata (i.e. journal force)
3. flush device caches if journal force didn't, keeping in
mind the requirements of data and journal being placed on
different devices.
The ext4 variant shouldn't need to be any more complex than that...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Mon, Oct 21, 2019 at 03:28:18PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:17:46, Matthew Bobrowski wrote:
> > This patch is effectively addressed what Dave Chinner had found and
> > fixed within this commit: 8a23414ee345. Justification for needing this
> > modification has been provided below:
> >
> > When doing a direct IO that spans the current EOF, and there are
> > written blocks beyond EOF that extend beyond the current write, the
> > only metadata update that needs to be done is a file size extension.
> >
> > However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> > there is IO completion metadata updates required, and hence we may
> > fail to correctly sync file size extensions made in IO completion when
> > O_DSYNC writes are being used and the hardware supports FUA.
> >
> > Hence when setting IOMAP_F_DIRTY, we need to also take into account
> > whether the iomap spans the current EOF. If it does, then we need to
> > mark it dirty so that IO completion will call generic_write_sync() to
> > flush the inode size update to stable storage correctly.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> Looks good to me. You could possibly also comment in the changelog that
> currently, this change doesn't have user visible impact for ext4 as none of
> current users of ext4_iomap_begin() that extend files depend of
> IOMAP_F_DIRTY.
Sure, I will add this.
> Also this patch would make slightly more sense to be before 1/12 so that
> you don't have there those two strange unused arguments. But these are just
> small nits.
You're right. I will rearrange it in v6 so that this patch comes
first.
> Feel free to add:
>
> Reviewed-by: Jan Kara <[email protected]>
Thanks Jan!
--<M>--
On Mon, Oct 21, 2019 at 03:37:15PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:18:09, Matthew Bobrowski wrote:
> One nit below.
> > + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> > + map->m_lblk, end, &es);
> > +
> > + if (!es.es_len || es.es_lblk > end)
> > + return false;
> > +
> > + if (es.es_lblk > map->m_lblk) {
> > + map->m_len = es.es_lblk - map->m_lblk;
> > + return false;
> > + }
> > +
> > + if (es.es_lblk <= map->m_lblk)
>
> This condition must be always true AFAICT.
That would make sense. I will remove this condition in v6.
> > + offset = map->m_lblk - es.es_lblk;
> > +
> > + map->m_lblk = es.es_lblk + offset;
> > + map->m_len = es.es_len - offset;
> > +
> > + return true;
> > +}
--<M>--
On Mon, Oct 21, 2019 at 03:41:31PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:18:37, Matthew Bobrowski wrote:
> > This patch introduces a new direct I/O read path which makes use of
> > the iomap infrastructure.
> >
> > The new function ext4_do_read_iter() is responsible for calling into
> > the iomap infrastructure via iomap_dio_rw(). If the read operation
> > performed on the inode is not supported, which is checked via
> > ext4_dio_supported(), then we simply fallback and complete the I/O
> > using buffered I/O.
> >
> > Existing direct I/O read code path has been removed, as it is no
> > longer required.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> Looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
>
> BTW, I think I gave you my Reviewed-by tag for this patch also last time
> and this patch didn't change since then. You can just include the tag in
> your posting in that case.
The lock pattern was dropped here, so I thought I'd just have you
review it again to be sure that you didn't object. But anyway, thank
you!
--<M>--
On Mon, Oct 21, 2019 at 03:48:17PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:18:46, Matthew Bobrowski wrote:
> > This patch updates the lock pattern in ext4_dio_read_iter() to only
> > perform the trylock in IOCB_NOWAIT cases.
>
> The changelog is actually misleading. It should say something like "This
> patch updates the lock pattern in ext4_dio_read_iter() to not block on
> inode lock in case of IOCB_NOWAIT direct IO reads."
>
> Also to ease backporting of easy fixes, we try to put patches like this
> early in the series (fixing code in ext4_direct_IO_read(), and then the
> fixed code would just carry over to ext4_dio_read_iter()).
OK, understood. Now I know this for next time. :)
Providing that I have this patch precede the ext4_dio_read_iter()
patch and implement this lock pattern in ext4_direct_IO_read(), am I
OK to add the 'Reviewed-by' tag?
--<M>--
On Mon, Oct 21, 2019 at 03:53:37PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:18:56, Matthew Bobrowski wrote:
> > In preparation for implementing the iomap direct I/O modifications,
> > the inode extension/truncate code needs to be moved out from the
> > ext4_iomap_end() callback. For direct I/O, if the current code
> > remained, it would behave incorrrectly. Updating the inode size prior
> > to converting unwritten extents would potentially allow a racing
> > direct I/O read to find unwritten extents before being converted
> > correctly.
> >
> > The inode extension/truncate code now resides within a new helper
> > ext4_handle_inode_extension(). This function has been designed so that
> > it can accommodate for both DAX and direct I/O extension/truncate
> > operations.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
> > ---
> > fs/ext4/file.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++-
> > fs/ext4/inode.c | 48 +--------------------------------
> > 2 files changed, 71 insertions(+), 48 deletions(-)
> >
>
> The patch looks good to me. You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
Thank you!
> > +static ssize_t ext4_handle_inode_extension(struct inode *inode, ssize_t written,
> > + loff_t offset, size_t count)
>
> IMHO a bit more logical ordering of arguments would be 'inode, offset,
> written, count'...
Funnily enough, I originally had the arguments ordered as you've
suggested, but then decided to reorder them this way last minute. No
objections to reshuffling them around.
--<M>--
On Mon, Oct 21, 2019 at 06:18:48PM +0200, Jan Kara wrote:
> On Mon 21-10-19 20:20:20, Matthew Bobrowski wrote:
> > This patch introduces a new direct I/O write path which makes use of
> > the iomap infrastructure.
> >
> > All direct I/O writes are now passed from the ->write_iter() callback
> > through to the new direct I/O handler ext4_dio_write_iter(). This
> > function is responsible for calling into the iomap infrastructure via
> > iomap_dio_rw().
> >
> > Code snippets from the existing direct I/O write code within
> > ext4_file_write_iter() such as, checking whether the I/O request is
> > unaligned asynchronous I/O, or whether the write will result in an
> > overwrite have effectively been moved out and into the new direct I/O
> > ->write_iter() handler.
> >
> > The block mapping flags that are eventually passed down to
> > ext4_map_blocks() from the *_get_block_*() suite of routines have been
> > taken out and introduced within ext4_iomap_alloc().
> >
> > For inode extension cases, ext4_handle_inode_extension() is
> > effectively the function responsible for performing such metadata
> > updates. This is called after iomap_dio_rw() has returned so that we
> > can safely determine whether we need to potentially truncate any
> > allocated blocks that may have been prepared for this direct I/O
> > write. We don't perform the inode extension, or truncate operations
> > from the ->end_io() handler as we don't have the original I/O 'length'
> > available there. The ->end_io() however is responsible fo converting
> > allocated unwritten extents to written extents.
> >
> > In the instance of a short write, we fallback and complete the
> > remainder of the I/O using buffered I/O via
> > ext4_buffered_write_iter().
> >
> > The existing buffer_head direct I/O implementation has been removed as
> > it's now redundant.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
> > ---
> > fs/ext4/ext4.h | 3 -
> > fs/ext4/extents.c | 4 +-
> > fs/ext4/file.c | 236 ++++++++++++++++++--------
> > fs/ext4/inode.c | 411 +++++-----------------------------------------
> > 4 files changed, 207 insertions(+), 447 deletions(-)
>
> The patch looks good to me! You can add:
>
> Reviewed-by: Jan Kara <[email protected]>
Thanks Jan! :)
> One nitpick below:
>
> > + if (extend) {
> > + ret = ext4_handle_inode_extension(inode, ret, offset, count);
> > +
> > + /*
> > + * We may have failed to remove the inode from the orphan list
> > + * in the case that the i_disksize got update due to delalloc
> > + * writeback while the direct I/O was running. We need to make
> > + * sure we remove it from the orphan list as if we've
> > + * prematurely popped it onto the list.
> > + */
> > + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > + if (IS_ERR(handle)) {
> > + ret = PTR_ERR(handle);
> > + if (inode->i_nlink)
> > + ext4_orphan_del(NULL, inode);
> > + goto out;
> > + }
> > +
> > + if (inode->i_nlink)
>
> This check can be joined with the list_empty() check above to save us from
> unnecessarily starting a transaction.
Yes, easy done.
> Also I was wondering whether it would not make more sense have this
> orphan handling bit also in
> ext4_handle_inode_extension(). ext4_dax_write_iter() doesn't
> strictly need it (as for DAX i_disksize cannot currently change
> while ext4_dax_write_iter() is running) but it would look more
> robust to me for the future users and it certainly doesn't hurt
> ext4_dax_write_iter() case.
I was thinking the same, but to be honest I wasn't entirely sure how
it would pan out for the DAX code path. However, seeing as though you
don't forsee there being any problems, then I can't really think of a
reason not to roll this up into ext4_handle_inode_extension().
So, in ext4_handle_inode_extension() for the initial check against
i_disksize, rather than returning 'written' and then having
ext4_dio_write_iter() perform the cleanup, we could simply jump to a
chunk of code in ext4_handle_inode_extension() and deal with it there,
or quite literally just cleanup if that branch is taken there and then
seeing as though it's not really needed in any other case? What do you
think?
--<M>--
On Tue 22-10-19 09:38:19, Dave Chinner wrote:
> On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > Hi Matthew, thanks for your work on this patch series!
> > >
> > > I applied it against 4c3, and ran a quick test run on it, and found
> > > the following locking problem. To reproduce:
> > >
> > > kvm-xfstests -c nojournal generic/113
> > >
> > > generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > [ 7.959477]
> > > [ 7.959798] ============================================
> > > [ 7.960518] WARNING: possible recursive locking detected
> > > [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > [ 7.961991] --------------------------------------------
> > > [ 7.962569] aio-stress/1516 is trying to acquire lock:
> > > [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > [ 7.964109]
> > > [ 7.964109] but task is already holding lock:
> > > [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> >
> > This is going to be a tricky one. With iomap, the inode locking is handled
> > by the filesystem while calling generic_write_sync() is done by
> > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > to call generic_write_sync().
>
> You can't remove it from there, because that will break O_DSYNC
> AIO+DIO. i.e. generic_write_sync() needs to be called before
> iocb->ki_complete() is called in the AIO completion path, and that
> means filesystems using iomap_dio_rw need to be be able to run
> generic_write_sync() without taking the inode_lock().
>
> > So we need to remove inode_lock from
> > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > the lock from __generic_file_fsync() would mean auditing all legacy
> > filesystems that use this to make sure flushing inode & its metadata buffer
> > list while it is possibly changing cannot result in something unexpected. I
> > don't want to clutter this series with it so we are left with
> > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > too bad but not great either. Thoughts?
>
> XFS has implemented it's own ->fsync operation pretty much forever
> without issues. It's basically:
>
> 1. flush dirty cached data w/ WB_SYNC_ALL
> 2. flush dirty cached metadata (i.e. journal force)
> 3. flush device caches if journal force didn't, keeping in
> mind the requirements of data and journal being placed on
> different devices.
>
> The ext4 variant shouldn't need to be any more complex than that...
Yeah, that's what we do for the common case as well. But when the
filesystem is created without a journal (i.e., ext2 compatibility mode) we
currently use the old fsync implementation including
__generic_file_fsync(). But as I wrote above, duplicating those ~5 lines
out of __generic_file_fsync() we really care about is not a big deal.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 22-10-19 13:04:21, Matthew Bobrowski wrote:
> On Mon, Oct 21, 2019 at 03:48:17PM +0200, Jan Kara wrote:
> > On Mon 21-10-19 20:18:46, Matthew Bobrowski wrote:
> > > This patch updates the lock pattern in ext4_dio_read_iter() to only
> > > perform the trylock in IOCB_NOWAIT cases.
> >
> > The changelog is actually misleading. It should say something like "This
> > patch updates the lock pattern in ext4_dio_read_iter() to not block on
> > inode lock in case of IOCB_NOWAIT direct IO reads."
> >
> > Also to ease backporting of easy fixes, we try to put patches like this
> > early in the series (fixing code in ext4_direct_IO_read(), and then the
> > fixed code would just carry over to ext4_dio_read_iter()).
>
> OK, understood. Now I know this for next time. :)
>
> Providing that I have this patch precede the ext4_dio_read_iter()
> patch and implement this lock pattern in ext4_direct_IO_read(), am I
> OK to add the 'Reviewed-by' tag?
Yes.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue 22-10-19 14:02:35, Matthew Bobrowski wrote:
> On Mon, Oct 21, 2019 at 06:18:48PM +0200, Jan Kara wrote:
> > > + if (extend) {
> > > + ret = ext4_handle_inode_extension(inode, ret, offset, count);
> > > +
> > > + /*
> > > + * We may have failed to remove the inode from the orphan list
> > > + * in the case that the i_disksize got update due to delalloc
> > > + * writeback while the direct I/O was running. We need to make
> > > + * sure we remove it from the orphan list as if we've
> > > + * prematurely popped it onto the list.
> > > + */
> > > + if (!list_empty(&EXT4_I(inode)->i_orphan)) {
> > > + handle = ext4_journal_start(inode, EXT4_HT_INODE, 2);
> > > + if (IS_ERR(handle)) {
> > > + ret = PTR_ERR(handle);
> > > + if (inode->i_nlink)
> > > + ext4_orphan_del(NULL, inode);
> > > + goto out;
> > > + }
> > > +
> > > + if (inode->i_nlink)
> >
> > This check can be joined with the list_empty() check above to save us from
> > unnecessarily starting a transaction.
>
> Yes, easy done.
>
> > Also I was wondering whether it would not make more sense have this
> > orphan handling bit also in
> > ext4_handle_inode_extension(). ext4_dax_write_iter() doesn't
> > strictly need it (as for DAX i_disksize cannot currently change
> > while ext4_dax_write_iter() is running) but it would look more
> > robust to me for the future users and it certainly doesn't hurt
> > ext4_dax_write_iter() case.
>
> I was thinking the same, but to be honest I wasn't entirely sure how
> it would pan out for the DAX code path. However, seeing as though you
> don't forsee there being any problems, then I can't really think of a
> reason not to roll this up into ext4_handle_inode_extension().
>
> So, in ext4_handle_inode_extension() for the initial check against
> i_disksize, rather than returning 'written' and then having
> ext4_dio_write_iter() perform the cleanup, we could simply jump to a
> chunk of code in ext4_handle_inode_extension() and deal with it there,
> or quite literally just cleanup if that branch is taken there and then
> seeing as though it's not really needed in any other case? What do you
> think?
Yeah, the last option makes the most sense to me.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > Hi Matthew, thanks for your work on this patch series!
> >
> > I applied it against 4c3, and ran a quick test run on it, and found
> > the following locking problem. To reproduce:
> >
> > kvm-xfstests -c nojournal generic/113
> >
> > generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > [ 7.959477]
> > [ 7.959798] ============================================
> > [ 7.960518] WARNING: possible recursive locking detected
> > [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > [ 7.961991] --------------------------------------------
> > [ 7.962569] aio-stress/1516 is trying to acquire lock:
> > [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > [ 7.964109]
> > [ 7.964109] but task is already holding lock:
> > [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
>
> This is going to be a tricky one. With iomap, the inode locking is handled
> by the filesystem while calling generic_write_sync() is done by
> iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> to call generic_write_sync(). So we need to remove inode_lock from
> __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> for legacy purposes and we don't need this in ext4 AFAICT - but removing
> the lock from __generic_file_fsync() would mean auditing all legacy
> filesystems that use this to make sure flushing inode & its metadata buffer
> list while it is possibly changing cannot result in something unexpected. I
> don't want to clutter this series with it so we are left with
> reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> too bad but not great either. Thoughts?
So, I just looked at this on my lunch break and I think the simplest approach
would be to just transfer the necessary chunks of code from within
__generic_file_fsync() into ext4_sync_file() for !journal cases, minus the
inode lock, and minus calling into __generic_file_fsync(). I don't forsee this
causing any issues, but feel free to correct me if I'm wrong.
If this is deemed to be OK, then I will go ahead and include this as a
separate patch in my series.
--<M>--
On 10/21/19 2:47 PM, Matthew Bobrowski wrote:
> Separate the iomap field population chunk of code that is currently
> within ext4_iomap_begin() into a new helper called
> ext4_set_iomap(). The intent of this function is self explanatory,
> however the rationale behind doing so is to also reduce the overall
> clutter that we currently have within the ext4_iomap_begin() callback.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Could you please re-arrange patch sequence in this fashion.
1. Patch-11 (re-ordering of unwritten flags)
2. Patch-8 (trylock in IOCB_NOWAIT cases)
3. Patch-2 (should explain offset & len in this patch)
4. Patch-1 (this patch).
This is so that some of these are anyway fixes or refactoring
which can be picked up easily, either for backporting or
sometimes this helps in getting some of the patches in, if the patch
series gets bigger.
Also others (like me) can also pick some of these changes then to meet
their dependency. :)
This patch looks good to me. You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/inode.c | 59 +++++++++++++++++++++++++++----------------------
> 1 file changed, 33 insertions(+), 26 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 516faa280ced..158eea9a1944 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3406,10 +3406,39 @@ static bool ext4_inode_datasync_dirty(struct inode *inode)
> return inode->i_state & I_DIRTY_DATASYNC;
> }
>
> +static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> + struct ext4_map_blocks *map, loff_t offset,
> + loff_t length)
> +{
> + u8 blkbits = inode->i_blkbits;
> +
> + iomap->flags = 0;
> + if (ext4_inode_datasync_dirty(inode))
> + iomap->flags |= IOMAP_F_DIRTY;
> +
> + if (map->m_flags & EXT4_MAP_NEW)
> + iomap->flags |= IOMAP_F_NEW;
> +
> + iomap->bdev = inode->i_sb->s_bdev;
> + iomap->dax_dev = EXT4_SB(inode->i_sb)->s_daxdev;
> + iomap->offset = (u64) map->m_lblk << blkbits;
> + iomap->length = (u64) map->m_len << blkbits;
> +
> + if (map->m_flags & (EXT4_MAP_MAPPED | EXT4_MAP_UNWRITTEN)) {
> + if (map->m_flags & EXT4_MAP_MAPPED)
> + iomap->type = IOMAP_MAPPED;
> + else if (map->m_flags & EXT4_MAP_UNWRITTEN)
> + iomap->type = IOMAP_UNWRITTEN;
> + iomap->addr = (u64) map->m_pblk << blkbits;
> + } else {
> + iomap->type = IOMAP_HOLE;
> + iomap->addr = IOMAP_NULL_ADDR;
> + }
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> - struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
> unsigned int blkbits = inode->i_blkbits;
> unsigned long first_block, last_block;
> struct ext4_map_blocks map;
> @@ -3523,31 +3552,9 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> return ret;
> }
>
> - iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode))
> - iomap->flags |= IOMAP_F_DIRTY;
> - iomap->bdev = inode->i_sb->s_bdev;
> - iomap->dax_dev = sbi->s_daxdev;
> - iomap->offset = (u64)first_block << blkbits;
> - iomap->length = (u64)map.m_len << blkbits;
> -
> - if (ret == 0) {
> - iomap->type = delalloc ? IOMAP_DELALLOC : IOMAP_HOLE;
> - iomap->addr = IOMAP_NULL_ADDR;
> - } else {
> - if (map.m_flags & EXT4_MAP_MAPPED) {
> - iomap->type = IOMAP_MAPPED;
> - } else if (map.m_flags & EXT4_MAP_UNWRITTEN) {
> - iomap->type = IOMAP_UNWRITTEN;
> - } else {
> - WARN_ON_ONCE(1);
> - return -EIO;
> - }
> - iomap->addr = (u64)map.m_pblk << blkbits;
> - }
> -
> - if (map.m_flags & EXT4_MAP_NEW)
> - iomap->flags |= IOMAP_F_NEW;
> + ext4_set_iomap(inode, iomap, &map, offset, length);
> + if (delalloc && iomap->type == IOMAP_HOLE)
> + iomap->type = IOMAP_DELALLOC;
>
> return 0;
> }
>
On 10/21/19 2:47 PM, Matthew Bobrowski wrote:
> This patch is effectively addressed what Dave Chinner had found and
> fixed within this commit: 8a23414ee345. Justification for needing this
> modification has been provided below:
Not sure if this is a valid commit id. I couldn't find it.
>
> When doing a direct IO that spans the current EOF, and there are
> written blocks beyond EOF that extend beyond the current write, the
> only metadata update that needs to be done is a file size extension.
>
> However, we don't mark such iomaps as IOMAP_F_DIRTY to indicate that
> there is IO completion metadata updates required, and hence we may
> fail to correctly sync file size extensions made in IO completion when
> O_DSYNC writes are being used and the hardware supports FUA.
>
> Hence when setting IOMAP_F_DIRTY, we need to also take into account
> whether the iomap spans the current EOF. If it does, then we need to
> mark it dirty so that IO completion will call generic_write_sync() to
> flush the inode size update to stable storage correctly.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Otherwise, this patch looks good to me. You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/inode.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 158eea9a1944..0dd29ae5cc8c 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3412,8 +3412,14 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> {
> u8 blkbits = inode->i_blkbits;
>
> + /*
> + * Writes that span EOF might trigger an I/O size update on completion,
> + * so consider them to be dirty for the purposes of O_DSYNC, even if
> + * there is no other metadata changes being made or are pending here.
> + */
> iomap->flags = 0;
> - if (ext4_inode_datasync_dirty(inode))
> + if (ext4_inode_datasync_dirty(inode) ||
> + offset + length > i_size_read(inode))
> iomap->flags |= IOMAP_F_DIRTY;
>
> if (map->m_flags & EXT4_MAP_NEW)
>
On 10/21/19 2:48 PM, Matthew Bobrowski wrote:
> In preparation for porting across the ext4 direct I/O path for reads
> and writes over to the iomap infrastructure, split up the IOMAP_WRITE
> chunk of code into a separate helper ext4_alloc_iomap(). This way,
> when we add the necessary bits for direct I/O, we don't end up with
> ext4_iomap_begin() becoming a behemoth twisty maze.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Patch looks good to me. You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/inode.c | 112 ++++++++++++++++++++++++++----------------------
> 1 file changed, 60 insertions(+), 52 deletions(-)
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 0dd29ae5cc8c..3dc92bd8a944 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -3442,6 +3442,62 @@ static void ext4_set_iomap(struct inode *inode, struct iomap *iomap,
> }
> }
>
> +static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
> + unsigned int flags)
> +{
> + handle_t *handle;
> + u8 blkbits = inode->i_blkbits;
> + int ret, dio_credits, retries = 0;
> +
> + /*
> + * Trim the mapping request to the maximum value that we can map at
> + * once for direct I/O.
> + */
> + if (map->m_len > DIO_MAX_BLOCKS)
> + map->m_len = DIO_MAX_BLOCKS;
> + dio_credits = ext4_chunk_trans_blocks(inode, map->m_len);
> +
> +retry:
> + /*
> + * Either we allocate blocks and then don't get an unwritten extent, so
> + * in that case we have reserved enough credits. Or, the blocks are
> + * already allocated and and unwritten. In that case, the extent
> + * conversion fits into the credits too.
> + */
> + handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS, dio_credits);
> + if (IS_ERR(handle))
> + return PTR_ERR(handle);
> +
> + ret = ext4_map_blocks(handle, inode, map, EXT4_GET_BLOCKS_CREATE_ZERO);
> + if (ret < 0)
> + goto journal_stop;
> +
> + /*
> + * If we have allocated blocks beyond EOF, we need to ensure that
> + * they're truncated if we crash before updating the inode size
> + * metadata within ext4_iomap_end(). For faults, we don't need to do
> + * that (and cannot due to the orphan list operations needing an
> + * inode_lock()). If we happen to instantiate blocks beyond EOF, it is
> + * because we race with a truncate operation, which already has added
> + * the inode onto the orphan list.
> + */
> + if (!(flags & IOMAP_FAULT) && map->m_lblk + map->m_len >
> + (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> + int err;
> +
> + err = ext4_orphan_add(handle, inode);
> + if (err < 0)
> + ret = err;
> + }
> +
> +journal_stop:
> + ext4_journal_stop(handle);
> + if (ret == -ENOSPC && ext4_should_retry_alloc(inode->i_sb, &retries))
> + goto retry;
> +
> + return ret;
> +}
> +
> static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> unsigned flags, struct iomap *iomap)
> {
> @@ -3502,62 +3558,14 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
> }
> }
> } else if (flags & IOMAP_WRITE) {
> - int dio_credits;
> - handle_t *handle;
> - int retries = 0;
> -
> - /* Trim mapping request to maximum we can map at once for DIO */
> - if (map.m_len > DIO_MAX_BLOCKS)
> - map.m_len = DIO_MAX_BLOCKS;
> - dio_credits = ext4_chunk_trans_blocks(inode, map.m_len);
> -retry:
> - /*
> - * Either we allocate blocks and then we don't get unwritten
> - * extent so we have reserved enough credits, or the blocks
> - * are already allocated and unwritten and in that case
> - * extent conversion fits in the credits as well.
> - */
> - handle = ext4_journal_start(inode, EXT4_HT_MAP_BLOCKS,
> - dio_credits);
> - if (IS_ERR(handle))
> - return PTR_ERR(handle);
> -
> - ret = ext4_map_blocks(handle, inode, &map,
> - EXT4_GET_BLOCKS_CREATE_ZERO);
> - if (ret < 0) {
> - ext4_journal_stop(handle);
> - if (ret == -ENOSPC &&
> - ext4_should_retry_alloc(inode->i_sb, &retries))
> - goto retry;
> - return ret;
> - }
> -
> - /*
> - * If we added blocks beyond i_size, we need to make sure they
> - * will get truncated if we crash before updating i_size in
> - * ext4_iomap_end(). For faults we don't need to do that (and
> - * even cannot because for orphan list operations inode_lock is
> - * required) - if we happen to instantiate block beyond i_size,
> - * it is because we race with truncate which has already added
> - * the inode to the orphan list.
> - */
> - if (!(flags & IOMAP_FAULT) && first_block + map.m_len >
> - (i_size_read(inode) + (1 << blkbits) - 1) >> blkbits) {
> - int err;
> -
> - err = ext4_orphan_add(handle, inode);
> - if (err < 0) {
> - ext4_journal_stop(handle);
> - return err;
> - }
> - }
> - ext4_journal_stop(handle);
> + ret = ext4_iomap_alloc(inode, &map, flags);
> } else {
> ret = ext4_map_blocks(NULL, inode, &map, 0);
> - if (ret < 0)
> - return ret;
> }
>
> + if (ret < 0)
> + return ret;
> +
> ext4_set_iomap(inode, iomap, &map, offset, length);
> if (delalloc && iomap->type == IOMAP_HOLE)
> iomap->type = IOMAP_DELALLOC;
>
On 10/22/19 7:25 AM, Matthew Bobrowski wrote:
> On Mon, Oct 21, 2019 at 03:37:15PM +0200, Jan Kara wrote:
>> On Mon 21-10-19 20:18:09, Matthew Bobrowski wrote:
>> One nit below.
>>> + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
>>> + map->m_lblk, end, &es);
>>> +
>>> + if (!es.es_len || es.es_lblk > end)
>>> + return false;
>>> +
>>> + if (es.es_lblk > map->m_lblk) {
>>> + map->m_len = es.es_lblk - map->m_lblk;
>>> + return false;
>>> + }
>>> +
>>> + if (es.es_lblk <= map->m_lblk)
>>
>> This condition must be always true AFAICT.
>
> That would make sense. I will remove this condition in v6.
>
>>> + offset = map->m_lblk - es.es_lblk;
>>> +
>>> + map->m_lblk = es.es_lblk + offset;
And so, this above line will also be redundant.
>>> + map->m_len = es.es_len - offset;
>>> +
>>> + return true;
>>> +}
>
> --<M>--
>
On 10/21/19 2:48 PM, Matthew Bobrowski wrote:
> This patch introduces a new direct I/O read path which makes use of
> the iomap infrastructure.
>
> The new function ext4_do_read_iter() is responsible for calling into
> the iomap infrastructure via iomap_dio_rw(). If the read operation
> performed on the inode is not supported, which is checked via
> ext4_dio_supported(), then we simply fallback and complete the I/O
> using buffered I/O.
>
> Existing direct I/O read code path has been removed, as it is no
> longer required.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
Patch looks good to me. You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 46 +++++++++++++++++++++++++++++++++++++++++++++-
> fs/ext4/inode.c | 32 +-------------------------------
> 2 files changed, 46 insertions(+), 32 deletions(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index ab75aee3e687..6ea7e00e0204 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -34,6 +34,46 @@
> #include "xattr.h"
> #include "acl.h"
>
> +static bool ext4_dio_supported(struct inode *inode)
> +{
> + if (IS_ENABLED(CONFIG_FS_ENCRYPTION) && IS_ENCRYPTED(inode))
> + return false;
> + if (fsverity_active(inode))
> + return false;
> + if (ext4_should_journal_data(inode))
> + return false;
> + if (ext4_has_inline_data(inode))
> + return false;
> + return true;
> +}
> +
> +static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> + ssize_t ret;
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + inode_lock_shared(inode);
> + if (!ext4_dio_supported(inode)) {
> + inode_unlock_shared(inode);
> + /*
> + * Fallback to buffered I/O if the operation being performed on
> + * the inode is not supported by direct I/O. The IOCB_DIRECT
> + * flag needs to be cleared here in order to ensure that the
> + * direct I/O path within generic_file_read_iter() is not
> + * taken.
> + */
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + return generic_file_read_iter(iocb, to);
> + }
> +
> + ret = iomap_dio_rw(iocb, to, &ext4_iomap_ops, NULL,
> + is_sync_kiocb(iocb));
> + inode_unlock_shared(inode);
> +
> + file_accessed(iocb->ki_filp);
> + return ret;
> +}
> +
> #ifdef CONFIG_FS_DAX
> static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> @@ -64,7 +104,9 @@ static ssize_t ext4_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
>
> static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> {
> - if (unlikely(ext4_forced_shutdown(EXT4_SB(file_inode(iocb->ki_filp)->i_sb))))
> + struct inode *inode = file_inode(iocb->ki_filp);
> +
> + if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
> return -EIO;
>
> if (!iov_iter_count(to))
> @@ -74,6 +116,8 @@ static ssize_t ext4_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> if (IS_DAX(file_inode(iocb->ki_filp)))
> return ext4_dax_read_iter(iocb, to);
> #endif
> + if (iocb->ki_flags & IOCB_DIRECT)
> + return ext4_dio_read_iter(iocb, to);
> return generic_file_read_iter(iocb, to);
> }
>
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index ebeedbf3900f..03a9e2b85e46 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -863,9 +863,6 @@ int ext4_dio_get_block(struct inode *inode, sector_t iblock,
> {
> /* We don't expect handle for direct IO */
> WARN_ON_ONCE(ext4_journal_current_handle());
> -
> - if (!create)
> - return _ext4_get_block(inode, iblock, bh, 0);
> return ext4_get_block_trans(inode, iblock, bh, EXT4_GET_BLOCKS_CREATE);
> }
>
> @@ -3865,30 +3862,6 @@ static ssize_t ext4_direct_IO_write(struct kiocb *iocb, struct iov_iter *iter)
> return ret;
> }
>
> -static ssize_t ext4_direct_IO_read(struct kiocb *iocb, struct iov_iter *iter)
> -{
> - struct address_space *mapping = iocb->ki_filp->f_mapping;
> - struct inode *inode = mapping->host;
> - size_t count = iov_iter_count(iter);
> - ssize_t ret;
> -
> - /*
> - * Shared inode_lock is enough for us - it protects against concurrent
> - * writes & truncates and since we take care of writing back page cache,
> - * we are protected against page writeback as well.
> - */
> - inode_lock_shared(inode);
> - ret = filemap_write_and_wait_range(mapping, iocb->ki_pos,
> - iocb->ki_pos + count - 1);
> - if (ret)
> - goto out_unlock;
> - ret = __blockdev_direct_IO(iocb, inode, inode->i_sb->s_bdev,
> - iter, ext4_dio_get_block, NULL, NULL, 0);
> -out_unlock:
> - inode_unlock_shared(inode);
> - return ret;
> -}
> -
> static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> {
> struct file *file = iocb->ki_filp;
> @@ -3915,10 +3888,7 @@ static ssize_t ext4_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
> return 0;
>
> trace_ext4_direct_IO_enter(inode, offset, count, iov_iter_rw(iter));
> - if (iov_iter_rw(iter) == READ)
> - ret = ext4_direct_IO_read(iocb, iter);
> - else
> - ret = ext4_direct_IO_write(iocb, iter);
> + ret = ext4_direct_IO_write(iocb, iter);
> trace_ext4_direct_IO_exit(inode, offset, count, iov_iter_rw(iter), ret);
> return ret;
> }
>
On 10/21/19 2:48 PM, Matthew Bobrowski wrote:
> This patch updates the lock pattern in ext4_dio_read_iter() to only
> perform the trylock in IOCB_NOWAIT cases.
>
> Signed-off-by: Matthew Bobrowski <[email protected]>
This shall need a rebase (as mentioned in patch-1 discussion).
It will be good to mention that the locking condition
is kept similar to 942491c9e6d6 ("xfs: fix AIM7 regression").
Also, I think it may also need a fixes tag.
It seems unconditional inode_lock_shared was added by below commit.
Fixes: 16c54688592c ("ext4: Allow parallel DIO reads")
Otherwise, this patch looks good to me. You may add:
Reviewed-by: Ritesh Harjani <[email protected]>
> ---
> fs/ext4/file.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/fs/ext4/file.c b/fs/ext4/file.c
> index 6ea7e00e0204..8420686b90f5 100644
> --- a/fs/ext4/file.c
> +++ b/fs/ext4/file.c
> @@ -52,7 +52,13 @@ static int ext4_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
> ssize_t ret;
> struct inode *inode = file_inode(iocb->ki_filp);
>
> - inode_lock_shared(inode);
> + if (iocb->ki_flags & IOCB_NOWAIT) {
> + if (!inode_trylock_shared(inode))
> + return -EAGAIN;
> + } else {
> + inode_lock_shared(inode);
> + }
> +
> if (!ext4_dio_supported(inode)) {
> inode_unlock_shared(inode);
> /*
>
On Wed 23-10-19 13:35:19, Matthew Bobrowski wrote:
> On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > Hi Matthew, thanks for your work on this patch series!
> > >
> > > I applied it against 4c3, and ran a quick test run on it, and found
> > > the following locking problem. To reproduce:
> > >
> > > kvm-xfstests -c nojournal generic/113
> > >
> > > generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > [ 7.959477]
> > > [ 7.959798] ============================================
> > > [ 7.960518] WARNING: possible recursive locking detected
> > > [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > [ 7.961991] --------------------------------------------
> > > [ 7.962569] aio-stress/1516 is trying to acquire lock:
> > > [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > [ 7.964109]
> > > [ 7.964109] but task is already holding lock:
> > > [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> >
> > This is going to be a tricky one. With iomap, the inode locking is handled
> > by the filesystem while calling generic_write_sync() is done by
> > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > to call generic_write_sync(). So we need to remove inode_lock from
> > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > the lock from __generic_file_fsync() would mean auditing all legacy
> > filesystems that use this to make sure flushing inode & its metadata buffer
> > list while it is possibly changing cannot result in something unexpected. I
> > don't want to clutter this series with it so we are left with
> > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > too bad but not great either. Thoughts?
>
> So, I just looked at this on my lunch break and I think the simplest
> approach would be to just transfer the necessary chunks of code from
> within __generic_file_fsync() into ext4_sync_file() for !journal cases,
> minus the inode lock, and minus calling into __generic_file_fsync(). I
> don't forsee this causing any issues, but feel free to correct me if I'm
> wrong.
Yes, that's what I'd suggest as well. In fact when doing that you can share
file_write_and_wait_range() call with the one already in ext4_sync_file()
use for other cases. Similarly with file_check_and_advance_wb_err(). So the
copied bit will be really only:
ret = sync_mapping_buffers(inode->i_mapping);
if (!(inode->i_state & I_DIRTY_ALL))
goto out;
if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
goto out;
err = sync_inode_metadata(inode, 1);
if (ret == 0)
ret = err;
> If this is deemed to be OK, then I will go ahead and include this as a
> separate patch in my series.
Yes, please.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, Oct 23, 2019 at 12:01:53PM +0200, Jan Kara wrote:
> On Wed 23-10-19 13:35:19, Matthew Bobrowski wrote:
> > On Mon, Oct 21, 2019 at 09:43:30PM +0200, Jan Kara wrote:
> > > On Mon 21-10-19 09:31:12, Theodore Y. Ts'o wrote:
> > > > Hi Matthew, thanks for your work on this patch series!
> > > >
> > > > I applied it against 4c3, and ran a quick test run on it, and found
> > > > the following locking problem. To reproduce:
> > > >
> > > > kvm-xfstests -c nojournal generic/113
> > > >
> > > > generic/113 [09:27:19][ 5.841937] run fstests generic/113 at 2019-10-21 09:27:19
> > > > [ 7.959477]
> > > > [ 7.959798] ============================================
> > > > [ 7.960518] WARNING: possible recursive locking detected
> > > > [ 7.961225] 5.4.0-rc3-xfstests-00012-g7fe6ea084e48 #1238 Not tainted
> > > > [ 7.961991] --------------------------------------------
> > > > [ 7.962569] aio-stress/1516 is trying to acquire lock:
> > > > [ 7.963129] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: __generic_file_fsync+0x3e/0xb0
> > > > [ 7.964109]
> > > > [ 7.964109] but task is already holding lock:
> > > > [ 7.964740] ffff9fd4791148c8 (&sb->s_type->i_mutex_key#12){++++}, at: ext4_dio_write_iter+0x15b/0x430
> > >
> > > This is going to be a tricky one. With iomap, the inode locking is handled
> > > by the filesystem while calling generic_write_sync() is done by
> > > iomap_dio_rw(). I would really prefer to avoid tweaking iomap_dio_rw() not
> > > to call generic_write_sync(). So we need to remove inode_lock from
> > > __generic_file_fsync() (used from ext4_sync_file()). This locking is mostly
> > > for legacy purposes and we don't need this in ext4 AFAICT - but removing
> > > the lock from __generic_file_fsync() would mean auditing all legacy
> > > filesystems that use this to make sure flushing inode & its metadata buffer
> > > list while it is possibly changing cannot result in something unexpected. I
> > > don't want to clutter this series with it so we are left with
> > > reimplementing __generic_file_fsync() inside ext4 without inode_lock. Not
> > > too bad but not great either. Thoughts?
> >
> > So, I just looked at this on my lunch break and I think the simplest
> > approach would be to just transfer the necessary chunks of code from
> > within __generic_file_fsync() into ext4_sync_file() for !journal cases,
> > minus the inode lock, and minus calling into __generic_file_fsync(). I
> > don't forsee this causing any issues, but feel free to correct me if I'm
> > wrong.
>
> Yes, that's what I'd suggest as well. In fact when doing that you can share
> file_write_and_wait_range() call with the one already in ext4_sync_file()
> use for other cases. Similarly with file_check_and_advance_wb_err(). So the
> copied bit will be really only:
>
> ret = sync_mapping_buffers(inode->i_mapping);
> if (!(inode->i_state & I_DIRTY_ALL))
> goto out;
> if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> goto out;
>
> err = sync_inode_metadata(inode, 1);
> if (ret == 0)
> ret = err;
>
> > If this is deemed to be OK, then I will go ahead and include this as a
> > separate patch in my series.
>
> Yes, please.
Heh!
I just finished writing and testing it and exactly what I've done
(attached). Anyway, I will include it in v6. :)
--<M>--
On Wed, Oct 23, 2019 at 12:01:51PM +0530, Ritesh Harjani wrote:
>
>
> On 10/21/19 2:47 PM, Matthew Bobrowski wrote:
> > Separate the iomap field population chunk of code that is currently
> > within ext4_iomap_begin() into a new helper called
> > ext4_set_iomap(). The intent of this function is self explanatory,
> > however the rationale behind doing so is to also reduce the overall
> > clutter that we currently have within the ext4_iomap_begin() callback.
> >
> > Signed-off-by: Matthew Bobrowski <[email protected]>
>
> Could you please re-arrange patch sequence in this fashion.
>
> 1. Patch-11 (re-ordering of unwritten flags)
> 2. Patch-8 (trylock in IOCB_NOWAIT cases)
> 3. Patch-2 (should explain offset & len in this patch)
> 4. Patch-1 (this patch).
No objections to this. Just needing to do a little shuffle here and there.
> This is so that some of these are anyway fixes or refactoring
> which can be picked up easily, either for backporting or
> sometimes this helps in getting some of the patches in, if the patch
> series gets bigger.
> Also others (like me) can also pick some of these changes then to meet
> their dependency. :)
Sure, thanks for educating me and making me aware of this.
> This patch looks good to me. You may add:
>
> Reviewed-by: Ritesh Harjani <[email protected]>
Thanks Ritesh!
--<M>--
On Wed, Oct 23, 2019 at 12:05:55PM +0530, Ritesh Harjani wrote:
> On 10/21/19 2:47 PM, Matthew Bobrowski wrote:
> > This patch is effectively addressed what Dave Chinner had found and
> > fixed within this commit: 8a23414ee345. Justification for needing this
> > modification has been provided below:
> Not sure if this is a valid commit id. I couldn't find it.
Ah, oops! I plucked that from somewhere, but where (some thread)? Hm, anyway,
this is what I was referring to:
https://git.kernel.org/pub/scm/fs/xfs/xfs-linux.git/commit/?h=xfs-5.5-merge&id=7684e2c4384d5d1f884b01ab8bff2369e4db0bff
This is queued for 5.5, so I will add this commit hash to my changelog in v6.
--<M>--
On Wed, Oct 23, 2019 at 12:09:24PM +0530, Ritesh Harjani wrote:
> On 10/22/19 7:25 AM, Matthew Bobrowski wrote:
> > On Mon, Oct 21, 2019 at 03:37:15PM +0200, Jan Kara wrote:
> > > On Mon 21-10-19 20:18:09, Matthew Bobrowski wrote:
> > > One nit below.
> > > > + ext4_es_find_extent_range(inode, &ext4_es_is_delayed,
> > > > + map->m_lblk, end, &es);
> > > > +
> > > > + if (!es.es_len || es.es_lblk > end)
> > > > + return false;
> > > > +
> > > > + if (es.es_lblk > map->m_lblk) {
> > > > + map->m_len = es.es_lblk - map->m_lblk;
> > > > + return false;
> > > > + }
> > > > +
> > > > + if (es.es_lblk <= map->m_lblk)
> > > > + offset = map->m_lblk - es.es_lblk;
> > > > + map->m_lblk = es.es_lblk + offset;
>
> And so, this above line will also be redundant.
And, you're absolutely right. Nice catch. Who knew basic math could go such a
long way? :P
Thanks Ritesh!
--<M>--
On Mon, Oct 21, 2019 at 08:18:18PM +1100, [email protected] wrote:
> This patch has already been posted through by Jan, but I've just
> included it within this patch series to mark that it's a clear
> dependency.
You probably want to resend the next iteration against the
iomap-for-next branch, which includes this plus an iomap_begin
API change. Darrick plans to have it as a stable branch, so that it can
also be pulled into the ext4 tree.
Maybe something like the version below which declutter the thing a bit?
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 5508baa11bb6..ab601c6779d2 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -80,6 +80,41 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}
+static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ int ret, err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ return ret;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ return ret;
+
+ err = sync_inode_metadata(inode, 1);
+ if (!ret)
+ ret = err;
+
+ if (!ret)
+ ret = ext4_sync_parent(inode);
+ if (test_opt(inode->i_sb, BARRIER))
+ *needs_barrier = true;
+ return ret;
+}
+
+static int ext4_fsync_journal(struct inode *inode, bool datasync,
+ bool *needs_barrier)
+{
+ struct ext4_inode_info *ei = EXT4_I(inode);
+ journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ tid_t commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
+
+ if (journal->j_flags & JBD2_BARRIER &&
+ !jbd2_trans_will_send_data_barrier(journal, commit_tid))
+ *needs_barrier = true;
+ return jbd2_complete_transaction(journal, commit_tid);
+}
+
/*
* akpm: A new design for ext4_sync_file().
*
@@ -95,13 +130,11 @@ static int ext4_sync_parent(struct inode *inode)
int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
{
struct inode *inode = file->f_mapping->host;
- struct ext4_inode_info *ei = EXT4_I(inode);
- journal_t *journal = EXT4_SB(inode->i_sb)->s_journal;
+ struct ext4_sb_info *sbi = EXT4_SB(inode->i_sb);
int ret = 0, err;
- tid_t commit_tid;
bool needs_barrier = false;
- if (unlikely(ext4_forced_shutdown(EXT4_SB(inode->i_sb))))
+ if (unlikely(ext4_forced_shutdown(sbi)))
return -EIO;
J_ASSERT(ext4_journal_current_handle() == NULL);
@@ -116,18 +149,10 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
goto out;
}
- if (!journal) {
- ret = __generic_file_fsync(file, start, end, datasync);
- if (!ret)
- ret = ext4_sync_parent(inode);
- if (test_opt(inode->i_sb, BARRIER))
- goto issue_flush;
- goto out;
- }
-
ret = file_write_and_wait_range(file, start, end);
if (ret)
return ret;
+
/*
* data=writeback,ordered:
* The caller's filemap_fdatawrite()/wait will sync the data.
@@ -142,18 +167,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (ext4_should_journal_data(inode)) {
+ if (!sbi->s_journal)
+ ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
+ else if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
- goto out;
- }
+ else
+ ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
- commit_tid = datasync ? ei->i_datasync_tid : ei->i_sync_tid;
- if (journal->j_flags & JBD2_BARRIER &&
- !jbd2_trans_will_send_data_barrier(journal, commit_tid))
- needs_barrier = true;
- ret = jbd2_complete_transaction(journal, commit_tid);
if (needs_barrier) {
- issue_flush:
err = blkdev_issue_flush(inode->i_sb->s_bdev, GFP_KERNEL, NULL);
if (!ret)
ret = err;
On Wed, Oct 23, 2019 at 06:58:57PM -0700, Christoph Hellwig wrote:
> Maybe something like the version below which declutter the thing a bit?
Yeah, I like this suggestion Christoph. Super awesome.
Thank you!
--<M>--
On Wed, Oct 23, 2019 at 06:41:53PM -0700, Christoph Hellwig wrote:
> On Mon, Oct 21, 2019 at 08:18:18PM +1100, [email protected] wrote:
> > This patch has already been posted through by Jan, but I've just
> > included it within this patch series to mark that it's a clear
> > dependency.
>
> You probably want to resend the next iteration against the
> iomap-for-next branch, which includes this plus an iomap_begin
> API change. Darrick plans to have it as a stable branch, so that it can
> also be pulled into the ext4 tree.
OK, noted! Checking out and rebasing. :)
--<M>--