Hello All,
Please find the series which rewrites ext2 direct-io path to use modern
iomap interface.
RFCv4 -> PATCHv5:
=================
1. Added trace_iomap_dio_rw_begin tracepoint in __iomap_dio_rw()
2. Added Reviewed-by tags from Christoph
RFCv3 -> RFCV4:
===============
1. Renamed __generic_file_fsync_nolock() from libfs to generic_buffer_fsync() in
fs/buffer.c
(Review comment from Christoph)
2. Fixed s/EVENTD/EVENTFD/ in TRACE_IOCB_STRINGS
3. Fixed few data types for parameters in ext2 trace patch (size_t && ssize_t)
4. Killed this patch "Minor refactor of iomap_dio_rw"
5. Changed iomap tracepoint patch and fixed the data types (size_t && ssize_t)
(addressed review comments from Christoph)
RFCv2 -> RFCv3:
===============
1. Addressed minor review comments related to extern, parameter naming in
function declaration, removing not required braces and shorting overly long
lines.
2. Added Reviewed-by from various reviewers.
3. Fixed a warning & couple of compilation errors in Patch-7 (ext2 trace points)
related to CFLAGS_trace & second related to unable to find function
definition for iov_iter_count(). (requires uio.h file)
CFLAGS_trace is required in Makefile so that it can find trace.h file from
tracepoint infrastructure.
4. Changed naming of IOCB_STRINGS TO TRACE_IOCB_STRINGS.
5. Shortened naming of tracepoint events for ext2 dio.
6. Added iomap DIO tracepoint events.
7. Disha tested this series internally against Power with "auto" group for 4k
and 64k blocksize configuration. Added her "Tested-by" tag in all DIO
related patches. No new failures were reported.
Thanks everyone for the review and test. The series is looking good to me now.
It has been tested on x86 and Power with different configurations.
Please let me know if anything else is required on this.
v2: https://lore.kernel.org/all/[email protected]/
Ritesh Harjani (IBM) (9):
ext2/dax: Fix ext2_setsize when len is page aligned
fs/buffer.c: Add generic_buffer_fsync implementation
ext4: Use generic_buffer_fsync() implementation
ext2: Use generic_buffer_fsync() implementation
ext2: Move direct-io to use iomap
fs.h: Add TRACE_IOCB_STRINGS for use in trace points
ext2: Add direct-io trace points
iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
iomap: Add DIO tracepoints
fs/buffer.c | 43 ++++++++++++
fs/ext2/Makefile | 5 +-
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 128 +++++++++++++++++++++++++++++++++++-
fs/ext2/inode.c | 58 +++++++++-------
fs/ext2/trace.c | 6 ++
fs/ext2/trace.h | 94 ++++++++++++++++++++++++++
fs/ext4/fsync.c | 32 ++++-----
fs/iomap/direct-io.c | 9 ++-
fs/iomap/trace.c | 1 +
fs/iomap/trace.h | 78 ++++++++++++++++++++++
include/linux/buffer_head.h | 2 +
include/linux/fs.h | 14 ++++
include/linux/iomap.h | 6 --
14 files changed, 429 insertions(+), 48 deletions(-)
create mode 100644 fs/ext2/trace.c
create mode 100644 fs/ext2/trace.h
--
2.39.2
Some of the higher layers like iomap takes inode_lock() when calling
generic_write_sync().
Also writeback already happens from other paths without inode lock,
so it's difficult to say that we really need sync_mapping_buffers() to
take any inode locking here. Having said that, let's add
generic_buffer_fsync() implementation in buffer.c with no
inode_lock/unlock() for now so that filesystems like ext2 and
ext4's nojournal mode can use it.
Ext4 when got converted to iomap for direct-io already copied it's own
variant of __generic_file_fsync() without lock. Hence let's add a helper
API and use it both in ext2 and ext4.
Later we can review other filesystems as well to see if we can make
generic_buffer_fsync() which does not take any inode_lock() as the
default path.
Tested-by: Disha Goel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/buffer.c | 43 +++++++++++++++++++++++++++++++++++++
include/linux/buffer_head.h | 2 ++
2 files changed, 45 insertions(+)
diff --git a/fs/buffer.c b/fs/buffer.c
index 9e1e2add541e..df98f1966a71 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -593,6 +593,49 @@ int sync_mapping_buffers(struct address_space *mapping)
}
EXPORT_SYMBOL(sync_mapping_buffers);
+/**
+ * generic_buffer_fsync - generic buffer fsync implementation
+ * for simple filesystems with no inode lock
+ *
+ * @file: file to synchronize
+ * @start: start offset in bytes
+ * @end: end offset in bytes (inclusive)
+ * @datasync: only synchronize essential metadata if true
+ *
+ * This is a generic implementation of the fsync method for simple
+ * filesystems which track all non-inode metadata in the buffers list
+ * hanging off the address_space structure.
+ */
+int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
+ bool datasync)
+{
+ struct inode *inode = file->f_mapping->host;
+ int err;
+ int ret;
+
+ err = file_write_and_wait_range(file, start, end);
+ if (err)
+ return err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ goto out;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ goto out;
+
+ err = sync_inode_metadata(inode, 1);
+ if (ret == 0)
+ ret = err;
+
+out:
+ /* check and advance again to catch errors after syncing out buffers */
+ err = file_check_and_advance_wb_err(file);
+ if (ret == 0)
+ ret = err;
+ return ret;
+}
+EXPORT_SYMBOL(generic_buffer_fsync);
+
/*
* Called when we've recently written block `bblock', and it is known that
* `bblock' was for a buffer_boundary() buffer. This means that the block at
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 8f14dca5fed7..3170d0792d52 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -211,6 +211,8 @@ int inode_has_buffers(struct inode *);
void invalidate_inode_buffers(struct inode *);
int remove_inode_buffers(struct inode *inode);
int sync_mapping_buffers(struct address_space *mapping);
+int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
+ bool datasync);
void clean_bdev_aliases(struct block_device *bdev, sector_t block,
sector_t len);
static inline void clean_bdev_bh_alias(struct buffer_head *bh)
--
2.39.2
PAGE_ALIGN(x) macro gives the next highest value which is multiple of
pagesize. But if x is already page aligned then it simply returns x.
So, if x passed is 0 in dax_zero_range() function, that means the
length gets passed as 0 to ->iomap_begin().
In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on
here in ext2_get_blocks().
BUG_ON(maxblocks == 0);
Instead we should be calling dax_truncate_page() here which takes
care of it. i.e. it only calls dax_zero_range if the offset is not
page/block aligned.
This can be easily triggered with following on fsdax mounted pmem
device.
dd if=/dev/zero of=file count=1 bs=512
truncate -s 0 file
[79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff)
[93.793207] ------------[ cut here ]------------
[93.795102] kernel BUG at fs/ext2/inode.c:637!
[93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139
[93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610
<...>
[93.835298] Call Trace:
[93.836253] <TASK>
[93.837103] ? lock_acquire+0xf8/0x110
[93.838479] ? d_lookup+0x69/0xd0
[93.839779] ext2_iomap_begin+0xa7/0x1c0
[93.841154] iomap_iter+0xc7/0x150
[93.842425] dax_zero_range+0x6e/0xa0
[93.843813] ext2_setsize+0x176/0x1b0
[93.845164] ext2_setattr+0x151/0x200
[93.846467] notify_change+0x341/0x4e0
[93.847805] ? lock_acquire+0xf8/0x110
[93.849143] ? do_truncate+0x74/0xe0
[93.850452] ? do_truncate+0x84/0xe0
[93.851739] do_truncate+0x84/0xe0
[93.852974] do_sys_ftruncate+0x2b4/0x2f0
[93.854404] do_syscall_64+0x3f/0x90
[93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/inode.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 26f135e7ffce..dc76147e7b07 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1259,9 +1259,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);
if (IS_DAX(inode))
- error = dax_zero_range(inode, newsize,
- PAGE_ALIGN(newsize) - newsize, NULL,
- &ext2_iomap_ops);
+ error = dax_truncate_page(inode, newsize, NULL,
+ &ext2_iomap_ops);
else
error = block_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
--
2.39.2
ext4 when got converted to iomap for dio, it copied __generic_file_fsync
implementation to avoid taking inode_lock in order to avoid any deadlock
(since iomap takes an inode_lock while calling generic_write_sync()).
The previous patch already added generic_buffer_fsync() which does not
take any inode_lock(). Hence kill the redundant code and use
generic_buffer_fsync() function instead.
Tested-by: Disha Goel <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/fsync.c | 32 ++++++++++++++++----------------
1 file changed, 16 insertions(+), 16 deletions(-)
diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 027a7d7037a0..4f2af43f8b0f 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -28,6 +28,7 @@
#include <linux/sched.h>
#include <linux/writeback.h>
#include <linux/blkdev.h>
+#include <linux/buffer_head.h>
#include "ext4.h"
#include "ext4_jbd2.h"
@@ -78,21 +79,13 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}
-static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
- bool *needs_barrier)
+static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
+ int datasync, bool *needs_barrier)
{
- int ret, err;
-
- ret = sync_mapping_buffers(inode->i_mapping);
- if (!(inode->i_state & I_DIRTY_ALL))
- return ret;
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- return ret;
-
- err = sync_inode_metadata(inode, 1);
- if (!ret)
- ret = err;
+ struct inode *inode = file->f_inode;
+ int ret;
+ ret = generic_buffer_fsync(file, start, end, datasync);
if (!ret)
ret = ext4_sync_parent(inode);
if (test_opt(inode->i_sb, BARRIER))
@@ -148,6 +141,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
goto out;
}
+ if (!sbi->s_journal) {
+ ret = ext4_fsync_nojournal(file, start, end, datasync,
+ &needs_barrier);
+ if (needs_barrier)
+ goto issue_flush;
+ goto out;
+ }
+
ret = file_write_and_wait_range(file, start, end);
if (ret)
goto out;
@@ -166,13 +167,12 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (!sbi->s_journal)
- ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
- else if (ext4_should_journal_data(inode))
+ if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
else
ret = ext4_fsync_journal(inode, datasync, &needs_barrier);
+issue_flush:
if (needs_barrier) {
err = blkdev_issue_flush(inode->i_sb->s_bdev);
if (!ret)
--
2.39.2
Add TRACE_IOCB_STRINGS macro which can be used in the trace point patch to
print different flag values with meaningful string output.
Tested-by: Disha Goel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
include/linux/fs.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..bdc1f7ed2aba 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -340,6 +340,20 @@ enum rw_hint {
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE (1 << 21)
+/* for use in trace events */
+#define TRACE_IOCB_STRINGS \
+ { IOCB_HIPRI, "HIPRI" }, \
+ { IOCB_DSYNC, "DSYNC" }, \
+ { IOCB_SYNC, "SYNC" }, \
+ { IOCB_NOWAIT, "NOWAIT" }, \
+ { IOCB_APPEND, "APPEND" }, \
+ { IOCB_EVENTFD, "EVENTFD"}, \
+ { IOCB_DIRECT, "DIRECT" }, \
+ { IOCB_WRITE, "WRITE" }, \
+ { IOCB_WAITQ, "WAITQ" }, \
+ { IOCB_NOIO, "NOIO" }, \
+ { IOCB_ALLOC_CACHE, "ALLOC_CACHE" }
+
struct kiocb {
struct file *ki_filp;
loff_t ki_pos;
--
2.39.2
Next patch converts ext2 to use iomap interface for DIO.
iomap layer can call generic_write_sync() -> ext2_fsync() from
iomap_dio_complete while still holding the inode_lock().
Now writeback from other paths doesn't need inode_lock().
It seems there is also no need of an inode_lock() for
sync_mapping_buffers(). It uses it's own mapping->private_lock
for it's buffer list handling.
Hence this patch is in preparation to move ext2 to iomap.
This uses generic_buffer_fsync() which does not take any inode_lock()
in ext2_fsync().
Tested-by: Disha Goel <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/file.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 6b4bebe982ca..7603427fb38f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -25,6 +25,7 @@
#include <linux/quotaops.h>
#include <linux/iomap.h>
#include <linux/uio.h>
+#include <linux/buffer_head.h>
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
@@ -153,7 +154,9 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
int ret;
struct super_block *sb = file->f_mapping->host->i_sb;
- ret = generic_file_fsync(file, start, end, datasync);
+ ret = generic_buffer_fsync(file, start, end, datasync);
+ if (!ret)
+ ret = blkdev_issue_flush(sb->s_bdev);
if (ret == -EIO)
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
--
2.39.2
IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for
aio dsync writes this is not useful anyway. For aio dsync case, we
we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC
doesn't let iomap_dio_complete() to call generic_write_sync(),
hence we may lose the sync write.
Hence kill this flag as it is not in use by any FS now.
Tested-by: Disha Goel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Darrick J. Wong <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/iomap/direct-io.c | 2 +-
include/linux/iomap.h | 6 ------
2 files changed, 1 insertion(+), 7 deletions(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f771001574d0..36ab1152dbea 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -541,7 +541,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
}
/* for data sync or sync, we need sync completion processing */
- if (iocb_is_dsync(iocb) && !(dio_flags & IOMAP_DIO_NOSYNC)) {
+ if (iocb_is_dsync(iocb)) {
dio->flags |= IOMAP_DIO_NEED_SYNC;
/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0f8123504e5e..e2b836c2e119 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -377,12 +377,6 @@ struct iomap_dio_ops {
*/
#define IOMAP_DIO_PARTIAL (1 << 2)
-/*
- * The caller will sync the write if needed; do not sync it within
- * iomap_dio_rw. Overrides IOMAP_DIO_FORCE_WAIT.
- */
-#define IOMAP_DIO_NOSYNC (1 << 3)
-
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before);
--
2.39.2
This patch converts ext2 direct-io path to iomap interface.
- This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK
from ext2_iomap_begin(), in case if the write is done on a hole.
- This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of
a partial write or if any error is detected in ext2_iomap_end().
We try to return -ENOTBLK in such cases.
- For any unaligned or extending DIO writes, we pass
IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes.
- For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because
otherwise with dsync writes on devices that support FUA, generic_write_sync
won't be called and we might miss inode metadata updates.
- Since ext2 already now uses _nolock vartiant of sync write. Hence
there is no inode lock problem with iomap in this patch.
- ext2_iomap_ops are now being shared by DIO, DAX & fiemap path
Tested-by: Disha Goel <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 53 ++++++++++++++--------
3 files changed, 150 insertions(+), 19 deletions(-)
diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index cb78d7dcfb95..00168447d48c 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
extern struct inode *ext2_iget (struct super_block *, unsigned long);
extern int ext2_write_inode (struct inode *, struct writeback_control *);
extern void ext2_evict_inode(struct inode *);
+void ext2_write_failed(struct address_space *mapping, loff_t to);
extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
extern int ext2_getattr (struct mnt_idmap *, const struct path *,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 7603427fb38f..704abe0a79cb 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -164,12 +164,124 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
return ret;
}
+static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+
+ inode_lock_shared(inode);
+ ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
+ inode_unlock_shared(inode);
+
+ return ret;
+}
+
+static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t pos = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error)
+ goto out;
+
+ /*
+ * If we are extending the file, we have to update i_size here before
+ * page cache gets invalidated in iomap_dio_rw(). This prevents racing
+ * buffered reads from zeroing out too much from page cache pages.
+ * Note that all extending writes always happens synchronously with
+ * inode lock held by ext2_dio_write_iter(). So it is safe to update
+ * inode size here for extending file writes.
+ */
+ pos += size;
+ if (pos > i_size_read(inode)) {
+ i_size_write(inode, pos);
+ mark_inode_dirty(inode);
+ }
+out:
+ return error;
+}
+
+static const struct iomap_dio_ops ext2_dio_write_ops = {
+ .end_io = ext2_dio_write_end_io,
+};
+
+static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ unsigned int flags = 0;
+ unsigned long blocksize = inode->i_sb->s_blocksize;
+ loff_t offset = iocb->ki_pos;
+ loff_t count = iov_iter_count(from);
+
+ inode_lock(inode);
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out_unlock;
+
+ ret = kiocb_modified(iocb);
+ if (ret)
+ goto out_unlock;
+
+ /* use IOMAP_DIO_FORCE_WAIT for unaligned or extending writes */
+ if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
+ (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
+ flags |= IOMAP_DIO_FORCE_WAIT;
+
+ ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
+ flags, NULL, 0);
+
+ /* ENOTBLK is magic return value for fallback to buffered-io */
+ if (ret == -ENOTBLK)
+ ret = 0;
+
+ if (ret < 0 && ret != -EIOCBQUEUED)
+ ext2_write_failed(inode->i_mapping, offset + count);
+
+ /* handle case for partial write and for fallback to buffered write */
+ if (ret >= 0 && iov_iter_count(from)) {
+ loff_t pos, endbyte;
+ ssize_t status;
+ int ret2;
+
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ pos = iocb->ki_pos;
+ status = generic_perform_write(iocb, from);
+ if (unlikely(status < 0)) {
+ ret = status;
+ goto out_unlock;
+ }
+
+ iocb->ki_pos += status;
+ ret += status;
+ endbyte = pos + status - 1;
+ ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
+ endbyte);
+ if (!ret2)
+ invalidate_mapping_pages(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ endbyte >> PAGE_SHIFT);
+ if (ret > 0)
+ generic_write_sync(iocb, ret);
+ }
+
+out_unlock:
+ inode_unlock(inode);
+ return ret;
+}
+
static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
#ifdef CONFIG_FS_DAX
if (IS_DAX(iocb->ki_filp->f_mapping->host))
return ext2_dax_read_iter(iocb, to);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext2_dio_read_iter(iocb, to);
+
return generic_file_read_iter(iocb, to);
}
@@ -179,6 +291,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (IS_DAX(iocb->ki_filp->f_mapping->host))
return ext2_dax_write_iter(iocb, from);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext2_dio_write_iter(iocb, from);
+
return generic_file_write_iter(iocb, from);
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index dc76147e7b07..75983215c7a1 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode)
static void ext2_truncate_blocks(struct inode *inode, loff_t offset);
-static void ext2_write_failed(struct address_space *mapping, loff_t to)
+void ext2_write_failed(struct address_space *mapping, loff_t to)
{
struct inode *inode = mapping->host;
@@ -809,9 +809,27 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
bool new = false, boundary = false;
u32 bno;
int ret;
+ bool create = flags & IOMAP_WRITE;
+
+ /*
+ * For writes that could fill holes inside i_size on a
+ * DIO_SKIP_HOLES filesystem we forbid block creations: only
+ * overwrites are permitted.
+ */
+ if ((flags & IOMAP_DIRECT) &&
+ (first_block << blkbits) < i_size_read(inode))
+ create = 0;
+
+ /*
+ * Writes that span EOF might trigger an IO size update on completion,
+ * so consider them to be dirty for the purposes of O_DSYNC even if
+ * there is no other metadata changes pending or have been made here.
+ */
+ if ((flags & IOMAP_WRITE) && offset + length > i_size_read(inode))
+ iomap->flags |= IOMAP_F_DIRTY;
ret = ext2_get_blocks(inode, first_block, max_blocks,
- &bno, &new, &boundary, flags & IOMAP_WRITE);
+ &bno, &new, &boundary, create);
if (ret < 0)
return ret;
@@ -823,6 +841,12 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->bdev = inode->i_sb->s_bdev;
if (ret == 0) {
+ /*
+ * Switch to buffered-io for writing to holes in a non-extent
+ * based filesystem to avoid stale data exposure problem.
+ */
+ if (!create && (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))
+ return -ENOTBLK;
iomap->type = IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
iomap->length = 1 << blkbits;
@@ -844,6 +868,13 @@ static int
ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Switch to buffered-io in case of any error.
+ * Blocks allocated can be used by the buffered-io path.
+ */
+ if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
+ return -ENOTBLK;
+
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
@@ -908,22 +939,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping,block,ext2_get_block);
}
-static ssize_t
-ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
- ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
- if (ret < 0 && iov_iter_rw(iter) == WRITE)
- ext2_write_failed(mapping, offset + count);
- return ret;
-}
-
static int
ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
@@ -946,7 +961,7 @@ const struct address_space_operations ext2_aops = {
.write_begin = ext2_write_begin,
.write_end = ext2_write_end,
.bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
+ .direct_IO = noop_direct_IO,
.writepages = ext2_writepages,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
--
2.39.2
This patch adds the trace point to ext2 direct-io apis
in fs/ext2/file.c
Here is how the output looks like
a.out-467865 [006] 6758.170968: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0
a.out-467865 [006] 6758.171061: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT|WRITE aio 1 ret -529
kworker/3:153-444162 [003] 6758.171252: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT|WRITE aio 1 ret 0
a.out-468222 [001] 6761.628924: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 1 ret 0
a.out-468222 [001] 6761.629063: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 0 flags DIRECT aio 1 ret -529
a.out-468428 [005] 6763.937454: ext2_dio_write_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
a.out-468428 [005] 6763.937829: ext2_dio_write_endio: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
a.out-468428 [005] 6763.937847: ext2_dio_write_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096
a.out-468609 [000] 6765.702878: ext2_dio_read_begin: dev 7:12 ino 0xe isize 0x1000 pos 0x0 len 4096 flags DIRECT aio 0 ret 0
a.out-468609 [000] 6765.703243: ext2_dio_read_end: dev 7:12 ino 0xe isize 0x1000 pos 0x1000 len 0 flags DIRECT aio 0 ret 4096
Reported-and-tested-by: Disha Goel <[email protected]>
[Need to add CFLAGS_trace for fixing unable to find trace file problem]
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/Makefile | 5 ++-
fs/ext2/file.c | 10 +++++-
fs/ext2/trace.c | 6 ++++
fs/ext2/trace.h | 94 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 113 insertions(+), 2 deletions(-)
create mode 100644 fs/ext2/trace.c
create mode 100644 fs/ext2/trace.h
diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index 311479d864a7..8860948ef9ca 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -6,7 +6,10 @@
obj-$(CONFIG_EXT2_FS) += ext2.o
ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o
+ ioctl.o namei.o super.o symlink.o trace.o
+
+# For tracepoints to include our trace.h from tracepoint infrastructure
+CFLAGS_trace.o := -I$(src)
ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 704abe0a79cb..c3e8d8cc6792 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -29,6 +29,7 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
+#include "trace.h"
#ifdef CONFIG_FS_DAX
static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -170,9 +171,11 @@ static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file->f_mapping->host;
ssize_t ret;
+ trace_ext2_dio_read_begin(iocb, to, 0);
inode_lock_shared(inode);
ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
inode_unlock_shared(inode);
+ trace_ext2_dio_read_end(iocb, to, ret);
return ret;
}
@@ -200,6 +203,7 @@ static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
mark_inode_dirty(inode);
}
out:
+ trace_ext2_dio_write_endio(iocb, size, error);
return error;
}
@@ -216,7 +220,9 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
unsigned long blocksize = inode->i_sb->s_blocksize;
loff_t offset = iocb->ki_pos;
loff_t count = iov_iter_count(from);
+ ssize_t status = 0;
+ trace_ext2_dio_write_begin(iocb, from, 0);
inode_lock(inode);
ret = generic_write_checks(iocb, from);
if (ret <= 0)
@@ -244,7 +250,6 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
/* handle case for partial write and for fallback to buffered write */
if (ret >= 0 && iov_iter_count(from)) {
loff_t pos, endbyte;
- ssize_t status;
int ret2;
iocb->ki_flags &= ~IOCB_DIRECT;
@@ -270,6 +275,9 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
out_unlock:
inode_unlock(inode);
+ if (status)
+ trace_ext2_dio_write_buff_end(iocb, from, status);
+ trace_ext2_dio_write_end(iocb, from, ret);
return ret;
}
diff --git a/fs/ext2/trace.c b/fs/ext2/trace.c
new file mode 100644
index 000000000000..b01cdf6526fd
--- /dev/null
+++ b/fs/ext2/trace.c
@@ -0,0 +1,6 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "ext2.h"
+#include <linux/uio.h>
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
diff --git a/fs/ext2/trace.h b/fs/ext2/trace.h
new file mode 100644
index 000000000000..7d230e13576e
--- /dev/null
+++ b/fs/ext2/trace.h
@@ -0,0 +1,94 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ext2
+
+#if !defined(_EXT2_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _EXT2_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(ext2_dio_class,
+ TP_PROTO(struct kiocb *iocb, struct iov_iter *iter, ssize_t ret),
+ TP_ARGS(iocb, iter, ret),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(loff_t, isize)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ __field(int, ki_flags)
+ __field(bool, aio)
+ __field(ssize_t, ret)
+ ),
+ TP_fast_assign(
+ __entry->dev = file_inode(iocb->ki_filp)->i_sb->s_dev;
+ __entry->ino = file_inode(iocb->ki_filp)->i_ino;
+ __entry->isize = file_inode(iocb->ki_filp)->i_size;
+ __entry->pos = iocb->ki_pos;
+ __entry->count = iov_iter_count(iter);
+ __entry->ki_flags = iocb->ki_flags;
+ __entry->aio = !is_sync_kiocb(iocb);
+ __entry->ret = ret;
+ ),
+ TP_printk("dev %d:%d ino 0x%lx isize 0x%llx pos 0x%llx len %zu flags %s aio %d ret %zd",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->isize,
+ __entry->pos,
+ __entry->count,
+ __print_flags(__entry->ki_flags, "|", TRACE_IOCB_STRINGS),
+ __entry->aio,
+ __entry->ret)
+);
+
+#define DEFINE_DIO_RW_EVENT(name) \
+DEFINE_EVENT(ext2_dio_class, name, \
+ TP_PROTO(struct kiocb *iocb, struct iov_iter *iter, ssize_t ret), \
+ TP_ARGS(iocb, iter, ret))
+DEFINE_DIO_RW_EVENT(ext2_dio_write_begin);
+DEFINE_DIO_RW_EVENT(ext2_dio_write_end);
+DEFINE_DIO_RW_EVENT(ext2_dio_write_buff_end);
+DEFINE_DIO_RW_EVENT(ext2_dio_read_begin);
+DEFINE_DIO_RW_EVENT(ext2_dio_read_end);
+
+TRACE_EVENT(ext2_dio_write_endio,
+ TP_PROTO(struct kiocb *iocb, ssize_t size, int ret),
+ TP_ARGS(iocb, size, ret),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(loff_t, isize)
+ __field(loff_t, pos)
+ __field(ssize_t, size)
+ __field(int, ki_flags)
+ __field(bool, aio)
+ __field(int, ret)
+ ),
+ TP_fast_assign(
+ __entry->dev = file_inode(iocb->ki_filp)->i_sb->s_dev;
+ __entry->ino = file_inode(iocb->ki_filp)->i_ino;
+ __entry->isize = file_inode(iocb->ki_filp)->i_size;
+ __entry->pos = iocb->ki_pos;
+ __entry->size = size;
+ __entry->ki_flags = iocb->ki_flags;
+ __entry->aio = !is_sync_kiocb(iocb);
+ __entry->ret = ret;
+ ),
+ TP_printk("dev %d:%d ino 0x%lx isize 0x%llx pos 0x%llx len %zd flags %s aio %d ret %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->isize,
+ __entry->pos,
+ __entry->size,
+ __print_flags(__entry->ki_flags, "|", TRACE_IOCB_STRINGS),
+ __entry->aio,
+ __entry->ret)
+);
+
+#endif /* _EXT2_TRACE_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace
+#include <trace/define_trace.h>
--
2.39.2
Add trace_iomap_dio_rw_begin, trace_iomap_dio_rw_queued and
trace_iomap_dio_complete tracepoint.
trace_iomap_dio_rw_queued is mostly only to know that the request was
queued and -EIOCBQUEUED was returned. It is mostly trace_iomap_dio_rw_begin
& trace_iomap_dio_complete which has all the details.
<example output log>
a.out-2073 [006] 134.225717: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x0 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT|WRITE dio_flags DIO_FORCE_WAIT aio 1
a.out-2073 [006] 134.226234: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT|WRITE aio 1 error 0 ret 4096
a.out-2074 [006] 136.225975: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags aio 1
a.out-2074 [006] 136.226173: iomap_dio_rw_queued: dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
ksoftirqd/3-31 [003] 136.226389: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 1 error 0 ret 4096
a.out-2075 [003] 141.674969: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT|WRITE dio_flags aio 1
a.out-2075 [003] 141.676085: iomap_dio_rw_queued: dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
kworker/2:0-27 [002] 141.676432: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT|WRITE aio 1 error 0 ret 4096
a.out-2077 [006] 143.443746: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags aio 1
a.out-2077 [006] 143.443866: iomap_dio_rw_queued: dev 7:7 ino 0xe size 0x1000 offset 0x1000 length 0x0
ksoftirqd/5-41 [005] 143.444134: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 1 error 0 ret 4096
a.out-2078 [007] 146.716833: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags aio 0
a.out-2078 [007] 146.717639: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 0 error 0 ret 4096
a.out-2079 [006] 148.972605: iomap_dio_rw_begin: dev 7:7 ino 0xe size 0x1000 offset 0x0 length 0x1000 done_before 0x0 flags DIRECT dio_flags aio 0
a.out-2079 [006] 148.973099: iomap_dio_complete: dev 7:7 ino 0xe size 0x1000 offset 0x1000 flags DIRECT aio 0 error 0 ret 4096
Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/iomap/direct-io.c | 7 +++-
fs/iomap/trace.c | 1 +
fs/iomap/trace.h | 78 ++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 85 insertions(+), 1 deletion(-)
diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 36ab1152dbea..019cc87d0fb3 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -130,6 +130,7 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
if (ret > 0)
ret += dio->done_before;
+ trace_iomap_dio_complete(iocb, dio->error, ret);
kfree(dio);
return ret;
@@ -493,6 +494,8 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
struct blk_plug plug;
struct iomap_dio *dio;
+ trace_iomap_dio_rw_begin(iocb, iter, dio_flags, done_before);
+
if (!iomi.len)
return NULL;
@@ -650,8 +653,10 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
*/
dio->wait_for_completion = wait_for_completion;
if (!atomic_dec_and_test(&dio->ref)) {
- if (!wait_for_completion)
+ if (!wait_for_completion) {
+ trace_iomap_dio_rw_queued(inode, iomi.pos, iomi.len);
return ERR_PTR(-EIOCBQUEUED);
+ }
for (;;) {
set_current_state(TASK_UNINTERRUPTIBLE);
diff --git a/fs/iomap/trace.c b/fs/iomap/trace.c
index da217246b1a9..728d5443daf5 100644
--- a/fs/iomap/trace.c
+++ b/fs/iomap/trace.c
@@ -3,6 +3,7 @@
* Copyright (c) 2019 Christoph Hellwig
*/
#include <linux/iomap.h>
+#include <linux/uio.h>
/*
* We include this last to have the helpers above available for the trace
diff --git a/fs/iomap/trace.h b/fs/iomap/trace.h
index f6ea9540d082..448b82d16c0b 100644
--- a/fs/iomap/trace.h
+++ b/fs/iomap/trace.h
@@ -83,6 +83,7 @@ DEFINE_RANGE_EVENT(iomap_writepage);
DEFINE_RANGE_EVENT(iomap_release_folio);
DEFINE_RANGE_EVENT(iomap_invalidate_folio);
DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
+DEFINE_RANGE_EVENT(iomap_dio_rw_queued);
#define IOMAP_TYPE_STRINGS \
{ IOMAP_HOLE, "HOLE" }, \
@@ -107,6 +108,11 @@ DEFINE_RANGE_EVENT(iomap_dio_invalidate_fail);
{ IOMAP_F_BUFFER_HEAD, "BH" }, \
{ IOMAP_F_SIZE_CHANGED, "SIZE_CHANGED" }
+#define IOMAP_DIO_STRINGS \
+ {IOMAP_DIO_FORCE_WAIT, "DIO_FORCE_WAIT" }, \
+ {IOMAP_DIO_OVERWRITE_ONLY, "DIO_OVERWRITE_ONLY" }, \
+ {IOMAP_DIO_PARTIAL, "DIO_PARTIAL" }
+
DECLARE_EVENT_CLASS(iomap_class,
TP_PROTO(struct inode *inode, struct iomap *iomap),
TP_ARGS(inode, iomap),
@@ -183,6 +189,78 @@ TRACE_EVENT(iomap_iter,
(void *)__entry->caller)
);
+TRACE_EVENT(iomap_dio_rw_begin,
+ TP_PROTO(struct kiocb *iocb, struct iov_iter *iter,
+ unsigned int dio_flags, size_t done_before),
+ TP_ARGS(iocb, iter, dio_flags, done_before),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(loff_t, isize)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ __field(size_t, done_before)
+ __field(int, ki_flags)
+ __field(unsigned int, dio_flags)
+ __field(bool, aio)
+ ),
+ TP_fast_assign(
+ __entry->dev = file_inode(iocb->ki_filp)->i_sb->s_dev;
+ __entry->ino = file_inode(iocb->ki_filp)->i_ino;
+ __entry->isize = file_inode(iocb->ki_filp)->i_size;
+ __entry->pos = iocb->ki_pos;
+ __entry->count = iov_iter_count(iter);
+ __entry->done_before = done_before;
+ __entry->ki_flags = iocb->ki_flags;
+ __entry->dio_flags = dio_flags;
+ __entry->aio = !is_sync_kiocb(iocb);
+ ),
+ TP_printk("dev %d:%d ino 0x%lx size 0x%llx offset 0x%llx length 0x%zx done_before 0x%zx flags %s dio_flags %s aio %d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->isize,
+ __entry->pos,
+ __entry->count,
+ __entry->done_before,
+ __print_flags(__entry->ki_flags, "|", TRACE_IOCB_STRINGS),
+ __print_flags(__entry->dio_flags, "|", IOMAP_DIO_STRINGS),
+ __entry->aio)
+);
+
+TRACE_EVENT(iomap_dio_complete,
+ TP_PROTO(struct kiocb *iocb, int error, ssize_t ret),
+ TP_ARGS(iocb, error, ret),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(loff_t, isize)
+ __field(loff_t, pos)
+ __field(int, ki_flags)
+ __field(bool, aio)
+ __field(int, error)
+ __field(ssize_t, ret)
+ ),
+ TP_fast_assign(
+ __entry->dev = file_inode(iocb->ki_filp)->i_sb->s_dev;
+ __entry->ino = file_inode(iocb->ki_filp)->i_ino;
+ __entry->isize = file_inode(iocb->ki_filp)->i_size;
+ __entry->pos = iocb->ki_pos;
+ __entry->ki_flags = iocb->ki_flags;
+ __entry->aio = !is_sync_kiocb(iocb);
+ __entry->error = error;
+ __entry->ret = ret;
+ ),
+ TP_printk("dev %d:%d ino 0x%lx size 0x%llx offset 0x%llx flags %s aio %d error %d ret %zd",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->isize,
+ __entry->pos,
+ __print_flags(__entry->ki_flags, "|", TRACE_IOCB_STRINGS),
+ __entry->aio,
+ __entry->error,
+ __entry->ret)
+);
+
#endif /* _IOMAP_TRACE_H */
#undef TRACE_INCLUDE_PATH
--
2.39.2
Looks good:
Reviewed-by: Christoph Hellwig <[email protected]>
On Sun 16-04-23 15:38:37, Ritesh Harjani (IBM) wrote:
> Some of the higher layers like iomap takes inode_lock() when calling
> generic_write_sync().
> Also writeback already happens from other paths without inode lock,
> so it's difficult to say that we really need sync_mapping_buffers() to
> take any inode locking here. Having said that, let's add
> generic_buffer_fsync() implementation in buffer.c with no
> inode_lock/unlock() for now so that filesystems like ext2 and
> ext4's nojournal mode can use it.
>
> Ext4 when got converted to iomap for direct-io already copied it's own
> variant of __generic_file_fsync() without lock. Hence let's add a helper
> API and use it both in ext2 and ext4.
>
> Later we can review other filesystems as well to see if we can make
> generic_buffer_fsync() which does not take any inode_lock() as the
> default path.
>
> Tested-by: Disha Goel <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
There is a problem with generic_buffer_fsync() that it does not call
blkdev_issue_flush() so the caller is responsible for doing that. That's
necessary for ext2 & ext4 so fine for now. But historically this was the
case with generic_file_fsync() as well and that led to many filesystem
forgetting to flush caches from fsync(2). What is our transition plan for
these filesystems that currently do the cache flush from
generic_file_fsync()? Do we want to eventually keep generic_file_fsync()
doing the cache flush and call generic_buffer_fsync() instead of
__generic_buffer_fsync() from it?
Honza
> ---
> fs/buffer.c | 43 +++++++++++++++++++++++++++++++++++++
> include/linux/buffer_head.h | 2 ++
> 2 files changed, 45 insertions(+)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 9e1e2add541e..df98f1966a71 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -593,6 +593,49 @@ int sync_mapping_buffers(struct address_space *mapping)
> }
> EXPORT_SYMBOL(sync_mapping_buffers);
>
> +/**
> + * generic_buffer_fsync - generic buffer fsync implementation
> + * for simple filesystems with no inode lock
> + *
> + * @file: file to synchronize
> + * @start: start offset in bytes
> + * @end: end offset in bytes (inclusive)
> + * @datasync: only synchronize essential metadata if true
> + *
> + * This is a generic implementation of the fsync method for simple
> + * filesystems which track all non-inode metadata in the buffers list
> + * hanging off the address_space structure.
> + */
> +int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
> + bool datasync)
> +{
> + struct inode *inode = file->f_mapping->host;
> + int err;
> + int ret;
> +
> + err = file_write_and_wait_range(file, start, end);
> + if (err)
> + return err;
> +
> + ret = sync_mapping_buffers(inode->i_mapping);
> + if (!(inode->i_state & I_DIRTY_ALL))
> + goto out;
> + if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
> + goto out;
> +
> + err = sync_inode_metadata(inode, 1);
> + if (ret == 0)
> + ret = err;
> +
> +out:
> + /* check and advance again to catch errors after syncing out buffers */
> + err = file_check_and_advance_wb_err(file);
> + if (ret == 0)
> + ret = err;
> + return ret;
> +}
> +EXPORT_SYMBOL(generic_buffer_fsync);
> +
> /*
> * Called when we've recently written block `bblock', and it is known that
> * `bblock' was for a buffer_boundary() buffer. This means that the block at
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 8f14dca5fed7..3170d0792d52 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -211,6 +211,8 @@ int inode_has_buffers(struct inode *);
> void invalidate_inode_buffers(struct inode *);
> int remove_inode_buffers(struct inode *inode);
> int sync_mapping_buffers(struct address_space *mapping);
> +int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
> + bool datasync);
> void clean_bdev_aliases(struct block_device *bdev, sector_t block,
> sector_t len);
> static inline void clean_bdev_bh_alias(struct buffer_head *bh)
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon 17-04-23 13:01:49, Jan Kara wrote:
> On Sun 16-04-23 15:38:37, Ritesh Harjani (IBM) wrote:
> > Some of the higher layers like iomap takes inode_lock() when calling
> > generic_write_sync().
> > Also writeback already happens from other paths without inode lock,
> > so it's difficult to say that we really need sync_mapping_buffers() to
> > take any inode locking here. Having said that, let's add
> > generic_buffer_fsync() implementation in buffer.c with no
> > inode_lock/unlock() for now so that filesystems like ext2 and
> > ext4's nojournal mode can use it.
> >
> > Ext4 when got converted to iomap for direct-io already copied it's own
> > variant of __generic_file_fsync() without lock. Hence let's add a helper
> > API and use it both in ext2 and ext4.
> >
> > Later we can review other filesystems as well to see if we can make
> > generic_buffer_fsync() which does not take any inode_lock() as the
> > default path.
> >
> > Tested-by: Disha Goel <[email protected]>
> > Reviewed-by: Christoph Hellwig <[email protected]>
> > Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>
> There is a problem with generic_buffer_fsync() that it does not call
> blkdev_issue_flush() so the caller is responsible for doing that. That's
> necessary for ext2 & ext4 so fine for now.
Actually a slight correction: ext2 could use a variant of
generic_buffer_fsync() that flushes disk caches.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Sun 16-04-23 15:38:40, Ritesh Harjani (IBM) wrote:
> This patch converts ext2 direct-io path to iomap interface.
> - This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK
> from ext2_iomap_begin(), in case if the write is done on a hole.
> - This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of
> a partial write or if any error is detected in ext2_iomap_end().
> We try to return -ENOTBLK in such cases.
> - For any unaligned or extending DIO writes, we pass
> IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes.
> - For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because
> otherwise with dsync writes on devices that support FUA, generic_write_sync
> won't be called and we might miss inode metadata updates.
> - Since ext2 already now uses _nolock vartiant of sync write. Hence
> there is no inode lock problem with iomap in this patch.
> - ext2_iomap_ops are now being shared by DIO, DAX & fiemap path
>
> Tested-by: Disha Goel <[email protected]>
> Reviewed-by: Christoph Hellwig <[email protected]>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
One comment below:
> @@ -844,6 +868,13 @@ static int
> ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
> ssize_t written, unsigned flags, struct iomap *iomap)
> {
> + /*
> + * Switch to buffered-io in case of any error.
> + * Blocks allocated can be used by the buffered-io path.
> + */
> + if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
> + return -ENOTBLK;
> +
> if (iomap->type == IOMAP_MAPPED &&
> written < length &&
> (flags & IOMAP_WRITE))
Is this really needed? What for?
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Jan Kara <[email protected]> writes:
> On Sun 16-04-23 15:38:37, Ritesh Harjani (IBM) wrote:
>> Some of the higher layers like iomap takes inode_lock() when calling
>> generic_write_sync().
>> Also writeback already happens from other paths without inode lock,
>> so it's difficult to say that we really need sync_mapping_buffers() to
>> take any inode locking here. Having said that, let's add
>> generic_buffer_fsync() implementation in buffer.c with no
>> inode_lock/unlock() for now so that filesystems like ext2 and
>> ext4's nojournal mode can use it.
>>
>> Ext4 when got converted to iomap for direct-io already copied it's own
>> variant of __generic_file_fsync() without lock. Hence let's add a helper
>> API and use it both in ext2 and ext4.
>>
>> Later we can review other filesystems as well to see if we can make
>> generic_buffer_fsync() which does not take any inode_lock() as the
>> default path.
>>
>> Tested-by: Disha Goel <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>
> There is a problem with generic_buffer_fsync() that it does not call
> blkdev_issue_flush() so the caller is responsible for doing that. That's
> necessary for ext2 & ext4 so fine for now. But historically this was the
> case with generic_file_fsync() as well and that led to many filesystem
> forgetting to flush caches from fsync(2).
Ok, thanks for the details.
> What is our transition plan for
> these filesystems that currently do the cache flush from
> generic_file_fsync()? Do we want to eventually keep generic_file_fsync()
> doing the cache flush and call generic_buffer_fsync() instead of
> __generic_buffer_fsync() from it?
Frankly speaking, I was thinking we will come back to this question
maybe when we start working on those changes. At this point in time
I only looked at it from ext2 DIO changes perspective.
But since you asked, here is what I think we could do -
Rename generic_file_fsync => generic_buffers_sync() to fs/buffers.c
Then
generic_buffers_sync() {
ret = generic_buffers_fsync()
if (!ret)
blkdev_issue_flush()
}
generic_buffers_fsync() is same as in this patch which does not have the
cache flush operation.
(will rename from generic_buffer_fsync() to generic_buffers_fsync())
Note: The naming is kept such that-
- sync means it will do fsync followed by cache flush.
- fsync means it will only do the file fsync
As I understand - we would eventually like to kill the
inode_lock() variants of generic_file_fsync() and __generic_file_fsync()
after auditing other filesystem code, right?
Then for now what we need is generic_buffers_sync() function which does
not take an inode_lock() and also does cache flush which is required for ext2.
And generic_buffers_fsync() which does not do any cache flush operations
required by filesystem like ext4.
Does that sound good to you? Is the naming also proper?
Is yes, then I can rename the below function to generic_buffers_fsync()
and also create implementation of generic_buffers_sync().
Then let ext2 and ext4 use them.
-ritesh
>
> Honza
>
>> ---
>> fs/buffer.c | 43 +++++++++++++++++++++++++++++++++++++
>> include/linux/buffer_head.h | 2 ++
>> 2 files changed, 45 insertions(+)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 9e1e2add541e..df98f1966a71 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -593,6 +593,49 @@ int sync_mapping_buffers(struct address_space *mapping)
>> }
>> EXPORT_SYMBOL(sync_mapping_buffers);
>>
>> +/**
>> + * generic_buffer_fsync - generic buffer fsync implementation
>> + * for simple filesystems with no inode lock
>> + *
>> + * @file: file to synchronize
>> + * @start: start offset in bytes
>> + * @end: end offset in bytes (inclusive)
>> + * @datasync: only synchronize essential metadata if true
>> + *
>> + * This is a generic implementation of the fsync method for simple
>> + * filesystems which track all non-inode metadata in the buffers list
>> + * hanging off the address_space structure.
>> + */
>> +int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
>> + bool datasync)
>> +{
>> + struct inode *inode = file->f_mapping->host;
>> + int err;
>> + int ret;
>> +
>> + err = file_write_and_wait_range(file, start, end);
>> + if (err)
>> + return err;
>> +
>> + ret = sync_mapping_buffers(inode->i_mapping);
>> + if (!(inode->i_state & I_DIRTY_ALL))
>> + goto out;
>> + if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
>> + goto out;
>> +
>> + err = sync_inode_metadata(inode, 1);
>> + if (ret == 0)
>> + ret = err;
>> +
>> +out:
>> + /* check and advance again to catch errors after syncing out buffers */
>> + err = file_check_and_advance_wb_err(file);
>> + if (ret == 0)
>> + ret = err;
>> + return ret;
>> +}
>> +EXPORT_SYMBOL(generic_buffer_fsync);
>> +
>> /*
>> * Called when we've recently written block `bblock', and it is known that
>> * `bblock' was for a buffer_boundary() buffer. This means that the block at
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 8f14dca5fed7..3170d0792d52 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -211,6 +211,8 @@ int inode_has_buffers(struct inode *);
>> void invalidate_inode_buffers(struct inode *);
>> int remove_inode_buffers(struct inode *inode);
>> int sync_mapping_buffers(struct address_space *mapping);
>> +int generic_buffer_fsync(struct file *file, loff_t start, loff_t end,
>> + bool datasync);
>> void clean_bdev_aliases(struct block_device *bdev, sector_t block,
>> sector_t len);
>> static inline void clean_bdev_bh_alias(struct buffer_head *bh)
>> --
>> 2.39.2
>>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Mon 17-04-23 17:08:57, Ritesh Harjani wrote:
> Jan Kara <[email protected]> writes:
>
> > On Sun 16-04-23 15:38:37, Ritesh Harjani (IBM) wrote:
> >> Some of the higher layers like iomap takes inode_lock() when calling
> >> generic_write_sync().
> >> Also writeback already happens from other paths without inode lock,
> >> so it's difficult to say that we really need sync_mapping_buffers() to
> >> take any inode locking here. Having said that, let's add
> >> generic_buffer_fsync() implementation in buffer.c with no
> >> inode_lock/unlock() for now so that filesystems like ext2 and
> >> ext4's nojournal mode can use it.
> >>
> >> Ext4 when got converted to iomap for direct-io already copied it's own
> >> variant of __generic_file_fsync() without lock. Hence let's add a helper
> >> API and use it both in ext2 and ext4.
> >>
> >> Later we can review other filesystems as well to see if we can make
> >> generic_buffer_fsync() which does not take any inode_lock() as the
> >> default path.
> >>
> >> Tested-by: Disha Goel <[email protected]>
> >> Reviewed-by: Christoph Hellwig <[email protected]>
> >> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
> >
> > There is a problem with generic_buffer_fsync() that it does not call
> > blkdev_issue_flush() so the caller is responsible for doing that. That's
> > necessary for ext2 & ext4 so fine for now. But historically this was the
> > case with generic_file_fsync() as well and that led to many filesystem
> > forgetting to flush caches from fsync(2).
>
> Ok, thanks for the details.
>
> > What is our transition plan for
> > these filesystems that currently do the cache flush from
> > generic_file_fsync()? Do we want to eventually keep generic_file_fsync()
> > doing the cache flush and call generic_buffer_fsync() instead of
> > __generic_buffer_fsync() from it?
>
> Frankly speaking, I was thinking we will come back to this question
> maybe when we start working on those changes. At this point in time
> I only looked at it from ext2 DIO changes perspective.
Yes, we can return to this later. The only thing I wanted to kind of make
sure is we don't have to rename the function again when adding support for
other filesystems (although even that would not be a big issue given there
are two callers).
> But since you asked, here is what I think we could do -
>
> Rename generic_file_fsync => generic_buffers_sync() to fs/buffers.c
> Then
> generic_buffers_sync() {
> ret = generic_buffers_fsync()
> if (!ret)
> blkdev_issue_flush()
> }
>
> generic_buffers_fsync() is same as in this patch which does not have the
> cache flush operation.
> (will rename from generic_buffer_fsync() to generic_buffers_fsync())
>
> Note: The naming is kept such that-
> - sync means it will do fsync followed by cache flush.
> - fsync means it will only do the file fsync
Hum, I think the difference sync vs fsync is too subtle and non-obvious.
I can see sensible pairs like:
__generic_buffers_fsync() - "__" indicates you should know what you
are doing when calling this
generic_buffers_fsync()
or
generic_buffers_fsync()
generic_file_fsync() - difficult at this point as there's name
clash
or
generic_buffers_fsync_noflush()
generic_buffers_fsync() - obvious what the default "safe" choice
is.
or something like that.
> As I understand - we would eventually like to kill the
> inode_lock() variants of generic_file_fsync() and __generic_file_fsync()
> after auditing other filesystem code, right?
Yes.
> Then for now what we need is generic_buffers_sync() function which does
> not take an inode_lock() and also does cache flush which is required for ext2.
> And generic_buffers_fsync() which does not do any cache flush operations
> required by filesystem like ext4.
>
> Does that sound good to you? Is the naming also proper?
I agree with the plan, just the naming is hard :)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Apr 17, 2023 at 06:45:50PM +0200, Jan Kara wrote:
> Hum, I think the difference sync vs fsync is too subtle and non-obvious.
Agreed.
> I can see sensible pairs like:
>
> __generic_buffers_fsync() - "__" indicates you should know what you
> are doing when calling this
> generic_buffers_fsync()
>
> or
>
> generic_buffers_fsync()
> generic_file_fsync() - difficult at this point as there's name
> clash
>
> or
>
> generic_buffers_fsync_noflush()
> generic_buffers_fsync() - obvious what the default "safe" choice
> is.
>
> or something like that.
I'd prefer the last option as the most explicit one.
Christoph Hellwig <[email protected]> writes:
> On Mon, Apr 17, 2023 at 06:45:50PM +0200, Jan Kara wrote:
>> Hum, I think the difference sync vs fsync is too subtle and non-obvious.
>
> Agreed.
>
>> I can see sensible pairs like:
>>
>> __generic_buffers_fsync() - "__" indicates you should know what you
>> are doing when calling this
>> generic_buffers_fsync()
>>
>> or
>>
>> generic_buffers_fsync()
>> generic_file_fsync() - difficult at this point as there's name
>> clash
>>
>> or
>>
>> generic_buffers_fsync_noflush()
>> generic_buffers_fsync() - obvious what the default "safe" choice
>> is.
>>
>> or something like that.
>
> I'd prefer the last option as the most explicit one.
Yes. I was going to use this one as this is more explicit.
Thanks Jan & Christoph,
I will spin a new revision soon with the suggested changes.
-ritesh
Jan Kara <[email protected]> writes:
> On Sun 16-04-23 15:38:40, Ritesh Harjani (IBM) wrote:
>> This patch converts ext2 direct-io path to iomap interface.
>> - This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK
>> from ext2_iomap_begin(), in case if the write is done on a hole.
>> - This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of
>> a partial write or if any error is detected in ext2_iomap_end().
>> We try to return -ENOTBLK in such cases.
>> - For any unaligned or extending DIO writes, we pass
>> IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes.
>> - For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because
>> otherwise with dsync writes on devices that support FUA, generic_write_sync
>> won't be called and we might miss inode metadata updates.
>> - Since ext2 already now uses _nolock vartiant of sync write. Hence
>> there is no inode lock problem with iomap in this patch.
>> - ext2_iomap_ops are now being shared by DIO, DAX & fiemap path
>>
>> Tested-by: Disha Goel <[email protected]>
>> Reviewed-by: Christoph Hellwig <[email protected]>
>> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
>
> One comment below:
>
>> @@ -844,6 +868,13 @@ static int
>> ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
>> ssize_t written, unsigned flags, struct iomap *iomap)
>> {
>> + /*
>> + * Switch to buffered-io in case of any error.
>> + * Blocks allocated can be used by the buffered-io path.
>> + */
>> + if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
>> + return -ENOTBLK;
>> +
>> if (iomap->type == IOMAP_MAPPED &&
>> written < length &&
>> (flags & IOMAP_WRITE))
>
> Is this really needed? What for?
>
Sorry Jan, I got caught into something else so couldn't respond on this
earlier. Thanks a lot for review.
I don't think this will be called for IOMAP_DIRECT for write case.
I mostly see this code was already present for IOMAP_DAX path.
It is to truncate the blocks in case if the iomap dax write failed to
write but the blocks might have been allocated in ->iomap_begin
function.
Is there a specific query that you would like me to check and verify?
I can check more by probing this path to see what happens when this gets
called. But my understanding was it is used for truncating blocks as I
mentioned above.
Thanks
-ritesh