2023-04-11 05:25:50

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 0/8] ext2: DIO to use iomap

Please find the series which moves ext2 direct-io to use modern iomap interface.

Here are some more details -
1. Patch-1: Fixes a kernel bug_on problem with ext2 dax code (found during code
review and testing).
2. Patch-2: Adds a __generic_file_fsync_nolock implementation as we had
discussed.
3. Patch-3 & Patch-4: Moves ext4 nojournal and ext2 to use _nolock method.
4. Patch-5: This is the main patch which moves ext2 direct-io to use iomap.
(more details can be found in the patch)
5. Patch-6: Kills IOMAP_DIO_NOSYNC flag as it is not in use by any filesystem.
6. Patch-7: adds IOCB_STRINGS macro for use in trace events for better trace
output of iocb flags.
7. Patch-8: Add ext2 trace point for DIO.

Testing:
=========
This passes ext2 "auto" group testing for fstests. There were no new failures
with this patches.


Ritesh Harjani (IBM) (8):
ext2/dax: Fix ext2_setsize when len is page aligned
libfs: Add __generic_file_fsync_nolock implementation
ext4: Use __generic_file_fsync_nolock implementation
ext2: Use __generic_file_fsync_nolock implementation
ext2: Move direct-io to use iomap
iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
fs.h: Add IOCB_STRINGS for use in trace points
ext2: Add direct-io trace points

fs/ext2/Makefile | 2 +-
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 127 +++++++++++++++++++++++++++++++++++++++++-
fs/ext2/inode.c | 57 +++++++++++--------
fs/ext2/trace.c | 5 ++
fs/ext2/trace.h | 61 ++++++++++++++++++++
fs/ext4/fsync.c | 31 +++++------
fs/iomap/direct-io.c | 2 +-
fs/libfs.c | 43 ++++++++++++++
include/linux/fs.h | 15 +++++
include/linux/iomap.h | 6 --
11 files changed, 303 insertions(+), 47 deletions(-)
create mode 100644 fs/ext2/trace.c
create mode 100644 fs/ext2/trace.h

--
2.39.2


2023-04-11 05:31:24

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 3/8] ext4: Use __generic_file_fsync_nolock implementation

ext4 when got converted to iomap for dio, it copied __generic_file_fsync
implementation to avoid taking inode_lock in order to avoid any deadlock
(since iomap takes an inode_lock while calling generic_write_sync()).

The previous patch already added it's _nolock variant. Hence kill the
redundant code and use __generic_file_fsync_nolock() function instead.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext4/fsync.c | 31 +++++++++++++++----------------
1 file changed, 15 insertions(+), 16 deletions(-)

diff --git a/fs/ext4/fsync.c b/fs/ext4/fsync.c
index 027a7d7037a0..bf1d1b7f4ec7 100644
--- a/fs/ext4/fsync.c
+++ b/fs/ext4/fsync.c
@@ -78,21 +78,13 @@ static int ext4_sync_parent(struct inode *inode)
return ret;
}

-static int ext4_fsync_nojournal(struct inode *inode, bool datasync,
- bool *needs_barrier)
+static int ext4_fsync_nojournal(struct file *file, loff_t start, loff_t end,
+ int datasync, bool *needs_barrier)
{
- int ret, err;
-
- ret = sync_mapping_buffers(inode->i_mapping);
- if (!(inode->i_state & I_DIRTY_ALL))
- return ret;
- if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
- return ret;
-
- err = sync_inode_metadata(inode, 1);
- if (!ret)
- ret = err;
+ struct inode *inode = file->f_inode;
+ int ret;

+ ret = __generic_file_fsync_nolock(file, start, end, datasync);
if (!ret)
ret = ext4_sync_parent(inode);
if (test_opt(inode->i_sb, BARRIER))
@@ -148,6 +140,14 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
goto out;
}

+ if (!sbi->s_journal) {
+ ret = ext4_fsync_nojournal(file, start, end, datasync,
+ &needs_barrier);
+ if (needs_barrier)
+ goto issue_flush;
+ goto out;
+ }
+
ret = file_write_and_wait_range(file, start, end);
if (ret)
goto out;
@@ -166,13 +166,12 @@ int ext4_sync_file(struct file *file, loff_t start, loff_t end, int datasync)
* (they were dirtied by commit). But that's OK - the blocks are
* safe in-journal, which is all fsync() needs to ensure.
*/
- if (!sbi->s_journal)
- ret = ext4_fsync_nojournal(inode, datasync, &needs_barrier);
- else if (ext4_should_journal_data(inode))
+ if (ext4_should_journal_data(inode))
ret = ext4_force_commit(inode->i_sb);
else
ret = ext4_fsync_journal(inode, datasync, &needs_barrier);

+issue_flush:
if (needs_barrier) {
err = blkdev_issue_flush(inode->i_sb->s_bdev);
if (!ret)
--
2.39.2

2023-04-11 05:32:37

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 7/8] fs.h: Add IOCB_STRINGS for use in trace points

Add IOCB_STRINGS macro which can be used in the trace point patch to
print different flag values with meaningful string output.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
include/linux/fs.h | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 21d2b5670308..496159869af1 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -340,6 +340,20 @@ enum rw_hint {
/* can use bio alloc cache */
#define IOCB_ALLOC_CACHE (1 << 21)

+/* for use in trace events */
+#define IOCB_STRINGS \
+ { IOCB_HIPRI, "HIPRI" }, \
+ { IOCB_DSYNC, "DSYNC" }, \
+ { IOCB_SYNC, "SYNC" }, \
+ { IOCB_NOWAIT, "NOWAIT" }, \
+ { IOCB_APPEND, "APPEND" }, \
+ { IOCB_EVENTFD, "EVENTD"}, \
+ { IOCB_DIRECT, "DIRECT" }, \
+ { IOCB_WRITE, "WRITE" }, \
+ { IOCB_WAITQ, "WAITQ" }, \
+ { IOCB_NOIO, "NOIO" }, \
+ { IOCB_ALLOC_CACHE, "ALLOC_CACHE" }
+
struct kiocb {
struct file *ki_filp;
loff_t ki_pos;
--
2.39.2

2023-04-11 05:34:29

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

Some of the higher layers like iomap takes inode_lock() when calling
generic_write_sync().
Also writeback already happens from other paths without inode lock,
so it's difficult to say that we really need sync_mapping_buffers() to
take any inode locking here. Having said that, let's add a _nolock
variant of this function in libfs for now so that filesystems like
ext2 and ext4's nojournal mode can use it.

Ext4 when got converted to iomap for direct-io already copied it's own
variant of __generic_file_fsync() without lock. Hence let's add a helper
API and use it both in ext2 and ext4.

Later we can review other filesystems as well to see if we can make
_nolock as the default path if inode_lock() is not necessary here.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/libfs.c | 43 +++++++++++++++++++++++++++++++++++++++++++
include/linux/fs.h | 1 +
2 files changed, 44 insertions(+)

diff --git a/fs/libfs.c b/fs/libfs.c
index 4eda519c3002..d2dfb72e3cf8 100644
--- a/fs/libfs.c
+++ b/fs/libfs.c
@@ -1110,6 +1110,49 @@ struct dentry *generic_fh_to_parent(struct super_block *sb, struct fid *fid,
}
EXPORT_SYMBOL_GPL(generic_fh_to_parent);

+/**
+ * __generic_file_fsync_nolock - generic fsync implementation for simple
+ * filesystems with no inode lock
+ *
+ * @file: file to synchronize
+ * @start: start offset in bytes
+ * @end: end offset in bytes (inclusive)
+ * @datasync: only synchronize essential metadata if true
+ *
+ * This is a generic implementation of the fsync method for simple
+ * filesystems which track all non-inode metadata in the buffers list
+ * hanging off the address_space structure.
+ */
+int __generic_file_fsync_nolock(struct file *file, loff_t start, loff_t end,
+ int datasync)
+{
+ struct inode *inode = file->f_mapping->host;
+ int err;
+ int ret;
+
+ err = file_write_and_wait_range(file, start, end);
+ if (err)
+ return err;
+
+ ret = sync_mapping_buffers(inode->i_mapping);
+ if (!(inode->i_state & I_DIRTY_ALL))
+ goto out;
+ if (datasync && !(inode->i_state & I_DIRTY_DATASYNC))
+ goto out;
+
+ err = sync_inode_metadata(inode, 1);
+ if (ret == 0)
+ ret = err;
+
+out:
+ /* check and advance again to catch errors after syncing out buffers */
+ err = file_check_and_advance_wb_err(file);
+ if (ret == 0)
+ ret = err;
+ return ret;
+}
+EXPORT_SYMBOL(__generic_file_fsync_nolock);
+
/**
* __generic_file_fsync - generic fsync implementation for simple filesystems
*
diff --git a/include/linux/fs.h b/include/linux/fs.h
index c85916e9f7db..21d2b5670308 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2935,6 +2935,7 @@ extern ssize_t simple_read_from_buffer(void __user *to, size_t count,
extern ssize_t simple_write_to_buffer(void *to, size_t available, loff_t *ppos,
const void __user *from, size_t count);

+extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
extern int __generic_file_fsync(struct file *, loff_t, loff_t, int);
extern int generic_file_fsync(struct file *, loff_t, loff_t, int);

--
2.39.2

2023-04-11 05:34:29

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 8/8] ext2: Add direct-io trace points

This patch adds the trace point to ext2 direct-io apis
in fs/ext2/file.c

Here is how the output looks like
xfs_io-4099 [004] 81.431800: ext2_dio_write_iter_start: dev 7:7 ino 0xc isize 0x4000 pos 0x4000 count 4096 flags DIRECT aio 0 ret=0
xfs_io-4099 [004] 81.432002: ext2_dio_write_end_io: dev 7:7 ino 0xc isize 0x5000 pos 0x4000 count 4096 flags DIRECT aio 0 ret=0
xfs_io-4099 [004] 81.432014: ext2_dio_write_iter_dio_end: dev 7:7 ino 0xc isize 0x5000 pos 0x5000 count 0 flags DIRECT aio 0 ret=4096
aio-dio-fcntl-r-4103 [005] 81.461123: ext2_dio_write_iter_start: dev 7:7 ino 0xc isize 0x400 pos 0x200 count 512 flags DIRECT|WRITE aio 1 ret=0
aio-dio-fcntl-r-4103 [005] 81.462099: ext2_dio_write_end_io: dev 7:7 ino 0xc isize 0x400 pos 0x200 count 512 flags DIRECT|WRITE aio 1 ret=0
aio-dio-fcntl-r-4103 [005] 81.462133: ext2_dio_write_iter_dio_end: dev 7:7 ino 0xc isize 0x400 pos 0x400 count 0 flags DIRECT|WRITE aio 1 ret=512

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/Makefile | 2 +-
fs/ext2/file.c | 10 +++++++-
fs/ext2/trace.c | 5 ++++
fs/ext2/trace.h | 61 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 76 insertions(+), 2 deletions(-)
create mode 100644 fs/ext2/trace.c
create mode 100644 fs/ext2/trace.h

diff --git a/fs/ext2/Makefile b/fs/ext2/Makefile
index 311479d864a7..ae600f011826 100644
--- a/fs/ext2/Makefile
+++ b/fs/ext2/Makefile
@@ -6,7 +6,7 @@
obj-$(CONFIG_EXT2_FS) += ext2.o

ext2-y := balloc.o dir.o file.o ialloc.o inode.o \
- ioctl.o namei.o super.o symlink.o
+ ioctl.o namei.o super.o symlink.o trace.o

ext2-$(CONFIG_EXT2_FS_XATTR) += xattr.o xattr_user.o xattr_trusted.o
ext2-$(CONFIG_EXT2_FS_POSIX_ACL) += acl.o
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 3511ef85379f..322bfa2d5a28 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -28,6 +28,7 @@
#include "ext2.h"
#include "xattr.h"
#include "acl.h"
+#include "trace.h"

#ifdef CONFIG_FS_DAX
static ssize_t ext2_dax_read_iter(struct kiocb *iocb, struct iov_iter *to)
@@ -169,9 +170,11 @@ static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
struct inode *inode = file->f_mapping->host;
ssize_t ret;

+ trace_ext2_dio_read_iter_start(iocb, to, 0);
inode_lock_shared(inode);
ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
inode_unlock_shared(inode);
+ trace_ext2_dio_read_iter_end(iocb, to, ret);

return ret;
}
@@ -199,6 +202,7 @@ static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
mark_inode_dirty(inode);
}

+ trace_ext2_dio_write_end_io(iocb, NULL, size);
return 0;
}

@@ -215,7 +219,9 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
unsigned long blocksize = inode->i_sb->s_blocksize;
loff_t offset = iocb->ki_pos;
loff_t count = iov_iter_count(from);
+ ssize_t status = 0;

+ trace_ext2_dio_write_iter_start(iocb, from, 0);
inode_lock(inode);
ret = generic_write_checks(iocb, from);
if (ret <= 0)
@@ -243,7 +249,6 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
/* handle case for partial write and for fallback to buffered write */
if (ret >= 0 && iov_iter_count(from)) {
loff_t pos, endbyte;
- ssize_t status;
int ret2;

iocb->ki_flags &= ~IOCB_DIRECT;
@@ -269,6 +274,9 @@ static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)

out_unlock:
inode_unlock(inode);
+ if (status)
+ trace_ext2_dio_write_iter_buff_end(iocb, from, status);
+ trace_ext2_dio_write_iter_dio_end(iocb, from, ret);
return ret;
}

diff --git a/fs/ext2/trace.c b/fs/ext2/trace.c
new file mode 100644
index 000000000000..2d5a3a5109f0
--- /dev/null
+++ b/fs/ext2/trace.c
@@ -0,0 +1,5 @@
+// SPDX-License-Identifier: GPL-2.0
+#include "ext2.h"
+
+#define CREATE_TRACE_POINTS
+#include "trace.h"
diff --git a/fs/ext2/trace.h b/fs/ext2/trace.h
new file mode 100644
index 000000000000..26b53de86f00
--- /dev/null
+++ b/fs/ext2/trace.h
@@ -0,0 +1,61 @@
+// SPDX-License-Identifier: GPL-2.0
+
+#undef TRACE_SYSTEM
+#define TRACE_SYSTEM ext2
+
+#if !defined(_EXT2_TRACE_H) || defined(TRACE_HEADER_MULTI_READ)
+#define _EXT2_TRACE_H
+
+#include <linux/tracepoint.h>
+
+DECLARE_EVENT_CLASS(ext2_dio_class,
+ TP_PROTO(struct kiocb *iocb, struct iov_iter *iter, int ret),
+ TP_ARGS(iocb, iter, ret),
+ TP_STRUCT__entry(
+ __field(dev_t, dev)
+ __field(ino_t, ino)
+ __field(loff_t, isize)
+ __field(loff_t, pos)
+ __field(size_t, count)
+ __field(int, ki_flags)
+ __field(int, aio)
+ __field(int, ret)
+ ),
+ TP_fast_assign(
+ __entry->dev = file_inode(iocb->ki_filp)->i_sb->s_dev;
+ __entry->ino = file_inode(iocb->ki_filp)->i_ino;
+ __entry->isize = file_inode(iocb->ki_filp)->i_size;
+ __entry->pos = iocb->ki_pos;
+ __entry->count = iter ? iov_iter_count(iter) : ret;
+ __entry->ki_flags = iocb->ki_flags;
+ __entry->aio = !is_sync_kiocb(iocb);
+ __entry->ret = iter ? ret : 0;
+ ),
+ TP_printk("dev %d:%d ino 0x%lx isize 0x%llx pos 0x%llx count %ld flags %s aio %d ret=%d",
+ MAJOR(__entry->dev), MINOR(__entry->dev),
+ __entry->ino,
+ __entry->isize,
+ __entry->pos,
+ __entry->count,
+ __print_flags(__entry->ki_flags, "|", IOCB_STRINGS),
+ __entry->aio,
+ __entry->ret)
+)
+
+#define DEFINE_RW_EVENT(name) \
+DEFINE_EVENT(ext2_dio_class, name, \
+ TP_PROTO(struct kiocb *iocb, struct iov_iter *iter, int ret), \
+ TP_ARGS(iocb, iter, ret))
+DEFINE_RW_EVENT(ext2_dio_write_iter_start);
+DEFINE_RW_EVENT(ext2_dio_write_iter_dio_end);
+DEFINE_RW_EVENT(ext2_dio_write_iter_buff_end);
+DEFINE_RW_EVENT(ext2_dio_write_end_io);
+DEFINE_RW_EVENT(ext2_dio_read_iter_start);
+DEFINE_RW_EVENT(ext2_dio_read_iter_end);
+
+#endif /* _EXT2_TRACE_H */
+
+#undef TRACE_INCLUDE_PATH
+#define TRACE_INCLUDE_PATH .
+#define TRACE_INCLUDE_FILE trace
+#include <trace/define_trace.h>
--
2.39.2

2023-04-11 05:34:43

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 1/8] ext2/dax: Fix ext2_setsize when len is page aligned

PAGE_ALIGN(x) macro gives the next highest value which is multiple of
pagesize. But if x is already page aligned then it simply returns x.
So, if x passed is 0 in dax_zero_range() function, that means the
length gets passed as 0 to ->iomap_begin().

In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on
here in ext2_get_blocks().
BUG_ON(maxblocks == 0);

Instead we should be calling dax_truncate_page() here which takes
care of it. i.e. it only calls dax_zero_range if the offset is not
page/block aligned.

This can be easily triggered with following on fsdax mounted pmem
device.

dd if=/dev/zero of=file count=1 bs=512
truncate -s 0 file

[79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
[79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff)
[93.793207] ------------[ cut here ]------------
[93.795102] kernel BUG at fs/ext2/inode.c:637!
[93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139
[93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610
<...>
[93.835298] Call Trace:
[93.836253] <TASK>
[93.837103] ? lock_acquire+0xf8/0x110
[93.838479] ? d_lookup+0x69/0xd0
[93.839779] ext2_iomap_begin+0xa7/0x1c0
[93.841154] iomap_iter+0xc7/0x150
[93.842425] dax_zero_range+0x6e/0xa0
[93.843813] ext2_setsize+0x176/0x1b0
[93.845164] ext2_setattr+0x151/0x200
[93.846467] notify_change+0x341/0x4e0
[93.847805] ? lock_acquire+0xf8/0x110
[93.849143] ? do_truncate+0x74/0xe0
[93.850452] ? do_truncate+0x84/0xe0
[93.851739] do_truncate+0x84/0xe0
[93.852974] do_sys_ftruncate+0x2b4/0x2f0
[93.854404] do_syscall_64+0x3f/0x90
[93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/inode.c | 5 ++---
1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 26f135e7ffce..dc76147e7b07 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1259,9 +1259,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
inode_dio_wait(inode);

if (IS_DAX(inode))
- error = dax_zero_range(inode, newsize,
- PAGE_ALIGN(newsize) - newsize, NULL,
- &ext2_iomap_ops);
+ error = dax_truncate_page(inode, newsize, NULL,
+ &ext2_iomap_ops);
else
error = block_truncate_page(inode->i_mapping,
newsize, ext2_get_block);
--
2.39.2

2023-04-11 05:35:47

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 6/8] iomap: Remove IOMAP_DIO_NOSYNC unused dio flag

IOMAP_DIO_NOSYNC earlier was added for use in btrfs. But it seems for
aio dsync writes this is not useful anyway. For aio dsync case, we
we queue the request and return -EIOCBQUEUED. Now, since IOMAP_DIO_NOSYNC
doesn't let iomap_dio_complete() to call generic_write_sync(),
hence we may lose the sync write.

Hence kill this flag as it is not in use by any FS now.

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/iomap/direct-io.c | 2 +-
include/linux/iomap.h | 6 ------
2 files changed, 1 insertion(+), 7 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index f771001574d0..36ab1152dbea 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -541,7 +541,7 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
}

/* for data sync or sync, we need sync completion processing */
- if (iocb_is_dsync(iocb) && !(dio_flags & IOMAP_DIO_NOSYNC)) {
+ if (iocb_is_dsync(iocb)) {
dio->flags |= IOMAP_DIO_NEED_SYNC;

/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 0f8123504e5e..e2b836c2e119 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -377,12 +377,6 @@ struct iomap_dio_ops {
*/
#define IOMAP_DIO_PARTIAL (1 << 2)

-/*
- * The caller will sync the write if needed; do not sync it within
- * iomap_dio_rw. Overrides IOMAP_DIO_FORCE_WAIT.
- */
-#define IOMAP_DIO_NOSYNC (1 << 3)
-
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before);
--
2.39.2

2023-04-11 05:35:49

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> +/**
> + * __generic_file_fsync_nolock - generic fsync implementation for simple
> + * filesystems with no inode lock

No reallz need for the __ prefix in the name.

> +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);

No need for the extern. And at least I personally prefer to spell out
the parameter names to make the prototype much more readable.

2023-04-11 05:35:58

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 4/8] ext2: Use __generic_file_fsync_nolock implementation

Next patch converts ext2 to use iomap interface for DIO.
iomap layer can call generic_write_sync() -> ext2_fsync() from
iomap_dio_complete while still holding the inode_lock().

Now writeback from other paths doesn't need inode_lock().
It seems there is also no need of an inode_lock for
sync_mapping_buffers(). It uses it's own mapping->private_lock
for it's buffer list handling.
Hence this patch is in preparation to move ext2 to iomap.
This uses __generic_file_fsync_nolock() variant in ext2_fsync().

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/file.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 6b4bebe982ca..1d0bc3fc88bb 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -153,7 +153,9 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
int ret;
struct super_block *sb = file->f_mapping->host->i_sb;

- ret = generic_file_fsync(file, start, end, datasync);
+ ret = __generic_file_fsync_nolock(file, start, end, datasync);
+ if (!ret)
+ ret = blkdev_issue_flush(sb->s_bdev);
if (ret == -EIO)
/* We don't really know where the IO error happened... */
ext2_error(sb, __func__,
--
2.39.2

2023-04-11 05:36:31

by Ritesh Harjani

[permalink] [raw]
Subject: [RFCv2 5/8] ext2: Move direct-io to use iomap

This patch converts ext2 direct-io path to iomap interface.
- This also takes care of DIO_SKIP_HOLES part in which we return -ENOTBLK
from ext2_iomap_begin(), in case if the write is done on a hole.
- This fallbacks to buffered-io in case of DIO_SKIP_HOLES or in case of
a partial write or if any error is detected in ext2_iomap_end().
We try to return -ENOTBLK in such cases.
- For any unaligned or extending DIO writes, we pass
IOMAP_DIO_FORCE_WAIT flag to ensure synchronous writes.
- For extending writes we set IOMAP_F_DIRTY in ext2_iomap_begin because
otherwise with dsync writes on devices that support FUA, generic_write_sync
won't be called and we might miss inode metadata updates.
- Since ext2 already now uses _nolock vartiant of sync write. Hence
there is no inode lock problem with iomap in this patch.
- ext2_iomap_ops are not being shared by DIO, DAX & fiemap path

Signed-off-by: Ritesh Harjani (IBM) <[email protected]>
---
fs/ext2/ext2.h | 1 +
fs/ext2/file.c | 115 ++++++++++++++++++++++++++++++++++++++++++++++++
fs/ext2/inode.c | 52 ++++++++++++++--------
3 files changed, 149 insertions(+), 19 deletions(-)

diff --git a/fs/ext2/ext2.h b/fs/ext2/ext2.h
index cb78d7dcfb95..cb5e309fe040 100644
--- a/fs/ext2/ext2.h
+++ b/fs/ext2/ext2.h
@@ -753,6 +753,7 @@ extern unsigned long ext2_count_free (struct buffer_head *, unsigned);
extern struct inode *ext2_iget (struct super_block *, unsigned long);
extern int ext2_write_inode (struct inode *, struct writeback_control *);
extern void ext2_evict_inode(struct inode *);
+extern void ext2_write_failed(struct address_space *mapping, loff_t to);
extern int ext2_get_block(struct inode *, sector_t, struct buffer_head *, int);
extern int ext2_setattr (struct mnt_idmap *, struct dentry *, struct iattr *);
extern int ext2_getattr (struct mnt_idmap *, const struct path *,
diff --git a/fs/ext2/file.c b/fs/ext2/file.c
index 1d0bc3fc88bb..3511ef85379f 100644
--- a/fs/ext2/file.c
+++ b/fs/ext2/file.c
@@ -163,12 +163,124 @@ int ext2_fsync(struct file *file, loff_t start, loff_t end, int datasync)
return ret;
}

+static ssize_t ext2_dio_read_iter(struct kiocb *iocb, struct iov_iter *to)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+
+ inode_lock_shared(inode);
+ ret = iomap_dio_rw(iocb, to, &ext2_iomap_ops, NULL, 0, NULL, 0);
+ inode_unlock_shared(inode);
+
+ return ret;
+}
+
+static int ext2_dio_write_end_io(struct kiocb *iocb, ssize_t size,
+ int error, unsigned int flags)
+{
+ loff_t pos = iocb->ki_pos;
+ struct inode *inode = file_inode(iocb->ki_filp);
+
+ if (error)
+ return error;
+
+ /*
+ * If we are extending the file, we have to update i_size here before
+ * page cache gets invalidated in iomap_dio_rw(). This prevents racing
+ * buffered reads from zeroing out too much from page cache pages.
+ * Note that all extending writes always happens synchronously with
+ * inode lock held by ext2_dio_write_iter(). So it is safe to update
+ * inode size here for extending file writes.
+ */
+ pos += size;
+ if (pos > i_size_read(inode)) {
+ i_size_write(inode, pos);
+ mark_inode_dirty(inode);
+ }
+
+ return 0;
+}
+
+static const struct iomap_dio_ops ext2_dio_write_ops = {
+ .end_io = ext2_dio_write_end_io,
+};
+
+static ssize_t ext2_dio_write_iter(struct kiocb *iocb, struct iov_iter *from)
+{
+ struct file *file = iocb->ki_filp;
+ struct inode *inode = file->f_mapping->host;
+ ssize_t ret;
+ unsigned int flags = 0;
+ unsigned long blocksize = inode->i_sb->s_blocksize;
+ loff_t offset = iocb->ki_pos;
+ loff_t count = iov_iter_count(from);
+
+ inode_lock(inode);
+ ret = generic_write_checks(iocb, from);
+ if (ret <= 0)
+ goto out_unlock;
+
+ ret = kiocb_modified(iocb);
+ if (ret)
+ goto out_unlock;
+
+ /* use IOMAP_DIO_FORCE_WAIT for unaligned or extending writes */
+ if (iocb->ki_pos + iov_iter_count(from) > i_size_read(inode) ||
+ (!IS_ALIGNED(iocb->ki_pos | iov_iter_alignment(from), blocksize)))
+ flags |= IOMAP_DIO_FORCE_WAIT;
+
+ ret = iomap_dio_rw(iocb, from, &ext2_iomap_ops, &ext2_dio_write_ops,
+ flags, NULL, 0);
+
+ /* ENOTBLK is magic return value for fallback to buffered-io */
+ if (ret == -ENOTBLK)
+ ret = 0;
+
+ if (ret < 0 && ret != -EIOCBQUEUED)
+ ext2_write_failed(inode->i_mapping, offset + count);
+
+ /* handle case for partial write and for fallback to buffered write */
+ if (ret >= 0 && iov_iter_count(from)) {
+ loff_t pos, endbyte;
+ ssize_t status;
+ int ret2;
+
+ iocb->ki_flags &= ~IOCB_DIRECT;
+ pos = iocb->ki_pos;
+ status = generic_perform_write(iocb, from);
+ if (unlikely(status < 0)) {
+ ret = status;
+ goto out_unlock;
+ }
+
+ iocb->ki_pos += status;
+ ret += status;
+ endbyte = pos + status - 1;
+ ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
+ endbyte);
+ if (!ret2)
+ invalidate_mapping_pages(inode->i_mapping,
+ pos >> PAGE_SHIFT,
+ endbyte >> PAGE_SHIFT);
+ if (ret > 0)
+ generic_write_sync(iocb, ret);
+ }
+
+out_unlock:
+ inode_unlock(inode);
+ return ret;
+}
+
static ssize_t ext2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
{
#ifdef CONFIG_FS_DAX
if (IS_DAX(iocb->ki_filp->f_mapping->host))
return ext2_dax_read_iter(iocb, to);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext2_dio_read_iter(iocb, to);
+
return generic_file_read_iter(iocb, to);
}

@@ -178,6 +290,9 @@ static ssize_t ext2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
if (IS_DAX(iocb->ki_filp->f_mapping->host))
return ext2_dax_write_iter(iocb, from);
#endif
+ if (iocb->ki_flags & IOCB_DIRECT)
+ return ext2_dio_write_iter(iocb, from);
+
return generic_file_write_iter(iocb, from);
}

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index dc76147e7b07..e8e66156fc5e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -56,7 +56,7 @@ static inline int ext2_inode_is_fast_symlink(struct inode *inode)

static void ext2_truncate_blocks(struct inode *inode, loff_t offset);

-static void ext2_write_failed(struct address_space *mapping, loff_t to)
+void ext2_write_failed(struct address_space *mapping, loff_t to)
{
struct inode *inode = mapping->host;

@@ -809,9 +809,26 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
bool new = false, boundary = false;
u32 bno;
int ret;
+ bool create = flags & IOMAP_WRITE;
+
+ /*
+ * For writes that could fill holes inside i_size on a
+ * DIO_SKIP_HOLES filesystem we forbid block creations: only
+ * overwrites are permitted.
+ */
+ if ((flags & IOMAP_DIRECT) && (first_block << blkbits < i_size_read(inode)))
+ create = 0;
+
+ /*
+ * Writes that span EOF might trigger an IO size update on completion,
+ * so consider them to be dirty for the purposes of O_DSYNC even if
+ * there is no other metadata changes pending or have been made here.
+ */
+ if ((flags & IOMAP_WRITE) && (offset + length > i_size_read(inode)))
+ iomap->flags |= IOMAP_F_DIRTY;

ret = ext2_get_blocks(inode, first_block, max_blocks,
- &bno, &new, &boundary, flags & IOMAP_WRITE);
+ &bno, &new, &boundary, create);
if (ret < 0)
return ret;

@@ -823,6 +840,12 @@ static int ext2_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
iomap->bdev = inode->i_sb->s_bdev;

if (ret == 0) {
+ /*
+ * Switch to buffered-io for writing to holes in a non-extent
+ * based filesystem to avoid stale data exposure problem.
+ */
+ if (!create && (flags & IOMAP_WRITE) && (flags & IOMAP_DIRECT))
+ return -ENOTBLK;
iomap->type = IOMAP_HOLE;
iomap->addr = IOMAP_NULL_ADDR;
iomap->length = 1 << blkbits;
@@ -844,6 +867,13 @@ static int
ext2_iomap_end(struct inode *inode, loff_t offset, loff_t length,
ssize_t written, unsigned flags, struct iomap *iomap)
{
+ /*
+ * Switch to buffered-io in case of any error.
+ * Blocks allocated can be used by the buffered-io path.
+ */
+ if ((flags & IOMAP_DIRECT) && (flags & IOMAP_WRITE) && written == 0)
+ return -ENOTBLK;
+
if (iomap->type == IOMAP_MAPPED &&
written < length &&
(flags & IOMAP_WRITE))
@@ -908,22 +938,6 @@ static sector_t ext2_bmap(struct address_space *mapping, sector_t block)
return generic_block_bmap(mapping,block,ext2_get_block);
}

-static ssize_t
-ext2_direct_IO(struct kiocb *iocb, struct iov_iter *iter)
-{
- struct file *file = iocb->ki_filp;
- struct address_space *mapping = file->f_mapping;
- struct inode *inode = mapping->host;
- size_t count = iov_iter_count(iter);
- loff_t offset = iocb->ki_pos;
- ssize_t ret;
-
- ret = blockdev_direct_IO(iocb, inode, iter, ext2_get_block);
- if (ret < 0 && iov_iter_rw(iter) == WRITE)
- ext2_write_failed(mapping, offset + count);
- return ret;
-}
-
static int
ext2_writepages(struct address_space *mapping, struct writeback_control *wbc)
{
@@ -946,7 +960,7 @@ const struct address_space_operations ext2_aops = {
.write_begin = ext2_write_begin,
.write_end = ext2_write_end,
.bmap = ext2_bmap,
- .direct_IO = ext2_direct_IO,
+ .direct_IO = noop_direct_IO,
.writepages = ext2_writepages,
.migrate_folio = buffer_migrate_folio,
.is_partially_uptodate = block_is_partially_uptodate,
--
2.39.2

2023-04-11 05:39:26

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 7/8] fs.h: Add IOCB_STRINGS for use in trace points

Looks good:

Reviewed-by: Christoph Hellwig <[email protected]>

2023-04-11 05:39:41

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 8/8] ext2: Add direct-io trace points

On Tue, Apr 11, 2023 at 10:51:56AM +0530, Ritesh Harjani (IBM) wrote:
> This patch adds the trace point to ext2 direct-io apis
> in fs/ext2/file.c

Wouldn't it make more sense to add this to iomap instead?

2023-04-11 05:48:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 5/8] ext2: Move direct-io to use iomap

On Tue, Apr 11, 2023 at 10:51:53AM +0530, Ritesh Harjani (IBM) wrote:
> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);

No need for the extern.

> + /* handle case for partial write and for fallback to buffered write */
> + if (ret >= 0 && iov_iter_count(from)) {
> + loff_t pos, endbyte;
> + ssize_t status;
> + int ret2;
> +
> + iocb->ki_flags &= ~IOCB_DIRECT;
> + pos = iocb->ki_pos;
> + status = generic_perform_write(iocb, from);
> + if (unlikely(status < 0)) {
> + ret = status;
> + goto out_unlock;
> + }
> +
> + iocb->ki_pos += status;
> + ret += status;
> + endbyte = pos + status - 1;
> + ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
> + endbyte);
> + if (!ret2)
> + invalidate_mapping_pages(inode->i_mapping,
> + pos >> PAGE_SHIFT,
> + endbyte >> PAGE_SHIFT);
> + if (ret > 0)
> + generic_write_sync(iocb, ret);
> + }

Nit, but to me it would seem cleaner if all the fallback handling
was moved into a separate helper function. Or in fact by not
using generic_file_write_iter even for buffered I/O and at doing
the pre-I/O checks and the final generic_write_sync in common code in
ext2 for direct and buffered I/O.

> + /*
> + * For writes that could fill holes inside i_size on a
> + * DIO_SKIP_HOLES filesystem we forbid block creations: only
> + * overwrites are permitted.
> + */
> + if ((flags & IOMAP_DIRECT) && (first_block << blkbits < i_size_read(inode)))
> + create = 0;

No need for braes around the < operation, but I think you might need
them around the shift.

Also an overly long line here.

> + if ((flags & IOMAP_WRITE) && (offset + length > i_size_read(inode)))

No need for the second set of inner braces here either.

2023-04-11 12:37:25

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

On Mon, Apr 10, 2023 at 10:27:10PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > +/**
> > + * __generic_file_fsync_nolock - generic fsync implementation for simple
> > + * filesystems with no inode lock
>
> No reallz need for the __ prefix in the name.

It kind of makes sense though.

generic_file_fsync does the flush
__generic_file_fsync doesn't do the flush
__generic_file_fsync_nolock doesn't do the flush and doesn't lock/unlock

> > +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
>
> No need for the extern. And at least I personally prefer to spell out
> the parameter names to make the prototype much more readable.

Agreed, although I make an exception for the 'struct file *'. Naming that
parameter adds no value, but a plain int is just obscene.

int __generic_file_fsync_nolock(struct file *, loff_t start, loff_t end,
bool datasync);

(yes, the other variants don't use a bool for datasync, but they should)

2023-04-11 14:37:08

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv2 1/8] ext2/dax: Fix ext2_setsize when len is page aligned

On Tue, Apr 11, 2023 at 10:51:49AM +0530, Ritesh Harjani (IBM) wrote:
> PAGE_ALIGN(x) macro gives the next highest value which is multiple of
> pagesize. But if x is already page aligned then it simply returns x.
> So, if x passed is 0 in dax_zero_range() function, that means the
> length gets passed as 0 to ->iomap_begin().
>
> In ext2 it then calls ext2_get_blocks -> max_blocks as 0 and hits bug_on
> here in ext2_get_blocks().
> BUG_ON(maxblocks == 0);
>
> Instead we should be calling dax_truncate_page() here which takes
> care of it. i.e. it only calls dax_zero_range if the offset is not
> page/block aligned.
>
> This can be easily triggered with following on fsdax mounted pmem
> device.
>
> dd if=/dev/zero of=file count=1 bs=512
> truncate -s 0 file
>
> [79.525838] EXT2-fs (pmem0): DAX enabled. Warning: EXPERIMENTAL, use at your own risk
> [79.529376] ext2 filesystem being mounted at /mnt1/test supports timestamps until 2038 (0x7fffffff)
> [93.793207] ------------[ cut here ]------------
> [93.795102] kernel BUG at fs/ext2/inode.c:637!
> [93.796904] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [93.798659] CPU: 0 PID: 1192 Comm: truncate Not tainted 6.3.0-rc2-xfstests-00056-g131086faa369 #139
> [93.806459] RIP: 0010:ext2_get_blocks.constprop.0+0x524/0x610
> <...>
> [93.835298] Call Trace:
> [93.836253] <TASK>
> [93.837103] ? lock_acquire+0xf8/0x110
> [93.838479] ? d_lookup+0x69/0xd0
> [93.839779] ext2_iomap_begin+0xa7/0x1c0
> [93.841154] iomap_iter+0xc7/0x150
> [93.842425] dax_zero_range+0x6e/0xa0
> [93.843813] ext2_setsize+0x176/0x1b0
> [93.845164] ext2_setattr+0x151/0x200
> [93.846467] notify_change+0x341/0x4e0
> [93.847805] ? lock_acquire+0xf8/0x110
> [93.849143] ? do_truncate+0x74/0xe0
> [93.850452] ? do_truncate+0x84/0xe0
> [93.851739] do_truncate+0x84/0xe0
> [93.852974] do_sys_ftruncate+0x2b4/0x2f0
> [93.854404] do_syscall_64+0x3f/0x90
> [93.855789] entry_SYSCALL_64_after_hwframe+0x72/0xdc
>
> Signed-off-by: Ritesh Harjani (IBM) <[email protected]>

Would seem to make sense...
Reviewed-by: Darrick J. Wong <[email protected]>

--D

> ---
> fs/ext2/inode.c | 5 ++---
> 1 file changed, 2 insertions(+), 3 deletions(-)
>
> diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
> index 26f135e7ffce..dc76147e7b07 100644
> --- a/fs/ext2/inode.c
> +++ b/fs/ext2/inode.c
> @@ -1259,9 +1259,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
> inode_dio_wait(inode);
>
> if (IS_DAX(inode))
> - error = dax_zero_range(inode, newsize,
> - PAGE_ALIGN(newsize) - newsize, NULL,
> - &ext2_iomap_ops);
> + error = dax_truncate_page(inode, newsize, NULL,
> + &ext2_iomap_ops);
> else
> error = block_truncate_page(inode->i_mapping,
> newsize, ext2_get_block);
> --
> 2.39.2
>

2023-04-11 14:37:51

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [RFCv2 8/8] ext2: Add direct-io trace points

On Mon, Apr 10, 2023 at 10:38:38PM -0700, Christoph Hellwig wrote:
> On Tue, Apr 11, 2023 at 10:51:56AM +0530, Ritesh Harjani (IBM) wrote:
> > This patch adds the trace point to ext2 direct-io apis
> > in fs/ext2/file.c
>
> Wouldn't it make more sense to add this to iomap instead?

Yes please. :)

--D

2023-04-11 15:13:05

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 8/8] ext2: Add direct-io trace points

"Darrick J. Wong" <[email protected]> writes:

> On Mon, Apr 10, 2023 at 10:38:38PM -0700, Christoph Hellwig wrote:
>> On Tue, Apr 11, 2023 at 10:51:56AM +0530, Ritesh Harjani (IBM) wrote:
>> > This patch adds the trace point to ext2 direct-io apis
>> > in fs/ext2/file.c
>>
>> Wouldn't it make more sense to add this to iomap instead?
>
> Yes please. :)

Sure. Let me also add a trace event for iomap DIO as well in the next revision.
However I would like to keep ext2 dio trace point as is :)

-ritesh

2023-04-11 15:16:06

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

Matthew Wilcox <[email protected]> writes:

> On Mon, Apr 10, 2023 at 10:27:10PM -0700, Christoph Hellwig wrote:
>> On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
>> > +/**
>> > + * __generic_file_fsync_nolock - generic fsync implementation for simple
>> > + * filesystems with no inode lock
>>
>> No reallz need for the __ prefix in the name.
>
> It kind of makes sense though.
>
> generic_file_fsync does the flush
> __generic_file_fsync doesn't do the flush
> __generic_file_fsync_nolock doesn't do the flush and doesn't lock/unlock

Yes.

>
>> > +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
>>
>> No need for the extern. And at least I personally prefer to spell out
>> the parameter names to make the prototype much more readable.
>
> Agreed, although I make an exception for the 'struct file *'. Naming that
> parameter adds no value, but a plain int is just obscene.
>
> int __generic_file_fsync_nolock(struct file *, loff_t start, loff_t end,
> bool datasync);
>
> (yes, the other variants don't use a bool for datasync, but they should)

Sure. Will make the change.

-ritesh

2023-04-11 15:30:49

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 5/8] ext2: Move direct-io to use iomap

Christoph Hellwig <[email protected]> writes:

> On Tue, Apr 11, 2023 at 10:51:53AM +0530, Ritesh Harjani (IBM) wrote:
>> +extern void ext2_write_failed(struct address_space *mapping, loff_t to);
>
> No need for the extern.
>

Sure will drop it.

>> + /* handle case for partial write and for fallback to buffered write */
>> + if (ret >= 0 && iov_iter_count(from)) {
>> + loff_t pos, endbyte;
>> + ssize_t status;
>> + int ret2;
>> +
>> + iocb->ki_flags &= ~IOCB_DIRECT;
>> + pos = iocb->ki_pos;
>> + status = generic_perform_write(iocb, from);
>> + if (unlikely(status < 0)) {
>> + ret = status;
>> + goto out_unlock;
>> + }
>> +
>> + iocb->ki_pos += status;
>> + ret += status;
>> + endbyte = pos + status - 1;
>> + ret2 = filemap_write_and_wait_range(inode->i_mapping, pos,
>> + endbyte);
>> + if (!ret2)
>> + invalidate_mapping_pages(inode->i_mapping,
>> + pos >> PAGE_SHIFT,
>> + endbyte >> PAGE_SHIFT);
>> + if (ret > 0)
>> + generic_write_sync(iocb, ret);
>> + }
>
> Nit, but to me it would seem cleaner if all the fallback handling
> was moved into a separate helper function. Or in fact by not
> using generic_file_write_iter even for buffered I/O and at doing
> the pre-I/O checks and the final generic_write_sync in common code in
> ext2 for direct and buffered I/O.
>

Make sense. However, since we are on the path to modify ext2 buffered-io
code as well to move to iomap interface, I wouldn't bother too much as
of now for this code as, all of this is going to go away anyways.


>> + /*
>> + * For writes that could fill holes inside i_size on a
>> + * DIO_SKIP_HOLES filesystem we forbid block creations: only
>> + * overwrites are permitted.
>> + */
>> + if ((flags & IOMAP_DIRECT) && (first_block << blkbits < i_size_read(inode)))
>> + create = 0;
>
> No need for braes around the < operation, but I think you might need
> them around the shift.

left-shift has a higher precedence. But let me make it more clear in
next rev.

>
> Also an overly long line here.
>

Sure, will see to it.

>> + if ((flags & IOMAP_WRITE) && (offset + length > i_size_read(inode)))
>
> No need for the second set of inner braces here either.

It's just avoids any confusion this way.

-ritesh

2023-04-12 11:51:12

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

On Tue, Apr 11, 2023 at 01:33:17PM +0100, Matthew Wilcox wrote:
> On Mon, Apr 10, 2023 at 10:27:10PM -0700, Christoph Hellwig wrote:
> > On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > +/**
> > > + * __generic_file_fsync_nolock - generic fsync implementation for simple
> > > + * filesystems with no inode lock
> >
> > No reallz need for the __ prefix in the name.
>
> It kind of makes sense though.
>
> generic_file_fsync does the flush
> __generic_file_fsync doesn't do the flush
> __generic_file_fsync_nolock doesn't do the flush and doesn't lock/unlock

Indeed. Part of it is that the naming is a bit horrible.
Maybe it should move to buffer.c and be called generic_buffer_fsync,
or generic_block_fsync which still wouldn't be perfect but match the
buffer.c naming scheme.

>
> > > +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
> >
> > No need for the extern. And at least I personally prefer to spell out
> > the parameter names to make the prototype much more readable.
>
> Agreed, although I make an exception for the 'struct file *'. Naming that
> parameter adds no value, but a plain int is just obscene.
>
> int __generic_file_fsync_nolock(struct file *, loff_t start, loff_t end,
> bool datasync);

While I agree that it's not needed for the file, leaving it out is a bit
silly.

> (yes, the other variants don't use a bool for datasync, but they should)

.. including the ->fsync prototype to make it work ..

2023-04-12 11:51:55

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 5/8] ext2: Move direct-io to use iomap

On Tue, Apr 11, 2023 at 08:51:24PM +0530, Ritesh Harjani wrote:
> >> + if ((flags & IOMAP_WRITE) && (offset + length > i_size_read(inode)))
> >
> > No need for the second set of inner braces here either.
>
> It's just avoids any confusion this way.

Does it? Except for some really weird places it is very unusual in
the kernel. To me it really does add confusion.

2023-04-12 11:51:55

by Jan Kara

[permalink] [raw]
Subject: Re: [RFCv2 0/8] ext2: DIO to use iomap

Hi Ritesh!

On Tue 11-04-23 10:51:48, Ritesh Harjani (IBM) wrote:
> Please find the series which moves ext2 direct-io to use modern iomap interface.
>
> Here are some more details -
> 1. Patch-1: Fixes a kernel bug_on problem with ext2 dax code (found during code
> review and testing).
> 2. Patch-2: Adds a __generic_file_fsync_nolock implementation as we had
> discussed.
> 3. Patch-3 & Patch-4: Moves ext4 nojournal and ext2 to use _nolock method.
> 4. Patch-5: This is the main patch which moves ext2 direct-io to use iomap.
> (more details can be found in the patch)
> 5. Patch-6: Kills IOMAP_DIO_NOSYNC flag as it is not in use by any filesystem.
> 6. Patch-7: adds IOCB_STRINGS macro for use in trace events for better trace
> output of iocb flags.
> 7. Patch-8: Add ext2 trace point for DIO.
>
> Testing:
> =========
> This passes ext2 "auto" group testing for fstests. There were no new failures
> with this patches.

I went through the patches and I have no further comments besides what has
been already said. So feel free to update patches based on other feedback,
I'll do the last round of review and can queue the patches to my tree.

Honza

>
>
> Ritesh Harjani (IBM) (8):
> ext2/dax: Fix ext2_setsize when len is page aligned
> libfs: Add __generic_file_fsync_nolock implementation
> ext4: Use __generic_file_fsync_nolock implementation
> ext2: Use __generic_file_fsync_nolock implementation
> ext2: Move direct-io to use iomap
> iomap: Remove IOMAP_DIO_NOSYNC unused dio flag
> fs.h: Add IOCB_STRINGS for use in trace points
> ext2: Add direct-io trace points
>
> fs/ext2/Makefile | 2 +-
> fs/ext2/ext2.h | 1 +
> fs/ext2/file.c | 127 +++++++++++++++++++++++++++++++++++++++++-
> fs/ext2/inode.c | 57 +++++++++++--------
> fs/ext2/trace.c | 5 ++
> fs/ext2/trace.h | 61 ++++++++++++++++++++
> fs/ext4/fsync.c | 31 +++++------
> fs/iomap/direct-io.c | 2 +-
> fs/libfs.c | 43 ++++++++++++++
> include/linux/fs.h | 15 +++++
> include/linux/iomap.h | 6 --
> 11 files changed, 303 insertions(+), 47 deletions(-)
> create mode 100644 fs/ext2/trace.c
> create mode 100644 fs/ext2/trace.h
>
> --
> 2.39.2
>
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-04-12 11:52:31

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [RFCv2 8/8] ext2: Add direct-io trace points

On Tue, Apr 11, 2023 at 08:41:05PM +0530, Ritesh Harjani wrote:
> Sure. Let me also add a trace event for iomap DIO as well in the next revision.
> However I would like to keep ext2 dio trace point as is :)

Am I confused, or does this patch only add new trace points anyway?

>
> -ritesh
---end quoted text---

2023-04-12 14:11:49

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

Christoph Hellwig <[email protected]> writes:

> On Tue, Apr 11, 2023 at 01:33:17PM +0100, Matthew Wilcox wrote:
>> On Mon, Apr 10, 2023 at 10:27:10PM -0700, Christoph Hellwig wrote:
>> > On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
>> > > +/**
>> > > + * __generic_file_fsync_nolock - generic fsync implementation for simple
>> > > + * filesystems with no inode lock
>> >
>> > No reallz need for the __ prefix in the name.
>>
>> It kind of makes sense though.
>>
>> generic_file_fsync does the flush
>> __generic_file_fsync doesn't do the flush
>> __generic_file_fsync_nolock doesn't do the flush and doesn't lock/unlock
>
> Indeed. Part of it is that the naming is a bit horrible.
> Maybe it should move to buffer.c and be called generic_buffer_fsync,
> or generic_block_fsync which still wouldn't be perfect but match the
> buffer.c naming scheme.
>

Eventually it anyways needs some work to see if we can kill the lock
variant all together. I didn't do that in this series which is
focused on ext2 conversion of iomap.
So, if it's not that bad, I would like to keep both function
definitions at one place so that it can be worked out later.

>>
>> > > +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
>> >
>> > No need for the extern. And at least I personally prefer to spell out
>> > the parameter names to make the prototype much more readable.
>>
>> Agreed, although I make an exception for the 'struct file *'. Naming that
>> parameter adds no value, but a plain int is just obscene.
>>
>> int __generic_file_fsync_nolock(struct file *, loff_t start, loff_t end,
>> bool datasync);
>
> While I agree that it's not needed for the file, leaving it out is a bit
> silly.
>

Sure. Will fix it.

>> (yes, the other variants don't use a bool for datasync, but they should)
>
> .. including the ->fsync prototype to make it work ..

Sure, this work should go as a seperate series.

-ritesh

2023-04-12 14:12:33

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 5/8] ext2: Move direct-io to use iomap

Christoph Hellwig <[email protected]> writes:

> On Tue, Apr 11, 2023 at 08:51:24PM +0530, Ritesh Harjani wrote:
>> >> + if ((flags & IOMAP_WRITE) && (offset + length > i_size_read(inode)))
>> >
>> > No need for the second set of inner braces here either.
>>
>> It's just avoids any confusion this way.
>
> Does it? Except for some really weird places it is very unusual in
> the kernel. To me it really does add confusion.

Ok, In that case, will change it.

Thanks for the review!
-ritesh

2023-04-12 14:14:25

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 8/8] ext2: Add direct-io trace points

Christoph Hellwig <[email protected]> writes:

> On Tue, Apr 11, 2023 at 08:41:05PM +0530, Ritesh Harjani wrote:
>> Sure. Let me also add a trace event for iomap DIO as well in the next revision.
>> However I would like to keep ext2 dio trace point as is :)
>
> Am I confused, or does this patch only add new trace points anyway?

I meant to keep ext2 trace point as well. One of the reason for
that would be if someone wants to just enable all ext2 related tracing
logs, it would be easier.

$ trace-cmd -e ext2
$ trace-cmd report

But since iomap dio also doesn't have a tracepoint, I am looking to
add that in the next revision.

-ritesh

2023-04-12 14:15:24

by Ritesh Harjani

[permalink] [raw]
Subject: Re: [RFCv2 0/8] ext2: DIO to use iomap

Jan Kara <[email protected]> writes:

> Hi Ritesh!
>
> On Tue 11-04-23 10:51:48, Ritesh Harjani (IBM) wrote:
>> Please find the series which moves ext2 direct-io to use modern iomap interface.
>>
>> Here are some more details -
>> 1. Patch-1: Fixes a kernel bug_on problem with ext2 dax code (found during code
>> review and testing).
>> 2. Patch-2: Adds a __generic_file_fsync_nolock implementation as we had
>> discussed.
>> 3. Patch-3 & Patch-4: Moves ext4 nojournal and ext2 to use _nolock method.
>> 4. Patch-5: This is the main patch which moves ext2 direct-io to use iomap.
>> (more details can be found in the patch)
>> 5. Patch-6: Kills IOMAP_DIO_NOSYNC flag as it is not in use by any filesystem.
>> 6. Patch-7: adds IOCB_STRINGS macro for use in trace events for better trace
>> output of iocb flags.
>> 7. Patch-8: Add ext2 trace point for DIO.
>>
>> Testing:
>> =========
>> This passes ext2 "auto" group testing for fstests. There were no new failures
>> with this patches.
>
> I went through the patches and I have no further comments besides what has
> been already said. So feel free to update patches based on other feedback,
> I'll do the last round of review and can queue the patches to my tree.
>

Sure, thanks Jan for going over the patches.
I will send the next rev soon addressing review comments along with some
other minor changes.

-ritesh

2023-04-13 09:55:11

by Christian Brauner

[permalink] [raw]
Subject: Re: [RFCv2 2/8] libfs: Add __generic_file_fsync_nolock implementation

On Wed, Apr 12, 2023 at 04:43:03AM -0700, Christoph Hellwig wrote:
> On Tue, Apr 11, 2023 at 01:33:17PM +0100, Matthew Wilcox wrote:
> > On Mon, Apr 10, 2023 at 10:27:10PM -0700, Christoph Hellwig wrote:
> > > On Tue, Apr 11, 2023 at 10:51:50AM +0530, Ritesh Harjani (IBM) wrote:
> > > > +/**
> > > > + * __generic_file_fsync_nolock - generic fsync implementation for simple
> > > > + * filesystems with no inode lock
> > >
> > > No reallz need for the __ prefix in the name.
> >
> > It kind of makes sense though.
> >
> > generic_file_fsync does the flush
> > __generic_file_fsync doesn't do the flush
> > __generic_file_fsync_nolock doesn't do the flush and doesn't lock/unlock
>
> Indeed. Part of it is that the naming is a bit horrible.
> Maybe it should move to buffer.c and be called generic_buffer_fsync,
> or generic_block_fsync which still wouldn't be perfect but match the
> buffer.c naming scheme.
>
> >
> > > > +extern int __generic_file_fsync_nolock(struct file *, loff_t, loff_t, int);
> > >
> > > No need for the extern. And at least I personally prefer to spell out
> > > the parameter names to make the prototype much more readable.
> >
> > Agreed, although I make an exception for the 'struct file *'. Naming that
> > parameter adds no value, but a plain int is just obscene.
> >
> > int __generic_file_fsync_nolock(struct file *, loff_t start, loff_t end,
> > bool datasync);
>
> While I agree that it's not needed for the file, leaving it out is a bit
> silly.

I think we should just be consistent and try to enforce that the
parameter name is added in new patches. It's often easier for grepping
and there's really not a lot of value in leaving it out in general.