2023-12-07 06:51:35

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> From: Dave Chinner <[email protected]>
>
> If we are doing direct IO writes with datasync semantics, we often
> have to flush metadata changes along with the data write. However,
> if we are overwriting existing data, there are no metadata changes
> that we need to flush. In this case, optimising the IO by using
> FUA write makes sense.
>
> We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> requires a metadata flush - this is currently used by DAX to ensure
> extent modi$fication as stable in page fault operations. For direct
> IO writes, we can use it to determine if we need to flush metadata
> or not once the data is on disk.

Hi,

I've gotten an inquiry from some engineers at Microsoft who would
really like it if ext4 could use FUA writes when doing O_DSYNC writes,
since this is soemthing that SQL Server uses. In the discussion for
this patch series back in 2018[1], ext4 hadn't yet converted over to
iomap for Direct I/O, and so adding this feature for ext4 wasn't
really practical.

[1] https://lore.kernel.org/all/[email protected]/

Today, ext4 does use iomap for DIO, but an experiment seems to
indicate that something hasn't been wired up to enable FUA for O_DSYNC
writes. I've looked at fs/iomap/direct-io.c and it wasn't immediately
obvious what I need to add to enable this feature.

I was wondering if you could me some quick hints about what and where
I should be looking?

Many thanks!

- Ted


2023-12-07 07:32:47

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote:
> Today, ext4 does use iomap for DIO, but an experiment seems to
> indicate that something hasn't been wired up to enable FUA for O_DSYNC
> writes. I've looked at fs/iomap/direct-io.c and it wasn't immediately
> obvious what I need to add to enable this feature.
>
> I was wondering if you could me some quick hints about what and where
> I should be looking?

There isn't really anything strange going on. First your device needs to
support it. For NVMe the feature is mandatory, but we still disable it
globally for SATA, and second you need to remember FUA only makes sense
for O_DSYNC-style writes, not O_SYNC that also writes out the inode.

Then you need to make sure neither IOMAP_F_SHARED nor IOMAP_F_DIRTY
is set on the iomap. IOMAP_F_SHARED is never set by ext4, so if none
of the above is the culprit I'd loke into if IOMAP_F_DIRTY and up
beeing set for your test I/O.


2023-12-07 23:03:21

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH] [RFC] iomap: Use FUA for pure data O_DSYNC DIO writes

On Thu, Dec 07, 2023 at 01:50:46AM -0500, Theodore Ts'o wrote:
> On Thu, Mar 01, 2018 at 12:41:44PM +1100, Dave Chinner wrote:
> > From: Dave Chinner <[email protected]>
> >
> > If we are doing direct IO writes with datasync semantics, we often
> > have to flush metadata changes along with the data write. However,
> > if we are overwriting existing data, there are no metadata changes
> > that we need to flush. In this case, optimising the IO by using
> > FUA write makes sense.
> >
> > We know from teh IOMAP_F_DIRTY flag as to whether a specific inode
> > requires a metadata flush - this is currently used by DAX to ensure
> > extent modi$fication as stable in page fault operations. For direct
> > IO writes, we can use it to determine if we need to flush metadata
> > or not once the data is on disk.
>
> Hi,
>
> I've gotten an inquiry from some engineers at Microsoft who would
> really like it if ext4 could use FUA writes when doing O_DSYNC writes,
> since this is soemthing that SQL Server uses. In the discussion for
> this patch series back in 2018[1], ext4 hadn't yet converted over to
> iomap for Direct I/O, and so adding this feature for ext4 wasn't
> really practical.
>
> [1] https://lore.kernel.org/all/[email protected]/
>
> Today, ext4 does use iomap for DIO, but an experiment seems to
> indicate that something hasn't been wired up to enable FUA for O_DSYNC
> writes. I've looked at fs/iomap/direct-io.c and it wasn't immediately
> obvious what I need to add to enable this feature.

I've actually looked briefly into this in the past for ext4 - IIRC
the problem is that mtime updates during the write are causing the
inode to get marked dirty.

Yeah, typical trace from a RWF_DSYNC direct IO through ext4 is:

xfs_io-6002 [007] 2986.814841: ext4_es_lookup_extent_enter: dev 7,0 ino 12 lblk 0
xfs_io-6002 [007] 2986.814848: ext4_es_lookup_extent_exit: dev 7,0 ino 12 found 1 [0/30720) 34304 W
xfs_io-6002 [007] 2986.814856: ext4_journal_start_inode: dev 7,0 blocks 2, rsv_blocks 0, revoke_creds 8, type 1, ino 12, caller ext4_dirty_inode+0x39
xfs_io-6002 [007] 2986.814875: ext4_mark_inode_dirty: dev 7,0 ino 12 caller ext4_dirty_inode+0x5c
xfs_io-6002 [007] 2986.814917: iomap_dio_rw_begin: dev 7:0 ino 0xc size 0x40000000 offset 0x0 length 0x20000 done_before 0x0 flags DIRECT dio_flags aio 0
xfs_io-6002 [007] 2986.814922: iomap_iter: dev 7:0 ino 0xc pos 0x0 length 0x0 flags WRITE|DIRECT (0x11) ops ext4_iomap_overwrite_ops caller __iomap_dio_rw+0x1c2

So I think the problem may be through file_modified() from
ext4_dio_write_checks() triggering mtime updates which then comes
back through mark_inode_dirty and that ends up causing
the inode to be marked dirty and tracked by the journal.

Yup, there it is in generic_update_time():

if (updated & (S_ATIME|S_MTIME|S_CTIME))
dirty_flags = inode->i_sb->s_flags & SB_LAZYTIME ? I_DIRTY_TIME : I_DIRTY_SYNC;

it's setting the inode I_DIRTY_SYNC for mtime updates.

XFS does not do this. It implements the ->update_time() method so
doesn't use generic_update_time(). XFS tracks pure timestamp updates
separately to other dirty inode metadata. We then elide "timestamp
dirty" state from the IOMAP_F_DIRTY checks because O_DSYNC doesn't
require timestamps to be persisted.

FWIW, the patch below allows testing FUA optimisations on loop
devices....

-Dave.
--
Dave Chinner
[email protected]

loop: add support for FUA writes

From: Dave Chinner <[email protected]>

Any file-backed loop device on a filesysetm that support the ->fsync
operation can do FUA writes. All a FUA write requires from the loop
device is RWF_DSYNC semantics for the specific REQ_FUA request that
is being processed. i.e. that it is on stable storage before the
request completes.

This allows simple testing of things like iomap FUA optimisations,
and should also provide better performance for loop device in
situations where the upper filesystem is issuing FUA writes to try
to avoid needing journal flushes. Using RWF_DSYNC will also allow
the lower filesystem to attempt to use REQ_FUA on loop devices using
direct IO and potentially improve the IO patterns all the way down
the stack as a result.

Prep:

# losetup --direct-io=on /dev/loop0 /mnt/scratch/foo.img
# mkfs.xfs /dev/loop0
# mount /dev/loop0 /mnt/test
# xfs_io -fdc "pwrite -b 2m 0 1g" -c fsync /mnt/test/foo
# sync

Measure:

# xfs_io -fdc "pwrite -V 1 -D -b 128k 0 1m" -c fsync /mnt/test/foo

Before:

# grep . /sys/block/loop0/queue/*
....
/sys/block/loop0/queue/fua:0
....

# trace-cmd stream -e block* |grep xfs_io
.....
<...>-4654 [007] 809.018173: block_rq_issue: 7,0 WS 131072 () 192 + 256 [xfs_io]
<...>-4654 [007] 809.025910: block_bio_queue: 7,0 WS 448 + 256 [xfs_io]
<...>-4654 [007] 809.025917: block_getrq: 7,0 WS 448 + 256 [xfs_io]
....

After:

# grep . /sys/block/loop0/queue/*
....
/sys/block/loop0/queue/fua:1
....

# trace-cmd stream -e block* |grep xfs_io
.....
<...>-4648 [006] 793.968169: block_rq_issue: 7,0 WFS 131072 () 192 + 256 [xfs_io]
<...>-4648 [006] 793.977521: block_bio_queue: 7,0 WFS 448 + 256 [xfs_io]
<...>-4648 [006] 793.977531: block_getrq: 7,0 WFS 448 + 256 [xfs_io]

The upper XFS filesystem is issuing REQ_FUA writes to the loop
device now.

Signed-off-by: Dave Chinner <[email protected]>
---
drivers/block/loop.c | 29 ++++++++++++++++++++---------
1 file changed, 20 insertions(+), 9 deletions(-)

diff --git a/drivers/block/loop.c b/drivers/block/loop.c
index 9f2d412fc560..6f2df09b3179 100644
--- a/drivers/block/loop.c
+++ b/drivers/block/loop.c
@@ -238,7 +238,8 @@ static void loop_set_size(struct loop_device *lo, loff_t size)
kobject_uevent(&disk_to_dev(lo->lo_disk)->kobj, KOBJ_CHANGE);
}

-static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
+static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos,
+ rwf_t rw_flags)
{
struct iov_iter i;
ssize_t bw;
@@ -246,7 +247,7 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
iov_iter_bvec(&i, ITER_SOURCE, bvec, 1, bvec->bv_len);

file_start_write(file);
- bw = vfs_iter_write(file, &i, ppos, 0);
+ bw = vfs_iter_write(file, &i, ppos, rw_flags);
file_end_write(file);

if (likely(bw == bvec->bv_len))
@@ -261,14 +262,14 @@ static int lo_write_bvec(struct file *file, struct bio_vec *bvec, loff_t *ppos)
}

static int lo_write_simple(struct loop_device *lo, struct request *rq,
- loff_t pos)
+ loff_t pos, rwf_t rw_flags)
{
struct bio_vec bvec;
struct req_iterator iter;
int ret = 0;

rq_for_each_segment(bvec, rq, iter) {
- ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos);
+ ret = lo_write_bvec(lo->lo_backing_file, &bvec, &pos, rw_flags);
if (ret < 0)
break;
cond_resched();
@@ -392,7 +393,7 @@ static void lo_rw_aio_complete(struct kiocb *iocb, long ret)
}

static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
- loff_t pos, int rw)
+ loff_t pos, int rw, rwf_t rw_flags)
{
struct iov_iter iter;
struct req_iterator rq_iter;
@@ -446,6 +447,7 @@ static int lo_rw_aio(struct loop_device *lo, struct loop_cmd *cmd,
cmd->iocb.ki_filp = file;
cmd->iocb.ki_complete = lo_rw_aio_complete;
cmd->iocb.ki_flags = IOCB_DIRECT;
+ kiocb_set_rw_flags(&cmd->iocb, rw_flags);
cmd->iocb.ki_ioprio = IOPRIO_PRIO_VALUE(IOPRIO_CLASS_NONE, 0);

if (rw == ITER_SOURCE)
@@ -464,6 +466,15 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
{
struct loop_cmd *cmd = blk_mq_rq_to_pdu(rq);
loff_t pos = ((loff_t) blk_rq_pos(rq) << 9) + lo->lo_offset;
+ rwf_t rw_flags = 0;
+
+ /*
+ * If we are being asked to do a REQ_FUA write, then we convert that to
+ * a datasync write so that the contents of the write are on stable
+ * storage before the write completes.
+ */
+ if (rq->cmd_flags & REQ_FUA)
+ rw_flags = RWF_DSYNC;

/*
* lo_write_simple and lo_read_simple should have been covered
@@ -490,12 +501,12 @@ static int do_req_filebacked(struct loop_device *lo, struct request *rq)
return lo_fallocate(lo, rq, pos, FALLOC_FL_PUNCH_HOLE);
case REQ_OP_WRITE:
if (cmd->use_aio)
- return lo_rw_aio(lo, cmd, pos, ITER_SOURCE);
+ return lo_rw_aio(lo, cmd, pos, ITER_SOURCE, rw_flags);
else
- return lo_write_simple(lo, rq, pos);
+ return lo_write_simple(lo, rq, pos, rw_flags);
case REQ_OP_READ:
if (cmd->use_aio)
- return lo_rw_aio(lo, cmd, pos, ITER_DEST);
+ return lo_rw_aio(lo, cmd, pos, ITER_DEST, 0);
else
return lo_read_simple(lo, rq, pos);
default:
@@ -1077,7 +1088,7 @@ static int loop_configure(struct loop_device *lo, blk_mode_t mode,
mapping_set_gfp_mask(mapping, lo->old_gfp_mask & ~(__GFP_IO|__GFP_FS));

if (!(lo->lo_flags & LO_FLAGS_READ_ONLY) && file->f_op->fsync)
- blk_queue_write_cache(lo->lo_queue, true, false);
+ blk_queue_write_cache(lo->lo_queue, true, true);

if (config->block_size)
bsize = config->block_size;