2024-02-29 11:46:25

by Zhihao Cheng

[permalink] [raw]
Subject: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag

It will be more efficient to execute quick endio process(eg. non-sync
overwriting case) under irq process rather than starting a worker to
do it.
Add a flag to control DIO to be finished inline(under irq context), which
can be used for non-sync overwriting case.
Besides, skip invalidating pages if DIO is finished inline, which will
keep the same logic with dio_bio_end_aio in non-sync overwriting case.

Signed-off-by: Zhihao Cheng <[email protected]>
---
fs/iomap/direct-io.c | 10 ++++++++--
include/linux/iomap.h | 6 ++++++
2 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index bcd3f8cf5ea4..221715b38ce2 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -110,7 +110,8 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
* ->end_io() when necessary, otherwise a racing buffer read would cache
* zeros from unwritten extents.
*/
- if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE))
+ if (!dio->error && dio->size && (dio->flags & IOMAP_DIO_WRITE) &&
+ !(dio->flags & IOMAP_DIO_INLINE_COMP))
kiocb_invalidate_post_direct_write(iocb, dio->size);

inode_dio_end(file_inode(iocb->ki_filp));
@@ -122,8 +123,10 @@ ssize_t iomap_dio_complete(struct iomap_dio *dio)
* If this is a DSYNC write, make sure we push it to stable
* storage now that we've written data.
*/
- if (dio->flags & IOMAP_DIO_NEED_SYNC)
+ if (dio->flags & IOMAP_DIO_NEED_SYNC) {
+ WARN_ON_ONCE(dio->flags & IOMAP_DIO_INLINE_COMP);
ret = generic_write_sync(iocb, ret);
+ }
if (ret > 0)
ret += dio->done_before;
}
@@ -628,6 +631,9 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
*/
if (!(iocb->ki_flags & IOCB_SYNC))
dio->flags |= IOMAP_DIO_WRITE_THROUGH;
+ } else if (dio_flags & IOMAP_DIO_MAY_INLINE_COMP) {
+ /* writes could complete inline */
+ dio->flags |= IOMAP_DIO_INLINE_COMP;
}

/*
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 96dd0acbba44..f292b10028d0 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -382,6 +382,12 @@ struct iomap_dio_ops {
*/
#define IOMAP_DIO_PARTIAL (1 << 2)

+/*
+ * DIO will be completed inline unless sync operation is needed after io is
+ * finished.
+ */
+#define IOMAP_DIO_MAY_INLINE_COMP (1 << 3)
+
ssize_t iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
const struct iomap_ops *ops, const struct iomap_dio_ops *dops,
unsigned int dio_flags, void *private, size_t done_before);
--
2.39.2



2024-03-01 10:04:39

by Zhihao Cheng

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag

?? 2024/3/1 8:40, Dave Chinner д??:

Hi Dave, thanks for your detailed and nice suggestions, I have a few
questions below.
> On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
>> It will be more efficient to execute quick endio process(eg. non-sync
>> overwriting case) under irq process rather than starting a worker to
>> do it.
>> Add a flag to control DIO to be finished inline(under irq context), which
>> can be used for non-sync overwriting case.
>> Besides, skip invalidating pages if DIO is finished inline, which will
>> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
>>
>> Signed-off-by: Zhihao Cheng <[email protected]>
>
> A nice idea, but I don't think an ext4 specific API flag is the
> right way to go about enabling this. The iomap dio code knows if
> the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
> for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
> pure overwrite FUA optimisations.
>
> That is:
>
> /*
> * Use a FUA write if we need datasync semantics, this is a pure
> * data IO that doesn't require any metadata updates (including
> * after IO completion such as unwritten extent conversion) and
> * the underlying device either supports FUA or doesn't have
> * a volatile write cache. This allows us to avoid cache flushes
> * on IO completion. If we can't use writethrough and need to
> * sync, disable in-task completions as dio completion will
> * need to call generic_write_sync() which will do a blocking
> * fsync / cache flush call.
> */
> if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
> (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
> (bdev_fua(iomap->bdev) || !bdev_write_cache(iomap->bdev)))
> use_fua = true;
>
> Hence if we want to optimise pure overwrites that have no data sync
> requirements, we already have the detection and triggers in place to
> do this. We just need to change the way we set up the IO flags to
> allow write-through (i.e. non-blocking IO completions) to use inline
> completions.
>
> In __iomap_dio_rw():
>
> + /* Always try to complete inline. */
> + dio->flags |= IOMAP_DIO_INLINE_COMP;
> if (iov_iter_rw(iter) == READ) {
> - /* reads can always complete inline */
> - dio->flags |= IOMAP_DIO_INLINE_COMP;
> ....
>
> } else {
> + /* Always try write-through semantics. If we can't
> + * use writethough, it will be disabled along with
> + * IOMAP_DIO_INLINE_COMP before dio completion is run
> + * so it can be deferred to a task completion context
> + * appropriately.
> + */
> + dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;

There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH
unconditionally, non-datasync IO will be set with REQ_FUA, which means
that device will flush writecache for each IO, will it affect the
performance in non-sync dio case?
> iomi.flags |= IOMAP_WRITE;
> - dio->flags |= IOMAP_DIO_WRITE;
> .....
> /* for data sync or sync, we need sync completion processing */
> if (iocb_is_dsync(iocb)) {
> dio->flags |= IOMAP_DIO_NEED_SYNC;
>
> - /*
> - * For datasync only writes, we optimistically try using
> - * WRITE_THROUGH for this IO. This flag requires either
> - * FUA writes through the device's write cache, or a
> - * normal write to a device without a volatile write
> - * cache. For the former, Any non-FUA write that occurs
> - * will clear this flag, hence we know before completion
> - * whether a cache flush is necessary.
> - */
> - if (!(iocb->ki_flags & IOCB_SYNC))
> - dio->flags |= IOMAP_DIO_WRITE_THROUGH;
> + * For sync writes we know we are going to need
> + * blocking completion processing, so turn off
> + * writethrough now.
> + */
> if (iocb->ki_flags & IOCB_SYNC) {
> dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
> IOMAP_DIO_INLINE_COMP);
> }
> }
>

[...]
>
> However, this does mean that any spinlock taken in the ->end_io()
> callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
> the spinlock protection around inode size updates will need to use
> an irq safe locking, as will the locking in the DIO submission path
> that it serialises against in xfs_file_write_checks(). That probably
> is best implemented as a separate spinlock.
>
> There will also be other filesystems that need to set IOMAP_F_DIRTY
> unconditionally (e.g. zonefs) because they always take blocking
> locks in their ->end_io callbacks and so must always run in task
> context...
Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the
endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that
the metadata needs to be written, if we add a new semantics(endio must
be done in defered work) for this flag, the code will looks a little
complicated.


2024-03-11 07:55:47

by Zhihao Cheng

[permalink] [raw]
Subject: Re: [PATCH RFC 1/2] iomap: Add a IOMAP_DIO_MAY_INLINE_COMP flag

在 2024/3/1 18:02, Zhihao Cheng 写道:
> 在 2024/3/1 8:40, Dave Chinner 写道:
>
Hi Dave. Friendly ping.
> Hi Dave, thanks for your detailed and nice suggestions, I have a few
> questions below.
>> On Thu, Feb 29, 2024 at 07:38:48PM +0800, Zhihao Cheng wrote:
>>> It will be more efficient to execute quick endio process(eg. non-sync
>>> overwriting case) under irq process rather than starting a worker to
>>> do it.
>>> Add a flag to control DIO to be finished inline(under irq context),
>>> which
>>> can be used for non-sync overwriting case.
>>> Besides, skip invalidating pages if DIO is finished inline, which will
>>> keep the same logic with dio_bio_end_aio in non-sync overwriting case.
>>>
>>> Signed-off-by: Zhihao Cheng <[email protected]>
>>
>> A nice idea, but I don't think an ext4 specific API flag is the
>> right way to go about enabling this. The iomap dio code knows if
>> the write is pure overwrite already - we have the IOMAP_F_DIRTY flag
>> for that, and we combine this with IOMAP_DIO_WRITE_THROUGH to do the
>> pure overwrite FUA optimisations.
>>
>> That is:
>>
>>         /*
>>                   * Use a FUA write if we need datasync semantics,
>> this is a pure
>>                   * data IO that doesn't require any metadata updates
>> (including
>>                   * after IO completion such as unwritten extent
>> conversion) and
>>                   * the underlying device either supports FUA or
>> doesn't have
>>                   * a volatile write cache. This allows us to avoid
>> cache flushes
>>                   * on IO completion. If we can't use writethrough and
>> need to
>>                   * sync, disable in-task completions as dio
>> completion will
>>                   * need to call generic_write_sync() which will do a
>> blocking
>>                   * fsync / cache flush call.
>>                   */
>>                  if (!(iomap->flags & (IOMAP_F_SHARED|IOMAP_F_DIRTY)) &&
>>                      (dio->flags & IOMAP_DIO_WRITE_THROUGH) &&
>>                      (bdev_fua(iomap->bdev) ||
>> !bdev_write_cache(iomap->bdev)))
>>                          use_fua = true;
>>
>> Hence if we want to optimise pure overwrites that have no data sync
>> requirements, we already have the detection and triggers in place to
>> do this. We just need to change the way we set up the IO flags to
>> allow write-through (i.e. non-blocking IO completions) to use inline
>> completions.
>>
>> In __iomap_dio_rw():
>>
>> +    /* Always try to complete inline. */
>> +    dio->flags |= IOMAP_DIO_INLINE_COMP;
>>     if (iov_iter_rw(iter) == READ) {
>> -               /* reads can always complete inline */
>> -               dio->flags |= IOMAP_DIO_INLINE_COMP;
>> ....
>>
>>     } else {
>> +        /* Always try write-through semantics. If we can't
>> +         * use writethough, it will be disabled along with
>> +         * IOMAP_DIO_INLINE_COMP before dio completion is run
>> +         * so it can be deferred to a task completion context
>> +         * appropriately.
>> +         */
>> +               dio->flags |= IOMAP_DIO_WRITE | IOMAP_DIO_WRITE_THROUGH;
>
> There is a behavior change here, if we set IOMAP_DIO_WRITE_THROUGH
> unconditionally, non-datasync IO will be set with REQ_FUA, which means
> that device will flush writecache for each IO, will it affect the
> performance in non-sync dio case?
>>         iomi.flags |= IOMAP_WRITE;
>> -               dio->flags |= IOMAP_DIO_WRITE;
>> .....
>>         /* for data sync or sync, we need sync completion processing */
>>                  if (iocb_is_dsync(iocb)) {
>>                          dio->flags |= IOMAP_DIO_NEED_SYNC;
>>
>> -                      /*
>> -                       * For datasync only writes, we optimistically
>> try using
>> -                       * WRITE_THROUGH for this IO. This flag
>> requires either
>> -                       * FUA writes through the device's write cache,
>> or a
>> -                       * normal write to a device without a volatile
>> write
>> -                       * cache. For the former, Any non-FUA write
>> that occurs
>> -                       * will clear this flag, hence we know before
>> completion
>> -                       * whether a cache flush is necessary.
>> -                       */
>> -                       if (!(iocb->ki_flags & IOCB_SYNC))
>> -                               dio->flags |= IOMAP_DIO_WRITE_THROUGH;
>> +            * For sync writes we know we are going to need
>> +            * blocking completion processing, so turn off
>> +            * writethrough now.
>> +            */
>>             if (iocb->ki_flags & IOCB_SYNC) {
>>                 dio->flags &= ~(IOMAP_DIO_WRITE_THROUGH |
>>                         IOMAP_DIO_INLINE_COMP);
>>             }
>>                  }
>>
>
> [...]
>>
>> However, this does mean that any spinlock taken in the ->end_io()
>> callbacks now needs to be irq safe. e.g. in xfs_dio_write_end_io()
>> the spinlock protection around inode size updates will need to use
>> an irq safe locking, as will the locking in the DIO submission path
>> that it serialises against in xfs_file_write_checks(). That probably
>> is best implemented as a separate spinlock.
>>
>> There will also be other filesystems that need to set IOMAP_F_DIRTY
>> unconditionally (e.g. zonefs) because they always take blocking
>> locks in their ->end_io callbacks and so must always run in task
>> context...
> Should we add a new flag(eg. IOMAP_F_ENDIO_IRQ ?) to indicate that the
> endio cannot be done under irq? Because I think IOMAP_F_DIRTY means that
> the metadata needs to be written, if we add a new semantics(endio must
> be done in defered work) for this flag, the code will looks a little
> complicated.
>
>
> .