2021-01-07 16:45:44

by Mingkai Dong

[permalink] [raw]
Subject: Re: Expense of read_iter

Hi Matthew,

We have also discovered the expense of `->read_iter` in our study on Ext4-DAX.
In single-thread 4K-reads, the `->read` version could outperform `->read_iter`
by 41.6% in terms of throughput.

According to our observation and evaluation, at least for Ext4-DAX, the cost
also comes from the invocation of `->iomap_begin` (`ext4_iomap_begin`),
which might not be simply avoided by adding a new iter_type.
The slowdown is more significant when multiple threads reading different files
concurrently, due to the scalability issue (grabbing a read lock to check the
status of the journal) in `ext4_iomap_begin`.

In our solution, we implemented the `->read` and `->write` interfaces for
Ext4-DAX. Thus, we also think it would be good if both `->read` and `->read_iter`
could exist.

By the way, besides the implementation of `->read` and `->write`, we have
some other optimizations for Ext4-DAX and would like to share them once our
patches are prepared.

Thanks,
Mingkai


> On Jan 7, 2021, at 23:11, Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
>> I'd like to ask about this piece of code in __kernel_read:
>> if (unlikely(!file->f_op->read_iter || file->f_op->read))
>> return warn_unsupported...
>> and __kernel_write:
>> if (unlikely(!file->f_op->write_iter || file->f_op->write))
>> return warn_unsupported...
>>
>> - It exits with an error if both read_iter and read or write_iter and
>> write are present.
>>
>> I found out that on NVFS, reading a file with the read method has 10%
>> better performance than the read_iter method. The benchmark just reads the
>> same 4k page over and over again - and the cost of creating and parsing
>> the kiocb and iov_iter structures is just that high.
>
> Which part of it is so expensive? Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
>
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>
> enum iter_type {
> /* iter types */
> + ITER_UBUF = 2,
> ITER_IOVEC = 4,
> ITER_KVEC = 8,
> ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
> size_t iov_offset;
> size_t count;
> union {
> + void __user *buf;
> const struct iovec *iov;
> const struct kvec *kvec;
> const struct bio_vec *bvec;
>
> and then doing all the appropriate changes to make that work.
> _______________________________________________
> Linux-nvdimm mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


2021-01-12 13:58:10

by Zhongwei Cai

[permalink] [raw]
Subject: Re: Expense of read_iter


I'm working with Mingkai on optimizations for Ext4-dax.
We think that optmizing the read-iter method cannot achieve the
same performance as the read method for Ext4-dax.
We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
results are listed below:

Overall time of 2^26 4KB read.

Method Time
read 26.782s
read-iter 36.477s

Perf result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13379476464
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
20.09% pread [kernel.vmlinux] [k] copy_user_generic_string
6.58% pread [kernel.vmlinux] [k] iomap_apply
6.01% pread [kernel.vmlinux] [k] syscall_return_via_sysret
4.85% pread libc-2.31.so [.] __libc_pread
3.61% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
3.25% pread [kernel.vmlinux] [k] _raw_read_lock
2.80% pread [kernel.vmlinux] [k] entry_SYSCALL_64
2.71% pread [ext4] [k] ext4_es_lookup_extent
2.71% pread [kernel.vmlinux] [k] __fsnotify_parent
2.63% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.55% pread [kernel.vmlinux] [k] new_sync_read
2.39% pread [ext4] [k] ext4_iomap_begin
2.38% pread [kernel.vmlinux] [k] vfs_read
2.30% pread [kernel.vmlinux] [k] dax_iomap_actor
2.30% pread [kernel.vmlinux] [k] __srcu_read_lock
2.14% pread [ext4] [k] ext4_inode_block_valid
1.97% pread [kernel.vmlinux] [k] _copy_mc_to_iter
1.97% pread [ext4] [k] ext4_map_blocks
1.89% pread [kernel.vmlinux] [k] down_read
1.89% pread [kernel.vmlinux] [k] up_read
1.65% pread [ext4] [k] ext4_file_read_iter
1.48% pread [kernel.vmlinux] [k] dax_iomap_rw
1.48% pread [jbd2] [k] jbd2_transaction_committed
1.15% pread [nd_pmem] [k] __pmem_direct_access
1.15% pread [kernel.vmlinux] [k] ksys_pread64
1.15% pread [kernel.vmlinux] [k] __fget_light
1.15% pread [ext4] [k] ext4_set_iomap
1.07% pread [kernel.vmlinux] [k] atime_needs_update
0.82% pread pread [.] main
0.82% pread [kernel.vmlinux] [k] do_syscall_64
0.74% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.66% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.66% pread [nd_pmem] [k] 0x00000000000001d0
0.59% pread [kernel.vmlinux] [k] dax_direct_access
0.58% pread [nd_pmem] [k] 0x00000000000001de
0.58% pread [kernel.vmlinux] [k] bdev_dax_pgoff
0.49% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.49% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.41% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.33% pread [nd_pmem] [k] 0x0000000000001083
0.33% pread [kernel.vmlinux] [k] dax_get_private
0.33% pread [kernel.vmlinux] [k] timestamp_truncate
0.33% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] copyout_mc
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] touch_atime
0.25% pread [nd_pmem] [k] 0x000000000000107f
0.25% pread [kernel.vmlinux] [k] rw_verify_area
0.25% pread [ext4] [k] ext4_iomap_end
0.25% pread [kernel.vmlinux] [k] _cond_resched
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.16% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] current_time
0.16% pread [nd_pmem] [k] 0x0000000000001075
0.16% pread [ext4] [k] ext4_inode_datasync_dirty
0.16% pread [kernel.vmlinux] [k] copy_mc_to_user
0.08% pread pread [.] pread@plt
0.08% pread [kernel.vmlinux] [k] __x86_indirect_thunk_r11
0.08% pread [kernel.vmlinux] [k] security_file_permission
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.08% pread [nd_pmem] [k] 0x000000000000108f
0.08% pread [nd_pmem] [k] 0x0000000000001095
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.00% pread [kernel.vmlinux] [k] native_write_msr


#
# (Tip: Show current config key-value pairs: perf config --list)
#

Perf result, using the read method we added for Ext4-dax:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 1K of event 'cycles'
# Event count (approx.): 13364755903
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .......................................
#
28.65% pread [kernel.vmlinux] [k] copy_user_generic_string
7.99% pread [ext4] [k] ext4_dax_read
6.50% pread [kernel.vmlinux] [k] syscall_return_via_sysret
5.43% pread libc-2.31.so [.] __libc_pread
4.45% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.20% pread [kernel.vmlinux] [k] down_read
3.38% pread [kernel.vmlinux] [k] _raw_read_lock
3.13% pread [ext4] [k] ext4_es_lookup_extent
3.05% pread [kernel.vmlinux] [k] __srcu_read_lock
2.72% pread [kernel.vmlinux] [k] __fsnotify_parent
2.55% pread [kernel.vmlinux] [k] __srcu_read_unlock
2.47% pread [kernel.vmlinux] [k] vfs_read
2.31% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
1.89% pread [kernel.vmlinux] [k] up_read
1.73% pread [ext4] [k] ext4_map_blocks
1.65% pread pread [.] main
1.56% pread [kernel.vmlinux] [k] __fget_light
1.48% pread [ext4] [k] ext4_inode_block_valid
1.34% pread [kernel.vmlinux] [k] ksys_pread64
1.23% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
1.08% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
1.07% pread [nd_pmem] [k] __pmem_direct_access
0.99% pread [kernel.vmlinux] [k] atime_needs_update
0.91% pread [kernel.vmlinux] [k] security_file_permission
0.91% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.66% pread [kernel.vmlinux] [k] timestamp_truncate
0.58% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.49% pread pread [.] pread@plt
0.41% pread [kernel.vmlinux] [k] current_time
0.41% pread [kernel.vmlinux] [k] dax_direct_access
0.41% pread [kernel.vmlinux] [k] do_syscall_64
0.41% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.41% pread [kernel.vmlinux] [k] percpu_counter_add_batch
0.33% pread [kernel.vmlinux] [k] touch_atime
0.33% pread [ext4] [k] __check_block_validity.constprop.80
0.33% pread [kernel.vmlinux] [k] copy_mc_to_user
0.25% pread [kernel.vmlinux] [k] dax_get_private
0.25% pread [kernel.vmlinux] [k] rcu_all_qs
0.25% pread [nd_pmem] [k] 0x0000000000001095
0.16% pread [kernel.vmlinux] [k] _raw_spin_lock_irqsave
0.16% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.16% pread [nd_pmem] [k] 0x0000000000001083
0.16% pread [kernel.vmlinux] [k] rw_verify_area
0.16% pread [kernel.vmlinux] [k] _raw_spin_unlock_irqrestore
0.16% pread [kernel.vmlinux] [k] __fdget
0.16% pread [kernel.vmlinux] [k] dax_read_lock
0.16% pread [kernel.vmlinux] [k] __x86_indirect_thunk_rax
0.08% pread [kernel.vmlinux] [k] rcu_read_unlock_strict
0.08% pread [kernel.vmlinux] [k] dax_read_unlock
0.08% pread [kernel.vmlinux] [k] update_irq_load_avg
0.08% pread [nd_pmem] [k] 0x000000000000109d
0.08% pread [nd_pmem] [k] 0x000000000000107a
0.08% pread [kernel.vmlinux] [k] __x64_sys_pread64
0.00% pread [kernel.vmlinux] [k] native_write_msr


#
# (Tip: Sample related events with: perf record -e '{cycles,instructions}:S')
#

Note that the overall time of read method is 73.42% of the read-iter method.
If we sum up the percentage of read-iter specific functions (including
ext4_file_read_iter, iomap_apply, dax_iomap_actor, _copy_mc_to_iter,
ext4_iomap_begin, jbd2_transaction_committed, new_sync_read, dax_iomap_rw,
ext4_set_iomap, ext4_iomap_end and iov_iter_init), we will get 20.81%.
In the second trace, ext4_dax_read only consumes 7.99%, which can replace
all these functions.

The overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

There could be two approaches to optimizing: 1) implementing the read method
without the complexity of iterators and dax_iomap_rw; 2) optimizing both
iterators and how dax_iomap_rw works. Since dax_iomap_rw requires
ext4_iomap_begin, which further involves the iomap structure and others
(e.g., journaling status locks in Ext4), we think implementing the read
method would be easier.

Thanks,
Zhongwei

2021-01-12 14:11:27

by David Laight

[permalink] [raw]
Subject: RE: Expense of read_iter

From: Zhongwei Cai
> Sent: 12 January 2021 13:45
..
> The overhead mainly consists of two parts. The first is constructing
> struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
> iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
> dax_iomap_rw, iomap_apply and ext4_iomap_begin).

Setting up an iov_iter with a single buffer ought to be relatively
cheap - compared to a file system read.

The iteration should be over the total length
calling copy_from/to_iter() for 'chunks' that don't
depend on the size of the iov[] fragments.

So copy_to/from_iter() should directly replace the copy_to/from_user()
calls in the 'read' method.
For a single buffer this really ought to be noise as well.

Clearly is the iov[] has a lot of short fragments the copy
will be more expensive.

Access to /dev/null and /dev/zero are much more likely to show
the additional costs of the iov_iter code than fs code.


David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-13 16:47:48

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Tue, 12 Jan 2021, Zhongwei Cai wrote:

>
> I'm working with Mingkai on optimizations for Ext4-dax.

What specific patch are you working on? Please, post it somewhere.

> We think that optmizing the read-iter method cannot achieve the
> same performance as the read method for Ext4-dax.
> We tried Mikulas's benchmark on Ext4-dax. The overall time and perf
> results are listed below:
>
> Overall time of 2^26 4KB read.
>
> Method Time
> read 26.782s
> read-iter 36.477s

What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
- detect in the "read_iter" method that there is just one segment and
treat it like a "read" method. I think that it should improve performance
for your case.

Mikulas

2021-01-15 09:43:52

by Zhongwei Cai

[permalink] [raw]
Subject: Re: Expense of read_iter

On Thu, 14 Jan 2021, Mikulas wrote:

>> I'm working with Mingkai on optimizations for Ext4-dax.
>
> What specific patch are you working on? Please, post it somewhere.

Here is the work-in-progress patch: https://ipads.se.sjtu.edu.cn:1312/opensource/linux/-/tree/ext4-read
It only contains the "read" implementation for Ext4-dax now, though, we
will put other optimizations on it later.

> What happens if you use this trick ( https://lkml.org/lkml/2021/1/11/1612 )
> - detect in the "read_iter" method that there is just one segment and
> treat it like a "read" method. I think that it should improve performance
> for your case.

Note that the original Ext4-dax does not implement the "read" method. Instead, it
calls the "dax_iomap_rw" method provided by VFS. So we firstly rewrite
the "read-iter" method which iterates struct iov_iter and calls our
"read" method as a baseline for comparison.

Overall time of 2^26 4KB read:
"read-iter" method with dax-iomap-rw (original) - 36.477s
"read_iter" method wraps our "read" method - 28.950s
"read_iter" method tests for one entry proposed by Mikulas - 27.947s
"read" method - 26.899s

As we mentioned in the previous email (https://lkml.org/lkml/2021/1/12/710),
the overhead mainly consists of two parts. The first is constructing
struct iov_iter and iterating it (i.e., new_sync, _copy_mc_to_iter and
iov_iter_init). The second is the dax io mechanism provided by VFS (i.e.,
dax_iomap_rw, iomap_apply and ext4_iomap_begin).

For Ext4-dax, the overhead of dax_iomap_rw is significant
compared to the overhead of struct iov_iter. Although methods
proposed by Mikulas can eliminate the overhead of iov_iter
well, they can not be applied in Ext4-dax unless we implement an
internal "read" method in Ext4-dax.

For Ext4-dax, there could be two approaches to optimizing:
1) implementing the internal "read" method without the complexity
of iterators and dax_iomap_rw; 2) optimizing how dax_iomap_rw works.
Since dax_iomap_rw requires ext4_iomap_begin, which further involves
the iomap structure and others (e.g., journaling status locks in Ext4),
we think implementing the internal "read" method would be easier.

As for whether the external .read interface in VFS should be reserved,
since there is still a performance gap (3.9%) between the "read" method
and the optimized "read_iter" method, we think reserving it is better.

Thanks,
Zhongwei

2021-01-20 04:53:18

by Dave Chinner

[permalink] [raw]
Subject: Re: Expense of read_iter

On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> On Thu, 14 Jan 2021, Mikulas wrote:
> For Ext4-dax, the overhead of dax_iomap_rw is significant
> compared to the overhead of struct iov_iter. Although methods
> proposed by Mikulas can eliminate the overhead of iov_iter
> well, they can not be applied in Ext4-dax unless we implement an
> internal "read" method in Ext4-dax.
>
> For Ext4-dax, there could be two approaches to optimizing:
> 1) implementing the internal "read" method without the complexity
> of iterators and dax_iomap_rw;

Please do not go an re-invent the wheel just for ext4. If there's a
problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
as well, so any improvements to that path benefit all DAX users, not
just ext4.

> 2) optimizing how dax_iomap_rw works.
> Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> the iomap structure and others (e.g., journaling status locks in Ext4),
> we think implementing the internal "read" method would be easier.

Maybe it is, but it's also very selfish. The DAX iomap path was
written to be correct for all users, not inecessarily provide
optimal performance. There will be lots of things that could be done
to optimise it, so rather than creating a special snowflake in ext4
that makes DAX in ext4 much harder to maintain for non-ext4 DAX
developers, please work to improve the common DAX IO path and so
provide the same benefit to all the filesystems that use it.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-01-20 14:29:48

by Jan Kara

[permalink] [raw]
Subject: Re: Expense of read_iter

On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > On Thu, 14 Jan 2021, Mikulas wrote:
> > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > compared to the overhead of struct iov_iter. Although methods
> > proposed by Mikulas can eliminate the overhead of iov_iter
> > well, they can not be applied in Ext4-dax unless we implement an
> > internal "read" method in Ext4-dax.
> >
> > For Ext4-dax, there could be two approaches to optimizing:
> > 1) implementing the internal "read" method without the complexity
> > of iterators and dax_iomap_rw;
>
> Please do not go an re-invent the wheel just for ext4. If there's a
> problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> as well, so any improvements to that path benefit all DAX users, not
> just ext4.
>
> > 2) optimizing how dax_iomap_rw works.
> > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > the iomap structure and others (e.g., journaling status locks in Ext4),
> > we think implementing the internal "read" method would be easier.
>
> Maybe it is, but it's also very selfish. The DAX iomap path was
> written to be correct for all users, not inecessarily provide
> optimal performance. There will be lots of things that could be done
> to optimise it, so rather than creating a special snowflake in ext4
> that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> developers, please work to improve the common DAX IO path and so
> provide the same benefit to all the filesystems that use it.

Yeah, I agree. I'm against ext4 private solution for this read problem. And
I'm also against duplicating ->read_iter functionatily in ->read handler.
The maintenance burden of this code duplication is IMHO just too big. We
rather need to improve the generic code so that the fast path is faster.
And every filesystem will benefit because this is not ext4 specific
problem.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2021-01-20 15:28:59

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Wed, 20 Jan 2021, Jan Kara wrote:

> Yeah, I agree. I'm against ext4 private solution for this read problem. And
> I'm also against duplicating ->read_iter functionatily in ->read handler.
> The maintenance burden of this code duplication is IMHO just too big. We
> rather need to improve the generic code so that the fast path is faster.
> And every filesystem will benefit because this is not ext4 specific
> problem.
>
> Honza

Do you have some idea how to optimize the generic code that calls
->read_iter?

vfs_read calls ->read if it is present. If not, it calls new_sync_read.
new_sync_read's frame size is 128 bytes - it holds the structures iovec,
kiocb and iov_iter. new_sync_read calls ->read_iter.

I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei
found out 3.9%. (the benchmark repeatedy reads the same 4k page)

I don't see any way how to optimize new_sync_read or how to reduce its
frame size. Do you?

Mikulas

2021-01-20 15:54:07

by David Laight

[permalink] [raw]
Subject: RE: Expense of read_iter

From: Mikulas Patocka
> Sent: 20 January 2021 15:12
>
> On Wed, 20 Jan 2021, Jan Kara wrote:
>
> > Yeah, I agree. I'm against ext4 private solution for this read problem. And
> > I'm also against duplicating ->read_iter functionatily in ->read handler.
> > The maintenance burden of this code duplication is IMHO just too big. We
> > rather need to improve the generic code so that the fast path is faster.
> > And every filesystem will benefit because this is not ext4 specific
> > problem.
> >
> > Honza
>
> Do you have some idea how to optimize the generic code that calls
> ->read_iter?
>
> vfs_read calls ->read if it is present. If not, it calls new_sync_read.
> new_sync_read's frame size is 128 bytes - it holds the structures iovec,
> kiocb and iov_iter. new_sync_read calls ->read_iter.
>
> I have found out that the cost of calling new_sync_read is 3.3%, Zhongwei
> found out 3.9%. (the benchmark repeatedy reads the same 4k page)
>
> I don't see any way how to optimize new_sync_read or how to reduce its
> frame size. Do you?

Why is the 'read_iter' path not just the same as the 'read' one
but calling copy_to_iter() instead of copy_to_user().

For a single fragment iov[] the difference might just be
measurable for a single byte read.
But by the time you are transferring 4k it ought to be miniscule.

It isn't as though you have the cost of reading the iov[] from userspace.
(That hits sendmsg() v send().)

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-21 16:32:54

by Zhongwei Cai

[permalink] [raw]
Subject: Re: Expense of read_iter


On Wed, 20 Jan 2021, Jan Kara wrote:
> On Wed 20-01-21 15:47:00, Dave Chinner wrote:
> > On Fri, Jan 15, 2021 at 05:40:43PM +0800, Zhongwei Cai wrote:
> > > On Thu, 14 Jan 2021, Mikulas wrote:
> > > For Ext4-dax, the overhead of dax_iomap_rw is significant
> > > compared to the overhead of struct iov_iter. Although methods
> > > proposed by Mikulas can eliminate the overhead of iov_iter
> > > well, they can not be applied in Ext4-dax unless we implement an
> > > internal "read" method in Ext4-dax.
> > >
>> > For Ext4-dax, there could be two approaches to optimizing:
> > > 1) implementing the internal "read" method without the complexity
> > > of iterators and dax_iomap_rw;
> >
> > Please do not go an re-invent the wheel just for ext4. If there's a
> > problem in a shared path - ext2, FUSE and XFS all use dax_iomap_rw()
> > as well, so any improvements to that path benefit all DAX users, not
> > just ext4.
> >
> > > 2) optimizing how dax_iomap_rw works.
> > > Since dax_iomap_rw requires ext4_iomap_begin, which further involves
> > > the iomap structure and others (e.g., journaling status locks in Ext4),
> > > we think implementing the internal "read" method would be easier.
> >
> > Maybe it is, but it's also very selfish. The DAX iomap path was
> > written to be correct for all users, not inecessarily provide
> > optimal performance. There will be lots of things that could be done
> > to optimise it, so rather than creating a special snowflake in ext4
> > that makes DAX in ext4 much harder to maintain for non-ext4 DAX
> > developers, please work to improve the common DAX IO path and so
> > provide the same benefit to all the filesystems that use it.
>
> Yeah, I agree. I'm against ext4 private solution for this read problem. And
> I'm also against duplicating ->read_iter functionatily in ->read handler.
> The maintenance burden of this code duplication is IMHO just too big. We
> rather need to improve the generic code so that the fast path is faster.
> And every filesystem will benefit because this is not ext4 specific
> problem.
>
> Honza

We agree that maintenance burden could be a problem here. So we will take
your suggestions and try to optimize on the generic path. But as Mikulas
said ( https://lkml.org/lkml/2021/1/20/618 ), it seems that some parts of
the overhead are hard to avoid, such as new_sync_read, and we are concerned
that optimizing on the generic path will have limited effect. Nevertheless,
we will try to optimize the generic path and see how much we can improve.

Zhongwei

2021-01-21 17:01:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Expense of read_iter

On Wed, Jan 20, 2021 at 10:12:01AM -0500, Mikulas Patocka wrote:
> Do you have some idea how to optimize the generic code that calls
> ->read_iter?

Yes.

> It might be better to maintain an f_iocb_flags in the
> struct file and just copy that unconditionally. We'd need to remember
> to update it in fcntl(F_SETFL), but I think that's the only place.

Want to give that a try?