2021-01-07 19:03:52

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Thu, 7 Jan 2021, Matthew Wilcox wrote:

> On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > I'd like to ask about this piece of code in __kernel_read:
> > if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > return warn_unsupported...
> > and __kernel_write:
> > if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > return warn_unsupported...
> >
> > - It exits with an error if both read_iter and read or write_iter and
> > write are present.
> >
> > I found out that on NVFS, reading a file with the read method has 10%
> > better performance than the read_iter method. The benchmark just reads the
> > same 4k page over and over again - and the cost of creating and parsing
> > the kiocb and iov_iter structures is just that high.
>
> Which part of it is so expensive?

The read_iter path is much bigger:
vfs_read - 0x160 bytes
new_sync_read - 0x160 bytes
nvfs_rw_iter - 0x100 bytes
nvfs_rw_iter_locked - 0x4a0 bytes
iov_iter_advance - 0x300 bytes

If we go with the "read" method, there's just:
vfs_read - 0x160 bytes
nvfs_read - 0x200 bytes

> Is it worth, eg adding an iov_iter
> type that points to a single buffer instead of a single-member iov?
>
> +++ b/include/linux/uio.h
> @@ -19,6 +19,7 @@ struct kvec {
>
> enum iter_type {
> /* iter types */
> + ITER_UBUF = 2,
> ITER_IOVEC = 4,
> ITER_KVEC = 8,
> ITER_BVEC = 16,
> @@ -36,6 +36,7 @@ struct iov_iter {
> size_t iov_offset;
> size_t count;
> union {
> + void __user *buf;
> const struct iovec *iov;
> const struct kvec *kvec;
> const struct bio_vec *bvec;
>
> and then doing all the appropriate changes to make that work.


I tried this benchmark on nvfs:

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(void)
{
unsigned long i;
unsigned long l = 1UL << 38;
unsigned s = 4096;
void *a = valloc(s);
if (!a) perror("malloc"), exit(1);
for (i = 0; i < l; i += s) {
if (pread(0, a, s, 0) != s) perror("read"), exit(1);
}
return 0;
}


Result, using the read_iter method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1049885560
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .....................................
#
47.32% pread [kernel.vmlinux] [k] copy_user_generic_string
7.83% pread [kernel.vmlinux] [k] current_time
6.57% pread [nvfs] [k] nvfs_rw_iter_locked
5.59% pread [kernel.vmlinux] [k] entry_SYSCALL_64
4.23% pread libc-2.31.so [.] __libc_pread
3.51% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.34% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.34% pread [kernel.vmlinux] [k] vfs_read
2.34% pread [kernel.vmlinux] [k] __fsnotify_parent
2.31% pread [kernel.vmlinux] [k] new_sync_read
2.21% pread [nvfs] [k] nvfs_bmap
1.89% pread [kernel.vmlinux] [k] iov_iter_advance
1.71% pread [kernel.vmlinux] [k] __x64_sys_pread64
1.59% pread [kernel.vmlinux] [k] atime_needs_update
1.24% pread [nvfs] [k] nvfs_rw_iter
0.94% pread [kernel.vmlinux] [k] touch_atime
0.75% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.72% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.68% pread [kernel.vmlinux] [k] down_read
0.62% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.52% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.49% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.47% pread [kernel.vmlinux] [k] __fget_light
0.46% pread [kernel.vmlinux] [k] do_syscall_64
0.42% pread pread [.] main
0.33% pread [kernel.vmlinux] [k] up_read
0.29% pread [kernel.vmlinux] [k] iov_iter_init
0.16% pread [kernel.vmlinux] [k] __fdget
0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.03% pread pread [.] pread@plt
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all


#
# (Tip: Use --symfs <dir> if your symbol files are in non-standard locations)
#



Result, using the read method:

# To display the perf.data header info, please use --header/--header-only options.
#
#
# Total Lost Samples: 0
#
# Samples: 3K of event 'cycles'
# Event count (approx.): 1312158116
#
# Overhead Command Shared Object Symbol
# ........ ....... ................ .....................................
#
60.77% pread [kernel.vmlinux] [k] copy_user_generic_string
6.14% pread [kernel.vmlinux] [k] current_time
3.88% pread [kernel.vmlinux] [k] entry_SYSCALL_64
3.55% pread libc-2.31.so [.] __libc_pread
3.04% pread [nvfs] [k] nvfs_bmap
2.84% pread [kernel.vmlinux] [k] syscall_return_via_sysret
2.71% pread [nvfs] [k] nvfs_read
2.56% pread [kernel.vmlinux] [k] entry_SYSCALL_64_after_hwframe
2.00% pread [kernel.vmlinux] [k] __x64_sys_pread64
1.98% pread [kernel.vmlinux] [k] __fsnotify_parent
1.77% pread [kernel.vmlinux] [k] vfs_read
1.35% pread [kernel.vmlinux] [k] atime_needs_update
0.94% pread [kernel.vmlinux] [k] exit_to_user_mode_prepare
0.91% pread [kernel.vmlinux] [k] __fget_light
0.83% pread [kernel.vmlinux] [k] syscall_enter_from_user_mode
0.70% pread [kernel.vmlinux] [k] down_read
0.70% pread [kernel.vmlinux] [k] touch_atime
0.65% pread [kernel.vmlinux] [k] ktime_get_coarse_real_ts64
0.55% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode
0.49% pread [kernel.vmlinux] [k] up_read
0.44% pread [kernel.vmlinux] [k] do_syscall_64
0.39% pread [kernel.vmlinux] [k] syscall_exit_to_user_mode_prepare
0.34% pread pread [.] main
0.26% pread [kernel.vmlinux] [k] __fdget
0.10% pread pread [.] pread@plt
0.10% pread [kernel.vmlinux] [k] entry_SYSCALL_64_safe_stack
0.00% perf [kernel.vmlinux] [k] x86_pmu_enable_all


#
# (Tip: To set sample time separation other than 100ms with --sort time use --time-quantum)
#


Note that if we sum the percentage of nvfs_iter_locked, new_sync_read,
iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the
second trace, nvfs_read consumes just 2.71% - and it replaces
functionality of all these functions.

That is the reason for that 10% degradation with read_iter.

Mikulas


2021-01-10 06:17:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Expense of read_iter

On Thu, Jan 07, 2021 at 01:59:01PM -0500, Mikulas Patocka wrote:
> On Thu, 7 Jan 2021, Matthew Wilcox wrote:
> > On Thu, Jan 07, 2021 at 08:15:41AM -0500, Mikulas Patocka wrote:
> > > I'd like to ask about this piece of code in __kernel_read:
> > > if (unlikely(!file->f_op->read_iter || file->f_op->read))
> > > return warn_unsupported...
> > > and __kernel_write:
> > > if (unlikely(!file->f_op->write_iter || file->f_op->write))
> > > return warn_unsupported...
> > >
> > > - It exits with an error if both read_iter and read or write_iter and
> > > write are present.
> > >
> > > I found out that on NVFS, reading a file with the read method has 10%
> > > better performance than the read_iter method. The benchmark just reads the
> > > same 4k page over and over again - and the cost of creating and parsing
> > > the kiocb and iov_iter structures is just that high.
> >
> > Which part of it is so expensive?
>
> The read_iter path is much bigger:
> vfs_read - 0x160 bytes
> new_sync_read - 0x160 bytes
> nvfs_rw_iter - 0x100 bytes
> nvfs_rw_iter_locked - 0x4a0 bytes
> iov_iter_advance - 0x300 bytes

Number of bytes in a function isn't really correlated with how expensive
a particular function is. That said, looking at new_sync_read() shows
one part that's particularly bad, init_sync_kiocb():

static inline int iocb_flags(struct file *file)
{
int res = 0;
if (file->f_flags & O_APPEND)
res |= IOCB_APPEND;
7ec: 8b 57 40 mov 0x40(%rdi),%edx
7ef: 48 89 75 80 mov %rsi,-0x80(%rbp)
if (file->f_flags & O_DIRECT)
7f3: 89 d0 mov %edx,%eax
7f5: c1 e8 06 shr $0x6,%eax
7f8: 83 e0 10 and $0x10,%eax
res |= IOCB_DIRECT;
if ((file->f_flags & O_DSYNC) || IS_SYNC(file->f_mapping->host))
7fb: 89 c1 mov %eax,%ecx
7fd: 81 c9 00 00 02 00 or $0x20000,%ecx
803: f6 c6 40 test $0x40,%dh
806: 0f 45 c1 cmovne %ecx,%eax
res |= IOCB_DSYNC;
809: f6 c6 10 test $0x10,%dh
80c: 75 18 jne 826 <new_sync_read+0x66>
80e: 48 8b 8f d8 00 00 00 mov 0xd8(%rdi),%rcx
815: 48 8b 09 mov (%rcx),%rcx
818: 48 8b 71 28 mov 0x28(%rcx),%rsi
81c: f6 46 50 10 testb $0x10,0x50(%rsi)
820: 0f 84 e2 00 00 00 je 908 <new_sync_read+0x148>
if (file->f_flags & __O_SYNC)
826: 83 c8 02 or $0x2,%eax
res |= IOCB_SYNC;
return res;
829: 89 c1 mov %eax,%ecx
82b: 83 c9 04 or $0x4,%ecx
82e: 81 e2 00 00 10 00 and $0x100000,%edx

We could optimise this by, eg, checking for (__O_SYNC | O_DIRECT |
O_APPEND) and returning 0 if none of them are set, since they're all
pretty rare. It might be better to maintain an f_iocb_flags in the
struct file and just copy that unconditionally. We'd need to remember
to update it in fcntl(F_SETFL), but I think that's the only place.


> If we go with the "read" method, there's just:
> vfs_read - 0x160 bytes
> nvfs_read - 0x200 bytes
>
> > Is it worth, eg adding an iov_iter
> > type that points to a single buffer instead of a single-member iov?

> 6.57% pread [nvfs] [k] nvfs_rw_iter_locked
> 2.31% pread [kernel.vmlinux] [k] new_sync_read
> 1.89% pread [kernel.vmlinux] [k] iov_iter_advance
> 1.24% pread [nvfs] [k] nvfs_rw_iter
> 0.29% pread [kernel.vmlinux] [k] iov_iter_init

> 2.71% pread [nvfs] [k] nvfs_read

> Note that if we sum the percentage of nvfs_iter_locked, new_sync_read,
> iov_iter_advance, nvfs_rw_iter, we get 12.01%. On the other hand, in the
> second trace, nvfs_read consumes just 2.71% - and it replaces
> functionality of all these functions.
>
> That is the reason for that 10% degradation with read_iter.

You seem to be focusing on your argument for "let's just permit
filesystems to implement both ->read and ->read_iter". My suggestion
is that we need to optimise the ->read_iter path, but to do that we need
to know what's expensive.

nvfs_rw_iter_locked() looks very complicated. I suspect it can
be simplified. Of course new_sync_read() needs to be improved too,
as do the other functions here, but fully a third of the difference
between read() and read_iter() is the difference between nvfs_read()
and nvfs_rw_iter_locked().

2021-01-10 21:22:31

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Sun, 10 Jan 2021, Matthew Wilcox wrote:

> > That is the reason for that 10% degradation with read_iter.
>
> You seem to be focusing on your argument for "let's just permit
> filesystems to implement both ->read and ->read_iter". My suggestion
> is that we need to optimise the ->read_iter path, but to do that we need
> to know what's expensive.
>
> nvfs_rw_iter_locked() looks very complicated. I suspect it can
> be simplified.

I split it to a separate read and write function and it improved
performance by 1.3%. Using Al Viro's read_iter improves performance by 3%.

> Of course new_sync_read() needs to be improved too,
> as do the other functions here, but fully a third of the difference
> between read() and read_iter() is the difference between nvfs_read()
> and nvfs_rw_iter_locked().

I put counters into vfs_read and vfs_readv.

After a fresh boot of the virtual machine, the counters show "13385 4".
After a kernel compilation they show "4475220 8".

So, the readv path is almost unused.

My reasoning was that we should optimize for the "read" path and glue the
"readv" path on the top of that. Currently, the kernel is doing the
opposite - optimizing for "readv" and glueing "read" on the top of it.

Mikulas

2021-01-11 00:20:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Expense of read_iter

On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> I put counters into vfs_read and vfs_readv.
>
> After a fresh boot of the virtual machine, the counters show "13385 4".
> After a kernel compilation they show "4475220 8".
>
> So, the readv path is almost unused.
>
> My reasoning was that we should optimize for the "read" path and glue the
> "readv" path on the top of that. Currently, the kernel is doing the
> opposite - optimizing for "readv" and glueing "read" on the top of it.

But it's not about optimising for read vs readv. read_iter handles
a host of other cases, such as pread(), preadv(), AIO reads, splice,
and reads to in-kernel buffers.

Some device drivers abused read() vs readv() to actually return different
information, depending which you called. That's why there's now a
prohibition against both.

So let's figure out how to make iter_read() perform well for sys_read().

2021-01-11 13:05:43

by David Laight

[permalink] [raw]
Subject: RE: Expense of read_iter

From: Matthew Wilcox
> Sent: 10 January 2021 06:13
...
> nvfs_rw_iter_locked() looks very complicated. I suspect it can
> be simplified. Of course new_sync_read() needs to be improved too,
> as do the other functions here, but fully a third of the difference
> between read() and read_iter() is the difference between nvfs_read()
> and nvfs_rw_iter_locked().

There is also the non-zero cost of import_iovec().
I've got some slight speedups, but haven't measured an
old kernel yet to see how much slower 5.11-rc1 made it.

Basic test is:
fd = open("/dev/null", O_RDWR);
for (1 = 0; 1 < 10000; i++) {
start = rdtsc();
writev(fd, iovec, count);
histogram[rdtsc() - start]++;
}

This doesn't actually copy any data - the iovec
isn't iterated.

I'm seeing pretty stable counts for most of the 10000 iterations.
But different program runs can give massively different timings.
I'm quessing that depends on cache collisions due to the addresses
(virtual of physical?) selected for some items.

For 5.11-rc2 -mx32 is slightly faster than 64bit.
Whereas -m32 has a much slower syscall entry/exit path,
but the difference between gettid() and writev() is lower.
The compat code for import_iovec() is actually faster.
This isn't really surprising since copy_from_user() is
absolutely horrid these days - especially with userspace hardening.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-01-11 21:13:29

by Mikulas Patocka

[permalink] [raw]
Subject: Re: Expense of read_iter



On Mon, 11 Jan 2021, Matthew Wilcox wrote:

> On Sun, Jan 10, 2021 at 04:19:15PM -0500, Mikulas Patocka wrote:
> > I put counters into vfs_read and vfs_readv.
> >
> > After a fresh boot of the virtual machine, the counters show "13385 4".
> > After a kernel compilation they show "4475220 8".
> >
> > So, the readv path is almost unused.
> >
> > My reasoning was that we should optimize for the "read" path and glue the
> > "readv" path on the top of that. Currently, the kernel is doing the
> > opposite - optimizing for "readv" and glueing "read" on the top of it.
>
> But it's not about optimising for read vs readv. read_iter handles
> a host of other cases, such as pread(), preadv(), AIO reads, splice,
> and reads to in-kernel buffers.

These things are used rarely compared to "read" and "pread". (BTW. "pread"
could be handled by the read method too).

What's the reason why do you think that the "read" syscall should use the
"read_iter" code path? Is it because duplicating the logic is discouraged?
Or because of code size? Or something else?

> Some device drivers abused read() vs readv() to actually return different
> information, depending which you called. That's why there's now a
> prohibition against both.
>
> So let's figure out how to make iter_read() perform well for sys_read().

I've got another idea - in nvfs_read_iter, test if the iovec contains just
one entry and call nvfs_read_locked if it does.

diff --git a/file.c b/file.c
index f4b8a1a..e4d87b2 100644
--- a/file.c
+++ b/file.c
@@ -460,6 +460,10 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
if (!IS_DAX(&nmi->vfs_inode)) {
r = generic_file_read_iter(iocb, iov);
} else {
+ if (likely(iter_is_iovec(iov)) && likely(!iov->iov_offset) && likely(iov->nr_segs == 1)) {
+ r = nvfs_read_locked(nmi, iocb->ki_filp, iov->iov->iov_base, iov->count, true, &iocb->ki_pos);
+ goto unlock_ret;
+ }
#if 1
r = nvfs_rw_iter_locked(iocb, iov, false);
#else
@@ -467,6 +471,7 @@ static ssize_t nvfs_read_iter(struct kiocb *iocb, struct iov_iter *iov)
#endif
}

+unlock_ret:
inode_unlock_shared(&nmi->vfs_inode);

return r;



The result is:

nvfs_read_iter - 7.307s
Al Viro's read_iter_locked - 7.147s
test for just one entry - 7.010s
the read method - 6.782s

So far, this is the best way how to do it, but it's still 3.3% worse than
the read method. There's not anything more that could be optimized on the
filesystem level - the rest of optimizations must be done in the VFS.

Mikulas