2021-05-03 18:09:48

by Jens Axboe

[permalink] [raw]
Subject: [PATCH] eventfd: convert to using ->write_iter()

Had a report on writing to eventfd with io_uring is slower than it
should be, and it's the usual case of if a file type doesn't support
->write_iter(), then io_uring cannot rely on IOCB_NOWAIT being honored
alongside O_NONBLOCK for whether or not this is a non-blocking write
attempt. That means io_uring will punt the operation to an io thread,
which will slow us down unnecessarily.

Convert eventfd to using fops->write_iter() instead of fops->write().

Signed-off-by: Jens Axboe <[email protected]>

---

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..02c55e5e1a3e 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -264,17 +264,18 @@ static ssize_t eventfd_read(struct kiocb *iocb, struct iov_iter *to)
return sizeof(ucnt);
}

-static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t count,
- loff_t *ppos)
+static ssize_t eventfd_write(struct kiocb *kiocb, struct iov_iter *from)
{
+ struct file *file = kiocb->ki_filp;
struct eventfd_ctx *ctx = file->private_data;
+ size_t count = iov_iter_count(from);
ssize_t res;
__u64 ucnt;
DECLARE_WAITQUEUE(wait, current);

if (count < sizeof(ucnt))
return -EINVAL;
- if (copy_from_user(&ucnt, buf, sizeof(ucnt)))
+ if (copy_from_iter(&ucnt, count, from) != count)
return -EFAULT;
if (ucnt == ULLONG_MAX)
return -EINVAL;
@@ -282,7 +283,8 @@ static ssize_t eventfd_write(struct file *file, const char __user *buf, size_t c
res = -EAGAIN;
if (ULLONG_MAX - ctx->count > ucnt)
res = sizeof(ucnt);
- else if (!(file->f_flags & O_NONBLOCK)) {
+ else if (!(file->f_flags & O_NONBLOCK) &&
+ !(kiocb->ki_flags & IOCB_NOWAIT)) {
__add_wait_queue(&ctx->wqh, &wait);
for (res = 0;;) {
set_current_state(TASK_INTERRUPTIBLE);
@@ -331,7 +333,7 @@ static const struct file_operations eventfd_fops = {
.release = eventfd_release,
.poll = eventfd_poll,
.read_iter = eventfd_read,
- .write = eventfd_write,
+ .write_iter = eventfd_write,
.llseek = noop_llseek,
};


--
Jens Axboe


2021-05-03 18:11:56

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] eventfd: convert to using ->write_iter()

From: Jens Axboe
> Sent: 03 May 2021 15:58
>
> Had a report on writing to eventfd with io_uring is slower than it
> should be, and it's the usual case of if a file type doesn't support
> ->write_iter(), then io_uring cannot rely on IOCB_NOWAIT being honored
> alongside O_NONBLOCK for whether or not this is a non-blocking write
> attempt. That means io_uring will punt the operation to an io thread,
> which will slow us down unnecessarily.
>
> Convert eventfd to using fops->write_iter() instead of fops->write().

Won't this have a measurable performance degradation on normal
code that does write(event_fd, &one, 4);

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-05-03 18:15:24

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] eventfd: convert to using ->write_iter()

On 5/3/21 12:02 PM, Matthew Wilcox wrote:
> On Mon, May 03, 2021 at 11:57:08AM -0600, Jens Axboe wrote:
>> On 5/3/21 10:12 AM, David Laight wrote:
>>> From: Jens Axboe
>>>> Sent: 03 May 2021 15:58
>>>>
>>>> Had a report on writing to eventfd with io_uring is slower than it
>>>> should be, and it's the usual case of if a file type doesn't support
>>>> ->write_iter(), then io_uring cannot rely on IOCB_NOWAIT being honored
>>>> alongside O_NONBLOCK for whether or not this is a non-blocking write
>>>> attempt. That means io_uring will punt the operation to an io thread,
>>>> which will slow us down unnecessarily.
>>>>
>>>> Convert eventfd to using fops->write_iter() instead of fops->write().
>>>
>>> Won't this have a measurable performance degradation on normal
>>> code that does write(event_fd, &one, 4);
>>
>> If ->write_iter() or ->read_iter() is much slower than the non-iov
>> versions, then I think we have generic issues that should be solved.
>
> We do!
>
> https://lore.kernel.org/linux-fsdevel/[email protected]/
> is one thread on it. There have been others.

But then we really must get that fixed, imho ->read() and ->write()
should go away, and if the iter variants are 10% slower, then that should
get fixed up.

I'll go over that thread.

--
Jens Axboe

2021-05-04 08:09:38

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] eventfd: convert to using ->write_iter()

From: Jens Axboe
> Sent: 03 May 2021 19:05
>
> On 5/3/21 12:02 PM, Matthew Wilcox wrote:
> > On Mon, May 03, 2021 at 11:57:08AM -0600, Jens Axboe wrote:
> >> On 5/3/21 10:12 AM, David Laight wrote:
> >>> From: Jens Axboe
> >>>> Sent: 03 May 2021 15:58
> >>>>
> >>>> Had a report on writing to eventfd with io_uring is slower than it
> >>>> should be, and it's the usual case of if a file type doesn't support
> >>>> ->write_iter(), then io_uring cannot rely on IOCB_NOWAIT being honored
> >>>> alongside O_NONBLOCK for whether or not this is a non-blocking write
> >>>> attempt. That means io_uring will punt the operation to an io thread,
> >>>> which will slow us down unnecessarily.
> >>>>
> >>>> Convert eventfd to using fops->write_iter() instead of fops->write().
> >>>
> >>> Won't this have a measurable performance degradation on normal
> >>> code that does write(event_fd, &one, 4);
> >>
> >> If ->write_iter() or ->read_iter() is much slower than the non-iov
> >> versions, then I think we have generic issues that should be solved.
> >
> > We do!
> >
> > https://lore.kernel.org/linux-fsdevel/[email protected]/
> > is one thread on it. There have been others.
>
> But then we really must get that fixed, imho ->read() and ->write()
> should go away, and if the iter variants are 10% slower, then that should
> get fixed up.

I think there are two separate issues.
(Although I've not looked in detail into the really bad cases.)

1) I suspect some of the fs code is using entirely different paths for the
'single fragment' and 'iter' variants.

2) For trivial drivers the cost of setting up the iov_iter[] and then
iterating it becomes significant (or at least measurable).

I haven't tried to undo the morass of #defines in the iter code.
But I suspect they could be optimised for the common case of
copying an entire single-fragment to/from userspace in one call.

Not related to this code path, but I've some patches that give a
few % speedup for writev() to /dev/null.
That is all about copying the iov[] from user - it doesn't get 'iterated'.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2021-05-09 13:07:52

by kernel test robot

[permalink] [raw]
Subject: [eventfd] cd8a8dd187: WARNING:at_include/linux/thread_info.h:#eventfd_write



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: cd8a8dd187e6e7578434025f88daaf40fe0e1ef8 ("[PATCH] eventfd: convert to using ->write_iter()")
url: https://github.com/0day-ci/linux/commits/Jens-Axboe/eventfd-convert-to-using-write_iter/20210503-225846
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 9ccce092fc64d19504fa54de4fd659e279cc92e7

in testcase: trinity
version: trinity-x86_64-03f10b67-1_20210506
with following parameters:

runtime: 300s

test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/


on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+-------------------------------------------------------+------------+------------+
| | 9ccce092fc | cd8a8dd187 |
+-------------------------------------------------------+------------+------------+
| boot_successes | 18 | 0 |
| boot_failures | 0 | 25 |
| WARNING:at_include/linux/thread_info.h:#eventfd_write | 0 | 25 |
| RIP:eventfd_write | 0 | 25 |
+-------------------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


[ 12.437252] WARNING: CPU: 1 PID: 543 at include/linux/thread_info.h:199 eventfd_write (kbuild/src/x86_64/include/linux/thread_info.h:199 kbuild/src/x86_64/include/linux/thread_info.h:208 kbuild/src/x86_64/include/linux/uio.h:151 kbuild/src/x86_64/fs/eventfd.c:278)
[ 12.438625] Modules linked in: hidp bnep rfcomm bluetooth ecdh_generic ecc rfkill can_bcm can_raw can crypto_user ib_core nfnetlink scsi_transport_iscsi atm sctp ip6_udp_tunnel udp_tunnel libcrc32c bochs_drm drm_vram_helper drm_ttm_helper ttm sr_mod cdrom drm_kms_helper sg intel_rapl_msr ppdev intel_rapl_common ata_generic crct10dif_pclmul crc32_pclmul crc32c_intel syscopyarea ghash_clmulni_intel sysfillrect rapl sysimgblt fb_sys_fops parport_pc parport ata_piix drm libata joydev ipmi_devintf ipmi_msghandler serio_raw i2c_piix4 ip_tables
[ 12.449452] CPU: 1 PID: 543 Comm: trinity-c1 Not tainted 5.12.0-13584-gcd8a8dd187e6 #3
[ 12.455041] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
[ 12.456585] RIP: 0010:eventfd_write (kbuild/src/x86_64/include/linux/thread_info.h:199 kbuild/src/x86_64/include/linux/thread_info.h:208 kbuild/src/x86_64/include/linux/uio.h:151 kbuild/src/x86_64/fs/eventfd.c:278)
[ 12.457766] Code: 65 ff 0d 28 7f 06 5f e8 63 07 87 00 48 89 ef e8 9b 6e 87 00 eb 9a 4c 89 c2 be 08 00 00 00 48 c7 c7 78 5d 15 a2 e8 6b 96 80 00 <0f> 0b 49 c7 c6 f2 ff ff ff e9 d1 fe ff ff 49 c7 c6 00 fe ff ff 48
All code
========
0: 65 ff 0d 28 7f 06 5f decl %gs:0x5f067f28(%rip) # 0x5f067f2f
7: e8 63 07 87 00 callq 0x87076f
c: 48 89 ef mov %rbp,%rdi
f: e8 9b 6e 87 00 callq 0x876eaf
14: eb 9a jmp 0xffffffffffffffb0
16: 4c 89 c2 mov %r8,%rdx
19: be 08 00 00 00 mov $0x8,%esi
1e: 48 c7 c7 78 5d 15 a2 mov $0xffffffffa2155d78,%rdi
25: e8 6b 96 80 00 callq 0x809695
2a:* 0f 0b ud2 <-- trapping instruction
2c: 49 c7 c6 f2 ff ff ff mov $0xfffffffffffffff2,%r14
33: e9 d1 fe ff ff jmpq 0xffffffffffffff09
38: 49 c7 c6 00 fe ff ff mov $0xfffffffffffffe00,%r14
3f: 48 rex.W

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: 49 c7 c6 f2 ff ff ff mov $0xfffffffffffffff2,%r14
9: e9 d1 fe ff ff jmpq 0xfffffffffffffedf
e: 49 c7 c6 00 fe ff ff mov $0xfffffffffffffe00,%r14
15: 48 rex.W
[ 12.464454] RSP: 0018:ffffac1400a17dd8 EFLAGS: 00010286
[ 12.466157] RAX: 0000000000000000 RBX: ffff9f6aecb8f200 RCX: 0000000000000000
[ 12.468092] RDX: ffff9f6defd27a40 RSI: ffff9f6defd17bf0 RDI: ffff9f6defd17bf0
[ 12.469897] RBP: ffffac1400a17f08 R08: ffff9f6defd17bf0 R09: ffffac1400a17bf8
[ 12.471441] R10: 0000000000000001 R11: 0000000000000001 R12: ffffffffffffffea
[ 12.472811] R13: ffff9f6aeedcad00 R14: ffffac1400a17f08 R15: 00000000000003a7
[ 12.474176] FS: 00007fb9adc2c740(0000) GS:ffff9f6defd00000(0000) knlGS:0000000000000000
[ 12.475606] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 12.476866] CR2: 00007fb9ad16f3fc CR3: 0000000134eb6000 CR4: 00000000000406e0
[ 12.478215] DR0: 00007fb9abfc1000 DR1: 0000000000000000 DR2: 0000000000000000
[ 12.479569] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000600
[ 12.480894] Call Trace:
[ 12.482775] ? wake_up_q (kbuild/src/x86_64/kernel/sched/core.c:5545)
[ 12.483933] new_sync_write (kbuild/src/x86_64/fs/read_write.c:519 (discriminator 1))
[ 12.485037] vfs_write (kbuild/src/x86_64/fs/read_write.c:605)
[ 12.506233] ksys_write (kbuild/src/x86_64/fs/read_write.c:658)
[ 12.507375] do_syscall_64 (kbuild/src/x86_64/arch/x86/entry/common.c:47)
[ 12.508431] entry_SYSCALL_64_after_hwframe (kbuild/src/x86_64/arch/x86/entry/entry_64.S:112)
[ 12.509641] RIP: 0033:0x7fb9add43f59
[ 12.510719] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 07 6f 0c 00 f7 d8 64 89 01 48
All code
========
0: 00 c3 add %al,%bl
2: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
9: 00 00 00
c: 0f 1f 44 00 00 nopl 0x0(%rax,%rax,1)
11: 48 89 f8 mov %rdi,%rax
14: 48 89 f7 mov %rsi,%rdi
17: 48 89 d6 mov %rdx,%rsi
1a: 48 89 ca mov %rcx,%rdx
1d: 4d 89 c2 mov %r8,%r10
20: 4d 89 c8 mov %r9,%r8
23: 4c 8b 4c 24 08 mov 0x8(%rsp),%r9
28: 0f 05 syscall
2a:* 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax <-- trapping instruction
30: 73 01 jae 0x33
32: c3 retq
33: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f41
3a: f7 d8 neg %eax
3c: 64 89 01 mov %eax,%fs:(%rcx)
3f: 48 rex.W

Code starting with the faulting instruction
===========================================
0: 48 3d 01 f0 ff ff cmp $0xfffffffffffff001,%rax
6: 73 01 jae 0x9
8: c3 retq
9: 48 8b 0d 07 6f 0c 00 mov 0xc6f07(%rip),%rcx # 0xc6f17
10: f7 d8 neg %eax
12: 64 89 01 mov %eax,%fs:(%rcx)
15: 48 rex.W
[ 12.513697] RSP: 002b:00007ffeddefa1a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 12.515116] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007fb9add43f59
[ 12.516497] RDX: 00000000000003a7 RSI: 0000558e21026d30 RDI: 000000000000001a
[ 12.517906] RBP: 0000000000000001 R08: 000000000000008b R09: 0000000000000004
[ 12.519243] R10: 0000000000004bfa R11: 0000000000000246 R12: 0000000000000002
[ 12.520622] R13: 00007fb9ac6ef058 R14: 00007fb9adc2c6c0 R15: 00007fb9ac6ef000
[ 12.538117] ---[ end trace cef3c60a6b1ee0eb ]---
[ 17.126345] Kernel tests: Boot OK!
[ 17.126352]
[ 22.248625] install debs round one: dpkg -i --force-confdef --force-depends /opt/deb/gawk_1%3a4.2.1+dfsg-1_amd64.deb
[ 22.248636]
[ 22.254219] Selecting previously unselected package gawk.
[ 22.254227]
[ 22.259934] (Reading database ... 16553 files and directories currently installed.)
[ 22.259940]
[ 22.265322] Preparing to unpack .../gawk_1%3a4.2.1+dfsg-1_amd64.deb ...
[ 22.265328]
[ 22.270126] Unpacking gawk (1:4.2.1+dfsg-1) ...
[ 22.270131]
[ 22.274405] Setting up gawk (1:4.2.1+dfsg-1) ...
[ 22.274411]
[ 22.278069] /lkp/lkp/src/bin/run-lkp
[ 22.278073]
[ 24.022481] RESULT_ROOT=/result/trinity/300s/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/cd8a8dd187e6e7578434025f88daaf40fe0e1ef8/8
[ 24.022506]
[ 24.737426] job=/lkp/jobs/scheduled/vm-snb-51/trinity-300s-debian-10.4-x86_64-20200603.cgz-cd8a8dd187e6e7578434025f88daaf40fe0e1ef8-20210508-44336-1249u1s-8.yaml
[ 24.737434]
[ 29.416142] result_service: raw_upload, RESULT_MNT: /internal-lkp-server/result, RESULT_ROOT: /internal-lkp-server/result/trinity/300s/vm-snb/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/cd8a8dd187e6e7578434025f88daaf40fe0e1ef8/8, TMP_RESULT_ROOT: /tmp/lkp/result
[ 29.416150]
[ 29.427370] run-job /lkp/jobs/scheduled/vm-snb-51/trinity-300s-debian-10.4-x86_64-20200603.cgz-cd8a8dd187e6e7578434025f88daaf40fe0e1ef8-20210508-44336-1249u1s-8.yaml
[ 29.427378]
[ 31.203695] /usr/bin/wget -q --timeout=1800 --tries=1 --local-encoding=UTF-8 http://internal-lkp-server:80/~lkp/cgi-bin/lkp-jobfile-append-var?job_file=/lkp/jobs/scheduled/vm-snb-51/trinity-300s-debian-10.4-x86_64-20200603.cgz-cd8a8dd187e6e7578434025f88daaf40fe0e1ef8-20210508-44336-1249u1s-8.yaml&job_state=running -O /dev/null
[ 31.203704]
[ 31.210937] target ucode:
[ 31.210941]
[ 31.215033] Seeding trinity based on x86_64-rhel-8.3
[ 31.215038]
[ 31.222073] 2021-05-08 01:36:06 chroot --userspec nobody:nogroup / trinity -q -q -l off -s 1655450980 -x get_robust_list -x remap_file_pages -N 999999999
[ 31.222080]
[ 31.228500] Trinity 2019.06 Dave Jones <[email protected]>
[ 31.228505]
[ 31.233044] shm:0x7fb9ade1d000-0x7fb9baa19d00 (4 pages)
[ 31.233048]
[ 31.238386] [main] Marking syscall get_robust_list (64bit:274 32bit:312) as to be disabled.
[ 31.238392]
[ 31.244057] [main] Marking syscall remap_file_pages (64bit:216 32bit:257) as to be disabled.
[ 31.244062]
[ 31.248408] [main] Couldn't chmod tmp/ to 0777.
[ 31.248413]
[ 31.254379] [main] Using user passed random seed: 1655450980.
[ 31.254396]
[ 31.258610] Marking all syscalls as enabled.
[ 31.258615]
[ 31.263663] [main] Disabling syscalls marked as disabled by command line options
[ 31.263669]
[ 31.268754] [main] Marked 64-bit syscall remap_file_pages (216) as deactivated.
[ 31.268759]
[ 31.273859] [main] Marked 64-bit syscall get_robust_list (274) as deactivated.
[ 31.273865]
[ 31.280261] [main] Marked 32-bit syscall remap_file_pages (257) as deactivated.
[ 31.280267]
[ 31.285582] [main] Marked 32-bit syscall get_robust_list (312) as deactivated.
[ 31.285588]
[ 31.291580] [main] 32-bit syscalls: 426 enabled, 3 disabled. 64-bit syscalls: 352 enabled, 91 disabled.
[ 31.291586]
[ 31.295866] [main] Using pid_max = 32768
[ 31.295870]
[ 31.299736] [main] futex: 0 owner:0 global:1
[ 31.299740]
[ 31.303778] [main] futex: 0 owner:0 global:1
[ 31.303782]


To reproduce:

# build kernel
cd linux
cp config-5.12.0-13584-gcd8a8dd187e6 .config
make HOSTCC=gcc-9 CC=gcc-9 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (11.50 kB)
config-5.12.0-13584-gcd8a8dd187e6 (176.60 kB)
job-script (4.48 kB)
dmesg.xz (15.91 kB)
trinity (2.01 kB)
Download all attachments