LinuxLists.cc - [PATCH v2 0/5] fast poll multishot mode

2022-05-08 20:32:48

Subject: [PATCH v2 0/5] fast poll multishot mode

Let multishot support multishot mode, currently only add accept as its
first comsumer.
theoretical analysis:
1) when connections come in fast
- singleshot:
add accept sqe(userpsace) --> accept inline
^ |
|-----------------|
- multishot:
add accept sqe(userspace) --> accept inline
^ |
|--*--|

we do accept repeatedly in * place until get EAGAIN

2) when connections come in at a low pressure
similar thing like 1), we reduce a lot of userspace-kernel context
switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

server client(multiple)
accept connect
read write
write read
close close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
+1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
+1934226+1914385)/20.0 = 1927633.75
after:
1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
+1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
+1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%

A liburing test is here:
https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

v1->v2:
- re-implement it against the reworked poll code

Hao Xu (5):
io_uring: add IORING_ACCEPT_MULTISHOT for accept
io_uring: add REQ_F_APOLL_MULTISHOT for requests
io_uring: let fast poll support multishot
io_uring: add a helper for poll clean
io_uring: implement multishot mode for accept

fs/io_uring.c | 121 ++++++++++++++++++++++++++--------
include/uapi/linux/io_uring.h | 5 ++
2 files changed, 100 insertions(+), 26 deletions(-)

base-commit: f2e030dd7aaea5a937a2547dc980fab418fbc5e7
--
2.36.0

2022-05-09 02:29:33

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] fast poll multishot mode

On 5/6/22 4:23 PM, Jens Axboe wrote:
> On 5/6/22 1:00 AM, Hao Xu wrote:
>> Let multishot support multishot mode, currently only add accept as its
>> first comsumer.
>> theoretical analysis:
>> 1) when connections come in fast
>> - singleshot:
>> add accept sqe(userpsace) --> accept inline
>> ^ |
>> |-----------------|
>> - multishot:
>> add accept sqe(userspace) --> accept inline
>> ^ |
>> |--*--|
>>
>> we do accept repeatedly in * place until get EAGAIN
>>
>> 2) when connections come in at a low pressure
>> similar thing like 1), we reduce a lot of userspace-kernel context
>> switch and useless vfs_poll()
>>
>>
>> tests:
>> Did some tests, which goes in this way:
>>
>> server client(multiple)
>> accept connect
>> read write
>> write read
>> close close
>>
>> Basically, raise up a number of clients(on same machine with server) to
>> connect to the server, and then write some data to it, the server will
>> write those data back to the client after it receives them, and then
>> close the connection after write return. Then the client will read the
>> data and then close the connection. Here I test 10000 clients connect
>> one server, data size 128 bytes. And each client has a go routine for
>> it, so they come to the server in short time.
>> test 20 times before/after this patchset, time spent:(unit cycle, which
>> is the return value of clock())
>> before:
>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>> +1934226+1914385)/20.0 = 1927633.75
>> after:
>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>> +1871324+1940803)/20.0 = 1894750.45
>>
>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>
>>
>> A liburing test is here:
>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>
> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
> find other issues through that.
>
> Anyway, works for me in testing, and I can see this being a nice win for
> accept intensive workloads. I pushed a bunch of cleanup patches that
> should just get folded in. Can you fold them into your patches and
> address the other feedback, and post a v3? I pushed the test branch
> here:
>
> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

Quick benchmark here, accepting 10k connections:

Stock kernel
real 0m0.728s
user 0m0.009s
sys 0m0.192s

Patched
real 0m0.684s
user 0m0.018s
sys 0m0.102s

Looks like a nice win for a highly synthetic benchmark. Nothing
scientific, was just curious.

--
Jens Axboe

2022-05-09 02:34:38

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] fast poll multishot mode

On 5/6/22 5:26 PM, Jens Axboe wrote:
> On 5/6/22 4:23 PM, Jens Axboe wrote:
>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>> Let multishot support multishot mode, currently only add accept as its
>>> first comsumer.
>>> theoretical analysis:
>>> 1) when connections come in fast
>>> - singleshot:
>>> add accept sqe(userpsace) --> accept inline
>>> ^ |
>>> |-----------------|
>>> - multishot:
>>> add accept sqe(userspace) --> accept inline
>>> ^ |
>>> |--*--|
>>>
>>> we do accept repeatedly in * place until get EAGAIN
>>>
>>> 2) when connections come in at a low pressure
>>> similar thing like 1), we reduce a lot of userspace-kernel context
>>> switch and useless vfs_poll()
>>>
>>>
>>> tests:
>>> Did some tests, which goes in this way:
>>>
>>> server client(multiple)
>>> accept connect
>>> read write
>>> write read
>>> close close
>>>
>>> Basically, raise up a number of clients(on same machine with server) to
>>> connect to the server, and then write some data to it, the server will
>>> write those data back to the client after it receives them, and then
>>> close the connection after write return. Then the client will read the
>>> data and then close the connection. Here I test 10000 clients connect
>>> one server, data size 128 bytes. And each client has a go routine for
>>> it, so they come to the server in short time.
>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>> is the return value of clock())
>>> before:
>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>> +1934226+1914385)/20.0 = 1927633.75
>>> after:
>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>> +1871324+1940803)/20.0 = 1894750.45
>>>
>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>
>>>
>>> A liburing test is here:
>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>
>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>> find other issues through that.
>>
>> Anyway, works for me in testing, and I can see this being a nice win for
>> accept intensive workloads. I pushed a bunch of cleanup patches that
>> should just get folded in. Can you fold them into your patches and
>> address the other feedback, and post a v3? I pushed the test branch
>> here:
>>
>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
>
> Quick benchmark here, accepting 10k connections:
>
> Stock kernel
> real 0m0.728s
> user 0m0.009s
> sys 0m0.192s
>
> Patched
> real 0m0.684s
> user 0m0.018s
> sys 0m0.102s
>
> Looks like a nice win for a highly synthetic benchmark. Nothing
> scientific, was just curious.

One more thought on this - how is it supposed to work with
accept-direct? One idea would be to make it incrementally increasing.
But we need a good story for that, if it's exclusive to non-direct
files, then it's a lot less interesting as the latter is really nice win
for lots of files. If we can combine the two, even better.

--
Jens Axboe

2022-05-09 02:40:24

by Hao Xu

[permalink] [raw]

Subject: [PATCH 3/5] io_uring: let fast poll support multishot

From: Hao Xu <[email protected]>

For operations like accept, multishot is a useful feature, since we can
reduce a number of accept sqe. Let's integrate it to fast poll, it may
be good for other operations in the future.

Signed-off-by: Hao Xu <[email protected]>
---
fs/io_uring.c | 41 ++++++++++++++++++++++++++---------------
1 file changed, 26 insertions(+), 15 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 8ebb1a794e36..d33777575faf 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -5952,7 +5952,7 @@ static void io_poll_remove_entries(struct io_kiocb *req)
* either spurious wakeup or multishot CQE is served. 0 when it's done with
* the request, then the mask is stored in req->cqe.res.
*/
-static int io_poll_check_events(struct io_kiocb *req, bool locked)
+static int io_poll_check_events(struct io_kiocb *req, bool *locked)
{
struct io_ring_ctx *ctx = req->ctx;
int v;
@@ -5981,17 +5981,26 @@ static int io_poll_check_events(struct io_kiocb *req, bool locked)

/* multishot, just fill an CQE and proceed */
if (req->cqe.res && !(req->apoll_events & EPOLLONESHOT)) {
- __poll_t mask = mangle_poll(req->cqe.res & req->apoll_events);
- bool filled;
-
- spin_lock(&ctx->completion_lock);
- filled = io_fill_cqe_aux(ctx, req->cqe.user_data, mask,
- IORING_CQE_F_MORE);
- io_commit_cqring(ctx);
- spin_unlock(&ctx->completion_lock);
- if (unlikely(!filled))
- return -ECANCELED;
- io_cqring_ev_posted(ctx);
+ if (req->flags & REQ_F_APOLL_MULTISHOT) {
+ io_tw_lock(req->ctx, locked);
+ if (likely(!(req->task->flags & PF_EXITING)))
+ io_queue_sqe(req);
+ else
+ return -EFAULT;
+ } else {
+ __poll_t mask = mangle_poll(req->cqe.res &
+ req->apoll_events);
+ bool filled;
+
+ spin_lock(&ctx->completion_lock);
+ filled = io_fill_cqe_aux(ctx, req->cqe.user_data,
+ mask, IORING_CQE_F_MORE);
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ if (unlikely(!filled))
+ return -ECANCELED;
+ io_cqring_ev_posted(ctx);
+ }
} else if (req->cqe.res) {
return 0;
}
@@ -6010,7 +6019,7 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
struct io_ring_ctx *ctx = req->ctx;
int ret;

- ret = io_poll_check_events(req, *locked);
+ ret = io_poll_check_events(req, locked);
if (ret > 0)
return;

@@ -6035,7 +6044,7 @@ static void io_apoll_task_func(struct io_kiocb *req, bool *locked)
struct io_ring_ctx *ctx = req->ctx;
int ret;

- ret = io_poll_check_events(req, *locked);
+ ret = io_poll_check_events(req, locked);
if (ret > 0)
return;

@@ -6275,7 +6284,7 @@ static int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
struct io_ring_ctx *ctx = req->ctx;
struct async_poll *apoll;
struct io_poll_table ipt;
- __poll_t mask = EPOLLONESHOT | POLLERR | POLLPRI;
+ __poll_t mask = POLLERR | POLLPRI;
int ret;

if (!def->pollin && !def->pollout)
@@ -6284,6 +6293,8 @@ static int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
return IO_APOLL_ABORTED;
if ((req->flags & (REQ_F_POLLED|REQ_F_PARTIAL_IO)) == REQ_F_POLLED)
return IO_APOLL_ABORTED;
+ if (!(req->flags & REQ_F_APOLL_MULTISHOT))
+ mask |= EPOLLONESHOT;

if (def->pollin) {
mask |= POLLIN | POLLRDNORM;
--
2.36.0

2022-05-09 03:59:32

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 3/5] io_uring: let fast poll support multishot

Hi Hao,

Thank you for the patch! Perhaps something to improve:

[auto build test WARNING on f2e030dd7aaea5a937a2547dc980fab418fbc5e7]

url: https://github.com/intel-lab-lkp/linux/commits/Hao-Xu/fast-poll-multishot-mode/20220506-150750
base: f2e030dd7aaea5a937a2547dc980fab418fbc5e7
config: x86_64-randconfig-s021 (https://download.01.org/0day-ci/archive/20220507/[email protected]/config)
compiler: gcc-11 (Debian 11.2.0-20) 11.2.0
reproduce:
# apt-get install sparse
# sparse version: v0.6.4-dirty
# https://github.com/intel-lab-lkp/linux/commit/6001c3e95550875d4328aa2ca8b342c42b0e644e
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Hao-Xu/fast-poll-multishot-mode/20220506-150750
git checkout 6001c3e95550875d4328aa2ca8b342c42b0e644e
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 C=1 CF='-fdiagnostic-prefix -D__CHECK_ENDIAN__' O=build_dir ARCH=x86_64 SHELL=/bin/bash

If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <[email protected]>

sparse warnings: (new ones prefixed by >>)
fs/io_uring.c: note: in included file (through include/trace/trace_events.h, include/trace/define_trace.h, include/trace/events/io_uring.h):
include/trace/events/io_uring.h:488:1: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [usertype] op_flags @@ got restricted __kernel_rwf_t const [usertype] rw_flags @@
include/trace/events/io_uring.h:488:1: sparse: expected unsigned int [usertype] op_flags
include/trace/events/io_uring.h:488:1: sparse: got restricted __kernel_rwf_t const [usertype] rw_flags
fs/io_uring.c: note: in included file (through include/trace/perf.h, include/trace/define_trace.h, include/trace/events/io_uring.h):
include/trace/events/io_uring.h:488:1: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [usertype] op_flags @@ got restricted __kernel_rwf_t const [usertype] rw_flags @@
include/trace/events/io_uring.h:488:1: sparse: expected unsigned int [usertype] op_flags
include/trace/events/io_uring.h:488:1: sparse: got restricted __kernel_rwf_t const [usertype] rw_flags
fs/io_uring.c:3280:23: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [usertype] flags @@ got restricted __kernel_rwf_t @@
fs/io_uring.c:3280:23: sparse: expected unsigned int [usertype] flags
fs/io_uring.c:3280:23: sparse: got restricted __kernel_rwf_t
fs/io_uring.c:3477:24: sparse: sparse: incorrect type in return expression (different address spaces) @@ expected void [noderef] __user * @@ got struct io_buffer *[assigned] kbuf @@
fs/io_uring.c:3477:24: sparse: expected void [noderef] __user *
fs/io_uring.c:3477:24: sparse: got struct io_buffer *[assigned] kbuf
fs/io_uring.c:3864:48: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected restricted __kernel_rwf_t [usertype] flags @@ got unsigned int [usertype] flags @@
fs/io_uring.c:3864:48: sparse: expected restricted __kernel_rwf_t [usertype] flags
fs/io_uring.c:3864:48: sparse: got unsigned int [usertype] flags
fs/io_uring.c:5187:14: sparse: sparse: incorrect type in assignment (different address spaces) @@ expected struct file *file @@ got struct file [noderef] __rcu * @@
fs/io_uring.c:5187:14: sparse: expected struct file *file
fs/io_uring.c:5187:14: sparse: got struct file [noderef] __rcu *
fs/io_uring.c:5974:68: sparse: sparse: incorrect type in initializer (different base types) @@ expected restricted __poll_t [usertype] _key @@ got int apoll_events @@
fs/io_uring.c:5974:68: sparse: expected restricted __poll_t [usertype] _key
fs/io_uring.c:5974:68: sparse: got int apoll_events
fs/io_uring.c:5979:48: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:5983:59: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:5991:74: sparse: sparse: incorrect type in argument 1 (different base types) @@ expected restricted __poll_t [usertype] val @@ got int @@
fs/io_uring.c:5991:74: sparse: expected restricted __poll_t [usertype] val
fs/io_uring.c:5991:74: sparse: got int
fs/io_uring.c:5991:60: sparse: sparse: incorrect type in initializer (different base types) @@ expected restricted __poll_t [usertype] mask @@ got unsigned short @@
fs/io_uring.c:5991:60: sparse: expected restricted __poll_t [usertype] mask
fs/io_uring.c:5991:60: sparse: got unsigned short
fs/io_uring.c:5997:58: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected signed int [usertype] res @@ got restricted __poll_t [usertype] mask @@
fs/io_uring.c:5997:58: sparse: expected signed int [usertype] res
fs/io_uring.c:5997:58: sparse: got restricted __poll_t [usertype] mask
fs/io_uring.c:6027:68: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:6027:57: sparse: sparse: incorrect type in argument 1 (different base types) @@ expected restricted __poll_t [usertype] val @@ got unsigned int @@
fs/io_uring.c:6027:57: sparse: expected restricted __poll_t [usertype] val
fs/io_uring.c:6027:57: sparse: got unsigned int
fs/io_uring.c:6108:45: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6108:45: sparse: expected int events
fs/io_uring.c:6108:45: sparse: got restricted __poll_t [usertype] events
fs/io_uring.c:6143:40: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected int mask @@ got restricted __poll_t [usertype] mask @@
fs/io_uring.c:6143:40: sparse: expected int mask
fs/io_uring.c:6143:40: sparse: got restricted __poll_t [usertype] mask
fs/io_uring.c:6143:50: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6143:50: sparse: expected int events
fs/io_uring.c:6143:50: sparse: got restricted __poll_t [usertype] events
fs/io_uring.c:6235:24: sparse: sparse: incorrect type in return expression (different base types) @@ expected int @@ got restricted __poll_t [assigned] [usertype] mask @@
fs/io_uring.c:6235:24: sparse: expected int
fs/io_uring.c:6235:24: sparse: got restricted __poll_t [assigned] [usertype] mask
fs/io_uring.c:6252:40: sparse: sparse: incorrect type in argument 2 (different base types) @@ expected int mask @@ got restricted __poll_t [assigned] [usertype] mask @@
fs/io_uring.c:6252:40: sparse: expected int mask
fs/io_uring.c:6252:40: sparse: got restricted __poll_t [assigned] [usertype] mask
fs/io_uring.c:6252:50: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6252:50: sparse: expected int events
fs/io_uring.c:6252:50: sparse: got restricted __poll_t [usertype] events
fs/io_uring.c:6262:47: sparse: sparse: incorrect type in argument 3 (different base types) @@ expected int events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6262:47: sparse: expected int events
fs/io_uring.c:6262:47: sparse: got restricted __poll_t [usertype] events
>> fs/io_uring.c:6287:33: sparse: sparse: incorrect type in initializer (different base types) @@ expected restricted __poll_t [usertype] mask @@ got int @@
fs/io_uring.c:6287:33: sparse: expected restricted __poll_t [usertype] mask
fs/io_uring.c:6287:33: sparse: got int
fs/io_uring.c:6300:22: sparse: sparse: invalid assignment: |=
fs/io_uring.c:6300:22: sparse: left side has type restricted __poll_t
fs/io_uring.c:6300:22: sparse: right side has type int
fs/io_uring.c:6305:30: sparse: sparse: invalid assignment: &=
fs/io_uring.c:6305:30: sparse: left side has type restricted __poll_t
fs/io_uring.c:6305:30: sparse: right side has type int
fs/io_uring.c:6307:22: sparse: sparse: invalid assignment: |=
fs/io_uring.c:6307:22: sparse: left side has type restricted __poll_t
fs/io_uring.c:6307:22: sparse: right side has type int
fs/io_uring.c:6335:33: sparse: sparse: incorrect type in argument 5 (different base types) @@ expected int mask @@ got restricted __poll_t [assigned] [usertype] mask @@
fs/io_uring.c:6335:33: sparse: expected int mask
fs/io_uring.c:6335:33: sparse: got restricted __poll_t [assigned] [usertype] mask
fs/io_uring.c:6335:50: sparse: sparse: incorrect type in argument 6 (different base types) @@ expected int events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6335:50: sparse: expected int events
fs/io_uring.c:6335:50: sparse: got restricted __poll_t [usertype] events
fs/io_uring.c:6449:24: sparse: sparse: invalid assignment: |=
fs/io_uring.c:6449:24: sparse: left side has type unsigned int
fs/io_uring.c:6449:24: sparse: right side has type restricted __poll_t
fs/io_uring.c:6450:65: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:6450:29: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:6450:38: sparse: sparse: incorrect type in return expression (different base types) @@ expected restricted __poll_t @@ got unsigned int @@
fs/io_uring.c:6450:38: sparse: expected restricted __poll_t
fs/io_uring.c:6450:38: sparse: got unsigned int
fs/io_uring.c:6502:27: sparse: sparse: incorrect type in assignment (different base types) @@ expected int apoll_events @@ got restricted __poll_t [usertype] events @@
fs/io_uring.c:6502:27: sparse: expected int apoll_events
fs/io_uring.c:6502:27: sparse: got restricted __poll_t [usertype] events
fs/io_uring.c:6541:43: sparse: sparse: invalid assignment: &=
fs/io_uring.c:6541:43: sparse: left side has type restricted __poll_t
fs/io_uring.c:6541:43: sparse: right side has type int
fs/io_uring.c:6542:62: sparse: sparse: restricted __poll_t degrades to integer
fs/io_uring.c:6542:43: sparse: sparse: invalid assignment: |=
fs/io_uring.c:6542:43: sparse: left side has type restricted __poll_t
fs/io_uring.c:6542:43: sparse: right side has type unsigned int
fs/io_uring.c:2536:17: sparse: sparse: context imbalance in 'handle_prev_tw_list' - different lock contexts for basic block
fs/io_uring.c:7610:39: sparse: sparse: marked inline, but without a definition
fs/io_uring.c:7610:39: sparse: sparse: marked inline, but without a definition
fs/io_uring.c:7610:39: sparse: sparse: marked inline, but without a definition
fs/io_uring.c:7610:39: sparse: sparse: marked inline, but without a definition

vim +6287 fs/io_uring.c

6280
6281 static int io_arm_poll_handler(struct io_kiocb *req, unsigned issue_flags)
6282 {
6283 const struct io_op_def *def = &io_op_defs[req->opcode];
6284 struct io_ring_ctx *ctx = req->ctx;
6285 struct async_poll *apoll;
6286 struct io_poll_table ipt;
> 6287 __poll_t mask = POLLERR | POLLPRI;
6288 int ret;
6289
6290 if (!def->pollin && !def->pollout)
6291 return IO_APOLL_ABORTED;
6292 if (!file_can_poll(req->file))
6293 return IO_APOLL_ABORTED;
6294 if ((req->flags & (REQ_F_POLLED|REQ_F_PARTIAL_IO)) == REQ_F_POLLED)
6295 return IO_APOLL_ABORTED;
6296 if (!(req->flags & REQ_F_APOLL_MULTISHOT))
6297 mask |= EPOLLONESHOT;
6298
6299 if (def->pollin) {
6300 mask |= POLLIN | POLLRDNORM;
6301
6302 /* If reading from MSG_ERRQUEUE using recvmsg, ignore POLLIN */
6303 if ((req->opcode == IORING_OP_RECVMSG) &&
6304 (req->sr_msg.msg_flags & MSG_ERRQUEUE))
6305 mask &= ~POLLIN;
6306 } else {
6307 mask |= POLLOUT | POLLWRNORM;
6308 }
6309 if (def->poll_exclusive)
6310 mask |= EPOLLEXCLUSIVE;
6311 if (req->flags & REQ_F_POLLED) {
6312 apoll = req->apoll;
6313 } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&
6314 !list_empty(&ctx->apoll_cache)) {
6315 apoll = list_first_entry(&ctx->apoll_cache, struct async_poll,
6316 poll.wait.entry);
6317 list_del_init(&apoll->poll.wait.entry);
6318 } else {
6319 apoll = kmalloc(sizeof(*apoll), GFP_ATOMIC);
6320 if (unlikely(!apoll))
6321 return IO_APOLL_ABORTED;
6322 }
6323 apoll->double_poll = NULL;
6324 req->apoll = apoll;
6325 req->flags |= REQ_F_POLLED;
6326 ipt.pt._qproc = io_async_queue_proc;
6327
6328 io_kbuf_recycle(req, issue_flags);
6329
6330 ret = __io_arm_poll_handler(req, &apoll->poll, &ipt, mask);
6331 if (ret || ipt.error)
6332 return ret ? IO_APOLL_READY : IO_APOLL_ABORTED;
6333
6334 trace_io_uring_poll_arm(ctx, req, req->cqe.user_data, req->opcode,
6335 mask, apoll->poll.events);
6336 return IO_APOLL_OK;
6337 }
6338

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-05-09 06:40:02

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] fast poll multishot mode

On 5/6/22 8:33 PM, Jens Axboe wrote:
> On 5/6/22 5:26 PM, Jens Axboe wrote:
>> On 5/6/22 4:23 PM, Jens Axboe wrote:
>>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>>> Let multishot support multishot mode, currently only add accept as its
>>>> first comsumer.
>>>> theoretical analysis:
>>>> 1) when connections come in fast
>>>> - singleshot:
>>>> add accept sqe(userpsace) --> accept inline
>>>> ^ |
>>>> |-----------------|
>>>> - multishot:
>>>> add accept sqe(userspace) --> accept inline
>>>> ^ |
>>>> |--*--|
>>>>
>>>> we do accept repeatedly in * place until get EAGAIN
>>>>
>>>> 2) when connections come in at a low pressure
>>>> similar thing like 1), we reduce a lot of userspace-kernel context
>>>> switch and useless vfs_poll()
>>>>
>>>>
>>>> tests:
>>>> Did some tests, which goes in this way:
>>>>
>>>> server client(multiple)
>>>> accept connect
>>>> read write
>>>> write read
>>>> close close
>>>>
>>>> Basically, raise up a number of clients(on same machine with server) to
>>>> connect to the server, and then write some data to it, the server will
>>>> write those data back to the client after it receives them, and then
>>>> close the connection after write return. Then the client will read the
>>>> data and then close the connection. Here I test 10000 clients connect
>>>> one server, data size 128 bytes. And each client has a go routine for
>>>> it, so they come to the server in short time.
>>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>>> is the return value of clock())
>>>> before:
>>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>>> +1934226+1914385)/20.0 = 1927633.75
>>>> after:
>>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>>> +1871324+1940803)/20.0 = 1894750.45
>>>>
>>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>>
>>>>
>>>> A liburing test is here:
>>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>>
>>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>>> find other issues through that.
>>>
>>> Anyway, works for me in testing, and I can see this being a nice win for
>>> accept intensive workloads. I pushed a bunch of cleanup patches that
>>> should just get folded in. Can you fold them into your patches and
>>> address the other feedback, and post a v3? I pushed the test branch
>>> here:
>>>
>>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
>>
>> Quick benchmark here, accepting 10k connections:
>>
>> Stock kernel
>> real 0m0.728s
>> user 0m0.009s
>> sys 0m0.192s
>>
>> Patched
>> real 0m0.684s
>> user 0m0.018s
>> sys 0m0.102s
>>
>> Looks like a nice win for a highly synthetic benchmark. Nothing
>> scientific, was just curious.
>
> One more thought on this - how is it supposed to work with
> accept-direct? One idea would be to make it incrementally increasing.
> But we need a good story for that, if it's exclusive to non-direct
> files, then it's a lot less interesting as the latter is really nice win
> for lots of files. If we can combine the two, even better.

Running some quick testing, on an actual test box (previous numbers were
from a vm on my laptop):

Testing singleshot, normal files
Did 10000 accepts

________________________________________________________
Executed in 216.10 millis fish external
usr time 9.32 millis 150.00 micros 9.17 millis
sys time 110.06 millis 67.00 micros 109.99 millis

Testing multishot, fixed files
Did 10000 accepts

________________________________________________________
Executed in 189.04 millis fish external
usr time 11.86 millis 159.00 micros 11.71 millis
sys time 93.71 millis 70.00 micros 93.64 millis

That's about ~19 usec to accept a connection, pretty decent. Using
singleshot and with fixed files, it shaves about ~8% off, ends at around
200msec.

I think we can get away with using fixed files and multishot, attaching
the quick patch I did below to test it. We need something better than
this, otherwise once the space fills up, we'll likely end up with a
sparse space and the naive approach of just incrementing the next slot
won't work at all.

--
Jens Axboe

2022-05-09 06:40:47

by Hao Xu

[permalink] [raw]

Subject: Re: [PATCH 3/5] io_uring: let fast poll support multishot

在 2022/5/7 上午6:02, Jens Axboe 写道:
> On 5/6/22 11:19 AM, Pavel Begunkov wrote:
>> On 5/6/22 08:01, Hao Xu wrote:
>>> From: Hao Xu <[email protected]>
>>>
>>> For operations like accept, multishot is a useful feature, since we can
>>> reduce a number of accept sqe. Let's integrate it to fast poll, it may
>>> be good for other operations in the future.
>>>
>>> Signed-off-by: Hao Xu <[email protected]>
>>> ---
>>> fs/io_uring.c | 41 ++++++++++++++++++++++++++---------------
>>> 1 file changed, 26 insertions(+), 15 deletions(-)
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 8ebb1a794e36..d33777575faf 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -5952,7 +5952,7 @@ static void io_poll_remove_entries(struct io_kiocb *req)
>>> * either spurious wakeup or multishot CQE is served. 0 when it's done with
>>> * the request, then the mask is stored in req->cqe.res.
>>> */
>>> -static int io_poll_check_events(struct io_kiocb *req, bool locked)
>>> +static int io_poll_check_events(struct io_kiocb *req, bool *locked)
>>> {
>>> struct io_ring_ctx *ctx = req->ctx;
>>> int v;
>>> @@ -5981,17 +5981,26 @@ static int io_poll_check_events(struct io_kiocb *req, bool locked)
>>> /* multishot, just fill an CQE and proceed */
>>> if (req->cqe.res && !(req->apoll_events & EPOLLONESHOT)) {
>>> - __poll_t mask = mangle_poll(req->cqe.res & req->apoll_events);
>>> - bool filled;
>>> -
>>> - spin_lock(&ctx->completion_lock);
>>> - filled = io_fill_cqe_aux(ctx, req->cqe.user_data, mask,
>>> - IORING_CQE_F_MORE);
>>> - io_commit_cqring(ctx);
>>> - spin_unlock(&ctx->completion_lock);
>>> - if (unlikely(!filled))
>>> - return -ECANCELED;
>>> - io_cqring_ev_posted(ctx);
>>> + if (req->flags & REQ_F_APOLL_MULTISHOT) {
>>> + io_tw_lock(req->ctx, locked);
>>> + if (likely(!(req->task->flags & PF_EXITING)))
>>> + io_queue_sqe(req);
>>
>> That looks dangerous, io_queue_sqe() usually takes the request
>> ownership and doesn't expect that someone, i.e.
>> io_poll_check_events(), may still be actively using it.
>
> I took a look at this, too. We do own the request at this point, but
> it's still on the poll list. If io_accept() fails, then we do run the
> poll_clean.
>
>> E.g. io_accept() fails on fd < 0, return an error, io_queue_sqe() ->
>> io_queue_async() -> io_req_complete_failed() kills it. Then
>> io_poll_check_events() and polling in general carry on using the freed
>> request => UAF. Didn't look at it too carefully, but there might other
>> similar cases.
>
> But we better have done poll_clean() before returning the error. What am
> I missing here?

Sorry, I don't get it. I've done the poll_clean() before returnning
error in io_accept()
>

2022-05-09 07:44:40

by Hao Xu

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] fast poll multishot mode

在 2022/5/7 上午11:08, Jens Axboe 写道:
> On 5/6/22 8:33 PM, Jens Axboe wrote:
>> On 5/6/22 5:26 PM, Jens Axboe wrote:
>>> On 5/6/22 4:23 PM, Jens Axboe wrote:
>>>> On 5/6/22 1:00 AM, Hao Xu wrote:
>>>>> Let multishot support multishot mode, currently only add accept as its
>>>>> first comsumer.
>>>>> theoretical analysis:
>>>>> 1) when connections come in fast
>>>>> - singleshot:
>>>>> add accept sqe(userpsace) --> accept inline
>>>>> ^ |
>>>>> |-----------------|
>>>>> - multishot:
>>>>> add accept sqe(userspace) --> accept inline
>>>>> ^ |
>>>>> |--*--|
>>>>>
>>>>> we do accept repeatedly in * place until get EAGAIN
>>>>>
>>>>> 2) when connections come in at a low pressure
>>>>> similar thing like 1), we reduce a lot of userspace-kernel context
>>>>> switch and useless vfs_poll()
>>>>>
>>>>>
>>>>> tests:
>>>>> Did some tests, which goes in this way:
>>>>>
>>>>> server client(multiple)
>>>>> accept connect
>>>>> read write
>>>>> write read
>>>>> close close
>>>>>
>>>>> Basically, raise up a number of clients(on same machine with server) to
>>>>> connect to the server, and then write some data to it, the server will
>>>>> write those data back to the client after it receives them, and then
>>>>> close the connection after write return. Then the client will read the
>>>>> data and then close the connection. Here I test 10000 clients connect
>>>>> one server, data size 128 bytes. And each client has a go routine for
>>>>> it, so they come to the server in short time.
>>>>> test 20 times before/after this patchset, time spent:(unit cycle, which
>>>>> is the return value of clock())
>>>>> before:
>>>>> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
>>>>> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
>>>>> +1934226+1914385)/20.0 = 1927633.75
>>>>> after:
>>>>> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
>>>>> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
>>>>> +1871324+1940803)/20.0 = 1894750.45
>>>>>
>>>>> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>>>>>
>>>>>
>>>>> A liburing test is here:
>>>>> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c
>>>>
>>>> Wish I had seen that, I wrote my own! But maybe that's good, you tend to
>>>> find other issues through that.
>>>>
>>>> Anyway, works for me in testing, and I can see this being a nice win for
>>>> accept intensive workloads. I pushed a bunch of cleanup patches that
>>>> should just get folded in. Can you fold them into your patches and
>>>> address the other feedback, and post a v3? I pushed the test branch
>>>> here:
>>>>
>>>> https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot
>>>
>>> Quick benchmark here, accepting 10k connections:
>>>
>>> Stock kernel
>>> real 0m0.728s
>>> user 0m0.009s
>>> sys 0m0.192s
>>>
>>> Patched
>>> real 0m0.684s
>>> user 0m0.018s
>>> sys 0m0.102s
>>>
>>> Looks like a nice win for a highly synthetic benchmark. Nothing
>>> scientific, was just curious.
>>
>> One more thought on this - how is it supposed to work with
>> accept-direct? One idea would be to make it incrementally increasing.
>> But we need a good story for that, if it's exclusive to non-direct
>> files, then it's a lot less interesting as the latter is really nice win
>> for lots of files. If we can combine the two, even better.
>
> Running some quick testing, on an actual test box (previous numbers were
> from a vm on my laptop):
>
> Testing singleshot, normal files
> Did 10000 accepts
>
> ________________________________________________________
> Executed in 216.10 millis fish external
> usr time 9.32 millis 150.00 micros 9.17 millis
> sys time 110.06 millis 67.00 micros 109.99 millis
>
> Testing multishot, fixed files
> Did 10000 accepts
>
> ________________________________________________________
> Executed in 189.04 millis fish external
> usr time 11.86 millis 159.00 micros 11.71 millis
> sys time 93.71 millis 70.00 micros 93.64 millis
>
> That's about ~19 usec to accept a connection, pretty decent. Using
> singleshot and with fixed files, it shaves about ~8% off, ends at around
> 200msec.
>
> I think we can get away with using fixed files and multishot, attaching
I'm not following, do you mean we shouldn't do the multishot+fixed file
or we should use multishot+fixed to make the result better?
> the quick patch I did below to test it. We need something better than
Sorry Jens, I didn't see the quick patch, is there anything I misunderstand?
> this, otherwise once the space fills up, we'll likely end up with a
> sparse space and the naive approach of just incrementing the next slot
> won't work at all.

>

2022-05-09 07:44:43

by Hao Xu

[permalink] [raw]

Subject: [PATCH 2/5] io_uring: add REQ_F_APOLL_MULTISHOT for requests

From: Hao Xu <[email protected]>

Add a flag to indicate multishot mode for fast poll. currently only
accept use it, but there may be more operations leveraging it in the
future.

Signed-off-by: Hao Xu <[email protected]>
---
fs/io_uring.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 1e7466079af7..8ebb1a794e36 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -808,6 +808,7 @@ enum {
REQ_F_SINGLE_POLL_BIT,
REQ_F_DOUBLE_POLL_BIT,
REQ_F_PARTIAL_IO_BIT,
+ REQ_F_APOLL_MULTISHOT_BIT,
/* keep async read/write and isreg together and in order */
REQ_F_SUPPORT_NOWAIT_BIT,
REQ_F_ISREG_BIT,
@@ -872,6 +873,8 @@ enum {
REQ_F_DOUBLE_POLL = BIT(REQ_F_DOUBLE_POLL_BIT),
/* request has already done partial IO */
REQ_F_PARTIAL_IO = BIT(REQ_F_PARTIAL_IO_BIT),
+ /* fast poll multishot mode */
+ REQ_F_APOLL_MULTISHOT = BIT(REQ_F_APOLL_MULTISHOT_BIT),
};

struct async_poll {
--
2.36.0

2022-05-09 07:55:52

by Hao Xu

[permalink] [raw]

Subject: [PATCH 5/5] io_uring: implement multishot mode for accept

From: Hao Xu <[email protected]>

Refactor io_accept() to support multishot mode.

theoretical analysis:
1) when connections come in fast
- singleshot:
add accept sqe(userpsace) --> accept inline
^ |
|-----------------|
- multishot:
add accept sqe(userspace) --> accept inline
^ |
|--*--|

we do accept repeatedly in * place until get EAGAIN

2) when connections come in at a low pressure
similar thing like 1), we reduce a lot of userspace-kernel context
switch and useless vfs_poll()

tests:
Did some tests, which goes in this way:

server client(multiple)
accept connect
read write
write read
close close

Basically, raise up a number of clients(on same machine with server) to
connect to the server, and then write some data to it, the server will
write those data back to the client after it receives them, and then
close the connection after write return. Then the client will read the
data and then close the connection. Here I test 10000 clients connect
one server, data size 128 bytes. And each client has a go routine for
it, so they come to the server in short time.
test 20 times before/after this patchset, time spent:(unit cycle, which
is the return value of clock())
before:
1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
+1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
+1934226+1914385)/20.0 = 1927633.75
after:
1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
+1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
+1871324+1940803)/20.0 = 1894750.45

(1927633.75 - 1894750.45) / 1927633.75 = 1.65%

Signed-off-by: Hao Xu <[email protected]>
---
fs/io_uring.c | 54 +++++++++++++++++++++++++++++++++++++++++++++------
1 file changed, 48 insertions(+), 6 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 0a83ecc457d1..9febe7774dc3 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1254,6 +1254,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
static void io_eventfd_signal(struct io_ring_ctx *ctx);
static void io_req_tw_post_queue(struct io_kiocb *req, s32 res, u32 cflags);
+static void io_poll_remove_entries(struct io_kiocb *req);

static struct kmem_cache *req_cachep;

@@ -5690,24 +5691,29 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
{
struct io_accept *accept = &req->accept;
+ bool multishot;

if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
return -EINVAL;
- if (sqe->ioprio || sqe->len || sqe->buf_index)
+ if (sqe->len || sqe->buf_index)
return -EINVAL;

accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
accept->flags = READ_ONCE(sqe->accept_flags);
accept->nofile = rlimit(RLIMIT_NOFILE);
+ multishot = !!(READ_ONCE(sqe->ioprio) & IORING_ACCEPT_MULTISHOT);

accept->file_slot = READ_ONCE(sqe->file_index);
- if (accept->file_slot && (accept->flags & SOCK_CLOEXEC))
+ if (accept->file_slot && ((accept->flags & SOCK_CLOEXEC) || multishot))
return -EINVAL;
if (accept->flags & ~(SOCK_CLOEXEC | SOCK_NONBLOCK))
return -EINVAL;
if (SOCK_NONBLOCK != O_NONBLOCK && (accept->flags & SOCK_NONBLOCK))
accept->flags = (accept->flags & ~SOCK_NONBLOCK) | O_NONBLOCK;
+ if (multishot)
+ req->flags |= REQ_F_APOLL_MULTISHOT;
+
return 0;
}

@@ -5730,6 +5736,7 @@ static inline void io_poll_clean(struct io_kiocb *req)

static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
{
+ struct io_ring_ctx *ctx = req->ctx;
struct io_accept *accept = &req->accept;
bool force_nonblock = issue_flags & IO_URING_F_NONBLOCK;
unsigned int file_flags = force_nonblock ? O_NONBLOCK : 0;
@@ -5737,10 +5744,13 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
struct file *file;
int ret, fd;

+retry:
if (!fixed) {
fd = __get_unused_fd_flags(accept->flags, accept->nofile);
- if (unlikely(fd < 0))
+ if (unlikely(fd < 0)) {
+ io_poll_clean(req);
return fd;
+ }
}
file = do_accept(req->file, file_flags, accept->addr, accept->addr_len,
accept->flags);
@@ -5748,8 +5758,12 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
if (!fixed)
put_unused_fd(fd);
ret = PTR_ERR(file);
- if (ret == -EAGAIN && force_nonblock)
- return -EAGAIN;
+ if (ret == -EAGAIN && force_nonblock) {
+ if ((req->flags & REQ_F_APOLL_MULTI_POLLED) ==
+ REQ_F_APOLL_MULTI_POLLED)
+ ret = 0;
+ return ret;
+ }
if (ret == -ERESTARTSYS)
ret = -EINTR;
req_set_fail(req);
@@ -5760,7 +5774,35 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
ret = io_install_fixed_file(req, file, issue_flags,
accept->file_slot - 1);
}
- __io_req_complete(req, issue_flags, ret, 0);
+
+ if (req->flags & REQ_F_APOLL_MULTISHOT) {
+ if (ret >= 0) {
+ bool filled;
+
+ spin_lock(&ctx->completion_lock);
+ filled = io_fill_cqe_aux(ctx, req->cqe.user_data, ret,
+ IORING_CQE_F_MORE);
+ io_commit_cqring(ctx);
+ spin_unlock(&ctx->completion_lock);
+ if (unlikely(!filled)) {
+ io_poll_clean(req);
+ return -ECANCELED;
+ }
+ io_cqring_ev_posted(ctx);
+ goto retry;
+ } else {
+ /*
+ * the apoll multishot req should handle poll
+ * cancellation by itself since the upper layer
+ * who called io_queue_sqe() cannot get errors
+ * happened here.
+ */
+ io_poll_clean(req);
+ return ret;
+ }
+ } else {
+ __io_req_complete(req, issue_flags, ret, 0);
+ }
return 0;
}

--
2.36.0

2022-05-09 09:23:12

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH v2 0/5] fast poll multishot mode

On 5/6/22 1:00 AM, Hao Xu wrote:
> Let multishot support multishot mode, currently only add accept as its
> first comsumer.
> theoretical analysis:
> 1) when connections come in fast
> - singleshot:
> add accept sqe(userpsace) --> accept inline
> ^ |
> |-----------------|
> - multishot:
> add accept sqe(userspace) --> accept inline
> ^ |
> |--*--|
>
> we do accept repeatedly in * place until get EAGAIN
>
> 2) when connections come in at a low pressure
> similar thing like 1), we reduce a lot of userspace-kernel context
> switch and useless vfs_poll()
>
>
> tests:
> Did some tests, which goes in this way:
>
> server client(multiple)
> accept connect
> read write
> write read
> close close
>
> Basically, raise up a number of clients(on same machine with server) to
> connect to the server, and then write some data to it, the server will
> write those data back to the client after it receives them, and then
> close the connection after write return. Then the client will read the
> data and then close the connection. Here I test 10000 clients connect
> one server, data size 128 bytes. And each client has a go routine for
> it, so they come to the server in short time.
> test 20 times before/after this patchset, time spent:(unit cycle, which
> is the return value of clock())
> before:
> 1930136+1940725+1907981+1947601+1923812+1928226+1911087+1905897+1941075
> +1934374+1906614+1912504+1949110+1908790+1909951+1941672+1969525+1934984
> +1934226+1914385)/20.0 = 1927633.75
> after:
> 1858905+1917104+1895455+1963963+1892706+1889208+1874175+1904753+1874112
> +1874985+1882706+1884642+1864694+1906508+1916150+1924250+1869060+1889506
> +1871324+1940803)/20.0 = 1894750.45
>
> (1927633.75 - 1894750.45) / 1927633.75 = 1.65%
>
>
> A liburing test is here:
> https://github.com/HowHsu/liburing/blob/multishot_accept/test/accept.c

Wish I had seen that, I wrote my own! But maybe that's good, you tend to
find other issues through that.

Anyway, works for me in testing, and I can see this being a nice win for
accept intensive workloads. I pushed a bunch of cleanup patches that
should just get folded in. Can you fold them into your patches and
address the other feedback, and post a v3? I pushed the test branch
here:

https://git.kernel.dk/cgit/linux-block/log/?h=fastpoll-mshot

--
Jens Axboe

2022-05-09 11:15:36

by Jens Axboe

[permalink] [raw]

Subject: Re: [PATCH 5/5] io_uring: implement multishot mode for accept

On 5/6/22 1:01 AM, Hao Xu wrote:
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 0a83ecc457d1..9febe7774dc3 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -1254,6 +1254,7 @@ static int io_close_fixed(struct io_kiocb *req, unsigned int issue_flags);
> static enum hrtimer_restart io_link_timeout_fn(struct hrtimer *timer);
> static void io_eventfd_signal(struct io_ring_ctx *ctx);
> static void io_req_tw_post_queue(struct io_kiocb *req, s32 res, u32 cflags);
> +static void io_poll_remove_entries(struct io_kiocb *req);
>
> static struct kmem_cache *req_cachep;
>
> @@ -5690,24 +5691,29 @@ static int io_recv(struct io_kiocb *req, unsigned int issue_flags)
> static int io_accept_prep(struct io_kiocb *req, const struct io_uring_sqe *sqe)
> {
> struct io_accept *accept = &req->accept;
> + bool multishot;
>
> if (unlikely(req->ctx->flags & IORING_SETUP_IOPOLL))
> return -EINVAL;
> - if (sqe->ioprio || sqe->len || sqe->buf_index)
> + if (sqe->len || sqe->buf_index)
> return -EINVAL;
>
> accept->addr = u64_to_user_ptr(READ_ONCE(sqe->addr));
> accept->addr_len = u64_to_user_ptr(READ_ONCE(sqe->addr2));
> accept->flags = READ_ONCE(sqe->accept_flags);
> accept->nofile = rlimit(RLIMIT_NOFILE);
> + multishot = !!(READ_ONCE(sqe->ioprio) & IORING_ACCEPT_MULTISHOT);

I tend to like:

multishot = READ_ONCE(sqe->ioprio) & IORING_ACCEPT_MULTISHOT) != 0;

as I think it's more readable. But I think we really want it ala:

u16 poll_flags;

poll_flags = READ_ONCE(sqe->ioprio);
if (poll_flags & ~IORING_ACCEPT_MULTISHOT)
return -EINVAL;

...

to ensure that we can add more flags later, hence only accepting this
single flag right now.

Do we need REQ_F_APOLL_MULTI_POLLED, or can we just store whether this
is a multishot request in struct io_accept?

> @@ -5760,7 +5774,35 @@ static int io_accept(struct io_kiocb *req, unsigned int issue_flags)
> ret = io_install_fixed_file(req, file, issue_flags,
> accept->file_slot - 1);
> }
> - __io_req_complete(req, issue_flags, ret, 0);
> +
> + if (req->flags & REQ_F_APOLL_MULTISHOT) {
> + if (ret >= 0) {
> + bool filled;
> +
> + spin_lock(&ctx->completion_lock);
> + filled = io_fill_cqe_aux(ctx, req->cqe.user_data, ret,
> + IORING_CQE_F_MORE);
> + io_commit_cqring(ctx);
> + spin_unlock(&ctx->completion_lock);
> + if (unlikely(!filled)) {
> + io_poll_clean(req);
> + return -ECANCELED;
> + }
> + io_cqring_ev_posted(ctx);
> + goto retry;
> + } else {
> + /*
> + * the apoll multishot req should handle poll
> + * cancellation by itself since the upper layer
> + * who called io_queue_sqe() cannot get errors
> + * happened here.
> + */
> + io_poll_clean(req);
> + return ret;
> + }
> + } else {
> + __io_req_complete(req, issue_flags, ret, 0);
> + }
> return 0;
> }

I'd probably just make that:

if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
__io_req_complete(req, issue_flags, ret, 0);
return 0;
}
if (ret >= 0) {
bool filled;

spin_lock(&ctx->completion_lock);
filled = io_fill_cqe_aux(ctx, req->cqe.user_data, ret,
IORING_CQE_F_MORE);
io_commit_cqring(ctx);
spin_unlock(&ctx->completion_lock);
if (filled) {
io_cqring_ev_posted(ctx);
goto retry;
}
/* fall through to error case */
ret = -ECANCELED;
}

/*
* the apoll multishot req should handle poll
* cancellation by itself since the upper layer
* who called io_queue_sqe() cannot get errors
* happened here.
*/
io_poll_clean(req);
return ret;

which I think is a lot easier to read and keeps the indentation at a
manageable level and reduces duplicate code.

--
Jens Axboe