2022-06-30 10:00:16

by Dylan Yudaken

[permalink] [raw]
Subject: [PATCH v2 for-next 00/12] io_uring: multishot recv

This series adds support for multishot recv/recvmsg to io_uring.

The idea is that generally socket applications will be continually
enqueuing a new recv() when the previous one completes. This can be
improved on by allowing the application to queue a multishot receive,
which will post completions as and when data is available. It uses the
provided buffers feature to receive new data into a pool provided by
the application.

This is more performant in a few ways:
* Subsequent receives are queued up straight away without requiring the
application to finish a processing loop.
* If there are more data in the socket (sat the provided buffer
size is smaller than the socket buffer) then the data is immediately
returned, improving batching.
* Poll is only armed once and reused, saving CPU cycles

Running a small network benchmark [1] shows improved QPS of ~6-8% over a range of loads.

[1]: https://github.com/DylanZA/netbench/tree/multishot_recv

While building this I noticed a small problem in multishot poll which is a really
big problem for receive. If CQEs overflow, then they will be returned to the user
out of order. This is annoying for the existing use cases of poll and accept but
doesn't totally break the functionality. Both of these return results that aren't
strictly ordered except for the IORING_CQE_F_MORE flag. For receive this obviously
is a critical requirement as otherwise data will be received out of order by the
application.

To fix this, when a multishot CQE hits overflow we remove multishot. The application
should then clear CQEs until it sees that CQE, and noticing that IORING_CQE_F_MORE is
not set can re-issue the multishot request.

Patches:
1-3: relax restrictions around provided buffers to allow 0 size lengths
4: recycles more buffers on kernel side in error conditions
5-6: clean up multishot poll API a bit allowing it to end with succesful
error conditions
7-8: fix existing problems with multishot poll on overflow
9: is the multishot receive patch
10-11: are small fixes to tracing of CQEs

v2:
* Added patches 6,7,8 (fixing multishot poll bugs)
* Added patches 10,11 (trace cleanups)
* added io_recv_finish to reduce duplicate logic


Dylan Yudaken (12):
io_uring: allow 0 length for buffer select
io_uring: restore bgid in io_put_kbuf
io_uring: allow iov_len = 0 for recvmsg and buffer select
io_uring: recycle buffers on error
io_uring: clean up io_poll_check_events return values
io_uring: add IOU_STOP_MULTISHOT return code
io_uring: add allow_overflow to io_post_aux_cqe
io_uring: fix multishot poll on overflow
io_uring: fix multishot accept ordering
io_uring: multishot recv
io_uring: fix io_uring_cqe_overflow trace format
io_uring: only trace one of complete or overflow

include/trace/events/io_uring.h | 2 +-
include/uapi/linux/io_uring.h | 5 ++
io_uring/io_uring.c | 17 ++--
io_uring/io_uring.h | 20 +++--
io_uring/kbuf.c | 4 +-
io_uring/kbuf.h | 9 ++-
io_uring/msg_ring.c | 4 +-
io_uring/net.c | 139 ++++++++++++++++++++++++++------
io_uring/poll.c | 44 ++++++----
io_uring/rsrc.c | 4 +-
10 files changed, 190 insertions(+), 58 deletions(-)


base-commit: 864a15ca4f196184e3f44d72efc1782a7017cbbd
--
2.30.2


2022-06-30 10:01:08

by Dylan Yudaken

[permalink] [raw]
Subject: [PATCH v2 for-next 03/12] io_uring: allow iov_len = 0 for recvmsg and buffer select

When using BUFFER_SELECT there is no technical requirement that the user
actually provides iov, and this removes one copy_from_user call.

So allow iov_len to be 0.

Signed-off-by: Dylan Yudaken <[email protected]>
---
io_uring/net.c | 16 +++++++++++-----
1 file changed, 11 insertions(+), 5 deletions(-)

diff --git a/io_uring/net.c b/io_uring/net.c
index 19a805c3814c..5e84f7ab92a3 100644
--- a/io_uring/net.c
+++ b/io_uring/net.c
@@ -300,12 +300,18 @@ static int __io_recvmsg_copy_hdr(struct io_kiocb *req,
return ret;

if (req->flags & REQ_F_BUFFER_SELECT) {
- if (iov_len > 1)
+ if (iov_len == 0) {
+ sr->len = iomsg->fast_iov[0].iov_len = 0;
+ iomsg->fast_iov[0].iov_base = NULL;
+ iomsg->free_iov = NULL;
+ } else if (iov_len > 1) {
return -EINVAL;
- if (copy_from_user(iomsg->fast_iov, uiov, sizeof(*uiov)))
- return -EFAULT;
- sr->len = iomsg->fast_iov[0].iov_len;
- iomsg->free_iov = NULL;
+ } else {
+ if (copy_from_user(iomsg->fast_iov, uiov, sizeof(*uiov)))
+ return -EFAULT;
+ sr->len = iomsg->fast_iov[0].iov_len;
+ iomsg->free_iov = NULL;
+ }
} else {
iomsg->free_iov = iomsg->fast_iov;
ret = __import_iovec(READ, uiov, iov_len, UIO_FASTIOV,
--
2.30.2

2022-06-30 10:05:12

by Dylan Yudaken

[permalink] [raw]
Subject: [PATCH v2 for-next 05/12] io_uring: clean up io_poll_check_events return values

The values returned are a bit confusing, where 0 and 1 have implied
meaning, so add some definitions for them.

Signed-off-by: Dylan Yudaken <[email protected]>
---
io_uring/poll.c | 27 ++++++++++++++++-----------
1 file changed, 16 insertions(+), 11 deletions(-)

diff --git a/io_uring/poll.c b/io_uring/poll.c
index fa25b88a7b93..922a3d1b2e31 100644
--- a/io_uring/poll.c
+++ b/io_uring/poll.c
@@ -192,13 +192,18 @@ static void io_poll_remove_entries(struct io_kiocb *req)
rcu_read_unlock();
}

+enum {
+ IOU_POLL_DONE = 0,
+ IOU_POLL_NO_ACTION = 1,
+};
+
/*
* All poll tw should go through this. Checks for poll events, manages
* references, does rewait, etc.
*
- * Returns a negative error on failure. >0 when no action require, which is
- * either spurious wakeup or multishot CQE is served. 0 when it's done with
- * the request, then the mask is stored in req->cqe.res.
+ * Returns a negative error on failure. IOU_POLL_NO_ACTION when no action require,
+ * which is either spurious wakeup or multishot CQE is served.
+ * IOU_POLL_DONE when it's done with the request, then the mask is stored in req->cqe.res.
*/
static int io_poll_check_events(struct io_kiocb *req, bool *locked)
{
@@ -214,10 +219,11 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)

/* tw handler should be the owner, and so have some references */
if (WARN_ON_ONCE(!(v & IO_POLL_REF_MASK)))
- return 0;
+ return IOU_POLL_DONE;
if (v & IO_POLL_CANCEL_FLAG)
return -ECANCELED;

+ /* the mask was stashed in __io_poll_execute */
if (!req->cqe.res) {
struct poll_table_struct pt = { ._key = req->apoll_events };
req->cqe.res = vfs_poll(req->file, &pt) & req->apoll_events;
@@ -226,7 +232,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
if ((unlikely(!req->cqe.res)))
continue;
if (req->apoll_events & EPOLLONESHOT)
- return 0;
+ return IOU_POLL_DONE;

/* multishot, just fill a CQE and proceed */
if (!(req->flags & REQ_F_APOLL_MULTISHOT)) {
@@ -238,7 +244,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
return -ECANCELED;
} else {
ret = io_poll_issue(req, locked);
- if (ret)
+ if (ret < 0)
return ret;
}

@@ -248,7 +254,7 @@ static int io_poll_check_events(struct io_kiocb *req, bool *locked)
*/
} while (atomic_sub_return(v & IO_POLL_REF_MASK, &req->poll_refs));

- return 1;
+ return IOU_POLL_NO_ACTION;
}

static void io_poll_task_func(struct io_kiocb *req, bool *locked)
@@ -256,12 +262,11 @@ static void io_poll_task_func(struct io_kiocb *req, bool *locked)
int ret;

ret = io_poll_check_events(req, locked);
- if (ret > 0)
+ if (ret == IOU_POLL_NO_ACTION)
return;

- if (!ret) {
+ if (ret == IOU_POLL_DONE) {
struct io_poll *poll = io_kiocb_to_cmd(req);
-
req->cqe.res = mangle_poll(req->cqe.res & poll->events);
} else {
req->cqe.res = ret;
@@ -280,7 +285,7 @@ static void io_apoll_task_func(struct io_kiocb *req, bool *locked)
int ret;

ret = io_poll_check_events(req, locked);
- if (ret > 0)
+ if (ret == IOU_POLL_NO_ACTION)
return;

io_poll_remove_entries(req);
--
2.30.2

2022-06-30 10:05:17

by Dylan Yudaken

[permalink] [raw]
Subject: [PATCH v2 for-next 12/12] io_uring: only trace one of complete or overflow

In overflow we see a duplcate line in the trace, and in some cases 3
lines (if initial io_post_aux_cqe fails).
Instead just trace once for each CQE

Signed-off-by: Dylan Yudaken <[email protected]>
---
io_uring/io_uring.c | 3 ++-
io_uring/io_uring.h | 10 ++++++----
2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/io_uring/io_uring.c b/io_uring/io_uring.c
index 523b6ebad15a..caf979cd4327 100644
--- a/io_uring/io_uring.c
+++ b/io_uring/io_uring.c
@@ -742,7 +742,6 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx,
struct io_uring_cqe *cqe;

ctx->cq_extra++;
- trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);

/*
* If we can't get a cq entry, userspace overflowed the
@@ -751,6 +750,8 @@ static bool io_fill_cqe_aux(struct io_ring_ctx *ctx,
*/
cqe = io_get_cqe(ctx);
if (likely(cqe)) {
+ trace_io_uring_complete(ctx, NULL, user_data, res, cflags, 0, 0);
+
WRITE_ONCE(cqe->user_data, user_data);
WRITE_ONCE(cqe->res, res);
WRITE_ONCE(cqe->flags, cflags);
diff --git a/io_uring/io_uring.h b/io_uring/io_uring.h
index e022d71c177a..868f45d55543 100644
--- a/io_uring/io_uring.h
+++ b/io_uring/io_uring.h
@@ -101,10 +101,6 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx,
{
struct io_uring_cqe *cqe;

- trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
- req->cqe.res, req->cqe.flags,
- (req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
- (req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
/*
* If we can't get a cq entry, userspace overflowed the
* submission (by quite a lot). Increment the overflow count in
@@ -113,6 +109,12 @@ static inline bool __io_fill_cqe_req(struct io_ring_ctx *ctx,
cqe = io_get_cqe(ctx);
if (unlikely(!cqe))
return io_req_cqe_overflow(req);
+
+ trace_io_uring_complete(req->ctx, req, req->cqe.user_data,
+ req->cqe.res, req->cqe.flags,
+ (req->flags & REQ_F_CQE32_INIT) ? req->extra1 : 0,
+ (req->flags & REQ_F_CQE32_INIT) ? req->extra2 : 0);
+
memcpy(cqe, &req->cqe, sizeof(*cqe));

if (ctx->flags & IORING_SETUP_CQE32) {
--
2.30.2

2022-06-30 20:38:20

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 for-next 00/12] io_uring: multishot recv

On Thu, 30 Jun 2022 02:12:19 -0700, Dylan Yudaken wrote:
> This series adds support for multishot recv/recvmsg to io_uring.
>
> The idea is that generally socket applications will be continually
> enqueuing a new recv() when the previous one completes. This can be
> improved on by allowing the application to queue a multishot receive,
> which will post completions as and when data is available. It uses the
> provided buffers feature to receive new data into a pool provided by
> the application.
>
> [...]

Applied, thanks!

[01/12] io_uring: allow 0 length for buffer select
(no commit info)
[02/12] io_uring: restore bgid in io_put_kbuf
(no commit info)
[03/12] io_uring: allow iov_len = 0 for recvmsg and buffer select
(no commit info)
[04/12] io_uring: recycle buffers on error
(no commit info)
[05/12] io_uring: clean up io_poll_check_events return values
(no commit info)
[06/12] io_uring: add IOU_STOP_MULTISHOT return code
(no commit info)
[07/12] io_uring: add allow_overflow to io_post_aux_cqe
(no commit info)
[08/12] io_uring: fix multishot poll on overflow
(no commit info)
[09/12] io_uring: fix multishot accept ordering
(no commit info)
[10/12] io_uring: multishot recv
(no commit info)
[11/12] io_uring: fix io_uring_cqe_overflow trace format
(no commit info)
[12/12] io_uring: only trace one of complete or overflow
(no commit info)

Best regards,
--
Jens Axboe


2022-06-30 21:02:07

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 for-next 00/12] io_uring: multishot recv

On 6/30/22 3:12 AM, Dylan Yudaken wrote:
> This series adds support for multishot recv/recvmsg to io_uring.
>
> The idea is that generally socket applications will be continually
> enqueuing a new recv() when the previous one completes. This can be
> improved on by allowing the application to queue a multishot receive,
> which will post completions as and when data is available. It uses the
> provided buffers feature to receive new data into a pool provided by
> the application.
>
> This is more performant in a few ways:
> * Subsequent receives are queued up straight away without requiring the
> application to finish a processing loop.
> * If there are more data in the socket (sat the provided buffer
> size is smaller than the socket buffer) then the data is immediately
> returned, improving batching.
> * Poll is only armed once and reused, saving CPU cycles
>
> Running a small network benchmark [1] shows improved QPS of ~6-8% over
> a range of loads.

I have applied this, changing ->addr2 to ->ioprio for the flags bit as
per the io_uring-5.19 branch.

Pretty excited about recv multishot. I think it's an elegant model, and
it has really nice performance improvements as well!

--
Jens Axboe