2020-07-13 23:44:21

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 0/5] batch completion + freeing improvements

Different batching improvements, that's it.

Unfortunately, I don't have a decent SSD/setup at hand to
benchmark it properly.

p.s. if extra 32 pointers on stack would be a problem, I wanted for
long to put submit_state into ctx itself.

Pavel Begunkov (5):
io_uring: move io_req_complete() definition
io_uring: replace list with array for compl batch
io_uring: batch free in batched completion
tasks: add put_task_struct_many()
io_uring: batch put_task_struct()

fs/io_uring.c | 129 ++++++++++++++++++++++---------------
include/linux/sched/task.h | 6 ++
2 files changed, 82 insertions(+), 53 deletions(-)

--
2.24.0


2020-07-13 23:44:41

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 4/5] tasks: add put_task_struct_many()

put_task_struct_many() is as put_task_struct() but puts several
references at once. Useful to batching it.

Signed-off-by: Pavel Begunkov <[email protected]>
---
include/linux/sched/task.h | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/include/linux/sched/task.h b/include/linux/sched/task.h
index 38359071236a..1301077f9c24 100644
--- a/include/linux/sched/task.h
+++ b/include/linux/sched/task.h
@@ -126,6 +126,12 @@ static inline void put_task_struct(struct task_struct *t)
__put_task_struct(t);
}

+static inline void put_task_struct_many(struct task_struct *t, int nr)
+{
+ if (refcount_sub_and_test(nr, &t->usage))
+ __put_task_struct(t);
+}
+
void put_task_struct_rcu_user(struct task_struct *task);

#ifdef CONFIG_ARCH_WANTS_DYNAMIC_TASK_STRUCT
--
2.24.0

2020-07-13 23:44:50

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 3/5] io_uring: batch free in batched completion

io_submit_flush_completions() already does batching, hence also bundle
freeing reusing batch_free from iopoll.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 3277a06e2fb6..6f767781351f 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1795,6 +1795,7 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)

static void io_submit_flush_completions(struct io_comp_state *cs)
{
+ struct req_batch rb;
struct io_kiocb *req;
struct io_ring_ctx *ctx = cs->ctx;
int i, nr = cs->nr;
@@ -1808,8 +1809,13 @@ static void io_submit_flush_completions(struct io_comp_state *cs)
spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

- for (i = 0; i < nr; ++i)
- io_put_req(cs->reqs[i]);
+ rb.to_free = 0;
+ for (i = 0; i < nr; ++i) {
+ req = cs->reqs[i];
+ if (refcount_dec_and_test(&req->refs))
+ io_req_free_batch(&rb, req);
+ }
+ io_req_free_batch_finish(ctx, &rb);
cs->nr = 0;
}

--
2.24.0

2020-07-13 23:45:04

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 5/5] io_uring: batch put_task_struct()

Each put_task_struct() is an atomic_dec. Do that in batches.

Tested io_uring-bench(iopoll,QD=128) with a custom nullblk, where
added ->iopoll() is not optimised at all:

before: 529504 IOPS
after: 538415 IOPS
diff: ~1.8%

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 28 ++++++++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 6f767781351f..3216cc00061b 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1761,8 +1761,18 @@ static void io_free_req(struct io_kiocb *req)
struct req_batch {
void *reqs[IO_IOPOLL_BATCH];
int to_free;
+
+ struct task_struct *task;
+ int task_refs;
};

+static void io_init_req_batch(struct req_batch *rb)
+{
+ rb->to_free = 0;
+ rb->task_refs = 0;
+ rb->task = NULL;
+}
+
static void __io_req_free_batch_flush(struct io_ring_ctx *ctx,
struct req_batch *rb)
{
@@ -1776,6 +1786,10 @@ static void io_req_free_batch_finish(struct io_ring_ctx *ctx,
{
if (rb->to_free)
__io_req_free_batch_flush(ctx, rb);
+ if (rb->task) {
+ put_task_struct_many(rb->task, rb->task_refs);
+ rb->task = NULL;
+ }
}

static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
@@ -1787,6 +1801,16 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
if (req->flags & REQ_F_LINK_HEAD)
io_queue_next(req);

+ if (req->flags & REQ_F_TASK_PINNED) {
+ if (req->task != rb->task && rb->task) {
+ put_task_struct_many(rb->task, rb->task_refs);
+ rb->task = req->task;
+ rb->task_refs = 0;
+ }
+ rb->task_refs++;
+ req->flags &= ~REQ_F_TASK_PINNED;
+ }
+
io_dismantle_req(req);
rb->reqs[rb->to_free++] = req;
if (unlikely(rb->to_free == ARRAY_SIZE(rb->reqs)))
@@ -1809,7 +1833,7 @@ static void io_submit_flush_completions(struct io_comp_state *cs)
spin_unlock_irq(&ctx->completion_lock);
io_cqring_ev_posted(ctx);

- rb.to_free = 0;
+ io_init_req_batch(&rb);
for (i = 0; i < nr; ++i) {
req = cs->reqs[i];
if (refcount_dec_and_test(&req->refs))
@@ -1973,7 +1997,7 @@ static void io_iopoll_complete(struct io_ring_ctx *ctx, unsigned int *nr_events,
/* order with ->result store in io_complete_rw_iopoll() */
smp_rmb();

- rb.to_free = 0;
+ io_init_req_batch(&rb);
while (!list_empty(done)) {
int cflags = 0;

--
2.24.0

2020-07-13 23:45:27

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 2/5] io_uring: replace list with array for compl batch

We limit how much request we batch on the completion path, use a fixed
size array instead of lists for completion batching. That also allows
to split io_submit_flush_completions() into 2 steps: the first is
filling CQEs, and the second actually frees requests.

There are plenty of benefits:
- list head tossing is expensive + removes LIST_INIT in state prep
- doesn't do extra unlock/lock to put a linked request
- filling CQEs first gives better latency
- will be used to handle list entry aliasing and add batch free there

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 33 +++++++++++++--------------------
1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 609c7da044d7..3277a06e2fb6 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -680,11 +680,12 @@ struct io_defer_entry {
};

#define IO_IOPOLL_BATCH 8
+#define IO_COMPL_BATCH 32

struct io_comp_state {
- unsigned int nr;
- struct list_head list;
struct io_ring_ctx *ctx;
+ unsigned int nr;
+ struct io_kiocb *reqs[IO_COMPL_BATCH];
};

struct io_submit_state {
@@ -1794,28 +1795,21 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)

static void io_submit_flush_completions(struct io_comp_state *cs)
{
+ struct io_kiocb *req;
struct io_ring_ctx *ctx = cs->ctx;
+ int i, nr = cs->nr;

spin_lock_irq(&ctx->completion_lock);
- while (!list_empty(&cs->list)) {
- struct io_kiocb *req;
-
- req = list_first_entry(&cs->list, struct io_kiocb, compl.list);
- list_del(&req->compl.list);
+ for (i = 0; i < nr; ++i) {
+ req = cs->reqs[i];
__io_cqring_fill_event(req, req->result, req->compl.cflags);
- if (!(req->flags & REQ_F_LINK_HEAD)) {
- req->flags |= REQ_F_COMP_LOCKED;
- io_put_req(req);
- } else {
- spin_unlock_irq(&ctx->completion_lock);
- io_put_req(req);
- spin_lock_irq(&ctx->completion_lock);
- }
}
io_commit_cqring(ctx);
spin_unlock_irq(&ctx->completion_lock);
-
io_cqring_ev_posted(ctx);
+
+ for (i = 0; i < nr; ++i)
+ io_put_req(cs->reqs[i]);
cs->nr = 0;
}

@@ -1829,8 +1823,8 @@ static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags,
io_clean_op(req);
req->result = res;
req->compl.cflags = cflags;
- list_add_tail(&req->compl.list, &cs->list);
- if (++cs->nr >= 32)
+ cs->reqs[cs->nr++] = req;
+ if (cs->nr == IO_COMPL_BATCH)
io_submit_flush_completions(cs);
}
}
@@ -6118,7 +6112,7 @@ static int io_submit_sqe(struct io_kiocb *req, const struct io_uring_sqe *sqe,
*/
static void io_submit_state_end(struct io_submit_state *state)
{
- if (!list_empty(&state->comp.list))
+ if (state->comp.nr)
io_submit_flush_completions(&state->comp);
blk_finish_plug(&state->plug);
io_state_file_put(state);
@@ -6137,7 +6131,6 @@ static void io_submit_state_start(struct io_submit_state *state,
state->plug.nowait = true;
#endif
state->comp.nr = 0;
- INIT_LIST_HEAD(&state->comp.list);
state->comp.ctx = ctx;
state->free_reqs = 0;
state->file = NULL;
--
2.24.0

2020-07-13 23:46:28

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH 1/5] io_uring: move io_req_complete() definition

Just a preparation patch, moving several functions
(i.e. [__]io_req_complete(), io_submit_flush_completions()) below in
code to avoid extra declarations.

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 96 +++++++++++++++++++++++++--------------------------
1 file changed, 48 insertions(+), 48 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 7038c4f08805..609c7da044d7 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -1428,54 +1428,6 @@ static void io_cqring_add_event(struct io_kiocb *req, long res, long cflags)
io_cqring_ev_posted(ctx);
}

-static void io_submit_flush_completions(struct io_comp_state *cs)
-{
- struct io_ring_ctx *ctx = cs->ctx;
-
- spin_lock_irq(&ctx->completion_lock);
- while (!list_empty(&cs->list)) {
- struct io_kiocb *req;
-
- req = list_first_entry(&cs->list, struct io_kiocb, compl.list);
- list_del(&req->compl.list);
- __io_cqring_fill_event(req, req->result, req->compl.cflags);
- if (!(req->flags & REQ_F_LINK_HEAD)) {
- req->flags |= REQ_F_COMP_LOCKED;
- io_put_req(req);
- } else {
- spin_unlock_irq(&ctx->completion_lock);
- io_put_req(req);
- spin_lock_irq(&ctx->completion_lock);
- }
- }
- io_commit_cqring(ctx);
- spin_unlock_irq(&ctx->completion_lock);
-
- io_cqring_ev_posted(ctx);
- cs->nr = 0;
-}
-
-static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags,
- struct io_comp_state *cs)
-{
- if (!cs) {
- io_cqring_add_event(req, res, cflags);
- io_put_req(req);
- } else {
- io_clean_op(req);
- req->result = res;
- req->compl.cflags = cflags;
- list_add_tail(&req->compl.list, &cs->list);
- if (++cs->nr >= 32)
- io_submit_flush_completions(cs);
- }
-}
-
-static void io_req_complete(struct io_kiocb *req, long res)
-{
- __io_req_complete(req, res, 0, NULL);
-}
-
static inline bool io_is_fallback_req(struct io_kiocb *req)
{
return req == (struct io_kiocb *)
@@ -1840,6 +1792,54 @@ static void io_req_free_batch(struct req_batch *rb, struct io_kiocb *req)
__io_req_free_batch_flush(req->ctx, rb);
}

+static void io_submit_flush_completions(struct io_comp_state *cs)
+{
+ struct io_ring_ctx *ctx = cs->ctx;
+
+ spin_lock_irq(&ctx->completion_lock);
+ while (!list_empty(&cs->list)) {
+ struct io_kiocb *req;
+
+ req = list_first_entry(&cs->list, struct io_kiocb, compl.list);
+ list_del(&req->compl.list);
+ __io_cqring_fill_event(req, req->result, req->compl.cflags);
+ if (!(req->flags & REQ_F_LINK_HEAD)) {
+ req->flags |= REQ_F_COMP_LOCKED;
+ io_put_req(req);
+ } else {
+ spin_unlock_irq(&ctx->completion_lock);
+ io_put_req(req);
+ spin_lock_irq(&ctx->completion_lock);
+ }
+ }
+ io_commit_cqring(ctx);
+ spin_unlock_irq(&ctx->completion_lock);
+
+ io_cqring_ev_posted(ctx);
+ cs->nr = 0;
+}
+
+static void __io_req_complete(struct io_kiocb *req, long res, unsigned cflags,
+ struct io_comp_state *cs)
+{
+ if (!cs) {
+ io_cqring_add_event(req, res, cflags);
+ io_put_req(req);
+ } else {
+ io_clean_op(req);
+ req->result = res;
+ req->compl.cflags = cflags;
+ list_add_tail(&req->compl.list, &cs->list);
+ if (++cs->nr >= 32)
+ io_submit_flush_completions(cs);
+ }
+}
+
+static void io_req_complete(struct io_kiocb *req, long res)
+{
+ __io_req_complete(req, res, 0, NULL);
+}
+
/*
* Drop reference to request, return next in chain (if there is one) if this
* was the last reference to this request.
--
2.24.0

2020-07-14 01:09:06

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH 0/5] batch completion + freeing improvements

On 7/13/20 5:41 PM, Pavel Begunkov wrote:
> Different batching improvements, that's it.
>
> Unfortunately, I don't have a decent SSD/setup at hand to
> benchmark it properly.

I do though, but I'm not seeing any improvement with this, whereas
some of the previous series made nice improvements... If anything
maybe it's a bit slower.

> p.s. if extra 32 pointers on stack would be a problem, I wanted for
> long to put submit_state into ctx itself.

It's getting up there... But really depends on how early in the stack,
so 32 could _probably_ work, though usually batched on-stack counts
are a bit lower than that.

--
Jens Axboe

2020-07-14 07:20:13

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH 0/5] batch completion + freeing improvements

On 14/07/2020 04:08, Jens Axboe wrote:
> On 7/13/20 5:41 PM, Pavel Begunkov wrote:
>> Different batching improvements, that's it.
>>
>> Unfortunately, I don't have a decent SSD/setup at hand to
>> benchmark it properly.
>
> I do though, but I'm not seeing any improvement with this, whereas
> some of the previous series made nice improvements... If anything
> maybe it's a bit slower.

Thanks for testing it, appreciate that. Probably, the array did
something wrong with your caches, or the 2-step approach is to blame.
I'll try to refine and/or resend parts after closer benchmarking.

>
>> p.s. if extra 32 pointers on stack would be a problem, I wanted for
>> long to put submit_state into ctx itself.
>
> It's getting up there... But really depends on how early in the stack,
> so 32 could _probably_ work, though usually batched on-stack counts
> are a bit lower than that.

On a fresh head 250 bytes looks too much, I agree. That considering
that io_uring is stacking on top of vfs or near that, and there are
already fast_iovec/msg.

--
Pavel Begunkov