LinuxLists.cc - [PATCH v2 0/2] Optimise io

2019-09-23 18:30:01

Subject: [PATCH v2 0/2] Optimise io_uring completion waiting

From: Pavel Begunkov <[email protected]>

There could be a lot of overhead within generic wait_event_*() used for
waiting for large number of completions. The patchset removes much of
it by using custom wait event (wait_threshold).

Synthetic test showed ~40% performance boost. (see patch 2)

v2: rebase

Pavel Begunkov (2):
sched/wait: Add wait_threshold
io_uring: Optimise cq waiting with wait_threshold

fs/io_uring.c | 35 ++++++++++++------
include/linux/wait_threshold.h | 67 ++++++++++++++++++++++++++++++++++
kernel/sched/Makefile | 2 +-
kernel/sched/wait_threshold.c | 26 +++++++++++++
4 files changed, 118 insertions(+), 12 deletions(-)
create mode 100644 include/linux/wait_threshold.h
create mode 100644 kernel/sched/wait_threshold.c

--
2.23.0

2019-09-23 19:18:55

by Pavel Begunkov

[permalink] [raw]

Subject: [PATCH v2 2/2] io_uring: Optimise cq waiting with wait_threshold

From: Pavel Begunkov <[email protected]>

While waiting for completion events in io_cqring_wait(), the process
will be waken up inside wait_threshold_interruptible() on any request
completion, check num of events in completion queue and potentially go
to sleep again.

Apparently, there could be a lot of such spurious wakeups with lots of
overhead. It especially manifests itself, when min_events is large, and
completions are arriving one by one or in small batches (that usually
is true).

E.g. if device completes requests one by one and io_uring_enter is
waiting for 100 events, then there will be ~99 spurious wakeups.

Use new wait_threshold_*() instead, which won't wake it up until
necessary number of events is collected.

Performance test:
The first thread generates requests (QD=512) one by one, so they will
be completed in the similar pattern. The second thread waiting for
128 events to complete.

Tested with null_blk with 5us delay
and 3.8GHz Intel CPU.

throughput before: 270 KIOPS
throughput after: 370 KIOPS
~40% throughput boost, exaggerated, but makes a point.

v2: wake always in io_timeout_fn() with WQ_THRESHOLD_WAKE_ALWAYS

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5c3f2bb81637..05f4391c7bbe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -70,6 +70,7 @@
#include <linux/nospec.h>
#include <linux/sizes.h>
#include <linux/hugetlb.h>
+#include <linux/wait_threshold.h>

#include <uapi/linux/io_uring.h>

@@ -414,6 +415,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
return ctx;
}

+static unsigned int io_cqring_events(struct io_rings *rings)
+{
+ /* See comment at the top of this file */
+ smp_rmb();
+ return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head);
+}
+
static inline bool io_sequence_defer(struct io_ring_ctx *ctx,
struct io_kiocb *req)
{
@@ -559,16 +567,27 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data,
}
}

-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
+static void __io_cqring_ev_posted(struct io_ring_ctx *ctx,
+ unsigned int nr_events)
{
if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ wake_up_threshold(&ctx->wait, nr_events);
if (waitqueue_active(&ctx->sqo_wait))
wake_up(&ctx->sqo_wait);
if (ctx->cq_ev_fd)
eventfd_signal(ctx->cq_ev_fd, 1);
}

+static inline void io_cqring_ev_posted(struct io_ring_ctx *ctx)
+{
+ __io_cqring_ev_posted(ctx, io_cqring_events(ctx->rings));
+}
+
+static inline void io_cqring_timeout_posted(struct io_ring_ctx *ctx)
+{
+ __io_cqring_ev_posted(ctx, WQ_THRESHOLD_WAKE_ALWAYS);
+}
+
static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 user_data,
long res)
{
@@ -587,7 +606,7 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
percpu_ref_put_many(&ctx->refs, refs);

if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ wake_up_threshold(&ctx->wait, io_cqring_events(ctx->rings));
}

static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
@@ -722,12 +741,6 @@ static void io_put_req(struct io_kiocb *req)
io_free_req(req);
}

-static unsigned io_cqring_events(struct io_rings *rings)
-{
- /* See comment at the top of this file */
- smp_rmb();
- return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head);
-}

/*
* Find and free completed poll iocbs
@@ -1824,7 +1837,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
io_commit_cqring(ctx);
spin_unlock_irqrestore(&ctx->completion_lock, flags);

- io_cqring_ev_posted(ctx);
+ io_cqring_timeout_posted(ctx);

io_put_req(req);
return HRTIMER_NORESTART;
@@ -2723,7 +2736,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
* we started waiting. For timeouts, we always want to return to
* userspace.
*/
- ret = wait_event_interruptible(ctx->wait,
+ ret = wait_threshold_interruptible(ctx->wait, min_events,
io_cqring_events(rings) >= min_events ||
atomic_read(&ctx->cq_timeouts) != nr_timeouts);
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
--
2.23.0

2019-09-25 20:04:52

by Pavel Begunkov

[permalink] [raw]

Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

Sorry, mixed the threads.

>>
>> I'm not sure an extension is needed for such a special interface, why not
>> just put a ->threshold value next to the ctx->wait field and use either
>> the regular wait_event() APIs with the proper condition, or
>> wait_event_cmd() style APIs if you absolutely need something more complex
>> to happen inside?
Ingo,
io_uring works well without this patch just using wait_event_*() with
proper condition, but there are performance issues with spurious
wakeups. Detailed description in the previous mail.
Am I missing something?
Thanks

>>
>> Should result in a much lower linecount and no scheduler changes. :-)
>>
>> Thanks,
>>
>> Ingo
>>
>

--
Yours sincerely,
Pavel Begunkov

Attachments:

signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 00:36:56