2019-09-23 18:30:01

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 0/2] Optimise io_uring completion waiting

From: Pavel Begunkov <[email protected]>

There could be a lot of overhead within generic wait_event_*() used for
waiting for large number of completions. The patchset removes much of
it by using custom wait event (wait_threshold).

Synthetic test showed ~40% performance boost. (see patch 2)

v2: rebase

Pavel Begunkov (2):
sched/wait: Add wait_threshold
io_uring: Optimise cq waiting with wait_threshold

fs/io_uring.c | 35 ++++++++++++------
include/linux/wait_threshold.h | 67 ++++++++++++++++++++++++++++++++++
kernel/sched/Makefile | 2 +-
kernel/sched/wait_threshold.c | 26 +++++++++++++
4 files changed, 118 insertions(+), 12 deletions(-)
create mode 100644 include/linux/wait_threshold.h
create mode 100644 kernel/sched/wait_threshold.c

--
2.23.0


2019-09-23 19:18:55

by Pavel Begunkov

[permalink] [raw]
Subject: [PATCH v2 2/2] io_uring: Optimise cq waiting with wait_threshold

From: Pavel Begunkov <[email protected]>

While waiting for completion events in io_cqring_wait(), the process
will be waken up inside wait_threshold_interruptible() on any request
completion, check num of events in completion queue and potentially go
to sleep again.

Apparently, there could be a lot of such spurious wakeups with lots of
overhead. It especially manifests itself, when min_events is large, and
completions are arriving one by one or in small batches (that usually
is true).

E.g. if device completes requests one by one and io_uring_enter is
waiting for 100 events, then there will be ~99 spurious wakeups.

Use new wait_threshold_*() instead, which won't wake it up until
necessary number of events is collected.

Performance test:
The first thread generates requests (QD=512) one by one, so they will
be completed in the similar pattern. The second thread waiting for
128 events to complete.

Tested with null_blk with 5us delay
and 3.8GHz Intel CPU.

throughput before: 270 KIOPS
throughput after: 370 KIOPS
~40% throughput boost, exaggerated, but makes a point.

v2: wake always in io_timeout_fn() with WQ_THRESHOLD_WAKE_ALWAYS

Signed-off-by: Pavel Begunkov <[email protected]>
---
fs/io_uring.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)

diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5c3f2bb81637..05f4391c7bbe 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -70,6 +70,7 @@
#include <linux/nospec.h>
#include <linux/sizes.h>
#include <linux/hugetlb.h>
+#include <linux/wait_threshold.h>

#include <uapi/linux/io_uring.h>

@@ -414,6 +415,13 @@ static struct io_ring_ctx *io_ring_ctx_alloc(struct io_uring_params *p)
return ctx;
}

+static unsigned int io_cqring_events(struct io_rings *rings)
+{
+ /* See comment at the top of this file */
+ smp_rmb();
+ return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head);
+}
+
static inline bool io_sequence_defer(struct io_ring_ctx *ctx,
struct io_kiocb *req)
{
@@ -559,16 +567,27 @@ static void io_cqring_fill_event(struct io_ring_ctx *ctx, u64 ki_user_data,
}
}

-static void io_cqring_ev_posted(struct io_ring_ctx *ctx)
+static void __io_cqring_ev_posted(struct io_ring_ctx *ctx,
+ unsigned int nr_events)
{
if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ wake_up_threshold(&ctx->wait, nr_events);
if (waitqueue_active(&ctx->sqo_wait))
wake_up(&ctx->sqo_wait);
if (ctx->cq_ev_fd)
eventfd_signal(ctx->cq_ev_fd, 1);
}

+static inline void io_cqring_ev_posted(struct io_ring_ctx *ctx)
+{
+ __io_cqring_ev_posted(ctx, io_cqring_events(ctx->rings));
+}
+
+static inline void io_cqring_timeout_posted(struct io_ring_ctx *ctx)
+{
+ __io_cqring_ev_posted(ctx, WQ_THRESHOLD_WAKE_ALWAYS);
+}
+
static void io_cqring_add_event(struct io_ring_ctx *ctx, u64 user_data,
long res)
{
@@ -587,7 +606,7 @@ static void io_ring_drop_ctx_refs(struct io_ring_ctx *ctx, unsigned refs)
percpu_ref_put_many(&ctx->refs, refs);

if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ wake_up_threshold(&ctx->wait, io_cqring_events(ctx->rings));
}

static struct io_kiocb *io_get_req(struct io_ring_ctx *ctx,
@@ -722,12 +741,6 @@ static void io_put_req(struct io_kiocb *req)
io_free_req(req);
}

-static unsigned io_cqring_events(struct io_rings *rings)
-{
- /* See comment at the top of this file */
- smp_rmb();
- return READ_ONCE(rings->cq.tail) - READ_ONCE(rings->cq.head);
-}

/*
* Find and free completed poll iocbs
@@ -1824,7 +1837,7 @@ static enum hrtimer_restart io_timeout_fn(struct hrtimer *timer)
io_commit_cqring(ctx);
spin_unlock_irqrestore(&ctx->completion_lock, flags);

- io_cqring_ev_posted(ctx);
+ io_cqring_timeout_posted(ctx);

io_put_req(req);
return HRTIMER_NORESTART;
@@ -2723,7 +2736,7 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
* we started waiting. For timeouts, we always want to return to
* userspace.
*/
- ret = wait_event_interruptible(ctx->wait,
+ ret = wait_threshold_interruptible(ctx->wait, min_events,
io_cqring_events(rings) >= min_events ||
atomic_read(&ctx->cq_timeouts) != nr_timeouts);
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
--
2.23.0

2019-09-25 20:04:52

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

Sorry, mixed the threads.

>>
>> I'm not sure an extension is needed for such a special interface, why not
>> just put a ->threshold value next to the ctx->wait field and use either
>> the regular wait_event() APIs with the proper condition, or
>> wait_event_cmd() style APIs if you absolutely need something more complex
>> to happen inside?
Ingo,
io_uring works well without this patch just using wait_event_*() with
proper condition, but there are performance issues with spurious
wakeups. Detailed description in the previous mail.
Am I missing something?
Thanks


>>
>> Should result in a much lower linecount and no scheduler changes. :-)
>>
>> Thanks,
>>
>> Ingo
>>
>

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 00:36:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/23/19 2:48 PM, Jens Axboe wrote:
> On 9/23/19 10:32 AM, Pavel Begunkov wrote:
>> Sorry, mixed the threads.
>>
>>>>
>>>> I'm not sure an extension is needed for such a special interface, why not
>>>> just put a ->threshold value next to the ctx->wait field and use either
>>>> the regular wait_event() APIs with the proper condition, or
>>>> wait_event_cmd() style APIs if you absolutely need something more complex
>>>> to happen inside?
>> Ingo,
>> io_uring works well without this patch just using wait_event_*() with
>> proper condition, but there are performance issues with spurious
>> wakeups. Detailed description in the previous mail.
>> Am I missing something?
>
> I think we can do the same thing, just wrapping the waitqueue in a
> structure with a count in it, on the stack. Got some flight time
> coming up later today, let me try and cook up a patch.

Totally untested, and sent out 5 min before departure... But something
like this.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..c2f9e1da26dd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
return submit;
}

+struct io_wait_queue {
+ struct wait_queue_entry wq;
+ struct io_ring_ctx *ctx;
+ struct task_struct *task;
+ unsigned to_wait;
+ unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+ struct io_ring_ctx *ctx = iowq->ctx;
+
+ return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+ atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+ int wake_flags, void *key)
+{
+ struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+ wq);
+
+ if (io_should_wake(iowq)) {
+ list_del_init(&curr->entry);
+ wake_up_process(iowq->task);
+ return 1;
+ }
+
+ return -1;
+}
+
/*
* Wait until events become available, if we don't already have some. The
* application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
const sigset_t __user *sig, size_t sigsz)
{
+ struct io_wait_queue iowq = {
+ .wq = {
+ .func = io_wake_function,
+ .entry = LIST_HEAD_INIT(iowq.wq.entry),
+ },
+ .task = current,
+ .ctx = ctx,
+ .to_wait = min_events,
+ };
struct io_rings *rings = ctx->rings;
- unsigned nr_timeouts;
int ret;

if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
return ret;
}

- nr_timeouts = atomic_read(&ctx->cq_timeouts);
- /*
- * Return if we have enough events, or if a timeout occured since
- * we started waiting. For timeouts, we always want to return to
- * userspace.
- */
- ret = wait_event_interruptible(ctx->wait,
- io_cqring_events(rings) >= min_events ||
- atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+ iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+ prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+ do {
+ if (io_should_wake(&iowq))
+ break;
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ } while (1);
+ finish_wait(&ctx->wait, &iowq.wq);
+
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
if (ret == -ERESTARTSYS)
ret = -EINTR;

--
Jens Axboe

2019-09-26 00:44:15

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 02:00, Jens Axboe wrote:
>> I think we can do the same thing, just wrapping the waitqueue in a
>> structure with a count in it, on the stack. Got some flight time
>> coming up later today, let me try and cook up a patch.
>
> Totally untested, and sent out 5 min before departure... But something
> like this.
Hmm, reminds me my first version. Basically that's the same thing but
with macroses inlined. I wanted to make it reusable and self-contained,
though.

If you don't think it could be useful in other places, sure, we could do
something like that. Is that so?

>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..c2f9e1da26dd 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,37 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> return submit;
> }
>
> +struct io_wait_queue {
> + struct wait_queue_entry wq;
> + struct io_ring_ctx *ctx;
> + struct task_struct *task;
> + unsigned to_wait;
> + unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> + struct io_ring_ctx *ctx = iowq->ctx;
> +
> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> + int wake_flags, void *key)
> +{
> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> + wq);
> +
> + if (io_should_wake(iowq)) {
> + list_del_init(&curr->entry);
> + wake_up_process(iowq->task);
> + return 1;
> + }
> +
> + return -1;
> +}
> +
> /*
> * Wait until events become available, if we don't already have some. The
> * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2806,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> const sigset_t __user *sig, size_t sigsz)
> {
> + struct io_wait_queue iowq = {
> + .wq = {
> + .func = io_wake_function,
> + .entry = LIST_HEAD_INIT(iowq.wq.entry),
> + },
> + .task = current,
> + .ctx = ctx,
> + .to_wait = min_events,
> + };
> struct io_rings *rings = ctx->rings;
> - unsigned nr_timeouts;
> int ret;
>
> if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2834,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> return ret;
> }
>
> - nr_timeouts = atomic_read(&ctx->cq_timeouts);
> - /*
> - * Return if we have enough events, or if a timeout occured since
> - * we started waiting. For timeouts, we always want to return to
> - * userspace.
> - */
> - ret = wait_event_interruptible(ctx->wait,
> - io_cqring_events(rings) >= min_events ||
> - atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> + do {
> + if (io_should_wake(&iowq))
> + break;
> + schedule();
> + set_current_state(TASK_INTERRUPTIBLE);
> + } while (1);
> + finish_wait(&ctx->wait, &iowq.wq);
> +
> restore_saved_sigmask_unless(ret == -ERESTARTSYS);
> if (ret == -ERESTARTSYS)
> ret = -EINTR;
>

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 00:48:47

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 1:06 AM, Pavel Begunkov wrote:
> On 24/09/2019 02:00, Jens Axboe wrote:
>>> I think we can do the same thing, just wrapping the waitqueue in a
>>> structure with a count in it, on the stack. Got some flight time
>>> coming up later today, let me try and cook up a patch.
>>
>> Totally untested, and sent out 5 min before departure... But something
>> like this.
> Hmm, reminds me my first version. Basically that's the same thing but
> with macroses inlined. I wanted to make it reusable and self-contained,
> though.
>
> If you don't think it could be useful in other places, sure, we could do
> something like that. Is that so?

I totally agree it could be useful in other places. Maybe formalized and
used with wake_up_nr() instead of adding a new primitive? Haven't looked
into that, I may be talking nonsense.

In any case, I did get a chance to test it and it works for me. Here's
the "finished" version, slightly cleaned up and with a comment added
for good measure.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..14fae454cf75 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
return submit;
}

+struct io_wait_queue {
+ struct wait_queue_entry wq;
+ struct io_ring_ctx *ctx;
+ struct task_struct *task;
+ unsigned to_wait;
+ unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+ struct io_ring_ctx *ctx = iowq->ctx;
+
+ /*
+ * Wake up if we have enough events, or if a timeout occured since we
+ * started waiting. For timeouts, we always want to return to userspace,
+ * regardless of event count.
+ */
+ return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+ atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+ int wake_flags, void *key)
+{
+ struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+ wq);
+
+ if (io_should_wake(iowq)) {
+ list_del_init(&curr->entry);
+ wake_up_process(iowq->task);
+ return 1;
+ }
+
+ return -1;
+}
+
/*
* Wait until events become available, if we don't already have some. The
* application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
const sigset_t __user *sig, size_t sigsz)
{
+ struct io_wait_queue iowq = {
+ .wq = {
+ .func = io_wake_function,
+ .entry = LIST_HEAD_INIT(iowq.wq.entry),
+ },
+ .task = current,
+ .ctx = ctx,
+ .to_wait = min_events,
+ };
struct io_rings *rings = ctx->rings;
- unsigned nr_timeouts;
int ret;

if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
return ret;
}

- nr_timeouts = atomic_read(&ctx->cq_timeouts);
- /*
- * Return if we have enough events, or if a timeout occured since
- * we started waiting. For timeouts, we always want to return to
- * userspace.
- */
- ret = wait_event_interruptible(ctx->wait,
- io_cqring_events(rings) >= min_events ||
- atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+ iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+ prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+ do {
+ if (io_should_wake(&iowq))
+ break;
+ schedule();
+ set_current_state(TASK_INTERRUPTIBLE);
+ } while (1);
+ finish_wait(&ctx->wait, &iowq.wq);
+
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
if (ret == -ERESTARTSYS)
ret = -EINTR;

--
Jens Axboe

2019-09-26 00:51:46

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 2:27 AM, Jens Axboe wrote:
> On 9/24/19 2:02 AM, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
>>
>> In any case, I did get a chance to test it and it works for me. Here's
>> the "finished" version, slightly cleaned up and with a comment added
>> for good measure.
>
> Notes:
>
> This version gets the ordering right, you need exclusive waits to get
> fifo ordering on the waitqueue.
>
> Both versions (yours and mine) suffer from the problem of potentially
> waking too many. I don't think this is a real issue, as generally we
> don't do threaded access to the io_urings. But if you had the following
> tasks wait on the cqring:
>
> [min_events = 32], [min_events = 8], [min_events = 8]
>
> and we reach the io_cqring_events() == threshold, we'll wake all three.
> I don't see a good solution to this, so I suspect we just live with
> until proven an issue. Both versions are much better than what we have
> now.

Forgot an issue around signal handling, version below adds the
right check for that too.

Curious what your test case was for this?


diff --git a/fs/io_uring.c b/fs/io_uring.c
index ca7570aca430..3fbab5692f14 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
return submit;
}

+struct io_wait_queue {
+ struct wait_queue_entry wq;
+ struct io_ring_ctx *ctx;
+ struct task_struct *task;
+ unsigned to_wait;
+ unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+ struct io_ring_ctx *ctx = iowq->ctx;
+
+ /*
+ * Wake up if we have enough events, or if a timeout occured since we
+ * started waiting. For timeouts, we always want to return to userspace,
+ * regardless of event count.
+ */
+ return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+ atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+ int wake_flags, void *key)
+{
+ struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+ wq);
+
+ if (io_should_wake(iowq)) {
+ list_del_init(&curr->entry);
+ wake_up_process(iowq->task);
+ return 1;
+ }
+
+ return -1;
+}
+
/*
* Wait until events become available, if we don't already have some. The
* application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
const sigset_t __user *sig, size_t sigsz)
{
+ struct io_wait_queue iowq = {
+ .wq = {
+ .func = io_wake_function,
+ .entry = LIST_HEAD_INIT(iowq.wq.entry),
+ },
+ .task = current,
+ .ctx = ctx,
+ .to_wait = min_events,
+ };
struct io_rings *rings = ctx->rings;
- unsigned nr_timeouts;
int ret;

if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
return ret;
}

- nr_timeouts = atomic_read(&ctx->cq_timeouts);
- /*
- * Return if we have enough events, or if a timeout occured since
- * we started waiting. For timeouts, we always want to return to
- * userspace.
- */
- ret = wait_event_interruptible(ctx->wait,
- io_cqring_events(rings) >= min_events ||
- atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+ iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+ prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+ do {
+ if (io_should_wake(&iowq))
+ break;
+ schedule();
+ if (signal_pending(current))
+ break;
+ set_current_state(TASK_INTERRUPTIBLE);
+ } while (1);
+ finish_wait(&ctx->wait, &iowq.wq);
+
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
if (ret == -ERESTARTSYS)
ret = -EINTR;

--
Jens Axboe

2019-09-26 00:52:00

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 2:02 AM, Jens Axboe wrote:
> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>> structure with a count in it, on the stack. Got some flight time
>>>> coming up later today, let me try and cook up a patch.
>>>
>>> Totally untested, and sent out 5 min before departure... But something
>>> like this.
>> Hmm, reminds me my first version. Basically that's the same thing but
>> with macroses inlined. I wanted to make it reusable and self-contained,
>> though.
>>
>> If you don't think it could be useful in other places, sure, we could do
>> something like that. Is that so?
>
> I totally agree it could be useful in other places. Maybe formalized and
> used with wake_up_nr() instead of adding a new primitive? Haven't looked
> into that, I may be talking nonsense.
>
> In any case, I did get a chance to test it and it works for me. Here's
> the "finished" version, slightly cleaned up and with a comment added
> for good measure.

Notes:

This version gets the ordering right, you need exclusive waits to get
fifo ordering on the waitqueue.

Both versions (yours and mine) suffer from the problem of potentially
waking too many. I don't think this is a real issue, as generally we
don't do threaded access to the io_urings. But if you had the following
tasks wait on the cqring:

[min_events = 32], [min_events = 8], [min_events = 8]

and we reach the io_cqring_events() == threshold, we'll wake all three.
I don't see a good solution to this, so I suspect we just live with
until proven an issue. Both versions are much better than what we have
now.

--
Jens Axboe

2019-09-26 00:56:13

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting



On 24/09/2019 11:27, Jens Axboe wrote:
> On 9/24/19 2:02 AM, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
>>
>> In any case, I did get a chance to test it and it works for me. Here's
>> the "finished" version, slightly cleaned up and with a comment added
>> for good measure.
>
> Notes:
>
> This version gets the ordering right, you need exclusive waits to get
> fifo ordering on the waitqueue.
>
> Both versions (yours and mine) suffer from the problem of potentially
> waking too many. I don't think this is a real issue, as generally we
> don't do threaded access to the io_urings. But if you had the following
> tasks wait on the cqring:
>
> [min_events = 32], [min_events = 8], [min_events = 8]
>
> and we reach the io_cqring_events() == threshold, we'll wake all three.
> I don't see a good solution to this, so I suspect we just live with
> until proven an issue. Both versions are much better than what we have
> now.
>
If io_cqring_events() == 8, only the last two would be woken up in both
implementations, as to_wait/threshold is specified per waiter. Isn't it?

Agree with waiting, I don't see a good real-life case for that, that
couldn't be done efficiently in userspace.


--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 01:05:23

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:

> +struct io_wait_queue {
> + struct wait_queue_entry wq;
> + struct io_ring_ctx *ctx;
> + struct task_struct *task;

wq.private is where the normal waitqueue stores the task pointer.

(I'm going to rename that)

> + unsigned to_wait;
> + unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> + struct io_ring_ctx *ctx = iowq->ctx;
> +
> + /*
> + * Wake up if we have enough events, or if a timeout occured since we
> + * started waiting. For timeouts, we always want to return to userspace,
> + * regardless of event count.
> + */
> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> + int wake_flags, void *key)
> +{
> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> + wq);
> +
> + if (io_should_wake(iowq)) {
> + list_del_init(&curr->entry);
> + wake_up_process(iowq->task);

Then you can use autoremove_wake_function() here.

> + return 1;
> + }
> +
> + return -1;
> +}

Ideally we'd get wait_event()'s @cond in a custom wake function. Then we
can _always_ do this.

This is one I'd love to have lambda functions for. It would actually
work with GCC nested functions, because the wake function will always be
in scope, but we can't use those in the kernel for other reasons :/

2019-09-26 01:12:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 3:20 AM, Pavel Begunkov wrote:
>
>
> On 24/09/2019 11:27, Jens Axboe wrote:
>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>> coming up later today, let me try and cook up a patch.
>>>>>
>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>> like this.
>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>> though.
>>>>
>>>> If you don't think it could be useful in other places, sure, we could do
>>>> something like that. Is that so?
>>>
>>> I totally agree it could be useful in other places. Maybe formalized and
>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>> into that, I may be talking nonsense.
>>>
>>> In any case, I did get a chance to test it and it works for me. Here's
>>> the "finished" version, slightly cleaned up and with a comment added
>>> for good measure.
>>
>> Notes:
>>
>> This version gets the ordering right, you need exclusive waits to get
>> fifo ordering on the waitqueue.
>>
>> Both versions (yours and mine) suffer from the problem of potentially
>> waking too many. I don't think this is a real issue, as generally we
>> don't do threaded access to the io_urings. But if you had the following
>> tasks wait on the cqring:
>>
>> [min_events = 32], [min_events = 8], [min_events = 8]
>>
>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>> I don't see a good solution to this, so I suspect we just live with
>> until proven an issue. Both versions are much better than what we have
>> now.
>>
> If io_cqring_events() == 8, only the last two would be woken up in both
> implementations, as to_wait/threshold is specified per waiter. Isn't it?

If io_cqring_events() == 8, then none would be woken in my
implementation since the first one will break the wakeup loop.

> Agree with waiting, I don't see a good real-life case for that, that
> couldn't be done efficiently in userspace.

Exactly

--
Jens Axboe

2019-09-26 01:37:21

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 13:34, Jens Axboe wrote:
> On 9/24/19 4:13 AM, Jens Axboe wrote:
>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>
>>>> +struct io_wait_queue {
>>>> + struct wait_queue_entry wq;
>>>> + struct io_ring_ctx *ctx;
>>>> + struct task_struct *task;
>>>
>>> wq.private is where the normal waitqueue stores the task pointer.
>>>
>>> (I'm going to rename that)
>>
>> If you do that, then we can just base the io_uring parts on that.
>
> Just took a quick look at it, and ran into block/kyber-iosched.c that
> actually uses the private pointer for something that isn't a task
> struct...
>

Let reuse autoremove_wake_function anyway. Changed a bit your patch:


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 5c3f2bb81637..a77971290fdd 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2690,6 +2690,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
return submit;
}

+struct io_wait_queue {
+ struct wait_queue_entry wq;
+ struct io_ring_ctx *ctx;
+ unsigned to_wait;
+ unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+ struct io_ring_ctx *ctx = iowq->ctx;
+
+ /*
+ * Wake up if we have enough events, or if a timeout occured since we
+ * started waiting. For timeouts, we always want to return to userspace,
+ * regardless of event count.
+ */
+ return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+ atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+ int wake_flags, void *key)
+{
+ struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+ wq);
+
+ if (!io_should_wake(iowq))
+ return -1;
+
+ return autoremove_wake_function(curr, mode, wake_flags, key);
+}
+
/*
* Wait until events become available, if we don't already have some. The
* application must reap them itself, as they reside on the shared cq ring.
@@ -2697,8 +2729,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
const sigset_t __user *sig, size_t sigsz)
{
+ struct io_wait_queue iowq = {
+ .wq = {
+ .private = current,
+ .func = io_wake_function,
+ .entry = LIST_HEAD_INIT(iowq.wq.entry),
+ },
+ .ctx = ctx,
+ .to_wait = min_events,
+ };
struct io_rings *rings = ctx->rings;
- unsigned nr_timeouts;
int ret;

if (io_cqring_events(rings) >= min_events)
@@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
return ret;
}

- nr_timeouts = atomic_read(&ctx->cq_timeouts);
- /*
- * Return if we have enough events, or if a timeout occured since
- * we started waiting. For timeouts, we always want to return to
- * userspace.
- */
- ret = wait_event_interruptible(ctx->wait,
- io_cqring_events(rings) >= min_events ||
- atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+ iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+ prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
+ do {
+ if (io_should_wake(&iowq))
+ break;
+ schedule();
+ if (signal_pending(current))
+ break;
+ set_current_state(TASK_INTERRUPTIBLE);
+ } while (1);
+ finish_wait(&ctx->wait, &iowq.wq);
+
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
if (ret == -ERESTARTSYS)
ret = -EINTR;


--
Yours sincerely,
Pavel Begunkov

2019-09-26 01:43:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On Tue, Sep 24, 2019 at 12:34:17PM +0200, Jens Axboe wrote:

> Just took a quick look at it, and ran into block/kyber-iosched.c that
> actually uses the private pointer for something that isn't a task
> struct...

Argh... that's some 'creative' abuse of the waitqueue API :/

2019-09-26 01:46:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote:

> @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> return ret;
> }
>
> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> + do {
> + if (io_should_wake(&iowq))
> + break;
> + schedule();
> + if (signal_pending(current))
> + break;
> + set_current_state(TASK_INTERRUPTIBLE);
> + } while (1);
> + finish_wait(&ctx->wait, &iowq.wq);

It it likely OK, but for paranoia, I'd prefer this form:

for (;;) {
prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
if (io_should_wake(&iowq))
break;
schedule();
if (signal_pending(current))
break;
}
finish_wait(&ctx->wait, &iowq.wq);

The thing is, if we ever succeed with io_wake_function() (that CPU
observes io_should_wake()), but when waking here, we do not observe
is_wake_function() and go sleep again, we might never wake up if we
don't put ourselves back on the wait-list again.

2019-09-26 04:57:34

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/23/19 10:32 AM, Pavel Begunkov wrote:
> Sorry, mixed the threads.
>
>>>
>>> I'm not sure an extension is needed for such a special interface, why not
>>> just put a ->threshold value next to the ctx->wait field and use either
>>> the regular wait_event() APIs with the proper condition, or
>>> wait_event_cmd() style APIs if you absolutely need something more complex
>>> to happen inside?
> Ingo,
> io_uring works well without this patch just using wait_event_*() with
> proper condition, but there are performance issues with spurious
> wakeups. Detailed description in the previous mail.
> Am I missing something?

I think we can do the same thing, just wrapping the waitqueue in a
structure with a count in it, on the stack. Got some flight time
coming up later today, let me try and cook up a patch.

--
Jens Axboe

2019-09-26 07:46:25

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting



On 24/09/2019 11:36, Jens Axboe wrote:
> On 9/24/19 2:27 AM, Jens Axboe wrote:
>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>> coming up later today, let me try and cook up a patch.
>>>>>
>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>> like this.
>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>> though.
>>>>
>>>> If you don't think it could be useful in other places, sure, we could do
>>>> something like that. Is that so?
>>>
>>> I totally agree it could be useful in other places. Maybe formalized and
>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>> into that, I may be talking nonsense.
>>>
>>> In any case, I did get a chance to test it and it works for me. Here's
>>> the "finished" version, slightly cleaned up and with a comment added
>>> for good measure.
>>
>> Notes:
>>
>> This version gets the ordering right, you need exclusive waits to get
>> fifo ordering on the waitqueue.
>>
>> Both versions (yours and mine) suffer from the problem of potentially
>> waking too many. I don't think this is a real issue, as generally we
>> don't do threaded access to the io_urings. But if you had the following
>> tasks wait on the cqring:
>>
>> [min_events = 32], [min_events = 8], [min_events = 8]
>>
>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>> I don't see a good solution to this, so I suspect we just live with
>> until proven an issue. Both versions are much better than what we have
>> now.
>
> Forgot an issue around signal handling, version below adds the
> right check for that too.

It seems to be a good reason to not keep reimplementing
"prepare_to_wait*() + wait loop" every time, but keep it in sched :)

>
> Curious what your test case was for this?
You mean a performance test case? It's briefly described in a comment
for the second patch. That's just rewritten io_uring-bench, with
1. a thread generating 1 request per call in a loop
2. and the second thread waiting for ~128 events.
Both are pinned to the same core.

>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..3fbab5692f14 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> return submit;
> }
>
> +struct io_wait_queue {
> + struct wait_queue_entry wq;
> + struct io_ring_ctx *ctx;
> + struct task_struct *task;
> + unsigned to_wait;
> + unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> + struct io_ring_ctx *ctx = iowq->ctx;
> +
> + /*
> + * Wake up if we have enough events, or if a timeout occured since we
> + * started waiting. For timeouts, we always want to return to userspace,
> + * regardless of event count.
> + */
> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> + int wake_flags, void *key)
> +{
> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> + wq);
> +
> + if (io_should_wake(iowq)) {
> + list_del_init(&curr->entry);
> + wake_up_process(iowq->task);
> + return 1;
> + }
> +
> + return -1;
> +}
> +
> /*
> * Wait until events become available, if we don't already have some. The
> * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> const sigset_t __user *sig, size_t sigsz)
> {
> + struct io_wait_queue iowq = {
> + .wq = {
> + .func = io_wake_function,
> + .entry = LIST_HEAD_INIT(iowq.wq.entry),
> + },
> + .task = current,
> + .ctx = ctx,
> + .to_wait = min_events,
> + };
> struct io_rings *rings = ctx->rings;
> - unsigned nr_timeouts;
> int ret;
>
> if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2839,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> return ret;
> }
>
> - nr_timeouts = atomic_read(&ctx->cq_timeouts);
> - /*
> - * Return if we have enough events, or if a timeout occured since
> - * we started waiting. For timeouts, we always want to return to
> - * userspace.
> - */
> - ret = wait_event_interruptible(ctx->wait,
> - io_cqring_events(rings) >= min_events ||
> - atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> + do {
> + if (io_should_wake(&iowq))
> + break;
> + schedule();
> + if (signal_pending(current))
> + break;
> + set_current_state(TASK_INTERRUPTIBLE);
> + } while (1);
> + finish_wait(&ctx->wait, &iowq.wq);
> +
> restore_saved_sigmask_unless(ret == -ERESTARTSYS);
> if (ret == -ERESTARTSYS)
> ret = -EINTR;
>

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 07:47:07

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 11:02, Jens Axboe wrote:
> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>> structure with a count in it, on the stack. Got some flight time
>>>> coming up later today, let me try and cook up a patch.
>>>
>>> Totally untested, and sent out 5 min before departure... But something
>>> like this.
>> Hmm, reminds me my first version. Basically that's the same thing but
>> with macroses inlined. I wanted to make it reusable and self-contained,
>> though.
>>
>> If you don't think it could be useful in other places, sure, we could do
>> something like that. Is that so?
>
> I totally agree it could be useful in other places. Maybe formalized and
> used with wake_up_nr() instead of adding a new primitive? Haven't looked
> into that, I may be talking nonsense.

@nr there is about number of tasks to wake up. AFAIK doesn't solve the
problem.


>
> In any case, I did get a chance to test it and it works for me. Here's
> the "finished" version, slightly cleaned up and with a comment added
> for good measure.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index ca7570aca430..14fae454cf75 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,42 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> return submit;
> }
>
> +struct io_wait_queue {
> + struct wait_queue_entry wq;
> + struct io_ring_ctx *ctx;
> + struct task_struct *task;
> + unsigned to_wait;
> + unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> + struct io_ring_ctx *ctx = iowq->ctx;
> +
> + /*
> + * Wake up if we have enough events, or if a timeout occured since we
> + * started waiting. For timeouts, we always want to return to userspace,
> + * regardless of event count.
> + */
> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> + int wake_flags, void *key)
> +{
> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> + wq);
> +
> + if (io_should_wake(iowq)) {
> + list_del_init(&curr->entry);
> + wake_up_process(iowq->task);
> + return 1;
> + }
> +
> + return -1;
> +}
> +
> /*
> * Wait until events become available, if we don't already have some. The
> * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2811,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> const sigset_t __user *sig, size_t sigsz)
> {
> + struct io_wait_queue iowq = {
> + .wq = {
> + .func = io_wake_function,
> + .entry = LIST_HEAD_INIT(iowq.wq.entry),
> + },
> + .task = current,
> + .ctx = ctx,
> + .to_wait = min_events,
> + };
> struct io_rings *rings = ctx->rings;
> - unsigned nr_timeouts;
> int ret;
>
> if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2839,16 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> return ret;
> }
>
> - nr_timeouts = atomic_read(&ctx->cq_timeouts);
> - /*
> - * Return if we have enough events, or if a timeout occured since
> - * we started waiting. For timeouts, we always want to return to
> - * userspace.
> - */
> - ret = wait_event_interruptible(ctx->wait,
> - io_cqring_events(rings) >= min_events ||
> - atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> + do {
> + if (io_should_wake(&iowq))
> + break;
> + schedule();
> + set_current_state(TASK_INTERRUPTIBLE);
> + } while (1);
> + finish_wait(&ctx->wait, &iowq.wq);
> +
> restore_saved_sigmask_unless(ret == -ERESTARTSYS);
> if (ret == -ERESTARTSYS)
> ret = -EINTR;
>

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 07:47:56

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 3:33 AM, Pavel Begunkov wrote:
>
>
> On 24/09/2019 11:36, Jens Axboe wrote:
>> On 9/24/19 2:27 AM, Jens Axboe wrote:
>>> On 9/24/19 2:02 AM, Jens Axboe wrote:
>>>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>>>> structure with a count in it, on the stack. Got some flight time
>>>>>>> coming up later today, let me try and cook up a patch.
>>>>>>
>>>>>> Totally untested, and sent out 5 min before departure... But something
>>>>>> like this.
>>>>> Hmm, reminds me my first version. Basically that's the same thing but
>>>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>>>> though.
>>>>>
>>>>> If you don't think it could be useful in other places, sure, we could do
>>>>> something like that. Is that so?
>>>>
>>>> I totally agree it could be useful in other places. Maybe formalized and
>>>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>>>> into that, I may be talking nonsense.
>>>>
>>>> In any case, I did get a chance to test it and it works for me. Here's
>>>> the "finished" version, slightly cleaned up and with a comment added
>>>> for good measure.
>>>
>>> Notes:
>>>
>>> This version gets the ordering right, you need exclusive waits to get
>>> fifo ordering on the waitqueue.
>>>
>>> Both versions (yours and mine) suffer from the problem of potentially
>>> waking too many. I don't think this is a real issue, as generally we
>>> don't do threaded access to the io_urings. But if you had the following
>>> tasks wait on the cqring:
>>>
>>> [min_events = 32], [min_events = 8], [min_events = 8]
>>>
>>> and we reach the io_cqring_events() == threshold, we'll wake all three.
>>> I don't see a good solution to this, so I suspect we just live with
>>> until proven an issue. Both versions are much better than what we have
>>> now.
>>
>> Forgot an issue around signal handling, version below adds the
>> right check for that too.
>
> It seems to be a good reason to not keep reimplementing
> "prepare_to_wait*() + wait loop" every time, but keep it in sched :)

I think if we do the ->private cleanup that Peter mentioned, then
there's not much left in terms of consolidation. Not convinced the case
is interesting enough to warrant a special helper. If others show up,
it's easy enough to consolidate the use cases and unify them.

If you look at wake_up_nr(), I would have thought that would be more
widespread. But it really isn't.

>> Curious what your test case was for this?
> You mean a performance test case? It's briefly described in a comment
> for the second patch. That's just rewritten io_uring-bench, with
> 1. a thread generating 1 request per call in a loop
> 2. and the second thread waiting for ~128 events.
> Both are pinned to the same core.

Gotcha, thanks.

--
Jens Axboe

2019-09-26 07:48:21

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 4:13 AM, Jens Axboe wrote:
> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>
>>> +struct io_wait_queue {
>>> + struct wait_queue_entry wq;
>>> + struct io_ring_ctx *ctx;
>>> + struct task_struct *task;
>>
>> wq.private is where the normal waitqueue stores the task pointer.
>>
>> (I'm going to rename that)
>
> If you do that, then we can just base the io_uring parts on that.

Just took a quick look at it, and ran into block/kyber-iosched.c that
actually uses the private pointer for something that isn't a task
struct...

--
Jens Axboe

2019-09-26 07:48:31

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 3:21 AM, Pavel Begunkov wrote:
> On 24/09/2019 11:02, Jens Axboe wrote:
>> On 9/24/19 1:06 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 02:00, Jens Axboe wrote:
>>>>> I think we can do the same thing, just wrapping the waitqueue in a
>>>>> structure with a count in it, on the stack. Got some flight time
>>>>> coming up later today, let me try and cook up a patch.
>>>>
>>>> Totally untested, and sent out 5 min before departure... But something
>>>> like this.
>>> Hmm, reminds me my first version. Basically that's the same thing but
>>> with macroses inlined. I wanted to make it reusable and self-contained,
>>> though.
>>>
>>> If you don't think it could be useful in other places, sure, we could do
>>> something like that. Is that so?
>>
>> I totally agree it could be useful in other places. Maybe formalized and
>> used with wake_up_nr() instead of adding a new primitive? Haven't looked
>> into that, I may be talking nonsense.
>
> @nr there is about number of tasks to wake up. AFAIK doesn't solve the
> problem.

Ah right, embarassingly I'm actually the one that added that
functionality ages ago...

--
Jens Axboe

2019-09-26 07:48:37

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 3:49 AM, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>
>> +struct io_wait_queue {
>> + struct wait_queue_entry wq;
>> + struct io_ring_ctx *ctx;
>> + struct task_struct *task;
>
> wq.private is where the normal waitqueue stores the task pointer.
>
> (I'm going to rename that)

If you do that, then we can just base the io_uring parts on that.

>> + unsigned to_wait;
>> + unsigned nr_timeouts;
>> +};
>> +
>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>> +{
>> + struct io_ring_ctx *ctx = iowq->ctx;
>> +
>> + /*
>> + * Wake up if we have enough events, or if a timeout occured since we
>> + * started waiting. For timeouts, we always want to return to userspace,
>> + * regardless of event count.
>> + */
>> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>> +}
>> +
>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>> + int wake_flags, void *key)
>> +{
>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>> + wq);
>> +
>> + if (io_should_wake(iowq)) {
>> + list_del_init(&curr->entry);
>> + wake_up_process(iowq->task);
>
> Then you can use autoremove_wake_function() here.
>
>> + return 1;
>> + }
>> +
>> + return -1;
>> +}
>
> Ideally we'd get wait_event()'s @cond in a custom wake function. Then we
> can _always_ do this.
>
> This is one I'd love to have lambda functions for. It would actually
> work with GCC nested functions, because the wake function will always be
> in scope, but we can't use those in the kernel for other reasons :/

I'll be happy enough if I can just call autoremove_wake_function(), I
think that will simplify the case enough for io_uring to not really make
me care too much about going further. I'll leave that to you, if you
have the desire :-)

--
Jens Axboe

2019-09-26 07:50:27

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 14:15, Jens Axboe wrote:
> On 9/24/19 5:11 AM, Pavel Begunkov wrote:
>> On 24/09/2019 13:34, Jens Axboe wrote:
>>> On 9/24/19 4:13 AM, Jens Axboe wrote:
>>>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>>>
>>>>>> +struct io_wait_queue {
>>>>>> + struct wait_queue_entry wq;
>>>>>> + struct io_ring_ctx *ctx;
>>>>>> + struct task_struct *task;
>>>>>
>>>>> wq.private is where the normal waitqueue stores the task pointer.
>>>>>
>>>>> (I'm going to rename that)
>>>>
>>>> If you do that, then we can just base the io_uring parts on that.
>>>
>>> Just took a quick look at it, and ran into block/kyber-iosched.c that
>>> actually uses the private pointer for something that isn't a task
>>> struct...
>>>
>>
>> Let reuse autoremove_wake_function anyway. Changed a bit your patch:
>
> Yep that should do it, and saves 8 bytes of stack as well.
>
> BTW, did you test my patch, this one or the previous? Just curious if it
> worked for you.
>
Not yet, going to do that tonight

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 07:58:03

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 5:11 AM, Pavel Begunkov wrote:
> On 24/09/2019 13:34, Jens Axboe wrote:
>> On 9/24/19 4:13 AM, Jens Axboe wrote:
>>> On 9/24/19 3:49 AM, Peter Zijlstra wrote:
>>>> On Tue, Sep 24, 2019 at 10:36:28AM +0200, Jens Axboe wrote:
>>>>
>>>>> +struct io_wait_queue {
>>>>> + struct wait_queue_entry wq;
>>>>> + struct io_ring_ctx *ctx;
>>>>> + struct task_struct *task;
>>>>
>>>> wq.private is where the normal waitqueue stores the task pointer.
>>>>
>>>> (I'm going to rename that)
>>>
>>> If you do that, then we can just base the io_uring parts on that.
>>
>> Just took a quick look at it, and ran into block/kyber-iosched.c that
>> actually uses the private pointer for something that isn't a task
>> struct...
>>
>
> Let reuse autoremove_wake_function anyway. Changed a bit your patch:

Yep that should do it, and saves 8 bytes of stack as well.

BTW, did you test my patch, this one or the previous? Just curious if it
worked for you.

--
Jens Axboe

2019-09-26 08:06:04

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 5:43 AM, Peter Zijlstra wrote:
> On Tue, Sep 24, 2019 at 02:11:29PM +0300, Pavel Begunkov wrote:
>
>> @@ -2717,15 +2757,18 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
>> return ret;
>> }
>>
>> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
>> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
>> + do {
>> + if (io_should_wake(&iowq))
>> + break;
>> + schedule();
>> + if (signal_pending(current))
>> + break;
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + } while (1);
>> + finish_wait(&ctx->wait, &iowq.wq);
>
> It it likely OK, but for paranoia, I'd prefer this form:
>
> for (;;) {
> prepare_to_wait_exclusive(&ctx->wait, &iowq.wq, TASK_INTERRUPTIBLE);
> if (io_should_wake(&iowq))
> break;
> schedule();
> if (signal_pending(current))
> break;
> }
> finish_wait(&ctx->wait, &iowq.wq);
>
> The thing is, if we ever succeed with io_wake_function() (that CPU
> observes io_should_wake()), but when waking here, we do not observe
> is_wake_function() and go sleep again, we might never wake up if we
> don't put ourselves back on the wait-list again.

Might be paranoia indeed, but especially after this change, we don't
expect to make frequent roundtrips there. So better safe than sorry,
I'll make the change.

--
Jens Axboe

2019-09-26 08:09:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>> Yep that should do it, and saves 8 bytes of stack as well.
>>
>> BTW, did you test my patch, this one or the previous? Just curious if it
>> worked for you.
>>
> Not yet, going to do that tonight

Thanks! For reference, the final version is below. There was still a
signal mishap in there, now it should all be correct afaict.


diff --git a/fs/io_uring.c b/fs/io_uring.c
index 9b84232e5cc4..d2a86164d520 100644
--- a/fs/io_uring.c
+++ b/fs/io_uring.c
@@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
return submit;
}

+struct io_wait_queue {
+ struct wait_queue_entry wq;
+ struct io_ring_ctx *ctx;
+ unsigned to_wait;
+ unsigned nr_timeouts;
+};
+
+static inline bool io_should_wake(struct io_wait_queue *iowq)
+{
+ struct io_ring_ctx *ctx = iowq->ctx;
+
+ /*
+ * Wake up if we have enough events, or if a timeout occured since we
+ * started waiting. For timeouts, we always want to return to userspace,
+ * regardless of event count.
+ */
+ return io_cqring_events(ctx->rings) >= iowq->to_wait ||
+ atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
+}
+
+static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
+ int wake_flags, void *key)
+{
+ struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
+ wq);
+
+ if (!io_should_wake(iowq))
+ return -1;
+
+ return autoremove_wake_function(curr, mode, wake_flags, key);
+}
+
/*
* Wait until events become available, if we don't already have some. The
* application must reap them itself, as they reside on the shared cq ring.
@@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
const sigset_t __user *sig, size_t sigsz)
{
+ struct io_wait_queue iowq = {
+ .wq = {
+ .private = current,
+ .func = io_wake_function,
+ .entry = LIST_HEAD_INIT(iowq.wq.entry),
+ },
+ .ctx = ctx,
+ .to_wait = min_events,
+ };
struct io_rings *rings = ctx->rings;
- unsigned nr_timeouts;
int ret;

if (io_cqring_events(rings) >= min_events)
@@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
return ret;
}

- nr_timeouts = atomic_read(&ctx->cq_timeouts);
- /*
- * Return if we have enough events, or if a timeout occured since
- * we started waiting. For timeouts, we always want to return to
- * userspace.
- */
- ret = wait_event_interruptible(ctx->wait,
- io_cqring_events(rings) >= min_events ||
- atomic_read(&ctx->cq_timeouts) != nr_timeouts);
+ iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
+ do {
+ prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
+ TASK_INTERRUPTIBLE);
+ if (io_should_wake(&iowq))
+ break;
+ schedule();
+ if (signal_pending(current)) {
+ ret = -ERESTARTSYS;
+ break;
+ }
+ } while (1);
+ finish_wait(&ctx->wait, &iowq.wq);
+
restore_saved_sigmask_unless(ret == -ERESTARTSYS);
if (ret == -ERESTARTSYS)
ret = -EINTR;

--
Jens Axboe

2019-09-26 08:38:34

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 20:46, Jens Axboe wrote:
> On 9/24/19 11:33 AM, Pavel Begunkov wrote:
>> On 24/09/2019 16:13, Jens Axboe wrote:
>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>>
>>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>>> worked for you.
>>>>>
>>>> Not yet, going to do that tonight
>>>
>>> Thanks! For reference, the final version is below. There was still a
>>> signal mishap in there, now it should all be correct afaict.
>>>
>>>
>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>> index 9b84232e5cc4..d2a86164d520 100644
>>> --- a/fs/io_uring.c
>>> +++ b/fs/io_uring.c
>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>> return submit;
>>> }
>>>
>>> +struct io_wait_queue {
>>> + struct wait_queue_entry wq;
>>> + struct io_ring_ctx *ctx;
>>> + unsigned to_wait;
>>> + unsigned nr_timeouts;
>>> +};
>>> +
>>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>>> +{
>>> + struct io_ring_ctx *ctx = iowq->ctx;
>>> +
>>> + /*
>>> + * Wake up if we have enough events, or if a timeout occured since we
>>> + * started waiting. For timeouts, we always want to return to userspace,
>>> + * regardless of event count.
>>> + */
>>> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>>> +}
>>> +
>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>>> + int wake_flags, void *key)
>>> +{
>>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>>> + wq);
>>> +
>>> + if (!io_should_wake(iowq))
>>> + return -1;
>>
>> It would try to schedule only the first task in the wait list. Is that the
>> semantic you want?
>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
>> after @32, and won't wake up the second one.
>
> Right, those are the semantics I want. We keep the list ordered by using
> the exclusive wait addition. Which means that for the case you list,
> waiters=32 came first, and we should not wake others before that task
> gets the completions it wants. Otherwise we could potentially starve
> higher count waiters, if we always keep going and new waiters come in.
>
Yes. I think It would better to be documented in userspace API. I
could imagine some crazy case deadlocking userspace. E.g.
thread 1: wait_events(8), reap_events
thread 2: wait_events(32), wait(thread 1), reap_events

works well
Reviewed-by: Pavel Begunkov <[email protected]>
Tested-by: Pavel Begunkov <[email protected]>

BTW, I searched for wait_event*(), and it seems there are plenty of
similar use cases. So, generic case would be useful, but this is for
later.



--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 10:49:41

by Pavel Begunkov

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 24/09/2019 16:13, Jens Axboe wrote:
> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>
>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>> worked for you.
>>>
>> Not yet, going to do that tonight
>
> Thanks! For reference, the final version is below. There was still a
> signal mishap in there, now it should all be correct afaict.
>
>
> diff --git a/fs/io_uring.c b/fs/io_uring.c
> index 9b84232e5cc4..d2a86164d520 100644
> --- a/fs/io_uring.c
> +++ b/fs/io_uring.c
> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> return submit;
> }
>
> +struct io_wait_queue {
> + struct wait_queue_entry wq;
> + struct io_ring_ctx *ctx;
> + unsigned to_wait;
> + unsigned nr_timeouts;
> +};
> +
> +static inline bool io_should_wake(struct io_wait_queue *iowq)
> +{
> + struct io_ring_ctx *ctx = iowq->ctx;
> +
> + /*
> + * Wake up if we have enough events, or if a timeout occured since we
> + * started waiting. For timeouts, we always want to return to userspace,
> + * regardless of event count.
> + */
> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
> +}
> +
> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
> + int wake_flags, void *key)
> +{
> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
> + wq);
> +
> + if (!io_should_wake(iowq))
> + return -1;

It would try to schedule only the first task in the wait list. Is that the
semantic you want?
E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
after @32, and won't wake up the second one.

> +
> + return autoremove_wake_function(curr, mode, wake_flags, key);
> +}
> +
> /*
> * Wait until events become available, if we don't already have some. The
> * application must reap them itself, as they reside on the shared cq ring.
> @@ -2775,8 +2807,16 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
> static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> const sigset_t __user *sig, size_t sigsz)
> {
> + struct io_wait_queue iowq = {
> + .wq = {
> + .private = current,
> + .func = io_wake_function,
> + .entry = LIST_HEAD_INIT(iowq.wq.entry),
> + },
> + .ctx = ctx,
> + .to_wait = min_events,
> + };
> struct io_rings *rings = ctx->rings;
> - unsigned nr_timeouts;
> int ret;
>
> if (io_cqring_events(rings) >= min_events)
> @@ -2795,15 +2835,20 @@ static int io_cqring_wait(struct io_ring_ctx *ctx, int min_events,
> return ret;
> }
>
> - nr_timeouts = atomic_read(&ctx->cq_timeouts);
> - /*
> - * Return if we have enough events, or if a timeout occured since
> - * we started waiting. For timeouts, we always want to return to
> - * userspace.
> - */
> - ret = wait_event_interruptible(ctx->wait,
> - io_cqring_events(rings) >= min_events ||
> - atomic_read(&ctx->cq_timeouts) != nr_timeouts);
> + iowq.nr_timeouts = atomic_read(&ctx->cq_timeouts);
> + do {
> + prepare_to_wait_exclusive(&ctx->wait, &iowq.wq,
> + TASK_INTERRUPTIBLE);
> + if (io_should_wake(&iowq))
> + break;
> + schedule();
> + if (signal_pending(current)) {
> + ret = -ERESTARTSYS;
> + break;
> + }
> + } while (1);
> + finish_wait(&ctx->wait, &iowq.wq);
> +
> restore_saved_sigmask_unless(ret == -ERESTARTSYS);
> if (ret == -ERESTARTSYS)
> ret = -EINTR;
>

--
Yours sincerely,
Pavel Begunkov


Attachments:
signature.asc (849.00 B)
OpenPGP digital signature

2019-09-26 10:50:45

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 11:33 AM, Pavel Begunkov wrote:
> On 24/09/2019 16:13, Jens Axboe wrote:
>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>
>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>> worked for you.
>>>>
>>> Not yet, going to do that tonight
>>
>> Thanks! For reference, the final version is below. There was still a
>> signal mishap in there, now it should all be correct afaict.
>>
>>
>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>> index 9b84232e5cc4..d2a86164d520 100644
>> --- a/fs/io_uring.c
>> +++ b/fs/io_uring.c
>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>> return submit;
>> }
>>
>> +struct io_wait_queue {
>> + struct wait_queue_entry wq;
>> + struct io_ring_ctx *ctx;
>> + unsigned to_wait;
>> + unsigned nr_timeouts;
>> +};
>> +
>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>> +{
>> + struct io_ring_ctx *ctx = iowq->ctx;
>> +
>> + /*
>> + * Wake up if we have enough events, or if a timeout occured since we
>> + * started waiting. For timeouts, we always want to return to userspace,
>> + * regardless of event count.
>> + */
>> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>> +}
>> +
>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>> + int wake_flags, void *key)
>> +{
>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>> + wq);
>> +
>> + if (!io_should_wake(iowq))
>> + return -1;
>
> It would try to schedule only the first task in the wait list. Is that the
> semantic you want?
> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
> after @32, and won't wake up the second one.

Right, those are the semantics I want. We keep the list ordered by using
the exclusive wait addition. Which means that for the case you list,
waiters=32 came first, and we should not wake others before that task
gets the completions it wants. Otherwise we could potentially starve
higher count waiters, if we always keep going and new waiters come in.

--
Jens Axboe

2019-09-26 10:54:42

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH v2 0/2] Optimise io_uring completion waiting

On 9/24/19 12:28 PM, Pavel Begunkov wrote:
> On 24/09/2019 20:46, Jens Axboe wrote:
>> On 9/24/19 11:33 AM, Pavel Begunkov wrote:
>>> On 24/09/2019 16:13, Jens Axboe wrote:
>>>> On 9/24/19 5:23 AM, Pavel Begunkov wrote:
>>>>>> Yep that should do it, and saves 8 bytes of stack as well.
>>>>>>
>>>>>> BTW, did you test my patch, this one or the previous? Just curious if it
>>>>>> worked for you.
>>>>>>
>>>>> Not yet, going to do that tonight
>>>>
>>>> Thanks! For reference, the final version is below. There was still a
>>>> signal mishap in there, now it should all be correct afaict.
>>>>
>>>>
>>>> diff --git a/fs/io_uring.c b/fs/io_uring.c
>>>> index 9b84232e5cc4..d2a86164d520 100644
>>>> --- a/fs/io_uring.c
>>>> +++ b/fs/io_uring.c
>>>> @@ -2768,6 +2768,38 @@ static int io_ring_submit(struct io_ring_ctx *ctx, unsigned int to_submit,
>>>> return submit;
>>>> }
>>>>
>>>> +struct io_wait_queue {
>>>> + struct wait_queue_entry wq;
>>>> + struct io_ring_ctx *ctx;
>>>> + unsigned to_wait;
>>>> + unsigned nr_timeouts;
>>>> +};
>>>> +
>>>> +static inline bool io_should_wake(struct io_wait_queue *iowq)
>>>> +{
>>>> + struct io_ring_ctx *ctx = iowq->ctx;
>>>> +
>>>> + /*
>>>> + * Wake up if we have enough events, or if a timeout occured since we
>>>> + * started waiting. For timeouts, we always want to return to userspace,
>>>> + * regardless of event count.
>>>> + */
>>>> + return io_cqring_events(ctx->rings) >= iowq->to_wait ||
>>>> + atomic_read(&ctx->cq_timeouts) != iowq->nr_timeouts;
>>>> +}
>>>> +
>>>> +static int io_wake_function(struct wait_queue_entry *curr, unsigned int mode,
>>>> + int wake_flags, void *key)
>>>> +{
>>>> + struct io_wait_queue *iowq = container_of(curr, struct io_wait_queue,
>>>> + wq);
>>>> +
>>>> + if (!io_should_wake(iowq))
>>>> + return -1;
>>>
>>> It would try to schedule only the first task in the wait list. Is that the
>>> semantic you want?
>>> E.g. for waiters=[32,8] and nr_events == 8, io_wake_function() returns
>>> after @32, and won't wake up the second one.
>>
>> Right, those are the semantics I want. We keep the list ordered by using
>> the exclusive wait addition. Which means that for the case you list,
>> waiters=32 came first, and we should not wake others before that task
>> gets the completions it wants. Otherwise we could potentially starve
>> higher count waiters, if we always keep going and new waiters come in.
>>
> Yes. I think It would better to be documented in userspace API. I
> could imagine some crazy case deadlocking userspace. E.g.
> thread 1: wait_events(8), reap_events
> thread 2: wait_events(32), wait(thread 1), reap_events

No matter how you handle cases like this, there will always be deadlocks
possible... So I don't think that's a huge concern. It's more important
to not have intentional livelocks, which we would have if we always
allowed the lowest wait count to get woken and steal the budget
everytime.

> works well
> Reviewed-by: Pavel Begunkov <[email protected]>
> Tested-by: Pavel Begunkov <[email protected]>

Thanks, will add!

> BTW, I searched for wait_event*(), and it seems there are plenty of
> similar use cases. So, generic case would be useful, but this is for
> later.

Agree, it would undoubtedly be useful.

--
Jens Axboe