2006-12-30 02:31:44

by Chen, Kenneth W

[permalink] [raw]
Subject: [patch] aio: add per task aio wait event condition

The AIO wake-up notification from aio_complete is really inefficient
in current AIO implementation in the presence of process waiting in
io_getevents().

For example, if app calls io_getevents with min_nr > 1, and aio event
queue doesn't have enough completed aio event, the process will block
in read_events(). However, aio_complete() will wake up the waiting
process for *each* complete I/O even though number of events that an
app is waiting for is much larger than 1. This makes excessive and
unnecessary context switch because the waiting process will just reap
one single event and goes back to sleep again. It is much more efficient
to wake up the waiting process when there are enough events for it to
reap.

This patch adds a wait condition to the wait queue and only wake-up
process when that condition meets. And this condition is added on a
per task base for handling multi-threaded app that shares single ioctx.

To show the effect of this patch, here is an vmstat output before and
after the patch. The app does random O_DIRECT AIO on 60 disks. Context
switch is reduced from 13 thousand+ down to just 40+, an significant
improvement.

Before:
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
0 0 0 3972608 7056 31312 0 0 14000 0 7840 13715 0 2 98 0
0 0 0 3972608 7056 31312 0 0 14300 0 7793 13641 0 2 98 0
0 0 0 3972608 7056 31312 0 0 14100 0 7885 13747 0 2 98 0

After:
0 0 0 3972608 7056 31312 0 0 14000 0 7840 49 0 2 98 0
0 0 0 3972608 7056 31312 0 0 13800 0 7793 53 0 2 98 0
0 0 0 3972608 7056 31312 0 0 13800 0 7885 42 0 2 98 0


Signed-off-by: Ken Chen <[email protected]>


--- ./fs/aio.c.orig 2006-12-24 16:41:39.000000000 -0800
+++ ./fs/aio.c 2006-12-24 16:52:15.000000000 -0800
@@ -193,6 +193,17 @@ static int aio_setup_ring(struct kioctx
kunmap_atomic((void *)((unsigned long)__event & PAGE_MASK), km); \
} while(0)

+struct aio_wait_queue {
+ int nr_wait; /* wake-up condition */
+ wait_queue_t wait;
+};
+
+static inline void aio_init_wait(struct aio_wait_queue *wait)
+{
+ wait->nr_wait = 0;
+ init_wait(&wait->wait);
+}
+
/* ioctx_alloc
* Allocates and initializes an ioctx. Returns an ERR_PTR if it failed.
*/
@@ -296,13 +307,14 @@ static void aio_cancel_all(struct kioctx
static void wait_for_all_aios(struct kioctx *ctx)
{
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ struct aio_wait_queue wait;

spin_lock_irq(&ctx->ctx_lock);
if (!ctx->reqs_active)
goto out;

- add_wait_queue(&ctx->wait, &wait);
+ aio_init_wait(&wait);
+ add_wait_queue(&ctx->wait, &wait.wait);
set_task_state(tsk, TASK_UNINTERRUPTIBLE);
while (ctx->reqs_active) {
spin_unlock_irq(&ctx->ctx_lock);
@@ -311,7 +323,7 @@ static void wait_for_all_aios(struct kio
spin_lock_irq(&ctx->ctx_lock);
}
__set_task_state(tsk, TASK_RUNNING);
- remove_wait_queue(&ctx->wait, &wait);
+ remove_wait_queue(&ctx->wait, &wait.wait);

out:
spin_unlock_irq(&ctx->ctx_lock);
@@ -932,6 +944,7 @@ int fastcall aio_complete(struct kiocb *
unsigned long flags;
unsigned long tail;
int ret;
+ int nr_evt = 0;

/*
* Special case handling for sync iocbs:
@@ -992,6 +1005,9 @@ int fastcall aio_complete(struct kiocb *
info->tail = tail;
ring->tail = tail;

+ nr_evt = ring->tail - ring->head;
+ if (nr_evt < 0)
+ nr_evt += info->nr;
put_aio_ring_event(event, KM_IRQ0);
kunmap_atomic(ring, KM_IRQ1);

@@ -1000,8 +1016,13 @@ put_rq:
/* everything turned out well, dispose of the aiocb. */
ret = __aio_put_req(ctx, iocb);

- if (waitqueue_active(&ctx->wait))
- wake_up(&ctx->wait);
+ if (waitqueue_active(&ctx->wait)) {
+ struct aio_wait_queue *wait;
+ wait = container_of(ctx->wait.task_list.next,
+ struct aio_wait_queue, wait.task_list);
+ if (nr_evt >= wait->nr_wait)
+ wake_up(&ctx->wait);
+ }

spin_unlock_irqrestore(&ctx->ctx_lock, flags);
return ret;
@@ -1094,7 +1115,7 @@ static int read_events(struct kioctx *ct
{
long start_jiffies = jiffies;
struct task_struct *tsk = current;
- DECLARE_WAITQUEUE(wait, tsk);
+ struct aio_wait_queue wait;
int ret;
int i = 0;
struct io_event ent;
@@ -1152,10 +1173,11 @@ retry:
set_timeout(start_jiffies, &to, &ts);
}

+ aio_init_wait(&wait);
while (likely(i < nr)) {
- add_wait_queue_exclusive(&ctx->wait, &wait);
do {
- set_task_state(tsk, TASK_INTERRUPTIBLE);
+ prepare_to_wait_exclusive(&ctx->wait, &wait.wait,
+ TASK_INTERRUPTIBLE);
ret = aio_read_evt(ctx, &ent);
if (ret)
break;
@@ -1164,6 +1186,7 @@ retry:
ret = 0;
if (to.timed_out) /* Only check after read evt */
break;
+ wait.nr_wait = min_nr - i;
schedule();
if (signal_pending(tsk)) {
ret = -EINTR;
@@ -1171,9 +1194,7 @@ retry:
}
/*ret = aio_read_evt(ctx, &ent);*/
} while (1) ;
-
- set_task_state(tsk, TASK_RUNNING);
- remove_wait_queue(&ctx->wait, &wait);
+ finish_wait(&ctx->wait, &wait.wait);

if (unlikely(ret <= 0))
break;


2007-01-03 00:49:24

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition


On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:

> The AIO wake-up notification from aio_complete is really inefficient
> in current AIO implementation in the presence of process waiting in
> io_getevents().

Yeah, it's a real deficiency. Thanks for taking a stab at it.

> This patch adds a wait condition to the wait queue and only wake-up
> process when that condition meets. And this condition is added on a
> per task base for handling multi-threaded app that shares single
> ioctx.

But only one of the waiting tasks is tested, the one at the head of
the list. It looks like this change could starve a io_getevents()
with a low min_nr in the presence of another io_getevents() with a
larger min_nr.

> Before:
> 0 0 0 3972608 7056 31312 0 0 14100 0 7885
> 13747 0 2 98 0
> After:
> 0 0 0 3972608 7056 31312 0 0 13800 0 7885
> 42 0 2 98 0

Nice. What min_nr was used in this test?

> +struct aio_wait_queue {
> + int nr_wait; /* wake-up condition */

It appears that this is never assigned a negative? Can we make it
that explicit in the type so that we reviewers don't have to worry
about wrapping and signed comparisons?

> - DECLARE_WAITQUEUE(wait, tsk);
> + struct aio_wait_queue wait;

> + aio_init_wait(&wait);

This just changed from using default_wake_function() to
autoremove_wait_function(). Very sneaky! wait_for_all_aios() should
be adding the wait queue before going to sleep each time. (better
still to just use wait_event()).

Was this on purpose? I'm all for it as a way to reduce wakeups from
a stream of completions to a single waiter.

> + nr_evt = ring->tail - ring->head;
> + if (nr_evt < 0)
> + nr_evt += info->nr;

int = unsigned - unsigned;
if (int < 0)

My head already hurts. Can we clean this up so one doesn't have to
live and breath type conversion rules to tell if this code is correct?

> + if (waitqueue_active(&ctx->wait)) {
> + struct aio_wait_queue *wait;
> + wait = container_of(ctx->wait.task_list.next,
> + struct aio_wait_queue, wait.task_list);
> + if (nr_evt >= wait->nr_wait)
> + wake_up(&ctx->wait);
> + }

First is the fear of starvation as mentioned previously.

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

This makes me think this elegant task_list approach is doomed. I
think this is what stopped Ben and I from being interested in this
last time we talked about it :).

Also, is that container_of() and dereference safe in the presence of
racing wake-ups? It looks like we could get deref a freed wait and
get a bogus nr_wait and decide not to wake.

Andrew, I fear we should remove this from -mm until it's fixed up.

- z

2007-01-03 01:16:57

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] aio: add per task aio wait event condition

Zach Brown wrote on Tuesday, January 02, 2007 4:49 PM
> On Dec 29, 2006, at 6:31 PM, Chen, Kenneth W wrote:
> > This patch adds a wait condition to the wait queue and only wake-up
> > process when that condition meets. And this condition is added on a
> > per task base for handling multi-threaded app that shares single
> > ioctx.
>
> But only one of the waiting tasks is tested, the one at the head of
> the list. It looks like this change could starve a io_getevents()
> with a low min_nr in the presence of another io_getevents() with a
> larger min_nr.
>
> > + if (waitqueue_active(&ctx->wait)) {
> > + struct aio_wait_queue *wait;
> > + wait = container_of(ctx->wait.task_list.next,
> > + struct aio_wait_queue, wait.task_list);
> > + if (nr_evt >= wait->nr_wait)
> > + wake_up(&ctx->wait);
> > + }
>
> First is the fear of starvation as mentioned previously.
>
> issue 2 ops
> first io_getevents sleeps with a min_nr of 2
> second io_getevents sleeps with min_nr of 3
> 2 ops complete but only test the second sleeper's min_nr of 3
> first sleeper twiddles thumbs

That is not possible because when multiple tasks waiting for events, they
enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
__add_wait_queue_tail(). So first io_getevents() with min_nr of 2 will
be woken up when 2 ops completes.

2007-01-03 01:23:37

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition

>
> That is not possible because when multiple tasks waiting for
> events, they
> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> __add_wait_queue_tail(). So first io_getevents() with min_nr of 2
> will
> be woken up when 2 ops completes.

So switch the order of the two sleepers in the example?

The point is that there's no way to guarantee that the head of the
wait queue will be the lowest min_nr.

I got list_add() from the add_wait_queue() still being used in
wait_for_all_aios(), fwiw. My mistake.

- z

2007-01-03 01:50:55

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] aio: add per task aio wait event condition

Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> > That is not possible because when multiple tasks waiting for
> > events, they
> > enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> > __add_wait_queue_tail(). So first io_getevents() with min_nr of 2
> > will be woken up when 2 ops completes.
>
> So switch the order of the two sleepers in the example?

Not sure why that would be a problem though: whoever sleep first will
be woken up first.


> The point is that there's no way to guarantee that the head of the
> wait queue will be the lowest min_nr.

Before I challenge that semantics, I want to mention that in current
implementation, dribbling AIO events will be distributed in round robin
fashion to all pending tasks waiting in io_getevents. In the example you
gave earlier, task with min_nr of 2 will be woken up after 4 completed
events. I consider that as an undesirable behavior as well.

Going back to your counter argument, why do we need the lowest min_nr in
the head of the queue? These are tasks that shares one aio ctx and ioctx
is shareable only among threads. Any reason why round robin policy is
superior than FIFO? Also presumably, threads that shares ioctx should be
capable of handling events for the same ioctx.

>From wakeup order point of view, yes, tasks with lowest min_nr wakes up
first, but looking from io completion order, they are not. And these are
the source of excessive ctx switch.

2007-01-03 02:06:27

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition


On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:

> Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
>>> That is not possible because when multiple tasks waiting for
>>> events, they
>>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
>>> __add_wait_queue_tail(). So first io_getevents() with min_nr of 2
>>> will be woken up when 2 ops completes.
>>
>> So switch the order of the two sleepers in the example?
>
> Not sure why that would be a problem though: whoever sleep first will
> be woken up first.

Why would the min_nr = 3 sleeper be woken up in that case? Only 2
ios were issued.

Maybe the app was relying on the min_nr = 2 completion to issue 3
more ios for the min_nr = 3 sleeper, who knows.

> Before I challenge that semantics, I want to mention that in current
> implementation, dribbling AIO events will be distributed in round
> robin
> fashion to all pending tasks waiting in io_getevents.

Yeah, don't misunderstand me -- we agree that the current situation
is bad.

> In the example you
> gave earlier, task with min_nr of 2 will be woken up after 4 completed
> events.

I only gave 2 ios/events in that example.

Does that clear up the confusion?

- z

2007-01-03 02:32:27

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] aio: add per task aio wait event condition

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> On Jan 2, 2007, at 5:50 PM, Chen, Kenneth W wrote:
> > Zach Brown wrote on Tuesday, January 02, 2007 5:24 PM
> >>> That is not possible because when multiple tasks waiting for
> >>> events, they
> >>> enter the wait queue in FIFO order, prepare_to_wait_exclusive() does
> >>> __add_wait_queue_tail(). So first io_getevents() with min_nr of 2
> >>> will be woken up when 2 ops completes.
> >>
> >> So switch the order of the two sleepers in the example?
> >
> > Not sure why that would be a problem though: whoever sleep first will
> > be woken up first.
>
> Why would the min_nr = 3 sleeper be woken up in that case? Only 2
> ios were issued.
>
> Maybe the app was relying on the min_nr = 2 completion to issue 3
> more ios for the min_nr = 3 sleeper, who knows.
>
> Does that clear up the confusion?


Not really. I don't think I understand your concern. You gave an example:

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete but only test the second sleeper's min_nr of 3
first sleeper twiddles thumbs

Or:

issue 2 ops
first io_getevents sleeps with a min_nr of 3
second io_getevents sleeps with min_nr of 2
2 ops complete but only test the second sleeper's min_nr of 2
first sleeper twiddles thumbs


First scenario doesn't exist because in the new scheme, we test first
sleeper (as in head of the queue) when 2 ops complete. It wakes up first.

2nd scenario is OK to me because first sleeper waiting for 3 events,
and there are only 2 ops completed, so it waits.

The one scenario that I can think of that breaks down is that one task
sleeps with min_nr of 100. Then 50 ops completed. Comes along 2nd
thread does a io_getevents and it will take all 50 events in the 2nd
thread. Is that what you are talking about? It doesn't involve two
sleepers. That I can fix by testing whether wait queue is active or
not at the beginning of fast path in read_events().

The bigger question is: what is the semantics on event reap order for
thread? Random, FIFO or round robin? It is not specified anywhere.
What would be the most optimal policy?

2007-01-03 06:36:30

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [patch] aio: add per task aio wait event condition

Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
> > In the example you
> > gave earlier, task with min_nr of 2 will be woken up after 4 completed
> > events.
>
> I only gave 2 ios/events in that example.
>
> Does that clear up the confusion?

It occurs to me that people might not be aware how peculiar the
current io_getevent wakeup scheme is, to the extend of erratic
behavior.

In the blocking path of read_events(), we essentially doing the
following loop (simplified for clarity):

while (i < nr) {
add_wait_queue_exclusive(&ctx->wait, &wait);
do {
ret = aio_read_evt(ctx, &ent);
if (!ret)
schedule();
while (1);
remove_wait_queue(&ctx->wait, &wait);
copy_to_user(event, &ent, sizeof(ent));
}

Noticed that when thread comes out of schedule(), it removes itself
from the wait queue, and requeue itself at the end of the wait queue
for each and every event it reaps. So if there are multiple threads
waiting in io_getevents, completed I/O are handed out in round robin
scheme to all waiting threads.

To illustrate it in ascii graph, here is what happens:

thread 1 thread 2

queue at head
schedule()

queue at 2nd position
schedule

aio_complete
(event 1)
remove_wait_queue (now thread 2 is at head)
reap event 1
requeue at tail
schedule

aio_complete
(event 2)
remove_wait_queue (now thread 1 is at head)
reap event 2
requeue at tail
schedule

If thread 1 sleeps first with min_nr = 2, and thread 2 sleeps
second with min_nr = 3, then thread 1 wakes up on event _3_.
But if thread 2 sleeps first, thread 1 sleeps second, thread 1
wakes up on event _4_. If someone ask me to describe algorithm
of io_getevents wake-up scheme in the presence of multiple
waiters, I call it erratic and un-deterministic.

Looking back to the example Zach gave earlier, current
implementation behaves just like what described as an undesired
bug (modified and tortured):

issue 2 ops
first io_getevents sleeps with a min_nr of 2
second io_getevents sleeps with min_nr of 3
2 ops complete
first sleeper twiddles thumbs

So I can categorize my patchset as a bug fix instead of a
performance patch ;-) Let's be serious, this ought to be fixed
one way or the other.


- Ken

2007-01-03 18:50:13

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition


On Jan 2, 2007, at 10:36 PM, Chen, Kenneth W wrote:

> Zach Brown wrote on Tuesday, January 02, 2007 6:06 PM
>>> In the example you
>>> gave earlier, task with min_nr of 2 will be woken up after 4
>>> completed
>>> events.
>>
>> I only gave 2 ios/events in that example.
>>
>> Does that clear up the confusion?
>
> It occurs to me that people might not be aware how peculiar the
> current io_getevent wakeup scheme is, to the extend of erratic
> behavior.

Yeah. I have to admit that I hadn't fully grasped the behaviour of
the interface when we started this thread. I realized just how
insane it was as I was finishing off last night.

> wakes up on event _4_. If someone ask me to describe algorithm
> of io_getevents wake-up scheme in the presence of multiple
> waiters, I call it erratic and un-deterministic.

I agree completely. This leads me to assume that essentially *no
one* has multiple waiters on a context who care about min_nr. That's
good news because we can consider it the slow path for now :)

> So I can categorize my patchset as a bug fix instead of a
> performance patch ;-) Let's be serious, this ought to be fixed
> one way or the other.

Yeah. How about this:

Start with a cleanup of the event copying loop:

- prefault user event dest with fault_in_pages_writable()
- copy multiple events with _inatomic
- only copy events if min_nr are available
- back off and retry if _inatomic faults
- go back to sleep if too few events are available
- return once a copy doesn't fault

(I guess one could only fall back to prefaulting if the _inatomic
fails. I don't know if most event buffers fault or not, I'd guess not.)

This would get rid of the nonsense of consuming events (possibly
depriving other concurrent waiters) without returning them to
userspace in a timely fashion. We get rid of a lot of the per-event-
delivery overhead that the current loop suffers from.

The controversial part here happens when min_nr is larger than the
ring size. In that case I think we should consider min_nr to be
equal to the ring size. We'll return fewer events than userspace
asked for, but it'll still be a big batch.

Does io_getevents() returning +ve but < min_nr raise alarm bells for
anyone? It's already done in the timeout case, so arguably code
already handles it.

Now, on the waking and sleeping side:

- min_nr 0 and 1 as exclusive waiters in ctx->wait
- maintain a rbtree of with min_nr of waiters when > 1
- always wake ctx->wait if it's pending
- wake rb_first of the tree if nr_in_ring >= node->min_nr

The rbtree nodes would have a task_struct pointer and its waker would
use wake_up_process(). It'd be maintained under the ctx lock. It'll
be expensive to maintain in the case of lots of concurrent min_nr > 1
waiters, but we're declaring that the slow path.

The fast path of a single waiter with > 1 min_nr becomes a pointer
indirection through the root node of the rb tree. It should be just
like the current list_head use in the wait queue. If the wait queue
head and rbtree root node share a cacheline in the context then
checking them both (and almost always only finding one of them
populated) shouldn't cost the completion path much.

I don't have a strong opinion about using exclusive wake-ups or not
because I doubt we ever have concurrent waiters in the field. I'm
leaning towards avoiding the thundering herd, even if that risks the
event delivery stream waiting behind a woken waiter who might be
blocking while prefaulting their destination event buffer. We should
wake from io_getevents() before it sleeps or returns, then. I could
easily be convinced otherwise.

The wait queue lets us support lots of min_nr == 1 waiters without
the rbtree maintenance overhead.. they're not as self-evidently
broken in the current code as multiple min_nr > 1 waiters, so they
may well be in use? who knows. It's easy.

How's this all sound? I can throw this together if you'd prefer it
that way. Though I suspect you'll want to keep hacking on this :).

Finally, let's get a test case for multiple min_nr > 1 waiters into
autotest. I'd use our trivial example where the min_nr = 2 waiter
fires of 3 ios for the concurrent min_nr = 3 waiter to ensure that
things don't get stuck.

- z

2007-02-01 21:46:43

by Jeff Moyer

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition

==> On Wed, 3 Jan 2007 10:50:06 -0800, Zach Brown <[email protected]> said:

Zach> The controversial part here happens when min_nr is larger than the
Zach> ring size. In that case I think we should consider min_nr to be
Zach> equal to the ring size. We'll return fewer events than userspace
Zach> asked for, but it'll still be a big batch.

That sounds like a programming error, don't you think? Maybe
returning EINVAL is the right approach?

-Jeff

2007-02-01 22:25:31

by Zach Brown

[permalink] [raw]
Subject: Re: [patch] aio: add per task aio wait event condition

> That sounds like a programming error, don't you think? Maybe
> returning EINVAL is the right approach?

Maybe. I think I'd prefer to be permissive and queue as much as
possible, but it's not a strong preference. Returning EINVAL seems
ok, too.

- z