2008-11-25 10:16:32

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: kvm aio wishlist


[cc'ing lkml as well ]

On Tue, Nov 25, 2008 at 11:30:51AM +0200, Avi Kivity wrote:
> Zach Brown wrote:
>>> I'm also worried about introducing threads. With direct I/O, we know
>>> we're going to block. The easiest thing is to slap the request onto a
>>> queue (blockdev or netdev) and unplug it.
>>>
>>
>> Is it really that easy? There's a non-trivial number of places it can
>> block before submitting the IO and making it to the async completion
>> phase. They show up as latency spikes in real-world loads.
>>
>> DIO is a good example. Using a kernel thread lets the entire path be
>> async. We don't have to go in and fold an async state machine under
>> pinning user space pages, performing file system block mapping lookups,
>> allocating block layer requests, on and on.
>>
>>
>
> Certainly, filesystem backed storage is much harder. Maybe we can use one
> of the fork-on-demand proposals to make the block mapping async, then queue
> the request+pinned pages.
>
>>> IIRC, the idea behind the *lets/*rils was that the calls are usually
>>> nonblocking, so you fork on block, no? I don't see that here. Of
>>> course, that's not the case in my wishlist; all requests will block
>>> without exception.
>>>
>>
>> Yeah. My thinking is that if someone wants to experiment with syslets
>> it'll be pretty easy for them to add a flag to the submission struct and
>> re-use most of the submission and completion framework. That's not my
>> priority. I want posix aio in glibc to work.
>>
>
> Why not extend io_submit() to use a thread pool when going through a
> non-aio-ready path? Yet a new interface, with another round of integrating
> to the previous interfaces, is not a comforting thought. I still haven't
> got used to the fact that aio can work with fd polling.

Even paths that provide fop->aio_read/write can be synchronous (like non
O_DIRECT filesystem read/writes) underneath, and then there could be multiple
blocking points.

BTW, Ben had implemented a fallback approach that spawned kernel threads
- it was an initial patch and didn't do any thread pooling at that time.

I had a fallback path for pollable fds which did not require thread pools
http://lwn.net/Articles/216443/
(limited to fds which support non blocking semantics)

OR

Maybe we could use a very simple version of syslets to do an io_submit
in libaio :)

Does the syslet approach of continuing in a different thread (different
thread id) affect kvm ?

Regards
Suparna

>
>>> Actually without preadv/pwritev (and without changes in qemu; that has
>>> its own wishlist) we can't really make good use of this now.
>>>
>>
>> I could trivially add preadv and pwritev to the patch series. The vfs
>> paths already support it, it's just that we don't have a syscall entry
>> point which takes the file position from an argument instead of from the
>> file struct behind the fd.
>>
>> Would that make it an interesting experiment for you to work with?
>>
>
> Not really -- it doesn't add anything (at the moment) that a userspace
> thread pool doesn't have.
>
> The key here is in the richer interface to the scheduler. If we can get
> the async exec thread to stay on the same cpu as the user thread that
> launched it, and to start executing on the userspace thread's return to
> userspace, then I guess many of the problems of threads are eliminated.
>
> --
> error compiling committee.c: too many arguments to function
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-aio' in
> the body to [email protected]. For more info on Linux AIO,
> see: http://www.kvack.org/aio/
> Don't email: <a href=mailto:"[email protected]">[email protected]</a>


2008-11-25 10:49:37

by Avi Kivity

[permalink] [raw]
Subject: Re: kvm aio wishlist

Suparna Bhattacharya wrote:
>> Why not extend io_submit() to use a thread pool when going through a
>> non-aio-ready path? Yet a new interface, with another round of integrating
>> to the previous interfaces, is not a comforting thought. I still haven't
>> got used to the fact that aio can work with fd polling.
>>
>
> Even paths that provide fop->aio_read/write can be synchronous (like non
> O_DIRECT filesystem read/writes) underneath, and then there could be multiple
> blocking points.
>

If they are known to be synchronous when execution starts, they could
just return -ENOSYS and fall back to threads, until someone implements a
truly async path.

> BTW, Ben had implemented a fallback approach that spawned kernel threads
> - it was an initial patch and didn't do any thread pooling at that time.
>
> I had a fallback path for pollable fds which did not require thread pools
> http://lwn.net/Articles/216443/
> (limited to fds which support non blocking semantics)
>

These are good solutions for the complex-blocking and never blocking cases.

> OR
>
> Maybe we could use a very simple version of syslets to do an io_submit
> in libaio :)
>
> Does the syslet approach of continuing in a different thread (different
> thread id) affect kvm ?
>

Yes, we like to pthread_kill() threads from time to time, and even
expose the thread IDs to management tools so they can control pinning.

Perhaps a variant of syslet, that is kernel-only, and does:

- always allocate a new kernel stack at io_submit() time, but not a new
thread
- start executing the rarely-blocking path of the request (like block
mapping and get_users_pages_fast) on the new stack
- if we block here, clone a new thread and graft the stack onto it
- start the always-blocking portion of the call (enqueuing a bio)
- exit the new thead if we hit the slowpath, or deallocate the stack and
longjmp back to the main stack if we did not

This does not expose any new semantics to userspace. It does twist the
guts of the kernel in that we have to duplicate thread_info, but if
thread_info is only accessed from current, I think that is managable.

(I think I just described fibrils, no? I think that was a good idea.
Why can't we go back to it?)

--
error compiling committee.c: too many arguments to function

2008-11-25 15:00:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: kvm aio wishlist


* Avi Kivity <[email protected]> wrote:

> Perhaps a variant of syslet, that is kernel-only, and does:
>
> - always allocate a new kernel stack at io_submit() time, but not a
> new thread

such a N:M threading design is a loss - sooner or later we arrive to a
point where people actually start using it and then we want to
load-balance and schedule these entities.

So i'd suggest the kthread based async engine i wrote for syslets. It
worked well and for kernel-only entities it schedules super-fast - it
can do up to 20 million events per second on a 16-way box i'm testing
on. The objections about syslets were not related to the scheduling of
it but were mostly about the userspace API/ABI: you dont have to use
that.

Ingo

2008-11-25 15:13:15

by Jens Axboe

[permalink] [raw]
Subject: Re: kvm aio wishlist

On Tue, Nov 25 2008, Ingo Molnar wrote:
>
> * Avi Kivity <[email protected]> wrote:
>
> > Perhaps a variant of syslet, that is kernel-only, and does:
> >
> > - always allocate a new kernel stack at io_submit() time, but not a
> > new thread
>
> such a N:M threading design is a loss - sooner or later we arrive to a
> point where people actually start using it and then we want to
> load-balance and schedule these entities.
>
> So i'd suggest the kthread based async engine i wrote for syslets. It
> worked well and for kernel-only entities it schedules super-fast - it
> can do up to 20 million events per second on a 16-way box i'm testing
> on. The objections about syslets were not related to the scheduling of
> it but were mostly about the userspace API/ABI: you dont have to use
> that.

Still unsure why that stuff never got anywhere. Do you have a pointer to
the latest posting?

--
Jens Axboe

2008-11-25 15:26:17

by Zach Brown

[permalink] [raw]
Subject: Re: kvm aio wishlist


> Still unsure why that stuff never got anywhere.

Changing the tid of submitting tasks makes it unsuitable for sys_io_*()
or posix aio users as it stands. Maybe we could swap tids on the
switch, but we'd probably then have to audit the life time of tid ->
task_struct users in the kernel.

And there's still the question of what ptrace is supposed to do.

- z

2008-11-25 15:58:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: kvm aio wishlist


* Zach Brown <[email protected]> wrote:

> > Still unsure why that stuff never got anywhere.
>
> Changing the tid of submitting tasks makes it unsuitable for
> sys_io_*() or posix aio users as it stands. Maybe we could swap
> tids on the switch, but we'd probably then have to audit the life
> time of tid -> task_struct users in the kernel.

doesnt look like a big thing affecting the fastpath materially.

> And there's still the question of what ptrace is supposed to do.

debug-only, we sure can work something out.

Ingo

2008-11-25 16:54:48

by Avi Kivity

[permalink] [raw]
Subject: Re: kvm aio wishlist

Ingo Molnar wrote:
>
>> Perhaps a variant of syslet, that is kernel-only, and does:
>>
>> - always allocate a new kernel stack at io_submit() time, but not a
>> new thread
>>
>
> such a N:M threading design is a loss - sooner or later we arrive to a
> point where people actually start using it and then we want to
> load-balance and schedule these entities.
>

It's only N:M as long as its nonblocking. If it blocks it becomes 1:1
again. If it doesn't, it's probably faster to do things on the same
cache as the caller.

> So i'd suggest the kthread based async engine i wrote for syslets. It
> worked well and for kernel-only entities it schedules super-fast - it
> can do up to 20 million events per second on a 16-way box i'm testing
> on. The objections about syslets were not related to the scheduling of
> it but were mostly about the userspace API/ABI: you dont have to use
> that.

I'd love to have something :)

I guess any cache and latency considerations could be fixed if
- we schedule a syslet for the first time when the thread that launched
it exits to userspace
- we queue it on the current cpu's runqueue

In that case, for the nonblocking case syslets and fibrils would have
very similar performance.

--
error compiling committee.c: too many arguments to function

2008-11-25 16:56:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: kvm aio wishlist


* Avi Kivity <[email protected]> wrote:

> Ingo Molnar wrote:
>>
>>> Perhaps a variant of syslet, that is kernel-only, and does:
>>>
>>> - always allocate a new kernel stack at io_submit() time, but not a
>>> new thread
>>>
>>
>> such a N:M threading design is a loss - sooner or later we arrive to a
>> point where people actually start using it and then we want to
>> load-balance and schedule these entities.
>>
>
> It's only N:M as long as its nonblocking. If it blocks it becomes 1:1
> again. If it doesn't, it's probably faster to do things on the same
> cache as the caller.
>
>> So i'd suggest the kthread based async engine i wrote for syslets. It
>> worked well and for kernel-only entities it schedules super-fast - it
>> can do up to 20 million events per second on a 16-way box i'm testing
>> on. The objections about syslets were not related to the scheduling of
>> it but were mostly about the userspace API/ABI: you dont have to use
>> that.
>
> I'd love to have something :)
>
> I guess any cache and latency considerations could be fixed if
> - we schedule a syslet for the first time when the thread that launched
> it exits to userspace
> - we queue it on the current cpu's runqueue
>
> In that case, for the nonblocking case syslets and fibrils would
> have very similar performance.

yes. Hence given that fibrills have various tradeoffs, we should do
the syslet thread pool. The code is there and it works :)

Ingo

2008-11-25 16:57:15

by Avi Kivity

[permalink] [raw]
Subject: Re: kvm aio wishlist

Zach Brown wrote:
> And there's still the question of what ptrace is supposed to do.
>

If it's kernel-only (which I think is a good start for something like
this), then is ptrace relevant at all?

--
error compiling committee.c: too many arguments to function

2008-11-25 16:57:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: kvm aio wishlist


* Avi Kivity <[email protected]> wrote:

> Zach Brown wrote:
>> And there's still the question of what ptrace is supposed to do.
>
> If it's kernel-only (which I think is a good start for something
> like this), then is ptrace relevant at all?

it's relevant wrt. details: to make sure that it's all transparent and
the ptrace engine is not confused by thread switching tricks.

Ingo