So, awhile back I posted about an extensible AIO attributes mechanism
I'd been cooking up: http://article.gmane.org/gmane.linux.kernel/1367969
Since then, more uses for the thing have been popping up, but I ran into
a roadblock - with the existing AIO api, return values for the
attributes were going to be, at best, considerably uglier than I
anticipated.
Some background: some attributes we'd like to implement need to be able
to return values with the io_event at completion time. Many of the
examples I know of are more or less tracing - returning how long the IO
took, whether it was a cache hit or miss (bcache, perhaps page cache
when buffered AIO is supported), etc.
Additionally, you probably want to be able to return whether the
attribute was supported/handled at all (because of differing kernel
versions, or because it was driver specific) and we need attribute
returns to be able to sanely handle that.
So my opinion is that the only really sane way to implement attribute
return values is to pass them back to userspace via the ringbuffer,
along with the struct io_event.
(For those not intimately familiar with the AIO implementation, on
completion the generated io_event is copied into a ringbuffer which
happens to be mapped into userspace, even though normally userspace will
get the io_event with io_getevents(). This ringbuffer constrains the
design quite a bit, though).
Trouble is, we (probably, there is some debate) can't really just change
the existing ringbuffer format - there's a version field in the existing
ringbuffer, but userspace can't check that until after the ringbuffer is
setup and mapped into userspace. There's no existing mechanism for
userspace to specify flags or options or versioning when setting up the
io context.
So, to do this requires new syscalls, and more or less forking most of
the existing AIO implementation. Also, returning variable length entries
via the ringbuffer turns out to require redesigning a substantial
fraction of the existing AIO implementation - so we might as well fix
everything else that needs fixing at the same time.
Where I'm at now - I've got a new syscall interface that changes enough
to support extensible AIO attributes prototyped; it looks almost
complete but I haven't started testing yet. Enough is there to see how
it all fits together, though - IMO the important bits are how we deal
with different types of kioctxs (I think it works out fairly nicely).
Code is available at http://evilpiepirate.org/git/linux-bcache.git/ aio-new-abi
(Definitely broken, don't even think about trying to run it yet).
We plan on rolling this out at Google in the near term with the minimal
set of changes (because we've got stuff blocked on this), but there's
more changes I'd like to make before this (hopefully) goes upstream.
So, what changes?
* Currently, we strictly limit outstanding kiocbs so as to avoid
overflowing the ringbuffer; this means that the size of the
ringubffer we allocate is determined by the nr_events userspace
passes to io_setup().
This approach doesn't work when ringbuffer entries are variable
length - we can still use a ringbuffer (and I think we want to), but
we need to have an overflow mechanism for when it fills up.
This is actually one of the backwards compatibility issues;
currently, it is possible for userspace to reap io_events without
ever calling into the kernel. But if we've got an overflow mechanism,
that's no longer possible - userspace has to call io_getevents() when
the ringbuffer's empty, or it'll never see events that might've been
on the overflow list - that or we need to put a flag in the
ringbuffer header.
Adding the overflow mechanism is an overall reduction in complexity
though, we can toss out a bunch of code elsewhere and ringbuffer size
isn't so important anymore.
* With the way the head/tail pointers are defined in the current
ringbuffer implentation, we can't do lockless reaping without being
subject to ABA. I've fixed this in my prototype - the head/tail
values use the full range of 32 bit integers, we only mod them by the
ringbuffer size when calculating the current position.
* The head/tail pointers, and also io_submit()/io_getevents() all work
in units of struct iocb/struct io_event. With attributes those
structs are now variable length, so it makes more sense to switch
all the units to bytes.
With these changes, the ringbuffer implementation is looking less and
less AIO specific. I've been wondering a bit whether it could be made
generic and merged with other ringbuffers (I'm not sure what else
there is offhand, besides tracing - tracing has substantially
different needs, but I'd be surprised if there aren't other similar
ringbuffers somewhere).
* The eventfd field should've never been added to struct iocb, imo -
it should've been added to the kioctx (You don't want to know when a
specific iocb is done, there isn't any way to check for that directly
- you want to know when there's events to reap). I'm fixing that.
* Adding a version parameter to io_setup2()
Those are the main changes (besides adding attributes, of course) that
I've made so far.
* Get rid of the parallel syscall interface
AIO really shouldn't be implementing its own slightly different
syscalls; it should be a mechanism for doing syscalls asynchronously.
If we don't have asynchronous implementations of most of our syscalls
right now, so what? Tying the interface to the implementation is
still stupid. And if we're lucky, someday we'll have a generic thread
pool implementation for all the syscalls that aren't worth special
casing (perhaps building off the work Ben LaHaise has been doing to
implement buffered AIO).
This is particularly important now with attributes - almost none of
the attributes we want to implement are actually AIO specific; we'd
like to be able to use them with arbitrary syscalls.
Well, if we turn AIO into a mechanism for doing arbitrary syscalls
asynchronously - it'll be really easy to add one syscall to issue an
iocb synchronously; at that point it'll just be an "issue this
syscall with attributes" syscall.
On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <[email protected]> wrote:
> So, awhile back I posted about an extensible AIO attributes mechanism
> I'd been cooking up: http://article.gmane.org/gmane.linux.kernel/1367969
>
> Since then, more uses for the thing have been popping up, but I ran into
> a roadblock - with the existing AIO api, return values for the
> attributes were going to be, at best, considerably uglier than I
> anticipated.
>
> Some background: some attributes we'd like to implement need to be able
> to return values with the io_event at completion time. Many of the
> examples I know of are more or less tracing - returning how long the IO
> took, whether it was a cache hit or miss (bcache, perhaps page cache
> when buffered AIO is supported), etc.
>
> Additionally, you probably want to be able to return whether the
> attribute was supported/handled at all (because of differing kernel
> versions, or because it was driver specific) and we need attribute
> returns to be able to sanely handle that.
>
> So my opinion is that the only really sane way to implement attribute
> return values is to pass them back to userspace via the ringbuffer,
> along with the struct io_event.
>
> (For those not intimately familiar with the AIO implementation, on
> completion the generated io_event is copied into a ringbuffer which
> happens to be mapped into userspace, even though normally userspace will
> get the io_event with io_getevents(). This ringbuffer constrains the
> design quite a bit, though).
>
> Trouble is, we (probably, there is some debate) can't really just change
> the existing ringbuffer format - there's a version field in the existing
> ringbuffer, but userspace can't check that until after the ringbuffer is
> setup and mapped into userspace. There's no existing mechanism for
> userspace to specify flags or options or versioning when setting up the
> io context.
>
> So, to do this requires new syscalls, and more or less forking most of
> the existing AIO implementation. Also, returning variable length entries
> via the ringbuffer turns out to require redesigning a substantial
> fraction of the existing AIO implementation - so we might as well fix
> everything else that needs fixing at the same time.
This all sounds like a lot of work, risk, disruption, bloat, etc, etc.
That's not a problem per-se, but it should only be undertaken if the
payback makes it worthwhile.
Unfortunately your email contains only a terse description of this most
important factor: if we add all this stuff to Linux, what do we get in
return? "More or less tracing". Is that useful enough to justify the
changes? Please let's pay a lot more attention to this question before
getting further into implementation stuff! Sell it to us.
> Those are the main changes (besides adding attributes, of course) that
> I've made so far.
>
> * Get rid of the parallel syscall interface
>
> AIO really shouldn't be implementing its own slightly different
> syscalls; it should be a mechanism for doing syscalls asynchronously.
Yes. We got about a twelfth of the way there many years ago
(google("syslets")) but it died. A shame.
Andrew Morton <[email protected]> writes:
> On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <[email protected]> wrote:
>> Those are the main changes (besides adding attributes, of course) that
>> I've made so far.
>>
>> * Get rid of the parallel syscall interface
>>
>> AIO really shouldn't be implementing its own slightly different
>> syscalls; it should be a mechanism for doing syscalls asynchronously.
>
> Yes. We got about a twelfth of the way there many years ago
> (google("syslets")) but it died. A shame.
Yeah, letting the current process keep waiting and creating a new one
which returns is a fascinating idea, but you really need to swizzle the
PIDs so that the "new" one is identical to the old. Otherwise the API
is unbearable...
Good luck!
Rusty.
On Tue 16-04-13 10:48:35, Rusty Russell wrote:
> Andrew Morton <[email protected]> writes:
> > On Fri, 12 Apr 2013 15:28:56 -0700 Kent Overstreet <[email protected]> wrote:
> >> Those are the main changes (besides adding attributes, of course) that
> >> I've made so far.
> >>
> >> * Get rid of the parallel syscall interface
> >>
> >> AIO really shouldn't be implementing its own slightly different
> >> syscalls; it should be a mechanism for doing syscalls asynchronously.
> >
> > Yes. We got about a twelfth of the way there many years ago
> > (google("syslets")) but it died. A shame.
>
> Yeah, letting the current process keep waiting and creating a new one
> which returns is a fascinating idea, but you really need to swizzle the
> PIDs so that the "new" one is identical to the old. Otherwise the API
> is unbearable...
But when we do crazy stuff like pid namespaces these days somehow
switching pids shouldn't be *that* hard... Should it? Just a crazy idea
that occured to me now :)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR