2002-09-20 20:42:10

by Shailabh Nagar

[permalink] [raw]
Subject: [RFC] adding aio_readv/writev

Ben,

Currently there is no way to initiate an aio readv/writev in 2.5. There
were no aio_readv/writev calls in 2.4 either - I'm wondering if there
was any particular reason for excluding readv/writev operations from aio ?

The read/readv paths have anyway been merged for raw/O_DIRECT and
regular file read/writes. So why not expose the vector read/write to the
user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and
IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that,
raw/O_DIRECT readv users would need to unnecessarily cycle through their
iovecs at a library level submitting them individually.
For larger iovecs, user/library code would needlessly deal with multiple
completions. While I'm not sure of the performance impact of the absence
of aio_readv/writev, it seems easy enough to provide.
Most of the functions are already in place. We would only
need a way to pass the iovec through the iocb.

I was thinking of something like this:

struct iocb {

+union {
__u64 aio_buf
+ __u64 aio_iovp
+}
+union {
__u64 aio_nbytes
+ __u64 aio_nsegs
+}

allowing the iovec * & nsegs to be passed into sys_io_submit. Some code
would be added (within case handling of IOCB_CMD_READV within
io_submit_one) to copy & verify the iovec pointers and then call
aio_readv/aio_writev (if its defined for the fs).

What do you think ? I wanted to get some feedback before trying to code
this up.

While we are on the topic of expanding aio operations, what about
providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite
counterparts ? Do you think thats needed ?

- Shailabh



2002-09-23 14:33:46

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev

Stephen Hemminger wrote:

>Why not batch up multiple requests with one io_submit? It has the same
>effect, except there would be multiple responses.
>
Even though the multiple iocb's enter the kernel together, they still
get processed individually so a fair amount of unnecessary data
transmission and function invocation are still occurring in the submit
code path.
Depending on how long it takes for io_submit_one to return, there might
be a reduced probability for merging of io requests at the i/o scheduler.
Finally, the multiple responses need to be handled as you mentioned. I
suppose the application could wait for the last request (in the
io_submit list) and that would most probably ensure that the preceding
ones were complete as well but its not a guarantee offered by the aio
API, right ?
Besides, the application needs the data (represented by multiple
requests) at one go so partial completion isn't likely to be useful and
will only be an overhead.

While a quantitative assessment of the above tradeoffs is possible, it
will be difficult to make a good comparison before "true" aio
functionality is in place for 2.5. Such an assessment is unlikely to
happen before the feature freeze takes effect. So I'm making a case for
putting in async vector I/O interfaces in for the following three reasons:
- the synchronous API does provide separate entry points for vector I/O.
Extending the same to the async interfaces, especially when it doesn't
even involve creating new syscalls, seems natural for completeness.
- underlying in-kernel infrastructure already supports it, so no major
changes are needed.
- there exists atleast one major application class (databases) that uses
vectored I/O heavily and benefits from async I/O. Hence async vectored
I/O is also likely to be useful. Can anyone else with experience on
other OS's comment on this ?

Comments, reasons for not doing async readv/writev directly welcome.

- Shailabh

>
>
>On Fri, 2002-09-20 at 13:39, Shailabh Nagar wrote:
>
>>Ben,
>>
>>Currently there is no way to initiate an aio readv/writev in 2.5. There
>>were no aio_readv/writev calls in 2.4 either - I'm wondering if there
>>was any particular reason for excluding readv/writev operations from aio ?
>>
>>The read/readv paths have anyway been merged for raw/O_DIRECT and
>>regular file read/writes. So why not expose the vector read/write to the
>>user by adding the IOCB_CMD_PREADV/IOCB_CMD_READV and
>>IOCB_CMD_PWRITEV/IOCB_CMD_WRITEV commands to the aio set. Without that,
>>raw/O_DIRECT readv users would need to unnecessarily cycle through their
>>iovecs at a library level submitting them individually.
>>For larger iovecs, user/library code would needlessly deal with multiple
>>completions. While I'm not sure of the performance impact of the absence
>>of aio_readv/writev, it seems easy enough to provide.
>>Most of the functions are already in place. We would only
>>need a way to pass the iovec through the iocb.
>>
>>I was thinking of something like this:
>>
>>struct iocb {
>>
>>+union {
>> __u64 aio_buf
>>+ __u64 aio_iovp
>>+}
>>+union {
>> __u64 aio_nbytes
>>+ __u64 aio_nsegs
>>+}
>>
>>allowing the iovec * & nsegs to be passed into sys_io_submit. Some code
>>would be added (within case handling of IOCB_CMD_READV within
>>io_submit_one) to copy & verify the iovec pointers and then call
>>aio_readv/aio_writev (if its defined for the fs).
>>
>>What do you think ? I wanted to get some feedback before trying to code
>>this up.
>>
>>While we are on the topic of expanding aio operations, what about
>>providing IOCB_CMD_READ/WRITE, distinct from their pread/pwrite
>>counterparts ? Do you think thats needed ?
>>
>>- Shailabh
>>

2002-09-23 17:54:52

by Chen, Kenneth W

[permalink] [raw]
Subject: RE: [RFC] adding aio_readv/writev

ben> Only db2 uses vectored io heavily. Oracle does not, and none of the
open
ben> source databases do. Vectored io is pretty useless for most people.

That's not necessary true. As far as I know, the reason oracle doesn't use
vectored io is because the real implementation is not there.

- Ken

2002-09-23 18:53:56

by Clement T. Cole

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev

>>Comments, reasons for not doing async readv/writev directly welcome.

How about the case for it... See Pages 404-406 [Section 12.7] of
Richard Steven's ``Advanced Programming in the Unix Environment''
[aka APUE]. Richard measures almost a factor of 2 difference
in system time between using vectored I/O and not using it on
a Sun and on a x86.

>>- there exists at least one major application class (databases)
>> that uses vectored I/O heavily and benefits from async I/O.
>> Hence async vectored I/O is also likely to be useful. Can anyone
>> else with experience on other OS's comment on this ?

.... a number of other comments/arguments from other
.... responces removed to get the meat of the discussion.

Becareful out there.....

Large commercial applications such as Oracle DB, IBM's DB2 or
Netscape Enterprise server for that matter - are very modular in
their interface to OS because they have to be and have been, ported
and tuned to run on a number of different OS's and HW architectures.
When I see some one say something about ``Oracle'' doing X or Y -
I get a little worried.

Which version, which port etc... e.g. Oracle's DB running on VMS
has a different I/O system interface that is different from any of
it's Unix implementations...... oh yes - was the clustered or
not... did it have the X package etc...

The point is that the UNIX implementations of Oracle DB vary
widely..... This is also true of every >>major<< application package
I have worked/consulted seen some of the insides (SAP, Informix,
Netscape, etc...).

Solaris and Tru64 [and I would expect AIX, HP-UX
etc... but I only know these two personally] each offer
a highly parallel I/O, asynchronous (but proprietary) interface.
Oracle's Sun group (or the old DEC group) exploit the >>private<<
interfaces -- to make the code work better - they do.

That may or may not be what you have seen on some ``simple Un*x''
port - which is a starting point for them - that's not the code
they ship on the high end revenue systems.

Oracle/IBM/Netscape etc... do this cause the want to grab customers
from their competetors (DB2, Informix, etc.)... they invest in
using the best interfaces available.... if they are available
AND if they can help them sell more copies of their product.


So... let's get back to the basic issue....

We know that vectored/scatter gather I/O can help a number of real
applications ... Richard demonstrated that. We have some examples
[like DB2] that have use vectored I/O successfully. We also
know asynchronous I/O has been demonstrated to be useful and
know that some commerical folks have used that.

I'm gather from some of the comments, adding async/vectored
will make an already complex subsystem, even more so [i.e. not
a resounding endorsement for sure this is easy].

So the question is can async vectored I/O be implemented
to have a positive gain, such as it did within the traditonal one.
If the complexity is too high and it does not help much...then
maybe this is a Chimera to leave alone. But.... if it can be
done with some level of elegance... well.... the past history is
that the commerical folks have used those features.

I know this this sounds a little bit like:
``if you build it - they will come.''

But I would say it's more to this point:
``if you build it and this new feature shows some real value
AND the application can exploit it ... in time, they will
because if they don't their competetors will.''

Clem Cole

2002-09-23 19:55:28

by Shailabh Nagar

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev

Clement T. Cole wrote:

>>>Comments, reasons for not doing async readv/writev directly welcome.
>>>
>
>How about the case for it... See Pages 404-406 [Section 12.7] of
>Richard Steven's ``Advanced Programming in the Unix Environment''
>[aka APUE]. Richard measures almost a factor of 2 difference
>in system time between using vectored I/O and not using it on
>a Sun and on a x86.
>
It would have been nice to have corresponding data for the async path.

><snip>
>
>So... let's get back to the basic issue....
>
>We know that vectored/scatter gather I/O can help a number of real
>applications ... Richard demonstrated that. We have some examples
>[like DB2] that have use vectored I/O successfully. We also
>know asynchronous I/O has been demonstrated to be useful and
>know that some commerical folks have used that.
>
>I'm gather from some of the comments, adding async/vectored
>will make an already complex subsystem, even more so [i.e. not
>a resounding endorsement for sure this is easy].
>

I wouldn't say so. Adding async vectored I/O to the 2.5 code won't make
it more complex since the underlying functions
do handle iovec's anyway.

>
>
>So the question is can async vectored I/O be implemented
>to have a positive gain, such as it did within the traditonal one.
>If the complexity is too high and it does not help much...then
>maybe this is a Chimera to leave alone. But.... if it can be
>done with some level of elegance... well.... the past history is
>that the commerical folks have used those features.
>

It seems to be a case of "complexity is low, benefits are unknown". I
guess the best thing is to develop a patch and see what people think
about the complexity part. The benefits part will become clear only when
the async interfaces are reasonable functional and we can compare the
following

- call async readv directly
vs
- multiple calls to io_submit using one iocb (each call corresponds to
one element of user's vector)
vs
- single call to io_submit using multiple iocb's (each iocb corresponds
to one element of user's vector)

Since the raw/O_DIRECT interfaces offer asynchrony (through Badari
Pulavarty & Mingming Cao's patches), it should be possible to test this
out.

More on this shortly,
- Shailabh

2002-09-23 20:32:39

by Clement T. Cole

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev

>>It would have been nice to have corresponding data for the async path.
Agreed... I'll let you know if I learn anything. When Richard wrote
APUE, aio was not defined by Posix. Only the select/poll and turning
on the O_*SYNC flags hacks from BSD and SVR4. I don't think you will
learn much from that.

As I said, many/most of the commerical Un*x folks added their own
propritary (and slightly different) version of aio years ago. Then
they agreed on the Posix interface and most [if not all] have offered
those. Most of the majpr ISV's that used their proprietary ones
have switched to or are in the process of switching too the Posi
interface for simpliticy [if they could - there are sometimes reasons
why they can not - not always technical reasons BTW].

I personally started to monitor this mailing list because I was
interested in Ben's work on aio for Linux and what I'm researching
needs to be follow what Linux is doing in this area.

For what ever its worth to this list: I have local implementations
of the Posix async I/O for a Sun and *BSD. I trying to get my hands
on a Alpha and SVR5 [<-- bits secured for the later but no HW at
the moment to try it]. If you have any aio test cases, let me know.
As I do my research, if I can learn anything useful I'll be willing
to pass it on if you think it will help.

I'm currently thinking up/trying some examples and there are some
worrisome issues with the Posix spec IMHO. I know that you
folks are not trying to be Posix compliant - which is both
a blessing and curse.

In my case, I need to follow Posix, since that's
what the ISVs really use as their guide. My assumption is that
there will be mapping layer between your final interface and
the Posix interface. I can offer any extensions as need/appropriate if
I can show that it helps [which in this case it might].

Clem

2002-09-24 13:15:41

by John Myers

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev



Benjamin LaHaise wrote:

>Only db2 uses vectored io heavily. Oracle does not, and none of the open
>source databases do. Vectored io is pretty useless for most people.
>
>
writev is extremely important for networking as it avoids small packets.

Why do people have such tunnel vision around aio to disk? Aio to
network is far more important, as networks are much slower than disks.


Attachments:
smime.p7s (3.45 kB)
S/MIME Cryptographic Signature

2002-09-24 13:47:23

by Stephen C. Tweedie

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev

Hi,

On Tue, Sep 24, 2002 at 06:20:45AM -0700, John Gardiner Myers wrote:

> Benjamin LaHaise wrote:
>
> >Only db2 uses vectored io heavily. Oracle does not, and none of the open
> >source databases do. Vectored io is pretty useless for most people.
> >
> >
> writev is extremely important for networking as it avoids small packets.

No, all you can infer from that is that "some method for avoiding
small packets is important for networking." TCP_CORK already does
that in Linux, for tcp at least, without requiring writev. (Of
course, normal nonblocking writev is still there if you want it.)

--Stephen

2002-09-24 14:08:38

by John Myers

[permalink] [raw]
Subject: Re: [RFC] adding aio_readv/writev



Stephen C. Tweedie wrote:

>No, all you can infer from that is that "some method for avoiding
>small packets is important for networking." TCP_CORK already does
>that in Linux, for tcp at least, without requiring writev. (Of
>course, normal nonblocking writev is still there if you want it.)
>
TCP_CORK is indeed effective for avoiding small packets. Be that as it
may, the source data for network writes are frequently in discontiguous
buffers and writev is nonetheless still important for networking. The
alternative in the aio model is to waste a lot of resources delivering
io completions the application doesn't care about.


Attachments:
smime.p7s (3.45 kB)
S/MIME Cryptographic Signature