2004-10-14 20:16:55

by Yasushi Saito

[permalink] [raw]
Subject: [PATCH 1/2] aio: add vectored I/O support

This is a patch against 2.6.9-rc3-mm3 that add supports for vectored
async I/O. It adds two additional commands, IO_CMD_PREADV and
IO_CMD_PWRITEV to libaio.h. The below is roughly what I did:

* add "aio_readv" and "aio_writev" to struct file_operations.
* aio_abi.h: add IOCB_CMD_PREADV and PWRITEV. \
Also make struct iocb compatible with the latest glibc libaio.h.
* aio.c: change kiocb to support vectored operations.
IOCB_CMD_PREAD and PWRITE are now simply
implemented as degenerate variations of PREADV and PWRITEV.
* block_dev.c, file.c, {ext2,ext3,jfs,reiserfs}/file.c:
Add implementations of aio_readv and wriitev methods, route
aio_read and aio_write to the ...v methods.
They are reasonably straightforward since the low-level code
already supports vectored I/O.

This feature has been tested on ext3 and a raw block device, but it
should also work with ext2, nfs, jfs and reiser.
The patch is divided into two parts, one that just fixes bugs in the
existing code, and the other that adds the vectored I/O feature. A bit
more detailed explanation, along with a sample program, is found in
http://www.hpl.hp.com/personal/Yasushi_Saito/linux-aiov
Comments are appreciated.

yaz

This patch fixes two bugs: (1) aio_free_ring: don't attempt to free page
on cleanup after io_setup fails while in do_mmap,
(2) __generic_file_aio_read: properly abort and retry when more than one
iovec is passed and data is not yet ready.

Signed-off-by: Yasushi Saito <[email protected]>

--- .pc/aio-bugfixes.patch/fs/aio.c 2004-10-14 12:58:39 -07:00
+++ fs/aio.c 2004-10-14 12:58:39 -07:00
@@ -86,7 +86,8 @@ static void aio_free_ring(struct kioctx
long i;

for (i=0; i<info->nr_pages; i++)
- put_page(info->ring_pages[i]);
+ if (info->ring_pages[i])
+ put_page(info->ring_pages[i]);

if (info->mmap_size) {
down_write(&ctx->mm->mmap_sem);
--- .pc/aio-bugfixes.patch/mm/filemap.c 2004-10-14 12:58:39 -07:00
+++ mm/filemap.c 2004-10-14 12:58:39 -07:00
@@ -976,9 +976,11 @@ __generic_file_aio_read(struct kiocb *io
desc.error = 0;
do_generic_file_read(filp,ppos,&desc,file_read_actor);
retval += desc.written;
- if (!retval) {
- retval = desc.error;
- break;
+
+ if (desc.written < iov[seg].iov_len) {
+ if (retval == 0)
+ retval = desc.error;
+ break;
}
}
}




2004-10-16 03:13:18

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

On Thu, Oct 14, 2004 at 01:10:01PM -0700, Yasushi Saito wrote:
> This is a patch against 2.6.9-rc3-mm3 that add supports for vectored
> async I/O. It adds two additional commands, IO_CMD_PREADV and
> IO_CMD_PWRITEV to libaio.h. The below is roughly what I did:

How does this differ substantially from lio_listio() of each I/O
range? Does it have some significant performance win, or is it just
aiming for a completeness that POSIX doesn't (to my knowledge) specify?

Joel

--

Life's Little Instruction Book #464

"Don't miss the magic of the moment by focusing on what's
to come."

http://www.jlbec.org/
[email protected]

2004-10-16 05:18:51

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

Joel Becker wrote:

>On Thu, Oct 14, 2004 at 01:10:01PM -0700, Yasushi Saito wrote:
>
>
>>This is a patch against 2.6.9-rc3-mm3 that add supports for vectored
>>async I/O. It adds two additional commands, IO_CMD_PREADV and
>>IO_CMD_PWRITEV to libaio.h. The below is roughly what I did:
>>
>>
>
> How does this differ substantially from lio_listio() of each I/O
>range? Does it have some significant performance win, or is it just
>aiming for a completeness that POSIX doesn't (to my knowledge) specify?
>
>
>
It is a huge performance win, at least on the 2.4-based RHEL kernel.
Large reads (~256K) using 4K iocbs are very slow on a large RAID, while
after I coded a similar patch I got a substantial speedup.

I don't know the cause for the slowdown; maybe request merging only
works for queued requests, and as the RAID has a large TCQ depth,
requests didn't have much of a chance to queue in the kernel.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-10-16 05:37:28

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

On Sat, Oct 16, 2004 at 07:18:45AM +0200, Avi Kivity wrote:
> It is a huge performance win, at least on the 2.4-based RHEL kernel.
> Large reads (~256K) using 4K iocbs are very slow on a large RAID, while
> after I coded a similar patch I got a substantial speedup.

I'd think we should fix the submission path instead. Why create
iovs _and_ iocbs when we only need to create one? And even if we
decided aio_readv() was still nice to keep, we'd want to fix this
inefficiency in io_submit().

Joel

--

"Nothing is wrong with California that a rise in the ocean level
wouldn't cure."
- Ross MacDonald

http://www.jlbec.org/
[email protected]

2004-10-16 08:43:11

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

Joel Becker wrote:

>On Sat, Oct 16, 2004 at 07:18:45AM +0200, Avi Kivity wrote:
>
>
>>It is a huge performance win, at least on the 2.4-based RHEL kernel.
>>Large reads (~256K) using 4K iocbs are very slow on a large RAID, while
>>after I coded a similar patch I got a substantial speedup.
>>
>>
>
> I'd think we should fix the submission path instead. Why create
>iovs _and_ iocbs when we only need to create one? And even if we
>decided aio_readv() was still nice to keep, we'd want to fix this
>inefficiency in io_submit().
>
>
>
Using IO_CMD_READ for a vector entails

- converting the userspace structure (which might well an iovec) to iocbs
- copying all those iocbs to the kernel
- merging the iocbs
- generating multiple completions for the merged request
- copying the completions to userspace
- coalescing the multiple completions in userspace to a single completion

error handling is difficult as well. one would expect that a bad sector
with multiple iocbs would only fail one of the requests. it seems to be
non-trivial to implement this correctly.

IO_CMD_PREADV, by contrast, is very simple, intuitive, and efficient.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-10-16 12:11:39

by William Lee Irwin III

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

On Sat, Oct 16, 2004 at 07:18:45AM +0200, Avi Kivity wrote:
>> It is a huge performance win, at least on the 2.4-based RHEL kernel.
>> Large reads (~256K) using 4K iocbs are very slow on a large RAID, while
>> after I coded a similar patch I got a substantial speedup.

On Sat, Oct 16, 2004 at 06:37:21AM +0100, Joel Becker wrote:
> I'd think we should fix the submission path instead. Why create
> iovs _and_ iocbs when we only need to create one? And even if we
> decided aio_readv() was still nice to keep, we'd want to fix this
> inefficiency in io_submit().

Disk I/O can vector across iocb's; however, network I/O requires
temporal ordering not imposed by sequential iocb submission, barring
unusual extensions to the semantics.


-- wli

2004-10-16 16:28:46

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

On Sat, Oct 16, 2004 at 10:43:04AM +0200, Avi Kivity wrote:
> Using IO_CMD_READ for a vector entails
>
> - converting the userspace structure (which might well an iovec) to iocbs

Why create an iov if you don't need to?

> - merging the iocbs

I don't see how this is different than merging iovs. Whether an
I/O range is represented by two segments of an iov or by two iocbs, the
elevator is going to merge them. If the userspace program had the
knowledge to merge them up front, it should have submitted one larger
segment.

> - generating multiple completions for the merged request

Fair enough.

> - coalescing the multiple completions in userspace to a single completion

You generally have to do this anyway. In fact, it is often far
more efficient and performant to have a pattern of:

submit 10;
reap 3; submit 3 more;
reap 6; submit 6 more;
repeat until you are done;

than to wait on all 10 before you can submit 10 again.

> error handling is difficult as well. one would expect that a bad sector
> with multiple iocbs would only fail one of the requests. it seems to be
> non-trivial to implement this correctly.

I don't follow this. If you mean that you want all io from
later segments in an iov to fail if one segment has a bad sector, I
don't know that we can enforce it without running one segment at a
time. That's terribly slow.
Again, even if READV is a good idea, we need to fix whatever
inefficiencies io_submit() has. copying to/from userspace just can't be
that slow.

Joel

--

"When choosing between two evils, I always like to try the one
I've never tried before."
- Mae West

http://www.jlbec.org/
[email protected]

2004-10-16 17:29:21

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

Joel Becker wrote:

>On Sat, Oct 16, 2004 at 10:43:04AM +0200, Avi Kivity wrote:
>
>
>>Using IO_CMD_READ for a vector entails
>>
>>- converting the userspace structure (which might well an iovec) to iocbs
>>
>>
>
> Why create an iov if you don't need to?
>
>
>
If you aren't writing directly to the kernel API, an iovec is very
convenient. It need not be an iovec, but surely you need _some_ vector.

>>- merging the iocbs
>>
>>
>
> I don't see how this is different than merging iovs. Whether an
>I/O range is represented by two segments of an iov or by two iocbs, the
>elevator is going to merge them. If the userspace program had the
>knowledge to merge them up front, it should have submitted one larger
>segment.
>
>
No. An iovec is already merged; it is known that adjacent segments of an
iovec have adjacent offsets. a single IO_CMD_READV iovec can generate a
single bio without any merging.

The app did not submit a single large segment for the same reason
non-aio readv is used: because app memory is paged. in my case, a
userspace filesystem has a paged cache; large, disk-contiguous reads go
into many small noncontiguous memory pages. or it might be a database
performing a sequential scan and reading a large block into multiple
block buffers, which are usually discontiguous.

>
>
>>- coalescing the multiple completions in userspace to a single completion
>>
>>
>
> You generally have to do this anyway. In fact, it is often far
>more efficient and performant to have a pattern of:
>
> submit 10;
> reap 3; submit 3 more;
> reap 6; submit 6 more;
> repeat until you are done;
>
>than to wait on all 10 before you can submit 10 again.
>
>
If the data is physically contiguous, it will (should) be merged, and
thus completed in a single event anyway. All 10 completions will happen
at the same time.

I might divide a 1M read into 4 iocbs to get the effect you mention. I
don't want to be forced into dividing them based on virtual address,
into 256 4K iocbs. *if* I wanted to do anything with partial data.

>>error handling is difficult as well. one would expect that a bad sector
>>with multiple iocbs would only fail one of the requests. it seems to be
>>non-trivial to implement this correctly.
>>
>>
>
> I don't follow this. If you mean that you want all io from
>later segments in an iov to fail if one segment has a bad sector, I
>don't know that we can enforce it without running one segment at a
>time. That's terribly slow.
>
>
That's not what I meant. If you submit 16 iocbs which are merged by the
kernel, and there is an error somewhere within the seventh iocb, I would
expect to get 15 success completions and one error completion. so error
information from the merged iocb must be demultiplexed into the originals.

If you have a single iocb, then any error simply fails that iocb.

> Again, even if READV is a good idea, we need to fix whatever
>inefficiencies io_submit() has. copying to/from userspace just can't be
>that slow.
>
>
The inefficiencies I refered to were disk inefficiencies, not processor.

I think what happened was that the number of iocbs submitted (64 iocbs
of 4K each) did not merge because the device queue depth was very large;
no queuing occured because (I imagine) merging happens while a request
is waiting for disk readiness.

Decreasing the queue depth is not an option, because I might want to do
random reads of small iovecs later.

Of course, it is better to copy less than to copy more; so that is an
additional win for PREADV.

--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.

2004-10-17 00:16:04

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

On Sat, Oct 16, 2004 at 07:29:03PM +0200, Avi Kivity wrote:
> No. An iovec is already merged; it is known that adjacent segments of an
> iovec have adjacent offsets. a single IO_CMD_READV iovec can generate a
> single bio without any merging.

A slightly embarrassed glance at the manpage later, I recall
that fact. Something I never liked about writev(), but that's a moot
point.

> That's not what I meant. If you submit 16 iocbs which are merged by the
> kernel, and there is an error somewhere within the seventh iocb, I would
> expect to get 15 success completions and one error completion. so error
> information from the merged iocb must be demultiplexed into the originals.

This takes no effort, really, for the kernel. The block layer
handles it.

> If you have a single iocb, then any error simply fails that iocb.

True, but in some senses (and this is application specific) you
want to know that 15/16ths succeeded. But we're talking about
application needs, not kernel design. So it's moot for the technical
argument of whether READV is a useful operation. I just wanted to have
the discussion. It wasn't an objection per se.

> I think what happened was that the number of iocbs submitted (64 iocbs
> of 4K each) did not merge because the device queue depth was very large;
> no queuing occured because (I imagine) merging happens while a request
> is waiting for disk readiness.

Why did you submit 64 iocbs of 4K? Was every page virtually
discontiguous, or did you arbitrarily decide to create a worst-case?

Joel

--

Life's Little Instruction Book #510

"Count your blessings."

http://www.jlbec.org/
[email protected]

2004-10-17 06:25:39

by Avi Kivity

[permalink] [raw]
Subject: Re: [PATCH 1/2] aio: add vectored I/O support

Joel Becker wrote:

>>I think what happened was that the number of iocbs submitted (64 iocbs
>>of 4K each) did not merge because the device queue depth was very large;
>>no queuing occured because (I imagine) merging happens while a request
>>is waiting for disk readiness.
>>
>>
>
> Why did you submit 64 iocbs of 4K? Was every page virtually
>discontiguous, or did you arbitrarily decide to create a worst-case?
>
>
The application (a userspace filesystem with its own cache) manages
memory in 4K pages, but can perform much larger I/Os, for example during
readahead and after merging writes. After a very short while memory is
completely fragmented.


--
Do not meddle in the internals of kernels, for they are subtle and quick to panic.