Reading files with O_DIRECT works very nicely for me off a single drive
(for video streaming, so I dont want cacheing), but is extremely slow on
software raid0 devices, and striped lvm volumes. Basically a striped
raid device reads at much the same speed as a single device with O_DIRECT,
while reading the same file without O_DIRECT gives the expected performance
(but with unwanted cacheing).
raw devices behave similarly (though if you are using them you can probably
do your own raid0).
My guess is this is because of the md blocksizes being 1024, rather than
4096: is this the case and is there a fix (my quick hack at md.c to try
to make this happen didnt work).
Justin
Justin Cormack wrote:
>
> Reading files with O_DIRECT works very nicely for me off a single drive
> (for video streaming, so I dont want cacheing), but is extremely slow on
> software raid0 devices, and striped lvm volumes. Basically a striped
> raid device reads at much the same speed as a single device with O_DIRECT,
> while reading the same file without O_DIRECT gives the expected performance
> (but with unwanted cacheing).
>
> raw devices behave similarly (though if you are using them you can probably
> do your own raid0).
>
> My guess is this is because of the md blocksizes being 1024, rather than
> 4096: is this the case and is there a fix (my quick hack at md.c to try
> to make this happen didnt work).
well not exactly. Raid0 is faster due to readahead (eg you read one
block and the kernel
sets the OTHER disk also working in parallel in anticipation of you
using that). O_DIRECT
is of course directly in conflict with this as you tell the kernel that
you DON'T want
any optimisations....
Greetinsg,
Arjan van de Ven
On Fri, Jan 18, 2002 at 05:50:53PM +0000, Arjan van de Ven wrote:
> Justin Cormack wrote:
> >
> > Reading files with O_DIRECT works very nicely for me off a single drive
> > (for video streaming, so I dont want cacheing), but is extremely slow on
> > software raid0 devices, and striped lvm volumes. Basically a striped
> > raid device reads at much the same speed as a single device with O_DIRECT,
> > while reading the same file without O_DIRECT gives the expected performance
> > (but with unwanted cacheing).
> >
> > raw devices behave similarly (though if you are using them you can probably
> > do your own raid0).
> >
> > My guess is this is because of the md blocksizes being 1024, rather than
> > 4096: is this the case and is there a fix (my quick hack at md.c to try
> > to make this happen didnt work).
>
> well not exactly. Raid0 is faster due to readahead (eg you read one
> block and the kernel
> sets the OTHER disk also working in parallel in anticipation of you
> using that). O_DIRECT
> is of course directly in conflict with this as you tell the kernel that
> you DON'T want
> any optimisations....
if you read in chunks of a few mbytes per read syscall, the lack of
readahead shouldn't make much difference (this is true for both raid and
standalone device). If there's a relevant difference it's more liekly an
issue with the blocksize.
Andrea
Andrea Arcangeli <[email protected]> writes:
>
> if you read in chunks of a few mbytes per read syscall, the lack of
> readahead shouldn't make much difference (this is true for both raid and
> standalone device). If there's a relevant difference it's more liekly an
> issue with the blocksize.
The problem with that is that doing overlapping IO requires much more
effort (you need threads in user space). If you don't do overlapping
IO you add a latency bubble for each round trip to user space after you
read one big chunk and submitting the request for the next big chunk.
Your disk will not be constantly streaming, because of these pauses where
it doesn't have an request to process.
The application could do it using some aio setup, but it gets rather
complicated and the kernel already knows how to do that well.
I think an optional readahead mode for O_DIRECT would be useful.
-Andi
In article <[email protected]> you wrote:
> I think an optional readahead mode for O_DIRECT would be useful.
I disagree. O_DIRECT says "do not cache. period. I know what I'm doing"
and the kernel should respect that imho. After all we have sys_readahead for
the other part...
[email protected] writes:
> In article <[email protected]> you wrote:
>
> > I think an optional readahead mode for O_DIRECT would be useful.
>
> I disagree. O_DIRECT says "do not cache. period. I know what I'm doing"
> and the kernel should respect that imho. After all we have sys_readahead for
> the other part...
Problem with sys_readahead is that it doesn't work for big IO sizes.
e.g. you read in big blocks. You have to do readahead(next block);
read(directfd, ..., big-block);
The readahead comes to early in this case; it would be better if it is
done in the middle of read of big-block based on the request size.
Otherwise you risk additional seeks when you overflow the 'read window',
which is all to easy this way.
-Andi
On Sun, Jan 20, 2002 at 10:28:21PM +0100, Andi Kleen wrote:
> Andrea Arcangeli <[email protected]> writes:
> >
> > if you read in chunks of a few mbytes per read syscall, the lack of
> > readahead shouldn't make much difference (this is true for both raid and
> > standalone device). If there's a relevant difference it's more liekly an
> > issue with the blocksize.
>
> The problem with that is that doing overlapping IO requires much more
> effort (you need threads in user space). If you don't do overlapping
> IO you add a latency bubble for each round trip to user space after you
> read one big chunk and submitting the request for the next big chunk.
> Your disk will not be constantly streaming, because of these pauses where
> it doesn't have an request to process.
correct, we can't keep the pipeline always full, the larger the size of
the read/write, the lower it will matter, this is the only way to hide
the pipeline stall at the moment (like with rawio).
> The application could do it using some aio setup, but it gets rather
> complicated and the kernel already knows how to do that well.
yes, in short the API to allow the userspace to keep the I/O pipeline
full with a ring of user buffers is not available at the moment.
As you say one could try to workaround it by threading the I/O in
userspace but it would get rather dirty (and with a scheduling
overhead).
>
> I think an optional readahead mode for O_DIRECT would be useful.
to do transparent readahead we'll need to use the pagecache, so we'd need
to make copies of pages with the cpu between usermemory and pagecache,
but the nicer part of O_DIRECT is that it skips the costly copies
with the cpu on the membus, so I usually disagree about trying to allow
O_DIRECT to support readahead. I believe if you need readahead, you
probably shouldn't use O_DIRECT in the first place.
Andrea
On Mon, Jan 21, 2002 at 02:12:24AM +0100, Andrea Arcangeli wrote:
> yes, in short the API to allow the userspace to keep the I/O pipeline
> full with a ring of user buffers is not available at the moment.
See http://www.kvack.org/~blah/aio/ . Seems to work pretty nicely
for raw io.
-ben
On Mon, Jan 21, 2002 at 12:35:52AM -0500, Benjamin LaHaise wrote:
> On Mon, Jan 21, 2002 at 02:12:24AM +0100, Andrea Arcangeli wrote:
> > yes, in short the API to allow the userspace to keep the I/O pipeline
> > full with a ring of user buffers is not available at the moment.
>
> See http://www.kvack.org/~blah/aio/ . Seems to work pretty nicely
> for raw io.
of course async-io API is the right fix to keep the I/O pipeline always
full. Thanks for pointing it out.
Andrea