2003-08-07 09:55:51

by Suparna Bhattacharya

[permalink] [raw]
Subject: [PATCH][2.6-mm] Readahead issues and AIO read speedup

I noticed a problem with the way do_generic_mapping_read
and readahead works for the case of large reads, especially
random reads. This was leading to very inefficient behaviour
for a stream for AIO reads. (See the results a little later
in this note)

1) We should be reading ahead at least the pages that are
required by the current read request (even if the ra window
is maximally shrunk). I think I've seen this in 2.4 - we
seem to have lost that in 2.5.
The result is that sometimes (large random reads) we end
up doing reads one page at a time waiting for it to complete
being reading the next page and so on, even for a large read.
(until we buildup a readahead window again)

2) Once the ra window is maximally shrunk, the responsibility
for reading the pages and re-building the window is shifted
to the slow path in read, which breaks down in the case of
a stream of AIO reads where multiple iocbs submit reads
to the same file rather than serialise the wait for i/o
completion.

So here is a patch that fixes this by making sure we do
(1) and pushing up the handle_ra_miss calls for the maximally
shrunk case before the loop that waits for I/O completion.

Does it make a difference ? A lot, actually.

Here's a summary of 64 KB random read throughput results
running aio-stress for a 1GB file on 2.6.0-test2,
2.6.0-test2-mm4 and 2.6.0-test-mm4 on a 4way test
system with this patch:

./aio-stress -o3 testdir/rwfile5
file size 1024MB, record size 64KB, depth 64, ios per iteration 8
max io_submit 8, buffer alignment set to 4KB

2.6.0-test2:
-----------
random read on testdir/rwfile5 (8.54 MB/s) 1024.00 MB in 119.91s
(not true aio btw - here aio_read is actually fully synchronous
since buffered fs aio support isn't in)

2.6.0-test2-mm4:
---------------
random read on testdir/rwfile5 (2.40 MB/s) 1024.00 MB in 426.10s
(and this is with buffered fs aio support i.e. true aio
which demonstrates just how bad the problem is)

And with
2.6.0-test2-mm4 + the attached patch (read-speedup.patch)
-------------------------------------
random read on testdir/rwfile5 (21.45 MB/s) 1024.00 MB in 47.74s
(Throughput is now 2.5x that in vanilla 2.6.0-test2)

Just as a reference, here are the throughput results for
O_DIRECT aio ( ./aio-stress -O -o3 testdir/rwfile5) on the
same kernel:
random read on testdir/rwfile5 (17.71 MB/s) 1024.00 MB in 57.81s

Note: aio-stress is available at Chris Mason's ftp site
ftp.suse.com/pub/people/mason/utils/aio-stress.c

Now, another way to solve the problem would be to
modify page_cache_readahead to let it know about the
size of the request, and to make it handle re-enablement
of readahead, and have handle_ra_miss only deal with
misses due to VM eviction. Perhaps this is what we should
do in the long run ?

Comments/feedback welcome.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India


--- pure-mm4/mm/filemap.c Wed Aug 6 22:54:24 2003
+++ linux-2.6.0-test2/mm/filemap.c Wed Aug 6 22:56:11 2003
@@ -608,13 +608,13 @@
read_actor_t actor)
{
struct inode *inode = mapping->host;
- unsigned long index, offset, last, end_index;
+ unsigned long index, offset, first, last, end_index;
struct page *cached_page;
loff_t isize = i_size_read(inode);
int error;

cached_page = NULL;
- index = *ppos >> PAGE_CACHE_SHIFT;
+ first = *ppos >> PAGE_CACHE_SHIFT;
offset = *ppos & ~PAGE_CACHE_MASK;

last = (*ppos + desc->count) >> PAGE_CACHE_SHIFT;
@@ -632,11 +632,19 @@
* Let the readahead logic know upfront about all
* the pages we'll need to satisfy this request
*/
- for (; index < last; index++)
+ for (index = first ; index < last; index++)
page_cache_readahead(mapping, ra, filp, index);
- index = *ppos >> PAGE_CACHE_SHIFT;
+
+ if (ra->next_size == -1UL) {
+ /* the readahead window was maximally shrunk */
+ /* explicitly readahead at least what is needed now */
+ for (index = first; index < last; index++)
+ handle_ra_miss(mapping, ra, index);
+ do_page_cache_readahead(mapping, filp, first, last - first);
+ }

done_readahead:
+ index = first;
for (;;) {
struct page *page;
unsigned long nr, ret;


2003-08-07 16:12:56

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

Suparna,

I noticed the exact same thing while testing on database benchmark
on filesystems (without AIO). I added instrumentation in scsi layer to
record the IO pattern and I found that we are doing lots of (4million)
4K reads, in my benchmark run. I was tracing that and found that all
those reads are generated by slow read path, since readahead window
is maximally shrunk. When I forced the readahead code to read 16k
(my database pagesize), in case ra window closed - I see 20% improvement
in my benchmark. I asked "Ramchandra Pai" ([email protected])
to investigate it further.

Thanks,
Badari

On Thursday 07 August 2003 03:01 am, Suparna Bhattacharya wrote:
> I noticed a problem with the way do_generic_mapping_read
> and readahead works for the case of large reads, especially
> random reads. This was leading to very inefficient behaviour
> for a stream for AIO reads. (See the results a little later
> in this note)
>
> 1) We should be reading ahead at least the pages that are
> required by the current read request (even if the ra window
> is maximally shrunk). I think I've seen this in 2.4 - we
> seem to have lost that in 2.5.
> The result is that sometimes (large random reads) we end
> up doing reads one page at a time waiting for it to complete
> being reading the next page and so on, even for a large read.
> (until we buildup a readahead window again)
>
> 2) Once the ra window is maximally shrunk, the responsibility
> for reading the pages and re-building the window is shifted
> to the slow path in read, which breaks down in the case of
> a stream of AIO reads where multiple iocbs submit reads
> to the same file rather than serialise the wait for i/o
> completion.
>
> So here is a patch that fixes this by making sure we do
> (1) and pushing up the handle_ra_miss calls for the maximally
> shrunk case before the loop that waits for I/O completion.
>
> Does it make a difference ? A lot, actually.

2003-08-07 16:26:18

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

Badari Pulavarty <[email protected]> wrote:
>
> I noticed the exact same thing while testing on database benchmark
> on filesystems (without AIO). I added instrumentation in scsi layer to
> record the IO pattern and I found that we are doing lots of (4million)
> 4K reads, in my benchmark run. I was tracing that and found that all
> those reads are generated by slow read path, since readahead window
> is maximally shrunk. When I forced the readahead code to read 16k
> (my database pagesize), in case ra window closed - I see 20% improvement
> in my benchmark. I asked "Ramchandra Pai" ([email protected])
> to investigate it further.

But if all the file's pages are already in pagecache (a common case)
this patched kernel will consume extra CPU pointlessly poking away at
pagecache. Reliably shrinking the window to zero is important for this
reason.

If the database pagesize is 16k then the application should be submitting
16k reads, yes? If so then these should not be creating 4k requests at the
device layer! So what we need to do is to ensure that at least those 16k
worth of pages are submitted in a single chunk. Without blowing CPU if
everything is cached. Tricky. I'll take a look at what's going on.


Another relevant constraint here (and there are lots of subtle constraints
in readahead) is that often database files are fragmented all over the
disk, because they were laid out that way (depends on the database and
how it was set up). In this case, any extra readahead is a disaster
because it incurs extra seeks, needlessly.

2003-08-07 17:32:23

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thursday 07 August 2003 09:28 am, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> > I noticed the exact same thing while testing on database benchmark
> > on filesystems (without AIO). I added instrumentation in scsi layer to
> > record the IO pattern and I found that we are doing lots of (4million)
> > 4K reads, in my benchmark run. I was tracing that and found that all
> > those reads are generated by slow read path, since readahead window
> > is maximally shrunk. When I forced the readahead code to read 16k
> > (my database pagesize), in case ra window closed - I see 20% improvement
> > in my benchmark. I asked "Ramchandra Pai" ([email protected])
> > to investigate it further.
>
> But if all the file's pages are already in pagecache (a common case)
> this patched kernel will consume extra CPU pointlessly poking away at
> pagecache. Reliably shrinking the window to zero is important for this
> reason.

Yes !! I hardcoded it to 16k, since I know that all my reads will be 16k
(atleast). We should do readahead of actual pages required by the current
read would be correct solution. (like Suparna suggested).

>
> If the database pagesize is 16k then the application should be submitting
> 16k reads, yes?

Yes. Database always does IO in atleast 16k (in my case).

> If so then these should not be creating 4k requests at the
> device layer! So what we need to do is to ensure that at least those 16k
> worth of pages are submitted in a single chunk. Without blowing CPU if
> everything is cached. Tricky. I'll take a look at what's going on.

When readahead window is closed, slow read code will be submitting IO in 4k
chunks. Infact, it will wait for the IO to finish, before reading next page.
Isn't it ? How would you ensure atleast 16k worth of pages are submitted
in a sinle chunk here ?

I am hoping that forcing readhead code to read pages needed by current read
would address this problem.

> Another relevant constraint here (and there are lots of subtle constraints
> in readahead) is that often database files are fragmented all over the
> disk, because they were laid out that way (depends on the database and
> how it was set up). In this case, any extra readahead is a disaster
> because it incurs extra seeks, needlessly.

Agreed. In my case, I made sure that all the files are almost contiguous.
(I put one file per filesystem - and verified thro debugfs).

Thanks,
Badari

2003-08-07 17:37:48

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

Badari Pulavarty <[email protected]> wrote:
>
> We should do readahead of actual pages required by the current
> read would be correct solution. (like Suparna suggested).

I repeat: what will be the effect of this if all those pages are already in
pagecache?

> When readahead window is closed, slow read code will be submitting IO in 4k
> chunks. Infact, it will wait for the IO to finish, before reading next page.
> Isn't it?

Yes.

> How would you ensure atleast 16k worth of pages are submitted
> in a sinle chunk here ?

Need to work out why the window got itself shrunk, fix that, then constrain each
readahead chunk to the size of the application's read() if the application
just seeked, I think.

> I am hoping that forcing readhead code to read pages needed by current read
> would address this problem.

You'll have Martin and Anton coming after you with SDET regressions.


2003-08-07 19:36:49

by Joel Becker

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thu, Aug 07, 2003 at 09:28:00AM -0700, Andrew Morton wrote:
> If the database pagesize is 16k then the application should be submitting
> 16k reads, yes? If so then these should not be creating 4k requests at the
> device layer! So what we need to do is to ensure that at least those 16k

Um, I thought bio enforced single CDB I/O for contiguous chunks
of disk! If it doesn't, 2.6 is just as lame as 2.4.

Joel

--

"Three o'clock is always too late or too early for anything you
want to do."
- Jean-Paul Sartre

http://www.jlbec.org/
[email protected]

2003-08-07 20:52:33

by Badari Pulavarty

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thursday 07 August 2003 10:39 am, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> > We should do readahead of actual pages required by the current
> > read would be correct solution. (like Suparna suggested).
>
> I repeat: what will be the effect of this if all those pages are already in
> pagecache?

Hmm !! Do you think just peeking at pagecache and bailing out if
nothing needed to be done, is too expensive ? Anyway, slow read
code has to do this later. Doing it in readahead one more time causes
significant perf. hit ? And also, do you think this is the most common
case ?


Thanks,
Badari

2003-08-07 20:56:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

Badari Pulavarty <[email protected]> wrote:
>
> On Thursday 07 August 2003 10:39 am, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > > We should do readahead of actual pages required by the current
> > > read would be correct solution. (like Suparna suggested).
> >
> > I repeat: what will be the effect of this if all those pages are already in
> > pagecache?
>
> Hmm !! Do you think just peeking at pagecache and bailing out if
> nothing needed to be done, is too expensive ? Anyway, slow read
> code has to do this later. Doing it in readahead one more time causes
> significant perf. hit ?

It has been observed, yes.

> And also, do you think this is the most common case ?

It is a very common case. It's one we need to care for. Especially when
lots of CPUs are hitting the same file.

There are things we can do to tweak it up, such as adding a max_index to
find_get_pages(), then do multipage lookups, etc. But not doing it at all
is always the fastest way.

2003-08-08 05:42:53

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thu, Aug 07 2003, Joel Becker wrote:
> On Thu, Aug 07, 2003 at 09:28:00AM -0700, Andrew Morton wrote:
> > If the database pagesize is 16k then the application should be submitting
> > 16k reads, yes? If so then these should not be creating 4k requests at the
> > device layer! So what we need to do is to ensure that at least those 16k
>
> Um, I thought bio enforced single CDB I/O for contiguous chunks
> of disk! If it doesn't, 2.6 is just as lame as 2.4.

It does of course, that was one of the main points of the bio structure.
->readpages() can submit those 16k minimum chunks in one bio.

--
Jens Axboe

2003-08-08 13:51:10

by Suparna Bhattacharya

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thu, Aug 07, 2003 at 01:58:19PM -0700, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > On Thursday 07 August 2003 10:39 am, Andrew Morton wrote:
> > > Badari Pulavarty <[email protected]> wrote:
> > > > We should do readahead of actual pages required by the current
> > > > read would be correct solution. (like Suparna suggested).
> > >
> > > I repeat: what will be the effect of this if all those pages are already in
> > > pagecache?
> >
> > Hmm !! Do you think just peeking at pagecache and bailing out if
> > nothing needed to be done, is too expensive ? Anyway, slow read
> > code has to do this later. Doing it in readahead one more time causes
> > significant perf. hit ?
>
> It has been observed, yes.
>
> > And also, do you think this is the most common case ?
>
> It is a very common case. It's one we need to care for. Especially when
> lots of CPUs are hitting the same file.

So, you are concerned about the case when the window was maximally
shrunk via check_ra_success because the pages are already cached ?
Is it mainly for small reads that the performance hit due to the
extra pagecache lookup is significant compared to the cycles spent
in copy_to_user ? For less than page-sized reads clustering isn't
relevant, we can just keep the old behaviour there.

If on the other hand large cached reads are affected as well, then
we need to work on a solution.

>
> There are things we can do to tweak it up, such as adding a max_index to
> find_get_pages(), then do multipage lookups, etc. But not doing it at all
> is always the fastest way.

Regards
Suparna

--
Suparna Bhattacharya ([email protected])
Linux Technology Center
IBM Software Labs, India

2003-08-13 20:03:46

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

I ran an Database workload on the test3-mm1 bits and got around 20%
improvement. These bits have Suparna's readahead_patch.

The number of 4k read requests sent to the driver reduced significantly by
95%. Effectively generating lesser number of larger-size read-requests.

And the number of readahead-window-resets dropped down by 50%.

FYI,
Ram Pai

On Fri, 8 Aug 2003, Suparna Bhattacharya wrote:

> On Thu, Aug 07, 2003 at 01:58:19PM -0700, Andrew Morton wrote:
> > Badari Pulavarty <[email protected]> wrote:
> > >
> > > On Thursday 07 August 2003 10:39 am, Andrew Morton wrote:
> > > > Badari Pulavarty <[email protected]> wrote:
> > > > > We should do readahead of actual pages required by the current
> > > > > read would be correct solution. (like Suparna suggested).

--
Ram Pai
[email protected]
----------------------------------




2003-09-23 00:41:30

by Ram Pai

[permalink] [raw]
Subject: Re: [PATCH][2.6-mm] Readahead issues and AIO read speedup

On Thu, 2003-08-07 at 13:58, Andrew Morton wrote:
> Badari Pulavarty <[email protected]> wrote:
> >
> > On Thursday 07 August 2003 10:39 am, Andrew Morton wrote:
> > > Badari Pulavarty <[email protected]> wrote:
> > > > We should do readahead of actual pages required by the current
> > > > read would be correct solution. (like Suparna suggested).
> > >
> > > I repeat: what will be the effect of this if all those pages are already in
> > > pagecache?
> >
> > Hmm !! Do you think just peeking at pagecache and bailing out if
> > nothing needed to be done, is too expensive ? Anyway, slow read
> > code has to do this later. Doing it in readahead one more time causes
> > significant perf. hit ?
>
> It has been observed, yes.

we found substantial improvements (around 20% )in Database decision
support work load on Filesystems.

To address your concern regarding possible SDET regression generated by
this patch, we (Steve Pratt) ran a bunch of regression tests on
2.6.0test5-mm2(with and without the patch). I have pasted the results of
SDET and Kernel Bench. We did not find any noticable performance
regression.


Here are some results from Steve on test5-mm2
**************************************************************************


sdet comparison
of
2.6.0-test5-mm2 vs 2.6.0-test5mm2-without-READAHEAD-patch


Results:Throughput

tolerance = 0.00 + 3.00% of 2.6.0-test5-mm2
2.6.0-test5-mm2 2.6.0-test5mm2-without-READAHEAD
Threads Ops/sec Ops/sec %diff diff tolerance
---------- ------------ ------------ -------- ------------ ------------
1 3089 3103 0.45 14.00 92.67
4 11181 11294 1.01 113.00 335.43
16 18436 18530 0.51 94.00 553.08
64 18867 19002 0.72 135.00 566.01

****************************************************************************
**************************************************************************
kernbench comparison
of
2.6.0-test5-mm2 vs 2.6.0-test5mm2-without-READAHEAD


Results:Elapsed Time
tolerance = 0.00 + 3.00% of 2.6.0-test5-mm2
2.6.0-test5-mm2 2.6.0-test5mm2-without-READAHEAD
Seconds Seconds %diff diff tolerance
---------- ------------ ------------ -------- ------------ ------------
2 96.015 95.035 -1.02 -0.98 2.88

**************************************************************************

Would you like us to run some other tests?

Thanks,
RP



>
> > And also, do you think this is the most common case ?
>
> It is a very common case. It's one we need to care for. Especially when
> lots of CPUs are hitting the same file.
>
> There are things we can do to tweak it up, such as adding a max_index to
> find_get_pages(), then do multipage lookups, etc. But not doing it at all
> is always the fastest way.
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>