2007-05-16 22:51:26

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 5/9] readahead: on-demand readahead logic

This is a minimal readahead algorithm that aims to replace the current one.
It is more flexible and reliable, while maintaining almost the same behavior
and performance. Also it is full integrated with adaptive readahead.

It is designed to be called on demand:
- on a missing page, to do synchronous readahead
- on a lookahead page, to do asynchronous readahead

In this way it eliminated the awkward workarounds for cache hit/miss,
readahead thrashing, retried read, and unaligned read. It also adopts the
data structure introduced by adaptive readahead, parameterizes readahead
pipelining with `lookahead_index', and reduces the current/ahead windows
to one single window.

HEURISTICS

The logic deals with four cases:

- sequential-next
found a consistent readahead window, so push it forward

- random
standalone small read, so read as is

- sequential-first
create a new readahead window for a sequential/oversize request

- lookahead-clueless
hit a lookahead page not associated with the readahead window,
so create a new readahead window and ramp it up

In each case, three parameters are determined:

- readahead index: where the next readahead begins
- readahead size: how much to readahead
- lookahead size: when to do the next readahead (for pipelining)


BEHAVIORS

The old behaviors are maximally preserved for trivial sequential/random reads.
Notable changes are:

- It no longer imposes strict sequential checks.
It might help some interleaved cases, and clustered random reads.
It does introduce risks of a random lookahead hit triggering an
unexpected readahead. But in general it is more likely to do good
than to do evil.

- Interleaved reads are supported in a minimal way.
Their chances of being detected and proper handled are still low.

- Readahead thrashings are better handled.
The current readahead leads to tiny average I/O sizes, because it
never turn back for the thrashed pages. They have to be fault in
by do_generic_mapping_read() one by one. Whereas the on-demand
readahead will redo readahead for them.


OVERHEADS

The new code reduced the overheads of

- excessively calling the readahead routine on small sized reads
(the current readahead code insists on seeing all requests)

- doing a lot of pointless page-cache lookups for small cached files
(the current readahead only turns itself off after 256 cache hits,
unfortunately most files are < 1MB, so never see that chance)

That accounts for speedup of
- 0.3% on 1-page sequential reads on sparse file
- 1.2% on 1-page cache hot sequential reads
- 3.2% on 256-page cache hot sequential reads
- 1.3% on cache hot `tar /lib`

However, it does introduce one extra page-cache lookup per cache miss, which
impacts random reads slightly. That's 1% overheads for 1-page random reads on
sparse file.


PERFORMANCE

The basic benchmark setup is
- 2.6.20 kernel with on-demand readahead
- 1MB max readahead size
- 2.9GHz Intel Core 2 CPU
- 2GB memory
- 160G/8M Hitachi SATA II 7200 RPM disk

The benchmarks show that
- it maintains the same performance for trivial sequential/random reads
- sysbench/OLTP performance on MySQL gains up to 8%
- performance on readahead thrashing gains up to 3 times


iozone throughput (KB/s): roughly the same
==========================================
iozone -c -t1 -s 4096m -r 64k

2.6.20 on-demand gain
first run
" Initial write " 61437.27 64521.53 +5.0%
" Rewrite " 47893.02 48335.20 +0.9%
" Read " 62111.84 62141.49 +0.0%
" Re-read " 62242.66 62193.17 -0.1%
" Reverse Read " 50031.46 49989.79 -0.1%
" Stride read " 8657.61 8652.81 -0.1%
" Random read " 13914.28 13898.23 -0.1%
" Mixed workload " 19069.27 19033.32 -0.2%
" Random write " 14849.80 14104.38 -5.0%
" Pwrite " 62955.30 65701.57 +4.4%
" Pread " 62209.99 62256.26 +0.1%

second run
" Initial write " 60810.31 66258.69 +9.0%
" Rewrite " 49373.89 57833.66 +17.1%
" Read " 62059.39 62251.28 +0.3%
" Re-read " 62264.32 62256.82 -0.0%
" Reverse Read " 49970.96 50565.72 +1.2%
" Stride read " 8654.81 8638.45 -0.2%
" Random read " 13901.44 13949.91 +0.3%
" Mixed workload " 19041.32 19092.04 +0.3%
" Random write " 14019.99 14161.72 +1.0%
" Pwrite " 64121.67 68224.17 +6.4%
" Pread " 62225.08 62274.28 +0.1%

In summary, writes are unstable, reads are pretty close on average:

access pattern 2.6.20 on-demand gain
Read 62085.61 62196.38 +0.2%
Re-read 62253.49 62224.99 -0.0%
Reverse Read 50001.21 50277.75 +0.6%
Stride read 8656.21 8645.63 -0.1%
Random read 13907.86 13924.07 +0.1%
Mixed workload 19055.29 19062.68 +0.0%
Pread 62217.53 62265.27 +0.1%


aio-stress: roughly the same
============================
aio-stress -l -s4096 -r128 -t1 -o1 knoppix511-dvd-cn.iso
aio-stress -l -s4096 -r128 -t1 -o3 knoppix511-dvd-cn.iso

2.6.20 on-demand delta
sequential 92.57s 92.54s -0.0%
random 311.87s 312.15s +0.1%


sysbench fileio: roughly the same
=================================
sysbench --test=fileio --file-io-mode=async --file-test-mode=rndrw \
--file-total-size=4G --file-block-size=64K \
--num-threads=001 --max-requests=10000 --max-time=900 run

threads 2.6.20 on-demand delta
first run
1 59.1974s 59.2262s +0.0%
2 58.0575s 58.2269s +0.3%
4 48.0545s 47.1164s -2.0%
8 41.0684s 41.2229s +0.4%
16 35.8817s 36.4448s +1.6%
32 32.6614s 32.8240s +0.5%
64 23.7601s 24.1481s +1.6%
128 24.3719s 23.8225s -2.3%
256 23.2366s 22.0488s -5.1%

second run
1 59.6720s 59.5671s -0.2%
8 41.5158s 41.9541s +1.1%
64 25.0200s 23.9634s -4.2%
256 22.5491s 20.9486s -7.1%

Note that the numbers are not very stable because of the writes.
The overall performance is close when we sum all seconds up:

sum all up 495.046s 491.514s -0.7%


sysbench oltp (trans/sec): up to 8% gain
========================================
sysbench --test=oltp --oltp-table-size=10000000 --oltp-read-only \
--mysql-socket=/var/run/mysqld/mysqld.sock \
--mysql-user=root --mysql-password=readahead \
--num-threads=064 --max-requests=10000 --max-time=900 run

10000-transactions run
threads 2.6.20 on-demand gain
1 62.81 64.56 +2.8%
2 67.97 70.93 +4.4%
4 81.81 85.87 +5.0%
8 94.60 97.89 +3.5%
16 99.07 104.68 +5.7%
32 95.93 104.28 +8.7%
64 96.48 103.68 +7.5%
5000-transactions run
1 48.21 48.65 +0.9%
8 68.60 70.19 +2.3%
64 70.57 74.72 +5.9%
2000-transactions run
1 37.57 38.04 +1.3%
2 38.43 38.99 +1.5%
4 45.39 46.45 +2.3%
8 51.64 52.36 +1.4%
16 54.39 55.18 +1.5%
32 52.13 54.49 +4.5%
64 54.13 54.61 +0.9%

That's interesting results. Some investigations show that
- MySQL is accessing the db file non-uniformly: some parts are
more hot than others
- It is mostly doing 4-page random reads, and sometimes doing two
reads in a row, the latter one triggers a 16-page readahead.
- The on-demand readahead leaves many lookahead pages (flagged
PG_readahead) there. Many of them will be hit, and trigger
more readahead pages. Which might save more seeks.
- Naturally, the readahead windows tend to lie in hot areas,
and the lookahead pages in hot areas is more likely to be hit.
- The more overall read density, the more possible gain.

That also explains the adaptive readahead tricks for clustered random reads.


readahead thrashing: 3 times better
===================================
We boot kernel with "mem=128m single", and start a 100KB/s stream on every
second, until reaching 200 streams.

max throughput min avg I/O size
2.6.20: 5MB/s 16KB
on-demand: 15MB/s 140KB

Signed-off-by: Fengguang Wu <[email protected]>
---
include/linux/mm.h | 6 +
mm/readahead.c | 174 +++++++++++++++++++++++++++++++++++++++++++
2 files changed, 180 insertions(+)

--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -611,3 +611,177 @@ unsigned long ra_submit(struct file_ra_s
return actual;
}
EXPORT_SYMBOL_GPL(ra_submit);
+
+/*
+ * Get the previous window size, ramp it up, and
+ * return it as the new window size.
+ */
+static unsigned long get_next_ra_size2(struct file_ra_state *ra,
+ unsigned long max)
+{
+ unsigned long cur = ra->readahead_index - ra->ra_index;
+ unsigned long newsize;
+
+ if (cur < max / 16)
+ newsize = cur * 4;
+ else
+ newsize = cur * 2;
+
+ return min(newsize, max);
+}
+
+/*
+ * On-demand readahead design.
+ *
+ * The fields in struct file_ra_state represent the most-recently-executed
+ * readahead attempt:
+ *
+ * |-------- last readahead window -------->|
+ * |-- application walking here -->|
+ * ======#============|==================#=====================|
+ * ^la_index ^ra_index ^lookahead_index ^readahead_index
+ *
+ * [ra_index, readahead_index) represents the last readahead window.
+ *
+ * [la_index, lookahead_index] is where the application would be walking(in
+ * the common case of cache-cold sequential reads): the last window was
+ * established when the application was at la_index, and the next window will
+ * be bring in when the application reaches lookahead_index.
+ *
+ * To overlap application thinking time and disk I/O time, we do
+ * `readahead pipelining': Do not wait until the application consumed all
+ * readahead pages and stalled on the missing page at readahead_index;
+ * Instead, submit an asynchronous readahead I/O as early as the application
+ * reads on the page at lookahead_index. Normally lookahead_index will be
+ * equal to ra_index, for maximum pipelining.
+ *
+ * In interleaved sequential reads, concurrent streams on the same fd can
+ * be invalidating each other's readahead state. So we flag the new readahead
+ * page at lookahead_index with PG_readahead, and use it as readahead
+ * indicator. The flag won't be set on already cached pages, to avoid the
+ * readahead-for-nothing fuss, saving pointless page cache lookups.
+ *
+ * prev_index tracks the last visited page in the _previous_ read request.
+ * It should be maintained by the caller, and will be used for detecting
+ * small random reads. Note that the readahead algorithm checks loosely
+ * for sequential patterns. Hence interleaved reads might be served as
+ * sequential ones.
+ *
+ * There is a special-case: if the first page which the application tries to
+ * read happens to be the first page of the file, it is assumed that a linear
+ * read is about to happen and the window is immediately set to the initial size
+ * based on I/O request size and the max_readahead.
+ *
+ * The code ramps up the readahead size aggressively at first, but slow down as
+ * it approaches max_readhead.
+ */
+
+/*
+ * A minimal readahead algorithm for trivial sequential/random reads.
+ */
+static unsigned long
+ondemand_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *page, pgoff_t offset,
+ unsigned long req_size)
+{
+ unsigned long max; /* max readahead pages */
+ pgoff_t ra_index; /* readahead index */
+ unsigned long ra_size; /* readahead size */
+ unsigned long la_size; /* lookahead size */
+ int sequential;
+
+ max = ra->ra_pages;
+ sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
+
+ /*
+ * Lookahead/readahead hit, assume sequential access.
+ * Ramp up sizes, and push forward the readahead window.
+ */
+ if (offset && (offset == ra->lookahead_index ||
+ offset == ra->readahead_index)) {
+ ra_index = ra->readahead_index;
+ ra_size = get_next_ra_size2(ra, max);
+ la_size = ra_size;
+ goto fill_ra;
+ }
+
+ /*
+ * Standalone, small read.
+ * Read as is, and do not pollute the readahead state.
+ */
+ if (!page && !sequential) {
+ return __do_page_cache_readahead(mapping, filp,
+ offset, req_size, 0);
+ }
+
+ /*
+ * It may be one of
+ * - first read on start of file
+ * - sequential cache miss
+ * - oversize random read
+ * Start readahead for it.
+ */
+ ra_index = offset;
+ ra_size = get_init_ra_size(req_size, max);
+ la_size = ra_size > req_size ? ra_size - req_size : ra_size;
+
+ /*
+ * Hit on a lookahead page without valid readahead state.
+ * E.g. interleaved reads.
+ * Not knowing its readahead pos/size, bet on the minimal possible one.
+ */
+ if (page) {
+ ra_index++;
+ ra_size = min(4 * ra_size, max);
+ }
+
+fill_ra:
+ ra_set_index(ra, offset, ra_index);
+ ra_set_size(ra, ra_size, la_size);
+
+ return ra_submit(ra, mapping, filp);
+}
+
+/**
+ * page_cache_readahead_ondemand - generic file readahead
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset, or NULL if non-present
+ * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @req_size: hint: total size of the read which the caller is performing in
+ * PAGE_CACHE_SIZE units
+ *
+ * page_cache_readahead_ondemand() is the entry point of readahead logic.
+ * This function should be called when it is time to perform readahead:
+ * 1) @page == NULL
+ * A cache miss happened, time for synchronous readahead.
+ * 2) @page != NULL && PageReadahead(@page)
+ * A look-ahead hit occured, time for asynchronous readahead.
+ */
+unsigned long
+page_cache_readahead_ondemand(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *page, pgoff_t offset,
+ unsigned long req_size)
+{
+ /* no read-ahead */
+ if (!ra->ra_pages)
+ return 0;
+
+ if (page) {
+ ClearPageReadahead(page);
+
+ /*
+ * Defer asynchronous read-ahead on IO congestion.
+ */
+ if (bdi_read_congested(mapping->backing_dev_info))
+ return 0;
+ }
+
+ /* do read-ahead */
+ return ondemand_readahead(mapping, ra, filp, page,
+ offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
--- linux-2.6.22-rc1-mm1.orig/include/linux/mm.h
+++ linux-2.6.22-rc1-mm1/include/linux/mm.h
@@ -1157,6 +1157,12 @@ int do_page_cache_readahead(struct addre
pgoff_t offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
+unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ struct page *page,
+ pgoff_t offset,
+ unsigned long size);
unsigned long page_cache_readahead(struct address_space *mapping,
struct file_ra_state *ra,
struct file *filp,

--


2007-05-19 06:28:37

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Thu, 17 May 2007 06:47:57 +0800 Fengguang Wu <[email protected]> wrote:

> This is a minimal readahead algorithm that aims to replace the current one.
> It is more flexible and reliable, while maintaining almost the same behavior
> and performance. Also it is full integrated with adaptive readahead.
>
> It is designed to be called on demand:
> - on a missing page, to do synchronous readahead
> - on a lookahead page, to do asynchronous readahead
>
> In this way it eliminated the awkward workarounds for cache hit/miss,
> readahead thrashing, retried read, and unaligned read. It also adopts the
> data structure introduced by adaptive readahead, parameterizes readahead
> pipelining with `lookahead_index', and reduces the current/ahead windows
> to one single window.
>
> HEURISTICS
>
> The logic deals with four cases:
>
> ...
>

That would have to be the best changelog I've ever seen ;) Thanks for
persisting with this.

> sysbench oltp (trans/sec): up to 8% gain

Have you given any thought to identifying workloads which may be worsened
by your changes? Attempt to deliberately expose any weak spots?

2007-05-19 13:02:18

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Fri, May 18, 2007 at 11:23:35PM -0700, Andrew Morton wrote:
>
> That would have to be the best changelog I've ever seen ;) Thanks for
> persisting with this.

Thank you :)

> > sysbench oltp (trans/sec): up to 8% gain
>
> Have you given any thought to identifying workloads which may be worsened
> by your changes? Attempt to deliberately expose any weak spots?

Yeah. All possible downsides I can imagine are:

- CPU overheads
Only random reads will be hurt.
That's 1% slow down for _sparse files_, and should be much smaller
when real I/O is involved.

- Behavior changes
It do not enforce strict check sequentialness.
- it is in general a good behavior for interleaved reads and
clustered-and-intermixed-random/sequential workloads.
- it might lead to more readahead misses
E.g. a random read sequence of 0,1,4,12,28,60,92,124,156,188,220
that is weird enough to start the readahead and hit all the
lookahead pages.
I highly doubt the possibility of such patterns happen in real
world. But if ever it happens repeatedly for some user, he can
work it around easily by tuning readahead_kb to some other value.

So, it is only a possibility that some random workload may be
worsened. But it's really hard to find one real world example.

2007-06-12 04:36:55

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> +static unsigned long
> +ondemand_readahead(struct address_space *mapping,
> + struct file_ra_state *ra, struct file *filp,
> + struct page *page, pgoff_t offset,
> + unsigned long req_size)
> +{
> + unsigned long max; /* max readahead pages */
> + pgoff_t ra_index; /* readahead index */
> + unsigned long ra_size; /* readahead size */
> + unsigned long la_size; /* lookahead size */
> + int sequential;
> +
> + max = ra->ra_pages;
> + sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);

Hi again!

This <= 1UL seems weird. prev_index is end of last request, so I'd
expect offset == prev_index + 1 for sequential reads? Does offset ==
ra->prev_index happen? If not, this would be clearer as (offset ==
ra->prev_index + 1).

(prev_index is not a great name either, but that's not your patch 8).

> + /*
> + * Lookahead/readahead hit, assume sequential access.
> + * Ramp up sizes, and push forward the readahead window.
> + */
> + if (offset && (offset == ra->lookahead_index ||
> + offset == ra->readahead_index)) {
> + ra_index = ra->readahead_index;
> + ra_size = get_next_ra_size2(ra, max);
> + la_size = ra_size;
> + goto fill_ra;
> + }

Will offset hit lookahead_index or readahead_index exactly? Should this
be checking the range from offset to offset + req_size?

> + ra_index = offset;
> + ra_size = get_init_ra_size(req_size, max);
> + la_size = ra_size > req_size ? ra_size - req_size : ra_size;

So if we're doing a big sequential read, ra_size < req_size, so next
time offset will be > ra->readahead_index and the "ramp up sizes" code
won't get run?

> + /*
> + * Hit on a lookahead page without valid readahead state.
> + * E.g. interleaved reads.
> + * Not knowing its readahead pos/size, bet on the minimal possible one.
> + */
> + if (page) {
> + ra_index++;
> + ra_size = min(4 * ra_size, max);
> + }

If I understand correctly, it's expected to happen when we have multiple
streams: we previously marked the lookahead page, but then the other
stream changed the ra to somewhere else in the file. We now change it
back to our stream, but we've lost information so we make it up.

This seems a little like two functions crammed into one. Do you think
page_cache_readahead_ondemand() should be split into
"page_cache_readahead()" which doesn't take a page*, and
"page_cache_check_readahead_page()" which is an inline which does the
PageReadahead(page) check as well?

Thanks,
Rusty.

2007-06-12 10:35:33

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

Hi Rusty,

On Tue, Jun 12, 2007 at 02:36:26PM +1000, Rusty Russell wrote:
> On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> > +static unsigned long
> > +ondemand_readahead(struct address_space *mapping,
> > + struct file_ra_state *ra, struct file *filp,
> > + struct page *page, pgoff_t offset,
> > + unsigned long req_size)
> > +{
> > + unsigned long max; /* max readahead pages */
> > + pgoff_t ra_index; /* readahead index */
> > + unsigned long ra_size; /* readahead size */
> > + unsigned long la_size; /* lookahead size */
> > + int sequential;
> > +
> > + max = ra->ra_pages;
> > + sequential = (offset - ra->prev_index <= 1UL) || (req_size > max);
>
> This <= 1UL seems weird. prev_index is end of last request, so I'd
> expect offset == prev_index + 1 for sequential reads? Does offset ==
> ra->prev_index happen? If not, this would be clearer as (offset ==
> ra->prev_index + 1).

It's possible to have (offset == ra->prev_index) when someone is doing
1K reads or 10K reads, which do not always align to page boundaries.

> (prev_index is not a great name either, but that's not your patch 8).

It was just renamed from `prev_page', hehe.

> > + /*
> > + * Lookahead/readahead hit, assume sequential access.
> > + * Ramp up sizes, and push forward the readahead window.
> > + */
> > + if (offset && (offset == ra->lookahead_index ||
> > + offset == ra->readahead_index)) {
> > + ra_index = ra->readahead_index;
> > + ra_size = get_next_ra_size2(ra, max);
> > + la_size = ra_size;
> > + goto fill_ra;
> > + }
>
> Will offset hit lookahead_index or readahead_index exactly? Should this
> be checking the range from offset to offset + req_size?

Yes, normally lookahead_index will be hit(1). But in case readahead is
canceled because of IO congestion at that time, readahead_index will
be hit later(2).

The readahead code is called on two possible conditions:
(1) page != NULL and PageReadahead(page)
It will be an asynchronous readahead.
In this case, (offset == ra->lookahead_index) indicates sequential
reads that have been associated with a valid readahead window.

(2) page == NULL
It will be a synchronous readahead.
In this case, (offset == ra->readahead_index) indicates sequential
reads that has just consumed all of the readahead pages.

> > + ra_index = offset;
> > + ra_size = get_init_ra_size(req_size, max);
> > + la_size = ra_size > req_size ? ra_size - req_size : ra_size;
>
> So if we're doing a big sequential read, ra_size < req_size, so next
> time offset will be > ra->readahead_index and the "ramp up sizes" code
> won't get run?

For big reads, (ra_size = max) and (max < req_size), la_size will be
equal to ra_size, or max. So after this readahead invocation submits
max pages of I/O and returns to do_generic_mapping_read(), it will
*immediately* be called again because of lookahead hit:

if (!page) {
page_cache_readahead_ondemand(mapping,
&ra, filp, page,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
lookahead hit: if (PageReadahead(page)) {
page_cache_readahead_ondemand(mapping,
&ra, filp, page,
index, last_index - index);
}

Then it will submit another max pages of readahead I/O, whether or not
the size ramp up code will be executed: either the remaining request
size is still > max and get_init_ra_size() is called, or the remaining
request size is <= max and get_next_ra_size() is called, in both cases
they will return max. This behavior is inherited from the current
readahead, and makes sense.

> > + /*
> > + * Hit on a lookahead page without valid readahead state.
> > + * E.g. interleaved reads.
> > + * Not knowing its readahead pos/size, bet on the minimal possible one.
> > + */
> > + if (page) {
> > + ra_index++;
> > + ra_size = min(4 * ra_size, max);
> > + }
>
> If I understand correctly, it's expected to happen when we have multiple
> streams: we previously marked the lookahead page, but then the other
> stream changed the ra to somewhere else in the file. We now change it
> back to our stream, but we've lost information so we make it up.

Yeah, exactly!

> This seems a little like two functions crammed into one. Do you think
> page_cache_readahead_ondemand() should be split into
> "page_cache_readahead()" which doesn't take a page*, and
> "page_cache_check_readahead_page()" which is an inline which does the
> PageReadahead(page) check as well?

page_cache_check_readahead_page(..., page) is a good idea.
But which part of the code should we put to page_cache_readahead()
that does not take a page param?

Thank you,
Fengguang

2007-06-13 01:41:08

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Tue, 2007-06-12 at 18:35 +0800, Fengguang Wu wrote:
> > This seems a little like two functions crammed into one. Do you think
> > page_cache_readahead_ondemand() should be split into
> > "page_cache_readahead()" which doesn't take a page*, and
> > "page_cache_check_readahead_page()" which is an inline which does the
> > PageReadahead(page) check as well?
>
> page_cache_check_readahead_page(..., page) is a good idea.
> But which part of the code should we put to page_cache_readahead()
> that does not take a page param?

OK, here's my attempt. I also made them return void, since none of the
callers seem to mind. I didn't change the internals much: I think
they're pretty clear and I didn't want to clash if you decided to rename
the "ra" fields. It compiles and boots.

Thoughts?

Signed-off-by: Rusty Russell <[email protected]>

diff -r 3c8ae0c37063 fs/ext3/dir.c
--- a/fs/ext3/dir.c Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/ext3/dir.c Wed Jun 13 11:17:42 2007 +1000
@@ -139,10 +139,10 @@ static int ext3_readdir(struct file * fi
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
- page_cache_readahead_ondemand(
+ page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
- NULL, index, 1);
+ index, 1);
filp->f_ra.prev_index = index;
bh = ext3_bread(NULL, inode, blk, 0, &err);
}
diff -r 3c8ae0c37063 fs/ext4/dir.c
--- a/fs/ext4/dir.c Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/ext4/dir.c Wed Jun 13 11:19:56 2007 +1000
@@ -138,10 +138,10 @@ static int ext4_readdir(struct file * fi
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
- page_cache_readahead_ondemand(
+ page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
- NULL, index, 1);
+ index, 1);
filp->f_ra.prev_index = index;
bh = ext4_bread(NULL, inode, blk, 0, &err);
}
diff -r 3c8ae0c37063 fs/splice.c
--- a/fs/splice.c Tue Jun 12 17:17:25 2007 +1000
+++ b/fs/splice.c Wed Jun 13 11:18:38 2007 +1000
@@ -312,8 +312,8 @@ __generic_file_splice_read(struct file *
*/
page = find_get_page(mapping, index);
if (!page) {
- page_cache_readahead_ondemand(mapping, &in->f_ra, in,
- NULL, index, nr_pages - spd.nr_pages);
+ page_cache_sync_readahead(mapping, &in->f_ra, in,
+ index, nr_pages - spd.nr_pages);

/*
* page didn't exist, allocate one.
@@ -360,8 +360,7 @@ __generic_file_splice_read(struct file *
this_len = min_t(unsigned long, len, PAGE_CACHE_SIZE - loff);
page = pages[page_nr];

- if (PageReadahead(page))
- page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+ page_cache_check_readahead_page(mapping, &in->f_ra, in,
page, index, nr_pages - page_nr);

/*
diff -r 3c8ae0c37063 include/linux/mm.h
--- a/include/linux/mm.h Tue Jun 12 17:17:25 2007 +1000
+++ b/include/linux/mm.h Wed Jun 13 11:25:10 2007 +1000
@@ -1146,12 +1146,34 @@ int do_page_cache_readahead(struct addre
pgoff_t offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
-unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
- struct file_ra_state *ra,
- struct file *filp,
- struct page *page,
- pgoff_t offset,
- unsigned long size);
+
+void page_cache_sync_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long size);
+
+void page_cache_async_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ struct page *pg,
+ pgoff_t offset,
+ unsigned long size);
+
+/* If page has PG_readahead flag set, call async readahead logic. */
+static inline void
+page_cache_check_readahead_page(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ struct page *pg,
+ pgoff_t offset,
+ unsigned long size)
+{
+ if (!PageReadahead(pg))
+ return;
+ page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
+}
+
unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */
diff -r 3c8ae0c37063 mm/filemap.c
--- a/mm/filemap.c Tue Jun 12 17:17:25 2007 +1000
+++ b/mm/filemap.c Wed Jun 13 11:22:04 2007 +1000
@@ -913,18 +913,16 @@ find_page:
find_page:
page = find_get_page(mapping, index);
if (!page) {
- page_cache_readahead_ondemand(mapping,
- &ra, filp, page,
+ page_cache_sync_readahead(mapping,
+ &ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
- if (PageReadahead(page)) {
- page_cache_readahead_ondemand(mapping,
+ page_cache_check_readahead_page(mapping,
&ra, filp, page,
index, last_index - index);
- }
if (!PageUptodate(page))
goto page_not_up_to_date;
page_ok:
@@ -1382,16 +1380,14 @@ retry_find:
*/
if (VM_SequentialReadHint(vma)) {
if (!page) {
- page_cache_readahead_ondemand(mapping, ra, file, page,
+ page_cache_sync_readahead(mapping, ra, file,
fdata->pgoff, 1);
page = find_lock_page(mapping, fdata->pgoff);
if (!page)
goto no_cached_page;
}
- if (PageReadahead(page)) {
- page_cache_readahead_ondemand(mapping, ra, file, page,
+ page_cache_check_readahead_page(mapping, ra, file, page,
fdata->pgoff, 1);
- }
}

if (!page) {
diff -r 3c8ae0c37063 mm/readahead.c
--- a/mm/readahead.c Tue Jun 12 17:17:25 2007 +1000
+++ b/mm/readahead.c Wed Jun 13 11:09:39 2007 +1000
@@ -351,7 +351,7 @@ static unsigned long
static unsigned long
ondemand_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
- struct page *page, pgoff_t offset,
+ bool hit_lookahead_marker, pgoff_t offset,
unsigned long req_size)
{
unsigned long max; /* max readahead pages */
@@ -379,7 +379,7 @@ ondemand_readahead(struct address_space
* Standalone, small read.
* Read as is, and do not pollute the readahead state.
*/
- if (!page && !sequential) {
+ if (!hit_lookahead_marker && !sequential) {
return __do_page_cache_readahead(mapping, filp,
offset, req_size, 0);
}
@@ -400,7 +400,7 @@ ondemand_readahead(struct address_space
* E.g. interleaved reads.
* Not knowing its readahead pos/size, bet on the minimal possible one.
*/
- if (page) {
+ if (hit_lookahead_marker) {
ra_index++;
ra_size = min(4 * ra_size, max);
}
@@ -413,50 +413,71 @@ fill_ra:
}

/**
- * page_cache_readahead_ondemand - generic file readahead
+ * page_cache_sync_readahead - generic file readahead
* @mapping: address_space which holds the pagecache and I/O vectors
* @ra: file_ra_state which holds the readahead state
* @filp: passed on to ->readpage() and ->readpages()
- * @page: the page at @offset, or NULL if non-present
* @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
* @req_size: hint: total size of the read which the caller is performing in
* PAGE_CACHE_SIZE units
*
- * page_cache_readahead_ondemand() is the entry point of readahead logic.
- * This function should be called when it is time to perform readahead:
- * 1) @page == NULL
- * A cache miss happened, time for synchronous readahead.
- * 2) @page != NULL && PageReadahead(@page)
- * A look-ahead hit occured, time for asynchronous readahead.
- */
-unsigned long
-page_cache_readahead_ondemand(struct address_space *mapping,
- struct file_ra_state *ra, struct file *filp,
- struct page *page, pgoff_t offset,
- unsigned long req_size)
+ * page_cache_sync_readahead() should be called when a cache miss happened:
+ * it will submit the read. The readahead logic may decide to piggyback more
+ * pages onto the read request if access patterns suggest it will improve
+ * performance.
+ */
+void page_cache_sync_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ pgoff_t offset, unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
- return 0;
-
- if (page) {
- /*
- * It can be PG_reclaim.
- */
- if (PageWriteback(page))
- return 0;
-
- ClearPageReadahead(page);
-
- /*
- * Defer asynchronous read-ahead on IO congestion.
- */
- if (bdi_read_congested(mapping->backing_dev_info))
- return 0;
- }
+ return;

/* do read-ahead */
- return ondemand_readahead(mapping, ra, filp, page,
- offset, req_size);
-}
-EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
+ ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
+
+/**
+ * page_cache_async_readahead - file readahead for marked pages
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset which has the PageReadahead flag set
+ * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @req_size: hint: total size of the read which the caller is performing in
+ * PAGE_CACHE_SIZE units
+ *
+ * page_cache_async_ondemand() should be called when a page is used which
+ * has the PageReadahead flag: this is a marker to suggest that the application
+ * has used up enough of the readahead window that we should start pulling in
+ * more pages. */
+void
+page_cache_async_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *page, pgoff_t offset,
+ unsigned long req_size)
+{
+ /* no read-ahead */
+ if (!ra->ra_pages)
+ return;
+
+ /*
+ * Same bit is used for PG_readahead and PG_reclaim.
+ */
+ if (PageWriteback(page))
+ return;
+
+ ClearPageReadahead(page);
+
+ /*
+ * Defer asynchronous read-ahead on IO congestion.
+ */
+ if (bdi_read_congested(mapping->backing_dev_info))
+ return;
+
+ /* do read-ahead */
+ ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_async_readahead);


2007-06-13 04:01:12

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Wed, Jun 13, 2007 at 11:40:33AM +1000, Rusty Russell wrote:
> On Tue, 2007-06-12 at 18:35 +0800, Fengguang Wu wrote:
> > > This seems a little like two functions crammed into one. Do you think
> > > page_cache_readahead_ondemand() should be split into
> > > "page_cache_readahead()" which doesn't take a page*, and
> > > "page_cache_check_readahead_page()" which is an inline which does the
> > > PageReadahead(page) check as well?
> >
> > page_cache_check_readahead_page(..., page) is a good idea.
> > But which part of the code should we put to page_cache_readahead()
> > that does not take a page param?
>
> OK, here's my attempt. I also made them return void, since none of the
> callers seem to mind. I didn't change the internals much: I think
> they're pretty clear and I didn't want to clash if you decided to rename
> the "ra" fields. It compiles and boots.
>
> Thoughts?

Thank you, comments follow.

> +void page_cache_sync_readahead(struct address_space *mapping,
> + struct file_ra_state *ra,
> + struct file *filp,
> + pgoff_t offset,
> + unsigned long size);
> +
> +void page_cache_async_readahead(struct address_space *mapping,
> + struct file_ra_state *ra,
> + struct file *filp,
> + struct page *pg,
> + pgoff_t offset,
> + unsigned long size);

Got your idea: it looks like a nice split.

> +/* If page has PG_readahead flag set, call async readahead logic. */
> +static inline void
> +page_cache_check_readahead_page(struct address_space *mapping,
> + struct file_ra_state *ra,
> + struct file *filp,
> + struct page *pg,
> + pgoff_t offset,
> + unsigned long size)
> +{
> + if (!PageReadahead(pg))
> + return;
> + page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
> +}

This function might not be necessary?
I guess the explicit code is clear enough.

> static unsigned long
> ondemand_readahead(struct address_space *mapping,
> struct file_ra_state *ra, struct file *filp,
> - struct page *page, pgoff_t offset,
> + bool hit_lookahead_marker, pgoff_t offset,

Or use names like async/is_async for hit_lookahead_marker?
Seems that you have accepted the 'lookahead' term ;)

> unsigned long req_size)
> {
> unsigned long max; /* max readahead pages */
> @@ -379,7 +379,7 @@ ondemand_readahead(struct address_space
> * Standalone, small read.
> * Read as is, and do not pollute the readahead state.
> */
> - if (!page && !sequential) {
> + if (!hit_lookahead_marker && !sequential) {
> return __do_page_cache_readahead(mapping, filp,
> offset, req_size, 0);
> }
> @@ -400,7 +400,7 @@ ondemand_readahead(struct address_space
> * E.g. interleaved reads.
> * Not knowing its readahead pos/size, bet on the minimal possible one.
> */
> - if (page) {
> + if (hit_lookahead_marker) {
> ra_index++;
> ra_size = min(4 * ra_size, max);
> }

> +/**
> + * page_cache_async_readahead - file readahead for marked pages
> + * @mapping: address_space which holds the pagecache and I/O vectors
> + * @ra: file_ra_state which holds the readahead state
> + * @filp: passed on to ->readpage() and ->readpages()
> + * @page: the page at @offset which has the PageReadahead flag set

^PG_readahead

> + * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
> + * @req_size: hint: total size of the read which the caller is performing in
> + * PAGE_CACHE_SIZE units

^in pagecache pages?
//Christoph is changing the fixed PAGE_CACHE_SIZE to variable ones.

> + *
> + * page_cache_async_ondemand() should be called when a page is used which
> + * has the PageReadahead flag: this is a marker to suggest that the application

^PG_readahead

> + * has used up enough of the readahead window that we should start pulling in
> + * more pages. */
> +void
> +page_cache_async_readahead(struct address_space *mapping,
> + struct file_ra_state *ra, struct file *filp,
> + struct page *page, pgoff_t offset,
> + unsigned long req_size)
> +{
> + /* no read-ahead */
> + if (!ra->ra_pages)
> + return;
> +
> + /*
> + * Same bit is used for PG_readahead and PG_reclaim.

I like this new comment!

> + */
> + if (PageWriteback(page))
> + return;
> +
> + ClearPageReadahead(page);

Thank you,
Fengguang

2007-06-13 05:51:45

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Wed, 2007-06-13 at 12:00 +0800, Fengguang Wu wrote:
> On Wed, Jun 13, 2007 at 11:40:33AM +1000, Rusty Russell wrote:
> > +/* If page has PG_readahead flag set, call async readahead logic. */
> > +static inline void
> > +page_cache_check_readahead_page(struct address_space *mapping,
> > + struct file_ra_state *ra,
> > + struct file *filp,
> > + struct page *pg,
> > + pgoff_t offset,
> > + unsigned long size)
> > +{
> > + if (!PageReadahead(pg))
> > + return;
> > + page_cache_async_readahead(mapping, ra, filp, pg, offset, size);
> > +}
>
> This function might not be necessary?
> I guess the explicit code is clear enough.

Hi Fengguang,

Yes, I think you're right.

> > static unsigned long
> > ondemand_readahead(struct address_space *mapping,
> > struct file_ra_state *ra, struct file *filp,
> > - struct page *page, pgoff_t offset,
> > + bool hit_lookahead_marker, pgoff_t offset,
>
> Or use names like async/is_async for hit_lookahead_marker?

I wasn't sure. The argument says "we've hit a marker, so do readahead
even if doesn't look sequential.". Now, you and I know that only
happens if it's an async readahead, but that's not really relevant to
this function.

> Seems that you have accepted the 'lookahead' term ;)

Yes, but I shouldn't, because marker is still called PG_readahead 8) I
changed this to "hit_readahead_marker" instead.

> > +/**
> > + * page_cache_async_readahead - file readahead for marked pages
> > + * @mapping: address_space which holds the pagecache and I/O vectors
> > + * @ra: file_ra_state which holds the readahead state
> > + * @filp: passed on to ->readpage() and ->readpages()
> > + * @page: the page at @offset which has the PageReadahead flag set
>
> ^PG_readahead

Oh, yes.

> > + * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
> > + * @req_size: hint: total size of the read which the caller is performing in
> > + * PAGE_CACHE_SIZE units
>
> ^in pagecache pages?
> //Christoph is changing the fixed PAGE_CACHE_SIZE to variable ones.

This came from your patch, but it sounds like a good change. Also the
PAGE_CACHE_SIZE units above.

> I like this new comment!

Heh, maybe that means I finally understand the code 8).

OK, here is revised patch with your changes:

====
Split ondemand readahead interface into two functions. I think this
makes it a little clearer for non-readahead experts (like Rusty).

Internally they both call ondemand_readahead(), but the page argument
is changed to an obvious boolean flag.

Signed-off-by: Rusty Russell <[email protected]>

diff -r e96f11f61577 fs/ext3/dir.c
--- a/fs/ext3/dir.c Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/ext3/dir.c Wed Jun 13 15:16:49 2007 +1000
@@ -139,10 +139,10 @@ static int ext3_readdir(struct file * fi
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
- page_cache_readahead_ondemand(
+ page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
- NULL, index, 1);
+ index, 1);
filp->f_ra.prev_index = index;
bh = ext3_bread(NULL, inode, blk, 0, &err);
}
diff -r e96f11f61577 fs/ext4/dir.c
--- a/fs/ext4/dir.c Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/ext4/dir.c Wed Jun 13 15:16:49 2007 +1000
@@ -138,10 +138,10 @@ static int ext4_readdir(struct file * fi
pgoff_t index = map_bh.b_blocknr >>
(PAGE_CACHE_SHIFT - inode->i_blkbits);
if (!ra_has_index(&filp->f_ra, index))
- page_cache_readahead_ondemand(
+ page_cache_sync_readahead(
sb->s_bdev->bd_inode->i_mapping,
&filp->f_ra, filp,
- NULL, index, 1);
+ index, 1);
filp->f_ra.prev_index = index;
bh = ext4_bread(NULL, inode, blk, 0, &err);
}
diff -r e96f11f61577 fs/splice.c
--- a/fs/splice.c Sun Jun 10 18:25:28 2007 +1000
+++ b/fs/splice.c Wed Jun 13 15:16:49 2007 +1000
@@ -312,8 +312,8 @@ __generic_file_splice_read(struct file *
*/
page = find_get_page(mapping, index);
if (!page) {
- page_cache_readahead_ondemand(mapping, &in->f_ra, in,
- NULL, index, nr_pages - spd.nr_pages);
+ page_cache_sync_readahead(mapping, &in->f_ra, in,
+ index, nr_pages - spd.nr_pages);

/*
* page didn't exist, allocate one.
@@ -361,7 +361,7 @@ __generic_file_splice_read(struct file *
page = pages[page_nr];

if (PageReadahead(page))
- page_cache_readahead_ondemand(mapping, &in->f_ra, in,
+ page_cache_async_readahead(mapping, &in->f_ra, in,
page, index, nr_pages - page_nr);

/*
diff -r e96f11f61577 include/linux/mm.h
--- a/include/linux/mm.h Sun Jun 10 18:25:28 2007 +1000
+++ b/include/linux/mm.h Wed Jun 13 15:16:49 2007 +1000
@@ -1146,12 +1146,20 @@ int do_page_cache_readahead(struct addre
pgoff_t offset, unsigned long nr_to_read);
int force_page_cache_readahead(struct address_space *mapping, struct file *filp,
pgoff_t offset, unsigned long nr_to_read);
-unsigned long page_cache_readahead_ondemand(struct address_space *mapping,
- struct file_ra_state *ra,
- struct file *filp,
- struct page *page,
- pgoff_t offset,
- unsigned long size);
+
+void page_cache_sync_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ pgoff_t offset,
+ unsigned long size);
+
+void page_cache_async_readahead(struct address_space *mapping,
+ struct file_ra_state *ra,
+ struct file *filp,
+ struct page *pg,
+ pgoff_t offset,
+ unsigned long size);
+
unsigned long max_sane_readahead(unsigned long nr);

/* Do stack extension */
diff -r e96f11f61577 mm/filemap.c
--- a/mm/filemap.c Sun Jun 10 18:25:28 2007 +1000
+++ b/mm/filemap.c Wed Jun 13 15:16:49 2007 +1000
@@ -913,15 +913,15 @@ find_page:
find_page:
page = find_get_page(mapping, index);
if (!page) {
- page_cache_readahead_ondemand(mapping,
- &ra, filp, page,
+ page_cache_sync_readahead(mapping,
+ &ra, filp,
index, last_index - index);
page = find_get_page(mapping, index);
if (unlikely(page == NULL))
goto no_cached_page;
}
if (PageReadahead(page)) {
- page_cache_readahead_ondemand(mapping,
+ page_cache_async_readahead(mapping,
&ra, filp, page,
index, last_index - index);
}
@@ -1382,14 +1382,14 @@ retry_find:
*/
if (VM_SequentialReadHint(vma)) {
if (!page) {
- page_cache_readahead_ondemand(mapping, ra, file, page,
+ page_cache_sync_readahead(mapping, ra, file,
fdata->pgoff, 1);
page = find_lock_page(mapping, fdata->pgoff);
if (!page)
goto no_cached_page;
}
if (PageReadahead(page)) {
- page_cache_readahead_ondemand(mapping, ra, file, page,
+ page_cache_async_readahead(mapping, ra, file, page,
fdata->pgoff, 1);
}
}
diff -r e96f11f61577 mm/readahead.c
--- a/mm/readahead.c Sun Jun 10 18:25:28 2007 +1000
+++ b/mm/readahead.c Wed Jun 13 15:16:49 2007 +1000
@@ -351,7 +351,7 @@ static unsigned long
static unsigned long
ondemand_readahead(struct address_space *mapping,
struct file_ra_state *ra, struct file *filp,
- struct page *page, pgoff_t offset,
+ bool hit_readahead_marker, pgoff_t offset,
unsigned long req_size)
{
unsigned long max; /* max readahead pages */
@@ -379,7 +379,7 @@ ondemand_readahead(struct address_space
* Standalone, small read.
* Read as is, and do not pollute the readahead state.
*/
- if (!page && !sequential) {
+ if (!hit_readahead_marker && !sequential) {
return __do_page_cache_readahead(mapping, filp,
offset, req_size, 0);
}
@@ -400,7 +400,7 @@ ondemand_readahead(struct address_space
* E.g. interleaved reads.
* Not knowing its readahead pos/size, bet on the minimal possible one.
*/
- if (page) {
+ if (hit_readahead_marker) {
ra_index++;
ra_size = min(4 * ra_size, max);
}
@@ -413,50 +413,71 @@ fill_ra:
}

/**
- * page_cache_readahead_ondemand - generic file readahead
+ * page_cache_sync_readahead - generic file readahead
* @mapping: address_space which holds the pagecache and I/O vectors
* @ra: file_ra_state which holds the readahead state
* @filp: passed on to ->readpage() and ->readpages()
- * @page: the page at @offset, or NULL if non-present
- * @offset: start offset into @mapping, in PAGE_CACHE_SIZE units
+ * @offset: start offset into @mapping, in pagecache page-sized units
* @req_size: hint: total size of the read which the caller is performing in
- * PAGE_CACHE_SIZE units
- *
- * page_cache_readahead_ondemand() is the entry point of readahead logic.
- * This function should be called when it is time to perform readahead:
- * 1) @page == NULL
- * A cache miss happened, time for synchronous readahead.
- * 2) @page != NULL && PageReadahead(@page)
- * A look-ahead hit occured, time for asynchronous readahead.
- */
-unsigned long
-page_cache_readahead_ondemand(struct address_space *mapping,
- struct file_ra_state *ra, struct file *filp,
- struct page *page, pgoff_t offset,
- unsigned long req_size)
+ * pagecache pages
+ *
+ * page_cache_sync_readahead() should be called when a cache miss happened:
+ * it will submit the read. The readahead logic may decide to piggyback more
+ * pages onto the read request if access patterns suggest it will improve
+ * performance.
+ */
+void page_cache_sync_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ pgoff_t offset, unsigned long req_size)
{
/* no read-ahead */
if (!ra->ra_pages)
- return 0;
-
- if (page) {
- /*
- * It can be PG_reclaim.
- */
- if (PageWriteback(page))
- return 0;
-
- ClearPageReadahead(page);
-
- /*
- * Defer asynchronous read-ahead on IO congestion.
- */
- if (bdi_read_congested(mapping->backing_dev_info))
- return 0;
- }
+ return;

/* do read-ahead */
- return ondemand_readahead(mapping, ra, filp, page,
- offset, req_size);
-}
-EXPORT_SYMBOL_GPL(page_cache_readahead_ondemand);
+ ondemand_readahead(mapping, ra, filp, false, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_sync_readahead);
+
+/**
+ * page_cache_async_readahead - file readahead for marked pages
+ * @mapping: address_space which holds the pagecache and I/O vectors
+ * @ra: file_ra_state which holds the readahead state
+ * @filp: passed on to ->readpage() and ->readpages()
+ * @page: the page at @offset which has the PG_readahead flag set
+ * @offset: start offset into @mapping, in pagecache page-sized units
+ * @req_size: hint: total size of the read which the caller is performing in
+ * pagecache pages
+ *
+ * page_cache_async_ondemand() should be called when a page is used which
+ * has the PG_readahead flag: this is a marker to suggest that the application
+ * has used up enough of the readahead window that we should start pulling in
+ * more pages. */
+void
+page_cache_async_readahead(struct address_space *mapping,
+ struct file_ra_state *ra, struct file *filp,
+ struct page *page, pgoff_t offset,
+ unsigned long req_size)
+{
+ /* no read-ahead */
+ if (!ra->ra_pages)
+ return;
+
+ /*
+ * Same bit is used for PG_readahead and PG_reclaim.
+ */
+ if (PageWriteback(page))
+ return;
+
+ ClearPageReadahead(page);
+
+ /*
+ * Defer asynchronous read-ahead on IO congestion.
+ */
+ if (bdi_read_congested(mapping->backing_dev_info))
+ return;
+
+ /* do read-ahead */
+ ondemand_readahead(mapping, ra, filp, true, offset, req_size);
+}
+EXPORT_SYMBOL_GPL(page_cache_async_readahead);


2007-06-13 07:07:37

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 5/9] readahead: on-demand readahead logic

On Wed, Jun 13, 2007 at 03:51:14PM +1000, Rusty Russell wrote:
> > > static unsigned long
> > > ondemand_readahead(struct address_space *mapping,
> > > struct file_ra_state *ra, struct file *filp,
> > > - struct page *page, pgoff_t offset,
> > > + bool hit_lookahead_marker, pgoff_t offset,
> >
> > Or use names like async/is_async for hit_lookahead_marker?
>
> I wasn't sure. The argument says "we've hit a marker, so do readahead
> even if doesn't look sequential.". Now, you and I know that only
> happens if it's an async readahead, but that's not really relevant to
> this function.

Yeah, it's an interface for the message "hey I hit a readahead marker".
That's the point!

> OK, here is revised patch with your changes:
>
> ====
> Split ondemand readahead interface into two functions. I think this
> makes it a little clearer for non-readahead experts (like Rusty).
>
> Internally they both call ondemand_readahead(), but the page argument
> is changed to an obvious boolean flag.
>
> Signed-off-by: Rusty Russell <[email protected]>

Acked-by: Fengguang Wu <[email protected]>

Thank you,
Fengguang