Introduce a new page flag: PG_readahead.
It acts as a look-ahead mark, which tells the page reader:
Hey, it's time to invoke the read-ahead logic. For the sake of I/O pipelining,
don't wait until it runs out of cached pages!
Signed-off-by: Fengguang Wu <[email protected]>
---
include/linux/page-flags.h | 5 +++++
mm/page_alloc.c | 2 +-
2 files changed, 6 insertions(+), 1 deletion(-)
--- linux-2.6.22-rc1-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.22-rc1-mm1/include/linux/page-flags.h
@@ -92,6 +92,7 @@
#define PG_lazyfree 20 /* MADV_FREE potential throwaway */
#define PG_booked 21 /* Has blocks reserved on-disk */
+#define PG_readahead 22 /* Reminder to do async read-ahead */
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
@@ -206,6 +207,10 @@ static inline void SetPageUptodate(struc
#define SetPageBooked(page) set_bit(PG_booked, &(page)->flags)
#define ClearPageBooked(page) clear_bit(PG_booked, &(page)->flags)
+#define PageReadahead(page) test_bit(PG_readahead, &(page)->flags)
+#define SetPageReadahead(page) set_bit(PG_readahead, &(page)->flags)
+#define ClearPageReadahead(page) clear_bit(PG_readahead, &(page)->flags)
+
#define PageReclaim(page) test_bit(PG_reclaim, &(page)->flags)
#define SetPageReclaim(page) set_bit(PG_reclaim, &(page)->flags)
#define ClearPageReclaim(page) clear_bit(PG_reclaim, &(page)->flags)
--- linux-2.6.22-rc1-mm1.orig/mm/page_alloc.c
+++ linux-2.6.22-rc1-mm1/mm/page_alloc.c
@@ -659,7 +659,7 @@ static int prep_new_page(struct page *pa
if (PageReserved(page))
return 1;
- page->flags &= ~(1 << PG_uptodate | 1 << PG_error |
+ page->flags &= ~(1 << PG_uptodate | 1 << PG_error | 1 << PG_readahead |
1 << PG_referenced | 1 << PG_arch_1 |
1 << PG_owner_priv_1 | 1 << PG_mappedtodisk);
set_page_private(page, 0);
--
On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <[email protected]> wrote:
> Introduce a new page flag: PG_readahead.
Is there any way in which we can avoid adding a new page flag?
We have the advantage that if the kernel very occasionally gets the wrong
result for PageReadahead(page), nothing particularly bad will happen, so we
can do racy things.
>From a quick peek, it appears that PG_readahead is only ever set against
non-uptodate pages. If true we could perhaps exploit that: say,
PageReadahead(page) == PG_referenced && !PG_uptodate?
> We have the advantage that if the kernel very occasionally gets the wrong
> result for PageReadahead(page), nothing particularly bad will happen, so we
> can do racy things.
On 64bit there is no particular shortage of page flags. If you ever do
racy things please do them 32bit only.
-Andi
On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <[email protected]> wrote:
>
> > Introduce a new page flag: PG_readahead.
>
> Is there any way in which we can avoid adding a new page flag?
>
> We have the advantage that if the kernel very occasionally gets the wrong
> result for PageReadahead(page), nothing particularly bad will happen, so we
> can do racy things.
>
> >From a quick peek, it appears that PG_readahead is only ever set against
> non-uptodate pages. If true we could perhaps exploit that: say,
> PageReadahead(page) == PG_referenced && !PG_uptodate?
PG_uptodate will flip to 1 before the reader touches the page :(
However, it may be possible to share the same bit with PG_reclaim or PG_booked.
Which one would be preferred?
On Sat, 19 May 2007 13:35:01 +0200 Andi Kleen <[email protected]> wrote:
> > We have the advantage that if the kernel very occasionally gets the wrong
> > result for PageReadahead(page), nothing particularly bad will happen, so we
> > can do racy things.
>
> On 64bit there is no particular shortage of page flags.
I think pretty much all of the upper 32 bits got used for ia64 fields,
but it's never been very clear how many were actually used or needed.
> If you ever do racy things please do them 32bit only.
hrm. It *really* won't matter if we make one suboptimal readahead decision
every second day. And there's value in having the same behaviour on all
architectures.
On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu <[email protected]> wrote:
> On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> > On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <[email protected]> wrote:
> >
> > > Introduce a new page flag: PG_readahead.
> >
> > Is there any way in which we can avoid adding a new page flag?
> >
> > We have the advantage that if the kernel very occasionally gets the wrong
> > result for PageReadahead(page), nothing particularly bad will happen, so we
> > can do racy things.
> >
> > >From a quick peek, it appears that PG_readahead is only ever set against
> > non-uptodate pages. If true we could perhaps exploit that: say,
> > PageReadahead(page) == PG_referenced && !PG_uptodate?
>
> PG_uptodate will flip to 1 before the reader touches the page :(
OK.
> However, it may be possible to share the same bit with PG_reclaim or PG_booked.
> Which one would be preferred?
I'd like PG_booked to go away too - I don't think we've put that under the
microscope yet. If it remains then its scope will be "defined by the
filesystem", so readahead shouldn't use it. PG_reclaim belongs to core VFS
so that's much better.
Let's not do anything ugly, slow or silly in there, but please do take a
look, see if there is an opportunity here.
On Sat, May 19, 2007 at 08:25:04AM -0700, Andrew Morton wrote:
> On Sat, 19 May 2007 20:30:31 +0800 Fengguang Wu <[email protected]> wrote:
>
> > On Fri, May 18, 2007 at 11:28:24PM -0700, Andrew Morton wrote:
> > > On Thu, 17 May 2007 06:47:53 +0800 Fengguang Wu <[email protected]> wrote:
> > >
> > > > Introduce a new page flag: PG_readahead.
> > >
> > > Is there any way in which we can avoid adding a new page flag?
> > >
> > > We have the advantage that if the kernel very occasionally gets the wrong
> > > result for PageReadahead(page), nothing particularly bad will happen, so we
> > > can do racy things.
> > >
> > > >From a quick peek, it appears that PG_readahead is only ever set against
> > > non-uptodate pages. If true we could perhaps exploit that: say,
> > > PageReadahead(page) == PG_referenced && !PG_uptodate?
> >
> > PG_uptodate will flip to 1 before the reader touches the page :(
>
> OK.
>
> > However, it may be possible to share the same bit with PG_reclaim or PG_booked.
> > Which one would be preferred?
>
> I'd like PG_booked to go away too - I don't think we've put that under the
> microscope yet. If it remains then its scope will be "defined by the
> filesystem", so readahead shouldn't use it. PG_reclaim belongs to core VFS
> so that's much better.
>
> Let's not do anything ugly, slow or silly in there, but please do take a
> look, see if there is an opportunity here.
The reuse code would look like the attached one.
It still needs more testing, and would fail if Christoph reuses
PG_reclaim in higher order pagecache in the future.
Fengguang Wu
---
Subject: mm: share PG_readahead and PG_reclaim
Share the same page flag bit for PG_readahead and PG_reclaim.
One is used only on file reads, another is only for emergency writes.
One is used mostly for fresh/young pages, another is for old pages.
Combinations of possible interactions are:
a) clear PG_reclaim => implicit clear of PG_readahead
it will delay an asynchronous readahead into a synchronous one
it actually does _good_ for readahead:
the pages will be reclaimed soon, it's readahead thrashing!
in this case, synchronous readahead makes more sense.
b) clear PG_readahead => implicit clear of PG_reclaim
one(and only one) page will not be reclaimed in time
it can be avoided by checking PageWriteback(page) in readahead first
c) set PG_reclaim => implicit set of PG_readahead
will confuse readahead and make it restart the size rampup process
it's a trivial problem, and can mostly be avoided by checking
PageWriteback(page) first in readahead
d) set PG_readahead => implicit set of PG_reclaim
PG_readahead will never be set on already cached pages.
PG_reclaim will always be cleared on dirtying a page.
so not a problem.
In summary,
a) we get better behavior
b,d) possible interactions can be avoided
c) racy condition exists that might affect readahead, but the chance
is _really_ low, and the hurt on readahead is trivial.
Compound pages also use PG_reclaim, but for now they do not interact with
reclaim/readahead code.
Signed-off-by: Fengguang Wu <[email protected]>
---
include/linux/page-flags.h | 4 +++-
mm/page-writeback.c | 1 +
mm/readahead.c | 4 ++++
3 files changed, 8 insertions(+), 1 deletion(-)
--- linux-2.6.22-rc1-mm1.orig/include/linux/page-flags.h
+++ linux-2.6.22-rc1-mm1/include/linux/page-flags.h
@@ -92,7 +92,9 @@
#define PG_lazyfree 20 /* MADV_FREE potential throwaway */
#define PG_booked 21 /* Has blocks reserved on-disk */
-#define PG_readahead 22 /* Reminder to do async read-ahead */
+
+/* PG_readahead is only used for file reads; PG_reclaim is only for writes */
+#define PG_readahead PG_reclaim /* Reminder to do async read-ahead */
/* PG_owner_priv_1 users should have descriptive aliases */
#define PG_checked PG_owner_priv_1 /* Used by some filesystems */
--- linux-2.6.22-rc1-mm1.orig/mm/page-writeback.c
+++ linux-2.6.22-rc1-mm1/mm/page-writeback.c
@@ -922,6 +922,7 @@ int clear_page_dirty_for_io(struct page
BUG_ON(!PageLocked(page));
+ ClearPageReclaim(page);
if (mapping && mapping_cap_account_dirty(mapping)) {
/*
* Yes, Virginia, this is indeed insane.
--- linux-2.6.22-rc1-mm1.orig/mm/readahead.c
+++ linux-2.6.22-rc1-mm1/mm/readahead.c
@@ -447,6 +447,10 @@ page_cache_readahead_ondemand(struct add
if (!ra->ra_pages)
return 0;
+ /* It's PG_reclaim! */
+ if (PageWriteback(page))
+ return 0;
+
if (page) {
ClearPageReadahead(page);
On Sun, 20 May 2007, Fengguang Wu wrote:
> The reuse code would look like the attached one.
> It still needs more testing, and would fail if Christoph reuses
> PG_reclaim in higher order pagecache in the future.
Do not worry about that. All higher order pagecache patchsets remove that
sharing because we will then need these pages on the LRU and thus
PG_reclaim cannot be shared anymore.
On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> plain text document attachment (mm-introduce-pg_readahead.patch)
> Introduce a new page flag: PG_readahead.
>
> It acts as a look-ahead mark, which tells the page reader:
> Hey, it's time to invoke the read-ahead logic. For the sake of I/O pipelining,
> don't wait until it runs out of cached pages!
Hi Fengguang!
I've been reading your patches, and I have some (possibly dumb!)
questions.
For this patch: why set a bit in the page, rather than keep a value
inside the "struct file_ra_state"?
Thanks,
Rusty.
Hi Rusty,
On Tue, Jun 12, 2007 at 11:04:54AM +1000, Rusty Russell wrote:
> On Thu, 2007-05-17 at 06:47 +0800, Fengguang Wu wrote:
> > plain text document attachment (mm-introduce-pg_readahead.patch)
> > Introduce a new page flag: PG_readahead.
> >
> > It acts as a look-ahead mark, which tells the page reader:
> > Hey, it's time to invoke the read-ahead logic. For the sake of I/O pipelining,
> > don't wait until it runs out of cached pages!
>
> Hi Fengguang!
>
> I've been reading your patches, and I have some (possibly dumb!)
> questions.
>
> For this patch: why set a bit in the page, rather than keep a value
> inside the "struct file_ra_state"?
Good question. I should have documented it in the patch ;)
The short answer:
there can be multiple read streams per fd, i.e. interleaved reads.
file_ra_state can not easily track all of the streams. Solaris zfs
does it by managing a list in struct dnode. While PG_readahead plus
the context based readahead(http://lkml.org/lkml/2006/11/15/54) makes
another solution. The two solutions are comparable in complexity.
The context based readahead is a bit more feature rich: it is sensible
to memory pressure, and supports an unlimited number of streams. Note
that random reads can be regarded as a huge number of one-shot streams,
which can interfere with zfs's list-based readahead states badly.
Fengguang