2006-05-24 11:26:58

by Wu Fengguang

[permalink] [raw]
Subject: [PATCH 10/33] readahead: support functions

Several support functions of adaptive read-ahead.

Signed-off-by: Wu Fengguang <[email protected]>
---

include/linux/mm.h | 11 +++++
mm/readahead.c | 107 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 118 insertions(+)

--- linux-2.6.17-rc4-mm3.orig/include/linux/mm.h
+++ linux-2.6.17-rc4-mm3/include/linux/mm.h
@@ -1029,6 +1029,17 @@ void handle_ra_miss(struct address_space
struct file_ra_state *ra, pgoff_t offset);
unsigned long max_sane_readahead(unsigned long nr);

+#ifdef CONFIG_ADAPTIVE_READAHEAD
+extern int readahead_ratio;
+#else
+#define readahead_ratio 1
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+static inline int prefer_adaptive_readahead(void)
+{
+ return readahead_ratio >= 10;
+}
+
/* Do stack extension */
extern int expand_stack(struct vm_area_struct *vma, unsigned long address);
#ifdef CONFIG_IA64
--- linux-2.6.17-rc4-mm3.orig/mm/readahead.c
+++ linux-2.6.17-rc4-mm3/mm/readahead.c
@@ -683,6 +683,113 @@ unsigned long max_sane_readahead(unsigne
}

/*
+ * Adaptive read-ahead.
+ *
+ * Good read patterns are compact both in space and time. The read-ahead logic
+ * tries to grant larger read-ahead size to better readers under the constraint
+ * of system memory and load pressure.
+ *
+ * It employs two methods to estimate the max thrashing safe read-ahead size:
+ * 1. state based - the default one
+ * 2. context based - the failsafe one
+ * The integration of the dual methods has the merit of being agile and robust.
+ * It makes the overall design clean: special cases are handled in general by
+ * the stateless method, leaving the stateful one simple and fast.
+ *
+ * To improve throughput and decrease read delay, the logic 'looks ahead'.
+ * In most read-ahead chunks, one page will be selected and tagged with
+ * PG_readahead. Later when the page with PG_readahead is read, the logic
+ * will be notified to submit the next read-ahead chunk in advance.
+ *
+ * a read-ahead chunk
+ * +-----------------------------------------+
+ * | # PG_readahead |
+ * +-----------------------------------------+
+ * ^ When this page is read, notify me for the next read-ahead.
+ *
+ */
+
+#ifdef CONFIG_ADAPTIVE_READAHEAD
+
+/*
+ * The nature of read-ahead allows false tests to occur occasionally.
+ * Here we just do not bother to call get_page(), it's meaningless anyway.
+ */
+static inline struct page *__find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ return radix_tree_lookup(&mapping->page_tree, offset);
+}
+
+static inline struct page *find_page(struct address_space *mapping,
+ pgoff_t offset)
+{
+ struct page *page;
+
+ read_lock_irq(&mapping->tree_lock);
+ page = __find_page(mapping, offset);
+ read_unlock_irq(&mapping->tree_lock);
+ return page;
+}
+
+/*
+ * Move pages in danger (of thrashing) to the head of inactive_list.
+ * Not expected to happen frequently.
+ */
+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
+{
+ int pgrescue;
+ pgoff_t index;
+ struct zone *zone;
+ struct address_space *mapping;
+
+ BUG_ON(!nr_pages || !page);
+ pgrescue = 0;
+ index = page_index(page);
+ mapping = page_mapping(page);
+
+ dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
+ mapping->host->i_ino, index, nr_pages);
+
+ for(;;) {
+ zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+
+ if (!PageLRU(page))
+ goto out_unlock;
+
+ while (page_mapping(page) == mapping &&
+ page_index(page) == index) {
+ struct page *the_page = page;
+ page = next_page(page);
+ if (!PageActive(the_page) &&
+ !PageLocked(the_page) &&
+ page_count(the_page) == 1) {
+ list_move(&the_page->lru, &zone->inactive_list);
+ pgrescue++;
+ }
+ index++;
+ if (!--nr_pages)
+ goto out_unlock;
+ }
+
+ spin_unlock_irq(&zone->lru_lock);
+
+ cond_resched();
+ page = find_page(mapping, index);
+ if (!page)
+ goto out;
+ }
+out_unlock:
+ spin_unlock_irq(&zone->lru_lock);
+out:
+ ra_account(NULL, RA_EVENT_READAHEAD_RESCUE, pgrescue);
+ return nr_pages;
+}
+
+#endif /* CONFIG_ADAPTIVE_READAHEAD */
+
+/*
* Read-ahead events accounting.
*/
#ifdef CONFIG_DEBUG_READAHEAD

--


2006-05-25 05:13:21

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 10/33] readahead: support functions

Wu Fengguang wrote:

>+#ifdef CONFIG_ADAPTIVE_READAHEAD
>+
>+/*
>+ * The nature of read-ahead allows false tests to occur occasionally.
>+ * Here we just do not bother to call get_page(), it's meaningless anyway.
>+ */
>+static inline struct page *__find_page(struct address_space *mapping,
>+ pgoff_t offset)
>+{
>+ return radix_tree_lookup(&mapping->page_tree, offset);
>+}
>+
>+static inline struct page *find_page(struct address_space *mapping,
>+ pgoff_t offset)
>+{
>+ struct page *page;
>+
>+ read_lock_irq(&mapping->tree_lock);
>+ page = __find_page(mapping, offset);
>+ read_unlock_irq(&mapping->tree_lock);
>+ return page;
>+}
>
>

Meh, this is just open-coded elsewhere in readahead.c; I'd either
open code it, or do a new patch to replace the existing callers.
find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).

>+
>+/*
>+ * Move pages in danger (of thrashing) to the head of inactive_list.
>+ * Not expected to happen frequently.
>+ */
>+static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
>
>

Should probably be in mm/vmscan.c

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-25 11:13:19

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 10/33] readahead: support functions

On Thu, May 25, 2006 at 03:13:16PM +1000, Nick Piggin wrote:
> Wu Fengguang wrote:
>
> >+#ifdef CONFIG_ADAPTIVE_READAHEAD
> >+
> >+/*
> >+ * The nature of read-ahead allows false tests to occur occasionally.
> >+ * Here we just do not bother to call get_page(), it's meaningless anyway.
> >+ */
> >+static inline struct page *__find_page(struct address_space *mapping,
> >+ pgoff_t offset)
> >+{
> >+ return radix_tree_lookup(&mapping->page_tree, offset);
> >+}
> >+
> >+static inline struct page *find_page(struct address_space *mapping,
> >+ pgoff_t offset)
> >+{
> >+ struct page *page;
> >+
> >+ read_lock_irq(&mapping->tree_lock);
> >+ page = __find_page(mapping, offset);
> >+ read_unlock_irq(&mapping->tree_lock);
> >+ return page;
> >+}
> >
> >
>
> Meh, this is just open-coded elsewhere in readahead.c; I'd either
> open code it, or do a new patch to replace the existing callers.
> find_page should be in mm/filemap.c, btw (or include/linux/pagemap.h).

Maybe it should stay in readahead.c.

I got this early warning from Andrew:
find_page() is not meant to be a general API, for it can
easily be abused.

> >+
> >+/*
> >+ * Move pages in danger (of thrashing) to the head of inactive_list.
> >+ * Not expected to happen frequently.
> >+ */
> >+static unsigned long rescue_pages(struct page *page, unsigned long
> >nr_pages)
> >
> >
>
> Should probably be in mm/vmscan.c

Maybe. It's a highly specialized function. It protects a continuous
range of sequential readahead pages in a file. Do you mean to move it
for the zone->lru_lock protected statements?

Regards,
Wu

2006-05-25 16:49:08

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 10/33] readahead: support functions

Wu Fengguang <[email protected]> wrote:
>
> +/*
> + * The nature of read-ahead allows false tests to occur occasionally.
> + * Here we just do not bother to call get_page(), it's meaningless anyway.
> + */
> +static inline struct page *__find_page(struct address_space *mapping,
> + pgoff_t offset)
> +{
> + return radix_tree_lookup(&mapping->page_tree, offset);
> +}
> +
> +static inline struct page *find_page(struct address_space *mapping,
> + pgoff_t offset)
> +{
> + struct page *page;
> +
> + read_lock_irq(&mapping->tree_lock);
> + page = __find_page(mapping, offset);
> + read_unlock_irq(&mapping->tree_lock);
> + return page;
> +}

Would much prefer that this be called probe_page() and that it return 0 or
1, so nobody is tempted to dereference `page'.

> +/*
> + * Move pages in danger (of thrashing) to the head of inactive_list.
> + * Not expected to happen frequently.
> + */
> +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> +{
> + int pgrescue;
> + pgoff_t index;
> + struct zone *zone;
> + struct address_space *mapping;
> +
> + BUG_ON(!nr_pages || !page);
> + pgrescue = 0;
> + index = page_index(page);
> + mapping = page_mapping(page);
> +
> + dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> + mapping->host->i_ino, index, nr_pages);
> +
> + for(;;) {
> + zone = page_zone(page);
> + spin_lock_irq(&zone->lru_lock);
> +
> + if (!PageLRU(page))
> + goto out_unlock;
> +
> + while (page_mapping(page) == mapping &&
> + page_index(page) == index) {
> + struct page *the_page = page;
> + page = next_page(page);
> + if (!PageActive(the_page) &&
> + !PageLocked(the_page) &&
> + page_count(the_page) == 1) {
> + list_move(&the_page->lru, &zone->inactive_list);
> + pgrescue++;
> + }
> + index++;
> + if (!--nr_pages)
> + goto out_unlock;
> + }
> +
> + spin_unlock_irq(&zone->lru_lock);
> +
> + cond_resched();
> + page = find_page(mapping, index);
> + if (!page)
> + goto out;

Yikes! We do not have a reference on this page. Now, it happens that
page_zone() on a random freed page will work OK. At present. I think.
Depends on things like memory hot-remove, balloon drivers and heaven knows
what.

But it's not at all clear that the combination

spin_lock_irq(&zone->lru_lock);

if (!PageLRU(page))
goto out_unlock;

is is a safe thing to do against a freed page, or against a freed and
reused-for-we-dont-know-what page. It probably _is_ safe, as we're
probably setting and clearing PG_lru inside lru_lock in other places. But
it's not obvious that these things will be true for all time and Nick keeps
on trying to diddle with that stuff. There's quite a bit of subtle
dependency being introduced here.

2006-05-26 07:31:14

by Wu Fengguang

[permalink] [raw]
Subject: Re: [PATCH 10/33] readahead: support functions

On Thu, May 25, 2006 at 09:48:29AM -0700, Andrew Morton wrote:
> Wu Fengguang <[email protected]> wrote:
> >
> > +/*
> > + * The nature of read-ahead allows false tests to occur occasionally.
> > + * Here we just do not bother to call get_page(), it's meaningless anyway.
> > + */
> > +static inline struct page *__find_page(struct address_space *mapping,
> > + pgoff_t offset)
> > +{
> > + return radix_tree_lookup(&mapping->page_tree, offset);
> > +}
> > +
> > +static inline struct page *find_page(struct address_space *mapping,
> > + pgoff_t offset)
> > +{
> > + struct page *page;
> > +
> > + read_lock_irq(&mapping->tree_lock);
> > + page = __find_page(mapping, offset);
> > + read_unlock_irq(&mapping->tree_lock);
> > + return page;
> > +}
>
> Would much prefer that this be called probe_page() and that it return 0 or
> 1, so nobody is tempted to dereference `page'.

Good idea. I'd add them to filemap.c.

> > +/*
> > + * Move pages in danger (of thrashing) to the head of inactive_list.
> > + * Not expected to happen frequently.
> > + */
> > +static unsigned long rescue_pages(struct page *page, unsigned long nr_pages)
> > +{
> > + int pgrescue;
> > + pgoff_t index;
> > + struct zone *zone;
> > + struct address_space *mapping;
> > +
> > + BUG_ON(!nr_pages || !page);
> > + pgrescue = 0;
> > + index = page_index(page);
> > + mapping = page_mapping(page);
> > +
> > + dprintk("rescue_pages(ino=%lu, index=%lu nr=%lu)\n",
> > + mapping->host->i_ino, index, nr_pages);
> > +
> > + for(;;) {
> > + zone = page_zone(page);
> > + spin_lock_irq(&zone->lru_lock);
> > +
> > + if (!PageLRU(page))
> > + goto out_unlock;
> > +
> > + while (page_mapping(page) == mapping &&
> > + page_index(page) == index) {
> > + struct page *the_page = page;
> > + page = next_page(page);
> > + if (!PageActive(the_page) &&
> > + !PageLocked(the_page) &&
> > + page_count(the_page) == 1) {
> > + list_move(&the_page->lru, &zone->inactive_list);
> > + pgrescue++;
> > + }
> > + index++;
> > + if (!--nr_pages)
> > + goto out_unlock;
> > + }
> > +
> > + spin_unlock_irq(&zone->lru_lock);
> > +
> > + cond_resched();
> > + page = find_page(mapping, index);
> > + if (!page)
> > + goto out;
>
> Yikes! We do not have a reference on this page. Now, it happens that
> page_zone() on a random freed page will work OK. At present. I think.
> Depends on things like memory hot-remove, balloon drivers and heaven knows
> what.
>
> But it's not at all clear that the combination
>
> spin_lock_irq(&zone->lru_lock);
>
> if (!PageLRU(page))
> goto out_unlock;
>
> is is a safe thing to do against a freed page, or against a freed and
> reused-for-we-dont-know-what page. It probably _is_ safe, as we're
> probably setting and clearing PG_lru inside lru_lock in other places. But
> it's not obvious that these things will be true for all time and Nick keeps
> on trying to diddle with that stuff. There's quite a bit of subtle
> dependency being introduced here.

I saw some code pieces like
spin_lock_irqsave(&zone->lru_lock, flags);
VM_BUG_ON(!PageLRU(page));
__ClearPageLRU(page);
del_page_from_lru(zone, page);
spin_unlock_irqrestore(&zone->lru_lock, flags);

They give me an allusion that PG_lru and page->lru are always changed together,
under the protection of zone->lru_lock...

I bet correctness is top priority, so I'll stop playing fire with it.

Thanks,
Wu