2006-03-22 22:31:45

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 00/34] mm: Page Replacement Policy Framework


This patch-set introduces a page replacement policy framework and 4 new
experimental policies.

The page replacement algorithm determines which pages to swap out.
The current algorithm has some problems that are increasingly noticable, even
on desktop workloads. As said, this patch-set introduces 4 new algorithms.

Patches 01 - 25:

Introduction of the general framework. Piece by piece the current
use-once LRU-2Q policy is isolated. With each patch a piece of the framework
API is introduced.

Patches 26 - 29:

Adds a policy based on CLOCKPro. (http://linux-mm.org/PeterZClockPro2)

Patches 30 - 32:

Adds a policy based on CART. (http://linux-mm.org/PeterZCart)

Patch 33:

Adds a variation of the CART policy that tries to incorporate
cyclic access patterns.

Patch 34:

Adds a random page replacement policy, simple policy that uses a simple
PRNG to take the decision. More of a toy example than a real alternative.


Individual patches and a rollup can be found here:
http://programming.kicks-ass.net/kernel-patches/page-replace/


Measurements:

(Walltime, so lower is better)

cyclic-anon ; Cyclic access pattern with anonymous memory.
(http://programming.kicks-ass.net/benchmarks/cyclic-anon.c)

2.6.16-rc6 14:28
2.6.16-rc6-useonce 15:11
2.6.16-rc6-clockpro 10:51
2.6.16-rc6-cart 8:55
2.6.16-rc6-random 1:09:50

cyclic-file ; Cyclic access pattern with file backed memory.
(http://programming.kicks-ass.net/benchmarks/cyclic-file.c)

2.6.16-rc6 11:24
2.6.16-rc6-clockpro 8:14
2.6.16-rc6-cart 8:09

webtrace ; Replay of an IO trace from the Umass trace repository
(http://programming.kicks-ass.net/benchmarks/spc/)

2.6.16-rc6 8:27
2.6.16-rc6-useonce 8:24
2.6.16-rc6-clockpro 10:23
2.6.16-rc6-cart 15:30
2.6.16-rc6-random 15:52

mdb-bench ; Low frequency benchmark.
(http://linux-mm.org/PageReplacementTesting)

2.6.16-rc6 4:20:44
2.6.16-rc6 (mlock) 3:52:15
2.6.16-rc6-useonce 4:20:59
2.6.16-rc6-clockpro 3:56:17
2.6.16-rc6-cart 4:11:54
2.6.16-rc6-random 5:21:30

(I should do more runs to get error bounds on these values,
this is the avg of 3)

Aside from tweaking the policies, the big thing left is NUMA-ify the
nonresident page trackers.

The results merit further attention, please consider for 2.6.18.

Peter

---

Documentation/vm/page_replacement_api.txt | 216 +++++++
fs/cifs/file.c | 4
fs/exec.c | 4
fs/mpage.c | 5
fs/ntfs/file.c | 4
fs/ramfs/file-nommu.c | 2
include/linux/mm_cart_data.h | 39 +
include/linux/mm_cart_policy.h | 141 ++++
include/linux/mm_clockpro_data.h | 21
include/linux/mm_clockpro_policy.h | 143 +++++
include/linux/mm_inline.h | 39 -
include/linux/mm_page_replace.h | 146 +++++
include/linux/mm_page_replace_data.h | 19
include/linux/mm_random_data.h | 9
include/linux/mm_random_policy.h | 47 +
include/linux/mm_use_once_data.h | 16
include/linux/mm_use_once_policy.h | 175 ++++++
include/linux/mmzone.h | 8
include/linux/nonresident-cart.h | 34 +
include/linux/nonresident.h | 12
include/linux/page-flags.h | 11
include/linux/pagevec.h | 8
include/linux/percpu.h | 5
include/linux/rmap.h | 4
include/linux/swap.h | 10
init/main.c | 2
mm/Kconfig | 32 +
mm/Makefile | 6
mm/cart.c | 723 +++++++++++++++++++++++++
mm/clockpro.c | 855 ++++++++++++++++++++++++++++++
mm/filemap.c | 20
mm/hugetlb.c | 5
mm/memory.c | 42 +
mm/mempolicy.c | 13
mm/mmap.c | 5
mm/nonresident-cart.c | 362 ++++++++++++
mm/nonresident.c | 167 +++++
mm/page_alloc.c | 76 --
mm/random_policy.c | 292 ++++++++++
mm/readahead.c | 9
mm/rmap.c | 26
mm/shmem.c | 2
mm/swap.c | 206 +------
mm/swap_state.c | 6
mm/swapfile.c | 17
mm/useonce.c | 489 +++++++++++++++++
mm/vmscan.c | 552 +++----------------
47 files changed, 4226 insertions(+), 803 deletions(-)


2006-03-22 22:31:55

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 01/34] mm: kill-page-activate.patch


From: Marcelo Tosatti <[email protected]>

Get rid of activate_page() callers.

Instead, page activation is achieved through mark_page_accessed()
interface.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>

---

include/linux/swap.h | 1 -
mm/swapfile.c | 4 ++--
2 files changed, 2 insertions(+), 3 deletions(-)

Index: linux-2.6/include/linux/swap.h
===================================================================
--- linux-2.6.orig/include/linux/swap.h 2006-03-13 20:37:08.000000000 +0100
+++ linux-2.6/include/linux/swap.h 2006-03-13 20:37:22.000000000 +0100
@@ -164,7 +164,6 @@ extern unsigned int nr_free_pagecache_pa
/* linux/mm/swap.c */
extern void FASTCALL(lru_cache_add(struct page *));
extern void FASTCALL(lru_cache_add_active(struct page *));
-extern void FASTCALL(activate_page(struct page *));
extern void FASTCALL(mark_page_accessed(struct page *));
extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
Index: linux-2.6/mm/swapfile.c
===================================================================
--- linux-2.6.orig/mm/swapfile.c 2006-03-13 20:37:08.000000000 +0100
+++ linux-2.6/mm/swapfile.c 2006-03-13 20:37:22.000000000 +0100
@@ -435,7 +435,7 @@ static void unuse_pte(struct vm_area_str
* Move the page to the active list so it is not
* immediately swapped out again after swapon.
*/
- activate_page(page);
+ mark_page_accessed(page);
}

static int unuse_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
@@ -537,7 +537,7 @@ static int unuse_mm(struct mm_struct *mm
* Activate page so shrink_cache is unlikely to unmap its
* ptes while lock is dropped, so swapoff can make progress.
*/
- activate_page(page);
+ mark_page_accessed(page);
unlock_page(page);
down_read(&mm->mmap_sem);
lock_page(page);

2006-03-22 22:32:37

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 02/34] mm: page-replace-kconfig-makefile.patch


From: Peter Zijlstra <[email protected]>

Introduce the configuration option, and modify the Makefile.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

mm/Kconfig | 11 +++++++++++
mm/Makefile | 2 ++
mm/useonce.c | 3 +++
3 files changed, 16 insertions(+)

Index: linux-2.6/mm/Kconfig
===================================================================
--- linux-2.6.orig/mm/Kconfig 2006-03-13 20:37:08.000000000 +0100
+++ linux-2.6/mm/Kconfig 2006-03-13 20:37:24.000000000 +0100
@@ -133,6 +133,17 @@ config SPLIT_PTLOCK_CPUS
default "4096" if PARISC && !PA20
default "4"

+choice
+ prompt "Page replacement policy"
+ default MM_POLICY_USEONCE
+
+config MM_POLICY_USEONCE
+ bool "LRU-2Q USE-ONCE"
+ help
+ This option selects the standard multi-queue LRU policy.
+
+endchoice
+
#
# support for page migration
#
Index: linux-2.6/mm/Makefile
===================================================================
--- linux-2.6.orig/mm/Makefile 2006-03-13 20:37:08.000000000 +0100
+++ linux-2.6/mm/Makefile 2006-03-13 20:37:24.000000000 +0100
@@ -12,6 +12,8 @@ obj-y := bootmem.o filemap.o mempool.o
readahead.o swap.o truncate.o vmscan.o \
prio_tree.o util.o $(mmu-y)

+obj-$(CONFIG_MM_POLICY_USEONCE) += useonce.o
+
obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
obj-$(CONFIG_NUMA) += mempolicy.o
Index: linux-2.6/mm/useonce.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/useonce.c 2006-03-13 20:37:24.000000000 +0100
@@ -0,0 +1,3 @@
+
+
+

2006-03-22 22:32:21

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 03/34] mm: page-replace-insert.patch


From: Peter Zijlstra <[email protected]>

Abstract the insertion of pages into the page cache.

API:

give a hint to the page replace algorithm as to the
importance of the given page.

void page_replace_hint_active(struct page *);

insert the given page in a per cpu pagevec

void fastcall page_replace_add(struct page *);

flush either the current or given cpu's pagevec.

void page_replace_add_drain(void);
void __page_replace_add_drain(unsigned int);

functions to insert a pagevec worth of pages

void __pagevec_page_replace_add(struct pagevec *);
void pagevec_page_replace_add(struct pagevec *);


Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

fs/cifs/file.c | 4 -
fs/exec.c | 4 -
fs/mpage.c | 5 -
fs/ntfs/file.c | 4 -
fs/ramfs/file-nommu.c | 2
include/linux/mm_page_replace.h | 36 +++++++++
include/linux/mm_use_once_policy.h | 12 +++
include/linux/pagevec.h | 8 --
include/linux/swap.h | 4 -
mm/filemap.c | 7 +
mm/memory.c | 14 ++-
mm/mempolicy.c | 3
mm/mmap.c | 5 -
mm/readahead.c | 9 +-
mm/shmem.c | 2
mm/swap.c | 134 -------------------------------------
mm/swap_state.c | 6 +
mm/useonce.c | 134 +++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 16 +---
19 files changed, 229 insertions(+), 180 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -0,0 +1,12 @@
+#ifndef _LINUX_MM_USEONCE_POLICY_H
+#define _LINUX_MM_USEONCE_POLICY_H
+
+#ifdef __KERNEL__
+
+static inline void page_replace_hint_active(struct page *page)
+{
+ SetPageActive(page);
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -0,0 +1,36 @@
+#ifndef _LINUX_MM_PAGE_REPLACE_H
+#define _LINUX_MM_PAGE_REPLACE_H
+
+#ifdef __KERNEL__
+
+#include <linux/mmzone.h>
+#include <linux/mm.h>
+#include <linux/pagevec.h>
+
+/* void page_replace_hint_active(struct page *); */
+extern void fastcall page_replace_add(struct page *);
+/* void page_replace_add_drain(void); */
+extern void __page_replace_add_drain(unsigned int);
+extern int page_replace_add_drain_all(void);
+extern void __pagevec_page_replace_add(struct pagevec *);
+
+#ifdef CONFIG_MM_POLICY_USEONCE
+#include <linux/mm_use_once_policy.h>
+#else
+#error no mm policy
+#endif
+
+static inline void pagevec_page_replace_add(struct pagevec *pvec)
+{
+ if (pagevec_count(pvec))
+ __pagevec_page_replace_add(pvec);
+}
+
+static inline void page_replace_add_drain(void)
+{
+ __page_replace_add_drain(get_cpu());
+ put_cpu();
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_PAGE_REPLACE_H */
Index: linux-2.6-git/mm/filemap.c
===================================================================
--- linux-2.6-git.orig/mm/filemap.c
+++ linux-2.6-git/mm/filemap.c
@@ -29,6 +29,7 @@
#include <linux/blkdev.h>
#include <linux/security.h>
#include <linux/syscalls.h>
+#include <linux/mm_page_replace.h>
#include "filemap.h"
/*
* FIXME: remove all knowledge of the buffer layer from the core VM
@@ -421,7 +422,7 @@ int add_to_page_cache_lru(struct page *p
{
int ret = add_to_page_cache(page, mapping, offset, gfp_mask);
if (ret == 0)
- lru_cache_add(page);
+ page_replace_add(page);
return ret;
}

@@ -1721,7 +1722,7 @@ repeat:
page = *cached_page;
page_cache_get(page);
if (!pagevec_add(lru_pvec, page))
- __pagevec_lru_add(lru_pvec);
+ __pagevec_page_replace_add(lru_pvec);
*cached_page = NULL;
}
}
@@ -2051,7 +2052,7 @@ generic_file_buffered_write(struct kiocb
if (unlikely(file->f_flags & O_DIRECT) && written)
status = filemap_write_and_wait(mapping);

- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);
return written ? written : status;
}
EXPORT_SYMBOL(generic_file_buffered_write);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -1,3 +1,137 @@
+#include <linux/mm_page_replace.h>
+#include <linux/mm_inline.h>
+#include <linux/swap.h>
+#include <linux/module.h>
+#include <linux/pagemap.h>

+/**
+ * lru_cache_add: add a page to the page lists
+ * @page: the page to add
+ */
+static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
+static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };

+/*
+ * Add the passed pages to the LRU, then drop the caller's refcount
+ * on them. Reinitialises the caller's pagevec.
+ */
+void __pagevec_page_replace_add(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;

+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ if (TestSetPageLRU(page))
+ BUG();
+ add_page_to_inactive_list(zone, page);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+EXPORT_SYMBOL(__pagevec_page_replace_add);
+
+static void __pagevec_lru_add_active(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ if (TestSetPageLRU(page))
+ BUG();
+ if (TestSetPageActive(page))
+ BUG();
+ add_page_to_active_list(zone, page);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+static inline void lru_cache_add(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_page_replace_add(pvec);
+ put_cpu_var(lru_add_pvecs);
+}
+
+static inline void lru_cache_add_active(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_lru_add_active(pvec);
+ put_cpu_var(lru_add_active_pvecs);
+}
+
+void fastcall page_replace_add(struct page *page)
+{
+ if (PageActive(page)) {
+ ClearPageActive(page);
+ lru_cache_add_active(page);
+ } else {
+ lru_cache_add(page);
+ }
+}
+
+void __page_replace_add_drain(unsigned int cpu)
+{
+ struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
+
+ if (pagevec_count(pvec))
+ __pagevec_page_replace_add(pvec);
+ pvec = &per_cpu(lru_add_active_pvecs, cpu);
+ if (pagevec_count(pvec))
+ __pagevec_lru_add_active(pvec);
+}
+
+#ifdef CONFIG_NUMA
+static void drain_per_cpu(void *dummy)
+{
+ page_replace_add_drain();
+}
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ return schedule_on_each_cpu(drain_per_cpu, NULL);
+}
+
+#else
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ page_replace_add_drain();
+ return 0;
+}
+#endif
Index: linux-2.6-git/mm/memory.c
===================================================================
--- linux-2.6-git.orig/mm/memory.c
+++ linux-2.6-git/mm/memory.c
@@ -48,6 +48,7 @@
#include <linux/rmap.h>
#include <linux/module.h>
#include <linux/init.h>
+#include <linux/mm_page_replace.h>

#include <asm/pgalloc.h>
#include <asm/uaccess.h>
@@ -872,7 +873,7 @@ unsigned long zap_page_range(struct vm_a
unsigned long end = address + size;
unsigned long nr_accounted = 0;

- lru_add_drain();
+ page_replace_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
end = unmap_vmas(&tlb, vma, address, end, &nr_accounted, details);
@@ -1507,7 +1508,8 @@ gotten:
ptep_establish(vma, address, page_table, entry);
update_mmu_cache(vma, address, entry);
lazy_mmu_prot_update(entry);
- lru_cache_add_active(new_page);
+ page_replace_hint_active(new_page);
+ page_replace_add(new_page);
page_add_new_anon_rmap(new_page, vma, address);

/* Free the old page.. */
@@ -1859,7 +1861,7 @@ void swapin_readahead(swp_entry_t entry,
}
#endif
}
- lru_add_drain(); /* Push any new pages onto the LRU now */
+ page_replace_add_drain(); /* Push any new pages onto the LRU now */
}

/*
@@ -1993,7 +1995,8 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;
inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(page);
+ page_replace_hint_active(page);
+ page_replace_add(page);
page_add_new_anon_rmap(page, vma, address);
} else {
/* Map the ZERO_PAGE - vm_page_prot is readonly */
@@ -2124,7 +2127,8 @@ retry:
set_pte_at(mm, address, page_table, entry);
if (anon) {
inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(new_page);
+ page_replace_hint_active(new_page);
+ page_replace_add(new_page);
page_add_new_anon_rmap(new_page, vma, address);
} else {
inc_mm_counter(mm, file_rss);
Index: linux-2.6-git/mm/mmap.c
===================================================================
--- linux-2.6-git.orig/mm/mmap.c
+++ linux-2.6-git/mm/mmap.c
@@ -25,6 +25,7 @@
#include <linux/mount.h>
#include <linux/mempolicy.h>
#include <linux/rmap.h>
+#include <linux/mm_page_replace.h>

#include <asm/uaccess.h>
#include <asm/cacheflush.h>
@@ -1656,7 +1657,7 @@ static void unmap_region(struct mm_struc
struct mmu_gather *tlb;
unsigned long nr_accounted = 0;

- lru_add_drain();
+ page_replace_add_drain();
tlb = tlb_gather_mmu(mm, 0);
update_hiwater_rss(mm);
unmap_vmas(&tlb, vma, start, end, &nr_accounted, NULL);
@@ -1937,7 +1938,7 @@ void exit_mmap(struct mm_struct *mm)
unsigned long nr_accounted = 0;
unsigned long end;

- lru_add_drain();
+ page_replace_add_drain();
flush_cache_mm(mm);
tlb = tlb_gather_mmu(mm, 1);
/* Don't update_hiwater_rss(mm) here, do_exit already did */
Index: linux-2.6-git/mm/shmem.c
===================================================================
--- linux-2.6-git.orig/mm/shmem.c
+++ linux-2.6-git/mm/shmem.c
@@ -952,7 +952,7 @@ struct page *shmem_swapin(struct shmem_i
break;
page_cache_release(page);
}
- lru_add_drain(); /* Push any new pages onto the LRU now */
+ page_replace_add_drain(); /* Push any new pages onto the LRU now */
return shmem_swapin_async(p, entry, idx);
}

Index: linux-2.6-git/mm/swap.c
===================================================================
--- linux-2.6-git.orig/mm/swap.c
+++ linux-2.6-git/mm/swap.c
@@ -30,6 +30,7 @@
#include <linux/cpu.h>
#include <linux/notifier.h>
#include <linux/init.h>
+#include <linux/mm_page_replace.h>

/* How many pages do we try to swap or page in/out together? */
int page_cluster;
@@ -132,77 +133,6 @@ void fastcall mark_page_accessed(struct

EXPORT_SYMBOL(mark_page_accessed);

-/**
- * lru_cache_add: add a page to the page lists
- * @page: the page to add
- */
-static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
-static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };
-
-void fastcall lru_cache_add(struct page *page)
-{
- struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
-
- page_cache_get(page);
- if (!pagevec_add(pvec, page))
- __pagevec_lru_add(pvec);
- put_cpu_var(lru_add_pvecs);
-}
-
-void fastcall lru_cache_add_active(struct page *page)
-{
- struct pagevec *pvec = &get_cpu_var(lru_add_active_pvecs);
-
- page_cache_get(page);
- if (!pagevec_add(pvec, page))
- __pagevec_lru_add_active(pvec);
- put_cpu_var(lru_add_active_pvecs);
-}
-
-static void __lru_add_drain(int cpu)
-{
- struct pagevec *pvec = &per_cpu(lru_add_pvecs, cpu);
-
- /* CPU is dead, so no locking needed. */
- if (pagevec_count(pvec))
- __pagevec_lru_add(pvec);
- pvec = &per_cpu(lru_add_active_pvecs, cpu);
- if (pagevec_count(pvec))
- __pagevec_lru_add_active(pvec);
-}
-
-void lru_add_drain(void)
-{
- __lru_add_drain(get_cpu());
- put_cpu();
-}
-
-#ifdef CONFIG_NUMA
-static void lru_add_drain_per_cpu(void *dummy)
-{
- lru_add_drain();
-}
-
-/*
- * Returns 0 for success
- */
-int lru_add_drain_all(void)
-{
- return schedule_on_each_cpu(lru_add_drain_per_cpu, NULL);
-}
-
-#else
-
-/*
- * Returns 0 for success
- */
-int lru_add_drain_all(void)
-{
- lru_add_drain();
- return 0;
-}
-#endif
-
/*
* This path almost never happens for VM activity - pages are normally
* freed via pagevecs. But it gets used by networking.
@@ -295,7 +225,7 @@ void release_pages(struct page **pages,
*/
void __pagevec_release(struct pagevec *pvec)
{
- lru_add_drain();
+ page_replace_add_drain();
release_pages(pvec->pages, pagevec_count(pvec), pvec->cold);
pagevec_reinit(pvec);
}
@@ -325,64 +255,6 @@ void __pagevec_release_nonlru(struct pag
}

/*
- * Add the passed pages to the LRU, then drop the caller's refcount
- * on them. Reinitialises the caller's pagevec.
- */
-void __pagevec_lru_add(struct pagevec *pvec)
-{
- int i;
- struct zone *zone = NULL;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);
-
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (TestSetPageLRU(page))
- BUG();
- add_page_to_inactive_list(zone, page);
- }
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
- pagevec_reinit(pvec);
-}
-
-EXPORT_SYMBOL(__pagevec_lru_add);
-
-void __pagevec_lru_add_active(struct pagevec *pvec)
-{
- int i;
- struct zone *zone = NULL;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);
-
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (TestSetPageLRU(page))
- BUG();
- if (TestSetPageActive(page))
- BUG();
- add_page_to_active_list(zone, page);
- }
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
- pagevec_reinit(pvec);
-}
-
-/*
* Try to drop buffers from the pages in a pagevec
*/
void pagevec_strip(struct pagevec *pvec)
@@ -470,7 +342,7 @@ static int cpu_swap_callback(struct noti
if (action == CPU_DEAD) {
atomic_add(*committed, &vm_committed_space);
*committed = 0;
- __lru_add_drain((long)hcpu);
+ __page_replace_add_drain((long)hcpu);
}
return NOTIFY_OK;
}
Index: linux-2.6-git/mm/swap_state.c
===================================================================
--- linux-2.6-git.orig/mm/swap_state.c
+++ linux-2.6-git/mm/swap_state.c
@@ -15,6 +15,7 @@
#include <linux/buffer_head.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/mm_page_replace.h>

#include <asm/pgtable.h>

@@ -276,7 +277,7 @@ void free_pages_and_swap_cache(struct pa
{
struct page **pagep = pages;

- lru_add_drain();
+ page_replace_add_drain();
while (nr) {
int todo = min(nr, PAGEVEC_SIZE);
int i;
@@ -354,7 +355,8 @@ struct page *read_swap_cache_async(swp_e
/*
* Initiate read into locked page and return.
*/
- lru_cache_add_active(new_page);
+ page_replace_hint_active(new_page);
+ page_replace_add(new_page);
swap_readpage(NULL, new_page);
return new_page;
}
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -33,6 +33,7 @@
#include <linux/cpuset.h>
#include <linux/notifier.h>
#include <linux/rwsem.h>
+#include <linux/mm_page_replace.h>

#include <asm/tlbflush.h>
#include <asm/div64.h>
@@ -587,16 +588,7 @@ keep:
static inline void move_to_lru(struct page *page)
{
list_del(&page->lru);
- if (PageActive(page)) {
- /*
- * lru_cache_add_active checks that
- * the PG_active bit is off.
- */
- ClearPageActive(page);
- lru_cache_add_active(page);
- } else {
- lru_cache_add(page);
- }
+ page_replace_add(page);
put_page(page);
}

@@ -1111,7 +1103,7 @@ static void shrink_cache(struct zone *zo

pagevec_init(&pvec, 1);

- lru_add_drain();
+ page_replace_add_drain();
spin_lock_irq(&zone->lru_lock);
while (max_scan > 0) {
struct page *page;
@@ -1237,7 +1229,7 @@ refill_inactive_zone(struct zone *zone,
reclaim_mapped = 1;
}

- lru_add_drain();
+ page_replace_add_drain();
spin_lock_irq(&zone->lru_lock);
pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
&l_hold, &pgscanned);
Index: linux-2.6-git/fs/cifs/file.c
===================================================================
--- linux-2.6-git.orig/fs/cifs/file.c
+++ linux-2.6-git/fs/cifs/file.c
@@ -1604,7 +1604,7 @@ static void cifs_copy_cache_pages(struct
SetPageUptodate(page);
unlock_page(page);
if (!pagevec_add(plru_pvec, page))
- __pagevec_lru_add(plru_pvec);
+ __pagevec_page_replace_add(plru_pvec);
data += PAGE_CACHE_SIZE;
}
return;
@@ -1758,7 +1758,7 @@ static int cifs_readpages(struct file *f
bytes_read = 0;
}

- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);

/* need to free smb_read_data buf before exit */
if (smb_read_data) {
Index: linux-2.6-git/fs/mpage.c
===================================================================
--- linux-2.6-git.orig/fs/mpage.c
+++ linux-2.6-git/fs/mpage.c
@@ -26,6 +26,7 @@
#include <linux/writeback.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/mm_page_replace.h>

/*
* I/O completion handler for multipage BIOs.
@@ -344,12 +345,12 @@ mpage_readpages(struct address_space *ma
nr_pages - page_idx,
&last_block_in_bio, get_block);
if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
+ __pagevec_page_replace_add(&lru_pvec);
} else {
page_cache_release(page);
}
}
- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);
BUG_ON(!list_empty(pages));
if (bio)
mpage_bio_submit(READ, bio);
Index: linux-2.6-git/fs/ntfs/file.c
===================================================================
--- linux-2.6-git.orig/fs/ntfs/file.c
+++ linux-2.6-git/fs/ntfs/file.c
@@ -441,7 +441,7 @@ static inline int __ntfs_grab_cache_page
pages[nr] = *cached_page;
page_cache_get(*cached_page);
if (unlikely(!pagevec_add(lru_pvec, *cached_page)))
- __pagevec_lru_add(lru_pvec);
+ __pagevec_page_replace_add(lru_pvec);
*cached_page = NULL;
}
index++;
@@ -2121,7 +2121,7 @@ err_out:
OSYNC_METADATA|OSYNC_DATA);
}
}
- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);
ntfs_debug("Done. Returning %s (written 0x%lx, status %li).",
written ? "written" : "status", (unsigned long)written,
(long)status);
Index: linux-2.6-git/mm/readahead.c
===================================================================
--- linux-2.6-git.orig/mm/readahead.c
+++ linux-2.6-git/mm/readahead.c
@@ -14,6 +14,7 @@
#include <linux/blkdev.h>
#include <linux/backing-dev.h>
#include <linux/pagevec.h>
+#include <linux/mm_page_replace.h>

void default_unplug_io_fn(struct backing_dev_info *bdi, struct page *page)
{
@@ -135,7 +136,7 @@ int read_cache_pages(struct address_spac
}
ret = filler(data, page);
if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
+ __pagevec_page_replace_add(&lru_pvec);
if (ret) {
while (!list_empty(pages)) {
struct page *victim;
@@ -147,7 +148,7 @@ int read_cache_pages(struct address_spac
break;
}
}
- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);
return ret;
}

@@ -174,13 +175,13 @@ static int read_pages(struct address_spa
ret = mapping->a_ops->readpage(filp, page);
if (ret != AOP_TRUNCATED_PAGE) {
if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
+ __pagevec_page_replace_add(&lru_pvec);
continue;
} /* else fall through to release */
}
page_cache_release(page);
}
- pagevec_lru_add(&lru_pvec);
+ pagevec_page_replace_add(&lru_pvec);
ret = 0;
out:
return ret;
Index: linux-2.6-git/fs/exec.c
===================================================================
--- linux-2.6-git.orig/fs/exec.c
+++ linux-2.6-git/fs/exec.c
@@ -49,6 +49,7 @@
#include <linux/rmap.h>
#include <linux/acct.h>
#include <linux/cn_proc.h>
+#include <linux/mm_page_replace.h>

#include <asm/uaccess.h>
#include <asm/mmu_context.h>
@@ -321,7 +322,8 @@ void install_arg_page(struct vm_area_str
goto out;
}
inc_mm_counter(mm, anon_rss);
- lru_cache_add_active(page);
+ page_replace_hint_active(page);
+ page_replace_add(page);
set_pte_at(mm, address, pte, pte_mkdirty(pte_mkwrite(mk_pte(
page, vma->vm_page_prot))));
page_add_new_anon_rmap(page, vma, address);
Index: linux-2.6-git/include/linux/pagevec.h
===================================================================
--- linux-2.6-git.orig/include/linux/pagevec.h
+++ linux-2.6-git/include/linux/pagevec.h
@@ -23,8 +23,6 @@ struct pagevec {
void __pagevec_release(struct pagevec *pvec);
void __pagevec_release_nonlru(struct pagevec *pvec);
void __pagevec_free(struct pagevec *pvec);
-void __pagevec_lru_add(struct pagevec *pvec);
-void __pagevec_lru_add_active(struct pagevec *pvec);
void pagevec_strip(struct pagevec *pvec);
unsigned pagevec_lookup(struct pagevec *pvec, struct address_space *mapping,
pgoff_t start, unsigned nr_pages);
@@ -81,10 +79,4 @@ static inline void pagevec_free(struct p
__pagevec_free(pvec);
}

-static inline void pagevec_lru_add(struct pagevec *pvec)
-{
- if (pagevec_count(pvec))
- __pagevec_lru_add(pvec);
-}
-
#endif /* _LINUX_PAGEVEC_H */
Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h
+++ linux-2.6-git/include/linux/swap.h
@@ -162,13 +162,11 @@ extern unsigned int nr_free_buffer_pages
extern unsigned int nr_free_pagecache_pages(void);

/* linux/mm/swap.c */
-extern void FASTCALL(lru_cache_add(struct page *));
-extern void FASTCALL(lru_cache_add_active(struct page *));
extern void FASTCALL(mark_page_accessed(struct page *));
-extern void lru_add_drain(void);
extern int lru_add_drain_all(void);
extern int rotate_reclaimable_page(struct page *page);
extern void swap_setup(void);
+extern void release_pages(struct page **, int, int);

/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, gfp_t);
Index: linux-2.6-git/fs/ramfs/file-nommu.c
===================================================================
--- linux-2.6-git.orig/fs/ramfs/file-nommu.c
+++ linux-2.6-git/fs/ramfs/file-nommu.c
@@ -109,7 +109,7 @@ static int ramfs_nommu_expand_for_mappin
goto add_error;

if (!pagevec_add(&lru_pvec, page))
- __pagevec_lru_add(&lru_pvec);
+ __pagevec_page_replace_add(&lru_pvec);

unlock_page(page);
}
Index: linux-2.6-git/mm/mempolicy.c
===================================================================
--- linux-2.6-git.orig/mm/mempolicy.c
+++ linux-2.6-git/mm/mempolicy.c
@@ -86,6 +86,7 @@
#include <linux/swap.h>
#include <linux/seq_file.h>
#include <linux/proc_fs.h>
+#include <linux/mm_page_replace.h>

#include <asm/tlbflush.h>
#include <asm/uaccess.h>
@@ -332,7 +333,7 @@ check_range(struct mm_struct *mm, unsign

/* Clear the LRU lists so pages can be isolated */
if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
- lru_add_drain_all();
+ page_replace_add_drain_all();

first = find_vma(mm, start);
if (!first)

2006-03-22 22:33:56

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 11/34] mm: page-replace-should_reclaim_mapped.patch


From: Peter Zijlstra <[email protected]>

Move the reclaim_mapped code over to its own function so that other
reclaim policies can make use of it.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

mm/vmscan.c | 86 ++++++++++++++++++++++++++++++++----------------------------
1 file changed, 46 insertions(+), 40 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2006-03-13 20:37:32.000000000 +0100
+++ linux-2.6/mm/vmscan.c 2006-03-13 20:37:33.000000000 +0100
@@ -978,6 +978,50 @@ done:
pagevec_release(&pvec);
}

+int should_reclaim_mapped(struct zone *zone, struct scan_control *sc)
+{
+ long mapped_ratio;
+ long distress;
+ long swap_tendency;
+
+ /*
+ * `distress' is a measure of how much trouble we're having
+ * reclaiming pages. 0 -> no problems. 100 -> great trouble.
+ */
+ distress = 100 >> zone->prev_priority;
+
+ /*
+ * The point of this algorithm is to decide when to start
+ * reclaiming mapped memory instead of just pagecache. Work out
+ * how much memory
+ * is mapped.
+ */
+ mapped_ratio = (sc->nr_mapped * 100) / total_memory;
+
+ /*
+ * Now decide how much we really want to unmap some pages. The
+ * mapped ratio is downgraded - just because there's a lot of
+ * mapped memory doesn't necessarily mean that page reclaim
+ * isn't succeeding.
+ *
+ * The distress ratio is important - we don't want to start
+ * going oom.
+ *
+ * A 100% value of vm_swappiness overrides this algorithm
+ * altogether.
+ */
+ swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
+
+ /*
+ * Now use this metric to decide whether to start moving mapped
+ * memory onto the inactive list.
+ */
+ if (swap_tendency >= 100)
+ return 1;
+
+ return 0;
+}
+
/*
* This moves pages from the active list to the inactive list.
*
@@ -1009,46 +1053,8 @@ refill_inactive_zone(struct zone *zone,
struct pagevec pvec;
int reclaim_mapped = 0;

- if (unlikely(sc->may_swap)) {
- long mapped_ratio;
- long distress;
- long swap_tendency;
-
- /*
- * `distress' is a measure of how much trouble we're having
- * reclaiming pages. 0 -> no problems. 100 -> great trouble.
- */
- distress = 100 >> zone->prev_priority;
-
- /*
- * The point of this algorithm is to decide when to start
- * reclaiming mapped memory instead of just pagecache. Work out
- * how much memory
- * is mapped.
- */
- mapped_ratio = (sc->nr_mapped * 100) / total_memory;
-
- /*
- * Now decide how much we really want to unmap some pages. The
- * mapped ratio is downgraded - just because there's a lot of
- * mapped memory doesn't necessarily mean that page reclaim
- * isn't succeeding.
- *
- * The distress ratio is important - we don't want to start
- * going oom.
- *
- * A 100% value of vm_swappiness overrides this algorithm
- * altogether.
- */
- swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
-
- /*
- * Now use this metric to decide whether to start moving mapped
- * memory onto the inactive list.
- */
- if (swap_tendency >= 100)
- reclaim_mapped = 1;
- }
+ if (unlikely(sc->may_swap))
+ reclaim_mapped = should_reclaim_mapped(zone, sc);

page_replace_add_drain();
spin_lock_irq(&zone->lru_lock);

2006-03-22 22:33:53

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 12/34] mm: page-replace-shrink.patch


From: Peter Zijlstra <[email protected]>

Move the whole per zone shrinker to the policy files.
Share the shrink_list logic across policies since it doesn't know about
the policy internels anymore and exclusively deals with pageout.

API:
void page_replace_shrink(struct zone *, struct scan_control *);

Shrink the specified zone.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 1
include/linux/mm_use_once_policy.h | 3
include/linux/swap.h | 3
mm/useonce.c | 235 ++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 238 -------------------------------------
5 files changed, 241 insertions(+), 239 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -76,8 +76,5 @@ static inline int page_replace_activate(
return 1;
}

-extern int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned);
-
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -87,6 +87,7 @@ typedef enum {
/* reclaim_t page_replace_reclaimable(struct page *); */
/* int page_replace_activate(struct page *page); */
extern void page_replace_reinsert(struct list_head *);
+extern void page_replace_shrink(struct zone *, struct scan_control *);

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h
+++ linux-2.6-git/include/linux/swap.h
@@ -7,6 +7,7 @@
#include <linux/mmzone.h>
#include <linux/list.h>
#include <linux/sched.h>
+#include <linux/mm_page_replace.h>

#include <asm/atomic.h>
#include <asm/page.h>
@@ -171,6 +172,8 @@ extern void release_pages(struct page **
/* linux/mm/vmscan.c */
extern int try_to_free_pages(struct zone **, gfp_t);
extern int shrink_all_memory(int);
+extern int shrink_list(struct list_head *, struct scan_control *);
+extern int should_reclaim_mapped(struct zone *, struct scan_control *);
extern int vm_swappiness;

#ifdef CONFIG_NUMA
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -3,6 +3,9 @@
#include <linux/swap.h>
#include <linux/module.h>
#include <linux/pagemap.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h> /* for try_to_release_page(),
+ buffer_heads_over_limit */

/**
* lru_cache_add: add a page to the page lists
@@ -135,8 +138,8 @@ void page_replace_reinsert(struct list_h
*
* returns how many pages were moved onto *@dst.
*/
-int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned)
+static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
+ struct list_head *dst, int *scanned)
{
int nr_taken = 0;
struct page *page;
@@ -167,3 +170,231 @@ int isolate_lru_pages(int nr_to_scan, st
return nr_taken;
}

+/*
+ * shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
+ */
+static void shrink_cache(struct zone *zone, struct scan_control *sc)
+{
+ LIST_HEAD(page_list);
+ struct pagevec pvec;
+ int max_scan = sc->nr_to_scan;
+
+ pagevec_init(&pvec, 1);
+
+ page_replace_add_drain();
+ spin_lock_irq(&zone->lru_lock);
+ while (max_scan > 0) {
+ struct page *page;
+ int nr_taken;
+ int nr_scan;
+ int nr_freed;
+
+ nr_taken = isolate_lru_pages(sc->swap_cluster_max,
+ &zone->inactive_list,
+ &page_list, &nr_scan);
+ zone->nr_inactive -= nr_taken;
+ zone->pages_scanned += nr_scan;
+ spin_unlock_irq(&zone->lru_lock);
+
+ if (nr_taken == 0)
+ goto done;
+
+ max_scan -= nr_scan;
+ nr_freed = shrink_list(&page_list, sc);
+
+ local_irq_disable();
+ if (current_is_kswapd()) {
+ __mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
+ __mod_page_state(kswapd_steal, nr_freed);
+ } else
+ __mod_page_state_zone(zone, pgscan_direct, nr_scan);
+ __mod_page_state_zone(zone, pgsteal, nr_freed);
+
+ spin_lock(&zone->lru_lock);
+ /*
+ * Put back any unfreeable pages.
+ */
+ while (!list_empty(&page_list)) {
+ page = lru_to_page(&page_list);
+ if (TestSetPageLRU(page))
+ BUG();
+ list_del(&page->lru);
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+ if (!pagevec_add(&pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ }
+ spin_unlock_irq(&zone->lru_lock);
+done:
+ pagevec_release(&pvec);
+}
+
+/*
+ * This moves pages from the active list to the inactive list.
+ *
+ * We move them the other way if the page is referenced by one or more
+ * processes, from rmap.
+ *
+ * If the pages are mostly unmapped, the processing is fast and it is
+ * appropriate to hold zone->lru_lock across the whole operation. But if
+ * the pages are mapped, the processing is slow (page_referenced()) so we
+ * should drop zone->lru_lock around each page. It's impossible to balance
+ * this, so instead we remove the pages from the LRU while processing them.
+ * It is safe to rely on PG_active against the non-LRU pages in here because
+ * nobody will play with that bit on a non-LRU page.
+ *
+ * The downside is that we have to touch page->_count against each page.
+ * But we had to alter page->flags anyway.
+ */
+static void
+refill_inactive_zone(struct zone *zone, struct scan_control *sc)
+{
+ int pgmoved;
+ int pgdeactivate = 0;
+ int pgscanned;
+ int nr_pages = sc->nr_to_scan;
+ LIST_HEAD(l_hold); /* The pages which were snipped off */
+ LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */
+ LIST_HEAD(l_active); /* Pages to go onto the active_list */
+ struct page *page;
+ struct pagevec pvec;
+ int reclaim_mapped = 0;
+
+ if (unlikely(sc->may_swap))
+ reclaim_mapped = should_reclaim_mapped(zone, sc);
+
+ page_replace_add_drain();
+ spin_lock_irq(&zone->lru_lock);
+ pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
+ &l_hold, &pgscanned);
+ zone->pages_scanned += pgscanned;
+ zone->nr_active -= pgmoved;
+ spin_unlock_irq(&zone->lru_lock);
+
+ while (!list_empty(&l_hold)) {
+ cond_resched();
+ page = lru_to_page(&l_hold);
+ list_del(&page->lru);
+ if (page_mapped(page)) {
+ if (!reclaim_mapped ||
+ (total_swap_pages == 0 && PageAnon(page)) ||
+ page_referenced(page, 0)) {
+ list_add(&page->lru, &l_active);
+ continue;
+ }
+ }
+ list_add(&page->lru, &l_inactive);
+ }
+
+ pagevec_init(&pvec, 1);
+ pgmoved = 0;
+ spin_lock_irq(&zone->lru_lock);
+ while (!list_empty(&l_inactive)) {
+ page = lru_to_page(&l_inactive);
+ prefetchw_prev_lru_page(page, &l_inactive, flags);
+ if (TestSetPageLRU(page))
+ BUG();
+ if (!TestClearPageActive(page))
+ BUG();
+ list_move(&page->lru, &zone->inactive_list);
+ pgmoved++;
+ if (!pagevec_add(&pvec, page)) {
+ zone->nr_inactive += pgmoved;
+ spin_unlock_irq(&zone->lru_lock);
+ pgdeactivate += pgmoved;
+ pgmoved = 0;
+ if (buffer_heads_over_limit)
+ pagevec_strip(&pvec);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ zone->nr_inactive += pgmoved;
+ pgdeactivate += pgmoved;
+ if (buffer_heads_over_limit) {
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_strip(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+
+ pgmoved = 0;
+ while (!list_empty(&l_active)) {
+ page = lru_to_page(&l_active);
+ prefetchw_prev_lru_page(page, &l_active, flags);
+ if (TestSetPageLRU(page))
+ BUG();
+ BUG_ON(!PageActive(page));
+ list_move(&page->lru, &zone->active_list);
+ pgmoved++;
+ if (!pagevec_add(&pvec, page)) {
+ zone->nr_active += pgmoved;
+ pgmoved = 0;
+ spin_unlock_irq(&zone->lru_lock);
+ __pagevec_release(&pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+ }
+ zone->nr_active += pgmoved;
+ spin_unlock(&zone->lru_lock);
+
+ __mod_page_state_zone(zone, pgrefill, pgscanned);
+ __mod_page_state(pgdeactivate, pgdeactivate);
+ local_irq_enable();
+
+ pagevec_release(&pvec);
+}
+
+/*
+ * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ */
+void page_replace_shrink(struct zone *zone, struct scan_control *sc)
+{
+ unsigned long nr_active;
+ unsigned long nr_inactive;
+
+ atomic_inc(&zone->reclaim_in_progress);
+
+ /*
+ * Add one to `nr_to_scan' just to make sure that the kernel will
+ * slowly sift through the active list.
+ */
+ zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
+ nr_active = zone->nr_scan_active;
+ if (nr_active >= sc->swap_cluster_max)
+ zone->nr_scan_active = 0;
+ else
+ nr_active = 0;
+
+ zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
+ nr_inactive = zone->nr_scan_inactive;
+ if (nr_inactive >= sc->swap_cluster_max)
+ zone->nr_scan_inactive = 0;
+ else
+ nr_inactive = 0;
+
+ while (nr_active || nr_inactive) {
+ if (nr_active) {
+ sc->nr_to_scan = min(nr_active,
+ (unsigned long)sc->swap_cluster_max);
+ nr_active -= sc->nr_to_scan;
+ refill_inactive_zone(zone, sc);
+ }
+
+ if (nr_inactive) {
+ sc->nr_to_scan = min(nr_inactive,
+ (unsigned long)sc->swap_cluster_max);
+ nr_inactive -= sc->nr_to_scan;
+ shrink_cache(zone, sc);
+ }
+ }
+
+ throttle_vm_writeout();
+
+ atomic_dec(&zone->reclaim_in_progress);
+}
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -336,7 +336,7 @@ cannot_free:
/*
* shrink_list adds the number of reclaimed pages to sc->nr_reclaimed
*/
-static int shrink_list(struct list_head *page_list, struct scan_control *sc)
+int shrink_list(struct list_head *page_list, struct scan_control *sc)
{
LIST_HEAD(ret_pages);
struct pagevec freed_pvec;
@@ -913,71 +913,6 @@ next:
}
#endif

-/*
- * shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
- */
-static void shrink_cache(struct zone *zone, struct scan_control *sc)
-{
- LIST_HEAD(page_list);
- struct pagevec pvec;
- int max_scan = sc->nr_to_scan;
-
- pagevec_init(&pvec, 1);
-
- page_replace_add_drain();
- spin_lock_irq(&zone->lru_lock);
- while (max_scan > 0) {
- struct page *page;
- int nr_taken;
- int nr_scan;
- int nr_freed;
-
- nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
- &page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
- zone->pages_scanned += nr_scan;
- spin_unlock_irq(&zone->lru_lock);
-
- if (nr_taken == 0)
- goto done;
-
- max_scan -= nr_scan;
- nr_freed = shrink_list(&page_list, sc);
-
- local_irq_disable();
- if (current_is_kswapd()) {
- __mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
- __mod_page_state(kswapd_steal, nr_freed);
- } else
- __mod_page_state_zone(zone, pgscan_direct, nr_scan);
- __mod_page_state_zone(zone, pgsteal, nr_freed);
-
- spin_lock(&zone->lru_lock);
- /*
- * Put back any unfreeable pages.
- */
- while (!list_empty(&page_list)) {
- page = lru_to_page(&page_list);
- if (TestSetPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (PageActive(page))
- add_page_to_active_list(zone, page);
- else
- add_page_to_inactive_list(zone, page);
- if (!pagevec_add(&pvec, page)) {
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
- }
- }
- spin_unlock_irq(&zone->lru_lock);
-done:
- pagevec_release(&pvec);
-}
-
int should_reclaim_mapped(struct zone *zone, struct scan_control *sc)
{
long mapped_ratio;
@@ -1023,171 +958,6 @@ int should_reclaim_mapped(struct zone *z
}

/*
- * This moves pages from the active list to the inactive list.
- *
- * We move them the other way if the page is referenced by one or more
- * processes, from rmap.
- *
- * If the pages are mostly unmapped, the processing is fast and it is
- * appropriate to hold zone->lru_lock across the whole operation. But if
- * the pages are mapped, the processing is slow (page_referenced()) so we
- * should drop zone->lru_lock around each page. It's impossible to balance
- * this, so instead we remove the pages from the LRU while processing them.
- * It is safe to rely on PG_active against the non-LRU pages in here because
- * nobody will play with that bit on a non-LRU page.
- *
- * The downside is that we have to touch page->_count against each page.
- * But we had to alter page->flags anyway.
- */
-static void
-refill_inactive_zone(struct zone *zone, struct scan_control *sc)
-{
- int pgmoved;
- int pgdeactivate = 0;
- int pgscanned;
- int nr_pages = sc->nr_to_scan;
- LIST_HEAD(l_hold); /* The pages which were snipped off */
- LIST_HEAD(l_inactive); /* Pages to go onto the inactive_list */
- LIST_HEAD(l_active); /* Pages to go onto the active_list */
- struct page *page;
- struct pagevec pvec;
- int reclaim_mapped = 0;
-
- if (unlikely(sc->may_swap))
- reclaim_mapped = should_reclaim_mapped(zone, sc);
-
- page_replace_add_drain();
- spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
- &l_hold, &pgscanned);
- zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
- spin_unlock_irq(&zone->lru_lock);
-
- while (!list_empty(&l_hold)) {
- cond_resched();
- page = lru_to_page(&l_hold);
- list_del(&page->lru);
- if (page_mapped(page)) {
- if (!reclaim_mapped ||
- (total_swap_pages == 0 && PageAnon(page)) ||
- page_referenced(page, 0)) {
- list_add(&page->lru, &l_active);
- continue;
- }
- }
- list_add(&page->lru, &l_inactive);
- }
-
- pagevec_init(&pvec, 1);
- pgmoved = 0;
- spin_lock_irq(&zone->lru_lock);
- while (!list_empty(&l_inactive)) {
- page = lru_to_page(&l_inactive);
- prefetchw_prev_lru_page(page, &l_inactive, flags);
- if (TestSetPageLRU(page))
- BUG();
- if (!TestClearPageActive(page))
- BUG();
- list_move(&page->lru, &zone->inactive_list);
- pgmoved++;
- if (!pagevec_add(&pvec, page)) {
- zone->nr_inactive += pgmoved;
- spin_unlock_irq(&zone->lru_lock);
- pgdeactivate += pgmoved;
- pgmoved = 0;
- if (buffer_heads_over_limit)
- pagevec_strip(&pvec);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
- }
- zone->nr_inactive += pgmoved;
- pgdeactivate += pgmoved;
- if (buffer_heads_over_limit) {
- spin_unlock_irq(&zone->lru_lock);
- pagevec_strip(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
-
- pgmoved = 0;
- while (!list_empty(&l_active)) {
- page = lru_to_page(&l_active);
- prefetchw_prev_lru_page(page, &l_active, flags);
- if (TestSetPageLRU(page))
- BUG();
- BUG_ON(!PageActive(page));
- list_move(&page->lru, &zone->active_list);
- pgmoved++;
- if (!pagevec_add(&pvec, page)) {
- zone->nr_active += pgmoved;
- pgmoved = 0;
- spin_unlock_irq(&zone->lru_lock);
- __pagevec_release(&pvec);
- spin_lock_irq(&zone->lru_lock);
- }
- }
- zone->nr_active += pgmoved;
- spin_unlock(&zone->lru_lock);
-
- __mod_page_state_zone(zone, pgrefill, pgscanned);
- __mod_page_state(pgdeactivate, pgdeactivate);
- local_irq_enable();
-
- pagevec_release(&pvec);
-}
-
-/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
- */
-static void
-shrink_zone(struct zone *zone, struct scan_control *sc)
-{
- unsigned long nr_active;
- unsigned long nr_inactive;
-
- atomic_inc(&zone->reclaim_in_progress);
-
- /*
- * Add one to `nr_to_scan' just to make sure that the kernel will
- * slowly sift through the active list.
- */
- zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
- nr_active = zone->nr_scan_active;
- if (nr_active >= sc->swap_cluster_max)
- zone->nr_scan_active = 0;
- else
- nr_active = 0;
-
- zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
- nr_inactive = zone->nr_scan_inactive;
- if (nr_inactive >= sc->swap_cluster_max)
- zone->nr_scan_inactive = 0;
- else
- nr_inactive = 0;
-
- while (nr_active || nr_inactive) {
- if (nr_active) {
- sc->nr_to_scan = min(nr_active,
- (unsigned long)sc->swap_cluster_max);
- nr_active -= sc->nr_to_scan;
- refill_inactive_zone(zone, sc);
- }
-
- if (nr_inactive) {
- sc->nr_to_scan = min(nr_inactive,
- (unsigned long)sc->swap_cluster_max);
- nr_inactive -= sc->nr_to_scan;
- shrink_cache(zone, sc);
- }
- }
-
- throttle_vm_writeout();
-
- atomic_dec(&zone->reclaim_in_progress);
-}
-
-/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
@@ -1224,7 +994,7 @@ shrink_caches(struct zone **zones, struc
if (zone->all_unreclaimable && sc->priority != DEF_PRIORITY)
continue; /* Let kswapd poll it */

- shrink_zone(zone, sc);
+ page_replace_shrink(zone, sc);
}
}

@@ -1440,7 +1210,7 @@ scan:
sc.nr_reclaimed = 0;
sc.priority = priority;
sc.swap_cluster_max = nr_pages? nr_pages : SWAP_CLUSTER_MAX;
- shrink_zone(zone, &sc);
+ page_replace_shrink(zone, &sc);
reclaim_state->reclaimed_slab = 0;
nr_slab = shrink_slab(sc.nr_scanned, GFP_KERNEL,
lru_pages);
@@ -1743,7 +1513,7 @@ int zone_reclaim(struct zone *zone, gfp_
*/
do {
sc.priority--;
- shrink_zone(zone, &sc);
+ page_replace_shrink(zone, &sc);

} while (sc.nr_reclaimed < nr_pages && sc.priority > 0);

2006-03-22 22:33:07

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 05/34] mm: page-replace-generic-pagevec.patch


From: Peter Zijlstra <[email protected]>

Since PG_active is already used to discriminate between active and inactive
lists, use it to collapse the two pagevec add functions and make it a generic
helper function.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 2 +
include/linux/mm_use_once_policy.h | 16 ++++++++
mm/swap.c | 31 ++++++++++++++++
mm/useonce.c | 68 ++-----------------------------------
4 files changed, 53 insertions(+), 64 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -8,6 +8,22 @@ static inline void page_replace_hint_act
SetPageActive(page);
}

+static inline void
+add_page_to_inactive_list(struct zone *zone, struct page *page)
+{
+ list_add(&page->lru, &zone->policy.inactive_list);
+ zone->policy.nr_inactive++;
+}
+
+static inline void
+__page_replace_add(struct zone *zone, struct page *page)
+{
+ if (PageActive(page))
+ add_page_to_active_list(zone, page);
+ else
+ add_page_to_inactive_list(zone, page);
+}
+
static inline void page_replace_hint_use_once(struct page *page)
{
}
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -6,10 +6,12 @@
#include <linux/mmzone.h>
#include <linux/mm.h>
#include <linux/pagevec.h>
+#include <linux/mm_inline.h>

/* void page_replace_hint_active(struct page *); */
/* void page_replace_hint_use_once(struct page *); */
extern void fastcall page_replace_add(struct page *);
+/* void __page_replace_add(struct zone *, struct page *); */
/* void page_replace_add_drain(void); */
extern void __page_replace_add_drain(unsigned int);
extern int page_replace_add_drain_all(void);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -11,64 +11,6 @@
static DEFINE_PER_CPU(struct pagevec, lru_add_pvecs) = { 0, };
static DEFINE_PER_CPU(struct pagevec, lru_add_active_pvecs) = { 0, };

-/*
- * Add the passed pages to the LRU, then drop the caller's refcount
- * on them. Reinitialises the caller's pagevec.
- */
-void __pagevec_page_replace_add(struct pagevec *pvec)
-{
- int i;
- struct zone *zone = NULL;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);
-
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (TestSetPageLRU(page))
- BUG();
- add_page_to_inactive_list(zone, page);
- }
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
- pagevec_reinit(pvec);
-}
-
-EXPORT_SYMBOL(__pagevec_page_replace_add);
-
-static void __pagevec_lru_add_active(struct pagevec *pvec)
-{
- int i;
- struct zone *zone = NULL;
-
- for (i = 0; i < pagevec_count(pvec); i++) {
- struct page *page = pvec->pages[i];
- struct zone *pagezone = page_zone(page);
-
- if (pagezone != zone) {
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- zone = pagezone;
- spin_lock_irq(&zone->lru_lock);
- }
- if (TestSetPageLRU(page))
- BUG();
- if (TestSetPageActive(page))
- BUG();
- add_page_to_active_list(zone, page);
- }
- if (zone)
- spin_unlock_irq(&zone->lru_lock);
- release_pages(pvec->pages, pvec->nr, pvec->cold);
- pagevec_reinit(pvec);
-}
-
static inline void lru_cache_add(struct page *page)
{
struct pagevec *pvec = &get_cpu_var(lru_add_pvecs);
@@ -85,18 +27,16 @@ static inline void lru_cache_add_active(

page_cache_get(page);
if (!pagevec_add(pvec, page))
- __pagevec_lru_add_active(pvec);
+ __pagevec_page_replace_add(pvec);
put_cpu_var(lru_add_active_pvecs);
}

void fastcall page_replace_add(struct page *page)
{
- if (PageActive(page)) {
- ClearPageActive(page);
+ if (PageActive(page))
lru_cache_add_active(page);
- } else {
+ else
lru_cache_add(page);
- }
}

void __page_replace_add_drain(unsigned int cpu)
@@ -107,7 +47,7 @@ void __page_replace_add_drain(unsigned i
__pagevec_page_replace_add(pvec);
pvec = &per_cpu(lru_add_active_pvecs, cpu);
if (pagevec_count(pvec))
- __pagevec_lru_add_active(pvec);
+ __pagevec_page_replace_add(pvec);
}

#ifdef CONFIG_NUMA
Index: linux-2.6-git/mm/swap.c
===================================================================
--- linux-2.6-git.orig/mm/swap.c
+++ linux-2.6-git/mm/swap.c
@@ -306,6 +306,37 @@ unsigned pagevec_lookup_tag(struct pagev

EXPORT_SYMBOL(pagevec_lookup_tag);

+/*
+ * Add the passed pages to the LRU, then drop the caller's refcount
+ * on them. Reinitialises the caller's pagevec.
+ */
+void __pagevec_page_replace_add(struct pagevec *pvec)
+{
+ int i;
+ struct zone *zone = NULL;
+
+ for (i = 0; i < pagevec_count(pvec); i++) {
+ struct page *page = pvec->pages[i];
+ struct zone *pagezone = page_zone(page);
+
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ if (TestSetPageLRU(page))
+ BUG();
+ __page_replace_add(zone, page);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ release_pages(pvec->pages, pvec->nr, pvec->cold);
+ pagevec_reinit(pvec);
+}
+
+EXPORT_SYMBOL(__pagevec_page_replace_add);
+
#ifdef CONFIG_SMP
/*
* We tolerate a little inaccuracy to avoid ping-ponging the counter between

2006-03-22 22:33:06

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 04/34] mm: page-replace-use_once.patch


From: Peter Zijlstra <[email protected]>

Allow for a use-once hint.

API:

give a hint to the page replace algorithm:

void page_replace_hint_use_once(struct page *);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 1 +
include/linux/mm_use_once_policy.h | 4 ++++
mm/filemap.c | 13 ++++++++++++-
3 files changed, 17 insertions(+), 1 deletion(-)

Index: linux-2.6-git/mm/filemap.c
===================================================================
--- linux-2.6-git.orig/mm/filemap.c
+++ linux-2.6-git/mm/filemap.c
@@ -403,7 +403,18 @@ int add_to_page_cache(struct page *page,
error = radix_tree_insert(&mapping->page_tree, offset, page);
if (!error) {
page_cache_get(page);
- SetPageLocked(page);
+ /*
+ * shmem_getpage()
+ * lookup_swap_cache()
+ * TestSetPageLocked()
+ * move_from_swap_cache()
+ * add_to_page_cache()
+ *
+ * That path calls us with a LRU page instead of a new
+ * page. Don't set the hint for LRU pages.
+ */
+ if (!TestSetPageLocked(page))
+ page_replace_hint_use_once(page);
page->mapping = mapping;
page->index = offset;
mapping->nrpages++;
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -8,6 +8,7 @@
#include <linux/pagevec.h>

/* void page_replace_hint_active(struct page *); */
+/* void page_replace_hint_use_once(struct page *); */
extern void fastcall page_replace_add(struct page *);
/* void page_replace_add_drain(void); */
extern void __page_replace_add_drain(unsigned int);
Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -8,5 +8,9 @@ static inline void page_replace_hint_act
SetPageActive(page);
}

+static inline void page_replace_hint_use_once(struct page *page)
+{
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */

2006-03-22 22:34:37

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 17/34] mm: page-replace-info.patch


From: Peter Zijlstra <[email protected]>

Isolate the printing of various policy related information.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 3 ++
mm/page_alloc.c | 44 +--------------------------------
mm/useonce.c | 52 ++++++++++++++++++++++++++++++++++++++++
3 files changed, 57 insertions(+), 42 deletions(-)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -6,6 +6,7 @@
#include <linux/mmzone.h>
#include <linux/mm.h>
#include <linux/pagevec.h>
+#include <linux/seq_file.h>

struct scan_control {
/* Ask refill_inactive_zone, or shrink_cache to scan this many pages */
@@ -92,6 +93,8 @@ extern void page_replace_shrink(struct z
/* void page_replace_mark_accessed(struct page *); */
/* void page_replace_remove(struct zone *, struct page *); */
/* void __page_replace_rotate_reclaimable(struct zone *, struct page *); */
+extern void page_replace_show(struct zone *);
+extern void page_replace_zoneinfo(struct zone *, struct seq_file *);

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -426,3 +426,55 @@ void page_replace_shrink(struct zone *zo

atomic_dec(&zone->reclaim_in_progress);
}
+
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
+void page_replace_show(struct zone *zone)
+{
+ printk("%s"
+ " free:%lukB"
+ " min:%lukB"
+ " low:%lukB"
+ " high:%lukB"
+ " active:%lukB"
+ " inactive:%lukB"
+ " present:%lukB"
+ " pages_scanned:%lu"
+ " all_unreclaimable? %s"
+ "\n",
+ zone->name,
+ K(zone->free_pages),
+ K(zone->pages_min),
+ K(zone->pages_low),
+ K(zone->pages_high),
+ K(zone->nr_active),
+ K(zone->nr_inactive),
+ K(zone->present_pages),
+ zone->pages_scanned,
+ (zone->all_unreclaimable ? "yes" : "no")
+ );
+}
+
+void page_replace_zoneinfo(struct zone *zone, struct seq_file *m)
+{
+ seq_printf(m,
+ "\n pages free %lu"
+ "\n min %lu"
+ "\n low %lu"
+ "\n high %lu"
+ "\n active %lu"
+ "\n inactive %lu"
+ "\n scanned %lu (a: %lu i: %lu)"
+ "\n spanned %lu"
+ "\n present %lu",
+ zone->free_pages,
+ zone->pages_min,
+ zone->pages_low,
+ zone->pages_high,
+ zone->nr_active,
+ zone->nr_inactive,
+ zone->pages_scanned,
+ zone->nr_scan_active, zone->nr_scan_inactive,
+ zone->spanned_pages,
+ zone->present_pages);
+}
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1432,28 +1432,7 @@ void show_free_areas(void)
int i;

show_node(zone);
- printk("%s"
- " free:%lukB"
- " min:%lukB"
- " low:%lukB"
- " high:%lukB"
- " active:%lukB"
- " inactive:%lukB"
- " present:%lukB"
- " pages_scanned:%lu"
- " all_unreclaimable? %s"
- "\n",
- zone->name,
- K(zone->free_pages),
- K(zone->pages_min),
- K(zone->pages_low),
- K(zone->pages_high),
- K(zone->nr_active),
- K(zone->nr_inactive),
- K(zone->present_pages),
- zone->pages_scanned,
- (zone->all_unreclaimable ? "yes" : "no")
- );
+ page_replace_show(zone);
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
printk(" %lu", zone->lowmem_reserve[i]);
@@ -2218,26 +2197,7 @@ static int zoneinfo_show(struct seq_file

spin_lock_irqsave(&zone->lock, flags);
seq_printf(m, "Node %d, zone %8s", pgdat->node_id, zone->name);
- seq_printf(m,
- "\n pages free %lu"
- "\n min %lu"
- "\n low %lu"
- "\n high %lu"
- "\n active %lu"
- "\n inactive %lu"
- "\n scanned %lu (a: %lu i: %lu)"
- "\n spanned %lu"
- "\n present %lu",
- zone->free_pages,
- zone->pages_min,
- zone->pages_low,
- zone->pages_high,
- zone->nr_active,
- zone->nr_inactive,
- zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
- zone->spanned_pages,
- zone->present_pages);
+ page_replace_zoneinfo(zone, m);
seq_printf(m,
"\n protection: (%lu",
zone->lowmem_reserve[0]);

2006-03-22 22:34:40

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 16/34] mm: page-replace-init.patch


From: Peter Zijlstra <[email protected]>

Move initialization of the replacement policy's variables into the
implementation.

API:

initialize the implementation's per zone variables

void page_replace_init_zone(struct zone *);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 2 ++
init/main.c | 2 ++
mm/useonce.c | 15 +++++++++++++++
mm/page_alloc.c | 8 ++------
4 files changed, 21 insertions(+), 6 deletions(-)

Index: linux-2.6/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6.orig/include/linux/mm_page_replace.h 2006-03-13 20:37:39.000000000 +0100
+++ linux-2.6/include/linux/mm_page_replace.h 2006-03-13 20:37:40.000000000 +0100
@@ -67,6 +67,8 @@ struct scan_control {
#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
#endif

+extern void page_replace_init(void);
+extern void page_replace_init_zone(struct zone *);
/* void page_replace_hint_active(struct page *); */
/* void page_replace_hint_use_once(struct page *); */
extern void fastcall page_replace_add(struct page *);
Index: linux-2.6/mm/useonce.c
===================================================================
--- linux-2.6.orig/mm/useonce.c 2006-03-13 20:37:38.000000000 +0100
+++ linux-2.6/mm/useonce.c 2006-03-13 20:37:40.000000000 +0100
@@ -6,6 +6,21 @@
#include <linux/buffer_head.h> /* for try_to_release_page(),
buffer_heads_over_limit */

+void __init page_replace_init(void)
+{
+ /* empty hook */
+}
+
+void __init page_replace_init_zone(struct zone *zone)
+{
+ INIT_LIST_HEAD(&zone->active_list);
+ INIT_LIST_HEAD(&zone->inactive_list);
+ zone->nr_scan_active = 0;
+ zone->nr_scan_inactive = 0;
+ zone->nr_active = 0;
+ zone->nr_inactive = 0;
+}
+
static inline void
add_page_to_inactive_list(struct zone *zone, struct page *page)
{
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-03-13 20:37:05.000000000 +0100
+++ linux-2.6/mm/page_alloc.c 2006-03-13 20:37:40.000000000 +0100
@@ -37,6 +37,7 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
+#include <linux/mm_page_replace.h>

#include <asm/tlbflush.h>
#include "internal.h"
@@ -2075,12 +2076,7 @@ static void __init free_area_init_core(s
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;

zone_pcp_init(zone);
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
- zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
- zone->nr_active = 0;
- zone->nr_inactive = 0;
+ page_replace_init_zone(zone);
atomic_set(&zone->reclaim_in_progress, 0);
if (!size)
continue;
Index: linux-2.6/init/main.c
===================================================================
--- linux-2.6.orig/init/main.c 2006-03-13 20:37:05.000000000 +0100
+++ linux-2.6/init/main.c 2006-03-13 20:37:40.000000000 +0100
@@ -47,6 +47,7 @@
#include <linux/rmap.h>
#include <linux/mempolicy.h>
#include <linux/key.h>
+#include <linux/mm_page_replace.h>

#include <asm/io.h>
#include <asm/bugs.h>
@@ -507,6 +508,7 @@ asmlinkage void __init start_kernel(void
#endif
vfs_caches_init_early();
cpuset_init_early();
+ page_replace_init();
mem_init();
kmem_cache_init();
setup_per_cpu_pageset();

2006-03-22 22:35:11

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 15/34] mm: page-replace-rotate.patch


From: Peter Zijlstra <[email protected]>

Take out the knowledge of the rotation itself.

API:

rotate the page to the candidate end of the page scanner
(when suitable for reclaim)

void __page_replace_rotate_reclaimable(struct zone *, struct page *);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 1 +
include/linux/mm_use_once_policy.h | 8 ++++++++
mm/swap.c | 8 +-------
3 files changed, 10 insertions(+), 7 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -127,5 +127,13 @@ static inline void page_replace_remove(s
}
}

+static inline void __page_replace_rotate_reclaimable(struct zone *zone, struct page *page)
+{
+ if (PageLRU(page) && !PageActive(page)) {
+ list_move_tail(&page->lru, &zone->inactive_list);
+ inc_page_state(pgrotated);
+ }
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -89,6 +89,7 @@ extern void page_replace_reinsert(struct
extern void page_replace_shrink(struct zone *, struct scan_control *);
/* void page_replace_mark_accessed(struct page *); */
/* void page_replace_remove(struct zone *, struct page *); */
+/* void __page_replace_rotate_reclaimable(struct zone *, struct page *); */

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/swap.c
===================================================================
--- linux-2.6-git.orig/mm/swap.c
+++ linux-2.6-git/mm/swap.c
@@ -78,18 +78,12 @@ int rotate_reclaimable_page(struct page
return 1;
if (PageDirty(page))
return 1;
- if (PageActive(page))
- return 1;
if (!PageLRU(page))
return 1;

zone = page_zone(page);
spin_lock_irqsave(&zone->lru_lock, flags);
- if (PageLRU(page) && !PageActive(page)) {
- list_del(&page->lru);
- list_add_tail(&page->lru, &zone->inactive_list);
- inc_page_state(pgrotated);
- }
+ __page_replace_rotate_reclaimable(zone, page);
if (!test_clear_page_writeback(page))
BUG();
spin_unlock_irqrestore(&zone->lru_lock, flags);

2006-03-22 22:35:34

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 22/34] mm: page-replace-shrink-new.patch


From: Peter Zijlstra <[email protected]>

Add a general shrinker that policies can make use of.
The policy defines MM_POLICY_HAS_SHRINKER when it does _NOT_ want
to make use of this framework.

API:
unsigned long __page_replace_nr_scan(struct zone *);

return the number of pages in the scanlist for this zone.

void page_replace_candidates(struct zone *, int, struct list_head *);

fill the @list with at most @nr pages from @zone.

void page_replace_reinsert_zone(struct zone *, struct list_head *, int);

reinsert @list into @zone where @nr pages were freed - reinsert those pages that
could not be freed.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 6 +++++
include/linux/mm_use_once_policy.h | 2 +
mm/vmscan.c | 43 +++++++++++++++++++++++++++++++++++++
3 files changed, 51 insertions(+)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -128,5 +128,11 @@ static inline void page_replace_add_drai
put_cpu();
}

+#if ! defined MM_POLICY_HAS_SHRINKER
+/* unsigned long __page_replace_nr_scan(struct zone *); */
+void page_replace_candidates(struct zone *, int, struct list_head *);
+void page_replace_reinsert_zone(struct zone *, struct list_head *, int);
+#endif
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_PAGE_REPLACE_H */
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -958,6 +958,49 @@ int should_reclaim_mapped(struct zone *z
return 0;
}

+#if ! defined MM_POLICY_HAS_SHRINKER
+void page_replace_shrink(struct zone *zone, struct scan_control *sc)
+{
+ unsigned long nr_scan = 0;
+
+ atomic_inc(&zone->reclaim_in_progress);
+
+ if (unlikely(sc->swap_cluster_max > SWAP_CLUSTER_MAX)) {
+ nr_scan = zone->policy.nr_scan;
+ zone->policy.nr_scan =
+ sc->swap_cluster_max + SWAP_CLUSTER_MAX - 1;
+ } else
+ zone->policy.nr_scan +=
+ (__page_replace_nr_scan(zone) >> sc->priority) + 1;
+
+ while (zone->policy.nr_scan >= SWAP_CLUSTER_MAX) {
+ LIST_HEAD(page_list);
+ int nr_freed;
+
+ zone->policy.nr_scan -= SWAP_CLUSTER_MAX;
+ page_replace_candidates(zone, SWAP_CLUSTER_MAX, &page_list);
+ if (list_empty(&page_list))
+ continue;
+
+ nr_freed = shrink_list(&page_list, sc);
+
+ local_irq_disable();
+ if (current_is_kswapd())
+ __mod_page_state(kswapd_steal, nr_freed);
+ __mod_page_state_zone(zone, pgsteal, nr_freed);
+ local_irq_enable();
+
+ page_replace_reinsert_zone(zone, &page_list, nr_freed);
+ }
+ if (nr_scan)
+ zone->policy.nr_scan = nr_scan;
+
+ atomic_dec(&zone->reclaim_in_progress);
+
+ throttle_vm_writeout();
+}
+#endif
+
/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -169,5 +169,7 @@ static inline unsigned long __page_repla
return zone->policy.nr_active + zone->policy.nr_inactive;
}

+#define MM_POLICY_HAS_SHRINKER
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */

2006-03-22 22:36:13

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 19/34] mm: page-replace-data.patch


From: Peter Zijlstra <[email protected]>

Abstract the policy specific variables from struct zone.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace_data.h | 13 +++++++
include/linux/mm_use_once_data.h | 16 +++++++++
include/linux/mm_use_once_policy.h | 14 ++++----
include/linux/mmzone.h | 8 +---
mm/useonce.c | 60 +++++++++++++++++------------------
5 files changed, 68 insertions(+), 43 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_data.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_use_once_data.h
@@ -0,0 +1,16 @@
+#ifndef _LINUX_MM_USEONCE_DATA_H
+#define _LINUX_MM_USEONCE_DATA_H
+
+#ifdef __KERNEL__
+
+struct page_replace_data {
+ struct list_head active_list;
+ struct list_head inactive_list;
+ unsigned long nr_scan_active;
+ unsigned long nr_scan_inactive;
+ unsigned long nr_active;
+ unsigned long nr_inactive;
+};
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_USEONCE_DATA_H */
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -13,19 +13,19 @@ void __init page_replace_init(void)

void __init page_replace_init_zone(struct zone *zone)
{
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
- zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
- zone->nr_active = 0;
- zone->nr_inactive = 0;
+ INIT_LIST_HEAD(&zone->policy.active_list);
+ INIT_LIST_HEAD(&zone->policy.inactive_list);
+ zone->policy.nr_scan_active = 0;
+ zone->policy.nr_scan_inactive = 0;
+ zone->policy.nr_active = 0;
+ zone->policy.nr_inactive = 0;
}

static inline void
del_page_from_active_list(struct zone *zone, struct page *page)
{
list_del(&page->lru);
- zone->nr_active--;
+ zone->policy.nr_active--;
}

/**
@@ -211,9 +211,9 @@ static void shrink_cache(struct zone *zo
int nr_freed;

nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
+ &zone->policy.inactive_list,
&page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
+ zone->policy.nr_inactive -= nr_taken;
zone->pages_scanned += nr_scan;
spin_unlock_irq(&zone->lru_lock);

@@ -292,10 +292,10 @@ refill_inactive_zone(struct zone *zone,

page_replace_add_drain();
spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
+ pgmoved = isolate_lru_pages(nr_pages, &zone->policy.active_list,
&l_hold, &pgscanned);
zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
+ zone->policy.nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);

while (!list_empty(&l_hold)) {
@@ -323,10 +323,10 @@ refill_inactive_zone(struct zone *zone,
BUG();
if (!TestClearPageActive(page))
BUG();
- list_move(&page->lru, &zone->inactive_list);
+ list_move(&page->lru, &zone->policy.inactive_list);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
- zone->nr_inactive += pgmoved;
+ zone->policy.nr_inactive += pgmoved;
spin_unlock_irq(&zone->lru_lock);
pgdeactivate += pgmoved;
pgmoved = 0;
@@ -336,7 +336,7 @@ refill_inactive_zone(struct zone *zone,
spin_lock_irq(&zone->lru_lock);
}
}
- zone->nr_inactive += pgmoved;
+ zone->policy.nr_inactive += pgmoved;
pgdeactivate += pgmoved;
if (buffer_heads_over_limit) {
spin_unlock_irq(&zone->lru_lock);
@@ -351,17 +351,17 @@ refill_inactive_zone(struct zone *zone,
if (TestSetPageLRU(page))
BUG();
BUG_ON(!PageActive(page));
- list_move(&page->lru, &zone->active_list);
+ list_move(&page->lru, &zone->policy.active_list);
pgmoved++;
if (!pagevec_add(&pvec, page)) {
- zone->nr_active += pgmoved;
+ zone->policy.nr_active += pgmoved;
pgmoved = 0;
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
- zone->nr_active += pgmoved;
+ zone->policy.nr_active += pgmoved;
spin_unlock(&zone->lru_lock);

__mod_page_state_zone(zone, pgrefill, pgscanned);
@@ -385,17 +385,17 @@ void page_replace_shrink(struct zone *zo
* Add one to `nr_to_scan' just to make sure that the kernel will
* slowly sift through the active list.
*/
- zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
- nr_active = zone->nr_scan_active;
+ zone->policy.nr_scan_active += (zone->policy.nr_active >> sc->priority) + 1;
+ nr_active = zone->policy.nr_scan_active;
if (nr_active >= sc->swap_cluster_max)
- zone->nr_scan_active = 0;
+ zone->policy.nr_scan_active = 0;
else
nr_active = 0;

- zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
- nr_inactive = zone->nr_scan_inactive;
+ zone->policy.nr_scan_inactive += (zone->policy.nr_inactive >> sc->priority) + 1;
+ nr_inactive = zone->policy.nr_scan_inactive;
if (nr_inactive >= sc->swap_cluster_max)
- zone->nr_scan_inactive = 0;
+ zone->policy.nr_scan_inactive = 0;
else
nr_inactive = 0;

@@ -440,8 +440,8 @@ void page_replace_show(struct zone *zone
K(zone->pages_min),
K(zone->pages_low),
K(zone->pages_high),
- K(zone->nr_active),
- K(zone->nr_inactive),
+ K(zone->policy.nr_active),
+ K(zone->policy.nr_inactive),
K(zone->present_pages),
zone->pages_scanned,
(zone->all_unreclaimable ? "yes" : "no")
@@ -464,10 +464,10 @@ void page_replace_zoneinfo(struct zone *
zone->pages_min,
zone->pages_low,
zone->pages_high,
- zone->nr_active,
- zone->nr_inactive,
+ zone->policy.nr_active,
+ zone->policy.nr_inactive,
zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
+ zone->policy.nr_scan_active, zone->policy.nr_scan_inactive,
zone->spanned_pages,
zone->present_pages);
}
@@ -482,8 +482,8 @@ void __page_replace_counts(unsigned long
*inactive = 0;
*free = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
- *active += zones[i].nr_active;
- *inactive += zones[i].nr_inactive;
+ *active += zones[i].policy.nr_active;
+ *inactive += zones[i].policy.nr_inactive;
*free += zones[i].free_pages;
}
}
Index: linux-2.6-git/include/linux/mm_page_replace_data.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_page_replace_data.h
@@ -0,0 +1,13 @@
+#ifndef _LINUX_MM_PAGE_REPLACE_DATA_H
+#define _LINUX_MM_PAGE_REPLACE_DATA_H
+
+#ifdef __KERNEL__
+
+#ifdef CONFIG_MM_POLICY_USEONCE
+#include <linux/mm_use_once_data.h>
+#else
+#error no mm policy
+#endif
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_PAGE_REPLACE_DATA_H */
Index: linux-2.6-git/include/linux/mmzone.h
===================================================================
--- linux-2.6-git.orig/include/linux/mmzone.h
+++ linux-2.6-git/include/linux/mmzone.h
@@ -13,6 +13,7 @@
#include <linux/numa.h>
#include <linux/init.h>
#include <linux/seqlock.h>
+#include <linux/mm_page_replace_data.h>
#include <asm/atomic.h>

/* Free memory management - zoned buddy allocator. */
@@ -151,12 +152,7 @@ struct zone {

/* Fields commonly accessed by the page reclaim scanner */
spinlock_t lru_lock;
- struct list_head active_list;
- struct list_head inactive_list;
- unsigned long nr_scan_active;
- unsigned long nr_scan_inactive;
- unsigned long nr_active;
- unsigned long nr_inactive;
+ struct page_replace_data policy;
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -15,14 +15,14 @@ static inline void
del_page_from_inactive_list(struct zone *zone, struct page *page)
{
list_del(&page->lru);
- zone->nr_inactive--;
+ zone->policy.nr_inactive--;
}

static inline void
add_page_to_active_list(struct zone *zone, struct page *page)
{
- list_add(&page->lru, &zone->active_list);
- zone->nr_active++;
+ list_add(&page->lru, &zone->policy.active_list);
+ zone->policy.nr_active++;
}

static inline void
@@ -121,23 +121,23 @@ static inline void page_replace_remove(s
list_del(&page->lru);
if (PageActive(page)) {
ClearPageActive(page);
- zone->nr_active--;
+ zone->policy.nr_active--;
} else {
- zone->nr_inactive--;
+ zone->policy.nr_inactive--;
}
}

static inline void __page_replace_rotate_reclaimable(struct zone *zone, struct page *page)
{
if (PageLRU(page) && !PageActive(page)) {
- list_move_tail(&page->lru, &zone->inactive_list);
+ list_move_tail(&page->lru, &zone->policy.inactive_list);
inc_page_state(pgrotated);
}
}

static inline unsigned long __page_replace_nr_pages(struct zone *zone)
{
- return zone->nr_active + zone->nr_inactive;
+ return zone->policy.nr_active + zone->policy.nr_inactive;
}

#endif /* __KERNEL__ */

2006-03-22 22:36:20

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 25/34] mm: kswapd-writeout-wait.patch


From: Peter Zijlstra <[email protected]>

The new page reclaim implementations can require a lot more scanning
in order to find a suiteable page. This causes kswapd to constantly hit:

blk_congestion_wait(WRITE, HZ/10);

without there being any submitted IO.

Count the number of async pages submitted by pageout() and only wait
for congestion when the last priority level has submitted more than
half SWAP_CLUSTER_MAX pages for IO.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 2 ++
mm/vmscan.c | 9 +++++++--
2 files changed, 9 insertions(+), 2 deletions(-)

Index: linux-2.6/mm/vmscan.c
===================================================================
--- linux-2.6.orig/mm/vmscan.c 2006-03-13 20:45:16.000000000 +0100
+++ linux-2.6/mm/vmscan.c 2006-03-13 20:45:25.000000000 +0100
@@ -343,6 +343,7 @@ int shrink_list(struct list_head *page_l
struct pagevec freed_pvec;
int pgactivate = 0;
int reclaimed = 0;
+ int writeout = 0;

cond_resched();

@@ -438,8 +439,10 @@ int shrink_list(struct list_head *page_l
case PAGE_ACTIVATE:
goto activate_locked;
case PAGE_SUCCESS:
- if (PageWriteback(page) || PageDirty(page))
+ if (PageWriteback(page) || PageDirty(page)) {
+ writeout++;
goto keep;
+ }
/*
* A synchronous write - probably a ramdisk. Go
* ahead and try to reclaim the page.
@@ -507,6 +510,7 @@ keep:
__pagevec_release_nonlru(&freed_pvec);
mod_page_state(pgactivate, pgactivate);
sc->nr_reclaimed += reclaimed;
+ sc->nr_writeout += writeout;
return reclaimed;
}

@@ -1232,6 +1236,7 @@ scan:
* pages behind kswapd's direction of progress, which would
* cause too much scanning of the lower zones.
*/
+ sc.nr_writeout = 0;
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
int nr_slab;
@@ -1283,7 +1288,7 @@ scan:
* OK, kswapd is getting into trouble. Take a nap, then take
* another pass across the zones.
*/
- if (total_scanned && priority < DEF_PRIORITY - 2)
+ if (sc.nr_writeout > SWAP_CLUSTER_MAX/2)
blk_congestion_wait(WRITE, HZ/10);

/*
Index: linux-2.6/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6.orig/include/linux/mm_page_replace.h 2006-03-13 20:45:16.000000000 +0100
+++ linux-2.6/include/linux/mm_page_replace.h 2006-03-13 20:45:25.000000000 +0100
@@ -18,6 +18,8 @@ struct scan_control {
/* Incremented by the number of pages reclaimed */
unsigned long nr_reclaimed;

+ unsigned long nr_writeout; /* page against which writeout was started */
+
unsigned long nr_mapped; /* From page_state */

/* Ask shrink_caches, or shrink_zone to scan at this priority */

2006-03-22 22:37:51

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 31/34] mm: cart-PG_reclaim3.patch


From: Peter Zijlstra <[email protected]>

Add a third PG_flag to the page reclaim framework.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/page-flags.h | 1 +
mm/hugetlb.c | 3 ++-
mm/page_alloc.c | 3 +++
3 files changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6-git/include/linux/page-flags.h
===================================================================
--- linux-2.6-git.orig/include/linux/page-flags.h
+++ linux-2.6-git/include/linux/page-flags.h
@@ -77,6 +77,7 @@
#define PG_uncached 19 /* Page has been mapped as uncached */

#define PG_reclaim2 20 /* reserved by the mm reclaim code */
+#define PG_reclaim3 21 /* reserved by the mm reclaim code */

/*
* Global page accounting. One instance per CPU. Only unsigned longs are
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -151,6 +151,7 @@ static void bad_page(struct page *page)
1 << PG_locked |
1 << PG_reclaim1 |
1 << PG_reclaim2 |
+ 1 << PG_reclaim3 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
@@ -363,6 +364,7 @@ static inline int free_pages_check(struc
1 << PG_locked |
1 << PG_reclaim1 |
1 << PG_reclaim2 |
+ 1 << PG_reclaim3 |
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
@@ -521,6 +523,7 @@ static int prep_new_page(struct page *pa
1 << PG_locked |
1 << PG_reclaim1 |
1 << PG_reclaim2 |
+ 1 << PG_reclaim3 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
Index: linux-2.6-git/mm/hugetlb.c
===================================================================
--- linux-2.6-git.orig/mm/hugetlb.c
+++ linux-2.6-git/mm/hugetlb.c
@@ -153,7 +153,8 @@ static void update_and_free_page(struct
for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
1 << PG_dirty | 1 << PG_reclaim1 | 1 << PG_reclaim2 |
- 1 << PG_reserved | 1 << PG_private | 1<< PG_writeback);
+ 1 << PG_reclaim3 | 1 << PG_reserved | 1 << PG_private |
+ 1<< PG_writeback);
set_page_count(&page[i], 0);
}
set_page_count(page, 1);

2006-03-22 22:37:05

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 30/34] mm: cart-nonresident.patch


From: Peter Zijlstra <[email protected]>

Originally started by Rik van Riel, I heavily modified the code
to suit my needs. Comments in the file should be clear.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/nonresident-cart.h | 34 +++
mm/nonresident-cart.c | 362 +++++++++++++++++++++++++++++++++++++++
2 files changed, 396 insertions(+)

Index: linux-2.6/mm/nonresident-cart.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/nonresident-cart.c 2006-03-13 20:38:04.000000000 +0100
@@ -0,0 +1,362 @@
+/*
+ * mm/nonresident-cart.c
+ * (C) 2004,2005 Red Hat, Inc
+ * Written by Rik van Riel <[email protected]>
+ * Released under the GPL, see the file COPYING for details.
+ * Adapted by Peter Zijlstra <[email protected]> for use by ARC
+ * like algorithms.
+ *
+ * Keeps track of whether a non-resident page was recently evicted
+ * and should be immediately promoted to the active list. This also
+ * helps automatically tune the inactive target.
+ *
+ * The pageout code stores a recently evicted page in this cache
+ * by calling nonresident_put(mapping/mm, index/vaddr)
+ * and can look it up in the cache by calling nonresident_find()
+ * with the same arguments.
+ *
+ * Note that there is no way to invalidate pages after eg. truncate
+ * or exit, we let the pages fall out of the non-resident set through
+ * normal replacement.
+ *
+ *
+ * Modified to work with ARC like algorithms who:
+ * - need to balance two FIFOs; |b1| + |b2| = c,
+ *
+ * The bucket contains four single linked cyclic lists (CLOCKS) and each
+ * clock has a tail hand. By selecting a victim clock upon insertion it
+ * is possible to balance them.
+ *
+ * The first two lists are used for B1/B2 and a third for a free slot list.
+ * The fourth list is unused.
+ *
+ * The slot looks like this:
+ * struct slot_t {
+ * u32 cookie : 24; // LSB
+ * u32 index : 6;
+ * u32 listid : 2;
+ * };
+ *
+ * The bucket is guarded by a spinlock.
+ */
+#include <linux/swap.h>
+#include <linux/mm.h>
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/bootmem.h>
+#include <linux/hash.h>
+#include <linux/prefetch.h>
+#include <linux/kernel.h>
+#include <linux/nonresident-cart.h>
+
+#define TARGET_SLOTS 64
+#define NR_CACHELINES (TARGET_SLOTS*sizeof(u32) / L1_CACHE_BYTES)
+#define NR_SLOTS (((NR_CACHELINES * L1_CACHE_BYTES) - sizeof(spinlock_t) - 4*sizeof(u8)) / sizeof(u32))
+#if 0
+#if NR_SLOTS < (TARGET_SLOTS / 2)
+#warning very small slot size
+#if NR_SLOTS <= 0
+#error no room for slots left
+#endif
+#endif
+#endif
+
+#define BUILD_MASK(bits, shift) (((1 << (bits)) - 1) << (shift))
+
+#define LISTID_BITS 2
+#define LISTID_SHIFT (sizeof(u32)*8 - LISTID_BITS)
+#define LISTID_MASK BUILD_MASK(LISTID_BITS, LISTID_SHIFT)
+
+#define SET_LISTID(x, flg) ((x) = ((x) & ~LISTID_MASK) | ((flg) << LISTID_SHIFT))
+#define GET_LISTID(x) (((x) & LISTID_MASK) >> LISTID_SHIFT)
+
+#define INDEX_BITS 6 /* ceil(log2(NR_SLOTS)) */
+#define INDEX_SHIFT (LISTID_SHIFT - INDEX_BITS)
+#define INDEX_MASK BUILD_MASK(INDEX_BITS, INDEX_SHIFT)
+
+#define SET_INDEX(x, idx) ((x) = ((x) & ~INDEX_MASK) | ((idx) << INDEX_SHIFT))
+#define GET_INDEX(x) (((x) & INDEX_MASK) >> INDEX_SHIFT)
+
+#define COOKIE_MASK BUILD_MASK(sizeof(u32)*8 - LISTID_BITS - INDEX_BITS, 0)
+
+struct nr_bucket
+{
+ spinlock_t lock;
+ u8 hand[4];
+ u32 slot[NR_SLOTS];
+} ____cacheline_aligned;
+
+/* The non-resident page hash table. */
+static struct nr_bucket * nonres_table;
+static unsigned int nonres_shift;
+static unsigned int nonres_mask;
+
+/* hash the address into a bucket */
+static struct nr_bucket * nr_hash(void * mapping, unsigned long index)
+{
+ unsigned long bucket;
+ unsigned long hash;
+
+ hash = (unsigned long)mapping + 37 * index;
+ bucket = hash_long(hash, nonres_shift);
+
+ return nonres_table + bucket;
+}
+
+/* hash the address and inode into a cookie */
+static u32 nr_cookie(struct address_space * mapping, unsigned long index)
+{
+ unsigned long hash;
+
+ hash = 37 * (unsigned long)mapping + index;
+
+ if (mapping && mapping->host)
+ hash = 37 * hash + mapping->host->i_ino;
+
+ return hash_long(hash, sizeof(u32)*8 - LISTID_BITS - INDEX_BITS);
+}
+
+DEFINE_PER_CPU(unsigned long[4], nonres_count);
+
+/*
+ * remove current (b from 'abc'):
+ *
+ * initial swap(2,3)
+ *
+ * 1: -> [2],a 1: -> [2],a
+ * * 2: -> [3],b 2: -> [1],c
+ * 3: -> [1],c * 3: -> [3],b
+ *
+ * 3 is now free for use.
+ *
+ * @nr_bucket: bucket to operate in
+ * @listid: list that the deletee belongs to
+ * @pos: slot position of deletee
+ * @slot: possible pointer to slot
+ *
+ * returns pointer to removed slot, NULL when list empty.
+ */
+static u32 * __nonresident_del(struct nr_bucket *nr_bucket, int listid, u8 pos, u32 *slot)
+{
+ int next_pos;
+ u32 *next;
+
+ if (slot == NULL) {
+ slot = &nr_bucket->slot[pos];
+ if (GET_LISTID(*slot) != listid)
+ return NULL;
+ }
+
+ --__get_cpu_var(nonres_count[listid]);
+
+ next_pos = GET_INDEX(*slot);
+ if (pos == next_pos) {
+ next = slot;
+ goto out;
+ }
+
+ next = &nr_bucket->slot[next_pos];
+ *next = xchg(slot, *next);
+
+ if (next_pos == nr_bucket->hand[listid])
+ nr_bucket->hand[listid] = pos;
+out:
+ BUG_ON(GET_INDEX(*next) != next_pos);
+ return next;
+}
+
+static inline u32 * __nonresident_pop(struct nr_bucket *nr_bucket, int listid)
+{
+ return __nonresident_del(nr_bucket, listid, nr_bucket->hand[listid], NULL);
+}
+
+/*
+ * insert before (d before b in 'abc')
+ *
+ * initial set 4 swap(2,4)
+ *
+ * 1: -> [2],a 1: -> [2],a 1: -> [2],a
+ * * 2: -> [3],b 2: -> [3],b 2: -> [4],d
+ * 3: -> [1],c 3: -> [1],c 3: -> [1],c
+ * 4: -> [4],nil 4: -> [4],d * 4: -> [3],b
+ *
+ * leaving us with 'adbc'.
+ *
+ * @nr_bucket: bucket to operator in
+ * @listid: list to insert into
+ * @pos: position to insert before
+ * @slot: slot to insert
+ */
+static void __nonresident_insert(struct nr_bucket *nr_bucket, int listid, u8 *pos, u32 *slot)
+{
+ u32 *head;
+
+ SET_LISTID(*slot, listid);
+
+ head = &nr_bucket->slot[*pos];
+
+ *pos = GET_INDEX(*slot);
+ if (GET_LISTID(*head) == listid)
+ *slot = xchg(head, *slot);
+
+ ++__get_cpu_var(nonres_count[listid]);
+}
+
+static inline void __nonresident_push(struct nr_bucket *nr_bucket, int listid, u32 *slot)
+{
+ __nonresident_insert(nr_bucket, listid, &nr_bucket->hand[listid], slot);
+}
+
+/*
+ * Remembers a page by putting a hash-cookie on the @listid list.
+ *
+ * @mapping: page_mapping()
+ * @index: page_index()
+ * @listid: list to put the page on (NR_b1, NR_b2 and NR_free).
+ * @listid_evict: list to get a free page from when NR_free is empty.
+ *
+ * returns the list an empty page was taken from.
+ */
+int nonresident_put(struct address_space * mapping, unsigned long index, int listid, int listid_evict)
+{
+ struct nr_bucket *nr_bucket;
+ u32 cookie;
+ unsigned long flags;
+ u32 *slot;
+ int evict = NR_free;
+
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ spin_lock_prefetch(nr_bucket); // prefetchw_range(nr_bucket, NR_CACHELINES);
+ cookie = nr_cookie(mapping, index);
+
+ spin_lock_irqsave(&nr_bucket->lock, flags);
+ slot = __nonresident_pop(nr_bucket, evict);
+ if (!slot) {
+ evict = listid_evict;
+ slot = __nonresident_pop(nr_bucket, evict);
+ if (!slot) {
+ evict ^= 1;
+ slot = __nonresident_pop(nr_bucket, evict);
+ }
+ }
+ BUG_ON(!slot);
+ SET_INDEX(cookie, GET_INDEX(*slot));
+ cookie = xchg(slot, cookie);
+ __nonresident_push(nr_bucket, listid, slot);
+ spin_unlock_irqrestore(&nr_bucket->lock, flags);
+
+ return evict;
+}
+
+/*
+ * Searches a page on the first two lists, and places it on the free list.
+ *
+ * @mapping: page_mapping()
+ * @index: page_index()
+ *
+ * returns listid of the list the item was found on with NR_found set if found.
+ */
+int nonresident_get(struct address_space * mapping, unsigned long index)
+{
+ struct nr_bucket * nr_bucket;
+ u32 wanted;
+ int j;
+ u8 i;
+ unsigned long flags;
+ int ret = 0;
+
+ if (mapping)
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ spin_lock_prefetch(nr_bucket); // prefetch_range(nr_bucket, NR_CACHELINES);
+ wanted = nr_cookie(mapping, index) & COOKIE_MASK;
+
+ spin_lock_irqsave(&nr_bucket->lock, flags);
+ for (i = 0; i < 2; ++i) {
+ j = nr_bucket->hand[i];
+ do {
+ u32 *slot = &nr_bucket->slot[j];
+ if (GET_LISTID(*slot) != i)
+ break;
+
+ if ((*slot & COOKIE_MASK) == wanted) {
+ slot = __nonresident_del(nr_bucket, i, j, slot);
+ __nonresident_push(nr_bucket, NR_free, slot);
+ ret = i | NR_found;
+ goto out;
+ }
+
+ j = GET_INDEX(*slot);
+ } while (j != nr_bucket->hand[i]);
+ }
+out:
+ spin_unlock_irqrestore(&nr_bucket->lock, flags);
+
+ return ret;
+}
+
+unsigned int nonresident_total(void)
+{
+ return (1 << nonres_shift) * NR_SLOTS;
+}
+
+/*
+ * For interactive workloads, we remember about as many non-resident pages
+ * as we have actual memory pages. For server workloads with large inter-
+ * reference distances we could benefit from remembering more.
+ */
+static __initdata unsigned long nonresident_factor = 1;
+void __init nonresident_init(void)
+{
+ int target;
+ int i, j;
+
+ /*
+ * Calculate the non-resident hash bucket target. Use a power of
+ * two for the division because alloc_large_system_hash rounds up.
+ */
+ target = nr_all_pages * nonresident_factor;
+ target /= (sizeof(struct nr_bucket) / sizeof(u32));
+
+ nonres_table = alloc_large_system_hash("Non-resident page tracking",
+ sizeof(struct nr_bucket),
+ target,
+ 0,
+ HASH_EARLY | HASH_HIGHMEM,
+ &nonres_shift,
+ &nonres_mask,
+ 0);
+
+ for (i = 0; i < (1 << nonres_shift); i++) {
+ spin_lock_init(&nonres_table[i].lock);
+ for (j = 0; j < 4; ++j)
+ nonres_table[i].hand[j] = 0;
+
+ for (j = 0; j < NR_SLOTS; ++j) {
+ nonres_table[i].slot[j] = 0;
+ SET_LISTID(nonres_table[i].slot[j], NR_free);
+ if (j < NR_SLOTS - 1)
+ SET_INDEX(nonres_table[i].slot[j], j+1);
+ else /* j == NR_SLOTS - 1 */
+ SET_INDEX(nonres_table[i].slot[j], 0);
+ }
+ }
+
+ for_each_cpu(i) {
+ for (j=0; j<4; ++j)
+ per_cpu(nonres_count[j], i) = 0;
+ }
+}
+
+static int __init set_nonresident_factor(char * str)
+{
+ if (!str)
+ return 0;
+ nonresident_factor = simple_strtoul(str, &str, 0);
+ return 1;
+}
+
+__setup("nonresident_factor=", set_nonresident_factor);
Index: linux-2.6/include/linux/nonresident-cart.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/nonresident-cart.h 2006-03-13 20:38:04.000000000 +0100
@@ -0,0 +1,34 @@
+#ifndef _LINUX_NONRESIDENT_CART_H_
+#define _LINUX_NONRESIDENT_CART_H_
+
+#ifdef __KERNEL__
+
+#include <linux/fs.h>
+#include <linux/preempt.h>
+#include <linux/percpu.h>
+
+#define NR_b1 0
+#define NR_b2 1
+#define NR_free 2
+
+#define NR_listid 3
+#define NR_found 0x80000000
+
+extern int nonresident_put(struct address_space *, unsigned long, int, int);
+extern int nonresident_get(struct address_space *, unsigned long);
+extern unsigned int nonresident_total(void);
+extern void nonresident_init(void);
+
+DECLARE_PER_CPU(unsigned long[4], nonres_count);
+
+static inline unsigned long nonresident_count(int listid)
+{
+ unsigned long count;
+ preempt_disable();
+ count = __sum_cpu_var(unsigned long, nonres_count[listid]);
+ preempt_enable();
+ return count;
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_NONRESIDENT_CART_H_ */

2006-03-22 22:38:00

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 33/34] mm: cart-r.patch


From: Peter Zijlstra <[email protected]>

Another CART based policy, this one extends CART to handle cyclic access.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_cart_data.h | 8 ++++
include/linux/mm_cart_policy.h | 7 +++
include/linux/mm_page_replace.h | 2 -
include/linux/mm_page_replace_data.h | 2 -
mm/Kconfig | 6 +++
mm/Makefile | 1
mm/cart.c | 63 ++++++++++++++++++++++++++++++-----
7 files changed, 78 insertions(+), 11 deletions(-)

Index: linux-2.6-git/include/linux/mm_cart_data.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_cart_data.h
+++ linux-2.6-git/include/linux/mm_cart_data.h
@@ -13,11 +13,15 @@ struct page_replace_data {
unsigned long nr_T2;
unsigned long nr_shortterm;
unsigned long nr_p;
+#if defined CONFIG_MM_POLICY_CART_R
+ unsigned long nr_r;
+#endif
unsigned long flags;
};

#define CART_RECLAIMED_T1 0
#define CART_SATURATED 1
+#define CART_CYCLIC 2

#define ZoneReclaimedT1(z) test_bit(CART_RECLAIMED_T1, &((z)->policy.flags))
#define SetZoneReclaimedT1(z) __set_bit(CART_RECLAIMED_T1, &((z)->policy.flags))
@@ -27,5 +31,9 @@ struct page_replace_data {
#define SetZoneSaturated(z) __set_bit(CART_SATURATED, &((z)->policy.flags))
#define TestClearZoneSaturated(z) __test_and_clear_bit(CART_SATURATED, &((z)->policy.flags))

+#define ZoneCyclic(z) test_bit(CART_CYCLIC, &((z)->policy.flags))
+#define SetZoneCyclic(z) __set_bit(CART_CYCLIC, &((z)->policy.flags))
+#define ClearZoneCyclic(z) __clear_bit(CART_CYCLIC, &((z)->policy.flags))
+
#endif /* __KERNEL__ */
#endif /* _LINUX_CART_DATA_H_ */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -116,7 +116,7 @@ static inline int page_replace_isolate(s
#include <linux/mm_use_once_policy.h>
#elif defined CONFIG_MM_POLICY_CLOCKPRO
#include <linux/mm_clockpro_policy.h>
-#elif defined CONFIG_MM_POLICY_CART
+#elif defined CONFIG_MM_POLICY_CART || defined CONFIG_MM_POLICY_CART_R
#include <linux/mm_cart_policy.h>
#else
#error no mm policy
Index: linux-2.6-git/include/linux/mm_page_replace_data.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace_data.h
+++ linux-2.6-git/include/linux/mm_page_replace_data.h
@@ -7,7 +7,7 @@
#include <linux/mm_use_once_data.h>
#elif defined CONFIG_MM_POLICY_CLOCKPRO
#include <linux/mm_clockpro_data.h>
-#elif defined CONFIG_MM_POLICY_CART
+#elif defined CONFIG_MM_POLICY_CART || defined CONFIG_MM_POLICY_CART_R
#include <linux/mm_cart_data.h>
#else
#error no mm policy
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -152,6 +152,12 @@ config MM_POLICY_CART
help
This option selects a CART based policy

+config MM_POLICY_CART_R
+ bool "CART-r"
+ help
+ This option selects a CART based policy modified to handle cyclic
+ access patterns.
+
endchoice

#
Index: linux-2.6-git/mm/cart.c
===================================================================
--- linux-2.6-git.orig/mm/cart.c
+++ linux-2.6-git/mm/cart.c
@@ -69,6 +69,9 @@ void __init page_replace_init_zone(struc
zone->policy.nr_T2 = 0;
zone->policy.nr_shortterm = 0;
zone->policy.nr_p = 0;
+#if defined CONFIG_MM_POLICY_CART_R
+ zone->policy.nr_r = 0;
+#endif
zone->policy.flags = 0;
}

@@ -166,6 +169,30 @@ static inline void __cart_p_dec(struct z
zone->policy.nr_p = 0UL;
}

+#if defined CONFIG_MM_POLICY_CART_R
+static inline void __cart_r_inc(struct zone *zone)
+{
+ unsigned long ratio;
+ ratio = (cart_longterm(zone) / (zone->policy.nr_shortterm + 1)) ?: 1;
+ zone->policy.nr_r += ratio;
+ if (zone->policy.nr_r > cart_c(zone))
+ zone->policy.nr_r = cart_c(zone);
+}
+
+static inline void __cart_r_dec(struct zone *zone)
+{
+ unsigned long ratio;
+ ratio = (zone->policy.nr_shortterm / (cart_longterm(zone) + 1)) ?: 1;
+ if (zone->policy.nr_r > ratio)
+ zone->policy.nr_r -= ratio;
+ else
+ zone->policy.nr_r = 0UL;
+}
+#else
+#define __cart_r_inc(z) do { } while (0)
+#define __cart_r_dec(z) do { } while (0)
+#endif
+
static unsigned long list_count(struct list_head *list, int PG_flag, int result)
{
unsigned long nr = 0;
@@ -236,6 +263,8 @@ void __page_replace_add(struct zone *zon

if (rflags & NR_found) {
SetPageLongTerm(page);
+ __cart_r_dec(zone);
+
rflags &= NR_listid;
if (rflags == NR_b1) {
__cart_p_inc(zone);
@@ -246,6 +275,7 @@ void __page_replace_add(struct zone *zon
/* ++cart_longterm(zone); */
} else {
ClearPageLongTerm(page);
+ __cart_r_inc(zone);
++zone->policy.nr_shortterm;
}
SetPageT1(page);
@@ -454,19 +484,28 @@ static int isolate_pages(struct zone *zo

static inline int cart_reclaim_T1(struct zone *zone)
{
+ int ret = 0;
int t1 = zone->policy.nr_T1 > zone->policy.nr_p;
int sat = TestClearZoneSaturated(zone);
int rec = ZoneReclaimedT1(zone);
+#if defined CONFIG_MM_POLICY_CART_R
+ int cyc = zone->policy.nr_r < cart_longterm(zone);

- if (t1) {
- if (sat && rec)
- return 0;
- return 1;
- }
+ t1 |= cyc;
+#endif

- if (sat && !rec)
- return 1;
- return 0;
+ if ((t1 && !(rec && sat)) ||
+ (!t1 && (!rec && sat)))
+ ret = 1;
+
+#if defined CONFIG_MM_POLICY_CART_R
+ if (ret && cyc)
+ SetZoneCyclic(zone);
+ else
+ ClearZoneCyclic(zone);
+#endif
+
+ return ret;
}


@@ -642,7 +681,10 @@ void page_replace_zoneinfo(struct zone *
"\n T2 %lu"
"\n shortterm %lu"
"\n p %lu"
- "\n flags %lu"
+#if defined CONFIG_MM_POLICY_CART_R
+ "\n r %lu"
+#endif
+ "\n flags %lx"
"\n scanned %lu"
"\n spanned %lu"
"\n present %lu",
@@ -654,6 +696,9 @@ void page_replace_zoneinfo(struct zone *
zone->policy.nr_T2,
zone->policy.nr_shortterm,
zone->policy.nr_p,
+#if defined CONFIG_MM_POLICY_CART_R
+ zone->policy.nr_r,
+#endif
zone->policy.flags,
zone->pages_scanned,
zone->spanned_pages,
Index: linux-2.6-git/include/linux/mm_cart_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_cart_policy.h
+++ linux-2.6-git/include/linux/mm_cart_policy.h
@@ -82,6 +82,13 @@ static inline void page_replace_remove(s

static inline int page_replace_reclaimable(struct page *page)
{
+#if defined CONFIG_MM_POLICY_CART_R
+ if (PageNew(page) && ZoneCyclic(page_zone(page))) {
+ ClearPageNew(page);
+ return RECLAIM_OK;
+ }
+#endif
+
if (page_referenced(page, 1, 0))
return RECLAIM_ACTIVATE;

Index: linux-2.6-git/mm/Makefile
===================================================================
--- linux-2.6-git.orig/mm/Makefile
+++ linux-2.6-git/mm/Makefile
@@ -15,6 +15,7 @@ obj-y := bootmem.o filemap.o mempool.o
obj-$(CONFIG_MM_POLICY_USEONCE) += useonce.o
obj-$(CONFIG_MM_POLICY_CLOCKPRO) += nonresident.o clockpro.o
obj-$(CONFIG_MM_POLICY_CART) += nonresident-cart.o cart.o
+obj-$(CONFIG_MM_POLICY_CART_R) += nonresident-cart.o cart.o

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o

2006-03-22 22:38:37

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 32/34] mm: cart-cart.patch


From: Peter Zijlstra <[email protected]>

This patch contains a Page Replacement Algorithm based on CART
Please refer to the CART paper here -
http://www.almaden.ibm.com/cs/people/dmodha/clockfast.pdf

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_cart_data.h | 31 +
include/linux/mm_cart_policy.h | 134 ++++++
include/linux/mm_page_replace.h | 6
include/linux/mm_page_replace_data.h | 6
mm/Kconfig | 5
mm/Makefile | 1
mm/cart.c | 678 +++++++++++++++++++++++++++++++++++
7 files changed, 857 insertions(+), 4 deletions(-)

Index: linux-2.6-git/mm/cart.c
===================================================================
--- /dev/null
+++ linux-2.6-git/mm/cart.c
@@ -0,0 +1,678 @@
+/*
+ * mm/cart.c
+ *
+ * Written by Peter Zijlstra <[email protected]>
+ * Released under the GPLv2, see the file COPYING for details.
+ *
+ * This file contains a Page Replacement Algorithm based on CART
+ * Please refer to the CART paper here -
+ * http://www.almaden.ibm.com/cs/people/dmodha/clockfast.pdf
+ *
+ * T1 -> active_list |T1| -> nr_active
+ * T2 -> inactive_list |T2| -> nr_inactive
+ * filter bit -> PG_longterm
+ *
+ * The algorithm was adapted to work for linux which poses the following
+ * extra constraints:
+ * - multiple memory zones,
+ * - fault before reference,
+ * - expensive refernce check.
+ *
+ * The multiple memory zones are handled by decoupling the T lists from the
+ * B lists, keeping T lists per zone while having global B lists. See
+ * mm/nonresident.c for the B list implementation. List sizes are scaled on
+ * comparison.
+ *
+ * The paper seems to assume we insert after/on the first reference, we
+ * actually insert before the first reference. In order to give 'S' pages
+ * a chance we will not mark them 'L' on their first cycle (PG_new).
+ *
+ * Also for efficiency's sake the replace operation is batched. This to
+ * avoid holding the much contended zone->lru_lock while calling the
+ * possibly slow page_referenced().
+ *
+ * All functions that are prefixed with '__' assume that zone->lru_lock is taken.
+ */
+
+#include <linux/mm_page_replace.h>
+#include <linux/rmap.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/nonresident-cart.h>
+#include <linux/swap.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/writeback.h>
+
+#include <asm/div64.h>
+
+
+static DEFINE_PER_CPU(unsigned long, cart_nr_q);
+
+void __init page_replace_init(void)
+{
+ int i;
+
+ nonresident_init();
+
+ for_each_cpu(i)
+ per_cpu(cart_nr_q, i) = 0;
+}
+
+void __init page_replace_init_zone(struct zone *zone)
+{
+ INIT_LIST_HEAD(&zone->policy.list_T1);
+ INIT_LIST_HEAD(&zone->policy.list_T2);
+ zone->policy.nr_T1 = 0;
+ zone->policy.nr_T2 = 0;
+ zone->policy.nr_shortterm = 0;
+ zone->policy.nr_p = 0;
+ zone->policy.flags = 0;
+}
+
+static inline unsigned long cart_c(struct zone *zone)
+{
+ return zone->policy.nr_T1 + zone->policy.nr_T2 + zone->free_pages;
+}
+
+#define scale(x, y, z) ({ unsigned long long tmp = (x); \
+ tmp *= (y); \
+ do_div(tmp, (z)); \
+ (unsigned long)tmp; })
+
+#define B2T(x) scale((x), cart_c(zone), nonresident_total())
+#define T2B(x) scale((x), nonresident_total(), cart_c(zone))
+
+static inline unsigned long cart_longterm(struct zone *zone)
+{
+ return zone->policy.nr_T1 + zone->policy.nr_T2 - zone->policy.nr_shortterm;
+}
+
+static inline unsigned long __cart_q(void)
+{
+ return __sum_cpu_var(unsigned long, cart_nr_q);
+}
+
+static void __cart_q_inc(struct zone *zone, unsigned long dq)
+{
+ /* if (|T2| + |B2| + |T1| - ns >= c) q = min(q + 1, 2c - |T1|) */
+ /* |B2| + nl >= c */
+ if (B2T(nonresident_count(NR_b2)) + cart_longterm(zone) >=
+ cart_c(zone)) {
+ unsigned long nr_q = __cart_q();
+ unsigned long target = 2*nonresident_total() - T2B(zone->policy.nr_T1);
+
+ __get_cpu_var(cart_nr_q) += dq;
+ nr_q += dq;
+
+ if (nr_q > target) {
+ unsigned long tmp = nr_q - target;
+ __get_cpu_var(cart_nr_q) -= tmp;
+ }
+ }
+}
+
+static void __cart_q_dec(struct zone *zone, unsigned long dq)
+{
+ /* q = max(q - 1, c - |T1|) */
+ unsigned long nr_q = __cart_q();
+ unsigned long target = nonresident_total() - T2B(zone->policy.nr_T1);
+
+ if (nr_q < dq) {
+ __get_cpu_var(cart_nr_q) -= nr_q;
+ nr_q = 0;
+ } else {
+ __get_cpu_var(cart_nr_q) -= dq;
+ nr_q -= dq;
+ }
+
+ if (nr_q < target) {
+ unsigned long tmp = target - nr_q;
+ __get_cpu_var(cart_nr_q) += tmp;
+ }
+}
+
+static inline unsigned long cart_q(void)
+{
+ unsigned long q;
+ preempt_disable();
+ q = __cart_q();
+ preempt_enable();
+ return q;
+}
+
+static inline void __cart_p_inc(struct zone *zone)
+{
+ /* p = min(p + max(1, ns/|B1|), c) */
+ unsigned long ratio;
+ ratio = (zone->policy.nr_shortterm /
+ (B2T(nonresident_count(NR_b1)) + 1)) ?: 1UL;
+ zone->policy.nr_p += ratio;
+ if (unlikely(zone->policy.nr_p > cart_c(zone)))
+ zone->policy.nr_p = cart_c(zone);
+}
+
+static inline void __cart_p_dec(struct zone *zone)
+{
+ /* p = max(p - max(1, nl/|B2|), 0) */
+ unsigned long ratio;
+ ratio = (cart_longterm(zone) /
+ (B2T(nonresident_count(NR_b2)) + 1)) ?: 1UL;
+ if (zone->policy.nr_p >= ratio)
+ zone->policy.nr_p -= ratio;
+ else
+ zone->policy.nr_p = 0UL;
+}
+
+static unsigned long list_count(struct list_head *list, int PG_flag, int result)
+{
+ unsigned long nr = 0;
+ struct page *page;
+ list_for_each_entry(page, list, lru) {
+ if (!!test_bit(PG_flag, &(page)->flags) == result)
+ ++nr;
+ }
+ return nr;
+}
+
+static void __validate_zone(struct zone *zone)
+{
+#if 0
+ int bug = 0;
+ unsigned long cnt0 = list_count(&zone->policy.list_T1, PG_lru, 0);
+ unsigned long cnt1 = list_count(&zone->policy.list_T1, PG_lru, 1);
+ if (cnt1 != zone->policy.nr_T1) {
+ printk(KERN_ERR "__validate_zone: T1: %lu,%lu,%lu\n", cnt0, cnt1, zone->policy.nr_T1);
+ bug = 1;
+ }
+
+ cnt0 = list_count(&zone->policy.list_T2, PG_lru, 0);
+ cnt1 = list_count(&zone->policy.list_T2, PG_lru, 1);
+ if (cnt1 != zone->policy.nr_T2 || bug) {
+ printk(KERN_ERR "__validate_zone: T2: %lu,%lu,%lu\n", cnt0, cnt1, zone->policy.nr_T2);
+ bug = 1;
+ }
+
+ cnt0 = list_count(&zone->policy.list_T1, PG_longterm, 0) +
+ list_count(&zone->policy.list_T2, PG_longterm, 0);
+ cnt1 = list_count(&zone->policy.list_T1, PG_longterm, 1) +
+ list_count(&zone->policy.list_T2, PG_longterm, 1);
+ if (cnt0 != zone->policy.nr_shortterm || bug) {
+ printk(KERN_ERR "__validate_zone: shortterm: %lu,%lu,%lu\n", cnt0, cnt1, zone->policy.nr_shortterm);
+ bug = 1;
+ }
+
+ cnt0 = list_count(&zone->policy.list_T2, PG_longterm, 0);
+ cnt1 = list_count(&zone->policy.list_T2, PG_longterm, 1);
+ if (cnt1 != zone->policy.nr_T2 || bug) {
+ printk(KERN_ERR "__validate_zone: longterm: %lu,%lu,%lu\n", cnt0, cnt1, zone->policy.nr_T2);
+ bug = 1;
+ }
+
+ if (bug) {
+ BUG();
+ }
+#endif
+}
+
+/*
+ * Insert page into @zones CART and update adaptive parameters.
+ *
+ * @zone: target zone.
+ * @page: new page.
+ */
+void __page_replace_add(struct zone *zone, struct page *page)
+{
+ unsigned int rflags;
+
+ /*
+ * Note: we could give hints to the insertion process using the LRU
+ * specific PG_flags like: PG_t1, PG_longterm and PG_referenced.
+ */
+
+ rflags = nonresident_get(page_mapping(page), page_index(page));
+
+ if (rflags & NR_found) {
+ SetPageLongTerm(page);
+ rflags &= NR_listid;
+ if (rflags == NR_b1) {
+ __cart_p_inc(zone);
+ } else if (rflags == NR_b2) {
+ __cart_p_dec(zone);
+ __cart_q_inc(zone, 1);
+ }
+ /* ++cart_longterm(zone); */
+ } else {
+ ClearPageLongTerm(page);
+ ++zone->policy.nr_shortterm;
+ }
+ SetPageT1(page);
+
+ list_add(&page->lru, &zone->policy.list_T1);
+
+ ++zone->policy.nr_T1;
+ BUG_ON(!PageLRU(page));
+
+ __validate_zone(zone);
+}
+
+static DEFINE_PER_CPU(struct pagevec, cart_add_pvecs) = { 0, };
+
+void fastcall page_replace_add(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(cart_add_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_page_replace_add(pvec);
+ put_cpu_var(cart_add_pvecs);
+}
+
+void __page_replace_add_drain(unsigned int cpu)
+{
+ struct pagevec *pvec = &per_cpu(cart_add_pvecs, cpu);
+
+ if (pagevec_count(pvec))
+ __pagevec_page_replace_add(pvec);
+}
+
+#ifdef CONFIG_NUMA
+static void drain_per_cpu(void *dummy)
+{
+ page_replace_add_drain();
+}
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ return schedule_on_each_cpu(drain_per_cpu, NULL);
+}
+
+#else
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ page_replace_add_drain();
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_MIGRATION
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list with elevated refcount.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ */
+int page_replace_isolate(struct page *page)
+{
+ int ret = 0;
+
+ if (PageLRU(page)) {
+ struct zone *zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (TestClearPageLRU(page)) {
+ ret = 1;
+ get_page(page);
+
+ if (PageT1(page))
+ --zone->policy.nr_T1;
+ else
+ --zone->policy.nr_T2;
+
+ if (!PageLongTerm(page))
+ --zone->policy.nr_shortterm;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ }
+
+ return ret;
+}
+#endif
+
+/*
+ * Add page to a release pagevec, temp. drop zone lock to release pagevec if full.
+ *
+ * @zone: @pages zone.
+ * @page: page to be released.
+ * @pvec: pagevec to collect pages in.
+ */
+static inline void __page_release(struct zone *zone, struct page *page,
+ struct pagevec *pvec)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+ if (!PageLongTerm(page))
+ ++zone->policy.nr_shortterm;
+ if (PageT1(page))
+ ++zone->policy.nr_T1;
+ else
+ ++zone->policy.nr_T2;
+
+ if (!pagevec_add(pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ if (buffer_heads_over_limit)
+ pagevec_strip(pvec);
+ __pagevec_release(pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+}
+
+void page_replace_reinsert(struct list_head *page_list)
+{
+ struct page *page, *page2;
+ struct zone *zone = NULL;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 1);
+ list_for_each_entry_safe(page, page2, page_list, lru) {
+ struct zone *pagezone = page_zone(page);
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ if (PageT1(page))
+ list_move(&page->lru, &zone->policy.list_T1);
+ else
+ list_move(&page->lru, &zone->policy.list_T2);
+
+ __page_release(zone, page, &pvec);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+/*
+ * zone->lru_lock is heavily contended. Some of the functions that
+ * shrink the lists perform better by taking out a batch of pages
+ * and working on them outside the LRU lock.
+ *
+ * For pagecache intensive workloads, this function is the hottest
+ * spot in the kernel (apart from copy_*_user functions).
+ *
+ * Appropriate locks must be held before calling this function.
+ *
+ * @nr_to_scan: The number of pages to look through on the list.
+ * @src: The LRU list to pull pages off.
+ * @dst: The temp list to put pages on to.
+ * @scanned: The number of pages that were scanned.
+ *
+ * returns how many pages were moved onto *@dst.
+ */
+static int isolate_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *src,
+ struct list_head *dst, int *scanned)
+{
+ int nr_taken = 0;
+ struct page *page;
+ int scan = 0;
+
+ while (scan++ < nr_to_scan && !list_empty(src)) {
+ page = lru_to_page(src);
+ prefetchw_prev_lru_page(page, src, flags);
+
+ if (!TestClearPageLRU(page))
+ BUG();
+ list_del(&page->lru);
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ list_add(&page->lru, src);
+ continue;
+ } else {
+ list_add(&page->lru, dst);
+ nr_taken++;
+ if (!PageLongTerm(page))
+ --zone->policy.nr_shortterm;
+ }
+ }
+
+ zone->pages_scanned += scan;
+ if (src == &zone->policy.list_T1)
+ zone->policy.nr_T1 -= nr_taken;
+ else
+ zone->policy.nr_T2 -= nr_taken;
+
+ *scanned = scan;
+ return nr_taken;
+}
+
+static inline int cart_reclaim_T1(struct zone *zone)
+{
+ int t1 = zone->policy.nr_T1 > zone->policy.nr_p;
+ int sat = TestClearZoneSaturated(zone);
+ int rec = ZoneReclaimedT1(zone);
+
+ if (t1) {
+ if (sat && rec)
+ return 0;
+ return 1;
+ }
+
+ if (sat && !rec)
+ return 1;
+ return 0;
+}
+
+
+void page_replace_candidates(struct zone *zone, int nr_to_scan,
+ struct list_head *page_list)
+{
+ int nr_scan;
+ int nr_taken;
+ struct list_head *list;
+
+ page_replace_add_drain();
+ spin_lock_irq(&zone->lru_lock);
+
+ if (cart_reclaim_T1(zone)) {
+ list = &zone->policy.list_T1;
+ SetZoneReclaimedT1(zone);
+ } else {
+ list = &zone->policy.list_T2;
+ ClearZoneReclaimedT1(zone);
+ }
+
+ nr_taken = isolate_pages(zone, nr_to_scan, list, page_list,
+ &nr_scan);
+
+ if (!nr_taken) {
+ if (list == &zone->policy.list_T1) {
+ list = &zone->policy.list_T2;
+ ClearZoneReclaimedT1(zone);
+ } else {
+ list = &zone->policy.list_T1;
+ SetZoneReclaimedT1(zone);
+ }
+
+ nr_taken = isolate_pages(zone, nr_to_scan, list,
+ page_list, &nr_scan);
+ }
+ spin_unlock(&zone->lru_lock);
+ if (current_is_kswapd())
+ __mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
+ else
+ __mod_page_state_zone(zone, pgscan_direct, nr_scan);
+ local_irq_enable();
+}
+
+void page_replace_reinsert_zone(struct zone *zone, struct list_head *page_list, int nr_freed)
+{
+ struct pagevec pvec;
+ unsigned long dqi = 0;
+ unsigned long dqd = 0;
+ unsigned long dsl = 0;
+ unsigned long target;
+
+ pagevec_init(&pvec, 1);
+ spin_lock_irq(&zone->lru_lock);
+
+ target = min(zone->policy.nr_p + 1UL, B2T(nonresident_count(NR_b1)));
+
+ while (!list_empty(page_list)) {
+ struct page * page = lru_to_page(page_list);
+ prefetchw_prev_lru_page(page, page_list, flags);
+
+ if (PageT1(page)) { /* T1 */
+ if (TestClearPageReferenced(page)) {
+ if (!PageLongTerm(page) &&
+ (zone->policy.nr_T1 - dqd + dqi) >= target) {
+ SetPageLongTerm(page);
+ ++dsl;
+ }
+ list_move(&page->lru, &zone->policy.list_T1);
+ } else if (PageLongTerm(page)) {
+ ClearPageT1(page);
+ ++dqd;
+ list_move(&page->lru, &zone->policy.list_T2);
+ } else {
+ /* should have been reclaimed or was PG_new */
+ list_move(&page->lru, &zone->policy.list_T1);
+ }
+ } else { /* T2 */
+ if (TestClearPageReferenced(page)) {
+ SetPageT1(page);
+ ++dqi;
+ list_move(&page->lru, &zone->policy.list_T1);
+ } else {
+ /* should have been reclaimed */
+ list_move(&page->lru, &zone->policy.list_T2);
+ }
+ }
+ __page_release(zone, page, &pvec);
+ }
+
+ if (!nr_freed) SetZoneSaturated(zone);
+
+ if (dqi > dqd)
+ __cart_q_inc(zone, dqi - dqd);
+ else
+ __cart_q_dec(zone, dqd - dqi);
+
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+void __page_replace_rotate_reclaimable(struct zone *zone, struct page *page)
+{
+ if (PageLRU(page)) {
+ if (PageLongTerm(page)) {
+ if (TestClearPageT1(page)) {
+ --zone->policy.nr_T1;
+ ++zone->policy.nr_T2;
+ __cart_q_dec(zone, 1);
+ }
+ list_move_tail(&page->lru, &zone->policy.list_T2);
+ } else {
+ if (!PageT1(page))
+ BUG();
+ list_move_tail(&page->lru, &zone->policy.list_T1);
+ }
+ }
+}
+
+void page_replace_remember(struct zone *zone, struct page *page)
+{
+ int target_list = PageT1(page) ? NR_b1 : NR_b2;
+ int evict_list = (nonresident_count(NR_b1) > cart_q())
+ ? NR_b1 : NR_b2;
+
+ nonresident_put(page_mapping(page), page_index(page),
+ target_list, evict_list);
+}
+
+void page_replace_forget(struct address_space *mapping, unsigned long index)
+{
+ nonresident_get(mapping, index);
+}
+
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
+void page_replace_show(struct zone *zone)
+{
+ printk("%s"
+ " free:%lukB"
+ " min:%lukB"
+ " low:%lukB"
+ " high:%lukB"
+ " T1:%lukB"
+ " T2:%lukB"
+ " shortterm:%lukB"
+ " present:%lukB"
+ " pages_scanned:%lu"
+ " all_unreclaimable? %s"
+ "\n",
+ zone->name,
+ K(zone->free_pages),
+ K(zone->pages_min),
+ K(zone->pages_low),
+ K(zone->pages_high),
+ K(zone->policy.nr_T1),
+ K(zone->policy.nr_T2),
+ K(zone->policy.nr_shortterm),
+ K(zone->present_pages),
+ zone->pages_scanned,
+ (zone->all_unreclaimable ? "yes" : "no")
+ );
+}
+
+void page_replace_zoneinfo(struct zone *zone, struct seq_file *m)
+{
+ seq_printf(m,
+ "\n pages free %lu"
+ "\n min %lu"
+ "\n low %lu"
+ "\n high %lu"
+ "\n T1 %lu"
+ "\n T2 %lu"
+ "\n shortterm %lu"
+ "\n p %lu"
+ "\n flags %lu"
+ "\n scanned %lu"
+ "\n spanned %lu"
+ "\n present %lu",
+ zone->free_pages,
+ zone->pages_min,
+ zone->pages_low,
+ zone->pages_high,
+ zone->policy.nr_T1,
+ zone->policy.nr_T2,
+ zone->policy.nr_shortterm,
+ zone->policy.nr_p,
+ zone->policy.flags,
+ zone->pages_scanned,
+ zone->spanned_pages,
+ zone->present_pages);
+}
+
+void __page_replace_counts(unsigned long *active, unsigned long *inactive,
+ unsigned long *free, struct pglist_data *pgdat)
+{
+ struct zone *zones = pgdat->node_zones;
+ int i;
+
+ *active = 0;
+ *inactive = 0;
+ *free = 0;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ *active += zones[i].policy.nr_T1 + zones[i].policy.nr_T2 -
+ zones[i].policy.nr_shortterm;
+ *inactive += zones[i].policy.nr_shortterm;
+ *free += zones[i].free_pages;
+ }
+}
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -112,10 +112,12 @@ extern int page_replace_isolate(struct p
static inline int page_replace_isolate(struct page *p) { return -ENOSYS; }
#endif

-#ifdef CONFIG_MM_POLICY_USEONCE
+#if defined CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_policy.h>
-#elif CONFIG_MM_POLICY_CLOCKPRO
+#elif defined CONFIG_MM_POLICY_CLOCKPRO
#include <linux/mm_clockpro_policy.h>
+#elif defined CONFIG_MM_POLICY_CART
+#include <linux/mm_cart_policy.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/include/linux/mm_page_replace_data.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace_data.h
+++ linux-2.6-git/include/linux/mm_page_replace_data.h
@@ -3,10 +3,12 @@

#ifdef __KERNEL__

-#ifdef CONFIG_MM_POLICY_USEONCE
+#if defined CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_data.h>
-#elif CONFIG_MM_POLICY_CLOCKPRO
+#elif defined CONFIG_MM_POLICY_CLOCKPRO
#include <linux/mm_clockpro_data.h>
+#elif defined CONFIG_MM_POLICY_CART
+#include <linux/mm_cart_data.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -147,6 +147,11 @@ config MM_POLICY_CLOCKPRO
help
This option selects a CLOCK-Pro based policy

+config MM_POLICY_CART
+ bool "CART"
+ help
+ This option selects a CART based policy
+
endchoice

#
Index: linux-2.6-git/mm/Makefile
===================================================================
--- linux-2.6-git.orig/mm/Makefile
+++ linux-2.6-git/mm/Makefile
@@ -14,6 +14,7 @@ obj-y := bootmem.o filemap.o mempool.o

obj-$(CONFIG_MM_POLICY_USEONCE) += useonce.o
obj-$(CONFIG_MM_POLICY_CLOCKPRO) += nonresident.o clockpro.o
+obj-$(CONFIG_MM_POLICY_CART) += nonresident-cart.o cart.o

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o
Index: linux-2.6-git/include/linux/mm_cart_data.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_cart_data.h
@@ -0,0 +1,31 @@
+#ifndef _LINUX_CART_DATA_H_
+#define _LINUX_CART_DATA_H_
+
+#ifdef __KERNEL__
+
+#include <asm/bitops.h>
+
+struct page_replace_data {
+ struct list_head list_T1;
+ struct list_head list_T2;
+ unsigned long nr_scan;
+ unsigned long nr_T1;
+ unsigned long nr_T2;
+ unsigned long nr_shortterm;
+ unsigned long nr_p;
+ unsigned long flags;
+};
+
+#define CART_RECLAIMED_T1 0
+#define CART_SATURATED 1
+
+#define ZoneReclaimedT1(z) test_bit(CART_RECLAIMED_T1, &((z)->policy.flags))
+#define SetZoneReclaimedT1(z) __set_bit(CART_RECLAIMED_T1, &((z)->policy.flags))
+#define ClearZoneReclaimedT1(z) __clear_bit(CART_RECLAIMED_T1, &((z)->policy.flags))
+
+#define ZoneSaturated(z) test_bit(CART_SATURATED, &((z)->policy.flags))
+#define SetZoneSaturated(z) __set_bit(CART_SATURATED, &((z)->policy.flags))
+#define TestClearZoneSaturated(z) __test_and_clear_bit(CART_SATURATED, &((z)->policy.flags))
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_CART_DATA_H_ */
Index: linux-2.6-git/include/linux/mm_cart_policy.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_cart_policy.h
@@ -0,0 +1,134 @@
+#ifndef _LINUX_MM_CART_POLICY_H
+#define _LINUX_MM_CART_POLICY_H
+
+#ifdef __KERNEL__
+
+#include <linux/rmap.h>
+#include <linux/page-flags.h>
+
+#define PG_t1 PG_reclaim1
+#define PG_longterm PG_reclaim2
+#define PG_new PG_reclaim3
+
+#define PageT1(page) test_bit(PG_t1, &(page)->flags)
+#define SetPageT1(page) set_bit(PG_t1, &(page)->flags)
+#define ClearPageT1(page) clear_bit(PG_t1, &(page)->flags)
+#define TestClearPageT1(page) test_and_clear_bit(PG_t1, &(page)->flags)
+#define TestSetPageT1(page) test_and_set_bit(PG_t1, &(page)->flags)
+
+#define PageLongTerm(page) test_bit(PG_longterm, &(page)->flags)
+#define SetPageLongTerm(page) set_bit(PG_longterm, &(page)->flags)
+#define TestSetPageLongTerm(page) test_and_set_bit(PG_longterm, &(page)->flags)
+#define ClearPageLongTerm(page) clear_bit(PG_longterm, &(page)->flags)
+#define TestClearPageLongTerm(page) test_and_clear_bit(PG_longterm, &(page)->flags)
+
+#define PageNew(page) test_bit(PG_new, &(page)->flags)
+#define SetPageNew(page) set_bit(PG_new, &(page)->flags)
+#define TestSetPageNew(page) test_and_set_bit(PG_new, &(page)->flags)
+#define ClearPageNew(page) clear_bit(PG_new, &(page)->flags)
+#define TestClearPageNew(page) test_and_clear_bit(PG_new, &(page)->flags)
+
+static inline void page_replace_hint_active(struct page *page)
+{
+}
+
+static inline void page_replace_hint_use_once(struct page *page)
+{
+ if (PageLRU(page))
+ BUG();
+ SetPageNew(page);
+}
+
+extern void __page_replace_add(struct zone *, struct page *);
+
+static inline void page_replace_copy_state(struct page *dpage, struct page *spage)
+{
+ if (PageT1(spage))
+ SetPageT1(dpage);
+ if (PageLongTerm(spage))
+ SetPageLongTerm(dpage);
+ if (PageNew(spage))
+ SetPageNew(dpage);
+}
+
+static inline void page_replace_clear_state(struct page *page)
+{
+ if (PageT1(page))
+ ClearPageT1(page);
+ if (PageLongTerm(page))
+ ClearPageLongTerm(page);
+ if (PageNew(page))
+ ClearPageNew(page);
+}
+
+static inline int page_replace_is_active(struct page *page)
+{
+ return PageLongTerm(page);
+}
+
+static inline void page_replace_remove(struct zone *zone, struct page *page)
+{
+ list_del(&page->lru);
+ if (PageT1(page))
+ --zone->policy.nr_T1;
+ else
+ --zone->policy.nr_T2;
+
+ if (!PageLongTerm(page))
+ --zone->policy.nr_shortterm;
+
+ page_replace_clear_state(page);
+}
+
+static inline int page_replace_reclaimable(struct page *page)
+{
+ if (page_referenced(page, 1, 0))
+ return RECLAIM_ACTIVATE;
+
+ if (PageNew(page))
+ ClearPageNew(page);
+
+ if ((PageT1(page) && PageLongTerm(page)) ||
+ (!PageT1(page) && !PageLongTerm(page)))
+ return RECLAIM_KEEP;
+
+ return RECLAIM_OK;
+}
+
+static inline int fastcall page_replace_activate(struct page *page)
+{
+ /* just set PG_referenced, handle the rest in
+ * page_replace_reinsert()
+ */
+ if (!TestClearPageNew(page)) {
+ SetPageReferenced(page);
+ return 1;
+ }
+
+ return 0;
+}
+
+extern void __page_replace_rotate_reclaimable(struct zone *, struct page *);
+
+static inline void page_replace_mark_accessed(struct page *page)
+{
+ SetPageReferenced(page);
+}
+
+#define MM_POLICY_HAS_NONRESIDENT
+
+extern void page_replace_remember(struct zone *, struct page *);
+extern void page_replace_forget(struct address_space *, unsigned long);
+
+static inline unsigned long __page_replace_nr_pages(struct zone *zone)
+{
+ return zone->policy.nr_T1 + zone->policy.nr_T2;
+}
+
+static inline unsigned long __page_replace_nr_scan(struct zone *zone)
+{
+ return zone->policy.nr_T1 + zone->policy.nr_T2;
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_CART_POLICY_H_ */

2006-03-22 22:38:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 34/34] mm: random.patch


From: Marcelo Tosatti <[email protected]>

Random page replacement.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>

---

include/linux/mm_page_replace.h | 2
include/linux/mm_page_replace_data.h | 2
include/linux/mm_random_data.h | 9 +
include/linux/mm_random_policy.h | 47 +++++
mm/Kconfig | 5
mm/Makefile | 1
mm/random_policy.c | 292 +++++++++++++++++++++++++++++++++++
7 files changed, 358 insertions(+)

Index: linux-2.6-git/mm/random_policy.c
===================================================================
--- /dev/null
+++ linux-2.6-git/mm/random_policy.c
@@ -0,0 +1,292 @@
+
+/* Random page replacement policy */
+
+#include <linux/module.h>
+#include <linux/mm_page_replace.h>
+#include <linux/swap.h>
+#include <linux/pagevec.h>
+#include <linux/init.h>
+#include <linux/rmap.h>
+#include <linux/hash.h>
+#include <linux/seq_file.h>
+#include <linux/writeback.h>
+#include <linux/buffer_head.h> /* for try_to_release_page(),
+ buffer_heads_over_limit */
+#include <asm/sections.h>
+
+void __init page_replace_init(void)
+{
+ printk(KERN_ERR "Random page replacement policy init!\n");
+}
+
+void __init page_replace_init_zone(struct zone *zone)
+{
+ zone->policy.nr_pages = 0;
+}
+
+static DEFINE_PER_CPU(struct pagevec, add_pvecs) = { 0, };
+
+void __page_replace_add(struct zone *zone, struct page *page)
+{
+ zone->policy.nr_pages++;
+}
+
+void fastcall page_replace_add(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(add_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_page_replace_add(pvec);
+ put_cpu_var(add_pvecs);
+}
+
+void __page_replace_add_drain(unsigned int cpu)
+{
+ struct pagevec *pvec = &per_cpu(add_pvecs, cpu);
+
+ if (pagevec_count(pvec))
+ __pagevec_page_replace_add(pvec);
+}
+
+#ifdef CONFIG_NUMA
+static void drain_per_cpu(void *dummy)
+{
+ page_replace_add_drain();
+}
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ return schedule_on_each_cpu(drain_per_cpu, NULL);
+}
+
+#else
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ page_replace_add_drain();
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_MIGRATION
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list with elevated refcount.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ */
+int page_replace_isolate(struct page *page)
+{
+ int ret = 0;
+
+ if (PageLRU(page)) {
+ struct zone *zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (TestClearPageLRU(page)) {
+ ret = 1;
+ get_page(page);
+ --zone->policy.nr_pages;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ }
+
+ return ret;
+}
+#endif
+
+static inline void __page_release(struct zone *zone, struct page *page,
+ struct pagevec *pvec)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+ ++zone->policy.nr_pages;
+
+ if (!pagevec_add(pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ if (buffer_heads_over_limit)
+ pagevec_strip(pvec);
+ __pagevec_release(pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+}
+
+void page_replace_reinsert(struct list_head *page_list)
+{
+ struct page *page, *page2;
+ struct zone *zone = NULL;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 1);
+ list_for_each_entry_safe(page, page2, page_list, lru) {
+ struct zone *pagezone = page_zone(page);
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ __page_release(zone, page, &pvec);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+/*
+ * Lehmer simple linear congruential PRNG
+ *
+ * Xn+1 = (a * Xn + c) mod m
+ *
+ * where a, c and m are constants.
+ *
+ * Note that "m" is zone->present_pages, so in this case its
+ * really not constant.
+ */
+
+static unsigned long get_random(struct zone *zone)
+{
+ zone->policy.seed =
+ hash_long(zone->policy.seed, BITS_PER_LONG) + 3147484177UL;
+ return zone->policy.seed;
+}
+
+static struct page *pick_random_cache_page(struct zone *zone)
+{
+ struct page *page;
+ unsigned long pfn;
+ do {
+ pfn = zone->zone_start_pfn +
+ get_random(zone) % zone->present_pages;
+ page = pfn_to_page(pfn);
+ } while (!PageLRU(page));
+ zone->policy.seed ^= page_index(page);
+ return page;
+}
+
+static int isolate_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *pages, int *nr_scanned)
+{
+ int nr_taken = 0;
+ struct page *page;
+ int scan = 0;
+
+ while (scan++ < nr_to_scan && zone->policy.nr_pages) {
+ page = pick_random_cache_page(zone);
+
+ if (!TestClearPageLRU(page))
+ BUG();
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ continue;
+ } else {
+ list_add(&page->lru, pages);
+ nr_taken++;
+ --zone->policy.nr_pages;
+ }
+ }
+ zone->pages_scanned += scan;
+ *nr_scanned = scan;
+ return nr_taken;
+}
+
+void page_replace_candidates(struct zone *zone, int nr_to_scan, struct list_head *pages)
+{
+ int nr_scan;
+ page_replace_add_drain();
+ spin_lock_irq(&zone->lru_lock);
+ isolate_pages(zone, nr_to_scan, pages, &nr_scan);
+ spin_unlock(&zone->lru_lock);
+ if (current_is_kswapd())
+ __mod_page_state_zone(zone, pgscan_kswapd, nr_scan);
+ else
+ __mod_page_state_zone(zone, pgscan_direct, nr_scan);
+ local_irq_enable();
+}
+
+void page_replace_reinsert_zone(struct zone *zone, struct list_head *pages, int nr_freed)
+{
+ struct pagevec pvec;
+ pagevec_init(&pvec, 1);
+ spin_lock_irq(&zone->lru_lock);
+ while (!list_empty(pages)) {
+ struct page *page = lru_to_page(pages);
+ list_del(&page->lru);
+ __page_release(zone, page, &pvec);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
+void page_replace_show(struct zone *zone)
+{
+ printk("%s"
+ " free:%lukB"
+ " min:%lukB"
+ " low:%lukB"
+ " high:%lukB"
+ " cached:%lukB"
+ " present:%lukB"
+ " pages_scanned:%lu"
+ " all_unreclaimable? %s"
+ "\n",
+ zone->name,
+ K(zone->free_pages),
+ K(zone->pages_min),
+ K(zone->pages_low),
+ K(zone->pages_high),
+ K(zone->policy.nr_pages),
+ K(zone->present_pages),
+ zone->pages_scanned,
+ (zone->all_unreclaimable ? "yes" : "no")
+ );
+}
+
+void page_replace_zoneinfo(struct zone *zone, struct seq_file *m)
+{
+ seq_printf(m,
+ "\n pages free %lu"
+ "\n min %lu"
+ "\n low %lu"
+ "\n high %lu"
+ "\n cached %lu"
+ "\n scanned %lu"
+ "\n spanned %lu"
+ "\n present %lu",
+ zone->free_pages,
+ zone->pages_min,
+ zone->pages_low,
+ zone->pages_high,
+ zone->policy.nr_pages,
+ zone->pages_scanned,
+ zone->spanned_pages,
+ zone->present_pages);
+}
+
+void __page_replace_counts(unsigned long *active, unsigned long *inactive,
+ unsigned long *free, struct pglist_data *pgdat)
+{
+ struct zone *zones = pgdat->node_zones;
+ int i;
+
+ *active = 0;
+ *inactive = 0;
+ *free = 0;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ *free += zones[i].free_pages;
+ }
+}
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -158,6 +158,11 @@ config MM_POLICY_CART_R
This option selects a CART based policy modified to handle cyclic
access patterns.

+config MM_POLICY_RANDOM
+ bool "Random"
+ help
+ This option selects the random replacement policy.
+
endchoice

#
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -118,6 +118,8 @@ static inline int page_replace_isolate(s
#include <linux/mm_clockpro_policy.h>
#elif defined CONFIG_MM_POLICY_CART || defined CONFIG_MM_POLICY_CART_R
#include <linux/mm_cart_policy.h>
+#elif defined CONFIG_MM_POLICY_RANDOM
+#include <linux/mm_random_policy.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/include/linux/mm_page_replace_data.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace_data.h
+++ linux-2.6-git/include/linux/mm_page_replace_data.h
@@ -9,6 +9,8 @@
#include <linux/mm_clockpro_data.h>
#elif defined CONFIG_MM_POLICY_CART || defined CONFIG_MM_POLICY_CART_R
#include <linux/mm_cart_data.h>
+#elif defined CONFIG_MM_POLICY_RANDOM
+#include <linux/mm_random_data.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/include/linux/mm_random_policy.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_random_policy.h
@@ -0,0 +1,47 @@
+#ifndef _LINUX_MM_RANDOM_POLICY_H
+#define _LINUX_MM_RANDOM_POLICY_H
+
+#ifdef __KERNEL__
+
+#include <linux/page-flags.h>
+
+
+#define page_replace_hint_active(p) do { } while (0)
+#define page_replace_hint_use_once(p) do { } while (0)
+
+extern void __page_replace_add(struct zone *, struct page *);
+
+#define page_replace_activate(p) 0
+#define page_replace_reclaimable(p) RECLAIM_OK
+#define page_replace_mark_accessed(p) do { } while (0)
+
+static inline
+void page_replace_remove(struct zone *zone, struct page *page)
+{
+ zone->policy.nr_pages--;
+}
+
+static inline
+void __page_replace_rotate_reclaimable(struct zone *zone, struct page *page)
+{
+}
+
+#define page_replace_copy_state(d, s) do { } while (0)
+#define page_replace_clear_state(p) do { } while (0)
+#define page_replace_is_active(p) 0
+
+#define page_replace_remember(z, p) do { } while (0)
+#define page_replace_forget(m, i) do { } while (0)
+
+static inline unsigned long __page_replace_nr_pages(struct zone *zone)
+{
+ return zone->policy.nr_pages;
+}
+
+static inline unsigned long __page_replace_nr_scan(struct zone *zone)
+{
+ return zone->policy.nr_pages;
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_LRU_POLICY_H */
Index: linux-2.6-git/include/linux/mm_random_data.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_random_data.h
@@ -0,0 +1,9 @@
+#ifdef __KERNEL__
+
+struct page_replace_data {
+ unsigned long nr_scan;
+ unsigned long nr_pages;
+ unsigned long seed;
+};
+
+#endif /* __KERNEL__ */
Index: linux-2.6-git/mm/Makefile
===================================================================
--- linux-2.6-git.orig/mm/Makefile
+++ linux-2.6-git/mm/Makefile
@@ -16,6 +16,7 @@ obj-$(CONFIG_MM_POLICY_USEONCE) += useon
obj-$(CONFIG_MM_POLICY_CLOCKPRO) += nonresident.o clockpro.o
obj-$(CONFIG_MM_POLICY_CART) += nonresident-cart.o cart.o
obj-$(CONFIG_MM_POLICY_CART_R) += nonresident-cart.o cart.o
+obj-$(CONFIG_MM_POLICY_RANDOM) += random_policy.o

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o

2006-03-22 22:37:07

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 28/34] mm: clockpro-PG_reclaim2.patch


From: Peter Zijlstra <[email protected]>

Add a second PG_flag to the page reclaim framework.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/page-flags.h | 2 ++
mm/hugetlb.c | 4 ++--
mm/page_alloc.c | 3 +++
3 files changed, 7 insertions(+), 2 deletions(-)

Index: linux-2.6/include/linux/page-flags.h
===================================================================
--- linux-2.6.orig/include/linux/page-flags.h 2006-03-13 20:38:26.000000000 +0100
+++ linux-2.6/include/linux/page-flags.h 2006-03-13 20:45:31.000000000 +0100
@@ -76,6 +76,8 @@
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */

+#define PG_reclaim2 20 /* reserved by the mm reclaim code */
+
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
* allowed.
Index: linux-2.6/mm/page_alloc.c
===================================================================
--- linux-2.6.orig/mm/page_alloc.c 2006-03-13 20:38:26.000000000 +0100
+++ linux-2.6/mm/page_alloc.c 2006-03-13 20:45:31.000000000 +0100
@@ -150,6 +150,7 @@ static void bad_page(struct page *page)
1 << PG_private |
1 << PG_locked |
1 << PG_reclaim1 |
+ 1 << PG_reclaim2 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
@@ -361,6 +362,7 @@ static inline int free_pages_check(struc
1 << PG_private |
1 << PG_locked |
1 << PG_reclaim1 |
+ 1 << PG_reclaim2 |
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
@@ -518,6 +520,7 @@ static int prep_new_page(struct page *pa
1 << PG_private |
1 << PG_locked |
1 << PG_reclaim1 |
+ 1 << PG_reclaim2 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
Index: linux-2.6/mm/hugetlb.c
===================================================================
--- linux-2.6.orig/mm/hugetlb.c 2006-03-13 20:38:26.000000000 +0100
+++ linux-2.6/mm/hugetlb.c 2006-03-13 20:45:31.000000000 +0100
@@ -152,8 +152,8 @@ static void update_and_free_page(struct
nr_huge_pages_node[page_zone(page)->zone_pgdat->node_id]--;
for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
- 1 << PG_dirty | 1 << PG_reclaim1 | 1 << PG_reserved |
- 1 << PG_private | 1<< PG_writeback);
+ 1 << PG_dirty | 1 << PG_reclaim1 | 1 << PG_reclaim2 |
+ 1 << PG_reserved | 1 << PG_private | 1<< PG_writeback);
set_page_count(&page[i], 0);
}
set_page_count(page, 1);

2006-03-22 22:39:52

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 24/34] mm: sum_cpu_var.patch


From: Peter Zijlstra <[email protected]>

Much used per_cpu op by the additional policies.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/percpu.h | 5 +++++
1 file changed, 5 insertions(+)

Index: linux-2.6/include/linux/percpu.h
===================================================================
--- linux-2.6.orig/include/linux/percpu.h 2006-03-13 20:38:20.000000000 +0100
+++ linux-2.6/include/linux/percpu.h 2006-03-13 20:45:24.000000000 +0100
@@ -15,6 +15,11 @@
#define get_cpu_var(var) (*({ preempt_disable(); &__get_cpu_var(var); }))
#define put_cpu_var(var) preempt_enable()

+#define __sum_cpu_var(type, var) ({ __typeof__(type) sum = 0; \
+ int cpu; \
+ for_each_cpu(cpu) sum += per_cpu(var, cpu); \
+ sum; })
+
#ifdef CONFIG_SMP

struct percpu_data {

2006-03-22 22:37:05

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 29/34] mm: clockpro-clockpro.patch


From: Peter Zijlstra <[email protected]>

This patch implememnts an approximation to the CLOCKPro page replace
algorithm presented in:
http://www.cs.wm.edu/hpcs/WWW/HTML/publications/abs05-3.html

<insert rant on coolness and some numbers that prove it/>

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_clockpro_data.h | 21
include/linux/mm_clockpro_policy.h | 143 +++++
include/linux/mm_page_replace.h | 2
include/linux/mm_page_replace_data.h | 2
mm/Kconfig | 5
mm/Makefile | 1
mm/clockpro.c | 855 +++++++++++++++++++++++++++++++++++
7 files changed, 1029 insertions(+)

Index: linux-2.6-git/mm/clockpro.c
===================================================================
--- /dev/null
+++ linux-2.6-git/mm/clockpro.c
@@ -0,0 +1,855 @@
+/*
+ * mm/clockpro.c
+ *
+ * Written by Peter Zijlstra <[email protected]>
+ * Released under the GPLv2, see the file COPYING for details.
+ *
+ * This file implements an approximation to the CLOCKPro page replace
+ * algorithm presented in:
+ * http://www.cs.wm.edu/hpcs/WWW/HTML/publications/abs05-3.html
+ *
+ * ===> The Algorithm <===
+ *
+ * This algorithm strifes to separate the pages with a small reuse distance
+ * from those with a large reuse distance. Pages with a small reuse distance
+ * are called hot pages and are not available for reclaim. Cold pages are those
+ * that have a large reuse distance. In order to track the reuse distance a
+ * test period is started when a reference is detected. When another reference
+ * is detected during this test period the page has a small enough reuse
+ * distance to be classified as hot.
+ *
+ * The test period is terminated when the page would get a larger reuse
+ * distance than the current largest hot page. This is directly coupled to the
+ * cold page target - the target number of cold pages. More cold pages
+ * mean fewer hot pages and hence the test period will be shorter.
+ *
+ * The cold page target is adjusted when a test period expires (dec) or when
+ * a page is referenced during its test period (inc).
+ *
+ * If we faulted in a nonresident page that is still in the test period, the
+ * inter-reference distance of that page is by definition smaller than that of
+ * the coldest page on the hot list. Meaning the hot list contains pages that
+ * are colder than at least one page that got evicted from memory, and the hot
+ * list should be smaller - conversely, the cold list should be larger.
+ *
+ * Since it is very likely that pages that are about to be evicted are still in
+ * their test period, their state has to be kept around until it expires, or
+ * the total number of pages tracks is twice the total of resident pages.
+ *
+ * The data-structre used is a single CLOCK with three hands: Hcold, Hhot and
+ * Htest. The dynamic is thusly: Hcold is rotated to look for unreferenced cold
+ * pages - those can be evicted. When Hcold encounters a referenced page it
+ * either starts a test period or promotes the page to hot if it already was in
+ * its test period. Then if there are less cold pages left than targeted, Hhot
+ * is rotated which will demote unreferenced hot pages. Hhot also terminates
+ * the test period of all cold pages it encounters. Then if after all this
+ * there are more nonresident pages tracked than there are resident pages,
+ * Htest will be rotated. Htest terminates all test periods it encounters,
+ * thereby removing nonresident pages. (Htest is pushed by Hhot - Hcold moves
+ * independently)
+ *
+ * res | h/c | tst | ref || Hcold | Hhot | Htest || Flt
+ * ----+-----+-----+-----++--------+--------+--------++-----
+ * 1 | 1 | 0 | 1 || = 1101 | 1100 | = 1101 ||
+ * 1 | 1 | 0 | 0 || = 1100 | 1000 | = 1100 ||
+ * ----+-----+-----+-----++--------+--------+--------++-----
+ * 1 | 0 | 1 | 1 || 1100 | 1001 | 1001 ||
+ * 1 | 0 | 1 | 0 || N 0010 | 1000 | 1000 ||
+ * 1 | 0 | 0 | 1 || 1010 | = 1001 | = 1001 ||
+ * 1 | 0 | 0 | 0 || X 0000 | = 1000 | = 1000 ||
+ * ----+-----+-----+-----++--------+--------+--------++-----
+ * ----+-----+-----+-----++--------+--------+--------++-----
+ * 0 | 0 | 1 | 1 || | | || 1100
+ * 0 | 0 | 1 | 0 || = 0010 | X 0000 | X 0000 ||
+ * 0 | 0 | 0 | 1 || | | || 1010
+ *
+ * The table gives the state transitions for each hand, '=' denotes no change,
+ * 'N' denotes becomes nonresident and 'X' denotes removal.
+ *
+ * (XXX: mention LIRS hot/cold page swapping which makes for the relocation on
+ * promotion/demotion)
+ *
+ * ===> The Approximation <===
+ *
+ * h/c -> PageHot()
+ * tst -> PageTest()
+ * ref -> page_referenced()
+ *
+ * Because pages can be evicted from one zone and paged back into another,
+ * nonresident page tracking needs to be inter-zone whereas resident page
+ * tracking is per definition per zone. Hence the resident and nonresident
+ * page tracking needs to be separated.
+ *
+ * This is accomplished by using two CLOCKs instead of one. One two handed
+ * CLOCK for the resident pages, and one single handed CLOCK for the
+ * nonresident pages. These CLOCKs are then coupled so that one can be seen
+ * as an overlay on the other - thereby approximating the relative order of
+ * the pages.
+ *
+ * The resident CLOCK has, as mentioned, two hands, one is Hcold (it does not
+ * affect nonresident pages) and the other is the resident part of Hhot.
+ *
+ * The nonresident CLOCK's single hand will be the nonresident part of Hhot.
+ * Htest is replaced by limiting the size of the nonresident CLOCK.
+ *
+ * The Hhot parts are coupled so that when all resident Hhot have made a full
+ * revolution so will the nonresident Hhot.
+ *
+ * (XXX: mention use-once, the two list/single list duality)
+ * TODO: numa
+ *
+ * All functions that are prefixed with '__' assume that zone->lru_lock is taken.
+ */
+
+#include <linux/mm_page_replace.h>
+#include <linux/rmap.h>
+#include <linux/buffer_head.h>
+#include <linux/pagevec.h>
+#include <linux/bootmem.h>
+#include <linux/init.h>
+#include <linux/swap.h>
+#include <linux/module.h>
+#include <linux/percpu.h>
+#include <linux/writeback.h>
+
+#include <asm/div64.h>
+
+#include <linux/nonresident.h>
+
+/* The nonresident code can be seen as a single handed clock that
+ * lacks the ability to remove tail pages. However it can report the
+ * distance to the head.
+ *
+ * What is done is to set a threshold that cuts of the clock tail.
+ */
+static DEFINE_PER_CPU(unsigned long, nonres_cutoff) = 0;
+
+/* Keep track of the number of nonresident pages tracked.
+ * This is used to scale the hand hot vs nonres hand rotation.
+ */
+static DEFINE_PER_CPU(unsigned long, nonres_count) = 0;
+
+static inline unsigned long __nonres_cutoff(void)
+{
+ return __sum_cpu_var(unsigned long, nonres_cutoff);
+}
+
+static inline unsigned long __nonres_count(void)
+{
+ return __sum_cpu_var(unsigned long, nonres_count);
+}
+
+static inline unsigned long __nonres_threshold(void)
+{
+ unsigned long cutoff = __nonres_cutoff() / 2;
+ unsigned long count = __nonres_count();
+
+ if (cutoff > count)
+ return 0;
+
+ return count - cutoff;
+}
+
+static void __nonres_cutoff_inc(unsigned long dt)
+{
+ unsigned long count = __nonres_count() * 2;
+ unsigned long cutoff = __nonres_cutoff();
+ if (cutoff < count - dt)
+ __get_cpu_var(nonres_cutoff) += dt;
+ else
+ __get_cpu_var(nonres_cutoff) += count - cutoff;
+}
+
+static void __nonres_cutoff_dec(unsigned long dt)
+{
+ unsigned long cutoff = __nonres_cutoff();
+ if (cutoff > dt)
+ __get_cpu_var(nonres_cutoff) -= dt;
+ else
+ __get_cpu_var(nonres_cutoff) -= cutoff;
+}
+
+static int nonres_get(struct address_space *mapping, unsigned long index)
+{
+ int found = 0;
+ unsigned long distance = nonresident_get(mapping, index);
+ if (distance != ~0UL) { /* valid page */
+ --__get_cpu_var(nonres_count);
+
+ /* If the distance is below the threshold the test
+ * period is still valid. Otherwise a tail page
+ * was found and we can decrease the the cutoff.
+ *
+ * Even if not found the hole introduced by the removal
+ * of the cookie increases the avg. distance by 1/2.
+ *
+ * NOTE: the cold target was adjusted when the threshold
+ * was decreased.
+ */
+ found = distance < __nonres_cutoff();
+ __nonres_cutoff_dec(1 + !!found);
+ }
+
+ return found;
+}
+
+static int nonres_put(struct address_space *mapping, unsigned long index)
+{
+ if (nonresident_put(mapping, index)) {
+ /* nonresident clock eats tail due to limited
+ * size; hand test equivalent.
+ */
+ __nonres_cutoff_dec(2);
+ return 1;
+ }
+
+ ++__get_cpu_var(nonres_count);
+ return 0;
+}
+
+static inline void nonres_rotate(unsigned long nr)
+{
+ __nonres_cutoff_inc(nr * 2);
+}
+
+static inline unsigned long nonres_count(void)
+{
+ return __nonres_threshold();
+}
+
+void __init page_replace_init(void)
+{
+ nonresident_init();
+}
+
+/* Called to initialize the clockpro parameters */
+void __init page_replace_init_zone(struct zone *zone)
+{
+ INIT_LIST_HEAD(&zone->policy.list_hand[0]);
+ INIT_LIST_HEAD(&zone->policy.list_hand[1]);
+ zone->policy.nr_resident = 0;
+ zone->policy.nr_cold = 0;
+ zone->policy.nr_cold_target = 2*zone->pages_high;
+ zone->policy.nr_nonresident_scale = 0;
+}
+
+/*
+ * Increase the cold pages target; limit it to the total number of resident
+ * pages present in the current zone.
+ *
+ * @zone: current zone
+ * @dct: intended increase
+ */
+static void __cold_target_inc(struct zone *zone, unsigned long dct)
+{
+ if (zone->policy.nr_cold_target < zone->policy.nr_resident - dct)
+ zone->policy.nr_cold_target += dct;
+ else
+ zone->policy.nr_cold_target = zone->policy.nr_resident;
+}
+
+/*
+ * Decrease the cold pages target; limit it to the high watermark in order
+ * to always have some pages available for quick reclaim.
+ *
+ * @zone: current zone
+ * @dct: intended decrease
+ */
+static void __cold_target_dec(struct zone *zone, unsigned long dct)
+{
+ if (zone->policy.nr_cold_target > (2*zone->pages_high) + dct)
+ zone->policy.nr_cold_target -= dct;
+ else
+ zone->policy.nr_cold_target = (2*zone->pages_high);
+}
+
+/*
+ * Instead of a single CLOCK with two hands, two lists are used.
+ * When the two lists are laid head to tail two junction points
+ * appear, these points are the hand positions.
+ *
+ * This approach has the advantage that there is no pointer magic
+ * associated with the hands. It is impossible to remove the page
+ * a hand is pointing to.
+ *
+ * To allow the hands to lap each other the lists are swappable; eg.
+ * when the hands point to the same position, one of the lists has to
+ * be empty - however it does not matter which list is. Hence we make
+ * sure that the hand we are going to work on contains the pages.
+ */
+static inline
+void __select_list_hand(struct zone *zone, struct list_head *list)
+{
+ if (list_empty(list)) {
+ LIST_HEAD(tmp);
+ list_splice_init(&zone->policy.list_hand[0], &tmp);
+ list_splice_init(&zone->policy.list_hand[1],
+ &zone->policy.list_hand[0]);
+ list_splice(&tmp, &zone->policy.list_hand[1]);
+ }
+}
+
+static DEFINE_PER_CPU(struct pagevec, clockpro_add_pvecs) = { 0, };
+
+/*
+ * Insert page into @zones clock and update adaptive parameters.
+ *
+ * Several page flags are used for insertion hints:
+ * PG_test - use the use-once logic
+ *
+ * For now we will ignore the active hint; the use once logic is
+ * explained below.
+ *
+ * @zone: target zone.
+ * @page: new page.
+ */
+void __page_replace_add(struct zone *zone, struct page *page)
+{
+ int found = 0;
+ struct address_space *mapping = page_mapping(page);
+ int hand = HAND_HOT;
+
+ if (mapping)
+ found = nonres_get(mapping, page_index(page));
+
+#if 0
+ /* prefill the hot list */
+ if (zone->free_pages > zone->policy.nr_cold_target) {
+ SetPageHot(page);
+ hand = HAND_COLD;
+ } else
+#endif
+ /* abuse the PG_test flag for pagecache use-once */
+ if (PageTest(page)) {
+ /*
+ * Use-Once insert; we want to avoid activation on the first
+ * reference (which we know will come).
+ *
+ * This is accomplished by inserting the page one state lower
+ * than usual so the activation that does come ups it to the
+ * normal insert state. Also we insert right behind Hhot so
+ * 1) Hhot cannot interfere; and 2) we lose the first reference
+ * quicker.
+ *
+ * Insert (cold,test)/(cold) so the following activation will
+ * elevate the state to (hot)/(cold,test). (NOTE: the activation
+ * will take care of the cold target increment).
+ */
+ if (!found)
+ ClearPageTest(page);
+ ++zone->policy.nr_cold;
+ hand = HAND_COLD;
+ } else {
+ /*
+ * Insert (hot) when found in the nonresident list, otherwise
+ * insert as (cold,test). Insert at the head of the Hhot list,
+ * ie. right behind Hcold.
+ */
+ if (found) {
+ SetPageHot(page);
+ __cold_target_inc(zone, 1);
+ } else {
+ SetPageTest(page);
+ ++zone->policy.nr_cold;
+ }
+ }
+ ++zone->policy.nr_resident;
+ list_add(&page->lru, &zone->policy.list_hand[hand]);
+
+ BUG_ON(!PageLRU(page));
+}
+
+void fastcall page_replace_add(struct page *page)
+{
+ struct pagevec *pvec = &get_cpu_var(clockpro_add_pvecs);
+
+ page_cache_get(page);
+ if (!pagevec_add(pvec, page))
+ __pagevec_page_replace_add(pvec);
+ put_cpu_var(clockpro_add_pvecs);
+}
+
+void __page_replace_add_drain(unsigned int cpu)
+{
+ struct pagevec *pvec = &per_cpu(clockpro_add_pvecs, cpu);
+
+ if (pagevec_count(pvec))
+ __pagevec_page_replace_add(pvec);
+}
+
+#ifdef CONFIG_NUMA
+static void drain_per_cpu(void *dummy)
+{
+ page_replace_add_drain();
+}
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ return schedule_on_each_cpu(drain_per_cpu, NULL);
+}
+
+#else
+
+/*
+ * Returns 0 for success
+ */
+int page_replace_add_drain_all(void)
+{
+ page_replace_add_drain();
+ return 0;
+}
+#endif
+
+#ifdef CONFIG_MIGRATION
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list with elevated refcount.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ */
+int page_replace_isolate(struct page *page)
+{
+ int ret = 0;
+
+ if (PageLRU(page)) {
+ struct zone *zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (TestClearPageLRU(page)) {
+ ret = 1;
+ get_page(page);
+ --zone->policy.nr_resident;
+ if (!PageHot(page))
+ --zone->policy.nr_cold;
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ }
+
+ return ret;
+}
+#endif
+
+/*
+ * zone->lru_lock is heavily contended. Some of the functions that
+ * shrink the lists perform better by taking out a batch of pages
+ * and working on them outside the LRU lock.
+ *
+ * For pagecache intensive workloads, this function is the hottest
+ * spot in the kernel (apart from copy_*_user functions).
+ *
+ * Appropriate locks must be held before calling this function.
+ *
+ * @nr_to_scan: The number of pages to look through on the list.
+ * @src: The LRU list to pull pages off.
+ * @dst: The temp list to put pages on to.
+ * @scanned: The number of pages that were scanned.
+ *
+ * returns how many pages were moved onto *@dst.
+ */
+static int isolate_pages(struct zone *zone, int nr_to_scan,
+ struct list_head *src,
+ struct list_head *dst, int *scanned)
+{
+ int nr_taken = 0;
+ struct page *page;
+ int scan = 0;
+
+ __select_list_hand(zone, src);
+ while (scan++ < nr_to_scan && !list_empty(src)) {
+ page = lru_to_page(src);
+ prefetchw_prev_lru_page(page, src, flags);
+
+ if (!TestClearPageLRU(page))
+ BUG();
+ list_del(&page->lru);
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ list_add(&page->lru, src);
+ continue;
+ } else {
+ list_add(&page->lru, dst);
+ nr_taken++;
+ if (!PageHot(page))
+ --zone->policy.nr_cold;
+ }
+ }
+ zone->policy.nr_resident -= nr_taken;
+ zone->pages_scanned += scan;
+
+ *scanned = scan;
+ return nr_taken;
+}
+
+/*
+ * Add page to a release pagevec, temp. drop zone lock to release pagevec if full.
+ * Set PG_lru, update zone->policy.nr_cold and zone->policy.nr_resident.
+ *
+ * @zone: @pages zone.
+ * @page: page to be released.
+ * @pvec: pagevec to collect pages in.
+ */
+static void __page_release(struct zone *zone, struct page *page,
+ struct pagevec *pvec)
+{
+ if (TestSetPageLRU(page))
+ BUG();
+ if (!PageHot(page))
+ ++zone->policy.nr_cold;
+ ++zone->policy.nr_resident;
+
+ if (!pagevec_add(pvec, page)) {
+ spin_unlock_irq(&zone->lru_lock);
+ if (buffer_heads_over_limit)
+ pagevec_strip(pvec);
+ __pagevec_release(pvec);
+ spin_lock_irq(&zone->lru_lock);
+ }
+}
+
+void page_replace_reinsert(struct list_head *page_list)
+{
+ struct page *page, *page2;
+ struct zone *zone = NULL;
+ struct pagevec pvec;
+
+ pagevec_init(&pvec, 1);
+ list_for_each_entry_safe(page, page2, page_list, lru) {
+ struct zone *pagezone = page_zone(page);
+ if (pagezone != zone) {
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ zone = pagezone;
+ spin_lock_irq(&zone->lru_lock);
+ }
+ /* XXX: maybe discriminate between hot and cold pages? */
+ list_move(&page->lru, &zone->policy.list_hand[HAND_HOT]);
+ __page_release(zone, page, &pvec);
+ }
+ if (zone)
+ spin_unlock_irq(&zone->lru_lock);
+ pagevec_release(&pvec);
+}
+
+/*
+ * Try to reclaim a specified number of pages.
+ *
+ * Reclaim cadidates have:
+ * - PG_lru cleared
+ * - 1 extra ref
+ *
+ * NOTE: hot pages are also returned but will be spit back by try_pageout()
+ * this to preserve CLOCK order.
+ *
+ * @zone: target zone to reclaim pages from.
+ * @nr_to_scan: nr of pages to try for reclaim.
+ *
+ * returns candidate list.
+ */
+void page_replace_candidates(struct zone *zone, int nr_to_scan,
+ struct list_head *page_list)
+{
+ int nr_scan, nr_total_scan = 0;
+ int nr_taken;
+
+ page_replace_add_drain();
+ spin_lock_irq(&zone->lru_lock);
+
+ do {
+ nr_taken = isolate_pages(zone, nr_to_scan,
+ &zone->policy.list_hand[HAND_COLD],
+ page_list, &nr_scan);
+ nr_to_scan -= nr_scan;
+ nr_total_scan += nr_scan;
+ } while (nr_to_scan > 0 && nr_taken);
+
+ spin_unlock(&zone->lru_lock);
+ if (current_is_kswapd())
+ __mod_page_state_zone(zone, pgscan_kswapd, nr_total_scan);
+ else
+ __mod_page_state_zone(zone, pgscan_direct, nr_total_scan);
+ local_irq_enable();
+}
+
+static void rotate_hot(struct zone *, int, int, struct pagevec *);
+
+/*
+ * Reinsert those candidate pages that were not freed in shrink_list().
+ * Account pages that were promoted to hot by page_replace_activate().
+ * Rotate hand hot to balance the new hot and lost cold pages vs.
+ * the cold pages target.
+ *
+ * Candidate pages have:
+ * - PG_lru cleared
+ * - 1 extra ref
+ * undo that.
+ *
+ * @zone: zone we're working on.
+ * @page_list: the left over pages.
+ * @nr_freed: number of pages freed by shrink_list()
+ */
+void page_replace_reinsert_zone(struct zone *zone, struct list_head *page_list, int nr_freed)
+{
+ struct pagevec pvec;
+ unsigned long dct = 0;
+
+ pagevec_init(&pvec, 1);
+ spin_lock_irq(&zone->lru_lock);
+ while (!list_empty(page_list)) {
+ int hand = HAND_HOT;
+ struct page *page = lru_to_page(page_list);
+ prefetchw_prev_lru_page(page, page_list, flags);
+
+ if (PageHot(page) && PageTest(page)) {
+ ClearPageTest(page);
+ ++dct;
+ hand = HAND_COLD; /* relocate promoted pages */
+ }
+
+ list_move(&page->lru, &zone->policy.list_hand[hand]);
+ __page_release(zone, page, &pvec);
+ }
+ __cold_target_inc(zone, dct);
+ spin_unlock_irq(&zone->lru_lock);
+
+ /*
+ * Limit the hot hand to half a revolution.
+ */
+ if (zone->policy.nr_cold < zone->policy.nr_cold_target) {
+ int i, nr = 1 + (zone->policy.nr_resident / 2*SWAP_CLUSTER_MAX);
+ int reclaim_mapped = 0; /* should_reclaim_mapped(zone); */
+ for (i = 0; zone->policy.nr_cold < zone->policy.nr_cold_target &&
+ i < nr; ++i)
+ rotate_hot(zone, SWAP_CLUSTER_MAX, reclaim_mapped, &pvec);
+ }
+
+ pagevec_release(&pvec);
+}
+
+/*
+ * Puts cold pages that have their test bit set on the non-resident lists.
+ *
+ * @zone: dead pages zone.
+ * @page: dead page.
+ */
+void page_replace_remember(struct zone *zone, struct page *page)
+{
+ if (PageTest(page) &&
+ nonres_put(page_mapping(page), page_index(page)))
+ __cold_target_dec(zone, 1);
+}
+
+void page_replace_forget(struct address_space *mapping, unsigned long index)
+{
+ nonres_get(mapping, index);
+}
+
+static unsigned long estimate_pageable_memory(void)
+{
+#if 0
+ static unsigned long next_check;
+ static unsigned long total = 0;
+
+ if (!total || time_after(jiffies, next_check)) {
+ struct zone *z;
+ total = 0;
+ for_each_zone(z)
+ total += z->nr_resident;
+ next_check = jiffies + HZ/10;
+ }
+
+ // gave 0 first time, SIGFPE in kernel sucks
+ // hence the !total
+#else
+ unsigned long total = 0;
+ struct zone *z;
+ for_each_zone(z)
+ total += z->policy.nr_resident;
+#endif
+ return total;
+}
+
+/*
+ * Rotate the non-resident hand; scale the rotation speed so that when all
+ * hot hands have made one full revolution the non-resident hand will have
+ * too.
+ *
+ * @zone: current zone
+ * @dh: number of pages the hot hand has moved
+ */
+static void __nonres_term(struct zone *zone, unsigned long dh)
+{
+ unsigned long long cycles;
+ unsigned long nr_count = nonres_count();
+
+ /*
+ * |n1| Rhot |N| Rhot
+ * Nhot = ----------- ~ ----------
+ * |r1| |R|
+ *
+ * NOTE depends on |N|, hence include the nonresident_del patch
+ */
+ cycles = zone->policy.nr_nonresident_scale + 1ULL * dh * nr_count;
+ zone->policy.nr_nonresident_scale =
+ do_div(cycles, estimate_pageable_memory() + 1UL);
+ nonres_rotate(cycles);
+ __cold_target_dec(zone, cycles);
+}
+
+/*
+ * Rotate hand hot;
+ *
+ * @zone: current zone
+ * @nr_to_scan: batch quanta
+ * @reclaim_mapped: whether to demote mapped pages too
+ * @pvec: release pagevec
+ */
+static void rotate_hot(struct zone *zone, int nr_to_scan, int reclaim_mapped,
+ struct pagevec *pvec)
+{
+ LIST_HEAD(l_hold);
+ LIST_HEAD(l_tmp);
+ unsigned long dh = 0, dct = 0;
+ int pgscanned;
+ int pgdeactivate = 0;
+ int nr_taken;
+
+ spin_lock_irq(&zone->lru_lock);
+ nr_taken = isolate_pages(zone, nr_to_scan,
+ &zone->policy.list_hand[HAND_HOT],
+ &l_hold, &pgscanned);
+ spin_unlock_irq(&zone->lru_lock);
+
+ while (!list_empty(&l_hold)) {
+ struct page *page = lru_to_page(&l_hold);
+ prefetchw_prev_lru_page(page, &l_hold, flags);
+
+ if (PageHot(page)) {
+ BUG_ON(PageTest(page));
+
+ /*
+ * Ignore the swap token; this is not actual reclaim
+ * and it will give a better reflection of the actual
+ * hotness of pages.
+ *
+ * XXX do something with this reclaim_mapped stuff.
+ */
+ if (/*(((reclaim_mapped && mapped) || !mapped) ||
+ (total_swap_pages == 0 && PageAnon(page))) && */
+ !page_referenced(page, 0, 1)) {
+ SetPageTest(page);
+ ++pgdeactivate;
+ }
+
+ ++dh;
+ } else {
+ if (TestClearPageTest(page))
+ ++dct;
+ }
+ list_move(&page->lru, &l_tmp);
+
+ cond_resched();
+ }
+
+ spin_lock_irq(&zone->lru_lock);
+ while (!list_empty(&l_tmp)) {
+ int hand = HAND_COLD;
+ struct page *page = lru_to_page(&l_tmp);
+ prefetchw_prev_lru_page(page, &l_tmp, flags);
+
+ if (PageHot(page) && PageTest(page)) {
+ ClearPageHot(page);
+ ClearPageTest(page);
+ hand = HAND_HOT; /* relocate demoted page */
+ }
+
+ list_move(&page->lru, &zone->policy.list_hand[hand]);
+ __page_release(zone, page, pvec);
+ }
+ __nonres_term(zone, nr_taken);
+ __cold_target_dec(zone, dct);
+ spin_unlock(&zone->lru_lock);
+
+ __mod_page_state_zone(zone, pgrefill, pgscanned);
+ __mod_page_state(pgdeactivate, pgdeactivate);
+
+ local_irq_enable();
+}
+
+#define K(x) ((x) << (PAGE_SHIFT-10))
+
+void page_replace_show(struct zone *zone)
+{
+ printk("%s"
+ " free:%lukB"
+ " min:%lukB"
+ " low:%lukB"
+ " high:%lukB"
+ " resident:%lukB"
+ " cold:%lukB"
+ " present:%lukB"
+ " pages_scanned:%lu"
+ " all_unreclaimable? %s"
+ "\n",
+ zone->name,
+ K(zone->free_pages),
+ K(zone->pages_min),
+ K(zone->pages_low),
+ K(zone->pages_high),
+ K(zone->policy.nr_resident),
+ K(zone->policy.nr_cold),
+ K(zone->present_pages),
+ zone->pages_scanned,
+ (zone->all_unreclaimable ? "yes" : "no")
+ );
+}
+
+void page_replace_zoneinfo(struct zone *zone, struct seq_file *m)
+{
+ seq_printf(m,
+ "\n pages free %lu"
+ "\n min %lu"
+ "\n low %lu"
+ "\n high %lu"
+ "\n resident %lu"
+ "\n cold %lu"
+ "\n cold_tar %lu"
+ "\n nr_count %lu"
+ "\n scanned %lu"
+ "\n spanned %lu"
+ "\n present %lu",
+ zone->free_pages,
+ zone->pages_min,
+ zone->pages_low,
+ zone->pages_high,
+ zone->policy.nr_resident,
+ zone->policy.nr_cold,
+ zone->policy.nr_cold_target,
+ nonres_count(),
+ zone->pages_scanned,
+ zone->spanned_pages,
+ zone->present_pages);
+}
+
+void __page_replace_counts(unsigned long *active, unsigned long *inactive,
+ unsigned long *free, struct pglist_data *pgdat)
+{
+ struct zone *zones = pgdat->node_zones;
+ int i;
+
+ *active = 0;
+ *inactive = 0;
+ *free = 0;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ *active += zones[i].policy.nr_resident - zones[i].policy.nr_cold;
+ *inactive += zones[i].policy.nr_cold;
+ *free += zones[i].free_pages;
+ }
+}
Index: linux-2.6-git/include/linux/mm_clockpro_data.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_clockpro_data.h
@@ -0,0 +1,21 @@
+#ifndef _LINUX_CLOCKPRO_DATA_H_
+#define _LINUX_CLOCKPRO_DATA_H_
+
+#ifdef __KERNEL__
+
+enum {
+ HAND_HOT = 0,
+ HAND_COLD = 1
+};
+
+struct page_replace_data {
+ struct list_head list_hand[2];
+ unsigned long nr_scan;
+ unsigned long nr_resident;
+ unsigned long nr_cold;
+ unsigned long nr_cold_target;
+ unsigned long nr_nonresident_scale;
+};
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_CLOCKPRO_DATA_H_ */
Index: linux-2.6-git/include/linux/mm_clockpro_policy.h
===================================================================
--- /dev/null
+++ linux-2.6-git/include/linux/mm_clockpro_policy.h
@@ -0,0 +1,143 @@
+#ifndef _LINUX_MM_CLOCKPRO_POLICY_H
+#define _LINUX_MM_CLOCKPRO_POLICY_H
+
+#ifdef __KERNEL__
+
+#include <linux/rmap.h>
+#include <linux/page-flags.h>
+
+#define PG_hot PG_reclaim1
+#define PG_test PG_reclaim2
+
+#define PageHot(page) test_bit(PG_hot, &(page)->flags)
+#define SetPageHot(page) set_bit(PG_hot, &(page)->flags)
+#define ClearPageHot(page) clear_bit(PG_hot, &(page)->flags)
+#define TestClearPageHot(page) test_and_clear_bit(PG_hot, &(page)->flags)
+#define TestSetPageHot(page) test_and_set_bit(PG_hot, &(page)->flags)
+
+#define PageTest(page) test_bit(PG_test, &(page)->flags)
+#define SetPageTest(page) set_bit(PG_test, &(page)->flags)
+#define ClearPageTest(page) clear_bit(PG_test, &(page)->flags)
+#define TestClearPageTest(page) test_and_clear_bit(PG_test, &(page)->flags)
+
+static inline void page_replace_hint_active(struct page *page)
+{
+}
+
+static inline void page_replace_hint_use_once(struct page *page)
+{
+ if (PageLRU(page))
+ BUG();
+ if (PageHot(page))
+ BUG();
+ SetPageTest(page);
+}
+
+extern void __page_replace_add(struct zone *, struct page *);
+
+/*
+ * Activate a cold page:
+ * cold, !test -> cold, test
+ * cold, test -> hot
+ *
+ * @page: page to activate
+ */
+static inline int fastcall page_replace_activate(struct page *page)
+{
+ int hot, test;
+
+ hot = PageHot(page);
+ test = PageTest(page);
+
+ if (hot) {
+ BUG_ON(test);
+ } else {
+ if (test) {
+ SetPageHot(page);
+ /*
+ * Leave PG_test set for new hot pages in order to
+ * recognise them in reinsert() and do accounting.
+ */
+ return 1;
+ } else {
+ SetPageTest(page);
+ }
+ }
+
+ return 0;
+}
+
+static inline void page_replace_copy_state(struct page *dpage, struct page *spage)
+{
+ if (PageHot(spage))
+ SetPageHot(dpage);
+ if (PageTest(spage))
+ SetPageTest(dpage);
+}
+
+static inline void page_replace_clear_state(struct page *page)
+{
+ if (PageHot(page))
+ ClearPageHot(page);
+ if (PageTest(page))
+ ClearPageTest(page);
+}
+
+static inline int page_replace_is_active(struct page *page)
+{
+ return PageHot(page);
+}
+
+static inline void page_replace_remove(struct zone *zone, struct page *page)
+{
+ list_del(&page->lru);
+ --zone->policy.nr_resident;
+ if (!PageHot(page))
+ --zone->policy.nr_cold;
+ else
+ ClearPageHot(page);
+
+ page_replace_clear_state(page);
+}
+
+static inline reclaim_t page_replace_reclaimable(struct page *page)
+{
+ if (PageHot(page))
+ return RECLAIM_KEEP;
+
+ if (page_referenced(page, 1, 0))
+ return RECLAIM_ACTIVATE;
+
+ return RECLAIM_OK;
+}
+
+static inline void __page_replace_rotate_reclaimable(struct zone *zone, struct page *page)
+{
+ if (PageLRU(page) && !PageHot(page)) {
+ list_move_tail(&page->lru, &zone->policy.list_hand[HAND_COLD]);
+ inc_page_state(pgrotated);
+ }
+}
+
+static inline void page_replace_mark_accessed(struct page *page)
+{
+ SetPageReferenced(page);
+}
+
+#define MM_POLICY_HAS_NONRESIDENT
+
+extern void page_replace_remember(struct zone *, struct page *);
+extern void page_replace_forget(struct address_space *, unsigned long);
+
+static inline unsigned long __page_replace_nr_pages(struct zone *zone)
+{
+ return zone->policy.nr_resident;
+}
+
+static inline unsigned long __page_replace_nr_scan(struct zone *zone)
+{
+ return zone->policy.nr_resident;
+}
+
+#endif /* __KERNEL__ */
+#endif /* _LINUX_MM_CLOCKPRO_POLICY_H_ */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -114,6 +114,8 @@ static inline int page_replace_isolate(s

#ifdef CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_policy.h>
+#elif CONFIG_MM_POLICY_CLOCKPRO
+#include <linux/mm_clockpro_policy.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/include/linux/mm_page_replace_data.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace_data.h
+++ linux-2.6-git/include/linux/mm_page_replace_data.h
@@ -5,6 +5,8 @@

#ifdef CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_data.h>
+#elif CONFIG_MM_POLICY_CLOCKPRO
+#include <linux/mm_clockpro_data.h>
#else
#error no mm policy
#endif
Index: linux-2.6-git/mm/Kconfig
===================================================================
--- linux-2.6-git.orig/mm/Kconfig
+++ linux-2.6-git/mm/Kconfig
@@ -142,6 +142,11 @@ config MM_POLICY_USEONCE
help
This option selects the standard multi-queue LRU policy.

+config MM_POLICY_CLOCKPRO
+ bool "CLOCK-Pro"
+ help
+ This option selects a CLOCK-Pro based policy
+
endchoice

#
Index: linux-2.6-git/mm/Makefile
===================================================================
--- linux-2.6-git.orig/mm/Makefile
+++ linux-2.6-git/mm/Makefile
@@ -13,6 +13,7 @@ obj-y := bootmem.o filemap.o mempool.o
prio_tree.o util.o $(mmu-y)

obj-$(CONFIG_MM_POLICY_USEONCE) += useonce.o
+obj-$(CONFIG_MM_POLICY_CLOCKPRO) += nonresident.o clockpro.o

obj-$(CONFIG_SWAP) += page_io.o swap_state.o swapfile.o thrash.o
obj-$(CONFIG_HUGETLBFS) += hugetlb.o

2006-03-22 22:36:14

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 21/34] mm: page-replace-nonresident.patch


From: Peter Zijlstra <[email protected]>

Add hooks for nonresident page tracking.
The policy has to define MM_POLICY_HAS_NONRESIDENT when it makes
use of these.

API:
void page_replace_remember(struct zone *, struct page *);

Remeber a page - insert it into the nonresident page tracking.

void page_replace_forget(struct address_space *, unsigned long);

Forget about a page - remove it from the nonresident page tracking.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 2 ++
include/linux/mm_use_once_policy.h | 3 +++
mm/memory.c | 28 ++++++++++++++++++++++++++++
mm/swapfile.c | 13 +++++++++++--
mm/vmscan.c | 2 ++
5 files changed, 46 insertions(+), 2 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -161,6 +161,9 @@ static inline int page_replace_is_active
return PageActive(page);
}

+#define page_replace_remember(z, p) do { } while (0)
+#define page_replace_forget(m, i) do { } while (0)
+
static inline unsigned long __page_replace_nr_pages(struct zone *zone)
{
return zone->policy.nr_active + zone->policy.nr_inactive;
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -96,6 +96,8 @@ extern void page_replace_shrink(struct z
/* void page_replace_copy_state(struct page *, struct page *); */
/* void page_replace_clear_state(struct page *); */
/* int page_replace_is_active(struct page *); */
+/* void page_replace_remember(struct zone *, struct page*); */
+/* void page_replace_forget(struct address_space *, unsigned long); */
extern void page_replace_show(struct zone *);
extern void page_replace_zoneinfo(struct zone *, struct seq_file *);
extern void __page_replace_counts(unsigned long *, unsigned long *,
Index: linux-2.6-git/mm/memory.c
===================================================================
--- linux-2.6-git.orig/mm/memory.c
+++ linux-2.6-git/mm/memory.c
@@ -606,6 +606,31 @@ int copy_page_range(struct mm_struct *ds
return 0;
}

+#if defined MM_POLICY_HAS_NONRESIDENT
+static void free_file(struct vm_area_struct *vma,
+ unsigned long offset)
+{
+ struct address_space *mapping;
+ struct page *page;
+
+ if (!vma ||
+ !vma->vm_file ||
+ !vma->vm_file->f_mapping)
+ return;
+
+ mapping = vma->vm_file->f_mapping;
+ page = find_get_page(mapping, offset);
+ if (page) {
+ page_cache_release(page);
+ return;
+ }
+
+ page_replace_forget(mapping, offset);
+}
+#else
+#define free_file(a,b) do { } while (0)
+#endif
+
static unsigned long zap_pte_range(struct mmu_gather *tlb,
struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end,
@@ -621,6 +646,7 @@ static unsigned long zap_pte_range(struc
do {
pte_t ptent = *pte;
if (pte_none(ptent)) {
+ free_file(vma, pte_to_pgoff(ptent));
(*zap_work)--;
continue;
}
@@ -679,6 +705,8 @@ static unsigned long zap_pte_range(struc
continue;
if (!pte_file(ptent))
free_swap_and_cache(pte_to_swp_entry(ptent));
+ else
+ free_file(vma, pte_to_pgoff(ptent));
pte_clear_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

Index: linux-2.6-git/mm/swapfile.c
===================================================================
--- linux-2.6-git.orig/mm/swapfile.c
+++ linux-2.6-git/mm/swapfile.c
@@ -28,6 +28,7 @@
#include <linux/mutex.h>
#include <linux/capability.h>
#include <linux/syscalls.h>
+#include <linux/mm_page_replace.h>

#include <asm/pgtable.h>
#include <asm/tlbflush.h>
@@ -300,7 +301,8 @@ void swap_free(swp_entry_t entry)

p = swap_info_get(entry);
if (p) {
- swap_entry_free(p, swp_offset(entry));
+ if (!swap_entry_free(p, swp_offset(entry)))
+ page_replace_forget(&swapper_space, entry.val);
spin_unlock(&swap_lock);
}
}
@@ -397,8 +399,15 @@ void free_swap_and_cache(swp_entry_t ent

p = swap_info_get(entry);
if (p) {
- if (swap_entry_free(p, swp_offset(entry)) == 1)
+ switch (swap_entry_free(p, swp_offset(entry))) {
+ case 1:
page = find_trylock_page(&swapper_space, entry.val);
+ break;
+
+ case 0:
+ page_replace_forget(&swapper_space, entry.val);
+ break;
+ }
spin_unlock(&swap_lock);
}
if (page) {
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -315,6 +315,7 @@ static int remove_mapping(struct address

if (PageSwapCache(page)) {
swp_entry_t swap = { .val = page_private(page) };
+ page_replace_remember(page_zone(page), page);
__delete_from_swap_cache(page);
write_unlock_irq(&mapping->tree_lock);
swap_free(swap);
@@ -322,6 +323,7 @@ static int remove_mapping(struct address
return 1;
}

+ page_replace_remember(page_zone(page), page);
__remove_from_page_cache(page);
write_unlock_irq(&mapping->tree_lock);
__put_page(page);

2006-03-22 22:35:13

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 20/34] mm: page-replace-pg_flags.patch


From: Peter Zijlstra <[email protected]>

Abstract the replacement policy specific pageflags.

API:

Copy the policy specific page flags

void page_replace_copy_state(struct page *, struct page *);

Clear the policy specific page flags

void page_replace_clear_state(struct page *);

Account the page as active

int page_replace_is_active(struct page *);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 3 +++
include/linux/mm_use_once_policy.h | 26 ++++++++++++++++++++++++++
include/linux/page-flags.h | 8 +-------
mm/hugetlb.c | 2 +-
mm/mempolicy.c | 2 +-
mm/page_alloc.c | 6 +++---
mm/vmscan.c | 6 +++---
7 files changed, 38 insertions(+), 15 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -5,6 +5,15 @@

#include <linux/fs.h>
#include <linux/rmap.h>
+#include <linux/page-flags.h>
+
+#define PG_active PG_reclaim1
+
+#define PageActive(page) test_bit(PG_active, &(page)->flags)
+#define SetPageActive(page) set_bit(PG_active, &(page)->flags)
+#define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)
+#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
+#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)

static inline void page_replace_hint_active(struct page *page)
{
@@ -135,6 +144,23 @@ static inline void __page_replace_rotate
}
}

+static inline void page_replace_copy_state(struct page *dpage, struct page *spage)
+{
+ if (PageActive(spage))
+ SetPageActive(dpage);
+}
+
+static inline void page_replace_clear_state(struct page *page)
+{
+ if (PageActive(page))
+ ClearPageActive(page);
+}
+
+static inline int page_replace_is_active(struct page *page)
+{
+ return PageActive(page);
+}
+
static inline unsigned long __page_replace_nr_pages(struct zone *zone)
{
return zone->policy.nr_active + zone->policy.nr_inactive;
Index: linux-2.6-git/include/linux/page-flags.h
===================================================================
--- linux-2.6-git.orig/include/linux/page-flags.h
+++ linux-2.6-git/include/linux/page-flags.h
@@ -58,7 +58,7 @@

#define PG_dirty 4
#define PG_lru 5
-#define PG_active 6
+#define PG_reclaim1 6 /* reserved by the mm reclaim code */
#define PG_slab 7 /* slab debug (Suparna wants this) */

#define PG_checked 8 /* kill me in 2.5.<early>. */
@@ -244,12 +244,6 @@ extern void __mod_page_state_offset(unsi
#define TestSetPageLRU(page) test_and_set_bit(PG_lru, &(page)->flags)
#define TestClearPageLRU(page) test_and_clear_bit(PG_lru, &(page)->flags)

-#define PageActive(page) test_bit(PG_active, &(page)->flags)
-#define SetPageActive(page) set_bit(PG_active, &(page)->flags)
-#define ClearPageActive(page) clear_bit(PG_active, &(page)->flags)
-#define TestClearPageActive(page) test_and_clear_bit(PG_active, &(page)->flags)
-#define TestSetPageActive(page) test_and_set_bit(PG_active, &(page)->flags)
-
#define PageSlab(page) test_bit(PG_slab, &(page)->flags)
#define SetPageSlab(page) set_bit(PG_slab, &(page)->flags)
#define ClearPageSlab(page) clear_bit(PG_slab, &(page)->flags)
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -149,7 +149,7 @@ static void bad_page(struct page *page)
page->flags &= ~(1 << PG_lru |
1 << PG_private |
1 << PG_locked |
- 1 << PG_active |
+ 1 << PG_reclaim1 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
@@ -360,7 +360,7 @@ static inline int free_pages_check(struc
1 << PG_lru |
1 << PG_private |
1 << PG_locked |
- 1 << PG_active |
+ 1 << PG_reclaim1 |
1 << PG_reclaim |
1 << PG_slab |
1 << PG_swapcache |
@@ -517,7 +517,7 @@ static int prep_new_page(struct page *pa
1 << PG_lru |
1 << PG_private |
1 << PG_locked |
- 1 << PG_active |
+ 1 << PG_reclaim1 |
1 << PG_dirty |
1 << PG_reclaim |
1 << PG_slab |
Index: linux-2.6-git/mm/hugetlb.c
===================================================================
--- linux-2.6-git.orig/mm/hugetlb.c
+++ linux-2.6-git/mm/hugetlb.c
@@ -152,7 +152,7 @@ static void update_and_free_page(struct
nr_huge_pages_node[page_zone(page)->zone_pgdat->node_id]--;
for (i = 0; i < (HPAGE_SIZE / PAGE_SIZE); i++) {
page[i].flags &= ~(1 << PG_locked | 1 << PG_error | 1 << PG_referenced |
- 1 << PG_dirty | 1 << PG_active | 1 << PG_reserved |
+ 1 << PG_dirty | 1 << PG_reclaim1 | 1 << PG_reserved |
1 << PG_private | 1<< PG_writeback);
set_page_count(&page[i], 0);
}
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -93,6 +93,9 @@ extern void page_replace_shrink(struct z
/* void page_replace_mark_accessed(struct page *); */
/* void page_replace_remove(struct zone *, struct page *); */
/* void __page_replace_rotate_reclaimable(struct zone *, struct page *); */
+/* void page_replace_copy_state(struct page *, struct page *); */
+/* void page_replace_clear_state(struct page *); */
+/* int page_replace_is_active(struct page *); */
extern void page_replace_show(struct zone *);
extern void page_replace_zoneinfo(struct zone *, struct seq_file *);
extern void __page_replace_counts(unsigned long *, unsigned long *,
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -484,6 +484,7 @@ int shrink_list(struct list_head *page_l
goto keep_locked;

free_it:
+ page_replace_clear_state(page);
unlock_page(page);
reclaimed++;
if (!pagevec_add(&freed_pvec, page))
@@ -668,12 +669,11 @@ void migrate_page_copy(struct page *newp
SetPageReferenced(newpage);
if (PageUptodate(page))
SetPageUptodate(newpage);
- if (PageActive(page))
- SetPageActive(newpage);
if (PageChecked(page))
SetPageChecked(newpage);
if (PageMappedToDisk(page))
SetPageMappedToDisk(newpage);
+ page_replace_copy_state(newpage, page);

if (PageDirty(page)) {
clear_page_dirty_for_io(page);
@@ -681,8 +681,8 @@ void migrate_page_copy(struct page *newp
}

ClearPageSwapCache(page);
- ClearPageActive(page);
ClearPagePrivate(page);
+ page_replace_clear_state(page);
set_page_private(page, 0);
page->mapping = NULL;

Index: linux-2.6-git/mm/mempolicy.c
===================================================================
--- linux-2.6-git.orig/mm/mempolicy.c
+++ linux-2.6-git/mm/mempolicy.c
@@ -1774,7 +1774,7 @@ static void gather_stats(struct page *pa
if (PageSwapCache(page))
md->swapcache++;

- if (PageActive(page))
+ if (page_replace_is_active(page))
md->active++;

if (PageWriteback(page))

2006-03-22 22:40:38

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 27/34] mm: clockpro-ignore_token.patch


From: Peter Zijlstra <[email protected]>

Re-introduce the ignore_token argument to page_referenced(); hand hot
rotation will make use of this feature.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_use_once_policy.h | 2 +-
include/linux/rmap.h | 4 ++--
mm/rmap.c | 26 ++++++++++++++++----------
mm/useonce.c | 2 +-
4 files changed, 20 insertions(+), 14 deletions(-)

Index: linux-2.6-git/include/linux/rmap.h
===================================================================
--- linux-2.6-git.orig/include/linux/rmap.h
+++ linux-2.6-git/include/linux/rmap.h
@@ -90,7 +90,7 @@ static inline void page_dup_rmap(struct
/*
* Called from mm/vmscan.c to handle paging out
*/
-int page_referenced(struct page *, int is_locked);
+int page_referenced(struct page *, int is_locked, int ignore_token);
int try_to_unmap(struct page *, int ignore_refs);
void remove_from_swap(struct page *page);

@@ -111,7 +111,7 @@ unsigned long page_address_in_vma(struct
#define anon_vma_prepare(vma) (0)
#define anon_vma_link(vma) do {} while (0)

-#define page_referenced(page,l) TestClearPageReferenced(page)
+#define page_referenced(page,l,i) TestClearPageReferenced(page)
#define try_to_unmap(page, refs) SWAP_FAIL

#endif /* CONFIG_MMU */
Index: linux-2.6-git/mm/rmap.c
===================================================================
--- linux-2.6-git.orig/mm/rmap.c
+++ linux-2.6-git/mm/rmap.c
@@ -329,7 +329,7 @@ pte_t *page_check_address(struct page *p
* repeatedly from either page_referenced_anon or page_referenced_file.
*/
static int page_referenced_one(struct page *page,
- struct vm_area_struct *vma, unsigned int *mapcount)
+ struct vm_area_struct *vma, unsigned int *mapcount, int ignore_token)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long address;
@@ -350,7 +350,7 @@ static int page_referenced_one(struct pa

/* Pretend the page is referenced if the task has the
swap token and is in the middle of a page fault. */
- if (mm != current->mm && has_swap_token(mm) &&
+ if (mm != current->mm && !ignore_token && has_swap_token(mm) &&
rwsem_is_locked(&mm->mmap_sem))
referenced++;

@@ -360,7 +360,7 @@ out:
return referenced;
}

-static int page_referenced_anon(struct page *page)
+static int page_referenced_anon(struct page *page, int ignore_token)
{
unsigned int mapcount;
struct anon_vma *anon_vma;
@@ -373,7 +373,8 @@ static int page_referenced_anon(struct p

mapcount = page_mapcount(page);
list_for_each_entry(vma, &anon_vma->head, anon_vma_node) {
- referenced += page_referenced_one(page, vma, &mapcount);
+ referenced += page_referenced_one(page, vma, &mapcount,
+ ignore_token);
if (!mapcount)
break;
}
@@ -392,7 +393,7 @@ static int page_referenced_anon(struct p
*
* This function is only called from page_referenced for object-based pages.
*/
-static int page_referenced_file(struct page *page)
+static int page_referenced_file(struct page *page, int ignore_token)
{
unsigned int mapcount;
struct address_space *mapping = page->mapping;
@@ -430,7 +431,8 @@ static int page_referenced_file(struct p
referenced++;
break;
}
- referenced += page_referenced_one(page, vma, &mapcount);
+ referenced += page_referenced_one(page, vma, &mapcount,
+ ignore_token);
if (!mapcount)
break;
}
@@ -447,10 +449,13 @@ static int page_referenced_file(struct p
* Quick test_and_clear_referenced for all mappings to a page,
* returns the number of ptes which referenced the page.
*/
-int page_referenced(struct page *page, int is_locked)
+int page_referenced(struct page *page, int is_locked, int ignore_token)
{
int referenced = 0;

+ if (!swap_token_default_timeout)
+ ignore_token = 1;
+
if (page_test_and_clear_young(page))
referenced++;

@@ -459,14 +464,15 @@ int page_referenced(struct page *page, i

if (page_mapped(page) && page->mapping) {
if (PageAnon(page))
- referenced += page_referenced_anon(page);
+ referenced += page_referenced_anon(page, ignore_token);
else if (is_locked)
- referenced += page_referenced_file(page);
+ referenced += page_referenced_file(page, ignore_token);
else if (TestSetPageLocked(page))
referenced++;
else {
if (page->mapping)
- referenced += page_referenced_file(page);
+ referenced += page_referenced_file(page,
+ ignore_token);
unlock_page(page);
}
}
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -312,7 +312,7 @@ refill_inactive_zone(struct zone *zone,
if (page_mapped(page)) {
if (!reclaim_mapped ||
(total_swap_pages == 0 && PageAnon(page)) ||
- page_referenced(page, 0)) {
+ page_referenced(page, 0, 0)) {
list_add(&page->lru, &l_active);
continue;
}
Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -108,7 +108,7 @@ static inline reclaim_t page_replace_rec
if (PageActive(page))
BUG();

- referenced = page_referenced(page, 1);
+ referenced = page_referenced(page, 1, 0);
/* In active use or really unfreeable? Activate it. */
if (referenced && page_mapping_inuse(page))
return RECLAIM_ACTIVATE;

2006-03-22 22:36:15

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 26/34] mm: clockpro-nonresident.patch


From: Rik van Riel <[email protected]>

Track non-resident pages through a simple hashing scheme. This way
the space overhead is limited to 1 u32 per page, or 0.1% space overhead
and lookups are one cache miss.

Aside from seeing whether or not a page was recently evicted, we can
also take a reasonable guess at how many other pages were evicted since
this page was evicted.

TODO: make the entries unsigned long, currently we're limited to
1^32*NUM_NR*PAGE_SIZE bytes of memory. Event though this would end up
being 1008 TB of memory, I suspect the hash function to go crap at around
4 to 16 TB.

Signed-off-by: Rik van Riel <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/nonresident.h | 12 +++
mm/nonresident.c | 167 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 179 insertions(+)

Index: linux-2.6/mm/nonresident.c
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/mm/nonresident.c 2006-03-13 20:45:26.000000000 +0100
@@ -0,0 +1,167 @@
+/*
+ * mm/nonresident.c
+ * (C) 2004,2005 Red Hat, Inc
+ * Written by Rik van Riel <[email protected]>
+ * Released under the GPL, see the file COPYING for details.
+ *
+ * Keeps track of whether a non-resident page was recently evicted
+ * and should be immediately promoted to the active list. This also
+ * helps automatically tune the inactive target.
+ *
+ * The pageout code stores a recently evicted page in this cache
+ * by calling remember_page(mapping/mm, index/vaddr, generation)
+ * and can look it up in the cache by calling recently_evicted()
+ * with the same arguments.
+ *
+ * Note that there is no way to invalidate pages after eg. truncate
+ * or exit, we let the pages fall out of the non-resident set through
+ * normal replacement.
+ */
+#include <linux/mm.h>
+#include <linux/cache.h>
+#include <linux/spinlock.h>
+#include <linux/bootmem.h>
+#include <linux/hash.h>
+#include <linux/prefetch.h>
+#include <linux/kernel.h>
+
+/* Number of non-resident pages per hash bucket. Never smaller than 15. */
+#if (L1_CACHE_BYTES < 64)
+#define NR_BUCKET_BYTES 64
+#else
+#define NR_BUCKET_BYTES L1_CACHE_BYTES
+#endif
+#define NUM_NR ((NR_BUCKET_BYTES - sizeof(atomic_t))/sizeof(u32))
+
+struct nr_bucket
+{
+ atomic_t hand;
+ u32 page[NUM_NR];
+} ____cacheline_aligned;
+
+/* The non-resident page hash table. */
+static struct nr_bucket * nonres_table;
+static unsigned int nonres_shift;
+static unsigned int nonres_mask;
+
+static struct nr_bucket * nr_hash(void * mapping, unsigned long index)
+{
+ unsigned long bucket;
+ unsigned long hash;
+
+ hash = hash_ptr(mapping, BITS_PER_LONG);
+ hash = 37 * hash + hash_long(index, BITS_PER_LONG);
+ bucket = hash & nonres_mask;
+
+ return nonres_table + bucket;
+}
+
+static u32 nr_cookie(struct address_space * mapping, unsigned long index)
+{
+ unsigned long cookie = hash_ptr(mapping, BITS_PER_LONG);
+ cookie = 37 * cookie + hash_long(index, BITS_PER_LONG);
+
+ if (mapping && mapping->host) {
+ cookie = 37 * cookie + hash_long(mapping->host->i_ino, BITS_PER_LONG);
+ }
+
+ return (u32)(cookie >> (BITS_PER_LONG - 32));
+}
+
+unsigned long nonresident_get(struct address_space * mapping, unsigned long index)
+{
+ struct nr_bucket * nr_bucket;
+ int distance;
+ u32 wanted;
+ int i;
+
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ prefetch(nr_bucket);
+ wanted = nr_cookie(mapping, index);
+
+ for (i = 0; i < NUM_NR; i++) {
+ if (nr_bucket->page[i] == wanted) {
+ nr_bucket->page[i] = 0;
+ /* Return the distance between entry and clock hand. */
+ distance = atomic_read(&nr_bucket->hand) + NUM_NR - i;
+ distance %= NUM_NR;
+ return (distance << nonres_shift) + (nr_bucket - nonres_table);
+ }
+ }
+
+ return ~0UL;
+}
+
+u32 nonresident_put(struct address_space * mapping, unsigned long index)
+{
+ struct nr_bucket * nr_bucket;
+ u32 nrpage;
+ int i;
+
+ prefetch(mapping->host);
+ nr_bucket = nr_hash(mapping, index);
+
+ prefetchw(nr_bucket);
+ nrpage = nr_cookie(mapping, index);
+
+ /* Atomically find the next array index. */
+ preempt_disable();
+retry:
+ i = atomic_inc_return(&nr_bucket->hand);
+ if (unlikely(i >= NUM_NR)) {
+ if (i == NUM_NR)
+ atomic_set(&nr_bucket->hand, -1);
+ goto retry;
+ }
+ preempt_enable();
+
+ /* Statistics may want to know whether the entry was in use. */
+ return xchg(&nr_bucket->page[i], nrpage);
+}
+
+unsigned long fastcall nonresident_total(void)
+{
+ return NUM_NR << nonres_shift;
+}
+
+/*
+ * For interactive workloads, we remember about as many non-resident pages
+ * as we have actual memory pages. For server workloads with large inter-
+ * reference distances we could benefit from remembering more.
+ */
+static __initdata unsigned long nonresident_factor = 1;
+void __init nonresident_init(void)
+{
+ int target;
+ int i;
+
+ /*
+ * Calculate the non-resident hash bucket target. Use a power of
+ * two for the division because alloc_large_system_hash rounds up.
+ */
+ target = nr_all_pages * nonresident_factor;
+ target /= (sizeof(struct nr_bucket) / sizeof(u32));
+
+ nonres_table = alloc_large_system_hash("Non-resident page tracking",
+ sizeof(struct nr_bucket),
+ target,
+ 0,
+ HASH_EARLY | HASH_HIGHMEM,
+ &nonres_shift,
+ &nonres_mask,
+ 0);
+
+ for (i = 0; i < (1 << nonres_shift); i++)
+ atomic_set(&nonres_table[i].hand, 0);
+}
+
+static int __init set_nonresident_factor(char * str)
+{
+ if (!str)
+ return 0;
+ nonresident_factor = simple_strtoul(str, &str, 0);
+ return 1;
+}
+__setup("nonresident_factor=", set_nonresident_factor);
Index: linux-2.6/include/linux/nonresident.h
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/include/linux/nonresident.h 2006-03-13 20:45:26.000000000 +0100
@@ -0,0 +1,12 @@
+#ifndef _LINUX_NONRESIDENT_H_
+#define _LINUX_NONRESIDENT_H_
+
+#ifdef __KERNEL__
+
+extern void nonresident_init(void);
+extern unsigned long nonresident_get(struct address_space *, unsigned long);
+extern u32 nonresident_put(struct address_space *, unsigned long);
+extern unsigned long fastcall nonresident_total(void);
+
+#endif /* __KERNEL */
+#endif /* _LINUX_NONRESIDENT_H_ */

2006-03-22 22:35:12

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 14/34] mm: page-replace-remove-mm_inline.patch


From: Peter Zijlstra <[email protected]>

Remove mm_inline.h and abstract the removal of pages from the
page replacement policy.

API:

remove the page from the care of the replacement policy's care

void page_replace_remove(struct zone *, struct page *);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_inline.h | 39 -------------------------------------
include/linux/mm_page_replace.h | 2 -
include/linux/mm_use_once_policy.h | 25 +++++++++++++++++++++++
mm/swap.c | 5 +---
mm/useonce.c | 8 ++++++-
mm/vmscan.c | 1
6 files changed, 35 insertions(+), 45 deletions(-)

Index: linux-2.6-git/include/linux/mm_inline.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_inline.h
+++ linux-2.6-git/include/linux/mm_inline.h
@@ -1,41 +1,2 @@

-static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
-{
- list_add(&page->lru, &zone->active_list);
- zone->nr_active++;
-}
-
-static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
-{
- list_add(&page->lru, &zone->inactive_list);
- zone->nr_inactive++;
-}
-
-static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
-{
- list_del(&page->lru);
- zone->nr_active--;
-}
-
-static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
-{
- list_del(&page->lru);
- zone->nr_inactive--;
-}
-
-static inline void
-del_page_from_lru(struct zone *zone, struct page *page)
-{
- list_del(&page->lru);
- if (PageActive(page)) {
- ClearPageActive(page);
- zone->nr_active--;
- } else {
- zone->nr_inactive--;
- }
-}

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -12,6 +12,20 @@ static inline void page_replace_hint_act
}

static inline void
+del_page_from_inactive_list(struct zone *zone, struct page *page)
+{
+ list_del(&page->lru);
+ zone->nr_inactive--;
+}
+
+static inline void
+add_page_to_active_list(struct zone *zone, struct page *page)
+{
+ list_add(&page->lru, &zone->active_list);
+ zone->nr_active++;
+}
+
+static inline void
add_page_to_inactive_list(struct zone *zone, struct page *page)
{
list_add(&page->lru, &zone->policy.inactive_list);
@@ -102,5 +116,16 @@ static inline int page_replace_activate(
return 1;
}

+static inline void page_replace_remove(struct zone *zone, struct page *page)
+{
+ list_del(&page->lru);
+ if (PageActive(page)) {
+ ClearPageActive(page);
+ zone->nr_active--;
+ } else {
+ zone->nr_inactive--;
+ }
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -6,7 +6,6 @@
#include <linux/mmzone.h>
#include <linux/mm.h>
#include <linux/pagevec.h>
-#include <linux/mm_inline.h>

struct scan_control {
/* Ask refill_inactive_zone, or shrink_cache to scan this many pages */
@@ -89,6 +88,7 @@ typedef enum {
extern void page_replace_reinsert(struct list_head *);
extern void page_replace_shrink(struct zone *, struct scan_control *);
/* void page_replace_mark_accessed(struct page *); */
+/* void page_replace_remove(struct zone *, struct page *); */

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -1,5 +1,4 @@
#include <linux/mm_page_replace.h>
-#include <linux/mm_inline.h>
#include <linux/swap.h>
#include <linux/module.h>
#include <linux/pagemap.h>
@@ -7,6 +6,13 @@
#include <linux/buffer_head.h> /* for try_to_release_page(),
buffer_heads_over_limit */

+static inline void
+del_page_from_active_list(struct zone *zone, struct page *page)
+{
+ list_del(&page->lru);
+ zone->nr_active--;
+}
+
/**
* lru_cache_add: add a page to the page lists
* @page: the page to add
Index: linux-2.6-git/mm/swap.c
===================================================================
--- linux-2.6-git.orig/mm/swap.c
+++ linux-2.6-git/mm/swap.c
@@ -22,7 +22,6 @@
#include <linux/pagevec.h>
#include <linux/init.h>
#include <linux/module.h>
-#include <linux/mm_inline.h>
#include <linux/buffer_head.h> /* for try_to_release_page() */
#include <linux/module.h>
#include <linux/percpu_counter.h>
@@ -118,7 +117,7 @@ void fastcall __page_cache_release(struc

spin_lock_irqsave(&zone->lru_lock, flags);
if (TestClearPageLRU(page))
- del_page_from_lru(zone, page);
+ page_replace_remove(zone, page);
if (page_count(page) != 0)
page = NULL;
spin_unlock_irqrestore(&zone->lru_lock, flags);
@@ -171,7 +170,7 @@ void release_pages(struct page **pages,
spin_lock_irq(&zone->lru_lock);
}
if (TestClearPageLRU(page))
- del_page_from_lru(zone, page);
+ page_replace_remove(zone, page);
if (page_count(page) == 0) {
if (!pagevec_add(&pages_to_free, page)) {
spin_unlock_irq(&zone->lru_lock);
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -24,7 +24,6 @@
#include <linux/blkdev.h>
#include <linux/buffer_head.h> /* for try_to_release_page(),
buffer_heads_over_limit */
-#include <linux/mm_inline.h>
#include <linux/pagevec.h>
#include <linux/backing-dev.h>
#include <linux/rmap.h>

2006-03-22 22:34:37

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 10/34] mm: page-replace-reinsert.patch


From: Peter Zijlstra <[email protected]>

API:
void page_replace_reinsert(struct list_head*);

reinserts pages taken with page_replace_isolate().
NOTE: these pages still have their reclaim page state and so can be
inserted at the proper place.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 2 +-
mm/mempolicy.c | 6 +++---
mm/useonce.c | 11 +++++++++++
mm/vmscan.c | 26 --------------------------
4 files changed, 15 insertions(+), 30 deletions(-)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -86,7 +86,7 @@ typedef enum {

/* reclaim_t page_replace_reclaimable(struct page *); */
/* int page_replace_activate(struct page *page); */
-
+extern void page_replace_reinsert(struct list_head *);

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -107,6 +107,17 @@ int page_replace_isolate(struct page *pa
}
#endif

+void page_replace_reinsert(struct list_head *page_list)
+{
+ struct page *page, *page2;
+
+ list_for_each_entry_safe(page, page2, page_list, lru) {
+ list_del(&page->lru);
+ page_replace_add(page);
+ put_page(page);
+ }
+}
+
/*
* zone->lru_lock is heavily contended. Some of the functions that
* shrink the lists perform better by taking out a batch of pages
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -509,32 +509,6 @@ keep:
}

#ifdef CONFIG_MIGRATION
-
-static inline void move_to_lru(struct page *page)
-{
- list_del(&page->lru);
- page_replace_add(page);
- put_page(page);
-}
-
-/*
- * Add isolated pages on the list back to the LRU.
- *
- * returns the number of pages put back.
- */
-int putback_lru_pages(struct list_head *l)
-{
- struct page *page;
- struct page *page2;
- int count = 0;
-
- list_for_each_entry_safe(page, page2, l, lru) {
- move_to_lru(page);
- count++;
- }
- return count;
-}
-
/*
* Non migratable page
*/
Index: linux-2.6-git/mm/mempolicy.c
===================================================================
--- linux-2.6-git.orig/mm/mempolicy.c
+++ linux-2.6-git/mm/mempolicy.c
@@ -607,7 +607,7 @@ redo:
}
err = migrate_pages(pagelist, &newlist, &moved, &failed);

- putback_lru_pages(&moved); /* Call release pages instead ?? */
+ page_replace_reinsert(&moved); /* Call release pages instead ?? */

if (err >= 0 && list_empty(&newlist) && !list_empty(pagelist))
goto redo;
@@ -648,7 +648,7 @@ int migrate_to_node(struct mm_struct *mm
if (!list_empty(&pagelist)) {
err = migrate_pages_to(&pagelist, NULL, dest);
if (!list_empty(&pagelist))
- putback_lru_pages(&pagelist);
+ page_replace_reinsert(&pagelist);
}
return err;
}
@@ -800,7 +800,7 @@ long do_mbind(unsigned long start, unsig
err = -EIO;
}
if (!list_empty(&pagelist))
- putback_lru_pages(&pagelist);
+ page_replace_reinsert(&pagelist);

up_write(&mm->mmap_sem);
mpol_free(new);

2006-03-22 22:34:38

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 13/34] mm: page-replace-mark-accessed.patch


From: Peter Zijlstra <[email protected]>

Abstract the page activation.

API:
void page_replace_mark_accessed(struct page *);

Mark a page as accessed.

XXX: go through tree and rename mark_page_accessed() ?

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 1 +
include/linux/mm_use_once_policy.h | 26 ++++++++++++++++++++++++++
mm/swap.c | 28 +---------------------------
3 files changed, 28 insertions(+), 27 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -31,6 +31,32 @@ static inline void page_replace_hint_use
{
}

+/*
+ * Mark a page as having seen activity.
+ *
+ * inactive,unreferenced -> inactive,referenced
+ * inactive,referenced -> active,unreferenced
+ * active,unreferenced -> active,referenced
+ */
+static inline void page_replace_mark_accessed(struct page *page)
+{
+ if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
+ struct zone *zone = page_zone(page);
+
+ spin_lock_irq(&zone->lru_lock);
+ if (PageLRU(page) && !PageActive(page)) {
+ del_page_from_inactive_list(zone, page);
+ SetPageActive(page);
+ add_page_to_active_list(zone, page);
+ inc_page_state(pgactivate);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ ClearPageReferenced(page);
+ } else if (!PageReferenced(page)) {
+ SetPageReferenced(page);
+ }
+}
+
/* Called without lock on whether page is mapped, so answer is unstable */
static inline int page_mapping_inuse(struct page *page)
{
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -88,6 +88,7 @@ typedef enum {
/* int page_replace_activate(struct page *page); */
extern void page_replace_reinsert(struct list_head *);
extern void page_replace_shrink(struct zone *, struct scan_control *);
+/* void page_replace_mark_accessed(struct page *); */

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/swap.c
===================================================================
--- linux-2.6-git.orig/mm/swap.c
+++ linux-2.6-git/mm/swap.c
@@ -98,37 +98,11 @@ int rotate_reclaimable_page(struct page
}

/*
- * FIXME: speed this up?
- */
-void fastcall activate_page(struct page *page)
-{
- struct zone *zone = page_zone(page);
-
- spin_lock_irq(&zone->lru_lock);
- if (PageLRU(page) && !PageActive(page)) {
- del_page_from_inactive_list(zone, page);
- SetPageActive(page);
- add_page_to_active_list(zone, page);
- inc_page_state(pgactivate);
- }
- spin_unlock_irq(&zone->lru_lock);
-}
-
-/*
* Mark a page as having seen activity.
- *
- * inactive,unreferenced -> inactive,referenced
- * inactive,referenced -> active,unreferenced
- * active,unreferenced -> active,referenced
*/
void fastcall mark_page_accessed(struct page *page)
{
- if (!PageActive(page) && PageReferenced(page) && PageLRU(page)) {
- activate_page(page);
- ClearPageReferenced(page);
- } else if (!PageReferenced(page)) {
- SetPageReferenced(page);
- }
+ page_replace_mark_accessed(page);
}

EXPORT_SYMBOL(mark_page_accessed);

2006-03-22 22:42:40

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 23/34] mm: page-replace-documentation.patch


From: Marcelo Tosatti <[email protected]>

Documentation for the page replace framework.

Signed-off-by: Marcelo Tosatti <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>

---

Documentation/vm/page_replacement_api.txt | 216 ++++++++++++++++++++++++++++++
1 file changed, 216 insertions(+)

Index: linux-2.6/Documentation/vm/page_replacement_api.txt
===================================================================
--- /dev/null 1970-01-01 00:00:00.000000000 +0000
+++ linux-2.6/Documentation/vm/page_replacement_api.txt 2006-03-13 20:45:22.000000000 +0100
@@ -0,0 +1,216 @@
+ Page Replacement Policy Interface
+
+Introduction
+============
+
+This document describes the page replacement interfaces used by the
+virtual memory subsystem.
+
+When the system's free memory runs below a certain threshold, an action
+must be initiated to reclaim memory for future use. The decision of
+which memory pages to evict is called the replacement policy.
+
+There are several types of reclaimable objects which live in the
+system's memory:
+
+a) file cache pages
+b) anonymous process pages
+c) shared memory (shm) pages
+d) SLAB cache pages, used for internal kernel objects such as the inode
+and dentry caches.
+
+The policy API abstracts the replacement structure for pagecache objects
+(items a) b) and c)), separating it from the reclaim code path.
+
+This allows maintenance of different policies to deal with different
+workload requirements.
+
+Zoned VM
+========
+
+In Linux, physical memory is managed separately into zones, because
+certain types of allocations are address constrained.
+
+The operating system has to support types of hardware which cannot
+access full 32-bit addresses, but are limited to an address mask. For
+instance, ISA devices can only address the lower 24-bits (hence their
+visilibity goes up to 16MB).
+
+Additionally, pages used for internal kernel data must be restricted to
+the direct kernel mapping, which is approximately 1GB in current default
+configurations.
+
+Different zones must be managed separately from the perspective of page
+reclaim path, because particular zones might suffer more pressure than
+others.
+
+This means that the page replacement structures have to be maintained
+separately for each zone.
+
+Description
+===========
+
+The page replacement policy interface consists off a set of operations
+which are invoked from the common VM code.
+
+As mentioned before, the policy specific data has to be maintained
+separately for each zone, therefore "struct zone" embeds the following
+data structure:
+
+ struct page_replace_data policy;
+
+Which is to be defined by the policy in a separate header file.
+
+At the moment, this data structure is guarded by the "zone->lru_lock"
+spinlock, thus shared by all policies.
+
+Initialization (invoked during system bootup)
+--------------
+
+ * void __init page_replace_init(void)
+
+Policy private initialization.
+
+ * void __init page_replace_init_zone(struct zone *)
+
+Initialize zone specific policy data.
+
+
+Methods called by the VM
+------------------------
+
+ * void page_replace_hint_active(struct page *);
+ * void page_replace_hint_use_once(struct page *);
+
+Give the policy hints as to the importance of the page. These hints can
+be viewed as initial priority of page, where active is +1 and use_once -1.
+
+
+ * void fastcall page_replace_add(struct page *);
+
+Insert page into per-CPU list(s), used for batching groups of pages to
+relief zone->lru_lock contention. Called during page instantiation.
+
+
+ * void page_replace_add_drain(void);
+ * void page_replace_add_drain_cpu(unsigned int);
+
+Drain the per-CPU lists(s), pushing pages to the actual cache.
+Called in locations where it is important to not have stale data
+into the per-CPU lists.
+
+
+ * void pagevec_page_replace_add(struct pagevec *);
+ * void __pagevec_page_replace_add(struct pagevec *);
+
+Insert a whole pagevec worth of pages directly.
+
+
+ * void page_replace_candidates(struct zone *, int, struct list_head *);
+
+Select candidates for eviction from the specified zone.
+
+@zone: which memory zone to scan for.
+@nr_to_scan: number of pages to scan.
+@page_list: list_head to add select pages
+
+Called by mm/vmscan.c::shrink_cache(), the main function used to
+evict pagecache pages from a specific zone.
+
+
+ * reclaim_t page_replace_reclaimable(struct page *);
+
+Determines wether a page is reclaimable, used by shrink_list().
+This function encapsulates the call to page_referenced.
+
+
+ * void page_replace_activate(struct page *);
+
+Callback used to let the policy know this page was referenced.
+
+
+ * void page_replace_reinsert_zone(struct zone *, struct list_head *);
+
+Put unfreeable pages back into the zone's cache mgmt structures.
+
+@zone: memory zone which pages belong
+@page_list: list of pages to reinsert
+
+
+ * void page_replace_remove(struct zone *, struct page *);
+
+Remove page from cache. This function clears the page state.
+
+
+ * int page_replace_isolate(struct page *);
+
+Isolate a specified page; ie. remove it from the cache mgmt structures without
+clearing its page state (used for page migration).
+
+
+ * void page_replace_reinsert(struct list_head *);
+
+Reinsert a list of pages previously isolated by page_replace_isolate().
+Remember that these pages still have their page state; this property
+distinguishes this function from page_replace_add().
+NOTE: the pages on the list need not be in the same zone.
+
+
+ * void __page_replace_rotate_reclaimable(struct zone *, struct page *);
+
+Place this page so that it will be in the next candidate batch.
+
+
+ * void page_replace_remember(struct zone *, struct page*);
+ * void page_replace_forget(struct address_space *, unsigned long);
+
+Hooks for nonresident page management. Allows the policy to remember and
+forget about pages that are no longer resident.
+
+ * void page_replace_show(struct zone *);
+ * void page_replace_zoneinfo(struct zone *, struct seq_file *);
+
+Prints zoneinfo in the various ways.
+
+* void __page_replace_counts(unsigned long *, unsigned long *,
+ unsigned long *, struct pglist_data *);
+
+Gives 'active', 'inactive' and free count in pages for the selected pgdat.
+Where active/inactive are open for interpretation of the policy.
+
+ * unsigned long __page_replace_nr_pages(struct zone *);
+
+Gives the total number of pages currently managed by the page replacement
+policy.
+
+
+ * unsigned long __page_replace_nr_scan(struct zone *);
+
+Gives the number of pages needed to drive the scanning.
+
+Helpers
+-------
+
+Certain helpers are shared by all policies, follows a description of them:
+
+1) int should_reclaim_mapped(struct zone *);
+
+The point of this algorithm is to decide when to start reclaiming mapped
+memory instead of clean pagecache.
+
+Returns 1 if mapped pages should be candidates for reclaim, 0 otherwise.
+
+Page flags
+----------
+
+A number of bits in page->flags are reserved for the page replacement
+policies, they are:
+
+ PG_reclaim1 /* bit 6 */
+ PG_reclaim2 /* bit 20 */
+ PG_reclaim3 /* bit 21 */
+
+The policy private semantics of this bits are to be defined in
+the policy implementation. This bits are internal to the policy and as such
+should not be interpreted in any way by external code.
+

2006-03-22 22:42:46

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 18/34] mm: page-replace-counts.patch


From: Peter Zijlstra <[email protected]>

Abstract the various page counts used to drive the scanner.

API:

give the 'active', 'inactive' and free count for the selected pgdat.
(free interpretation of '' words)

void __page_replace_counts(unsigned long *, unsigned long *,
unsigned long *, struct pglist_data *);

total number of pages in the policies care

unsigned long __page_replace_nr_pages(struct zone *);

number of pages to base the scan speed on

unsigned long __page_replace_nr_scan(struct zone *);


Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 3 +++
include/linux/mm_use_once_policy.h | 5 +++++
mm/page_alloc.c | 12 +-----------
mm/useonce.c | 16 ++++++++++++++++
mm/vmscan.c | 6 +++---
5 files changed, 28 insertions(+), 14 deletions(-)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -95,6 +95,9 @@ extern void page_replace_shrink(struct z
/* void __page_replace_rotate_reclaimable(struct zone *, struct page *); */
extern void page_replace_show(struct zone *);
extern void page_replace_zoneinfo(struct zone *, struct seq_file *);
+extern void __page_replace_counts(unsigned long *, unsigned long *,
+ unsigned long *, struct pglist_data *);
+/* unsigned long __page_replace_nr_pages(struct zone *); */

#ifdef CONFIG_MIGRATION
extern int page_replace_isolate(struct page *p);
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -478,3 +478,19 @@ void page_replace_zoneinfo(struct zone *
zone->spanned_pages,
zone->present_pages);
}
+
+void __page_replace_counts(unsigned long *active, unsigned long *inactive,
+ unsigned long *free, struct pglist_data *pgdat)
+{
+ struct zone *zones = pgdat->node_zones;
+ int i;
+
+ *active = 0;
+ *inactive = 0;
+ *free = 0;
+ for (i = 0; i < MAX_NR_ZONES; i++) {
+ *active += zones[i].nr_active;
+ *inactive += zones[i].nr_inactive;
+ *free += zones[i].free_pages;
+ }
+}
Index: linux-2.6-git/mm/page_alloc.c
===================================================================
--- linux-2.6-git.orig/mm/page_alloc.c
+++ linux-2.6-git/mm/page_alloc.c
@@ -1307,17 +1307,7 @@ EXPORT_SYMBOL(mod_page_state_offset);
void __get_zone_counts(unsigned long *active, unsigned long *inactive,
unsigned long *free, struct pglist_data *pgdat)
{
- struct zone *zones = pgdat->node_zones;
- int i;
-
- *active = 0;
- *inactive = 0;
- *free = 0;
- for (i = 0; i < MAX_NR_ZONES; i++) {
- *active += zones[i].nr_active;
- *inactive += zones[i].nr_inactive;
- *free += zones[i].free_pages;
- }
+ __page_replace_counts(active, inactive, free, pgdat);
}

void get_zone_counts(unsigned long *active,
Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -135,5 +135,10 @@ static inline void __page_replace_rotate
}
}

+static inline unsigned long __page_replace_nr_pages(struct zone *zone)
+{
+ return zone->nr_active + zone->nr_inactive;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -1033,7 +1033,7 @@ int try_to_free_pages(struct zone **zone
continue;

zone->temp_priority = DEF_PRIORITY;
- lru_pages += zone->nr_active + zone->nr_inactive;
+ lru_pages += __page_replace_nr_pages(zone);
}

for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1175,7 +1175,7 @@ scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;

- lru_pages += zone->nr_active + zone->nr_inactive;
+ lru_pages += __page_replace_nr_pages(zone);
}

/*
@@ -1219,7 +1219,7 @@ scan:
if (zone->all_unreclaimable)
continue;
if (nr_slab == 0 && zone->pages_scanned >=
- (zone->nr_active + zone->nr_inactive) * 4)
+ __page_replace_nr_pages(zone) * 4)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and

2006-03-22 22:44:42

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 09/34] mm: page-replace-move-isolate_lru_pages.patch


From: Peter Zijlstra <[email protected]>

In anticipation that only the policy implementation will know anything about
the management of the pages, move isolate_lru_pages over to the policy
implementation.

API:
int page_replace_isolate(struct page*)

isolate a single page from the cache mgmt structures - used by page migration.
NOTE: this function leaves the reclaim page state untouched so that it can be
reinserted in the dest. zone at the correct place.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 6 ++
include/linux/mm_use_once_policy.h | 3 +
include/linux/swap.h | 2
mm/mempolicy.c | 2
mm/useonce.c | 81 +++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 79 ------------------------------------
6 files changed, 92 insertions(+), 81 deletions(-)

Index: linux-2.6-git/mm/mempolicy.c
===================================================================
--- linux-2.6-git.orig/mm/mempolicy.c
+++ linux-2.6-git/mm/mempolicy.c
@@ -552,7 +552,7 @@ static void migrate_page_add(struct page
* Avoid migrating a page that is shared with others.
*/
if ((flags & MPOL_MF_MOVE_ALL) || page_mapcount(page) == 1) {
- if (isolate_lru_page(page))
+ if (page_replace_isolate(page))
list_add_tail(&page->lru, pagelist);
}
}
Index: linux-2.6-git/mm/useonce.c
===================================================================
--- linux-2.6-git.orig/mm/useonce.c
+++ linux-2.6-git/mm/useonce.c
@@ -75,3 +75,84 @@ int page_replace_add_drain_all(void)
return 0;
}
#endif
+
+#ifdef CONFIG_MIGRATION
+/*
+ * Isolate one page from the LRU lists and put it on the
+ * indicated list with elevated refcount.
+ *
+ * Result:
+ * 0 = page not on LRU list
+ * 1 = page removed from LRU list and added to the specified list.
+ */
+int page_replace_isolate(struct page *page)
+{
+ int ret = 0;
+
+ if (PageLRU(page)) {
+ struct zone *zone = page_zone(page);
+ spin_lock_irq(&zone->lru_lock);
+ if (TestClearPageLRU(page)) {
+ ret = 1;
+ get_page(page);
+ if (PageActive(page))
+ del_page_from_active_list(zone, page);
+ else
+ del_page_from_inactive_list(zone, page);
+ }
+ spin_unlock_irq(&zone->lru_lock);
+ }
+
+ return ret;
+}
+#endif
+
+/*
+ * zone->lru_lock is heavily contended. Some of the functions that
+ * shrink the lists perform better by taking out a batch of pages
+ * and working on them outside the LRU lock.
+ *
+ * For pagecache intensive workloads, this function is the hottest
+ * spot in the kernel (apart from copy_*_user functions).
+ *
+ * Appropriate locks must be held before calling this function.
+ *
+ * @nr_to_scan: The number of pages to look through on the list.
+ * @src: The LRU list to pull pages off.
+ * @dst: The temp list to put pages on to.
+ * @scanned: The number of pages that were scanned.
+ *
+ * returns how many pages were moved onto *@dst.
+ */
+int isolate_lru_pages(int nr_to_scan, struct list_head *src,
+ struct list_head *dst, int *scanned)
+{
+ int nr_taken = 0;
+ struct page *page;
+ int scan = 0;
+
+ while (scan++ < nr_to_scan && !list_empty(src)) {
+ page = lru_to_page(src);
+ prefetchw_prev_lru_page(page, src, flags);
+
+ if (!TestClearPageLRU(page))
+ BUG();
+ list_del(&page->lru);
+ if (get_page_testone(page)) {
+ /*
+ * It is being freed elsewhere
+ */
+ __put_page(page);
+ SetPageLRU(page);
+ list_add(&page->lru, src);
+ continue;
+ } else {
+ list_add(&page->lru, dst);
+ nr_taken++;
+ }
+ }
+
+ *scanned = scan;
+ return nr_taken;
+}
+
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -509,6 +509,7 @@ keep:
}

#ifdef CONFIG_MIGRATION
+
static inline void move_to_lru(struct page *page)
{
list_del(&page->lru);
@@ -936,87 +937,9 @@ next:

return nr_failed + retry;
}
-
-/*
- * Isolate one page from the LRU lists and put it on the
- * indicated list with elevated refcount.
- *
- * Result:
- * 0 = page not on LRU list
- * 1 = page removed from LRU list and added to the specified list.
- */
-int isolate_lru_page(struct page *page)
-{
- int ret = 0;
-
- if (PageLRU(page)) {
- struct zone *zone = page_zone(page);
- spin_lock_irq(&zone->lru_lock);
- if (TestClearPageLRU(page)) {
- ret = 1;
- get_page(page);
- if (PageActive(page))
- del_page_from_active_list(zone, page);
- else
- del_page_from_inactive_list(zone, page);
- }
- spin_unlock_irq(&zone->lru_lock);
- }
-
- return ret;
-}
#endif

/*
- * zone->lru_lock is heavily contended. Some of the functions that
- * shrink the lists perform better by taking out a batch of pages
- * and working on them outside the LRU lock.
- *
- * For pagecache intensive workloads, this function is the hottest
- * spot in the kernel (apart from copy_*_user functions).
- *
- * Appropriate locks must be held before calling this function.
- *
- * @nr_to_scan: The number of pages to look through on the list.
- * @src: The LRU list to pull pages off.
- * @dst: The temp list to put pages on to.
- * @scanned: The number of pages that were scanned.
- *
- * returns how many pages were moved onto *@dst.
- */
-static int isolate_lru_pages(int nr_to_scan, struct list_head *src,
- struct list_head *dst, int *scanned)
-{
- int nr_taken = 0;
- struct page *page;
- int scan = 0;
-
- while (scan++ < nr_to_scan && !list_empty(src)) {
- page = lru_to_page(src);
- prefetchw_prev_lru_page(page, src, flags);
-
- if (!TestClearPageLRU(page))
- BUG();
- list_del(&page->lru);
- if (get_page_testone(page)) {
- /*
- * It is being freed elsewhere
- */
- __put_page(page);
- SetPageLRU(page);
- list_add(&page->lru, src);
- continue;
- } else {
- list_add(&page->lru, dst);
- nr_taken++;
- }
- }
-
- *scanned = scan;
- return nr_taken;
-}
-
-/*
* shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
*/
static void shrink_cache(struct zone *zone, struct scan_control *sc)
Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -76,5 +76,8 @@ static inline int page_replace_activate(
return 1;
}

+extern int isolate_lru_pages(int nr_to_scan, struct list_head *src,
+ struct list_head *dst, int *scanned);
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -88,6 +88,12 @@ typedef enum {
/* int page_replace_activate(struct page *page); */


+#ifdef CONFIG_MIGRATION
+extern int page_replace_isolate(struct page *p);
+#else
+static inline int page_replace_isolate(struct page *p) { return -ENOSYS; }
+#endif
+
#ifdef CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_policy.h>
#else
Index: linux-2.6-git/include/linux/swap.h
===================================================================
--- linux-2.6-git.orig/include/linux/swap.h
+++ linux-2.6-git/include/linux/swap.h
@@ -186,7 +186,6 @@ static inline int zone_reclaim(struct zo
#endif

#ifdef CONFIG_MIGRATION
-extern int isolate_lru_page(struct page *p);
extern int putback_lru_pages(struct list_head *l);
extern int migrate_page(struct page *, struct page *);
extern void migrate_page_copy(struct page *, struct page *);
@@ -195,7 +194,6 @@ extern int migrate_pages(struct list_hea
struct list_head *moved, struct list_head *failed);
extern int fail_migrate_page(struct page *, struct page *);
#else
-static inline int isolate_lru_page(struct page *p) { return -ENOSYS; }
static inline int putback_lru_pages(struct list_head *l) { return 0; }
static inline int migrate_pages(struct list_head *l, struct list_head *t,
struct list_head *moved, struct list_head *failed) { return -ENOSYS; }

2006-03-22 22:44:21

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 06/34] mm: page-replace-activate.patch


From: Peter Zijlstra <[email protected]>

Abstract page activation and the reclaimable condition.

API:

wether the page is reclaimable

reclaim_t page_replace_reclaimable(struct page *);

RECLAIM_KEEP - keep the page
RECLAIM_ACTIVATE - keep the page and activate
RECLAIM_REFERENCED - try to pageout even though referenced
RECLAIM_OK - try to pageout

activate the page

int page_replace_activate(struct page *page);

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 11 ++++++++
include/linux/mm_use_once_policy.h | 48 +++++++++++++++++++++++++++++++++++++
mm/vmscan.c | 42 ++++++++++----------------------
3 files changed, 72 insertions(+), 29 deletions(-)

Index: linux-2.6-git/include/linux/mm_use_once_policy.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_use_once_policy.h
+++ linux-2.6-git/include/linux/mm_use_once_policy.h
@@ -3,6 +3,9 @@

#ifdef __KERNEL__

+#include <linux/fs.h>
+#include <linux/rmap.h>
+
static inline void page_replace_hint_active(struct page *page)
{
SetPageActive(page);
@@ -28,5 +31,50 @@ static inline void page_replace_hint_use
{
}

+/* Called without lock on whether page is mapped, so answer is unstable */
+static inline int page_mapping_inuse(struct page *page)
+{
+ struct address_space *mapping;
+
+ /* Page is in somebody's page tables. */
+ if (page_mapped(page))
+ return 1;
+
+ /* Be more reluctant to reclaim swapcache than pagecache */
+ if (PageSwapCache(page))
+ return 1;
+
+ mapping = page_mapping(page);
+ if (!mapping)
+ return 0;
+
+ /* File is mmap'd by somebody? */
+ return mapping_mapped(mapping);
+}
+
+static inline reclaim_t page_replace_reclaimable(struct page *page)
+{
+ int referenced;
+
+ if (PageActive(page))
+ BUG();
+
+ referenced = page_referenced(page, 1);
+ /* In active use or really unfreeable? Activate it. */
+ if (referenced && page_mapping_inuse(page))
+ return RECLAIM_ACTIVATE;
+
+ if (referenced)
+ return RECLAIM_REFERENCED;
+
+ return RECLAIM_OK;
+}
+
+static inline int page_replace_activate(struct page *page)
+{
+ SetPageActive(page);
+ return 1;
+}
+
#endif /* __KERNEL__ */
#endif /* _LINUX_MM_USEONCE_POLICY_H */
Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -17,6 +17,17 @@ extern void __page_replace_add_drain(uns
extern int page_replace_add_drain_all(void);
extern void __pagevec_page_replace_add(struct pagevec *);

+typedef enum {
+ RECLAIM_KEEP,
+ RECLAIM_ACTIVATE,
+ RECLAIM_REFERENCED,
+ RECLAIM_OK,
+} reclaim_t;
+
+/* reclaim_t page_replace_reclaimable(struct page *); */
+/* int page_replace_activate(struct page *page); */
+
+
#ifdef CONFIG_MM_POLICY_USEONCE
#include <linux/mm_use_once_policy.h>
#else
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -244,27 +244,6 @@ int shrink_slab(unsigned long scanned, g
return ret;
}

-/* Called without lock on whether page is mapped, so answer is unstable */
-static inline int page_mapping_inuse(struct page *page)
-{
- struct address_space *mapping;
-
- /* Page is in somebody's page tables. */
- if (page_mapped(page))
- return 1;
-
- /* Be more reluctant to reclaim swapcache than pagecache */
- if (PageSwapCache(page))
- return 1;
-
- mapping = page_mapping(page);
- if (!mapping)
- return 0;
-
- /* File is mmap'd by somebody? */
- return mapping_mapped(mapping);
-}
-
static inline int is_page_cache_freeable(struct page *page)
{
return page_count(page) - !!PagePrivate(page) == 2;
@@ -431,7 +410,7 @@ static int shrink_list(struct list_head
struct address_space *mapping;
struct page *page;
int may_enter_fs;
- int referenced;
+ int referenced = 0;

cond_resched();

@@ -441,8 +420,6 @@ static int shrink_list(struct list_head
if (TestSetPageLocked(page))
goto keep;

- BUG_ON(PageActive(page));
-
sc->nr_scanned++;

if (!sc->may_swap && page_mapped(page))
@@ -455,10 +432,17 @@ static int shrink_list(struct list_head
if (PageWriteback(page))
goto keep_locked;

- referenced = page_referenced(page, 1);
- /* In active use or really unfreeable? Activate it. */
- if (referenced && page_mapping_inuse(page))
+ switch (page_replace_reclaimable(page)) {
+ case RECLAIM_KEEP:
+ goto keep_locked;
+ case RECLAIM_ACTIVATE:
goto activate_locked;
+ case RECLAIM_REFERENCED:
+ referenced = 1;
+ break;
+ case RECLAIM_OK:
+ break;
+ }

#ifdef CONFIG_SWAP
/*
@@ -568,8 +552,8 @@ free_it:
continue;

activate_locked:
- SetPageActive(page);
- pgactivate++;
+ if (page_replace_activate(page))
+ pgactivate++;
keep_locked:
unlock_page(page);
keep:

2006-03-22 22:33:08

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 08/34] mm: page-replace-move-scan_control.patch


From: Peter Zijlstra <[email protected]>

Move struct scan_control to the general page_replace header so that all
policies can make use of it.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 30 ++++++++++++++++++++++++++++++
mm/vmscan.c | 30 ------------------------------
2 files changed, 30 insertions(+), 30 deletions(-)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -8,6 +8,36 @@
#include <linux/pagevec.h>
#include <linux/mm_inline.h>

+struct scan_control {
+ /* Ask refill_inactive_zone, or shrink_cache to scan this many pages */
+ unsigned long nr_to_scan;
+
+ /* Incremented by the number of inactive pages that were scanned */
+ unsigned long nr_scanned;
+
+ /* Incremented by the number of pages reclaimed */
+ unsigned long nr_reclaimed;
+
+ unsigned long nr_mapped; /* From page_state */
+
+ /* Ask shrink_caches, or shrink_zone to scan at this priority */
+ unsigned int priority;
+
+ /* This context's GFP mask */
+ gfp_t gfp_mask;
+
+ int may_writepage;
+
+ /* Can pages be swapped as part of reclaim? */
+ int may_swap;
+
+ /* This context's SWAP_CLUSTER_MAX. If freeing memory for
+ * suspend, we effectively ignore SWAP_CLUSTER_MAX.
+ * In this context, it doesn't matter that we scan the
+ * whole list at once. */
+ int swap_cluster_max;
+};
+
#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))

#ifdef ARCH_HAS_PREFETCH
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -52,36 +52,6 @@ typedef enum {
PAGE_CLEAN,
} pageout_t;

-struct scan_control {
- /* Ask refill_inactive_zone, or shrink_cache to scan this many pages */
- unsigned long nr_to_scan;
-
- /* Incremented by the number of inactive pages that were scanned */
- unsigned long nr_scanned;
-
- /* Incremented by the number of pages reclaimed */
- unsigned long nr_reclaimed;
-
- unsigned long nr_mapped; /* From page_state */
-
- /* Ask shrink_caches, or shrink_zone to scan at this priority */
- unsigned int priority;
-
- /* This context's GFP mask */
- gfp_t gfp_mask;
-
- int may_writepage;
-
- /* Can pages be swapped as part of reclaim? */
- int may_swap;
-
- /* This context's SWAP_CLUSTER_MAX. If freeing memory for
- * suspend, we effectively ignore SWAP_CLUSTER_MAX.
- * In this context, it doesn't matter that we scan the
- * whole list at once. */
- int swap_cluster_max;
-};
-
/*
* The list of shrinker callbacks used by to apply pressure to
* ageable caches.

2006-03-22 22:47:04

by Peter Zijlstra

[permalink] [raw]
Subject: [PATCH 07/34] mm: page-replace-move-macros.patch


From: Peter Zijlstra <[email protected]>

move macro's out of vmscan into the generic page replace header so the
rest of the world can use them too.

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Marcelo Tosatti <[email protected]>

---

include/linux/mm_page_replace.h | 30 ++++++++++++++++++++++++++++++
mm/vmscan.c | 30 ------------------------------
2 files changed, 30 insertions(+), 30 deletions(-)

Index: linux-2.6-git/include/linux/mm_page_replace.h
===================================================================
--- linux-2.6-git.orig/include/linux/mm_page_replace.h
+++ linux-2.6-git/include/linux/mm_page_replace.h
@@ -8,6 +8,36 @@
#include <linux/pagevec.h>
#include <linux/mm_inline.h>

+#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
+
+#ifdef ARCH_HAS_PREFETCH
+#define prefetch_prev_lru_page(_page, _base, _field) \
+ do { \
+ if ((_page)->lru.prev != _base) { \
+ struct page *prev; \
+ \
+ prev = lru_to_page(&(_page->lru)); \
+ prefetch(&prev->_field); \
+ } \
+ } while (0)
+#else
+#define prefetch_prev_lru_page(_page, _base, _field) do { } while (0)
+#endif
+
+#ifdef ARCH_HAS_PREFETCHW
+#define prefetchw_prev_lru_page(_page, _base, _field) \
+ do { \
+ if ((_page)->lru.prev != _base) { \
+ struct page *prev; \
+ \
+ prev = lru_to_page(&(_page->lru)); \
+ prefetchw(&prev->_field); \
+ } \
+ } while (0)
+#else
+#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
+#endif
+
/* void page_replace_hint_active(struct page *); */
/* void page_replace_hint_use_once(struct page *); */
extern void fastcall page_replace_add(struct page *);
Index: linux-2.6-git/mm/vmscan.c
===================================================================
--- linux-2.6-git.orig/mm/vmscan.c
+++ linux-2.6-git/mm/vmscan.c
@@ -93,36 +93,6 @@ struct shrinker {
long nr; /* objs pending delete */
};

-#define lru_to_page(_head) (list_entry((_head)->prev, struct page, lru))
-
-#ifdef ARCH_HAS_PREFETCH
-#define prefetch_prev_lru_page(_page, _base, _field) \
- do { \
- if ((_page)->lru.prev != _base) { \
- struct page *prev; \
- \
- prev = lru_to_page(&(_page->lru)); \
- prefetch(&prev->_field); \
- } \
- } while (0)
-#else
-#define prefetch_prev_lru_page(_page, _base, _field) do { } while (0)
-#endif
-
-#ifdef ARCH_HAS_PREFETCHW
-#define prefetchw_prev_lru_page(_page, _base, _field) \
- do { \
- if ((_page)->lru.prev != _base) { \
- struct page *prev; \
- \
- prev = lru_to_page(&(_page->lru)); \
- prefetchw(&prev->_field); \
- } \
- } while (0)
-#else
-#define prefetchw_prev_lru_page(_page, _base, _field) do { } while (0)
-#endif
-
/*
* From 0 .. 100. Higher means more swappy.
*/

2006-03-22 22:55:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Peter Zijlstra <[email protected]> wrote:
>
>
> This patch-set introduces a page replacement policy framework and 4 new
> experimental policies.

Holy cow.

> The page replacement algorithm determines which pages to swap out.
> The current algorithm has some problems that are increasingly noticable, even
> on desktop workloads.

Rather than replacing the whole lot four times I'd really prefer to see
precise descriptions of these problems, see if we can improve the situation
incrementally rather than wholesale slash-n-burn...

Once we've done that work to the best of our ability, *then* we're in a
position to evaluate the performance benefits of this new work. Because
there's not much point in comparing known-to-have-unaddressed-problems old
code with fancy new code.

> Measurements:
>
> (Walltime, so lower is better)
>
> cyclic-anon ; Cyclic access pattern with anonymous memory.
> (http://programming.kicks-ass.net/benchmarks/cyclic-anon.c)
>
> 2.6.16-rc6 14:28
> 2.6.16-rc6-useonce 15:11
> 2.6.16-rc6-clockpro 10:51
> 2.6.16-rc6-cart 8:55
> 2.6.16-rc6-random 1:09:50
>
> cyclic-file ; Cyclic access pattern with file backed memory.
> (http://programming.kicks-ass.net/benchmarks/cyclic-file.c)
>
> 2.6.16-rc6 11:24
> 2.6.16-rc6-clockpro 8:14
> 2.6.16-rc6-cart 8:09
>
> webtrace ; Replay of an IO trace from the Umass trace repository
> (http://programming.kicks-ass.net/benchmarks/spc/)
>
> 2.6.16-rc6 8:27
> 2.6.16-rc6-useonce 8:24
> 2.6.16-rc6-clockpro 10:23
> 2.6.16-rc6-cart 15:30
> 2.6.16-rc6-random 15:52
>
> mdb-bench ; Low frequency benchmark.
> (http://linux-mm.org/PageReplacementTesting)
>
> 2.6.16-rc6 4:20:44
> 2.6.16-rc6 (mlock) 3:52:15
> 2.6.16-rc6-useonce 4:20:59
> 2.6.16-rc6-clockpro 3:56:17
> 2.6.16-rc6-cart 4:11:54
> 2.6.16-rc6-random 5:21:30

2.6.16-rc6 seems to do OK. I assume the cyclic patterns exploit the lru
worst case thing? Has consideration been given to tweaking the existing
code, detect the situation and work avoid the problem?

2006-03-22 23:04:05

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH 02/34] mm: page-replace-kconfig-makefile.patch

Peter Zijlstra wrote:
> From: Peter Zijlstra <[email protected]>
>
> Introduce the configuration option, and modify the Makefile.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Marcelo Tosatti <[email protected]>

For future patch posting, -please- use a sane email subject.

The email subject is used as a one-line summary for each changeset.
While "page-replace-kconfig-makefile.patch" certainly communicates
information, its much less easy to read than normal. It also makes the
git changelog summary (git log $branch..$branch2 | git shortlog) that
Linus posts much uglier:

Peter Zijlstra:
[PATCH] mm: kill-page-activate.patch
[PATCH] mm: page-replace-kconfig-makefile.patch
[PATCH] mm: page-replace-insert.patch
[PATCH] mm: page-replace-use_once.patch

See http://linux.yyz.us/patch-format.html for more info.

Regards,

Jeff


2006-03-22 23:42:23

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Peter Zijlstra writes:

>
> This patch-set introduces a page replacement policy framework and 4 new
> experimental policies.
>
> The page replacement algorithm determines which pages to swap out.
> The current algorithm has some problems that are increasingly noticable, even
> on desktop workloads. As said, this patch-set introduces 4 new algorithms.

Wow with a new cpu scheduler and new vm we could fork linux now and take
over the world!...

Or not...

Good luck :)

Cheers,
Con


Attachments:
(No filename) (489.00 B)
(No filename) (189.00 B)
Download all attachments

2006-03-23 02:21:18

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Andrew Morton wrote:
> Peter Zijlstra <[email protected]> wrote:
>
>>
>>This patch-set introduces a page replacement policy framework and 4 new
>>experimental policies.
>
>
> Holy cow.
>
>
>>The page replacement algorithm determines which pages to swap out.
>>The current algorithm has some problems that are increasingly noticable, even
>>on desktop workloads.
>
>
> Rather than replacing the whole lot four times I'd really prefer to see
> precise descriptions of these problems, see if we can improve the situation
> incrementally rather than wholesale slash-n-burn...
>

The other thing is that a lot of the "policy" stuff you've abstracted
out is actually low-level "mechanism" stuff that has implications beyond
page reclaim. Taking a refcount on lru pages, for example.

Also, as you work and find incremental improvements to the current code,
you should be submitting them (eg. patch 25, or patch 1) rather than
sitting on them and sending them in a huge patchset where they don't
really belong.

Some of the API names aren't very nice either. It's great that you want
to keep the namespace consistent, but it shouldn't be at the expense of
more descriptive names, and having the page_replace_ prefix itself makes
many functions read like crap. I'd suggest something like a pgrep_
prefix and try to make the rest of the name make sense.

Aside from all that, I'm with Andrew in that problems need to be
identified first and foremost. But also I don't like the chances of this
whole framework flying at all -- Linus vetoed a similar framework for
sched.c that was actually a reasonable API, with little or no
consequences outside sched.c. With good reason.

Nice work, though :)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-23 04:02:13

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

On Wed, 22 Mar 2006, Andrew Morton wrote:

> 2.6.16-rc6 seems to do OK. I assume the cyclic patterns exploit the lru
> worst case thing? Has consideration been given to tweaking the existing
> code, detect the situation and work avoid the problem?

This can certainly be done. Rate-based clock-pro isn't that
far away mechanically from the current 2.6 code and can be
introduced in small steps.

I'll just have to make it work again ;)

--
All Rights Reversed

2006-03-23 17:53:25

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework


On Wed, Mar 22, 2006 at 02:51:32PM -0800, Andrew Morton wrote:
> Peter Zijlstra <[email protected]> wrote:
> >
> >
> > This patch-set introduces a page replacement policy framework and 4 new
> > experimental policies.
>
> Holy cow.
>
> > The page replacement algorithm determines which pages to swap out.
> > The current algorithm has some problems that are increasingly noticable, even
> > on desktop workloads.
>
> Rather than replacing the whole lot four times I'd really prefer to see
> precise descriptions of these problems, see if we can improve the situation
> incrementally rather than wholesale slash-n-burn...
>
> Once we've done that work to the best of our ability, *then* we're in a
> position to evaluate the performance benefits of this new work. Because
> there's not much point in comparing known-to-have-unaddressed-problems old
> code with fancy new code.

IMHO the page replacement framework intent is wider than fixing the
currently known performance problems.

It allows easier implementation of new algorithms, which are being
invented/adapted over time as necessity appears. At the moment Linux is
stuck with a single policy - and if you think for a moment about the
wide range of scenarios where Linux is being used its easy to conclude
that one policy can't fit all.

It would be great to provide an interface for easier development of
such ideas - keep in mind that page replacement is an area of active
research.

One example (which I mentioned several times) is power saving:

PB-LRU: A Self-Tuning Power Aware Storage Cache Replacement Algorithm
for Conserving Disk Energy.

> 2.6.16-rc6 seems to do OK. I assume the cyclic patterns exploit the lru
> worst case thing? Has consideration been given to tweaking the existing
> code, detect the situation and work avoid the problem?

Use-once fixes the cyclic access pattern case - but at the moment we don't
have use-once for memory mapped pages.

http://marc.theaimsgroup.com/?l=linux-mm&m=113721453502138&w=2

Nick mentions:

"Yes, I found that also doing use-once on mapped pages caused fairly
huge slowdowns in some cases. File IO could much more easily cause X and
its applications to get swapped out."

Anyway, thats not the only problem with LRU, but one of them. The most
fundamental one is the lack of page access frequency book keeping:

http://www.linux-mm.org/PageReplacementTesting

Frequency

The most significant issue with LRU is that it uses too little
information to base the replacement decision: recency alone. It does not
take frequency of page accesses into account.

Here is one example from the LRU-K paper.

"The LRU-K Page-Replacement Algorithm for Database Disk Buffering":

"Consider a multi-user database application, which references randomly
chosen customer records through a clustered B-tree indexed key, CUST-ID,
to retrieve desired information. Assume simplistically that 20,000
customers exist, that a customer reeord is 2000 bytes in length, and
that space needed for the B-tree index at the leaf level, free space
included, is 20 bytes for each key entry. Then if disk pages contain
4000 bytes of usable space and ean be packed full, we require 100 pages
to hold the leaf level nodes of the B-tree index (there is a sin- gle
B-tree root node), and 10,000 pages to hold the reeords. The pattern of
reference to these pages (ignoring the B-tree root node) is clearly:
11, Rl, 12, R2, 13, R3, ... alternate references to random index leaf
pages and record pages. If we can only afford to buffer 101 pages
in memory for this application, the B-tree root node is automatic;
we should buffer all the B-tree leaf pages, since each of them is
referenced with a probability of .005 (once in each 200 general *page*
references), while it is clearly wasteful to displace one of these leaf
pages with a data *page*, since data pages have only .00005 probability
of reference (once in each 20,000 general *page* references). Using the
LRU *algorithm*, however, the pages held in memory buffers will be the
hundred most recently referenced ones. To a first approximation, this
means 50 B-tree leaf pages and 50 record pages. Given that a *page* gets
no extra credit for being referenced twice in the recent past and that
this is more likely to happen with B-tree leaf pages, there will even
be slightly more data pages present in memory than leaf pages. This
is clearly inappropriate behavior for a very common paradigm of disk
accesses."

To me it appears natural that page replacement should be pluggable and
not hard coded in the operating system.


2006-03-23 18:13:50

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Hi Nick,

On Thu, Mar 23, 2006 at 01:21:08PM +1100, Nick Piggin wrote:
> Andrew Morton wrote:
> >Peter Zijlstra <[email protected]> wrote:
> >
> >>
> >>This patch-set introduces a page replacement policy framework and 4 new
> >>experimental policies.
> >
> >
> >Holy cow.
> >
> >
> >>The page replacement algorithm determines which pages to swap out.
> >>The current algorithm has some problems that are increasingly noticable,
> >>even
> >>on desktop workloads.
> >
> >
> >Rather than replacing the whole lot four times I'd really prefer to see
> >precise descriptions of these problems, see if we can improve the situation
> >incrementally rather than wholesale slash-n-burn...
> >
>
> The other thing is that a lot of the "policy" stuff you've abstracted
> out is actually low-level "mechanism" stuff that has implications beyond
> page reclaim. Taking a refcount on lru pages, for example.

On "cache pages" you mean :)

Yes, some low-level mechanisms have also been abstracted away... I think
a nice way to avoid explicit knowledge of page reference acquision at
the moment of candidate selection hasnt been found.

Do you have any suggestions?

> you should be submitting them (eg. patch 25, or patch 1) rather than
> sitting on them and sending them in a huge patchset where they don't
> really belong.

I guess Peter and myself expected folks to criticise and help shape the
API to something acceptable.

BTW, patches 1 and 25 are not crucial improvements for mainline (there's
not much point in having them in mainline), and I don't see any others?

> Some of the API names aren't very nice either. It's great that you want
> to keep the namespace consistent, but it shouldn't be at the expense of
> more descriptive names, and having the page_replace_ prefix itself makes
> many functions read like crap. I'd suggest something like a pgrep_
> prefix and try to make the rest of the name make sense.

"pgrep_" looks more pleasant to me.

> Aside from all that, I'm with Andrew in that problems need to be
> identified first and foremost.

See my previous message.

> But also I don't like the chances of this
> whole framework flying at all -- Linus vetoed a similar framework for
> sched.c that was actually a reasonable API, with little or no
> consequences outside sched.c. With good reason.

Aren't we talking about very different things here? IMHO there is a lot
of point in allowing pluggable page replacement instead of trying to
make one policy fit all needs (which is obviously impossible).

> Nice work, though :)

Indeed - Peter has done a very nice job.


2006-03-23 18:16:31

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework



On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
>
> IMHO the page replacement framework intent is wider than fixing the
> currently known performance problems.
>
> It allows easier implementation of new algorithms, which are being
> invented/adapted over time as necessity appears.

Yes and no.

It smells wonderful for a pluggable page replacement standpoint, but
here's a couple of observations/questions:
a) the current one actually seems to have beaten the on-comers (except
for loads that were actually made up to try to defeat LRU)
b) is page replacement actually a huge issue?

Now, the reason I ask about (b) is that these days, you buy a Mac Mini,
and it comes with half a gig of RAM, and some apple users seem to worry
about the fact that the UMA graphics removes 50MB or something of that is
a problem.

IOW, just under half a _gigabyte_ of RAM is apparently considered to be
low end, and this is when talking about low-end (modern) hardware!

And don't tell me that the high-end people care, because both databases
(high end commercial) and video/graphics editing (high end desktop) very
much do _not_ care, since they tend to try to do their own memory
management anyway.

> One example (which I mentioned several times) is power saving:
>
> PB-LRU: A Self-Tuning Power Aware Storage Cache Replacement Algorithm
> for Conserving Disk Energy.

Please name a load that really actually hits the page replacement today.

It smells like university research to me.

And don't flame me: I'm perfectly happy to be shown to be wrong. I just
get a very strong feeling that the people who care about tight memory
conditions and perhaps about page replacement are the same people who
think that our kernel is too big - the embedded people. And somehow I'm
not convinced they want the added abstraction either - they'd probably
rather just have a smaller kernel ;)

What I'm trying to say is that page replacement hasn't been what seems to
have worried people over the last year or two. We had some ugly problems
in the early 2.4.x timeframe, and I'll claim that most (but not all) of
those were related to highmem/zoning issues which we largely solved. Which
was about page replacement, but really a very specific issue within that
area.

So seriously, I suspect Andrew's "Holy cow" comes from the fact that he is
more worried about VM maintainability and stability than page replacement.
I certainly am.

Linus

2006-03-23 18:26:50

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

On Thu, 23 Mar 2006, Linus Torvalds wrote:

> a) the current one actually seems to have beaten the on-comers (except
> for loads that were actually made up to try to defeat LRU)

A valid concern. I am of the opinion that we should try to
introduce change in small increments, whereever possible.

> b) is page replacement actually a huge issue?

Being involved in RHEL support occasionally: YES!

Page replacement may be doing the right thing in 99% of the
cases, but the misbehaviour in "corner cases" can be very
significant. I put "corner cases" in quotes because they
are not cornercases to the users - these loads tend to be
the main workload on some systems!

IMHO, improving performance for most workloads is nowhere
near as important as increasing the coverage of the VM,
ie. the number of workloads that it handles well.


--
All Rights Reversed

2006-03-23 18:48:51

by Diego Calleja

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

El Thu, 23 Mar 2006 10:15:47 -0800 (PST),
Linus Torvalds <[email protected]> escribi?:

> IOW, just under half a _gigabyte_ of RAM is apparently considered to be
> low end, and this is when talking about low-end (modern) hardware!

If it's considered "low-end" it's because people actually uses that
memory for something and the system starts swapping, not because it's
trendy.

The "powerful machines who never swaps" are always a minority. Being geeks
as we are we try to have the greatest machine possible, but the vast majority
of real users are "underpowered" I'm not talking of pentium 1 stuff, I can bet
there're far more pentium 4 machines with 256 MB out there than with 1 GB.

I know you don't hit those problems because you use expensive machines
with lots of ram ;) But in the _real_ world, lots of the machines are
already wasting most of its ram by running the desktop environment alone.

Diego Calleja (A user with 1 GB of RAM who usually gets his system
into swapping easily by using desktop apps and could benefit from
better page replacement policies)

2006-03-23 19:03:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

On Thu, 2006-03-23 at 10:15 -0800, Linus Torvalds wrote:
>
> On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
> >
> > IMHO the page replacement framework intent is wider than fixing the
> > currently known performance problems.
> >
> > It allows easier implementation of new algorithms, which are being
> > invented/adapted over time as necessity appears.
>
> Yes and no.
>
> It smells wonderful for a pluggable page replacement standpoint, but
> here's a couple of observations/questions:
> a) the current one actually seems to have beaten the on-comers (except
> for loads that were actually made up to try to defeat LRU)
> b) is page replacement actually a huge issue?
>
> Now, the reason I ask about (b) is that these days, you buy a Mac Mini,
> and it comes with half a gig of RAM, and some apple users seem to worry
> about the fact that the UMA graphics removes 50MB or something of that is
> a problem.
>
> IOW, just under half a _gigabyte_ of RAM is apparently considered to be
> low end, and this is when talking about low-end (modern) hardware!
>
> And don't tell me that the high-end people care, because both databases
> (high end commercial) and video/graphics editing (high end desktop) very
> much do _not_ care, since they tend to try to do their own memory
> management anyway.

Sure, however there is quite a large group in between. Especially the
desktop users.

For example see this thread:
http://lkml.org/lkml/2006/3/23/25
where Jens says:
http://lkml.org/lkml/2006/3/23/39

Typical desktop use cases that hit page-reclaim are things like
burning/copying/unrar'ing dvd images. Also bittorrent can hit the
page-cache quite hard.

Not to mention memory hogs such as Gnome and OpenOffice.

Other things that hit my page-cache are: git, cscope and grep.

> > One example (which I mentioned several times) is power saving:
> >
> > PB-LRU: A Self-Tuning Power Aware Storage Cache Replacement Algorithm
> > for Conserving Disk Energy.
>
> Please name a load that really actually hits the page replacement today.
>
> It smells like university research to me.
>
> And don't flame me: I'm perfectly happy to be shown to be wrong. I just
> get a very strong feeling that the people who care about tight memory
> conditions and perhaps about page replacement are the same people who
> think that our kernel is too big - the embedded people. And somehow I'm
> not convinced they want the added abstraction either - they'd probably
> rather just have a smaller kernel ;)

Well, I for one am not an embedded user, not have any interrests in that
direction.

As for the added abstraction, I find that having abstraction layers is
generaly good - it forces one to think on the concepts involved.
Layering violations become painfully clear.

> What I'm trying to say is that page replacement hasn't been what seems to
> have worried people over the last year or two. We had some ugly problems
> in the early 2.4.x timeframe, and I'll claim that most (but not all) of
> those were related to highmem/zoning issues which we largely solved. Which
> was about page replacement, but really a very specific issue within that
> area.

As Rik said, is would be good to have the VM work in more cases.

> So seriously, I suspect Andrew's "Holy cow" comes from the fact that he is
> more worried about VM maintainability and stability than page replacement.
> I certainly am.

I can understand this with regard to having more than one policy in the
kernel. However I feel that if the abstraction is maintained, the
stock-kernel could just carry one, preferably CLOCK-Pro. This should
greatly reduce the maintenance burden.

However PB-LRU might be very interresting for laptop users (haven't
looked into that one yet though so just rambling here).

Peter

2006-03-23 19:30:55

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Hi!

On Thu, Mar 23, 2006 at 10:15:47AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
> >
> > IMHO the page replacement framework intent is wider than fixing the
> > currently known performance problems.
> >
> > It allows easier implementation of new algorithms, which are being
> > invented/adapted over time as necessity appears.
>
> Yes and no.
>
> It smells wonderful for a pluggable page replacement standpoint, but
> here's a couple of observations/questions:
> a) the current one actually seems to have beaten the on-comers (except
> for loads that were actually made up to try to defeat LRU)

Nope, LRU only beat CLOCK-Pro/CART on the "UMass trace" (which is trace
replay, which can be very sensitive and not necessarily meaningful).
Needs more study though (talk is cheap).

Anyway, smarter algorithms such as this two have been proven to be more
efficient than LRU under a large range of real life loads. LRU's lack of
frequency information is really terrible.

LRU's worst case scenarios were well known before I was born.

> b) is page replacement actually a huge issue?
>
> Now, the reason I ask about (b) is that these days, you buy a Mac Mini,
> and it comes with half a gig of RAM, and some apple users seem to worry
> about the fact that the UMA graphics removes 50MB or something of that is
> a problem.

And why do they worry about having 50MB of memory less? Because memory
_is_ a very precious resource.

Therefore it should be managed as efficently and intelligently as
possible.

> IOW, just under half a _gigabyte_ of RAM is apparently considered to be
> low end, and this is when talking about low-end (modern) hardware!

Exactly. Tight memory condition is not a prerequisite for page
replacement being a bottleneck.

> And don't tell me that the high-end people care, because both databases
> (high end commercial) and video/graphics editing (high end desktop) very
> much do _not_ care, since they tend to try to do their own memory
> management anyway.

So basically you're saying that if applications want smarter memory
management under Linux they should do it on their own?

What percentage of the applications can afford doing that? Oracle, yeah.

> > One example (which I mentioned several times) is power saving:
> >
> > PB-LRU: A Self-Tuning Power Aware Storage Cache Replacement Algorithm
> > for Conserving Disk Energy.
>
> Please name a load that really actually hits the page replacement today.

Any sort of indexing process where the dataset is larger than memory
(cscope is a great example) or database load care _very much_, just
to name a few. The cost of waiting for a page to come from disk is
extremely bad compared to the time of a memory access, you know.

> It smells like university research to me.
>
> And don't flame me: I'm perfectly happy to be shown to be wrong. I just
> get a very strong feeling that the people who care about tight memory
> conditions and perhaps about page replacement are the same people who
> think that our kernel is too big - the embedded people. And somehow I'm
> not convinced they want the added abstraction either - they'd probably
> rather just have a smaller kernel ;)

Not really - small/medium end servers and desktops care more.

Embedded people don't care about page replacement: such scenarios have,
99.9% of the time, a working set smaller than total memory size.

> What I'm trying to say is that page replacement hasn't been what seems to
> have worried people over the last year or two.

Well, system certainly works with LRU, but the system can be faster with
a smarter algorithm. The popularity of the swap prefetching patchset is
a clear indication to me that people do care about memory management,
and that they usually have a working set larger than memory.

- "Every time I wake up in the morning updatedb has thrown my applications
out of memory".

- "Linux is awful every time I untar something larger than memory to disk".

BTW, Peter just pointed out to a message from Jens where he describes his
experience with CLOCK-Pro (http://lkml.org/lkml/2006/3/23/39):

"
> To prefetch applications from swap to physical memory when there is
> little activity seems so obvious that I can't believe it hasn't been
> implemented before.

It's a heuristic, and sometimes that will work well and sometimes it
will not. What if during this period of inactivity, you start bringing
everything in from swap again, only to page it right out because the
next memory hog starts running? From a logical standpoint, swap prefetch
and the vm must work closely together to avoid paging in things which
really aren't needed.

I've been running with the clockpro for the past week, which seems to
handle this sort of thing extremely well. On a 1GB machine, running the
vanilla kernels usually didn't see my use any swap. With the workload I
use, I typically had about ~100MiB of page cache and the rest of memory
full. Running clockpro, it's stabilized at ~288MiB of swap leaving me
more room for cache - with very rare paging activity going on. Hardly a
scientific test, but the feel is good."

> We had some ugly problems in the early 2.4.x timeframe, and I'll
> claim that most (but not all) of those were related to highmem/zoning
> issues which we largely solved. Which was about page replacement, but
> really a very specific issue within that area.
>
> So seriously, I suspect Andrew's "Holy cow" comes from the fact that
> he is more worried about VM maintainability and stability than page
> replacement. I certainly am.

It would be great if you/Andrew could go over the patchset and give your
comments on to specific points, telling where you think the API is dumb
and could be improved, etc.

It does not seem to introduce a hell lot of maintainability burden to me,
its simple in general (can certainly get better).

Page replacement is, without any doubt, an important issue. It does
matter _a lot_.


2006-03-23 20:51:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework



On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
>
> Nope, LRU only beat CLOCK-Pro/CART on the "UMass trace" (which is trace
> replay, which can be very sensitive and not necessarily meaningful).
> Needs more study though (talk is cheap).

Umm.. That _trace_ was the only thing that seemed to have any real-life
dataset, afaik. The others were totally synthetic.

> Anyway, smarter algorithms such as this two have been proven to be more
> efficient than LRU under a large range of real life loads. LRU's lack of
> frequency information is really terrible.
>
> LRU's worst case scenarios were well known before I was born.

The kernel doesn't actually use LRU, so the fact that LRU isn't good seems
a non-argument.

> - "Every time I wake up in the morning updatedb has thrown my applications
> out of memory".
>
> - "Linux is awful every time I untar something larger than memory to disk".

People seem to think that the fact that there are bad behaviours means
that there are somehow "magic" algorithms that don't have bad behaviours.

I'd really suggest somebody show better real-life numbers with a new
algorithm _before_ we do anything like this.

Linus

2006-03-23 21:00:20

by Rik van Riel

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

On Thu, 23 Mar 2006, Linus Torvalds wrote:

> > LRU's worst case scenarios were well known before I was born.
>
> The kernel doesn't actually use LRU, so the fact that LRU isn't good seems
> a non-argument.

Agreed. The current algorithm in the kernel is close to 2Q,
just without the corrections that 2Q gets from non-resident
history and the further tuning that is done by clock-pro.

> > - "Every time I wake up in the morning updatedb has thrown my applications
> > out of memory".
> >
> > - "Linux is awful every time I untar something larger than memory to disk".
>
> People seem to think that the fact that there are bad behaviours means
> that there are somehow "magic" algorithms that don't have bad behaviours.
>
> I'd really suggest somebody show better real-life numbers with a new
> algorithm _before_ we do anything like this.

Remember that it's not necessarily about "making a VM that
handles the common case better", but rather about "making
the VM behave well more of the time".

Furthermore, all VM benchmarks are corner cases. After all,
most systems have enough memory most of the time, and will
not be evicting much at all. This makes interpreting VM
benchmark results harder than the interpretation of many
other benchmarks...

--
All Rights Reversed

2006-03-24 15:07:00

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

Linus Torvalds wrote:

>On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
>
>
>>IMHO the page replacement framework intent is wider than fixing the
>>currently known performance problems.
>>
>>It allows easier implementation of new algorithms, which are being
>>invented/adapted over time as necessity appears.
>>
>>
>
>Yes and no.
>
>It smells wonderful for a pluggable page replacement standpoint, but
>here's a couple of observations/questions:
> a) the current one actually seems to have beaten the on-comers (except
> for loads that were actually made up to try to defeat LRU)
> b) is page replacement actually a huge issue?
>
>Now, the reason I ask about (b) is that these days, you buy a Mac Mini,
>and it comes with half a gig of RAM, and some apple users seem to worry
>about the fact that the UMA graphics removes 50MB or something of that is
>a problem.
>
>IOW, just under half a _gigabyte_ of RAM is apparently considered to be
>low end, and this is when talking about low-end (modern) hardware!
>
>And don't tell me that the high-end people care, because both databases
>(high end commercial) and video/graphics editing (high end desktop) very
>much do _not_ care, since they tend to try to do their own memory
>management anyway.
>
>
>
>>One example (which I mentioned several times) is power saving:
>>
>>PB-LRU: A Self-Tuning Power Aware Storage Cache Replacement Algorithm
>>for Conserving Disk Energy.
>>
>>
>
>Please name a load that really actually hits the page replacement today.
>
>
Any load where things goes into swap?
Sure - I have 512MB in my machines, I still tend to
get 20-50 MB in swap after a while. And now and then
I wait for stuff to swap in again.

I don't claim the current system is bad, and a more gradual
appraoach may very well be the better way. But if someone can
demonstrate an improvement, then I'm for it. Getting an
improved selection for the 50M in swap will be noticeable at times.

Remember, people compensate the bigger memory with bigger apps.
Linux should run those apps well. ;-)

>It smells like university research to me.
>
>And don't flame me: I'm perfectly happy to be shown to be wrong. I just
>get a very strong feeling that the people who care about tight memory
>conditions and perhaps about page replacement are the same people who
>think that our kernel is too big - the embedded people.
>
>
Well, how about desktop users? Snappyness is a nice thing.
The "enough memory" argument can be turned around too.
If the power users don't care because they have the memory - then
they shouldn't worry about someone changing the replacement
algorithms. :-)

>What I'm trying to say is that page replacement hasn't been what seems to
>have worried people over the last year or two. We had some ugly problems
>in the early 2.4.x timeframe, and I'll claim that most (but not all) of
>those were related to highmem/zoning issues which we largely solved. Which
>was about page replacement, but really a very specific issue within that
>area.
>
>So seriously, I suspect Andrew's "Holy cow" comes from the fact that he is
>more worried about VM maintainability and stability than page replacement.
>I certainly am.
>
>
Sure, the incremental approach is good, and this replaceable
system may be a thing for interested VM developers.
But there is definitely interest if they can show improvement.

Helge Hafting

2006-03-28 23:07:01

by Elladan

[permalink] [raw]
Subject: Re: [PATCH 00/34] mm: Page Replacement Policy Framework

On Thu, Mar 23, 2006 at 10:15:47AM -0800, Linus Torvalds wrote:
>
>
> On Thu, 23 Mar 2006, Marcelo Tosatti wrote:
> >
> > IMHO the page replacement framework intent is wider than fixing the
> > currently known performance problems.
> >
> > It allows easier implementation of new algorithms, which are being
> > invented/adapted over time as necessity appears.
>
> Yes and no.
>
> It smells wonderful for a pluggable page replacement standpoint, but
> here's a couple of observations/questions:
> a) the current one actually seems to have beaten the on-comers (except
> for loads that were actually made up to try to defeat LRU)
> b) is page replacement actually a huge issue?
>
> Now, the reason I ask about (b) is that these days, you buy a Mac Mini,
> and it comes with half a gig of RAM, and some apple users seem to worry
> about the fact that the UMA graphics removes 50MB or something of that is
> a problem.

Data point:

I run into swap all the time on my 1gig machine. There are a few reasons for
this.

* Applications are incredibly bloated. Just running a bunch of gnome apps
sucks down 1000 megs almost instantly. However, these apps don't seem to use
most of the space they bloat into, so after a bit of fighting for VM the
chaff gets forced out and they run fine.

* Apps are also incredibly buggy. Eg. Firefox seems to leak up to 50 megs per
second in some workloads, so I run it for a day or two and my machine tends
to go heavily into swap.

* VM system prefers disk cache over applications. Eg. updated runs at
3am and indexes all my files. Since the applications were idle, the
VM decides to page out all my executables and fill my ram with page
cache which is only used once. In the morning, my machine spends a few
minutes paging everything back in.

* Similarly, I have a 2gig machine available, and it's also showing about 512MB
swapped out and also 500MB free.

-J