Unmapped patches - Use two LRU:s per zone.
These patches break out the per-zone LRU into two separate LRU:s - one for
mapped pages and one for unmapped pages. The patches also introduce guarantee
support, which allows the user to set how many percent of all pages per node
that should be kept in memory for mapped or unmapped pages. This guarantee
makes it possible to adjust the VM behaviour depending on the workload.
Reasons behind the LRU separation:
- Avoid unnecessary page scanning.
The current VM implementation rotates mapped pages on the active list
until the number of mapped pages are high enough to start unmap and page out.
By using two LRU:s we can avoid this scanning and shrink/rotate unmapped
pages only, not touching mapped pages until the threshold is reached.
- Make it possible to adjust the VM behaviour.
In some cases the user might want to guarantee that a certain amount of
pages should be kept in memory, overriding the standard behaviour. Separating
pages into mapped and unmapped LRU:s allows guarantee with low overhead.
I've performed many tests on a Dual PIII machine while varying the amount of
RAM available. Kernel compiles on a 64MB configuration gets a small speedup,
but the impact on other configurations and workloads seems to be unaffected.
Apply on top of 2.6.16-rc5.
Comments?
/ magnus
Implement per-LRU guarantee through sysctl.
This patch introduces the two new sysctl files "node_mapped_guar" and
"node_unmapped_guar". Each file contains one percentage per node and tells
the system how many percentage of all pages that should be kept in RAM as
unmapped or mapped pages.
Signed-off-by: Magnus Damm <[email protected]>
---
include/linux/mmzone.h | 7 ++++
include/linux/sysctl.h | 2 +
kernel/sysctl.c | 18 ++++++++++++
mm/vmscan.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 95 insertions(+)
--- from-0003/include/linux/mmzone.h
+++ to-work/include/linux/mmzone.h 2006-03-06 18:07:22.000000000 +0900
@@ -124,6 +124,7 @@ struct lru {
unsigned long nr_inactive;
unsigned long nr_scan_active;
unsigned long nr_scan_inactive;
+ unsigned long nr_guaranteed;
};
#define LRU_MAPPED 0
@@ -459,6 +460,12 @@ int lowmem_reserve_ratio_sysctl_handler(
void __user *, size_t *, loff_t *);
int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int, struct file *,
void __user *, size_t *, loff_t *);
+extern int sysctl_node_mapped_guar[MAX_NUMNODES];
+int node_mapped_guar_sysctl_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
+extern int sysctl_node_unmapped_guar[MAX_NUMNODES];
+int node_unmapped_guar_sysctl_handler(struct ctl_table *, int, struct file *,
+ void __user *, size_t *, loff_t *);
#include <linux/topology.h>
/* Returns the number of the current Node. */
--- from-0002/include/linux/sysctl.h
+++ to-work/include/linux/sysctl.h 2006-03-06 18:07:22.000000000 +0900
@@ -185,6 +185,8 @@ enum
VM_PERCPU_PAGELIST_FRACTION=30,/* int: fraction of pages in each percpu_pagelist */
VM_ZONE_RECLAIM_MODE=31, /* reclaim local zone memory before going off node */
VM_ZONE_RECLAIM_INTERVAL=32, /* time period to wait after reclaim failure */
+ VM_NODE_MAPPED_GUAR=33, /* percent of node memory guaranteed mapped */
+ VM_NODE_UNMAPPED_GUAR=34, /* percent of node memory guaranteed unmapped */
};
--- from-0002/kernel/sysctl.c
+++ to-work/kernel/sysctl.c 2006-03-06 18:07:22.000000000 +0900
@@ -899,6 +899,24 @@ static ctl_table vm_table[] = {
.strategy = &sysctl_jiffies,
},
#endif
+ {
+ .ctl_name = VM_NODE_MAPPED_GUAR,
+ .procname = "node_mapped_guar",
+ .data = &sysctl_node_mapped_guar,
+ .maxlen = sizeof(sysctl_node_mapped_guar),
+ .mode = 0644,
+ .proc_handler = &node_mapped_guar_sysctl_handler,
+ .strategy = &sysctl_intvec,
+ },
+ {
+ .ctl_name = VM_NODE_UNMAPPED_GUAR,
+ .procname = "node_unmapped_guar",
+ .data = &sysctl_node_unmapped_guar,
+ .maxlen = sizeof(sysctl_node_unmapped_guar),
+ .mode = 0644,
+ .proc_handler = &node_unmapped_guar_sysctl_handler,
+ .strategy = &sysctl_intvec,
+ },
{ .ctl_name = 0 }
};
--- from-0004/mm/vmscan.c
+++ to-work/mm/vmscan.c 2006-03-06 18:07:22.000000000 +0900
@@ -29,6 +29,7 @@
#include <linux/backing-dev.h>
#include <linux/rmap.h>
#include <linux/topology.h>
+#include <linux/sysctl.h>
#include <linux/cpu.h>
#include <linux/cpuset.h>
#include <linux/notifier.h>
@@ -1284,6 +1285,11 @@ shrink_lru(struct zone *zone, int lru_nr
unsigned long nr_active;
unsigned long nr_inactive;
+ /* No need to scan if guarantee is not fulfilled */
+
+ if ((lru->nr_active + lru->nr_inactive) <= lru->nr_guaranteed)
+ return;
+
atomic_inc(&zone->reclaim_in_progress);
/*
@@ -1325,6 +1331,68 @@ shrink_lru(struct zone *zone, int lru_nr
atomic_dec(&zone->reclaim_in_progress);
}
+spinlock_t sysctl_node_guar_lock = SPIN_LOCK_UNLOCKED;
+int sysctl_node_mapped_guar[MAX_NUMNODES];
+int sysctl_node_mapped_guar_ok[MAX_NUMNODES];
+int sysctl_node_unmapped_guar[MAX_NUMNODES];
+int sysctl_node_unmapped_guar_ok[MAX_NUMNODES];
+
+static void setup_per_lru_guarantee(int lru_nr)
+{
+ struct zone *zone;
+ unsigned long nr;
+ int nid, i;
+
+ for_each_zone(zone) {
+ nid = zone->zone_pgdat->node_id;
+
+ if ((sysctl_node_mapped_guar[nid] +
+ sysctl_node_unmapped_guar[nid]) > 100) {
+ if (lru_nr == LRU_MAPPED) {
+ i = sysctl_node_mapped_guar_ok[nid];
+ sysctl_node_mapped_guar[nid] = i;
+ }
+ else {
+ i = sysctl_node_unmapped_guar_ok[nid];
+ sysctl_node_unmapped_guar[nid] = i;
+ }
+ }
+
+ if (lru_nr == LRU_MAPPED) {
+ i = sysctl_node_mapped_guar[nid];
+ sysctl_node_mapped_guar_ok[nid] = i;
+ }
+ else {
+ i = sysctl_node_unmapped_guar[nid];
+ sysctl_node_unmapped_guar_ok[nid] = i;
+ }
+
+ nr = zone->present_pages - zone->pages_high;
+
+ zone->lru[lru_nr].nr_guaranteed = (nr * i) / 100;
+ }
+}
+
+int node_mapped_guar_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ spin_lock(&sysctl_node_guar_lock);
+ proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ setup_per_lru_guarantee(LRU_MAPPED);
+ spin_unlock(&sysctl_node_guar_lock);
+ return 0;
+}
+
+int node_unmapped_guar_sysctl_handler(ctl_table *table, int write,
+ struct file *file, void __user *buffer, size_t *length, loff_t *ppos)
+{
+ spin_lock(&sysctl_node_guar_lock);
+ proc_dointvec_minmax(table, write, file, buffer, length, ppos);
+ setup_per_lru_guarantee(LRU_UNMAPPED);
+ spin_unlock(&sysctl_node_guar_lock);
+ return 0;
+}
+
/*
* This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
*/
Move reclaim_mapped logic, sc->nr_mapped, keep active unmapped pages.
This patch moves the reclaim_mapped logic from refill_inactive_zone() to
shrink_zone(), where it is used to determine if the mapped LRU should be
scanned or not. The sc->nr_mapped member is removed and replaced with code
that checks the number of pages placed on the per-zone mapped LRU.
refill_inactive_zone is changed to allow rotate of active unmapped pages.
Signed-off-by: Magnus Damm <[email protected]>
---
vmscan.c | 106 +++++++++++++++++++++++++++++---------------------------------
1 files changed, 51 insertions(+), 55 deletions(-)
--- from-0003/mm/vmscan.c
+++ to-work/mm/vmscan.c 2006-03-06 16:58:27.000000000 +0900
@@ -61,8 +61,6 @@ struct scan_control {
/* Incremented by the number of pages reclaimed */
unsigned long nr_reclaimed;
- unsigned long nr_mapped; /* From page_state */
-
/* Ask shrink_caches, or shrink_zone to scan at this priority */
unsigned int priority;
@@ -1202,48 +1200,6 @@ refill_inactive_zone(struct zone *zone,
LIST_HEAD(l_active); /* Pages to go onto the active_list */
struct page *page;
struct pagevec pvec;
- int reclaim_mapped = 0;
-
- if (unlikely(sc->may_swap)) {
- long mapped_ratio;
- long distress;
- long swap_tendency;
-
- /*
- * `distress' is a measure of how much trouble we're having
- * reclaiming pages. 0 -> no problems. 100 -> great trouble.
- */
- distress = 100 >> zone->prev_priority;
-
- /*
- * The point of this algorithm is to decide when to start
- * reclaiming mapped memory instead of just pagecache. Work out
- * how much memory
- * is mapped.
- */
- mapped_ratio = (sc->nr_mapped * 100) / total_memory;
-
- /*
- * Now decide how much we really want to unmap some pages. The
- * mapped ratio is downgraded - just because there's a lot of
- * mapped memory doesn't necessarily mean that page reclaim
- * isn't succeeding.
- *
- * The distress ratio is important - we don't want to start
- * going oom.
- *
- * A 100% value of vm_swappiness overrides this algorithm
- * altogether.
- */
- swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
-
- /*
- * Now use this metric to decide whether to start moving mapped
- * memory onto the inactive list.
- */
- if (swap_tendency >= 100)
- reclaim_mapped = 1;
- }
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
@@ -1257,13 +1213,10 @@ refill_inactive_zone(struct zone *zone,
cond_resched();
page = lru_to_page(&l_hold);
list_del(&page->lru);
- if (page_mapped(page)) {
- if (!reclaim_mapped ||
- (total_swap_pages == 0 && PageAnon(page)) ||
- page_referenced(page, 0)) {
- list_add(&page->lru, &l_active);
- continue;
- }
+ if ((total_swap_pages == 0 && PageAnon(page)) ||
+ page_referenced(page, 0)) {
+ list_add(&page->lru, &l_active);
+ continue;
}
list_add(&page->lru, &l_inactive);
}
@@ -1378,8 +1331,54 @@ shrink_lru(struct zone *zone, int lru_nr
static void
shrink_zone(struct zone *zone, struct scan_control *sc)
{
+ int reclaim_mapped = 0;
+
+ if (unlikely(sc->may_swap)) {
+ struct lru *lru = &zone->lru[LRU_MAPPED];
+ long mapped_ratio;
+ long distress;
+ long swap_tendency;
+
+ /*
+ * `distress' is a measure of how much trouble we're having
+ * reclaiming pages. 0 -> no problems. 100 -> great trouble.
+ */
+ distress = 100 >> zone->prev_priority;
+
+ /*
+ * The point of this algorithm is to decide when to start
+ * reclaiming mapped memory instead of just pagecache.
+ * Work out how much memory is mapped.
+ */
+ mapped_ratio = (lru->nr_active + lru->nr_inactive) * 100;
+ mapped_ratio /= zone->present_pages;
+
+ /*
+ * Now decide how much we really want to unmap some pages. The
+ * mapped ratio is downgraded - just because there's a lot of
+ * mapped memory doesn't necessarily mean that page reclaim
+ * isn't succeeding.
+ *
+ * The distress ratio is important - we don't want to start
+ * going oom.
+ *
+ * A 100% value of vm_swappiness overrides this algorithm
+ * altogether.
+ */
+ swap_tendency = mapped_ratio / 2 + distress + vm_swappiness;
+
+ /*
+ * Now use this metric to decide whether to start moving mapped
+ * memory onto the inactive list.
+ */
+ if (swap_tendency >= 100)
+ reclaim_mapped = 1;
+ }
+
shrink_lru(zone, LRU_UNMAPPED, sc);
- shrink_lru(zone, LRU_MAPPED, sc);
+
+ if (reclaim_mapped)
+ shrink_lru(zone, LRU_MAPPED, sc);
}
/*
@@ -1463,7 +1462,6 @@ int try_to_free_pages(struct zone **zone
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
- sc.nr_mapped = read_page_state(nr_mapped);
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
sc.priority = priority;
@@ -1552,7 +1550,6 @@ loop_again:
sc.gfp_mask = GFP_KERNEL;
sc.may_writepage = !laptop_mode;
sc.may_swap = 1;
- sc.nr_mapped = read_page_state(nr_mapped);
inc_page_state(pageoutrun);
@@ -1910,7 +1907,6 @@ int zone_reclaim(struct zone *zone, gfp_
sc.nr_scanned = 0;
sc.nr_reclaimed = 0;
sc.priority = ZONE_RECLAIM_PRIORITY + 1;
- sc.nr_mapped = read_page_state(nr_mapped);
sc.gfp_mask = gfp_mask;
disable_swap_token();
Use separate LRU:s for mapped and unmapped pages.
This patch creates two instances of "struct lru" per zone, both protected by
zone->lru_lock. A new bit in page->flags named PG_mapped is used to determine
which LRU the page belongs to. The rmap code is changed to move pages to the
mapped LRU, while the vmscan code moves pages back to the unmapped LRU when
needed. Pages moved to the mapped LRU are added to the inactive list, while
pages moved back to the unmapped LRU are added to the active list.
Signed-off-by: Magnus Damm <[email protected]>
---
include/linux/mm_inline.h | 77 ++++++++++++++++++++++++++++++++------
include/linux/mmzone.h | 21 +++++++---
include/linux/page-flags.h | 5 ++
mm/page_alloc.c | 44 ++++++++++++++--------
mm/rmap.c | 32 ++++++++++++++++
mm/swap.c | 17 ++++++--
mm/vmscan.c | 88 +++++++++++++++++++++++++-------------------
7 files changed, 208 insertions(+), 76 deletions(-)
--- from-0002/include/linux/mm_inline.h
+++ to-work/include/linux/mm_inline.h 2006-03-06 16:21:30.000000000 +0900
@@ -1,41 +1,94 @@
static inline void
-add_page_to_active_list(struct zone *zone, struct page *page)
+add_page_to_active_list(struct lru *lru, struct page *page)
{
- list_add(&page->lru, &zone->active_list);
- zone->nr_active++;
+ list_add(&page->lru, &lru->active_list);
+ lru->nr_active++;
}
static inline void
-add_page_to_inactive_list(struct zone *zone, struct page *page)
+add_page_to_inactive_list(struct lru *lru, struct page *page)
{
- list_add(&page->lru, &zone->inactive_list);
- zone->nr_inactive++;
+ list_add(&page->lru, &lru->inactive_list);
+ lru->nr_inactive++;
}
static inline void
-del_page_from_active_list(struct zone *zone, struct page *page)
+del_page_from_active_list(struct lru *lru, struct page *page)
{
list_del(&page->lru);
- zone->nr_active--;
+ lru->nr_active--;
}
static inline void
-del_page_from_inactive_list(struct zone *zone, struct page *page)
+del_page_from_inactive_list(struct lru *lru, struct page *page)
{
list_del(&page->lru);
- zone->nr_inactive--;
+ lru->nr_inactive--;
+}
+
+static inline struct lru *page_lru(struct zone *zone, struct page *page)
+{
+ return &zone->lru[PageMapped(page) ? LRU_MAPPED : LRU_UNMAPPED];
+}
+
+static inline int may_become_unmapped(struct page *page)
+{
+ if (!page_mapped(page) && PageMapped(page)) {
+ ClearPageMapped(page);
+ return 1;
+ }
+
+ return 0;
}
static inline void
del_page_from_lru(struct zone *zone, struct page *page)
{
+ struct lru *lru = page_lru(zone, page);
+
list_del(&page->lru);
if (PageActive(page)) {
ClearPageActive(page);
- zone->nr_active--;
+ lru->nr_active--;
} else {
- zone->nr_inactive--;
+ lru->nr_inactive--;
}
}
+static inline unsigned long active_pages(struct zone *zone)
+{
+ unsigned long sum;
+
+ sum = zone->lru[LRU_MAPPED].nr_active;
+ sum += zone->lru[LRU_UNMAPPED].nr_active;
+ return sum;
+}
+
+static inline unsigned long inactive_pages(struct zone *zone)
+{
+ unsigned long sum;
+
+ sum = zone->lru[LRU_MAPPED].nr_inactive;
+ sum += zone->lru[LRU_UNMAPPED].nr_inactive;
+ return sum;
+}
+
+static inline unsigned long active_pages_scanned(struct zone *zone)
+{
+ unsigned long sum;
+
+ sum = zone->lru[LRU_MAPPED].nr_scan_active;
+ sum += zone->lru[LRU_UNMAPPED].nr_scan_active;
+ return sum;
+}
+
+static inline unsigned long inactive_pages_scanned(struct zone *zone)
+{
+ unsigned long sum;
+
+ sum = zone->lru[LRU_MAPPED].nr_scan_inactive;
+ sum += zone->lru[LRU_UNMAPPED].nr_scan_inactive;
+ return sum;
+}
+
--- from-0002/include/linux/mmzone.h
+++ to-work/include/linux/mmzone.h 2006-03-06 16:21:30.000000000 +0900
@@ -117,6 +117,18 @@ struct per_cpu_pageset {
* ZONE_HIGHMEM > 896 MB only page cache and user processes
*/
+struct lru {
+ struct list_head active_list;
+ struct list_head inactive_list;
+ unsigned long nr_active;
+ unsigned long nr_inactive;
+ unsigned long nr_scan_active;
+ unsigned long nr_scan_inactive;
+};
+
+#define LRU_MAPPED 0
+#define LRU_UNMAPPED 1
+
struct zone {
/* Fields commonly accessed by the page allocator */
unsigned long free_pages;
@@ -150,13 +162,8 @@ struct zone {
ZONE_PADDING(_pad1_)
/* Fields commonly accessed by the page reclaim scanner */
- spinlock_t lru_lock;
- struct list_head active_list;
- struct list_head inactive_list;
- unsigned long nr_scan_active;
- unsigned long nr_scan_inactive;
- unsigned long nr_active;
- unsigned long nr_inactive;
+ spinlock_t lru_lock;
+ struct lru lru[2]; /* LRU_MAPPED, LRU_UNMAPPED */
unsigned long pages_scanned; /* since last reclaim */
int all_unreclaimable; /* All pages pinned */
--- from-0002/include/linux/page-flags.h
+++ to-work/include/linux/page-flags.h 2006-03-06 16:21:30.000000000 +0900
@@ -75,6 +75,7 @@
#define PG_reclaim 17 /* To be reclaimed asap */
#define PG_nosave_free 18 /* Free, should not be written */
#define PG_uncached 19 /* Page has been mapped as uncached */
+#define PG_mapped 20 /* Page might be mapped in a vma */
/*
* Global page accounting. One instance per CPU. Only unsigned longs are
@@ -344,6 +345,10 @@ extern void __mod_page_state_offset(unsi
#define SetPageUncached(page) set_bit(PG_uncached, &(page)->flags)
#define ClearPageUncached(page) clear_bit(PG_uncached, &(page)->flags)
+#define PageMapped(page) test_bit(PG_mapped, &(page)->flags)
+#define SetPageMapped(page) set_bit(PG_mapped, &(page)->flags)
+#define ClearPageMapped(page) clear_bit(PG_mapped, &(page)->flags)
+
struct page; /* forward declaration */
int test_clear_page_dirty(struct page *page);
--- from-0002/mm/page_alloc.c
+++ to-work/mm/page_alloc.c 2006-03-06 16:22:48.000000000 +0900
@@ -37,6 +37,7 @@
#include <linux/nodemask.h>
#include <linux/vmalloc.h>
#include <linux/mempolicy.h>
+#include <linux/mm_inline.h>
#include <asm/tlbflush.h>
#include "internal.h"
@@ -1314,8 +1315,8 @@ void __get_zone_counts(unsigned long *ac
*inactive = 0;
*free = 0;
for (i = 0; i < MAX_NR_ZONES; i++) {
- *active += zones[i].nr_active;
- *inactive += zones[i].nr_inactive;
+ *active += active_pages(&zones[i]);
+ *inactive += inactive_pages(&zones[i]);
*free += zones[i].free_pages;
}
}
@@ -1448,8 +1449,8 @@ void show_free_areas(void)
K(zone->pages_min),
K(zone->pages_low),
K(zone->pages_high),
- K(zone->nr_active),
- K(zone->nr_inactive),
+ K(active_pages(zone)),
+ K(inactive_pages(zone)),
K(zone->present_pages),
zone->pages_scanned,
(zone->all_unreclaimable ? "yes" : "no")
@@ -2034,6 +2035,16 @@ static __meminit void init_currently_emp
zone_init_free_lists(pgdat, zone, zone->spanned_pages);
}
+static void __init init_lru(struct lru *lru)
+{
+ INIT_LIST_HEAD(&lru->active_list);
+ INIT_LIST_HEAD(&lru->inactive_list);
+ lru->nr_active = 0;
+ lru->nr_inactive = 0;
+ lru->nr_scan_active = 0;
+ lru->nr_scan_inactive = 0;
+}
+
/*
* Set up the zone data structures:
* - mark all pages reserved
@@ -2076,12 +2087,10 @@ static void __init free_area_init_core(s
zone->temp_priority = zone->prev_priority = DEF_PRIORITY;
zone_pcp_init(zone);
- INIT_LIST_HEAD(&zone->active_list);
- INIT_LIST_HEAD(&zone->inactive_list);
- zone->nr_scan_active = 0;
- zone->nr_scan_inactive = 0;
- zone->nr_active = 0;
- zone->nr_inactive = 0;
+
+ init_lru(&zone->lru[LRU_MAPPED]);
+ init_lru(&zone->lru[LRU_UNMAPPED]);
+
atomic_set(&zone->reclaim_in_progress, 0);
if (!size)
continue;
@@ -2228,8 +2237,8 @@ static int zoneinfo_show(struct seq_file
"\n min %lu"
"\n low %lu"
"\n high %lu"
- "\n active %lu"
- "\n inactive %lu"
+ "\n active %lu (u: %lu m: %lu)"
+ "\n inactive %lu (u: %lu m: %lu)"
"\n scanned %lu (a: %lu i: %lu)"
"\n spanned %lu"
"\n present %lu",
@@ -2237,10 +2246,15 @@ static int zoneinfo_show(struct seq_file
zone->pages_min,
zone->pages_low,
zone->pages_high,
- zone->nr_active,
- zone->nr_inactive,
+ active_pages(zone),
+ zone->lru[LRU_UNMAPPED].nr_active,
+ zone->lru[LRU_MAPPED].nr_active,
+ inactive_pages(zone),
+ zone->lru[LRU_UNMAPPED].nr_inactive,
+ zone->lru[LRU_MAPPED].nr_inactive,
zone->pages_scanned,
- zone->nr_scan_active, zone->nr_scan_inactive,
+ active_pages_scanned(zone),
+ inactive_pages_scanned(zone),
zone->spanned_pages,
zone->present_pages);
seq_printf(m,
--- from-0002/mm/rmap.c
+++ to-work/mm/rmap.c 2006-03-06 16:21:30.000000000 +0900
@@ -45,6 +45,7 @@
*/
#include <linux/mm.h>
+#include <linux/mm_inline.h>
#include <linux/pagemap.h>
#include <linux/swap.h>
#include <linux/swapops.h>
@@ -466,6 +467,28 @@ int page_referenced(struct page *page, i
}
/**
+ * page_move_to_mapped_lru - move page to mapped lru
+ * @page: the page to add the mapping to
+ */
+static void page_move_to_mapped_lru(struct page *page)
+{
+ struct zone *zone = page_zone(page);
+ struct lru *lru = &zone->lru[LRU_MAPPED];
+ unsigned long flags;
+
+ spin_lock_irqsave(&zone->lru_lock, flags);
+
+ if (PageLRU(page)) {
+ del_page_from_lru(zone, page);
+ add_page_to_inactive_list(lru, page);
+ }
+
+ SetPageMapped(page);
+
+ spin_unlock_irqrestore(&zone->lru_lock, flags);
+}
+
+/**
* page_set_anon_rmap - setup new anonymous rmap
* @page: the page to add the mapping to
* @vma: the vm area in which the mapping is added
@@ -503,6 +526,9 @@ void page_add_anon_rmap(struct page *pag
if (atomic_inc_and_test(&page->_mapcount))
__page_set_anon_rmap(page, vma, address);
/* else checking page index and mapping is racy */
+
+ if (!PageMapped(page))
+ page_move_to_mapped_lru(page);
}
/*
@@ -519,6 +545,9 @@ void page_add_new_anon_rmap(struct page
{
atomic_set(&page->_mapcount, 0); /* elevate count by 1 (starts at -1) */
__page_set_anon_rmap(page, vma, address);
+
+ if (!PageMapped(page))
+ page_move_to_mapped_lru(page);
}
/**
@@ -534,6 +563,9 @@ void page_add_file_rmap(struct page *pag
if (atomic_inc_and_test(&page->_mapcount))
__inc_page_state(nr_mapped);
+
+ if (!PageMapped(page))
+ page_move_to_mapped_lru(page);
}
/**
--- from-0002/mm/swap.c
+++ to-work/mm/swap.c 2006-03-06 16:24:17.000000000 +0900
@@ -87,7 +87,8 @@ int rotate_reclaimable_page(struct page
spin_lock_irqsave(&zone->lru_lock, flags);
if (PageLRU(page) && !PageActive(page)) {
list_del(&page->lru);
- list_add_tail(&page->lru, &zone->inactive_list);
+ list_add_tail(&page->lru,
+ &(page_lru(zone, page)->inactive_list));
inc_page_state(pgrotated);
}
if (!test_clear_page_writeback(page))
@@ -105,9 +106,9 @@ void fastcall activate_page(struct page
spin_lock_irq(&zone->lru_lock);
if (PageLRU(page) && !PageActive(page)) {
- del_page_from_inactive_list(zone, page);
+ del_page_from_inactive_list(page_lru(zone, page), page);
SetPageActive(page);
- add_page_to_active_list(zone, page);
+ add_page_to_active_list(page_lru(zone, page), page);
inc_page_state(pgactivate);
}
spin_unlock_irq(&zone->lru_lock);
@@ -215,6 +216,9 @@ void fastcall __page_cache_release(struc
spin_lock_irqsave(&zone->lru_lock, flags);
if (TestClearPageLRU(page))
del_page_from_lru(zone, page);
+
+ ClearPageMapped(page);
+
if (page_count(page) != 0)
page = NULL;
spin_unlock_irqrestore(&zone->lru_lock, flags);
@@ -268,6 +272,9 @@ void release_pages(struct page **pages,
}
if (TestClearPageLRU(page))
del_page_from_lru(zone, page);
+
+ ClearPageMapped(page);
+
if (page_count(page) == 0) {
if (!pagevec_add(&pages_to_free, page)) {
spin_unlock_irq(&zone->lru_lock);
@@ -345,7 +352,7 @@ void __pagevec_lru_add(struct pagevec *p
}
if (TestSetPageLRU(page))
BUG();
- add_page_to_inactive_list(zone, page);
+ add_page_to_inactive_list(page_lru(zone, page), page);
}
if (zone)
spin_unlock_irq(&zone->lru_lock);
@@ -374,7 +381,7 @@ void __pagevec_lru_add_active(struct pag
BUG();
if (TestSetPageActive(page))
BUG();
- add_page_to_active_list(zone, page);
+ add_page_to_active_list(page_lru(zone, page), page);
}
if (zone)
spin_unlock_irq(&zone->lru_lock);
--- from-0002/mm/vmscan.c
+++ to-work/mm/vmscan.c 2006-03-06 16:45:16.000000000 +0900
@@ -1103,8 +1103,11 @@ static int isolate_lru_pages(int nr_to_s
/*
* shrink_cache() adds the number of pages reclaimed to sc->nr_reclaimed
*/
-static void shrink_cache(struct zone *zone, struct scan_control *sc)
+static void
+shrink_cache(struct zone *zone, int lru_nr, struct scan_control *sc)
{
+ struct lru *lru = &zone->lru[lru_nr];
+ struct lru *new_lru;
LIST_HEAD(page_list);
struct pagevec pvec;
int max_scan = sc->nr_to_scan;
@@ -1120,9 +1123,9 @@ static void shrink_cache(struct zone *zo
int nr_freed;
nr_taken = isolate_lru_pages(sc->swap_cluster_max,
- &zone->inactive_list,
+ &lru->inactive_list,
&page_list, &nr_scan);
- zone->nr_inactive -= nr_taken;
+ lru->nr_inactive -= nr_taken;
zone->pages_scanned += nr_scan;
spin_unlock_irq(&zone->lru_lock);
@@ -1148,11 +1151,14 @@ static void shrink_cache(struct zone *zo
page = lru_to_page(&page_list);
if (TestSetPageLRU(page))
BUG();
+ if (may_become_unmapped(page))
+ SetPageActive(page);
+ new_lru = page_lru(zone, page);
list_del(&page->lru);
if (PageActive(page))
- add_page_to_active_list(zone, page);
+ add_page_to_active_list(new_lru, page);
else
- add_page_to_inactive_list(zone, page);
+ add_page_to_inactive_list(new_lru, page);
if (!pagevec_add(&pvec, page)) {
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
@@ -1183,8 +1189,10 @@ done:
* But we had to alter page->flags anyway.
*/
static void
-refill_inactive_zone(struct zone *zone, struct scan_control *sc)
+refill_inactive_zone(struct zone *zone, int lru_nr, struct scan_control *sc)
{
+ struct lru *lru = &zone->lru[lru_nr];
+ struct lru *new_lru;
int pgmoved;
int pgdeactivate = 0;
int pgscanned;
@@ -1239,10 +1247,10 @@ refill_inactive_zone(struct zone *zone,
lru_add_drain();
spin_lock_irq(&zone->lru_lock);
- pgmoved = isolate_lru_pages(nr_pages, &zone->active_list,
+ pgmoved = isolate_lru_pages(nr_pages, &lru->active_list,
&l_hold, &pgscanned);
zone->pages_scanned += pgscanned;
- zone->nr_active -= pgmoved;
+ lru->nr_active -= pgmoved;
spin_unlock_irq(&zone->lru_lock);
while (!list_empty(&l_hold)) {
@@ -1261,54 +1269,52 @@ refill_inactive_zone(struct zone *zone,
}
pagevec_init(&pvec, 1);
- pgmoved = 0;
spin_lock_irq(&zone->lru_lock);
while (!list_empty(&l_inactive)) {
page = lru_to_page(&l_inactive);
prefetchw_prev_lru_page(page, &l_inactive, flags);
+ if (may_become_unmapped(page)) {
+ list_move(&page->lru, &l_active);
+ continue;
+ }
if (TestSetPageLRU(page))
BUG();
if (!TestClearPageActive(page))
BUG();
- list_move(&page->lru, &zone->inactive_list);
- pgmoved++;
+ new_lru = page_lru(zone, page);
+ list_move(&page->lru, &new_lru->inactive_list);
+ new_lru->nr_inactive++;
+ pgdeactivate++;
if (!pagevec_add(&pvec, page)) {
- zone->nr_inactive += pgmoved;
spin_unlock_irq(&zone->lru_lock);
- pgdeactivate += pgmoved;
- pgmoved = 0;
if (buffer_heads_over_limit)
pagevec_strip(&pvec);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
- zone->nr_inactive += pgmoved;
- pgdeactivate += pgmoved;
if (buffer_heads_over_limit) {
spin_unlock_irq(&zone->lru_lock);
pagevec_strip(&pvec);
spin_lock_irq(&zone->lru_lock);
}
- pgmoved = 0;
while (!list_empty(&l_active)) {
page = lru_to_page(&l_active);
prefetchw_prev_lru_page(page, &l_active, flags);
if (TestSetPageLRU(page))
BUG();
BUG_ON(!PageActive(page));
- list_move(&page->lru, &zone->active_list);
- pgmoved++;
+ may_become_unmapped(page);
+ new_lru = page_lru(zone, page);
+ list_move(&page->lru, &new_lru->active_list);
+ new_lru->nr_active++;
if (!pagevec_add(&pvec, page)) {
- zone->nr_active += pgmoved;
- pgmoved = 0;
spin_unlock_irq(&zone->lru_lock);
__pagevec_release(&pvec);
spin_lock_irq(&zone->lru_lock);
}
}
- zone->nr_active += pgmoved;
spin_unlock(&zone->lru_lock);
__mod_page_state_zone(zone, pgrefill, pgscanned);
@@ -1318,12 +1324,10 @@ refill_inactive_zone(struct zone *zone,
pagevec_release(&pvec);
}
-/*
- * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
- */
static void
-shrink_zone(struct zone *zone, struct scan_control *sc)
+shrink_lru(struct zone *zone, int lru_nr, struct scan_control *sc)
{
+ struct lru *lru = &zone->lru[lru_nr];
unsigned long nr_active;
unsigned long nr_inactive;
@@ -1333,17 +1337,17 @@ shrink_zone(struct zone *zone, struct sc
* Add one to `nr_to_scan' just to make sure that the kernel will
* slowly sift through the active list.
*/
- zone->nr_scan_active += (zone->nr_active >> sc->priority) + 1;
- nr_active = zone->nr_scan_active;
+ lru->nr_scan_active += (lru->nr_active >> sc->priority) + 1;
+ nr_active = lru->nr_scan_active;
if (nr_active >= sc->swap_cluster_max)
- zone->nr_scan_active = 0;
+ lru->nr_scan_active = 0;
else
nr_active = 0;
- zone->nr_scan_inactive += (zone->nr_inactive >> sc->priority) + 1;
- nr_inactive = zone->nr_scan_inactive;
+ lru->nr_scan_inactive += (lru->nr_inactive >> sc->priority) + 1;
+ nr_inactive = lru->nr_scan_inactive;
if (nr_inactive >= sc->swap_cluster_max)
- zone->nr_scan_inactive = 0;
+ lru->nr_scan_inactive = 0;
else
nr_inactive = 0;
@@ -1352,14 +1356,14 @@ shrink_zone(struct zone *zone, struct sc
sc->nr_to_scan = min(nr_active,
(unsigned long)sc->swap_cluster_max);
nr_active -= sc->nr_to_scan;
- refill_inactive_zone(zone, sc);
+ refill_inactive_zone(zone, lru_nr, sc);
}
if (nr_inactive) {
sc->nr_to_scan = min(nr_inactive,
(unsigned long)sc->swap_cluster_max);
nr_inactive -= sc->nr_to_scan;
- shrink_cache(zone, sc);
+ shrink_cache(zone, lru_nr, sc);
}
}
@@ -1369,6 +1373,16 @@ shrink_zone(struct zone *zone, struct sc
}
/*
+ * This is a basic per-zone page freer. Used by both kswapd and direct reclaim.
+ */
+static void
+shrink_zone(struct zone *zone, struct scan_control *sc)
+{
+ shrink_lru(zone, LRU_UNMAPPED, sc);
+ shrink_lru(zone, LRU_MAPPED, sc);
+}
+
+/*
* This is the direct reclaim path, for page-allocating processes. We only
* try to reclaim pages from zones which will satisfy the caller's allocation
* request.
@@ -1445,7 +1459,7 @@ int try_to_free_pages(struct zone **zone
continue;
zone->temp_priority = DEF_PRIORITY;
- lru_pages += zone->nr_active + zone->nr_inactive;
+ lru_pages += active_pages(zone) + inactive_pages(zone);
}
for (priority = DEF_PRIORITY; priority >= 0; priority--) {
@@ -1587,7 +1601,7 @@ scan:
for (i = 0; i <= end_zone; i++) {
struct zone *zone = pgdat->node_zones + i;
- lru_pages += zone->nr_active + zone->nr_inactive;
+ lru_pages += active_pages(zone) + inactive_pages(zone);
}
/*
@@ -1631,7 +1645,7 @@ scan:
if (zone->all_unreclaimable)
continue;
if (nr_slab == 0 && zone->pages_scanned >=
- (zone->nr_active + zone->nr_inactive) * 4)
+ (active_pages(zone) + inactive_pages(zone)) * 4)
zone->all_unreclaimable = 1;
/*
* If we've done a decent amount of scanning and
Magnus Damm wrote:
> Unmapped patches - Use two LRU:s per zone.
>
> These patches break out the per-zone LRU into two separate LRU:s - one for
> mapped pages and one for unmapped pages. The patches also introduce guarantee
> support, which allows the user to set how many percent of all pages per node
> that should be kept in memory for mapped or unmapped pages. This guarantee
> makes it possible to adjust the VM behaviour depending on the workload.
>
> Reasons behind the LRU separation:
>
> - Avoid unnecessary page scanning.
> The current VM implementation rotates mapped pages on the active list
> until the number of mapped pages are high enough to start unmap and page out.
> By using two LRU:s we can avoid this scanning and shrink/rotate unmapped
> pages only, not touching mapped pages until the threshold is reached.
>
> - Make it possible to adjust the VM behaviour.
> In some cases the user might want to guarantee that a certain amount of
> pages should be kept in memory, overriding the standard behaviour. Separating
> pages into mapped and unmapped LRU:s allows guarantee with low overhead.
>
> I've performed many tests on a Dual PIII machine while varying the amount of
> RAM available. Kernel compiles on a 64MB configuration gets a small speedup,
> but the impact on other configurations and workloads seems to be unaffected.
>
> Apply on top of 2.6.16-rc5.
>
> Comments?
>
I did something similar a while back which I called split active lists.
I think it is a good idea in general and I did see fairly large speedups
with heavy swapping kbuilds, but nobody else seemed to want it :P
So you split the inactive list as well - that's going to be a bit of
change in behaviour and I'm not sure whether you gain anything.
I don't think PageMapped is a very good name for the flag.
I test mapped lazily. Much better way to go IMO.
I had further patches that got rid of reclaim_mapped completely while
I was there. It is based on crazy metrics that basically completely
change meaning if there are changes in the memory configuration of
the system, or small changes in reclaim algorithms.
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
Magnus Damm wrote:
> Implement per-LRU guarantee through sysctl.
>
> This patch introduces the two new sysctl files "node_mapped_guar" and
> "node_unmapped_guar". Each file contains one percentage per node and tells
> the system how many percentage of all pages that should be kept in RAM as
> unmapped or mapped pages.
>
The whole Linux VM philosophy until now has been to get away from stuff
like this.
If your app is really that specialised then maybe it can use mlock. If
not, maybe the VM is currently broken.
You do have a real-world workload that is significantly improved by this,
right?
--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com
On 3/10/06, Nick Piggin <[email protected]> wrote:
> Magnus Damm wrote:
> > Unmapped patches - Use two LRU:s per zone.
> >
> > These patches break out the per-zone LRU into two separate LRU:s - one for
> > mapped pages and one for unmapped pages. The patches also introduce guarantee
> > support, which allows the user to set how many percent of all pages per node
> > that should be kept in memory for mapped or unmapped pages. This guarantee
> > makes it possible to adjust the VM behaviour depending on the workload.
> >
> > Reasons behind the LRU separation:
> >
> > - Avoid unnecessary page scanning.
> > The current VM implementation rotates mapped pages on the active list
> > until the number of mapped pages are high enough to start unmap and page out.
> > By using two LRU:s we can avoid this scanning and shrink/rotate unmapped
> > pages only, not touching mapped pages until the threshold is reached.
> >
> > - Make it possible to adjust the VM behaviour.
> > In some cases the user might want to guarantee that a certain amount of
> > pages should be kept in memory, overriding the standard behaviour. Separating
> > pages into mapped and unmapped LRU:s allows guarantee with low overhead.
> >
> > I've performed many tests on a Dual PIII machine while varying the amount of
> > RAM available. Kernel compiles on a 64MB configuration gets a small speedup,
> > but the impact on other configurations and workloads seems to be unaffected.
> >
> > Apply on top of 2.6.16-rc5.
> >
> > Comments?
> >
>
> I did something similar a while back which I called split active lists.
> I think it is a good idea in general and I did see fairly large speedups
> with heavy swapping kbuilds, but nobody else seemed to want it :P
I want it if it helps you! =)
I don't see why both mapped and unmapped pages should be kept on the
same list at all actually, especially with the reclaim_mapped
threshold used today. The current solution is to scan through lots of
mapped pages on the active list if the threshold is not reached. I
think avoiding this scanning can improve performance.
The single LRU solution today keeps mapped pages on the active list,
but always moves unmapped pages from the active list to the inactive
list. I would say that that solution is pretty different from having
two individual LRU:s with two lists each.
> So you split the inactive list as well - that's going to be a bit of
> change in behaviour and I'm not sure whether you gain anything.
Well, other parts of the VM still use lru_cache_add_active for some
mapped pages, so anonymous pages will mostly be in the active list on
the mapped LRU. My plan with using two full LRU:s is to provide two
separate LRU instances that individually will act as two-list LRU:s.
So active mapped pages should actually end up on the active list,
while seldom used mapped pages should be on the inactive list.
Also, I think it makes sense to separate mapped from unmapped because
mapped pages needs to clear the young-bits in the pte to track usage,
but unmapped activity happens through mark_page_accessed(). So mapped
pages needs to be scanned, but unmapped pages could say be moved to
the head of a list to avoid scanning. I'm not sure that is a win
though.
> I don't think PageMapped is a very good name for the flag.
Yeah, it's a bit confusing to both have PageMapped() and page_mapped().
> I test mapped lazily. Much better way to go IMO.
I will have a look at your patch to see how you handle things.
> I had further patches that got rid of reclaim_mapped completely while
> I was there. It is based on crazy metrics that basically completely
> change meaning if there are changes in the memory configuration of
> the system, or small changes in reclaim algorithms.
It is not very NUMA aware either, right?
I think there are many interesting things that are possible to improve
in the vmscan code, but I'm trying to change as little as possible for
now.
Thanks for the comments,
/ magnus
On 3/10/06, Nick Piggin <[email protected]> wrote:
> Magnus Damm wrote:
> > Implement per-LRU guarantee through sysctl.
> >
> > This patch introduces the two new sysctl files "node_mapped_guar" and
> > "node_unmapped_guar". Each file contains one percentage per node and tells
> > the system how many percentage of all pages that should be kept in RAM as
> > unmapped or mapped pages.
> >
>
> The whole Linux VM philosophy until now has been to get away from stuff
> like this.
Yeah, and Linux has never supported memory resource control either, right?
> If your app is really that specialised then maybe it can use mlock. If
> not, maybe the VM is currently broken.
>
> You do have a real-world workload that is significantly improved by this,
> right?
Not really, but I think there is a demand for memory resource control today.
The memory controller in ckrm also breaks out the LRU, but puts one
LRU instance in each class. My code does not depend on ckrm, but it
should be possible to have some kind of resource control with this
patch and cpusets. And yeah, add numa emulation if you are out of
nodes. =)
Thanks,
/ magnus
> Apply on top of 2.6.16-rc5.
>
> Comments?
my big worry with a split LRU is: how do you keep fairness and balance
between those LRUs? This is one of the things that made the 2.4 VM suck
really badly, so I really wouldn't want this bad...
On Fri, 2006-03-10 at 12:44 +0900, Magnus Damm wrote:
> Unmapped patches - Use two LRU:s per zone.
>
> These patches break out the per-zone LRU into two separate LRU:s - one for
> mapped pages and one for unmapped pages. The patches also introduce guarantee
> support, which allows the user to set how many percent of all pages per node
> that should be kept in memory for mapped or unmapped pages. This guarantee
> makes it possible to adjust the VM behaviour depending on the workload.
>
> Reasons behind the LRU separation:
>
> - Avoid unnecessary page scanning.
> The current VM implementation rotates mapped pages on the active list
> until the number of mapped pages are high enough to start unmap and page out.
> By using two LRU:s we can avoid this scanning and shrink/rotate unmapped
> pages only, not touching mapped pages until the threshold is reached.
>
> - Make it possible to adjust the VM behaviour.
> In some cases the user might want to guarantee that a certain amount of
> pages should be kept in memory, overriding the standard behaviour. Separating
> pages into mapped and unmapped LRU:s allows guarantee with low overhead.
>
> I've performed many tests on a Dual PIII machine while varying the amount of
> RAM available. Kernel compiles on a 64MB configuration gets a small speedup,
> but the impact on other configurations and workloads seems to be unaffected.
>
> Apply on top of 2.6.16-rc5.
>
> Comments?
I'm not convinced of special casing mapped pages, nor of tunable knobs.
I've been working on implementing some page replacement algorithms that
have neither.
Breaking the LRU in two like this breaks the page ordering, which makes
it possible for pages to stay resident even though they have much less
activity than pages that do get reclaimed.
I have a serious regression somewhere, but will post as soon as we've
managed to track it down.
If you're interrested, the work can be found here:
http://programming.kicks-ass.net/kernel-patches/page-replace/
--
Peter Zijlstra <[email protected]>
On 3/10/06, Arjan van de Ven <[email protected]> wrote:
> > Apply on top of 2.6.16-rc5.
> >
> > Comments?
>
>
> my big worry with a split LRU is: how do you keep fairness and balance
> between those LRUs? This is one of the things that made the 2.4 VM suck
> really badly, so I really wouldn't want this bad...
Yeah, I agree this is important. I think linux-2.4 tried to keep the
LRU list lengths in a certain way (maybe 2/3 of all pages active, 1/3
inactive). In 2.6 there is no such thing, instead the number of pages
scanned is related to the current scanning priority.
My current code just extends this idea which basically means that
there is currently no relation between how many pages that sit in each
LRU. The LRU with the largest amount of pages will be shrunk/rotated
first. And on top of that is the guarantee logic and the
reclaim_mapped threshold, ie the unmapped LRU will be shrunk first by
default.
The current balancing code plays around with nr_scan_active and
nr_scan_inactive, but I'm not entirely sure why that logic is there.
If anyone can explain the reason behind that code I'd be happy to hear
it.
Thanks,
/ magnus
On 3/10/06, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2006-03-10 at 12:44 +0900, Magnus Damm wrote:
> > Unmapped patches - Use two LRU:s per zone.
> >
> > These patches break out the per-zone LRU into two separate LRU:s - one for
> > mapped pages and one for unmapped pages. The patches also introduce guarantee
> > support, which allows the user to set how many percent of all pages per node
> > that should be kept in memory for mapped or unmapped pages. This guarantee
> > makes it possible to adjust the VM behaviour depending on the workload.
> >
> > Reasons behind the LRU separation:
> >
> > - Avoid unnecessary page scanning.
> > The current VM implementation rotates mapped pages on the active list
> > until the number of mapped pages are high enough to start unmap and page out.
> > By using two LRU:s we can avoid this scanning and shrink/rotate unmapped
> > pages only, not touching mapped pages until the threshold is reached.
> >
> > - Make it possible to adjust the VM behaviour.
> > In some cases the user might want to guarantee that a certain amount of
> > pages should be kept in memory, overriding the standard behaviour. Separating
> > pages into mapped and unmapped LRU:s allows guarantee with low overhead.
> >
> > I've performed many tests on a Dual PIII machine while varying the amount of
> > RAM available. Kernel compiles on a 64MB configuration gets a small speedup,
> > but the impact on other configurations and workloads seems to be unaffected.
> >
> > Apply on top of 2.6.16-rc5.
> >
> > Comments?
>
> I'm not convinced of special casing mapped pages, nor of tunable knobs.
I think it makes sense to treat mapped pages separately because only
mapped pages require clearing of young-bits in pte:s. The logic for
unmapped pages could be driven entirely from mark_page_access(), no
scanning required. At least in my head that is. =)
Also, what might be an optimal page replacement policy for for
unmapped pages might be suboptimal for mapped pages.
> I've been working on implementing some page replacement algorithms that
> have neither.
Yeah, I know that. =) I think your ClockPRO work looks very promising.
I would really like to see some better page replacement policy than
LRU merged.
> Breaking the LRU in two like this breaks the page ordering, which makes
> it possible for pages to stay resident even though they have much less
> activity than pages that do get reclaimed.
Yes, true. But this happens already with a per-zone LRU. LRU pages
that happen to end up in the DMA zone will probably stay there a
longer time than pages in the normal zone. That does not mean it is
right to break the page ordering though, I'm just saying it happens
already and the oldest piece of data in the global system will not be
reclaimed first - instead there are priorities such as unmapped pages
will be reclaimed over mapped and so on. (I strongly feel that there
should be per-node LRU:s, but that's another story)
> I have a serious regression somewhere, but will post as soon as we've
> managed to track it down.
>
> If you're interrested, the work can be found here:
> http://programming.kicks-ass.net/kernel-patches/page-replace/
I'm definitely interested, but I also believe that the page reclaim
code is hairy as hell, and that complicated changes to the "stable"
2.6-tree are hard to merge. So I see my work as a first step (or just
something that starts a discussion if no one is interested), and in
the end a page replacement policy implementation such as yours will be
accepted.
Thanks!
/ magnus
On Fri, 2006-03-10 at 14:19 +0100, Magnus Damm wrote:
> On 3/10/06, Arjan van de Ven <[email protected]> wrote:
> > > Apply on top of 2.6.16-rc5.
> > >
> > > Comments?
> >
> >
> > my big worry with a split LRU is: how do you keep fairness and balance
> > between those LRUs? This is one of the things that made the 2.4 VM suck
> > really badly, so I really wouldn't want this bad...
>
> Yeah, I agree this is important. I think linux-2.4 tried to keep the
> LRU list lengths in a certain way (maybe 2/3 of all pages active, 1/3
> inactive).
not really
> My current code just extends this idea which basically means that
> there is currently no relation between how many pages that sit in each
> LRU. The LRU with the largest amount of pages will be shrunk/rotated
> first. And on top of that is the guarantee logic and the
> reclaim_mapped threshold, ie the unmapped LRU will be shrunk first by
> default.
that sounds wrong, you lose history this way. There is NO reason to
shrink only the unmapped LRU and not the mapped one. At minimum you
always need to pressure both. How you pressure (absolute versus
percentage) is an interesting question, but to me there is no doubt that
you always need to pressure both, and "equally" to some measure of equal
On Fri, 2006-03-10 at 15:04 +0900, Magnus Damm wrote:
> On 3/10/06, Nick Piggin <[email protected]> wrote:
> > Magnus Damm wrote:
> > > Implement per-LRU guarantee through sysctl.
> > >
> > > This patch introduces the two new sysctl files "node_mapped_guar" and
> > > "node_unmapped_guar". Each file contains one percentage per node and tells
> > > the system how many percentage of all pages that should be kept in RAM as
> > > unmapped or mapped pages.
> > >
> >
> > The whole Linux VM philosophy until now has been to get away from stuff
> > like this.
>
> Yeah, and Linux has never supported memory resource control either, right?
>
> > If your app is really that specialised then maybe it can use mlock. If
> > not, maybe the VM is currently broken.
> >
> > You do have a real-world workload that is significantly improved by this,
> > right?
>
> Not really, but I think there is a demand for memory resource control today.
As a person who is working on CKRM, I totally agree with this :)
>
> The memory controller in ckrm also breaks out the LRU, but puts one
> LRU instance in each class. My code does not depend on ckrm, but it
> should be possible to have some kind of resource control with this
i do not understand how breaking lru lists into mapped/unmapped pages
and providing a knob to control the proportion of mapped/unmapped pages
in a node help in resource control.
Can you explain, please. I am very interested.
> patch and cpusets. And yeah, add numa emulation if you are out of
> nodes. =)
>
> Thanks,
>
> / magnus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------
On Fri, 10 Mar 2006, Magnus Damm wrote:
> Unmapped patches - Use two LRU:s per zone.
Note that if this is done then the default case of zone_reclaim becomes
trivial to deal with and we can get rid of the zone_reclaim_interval.
However, I have not looked at the rest yet.
On Fri, 10 Mar 2006, Magnus Damm wrote:
> Use separate LRU:s for mapped and unmapped pages.
>
> This patch creates two instances of "struct lru" per zone, both protected by
> zone->lru_lock. A new bit in page->flags named PG_mapped is used to determine
> which LRU the page belongs to. The rmap code is changed to move pages to the
> mapped LRU, while the vmscan code moves pages back to the unmapped LRU when
> needed. Pages moved to the mapped LRU are added to the inactive list, while
> pages moved back to the unmapped LRU are added to the active list.
The swapper moves pages to the unmapped list? So the mapped LRU
lists contains unmapped pages? That would get rid of the benefit that I
saw from this scheme. Pretty inconsistent.
On 3/10/06, Arjan van de Ven <[email protected]> wrote:
> On Fri, 2006-03-10 at 14:19 +0100, Magnus Damm wrote:
> > My current code just extends this idea which basically means that
> > there is currently no relation between how many pages that sit in each
> > LRU. The LRU with the largest amount of pages will be shrunk/rotated
> > first. And on top of that is the guarantee logic and the
> > reclaim_mapped threshold, ie the unmapped LRU will be shrunk first by
> > default.
>
> that sounds wrong, you lose history this way. There is NO reason to
> shrink only the unmapped LRU and not the mapped one. At minimum you
> always need to pressure both. How you pressure (absolute versus
> percentage) is an interesting question, but to me there is no doubt that
> you always need to pressure both, and "equally" to some measure of equal
Regarding if shrinking the unmapped LRU only is bad or not: In the
vanilla version of refill_inactive_zone(), if reclaim_mapped is false
then mapped pages are rotated on the active list without the
young-bits are getting cleared in the PTE:s. I would say this is very
similar to leaving the pages on the mapped active list alone as long
as reclaim_mapped is false in the dual LRU case. Do you agree?
Also, losing history, do you mean that the order of the pages are not
kept? If so, then I think my refill_inactive_zone() rant above shows
that the order of the pages are not kept today. But yes, keeping the
order is probaly a good idea.
It would be interesting to hear what you mean by "pressure", do you
mean that both the active list and inactive list are scanned?
Many thanks,
/ magnus
On 3/11/06, Christoph Lameter <[email protected]> wrote:
> On Fri, 10 Mar 2006, Magnus Damm wrote:
>
> > Unmapped patches - Use two LRU:s per zone.
>
> Note that if this is done then the default case of zone_reclaim becomes
> trivial to deal with and we can get rid of the zone_reclaim_interval.
That's a good thing, right? =)
> However, I have not looked at the rest yet.
Please do. I'd like to hear what you think about it.
Thanks,
/ magnus
On 3/11/06, Christoph Lameter <[email protected]> wrote:
> On Fri, 10 Mar 2006, Magnus Damm wrote:
>
> > Use separate LRU:s for mapped and unmapped pages.
> >
> > This patch creates two instances of "struct lru" per zone, both protected by
> > zone->lru_lock. A new bit in page->flags named PG_mapped is used to determine
> > which LRU the page belongs to. The rmap code is changed to move pages to the
> > mapped LRU, while the vmscan code moves pages back to the unmapped LRU when
> > needed. Pages moved to the mapped LRU are added to the inactive list, while
> > pages moved back to the unmapped LRU are added to the active list.
>
> The swapper moves pages to the unmapped list? So the mapped LRU
> lists contains unmapped pages? That would get rid of the benefit that I
> saw from this scheme. Pretty inconsistent.
The first (non released) versions of these patches modified rmap.c to
move the pages between the LRU:s both during adding and removing
rmap:s, so the mapped LRU would in that case keep mapped pages only.
This did however introduce more overhead, because pages only mapped by
a single process would bounce between the LRU:s when a such process
starts or terminates.
The split active list implementation by Nick Piggin did however only
move pages between the active lists during vmscan (if I understood the
patch correctly), which is something that I have not tried yet.
I think it would be interesting with 3 active lists, one for unmapped
pages, one for mapped file-backed pages and one for mapped anonymous
pages. And then let the vmscan code move pages between the lists.
Thank you for the comments!
/ magnus
On 3/11/06, Chandra Seetharaman <[email protected]> wrote:
> On Fri, 2006-03-10 at 15:04 +0900, Magnus Damm wrote:
> > On 3/10/06, Nick Piggin <[email protected]> wrote:
> > > Magnus Damm wrote:
> > > If your app is really that specialised then maybe it can use mlock. If
> > > not, maybe the VM is currently broken.
> > >
> > > You do have a real-world workload that is significantly improved by this,
> > > right?
> >
> > Not really, but I think there is a demand for memory resource control today.
>
> As a person who is working on CKRM, I totally agree with this :)
Hehe, good to head that I'm not alone. =)
> > The memory controller in ckrm also breaks out the LRU, but puts one
> > LRU instance in each class. My code does not depend on ckrm, but it
> > should be possible to have some kind of resource control with this
>
> i do not understand how breaking lru lists into mapped/unmapped pages
> and providing a knob to control the proportion of mapped/unmapped pages
> in a node help in resource control.
It is one type of resource control. It is of course not a complete
solution like ckrm, but on machines with more than one node (or a
regular PC with numa emulation) it is possible to create partitions
using CPUSETS and then use this patch to control the amount of memory
that should be dedicated for say mapped pages on each node.
CKRM and CPUSETS are the ways to provide resource control today.
CPUSETS is coarse-grained, but CKRM aims for finer granularity. None
of them have a way to control the ratio between mapped and unmapped
pages, excluding this patch.
I'd like to see CKRM merged, but I'm not the one calling the shots
(probably fortunate enough for everyone). I think CKRM has the same
properties as the ClockPRO work - it would be nice to have it included
in mainline, but these patches modify lots of crital code and
therefore has problems getting accepted that easily.
So this patch is YASSITRD. (Yet Another Small Step In The Right Direction)
Thank you!
/ magnus
On Fri, 2006-03-10 at 14:19 +0100, Magnus Damm wrote:
> On 3/10/06, Arjan van de Ven <[email protected]> wrote:
> > > Apply on top of 2.6.16-rc5.
> > >
> > > Comments?
> >
> >
> > my big worry with a split LRU is: how do you keep fairness and balance
> > between those LRUs? This is one of the things that made the 2.4 VM suck
> > really badly, so I really wouldn't want this bad...
>
> Yeah, I agree this is important. I think linux-2.4 tried to keep the
> LRU list lengths in a certain way (maybe 2/3 of all pages active, 1/3
> inactive). In 2.6 there is no such thing, instead the number of pages
> scanned is related to the current scanning priority.
This sounds wrong, the active and inactive lists are balanced to a 1:1
ratio. This is happens because the scan speed is directly proportional
to the size of the list. Hence the largest list will shrink fastest -
this gives a natural balance to equal sizes.
Peter
On Fri, 2006-03-10 at 14:38 +0100, Magnus Damm wrote:
> On 3/10/06, Peter Zijlstra <[email protected]> wrote:
> > Breaking the LRU in two like this breaks the page ordering, which makes
> > it possible for pages to stay resident even though they have much less
> > activity than pages that do get reclaimed.
>
> Yes, true. But this happens already with a per-zone LRU. LRU pages
> that happen to end up in the DMA zone will probably stay there a
> longer time than pages in the normal zone. That does not mean it is
> right to break the page ordering though, I'm just saying it happens
> already and the oldest piece of data in the global system will not be
> reclaimed first - instead there are priorities such as unmapped pages
> will be reclaimed over mapped and so on. (I strongly feel that there
> should be per-node LRU:s, but that's another story)
If reclaim works right* there is equal pressure on each zone
(proportional to their size) and hence each page will have an equal life
time expectancy.
(*) this is of course not possible for all workloads, however
balance_pgdat and the page allocator take pains to make it as true as
possible.
Peter
On 3/12/06, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2006-03-10 at 14:19 +0100, Magnus Damm wrote:
> > On 3/10/06, Arjan van de Ven <[email protected]> wrote:
> > > > Apply on top of 2.6.16-rc5.
> > > >
> > > > Comments?
> > >
> > >
> > > my big worry with a split LRU is: how do you keep fairness and balance
> > > between those LRUs? This is one of the things that made the 2.4 VM suck
> > > really badly, so I really wouldn't want this bad...
> >
> > Yeah, I agree this is important. I think linux-2.4 tried to keep the
> > LRU list lengths in a certain way (maybe 2/3 of all pages active, 1/3
> > inactive). In 2.6 there is no such thing, instead the number of pages
> > scanned is related to the current scanning priority.
>
> This sounds wrong, the active and inactive lists are balanced to a 1:1
> ratio. This is happens because the scan speed is directly proportional
> to the size of the list. Hence the largest list will shrink fastest -
> this gives a natural balance to equal sizes.
Yes, you are explaining the current 2.6 behaviour much better. Also,
some balancing logic with nr_scan_active/nr_scan_inactive is present
in the code today. I'm not entirely sure about the purpose of that
code.
Thanks,
/ magnus
On 3/12/06, Peter Zijlstra <[email protected]> wrote:
> On Fri, 2006-03-10 at 14:38 +0100, Magnus Damm wrote:
> > On 3/10/06, Peter Zijlstra <[email protected]> wrote:
>
> > > Breaking the LRU in two like this breaks the page ordering, which makes
> > > it possible for pages to stay resident even though they have much less
> > > activity than pages that do get reclaimed.
> >
> > Yes, true. But this happens already with a per-zone LRU. LRU pages
> > that happen to end up in the DMA zone will probably stay there a
> > longer time than pages in the normal zone. That does not mean it is
> > right to break the page ordering though, I'm just saying it happens
> > already and the oldest piece of data in the global system will not be
> > reclaimed first - instead there are priorities such as unmapped pages
> > will be reclaimed over mapped and so on. (I strongly feel that there
> > should be per-node LRU:s, but that's another story)
>
> If reclaim works right* there is equal pressure on each zone
> (proportional to their size) and hence each page will have an equal life
> time expectancy.
>
> (*) this is of course not possible for all workloads, however
> balance_pgdat and the page allocator take pains to make it as true as
> possible.
In shrink_zone(), there is +1 logic that adds at least one to
nr_scan_active/nr_scan_inactive, and resets them to zero when they
have reached sc->swap_cluster_max (32 or higher in some cases).
So nr_scan_active/nr_scan_inactive will in most cases be 16
(SWAP_CLUSTER_MAX / 2), regardless of the size of the zone. So, a
total of 256 calls to shrink_zone() on a zone with 4096 pages will
likely scan through 100% of the pages on both LRU lists, while 256
calls to shrink_zone() on a zone with say 8096 pages will result in
around 50% of the pages on the lists are scanned through.
Maybe not entirely true, but the bottom line is that the +1 logic will
scan though smaller zones faster than large ones.
/ magnus
On Sat, 2006-03-11 at 21:29 +0900, Magnus Damm wrote:
>
> > > The memory controller in ckrm also breaks out the LRU, but puts one
> > > LRU instance in each class. My code does not depend on ckrm, but it
> > > should be possible to have some kind of resource control with this
> >
> > i do not understand how breaking lru lists into mapped/unmapped pages
> > and providing a knob to control the proportion of mapped/unmapped pages
> > in a node help in resource control.
>
> It is one type of resource control. It is of course not a complete
> solution like ckrm, but on machines with more than one node (or a
> regular PC with numa emulation) it is possible to create partitions
> using CPUSETS and then use this patch to control the amount of memory
> that should be dedicated for say mapped pages on each node.
>
> CKRM and CPUSETS are the ways to provide resource control today.
> CPUSETS is coarse-grained, but CKRM aims for finer granularity. None
> of them have a way to control the ratio between mapped and unmapped
> pages, excluding this patch.
Oh... different type of resource control ? Controlling _how_ a resource
is used rather than _who_ uses the resource (which is what CKRM intends
to provide).
>
> I'd like to see CKRM merged, but I'm not the one calling the shots
8-)
> (probably fortunate enough for everyone). I think CKRM has the same
> properties as the ClockPRO work - it would be nice to have it included
> in mainline, but these patches modify lots of crital code and
> therefore has problems getting accepted that easily.
>
> So this patch is YASSITRD. (Yet Another Small Step In The Right Direction)
>
> Thank you!
>
> / magnus
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
--
----------------------------------------------------------------------
Chandra Seetharaman | Be careful what you choose....
- [email protected] | .......you may get it.
----------------------------------------------------------------------