2010-12-23 18:09:41

by Balbir Singh

[permalink] [raw]
Subject: [PATCH 0/3] Unmapped Page Control (v3)

The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/419564/

For those with LWN.net access, there is a detailed coverage
of the patchset at http://lwn.net/Articles/419713/

The previous few revision received lot of comments, I've tried to
address as many of those as possible in this revision. An earlier
series was reviewed-by Christoph Lameter.

There were comments on overlap with Nick's changes and overlap
with them. I don't feel these changes impact Nick's work and
integration can/will be considered as the patches evolve, if
need be.

Detailed Description
====================
This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
can selectively turn it on, on a need to use basis.

A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.

KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.

There are some magic numbers in use in the code, UNMAPPED_PAGE_RATIO
and the number of pages to reclaim when unmapped_page_control argument
is supplied. These numbers were chosen to avoid aggressiveness in
reaping page cache ever so frequently, at the same time providing control.

The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim.

Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79

Size measurement

CONFIG_UNMAPPED_PAGECACHE_CONTROL and CONFIG_NUMA enabled
# size mm/built-in.o
text data bss dec hex filename
419431 1883047 140888 2443366 254866 mm/built-in.o

CONFIG_UNMAPPED_PAGECACHE_CONTROL disabled, CONFIG_NUMA enabled
# size mm/built-in.o
text data bss dec hex filename
418908 1883023 140888 2442819 254643 mm/built-in.o


---

Balbir Singh (3):
Move zone_reclaim() outside of CONFIG_NUMA
Refactor zone_reclaim code
Provide control over unmapped pages


Documentation/kernel-parameters.txt | 8 ++
include/linux/mmzone.h | 4 +
include/linux/swap.h | 21 +++++-
init/Kconfig | 12 +++
kernel/sysctl.c | 20 +++--
mm/page_alloc.c | 9 ++
mm/vmscan.c | 132 +++++++++++++++++++++++++++++++----
7 files changed, 175 insertions(+), 31 deletions(-)

--
Balbir Singh


2010-12-23 18:10:18

by Balbir Singh

[permalink] [raw]
Subject: [PATCH 1/3] Move zone_reclaim() outside of CONFIG_NUMA (v3)

This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.

Signed-off-by: Balbir Singh <[email protected]>
---
include/linux/mmzone.h | 4 ++--
include/linux/swap.h | 4 ++--
kernel/sysctl.c | 18 +++++++++---------
mm/vmscan.c | 2 --
4 files changed, 13 insertions(+), 15 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4890662..aeede91 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -302,12 +302,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];

-#ifdef CONFIG_NUMA
- int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
+#ifdef CONFIG_NUMA
+ int node;
unsigned long min_slab_pages;
#endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index 84375e4..ac5c06e 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,11 +253,11 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;

+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#else
#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index a00fdef..e40040e 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,15 +1211,6 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
-#ifdef CONFIG_NUMA
- {
- .procname = "zone_reclaim_mode",
- .data = &zone_reclaim_mode,
- .maxlen = sizeof(zone_reclaim_mode),
- .mode = 0644,
- .proc_handler = proc_dointvec,
- .extra1 = &zero,
- },
{
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
@@ -1229,6 +1220,15 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+#ifdef CONFIG_NUMA
+ {
+ .procname = "zone_reclaim_mode",
+ .data = &zone_reclaim_mode,
+ .maxlen = sizeof(zone_reclaim_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec,
+ .extra1 = &zero,
+ },
{
.procname = "min_slab_ratio",
.data = &sysctl_min_slab_ratio,
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 42a4859..e841cae 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2740,7 +2740,6 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

-#ifdef CONFIG_NUMA
/*
* Zone reclaim mode
*
@@ -2950,7 +2949,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)

return ret;
}
-#endif

/*
* page_evictable - test whether a page is evictable

2010-12-23 18:34:32

by Balbir Singh

[permalink] [raw]
Subject: [PATCH 2/3] Refactor zone_reclaim code (v3)

Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages

Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular

Signed-off-by: Balbir Singh <[email protected]>
---
mm/vmscan.c | 35 +++++++++++++++++++++++------------
1 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/mm/vmscan.c b/mm/vmscan.c
index e841cae..3b25423 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2815,6 +2815,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
}

/*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+ unsigned long nr_pages)
+{
+ int priority;
+ /*
+ * Free memory by calling shrink zone with increasing
+ * priorities until we have enough memory freed.
+ */
+ priority = ZONE_RECLAIM_PRIORITY;
+ do {
+ shrink_zone(priority, zone, sc);
+ priority--;
+ } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
* Try to free up some pages from this zone through reclaim.
*/
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -2823,7 +2844,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2847,17 +2867,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;

- if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
- /*
- * Free memory by calling shrink zone with increasing
- * priorities until we have enough memory freed.
- */
- priority = ZONE_RECLAIM_PRIORITY;
- do {
- shrink_zone(priority, zone, &sc);
- priority--;
- } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
- }
+ if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages)
+ zone_reclaim_pages(zone, &sc, nr_pages);

nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0 > zone->min_slab_pages) {

2010-12-23 18:35:03

by Balbir Singh

[permalink] [raw]
Subject: [PATCH 3/3] Provide control over unmapped pages (v3)

Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)

Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.

Signed-off-by: Balbir Singh <[email protected]>
---
Documentation/kernel-parameters.txt | 8 +++
include/linux/swap.h | 21 ++++++--
init/Kconfig | 12 ++++
kernel/sysctl.c | 2 +
mm/page_alloc.c | 9 +++
mm/vmscan.c | 97 +++++++++++++++++++++++++++++++++++
6 files changed, 142 insertions(+), 7 deletions(-)

diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index dd8fe2b..f52b0bd 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2515,6 +2515,14 @@ and is between 256 and 4096 characters. It is defined in the file
[X86]
Set unknown_nmi_panic=1 early on boot.

+ unmapped_page_control
+ [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+ is enabled. It controls the amount of unmapped memory
+ that is present in the system. This boot option plus
+ vm.min_unmapped_ratio (sysctl) provide granular control
+ over how much unmapped page cache can exist in the system
+ before kswapd starts reclaiming unmapped page cache pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2). This
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ac5c06e..773d7e5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -253,19 +253,32 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
extern int sysctl_min_unmapped_ratio;
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
#else
-#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
{
return 0;
}
#endif

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
extern int page_evictable(struct page *page, struct vm_area_struct *vma);
extern void scan_mapping_unevictable_pages(struct address_space *);

diff --git a/init/Kconfig b/init/Kconfig
index 3eb22ad..78c9169 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -782,6 +782,18 @@ endif # NAMESPACES
config MM_OWNER
bool

+config UNMAPPED_PAGECACHE_CONTROL
+ bool "Provide control over unmapped page cache"
+ default n
+ help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
config SYSFS_DEPRECATED
bool "enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e40040e..ab2c60a 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1211,6 +1211,7 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
},
#endif
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
{
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
@@ -1220,6 +1221,7 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+#endif
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1845a97..1c9fbab 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1662,6 +1662,9 @@ zonelist_scan:
unsigned long mark;
int ret;

+ if (should_reclaim_unmapped_pages(zone))
+ wakeup_kswapd(zone, order);
+
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4154,10 +4157,12 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,

zone->spanned_pages = size;
zone->present_pages = realsize;
-#ifdef CONFIG_NUMA
- zone->node = nid;
+#if defined(CONFIG_UNMAPPED_PAGE_CONTROL) || defined(CONFIG_NUMA)
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#endif
+#ifdef CONFIG_NUMA
+ zone->node = nid;
zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 3b25423..9a6682c 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -158,6 +158,29 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc) (1)
#endif

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+static unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
+ struct scan_control *sc);
+static int unmapped_page_control __read_mostly;
+
+static int __init unmapped_page_control_parm(char *str)
+{
+ unmapped_page_control = 1;
+ /*
+ * XXX: Should we tweak swappiness here?
+ */
+ return 1;
+}
+__setup("unmapped_page_control", unmapped_page_control_parm);
+
+#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
+static inline unsigned long reclaim_unmapped_pages(int priority,
+ struct zone *zone, struct scan_control *sc)
+{
+ return 0;
+}
+#endif
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -2297,6 +2320,12 @@ loop_again:
shrink_active_list(SWAP_CLUSTER_MAX, zone,
&sc, priority, 0);

+ /*
+ * We do unmapped page reclaim once here and once
+ * below, so that we don't lose out
+ */
+ reclaim_unmapped_pages(priority, zone, &sc);
+
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), 0, 0)) {
end_zone = i;
@@ -2332,6 +2361,11 @@ loop_again:
continue;

sc.nr_scanned = 0;
+ /*
+ * Reclaim unmapped pages upfront, this should be
+ * really cheap
+ */
+ reclaim_unmapped_pages(priority, zone, &sc);

/*
* Call soft limit reclaim before calling shrink_zone.
@@ -2587,7 +2621,8 @@ void wakeup_kswapd(struct zone *zone, int order)
pgdat->kswapd_max_order = order;
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+ if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) &&
+ !should_reclaim_unmapped_pages(zone))
return;

trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
@@ -2740,6 +2775,7 @@ static int __init kswapd_init(void)

module_init(kswapd_init)

+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
/*
* Zone reclaim mode
*
@@ -2960,6 +2996,65 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)

return ret;
}
+#endif
+
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+/*
+ * Routine to reclaim unmapped pages, inspired from the code under
+ * CONFIG_NUMA that does unmapped page and slab page control by keeping
+ * min_unmapped_pages in the zone. We currently reclaim just unmapped
+ * pages, slab control will come in soon, at which point this routine
+ * should be called reclaim cached pages
+ */
+unsigned long reclaim_unmapped_pages(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ if (unlikely(unmapped_page_control) &&
+ (zone_unmapped_file_pages(zone) > zone->min_unmapped_pages)) {
+ struct scan_control nsc;
+ unsigned long nr_pages;
+
+ nsc = *sc;
+
+ nsc.swappiness = 0;
+ nsc.may_writepage = 0;
+ nsc.may_unmap = 0;
+ nsc.nr_reclaimed = 0;
+
+ nr_pages = zone_unmapped_file_pages(zone) -
+ zone->min_unmapped_pages;
+ /*
+ * We don't want to be too aggressive with our
+ * reclaim, it is our best effort to control
+ * unmapped pages
+ */
+ nr_pages >>= 3;
+
+ zone_reclaim_pages(zone, &nsc, nr_pages);
+ return nsc.nr_reclaimed;
+ }
+ return 0;
+}
+
+/*
+ * 16 is a magic number that was pulled out of a magician's
+ * hat. This number automatically provided the best performance
+ * to memory usage (unmapped pages). Lower than this and we spend
+ * a lot of time in frequent reclaims, higher and our control is
+ * weakend.
+ */
+#define UNMAPPED_PAGE_RATIO 16
+
+bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+ if (unlikely(unmapped_page_control) &&
+ (zone_unmapped_file_pages(zone) >
+ UNMAPPED_PAGE_RATIO * zone->min_unmapped_pages))
+ return true;
+ return false;
+}
+#endif
+

/*
* page_evictable - test whether a page is evictable