The following series implements page cache control,
this is a split out version of patch 1 of version 3 of the
page cache optimization patches posted earlier at
Previous posting http://lwn.net/Articles/425851/ and analysis
at http://lwn.net/Articles/419713/
Detailed Description
====================
This patch implements unmapped page cache control via preferred
page cache reclaim. The current patch hooks into kswapd and reclaims
page cache if the user has requested for unmapped page control.
This is useful in the following scenario
- In a virtualized environment with cache=writethrough, we see
double caching - (one in the host and one in the guest). As
we try to scale guests, cache usage across the system grows.
The goal of this patch is to reclaim page cache when Linux is running
as a guest and get the host to hold the page cache and manage it.
There might be temporary duplication, but in the long run, memory
in the guests would be used for mapped pages.
- The option is controlled via a boot option and the administrator
can selectively turn it on, on a need to use basis.
A lot of the code is borrowed from zone_reclaim_mode logic for
__zone_reclaim(). One might argue that the with ballooning and
KSM this feature is not very useful, but even with ballooning,
we need extra logic to balloon multiple VM machines and it is hard
to figure out the correct amount of memory to balloon. With these
patches applied, each guest has a sufficient amount of free memory
available, that can be easily seen and reclaimed by the balloon driver.
The additional memory in the guest can be reused for additional
applications or used to start additional guests/balance memory in
the host.
KSM currently does not de-duplicate host and guest page cache. The goal
of this patch is to help automatically balance unmapped page cache when
instructed to do so.
The sysctl for min_unmapped_ratio provides further control from
within the guest on the amount of unmapped pages to reclaim, a similar
max_unmapped_ratio sysctl is added and helps in the decision making
process of when reclaim should occur. This is tunable and set by
default to 16 (based on tradeoff's seen between aggressiveness in
balancing versus size of unmapped pages). Distro's and administrators
can further tweak this for desired control.
Data from the previous patchsets can be found at
https://lkml.org/lkml/2010/11/30/79
---
Balbir Singh (3):
Move zone_reclaim() outside of CONFIG_NUMA
Refactor zone_reclaim code
Provide control over unmapped pages
Documentation/kernel-parameters.txt | 8 ++
Documentation/sysctl/vm.txt | 19 +++++
include/linux/mmzone.h | 11 +++
include/linux/swap.h | 25 ++++++-
init/Kconfig | 12 +++
kernel/sysctl.c | 29 ++++++--
mm/page_alloc.c | 35 +++++++++-
mm/vmscan.c | 123 +++++++++++++++++++++++++++++++----
8 files changed, 229 insertions(+), 33 deletions(-)
--
Three Cheers,
Balbir
This patch moves zone_reclaim and associated helpers
outside CONFIG_NUMA. This infrastructure is reused
in the patches for page cache control that follow.
Signed-off-by: Balbir Singh <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 4 ++--
include/linux/swap.h | 4 ++--
kernel/sysctl.c | 16 ++++++++--------
mm/page_alloc.c | 6 +++---
mm/vmscan.c | 2 --
5 files changed, 15 insertions(+), 17 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 628f07b..59cbed0 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -306,12 +306,12 @@ struct zone {
*/
unsigned long lowmem_reserve[MAX_NR_ZONES];
-#ifdef CONFIG_NUMA
- int node;
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
unsigned long min_unmapped_pages;
+#ifdef CONFIG_NUMA
+ int node;
unsigned long min_slab_pages;
#endif
struct per_cpu_pageset __percpu *pageset;
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ed6ebe6..ce8f686 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,11 +264,11 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+extern int sysctl_min_unmapped_ratio;
+extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#ifdef CONFIG_NUMA
extern int zone_reclaim_mode;
-extern int sysctl_min_unmapped_ratio;
extern int sysctl_min_slab_ratio;
-extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
#else
#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 927fc5a..e3a8ce4 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1214,14 +1214,6 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec_unsigned,
},
#endif
-#ifdef CONFIG_NUMA
- {
- .procname = "zone_reclaim_mode",
- .data = &zone_reclaim_mode,
- .maxlen = sizeof(zone_reclaim_mode),
- .mode = 0644,
- .proc_handler = proc_dointvec_unsigned,
- },
{
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
@@ -1231,6 +1223,14 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+#ifdef CONFIG_NUMA
+ {
+ .procname = "zone_reclaim_mode",
+ .data = &zone_reclaim_mode,
+ .maxlen = sizeof(zone_reclaim_mode),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_unsigned,
+ },
{
.procname = "min_slab_ratio",
.data = &sysctl_min_slab_ratio,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 6e1b52a..1d32865 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4249,10 +4249,10 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->spanned_pages = size;
zone->present_pages = realsize;
-#ifdef CONFIG_NUMA
- zone->node = nid;
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#ifdef CONFIG_NUMA
+ zone->node = nid;
zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
#endif
zone->name = zone_names[j];
@@ -5157,7 +5157,6 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
return 0;
}
-#ifdef CONFIG_NUMA
int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -5174,6 +5173,7 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
return 0;
}
+#ifdef CONFIG_NUMA
int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 060e4c1..4923160 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2874,7 +2874,6 @@ static int __init kswapd_init(void)
module_init(kswapd_init)
-#ifdef CONFIG_NUMA
/*
* Zone reclaim mode
*
@@ -3084,7 +3083,6 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
return ret;
}
-#endif
/*
* page_evictable - test whether a page is evictable
Changelog v3
1. Renamed zone_reclaim_unmapped_pages to zone_reclaim_pages
Refactor zone_reclaim, move reusable functionality outside
of zone_reclaim. Make zone_reclaim_unmapped_pages modular
Signed-off-by: Balbir Singh <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
mm/vmscan.c | 35 +++++++++++++++++++++++------------
1 files changed, 23 insertions(+), 12 deletions(-)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 4923160..5b24e74 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -2949,6 +2949,27 @@ static long zone_pagecache_reclaimable(struct zone *zone)
}
/*
+ * Helper function to reclaim unmapped pages, we might add something
+ * similar to this for slab cache as well. Currently this function
+ * is shared with __zone_reclaim()
+ */
+static inline void
+zone_reclaim_pages(struct zone *zone, struct scan_control *sc,
+ unsigned long nr_pages)
+{
+ int priority;
+ /*
+ * Free memory by calling shrink zone with increasing
+ * priorities until we have enough memory freed.
+ */
+ priority = ZONE_RECLAIM_PRIORITY;
+ do {
+ shrink_zone(priority, zone, sc);
+ priority--;
+ } while (priority >= 0 && sc->nr_reclaimed < nr_pages);
+}
+
+/*
* Try to free up some pages from this zone through reclaim.
*/
static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
@@ -2957,7 +2978,6 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
const unsigned long nr_pages = 1 << order;
struct task_struct *p = current;
struct reclaim_state reclaim_state;
- int priority;
struct scan_control sc = {
.may_writepage = !!(zone_reclaim_mode & RECLAIM_WRITE),
.may_unmap = !!(zone_reclaim_mode & RECLAIM_SWAP),
@@ -2981,17 +3001,8 @@ static int __zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
reclaim_state.reclaimed_slab = 0;
p->reclaim_state = &reclaim_state;
- if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages) {
- /*
- * Free memory by calling shrink zone with increasing
- * priorities until we have enough memory freed.
- */
- priority = ZONE_RECLAIM_PRIORITY;
- do {
- shrink_zone(priority, zone, &sc);
- priority--;
- } while (priority >= 0 && sc.nr_reclaimed < nr_pages);
- }
+ if (zone_pagecache_reclaimable(zone) > zone->min_unmapped_pages)
+ zone_reclaim_pages(zone, &sc, nr_pages);
nr_slab_pages0 = zone_page_state(zone, NR_SLAB_RECLAIMABLE);
if (nr_slab_pages0 > zone->min_slab_pages) {
Changelog v4
1. Added documentation for max_unmapped_pages
2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages
Changelog v2
1. Use a config option to enable the code (Andrew Morton)
2. Explain the magic tunables in the code or at-least attempt
to explain them (General comment)
3. Hint uses of the boot parameter with unlikely (Andrew Morton)
4. Use better names (balanced is not a good naming convention)
Provide control using zone_reclaim() and a boot parameter. The
code reuses functionality from zone_reclaim() to isolate unmapped
pages and reclaim them as a priority, ahead of other mapped pages.
Signed-off-by: Balbir Singh <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
---
Documentation/kernel-parameters.txt | 8 +++
Documentation/sysctl/vm.txt | 19 +++++++-
include/linux/mmzone.h | 7 +++
include/linux/swap.h | 25 ++++++++--
init/Kconfig | 12 +++++
kernel/sysctl.c | 13 +++++
mm/page_alloc.c | 29 ++++++++++++
mm/vmscan.c | 88 +++++++++++++++++++++++++++++++++++
8 files changed, 194 insertions(+), 7 deletions(-)
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index d4e67a5..f522c34 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2520,6 +2520,14 @@ bytes respectively. Such letter suffixes can also be entirely omitted.
[X86]
Set unknown_nmi_panic=1 early on boot.
+ unmapped_page_control
+ [KNL] Available if CONFIG_UNMAPPED_PAGECACHE_CONTROL
+ is enabled. It controls the amount of unmapped memory
+ that is present in the system. This boot option plus
+ vm.min_unmapped_ratio (sysctl) provide granular control
+ over how much unmapped page cache can exist in the system
+ before kswapd starts reclaiming unmapped page cache pages.
+
usbcore.autosuspend=
[USB] The autosuspend time delay (in seconds) used
for newly-detected USB devices (default 2). This
diff --git a/Documentation/sysctl/vm.txt b/Documentation/sysctl/vm.txt
index 30289fa..1c722f7 100644
--- a/Documentation/sysctl/vm.txt
+++ b/Documentation/sysctl/vm.txt
@@ -381,11 +381,14 @@ and may not be fast.
min_unmapped_ratio:
-This is available only on NUMA kernels.
+This is available only on NUMA kernels or when unmapped page cache
+control is enabled.
This is a percentage of the total pages in each zone. Zone reclaim will
only occur if more than this percentage of pages are in a state that
-zone_reclaim_mode allows to be reclaimed.
+zone_reclaim_mode allows to be reclaimed. If unmapped page cache control
+is enabled, this is the minimum level to which the cache will be shrunk
+down to.
If zone_reclaim_mode has the value 4 OR'd, then the percentage is compared
against all file-backed unmapped pages including swapcache pages and tmpfs
@@ -396,6 +399,18 @@ The default is 1 percent.
==============================================================
+max_unmapped_ratio:
+
+This is available only when unmapped page cache control is enabled.
+
+This is a percentage of the total pages in each zone. Zone reclaim will
+only occur if more than this percentage of pages are in a state and
+unmapped page cache control is enabled.
+
+The default is 16 percent.
+
+==============================================================
+
mmap_min_addr
This file indicates the amount of address space which a user process will
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 59cbed0..caa29ad 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -309,7 +309,12 @@ struct zone {
/*
* zone reclaim becomes active if more unmapped pages exist.
*/
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
unsigned long min_unmapped_pages;
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+ unsigned long max_unmapped_pages;
+#endif
#ifdef CONFIG_NUMA
int node;
unsigned long min_slab_pages;
@@ -776,6 +781,8 @@ int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
+int sysctl_max_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
+ void __user *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void __user *, size_t *, loff_t *);
diff --git a/include/linux/swap.h b/include/linux/swap.h
index ce8f686..86cafc5 100644
--- a/include/linux/swap.h
+++ b/include/linux/swap.h
@@ -264,19 +264,36 @@ extern int vm_swappiness;
extern int remove_mapping(struct address_space *mapping, struct page *page);
extern long vm_total_pages;
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
extern int sysctl_min_unmapped_ratio;
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern int sysctl_max_unmapped_ratio;
+#endif
+
extern int zone_reclaim(struct zone *, gfp_t, unsigned int);
-#ifdef CONFIG_NUMA
-extern int zone_reclaim_mode;
-extern int sysctl_min_slab_ratio;
#else
-#define zone_reclaim_mode 0
static inline int zone_reclaim(struct zone *z, gfp_t mask, unsigned int order)
{
return 0;
}
#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+extern bool should_reclaim_unmapped_pages(struct zone *zone);
+#else
+static inline bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+ return false;
+}
+#endif
+
+#ifdef CONFIG_NUMA
+extern int zone_reclaim_mode;
+extern int sysctl_min_slab_ratio;
+#else
+#define zone_reclaim_mode 0
+#endif
+
extern int page_evictable(struct page *page, struct vm_area_struct *vma);
extern void scan_mapping_unevictable_pages(struct address_space *);
diff --git a/init/Kconfig b/init/Kconfig
index 41b2431..222b3af 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -811,6 +811,18 @@ config SCHED_AUTOGROUP
config MM_OWNER
bool
+config UNMAPPED_PAGECACHE_CONTROL
+ bool "Provide control over unmapped page cache"
+ default n
+ help
+ This option adds support for controlling unmapped page cache
+ via a boot parameter (unmapped_page_control). The boot parameter
+ with sysctl (vm.min_unmapped_ratio) control the total number
+ of unmapped pages in the system. This feature is useful if
+ you want to limit the amount of unmapped page cache or want
+ to reduce page cache duplication in a virtualized environment.
+ If unsure say 'N'
+
config SYSFS_DEPRECATED
bool "Enable deprecated sysfs features to support old userspace tools"
depends on SYSFS
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index e3a8ce4..d9e77da 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1214,6 +1214,7 @@ static struct ctl_table vm_table[] = {
.proc_handler = proc_dointvec_unsigned,
},
#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
{
.procname = "min_unmapped_ratio",
.data = &sysctl_min_unmapped_ratio,
@@ -1223,6 +1224,18 @@ static struct ctl_table vm_table[] = {
.extra1 = &zero,
.extra2 = &one_hundred,
},
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+ {
+ .procname = "max_unmapped_ratio",
+ .data = &sysctl_max_unmapped_ratio,
+ .maxlen = sizeof(sysctl_max_unmapped_ratio),
+ .mode = 0644,
+ .proc_handler = sysctl_max_unmapped_ratio_sysctl_handler,
+ .extra1 = &zero,
+ .extra2 = &one_hundred,
+ },
+#endif
#ifdef CONFIG_NUMA
{
.procname = "zone_reclaim_mode",
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1d32865..5b89e5b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1669,6 +1669,9 @@ zonelist_scan:
unsigned long mark;
int ret;
+ if (should_reclaim_unmapped_pages(zone))
+ wakeup_kswapd(zone, order, classzone_idx);
+
mark = zone->watermark[alloc_flags & ALLOC_WMARK_MASK];
if (zone_watermark_ok(zone, order, mark,
classzone_idx, alloc_flags))
@@ -4249,8 +4252,14 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat,
zone->spanned_pages = size;
zone->present_pages = realsize;
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
zone->min_unmapped_pages = (realsize*sysctl_min_unmapped_ratio)
/ 100;
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+ zone->max_unmapped_pages = (realsize*sysctl_max_unmapped_ratio)
+ / 100;
+#endif
#ifdef CONFIG_NUMA
zone->node = nid;
zone->min_slab_pages = (realsize * sysctl_min_slab_ratio) / 100;
@@ -5157,6 +5166,7 @@ int min_free_kbytes_sysctl_handler(ctl_table *table, int write,
return 0;
}
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
{
@@ -5173,6 +5183,25 @@ int sysctl_min_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
return 0;
}
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+int sysctl_max_unmapped_ratio_sysctl_handler(ctl_table *table, int write,
+ void __user *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ int rc;
+
+ rc = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (rc)
+ return rc;
+
+ for_each_zone(zone)
+ zone->max_unmapped_pages = (zone->present_pages *
+ sysctl_max_unmapped_ratio) / 100;
+ return 0;
+}
+#endif
+
#ifdef CONFIG_NUMA
int sysctl_min_slab_ratio_sysctl_handler(ctl_table *table, int write,
void __user *buffer, size_t *length, loff_t *ppos)
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5b24e74..bb06710 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -158,6 +158,29 @@ static DECLARE_RWSEM(shrinker_rwsem);
#define scanning_global_lru(sc) (1)
#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+static void reclaim_unmapped_pages(int priority, struct zone *zone,
+ struct scan_control *sc);
+static int unmapped_page_control __read_mostly;
+
+static int __init unmapped_page_control_parm(char *str)
+{
+ unmapped_page_control = 1;
+ /*
+ * XXX: Should we tweak swappiness here?
+ */
+ return 1;
+}
+__setup("unmapped_page_control", unmapped_page_control_parm);
+
+#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
+static inline void reclaim_unmapped_pages(int priority,
+ struct zone *zone, struct scan_control *sc)
+{
+ return 0;
+}
+#endif
+
static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
struct scan_control *sc)
{
@@ -2371,6 +2394,12 @@ loop_again:
shrink_active_list(SWAP_CLUSTER_MAX, zone,
&sc, priority, 0);
+ /*
+ * We do unmapped page reclaim once here and once
+ * below, so that we don't lose out
+ */
+ reclaim_unmapped_pages(priority, zone, &sc);
+
if (!zone_watermark_ok_safe(zone, order,
high_wmark_pages(zone), 0, 0)) {
end_zone = i;
@@ -2408,6 +2437,11 @@ loop_again:
continue;
sc.nr_scanned = 0;
+ /*
+ * Reclaim unmapped pages upfront, this should be
+ * really cheap
+ */
+ reclaim_unmapped_pages(priority, zone, &sc);
/*
* Call soft limit reclaim before calling shrink_zone.
@@ -2721,7 +2755,8 @@ void wakeup_kswapd(struct zone *zone, int order, enum zone_type classzone_idx)
}
if (!waitqueue_active(&pgdat->kswapd_wait))
return;
- if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0))
+ if (zone_watermark_ok_safe(zone, order, low_wmark_pages(zone), 0, 0) &&
+ !should_reclaim_unmapped_pages(zone))
return;
trace_mm_vmscan_wakeup_kswapd(pgdat->node_id, zone_idx(zone), order);
@@ -2874,6 +2909,7 @@ static int __init kswapd_init(void)
module_init(kswapd_init)
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
/*
* Zone reclaim mode
*
@@ -2900,6 +2936,10 @@ int zone_reclaim_mode __read_mostly;
*/
int sysctl_min_unmapped_ratio = 1;
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+int sysctl_max_unmapped_ratio = 16;
+#endif
+
/*
* If the number of slab pages in a zone grows beyond this percentage then
* slab reclaim needs to occur.
@@ -3094,6 +3134,52 @@ int zone_reclaim(struct zone *zone, gfp_t gfp_mask, unsigned int order)
return ret;
}
+#endif
+
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+/*
+ * Routine to reclaim unmapped pages, inspired from the code under
+ * CONFIG_NUMA that does unmapped page and slab page control by keeping
+ * min_unmapped_pages in the zone. We currently reclaim just unmapped
+ * pages, slab control will come in soon, at which point this routine
+ * should be called reclaim cached pages
+ */
+void reclaim_unmapped_pages(int priority, struct zone *zone,
+ struct scan_control *sc)
+{
+ if (unlikely(unmapped_page_control) &&
+ (zone_unmapped_file_pages(zone) > zone->min_unmapped_pages)) {
+ struct scan_control nsc;
+ unsigned long nr_pages;
+
+ nsc = *sc;
+
+ nsc.swappiness = 0;
+ nsc.may_writepage = 0;
+ nsc.may_unmap = 0;
+ nsc.nr_reclaimed = 0;
+
+ nr_pages = zone_unmapped_file_pages(zone) -
+ zone->min_unmapped_pages;
+ /*
+ * We don't want to be too aggressive with our
+ * reclaim, it is our best effort to control
+ * unmapped pages
+ */
+ nr_pages >>= 3;
+
+ zone_reclaim_pages(zone, &nsc, nr_pages);
+ }
+}
+
+bool should_reclaim_unmapped_pages(struct zone *zone)
+{
+ if (unlikely(unmapped_page_control) &&
+ (zone_unmapped_file_pages(zone) > zone->max_unmapped_pages))
+ return true;
+ return false;
+}
+#endif
/*
* page_evictable - test whether a page is evictable
On Wed, 30 Mar 2011 11:00:26 +0530
Balbir Singh <[email protected]> wrote:
> Data from the previous patchsets can be found at
> https://lkml.org/lkml/2010/11/30/79
It would be nice if the data for the current patchset was present in
the current patchset's changelog!
On Wed, 30 Mar 2011 11:02:38 +0530
Balbir Singh <[email protected]> wrote:
> Changelog v4
> 1. Added documentation for max_unmapped_pages
> 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages
>
> Changelog v2
> 1. Use a config option to enable the code (Andrew Morton)
> 2. Explain the magic tunables in the code or at-least attempt
> to explain them (General comment)
> 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
> 4. Use better names (balanced is not a good naming convention)
>
> Provide control using zone_reclaim() and a boot parameter. The
> code reuses functionality from zone_reclaim() to isolate unmapped
> pages and reclaim them as a priority, ahead of other mapped pages.
>
This:
akpm:/usr/src/25> grep '^+#' patches/provide-control-over-unmapped-pages-v5.patch
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#else
+#endif
+#ifdef CONFIG_NUMA
+#else
+#define zone_reclaim_mode 0
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
+#endif
+#endif
+#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
is getting out of control. What happens if we just make the feature
non-configurable?
> +static int __init unmapped_page_control_parm(char *str)
> +{
> + unmapped_page_control = 1;
> + /*
> + * XXX: Should we tweak swappiness here?
> + */
> + return 1;
> +}
> +__setup("unmapped_page_control", unmapped_page_control_parm);
That looks like a pain - it requires a reboot to change the option,
which makes testing harder and slower. Methinks you're being a bit
virtualization-centric here!
> +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
> +static inline void reclaim_unmapped_pages(int priority,
> + struct zone *zone, struct scan_control *sc)
> +{
> + return 0;
> +}
> +#endif
> +
> static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> struct scan_control *sc)
> {
> @@ -2371,6 +2394,12 @@ loop_again:
> shrink_active_list(SWAP_CLUSTER_MAX, zone,
> &sc, priority, 0);
>
> + /*
> + * We do unmapped page reclaim once here and once
> + * below, so that we don't lose out
> + */
> + reclaim_unmapped_pages(priority, zone, &sc);
Doing this here seems wrong. balance_pgdat() does two passes across
the zones. The first pass is a read-only work-out-what-to-do pass and
the second pass is a now-reclaim-some-stuff pass. But here we've stuck
a do-some-reclaiming operation inside the first, work-out-what-to-do pass.
> @@ -2408,6 +2437,11 @@ loop_again:
> continue;
>
> sc.nr_scanned = 0;
> + /*
> + * Reclaim unmapped pages upfront, this should be
> + * really cheap
Comment is mysterious. Why is it cheap?
> + */
> + reclaim_unmapped_pages(priority, zone, &sc);
I dunno, the whole thing seems rather nasty to me.
It sticks a magical reclaim-unmapped-pages operation right in the
middle of regular page reclaim. This means that reclaim will walk the
LRU looking at mapped and unmapped pages. Then it will walk some more,
looking at only unmapped pages and moving the mapped ones to the head
of the LRU. Then it goes back to looking at mapped and unmapped pages.
So it rather screws up the LRU ordering and page aging, does it not?
Also, the special-case handling sticks out like a sore thumb. Would it
not be better to manage the mapped/unmapped bias within the core of the
regular scanning? ie: in shrink_page_list().
* Andrew Morton <[email protected]> [2011-03-30 16:36:07]:
> On Wed, 30 Mar 2011 11:00:26 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Data from the previous patchsets can be found at
> > https://lkml.org/lkml/2010/11/30/79
>
> It would be nice if the data for the current patchset was present in
> the current patchset's changelog!
>
Sure, since there were no major changes, I put in a URL. The main
change was the documentation update.
--
Three Cheers,
Balbir
On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh <[email protected]> wrote:
> * Andrew Morton <[email protected]> [2011-03-30 16:36:07]:
>
> > On Wed, 30 Mar 2011 11:00:26 +0530
> > Balbir Singh <[email protected]> wrote:
> >
> > > Data from the previous patchsets can be found at
> > > https://lkml.org/lkml/2010/11/30/79
> >
> > It would be nice if the data for the current patchset was present in
> > the current patchset's changelog!
> >
>
> Sure, since there were no major changes, I put in a URL. The main
> change was the documentation update.
Well some poor schmuck has to copy and paste the data into the
changelog so it's still there in five years time. It's better to carry
this info around in the patch's own metedata, and to maintain
and update it.
>
> The following series implements page cache control,
> this is a split out version of patch 1 of version 3 of the
> page cache optimization patches posted earlier at
> Previous posting http://lwn.net/Articles/425851/ and analysis
> at http://lwn.net/Articles/419713/
>
> Detailed Description
> ====================
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
> - In a virtualized environment with cache=writethrough, we see
> double caching - (one in the host and one in the guest). As
> we try to scale guests, cache usage across the system grows.
> The goal of this patch is to reclaim page cache when Linux is running
> as a guest and get the host to hold the page cache and manage it.
> There might be temporary duplication, but in the long run, memory
> in the guests would be used for mapped pages.
>
> - The option is controlled via a boot option and the administrator
> can selectively turn it on, on a need to use basis.
>
> A lot of the code is borrowed from zone_reclaim_mode logic for
> __zone_reclaim(). One might argue that the with ballooning and
> KSM this feature is not very useful, but even with ballooning,
> we need extra logic to balloon multiple VM machines and it is hard
> to figure out the correct amount of memory to balloon. With these
> patches applied, each guest has a sufficient amount of free memory
> available, that can be easily seen and reclaimed by the balloon driver.
> The additional memory in the guest can be reused for additional
> applications or used to start additional guests/balance memory in
> the host.
If anyone think this series works, They are just crazy. This patch reintroduce
two old issues.
1) zone reclaim doesn't work if the system has multiple node and the
workload is file cache oriented (eg file server, web server, mail server, et al).
because zone recliam make some much free pages than zone->pages_min and
then new page cache request consume nearest node memory and then it
bring next zone reclaim. Then, memory utilization is reduced and
unnecessary LRU discard is increased dramatically.
SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
But global recliam still have its issue. zone recliam is HPC workload specific
feature and HPC folks has no motivation to don't use CPUSET.
2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
decide to filter out mapped pages. It made a lot of problems for DB servers
and large application servers. Because, if the system has a lot of mapped
pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
reclaim latency become terribly slow and hangup detectors misdetect its
state and start to force reboot. That was big problem of RHEL5 based banking
system.
So, sc->may_unmap should be killed in future. Don't increase uses.
And, this patch introduce new allocator fast path overhead. I haven't seen
any justification for it.
In other words, you have to kill following three for getting ack 1) zone
reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
3) fastpath overhead. In other words, If you want a feature for vm guest,
Any hardcoded machine configration assumption and/or workload assumption
are wrong.
But, I agree that now we have to concern slightly large VM change parhaps
(or parhaps not). Ok, it's good opportunity to fill out some thing.
Historically, Linux MM has "free memory are waste memory" policy, and It
worked completely fine. But now we have a few exceptions.
1) RT, embedded and finance systems. They really hope to avoid reclaim
latency (ie avoid foreground reclaim completely) and they can accept
to make slightly much free pages before memory shortage.
2) VM guest
VM host and VM guest naturally makes two level page cache model. and
Linux page cache + two level don't work fine. It has two issues
1) hard to visualize real memory consumption. That makes harder to
works baloon fine. And google want to visualize memory utilization
to pack in more jobs.
2) hard to make in kernel memory utilization improvement mechanism.
And, now we have four proposal of utilization related issues.
1) cleancache (from Oracle)
2) VirtFS (from IBM)
3) kstaled (from Google)
4) unmapped page reclaim (from you)
Probably, we can't merge all of them and we need to consolidate some
requirement and implementations.
cleancache seems most straight forward two level cache handling for
virtalization. but it has soem xen specific mess and, currently, don't fit RT
usage. VirtFS has another interesting de-duplication idea. But filesystem based
implemenation naturally inherit some vfs interface limitations.
Google approach is more unique. memcg don't have double cache
issue, therefore they only want to visualize it.
Personally I think cleancache or other multi level page cache framework
looks promising. but another solution is also acceptable. Anyway, I hope
to everyone back 1000feet bird eye at once and sorting out all requiremnt
with all related person.
>
> KSM currently does not de-duplicate host and guest page cache. The goal
> of this patch is to help automatically balance unmapped page cache when
> instructed to do so.
>
> The sysctl for min_unmapped_ratio provides further control from
> within the guest on the amount of unmapped pages to reclaim, a similar
> max_unmapped_ratio sysctl is added and helps in the decision making
> process of when reclaim should occur. This is tunable and set by
> default to 16 (based on tradeoff's seen between aggressiveness in
> balancing versus size of unmapped pages). Distro's and administrators
> can further tweak this for desired control.
>
> Data from the previous patchsets can be found at
> https://lkml.org/lkml/2010/11/30/79
>
> ---
>
> Balbir Singh (3):
> Move zone_reclaim() outside of CONFIG_NUMA
> Refactor zone_reclaim code
> Provide control over unmapped pages
>
>
> Documentation/kernel-parameters.txt | 8 ++
> Documentation/sysctl/vm.txt | 19 +++++
> include/linux/mmzone.h | 11 +++
> include/linux/swap.h | 25 ++++++-
> init/Kconfig | 12 +++
> kernel/sysctl.c | 29 ++++++--
> mm/page_alloc.c | 35 +++++++++-
> mm/vmscan.c | 123 +++++++++++++++++++++++++++++++----
> 8 files changed, 229 insertions(+), 33 deletions(-)
* Andrew Morton <[email protected]> [2011-03-30 16:35:45]:
> On Wed, 30 Mar 2011 11:02:38 +0530
> Balbir Singh <[email protected]> wrote:
>
> > Changelog v4
> > 1. Added documentation for max_unmapped_pages
> > 2. Better #ifdef'ing of max_unmapped_pages and min_unmapped_pages
> >
> > Changelog v2
> > 1. Use a config option to enable the code (Andrew Morton)
> > 2. Explain the magic tunables in the code or at-least attempt
> > to explain them (General comment)
> > 3. Hint uses of the boot parameter with unlikely (Andrew Morton)
> > 4. Use better names (balanced is not a good naming convention)
> >
> > Provide control using zone_reclaim() and a boot parameter. The
> > code reuses functionality from zone_reclaim() to isolate unmapped
> > pages and reclaim them as a priority, ahead of other mapped pages.
> >
>
> This:
>
> akpm:/usr/src/25> grep '^+#' patches/provide-control-over-unmapped-pages-v5.patch
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#else
> +#endif
> +#ifdef CONFIG_NUMA
> +#else
> +#define zone_reclaim_mode 0
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL) || defined(CONFIG_NUMA)
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
> +#endif
> +#endif
> +#if defined(CONFIG_UNMAPPED_PAGECACHE_CONTROL)
>
> is getting out of control. What happens if we just make the feature
> non-configurable?
>
I added the configuration based on review comments I received. If the
feature is made non-configurable, it should be easy to remove them or
just set the default value to "y" in the config.
> > +static int __init unmapped_page_control_parm(char *str)
> > +{
> > + unmapped_page_control = 1;
> > + /*
> > + * XXX: Should we tweak swappiness here?
> > + */
> > + return 1;
> > +}
> > +__setup("unmapped_page_control", unmapped_page_control_parm);
>
> That looks like a pain - it requires a reboot to change the option,
> which makes testing harder and slower. Methinks you're being a bit
> virtualization-centric here!
:-) The reason for the boot parameter is to ensure that people know
what they are doing.
>
> > +#else /* !CONFIG_UNMAPPED_PAGECACHE_CONTROL */
> > +static inline void reclaim_unmapped_pages(int priority,
> > + struct zone *zone, struct scan_control *sc)
> > +{
> > + return 0;
> > +}
> > +#endif
> > +
> > static struct zone_reclaim_stat *get_reclaim_stat(struct zone *zone,
> > struct scan_control *sc)
> > {
> > @@ -2371,6 +2394,12 @@ loop_again:
> > shrink_active_list(SWAP_CLUSTER_MAX, zone,
> > &sc, priority, 0);
> >
> > + /*
> > + * We do unmapped page reclaim once here and once
> > + * below, so that we don't lose out
> > + */
> > + reclaim_unmapped_pages(priority, zone, &sc);
>
> Doing this here seems wrong. balance_pgdat() does two passes across
> the zones. The first pass is a read-only work-out-what-to-do pass and
> the second pass is a now-reclaim-some-stuff pass. But here we've stuck
> a do-some-reclaiming operation inside the first, work-out-what-to-do pass.
>
The reason is primarily for balancing, zone_watermark's do not give us
a good idea of whether unmapped pages are balanced, hence the code.
>
> > @@ -2408,6 +2437,11 @@ loop_again:
> > continue;
> >
> > sc.nr_scanned = 0;
> > + /*
> > + * Reclaim unmapped pages upfront, this should be
> > + * really cheap
>
> Comment is mysterious. Why is it cheap?
Cheap because we do a quick check to see if unmapped pages exceed a
threshold. If selective users enable this functionality (which is
expected), the use case is primarily for embedded and virtualization
folks, this should be a simple check.
>
> > + */
> > + reclaim_unmapped_pages(priority, zone, &sc);
>
>
> I dunno, the whole thing seems rather nasty to me.
>
> It sticks a magical reclaim-unmapped-pages operation right in the
> middle of regular page reclaim. This means that reclaim will walk the
> LRU looking at mapped and unmapped pages. Then it will walk some more,
> looking at only unmapped pages and moving the mapped ones to the head
> of the LRU. Then it goes back to looking at mapped and unmapped pages.
> So it rather screws up the LRU ordering and page aging, does it not?
>
This was brought up earlier, it is no different than zone_reclaim of
unmapped pages, in that
1. It is a specialized case where we want explicit control of unmapped
pages and what you mention is the price
2. This situation can be improved with an incremental patch to be
smart about page isolation.
> Also, the special-case handling sticks out like a sore thumb. Would it
> not be better to manage the mapped/unmapped bias within the core of the
> regular scanning? ie: in shrink_page_list().
Or in isolating, we could check if we really want to isolate mapped
pages or not. I intend to send an incremental patch to improve that
part.
--
Three Cheers,
Balbir
* KOSAKI Motohiro <[email protected]> [2011-03-31 14:40:33]:
> >
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> >
> > Detailed Description
> > ====================
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> > double caching - (one in the host and one in the guest). As
> > we try to scale guests, cache usage across the system grows.
> > The goal of this patch is to reclaim page cache when Linux is running
> > as a guest and get the host to hold the page cache and manage it.
> > There might be temporary duplication, but in the long run, memory
> > in the guests would be used for mapped pages.
> >
> > - The option is controlled via a boot option and the administrator
> > can selectively turn it on, on a need to use basis.
> >
> > A lot of the code is borrowed from zone_reclaim_mode logic for
> > __zone_reclaim(). One might argue that the with ballooning and
> > KSM this feature is not very useful, but even with ballooning,
> > we need extra logic to balloon multiple VM machines and it is hard
> > to figure out the correct amount of memory to balloon. With these
> > patches applied, each guest has a sufficient amount of free memory
> > available, that can be easily seen and reclaimed by the balloon driver.
> > The additional memory in the guest can be reused for additional
> > applications or used to start additional guests/balance memory in
> > the host.
>
> If anyone think this series works, They are just crazy. This patch reintroduce
> two old issues.
>
> 1) zone reclaim doesn't work if the system has multiple node and the
> workload is file cache oriented (eg file server, web server, mail server, et al).
> because zone recliam make some much free pages than zone->pages_min and
> then new page cache request consume nearest node memory and then it
> bring next zone reclaim. Then, memory utilization is reduced and
> unnecessary LRU discard is increased dramatically.
>
> SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> But global recliam still have its issue. zone recliam is HPC workload specific
> feature and HPC folks has no motivation to don't use CPUSET.
>
I am afraid you misread the patches and the intent. The intent to
explictly enable control of unmapped pages and has nothing
specifically to do with multiple nodes at this point. The control is
system wide and carefully enabled by the administrator.
> 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
> decide to filter out mapped pages. It made a lot of problems for DB servers
> and large application servers. Because, if the system has a lot of mapped
> pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
> reclaim latency become terribly slow and hangup detectors misdetect its
> state and start to force reboot. That was big problem of RHEL5 based banking
> system.
> So, sc->may_unmap should be killed in future. Don't increase uses.
>
Can you remove sc->may_unmap without removing zone_reclaim()? The LRU
churn can be addressed at the time of isolation, I'll send out an
incremental patch for that.
>
> But, I agree that now we have to concern slightly large VM change parhaps
> (or parhaps not). Ok, it's good opportunity to fill out some thing.
> Historically, Linux MM has "free memory are waste memory" policy, and It
> worked completely fine. But now we have a few exceptions.
>
> 1) RT, embedded and finance systems. They really hope to avoid reclaim
> latency (ie avoid foreground reclaim completely) and they can accept
> to make slightly much free pages before memory shortage.
>
> 2) VM guest
> VM host and VM guest naturally makes two level page cache model. and
> Linux page cache + two level don't work fine. It has two issues
> 1) hard to visualize real memory consumption. That makes harder to
> works baloon fine. And google want to visualize memory utilization
> to pack in more jobs.
> 2) hard to make in kernel memory utilization improvement mechanism.
>
>
> And, now we have four proposal of utilization related issues.
>
> 1) cleancache (from Oracle)
Cleancache requires both hypervisor and guest support. With these
patches, Linux can run well under hypverisor if we know the hypversior
does a lot of the IO and maintains the cache.
> 2) VirtFS (from IBM)
> 3) kstaled (from Google)
> 4) unmapped page reclaim (from you)
>
> Probably, we can't merge all of them and we need to consolidate some
> requirement and implementations.
>
>
> cleancache seems most straight forward two level cache handling for
> virtalization. but it has soem xen specific mess and, currently, don't fit RT
> usage. VirtFS has another interesting de-duplication idea. But filesystem based
I am planning to work on deduplication at KSM level at some point, but
I think again you misread the intention of these patches.
> implemenation naturally inherit some vfs interface limitations.
> Google approach is more unique. memcg don't have double cache
> issue, therefore they only want to visualize it.
>
> Personally I think cleancache or other multi level page cache framework
> looks promising. but another solution is also acceptable. Anyway, I hope
> to everyone back 1000feet bird eye at once and sorting out all requiremnt
> with all related person.
>
Thanks for the review!
--
Three Cheers,
Balbir
On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
> 1) zone reclaim doesn't work if the system has multiple node and the
> workload is file cache oriented (eg file server, web server, mail server, et al).
> because zone recliam make some much free pages than zone->pages_min and
> then new page cache request consume nearest node memory and then it
> bring next zone reclaim. Then, memory utilization is reduced and
> unnecessary LRU discard is increased dramatically.
That is only true if the webserver only allocates from a single node. If
the allocation load is balanced then it will be fine. It is useful to
reclaim pages from the node where we allocate memory since that keeps the
dataset node local.
> SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> But global recliam still have its issue. zone recliam is HPC workload specific
> feature and HPC folks has no motivation to don't use CPUSET.
The spreading can also be done via memory policies. But that is only
required if the application has an unbalanced allocation behavior.
> 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
> decide to filter out mapped pages. It made a lot of problems for DB servers
> and large application servers. Because, if the system has a lot of mapped
> pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
> reclaim latency become terribly slow and hangup detectors misdetect its
> state and start to force reboot. That was big problem of RHEL5 based banking
> system.
> So, sc->may_unmap should be killed in future. Don't increase uses.
Because a bank could not configure its system properly we need to get rid
of may_unmap? Maybe raise min_unmapped_ratio instead and take care that
either the allocation load is balanced or a round robin scheme is
used by the app?
> And, this patch introduce new allocator fast path overhead. I haven't seen
> any justification for it.
We could do the triggering differently.
> In other words, you have to kill following three for getting ack 1) zone
> reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
> 3) fastpath overhead. In other words, If you want a feature for vm guest,
> Any hardcoded machine configration assumption and/or workload assumption
> are wrong.
It would be good if you could come up with a new reclaim scheme that
avoids the need for zone reclaim and still allows one to take advantage of
memory distances. I agree that the current scheme sometimes requires
tuning too many esoteric knobs to get useful behavior.
> But, I agree that now we have to concern slightly large VM change parhaps
> (or parhaps not). Ok, it's good opportunity to fill out some thing.
> Historically, Linux MM has "free memory are waste memory" policy, and It
> worked completely fine. But now we have a few exceptions.
>
> 1) RT, embedded and finance systems. They really hope to avoid reclaim
> latency (ie avoid foreground reclaim completely) and they can accept
> to make slightly much free pages before memory shortage.
In general we need a mechanism to ensure we can avoid reclaim during
critical sections of application. So some way to give some hints to the
machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
drastic) may be useful.
> And, now we have four proposal of utilization related issues.
>
> 1) cleancache (from Oracle)
> 2) VirtFS (from IBM)
> 3) kstaled (from Google)
> 4) unmapped page reclaim (from you)
>
> Probably, we can't merge all of them and we need to consolidate some
> requirement and implementations.
Well all these approaches show that we have major issues with reclaim and
large memory. Things get overly complicated. Time for a new approach that
integrates all the goals that these try to accomplish?
> Personally I think cleancache or other multi level page cache framework
> looks promising. but another solution is also acceptable. Anyway, I hope
> to everyone back 1000feet bird eye at once and sorting out all requiremnt
> with all related person.
Would be good if you could takle that problem.
On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
>
> The following series implements page cache control,
> this is a split out version of patch 1 of version 3 of the
> page cache optimization patches posted earlier at
> Previous posting http://lwn.net/Articles/425851/ and analysis
> at http://lwn.net/Articles/419713/
>
> Detailed Description
> ====================
> This patch implements unmapped page cache control via preferred
> page cache reclaim. The current patch hooks into kswapd and reclaims
> page cache if the user has requested for unmapped page control.
> This is useful in the following scenario
> - In a virtualized environment with cache=writethrough, we see
> double caching - (one in the host and one in the guest). As
> we try to scale guests, cache usage across the system grows.
> The goal of this patch is to reclaim page cache when Linux is running
> as a guest and get the host to hold the page cache and manage it.
> There might be temporary duplication, but in the long run, memory
> in the guests would be used for mapped pages.
What does this do that "cache=none" for the VMs and using the page
cache inside the guest doesn't acheive? That avoids double caching
and doesn't require any new complexity inside the host OS to
acheive...
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> >
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> >
> > Detailed Description
> > ====================
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> > double caching - (one in the host and one in the guest). As
> > we try to scale guests, cache usage across the system grows.
> > The goal of this patch is to reclaim page cache when Linux is running
> > as a guest and get the host to hold the page cache and manage it.
> > There might be temporary duplication, but in the long run, memory
> > in the guests would be used for mapped pages.
>
> What does this do that "cache=none" for the VMs and using the page
> cache inside the guest doesn't acheive? That avoids double caching
> and doesn't require any new complexity inside the host OS to
> acheive...
Right.
"cache=none" has no double caching issue and KSM already solved
cross gues cache sharing. So, I _guess_ this is not a core motivation
of his patch. But I'm not him. I'm not sure.
* Dave Chinner <[email protected]> [2011-04-01 08:40:33]:
> On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> >
> > The following series implements page cache control,
> > this is a split out version of patch 1 of version 3 of the
> > page cache optimization patches posted earlier at
> > Previous posting http://lwn.net/Articles/425851/ and analysis
> > at http://lwn.net/Articles/419713/
> >
> > Detailed Description
> > ====================
> > This patch implements unmapped page cache control via preferred
> > page cache reclaim. The current patch hooks into kswapd and reclaims
> > page cache if the user has requested for unmapped page control.
> > This is useful in the following scenario
> > - In a virtualized environment with cache=writethrough, we see
> > double caching - (one in the host and one in the guest). As
> > we try to scale guests, cache usage across the system grows.
> > The goal of this patch is to reclaim page cache when Linux is running
> > as a guest and get the host to hold the page cache and manage it.
> > There might be temporary duplication, but in the long run, memory
> > in the guests would be used for mapped pages.
>
> What does this do that "cache=none" for the VMs and using the page
> cache inside the guest doesn't acheive? That avoids double caching
> and doesn't require any new complexity inside the host OS to
> acheive...
>
There was a long discussion on cache=none in the first posting and the
downsides/impact on throughput. Please see
http://www.mail-archive.com/[email protected]/msg30655.html
--
Three Cheers,
Balbir
On Fri, Apr 01, 2011 at 08:38:11AM +0530, Balbir Singh wrote:
> * Dave Chinner <[email protected]> [2011-04-01 08:40:33]:
>
> > On Wed, Mar 30, 2011 at 11:00:26AM +0530, Balbir Singh wrote:
> > >
> > > The following series implements page cache control,
> > > this is a split out version of patch 1 of version 3 of the
> > > page cache optimization patches posted earlier at
> > > Previous posting http://lwn.net/Articles/425851/ and analysis
> > > at http://lwn.net/Articles/419713/
> > >
> > > Detailed Description
> > > ====================
> > > This patch implements unmapped page cache control via preferred
> > > page cache reclaim. The current patch hooks into kswapd and reclaims
> > > page cache if the user has requested for unmapped page control.
> > > This is useful in the following scenario
> > > - In a virtualized environment with cache=writethrough, we see
> > > double caching - (one in the host and one in the guest). As
> > > we try to scale guests, cache usage across the system grows.
> > > The goal of this patch is to reclaim page cache when Linux is running
> > > as a guest and get the host to hold the page cache and manage it.
> > > There might be temporary duplication, but in the long run, memory
> > > in the guests would be used for mapped pages.
> >
> > What does this do that "cache=none" for the VMs and using the page
> > cache inside the guest doesn't acheive? That avoids double caching
> > and doesn't require any new complexity inside the host OS to
> > acheive...
> >
>
> There was a long discussion on cache=none in the first posting and the
> downsides/impact on throughput. Please see
> http://www.mail-archive.com/[email protected]/msg30655.html
All there is in that thread is handwaving about the differences
between cache=none vs cache=writeback behaviour and about the amount
of data loss/corruption when failures occur. There is only one real
example provided about real world performance in the entire thread,
but the root cause of the performance difference is not analysed,
determined and understood. Hence I'm not convinced from this thread
that using cache=write* and using this functionality is
anything other than papering over some still unknown problem....
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hi
> > 1) zone reclaim doesn't work if the system has multiple node and the
> > workload is file cache oriented (eg file server, web server, mail server, et al).
> > because zone recliam make some much free pages than zone->pages_min and
> > then new page cache request consume nearest node memory and then it
> > bring next zone reclaim. Then, memory utilization is reduced and
> > unnecessary LRU discard is increased dramatically.
> >
> > SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> > But global recliam still have its issue. zone recliam is HPC workload specific
> > feature and HPC folks has no motivation to don't use CPUSET.
>
> I am afraid you misread the patches and the intent. The intent to
> explictly enable control of unmapped pages and has nothing
> specifically to do with multiple nodes at this point. The control is
> system wide and carefully enabled by the administrator.
Hm. OK, I may misread.
Can you please explain the reason why de-duplication feature need to selectable and
disabled by defaut. "explicity enable" mean this feature want to spot corner case issue??
> > 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
> > decide to filter out mapped pages. It made a lot of problems for DB servers
> > and large application servers. Because, if the system has a lot of mapped
> > pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
> > reclaim latency become terribly slow and hangup detectors misdetect its
> > state and start to force reboot. That was big problem of RHEL5 based banking
> > system.
> > So, sc->may_unmap should be killed in future. Don't increase uses.
> >
>
> Can you remove sc->may_unmap without removing zone_reclaim()? The LRU
> churn can be addressed at the time of isolation, I'll send out an
> incremental patch for that.
At least, I don't plan to do it. because current zone_reclaim() works good on SGI
HPC workload and uncareful change can lead to break them. In other word, they
understand their workloads are HPC specific and they understand they do how.
I'm worry about to spread out zone_reclaim() usage _without_ removing its assumption.
I wrote following by last mail.
> In other words, you have to kill following three for getting ack 1) zone
> reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
> 3) fastpath overhead.
But another ways is there, probably. If you can improve zone_reclaim() for more generic
workload and fitting so so much people, I'll ack this.
Thanks.
* KOSAKI Motohiro <[email protected]> [2011-04-01 16:56:57]:
> Hi
>
> > > 1) zone reclaim doesn't work if the system has multiple node and the
> > > workload is file cache oriented (eg file server, web server, mail server, et al).
> > > because zone recliam make some much free pages than zone->pages_min and
> > > then new page cache request consume nearest node memory and then it
> > > bring next zone reclaim. Then, memory utilization is reduced and
> > > unnecessary LRU discard is increased dramatically.
> > >
> > > SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> > > But global recliam still have its issue. zone recliam is HPC workload specific
> > > feature and HPC folks has no motivation to don't use CPUSET.
> >
> > I am afraid you misread the patches and the intent. The intent to
> > explictly enable control of unmapped pages and has nothing
> > specifically to do with multiple nodes at this point. The control is
> > system wide and carefully enabled by the administrator.
>
> Hm. OK, I may misread.
> Can you please explain the reason why de-duplication feature need to selectable and
> disabled by defaut. "explicity enable" mean this feature want to spot corner case issue??
>
Yes, because given a selection of choices (including what you
mentioned in the review), it would be nice to have
this selectable.
--
Three Cheers,
Balbir
Hi Christoph,
Thanks, long explanation.
> On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
>
> > 1) zone reclaim doesn't work if the system has multiple node and the
> > workload is file cache oriented (eg file server, web server, mail server, et al).
> > because zone recliam make some much free pages than zone->pages_min and
> > then new page cache request consume nearest node memory and then it
> > bring next zone reclaim. Then, memory utilization is reduced and
> > unnecessary LRU discard is increased dramatically.
>
> That is only true if the webserver only allocates from a single node. If
> the allocation load is balanced then it will be fine. It is useful to
> reclaim pages from the node where we allocate memory since that keeps the
> dataset node local.
Why?
Scheduler load balancing only consider cpu load. Then, usually memory
pressure is no complete symmetric. That's the reason why we got the
bug report periodically.
> > SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> > But global recliam still have its issue. zone recliam is HPC workload specific
> > feature and HPC folks has no motivation to don't use CPUSET.
>
> The spreading can also be done via memory policies. But that is only
> required if the application has an unbalanced allocation behavior.
??
I didin't talking about memory isolation. CPUSETS has a backdoor of memory
isolation for file cache. But global allocation/reclaim doesn't.
memory policy is for application specific behavior customization. If all
of application required the same memory policy customization, it's bad and
unpractical. Application developer know application behavior but don't know
machine configuration and system administrator is opposite. then, for application
tuning feature can't alternative system's one.
> > 2) Before 2.6.27, VM has only one LRU and calc_reclaim_mapped() is used to
> > decide to filter out mapped pages. It made a lot of problems for DB servers
> > and large application servers. Because, if the system has a lot of mapped
> > pages, 1) LRU was churned and then reclaim algorithm become lotree one. 2)
> > reclaim latency become terribly slow and hangup detectors misdetect its
> > state and start to force reboot. That was big problem of RHEL5 based banking
> > system.
> > So, sc->may_unmap should be killed in future. Don't increase uses.
>
> Because a bank could not configure its system properly we need to get rid
> of may_unmap? Maybe raise min_unmapped_ratio instead and take care that
> either the allocation load is balanced or a round robin scheme is
> used by the app?
Hmm..
I and you seems to talk different topic. I didn't talk about zone reclaim here.
I did explain why filter based selective page reclaim may cause disaster. And
you seems to talk about zone_reclaim() customization tips.
Firstly, If we don't think this patch and if we are using zone_reclaim,
raising min_unmapped_ratio is a option. I agree. search is alywas problematic
beucase it's no scale. But we have some workarounds and we used them so.
Secondly, If we think this patch, "by the app" is no option. We can't
hold any specific assumption of workloads on VM guest in general.
> > And, this patch introduce new allocator fast path overhead. I haven't seen
> > any justification for it.
>
> We could do the triggering differently.
ok.
and, I'd like to put supplimental explanation. If the feature is widely used one,
I don't put objection fastpath thing. It should be compared cost vs benefit fairly.
but If the feature is for only a few person, I strongly hope to avoid fastpath
overhead. number of people is one of most big considerable componet of a benefit.
> > In other words, you have to kill following three for getting ack 1) zone
> > reclaim oriented reclaim 2) filter based LRU scanning (eg sc->may_unmap)
> > 3) fastpath overhead. In other words, If you want a feature for vm guest,
> > Any hardcoded machine configration assumption and/or workload assumption
> > are wrong.
>
> It would be good if you could come up with a new reclaim scheme that
> avoids the need for zone reclaim and still allows one to take advantage of
> memory distances. I agree that the current scheme sometimes requires
> tuning too many esoteric knobs to get useful behavior.
To be honest, I hope to sort out balbir and virtulization people requirements
at first. I feel his [patch 0/3] explanation and the implementaion are not
exactly match. I'm worry about it.
btw, when we are talking about memory distance aware reclaim, we have to
recognize traditional numa (ie external node interconnect) and on-chip
numa have different performance characteristics. on-chip remote node access
is not so slow, then elaborated nearest node allocation effort doesn't have
so much worth. especially, a workload use a lot of short lived object.
Current zone-reclaim don't have so much issue when using traditiona numa
because it's fit your original design and assumption and administrators of
such systems have good skill and don't hesitate to learn esoteric knobs.
But recent on-chip and cheap numa are used for much different people against
past. therefore new issues and claims were raised.
> > But, I agree that now we have to concern slightly large VM change parhaps
> > (or parhaps not). Ok, it's good opportunity to fill out some thing.
> > Historically, Linux MM has "free memory are waste memory" policy, and It
> > worked completely fine. But now we have a few exceptions.
> >
> > 1) RT, embedded and finance systems. They really hope to avoid reclaim
> > latency (ie avoid foreground reclaim completely) and they can accept
> > to make slightly much free pages before memory shortage.
>
> In general we need a mechanism to ensure we can avoid reclaim during
> critical sections of application. So some way to give some hints to the
> machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
> drastic) may be useful.
Exactly.
I've heard multiple times this request from finance people. And I've also
heared the same request from bullet train control software people recently.
> * KOSAKI Motohiro <[email protected]> [2011-04-01 16:56:57]:
>
> > Hi
> >
> > > > 1) zone reclaim doesn't work if the system has multiple node and the
> > > > workload is file cache oriented (eg file server, web server, mail server, et al).
> > > > because zone recliam make some much free pages than zone->pages_min and
> > > > then new page cache request consume nearest node memory and then it
> > > > bring next zone reclaim. Then, memory utilization is reduced and
> > > > unnecessary LRU discard is increased dramatically.
> > > >
> > > > SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> > > > But global recliam still have its issue. zone recliam is HPC workload specific
> > > > feature and HPC folks has no motivation to don't use CPUSET.
> > >
> > > I am afraid you misread the patches and the intent. The intent to
> > > explictly enable control of unmapped pages and has nothing
> > > specifically to do with multiple nodes at this point. The control is
> > > system wide and carefully enabled by the administrator.
> >
> > Hm. OK, I may misread.
> > Can you please explain the reason why de-duplication feature need to selectable and
> > disabled by defaut. "explicity enable" mean this feature want to spot corner case issue??
>
> Yes, because given a selection of choices (including what you
> mentioned in the review), it would be nice to have
> this selectable.
It's no good answer. :-/
Who need the feature and who shouldn't use it? It this enough valuable for enough large
people? That's my question point.
On Fri, 1 Apr 2011, KOSAKI Motohiro wrote:
> > On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
> >
> > > 1) zone reclaim doesn't work if the system has multiple node and the
> > > workload is file cache oriented (eg file server, web server, mail server, et al).
> > > because zone recliam make some much free pages than zone->pages_min and
> > > then new page cache request consume nearest node memory and then it
> > > bring next zone reclaim. Then, memory utilization is reduced and
> > > unnecessary LRU discard is increased dramatically.
> >
> > That is only true if the webserver only allocates from a single node. If
> > the allocation load is balanced then it will be fine. It is useful to
> > reclaim pages from the node where we allocate memory since that keeps the
> > dataset node local.
>
> Why?
> Scheduler load balancing only consider cpu load. Then, usually memory
> pressure is no complete symmetric. That's the reason why we got the
> bug report periodically.
The scheduler load balancing also considers caching effects. It does not
consider NUMA effects aside from heuritics though. If processes are
randomly moving around then zone reclaim is not effective. Processes need
to stay mainly on a certain node and memory needs to be allocatable from
that node in order to improve performance. zone_reclaim is useless if you
toss processes around the box.
> btw, when we are talking about memory distance aware reclaim, we have to
> recognize traditional numa (ie external node interconnect) and on-chip
> numa have different performance characteristics. on-chip remote node access
> is not so slow, then elaborated nearest node allocation effort doesn't have
> so much worth. especially, a workload use a lot of short lived object.
> Current zone-reclaim don't have so much issue when using traditiona numa
> because it's fit your original design and assumption and administrators of
> such systems have good skill and don't hesitate to learn esoteric knobs.
> But recent on-chip and cheap numa are used for much different people against
> past. therefore new issues and claims were raised.
You can switch NUMA off completely at the bios level. Then the distances
are not considered by the OS. If they are not relevant then lets just
switch NUMA off. Managing NUMA distances can cause significant overhead.
* Andrew Morton <[email protected]> [2011-03-30 22:32:31]:
> On Thu, 31 Mar 2011 10:57:03 +0530 Balbir Singh <[email protected]> wrote:
>
> > * Andrew Morton <[email protected]> [2011-03-30 16:36:07]:
> >
> > > On Wed, 30 Mar 2011 11:00:26 +0530
> > > Balbir Singh <[email protected]> wrote:
> > >
> > > > Data from the previous patchsets can be found at
> > > > https://lkml.org/lkml/2010/11/30/79
> > >
> > > It would be nice if the data for the current patchset was present in
> > > the current patchset's changelog!
> > >
> >
> > Sure, since there were no major changes, I put in a URL. The main
> > change was the documentation update.
>
> Well some poor schmuck has to copy and paste the data into the
> changelog so it's still there in five years time. It's better to carry
> this info around in the patch's own metedata, and to maintain
> and update it.
>
Agreed, will do.
--
Three Cheers,
Balbir
* KOSAKI Motohiro <[email protected]> [2011-04-01 22:21:26]:
> > * KOSAKI Motohiro <[email protected]> [2011-04-01 16:56:57]:
> >
> > > Hi
> > >
> > > > > 1) zone reclaim doesn't work if the system has multiple node and the
> > > > > workload is file cache oriented (eg file server, web server, mail server, et al).
> > > > > because zone recliam make some much free pages than zone->pages_min and
> > > > > then new page cache request consume nearest node memory and then it
> > > > > bring next zone reclaim. Then, memory utilization is reduced and
> > > > > unnecessary LRU discard is increased dramatically.
> > > > >
> > > > > SGI folks added CPUSET specific solution in past. (cpuset.memory_spread_page)
> > > > > But global recliam still have its issue. zone recliam is HPC workload specific
> > > > > feature and HPC folks has no motivation to don't use CPUSET.
> > > >
> > > > I am afraid you misread the patches and the intent. The intent to
> > > > explictly enable control of unmapped pages and has nothing
> > > > specifically to do with multiple nodes at this point. The control is
> > > > system wide and carefully enabled by the administrator.
> > >
> > > Hm. OK, I may misread.
> > > Can you please explain the reason why de-duplication feature need to selectable and
> > > disabled by defaut. "explicity enable" mean this feature want to spot corner case issue??
> >
> > Yes, because given a selection of choices (including what you
> > mentioned in the review), it would be nice to have
> > this selectable.
>
> It's no good answer. :-/
I am afraid I cannot please you with my answers
> Who need the feature and who shouldn't use it? It this enough valuable for enough large
> people? That's my question point.
>
You can see the use cases documented, including when running Linux as
a guest under other hypervisors, today we have a choice of not using
host page cache with cache=none, but nothing the other way round.
There are other use cases for embedded folks (in terms of controlling
unmapped page cache), please see previous discussions.
--
Three Cheers,
Balbir
On 04/01/2011 09:17 AM, KOSAKI Motohiro wrote:
> Hi Christoph,
>
> Thanks, long explanation.
>
>
>> On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
>>
>>> 1) zone reclaim doesn't work if the system has multiple node and the
>>> workload is file cache oriented (eg file server, web server, mail server, et al).
>>> because zone recliam make some much free pages than zone->pages_min and
>>> then new page cache request consume nearest node memory and then it
>>> bring next zone reclaim. Then, memory utilization is reduced and
>>> unnecessary LRU discard is increased dramatically.
>>
>> That is only true if the webserver only allocates from a single node. If
>> the allocation load is balanced then it will be fine. It is useful to
>> reclaim pages from the node where we allocate memory since that keeps the
>> dataset node local.
>
> Why?
> Scheduler load balancing only consider cpu load. Then, usually memory
> pressure is no complete symmetric. That's the reason why we got the
> bug report periodically.
Agreed. As Christoph said if the allocation load is balanced it will be fine.
But I think it's not always true that the allocation load is balanced.
>>> But, I agree that now we have to concern slightly large VM change parhaps
>>> (or parhaps not). Ok, it's good opportunity to fill out some thing.
>>> Historically, Linux MM has "free memory are waste memory" policy, and It
>>> worked completely fine. But now we have a few exceptions.
>>>
>>> 1) RT, embedded and finance systems. They really hope to avoid reclaim
>>> latency (ie avoid foreground reclaim completely) and they can accept
>>> to make slightly much free pages before memory shortage.
>>
>> In general we need a mechanism to ensure we can avoid reclaim during
>> critical sections of application. So some way to give some hints to the
>> machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
>> drastic) may be useful.
>
> Exactly.
> I've heard multiple times this request from finance people. And I've also
> heared the same request from bullet train control software people recently.
I completely agree with you. I have both customers and they really need it
to make their critical section deterministic.
Thanks,
Satoru
On Fri, Apr 01, 2011 at 10:17:56PM +0900, KOSAKI Motohiro wrote:
> > > But, I agree that now we have to concern slightly large VM change parhaps
> > > (or parhaps not). Ok, it's good opportunity to fill out some thing.
> > > Historically, Linux MM has "free memory are waste memory" policy, and It
> > > worked completely fine. But now we have a few exceptions.
> > >
> > > 1) RT, embedded and finance systems. They really hope to avoid reclaim
> > > latency (ie avoid foreground reclaim completely) and they can accept
> > > to make slightly much free pages before memory shortage.
> >
> > In general we need a mechanism to ensure we can avoid reclaim during
> > critical sections of application. So some way to give some hints to the
> > machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
> > drastic) may be useful.
>
> Exactly.
> I've heard multiple times this request from finance people. And I've also
> heared the same request from bullet train control software people recently.
Well, that's enough to make me avoid Japanese trains in future. If
your critical control system has problems with memory reclaim
interfering with it's operation, then you are doing something
very, very wrong.
If you have a need to avoid memory allocation latency during
specific critical sections then the critical section needs to:
a) have all it's memory preallocated and mlock()d in advance
b) avoid doing anything that requires memory to be
allocated.
These are basic design rules for time-sensitive applications.
Fundamentally, if you just switch off memory reclaim to avoid the
latencies involved with direct memory reclaim, then all you'll get
instead is ENOMEM because there's no memory available and none will be
reclaimed. That's even more fatal for the system than doing reclaim.
IMO, you should tell the people requesting stuff like this to
architect their critical sections according to best practices.
Hacking the VM to try to work around badly designed applications is
a sure recipe for disaster...
Cheers,
Dave.
--
Dave Chinner
[email protected]
> On Fri, Apr 01, 2011 at 10:17:56PM +0900, KOSAKI Motohiro wrote:
> > > > But, I agree that now we have to concern slightly large VM change parhaps
> > > > (or parhaps not). Ok, it's good opportunity to fill out some thing.
> > > > Historically, Linux MM has "free memory are waste memory" policy, and It
> > > > worked completely fine. But now we have a few exceptions.
> > > >
> > > > 1) RT, embedded and finance systems. They really hope to avoid reclaim
> > > > latency (ie avoid foreground reclaim completely) and they can accept
> > > > to make slightly much free pages before memory shortage.
> > >
> > > In general we need a mechanism to ensure we can avoid reclaim during
> > > critical sections of application. So some way to give some hints to the
> > > machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
> > > drastic) may be useful.
> >
> > Exactly.
> > I've heard multiple times this request from finance people. And I've also
> > heared the same request from bullet train control software people recently.
>
> Well, that's enough to make me avoid Japanese trains in future.
Feel free do. :)
>If
> your critical control system has problems with memory reclaim
> interfering with it's operation, then you are doing something
> very, very wrong.
>
> If you have a need to avoid memory allocation latency during
> specific critical sections then the critical section needs to:
>
> a) have all it's memory preallocated and mlock()d in advance
>
> b) avoid doing anything that requires memory to be
> allocated.
>
> These are basic design rules for time-sensitive applications.
I wonder why do you think our VM folks don't know that.
> Fundamentally, if you just switch off memory reclaim to avoid the
> latencies involved with direct memory reclaim, then all you'll get
> instead is ENOMEM because there's no memory available and none will be
> reclaimed. That's even more fatal for the system than doing reclaim.
You have two level oversight.
Firstly, *ALL* RT application need to cooperate applications, kernel,
and other various system level daemons. That's no specific issue of
this topic. OK, *IF* RT application run egoistic, a system may hang
up easily even routh mere simple busy loop, yes. But, Who want to do so?
Secondly, You misparsed "avoid direct reclaim" paragraph. We don't talk
about "avoid direct reclaim even if system memory is no enough", We talk
about "avoid direct reclaim by preparing before".
> IMO, you should tell the people requesting stuff like this to
> architect their critical sections according to best practices.
> Hacking the VM to try to work around badly designed applications is
> a sure recipe for disaster...
I hope this mail satisfy you. :)
> > > > Hm. OK, I may misread.
> > > > Can you please explain the reason why de-duplication feature need to selectable and
> > > > disabled by defaut. "explicity enable" mean this feature want to spot corner case issue??
> > >
> > > Yes, because given a selection of choices (including what you
> > > mentioned in the review), it would be nice to have
> > > this selectable.
> >
> > It's no good answer. :-/
>
> I am afraid I cannot please you with my answers
>
> > Who need the feature and who shouldn't use it? It this enough valuable for enough large
> > people? That's my question point.
> >
>
> You can see the use cases documented, including when running Linux as
> a guest under other hypervisors,
Which hypervisor? If this patch is unrelated 99.9999% people, shouldn't you have to reduce
negative impact?
> today we have a choice of not using
> host page cache with cache=none, but nothing the other way round.
> There are other use cases for embedded folks (in terms of controlling
> unmapped page cache), please see previous discussions.
Is there other usecase? really? Where exist?
Why do you start to talk about embedded sudenly? I reviewed this as virtualization feature
beucase you wrote so in [path 0/3]. Why do you change your point suddenly?
> On Fri, 1 Apr 2011, KOSAKI Motohiro wrote:
>
> > > On Thu, 31 Mar 2011, KOSAKI Motohiro wrote:
> > >
> > > > 1) zone reclaim doesn't work if the system has multiple node and the
> > > > workload is file cache oriented (eg file server, web server, mail server, et al).
> > > > because zone recliam make some much free pages than zone->pages_min and
> > > > then new page cache request consume nearest node memory and then it
> > > > bring next zone reclaim. Then, memory utilization is reduced and
> > > > unnecessary LRU discard is increased dramatically.
> > >
> > > That is only true if the webserver only allocates from a single node. If
> > > the allocation load is balanced then it will be fine. It is useful to
> > > reclaim pages from the node where we allocate memory since that keeps the
> > > dataset node local.
> >
> > Why?
> > Scheduler load balancing only consider cpu load. Then, usually memory
> > pressure is no complete symmetric. That's the reason why we got the
> > bug report periodically.
>
> The scheduler load balancing also considers caching effects. It does not
> consider NUMA effects aside from heuritics though. If processes are
> randomly moving around then zone reclaim is not effective. Processes need
> to stay mainly on a certain node and memory needs to be allocatable from
> that node in order to improve performance. zone_reclaim is useless if you
> toss processes around the box.
Agreed. zone_reclaim has both good and bad work situation.
> > btw, when we are talking about memory distance aware reclaim, we have to
> > recognize traditional numa (ie external node interconnect) and on-chip
> > numa have different performance characteristics. on-chip remote node access
> > is not so slow, then elaborated nearest node allocation effort doesn't have
> > so much worth. especially, a workload use a lot of short lived object.
> > Current zone-reclaim don't have so much issue when using traditiona numa
> > because it's fit your original design and assumption and administrators of
> > such systems have good skill and don't hesitate to learn esoteric knobs.
> > But recent on-chip and cheap numa are used for much different people against
> > past. therefore new issues and claims were raised.
>
> You can switch NUMA off completely at the bios level. Then the distances
> are not considered by the OS. If they are not relevant then lets just
> switch NUMA off. Managing NUMA distances can cause significant overhead.
1) Some bios don't have such knob. btw, OK, yes, *I* can switch NUMA off completely
because I don't have such bios. 2) bios level turning off makes some side effects,
example, scheduler load balancing don't care numa anymore.
So, your workaround is good for workaround. but it's no solution.
On Sat, 2 Apr 2011, Dave Chinner wrote:
> Fundamentally, if you just switch off memory reclaim to avoid the
> latencies involved with direct memory reclaim, then all you'll get
> instead is ENOMEM because there's no memory available and none will be
> reclaimed. That's even more fatal for the system than doing reclaim.
Not for my use cases here. No one will die if reclaim happens but its bad
for the bottom line. Reducing the chance of memory reclaim occurring in a
critical section is sufficient.
On Sun, 3 Apr 2011, KOSAKI Motohiro wrote:
> 1) Some bios don't have such knob. btw, OK, yes, *I* can switch NUMA off completely
> because I don't have such bios. 2) bios level turning off makes some side effects,
> example, scheduler load balancing don't care numa anymore.
Well then lets add a kernel parameter that switches all NUMA off.
Otherwise: If you just run a kernel build without NUMA support then you have a similar
effect.
Re #2) If you have the system toss processes around the system then the
load balancing heuristics does not bring you any benefit.
On Sun, Apr 03, 2011 at 06:32:16PM +0900, KOSAKI Motohiro wrote:
> > On Fri, Apr 01, 2011 at 10:17:56PM +0900, KOSAKI Motohiro wrote:
> > > > > But, I agree that now we have to concern slightly large VM change parhaps
> > > > > (or parhaps not). Ok, it's good opportunity to fill out some thing.
> > > > > Historically, Linux MM has "free memory are waste memory" policy, and It
> > > > > worked completely fine. But now we have a few exceptions.
> > > > >
> > > > > 1) RT, embedded and finance systems. They really hope to avoid reclaim
> > > > > latency (ie avoid foreground reclaim completely) and they can accept
> > > > > to make slightly much free pages before memory shortage.
> > > >
> > > > In general we need a mechanism to ensure we can avoid reclaim during
> > > > critical sections of application. So some way to give some hints to the
> > > > machine to free up lots of memory (/proc/sys/vm/dropcaches is far too
> > > > drastic) may be useful.
> > >
> > > Exactly.
> > > I've heard multiple times this request from finance people. And I've also
> > > heared the same request from bullet train control software people recently.
> >
[...]
> > Fundamentally, if you just switch off memory reclaim to avoid the
> > latencies involved with direct memory reclaim, then all you'll get
> > instead is ENOMEM because there's no memory available and none will be
> > reclaimed. That's even more fatal for the system than doing reclaim.
>
> You have two level oversight.
>
> Firstly, *ALL* RT application need to cooperate applications, kernel,
> and other various system level daemons. That's no specific issue of
> this topic. OK, *IF* RT application run egoistic, a system may hang
> up easily even routh mere simple busy loop, yes. But, Who want to do so?
Sure - that's RT-101. I think I have a good understanding of these
principles after spending 7 years of my life working on wide-area
distributed real-time control systems (think city-scale water and
electricity supply).
> Secondly, You misparsed "avoid direct reclaim" paragraph. We don't talk
> about "avoid direct reclaim even if system memory is no enough", We talk
> about "avoid direct reclaim by preparing before".
I don't think I misparsed it. I am addressing the "avoid direct
reclaim by preparing before" principle directly. The problem with it
is that just enalrging the free memory pool doesn't guarantee future
allocation success when there are other concurrent allocations
occurring. IOWs, if you don't _reserve_ the free memory for the
critical area in advance then there is no guarantee it will be
available when needed by the critical section.
A simple example: the radix tree node preallocation code to
guarantee inserts succeed while holding a spinlock. If just relying
on free memory was sufficient, then GFP_ATOMIC allocations are all
that is necessary. However, even that isn't sufficient as even the
GFP_ATOMIC reserved pool can be exhausted by other concurrent
GFP_ATOMIC allocations. Hence preallocation is required before
entering the critical section to guarantee success in all cases.
And to state the obvious: doing allocation before the critical
section will trigger reclaim if necessary so there is no need to
have the application trigger reclaim.
Cheers,
Dave.
--
Dave Chinner
[email protected]
Hi Dave,
Thanks long explanation.
> > Secondly, You misparsed "avoid direct reclaim" paragraph. We don't talk
> > about "avoid direct reclaim even if system memory is no enough", We talk
> > about "avoid direct reclaim by preparing before".
>
> I don't think I misparsed it. I am addressing the "avoid direct
> reclaim by preparing before" principle directly. The problem with it
> is that just enalrging the free memory pool doesn't guarantee future
> allocation success when there are other concurrent allocations
> occurring. IOWs, if you don't _reserve_ the free memory for the
> critical area in advance then there is no guarantee it will be
> available when needed by the critical section.
Right.
Then, I made per-task reserve memory code at very years ago when I'm
working for embedded. So, There are some design choice here. best effort
as Christoph described or per thread or RT thread specific reservation.
> A simple example: the radix tree node preallocation code to
> guarantee inserts succeed while holding a spinlock. If just relying
> on free memory was sufficient, then GFP_ATOMIC allocations are all
> that is necessary. However, even that isn't sufficient as even the
> GFP_ATOMIC reserved pool can be exhausted by other concurrent
> GFP_ATOMIC allocations. Hence preallocation is required before
> entering the critical section to guarantee success in all cases.
>
> And to state the obvious: doing allocation before the critical
> section will trigger reclaim if necessary so there is no need to
> have the application trigger reclaim.
Yes and No.
Preallocation is core piece, yes. But Almost all syscall call
kmalloc() implicitly. then mlock() is no sufficient preallocation.
Almost all application except HPC can't avoid syscall use. That's
the reason why finance people repeatedly requirest us the feature,
I think.
Thanks!